Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Developing and benchmarking computational tools to facilitate T cell receptor repertoire analysis
(USC Thesis Other)
Developing and benchmarking computational tools to facilitate T cell receptor repertoire analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Developing and benchmarking computational tools to facilitate T cell receptor repertoire analysis
by
Kerui Peng
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(CLINICAL AND EXPERIMENTAL THERAPEUTICS)
August 2023
Copyright 2023 Kerui Peng
ii
Acknowledgments
First, I would like to thank my advisors, Dr. Serghei Mangul and Dr. Houda Alachkar. I
am fortunate to be Dr. Serghei Mangul’s first PhD student. Dr. Serghei Mangul has provided
guidance and support during my PhD journey, a unique and unforgettable journey because the
majority of the time was spent during the global pandemic. He always put himself in my position
and provided the best suggestions that I could ask for. You are an amazing advisor and an
inspiration to all our lab members including me. With a similar background, Dr. Houda Alachkar
has supported and challenged me to pursue my research and career goals. Dr. Houda Alachkar
made me feel included during lab meetings and social events so that I know that I am not alone
in the journey.
I would like to thank my committee member, Dr. Nicholas Mancuso for offering helpful
suggestions and comments on my research projects. I would like to thank Dr. Theodore Scott
Nowicki for providing clinical samples to support my research.
I am grateful to have mentors including Dr. Robert Straka, Dr. Youssef Roman, and Dr.
Harvey Arbit. I was fortunate to work in Dr. Robert Straka’s lab and participate in
pharmacogenomics research projects during my PharmD study in Minnesota. Dr. Robert Straka
and Dr. Youssef Roman motivated me to pursue a PhD when I was in pharmacy school
contemplating the future. Dr. Harvey Arbit offered me the opportunity to open my mind to
different career paths other than being a traditional pharmacist. They are all role models for me
to become a better scientist and a better person in general. Their continuous contribution to
science and education is a motivation for me to hopefully pay back to the scientific community
in the future.
iii
I am also thankful to my friends Angela, Haoyi, and Shu-Yi for going through the highs
and lows of the PhD journey together.
Last but most importantly, I am grateful to my parents for teaching me to be a kind
person and supporting me throughout my journey, my dear husband Edward for cheering me up
and always being there for me, and my husband’s family for encouraging me. You are all
important persons in my life, I love you and thank you.
iv
Table of Contents
Acknowledgments ......................................................................................................................... ii
List of Tables ................................................................................................................................ vi
List of Figures .............................................................................................................................. vii
Abstract ......................................................................................................................................... ix
Chapter 1: Background ................................................................................................................ 1
1.1 T cell receptor (TCR) ........................................................................................................................ 1
1.1.1 Definition .................................................................................................................................................... 1
1.1.2 Development of TCRs ............................................................................................................................... 1
1.1.3 Functionality of TCRs ............................................................................................................................... 2
1.2 T cell receptor sequencing (TCR-Seq)............................................................................................. 3
1.2.1 Multiplex Polymerase Chain Reaction (Multiplex PCR) ....................................................................... 4
1.2.2 Rapid Amplification of 5’ Complementary Ends (5’RACE) ................................................................. 4
1.3 TCR-Seq data analysis workflow ..................................................................................................... 5
1.4 T cell receptor studies ....................................................................................................................... 6
1.4.1 TCR studies and disease biomarkers ....................................................................................................... 6
1.4.2 TCR studies and treatment outcomes...................................................................................................... 7
1.5 Diversity in immunogenomics: the value and the challenge .......................................................... 7
1.5.1 Current state of diversity in genomics studies ........................................................................................ 8
1.5.2 The need for diversity in immunogenomics ............................................................................................ 9
1.5.3 Germline gene diversity and databases ................................................................................................. 10
1.5.4 Recommendations for the immunology community ............................................................................ 12
Chapter 2: Rigorous benchmarking of T cell receptor repertoire profiling methods
for cancer RNA sequencing........................................................................................................ 16
2.1 Introduction ..................................................................................................................................... 16
2.2 Methods ............................................................................................................................................ 18
2.2.1 Gold standard datasets............................................................................................................................ 18
2.2.2 Choice of RNA-Seq-based TCR profiling methods .............................................................................. 26
2.2.3 Isolate genomic DNA and RNA .............................................................................................................. 28
2.2.4 Generate DNA-based TCR-Seq data ..................................................................................................... 28
2.2.5 Analyze PBMC and melanoma biopsy TCR-Seq data ......................................................................... 29
2.2.6 RNA-Seq of PBMC and melanoma biopsy samples ............................................................................. 29
2.2.7 Download renal carcinoma clear cell samples ...................................................................................... 29
2.2.8 Download ileocecal lymph node and small intestine samples .............................................................. 30
2.2.9 Run RNA-Seq-based repertoire profiling tools .................................................................................... 30
2.2.10 Estimate TCR diversity and relative frequency ................................................................................. 31
2.2.11 Classify T cell rich and T cell poor tissues .......................................................................................... 31
2.2.12 Generate Hill curves .............................................................................................................................. 32
2.2.13 Computationally modify the properties of RNA-Seq samples .......................................................... 32
2.2.14 Additional dataset from healthy individuals ....................................................................................... 32
2.2.15 Compare computational resources across RNA-Seq-based repertoire profiling tools ................... 34
2.2.16 Code availability .................................................................................................................................... 34
2.2.17 Data availability ..................................................................................................................................... 34
2.3 Results............................................................................................................................................... 35
v
2.3.1 RNA-Seq-based TCR profiling methods were able to successfully capture TCRβ low SDI
repertoires across various T cell rich tissues ................................................................................................. 36
2.3.2 RNA-Seq-based TCR profiling methods were able to effectively estimate clonality in T cell
rich tissues ......................................................................................................................................................... 45
2.3.3 RNA-Seq-based TCR profiling methods were able to successfully estimate relative
frequencies of clonotypes in low SDI samples ................................................................................................ 58
2.3.4 Fewer clonotypes were detected by RNA-Seq-based TCR profiling methods compared to
TCR-Seq-based methods.................................................................................................................................. 65
2.3.5 The effect of the number of TCR derived reads on the performance of RNA-Seq-based
TCR profiling methods .................................................................................................................................... 71
2.3.6 The effect of read length on the performance of RNA-Seq-based TCR profiling methods ............. 73
2.3.7 Complement the TCRβ chain analysis with TCRɑ chain analysis ..................................................... 76
2.3.8 Performance of RNA-Seq methods in the samples from healthy individuals .................................... 77
2.3.9 Computational resources comparison across RNA-Seq-based tools .................................................. 84
2.4 Discussion ......................................................................................................................................... 85
Chapter 3: pyTCR: A comprehensive and scalable solution for TCR-Seq data
analysis to facilitate reproducibility and rigor of immunogenomics research ...................... 91
3.1 Introduction ..................................................................................................................................... 91
3.2 Methods ............................................................................................................................................ 93
3.2.1 TCR-Seq data........................................................................................................................................... 93
3.2.2 TCR-Seq data preprocessing .................................................................................................................. 94
3.2.3 Statistical analysis .................................................................................................................................... 94
3.2.4 Jupyter notebooks ................................................................................................................................... 94
3.2.5 pyTCR data structure and software implementation .......................................................................... 95
3.2.6 Comparison with other methods ............................................................................................................ 95
3.2.7 Data availability ....................................................................................................................................... 96
3.2.8 Ethics statement ....................................................................................................................................... 97
3.3 Results............................................................................................................................................... 97
3.3.1 pyTCR: a comprehensive and scalable solution for TCR-seq data analysis ..................................... 97
3.3.2 pyTCR is able to perform basic analysis to characterize the TCR repertoire ................................ 104
3.3.3 pyTCR is able to perform clonality analysis to assess the evenness of distribution of TCR
clonotypes ........................................................................................................................................................ 106
3.3.4 pyTCR is able to perform gene usage analysis to detect over and underrepresented TCR
genes across the samples ................................................................................................................................ 109
3.3.5 pyTCR is able to assess the diversity of TCR repertoires.................................................................. 111
3.3.6 pyTCR is able to effectively compare clonotypes and motifs across samples .................................. 113
3.3.7 pyTCR offers several advantages compared to the existing tools ..................................................... 116
3.4 Discussion ....................................................................................................................................... 120
Chapter 4: Concluding Remarks ............................................................................................. 123
References .................................................................................................................................. 126
vi
List of Tables
Table 1 Tools for inference of germline TCR and immunoglobulin genes from AIRR-seq
data ................................................................................................................................................ 14
Table 2.1 Overview of the gold standard dataset .......................................................................... 19
Table 2.2 Overview of the samples in the gold standard dataset .................................................. 20
Table 2.3 Overview of evaluated RNA-Seq-based TCR profiling methods’ parameters,
publication details, and technical characteristics .......................................................................... 26
Table 2.4 Execution of RNA-Seq-based TCR profiling methods ................................................ 27
Table 2.5 Overview of the samples in the additional dataset ....................................................... 32
Table 2.6 The comparison of the portion of the TCR clonotypes that are captured by the
tools ............................................................................................................................................... 38
Table 2.7 The average number of TCRβ derived reads by RNA-Seq-based methods in
different tissue types ..................................................................................................................... 39
Table 2.8 The average number of TCRβ derived reads by RNA-Seq-based methods in
different repertoire types ............................................................................................................... 40
Table 2.9 The comparison of the absolute error of the Shannon Diversity Index (SDI)
across tools .................................................................................................................................... 48
Table 2.10 The fraction of the top ten abundant TCR clonotypes across the samples ................. 49
Table 2.11 The comparison of the absolute error of clonality across tools .................................. 50
Table 2.12 The comparison of the absolute error of SDI and clonality when varying SDI
thresholds to classify low or high SDI samples ............................................................................ 51
Table 2.13 The formulas for Hill curves ....................................................................................... 53
Table 2.14 The comparison of the clonotype counts across tools ................................................ 66
Table 2.15 The comparison of the TCR-Seq confirmed clonotype counts across tools ............... 68
Table 2.16 Overview of the samples in the additional dataset ..................................................... 78
Table 3.1 TCR repertoire analysis functions in pyTCR ............................................................... 98
Table 3.2 The comparison of TCR repertoire analysis tools ...................................................... 117
vii
List of Figures
Figure 1.1 The structure and development of T cell receptors (Liu and Mardis, 2017) ................. 2
Figure 1.2 The process of multiplex PCR and 5’RACE (Li et al., 2020) ....................................... 4
Figure 1.3 The overview of TCR-Seq data analysis workflow ...................................................... 6
Figure 2.1 Basic metrics of the dataset ......................................................................................... 22
Figure 2.2 The distribution of clonotype groups in each sample .................................................. 25
Figure 2.3 The Shannon Diversity Index (SDI) (a), clonality (b), and estimated T cell
fraction (c) of each sample ............................................................................................................ 36
Figure 2.4 The ability to capture TCR-Seq confirmed clonotypes by RNA-Seq-based
methods in low SDI samples (a-d) and high SDI samples (e-h) ................................................... 41
Figure 2.5 The number of TCR derived reads by RNA-Seq-based TCR profiling methods
per 1 million RNA-Seq reads in different types of samples ......................................................... 42
Figure 2.6 The capability of capturing TCR-Seq confirmed clonotypes by RNA-Seq-
based methods in each sample ...................................................................................................... 44
Figure 2.7 TCR-Seq-based TCRβ and RNA-Seq-based diversity and clonality
comparison .................................................................................................................................... 54
Figure 2.8 TCR-Seq-based TCRβ and RNA-Seq-based diversity and clonality
comparison when singleton TCRs are included............................................................................ 54
Figure 2.9 Hill curves of each sample to estimate the diversity of TCR repertoire ..................... 57
Figure 2.10 Correlation of TCRβ clonotype frequency based on the TCR-Seq data (x-
axis) and the TCR derived reads from RNA-seq (y-axis)............................................................. 61
Figure 2.11 Correlation of TCRβ clonotype frequency based on the TCR-Seq data (x-
axis) and the TCR derived reads from RNA-seq (y-axis) in T cell rich low SDI samples ........... 63
Figure 2.12 Correlation of TCRβ clonotype frequency based on the TCR-Seq data (x-
axis) and the TCR derived reads from RNA-seq (y-axis) when singleton TCRs are
included ......................................................................................................................................... 65
Figure 2.13 TCRβ derived clonotype counts from RNA-Seq data ............................................... 71
Figure 2.14 The absolute error of RNA-Seq based TCRβ diversity and TCR-Seq based
TCRβ diversity and capability of capturing TCR-Seq confirmed clonotypes by RNA-Seq-
based methods in computationally modified low SDI samples with reduced reads ..................... 72
Figure 2.15 The capability of capturing TCR-Seq confirmed clonotypes by RNA-Seq-
based methods and absolute error of RNA-Seq based TCRβ diversity and TCR-Seq based
TCRβ diversity from reduced RNA-Seq read length results ........................................................ 76
Figure 2.16 TCRɑ chain and TCRβ chain metrics comparison .................................................... 77
Figure 2.17 The Shannon Diversity Index (SDI) (a), clonality (b), and estimated T cell
fraction (c) of samples from 10 healthy individuals ..................................................................... 79
viii
Figure 2.18 The ability to capture TCR-Seq confirmed clonotypes by RNA-Seq-based
methods in samples from 10 healthy individuals .......................................................................... 80
Figure 2.19 TCR-Seq-based TCRβ and RNA-Seq-based diversity and clonality
comparison in samples from 10 healthy individuals..................................................................... 81
Figure 2.20 Correlation of TCRβ clonotype frequency based on the TCR-Seq data (x-
axis) and the TCR derived reads from RNA-seq (y-axis) in samples from 10 healthy
individuals ..................................................................................................................................... 82
Figure 2.21 TCRβ derived clonotype counts from RNA-Seq data ............................................... 83
Figure 2.22 TCRɑ chain and TCRβ chain metrics comparison .................................................... 84
Figure 2.23 Central Processing Unit (CPU) time (a) and memory usage (b) for the
selected four samples across different tools.................................................................................. 85
Figure 3.1 Visualization of TCR repertoire metrics generated using pyTCR ............................ 103
Figure 3.2 The clonotype counts were shown in groups by hospitalization status ..................... 105
Figure 3.3 Clonal portion grouped by hospitalization status ...................................................... 107
Figure 3.4 The distribution of clonotype groups in each sample ................................................ 109
Figure 3.5 Heatmap of the weighted V gene usage in each sample and the Sankey plot for
V-J combinations ........................................................................................................................ 111
Figure 3.6 The diversity indices were shown in groups by hospitalization status ...................... 112
Figure 3.7 The overlap indices across samples ........................................................................... 116
Figure 3.8 The number of highly presented motifs across samples grouped by
hospitalization status ................................................................................................................... 116
Figure 3.9 Central Processing Unit (CPU) time (a) and Memory usage (b) for subsamples
of the COVID19-BWNW dataset for overlap analysis............................................................... 120
ix
Abstract
T cells respond to antigens through their diverse repertoire of T cell receptors (TCRs),
which are formed through the process of V(D)J recombination of the germline TCR locus. The
V(D)J recombination is a process that poses high diversity with selections from different V, D,
and J genes. Furthermore, deletion and insertion of nucleotides during the V(D)J recombination
process increases the complexity of TCR repertoires. Due to the complexity and diversity of
TCR repertoires, targeted sequencing techniques, as well as computational methods, are needed
for appropriate downstream analysis.
The recent advances in high throughput sequencing technologies enable cost-effective
characterization of the immune system, thus providing opportunities to study immune responses
and further promote translational and clinical research. TCR sequencing (TCR-Seq) is a
sequencing technology that targets the amplified DNA or RNA from the regions of TCR loci.
TCR-Seq data provides information on V, D, and J gene usage and complementarity determining
region (CDR) diversity. Specifically, the complementarity-determining region 3 (CDR3) region
is highly variable and has the greatest diversity. Computational methods are developed for
facilitating TCR-Seq data analysis.
The repertoires of TCRs are central to adaptive immunity, thus analyses of these loci are
important for understanding adaptive immune receptor repertoire. Additionally, TCR-Seq studies
have offered new opportunities to deepen the understanding of the adaptive immune system and
its roles in various human diseases, such as infectious diseases, cancers, autoimmune conditions,
and neurodegenerative diseases.
However, even with the advances in TCR studies, there are opportunities for
improvement. First, existing computational methods including MiXCR, ImRep, TRUST4, and
x
CATT allow the utilization of non-targeted next-generation sequencing (NGS) data to extract
TCR information. However, the performance of these methods on samples with variable T cells
level remains unknown. Second, the traditional computational methods require researchers to
acquire knowledge in the command-line interface. However, biomedical researchers usually lack
the relevant experiences, this situation hinders the application of computational tools to a broad
range of users.
Therefore, to tackle these challenges, the overall goals of this project are to benchmark
the existing computational RNA-Seq-based TCR profiling methods and to develop a new
platform for TCR-Seq data analysis. I utilized a data science approach to achieve these goals
with two aims: aim 1: a benchmark of RNA-Seq-based TCR repertoire profiling methods across
different tissue types and aim 2: development of pyTCR to facilitate scalability, reproducibility,
and transparency for TCR-Seq analysis.
In order to properly benchmark the RNA-Seq-based TCR repertoire profiling methods,
we conducted a comprehensive evaluation of four existing methods (MiXCR, ImRep, TRUST4,
and CATT) by using TCR-Seq as a gold standard. We utilized the largest dataset of 19 samples
from five different tissue types from cancer patients that were sequenced by both TCR-Seq and
RNA-Seq. The scenarios in which TCR repertoires can be effectively profiled by RNA-Seq have
been investigated by comparing fractions of clonotype captured, diversity and clonality
estimates, clonotype frequencies, and clonotype counts. The results provided researchers with
information to choose the most appropriate methods for individual purposes in TCR when RNA-
Seq data is available but TCR-Seq may not be.
In order to develop a platform for TCR-Seq data analysis for biomedical researchers with
limited computational backgrounds or skills, I developed pyTCR, a platform based on interactive
xi
computational notebooks for comprehensive TCR-Seq data analysis with enriched functionalities
including metrics analysis, visualization, and statistical analysis. The detailed instructions and
web-based platform reduce the barriers to using the command-line interface. This platform has a
rich set of functionalities including easy to customize and alter parameters. The outputs of
corresponding analyses are generated in one notebook, which decreases the chances of creating
errors when combining different output files or utilizing different tools for further analysis. As a
result, we provided a one-stop-shop solution for biomedical researchers to conduct TCR-Seq
data analysis, with the facilitation of transparency and reproducibility in immunogenomics
research.
Overall, this work has demonstrated the scalability and effectiveness of using RNA-Seq
data collected from cancer patients to assemble TCR clonotypes and estimate diversity, it also
delivered a user-friendly platform for TCR-Seq data analysis with enriched functionalities and
reduced computational complexity and resources.
1
Chapter 1: Background
1.1 T cell receptor (TCR)
1.1.1 Definition
T cell receptors (TCRs) are expressed on the surface of T cells. Each receptor consists of
two polypeptide chains (α and β, or γ and δ), these chains are linked by a disulfide bond. The T
cells with α and β chains are defined as αβ T cells, while the T cells with γ and δ chains are
defined as γδ T cells. The majority of the T cells express α and β chains. The function of γδ T
cells in the immune system and immune responses are not yet understood compared to αβ T
cells.
1.1.2 Development of TCRs
Exposure to naturally exposed antigens, including allergens, pathogens, and neoantigens,
as well as vaccines and certain therapeutic agents, stimulate the T cells to activate the adaptive
immune response and form TCRs. TCRs are formed through the process of V(D)J recombination
(X. S. Liu & Mardis, 2017). For the β chain, a D segment is firstly paired with a J segment, then
a V segment is selected and combined with the DJ segment. For the α chain, a similar process
occurs with the lack of the D segment. Deletion and insertion of nucleotides can occur during the
V(D)J recombination process as well. V(D)J recombination creates diverse and unique TCR
repertoires. According to the international ImMunoGeneTics information system (IMGT)
database, an integrated database with TCR and BCR references from various species, the human
germline TCR locus has a total number of 237-243 genes with 133-139 Variable (V), 5 Diversity
2
(D), and 84 Joining (J) known alleles. Additionally, it is estimated that there are 10
6
to 10
8
possible TCR clonotypes in human (Laydon et al., 2015).
Complementarity-determining regions (CDRs), also known as hypervariable regions,
consist of three TCRα chains and three TCRβ chains (Wong et al., 2019). There are classified to
be CDR1, CDR2, and CDR3. CDR3 has the greatest diversity because it includes all the D and J
segments, along with some V segments. In contrast, CDR1 and CDR2 only include V regions.
Figure 1.1 The structure and development of T cell receptors (Liu and Mardis, 2017)
1.1.3 Functionality of TCRs
Helper T cells (CD4 cells) and cytotoxic T cells (CD8 cells) are the major types of T cells
in humans. T cell receptors (TCRs) are expressed on both CD4 cells and CD8 cells. TCRs
recognize and bind to the epitopes of the antigens presented by the major histocompatibility
complex (MHC). However, the class of MHC that TCRs bind to is different between those on the
CD4 cells and on the CD8 cells. Specifically, the TCRs on the CD8 cells bind to the peptide
3
presented by MHC class I molecules and further kill the cells. While, the TCRs on the CD4 cells
bind to the peptide presented by MHC class II molecules and activate other cells such as B cells,
dendritic cells, and macrophages for the following responses. The highly diverse TCRs play key
roles in the recognition of various antigens including allergens, pathogens, and neoantigens. This
process is part of the adaptive immune response, which creates immunological memory for
humans and provides protection against the same foreign antigens.
1.2 T cell receptor sequencing (TCR-Seq)
The recent advances in high throughput sequencing technologies made it possible for
researchers to capture a detailed view of the T cell receptor repertoires, it also provides novel
opportunities to study T cell receptor repertoire at the population scale. There are two main
molecular approaches for TCR sequencing, multiplex polymerase chain reaction (multiplex
PCR) and rapid amplification of 5’ complementary ends (5’RACE) (N. Li et al., 2020). They
both leverage the deep sequencing of amplified DNA or RNA from T cell receptor loci.
4
Figure 1.2 The process of multiplex PCR and 5’RACE (Li et al., 2020)
1.2.1 Multiplex Polymerase Chain Reaction (Multiplex PCR)
Multiplex PCR is one of the main molecular approaches for TCR repertoire sequencing.
It requires the utilization of V gene primers, as well as C gene primers or J gene primers. Both
DNA and RNA can be used as the starting material for multiplex PCR. It is subject to PCR
amplification and sequencing biases because of the multiple rounds of PCR involved.
Additionally, it also limits the ability to discover novel V(D)J alleles since only the known V, D,
or J genes could be used as primers, and the TCRs that contain such genes would be further
sequenced.
1.2.2 Rapid Amplification of 5’ Complementary Ends (5’RACE)
Rapid Amplification of 5’ Complementary Ends (5’RACE) is another widely used TCR
repertoire sequencing technique. It uses RNA as the starting material and then syntheses the
complementary DNA (cDNA) via reverse transcription. Primers that target the 5’ end and the
constant regions of TCR α and β genes are required for capturing the variable regions of TCR.
The procedure of TCR repertoire sequencing via 5’RACE usually includes RNA extraction,
library preparation, library purification, library quality control, library quantification, library
pooling and sequencing. For example, by using the SMARTer Human TCR a/b Profiling Kit v2
(TaKaRa Bio, San Jose, CA, USA) and MiSeq system (Illumina, San Diego, CA, USA),
researchers are able to obtain 600 to 800 base pairs products.
Unlike multiplex PCR, the 5’RACE method captures the entire length of the TCR
sequences. Since it doesn’t rely on V primers, it also enables the detection of rare TCR
sequences. Furthermore, the utilization of unique molecular identifiers (UMIs) helps to reduce
PCR amplification and sequencing errors.
5
1.3 TCR-Seq data analysis workflow
TCR-Seq data analysis workflow includes both the TCR sequencing (wet lab
experiments) and the sequencing data analysis (computational analysis). As described in the
previous section, after the amplification of the TCR regions, targeted sequencing is performed
for those regions. The sequencing data is not human-readable because it is usually delivered in
the fastq format. There are computational methods, referred to as alignment/assembly methods,
developed to assemble TCR clonotypes and perform V(D)J gene mapping to the IMGT database
(Lefranc et al., 2015). MiXCR is the most widely used method among these methods.
ImmunoSEQ (Adaptive Biotechnologies, Seattle, WA, USA) is a commercially available
platform for TCR-Seq data generation and processing. After the processing, TCR clonotypes, as
well as the V, D, and J genes are reported. Even with such information, in order to generate
summarized statistics for TCR repertoires, researchers rely on TCR data analysis methods to
further analyze the TCR-Seq data. Such methods included VDJtools (Shugay et al., 2015),
Immunarch (Nazarov et al., 2022), and pyTCR (Peng et al., 2022). With the applications of TCR
data analysis methods, different TCR metrics and visualization could be produced.
6
Figure 1.3 The overview of TCR-Seq data analysis workflow
1.4 T cell receptor studies
T cell receptor (TCR) studies, which use TCR-Seq as the technique to examine TCR
repertoires in various disease states, have provided us with the opportunities to understand the
characteristics of the adaptive immune system and have advanced the discoveries in novel
therapeutic strategies.
1.4.1 TCR studies and disease biomarkers
TCR sequences have been shown to have the potential to be used as disease biomarkers
across different disease types. For example, researchers identified 164 cytomegaloviruses (CMV)
associated TCRβ sequences, among which 45 were associated with HLA alleles. These TCRβ
sequences provided highly precise predictions for CMV infection. Additionally, three TCRβ
sequences were proved to bind to CMV epitopes, and HLA alleles could be predicted by using
such information (Emerson et al., 2017). There are efforts in identifying diagnostic markers in
autoimmune diseases as well. For example, a study has shown that patients with primary
autoimmune cytopenias (AIC) lack specific TCRβ clusters compared with healthy individuals,
these can potentially be used as diagnostic signals for primary AIC (Simnica et al., 2019).
Another study revealed certain TCRβ clones that can distinguish systemic lupus erythematosus
(SLE) or rheumatoid arthritis (RA) patients from healthy individuals (X. Liu et al., 2019). In
recent years, this exploration has been expanded to neurological conditions, such as Alzheimer’s
disease. For example, by using single-cell TCR sequencing, novel TCRαβ sequences that were
specific to the Epstein-Barr virus (EBV) protein were found in the cerebrospinal fluid of
Alzheimer’s disease patients (Gate et al., 2020).
7
In addition to the identification of disease-specific TCR sequences, the characteristics of
TCR repertoires have shown to vary between disease populations versus healthy individuals, and
before and after treatments. For example, a less diverse TCR repertoire, which was indicated by
a higher Gini index was noticed in major depressive disorder patients. Additionally, differences
in V gene usage were detected across patients with major depressive disorder and non-depressed
individuals (Patas et al., 2018). In the context of cancer, the number of TCRβ sequences was
increased after treatment with tremelimumab in melanoma patients (Robert et al., 2014).
1.4.2 TCR studies and treatment outcomes
With the development of therapeutic agents that target the immune system, it is crucial to
investigate whether metrics of TCR repertoires could be leveraged to predict the treatment
outcomes. Since the immunotherapies advancements are mainly used in hematological
malignancies, the TCR studies with the focus on predicting outcomes are mostly conducted
among cancer patients. For example, researchers have demonstrated that metastatic melanoma
patients with clinical response to pembrolizumab had less diverse TCR repertoires (which was
determined by normalized Shannon index), compared to those in the non-response group (Tumeh
et al., 2014).
1.5 Diversity in immunogenomics: the value and the challenge
Section is published: Peng, K., Safonova, Y., Shugay, M., Popejoy, A. B., Rodriguez, O. L.,
Breden, F., Brodin, P., Burkhardt, A. M., Bustamante, C., Cao-Lormeau, V. M., Corcoran, M.
M., Duffy, D., Fuentes-Guajardo, M., Fujita, R., Greiff, V., Jönsson, V. D., Liu, X., Quintana-
Murci, L., Rossetti, M., Xie, J., … Mangul, S. (2021). Diversity in immunogenomics: the value
and the challenge. Nature methods, 18(6), 588–591.
8
1.5.1 Current state of diversity in genomics studies
Genomic studies have mainly used samples from individuals of European ancestry, at the
expense of learning from the largest and most genetically diverse populations. For example, 78%
of individuals included in genome-wide association studies (GWAS) reported in the GWAS
Catalog (https:// www.ebi.ac.uk/gwas/home) through January 2019 are of European descent
(Sirugo et al., 2019), while Asian populations account for 59.5% of the world population based
on the Population Reference Bureau’s World Population Data Sheet
(https://www.prb.org/datasheets/). Though this is partially due to inadequate sampling of non-
European populations, researchers tend to exclude data from minority groups when conducting
statistical analyses (Ben-Eghan et al., 2020) even when diverse datasets are available. The
limited inclusion of samples from diverse populations hinders the equitable advancement of
genomic medicine as a result of persistent uncertainty with respect to the genetic etiology of
disease across populations, as well as differential rates of adverse drug events, treatment
outcomes and other health disparities.
In recent years there has been an increased awareness of the limited generalizability of
findings across populations and the benefits for the discovery and interpretation of gene-trait
associations brought about by the inclusion of diverse populations in genomic studies. This has
motivated the inclusion of diverse, multiethnic populations in large-scale genomic studies. For
example, whole-genome sequencing in individuals of African descent (Choudhury et al., 2020)
and whole-exome sequencing in a southern African population (Retshabile et al., 2018) have
improved understanding of genetic variation in under-represented populations. Additional efforts
have been made to establish reference genome datasets for research in diverse populations; these
include the GenomeAsia 100K Project, Human Heredity and Health in Africa (H3Africa)
9
initiative, Taiwan Biobank, Population Architecture Using Genomics and Epidemiology (PAGE)
Consortium, Trans-Omics for Precision Medicine (TOPMed) program, Clinical Sequencing
Evidence-Generating Research (CSER) consortium, Human Genome Reference Program
(HGRP) and All of Us Research Program. However, the field of immunogenomics, especially
that related to adaptive immune receptors, has yet to benefit from a similar growth in diversity.
1.5.2 The need for diversity in immunogenomics
Central to immunity are the repertoires of T cell receptors (TCRs), immunoglobulins,
human leukocyte antigens (HLAs) and killer cell immunoglobulin-like receptors (KIRs). Thus,
analyses of the loci that encode these molecules are critical to immunogenomics studies.
T cells and B cells recognize antigens through their TCRs and immunoglobulins, which
are formed through the process of V(D)J (variable, diversity and joining region) recombination.
Capturing the vast diversity of recombined, expressed TCR and immunoglobulin repertoires was
not possible until the development of high-throughput sequencing techniques in the late 2000s.
Freeman and colleagues employed 5′ rapid amplification of cDNA ends (5′ RACE) PCR to
amplify TCR cDNA and to characterize TCR repertoires (Freeman et al., 2009). Weinstein and
colleagues sequenced an antibody repertoire in zebrafish (Weinstein et al., 2009) in 2009,
creating the foundation of adaptive immune receptor repertoire sequencing (AIRR-seq)
technologies. In 2010, Boyd and colleagues applied AIRR-seq to human immunoglobulins
(Boyd et al., 2010). Since then, studies including AIRR-seq have seen exponential growth, and
findings from these studies have shaped our understanding of human immune repertoires in
different settings (Briney et al., 2019).
AIRR-seq analysis and other immunogenomics studies offer new opportunities to deepen
our understanding of the immune system in the context of a variety of human diseases, including
10
infectious diseases (Avnir et al., 2016), cancer (Sharonov et al., 2020), autoimmune conditions
(Bashford-Rogers et al., 2019) and neurodegenerative diseases (Gate et al., 2020). Furthermore,
AIRR-seq data provides information on expression profiles; germline V, D and J gene usage;
complementarity-determining region (CDR) diversity; and, in the case of immunoglobulin
repertoires, somatic hypermutation levels. There have been extensive efforts to explore the
genetic diversity of the HLA and KIR systems (Singh et al., 2012; Dilthey et al., 2015), and the
knowledge gained from these efforts could be integrated into the AIRR-seq studies of TCR and
immunoglobulin germline gene diversity.
As in the field of genomics, greater diversity in immunogenomics research has the
potential to enable the discovery of novel genetic traits associated with immune system
phenotypes that are common or different across populations. While evidence for extensive
diversity in germline TCR and immunoglobulin genes have been reported in the human
population (Scheepers et al., 2015; Watson et al., 2013), most AIRR-seq studies that use
sequencing to study T and B cell receptor repertoires have been conducted in individuals of
European descent, leaving other populations under-represented (Watson et al., 2017). Exclusion
of non-European populations in genomics research limits our understanding of how pathogens
have exerted selective pressures on immune-related genes in populations living in different
environments, and thus on infectious disease manifestation (Quintana-Murci, 2019).
1.5.3 Germline gene diversity and databases
A critical step in AIRR-seq studies is germline gene assignments, which requires reliable
and comprehensive databases of germline V(D)J alleles representing different populations. So
far, such databases are lacking because the genetic regions encoding these genes have been
exceptionally challenging to characterize at the genomic level. Not only do these loci contain a
11
mixture of functional genes and pseudogenes with high similarity, but they are also characterized
by considerable structural variation, with deletions and duplications occurring at high frequency
in different populations. Given the complexity of the TCR and immunoglobulin genomic loci
and deficits in existing germline databases, the determination of immune receptor germline gene
usage from bulk RNA-seq or whole-genome sequencing is often inaccurate. Efforts to improve
germline databases are therefore critical for improved coverage of diversity in immune repertoire
analysis. Computational methods to infer germline TCR and immunoglobulin genes from AIRR-
seq data are expected to accelerate these efforts (Gadala-Maria et al., 2019; Ralph & Matsen,
2019; Corcoran et al., 2016; W. Zhang et al., 2016; Safonova & Pevzner, 2019; Bhardwaj et al.,
2020) (Table 1). Comparisons are also needed between results obtained from methods for
inferring germline gene variants from AIRR-seq repertoires (Ohlin et al., 2019) and from direct
sequencing of genomic DNA (Scheepers et al., 2015), such as the sequencing and assembly of
large-insert clones (for example, bacterial artificial chromosome (BAC) and fosmid clones)
(Watson et al., 2013) and, more recently, whole-genome sequencing and targeted long-read
sequencing (Rodriguez et al., 2020).
The most widely used reference database for immunogenomics data, the international
ImMunoGeneTics information system (IMGT) (Lefranc et al., 2015), has been a valuable
resource. However, it lacks a comprehensive set of human TCR and immunoglobulin alleles
representing diverse populations worldwide. Further uncertainty stems from descriptions of
sample populations in databases being based on geography or self-identified race and/or ethnicity
of study subjects, rather than genetic ancestry. As a result, we have a limited understanding of
population-level TCR and immunoglobulin germline gene variation. However, progress is being
made.
12
The AIRR Community (AIRR-C; http://www.airr-community.org) is an international
community of bioinformaticians and immunogeneticists that has been formed to develop
standards and protocols to promote sharing and common analysis approaches for AIRR-seq data,
including the AIRR Data Commons (Christley et al., 2020). As a means to enrich available
germline gene sets, the AIRR-C established the Inferred Allele Review Committee (IARC;
https://www.antibodysociety.org/the-airr-community/airr-subcomittees/ inferred-allele-review-
committee-iarc) to review and curate new immunoglobulin or TCR germline genes inferred from
AIRR-seq data. Its work is underpinned by the Open Germline Receptor Database, which
provides submission and review workflows. IARC-affirmed sequences are published in this
database, together with supporting evidence. VDJbase was also recently launched as a public
database that allows users to access population-level immunoglobulin and TCR germline data,
including reports and summary statistics on germline genes, alleles, single nucleotide and
structural variants, and haplotypes of interest derived from AIRR-seq and genomic sequencing
data. It currently contains AIRR-seq data from 421 human donors, representing 724
immunoglobulin heavy chain gene alleles. The integration of TCR datasets is in progress.
Together these initiatives will help pave the way for the development of approaches that extend
germline curation efforts to include more data types and ultimately ensure that population-level
metadata can be more effectively captured and leveraged.
1.5.4 Recommendations for the immunology community
The immunology community should make targeted efforts to include non-European
populations in AIRR-seq and other immunogenomics studies. Already, AIRR-seq studies in
more diverse populations have uncovered evidence for extensive genetic heterogeneity. For
example, in a study of South Africans with HIV, Scheepers and colleagues discovered many
13
immunoglobulin heavy chain variable (IGHV) alleles that were not represented in IMGT
(Scheepers et al., 2015), information of relevance to HIV vaccine design aimed at germline-
targeting immunogens (Burton, 2019). In a study in the Papua New Guinea population, 1 new
IGHV gene and 16 IGHV allelic variants were identified from AIRR-seq data (Wang et al.,
2011). These discoveries of alleles indicate the need for further population-based AIRR-seq
datasets and the identification and validation of the presence of new alleles so that they can be
added to public databases. It will be critical to conduct studies in various human populations if
we are to fully understand how AIRR-seq can be leveraged to make improvements in a wide
range of applications, including vaccine design.
Further, we suggest that extant open AIRR-seq datasets could be used to augment
immunoglobulin and TCR germline databases and inform AIRR-seq and other immunogenomics
studies across diverse populations. It may be possible in the future to use AIRR-seq data to infer
genetic ancestry, but such bioinformatics methods are yet to be developed and thus the utility of
genetic ancestry in this field has yet to be demonstrated. Conclusions about new germline
variants discovered through non-targeted sequencing data, including RNA-seq based on short
read sequences, should be drawn with caution owing to the complexity of the adaptive immune
receptor loci (Rodriguez et al., 2020), as described above. New methodologies and
computational approaches should be developed to facilitate the inclusion of diverse population
datasets into existing databases, with the aim of enhancing our knowledge base to reflect global
genomic immunological diversity in populations around the globe. Such enriched databases
would provide researchers with baseline resources to design and implement the next generation
of personalized and precision immunodiagnostics and therapeutics (Greiff et al., 2020).
14
At the current stage of the global COVID-19 pandemic, many vaccine trials and
programs are underway worldwide, offering opportunities to investigate the role of genetic
factors in vaccine-mediated immune responses. Such investigations will require careful study
designs to effectively address potential confounding factors such as environmental, economic
and social determinants of health that systematically differ between populations defined by self-
identified measures of diversity and that are correlated with continental-level ancestry (Ioannidis
et al., 2021). Incomplete representation of diverse populations limits our capacity to address the
impact of genetics on clinical phenotypes, and ideally this should be investigated alongside non-
genetic risk factors for disease. Different genetic variants in an etiologic pathway modify the
clinical presentation of disease, and these effects can differ by genomic background (The Severe
Covid-19 GWAS Group, 2020). Specific immunoglobulin germline genes, and in some cases
alleles, have been found to be preferentially used in the response to pathogens, suggesting a
degree of convergence in the antibody response, as observed for influenza (Avnir et al., 2016),
HIV-1 (J. Huang et al., 2016), Zika virus (Robbiani et al., 2017) and SARS CoV-2 (Yuan et al.,
2020). Therefore, in addition to environmental factors, genetic variability in immune genes is
likely to drive differential effects in vaccine effectiveness and infection outcomes (Watson et al.,
2017).
Table 1 Tools for inference of germline TCR and immunoglobulin genes from AIRR-seq data
Tool Type of
receptors
Type of inferring
genes
Needs gene
database for
inference
Comment
TIgGER Ig V Yes TIgGER and
Partis assign
AIRR-seq reads
to V genes from
the database and
report a list of V
Partis Ig V Yes
15
gene alleles
(both known
alleles and
alleles with
modifications)
IgDiscover Ig, TCR V, J Yes IgDiscover uses
the database for
annotation of
AIRR-seq reads,
clusters reads
with similar
annotations, and
reports both
known and
previously
unobserved V
genes
IMPre Ig, TCR V, J No IMPre infers V
and J genes from
clusters of
similar AIRR-
seq reads and
uses a germline
database (if
available) for
annotation of the
inferred genes
IgScout Ig D, J No Both IgScout
and MINING-D
infer D genes as
abundant
substrings of
CDR3s of
AIRR-seq reads
and use a
germline
database (if
available) for
annotation of the
inferred genes
MINING-D Ig, TCR D No
16
Chapter 2: Rigorous benchmarking of T cell receptor repertoire
profiling methods for cancer RNA sequencing
2.1 Introduction
Immunotherapy is an effective approach to treat a variety of advanced malignancies
(Galon & Bruni, 2019; Sanmamed & Chen, 2018). The success of immunotherapy relies in part
on the presence of CD8+ cytotoxic T cells, which recognize tumor antigens via their T cell
receptors (TCRs), and then induce targeted cell apoptosis (Galon & Bruni, 2019). The ability to
characterize the TCR repertoire in patient samples is increasingly central to the field of
immunotherapy and cancer research (Hogan et al., 2019; X. S. Liu & Mardis, 2017). Targeted
TCR Sequencing (TCR-Seq) approaches are currently used to measure the diversity and clonality
of the TCR repertoire (Kaplinsky & Arnaout, 2016; Warren et al., 2011; Chaara et al., 2018; Pai
& Satpathy, 2021; Rosati et al., 2017), which can be affected by immune checkpoint inhibitors
and can serve as a surrogate measure of effectiveness and overall prognosis (Schrama et al.,
2017; Robert et al., 2014, p. 4; Tumeh et al., 2014). Furthermore, the ability to track genetically
engineered T cells expressing chimeric TCRs that target specific tumor antigens is important to
determine the persistence of these cells and corresponding clinical responses. However, TCR-
Seq is not frequently done compared to RNA-Seq, thus there is a significant amount of RNA-Seq
data available that can be used to extract TCR data. RNA-Seq-based TCR profiling methods
have been developed to bridge the gap (Bolotin et al., 2015; Mandric et al., 2020; Song et al.,
2021, p. 4; Chen et al., 2020). For example, a recent study demonstrated that MiXCR could
detect all TCRβ sequences with relative frequencies greater than 0.15% in one of T cell rich
tissues (Bolotin et al., 2017). However, despite the great promise of RNA-Seq-based TCR
17
profiling methods, such methods were not systematically benchmarked and they were validated
in extremely small numbers of samples and limited scenarios. Thus, the biomedical communities
remain uninformed regarding the advantages and limitations of these RNA-Seq-based TCR
profiling methods. Additionally, the feasibility of applying these methods across various cancer
tissue types and TCR repertoire types remains unknown.
Here we performed a comprehensive benchmarking of existing RNA-Seq-based TCR
profiling methods. The performance of these methods was investigated by using TCR-Seq as a
gold standard across T cell rich and poor tissues from different cancer tissue types and immune
repertoire types. We have carefully examined the scenarios of low Shannon Diversity Index
(SDI) (repertoires are dominated by one or a few clonotypes with high frequencies) and high SDI
(repertoires are composed of clonotypes with frequencies nearly evenly distributed) TCR
repertoires under which such tools can leverage RNA-Seq data and provide reliable estimates of
TCR repertoires, which is currently unknown. Our results show that RNA-Seq-based TCR
profiling methods are able to effectively capture the majority of TCR-Seq confirmed clonotypes
in low SDI samples. These methods are also capable of precisely estimating the overall TCR
repertoire diversity and clonotype frequencies in T cell rich low SDI samples. However, in the
case of low SDI samples, the small number of receptor reads that capture major clonotypes is
enough to estimate diversity even in T cell poor tissues. Moreover, bulk RNA-Seq stored both
TCRɑ and TCRβ chain information, these methods offered the possibility to characterize both
TCRɑ and TCRβ repertoires. However, cautions need to be taken for T cell poor tissues as the
results typically were less accurate because of the lack of ability of RNA-Seq-based methods to
detect clonotypes with low frequencies. To conclude, we have examined the ability of RNA-Seq-
based TCR profiling methods in different types of tissues and repertoires, thus providing a
18
comprehensive guide in which tissues RNA-Seq-based TCR profiling methods are feasible to
deliver comparable results to targeted TCR-Seq.
2.2 Methods
2.2.1 Gold standard datasets
Peripheral blood mononuclear cells (PBMC) (N=5) and melanoma biopsy samples (N=9)
were generated from patients at UCLA enrolled in transgenic NY-ESO-1 TCR adoptive cell
therapy (IRB#12-000153 and 13-001624) (Nowicki et al., 2019) and PD-1 blockade clinical
trials (IRB#11-003066) (Tumeh et al., 2014), as previously described. The studies were
conducted in accordance with local regulations, the guidelines for Good Clinical Practice, and
the principles of the Declaration of Helsinki. Renal clear cell carcinoma (N=3) (B. Li et al.,
2016) samples were acquired from The Cancer Genome Atlas (TCGA), melanoma specimens
from the ileocecal lymph node (N=1) and the small intestine (N=1) (Bolotin et al., 2017) were
acquired from the Sequence Read Archive (SRA). For the PBMC samples, three were from
patients that were transduced with a retroviral vector containing the NY-ESO-1 TCR (Nowicki et
al., 2019), while another two were from patients being treated for melanoma (Tumeh et al.,
2014). The average number of RNA-Seq reads ranged from 66 million to 88.3 million. The
average numbers of TCR-Seq reads ranged from 0.01 million to 3.26 million reads (Table 2.1,
Table 2.2, Figure 2.1a-b). The number of clonotypes detected by TCR-Seq varied across
different tissue types. Kidney samples had the lowest average number of clonotypes (1,917) in
the tissue samples analyzed. Note that the lymph node sample has an extremely high number of
TCRβ clonotypes (202,869) compared to all other samples (Figure 2.1c). Importantly, the
cohorts we used correspond to various tissue types with substantially different levels of T cells
19
and repertoire types. We computationally estimated the T cell levels across all samples by
GEDIT (Nadel, Lopez, et al., 2021). Samples with high levels of T cells were from PBMC and
lymph nodes, which were considered T cell rich tissues. While low levels of T cells were the
samples from the renal cells, small intestine, and melanoma biopsies, which were considered as
T cell poor tissues.
We used Shannon Diversity Index (SDI) to classify TCR repertoires as low SDI or high
SDI samples. The minimal value of SDI is 0, which value closer to 0 corresponds to the
increased monoclonality of the TCR population. The three PBMC samples that were transduced
with a retroviral vector containing the NY-ESO-1 TCR (Nowicki et al., 2019) and one melanoma
biopsy sample were low SDI samples, while all other samples were high SDI samples. In our
dataset, the clonotypes with frequencies greater than 0.01 (hyperexpanded clonotypes), which
were usually only one or a few TCRβ clonotypes, on average composed 91.08% of all TCRβ
clonotypes in low SDI samples, while the hyperexpanded clonotypes only composed 24.36% of
all TCRβ clonotypes in high SDI samples (Figure 2.2).
Table 2.1 Overview of the gold standard dataset
We had different cohorts from five different tissue types in total. We documented the average
number of TCR-Seq reads in million (indicated in the column “Average number of TCR-Seq
reads”), the average number of RNA-Seq reads in million (indicated in the column “Average
number of RNA-Seq reads”), RNA-Seq read length (indicated in the column “Length of RNA-
Seq reads”), the fraction of T cells in certain tissue type (indicated in the column “Fraction of T
cells in tissue”), the number of samples from each tissue type (indicated in the column “Number
of samples ”), the number of low SDI samples and high SDI samples from each tissue type
(indicated in the column “Number of low SDI/high SDI samples”).
Tissue
type
Number
of TCR-
Seq reads
(millions)
(mean+S
D)
Number
of RNA-
Seq reads
(millions)
(mean+S
D)
Length
of
RNA-
Seq
reads
Fraction
of T cells
in tissue
Number
of
samples
Number of
low SDI/
high SDI
samples
Resources
20
PBMC 0.122+0.1 66.8+24.
4
150bp High 5 3/2 In-house
Melanom
a biopsy
0.951+0.5
23
74.7+15.
9
100bp Low 9 1/8 In-house
Renal
cell
0.0102+0.
007
88.3+28 50bp Low 3 0/3 TCGA
Lymph
node
3.26 66 100bp High 1 0/1 SRA
Small
intestine
3.05 72.7 100bp Low 1 0/1 SRA
Table 2.2 Overview of the samples in the gold standard dataset
The samples are sorted by the sample name (indicated in the column “Sample name”). We
documented the tissue of which the sample was taken (indicated in the column “Tissue”), the
tissue type based on the T cell levels (indicated in the column “Tissue type”), the repertoire type
based on the Shannon Diversity Index (indicated in the column “Repertoire type”), the number
of TCR-Seq reads from TCR-Seq data (indicated in the column “Number of TCR-Seq reads”),
the number of RNA-Seq reads from RNA-Seq data (indicated in the column “Number of RNA-
Seq reads”), the number of derived reads from RNA-Seq data after processing through RNA-
Seq-based TCR profiling methods (indicated in the columns “Number of derived reads from
RNA-Seq (MiXCR)”, “Number of derived reads from RNA-Seq (ImReP)”, “Number of derived
reads from RNA-Seq (TRUST4)”, and “Number of derived reads from RNA-Seq (CATT)”,
respectively).
N
o.
Sam
ple
name
Tissu
e
Tiss
ue
type
(T
cell
rich
or
poo
r)
T
cell
fracti
on
Repe
rtoire
type
(high
SDI
or
low
SDI)
Numb
er of
TCR-
Seq
reads
Numb
er of
RNA-
Seq
reads
Numb
er of
derive
d
reads
from
RNA-
Seq
(MiX
CR)
Numbe
r of
derived
reads
from
RNA-
Seq
(ImRe
P)
Numbe
r of
derive
d reads
from
RNA-
Seq
(TRUS
T4)
Numb
er of
derive
d
reads
from
RNA-
Seq
(CAT
T)
1 samp
le01
PBM
C
T
cell
rich
0.263 low
SDI
90,577 104,98
4,482
132,47
2
219,18
5
264,88
4
214,71
6
2 samp
le02
PBM
C
T
cell
rich
0.322 low
SDI
87,762 73,892
,845
41,370 57,700 89,947 72,800
21
3 samp
le03
PBM
C
T
cell
rich
0.209 low
SDI
305,95
3
71,654
,976
52,066 64,914 99,723 79,629
4 samp
le04
PBM
C
T
cell
rich
0.184 high
SDI
104,77
9
43,179
,989
158 1354 2006 4050
5 samp
le05
PBM
C
T
cell
rich
0.149 high
SDI
18,617 40,524
,817
55 111 186 1724
6 samp
le06
melan
oma
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
906,12
1
82,476
,159
91 231 689 4057
7 samp
le07
melan
oma
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
1,257,
571
72,397
,468
0 2 19 3298
8 samp
le08
melan
oma
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
1,408,
590
85,492
,431
150 259 657 3543
9 samp
le09
melan
oma
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
1,157,
845
68,584
,414
310 334 1009 1976
10 samp
le10
melan
oma
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
1,769,
522
63,320
,771
59 71 306 1972
11 samp
le11
melan
oma
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
1,006,
220
55,727
,841
12 31 63 1739
12 samp
le12
melan
oma
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
292,82
8
107,91
9,183
8 10 39 4985
13 samp melan T 0 low 12,954 80,622 1071 844 2855 5052
22
le13 oma cell
poo
r
(unde
tecte
d)
SDI ,502
14 samp
le14
melan
oma
T
cell
poo
r
0.004 high
SDI
749,68
6
55,931
,661
113 152 351 751
15 SRR
5233
637
small
intesti
ne
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
3,047,
629
72,720
,367
84 315 544 5616
16 SRR
5233
639
lymp
h
node
T
cell
rich
0.105 high
SDI
3,256,
697
65,998
,918
1346 6143 9054 5873
17 TCG
A-
CZ-
4862
renal
cell
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
16,784 58,529
,923
0 50 710 642
18 TCG
A-
CZ-
5463
renal
cell
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
806 122,63
2,451
0 16 366 845
19 TCG
A-
CZ-
5985
renal
cell
T
cell
poo
r
0
(unde
tecte
d)
high
SDI
12,998 83,862
,519
0 23 282 915
Figure 2.1 Basic metrics of the dataset
a. Bar plot and strip plot of the number of RNA-Seq reads in million across different tissue types.
b. Bar plot and strip plot of the number of TCR-Seq reads in million across different tissue types.
23
c. Bar plot and strip plot of the number of TCRβ clonotype counts from TCR-Seq across
different tissue types. PBMC samples are shown in blue, lymph node sample is shown in orange,
melanoma samples are shown in green, small intestine sample is shown in red, and the kidney
sample are shown in purple.
24
25
Figure 2.2 The distribution of clonotype groups in each sample
The clonotypes are categorized into five groups based on the clonotype frequencies.
Hyperexpanded clonotypes (cyan) are the clones with frequencies between 0.01 to 1, large
clonotypes (yellow) are the clones with frequencies between 0.001 to 0.01, medium clonotypes
(gray) are the clones with frequencies between 0.0001 to 0.001, small clonotypes (magenta) are
the clones with frequencies between 0.00001 to 0.0001, rare clonotypes (cambridge leather) are
the clones with frequencies between 0 to 0.00001. a-d. The results of low SDI samples. e-s. The
results of high SDI samples.
26
2.2.2 Choice of RNA-Seq-based TCR profiling methods
MiXCR supports various sequencing technologies, while ImReP, TRUST4, and CATT
are designed mainly for RNA-Seq data (Table 2.3). The input format varied across tools, MiXCR
requires RNA-Seq raw reads, ImReP accepts only aligned reads, TRUST4 and CATT accept
both raw reads and aligned reads. The details on how the RNA-Seq-based methods were run are
available in Table 2.4.
Table 2.3 Overview of evaluated RNA-Seq-based TCR profiling methods’ parameters,
publication details, and technical characteristics
RNA-Seq-based TCR profiling methods are sorted by the year of publication (indicated in the
column “Published year”). We documented the name of the methods (indicated in the column
“Method”), the version of methods that were used in this benchmarking project (indicated in the
column “Version”), the alignment algorithms (indicated in the column “Alignment/Assembly
algorithm”), the programming language (indicated in the column “Programming language”), the
input file type (indicated in the column “Input file type”), the types of sequencing data accepted
by the method (indicated in the column “Types of sequencing data”), the organism accepted by
the method (indicated in the column “Organism”), the journal that the method was published in
(indicated in the column “Journal”), the computational tools that the method was compared to in
the publication (indicated in the column “In the publication compared to”), the method webpage
(indicated in the column “Method webpage”).
Method Versi
on
Alignmen
t/Assemb
ly
algorithm
Publis
hed
year
Progra
mming
langua
ge
Input
file
type
Types
of
suppor
ted
data
Organi
sm
Journal In the
publication
compared
to
Webpage
MiXCR 3.0.13 modified
Smith-
Waterma
n and
Needlem
an-
Wunsch
2015 Java FAST
A,
FAST
Q,
FAST
Q[.gz
]
Paire
d-end
FAST
Q[.gz
]
AIRR-
seq,
RNA-
seq
Huma
n,
mouse,
rat
Nature
Method
s
MiTCR,
Decombina
tor,
IgBLAST,
IMGT/Hig
hV
QUEST
https://m
ixcr.com
27
ImReP 1.0 Not
required
2020 Python BAM RNA-
seq
Huma
n,
mouse
Nature
Commu
nication
MiXCR,
IMSEQ,
IgBLAST
https://gi
thub.com
/Mangul-
Lab-
USC/Im
ReP
TRUST
4
1.0.2 Seed-
extension
2021 C/C++ BAM,
FAST
Q
RNA-
seq
Huma
n
Nature
Method
s
MiXCR,
CATT,
TRUST3
https://gi
thub.com
/liulab-
dfci/TR
UST4
CATT 1.9.1 BioAlign
ments
2020 Julia BAM,
FAST
Q
RNA-
seq
Huma
n,
mouse,
swine
Bioinfo
rmatics
MiXCR,
IMSEQ,
LymAnaly
zer,
TraCeR,
RTCR
http://bio
info.life.
hust.edu.
cn/CAT
T
Table 2.4 Execution of RNA-Seq-based TCR profiling methods
We documented the required modules for running the methods (indicated in the column “Load
module”), the command used to process the RNA-Seq data (indicated in the column “Execute”),
the link to the manual for methods (indicated in the column “Manual”).
RNA-Seq-based
methods
Load module Execute Manual
MiXCR module load gcc/8.3.0
module load
openjdk/11.0.2
/your/path/mixcr-
3.0.13/mixcr analyze
shotgun --species hs -
-starting-material rna
--only-productive
sample_1.fastq
sample_2.fastq
sample
https://mixcr.com/mix
cr/reference/overview
-analysis-overview
ImReP module load gcc/4.9.4
module load
python/2.7.16
python2
/your/path/ImReP/Im
ReP.py --
extendedOutput --
hg38 --bam --
noOverlapStep --
noCast sample.bam
sample.cdr3
https://github.com/ma
ndricigor/ImReP/wiki
28
TRUST4 module load gcc/9.2.0
export
LC_CTYPE=en_US.
UTF-8
export
LC_ALL=en_US.UT
F-8
/your/path/TRUST4/r
un-trust4 -b
sample.bam -f
hg38_bcrtcr.fa --ref
human_IMGT+C.fa
https://github.com/liu
lab-dfci/TRUST4
CATT module load gcc/8.3.0
module load
python/3.7.6
module load
julia/1.3.1
module load bwa:
bwa/0.7.17
module load
samtools/1.10
/your/path/CATT/catt
–f1 sample_1.fastq –
f2 sample_2.fastq -o
outputName -t 2
https://github.com/Gu
oBioinfoLab/CATT
2.2.3 Isolate genomic DNA and RNA
Genomic DNA and RNA from patient-derived transgenic NY-ESO-1 TCR PBMCs and
normal control PBMCs (Nowicki et al., 2019) were isolated with an AllPrep DNA/RNA
isolation kit according to the manufacturer’s instructions (Qiagen). For patient melanoma
biopsies treated with anti-PD-1 blockade (Tumeh et al., 2014), genomic DNA was isolated from
formalin-fixed, paraffin-embedded tumor biopsy shavings on an Anaprep 12 nucleic acid
extraction platform (BioChain), while RNA was isolated from tumor samples which had been
preserved in RNAlater (Qiagen) and stored at -80℃ using an AllPrep RNA Isolation kit, as
above. Melanin was then removed from visibly pigmented samples using a PCR Inhibitor
Removal Kit (Zymo Research, Irvine, CA).
2.2.4 Generate DNA-based TCR-Seq data
TCRβ alleles were sequenced at 100,000 reads by Adaptive Biotechnologies. Briefly, this
process utilizes a synthetic immune repertoire, corresponding to every possible biological
29
combination of Variable (V) and Joining (J) gene segments for each TCR locus, spiked into
every sample at a known concentration. These inline controls enable rigorous quality assurance
for every sample assayed and allow for correction of multiplex PCR amplification bias,
providing an absolute quantitative measure of T cells containing the transgenic TCR relative to
the other endogenous TCR clonotypes, with no difference in amplification efficiency (Carlson et
al., 2013). Productive TCRβ sequences, i.e., those that could be translated into open reading
frames and did not contain a stop codon, were reported.
2.2.5 Analyze PBMC and melanoma biopsy TCR-Seq data
The assembled CDR3 sequences and corresponding VDJ recombinations were obtained
from Adaptive Biotechnologies. We have only considered clones supported by at least two reads.
Only full-length clonotypes (starting with C and ending F) were considered. Other clonotypes
were filtered out.
2.2.6 RNA-Seq of PBMC and melanoma biopsy samples
For both PBMC and melanoma biopsy samples, mRNA libraries were generated using
the Kapa Stranded mRNA Kit (Roche) and were subjected to 150 bp paired-end sequencing
(PBMCs) or 100 bp paired-end sequencing (melanoma biopsies) on a HiSeq 3000 platform.
2.2.7 Download renal carcinoma clear cell samples
We used RNA-seq data from the TCGA that corresponds to three renal carcinoma clear
cell samples sequenced by RNA-Seq and TCR-Seq. RNA-seq data are from Illumina HiSeq
sequencing of 50-bp paired-end reads. We downloaded the mapped and unmapped reads in BAM
format from the TCGA portal (TCGA-CZ-5463-01A, TCGA-CZ-5985-01A, TCGA-CZ-4862-
30
01A). The corresponding TCR-Seq data was downloaded from immuneACCESS® portal
(https://clients.adaptivebiotech.com/pub/Liu-2016-NatGenetics).
2.2.8 Download ileocecal lymph node and small intestine samples
We downloaded the RNA-Seq data of one sample of melanoma specimens from the
ileocecal lymph node (SPX6730, run: SRR5233639) and another sample of melanoma specimens
from the small intestine (SPX8151, run: SRR5233637) from SRA (Accession: PRJNA371303).
The corresponding TCR-Seq data was downloaded from
https://figshare.com/articles/dataset/Antigen_receptor_repertoires_profiling_from_RNA-
Seq/4620739?file=9787642.
2.2.9 Run RNA-Seq-based repertoire profiling tools
RNA-Seq-based repertoire profiling tools were run using the directions provided by each
of the respective tools. Wrappers were then prepared in order to run each of the respective tools
as well as create standardized log files. When running the tools, we have chosen the most
appropriate parameters. For MiXCR, we used the single analyze shotgun command. MiXCR
accepts the raw fastq files as input for generating a report of the clonotypes present within the
sample. Starting with the RNA sequence data, we mapped the reads onto the human chromosome
to generate the bam file. Together with the bam file and the indexed bam file generated from
samtools, we ran ImReP, TRUST4, or CATT to extract TCR clonotypes and corresponding
supporting reads. The computational pipeline to compare RNA-Seq-based repertoire profiling
methods with the gold standard is open source, free to use under the MIT license, and available
at https://github.com/Mangul-Lab-USC/TCR-profiling-benchmark.
31
2.2.10 Estimate TCR diversity and relative frequency
To estimate the diversity within each sample we only consider clonotypes that are
supported by at least 2 reads. The relative frequency of each clonotype is then calculated as the
number of supporting reads divided by the total number of reads supporting all clonotypes that
occur at least twice.
Shannon Diversity Index:
𝐻 = − ∑[(𝑝𝑖 ) × 𝐼𝑛 (𝑝𝑖 )]
𝑛 𝑖 =1
Clonality:
1 − 𝐻 /[𝐼𝑛 (𝑛 )]
pi = frequency of clonotype i
n = number of unique clonotype in the sample
2.2.11 Classify T cell rich and T cell poor tissues
We first ran Salmon (Patro et al., 2017) to quantify transcript levels from RNA-Seq data
with the reference transcriptome for homo sapiens. Then we used the Ensembl Human genes
(GRCh38.p13) as the reference. When converting transcript expression levels to gene expression
levels, the isoform with the highest expression transcript was used. Later, we ran the Gene
Expression Deconvolution Interactive Tool (GEDIT) (Nadel, Lopez, et al., 2021) v2.0 to
estimate the T cell fraction in each sample. Based on the recommendation of the benchmarking
results (Nadel, Oliva, et al., 2021), the Human Skin Signatures matrix was used as the reference
in our study. We estimated the total T cell fraction as the sum of fractions of CD3, CD4, and
CD8 cells. Samples with estimated T cell fractions greater than 0.05 are considered from T cell
rich tissues, otherwise from T cell poor tissues.
32
2.2.12 Generate Hill curves
We have used Hill curves functionality provided by pyTCR (Peng et al., 2022) to
estimate TCR diversity and generate Hill curves. We computed the diversity estimates when q
values varied from 1 to 6 across different RNA-Seq-based TCR profiling methods.
2.2.13 Computationally modify the properties of RNA-Seq samples
To reduce the length of RNA-Seq reads, we split each original sequence (150bp) from the
raw fastq files into 2 and 3 segments and got new 75bp and 50bp sequences. The sequences were
written into new fastq files. We also computationally reduced the number of TCR derived reads
in order to generate in silico samples based on the T cell rich low SDI samples. The clonotypes
that can be detected by the RNA-Seq based methods were randomly selected, in proportion to the
original clonotype frequency in the samples. The total numbers of TCR derived reads from each
tool were reduced to around 310, 1300, 2600, 13500, and 28000.
2.2.14 Additional dataset from healthy individuals
Ten samples from peripheral blood mononuclear cells (PBMC) samples collected from
healthy individuals were sequenced by TCR-Seq via 5’RACE with unique molecular identifiers
(UMI) and RNA-Seq technologies. The use of human materials was approved by the Office for
the Protection of Research Subjects (OPRS) at the University of Southern California (IRB#HS-
16-00274). TCR-Seq data was processed with MiXCR for clonotype identification and
quantification. The overview of the additional dataset was recorded in Table 2.5.
Table 2.5 Overview of the samples in the additional dataset
The samples are sorted by the sample name (indicated in the column “Sample name”). We
documented the tissue of which the sample was taken (indicated in the column “Tissue”), the
33
tissue type based on the T cell levels (indicated in the column “Tissue type”), the repertoire type
based on the Shannon Diversity Index (indicated in the column “Repertoire type”), the number
of TCR-Seq reads from TCR-Seq data (indicated in the column “Number of TCR-Seq reads”),
the number of RNA-Seq reads from RNA-Seq data (indicated in the column “Number of RNA-
Seq reads”), the number of derived reads from RNA-Seq data after processing through RNA-
Seq-based TCR profiling methods (indicated in the columns “Number of derived reads from
RNA-Seq (MiXCR)”, “Number of derived reads from RNA-Seq (ImReP)”, “Number of derived
reads from RNA-Seq (TRUST4)”, and “Number of derived reads from RNA-Seq (CATT)”,
respectively).
N
o.
Sam
ple
name
Tissu
e
Tiss
ue
type
(T
cell
rich
or
poor
)
T
cell
fracti
on
Repe
rtoire
type
(high
SDI
or
low
SDI)
Numb
er of
TCR-
Seq
reads
Numb
er of
RNA-
Seq
reads
Numb
er of
derive
d
reads
from
RNA-
Seq
(MiX
CR)
Numbe
r of
derived
reads
from
RNA-
Seq
(ImRe
P)
Numbe
r of
derive
d reads
from
RNA-
Seq
(TRUS
T4)
Numb
er of
derive
d
reads
from
RNA-
Seq
(CAT
T)
1 adds
ampl
e01
PBM
C
T
cell
rich
0.196 high
SDI
8286 101,92
4,084
342 268 1472 8704
2 adds
ampl
e02
PBM
C
T
cell
rich
0.3 high
SDI
20,897 104,35
2,202
788 511 2756 8849
3 adds
ampl
e03
PBM
C
T
cell
rich
0.274 high
SDI
52,379 89,333
,304
124 108 799 6803
4 adds
ampl
e04
PBM
C
T
cell
rich
0.36 high
SDI
224,04
9
120,84
0,832
323 246 1480 9012
5 adds
ampl
e05
PBM
C
T
cell
rich
0.152 high
SDI
59,517 90,101
,915
1139 649 3571 8168
6 adds
ampl
e06
PBM
C
T
cell
rich
0.463 high
SDI
29,586 97,169
,398
1521 914 4352 10,181
7 adds
ampl
e07
PBM
C
T
cell
rich
0.144 high
SDI
115,95
7
99,240
,419
354 261 1296 7599
34
8 adds
ampl
e08
PBM
C
T
cell
rich
0.306 high
SDI
68,388 107,43
0,151
704 499 2815 8952
9 adds
ampl
e09
PBM
C
T
cell
rich
0.247 high
SDI
453,90
6
93,539
,596
285 195 1346 7650
10 adds
ampl
e10
PBM
C
T
cell
rich
0.225 high
SDI
104,95
9
103,81
8,947
514 297 1907 9031
2.2.15 Compare computational resources across RNA-Seq-based repertoire profiling tools
We ran MiXCR, ImRep, TRUST4, and CATT separately on four selected samples
(sample02, sample05, sample13, TCGA-CZ-5463). The CPU time and RAM usage were
recorded. A high-performance computing cluster (HPCC) was utilized to acquire the results. We
set up the bash file by one node, eight cores per node, and 16 GB Memory Efficiency for each
run. The CPU time was recorded in seconds and the RAM was computed in gigabytes (GB).
2.2.16 Code availability
All code required to produce the figures and analysis performed in this paper is freely
available at https://github.com/Mangul-Lab-USC/TCR-profiling-benchmark.
2.2.17 Data availability
All RNA-Seq data used in the paper is available on Sequence Read Archive (SRA) or
The Cancer Genome Atlas (TCGA) portal. RNA-Seq data for five PBMC samples are available
in SRA (BioProject ID: PRJNA812076). RNA-Seq data for three renal carcinoma clear cell
samples are available in the TCGA portal (TCGA-CZ-5463-01A, TCGA-CZ-5985-01A, TCGA-
CZ-4862-01A). RNA-Seq data for one sample of melanoma specimens from the ileocecal lymph
node is available in SRA (SPX6730, run: SRR5233639) and another sample of melanoma
35
specimens from the small intestine is available in SRA (SPX8151, run: SRR5233637). TCR-Seq
data for three renal carcinoma clear cell samples are available
https://clients.adaptivebiotech.com/pub/Liu-2016-NatGenetics, TCR-Seq data for one sample of
melanoma specimens from the ileocecal lymph node and another sample of melanoma specimens
from the small intestine are available
https://figshare.com/articles/dataset/Antigen_receptor_repertoires_profiling_from_RNA-
Seq/4620739?file=9787642.
2.3 Results
We assembled the largest multi-cohort dataset which was composed of 19 samples and 5
tissue types (Table 2.1, Table 2.2, Figure 2.1). These samples include peripheral blood
mononuclear cells (PBMC) (N=5), melanoma biopsy samples (N=9) (Tumeh et al., 2014;
Nowicki et al., 2019), renal clear cell carcinoma (N=3) (B. Li et al., 2016) samples, and
melanoma specimens from the ileocecal lymph node (N=1) and the small intestine (N=1)
(Bolotin et al., 2017). All the samples were sequenced by both TCR-Seq (only for TCRβ chain)
and RNA-Seq. The samples have different T cell levels based on deconvolution results provided
by GEDIT (Nadel, Lopez, et al., 2021), and different diversity levels based on Shannon Diversity
Index (SDI) and clonality. We defined the samples with SDI less than 2 as low SDI samples,
otherwise high SDI samples (Figure 2.3). The distribution of clonotypes in the samples was
shown in Figure 2.2. Available RNA-Seq-based repertoire profiling methods that are designed to
assemble the complementarity determining regions 3 (CDR3) from RNA-Seq data were included
in this benchmarking study, which includes MiXCR (Bolotin et al., 2015), ImReP (Mandric et
al., 2020), TRUST4 (Song et al., 2021, p. 4), and CATT (Chen et al., 2020) (Table 2.3, Table
2.4). We excluded IMSEQ (Kuchenbecker et al., 2015) because it was originally designed for
36
TCR or immunoglobulin (Ig) sequencing, but not for RNA-Seq. The details of the samples and
methods are described in the method section.
Figure 2.3 The Shannon Diversity Index (SDI) (a), clonality (b), and estimated T cell fraction (c)
of each sample
Samples with SDI less than 2 are considered low SDI samples, otherwise high SDI samples. The
low SDI samples are shown in blue and the high SDI samples are shown in orange in a and b.
Samples with estimated T cell fraction greater than 0.05 are considered from T cell rich tissues,
otherwise from T cell poor tissues. The T cell rich samples are shown in blue and T cell poor
samples are shown in orange in c.
2.3.1 RNA-Seq-based TCR profiling methods were able to successfully capture TCRβ low
SDI repertoires across various T cell rich tissues
First, we have investigated the ability of RNA-Seq-based TCR profiling methods to
characterize TCRβ repertoires across different repertoire types and tissue types. The capturing
ability of the RNA-Seq-based methods was examined based on the sum of the TCR-Seq
confirmed TCRβ clonotypes frequencies that each method was able to capture. Each distinct
amino acid sequence was considered as one unique clonotype in the TCRβ repertoire.
The surveyed RNA-Seq-based methods were able to capture 93.2% of clonotypes in T
cell rich low SDI samples and 76.45% in T cell poor low SDI samples on average. The portions
of clonotypes that could be captured were much lower in high SDI samples. The results of the
37
portions from each tool were similar across T cell rich samples, CATT and TRUST4 had
statistically significant higher portions in T cell poor high SDI samples than MiXCR (Table 2.6).
In Figure 2.4, we presented the average fraction of cumulative frequencies of assembled
TCR clonotypes by RNA-Seq-based methods with clonotype frequency greater than the value
shown on the x-axis. Similar to the previous study (Bolotin et al., 2017), we observed that
MiXCR, TRUST4, and CATT were able to capture all the clonotypes with frequencies above
0.03% in T cell rich low SDI samples. All surveyed methods captured all the clonotypes with
frequencies above 0.036% in T cell rich low SDI samples and clonotypes with frequencies above
2.66% in T cell poor low SDI samples (Figure 2.4a-d). Moreover, they were able to capture all
the clonotypes with frequencies above 1.553% in T cell rich high SDI samples (Figure 2.4e-h).
However, only TCRβ clonotypes with relatively high frequencies in the T cell poor high SDI
samples were captured by ImRep, TRUST4, and CATT, but not by MiXCR. Among ImRep,
TRUST4, and CATT, TRUST4 and CATT achieved the best performance compared to ImReP in
the T cell poor high SDI samples, with TRUST4 and CATT being able to capture all the TCRβ
clonotypes with frequencies greater than 10.1%, while this threshold was 40.5% for ImReP
(Figure 2.4e-h). One reason that may have caused this was that on average TCRβ derived reads
from RNA-Seq reads in T cell poor tissues were much smaller than those in T cell rich tissues
(4.1 vs. 655.3 TCR derived reads per million RNA-Seq reads) (Table 2.7, Table 2.8, Figure 2.5).
The results for individual samples were shown in Figure 2.6.
We further examined the ability of the surveyed methods to capture the most abundant
clonotypes (top five and top ten clonotypes in each sample) in T cell rich tissues. MiXCR,
TRUST4, and CATT were able to capture the entire set of the top five clonotypes in T cell rich
low SDI samples while ImReP was able to capture 93.3% of those clonotypes. For the top ten
38
clonotypes in T cell rich low SDI samples, MiXCR, ImReP, TRUST4, and CATT were able to
capture 93.3%, 86.7%, 96.7%, and 93.3%, respectively. While in the T cell rich high SDI
samples, 68.3% of the top five clonotypes and 54.2% of the top ten clonotypes were captured by
these methods on average.
Table 2.6 The comparison of the portion of the TCR clonotypes that are captured by the tools
The portions were recorded in percentage. The mean, standard deviation, and p-values of the t-
test were presented.
T cell rich low SDI samples
(n=3)
p-values T cell poor low SDI sample
(n=1)
p-values
MiXCR ImRep MiXCR ImRep
92.9%+00.9
%
93.2%+00.8
%
0.638 77.1% 75.9% N/A
MiXCR TRUST4 MiXCR TRUST4
92.9%+00.9
%
93.4%+00.7
%
0.434 77.1% 76.8% N/A
MiXCR CATT MiXCR CATT
92.9%+00.9
%
93.3%+00.8
%
0.603 77.1% 76% N/A
ImRep TRUST4 ImRep TRUST4
93.2%+00.8
%
93.4%+00.7
%
0.752 75.9% 76.8% N/A
ImRep CATT ImRep CATT
93.2%+00.8
%
93.3%+00.8
%
0.963 75.9% 76% N/A
TRUST4 CATT TRUST4 CATT
93.4%+00.7
%
93.3%+00.8
%
0.784 76.8% 76% N/A
T cell rich high SDI samples
(n=3)
p-values T cell poor high SDI samples
(n=12)
p-values
39
MiXCR ImRep MiXCR ImRep
8.5%+8% 12%+11.9% 0.638 6.6%+7.6% 13.5%+14.4
%
0.154
MiXCR TRUST4 MiXCR TRUST4
8.5%+8% 13.8%+12.9
%
0.434 6.6%+7.6% 19.7%+17.6
%
0.027
MiXCR CATT MiXCR CATT
8.5%+8% 10.7%+9.6% 0.603 6.6%+7.6% 18.7%+15.1
%
0.02
ImRep TRUST4 ImRep TRUST4
12%+11.9% 13.8%+12.9
%
0.752 13.5%+14.4
%
19.7%+17.6
%
0.36
ImRep CATT ImRep CATT
12%+11.9% 10.7%+9.6% 0.963 13.5%+14.4
%
18.7%+15.1
%
0.399
TRUST4 CATT TRUST4 CATT
13.8%+12.9
%
10.7%+9.6% 0.784 19.7%+17.6
%
18.7%+15.1
%
0.886
Table 2.7 The average number of TCRβ derived reads by RNA-Seq-based methods in different
tissue types
Tissue Average TCRβ derived reads (mean+SD)
MiXCR ImReP TRUST4 CATT Overall
PBMC 45,224+5418
7
68,653+89,47
2
91,349+107,7
95
74,584+86,52
9
69,953+81,32
4
Melanoma
biopsies
202+340 215+264 665+888 3041+1516 1031+1473
Renal cell 0 30+18 453+227 801+142 321+363
Lymph node 1346 6143 9054 5873 5604+3183
Small 84 315 544 5616 1640+2657
40
intestine
Table 2.8 The average number of TCRβ derived reads by RNA-Seq-based methods in different
repertoire types
Repertoire Average TCRβ derived reads (mean+SD)
MiXCR ImReP TRUST4 CATT Overall
T cell rich
low SDI
75,303+49,7
98
113,933+91,22
2
151,518+98,29
9
122,382+80,0
37
115,784+75,455
T cell rich
high SDI
520+717 2536+3185 3749+4684 3882+2080 2672+2949
T cell
poor low
SDI
1,071 844 2,855 5052 2456+1951
T cell
poor high
SDI
69+92 125+127 420+309 2528+1730 785+1334
41
Figure 2.4 The ability to capture TCR-Seq confirmed clonotypes by RNA-Seq-based methods in
low SDI samples (a-d) and high SDI samples (e-h)
The x-axis corresponds to TCR-seq confirmed clonotypes with a frequency of Z on a log scale.
The y-axis corresponds to the average fraction of assembled TCR clonotypes by RNA-Seq-based
methods with clonotype frequency greater than Z across samples. The samples with no clonotype
(by RNA-Seq or TCR-Seq) at a given frequency are excluded to compute the average portion.
The brown dash lines indicate the minimal clonotype frequency above which all the clonotypes
are captured. The numbers on the brown dash lines are the actual values for the threshold. Area
plots show the proportion of the TCR clonotypes captured by MiXCR, ImReP, TRUST4, and
CATT based on clonotype frequencies. Results from T cell rich low SDI samples, T cell poor
42
low SDI samples, T cell rich high SDI samples, and T cell poor high SDI samples are shown in
blue, orange, green, and red, respectively.
Figure 2.5 The number of TCR derived reads by RNA-Seq-based TCR profiling methods per 1
million RNA-Seq reads in different types of samples
T cell rich low SDI samples (rich_low_SDI), T cell rich high SDI samples (rich_high_SDI), T
cell poor low SDI samples (poor_low_SDI), and T cell poor high SDI samples (poor_high_SDI).
The y-axis corresponds to the number of TCR derived reads on a log scale. The results from
MiXCR are represented in blue, the results from ImReP are represented in orange, the results
from TRUST4 are represented in green, and the results from CATT are represented in red.
43
44
Figure 2.6 The capability of capturing TCR-Seq confirmed clonotypes by RNA-Seq-based
methods in each sample
Line plots with x-axis corresponding to TCR-seq confirmed clonotypes with a frequency of Z,
and y-axis corresponding to the fraction of assembled TCR repertoire by RNA-Seq-based
methods with clonotype frequency greater than Z in (a-c) T cell rich low SDI samples, (d-f) T
cell rich high SDI samples, (g) T cell poor low SDI sample, and (h-s) T cell poor high SDI
samples. The capability of capturing by MiXCR, ImReP, TRUST4, and CATT are represented in
blue, orange, green, and red, respectively.
45
2.3.2 RNA-Seq-based TCR profiling methods were able to effectively estimate clonality in T
cell rich tissues
We evaluated the ability to effectively estimate the diversity of TCR repertoire by RNA-
Seq-based methods. The diversity is estimated by Shannon Diversity Index (SDI) and the
absolute error is used to examine the feasibility of the methods to estimate diversity. The
differences between the Shannon Diversity Index estimated from TCR-Seq results and RNA-Seq
TCR derived results were measured using the absolute error. The average absolute error of
diversity estimation from RNA-Seq-based methods compared to TCR-Seq results in T cell rich
tissues was 2.48. The best performance was achieved by TRUST4 with an absolute error of 2.07.
When we compared the absolute error of diversity estimation in different types of repertoires
among T cell rich samples, all the methods offered more precise diversity estimation in low SDI
samples than in high SDI samples. The average absolute error in T cell rich low SDI samples
was 0.57 while the average absolute error in T cell rich high SDI samples was 4.4 (Figure 2.7a).
Notably, MiXCR had the best performance in estimating diversity in low SDI samples but worse
performance in high SDI samples (Figure 2.7a). These tools had similar performance in
estimating diversity among low SDI samples (Table 2.9). In contrast to T cell rich low SDI
samples, all RNA-Seq-based TCR profiling methods showed poor performance in high SDI
samples (Figure 2.7a). Among the T cell poor low SDI samples, the average absolute error was
2.23 across the surveyed methods, which was larger than in T cell rich low SDI samples (Figure
2.7a). TRUST4 had statistically significantly better estimation of diversity in both T cell rich and
T cell poor high SDI samples than MiXCR and CATT (Table 2.9). For example, TRUST4 a had
smaller absolute error of SDI compared with CATT in T cell rich high SDI samples (TRSUT4:
3.49, CATT: 5.43, p=0.041).
46
T cell level and SDI classification can be confounded by each other. For example, T cell
level affects the Shannon Diversity Index, especially for high SDI samples. We examined the
fraction of the top ten TCR clonotypes in each sample (Table 2.10). On average, the fraction of
the top ten TCR clonotypes was higher in T cell rich tissues than it was in T cell poor tissues
(49.2% vs. 29.87%). Therefore, we also calculated clonality to account for the substantial
differences in the number of clonotypes which the Shannon Diversity Index doesn’t account for
directly. All these methods provided precise estimates of clonality for low SDI samples of T cell
rich tissues with an average absolute error of 0.0677 (Figure 2.7b). Among the surveyed
methods, MiXCR provided the most precise estimates for both clonality and diversity in T cell
rich low SDI samples, with average absolute errors of 0.015 and 0.141, respectively. However,
these were not statistically different (Table 2.9 and Table 2.11). Different from T cell rich
tissues, all the survey methods were not able to provide accurate clonality estimates in T cell
poor tissues, with average absolute errors of 0.184 in high SDI samples and 0.376 in low SDI
samples (Figure 2.7b). MiXCR and TRUST4 provided statistically more accurate clonality
estimates in T cell poor high SDI samples than CATT (Table 2.11). Additionally, we observe
consistent estimates of diversity and clonality across different SDI thresholds to classify low or
high SDI samples (Table 2.12).
Additionally, we investigated the ability of RNA-Seq-based methods to accurately
estimate the diversity and clonality when singleton (clonotypes supported by a single read) TCRs
were included. The presence of singletons in the sample decreases the accuracy of estimation for
all methods. The average absolute error of SDI in T cell rich tissues across all methods was
slightly higher than the samples excluding singleton TCRs, with values of 2.67 and 2.48,
respectively. Similar to the samples excluding singleton TCRs, the surveyed methods provided
47
reliable estimates of diversity in T cell rich low SDI samples (Figure 2.8a). Except for CATT,
other methods (MiXCR, ImRep, and TRUST4) had precise clonality estimates in T cell rich
tissues regardless of repertoire types (Figure 2.8b).
Furthermore, we have employed the Hill curves to access a broad spectrum of diversity
metrics, allowing systematically examining the ability of RNA-Seq-based methods to accurately
estimate the TCR diversity of a sample. Hill curves approach (Chao et al., 2014) represents an
array of diversity indices corresponding to various q values using different diversity metrics
(Table 2.13). These diversity metrics include the exponential of the SDI (when q value is 1) and
the inverse Simpson index (when q value is 2). Hill curves simultaneously account for the total
number of clonotypes and expansion of the clonotypes (Greiff et al., 2015). As the q value
increases, the clonotypes with high relative abundance have a larger contribution to the overall
diversity estimates. We have used pyTCR (Peng et al., 2022) to compute and plot the Hill curves.
In order to evaluate the ability of RNA-Seq-based methods to accurately estimate the TCR
diversity, we have compared the RNA-Seq based diversity estimates across various q values to
the diversity estimates based on TCR-Seq results, of which the latter was considered the gold
standard. The closer the estimates to the TCR-Seq based results, the more accurate the diversity
estimates were considered. According to all diversity metrics provided by Hill curves, RNA-Seq-
based methods provided the most accurate estimates of TCR diversity for T cell rich low SDI
samples compared to other tissue types and repertoire types (Figure 2.9a-c). We noticed that
RNA-Seq-based methods overestimated the diversity of TCR repertoires when the q value was 1
in low SDI samples, the estimations were more precise when q values were greater than 1
(Figure 2.9a-d). Even though the absolute errors of SDI estimates were relatively small in low
SDI samples, the exponential of the SDI (corresponds to the diversity estimate when the q value
48
is 1), may exaggerate the differences between diversity estimates. For high SDI samples, RNA-
Seq-based profiling methods underestimated the diversity regardless of the q values or tissue
types (Figure 2.9e-s).
Even though these methods captured much fewer clonotypes compared to TCR-Seq
results in T cell rich high SDI samples, if the clonotype counts were taken into account by
calculating clonality, these methods were able to offer comparable clonality estimates in T cell
rich high SDI samples as they were in T cell rich low SDI samples, with absolute errors of 0.113
and 0.068, respectively.
Table 2.9 The comparison of the absolute error of the Shannon Diversity Index (SDI) across
tools
The mean, standard deviation, and p-values of the t-test were presented.
T cell rich low SDI samples
(n=3)
p-values T cell poor low SDI sample
(n=1)
p-values
MiXCR ImRep MiXCR ImRep
0.14+0.12 0.81+0.68 0.166 2.22 2.64 N/A
MiXCR TRUST4 MiXCR TRUST4
0.14+0.12 0.65+0.56 0.199 2.22 2.53 N/A
MiXCR CATT MiXCR CATT
0.14+0.12 0.66+0.45 0.127 2.22 1.54 N/A
ImRep TRUST4 ImRep TRUST4
0.81+0.68 0.65+0.56 0.768 2.64 2.53 N/A
ImRep CATT ImRep CATT
0.81+0.68 0.66+0.45 0.753 2.64 1.54 N/A
TRUST4 CATT TRUST4 CATT
0.65+0.56 0.66+0.45 0.995 2.53 1.54 N/A
49
T cell rich high SDI samples
(n=3)
p-values T cell poor high SDI samples
(n=12)
p-values
MiXCR ImRep MiXCR ImRep
5.11+0.53 3.56+1.11 0.094 4.79+0.7 4.13+1.07 0.137
MiXCR TRUST4 MiXCR TRUST4
5.11+0.53 3.49+0.73 0.036 4.79+0.7 3.57+1.2 0.018
MiXCR CATT MiXCR CATT
5.11+0.53 5.43+0.86 0.612 4.79+0.7 4.91+1.4 0.835
ImRep TRUST4 ImRep TRUST4
3.56+1.11 3.49+0.73 0.932 4.13+1.07 3.57+1.2 0.238
ImRep CATT ImRep CATT
3.56+1.11 5.43+0.86 0.082 4.13+1.07 4.91+1.4 0.138
TRUST4 CATT TRUST4 CATT
3.49+0.73 5.43+0.86 0.041 3.57+1.2 4.91+1.4 0.019
Table 2.10 The fraction of the top ten abundant TCR clonotypes across the samples
Sample name Class Top ten clonotype abundance (%)
sample01 T cell rich low SDI 90.68
sample02 T cell rich low SDI 91.42
sample03 T cell rich low SDI 92.74
sample04 T cell rich high SDI 4.86
sample05 T cell rich high SDI 10.67
sample06 T cell poor high SDI 6.09
sample07 T cell poor high SDI 52.52
sample08 T cell poor high SDI 9.34
sample09 T cell poor high SDI 19.15
50
sample10 T cell poor high SDI 26.17
sample11 T cell poor high SDI 10.76
sample12 T cell poor high SDI 25.22
sample13 T cell poor low SDI 87.43
sample14 T cell poor high SDI 17.03
SRR5233637 T cell poor high SDI 16.74
SRR5233639 T cell rich high SDI 4.86
TCGA-CZ-4862 T cell poor high SDI 14.85
TCGA-CZ-5463 T cell poor high SDI 71.46
TCGA-CZ-5985 T cell poor high SDI 31.57
Table 2.11 The comparison of the absolute error of clonality across tools
The mean, standard deviation, and p-values of the t-test were presented.
T cell rich low SDI samples
(n=3)
p-values T cell poor low SDI sample
(n=1)
p-values
MiXCR ImRep MiXCR ImRep
0.02+0.01 0.1+0.08 0.156 0.44 0.49 N/A
MiXCR TRUST4 MiXCR TRUST4
0.02+0.01 0.08+0.07 0.189 0.44 0.4 N/A
MiXCR CATT MiXCR CATT
0.02+0.01 0.08+0.06 0.117 0.44 0.17 N/A
ImRep TRUST4 ImRep TRUST4
0.1+0.08 0.08+0.07 0.725 0.49 0.4 N/A
ImRep CATT ImRep CATT
0.1+0.08 0.08+0.06 0.761 0.49 0.17 N/A
TRUST4 CATT TRUST4 CATT
0.08+0.07 0.08+0.06 0.934 0.4 0.17 N/A
51
T cell rich high SDI samples
(n=3)
p-values T cell poor high SDI samples
(n=12)
p-values
MiXCR ImRep MiXCR ImRep
0.06+0.05 0.07+0.07 0.833 0.12+0.06 0.14+0.08 N/A
MiXCR TRUST4 MiXCR TRUST4
0.06+0.05 0.05+0.05 0.836 0.12+0.06 0.09+0.05 0.32
MiXCR CATT MiXCR CATT
0.06+0.05 0.28+0.2 0.612 0.12+0.06 0.36+0.15 0.0005
ImRep TRUST4 ImRep TRUST4
0.07+0.07 0.05+0.05 0.711 0.14+0.08 0.09+0.05 N/A
ImRep CATT ImRep CATT
0.07+0.07 0.28+0.2 0.148 0.14+0.08 0.36+0.15 N/A
TRUST4 CATT TRUST4 CATT
0.05+0.05 0.28+0.2 0.112 0.09+0.05 0.36+0.15 8.49e-06
Table 2.12 The comparison of the absolute error of SDI and clonality when varying SDI
thresholds to classify low or high SDI samples
The mean and standard deviation were presented.
Class T cell rich low
SDI
T cell rich high
SDI
T cell poor low
SDI
T cell poor high
SDI
SDI
thresh
old
Tool SDI clonalit
y
SDI clonalit
y
SDI clonalit
y
SDI clonalit
y
1 MiXC
R
0.17 0.007 3.12+2
.75
0.04+0
.04
N/A N/A 4.51+1
.08
0.15+0
.12
ImRep 0.91 0.12 2.44+1
.79
0.07+0
.08
N/A N/A 4.01+1
.1
0.17+0
.13
TRUS
T4
0.65 0.09 2.36+1
.68
0.06+0
.06
N/A N/A 3.49+1
.18
0.12+0
.1
52
CATT 0.62 0.09 3.53+2
.69
0.2+0.
18
N/A N/A 4.65+1
.63
0.35+0
.16
2 MiXC
R
0.14+0
.12
0.02+0
.01
5.11+0
.53
0.06+0
.05
2.22 0.44 4.79+0
.7
0.12+0
.06
ImRep 0.81+0
.68
0.1+0.
08
3.56+1
.11
0.07+0
.07
2.64 0.49 4.13+1
.07
0.14+0
.08
TRUS
T4
0.65+0
.56
0.08+0
.07
3.49+0
.73
0.05+0
.05
2.53 0.4 3.57+1
.2
0.09+0
.05
CATT 0.66+0
.45
0.08+0
.06
5.43+0
.86
0.28+0
.2
1.54 0.17 4.91+1
.4
0.36+0
.15
3 MiXC
R
0.14+0
.12
0.02+0
.01
5.11+0
.53
0.06+0
.05
2.22 0.44 4.79+0
.7
0.12+0
.06
ImRep 0.81+0
.68
0.1+0.
08
3.56+1
.11
0.07+0
.07
2.15+0
.7
0.4+0.
14
4.35+0
.77
0.12+0
.06
TRUS
T4
0.65+0
.56
0.08+0
.07
3.49+0
.73
0.05+0
.05
1.46+1
.52
0.28+0
.17
3.86+0
.68
0.09+0
.05
CATT 0.66+0
.45
0.08+0
.06
5.43+0
.86
0.28+0
.2
1.58+0
.06
0.2+0.
03
5.21+0
.99
0.38+0
.16
4 MiXC
R
0.14+0
.12
0.02+0
.01
5.11+0
.53
0.06+0
.05
2.22 0.44 4.79+0
.7
0.12+0
.06
ImRep 0.81+0
.68
0.1+0.
08
3.56+1
.11
0.07+0
.07
2.15+0
.7
0.4+0.
14
4.35+0
.77
0.12+0
.06
TRUS
T4
0.65+0
.56
0.08+0
.07
3.49+0
.73
0.05+0
.05
1.46+1
.52
0.28+0
.17
3.86+0
.68
0.09+0
.05
CATT 0.66+0
.45
0.08+0
.06
5.43+0
.86
0.28+0
.2
1.58+0
.06
0.2+0.
03
5.21+0
.99
0.38+0
.16
5 MiXC
R
0.14+0
.12
0.02+0
.01
5.11+0
.53
0.06+0
.05
2.22 0.44 4.79+0
.7
0.12+0
.06
ImRep 0.81+0
.68
0.1+0.
08
3.56+1
.11
0.07+0
.07
2.99+1
.54
0.4+0.
14
4.32+0
.8
0.12+0
.06
53
TRUS
T4
0.65+0
.56
0.08+0
.07
3.49+0
.73
0.05+0
.05
2.21+1
.68
0.23+0
.15
3.87+0
.72
0.08+0
.05
CATT 0.66+0
.45
0.08+0
.06
5.43+0
.86
0.28+0
.2
2.23+1
.12
0.21+0
.04
5.38+0
.86
0.39+0
.16
Table 2.13 The formulas for Hill curves
p represents the frequency of a clonotype and n represents the number of clonotypes.
Q value Formula
1
𝑒 ∑ −[(𝑝𝑖 )×𝐼𝑛 (𝑝𝑖 )]
𝑛 𝑖 =1
]
2
1
∑ (𝑝𝑖 )
2 𝑛 𝑖 =1
3
(∑ 𝑝 𝑖 𝑞 𝑛 𝑖 =1
)
(−1/2)
4
(∑ 𝑝 𝑖 𝑞 𝑛 𝑖 =1
)
(−1/3)
5
(∑ 𝑝 𝑖 𝑞 𝑛 𝑖 =1
)
(−1/4)
6
(∑ 𝑝 𝑖 𝑞 𝑛 𝑖 =1
)
(−1/5)
54
Figure 2.7 TCR-Seq-based TCRβ and RNA-Seq-based diversity and clonality comparison
a-b. Bar plot of absolute error between diversity (a) and clonality (b) estimated based on RNA-
Seq data and TCR-Seq data in T cell rich low SDI samples (rich_low_SDI, T cell rich high SDI
samples (rich_high_SDI), T cell poor low SDI samples (poor_low_SDI), and T cell poor high
SDI samples (poor_high_SDI). Each dot on the bar represents each sample. The results from
MiXCR are represented in blue, the results from ImReP are represented in orange, the results
from TRUST4 are represented in green, and the results from CATT are represented in red.
Figure 2.8 TCR-Seq-based TCRβ and RNA-Seq-based diversity and clonality comparison when
singleton TCRs are included
55
a-b. Bar plot of absolute error between diversity (a) and clonality (b) estimated based on RNA-
Seq data and TCR-Seq data in T cell rich low SDI samples (rich_low_SDI, T cell rich high SDI
samples (rich_high_SDI), T cell poor low SDI samples (poor_low_SDI), and T cell poor high
SDI samples (poor_high_SDI). Each dot on the bar represents each sample. The results from
MiXCR are represented in blue, the results from ImReP are represented in orange, the results
from TRUST4 are represented in green, and the results from CATT are represented in red.
56
57
Figure 2.9 Hill curves of each sample to estimate the diversity of TCR repertoire
a-c. T cell rich low SDI samples. d. T cell poor low SDI sample. e-g. T cell rich high SDI
samples. h-s. T cell poor high SDI samples. The x-axis represents q values, and the y-axis
represents the diversity estimates. The results from TCR-Seq are presented in purple, the results
from MiXCR are presented in blue, the results from ImRep are presented in orange, the results
from TRUST4 are presented in green, and the results from CATT are presented in red.
58
2.3.3 RNA-Seq-based TCR profiling methods were able to successfully estimate relative
frequencies of clonotypes in low SDI samples
While differences in diversity and clonality estimates were observed, we further
examined whether the surveyed methods were able to provide accurate estimates of the
frequency of assembled clonotypes in different types of repertoires. The results showed that all
four RNA-Seq based methods accurately estimated the relative frequencies in low SDI samples
(Figure 2.10a-h), especially in T cell rich tissues (MiXCR: r=1, p<0.001; ImReP: r=0.9999,
p<0.001; TRUST4: r=0.9999, p<0.001; CATT: r=0.9999, p<0.001). MiXCR had the best linear
correlation between RNA-Seq-based clonotype frequencies and TCR-Seq-based clonotype
frequencies compared to other methods. The linear correlations between the clonotype
frequencies measured by TCR-Seq and RNA-Seq in T cell rich low SDI samples were shown in
Figure 2.11. Additionally, the clonotype frequencies between results from TCR-Seq and from
RNA-Seq were positively correlated in T cell rich high SDI samples (MiXCR: r=0.7942, p=7.3e-
85; ImReP: r=0.7335, p<0.001; TRUST4: r=0.7205, p<0.001; CATT: r=0.8226, p=3.1e-245)
(Figure 2.10i-l). When comparing the clonotype frequencies in T cell poor high SDI samples, the
positive correlations were weaker compared to other tissue and repertoire types (MiXCR:
r=0.7334, p=4.2e-28; ImRep: r=0.5225, p=6.6e-23; TRUST4: r=0.5859, p=3e-48; CATT:
r=0.518, p=4.6e-27) (Figure 2.10m-p).
We also examined the scenario when singleton TCRs were included in the samples. In
this case, the correlations between RNA-Seq based frequencies and TCR-Seq based frequencies
were similar to the cases when singleton TCRs were filtered out (Figure 2.12). Furthermore, we
noticed that including singletons had a positive effect on the ability to estimate clonotype
59
frequencies in T cell rich high SDI samples (MiXCR: r=0.908, p<0.001; ImReP: r=0.8857,
p<0.001; TRUST4: r=0.9121, p<0.001; CATT: r=0.9263, p<0.001) (Figure 2.12i-l).
These findings suggested that all surveyed RNA-Seq-based methods were able to provide
accurate estimations of relative frequencies of the detected clonotypes in low SDI samples, while
the positive correlations were weaker in high SDI samples. CATT provided the best result in the
correlation of clonotypes in T cell rich high SDI samples than those provided by other methods.
60
61
Figure 2.10 Correlation of TCRβ clonotype frequency based on the TCR-Seq data (x-axis) and
the TCR derived reads from RNA-seq (y-axis)
Only clonotypes assembled from both TCR-Seq and RNA-Seq data are presented. The clonotype
frequencies are renormalized to only the set of clonotypes in the TCR-Seq and RNA-Seq
repertoire space. The clonotype frequencies are log-transformed. Pearson correlation coefficients
and the corresponding p-values are calculated and reported. a-d. The results of T cell rich low
SDI samples (blue). e-h. The results of T cell poor low SDI samples (orange). i-l. The results of
T cell rich high SDI samples (green). m-p. The results of T cell poor high SDI samples (red).
62
63
Figure 2.11 Correlation of TCRβ clonotype frequency based on the TCR-Seq data (x-axis) and
the TCR derived reads from RNA-seq (y-axis) in T cell rich low SDI samples
Only clonotypes assembled from both TCR-Seq and RNA-Seq data are presented. The clonotype
frequencies are renormalized to only the set of clonotypes in the TCR-Seq and RNA-Seq
repertoire space. The clonotype frequencies are log-transformed. Pearson correlation coefficients
and the corresponding p-values are calculated and reported. The results from MiXCR are
represented in blue, the results from ImReP are represented in orange, the results from TRUST4
are represented in green, and the results from CATT are represented in red. a-d. sample01
results. e-h. sample02 results. i-l. sample03 results.
64
65
Figure 2.12 Correlation of TCRβ clonotype frequency based on the TCR-Seq data (x-axis) and
the TCR derived reads from RNA-seq (y-axis) when singleton TCRs are included
Only clonotypes assembled from both TCR-Seq and RNA-Seq data are presented. The clonotype
frequencies are renormalized to only the set of clonotypes in the TCR-Seq and RNA-Seq
repertoire space. The clonotype frequencies are log-transformed. Pearson correlation coefficients
and the corresponding p-values are calculated and reported. a-d. The results of T cell rich low
SDI samples (blue). e-h. The results of T cell poor low SDI samples (orange). i-l. The results of
T cell rich high SDI samples (green). m-p. The results of T cell poor high SDI samples (red).
2.3.4 Fewer clonotypes were detected by RNA-Seq-based TCR profiling methods compared
to TCR-Seq-based methods
Next, we compared the number of clonotypes detected by RNA-Seq-based methods
compared to TCR-Seq-based methods. On average, TRUST4 captured the most clonotypes
among four RNA-Seq-based methods with 4,451 in T cell rich low SDI samples and 1,038 in T
66
cell rich high SDI samples, while MiXCR had the lowest with 1,723 in T cell rich low SDI
samples and 145 in T cell rich high SDI samples (Figure 2.13a). The clonotype counts detected
by MiXCR were statistically significantly smaller than the clonotype counts detected by other
methods in T cell rich low SDI samples (Table 2.14). When we considered the absolute count of
the TCR-Seq confirmed clonotypes, TRUST4 had the largest number of clonotypes confirmed
by TCR-Seq, while MiXCR had the lowest (Figure 2.13b). For example, on average, TRUST4
detected 538 TCR-Seq confirmed clonotypes and MiXCR only detected 317 in T cell rich low
SDI samples. However, these results were not statistically different when compared across tools
(Table 2.15). These TCR-Seq confirmed clonotypes were highly overlapped across the survey
methods in T cell rich low SDI samples compared to high SDI samples (Figure 2.13c-f).
However, when we considered the fraction of validated TCRs from RNA-Seq-based methods
when varying clonotype abundance thresholds, ImRep and TRUST4 captured all the
hyperexpanded clonotypes in T cell rich low SDI samples. MiXCR had the highest percentage of
detecting large clonotypes in all samples. CATT had the lowest fractions of TCR-Seq confirmed
hyperexpanded and large clonotypes in the majority of the tissue types and repertoire types
(Figure 2.13g-j). Combined with our findings above, these results suggested that ImRep,
TRUST4, and CATT were able to detect more clonotypes and TCR-Seq confirmed clonotypes
than MiXCR in T cell rich tissues. However, CATT had higher portions of the detected
hyperexpanded and large clonotypes that were not confirmed by TCR-Seq than all other
surveyed methods in T cell rich tissues and T cell poor high SDI samples.
Table 2.14 The comparison of the clonotype counts across tools
The mean, standard deviation, and p-values of the t-test were presented.
T cell rich low SDI samples
(n=3)
p-values T cell poor low SDI sample
(n=1)
p-values
67
MiXCR ImRep MiXCR ImRep
1723+802 3862+466 0.016 96 116 N/A
MiXCR TRUST4 MiXCR TRUST4
1723+802 4451+723 0.012 96 175 N/A
MiXCR CATT MiXCR CATT
1723+802 3885+746 0.027 96 242 N/A
ImRep TRUST4 ImRep TRUST4
3862+466 4451+723 0.302 116 175 N/A
ImRep CATT ImRep CATT
3862+466 3885+746 0.967 116 242 N/A
TRUST4 CATT TRUST4 CATT
4451+723 3885+746 0.399 175 242 N/A
T cell rich high SDI samples
(n=3)
p-values T cell poor high SDI samples
(n=12)
p-values
MiXCR ImRep MiXCR ImRep
145+181 966+1217 0.312 25+15 38+40 0.401
MiXCR TRUST4 MiXCR TRUST4
145+181 1038+1192 0.269 25+15 67+59 0.068
MiXCR CATT MiXCR CATT
145+181 612+552 0.236 25+15 92+53 0.003
ImRep TRUST4 ImRep TRUST4
966+1217 1038+1192 0.945 38+40 67+59 0.171
ImRep CATT ImRep CATT
966+1217 612+552 0.67 38+40 92+53 0.01
TRUST4 CATT TRUST4 CATT
1038+1192 612+552 0.605 67+59 92+53 0.287
68
Table 2.15 The comparison of the TCR-Seq confirmed clonotype counts across tools
The mean, standard deviation, and p-values of the t-test were presented.
T cell rich low SDI samples
(n=3)
p-values T cell poor low SDI sample
(n=1)
p-values
MiXCR ImRep MiXCR ImRep
317+152 478+160 0.273 17 14 N/A
MiXCR TRUST4 MiXCR TRUST4
317+152 538+161 0.158 17 16 N/A
MiXCR CATT MiXCR CATT
317+152 461+169 0.334 17 15 N/A
ImRep TRUST4 ImRep TRUST4
478+160 538+161 0.672 14 16 N/A
ImRep CATT ImRep CATT
478+160 461+169 0.902 14 15 N/A
TRUST4 CATT TRUST4 CATT
538+161 461+169 0.597 16 15 N/A
T cell rich high SDI samples
(n=3)
p-values T cell poor high SDI samples
(n=12)
p-values
MiXCR ImRep MiXCR ImRep
128+184 667+976 0.401 20+13 26+28 0.594
MiXCR TRUST4 MiXCR TRUST4
128+184 701+980 0.376 20+13 42+40 0.14
MiXCR CATT MiXCR CATT
128+184 331+441 0.503 20+13 31+29 0.313
ImRep TRUST4 ImRep TRUST4
667+976 701+980 0.968 26+28 42+40 0.24
ImRep CATT ImRep CATT
69
667+976 331+441 0.616 26+28 31+29 0.631
TRUST4 CATT TRUST4 CATT
701+980 331+441 0.583 42+40 31+29 0.433
70
71
Figure 2.13 TCRβ derived clonotype counts from RNA-Seq data
a. Bar plot and strip plot of the number of TCRβ derived clonotype counts in T cell rich low SDI
samples (rich_low_SDI), T cell rich high SDI samples (rich_high_SDI), T cell poor low SDI
samples (poor_low_SDI), and T cell poor high SDI samples (poor_high_SDI). b. Bar plot and
strip plot of the number of TCRβ derived clonotypes that are confirmed by TCR-Seq in T cell
rich low SDI samples (rich_low_SDI), T cell rich high SDI samples (rich_high_SDI), T cell poor
low SDI samples (poor_low_SDI), and T cell poor high SDI samples (poor_high_SDI). c-f.
Venn diagrams of TCR-Seq confirmed clonotypes across MiXCR, ImReP, TRUSTR4, and
CATT in T cell rich tissues. The numbers indicate the sum of clonotype counts across three low
SDI samples (c-d) and four high SDI samples (e-f). The overlapped numbers indicate the number
of TCR-Seq confirmed clonotypes that are presented in results from more than one method. g-j.
Bar plot and strip plot of the fractions of validated TCRs from RNA-Seq-based methods based
on different clonotype abundance thresholds (hyperexpanded, large, medium, small, rare). The
results from MiXCR are represented in blue, the results from ImReP are represented in orange,
the results from TRUST4 are represented in green, and the results from CATT are represented in
red.
2.3.5 The effect of the number of TCR derived reads on the performance of RNA-Seq-
based TCR profiling methods
We examined how the number of receptor derived reads from RNA-Seq data influences
the diversity estimates and capturing ability of the RNA-Seq-based methods. TCR derived reads
were calculated by the total number of TCR reads matching the assembled clonotypes detected
by RNA-Seq-based TCR profiling methods. To investigate the effect of TCR derived reads, we
subsampled the RNA-Seq TCR derived reads from the original to 310, which 310 is the average
TCR derived reads in T cell poor tissues, based on the original RNA-Seq assembled TCR
clonotype frequencies from surveyed methods in three T cell rich low SDI samples. We first
measured the ability of RNA-Seq-based methods to estimate the diversity among these
computationally modified samples by using the Shannon Diversity Index (SDI). The results
indicated an increased number of TCR derived reads resulted in a more precise diversity estimate
(Figure 2.14a).
72
We then investigated the portion of repertoire captured by these methods depending on
the number of TCR derived reads presented in the sample. As the number of receptor derived
reads decreases, the methods were more likely to miss the clonotypes with low abundance. For
example, all TCRβ clonotypes with frequencies above 0.094% were captured across all four
surveyed methods when the number of TCR derived read was 310 (Figure 2.14b-e). While,
without reducing TCR derived reads, all TCRβ clonotypes with frequencies greater than 0.045%
were captured. The surveyed methods failed to detect low abundant clonotypes also contributed
to the reduced accuracy in estimating diversity with the reduced number of TCR derived reads.
Figure 2.14 The absolute error of RNA-Seq based TCRβ diversity and TCR-Seq based TCRβ
diversity and capability of capturing TCR-Seq confirmed clonotypes by RNA-Seq-based
methods in computationally modified low SDI samples with reduced reads
a. Bar plots that present the absolute error of SDI with respect to the total number of TCR
derived reads across the RNA-Seq-based methods, with error bars with 95% confidence interval.
73
b-e. Line plots with x-axis corresponding to TCR-seq confirmed clonotypes with a frequency of
Z, and y-axis corresponding to the fraction of assembled TCR repertoire by RNA-Seq-based
methods with clonotype frequency greater than Z. Results from estimated TCR derived reads of
310, 1300, 2600, 13500, and 28000 are presented in blue, orange, green, red, and purple,
respectively.
2.3.6 The effect of read length on the performance of RNA-Seq-based TCR profiling
methods
We investigated the effect of RNA-Seq read length on the ability of RNA-Seq-based
methods to characterize the TCR repertoires by measuring the portion of clonotypes that can be
captured as well as diversity estimates. The length of RNA-Seq read was computationally
reduced to 50bp and 75bp for PBMC samples. We were able to generate results from MiXCR,
ImRep and TRUST4 with reduced read length, however, CATT failed to produce the result for
one 50bp T cell rich low SDI sample.
RNA-Seq-based methods were able to capture the majority of the clonotypes in T cell
rich low SDI samples even with the reduced read length (Figure 2.15a-d). Except for CATT,
other methods captured at least 91.8% of the clonotypes regardless of the read length that we
examined in T cell rich low SDI samples. MiXCR, ImRep, and TRUST4 captured all the
clonotypes with clonotype frequencies greater than 0.094% when the RNA-Seq samples had
50bp read length in T cell rich low SDI samples, which this threshold was 0.036% with the
original read length of 150bp. The portions of TCR clonotypes that can be captured were similar
in T cell rich low SDI samples among all read lengths by MiXCR, ImRep, and TRUST4.
However, such portions dropped substantially with reduced read length in high SDI samples
(Figure 2.15e-h). Only MiXCR captured all abundant clonotypes with reduced read lengths in T
cell rich high SDI samples. Furthermore, when we examined the diversity estimate in the
reduced read length samples, we noticed that the diversity estimates were only reliable in T cell
74
rich low SDI samples with reduced read length with an average absolute error of 0.58 in 50bp
read length samples and 0.48 in 75bp read length samples (Figure 2.15i). In T cell rich high SDI
samples, the accuracy of estimating diversity was worse. The absolute error decreased with an
increased RNA-Seq read length in ImReP results, while the accuracy was similar between
MiXCR results from 75bp and 150bp (Figure 2.15j). Selected RNA-Seq-based TCR profiling
methods provided reliable results in T cell rich low SDI samples across a variety of sequencing
parameters.
75
76
Figure 2.15 The capability of capturing TCR-Seq confirmed clonotypes by RNA-Seq-based
methods and absolute error of RNA-Seq based TCRβ diversity and TCR-Seq based TCRβ
diversity from reduced RNA-Seq read length results
a-h. Line plots with x-axis corresponding to TCR-seq confirmed clonotypes with a frequency of
Z, and y-axis corresponding to the fraction of assembled TCR repertoire by RNA-Seq-based
methods with clonotype frequency greater than Z in T cell rich low SDI samples (a-d) and T cell
rich high SDI samples (e-h). i-j. Bar plots that present the absolute error of SDI with respect to
RNA-Seq read length in T cell rich low SDI samples and T cell rich high SDI samples across the
RNA-Seq-based methods, with error bars with 95% confidence interval. Results from RNA-Seq
read lengths 50bp, 75bp, and 150bp are presented in blue, orange, and green, respectively.
2.3.7 Complement the TCRβ chain analysis with TCRɑ chain analysis
TCR-Seq requires separate library preparation and sequencing for ɑ and β chains, while
RNA-Seq can capture all four TCR chains (ɑ, β, γ and δ chains) at the same time with single
library preparation and sequencing. We compared the results for the TCRɑ chain with the TCRβ
chain across different tissue types and repertoire types. However, we have determined that delta
and gamma chains did not have enough reads to be reliably profiled.
First, we compared the number of TCRɑ and TCRβ derived reads from RNA-Seq. Except
for CATT, the average number of TCR ɑ chain reads derived from RNA-Seq was higher than the
number of TCR β chain reads derived from RNA-Seq in T cell rich low SDI samples, while more
reads from β chains than ɑ chains in T cell rich high SDI samples (Figure 2.16a-b). The number
of TCRɑ and TCRβ derived reads were highly positively correlated in T cell rich low SDI
samples across MiXCR, ImRep, and TRUST4, but not CATT (MiXCR: r=0.9998, p=0.013;
ImReP: r=0.9956, p=0.06; TRUST4: r=0.9855, p=0.11; CATT: r=0.549, p=0.63).
Next, we compared the number of TCRɑ and TCRβ clonotypes. Similar results were
observed that more TCRɑ and TCRβ clonotypes were captured by RNA-Seq-based methods in
low SDI samples than in high SDI samples among T cell rich tissues (Figure 2.16c-d). Other than
CATT, the number of clonotypes from ɑ and β chains was comparably similar across repertoire
77
types by RNA-Seq-based methods. All methods provided highly similar numbers of TCRɑ and
TCRβ clonotypes in T cell rich tissues. Additionally, the clonotype count correlations estimated
by MiXCR and TRUST4 were highly positive in T cell poor high SDI samples.
Figure 2.16 TCRɑ chain and TCRβ chain metrics comparison
a-b. TCR derived reads from RNA-Seq reads per one million RNA-Seq reads in T cell rich low
SDI samples (a) and T cell rich high SDI samples (b). c-d. TCRɑ and TCRβ chain clonotype
counts in T cell rich low SDI samples (c) and T cell rich high SDI samples (d). The results from
TCRɑ chain are shown in blue and the results from TCRβ are shown in orange.
2.3.8 Performance of RNA-Seq methods in the samples from healthy individuals
In order to examine whether the RNA-Seq-based TCR profiling methods perform
differently between cancer and non-cancer RNA-Seq samples, we performed the same analyses
78
as described above on 10 PBMC samples from healthy individuals (Table 2.16). Based on the
SDI values and predicted T cell levels, these samples were classified as T cell rich high SDI
samples (Figure 2.17). All four methods captured the clonotypes with frequencies above 2.39%
(Figure 2.18). The metrics were calculated and compared (Figure 2.19-2.22), these showed
similar results to those from samples from cancer patients.
Table 2.16 Overview of the samples in the additional dataset
The samples are sorted by the sample name (indicated in the column “Sample name”). We
documented the tissue of which the sample was taken (indicated in the column “Tissue”), the
tissue type based on the T cell levels (indicated in the column “Tissue type”), the repertoire type
based on the Shannon Diversity Index (indicated in the column “Repertoire type”), the number
of TCR-Seq reads from TCR-Seq data (indicated in the column “Number of TCR-Seq reads”),
the number of RNA-Seq reads from RNA-Seq data (indicated in the column “Number of RNA-
Seq reads”), the number of derived reads from RNA-Seq data after processing through RNA-
Seq-based TCR profiling methods (indicated in the columns “Number of derived reads from
RNA-Seq (MiXCR)”, “Number of derived reads from RNA-Seq (ImReP)”, “Number of derived
reads from RNA-Seq (TRUST4)”, and “Number of derived reads from RNA-Seq (CATT)”,
respectively).
N
o.
Sam
ple
name
Tissu
e
Tiss
ue
type
(T
cell
rich
or
poo
r)
T
cell
fracti
on
Repe
rtoire
type
(high
SDI
or
low
SDI)
Numb
er of
TCR-
Seq
reads
Numb
er of
RNA-
Seq
reads
Numb
er of
derive
d
reads
from
RNA-
Seq
(MiX
CR)
Numbe
r of
derived
reads
from
RNA-
Seq
(ImRe
P)
Numbe
r of
derive
d reads
from
RNA-
Seq
(TRUS
T4)
Numb
er of
derive
d
reads
from
RNA-
Seq
(CAT
T)
1 adds
ampl
e01
PBM
C
T
cell
rich
0.196 high
SDI
8286 101,92
4,084
342 268 1472 8704
2 adds
ampl
e02
PBM
C
T
cell
rich
0.3 high
SDI
20,897 104,35
2,202
788 511 2756 8849
3 adds
ampl
e03
PBM
C
T
cell
rich
0.274 high
SDI
52,379 89,333
,304
124 108 799 6803
79
4 adds
ampl
e04
PBM
C
T
cell
rich
0.36 high
SDI
224,04
9
120,84
0,832
323 246 1480 9012
5 adds
ampl
e05
PBM
C
T
cell
rich
0.152 high
SDI
59,517 90,101
,915
1139 649 3571 8168
6 adds
ampl
e06
PBM
C
T
cell
rich
0.463 high
SDI
29,586 97,169
,398
1521 914 4352 10,181
7 adds
ampl
e07
PBM
C
T
cell
rich
0.144 high
SDI
115,95
7
99,240
,419
354 261 1296 7599
8 adds
ampl
e08
PBM
C
T
cell
rich
0.306 high
SDI
68,388 107,43
0,151
704 499 2815 8952
9 adds
ampl
e09
PBM
C
T
cell
rich
0.247 high
SDI
453,90
6
93,539
,596
285 195 1346 7650
10 adds
ampl
e10
PBM
C
T
cell
rich
0.225 high
SDI
104,95
9
103,81
8,947
514 297 1907 9031
Figure 2.17 The Shannon Diversity Index (SDI) (a), clonality (b), and estimated T cell fraction
(c) of samples from 10 healthy individuals
Samples with SDI less than 2 are considered low SDI samples, otherwise high SDI samples.
Samples with estimated T cell fractions greater than 0.05 are considered T cell rich tissues,
otherwise T cell poor tissues. These samples are classified as all T cell rich high SDI samples.
80
Figure 2.18 The ability to capture TCR-Seq confirmed clonotypes by RNA-Seq-based methods
in samples from 10 healthy individuals
The x-axis corresponds to TCR-seq confirmed clonotypes with a frequency of Z on a log scale.
The y-axis corresponds to the average fraction of assembled TCR clonotypes by RNA-Seq-based
methods with clonotype frequency greater than Z across samples. The samples with no clonotype
at each frequency are excluded to compute the average portion. The brown dash lines indicate
the minimal clonotype frequency above which all the clonotypes are captured. The numbers on
the brown dash lines are the actual values for the threshold. Area plots show the proportion of the
TCR clonotypes captured by MiXCR, ImReP, TRUST4, and CATT based on clonotype
frequencies.
81
Figure 2.19 TCR-Seq-based TCRβ and RNA-Seq-based diversity and clonality comparison in
samples from 10 healthy individuals
a-b. Bar plot of absolute error between diversity (a) and clonality (b) estimated based on RNA-
Seq data and TCR-Seq data. Each dot on the bar represents each sample. The results from
MiXCR are represented in blue, the results from ImReP are represented in orange, the results
from TRUST4 are represented in green, and the results from CATT are represented in red.
82
Figure 2.20 Correlation of TCRβ clonotype frequency based on the TCR-Seq data (x-axis) and
the TCR derived reads from RNA-seq (y-axis) in samples from 10 healthy individuals
Only clonotypes assembled from both TCR-Seq and RNA-Seq data are presented. The clonotype
frequencies are renormalized to only the set of clonotypes in the TCR-Seq and RNA-Seq
repertoire space. The clonotype frequencies are log-transformed. Pearson correlation coefficients
and the corresponding p-values are calculated and reported.
83
Figure 2.21 TCRβ derived clonotype counts from RNA-Seq data
a. Bar plot and strip plot of the number of TCRβ derived clonotype counts in samples from 10
healthy individuals. b. Bar plot and strip plot of the number of TCRβ derived clonotypes that are
confirmed by TCR-Seq in samples from 10 healthy individuals. c-d. Venn diagrams of TCR-Seq
confirmed clonotypes across MiXCR, ImReP, TRUSTR4, and CATT. The numbers indicate the
sum of clonotype counts across 10 samples. The overlapped numbers indicate the number of
TCR-Seq confirmed clonotypes that are presented in results from more than one method. e. Bar
plot and strip plot of the fractions of validated TCRs from RNA-Seq-based methods based on
different clonotype abundance thresholds (hyperexpanded, large, medium, small, rare). The
results from MiXCR are represented in blue, the results from ImReP are represented in orange,
the results from TRUST4 are represented in green, and the results from CATT are represented in
red.
84
Figure 2.22 TCRɑ chain and TCRβ chain metrics comparison
a. TCR derived reads from RNA-Seq reads per one million RNA-Seq reads in samples from 10
healthy individuals. b. TCRɑ and TCRβ chain clonotype counts in samples from 10 healthy
individuals. The results from TCRɑ chain are shown in blue and the results from TCRβ are
shown in orange.
2.3.9 Computational resources comparison across RNA-Seq-based tools
We compared the run time and required computational resources across MiXCR, ImRep,
TRUST4, and CATT. We selected four samples with RNA-Seq reads ranging from 40,524,817
to 122,632,451, and recorded the central processing unit (CPU) time and memory usage for each
tool to process these samples. In general, ImRep and TRUST4 had shorter processing time
compared to MiXCR and CATT (Figure 2.23a). TRUST4 required the least amount of memory
across the samples we tested (Figure 2.23b).
85
Figure 2.23 Central Processing Unit (CPU) time (a) and memory usage (b) for the selected four
samples across different tools
Each dot on the line plot represents one sample. The x-axis corresponds to RNA-Seq reads, the
y-axis corresponds to the running time in seconds (a) memory usage in gigabytes (GB) (b).
2.4 Discussion
TCR-Seq is a powerful tool to profile TCR repertoires, gain novel insight into
immunological phenomena, and serve as a biomarker for immunotherapy efficacy. However,
such technology is cost-prohibitive in clinical cohorts, and is often unable to be routinely applied
in such settings (Ma et al., 2018). In contrast, RNA-seq is becoming a default technology to
profile patient tissues in clinical pathology (Marco-Puche et al., 2019). By utilizing an RNA-Seq
approach to characterize the presence and relative frequency of TCR sequences within a given
blood sample or T cell rich tissue sample, clinical samples can be more efficiently utilized
without the need for additional dedicated sequencing experiments using TCR-Seq.
This study is the first one to benchmark the performance of RNA-Seq-based TCR
profiling methods in both T cell rich and poor tissues across the largest number of clinical
86
samples. While the previous study only included a certain type of TCR repertoire (Bolotin et al.,
2017), we expanded the examination to a broad range of cancer tissue types and repertoire types.
The repertoire category is highly relevant to the biological features of a sample. A low SDI
sample is more predominated by one or a few TCR clonotypes. Such type of clonotype
presentation is often observed in transgenic TCR cell therapies, where a TCR of interest has been
inserted into a T cell via retroviral or lentiviral vectors, for the treatment of cancer of viral
infections, or in situations where only a few TCR clonotypes are preferentially expanded in the
setting of targeting cancer or viral antigens, thus reflecting their relative biological importance in
a given sample. Conversely, a high SDI sample would be more reflective of circulating T cells
which are not subject to the stressors of chronic antigen exposure, or tumor samples which have
T cell infiltrates directed at a wider variety of tumor neoantigens, which are diverse and unique
to a given patient or individual tumor.
We determined the scenarios under which RNA-Seq analysis using RNA-Seq-based TCR
profiling methods can offer a comparable quality of characterizing immune repertoires. Given
tissue with a sufficient degree of T cell level and with one or a few hyperexpanded clonotypes
made up for the whole repertoires, and with adequate RNA sequencing depth and satisfactory
quality, RNA-Seq-based TCR profiling methods are able to capture the majority of TCRβ
clonotypes and effectively estimate the diversity of the repertoire. Despite the inability of RNA-
Seq-based TCR profiling methods to capture the ultra-rare clonotypes supported by several TCR-
Seq reads, these methods are generally able to capture the clonotypes with a frequency greater
than 1.553% in T cell rich tissues. This suggests that RNA-Seq can potentially complement
TCR-Seq technology in T cell rich tissues to successfully estimate the overall diversity of the
sample and effectively detect clones with greater frequencies. The diversity estimated based on
87
high-throughput measures usually underestimates true diversity (Sanmamed & Chen, 2018;
Kaplinsky & Arnaout, 2016; Laydon et al., 2015; Rempała & Seweryn, 2013), but in cases where
there is a major clonotype existing with high relative frequency than even capturing the small
portion of the repertoire, one can estimate the diversity. However, when comparing different
samples, the commonly used diversity measurement, Shannon Diversity Index (SDI) is limited
when comparing samples with different TCR repertoire sizes. On the other hand, clonality,
which normalizes clonotype counts based on SDI, can be an appealing approach to estimate
diversity (L. Zhang et al., 2017).
In more heterogeneous tissues which are more relatively T cell poor tissues, such as
tumor biopsies, RNA-Seq-based TCR profiling methods are only able to capture the dominant
clonotypes with the highest frequencies. This results in unreliable estimates of diversity and
clonality in such tissues. These suggested that extra caution needs to be taken when utilizing
RNA-Seq-based TCR profiling methods in T cell poor tissues, especially in high SDI samples of
the T cell poor tissues.
The ability to effectively characterize the TCR repertoire depends on the proportion of
the total repertoire captured by a given technology. Even state-of-the-art high-throughput
technologies fail to capture the entire diversity of the T cell repertoire; instead of capturing the
most common clones as this is often sufficient to effectively characterize TCR diversity.
Identification of dominant TCR clones within responding patient biopsies can be useful for the
characterization of potential neoantigens being targeted by the TCRs. Capturing rare clones is
extremely challenging due to the enormous diversity of the individual T cell repertoire and the
limited throughout of commercially available TCR-Seq protocols. As a result, rare clones can
become unsampled and remain undetected in high-throughput measurements of the samples.
88
This may lead to the missing of important TCR information in some autoimmune disease
settings, such as autoimmune hepatitis (Renand et al., 2020) and patients with Sjögren's
Syndrome (Bende et al., 2015). In these diseases, clonotypes with low frequencies may relate to
disease activity. In other disease settings, such as myelodysplastic syndrome, responsiveness to
treatment with hypomethylating agents has been associated with rare and novel TCR clonotypes
which manifest during treatment (Abbas et al., 2021).
Diversity is measured based on the clonotypes that are present in the sequencing results
and their relative frequencies. The surveyed methods are able to provide reliable diversity
estimates in T cell rich low SDI samples but not so in high SDI samples. On average, the
absolute errors of estimating diversity in different types of samples range from 0.57 to 4.4. As
expected, the number of clonotypes and derived reads detected by RNA-Seq-based methods are
always lower than those from TCR-Seq. Clonotypes with low frequencies are not captured by the
RNA-Seq-based methods, this may have caused a systematic under-estimated error in estimating
diversity. In addition to the undetected clonotypes with low abundance, the differences in
clonotype frequency estimation as well as in the number of detected clonotypes could be the
reasons that cause differences in the accuracy of diversity estimates between various tissue types
and repertoire types.
There are a few limitations of our study. First, there is no standardized method to
distinguish between low SDI and high SDI TCR repertoires. We chose the Shannon Diversity
Index of value two as the threshold which represents the distribution of clonotypes in the
samples that we used in the study. We explored various SDI thresholds to classify low or high
SDI samples and observed consistent estimates of clonality and diversity across various
thresholds. Second, the TCR-Seq sequencing protocols that were used were not consistent across
89
different study cohorts. The lymph node and small intestine samples were sequenced by rapid
amplification of 5’ complementary ends (5’RACE) approach, while all other samples were
sequenced by immunoSEQ (Adaptive Biotechnology, Seattle, WA). The biases across multiplex
PCR and 5’RACE are known (Barennes et al., 2021; Dahal-Koirala et al., 2022), however, the
differences between immunoSEQ and other TCR-Seq profiling methods were unknown. Third,
TCR-Seq computational methods used for processing were different across study cohorts as well.
TCR-Seq raw data from lymph node and small intestine samples were processed through
MiXCR while the others were processed via internal methods at Adaptive Biotechnology.
Unfortunately, Adaptive Biotechnology does not share raw TCR-Seq data or bioinformatics
methods (Y.-N. Huang et al., 2022), so we were not able to process all samples through the same
bioinformatics software. The use of MiXCR for processing both TCR-Seq and RNA-Seq on
lymph node and small intestine samples may lead to biases in the results. Fourth, TCR derived
reads corresponded to T cell levels. However, RNA-Seq-based TCR profiling methods may also
detect natural killer (NK) cells. Though NK cells were relatively rare compared to T cells, they
could be added to the number of T cells that were deduced from the TCR reads. By utilizing
GEDIT, we observed values of zeros in several samples, which may not indicate that there were
no T cells in the samples, instead, these samples had low expressions which were below the
detection limit. Fifth, the distribution of the samples is heavily biased towards T cell poor high
SDI samples compared to other categories. The comparison between sample types with different
numbers of samples may lead to over or underestimating the differences in the results. Sixth,
only two types of cancer were included in this study, which were melanoma and kidney renal
clear cell carcinoma (KIRC). The conclusions from different types of cancers are inherently
biological-context dependent. Further validation is needed if more TCR-Seq and RNA-Seq data
90
on the same samples become available. Last, we observed that there were TCR clonotypes
discovered by RNA-Seq-based profiling methods that were not confirmed by TCR-Seq. These
may be false positives of RNA-Seq-based methods or false negatives of the TCR-Seq method
which are true TCR clonotypes that are not captured by TCR-Seq. However, the benchmarking
approach used in this project is incapable of differentiating between false positives of RNA-Seq-
based methods and false negatives of the TCR-Seq method. The presence of false positives in the
results of RNA-Seq-based methods may limit the application of RNA-Seq-based TCR profiling
in clinical settings.
While available approaches require separate assays for the detection of the ɑ and β chains
of the TCR, RNA-Seq-based TCR profiling methods are able to detect both ɑ and β chains
simultaneously from a single library preparation, reducing the overall cost of the assay. With the
increasing emphasis on personalized medicine, there is an increasing demand for cost-effective
and reproducible measures in order to comprehensively characterize cancer cells and profile
immune cell populations. This technique can be employed in large-scale clinical cohorts.
Further, the detection of TCR sequences at the RNA level provides increased assurances that
such sequences are transcriptionally active and more likely to be biologically relevant. It also
provides opportunities to reuse the existing RNA-Seq data instead of initiating new TCR-
Sequencing. Finally, RNA-Seq has already been used in clinical specimens and provides an
additional analysis tool for use on a variety of clinical samples. Overall, RNA-Seq-based
methods are appealing alternatives for profiling T cell receptor repertoires in T cell rich tissues
and low SDI samples when TCR-Seq is not yet available.
91
Chapter 3: pyTCR: A comprehensive and scalable solution for
TCR-Seq data analysis to facilitate reproducibility and rigor of
immunogenomics research
Chapter is published: Peng, K., Moore, J., Vahed, M., Brito, J., Kao, G., Burkhardt, A. M.,
Alachkar, H., & Mangul, S. (2022). pyTCR: A comprehensive and scalable solution for TCR-
Seq data analysis to facilitate reproducibility and rigor of immunogenomics research. Frontiers
in immunology, 13, 954078.
3.1 Introduction
T cell receptor (TCR) repertoire is a collection of all unique TCRs in an individual, which
is formed through the process of V(D)J recombination. With the growing understanding of TCR
repertoire, researchers are able to leverage detailed TCR-Seq datasets to reveal the changes of
TCR repertoires in a variety of human disease states such as cancer , autoimmune diseases
(Simnica et al., 2019; X. Liu et al., 2019), infectious diseases (Emerson et al., 2017; Luo et al.,
2021), and neurodegenerative diseases (Gate et al., 2020; Patas et al., 2018). Thus, these have
helped the biomedical community to deepen the understanding of the roles of the adaptive
immune system and adaptive immune responses. For example, studies have shown the usage and
diversity of TCR repertoires could be utilized to help select the most suitable immunotherapy for
cancer patients (Hogan et al., 2019; Schrama et al., 2017). Thus, effective TCR profiling and
analysis are informative to guide certain cancer treatments, which ultimately enables precision
and personalized medicine.
92
With the rapid development of high-throughput sequencing techniques in the past
decades, TCR-Seq has enabled researchers to effectively characterize TCR repertoires across
various tissue types and diseases with high specificity and sensitivity by targeting TCR loci.
Even with the available TCR profiling methods, TCR repertoire metrics such as diversity, gene
usage, and motif enrichment cannot be easily interpreted directly from TCR-Seq data after initial
TCR profiling. Post-analysis is required to calculate, visualize, and compare the sample level or
population level TCR repertoire characteristics.
Existing bioinformatics tools for TCR repertoire post-analysis are available as R
packages such as VDJTools (Shugay et al., 2015), Immunarch (Nazarov et al., 2022), and HTML
programs such as VisTCR (Ni et al., 2020) and Vidjil (Duez et al., 2016). These tools enable
biomedical researchers to analyze TCR-Seq data, however, multiple barriers and limitations
exist. First, as in any R package, users follow the instructions to enter commands in the
command-line interface and the output will be presented in the summarized tables or figures. The
analytical methods used for the particular step of the analysis are isolated, which can result in a
limited understanding of the details of the analysis. This also increases the probability of human
errors. Relying on the documentation of the tool is often not a reliable solution as it typically
lacks details, has unclear and ambiguous wording, and can be outdated for future users. Second,
the existing TCR-Seq analysis tools need to be installed and utilized with the command-line
interface which can be a challenge for biomedical researchers who lack the required
computational skill sets (Mangul et al., 2019). Third, the output files are generated as individual
files. Biomedical researchers need to work between different files and tools to finish the post-
analysis, which leads to high chances of creating manual mistakes. Last, none of the existing
tools cover all aspects of the TCR-Seq analysis. For example, researchers need to use multiple
93
packages or additional tools for statistical analysis, which adds an additional burden for
biomedical researchers.
Here, we present pyTCR (python TCR analysis), an easy-to-use, interactive, and scalable
solution with a wider range of functionalities compared to existing tools. pyTCR utilizes
interactive computational notebooks to facilitate reproducibility and rigor of performed TCR-Seq
analysis. The availability of well-documented code, visualization, and results of the analysis in a
single notebook will facilitate transparency and reproducibility of performed analysis, make the
users more aware of the details of the metrics and thresholds being used in the analysis, as well
as minimize the possibility of manual mistakes and misinterpretation of the TCR-Seq data
analysis results. Notably, pyTCR provides statistical analysis for the first time in a TCR-Seq
analysis tool, which is not available in the existing tools. We have demonstrated the utility of
pyTCR by applying it to the COVID19-BWNW dataset containing 46 TCR-Seq samples.
Additionally, we have compared the scalability of our tool with the existing tools and have
demonstrated substantial improvement in running time.
3.2 Methods
3.2.1 TCR-Seq data
We used the COVID19-BWNW dataset from the Adaptive ImmuneRace study to
demonstrate the functionality of pyTCR. COVID19-BWNW dataset contains 46 convalescent
COVID-19 patient samples collected at Bloodworks Northwest. Demographic and clinical
features including age, gender, smoking status, ICU admit status, birth year, blood type, CMV at
donation, days from the last symptom to sample date, ethnicity, race, height, weight, and
94
hospitalization status are reported. The extracted genomic DNA was sequenced based on
Multiplex PCR and only for the TCR beta chain by using the MiSeq platform.
3.2.2 TCR-Seq data preprocessing
We downloaded 46 TCR-Seq data samples and the file containing demographic and
clinical features in the tab-separated values (tsv) format. All the demographic and clinical
features were listed in the sample_tags column in the file. The features were split into one in
each column for further analysis.
3.2.3 Statistical analysis
The statistical analysis is available for comparing numerical values across two groups.
We first examine whether the datasets are normally distributed. If the dataset is normally
distributed, we use the student’s t-test to evaluate the statistical significance. Otherwise, we use
Wilcoxon rank-sum test to evaluate the statistical significance. Bonferroni correction is also
included to count for the multiple comparisons across different genes.
3.2.4 Jupyter notebooks
Jupyter notebooks (https://jupyter.org) is a web-based interactive computing platform that
contains code, markdown text, and visualizations. These features enable users to conduct
reproducible and transparent data analysis. We develop pyTCR based on Jupyter notebooks. In
each Jupyter cell, we include code for either calculation for analysis or visualization. Users can
easily change the parameters in the code to generate the results of their interests. Markdown text
is used for instructions and explanations.
95
3.2.5 pyTCR data structure and software implementation
The individual input data file can be text files, tab-separated values (tsv) files, or comma-
separated values (csv) files. pyTCR uses a small suite of robust open-source python libraries to
facilitate complex data analysis and visualizations. Python libraries including Pandas, Numpy,
Matplotlib, Seaborn, and Scipy are used to provide rich functionality with limited amounts of
code. Pandas is used to convert tabularly formatted TCR-Seq data into python data frames for
notebooks to utilize. Numpy is used to perform complex mathematical operations across python
data frames. Matplotlib and Seaborn are then used in tandem to generate rich data visualizations
from the resultant data.
One critical component of pyTCR functionality is the overlap analysis between two
samples. Such operations are unavoidably expensive in terms of computing power. Yet, pyTCR
employs multiple different technical optimizations by default to provide the most optimal
performance for researchers. In the case of any piecewise comparison between two samples, we
first index and group the data into a key-value pair hash table for instantaneous look-up time. We
then uniquely merge samples for comparison and index them using a hash table in the same
manner. By utilizing this method, we are able to negate a large amount of computing time that
would otherwise be associated with searching for the correct samples in the data set. This method
allows us to instantly retrieve the required samples for future look-ups which shifts most of the
computing time from slow searches, back onto the piecewise comparisons.
3.2.6 Comparison with other methods
We used the COVID19-BWNW dataset to compare pyTCR to VDJtools and Immunarch
for benchmarking purposes. We subsampled the COVID19-BWNW dataset to files containing 2,
4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46 TCR-Seq samples.
96
We then ran the overlap analysis of each tool ten times and computed the average CPU time and
RAM usage for each. For the comparison, we utilized a high-performance computing cluster
(HPCC) to acquire the most accurate benchmarking results. However, in the comparison of 2
TCR-Seq samples for VDJtools and Immunarch, the HPCC was unable to record the results due
to the short nature of the task. That is, the benchmark ended too quickly for the HPCC to
accurately record the results. Thus, the results for 2 TCR-Seq samples were not taken with the
average of ten runs. Instead, we recorded the results once by introducing an artificial stall in the
benchmark such that the HPCC had time to record, and then we subtracted the artificial stall time
from the final CPU time. The RAM usage remains unchanged with this workaround.
3.2.7 Data availability
COVID19-BWNW dataset was part of the ImmuneRACE Study, which was downloaded
from the ImmuneCODETM database (https://immunerace.adaptivebiotech.com/data). To
improve the discoverability of the data, we have copied the data to (https://github.com/Mangul-
Lab-USC/pyTCR_publication/tree/main/data/samples). All data required to produce the figures
and analysis performed in this paper are freely available at (https://github.com/Mangul-Lab-
USC/pyTCR_publication). pyTCR is freely available at (https://github.com/Mangul-Lab-
USC/pyTCR). pyTCR is distributed under the terms of the MIT License. The user manual for
pyTCR can be downloaded at (https://github.com/Mangul-Lab-USC/pyTCR/blob/main/User_
Manual_pyTCR.pdf). All code required to produce the figures and analysis performed in this
paper is freely available at (https://github.com/Mangul-Lab-USC/pyTCR_publication).
97
3.2.8 Ethics statement
The ImmuneRACE study involving human participants was reviewed and approved by
the Western Institutional Review Board (reference number 1-1281891-1) on March 17th, 2020.
The participants provided their written informed consent to participate in the ImmuneRACE
study.
3.3 Results
3.3.1 pyTCR: a comprehensive and scalable solution for TCR-seq data analysis
pyTCR, an open-source, user-friendly tool that addresses the issues mentioned above,
offers broader and more comprehensive TCR repertoire analysis with an increased number of
types of analyses compared to the existing tools. Six types of analysis are contained in the
pyTCR, which include basic analysis, clonality analysis, diversity analysis, overlap analysis,
gene usage analysis, and motif analysis (Table 3.1). TCR repertoire metrics, visualization, and
statistical analysis are included in all types of analyses (Figure 3.1, Table 3.1). Our tool for TCR-
Seq data analysis uses interactive computational notebooks for post analysis and visualization of
TCR-Seq data with a rich set of functionalities. Notably, we have used Google Colaboratory
(Google Colab) to provide the cloud option, which is free to use and provides different
subscriptions based on users’ needs. No installation of the software to the local computers is
required if the users choose to use the Google Colaboratory. However, large files can’t be run on
Google Colab if the users do not subscribe to Google Colab. The use of computational notebooks
enables users to execute analysis and produce tabular output and customizable visualization so
that users can use the desired features in their datasets to generate results.
98
The users can clone the GitHub repository to their local computer and utilize the pyTCR
notebooks locally via Jupyter Notebook (Randles et al., 2017). Users who choose to use Google
Colab (cloud option of pyTCR) can upload the data files from local computers, web-based
drives, or GitHub repositories. The results can be downloaded and stored locally or into web-
based drives by the code provided. pyTCR is capable of converting results from pre-processing
software such as MiXCR and ImmunoSEQ to the pyTCR analysis format. The minimal
clonotype information should include the counts of reads, clonotype frequency, CDR3 nucleotide
sequence, CDR3 amino acid sequence, and the inferred V, D, and J genes in the input data files.
pyTCR provides conversion to the corresponding format which consists of the columns of counts
of reads (#count), frequency (freq), CDR3 sequence (cdr3nt), CDR3 amino acids (cdr3aa), V
gene (v), D gene (d), J gene (j), features (if provided), sample name. These should be already
filtered for the non-coding CDR3 by the upstream tools. If the metadata is not available, a
notebook for combining individual files to a metadata file should be executed prior to any TCR-
Seq analysis to reduce the burden of analyzing individual data files separately. In order to
achieve this, all the sample files should be stored or uploaded in one folder prior to generating
the metadata file. The notebook that combines individual sample data files to a metatable with all
the files is provided.
Table 3.1 TCR repertoire analysis functions in pyTCR
Metric Description
Basic analysis
Read count Number of reads in a sample
Clonotype count Number of clonotypes in a sample
Mean frequency Mean of clonotype frequency in a sample
∑(𝑝𝑖 ) ÷ 𝑛 𝑛 𝑖 =1
pi = frequency of clonotype i
99
n = number of unique clonotype in the sample
Geometric mean frequency Geometric mean of clonotype frequency in a
sample
∏(𝑝𝑖 )
1/𝑛 𝑛 𝑖
pi = frequency of clonotype i
n = number of unique clonotype in the sample
Mean length of CDR3 nucleotide sequence Mean length of CDR3 nucleotide sequence,
weighted by clonotype frequency, in a sample
∑(𝑙𝑒𝑛 (𝑛𝑡 ) × 𝑝𝑖 )
𝑛 𝑖 =1
pi = frequency of clonotype i
n = number of unique clonotype in the
sample
len(nt) = length of CDR3 nucleotide sequence
Convergence Mean of unique CDR3 nucleotide sequences
that code for the same CDR3 amino acid
sequence
Spectratype Frequency of clonotypes based on CDR3
nucleotide lengths
Clonality analysis
The most or the least frequent clonotype The most or the least frequent clonotype
1-Pielou index The evenness of the distribution of the
clonotypes
1 + ∑[(𝑝𝑖 × 𝐼𝑛 (𝑝𝑖 )]/𝐼𝑛 (𝑛 )
𝑛 𝑖 =1
pi = frequency of clonotype i
n = number of unique clonotype in the sample
Clonal proportion The number of distinct clonotypes that
accounts for greater than or equal to
percentage (customizable) of the total of
sequencing reads
Relative abundance (in all repertoire, top
clonotypes, rare clonotypes)
The proportion of repertoire account for
clonal groups with specific abundances in a
sample
Diversity analysis
Shannon-Wiener index
𝑒 ∑ −[𝑝𝑖 ×𝐼𝑛 (𝑝𝑖 )]
𝑛 𝑖 =1
pi = frequency of clonotype i
n = number of unique clonotype in the sample
Normalized Shannon-Wiener index shannon-wiener index ÷ 𝐼𝑛 (𝑛 )
n = number of unique clonotype in the sample
Inverse Simpson index Simpson index: 𝐷 = ∑ 𝑝𝑖
2 𝑛 𝑖 =1
Inverse Simpson index: 1/D
100
Gini Simpson index Gini Simpson index: 1-D
D50 index The percentage of distinct clonotypes that
accounts for greater than or equal to 50% of
the total of sequencing reads
Chao1 estimate
Chao1 estimate: 𝑆 +
𝑓 1(𝑓 1−1)
2(𝑓 2+1)
Variation of Chao1 estimate:
𝑓 2[0.5 × (
𝑓 1
𝑓 2
)
2
+ (
𝑓 1
𝑓 2
)
3
+ 0.25 × (
𝑓 1
𝑓 2
)
4
]
f1 = number of clonotypes with read of 1
f2 = number of clonotypes with read of 2
S = number of clonotypes
Gini coefficient The ratio of the area that lies between the line
of equality and the Lorenz curve over the total
area under the line of equality
Gene usage analysis
V,D,J gene weighted usage Mean of gene frequency weighted by
clonotype frequency
V,D,J gene unweighted usage Mean of gene frequency by the number of the
reads
Overlap analysis
Morisita-Horn index
𝐶 ℎ =
(2 × ∑ 𝑥𝑖𝑦𝑖 𝑆 𝑖 =1
)
(
∑ 𝑥𝑖
2 𝑆 𝑖 =1
𝑋 2
+
∑ 𝑦𝑖
2 𝑆 𝑖 =1
𝑌 2
)𝑋𝑌
xi: the number of reads clonotype i is
represented in the total number of reads X
from one sample
yi: the number of reads clonotype i is
represented in the total number of reads Y
from another sample
S: the index of overlapped clonotypes
Jaccard index
𝐴 ∩ 𝐵 𝐴 ∪ 𝐵
A: sample A, B: sample B
Number of clonotypes that is in both samples
divided by the number of clonotypes that is in
either sample
Overlap coefficient
𝐴 ∩ 𝐵 𝐴
A: sample A (the sample with a smaller
clonotype count), B: sample B
Number of clonotypes that is in both samples
divided by the number of clonotypes that is in
the sample with a smaller clonotype count
Tversky index 𝐵𝑜𝑡 ℎ𝐴𝐵
𝛼 × 𝑜𝑛𝑙𝑦𝐴 + 𝛽 × 𝑜𝑛𝑙𝑦𝐵 + 𝑏𝑜𝑡 ℎ𝐴𝐵
101
α=β=0.5 (Sørensen–Dice coefficient)
BothAB, number of clonotypes that is in both
sample A and B, onlyA: number of
clonotypes that is only in sample A, onlyB:
number of clonotypes that is only in sample B
Cosine similarity ∑ 𝐴 𝑖 𝑛 𝑖 =1
𝐵 𝑖 √∑ 𝐴 𝑖 2 𝑛 𝑖 =1
√∑ 𝐵 𝑖 2 𝑛 𝑖 =1
Ai: the frequency of clonotype i in sample A
Bi: the frequency of clonotype i in sample B
n: the index of overlapped clonotypes
Pearson correlation of clonotype frequencies
𝑅𝑖𝑗 =
∑ (∅𝑖𝑘 − ∅
̅
𝑖 )(∅𝑗𝑘 − ∅
̅
𝑗 )
𝑁 𝑘 =1
√∑ (∅𝑖𝑘 − ∅
̅
𝑖 )
2 𝑁 𝑘 =1
∑ (∅𝑗𝑘 − ∅
̅
𝑗 )
2 𝑁 𝑘 =1
∅ik: the frequency of clonotype k in sample i
∅jk: the frequency of clonotype k in sample j
N: the index of overlapped clonotypes
Relative overlap diversity
𝐷𝑖𝑗 =
𝑑𝑖𝑗 𝑑𝑖 × 𝑑𝑗
dij: the number of clonotypes that is in both
samples
di: the number of clonotypes that is in sample
i
dj: the number of clonotypes that is in sample
j
Geometric mean of relative overlap
frequencies
𝐹𝑖𝑗 = √𝑓𝑖𝑗 × 𝑓𝑗𝑖
𝑓𝑖𝑗 = ∑ ∅𝑖𝑘
𝑁 𝑘 =1
: the total frequency of
clonotypes that are overlapped between
samples i and j in sample i
𝑓𝑗𝑖 = ∑ ∅𝑗𝑘
𝑁 𝑘 =1
: the total frequency of
clonotypes that are overlapped between
samples i and j in sample j
Сlonotype-wise sum of geometric mean
frequencies 𝐹 2𝑖𝑗 = ∑ √∅𝑖𝑘 ∅𝑗𝑘
𝑁 𝑘 =1
∅ik: the frequency of clonotype k in sample i
∅jk: the frequency of clonotype k in sample j
N: the index of overlapped clonotypes
Jensen-Shannon divergence
𝐽𝑆 (𝑃 , 𝑄 ) =
1
2
𝐾𝐿 (𝑃 ,
𝑃 + 𝑄 2
)
+
1
2
𝐾𝐿 (𝑄 ,
𝑃 + 𝑄 2
)
𝐾𝐿 (𝑃 , 𝑄 ) = ∑ 𝑝𝑖𝑙𝑜𝑔 2
𝑝𝑖
𝑞𝑖
𝑖
102
pi: the sum of overlapped variable segment
(V) frequencies in sample 1
qi:the sum of overlapped variable segment (V)
frequencies in sample 2
Motif analysis
Amino acid spectratype Frequency of clonotypes based on CDR3
amino acid lengths
Amino acid motif analysis Number of counts of the amino acid motifs
Nucleotide sequence motif analysis Number of counts of the nucleotide sequence
motifs
103
Figure 3.1 Visualization of TCR repertoire metrics generated using pyTCR
104
a. The clonotype counts of each sample grouped by hospitalization status were presented as a
box plot and strip plot. b. The normalized Shannon-Wiener index of each sample grouped by
hospitalization status was presented as a violin plot. c. The distribution of clonotype groups in
each sample was presented as a stacked bar plot. The clonotypes were categorized into five
groups based on the clonotype frequencies. Hyperexpanded clonotypes were the clones with
frequencies between 0.01 to 1, large clonotypes were the clones with frequencies between 0.001
to 0.01, medium clonotypes were the clones with frequencies between 0.0001 to 0.001, small
clonotypes were the clones with frequencies between 0.00001 to 0.0001, rare clonotypes were
the clones with frequencies between 0 to 0.00001. d. The V-J combinations with V and J gene
frequencies in sample 1445BW were presented as a Sankey plot.
3.3.2 pyTCR is able to perform basic analysis to characterize the TCR repertoire
The focus of the basic analysis is to group and provide the most fundamental TCR
repertoire metrics in one place. The basic analysis performed by pyTCR estimates provides the
number of reads, clonotype counts, mean clonotype frequency, the geometric mean of clonotype
frequency, mean length of CD3 nucleotide sequence, convergence, spectratype as TCR repertoire
metrics. The visualization is available for all the metrics (except for spectratype) in the basic
analysis at the individual sample level and group level. The available plot types are violin plot,
strip plot, swarm plot, box plot, boxen plot, point plot, and bar plot (Figure 3.1A, Figure 3.2). We
were able to detect that the mean reads count in the hospitalization group was lower than that in
the non-hospitalization group (480844.1 and 554580.3, respectively; t-test: p = 0.229), and the
mean clonotype count in the hospitalization group was lower than in the non- hospitalization
group (271777.6 and 328980.1, respectively; t-test: p = 0.136) in the COVID19-BWNW dataset.
105
Figure 3.2 The clonotype counts were shown in groups by hospitalization status
The patients that were not hospitalized were shown in blue while the patients that were
hospitalized were shown in orange. Different types of plots were shown as follows: (a) violin
plot (b) strip plot (c) swarm plot (d) box plot (e) boxen plot (f) point plot (g) bar plot.
106
3.3.3 pyTCR is able to perform clonality analysis to assess the evenness of distribution of
TCR clonotypes
The clonality analysis offers the measurements of clonality, which has been used to
assess the evenness of distribution of the clonotypes based on the relative abundance of
clonotypes in the sample. The metrics include the list of the most or the least frequent
clonotypes, 1-Pielou index for evenness measure (0 means no evenness, 1 means complete
evenness), clonal proportion, and the distribution of clonotype groups based on relative
abundance. Specifically, clonal proportion presents the number of clonotypes that consist of a
certain percentage of the clonotypes in the repertoire. In the COVID19-BWNW dataset, the
number of clonotypes that counts for 10% of the clonotypes in the repertoire was smaller in the
hospitalization group than in the non-hospitalization group (49.5 and 459, respectively;
Wilcoxon rank-sum test, p = 0.596), the corresponding plots were presented in various types
(Figure 3.3). Additionally, the distribution of clonotype groups based on clonotype frequency or
count in each sample can be presented in bar plots across all the clonotypes, the top clonotypes,
and the rare clonotypes (Figure 3.4). We presented the distribution of five clonotype groups
(hyperexpanded, large, medium, small, and rare) across all clonotypes in Figure 3.1C, this
categorization is similarly done in other existing tools. The users have full control of the
thresholds of the clonotype groups in our tool.
107
Figure 3.3 Clonal portion grouped by hospitalization status
The y-axis presented the number of clonotypes that counted for 10% of all the clonotypes in the
repertoire. The patients that were not hospitalized were shown in blue while the patients that
were hospitalized were shown in orange. Different types of plots were shown as follows: (a)
violin plot (b) strip plot (c) swarm plot (d) box plot (e) boxen plot (f) point plot (g) bar plot.
108
109
Figure 3.4 The distribution of clonotype groups in each sample
(a) The distribution of clonotype groups was based on the clonotype frequency across all the
clonotypes. Hyperexpanded clonotypes (blue) were the clones with frequencies between 0.01 to
1, large clonotypes (orange) were the clones with frequencies between 0.001 to 0.01, medium
clonotypes (green) were the clones with frequencies between 0.0001 to 0.001, small clonotypes
(red) were the clones with frequencies between 0.00001 to 0.0001, rare clonotypes (purple) were
the clones with frequencies between 0 to 0.00001. (b) The distribution of clonotype groups based
on the clonotype count across the top 100 clonotypes. The clonotypes with clone counts between
1001-5000 were presented in blue, the clonotypes with clone counts between 101-1000 were
resented in orange, and the clonotypes with clone counts between 11-100 were presented in
green. (c) The distribution of clonotype groups based on the clonotype count across the rare 100
clonotypes. The clonotypes with clone count of 1 were presented in blue.
3.3.4 pyTCR is able to perform gene usage analysis to detect over and underrepresented
TCR genes across the samples
Gene usage analysis provides the weighted and unweighted V/ D/J gene usage
calculations. For gene usage analysis, V gene usage, D gene usage, and J gene usage, both
weighted (which is based on clonotype frequency) and unweighted (which is based on clonotype
count) are provided as TCR repertoire metrics. Heatmap and hierarchically clustered heatmap are
the available visualizations (Figure 3.5A, B). Sankey plot is also available to visualize the V-J
combinations (Figure 3.1D, Figure 3.5C), this is not provided by other existing tools. We
observed higher V gene weighted usage of TRBV05- 05*01 (0.0084 and 0.0066, respectively)
and TRBV13-01*01 (0.0069 and 0.0042, respectively) in the non-hospitalization group. In
comparison, we observed higher V gene weighted usage of TRBV20 (0.0638 and 0.0588,
respectively) in the hospitalization group in the COVID19-BWNW dataset. We also observed
higher V gene unweighted usage of TRBV18-01*01 (0.035 and 0.031) and TRBV30-01*01
(0.025 and 0.019) in the hospitalization group. After the Bonferroni correction to account for the
multiple comparisons, according to the adjusted p values, the differences mentioned above were
not statistically significant.
110
111
Figure 3.5 Heatmap of the weighted V gene usage in each sample and the Sankey plot for V-J
combinations
(a) The heatmap of V gene weighted usage (b) The hierarchically-clustered heatmap of V gene
weighted usage. x-axis represented each sample, y-axis represented different V genes. The shade
of the color corresponded to the V gene frequency. (c) The V-J combinations with V and J gene
frequencies in sample 3602BW were presented as a Sankey plot.
3.3.5 pyTCR is able to assess the diversity of TCR repertoires
Diversity analysis offered by pyTCR includes all the widely adopted indices to
characterize the diversity of TCR repertoire, which contains Shannon-Wiener index, normalized
Shannon- Wiener index, inverse Simpson index, Gini Simpson index, D50 index, Chao1
estimate, Gini coefficient (Table 3.1). High Shannon-Wiener index, low normalized Shannon-
Wiener index, high inverse Simpson index, high Gini Simpson index, high Chao1 estimate, and
high Gini coefficient represent high clonal diversity. Additionally, the D50 index represents the
percentage of unique clonotypes that account for greater than 50% of the total number of
sequences. The visualization is available for all the diversity metrics at the sample or group level
as violin plot, strip plot, swarm plot, box plot, boxen plot, point plot, and bar plot (Figure 3.1B,
Figure 3.6). In the COVID19-BWNW dataset, the median Shannon-Wiener index, the median
inverse Simpson index, and the median Gini Simpson index were all lower in the hospitalization
group than in the non-hospitalization group. Even though none of the diversity indices was
statistically significant, most of the diversity indices showed the trend that patients in the non-
hospitalization group have more diverse TCR clonotypes than patients in the hospitalization
group. This finding was consistent with the results observed in the previously published studies,
that severe COVID-19 patients had reduced TCR diversity than moderate COVID-19 patients
(Chang et al., 2021, p.; F. Zhang et al., 2020).
112
Figure 3.6 The diversity indices were shown in groups by hospitalization status
The patients that were not hospitalized were shown in blue while the patients that were
hospitalized were shown in orange. (a) violin plot of the Shannon Wiener index (b) strip plot of
normalized Shannon Wiener index (c) swarm plot of inverse Simpson index (d) box plot of Gini
Simpson index (e) boxen plot of D50 index (f) point plot of Chao1 estimate (g) bar plot of Gini
coefficient.
113
3.3.6 pyTCR is able to effectively compare clonotypes and motifs across samples
The overlap analysis offers a comprehensive list of overlap metrics for comparing the
clonotype frequencies between two samples. These metrics include the Jaccard index, overlap
coefficient, Morisita-Horn index, Tversky index, Cosine similarity, Pearson correlation of
clonotype frequencies, relative overlap diversity, the geometric mean of relative overlap
frequencies, the clonotype-wise sum of geometric mean frequencies, and Jensen- Shannon
divergence. For overlap analysis, the visualization is shown in the heatmaps (Figure 3.7).
Currently, existing tools only accept one sample per file for overlap comparisons which can be
difficult to manage if the data already contains multiple samples per file. pyTCR allows for an
unlimited amount of samples per file which enables more flexibility and less file management.
Motif analysis provides enriched nucleotide and amino acid motif discovery with
customized length. For motif analysis, the amino acid motif counts and nucleotide motif counts
in each sample are provided. The users are able to customize the length of the motif and visualize
the distribution of the motifs in each sample or each group. In the COVID19-BWNW dataset, we
observed amino acid motifs NTEAFF, YNEQFF, CASSLG, TDTQYF, NQPQHF, TGELFF,
SYEQYF were the most abundant ones in both the hospitalization and non- hospitalization
groups (Figure 3.8A). We also observed nucleotide motifs such as TCTGTG, CTGTGC,
TGTGCC, GTGCCA, TGCCAG, GCCAGC, CCAGCA, CAGCAG were the most abundant
ones in both hospitalization and non-hospitalization groups (Figure 3.8B).
114
115
116
Figure 3.7 The overlap indices across samples
Heatmaps of each overlap index as shown above. The x-axis and the y-axis presented each
sample. (a) Jaccard index (b) Overlap coefficient (c) Morisita-Horn index (d) Tversky index (e)
Cosine similarity (f) Pearson correlation based on clonotype counts (g) Pearson correlation based
on clonotype frequency (h) Relative overlap diversity (i) Geometric mean of relative overlap
frequencies (j) Сlonotype-wise sum of geometric mean frequencies (k) Jensen-Shannon
divergence of variable gene usage distributions.
Figure 3.8 The number of highly presented motifs across samples grouped by hospitalization
status
The x-axis presented motifs and the y-axis presented the count of the motifs. (a) amino acid
motifs (k=6) that had numbers of counts of no less than 9,999 and were presented in more than 2
samples in hospitalization and non-hospitalization groups were shown in the strip plot (b)
nucleotide motifs (k=6) that had numbers of counts of more than 150,000 and were presented in
more than 2 samples in hospitalization and non-hospitalization group were shown in the strip
plot.
3.3.7 pyTCR offers several advantages compared to the existing tools
pyTCR provides more comprehensive functionalities compared with VDJtools,
Immunarch, and VisTCR (Table 3.2). Notably, pyTCR includes several innovations that have not
been implemented in the VDJtools, Immunarch, or VisTCR before. First, statistical analyses for
TCR-Seq datasets are embedded in pyTCR computational notebooks. No additional software or
platform is needed for statistical analysis. Second, pyTCR has the most comprehensive list of
overlap indices, which enable the thorough comparison between clonotypes across samples.
Third, pyTCR offers enriched motif detection for both amino acid sequences as well as
117
nucleotide sequences. Furthermore, we have compared the results produced by pyTCR with
VDJtools on the types of analysis that are provided by VDJtools on the COVID19-BWNW
dataset. The results are consistent across our tool and VDJtools.
We also evaluated the scalability of pyTCR by varying the number of samples in the
TCR-Seq data input files. After subsampling the COVID19-BWNW dataset into data files
containing 2 to 46 samples, we recorded the central processing unit (CPU) time and memory
usage required by pyTCR, VDJtools, and Immunarch when running overlap analysis. We
observed that pyTCR required a significantly less amount of CPU time across all the subsamples
compared to VDJtools and Immunarch (Figure 3.9A). For example, on average, pyTCR used 5
minutes and 25 seconds to process the overlap analysis for 46 samples, while Immunarch used 1
hour 49 minutes and 46 seconds to process the same number of samples. In terms of memory
usage, pyTCR had reduced memory usage for most of the subsamples (Figure 3.9B). According
to our benchmark results, we observe that pyTCR has up to 22 times faster performance than
existing TCR-seq analysis tools, especially for datasets with larger numbers of samples.
Table 3.2 The comparison of TCR repertoire analysis tools
Metric pyTCR VDJtools Immunarch VisTCR
Basic analysis
Read count Yes Yes Yes No
Clonotype count Yes Yes Yes No
Mean frequency Yes Yes No No
Geometric mean frequency Yes Yes No No
Mean length of CDR3
nucleotide sequence
Yes Yes No No
Convergence Yes Yes No Yes
Spectratype Yes Yes Yes Yes
118
Clonality analysis
The most or the least frequent
clonotype
Yes Yes Yes Yes
1-Pielou index Yes No No No
Clonal proportion Yes No Yes No
Relative abundance (in all
repertoire, top clonotypes, rare
clonotypes)
Yes No Yes No
Diversity analysis
Shannon-Wiener index
Yes Yes Presented by
Hill numbers
Yes
Normalized Shannon-Wiener
index
Yes Yes Presented by
Hill numbers
No
Inverse Simpson index Yes Yes Yes Yes
Gini Simpson index Yes No Yes Yes
D50 index Yes Yes Yes No
Chao1 estimate Yes Yes Yes No
Gini coefficient Yes No Yes No
Gene usage analysis
V,D,J gene weighted usage Yes Yes Yes Yes
V,D,J gene unweighted usage Yes Yes No No
Overlap analysis
Morisita-Horn index Yes Yes Yes Yes
Jaccard index Yes Yes Yes Yes
Overlap coefficient Yes No Yes No
Tversky index Yes No Yes No
Cosine similarity Yes No Yes No
119
Pearson correlation
of clonotype frequencies
Yes Yes No No
Relative overlap diversity Yes Yes No No
Geometric mean of relative
overlap frequencies
Yes Yes No No
Сlonotype-wise sum of
geometric mean frequencies
Yes Yes No No
Jensen-Shannon divergence Yes Yes No No
Motif analysis
Amino acid spectratype Yes No Yes No
Amino acid motif analysis Yes No Yes No
Nucleotide sequence motif
analysis
Yes No No No
Statistical analysis
Student's t-test Yes No No Yes
Wilcoxon rank-sum test Yes No No No
Bonferroni correction Yes No No No
120
Figure 3.9 Central Processing Unit (CPU) time (a) and Memory usage (b) for subsamples of the
COVID19-BWNW dataset for overlap analysis
Each point on the line plot represented the average of 10 runs for different input sizes.
3.4 Discussion
We have presented pyTCR, a comprehensive and scalable computational notebook-based
solution for TCR-Seq analysis and visualization with a rich set of functionalities. For the cloud-
based version, we use Google Colaboratory (Google Colab). Google Colab, as a user-friendly,
free with no installation needed prior to use service for Google account holders, is suitable for
biomedical researchers with a limited computational background. Using interactive
computational notebooks promotes high transparency for biomedical researchers because the
steps of analyzing and visualizing are recorded and saved, which are easy to be shared with the
scientific community. The goal of pyTCR is to provide straightforward scripts for useful analysis
which can be easily modified and complemented by users instead of stand-alone software tools
with well-structured code. The availability of the code as part of the notebook allows the users to
document all the steps of the analysis and share them in a reproducible and transparent way.
121
pyTCR offers several advantages compared to the existing tools. First, pyTCR includes
more comprehensive measurements than existing tools to analyze TCR-Seq data. The enriched
measurements can provide users with more options to effectively characterize TCR repertoires
and compare across various phenotypes. Furthermore, pyTCR provides code and analysis jointly
together. Users can understand the definition of measurements and interpret results easily with
pyTCR, as the explanation of the code and the math equations are available in the notebooks.
Additionally, pyTCR allows users to adjust parameters easily and directly in the notebooks.
Unlike other traditional bioinformatics tools, changing parameters that generate separate files
which leads to high error rates by analyzing across different files, pyTCR provides all the
analysis to be performed in the cloud where the files are automatically saved with the updated
parameters and no generation of different files is needed. Last, our tool is more scalable as it
requires less computational time for analysis.
We recognized that there are other tools available for TCR- Seq analysis, however, these
tools may not share the same purposes as pyTCR. For example, both VisTCR (Ni et al., 2020)
and Vidjil (Duez et al., 2016) are web-based tools for TCR-Seq analysis that uses fastq files as
input files. While pyTCR utilizes the tsv or csv files generated by pre-processing tools such as
MiXCR to conduct post-analysis. The sample files that we used in the manuscript were
generated by Adaptive Biotechnologies, unfortunately, the users do not have access to the raw
sequencing data per user policy. VDJviz is a web-based tool as well and it uses VDJtools as a
back-end.
However, there are limitations of pyTCR including the possibility of accidentally
modifying the code resulting in generating errors, limited available types of analysis, and storage
and processing speed limits from the Google Colab platform. For example, users with limited
122
experience with Python scripting may be prone to generate errors with the availability of code
and results in interactive notebooks. Additionally, pyTCR cannot be used directly on raw
sequencing files (such as fastq format).
In conclusion, our tool offers broader and more powerful functions in TCR repertoire
research. We expect the computational notebook-based tool to be adopted by the broad
biomedical community as it carries benefits that are superior or comparable to R packages.
123
Chapter 4: Concluding Remarks
The main components of the adaptive immune system are B cells and T cells. Exposures
to antigens including allergens, pathogens, and neoantigens stimulate these lymphocytes to
activate the adaptive immune responses. These naturally exposed antigens as well as vaccines
and certain therapeutic agents are the ways for humans to acquire immunity. Adaptive immunity
makes up immune history and creates immunological memory to protect long acting protection
for humans.
T cell receptors (TCRs) are on the surface of the T cells. Each receptor consists of two
polypeptide chains (α and β or γ and δ). The majority of the T cells are αβ T cells, while the
minority of the T cells express γ and δ. However, the function of γδ T cells in immune responses
is not yet entirely clear. My projects focus on the αβ T cells and T cell receptor repertoires. The
TCRs on the αβ T cells can recognize and bind to the epitopes of antigens presented by major
histocompatibility complex (MHC).
TCRs are formed through a process called V(D)J recombination. There are variable (V),
diversity (D), and joining (J) segments on the human germline locus. V, D, and J segments are
selected, rearranged, and combined. It is possible that the deletion and insertion of nucleotides
happen during the V(D)J recombination process, which increases the complexity and diversity of
TCRs. T cells later become mature and get selected via the antigen recognition properties
possessed by TCRs.
The development and innovations in high throughput sequencing technologies, especially
TCR sequencing made it possible for researchers to capture a detailed and precise view of TCR
repertoires. With these advancements, researchers are able to study the adaptive immune system
and immune responses at a larger scale. The current two methods for TCR sequencing are
124
multiplex PCR and 5’RACE, they both target TCR from amplified DNA or RNA samples. With
the TCR sequencing technology, TCRs have shown to have the potential to be utilized as
prognostic biological markers and treatment outcome prediction assistance. TCR studies have
been focused mainly on infectious diseases, autoimmune diseases, cancer, and neurological
conditions.
Even with the discoveries in TCR studies, there are a few critical challenges existing in
this field. First, the traditional computational methods require researchers to acquire knowledge
in the command line interface in order to properly utilize the tools. However, biomedical
researchers who generate the TCR-Seq data usually lack relevant experience. This barrier
hinders the application of these computational methods among a broad range of users. Second,
since TCR-Seq is a relatively new technology compared with other high throughput sequencing
techniques such as RNA-Sequencing, only limited datasets across diverse tissue types or
populations are available. Third, the existing computational algorithms allow the utilization of
non-targeted next generation sequencing (NGS) data to extract TCR information. However, the
performance of these methods on samples with different T cell levels remains unknown. Fourth,
the race/ethnicity/ancestry information of the TCR study participants is often not reported on
record in the original research. With the available race/ethnicity/ancestry data, we demonstrated
that the majority of the TCR-Seq datasets were generated from individuals of European ancestry
based on a survey that we conducted (Y.-N. Huang et al., 2021). Furthermore, there is a lack of a
comprehensive population-specific TCR-Seq database. These factors hinder our ability to
investigate TCR in diverse populations.
We have been motivated by these challenges and limitations. Our projects have addressed
the issues of the need for a comprehensive benchmark of RNA-Seq-based TCR profiling
125
methods as well as the development of a user-friendly, easy-to-use computational tool for TCR-
Seq data analysis.
Based on the benchmarking results, we have demonstrated that the performance of RNA-
Seq-based TCR profiling methods is highly dependent on the level of T cells as well as the
diversity of the TCR repertoire in a sample. Specifically, RNA-Seq-based TCR profiling
methods (MiXCR, ImRep, TRUST4, and CATT) are able to effectively capture the majority of
the TCR-Seq confirmed clonotypes in low Shannon Diversity Index (SDI) samples. Furthermore,
these methods are able to accurately estimate the diversity and clonality of the repertoire of T
cell rich low SDI samples. In the scenarios when TCR-Seq results are not available, these
methods may be considered as alternatives for profiling TCR repertoires in T cell rich tissues and
low SDI samples. However, cautions need to be taken into consideration when utilizing RNA-
Seq-based TCR profiling methods in T cell poor high SDI samples.
In addition to the benchmarking study, we have developed pyTCR, a computational
notebook based tool for TCR-Seq data analysis. It offers a comprehensive solution for TCR-Seq
data analysis and visualization. It is also a one-stop shop for biomedical researchers with limited
bioinformatics backgrounds. Moreover, our tool has broader and more powerful functionalities
in facilitating TCR repertoire research.
In conclusion, we have provided recommendations for choosing RNA-Seq-based TCR
profiling methods in different scenarios and established a platform for TCR-Seq data analysis for
biomedical researchers with a focus on promoting reproducibility and transparency.
126
References
Abbas, H. A., Reville, P. K., Jiang, X., Yang, H., Reuben, A., Im, J. S., Little, L., Sinson, J. C.,
Chen, K., Futreal, A., & Garcia-Manero, G. (2021). Response to Hypomethylating
Agents in Myelodysplastic Syndrome Is Associated With Emergence of Novel TCR
Clonotypes. Frontiers in Immunology, 12, 659625.
https://doi.org/10.3389/fimmu.2021.659625
Avnir, Y., Watson, C. T., Glanville, J., Peterson, E. C., Tallarico, A. S., Bennett, A. S., Qin, K.,
Fu, Y., Huang, C.-Y., Beigel, J. H., Breden, F., Zhu, Q., & Marasco, W. A. (2016).
IGHV1-69 polymorphism modulates anti-influenza antibody repertoires, correlates with
IGHV utilization shifts and varies by ethnicity. Scientific Reports, 6(1), 20842.
https://doi.org/10.1038/srep20842
Barennes, P., Quiniou, V., Shugay, M., Egorov, E. S., Davydov, A. N., Chudakov, D. M., Uddin,
I., Ismail, M., Oakes, T., Chain, B., Eugster, A., Kashofer, K., Rainer, P. P., Darko, S.,
Ransier, A., Douek, D. C., Klatzmann, D., & Mariotti-Ferrandiz, E. (2021).
Benchmarking of T cell receptor repertoire profiling methods reveals large systematic
biases. Nature Biotechnology, 39(2), 236–245. https://doi.org/10.1038/s41587-020-0656-
3
Bashford-Rogers, R. J. M., Bergamaschi, L., McKinney, E. F., Pombal, D. C., Mescia, F., Lee, J.
C., Thomas, D. C., Flint, S. M., Kellam, P., Jayne, D. R. W., Lyons, P. A., & Smith, K.
G. C. (2019). Analysis of the B cell receptor repertoire in six immune-mediated diseases.
Nature, 574(7776), 122–126. https://doi.org/10.1038/s41586-019-1595-3
Bende, R. J., Slot, L. M., Hoogeboom, R., Wormhoudt, T. A. M., Adeoye, A. O., Guikema, J. E.
J., & van Noesel, C. J. M. (2015). Stereotypic Rheumatoid Factors That Are Frequently
Expressed in Mucosa-Associated Lymphoid Tissue–Type Lymphomas Are Rare in the
Labial Salivary Glands of Patients With Sjögren’s Syndrome. Arthritis & Rheumatology,
67(4), 1074–1083. https://doi.org/10.1002/art.39002
Ben-Eghan, C., Sun, R., Hleap, J. S., Diaz-Papkovich, A., Munter, H. M., Grant, A. V., Dupras,
C., & Gravel, S. (2020). Don’t ignore genetic data from minority populations. Nature,
585(7824), 184–186. https://doi.org/10.1038/d41586-020-02547-3
Bhardwaj, V., Franceschetti, M., Rao, R., Pevzner, P. A., & Safonova, Y. (2020). Automated
analysis of immunosequencing datasets reveals novel immunoglobulin D genes across
127
diverse species. PLoS Computational Biology, 16(4), e1007837.
https://doi.org/10.1371/journal.pcbi.1007837
Bolotin, D. A., Poslavsky, S., Davydov, A. N., Frenkel, F. E., Fanchi, L., Zolotareva, O. I.,
Hemmers, S., Putintseva, E. V., Obraztsova, A. S., Shugay, M., Ataullakhanov, R. I.,
Rudensky, A. Y., Schumacher, T. N., & Chudakov, D. M. (2017). Antigen receptor
repertoire profiling from RNA-seq data. Nature Biotechnology, 35(10), 908–911.
https://doi.org/10.1038/nbt.3979
Bolotin, D. A., Poslavsky, S., Mitrophanov, I., Shugay, M., Mamedov, I. Z., Putintseva, E. V., &
Chudakov, D. M. (2015). MiXCR: Software for comprehensive adaptive immunity
profiling. Nature Methods, 12(5), 380–381. https://doi.org/10.1038/nmeth.3364
Boyd, S. D., Gaëta, B. A., Jackson, K. J., Fire, A. Z., Marshall, E. L., Merker, J. D., Maniar, J.
M., Zhang, L. N., Sahaf, B., Jones, C. D., Simen, B. B., Hanczaruk, B., Nguyen, K. D.,
Nadeau, K. C., Egholm, M., Miklos, D. B., Zehnder, J. L., & Collins, A. M. (2010).
Individual Variation in the Germline Ig Gene Repertoire Inferred from Variable Region
Gene Rearrangements. The Journal of Immunology, 184(12), 6986–6992.
https://doi.org/10.4049/jimmunol.1000445
Briney, B., Inderbitzin, A., Joyce, C., & Burton, D. R. (2019). Commonality despite exceptional
diversity in the baseline human antibody repertoire. Nature, 566(7744), Article 7744.
https://doi.org/10.1038/s41586-019-0879-y
Burton, D. R. (2019). Advancing an HIV vaccine; advancing vaccinology. Nature Reviews
Immunology, 19(2), Article 2. https://doi.org/10.1038/s41577-018-0103-6
Carlson, C. S., Emerson, R. O., Sherwood, A. M., Desmarais, C., Chung, M.-W., Parsons, J. M.,
Steen, M. S., LaMadrid-Herrmannsfeldt, M. A., Williamson, D. W., Livingston, R. J.,
Wu, D., Wood, B. L., Rieder, M. J., & Robins, H. (2013). Using synthetic templates to
design an unbiased multiplex PCR assay. Nature Communications, 4, 2680.
https://doi.org/10.1038/ncomms3680
Chaara, W., Gonzalez-Tort, A., Florez, L.-M., Klatzmann, D., Mariotti-Ferrandiz, E., & Six, A.
(2018). RepSeq Data Representativeness and Robustness Assessment by Shannon
Entropy. Frontiers in Immunology, 9. https://doi.org/10.3389/fimmu.2018.01038
128
Chang, C.-M., Feng, P., Wu, T.-H., Alachkar, H., Lee, K.-Y., & Chang, W.-C. (2021). Profiling
of T Cell Repertoire in SARS-CoV-2-Infected COVID-19 Patients Between Mild Disease
and Pneumonia. Journal of Clinical Immunology, 41(6), 1131–1145.
https://doi.org/10.1007/s10875-021-01045-z
Chao, A., Gotelli, N. J., Hsieh, T. C., Sander, E. L., Ma, K. H., Colwell, R. K., & Ellison, A. M.
(2014). Rarefaction and extrapolation with Hill numbers: A framework for sampling and
estimation in species diversity studies. Ecological Monographs, 84(1), 45–67.
https://doi.org/10.1890/13-0133.1
Chen, S.-Y., Liu, C.-J., Zhang, Q., & Guo, A.-Y. (2020). An ultra-sensitive T-cell receptor
detection method for TCR-Seq and RNA-Seq data. Bioinformatics, 36(15), 4255–4262.
https://doi.org/10.1093/bioinformatics/btaa432
Choudhury, A., Aron, S., Botigué, L. R., Sengupta, D., Botha, G., Bensellak, T., Wells, G.,
Kumuthini, J., Shriner, D., Fakim, Y. J., Ghoorah, A. W., Dareng, E., Odia, T., Falola,
O., Adebiyi, E., Hazelhurst, S., Mazandu, G., Nyangiri, O. A., Mbiyavanga, M., …
Hanchard, N. A. (2020). High-depth African genomes inform human migration and
health. Nature, 586(7831), Article 7831. https://doi.org/10.1038/s41586-020-2859-7
Christley, S., Aguiar, A., Blanck, G., Breden, F., Bukhari, S. A. C., Busse, C. E., Jaglale, J.,
Harikrishnan, S. L., Laserson, U., Peters, B., Rocha, A., Schramm, C. A., Taylor, S.,
Vander Heiden, J. A., Zimonja, B., Watson, C. T., Corrie, B., & Cowell, L. G. (2020).
The ADC API: A Web API for the Programmatic Query of the AIRR Data Commons.
Frontiers in Big Data, 3. https://www.frontiersin.org/articles/10.3389/fdata.2020.00022
Corcoran, M. M., Phad, G. E., Bernat, N. V., Stahl-Hennig, C., Sumida, N., Persson, M. A. A.,
Martin, M., & Hedestam, G. B. K. (2016). Production of individualized V gene databases
reveals high levels of immunoglobulin genetic diversity. Nature Communications, 7(1),
13642. https://doi.org/10.1038/ncomms13642
Dahal-Koirala, S., Balaban, G., Neumann, R. S., Scheffer, L., Lundin, K. E. A., Greiff, V.,
Sollid, L. M., Qiao, S.-W., & Sandve, G. K. (2022). TCRpower: Quantifying the
detection power of T-cell receptor sequencing with a novel computational pipeline
calibrated by spike-in sequences. Briefings in Bioinformatics, 23(2), bbab566.
https://doi.org/10.1093/bib/bbab566
129
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R., & McVean, G. (2015). Improved genome
inference in the MHC using a population reference graph. Nature Genetics, 47(6), 682–
688. https://doi.org/10.1038/ng.3257
Duez, M., Giraud, M., Herbert, R., Rocher, T., Salson, M., & Thonier, F. (2016). Vidjil: A Web
Platform for Analysis of High-Throughput Repertoire Sequencing. PLOS ONE, 11(11),
e0166126. https://doi.org/10.1371/journal.pone.0166126
Emerson, R. O., DeWitt, W. S., Vignali, M., Gravley, J., Hu, J. K., Osborne, E. J., Desmarais, C.,
Klinger, M., Carlson, C. S., Hansen, J. A., Rieder, M., & Robins, H. S. (2017).
Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-
mediated effects on the T cell repertoire. Nature Genetics, 49(5), 659–665.
https://doi.org/10.1038/ng.3822
Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H., & Holt, R. A. (2009). Profiling the T-
cell receptor beta-chain repertoire by massively parallel sequencing. Genome Research,
19(10), 1817–1824. https://doi.org/10.1101/gr.092924.109
Gadala-Maria, D., Gidoni, M., Marquez, S., Vander Heiden, J. A., Kos, J. T., Watson, C. T.,
O’Connor, K. C., Yaari, G., & Kleinstein, S. H. (2019). Identification of Subject-Specific
Immunoglobulin Alleles From Expressed Repertoire Sequencing Data. Frontiers in
Immunology, 10, 129. https://doi.org/10.3389/fimmu.2019.00129
Galon, J., & Bruni, D. (2019). Approaches to treat immune hot, altered and cold tumours with
combination immunotherapies. Nature Reviews Drug Discovery, 18(3), 197–218.
https://doi.org/10.1038/s41573-018-0007-y
Gate, D., Saligrama, N., Leventhal, O., Yang, A. C., Unger, M. S., Middeldorp, J., Chen, K.,
Lehallier, B., Channappa, D., De Los Santos, M. B., McBride, A., Pluvinage, J., Elahi, F.,
Tam, G. K.-Y., Kim, Y., Greicius, M., Wagner, A. D., Aigner, L., Galasko, D. R., …
Wyss-Coray, T. (2020). Clonally expanded CD8 T cells patrol the cerebrospinal fluid in
Alzheimer’s disease. Nature, 577(7790), 399–404. https://doi.org/10.1038/s41586-019-
1895-7
Greiff, V., Bhat, P., Cook, S. C., Menzel, U., Kang, W., & Reddy, S. T. (2015). A bioinformatic
framework for immune repertoire diversity profiling enables detection of immunological
status. Genome Medicine, 7(1), 49. https://doi.org/10.1186/s13073-015-0169-8
130
Greiff, V., Yaari, G., & Cowell, L. G. (2020). Mining adaptive immune receptor repertoires for
biological and clinical information using machine learning. Current Opinion in Systems
Biology, 24, 109–119. https://doi.org/10.1016/j.coisb.2020.10.010
Hogan, S. A., Courtier, A., Cheng, P. F., Jaberg-Bentele, N. F., Goldinger, S. M., Manuel, M.,
Perez, S., Plantier, N., Mouret, J.-F., Nguyen-Kim, T. D. L., Raaijmakers, M. I. G.,
Kvistborg, P., Pasqual, N., Haanen, J. B. A. G., Dummer, R., & Levesque, M. P. (2019).
Peripheral Blood TCR Repertoire Profiling May Facilitate Patient Stratification for
Immunotherapy against Melanoma. Cancer Immunology Research, 7(1), 77–85.
https://doi.org/10.1158/2326-6066.CIR-18-0136
Huang, J., Kang, B. H., Ishida, E., Zhou, T., Griesman, T., Sheng, Z., Wu, F., Doria-Rose, N. A.,
Zhang, B., McKee, K., O’Dell, S., Chuang, G.-Y., Druz, A., Georgiev, I. S., Schramm, C.
A., Zheng, A., Joyce, M. G., Asokan, M., Ransier, A., … Connors, M. (2016).
Identification of a CD4-Binding-Site Antibody to HIV that Evolved Near-Pan
Neutralization Breadth. Immunity, 45(5), 1108–1121.
https://doi.org/10.1016/j.immuni.2016.10.027
Huang, Y.-N., Patel, N. A., Mehta, J. H., Ginjala, S., Brodin, P., Gray, C. M., Patel, Y. M.,
Cowell, L. G., Burkhardt, A. M., & Mangul, S. (2022). Data Availability of Open T-Cell
Receptor Repertoire Data, a Systematic Assessment. Frontiers in Systems Biology, 2.
https://www.frontiersin.org/articles/10.3389/fsysb.2022.918792
Huang, Y.-N., Peng, K., Popejoy, A. B., Hu, J., Nowicki, T. S., Gold, S. M., Quintana-Murci, L.,
Fuentes-Guajardo, M., Shugay, M., Greiff, V., Burkhardt, A. M., Alachkar, H., &
Mangul, S. (2021). Ancestral diversity is limited in published T cell receptor sequencing
studies. Immunity, 54(10), 2177–2179. https://doi.org/10.1016/j.immuni.2021.09.015
Ioannidis, J. P. A., Powe, N. R., & Yancy, C. (2021). Recalibrating the Use of Race in Medical
Research. JAMA, 325(7), 623–624. https://doi.org/10.1001/jama.2021.0003
Kaplinsky, J., & Arnaout, R. (2016). Robust estimates of overall immune-repertoire diversity
from high-throughput measurements on samples. Nature Communications, 7, 11881.
https://doi.org/10.1038/ncomms11881
Kuchenbecker, L., Nienen, M., Hecht, J., Neumann, A. U., Babel, N., Reinert, K., & Robinson,
P. N. (2015). IMSEQ—a fast and error aware approach to immunogenetic sequence
analysis. Bioinformatics, 31(18), 2963–2971.
https://doi.org/10.1093/bioinformatics/btv309
131
Laydon, D. J., Bangham, C. R. M., & Asquith, B. (2015). Estimating T-cell repertoire diversity:
Limitations of classical estimators and a new approach. Philosophical Transactions of the
Royal Society B: Biological Sciences, 370(1675), 20140291.
https://doi.org/10.1098/rstb.2014.0291
Lefranc, M.-P., Giudicelli, V., Duroux, P., Jabado-Michaloud, J., Folch, G., Aouinti, S., Carillon,
E., Duvergey, H., Houles, A., Paysan-Lafosse, T., Hadi-Saljoqi, S., Sasorith, S., Lefranc,
G., & Kossida, S. (2015). IMGT®, the international ImMunoGeneTics information
system® 25 years on. Nucleic Acids Research, 43(Database issue), D413–D422.
https://doi.org/10.1093/nar/gku1056
Li, B., Li, T., Pignon, J.-C., Wang, B., Wang, J., Shukla, S. A., Dou, R., Chen, Q., Hodi, F. S.,
Choueiri, T. K., Wu, C., Hacohen, N., Signoretti, S., Liu, J. S., & Liu, X. S. (2016).
Landscape of tumor-infiltrating T cell repertoire of human cancers. Nature Genetics,
48(7), 725–732. https://doi.org/10.1038/ng.3581
Li, N., Yuan, J., Tian, W., Meng, L., & Liu, Y. (2020). T‐cell receptor repertoire analysis for the
diagnosis and treatment of solid tumor: A methodology and clinical applications. Cancer
Communications, 40(10), 473–483. https://doi.org/10.1002/cac2.12074
Liu, X. S., & Mardis, E. R. (2017). Applications of Immunogenomics to Cancer. Cell, 168(4),
600–612. https://doi.org/10.1016/j.cell.2017.01.014
Liu, X., Zhang, W., Zhao, M., Fu, L., Liu, L., Wu, J., Luo, S., Wang, L., Wang, Z., Lin, L., Liu,
Y., Wang, S., Yang, Y., Luo, L., Jiang, J., Wang, X., Tan, Y., Li, T., Zhu, B., … Lu, Q.
(2019). T cell receptor β repertoires as novel diagnostic markers for systemic lupus
erythematosus and rheumatoid arthritis. Annals of the Rheumatic Diseases, 78(8), 1070–
1078. https://doi.org/10.1136/annrheumdis-2019-215442
Luo, L., Liang, W., Pang, J., Xu, G., Chen, Y., Guo, X., Wang, X., Zhao, Y., Lai, Y., Liu, Y., Li,
B., Su, B., Zhang, S., Baniyash, M., Shen, L., Chen, L., Ling, Y., Wang, Y., Liang, Q., …
Wang, F. (2021). Dynamics of TCR repertoire and T cell function in COVID-19
convalescent individuals. Cell Discovery, 7, 89. https://doi.org/10.1038/s41421-021-
00321-x
Ma, K.-Y., He, C., Wendel, B. S., Williams, C. M., Xiao, J., Yang, H., & Jiang, N. (2018).
Immune Repertoire Sequencing Using Molecular Identifiers Enables Accurate Clonality
132
Discovery and Clone Size Quantification. Frontiers in Immunology, 9, 33.
https://doi.org/10.3389/fimmu.2018.00033
Mandric, I., Rotman, J., Yang, H. T., Strauli, N., Montoya, D. J., Van Der Wey, W., Ronas, J. R.,
Statz, B., Yao, D., Petrova, V., Zelikovsky, A., Spreafico, R., Shifman, S., Zaitlen, N.,
Rossetti, M., Ansel, K. M., Eskin, E., & Mangul, S. (2020). Profiling immunoglobulin
repertoires across multiple human tissues using RNA sequencing. Nature
Communications, 11(1), 3126. https://doi.org/10.1038/s41467-020-16857-7
Mangul, S., Mosqueiro, T., Abdill, R. J., Duong, D., Mitchell, K., Sarwal, V., Hill, B., Brito, J.,
Littman, R. J., Statz, B., Lam, A. K.-M., Dayama, G., Grieneisen, L., Martin, L. S., Flint,
J., Eskin, E., & Blekhman, R. (2019). Challenges and recommendations to improve the
installability and archival stability of omics computational tools. PLoS Biology, 17(6),
e3000333. https://doi.org/10.1371/journal.pbio.3000333
Marco-Puche, G., Lois, S., Benítez, J., & Trivino, J. C. (2019). RNA-Seq Perspectives to
Improve Clinical Diagnosis. Frontiers in Genetics, 0.
https://doi.org/10.3389/fgene.2019.01152
Nadel, B. B., Lopez, D., Montoya, D. J., Ma, F., Waddel, H., Khan, M. M., Mangul, S., &
Pellegrini, M. (2021). The Gene Expression Deconvolution Interactive Tool (GEDIT):
Accurate cell type quantification from gene expression data. GigaScience, 10(2),
giab002. https://doi.org/10.1093/gigascience/giab002
Nadel, B. B., Oliva, M., Shou, B. L., Mitchell, K., Ma, F., Montoya, D. J., Mouton, A., Kim-
Hellmuth, S., Stranger, B. E., Pellegrini, M., & Mangul, S. (2021). Systematic evaluation
of transcriptomics-based deconvolution methods and references using thousands of
clinical samples. Briefings in Bioinformatics, 22(6), bbab265.
https://doi.org/10.1093/bib/bbab265
Nazarov, V. I., Tsvetkov, V. O., Rumynskiy, E., Popov, A. A., Balashov, I., & Samokhina, M.
(2022). immunarch: Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires.
https://immunarch.com/, https://github.com/immunomind/immunarch
Ni, Q., Zhang, J., Zheng, Z., Chen, G., Christian, L., Grönholm, J., Yu, H., Zhou, D., Zhuang, Y.,
Li, Q.-J., & Wan, Y. (2020). VisTCR: An Interactive Software for T Cell Repertoire
Sequencing Data Analysis. Frontiers in Genetics, 11.
https://www.frontiersin.org/articles/10.3389/fgene.2020.00771
133
Nowicki, T. S., Berent-Maoz, B., Cheung-Lau, G., Huang, R. R., Wang, X., Tsoi, J., Kaplan-
Lefko, P., Cabrera, P., Tran, J., Pang, J., Macabali, M., Garcilazo, I. P., Carretero, I. B.,
Kalbasi, A., Cochran, A. J., Grasso, C. S., Hu-Lieskovan, S., Chmielowski, B., Comin-
Anduix, B., … Ribas, A. (2019). A Pilot Trial of the Combination of Transgenic NY-
ESO-1–reactive Adoptive Cellular Therapy with Dendritic Cell Vaccination with or
without Ipilimumab. Clinical Cancer Research, 25(7), 2096–2108.
https://doi.org/10.1158/1078-0432.CCR-18-3496
Ohlin, M., Scheepers, C., Corcoran, M., Lees, W. D., Busse, C. E., Bagnara, D., Thörnqvist, L.,
Bürckert, J.-P., Jackson, K. J. L., Ralph, D., Schramm, C. A., Marthandan, N., Breden, F.,
Scott, J., Matsen IV, F. A., Greiff, V., Yaari, G., Kleinstein, S. H., Christley, S., …
Collins, A. M. (2019). Inferred Allelic Variants of Immunoglobulin Receptor Genes: A
System for Their Evaluation, Documentation, and Naming. Frontiers in Immunology, 10.
https://www.frontiersin.org/articles/10.3389/fimmu.2019.00435
Pai, J. A., & Satpathy, A. T. (2021). High-throughput and single-cell T cell receptor sequencing
technologies. Nature Methods, 18(8), 881–892. https://doi.org/10.1038/s41592-021-
01201-8
Patas, K., Willing, A., Demiralay, C., Engler, J. B., Lupu, A., Ramien, C., Schäfer, T., Gach, C.,
Stumm, L., Chan, K., Vignali, M., Arck, P. C., Friese, M. A., Pless, O., Wiedemann, K.,
Agorastos, A., & Gold, S. M. (2018). T Cell Phenotype and T Cell Receptor Repertoire in
Patients with Major Depressive Disorder. Frontiers in Immunology, 9, 291.
https://doi.org/10.3389/fimmu.2018.00291
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon: Fast and
bias-aware quantification of transcript expression using dual-phase inference. Nature
Methods, 14(4), 417–419. https://doi.org/10.1038/nmeth.4197
Peng, K., Moore, J., Vahed, M., Brito, J., Kao, G., Burkhardt, A. M., Alachkar, H., & Mangul, S.
(2022). pyTCR: A comprehensive and scalable solution for TCR-Seq data analysis to
facilitate reproducibility and rigor of immunogenomics research. Frontiers in
Immunology, 13. https://www.frontiersin.org/articles/10.3389/fimmu.2022.954078
Quintana-Murci, L. (2019). Human Immunology through the Lens of Evolutionary Genetics.
Cell, 177(1), 184–199. https://doi.org/10.1016/j.cell.2019.02.033
134
Ralph, D. K., & Matsen, F. A. (2019). Per-sample immunoglobulin germline inference from B
cell receptor deep sequencing data. PLoS Computational Biology, 15(7), e1007133.
https://doi.org/10.1371/journal.pcbi.1007133
Randles, B. M., Pasquetto, I. V., Golshan, M. S., & Borgman, C. L. (2017). Using the Jupyter
Notebook as a Tool for Open Science: An Empirical Study. 2017 ACM/IEEE Joint
Conference on Digital Libraries (JCDL), 1–2.
https://doi.org/10.1109/JCDL.2017.7991618
Rempała, G. A., & Seweryn, M. (2013). Methods for diversity and overlap analysis in T-cell
receptor populations. Journal of Mathematical Biology, 67(0), 10.1007/s00285-012-
0589–7. https://doi.org/10.1007/s00285-012-0589-7
Renand, A., Cervera-Marzal, I., Gil, L., Dong, C., Garcia, A., Kervagoret, E., Aublé, H., Habes,
S., Chevalier, C., Vavasseur, F., Clémenceau, B., Cardon, A., Judor, J.-P., Mosnier, J.-F.,
Tanné, F., Laplaud, D.-A., Brouard, S., Gournay, J., Milpied, P., & Conchon, S. (2020).
Integrative molecular profiling of autoreactive CD4 T cells in autoimmune hepatitis.
Journal of Hepatology, 73(6), 1379–1390. https://doi.org/10.1016/j.jhep.2020.05.053
Retshabile, G., Mlotshwa, B. C., Williams, L., Mwesigwa, S., Mboowa, G., Huang, Z., Rustagi,
N., Swaminathan, S., Katagirya, E., Kyobe, S., Wayengera, M., Kisitu, G. P., Kateete, D.
P., Wampande, E. M., Maplanka, K., Kasvosve, I., Pettitt, E. D., Matshaba, M., Nsangi,
B., … Hanchard, N. A. (2018). Whole-Exome Sequencing Reveals Uncaptured Variation
and Distinct Ancestry in the Southern African Population of Botswana. The American
Journal of Human Genetics, 102(5), 731–743. https://doi.org/10.1016/j.ajhg.2018.03.010
Robbiani, D. F., Bozzacco, L., Keeffe, J. R., Khouri, R., Olsen, P. C., Gazumyan, A., Schaefer-
Babajew, D., Avila-Rios, S., Nogueira, L., Patel, R., Azzopardi, S. A., Uhl, L. F. K.,
Saeed, M., Sevilla-Reyes, E. E., Agudelo, M., Yao, K.-H., Golijanin, J., Gristick, H. B.,
Lee, Y. E., … Nussenzweig, M. C. (2017). Recurrent Potent Human Neutralizing
Antibodies to Zika Virus in Brazil and Mexico. Cell, 169(4), 597-609.e11.
https://doi.org/10.1016/j.cell.2017.04.024
Robert, L., Tsoi, J., Wang, X., Emerson, R., Homet, B., Chodon, T., Mok, S., Huang, R. R.,
Cochran, A. J., Comin-Anduix, B., Koya, R. C., Graeber, T. G., Robins, H., & Ribas, A.
(2014). CTLA4 blockade broadens the peripheral T cell receptor repertoire. Clinical
Cancer Research : An Official Journal of the American Association for Cancer Research,
20(9), 2424–2432. https://doi.org/10.1158/1078-0432.CCR-13-2648
135
Rodriguez, O. L., Gibson, W. S., Parks, T., Emery, M., Powell, J., Strahl, M., Deikus, G.,
Auckland, K., Eichler, E. E., Marasco, W. A., Sebra, R., Sharp, A. J., Smith, M. L.,
Bashir, A., & Watson, C. T. (2020). A Novel Framework for Characterizing Genomic
Haplotype Diversity in the Human Immunoglobulin Heavy Chain Locus. Frontiers in
Immunology, 11. https://www.frontiersin.org/articles/10.3389/fimmu.2020.02136
Rosati, E., Dowds, C. M., Liaskou, E., Henriksen, E. K. K., Karlsen, T. H., & Franke, A. (2017).
Overview of methodologies for T-cell receptor repertoire analysis. BMC Biotechnology,
17(1), 61. https://doi.org/10.1186/s12896-017-0379-9
Safonova, Y., & Pevzner, P. A. (2019). De novo Inference of Diversity Genes and Analysis of
Non-canonical V(DD)J Recombination in Immunoglobulins. Frontiers in Immunology,
10. https://doi.org/10.3389/fimmu.2019.00987
Sanmamed, M. F., & Chen, L. (2018). A Paradigm Shift in Cancer Immunotherapy: From
Enhancement to Normalization. Cell, 175(2), 313–326.
https://doi.org/10.1016/j.cell.2018.09.035
Scheepers, C., Shrestha, R. K., Lambson, B. E., Jackson, K. J. L., Wright, I. A., Naicker, D.,
Goosen, M., Berrie, L., Ismail, A., Garrett, N., Karim, Q. A., Karim, S. S. A., Moore, P.
L., Travers, S. A., & Morris, L. (2015). Ability To Develop Broadly Neutralizing HIV-1
Antibodies Is Not Restricted by the Germline Ig Gene Repertoire. The Journal of
Immunology, 194(9), 4371–4378. https://doi.org/10.4049/jimmunol.1500118
Schrama, D., Ritter, C., & Becker, J. C. (2017). T cell receptor repertoire usage in cancer as a
surrogate marker for immune responses. Seminars in Immunopathology, 39(3), 255–268.
https://doi.org/10.1007/s00281-016-0614-9
Sharonov, G. V., Serebrovskaya, E. O., Yuzhakova, D. V., Britanova, O. V., & Chudakov, D. M.
(2020). B cells, plasma cells and antibody repertoires in the tumour microenvironment.
Nature Reviews Immunology, 20(5), 294–307. https://doi.org/10.1038/s41577-019-0257-
x
Shugay, M., Bagaev, D. V., Turchaninova, M. A., Bolotin, D. A., Britanova, O. V., Putintseva,
E. V., Pogorelyy, M. V., Nazarov, V. I., Zvyagin, I. V., Kirgizova, V. I., Kirgizov, K. I.,
Skorobogatova, E. V., & Chudakov, D. M. (2015). VDJtools: Unifying Post-analysis of T
Cell Receptor Repertoires. PLoS Computational Biology, 11(11), e1004503.
https://doi.org/10.1371/journal.pcbi.1004503
136
Simnica, D., Schliffke, S., Schultheiß, C., Bonzanni, N., Fanchi, L. F., Akyüz, N., Gösch, B.,
Casar, C., Thiele, B., Schlüter, J., Lohse, A. W., & Binder, M. (2019). High-Throughput
Immunogenetics Reveals a Lack of Physiological T Cell Clusters in Patients With
Autoimmune Cytopenias. Frontiers in Immunology, 10, 1897.
https://doi.org/10.3389/fimmu.2019.01897
Singh, K. M., Phung, Y. T., Kohla, M. S., Lan, B. Y.-A., Chan, S., Suen, D. L., Murad, S.,
Rheault, S., Davidson, P., Evans, J., Singh, M., Dohil, S., Osorio, R. W., Wakil, A. E.,
Page, K., Feng, S., & Cooper, S. L. (2012). KIR Genotypic Diversity Can Track
Ancestries in Heterogeneous Populations: A Potential Confounder for Disease
Association Studies. Immunogenetics, 64(2), 97–109. https://doi.org/10.1007/s00251-
011-0569-x
Sirugo, G., Williams, S. M., & Tishkoff, S. A. (2019). The Missing Diversity in Human Genetic
Studies. Cell, 177(1), 26–31. https://doi.org/10.1016/j.cell.2019.02.048
Song, L., Cohen, D., Ouyang, Z., Cao, Y., Hu, X., & Liu, X. S. (2021). TRUST4: Immune
repertoire reconstruction from bulk and single-cell RNA-seq data. Nature Methods,
18(6), 627–630. https://doi.org/10.1038/s41592-021-01142-2
The Severe Covid-19 GWAS Group. (2020). Genomewide Association Study of Severe Covid-
19 with Respiratory Failure. New England Journal of Medicine, 383(16), 1522–1534.
https://doi.org/10.1056/NEJMoa2020283
Tumeh, P. C., Harview, C. L., Yearley, J. H., Shintaku, I. P., Taylor, E. J. M., Robert, L.,
Chmielowski, B., Spasic, M., Henry, G., Ciobanu, V., West, A. N., Carmona, M., Kivork,
C., Seja, E., Cherry, G., Gutierrez, A., Grogan, T. R., Mateus, C., Tomasic, G., … Ribas,
A. (2014). PD-1 blockade induces responses by inhibiting adaptive immune resistance.
Nature, 515(7528), 568–571. https://doi.org/10.1038/nature13954
Wang, Y., Jackson, K. J., Gäeta, B., Pomat, W., Siba, P., Sewell, W. A., & Collins, A. M.
(2011). Genomic screening by 454 pyrosequencing identifies a new human IGHV gene
and sixteen other new IGHV allelic variants. Immunogenetics, 63(5), 259–265.
https://doi.org/10.1007/s00251-010-0510-8
Warren, R. L., Freeman, J. D., Zeng, T., Choe, G., Munro, S., Moore, R., Webb, J. R., & Holt, R.
A. (2011). Exhaustive T-cell repertoire sequencing of human peripheral blood samples
reveals signatures of antigen selection and a directly measured repertoire size of at least 1
137
million clonotypes. Genome Research, 21(5), 790–797.
https://doi.org/10.1101/gr.115428.110
Watson, C. T., Glanville, J., & Marasco, W. A. (2017). The Individual and Population Genetics
of Antibody Immunity. Trends in Immunology, 38(7), 459–470.
https://doi.org/10.1016/j.it.2017.04.003
Watson, C. T., Steinberg, K. M., Huddleston, J., Warren, R. L., Malig, M., Schein, J., Willsey,
A. J., Joy, J. B., Scott, J. K., Graves, T. A., Wilson, R. K., Holt, R. A., Eichler, E. E., &
Breden, F. (2013). Complete Haplotype Sequence of the Human Immunoglobulin Heavy-
Chain Variable, Diversity, and Joining Genes and Characterization of Allelic and Copy-
Number Variation. The American Journal of Human Genetics, 92(4), 530–546.
https://doi.org/10.1016/j.ajhg.2013.03.004
Weinstein, J. A., Jiang, N., White, R. A., Fisher, D. S., & Quake, S. R. (2009). High-Throughput
Sequencing of the Zebrafish Antibody Repertoire. Science, 324(5928), 807–810.
https://doi.org/10.1126/science.1170020
Wong, W. K., Leem, J., & Deane, C. M. (2019). Comparative Analysis of the CDR Loops of
Antigen Receptors. Frontiers in Immunology, 10.
https://www.frontiersin.org/articles/10.3389/fimmu.2019.02454
Yuan, M., Liu, H., Wu, N. C., Lee, C.-C. D., Zhu, X., Zhao, F., Huang, D., Yu, W., Hua, Y.,
Tien, H., Rogers, T. F., Landais, E., Sok, D., Jardine, J. G., Burton, D. R., & Wilson, I. A.
(2020). Structural basis of a shared antibody response to SARS-CoV-2. Science,
369(6507), 1119–1123. https://doi.org/10.1126/science.abd2321
Zhang, F., Gan, R., Zhen, Z., Hu, X., Li, X., Zhou, F., Liu, Y., Chen, C., Xie, S., Zhang, B., Wu,
X., & Huang, Z. (2020). Adaptive immune responses to SARS-CoV-2 infection in severe
versus mild individuals. Signal Transduction and Targeted Therapy, 5, 156.
https://doi.org/10.1038/s41392-020-00263-y
Zhang, L., Cham, J., Paciorek, A., Trager, J., Sheikh, N., & Fong, L. (2017). 3D: Diversity,
dynamics, differential testing – a proposed pipeline for analysis of next-generation
sequencing T cell repertoire data. BMC Bioinformatics, 18, 129.
https://doi.org/10.1186/s12859-017-1544-9
138
Zhang, W., Wang, I.-M., Wang, C., Lin, L., Chai, X., Wu, J., Bett, A. J., Dhanasekaran, G.,
Casimiro, D. R., & Liu, X. (2016). IMPre: An Accurate and Efficient Software for
Prediction of T- and B-Cell Receptor Germline Genes and Alleles from Rearranged
Repertoire Data. Frontiers in Immunology, 7, 457.
https://doi.org/10.3389/fimmu.2016.00457
Abstract (if available)
Abstract
T cells respond to antigens through their diverse repertoire of T cell receptors (TCRs), which are formed through the process of V(D)J recombination of the germline TCR locus. The V(D)J recombination is a process that poses high diversity with selections from different V, D, and J genes. Furthermore, deletion and insertion of nucleotides during the V(D)J recombination process increases the complexity of TCR repertoires. Due to the complexity and diversity of TCR repertoires, targeted sequencing techniques, as well as computational methods, are needed for appropriate downstream analysis.
The recent advances in high throughput sequencing technologies enable cost-effective characterization of the immune system, thus providing opportunities to study immune responses and further promote translational and clinical research. TCR sequencing (TCR-Seq) is a sequencing technology that targets the amplified DNA or RNA from the regions of TCR loci. TCR-Seq data provides information on V, D, and J gene usage and complementarity determining region (CDR) diversity. Specifically, the complementarity-determining region 3 (CDR3) region is highly variable and has the greatest diversity. Computational methods are developed for facilitating TCR-Seq data analysis.
The repertoires of TCRs are central to adaptive immunity, thus analyses of these loci are important for understanding adaptive immune receptor repertoire. Additionally, TCR-Seq studies have offered new opportunities to deepen the understanding of the adaptive immune system and its roles in various human diseases, such as infectious diseases, cancers, autoimmune conditions, and neurodegenerative diseases.
However, even with the advances in TCR studies, there are opportunities for improvement. First, existing computational methods including MiXCR, ImRep, TRUST4, and CATT allow the utilization of non-targeted next-generation sequencing (NGS) data to extract TCR information. However, the performance of these methods on samples with variable T cells level remains unknown. Second, the traditional computational methods require researchers to acquire knowledge in the command-line interface. However, biomedical researchers usually lack the relevant experiences, this situation hinders the application of computational tools to a broad range of users.
Therefore, to tackle these challenges, the overall goals of this project are to benchmark the existing computational RNA-Seq-based TCR profiling methods and to develop a new platform for TCR-Seq data analysis. I utilized a data science approach to achieve these goals with two aims: aim 1: a benchmark of RNA-Seq-based TCR repertoire profiling methods across different tissue types and aim 2: development of pyTCR to facilitate scalability, reproducibility, and transparency for TCR-Seq analysis.
In order to properly benchmark the RNA-Seq-based TCR repertoire profiling methods, we conducted a comprehensive evaluation of four existing methods (MiXCR, ImRep, TRUST4, and CATT) by using TCR-Seq as a gold standard. We utilized the largest dataset of 19 samples from five different tissue types from cancer patients that were sequenced by both TCR-Seq and RNA-Seq. The scenarios in which TCR repertoires can be effectively profiled by RNA-Seq have been investigated by comparing fractions of clonotype captured, diversity and clonality estimates, clonotype frequencies, and clonotype counts. The results provided researchers with information to choose the most appropriate methods for individual purposes in TCR when RNA-Seq data is available but TCR-Seq may not be.
In order to develop a platform for TCR-Seq data analysis for biomedical researchers with limited computational backgrounds or skills, I developed pyTCR, a platform based on interactive computational notebooks for comprehensive TCR-Seq data analysis with enriched functionalities including metrics analysis, visualization, and statistical analysis. The detailed instructions and web-based platform reduce the barriers to using the command-line interface. This platform has a rich set of functionalities including easy to customize and alter parameters. The outputs of corresponding analyses are generated in one notebook, which decreases the chances of creating errors when combining different output files or utilizing different tools for further analysis. As a result, we provided a one-stop-shop solution for biomedical researchers to conduct TCR-Seq data analysis, with the facilitation of transparency and reproducibility in immunogenomics research.
Overall, this work has demonstrated the scalability and effectiveness of using RNA-Seq data collected from cancer patients to assemble TCR clonotypes and estimate diversity, it also delivered a user-friendly platform for TCR-Seq data analysis with enriched functionalities and reduced computational complexity and resources.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Evaluating the robustness and reproducibility or AIRR sequencing tools using computational replicates
PDF
Evaluation of ancestral diversity in open immunogenetics studies and databases
PDF
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
A systematic assessment of the completeness of TCR databases across Mus musculus strains
PDF
reTCR: a unified repository for robust, rigorous, and reproducible analysis of TCR-Seq data
PDF
Investigating the effects of T cell mediated anti-leukemia activity in FLT3-ITD positive acute myeloid leukemia
PDF
An analysis of the robustness and reproducibility of computational tools used in biomedical research
PDF
Structure-based computational analysis and prediction of TCR CDR3 loops in the TCR-peptide-MHC complex using solvation parameters and peptide molecular dynamics.
PDF
Prediction of peptides in formation of MHC class I - peptide - TCR complexes using molecular models and artificial intelligence
PDF
Evaluating the robustness and reproducibility of RNA-Seq quantification tools using computational replicates
PDF
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater
PDF
Genomics and transcriptomic alterations of the glutamate receptors in acute myeloid leukemia
PDF
Unlocking capacities of genomics datasets through effective computational methods
PDF
Global landscape of primary omics data generation and its secondary analysis across 193 countries and territories
PDF
Development of an NFAT-GFP Jurkat T cell reporter system for acceleration of T Cell receptor affinity screening
PDF
Clinical, functional and therapeutic analysis of CD99 in acute myeloid leukemia
PDF
Omics for clinical diagnostics: challenges, opportunities, and computational approaches
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Development of methods and novel crosslinkers for RNA structure and interaction studies in living cells
PDF
Computational analysis of genome architecture
Asset Metadata
Creator
Peng, Kerui
(author)
Core Title
Developing and benchmarking computational tools to facilitate T cell receptor repertoire analysis
School
School of Pharmacy
Degree
Doctor of Philosophy
Degree Program
Clinical and Experimental Therapeutics
Degree Conferral Date
2023-08
Publication Date
05/22/2023
Defense Date
05/10/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
immunogenomics,OAI-PMH Harvest,T cell receptor repertoire analysis,TCR sequencing,TCR-Seq
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mangul, Serghei (
committee chair
), Alachkar, Houda (
committee member
), Mancuso, Nicholas (
committee member
)
Creator Email
keruipen@usc.edu,keruipeng1@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113132908
Unique identifier
UC113132908
Identifier
etd-PengKerui-11878.pdf (filename)
Legacy Identifier
etd-PengKerui-11878
Document Type
Dissertation
Format
theses (aat)
Rights
Peng, Kerui
Internet Media Type
application/pdf
Type
texts
Source
20230522-usctheses-batch-1047
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
immunogenomics
T cell receptor repertoire analysis
TCR sequencing
TCR-Seq