Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Evaluation of ancestral diversity in open immunogenetics studies and databases
(USC Thesis Other)
Evaluation of ancestral diversity in open immunogenetics studies and databases
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ii Evaluation of ancestral diversity in open immunogenetics studies and databases by Yu Ning Huang A Thesis Presented to the FACULTY OF THE USC SCHOOL OF PHARMACY UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE (CLINICAL AND EXPERIMENTAL THERAPEUTICS) August 2022 ii Acknowledgments I thank all the researchers (Dr. Anastasia A. Minervina, Dr. Mikhail V. Pogorelyy, Dr. Grigory A Efimov, Dr. Zheming Lu, Dr. Yang Ke, Dr. Xiao Liu, Dr. Mascha Binder, Dr. Dmitriy M. Chudakov, Dr. Nathalie Bedard, Dr. Jun S. Liu, Dr. Michelle Miron, Dr. Maura Rossetti, Dr. Satu Mustjoki, Dr. Atsunari Kawashima, Dr. Matthew A. Brown, Dr. Jorge Correale, and Dr. Tae Jin Kim) who shared the information of the study participants. I thank Yiting Meng in Dr. Houda Alachkar lab for providing the T cell receptor repertoire samples’ RNA sequences for the analysis. I thank Dr Houda Alachkar and Dr Amanda Burkhardt for advising on the thesis work. I thank Dr. Serghei Mangul for instructing, mentoring and advising on the thesis work. iii TABLE OF CONTENTS Acknowledgements.........................................................................................................................ii List of Tables ..................................................................................................................................iv List of Figures .................................................................................................................................v Abstract …......................................................................................................................................vi Introduction ...................................................................................................................................1 Method...........................................................................................................................................3 Chapter 1: Raw sequencing data availability in T cell receptor sequencing studies…………………….5 The availability of raw sequencing data in TCR-Seq studies is limited……………………………....5 The platforms used to store the raw TCR-Seq data…………………………………………………….…...7 The presence of the “Data availability statement” increases the availability of raw TCR- Seq data ………………………………………………………………………………………………………………….……..7 Discussion ..………………………………………………………………………………………………….………………..8 Chapter 2: The landscape of ancestral diversity in T cell receptor sequencing studies.………….…11 The majority of TCR-Seq studies lack study participants’ ancestry information .…………..15 Individuals of non-European descent are underrepresented in TCR-Seq studies ………….16 Assessment of the completeness of the IMGT database representing the diverse populations …….…………………………………………………………………………………………………………..20 Discussion .…………………………………………………………………………………………………………..……..23 Reference…...................................................................................................................................26 Appendices .................................................................................................................................30 Appendix A: Python Script for counting the number of mismatches in V and J gene from MiXCR ..............................................................................................................................30 iv LIST OF TABLES Table 1: The raw data availability in the TCR-Seq studies…………………………………………………………..7 Table 2: The summary of TCR-Seq studies included in the analysis……………………………………..……11 Table 3. The mean and medium mismatches of the samples’ V and J gene………………………………20 Table 4: The mean and medium mismatches of the V and J gene among the European and Asian samples...…………………………………………………………………………………………………………………………………22 v LIST OF FIGURES Figure 1: The raw data availability across the 134 TCR-Seq studies…………………………………………….5 Figure 2: The Sankey plot of the raw data availability across the 134 TCR-Seq studies………………..6 Figure 3: Pie chart for the availability of the study participants’ ancestry information of the TCR- Seq studies ………………………………………………………………………………………………………………………………16 Figure 4: Bar chart for the comparisons of the proportion of reported ancestry information among total study participants and total studies..……………………………………………………………………16 Figure 5: Pie chart for the proportion of the studies that reported with a single ancestry group and multiple ancestry groups ..………………………………………………………………………………………………..17 Figure 6: Pie chart for the proportion of Hispanic and non-Hispanic individuals in the U.S.-based TCR-Seq studies....……………………………………………………………………………………………………………………18 Figure 7: Bar plot for the changes in the proportion of study participants’ ancestry information over years.....……………………………………………………………………………………………………………………………18 Figure 8: Boxplot for the number of study participants in each ancestry group among the TCR- Seq studies.....………………………………………………………………………………………………………………………….19 Figure 9: Pie chart for the proportion of COVID patients’ ancestry information……………………….19 Figure 10: Bar plot for the proportion of the COVID patients’ ancestry information among the six covid cohorts based on the number of the COVID patients…………………………………………………19 Figure 11: The boxplot and the strip plot of the mismatches, including substitutions, insertions, and deletions, of the TCR’s V and J genes among the European and Asian ancestry groups…….22 vi Abstract Secondary analysis, which is the re-analysis of the raw data, promotes novel biomedical discoveries in modern data-driven research. Not only is rigorous conduct in designing and conducting experiments crucial, but making raw data available, repurposable, and well annotated is also vital to promote efficient and accurate secondary analysis of such data. Similarly, it is important to ensure the immunogenetics data with great reproducibility and robustness and make them publicly available to the scientific community for further secondary analysis. In order to assess the raw data availability of published T cell receptor (TCR) repertoire studies, I examined 11,918 TCR Sequencing (TCR-Seq) samples corresponding to 134 TCR-Seq studies ranging from 2006 to 2022. Among the 134 studies, only 38.1% had publicly available raw TCR-Seq data shared in public repositories. Additionally, I found a statistically significant association between the presence of data availability statements and the increase in raw data availability (p=0.014). Yet, 46.8% of studies with data availability statements failed to share the raw TCR-Seq data. There is a pressing need for the biomedical community to increase awareness of the importance of promoting raw data availability in scientific research and take immediate action to improve its raw data availability enabling cost-effective secondary analysis of existing immunogenomics data by the larger scientific community. In addition to the raw data availability, metadata information, such as ancestry, of the study participants is also crucial for secondary analysis, which promotes the research across diverse populations and the development of precision medicine. However, few studies have taken the ancestry information into account, and the open TCR-Seq studies have an unknown proportion of ancestry. Therefore, I analyzed 3341 study participants in the TCR-Seq studies to estimate the study participants’ proportion of ancestry and the extent of the availability of ancestry information among the TCR-Seq studies. I discovered that 84.1% (2809) of the study participants were European, followed by 9.0% (301) of the study participants were Asian, 4.0% (135) of the study participants were African, and 2.9% (96) of the study participants were reported with other or unknown ancestries. The proportion of ancestry across the TCR-Seq studies is highly disproportionate, which mostly focused on European ancestries. Additionally, 36.8% of the TCR- Seq studies have available ancestry information, indicating that a majority of the TCR-Seq studies have unavailable ancestry information. Lastly, I further examined the completeness of the international ImMunoGeneTics information system® database 1 (IMGT) representing the diverse populations. By leveraging the bioinformatics software, MiXCR 2 , I’m able to comprehensively examine the mismatches in different ancestry group samples’ read in the VDJ genes and evaluate the completeness of the immunogenetics database across diverse ancestry groups. MiXCR aligns and compares the TCR-Seq reads to the IMGT database. Unveiling the ancestry distribution in TCR-Seq studies and the completeness of vii the immunogenetics database representing diverse populations could highlight the need to improve ancestry diversity in those underrepresented populations and guide future immunogenomics studies to improve ancestry availability and distribution. 1 Introduction The emergence of high-throughput sequencing techniques and the development of efficient bioinformatics tools provide efficient ways to profile human adaptive immune receptor repertoire (AIRR) including repertoires of T cell receptors (TCRs) and immunoglobulins, and has produced numerous raw TCR-Seq data accompanying the development of AIRR research. Availability of raw TCR-Seq data allows effective re-analysis of the data 3 (also known as secondary analysis) which in turn may accelerate novel biomedical discoveries 4 . TCR-Seq data allows researchers to examine individual’s immune status, immune responses, and profile the T cell between individuals in healthy or disease states, such as autoimmune diseases, infectious diseases, and cancer 5–7 . Additionally, TCR-Seq studies allow the development of novel therapeutics and biomarkers, including diagnostics for autoimmune disease 8 and cancer 9–11 , CAR-T cell therapy 12 , vaccines 13 , and monoclonal and therapeutic antibodies 14 . To unlock the full potential of publicly available raw TCR-Seq data for the secondary analyses, it is crucial for a study to be conducted accurately with reproducible results and rigorous laboratory practices 15 . In this study, I examined the raw data availability across 134 TCR-Seq studies over 11,918 TCR-Seq samples and examined where the researchers stored the raw TCR-Seq data. I found only 38.1% of the TCR-Seq studies had available raw data and, conversely, 61.9% of the TCR-Seq studies did not share the raw data in the original publications or that the raw data will only be available upon request. Additionally, the majority of studies with available raw data stored the raw data in the public repository, Sequence Read Archive 16 (SRA). Next, although the advanced development of bioinformatics tools has driven the popularity of AIRR sequencing (AIRR-Seq) studies and has promoted the scientific community to profile human AIRR efficiently, the underrepresented populations in the open AIRR-Seq studies remain unknown. It has been previously established that the proportion of study participants’ ancestry information is disproportionate in the genome-wide association studies (GWAS) in which most participants are of European ancestry 17–19 . Multiple projects have been launched to diversify the ancestry inputs across the underrepresented populations in the field of genomics research, such as the Human Heredity and Health in Africa project 20 (H3Africa), and Genomic Insights into the Formation of Human Populations in East Asia 21 , which have highlighted the importance of examining population-based genome studies. Similar initiatives for investigating the AIRR-Seq on the underrepresented populations are missing 22 . Expanding the knowledge of the underrepresented populations in the field of AIRR studies will enhance the understanding of the phenotypic differences in AIRR and elucidate the different human responses to immune-related diseases across diverse populations 23,24 . The failure to include diverse populations in immunogenomics research will impede the researchers to examine the differences in adaptive immune systems across diverse populations and hinder the understanding of populational differences in human immunities. 2 To address the emerging issues and further develop a strategic roadmap to engage diverse populations in the AIRR-seq studies, I investigated the ancestral diversity in open TCR-Seq studies and examined the completeness of the immunogenetics database across diverse populations. I completed a systematic review for examining the proportion of ancestry information across 114 open TCR-seq studies encompassing 3341 study participants. As a result, 84.1% of the study participants are of European ancestry. A majority of studies have unavailable ancestry information since most authors do not share the ancestry information in the paper, thus restricting the ways to study immune responses across diverse populations. The result of the study can clarify the current proportion of ancestry and the extent of ancestry availability across the TCR-Seq studies. Furthermore, I examined the completeness of the international ImMunoGeneTics information system 25 ® database (IMGT database) for representing diverse populations. The bioinformatics software, MiXCR 2 , can comprehensively examine the mismatches in the VDJ genes in different ancestry group samples, by aligning and comparing the TCR-Seq reads to the IMGT database. The methods can evaluate the completeness of the IMGT database representing diverse ancestry groups. If scientific researchers include diverse populations in conducting immunogenetics research and sharing the raw sequencing data, we will be able to promote the populational research in the field of immunogenetics and eventually promote health equity across diverse populations 26 . Additionally, promoting equity in immunogenetics research among ancestrally diverse populations can mitigate health disparities across diverse populations in healthcare treatments 22,27,28 and guide future studies to increase diversity in immunogenomics. 3 Method Examine the raw data availability in the TCR-Seq studies I collected the TCR-Seq studies from PubMed and examined the raw data availability, reasons for not sharing the raw data, and the last authors’ affiliations. I collected 134 TCR-Seq studies from PubMed ranging from 2006 to 2022 across 11,918 samples for the analysis of raw data availability. I examined the availability of raw TCR-Seq data, which the studies will only be considered with available raw data if they shared the raw FASTA or FASTQ files of the studies. I categorized the studies without available raw TCR-Seq data as having unavailable raw data and further investigated the reasons why the studies did not share the raw data. After investigating the results, I categorized the studies with unavailable raw data into the three factors, causing the studies to have unavailable raw data. The first factor is that the studies only had available summary data, not raw data, meaning that the studies only shared summary data or descriptive data in the publications, ImmuneACCESS® (Adaptive Biotechnologies), VDJdb, or the supplementary files. The second factor is that the raw data availability statements were not mentioned in the publication meaning that the studies did not mention the access to the raw data of the studies in the articles, and additionally, the studies that provided falsifying accession numbers making researchers unable to access the raw data of the studies. The third reason is that the raw data will only be available upon request, which the raw data is only available when people make direct requests to the authors of the publications for the raw data. I also examined the last author’s affiliation and categorized the last authors’ affiliations into one of the three categories, medical research institutes, including affiliations with medical schools, schools of medicine, hospitals, medical centers, private health/disease research institutes, and government health research institutes, engineering schools, and others, including other science-related research institutes. Examine the ancestry information and its availability of the study participants in the TCR-Seq studies I investigated the study participants’ ancestry information among the published TCR studies. For the ancestry information, I collected the study participants’ self-reported ancestry information in the original publications or from email inquiries to the authors and categorized the available ancestry information into the four categories, including European ancestry, Asian ancestry, African ancestry, and other ancestry groups. Due to the great heterogeneity in the self-reported ancestry we acquired from the authors, I categorized each study participant based on the geographic origin of the study participants. If the publications did not include the ancestry information in the original writings, I made email inquiries to the authors to request the ancestry information of the study participants in the 4 studies. I made two follow-up email inquiries per author to acquire the study participants’ ancestry information. Leveraging the bioinformatics tool, MiXCR, for the assessment of the completeness of the immunogenetics database The bioinformatics software, MiXCR, is a universal software that can accurately extract the adaptive immune receptor repertoire from any type of sequencing data and can align the samples' reads to the IMGT database. MiXCR allows three types of input files, FASTA files(.fasta), FASTQ(.gz) files(.fastq), and paired FASTQ(.gz) files. I took one of the three types of input files downloaded from the public repository, Sequence Read Archive 29 , as the input for the alignment function of MiXCR. MiXCR aligns the sample reads to the IMGT database. After the alignment function, MiXCR will transform the output files from human-unreadable vdjca files (.vdjca) to the human-readable text files (.txt) for analysis. I further process the output text files from MiXCR by the Python script that I wrote using the Pandas library (Appendix A). The Python script function by counting the number of the substitutions, insertions, and deletions in the samples’ VDJ genes’, and help to normalize and visualize the results (Appendix A). The command lines for running MiXCR: 1. mixcr analyze amplicon -s hsa --starting-material rna --5-end <5End> --3-end <3End> \ -- adapters <adapters> \ [OPTIONS] input_file1 [input_file2] analysis_name 2. mixcr exportAlignments analysis_name.vdjca analysis_name.txt In order to compare the results between difference ancestry groups, I normalized the results from MiXCR among the European and Asian ancestry groups by the logarithm transformation (log(x+1)). 5 Chapter 1: Raw sequencing data availability in T cell receptor sequencing studies I investigated 134 published TCR-Seq studies across 11,918 samples in PubMed for the raw sequencing data availability ranging from 2006 to 2022. The studies were considered as having available raw sequencing data if the samples’ raw FASTQ or FASTA files of the studies are available, in other words, freely accessible from public repositories. The availability of raw sequencing data in TCR-Seq studies is limited According to the results, only 38.1% (51 out of 134 studies) of the TCR-Seq studies shared raw TCR-Seq data in the original publications in public genomic repositories (Figure 1a). Conversely, 61.9% (83 out of 134 studies) of the TCR-Seq studies have unavailable raw RNA-Seq data or have raw data available upon request (Figure 1a). I observed a similar trend of raw data availability among the 11,918 samples of the 134 TCR-Seq studies, in which 25% of the samples among the TCR-Seq studies had available raw data (Figure 1a). I also observed that the raw TCR-Seq data availability has increased over the past decade (Figure 1b). Among the 134 studies, 89.6% (120 studies) of the raw data were generated in medical research institutes, including medical schools, hospitals, medical centers, private health/disease research institutes, and government health research institutes, such as the National Institutes of Health (Figure 1c). The remaining raw data were generated in engineering schools and other research institutes (Figure 1c). Additionally, we examined the types of TCR-Seq chains in the available raw sequencing data. Among the samples with available raw sequencing data, 65.2% of the TCR-Seq data were TCR beta (TCRβ) chain, 30.7% of the TCR-Seq data were TCR alpha (TCRα) chain, and 4.1% of the TCR-Seq data were TCR gamma (TCRγ) or TCR delta (TCRδ) chain (Figure 1d). Figure 1: The raw data availability across the 134 TCR-Seq studies. a, The proportion of the raw sequencing data available in the TCR-Seq studies. Left: The raw sequencing data availability of the 134 TCR-Seq studies; Right: The raw sequencing data availability of the 11,918 samples in the 134 TCR-Seq studies. b, Bar plot of the changes in the raw TCR-Seq data availability from 2006 to 2022. c, The pie chart depicting the types of institutes where the data was generated. d, The pie chart depicting the types of the chain types of TCR-Seq data. 6 I further investigated the specific reasons that the raw data was unavailable among the 83 studies without available raw TCR-Seq data. Among the 83 studies with unavailable raw data, rather than sharing raw sequencing data, 44 studies only shared summary data, such as summary data on ImmuneACCESS 30 ® (Adaptive Biotechnologies), VDJdb 31 , and supplementary files. Other 34 studies did not include statements about the raw data or provide raw sequencing data in the original publications and five studies indicated that the raw sequencing data will be available upon making direct requests to the authors (Figure 2a). Figure 2: The Sankey plot of the raw data availability across the 134 TCR-Seq studies. a, The Sankey plot depicts the proportion of raw data availability (left) and platforms where available raw data is stored and the specific reasons why the raw data is unavailable (right). b, The Sankey plot shows the proportion of TCR-Seq studies with available “Data availability statements” in the text of the article (left) and the proportion of the studies with available raw data, studies reporting only summary data, and studies mentioning that data is available upon request, and studies not mentioning data availability information. 7 The platforms used to store the raw TCR-Seq data I further examined the platforms where researchers stored the raw TCR-Seq data of the analyses. Among the TCR-Seq studies with available raw sequencing data, 68.6% (35 of the 51 studies) of the raw data are shared in Sequence Read Archive (SRA) 16 and 19.6% (ten of the 51 studies) shared the raw data in the Gene Expression Omnibus (GEO) 32 . 11.8% (six of the 51 studies) shared the raw sequencing data in various online repositories, including European Genome-phenome Archive 33 (EGA) (3 studies), VDJServer 34 (1 study), and National Genomics Data Center, China 35 (NGDC) (2 studies) (Figure 2a). The presence of the “Data availability statement” increases the availability of raw TCR-Seq data As part of promoting more transparent and reproducible research, many journals mandate the presence of the data availability statements for studies to be published in the journals 36–40 . Therefore, I examined the impact of the presence of the “Data availability statement” in research articles on promoting the raw TCR-Seq data availability. According to the results, the presence of data availability statements improves the raw data availability from 29.9% (26 out of 87 studies without data availability statements) to 53.2% (25 out of 47 studies with data availability statements), and increases the raw data availability by 23.3%. The detailed information can be seen in Table 1. Table 1: The raw data availability in the TCR-Seq studies. Category (N=47 (with data availability statements); N=87 (without data availability statements)) Subcategory Number of studies Studies with data availability statement Raw data available 25 Only summary data is available 15 Upon request 4 Raw data availability is not mentioned 3 Studies without data availability statements Raw data available 26 Only summary data is available 31 Upon request 1 Raw data availability is not mentioned 29 8 There are 47 studies containing “Data availability statements” in the corresponding publications. Among the 47 studies with “Data availability statements”, 53.2% (25 out of the 47 studies) studies shared the raw TCR-Seq data in the publications while the rest of the studies did not share raw TCR-Seq data in the original publications or would be available upon request. Conversely, there are 87 studies that did not have “Data availability statements” in the original publications. For the 87 publications that do not have the “Data availability statement”, only 29.9% (26 out of the 87 studies) of the studies shared the raw TCR-Seq data in the publications. Additionally, three of the studies shared erroneous SRA accession numbers in the data availability statement sections of the publications so I was not able to access the raw data of the articles. (Figure 2) Therefore, the three studies are categorized into the category “With data availability statement” and “Raw data availability is not mentioned”. According to the Pearson's chi-squared test (χ²), I found a statistically significant associations between the presence of data availability statements and the availability of raw data (p=0.014). I also examined parts of the publications where the raw data availability is mentioned since the authors included the data availability statements in various locations of the research articles. 49.0% (25 out of 51 studies) had the data availability statements in the “Data availability statement” of the research articles. Eight out of 51 studies (15.7%) had the data availability statements in the footnote of the studies, whereas the remaining 18 out of 51 studies (35.3%) mentioned their raw data availability directly in the main text of the studies. Discussion According to the results, the raw data availability is alarmingly low, with only 38.1% of the TCR- Seq studies sharing the raw data (Figure 2b). The low availability of raw data will impede the secondary analysis and make the published dataset no further reproducible value for novel biomedical discoveries. I investigated the association between the presence of data availability statements in the articles and the raw data availability. I discovered that the data availability statements presented have a statistically significant association with the raw data availability (p=0.014). The presence of data availability statements improves the raw data availability by 23.3%. Therefore, the presence of data availability statements might potentially improve the raw data availability. If the journal mandates the presence of the raw data availability statement, it can help promote the raw data availability of the published studies. However, it is still possible to have such statements mandated across all journals’ policies and not share raw data. For instance, authors might only share summary data or authors make the raw data only available upon request (46.8%) (Table 1). 9 Among the studies with unavailable raw data, the majority of the studies (44 studies) only shared summary data for the analysis. However, the summary data has less reproducible value for novel secondary analysis. For example, 23 studies shared the summary data in the corporate-owned repository, ImmuneACCESS® by Adaptive Biotechnologies. Unfortunately, the ImmuneACCESS® repository only offers access to the summary data of the studies, thus the raw data files are still inaccessible to the public, making the researchers unable to re-analyze the raw data generated by Adaptive Biotechnologies for novel biomedical discoveries. Additionally, several studies mentioned that the data will only be available by making direct requests of the data from the authors (5 studies). However, it has been previously shown that the statements mentioning that the raw data will be available upon request did not guarantee the availability of the raw data, which the authors might not reply to the inquiries and it might be hard to contact the authors with access to the raw data in the future. Accordingly, making raw data available upon request was not a sustainable and practical way to improve research reproducibility and raw data availability 36,37 . The specific reasons for not sharing raw TCR-Seq data are beyond the scope of this manuscript and need to be investigated in future studies. The perceptual and technical barriers researchers are facing when sharing the data are yet to be determined 41 . Previous work has suggested that the reasons for authors not sharing the raw data might be due to cultural barriers, making researchers unaware of the importance of sharing the raw data. Additionally, the regulatory barriers might also deter the raw data sharing, such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) 42 , which is a federal regulation mandating the creation of national standards to protect sensitive patient health information from being disclosed without the patient’s consent or knowledge. Additionally, there may be technical barriers that deter or prevent authors from sharing the raw data 37 . Individual researchers, research institutes, and journals should all take part in ensuring raw data availability 37 . Individual researchers and research institutes more specifically should make efforts on making the raw data publicly available by increasing the awareness of the importance of sharing the raw data or acquiring study participants or patients’ consent in health information sharing to conform with the regulations. Journals also have a critical role in promoting raw data availability, and many journals are already taking actions to promote data availability via data availability statement requests within articles upon publication. However, more stringent measures need to be imposed by the journals to ensure raw data availability 43 . It is known that many journals have already mandated the authors to share the raw data of the studies 44 . Journal’s policies in mandating authors to share the raw data might be a feasible way to improve the raw data availability 45 . In conclusion, the pressing need to increase awareness of enhancing raw data availability in scientific research can enable cost-effective secondary analysis of existing immunogenomics data for novel biomedical discoveries. In addition to immunogenomics data, metadata 40,46 , raw sequencing data 47 , and open human health data 48 , such as medical history, 10 should also be made publicly available for secondary analysis. Therefore, we recommend that all members of the biomedical community, including individual researchers, research institutions, and journals, specially editors of the journals, should contribute to increasing the raw data sharing and improve the raw data availability in future studies. 11 Chapter 2: The landscape of ancestral diversity in T cell receptor sequencing studies I examined the landscape of ancestral diversity in the field of TCR-Seq studies and the completenss of the immunogenetics database representing diverse populations. I performed a survey of 114 TCR-Seq studies including 3261 study participants on their ancestry information. The detailed information can be seen in Table 2. I examined the current stage of ancestry information availability across TCR-Seq studies. The ancestry Information was self-reported and was extracted from the text of TCR-Seq publications or directly acquired from the corresponding authors. Next, I leveraged the bioinformatics software, MiXCR, to assess the completeness of the IMGT database representing diverse populations. Table 2: The summary of TCR-Seq studies included in the analysis. Study Title Number of study participants Category Type of information Population labels in the study Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H. & Holt, R. A. Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome Res. 19, 1817– 1824 (2009). 335 European Self-report ancestry Caucasian Wang, C. et al. High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. Proc. Natl. Acad. Sci. U. S. A. 107, 1518–1523 (2010). 1 Asian Self-report ancestry East Asian Warren, R. L. et al. Exhaustive T-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured repertoire size of at least 1 million clonotypes. Genome Res. 21, 790–797 (2011). 3 European Self-report ancestry Caucasian Putintseva, E. V. et al. Mother and child T cell receptor repertoires: deep profiling study. Front. Immunol. 4, 463 (2013). 9 European Self-report ancestry Caucasian Britanova, O. V. et al. Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J. Immunol. Baltim. Md 1950 192, 2689–2698 (2014). 39 European Self-report ancestry Caucasian Zvyagin, I. V. et al. Distinctive properties of identical twins’ TCR repertoires revealed by high-throughput sequencing. Proc. Natl. Acad. Sci. U. S. A. 111, 5980–5985 (2014). 6 European Self-report ancestry Russian 12 Robert, L. et al. CTLA4 blockade broadens the peripheral T-cell receptor repertoire. Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res. 20, 2424–2432 (2014). 21 European, Other Ethnicity, race Caucasian, Hispanic Gaide, O. et al. Common clonal origin of central and resident memory T cells following skin immunization. Nat. Med. 21, 647–653 (2015). 11 European Self-report ancestry Caucasian Wu, J., Liu, D., Tu, W., Song, W. & Zhao, X. T- cell receptor diversity is selectively skewed in T-cell populations of patients with Wiskott- Aldrich syndrome. J. Allergy Clin. Immunol. 135, 209–216 (2015). 26 Asian Self-report ancestry Chinese Harden, J. L., Hamm, D., Gulati, N., Lowes, M. A. & Krueger, J. G. Deep Sequencing of the T- cell Receptor Repertoire Demonstrates Polyclonal T-cell Infiltrates in Psoriasis. F1000Research 4, 460 (2015). 19 European, Asian, African Ethnicity Black, Hispanic, White, Asian Britanova, O. V. et al. Dynamics of Individual T Cell Repertoires: From Cord Blood to Centenarians. J. Immunol. Baltim. Md 1950 196, 5005–5013 (2016). 65 European Self-report ancestry Caucasian Chen, Z. et al. T cell receptor β-chain repertoire analysis reveals intratumour heterogeneity of tumour-infiltrating lymphocytes in oesophageal squamous cell carcinoma. J. Pathol. 239, 450–458 (2016). 7 Asian Self-report ancestry Han Rossetti, M. et al. TCR repertoire sequencing identifies synovial Treg cell clonotypes in the bloodstream during active inflammation in human arthritis. Ann. Rheum. Dis. 76, 435–441 (2017). 30 European Ethnicity Caucasian Seay, H. R. et al. Tissue distribution and clonal diversity of the T and B cell repertoire in type 1 diabetes. JCI Insight 1, e88242 (2016). 33 European, African, Other Ethnicity, race Caucasian, African American, Hispanic Li, B. et al. Landscape of tumor-infiltrating T cell repertoire of human cancers. Nat. Genet. 48, 725–732 (2016). 3 European Ethnicity White Zvyagin, I. V. et al. Tracking T-cell immune reconstitution after TCRαβ/CD19-depleted hematopoietic cells transplantation in children. Leukemia 31, 1145–1153 (2017). 24 European Self-report ancestry Caucasian Ramesh, M., Hamm, D., Simchoni, N. & Cunningham-Rundles, C. Clonal and constricted T cell repertoire in Common Variable Immune Deficiency. Clin. Immunol. Orlando Fla 178, 1–9 (2017). 66 European Self-report ancestry Caucasian 13 Savola, P. et al. Somatic mutations in clonally expanded cytotoxic T lymphocytes in patients with newly diagnosed rheumatoid arthritis. Nat. Commun. 8, 15869 (2017). 82 European Self-report ancestry Finnish Wang, T. et al. The Different T-cell Receptor Repertoires in Breast Cancer Tumors, Draining Lymph Nodes, and Adjacent Tissues. Cancer Immunol. Res. 5, 148–156 (2017). 16 Asian Self-report ancestry Han Abdel-Hakeem, M. S., Boisvert, M., Bruneau, J., Soudeyns, H. & Shoukry, N. H. Selective expansion of high functional avidity memory CD8 T cell clonotypes during hepatitis C virus reinfection and clearance. PLoS Pathog. 13, e1006191 (2017). 5 European Self-report ancestry French Caucasian, Swiss Caucasian Pogorelyy, M. V. et al. Persisting fetal clonotypes influence the structure and overlap of adult human T cell receptor repertoires. PLoS Comput. Biol. 13, e1005572 (2017). 10 European Self-report ancestry Caucasian Kargl, J. et al. Neutrophils dominate the immune cell composition in non-small cell lung cancer. Nat. Commun. 8, 14381 (2017). 140 European, Asian Ethnicity White, Asian Herati, R. S. et al. Successive annual influenza vaccination induces a recurrent oligoclonotypic memory response in circulating T follicular helper cells. Sci. Immunol. 2, (2017). 43 European, African Ethnicity White, African American Emerson, R. O. et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat. Genet. 49, 659–665 (2017). 558 European, Asian, African Ethnicity, race Non-Hispanic/Latino White, Hispanic/Latino White, African American, Asian non- Hispanic/Latino, Asian- Hispanic/Latino, American Indian or Alaska Native, Native Hawaiian or Pacific Islanders, Unknown Hispanic/Latino Jia, Q. et al. Local mutational diversity drives intratumoral immune heterogeneity in non- small cell lung cancer. Nat. Commun. 9, 5361 (2018). 15 Asian Self-report ancestry Chinese Komech, E. A. et al. CD8+ T cells with characteristic T cell receptor beta motif are detected in blood and expanded in synovial fluid of ankylosing spondylitis patients. Rheumatol. Oxf. Engl. 57, 1097–1104 (2018). 25 European Self-report ancestry Caucasian Pogorelyy, M. V. et al. Precise tracking of vaccine-responding T cell clones reveals 6 European Self-report ancestry Russian 14 convergent and personalized response in identical twins. Proc. Natl. Acad. Sci. U. S. A. 115, 12704–12709 (2018). Patas, K. et al. T Cell Phenotype and T Cell Receptor Repertoire in Patients with Major Depressive Disorder. Front. Immunol. 9, 291 (2018). 10 European Self-report ancestry European Miron, M. et al. Human Lymph Nodes Maintain TCF-1hi Memory T Cells with High Functional Potential and Clonal Diversity throughout Life. J. Immunol. Baltim. Md 1950 201, 2132–2140 (2018). 5 European, Other Ethnicity, race Hispanic, White Carnero Contentti, E., Farez, M. F. & Correale, J. Mucosal-Associated Invariant T Cell Features and TCR Repertoire Characteristics During the Course of Multiple Sclerosis. Front. Immunol. 10, 2690 (2019). 110 European Ethnicity Caucasian vLee, M. et al. Preferential Infiltration of Unique Vγ9Jγ2-Vδ2 T Cells Into Glioblastoma Multiforme. Front. Immunol. 10, 555 (2019). 4 Asian Ethnicity Korean Simnica, D. et al. High-Throughput Immunogenetics Reveals a Lack of Physiological T Cell Clusters in Patients With Autoimmune Cytopenias. Front. Immunol. 10, 1897 (2019). 47 European Self-report ancestry Caucasian Ramien, C. et al. T Cell Repertoire Dynamics during Pregnancy in Multiple Sclerosis. Cell Rep. 29, 810-815.e4 (2019). 23 European Self-report ancestry Central European Caucasian/ White Minervina, A. A. et al. Longitudinal high- throughput TCR repertoire profiling reveals the dynamics of T-cell memory formation after mild COVID-19 infection. eLife 10, (2021). 2 European Ethnicity Russian (Central European part of Russia) Lin, Y.-H. et al. Dissecting efficiency of a 5’ rapid amplification of cDNA ends (5’-RACE) approach for profiling T-cell receptor beta repertoire. PloS One 15, e0236366 (2020). 2 Asian Self-report ancestry Asian Hanson, A. L. et al. Altered Repertoire Diversity and Disease-Associated Clonal Expansions Revealed by T Cell Receptor Immunosequencing in Ankylosing Spondylitis Patients. Arthritis Rheumatol. Hoboken NJ 72, 1289–1302 (2020). 85 European Ethnicity Australians of white European Schultheiß, C. et al. Next-Generation Sequencing of T and B Cell Receptor Repertoires from COVID-19 Patients Showed Signatures Associated with Severity of Disease. Immunity 53, 442-455.e4 (2020). 36 European Self-report ancestry Caucasian 15 Liao, M. et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat. Med. 26, 842–844 (2020). 16 Asian Self-report ancestry Han Nolan, S. et al. A large-scale database of T-cell receptor beta (TCRβ) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2. Res. Sq. (2020). 1146 European, Asian, African, Other Self-report ancestry, race Asian, Caucasian, Hispanic, African American, Native American Shomuradova, A. S. et al. SARS-CoV-2 Epitopes Are Recognized by a Public and Diverse Repertoire of Human T Cell Receptors. Immunity 53, 1245-1257.e5 (2020). 34 European Ethnicity Jewish, Ukrainian, Russian, German, Armenian, Greek, Tajik, Tatars, Georgian, Osetian, Mordvin, Chuvash Nishida, K. et al. Clinical importance of the expression of CD4+CD8+ T cells in renal cell carcinoma. Int. Immunol. 32, 347–357 (2020). 104 Asian Self-report ancestry Japanese Kusnadi, A. et al. Severely ill COVID-19 patients display impaired exhaustion features in SARS- CoV-2-reactive CD8+ T cells. Sci. Immunol. 6, (2021). 39 European, Asian, African Ethnicity Arab, Black British, Chinese, Indian, White British, White Other The majority of TCR-Seq studies lack study participants’ ancestry information I examined the availability of study participants’ ancestry information in TCR-Seq studies from publications. Among the 114 TCR-Seq studies, only 21 (less than 18.6%) of the surveyed studies included ancestry information in the text of the paper (Figure 3). For the rest of the studies without ancestry information in the original article, I requested the study participants’ ancestry information directly from the authors through email inquiries. After sending out the email inquiries to the authors of the studies, I obtained study participants’ ancestry information from additional 21 studies. In other words, an additional 18.6% of the studies have available ancestry information by request. 7 authors, representing 7 TCR-Seq studies, were unable to reveal the study participants’ ancestry information. 65 authors representing the 65 TCR-Seq studies did not reply to our ancestry inquiries, indicating that 65 TCR-Seq studies have unavailable ancestry information. As a result, 42 out of the 114 TCR-Seq studies across 3261 individuals had available ancestry information (Figure 3). 16 Figure 3: Pie chart for the availability of the study participants’ ancestry information of the TCR-Seq studies. I investigated the reasons for the unavailability of study participants’ ancestry information. Among studies without available study participants’ ancestry information, 6% of the authors did not share data due to privacy concerns and limitations in study designs. First, the authors had concerns about violating the U.S. Health Insurance Portability and Accountability Act (HIPAA) rules 42 . Second, the authors were not able to share participants’ information due to the limitations of the study protocols that were already approved by the Institutional Review Board (IRB). For example, the ancestry information was not included during the recruitment phase, or the study participants’ information was not approved to be shared with researchers that were from other institutions. Additionally, the researchers couldn’t have the access to the study participants’ ancestry information after the trials were over. Last, some studies utilized de- identified samples, preventing authors from obtaining the study participants’ information among the samples. Individuals of non-European descent are underrepresented in TCR-Seq studies Firstly, I analyzed the ancestry diversity among the 42 TCR-Seq studies with available ancestry information. Nearly 60% of the studies included European or European descent participants, while only 10% of the studies included African or African descent participants (Figure 4). Figure 4: Bar chart for the comparisons of the proportion of reported ancestry information among total study participants and total studies. *A total of 42 studies were included in the analysis. For the studies reported with more than one ancestry, we counted each ancestry as one entry. 17 From the perspective of the study participants, more than 80% of study participants were reported to be from European ancestry groups, followed by 9% of participants from Asian ancestry groups, and 4% of participants of African ancestry groups (Figure 4). Secondly, I examined the proportion of TCR-Seq studies that were conducted in a single ancestry group and multiple ancestry groups. In total, 33 out of 42 studies were conducted on a single self-reported ancestry (Figure 5). Figure 5: Pie chart for the proportion of the studies that reported with a single ancestry group and multiple ancestry groups. Among the 33 studies with a single ancestry, the European-based studies were dominant and followed by Asian-based studies, notably, no studies were conducted in African populations alone. Nine of the examined studies included study participants from multiple ancestry groups, all of which predominantly consisted of European populations. (Table 2). Thirdly, I focused on the ethnic information among 12 U.S.-based TCR studies. Only 4% of study participants were self- reported as Hispanic, which indicated that Hispanics were a highly underrepresented population in TCR-Seq studies in the United States, relative to their proportion of the U.S. population (Figure 6). 18 Figure 6: Pie chart for the proportion of Hispanic and non-Hispanic individuals in the U.S.-based TCR-Seq studies. Furthermore, I investigated the temporal dynamics of ancestry diversity in TCR-Seq studies. From 2009 to 2021, there was a 10% increase in the proportion non-European individuals (Figure 7). Despite the increase, the distribution of study participants based on their ancestry is still highly disproportionate in TCR-Seq studies. Next, I examined the number of participants across different ancestry groups in the TCR-Seq studies. The results showed that the average number of study participants of European ancestry was much greater than the average number of study participants of Asian and African ancestries (Figure 8). Figure 7: Bar plot for the changes in the proportion of study participants’ ancestry information over years. 19 Figure 8: Boxplot for the number of study participants in each ancestry group among the TCR-Seq studies. Lastly, in light of the COVID-19 pandemic, the pattern of skewed diversity in TCR-Seq studies on COVID patients has also been observed. A substantial amount of the study participants were of European ancestry across all six COVID TCR-Seq studies (Figure 9) and five studies mainly included the study participants of European ancestry (Figure 10). Figure 9: Pie chart for the proportion of COVID patients’ ancestry information Figure 10: Bar plot for the proportion of the COVID patients’ ancestry information among the six covid cohorts based on the number of the COVID patients. 20 Assessment of the completeness of the IMGT database representing the diverse populations I assessed the completeness of the immunogenetics database, the international ImMunoGeneTics information system® 25 (IMGT®), representing the diverse populations. The existing bioinformatics software, MiXCR 2 , aligns the TCR samples’ read with the IMGT database’s reference sample. By analyzing the number of the substitution, insertion, and deletions of the V and J genes of the TCR samples’ RNA Sequences (RNA-Seq) across diverse populations, I am able to evaluate the completeness of the immunogenetics databases across diverse populations. I ran ten TCR-Seq samples over MiXCR and examine the mismatches in the VDJ genes of the samples. I counted the number of mismatches of the seven TCR-Seq samples with available ancestry information, three European samples and four European samples. From the output of MiXCR, the mean substitution in the V gene of the European samples is 7.01 and the mean substitution in the V gene of the Asian samples is 5.97. The detailed mismatches in the V and J genes can be seen in Table 3. Table 3: The mean and medium mismatches of the samples’ V and J gene Ancestry Sample Mean/Medi um/Number of reads V gene Substitution V gene Deletion V gene Insertion J gene Substitution J gene Deletion J gene Insertion European RS_10_SS Median 3 0 0 0 0 0 Mean ± SD 8.0 ± 10.51 0.2 ± 0.62 0.2 ± 0.62 0.16 ± 0.53 0.02 ± 0.15 0.02 ± 0.15 Number of reads 680,507 RS_8_Jor1 Median 3 0 0 0 0 0 Mean ± SD 7.47 ± 10.12 0.18 ± 0.58 0.19 ± 0.58 0.19 ± 0.57 0.03 ± 0.17 0.03 ± 0.16 21 Number of reads 310,637 RS_9_Jar Median 1 0 0 0 0 0 Mean ± SD 6.09 ± 9.62 0.15 ± 0.53 0.16 ± 0.55 0.2 ± 0.66 0.03 ± 0.16 0.03 ± 0.16 Number of reads 998,839 Asian RS_11_YZ Median 1 0 0 0 0 0 Mean ± SD 5.52 ± 9.45 0.15 ± 0.54 0.16 ± 0.56 0.2 ± 0.72 0.02 ± 0.16 0.03 ± 0.17 Number of reads 1,538,245 RS_12_JS Median 2 0 0 0 0 0 Mean ± SD 6.6 ± 9.81 0.16 ± 0.55 0.17 ± 0.57 0.17 ± 0.6 0.02 ± 0.15 0.02 ± 0.15 Number of reads 1,538,245 RS_2_SW Median 12 0 0 0 0 0 Mean ± SD 13.33 ± 12.08 0.33 ± 0.75 0.32 ± 0.73 0.2 ± 0.61 0.03 ± 0.16 0.03 ± 0.16 Number of reads 79,186 RS_4_LY Median 11 0 0 0 0 0 Mean ± SD 12.69 ± 12.04 0.32 ± .73 0.31 ± 0.71 0.22 ± 0.84 0.03 ± 0.17 0.03 ± 0.17 Number of reads 682,429 The mean deletion in the V gene of the European samples is 0.17 and the mean deletion in the V gene of the Asian samples is 0.15. The mean insertion in the V gene of the European samples is 0.18 and the mean insertion in the V gene of the Asian samples is 0.16. The mean substitution in the J gene of the European samples is 0.18 and the mean substitution in the J gene of the Asian samples is 0.19. The mean deletion in the J gene of the European samples is 0.03 and the mean deletion in the J gene of the Asian samples is 0.02. The mean insertion in the J gene of the European samples is 0.02 and the mean insertion in the J gene of the Asian samples is 0.02 (Figure 22 11). The detailed comparisons of the mismatches among the European and Asian samples can be seen in Table 4. Table 4: The mean and medium mismatches of the V and J gene among the European and Asian samples Ancestry Mean/Medium/Number of reads V gene Substitution V gene Deletion V gene Insertion J gene Substitution J gene Deletion J gene Insertion European Median 2 0 0 0 0 0 Mean ± SD 7.01 ± 10.06 0.17 ± 0.57 0.18 ± 0.58 0.18 ± 0.6 0.03 ± 0.16 0.02 ± 0.16 Number of reads 1,989,983 Asian Median 1 0 0 0 0 0 Mean ± SD 5.97 ± 9.69 0.15 ± 0.55 0.16 ± 0.57 0.19 ± 0.69 0.02 ± 0.15 0.02 ± 0.16 Number of reads 3,838,105 Figure 11: The boxplot and the stripplot of the mismatches, including substitutions, insertions, and deletions, of the TCR’s V and J genes among the European and Asian ancestry groups. 23 In order to further compare the results from MiXCR, I normalized the data by logarithm transformation and compared the mismatches between the European and Asian ancestry groups by the independent t-test. The V gene substitutions, insertions, and deletions are statistically significantly lower in the Asian ancestry groups compared to the European ancestry groups (p<0.0001). Similarly, the J gene substitutions and deletions are statistically significantly lower in the Asian ancestry groups compared to the European ancestry groups (p<0.0001). The insertions in the J gene are not statistically significantly lower in the Asian ancestry groups compared to the Asian groups (p=0.721). Discussion The results reveal that individuals of non-European ancestry were severely underrepresented in the TCR-Seq studies, in which the proportion of ancestry was similar to the GWAS 18 . The disproportionate distribution of study participants’ self-reported ancestry in the TCR-Seq studies might restrict our understanding of disease pathology in diverse populations, confine the discovery of immunogenomics variants across populations 18 , and hinder the development of precision medicine across diverse populations 27,49 . Furthermore, the severe lack of available ancestry information in most of the TCR-Seq studies was reported, which might limit the reusability of the raw TCR-Seq data for secondary analysis for the discovery of novel population- specific TCR alleles and improving the representation of the diverse populations in the current reference databases. There are a few limitations of the analysis of the ancestry diversity in the TCR-Seq studies. First, the unavailability of genetic ancestry information may prevent us from accurately categorizing the study participants. Ultimately, it is preferable to examine ancestry by genetics-based methodologies. The genetic ancestry depicts the single-nucleotide variants across geographic origin groups and depicts the extent of single-nucleotide variants among individuals of different ancestries 50 . However, none of the studies I examined performed genotyping or other computational methods to infer the study participants’ genetic ancestry information. The use of self-reported ancestry information in TCR-seq studies may be considered reasonable at the current stage of analysis 51 . Second, the un-unified terminology in reporting study participants’ ancestry information may produce bias in the analysis. The lack of standardized terminologies of the ancestry information might cause inconsistency among researchers, and make it challenging to categorize each ancestry group and conduct secondary analysis. Therefore, developing and adopting standardized experimental protocols and computational methods to report or infer genetic ancestry in the field of immunogenomics were urgently needed 52 . 24 From our preliminary results from the analysis of the completeness of the IMGT database across diverse populations, we did not find statistically significant differences in the mean mismatches between the European and Asian samples in the TCR’s V and J genes among the seven samples from the lab. However, the method for examining the completeness of the IMGT database by MiXCR is reliable and I plan to expand the analysis to the publicly available data on SRA. The above method can efficiently evaluate the completeness of the IMGT database representing diverse populations. However, future efforts are still needed to comprehensively examine the completeness of the IMGT database representing diverse populations. For future research, I planned to examine the publicly available datasets on SRA to further confirm the completeness of the IMGT database representing diverse populations. Disclosing the extent of diverse ancestry groups represented in the IMGT database can raise the scientific community’s awareness of the extent of the ancestry diversity in the IMGT database. Improving the awareness of the importance of ancestry diversity can further promote ancestry diversity in the IMGT databases, and ultimately, improve the accuracy and precision of the ancestry diversity research related to the IMGT database. According to a previously published work, over 50% of 448 researchers and clinical genetics professionals surveyed considered ancestry important in clinical settings 19 . The human leukocyte antigen (HLA) system elicits the adaptive immune response mediated by TCR 53 . While substantial research has been done to examine the HLA diversity across diverse populations 54–57 , the current understanding of the ancestral diversity in TCR is more limited. With the expanded inclusivity in immunogenetics studies, the scientific community will have a more comprehensive understanding of AIRR in diverse populations 22 . This knowledge will accelerate its translational and clinical applications, and eventually promote health equity across diverse populations 27 . For example, the Moderna and Pfizer-BioNTech COVID-19 vaccines showed different efficacy in study participants of different races 58 . Although the reasons for the variation in efficacy across different races remain unknown, AIRR might potentially play an important role 59 but the exact reason is still yet to be studied. The investigation of the relationship between AIRR and vaccine-mediated immune response may advance the development of future vaccines or therapeutics. To promote the diversity of the study participants’ ancestry information, increasing the awareness of the importance of ancestry information among the researchers is needed. Researchers can take action prior to conducting the experiment to increase the availability of the study participants’ ancestry information. For example, if the researchers, or principal investigators, can include the protocol of collecting the participants’ ancestry information and the participants’ consent to share the information in the Institutional Review Board (IRB) applications, it may promote the availability and accessibility of ancestry information of the study participants for further secondary analysis. Additionally, researchers can specify if it is 12 25 acceptable to share data with external investigators in the IRB protocol. Furthermore, the scientific community, from individual researchers to scientific journals and funding agencies, should emphasize the importance of sharing the raw data accompanying the metadata information, especially ancestry information. An additional challenge is an un-unified terminology in the category of ancestry. To address the discrepancies in the terminology and category of the reported ancestry information, I recommend the scientific community establish standardized protocols or guidelines in reporting study participants’ ancestry information. For example, the resources of the ancestry information including self-reported or genetically verified are worth noting. Additionally, Human Ancestry Ontology 52 (HANCESTRO) provides a systematic description of ancestry but is yet to be wildly accepted and adopted in scientific publications. The results of this paper shared valuable input to the community to inform which population groups are underrepresented in TCR-Seq studies. The enrichment of the diversity in TCR-Seq studies is truly needed. I advocate for the broadened knowledge in this field by studying diverse populations. 26 Reference 1. Lefranc, M.-P., Giudicelli, V., Regnier, L. & Duroux, P. IMGT, a system and an ontology that bridge biological and computational spheres in bioinformatics. Brief. Bioinform. 9, 263–275 (2008). 2. Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods 12, 380–381 (2015). 3. Brito, J. J. et al. Recommendations to enhance rigor and reproducibility in biomedical research. GigaScience 9, giaa056 (2020). 4. Johnston, M. P. Secondary Data Analysis: A Method of which the Time Has Come. Qual. Quant. Methods Libr. 3, 619–626 (2017). 5. Dziubianau, M. et al. TCR repertoire analysis by next generation sequencing allows complex differential diagnosis of T cell-related pathology. Am. J. Transplant. Off. J. Am. Soc. Transplant. Am. Soc. Transpl. Surg. 13, 2842–2854 (2013). 6. Hou, D., Chen, C., Seely, E. J., Chen, S. & Song, Y. High-Throughput Sequencing-Based Immune Repertoire Study during Infectious Disease. Front. Immunol. 7, 336 (2016). 7. Benichou, J., Ben-Hamo, R., Louzoun, Y. & Efroni, S. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. Immunology 135, 183–191 (2012). 8. Arnaout, R. A., Prak, E. T. L., Schwab, N., Rubelt, F., & Adaptive Immune Receptor Repertoire Community. The Future of Blood Testing Is the Immunome. Front. Immunol. 12, 626793 (2021). 9. Linette, G. P. et al. Immunological ignorance is an enabling feature of the oligo-clonal T cell response to melanoma neoantigens. Proc. Natl. Acad. Sci. U. S. A. 116, 23662–23670 (2019). 10. Cowell, L. G. The Diagnostic, Prognostic, and Therapeutic Potential of Adaptive Immune Receptor Repertoire Profiling in Cancer. Cancer Res. 80, 643–654 (2020). 11. Ostmeyer, J. et al. Biophysicochemical motifs in T cell receptor sequences as a potential biomarker for high-grade serous ovarian carcinoma. PloS One 15, e0229569 (2020). 12. Sheih, A. et al. Clonal kinetics and single-cell transcriptional profiling of CAR-T cells in patients undergoing CD19 CAR-T immunotherapy. Nat. Commun. 11, 219 (2020). 13. Lee, J. et al. Molecular-level analysis of the serum antibody repertoire in young adults before and after seasonal influenza vaccination. Nat. Med. 22, 1456–1464 (2016). 14. Richardson, E. et al. A computational method for immune repertoire mining that identifies novel binders from different clonotypes, demonstrated by identifying anti-pertussis toxoid antibodies. mAbs 13, 1869406 (2021). 15. Miyakawa, T. No raw data, no science: another possible source of the reproducibility crisis. Mol. Brain 13, 24 (2020). 16. Kodama, Y., Shumway, M., Leinonen, R., & on behalf of the International Nucleotide Sequence Database Collaboration. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 40, D54–D56 (2012). 27 17. Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164 (2016). 18. Peterson, R. E. et al. Genome-wide Association Studies in Ancestrally Diverse Populations: Opportunities, Methods, Pitfalls, and Recommendations. Cell 179, 589–603 (2019). 19. Popejoy, A. B. et al. Clinical Genetics Lacks Standard Definitions and Protocols for the Collection and Use of Diversity Measures. Am. J. Hum. Genet. 107, 72–82 (2020). 20. Mulder, N. et al. H3Africa: current perspectives. Pharmacogenomics Pers. Med. 11, 59–66 (2018). 21. Wang, C.-C. et al. Genomic insights into the formation of human populations in East Asia. Nature 591, 413–419 (2021). 22. Peng, K. et al. Diversity in immunogenomics: the value and the challenge. Nat. Methods 18, 588–591 (2021). 23. Goulielmos, G. N. et al. The genetics and molecular pathogenesis of systemic lupus erythematosus (SLE) in populations of different ancestry. Gene 668, 59–72 (2018). 24. Lewis, M. J. & Jawad, A. S. The effect of ethnicity and genetic ancestry on the epidemiology, clinical features and outcome of systemic lupus erythematosus. Rheumatol. Oxf. Engl. 56, undefined-undefined (2017). 25. Lefranc, M.-P. et al. IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res. 37, D1006-1012 (2009). 26. Borrell, L. N. et al. Race and Genetic Ancestry in Medicine - A Time for Reckoning with Racism. N. Engl. J. Med. 384, 474–480 (2021). 27. Precision medicine needs an equity agenda. Nat. Med. 27, 737 (2021). 28. Borrell, L. N. et al. Race and Genetic Ancestry in Medicine — A Time for Reckoning with Racism. N. Engl. J. Med. 384, 474–480 (2021). 29. Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022). 30. immunoSEQ® | The Gold Standard of immunosequencing. immunoseq.com https://www.immunoseq.com/. 31. Shugay, M. et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res. 46, D419–D427 (2018). 32. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 41, D991-995 (2013). 33. Freeberg, M. A. et al. The European Genome-phenome Archive in 2021. Nucleic Acids Res. 50, D980–D987 (2022). 34. Christley, S. et al. VDJServer: A Cloud-Based Analysis Portal and Data Commons for Immune Repertoire Sequences and Rearrangements. Front. Immunol. 9, 976 (2018). 35. CNCB-NGDC Members and Partners. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2021. Nucleic Acids Res. 49, D18–D28 28 (2021). 36. Stodden, V., Seiler, J. & Ma, Z. An empirical analysis of journal policy effectiveness for computational reproducibility. Proc. Natl. Acad. Sci. 115, 2584–2589 (2018). 37. Tedersoo, L. et al. Data sharing practices and data availability upon request differ across scientific disciplines. Sci. Data 8, 192 (2021). 38. Schriml, L. M. et al. COVID-19 pandemic reveals the peril of ignoring metadata standards. Sci. Data 7, 188 (2020). 39. Gozashti, L. & Corbett-Detig, R. Shortcomings of SARS-CoV-2 genomic metadata. BMC Res. Notes 14, 189 (2021). 40. Rajesh, A. et al. Improving the completeness of public metadata accompanying omics studies. Genome Biol. 22, 106 (2021). 41. Corbyn, Z. Researchers failing to make raw data public. Nature (2011) doi:10.1038/news.2011.536. 42. Edemekong, P. F., Annamaraju, P. & Haydel, M. J. Health Insurance Portability and Accountability Act. in StatPearls (StatPearls Publishing, 2022). 43. Grant, R. & Hrynaszkiewicz, I. The Impact on Authors and Editors of Introducing Data Availability Statements at Nature Journals. Int. J. Digit. Curation 13, 195–203 (2018). 44. Kim, J., Kim, S., Cho, H.-M., Chang, J. H. & Kim, S. Y. Data sharing policies of journals in life, health, and physical sciences indexed in Journal Citation Reports. PeerJ 8, e9924 (2020). 45. Deshpande, D. et al. A comprehensive analysis of code and data availability in biomedical research. (2021) doi:10.31219/osf.io/uz7m5. 46. Field, D. et al. The minimum information about a genome sequence (MIGS) specification. Nat. Biotechnol. 26, 541–547 (2008). 47. Caspar, S. m. et al. Clinical sequencing: From raw data to diagnosis with lifetime value. Clin. Genet. 93, 508–519 (2018). 48. Peters, M. & Zeeb, H. Availability of open data for spatial public health research. GMS Ger. Med. Sci. 20, Doc01 (2022). 49. Greiff, V., Yaari, G. & Cowell, L. G. Mining adaptive immune receptor repertoires for biological and clinical information using machine learning. Curr. Opin. Syst. Biol. 24, 109–119 (2020). 50. Jorde, L. B. & Bamshad, M. J. Genetic Ancestry Testing: What Is It and Why Is It Important? JAMA 323, 1089–1090 (2020). 51. Oni-Orisan, A., Mavura, Y., Banda, Y., Thornton, T. A. & Sebro, R. Embracing Genetic Diversity to Improve Black Health. N. Engl. J. Med. 384, 1163–1167 (2021). 52. Morales, J. et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol. 19, 21 (2018). 53. Szeto, C., Lobos, C. A., Nguyen, A. T. & Gras, S. TCR Recognition of Peptide-MHC-I: Rule Makers and Breakers. Int. J. Mol. Sci. 22, E68 (2020). 29 54. Hou, L. et al. Next generation sequencing characterizes HLA diversity in a registry population from the Netherlands. HLA 93, 474–483 (2019). 55. Boquett, J. A., Bisso-Machado, R., Zagonel-Oliveira, M., Schüler-Faccini, L. & Fagundes, N. J. R. HLA diversity in Brazil. HLA 95, 3–14 (2020). 56. Hashimoto, S. et al. Implications of HLA diversity among regions for bone marrow donor searches in Japan. HLA 96, 24–42 (2020). 57. Mellet, J. et al. Human leukocyte antigen (HLA) diversity and clinical applications in South Africa. South Afr. Med. J. Suid-Afr. Tydskr. Vir Geneeskd. 109, 29–34 (2019). 58. Pilishvili, T. et al. Interim Estimates of Vaccine Effectiveness of Pfizer-BioNTech and Moderna COVID-19 Vaccines Among Health Care Personnel - 33 U.S. Sites, January-March 2021. MMWR Morb. Mortal. Wkly. Rep. 70, 753–758 (2021). 59. Unterman, A. et al. Single-Cell Omics Reveals Dyssynchrony of the Innate and Adaptive Immune System in Progressive COVID-19. 2020.07.16.20153437 https://www.medrxiv.org/content/10.1101/2020.07.16.20153437v1 (2020) doi:10.1101/2020.07.16.20153437. 30 Appendices Appendix A: Python Script for counting the number of mismatches in V and J gene from MiXCR Import python libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np Read the samples’ csv file df = pd.read_csv('RS_12_JS.csv') alignment = df.loc[:2,['allVAlignments', 'allJAlignments', 'allCAlignments']] Calculate the real read length df['Real_read_length']=df['targetSequences'].str.len() df25=df['Real_read_length'] df25 Calculate the V gene mismatches V= df['allVAlignments'] = df['allVAlignments'].astype(str) #Substitutions S_V = V.str.count('S') #Deletions D_V = V.str.count('D') #Insertions I_V = V.str.count('I') #mismatches R_V= S_V+D_V+I_V split= df['allVAlignments'].str.split('|', expand=True) df1=split df1 = df1.convert_dtypes(int) df2= df1.iloc[:,3] df3= df1.iloc[:,4] df4= pd.concat([df2, df3], axis=1) df4.columns = ['V_start', 'V_end'] df4=df4.dropna(how='any') 31 df4 = df4.apply(pd.to_numeric) df4['V_effective_read_length'] = df4['V_end'] - df4['V_start'] Calculate the J gene mismatches J= df['allJAlignments'] = df['allJAlignments'].astype(str) S_J = J.str.count('S') D_J = J.str.count('D') I_J = J.str.count('I') R_J= S_J+D_J+I_J split_J= df['allJAlignments'].str.split('|', expand=True) df5=split_J df5 = df5.convert_dtypes(int) df6= df5.iloc[:,3] df7= df5.iloc[:,4] df8 = pd.concat([df6, df7], axis=1) df8.columns = ['J_start', 'J_end'] df8=df8.dropna(how='any') df8 = df8.apply(pd.to_numeric) df8['J_effective_read_length'] = df8['J_end'] - df8['J_start'] Calculate the percentage of mismatches for V gene df4['V_Substition']= S_V df4['V_Insertion']= I_V df4['V_Deletion']= D_V #df4['V_Number_of_mismatches']= R_V df4['V_S/ERL(%)']=S_V/df4['V_effective_read_length'] df4['V_I/ERL(%)']=I_V/df4['V_effective_read_length'] df4['V_D/ERL(%)']=D_V/df4['V_effective_read_length'] df20= df4[['V_effective_read_length','V_Substition','V_Insertion','V_Deletion','V_S/ERL(%)','V_I/ERL(%)','V_D/ERL(% )']] Calculate the percentage of mismatches for J gene df8['J_Substition']= S_J df8['J_Insertion']= I_J df8['J_Deletion']= D_J df8['J_S/ERL(%)']=S_J/df8['J_effective_read_length'] 32 df8['J_I/ERL(%)']=I_J/df8['J_effective_read_length'] df8['J_D/ERL(%)']=D_J/df8['J_effective_read_length'] df21= df8[['J_effective_read_length','J_Substition','J_Insertion','J_Deletion','J_S/ERL(%)','J_I/ERL(%)','J_D/ERL(%)']] Summarize the output from MiXCR df23 = pd.concat([df20, df21,df25], axis=1) df24=df23[['V_effective_read_length','V_Substition','V_Insertion','V_Deletion','V_S/ERL(%)','V_I/ERL(%)','V_D /ERL(%)','J_effective_read_length','J_Substition','J_Insertion','J_Deletion','J_S/ERL(%)','J_I/ERL(%)','J_D/ERL(% )','Real_read_length']] df24=df24.dropna(how='any') df24.to_csv('RS_12_JS_mismatches.csv')
Abstract (if available)
Abstract
Secondary analysis, which is the re-analysis of the raw data, promotes novel biomedical discoveries in modern data-driven research. Not only is rigorous conduct in designing and conducting experiments crucial, but making raw data available, repurposable, and well annotated is also vital to promote efficient and accurate secondary analysis of such data. Similarly, it is important to ensure the immunogenetics data with great reproducibility and robustness and make them publicly available to the scientific community for further secondary analysis. In order to assess the raw data availability of published T cell receptor (TCR) repertoire studies, I examined 11,918 TCR Sequencing (TCR-Seq) samples corresponding to 134 TCR-Seq studies ranging from 2006 to 2022. Among the 134 studies, only 38.1% had publicly available raw TCR-Seq data shared in public repositories. Additionally, I found a statistically significant association between the presence of data availability statements and the increase in raw data availability (p=0.014). Yet, 46.8% of studies with data availability statements failed to share the raw TCR-Seq data. There is a pressing need for the biomedical community to increase awareness of the importance of promoting raw data availability in scientific research and take immediate action to improve its raw data availability enabling cost-effective secondary analysis of existing immunogenomics data by the larger scientific community.
In addition to the raw data availability, metadata information, such as ancestry, of the study participants is also crucial for secondary analysis, which promotes the research across diverse populations and the development of precision medicine. However, few studies have taken the ancestry information into account, and the open TCR-Seq studies have an unknown proportion of ancestry. Therefore, I analyzed 3341 study participants in the TCR-Seq studies to estimate the study participants’ proportion of ancestry and the extent of the availability of ancestry information among the TCR-Seq studies. I discovered that 84.1% (2809) of the study participants were European, followed by 9.0% (301) of the study participants were Asian, 4.0% (135) of the study participants were African, and 2.9% (96) of the study participants were reported with other or unknown ancestries. The proportion of ancestry across the TCR-Seq studies is highly disproportionate, which mostly focused on European ancestries. Additionally, 36.8% of the TCR-Seq studies have available ancestry information, indicating that a majority of the TCR-Seq studies have unavailable ancestry information.
Lastly, I further examined the completeness of the international ImMunoGeneTics information system® database (IMGT) representing the diverse populations. By leveraging the bioinformatics software, MiXCR, I’m able to comprehensively examine the mismatches in different ancestry group samples’ read in the VDJ genes and evaluate the completeness of the immunogenetics database across diverse ancestry groups. MiXCR aligns and compares the TCR-Seq reads to the IMGT database. Unveiling the ancestry distribution in TCR-Seq studies and the completeness of the immunogenetics database representing diverse populations could highlight the need to improve ancestry diversity in those underrepresented populations and guide future immunogenomics studies to improve ancestry availability and distribution.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Developing and benchmarking computational tools to facilitate T cell receptor repertoire analysis
PDF
A systematic assessment of the completeness of TCR databases across Mus musculus strains
PDF
Evaluating the robustness and reproducibility or AIRR sequencing tools using computational replicates
PDF
reTCR: a unified repository for robust, rigorous, and reproducible analysis of TCR-Seq data
PDF
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater
PDF
Genomics and transcriptomic alterations of the glutamate receptors in acute myeloid leukemia
PDF
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
Investigating the effects of T cell mediated anti-leukemia activity in FLT3-ITD positive acute myeloid leukemia
PDF
Availability assessment of research products in biomedical research
PDF
The multifarious utility of public genomic repositories and their significance in genomic data science
PDF
Global landscape of primary omics data generation and its secondary analysis across 193 countries and territories
PDF
Clinical, functional and therapeutic analysis of CD99 in acute myeloid leukemia
PDF
An iPSC-based biomarker strategy to identify neuroregenerative responders to allopregnanolone
PDF
Differences in the mutational landscape of clonal hematopoiesis of indeterminate potential among Races and between Male and Female patients with cancer.
PDF
Investigating role of APOC2 in normal hematopoiesis
PDF
Development of dihydromyricetin (DHM) as a novel therapy for alcoholic liver disease (ALD) and alcohol use disorder (AUD)
PDF
Selected papers on the evaluation of healthcare costs of prematurity and necrotizing enterocolitis using large retrospective databases
PDF
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
PDF
Pharmacogenetic association studies and the impact of population substructure in the women's interagency HIV study
PDF
Developmental trajectories of sensory patterns in young children with and without autism spectrum disorder: a longitudinal population-based study from infancy to school age
Asset Metadata
Creator
Huang, Yu Ning
(author)
Core Title
Evaluation of ancestral diversity in open immunogenetics studies and databases
School
School of Pharmacy
Degree
Master of Science
Degree Program
Clinical and Experimental Therapeutics
Degree Conferral Date
2022-08
Publication Date
07/18/2022
Defense Date
07/18/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptive immune receptor repertoire,ancestral diversity,IMGT database,immunogenetics,OAI-PMH Harvest,TCR
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mangul, Serghei (
committee chair
), Alachkar, Houda (
committee member
), Burkhardt, Amanda M. (
committee member
)
Creator Email
yuninghu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111372258
Unique identifier
UC111372258
Legacy Identifier
etd-HuangYuNin-10851
Document Type
Thesis
Format
application/pdf (imt)
Rights
Huang, Yu Ning
Type
texts
Source
20220719-usctheses-batch-955
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adaptive immune receptor repertoire
ancestral diversity
IMGT database
immunogenetics
TCR