Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Predicting mortality of sepsis with machine learning model approaches
(USC Thesis Other)
Predicting mortality of sepsis with machine learning model approaches
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Copyright 2023 Tejasvene Ramesh
Predicting Mortality of Sepsis with Machine Learning Model Approaches
by
Tejasvene Ramesh
A Thesis Presented to the
FACULTY OF THE USC SCHOOL OF PHARMACY
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(PHARMACEUTICAL SCIENCES)
August 2023
ii
TABLE OF CONTENTS
List of Tables ................................................................................................................................ iii
List of Figures ............................................................................................................................... iv
Abstract ......................................................................................................................................... v
Chapter 1: Introduction.................................................................................................................. 1
1.1 Opportunities and pitfalls of computational data-driven research........................................... 1
1.2 Sepsis ...................................................................................................................................... 9
1.3 Study focus ............................................................................................................................. 12
Chapter 2: Methods ..................................................................................................................... 13
2.1 Data collection and processing .............................................................................................. 13
2.2 Data normalization and gene annotation ............................................................................... 14
2.3 Generating of cell composition data from GEDIT ................................................................ 16
2.4 Statistical Analysis ................................................................................................................ 17
2.5 Gene Expression Data Cleaning ............................................................................................ 18
2.6 Logistic regression model and hyperparameter tuning ......................................................... 19
2.7 Model Performance Analysis ................................................................................................ 19
Chapter 3: Results ....................................................................................................................... 20
3.1 Cohort selection for our study and logistic regression .......................................................... 20
3.2 Statistically significant features were selected for logistic regression .................................. 25
3.3 Prediction of sepsis mortality with cell composition data ..................................................... 30
3.4 Prediction of sepsis mortality with gene expression data ...................................................... 33
3.5 Assessment of prognostic powers .......................................................................................... 35
Chapter 4: Discussion .................................................................................................................. 38
Chapter 5: Conclusion ................................................................................................................. 43
Bibliography ................................................................................................................................ 44
iii
LIST OF TABLES
Table 1.1.1: Opportunities of computational data-driven research ................................................. 4
Table 3.1.1: Characteristics of sepsis cohorts summarized from metadata and
original publication ....................................................................................................................... 21
Table 3.1.2: Characteristics of patients included in the logistic regression analysis ..................... 24
Table 3.2.1: Summary of Regression Analysis of the cell composition data .............................. 27
iv
LIST OF FIGURES
Figure 1.1.2: Challenges and pitfalls of computational data-driven research ................................ 8
Figure 1.2.1: Overview of study design for multi-cohort analysis with a machine
learning approach ......................................................................................................................... 12
Figure 2.2.1: Example of the R script to perform RMA normalization in a CEL
formatted file ................................................................................................................................ 14
Figure 2.2.2: Example of the R script used to annotate gene names for microarray data from the
HG-U133_Plus_2 probe set .......................................................................................................... 14
Figure 2.2.3: Example of the Python code used to annotate gene names for microarray data
generated from any probe set ........................................................................................................ 15
Figure 2.4.1: Example of logit model code on Python .................................................................. 18
Figure 3.2.2: Comparison of non-survivor and survivor .............................................................. 29
Figure 3.3: Logistic Regression of Cell composition data ............................................................ 32
Figure 3.4: Logistic Regression of Gene Expression data ............................................................ 34
Figure 3.5: Model performance of training and validation cohorts ............................................... 37
v
ABSTRACT
Sepsis is a lethal, host-inflammatory response to infection that is a primary cause of mortality
worldwide. Current research approaches predict sepsis mortality using machine learning based on
clinical scores and criteria defining sepsis severity. Several studies explore sepsis mortality on a
molecular level, yet more insight is needed into the impact of sepsis mortality in homogeneous
sepsis patients. We focused on and developed transcriptomic and cell composition-based machine
learning models in homogenous cohorts to infer sepsis deaths effectively. We derived cell
composition data from gene expression data using our recently developed statistical deconvolution
tool, GEDIT. Then, we combined publicly available transcriptomics and cell composition data into
a trans-ancestry retrospective sepsis cohort with similar clinical covariates. Our cohort includes
1132 adults diagnosed with a known type of sepsis across seven individual studies. We selected
key cell types and clinical covariates from statistical comparisons between survivors and non-
survivors. Subsequently, we built logistic regression models with cell composition and gene
expression data to estimate sepsis mortality status. We predicted mortality in AUROC for both cell
composition (summary AUROC = 0.7046) and gene expression data (summary AUROC =
0.6884). Our results suggest that the relative abundance of various cells (e.g., CD4+ T cells,
Neutrophils) was considerably distinct across survivors and non-survivors (p-value<10
-4
). Our
overall model performance proposes that cell composition data better predict sepsis mortality than
gene expression. Despite the limitations of the study, these results from our research will improve
our understanding of the relationship between the immune system and survival status across sepsis
patients of diverse backgrounds.
1
CHAPTER 1 INTRODUCTION
1.1 Opportunities and pitfalls of computational data-driven research
Over the past decade, the unprecedented advancement in omics technology has vastly
revolutionized modern biomedical research by expanding the diversity, abundance, and
accessibility of data and methods across varying domains. As a result, this has opened doors into
a new type of research, computational data-driven research, which is vital in developing
efficacious and robust computational methods and tools to examine sophisticated omics datasets
allowing for novel biological insights. Such computational data-driven research is performed in a
different kind of laboratory, dry lab (Ramesh et al., 2022). Initially, dry labs often participated in
the primary analysis of data generated in wet labs through collaborations. Over time the pervasive
usage and advancement of computational power have resulted in the augmentation of data
collection or ‘Big Data’ (Gauthier et al., 2019). Hence, dry labs can now independently perform
secondary analysis on publicly available data that involves mining and reevaluating large-scale
open public datasets to create new biological discoveries (Ramesh et al., 2022).
Due to the rapid and efficacious meta-level nature of data analysis, computational data-
driven research of today is not limited to fields of biology. In contrast, it has expanded ubiquitously
and is considered critical in myriads of fields, playing a multifaceted role (Gauthier et al., 2019;
Ramesh et al., 2022; Yu et al., 2004). In this modern era, computational biology research has
created opportunities by enhancing biological research, promoting effective collaborations,
allowing for creativity, and changing the speed and scale of data analysis (Ramesh et al., 2022).
2
Computational biology plays a crucial role in advancing biological research through the
implementation of its tools. Data can be examined in two ways- primary and secondary analysis.
Primary analysis is the in-silico analysis of data collected from clinical assessments, trials, or
experiments in a clinical or wet laboratory setting. Conversely, secondary analysis investigates
existing datasets to extract additional information. With novel methods and tools for primary and
secondary data analysis, computational biology explores extensive and complex datasets to answer
specific scientific queries and hypotheses (Mangul et al., 2019). While tools and method
development represent a fraction of evolving research directions of computational data, the
applications of computational data-driven research expand beyond tool development into
discovering new remedies, diagnostics, and insights into a disease, from association modeling for
a novel assay to examining operational big data (Ramesh et al., 2022).
Besides advancing research, the novel methods and tools used for primary and secondary
analysis also promote effective collaborations between wet and dry lab researchers. While the wet
lab researchers generate and collect the data, the dry lab researchers can add their expertise to
suggest suitable tools to derive insightful information from the results generated by the wet lab
researchers (Ramesh et al., 2022). With these insights, wet science researchers could further
validate the post-computational analysis findings.
Consequently, specific journal articles have strengthened their citations and the
significance of computational research in biology through the advancement, application, and
exploration of computational approaches (Van Noorden et al., 2014). Such journal articles have
3
opened doors for more studies to be published on purely computational work, contributing to more
than 30% of bioinformatics papers as highly cited scientific papers (Wren, 2016).
As much as the application and exploration of computational methods assist in analyzing
the data, computational research gives a unique platform to be creative. Research in wet labs is
often hypothesis-driven, whereas computational research lays the foundation to create new
hypotheses through data interpretation. Thus, computational data-driven research gives the liberty
to uncover unknown diseases, unfold novel approaches and tools, or refine the ones for a more
robust data analysis (Ramesh et al., 2022). With computational expertise, dry lab researchers can
add value to wet lab research by venturing into differing perspectives and proposing new
hypotheses or improving existing ones based on results generated in the wet lab.
Moreover, the speed and scalability of analysis in the dry lab give computational research
increasing importance to complement wet lab findings. Unlike wet labs, dry labs are more fast-
paced, amplifying the scope of dry lab researchers to gather more data and integrate with wet-lab
biologists to corroborate their discoveries (Ramesh et al., 2022). Wet lab biologists can also look
beyond the public data by suggesting new hypotheses or building upon existing ones with the
expertise of dry lab researchers. Additionally, computational researchers probe large-scale data
containing larger sample sizes or combine data across several existing studies and publicly
available datasets. Hence, this promotes large-scale interpretation that can potentially provide
more statistically significant results (Ramesh et al., 2022).
4
Table 1.1.1: Opportunities of computational data-driven research.
Advancing
biological research
• Emerge novel tools and methods for primary and secondary data
analysis
• Can analyze rich and diverse datasets rapidly
Promote effective
collaboration
• Experimental and computational labs utilize a more extensive
range of analysis and expertise from each other by mutually
working together
• Specific journal articles have strengthened their citations and the
significance of computational research in biology through the
advancement, application, and exploration of computational
approaches
Get to be creative • Gives the liberty to uncover unknown diseases, unfold novel
approaches and tools, or refine the ones for a more robust data
analysis
• Venture into differing perspectives and propose new hypotheses
or improve existing ones based on results generated in the wet
lab
Speed and Scale of
Data Analysis
• Dry lab has a faster working pace
• Amplify the scope of dry lab researchers to gather more data and
integrate with wet-lab biologists to corroborate their discoveries,
while the wet-lab biologists can look beyond the public data by
suggesting new hypotheses or building upon existing ones with
the expertise of dry lab researchers
• Computational scientists analyze data from large sample sizes,
promoting large-scale analysis to provide more statistically
significant data potentially
5
Despite the extensive opportunities for computational data-driven research, modern-day
computational biologists face some challenges and pitfalls, including, but not limited to, managing,
navigating, understanding, or acquiring meaningful and valid deductions from the metadata.
Chiefly, some key challenges include having control over the data, accessing and managing their
data, and proving the importance of computational biology.
One main challenge computational biologist deal with is control over the data. The myopic
view on computational research persists — computational researchers are often anticipated to have
secondary roles in biological projects rather than spearheading projects on translational biology
(Kwok, 2013). This view is often due to the assumption that computational researchers concentrate
on building bioinformatics tools rather than engaging in the experimental phase or spearheading
the project and depending on experimental labs to generate data (Bartlett et al., 2017). Frequently,
computational biologists play a prominent role in analyzing the data or building models or tools to
analyze the data; meantime, the wet lab experimentalists usually do the planning, execution, and
data collection. Hence, the lack of involvement in the experimental phase by computational
biologists limits their control over the planning, execution, and data. As a result, many
computational biologists have expressed their secondary role in taking autonomy or involvement
in the project, including planning and execution (Ramesh et al., 2022).
Furthermore, dry lab biologists encounter issues with the accessibility and management of
metadata. Very often, data produced in the wet lab are shared in public databases such as Gene
Expression Omnibus (GEO) and Sequence Read Archive (SRA); nevertheless, in some cases, a
part of the data is not available or is restricted to the public (Callebaut, 2012; Huang et al., 2021).
6
Undeniably, a substantial abundance of data is being deposited in public databases. Even with this,
computational biologists have limited access to the amount of information being collected as they
may not be actively involved in project planning and management (Ramesh et al., 2022). Having
comprehensive data is a pivotal element of research. The quality of data impacts analyses,
elucidations, and deductions to the choices of tools employed (Ramesh et al., 2022). When there
is limited information in data to investigate and elucidate observations, it can engender
computational biologists to consider reduced sample size in their analysis, which could poorly
influence the data quality. Hence, larger accessibility and availability of information promote a
better understanding and interpretation of data and a wise choice of computational tools to consider
for the study (Ramesh et al., 2022).
Besides the accessibility of data, the management of metadata is also another critical
problem faced by computational biologists. Data amalgamation and data cleaning are intrinsic to
obtaining normalized data for unprejudiced interpretations. It is common for data from public
databases to be scattered or incomprehensible due to incorrect annotation or partial information
availability in meta and raw omics data (Huang et al., 2021; Rajesh et al., 2021). Since
computational biologists are not the primary data administrators and may not know the
experimental methodology, they battle consolidating and cleaning raw information from multiple
datasets.
Lastly, computational biologists also confront the issue of proving the importance of
computational biology. As a result of taking a backseat in experimentation, computational
biologists are less praised and often acknowledged as joint or second authors during synergistic
7
studies between wet and dry labs, regardless of their immense contributions in method and tool
development and analyses (Bartlett et al., 2017). Occasionally, a few tool developers are left
uncited (Bartlett et al., 2017).
With growing accessibility and preponderance of omics technologies, data analysis has
become incrementally more sophisticated. Hence, this has gradually moved computational biology
from the backseat to the driver’s seat to take primary collaborative roles in biomedical research,
leading toward the path to independence (Yanai & Chmielnicki, 2017). This is evident from the
emerging multiple “bioinformatics cores” in academic and medical institutions and biological
papers that rely more on data-driven computational analysis (Mangul et al., 2019; Markowetz,
2017; Ramesh et al., 2022).
Regardless of these challenges, it is undeniable that computational data-driven research
does offer immense opportunities and has become increasingly more eminent in modern-day
biomedical research. With expanding advancements in omics technology and the complexity of
big data, computational biology attains a primary involvement in experimental and biomedical
research to provide well-placed conclusions for novel hypotheses and biological inventions,
making computational data-driven research indispensable and integral in biomedical studies.
8
Figure 1.1.2: Challenges and pitfalls of computational data-driven research.
9
1.2 Sepsis
Sepsis is a life-threatening disorder caused by the systemic immunological response due to
an underlying infection. Severe sepsis can result in complications of organ dysfunction or even
death (Karnatovskaia & Festic, 2012). Sepsis occurs as a result of bacterial, viral, fungal, and
parasitic infections or a combination of these infections (Dinu et al., 2020). These infections could
be community-acquired, healthcare-associated, or hospital-acquired. Community-acquired
infections are acquired outside the hospital (Page et al., 2015). Hospital-acquired infections are
primarily due to surgical-based infections, including bloodstream, pulmonary and genitourinary
infections from catheters and injections (Monegro et al., 2023; Page et al., 2015). Healthcare-
associated infections derive from prior treatment to a preexisting condition from a healthcare
facility, including patients who underwent hemodialysis, hospital readmission into the same
healthcare facility within 30 days, or admission into an inpatient nursing facility (Page et al., 2015).
When exposed to infection, the immune system activates innate immune cells, chiefly
neutrophils, macrophages, and monocytes. These cells invigorate pro- and anti-inflammatory
arms, promoting the release of cytokines, proteases, and reactive oxygen species. When the
immune response exacerbates into an unperturbed overdrive of systemic inflammation, clinical
signs and symptoms of sepsis appear, gradually leading to progression from sepsis to septic shock
(Li et al., 2021; Mahapatra & Heffner, 2023; Pop-Began et al., 2014). The equilibrium of pro- and
anti-inflammatory responses determines the severity and mortality of the patient. In severe cases
of sepsis, multiple organ failures or fatalities could be inevitable (Karnatovskaia & Festic, 2012;
Mahapatra & Heffner, 2023).
10
Multiple studies have suggested that hospital-acquired sepsis is more severe and more
likely a contributor to sepsis mortality than other sepsis sources (Page et al., 2015; Westphal et al.,
2019). Additionally, sepsis is one of the leading causes of death in the United States in 2017. On
average, 1.7 million adult cases are reported for sepsis yearly in the United States, of which one-
third of hospital deaths are due to sepsis (Prest et al., 2022).
One of the ways to prevent death from sepsis is to predict sepsis mortality based on the
patient's clinical profile. Prediction of sepsis mortality could allow clinicians to make early
treatment decisions before the condition becomes fatal. With advancements in computational data-
driven research, today, many scientists have used machine learning approaches to predict sepsis
mortality. Karlsson et al. (2021) identified symptoms of sepsis patients, including fever, abnormal
verbal response, chills, and low saturation, as the variables to predict sepsis mortality using the
Random Forest classifier (Karlsson et al., 2021). Although their model seems promising, they
focused purely on clinical criteria as the tool to predict sepsis fatalities. Such approaches may not
competently define the complexity of the pathology and diversity of clinical presentations
observed (Sweeney et al., 2018; Sweeney & Wong, 2016). Thus, attaining precision or
personalization with the existing treatment or preventive methods is challenging. Defining sepsis
on a molecular level becomes crucial to improve the understanding of sepsis prognosis. However,
there are limited studies that accurately categorize or describe sepsis at the molecular level.
Sweeney et al. (2018) investigated the mortality prediction models from three scientific teams by
performing logistic regression and Random Forest classifier with gene expression data (Sweeney
et al., 2018). They studied cohorts that performed 30-day mortality studies, regardless of the
11
availability of information on age, gender, and type of sepsis. Then, they discovered 58 signature
genes, 31 upregulated and 27 downregulated genes essential in sepsis and its associated deaths
(Sweeney et al., 2018). Banerjee et al. (2021) focused on pediatric sepsis, where they recognized
20 genetic markers that predict the complexity of general sepsis close to the time-intensive care
unit (ICU) admission (Banerjee et al., 2021). These studies address issues under heterogeneous
cases of sepsis — i.e., differences in availability of age, gender, and type of sepsis information
exist — and describe outcomes based on gene expression data. However, they lack insight into the
impact of sepsis mortality on homogeneous sepsis patients (similar age group, gender types, known
pathogen). Multiple studies have emphasized that adult sepsis patients, especially males, are at
higher risk of death than infants, underscoring the significance of including adults and gender
while building a model to predict sepsis mortality (Nasir et al., 2015; Wheeler et al., 2011). Since
sepsis is an immunological disorder, extending the study beyond the gene expression level and
analyzing the effect of cell composition on sepsis mortality could add value to the existing studies.
12
1.3 Study focus
Considering some of these limitations in the existing literature, we intend to investigate the
role of age, gender, cell composition, and gene expression data in predicting sepsis mortality in
patients in 30-day mortality studies. For this purpose, we initially selected cohorts studying 30-
day sepsis mortality from public repositories to achieve our goal. We further filtered the data to
cohorts that studied adult (age 17 to 99) sepsis, with information on gender (male and female only)
and type of sepsis (bacterial, viral, fungal, other, or mixed). After filtering the groups, we acquired
the gene expression or microarray data from public databases. A statistical deconvolution was
executed to estimate the cell composition data utilizing the Gene Expression Deconvolution
Interactive Tool (GEDIT) and Human Primary Cell Atlas (HPCA) Orthogonal reference matrix
(Nadel et al., 2021; Regev et al., 2017). A logistic regression prediction model was constructed to
compare critical features to the mortality outcomes after processing the gene expression and cell
composition data. The predictive power of the models was assessed by implementing the receiver
operating characteristic (ROC) analysis.
Figure 1.2.1: Overview of study design for multi-cohort analysis with a machine learning
approach.
13
CHAPTER 2 METHODS
2.1 Data collection and processing
We initially selected 28 cohorts with a total number of 3850 patients from public
repositories, such as Gene Expression Omnibus (GEO) and Array Express, that specifically studied
cohorts that experienced sepsis—particularly sepsis, septic shock, and Systemic Inflammatory
Response Syndrome (SIRS). In each cohort, we looked into nine major clinical phenotypes: age,
gender, number of organisms, type of organism, type of sepsis, type of pathogen that caused sepsis,
the country where the study occurred, ancestry information, source of the sample, and mortality
rate. The majority of the cohorts contained comprehensive information in the metadata. For some,
the information was not found in the metadata. Hence, we further studied the original publication
to understand the missing information better. The characteristics of the 28 cohorts are summarized
in Table 3.1.1.
The 28 cohorts contained a wide range of patients, including sepsis patients, healthy control
patients, patients with other conditions but sepsis, adult patients, and infant patients. Some cohorts
lacked the age, gender, or type of sepsis information, of which seven cohorts studied general sepsis
and not any particular sepsis type. To bring consistency to our study, we retained seven cohorts
that assessed adult bacterial, viral, fungal, or other sepsis and had information about gender,
specifically male and female, and sepsis mortality. We looked for respective gene expression
microarray data from NCBI GEO or EMBL-EBI Array Express repositories. The healthy controls,
patients with other conditions but sepsis, homosexuals, any missing values (NA values), and ages
below 17 were dropped.
14
2.2 Data normalization and gene annotation
The gene expression data for the seven cohorts, filtered based on information availability,
were downloaded in a CEL file format containing raw microarray data. The files downloaded in
CEL formats were processed and normalized through the Robust Microchip Average (RMA)
normalization method in R (Figure 2.2.1).
Figure 2.2.1: Example of the R script to perform RMA normalization in a CEL formatted file.
For some datasets, the CEL file format was unavailable. Alternatively, a series matrix files
containing normalized and processed microarray data were downloaded. Gene names were then
annotated through an R script or a Python code. Microarray data generated from the HG-
U133_Plus_2 probe set were annotated through the R script (Figure 2.2.2).
Figure 2.2.2: Example of the R script used to annotate gene names for microarray data from the
HG-U133_Plus_2 probe set.
15
For other microarray data, a SOFT File containing a reference probe ID and gene name list
was downloaded from the GEO or Array Express repository. The processed microarray data were
annotated with gene names according to the respective probe ID using a Python code and reference
probe ID list, as shown in Figures 2.2.3A and 2.2.3B.
A
B
Figure 2.2.3: Example of the Python code used to annotate gene names for microarray data
generated from any probe set. (A) Python script to annotate the gene (B) Bash code to run the
Python script on the processed microarray data to get an output of microarray data annotated with
gene names instead of probe ids.
16
2.3 Generating of cell composition data from Gene Expression Deconvolution
Interactive Tool (GEDIT)
Gene name annotated microarray data were converted into a cell composition matrix using
Gene Expression Deconvolution Interactive Tool (GEDIT). A Human Primary Cell Atlas (HPCA)
matrix was used as a reference matrix, where the matrix is arranged in a rectangular array of gene
expression values, with every row symbolizing a gene and the column symbolizing a cell type.
The HPCA matrix considers 19716 signature genes important to identifying 26 major cell types,
including keratinocytes, immune, Schwann, and smooth muscle cells (Nadel et al., 2021). Hence,
the use of the HPCA matrix a better understanding of potential cells affected during sepsis. Each
entry in the HPCA reference matrix contains a gene expression level value of a specific gene
respective to a cell type. The gene expression profile for all signature genes within a cell type is
unique, thus making the cell type distinct from others.
Referring to the HPCA matrix, GEDIT tabulates a signature score for every gene for our
input gene expression data to identify the signature genes. Then, GEDIT performs linear regression
and row scaling to deconvolute further and generate cell composition data.
17
2.4 Statistical Analysis
Metadata information specific to the respective sample, such as the Accession number, age,
gender, and mortality status, was appended to the cell composition data. The cell composition data
were further cleaned to remove any missing values and any homosexuals. Gender was redefined
as Sex converted from strings to integers, where males were assigned a value of ‘1’ and females a
value of ‘0’. Similarly, the mortality was redefined as Death, assigning ‘1’ to non-survivors who
died from sepsis and ‘0’ to sepsis survivors. The cell composition and clinical covariate
information from each cohort were combined into a single Combined Cell Composition data file
in TSV format.
Subsequently, the association between 30-day sepsis mortality and 26 cell type
composition and two clinical covariates was studied using a regression analysis with the stats
model package on Python. Firstly, a logistic (or logit) model, a form of the generalized linear
model (GLM), was fitted with two clinical variates, Sex, and Age, as the predictor variables and
Death from sepsis as the response variable. The regression analysis was performed on each cell
type individually, along with Age and Sex as the predictors and Death as a response variable. The
analysis results were combined and summarized in Table 3.2.1. The relative abundance of
statistically significant cell types was compared between survivors and non-survivors and was
visually represented as box plots.
18
Figure 2.4.1: Example of logit model code on Python.
2.5 Gene Expression Data Cleaning
Upon normalization, processing, and gene annotation described in Section 2.2, the gene
expression data for each accession number was further cleaned for logistic regression analysis.
Duplicated genes were removed from each cohort using the weighted correlation network analysis
(WGCNA) package in R using the ‘collapseRow’ function. Mortality status for each sample was
appended as ‘Death’ where ‘0’ was defined as survivor and ‘1’ was defined as non-survivor or
dead. Then, the cohorts were combined using the inner join function in R into a single TSV file.
The combined gene expression file was used for logistic regression analysis.
19
2.6 Logistic regression model and hyperparameter tuning
Logistic regression analysis and hyperparameter tuning were performed similarly to gene
expression and cell composition data. The analyses were performed using the packages in
scikitLearn, a machine learning library for Python. Using Python, respective classes and objects
were declared for the two data types. The respective objects' codes were initialized and run through
to give the output. Within their classes, functions were built to split the data into 80% training and
20% validation cohorts using the ‘train_test_split,’ with cross-validation for five.
GridsearchCV was then performed to tune hyperparameters to select the best parameters
to fit the respective cell composition and gene expression data into the logistic regression, giving
the best accuracy scores. Using the best parameters from GridsearchCV, the logistic regression
model was fitted using the gene expression or cell composition data. Respective confusion
matrices and AUROC curves for training and validation datasets were generated for cell
composition (Figure 3.3) and gene expression data (Figure 3.4).
2.7 Model Performance Analysis
To assess the prognostic model performance of the cell composition and gene expression
data, both the data types were run on random states 0 to 100, keeping the other parameters constant.
The AUROC and AUPRC scores were collated and plotted as boxplots (Figure 3.5).
20
CHAPTER 3 RESULTS
3.1 Cohort selection for our study and logistic regression
To achieve the objectives of this study, we used multiple cohorts, then filtered and selected
seven cohorts to build prognostic models to predict sepsis mortality. We compared whether cell
composition or gene expression data is more reliable in predicting sepsis mortality. For this study,
we initially selected 28 cohorts from public repositories, such as GEO and Array Express, that
specifically studied sepsis, septic shock, and SIRS.
One of the fundamental challenges faced by computational researchers is the accessibility
of information. Undeniably, there are multiple public databases, such as GEO, Array Express, and
Synapse, in which public omics data has been shared to allow for further analysis and potential
discovery of new hypotheses and biological insights. However, not all information is available to
the public in some of these datasets— some are partially annotated or even restricted to the public
(Huang et al., 2021; Ramesh et al., 2022). Information can often be scattered, possibly influencing
the data quality (Huang et al., 2021). As a result, this impacts any possible well-established
insights, elucidations, and deductions to be inferred by computational biologists.
In order to get well-positioned insights, we extracted detailed information from the
metadata provided by the data generators, mainly on age, gender, number and type of organism,
type of sepsis, type of pathogen that caused sepsis, the country where the study was conducted,
ancestry information, source of sample and mortality rate. Unfortunately, not all information was
available in the metadata for each cohort. Hence, we further studied the original publications to
21
extract the missing information. If a detail was not stated in both metadata or the original
publication, that information was noted as ‘No info’ (Table 3.1.1)
Table 3.1.1: Characteristics of sepsis cohorts summarized from metadata and original publication.
Accession
Number
Sepsis Source Type
(And Disease Type)
Number of
samples/ patients
(And Organism)
Genders Age Mortality
# died
(%died)
Ancestry
(And
Country
where study
conducted)
Sample
Source
E-MEXP-3567
[Irwin AD]
Bacterial
(Pneumococcal
Meningitis with some
HIV co-infection)
15 samples
(Homo sapiens)
Male and
Female
Infants 6
(40%)
No info
(Malawi)
Blood
(Total RNA)
E-MEXP-3850
[Kwan A]
Bacterial
(Meningoccoal
Sepsis)
5 samples
(Homo sapiens)
Male and
Female
Children 1
(20%)
No info
(England)
Blood
(Total RNA)
E-MTAB- 1548
[Almansa R]
General
(Normal vs. Systemic
Inflammatory
response syndrome
(SIRS) vs. Sepsis)
155 samples
(Homo sapiens)
Male and
Female (some
no info about
gender)
42 - 88 17
(10.9%)
No info
(Spain)
Blood
E-MTAB- 4421
[Davenport EE]
Bacterial (Gram
positive & Gram
Negative) & Viral
270 samples
(Homo sapiens)
Male and
Female
18-89 58
(21.5%)
No info
(United
Kingdom)
Leukocyte
Cells
E-MTAB- 4451
[Davenport EE]
Bacterial (Gram
positive & Gram
Negative) & Viral
114 Samples
(Homo sapiens)
Male and
Female
31-91 57
(50%)
No info
(United
Kingdom)
Leukocyte
Cells
(Total RNA)
GSE 10474
[Howrylak JA]
Bacterial (Gram-
positive and Gram
Negative)
34 samples
(Homo sapiens)
No Info No Info 12
(35.3%)
No info
(United States
of America)
Plasma
(Total RNA)
GSE 110487
[Barcella M]
General 31 samples
(Homo sapiens)
No Info No Info 7
(22.6%)
No info
(Switzerland
& Belgium)
Whole Blood
(Total RNA)
22
GSE 13015
[Pankla R]
Bacterial & Fungal
(B. pseudomallei,
A.baumannii,
Corynebacterium
spp., C. albicans, E.
coli, Salmonella
serotype B, S. aureus,
Salmonella spp.,
Streptococcus
infections)
92 Samples
(Homo sapiens)
Male and
Female
18- 81 20
(21.7%)
Asian
(Thailand)
Blood
(Total RNA)
GSE 13904
[Wong HR]
Mainly Bacteria
(Gram positive and
gram negative)
Other Pathogen
102 Samples
(Homo sapiens)
No Info Children
(Ages no
info)
16
(15.6%)
No info
(United States
of America)
Whole Blood
(Total RNA)
GSE 21802
[Bermejo-
Martin JF]
Viral (H1N1; Some
co-infected with
Bacterial or Fungal
superinfection)
19 samples
(Homo sapiens)
No Info No Info 7
(36.8%)
No info
(Spain)
Blood
(Total RNA)
GSE 25504
[Smith CL]
Bacterial and viral
(Staphylococcus,
Enterovirus, etc.)
170 samples
(Homo sapiens)
Female (68)
and Male
(102)
Neonates
(Ages no
info)
No info No info
(Scotland)
Whole Blood
(Total RNA)
GSE 27131
[Berdal JE]
Viral (H1N1) 21 samples
(Homo sapiens)
Female (5)
and Male (16)
28- 59 2
(9.5%)
No info
(Norway)
Blood
(Total RNA)
GSE 28750
[Sutherland A]
General (Sepsis vs
Healthy vs
Post- surgical)
41 samples
(Homo sapiens)
No info No Info No info No info
(Australia)
Whole Blood
(Total RNA)
GSE 32707
[Dolinay T]
General
( Sepsis, SIRS,
Sepsis + ARDS, No
Sepsis ARDS,
Untreated)
225 samples
(Homo sapiens)
No Info No Info 62
(27.6%)
Asian/Pacific
Islander,
Black, White,
Hispanic
(United States
of America)
Whole Blood
(Total RNA)
GSE 33341
[Ahn SH]
Bacterial
(Staphylococcus
aureus, E coli, S
Pneumonia,
Infection)
94 samples
(Homo sapiens)
227 samples
(Mus musculus)
Male- Human
(53)
Female-
Human (41)
Male Mouse
(227)
23- 91
(Humans)
2
(2.1%)
Asian, White,
Black
(United States
of America)
Whole Blood
(Total RNA)
GSE40586
[ Lill M ]
Bacterial (S.
agalactiae, E. coli, S.
pneumoniae, L.
monocytogenes, N.
meningitidis)
39 samples
(Homo sapiens)
No info No info 2
(5.1%)
No info
(Estonia)
Peripheral
Blood
(Total RNA)
23
GSE 4607
[ Wong HR]
Viral
Fungal
Bacterial
(C. albicans,
E. coli
Varicella, etc., vs
Healthy)
57 Samples
(Homo sapiens)
No info Children 9
(15.8%)
No info
(No info)
Whole
Blood
(Total RNA)
GSE 54514
[Parnell GP]
Bacterial (Type of
Bacteria not specified
for the affected)
163 samples
(Homo sapiens)
Male and
female
18-86 31
(19%)
No info
(Australia)
Whole Blood
(Total RNA)
GSE 63042
[Langley]
Bacterial (S.
pneumoniae,
Staphylococcus,
Enterobacteriaceae)
129 Samples
(Homo sapiens)
No info No info 28
(21.7%)
No info
(United States
of America)
Blood
(Total RNA)
GSE 63311
[Pena OM]
General
(Sepsis vs no sepsis)
83 Samples
(Homo sapiens)
No info No info No info No info
(Canada)
Whole Blood
(Total RNA)
GSE64457
[Demaret J]
Bacteria (Bacilli gram
negative, Cocci gram
positive), Fungal and
Other
23 Samples
(Homo sapiens)
No Info No Info No info No info
(United States
of America)
Purified
Neutrophils
(Total RNA)
GSE65682
[Scicluna BP]
Bacterial
(Streptococcus
pneumoniae,
Staphylococcus
aureus, etc.,
Unknown)
802 Samples
(Homo sapiens)
Male and
female
17-93 114
(14.2%)
No info
(Netherlands)
Whole
Blood
(Total RNA)
GSE66099
[Wong]
General
(Septic shock vs
SIRS vs healthy)
276 Samples
(Homo sapiens)
No info No info No Info No info
Blood
(Total RNA)
GSE66890
[Kangelaris]
General
(Sepsis vs Sepsis with
ARDS)
62 Samples,
(Homo sapiens)
Male and
Female
18-91 14
(22.6%)
No info
Blood
(Total RNA)
GSE74224
[Mchugh L]
Bacterial (Gram-
positive isolations,
Gram-negative
isolations), Fungal &
Mixed infections
105 Samples
(Homo sapiens)
No info No info No info White, East
Indian/Asian,
Aboriginal
and Torres
Strait
Islander(Austr
alia)
Peripheral
Blood
(Total RNA)
GSE8121
[Shanley TP]
Viral
Fungal
Bacterial
(C. albicans,
E. faecalis
E. coli
Varicella, etc)
45 Samples
(Homo sapiens)
No info
Children 5 (11.1%) Asian, Black,
Caucasian,
unreported
(United States
of America)
Whole
Blood
(Total RNA)
24
GSE95233
[Venet F]
Bacterial (Gram-
positive, Gram-
negative), Fungal and
other
125 Samples,
(Homo sapiens)
Homosexual,
Male and
Female
25-85 34
(27.2%)
No info
(France)
Whole Blood
(Total RNA)
GSE9960
[Tang BM]
Bacterial
(Gram Positive,
Gram-negative and
mixed)
70 Samples,
(Homo sapiens)
No info
No info No info No info
(Australia)
Circulating
mononuclear
cells
(Total RNA)
We further filtered the cohorts to seven out of 28 datasets to focus on humans (Homo
sapiens) who fall in age groups between 17 and 99, gender groups of male or female, and patients
exposed to bacterial, viral, or fungal sepsis. Those cohorts that did not meet these criteria were
omitted, including those described as having “general” sepsis. After filtering and cleaning the
data, we worked with 1132 samples for logistic regression. The cohorts selected for the study are
described in Table 3.1.2.
Table 3.1.2: Characteristics of patients included in the logistic regression analysis.
Accession ID
Cohort description and
Sepsis Type
Number of samples
and Organism
Genders Age
E-MTAB-4421
[Davenport EE]
Bacterial/ Viral 270 samples
(Homo sapiens)
Male and Female 18-89
E-MTAB-4451
[Davenport EE]
Bacterial/Viral 114 Samples
(Homo sapiens)
Male and Female 31-91
GSE27131
[Berdal JE]
Viral 21 samples
(Homo sapiens)
Female (5) and Male
(16)
28- 59
GSE33341
[Ahn SH]
Bacterial 94 samples
(Homo sapiens)
Male- Human (53)
Female- Human (41)
23-91
(Humans)
GSE54514
[Parnell GP]
Bacterial 163 samples
(Homo sapiens)
Male (64) and
female (99)
18-86
GSE65682
[Scicluna BP]
Bacterial 802 Samples
(Homo sapiens)
Male and female 17-93
GSE95233
[Venet F]
Bacterial, Fungal and
other
125 Samples
(Homo sapiens)
Homosexual, Male
and Female
25-85
25
3.2 Statistically significant features were selected for logistic regression
Our study compared gene expression and cell composition logistic regression models to
determine which model would better predict sepsis mortality. We used data from seven cohorts
that conducted 30-day mortality studies (Table 3.1.2). Gene expression data were downloaded
from public repositories such as NCBI GEO and EMBL-EBI Array Express, followed by
normalization, processing, and annotation of gene symbols. Subsequently, cell composition data
were generated from the statistical deconvolution of gene expression data using the GEDIT tool.
Before performing logistic regression for cell composition data, the data were analyzed for
statistical associations between the key covariates and cell types. We used a logit model to examine
the correlation between sepsis-linked morality and covariate and cell type composition. A logit
model is a generalized linear model that calculates for log-odds or the odds function in a natural
logarithm with a base of e.
Additionally, we used a statistical model like logit to perform forward feature selection for
logistic regression. Forward feature selection is a filtering technique in machine learning to select
essential features that build an optimized and robust machine learning model and boost the
performance of our machine learning model. The logit model was generated using the ‘stats model’
package on Python (Figure 2.4.1), as described in Section 2.4.
Our regression analysis found 11 cell types and two clinical covariates statistically
significant between sepsis patients who died and those who survived sepsis. The results of the
regression analysis have been summarized in Table 3.2.1. We looked into the coefficient, standard
error, z values, p-values, and the left and right tail values at 2.5%. A coefficient is the unit of log
26
odds and is defined as the change in log odds of outcome variables when per unit of input variable
changes. A positive coefficient indicates an increase in log odds of mortality rate per unit increase
in input feature. In contrast, a negative coefficient refers to a decrease in the mortality rate per unit
increase in the input feature. In other words, for every unit increase in age, the mortality rate
decreases by a log odds of 0.0122; meanwhile, for each unit increase in endothelial cells, the log
odds unit of mortality increases by 18.2526.
27
Table 3.2.1: Summary of Regression analysis of the cell composition data. A regression analysis
was performed for 26 cell types, age, and sex based on sepsis mortality. The analysis compares
coefficient, standard errors, z values, p-values (P>|z|), and left and right-tailed values at 2.5%. The
p-value of the statistically significant coefficient was highlighted by bold and asterisks (* for p-
value < 0.05, ** for p-value < 0.01, *** for p-value <0.001, **** for p-value <0.0001).
Coefficient Standard
Error
z- value P>|z| [0.025 0.975]
Age -0.0122 0.0015 -8.0492 8.3368 e-16 **** -0.0152 -0.0092
Sex -0.3263 0.1283 -2.5436 0.01097 * -0.5778 -0.0749
Mesenchymal
Stem cells
-61.2000 43.7957 -1.3974 0.1623 -147.0380 24.6380
Bronchial
Epithelial Cells
0.5361 3.292 0.163 0.8706 -5.915 6.987
Fibroblasts 32.1632 37.925 0.848 0.3964 -42.168 106.495
Endothelial cells 18.2526 5.439 3.356 0.0007913*** 7.592 28.913
Adipocytes 6.8374 5.997 1.140 0.2542 -4.916 18.591
Keratinocytes -14.7111 18.027 -0.816 0.4145 -50.043 20.621
Schwann cells -18.0620 14.929 -1.210 0.2263 -47.323 11.199
Smooth Muscle
Cells
-2.0380 10.827 -0.188 0.8507 -23.259 19.184
CD34+ 10.8900 3.406 3.197 0.001389** 4.214 17.566
Platelets -0.1945 1.416 -0.137 0.8908 -2.971 2.582
Monocytes
CD14+ CD16-
-5.9829 0.846 -7.075 1.4954 e-12**** -7.640 -4.325
Monocytes
CD14- CD16+
-12.5595 4.203 -2.988 0.002809** -20.798 -4.321
Macrophage -6.5335 5.359 -1.219 0.2228 -17.036 3.96
M1 Macrophage
(IFN𝜸)
-8.8978 3.206 -2.776 0.005507** -15.181 -2.615
M2 Macrophage
(IL4)
-50.7094 14.247 -3.559 0.0003718***
-78.633 -22.786
Monocyte
derived
Macrophages
(IFN𝜶)
20.2936 6.632 3.060 0.002213** 7.295 33.292
Dendritic cells
(BDCA1+)
-78.2019 62.424 -1.253 0.2103 -200.551 44.147
Dendritic cells
(BDCA3+)
-111.9394 93.014 -1.203 0.2288 -294.244 70.365
Dendritic cells
(plasmacytoid+)
-3.0851 5.666 -0.544 0.5861 -14.190 8.020
T cells (CD4+) -14.1071 1.845 -7.646 2.0712 e-14**** -17.723 -10.491
T cells (CD8+) 2.1898 3.791 0.578 0.5635 -5.240 9.620
T cells (𝜸−𝜹)
-6.5137 2.067 -3.152 0.001622** -10.564 -2.463
NK cells -9.8719 2.659 -3.712 0.0002053*** -15.084 -4.660
Neutrophils -3.9653 0.592 -6.696 2.1464 e-11**** -5.126 -2.805
B cells -1.5695 1.901 -0.826 0.4089 -5.295 2.156
28
Columns 4 and 5 returned the z- values and p-values for the two-tailed null hypothesis test,
where the null hypothesis is that the coefficient of each feature is zero. A predetermined
significance threshold (alpha) was defined by a p-value of 0.05. Hence, coefficients with p-values
less than 0.05 were considered statistically meaningful and rejected the null hypothesis.
Based on the p-values, we identified coefficients of cell types - endothelial cells, CD34+,
monocytes CD14+ CD16-, natural killer (NK) cells, monocytes CD14- CD16+, M1 macrophages,
M2 macrophages, CD4 + T cells, monocyte-derived macrophages, gamma delta T cells, and
neutrophils - to be statistically significant together with the two coefficients clinical covariates -
age and sex, since they have a p-value of less than 0.05 (Figure 3.2.2). These statistically influential
variables were considered important features to be selected to build a logistic regression model on
cell composition data against sepsis mortality. The relative abundance of 11 statistically significant
cell types and age distribution across non-survivors and survivors were presented in Figure 3.2.2.
29
Figure 3.2.2: Comparison of non-survivor and survivor based on (A) age distribution (B-L) relative
composition of cells with a significant difference for sepsis mortality
30
3.3 Prediction of sepsis mortality with cell composition data
Using the cell composition data, we presented a single prognostic model to compare the
association between cell composition and mortality status. Before building a logistic regression
model, we performed feature selection to select important features that affect the survival status of
patients with sepsis. We conducted a regression analysis to select the significant features, as
summarized in Table 3.2.1. From our analysis, we selected age, sex(gender), endothelial cells,
CD34+, monocytes CD14+ CD16-, NK cells, monocytes CD14- CD16+, M1 macrophages, M2
macrophages, CD4 + T cells, monocyte-derived macrophages, gamma delta T cells, and
neutrophils as the features to predict mortality in logistic regression of cell composition data. In
addition, we applied GridsearchCV to tune the hyperparameters for a robust logistic regression
model.
We split 20 % of our data using a ‘train_test_split’ function with a cross-validation of 5
and set it as a validation cohort. Five-fold cross-validation will divide the data into five subparts,
i.e., 20% validation data. Then logistic regression is repeated five times. Each time, one of the five
subparts is used as the validation data, and the other 80% of the data is part of the training set.
Then the cross-validation score (an error estimation) is averaged over all five trials to give
maximum efficacy of the model. A train_test_split with a cross-validation method was
implemented to reduce data fitting bias and variances since most data was used as a part of the
validation set.
Upon running the logistic regression model, we plotted the confusion matrix and ROC
curves for training and validation data, as shown in Figure 3.3. A confusion matrix illustrates the
31
summary of predictions in a matrix form and gives a quantitative understanding of whether the
model classifies the predictions correctly or incorrectly. The predicted label indicates the predicted
mortality status, ‘0’ for those who survived sepsis and ‘1’ for those who died from sepsis, studying
the input features in the test data. In comparison, the true label refers to the actual mortality status
of the test data.
The ROC curves look into the performance of the classification model at all possible
decision thresholds. ROC curves are sensitivity vs. specificity plots that measure the trade-off
between the true positive rate (TPR) and the false positive rate (FPR). A straight 45-degree
diagonal line is often illustrated through the coordinates (0,0) and (1,1), which is the baseline of
ROC and depicts the performance of the diagnostic test. This line suggests the test has no
predictive value and randomly guesses positive or negative value regardless of the true mortality
status. A performance metric is measured from the ROC curve- the area under the ROC (AUROC).
AUROC indicates how well it can discriminate between positive and negative cases and could
encapsulate the test's overall predictive accuracy. AUROC of 0.5 refers to the model’s inability to
distinguish whether the patients survived or died, and the ROC curve will likely fall on the 45-
degree diagonal line. An AUROC score of 1.0 means a perfect classifier.
The confusion matrix suggested that our model classified the survival status accurately by 76.2%,
with respect to the cell composition data of the patients. AUROC for training and validation
cohorts measured by our model was 0.7058 and 0.7046, which suggests that our model is
considerably good at predicting sepsis mortality when the model is given age, gender, and
significant cell type features as the input variables.
32
A B
C
Figure 3.3: Logistic Regression of Cell composition data. (A) Summary of true and predicted
labels in the confusion matrix (‘0’ for survivors and ‘1’ for non-survivors). (B) AUROC curve of
the training dataset. (C) AUROC curve of the validation dataset.
33
3.4 Prediction of sepsis mortality with gene expression data
Similar to the cell composition data, we investigated whether our model can predict sepsis
mortality using gene expression data. Upon acquiring gene expression data from the public
repositories, we cleaned the data by removing samples that were control or healthy patients,
missing values, homosexuals, and those who did not fall in the age range of 17 to 99. Then, we
removed the duplicated genes in the data using the WGCNA package in R, as stated in Section
2.6. In the logistic regression model of gene expression data, we employed 10596 genes to train
and test our model with the validation data set. We also applied GridsearchCV to tune the
hyperparameters for a powerful logistic regression model.
Like cell composition, we used the train_test_split function to split a 20% dataset as a
validation cohort with a cross-validation of 5. We generated a confusion matrix and ROC curves
for training and validation cohorts, as observed in Figure 3.4.
The confusion matrix suggested that our model classified the survival status accurately by
73.7%, with respect to the gene expression data of the patients with sepsis. AUROC for training
and validation cohorts measured by our model was 0.7635 and 0.6884, suggesting that our model
is considerably good at predicting sepsis mortality when the model is given gene expression as the
input variable.
34
A B
C
Figure 3.4: Logistic Regression of Gene Expression data. (A) Summary of true and predicted labels
in the confusion matrix (‘0’ for survivors and ‘1’ for non-survivors). (B) AUROC curve of the
training dataset. (C) AUROC curve of the validation dataset.
35
3.5 Assessment of prognostic powers
The performance of the logistic regression was determined by analyzing the ROC
individually in training and validation cohorts on both the cell composition and gene expression
data. Summaries of the area under the ROC curves (AUROC) for cell composition and gene
expression data are shown in Figures 3.3 and 3.4, respectively. As described in Section 2.7, we
also found areas under the Precision recall curves (AUPRC). The precision-recall (PR) curves
indicate part of the true positive amongst the positive predictive values, i.e., the prediction for
patients who died from sepsis.
As depicted in Figure 3.5, we discovered AUPRC and AUROC for training and validation
cohorts in gene expression and cell composition data. AUPRCs can be a valuable metric to
understand the model performance of machine learning approaches, especially for imbalanced
datasets- in which the number of negatives (survivors) and positives (non-survivors) are not equal.
AUPRCs obtained from the prediction for those who died in sepsis were noticeably lower than
AUROCs in cell composition and gene expression data, suggesting an imbalance in training and
validation cohorts. The number of negatives (training cohort: 677 survivors and validation cohort:
166 survivors) notably exceeded the number of positives (training cohort: 677 non-survivors and
validation cohort: 166 non-survivors). The gene expression data (training cohort mean AUPRC:
0.580 ± 0.0082 and validation cohort mean AUPRC: 0.479 ± 0.0082), in general, have higher
AUPRCs than the cell composition data (training cohort mean AUPRC: 0.464 and validation
cohort mean AUPRC: 0.457).
36
The cell composition data (Training cohort mean AUROC 0.708 ± 0.0082 and validation
cohort mean AUROC is 0.697 ± 0.0332) has similar AUROC scores as compared to gene
expression (Training cohort mean AUROC 0.766 ± 0.0060 and validation cohort mean AUROC
is 0.691 ± 0.0354) in their validation cohorts but slightly lower AUROC in training cohorts. Hence,
it suggests that the gene expression data is a better model for the training cohort than the cell
composition data; meanwhile, both cell composition and gene expression data are comparable in
validation cohorts, i.e., they behave similarly in the model to predict sepsis mortality. On the other
hand, the gene expression data (training cohort mean AUPRC: 0.580 ± 0.0082 and validation
cohort mean AUPRC:0.479 ± 0.0082) in general have higher AUPRCs than the cell composition
data (training cohort mean AUPRC = 0.464 and validation cohort mean AUPRC:0.457).
The baseline for ROC curves is set constant. AUROC of the baseline is 0.5, which indicates
the worst model classifier, while AUROC of 1.0 is considered the perfect classifier. Unlike ROC
curves, the PR curves do not have predetermined baselines. The baseline of the PR curve varies
according to the percentage of positive cases in a cohort. In our logistic regression model produced
from cell composition data (Figure 3.3), the baseline AUPRC for the training dataset was 0.2519,
and the validation dataset was 0.2687, while our model AUPRC for the training cohort was 0.4631,
and the validation dataset was 0.4665. For the model built using gene expression data (Figure 3.4),
the baseline for the training dataset was 0.2555, and the validation dataset was 0.2675; meanwhile,
our model AUPRC for the training cohort was 0.5801, and the validation dataset was 0.4592. It
suggests that both models have a good performance in detecting the patients likely to face mortality
within 30 days. However, the model developed from cell composition data performed slightly
better than the model built from gene expression data based on the AUROC and AUPRC scores.
37
A B
Figure 3.5: Model performance of training and validation cohorts. Model performance is measured
by (A) AUROC and (B) AUPRC. The two panels (top and bottom) depict the model performance
across cell composition and gene expression data, respectively.
AUPRC AUROC
Cell Composition Gene Expression
38
CHAPTER 4 DISCUSSION
Sepsis is a lethal inflammatory response to underlying infections. It affects a gamut of
people, including those with existing comorbidities, newly admitted patients in the hospitals, or
people who acquired sepsis through community exposure. The heterogeneous disease is caused by
hyperactivation of the immune system engendering an uncontrolled overdrive of systemic
inflammation, advancing from initial clinical manifestations of sepsis to septic shock. Severe
forms of sepsis can effectuate organ dysfunction to death. Many studies have highlighted the
lethality of sepsis and the urge to predict sepsis mortality as a prognostic tool to prevent sepsis
mortality.
With the advancements in computational technology and augmentation of the data
collection or ‘Big data,’ there are developments in computational data-driven research, including
methods and tool development and secondary analyses of publicly available large open datasets to
formulate novel biological innovations. Many studies have exploited the opportunities offered by
computational data-driven research and built prognostic machine-learning tools to predict sepsis
using clinical and molecular outcomes. Conversely, these studies lacked information on the effect
of sepsis mortality on homogenous sepsis patients or did not augment their perspective beyond
gene expression, using alternative aspects of the molecular level to build a machine-learning
model, for example, at the cellular level.
As we addressed these constraints, we built a prognostic model comparing the cell
composition data and gene expression data to predict sepsis mortality employing machine learning
approaches, especially logistic regression models. Since our model was designed to train with
39
existing examples before predicting the outcomes with several input features, our model followed
a supervised machine learning approach (Tiwari, 2022). Moreover, sepsis mortality in our data
existed as a binary variable with only two possible outcomes- survivor or non-survivor. Hence,
our model was an example of a classification model in supervised machine learning, where the
model categorizes into a selected number of classes (in our case, two classes) (Jiang et al., 2020;
Tiwari, 2022). Therefore, a logistic regression, a common classification supervised learning
approach, was applied to predict sepsis mortality in our study.
Unlike other studies, our study used homogenous 30-day sepsis mortality cohorts. For that,
we filtered seven cohorts (N =1132 patients) from 28 cohorts that contained information on age,
gender, and known type of sepsis. These cohorts include only adult sepsis who are either male or
female and were exposed to bacterial, fungal, viral, or mixed sepsis. Through the selection of
datasets with complete information and similar background, we aimed to improve the data quality,
thus, allowing us to provide more accurate, unbiased, and well-positioned conclusions from our
findings (Huang et al., 2021; Júlvez et al., 2018; Kondratyeva et al., 2022).
Using these cohorts, we built a prediction model in three stages. Firstly, we selected
significant features by performing a statistical test of clinical characteristics and different cell types
across all the datasets. Our findings identified age, gender, endothelial cells, CD34+, monocytes
CD14+ CD16-, NK cells, monocytes CD14- CD16+, M1 macrophages, M2 macrophages, CD4 +
T cells, monocyte-derived macrophages, gamma delta T cells, and neutrophils as statistically
significant features which could influence sepsis mortality rate (Table 3.2.1 and Figure 3.2.2). Our
data concurred with multiple experimental data from previous studies, suggesting age, gender,
40
several types of immune cells, and non-immune cells like endothelial cells can also play a role in
sepsis. For instance, some features like CD4+ T cells, neutrophils, and NK cells were
downregulated (reflected with a negative coefficient in Table 3.2.1), and cells like endothelial cells
are upregulated (reflected with a positive coefficient in Table 3.2.1) in sepsis mortality; hence
accentuating the gravity of these variables in sepsis-related deaths (Boomer et al., 2014; Jin et al.,
2019; Kondo et al., 2021; Lemire et al., 2017; NILSSON et al., 1999; Parent & Eichacker, 1999;
Wardi et al., 2021; Zhu et al., 2022).
The second stage of our prediction model involves the building of penalized logistic
regression models using (a) cell composition and (b) gene expression data to predict sepsis
mortality, followed by the third stage, which was to assess the model performance. Our model
returned a summary AUROC of 0.7046 (for cell composition) and 0.6884 (for gene expression
data) for our test datasets with an accuracy of 76.2% and 73.7% for cell composition and gene
expression data, respectively. With the comparison of AUPRC scores between the baseline and
our model, both our logistic regression models carry satisfactory model performances to predict
sepsis mortality. Nonetheless, according to our results, the cell composition data could be a better
predictor of sepsis mortality than gene expression data.
Despite the acceptable performance of our models, it is not possible or ideal to build
prediction models with 100% accuracy. It implies that sepsis mortality is preset and is uninfluenced
by any clinical factors. Alternatively, we applied hyperparameter tuning with GridsearchCV to
enhance our predictive performance and prevent overfitting or underfitting the model (Demšar &
Zupan, 2021; Shibahara et al., 2023).
41
Like other studies, our study has certain limitations. We acknowledge that our current
models could be refined further to give better predictive metrics for sepsis mortality by addressing
some limitations in our present study. Fundamentally, we derived our data from publicly accessible
platforms. Hence, we lack control over the information available for each patient in the data,
including the infection type, demographics, ancestry, age, gender, source of infection, and severity
of sepsis. Nonetheless, our successful selection of homogeneity contributed to more reliable
mortality prediction in our current cohort. Secondly, although we analyzed homogenous cases of
sepsis, we included a small number of cohorts in our prognostic models. By adding more cohorts
that meet the existing demographic criteria, we could augment our sample size, subsequently
fortifying the performance of our model.
Thirdly, unlike our cell composition data, we did not employ any particular feature
selection approach to our gene expression data to refine our predictions. It was also evident in the
difference in AUROC scores between the training and validation cohorts, suggesting a possible
overfitting of our model. Multiple studies have performed a vigorous selection of important genes
influencing sepsis and have shown better prediction scores than ours (Kreitmann et al., 2022;
Sweeney et al., 2018). Moreover, our cell composition data was not patient-derived; instead, it was
estimated from microarray data by a statistical deconvolution tool. Hence, our data and its findings
may not be identical to that of patient-derived cell composition data, but it could be the best
possible representation of the actual patient-derived cell composition data, considering the
robustness of the tool. Nevertheless, studying and testing our model with patient-derived cell
composition data could improve our knowledge of the role of cell composition data in sepsis
mortality. Lastly, our model is based on one type of supervised machine learning approach. By
42
including other machine learning classification models, such as Random Forest, Naive Bayes, and
Support Vector Machines (SVM), we could compare and explore more robust machine learning
approaches, in addition to the comparison between cellular and gene expression data.
43
CHAPTER 5: CONCLUSION
Our study comprehensively exhibited that machine-learning approaches can predict
sepsis-related deaths using information from cell composition and gene expression data. The
overall performance of our logistic regression models favored cell composition data as a better
predictor of sepsis mortality than gene expression data. Since cell composition and gene expression
models mirror implicit biological processes and the pathophysiology of sepsis, they could serve as
valuable clinical assessments and biological assays to define the molecular mechanisms of sepsis
at both cellular and transcriptomic levels. These results could also serve as a standard for future
prediction models while adding to the current publicly available resources for ancillary insights.
Rectifying the limitations, a refined prognostic model to predict sepsis mortality would allow
clinicians to consider early treatment decisions or interventions, resource allocations in hospitals,
and prevention of sepsis-related deaths. Eventually, the data-driven machine learning approach
needs to be ascertained through a prospective clinical trial to assess the effectiveness and extension
of the study.
44
BIBLIOGRAPHY
Banerjee, S., Mohammed, A., Wong, H. R., Palaniyar, N., & Kamaleswaran, R. (2021). Machine
Learning Identifies Complicated Sepsis Course and Subsequent Mortality Based on 20
Genes in Peripheral Blood Immune Cells at 24 H Post-ICU Admission. Frontiers in
Immunology, 12. https://www.frontiersin.org/articles/10.3389/fimmu.2021.592303
Bartlett, A., Penders, B., & Lewis, J. (2017). Bioinformatics: Indispensable, yet hidden in plain
sight? BMC Bioinformatics, 18(1), 311. https://doi.org/10.1186/s12859-017-1730-9
Boomer, J. S., Green, J. M., & Hotchkiss, R. S. (2014). The changing immune system in sepsis.
Virulence, 5(1), 45–56. https://doi.org/10.4161/viru.26516
Callebaut, W. (2012). Scientific perspectivism: A philosopher of science’s response to the
challenge of big data biology. Studies in History and Philosophy of Science Part C:
Studies in History and Philosophy of Biological and Biomedical Sciences, 43(1), 69–80.
https://doi.org/10.1016/j.shpsc.2011.10.007
Demšar, J., & Zupan, B. (2021). Hands-on training about overfitting. PLoS Computational
Biology, 17(3), e1008671. https://doi.org/10.1371/journal.pcbi.1008671
Dinu, A. R., Rogobete, A. F., Bratu, T., Popovici, S. E., Bedreag, O. H., Papurica, M., Bratu, L.
M., & Sandesc, D. (2020). Cannabis Sativa Revisited-Crosstalk between microRNA
Expression, Inflammation, Oxidative Stress, and Endocannabinoid Response System in
Critically Ill Patients with Sepsis. Cells, 9(2), 307. https://doi.org/10.3390/cells9020307
Gauthier, J., Vincent, A. T., Charette, S. J., & Derome, N. (2019). A brief history of
bioinformatics. Briefings in Bioinformatics, 20(6), 1981–1996.
https://doi.org/10.1093/bib/bby063
Huang, Y.-N., Rajesh, A., Ayyala, R., Sarkar, A., Guo, R., Ling, E., Nakashidze, I., Wong, M.
Y., Hu, J., Nosov, A., Chang, Y., Abedalthagafi, M. S., & Mangul, S. (2021). The
systematic assessment of completeness of public metadata accompanying omics studies
(p. 2021.11.22.469640). bioRxiv. https://doi.org/10.1101/2021.11.22.469640
Jiang, T., Gradus, J. L., & Rosellini, A. J. (2020). Supervised machine learning: A brief primer.
Behavior Therapy, 51(5), 675–687. https://doi.org/10.1016/j.beth.2020.05.002
Jin, T., Mohammad, M., Hu, Z., Fei, Y., Moore, E. R. B., Pullerits, R., & Ali, A. (2019). A novel
mouse model for septic arthritis induced by Pseudomonas aeruginosa. Scientific Reports,
9(1), 16868. https://doi.org/10.1038/s41598-019-53434-5
Júlvez, J., Dikicioglu, D., & Oliver, S. G. (2018). Handling variability and incompleteness of
biological data by flexible nets: A case study for Wilson disease. Npj Systems Biology
and Applications, 4(1), Article 1. https://doi.org/10.1038/s41540-017-0044-x
45
Karlsson, A., Stassen, W., Loutfi, A., Wallgren, U., Larsson, E., & Kurland, L. (2021).
Predicting mortality among septic patients presenting to the emergency department–a
cross sectional analysis using machine learning. BMC Emergency Medicine, 21(1), 84.
https://doi.org/10.1186/s12873-021-00475-7
Karnatovskaia, L. V., & Festic, E. (2012). Sepsis. The Neurohospitalist, 2(4), 144–153.
https://doi.org/10.1177/1941874412453338
Kondo, Y., Miyazato, A., Okamoto, K., & Tanaka, H. (2021). Impact of Sex Differences on
Mortality in Patients With Sepsis After Trauma: A Nationwide Cohort Study. Frontiers
in Immunology, 12. https://www.frontiersin.org/articles/10.3389/fimmu.2021.678156
Kondratyeva, L., Alekseenko, I., Chernov, I., & Sverdlov, E. (2022). Data Incompleteness May
form a Hard-to-Overcome Barrier to Decoding Life’s Mechanism. Biology, 11(8), 1208.
https://doi.org/10.3390/biology11081208
Kreitmann, L., Bodinier, M., Fleurie, A., Imhoff, K., Cazalis, M.-A., Peronnet, E., Cerrato, E.,
Tardiveau, C., Conti, F., Llitjos, J.-F., Textoris, J., Monneret, G., Blein, S., & Brengel-
Pesce, K. (2022). Mortality Prediction in Sepsis With an Immune-Related
Transcriptomics Signature: A Multi-Cohort Analysis. Frontiers in Medicine, 9, 930043.
https://doi.org/10.3389/fmed.2022.930043
Kwok, R. (2013). Computing: Out of the hood. Nature, 504(7479), Article 7479.
https://doi.org/10.1038/nj7479-319a
Lemire, P., Galbas, T., Thibodeau, J., & Segura, M. (2017). Natural Killer Cell Functions during
the Innate Immune Response to Pathogenic Streptococci. Frontiers in Microbiology, 8,
1196. https://doi.org/10.3389/fmicb.2017.01196
Li, Y., Wang, J., Li, Y., Liu, C., Gong, X., Zhuang, Y., Chen, L., & Sun, K. (2021).
Identification of Immune-Related Genes in Sepsis due to Community-Acquired
Pneumonia. Computational and Mathematical Methods in Medicine, 2021, e8020067.
https://doi.org/10.1155/2021/8020067
Mahapatra, S., & Heffner, A. C. (2023). Septic Shock. In StatPearls. StatPearls Publishing.
http://www.ncbi.nlm.nih.gov/books/NBK430939/
Mangul, S., Martin, L. S., Langmead, B., Sanchez-Galan, J. E., Toma, I., Hormozdiari, F.,
Pevzner, P., & Eskin, E. (2019). How bioinformatics and open data can boost basic
science in countries and universities with limited resources. Nature Biotechnology, 37(3),
Article 3. https://doi.org/10.1038/s41587-019-0053-y
Markowetz, F. (2017). All biology is computational biology. PLOS Biology, 15(3), e2002050.
https://doi.org/10.1371/journal.pbio.2002050
Monegro, A. F., Muppidi, V., & Regunath, H. (2023). Hospital Acquired Infections. In
StatPearls. StatPearls Publishing. http://www.ncbi.nlm.nih.gov/books/NBK441857/
46
Nadel, B. B., Lopez, D., Montoya, D. J., Ma, F., Waddel, H., Khan, M. M., Mangul, S., &
Pellegrini, M. (2021). The Gene Expression Deconvolution Interactive Tool (GEDIT):
Accurate cell type quantification from gene expression data. GigaScience, 10(2),
giab002. https://doi.org/10.1093/gigascience/giab002
Nasir, N., Jamil, B., Siddiqui, S., Talat, N., Khan, F. A., & Hussain, R. (2015). Mortality in
Sepsis and its relationship with Gender. Pakistan Journal of Medical Sciences, 31(5),
1201–1206. https://doi.org/10.12669/pjms.315.6925
NILSSON, N., BREMELL, T., TARKOWSKI, A., & CARLSTEN, H. (1999). Protective role of
NK1.1+ cells in experimental Staphylococcus aureus arthritis. Clinical and Experimental
Immunology, 117(1), 63–69. https://doi.org/10.1046/j.1365-2249.1999.00922.x
Page, D. B., Donnelly, J. P., & Wang, H. E. (2015). Community-, Healthcare- and Hospital-
Acquired Severe Sepsis Hospitalizations in the University HealthSystem Consortium.
Critical Care Medicine, 43(9), 1945–1951.
https://doi.org/10.1097/CCM.0000000000001164
Parent, C., & Eichacker, P. Q. (1999). NEUTROPHIL AND ENDOTHELIAL CELL
INTERACTIONS IN SEPSIS: The Role of Adhesion Molecules. Infectious Disease
Clinics of North America, 13(2), 427–447. https://doi.org/10.1016/S0891-
5520(05)70084-2
Pop-Began, V., Păunescu, V., Grigorean, V., Pop-Began, D., & Popescu, C. (2014). Molecular
mechanisms in the pathogenesis of sepsis. Journal of Medicine and Life, 7(Spec Iss 2),
38–41.
Prest, J., Nguyen, T., Rajah, T., Prest, A. B., Sathananthan, M., & Jeganathan, N. (2022). Sepsis-
Related Mortality Rates and Trends Based on Site of Infection. Critical Care
Explorations, 4(10), e0775. https://doi.org/10.1097/CCE.0000000000000775
Rajesh, A., Chang, Y., Abedalthagafi, M. S., Wong-Beringer, A., Love, M. I., & Mangul, S.
(2021). Improving the completeness of public metadata accompanying omics studies.
Genome Biology, 22(1), 106. https://doi.org/10.1186/s13059-021-02332-z
Ramesh, T., Chhugani, K., Jönsson, V., & Mangul, S. (2022). Systematic overview of challenges
and opportunities of computational data-driven research in biology. OSF Preprints.
https://doi.org/10.31219/osf.io/v5kr4
Regev, A., Teichmann, S. A., Lander, E. S., Amit, I., Benoist, C., Birney, E., Bodenmiller, B.,
Campbell, P., Carninci, P., Clatworthy, M., Clevers, H., Deplancke, B., Dunham, I.,
Eberwine, J., Eils, R., Enard, W., Farmer, A., Fugger, L., Göttgens, B., … Yosef, N.
(2017). The Human Cell Atlas. ELife, 6, e27041. https://doi.org/10.7554/eLife.27041
Shibahara, T., Wada, C., Yamashita, Y., Fujita, K., Sato, M., Kuwata, J., Okamoto, A., & Ono,
Y. (2023). Deep learning generates custom-made logistic regression models for
explaining how breast cancer subtypes are classified. PLOS ONE, 18(5), e0286072.
https://doi.org/10.1371/journal.pone.0286072
47
Sweeney, T. E., Perumal, T. M., Henao, R., Nichols, M., Howrylak, J. A., Choi, A. M., Bermejo-
Martin, J. F., Almansa, R., Tamayo, E., Davenport, E. E., Burnham, K. L., Hinds, C. J.,
Knight, J. C., Woods, C. W., Kingsmore, S. F., Ginsburg, G. S., Wong, H. R., Parnell, G.
P., Tang, B., … Langley, R. J. (2018). A community approach to mortality prediction in
sepsis via gene expression analysis. Nature Communications, 9(1), Article 1.
https://doi.org/10.1038/s41467-018-03078-2
Sweeney, T. E., & Wong, H. R. (2016). Risk Stratification and Prognosis in Sepsis: What Have
We Learned from Microarrays? Clinics in Chest Medicine, 37(2), 209–218.
https://doi.org/10.1016/j.ccm.2016.01.003
Tiwari, A. (2022). Chapter 2 - Supervised learning: From theory to applications. In R. Pandey, S.
K. Khatri, N. kumar Singh, & P. Verma (Eds.), Artificial Intelligence and Machine
Learning for EDGE Computing (pp. 23–32). Academic Press.
https://doi.org/10.1016/B978-0-12-824054-0.00026-5
Van Noorden, R., Maher, B., & Nuzzo, R. (2014). The top 100 papers. Nature News, 514(7524),
550. https://doi.org/10.1038/514550a
Wardi, G., Tainter, C. R., Ramnath, V. R., Brennan, J. J., Tolia, V., Castillo, E. M., Hsia, R. Y.,
Malhotra, A., Schmidt, U., & Meier, A. (2021). Age-related incidence and outcomes of
sepsis in California, 2008–2015. Journal of Critical Care, 62, 212–217.
https://doi.org/10.1016/j.jcrc.2020.12.015
Westphal, G. A., Pereira, A. B., Fachin, S. M., Barreto, A. C. C., Bornschein, A. C. G. J.,
Caldeira Filho, M., & Koenig, Á. (2019). Characteristics and outcomes of patients with
community-acquired and hospital-acquired sepsis. Revista Brasileira de Terapia
Intensiva, 31(1), 71–78. https://doi.org/10.5935/0103-507X.20190013
Wheeler, D. S., Wong, H. R., & Zingarelli, B. (2011). Pediatric Sepsis – Part I: “Children are not
small adults!” The Open Inflammation Journal, 4, 4–15.
https://doi.org/10.2174/1875041901104010004
Wren, J. D. (2016). Bioinformatics programs are 31-fold over-represented among the highest
impact scientific papers of the past two decades. Bioinformatics, 32(17), 2686–2691.
https://doi.org/10.1093/bioinformatics/btw284
Yanai, I., & Chmielnicki, E. (2017). Computational biologists: Moving to the driver’s seat.
Genome Biology, 18(1), 223. https://doi.org/10.1186/s13059-017-1357-1
Yu, U.-S., Lee, S.-H., Kim, Y.-J., & Kim, S.-S. (2004). Bioinformatics in the Post-genome Era.
BMB Reports, 37(1), 75–82. https://doi.org/10.5483/BMBRep.2004.37.1.075
Zhu, C., Wang, Y., Liu, Q., Li, H., Yu, C., Li, P., Deng, X., & Wang, J. (2022). Dysregulation of
neutrophil death in sepsis. Frontiers in Immunology, 13, 963955.
https://doi.org/10.3389/fimmu.2022.963955
Abstract (if available)
Abstract
Sepsis is a lethal, host-inflammatory response to infection that is a primary cause of mortality worldwide. Current research approaches predict sepsis mortality using machine learning based on clinical scores and criteria defining sepsis severity. Several studies explore sepsis mortality on a molecular level, yet more insight is needed into the impact of sepsis mortality in homogeneous sepsis patients. We focused on and developed transcriptomic and cell composition-based machine learning models in homogenous cohorts to infer sepsis deaths effectively. We derived cell composition data from gene expression data using our recently developed statistical deconvolution tool, GEDIT. Then, we combined publicly available transcriptomics and cell composition data into a trans-ancestry retrospective sepsis cohort with similar clinical covariates. Our cohort includes 1132 adults diagnosed with a known type of sepsis across seven individual studies. We selected key cell types and clinical covariates from statistical comparisons between survivors and nonsurvivors. Subsequently, we built logistic regression models with cell composition and gene expression data to estimate sepsis mortality status. We predicted mortality in AUROC for both cell composition (summary AUROC = 0.7046) and gene expression data (summary AUROC = 0.6884). Our results suggest that the relative abundance of various cells (e.g., CD4+ T cells, Neutrophils) was considerably distinct across survivors and non-survivors (p-value<10-4). Our overall model performance proposes that cell composition data better predict sepsis mortality than gene expression. Despite the limitations of the study, these results from our research will improve our understanding of the relationship between the immune system and survival status across sepsis patients of diverse backgrounds.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Omics for clinical diagnostics: challenges, opportunities, and computational approaches
PDF
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
Prediction of peptides in formation of MHC class I - peptide - TCR complexes using molecular models and artificial intelligence
PDF
Evaluating the robustness and reproducibility of RNA-Seq quantification tools using computational replicates
PDF
Computational model for predicting ionic solubility
PDF
Global landscape of primary omics data generation and its secondary analysis across 193 countries and territories
PDF
A systematic assessment of the completeness of TCR databases across Mus musculus strains
PDF
Structure-based computational analysis and prediction of TCR CDR3 loops in the TCR-peptide-MHC complex using solvation parameters and peptide molecular dynamics.
PDF
Inhibition of MAO-A by Dual MAO-A/HDAC inhibitors: in silico approach for ligand binding and affinity prediction
PDF
reTCR: a unified repository for robust, rigorous, and reproducible analysis of TCR-Seq data
PDF
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater
PDF
Evaluating the robustness and reproducibility or AIRR sequencing tools using computational replicates
PDF
Robust causal inference with machine learning on observational data
PDF
Robust and generalizable knowledge acquisition from text
PDF
APOC2 presents a viable therapeutic target in cancer
PDF
Machine learning of DNA shape and spatial geometry
PDF
Data-driven learning for dynamical systems in biology
PDF
Effect of acetaminophen and ibuprofen on spermatogenesis and cell signaling mechanisms
PDF
Blockade of CXCR2 as a novel approach for cancer chemotherapy
PDF
Development of a high-throughput screening assay to study mPD-1/mPD-L1 interactions
Asset Metadata
Creator
Ramesh, Tejasvene
(author)
Core Title
Predicting mortality of sepsis with machine learning model approaches
School
School of Pharmacy
Degree
Master of Science
Degree Program
Pharmaceutical Sciences
Degree Conferral Date
2023-08
Publication Date
08/02/2023
Defense Date
06/22/2023
Publisher
University of Southern California. Libraries
(digital)
Tag
bioinformatics,challenges,computational data driven research,logistic regression,machine learning,OAI-PMH Harvest,Opportunities,prognostic studies,sepsis
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mangul, Serghei (
committee chair
), Duncan, Roger (
committee member
), Haworth, Ian (
committee member
)
Creator Email
tejasvene@gmail.com,tramesh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113296150
Unique identifier
UC113296150
Identifier
etd-RameshTeja-12183.pdf (filename)
Legacy Identifier
etd-RameshTeja-12183
Document Type
Thesis
Rights
Ramesh, Tejasvene
Internet Media Type
application/pdf
Type
texts
Source
20230803-usctheses-batch-1078
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bioinformatics
challenges
computational data driven research
logistic regression
machine learning
prognostic studies
sepsis