Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Malignant cell fraction prediction using deep learning: from point estimate to uncertainty quantification
(USC Thesis Other)
Malignant cell fraction prediction using deep learning: from point estimate to uncertainty quantification
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Malignant Cell Fraction Prediction Using Deep Learning: From
Point Estimate to Uncertainty Quantification
by
Jiawei Huang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
May 2025
Copyright 2025 Jiawei Huang
Dedication
To my beloved parents, Yongxia Wang and Yin Huang, for their endless support and encouragement throughout this challenging yet rewarding journey. To the God for always being there.
ii
Acknowledgements
First of all, I would like to extend my most sincere gratitude to my advisor, Dr. Fengzhu Sun, , all
of which have been instrumental throughout my Ph.D. journey. An old Chinese saying goes, "A
teacher is one who imparts the way, teaches knowledge, and resolves doubt." Dr. Sun embodies
this sentiment. His commitment to academic excellence, passion for research, and insightful
feedback have significantly shaped this dissertation and enriched my academic experience. I am
deeply grateful for his inspiration and encouragement, which have contributed immensely to
both my academic and personal growth.
I would also like to thank my Doctoral Qualifying Exam and Dissertation committee members,
Dr. Remo Rohs, Dr. Liang Chen, Dr. Yingying Fan, Dr. Kevin R. Kelly, and Dr. Jiang F. Zhong, for
their invaluable feedback, constructive criticisms, and consistent encouragement. Their diverse
perspectives have enriched my research and broadened my academic horizons.
I am also grateful to Dr. Rui Jiang from the Department of Automation at Tsinghua University
for introducing me to the magnificent world of bioinformatics and computational biology. I am
truly appreciative of the opportunity to conduct research under his guidance as an undergraduate.
I would not have achieved what I have today without his tremendous support.
iii
My heartfelt thanks also go to my collaborators, Dr. Yuxuan Du, Dr. Andres Stucky, Dr.
Yingying Fan, Dr. Jinchi Lv, Dr. Kevin R. Kelly, and Dr. Jiang F. Zhong, for their support and
commitment to our joint publications.
I am thankful to the professors at USC, Dr. Peter Calabrese, Dr. Mark Chaisson, Dr. Liang
Chen, Dr. Vsevolod Katritch, Dr. Adam Maclean, Dr. Andrew Smith, Dr. Remo Rohs, Dr. Rory
Spence, and Dr. Fengzhu Sun, for imparting invaluable knowledge in computational biology.
I would also like to extend my appreciation to my labmates in Sun Lab: Dr. Yuxuan Du,
Dallace Francis, Dr. Yilin Gao, Dr. Wenxuan Zuo, Dr. Xin Bai, Yue Huang, Dr. Kujin Tang, Dr.
Tianqi Tang, Beibei Wang, Dr. Weili Wang, Yuqiu Wang, Dr. Ziye Wang, and Dr. Zifan Zhu.
Additionally, I thank my friends at USC: Dr. Jared Sagendorf, Dr. Brendon Cooper, Dr. Tsu-pei
Chiu, Dr. Jinsen Li, Bryan Dinh, Bida Gu, Dr. Wei Jiang, Yibei Jiang, Dr. Jordy Lam, Dr. Tsung-Yu
Lu, Meilu McDermott, Dr. Raktim Mitra, Dandan Peng, Dr. Jingwen Ren, Dr. Bo Sun, Dr. Vardges
Tserunyan, Yingfei Wang, Dr. Xiaojun Wu, Dr. Quentin Yang, Qingyang Yin, Yuxiang Zhan, and
many others.
I would also like to take a moment to acknowledge myself—to the younger versions of me who
have, in their own way, paved the path to this moment. To the boy who was eight, thank you
for your pure heart and boundless curiosity about the world. Your wonder and openness set the
foundation for a lifelong journey of learning and discovery. And to the young man at eighteen,
thank you for your dedication, for pushing through uncertainty, and for your courage to take
on new challenges. Don’t be afraid—stay brave and keep moving forward. Your persistence and
resilience have brought us here, and I am grateful for every step you took along the way.
Finally, I want to express my deepest gratitude to my parents, Yongxia Wang and Yin Huang,
for their unwavering support and boundless love. I am grateful for their teachings, guidance, and
iv
constant presence throughout this journey. To my sister, Jun Huang, and her husband, Xiaobing
Liu, I am thankful for their support and for shouldering family responsibilities in my absence. To
my dearest girlfriend, Heyi Huang, for her love, patience, and steadfast support at every step of
this journey.
v
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Malignant cell fractions estimation using scRNA-seq and deep neural networks . 4
1.2 Uncertainty quantification when estimating malignant cell fractions . . . . . . . . 6
1.3 Existing cell deconvolution methods . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Authors and contributors to the dissertation . . . . . . . . . . . . . . . . . . . . . 16
Chapter 2: DeepDecon Accurately Estimates Cancer Cell Fractions in Bulk RNA-seq Data 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Generating artificial bulk RNA-seq datasets . . . . . . . . . . . . . . . . . 22
2.2.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 The DeepDecon Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.5 The impact of gene expression perturbations and the number of cells per
bulk sample on DeepDecon . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Methods overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 DeepDecon outperforms other methods for estimating malignant cell
fraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 DeepDecon outperforms other deconvolution methods for other cancer
types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.4 The impacts of gene expression perturbations and cell number per bulk
sample on the performance of DeepDecon . . . . . . . . . . . . . . . . . . 36
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vi
Chapter 3: DeepDeconUQ Estimates Malignant Cell Fraction Prediction Intervals in Bulk
RNA-seq Tissue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Generating artificial bulk RNA-seq datasets . . . . . . . . . . . . . . . . . 52
3.2.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.4 DeepDeconUQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.4.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.4.2 Quantile regression . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.4.3 Conformal prediction . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.4.4 Model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.5 The impact of gene expression perturbations on DeepDeconUQ . . . . . . 59
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.1 Methods overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.2 DeepDeconUQ outperforms other methods for estimating the prediction
interval of malignant cell fraction . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.3 DeepDeconUQ is robust to gene expression perturbations . . . . . . . . . 66
3.3.4 Time and memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 4: Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1 Future work for DeepDecon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Future work for DeepDeconUQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 5: Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 Appendix for chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Cost analysis of bulk RNA-seq and scRNA-seq . . . . . . . . . . . . . . . . 77
5.2 Appendix for chapter 2: DeepDecon Accurately Estimates Cancer Cell Fractions
in Bulk RNA-seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.1 Software comparison and settings . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.2 Comparision with and without subject information for RNA-Sieve,
CIBERSORTs and NNLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.3 Preprocessing of single-cell gene expression data . . . . . . . . . . . . . . 82
5.2.4 Artificial bulk dataset simulation . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.5 Comparision of different normalization methods . . . . . . . . . . . . . . . 84
5.2.6 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Appendix for chapter 3: DeepDeconUQ Estimates Malignant Cell Fraction
Prediction Intervals in Bulk RNA-seq Tissue . . . . . . . . . . . . . . . . . . . . . 102
5.3.1 Preprocessing of single-cell gene expression data . . . . . . . . . . . . . . 102
5.3.2 Software comparison and settings . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
vii
List of Tables
2.1 DeepDecon outperforms other methods in simulated and real AML datasets. The
root mean square errors (RMSE)(%) for the estimated fraction of malignant cells
in leave-one-subject-out cross-validation for DeepDecon, Scaden, Bisque, MEAD,
RNA-Sieve, CIBERSORTx, MuSiC, ESTIMATE, and NNLS on AML datasets. The
boldfaced numbers indicate the best one. . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 DeepDeconUQ outperforms other methods in predicting malignant cell type
prediction interval on real AML bulk RNA-seq datasets. Coverage and average
prediction interval length (Lavg) are shown under different significance levels on
three real AML bulk RNA-seq datasets (’primary,’ ’recurrent,’ and ’BeatAML’).
The total row is the aggregation of all three real datasets. . . . . . . . . . . . . . . 67
5.1 The root mean square errors (RMSE)(%) for the estimated fraction of malignant
cells in leave-one-subject-out cross-validation on neuroblastoma datasets for
DeepDecon, Scaden, Bisque, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and
NNLS. The boldfaced numbers indicate the best one. . . . . . . . . . . . . . . . . . 97
5.2 The root mean square errors (RMSE)(%) for the estimated fraction of malignant
cells in leave-one-subject-out cross-validation on HNSCC datasets for DeepDecon, Scaden, Bisque, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS.
The boldfaced numbers indicate the best one. . . . . . . . . . . . . . . . . . . . . . 98
5.3 Preprocessing criteria for each subject in AML and neuroblastoma datasets.
Gene expression threshold means the maximum gene expression value of a cell.
The gene number threshold means the maximum number of expressed genes.
This is to avoid gene expressions that do not represent a single cell. The criteria
are based on Scanpy (v. 1.7.2) functions ‘filter_cells’ and ‘filter_genes’. . . . . . . 99
5.4 Bulk AML RNA-seq datasets used in DeepDecon . . . . . . . . . . . . . . . . . . . 100
viii
5.5 Hyperparameters are tested for model optimization and final model parameters
are selected to minimize the RMSE between the true and predicted malignant
cell fraction. The metrics are shown as the average values across all 15 AML
datasets (with their standard deviation). All iterative DeepDecon models share
the same structure and parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6 Preprocessing criteria for each subject in AML and neuroblastoma datasets.
Gene expression threshold means the maximum gene expression value of a cell.
The gene number threshold means the maximum number of expressed genes.
This is to avoid gene expressions that do not represent a single cell. The criteria
are based on Scanpy (v. 1.7.2) functions ‘filter_cells’ and ‘filter_genes’. . . . . . . 106
5.7 Bulk AML RNA-seq datasets used in DeepDeconUQ . . . . . . . . . . . . . . . . . 107
ix
List of Figures
1.1 Overview of general reference-based cell deconvolution . . . . . . . . . . . 5
1.2 An example of over-coverage (A) and under-coverage (B). . . . . . . . . . . 9
2.1 Overview of DeepDecon decomposition method. A: Constructing simulated
bulk RNA-seq samples with different fractions of malignant cells. p is the
fraction of malignant cells in a simulated bulk sample. B: Tranining DeepDecon
models using simulated bulk datasets with different malignant cell fractions.
Simulated bulk samples whose malignant cell fraction p ∈ [0.01i, 0.01j], i, j =
0, 10, · · · , 100 serve as the input to train a DeepDecon model Mi,j . C: Core
DeepDecon model structure. It consists of four fully connected layers with
dropout layers. All DeepDecon models in the iterative process share the same
structure. D: Predicting the fraction of malignant cells from a real bulk sample
iteratively. DeepDecon designs an iterative strategy to narrow down the
prediction interval of given bulk samples. When a new experimental tissue is
given, DeepDecon first generates an initial malignant cell prediction Pˆ using
the whole range model M0,100. In each iteration step, DeepDecon tries to limit
the estimate to a smaller range, denoted by [0.01i
′
, 0.01j
′
], based on the training
datasets and the previous iteration prediction value. If the prediction interval
can be shortened, DeepDecon will update the prediction value Pˆ by a newly
selected model Mi
′
,j′. Ultimately, DeepDecon generates the final prediction ppred
when the stopping conditions are satisfied. The flowchart shows the iterative
procedure and the stopping conditions. . . . . . . . . . . . . . . . . . . . . . . . . 32
x
2.2 DeepDecon outperforms other methods in predicting malignant cell
type fractions on AML simulated bulk RNA-seq datasets. A: Scatter plots
of true versus predicted malignant cell fractions based on DeepDecon (D),
Scaden (S), CIBERSORTx (C), Bisque (B), ESTIMATE (E), MuSiC (MU), MEAD
(M), RNA-Sieve (R), and NNLS (N) on three selected AML simulated datasets.
The x-axis is the true fraction and the y-axis is the predicted fraction. The
numbers on each subplot are the root mean square error (RMSE) values between
the true and predicted fraction of each method. B: Boxplots of RMSE values
between the predicted and true fractions of malignant cells on 15 AML simulated
bulk RNA-seq datasets. C: Boxplots of Pearson’s correlation coefficient (r)
values between the predicted and true fractions of malignant cells on 15 AML
simulated bulk RNA-seq datasets. D: Lin’s concordance correlation coefficient
(CCC) values between the predicted and true fractions of malignant cells on
15 AML simulated bulk RNA-seq datasets. The correlation and CCC values of
NNLS contain values that are not available (NAs). Therefore, paired tests of
correlation and CCC values between DeepDecon and NNLS are not available. *
0.01 < p − value ≤ 0.05, ** 0.001 < p − value ≤ 0.01, *** p − value ≤ 0.001. . . 41
2.3 The TF-IDF transformation and the iterative strategy improve the
performance of DeepDecon. A: Bar plots of RMSE values of DeepDecon
models with and without TF-IDF transformation. DeepDecon with TF-IDF
transformation achieves the lowest RMSE values in 14 out of 15 simulated AML
datasets. B: Bar plots of RMSE values on DeepDecon models with and without
the iterative strategy. Iterative DeepDecon achieves the lowest RMSE values in
all 15 simulated AML datasets. The x-axis is the simulated AML dataset. The
y-axis is the RMSE value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 DeepDecon outperforms other deconvolution methods on real AML
RNA-seq datasets. Boxplots of Root mean square error (RMSE) (A), Pearson’s
correlation coefficient (PCC) (B), and Lin’s concordance correlation coefficient
(CCC) (C) values between the predicted and true fractions of malignant cells.
Each bar in the boxplots contains three points corresponding to three real AML
bulk RNA-seq datasets, namely ‘primary’, ‘recurrent’, and ‘BeatAML’ datasets. . . 43
xi
2.5 DeepDecon outperforms other deconvolution methods on the simulated
neuroblastoma datasets. Boxplots of Root mean square error (RMSE) (A),
Pearson’s correlation coefficient (PCC) (B), and Lin’s concordance correlation
coefficient (CCC) (C) values between the predicted and true fractions of
malignant cells on 9 simulated neuroblastoma bulk RNA-seq datasets. Each bar
in the boxplot contains 9 points corresponding to 9 simulated neuroblastoma
bulk RNA-seq datasets. Each simulated neuroblastoma dataset contains bulk
samples constructed from only one subject. The correlation and CCC values
of method MEAD, RNA-Sieve, and NNLS contain not available (NAs) values.
Therefore, paired tests of correlation and CCC values between DeepDecon and
MEAD, RNA-Sieve, and NNLS are not available. * 0.01 < p − value ≤ 0.05, **
0.001 < p − value ≤ 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 DeepDecon outperforms other deconvolution methods on the simulated
HNSCC dataset. Boxplots of Root mean square error (RMSE) (A), Pearson’s
correlation coefficient (PCC) (B), and Lin’s concordance correlation coefficient
(CCC) (C) values between the predicted and true fractions of malignant cells on
27 simulated HNSCC bulk RNA-seq datasets. Each bar in the boxplot contains
27 points corresponding to 27 simulated HNSCC bulk RNA-seq datasets. Each
simulated HNSCC dataset contains bulk samples constructed from only one
subject. The correlation and CCC values of method MEAD, RNA-Sieve, and
NNLS contain not available (NAs) values. Therefore, paired tests of correlation
and CCC values between DeepDecon and MEAD, RNA-Sieve, and NNLS are
not available. * 0.01 < p − value ≤ 0.05, ** 0.001 < p − value ≤ 0.01, ***
1.00e − 04 < p − value ≤ 1.00e − 03, **** p − value ≤ 1.00e − 04. . . . . . . . . 45
2.7 DeepDecon is robust to gene expression perturbations. Boxplots of
RMSE values between the true and estimated malignant cell fractions on
simulated AML datasets under different noise levels. We added random noise
generated from a Gaussian distribution with zero mean and variance that equals
α(α = 0.01, 0.05, 0.1) times gene expression level for each gene in each sample.
We also randomly selected 10% of the genes for each sample and masked its gene
expression values into 0. Each bar contains a total of 15 points, representing 15
separate AML datasets. The color represents different levels of noise level α. . . . 46
2.8 DeepDecon is robust to the number of cells per bulk sample when the
number of cells in testing data is above 3000. The x-axis is the trained
DeepDecon model. The subscript is the number of cells per bulk sample.
DeepDeconN means a DeepDecon model trained on a dataset in which one bulk
sample consists of N single cells. The y-axis is the RMSE value between the
true and estimated malignant cell fractions. The color represents the number of
single cells per bulk sample in the testing data. . . . . . . . . . . . . . . . . . . . . 47
xii
3.1 Overview of DeepDeconUQ. A: Constructing simulated bulk RNA-seq samples
with different fractions of malignant cells. p is the fraction of malignant cells
in a simulated bulk sample. B: Model structure used to train DeepDeconUQ. It
consists of four fully connected layers with dropout layers. Seventy percent of
the simulated data are used for training. The output is two quantile functions at a
given significance level α. C: Conformity scores are calculated on the remaining
30% of the simulated dataset. D: Estimating the prediction interval of malignant
cells from a real bulk sample. The trained model is used to calculate the lower
and upper bounds, and the conformity scores are used to adjust the quantiles,
which finally outputs the prediction interval {ˆpαlo , ˆpαhi}. . . . . . . . . . . . . . . 62
3.2 DeepDeconUQ outperforms other methods in predicting malignant
cell type prediction interval on AML simulated bulk RNA-seq datasets.
Boxplots of coverage (A) and average prediction interval length (B) on 15 AML
simulated bulk RNA-seq datasets. Coverage is defined as the proportion of
instances in which the true fraction of malignant cells falls within the prediction
interval for the testing dataset. The average length represents the mean length
of the prediction intervals across the testing datasets. Each bar in the boxplot
comprises 15 data points, each corresponding to one of 15 simulated AML
datasets. Significance levels are indicated with different colors. . . . . . . . . . . . 65
3.3 DeepDeconUQ is robust to gene expression perturbations. Boxplots
of coverage and average prediction interval length on 15 AML simulated
bulk RNA-seq datasets under different noise levels. We added random noise
generated from a Gaussian distribution with zero mean and variance that equals
λ(λ = 0.01, 0.05, 0.1) times the gene expression level for each gene in each
sample. Each bar contains a total of 15 points, representing 15 separate AML
datasets. The color represents different levels of noise level λ. Significance level
α = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 DeepDecon outperformed other methods for predicting malignant cell
type fractions based on 15 artificial AML bulk RNA-seq datasets. Scatter
plots of true versus predicted malignant cell type fractions based on DeepDecon,
Scaden, Bisque, MEAD, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS
for all 15 AML datasets. The x-axis is the true fraction and the y-axis is the
predicted fraction. Leave-one-out cross-validation was used here where one
dataset served as a testing dataset and the rest 14 datasets served as a training
dataset. The number on each subplot is the RMSE values between the true and
predicted fraction of each method. D: DeepDecon, S: Scaden, B: Bisque, M:
MEAD, R: RNA-Sieve, MU: MuSiC C: CIBERSORTx, E: ESTIMATE, and N: NNLS . 86
xiii
5.2 Boxplots of Root Mean Square Error (RMSE), Pearson correlation
coefficient (PCC), and Lin’s concordance correlation coefficient (CCC) of
DeepDecon under different normalization methods on simulated AML
datasets. A: Boxplots of RMSE values between the predicted and true fractions
of malignant cells on simulated bulk RNA-seq datasets of different normalization
methods. B: Boxplots of Pearson’s correlation coefficient values between the
predicted and true fractions of malignant cells on simulated bulk RNA-seq
datasets of different normalization methods. C: Boxplots of Lin’s concordance
correlation coefficient (CCC) values between the predicted and true fractions of
malignant cells on simulated bulk RNA-seq datasets of different normalization
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Scatter plots of malignant cell type fractions estimated from DeepDecon
with and without iteration. The non-iterative model was trained on simulated
AML samples with malignant cell fractions in 0 ≤ p ≤ 1. The x-axis is the
true fraction and the y-axis is the predicted fraction. Leave-one-subject-out
cross-validation was used here where one subject served as a testing and the rest
14 datasets from other subjects served as training. The numbers on each subplot
are the RMSE values between the true and predicted fraction of each method.
DeepDecon without iteration has poor prediction accuracy when the malignant
cell fraction is close to 0 or 1 and tends to have a S-shape. It also has higher
RMSE values than iterative DeepDecon. The results indicate that the iterative
approach outperforms the non-iterative approach. N: Non-iterative model, I:
Iterative model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 UMAP projection of 15 scRNA-seq AML datasets reveals heterogeneity
across different datasets. AML1012-D0, AML475-D0, AML916-D0, and
AML707B-D0 were far away from the other datasets, indicating distinct
expression patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 DeepDecon outperformed other deconvolution methods based on real
primary AML RNA-seq expression data. Scatter plots of malignant cell
fractions estimated from DeepDecon, Scaden, Bisque, MEAD, RNA-Sieve,
MuSiC, CIBERSORTx, ESTIMATE, and NNLS on real primary AML tissues. The
x-axis is the true fraction and the y-axis is the predicted fraction. The number
on each subplot is the RMSE values between the true and predicted malignant
cell fraction of each method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.6 DeepDecon outperformed other deconvolution methods based on real
recurrent AML RNA-seq expression data. Scatter plots of malignant cell
fractions estimated from DeepDecon, Scaden, Bisque, MEAD, RNA-Sieve, MuSiC,
CIBERSORTx, ESTIMATE, and NNLS on real recurrent AML tissues. The x-axis
is the true fraction and the y-axis is the predicted fraction. The number on
each subplot is the RMSE values between the true and predicted malignant cell
fraction of each method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xiv
5.7 DeepDecon outperformed other deconvolution methods based on real
BeatAML RNA-seq expression data. Scatter plots of malignant cell fractions
estimated from DeepDecon, Scaden, Bisque, MEAD, RNA-Sieve, MuSiC,
CIBERSORTx, ESTIMATE, and NNLS on real BeatAML dataset. The x-axis is
the true fraction, and the y-axis is the predicted fraction. The numbers on each
subplot are the RMSE values between the true and predicted malignant cell
fraction of each method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 DeepDecon outperforms CIBERSORTx, RNA-Sieve, and NNLS when
subject information is used. DeepDecon is robust to gene expression profiles
even when subject information is given in CIBERSORTx, RNA-Sieve, and
NNLS on real AML (A), neuroblastoma (B), and HNSCC (C) datasets. In mode
‘Aggregate’, single cell reference is constructed by combining the single cells
from different subjects together without considering the subject information. In
the ‘Separate’ mode, single cell reference is constructed separately across each
subject and the final result is the average of results under each patient-specific
reference. * 0.01 < p − value ≤ 0.05, ** 0.001 < p − value ≤ 0.01, ***
p − value ≤ 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.9 DeepDecon is robust to gene expression perturbations. Boxplots of
correlation values between the true and estimated malignant cell fractions on
simulated AML datasets under different noise levels. We added random noise
generated from a Gaussian distribution with zero mean and variance that equals
α(α = 0.01, 0.05, 0.1) times gene expression level for each gene in each sample.
We also randomly selected 10% of the genes for each sample and masked its gene
expression values into 0. The color represents different levels of noise level α. . . 94
5.10 DeepDecon is robust to gene expression perturbations. Boxplots of
CCC values between the true and estimated malignant cell fractions on
simulated AML datasets under different noise levels. We added random noise
generated from a Gaussian distribution with zero mean and variance that equals
α(α = 0.01, 0.05, 0.1) times gene expression level for each gene in each sample.
We also randomly selected 10% of the genes for each sample and masked its gene
expression values into 0. The color represents different levels of noise level α. . . 95
5.11 Barplots of the numbers of malignant and normal cells in each scRNAseq AML subject. Subjects with at least 100 malignant and 100 normal cells
were selected for this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xv
5.12 DeepDeconUQ is robust to gene expression perturbations. Boxplots
of coverage and average prediction interval length on 15 AML simulated
bulk RNA-seq datasets under different noise levels. We added random noise
generated from a Gaussian distribution with zero mean and variance that equals
λ(λ = 0.01, 0.05, 0.1) times the gene expression level for each gene in each
sample. Each bar contains a total of 15 points, representing 15 separate AML
datasets. The color represents different levels of noise level λ. Significance level
α = 0.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.13 DeepDeconUQ is robust to gene expression perturbations. Boxplots
of coverage and average prediction interval length on 15 AML simulated
bulk RNA-seq datasets under different noise levels. We added random noise
generated from a Gaussian distribution with zero mean and variance that equals
λ(λ = 0.01, 0.05, 0.1) times the gene expression level for each gene in each
sample. Each bar contains a total of 15 points, representing 15 separate AML
datasets. The color represents different levels of noise level λ. Significance level
α = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
xvi
Abstract
Accurately estimating the fractions of malignant cells in cancer tissues is vital for effective diagnosis, prognosis, and personalized treatment planning. Bulk RNA sequencing (RNA-seq) offers an
aggregate profile of gene expression across entire tissue samples, yet lacks the resolution required
to discern cellular heterogeneity within tumors. While single-cell RNA sequencing (scRNA-seq)
enables precise assessment of malignant cell fractions, its high cost and labor intensity make
it impractical for routine clinical use. This limitation constrains our ability to reliably estimate
malignant cell proportions, a crucial aspect of understanding tumor dynamics and therapeutic
responsiveness. To address these challenges, this dissertation introduces DeepDecon, a deep
learning-based model designed for precise estimation of cancer cell fractions in bulk RNA-seq
samples. DeepDecon leverages scRNA-seq data to simulate bulk profiles, enabling it to predict
cell fractions by training models on a comprehensive dataset. It provides a refining strategy
that the cancer cell fraction is iteratively estimated by a set of trained models. Further enhancing this approach, this dissertation also presents DeepDeconUQ, a deep neural network model
developed to estimate prediction intervals for malignant cell fractions based on bulk RNA-seq
data. DeepDeconUQ utilizes conformalized quantile regression to generate prediction intervals,
providing statistically valid and narrowly bound confidence intervals that add robustness to predictions under variable gene expression conditions. Together, DeepDecon and DeepDeconUQ
xvii
offer a scalable, reliable framework for malignant cell deconvolution, advancing the precision of
cancer tissue analysis and supporting improved clinical decision-making.
xviii
Chapter 1
Introduction
Tumors are not merely homogeneous masses; they represent complex ecosystems composed of
diverse cell populations, including malignant, stromal, and immune cells within the tumor microenvironment (TME) [1, 2]. The malignant cell population within tumors exhibits significant
heterogeneity, characterized by the presence of distinct subclones with unique genotypic and phenotypic attributes, such as gene expression patterns, metabolic activity, and rates of proliferation
[1, 3, 4]. This cellular heterogeneity plays a critical role in determining tumor dynamics, invasive
capabilities, and responses to therapeutic interventions [2, 5, 6]. Notably, certain subclonal populations within tumors are frequently associated with the emergence of resistance to standard
therapies, leading to treatment failure and disease recurrence [7]. For instance, clonal diversity
has been implicated in the evolution of drug resistance, as specific subpopulations may survive
treatment and subsequently drive tumor repopulation [8]. Furthermore, the varying proportions
of malignant cells significantly influence the efficacy of targeted therapies and immunotherapies, as distinct subclones may evade immune recognition or exhibit different sensitivities to
treatment [6, 9]. Therefore, quantifying the proportion of malignant cells, or more broadly, performing cellular deconvolution within the TME, is crucial for precise cancer characterization and
1
for guiding therapeutic decision-making. By dissecting these cellular dynamics, researchers can
obtain critical insights into tumor progression, enabling the refinement of treatment strategies
and ultimately improving patient outcomes through personalized therapeutic interventions.
With the advent of next-generation sequencing (NGS) technologies, RNA sequencing (RNAseq) has become a critical tool for exploring transcriptional features and kinetics in tissues and
organisms. Specifically, it allows researchers to detect both known and novel features in a single
assay, enabling the identification of transcript isoforms, gene fusions, single nucleotide variants,
and other features without the limitation of prior knowledge [10, 11, 12, 13]. Bulk RNA-seq provides a view of average gene expression profiles within a whole organ or tissue, reflecting the
cumulative gene expressions of different cell types in proportion to their prevalence [14]. However, bulk RNA-seq lacks the resolution to capture the variation among different cell types. In
complex tissues containing multiple heterogeneous cell types, bulk RNA-seq measures the average gene expression across all cells, which can obscure rare cell types and confound cell-specific
transcriptional signals due to overlapping gene expression profiles among cell types [15]. For example, signals from a rare cell type may be diluted by the more abundant types in a bulk RNA-seq
experiment [16]. Technologies like laser capture microdissection (LCM) [17] and fluorescenceactivated cell sorting (FACS) [18] have therefore been developed to isolate individual cells, further
extending to single-cell RNA sequencing (scRNA-seq). In recent years, scRNA-seq has advanced
rapidly, enabling the characterization of individual cells, the discovery of rare or novel cell types,
and a deeper understanding of complex biological processes.
Single-cell experiments have become more affordable, making scRNA-seq accessible for large
cohorts comprising hundreds of samples. However, it still imposes a substantial financial burden
on researchers. [16] (see cost analysis in Appendix 5.1.1). Bulk RNA-seq remains a preferred
2
method for profiling transcriptomic variations under different conditions. The ability to perform cellular deconvolution with bulk RNA-seq offers two major advantages. First, it allows
researchers to obtain cell-type-specific information from bulk data, providing some benefits of
single-cell analysis at a lower cost. Second, it enables researchers to extract cell-type insights from
extensive bulk RNA-seq datasets that are already available in research labs and public repositories, such as The Cancer Genome Atlas (TCGA) [19], Gene Expression Omnibus (GEO) [20], and
ArrayExpress [21]. These repositories hold data from billions of dollars worth of experiments, and
cellular deconvolution allows for new discoveries without repeating these costly experiments.
Given these limitations, integrating bulk RNA-seq with scRNA-seq data offers a promising
avenue for estimating malignant cell fractions within cancer tissues. While bulk RNA-seq provides an average expression profile across all cell types, scRNA-seq allows for the resolution of
individual cells, capturing cellular heterogeneity. By leveraging scRNA-seq data from samples
with similar microenvironmental contexts, it is possible to simulate bulk RNA-seq profiles and
address the challenges of cellular deconvolution.
This dissertation focuses on the estimation of malignant cell fractions in cancer tissues using
scRNA-seq. Specifically, we will first introduce how to predict the malignant cell fractions in
cancer tissues accurately. Then, we will move further and focus on how to predict the malignant
cell fractions in cancer tissues confidently.
3
1.1 Malignant cell fractions estimation using scRNA-seq and
deep neural networks
Most reference-based cell deconvolution methods model bulk data on a linear scale and can be
expressed by the following equation:
yi = Xpi + ϵi
, (1.1)
where yi ∈ R
P
represents the bulk RNA-seq gene expression vector with P > 0 features (genes)
from the ith individual, X ∈ R
P ×K is cell-type-specific gene expression profiles (GEPs) with P
genes and K cell types. pi = (pi1, pi2, ..., piK) ∈ [0, 1]K represents the proportion of K cell types
in the ith bulk RNA-seq gene expression and PK
k=1 pik = 1. ϵi = (ϵi1, ϵi2, ..., ϵiP ) is a vector of
random variables with mean zero.
Figure 1.1 also illustrates the general workflow of reference-based cell deconvolution. Our
objective is to estimate pi
, the cell-type proportions. There are no direct measurements of the
cell-type-specific gene expression profiles X. Deconvolution methods rely on scRNA-seq data
from reference samples—individuals with known gene expression profiles—to approximate X.
Although cell deconvolution may appear straightforward due to its reliance on linear models,
several challenges complicate the statistical inference process in real-world applications. First, the
accuracy of deconvolution heavily depends on the selection of optimal gene expression profiles
(GEPs) [22, 23]. Substantial discrepancies between the true and approximated X can arise from
both technical and biological variability. Second, specific data processing platforms and preprocessing methods can further complicate cell-type inference [24]. Gene expression platforms often
4
= x +
Bulk
RNA-seq
Cell type specific gene
expression profile Proportion Bias scRNA-seq
�� �� ��
Cells
Genes
Subject 1
Subject 2
…
Subject n
�
Figure 1.1: Overview of general reference-based cell deconvolution
introduce cross-platform and gene-specific biases, while coregulated genes may exhibit different
scales and noise levels, making the deconvolution task even more complex [25].
While GEP-based approaches underpin modern cell deconvolution algorithms, we hypothesize that deep neural networks (DNNs) could offer an advantage by learning optimal features for
deconvolution without relying on explicitly defined GEPs. DNNs, such as multilayer perceptrons,
are universal function approximators with state-of-the-art performance in classification and regression tasks [26, 27]. Although these advantages are minimal for strictly linear input data,
DNNs excel over linear regression models when data deviate from ideal linearity, as they can
capture complex patterns. For example, when input data are noisy or biased, the hidden layers
of a DNN can learn higher-order latent representations of cell types, making them more robust
to input noise and measurement bias. We theorize that with gene expression data as input, the
DNN’s hidden layers can form robust, higher-order representations of cell types, effectively bypassing the limitations of noise and technical bias. While ground-truth cell composition data for
5
bulk RNA-seq is limited, scRNA-seq data can generate large artificial bulk RNA-seq datasets with
predefined cell compositions, addressing the extensive training data demands of machine learning models. This approach, combining scRNA-seq and deep neural networks, shows promise for
tackling the cell deconvolution problem.
Based on the aforementioned challenges and motivations, during my doctoral study, we developed DeepDecon [28], a deep neural network model that leverages single-cell gene expression
data to accurately estimate malignant cell fractions in bulk tissues. Trained on single-cell RNA
sequencing data, DeepDecon is robust to gene expression noise and variation in the number of
cells sampled in bulk tissues. It can iteratively and accurately predict malignant cell fractions,
achieving higher accuracy than traditional methods and offering a powerful, reliable tool for
cancer cell fraction estimation.
1.2 Uncertainty quantification when estimating malignant
cell fractions
In recent years, driven by advances in machine learning and deep learning, extensive research
has focused on developing high-performance regression models [29, 30]. However, this increased
predictive accuracy introduces a significant yet often overlooked challenge: while these models
frequently appear precise, they can occasionally produce predictions with substantial errors. This
issue arises from their reliance on point predictions—single scalar estimates designed to capture a
conditional measure of central tendency [31]. For example, when a regression model minimizes a
squared error loss function, its point predictions approximate the conditional mean of the target
6
variable. In contrast, minimizing the absolute error estimates the conditional median. Consequently, relying solely on point predictions, particularly in high-stakes domains, is problematic
because they fail to account for inherent uncertainty. This limitation is evident when identical
point predictions correspond to varying levels of uncertainty, such as two Gaussian distributions
with the same mean but different standard deviations, underscoring the need for models that
effectively quantify predictive uncertainty [32].
As a result, prediction intervals have become a preferred tool due to their ability to quantify
uncertainty [31]. Prediction intervals provide upper and lower bounds, defining the expected
range within which the target variable will likely fall for a given significance level (e.g., 10%).
Simply put, the width of a prediction interval helps determine the level of confidence we should
place in the model’s predictions. When the model is uncertain, it reflects this by producing a
wider interval; when the model is more confident, the interval is narrower.
To address the limitations of point predictions, various fields have adopted prediction intervals, leading to improved decision-making [33, 34, 35]. An ideal procedure for constructing
prediction intervals should satisfy two key properties:
• Validity [36]: This refers to the alignment between the desired and observed coverage.
A well-calibrated model ensures that the specified significance level α is achieved, with
approximately 1 − α of actual observations falling within the interval. It should provide
valid coverage even for finite samples without relying on strong distributional assumptions
like normality.
7
• Discrimination [37]: This pertains to the width of the prediction intervals. The intervals should be as narrow as possible in regions where the model is confident, making the
predictions more informative.
A good prediction interval strikes a balance between validity and discrimination. Overly narrow intervals may lead to under-coverage (failing to include the true value), while excessively
wide intervals diminish practical usefulness (over-coverage). Figure 1.2 illustrates examples of
under-coverage and over-coverage. Additionally, an often overlooked but essential requirement
is that prediction intervals should handle heteroscedasticity. Heteroscedasticity refers to the situation where error variance varies across the covariate space [31, 38]. Prediction intervals that
account for heteroscedasticity adapt to local uncertainties by adjusting their width according to
the variability at each point in the predictor space. In heteroscedastic data, achieving valid but
narrow prediction intervals necessitates tailoring the interval lengths to reflect the local variability of the input features.
Quantiles provide an attractive framework for representing prediction intervals. They enable
the modeling of complex distributions without imposing strict parametric assumptions. Additionally, quantiles are directly interpretable, maintaining consistent monotonicity with the target
distribution, and they facilitate the straightforward construction of prediction intervals. Learning
the quantile for a single quantile level is a well-studied problem in quantile regression (QR) [36].
Unlike traditional regression methods that focus solely on point estimates of the conditional distribution, quantile regression provides a more comprehensive perspective by estimating multiple
quantiles, thereby characterizing the entire conditional distribution of the response variable.
8
A B
Figure 1.2: An example of over-coverage (A) and under-coverage (B).
Technically, quantile regression aims to estimate conditional quantiles of the response variable as a function of predictor variables. This is achieved by minimizing the pinball loss function,
a specialized loss function that penalizes deviations from the predicted quantiles. Specifically,
given a target value y, a prediction value yˆ, and quantile (significance) level α ∈ (0, 1), the pinball loss ρα is defined as:
ρα(y, yˆ) =
α(y − yˆ) if y − y >ˆ 0,
(1 − α)(ˆy − y) otherwise,
(1.2)
This formulation of the pinball loss function allows for differential weighting of over- and
under-predictions based on the specified quantile level α. By minimizing this loss function, quantile regression effectively estimates the desired conditional quantile, providing a flexible and interpretable approach to uncertainty quantification.
9
Conformal regression provides a robust alternative for uncertainty quantification by constructing prediction intervals that offer guaranteed coverage[39]. Unlike traditional regression
methods, which typically yield a single point estimate, conformal regression generates an interval
around the predicted value, ensuring that the true response lies within this interval with a specified confidence level (e.g., 90%). This probabilistic guarantee distinguishes conformal regression
from conventional approaches, as it explicitly controls the coverage probability of the prediction
interval. Conformal regression achieves this by utilizing nonconformity scores, which measure
the deviation of a new data point’s predicted value from the model’s predictions on calibration
data (including the new point’s true value, withheld during calibration) [36]. The nonconformity
scores are ranked to determine a threshold corresponding to the desired significance level, allowing the construction of prediction intervals around the point estimate. One of the primary
advantages of conformal regression is its distribution-free nature, meaning it does not rely on
any parametric assumptions about the underlying data distribution, thus providing robust performance across a variety of scenarios. However, there is an inherent trade-off in conformal
regression: achieving higher coverage levels necessitates wider prediction intervals, while lower
coverage levels yield narrower intervals. This trade-off requires careful consideration, allowing
researchers to adjust the balance between coverage probability and interval width according to
the specific requirements of their analysis. Consequently, conformal regression offers a flexible
and reliable framework for uncertainty quantification, making it a valuable tool in predictive
modeling.
Motivated by these challenges, during my doctoral research, we developed DeepDeconUQ,
a deep neural network model designed to estimate prediction intervals for malignant cell fractions based on bulk RNA-seq data. DeepDeconUQ extends the capabilities of DeepDecon by
10
incorporating uncertainty quantification into cancer cell fraction predictions. The model leverages single-cell RNA sequencing (scRNA-seq) data in combination with conformalized quantile
regression to generate reliable prediction intervals. It trains a quantile regression neural network to establish upper and lower bounds for cancer cell proportions, followed by a calibration
step that refines these intervals to ensure both statistical validity (coverage probability) and discrimination (narrow intervals). This approach addresses the limitations of existing methods in
estimating malignant cell fractions, offering a robust solution that integrates uncertainty quantification for more reliable predictions.
1.3 Existing cell deconvolution methods
Cell-type deconvolution methods are generally classified into two main categories: referencebased and reference-free approaches. Reference-free methods estimate cell type compositions
solely from the bulk tissue or mixture data, without relying on external reference profiles. However, these methods often suffer from lower accuracy and are challenging to interpret, as they
do not provide clear information about the estimated cell-type components. This dissertation
focuses primarily on reference-based deconvolution methods, which rely on known reference
profiles to improve accuracy and interpretability.
The core of reference-based deconvolution methods is matrix decomposition, where the bulk
gene expression matrix is decomposed into a product of two matrices: one representing the celltype-specific gene expression profiles (GEPs) and the other representing the cell-type composition matrix. The GEP matrix, typically structured as genes by cell types, captures the unique gene
11
expression characteristics of each cell type within a given tissue. The following sections provide
a detailed discussion of some commonly used reference-based deconvolution methods.
• MuSiC [14]: It is a reference-based method that allows users to estimate the cell type
proportions using transcriptomic data. The method requires the following input from users:
(i) the bulk gene expression matrix (genes by samples) and (ii) the gene expression profiles.
Alternatively, users can provide single-cell RNA sequencing data from known cell types
instead of predefined GEPs. In this case, the gene expression profiles are derived from the
single-cell data. Specifically, for each gene, the expression value for a cell type is calculated
as the average expression level across all cells of that type, normalized by the number of
cells and the total mRNA content. Once the gene expression profiles are established, the
method deconvolves the bulk data using a Weighted Non-Negative Least Squares (W-NNLS)
approach. This algorithm extends the standard constrained NNLS, which solves ordinary
least squares (OLS) with non-negative and sum-to-one constraints on the coefficients by
assigning specific weights to each gene in the optimization process. Higher weights are
given to genes with lower variance in the cell type expression matrix, emphasizing genes
that exhibit consistent expression across cell types. The final output of the method is the
estimated cell-type proportions for each sample, represented as a matrix of cell types by
samples. This approach prioritizes genes with stable expression patterns across cell types,
enhancing the accuracy and reliability of the deconvolution results.
• Bisque [40]: Bisque offers two deconvolution models for analyzing bulk gene expression
data: a reference-based model and a marker-based model. The reference-based model requires two main inputs: (i) the bulk gene expression matrix (genes by samples) and (ii) a
12
reference single-cell RNA-seq (scRNA-seq) data matrix (genes by cells). The deconvolution
process begins by constructing a signature matrix from the reference scRNA-seq data. The
signature matrix captures the cell-type-specific gene expression profiles, serving as a reference for subsequent analysis. To address potential distributional discrepancies between
bulk data and single-cell-based pseudo-bulk data (summed single-cell counts), the method
applies a transformation to the input bulk data. Although bulk gene expression and pseudobulk data are typically highly correlated, they may differ in their distributions. Thus, this
transformation step ensures better alignment between the bulk data and the pseudo-bulk
reference. After obtaining the signature matrix and transforming the bulk data, the method
employs constrained least squares regression to deconvolve the bulk expression data. The
deconvolution is performed under non-negativity and sum-to-one constraints on the estimated proportions, ensuring that the resulting cell-type proportions are interpretable and
biologically meaningful.
• CIBERSORT/CIBERSORTx [41, 42]: They are two widely used deconvolution methods.
The input for both methods includes (i) the bulk gene expression matrix and (ii) signature of
the known cell types. The bulk transcriptome data is a standard matrix of genes by samples.
The signature of cell types is a matrix of genes (markers) by cell types. CIBERSORT adopts
a linear support vector regression (SVR) approach, representing the gene expression of a
bulk sample as a weighted sum of gene expressions from different cell types. These weights
are determined based on predefined GEPs. On the other hand, CIBERSORTx is an enhanced
version of CIBERSORT that enables the generation of GEPs from scRNA-seq data.
13
• RNA-Sieve [43]: RNA-Sieve is a reference-based method and requires two inputs: (i) bulk
expression data matrix (genes by samples), and (ii) single-cell expression data (genes by
cells) of known cell types. The method proposes a customized maximum likelihood estimation procedure to estimate the cell type proportions in each sample. Given the initial
values, the algorithm then alternately estimates and updates GEP, cell type proportions, and
the number of single cells in the bulk RNA-seq sample by maximizing the log-likelihood
function of parameters. The method repeats this process multiple times until the gradient of
the log-likelihood function reaches convergence. Finally, the method returns the estimated
cell type proportions, cell type expression matrix, and number of cells in each sample in the
bulk data. Additionally, prediction intervals of the estimated proportions can be calculated
as a by-product of the estimation procedure due to the likelihood estimation.
• MEAD [44]: MEAD is another statistical inference approach with two inputs: (i) bulk expression data matrix (genes by samples) and (ii) scRNA-seq reference data (genes by cells). It
incorporates a gene-gene dependency structure to improve the accuracy of cell proportion
estimates. MEAD asserts that the estimated proportions follow asymptotic normal distributions, with solutions constrained to non-negative values. The estimated proportions are
shown to be asymptotically normal, and the covariance is estimated through a sandwichtype estimator with an estimated gene-gene dependence set. Therefore, instead of giving
point estimates of the cell type proportions, MEAD can output the prediction interval for
each bulk sample as well.
• Scaden [45]: Scaden is a deep neural network method for cell deconvolution. The algorithm requires the following input from users: (i) bulk RNA-seq data and (ii) scRNA-seq
14
data. The scRNA-seq and RNA-seq data must come from the same tissue. The method
uses a Deep Neural Network (DNN) to predict the cell type proportions from bulk samples.
It uses the scRNA-seq to simulate artificial bulk RNA-seq and train the neural network.
The architecture of the deep neural network consists of multiple subnetworks in which
each network is a multilayer perceptron with different numbers of perceptrons. Each subnetwork is trained independently to predict cell type proportions from bulk data. The final
cell type proportions are the average of the predictions from all sub-networks.
1.4 Dissertation outline
This dissertation is organized into four chapters: Chapter 1 provides an introduction to foundational concepts, including RNA-seq and uncertainty quantification. It also offers a comprehensive
overview of existing methodologies and approaches used to tackle the challenges associated with
cell-type deconvolution and predictive uncertainty. Chapter 2 focuses on accurate prediction of
malignant cell fractions in cancer tissues. It introduces DeepDecon, an iterative, deep learningbased computational method designed for precise estimation of malignant cell fractions. Chapter
3 extends the DeepDecon framework to include uncertainty quantification, presenting DeepDeconUQ. This chapter addresses the challenge of predicting malignant cell fractions with confidence. DeepDeconUQ is an advanced deep neural network algorithm that leverages scRNA-seq
data to construct prediction intervals for malignant cell fractions, integrating uncertainty into
the estimation process. Finally, Chapter 4 summarizes the key contributions of this dissertation
and explores potential future directions for research in cell-type deconvolution and uncertainty
quantification.
15
1.5 Authors and contributors to the dissertation
I was fortunate to be supervised by Dr. Fengzhu Sun for the work I have done in this dissertation.
Dr. Yuxuan Du, Dr. Andres Stucky, Dr. Kevin R. Kelly, and Dr. Jiang F. Zhong helped with the
DeepDecon method. Additionally, Dr. Yuxuan Du, Dr. Yingying Fan, Dr. Jinchi Lv, Dr. Kevin R.
Kelly, and Dr. Jiang F. Zhong further helped with the DeepDeconUQ method. a
16
Chapter 2
DeepDecon Accurately Estimates Cancer Cell Fractions in
Bulk RNA-seq Data
2.1 Introduction
For centuries, biologists have recognized that multicellular organisms are composed of a vast array of distinct cell types [46]. Cells and tissues play a critical role in all living organisms. Tissues
are composed of cells, and cells are responsible for making up the different types of tissues in all
multicellular organisms. Classifying and quantifying cells are crucial to have a detailed understanding of how tissues function and interact with each other and the microenvironment, and to
reveal mechanisms underlying pathological states. For example, tumor tissues are heterogeneous
and consist of different fractions of cell types. Cancer identification, treatment, and clinical outcomes such as tumor growth, metastasis, recurrence, and drug resistance have a direct relation
with cell type composition and its changes [47, 48, 49]. Quantifying cell type fractions within
tumor tissues can provide insight into the role of heterogeneity in disease and how particular
environments can impact tumor biology.
17
RNA sequencing (RNA-seq) is an alternative method to conventional microarrays for transcriptome analysis [50, 51]. Bulk RNA-seq provides a view of average gene expression profiles
within a whole organ or tissue. It can be regarded as the sum of the product of cell type-specific
gene expressions and corresponding cell type proportions [14]. However, information on the
variations of different cell types is lost in bulk RNA-seq. Single-cell RNA sequencing (scRNAseq) instead can help solve this problem. It allows for the quantification of transcripts for each
cell and the further identification of new cell types based on gene expression profiles [52]. Additionally, it enables the assessment of heterogeneity in cohorts of patient samples, providing a
deeper understanding of disease states and aiding in the development of effective treatments [49,
53, 54, 55]. As a result, scRNA-seq data generated from samples with similar microenvironmental
conditions can potentially help tackle the problem of bulk tissue deconvolution.
Many methods have been developed in recent years to decompose fractions of cell types in
bulk tissues, and most of them use cell-type-specific gene expression profiles (GEPs) as references
[22, 23]. ESTIMATE [56] uses cancer genome atlas to infer the fraction of stromal and immune
cells in tumor samples, which can be further used to approximate the proportion of cancer cells
in bulk RNA-seq data. Non-negative least squares regression (NNLS) [57, 58] is an optimization
method to solve this deconvolution problem through matrix decomposition, but it can be easily
affected by the choice of GEPs. Noise, imprecision, and missing data of GEPs can lead to poor
performance of NNLS. CIBERSORT/CIBERSORTx [41, 42] are two widely used deconvolution
methods. CIBERSORT adopts a linear support vector regression (SVR) approach, representing
the gene expression of a bulk sample as a weighted sum of gene expressions from different cell
types. These weights are determined based on predefined GEPs. On the other hand, CIBERSORTx
is an enhanced version of CIBERSORT that enables the generation of GEPs from scRNA-seq data.
18
Another approach, MuSiC [14], dynamically generates reference profiles from scRNA-seq data.
It assigns high weights to genes with low cross-subject variance and low weights to genes with
high cross-subject variance. However, MuSiC ignores the possibility of significant variations in
tumor conditions between reference data and bulk data. Bisque [40] addresses the issue of simple
summation of scRNA-seq profiles by adopting a linear transformation on artificially derived bulk
RNA-seq samples. This transformed data is then used for decomposition. However, the success
of this transformation heavily relies on the similarity in distribution between reference single
cells and actual data. An alternative method, RNA-Sieve [43], uses a likelihood-based inference
method. It assumes that the estimates of cell-type fractions are normally distributed around the
true fractions. MEAD [44], on the other hand, is a statistical inference method that introduces a
gene-gene dependence structure to improve accuracy. Nonetheless, the dependence matrix used
in MEAD is highly dependent on the choice of bulk samples and cannot be generated when there
is only one single bulk sample to decompose. Lastly, Scaden [45] leverages neural networks to
predict cell fractions and has demonstrated superior performance compared to traditional deconvolution methods. It generates cell fractions by averaging the outputs of three different neural
networks.
In this study, we introduce DeepDecon, an iterative deep neural network model designed to
accurately estimate the proportion of cancer cells in bulk RNA-seq data. DeepDecon makes use of
scRNA-seq gene expression information to generate artificial bulk RNA-seq datasets with known
proportions of cancer cells in each artificial bulk RNA-seq sample. The artificial bulk RNA-seq
datasets can be employed to train an iterative deep neural network model, which can subsequently be employed to accurately predict the proportions of cancer cells in novel cancer tissues.
Our approach utilizes an iterative process to refine predictions and enhance estimation accuracy.
19
Through extensive benchmark evaluations using both simulated and real data, we demonstrate
that DeepDecon outperforms other existing methods across different cancer tissues, and is also
robust to the influence of gene expression perturbations and the number of cells per bulk sample.
Overall, by leveraging scRNA-seq information, employing deep neural networks, and making
use of an iterative refinement process, DeepDecon achieves superior performance in cancer cell
deconvolution analysis.
2.2 Materials and Methods
2.2.1 Datasets
Acute myeloid leukemia (AML) is a heterogeneous disease that haemopoietic progenitor cells
(blasts) lose the ability of normal differentiation and proliferation [59]. The diagnosis of AML
has a direct relation with the malignant cell percentage in bone marrow (BM) tissues [60, 61].
Therefore, we chose AML as our primary disease in this study. The single-cell AML datasets were
downloaded from Gene Expression Omnibus (GEO) with accession number GSE116256 [62]. This
dataset contains scRNA-seq gene expression sequenced from subjects who have different degrees
of AML disease. Each cell in the dataset has labeled cell types (malignant or normal). A total of 15
subjects (38,410 cells) were selected to simulate artificial bulk RNA-seq datasets. The scRNA-seq
data were processed following the preprocessing workflow of the widely-used single-cell gene
expression python package, Scanpy (v. 1.7.2) [63]. Initially, cell-gene matrices were filtered to
exclude cells with fewer than 500 detected genes and genes expressed in fewer than five cells.
Subsequently, the count matrix for each subject was filtered to remove extreme outliers in gene
20
expression values (Table 5.3). Then, gene expression was normalized by Scanpy’s ‘normalize_-
total’ function so that every cell has the same total count after normalization. This will counteract
the effect of different library sizes. Finally, the resulting normalized matrix of all filtered cells and
genes was saved for subsequent simulated bulk data generation. The details of data selection and
preprocessing are given in Chapter 5.2.3 and Figure 5.11.
DeepDecon was tested on real AML bulk RNA-seq datasets. We first downloaded AML data
from the GDC Data Portal (https://portal.gdc.cancer.gov/) with the project name ‘TARGETAML’. The AML samples were further divided into primary AML and recurrent AML categories
according to different cancer stages. As a result, there were a total of 117 primary AML samples and 38 recurrent AML samples. Ground-truth cancer cell fractions from flow cytometry are
available in these bulk RNA-seq data. Moreover, an additional real AML dataset, ‘BeatAML’ [64],
was collected from cBioportal [65]. ‘BeatAML’ contains a total of 451 bulk RNA-seq samples, and
300 of them have corresponding ground-truth cancer cell fractions. The study used the ‘SureSelect’ sequencing platform, which is different from the sequencing platform used to generate the
single-cell data on the ‘TARGET-AML’ dataset (Table 5.4). These datasets enable us to evaluate
DeepDecon’s performance on data from different sources.
To test DeepDecon’s performance on other cancer tissues, we also collected 19,173 single
cells from 9 neuroblastoma cancer patients [66] and 184,868 single cells from 27 Head and neck
squamous cell carcinoma (HNSCC) cancer patients [67]. They were used to simulate artificial
RNA-seq bulk samples to build and evaluate DeepDecon. A real neuroblastoma bulk RNA-seq
dataset consisting of 99 bulk RNA-seq samples with known cancer cell fractions was collected
from cBioportal [65] and another real HNSCC bulk RNA-seq dataset, ‘TCGA-HNSC’, consisting
of 518 bulk RNA-seq samples with known cancer cell fractions were collected from LinkedOmics
21
[68]. These two real datasets were used for testing. The details of data selection and preprocessing
are given in Chapter 5.2.3.
2.2.2 Generating artificial bulk RNA-seq datasets
We used scRNA-seq datasets described in Chapter 2.2.1 to construct artificial bulk RNA-seq samples. The generated samples were designed to have predetermined malignant cell fractions, which
were then employed as training data for the DeepDecon model. Specifically, we first fixed the
total number of cells in an artificial bulk sample to be N, and a malignant cell number nm was
randomly generated from a uniform distribution between 0 and N. Subsequently, nm malignant
cells and N − nm normal cells were randomly sampled from the same scRNA-seq dataset. If
the total number of malignant or normal cells in the scRNA-seq dataset was smaller than nm or
N − nm, respectively, the cells were chosen with replacement, that is, each cell is chosen uniformly from all the single cells available; otherwise, the cells were chosen without replacement,
that is, each cell is chosen from the remaining cells. Importantly, cells from different subjects
(i.e., individuals) were not merged into an aggregated sample. This decision was motivated by
two primary motivations. Firstly, the aim was to safeguard within-subject relationships among
genes by preserving the unique gene expression patterns inherent to each subject. Secondly, the
intention was to capture the variability between subjects, commonly referred to as cross-subject
heterogeneity[45]. The single cells were merged into one bulk sample by summing their expression values, and the resulting artificial bulk sample was labeled with the fraction of malignant
cells nm/N. This process was repeated for each scRNA-seq dataset, generating a corresponding
artificial bulk RNA-seq dataset. Each bulk dataset contained T samples with known malignant
22
cell type proportions (see Chapter 5.2.4). We set N = 3, 000 and T = 200 here for model training. We also investigated the impacts of N on DeepDecon. This procedure provides a valuable
resource for training and evaluating the DeepDecon algorithm.
2.2.3 Data Processing
To ensure consistency between the data used for training and prediction, the artificial bulk RNAseq samples underwent a preprocessing procedure before model training. Specifically, only genes
that were present in both training and testing datasets were retained, and genes with low expression variances (below 0.1) were removed. Next, a TF-IDF transformation was applied to the raw
RNA-seq count matrix. This transformation, commonly used in information retrieval and text
mining [69, 70], involves calculating the ‘term frequency (TF)’ for each gene in each sample by
normalizing the gene expression profile (see Formula 2.1). The ‘inverse document frequency
(IDF)’ was then calculated by dividing the total number of bulk samples by the total gene expression values of the gene across all samples (see Formula 2.2), followed by log-transformation and
multiplication by the TF value. The TF-IDF transformation weights genes with lower expression
levels more heavily, which helps to adjust for the imbalanced expression levels across genes [71].
This preprocessing procedure is an important step in ensuring the quality and consistency of the
data used for training the deep learning models.
TF(Xi,j ) = P
Xi,j
j Xi,j
, (2.1)
IDF(Gj ) = log
T
P
i Xi,j
+ 1
, (2.2)
23
where Xi,j is the expression level of the j-th gene in the i-th sample, Gj
indicates the j-th gene,
and T is the number of bulk samples.
Let X′ denote the gene expression matrix after TF-IDF transformation. A MinMax transformation was applied to the resulting expression matrix X′
to scale the expression values to
the (0, 1) range (see Formula 2.3). This is a common practice in deep learning models that use
gradient-based optimization algorithms [45, 72].
X
norm
i =
X
′
i − min(X
′
i
)
max(X
′
i
) − min(X
′
i
)
, (2.3)
where X′
i
is the i-th row of X′
and Xnorm
i
is the i-th row of the resulting expression matrix after
the MinMax transformation.
There are also several existing normalization methods, including fragments per kilobase per
million mapped fragments (FPKM) and transcripts per kilobase million (TPM). These methods are
mainly used for different sequencing methods [73]. For example, gene expression data from the
unique molecular identifier (UMI) counting can represent the real expression value while gene
expression data from smart-seq protocol need to be further normalized using methods like TPM
or FPKM [74]. We compared these normalization methods with TF-IDF normalization, and the
details can be accessed in Chapter 5.2.5.
2.2.4 The DeepDecon Model
The deep learning model used in this study consisted of two main components. The first component consisted of four fully connected layers with a dropout regularization between each layer,
and the rectified linear unit (ReLU) was used as the activation function in every internal layer.
24
The second component was a softmax function used to predict the malignant and normal cell
fractions. All model parameters were optimized using the Adam optimization algorithm [75]
with a learning rate of 0.0001 and a batch size of 128. The output of the DeepDecon model is
the estimated fraction of malignant (tumor) cells of given bulk RNA-seq samples. The model was
trained as a regression task, with the root mean square error (RMSE) as the loss function. Various combinations of learning rates, batch sizes, and dropout rates in the deep learning model
were tested, and the results are shown in Chapter 5.2.6 and Table 5.5. The Keras (v. 1.0.8) library
(https://keras.io/) was used to implement the deep learning model.
To address the issue of poor prediction accuracy when the malignant cell fraction is close
to 0 or 1 (see Figure 5.3), an iterative deep-learning model was developed. This model involves
iteratively narrowing down the prediction interval of giving samples. More specifically, let di,j
denote the set of artificial bulk RNA-seq samples whose malignant cell fractions p ∈ [0.01·i, 0.01·
j], i < j, i, j ∈ {0, 10, 20, · · · , 100} and Mi,j denote a DeepDecon model trained on di,j ; that
is, Mi,j was trained on artificial bulk samples with a particular range of cell fraction. A total of
55 models were trained in this experiment. DeepDecon model Mi,j was trained to minimize the
error between the predicted cell fraction and the true cell fraction. After training, the difference
between the predicted and true malignant fractions was calculated for each artificial sample in
di,j , and the set of differences was defined as diff(i, j).
To predict the malignant fraction for a given real bulk sample X, the full range model M0,100
(with i = 0, j = 100) is used to provide an initial estimate Pˆ. DeepDecon tries to limit the
estimate to a smaller range, denoted as [0.01i
′
, 0.01j
′
], based on the previous prediction value Pˆ
and training datasets difference diff(i, j) (see Formula 2.4 and 2.5). Model Mi
′
,j′ is then used to
predict the malignant cell fraction of bulk sample X again, and the process continues to refine
25
the estimation. During each iteration, DeepDecon either shortens the intervals or moves them to
the left or right. Direction flags fl and fr are used to indicate the directions in which DeepDecon
moves. The number of intervals is finite, and DeepDecon cannot shrink the intervals indefinitely.
The intervals are also not allowed to oscillate between left and right. Therefore, the algorithm is
finally forced to stop.
L(i, j) = Pˆ + diff(i, j)λ/2,
i
′
= max(0, ⌊100 ∗ L(i, j)⌋),
(2.4)
U(i, j) = Pˆ + diff(i, j)1−λ/2,
j
′
= min(100, ⌈100 ∗ U(i, j)⌉),
(2.5)
where λ is a hyperparameter we use to select the lower and upper percentile of diff(i, j) and help
define the new model interval. The default value of λ is set at 10%. diff(i, j)λ/2 and diff(i, j)1−λ/2
indicate the λ/2 and 1 − λ/2 percentiles of the set diff(i, j), respectively. ⌊·⌋ and ⌈·⌉ indicate the
floor and ceiling of a number, respectively. The specific steps of iterative DeepDecon are given
in Algorithm 1.
26
Algorithm 1 Iterative DeepDecon
Require: Trained DeepDecon models, M = {Mi,j , i < j; i, j ∈ {0, 10, 20, · · · , 100}}; Difference sets, which are differences between the prediction and the true malignant fractions
from training datasets, DIF F = {diff(i, j), i < j; i, j ∈ {0, 10, 20, · · · , 100}}; Testing
bulk sample X
Ensure: Malignant cell fraction estimate, Pˆ
1: Record the direction of the interval that DeepDecon moves compared to the last iteration,
denoted by fl and fr
2: Initialization: model start interval index i = 0, end interval index j = 100, left direction
fl = 0, right direction fr = 0, iteration end flag flag = 0, and percentile hyperparameter
λ = 10%
3: Pˆ = M0,100(X)
4: L(i, j) = Pˆ + diff(i, j)λ/2; i
′
= max(0, ⌊100 ∗ L(i, j)⌋)
5: U(i, j) = Pˆ + diff(i, j)1−λ/2; j
′
= min(100, ⌈100 ∗ U(i, j)⌉)
6: while flag = 0 do
7: if i
′
≥ i and j
′
≤ j then
8: flag = 0
9: else if i
′
≤ i and j
′
≤ j then
10: flag = 0; fl = 1
11: else if i
′
≥ i and j
′
≥ j then
12: flag = 0; fr = 1
13: end if
14:
15: if i
′ ≥ j
′ or min(fl
, fr) > 0 or (i
′
≤ i and j
′
≥ j) then
16: flag = 1
27
17: end if
18:
19: if flag = 0 then
20: i = i
′
; j = j
′
21: Pˆ = Mi,j (X)
22: L(i, j) = Pˆ + diff(i, j)λ/2; i
′
= max(0, ⌊100 ∗ L(i, j)⌋)
23: U(i, j) = Pˆ + diff(i, j)1−λ/2; j
′
= min(100, ⌈100 ∗ U(i, j)⌉)
24: end if
25: end while
26: return Pˆ
2.2.5 The impact of gene expression perturbations and the number of
cells per bulk sample on DeepDecon
To test the model’s robustness to gene expression perturbations, we added different levels of
Gaussian noise to the expression levels of the simulated datasets. Specifically, we added random noise generated from a Gaussian distribution with zero mean and variance that equals
α(α = 0.01, 0.05, 0.1) times gene expression level for each gene in each sample (see Formula
2.6). Moreover, for each simulated bulk sample, we randomly selected 10% of the genes and
masked their gene expression values into 0 to simulate data missing issues in practice.
X
noise
ij = max(0, Xij + N(0, α ∗ Xij )), (2.6)
where Xij is the gene expression value of gene j in simulated bulk sample i and α is the noise
level.
For each subject, we generated the simulated bulk datasets with different noise levels separately. Leave-one-out cross-validation was used to evaluate model performance across subjects.
Specifically, we selected one of the k artificial bulk RNA-seq datasets as the testing dataset, while
28
the remaining k−1 datasets served as the training set. This process was repeated k times to fully
evaluate the performance of our model.
The total number of cells N in bulk RNA-seq samples can vary from sample to sample, and it
can be challenging to accurately estimate the number of cells in a given sample. In addition, bulk
RNA-seq samples in practice could also be influenced by factors, such as cell isolation, cell size,
and clustering, which can further complicate the estimation of cell numbers. In order to evaluate
the performance of DeepDecon under different numbers of single cells, we first fixed our DeepDecon model and generated a set of testing datasets Q = {qi,n, |i = 0, 10, · · · , 80, 90, 100; n =
500, 1000, 2000, 3000, 4000, 5000}. Each bulk sample in the dataset qi,n contains n single cells,
and the number of malignant cells follows a binomial distribution Binomial(n, i
100 ). This simulates the variation of a random sampling of malignant cells. In practice, the number of single cells
in tissue samples can vary widely among different patients and even among different sampling
periods for the same patient [76]. Finally, we used DeepDecon to estimate the fraction of malignant cells for each sample in the testing datasets. By evaluating the performance of DeepDecon
under different numbers of single cells, we can assess the robustness and accuracy of the model
in real-world scenarios.
In addition to testing the scenario in which the DeepDecon model is fixed and the testing
datasets are varied, we also conducted additional experiments examining the impact of varying
the number of cells per sample during the training process. Specifically, we fixed the testing
datasets and trained different DeepDecon models using datasets where each bulk sample consisted of a different number of single cells, ranging from 500 to 3000. These models are denoted
as DeepDeconN, N = 500, 1000, 2000, 3000, where N represents the number of cells per bulk
sample (i.e., N = 500, 1000, 2000, 3000). These DeepDecon models were then used to predict the
29
fraction of malignant cells on the same testing bulk RNA-seq dataset qi,n, which was generated
as described earlier. This analysis provides insights into the performance of DeepDecon under
different training scenarios and can help with the optimal selection of training cell numbers for
a given experimental setup.
2.3 Results
2.3.1 Methods overview
Figure 2.1 shows the graphical overview of iterative DeepDecon. DeepDecon starts with scRNAseq datasets and assumes the cells for each subject have labeled cell types (malignant/normal)
and known gene expression levels. Therefore, simulated bulk RNA-seq datasets with known
cell type fractions can be generated from these scRNA-seq datasets (Figure 2.1A). Additionally,
simulated bulk RNA-seq datasets can be generated with specific ranges of malignant cell fractions. This allows us to develop an iterative deconvolution model. During the model training
process, simulated bulk samples whose malignant cell fraction p ∈ [0.01i, 0.01j], i < j, i, j ∈
{0, 10, 20, · · · , 100} serve as the input to train a DeepDecon model Mi,j (Figure 2.1B). The whole
group of DeepDecon models will be used in the iterative process. The core architecture of DeepDecon is a group of deep neural networks (DNN) that take bulk RNA-seq data as input and output
predicted malignant cell fractions. These models share the same structure, consisting of four fully
connected layers with dropout layers (Figure 2.1C). When presented with a real bulk sample,
DeepDecon first generates an initial malignant cell prediction Pˆ using the whole range model
M0,100. Then in each iteration, DeepDecon will narrow down the prediction interval and update
the prediction Pˆ with models trained on narrow-range datasets (Figure 2.1D). The selection of
30
these narrow-range models is only determined by the previous prediction value and the training datasets (see Formula 2.4 and 2.5). By incorporating datasets with all kinds of malignant cell
fractions and dynamically determining fraction-specific model iterations, DeepDecon allows for
estimating cell proportions of bulk RNA-seq data accurately.
Our model was constructed using artificial bulk RNA-seq samples and evaluated through
leave-one-out cross-validation. Root mean square error (RMSE), Pearson’s correlation coefficient
(r), and Lin’s concordance correlation coefficient (CCC) values between the predicted fractions
and true fractions of malignant cells were used to evaluate the performance of different deconvolution methods.
2.3.2 DeepDecon outperforms other methods for estimating malignant
cell fraction
To demonstrate and evaluate the performance of DeepDecon, we first compared DeepDecon with
eight other methods (Scaden (v. 1.1.2) [45], CIBERSORTx (https://cibersortx.stanford.edu/)
[42], Bisque (v. 1.0.5) [40], ESTIMATE (v. 2.0.0) [56], MuSiC (v. 1.0.0) [14], MEAD (v. 1.0.1) [44],
RNA-Sieve (v. 0.1.4) [43], and NNLS (v. 1.4) [57, 14]) on artificial bulk RNA-seq datasets. scRNAseq data described in Chapter 2.2.1 was used as reference data for Bisque, MEAD, RNA-Sieve,
MuSiC, and CIBERSORTx. MuSiC will also give the output of the NNLS method and we used it as
our NNLS result. Artificial bulk RNA-seq datasets were used to train two neural network methods DeepDecon and Scaden. We compared all benchmark methods with their default settings.
All methods were evaluated on the same testing datasets that were separate from the training
31
A C
B D
Figure 2.1: Overview of DeepDecon decomposition method. A: Constructing simulated bulk
RNA-seq samples with different fractions of malignant cells. p is the fraction of malignant cells
in a simulated bulk sample. B: Tranining DeepDecon models using simulated bulk datasets with
different malignant cell fractions. Simulated bulk samples whose malignant cell fraction p ∈
[0.01i, 0.01j], i, j = 0, 10, · · · , 100 serve as the input to train a DeepDecon model Mi,j . C: Core
DeepDecon model structure. It consists of four fully connected layers with dropout layers. All
DeepDecon models in the iterative process share the same structure. D: Predicting the fraction
of malignant cells from a real bulk sample iteratively. DeepDecon designs an iterative strategy
to narrow down the prediction interval of given bulk samples. When a new experimental tissue
is given, DeepDecon first generates an initial malignant cell prediction Pˆ using the whole range
model M0,100. In each iteration step, DeepDecon tries to limit the estimate to a smaller range,
denoted by [0.01i
′
, 0.01j
′
], based on the training datasets and the previous iteration prediction
value. If the prediction interval can be shortened, DeepDecon will update the prediction value
Pˆ by a newly selected model Mi
′
,j′. Ultimately, DeepDecon generates the final prediction ppred
when the stopping conditions are satisfied. The flowchart shows the iterative procedure and the
stopping conditions.
32
datasets used to train the above models. Details of implementations of these compared methods
are explained in Chapter 5.2.1.
Figure 2.2A shows the scatter plots of true malignant fractions with predicted malignant fractions for each method in 3 simulated AML datasets (Figure 5.1 shows the scatter plot in all 15
simulated AML datasets). Figures 2.2B-D show the RMSE, correlation, and CCC metrics between
the true and estimated malignant cell fractions in all 15 simulated AML datasets. Table 2.1 also
gives the RMSE values and average performance ranks of each method in simulated and real
AML datasets. DeepDecon demonstrated exceptional performance in deconvoluting bulk RNAseq data. It achieved the lowest RMSE values in 12 out of 15 simulated datasets. Even on the
three datasets where DeepDecon did not achieve the lowest RMSE values, its performance was
still highly competitive, with only a marginal difference between its RMSE values and the lowest
ones. Among the nine methods, we can see that the deep learning methods (i.e., DeepDecon and
Scaden) performed better than traditional methods (Bisque, MEAD, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS). They not only have lower RMSE values but also have higher
correlations and CCC values compared to other methods (see Figures 2.2B-D).
Figure 2.3A demonstrates the effectiveness of TF-IDF transformation on DeepDecon. Among
all 15 simulated AML datasets, DeepDecon with TF-IDF transformation exhibited lower RMSE
values in 14 datasets compared to DeepDecon without TF-IDF transformation. We used the paired
Wilcoxon signed-rank test to compare the RMSE values of DeepDecon with vs. without TF-IDF
normalization by combining all the 15 simulated datasets and the resulting p-value is 0.00099.
This suggests that the use of TF-IDF can enhance the predictive power of DeepDecon. We also
compared TF-IDF transformation with other existing normalization methods (FPKM and TPM
33
normalization) and the corresponding results are given in Figure 5.2. The figure shows that DeepDecon with TF-IDF normalization outperforms DeepDecon with FPKM and TPM normalization
methods. Figure 2.3B shows the effects of iterations on DeepDecon. Non-iterative DeepDecon
is only one neural network trained on datasets with malignant cell fractions ranging from 0.0 to
1.0. The RMSE values of iterative DeepDecon are lower than those of non-iterative DeepDecon
in all the 15 simulated AML datasets (p-value 0.00065). Figure 5.3 also gives the scatter plot of
iterative DeepDecon and non-iterative DeepDecon on simulated AML datasets. It shows noniterative DeepDecon’s poor prediction accuracy when the malignant cell fraction is close to 0 or
1.
Figure 5.4 shows this heterogeneity by presenting the uniform manifold approximation projection (UMAP) [77] of all 15 subject datasets based on their scRNA-seq gene expression levels.
Each subject has a specific clinical outcome, leading to gene expression variations and model performance differences. The projection indicated the heterogeneity across different subjects and
further proved that it’s necessary to simulate artificial bulk samples separately across different
single-cell subjects.
Table 2.1: DeepDecon outperforms other methods in simulated and real AML datasets. The root
mean square errors (RMSE)(%) for the estimated fraction of malignant cells in leave-one-subjectout cross-validation for DeepDecon, Scaden, Bisque, MEAD, RNA-Sieve, CIBERSORTx, MuSiC,
ESTIMATE, and NNLS on AML datasets. The boldfaced numbers indicate the best one.
Subject ID Real AML data
Method 210A
D0
328
D0
328
D113
328
D171
328
D29
329
D0
329
D20
419A
D0
420B
D0
475
D0
556
D0
707B
D0
916
D0
921A
D0
1012
D0 mean median mean
rank primary recurrent BeatAML
DeepDecon 5 4 7 6 5 6 8 4 4 4 3 5 12 6 29 7.20 5.00 1.40 13 19 17
Scaden 19 15 22 12 11 10 24 28 15 17 18 30 47 16 52 22.40 18.00 4.40 17 21 20
Bisque 14 14 16 13 15 35 48 31 43 14 14 52 21 15 20 24.33 16.00 5.07 27 25 28
MEAD 5 11 23 33 14 13 29 39 23 29 31 37 45 21 53 27.07 29.00 5.60 27 28 28
RNA-Sieve 4 7 28 25 3 7 26 45 23 18 35 39 57 10 51 25.20 25.00 4.73 27 31 32
CIBERSORTx 15 16 16 13 14 5 29 24 21 21 16 38 39 13 54 22.27 16.00 4.60 29 28 23
MuSiC 6 9 25 28 8 12 31 52 27 26 41 33 42 10 48 26.53 27.00 5.33 27 29 23
ESTIMATE 34 28 33 33 31 29 24 29 34 34 34 38 38 30 34 32.20 33.00 6.67 22 23 29
NNLS 11 17 15 47 52 26 40 59 52 25 48 44 48 33 29 36.40 40.00 7.20 30 30 32
34
We then investigated the decomposition performance of the nine methods using real bulk
RNA-seq data. We utilized all 15 artificial bulk RNA-seq datasets to train DeepDecon and Scaden.
To obtain the single-cell reference data for Bisque, MEAD, RNA-Sieve, MuSiC, and CIBERSORTx,
we selected single cells from all 15 scRNA-seq datasets and combined them together. Figure
2.4 shows the decomposition performance of the nine methods on real AML RNA-seq datasets
(‘TARGET-AML’ (primary and recurrent) and ‘BeatAML’). The RMSE values for each method
on real datasets were also given in Table 2.1. Figures 5.5, 5.6, and 5.7 show the scatter plots of
true malignant fractions with predicted malignant fractions for each method in real ‘primary’,
‘recurrent’, and ‘BeatAML’ AML datasets, respectively. DeepDecon outperforms Scaden, Bisque,
MEAD, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS in deconvolving the malignant
cell fraction on real AML datasets.
2.3.3 DeepDecon outperforms other deconvolution methods for other
cancer types
To test DeepDecon’s performance on other cancer types, we also applied DeepDecon to 9 neuroblastoma cancer patients [66]. Specifically, we constructed artificial bulk RNA-seq samples for
each subject separately. Then, we trained each DeepDecon model using the generated artificial
bulk RNA-seq datasets and evaluated the performance of DeepDecon and other methods using
leave-one-out cross-validation. Figures 2.5 and 2.6 show the boxplots of the RMSE, correlation,
and CCC values between the true and estimated cancer cell fractions among all methods on the
simulated neuroblastoma and HNSCC datasets. Tables 5.1 and 5.2 also give the RMSE values
35
and average performance ranks of each method on simulated and real neuroblastoma and HNSCC datasets. They show that DeepDecon still achieves the lowest RMSE values, the highest
correlations, and CCC values in neuroblastoma cancer, indicating that DeepDecon is robust and
applicable to other cancer types.
We also compared DeepDecon with regression-based methods such as CIBERSORTx, RNASieve, and NNLS, which do not use subject-specific information in their original publications. We
designed a way to incorporate subject information in these methods and showed that DeepDecon
outperforms them in both ways. Details are given in Chapter 5.2.2 and Figure 5.8.
2.3.4 The impacts of gene expression perturbations and cell number per
bulk sample on the performance of DeepDecon
In Chapter 2.2, we discussed that bulk RNA-seq gene expression perturbations and the number of
cells N in a bulk sample can influence the accuracy of the decomposition algorithms. Figures 2.7,
5.9, and 5.10 show the influence of different levels of perturbations on the performance of various
decomposition methods. The RMSE values for most methods except Bisque slightly increase with
the noise level. DeepDecon consistently achieves the lowest RMSE among all methods under
different noise levels, showing its robustness.
We also investigated the influence of the number of cells in a bulk sample on the prediction
accuracy of DeepDecon using AML datasets. Figure 2.8 shows the RMSE values between true
and predicted malignant cell fractions under different combinations of cell numbers in a bulk
sample. More specifically, when the training model is fixed, the RMSE value decreases with the
cell number in bulk samples in the testing data. This shows that a better prediction performance
36
can be achieved when the testing bulk sample contains more single cells. If the number of cells
per bulk sample exceeds a certain threshold (> 3, 000), the performance of the DeepDecon model
becomes stable. On the other hand, when the number of cells per bulk sample in testing datasets
is above 3000, the number of cells in the training dataset doesn’t have a strong influence on
DeepDecon. The RMSE values are stable, showing the robustness of DeepDecon to the number
of cells in the training datasets when the number of cells in testing data is above 3000.
DeepDecon was trained on a High-Performance-Cluster (HPC) with a xeon-2640 6-core CPU
node and it took ∼ 20 minutes to train a model and took ∼ 3s to predict on one bulk tissue.
2.4 Discussion
DeepDecon is an innovative deep learning-based algorithm that leverages single-cell RNA sequencing (scRNA-seq) information to accurately predict cancer cell fractions. Due to the latent
feature engineering capabilities of neural networks, which can automatically extract nonlinear
features in the hidden layers, DeepDecon can achieve superior performance by incorporating
all input genes ( ∼ 104
). We showed that DeepDecon is applicable to multiple cancer datasets.
DeepDecon can iteratively predict malignant cell fractions with lower RMSE compared to other
methods, making it a powerful tool for accurate and reliable prediction of cancer cell fractions.
DeepDecon adopts a term frequency-inverse document frequency (TF-IDF) approach to weigh
the expression of different genes, which addresses the issue of imbalanced expression levels across
genes. In addition, our algorithm employs an iterative approach to refine the prediction, as opposed to the three deep neural network outputs average used by Scaden. These two steps have
37
significantly improved the estimation accuracy of malignant cell fractions in bulk RNA-seq samples. By iteratively using small-range models Mi,j to predict the same bulk samples, where these
models share similar structures but work in different malignant fraction ranges, we have achieved
better prediction accuracy compared to using only one initial model M0,100. We have also evaluated DeepDecon’s performance with respect to gene expression perturbations and varying numbers of cells per bulk sample. We showed that DeepDecon is robust to gene expression perturbations and the number of cells in the training set if the number of cells in the testing data is at least
3000 These findings make DeepDecon a valuable tool for the accurate and reliable prediction of
malignant cells.
DeepDecon accurately estimates the malignant cell fraction in a tissue based on their transcriptomic features from bulk RNA-Seq data. In particular, for AML, this novel approach could
be used to accurately detect malignant clones in patients that appear to be in complete remission
by standard morphology and flow cytometric analysis. DeepDecon can also be used to measure
residual disease in AML patients with morphological remission and classify patients into different
phases such as accelerated or blast phase crisis depending on malignant cell fractions.
While DeepDecon can achieve good performance on different cancer samples and tissues, we
note that there are still limitations to this deep learning-based method. First, the quality of training data is very important. If the number of subjects is small or the single-cell data is dominated
by one specific cell type, DeepDecon can learn less information about real cell fraction distribution and cannot generalize and represent the latent features well. Second, experimental bias and
noise can greatly limit decomposition accuracy. These limitations can potentially be alleviated
by including more training subjects to increase the training set size and by reducing noise in
38
the expression data. However, more computational resources will be needed to train the DeepDecon models. How to efficiently train the model with large training data is a topic for further
research. Third, DeepDecon constructed simulated bulk RNA-seq datasets by assuming random
sampling of single cells from the tissue. However, it should be noted that simulated bulk RNA-seq
is not necessarily the same as real bulk RNA-seq samples. A potential limitation of DeepDecon
is that the exact cell type information may not be available. Preparation methods for generating
single-cell suspensions may result in the underrepresentation of certain cell types, particularly
those that are rare or do not survive disassociation. Therefore, the resulting cell composition may
differ from that in the real tissues. However, particular cell types that are consistently missing
from all single-cell suspensions are less likely when using multiple training datasets. Since we
analyze both malignant and normal cell types in this study, this is less an issue than a general cell
type deconvolution study where a large number of cell types are considered in solid tissues. We
alleviate this issue further by only selecting subjects with more than 100 malignant cells and 100
normal cells. In future studies involving multiple cell types, we could adopt similar requirements
and add the cell type ‘Unknown’ to cover potential missing cell types.
We plan to further improve the performance and applicability of DeepDecon by implementing
several key modifications to the existing methodology. Firstly, we want to extend DeepDecon’s
capacity to include multiple cell types or subtypes. For instance, we considered two main cell
types in this study: malignant and normal cells. However, it has been reported that both cell types
consist of molecular subtypes [78]. Thus, it is important to extend DeepDecon to multiple cell
types. Secondly, it is essential to consider both known and unknown cell types in deconvolution.
Cell composition derived from biological experiments can contain cells that do not belong to any
of the existing cell types. These cells are labeled as unknown cell types and have more complex
39
gene expression patterns. Thirdly, the current DeepDecon model takes all genes into account.
Selecting genes that are only relevant to the cell types of interest may further increase prediction
accuracy.
40
A
B C
D
Figure 2.2: DeepDecon outperforms other methods in predicting malignant cell type fractions on AML simulated bulk RNA-seq datasets. A: Scatter plots of true versus predicted
malignant cell fractions based on DeepDecon (D), Scaden (S), CIBERSORTx (C), Bisque (B), ESTIMATE (E), MuSiC (MU), MEAD (M), RNA-Sieve (R), and NNLS (N) on three selected AML
simulated datasets. The x-axis is the true fraction and the y-axis is the predicted fraction. The
numbers on each subplot are the root mean square error (RMSE) values between the true and
predicted fraction of each method. B: Boxplots of RMSE values between the predicted and true
fractions of malignant cells on 15 AML simulated bulk RNA-seq datasets. C: Boxplots of Pearson’s correlation coefficient (r) values between the predicted and true fractions of malignant cells
on 15 AML simulated bulk RNA-seq datasets. D: Lin’s concordance correlation coefficient (CCC)
values between the predicted and true fractions of malignant cells on 15 AML simulated bulk
RNA-seq datasets. The correlation and CCC values of NNLS contain values that are not available
(NAs). Therefore, paired tests of correlation and CCC values between DeepDecon and NNLS are
not available. * 0.01 < p − value ≤ 0.05, ** 0.001 < p − value ≤ 0.01, *** p − value ≤ 0.001.
41
A B
Figure 2.3: The TF-IDF transformation and the iterative strategy improve the performance of DeepDecon. A: Bar plots of RMSE values of DeepDecon models with and without
TF-IDF transformation. DeepDecon with TF-IDF transformation achieves the lowest RMSE values in 14 out of 15 simulated AML datasets. B: Bar plots of RMSE values on DeepDecon models
with and without the iterative strategy. Iterative DeepDecon achieves the lowest RMSE values in
all 15 simulated AML datasets. The x-axis is the simulated AML dataset. The y-axis is the RMSE
value.
42
A B
C
Figure 2.4: DeepDecon outperforms other deconvolution methods on real AML RNA-seq
datasets. Boxplots of Root mean square error (RMSE) (A), Pearson’s correlation coefficient (PCC)
(B), and Lin’s concordance correlation coefficient (CCC) (C) values between the predicted and
true fractions of malignant cells. Each bar in the boxplots contains three points corresponding to
three real AML bulk RNA-seq datasets, namely ‘primary’, ‘recurrent’, and ‘BeatAML’ datasets.
43
A B
C
Figure 2.5: DeepDecon outperforms other deconvolution methods on the simulated neuroblastoma datasets. Boxplots of Root mean square error (RMSE) (A), Pearson’s correlation
coefficient (PCC) (B), and Lin’s concordance correlation coefficient (CCC) (C) values between
the predicted and true fractions of malignant cells on 9 simulated neuroblastoma bulk RNA-seq
datasets. Each bar in the boxplot contains 9 points corresponding to 9 simulated neuroblastoma
bulk RNA-seq datasets. Each simulated neuroblastoma dataset contains bulk samples constructed
from only one subject. The correlation and CCC values of method MEAD, RNA-Sieve, and NNLS
contain not available (NAs) values. Therefore, paired tests of correlation and CCC values between
DeepDecon and MEAD, RNA-Sieve, and NNLS are not available. * 0.01 < p − value ≤ 0.05, **
0.001 < p − value ≤ 0.01.
44
A B
C
Figure 2.6: DeepDecon outperforms other deconvolution methods on the simulated HNSCC dataset. Boxplots of Root mean square error (RMSE) (A), Pearson’s correlation coefficient
(PCC) (B), and Lin’s concordance correlation coefficient (CCC) (C) values between the predicted
and true fractions of malignant cells on 27 simulated HNSCC bulk RNA-seq datasets. Each bar
in the boxplot contains 27 points corresponding to 27 simulated HNSCC bulk RNA-seq datasets.
Each simulated HNSCC dataset contains bulk samples constructed from only one subject. The
correlation and CCC values of method MEAD, RNA-Sieve, and NNLS contain not available (NAs)
values. Therefore, paired tests of correlation and CCC values between DeepDecon and MEAD,
RNA-Sieve, and NNLS are not available. * 0.01 < p − value ≤ 0.05, ** 0.001 < p − value ≤ 0.01,
*** 1.00e − 04 < p − value ≤ 1.00e − 03, **** p − value ≤ 1.00e − 04.
45
DeepDecon Scaden CIBERSORTx Bisque ESTIMATE MuSiC MEAD RNA-Sieve NNLS
0.1
0.2
0.3
0.4
0.5
0.6
RMSE
Noise Level
0.00
0.01
0.05
0.10
Figure 2.7: DeepDecon is robust to gene expression perturbations. Boxplots of RMSE values between the true and estimated malignant cell fractions on simulated AML datasets under
different noise levels. We added random noise generated from a Gaussian distribution with zero
mean and variance that equals α(α = 0.01, 0.05, 0.1) times gene expression level for each gene
in each sample. We also randomly selected 10% of the genes for each sample and masked its gene
expression values into 0. Each bar contains a total of 15 points, representing 15 separate AML
datasets. The color represents different levels of noise level α.
46
Figure 2.8: DeepDecon is robust to the number of cells per bulk sample when the number
of cells in testing data is above 3000. The x-axis is the trained DeepDecon model. The subscript
is the number of cells per bulk sample. DeepDeconN means a DeepDecon model trained on a
dataset in which one bulk sample consists of N single cells. The y-axis is the RMSE value between
the true and estimated malignant cell fractions. The color represents the number of single cells
per bulk sample in the testing data.
47
Chapter 3
DeepDeconUQ Estimates Malignant Cell Fraction
Prediction Intervals in Bulk RNA-seq Tissue
3.1 Introduction
Recent advancements in next-generation sequencing methodologies, particularly bulk RNA sequencing (RNA-seq) and single-cell RNA sequencing (scRNA-seq), have substantially driven progress
across biological and medical research domains [50, 51, 49, 53]. One prominent application is to
estimate malignant cell fraction from bulk RNA-seq samples [42, 14, 44, 45, 43]. This process
typically involves using regression-based methods that leverage malignant and normal expression data (e.g., scRNA-seq) as a reference profile [23]. Most available estimation methods merely
provide point estimates of cell-type proportions from bulk RNA-seq data [42, 14]. The accuracy
of these methods often depends on the choice and quality of the reference profile [45]. Furthermore, limited efforts have been made to investigate and quantify the impacts of uncertainties in
48
estimated cell-type proportions, which can critically impact downstream analyses in malignantcell-associated disease research, leading to potential errors in findings [79]. Uncertainty quantification of the estimated malignant cell fraction is thus essential, as is the quantification of
prediction accuracy.
Uncertainty in malignant cell fraction estimation can be quantified through prediction intervals, which provide a range within which the true cell-type composition is likely to fall with a
high probability [37, 80]. An ideal procedure for generating prediction intervals should satisfy
two properties. The first property is validity [36]. It should provide valid coverage in finite samples without making strong distributional assumptions, such as normality. The second property
is discrimination [37]. The predicted intervals should be as narrow as possible at each point in the
input space so that the predictions will be informative. When the data is heteroscedastic, getting
valid but narrow prediction intervals requires adjusting the lengths of the intervals according to
the local variability at each query point in the predictor space.
RNA-Sieve [43] and MEAD [44] are two statistical methods that have been proposed recently
that can be used to estimate cell-type proportions and, in the meantime, quantify the uncertainties of the estimated cell proportions. RNA-Sieve [43] is a likelihood-based deconvolution
method. It assumes that the estimates of cell-type fractions are normally distributed around the
true fractions. Meanwhile, the errors arising from the gene expression profile and observed bulk
gene expressions are independent. Therefore, the confidence intervals of the cell proportions
can be calculated through likelihood estimation. However, these assumptions may not hold consistently in practice, as gene expression levels within samples (either bulk or single-cell) often
49
exhibit inter-gene dependencies due to coregulation mechanisms [81]. MEAD [44], another statistical inference approach, incorporates a gene-gene dependency structure to improve the accuracy of cell proportion estimates. MEAD asserts that the estimated proportions follow asymptotic
normal distributions, with solutions constrained to non-negative values. While MEAD considers
the correlation across different genes, the assumption that individuals in the bulk and reference
data are from the same population may not hold universally, especially in contexts like cancer
research, where gene expression levels vary greatly in different populations. Moreover, the dependence matrix used in MEAD is highly dependent on the choice of bulk samples and cannot
be generated when there is only one single bulk sample to decompose.
In this study, we introduce DeepDeconUQ, a deep learning model that is distribution-agnostic
and designed to estimate prediction intervals for malignant cell compositions in bulk RNA-seq
data. DeepDeconUQ trains a neural network on simulated bulk RNA-seq data, avoiding parametric assumptions about bulk gene expression distributions. Through conformalized quantile
regression [36], it provides both valid and precise prediction intervals for malignant cell fractions.
Specifically, DeepDeconUQ employs scRNA-seq data to simulate artificial bulk RNA-seq datasets
with predefined malignant cell proportions. These simulated datasets are then used to train a
quantile regression neural network, which predicts the lower and upper bounds of malignant
cell proportions in new cancer tissue samples. Following this, a conformal prediction process is
applied to a separate calibration dataset of artificial bulk RNA-seq to adjust the intervals generated by the neural network. This conformalization step ensures that the estimated malignant
cell proportions achieve stronger coverage guarantees. Benchmarking with both simulated and
real datasets demonstrates that DeepDeconUQ surpasses existing methods in performance and
remains robust against perturbations in gene expression levels. By leveraging scRNA-seq data,
50
employing deep neural networks, and utilizing conformalized quantile regression, DeepDeconUQ
achieves superior performance in cancer cell deconvolution analysis with uncertainty quantification.
3.2 Materials and Methods
3.2.1 Datasets
To initially train and test DeepDeconUQ, we utilized simulated datasets derived from Acute
Myeloid Leukemia (AML) single-cell data previously used in DeepDecon [28]. The single-cell
AML datasets were downloaded from Gene Expression Omnibus (GEO) with accession number
GSE116256 [62]. We selected 15 subjects, totaling 38,410 cells, to simulate artificial bulk RNA-seq
datasets, employing the same preprocessing and simulation procedures established in DeepDecon. Preprocessing of scRNA-seq data followed the workflow of Scanpy (v.1.7.2), a widely-adopted
Python package for single-cell gene expression analysis [63]. Initially, cells with fewer than 500
detected genes and genes expressed in fewer than five cells were filtered out (see Figure 5.11). Further, gene expression count matrices were processed to remove extreme outliers (see Table 5.6).
Gene expression values were normalized using Scanpy’s ‘normalize_total’ function to ensure uniform total counts across cells. This will mitigate discrepancies arising from varying library sizes.
This produced a normalized matrix of all filtered cells and genes, ready for the generation of simulated bulk data. Ultimately, 30,000 simulated bulk samples (2,000 per subject) were generated
for training and testing DeepDeconUQ.
We further assessed DeepDeconUQ using real AML bulk RNA-seq datasets. Real AML data
were collected from the GDC Data Portal (https://portal.gdc.cancer.gov/) with the project name
51
“TARGET-AML". The AML samples were further divided into primary and recurrent AML categories according to different cancer stages. As a result, there were a total of 117 primary AML
samples and 38 recurrent AML samples. For these bulk RNA-seq datasets, ground-truth cancer
cell fractions via flow cytometry are available. Additionally, an independent real AML dataset,
“BeatAML" [64], was collected from cBioportal [65]. “BeatAML" contains a total of 451 bulk RNAseq samples and 300 of them have corresponding ground-truth cancer cell fractions. This dataset
used the “SureSelect" sequencing platform, which is different from the sequencing platform for
the single-cell data in “TARGET-AML" dataset (see Table 5.7). The inclusion of these diverse
datasets allowed us to evaluate DeepDeconUQ’s performance across different sequencing platforms and data sources.
3.2.2 Generating artificial bulk RNA-seq datasets
To generate artificial bulk RNA-seq samples, we used the previously described scRNA-seq datasets,
simulating each sample with predetermined malignant cell fractions for training the DeepDeconUQ model. Specifically, for each artificial bulk sample, we set a fixed total cell count, N, and a
malignant cell number nm was randomly sampled from a uniform distribution between 0 and N.
Subsequently, nm malignant cells and N − nm normal cells were randomly drawn from the same
scRNA-seq dataset. If the available malignant or normal cells were fewer than nm or N − nm,
respectively, cells were sampled with replacement, meaning that each cell was uniformly drawn
from all single cells in the dataset; otherwise, cells were sampled without replacement to ensure no duplicates. Importantly, cells from different subjects (i.e., individuals) were not combined
within a single artificial sample to maintain individual-specific gene expression profiles. This
52
principle was motivated by two reasons. Firstly, the aim was to safeguard within-subject relationships among genes by preserving the unique gene expression patterns inherent to each subject. Secondly, the intention was to capture the variability between subjects, commonly referred
to as cross-subject heterogeneity [45]. After generating an artificial bulk sample by summing
the expression values of all selected cells, it was labeled according to the malignant cell fraction, nm/N. This process was repeated for each scRNA-seq dataset, resulting in a corresponding
artificial bulk RNA-seq dataset with T samples, each tagged with a known malignant cell proportion. Here, we set N = 3, 000 and T = 200, consistent with the configuration in DeepDecon
[28]. This sampling strategy serves as a substantial data generation resource for training and
evaluating DeepDeconUQ.
3.2.3 Data Processing
Before training, the artificial bulk RNA-seq samples were preprocessed to ensure alignment between training and prediction data. Only genes present in both the training and testing datasets
were retained, and genes with low expression variance (below 0.1) were excluded. To further
standardize the data, a TF-IDF transformation was applied to the raw RNA-seq count matrix.
This transformation, commonly used in information retrieval and text mining [69, 70], starts by
calculating the ‘term frequency (TF)’ for each gene in each sample by normalizing the gene expression profile (see Formula 3.1). The ‘inverse document frequency (IDF)’ was then calculated
by dividing the total number of bulk samples by the total gene expression values of the gene
across all samples (see Formula 3.2), followed by log-transformation and multiplication by the
53
TF value. The TF-IDF transformation weights genes with lower expression levels more heavily,
which helps to adjust for the imbalanced expression levels across genes [71].
TF(Xi,j ) = P
Xi,j
j Xi,j
, (3.1)
IDF(Gj ) = log
T
P
i Xi,j
+ 1
, (3.2)
where Xi,j is the expression level of the jth gene in the ith sample, Gj
indicates the jth gene,
and T is the number of bulk samples.
Let X′ denote the gene expression matrix after TF-IDF transformation. A MinMax transformation was applied to the resulting expression matrix X′
to scale the expression values to
the [0, 1] range (see Formula 3.3). This is a common practice in deep learning models that use
gradient-based optimization algorithms [45, 72].
X
norm
i =
X
′
i − min(X
′
i
)
max(X
′
i
) − min(X
′
i
)
, (3.3)
where X′
i
is the ith row of X′
and Xnorm
i
is the ith row of the resulting expression matrix after
the MinMax transformation.
This preprocessing procedure is an important step in ensuring the quality and consistency
of the data used for training the deep learning models. Although the input datasets varied between platforms and protocols, we utilized the same processing workflow to make it easy to apply
DeepDeconUQ to other datasets.
54
3.2.4 DeepDeconUQ
3.2.4.1 Problem formulation
Suppose we are given n bulk RNA-seq gene expression samples {(Xi
, Yi)}
n
i=1, where Xi ∈ R
p
represents the ith bulk RNA-seq gene expression vector with p > 0 features (genes) and Yi =
(yi
, 1 − yi) is the corresponding ith cell fraction vector of malignant and normal cells. Our aim
is to construct a distribution-agnostic prediction interval Cˆ(Xn+1) that contains the malignant
cell fraction yn+1 for a new bulk RNA-seq sample Xn+1. Specifically, given a desired significance
level α, the prediction interval Cˆ(Xn+1) is likely to contain the true malignant cell fraction vector
yn+1 with a user-specified coverage probability 1 − α:
P{yn+1 ∈ Cˆ(Xn+1))} ≥ 1 − α, (3.4)
for any joint distribution PXY and any sample size n. Meanwhile, the estimated prediction interval Cˆ(Xn+1) should be as narrow as possible while achieving the desired coverage level.
3.2.4.2 Quantile regression
Methods like DeepDecon [28] formulate the problem as a regression task, typically addressed using variations of non-negative least squares or more advanced machine learning methodologies.
The estimation of cell type proportions is often solved by minimizing squared residuals over the
n training points {(Xi
, Yi)}
n
i=1 (see Formula 3.5):
µˆ(x) = µ(x;
ˆθ),
ˆθ = argminθ
1
n
Xn
i=1
(Yi − µ(Xi
; θ))2 + R(θ), (3.5)
55
where θ are the parameters of the regression model, µ(x; θ) is the learned regression model, and
R(θ) is a regularization module.
Similarly, quantile regression estimates the conditional quantiles of cell type proportions, assuming that the τ th conditional quantile is associated with gene expression profiles. A conditional
quantile function qα is learned from n training samples {(Xi
, Yi)}
n
i=1 at a specified quantile (or
significance) level α (see Formula 3.6).
ˆqα(x) = f(x;
ˆθ),
ˆθ = argminθ
1
n
Xn
i=1
ρα(Yi
, f(Xi
; θ)) + R(θ), (3.6)
where f(x; θ) is the quantile regression function and can be learned through neural networks.
ρα is the quantile (pinball) loss [82], defined as,
ρα(y, yˆ) =
α(y − yˆ) if y − y >ˆ 0,
(1 − α)(ˆy − y) otherwise,
(3.7)
where y and yˆ are the observed and predicted cell type fraction, and α ∈ (0, 1) is the corresponding quantile (significance) level. Pinball loss is a skewed transformation of the absolute value
function and is commonly used in quantile regression [36].
Given a significance level α, we can get the lower bound and upper bound prediction ˆqαlo , ˆqαhi
through quantile regression. Here, αlo =
α
2
, αhi = 1 −
α
2
. Then, Cˆ(Xn+1) = [ˆqαlo , ˆqαhi ] can be
used as the estimate of the true prediction interval C(Xn+1). The simplicity and generality of
this approach make quantile regression highly versatile, allowing for the integration of various
machine learning techniques to model and learn qα [83, 84, 36].
56
3.2.4.3 Conformal prediction
The quantile regression method is widely applicable and often works well in practice, yielding
intervals that are adaptive to heteroscedasticity. However, it is not guaranteed to satisfy the validity property when the true prediction interval C(Xn+1) is estimated by the prediction interval
Cˆ(Xn+1). Fortunately, conformal prediction [39] was then brought out to solve this problem.
Specifically, split (inductive) conformal prediction [85, 86], which is general and whose computational cost is a small fraction of the full conformal prediction, helps construct prediction
intervals that are valid and discriminative. We borrowed the idea from Romano et al. [36] and
combined DeepDecon with conformal quantile regression (CQR) to obtain valid and discriminative cell fraction prediction intervals on bulk RNA-seq samples. We refer the resulting algorithm
as DeepDeconUQ.
The split conformal method begins by splitting the training data into two disjoint subsets: a
proper training set {(Xi
, Yi) : i ∈ I1} and a calibration set {(Xi
, Yi) : i ∈ I2}. We then apply
a neural network to estimate the lower and upper quantile functions, ˆqαlo and ˆqαhi , as described
in Equation 3.6. This model’s architecture is similar to our previously developed cell fraction
estimation framework, DeepDecon [28], and will be further explained in the model structure
subsection.
Next, we compute conformity scores that quantify the error made by the prediction interval.
The scores are evaluated on the calibration set as follows:
Ei
:= max(ˆqαlo (Xi) − Yi
, Yi − ˆqαhi (Xi)) i ∈ I2, (3.8)
57
Finally, given new input data Xn+1, we construct the prediction interval of Yn+1 as:
Cˆ(Xn+1) = [ˆqαlo (Xn+1) − Q1−α(E, I2), ˆqαhi (Xn+1) + Q1−α(E, I2)], (3.9)
where Q1−α(E, I2) is the (1−α/2)(1 + 1
|I2|
)th quantile of {Ei
: i ∈ I2}. In this context, we select
α/2 due to the presence of two distinct cell types within the dataset—malignant and normal—as
suggested in multivariate quantile regression [87]. Moreover, Romano et al. demonstrated that
when conformity scores Ei are almost surely unique, the prediction interval achieves an approximate state of perfect calibration [36].
The specific steps of DeepDeconUQ are given in Algorithm 2.
Algorithm 2 DeepDeconUQ
Require: Bulk RNA-seq samples with labels (Xi
, Yi) ∈ R
p × R
2
, 1 ≤ i ≤ n
Significance level α
Testing bulk sample Xn+1
Ensure: Cell fraction prediction interval C(Xn+1) for Xn+1.
1: Randomly split n bulk RNA-seq samples into two disjoint sets, I1 and I2.
2: Fit two conditional quantile functions {ˆqαlo , ˆqαhi} according to Equation 3.6 on training set
I1
3: Compute conformity scores Ei according to formula 3.8 on calibration set I2
4: Compute Q1−α(E, I2), the (1 − α/2)(1 + 1
|I2|
)th quantile of {Ei
: i ∈ I2}.
5: Compute prediction interval Cˆ(Xn+1) according to formula 3.9 for Xn+1.
Lei et al. advocated for selecting a larger I1 compared to I2 to improve the accuracy of estimated quantile functions [88]. Given the size of our training dataset (30,000 simulated samples),
58
we opted for a 7:3 split ratio between the training and calibration sets to optimize the model
performance.
3.2.4.4 Model structure
The main neural network architecture of DeepDeconUQ is similar to DeepDecon, which consists
of two main components. The first component consists of four fully connected layers with a
dropout regularization between each layer, and the rectified linear unit (ReLU) is used as the activation function in every internal layer. The second component differs from DeepDecon, which
uses a softmax function to predict the malignant and normal cell fractions. To reduce the computational cost, instead of fitting two separate neural networks to estimate the lower and upper
quantile functions, we replaced the original one-dimensional estimate of the malignant cell fraction with a two-dimensional estimate of the lower and upper quantiles. In this way, most of the
network parameters are shared between the two quantile estimators. All model parameters were
optimized using the Adam optimization algorithm [75] with a learning rate of 0.0001 and a batch
size of 128. The model was trained as a regression task, with the pinball loss (see Formula 3.7)
as the loss function. Hyperparameters that are tested and tuned in DeepDecon were also used in
DeepDeconUQ.
3.2.5 The impact of gene expression perturbations on DeepDeconUQ
To test the model’s robustness to gene expression perturbations, we introduced varying levels of
Gaussian noise to the expression levels within the simulated datasets. Specifically, for each gene
in each sample, random noise was added, drawn from a Gaussian distribution with a mean of zero.
The variance of this noise was proportional to the expression level of each gene, set at λ times
59
the gene expression level, where λ was assigned values of 0.01, 0.05, and 0.1 (see Formula 3.10).
This approach allowed us to systematically examine the model stability and predictive accuracy
under controlled levels of expression variability.
X
noise
ij = max(0, Xij + N(0, λXij )), (3.10)
where Xij is the gene expression value of gene j in simulated bulk sample i and λ is the noise
level.
Following this processing, we applied the previously trained DeepDeconUQ models to each
simulated bulk RNA-seq dataset to estimate the prediction intervals. This enabled us to systematically evaluate the model’s robustness under various gene expression perturbations, providing
insights into its stability and reliability in producing accurate intervals when gene expression
data is subject to different levels of noise.
3.3 Results
3.3.1 Methods overview
Figure 3.1 provides a schematic representation of DeepDeconUQ. The framework begins with
single-cell RNA sequencing (scRNA-seq) datasets, where the cells from each subject are assumed
to have labeled cell types (malignant or normal) and known gene expression profiles. Therefore, simulated bulk RNA-seq datasets with known malignant and normal cell type fractions can
be generated from these scRNA-seq datasets (Figure 3.1A). Next, these simulated bulk RNA-seq
datasets are divided into two disjoint groups: a training set and a calibration set. Specifically,
60
70% of the data is randomly selected for training a highly accurate quantile function, while the
remaining 30% is reserved for conformal calibration. The trained model uses bulk RNA-seq data
x and a predefined significance level α as input and outputs predictions of the lower and upper
bounds for malignant cell fractions, {ˆqαlo (x), ˆqαhi (x)} (Figure 3.1B). Following model training,
the calibration set is employed to compute conformity scores using equation 3.8. The adjustment
minimizes both the risk of overly conservative predictions (over-coverage) and the potential for
overly narrow intervals that miss true values (under-coverage) (Figure 3.1C). Finally, for a real
bulk sample, DeepDeconUQ combines the trained model and conformity score to estimate the
prediction interval Cˆ(Xn+1) (Figure 3.1D). This prediction interval provides a measure of uncertainty, offering a more reliable estimate of the malignant cell fractions within a bulk RNA-seq
sample.
Our model was constructed using artificial bulk RNA-seq samples and evaluated through the
leave-one-out cross-validation. The evaluation is based on validity and discrimination. For validity, we check the coverage rate, defined as the frequency of true malignant cell fraction within
the prediction interval of the testing dataset (see Formula 3.11). For discrimination, we use the
average length of prediction interval of the testing datasets as an evaluation metric (see Formula
3.12).
Coverage = 1
n
Xn
i=1
1(ˆpi,αlo ≤ yi ≤ ˆpi,αhi ), (3.11)
Lavg =
1
n
Xn
i=1
|ˆpi,αhi − ˆpi,αlo | , (3.12)
61
Training
Calibration
Dropout Dropout Dropout
�"!!"
�"!#$
� = max(�)!!" � − �, � − �)!#$ � )
normal
malignant
scRNA-seq data
1-p
p
simulated bulk dataset
…
experimental tissue
�̂
!!"
�̂
!#$
experimental tissue
A
B
C
D
Figure 3.1: Overview of DeepDeconUQ. A: Constructing simulated bulk RNA-seq samples with
different fractions of malignant cells. p is the fraction of malignant cells in a simulated bulk
sample. B: Model structure used to train DeepDeconUQ. It consists of four fully connected layers
with dropout layers. Seventy percent of the simulated data are used for training. The output is
two quantile functions at a given significance level α. C: Conformity scores are calculated on
the remaining 30% of the simulated dataset. D: Estimating the prediction interval of malignant
cells from a real bulk sample. The trained model is used to calculate the lower and upper bounds,
and the conformity scores are used to adjust the quantiles, which finally outputs the prediction
interval {ˆpαlo , ˆpαhi}.
where yi
is the true malignant cell fraction of the ith sample in the testing dataset. ˆpi,αlo and
ˆpi,αhi are the corresponding lower and upper bounds of the ith sample’s prediction interval. n is
the total number of samples in the testing dataset, and 1(x) is an indicator function of 1 when x
is true and 0 otherwise.
For each subject, we generated the simulated bulk datasets as described in Chapter 3.2.1
subsection separately. Leave-one-out cross-validation was used to evaluate model performance
62
across subjects. Specifically, we selected one of the k artificial bulk RNA-seq datasets as the testing dataset, while the remaining k − 1 datasets served as the training set. This process was
repeated k times to fully evaluate the performance of our model.
3.3.2 DeepDeconUQ outperforms other methods for estimating the prediction
interval of malignant cell fraction
To assess the performance of DeepDeconUQ, we conducted a comparative analysis against two
alternative methods, RNA-Sieve (v. 0.1.4) [43] and MEAD (v. 1.0.1) [44], both of which have been
proposed in the literature to quantify uncertainties in estimated cell-type proportions. This evaluation was performed on both simulated and real bulk RNA-seq datasets. Since RNA-Sieve and
MEAD are statistical inference methods and do not include a step for simulating artificial bulk
RNA-seq datasets for model training, we utilized the scRNA-seq data directly as the reference
for these methods. The same scRNA-seq data were also employed to generate the synthetic bulk
RNA-seq datasets for DeepDeconUQ. All benchmarking methods were executed using their default configurations, ensuring a consistent basis for comparison. Additionally, the methods were
evaluated on identical test datasets, which were kept separate from the training datasets used
to develop the models. Details of implementations of these compared methods are explained in
Chapter 5.3.2.
Figure 3.2 presents boxplots illustrating coverage and average prediction interval lengths for
15 simulated bulk RNA-seq datasets at three significance levels (15%, 10%, and 5%). Although
RNA-Sieve maintains relatively narrow prediction intervals, it often fails to meet the coverage
criterion across the datasets, indicating a tendency toward marked undercoverage. This suggests
63
that RNA-Sieve’s intervals may be too narrow to reliably contain the true malignant fraction. In
contrast, MEAD achieves the coverage criterion for some datasets but exhibits considerable variability in prediction interval lengths, with some interval lengths extending beyond 0.6. Such substantial intervals lead to overcoverage, reducing interpretability by producing intervals that are
too broad to offer precise estimates. DeepDeconUQ demonstrates superior performance across all
three methods on the simulation datasets, consistently satisfying the coverage requirement while
maintaining tight prediction intervals. This performance advantage is attributed to two primary
factors: first, the neural network’s effective quantile learning enables it to meet the coverage
criterion; second, the well-trained model generates low conformity scores on the calibration set,
ensuring that the quantile of these scores remains sufficiently small to yield narrow prediction
intervals.
We further evaluated the performance of these three methods on real AML datasets, including
‘primary’, ‘recurrent’, and ‘BeatAML’ samples, as illustrated in Table 3.1. RNA-Sieve consistently
has the worst performance, with its average prediction interval length fixed at 1.0, indicating it
predicts 0.0 as the lower bound and 1.0 as the upper bound for every real sample. This likely stems
from RNA-Sieve’s limitations in handling gene expression data sourced from diverse sequencing
protocols. Consequently, while RNA-Sieve can provide an estimate of malignant cell fraction,
the results lack reliability. MEAD, conversely, accounts for variations in sequencing depth and
tissue sample size, thus yielding relatively robust performance on real datasets. DeepDeconUQ
demonstrates an even higher capability by addressing batch effects and sequencing biases via
TF-IDF transformation and Min-Max normalization, achieving superior performance relative to
MEAD, with more consistent coverage and narrower prediction intervals across the real datasets.
It should be noted that the malignant cell fractions given by flow cytometry most likely deviate
64
A
B
Figure 3.2: DeepDeconUQ outperforms other methods in predicting malignant cell type
prediction interval on AML simulated bulk RNA-seq datasets. Boxplots of coverage (A) and
average prediction interval length (B) on 15 AML simulated bulk RNA-seq datasets. Coverage is
defined as the proportion of instances in which the true fraction of malignant cells falls within
the prediction interval for the testing dataset. The average length represents the mean length
of the prediction intervals across the testing datasets. Each bar in the boxplot comprises 15 data
points, each corresponding to one of 15 simulated AML datasets. Significance levels are indicated
with different colors.
65
from a true fraction of malignant cells, resulting in undercoverage compared to the prespecified
coverage levels, which is expected. Despite these caveats, the results show that the coverages
of the prediction intervals from DeepDeconUQ are generally higher than those from MEAD,
while the lengths of the prediction intervals from DeepDeconUQ are shorter than those based on
MEAD.
Additionally, We further compared coverages of the prediction intervals based on DeepDeconUQ and MEAD using McNemar’s statistical test [89]. We also compared the lengths of the
prediction intervals based on DeepDeconUQ and MEAD using the Wilcoxon signed-rank test.
For the coverage analysis, each sample in the dataset was assigned a label of 1 if its true malignant cell fraction fell within the predicted interval; otherwise, it was labeled as 0. This approach
enabled the generation of binary outcome pairs for each sample between DeepDeconUQ and
MEAD, thereby providing paired nominal data suitable for McNemar’s statistical test. Furthermore, we aggregated all samples across the three datasets into a consolidated dataset to perform a
statistical assessment on this unified sample set. The resulting p-values from McNemar’s test are
1.4035×10−6
, 1.2438×10−13, and 1.3977×10−8
at significance levels 15%, 10%, 5%, respectively.
Moreover, the p-value of the Wilcoxon signed-rank test on the prediction lengths are 3.33×10−6
,
9.75 × 10−9
, and 0.0013 at the same significance levels. These findings underscore a statistically
significant performance distinction between DeepDeconUQ and MEAD.
3.3.3 DeepDeconUQ is robust to gene expression perturbations
In the Chapter 3.2 subsection, we discussed how perturbations in bulk RNA-seq gene expression
data can affect the accuracy of the estimation algorithms. Figure 3.3, Figure 5.12 and Figure 5.13
66
Table 3.1: DeepDeconUQ outperforms other methods in predicting malignant cell type prediction
interval on real AML bulk RNA-seq datasets. Coverage and average prediction interval length
(Lavg) are shown under different significance levels on three real AML bulk RNA-seq datasets
(’primary,’ ’recurrent,’ and ’BeatAML’). The total row is the aggregation of all three real datasets.
Methods Dataset 15% 10% 5%
Coverage Lavg Coverage Lavg Coverage Lavg
RNA-Sieve
primary 1.0 1.0 1.0 1.0 1.0 1.0
recurrent 1.0 1.0 1.0 1.0 1.0 1.0
beat 1.0 1.0 1.0 1.0 1.0 1.0
total 1.0 1.0 1.0 1.0 1.0 1.0
MEAD
primary 0.667 0.553 0.705 0.630 0.771 0.738
recurrent 0.676 0.520 0.706 0.591 0.735 0.694
beat 0.496 0.386 0.544 0.433 0.663 0.515
total 0.227 0.544 0.240 0.620 0.259 0.726
DeepDeconUQ
primary 0.800 0.434 0.876 0.572 0.912 0.662
recurrent 0.824 0.606 0.853 0.604 0.882 0.685
beat 0.592 0.409 0.730 0.554 0.781 0.611
total 0.665 0.432 0.778 0.563 0.824 0.630
illustrate the impact of various perturbation levels on the performance of these methods under
different significance levels. For RNA-Sieve, the performance remains comparable to prior results
without noise interference, with the prediction interval coverage consistently low. For MEAD,
increasing noise levels results in decreased coverage and increased variability in the intervals. In
the case of DeepDeconUQ, while coverage decreases as noise levels rise, the majority of coverage
values still meet the required threshold. Notably, the average length of DeepDeconUQ’s prediction intervals remains stable across different noise levels. DeepDeconUQ achieves the highest
coverage and smallest average interval length across all methods under various noise conditions,
demonstrating its robustness to expression perturbations.
67
A
B
Figure 3.3: DeepDeconUQ is robust to gene expression perturbations. Boxplots of coverage
and average prediction interval length on 15 AML simulated bulk RNA-seq datasets under different noise levels. We added random noise generated from a Gaussian distribution with zero mean
and variance that equals λ(λ = 0.01, 0.05, 0.1) times the gene expression level for each gene in
each sample. Each bar contains a total of 15 points, representing 15 separate AML datasets. The
color represents different levels of noise level λ. Significance level α = 0.1.
68
3.3.4 Time and memory usage
DeepDeconUQ was trained and tested on a High-Performance-Cluster (HPC) with a xeon-2640
6-core CPU node. It is the only algorithm that requires the generation of in silico training data,
which takes 20 min for 3000 samples with a peak memory usage of 10 GB. Additionally, it took
∼20 minutes to train a model and took ∼3s to predict on one bulk tissue.
3.4 Discussion
DeepDeconUQ is an advanced deep neural network-based algorithm designed to leverage singlecell RNA sequencing (scRNA-seq) data to generate prediction intervals for malignant cancer cell
fractions. Building on our earlier method, DeepDecon, DeepDeconUQ retains all its foundational
advantages, such as the ability to automatically extract complex nonlinear features within its hidden layers and to accurately estimate the quantile function by integrating a comprehensive input
of genes ( ∼ 104
). To address intrinsic variability in RNA-seq data, DeepDeconUQ employs TFIDF transformation and Min-Max normalization, which enables it to yield prediction intervals
that account for both biological and technical sources of noise. Additionally, it utilizes a calibration dataset to fine-tune the prediction interval, effectively mitigating risks of overcoverage
and undercoverage. Integrating training and calibration datasets in DeepDeconUQ represents a
significant advancement in malignant cancer cell fraction estimation, allowing for more accurate and interpretable predictions. By leveraging quantile regression and conformal inference,
DeepDeconUQ not only enhances confidence in the malignant cell prediction interval results but
also facilitates the application of the method to real-world datasets with minimal adjustments.
The framework’s ability to generate reliable uncertainty estimates positions DeepDeconUQ as a
69
valuable tool for the analysis of bulk RNA-seq data, particularly in contexts where precise quantification of cell type proportions is critical for downstream analyses and clinical decision-making.
While DeepDeconUQ can achieve good performance on AML cancer tissues, we note that this
method still has limitations. First of all, the quality of training data is very important. DeepDeconUQ is a neural network-based method, which means it needs a large amount of data to train.
Currently, we use single-cell data from 15 AML subjects to construct simulation bulk RNA-seq
datasets. If the number of subjects is small or the single-cell data is dominated by one specific cell
type, DeepDeconUQ can learn less information from the data and cannot generalize and represent the latent features well. Secondly, experimental bias and noise can greatly affect the estimate
performance. Even though we take different ways such as TF-IDF transformation and Min-Max
normalization to mitigate batch effects and bias. The complexity and difficulties of real RNAseq can still affect DeepDeconUQ’s performance. Thirdly, DeepDeconUQ can only estimate the
prediction interval of malignant cell fraction. In practice, tissues usually consist of multiple cell
types, and some tissues even contain unknown sub-cell types.
We plan to further improve the performance and applicability of DeepDeconUQ by implementing several key modifications to the existing methodology. Firstly, we want to extend DeepDeconUQ’s capacity to include multiple cell types or subtypes. In this case, we should not only
get the marginal confidence interval for one cell type but also accommodate the algorithms to
estimate a confidence region for all cell types. Secondly, DeepDeconUQ’s capability to detect
technical bias and diverse sequencing protocols should be improved. In addition to current normalization processing, methods like autoencoder [72, 90], transfer learning [91] and transformers
70
[92] can be used to generate latent embeddings to reduce these biases. Thirdly, the current DeepDeconUQ model takes all genes into account. Selecting genes that are only relevant to the cell
types of interest may further increase prediction accuracy.
71
Chapter 4
Conclusions and future work
This dissertation addresses the critical challenge of accurately estimating malignant cell fractions
in bulk RNA-seq data, a task complicated by tumor heterogeneity and the limitations of traditional deconvolution methods. To overcome these challenges, we developed two novel models:
DeepDecon and DeepDeconUQ, both leveraging deep learning techniques and single-cell RNA
sequencing (scRNA-seq) data to enhance predictive accuracy and reliability.
In Chapter 2, we introduced DeepDecon, an iterative deep learning framework designed to
deconvolute bulk RNA-seq data by learning optimal features from scRNA-seq profiles. This approach allows for the simulation of bulk RNA-seq data, facilitating more accurate estimation
of malignant cell fractions. The model’s iterative refinement strategy and the application of
Term Frequency-Inverse Document Frequency (TF-IDF) transformations enhance its robustness
against gene expression noise and variability. Comprehensive evaluations on both simulated and
real datasets—including acute myeloid leukemia (AML), neuroblastoma, and head and neck squamous cell carcinoma (HNSCC)—demonstrated that DeepDecon outperforms traditional methods
such as CIBERSORT and MuSiC, providing more precise estimates of cancer cell proportions.
72
Building upon this foundation, Chapter 3 presented DeepDeconUQ, an extension of DeepDecon that incorporates uncertainty quantification into its predictions. By integrating conformalized quantile regression, DeepDeconUQ generates statistically valid prediction intervals instead
of pure point estimates, offering a measure of confidence for each prediction. This feature addresses a significant limitation in existing deconvolution methods, which often lack mechanisms
for assessing predictive uncertainty. Evaluations revealed that DeepDeconUQ maintains robust
performance across various noise levels in bulk samples, consistently providing accurate prediction intervals. This advancement enhances the reliability of malignant cell fraction estimations,
making the model particularly valuable in clinical settings where understanding the confidence
of predictions is crucial.
Collectively, the development of DeepDecon and DeepDeconUQ represents a significant advancement in the field of cancer cell deconvolution. These models offer scalable, data-driven
frameworks capable of adapting to diverse cancer types and experimental conditions, thereby
providing researchers and clinicians with more accurate and reliable tools for understanding tumor heterogeneity and informing personalized treatment strategies.
4.1 Future work for DeepDecon
In Chapter 2, we presented DeepDecon to estimate the malignant cell fraction in cancer tissues.
Though the method shows great power in malignant cell deconvolution, several avenues can be
explored to enhance our understanding of cell deconvolution further.
First of all, we can extend DeepDecon’s capacity to include multiple cell types and multiomics data. Currently, we considered two main cell types in this study: malignant and normal
73
cells. However, it has been reported that both cell types consist of molecular subtypes [78]. Thus,
it is important to extend DeepDecon to multiple cell types. Computational deconvolution with
single-cell RNA sequencing data as reference is pivotal to interpreting transcriptomics data, but
the current methods are limited to scRNA-seq. Future research could explore the integration of
additional data modalities, such as DNA methylation, copy number variation, and proteomics
data, to further enhance the accuracy of cancer cell fraction estimates. Multi-omics approaches
could provide a more comprehensive view of tumor heterogeneity and improve the robustness
of the predictions. For instance, integrating DNA methylation data has been shown to improve
cell-type deconvolution accuracy in complex tissues [93].
Additionally, we can expand DeepDecon to include other cancer types. While DeepDecon has
been extensively tested on specific cancers (e.g., AML, neuroblastoma, HNSCC), expanding the
models to other cancer types could validate their generalizability and adaptability. This would
involve training and testing the models on additional datasets from diverse tumor microenvironments. Applying these models to cancers with distinct microenvironmental compositions, such
as breast cancer or melanoma, could further demonstrate their versatility and robustness.
Moreover, improving DeepDecon’s capability to handle batch effects and cross-platform variability is very important. One of the challenges faced in the current study is the variability in
sequencing platforms and experimental conditions across datasets. Future research could focus
on developing more sophisticated normalization techniques and domain adaptation methods to
better handle cross-platform biases, improving the transferability of the models across diverse
datasets. Techniques such as ComBat-seq [94] or domain adversarial neural networks [95] could
be explored to mitigate batch effects and enhance model robustness. Methods like autoencoder
74
[72, 90], transfer learning [91], and transformers [92] can be used to generate latent embeddings
to reduce technical biases.
4.2 Future work for DeepDeconUQ
In Chapter 3, we presented DeepDeconUQ, an advanced deep neural network-based algorithm
designed to leverage single-cell RNA sequencing data to generate prediction intervals for malignant cancer cell fractions. While DeepDeconUQ can achieve good performance on AML cancer
tissues, several future directions could be explored.
First of all, we can enhance the performance of quantile regression with adaptive calibration.
DeepDeconUQ currently uses conformalized quantile regression to generate prediction intervals
with fixed significance levels. However, this approach may not fully capture varying uncertainties across different sample types and conditions. A potential future improvement is to incorporate adaptive calibration methods, such as locally adaptive conformal prediction [96], which
can adjust the significance level based on the local variability in gene expression profiles. This
would enable the model to dynamically tailor its prediction intervals, providing more precise
uncertainty quantification in complex, heteroscedastic data.
Furthermore, we can explore the integration of Bayesian deep learning. One limitation of
the current DeepDeconUQ model is its reliance on frequentist methods for uncertainty quantification. Incorporating Bayesian deep learning techniques, such as Monte Carlo Dropout [97]
or Bayesian neural networks [98], could offer an alternative approach for capturing uncertainty.
Bayesian methods inherently account for model uncertainty by treating the weights of the neural network as random variables, allowing for a more comprehensive estimation of both aleatoric
75
(data-related) and epistemic (model-related) uncertainty [99]. This enhancement could lead to
more reliable and informative prediction intervals, especially in datasets with high variability.
Moreover, multi-source data fusion and cross-cancer applications can further increase DeepDeconUQ’s influence on tumor analysis. DeepDeconUQ currently relies solely on RNA-seq data
for prediction interval estimation. Future extensions could incorporate multi-source data fusion, combining information from other high-dimensional data types like DNA methylation, copy
number variations, and proteomics data. Using multi-view learning techniques, the model could
integrate complementary information from these diverse sources, potentially improving the accuracy and reliability of the prediction intervals. This fusion of multi-omics data could provide a
more holistic view of tumor composition and heterogeneity. Future research could also explore
transfer learning to adapt the model to other cancer types without extensive retraining. By pretraining DeepDeconUQ on a large, diverse dataset of scRNA-seq profiles and fine-tuning it on
smaller, cancer-specific datasets, the model could achieve better generalization across different
tumor microenvironments. This approach would also help mitigate the limited availability of
scRNA-seq data for rare cancers, enhancing the model’s utility in broader clinical applications.
76
Chapter 5
Appendix
5.1 Appendix for chapter 1: Introduction
5.1.1 Cost analysis of bulk RNA-seq and scRNA-seq
Nguyen, Hung, et al. performed a cost analysis using data from the Genomics Core of a top
US university in 2024 [16]. The pricing provided by this core is comparable to those offered by
other genomics cores in the US. The smallest experiment that would provide useful data involved
at least three controls and three case samples. This core charges $65 for library prep and $420
for conventional mRNA-Seq, for a total of $485 for each sample. A bulk RNA-seq experiment
involving these six samples would cost approximately $3K. In contrast, to perform a single-cell
experiment with the same number of samples, the same core would charge about $300 for a VDJ
library prep, $1,800 for the capture (up to 10K cells), and $300/cell for sequencing. The total
to perform the single-cell experiment is about $27K, or 9 times as much as the cost of the bulk
experiment. In principle, a bulk experiment is cheaper than single-cell experiment by $24K for
the minimal experiment contemplated above, assuming that the cost of the analysis, software and
77
personnel remains the same between the two. On average, the typical experiment will include
more samples, perhaps in the order of 10 controls and 10 case samples, which will increase the
price difference to about $70K per experiment.
5.2 Appendix for chapter 2: DeepDecon Accurately Estimates
Cancer Cell Fractions in Bulk RNA-seq Data
5.2.1 Software comparison and settings
We compared DeepDecon with other deconvolution methods including Scaden (v. 1.1.2) [45],
CIBERSORTx (https://cibersortx.stanford.edu/) [42], Bisque (v. 1.0.5) [40], ESTIMATE (v. 2.0.0)
[56], MuSiC (v. 1.0.0) [14], MEAD (v. 1.0.1) [44], RNA-Sieve (v. 0.1.4) [43].
For Scaden [45], the training hyperparameters were set following the instructions of the original article and the source code. For the simulated bulk dataset generation, we use the artificial
datasets generated by DeepDecon to train Scaden for a fair comparison. Then, we ran Scaden
with default settings by following the example provided by the authors.
For CIBERSORTx (CSx) [42], we used the web-based application. We first used a single-cell
profile to generate a signature matrix. In the leave-one-out cross-validation, the single-cell profile
was constructed by combining the single cells of all subjects, excluding the data itself, while
in the real bulk testing data, the single-cell profile was constructed by combining all available
single-cell data. CIBERSORTx doesn’t consider subject information. Next, we deconvolved the
corresponding bulk data without batch correction as suggested by the authors. Other settings
were default.
78
For MuSiC [14], we installed the R package and ran it with default settings following its
tutorial. In the leave-one-out cross-validation, the single-cell profile was constructed by using the
single cells of all subjects, excluding the data itself, while in the real bulk testing data, the singlecell profile was constructed by combining all available single-cell data. Subject information was
included in the single-cell reference. We also get the results of NNLS based on MuSiC’s output.
For ESTIMATE [56], as the cancer genome atlas already contains various cancer types including AML, we directly used the default ‘stromal signature’ and ‘immune signature’ as references.
Next, we ran ESTIMATE with default settings by following the example provided by the authors.
Regarding Bisque [40], we installed the BisqueRNA R package and executed the tool using the
default settings. In the ‘Reference-based decomposition’ mode’s default configuration, Bisque
discards low variance genes and utilizes the remaining genes for decomposition. We opted to
input all the genes without specifying any marker genes. In the leave-one-out cross-validation,
the single-cell profile was constructed by using the single cells of all subjects, excluding the data
itself, while in the real bulk testing data, the single-cell profile was constructed by combining all
available single-cell data. Subject information was included in the single-cell reference.
For MEAD [44], we installed the R package given in the manuscript and ran it with default
settings. In the leave-one-out cross-validation, the single-cell profile was constructed by using the
single cells of all subjects, excluding the data itself, while in the real bulk testing data, the singlecell profile was constructed by combining all available single-cell data. Subject information was
included in the single-cell reference.
For RNA-Sieve [43], we executed it by following the example code provided. In the leave-oneout cross-validation, the single-cell profile was constructed by combining the single cells of all
79
subjects, excluding the data itself, while in the real bulk testing data, the single-cell profile was
constructed by combining all available single-cell data.
5.2.2 Comparision with and without subject information for RNA-Sieve,
CIBERSORTs and NNLS
DeepDecon constructs simulated bulk RNA-seq samples within subjects. Cells from different subjects (i.e., individuals) were not merged into an aggregated sample. This decision was motivated
by two primary motivations. Firstly, the aim was to safeguard within-subject relationships among
genes by preserving the unique gene expression patterns inherent to each subject. Secondly, the
intention was to capture the variability between subjects, commonly referred to as cross-subject
heterogeneity. As a result, DeepDecon makes use of subject information. However, there are
several benchmarking methods (RNA-Sieve, CIBERSORTx, and NNLS), that do not use subject
information. To investigate the most effective approach for modeling the heterogeneity of gene
expression in tumor cells, we evaluated two distinct modes (‘Aggregate’ and ‘Separate’) for constructing single-cell gene expression profiles. This evaluation was conducted while benchmarking regression-based methods that do not incorporate subject ID as a factor. In particular, in the
‘Aggregate’ mode, we constructed the single-cell reference by aggregating single cells from different subjects, without considering subject information. On the other hand, in the ‘Separate’ mode,
we initially constructed patient-specific single-cell references for each subject, utilizing only the
single cells from that particular subject. Subsequently, we obtained patient-specific results separately for each reference. Finally, we computed the average of all patient-specific results to obtain
the final result for the given testing bulk datasets. The key distinction between the ‘Aggregate’
80
and ‘Separate’ modes lies in their gene expression profile allocation strategy. The ‘Aggregate’
mode employs a combined reference and generates a single result, while the ‘Separate’ mode
employs patient-specific references and derives the final result by averaging the patient-specific
results.
We further conducted a paired rank-sum test to evaluate if the difference between the estimate
from ‘Aggregate’ and the true value is the same as that for the difference between the estimate
from ‘Separate’ and the true value. To do this, for the i-th bulk RNA-seq sample in the testing
dataset, we calculated the absolute aggregate difference Adi = |Aei − fi
|, where Aei and fi are
the estimated fraction using the ‘Aggregate’ mode and the true malignant cell fraction value,
respectively. Similarly, we can define the absolute separate difference Sdi = |Sei − fi
| for the
‘Separate’ method, where Sei
is the estimated fraction using the ‘Separate’ mode. Then we can
test if Adi and Sdi have the same distribution using the paired Wilcoxon signed-rank test.
Fig. 5.8 gives the RMSE values of CIBERSORTx, RNA-Sieve and NNLS under the ‘Aggregate’
and ‘Separate’ modes when applied to the AML, neuroblastoma, and HNSCC bulk datasets. The
significance analysis for the Adi and Sdi pair on the real datasets for each method is also given
in Fig. 5.8. We can see that the relative performance of ‘Separate’ and ‘Aggregate’ modes highly
depends on the estimation methods and datasets. The ‘Separate’ mode of CIBERSORTx exhibits
inferior performance compared to the ‘Aggregate’ mode on the AML and HNSCC datasets while
outperforming the neuroblastoma dataset. On the other hand, the ‘Separate’ mode of both RNASieve and NNLS either performs similarly or surpasses the ‘Aggregate’ mode across all datasets.
Yet, DeepDecon outperforms these methods in both ‘Separate’ and ‘Aggregate’ modes.
81
5.2.3 Preprocessing of single-cell gene expression data
AML data was obtained from the Gene Expression Omnibus (GEO) under accession number
GSE116256 [62]. To ensure data quality, we utilized single-cell RNA sequencing (scRNA-seq)
data from subjects with at least 100 normal and 100 malignant cells, respectively (Figure 5.11).
This criterion was employed to avoid extreme scenarios in which very few normal or malignant
cells were selected. A total of 15 AML subjects were selected. For each subject, we filtered out
cells with less than 500 detected genes and genes expressed in less than five cells. The resulting
gene expression profile for each subject was further filtered for extreme outliers in gene expression values. The filtering criteria for each subject were given in Table 5.3. Finally, gene expression
was normalized to library size by total counts across all genes, and the processed data was saved
for subsequent RNA-seq bulk data generation. We used the tag ‘PredictionRefined’ [62] given in
the dataset as the final label (malignant/normal) to annotate the cell types. This tag was a manual
reclassification of cells by close inspection of mutations/expression profiles.
Neuroblastoma data was downloaded from the Gene Expression Omnibus (GEO) with accession number GSE137804 [66]. It contains single cells from 16 subjects with cell annotations. After
applying the same preprocessing workflow as AML, a total of 9 neuroblastoma subjects were selected for downstream artificial bulk RNA-seq datasets construction. We used the ‘celltype’ tag
given in the dataset as cell labels. Specifically, the ‘tumor’ cell type was labeled as malignant cells
and all the other cell types were labeled as normal cells.
Head and neck squamous cell carcinoma (HNSCC) cancer data was collected from a database
TISCH [67]. After the same filtering and preprocessing procedure (subjects with both numbers
of malignant and normal single cells to be at least 100). We have 27 subjects available for this
82
database. We used the ‘Cluster Celltype (malignancy)’ column in the meta information as cancer
cell (malignant/normal) labels. Specifically, we treated cells with ‘Malignant cells’ labeled as
malignant cells and all the other cells as normal cells. Note that the single-cell level expression
matrix from TISCH is log-transformed for each single cell. We de-normalized the matrix by using
the exponential form of gene expression to construct simulated bulk samples. We didn’t perform
filtering and scaling as the data were originally scaled to 10,000 for each single cell.
5.2.4 Artificial bulk dataset simulation
The simulated artificial bulk datasets were generated by subsampling within each scRNA-seq
subject. To preserve potential correlations between the expression levels of different genes within
subjects, cells from different subjects were not merged into one bulk sample. The true cell-type
proportion for each bulk sample was calculated by dividing the number of single cells with a
specific cell type by the total number of cells in the bulk sample.
First, in each bulk scRNA-seq data, N cells from different cell types (malignant/normal) were
generated where ‘1’ and ‘2’ correspond to malignant and normal cell types, respectively. Let
N = n1 + n2, (5.1)
fi =
ni
N
, i = 1, 2, (5.2)
where ni and fi are the number and fraction of cells of type i, respectively, and N is the total
number of cells in one simulated bulk sample. Here ni was generated uniformly from 0 to N
through the python random module [100]. When ni was determined, ni cells were sampled from
83
the scRNA-seq gene expression matrix for each cell type i (if ni
is bigger than the total number
of cells of type i in one particular subject, the cells were chosen with replacement. Otherwise,
the cells were chosen without replacement). Next, the selected single-cell expression profiles for
every cell type were aggregated by summing their expression values,
G =
X
i
X
j
Xij , (5.3)
where Xij is the jth gene expression vector of cell type i and G is the final bulk RNA-seq expression profile. Repeating the above steps T times to construct a simulated bulk dataset with T
samples. In our simulations, T was chosen as 200 for each subject. Finally, we had 15 simulated
bulk RNA-seq datasets and each had 200 bulk samples with known cell type proportions.
5.2.5 Comparision of different normalization methods
The 15 AML scRNA-seq data is based on Unique Molecular Identifier (UMI) counting for measuring gene expression levels. These UMIs enable accurate quantitation of gene expression levels
because we can tell which reads are generated from the same mRNA molecule and multiple reads
associated with the same UMI are collapsed into a unique count. The properties of UMI-based
protocols enable us to conduct a TF-IDF normalization to counteract the effect of different library
sizes and data sources [101]. However, there are also counts data generated from other sequencing
methods like smart-seq that need to be further normalized using a method like TPM or FPKM [74].
We compared the performance of DeepDecon using TPM or FPKM normalization methods with
the results using TF-IDF normalization. In particular, we replaced TF-IDF with FPKM and TPM
normalization and kept all other steps of the DeepDecon algorithm the same (the same MinMax
84
transformation and the same iterative steps, etc.). Figure 5.2 shows the boxplots of RMSE, Pearson correlation coefficient (PCC), and Lin’s concordance correlation coefficient (CCC) between
true and predicted malignant cell fractions under different normalization methods on simulated
datasets. They show that TF-IDF normalization on UMI counts data outperforms FPKM and TPM
normalization methods in estimating the fraction of malignant cells in tumor samples.
5.2.6 Hyperparameter tuning
To enhance model performance, we performed a grid search of hyperparameters [102]. The algorithm searched possible combinations of hyperparameters to identify the optimal set of hyperparameters that yield the highest performance. To simplify the analysis, we used the average RMSE
and CCC values across all 15 subjects as the final metric and tested on full-range (malignant cell
fraction from 0.0 to 1.0) datasets. The results of the grid search are presented in Table 5.5. Based
on the grid search, we set all our models as four fully connected layers with dropout layers. More
specifically, the number of nodes in each layer is 256, 128, 64, and 32. There is a dropout function
with a probability of 0.1 in the first three fully connected layers. We set a learning rate of 0.0001
and a batch size of 128, utilizing Adam as the optimizer.
85
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
D:0.0493
S:0.1896
C:0.1546
B:0.1386
E:0.3359
MU:0.0594
M:0.0469
R:0.0429
N:0.1051
210A-D0
D:0.0444
S:0.1520
C:0.1574
B:0.1416
E:0.2827
MU:0.0902
M:0.1141
R:0.0680
N:0.1730
328-D0
D:0.0706
S:0.2247
C:0.1608
B:0.1567
E:0.3321
MU:0.2478
M:0.2269
R:0.2823
N:0.1493
328-D113
D:0.0576
S:0.1154
C:0.1329
B:0.1294
E:0.3343
MU:0.2838
M:0.3266
R:0.2529
N:0.4665
328-D171
D:0.0537
S:0.1081
C:0.1376
B:0.1528
E:0.3061
MU:0.0842
M:0.1378
R:0.0291
N:0.5194
328-D29
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
D:0.0561
S:0.0960
C:0.0519
B:0.3544
E:0.2897
MU:0.1164
M:0.1264
R:0.0723
N:0.2569
329-D0
D:0.0770
S:0.2431
C:0.2935
B:0.4800
E:0.2351
MU:0.3084
M:0.2926
R:0.2622
N:0.4003
329-D20
D:0.0408
S:0.2787
C:0.2425
B:0.3135
E:0.2852
MU:0.5158
M:0.3926
R:0.4504
N:0.5918
419A-D0
D:0.0449
S:0.1523
C:0.2096
B:0.4260
E:0.3444
MU:0.2701
M:0.2270
R:0.2302
N:0.5173
420B-D0
D:0.0416
S:0.1717
C:0.2127
B:0.1386
E:0.3428
MU:0.2634
M:0.2887
R:0.1783
N:0.2533
475-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
D:0.0309
S:0.1825
C:0.1623
B:0.1354
E:0.3402
MU:0.4052
M:0.3066
R:0.3455
N:0.4783
556-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
D:0.0477
S:0.2973
C:0.3754
B:0.5235
E:0.3848
MU:0.3293
M:0.3707
R:0.3942
N:0.4444
707B-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
D:0.1205
S:0.4671
C:0.3868
B:0.2133
E:0.3797
MU:0.4218
M:0.4539
R:0.5653
N:0.4798
916-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
D:0.0640
S:0.1614
C:0.1275
B:0.1515
E:0.3042
MU:0.1024
M:0.2075
R:0.1018
N:0.3318
921A-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
D:0.2876
S:0.5162
C:0.5429
B:0.1959
E:0.3351
MU:0.4805
M:0.5333
R:0.5076
N:0.2863
1012-D0
DeepDecon Scaden CIBERSORTx Bisque ESTIMATE MuSiC MEAD RNA-Sieve NNLS
Figure 5.1: DeepDecon outperformed other methods for predicting malignant cell type
fractions based on 15 artificial AML bulk RNA-seq datasets. Scatter plots of true versus
predicted malignant cell type fractions based on DeepDecon, Scaden, Bisque, MEAD, RNA-Sieve,
MuSiC, CIBERSORTx, ESTIMATE, and NNLS for all 15 AML datasets. The x-axis is the true
fraction and the y-axis is the predicted fraction. Leave-one-out cross-validation was used here
where one dataset served as a testing dataset and the rest 14 datasets served as a training dataset.
The number on each subplot is the RMSE values between the true and predicted fraction of each
method. D: DeepDecon, S: Scaden, B: Bisque, M: MEAD, R: RNA-Sieve, MU: MuSiC C: CIBERSORTx, E: ESTIMATE, and N: NNLS
86
A B
C
Figure 5.2: Boxplots of Root Mean Square Error (RMSE), Pearson correlation coefficient
(PCC), and Lin’s concordance correlation coefficient (CCC) of DeepDecon under different normalization methods on simulated AML datasets. A: Boxplots of RMSE values between the predicted and true fractions of malignant cells on simulated bulk RNA-seq datasets
of different normalization methods. B: Boxplots of Pearson’s correlation coefficient values between the predicted and true fractions of malignant cells on simulated bulk RNA-seq datasets of
different normalization methods. C: Boxplots of Lin’s concordance correlation coefficient (CCC)
values between the predicted and true fractions of malignant cells on simulated bulk RNA-seq
datasets of different normalization methods.
87
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
I:0.0493
N:0.0761
210A-D0
I:0.0444
N:0.0707
328-D0
I:0.0706
N:0.1152
328-D113
I:0.0576
N:0.1004
328-D171
I:0.0537
N:0.0753
328-D29
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
I:0.0561
N:0.0809
329-D0
I:0.0770
N:0.1703
329-D20
I:0.0408
N:0.1945
419A-D0
I:0.0449
N:0.0450
420B-D0
I:0.0416
N:0.1600
475-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
I:0.0309
N:0.0933
556-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
I:0.0477
N:0.2026
707B-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
I:0.1205
N:0.2530
916-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
I:0.0640
N:0.0827
921A-D0
0.0 0.2 0.4 0.6 0.8 1.0
Truth
I:0.2876
N:0.3001
1012-D0
Iterative Non-iterative
Figure 5.3: Scatter plots of malignant cell type fractions estimated from DeepDecon with
and without iteration. The non-iterative model was trained on simulated AML samples with
malignant cell fractions in 0 ≤ p ≤ 1. The x-axis is the true fraction and the y-axis is the predicted
fraction. Leave-one-subject-out cross-validation was used here where one subject served as a
testing and the rest 14 datasets from other subjects served as training. The numbers on each
subplot are the RMSE values between the true and predicted fraction of each method. DeepDecon
without iteration has poor prediction accuracy when the malignant cell fraction is close to 0 or 1
and tends to have a S-shape. It also has higher RMSE values than iterative DeepDecon. The results
indicate that the iterative approach outperforms the non-iterative approach. N: Non-iterative
model, I: Iterative model.
88
Figure 5.4: UMAP projection of 15 scRNA-seq AML datasets reveals heterogeneity across
different datasets. AML1012-D0, AML475-D0, AML916-D0, and AML707B-D0 were far away
from the other datasets, indicating distinct expression patterns.
89
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
0.1347
DeepDecon
0.1674
Scaden
0.2930
CIBERSORTx
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
0.2683
Bisque
0.2207
ESTIMATE
0.2731
MuSiC
0.2 0.4 0.6 0.8 1.0
Truth
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
0.2662
MEAD
0.2 0.4 0.6 0.8 1.0
Truth
0.2741
RNA-Sieve
0.2 0.4 0.6 0.8 1.0
Truth
0.3030
NNLS
Figure 5.5: DeepDecon outperformed other deconvolution methods based on real primary AML RNA-seq expression data. Scatter plots of malignant cell fractions estimated from
DeepDecon, Scaden, Bisque, MEAD, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS on
real primary AML tissues. The x-axis is the true fraction and the y-axis is the predicted fraction.
The number on each subplot is the RMSE values between the true and predicted malignant cell
fraction of each method.
90
0.2
0.4
0.6
0.8
1.0
Prediction
0.1903
DeepDecon
0.2100
Scaden
0.2752
CIBERSORTx
0.2
0.4
0.6
0.8
1.0
Prediction
0.2535
Bisque
0.2339
ESTIMATE
0.2948
MuSiC
0.2 0.4 0.6 0.8 1.0
Truth
0.2
0.4
0.6
0.8
1.0
Prediction
0.2847
MEAD
0.2 0.4 0.6 0.8 1.0
Truth
0.3066
RNA-Sieve
0.2 0.4 0.6 0.8 1.0
Truth
0.2998
NNLS
Figure 5.6: DeepDecon outperformed other deconvolution methods based on real recurrent AML RNA-seq expression data. Scatter plots of malignant cell fractions estimated from
DeepDecon, Scaden, Bisque, MEAD, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS on
real recurrent AML tissues. The x-axis is the true fraction and the y-axis is the predicted fraction.
The number on each subplot is the RMSE values between the true and predicted malignant cell
fraction of each method.
91
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
0.1733
DeepDecon
0.1973
Scaden
0.2320
CIBERSORTx
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
0.2817
Bisque
0.2925
ESTIMATE
0.2326
MuSiC
0.00 0.25 0.50 0.75 1.00
Truth
0.0
0.2
0.4
0.6
0.8
1.0
Prediction
0.2816
MEAD
0.00 0.25 0.50 0.75 1.00
Truth
0.3187
RNA-Sieve
0.00 0.25 0.50 0.75 1.00
Truth
0.3172
NNLS
Figure 5.7: DeepDecon outperformed other deconvolution methods based on real BeatAML RNA-seq expression data. Scatter plots of malignant cell fractions estimated from DeepDecon, Scaden, Bisque, MEAD, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS on real
BeatAML dataset. The x-axis is the true fraction, and the y-axis is the predicted fraction. The
numbers on each subplot are the RMSE values between the true and predicted malignant cell
fraction of each method.
92
A B
C
Figure 5.8: DeepDecon outperforms CIBERSORTx, RNA-Sieve, and NNLS when subject
information is used. DeepDecon is robust to gene expression profiles even when subject information is given in CIBERSORTx, RNA-Sieve, and NNLS on real AML (A), neuroblastoma (B), and
HNSCC (C) datasets. In mode ‘Aggregate’, single cell reference is constructed by combining the
single cells from different subjects together without considering the subject information. In the
‘Separate’ mode, single cell reference is constructed separately across each subject and the final
result is the average of results under each patient-specific reference. * 0.01 < p − value ≤ 0.05,
** 0.001 < p − value ≤ 0.01, *** p − value ≤ 0.001.
93
DeepDecon Scaden CIBERSORTx Bisque ESTIMATE MuSiC MEAD RNA-Sieve NNLS
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Correlation
Noise Level
0.00
0.01
0.05
0.10
Figure 5.9: DeepDecon is robust to gene expression perturbations. Boxplots of correlation
values between the true and estimated malignant cell fractions on simulated AML datasets under
different noise levels. We added random noise generated from a Gaussian distribution with zero
mean and variance that equals α(α = 0.01, 0.05, 0.1) times gene expression level for each gene
in each sample. We also randomly selected 10% of the genes for each sample and masked its gene
expression values into 0. The color represents different levels of noise level α.
94
DeepDecon Scaden CIBERSORTx Bisque ESTIMATE MuSiC MEAD RNA-Sieve NNLS
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
CCC
Noise Level
0.00
0.01
0.05
0.10
Figure 5.10: DeepDecon is robust to gene expression perturbations. Boxplots of CCC values between the true and estimated malignant cell fractions on simulated AML datasets under
different noise levels. We added random noise generated from a Gaussian distribution with zero
mean and variance that equals α(α = 0.01, 0.05, 0.1) times gene expression level for each gene
in each sample. We also randomly selected 10% of the genes for each sample and masked its gene
expression values into 0. The color represents different levels of noise level α.
95
Figure 5.11: Barplots of the numbers of malignant and normal cells in each scRNA-seq
AML subject. Subjects with at least 100 malignant and 100 normal cells were selected for this
study.
96
Table 5.1: The root mean square errors (RMSE)(%) for the estimated fraction of malignant cells
in leave-one-subject-out cross-validation on neuroblastoma datasets for DeepDecon, Scaden,
Bisque, RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS. The boldfaced numbers indicate the best one.
Subject ID
Method T10 T19 T27 T40 T44 T69 T71 T75 T92 mean median mean
rank
Real
data
DeepDecon 19 16 23 21 24 26 27 17 22 21.67 22 1.11 24
Scaden 32 30 31 31 33 28 28 27 28 29.78 30 3.44 26
Bisque 56 13 26 41 59 30 30 38 31 36.00 31 5.44 31
MEAD 47 44 46 59 57 27 41 24 24 41.00 44 5.11 36
RNA-Sieve 58 60 57 59 62 29 45 56 59 53.89 58 7.78 36
CIBERSORTx 31 32 32 32 40 29 28 31 33 32.00 32 4.67 28
MuSiC 46 56 24 28 26 29 27 22 29 31.89 28 3.33 28
ESTIMATE 47 29 38 38 52 39 42 39 45 41.00 39 6.11 29
NNLS 58 56 57 59 55 59 58 24 52 53.11 57 7.22 36
97
Table 5.2: The root mean square errors (RMSE)(%) for the estimated fraction of malignant cells
in leave-one-subject-out cross-validation on HNSCC datasets for DeepDecon, Scaden, Bisque,
RNA-Sieve, MuSiC, CIBERSORTx, ESTIMATE, and NNLS. The boldfaced numbers indicate the
best one.
Subject ID
Method ATC265T ATC267T ATC269T ATC273T ATC275T HNSCC17 HNSCC18 HNSCC25 HNSCC5 HNSCC6 P01 P07 P1 P12 P13
DeepDecon 5 7 6 10 10 5 9 5 6 5 6 28 5 8 8
Scaden 8 17 13 10 11 6 10 4 7 7 37 61 9 11 9
Bisque 7 7 13 26 23 11 10 10 20 33 7 63 7 13 9
MEAD 5 22 15 13 5 7 14 6 14 32 40 58 13 21 9
RNA-Sieve 8 27 24 20 5 18 26 7 14 29 40 61 10 22 9
CIBERSORTx 18 17 7 22 25 7 23 7 15 15 34 53 18 12 7
MuSiC 15 30 26 21 13 5 15 3 11 28 39 59 13 24 15
ESTIMATE 19 27 19 16 16 23 20 24 14 33 45 24 25 32 31
NNLS 37 42 40 50 7 22 25 14 17 19 38 27 21 32 29
Subject ID
Method P14 P1802 P1805 P1810 P1813 P1815 P1816 P2 patient2 patient3 patient5 patient6 mean median mean
rank
Real
data
DeepDecon 6 10 7 6 4 8 5 6 7 10 7 12 7.81 7.00 1.74 18
Scaden 18 15 3 7 8 6 7 12 8 25 11 22 13.41 10.00 3.26 19
Bisque 5 10 10 7 9 9 7 10 11 5 10 6 13.26 10.00 4.41 28
MEAD 3 14 12 11 12 19 16 3 9 41 36 37 18.04 14.00 5.00 29
RNA-Sieve 4 15 12 10 9 18 16 2 11 41 38 33 19.59 16.00 5.48 32
CIBERSORTx 19 6 10 7 5 10 10 10 12 26 12 33 16.30 12.00 4.59 22
MuSiC 11 19 12 12 10 22 19 5 13 41 47 36 20.89 15.00 6.15 25
ESTIMATE 36 25 37 33 34 33 31 21 26 33 28 37 27.48 27.00 7.59 20
NNLS 26 18 8 13 9 18 15 9 24 56 60 34 26.30 24.00 6.78 34
98
Table 5.3: Preprocessing criteria for each subject in AML and neuroblastoma datasets. Gene
expression threshold means the maximum gene expression value of a cell. The gene number
threshold means the maximum number of expressed genes. This is to avoid gene expressions
that do not represent a single cell. The criteria are based on Scanpy (v. 1.7.2) functions ‘filter_-
cells’ and ‘filter_genes’.
Subject Gene expression threshold Gene number threshold
AML328-D29 7000 2500
AML1012-D0 5000 1600
AML556-D0 10000 3000
AML328-D171 5000 2000
AML210A-D0 6000 2000
AML419A-D0 7000 2500
AML328-D0 5000 2000
AML707B-D0 6000 2000
AML916-D0 5000 2000
AML328-D113 6000 2000
AML329-D0 8000 2000
AML420B-D0 7000 2000
AML329-D20 7000 2200
AML921A-D0 8000 2500
AML475-D0 4800 1500
T10 4000 12000
T19 5000 20000
T27 4000 19000
T40 5000 20000
T44 5000 20000
T69 4000 20000
T71 4000 15000
T75 3500 11000
T92 6000 30000
99
Table 5.4: Bulk AML RNA-seq datasets used in DeepDecon
Name Number of samples Sequencing platform Normalization method Source
primary 117 Affymetrix Gene ST Array [103] FPKM GDC Data Portal [104]
recurrent 38 Affymetrix Gene ST Array FPKM GDC Data Portal
BeatAML 300 SureSelect [105] CPM Tyner, et al. 2018 [64]
Pediatric Neuroblastoma 99 Illumina Hi-Seq 2000 [106] RPKM cBioPortal [107]
TCGA-HNSC 518 Illumina Hi-Seq 2000 RPKM LinkedOmics [68]
100
Table 5.5: Hyperparameters are tested for model optimization and final model parameters are selected to minimize the RMSE between the true and predicted malignant cell fraction. The metrics
are shown as the average values across all 15 AML datasets (with their standard deviation). All
iterative DeepDecon models share the same structure and parameters.
Parameters 0.0001 0.0001 0.0001 0.0001 0.001 0.001
learning rate 0.0001 0.0001 0.0001 0.0001 0.001 0.001
batch size 128 128 64 64 128 128
dropout rates 0.1 0.2 0.1 0.2 0.1 0.2
Metrics
RMSE 0.1106 ± 0.0598 0.1213 ± 0.0815 0.1310 ± 0.0777 0.1212 ± 0.0823 0.1322 ± 0.0764 0.1329 ± 0.0694
CCC 0.8490 ± 0.2298 0.8513 ± 0.2880 0.8346 ± 0.2910 0.8063 ± 0.2881 0.7603 ± 0.2745 0.7603 ± 0.2513
Parameters 0.001 0.001 0.01 0.01 0.01 0.01
learning rate 0.001 0.001 0.01 0.01 0.01 0.01
batch size 64 64 128 128 64 64
dropout rates 0.1 0.2 0.1 0.2 0.1 0.2
Metrics
RMSE 0.1270 ± 0.0779 0.1327 ± 0.0700 0.2763 ± 0.0450 0.2810 ± 0.0598 0.2810 ± 0.0697 0.2882 ± 0.0787
CCC 0.7603 ± 0.2746 0.7703 ± 0.2418 0.5632 ± 0.2298 0.6074 ± 0.2298 0.5734 ± 0.2298 0.5764 ± 0.2298
101
5.3 Appendix for chapter 3: DeepDeconUQ Estimates Malignant
Cell Fraction Prediction Intervals in Bulk RNA-seq Tissue
5.3.1 Preprocessing of single-cell gene expression data
The single-cell RNA sequencing (scRNA-seq) dataset for acute myeloid leukemia (AML) utilized
in DeepDeconUQ is identical to that employed in DeepDecon [28]. The AML dataset was sourced
from the Gene Expression Omnibus (GEO) under accession number GSE116256 [62]. To ensure
high data quality, we included scRNA-seq samples only from subjects containing a minimum of
100 normal and 100 malignant cells, respectively (Figure 5.11). This threshold was chosen to avoid
scenarios with an insufficient number of normal or malignant cells. In total, 15 AML subjects met
the inclusion criteria. For each subject, cells with fewer than 500 detected genes were excluded,
as were genes expressed in fewer than five cells. Subsequently, the gene expression profiles were
further curated by removing extreme outliers. The detailed filtering criteria applied to each subject are summarized in Table 5.6. Gene expression levels were then normalized based on library
size using total counts across all genes, and the processed data were retained for subsequent bulk
RNA-seq data synthesis. Cell type annotations (malignant/normal) were determined using the
‘PredictionRefined’ tag [62], which represents a manual reclassification of cells based on detailed
analysis of mutation and expression profiles.
5.3.2 Software comparison and settings
We compared DeepDeconUQ with other deconvolution methods, including MEAD (v. 1.0.1) [44],
RNA-Sieve (v. 0.1.4) [43].
102
For MEAD [44], we installed the R package given in the manuscript and ran it with default
settings. In the leave-one-out cross-validation, the single-cell profile was constructed using the
single cells of all subjects, excluding the subject itself, while in the real bulk testing data, the
single-cell profile was constructed by combining all available single-cell data. Subject information
was included in the single-cell reference. Then, we ran MEAD with default settings by following
the example provided by the authors.
For RNA-Sieve [43], we executed it by following the example code provided. In the leave-oneout cross-validation, the single-cell profile was constructed by combining the single cells of all
subjects, excluding the data itself, while in the real bulk testing data, the single-cell profile was
constructed by combining all available single-cell data.
103
A
B
Figure 5.12: DeepDeconUQ is robust to gene expression perturbations. Boxplots of coverage
and average prediction interval length on 15 AML simulated bulk RNA-seq datasets under different noise levels. We added random noise generated from a Gaussian distribution with zero mean
and variance that equals λ(λ = 0.01, 0.05, 0.1) times the gene expression level for each gene in
each sample. Each bar contains a total of 15 points, representing 15 separate AML datasets. The
color represents different levels of noise level λ. Significance level α = 0.15.
104
A
B
Figure 5.13: DeepDeconUQ is robust to gene expression perturbations. Boxplots of coverage
and average prediction interval length on 15 AML simulated bulk RNA-seq datasets under different noise levels. We added random noise generated from a Gaussian distribution with zero mean
and variance that equals λ(λ = 0.01, 0.05, 0.1) times the gene expression level for each gene in
each sample. Each bar contains a total of 15 points, representing 15 separate AML datasets. The
color represents different levels of noise level λ. Significance level α = 0.05.
105
Table 5.6: Preprocessing criteria for each subject in AML and neuroblastoma datasets. Gene
expression threshold means the maximum gene expression value of a cell. The gene number
threshold means the maximum number of expressed genes. This is to avoid gene expressions
that do not represent a single cell. The criteria are based on Scanpy (v. 1.7.2) functions ‘filter_-
cells’ and ‘filter_genes’.
Subject Gene expression threshold Gene number threshold
AML328-D29 7000 2500
AML1012-D0 5000 1600
AML556-D0 10000 3000
AML328-D171 5000 2000
AML210A-D0 6000 2000
AML419A-D0 7000 2500
AML328-D0 5000 2000
AML707B-D0 6000 2000
AML916-D0 5000 2000
AML328-D113 6000 2000
AML329-D0 8000 2000
AML420B-D0 7000 2000
AML329-D20 7000 2200
AML921A-D0 8000 2500
AML475-D0 4800 1500
106
Table 5.7: Bulk AML RNA-seq datasets used in DeepDeconUQ
Name Number of samples Sequencing platform Normalization method Source
primary 117 Affymetrix Gene ST Array [103] FPKM GDC Data Portal [104]
recurrent 38 Affymetrix Gene ST Array FPKM GDC Data Portal
BeatAML 300 SureSelect [105] CPM Tyner, et al. 2018 [64]
107
Bibliography
[1] Nicole M Anderson and M Celeste Simon. “The tumor microenvironment”. In: Current
Biology 30.16 (2020), R921–R925. doi: 10.1016/j.cub.2020.06.081.
[2] Karin E De Visser and Johanna A Joyce. “The evolving tumor microenvironment: From
cancer initiation to metastatic outgrowth”. In: Cancer cell 41.3 (2023), pp. 374–403. doi:
10.1016/j.ccell.2023.02.016.
[3] Lijin K Gopi and Benjamin L Kidder. “Integrative pan cancer analysis reveals
epigenomic variation in cancer type and cell specific chromatin domains”. In: Nature
communications 12.1 (2021), p. 1419. doi: 10.1038/s41467-021-21707-1.
[4] Maurice Tubiana. “Tumor cell proliferation kinetics and tumor growth rate”. In: Acta
Oncologica 28.1 (1989), pp. 113–121. doi: 10.3109/02841868909111193.
[5] Jawad Fares, Mohamad Y Fares, Hussein H Khachfe, Hamza A Salhab, and Youssef Fares.
“Molecular principles of metastasis: a hallmark of cancer revisited”. In: Signal
transduction and targeted therapy 5.1 (2020), p. 28. doi: 10.1038/s41392-020-0134-x.
[6] Bo Li, Eric Severson, Jean-Christophe Pignon, Haoquan Zhao, Taiwen Li, Jesse Novak,
Peng Jiang, Hui Shen, Jon C Aster, Scott Rodig, et al. “Comprehensive analyses of tumor
immunity: implications for cancer immunotherapy”. In: Genome biology 17.1 (2016),
pp. 1–16.
[7] Michael W Schmitt, Lawrence A Loeb, and Jesse J Salk. “The influence of subclonal
resistance mutations on targeted cancer therapy”. In: Nature reviews Clinical oncology
13.6 (2016), pp. 335–347. doi: 10.1038/nrclinonc.2015.175..
[8] Xuan Wang, Haiyun Zhang, and Xiaozhuo Chen. “Drug resistance and combating drug
resistance in cancer”. In: Cancer drug resistance 2.2 (2019), p. 141. doi:
10.20517/cdr.2019.10..
108
[9] Rebecca Austin, Mark J. Smyth, and Steven W. Lane. “Harnessing the Immune System in
Acute Myeloid Leukaemia”. In: Crit Rev Oncol Hematol 103 (July 2016), pp. 62–77. issn:
1879-0461. doi: 10.1016/j.critrevonc.2016.04.020.
[10] Zhong Wang, Mark Gerstein, and Michael Snyder. “RNA-Seq: a revolutionary tool for
transcriptomics”. In: Nature reviews genetics 10.1 (2009), pp. 57–63. doi: 10.1038/nrg2484.
[11] Guanghao Qi, Benjamin J Strober, Joshua M Popp, Rebecca Keener, Hongkai Ji, and
Alexis Battle. “Single-cell allele-specific expression analysis reveals dynamic and
cell-type-specific regulatory effects”. In: Nature communications 14.1 (2023), p. 6317. doi:
10.1038/s41467-023-42016-9.
[12] Zijie Jin, Wenjian Huang, Ning Shen, Juan Li, Xiaochen Wang, Jiqiao Dong, Peter J Park,
and Ruibin Xi. “Single-cell gene fusion detection by scFusion”. In: Nature
Communications 13.1 (2022), p. 1084. doi: 10.1038/s41467-022-28661-6.
[13] Nadine Norton, Zhifu Sun, Yan W Asmann, Daniel J Serie, Brian M Necela,
Aditya Bhagwate, Jin Jen, Bruce W Eckloff, Krishna R Kalari, Kevin J Thompson, et al.
“Gene expression, single nucleotide variant and fusion transcript discovery in archival
material from breast tumors”. In: PloS one 8.11 (2013), e81925. doi:
10.1371/journal.pone.0081925.
[14] Xuran Wang, Jihwan Park, Katalin Susztak, Nancy R. Zhang, and Mingyao Li. “Bulk
Tissue Cell Type Deconvolution with Multi-Subject Single-Cell Expression Reference”.
In: Nature Communications 10 (Jan. 2019), p. 380. issn: 2041-1723. doi:
10.1038/s41467-018-08023-x.
[15] Erick Armingol, Adam Officer, Olivier Harismendy, and Nathan E Lewis. “Deciphering
cell–cell interactions and communication from gene expression”. In: Nature Reviews
Genetics 22.2 (2021), pp. 71–88. doi: 10.1038/s41576-020-00292-x.
[16] Hung Nguyen, Ha Nguyen, Duc Tran, Sorin Draghici, and Tin Nguyen. “Fourteen years
of cellular deconvolution: methodology, applications, technical evaluation and
outstanding challenges”. In: Nucleic Acids Research 52.9 (2024), pp. 4761–4783. doi:
10.1093/nar/gkae267.
[17] Virginia Espina, Julia D Wulfkuhle, Valerie S Calvert, Amy VanMeter, Weidong Zhou,
George Coukos, David H Geho, Emanuel F Petricoin III, and Lance A Liotta.
“Laser-capture microdissection”. In: Nature protocols 1.2 (2006), pp. 586–603. doi:
10.1038/nprot.2006.85.
[18] AF Maarten Altelaar and Albert JR Heck. “Trends in ultrasensitive proteomics”. In:
Current opinion in chemical biology 16.1-2 (2012), pp. 206–213. doi:
10.1016/j.cbpa.2011.12.011.
109
[19] Robert L Grossman, Allison P Heath, Vincent Ferretti, Harold E Varmus,
Douglas R Lowy, Warren A Kibbe, and Louis M Staudt. “Toward a shared vision for
cancer genomic data”. In: New England Journal of Medicine 375.12 (2016), pp. 1109–1112.
doi: 10.1056/NEJMp1607591.
[20] Ron Edgar, Michael Domrachev, and Alex E Lash. “Gene Expression Omnibus: NCBI
gene expression and hybridization array data repository”. In: Nucleic acids research 30.1
(2002), pp. 207–210. doi: 10.1093/nar/30.1.207..
[21] Gabriella Rustici, Nikolay Kolesnikov, Marco Brandizi, Tony Burdett, Miroslaw Dylag,
Ibrahim Emam, Anna Farne, Emma Hastings, Jon Ison, Maria Keays, et al. “ArrayExpress
update—trends in database growth and links to data analysis tools”. In: Nucleic acids
research 41.D1 (2012), pp. D987–D990. doi: 10.1093/nar/gks1174.
[22] Francisco Avila Cobos, Jo Vandesompele, Pieter Mestdagh, and Katleen De Preter.
“Computational Deconvolution of Transcriptomics Data from Mixed Cell Populations”.
In: Bioinformatics 34.11 (June 2018), pp. 1969–1979. issn: 1367-4811. doi:
10.1093/bioinformatics/bty019.
[23] Shahin Mohammadi, Neta Zuckerman, Andrea Goldsmith, and Ananth Grama. “A
Critical Survey of Deconvolution Methods for Separating Cell Types in Complex
Tissues”. In: Proceedings of the IEEE 105.2 (Feb. 2017), pp. 340–366. issn: 1558-2256. doi:
10.1109/JPROC.2016.2607121.
[24] Jason Rudy and Faramarz Valafar. “Empirical comparison of cross-platform
normalization methods for gene expression data”. In: BMC bioinformatics 12 (2011),
pp. 1–22. doi: 10.1186/1471-2105-12-467.
[25] Luise Wolf, Olin K Silander, and Erik van Nimwegen. “Expression noise facilitates the
evolution of gene regulation”. In: elife 4 (2015), e05856. doi: 10.7554/eLife.05856.
[26] Vahid Asghari, Yat Fai Leung, and Shu-Chien Hsu. “Deep neural network based
framework for complex correlations in engineering metrics”. In: Advanced Engineering
Informatics 44 (2020), p. 101058. doi: 10.1016/j.aei.2020.101058.
[27] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553
(2015), pp. 436–444. doi: 10.1038/nature14539.
[28] Jiawei Huang, Yuxuan Du, Andres Stucky, Kevin R. Kelly, Jiang F. Zhong, and
Fengzhu Sun. “DeepDecon accurately estimates cancer cell fractions in bulk RNA-seq
data”. In: Patterns 38.5 (2020), pp. 716–733. doi: 10.1016/j.ccell.2020.08.014.
[29] Manuel Fernández-Delgado, Manisha Sanjay Sirsat, Eva Cernadas, Sadi Alawadi,
Senén Barro, and Manuel Febrero-Bande. “An extensive experimental survey of
regression methods”. In: Neural Networks 111 (2019), pp. 11–34. doi:
10.1016/j.neunet.2018.12.010.
110
[30] Stéphane Lathuilière, Pablo Mesejo, Xavier Alameda-Pineda, and Radu Horaud. “A
comprehensive analysis of deep regression”. In: IEEE transactions on pattern analysis and
machine intelligence 42.9 (2019), pp. 2065–2081. doi: 10.48550/arXiv.1803.08450.
[31] Martim Sousa, Ana Maria Tomé, and José Moreira. “Improving conformalized quantile
regression through cluster-based feature relevance”. In: Expert Systems with Applications
238 (2024), p. 122322. doi: 10.1016/j.eswa.2023.122322.
[32] Ian Farrance and Robert Frenkel. “Uncertainty of measurement: a review of the rules for
calculating uncertainty components through functional relationships”. In: The Clinical
Biochemist Reviews 33.2 (2012), p. 49.
[33] Joanna IntHout, John PA Ioannidis, Maroeska M Rovers, and Jelle J Goeman. “Plea for
routinely presenting prediction intervals in meta-analysis”. In: BMJ open 6.7 (2016),
e010247. doi: 10.1136/bmjopen-2015-010247.
[34] Zhanguo Song, Wei Feng, and Weiwei Liu. “Interval prediction of short-term traffic
speed with limited data input: Application of fuzzy-grey combined prediction model”. In:
Expert Systems with Applications 187 (2022), p. 115878. doi: 10.1016/j.eswa.2021.115878.
[35] Hongtao Li, Yang Yu, Zhipeng Huang, Shaolong Sun, and Xiaoyan Jia. “A multi-step
ahead point-interval forecasting system for hourly PM2. 5 concentrations based on
multivariate decomposition and kernel density estimation”. In: Expert Systems with
Applications 226 (2023), p. 120140. doi: 10.1016/j.eswa.2023.120140.
[36] Yaniv Romano, Evan Patterson, and Emmanuel Candes. “Conformalized quantile
regression”. In: Advances in neural information processing systems 32 (2019). doi:
10.48550/arXiv.1905.03222.
[37] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. “Locally valid and discriminative
prediction intervals for deep learning models”. In: Advances in Neural Information
Processing Systems 34 (2021), pp. 8378–8391. doi: 10.48550/arXiv.2106.00225.
[38] Maximilian Seitzer, Arash Tavakoli, Dimitrije Antic, and Georg Martius. “On the pitfalls
of heteroscedastic uncertainty estimation with probabilistic neural networks”. In: arXiv
preprint arXiv:2203.09168 (2022). doi: 10.48550/arXiv.2203.09168.
[39] Volodya Vovk, Alexander Gammerman, and Craig Saunders. “Machine-Learning
Applications of Algorithmic Randomness”. In: Proceedings of the Sixteenth International
Conference on Machine Learning. ICML ’99. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 1999, pp. 444–453. isbn: 1558606122.
111
[40] Brandon Jew, Marcus Alvarez, Elior Rahmani, Zong Miao, Arthur Ko,
Kristina M. Garske, Jae Hoon Sul, Kirsi H. Pietiläinen, Päivi Pajukanta, and
Eran Halperin. “Accurate Estimation of Cell Composition in Bulk Expression through
Robust Integration of Single-Cell Information”. In: Nature Communications 11.1 (Apr.
2020), p. 1971. issn: 2041-1723. doi: 10.1038/s41467-020-15816-6.
[41] Aaron M. Newman, Chih Long Liu, Michael R. Green, Andrew J. Gentles, Weiguo Feng,
Yue Xu, Chuong D. Hoang, Maximilian Diehn, and Ash A. Alizadeh. “Robust
Enumeration of Cell Subsets from Tissue Expression Profiles”. In: Nature Methods 12.5
(May 2015), pp. 453–457. issn: 1548-7105. doi: 10.1038/nmeth.3337.
[42] Aaron M. Newman, Chloé B. Steen, Chih Long Liu, Andrew J. Gentles,
Aadel A. Chaudhuri, Florian Scherer, Michael S. Khodadoust, Mohammad S. Esfahani,
Bogdan A. Luca, David Steiner, Maximilian Diehn, and Ash A. Alizadeh. “Determining
Cell Type Abundance and Expression from Bulk Tissues with Digital Cytometry”. In:
Nature Biotechnology 37.7 (July 2019), pp. 773–782. issn: 1546-1696. doi:
10.1038/s41587-019-0114-2.
[43] Dan D. Erdmann-Pham, Jonathan Fischer, Justin Hong, and Yun S. Song. “A
Likelihood-Based Deconvolution of Bulk Gene Expression Data Using Single-Cell
References”. In: Genome Research (July 2021), gr.272344.120. issn: 1088-9051, 1549-5469.
doi: 10.1101/gr.272344.120.
[44] Dongyue Xie and Jingshu Wang. “Robust Statistical Inference for Cell Type
Deconvolution”. In: arXiv preprint arXiv:2202.06420 (2022). doi:
10.48550/arXiv.2202.06420.
[45] Kevin Menden, Mohamed Marouf, Sergio Oller, Anupriya Dalmia,
Daniel Sumner Magruder, Karin Kloiber, Peter Heutink, and Stefan Bonn. “Deep
learning–based cell composition analysis from tissue expression profiles”. In: Science
Advances 6.30 (2020), eaba2619. doi: 10.1126/sciadv.aba2619.
[46] Luis A Corchete, Elizabeta A Rojas, Diego Alonso-López, Javier De Las Rivas,
Norma C Gutiérrez, and Francisco J Burguillo. “Systematic comparison and assessment
of RNA-seq procedures for gene expression quantitative analysis”. In: Scientific Reports
10.1 (2020), p. 19737. doi: 10.1038/s41598-020-76881-x.
[47] Huiting Xiao, Jiashuai Zhang, Kai Wang, Kai Song, Hailong Zheng, Jing Yang, Keru Li,
Rongqiang Yuan, Wenyuan Zhao, and Yang Hui. “A Cancer-Specific Qualitative Method
for Estimating the Proportion of Tumor-Infiltrating Immune Cells”. In: Frontiers in
Immunology 12 (2021), p. 672031. issn: 1664-3224. doi: 10.3389/fimmu.2021.672031.
[48] Xinmin Li and Cun-Yu Wang. “From Bulk, Single-Cell to Spatial RNA Sequencing”. In:
International Journal of Oral Science 13 (Nov. 2021), p. 36. issn: 2049-3169. doi:
10.1038/s41368-021-00146-0.
112
[49] Yufang Qin, Weiwei Zhang, Xiaoqiang Sun, Siwei Nan, Nana Wei, Hua-Jun Wu, and
Xiaoqi Zheng. “Deconvolution of Heterogeneous Tumor Samples Using Partial
Reference Signals”. In: PLOS Computational Biology 16(11).11 (Nov. 2020), e1008452.
issn: 1553-734X. doi: 10.1371/journal.pcbi.1008452.
[50] Manuel Garber, Manfred G Grabherr, Mitchell Guttman, and Cole Trapnell.
“Computational methods for transcriptome annotation and quantification using
RNA-seq”. In: Nature Methods 8.6 (2011), pp. 469–477. doi: 10.1038/nmeth.1613.
[51] Francesca Finotello and Barbara Di Camillo. “Measuring differential gene expression
with RNA-seq: challenges and strategies for data analysis”. In: Briefings in Functional
Genomics 14.2 (2015), pp. 130–142. doi: 10.1093/bfgp/elu035.
[52] Grace X. Y. Zheng, Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent,
Ryan Wilson, Solongo B. Ziraldo, Tobias D. Wheeler, Geoff P. McDermott, Junjie Zhu,
Mark T. Gregory, Joe Shuga, Luz Montesclaros, Jason G. Underwood,
Donald A. Masquelier, Stefanie Y. Nishimura, Michael Schnall-Levin, Paul W. Wyatt,
Christopher M. Hindson, Rajiv Bharadwaj, Alexander Wong, Kevin D. Ness,
Lan W. Beppu, H. Joachim Deeg, Christopher McFarland, Keith R. Loeb,
William J. Valente, Nolan G. Ericson, Emily A. Stevens, Jerald P. Radich,
Tarjei S. Mikkelsen, Benjamin J. Hindson, and Jason H. Bielas. “Massively Parallel
Digital Transcriptional Profiling of Single Cells”. In: Nature Communications 8.1 (Jan.
2017), p. 14049. issn: 2041-1723. doi: 10.1038/ncomms14049.
[53] Alice Giustacchini, Supat Thongjuea, Nikolaos Barkas, Petter S. Woll,
Benjamin J. Povinelli, Christopher A. G. Booth, Paul Sopp, Ruggiero Norfo,
Alba Rodriguez-Meira, Neil Ashley, Lauren Jamieson, Paresh Vyas, Kristina Anderson,
Åsa Segerstolpe, Hong Qian, Ulla Olsson-Strömberg, Satu Mustjoki, Rickard Sandberg,
Sten Eirik W. Jacobsen, and Adam J. Mead. “Single-Cell Transcriptomics Uncovers
Distinct Molecular Signatures of Stem Cells in Chronic Myeloid Leukemia”. In: Nature
Medicine 23.6 (June 2017), pp. 692–702. issn: 1546-170X. doi: 10.1038/nm.4336.
[54] Sidharth V. Puram, Itay Tirosh, Anuraag S. Parikh, Anoop P. Patel, Keren Yizhak,
Shawn Gillespie, Christopher Rodman, Christina L. Luo, Edmund A. Mroz,
Kevin S. Emerick, Daniel G. Deschler, Mark A. Varvares, Ravi Mylvaganam,
Orit Rozenblatt-Rosen, James W. Rocco, William C. Faquin, Derrick T. Lin, Aviv Regev,
and Bradley E. Bernstein. “Single-Cell Transcriptomic Analysis of Primary and
Metastatic Tumor Ecosystems in Head and Neck Cancer”. In: Cell 171.7 (Dec. 2017),
1611–1624.e24. issn: 1097-4172. doi: 10.1016/j.cell.2017.10.044.
[55] Ashraful Haque, Jessica Engel, Sarah A. Teichmann, and Tapio Lönnberg. “A Practical
Guide to Single-Cell RNA-sequencing for Biomedical Research and Clinical
Applications”. In: Genome Medicine 9.1 (Aug. 2017), p. 75. issn: 1756-994X. doi:
10.1186/s13073-017-0467-4.
113
[56] Kosuke Yoshihara, Maria Shahmoradgoli, Emmanuel Martínez, Rahulsimham Vegesna,
Hoon Kim, Wandaliz Torres-Garcia, Victor Treviño, Hui Shen, Peter W Laird,
Douglas A Levine, et al. “Inferring tumour purity and stromal and immune cell
admixture from expression data”. In: Nature Communications 4.1 (2013), p. 2612. doi:
10.1038/ncomms3612.
[57] Ken Sugino, Erin Clark, Anton Schulmann, Yasuyuki Shima, Lihua Wang, David L Hunt,
Bryan M Hooks, Dimitri Tränkner, Jayaram Chandrashekar, Serge Picard, et al.
“Mapping the transcriptional diversity of genetically and anatomically defined cell
populations in the mouse brain”. In: Elife 8 (2019), e38619. doi: 10.7554/eLife.38619.
[58] Donghui Chen and Robert J. Plemmons. Nonnegativity Constraints in Numerical
Analysis. London: WORLD SCIENTIFIC, Nov. 2009, pp. 109–139. isbn:
978-981-283-625-0. doi: 10.1142/9789812836267_0008.
[59] Elihu Estey and Hartmut Döhner. “Acute Myeloid Leukaemia”. In: The Lancet 368.9550
(Nov. 2006), pp. 1894–1907. issn: 0140-6736. doi: 10.1016/S0140-6736(06)69780-8.
[60] John M. Bennett, Daniel Catovsky, Marie T. Daniel, George Flandrin, David a. G. Galton,
Harvey R. Gralnick, and Claude Sultan. “Proposed Revised Criteria for the Classification
of Acute Myeloid Leukemia”. In: Annals of Internal Medicine 103.4 (Oct. 1985),
pp. 620–625. issn: 0003-4819. doi: 10.7326/0003-4819-103-4-620.
[61] James W. Vardiman, Nancy Lee Harris, and Richard D. Brunning. “The World Health
Organization (WHO) Classification of the Myeloid Neoplasms”. In: Blood 100.7 (Oct.
2002), pp. 2292–2302. issn: 0006-4971. doi: 10.1182/blood-2002-04-1199.
[62] Peter van Galen, Volker Hovestadt, Marc Wadsworth, Travis Hughes, Gabriel K. Griffin,
Sofia Battaglia, Julia A. Verga, Jason Stephansky, Timothy J. Pastika,
Jennifer Lombardi Story, Geraldine S. Pinkus, Olga Pozdnyakova, Ilene Galinsky,
Richard M. Stone, Timothy A. Graubert, Alex K. Shalek, Jon C. Aster, Andrew A. Lane,
and Bradley E. Bernstein. “Single-Cell RNA-seq Reveals AML Hierarchies Relevant to
Disease Progression and Immunity”. In: Cell 176.6 (Mar. 2019), 1265–1281.e24. issn:
0092-8674. doi: 10.1016/j.cell.2019.01.031.
[63] F Alexander Wolf, Philipp Angerer, and Fabian J Theis. “SCANPY: large-scale single-cell
gene expression data analysis”. In: Genome Biology 19 (2018), p. 15. doi:
10.1186/s13059-017-1382-0.
[64] Jeffrey W Tyner, Cristina E Tognon, Daniel Bottomly, Beth Wilmot, Stephen E Kurtz,
Samantha L Savage, Nicola Long, Anna Reister Schultz, Elie Traer, Melissa Abel, et al.
“Functional genomic landscape of acute myeloid leukaemia”. In: Nature 562.7728 (2018),
pp. 526–531. doi: 10.1038/s41586-018-0623-z.
114
[65] Ethan Cerami, Jianjiong Gao, Ugur Dogrusoz, Benjamin E Gross, Selcuk Onur Sumer,
Bülent Arman Aksoy, Anders Jacobsen, Caitlin J Byrne, Michael L Heuer, Erik Larsson,
et al. “The cBio cancer genomics portal: an open platform for exploring
multidimensional cancer genomics data”. In: Cancer Discovery 2.5 (2012), pp. 401–404.
doi: 10.1158/2159-8290.CD-12-0095.
[66] Rui Dong, Ran Yang, Yong Zhan, Hua-Dong Lai, Chun-Jing Ye, Xiao-Ying Yao,
Wen-Qin Luo, Xiao-Mu Cheng, Ju-Ju Miao, Jun-Feng Wang, et al. “Single-cell
characterization of malignant phenotypes and developmental trajectories of adrenal
neuroblastoma”. In: Cancer Cell 38.5 (2020), pp. 716–733. doi:
10.1016/j.ccell.2020.08.014.
[67] Dongqing Sun, Jin Wang, Ya Han, Xin Dong, Jun Ge, Rongbin Zheng, Xiaoying Shi,
Binbin Wang, Ziyi Li, Pengfei Ren, et al. “TISCH: a comprehensive web resource
enabling interactive single-cell transcriptome visualization of tumor microenvironment”.
In: Nucleic Acids Research 49.D1 (2021), pp. D1420–D1430. doi: 10.1093/nar/gkaa1020.
[68] Suhas V Vasaikar, Peter Straub, Jing Wang, and Bing Zhang. “LinkedOmics: analyzing
multi-omics data within and across 32 cancer types”. In: Nucleic Acids Research 46.D1
(2018), pp. D956–D963. doi: 10.1093/nar/gkx1090.
[69] Virginia Teller. “Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition”. In: Computational
Linguistics 26.4 (Dec. 2000), pp. 638–641. issn: 0891-2017. doi:
10.1162/089120100750105975.
[70] Gobinda G Chowdhury. Introduction to modern information retrieval. Facet publishing,
2010. isbn: 185604694X.
[71] Marmar Moussa and Ion I Măndoiu. “Single cell RNA-seq data clustering using TF-IDF
based methods”. In: BMC Genomics 19 (2018), p. 569. doi: 10.1186/s12864-018-4922-4.
[72] Yanshuo Chen, Yixuan Wang, Yuelong Chen, Yuqi Cheng, Yumeng Wei, Yunxiang Li,
Jiuming Wang, Yingying Wei, Ting-Fung Chan, and Yu Li. “Deep autoencoder for
interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis”. In:
Nature Communications 13.1 (2022), p. 6735. doi: 10.1038/s41467-022-34550-9.
[73] Yingdong Zhao, Ming-Chung Li, Mariam M Konaté, Li Chen, Biswajit Das,
Chris Karlovich, P Mickey Williams, Yvonne A Evrard, James H Doroshow, and
Lisa M McShane. “TPM, FPKM, or normalized counts? A comparative study of
quantification measures for the analysis of RNA-seq data from the NCI patient-derived
models repository”. In: Journal of Translational Medicine 19.1 (2021), p. 269. doi:
10.1186/s12967-021-02936-w.
115
[74] Christoph Ziegenhain, Beate Vieth, Swati Parekh, Björn Reinius,
Amy Guillaumet-Adkins, Martha Smets, Heinrich Leonhardt, Holger Heyn,
Ines Hellmann, and Wolfgang Enard. “Comparative analysis of single-cell RNA
sequencing methods”. In: Molecular Cell 65.4 (2017), pp. 631–643. doi:
10.1016/j.molcel.2017.01.023.
[75] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. Jan.
2017. doi: 10.48550/arXiv.1412.6980. arXiv: 1412.6980 [cs].
[76] Leonore A Herzenberg, James Tung, Wayne A Moore, Leonard A Herzenberg, and
David R Parks. “Interpreting flow cytometry data: a guide for the perplexed”. In: Nature
Immunology 7.7 (2006), pp. 681–685. doi: 10.1038/ni0706-681.
[77] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. “UMAP: Uniform
Manifold Approximation and Projection”. In: Journal of Open Source Software 3.29 (Sept.
2018), p. 861. issn: 2475-9066. doi: 10.21105/joss.00861.
[78] Li Wang, Robert P Sebra, John P Sfakianos, Kimaada Allette, Wenhui Wang,
Seungyeul Yoo, Nina Bhardwaj, Eric E Schadt, Xin Yao, Matthew D Galsky, et al. “A
reference profile-free deconvolution method to infer cancer cell-intrinsic subtypes and
tumor-type-specific stromal profiles”. In: Genome Medicine 12.1 (2020), p. 24. doi:
10.1186/s13073-020-0720-0.
[79] Francisco Avila Cobos, José Alquicira-Hernandez, Joseph E. Powell, Pieter Mestdagh,
and Katleen De Preter. “Benchmarking of Cell Type Deconvolution Pipelines for
Transcriptomics Data”. In: Nat Commun 11.1 (Nov. 2020), p. 5650. issn: 2041-1723. doi:
10.1038/s41467-020-19015-1.
[80] Biao Cai, Jingfei Zhang, Hongyu Li, Chang Su, and Hongyu Zhao. “Statistical Inference
of Cell-type Proportions Estimated from Bulk Expression Data”. In: arXiv preprint
arXiv:2209.04038 (2022). doi: 10.48550/arXiv.2209.04038.
[81] Chang Su, Zichun Xu, Xinning Shan, Biao Cai, Hongyu Zhao, and Jingfei Zhang.
“Cell-type-specific co-expression inference from single cell RNA-sequencing data”. In:
Nature Communications 14.1 (2023), p. 4846. doi: 10.1038/s41467-023-40503-7.
[82] Ingo Steinwart and Andreas Christmann. “Estimating conditional quantiles with the
help of the pinball loss”. In: Bernoulli 17.1 (2011), pp. 211–225. doi: 10.3150/10-BEJ267.
[83] James W Taylor. “A quantile regression neural network approach to estimating the
conditional density of multiperiod returns”. In: Journal of forecasting 19.4 (2000),
pp. 299–311. doi: 10.1002/1099-131X(200007)19:4<299::AID-FOR775>3.0.CO;2-V.
[84] Ichiro Takeuchi, Quoc V. Le, Timothy D. Sears, and Alexander J. Smola. “Nonparametric
Quantile Estimation”. In: Journal of Machine Learning Research 7.45 (2006),
pp. 1231–1264. url: http://jmlr.org/papers/v7/takeuchi06a.html.
116
[85] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman.
“Inductive confidence machines for regression”. In: Machine learning: ECML 2002: 13th
European conference on machine learning Helsinki, Finland, August 19–23, 2002
proceedings 13. Springer. 2002, pp. 345–356. doi: 10.1007/3-540-36755-1_29.
[86] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a
random world. Vol. 29. Springer, 2005. doi: 10.1007/b106715.
[87] Shai Feldman, Stephen Bates, and Yaniv Romano. “Calibrated multiple-output quantile
regression with representation learning”. In: Journal of Machine Learning Research 24.24
(2023), pp. 1–48. doi: 10.48550/arXiv.2110.00816.
[88] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman.
“Distribution-free predictive inference for regression”. In: Journal of the American
Statistical Association 113.523 (2018), pp. 1094–1111. doi: 10.48550/arXiv.1604.04173.
[89] Quinn McNemar. “Note on the sampling error of the difference between correlated
proportions or percentages”. In: Psychometrika 12.2 (1947), pp. 153–157. doi:
10.1007/BF02295996.
[90] Jared M Sagendorf, Raktim Mitra, Jiawei Huang, Xiaojiang S Chen, and Remo Rohs.
“Structure-based prediction of protein-nucleic acid binding using graph neural
networks”. In: Biophysical Reviews 16.3 (2024), pp. 297–314. doi:
10.1007/s12551-024-01201-w.
[91] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. “Deep transfer
learning with joint adaptation networks”. In: International conference on machine
learning. PMLR. 2017, pp. 2208–2217. doi: 10.48550/arXiv.1605.06636.
[92] Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin Givechian,
Dhananjay Bhaskar, and Smita Krishnaswamy. “Transformer-based protein generation
with regularized latent space optimization”. In: Nature Machine Intelligence 4.10 (2022),
pp. 840–851. doi: 10.1038/s42256-022-00532-1.
[93] Alexander J Titus, Rachel M Gallimore, Lucas A Salas, and Brock C Christensen.
“Cell-type deconvolution from DNA methylation: a review of recent applications”. In:
Human molecular genetics 26.R2 (2017), R216–R224. doi: 10.1093/hmg/ddx275.
[94] Yuqing Zhang, Giovanni Parmigiani, and W Evan Johnson. “ComBat-seq: batch effect
adjustment for RNA-seq count data”. In: NAR genomics and bioinformatics 2.3 (2020),
lqaa078. doi: 10.1093/nargab/lqaa078.
[95] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle,
François Laviolette, Mario March, and Victor Lempitsky. “Domain-adversarial training
of neural networks”. In: Journal of machine learning research 17.59 (2016), pp. 1–35. doi:
10.48550/arXiv.1505.07818.
117
[96] Nicolo Colombo. “On training locally adaptive CP”. In: Conformal and Probabilistic
Prediction with Applications. PMLR. 2023, pp. 384–398. doi: 10.48550/arXiv.2306.04648.
[97] Abouzar Choubineh, Jie Chen, Frans Coenen, and Fei Ma. “Applying Monte Carlo
dropout to quantify the uncertainty of skip connection-based convolutional neural
networks optimized by big data”. In: Electronics 12.6 (2023), p. 1453. doi:
10.3390/electronics12061453.
[98] Ethan Goan and Clinton Fookes. “Bayesian neural networks: An introduction and
survey”. In: Case Studies in Applied Bayesian Data Science: CIRM Jean-Morlet Chair, Fall
2018 (2020), pp. 45–87. doi: 10.48550/arXiv.2006.12024.
[99] Eyke Hüllermeier and Willem Waegeman. “Aleatoric and epistemic uncertainty in
machine learning: An introduction to concepts and methods”. In: Machine learning 110.3
(2021), pp. 457–506. doi: 10.1007/s10994-021-05946-3.
[100] Makoto Matsumoto and Takuji Nishimura. “Mersenne Twister: A 623-Dimensionally
Equidistributed Uniform Pseudo-Random Number Generator”. In: ACM Trans. Model.
Comput. Simul. 8.1 (Jan. 1998), pp. 3–30. issn: 1049-3301. doi: 10.1145/272991.272995.
[101] Catalina A Vallejos, Davide Risso, Antonio Scialdone, Sandrine Dudoit, and
John C Marioni. “Normalizing single-cell RNA sequencing data: challenges and
opportunities”. In: Nature Methods 14.6 (2017), pp. 565–571.
[102] Petro Liashchynskyi and Pavlo Liashchynskyi. “Grid search, random search, genetic
algorithm: a big comparison for NAS”. In: arXiv preprint arXiv:1912.06059 (2019).
[103] Affymetrix Gene ST Array. TARGET’s Study of Acute Myeloid Leukemia.
https://www.cancer.gov/ccg/research/genome-sequencing/target/using-targetdata/technology#aml.
[104] GDC Data Portal. TARGET’s Study of Acute Myeloid Leukemia.
https://gdc.cancer.gov/content/target-aml-publication-summary.
[105] SureSelect 38Mb. Functional Genomic Landscape of Acute Myeloid Leukemia. https:
//www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001657.v1.p1.
[106] Illumina Hi-Seq 2000. Protocols used for TARGET’s study of Neuroblastoma (NBL).
https://www.cancer.gov/ccg/research/genome-sequencing/target/using-targetdata/technology.
[107] cBioPortal Pediatric Neuroblastoma. Neuroblastoma dataset from cBioPortal.
https://www.cbioportal.org/study/summary?id=nbl_target_2018_pub.
118
Abstract (if available)
Abstract
Accurately estimating the fractions of malignant cells in cancer tissues is vital for effective diagnosis, prognosis, and personalized treatment planning. Bulk RNA sequencing (RNA-seq) offers an aggregate profile of gene expression across entire tissue samples yet lacks the resolution required to discern cellular heterogeneity within tumors. While single-cell RNA sequencing (scRNA-seq) enables precise assessment of malignant cell fractions, its high cost and labor intensity make it impractical for routine clinical use. This limitation constrains our ability to reliably estimate malignant cell proportions, a crucial aspect of understanding tumor dynamics and therapeutic responsiveness. To address these challenges, this dissertation introduces DeepDecon, a deep learning-based model designed for precise estimation of cancer cell fractions in bulk RNA-seq samples. DeepDecon leverages scRNA-seq data to simulate bulk profiles, enabling it to predict cell fractions by training models on a comprehensive dataset. It provides a refining strategy that the cancer cell fraction is iteratively estimated by a set of trained models. Further enhancing this approach, this dissertation also presents DeepDeconUQ, a deep neural network model developed to estimate prediction intervals for malignant cell fractions based on bulk RNA-seq data. DeepDeconUQ utilizes conformalized quantile regression to generate prediction intervals, providing statistically valid and narrowly bound confidence intervals that add robustness to predictions under variable gene expression conditions. Together, DeepDecon and DeepDeconUQ offer a scalable, reliable framework for malignant cell deconvolution, advancing the precision of cancer tissue analysis and supporting improved clinical decision-making.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Multimodal single-cell biology and machine learning to characterize plasma cell neoplasms
PDF
Patch aRNA in vitro amplification (PAIA): single cell RNA-seq to expand the understanding of the developing and developed nervous system
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
RNA methylation in cancer plasticity and drug resistance
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Discovery of mature microRNA sequences within the protein- coding regions of global HIV-1 genomes: Predictions of novel mechanisms for viral infection and pathogenicity
PDF
LINC00261 induces a G2/M cell cycle arrest and activation of the DNA damage response in lung adenocarcinoma
PDF
Longitudinal assessment of neural stem-cell aging
PDF
Understanding the impact of cell-to-cell variability on intracellular signaling in CAR cells through mathematical modeling
PDF
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model
PDF
The role of serotonergic receptor, HTR2B, on myeloid-derived suppressor cells in the brain metastatic environment
PDF
Data-driven learning for dynamical systems in biology
PDF
Isoform quantification and splicing regulation analysis in RNA-seq studies
PDF
Machine learning of DNA shape and spatial geometry
PDF
A single cell time course of senescence uncovers discrete cell trajectories and transcriptional heterogeneity
Asset Metadata
Creator
Huang, Jiawei (author)
Core Title
Malignant cell fraction prediction using deep learning: from point estimate to uncertainty quantification
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2025-05
Publication Date
02/21/2025
Defense Date
12/10/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cancer,cell deconvolution,deep learning,OAI-PMH Harvest,prediction interval,RNA-seq
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Kelly, Kevin R. (
committee member
), Rohs, Remo (
committee member
), Zhong, Jiang F. (
committee member
)
Creator Email
jellyhuang14@gmail.com,jiaweih@usc.edu
Unique identifier
UC11399HPOT
Identifier
etd-HuangJiawe-13827.pdf (filename)
Legacy Identifier
etd-HuangJiawe-13827
Document Type
Dissertation
Format
theses (aat)
Rights
Huang, Jiawei
Internet Media Type
application/pdf
Type
texts
Source
20250227-usctheses-batch-1242
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cancer
cell deconvolution
deep learning
prediction interval
RNA-seq