Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scalable latent factor models for inferring genetic regulatory networks
(USC Thesis Other)
Scalable latent factor models for inferring genetic regulatory networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Scalable Latent Factor Models for Inferring Genetic Regulatory
Networks
by
Dong Yuan
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
August 2024
Copyright 2024 Dong Yuan
Acknowledgements
My journey as a student began at the age of 5 and has continued uninterrupted for 23 years, culminating in earning the highest academic degree Ph.D. from the University of Southern California.
Almost every child was asked what their dream job would be. When I was 7, I responded that
I wanted to be a scientist. Although a seven-year-old boy had never seen a scientist in his life and
had no idea what are scientists’ work, this faint dream persistently guided me through the years
and has now become a reality.
Throughout this journey, I have been incredibly fortunate to encounter numerous wonderful
people who have supported and inspired me.
In elementary school, I struggled to learn the knowledge and consistently received a really
bad grade during that time. In my second year at the middle school, I met an English teacher,
Dehuai Huang. This changed in middle school when I met my English teacher, Dehuai Huang.
His stringent and rigorous teaching methods taught me how to learn and synthesize knowledge
efficiently. Under his mentorship, I gradually became a ”top student” in the class, – and that lasted
for the next 10 years. I am deeply grateful to Mr.Dehuai Huang, my first mentor, whose guidance
set a strong foundation for my academic success.
My passion for learning continued through middle school, high school, and college. During
my master’s studies, I delved into research and realized that knowledge alone was insufficient to
solve research problems.
In the first year of my Ph.D., I met my remarkable academic mentor, Dr. Nicholas Mancuso. I
never anticipated that an advisor could become one of the closest people in my life. Nick always
considered what was the best for his trainees. In our first conversation, he reviewed my resume in
detail and proposed a project (Dual-Variational Autoencoders) that perfectly matched my interests.
Despite my lack of experience with deep learning models, Nick patiently taught me coding and
machine learning skills. As my first research project, I faced numerous challenges, including
building the model, coding in Python, and addressing countless bugs. Then, at my second project,
ii
SuSiE PCA, Nick guided me through complex mathematical derivations and helped me develop
my first Python package with well-structured API documentation. As I progressed, I began to
formulate scientific ideas and solutions independently; the third project, Perturb-VI, was a real
practice for me: I started to form scientific thinking to progress the project, propose new ideas and
solutions to advance the model, and analyze, and talked more in our discussion. Finally, our project
was selected as a talk at a prestigious conference. When I looked back, I realize how pivotal Nick
has been as both a mentor and a beacon of light through the darkest and most difficult scientific
research paths. Thank you, Nick, for making my Ph.D. journey so memorable.
Moreover, I am honored to meet many wonderful faculty members at the Department of Population and Public Health Sciences (PPHS). Especially, I would like to thank my committee members, Dr.Kimberly Siegmund, Dr.Steven Gazal, Dr.Chun Li, and Dr. Harold Pimentel from UCLA.
Dr. Siegmund provided valuable suggestions during my Annual Research Appraisal meeting, ensuring I stayed on the right path. Dr. Gazal offered insights on downstream analysis across my
projects. Dr. Li’s deep learning course greatly supports the progress of the Dual-VAE project. Dr.
Pimentel supported the Perturb-VI project with model specifications, data analysis, and manuscript
revisions. I am profoundly thankful for their time, effort, and support in reviewing and refining my
dissertation.
Next, I extend my heartfelt thanks to my parents for their unwavering support over the past
twenty years, both mentally and financially. I was born in a small city where I attended elementary
school for my first 11 years. Then, my parents sent me to a larger city for a better education, but
they had to stay in our hometown because of work. Since then, my mother has traveled two hours
by train every weekend to visit me, and she has never missed a single weekend in those five years.
In my final year before the college entrance exam, she quit her job to stay with me for the entire
year. Until today, I cannot imagine how much she sacrificed to give me a better life. I never got
the chance to express my feelings and say thank you, and I don’t think any words can truly express
it. But the only thing I can say now is that I love you so much, Mom. My father is a doctor and
is always busy with countless patients. Since I can remember, I seldom saw him even on the eve
iii
of Chinese New Year, the most important night for family reunions in China, because there were
always numerous patients during that time. But I know clearly how much he has influenced me
throughout my life. He is the reason why I never gave up on dedicating myself to human health,
starting from wanting to be a doctor to now contributing to public health.
The best part of my five-year Ph.D. journey has been the friendships I’ve formed. I am grateful
to Dr. Zeyun Lu, Dr. Yinqi Zhao, Dr. Yue Tu, Dr. Yijie Li, Dr. Menglin Wang, Ziyi Jiang, and
Jingxuan He for their academic support and companionship. I also appreciate the camaraderie of
the Mancuso lab members: Zixuan Zhang, Xinyue Rui, Xinran Wang, and Zifeng Zhang. Additionally, I am thankful for my friends in the Los Angeles area, including Shengchao Hou, Yi
Zhang, Zhuanzhuan Liu, Juehan Wang, Siliangyu Cheng, Shuang Wu, Jiayi Shen, Anqi Wang,
Shutong Huo, Ji Tang, Carmen Chen, who have provided joy and support throughout this journey.
The pandemic is like a thick wall, bringing unexpected challenges and isolating us from daily
social interactions. On March 15, 2019, I started playing the mobile game ”Honor of Kings,” and
destiny brought me my best friend and gaming ally, Xinyi Jiang. Restricted by the ”Stay at Home”
rule, I stayed in my apartment for over six months; I cannot imagine how I would have spent this
period without her. We played games, chatted online, and more. I faced countless small and large
difficulties in my research and daily life, but because of her consistent support, I never felt helpless
or tired, always staying grounded and motivated. Thank you, Xinyi, for being a wonderful friend
during this significant phase of my life. Lastly, I would like to thank all other online allies Jie Mu
(”Good morning”), Zhaozhao (”Nine moons”), Bin Yang (”Good evening”), Yongkai Lin (”Little
cucumber”), and Yitong Liu (”Sixty lady play games”) for their support and companionship.
iv
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Genome-wide Association Studies (GWAS) and eQTLs . . . . . . . . . . . . . . 1
1.2 Gene regulatory networks (GRN) . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 CRISPR Perturbation Screening . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Latent Factor Models (LFM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 General Form of LFM . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Probabilistic Principal Components Analysis (PPCA) . . . . . . . . . . 8
1.4.3 Probabilistic Canonical Correlation Analysis (PCCA) . . . . . . . . . . 9
1.5 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Sum of Single Effects Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.8 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 2: Sum of Single Effects (SuSiE) Principal Component Analysis . . . . . . . . . 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 GTEx z-score dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.4 Purturb-seq dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 PIPs from SuSiE PCA outperform existing methods in feature selection 29
2.3.2 SuSiE PCA is robust to model misspecification . . . . . . . . . . . . . 30
2.3.3 Dissecting cross-tissue eQTLs in GTEx . . . . . . . . . . . . . . . . . 33
2.3.4 Identifying regulatory modules from Perturb-seq Data . . . . . . . . . . 35
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Limitations of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 3: Perturb Variational Inference (PerturbVI) . . . . . . . . . . . . . . . . . . . . 41
3.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 PerturbVI model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Variational inference in PerturbVI . . . . . . . . . . . . . . . . . . . . 45
v
3.3.3 PIP and LFSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.5 LUHMES CROP-seq Data . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.6 Essential-wide Perturb-seq Data . . . . . . . . . . . . . . . . . . . . . 48
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1 PerturbVI is more accurate and robust in estimating perturbation effect . 50
3.4.2 CROP-Seq application . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.3 Perturb-Seq application . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 4: Dual Variational Autoencoders (Dual-VAE) . . . . . . . . . . . . . . . . . . . 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Product of Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.3 Metabolomics data from prostate cancer patient . . . . . . . . . . . . . 69
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.2 Application to Metabolomics Data from Prostate Cancer Patients . . . . 72
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A Chatper 2. SuSiE PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.2 Derivation of Variational Distribution . . . . . . . . . . . . . . . . . . 90
A.3 Derivation of Evidence Lower Bound (ELBO) . . . . . . . . . . . . . . 98
A.4 Supplement Figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B Chatper 2. PerturbVI Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.2 Derivation of Variational Distributions . . . . . . . . . . . . . . . . . . 122
B.3 Derivation of Evidence Lower Bound (ELBO) . . . . . . . . . . . . . . 127
B.4 Supplement Figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
C Chatper 4. Dual Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . 135
C.1 Model Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
C.2 Supplement Figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
vi
List of Figures
1.1 The Regulation Function of eQTLs (Albert et al., Nature Genetics, 2015) . . . 2
1.2 Representation of a transcription factor network (Thomas & Alvis, Bioinformatics, 2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Disrupted Gene Regulatory Network . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 The Decomposition of Log-likelihood in Variational Inference . . . . . . . . . 13
1.5 The Architecture of the Variation Autoencoders (Jeremy Jordan) . . . . . . . . 16
2.1 PIPs exhibit a higher efficiency in selecting the true signals than the posterior
weights in SuSiE PCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 SuSiE PCA generates the smallest Procrustes error in weight matrix than
sparse PCA and EBMF (A-D) and is robust to over-specified K and L (E-F). . 32
2.3 Factor z1 and z3 captures different types of tissues (tissues without brain vs.
brain tissues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 The perturbations with top factor scores in the first component mostly belong to RPL and RPS family(A), and the enrichment analysis results of downstream genes in the same component are enriched for ribosome and coronavirus disease(B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 PerturbVI generates the smallest Procrustes error in B and W matrix than
GSFA (A-C) and is more robust to mis-specified models (D-F). . . . . . . . . . 51
3.2 PerturbVI results of perturbation effect from LUHMES CROP-seq data . . . 53
3.3 PerturbVI identifies the regulatory networks within the neuron differentiation process (partial downstream genes). . . . . . . . . . . . . . . . . . . . . . 56
3.4 PerturbVI results of perturbation effect from essential-wide Perturb-seq data 58
3.5 PerturbVI reveals regulatory networks in cell cycle and ribosome biogenesis . 60
4.1 Dual Variational Autoencoders for Gene Expression and Genotype Data . . . . 66
vii
4.2 Dual-VAE has better estimation accuracy and more robust to mis-specified
model compared to sparse CCA . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Dual-VAE suggests that Sphingomyelins from the Lipid pathway are significantly associated with prostate cancer risk loci. . . . . . . . . . . . . . . . . . . 74
4.4 rs11701433 present as the top effect in both Dual-VAE and sparse CCA, with
several other variants neglected by sCCA . . . . . . . . . . . . . . . . . . . . . 75
viii
Abstract
Gene Regulatory Networks (GRNs) consist of complex interactions among DNA, RNA, and proteins, which regulate gene activities through stages such as transcription, RNA splicing, translation,
and post-translational modification. However, the dysregulated network can drive disease risk.
Therefore, inferring regulatory modules within GRNs is crucial for understanding cellular functions and disease mechanisms. Current approaches for inferring regulatory modules within GRNs
are limited by scalability and linearity assumptions. This dissertation addresses these limitations
through three projects: the first project, SuSiE PCA, integrates the sum of single effects (SuSiE)
model with probabilistic principal component analysis (PPCA) to effectively compute interpretable
latent factors capturing regulatory modules from Perturb-seq data. The second project, PerturbVI,
scales to genome-wide perturbations by incorporating perturbation information from Perturb-seq
data into latent components derived from SuSiE PCA to accurately estimate the perturbation effect.
The third project, Dual-VAE, employs a deep learning model to capture non-linear relationships
within complex regulatory networks by associating genetic risk variants with downstream gene activities. These advancements provide more efficient, scalable, and interpretable tools for inferring
regulatory modules through Perturb-seq and multi-omics data, offering insights into understanding
the complex regulatory mechanism in human disease.
ix
Chapter 1
Introduction
1.1 Genome-wide Association Studies (GWAS) and eQTLs
Genome-wide association studies (GWAS) have significantly advanced our understanding of genetics. GWAS is an experimental approach to detect associations between genetic variants and
complex traits from the population [1]. By scanning the genomes of individuals from a certain
population, GWAS identifies single nucleotide polymorphisms (SNPs) that occur more frequently
in those with a particular condition than those without [2]. Typical GWAS study involves collecting
DNA samples, genotyping SNPs, and performing statistical analyses to find associations between
genetic variants and traits [3].
The advantage of GWAS is its ability to detect previously unknown genetic components contributing to disease risk, offering insights into biological pathways and potential therapeutic targets
to the disease [4]. However, GWAS requires large sample sizes to detect small effect sizes and can
potentially produce false positives due to population stratification and other confounding factors
[5]. Additionally, GWAS often identifies associations without pinpointing the exact causal variants or understanding the underlying mechanisms [6]: the risk loci provided by GWAS typically
contain several significant genetic variants highly correlated due to linkage disequilibrium (LD),
complicating the identification of true causal variants. Even when the causal variant is identified,
it often does not directly drive the disease. In fact, GWAS has identified thousands of putative
1
genetic variants associated with complex traits and diseases [7]; however, over 90% of these associated variants are at non-coding regions, which makes interpretation difficult [8]. Multiple lines of
evidence indicated that most of the identified non-coding variants are significantly related to gene
regulatory elements and expression levels of genes (expression quantitative trait loci, eQTLs) [9].
eQTLs refer to the sequence of variants that influence the expression level of nearby genes (ciseQTLs) or distal genes (trans-eQTLs). Figure 1.1 by Albert et al. (2015) [10] illustrates various
mechanisms eQTLs affect the expression of genes.
Figure 1.1: The Regulation Function of eQTLs (Albert et al., Nature Genetics, 2015)
2
1.2 Gene regulatory networks (GRN)
Regulatory elements play a crucial role in ensuring the desired genes are expressed in the cell at
the right time and in the right amounts throughout the cell’s life. Together, these elements form the
foundation of gene regulatory networks (GRNs). The gene regulatory networks (GRNs) consist
of complex interactions among molecular regulators, including DNA, RNA, and proteins, which
collectively control the levels of gene products. The process of gene expression can be broadly
summarized as several stages: transcription, RNA splicing, translation, and post-translational modification of the protein, and all of those steps can involve the regulation process. For example, a
type of protein called the transcription factor serves to activate a certain gene by binding to the
promoter region at the start of the coding region to activate the gene expression process to produce
another protein (Figure 1.2) [11].
The gene expression levels can either be up-regulated or down-regulated depending on different
regulatory DNA binding sites, DNA methylation, or RNA splicing [12]. Therefore, understanding GRNs is essential, as they represent the causal relationships in developmental processes and
influence numerous cellular functions [13]. Moreover, the gene regulatory network can get disrupted and drive certain diseases [14], and the disrupted networks are typically imparted through
the dysregulation of nearby or distal genes (Figure 1.2).
Numerous models have been developed to describe GRNs, which can broadly be classified into
three types [15]. The first type of model is the logical model, such as Boolean networks [16], which
describes GRN qualitatively and conveys the basic knowledge of different functionalities within
the network. The second type is the continuous model, which employs real-valued parameters such
as cell mass, cell-cycle length, and gene expression to compare the global state and experimental
data, therefore providing more accurate representations [17]. Continuous models can be further
divided into:
3
Figure 1.2: Representation of a transcription factor network (Thomas & Alvis, Bioinformatics, 2007)
1. Linear models: Assume each regulator contributes additively to the level of gene products.
Techniques like singular value decomposition and linear modeling can generate a family of
candidate networks [18].
2. Non-linear models: Capture the complexity of gene expression processes, discovering novel
regulators and regulatory relationships.
The third type is the single-molecule level models, which account for interactions between an
individual molecule and explain the relationship between the random state and gene regulation.
An example of this model is Gillespie’s stochastic simulation algorithm [19].
In this research work, we focused on developing continuous models to infer gene regulatory
networks from two different aspects:
1. In Chapter 2 and Chapter 3, we developed the salable linear latent factor models to infer the
effect of dysregulated upstream genes to the downstream genes networks through CRISPR
Perturbation data [20, 21].
4
Figure 1.3: Disrupted Gene Regulatory Network
2. In Chapter 4, we build the deep learning model to infer how the genetic components (i.e.,
genetic variations) perturb the downstream genes networks in a broader perspective.
The following introduction sections are constructed as follows: In Section 1.3, we introduce
the concept of CRISPR Perturbation data and describe why it is a perfect candidate for inferring
perturbation gene effect in the regulatory network. As all three methods developed in this work are
built on the latent factor models and inferred through variational inference, we introduced the latent
factor models in Section 1.4 and variational inference in Section 1.5; The first project leverages
the Sum of Single Effects (SuSiE) model, described in Section 1.6, while the third project is based
on variational autoencoders (VAEs), described in Section 1.7. Finally, we provide the overview of
projects in Section 1.8.
5
1.3 CRISPR Perturbation Screening
Clustered Regularly Interspaced Short Palindromic Repeats, or CRISPR, is a revolutionary genome
editing technology derived from a bacterial defense system that enables precise alterations of DNA
sequences within various organisms [22]. In genome engineering, ”CRISPR” or ”CRISPR-Cas9”
is often used broadly to describe various CRISPR-Cas9, CRISPR-CPF1, and other systems that can
be programmed to target specific genetic sequences for precise DNA editing and other applications
[23]. The advantage of CRISPR is its high specificity and programmability, allowing for targeted
modifications in genetic material, which holds significant potential for advancements in human
health and disease treatment [22].
Perturb-seq, a technique combining CRISPR-based gene perturbation with single-cell RNA sequencing [24], is a powerful tool that can generate single-cell level perturbation data for dissecting
GRNs [25]. Perturb-seq allows for testing thousands of regulatory genes simultaneously, offering
a high-resolution view of gene regulatory landscapes, revealing how genetic alterations can influence entire networks [26]. Recently, Reploge et al. generated genome-wide Perturb-seq data that
assay thousands of perturbations with single-cell RNA sequencing in chronic myeloid leukemia
(CML) (K562) cell lines [21]. The study demonstrated that the Perturb-seq data holds unparalleled
potential for functional discoveries on numerous poorly characterized genes.
6
1.4 Latent Factor Models (LFM)
1.4.1 General Form of LFM
Latent factor models have been developed and widely used to reveal the underlying structure of
high dimensional data with a low dimensional latent space [27]. Although there are many different
factor models designed for various purposes, such as principal component analysis (PCA) [28],
factor analysis (FA) [29], and canonical correlation analysis (CCA) [30], they have the same interpretation. Generally, the traditional latent factor model seeks a linear projection, zi ∈ Rk, of the
original dataset xi ∈ Rp, i = 1 · · · n through a k × p loading matrix, and k is usually assumed to
be much smaller than p. A common latent factor model has the following representation:
xi = W⊺
zi + ϵi (1.1)
Where ϵi ∈ Nk(0, Σ) is the Gaussian noise for the sample i. Depending on the constraints on the
form of Σ, the general model (1.1) develops into specific models. For example, when Σ = σ
2
I,
the model represents the probabilistic PCA; when Σ is a diagonal matrix with possibly different
elements on the diagonal, the model turns to factor analysis. Despite various constraints in different
latent models, we usually impose three general assumptions:
• factors zi and the error term ϵi are independent with each other;
• factors zi and zj are independent with each other for i ̸= j;
• E[zi
] = 0, V[zi
] = Ik
Then we can immediately yield the estimation of the feature covariance matrix Cov(xi) := Ω:
Ω = W⊺W + Σ (1.2)
7
This factorization of Ω not only suggests that factors contributing to the feature covariance matrix through their loading matrix but provides a straightforward method to infer the desired loading
matrix by performing matrix decomposition on Ω. we next present the probabilistic reformulation
of PCA and CCA.
1.4.2 Probabilistic Principal Components Analysis (PPCA)
Principal components analysis (PCA) is a well-established dimension reduction technique [28],
and is widely used for exploratory data analysis. PCA targets to find the linear projection of the
original dataset that possesses the largest variance in the projected space. The common solution to
PCA is to perform the eigenvalue decomposition of the sample covariance matrix, then eigenvectors wj with the k largest eigenvalues λj are served as weights on the original features, meaning we
can express the k dominant principal components as wjxn, where xn is one observation of the data
set. One notable property of PCA is that, of all linear projection spaces, the projection produced
by the principal components minimizes the mean squared error with the original data points [31].
Considering deriving the PCA from this property, Tipping and Bishop constructed a probabilistic
reformulation of PCA [32]. Next, we summarized their works and showed the important results.
Specifically, the PPCA is modeled as follows:
x | z ∼ N (Wz + µ, σ2
Ip) (1.3)
where x in the p-dimensional observed data, z ∼ N (0, Id) is the d-dimensional latent vectors,
and W ∈ R
p×d
. With the probabilistic distribution given by (1.3), we can explicitly express the
log-likelihood function and derive the maximum likelihood function of µ,W, σ2
:
µˆ =
Xn
j=1
xj
; Wˆ = Ud(Λd − σ
2
Id)
1
2 R;
ˆσ
2 =
1
p − d
X
p
i=d+1
λi (1.4)
8
where Ud are the p × d matrix such that the columns are principal eigenvectors of the sample
covariance matrix, Λd is the d × d diagonal matrix with eigenvalues on the diagonal. R is an
arbitrary orthogonal matrix. Besides,the posterior mean of z is given by:
E(z | x) = R
⊺
(Λd − ˆσ
2Id)
1
2U
⊺
d
(x − µˆ) (1.5)
Then, the ML estimates under this formulation are equivalent to the conventional PCA solution
and yield the same latent space. The advantage of probabilistic modeling is:
• The likelihood function allows for comparison of model performance with other probabilistic
models
• Enable the application of Bayesian methods
• capable of dealing with missing data (illustrated by Tipping and Bishop in their PPCA paper)
In fact, we mainly focus on the second merit of the probabilistic models, that is, to apply the
Bayesian method to the loading matrix to achieve the desired sparsity.
1.4.3 Probabilistic Canonical Correlation Analysis (PCCA)
When two data views come from the same set of samples but with different features, canonical
correlation analysis (CCA) can be employed to explore linear correlations between two data sets
[30]. In general, CCA computes the principal components (PCs) from each data set so that the PCs
are maximally correlated. Mathematically, given two data sets, X1 ∈ Rm1
, X2 ∈ Rm2
, then the
objective function is:
arg max
w1,w2
Cov(w1x1, w2, x2)
s.t. V(w1x1) = 1
V(w2x2) = 1
(1.6)
9
To derive the solution, we first compute the sample covariance matrix Σ˜ between X1 and X2:
Σ˜ =
Σ˜
11 Σ˜
12
Σ˜
21 Σ˜
22
(1.7)
The diagonal blocks represent the variance-covariance matrix within each data set, and the offdiagonal blocks represent the covariance matrix between data sets. Then it can be shown that the
canonical pairs (w1, w2), called canonical directions, that maximize the correlation as in (1.6) is:
(w1, w2) = (Σ˜ −1/2
11 v1, Σ˜ −1/2
22 v2) (1.8)
where v1, v2 is the left and right singular vectors obtained from singular value decomposition
(SVD) of matrix Σ˜ −1/2
11 Σ˜
12Σ˜ −1/2
22 , respectively. The corresponding singular value is called canonical correlation. Similar to PPCA, Bach and Jordan reformulate the probabilistic CCA (PCCA)
[33] as follows:
z ∼ N (0, Id) 1 ≤ d ≤ min(m1, m2)
x1 | z ∼ N (w1z + µ1, Ψ1) Ψ1 ≽ 0
x2 | z ∼ N (w2z + µ2, Ψ2) Ψ2 ≽ 0
10
The maximum likelihood estimates of parameter w1, w2, Ψ1, Ψ2, µ1, µ2 are given by:
wˆ 1 = Σ˜
11U1dM1
wˆ 2 = Σ˜
22U2dM2
Ψˆ
1 = Σ˜
11 − wˆ 1wˆ
⊺
1
Ψˆ
2 = Σ˜
22 − wˆ 2wˆ
⊺
2
µˆ1 =
Xn
i=1
x1i
µˆ2 =
Xn
i=1
x2i
where the columns of matrix U1d and U2d are the first d canonical directions, M1,M2 ∈ R
d×d
are
arbitrary matrix such that M1M⊺
2 = Pd, and Pd is the diagonal matrix of corresponded canonical
correlations. Finally, the posterior mean and variance have the expressions as:
E(z | X1) = M
⊺
1U
⊺
1d
(X1 − µˆ1)
E(z | X2) = M
⊺
2U
⊺
2d
(X2 − µˆ2)
V(z | X1) = Id − M1M
⊺
1
V(z | X2) = Id − M2M
⊺
2
For any M1,M2 that satisfies M1M⊺
2 = Pd, the projected d-dimensional spaces are equivalent to
the that obtained from conventional CCA.
11
1.5 Variational Inference
In probabilistic modeling, as we mentioned above, the posterior distribution of random variables
is typically complex and intractable to compute as there is no closed-form for the data likelihood.
To address that, one can use the numerical approach, Monte Carlo sampling methods, such as
Markov Chain Monte Carlo (MCMC). MCMC is designed to provide a numerical approximation
of the exact posterior distribution through intensive sampling [34] and, therefore, lacks computational efficiency, especially when the feature dimension is large. An alternative approach to the
Monte Carlo sampling method, the variational inference provides an analytical approximation to
the posterior distribution [35].
To approximate the conditional distribution of latent variables Z given the observed samples
X, variational methods first impose a family of densities over the latent variables, Q(Z), usually
predefined as known distributions parameterized with a set of variational parameters. Then the
goal is to infer those variational parameters such that the variational distribution Q(Z) is as similar
as possible to the true posterior distribution P(Z|X). A quantity commonly used to measure
dissimilarity between distributions is Kullback-Leibler divergence DKL(Q∥P) [36]:
DKL(Q∥P) = X
Z
Q(Z) log Q(Z)
P(Z | X)
(1.9)
However, since KL divergence contain the unknown true posterior distribution P(Z | X), it
cannot be directly computed. Instead, we can show that log-likelihood of data, log P(X) can be
decomposed as:
log P(X) = DKL(Q(Z | X)∥P(Z | X)) + L(X) (1.10)
12
Figure 1.4: The Decomposition of Log-likelihood in Variational Inference
Where L(X) = EQ[log P(Z, X) − log Q(Z)], which is also known as the Evidence Lower
Bound (ELBO). Since the log P(X) is a constant with respect to the variational parameters, minimizing KL divergence is equivalent to maximizing ELBO. The ELBO does not contain the unknown posterior distribution and, therefore, is tractable to compute. Then various optimization
methods such as gradient descent could be applied to maximize ELBO with respect to the variational parameters.
To find the approximate posterior distribution, one common solution we can take is via meanfield approximation [37]. The mean-field approximation partitions the variational distribution such
that each partition is independent of the others.
Q(Z) = Y
K
i=1
Qi(zi) (1.11)
Using the calculus of variations, we can show that the distribution Q∗
j
(zj ) minimizing KL divergence for each factor Zj can be expressed as:
ln Q
∗
j
(zj
| X) = Ei̸=j
[ln P(Z, X)] + constant (1.12)
Where Ei̸=j means taking an expectation over all the variables not in the partition Z.
13
1.6 Sum of Single Effects Model
In this section, we presented the work of Sum of single effects (SuSiE) model proposed by Wang
et al. (2020) [38]. SuSiE is designed to address the variable selection problem in the linear regression model. Because the SuSiE model has a sparse structure and provides a posterior inclusion
probability (PIP) to quantify the uncertainty of selecting the variable, it become a suitable tool
for genetic-fine mapping applications. SuSiE was built based on the Bayesian variable selection
in regression (BVSR) [39], but it featured a rather simple model structure and fast model fitting
procedure via variational inference.Suppose y is the one-dimension dependent variable, X is the
p-dimensional predictors, b is a effects vector of length p. Then we can express SuSiE regression
model as the following:
y = Xb + e (1.13)
e ∼ Nn(0, σ2
In) (1.14)
b =
X
L
l=1
bl (1.15)
bl = γlbl (1.16)
γl ∼ Multi(1,π) (1.17)
bl ∼ N1(0, σ2
0l
) (1.18)
Notice that equation 1.15 assumes the overall effect vector b is a sum of L number of single effect
vector bl
, which contains only one non-zero effect bl
. The coordinates of scalar bl
in the single
effect vector are determined by γl
that follows a multinomial distribution with parameter π, while
the bl
itself follows a normal distribution with its own variance term σ
2
0l
. SuSiE model suggests
that there are at most L number of single effects from X contribute to y because (1) with small
probabilities, some single effects will end up with the same coordinates in the sum of effect vector
b; (2) if the true number of effects are less than L, then the maximum likelihood estimates for σ
2
0l
14
could be large for some L, leading to the posterior mean of corresponding bl being almost zero.
Given σ
2
, σ2
0l
, we need to compute the posterior distribution of bl and γl
:
γl
| X, y, σ2
, σ2
0l ∼ Multi(1, αl)
bl
| X, y, σ2
, σ2
0l
, γj = 1 ∼ N (µ1j
, σ2
1j
)
The authors leveraged the posterior estimates of µ1j
, σ2
1j
, αl from the single effects regression model while fitting the SuSiE model. Considering it is intractable to compute the likelihood
function, they incorporated the variational inference technique to compute evidence lower bound
(ELBO). Taking the derivative with respect to the variance terms gives the MLE of σ
2
, σ2
0l
. Finally,
the SuSiE model returns the posterior mean and variance for every single effect, as well as the
corresponding posterior probability αl
, which will be used to compute PIPs and credible set for
variable selection.
15
1.7 Variational Autoencoders
Similar to latent factor models, variational autoencoders (VAE) [40, 41] is also an unsupervised
model and is designed to learn the data generating process through two coupled models: the encoder model, also named the recognition model, and the decoder model, or generative model [42].
The encoder model learns the approximate posterior distribution of latent variables from the observed data and delivers those learned latent variables into the decoder model, which effectively
reconstructs the data from the latent space. Both the encoders and decoders are usually built on
deep neural networks (DNN) to learn the potentially complicated data structure, therefore VAE
has been widely employed in the image recognition field. Below is a graphical representation of
VAE with two latent variables, provided by Jeremy Jordan [43]. The original data with 6 features
are input through two-layer neuron networks (the encoder) to encode into two sets of variables
(µi
, σi), which are the parameters of the latent variables zi
. The VAE samples points from the distribution of zi and push through another neuron network (decoder) to decode the latent variables
into predicted data with six features.
Figure 1.5: The Architecture of the Variation Autoencoders (Jeremy Jordan)
16
One crucial assumption that requires us to make is to impose a known family distribution Q(.)
on the latent variables Z. Otherwise, the model can naively choose a pair of reversed neuron
networks to reach a zero loss during model training. On the other hand, it is intractable to compute
data likelihood in such a complicated, non-linear model, impeding the computation of the posterior
distribution of Z. Therefore, by restricting a known family distribution of latent variable Q(.), we
only need to infer the parameters of the pre-defined distribution such that the Q(Z | X) is as close
as possible to the true but unknown posterior distribution P(Z | X). This is exactly the task of
variational inference, and hence the name variational autoencoders. One major difference is that,
in VAE, the encoder is a function of the input data. Then the encoder network output Q(Z|X), and
the decoder P(X | Z) serves as the ”model” that defines the data generating process. As equation
(1.10) suggest, the log-likelihood of data can be decomposed into the KL divergence DKL(Q∥P)
and the ELBO, since the log-likelihood of data is fixed and KL divergence is defined to be a nonnegative quantity, maximizing ELBO with respect to the variational parameters is equivalent to
minimizing the KL divergence between Q(Z | X) and P(Z | X). The interpretation is that the
model better fits the data as ELBO increases, and the variational distribution provides a better
approximation to the true posterior distribution [42].
Since the VAE is a non-linear model, it is impossible to find the analytical solution to variational parameters via mean-field approximation. However, ELBO has an important property that
allows joint optimization with respect to all variational parameters using stochastic gradient descent (SGD).[42].
17
1.8 Project Overview
Although sparse latent factor models have been widely applied in dimension reduction within
the genomic field, they mostly focus on regularized loadings to be zero but do not account for the
feature selection in a statistical framework. Moreover, current methods to infer perturbation effects
through Perturb-seq data cannot scale to genome-wide perturbations. Therefore, a more efficient
and interpretable model is necessary. Under this context, we developed SuSiE PCA (Chapter 2)
and PerturbVI (Chapter 3). To address the testing burdens and linearity constraints in associating
genetic variants to the downstream genes, we leveraged the variational autoencoders to describe the
relationship between genetic variants and downstream gene activities (Chapter 4). To Summarize,
1. Project 1 SuSiE PCA: an efficient approach to compute interpretable latent factors from
high-dimensional biological data by integrate SuSiE model into PPCA.
2. Project 2 PerturVI: a scalable approach to infer perturbation effect through CRISPR Screening data by integrate perturbation information into the latent component from SuSiE PCA.
3. Project 3 Dual-VAE: a deep-learninig model that associate a set of risk variant to downstream genes to capture the non-linearity within the complex regulatory networks.
18
Chapter 2
Sum of Single Effects (SuSiE) Principal Component Analysis
2.1 Introduction
Principal component analysis (PCA) is a popular dimension reduction technique [28] that has been
widely applied for exploratory data analysis in many fields. One notable functionality of PCA is to
synthesize crucial information across features into a small number of principal components (PCs).
For example, PCA is commonly used to infer population structure from large-scale genetic data
[44, 45]. The top PCs explain differences in genetic variation arising from different geographic origins and ancestry of individuals due to historical migration, admixture, etc. [46]. Moreover, PCA
provides a means to rank contributing relevant variables for each latent component, as Tipping and
Bishop (1986) proposed the probabilistic reformulation of principal component analysis (PPCA)
[31]. Specifically, each PC is independent of other PCs and has its unique weights to represent
the “importance” of original features, suggesting different latent components arise from different
combinations of variables, or distinct aspects of information from the data.
However, one disadvantage of conventional PCA is that PCs provide limited interpretability, as
each results from a linear combination of variables in the data [47]. To improve the interpretability
of PCs, while providing an identifiable solution in high-dimensional data, a common approach is
to impose sparsity on the PCA loadings. Broadly speaking, there are two types of approaches to
achieving sparsity on the loading matrix. The first is the regularization methods such as sparse
PCA[47], which rewrites the PCA as a regression-based optimization problem and then includes a
19
L1 penalty on the objective function to achieve sparse loadings. The second type of method is the
Bayesian treatment of PPCA, which imposes sparsity-induced prior on the factor loading matrix
[48–53]. Despite various methods that focus on inducing sparse solutions for PCA, few provide a
statistically rigorous way to select variables relevant to each factor in a post-hoc manner. Although
several sparse models are capable of shrinking the loadings of uninformative variables to zero, for
those variables with non-zero weights, neither a reasonable threshold nor a formal statistical test is
provided to inform feature prioritization for validation or follow-up.
Here, we propose SuSiE PCA, a highly-scalable Bayesian framework for sparse PCA, that
quantifies the uncertainty of contributing features for each latent component. Specifically, SuSiE
PCA leverages the recent “sum of single effects” (SuSiE) approach [38] to model a loading matrix
such that each latent factor contains at most L contributing features (see Introduction 1.6). Latent
factors and sparse loading weights are learned through an efficient variational algorithm (Appendix
A SuSiE PCA Algorithm). In addition to providing a sparse loading matrix, SuSiE PCA computes
posterior inclusion probabilities (PIPs) for each feature, which enables defining ρ−level credible
sets for feature selection. We demonstrate through extensive simulations that SuSiE PCA outperforms sparse PCA [47] and empirical Bayes matrix factorization (EBMF) [53] in identifying
relevant features contributing to structured data while being robust to data-generating assumptions.
Next, we apply SuSiE PCA to multi-tissue eQTL data from the GTEx v8 [53, 54] study to identify tissue-specific components of regulatory genetic features and contributing eGenes. We also
apply SuSiE PCA to high-dimensional Perturb-seq data (CRISPR-based screens with single-cell
RNA-sequencing readouts) [21] and identify gene sets more enriched in the ribosome, coronavirus
disease pathways when compared with sparse PCA (FDR = 9.2 × 10−82, 63 genes involved vs.
1.4 × 10−33, 35 genes involved) while requiring 17.8 times less computing time. Overall, we
find SuSiE PCA provides an efficient approach to compute interpretable latent factors from highdimensional biological data. We provide an open-source python implementation that can run seamlessly on CPUs, GPUs, or TPUs available at http://www.github.com/mancusolab/susiepca.
20
2.2 Materials and Methods
2.2.1 Methods
In this section, we presented the method description of SuSiE PCA. Let XN×P be the observed data
matrix, ZN×K be the K dimensional latent vectors, and WK×P be the loading matrix. We denote
the normal distribution with mean µ and variance σ
2
as N (µ, σ2
), the multinomial distribution with
n choices and probabilities π as Multi(n,π) and the matrix normal distribution with dimension
N ×K, mean M, row-covariance R, and column-covariance C as MN N,K(M, R, C). We denote
the basis vector in which k
th coordinate is 1 and 0 elsewhere as ek. The sampling distribution of
X under the SuSiE PCA model is given by,
X | Z,W, σ2 ∼ MN N,P (ZW, IN , σ2
IP ) (2.1)
Z ∼ MN N,K(0, IN , IK) (2.2)
W =
X
K
k=1
ekw
⊺
k
(2.3)
wk =
X
L
l=1
wkl (2.4)
wkl = wklγkl (2.5)
wkl | σ
2
0kl ∼ N (0, σ2
0kl) (2.6)
γkl | π ∼ Multi(1,π), (2.7)
where wk corresponds to the k
th row of W, and contains exactly L non-zero elements determined
by the sum of L single-effect vectors wkl. These single-effect vectors are described by a single random effect wkl and indicator vector γkl which assigns the effect to a feature with prior probabilities
π =
1
p
1.
21
Posterior Inclusion Probability
One of the distinguishing features that the SuSiE model [38] provides is a posterior inclusion probability (PIP). The PIP reflects the posterior probability that a given variable has a non-zero effect
given the observed data. Here we extend the PIP definition to include latent factors. Specifically,
given variational parameters αkl we can define the PIP that the i
th variable has a non-zero effect in
the k
th latent component as,
PIPki := Pr(wki ̸= 0 | X) = 1 −
Y
L
l=1
(1 − αkli) (2.8)
Similarly, a level-ρ credible set (CS) refers to a subset of variables that cumulatively explain at
least ρ of the posterior density. Here, we define factor-specific level-ρ CSs, which can be computed
across each αkl independently, resulting in K × L total level-ρ credible sets. This lets us reflect on
the uncertainty in identified variables to explain a single-effect for each latent factor.
Variational Inference in SuSiE PCA
We seek to perform inference of model variables Z, wkl, and γkl conditional on observed data X,
however, the marginal likelihood is intractable to compute and therefore, we cannot evaluate the
posterior exactly. While sampling based approaches such as Markov Chain Monte Carlo (MCMC)
methods provide a numerical approximation of the exact posterior distribution [34], they often
lack computational efficiency in high-dimensional settings. As an alternative, we leverage recent
advancements in the variational inference that provides an analytical approximation to the posterior
distribution [35] and remains computationally efficient.
Briefly, To approximate the conditional distribution of latent variables Z given the observed
samples X, variational methods first impose a family of densities over the latent variables, Q(Z),
usually predefined as known distributions parameterized with a set of variational parameters. Then
the goal is to infer those variational parameters such that the variational distribution Q(Z) is as
22
similar as possible to the true posterior distribution P(Z | X). A quantity commonly used to measure dissimilarity between distributions is Kullback-Leibler divergence DKL(Q∥P) [36]. However, since KL divergence contains the unknown true posterior distribution P(Z | X), it cannot
be analytically computed. Instead, we can show that the log-likelihood of data, log P(X) can be
decomposed as:
log P(X) = DKL(Q∥P) + L(Q) (2.9)
Where L(Q) = EQ[log P(Z, X) − log Q(Z)], which is also known as the Evidence Lower
Bound (ELBO). Since the log P(X) is a constant with respect to the variational parameters, minimizing KL divergence is equivalent to maximizing ELBO. As the ELBO does not contain the
unknown posterior distribution and therefore is tractable to compute and maximize for variational
parameters. In Appendix A equation A.23 we provide the exact form of ELBO. Based on the
mean-field approximation (see Appendix A.1), we can compute the complete data log-likelihood
(Equation A.11) and derive the analytical form of variational distributions. The completed derivation is outlined in Appendix A.1.
In summary, the optimal variational distribution of model parameters can be summarized as:
Q(Z) := MN n,k(Z | µZ, In, ΣZ) (2.10)
Q(wkl|γkl) := N (µwkl , σ2
wkl
) (2.11)
Q(γkl) := Multi(1, αkl). (2.12)
23
The corresponding update rules for variational parameters from Q(·) can be expressed as,
µZ = τXE[W⊺
]ΣZ (2.13)
ΣZ = (E[WW⊺
]τ + Ik)
−1
(2.14)
µwkl = τ σ2
wkl
E[R
⊺
klZk] (2.15)
Σwkl = σ
2
wkl
Ip (2.16)
σ
2
wkl = (τE[Z
⊺
kZk] + τ0kl)
−1
(2.17)
αkl = softmax(log π − log N (0 | µwkl , σ2
wkl
)). (2.18)
2.2.2 Simulations
To investigate the performance of SuSiE PCA in variable selection and model fitting, we simulated
various data sets that are controlled by 4 parameters: the sample size N, number of features P,
number of latent factors K, and number of single effects L in each of the factors. For simplicity,
we assume L is the same across different factors. The simulated data X is generated according to
equation (2.1), with N = 1000, P = 6000. Then the latent component zk and loadings wk, for
k = 1, · · · , 4 are simulated such that only 40 features (0.67%) are associated with each component,
i.e., having non-zero effect:
zk ∼ N (0, IN ) (2.19)
w1,i ∼ N (0, 1) i = 1, · · · , 40 (2.20)
w2,i ∼ N (0, 1) i = 41, · · · , 80 (2.21)
w3,i ∼ N (0, 2
2
) i = 81, · · · , 120 (2.22)
w4,i ∼ N (0, 1) i = 121, · · · , 160, (2.23)
with the remaining effects set to zero. Considering the scale of the estimates of loadings may differ
from various types of methods, we normalized the loading matrix with respect to Frobenius norm:
24
tr(A
⊺A) = tr(B
⊺B) = 1
To evaluate the accuracy of SuSiE PCA, we compared inferred posterior expectations with
the true latent variables. However, due to the rotational invariance property within latent factor
models, evaluating loading or latent factor accuracy can be challenging. To account for possible
rotation, we leverage the Procrustes transformation [55], which finds an orthogonal rotation matrix
P to transform the estimated loading matrix to the true loading matrix space. Specifically, given
an estimated loading matrix Wˆ := EQ[W] under approximate posterior distribution Q and true
effect matrix W, the “Procrustes Norm” can be obtained as following:
||W − Wˆ ||2
P
:= min
{P|P−1=P
⊺
}
||WPˆ − W||2
F
. (2.24)
Here, we perform the Procrustes analysis via Procrustes package [56], from which P is obtained by
performing a singular value decomposition on matrix Wˆ ⊺W (padding zeros on matrix Wˆ would
ensure the above operation process correctly).
In addition, we employ the relative root mean squared error (RRMSE) to evaluate the reconstructed data loss as,
RRMSE(X, X ˆ ) =
vuut
P
i,j (Xˆ
ij − Xij )
2
P
i,j X2
ij
. (2.25)
Lastly, we computed the log-likelihood under held-out data to assess generative modeling proficiency. Specifically, we first trained the model on simulated training data. Next, we computed
latent space representations for the testing data under each of the trained models. Lastly, we computed log-likelihoods under normality assumptions given the latent representations and learned
loadings and parameters.
For model comparison, we also evaluate the performance of sparse PCA [47] and Empirical
Bayes Matrix Factorization (EBMF), a recently described variational approach [53] on the same
25
simulation setting and compare the model performance with SuSiE PCA via criterion described
above.
2.2.3 GTEx z-score dataset
To illustrate the application of SuSiE PCA in genetic research, we downloaded the GenotypeTissue Expression (GTEx) [54] summary statistics data, composed of z-scores computed from
the testing association between genetic variants and the gene expression levels across 44 different
human tissues [53]. The GTEx project collected genotype data and gene expression data from 49
non-disease tissues across n = 838 individuals, providing an ideal resource database to study the
relationship between genetic variants and gene expression levels [54]. The genetic variants that
are statistically associated with gene expression levels are referred to as expression quantitative
trait loci (eQTLs). To identify eQTLs, the GTEx project tested the association between each
nearby genetic variant of a certain gene and its expression levels using linear regression to yield a z
score. The summary data we explored reflects the most significant eQTL (equivalently, the largest
absolute z score in each SNP and gene pair) at each of 16069 genes (row) from 44 tissues (column)
curated from GTEx (v8) in ref [53], as those 16069 genes show indication of being expressed in 44
of all 49 human tissues. To identify tissue-specific components of regulatory genetic features and
contributing genes, we applied SuSiE PCA across this z-score matrix with a latent dimension of 27
and the number of single effects of 18. The prior information on the number of latent dimensions
comes from Wang et al. (2021) [53], who contribute to the z-score dataset and run the EBMF
model with 27 factors. To determine the appropriate L that fits the data, we run the SuSiE PCA
with L ranged from 10 to 25, and select the model when the increase in the total percentage of
variance explained (PVE) is less than 5%. PVE is a measure of the amount of signals in the data
captured by the latent component, the PVE of the factor zk is calculated based on the following
equation:
PVEk = P
sk
k
sk + NP/τ (2.26)
26
where sk =
PN
i=1
PP
j=1(E[zik]E[wkj ])2
.
2.2.4 Purturb-seq dataset
We next investigated genome-scale Perturb-seq data [21] to discover the co-regulated gene sets
affected by some common type of perturbations. The Perturb-seq data originated from Perturb-seq
experiments performed by Replogle et al. [21]. Perturb-seq is a cutting-edge technique combining
CRISPR-based perturbations with single-cell RNA-sequencing readouts, enabling the investigation
of co-regulated gene sets affected by various perturbations. The researchers employed three cell
lines: K562 cells, hTERT-immortalized RPE1 cells, and HEK293T cells. CRISPRi technology
was used to generate cell lines expressing dCas9-BFP-KRAB (KOX1-derived) for the perturbation
experiments.
Since we focus our analyses on the expression data from the K562 cell line, we give a brief
description of the experiments performed on the K562 cell lines. Namely, the authors targeted
genes expressed in K562 cells, transcription factors, Cancer Dependency Map common essential
genes, and included non-targeting control sgRNAs accounting for 5% of the total library. The gene
sets were defined based on a combination of bulk RNA-seq data from ENCODE and 10x Genomics
3’ single-cell RNA-seq data. Libraries were constructed with dual-sgRNA pairs targeting each
gene, expressed from tandem U6 expression cassettes in a single lentiviral vector, and ranked based
on empirical data and computational predictions. Subsequently, the author conducted Perturb-seq
experiments on the K562 cells, with 2056 distinct knocked-out genes and one non-targeting control
group over an average of 150 different single cells and then measured the expression levels of the
downstream 8563 genes from each cell.
The final dataset contains 310,385 rows, each representing one perturbation in a specific cell,
and the expression levels of 8,563 downstream genes as the column. As an exploratory analysis, we
omitted the single-cell level information and aggregated the expression levels of downstream genes
with the same perturbation over all the cells, which resulted in a “pseudo-bulk” data matrix with
2057 rows and 8563 columns. We then performed the SuSiE PCA and Sparse PCA to investigate
27
the regulatory modules based on common perturbations. To exclude the batch effects and other
non-genetic covariates, we regressed out the germ-line group and the mitochondrial percent from
the original expression data and then aggregated the expression level of downstream genes with
the same perturbation. Finally, the aggregated data is centered and standardized before input into
SuSiE PCA.
As a comparison, we also run the sparse PCA with the same K in both datasets. While choosing
an appropriate sparsity parameter alpha in sparse PCA is less straightforward than tuning L in the
SuSiE PCA, as we cannot directly pull all of the non-zero genes even with a fairly large alpha
(higher sparsity). To make a reasonable comparison, we run sparse PCA with a set of alpha from
1 to 20 and choose two models to compare: first, choose the model giving the highest PVE, then
investigate the model having a similar level of PVE with SuSiE PCA.
28
2.3 Results
2.3.1 PIPs from SuSiE PCA outperform existing methods in feature selection
To evaluate the performance of SuSiE PCA, we performed extensive simulations (see details in
Simulations). Briefly, we performed 100 simulations by varying model parameters one at a time
and performed inference using SuSiE PCA with the true number of latent variables (K) and effects
(L) known. First, we evaluated the ability of inferred PIPs to discriminate between relevant and
non-relevant features for latent factors. Specifically, we compared the sensitivity and specificity
of inferred PIPs to normalized posterior mean weights from SuSiE PCA (see Figure 2.1). When
selecting variables based on PIPs > 0.90, SuSiE PCA identifies 88.9% of true positive (nonzero) signals, demonstrating largely calibrated posterior inference. We observed nearly all true
negative signals exhibited PIPs < 0.05. As a comparison, the posterior weights performed well
on excluding the true negative signals, but failed to capture true positive signals as rapidly as PIP
thresholds. Overall, the simulation demonstrates that PIPs provide an intuitive and more efficient
indicator for feature selection than normalized posterior weights in SuSiE PCA. In addition, we
also examined the sensitivity and specificity using weights estimated from sparse PCA and EBMF
(see Appendix A Figure 5.1), which have similar trends to the curves in Figure 1B and can only
capture a small proportion of the true positive signals as the cutoff threshold increases.
29
0.00
0.25
0.50
0.75
1.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
PIPs cutoff
Proportion of
correct classified signals
A
0.00
0.25
0.50
0.75
1.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Posterior weights cutoff
B
Signal True Negative True Positive
Figure 2.1: PIPs exhibit a higher efficiency in selecting the true signals than the posterior
weights in SuSiE PCA.
The proportion of correct classified signals using PIPs as cutoff (A) or posterior weights as cutoff (B). The green dots
represent sensitivity, i.e. Pr(PIPs ≥ cutoff | True positive signal), the red dots represent specificity, i.e. Pr(PIPs <
cutoff | True false signal). For consistency and to ensure comparability between PIPs and weights, the weights are
standardized to be ranged from 0 to 1.
2.3.2 SuSiE PCA is robust to model misspecification
Next, we examined the estimation accuracy of the loading matrix as a function of sample size (N),
feature dimension (P), latent dimension (K), and the number of single effects (or sparsity level)
(L), via the Procrustes errors [56] (the Frobenius norm after Procrustes transformation [55], see
Methods) (Figure 2.2 A-D). We found that SuSiE PCA has the smallest Procrustes errors across
all simulation settings compared to sparse PCA and EBMF. And we noticed that the Bayesian
methods including SuSiE PCA and EBMF maintain a low error even with a small sample size or
high feature dimension. Moreover, we found that SuSiE PCA has the lowest RRMSE across all
simulations compared with other methods (Appendix A.2 Figure 5.2); And EBMF and SuSiE PCA
have a lower level of Procrustes error of factor Z than sparse PCA (Appendix A.2 Figure 5.3). In
summary, SuSiE PCA exhibits the best estimation accuracy, which is consistent with its superior
performance in variable selection.
30
We next investigated model robustness under model mis-specification. Similar to other latent
factor models, SuSiE PCA could be mis-specified as it requires manually inputting the latent dimension K and the number of single effects L. Considering the potential model misspecification
setting, the simulation data sets are generated based on K = 4, L = 40 and then input into SuSiE
PCA, sparse PCA and EBMF with two mis-specified situations: vary L while fixing K, or vary
K while fixing L. The model estimation accuracy is then compared among three models with
Procrustes error (see Figure 2.2 E-F). We observed that as K and L in the model approaches the
true value (i.e. K = 4 or L = 40), the Procrustes error decreases rapidly to the lower level in
SuSiE PCA, and remains the same even when K > 4 or L > 40. However, the error for sparse
PCA has a V-shape and reaches its minimum at the real K. The explanation is that when there are
over-specified latent factors in the model, SuSiE PCA and EBMF will not extract any information
from the data due to their probabilistic model structure; the sparse PCA, on the other hand, cannot
handle the weights as it does not impose any probabilistic assumption on it. Instead, the value of
the redundant latent factor in sparse PCA is close to 0, which ensures the latent component does
not contribute.
Finally, to compare the generative capacity, we computed and compared the log-likelihood of
held-out data between sparse PCA and SuSiE PCA. We observed that SuSiE PCA outperforms
sparse PCA and obtains higher log-likelihoods for simulations (Appendix A.2 Figure 5.5). In
addition to the overall superior model performance, SuSiE PCA remains faster on both CPU and
GPU than sparse PCA and EBMF due to the efficient variational algorithm we implement (see
Methods) with the JAX library developed by Google [57].
31
0.00
0.04
0.08
0.12
0.16
600 800 100012001400160018002000
Sample size
||W
−
Q
^
W||
2
P
A
0.00
0.04
0.08
0.12
0.16
4000 5000 6000 7000 8000 900010000
Feature dimension
||W
−
Q
^
W||
2
P
B
0.00
0.05
0.10
0.15
4 5 6 7 8 9
Latent dimension
||W
−
Q
^
W||
2
P
C
0.00
0.08
0.16
0.24
0.32
10 20 30 40 50 60
Number of signals in each factor
||W
−
Q
^
W||
2
P
D
0.00
0.05
0.10
0.15
0.20
20 25 30 35 40 45 50
L in the SuSiE PCA
||W
−
Q
^
W||
2
P
E
0.0
0.1
0.2
0.3
0.4
0.5
2 3 4 5 6
Number of Latent Components
||W
−
Q
^
W||
2
P
F
Model EBMF Sparse PCA SuSiE PCA
Reference Model EBMF Sparse PCA SuSiE PCA
Figure 2.2: SuSiE PCA generates the smallest Procrustes error in weight matrix than sparse
PCA and EBMF (A-D) and is robust to over-specified K and L (E-F).
For each scenario in (A-D) we vary one of the parameters at a time to generate the simulation data while fixing the
other 3 parameters, and then input the true parameters (N, P, K, L) into models. Finally, we compute the Procrustes
error and plot them as a function of N, P, K, L. For (E-F), we use the same simulation setting in Figure 1 to generate
data but vary the specified L in SuSiE PCA (E) and K in all three models (F). Reference lines refer to the error from
the models with correctly specified parameters (i.e. L = 40, K = 4).
32
2.3.3 Dissecting cross-tissue eQTLs in GTEx
To illustrate the utility of SuSiE PCA to make inferences in biological data, we analyzed multitissue eQTL z-score results computed from GTEx v8 [54] [53] (see Methods). Specifically, we
sought to identify latent factors corresponding to tissue-specific and tissue-shared eQTLs similar
to ref[53]. Overall, we found that 27 latent factors explained 53.1% of the variance in the data
(Appendix A.2 Figure 5.6). Although we set L = 18 across all factors, we found the number
of tissues with PIP > 0.9 is frequently lower than 18 in different factors (Appendix A.2 Figure
5.9), which is due to inferred τ0kl acting to “shut off” uninformative features. Indeed, we observed
30/486 τ0kl with estimates greater than e
10 (Appendix A.2 Figure 5.7) which effectively shrink
the effect size of the corresponding single effect toward 0, driving the number of non-zero single
effects in some factors smaller than specified L. We found this behavior also reflected in estimated
level-0.9 credible sets, where 456/486 contained a single tissue, and the remaining 30 credible sets
contained at least two tissues.
To understand what each factor represents, we examined inferred PIPs (Appendix A.2 Figure
5.9) and posterior mean weights of each tissue across 27 factors (Appendix A.2 Figure 5.8). Here
we present the results from factor z1 and z3 through the posterior weights (Figure 2.3; see Appendix
A.2 Figure 5.8 for the remainder). We observed that the latent factor z1 with the second largest
PVE demonstrates high absolute weights on most tissues except for the brain tissues, while the
latent factor z3 has large weights almost exclusively on brain tissues. Moreover, we observed that
brain tissue tends to appear as a group and have similar effects, implying the eQTLs in brain tissue
are different from those in other tissue and those strong signals are specifically captured by the
factor z1. For the rest of the factors, we noticed that factors with large PVE such as z2, z4, z5
tended to have large weights on multiple tissues; for example, factor z2 has large weights on
esophagus and thyroid, suggesting the eQTLs signals are mostly shared across those tissues; while
the factors with small PVE usually have large weights exclusively on one or a few tissues, for
example, liver-specific component z12, lung-specific component z15, etc. The only exception is
33
that the factor z0 with the largest PVE has an exclusively large weight only on the testis, implying
the z0 captures the testis-specific eQTL signals. This is consistent with the investigation of the
latent factor values of z0: the gene with the largest factor value in z0 is DDT (Appendix A.2
Figure 5.10), which is shown to be associated with testis cancer. [58]. To make a comparison
with the existing method, we expanded our investigation by applying sparse PCA to the GTEx
z-score dataset and observed comparable tissue weights and factor scores across components in
both SuSiE PCA and sparse PCA (Appendix A.2 Figure 5.11). However, a notable distinction
arises where certain tissues exhibit tiny weights and can potentially be neglected in sparse PCA, in
contrast, the SuSiE PCA can successfully capture the signals in those tissues through the PIP. For
example, from the original analysis, both models identify adipose gland as the most relevant tissue
in factor 10, while the remaining tissues have a much smaller relative weight, and can effectively
be ignored. Despite this, SuSiE PCA assigns a PIP of 1 to the lowly weighted tissues, suggesting
that important signals would be missed if weights alone were used to provide insight. Overall, we
find that SuSiE PCA is able to identify tissue-specific components from multi-tissue eQTL data in
an intuitive, interpretable manner.
Factor 1 Factor 3
−1.0
−0.5
0.0
0.5
1.0
Tissue
Posterior Weights
Tissue
Adipose_Subcutaneous
Adipose_Visceral_Omentum
Adrenal_Gland
Artery_Aorta
Artery_Coronary
Artery_Tibial
Brain_Anterior_cingulate_cortex_BA24
Brain_Caudate_basal_ganglia
Brain_Cerebellar_Hemisphere
Brain_Cerebellum
Brain_Cortex
Brain_Frontal_Cortex_BA9
Brain_Hippocampus
Brain_Hypothalamus
Brain_Nucleus_accumbens_basal_ganglia
Brain_Putamen_basal_ganglia
Breast_Mammary_Tissue
Cells_EBV−transformed_lymphocytes
Cells_Transformed_fibroblasts
Colon_Sigmoid
Colon_Transverse
Esophagus_Gastroesophageal_Junction
Esophagus_Mucosa
Esophagus_Muscularis
Heart_Atrial_Appendage
Heart_Left_Ventricle
Liver
Lung
Muscle_Skeletal
Nerve_Tibial
Ovary
Pancreas
Pituitary
Prostate
Skin_Not_Sun_Exposed_Suprapubic
Skin_Sun_Exposed_Lower_leg
Small_Intestine_Terminal_Ileum
Spleen
Stomach
Testis
Thyroid
Uterus
Vagina
Whole_Blood
Figure 2.3: Factor z1 and z3 captures different types of tissues (tissues without brain vs. brain
tissues.
The posterior weights refer to the inferred W matrix from the SuSiE PCA. The clustering pattern in different factors
is found as there are only a limited number of tissues with non-zero weights in each factor since we set L=18 while
the feature dimension is 44.
34
2.3.4 Identifying regulatory modules from Perturb-seq Data
To identify gene regulatory modules from genome-wide perturbation data, we ran SuSiE PCA on
perturb-seq in cell lines [21] (see Methods) with K = 10 and L = 300. Briefly, we inputted the
normalized expression data (2057 × 8563) to SuSiE-PCA to identify gene regulatory modules (i.e.
Z) and downstream-regulated networks (i.e. W). To ensure our results were robust to K and L,
we explored a grid of possible combinations and found that K = 10 and L = 300 retain the most
important information while keeping the relevant gene set much smaller (see Appendix A.2 Figure
5.12 for a detailed explanation).
Overall, we found the total PVE was 10.71% across all components (Appendix A.2 Figure
5.13), with each component exhibiting 299 downstream genes with PIP > 0.9 on average. Focusing
on the leading component, we found that perturbations with the top 10 largest absolute factor scores
are primarily related to Ribosomal Protein Small (RPS) subunit genes and Ribosomal Protein Large
subunit (RPL) family (Figure 2.4 A). To provide a broader characterization of the module function,
we extracted downstream genes with PIP greater than 0.9 (298 genes) as input into ShinyGO [59]
to perform a gene set enrichment analysis (Figure 2.4B). We observed the most enriched pathway
was related to ribosome function (FDR=9.2 × 10−82,63 genes involved), followed by Coronavirus
disease (FDR=2.5×10−62,62 genes involved). Inspecting the loadings at these downstream genes,
we found nearly all weights were positive, suggesting that the knockout of RPS and RPL genes
down-regulate the expression level of those downstream genes. We found multiple elongation
factor genes (EEF1G, EEF1A1, EEF1B2, EIF4B, EIF3L) among the leading downstream genes,
which are known to be involved in ribosome function. Additionally, recent studies have suggested
that the decreased expression of elongation factor genes is associated with less severe conditions
among COVID-19 patients [60, 61]. We repeated pathway analysis for each latent factor using
corresponding loadings at genes with PIP greater than 0.9 (see Appendix A.2 Figure 5.14 - Figure
5.22).
35
DDX47 RPL7A
RPS4X RPS8 RPS6
RPS18 RPS13
RPS24
RPS26
RPS27A
non−targeting
−4
−2
0
2
0 500 1000 1500 2000
Rank
Factor scores
A
Endocytosis
Salmonella infection
Vasopressin−regulated water reabsorption
Coronavirus disease
Ribosome
0 10 20 30
Fold Enrichment
Pathways
N. Genes 10 20 70
−log10(FDR) 2 62 82
B
Figure 2.4: The perturbations with top factor scores in the first component mostly belong
to RPL and RPS family(A), and the enrichment analysis results of downstream genes in the
same component are enriched for ribosome and coronavirus disease(B).
Each point in (A) represents the latent factor value of each perturbation. The top 9 points as well as the control group
are labeled in the plot and colored red and blue, respectively. In gene set enrichment analysis, we input the downstream
genes with PIP > 0.9 and show the top enriched pathways with log(FDR) and the number of genes included in the
corresponding pathways.
To compare with Sparse PCA, we performed the same pathway analysis on factor loadings
and assessed enrichments. From the sparse PCA with the largest PVE (alpha = 1), we observed
components identified by Sparse PCA to be less enriched with biological pathways when compared
to SuSiE PCA (80 unique enriched pathways in sparse PCA versus 88 pathways in SuSiE PCA),
and the top enriched pathways such as ribosome and coronavirus disease are less significant and
contain less number of selected genes (FDR = 1.4 × 10−33,35 genes; FDR = 2.9 × 10−18,29
genes). We noticed that when alpha equals 17, the sparse PCA achieves an approximate similar
total PVE (10.91%) with our model (10.71%) but with a higher sparsity level (Appendix A.2 Figure
5.23). We then extracted the top 300 genes with non-zero weights in sparse PCA with alpha = 17
36
and performed the gene set enrichment analysis, and found that the significance level is almost
similar to that in SuSiE PCA (Appendix A.2 Figure 5.24). However, this is a post hoc analysis
that suggests SuSiE PCA is more suitable for sparse data analysis while maintaining the power to
perform the feature selection in a more statistical and reasonable manner.
Overall, we find distinct biological functions identified by each component, with groupings
consistent with those reported in previous works [62–64].
37
2.4 Discussion
In this chapter, we propose SuSiE PCA, an efficient Bayesian variable selection approach to principal components analysis for structured biological data. The sparsity of the loading matrix is
achieved by restricting the number of features associated with each factor to be at most L. Through
simulations and real-data application, we find that SuSiE PCA outperforms existing approaches to
sparse latent structure learning in identifying contributing features, while maintaining a more efficient run time.
There are several advantages of SuSiE PCA as compared to other sparse factor models. First,
SuSiE PCA generates the posterior inclusion probabilities (PIPs) for each feature that quantifies
the uncertainty of the selected feature, which can not be provided by other sparse models, such as
sparse PCA with regularization[38] or the Bayesian treatment of PPCA. And assessing the selected
variables based on the probability is more reasonable and convenient than using weights. Second,
PIPs are capable of selecting more signals with high confidence. In simulations, we demonstrated
that using weights for variable selection from SuSiE PCA, sparse PCA, and EBMF can deliver
a high specificity (low false discovery rate) but with low sensitivity as the cutoff value increase,
while using PIPs as selection tools can maintain a high sensitivity for any positive cutoff value
between 0 and 1. Third, SuSiE PCA provides a more precise estimate of the loadings and higher
prediction accuracy, even in the misspecified case, as we impose a probabilistic distribution over
the loadings that enables a much more accurate inference on the posterior distribution. Finally,
the inference procedure of SuSiE PCA works on the dimension of K and L, which is typically
set to be much smaller than feature dimension P; therefore, it is scalable to high-dimensional data
and requires less computational demands. We implement the SuSiE PCA with the JAX library
developed by Google [57] to enable fast convergence on CPU, GPU, or TPU.
In the SuSiE PCA, two parameters, the number of components K, and the number of single
effects L need to be prespecified by the user before fitting the model. The selection of K follows
a similar strategy as conventional principal component analysis, often informed by researchers’
38
domain expertise. The merit of SuSiE PCA is that when there are excessive latent components
being specified, the variance explained for those components would be extremely minimal with a
near-zero count of single effects exhibiting PIP > 0.9. This effectively allows for an initial choice
of a relatively large K and subsequently inspecting the PVE and PIPs in each component to decide
the most suitable K.
The choice of L determines the sparsity in the SuSiE PCA. Although SuSiE PCA only allows
one common L specified across all factors, the number of non-zero effects captured across factors
can be varied and learned from the data. This is because we treat the inverse of variance τ0kl
of the lth single effect in factor zk as a random variable. As the Algorithm 1 (see Appendix A
SuSiE PCA Algorithm) demonstrates, the MLE of τ0kl at the step 3 is derived before inference
of other parameters. When the L specified in the model, for a certain factor k, is greater than the
true number of signals associated with that factor, the MLE of the τ0kl will be extremely large
for those excessive single effects, which then shrinks the wkl and PIP to be 0 or close to 0, and
therefore removes the redundant single effects from the model. For example, in the simulation
and GTEx z-score data analysis, we have shown that when the user-specified L is larger than the
data-generating L, the ARD-like prior over loadings will shrink effects towards 0, thus adding little
additional predictive power and overall MSE from the true loadings matrix. Although it seems like
the L parameter may be automatically set to the total number of variables (and thus “shut off” if
necessary), we emphasize that this still comes with an added computational cost, albeit a low one
due to the scalability of our approach. Therefore, we allow users to specify their own choice of L.
From this point of view, without prior knowledge of the data, one can specify a relatively larger L
during the initial model fitting, and then examine the estimates of τ0kl to explore how many single
effects are reasonable for the dataset.
Overall, SuSiE PCA provides a flexible approach to high-dimensional biological data with a
low-rank structure, and allows for feature selection in sparse principal component analysis.
39
2.5 Limitations of the Study
One limitation of SuSiE PCA is that under the mean-field approximation, all the posteriors, i.e.
Q(W), Q(Z), are factorized to facilitate inference. Under this factorization estimation for mean
terms (i.e. E[W] and E[Z]), are approximately unbiased [65]. However, it produces overconfident covariance structures within variables (W, Z, etc) due to the assumed independence across Q
functions.
40
Chapter 3
Perturb Variational Inference (PerturbVI)
3.1 Preface
In the last chapter, we introduced the SuSiE PCA, which provides a scalable and robust latent
factor model for analysis of high-dimensional data: the estimated loading matrix is more sparse
and interpretable than existing approaches, making it suitable for analyzing large-scale Perturb-seq
data. However, when we applied SuSiE PCA to genome-wide Perturb-seq data [21], we collapsed
the expression data by perturbation across different cells, therefore losing the cell-level information. The reason is not about the scale of the data. In fact, we applied SuSiE PCA directly on the
original scale of expression data (310,385 cells by 8,563 genes), and it converged in 1 hour and
40 minutes with 100 GB of memory. However, the SuSiE PCA does not account for the perturbation information, which is known a-priori. In other words, SuSiE PCA fails to leverage prior
knowledge of which gene is perturbed in which individual cell. In this chapter, we developed a
perturbation-informed latent factor model based on the SuSiE PCA framework, named Perturb
variational inference (PerturbVI), for a scalable and accurate estimation of the perturbation effect
through Perturb-seq data.
41
3.2 Introduction
Perturb-seq, a technique combining CRISPR-based gene perturbation with single-cell RNA sequencing [24], is a powerful tool that can generate single-cell level perturbation data for dissecting
GRNs [25]. Perturb-seq allows for testing thousands of regulatory genes simultaneously, offering a
high-resolution view of gene regulatory landscapes, revealing how genetic alterations can influence
entire networks [26]. Recently, Reploge et al. generated genome-wide Perturb-seq data that assay
thousands of perturbations with single-cell RNA sequencing in chronic myeloid leukemia (CML)
(K562) cell lines. The study demonstrated that the Perturb-seq data holds unparalleled potential
for functional discoveries on numerous poorly characterized genes.
Despite the promise, the substantial volume and complexity of Perturb-seq data pose significant
computational challenges. Traditional methods for processing and interpreting Perturb-seq data are
often inadequate. For example, differential expression gene analysis (DEG) can be applied to test
the distribution of each gene between the target group and the control group [66, 67]. However,
it tends to be underpowered and suffers from the burden of tests. Moreover, matrix decomposition is a promising approach to extracting the gene modules from the large-scale expression data
[68]. Still, it does not explicitly incorporate the perturbations into the model. Recently, Zhou et
al. proposed a new model, guided sparse factor analysis (GSFA), to bridge this gap by associating
the perturbation as a function of factor scores, providing a novel approach to infer the perturbation
effects within the factor analysis framework [69]. However, the GSFA utilized the Monte-Carlo
method in the model inference and, therefore, cannot be applied to large-scale Perturb-seq data
due to computational inefficiency. Therefore, there is a critical need to develop a scalable computational method that can effectively extract the regulatory modules from the large-scale Perturb-seq
data.
In this study, we present a novel scalable latent factor model, Perturb Variational Inference
(PerturbVI), a perturbation-informed ultra sparse PCA that leverages the variational inference to
estimate the perturbation effects through large-scale CRIPSR perturbation datasets. Specifically,
42
we incorporate the perturbation design matrix into the latent component estimated from SuSiE
PCA [70]. This approach offers a scalable and robust solution for identifying gene regulatory
modules within Perturb-seq datasets. Through extensive simulation, we found that PerturbVI has
a more precise estimate of the perturbation effect matrix and retains more power to detect true
effects than GSFA. In the real data application, PerturbVI identified more enriched pathways and a
new target gene POGZ from the CROP-seq data by Lalli et al. compared to GSFA [71]; moreover,
PerturbVI effectively identifies multiple crucial cellular process-related regulatory modules in cell
cycle, ribosome biogenesis, etc. through the essential-wide Perturb-seq data by Reploge et al. [21].
PerturbVI is a highly efficient tool for inferring perturbation effects on downstream genes through
CRISPR screening data.
43
3.3 Methods
3.3.1 PerturbVI model
In this section, we provide a detailed description of the PerturbVI model. Suppose XN×P is the
observed expression count matrix with N cells and P downstream genes, MN×G is the binary
perturbation design matrix with G perturbations. Noticed that PerturbVI assumes that only one
perturbation gene is targeted on each cell, therefore each row in the design matrix M contains a
single value of 1, indicating the corresponding perturbation targeted on the cell. Then PerturbVI is
presented as:
X | Z,W, σ2 ∼ MN N,P (ZWSuSiE, IN , σ2
IP ) (3.1)
Z | B ∼ MN N,K(MB, IN , IK) (3.2)
B =
X
K
k=1
βke
⊺
k
(3.3)
βjk ∼ N (0, σ2
β
)
ηjk · δ
1−ηjk
0
(3.4)
ηjk ∼ Bernoulli(p) (3.5)
where the WSuSiE is the loading matrix from SuSiE PCA, ZN×K is the latent component that
captures the gene modules containing co-expressed genes, and BG×K is the upstream gene effect
matrix that estimates the perturbation effects to each latent component. We imposed a spike-andslab prior on each column vector in B, suggesting only a small portion of perturbations, ηjk, affects
the gene modules within a certain latent component. PerturbVI assumes that the target gene affects
the downstream network through the gene modules Z: the upstream gene effects are captured
44
in B, and the downstream gene effects are estimated in W. To infer the perturbation effect on
downstream genes, we noticed that:
E[X] = ZW (3.6)
E[Z] = MB (3.7)
Since M is a binary design matrix, by approximating Z with E[Z] we have:
E[X] ≈ MBW (3.8)
Therefore, BW is the estimated overall effect matrix (G by P) that represents the perturbation’s
effect on the downstream gene’s expression level.
The Perturb-seq data typically comes with a set of non-targeting cells (with negative control
RNA), representing the gene activities without any perturbation. Therefore, it is crucial to account
for the presence of non-targeting cells to correctly estimate the perturbation effects.To address that,
PerturbVI leverages the one-hot encoding approach in the design matrix M. Specifically, the nontargeting cells in the design matrix are encoded as zero across all target gene columns. As a result,
it naturally becomes the control group within the nested regression model. This is analogous to the
group coding approach in Analysis of Variance (ANOVA) [72], where the control group is treated
as the reference group, serving as a baseline for comparison with the treatment group.
3.3.2 Variational inference in PerturbVI
Similar to SuSiE PCA inference, we performed the variational inference (VI) to infer the posterior distribution of model parameters Z,W, γ, β, η. VI is an efficient approach to approximate
the complex and intractable probability distributions that are difficult to compute directly [35].
It operates by optimizing a family of densities to approximate the target posterior density, called
45
variational distribution, leveraging the Kullback–Leibler (KL) divergence for measuring the similarity between true posterior and variational distribution [36]. Notably, VI distinguishes itself
by its efficiency, often outperforming other computational methods such as Markov chain Monte
Carlo (MCMC) sampling [34], particularly in applications requiring heavy computation such as
large-scale Perturb-seq data. The objective function to optimize the variational distribution is the
evidence lower bound (ELBO). In our model, it can be written as follows:
ELBO(W,Z) = EQ [log Pr(X,Z,W) − log Q(Z,W)]
= EQ[log Pr(X|Z,W)] + EQ[log Pr(Z,W) − log Q(Z,W)]
= EQ[log Pr(X|Z,W)] + EQ(Z | B)
[log Pr(Z | B]) − log Q(Z | B)]+
X
k
X
j
pˆjkEQ(βjk | ηjk))[log Pr(βjk | ηjk) − log Q(βjk | ηjk)]+
X
k
X
j
EQ[ηjk log p − ηjk log ˆpjk]+
EQ(W,Γ)
[log Pr(W,Γ) − log Q(W,Γ)]
Since ELBO is tractable to compute, we then derive the analytical form of the variational distribution of model parameters:
Q(Z) = MN n,k(Z | µZ, In, ΣZ) (3.9)
Q(wkl | γkl) = log N (µwkl , σ2
wkl
) (3.10)
Q(γkl) = Multi(1, αkl = softmax(log α˜ kl)) (3.11)
Q(βgk | ηgk = 1) = N (βgk | µgk, s2
gk) (3.12)
Q(ηgk = 1) = Bernoulli(pgk = softmax(log ˆpgk)) (3.13)
The detailed derivation and the exact form of each parameter in the equation (3.9 - 3.13) are
outlined in Appendix B.2.
46
3.3.3 PIP and LFSR
In the PerturbVI, we can quantify the uncertainty that the downstream gene is associated with
the latent component through posterior inclusion probability (PIP). Benefiting from the SuSiE
PCA framework, one can compute PIP based on the equation 2.8. Genes with a higher PIP (e.g.
PIP > 0.9) represent a strong association with the latent component. Analyzing those associated
gene sets, or coexpressed genes through gene set enrichment analysis (GSEA) allows us to interpret
the biological pathways of each latent component.
Although the exact distribution of the PerturbVI effect is difficult to compute, we leveraged the
Local False Sign Rate (LFSR) [73], suggested by Zhou et al. in GSFA to assess the significance
of perturbation-gene effects [69]: we sampled the B and W based on their posterior distribution
n times, and then LFSR of the effect of upstream gene i to downstream gene j, defined as Oij is
given by:
LFSR(Oij ) = min{Pr(O
(t)
ij ≥ 0 | Q(B,W)), Pr(O
(t)
ij ≤ 0 | Q(B,W))} (3.14)
3.3.4 Simulations
To evaluate model performance and robustness, we applied the PerturbVI and GSFA under the
same simulation setting, with model parameters K and L correctly specified and misspecified,
respectively. Specifically, we first generate a binary design matrix M containing 4000 cells with
100 perturbations and the sparse perturbation effect matrix B with 4 components. Subsequently,
we use the regression model to compute the latent components. We then simulated a sparse loading
matrix W (4 × 5000) to link the latent components with 5000 downstream genes, resulting in an
expression count matrix X (4000 × 5000). In simulation, we primarily examine the metrics as the
function of perturbations dimension (columns in B) and perturbation density/sparsity (sparsity in
B). The evaluation metrics included the Frobenius error of the estimated beta matrix and loading
matrix, alongside overall sensitivity and specificity. Given the rotation property inherent in latent
47
factor models, we utilized the Procrustes error to evaluate the estimation accuracy of matrices B
and W (Equation 2.24). Furthermore, to assess the model’s efficacy in identifying perturbationgene pairs, we calculated the PerturbVI effect matrix and the corresponding LFSR, selecting those
significant pairs with LFSR ¡ 0.05 to compute sensitivity and specificity.
3.3.5 LUHMES CROP-seq Data
To validate the model efficiency on real data, we applied PeturbVI to the CROP-seq data sequenced
from Lund Human Mesencephalic (LUHMES) neuronal cells undergoing neuronal differentiation
by Laili et al. [71]. The author used the dCas9-KRAB to knock down 14 Autism spectrum disorder
(ASD)-related genes and then performed single-cell RNA sequencing to capture the downstream
transcriptional activities. The original study aimed to provide a comprehensive overview of how
these target genes influence neuronal differentiation processes, offering insights into the molecular
dynamics of neuron development and ASD-related gene functions. Although Zhou et al. conducted an in-depth analysis of this dataset through GSFA, we reanalyzed the dataset with the same
preprocessing steps [69] to benchmark PerturbVI’s performance.
We first examined the B matrix to understand the effects of perturbations on the latent components. We analyzed the downstream genes with PIP > 0.9 from W to interpret each component
through gene set enrichment analysis (GSEA). Next, we computed the PerturbVI effect matrix
to identify differentially expressed and coregulated gene sets by each perturbation. Finally, we
specifically examined how those ASD target genes regulate the downstream genes involved in the
neuron differentiation pathway.
3.3.6 Essential-wide Perturb-seq Data
We next applied the PerturbVI on the essential-wide Perturb-seq data [21] involving Perturb-seq
experiments on CML K562 cell lines targeting 2057 essential genes based on genes expressed in
K562 cells, transcription factor, and Cancer Dependency essential genes. The dataset comprised
48
310,385 cells with 8563 downstream genes per cell, with a mean coverage of 183 cells per perturbation for the day 8 genome-wide experiment.
After running the model, we performed the GSEA on downstream genes with PIP > 0.9 and
top 100 perturbations with respect to each component to elucidate gene modules or biological
pathways captured. Given the large sample size, most of the perturbation effects estimated from
the model, although they have a small effect size, can still be significant (LFSR ¡ 0.05). Instead,
we focused on investigating and interpreting several highly significant enriched pathways. To do
that, we first identified components that are highly enriched with respect to a specific pathway and
then selected the top 200 perturbations from those components and input for GSEA and kept those
involved in the interested pathway. Similarly, the downstream genes involved in the interested
pathway are also filtered for further investigation. Finally, we subseted the PerturbVI effect matrix
with potential targets and downstream genes and further filtered those with LFSR > 0.05.
49
3.4 Results
3.4.1 PerturbVI is more accurate and robust in estimating perturbation
effect
In the simulation, we first evaluated the Procrustes error and overall sensitivity (Methods) across
two distinct simulation scenarios: the number of perturbations (Figure 3.1 A-C) and perturbation
density (Appendix Figure 5.25). Overall, we found that the loading matrix estimations from PerturbVI are consistently more accurate than GSFA across all scenarios, benefiting from the ultra-sparse
architecture in the model setting. Moreover, we observed that PerturbVI outperforms GSFA in the
estimation accuracy of perturbation effect B, especially with more perturbations introduced and
more perturbation information (i.e., higher perturbation density) provided. Further, regarding the
overall performance in selecting the perturbation-gene pair, PerturbVI achieves around 5% higher
sensitivity, indicating a higher power to detect the true perturbation effects. Both models achieve
almost perfect specificity (> 0.999), indicating that they are less likely to produce potential false
discoveries.
To examine the model robustness, we run the model with the number of single effects L and
the number of components K were mis-specified and when true target genes are missing (Figure
3.1 D-F, Appendix Figure 5.26). Overall, we found that PerturbVI maintains superior estimation
accuracy and sensitivity compared to GSFA, particularly when K and L are over-specified (i.e.,
K > 4 or L > 150). However, we also noted that the under-specified component (K = 3) reduced
model efficacy in capturing relevant information in both PerturbVI and GSFA. Moreover, an underspecification of L in PerturbVI (L < 150) resulted in increased zeros in the estimated loading
matrix, consequently decreasing the power to detect significant perturbation-gene pairs. When the
true targets are not perturbed through the experiment, PerturbVI is more robust in detecting the
true effects than GSFA.
50
Figure 3.1: PerturbVI generates the smallest Procrustes error in B and W matrix than GSFA
(A-C) and is more robust to mis-specified models (D-F).
For each scenario in (A-C), we varied the number of perturbations while fixing all other parameters, then input the
true parameters (N, P, K, L) into models and computed the Procrustes error. For (D-E), we generated data based on
the base simulation setting (i.e., L = 150, K = 4) but assumed the input L in PerturbVI and input K in both models
are incorrectly specified, then we computed the sensitivity to select the true effect based on LFSR < 0.05. Reference
lines refer to the sensitivity from the correctly specified models. In (F), we assumed only a proportion of true target
genes are perturbed, meaning random sets of columns are dropped in design matrix M when input into both models.
51
In the simulation, we compared the beta estimates from the full model with the control group
specified in the design matrix and the reduced model with the control group dropped as the reference group. We noticed that :
βreduce = βfull − µzfull (3.15)
It implies that by setting the control group as a reference group, the effects of control on outcome
Z are absorbed into the latent component scores. When we centered the X, the posterior mean
for Z is almost zero, and the equation above still holds, suggesting that the βreduced provided an
accurate estimate for the perturbation effect when the control group existed.
3.4.2 CROP-Seq application
We analyzed CROP-seq data targeting 14 ASD-relevant genes in 8708 human LUHMES cells
with expression level measurements across 6000 downstream genes (Methods). We leveraged the
expression count matrix (8708 cells with 6000 downstream genes) and perturbation design matrix
(8708 cells with 13 targets) preprocessed by Zhou et al. [69] and applied PerturbVI with 12 latent
components and 400 single effects. PerturbVI converged in 166 seconds on CPU, using only 2
GB memory. It marked a 58-fold decrease in computing time and over 90% reduction in memory
usage compared to GSFA.
PerturbVI identifies new target gene POGZ
Overall, we discovered 8 of the 14 targets (ADNP, ARID1B, ASH1L, CHD2, DYRK1A, POGZ,
PTEN, SETD5) have at least one significant association with downstream genes (LFSR < 0.05) by
examining the overall effect matrix (Figure 3.2 D). Despite PerturbVI identifying fewer associated
genes for most targets compared to GSFA, it revealed more significant enriched pathways (FDR <
0.05) with smaller gene sets from enrichment analysis (Figure 3.2 D) and most of them are neuronrelated pathways (Appendix Figure 5.29). Notably, previous studies overlooked target gene POGZ
52
due to its smaller sample size (236 out of 8,308), yet we found it significantly associated with
63 downstream genes, 45 of which are functionally relevant to neurons, such as NEUROD1 and
HSP90B1, crucial for neuron differentiation and development [74].
0
100
200
300
400
500
Number of genes PIP > 0.9
in each component
Method GSFA PerturbVI
A
generation of neurons
neuron differentiation
neuron dev.
neuron proj. dev.
central nervous system
dev.
axon dev.
cell morphogenesis
involved in neuron
differentiation
0 1 2 3
Pathways
Factor 3
generation of neurons
neuron differentiation
neuron proj. dev.
axon dev.
neg. reg. of neuron proj.
dev.
neg. reg. of nervous
system dev.
reg. of axonogenesis
neuron proj. guidance
axon guidance
0.0 2.5 5.0 7.5
Factor 7
neuron proj. dev.
neuron proj.
morphogenesis
neuron differentiation
reg. of neuron death
reg. of nervous system
dev.
forebrain dev.
neuron death
reg. of neuron
differentiation
0 2 4 6 8
Factor 11
Fold Enrichment
N. Genes 6 21 28
−log10(FDR) 2 4 6 8 10
B
Factor
Perturbations
PTEN
SETD5
DYRK1A
ASH1L
POGZ
ARID1B
ADNP
CHD2
Factor1
Factor2
Factor3
Factor4
Factor5
Factor6
Factor7
Factor8
Factor9
Factor10
Factor11
Factor12
Perturbation−Factor
effects
−0.4
−0.2
0
0.2
0.4
C
288
789
330 344
129
333
412
765
79
19
63
0
421
905
259
470
0
250
500
750
ADNP
ARID1B
ASH1L
CHD2
DYRK1A
POGZ
PTEN
SETD5
Perturbation
Number of DEGs
DEGs by perturbation
63
128
83
128
145
68
147
278
3 11
70
0
100102
137
95
0
100
200
ADNP
ARID1B
ASH1L
CHD2
DYRK1A
POGZ
PTEN
SETD5
Perturbation
Number of enriched pathways
Enriched pathways by perturbation
Method GSFA PerturbVI
D
Figure 3.2: PerturbVI results of perturbation effect from LUHMES CROP-seq data
Results are from CROP-seq data in the LUHMES neuron cell line. (A) The number of significant downstream genes
(defined as PIP > 0.9) within each latent component across PerturbVI and GSFA. (B) Enrichment analysis of significant downstream genes. We pull out the components that are mostly enriched for neuron function. (C) The heatmap of
upstream effect matrix B from PerturbVI identifies 8 relevant target genes. (D) The number of differentially expressed
genes (DEGs) by each perturbation. We computed the LFSR of the overall effect matrix and then counted the number
of DEGs (LFSR < 0.05) within each perturbation.
53
PerturbVI produced a condensed set of downstream genes related to neuron functions
To characterize the biological representation of each latent component, we select downstream
genes with PIP greater than 0.9 from the loading matrix for GSEA. We found that 3 out of the
12 components (components 3, 7, and 11) are strongly enriched for neuron-related pathways, including neuronal development, neuron differentiation, axon development, etc. (Figure 3.2 B).
Compared to significant downstream genes from GSFA loadings, PerturbVI presented a more condensed set of genes from each component (Figure 3.2 A), indicating a sparser estimation of the
downstream loading matrix. Although PerturbVI identifies fewer total enriched pathways (201 vs.
279), the neuron-related pathways are mostly consistent (27 vs. 30) but with higher fold enrichment (Appendix Figure 5.28), suggesting that PerturbVI retains functional relevant gene sets to
neurons.
We next examined the upstream perturbation effect matrix B and found that 8 out of 14 perturbations were associated with multiple latent components (Figure 3.2 C, Appendix Figure 5.27).
More specifically, PTEN and CHD2 are associated with all neuron-related components; PTEN,
ASH1L, ADNP and CHD2 are strongly associated with component 3 which represents neuron development and differentiation; SETD5 an POGZ are strongly associated with component 11, which
is related to neuron differentiation and neuron death.
PerturbVI provides insights into regulatory networks in neuron differentiation process
To address the original study’s research question, we further investigated how the upstream ASDrelated genes influence the neuron differentiation network by highlighting downstream genes presented in neuron differentiation-related pathways (Figure 3.3). First, We found that knockdown of
ADNP, ASH1L, and CHD2 delayed the neuron differentiation process by downregulating mature
neuron marker genes (NEFL, CRABP2, PCP4, MAP2, etc.) and upregulating genes that negatively
influence neuron differentiation and development. Conversely, the repressing of PTEN accelerated
the differentiation process by upregulating the mature neuron marker genes and downregulating
those associated with negative regulation. These findings are consistent with Lalli’s studies and
54
GSFA results [69, 71], indicating that PerturbVI captures those consistent results with greatly improved efficiency.
Moreover, the target gene POGZ, previously excluded from DEG analysis in Lalli’s study due
to limited sample sizes and also missed by GSFA, was identified by PerturbVI as another crucial
gene in regulating neuron differentiation. Specifically, repression of POGZ, SETD5, and ARID1B
accelerated the neuron differentiation not by directly upregulating the mature neuron markers but
by suppressing the expression level of genes that negatively regulate neuron projection and differentiation. Finally, PerturbVI also provides new insights into the downstream genes involved in
neuron differentiation. Specifically, we found that AGT, BSG, and APLP2 show similar gene activities with gene sets from negative regulation in neuron development, suggesting their potential role
involved in this regulation process. Conversely, genes ALCAM, GPM6A, and STMN1 display patterns similar to mature neuron markers, indicating they might be involved in the positive regulation
of neuron differentiation and development.
55
Figure 3.3: PerturbVI identifies the regulatory networks within the neuron differentiation
process (partial downstream genes).
From the overall effect matrix, we extract the downstream genes from the neuron differentiation and development
pathways and further subset the genes for interpretation. See heatmap of full list downstream genes in Appendix B
Figure 5.30
3.4.3 Perturb-Seq application
We next applied PerturbVI to the essential-wide Perturb-seq data generated from 310,385 CML
K562 cells targetting 2057 common essential genes and measured 8563 downstream gene expression (Replogle et al., 2022). Considering the interpretability in such a large-scale dataset, we
restricted the maximum number of genes associated with each component to 500 (L = 500), and
ran the model with a range of K from 8 to 15. We found that PVE keeps increasing from K = 8
56
to K = 13, but when K > 10, the correlation among components increases, indicating redundant
information is captured with over-specified K.
PerturbVI identifies numerous ribosome-related pathways in cellular processes
Overall, we found that 1742 out of 2057 perturbations are significantly associated with at least
one downstream gene (LFSR < 0.05). By examining the sum of the square of the overall effect
of each perturbation across all 8563 downstream genes, we found that the top perturbations are
most enriched for pathways involved in the regulation of transcription and translation processes,
such as regulation of transcription from RNA polymerase II promoter, cytoplasmic translation, and
ribosome biogenesis, indicating the top perturbations mostly impact on gene expression regulation
and protein synthesis in CML K562 cells (Figure 3.4 A). Among them, we identified GATA1 as
the most significant perturbation.
We then examined the perturbation effect matrix B and found that on average 65.8% (std =
0.10) of perturbations in each component are drawn from “slab” distribution (η > 0.05) and represent a non-zero effect. In the downstream loading matrix W, we found each component contains
exactly 500 downstream genes with PIP ¿ 0.9, implying that L is potentially under-specified in the
context of the large sample size. We next performed gene set enrichment analysis on the significant
downstream genes with PIP > 0.9 from each component and found that each component encompasses numerous cellular behavior pathways with FDR < 0.05, and the most significant pathways
are related to transcription process such as cytoplasmic translation(FDR = 1.81e − 89), cell cycle
(FDR = 7.41e − 86), ribosome biogenesis (FDR = 7.11e − 44), etc. (Figure 3.4 A). Although the
significant pathways can be shared by multiple components, we observed that factor 4 is highly
enriched for cell cycle-related pathways, factors 1 and 5 are mainly enriched for the ribosome
biogenesis pathways, and factor 6 is enriched for mitochondrial function-related pathways.
57
Figure 3.4: PerturbVI results of perturbation effect from essential-wide Perturb-seq data
(A) From the PerturbVI overall effect matrix, we computed the sum of the square of effects in each perturbation, then
extracted the top 30 perturbations and annotated them through GSEA. (B) The downstream genes with PIP > 0.9 are
input for GSEA. We presented three latent components that capture different cellular process including ribosome, cell
cycle and mitochondrial functions.
58
PerturbVI reveals regulatory networks in cell cycle and ribosome biogenesis
To further summarize the essential cellular process-related regulatory networks, we demonstrate
the findings from the cell cycle and ribosome biogenesis. First, we extract the top 100 perturbations from factor 4 and filter those closely related to the cell cycle through enrichment analysis.
Then, we extract the downstream genes involved in the cell cycle pathway and compose a subset
of the overall effect (370 by 40). For illustration purposes, we select the perturbation-gene pairs
with large effect sizes (Figure 3.5 A). We found that inhibition of PHB2, GSPT1 and CHMP3
results in a decrease in expressions of downstream genes involved in cell cycle, cell division, and
chromosome organization and an upregulated gene involved in negative regulation of cell cycle
transition, implying a delay of cell cycle process. This is consistent with previous studies; for example, the overexpressed PHB2 usually enhances tumor cell proliferation in many types of cancer
[75], while in our study, we elucidate this process - by knocking out the PHB1, the downstream
genes RANBP1, YBX1, and NPM1 etc. that foster the cell cycle process are largely downregulated,
while genes such as ATF5 and RPL26 that inhibit the cell cycle are highly upregulated.
Next, we demonstrate the ribosome biogenesis network in the same manner (Figure 3.5 B).
First, we noticed that target genes SNRPF, SNRPD1, SNRPD2, SNRPD3, and SNRPE act exactly
opposite compared to RPS family genes. They are part of the small nuclear ribonucleoprotein
(snRNP) complex, which plays a crucial role in pre-mRNA splicing. This process is essential for
removing introns from pre-mRNA, thereby facilitating the production of mature messenger RNA
that can be translated into proteins [76]. PerturbVI reveals that repressing the snRNP complex disrupts the rRNA processing but upregulates the eukaryotic initiation factors (EIFs) and RPL genes
involved in the ribosome assembly. The explanation is that impaired splicing due to transcriptional repression can result in altered mRNA splicing patterns, which may facilitate the compensatory mechanisms involving eIFs and RPL genes [77]. In contrast, the knockdown of UTP, IMP4,
DDX47 and genes from the RPS family demonstrate the opposite effect: they upregulate rRNA
processing but downregulate those EIF and RPL genes from ribosome assembly.
59
Figure 3.5: PerturbVI reveals regulatory networks in cell cycle and ribosome biogenesis
(A). PerturbVI effects in cell cycle regulatory modules. Target genes (columns) are selected based on the effect size of
perturbation in factor 4 from the upstream effect matrix B; Downstream genes are selected from the gene sets from the
cell cycle pathways. (B). PerturbVI effects in ribosome biogenesis pathway. Similar processing step to select upstream
and downstream genes based on latent factor 1.
60
3.5 Discussion
In this study, we introduced PerturbVI, a Bayesian latent factor model informed by experimental
design, aimed to infer the perturbation networks through the large-scale CRISPR screening data.
By incorporating the guide matrix into ultra-sparse PCA (SuSiE PCA) and leveraging variational
inference for scalability, PerturbVI represents an efficient computational tool in analyzing Perturbseq data. Our findings demonstrate that PerturbVI outperforms existing methods in terms of computational efficiency but also provides more interpretable results of gene regulation in the context
of CRISPR perturbations. The application of PerturbVI to real-world datasets demonstrated its
ability to identify and interpret regulatory modules. Specifically, the analysis of CROP-seq data
targeting ASD-relevant genes in human LUHMES cells revealed crucial insights into neuron differentiation pathways. Furthermore, by applying essential-wide Perturb-seq data, we identified
key regulatory networks involved in essential cellular processes, such as cell cycle regulation and
ribosome biogenesis.
While PerturbVI significantly improves the efficiency and accuracy in the analysis of largescale CRISPR perturbation data, there are several limitations to consider. First, the PerturbVI
leverages the variational inference for approximating posterior distribution, although computationally efficient, cannot provide a good estimation of the overall perturbation effect (i.e., BW). We
currently approximate the posterior distribution of the overall effect matrix by sampling over the
posterior of B and W, which can be computationally inefficient when the feature dimension P is
high. Moreover, in large-scale data where the sample size, i.e., the number of cells, is sufficiently
large, the LFSR computed from the sampling posterior cannot provide interpretable results due to
potential inflation. Second, the PerturbVI assumes that the perturbation affects downstream genes
through the regulatory module via the latent component. However, some perturbation genes have
hierarchical regulation architecture as discussed in CROP-seq data by Laili et al. [71], which can
hardly be captured by PerturbVI due to the model setting. Third, in PerturbVI, we did not consider the cell type information. Many studies have shown that perturbation can act differentially
61
across cell types. Finally, in this work, we show that PerturbVI works well when only one target
is knocked out in a single cell, which might not always be the case in the experiment due to the
specific goal or actual expenses. However, even under multi-perturbation experiment conditions,
PerturbVI can accommodate multiple indexes in the design matrix. However, with more density
coming into the design matrix, we will also expect an increase in computational efforts.
There are several future works to be done to extend the PerturbVI. First, we can further incorporate cell type information into prior distributions in the model or include interaction terms in the
regression model, allowing perturbations have different effects on different type of cells. Second,
further improving the algorithm to accommodate multi-perturbation experiments could provide
even deeper insights into the complexity of gene regulatory networks.
In conclusion, PerturbVI represents a powerful new tool in the genomic toolbox, offering unprecedented speed and precision in analyzing CRISPR perturbation data. By enabling a more detailed exploration of genetic regulatory modules, this work opens new pathways for understanding
the fundamental mechanisms of biology and disease.
62
Chapter 4
Dual Variational Autoencoders (Dual-VAE)
4.1 Introduction
Genome-wide association studies (GWAS) have identified hundreds of thousands of novel variants associated with complex human diseases, yet the majority of these variants are located in
non-coding regions, leaving their biological mechanisms driving the disease largely unknown [1,
4]. There is substantial evidence indicating that many of these non-coding variants are significantly associated with gene regulatory elements and affect the expression levels of genes, known
as expression quantitative trait loci (eQTLs) [9].
Traditional methods for inferring eQTLs generally rely on genome-wide scans for specific
genes, leading to extensive statistical testings that potentially drive the power issue [10]. While
robust cis-regulatory effects are frequently identified with relative ease of effort, trans-regulatory
effects pose a greater challenge due to their weaker effects on distal genes and the substantial
multiple testing burden they create, resulting in a significantly decreased detection power. To
address the power issue, we focus on exploring the association between GWAS risk variants and
downstream gene activities beyond local genes from the broader network perspective. To achieve
that, we need to use Canonical correlation analysis (CCA), which is designed to explore the linear
relationship between two data sets [30, 33]. The CCA seeks a pair of linear combinations of
features, called canonical components, from both observed data sets that are maximally correlated
via singular value decomposition on the certain covariance matrix (see Section 1.4). Then, the
63
coefficients at each canonical component explain the correlations between the sets of variables
from two data sets, with the larger absolute coefficients representing higher correlations between
two sets of variables.
However, CCA assumes a purely linear relationship between eQTLs and gene expression for
simplicity and interpretability. However, numerous studies have demonstrated that the regulatory
mechanisms among genetic factors are inherently non-linear, driven by factors such as cooperative
binding of transcription factors, feedback loops, and threshold effects [78–80]. These non-linear
interactions enable diverse gene expression patterns, crucial for cellular functions like differentiation, environmental adaptation, and homeostasis maintenance [15]. In previous chapters, we have
primarily focused on developing linear latent factor models for dimension reduction and capturing regulatory mechanisms underlying gene expression data, but the linearity assumption in these
models limits their ability to capture the complexity of gene regulatory networks.
Recently, Variational autoencoders (VAE), the unsupervised generative models that learn the
data-generating process, have gained attraction and have been widely applied in biological research. VAEs have proven effective in revealing intricate data structures, making them suitable for
applications such as dimension reduction, image recognition, and image denoising [81]. Notably,
VAEs have successfully captured non-linearities within biological pathways. For instance, Way
and Greene (2018) utilized VAEs to analyze The Cancer Genome Atlas (TCGA) and identified
biologically relevant features in cancer gene expression data [82]. Daniel et al. (2021) applied
VAEs to metabolomics data to identify distinct cellular processes using feature importance scores
[83], and Zhang et al. (2019) developed the OmiVAE model to integrate a VAE with a classifier
for multi-class classification in-multi-omics data [84].
In this work, instead of using a pure linear factor model such as CCA and sparse CCA [85], we
leverage non-linear representations of regulatory networks and integrate them with linear models
of known genetic risk variants. However, the conventional VAE only allows for inputting a single
data view. While this study has two data views: gene expression (or metabolomics data) and
genotype data. Wu et al. (2018) proposed a multimodal VAE (MVAE) that uses a product-of-expert
64
inference network to address the multi-modal problem [86]. Building on the MVAE framework,
we aim to develop a Dual-VAE that accommodates two data views by integrating a linear network
for genotype data and a non-linear network for gene expression data.
65
4.2 Methods
Suppose XGE ∈ R
n×m1 and XGT ∈ R
n×m2
represent the gene expression data and genotype data,
respectively, measured from the same set of subjects. The encoder function for genotype data is
a linear model that outputs a pair of random vectors (µgt,σ
2
gt), specific to the genotype data. The
encoder function for gene expression data is a p-layer deep neural network, which outputs another
pair of random vectors (µge,σ
2
ge) corresponding to the gene expression data.
These two pairs of random vectors are then passed into the product of experts (PoE) function
that synthesizes the mean and variance terms from each view and outputs a pair of shared random vectors (µ,σ
2
), representing the mean and variance parameters for the shared latent space z.
Finally, the decoder, which has a reverse process to the encoder, uses the sampled data from the
latent space to generate the predicted genotype data through a linear model and the predicted gene
expression data through the deep neural network. The graphical representation of the Dual-VAE is
shown in 4.1.
Figure 4.1: Dual Variational Autoencoders for Gene Expression and Genotype Data
66
4.2.1 Product of Experts
In this section, we summarize the works of the product of experts (PoE) and their application
to MVAE by Wu et al. (2018)[86]. As shown in Figure 4.1, the PoE is essentially a function
synthesizing the vectors from multiple data modalities into one shared vector. To approximate the
true joint posterior distribution P(Z | X1, · · · , XN ), consider the conditional assumption made for
the decoder network: X1, · · · , XN are independent given Z, then we have:
P(Z | X1, · · · , XN ) = P(Z)
P(X1, · · · , XN )
Y
N
i=1
P(Xi
| Z)
=
P(Z)
P(X1, · · · , XN )
Y
N
i=1
P(Z | Xi)P(Xi)
P(Z)
=
QN
i=1 P(Z | Xi)
QN−1
i=1 P(Z)
QN
i=1 P(Xi)
P(X1, · · · , XN )
∝
QN
i=1 P(Z | Xi)
QN−1
i=1 P(Z)
(4.1)
The last line of the equation implies the true posterior distribution of Z is proportional to the product of individual posteriors divided by the prior. Suppose the individual posterior is approximated
by the variation distribution Q(Z | Xi), then it means the joint variational distribution has the
following form:
Q(Z | X1 · · · XN ) ∝
QN
i=1 Q(Z | Xi)
QN−1
i=1 P(Z)
(4.2)
In this case, the variational distribution is a product and quotient of experts. To further simplify it,
we assume Q(Z| Xi) = Q˜(Z| Xi)P(Z), where the Q˜(Z| Xi) is the inference work in the encoder,
then the quotient term can be canceled:
67
P(Z | X1, · · · , XN ) ∝
QN
i=1 P(Z | Xi)
QN−1
i=1 P(Z)
(4.3)
≈
QN
i=1 Q˜(Z | Xi)P(Z)
QN−1
i=1
(4.4)
= P(Z)
Y
N
i=1
Q˜(Z | Xi) (4.5)
Here, we use a product of experts and a prior expert to approximate the joint posterior distribution. Generally, the exact distribution in equation 4.5 cannot be derived in closed form. However,
when both the prior and the approximate posterior Q˜(Z | Xi) are Gaussian distributions, we can
show that the product of Gaussian experts remains Gaussian with the mean µ and covariance V as:
µ =
X
i
µiTi
! X
i
Ti
!−1
(4.6)
V =
X
i
Ti
!−1
(4.7)
where µi and Vi are the parameters for the ith expert, and Ti
is the inverse matrix of Vi
.
4.2.2 Simulations
To validate the performance of Dual-VAE in capturing the shared latent space and accurately modeling the data-generating process between correlated data, we conducted simulations involving
two data sets with different dimensions, both generated from the same latent space. We compared
the results with the sparse canonical correlation analysis (CCA) [85]. In the simulation setup, we
assumed a sample size of N = 600, a latent space dimension of K = 4, a genotype feature dimension of p1 = 200, and the gene expression feature dimension of p2 = 2000, then we construct
Z,W(1)
,W(2) as:
68
Z ∼ N (0, 1) (4.8)
w
(1)
k,i ∼ N (0, 1) i = 50(k − 1), 50k k = 1, · · · , 4 (4.9)
w
(2)
k,j ∼ N (0, 1) j = 200(k − 1), 100k k = 1, · · · , 4 (4.10)
Each column vector in W(1)
,W(2) contains a fraction of non-zero effects, i.e. 50 in genotype
loadings (25%) and 100 (5%) in gene expression loadings, sampled from the standard normal
distribution as shown by Equations 4.9-4.10. Then we generated the simulation data X(1) and X(2)
by:
X(m) = ZW(m) + Σ
(m) m = 1, 2 (4.11)
Where Σ(m)
is the error term drawn from standard normal distribution. We then input X(1)
, X(2)
into Dual-VAE with k = 4 and all networks as linear networks. For comparison, we applied sparse
CCA [85] with the penalty coefficients selected based on grid search to minimize the test error.
To quantify the performance of each model, we calculated the Procrustes error [55] (Equation
2.24) for the decoder weights W(1) and W(2), as well as for the latent factors Z from both data
views.
4.2.3 Metabolomics data from prostate cancer patient
We applied the Dual-VAE model to metabolomics and genotype data derived from serum samples
of 691 men of African ancestry from the Multiethnic Cohort (MEC) [87]. This cohort was established between 1993 and 1996 to investigate cancer and chronic disease risks among individuals
in Hawaii and Los Angeles. The dataset included 342 prostate cancer cases and 349 controls.
69
The metabolomics data were generated through untargeted serum metabolomic profiling using Ultrahigh Performance Liquid Chromatography-Tandem Mass Spectroscopy (UPLC-MS/MS) performed at Metabolon Inc. This analysis identified 1433 metabolites, with extensive quality control
resulting in 1098 metabolites that were included in the final dataset. Metabolites were categorized
into various classes, including lipids (342), amino acids (199), xenobiotics (139), and others, ensuring a comprehensive coverage of the metabolome. The genotype data consisted of genome-wide
genotyping, which was imputed using Minimac4 on the Michigan Imputation Server, referencing
Phase 3 of the 1000 Genomes Project. Following quality control, 248 prostate cancer risk variants
were identified based on a large multi-ancestry genome-wide association meta-analysis.
In this study, we aimed to identify associations between specific metabolites and prostate cancer risk SNPs. To achieve this, we first regressed out 14 principal components (PCs) of the
metabolomic data to exclude batch effects and sample collection variations. For the genotype
data, we regressed out the top 10 PCs to adjust for population stratification. The residuals from
these regressions were then input into the Dual-VAE model with K = 10. The encoder for the
metabolomics data is constructed through a two-layer neural network, with each hidden layer consisting of 100 nodes using the Gaussian Error Linear Unit (GeLU) activation function added to
each hidden layer. The goal was to uncover latent components that link specific sets of metabolites to prostate cancer risk SNPs, providing insights into the biological mechanisms underlying
prostate cancer in the African population.
To validate our results, we performed a permutation test. We permuted the rows of each dataset
to break the associations between the metabolomics and SNP data, then applied the model with
the same parameter settings and computed the p-value for the decoder weights. Metabolites and
SNPs with decoder weights showing a p-value of less than 0.05 from each latent component were
considered significantly associated with that component. For comparison, we also ran sparse CCA
with K = 10, tuning sparsity hyperparameters through a permutation test as implemented in the R
package PMA [88].
70
4.3 Results
4.3.1 Simulation Results
To benchmark the estimation accuracy of the Dual-VAE model, we computed the Procrustes error
in WGE, WGT , and Z. Overall, Dual-VAE achieved lower Procrustes error in the estimated loading matrices from both data views compared to sparse CCA loadings (Figure 4.2). Specifically,
in the correctly specified model (i.e., K = 4), Dual-VAE exhibits a lower Procrustes error than
sparse CCA across all sample sizes, with a greater discrepancy in the loading at smaller sample
sizes. Although larger sample sizes appeared to favor sparse CCA, allowing the Procrustes error
to decrease rapidly, the error in Dual-VAE decreased only slightly. It is important to note that as
a deep-learning model, Dual-VAE requires additional efforts to determine the appropriate learning
rate and number of epochs. In our simulation, these parameters were not optimized across various
settings. Given that most real datasets have a sample size of less than 1000, we did not further investigate how learning rate and training epochs should be adjusted for larger sample sizes. Finally,
we observed that Dual-VAE also produces consistently more accurate estimates of latent variable
Z, although the sample size does not affect the accuracy of Z from both Dual-VAE and sparse
CCA.
We next investigated model performance when the latent dimension K is mis-specified (Figure
4.2 (E-F)). We found that both models exhibited increased Procrustes error when K was underspecified (K = 3), as they missed information from one latent component. However, Dual-VAE
proved to be more robust when K was over-specified (K = 5), maintaining high accuracy compared
to the correctly specified model (K = 4), unlike sparse CCA, which showed a rapid increase in
Procrustes error when over-specified. This pattern is similar to the simulation results from SuSiE
PCA in Chapter 2 Figure 2.2, suggesting probabilistic latent factor models better capture the data
generating process and are more robust to over-specified models than conventional latent factor
models.
71
Figure 4.2: Dual-VAE has better estimation accuracy and more robust to mis-specified model
compared to sparse CCA
The upper panel (A-C) shows Procrustes error in the correct-specified setting, i.e., the true number of latent variables
K, as a function of sample size ranging from 500 to 2500. The lower panel (E-F) demonstrates the Procrustes error
when K is mis-specified, with the true latent dimension fixed at 4.
4.3.2 Application to Metabolomics Data from Prostate Cancer Patients
To investigate the downstream metabolite modules as a consequence of variation at prostate cancer
risk loci, we applied Dual-VAE with K = 10 to the metabolomics data from prostate cancer patients. We first performed a grid search for the tuning parameters that achieved the lowest test data
error, resulting in hidden layer dimensions of 100 for metabolites and 50 for SNPs, a learning rate
of 0.001, and 25,000 training epochs over 165,772 parameters in Dual-VAE. The model converged
in 1 minute and 50 seconds on an Nvidia GPU A100 with 16 GB memory. The relative root mean
square error (RRMSE) for the reconstructed data was 0.646 for metabolomics and 0.981 for SNPs.
In comparison, the RRMSE for sparse CCA was 0.678 for metabolomics and 0.930 for SNPs. By
72
inspecting the decoder weights, we found that, on average, each component encompassed 125.2
(std = 42.6) significant metabolites and 17.5 (std = 3.34) SNPs. In contrast, sparse CCA produced
184.6 (std = 47.5) non-zero metabolites and 45.8 (std = 10.0) SNPs.
In Darst’s study, it was shown that the latent factor scores from sCCA of SNP data are linearly
correlated with polygenic risk scores (PRS), while the scores of metabolites data are not associated
with PRS [87]. However, we found that the latent space from Dual-VAE is mostly correlated with
the scores of metabolites from sCCA, with the associations categorized into three types: small
correlation, linear correlation, and non-linear relationship (Appendix Figure 5.32 and Figure 5.33).
Besides, we did not identify any significant association between the Dual-VAE latent scores and
the SNP scores from sCCA.
Next, we performed an enrichment analysis over these significant metabolites to characterize the metabolomics module captured by each latent component (Figure 4.3). We observed that
all components had at least one significantly enriched subpathway (mean = 4.6, std = 2.1), with
Lipids being the most frequent superpathway (33 metabolites), followed by Xenobiotics (4 metabolites). Amino Acid Cofactors, Vitamins, and Peptide contained 3 enriched metabolites each. Component 6 was the most significant metabolite module, containing 8 enriched subpathways, with
Sphingomyelins being the most significantly enriched (FDR = 3.0 × 10−27). Additionally, Sphingomyelins were identified as the most significant pathway in components 2 and 4, indicating a
strong association with prostate cancer risk loci. This result is consistent with the network analysis
by Darst et al. [87]. It has been shown that alterations in these lipid molecules, specifically Sphingomyelins and HCER, are associated with cancer cell proliferation and survival: elevated levels
of sphingomyelins are linked to increased cancer aggressiveness, while HCER is implicated in the
modulation of apoptosis and cellular signaling pathways crucial for tumor growth. Our findings
validate the potential of targeting lipid metabolism in therapeutic strategies for prostate cancer [89,
90]. Furthermore, components 3, 5, and 10 demonstrated fatty acid metabolites (acetylcholine) as
highly significant enriched pathways, which were also significant in the network analysis [87]. This
further supports the hypothesis that metabolic reprogramming, particularly in lipid metabolism,
73
plays a crucial role in prostate cancer progression and may provide viable targets for therapeutic
intervention.
Figure 4.3: Dual-VAE suggests that Sphingomyelins from the Lipid pathway are significantly
associated with prostate cancer risk loci.
Enrichment analysis was performed by inputting metabolites with permutation p-value less than 0.05. The significant
pathway is defined as the adjusted FDR of less than 0.05.
To identify the risk variants associated with these enriched metabolite modules, we inspected
the SNPs with permutation p-values less than 0.05 from each component. The SLC22A1 rs4646284
at 6q25.3 was the SNP with the largest effect size (effect = -0.116, p-value = 0.005) within component 6. The SNP rs11701433 was the top effect within component 2 (effect = -0.115, p = 0.005),
suggesting a strong association with the lipid pathway containing Sphingomyelins and Hexosylceramides (HCER). Additionally, we identified ARHGAP6 rs17321482 (effect = -0.015, p-value =
0.04) as a significant SNP from component 2, which is enriched in primary bile acids. This finding
is consistent with the PheWAS results from Darst’s study [87].
To compare with sCCA, we targeted the component from sCCA that was mostly enriched for
the lipid pathway and Sphingomyelins (Appendix Figure 5.34). By inspecting and comparing
the weights of SNPs from this component, we identified rs11701433 as the top SNP from sCCA
74
as well. Notably, SNPs with larger effects were consistent between the two models, although
several SNPs, such as rs11691517 and rs10122990, which showed large effects in Dual-VAE, were
regularized to zero in sCCA. Given that the p-values computed from permutation tests indicate they
are less likely to be false positives, we suspect these variants are potentially associated with the
lipid module and are worth further investigation.
Figure 4.4: rs11701433 present as the top effect in both Dual-VAE and sparse CCA, with
several other variants neglected by sCCA
The significant SNPs from Dual-VAE components 2,4 and 6 that are enriched from Sphingomyelins are extracted, as
well as the non-zero SNPs from sCCA component 4 that is enriched for Sphingomyelins.
Overall, these findings underscore the effectiveness of the Dual-VAE model in identifying significant metabolic pathways and their genetic associations in prostate cancer research.
75
4.4 Discussion
In this study, we developed a Dual-variational autoencoder (Dual-VAE) to detect the regulatory
functions of known genetic risk variants in relation to downstream gene activities. For interpretability, we employed linear functions for the decoder networks while using non-linear neural networks
for the encoders to better approximate the distribution of latent variables.
The model is built using JAX and Haiku modules in Python, enabling fast training on GPUs
even with millions of parameters. In simulations, we demonstrated that Dual-VAE has higher estimation accuracy in latent space and decoder weights than sCCA, indicating its superior ability
to learn the data-generating process with two data views. In metabolomics data analysis, we observed that non-linearity in the latent space emerged compared to the purely linear sCCA. Moreover, we found consistent enriched metabolomics modules associated with prostate cancer loci,
validating Dual-VAE’s ability to capture shared biological modules from two datasets. Additionally, Dual-VAE identified several risk variants that were missed by sCCA, likely benefiting from
the non-linearity within the model.
There are several limitations to the current Dual-VAE model. First, it lacks sparsity constraints
on the decoder weights. Although we integrated various types of sparsity-inducing priors into
the decoders, such as automatic relevance determination (ARD), spike-and-slab prior, and threeparameter prior, the simulation results did not show improvements in regularizing zeros. This issue
likely arises because the additional sparsity parameters require estimation, and the sample size is
too small to provide sufficient information during the optimization process.
Second, we observed that the RRMSE for SNP data is much higher than that for metabolomics
data, similar to sCCA. This suggests that SNP data contribute much less than metabolite data.
The essential reason is that raw SNP data are categorical (0, 1, and 2) and exhibit much smaller
variation than expression or metabolomics data even after regressing out covariates. As a result,
parameters from expression data dominate the gradient updates. One possible solution is to assign
76
a higher weight to the encoder network for SNP data during inference, allowing the model to learn
more information from SNP data.
The third limitation is the linear decoder structure within Dual-VAE. Although the Linear Decoder VAE (LDVAE) has shown efficiency in maintaining model complexity while retaining interpretability [91], it constrains the data-generating process as a linear model. However, once the
decoder networks are set to non-linear functions, we must evaluate the relevant genes in the certain factor via the feature importance technique. Therefore, incorporating the feature selection
algorithm will better help Dual-VAE capture the non-linearity in GRNs.
Finally, the current Dual-VAE model only contains one shared latent space between two data
views. As demonstrated in the metabolite data analysis, these shared latent factors are mostly
related to metabolite scores from sCCA but show no association with SNP scores from sCCA.
This limitation could be addressed by incorporating a private space into the Dual-VAE (Appendix
Figure 5.31). While we have constructed such a model, we have not yet conducted extensive
simulations to validate the model since the previous limitations are more important to address in
the first place.
Overall, Dual-VAE can capture complex non-linear relationships and provide a robust tool for
interpreting the association between GWAS risk variants and downstream gene activities, offering
new insights into the biological mechanisms underlying the complex human trait.
77
Chapter 5
Conclusion
Although Genome-Wide Association Studies (GWAS) have identified numerous genetic variants
associated with human complex diseases, understanding the molecular mechanisms driving these
diseases remains a significant challenge. Multiple lines of evidence suggest that many of these
variants are located in non-coding regions and play regulatory roles in the complex gene regulatory
networks (GRNs). Therefore, it is crucial to understand the perturbed regulatory networks. The
current popular approach is to infer perturbation effects through experimental design: the Perturbseq data. However, analyzing Perturb-seq data is challenging due to its sheer scale.
To address these challenges, we first developed the Sum of Single Effects (SuSiE) Principal
Components Analysis (PCA) to constrain only L (L ≪ P) features associated with each gene
module (latent component). SuSiE PCA also computes the posterior inclusion probability (PIP) to
evaluate the strength of association of each gene to the coexpressed gene module. Additionally, we
leverage the variational method for efficient and fast inference in SuSiE PCA. In simulations, we
demonstrated that SuSiE PCA is more accurate and robust than sparse PCA and Empirical Bayes
Matrix Factorization (EBMF). In the Perturb-seq application, SuSiE PCA captured crucial cellular
processes consistent with previous findings.
To provide a more accurate estimate of the upstream genes (perturbed genes) effects, we develop the PerturbVI by incorporating experimental information into the SuSiE PCA latent components from SuSiE PCA, assuming that the upstream gene effects are mediated through the gene
78
modules. The ultra-sparse model structure and efficient variational methods implemented in PerturbVI distinguish it from current methods. In simulations, PerturbVI proved to be more accurate,
efficient, and robust in estimating upstream and downstream effects than GSFA. Analyzing CROPseq data with PerturbVI identified the new target gene POGZ and revealed the potential regulatory
functions of several other upstream and downstream genes. In the application to Perturb-seq data,
PerturbVI identified multiple target genes, such as GATA1, which have a large effect on downstream pathways related to cellular processes; it also revealed multiple regulatory modules, including those involved in the cell cycle, ribosome biogenesis, and mitochondrial functions.
Finally, we addressed non-linearity in gene regulatory networks (GRNs) through the deeplearning model Dual Variational Autoencoders (Dual-VAE). Dual-VAE aims to associate a set of
known GWAS risk variants with downstream genes to infer how these variants regulate networks
on a broader scale. Unlike pure linear models such as Canonical Correlation Analysis (CCA),
Dual-VAE uses non-linear neural networks for gene expression data and integrates these with linear models of SNPs. By learning the data-generating process, Dual-VAE captures the shared latent
space containing the association between genetic variants and downstream genes. In simulations,
Dual-VAE outperformed CCA in estimation accuracy. Applying Dual-VAE to metabolomics data
from prostate cancer patients, it captured enriched metabolite pathways, including Sphingomyelins
from the Lipids super pathway, which is consistent with CCA and previous findings, while also
identifying several new genetic variants, such as rs11691517 and rs10122990, potentially associated with lipid pathways related to prostate cancer.
The project SuSiE PCA has been published in iScience (2023) [70]; Project PerturbVI was
selected as a presentation in Biology of Genomes (2024), and will be submitted to the journal in
this summer. In the future, we will continue to address several limitations of the Dual-VAE project.
In conclusion, SuSiE PCA and PerturbVI are scalable latent factor approaches for inferring
gene modules and perturbation effects through Perturb-seq data. Dual-VAE offers a novel method
to infer how genetic variants regulate downstream networks. Together, these methods provide
79
significant insights into understanding the complex regulatory mechanisms potentially driving diseases.
80
References
1. Uffelmann, E., Huang, Q. Q., Munung, N. S., de Vries, J., Okada, Y., Martin, A. R., et al.
Genome-wide association studies. Nature Reviews Methods Primers 1. Publisher: Nature
Publishing Group, 1–21. doi:10.1038/s43586-021-00056-9 (2021).
2. Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five Years of GWAS Discovery.
American Journal of Human Genetics 90, 7–24. doi:10 . 1016 / j . ajhg . 2011 . 11 . 029
(2012).
3. McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis,
J. P. A., et al. Genome-wide association studies for complex traits: consensus, uncertainty
and challenges. Nature Reviews Genetics 9. Publisher: Nature Publishing Group, 356–369.
doi:10.1038/nrg2344 (2008).
4. Tam, V., Patel, N., Turcotte, M., Bosse, Y., Par ´ e, G. & Meyre, D. Benefits and limitations of ´
genome-wide association studies. Nature Reviews Genetics 20. Publisher: Nature Publishing
Group, 467–484. doi:10.1038/s41576-019-0127-1 (2019).
5. Pearson, T. A. & Manolio, T. A. How to interpret a genome-wide association study. JAMA
299, 1335–1344. doi:10.1001/jama.299.11.1335 (2008).
6. Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J.,
et al. Finding the missing heritability of complex diseases. Nature 461. Publisher: Nature
Publishing Group, 747–753. doi:10.1038/nature08494 (2009).
7. Claussnitzer, M., Cho, J. H., Collins, R., Cox, N. J., Dermitzakis, E. T., Hurles, M. E., et
al. A brief history of human disease genetics. Nature 577. Number: 7789 Publisher: Nature
Publishing Group, 179–189. doi:10.1038/s41586-019-1879-7 (2020).
8. Maurano, M. T., Humbert, R., Rynes, E., Thurman, R. E., Haugen, E., Wang, H., et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science
337. Number: 6099, 1190–1195. doi:10.1126/science.1222794 (2012).
9. Nica, A. C. & Dermitzakis, E. T. Expression quantitative trait loci: present and future. Philosophical Transactions of the Royal Society B: Biological Sciences 368. Number: 1620, 20120362.
doi:10.1098/rstb.2012.0362 (2013).
81
10. Albert, F. W. & Kruglyak, L. The role of regulatory variation in complex traits and disease. Nature Reviews Genetics 16. Number: 4 Publisher: Nature Publishing Group, 197–212.
doi:10.1038/nrg3891 (2015).
11. Schlitt, T. & Brazma, A. Current approaches to gene regulatory network modelling. BMC
Bioinformatics 8. Number: 6, S9. doi:10.1186/1471-2105-8-S6-S9 (2007).
12. De Jong, H. Modeling and simulation of genetic regulatory systems: a literature review. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology 9, 67–
103. doi:10.1089/10665270252833208 (2002).
13. Davidson, E. & Levin, M. Gene regulatory networks. Proceedings of the National Academy
of Sciences of the United States of America 102. Number: 14, 4935. doi:10.1073/pnas.
0502024102 (2005).
14. Hwang, T. H., Atluri, G., Kuang, R., Kumar, V., Starr, T., Silverstein, K. A., et al. Large-scale
integrative network-based analysis identifies common pathways disrupted by copy number
alterations across cancers. BMC Genomics 14. Number: 1, 440. doi:10.1186/1471-2164-
14-440 (2013).
15. Karlebach, G. & Shamir, R. Modelling and analysis of gene regulatory networks. Nature
Reviews Molecular Cell Biology 9, 770–780. doi:10.1038/nrm2503 (2008).
16. Glass, L. & Kauffman, S. A. The logical analysis of continuous, non-linear biochemical
control networks. Journal of Theoretical Biology 39. Number: 1, 103–129. doi:10.1016/
0022-5193(73)90208-7 (1973).
17. Sauer, U., Hatzimanikatis, V., Hohmann, H. P., Manneberg, M., van Loon, A. P. & Bailey,
J. E. Physiology and metabolic fluxes of wild-type and riboflavin-producing Bacillus subtilis.
Applied and Environmental Microbiology 62. Number: 10 Publisher: American Society for
Microbiology, 3687–3696. doi:10.1128/aem.62.10.3687-3696.1996 (1996).
18. Yeung, M. K. S., Tegner, J. & Collins, J. J. Reverse engineering gene networks using singular ´
value decomposition and robust regression. Proceedings of the National Academy of Sciences
99. Number: 9 Publisher: Proceedings of the National Academy of Sciences, 6163–6168.
doi:10.1073/pnas.092576199 (2002).
19. McAdams, H. H. & Arkin, A. Stochastic mechanisms in gene expression. Proceedings of
the National Academy of Sciences 94. Number: 3 Publisher: Proceedings of the National
Academy of Sciences, 814–819. doi:10.1073/pnas.94.3.814 (1997).
20. Horlbeck, M. A., Xu, A., Wang, M., Bennett, N. K., Park, C. Y., Bogdanoff, D., et al. Mapping the Genetic Landscape of Human Cells. Cell 174, 953–967.e22. doi:10.1016/j.cell.
2018.06.010 (2018).
82
21. Replogle, J. M., Saunders, R. A., Pogson, A. N., Hussmann, J. A., Lenail, A., Guna, A., et al.
Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq.
Cell 185, 2559–2575.e28. doi:10.1016/j.cell.2022.05.013 (2022).
22. Pacesa, M., Pelea, O. & Jinek, M. Past, present, and future of CRISPR genome editing technologies. Cell 187, 1076–1100. doi:10.1016/j.cell.2024.01.042 (2024).
23. Doudna, J. A. & Charpentier, E. Genome editing. The new frontier of genome engineering with CRISPR-Cas9. Science (New York, N.Y.) 346, 1258096. doi:10.1126/science.
1258096 (2014).
24. Klein, J. C., Agarwal, V., Inoue, F., Keith, A., Martin, B., Kircher, M., et al. A systematic evaluation of the design and context dependencies of massively parallel reporter assays.
Nature Methods 17. Number: 11 Publisher: Nature Publishing Group, 1083–1091. doi:10.
1038/s41592-020-0965-y (2020).
25. Dixit, A., Parnas, O., Li, B., Chen, J., Fulco, C. P., Jerby-Arnon, L., et al. Perturb-Seq:
Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic
Screens. Cell 167, 1853–1866.e17. doi:10.1016/j.cell.2016.11.038 (2016).
26. Rubin, A. J., Parker, K. R., Satpathy, A. T., Qi, Y., Wu, B., Ong, A. J., et al. Coupled SingleCell CRISPR Screening and Epigenomic Profiling Reveals Causal Gene Regulatory Networks. Cell 176, 361–376.e17. doi:10.1016/j.cell.2018.11.022 (2019).
27. West, M. Bayesian Factor Regression Models in the ”Large p, Small n” Paradigm in Bayesian
Statistics (Oxford University Press, 2003), 723–732.
28. Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24, 417–441. doi:10.1037/h0071325 (1933).
29. Bartholomew, D. J. Latent Variable Models and Factor Analysis OCLC: 464145923 (Charles
Griffin : Oxford University Press, London; New York, 1987).
30. Hotelling, H. & Pabst, M. R. Rank Correlation and Tests of Significance Involving No Assumption of Normality. The Annals of Mathematical Statistics 7. Publisher: Institute of Mathematical Statistics, 29–43 (1936).
31. I.T, J. Principal Component Analysis (Springer-Verlag, New York, 1986).
32. Tipping, M. E. & Bishop, C. M. Probabilistic Principal Component Analysis. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 61, 611–622. doi:10 . 1111 /
1467-9868.00196 (1999).
33. Bach, F. R. & Jordan, M. I. A probabilistic interpretation of canonical correlation analysis
(2005).
83
34. Andrieu, C., de Freitas, N., Doucet, A. & Jordan, M. I. An Introduction to MCMC for Machine Learning. Machine Learning 50, 5–43. doi:10.1023/A:1020281327116 (2003).
35. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. & Saul, L. K. An Introduction to Variational Methods for Graphical Models. Machine Learning 37, 183–233. doi:10 . 1023 / A :
1007665907178 (1999).
36. Kullback, S. & Leibler, R. A. On Information and Sufficiency. The Annals of Mathematical
Statistics 22, 79–86 (1951).
37. Tanaka, T. A Theory of Mean Field Approximation in Advances in Neural Information Processing Systems (MIT Press, 1998).
38. Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable
selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82, 1273–1300. doi:10.1111/rssb.12388
(2020).
39. Mitchell, T. J. & Beauchamp, J. J. Bayesian Variable Selection in Linear Regression. Journal of the American Statistical Association 83. Number: 404, 1023–1032. doi:10 . 1080 /
01621459.1988.10478694 (1988).
40. Kingma, D. P. & Welling, M. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat]
(2014).
41. Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic Backpropagation and Approximate
Inference in Deep Generative Models. arXiv:1401.4082 [cs, stat] (2014).
42. Kingma, D. P. & Welling, M. An Introduction to Variational Autoencoders. Foundations and
Trends® in Machine Learning 12. Number: 4, 307–392. doi:10.1561/2200000056 (2019).
43. Jeremy, J. Variational autoencoders. https://www.jeremyjordan.me/variationalautoencoders/.
44. Patterson, N., Price, A. L. & Reich, D. Population Structure and Eigenanalysis. PLoS Genetics 2, e190. doi:10.1371/journal.pgen.0020190 (2006).
45. Agrawal, A., Chiu, A. M., Le, M., Halperin, E. & Sankararaman, S. Scalable probabilistic PCA for large-scale genetic variation data. PLoS genetics 16, e1008773. doi:10.1371/
journal.pgen.1008773 (2020).
46. McVean, G. A Genealogical Interpretation of Principal Components Analysis. PLoS Genetics
5, e1000686. doi:10.1371/journal.pgen.1000686 (2009).
84
47. Zou, H., Hastie, T. & Tibshirani, R. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics 15, 265–286. doi:10.1198/106186006X113430 (2006).
48. Bishop, C. Bayesian PCA in Advances in Neural Information Processing Systems (MIT Press,
1998).
49. Guan, Y. & Dy, J. Sparse Probabilistic Principal Component Analysis in Proceedings of the
Twelth International Conference on Artificial Intelligence and Statistics (PMLR, 2009), 185–
192.
50. Ning, B. Spike and slab Bayesian sparse principal component analysis. arXiv:2102.00305
[stat] (2021).
51. Armagan, A., Clyde, M. & Dunson, D. Generalized Beta Mixtures of Gaussians in Advances
in Neural Information Processing Systems (Curran Associates, Inc., 2011).
52. Zhao, S., Gao, C., Mukherjee, S. & Engelhardt, B. E. Bayesian group factor analysis with
structured sparsity. Journal of Machine Learning Research 17, 1–47 (2016).
53. Wang, W. & Stephens, M. Empirical bayes matrix factorization. Journal of Machine Learning
Research 22, 1–40 (2021).
54. THE GTEX CONSORTIUM, Ardlie, K. G., Deluca, D. S., Segre, A. V., Sullivan, T. J., `
Young, T. R., et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene
regulation in humans. Science 348, 648–660. doi:10.1126/science.1262110 (2015).
55. Borg, I. & Groenen, P. Modern Multidimensional Scaling: Theory and Applications (Springer
Series in Statistics) doi:10.1007/978-1-4757-2711-1 (2005).
56. Meng, F., Richer, M., Tehrani, A., La, J., Kim, T. D., Ayers, P. W., et al. Procrustes: A
python library to find transformations that maximize the similarity between matrices. Computer Physics Communications 276, 108334. doi:10.1016/j.cpc.2022.108334 (2022).
57. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., et al. JAX:
composable transformations of Python+NumPy programs version 0.3.13. 2018.
58. Cohn, B. A., Cirillo, P. M. & Christianson, R. E. Prenatal DDT Exposure and Testicular
Cancer: A Nested Case-Control Study. Archives of environmental & occupational health 65,
127–134. doi:10.1080/19338241003730887 (2010).
59. Ge, S. X., Jung, D. & Yao, R. ShinyGO: a graphical gene-set enrichment tool for animals and
plants. Bioinformatics 36 (ed Valencia, A.) 2628–2629. doi:10 . 1093 / bioinformatics /
btz931 (2020).
85
60. Amrute, J. M., Perry, A. M., Anand, G., Cruchaga, C., Hock, K. G., Farnsworth, C. W., et
al. Cell specific peripheral immune responses predict survival in critical COVID-19 patients.
Nature Communications 13, 882. doi:10.1038/s41467-022-28505-3 (2022).
61. Garg, M., Li, X., Moreno, P., Papatheodorou, I., Shu, Y., Brazma, A., et al. Meta-analysis of
COVID-19 single-cell studies confirms eight key immune responses. Scientific Reports 11,
20833. doi:10.1038/s41598-021-00121-z (2021).
62. Signorile, A., Sgaramella, G., Bellomo, F. & De Rasmo, D. Prohibitins: A Critical Role in Mitochondrial Functions and Implication in Diseases. Cells 8, 71. doi:10.3390/cells8010071
(2019).
63. Artal-Sanz, M., Tsang, W. Y., Willems, E. M., Grivell, L. A., Lemire, B. D., van der Spek, H.,
et al. The mitochondrial prohibitin complex is essential for embryonic viability and germline
function in Caenorhabditis elegans. The Journal of Biological Chemistry 278, 32091–32099.
doi:10.1074/jbc.M304877200 (2003).
64. Artal-Sanz, M. & Tavernarakis, N. Prohibitin couples diapause signalling to mitochondrial
metabolism during ageing in C. elegans. Nature 461, 793–797. doi:10.1038/nature08466
(2009).
65. Opper, M. & Saad, D. Advanced Mean Field Methods: Theory and Practice (MIT Press,
2001).
66. Carulli, J. P., Artinger, M., Swain, P. M., Root, C. D., Chee, L., Tulig, C., et al. High throughput analysis of differential gene expression. Journal of Cellular Biochemistry 72, 286–296.
doi:10.1002/(SICI)1097- 4644(1998)72:30/31+<286::AID- JCB35> 3.0.CO;2- D
(S30-31 1998).
67. Rapaport, F., Khanin, R., Liang, Y., Pirun, M., Krek, A., Zumbo, P., et al. Comprehensive
evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biology 14, 3158. doi:10.1186/gb-2013-14-9-r95 (2013).
68. Tomfohr, J., Lu, J. & Kepler, T. B. Pathway level analysis of gene expression using singular
value decomposition. BMC Bioinformatics 6, 225. doi:10.1186/1471-2105-6-225 (2005).
69. Zhou, Y., Luo, K., Liang, L., Chen, M. & He, X. A new Bayesian factor analysis method
improves detection of genes and biological processes affected by perturbations in single-cell
CRISPR screening. Nature Methods 20, 1693–1703. doi:10.1038/s41592-023-02017-4
(2023).
70. Yuan, D. & Mancuso, N. SuSiE PCA: A Scalable Bayesian Variable Selection Technique
for Principal Component Analysis. iScience, 108181. doi:10.1016/j.isci.2023.108181
(2023).
86
71. Lalli, M. A., Avey, D., Dougherty, J. D., Milbrandt, J. & Mitra, R. D. High-throughput singlecell functional elucidation of neurodevelopmental disease–associated genes reveals convergent mechanisms altering neuronal differentiation. Genome Research 30. Company: Cold
Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab, 1317–1331. doi:10.1101/gr.262295.120 (2020).
72. Scheffe, H. ´ The Analysis of Variance Google-Books-ID: z9yUEAAAQBAJ (John Wiley &
Sons, 1999).
73. Stephens, M. False discovery rates: a new deal. Biostatistics 18, 275–294. doi:10 . 1093 /
biostatistics/kxw041 (2017).
74. Gao, Z., Ure, K., Ables, J. L., Lagace, D. C., Nave, K.-A., Goebbels, S., et al. Neurod1
is essential for the survival and maturation of adult-born neurons. Nature Neuroscience 12.
Publisher: Nature Publishing Group, 1090–1092. doi:10.1038/nn.2385 (2009).
75. Qi, A., Lamont, L., Liu, E., Murray, S. D., Meng, X. & Yang, S. Essential Protein PHB2
and Its Regulatory Mechanisms in Cancer. Cells 12, 1211. doi:10.3390/cells12081211
(2023).
76. Schmperli, D. & Pillai, R. S. The special Sm core structure of the U7 snRNP: far-reaching
significance of a small nuclear ribonucleoprotein. Cellular and Molecular Life Sciences 61,
2560–2570. doi:10.1007/s00018-004-4190-0 (2004).
77. Mao, Y., Tamura, T., Yuki, Y., Abe, D., Tamada, Y., Imoto, S., et al. The hnRNP-Htt axis
regulates necrotic cell death induced by transcriptional repression through impaired RNA
splicing. Cell Death & Disease 7. Publisher: Nature Publishing Group, e2207–e2207. doi:10.
1038/cddis.2016.101 (2016).
78. Ernst, J., Vainas, O., Harbison, C. T., Simon, I. & Bar-Joseph, Z. Reconstructing dynamic
regulatory maps. Molecular Systems Biology 3. Publisher: Nature Publishing Group, 74.
doi:10.1038/msb4100115 (2007).
79. Maienschein-Cline, M., Zhou, J., White, K. P., Sciammas, R. & Dinner, A. R. Discovering
transcription factor regulatory targets using gene expression and binding data. Bioinformatics 28. Publisher: Oxford University Press, 206. doi:10.1093/bioinformatics/btr628
(2012).
80. Alon, U. An Introduction to Systems Biology: Design Principles of Biological Circuits doi:10.
1201/9781420011432 (Chapman and Hall/CRC, New York, 2006).
81. Singh, A. & Ogunfunmi, T. An Overview of Variational Autoencoders for Source Separation,
Finance, and Bio-Signal Applications. Entropy 24, 55. doi:10.3390/e24010055 (2021).
87
82. Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 23, 80–91 (2018).
83. Gomari, D. P., Schweickart, A., Cerchietti, L., Paietta, E., Fernandez, H., Al-Amin, H., et al.
Variational autoencoders learn universal latent representations of metabolomics data Section: New Results Type: article (bioRxiv, 2021), 2021.01.14.426721. doi:10.1101/2021.
01.14.426721.
84. Zhang, X., Zhang, J., Sun, K., Yang, X., Dai, C. & Guo, Y. Integrated Multi-omics Analysis
Using Variational Autoencoders: Application to Pan-cancer Classification in 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (2019), 765–769. doi:10.
1109/BIBM47256.2019.8983228.
85. Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10,
515–534. doi:10.1093/biostatistics/kxp008 (2009).
86. Wu, M. & Goodman, N. Multimodal Generative Models for Scalable Weakly-Supervised
Learning arXiv:1802.05335. Issue: arXiv:1802.05335 type: article (arXiv, 2018). doi:10 .
48550/arXiv.1802.05335.
87. Darst, B., Yuan, D., Yinqi, Z., Peggy, W., Loreall, P., Xin, S., et al. Metabolomics implicates biological mechanisms impacted by a polygenic risk score of prostate cancer in African
ancestry men. in prep (2022).
88. Narasimhan, B. bnaras/PMA original-date: 2019-05-22T17:18:32Z. 2022.
89. Hannun, Y. A. & Obeid, L. M. Sphingolipids and their metabolism in physiology and disease.
Nature Reviews Molecular Cell Biology 19. Publisher: Nature Publishing Group, 175–191.
doi:10.1038/nrm.2017.107 (2018).
90. Bataller, M., Sanchez-Garc ´ ´ıa, A., Garcia-Mayea, Y., Mir, C., Rodriguez, I. & LLeonart, M. E.
The Role of Sphingolipids Metabolism in Cancer Drug Resistance. Frontiers in Oncology 11,
807636. doi:10.3389/fonc.2021.807636 (2021).
91. Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell
RNA-seq via variational autoencoders. Bioinformatics 36. Number: 11, 3418–3421. doi:10.
1093/bioinformatics/btaa169 (2020).
88
Appendices
A Chatper 2. SuSiE PCA
A.1 Model
In this section, we present the details of the derivation of variational inference in SuSiE PCA. Let
XN×P be the observed data matrix, ZN×K be the K dimensional latent vectors, and WK×P be the
loading matrix. We denote the normal distribution with mean µ and variance σ
2
as N (µ, σ2
), the
multinomial distribution with n choices and probabilities π as Multi(n,π) and the matrix normal
distribution with dimension N × K, mean M, row-covariance R, and column-covariance C as
MN N,K(M, R, C). We denote the basis vector where k
th coordinate is 1 and 0 elsewhere as ek.
The sampling distribution of X under the SuSiE PCA model is given by,
X | Z,W, σ2 ∼ MN N,P (ZW, IN , σ2
IP ) (A.1)
Z ∼ MN N,K(0, IN , IK) (A.2)
W =
X
K
k=1
ekw
⊺
k
(A.3)
wk =
X
L
l=1
wkl (A.4)
wkl = wklγkl (A.5)
wkl | σ
2
0kl ∼ N (0, σ2
0kl) (A.6)
γkl | π ∼ Multi(1,π) (A.7)
89
Then, the complete-data log-likelihood of data and parameters is given by:
ℓc(σ
2
, σ2
0
,π | X,Z,W) = log Pr(X | Z,W, σ2
) + log Pr(Z) + log Pr(W | σ
2
0
,π)
= logMN n,p(X | ZW, In, Ipσ
2
) + logMN n,k(Z | 0, In, Ik)+
X
L
l=1
X
K
k=1
log Multi(γkl | 1,π) + log N (wkl | 0, σ2
0
)
A.2 Derivation of Variational Distribution
Mean-Field Approximation
Mean field approximation [37] is a common solution to find the optimal solution that maximizes
the objective function ELBO (see Methods). The basic assumption is that we can factorize the variational distribution into independent components. Then using the calculus of variations, one can
show that the distribution Q∗
j
(zj ) minimizing KL divergence for each factor Zj can be expressed
as:
ln Q
∗
j
(zj
| X) = Ei̸=j
[ln P(Z, X)] + constant (A.8)
Applying the Mean-Field approximation to SuSiE PCA the approximate posterior given by,
Q(Z,W) = Q(Z)Q(W) (A.9)
Q(W) = Y
K
k=1
Y
L
l=1
Q(wkl | γkl)Q(γkl) (A.10)
Equation (A.9) factorizes the variational densities of the latent variables Z and the loading
matrix W into independent parts. We further assume that the variational distribution of loadings
wkl from each factor across L single effects are independent as well, leading to equation (A.10).
90
For ease of notation we first define τ =
1
σ2
, τ0kl =
1
σ
2
0kl
. Based on the factorization, the completedata log-likelihood of data and parameters of SuSiE PCA is given by:
ℓc(σ
2
, σ2
0
,π | X,Z,W) = log Pr(X | Z,W, σ2
) + log Pr(Z) + log Pr(W | σ
2
0
,π) (A.11)
= logMN n,p(X | ZW, In, Ipσ
2
) + logMN n,k(Z | 0, In, Ik)+ (A.12)
X
L
l=1
X
K
k=1
log Multi(γkl | 1,π) + log N (wkl | 0, σ2
0
)
(A.13)
Helpful definitions
Before proceeding to the full derivation of variational distribution of parameters Z, wkl, and γkl,
we first give some helpful definitions, including the expansion of the first and second moment of
W and Z.
The second moment of Z is:
E[Z
⊺
Z] = tr(In)ΣZ + E[Z]
⊺
E[Z]
= nΣZ + E[Z]
⊺
E[Z]
E[Z
⊺
kZk] = tr(V[Zk]) + E[Zk]
⊺
E[Zk]
= tr(In(ΣZ)kk) + E[Zk]
⊺
E[Zk]
= n(ΣZ)kk + E[Zk]
⊺
E[Zk]
91
The first and second moments of wk are listed as follows:
E[wkl | γkl] = p-vector of posterior conditional means
V[wkl | γkl] = p-vector of posterior conditional variances
E[wk] = E[
X
l
wkl] = X
l
E[wkl]
E[wkl] = X
l
E[wkl | γkl] ◦ E[γkl]
V[wk] = V[
X
l
wkl] = X
l
V[wkl]
V[wkl] = E[wklw
⊺
kl] − E[wkl]E[wkl]
⊺
= E[w
2
klγklγ
⊺
kl] − E[wkl]E[wkl]
⊺
= diag(E[wkl ◦ wkl | γkl] ◦ E[γkl]) − E[wkl]E[wkl]
⊺
diag(V[wkl]) = E[wkl ◦ wkl | γkl] ◦ E[γkl] − (E[wkl | γkl] ◦ E[γkl])2
E[w
⊺
kwk] = tr(V[wk]) + E[wk]
⊺
E[wk]
E[w
2
kl] = [E
2
[wkl | γkl] + V[wkl | γkl]] ◦ E[γkl]
92
The first and second moments of W are listed as follows:
E[W] = E[
X
k
ekw
⊺
k
] = X
k
ekE[wk]
⊺
E[WW⊺
] = E[(X
k
ekw
⊺
k
)(X
k
′
ek
′w
⊺
k
′)
⊺
]
= E[
X
k
X
k
′
ekw
⊺
kwk
′e
⊺
k
′]
=
X
k
X
k
′
eke
⊺
k
′E[w
⊺
kwk
′]
=
X
k
X
k
′
eke
⊺
k
′E[wk]
⊺
E[wk
′] +X
k
eke
⊺
k
(E[w
⊺
kwk] − E[wk]
⊺
E[wk])
= E[W]E[W]
⊺ +
X
k
eke
⊺
k
(E[w
⊺
kwk] − E[wk]
⊺
E[wk])
= E[W]E[W]
⊺ +
X
k
eke
⊺
k
tr(V[wk])
= E[W]E[W]
⊺ + diag(tr(V[w1]), . . . , tr(V[wk]))
93
Other terms in likelihood function:
logMN n,p(X | ZW, In, Ipσ
2
) = −
1
2σ
2
tr
(X − ZW)
⊺
(X − ZW)
−
np
2
log(2πσ2
)
logMN n,k(Z | 0, In, Ik) = −
1
2
tr
Z
⊺
Z
−
nk
2
log(2π)
log Multi(γkl | 1,π) = X
p
i=1
γkli log(πi)
log N (wkl | 0, σ2
0
) = −
1
2σ
2
0
w
2
kl −
1
2
log(2πσ2
0
)
tr
E¬Z
(X − ZW)
⊺
(X − ZW)
= tr
E¬Z(X
⊺X − X
⊺
ZW − W⊺
Z
⊺X + W⊺
Z
⊺
ZW)
= tr(X
⊺X) − 2tr(E[W]X
⊺
Z) + tr(Z
⊺
ZE[WW⊺
])
= tr(X
⊺X) − 2tr(E[W]X
⊺
Z) +X
p
i=1
tr(ZσWiZ
⊺
) + E[W⊺
]Z
⊺
ZE[W]
tr
E¬W
(X − ZW)
⊺
(X − ZW)
= tr
E¬W(X
⊺X − X
⊺
ZW − W⊺
Z
⊺X + W⊺
Z
⊺
ZW)
= tr(X
⊺X) − 2tr(X
⊺
E[Z]W) + tr(E[Z
TZ]WW⊺
)
E[tr((X − ZW)
⊺
(X − ZW))] = E[tr(X
⊺X − X
⊺
ZW − W⊺
Z
⊺X + W⊺
Z
⊺
ZW)]
= tr(X
⊺X − X
⊺
E[Z]E[W] − E[W⊺
]E[Z
⊺
]X + E[W⊺
Z
⊺
ZW])
= tr(X
⊺X) − 2tr(X
⊺
E[Z]E[W]) + tr(E[Z
⊺
Z]E[WW⊺
])
Rkl := X − E[Z](X
k
′̸=k
ek
′E[wk
′]
⊺ +
X
l
′̸=l
ekE[wkl′]
⊺
)
= X −
X
k
′̸=k
E[Zk
′]E[wk
′]
⊺ −
X
l
′̸=l
E[Zk]E[wkl′]
⊺
We assume our approximate posterior Q(Z,W) factorizes as Q(Z)Q(W) where Q(W) =
QL
l=1 Q(Wl) and Q(Wl) = Q(wl
| Γl)Q(Γl).
94
In this section, we present the detailed derivation of the optimal variational distributions of
variables Z, wkl, and γkl. First, we derived the log Q(Z):
log Q(Z) = E¬Z
ℓc(σ
2
, σ2
0
,π | X,Z,W)
= E¬Z
logMN n,p(X | ZW, In, Ipσ
2
)
+ logMN n,k(Z | 0, In, Ik)
= −
τ
2
−tr(X
⊺
ZE[W]) − tr(E[W⊺
]Z
⊺X) + tr(Z
⊺
ZE(WW⊺
))
−
1
2
tr(Z
⊺
Z) + O(1)
= −
1
2
tr(τZ
⊺
ZE(WW⊺
)) + tr(Z
⊺
Z) − tr(τX
⊺
ZE[W]) − tr(τE[W⊺
]Z
⊺X)
+ O(1)
= −
1
2
tr(Z
⊺
Z(E(WW⊺
)τ + Ik)) − tr(τX
⊺
ZE[W]) − tr(τE[W⊺
]Z
⊺X)
+ O(1)
= −
1
2
tr(Z(E(WW⊺
)τ + Ik)
| {z }
Σ
−1
Z
Z
⊺
) − tr(ZE[W]X
⊺
τ ) − tr(τXE[W⊺
]Z
⊺
)
+ O(1)
= −
1
2
tr(ZΣ−1
Z Z
⊺
) − tr(ZΣ−1
Z ΣZE[W]X
⊺
τ
| {z }
µ
⊺
Z
) − tr(τXE[W⊺
]ΣZ
| {z }
µZ
Σ
−1
Z Z
⊺
)
+ O(1)
= −
1
2
tr(Σ
−1
Z Z
⊺
Z) − tr(Σ
−1
Z µ
⊺
ZZ) − tr(Σ
−1
Z Z
TµZ) + tr(Σ
−1
Z µ
T
ZµZ) − tr(Σ
−1
Z µ
T
ZµZ)
+ O(1)
= −
1
2
tr(Σ
−1
Z
(Z
⊺
Z − Z
TµZ − µ
T
ZZ + µ
T
ZµZ))
+ O(1)
= −
1
2
tr
Σ
−1
Z
(Z − µZ)
⊺
(Z − µZ)
+ O(1) ⇒
Q(Z) = MN n,k(Z | µZ, In, ΣZ)
9
Second, we derive the log Q(wkl | γkli = 1):
log Q(wkl|γkli = 1) = −
τ
2
E¬wkl [tr((Rkl − Zkw
⊺
kl)
⊺
(Rkl − Zkw
⊺
kl))] −
τ0
2
E¬wkl [
X
L
l=1
X
K
k=1
w
2
kl] + O(1)
= −
τ
2
E¬wkl [−2tr(R
⊺
klZkw
⊺
kl) + tr(Z
⊺
kZkw
⊺
klwkl)] −
τ0
2
w
2
kl + O(1)
= −
τ
2
E¬wkl "
−2tr((X −
X
k
′̸=k
Zk
′w
⊺
k
′ −
X
l
′̸=l
Zkw
⊺
kl′)
⊺
Zkw
⊺
kl) + tr(Z
⊺
kZkw
2
kl)
#
−
τ0
2
w
2
kl + O(1)
= −
τ
2
E¬wkl "
−2tr((X
⊺
Zk −
X
k
′̸=k
wk
′Z
⊺
k
′Zk −
X
l
′̸=l
wkl′Z
⊺
kZk)w
⊺
kl) + tr(Z
⊺
kZkw
2
kl)
#
−
τ0
2
w
2
kl + O(1)
= −
τ
2
E¬wkl "
−2(X
⊺
iZk −
X
k
′̸=k
wk
′
,iZ
⊺
k
′Zk − Z
⊺
kZk
X
l
′̸=l
wkl′
i)wkl + Z
⊺
kZkw
2
kl#
−
τ0
2
w
2
kl + O(1)
= −
1
2
[−2τ (X
⊺
i E[Zk] −
X
k
′̸=k
E[wk
′
,i]E[Z
⊺
k
′Zk] − E[Z
⊺
kZk]
X
l
′̸=l
E[wkl′
i
])wkl + τE[Z
⊺
kZk]w
2
kl]
−
τ0
2
w
2
kl + O(1)
= −
1
2
[−2τ (X
⊺
i E[Zk] −
X
k
′̸=k
E[wk
′
,i]E[Z
⊺
k
′Zk] − E[Z
⊺
kZk]
X
l
′̸=l
E[wkl′
i
])wkl + τE[Z
⊺
kZk]w
2
kl
+ τ0w
2
kl] + O(1)
= −
1
2
[−2τ (X
⊺
i E[Zk] −
X
k
′̸=k
E[wk
′
,i]E[Z
⊺
k
′Zk] − E[Z
⊺
kZk]
X
l
′̸=l
E[wkl′
i
])wkl
+ (τE[Z
⊺
kZk] + τ0)
| {z }
1/σ2
wkl
w
2
kl] + O(1)
= −
1
2σ
2
wkl
[w
2
kl − 2 τ σ2
wkl
(X
⊺
i E[Zk] −
X
k
′̸=k
E[wk
′
,i]E[Z
⊺
k
′Zk] − E[Z
⊺
kZk]
X
l
′̸=l
E[wkl′
i
])
| {z }
µwkl
wkl]
= log N (µwkl , σ2
wkl
)
96
Noticed that we can update wkl for all feature at once:
µwkl = τ σ2
wkl
E[R
⊺
klZk],σwkl = σ
2
wkl
Ip
Finally we derive the log Q(γkl): Note that τR
⊺
klE[Zk] = E[wkl | γkl]/σ2
wkl = µwkl/σ2
wkl
.
log Q(γkli = 1) = E¬γkl [ℓc(σ
2
, σ2
0
,π, | X,Z,W)] + log Multi(γkl | π) + O(1)
= −
τ
2
E¬γkl tr((Rkl − Zkw
⊺
kl)
⊺
(Rkl − Zkw
⊺
kl)) + log Multi(γkl | π) + O(1)
= −
τ
2
[−2tr(R
⊺
klE[Zk]E[wkl | γkli = 1]γ
⊺
kl) + tr(E[Z
⊺
kZk]E[w
2
kl | γkli = 1])] + log πi + O(1)
= −
τ
2
−2R
⊺
kliE[Zk]E[wkl | γkli = 1] + E[Z
⊺
kZk]E[w
2
kl | γkl]
+ log πi + O(1)
= τR
⊺
kliE[Zk]E[wkl | γkli = 1] −
τ
2
E[Z
⊺
kZk]E[w
2
kl | γkli = 1] + log πi + O(1)
=
1
σ
2
wkl
E[wkl | γkli = 1]2 −
τ
2
E[Z
⊺
kZk]E[w
2
kl | γkli = 1] + log πi −
τ0
2
E[w
2
kl | γkli = 1] + O(1)
=
1
σ
2
wkl
E[wkl | γkli = 1]2 −
1
2
E[w
2
kl | γkli = 1](τE[Z
⊺
kZk] + τ0) + log πi + O(1)
=
1
σ
2
wkl
E[wkl | γkli = 1]2 −
1
2σ
2
wkl
E[w
2
kl | γkli = 1] + log πi + O(1)
= −
1
2σ
2
wkl
−2E[wkl | γkli = 1]2 + E[w
2
kl | γkli = 1]
+ log πi + O(1)
= −
1
2σ
2
wkl
−2E[wkl | γkli = 1]2 + σ
2
wkl + E[wkl | γkli = 1]2
+ log πi + O(1)
=
1
2σ
2
wkl
E[wkl | γkli = 1]2 + log πi + O(1) ⇒
log α˜ kli = log πi − log N (0 | µwkl , σ2
wkl
)
Q(γkl) = Multi(1, αkl = softmax(log α˜ kl))
97
In summary, the optimal variational distribution of model parameters can be summarized as:
Q(Z) := MN n,k(Z | µZ, In, ΣZ) (A.14)
Q(wkl|γkl) := N (µwkl , σ2
wkl
) (A.15)
Q(γkl) := Multi(1, αkl). (A.16)
The corresponding update rules for variational parameters from Q(·) can be expressed as,
µZ = τXE[W⊺
]ΣZ (A.17)
ΣZ = (E[WW⊺
]τ + Ik)
−1
(A.18)
µwkl = τ σ2
wkl
E[R
⊺
klZk] (A.19)
Σwkl = σ
2
wkl
Ip (A.20)
σ
2
wkl = (τE[Z
⊺
kZk] + τ0kl)
−1
(A.21)
αkl = softmax(log π − log N (0 | µwkl , σ2
wkl
)). (A.22)
A.3 Derivation of Evidence Lower Bound (ELBO)
The ELBO provides a natural criterion for evaluating model performance during model training,
and also provides a means to perform hyperparameter optimization for model variance τ and τ0
(or equivalently precision) parameters. Given the above definitions for Q, we derive the ELBO for
SuSiE PCA as
98
ELBO(W,Z) = EQ [log Pr(X,Z,W) − log Q(Z,W)] (A.23)
= EQ[log Pr(X|Z,W)] + EQ[log Pr(Z,W) − log Q(Z,W)] (A.24)
= EQ[log Pr(X|Z,W)] + EQ(Z)
[log Pr(Z) − log Q(Z)]+ (A.25)
X
L
l=1
EQ(wl
|Γl)
[log Pr(wl
|Γl) − log Q(wl
|Γl)] + EQ(Γl)
[log Pr(Γl) − log Q(Γl)]
(A.26)
= EQ[log Pr(X|Z,W)] + EQ(Z)
[log Pr(Z) − log Q(Z)] (A.27)
+ EQ(W,Γ)
[log Pr(W,Γ) − log Q(W,Γ)] (A.28)
(A.29)
Based on the above derivation, ELBO can be decomposed into three parts. The first term is the
expectation of the data with respect to all the parameters in the model:
EQ[log Pr(X|Z,W,Γ)] = EQ
−
1
2σ
2
tr
(X − ZW)
⊺
(X − ZW)
−
np
2
log(2πσ2
)
= −
1
2σ
2
tr(X
⊺X) − 2tr(X
⊺
E[Z]E[W]) + tr(E[Z
⊺
Z]E[WW⊺
])
−
np
2
log(2πσ2
)
The second term is the negative KL divergence of Z.
EQ(Z)
[log Pr(Z) − log Q(Z)] = E[−
1
2
tr(Z
⊺
Z) −
nk
2
log(2π) + 1
2
tr(Σ
−1
Z
(Z − µZ)
⊺
(Z − µZ))
+
nk
2
log(2π) + n
2
log(|ΣZ|)]
= −
1
2
tr(E[Z
⊺
Z]) + 1
2
tr[Σ
−1
Z
(E[Z
⊺
Z] − µ
⊺
ZµZ)] + N
2
log(|ΣZ|)
= −
1
2
tr(E[Z
⊺
Z]) + 1
2
tr[Σ
−1
Z
(NΣZ + µ
⊺
ZµZ − µ
⊺
ZµZ)] + N
2
log(|ΣZ|)
= −
1
2
tr(E[Z
⊺
Z]) + NK
2
+
N
2
log(|ΣZ|)
99
The last term contains joint negative KL divergence of W and Γ can be further decomposed as
follows:
EQ(W,Γ)
[log Pr(W,Γ) − log Q(W,Γ)] = EQ(W,Γ)
[log Pr(W | Γ) Pr(Γ) − log Q(W | Γ)Q(Γ)]
= EQ(W,Γ)
[log Pr(W | Γ) − log Q(W | Γ)]
+ EQ(W,Γ)
[log Pr(Γ) − log Q(Γ)]
=
X
K
k=1
X
L
l=1
EQ(wkl,γkl)
[log Pr(wkl|γkl) − log Q(wkl|γkl)]+
X
K
k=1
X
L
l=1
EQ(γkl)
[log Pr(γkl) − log Q(γkl)]
=
X
K
k=1
X
L
l=1
X
P
i=1
αkliEQ(wkl|γkl)
[log Pr(wkl|γkli = 1)
− log Q(wkl|γkli = 1)]+
X
K
k=1
X
L
l=1
X
P
i=1
Eγkl [log Pr(γkli = 1) − log Q(γkli = 1)]
100
The first expectation term of the last line of equation EQ(wkl|γkl) can be expanded as following:
EQ(wkl|γkl)
log Pr(wkl|γkl)
Q(wkl|γkl)
=
X
P
i=1
EQ(wkl|γkli=1)
log Pr(wkl|γkli = 1)
Q(wkl|γkli = 1)
= E[
X
P
i=1
[−
τ0
2
(wkli)
2 +
1
2σ
2
wkl
(wkli − µwkli )
2
]
−
p
2
log(2π/τ0) + p
2
log(2πσ2
wkl
)]
=
X
P
i=1
(−
τ0
2
+
1
2σ
2
wkl
)E[(wkli)
2
] −
1
2σ
2
wkl
µ
2
wkli
−
P
2
log(2π/τ0) + P
2
log(2πσ2
wkl
)
=
X
P
i=1
(−
τ0
2
+
1
2σ
2
wkl
)[µ
2
wkli + σ
2
wkl
] −
1
2σ
2
wkl
µ
2
wkli
+
P
2
log(σ
2
wkl
τ0)
=
X
P
i=1
[−
τ0
2
µ
2
wkli −
τ0
2
σ
2
wkl +
1
2
] + P
2
log(σ
2
wkl
τ0)
And the second expectation term Eγkl can be decomposed as
EQ(Γ)
[log Pr(Γ) − log Q(Γ)] = X
K
k=1
X
L
l=1
X
P
i=1
EQ(γkli=1) [(γkli log πi − γkli log αkli)]
=
X
K
k=1
X
L
l=1
X
P
i=1
[E(γkli) log(πi) − E(γkli) log(αkli)]
=
X
K
k=1
X
L
l=1
X
P
i=1
[αkli(log(πi) − log αkli)]
With the explicit form of ELBO, we obtained the maximum likelihood estimates of model precision
parameters τ, τ0kl by setting the derivative of ELBO with respect to each variance parameter to be
0, which results in closed-form update equations given by,
τˆ0kl =
PP
i=1 αkli
PP
i=1 αkli(µ2
wkli + σ
2
wkli
)
(A.30)
τˆ =
NP
P
i,j X2
ij − 2tr(E[W]X⊺µZ)
. (A.31)
101
Finally, we provide the algorithm for SuSiE PCA.
Algorithm 1 SuSiE PCA Algorithm
Require: Data XN×P
Require: Number of Factors K; Number of single effects in each factor L
Require: Initialize variational parameters (µZ, ΣZ; µwkl ,σwkl ; αkl); hyperparameters τ, τ0kl, for
l = 1, · · · , L; k = 1, · · · , K
Require: update equations on different variables: FZ; Fwkl ; Fαkl ; Fτ0
; Fτ
Require: function to compute ELBO, FELBO
Ensure: ELBO increase
1: repeat
2: W ←
PL
l=1 µw ◦ α. ▷ Define µw, α as (L, K, P) arrays by arranging µwkl , αkl
3: τ0 ← Fτ0
(µw,σw, α)
4: for k in 1, · · · , K do
5: E[R
⊺
klZk]
(1) = X
⊺µzk −
P
k
′̸=k E[wk
′]E[Z
⊺
k
′Zk] ▷ compute the first two terms in Eq
6: for l in 1, · · · , L do
7: E[wkl′] = wk − µwkl ◦ αkl ▷ removing the lth effect from wk
8: E[R
⊺
klZk] = E[R
⊺
klZk]
(1) − wkE[Z
⊺
kZk] ▷ complete the calculation of E[R
⊺
klZk]
9: (µwkl ,σwkl ) ← Fwkl (E[R
⊺
klZk], E[Z
⊺
kZk], τ0kl, τ )
10: αkl ← Fαkl (E[R
⊺
klZk], µwkl ,σwkl )
11: wk = E[wkl′] + µwkl ◦ αkl ▷ Update the wk
12: end for
13: end for
14: (µZ, ΣZ) ← FZ(X, τ, E[W])
15: τ = Fτ (X, τ, E[W], E[Z])
16: ELBO ← FELBO
17: until convergence criterion satisfied
102
A.4 Supplement Figure
Figure 5.1: Sensitivity decreases fast as cutoff value increases when choosing weights from
sparse PCA and EBMF for variable selection
The proportion of correct classified signals using posterior weights from sparse PCA (A) and EBMF (B) as the cutoff.
The green dots represent sensitivity, i.e. P r(weights ≥ cutoff | True positive signal), the red dots represent specificity,
i.e. P r(weights < cutoff | True false signal). For consistency and comparable between PIPs and weights, the weights
are standardized to be ranged from 0 to 1.
103
Figure 5.2: SuSiE PCA has the lowest RRMSE compared to EBMF and sparse PCA across
all simulation settings.
The RRMSE is defined in Simulation equation (2.31), which is an assessment of the model prediction performance.
The base simulation data is the same as the simulation setting in Figure 1. For each scenario in (A-D) we only vary
one of the parameters at a time to generate the simulation data while fixing the other 3 parameters and then input the
true parameters (N,P,K,L) into models. Finally, we compute the RRMSE based on equation (2.31) and plot them as a
function of N, P, K, L.
104
Figure 5.3: SuSiE PCA and EBMF has lower Procrustes error of latent factor Z than sparse
PCA across all simulation settings
The Procrustes error for latent factor Z is computed in the same manner of loading (Figure 1) using equation (2.30).
For each scenario in (A-D) we only vary one of the parameters at a time to generate the simulation data while fixing
the other 3 parameters and then input the true parameters (N,P,K,L) into models. Finally, we compute the Procrustes
errors of Z based on equation (2.30) and plot them as a function of N, P, K, L.
105
Figure 5.4: SuSiE PCA remains the fastest method on either CPU or GPU than sparse PCA
and EBMF across all simulation settings.
All analyses are performed on high-performance computing center with the same CPU (AMD EPYC 7302 16-Core
Processor) or GPU (Nvidia Tesla A40). Noticed that platform where we collect the runtime data in this figure is
different from that in Table 1 and therefore is not comparable.
106
Figure 5.5: Held-out log-likelihood in simulations. SuSiE PCA exhibits consistently higher
log-likelihood compared with those from sparse PCA on held-out data in simulations.
107
Figure 5.6: Percent of variance (PVE) explained by 27 factors from SuSiE PCA in GTEx z
score summary data.
PVE is a measurement of variance explained by the model and is computed based on equation (3.1)
Figure 5.7: The MLE of the precision parameter log τ0kl in SuSiE PCA will be extremely
large for those over-specified single effects in GTEx Z score summary data.
The τ0kl is the inverse variance of the random variable wkl. When there are excessive number of single effects specified
in the SuSiE PCA, the MLE of corresponded τ0kl will become extremely large and as a result shrink those redundant
single effects to 0.
108
Figure 5.8: Posterior weights by factors from SuSiE PCA across different tissues in GTEx Z
score summary data.
The posterior weights (or loadings) refers to the strengthen of association of the tissues contribution to the factor. The
L is set to be 18 which means each factor has at most 18 tissues with non-zero effects.
109
Figure 5.9: Posterior inclusion probabilities (PIPs) by factors from SuSiE PCA across different tissues in GTEx Z score summary data.
Most of (PIPs) are exactly 1 across different tissue by factors, implying the model is quite confident in terms of the
tissues contributing to each factor
110
Figure 5.10: Scatterplot of latent factor values of z1 vs. z0 in GTEx Z score summary data.
Latent factor values refer to the posterior means of latent factor Z. Each point represents a specific gene, the genes
with the top 5 absolute largest latent factor values are the red points with labels. The ”outlier” gene DDT is found to
be associated with testicular cancer.
111
Figure 5.11: The loading from sparse PCA in GTEx z-score data, which is similar to what
we observed in SuSiE PCA.
112
Figure 5.12: Total percent of variance explained (PVE) as a function of the number of latent
dimension K (A) and the number of single effects L (B) in SuSiE PCA.
(A) We first fixed L = 300, and varied K from 6 to 12. The increased amount between two consecutive K in total PVE
becomes smaller after K reaches 10. (B) We then fixed K=10 and varied L from 200 to 800. Although the total PVE
reaches its maximum at L=700, we noticed that only the first three components have 600 of downstream genes with
PIP > 0.9, while the rest of the components only have 200-400 genes with PIP > 0.9. We then compared the results
between L=300 and L=700 and realized the smaller L retains the same top significant downstream genes relevant to
the component. Considering the parsimony and interpretation of the model, we finally choose K=10 and L=300.
113
Figure 5.13: Percent of variance (PVE) explained by 10 factors from SuSiE PCA in gene
expression data from perturb-seq data.
PVE is a measurement of variance explained by the model and is computed based on equation (3.1)
Figure 5.14: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 1 in perturb-seq data.
114
Figure 5.15: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 2 in perturb-seq data.
Figure 5.16: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 3 in perturb-seq data.
115
Figure 5.17: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 4 in perturb-seq data.
Figure 5.18: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 5 in perturb-seq data.
116
Figure 5.19: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 6 in perturb-seq data.
Figure 5.20: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 7 in perturb-seq data.
117
Figure 5.21: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 8 in perturb-seq data.
Figure 5.22: factor scores (A) and top enriched pathways of gene set enrichment analysis of
downstream gene (B) from SuSiE PCA factor 9 in perturb-seq data.
118
Figure 5.23: sparsity in loadings from sparse PCA decreases as alpha decreases.
(A)The total percent of variance explained in sparse PCA from Perturb-seq data keeps decreasing as alpha (sparsity
parameter) increases. (B) The percent of non-zero weights in the loading matrix (or sparsity) decreases as alpha
increase. The yellow dashed line represents the total PVE(A) and sparsity (B) from SuSiE PCA when L = 300
Figure 5.24: Gene set enrichment analysis results from the top factor in sparse PCA
Results from sparse PCA with alpha = 17. The enriched pathways have a similar significance level with SuSiE PCA.
119
B Chatper 2. PerturbVI Appendix
B.1 Model
To further incorporate perturbation information into the latent factor model, we treat the latent
component Z from SuSiE PCA as a function of perturbation design matrix MN×G, i.e. E[Z] =
MB
X | Z,W, σ2 ∼ MN N,P (ZW, IN , σ2
IP ) (B.1)
Z | B ∼ MN N,K(MB, IN , IK) (B.2)
B =
X
K
k=1
βke
⊺
k
(B.3)
βjk ∼ N (0, σ2
β
)
ηjk · δ
1−ηjk
0
(B.4)
ηjk ∼ Bernoulli(p) (B.5)
W =
X
K
k=1
ekw
⊺
k
(B.6)
wk =
X
L
l=1
wkl (B.7)
wkl = wklγkl (B.8)
wkl | σ
2
0kl ∼ N (0, σ2
0kl) (B.9)
γkl | π ∼ Multi(1,π) (B.10)
Where βjk is the j-th perturbation from the factor k in the perturbation effect matrix B, following a spike-and-slab prior; and ηjk follows Bernoulli distribution with the parameter pjk that
determines the proportion of the effect sampled from the slab prior with mean 0 and variance σ
2
β
.
To derive the variational distribution for βjk and ηjk, we made a few assumptions here:
Q(B, H) = Y
k
Q(βk, ηk)
120
Q(βk, ηk) = Y
G
j=1
Q(βjk, ηjk) = Y
G
j=1
Q(βjk | ηjk = 1)Q(ηjk = 1)
First, the complete-data log-likelihood of data and parameters is given by:
ℓc(σ
2
, {σ
2
0kl}k,l, {πk}k | X,Z,W) = log Pr(X | Z,W, σ2
) + log Pr(Z) + log Pr(W | σ
2
0
,πk)
= logMN n,p(X | ZW, In, Ipσ
2
) + logMN n,k(Z | MB, In, Ik)+
X
k
X
j
[ηjk log N (βjk | 0, σ2
β
) + (1 − ηjk) log δ0] +X
k
X
j
ηjk log p+
X
L
l=1
X
K
k=1
log Multi(γkl | 1,πk) + log N (wkl | 0, σ2
0kl)
The helpful definitions, including the expansion of the first and second moment of W and
Z, as well as the expansion of the log-likelihood function, are similar to SuSiE PCA; please see
Appendix A.
121
B.2 Derivation of Variational Distributions
In this section, we formally present the detailed derivation of variational distributions of all variables in the model including Q(Z), Q(wkl | γkli = 1), Q(γkl). The derivation of the optimal variational distribution is based on the mean-field approximation (see Methods) and the corresponding
equation (2.10). For the ease of notation, let τ =
1
σ2
, and τ0 =
1
σ
2
0
.
First, we derived the log Q(Z | B):
log Q(Z | B) = E¬Z
ℓc(σ
2
, σ2
0
, {πk}k | X,Z,W)
= E¬Z
logMN n,p(X | ZW, In, Ipσ
2
)
+ logMN n,k(Z | MB, In, Ik)
= −
τ
2
−tr(X
⊺
ZE[W]) − tr(E[W⊺
]Z
⊺X) + tr(Z
⊺
ZE(WW⊺
))
−
1
2
tr(Z
⊺
Z) + tr(ZE[B
⊺
]M
⊺
) + O(1)
= −
1
2
tr(τZ
⊺
ZE(WW⊺
)) + tr(Z
⊺
Z) − 2tr(τZE[W]X
⊺
) − 2tr(ZE[B
⊺
]M
⊺
)
+ O(1)
= −
1
2
tr(Z
⊺
Z(E(WW⊺
)τ + Ik)) − 2tr(Z
τE[W]X
⊺ + E[B
⊺
]M
⊺
+ O(1)
= −
1
2
tr(Z(E(WW⊺
)τ + Ik)
| {z }
Σ
−1
Z
Z
⊺
) − 2tr(Z
τE[W]X
⊺ + E[B
⊺
]M
⊺
+ O(1)
= −
1
2
tr(ZΣ−1
Z Z
⊺
) − 2tr(ZΣ−1
Z ΣZ(τE[W]X
⊺ + E[B
⊺
]M
⊺
)
| {z }
µ
⊺
Z
)
+ O(1)
= −
1
2
tr(Σ
−1
Z Z
⊺
Z) − 2tr(Σ
−1
Z µ
⊺
ZZ)
+ O(1)
= −
1
2
tr(Σ
−1
Z
(Z − µZ)
⊺
(Z − µZ))
+ O(1) ⇒
Q(Z) = MN n,k(Z | µZ, In, ΣZ)
1
Second, we derive the log Q(wkl | γkli = 1):
log Q(wkl|γkli = 1) = −
τ
2
E¬wkl [tr((Rkl − Zkw
⊺
kl)
⊺
(Rkl − Zkw
⊺
kl))] −
τ0
2
E¬wkl [
X
L
l=1
X
K
k=1
w
2
kl] + O(1)
= −
τ
2
E¬wkl [−2tr(R
⊺
klZkw
⊺
kl) + tr(Z
⊺
kZkw
⊺
klwkl)] −
τ0
2
w
2
kl + O(1)
= −
τ
2
E¬wkl "
−2tr((X −
X
k
′̸=k
Zk
′w
⊺
k
′ −
X
l
′̸=l
Zkw
⊺
kl′)
⊺
Zkw
⊺
kl) + tr(Z
⊺
kZkw
2
kl)
#
−
τ0
2
w
2
kl + O(1)
= −
τ
2
E¬wkl "
−2tr((X
⊺
Zk −
X
k
′̸=k
wk
′Z
⊺
k
′Zk −
X
l
′̸=l
wkl′Z
⊺
kZk)w
⊺
kl) + tr(Z
⊺
kZkw
2
kl)
#
−
τ0
2
w
2
kl + O(1)
= −
τ
2
E¬wkl "
−2(X
⊺
iZk −
X
k
′̸=k
wk
′
,iZ
⊺
k
′Zk − Z
⊺
kZk
X
l
′̸=l
wkl′
i)wkl + Z
⊺
kZkw
2
kl#
−
τ0
2
w
2
kl + O(1)
= −
1
2
[−2τ (X
⊺
i E[Zk] −
X
k
′̸=k
E[wk
′
,i]E[Z
⊺
k
′Zk] − E[Z
⊺
kZk]
X
l
′̸=l
E[wkl′
i
])wkl + τE[Z
⊺
kZk]w
2
kl]
−
τ0
2
w
2
kl + O(1)
= −
1
2
[−2τ (X
⊺
i E[Zk] −
X
k
′̸=k
E[wk
′
,i]E[Z
⊺
k
′Zk] − E[Z
⊺
kZk]
X
l
′̸=l
E[wkl′
i
])wkl + τE[Z
⊺
kZk]w
2
kl
+ τ0w
2
kl] + O(1)
= −
1
2
[−2τ (X
⊺
i E[Zk] −
X
k
′̸=k
E[wk
′
,i]E[Z
⊺
k
′Zk] − E[Z
⊺
kZk]
X
l
′̸=l
E[wkl′
i
])wkl
+ (τE[Z
⊺
kZk] + τ0)
| {z }
1/σ2
wkl
w
2
kl] + O(1)
= −
1
2σ
2
wkl
[w
2
kl − 2 τ σ2
wkl
(X
⊺
i E[Zk] −
X
k
′̸=k
E[wk
′
,i]E[Z
⊺
k
′Zk] − E[Z
⊺
kZk]
X
l
′̸=l
E[wkl′
i
])
| {z }
µwkl
wkl]
= log N (µwkl , σ2
wkl
)
123
We can update wkl for all feature at once:
µwkl = τ σ2
wkl
E[R
⊺
klZk],σwkl = σ
2
wkl
Ip
Next we derive the log Q(γkl): Note that τR
⊺
klE[Zk] = E[wkl | γkl]/σ2
wkl = µwkl/σ2
wkl
. Note
that τR
⊺
klE[Zk] = E[wkl | γkl]/σ2
wkl = µwkl/σ2
wkl
.
log Q(γkli = 1) = E¬γkl [ℓc(σ
2
, σ2
0
, {πk}k, | X,Z,W)] + log Multi(γkl | πk) + O(1)
= −
τ
2
E¬γkl tr((Rkl − Zkw
⊺
kl)
⊺
(Rkl − Zkw
⊺
kl)) + log Multi(γkl | πk) + O(1)
= −
τ
2
[−2tr(R
⊺
klE[Zk]E[wkl | γkli = 1]γ
⊺
kl) + tr(E[Z
⊺
kZk]E[w
2
kl | γkli = 1])] + log πk,i + O(1)
= −
τ
2
−2R
⊺
kliE[Zk]E[wkl | γkli = 1] + E[Z
⊺
kZk]E[w
2
kl | γkl]
+ log πk,i + O(1)
= τR
⊺
kliE[Zk]E[wkl | γkli = 1] −
τ
2
E[Z
⊺
kZk]E[w
2
kl | γkli = 1] + log πk,i + O(1)
=
1
σ
2
wkl
E[wkl | γkli = 1]2 −
τ
2
E[Z
⊺
kZk]E[w
2
kl | γkli = 1] + log πk,i −
τ0
2
E[w
2
kl | γkli = 1] + O(1)
=
1
σ
2
wkl
E[wkl | γkli = 1]2 −
1
2
E[w
2
kl | γkli = 1](τE[Z
⊺
kZk] + τ0) + log πk,i + O(1)
=
1
σ
2
wkl
E[wkl | γkli = 1]2 −
1
2σ
2
wkl
E[w
2
kl | γkli = 1] + log πk,i + O(1)
= −
1
2σ
2
wkl
−2E[wkl | γkli = 1]2 + E[w
2
kl | γkli = 1]
+ log πk,i + O(1)
= −
1
2σ
2
wkl
−2E[wkl | γkli = 1]2 + σ
2
wkl + E[wkl | γkli = 1]2
+ log πk,i + O(1)
=
1
2σ
2
wkl
E[wkl | γkli = 1]2 + log πk,i + O(1) ⇒
log α˜ kli = log πk,i − log N (0 | µwkl , σ2
wkl
)
Q(γkl) = Multi(1, αkl = softmax(log α˜ kl))
Then, we derive log Q(βgk | ηgk = 1). Suppose Mj
is the j-th column vector in the perturbation
design matrix, Zk is the k-th column vector in latent component Z, and Z
′
k = Zk −
P
j̸=g Mjβjk.
124
log(βgk | ηgk = 1) = E¬βgk [−
1
2
(Z
′
k − Mgβgk)
⊺
(Z
′
k − Mgβgk)] −
1
2σ
2
β
E¬βgk [ηgkβ
2
gk] + O(1)
= −
1
2
[−2E[Z
′
k
⊺
]Mgβgk + M
⊺
gMgβ
2
gk] −
1
2σ
2
β
X
j̸=g
pjkE[β
2
jk] −
1
2σ
2
β
β
2
gk + O(1)
= −
1
2
[( 1
σ
2
β
+ M
⊺
gMg)
| {z }
1/s2
gk
β
2
gk − 2E[Z
′
k
⊺
]Mgβgk] + O(1)
= −
1
2s
2
gk
[β
2
gk − 2 s
2
gkE[Z
′
k
⊺
]Mg
| {z }
µgk
βgk] + O(1)
Therefore we have
Q(βgk | ηgk = 1) = N (βgk | µgk, s2
gk)
where
µgk = s
2
gkE[Z
′
k
⊺
]Mg
s
2
gk = ( 1
σ
2
β
+ M
⊺
gMg)
−1
E[Z
′
k
⊺
]Mg = E[Zk
⊺
]Mg −
X
j̸=g
M
⊺
jMgE[βjk]
125
Finally, we derive the log Q(ηgk = 1)
log Q(ηgk = 1) = −
1
2
E¬ηjk [−2Z
′
k
⊺Mgβgk + M
⊺
gMgβ
2
gk] −
1
2σ
2
β
E[ηgkβ
2
gk] + E[ηgk log p]
= −
−µ
2
gk
2s
2
gk
+ log p + O(1)
Therefore, log ˆpgk = log p − log N (0|µgk, s2
k
), and
Q(ηgk = 1) = Bernoulli(pgk = softmax(log ˆpgk))
126
B.3 Derivation of Evidence Lower Bound (ELBO)
To compute the maximum likelihood estimate of the variance terms τ and τ0 and determine the
likelihood of the data under PerturbVI, we write out the Evidence Lower Bound (ELBO).
ELBO(W,Z) = EQ [log Pr(X,Z,W) − log Q(Z,W)]
= EQ[log Pr(X|Z,W)] + EQ[log Pr(Z,W) − log Q(Z,W)]
= EQ[log Pr(X|Z,W)] + EQ(Z | B)
[log Pr(Z | B]) − log Q(Z | B)]+
X
k
X
j
pˆjkEQ(βjk | ηjk))[log Pr(βjk | ηjk) − log Q(βjk | ηjk)]+
X
k
X
j
EQ[ηjk log p − ηjk log ˆpjk]+
EQ(W,Γ)
[log Pr(W,Γ) − log Q(W,Γ)]
The first is the expectation of the data with respect to all the parameters in the model:
EQ[log Pr(X|Z,W,Γ)] = EQ
−
1
2σ
2
tr
(X − ZW)
⊺
(X − ZW)
−
np
2
log(2πσ2
)
= −
1
2σ
2
tr(X
⊺X) − 2tr(X
⊺
E[Z]E[W]) + tr(E[Z
⊺
Z]E[WW⊺
])
−
np
2
log(2πσ2
)
The second term is the negative KL divergence of Z.
EQ(Z)
[log Pr(Z) − log Q(Z)] = −
1
2
tr(E[ZZ⊺
]) − 2tr(µ
⊺
ZME[B]) + E[tr(B
⊺M
⊺MB)] − n log |ΣZ| − nk
= −
1
2
tr(E[ZZ⊺
]) − 2tr(µ
⊺
ZME[B]) + tr(M
⊺ME[BB⊺
]) − n log |ΣZ| − nk
127
Note that E[BB⊺
] = diag(
P
k E[β
2
1k
], . . . ,P
k E[β
2
Gk]), and:
E[β
2
jk] = E[β
2
jk | ηjk = 1]E[ηjk = 1]
= [E
2
[βjk | ηjk = 1] + V(βjkηjk = 1)]E[ηjk = 1]
= (µ
2
jk + s
2
jk) ˆpjk
The next term is the negative KL divergence for βjk:
EQ(βjk | ηjk=1)[log Pr(βjk | ηjk = 1) − log Q(βjk | ηjk = 1)] = E[−
β
2
jk
2σ
2
β
+
(βjk − µjk)
2
2s
2
jk
+
1
2
log
s
2
jk
σ
2
β
]
= −
E[β
2
jk]
2σ
2
β
+
E[β
2
jk] − µ
2
jk
2s
2
jk
+
1
2
log
s
2
jk
σ
2
β
= −
µ
2
jk + s
2
jk
2σ
2
β
+
1
2
+
1
2
log
s
2
jk
σ
2
β
Followed by the negative KL divergence for ηjk:
EQ(ηjk=1)[ηjk log p − ηjk log ˆpjk] = E(ηjk)(log p − log ˆpjk)
= ˆpjk(log p − log ˆpjk)
The last term contains joint negative KL divergence of W and Γ. The expression of those are
exactly the same with SuSiE PCA, please see Appendix A.3.
128
B.4 Supplement Figure
Figure 5.25: PerturbVI performs best when the perturbation effects are more dense
129
Figure 5.26: PerturbVI is more robust to over-specified component and number of single
effect.
(A-C) The simulation data is generated based on K = 4, we run PerturbVI and GSFA with K = 3, 4, 6. The reference
line is the results from the correctly-specified model. (D-F) The simulation data is generated based on L = 150.
We run PerturbVI and GSFA with L = 100, 125, 175, 200.he reference line is the results from the correctly-specified
model.
130
Figure 5.27: PerturbVI upstream effect matrix B in CROP-seq data from LUHMES cells (all
targets included)
131
Figure 5.28: PerturbVI retains neuron-related information with higher enrichment ratio
than GSFA in enrichment analysis of downstream genes
132
PerturbVI
Neuron−related Pathway
ADNP
ARID1B
ASH1L
CHD2
POGZ
PTEN
SETD5
GSFA
ARID1B
ASH1L
CHD2
PTEN
SETD5
Fold Enrichment
0
2
4
6
Figure 5.29: PerturbVI identified more neuron-related enrichment pathways than GSFA
across perturbations in CROP-seq data from LUHMES cells
133
Figure 5.30: PerturbVI identifies the regulatory networks within the neuron differentiation
process (all genes in the pathways)
134
C Chatper 4. Dual Variational Autoencoders
C.1 Model Improvements
There are several changes and possible improvements to the model described above. First, in
the simulations and real data analysis we conducted, we modified the decoder neural networks to a
linear function as well for a straightforward interpretation. This change was made because we have
not implemented any feature-importance algorithms, posing a significant difficulty in interpreting
decoder networks. However, Svensson et al. (2020) demonstrated that using a flexible non-linear
inference model alongside a linear decoder allows the model to benefit from the efficiency of VAE
while retaining its interpretability [91]. Second, similar to Probabilistic CCA, the original DualVAE model assumes a single shared latent space between the two data views. In reality, each data
type contains exclusive information that should be preserved in its own private latent space, and
that information is currently estimated in the error terms. To better capture the private latent spaces
in each data view, we propose the following construction for the Dual-VAE:
Figure 5.31: Dual variational autoencoders with private space to retain the view-specific
information into the private space
135
C.2 Supplement Figure
Figure 5.32: Component 1 and 9 in Dual-VAE shows small correlation with sCCA components (A), Component 3, 5 and 6 shows linear correlation with sCCA components (B)
We perform Procrustes transformation of Dual-VAE decoder weights based on CCA weight vectors to align the latent
space between two methods, then draw the scatter plot of latent component values between same components from
Dual-VAE and sCCA.
136
Figure 5.33: Component 2, 4, 7, 8 and 10 in Dual-VAE shows non-linear relationship with
sCCA components
137
SUPER PATHWAY Amino Acid Carbohydrate Partially Characterized Molecules Peptide Xenobiotics
Food Component/Plant
Pentose Metabolism
Histidine Metabolism
Partially Characterized
Molecules
Tryptophan Metabolism
Acetylated Peptides
Aminosugar Metabolism
0.0 0.5 1.0
− log10(FDR)
Pathway
K = 1, 1 signif.
Lysophospholipid
Leucine, Isoleucine and
Valine Metabolism
Phosphatidylinositol (PI)
Ceramides
Phosphatidylcholine (PC)
Phosphatidylethanolamine
(PE)
Diacylglycerol
0 5 10
− log10(FDR)
Pathway
K = 2, 6 signif.
Phenylalanine Metabolism
TCA Cycle
Fatty Acid Metabolism
(Acyl Carnitine, Hydroxy)
Gamma−glutamyl Amino Acid
Leucine, Isoleucine and
Valine Metabolism
Fatty Acid, Dicarboxylate
Sphingomyelins
0 1 2 3
− log10(FDR)
Pathway
K = 3, 5 signif.
Secondary Bile Acid
Metabolism
Fatty Acid Metabolism
(Acyl Carnitine, Long
Chain Saturated)
Hexosylceramides (HCER)
Ceramides
Primary Bile Acid
Metabolism
Plasmalogen
Sphingomyelins
0 5 10 15 20 25
− log10(FDR)
Pathway
K = 4, 13 signif.
Hemoglobin and Porphyrin
Metabolism
TCA Cycle
Hexosylceramides (HCER)
Medium Chain Fatty Acid
Fatty Acid Metabolism
(Acyl Carnitine, Hydroxy)
Fatty Acid Metabolism
(Acyl Carnitine, Medium
Chain)
Fatty Acid Metabolism
(Acyl Carnitine,
Monounsaturated)
0 1 2 3 4
− log10(FDR)
Pathway
K = 5, 9 signif.
Sphingolipid Synthesis
Leucine, Isoleucine and
Valine Metabolism
Fibrinogen Cleavage
Peptide
Methionine, Cysteine, SAM
and Taurine Metabolism
Fatty Acid Metabolism
(Acyl Carnitine,
Polyunsaturated)
Urea cycle; Arginine and
Proline Metabolism
Gamma−glutamyl Amino Acid
0 1 2 3 4 5
− log10(FDR)
Pathway
K = 6, 8 signif.
Vitamin A Metabolism
Vitamin B6 Metabolism
Xanthine Metabolism
Fatty Acid, Dihydroxy
Histidine Metabolism
Aminosugar Metabolism
Urea cycle; Arginine and
Proline Metabolism
0.0 0.5 1.0 1.5
− log10(FDR)
Pathway
K = 7, 3 signif.
Xanthine Metabolism
Methionine, Cysteine, SAM
and Taurine Metabolism
Urea cycle; Arginine and
Proline Metabolism
Leucine, Isoleucine and
Valine Metabolism
Acetylated Peptides
Pentose Metabolism
Histidine Metabolism
0.0 0.5 1.0
− log10(FDR)
Pathway
K = 8, 0 signif.
Benzoate Metabolism
Partially Characterized
Molecules
Sphingomyelins
Progestin Steroids
Ceramides
Pregnenolone Steroids
Androgenic Steroids
0 1 2 3
− log10(FDR)
Pathway
K = 9, 3 signif.
Dihydroceramides
Pyrimidine Metabolism,
Orotate containing
Lysine Metabolism
Androgenic Steroids
Primary Bile Acid
Metabolism
Phosphatidylethanolamine
(PE)
Secondary Bile Acid
Metabolism
0 1 2 3
− log10(FDR)
Pathway
K = 10, 5 signif.
Figure 5.34: sCCA enrichment analysis shows Lipids super pathway ( Sphingomyelins ) dominated the enriched metabolite modules
138
Abstract (if available)
Abstract
Gene Regulatory Networks (GRNs) consist of complex interactions among DNA, RNA, and proteins, which regulate gene activities through stages such as transcription, RNA splicing, translation, and post-translational modification. However, the dysregulated network can drive disease risk. Therefore, inferring regulatory modules within GRNs is crucial for understanding cellular functions and disease mechanisms. Current approaches for inferring regulatory modules within GRNs are limited by scalability and linearity assumptions. This dissertation addresses these limitations through three projects: the first project, SuSiE PCA, integrates the sum of single effects (SuSiE) model with probabilistic principal component analysis (PPCA) to effectively compute interpretable latent factors capturing regulatory modules from Perturb-seq data. The second project, PerturbVI, scales to genome-wide perturbations by incorporating perturbation information from Perturb-seq data into latent components derived from SuSiE PCA to accurately estimate the perturbation effect. The third project, Dual-VAE, employs a deep learning model to capture non-linear relationships within complex regulatory networks by associating genetic risk variants with downstream gene activities. These advancements provide more efficient, scalable, and interpretable tools for inferring regulatory modules through Perturb-seq and multi-omics data, offering insights into understanding the complex regulatory mechanism in human disease.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Identifying and quantifying transcriptional module heterogeneity and genetic co-regulation, with applications in asthma
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Investigating hematopoietic stem cells via multiscale modeling, single-cell genomics, and gene regulatory network inference
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Understanding ancestry-specific disease allelic effect sizes by leveraging multi-ancestry single-cell RNA-seq data
PDF
Data-driven learning for dynamical systems in biology
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Linking air pollution to integrative gene and metabolites networks in young adult with asthma
PDF
Using novel small molecule modulators as a tool to elucidate the role of the Myocyte Enhancer Factor 2 (MEF2) family of transcription factors in leukemia
Asset Metadata
Creator
Yuan, Dong
(author)
Core Title
Scalable latent factor models for inferring genetic regulatory networks
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2024-08
Publication Date
08/22/2024
Defense Date
08/22/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian inference,factor model,gene regulatory network,OAI-PMH Harvest,variational inference
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mancuso, Nicholas (
committee chair
), Gazal, Steven (
committee member
), Li, Chun (
committee member
), Pimentel, Harold (
committee member
), Siegmund, Kimberly (
committee member
)
Creator Email
dongyuan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113999GUT
Unique identifier
UC113999GUT
Identifier
etd-YuanDong-13419.pdf (filename)
Legacy Identifier
etd-YuanDong-13419
Document Type
Dissertation
Format
theses (aat)
Rights
Yuan, Dong
Internet Media Type
application/pdf
Type
texts
Source
20240823-usctheses-batch-1201
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Bayesian inference
factor model
gene regulatory network
variational inference