Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Stochastic Variational Inference as a solution to the intractability problem of Bayesian inference
(USC Thesis Other)
Stochastic Variational Inference as a solution to the intractability problem of Bayesian inference
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Stochastic Variational Inference as a Solution to the Intractability Problem
of Bayesian Inference
by
Jose Rafael Espinosa Mena
A Thesis Presented to the
FACULTY OF THE USC DORNSIFE COLLEGE OF LETTERS
ARTS AND SCIENCES
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the degree
MASTER OF ARTS
(APPLIED MATHEMATICS)
May 2024
Copyright 2024 Jose Rafael Espinosa Mena
Acknowledgments
I would like to express my deepest gratitude to my thesis committee members, Professor
Richard Arratia, Professor Mohammed Ziane, and Professor Xiaohui Chen, for their help
throughout my research journey. Their patience has been a gift to me during these times. I
am truly fortunate to have had the opportunity to learn from such distinguished scholars.
Additionally, I would like to thank Dr. Jeremias Knoblauch at the University College
London for his help during my research journey. His constant supoort, and dedication to my
progress has been invaluable in finishing this project.
I also extend my heartfelt thanks to my parents and family, whose love, encouragement,
and unwavering support have been my constant source of strength. Their belief in me has
been a driving force throughout my academic pursuits, and I am forever grateful for their
sacrifices and the values they have instilled in me. Without their support, this achievement
would not have been possible.
ii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Chapter 1:
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 An Example of Variational Inference for Variational Autoencoders . . . . . . 9
1.3.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Generative Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 Gaussian Approximate Posterior . . . . . . . . . . . . . . . . . . . . . 10
1.3.5 Training and Optimization . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.6 Inference and Generation . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2:
Stochastic Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Stochastic Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Minibatches for Stochastic Variational Inference . . . . . . . . . . . . . . . . 15
2.3 Learning Rate Schedules for Stochastic Variational Inference . . . . . . . . . 17
Chapter 3:
Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.3 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.4 Convergence Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.5 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 4:
Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Theoretical Properties of MCMC Methods . . . . . . . . . . . . . . . . . . . 26
iii
4.1.1 Asymptotic Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Markov Chain Central Limit Theorem . . . . . . . . . . . . . . . . . 26
4.2 Theoretical Properties of Stochastic Variational Inference . . . . . . . . . . . 27
4.2.1 Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Convergence and Consistency . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Advantages of Stochastic Variational Inference over MCMC . . . . . . . . . . 28
4.3.1 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.3 Flexibility in Variational Family . . . . . . . . . . . . . . . . . . . . . 29
4.3.4 Deterministic Approximation . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iv
List of Algorithms
1 SVI with Mini-Batch and Robbins-Monro Schedule . . . . . . . . . . . . . . 13
2 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Hamiltonian Monte Carlo Algorithm . . . . . . . . . . . . . . . . . . . . . . 23
v
Abstract
Bayes’ theorem has been widely used to estimate posterior distributions for complex data.
However, directly estimating the distribution can be computationally expensive, and in some
cases, intractable. To address this challenge, new techniques such as Variational Inference
have been developed, which approximate the posterior through optimization instead of direct
computation. In this thesis, we analyze Stochastic Variational Inference (SVI), a powerful
framework for approximating the posterior distribution in large-scale and streaming data
settings. SVI combines the benefits of stochastic optimization and variational inference,
enabling efficient and scalable Bayesian inference. We compare SVI with classical Markov
Chain Monte Carlo (MCMC) methods, highlighting the advantages of SVI in terms of computational efficiency and its ability to handle large datasets. This analysis demonstrates
that SVI offers a compelling alternative to MCMC for Bayesian inference in dynamic environments, providing a scalable and efficient solution for real-time data processing.
vi
Chapter 1
Introduction
1.1 Bayes’ Theorem
Bayes’ theorem is a fundamental principle in probability theory and statistics that describes
the probability of an event based on prior knowledge and new evidence [Alexanian 2015].
Theorem 1 (Bayes’ Theorem). Let (Ω, F, P) be a probability space, and let H, E ∈ F be
two events such that P(E) > 0. Then, the conditional probability of H given E, denoted as
P(H|E), is given by:
P(H|E) = P(E|H)P(H)
P(E)
(1.1)
where:
• P(H) is the prior probability of the hypothesis H,
• P(E|H) is the likelihood of the evidence E given the hypothesis H,
• P(E) is the marginal probability of the evidence E, which can be calculated as:
P(E) = P(E|H)P(H) + P(E|H)P(H) (1.2)
where H denotes the complement of the event H. Furthermore, if H1, H2, . . . , Hn forms
a partition of the sample space Ω, then the marginal probability of the evidence E can be
1
calculated using the law of total probability:
P(E) = Xn
i=1
P(E|Hi)P(Hi) (1.3)
Proof. To prove Bayes’ Theorem, we start from the definition of conditional probability:
P(H|E) = P(H ∩ E)
P(E)
(1.4)
where P(H ∩ E) denotes the joint probability of events H and E, and P(E) > 0. By the
definition of conditional probability, we also have:
P(E|H) = P(H ∩ E)
P(H)
(1.5)
Rearranging this equation, we get:
P(H ∩ E) = P(E|H)P(H) (1.6)
Substituting this expression for P(H ∩ E) into the definition of P(H|E), we obtain:
P(H|E) = P(H ∩ E)
P(E)
(1.7)
=
P(E|H)P(H)
P(E)
(1.8)
which is the statement of Bayes’ Theorem. Therefore, Bayes’ Theorem follows directly from
the definition of conditional probability.
Bayes’ theorem is a powerful tool for updating beliefs in light of new evidence. It allows
us to combine our prior knowledge about a hypothesis with the likelihood of observing
the evidence under that hypothesis to obtain a revised probability estimate, known as the
posterior probability.
2
The theorem is named after the Reverend Thomas Bayes, an 18th-century British statistician and Presbyterian minister. Although Bayes’ original work on the theorem was not
published during his lifetime, it was later rediscovered and popularized by Pierre-Simon
Laplace, a French mathematician and astronomer, in the early 19th century.
The components of Bayes’ theorem can be interpreted as follows:
• Prior probability P(H): This represents the initial degree of belief in the hypothesis
H before considering any evidence. It encapsulates our prior knowledge or subjective
belief about the likelihood of the hypothesis being true.
• Likelihood P(E|H): This term quantifies the probability of observing the evidence
E given that the hypothesis H is true. It measures how compatible the evidence is
with the hypothesis.
• Marginal probability P(E): This is the probability of observing the evidence E
regardless of the hypothesis.
• Posterior probability P(H|E): This is the updated probability of the hypothesis H
after taking into account the observed evidence E. It represents our revised degree of
belief in the hypothesis in light of the new information.
Bayes’ theorem has wide-ranging applications in various fields, including machine learning, data science, and artificial intelligence. It forms the foundation of Bayesian inference,
a powerful framework for reasoning under uncertainty. Some notable applications of Bayes’
theorem include:
• Spam email filtering: Bayes’ theorem can be used to classify emails as spam or not
spam based on the presence of certain words or phrases in the email content.
• Medical diagnosis: Given the observed symptoms and test results, Bayes’ theorem can
be employed to calculate the probability of a patient having a specific disease.
3
• Parameter estimation: In Bayesian statistics, Bayes’ theorem is used to update the
probability distributions of model parameters as new data becomes available.
• Text classification: Bayes’ theorem can be applied to classify documents into predefined
categories based on the occurrence of specific words or phrases.
• Sensor fusion: In robotics and autonomous systems, Bayes’ theorem is used to combine
information from multiple sensors to obtain a more accurate estimate of the system’s
state.
Despite its power and versatility, Bayes’ theorem has some limitations and challenges.
One major challenge is the determination of appropriate prior probabilities, which can be
subjective and may require domain expertise. Additionally, the computation of the marginal
probability P(E) can be intractable for complex models with many variables, leading to the
need for approximation techniques such as variational inference or Markov chain Monte Carlo
(MCMC) methods.
In summary, Bayes’ theorem is a fundamental principle in probability theory that allows
us to update our beliefs about a hypothesis based on new evidence. It provides a principled
way to combine prior knowledge with observed data, enabling probabilistic reasoning and
decision-making in various domains. Understanding Bayes’ theorem is crucial for anyone
working in fields that involve uncertainty and inference, such as machine learning, data
science, and artificial intelligence.
1.2 Variational Inference
Variational Inference (VI) is a technique used to approximate intractable posterior distributions in Bayesian inference. The main idea behind VI is to transform the inference problem
into an optimization problem by introducing a family of approximate distributions q(z) and
minimizing the Kullback-Leibler (KL) divergence between q(z) and the true posterior dis4
tribution p(z|x):
q
∗
(z) = arg min
q(z)
KL(q(z)||p(z|x)) (1.9)
Theorem 2 (Variational Inference). Let p(x, z) be a joint probability distribution over observed variables x and latent variables z. Maximizing the evidence lower bound (ELBO):
ELBO(q) := Eq(z)
[log p(x, z)] − Eq(z)
[log q(z)] (1.10)
minimizes the KL divergence between the approximate distribution and the true posterior,
relative to the given family [Ganguly and Earp 2021].
Proof. The ELBO is a lower bound on the logarithm of the marginal likelihood p(x), also
known as the log-evidence:
log p(x) = log Z
p(x, z)dz (1.11)
The ELBO is derived using Jensen’s inequality and the properties of the KL divergence:
log p(x) = log Z
p(x, z) dz (1.12)
= log Z
q(z)
p(x, z)
q(z)
dz (1.13)
≥
Z
q(z) log p(x, z)
q(z)
dz (1.14)
= Eq(z)
[log p(x, z)] − Eq(z)
[log q(z)] (1.15)
= ELBO(q) (1.16)
The inequality in the third line follows from Jensen’s inequality, which states that for a
concave function f (such as the logarithm) and a random variable X:
E[f(X)] ≤ f(E[X]) (1.17)
5
Applying Jensen’s inequality to the logarithm of the expectation of p(x,z)
q(z)
under q(z) yields:
log Eq(z)
p(x, z)
q(z)
≥ Eq(z)
log p(x, z)
q(z)
(1.18)
The ELBO consists of two terms: the expected log-likelihood of the data under the approximate distribution, and the negative KL divergence between the approximate distribution
and the prior distribution:
ELBO(q) = Eq(z)
[log p(x, z)] − Eq(z)
[log q(z)] (1.19)
= Eq(z)
[log p(x|z)] + Eq(z)
[log p(z)] − Eq(z)
[log q(z)] (1.20)
= Eq(z)
[log p(x|z)] − KL(q(z)∥p(z)) (1.21)
Maximizing the ELBO with respect to the parameters of the approximate distribution q(z)
effectively minimizes the KL divergence between q(z) and the true posterior p(z|x), as the
KL divergence is non-negative and reaches its minimum value of zero when q(z) = p(z|x).
Variational inference has emerged as a powerful and widely used technique for approximating intractable posterior distributions in Bayesian models. In many real-world applications, the true posterior distribution p(z|x) is computationally intractable due to the high
dimensionality of the latent variables z or the complexity of the model. Variational inference addresses this challenge by introducing a family of approximate distributions q(z) and
finding the best approximation within this family to the true posterior. Minimizing the KL
divergence encourages the approximate distribution q(z) to be close to the true posterior
p(z|x).
The choice of the variational family q(z) is crucial in variational inference. The most
common choice is the mean-field variational family, where the approximate distribution is
6
assumed to factorize over the latent variables:
q(z) = Y
M
i=1
qi(zi) (1.22)
Here, M is the number of latent variables, and each qi(zi) is an independent variational
factor. The mean-field assumption simplifies the optimization problem and allows for efficient
updates of the variational factors.
The variational parameters of the approximate distribution q(z) are typically optimized
using gradient-based methods, such as stochastic gradient descent (SGD) or coordinate ascent variational inference (CAVI). In SGD, the ELBO is approximated using mini-batches of
data, and the variational parameters are updated using the gradients of the ELBO with respect to these parameters. CAVI, on the other hand, updates each variational factor in turn
while keeping the others fixed, leveraging the mean-field assumption to derive closed-form
updates.
Variational inference has several advantages over other inference techniques, such as
Markov Chain Monte Carlo (MCMC) methods. VI is often faster and more scalable than
MCMC, as it transforms the inference problem into an optimization problem that can be
solved using efficient optimization algorithms. Additionally, VI provides a deterministic
approximation to the posterior distribution, which can be more convenient for certain applications, such as model selection and prediction.
However, variational inference also has some limitations. The quality of the approximation depends on the expressiveness of the variational family q(z). If the variational family is
not rich enough to capture the true posterior distribution, the approximation may be poor.
This can lead to underestimating the posterior uncertainty and biased parameter estimates.
Another challenge in variational inference is the choice of the prior distribution. The prior
distribution plays a crucial role in Bayesian inference, as it encodes our prior knowledge or
beliefs about the latent variables. In some cases, the prior distribution may not be conjugate
7
to the likelihood, making the updates of the variational factors more difficult. Non-conjugate
priors often require additional approximations or numerical integration techniques.
Despite these challenges, variational inference has been successfully applied to a wide
range of models and tasks, including topic modeling, Gaussian processes, Bayesian neural networks, and probabilistic graphical models. It has become a standard tool in the
Bayesian inference toolbox, enabling researchers and practitioners to handle complex and
high-dimensional models efficiently.
Recent advancements in variational inference have focused on scalability, flexibility, and
automation. Stochastic variational inference (SVI) has been proposed to handle large-scale
datasets by processing mini-batches of data and updating the variational parameters incrementally. Black-box variational inference (BBVI) automates the derivation of the ELBO
and the variational updates, making it easier to apply VI to new models. Furthermore, the
integration of variational inference with deep learning techniques, such as variational autoencoders and variational Bayes neural networks, has opened up new possibilities for learning
rich and expressive approximate posteriors.
In conclusion, variational inference is a powerful and widely used technique for approximating intractable posterior distributions in Bayesian models. By introducing a family of
approximate distributions and optimizing the evidence lower bound (ELBO), VI transforms
the inference problem into an optimization problem. While VI has some limitations, such
as the potential for underestimating posterior uncertainty, it offers a scalable and efficient
alternative to MCMC methods. With ongoing research and advancements, variational inference continues to be a valuable tool for Bayesian inference in various domains, including
machine learning, statistics, and data science.
8
1.3 An Example of Variational Inference for Variational
Autoencoders
Kingma and Welling [Diederik P Kingma and Welling 2022] developed VAEs to overcome
the limitations of traditional autoencoding neural networks, which struggle to generate new
data points. They work by compressing the inputs into a probabilistic latent space, and
can thus sample new information from it, which closely resembles but is not identical to
the original inputs. In this subsection, we will delve into a detailed example to illustrate
how variational inference is applied in the context of VAEs and explore the mathematical
formulation behind it.
1.3.1 Problem Setup
Let us consider a dataset D = x
(1)
, x
(2)
, . . . , x
(N)
consisting of N independently and identically distributed (i.i.d.) samples of a continuous random variable x ∈ R
d
. Our objective
is to learn a latent representation z ∈ R
k
for each data point, where k < d. The latent
space should capture the essential information present in the input data while being more
compact.
1.3.2 Generative Process
In the VAE framework, we assume that each data point x
(i)
is generated from a corresponding latent variable z
(i)
through a generative process defined by the probability distribution
pθ(x|z). The joint distribution over the observed data and the latent variables can be written
as:
pθ(x, z) = pθ(x|z)p(z), (1.23)
where p(z) is the prior distribution over the latent variables. In VAEs, the prior is typically
chosen to be a standard multivariate Gaussian distribution, i.e., p(z) = N (z; 0, I).
9
1.3.3 Variational Inference
The true posterior distribution pθ(z|x), which represents the distribution of the latent variables given the observed data, is often intractable to compute directly. Variational inference
addresses this challenge by introducing a variational approximate posterior qϕ(z|x), parameterized by ϕ. By Theorem 2, finding the parameters θ and ϕ that maximize the evidence
lower bound (ELBO), we can be sure that the true posterior distribution is approximated
as closely as possible.
In other words, the ELBO consists of two terms: the reconstruction term and the KL
divergence term. The reconstruction term encourages the VAE to learn a decoder that can
accurately reconstruct the input data from the latent representation. The KL divergence
term acts as a regularizer, ensuring that the approximate posterior remains close to the prior
distribution.
1.3.4 Gaussian Approximate Posterior
In practice, the approximate posterior qϕ(z|x) is often chosen to be a multivariate Gaussian
distribution with a diagonal covariance matrix:
qϕ(z|x) = N (z; µϕ(x),σ
2
ϕ
(x)I), (1.24)
Where µϕ(x) and σ
2
ϕ
(x) are the mean and variance vectors, respectively, outputted by the
encoder neural network with parameters ϕ. The encoder takes the input data x and maps
it to the parameters of the approximate posterior distribution.
1.3.5 Training and Optimization
During training, the VAE jointly optimizes the parameters of the encoder (ϕ) and the decoder (θ) to maximize the ELBO using stochastic gradient descent. However, the sampling
10
process of z from the approximate posterior is not differentiable, which prevents direct backpropagation. To overcome this issue, the reparameterization trick is employed.
The reparameterization trick allows for the sampling of z to be expressed as a deterministic function of the parameters and a noise variable:
z = µϕ(x) + ϵ ⊙ σϕ(x), (1.25)
where ϵ ∼ N (0, I) is a noise variable sampled from a standard Gaussian distribution, and
⊙ denotes element-wise multiplication. By expressing the sampling process in this way,
gradients can flow through the deterministic components, enabling backpropagation and
optimization of the ELBO.
1.3.6 Inference and Generation
At inference time, given a new data point x, the VAE can generate a compressed representation z by sampling from the approximate posterior qϕ(z|x) using the trained encoder.
The decoder can then reconstruct the original data point by sampling from the generative
distribution pθ(x|z). Moreover, the VAE can generate new samples by sampling random
latent vectors z from the prior distribution p(z) and passing them through the decoder to
obtain the corresponding generated data points.
11
Chapter 2
Stochastic Variational Inference
2.1 Stochastic Variational Inference
[Hoffman et al. 2013] developed Stochastic Variational Inference (SVI) as an extension of
variational inference that enables scalable and efficient inference on large datasets. In SVI,
the ELBO is approximated using mini-batches of data, and the variational parameters are
updated using stochastic optimization techniques. In many real-world applications, the
datasets can be massive, consisting of millions or even billions of data points. Traditional
variational inference, which requires iterating over the entire dataset at each iteration, becomes computationally infeasible in such scenarios. SVI addresses this challenge by leveraging stochastic optimization techniques and processing mini-batches of data to update the
variational parameters incrementally.
The key idea behind SVI is to approximate the ELBO using mini-batches of data instead
of the full dataset. At each iteration, a mini-batch B of size B is randomly sampled from
the dataset. The stochastic ELBO for the mini-batch is then computed as:
ELBOB(q) = N
B
X
i∈B
Eq(zi)
[log p(xi
, zi)] − Eq(z)
[log q(z)] (2.1)
Where N is the total number of data points, and q(zi) is the approximate posterior distribution for the latent variables associated with the i-th data point in the mini-batch. The
12
stochastic ELBO is an unbiased estimator of the full ELBO, and its expectation over the
random mini-batches is equal to the full ELBO.
The variational parameters are updated using stochastic gradient ascent on the stochastic
ELBO. The update rule for the variational parameters θ at iteration t is given by:
θ
(t+1) = θ
(t) + ρt∇θELBOB(qθ
(t) ) (2.2)
where ρt
is the learning rate at iteration t. The learning rate is typically chosen to satisfy the Robbins-Monro conditions, which ensure convergence of the stochastic optimization
algorithm, presented below.
Algorithm 1 SVI with Mini-Batch and Robbins-Monro Schedule
1: Initialize variational parameters λ
2: Choose mini-batch size m
3: Set initial learning rate η
4: Initialize time step t ← 0
5: while not converged do
6: t ← t + 1
7: Update learning rate ηt =
η
1+t·decay rate
8: Randomly select a mini-batch of m data points
9: Compute stochastic gradient ˆg of the ELBO with respect to λ using the mini-batch
gˆ = ∇λ (L(λ; mini-batch))
10: Update the variational parameters λ using the Robbins-Monro schedule
λ ← λ − ηt
· gˆ
11: end while
SVI has several advantages over traditional variational inference: The first one is scalability: SVI can handle large datasets by processing mini-batches of data at each iteration,
making it computationally efficient and scalable. It allows for inference on datasets that are
too large to fit into memory, as only a small subset of the data is required at each iteration.
Faster convergence: SVI often converges faster than traditional variational inference,
especially in the early iterations. By updating the variational parameters more frequently
13
using mini-batches, SVI can quickly move towards the optimal solution.
Flexibility: SVI can be applied to a wide range of models, including complex hierarchical
models and models with non-conjugate priors. It is particularly well-suited for models with
large numbers of latent variables or parameters.
Online learning: SVI naturally supports online learning scenarios, where data arrives in
a streaming fashion. The variational parameters can be updated incrementally as new data
becomes available, without the need to reprocess the entire dataset.
However, SVI also has some challenges and considerations: Noisy gradients: The stochastic gradients used in SVI are noisy estimators of the true gradients, as they are computed
using mini-batches of data. This noise can impact the convergence properties of the algorithm and may require careful tuning of the learning rate and mini-batch size.
Variance reduction: The variance of the stochastic gradients can be high, especially
when the mini-batch size is small. Variance reduction techniques, such as control variates or
importance sampling, can be employed to reduce the variance and improve the convergence
of SVI.
Model-specific considerations: The effectiveness of SVI may depend on the specific model
and the choice of the variational family. Some models may require additional approximations or reparameterization tricks to enable efficient stochastic optimization. SVI has been
successfully applied to various models and domains, including topic modeling, matrix factorization, and Bayesian non-parametric models. It has become a standard tool for scalable
Bayesian inference, particularly in the context of large-scale machine learning applications.
Recent advancements in SVI have focused on further improving its scalability, robustness,
and automation. Techniques such as stochastic gradient Langevin dynamics (SGLD) and
stochastic gradient Hamiltonian Monte Carlo (SGHMC) combine SVI with Markov Chain
Monte Carlo (MCMC) methods to obtain more accurate posterior approximations. Additionally, black-box stochastic variational inference (BBSVI) has been proposed to automate
the derivation of the stochastic gradients, making it easier to apply SVI to new models.
14
2.2 Minibatches for Stochastic Variational Inference
While the basic stochastic variational inference algorithm optimizes the ELBO by subsampling individual data points, we can also more generally subsample minibatches consisting
of multiple data points to enhance the algorithm’s efficiency and convergence properties
[Bottou and Bousquet 2007].
Consider a minibatch St of size S sampled randomly from the set of data points {x1, . . . , xN }
at iteration t. Let St = {xt,1, . . . , xt,S} denote the minibatch, where xt,s represents the s-th
sampled data point in the minibatch at iteration t.
For each data point xt,s in the minibatch, we optimize the corresponding local variational
parameters ϕt,s(λ
(t−1)) using the current global parameters λ
(t−1) from the previous iteration.
This yields a set of S optimized local variational parameters {ϕt,s(λ
(t−1))}
S
s=1 for the current
minibatch.
We then compute an intermediate global parameter λˆ
t,s for each data point in the minibatch using the optimized local parameters:
λˆ
t,s = α + N · Eϕt,s [t(xt,s, zt,s)] for s = 1, . . . , S (2.3)
Here, α are the global hyperparameters, N is the total number of data points, and t(·, ·)
are the sufficient statistics. The expectation is taken with respect to the optimized local
variational distribution ϕt,s for the s-th data point in the minibatch.
Since we randomly sampled a minibatch of size S, we can average the S intermediate
global parameters to obtain an overall intermediate global parameter λˆ
t
for the current
iteration t:
λˆ
t =
1
S
X
S
s=1
λˆ
t,s (2.4)
This averaging step allows the intermediate global parameters to aggregate information
from all S data points in the minibatch. Finally, we take a step in the stochastic natural
15
gradient direction using λˆ
t to update the current global parameters λ
(t)
:
λ
(t) = (1 − ρt)λ
(t−1) + ρtλˆ
t (2.5)
The learning rate ρt
follows a Robbins-Monro schedule ρt = (t + τ )
−κ
, where κ ∈ (0.5, 1]
controls the decay and τ ≥ 0 down-weights early iterations. Using minibatches provides two
main benefits:
1. Computational efficiency: Updating the global parameters using a minibatch allows
us to amortize any computational overhead over multiple data points. For example, if
the sufficient statistics or other quantities needed to compute the intermediate global
parameters are expensive, using minibatches enables reusing these computations for S
data points instead of just one.
2. Improved convergence: Compared to using only a single data point, a minibatch
of size S provides a lower-variance estimate of the stochastic natural gradient. This is
because the random sampling noise is averaged out over S points, leading to a more
stable gradient estimate. More stable gradients can help the algorithm converge more
quickly and reliably to a better local optimum, reducing the risk of getting stuck in a
suboptimal solution due to noisy single-point gradient estimates.
The size of the minibatch S can be tuned to trade off between the computational efficiency of
larger batches and the convergence speed of smaller batches. If S is too small, the gradient
estimates may be too noisy for the algorithm to make rapid progress. But if S is too
large, the updates may become inefficient. Hoffman et al. explored various settings of S
and found that S between 500 and 1000 worked well for LDA topic models [Hoffman et al.
2013]. While minibatches change the effective batch size used to compute stochastic gradient
updates, stochastic variational inference still provides unbiased estimates of the natural
gradient of the ELBO and converges to a local optimum of the objective function. Thus,
16
minibatches preserve the scalability and theoretical guarantees of the basic SVI algorithm
while potentially improving its efficiency and convergence in practice.
2.3 Learning Rate Schedules for Stochastic Variational
Inference
The performance and convergence properties of stochastic variational inference critically
depend on the choice of learning rate schedule, which determines the sequence of step sizes
{ρt}
∞
t=1 used to update the global variational parameters in the direction of the stochastic
natural gradient.
In the stochastic variational inference algorithm, at each iteration t, the global variational
parameters λ
(t) are updated according to:
λ
(t) = λ
(t−1) + ρt∇ˆ
λLt (2.6)
= (1 − ρt)λ
(t−1) + ρtλˆ
t
, (2.7)
where ρt
is the step size at iteration t, ∇ˆ
λLt denotes a stochastic estimate of the natural
gradient of the ELBO obtained using the sampled data point(s), and λˆ
t represents an intermediate global parameter value computed from the optimized local variational parameters.
The step size ρt plays a crucial role in determining the magnitude of the update in
the direction of the stochastic natural gradient. If ρt
is set too small, the algorithm will
make very slow progress towards the optimum. On the other hand, if ρt
is set too large,
the algorithm may overshoot the optimum and oscillate or diverge. To ensure convergence
to a local optimum while maintaining efficient progress, Hoffman et al. proposed using a
Robbins-Monro learning rate schedule [Robbins and Monro 1951] of the form:
ρt = (t + τ )
−κ
(2.8)
17
where κ ∈ (0.5, 1] is known as the forgetting rate and τ ≥ 0 is termed the delay. The
forgetting rate κ controls the speed at which the influence of old gradient information decays.
A higher value of κ leads to more rapid forgetting of old gradients, while a smaller value
of κ assigns more weight to previous gradient estimates. The delay parameter τ helps to
down-weight the step sizes in the early iterations of the algorithm. By setting τ to a larger
value, the algorithm can take more aggressive steps in later iterations after having seen more
data. For the stochastic variational inference algorithm to converge to a local optimum, the
learning rate schedule {ρt}
∞
t=1 must satisfy the following Robbins-Monro conditions:
X∞
t=1
ρt = ∞ and X∞
t=1
ρ
2
t < ∞. (2.9)
The first condition ensures that the cumulative sum of step sizes diverges, allowing the
algorithm to eventually reach the optimum no matter how far away the initial point is. The
second condition guarantees that the step sizes decrease quickly enough so that the variance
of the parameter estimates remains bounded, preventing the algorithm from overshooting
the optimum due to noisy gradient estimates.
In their experiments, Hoffman et al. found that setting the forgetting rate κ in the range
[0.5, 1.0] yielded good performance in practice, with values of κ closer to 1 often leading to
faster convergence. They fixed the delay parameter τ = 1 throughout their empirical studies.
The choice of learning rate schedule can significantly impact the efficiency and reliability of
stochastic variational inference. If the learning rate decays too slowly (i.e., κ is set too small),
the algorithm may oscillate around the optimum or even diverge due to the accumulation of
noisy gradient estimates. Conversely, if the learning rate decays too rapidly (i.e., κ is set too
close to 1), the algorithm may converge very slowly, requiring many more iterations to reach
a good solution. Therefore, carefully tuning the learning rate schedule is crucial in practice
to achieve the best balance between stability and efficiency.
It is worth noting that the optimal learning rate schedule may depend on the specific
18
characteristics of the model and dataset under consideration. In some cases, more advanced
adaptive learning rate schedules that dynamically adjust the step sizes based on the observed
gradients may yield better performance than a fixed Robbins-Monro schedule.
Some additional points to keep in mind when considering learning rate schedules for
stochastic variational inference:
• The learning rate ρt should gradually decay to zero as the number of iterations t tends
to infinity, but the rate of decay should not be too rapid.
• While the Robbins-Monro conditions provide theoretical guarantees for convergence,
the specific values of the constants κ and τ should be carefully tuned in practice to
obtain good performance on a given problem.
• The optimal learning rate schedule may depend on various factors such as the size of
the dataset, the complexity of the model, the choice of minibatch size, and the specific
structure of the variational family.
• Advanced adaptive learning rate schedules, such as AdaGrad [Asi et al. 2021] or Adam
[Diederik P. Kingma and Ba 2017], can potentially improve convergence by dynamically
adapting the step sizes based on the observed gradients. However, these adaptive
methods may introduce additional hyperparameters that need to be tuned.
• It is a good practice to monitor the convergence of the algorithm and adjust the
learning rate schedule if needed based on the observed behavior. Techniques such as
gradient clipping [Pascanu, Mikolov, and Bengio 2013] or learning rate decay can help
to stabilize the optimization process.
In summary, the learning rate schedule is a key ingredient of stochastic variational inference that largely determines the efficiency and reliability of the optimization process.
A well-tuned learning rate schedule is essential to ensure that the algorithm makes rapid
19
progress towards a good solution while avoiding instabilities and divergence. Understanding the role of the learning rate and being able to effectively adjust it are crucial skills for
successfully applying stochastic variational inference in practice.
20
Chapter 3
Markov Chain Monte Carlo
3.1 Markov Chain Monte Carlo (MCMC)
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms used for sampling
from complex probability distributions, particularly in Bayesian inference where the posterior distribution is often intractable. MCMC techniques construct a Markov chain whose
stationary distribution converges to the desired posterior distribution. By sampling from
this Markov chain, we can obtain samples that approximate the posterior distribution and
estimate various quantities of interest.
3.1.1 Metropolis-Hastings Algorithm
The Metropolis-Hastings (MH) algorithm is a fundamental MCMC method that forms the
basis for many other MCMC techniques. The MH algorithm generates samples from a target
distribution p(θ) by proposing moves from a proposal distribution q(θ
′
|θ) and accepting or
rejecting these moves based on an acceptance probability [Robert 2016]. The MH algorithm
proceeds as follows:
21
Algorithm 2 Metropolis-Hastings Algorithm
1: Initialize θ
(0)
2: for t = 1, . . . , T do
3: Sample θ
′ ∼ q(θ
′
|θ
(t−1))
4: Compute the acceptance probability:
α = min
1,
p(θ
′
)q(θ
(t−1)|θ
′
)
p(θ
(t−1))q(θ
′
|θ
(t−1))
5: Sample u ∼ Uniform(0, 1)
6: if u ≤ α then
7: θ
(t) = θ
′ ▷ Accept the proposal
8: else
9: θ
(t) = θ
(t−1) ▷ Reject the proposal
10: end if
11: end for
The choice of the proposal distribution q(θ
′
|θ) is crucial for the efficiency of the MH
algorithm. Common choices include Gaussian distributions centered at the current state or
distributions that exploit the structure of the problem.
3.1.2 Gibbs Sampling
Gibbs sampling is a special case of the Metropolis-Hastings algorithm that is particularly
useful when the joint distribution is complex but the conditional distributions are easy to
sample from. In Gibbs sampling, the proposal distribution is chosen to be the conditional distribution of each variable given the current values of all other variables. Let θ = (θ1, . . . , θd)
be the set of variables in the model. The Gibbs sampling algorithm iteratively samples each
variable θi
from its conditional distribution given the current values of all other variables:
Algorithm 3 Gibbs Sampling Algorithm
1: Initialize θ
(0) = (θ
(0)
1
, . . . , θ(0)
d
)
2: for t = 1, . . . , T do
3: for i = 1, . . . , d do
4: Sample θ
(t)
i ∼ p(θi
|θ
(t)
1
, . . . , θ(t)
i−1
, θ(t−1)
i+1 , . . . , θ(t−1)
d
)
5: end for
6: end for
22
Gibbs sampling is particularly useful in models with conjugate priors, where the conditional distributions have a closed-form expression and are easy to sample from.
3.1.3 Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC), also known as Hybrid Monte Carlo, is an MCMC method
that uses the principles of Hamiltonian dynamics to propose efficient moves in the parameter
space [Betancourt 2018]. HMC introduces auxiliary momentum variables and simulates the
motion of a particle in a potential energy landscape defined by the negative log-posterior.
The HMC algorithm alternates between two steps: a deterministic proposal step based on
Hamiltonian dynamics and a stochastic acceptance step similar to the Metropolis-Hastings
algorithm. The Hamiltonian dynamics step allows for efficient exploration of the parameter space by taking into account the gradient information of the log-posterior. The HMC
algorithm proceeds as follows:
Algorithm 4 Hamiltonian Monte Carlo Algorithm
1: Initialize θ
(0) and p
(0) ∼ N (0, M)
2: for t = 1, . . . , T do
3: Sample momentum p ∼ N (0, M)
4: Set θ
′ = θ
(t−1) and p
′ = p
5: for l = 1, . . . , L do
6: Update θ
′ and p
′ using a symplectic integrator (e.g., leapfrog)
7: end for
8: Compute the acceptance probability:
α = min
1,
exp(−H(θ
′
, p′
))
exp(−H(θ
(t−1), p))
where H(θ, p) = U(θ) + K(p) is the Hamiltonian
9: Sample u ∼ Uniform(0, 1)
10: if u ≤ α then
11: θ
(t) = θ
′ ▷ Accept the proposal
12: else
13: θ
(t) = θ
(t−1) ▷ Reject the proposal
14: end if
15: end for
23
HMC has several advantages over other MCMC methods, including better mixing properties and the ability to efficiently explore high-dimensional spaces. However, it requires the
computation of gradients and careful tuning of the step size and number of leapfrog steps.
3.1.4 Convergence Diagnostics
One important aspect of MCMC methods is assessing the convergence of the Markov chain
to the stationary distribution. Several diagnostic tools have been proposed to monitor convergence and ensure that the samples obtained from the MCMC algorithm are representative
of the true posterior distribution. Some commonly used convergence diagnostics include:
• Trace Plots: Plotting the values of the parameters over the iterations of the Markov
chain can provide visual evidence of convergence. A well-mixing chain should exhibit
random fluctuations around a stable mean.
• Autocorrelation Plots: Plotting the autocorrelation of the samples can indicate the
degree of correlation between successive samples. High autocorrelation suggests slow
mixing and may require thinning the samples.
• Gelman-Rubin Statistic: The Gelman-Rubin statistic compares the between-chain
and within-chain variances of multiple parallel chains to assess convergence. A value
close to 1 indicates good convergence.
• Effective Sample Size: The effective sample size (ESS) measures the number of independent samples that would be equivalent to the correlated samples obtained from the
MCMC algorithm. A higher ESS indicates better mixing and more reliable estimates.
It is important to carefully monitor convergence diagnostics and adjust the MCMC algorithm
accordingly to ensure reliable posterior inference.
24
3.1.5 Limitations and Challenges
While MCMC methods have been widely successful in Bayesian inference, they also have
certain limitations and challenges:
• Computational Complexity: MCMC methods can be computationally intensive,
especially for high-dimensional models or large datasets. The need to generate a large
number of samples and the sequential nature of the algorithms can lead to long computation times.
• Tuning and Convergence: MCMC algorithms often require careful tuning of hyperparameters, such as the proposal distribution or the step size, to achieve good
mixing and convergence. Assessing convergence can also be challenging, particularly
in high-dimensional spaces.
• Multimodal Distributions: MCMC methods may struggle to explore multimodal
posterior distributions efficiently. The Markov chain can get trapped in local modes,
leading to biased estimates and slow convergence.
• Sensitivity to Initialization: The initial state of the Markov chain can influence
the convergence and mixing properties of the algorithm. Poor initialization can lead
to slow convergence or even failure to converge to the correct posterior distribution.
Despite these challenges, MCMC methods remain a powerful tool for Bayesian inference and
have been successfully applied to a wide range of problems in various domains.
25
Chapter 4
Comparative Analysis
4.1 Theoretical Properties of MCMC Methods
4.1.1 Asymptotic Convergence
One of the key theoretical properties of MCMC methods is their asymptotic convergence.
Under certain regularity conditions, MCMC algorithms are guaranteed to converge to the
true posterior distribution as the number of iterations tends to infinity. This property is
formalized by the ergodic theorem for Markov chains. Let π(θ) denote the target posterior
distribution and P
t
(θ0, ·) denote the distribution of the Markov chain after t iterations,
starting from an initial state θ0. The ergodic theorem states that for any integrable function
f:
lim
t→∞ Z
f(θ)P
t
(θ0, dθ) = Z
f(θ)π(θ)dθ (4.1)
This means that the average of the function f over the samples generated by the Markov
chain converges to the true posterior expectation of f as the number of iterations increases.
4.1.2 Markov Chain Central Limit Theorem
Another important theoretical property of MCMC methods is the Markov chain central limit
theorem (MCCLT). The MCCLT states that, under certain conditions, the distribution of
26
the sample average of a function f over the Markov chain converges to a normal distribution
as the number of iterations tends to infinity. Let ¯ft =
1
t
Pt
i=1 f(θi) be the sample average of f
over t iterations of the Markov chain. The MCCLT (Markov Chain Central Limit Theorem)
states that:
√
t(
¯ft − Eπ[f]) d−→ N (0, σ2
f
) (4.2)
where Eπ[f] is the true posterior expectation of f and σ
2
f
is the asymptotic variance,
which depends on the autocovariance of the Markov chain. The MCCLT provides a basis
for constructing confidence intervals and assessing the uncertainty of the estimates obtained
from MCMC methods.
4.2 Theoretical Properties of Stochastic Variational Inference
4.2.1 Stochastic Optimization
SVI leverages stochastic optimization techniques to efficiently optimize the ELBO and find
the optimal variational parameters. The key idea is to approximate the expectations in the
ELBO using Monte Carlo estimates based on mini-batches of data. Let D = x1, . . . , xN be
the dataset and DB ⊂ D be a mini-batch of size B. The stochastic gradient of the ELBO
with respect to ϕ can be estimated as:
∇ϕL(ϕ) ≈
N
B
X
x∈DB
∇ϕEqϕ
[log p(θ, x)] − ∇ϕEqϕ
[log qϕ(θ)] (4.3)
By using stochastic gradients, SVI can efficiently update the variational parameters and
converge to a local optimum of the ELBO.
27
4.2.2 Convergence and Consistency
The convergence and consistency properties of SVI have been studied in the literature. Under
certain conditions, such as the properness of the variational family and the smoothness of the
log-likelihood, SVI has been shown to converge to a local optimum of the ELBO. Moreover,
as the mini-batch size m increases, the stochastic gradient estimates become more accurate,
and SVI approaches the behavior of batch variational inference. In the limit m → N, SVI
recovers the exact batch updates.
4.3 Advantages of Stochastic Variational Inference over
MCMC
Stochastic Variational Inference offers several advantages over MCMC methods, making it
a preferred choice in many practical scenarios.
4.3.1 Computational Efficiency
One of the main advantages of SVI is its computational efficiency compared to MCMC
methods. SVI exploits stochastic optimization techniques and mini-batch updates, allowing
it to scale to large datasets and high-dimensional models. While MCMC methods require
processing the entire dataset at each iteration, SVI can update the variational parameters
using mini-batches of data. This leads to faster convergence and reduced computational cost
per iteration. The computational complexity of SVI is typically O(B) per iteration, where B
is the mini-batch size, compared to O(N) for MCMC methods, where N is the total number
of data points, (i.e. B < N).
28
4.3.2 Scalability
SVI is highly scalable and can handle massive datasets that are infeasible for MCMC methods. By leveraging stochastic optimization and mini-batch updates, SVI can process data in
a streaming fashion and adapt to new data points as they arrive. The scalability of SVI is
particularly advantageous in scenarios where the dataset is too large to fit into memory or
when the data arrives in a streaming manner. SVI can continuously update the variational
parameters without the need to revisit previous data points.
4.3.3 Flexibility in Variational Family
SVI offers flexibility in the choice of the variational family qϕ(θ). While MCMC methods
directly sample from the true posterior distribution, SVI approximates the posterior using a
tractable variational distribution. The variational family can be chosen to balance between
expressiveness and tractability. Common choices include mean-field factorization, Gaussian
distributions, or more complex distributions such as normalizing flows. The flexibility in the
variational family allows SVI to adapt to the structure of the problem and capture important
dependencies between variables. It also enables the incorporation of domain knowledge and
prior information into the variational approximation.
4.3.4 Deterministic Approximation
SVI provides a deterministic approximation to the posterior distribution, in contrast to the
stochastic samples obtained from MCMC methods. The variational parameters ϕ uniquely
determine the variational distribution qϕ(θ). The deterministic nature of SVI has several
benefits. It allows for easier interpretation and analysis of the posterior approximation. The
variational parameters can be directly inspected and provide insights into the structure of
the posterior. Moreover, the deterministic approximation facilitates downstream tasks such
as prediction, model selection, and decision-making. The variational distribution can be
29
easily sampled from or integrated over to obtain predictive distributions or expected values.
4.4 Conclusion
In this chapter, we compared the theoretical properties of Markov Chain Monte Carlo
(MCMC) and Stochastic Variational Inference (SVI) for Bayesian inference. MCMC methods, based on the construction of Markov chains, offer asymptotic convergence guarantees
and a principled framework for sampling from the true posterior distribution. However,
MCMC methods can be computationally intensive and may struggle with large datasets and
high-dimensional models. On the other hand, Stochastic Variational Inference provides a
powerful alternative by approximating the posterior distribution using variational methods
and stochastic optimization. SVI offers several advantages over MCMC, including computational efficiency, scalability, flexibility in the choice of variational family, and deterministic
approximation.
The computational efficiency of SVI stems from its ability to leverage stochastic optimization techniques and mini-batch updates, allowing it to handle large datasets and converge
faster than MCMC methods. SVI is highly scalable and can process data in a streaming
fashion, making it suitable for massive datasets and online learning scenarios. Moreover,
SVI provides flexibility in the choice of the variational family, enabling the incorporation of
domain knowledge and the capture of important dependencies between variables. The deterministic nature of the variational approximation facilitates interpretation, analysis, and
downstream tasks.
While MCMC methods remain a fundamental tool in Bayesian inference, the advantages
of Stochastic Variational Inference make it an attractive choice in many practical scenarios,
particularly when dealing with large-scale datasets and complex models.
In conclusion, understanding the theoretical properties and advantages of SVI over MCMC
is crucial for making informed choices in Bayesian inference. The scalability, efficiency, and
30
flexibility offered by SVI have made it a popular approach in various domains, ranging from
machine learning to computational biology and beyond.
31
Bibliography
Alexanian, Moorad (2015). Nature, Science, Bayes’ Theorem, and the Whole of Reality.
arXiv: 1506.05040 [physics.hist-ph].
Asi, Hilal et al. (2021). Private Adaptive Gradient Methods for Convex Optimization. arXiv:
2106.13756 [cs.LG].
Betancourt, Michael (2018). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:
1701.02434 [stat.ME].
Bottou, Leon and Olivier Bousquet (2007). “The Tradeoffs of Large Scale Learning”. In: Advances in Neural Information Processing Systems. Vol. 20. url: https://proceedings.
neurips.cc/paper/2007/file/0d3180d672e08b4c5312dcdafdf6ef36-Paper.pdf.
Ganguly, Ankush and Samuel W. F. Earp (2021). An Introduction to Variational Inference.
arXiv: 2108.13083 [cs.LG].
Hoffman, Matt et al. (2013). Stochastic Variational Inference. arXiv: 1206.7051 [stat.ML].
Kingma, Diederik P and Max Welling (2022). Auto-Encoding Variational Bayes. arXiv: 1312.
6114 [stat.ML].
Kingma, Diederik P. and Jimmy Ba (2017). Adam: A Method for Stochastic Optimization.
arXiv: 1412.6980 [cs.LG].
Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio (2013). On the difficulty of training
Recurrent Neural Networks. arXiv: 1211.5063 [cs.LG].
Robbins, Herbert and Sutton Monro (Sept. 1951). “A Stochastic Approximation Method”.
In: The Annals of Mathematical Statistics 22.3, pp. 400–407.
Robert, Christian P. (2016). The Metropolis-Hastings algorithm. arXiv: 1504.01896 [stat.CO].
32
Abstract (if available)
Abstract
Bayes’ theorem has been widely used to estimate posterior distributions for complex data. However, directly estimating the distribution can be computationally expensive, and in some cases, intractable. To address this challenge, new techniques such as Variational Inference have been developed, which approximate the posterior through optimization instead of direct computation. In this thesis, we analyze Stochastic Variational Inference (SVI), a powerful framework for approximating the posterior distribution in large-scale and streaming data settings. SVI combines the benefits of stochastic optimization and variational inference, enabling efficient and scalable Bayesian inference. We compare SVI with classical Markov Chain Monte Carlo (MCMC) methods, highlighting the advantages of SVI in terms of com- putational efficiency and its ability to handle large datasets. This analysis demonstrates that SVI offers a compelling alternative to MCMC for Bayesian inference in dynamic envi- ronments, providing a scalable and efficient solution for real-time data processing.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model
PDF
Scalable latent factor models for inferring genetic regulatory networks
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Theoretical foundations of approximate Bayesian computation
PDF
Noise benefits in Markov chain Monte Carlo computation
PDF
Stochastic inference for deterministic systems: normality and beyond
PDF
Obtaining breath alcohol concentration from transdermal alcohol concentration using Bayesian approaches
PDF
Statistical inference of stochastic differential equations driven by Gaussian noise
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Inference correction in measurement error models with a complex dosimetry system
PDF
Some mathematical problems for the stochastic Navier Stokes equations
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Large scale inference with structural information
PDF
Information geometry of annealing paths for inference and estimation
PDF
Physics-based data-driven inference
PDF
Physics-informed machine learning techniques for the estimation and uncertainty quantification of breath alcohol concentration from transdermal alcohol biosensor data
PDF
Robust estimation of high dimensional parameters
PDF
A Bayesian region of measurement equivalence (ROME) framework for establishing measurement invariance
PDF
An abstract hyperbolic population model for the transdermal transport of ethanol in humans: estimating the distribution of random parameters and the deconvolution of breath alcohol concentration
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
Asset Metadata
Creator
Espinosa Mena, Jose Rafael
(author)
Core Title
Stochastic Variational Inference as a solution to the intractability problem of Bayesian inference
School
College of Letters, Arts and Sciences
Degree
Master of Arts
Degree Program
Applied Mathematics
Degree Conferral Date
2024-05
Publication Date
05/17/2024
Defense Date
05/16/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bayes,Bayesian inference,machine learning,Markov chain Monte Carlo,OAI-PMH Harvest,statistics,variational inference
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ziane, Mohammed (
committee chair
), Arratia, Richard (
committee member
), Chen, Xiaohui (
committee member
)
Creator Email
joseespi@usc.edu,rafaelespinosamena@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113939694
Unique identifier
UC113939694
Identifier
etd-EspinosaMe-12921.pdf (filename)
Legacy Identifier
etd-EspinosaMe-12921
Document Type
Thesis
Format
theses (aat)
Rights
Espinosa Mena, Jose Rafael
Internet Media Type
application/pdf
Type
texts
Source
20240517-usctheses-batch-1151
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bayes
Bayesian inference
machine learning
Markov chain Monte Carlo
variational inference