Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Information geometry of annealing paths for inference and estimation
(USC Thesis Other)
Information geometry of annealing paths for inference and estimation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Information Geometry of Annealing Paths for Inference and Estimation by Rob Brekelmans A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Rob Brekelmans Dedicated to my sister, Nina May Brekelmans ii Acknowledgements First, I would like to thank my advisors, Greg Ver Steeg and Aram Galstyan, for creating a supportive research environment with near-complete academic freedom. Being allowed to be utterly confused and unproductive for my rst two to three years made it that much more rewarding when I could eventually contribute to literature on the same material. Particular thanks to Greg for parsing my long-winded research ramblings with equanimity and insight; it was amazing to have an advisor so closely involved in similar research and to observe such an original thinker at work. Further thanks to Aram for advisement over the years and facilitating a fruitful group environment. Beyond my advisors, I am grateful to have engaged with a community of like-minded researchers, which have been some of the most stimulating and rewarding relationships built in the last six years. Highlights include path sampling and philosophy with Vaden Masrani, synergistic discussions with Kyle Reing, and the community of USC ISI labmates including Palash Goyal, Dan Moyer, Hrayr Harutyunyan, Myrl Marmarelis, Neal Lawton, and Serban Stan. Particular thanks to Alireza Makhzani, for long discussions and paper-writing sessions which have helped transform my un- derstanding of mcmc, inference, and generative modeling, and Frank Nielsen for his guidance, including a trove of information geometry references as a pandemic study program in lieu of a Tokyo internship. I am grateful for Pedro Ortega for inspiring discussions over the years and the chance to intern with his team at DeepMind, where Tim Genewein was a relentlessly positive and supportive manager. Beyond the limitless love of my parents, thanks to Ashok, Chantle, and especially Caroline for providing support and light through many ups and downs in the PhD program. iii Table of Contents Dedication ii Acknowledgements iii Abstract vi Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Conjugate Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Contributions and Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 2: Improving Mutual Information Estimation Using Annealed and Energy- Based Bounds 26 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Unifying Mutual Information Bounds via Importance Sampling . . . . . . . . . . . . 28 2.3 Multi-Sample AIS Bounds on logp(x) and Mutual Information . . . . . . . . . . . . 34 2.4 MINE-AIS Estimation of Mutual Information . . . . . . . . . . . . . . . . . . . . . . 39 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6 Conjugate Duality Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 3: Bregman Duality in Thermodynamic Variational Inference 66 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 Thermodynamic Variational Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3 Likelihood Ratio Exponential Family Interpretation . . . . . . . . . . . . . . . . . . 70 3.4 TVO logp(x) Bound Gaps via Bregman Divergences . . . . . . . . . . . . . . . . . . 72 3.5 Moment-Spacing Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.6 Doubly-Reparameterized TVO Gradient . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.8 Discussion: Likelihood Ratio Exponential Families . . . . . . . . . . . . . . . . . . . 83 3.9 Discussion: TVO and AIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter 4: q-Paths: Generalizing the Geometric Mixture Path using Power Means and -Divergences 97 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3 q-Paths from Power Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 iv 4.4 q-Likelihood Ratio Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.5 Variational Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Chapter 5: Rho-Tau Bregman Information and the Geometry of Annealing Paths113 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Rho-Tau Bregman Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3 Rho-Tau Bregman Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.4 Information Geometry of Rho-Tau Divergence . . . . . . . . . . . . . . . . . . . . . . 130 5.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Chapter 6: Your Policy Regularizer is Secretly an Adversary: Conjugate Duality in Reinforcement Learning 145 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.3 Adversarial Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Bibliography 168 Appendices Chapter A: Conjugate Duality in Exponential Families (Ch. 3) 181 Chapter B: Conjugate Duality Beyond the Exponential Family (Ch. 2 and 5) 187 B.1 Bregman Divergence as the Gap in Conjugate Optimization . . . . . . . . . . . . . . 190 B.2 Conjugate Duality Interpretation of the Evidence Lower Bound . . . . . . . . . . . . 192 Chapter C: Detailed Conjugate Derivations for KL- & -Divergences (Ch. 6) 196 C.1 Conjugate Derivations without Normalization Constraint . . . . . . . . . . . . . . . 196 C.2 Conjugate Derivations with Normalization Constraint . . . . . . . . . . . . . . . . . 203 Chapter D: Proof of Prop. D.0.1: Linear Bias Reduction for AIS (Ch. 2 and 3) 206 Chapter E: Appendix for \Your Policy Regularizer is Secretly an Adversary" (Ch. 6)208 E.1 Implications of Conjugate Duality Optimality Conditions . . . . . . . . . . . . . . . 208 E.2 Soft Value Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 E.3 Robust Set of Perturbed Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 E.4 Tsallis Entropy and -Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 E.5 Worked Example for Deterministic Regularized Policy . . . . . . . . . . . . . . . . . 224 Chapter F: Appendix for \Bregman Information and the Geometry of Annealing Paths" (Ch. 5) 226 F.1 -Deformed Logarithm Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 F.2 Limiting Behavior of Rho-Tau Divergences . . . . . . . . . . . . . . . . . . . . . . . . 227 v Abstract Normalization constants or marginal likelihoods that integrate over high-dimensional state spaces are ubiquitous in probabilistic machine learning and statistical physics. Important problems where these quantities appear include maximum likelihood learning, Bayesian inference, and calculating physical or information theoretic quantities such as entropy and mutual information (mi). Impor- tance sampling, in which samples from a tractable proposal distribution are reweighted based on their likelihood under the target density, is at the heart of many successful solutions, although its accuracy can crucially depend on how closely the proposal distribution matches the target. For example, Annealed Importance Sampling (ais) decomposes the sampling problem into a se- quence of easier subproblems using Markov Chain Monte Carlo (mcmc), providing a practical way to trade computation for estimation accuracy. Popular variational bounds such as the Evidence Lower Bound (elbo) and Importance Weighted Autoencoder (iwae) bound can also be understood as members of the broader class of importance sampling methods. In this thesis, we provide unifying and general perspectives on these methods for estimating intractable normalization constants, providing methodological improvements and extending their application to new estimation or inference problems. First, we propose a general approach for deriving extended state space importance sampling bounds, which yields upper and lower bounds whose tightness and probabilistic inference interpretations are given by construction. This leads to novel multi-sample ais bounds which combine insights from iwae and ais. We consider mi estimation as our primary problem setting, and nd that our importance sampling perspectives help shed light on the limitations of existing methods. We demonstrate that multi-sample ais can accurately estimate large values of ground truth mi when an analytic joint distribution is available, and provide improved estimation methods for more general settings where the conditional likelihood is unknown. An important design choice in ais is to choose an annealing path which bridges between a tractable initial distribution and the unnormalized density function whose normalization constant we seek to estimate. We provide a thorough analysis of the commonly-used geometric averaging vi path through the lens of thermodynamic integration (ti) and convex duality in exponential fam- ilies, which allows us to characterize the gap in the recent Thermodynamic Variational inference Objective (tvo). More generally, our `likelihood ratio exponential family' approach can be used to understand importance sampling with arbitrary energy functions, including the extended state space importance sampling bounds from earlier sections. Using this perspective, we prove a linear bias reduction result for ais under perfect transitions, which helps explain its success in our mi estimation application. We further extend our analysis of annealing paths to propose a novel one-parameter family of paths which generalize the geometric averaging path. Our approach leverages convex duality to highlight intimate connections between divergence functions, parametric families, and quasi- arithmetic means. For example, the kl divergence, exponential family, and geometric mean arise from using the natural logarithm as a monotonic transformation function. By contrast, our `q-paths' are associated with either the-divergence or-divergence, correspond to a deformed exponential family of unnormalized densities, and use Tsallis's q-logarithm as the transformation function to dene the quasi-arithmetic mean. We demonstrate that q-paths can improve estimation accuracy in example applications, but envision that our exible framework may lead to further improvement by adapting annealing paths to a sampling problem of interest. Finally, we leverage tools from convex duality to analyze the role of regularization in reinforce- ment learning (rl) settings, where optimizations analogous to the elbo appear when choosing a decision policy based on current reward estimates. We show that regularization can be interpreted as providing robustness to perturbations of the reward function by an implicit adversary, and high- light how the properties of this robustness change for -divergence regularization compared to the more standardkl divergence regularization. While, at rst glance, the developments in this section appear distinct from the rest of the thesis, we reference recent work showing that similar adversarial interpretations apply in the setting of (generalized) Bayesian inference. vii Chapter 1 Introduction 1.1 Motivation Estimating high-dimensional integrals is a fundamental problem which appears across probabilistic machine learning and statistical physics. While it is conceptually simple to assign a non-negative value to some conguration of physical states, model parameters, latent variables, and/or data examples, a sum or integral over all possible states is necessary to calculate the normalization constant and interpret these values as a probability distribution. Calculating the normalization constant or its logarithm is often intractable, but appears in various important settings such as posterior inference, maximum likelihood learning, and calculating physical or information theoretic quantities such as entropy and mutual information. We review several of these problem settings below as motivation for the analysis in this thesis. As a running example for this introductory chapter, we will derive the ubiquitous Evidence Lower BOund (elbo) on the log normalization constant using two approaches which form the backbone of this thesis: importance sampling (Sec. 1.4.4) and convex duality (Sec. 1.3.1). In par- ticular, the elbo is commonly used as an objective function for variational inference (Sec. 1.2) and maximum likelihood learning in latent variable generative models. We consider the problem of estimating normalization constants using importance sampling in Sec. 1.4, including background on annealed importance sampling in Sec. 1.4.3. Finally, we show that unbiased importance sampling estimators of the normalization constant translate to stochastic lower bounds on the log normal- ization constant, which leads to extended state space generalizations of the elbo in Sec. 1.4.4 and our mutual information estimators in Ch. 2. 1 Posterior Inference and Model Evaluation For example, in Bayesian inference of parame- ters in a given modelM, we are often interested in sampling from or evaluating the posterior distribution of parameters p(jM;D) after observing dataD, p(jD;M) = p(Dj;M)p(;M) R p(Dj;M)p(;M)d : (1.1) where the priorp(;M) and conditional likelihoodp(Dj;M) in the numerator are usually tractable by design. However, evaluating the posterior still requires calculating the `marginal likelihood' p(DjM) = R p(Dj;M)p(jM)d as a normalization constant which integrates or marginalizes over. Since represents a vector of parameters, the number of congurations scales exponentially in the number of dimensions. The marginal likelihood is also useful for model selection in choosing between two modelsM 1 orM 2 , which has been argued to re ect Occam's Razor (MacKay et al. (2003) Ch. 28) and balance prediction accuracy with model complexity. Maximum Likelihood Estimation in Energy-Based Models A direct approach to modeling the unknown distribution q d (x) of the data generating process is to learn an arbitrary energy function E (x), which is parameterized by and assigns a value to each data point. To match the notation of statistical physics, we might also consider a xed inverse temperature parameter which scales the negative energy function and leads to the Gibbs-Boltzmann distribution p (x) = 1 Z(;) e E (x) where Z(;) = Z e E (x) dx: (1.2) where the `partition function' or normalization constantZ(;) integrates over the high-dimensional data space. A common principle for learning is maximum likelihood estimation, in which we seek to nd parameters which assign high probabilityp (x) to empirical datafx (i) g N i=1 , where eachx (i) q d (x) is drawn independently and identically from a true data distribution q d (x). It is often convenient to work with the log probability due to its additivity with respect to products log Q N i=1 p (x (i) ) = P N i=1 logp (x (i) ). With a xed , learning proceeds by optimizing the following objective max L mle () := max E q d (x) logp (x) max 1 N N X i=1 logp (x (i) ) (1.3) = max 1 N N X i=1 E (x (i) ) logZ(;) (1.4) 2 where the `log partition function' logZ(;) is similarly intractable. We give more detailed con- sideration to the dierences between estimating the partition function and log partition function in Section 1.4 below (see also Grosse et al. (2015)). Entropy and Mutual Information Fundamental quantities in information theory and statis- tical physics also involve estimation of (log) normalization constants. In particular, the thermo- dynamic entropy describing the `disorder', or possible number of microstates, of a physical system and the information theoretic entropy, which measures the fundamental lossless compression rate of a source random variable, can both be shown the have the form of the Shannon entropy (Jaynes, 1957; Cover and Thomas, 2012). We can consider both marginal and conditional forms for the entropy under a distribution p(x;z) over random variables X and Z, H(X) =E p(x) [logp(x)] H(XjZ) =E p(x) E p(xjz) [logp(xjz)]: (1.5) The marginal entropy is often intractable, since p(x) = R p(x;z)dz involves an integral over z. Mutual information is among the most general measures of dependence between two random variables, and measures, for example, the reduction in entropy about X after conditioning on Z, I(X :Z) =E p(x;z) log p(x;z) p(x)p(z) =H(X)H(XjZ) =E p(x;z) [logp(xjz)]E p(x) [logp(x)] (1.6) =H(Z)H(ZjX) =E p(x;z) [logp(zjx)]E p(z) [logp(z)]: (1.7) In Ch. 2, we provide detailed discussion and experimental results for estimation of these quantities, which have proven insightful in evaluating various generative models (Salakhutdinov and Murray, 2008; Wu et al., 2016; Alemi and Fischer, 2018; Huang et al., 2020). Maximum Likelihood Learning in Latent Variable Models via Expectation-Maximization We encounter both the challenges of posterior inference and maximizing intractable marginal like- lihoods when learning latent variable models p (x;z), which seek to approximateq d (x) using addi- tional unobserved variables z and learnable parameters . Using maximum likelihood estimation to learn the model parameters , we nd that the marginal likelihood p (x (i) ) of a data point 3 x (i) q d (x) under the model p (x;z) involves an integral over all congurations of the latent variables z, max L mle () := max E q d (x) logp (x) = max E q d (x) log Z p (x;z)dz (1.8) max 1 N N X i=1 log Z p (x (i) ;z)dz: (1.9) However, since the latent variables z are, by denition, not observed, it is not clear how we should assign or infer values ofz for a given observed data pointx. While the model species a prior p (z) over the latent variables, estimating the integral in Eq. (1.9) by sampling from this distribution may require prohibitive sample complexity (Chatterjee and Diaconis (2018); Brekelmans et al. (2022b)). The posterior distribution p (zjx) instead provides a natural choice for setting the latent vari- ables, and in fact yields the optimal importance sampling estimate of R p (x;z)dz (Owen (2013) 9.1) p (zjx) = p (xjz)p (z) R p (xjz)p (z)dz = p (x;z) p (x) : (1.10) The posterior is concentrated on z which have high joint probability p (x;z) for a given x. While the joint distribution p (x;z) is often tractable by design, the posterior again requires calculating the marginal likelihood p (x). The above observations give rise to the famous Expectation-Maximization algorithm (Dempster et al., 1977; Csisz ar, 1984; Kunstner et al., 2021), which is an alternating optimization algorithm which converges to a local maximum (Wu, 1983) of the maximum likelihood objective by opti- mizing a lower bound on Eq. (1.9). We direct the reader to Kunstner et al. (2021) for a detailed consideration of the em algorithm within exponential family models, involving concepts from con- vex duality as in Ch. 3 and App. A.0.1. However, the em algorithm applies more generally outside of exponential family models (Wu, 1983; Neal and Hinton, 1998), as we now discuss. We rst recall the denition of the Kullback-Leibler (kl) divergence (Kullback and Leibler, 1951) between two distributions q;p over an arbitrary (set of) random variable(s) !, D KL [q(!) :p(!)] := Z q(!) log q(!) p(!) d!; (1.11) which is equal to zero if and only i q(!) =p(!). 4 The em algorithm may now be viewed in terms of the following alternating divergence mini- mization problem (Dempster et al., 1977; Csisz ar, 1984), which we discuss in detail below. Letting t indicate the iteration timestep, E Step: q (t+1) (zjx) argmin q(zjx) D KL h q d (x)q(zjx) p (t)(x;z) i M Step: (t+1) argmin D KL h q d (x)q (t+1) (zjx) p (x;z) i : (1.12) We rst consider the M-step, noticing that the data distribution q d (x) is constant with respect to optimization of the model. Simplifying the joint kl divergence and switching the minimization to a maximization, we have argmin DKL q d (x)q (t+1) (zjx) p (x;z) = argmin E q d (x) [logq d (x)] | {z } unknown, constant E q d (x) [logp (x)] +E q d (x) h DKL q (t+1) (zjx)kp (zjx) i = argmax E q d (x) [logp (x)]E q d (x) h DKL q (t+1) (zjx)kp (zjx) i : (1.13) We can thus view theM-step as maximizing a lower bound on the marginal likelihood in Eq. (1.9), which follows from the nonnegativity of the kl divergence. Evidence Lower Bound This quantity is known as the Evidence Lower Bound (elbo) and is ubiquitous throughout machine learning. As a function of a particular x in the data domain, and with respect to a distribution q(zjx) and parameters , we can write the elbo as elbo(x;q;) := logp (x)D KL q(zjx)kp (zjx) : (1.14) The elbo will play a crucial role throughout this thesis, with our developments in Ch. 2 and Ch. 3 providing general families of lower bounds which include the elbo as a special case. For the E-step, note that the elbo is tight when D KL q(zjx)kp (zjx) = 0 or q(zjx) =p (zjx) for allx, sincep (x) is constant with respect toq(zjx). This implies that the optimal update in the E-step is the posterior, which can also be seen by expanding the joint kl divergence as in (1.13), q (t+1) (zjx) p (t)(zjx) = argmin q(zjx) D KL h q d (x)q(zjx) p (t)(x;z) i : (1.15) However, as in Eq. (1.10), this update may be dicult to perform exactly due to the appearance of the marginal likelihood p (t)(x) as the normalization constant of p (t)(zjx). Further, we must solve a separate inference problem to calculate the posterior for each data pointfx (i) g N i=1 . We discuss variational inference approaches to these problems below. 5 1.2 Variational Inference Variational inference (vi) frames posterior inference as an optimization problem, often using the elbo objective in Eq. (1.14). However, a crucial component of vi approaches is that optimization is performed within a tractable familyQ of approximate posterior distributionsq (zjx) parameterized by . While vi may not produce exact samples or posterior distributions, its ability to quickly obtain approximate posteriors for which density evaluation and sampling are tractable has made it a popular approach for performing inference on large-scale datasets. vi is often framed as seeking to minimize a statistical divergence betweenp (zjx) andq (zjx)2 Q . However, as in the elbo, it is usually necessary to include a term such as logp (x) to cancel the intractable p (x) in the posterior, q (zjx) = argmin q (zjx)2Q D[q (zjx) :p (zjx)] = argmax q (zjx)2Q logp (x)D q (zjx)kp (zjx) : (1.16) Note that q (zjx) may not be equal p (zjx) if the variational family is too limited to include the true posterior, such as when using a Gaussian family to approximate a multi-modal posterior. Expressive Variational Families Numerous works have thus considered more exible varia- tional families using techniques such as normalizing ows (Rezende and Mohamed, 2015; Kingma et al., 2016) or Markov Chain Monte Carlo (mcmc) transformations (Salimans et al., 2015; Ho- man, 2017; Caterini et al., 2018; Ding and Freedman, 2019; Ruiz and Titsias, 2019; Thin et al., 2021; Zhang et al., 2021; Gener and Domke, 2021). In many cases, these methods optimize the standardelbo in Eq. (1.16) for variational inference in the E-step. Notable exceptions include Caterini et al. (2018); Thin et al. (2021); Zhang et al. (2021); Gener and Domke (2021), which optimize an extended state space generalization of the elbo (see Sec. 1.4.4 Eq. (1.46) and Eq. (1.47)). We discuss these bounds in Ch. 2-3, while the Thermodynamic Variational Objective (tvo) (Masrani et al., 2019; Brekelmans et al., 2020a) in Ch. 3 should naturally benet from sample transformations as in the related work above. Partial Optimization in Variational Autoencoders So far, we have ignored the possibility that the E- and M- steps in Eq. (1.12) or the E-step within a variational family in Eq. (1.16) may not be performed optimally at each iteration. In fact, when jointly performing variational inference and maximum likelihood learning in generative models such as Variational Autoencoders (vae) 6 (Kingma and Welling, 2013; Rezende et al., 2014), it is common practice to take only one or several stochastic gradient steps to maximize the elbo objective in terms of either or . To the best of our knowledge, convergence guarantees for the em algorithm in this setting are lacking. We mention this issue only to highlight distinctions between our above exposition on the em algorithm and practical applications later in this thesis. We direct the interested reader to Cremer et al. (2017); He et al. (2019) for a detailed study of inference suboptimality in the E-step for generative modeling. For a xed modelp (x;z) or posteriorp (zjx), we also emphasize that the variational inference optimization in Eq. (1.16) may be performed on its own, without reference to an M-step. Alternative VI Objectives Moving beyond kl divergence minimization in the elbo, Li and Turner (2016); Dieng et al. (2017) use R enyi's -divergence to construct vi objectives of the form Eq. (1.16). While any divergence minimization will yield the true posterior when p (zjx)2Q , dierent divergences can cause the optimization in Eq. (1.16) to select dierent optimal variational distributions q (zjx) which, for example, might prioritize placing probability density on modes of p (zjx) or prioritize covering all regions where p (zjx)> 0. However, Knoblauch et al. (2019) argue that the posterior divergence minimization approach in Eq. (1.16) prevents a modular understanding of the behavior of variational inference. For ex- ample, changing the divergenceD[q (zjx)kp (zjx)] to the standard Bayesian posteriorp (zjx) may simultaneously change both the sensitivity of q (zjx)2Q to misspecication of the prior p (z) and its robustness to misspecication of the likelihood function p (xjz) (see Eq. (1.10)). While these arguments are admittedly subtle, they inspire Knoblauch et al. (2019) to propose generalized variational inference, an optimization-centric view of variational inference (vi) which closely aligns with our convex duality approaches in Sec. 1.3.1, Ch. 5, Ch. 6, and App. B. In particular, for a given variational familyQ , priorp (z), and loss function`(x;z), Knoblauch et al. (2019) propose to view the optimal approximate posterior as the solution to the optimization q (zjx) = argmin q (zjx)2Q q (zjx);`(x;z) +D q (zjx)kp (z) (1.17) where inner product notationhq (zjx);`(x;z)i = R q (zjx)`(x;z)dz indicates summation or inte- gration over the appropriate domain, here z2 R d . Notice that the divergence involves the prior p (z), in contrast to Eq. (1.16). 7 After rearranging terms in Eq. (1.14), we can recognize elbo maximization as a special case of this optimization for the choice of the kl divergence and `(x;z) = logp (xjz), elbo(x;q ;) :=E q (zjx) [logp (xjz)]D KL q (zjx)kp (z) : (1.18) To draw explicit connections between Eq. (1.17) and our convex conjugate duality notation in the rest of this thesis, we can view T (x;z) =`(x;z) as a negative energy, sucient statistic, or distortion function. Rewriting Eq. (1.17), we have q (zjx) = argmax q (zjx)2Q q (zjx);T (x;z) D q (zjx)kp (z) (1.19) Up to the variational family restriction q (zjx)2Q , we will see that a similar optimization arises from the convex conjugate function assuming that (q (zjx)) := D[q (zjx)kp (z)] is a convex function of q (zjx). 1.3 Conjugate Duality In later chapters, we will leverage convex conjugate duality to to gain additional insights into importance sampling problems (Sec. 2.6, Ch. 3), annealing paths (Ch. 5), and even regularized reinforcement learning methods (Ch. 6). To provide initial background and a motivating example, our goal in this section is to derive the elbo in Eq. (1.14) from the perspective of convex duality. 1 Throughout this section, we use ~ (z) to indicate a possibly unnormalized, nonnegative density function over a possibly continuous sample space z2Z. Conjugate Duality For a convex function (~ ) with domain ~ 2D, the conjugate function or Legendre transform (T ) is dened via the optimization (T ) = sup ~ (z)2D ~ (z);T (z) (~ ) (1.20) where T 2 D is a variable in the continuous dual space of D. In what follows, we will often choose ~ (z)2D to be the space of normalized probability distributions or unnormalized positive measures, with T (z)2D as an appropriate corresponding function space. Inner product h;i notation indicates integration or summation over the domain, for exampleh(z);T (z)i := R (z)T (z)dz. 1 For the multi-sample case and iwae lower bound on logp (x), see Sec. 2.6.6. 8 Private & Confidential Interpreting the Conjugate Function optimality: T such that <latexit sha1_base64="6FLiTR5aS6gRFuXbOAt/7MdpKsU=">AAAI6HicjVXNbttGEGbSn6jpT5z22AtRw4AsEIZoyVGMXow0RVu0RdLCToyGCrEkR9TWy+VqubIprfkOuRW99tBH6LV9jb5NZynJlrh2W0I/w/m+mfl2ZsmNBKOF6nb/vnP3rbffefde673773/w4UcPth5+/KLIpzKGkzhnuTyNSAGMcjhRVDE4FRJIFjF4GZ19YfCX5yALmvNjNRMwzEjK6YjGRKEr3OoECkpV59FJFgrKzyodPMsgJaEOBA271WvdqdrHu1W4td3d69aXaxv+0tg+ajn19Tx8eO/3IMnjaQZcxYwUxav9vlAe8BQXNh5qIhWNGVT3g2kBgsRnJIVXaHKSQTHUtajK3UFP4o5yiV+u3Nq7HqFJVhSzLEJmRtS4aGLGeYXtbJRSo8dDTbmYKuDxotJoylyVu6ZVbkIlxIrN0CCxpCjWjcdEklhhQzdEH/tDbdSZNBsAwzZzf63IhrqIRMCqDVFalCZRgcQERjjUxWimRRxKSCr941dPKu0fDLxe3zvoIYvDRZxnGeGJDpSgODxFWQJmdtX/TrLOutoFNWv/4MA7HHj+fpNGC3qVqOd1vW4Dj6JoiT9+5PmHPQ8zNShpmi4ph32v/jQIuSQ8hTUlfn9gFUolAPCVFs/3ezdoWcJdr9/z/EG/0bfJvKz0xOz3Ma3a88tyt0EQZTqvtECGGoMiVbu8nDc5k/l6DivDZryFR7hM/Ekp1zDhREoyq9ylgzCacpxbIwRMCKC1wt367jrcKsG5VaTz31XqqNvrdKxC59hNXTYTnV+gNzCPYTRyLyx0tobOLBS7p+eWd7QWM2qiZwzRp6H+9ruqCSUQI3Y9DreNmi+xyq61em6Yq7Ea3hx5pcVLzQ5sZvTcm1IylthUiyXKDdKNnIlJpCcrcKcWmpDb6HSC9G9CE3D6+U/2UidX3Yz0l5XhWQ1P1xqeWgsjBo1o6qKJ73hoEuQ1Qd5IqNcqahKR7nLpzV08AlFcy0BRRKKHspxbeoExIWkGho72a2xrfWs1mwqQmWEtXpoNbJVC0Fsz8JziKYXH9UpZTBg2sb2Qa/U6yYiZbxIacmllozQpaK0nN2c3KI0efAiN03CfAh6nEr7HOs9QOVG57OiAyLROi/+BZ6x/I1K+IqKFJ7vfPMdt48X+nv9or/9Df/voyeKId1rOp85nTtvxnYFz5HztPHdOnNh54/zh/On81fq59ab1S+vXBfXunWXMJ87G1frtH/rYTKs=</latexit> ⌦ ⇤ ⇡ 0 (T) <latexit sha1_base64="qu08D0hqJeZROAJrfCll4qOBJtg=">AAAI+XicjVVNb9tGEGXSj6jpl9MeeyFqGFAMwhAtOYqRi5GmaIu2aFrYidFQIZbkkFp4uaSWK5vSmj8ml6LotT+i1/bYf9PZlWRLXLstIVGjeW9m3s4suVHJaCV7vb/v3H3r7Xfevdd57/77H3z40cdbDz55URVTEcNJXLBCnEakAkY5nEgqGZyWAkgeMXgZnX2h8ZfnICpa8GM5K2GUk4zTlMZEoivcehIwwjMGbiChliafiqJGBSVtvHVnkocl5WeNOm7cQJiYcGu7t9czl2sb/tLYPuo45noePrj3S5AU8TQHLmNGqurV/qCUHmAyWo1HighJYwbN/WBaQUniM5LBKzQ5yaEaKSOkcXfQk7hpIfDLpWu86xGK5FU1yyNk5kSOqzamnVfYzkYpmT4eKcrLqQQeLyqlU+bKwtW9cxMqIJZshgaJBUWxbjwmgsQSO7wh+tgfKa1Op9kAGPad+2tFNtRFJALWbIhSZa0TVUhMIMUpL8YxreJQQNKon7562ij/YOj1B95BH1kcLuIizwlPVCBxininLAEz0f+dZJ11NXnD2j848A6Hnr/fptGKXiXqez2v18Ijva0M/viR5x/2PczUomRZtqQcDjzzaREKvfFgTYk/GFqFMgEAfKXF8/3+DVqWcM8b9D1/OGj1bTKvGzUJsWdj2nTnl/XDFqGss3mjSmTIMUjSdOvLeZszma/nsDJsxlt4hMvEW0a5ggknQpAZPngLB2E04zi3VgjoEEBrhbvm33W4VYJzq8juf1cxUbfX2bUKnWM3Vd1OdH6B3kA/hlHqXljobA2dWSh2T80tb7oWk7bRM4bos1B9+13ThhKIEbseh9tFzZdY5aG1eq6Zq7Fq3hx5tcXL9A5sZ8QX6g0pGUtsqsUq6w3SjZyJTqQmK3DHCE3IbXQ6Qfo3oQ44ffKzvdTJVTcj9WWjeVbDs7WGZ9bCiEYjmrnLQ6ZNENcEcSPBrLU0JCLc5dLbuziFsrqWgaKIQA9lBbf0AmOloDloOtqvsa3mr9VsWoLINWvx0mxhqxQlvTUDLyieUnh+r5TFhGETuwu5Vq+TnOj5JqEm11Y2SpOKGj2FPsxBKvTgQ6idmvsM8DgV8D3W+QGVE1mIXRUQkZm0+Bt42vo3IuUrIlp4svvtc9w2Xuzv+Y/2Bj8Oto+eLo54p+N85nzudB3fGTpHztfOc+fEiZ03zh/On85fHdV50/m189uCevfOMuZTZ+Pq/P4P9LRUfQ==</latexit> h⇡ ,Ti <latexit sha1_base64="Q0v0dg9Sai3QVeNAMLIbMdu0pDE=">AAAI1nicjVXBbttGEN2kSaOmaeOklwK9EDUMKAZhiJYcxejFSFOkRVskKezEaKQQS3JILbxcUsuVTWnN3Ipe+xFFb+3/9G86S0m2xLXbEhI1mvfmzdtZiRvknBWq0/n7xs0Pbt3+8E7ro7sf3/vk0/sbDx6+LrKJDOEozHgmjwNaAGcCjhRTHI5zCTQNOLwJTr42+JtTkAXLxKGa5jBMaSJYzEKqMOVvfO4MFJSqFtJR6udMnFT6sHL8jc3OTqe+HDvwFsHmQYvU10v/wZ0/BlEWTlIQKuS0KN7u9nLlgkhwGaOhplKxkEN1dzApIKfhCU3gLYaCplAMde2gcrYwEzlxJvEtlFNnVys0TYtimgbITKkaFU3MJC+wrbVWKn4y1EzkEwUinHeKJ9xRmWMG40RMQqj4FAMaSoZmnXBEJQ0Vjm/N9KE31MadkVkDOA5VeCtN1twFNABerZnSeWmECiRGEOMWzvdhUoS+hKjSPz1/Wmlvr+92e+5eF1kCzsIsTamI9EDlrMI74xHoAcb/W2SVdbHlNWt3b8/d77vebpPGCnYh1HU7bqeBB0GwwJ88dr39rotKDUqSJAvKfs+tXw1CJqlIYMWJ1+tbjRIJAGLpxfW87hVeFnDH7XVdr99rzG08Kys99nFmI1a1Z+flowYhL5NZpXNkqBEoWrXL81mTM56talgK6/UWHuAy8ZYwoWEsqJR0WjmLBOUsEbhvjRIwJYDREnfqb5flVgshrCbb/92lrrq+z7bV6BSnqcum0OkZZgfmbxjEzpmFTlfQqYXi9PTMysYrNXETPeGIPvP19z9UTSiCELHL7XDa6PkcuzyyVi8Mc7mthjdDXmnxEvMLbCq6zlWSnEc21WLl5RrpSs7YCOnxEtyqjUb0OjobI/073xQcf/WzvdTxxTQD/U1leNbAk5WBJ9bCqEEDljgY4jMemgR5SZBXEuq15jWJSmex9OavOIa8uLSBpqjEDOOZsPwC57lkKRg6xu9wrPVXa9gsB5ka1vyh2cCWEjm7VkFkDE8pPJyXzkLKcYjtuV1r1lFKzf5GviGXlhpjUcFqP5k5qUFpzOCf0CQN9xngcSrhR+zzAp1TlcltPaAyqWXxc+Ca6N+ITCyJGOHJ7jXPcTt4vbvjPd7pveptHjydH/GkRb4gX5I28UifHJBvyUtyRELynvxO/iR/tY5b71u/tH6dU2/eWNR8Rtau1m//APVqRQY=</latexit> T <latexit sha1_base64="T5cK49Y4qD4xf6hKai6jmHEOGek=">AAAIz3icjVVdb9s2FGW7bvW6r2R93IvQIIAbaIEVO3WDvgRdh27YhrVD0warPYGSrmQiFCVTdCybUbHX/Yi9dr9p/2aXsp3YYrJN8Mf1Peeee3gpmUHOWaE6nb9v3f7gzocf3W19fO+TTz/7/Iut7S9fF9lEhnASZjyTpwEtgDMBJ4opDqe5BJoGHN4EZ98Y/M05yIJl4pWa5TBMaSJYzEKqMOVvbQ8UlKrW0UFQ6a8rf2uns9+pL8cOvGWwc9wi9fXC3777fhBl4SQFoUJOi+LtQS9XLogE/Y+GmkrFQg7VvcGkgJyGZzSBtxgKmkIx1HXvytnFTOTEmcS3UE6dXa/QNC2KWRogM6VqVDQxk7zEdjdaqfjxUDORTxSIcNEpnnBHZY6ZiBMxCaHiMwxoKBmadcIRlTRUOLcN06+8oTbujMwGwHGawltrsuEuoAHwasOUzksjVCAxghj3brEDkyL0JUSV/uX500p7h32323MPu8gSMA2zNKUi0gOVswo/GY9ADzD+3yLrrCj1cybOlqyDw0P3qO96B00aK9ilUNftuJ0GHpibpsYfP3K9o66LSg1KkiRLylHPrV8NQiapSGDNidfrW40SCQBi5cX1vO41XpZwx+11Xa/fa8xtPC8rPfZxZiNWtecX5cMGIS+TeaVzZKgRKFq1y4t5kzOer2tYCpv1Fh7gMvEjYULDWFAp6axylgnKWSJw3xolYEoAoxXu1L+uyq0WQlhN9v67S111c589q9E5TlOXTaHzKWYH5jEMYmdqobM1dGahOD09t7LxWk3cRM84os98/cOPVROKIETsajucNnq+wC4PrdULw1xtq+HNkVdavMTcgU1F17lOkvPIplqsvNwgXcsZGyE9XoG7tdGI3kRnY6R/75uC0ye/2ksdX04z0N9WhmcNPFkbeGItjBo0YImDIf7HQ5MgrwjyWkK91rwmUeksl968i2PIiysbaIpKzDCeCcsvcJ5LloKhY/wbjrX+aQ2b5SBTw1r8aTawlUTOblQQGcNTCk/llbOQchxie2HXmnWUUrO/kW/IpaXGWFSw2k9mjmhQGjP4EJqk4T4DPE4l/IR9fkbnVGVyTw+oTGpZ/B64Jvo3IhMrIkZ4snvNc9wOXh/se4/2ey97O8dPF0c8aZGvyAPSJh7pk2PyHXlBTkhIpuRP8p781XrZmrbetX5fUG/fWtbcJxtX649/AINCQjs=</latexit> <latexit sha1_base64="9NN/juadefo4ffRy1mNpRkT0nCI=">AAAJGXicjVbfb+NEEPYdPy4cv3rcIy8WVaVcZaq4cZpW91IdhwAB4kDXu4o6WGt74qy6XjvrTetk638E/hl4g5N44on/hlknaR1vD7ASZzLfNzPfztoehzmjhez1/r5z940333r7Xued++++9/4HH249+OhFkc1EBCdRxjJxGpICGOVwIqlkcJoLIGnI4GV4/pnGX16AKGjGn8t5DqOUJJyOaUQkuoItz5dQyjqPitMgEQC8Up8+DdTX31Rnfk5/Un4uaAqV7V/Z+D/ojapga7u316sP2zTclbF93LHq41nw4N6ffpxFsxS4jBgpirN9L5cO8ASXOBkpIiSNGFT3/VkBOYnOSQJnaHKSQjFStbzK3kFPbI8zgV8u7drbjFAkLYp5GiIzJXJStDHtvMZ2NkrJ8eFIUZ7PJPBoWWk8Y7bMbN00O6YCIsnmaJBIUBRrRxMiSCSxtRuin7sjpdXpNBsAw4Zzt1FkQ11IQmDVhiiVlzpRgcQYxri9y02aFVEgIK7UD188qZQ7GDp9zxn0kcXhMsrSlPBY+TKnFZ4piwH3j1b/O0mThddDTvn5irU/GDhHQ8fdv4W2umyW2Rz34NBxB16LRgt6Xa/v9JxeCw/DcIUfHjjuUd/Bgi1KkiQrypHn1J8WIROEJ9AQ7HpDo5AW21Tr9m/RsoJ7jtd33KHXau90UVZqGmBrJ7TqLq7KRy1CXiaLSuXIkBOQpOqWV4s2Z7po5jAybMYbeIjLxFNCuYIpJ0KQOd6iSwdhNOG4va0Q0CGA1hq363834UYJzo0iu/9dpY56fZ1do9AFdlOV7UQXl+j19d0aju1LA5030LmBYvfUwvCOGzHjNnrOEF0++NpQDBFiN9thd1HzFVZ5ZKyea+Z6WzVvgbzS4CX6CmxndOzbUjIWm1SDlZcbpFs5U51ITdfgTi00Jq+j0ynSvwp0wOnjH82lTq+7GarPK80zGp40Gp4YCyMaDWlio4mjANoEcUMQtxLqteY1iQh7tfT2VTyGvLiRgaKIQA9lGTf0AmPLYadq+3r2Gc2mOYhUs5bP1ha2TtGYnm0WzygOM5zva2URYdjE7lKu0es4JXp/40CTSyMbpXFBaz2ZHvYgFXrwJtROzX0KOHUFfIt1vkPlRGZiV/lEJHVa/PUdbf0bkfI1ES18AXDb4940XuzvuQd73vfe9vGT5ZuA1bE+tj6xupZrDa1j60vrmXViRdbP1q/WH9arzi+d3zq/d14tqXfvrGIeWhtH569/ABOwX4A=</latexit> D KL [⇡ 0 k⇡ 0 ] Figure 1.1: Convex Conjugate Representation of the KL Divergence.. To give a graphical interpretation of the convex conjugate function, consider the graph of the function (T ) as a function ofT . Note thatT may be a function in general, but we plot a one-dimensional variable for simplicity. The conjugate optimization () = sup T h;Ti (T ) can be visualized by drawing the line (hyperplane) with a given slope , and nding the maximum dierence with the convex function (T ). This occurs at the value of T such that =r T (T ), so that the gradient of the function at this point has the desired slope . They-intercept of this tangent line corresponds to the negative value of the function (), so that this entire procedure yields a way to represent the function . For a dierent slope 0 , we would obtain a y-intercept of ( 0 ) at a dierent optimizing point. The conjugate operation is an involution for any proper, lower semi-continuous, convex (Boyd and Vandenberghe, 2004), which implies that ( ) = . Note that is also convex in this case. We can thus represent (~ ) via a conjugate optimization (~ ) = sup T (z)2D h~ (z);T (z)i (T ); (1.21) The dual functionT (z) is often called a `critic function' and appears prominently in the f-gan framework (Nowozin et al., 2016), which generalizes the minimax optimization found in genera- tive adversarial network (gan) training beyond the special case of the Jensen-Shannon divergence (Goodfellow et al., 2014). Dual Correspondence Solving for the optimizing argument in each of Eq. (1.20) and Eq. (1.21) yields the following dual correspondence ~ T (z) =r T (T ) T ~ (z) =r ~ (~ ): (1.22) The conjugate optimizations in Eq. (B.1) and B.2 also suggest optimality conditions of the form ~ T (z) = (r ~ ) 1 (T ) and T ~ (z) = (r T ) 1 (~ ). 9 Conjugate duality thus provides an alternative representation of a convex function (~ ), using a related conjugate function (T ) whose input arguments correspond to gradients T =r (~ ) of the original function. We show a graphical interpretation in Fig. 1.1. Conjugate (T ) of the KL Divergence The kl divergence to a xed reference or prior distribution p(z) is a convex function of the distribution in the rst argument () :=D KL [jjp(z)]. Here, we will restrict thekl divergence to accept only normalized distributions as input and assume that p(z) is normalized. 2 Conditioned on a particular x, we consider the following Lagrangian p(z) (T ) = sup ~ (zjx) ~ (zjx);T (x;z) Z ~ (zjx) log ~ (zjx) p(z) dz (x) Z ~ (zjx)dz 1 +(a;s) (1.23) where we have included an explicit Lagrange multiplier (x) to enforce normalization of (zjx)/ ~ (zjx), and (a;s) is a Lagrange multiplier enforcing ~ (zjx) 0. Dierentiating Eq. (1.23) leads to the optimality condition T ~ =r (~ ) as in Eq. (1.22), T (x;z) = 1 log (zjx) p(z) + (x)(a;s): (1.24) Inverting this expression to solve for the optimizing argument , we have T (zjx) =p(z) expfT (x;z) (x)g: (1.25) Note that we ignore the Lagrange multiplier (a;s), since the exp function automatically enforces T (zjx) 0 for all z. Using the condition R T (zjx)dz = 1 in Eq. (1.25), we can solve for (x) as (x) = log Z p(z) expfT (x;z)gdz = logE p(z) h e T (x;z) i (1.26) Finally, substituting back into the conjugate optimization in Eq. (1.23), we obtain p(z) (T ) = (x) = logE p(z) h e T (x;z) i : (1.27) Thus, the conjugate function associated with () = D KL [ : p(z)] corresponds to the log- partition function of an exponential family (see App. A.0.1) with base measure p(z) and sucient statistics T (x;z). 2 The convex conjugate of the kl divergence over unnormalized density functions will play a role in Sec. 2.6 and Ch. 6, and is derived in App. C.1.1. 10 1.3.1 Conjugate Duality Interpretation of the ELBO We are nally ready to express the elbo as a lower bound induced by the conjugate dual opti- mization p(z) (T ) = sup h;Ti p(z) (). We choose p(z) as the base or reference distribution and T p (zjx) (x;z) = log p (x;z) p(z) = logp (xjz). Using these choices in Eq. (1.25) and Eq. (1.24), the optimizing argument T (zjx) =p (zjx) recovers the true posterior, with the log partition function corresponding to (T p (zjx) ) = logp (x), T (zjx) =p (zjx) (logp (xjz)) = logp (x) Finally, the conjugate optimization for (logp (xjz)) indicates that any suboptimal q (zjx)6= p (zjx) will provide a lower bound, which leads to (logp (xjz)) = logp (x) = sup (zjx) h(zjx); logp (xjz)iD KL [(zjx)kp(z)] (1.28) hq (zjx); logp (xjz)iD KL [q (zjx)kp(z)] (1.29) =E q (zjx) log p (x;z) q (zjx) (1.30) =:elbo(x;;) (1.31) which matches theelbo. Note that Eq. (1.28) is identical to the optimization-centric interpretation of the elbo from Knoblauch et al. (2019) and Eq. (1.17). A similar optimization is also used to dene Fenchel-Young losses for structured prediction (Blondel et al., 2020) and continuous attention mechanism (Martins et al., 2021). We can also think of the functional (T ) as a `free energy' corresponding to the sucient statistic or negative energy function T (x;z). If we consider T (x;z) = logp (xjz) to be a function of the parameters , the conjugate duality interpretation in this section recovers the variational interpretation of the em algorithm from Neal and Hinton (1998). Noting that the lower bound in Eq. (1.29) may also be rewritten in terms of the entropy H q (ZjX), we have (logp (xjz))F (q ;) :=hq (zjx); logp (x;z)i +H q (ZjX): (1.32) This matches Neal and Hinton (1998), where it is used provide a principled alternating maximization interpretation of em algorithm variants which, for example, only update a subset of variables in the e- or m steps. 11 Finally, in App. B.1 we show that we can characterize the gap in the conjugate optimizations Eq. (1.29) or Eq. (1.32) as a Bregman divergence, which in this case corresponds toD KL [q (zjx)kp (zjx)]. We consider the conjugate optimizations for alternative regularization functions in Ch. 6 and App. E, along with a more general framework for conjugate duality in Ch. 5. Further consideration of alternative convex regularization functions in the setting of generalized variational inference or Fenchel-Young losses remains for future work. 1.4 Importance Sampling As a second major theme of this thesis, we review various importance sampling techniques in this section, which we will build upon in later chapters. Importance sampling is among the most successful and well-studied approaches for estimating normalization constants or expectations un- der intractable distributions (Owen, 2013). We rst discuss bounds on the partition function or marginal likelihood p (x), before translating these approaches to lower bounds on logp (x) in Sec. 1.4.4. We will consider the notation of latent variable models p (x;z), where we have access to a tractable joint probability. For each x, we consider p (x;z) to be an unnormalized density over z, where the posterior p (zjx) = p (x;z) p (x) accounts for the marginal likelihood p (x) = R p (x;z)dz as a normalization constant. We emphasize the need to perform a separate estimation problem, or integration over z, for each data point x. In what follows, we will focus on estimation of the marginal likelihood p (x). 1.4.1 Simple Importance Sampling As a starting point, we consider multiplying the joint distribution inside the integral by a factor of 1, using a proposal or variational distribution q (zjx) which is tractable to sample and evaluate, p (x) = Z p (x;z)dz = Z q (zjx) q (zjx) p (x;z)dz =E q (zjx) p (x;z) q (zjx) (1.33) We dene the importance weights using this unnormalized density ratio, w(z) := p (x;z) q (zjx) ; (1.34) 12 and note that w provides an unbiased estimator of p (x) since its expectation under the sampling distribution q (zjx) is equal to p (x) (Eq. (1.33)). However, as we visualize in Fig. 1.3, this estimator may be extremely high variance if the pro- posalq (zjx) does not closely match the target. This may be exacerbated when using a parametric family of proposals such as Gaussian q (zjx). For the multi-modal target in Fig. 1.3, the uni- modal Gaussian cannot match both modes of the target and, in this case, attempts to spread mass across both separated modes. 3 Inspecting the importance weights (Fig. 1.3b) and their frequency (Fig. 1.3c), we see that we are most likely to obtain low importance weights when sampling from the mode of q (zjx), which is not high probability under p (x;z). These low importance weights are balanced by a heavy tail of large importance weights, in regions which have high probability under p (x;z) but are relatively unlikely to be sampled from q (zjx). These large contributions from rare events balance the more frequently small weights to yield an unbiased estimator of the partition function ratio. In fact, we show in Sec. 3.8.2 that the variance of the importance weights Var q [w(z)] p (x) 2 (expfD KL [p (zjx) :q (zjx)]g 1) scales exponentially with the kl divergence between pro- posal and target (Song and Ermon, 2019). This motivates our search for lower variance proposals which can reduce this kl divergence, and we will study extended state space importance sampling approaches which achieve this goal in Ch. 2. 1.4.2 Multi-Sample Importance Sampling As a rst example, we can consider taking K samples from q (zjx) and reporting their arithmetic mean as our estimator w(z (1:K) ) := 1 K K X k=1 p (x;z (k) ) q (z (k) jx) : (1.35) 3 As discussed in \Alternative VI Objectives" paragraph in Sec. 1.2, the mass-covering or mode-seeking behavior for a learned proposal distribution will depend on the divergence used during optimization. For example, the mass- covering behavior observed here results from minqDKL[p (zjx)kq (zjx)], whereas minimizing thekl divergence in the opposite direction minqDKL[q (zjx)kp (zjx)] may result in mode-seeking behavior whereq (zjx) gravitates toward a single mode of the target distribution. 13 Taking the average will reduce the variance of the estimator across sets ofK samples, as we observe for our running example in Fig. 1.3d. This also provides an unbiased estimator of the partition function p (x), since E h w(z (1:K) ) i = Z K Y k=1 q (z (k) jx) 1 K K X k=1 p (x;z (k) ) q (z (k) jx) dz (1:K) = Z 1 K K X s=1 p (x;z (s) ) K Y k6=s q (z (k) jx)dz (1:K) =p (x) (1.36) where the second equality in the rst line reveals a mixture distribution with components indicating which sample is evaluated under p (x;z) (see Ch. 2 Sec. 2.2.3). The integral will evaluate to p (x) if q (zjx) is normalized. In Ch. 2, we show that this multi-sample importance sampling scheme has variance which scales as Var w(z (1:K) ) p (x) 2 exp n D KL h 1 K P K s=1 p (x;z (s) ) Q K k6=s q (z (k) jx)k Q K k=1 q (z (k) jx) io 1 , where this kl divergence can be shown to be less than or equal to the single-sample kl divergence D KL [p (zjx)kq (zjx)] (Brekelmans et al. (2022b) App. B2, Burda et al. (2016)). Intuitively, con- structing the mixture with additional samples fromq (zjx) `dilutes' the discrepancy betweenp (zjx) and q (zjx) in measuring the K-sample kl divergence. This probabilistic interpretation also un- derlies the Importance Weighted Autoencoder bounds (Burda et al., 2016; Sobolev and Vetrov, 2019). 1.4.3 Annealed Importance Sampling For simple importance sampling, it may be dicult to learn a tractable proposal distribution which closely matches the target distribution. If the kl divergence between the proposal and target is large, we have seen that we will have high variance importance weights. Instead, Annealed Importance Sampling (ais) (Neal, 2001; Jarzynski, 1997) deconstructs the importance sampling problem into a series of smaller subproblems, each of which use importance sampling to compare intermediate distributions which bridge between the initial and target distributions and are thus are closer in distribution space. More concretely, consider an initial distribution 0 (zjx) which is tractable to sample and is often chosen to have normalization constantZ 0 (x) = 1. We denote the target distribution T (zjx) = p(zjx) with unnormalized density T (x;z) =p(x;z) and normalizing constantZ T (x) =p(x). 14 (a) Target p(z) and Proposal q(z) (b) Importance Weights for various z (c) Single-Sample IS Weights w(z) (d) Multi-Sample (K = 5) IS Weights w(z (1:K) ) Figure 1.3: Importance Sampling Example. For the example target and proposal in (a), we calculate the importance weights at each point in the sample space in (b). In (c)-(d), we draw 1000 samples from q(z) and calculate the histogram and variance of importance weights for 1000 realizations of the single-sample estimator w(z) in (c), or by averaging K = 5 samples for 200 realizations of the estimator w(z (1:K) ) in (d). 15 Algorithm 1: Annealed IS input : Endpoint densities ~ 0 (z); ~ 1 (z) Schedulef t g T t=0 Annealing Path t 7! ~ t (z) for k = 1 to K do Z 0 ~ 0 (z) w (k) 0 1 for t = 1 to T do Z (k) T T t (z (k) t jz (k) t1 ) w (k) t w (k) t1 ~ t(z (k) t ) ~ t1 (z (k) t ) return Approximate samples z T T (z) Z T =Z 0 1 K K P k=1 w (k) T ais (Neal, 2001) constructs a sequence of intermediate distributionsf t (z)g T t=0 , which bridge between the initial and target distributions. A common choice for intermediate distributions is the geometric mixture path parameterized byf t g T t=0 , with 0 = 0 and T = 1: t (zjx) = 0 (x;z) 1t T (x;z) t Z t (x) where Z t (x) = Z 0 (x;z) 1t T (x;z) t dz: (1.37) In Ch. 4, we propose a family of annealing paths which generalize the geometric average path and highlight connections between divergences, convex duality, and annealing paths. We provide extensive analysis of the geometric averaging path in Ch. 3. ais corresponds to importance sampling using an extended state space proposal distribution q ais prop (z 0:T jx), obtained by sampling from the initial 0 (zjx) and constructing transitionsT f (z t jz t1 ) which leave t1 (zjx) invariant. The target distributionp ais tgt (z 0:T jx) is given by running the reverse transitionsT r (z t1 jz t ) starting from a target or posterior sample T (zjx), as shown in Fig. 2.2. q ais prop (z 0:T jx) = 0 (z 0 jx) T Y t=1 T f (z t jz t1 ) p ais tgt (x;z 0:T ) = T (x;z T ) T Y t=1 T r (z t1 jz t ) (1.38) Since each transition kernel is normalized, the normalization of extended state space target will match p (x), withZ T (x) =p (x) = R p (x;z)dz = R p ais tgt (x;z 0:T )dz 0:T . 16 The property thatT f (z t jz t1 ) leave t1 (zjx) invariant, or R t1 (z t1 jx)T f (z t jz t1 )dz t1 = t1 (z t jx), is useful to ensure tractable calculation of the reverse transitions T r (z t1 jz t ) = t1 (z t1 jx)T f (z t jz t1 ) R t1 (z t1 jx)T f (z t jz t1 )dz t1 = t1 (z t1 jx)T f (z t jz t1 ) t1 (z t jx) The invariance assumption thus implies the identityT f (z t jz t1 ) t1 (z t1 jx) =T r (z t1 jz t ) t1 (z t jx). Often, Metropolis-Hastings accept-reject steps are used to ensure invariance or, in fact, the more strict detailed balance condition whereT r (z t1 jz t ) =T f (z t1 jz t ) is the `transpose' of the forward kernel (Habeck, 2017). Using the invariance assumption, we can simplify the importance weights between unnormalized densities in the extended state space as w 0:T = p ais tgt (x;z 0:T ) q ais prop (x;z 0:T ) = T (x;z T ) Q T t=1 T r (z t1 jz t ) 0 (x;z 0 ) Q T t=1 T f (z t jz t1 ) = T (x;z T ) 0 (x;z 0 ) T Y t=1 t1 (x;z t1 ) t1 (x;z t ) = T Y t=1 t (x;z t ) t1 (x;z t ) (1.39) = T Y t=1 T (x;z t ) 0 (x;z t ) tt1 : Note that these importance weights involve unnormalized densities in the general case whereZ 0 6= 1. ForZ 0 = R q ais prop (x;z 0:T )dz 0:T andZ T (x) = R p ais tgt (x;z 0:T )dz 0:T , we have w 0:T = p ais tgt (x;z 0:T ) q ais prop (x;z 0:T ) = Z T (x) Z 0 (x) p ais tgt (z 0:T jx) q ais prop (z 0:T jx) (1.40) Now, since the expectation of the normalized importance weights under q ais prop (z 0:T jx) equals 1, or E q ais prop (z 0:T jx) [ p ais tgt (z 0:T jx) q ais prop (z 0:T jx) ], the ais weights provide an unbiased estimator of the ratioZ T (x)=Z 0 (x), E q ais prop (z 0:T jx) [w 0:T ] =E q ais prop (z 0:T jx) Z T (x) Z 0 (x) p ais tgt (z 0:T jx) q ais prop (z 0:T jx) = Z T (x) Z 0 (x) (1.41) Assuming the initial distribution is normalizedZ 0 (x) = 1, ais provides the following unbiased estimate of the partition function ^ Z T (x) ^ Z T (x) = 1 K K X k=1 w (k) 0:T where w (k) 0:T = p ais tgt (x;z (k) 0:T ) q ais prop (z (k) 0:T jx) and z (k) 0:T q ais prop (z 0:T jx): (1.42) In this case, our lower bound on the variance of the importance weights Var q ais prop [w 0:T ]p (x) 2 (expfD KL [p ais tgt (z 0:T jx)kq ais prop (z 0:T jx)]g 1) scales exponentially inD KL [p ais tgt (z 0:T jx)kq ais prop (z 0:T jx)]. In the limit of asT!1, each intermediate distribution is innitesimally close and this divergence 17 approaches 0. While this is suggestive of the performance gains which can be obtained using ais, we provide a more formal analysis in App. D. 1.4.4 ELBO using Importance Sampling We have seen that it is relatively simple to construct an unbiased importance sampling estimator of the partition function p (x), although the variance of these estimators may be extremely large. We will now see that unbiased estimatorsp (x) will translate to stochastic lower bounds on the log partition function logp (x), as reasoned in Grosse et al. (2015) Sec. 4. We rst introduce notation which will preserve the generality of our exposition, since the same derivations hold for the simple, multi-sample, and annealed importance sampling approaches in the previous sections. In Ch. 2, we consider extended state space importance sampling over a collection of latent variables z ext , which will encompass the multi-sample and annealing examples from the previous section. The importance sampling proposal is given by q prop (x;z ext ), often chosen to have nor- malization constantZ 0 = 1 so that q prop (x;z ext ) = q prop (z ext jx). We construct the unnormalized target distribution p tgt (x;z ext ) to have normalization constantZ T = R p tgt (x;z ext )dz ext =p (x), so that the importance weights w(z ext ) := p tgt (x;z ext ) q prop (x;z ext ) E [w(z ext )] =E q prop (zextjx) p tgt (x;z ext ) q prop (x;z ext ) = Z T Z 0 =p (x) (1.43) provide an unbiased estimator of the partition function. We will now see that these unbiased estimators of p (x) translate to biased estimators of logp (x). First, since the logarithm is concave, Jensen's inequality implies that logp (x) = logE q prop (zextjx) p tgt (x;z ext ) q prop (z ext jx) E q prop (zextjx) log p tgt (x;z ext ) q prop (z ext jx) (1.44) where we use z ext to preserve generality across the multi-sample and annealing examples from the previous section (see Ch. 2). Sincep tgt (x;z ext ) =p (x)p tgt (z ext jx) in our settings, we can conrm that the gap in this bound is the kl divergence D KL [p tgt (z ext jx)kq prop (z ext jx)] logp (x)E q prop log p (x)p tgt (z ext jx) q prop (z ext jx) =E q prop log p tgt (z ext jx) q prop (z ext jx) =D KL [p tgt (z ext jx)kq prop (z ext jx)]: (1.45) We provide alternative perspectives on why this gap in Jensen's inequality is equal to the kl divergence or, more generally, a Bregman divergence, in Ch. 2 and App. B. 18 In Ch. 2, we argue that it is insightful to directly construct extended state space proposal q prop (z ext jx) and target p tgt (z ext jx) distributions which have the desired ratio of normalization constantsZ p tgt (x)=Z q prop (x) = p (x). Instead of having to posit an appropriate application of Jensen's inequality, we can directly construct a lower bound using elbo ext (x;q prop ;) =E q prop (zextjx) log p tgt (x;z ext ) q prop (z ext jx) = logp (x)D KL [q prop (z ext jx)kp tgt (z ext jx)]: (1.46) For q prop (z ext jx) = q (zjx), we recover the standard elbo in Eq. (1.14), while we will see in Ch. 2 that we can recover the Importance Weighted Autoencoder (iwae) lower bound of Burda et al. (2015b) using multi-sample importance sampling. For ais, we use the simplication in Eq. (1.39) to write log p tgt (x;zext) q prop (zextjx) = logw 0:T = log Q T t=1 T (x;zt) 0 (ztjx) t t1 . Plugging this into Eq. (1.46), we obtain an ais extended state space elbo, elbo ais (x;q ais prop ;) =E q ais prop (z 0:T jx) " T X t=1 t t1 log T (x;z t ) 0 (z t jx) # (1.47) where the expectation is over the ais proposal q ais prop (z 0:T jx) in Eq. (1.38) which samples in the forward direction. In Ch. 3, we provide intuitive visual interpretations of the bound in Eq. (1.47) in terms of Riemann sums and Taylor approximations. Under perfect ais transitions, we show in Sec. 3.9 that Eq. (1.47) matches the recent Thermodynamic Variational Objective (tvo)(Masrani et al., 2019; Brekelmans et al., 2020a). As we discuss in Ch. 2 Sec. 2.2.1, the extended state space approach in Eq. (1.46) can be used to derive further generalizations of theelbo with improved importance sampling proposals. As thekl divergence between proposal and target, D KL [q prop (z ext jx)kp tgt (z ext jx)], decreases, the resulting elbo will more tightly bound logp (x). For a given choice ofp tgt (z ext jx) andq prop (z ext jx), we can also obtain a stochastic upper bound on logp (x) which involves sampling fromp tgt (z ext jx). We refer to this as an Evidence Upper Bound (eubo), which has a gap equal to thekl divergence in the reverse directionD KL [p tgt (z ext jx)kq prop (z ext jx)]. The single-sample eubo also appears as a special case of the tvo framework in Ch. 3. 19 1.5 Contributions and Publications With this motivation and background, we are now ready to outline this thesis and highlight our contributions, linking each chapter to our central themes of importance sampling and conjugate duality. At the end of the section, we provide references to work that contributed to this thesis. Ch. 2: Improving Mutual Information Estimation Using Annealed and Energy-Based Bounds Brekelmans et al. (2022b) (ICLR 2022) In this chapter, we view estimation of the mutual information (mi, Eq. (1.7)) from the per- spective of importance sampling, where the log partition function logp (x) must be estimated even if the joint distribution is known. We propose a general approach for constructing lower and up- per bounds on the log partition function using extended state space importance sampling (as in Sec. 1.4.1-1.4.3). These bounds correspond to extended state space analogues of the elbo and eubo. We use this framework to present novel perspectives on annealed importance sampling (ais) bounds and propose methods which provide signicant gains over existing estimators of mi. How- ever, our methods often require additional assumptions such as availability of densities for the joint distribution p (x;z) or marginal distribution p(z). In particular, when p (x;z) is known, we nd that our proposed ais upper and lower bounds can accurately sandwich ground truth values of mi which are orders of magnitude larger than the limitations of existing estimators. For settings where only a single marginal distribution p(z) is known, we propose a method based on variational inference of the posteriorp (zjx) using an energy- based model (1.2). We draw connections with existing `contrastive' estimators such as InfoNCE (van den Oord et al., 2018; Poole et al., 2019) and demonstrate the eectiveness of our energy-based training scheme for the `critic' or negative energy function T (x;z). While the majority of Ch. 2 is written using the importance sampling viewpoint, we show in Sec. 2.6 that many of the bounds discussed in this chapter also have interpretations in terms of conjugate duality. In contrast to our derivations for the log partition function (T ) in Sec. 1.3.1, these interpretations are based on dual expansions of thekl divergence () =D KL [ : 0 ]. Further, in contrast toelbo interpretation in Sec. 1.3.1, these conjugate optimizations re ect Evidence Upper Bounds (eubo) on logp (x) and lower bounds on mi. Ch. 3: Bregman Duality in Thermodynamic Variational Inference Brekelmans et al. (2020a) (ICML 2020) Brekelmans et al. (2020d) (NeurIPS Workshop 2020) 20 Nguyen et al. (2020) (NeurIPS 2020) This chapter summarizes a line of work analyzing the Thermodynamic Variational (inference) Objective (TVO, Masrani et al. (2019)) and the geometric annealing path commonly used in ais (also used in Ch. 2). We use the Bregman divergences and conjugate duality associated with the exponential family (Sec. B.2) to characterize the gap in the tvo lower and upper bounds on logp (x). In particular, we nd that the tvo objectives correspond to the elbo ais and eubo ais bounds obtained from ais under perfect transitions. We use the tvo framework to prove that the bias of the ais sandwich bounds reduces linearly with increasing T , under perfect transitions and uniform scheduling. However, most work on tvo to date (Masrani et al. (2019); Brekelmans et al. (2020b); Nguyen et al. (2020); Chen et al. (2021)) does not usemcmc sample transformations, as would be prescribed from the ais perspective. We use a static approximate sampling scheme to report experimental results in 3.7, and describe several algorithmic improvements for this version of the tvo in Sec. 3.5- 3.6. However, in Sec. 3.9.2, we highlight the shortfall of this sampling scheme compared to ground truth estimates of the tvo objective obtained using ais. We nally discuss recent work (Zhang et al., 2021; Gener and Domke, 2021) proposing reparameterization gradients through ais, which might be used to improve optimization of the tvo objective. Our analysis of the tvo involves what we call a likelihood ratio exponential family (Brekelmans et al., 2020d), which is a one-dimensional exponential family that matches the geometric annealing path between arbitrary endpoints. We discuss connections between this interpretation, hypothesis testing, and Renyi's -divergence in Sec. 3.8. This interpretation will also play a role in Ch. 5. Ch. 4: q-Paths: Generalizing the Geometric Path using Power Means Masrani et al. (2021) (UAI 2021, RB as joint rst author) Brekelmans et al. (2020a) (NeurIPS Workshop 2020, Best Paper Award) So far, we have used the geometric annealing path of intermediate densities to facilitate accu- rate importance sampling (Sec. 1.4.3, Ch. 2) and analyze `thermodynamic' objectives for variational inference (Ch. 3). However, the geometric path in Eq. (1.37) is nothing more than the arithmetic mean of the endpoint unnormalized densities after applying the logarithm as a monotonic trans- formation of the density function, log ~ (z) = (1) log ~ 0 (z) + log ~ 1 (z). In this chapter, we propose a family of annealing paths based on generalized or quasi-arithmetic means associated with a one-parameter family of q-deformed logarithms (Tsallis, 2009; Naudts, 2011). In particular, the geometric path corresponds to the value q = 1. We present a rough 21 heuristic for choosing the best q-path for a given sampling problem, and show that values of q just smaller than 1 can improve estimation performance in evaluating generative models and calculating Bayesian model evidences. Finally, we mention recent work proposing q-paths for variational inference with the tvo (Chen et al., 2021). Ch. 5: Bregman Information and the Geometry of Annealing Paths Brekelmans and Nielsen (2022) (Information Geometry for Data Science conference, in prep for Springer Information Geometry journal) In addition to the derivation in terms of quasi-arithmetic means, Masrani et al. (2021) inter- preted q-paths as arising from minimizing the expected -divergence to the endpoint densities. Grosse et al. (2013) show an analogous result for the geometric averaging path and the kl diver- gence. To interpret these results, we leverage the rho-tau Bregman divergence and nonparametric information geometry framework of Zhang (2004, 2013); Naudts and Zhang (2018) to generalize the `centroid' property of Bregman divergences (Banerjee et al., 2005c) from arithmetic means to quasi-arithmetic means. In addition to the derivation of q-paths as the solution to an ex- pected-divergence minimization, we provide an alternative interpretation in terms of an expected -divergence minimization. These results highlight an intimate relationship between divergence functions, generalized means, and monotonic embedding functions of probability densities from the perspective of convex duality. Ch. 6: Your Policy is Secretly an Adversary Brekelmans et al. (2022a) (Transactions on Machine Learning Research ) Finally, we apply our convex duality perspectives in the context of reinforcement learning, where we interpret kl and -divergence regularization of the behavioral policy of an agent as providing robustness to adversarial perturbations of the reward function. We derive the convex conjugate functions for the -divergence and characterize the set of reward perturbations which are feasible for the adversary. This feasible set describes the robust set of rewards to which a policy generalizes. Our work claries and unies several related works on this topic (Ortega and Lee, 2014; Eysenbach and Levine, 2021; Husain et al., 2021). Although this chapter uses dierent notation and problem setting from the rest of the thesis, the recent work of Husain and Knoblauch (2022) translates these insights to the setting and notation of variational inference. 22 Rate-Distortion and Information Bottleneck in Representation Learning We have omit- ted several earlier works (Brekelmans et al., 2019; Moyer et al., 2018; Jaiswal et al., 2019) from this thesis. In (Brekelmans et al., 2019), we consider a rate-distortion (rd) perspective on the elbo for standard variational inference and propose the Echo noise model, which admits an analytical form for the mutual information I q (x;z) induced by a learned encoding distribution q (zjx). In Jaiswal et al. (2019), we apply this noise model in the setting of invariant representation learning, where we would like our encoder to produce latent variables which are invariant to certain nuisance variables or sensitive attributes. We use the Information Bottleneck (ib) framework to jointly learn downstream classiers and/or generator networks from this compressed, invariant representation (Moyer et al., 2018; Jaiswal et al., 2019). In Brekelmans et al. (2020d), we view rate-distortion (r-d), Information Bottleneck (ib), and combined rate-distortion-classication approaches under the same likelihood ratio exponential fam- ily approach that gures prominently in our interpretation of the tvo in Ch. 3 and analysis of the geometric averaging path in Ch. 5. In particular, changing the Lagrange multipliers in the loss function corresponds to changing the natural parameters of the exponential family, and allows one to traverse the landscape of possible tradeos between compression, reconstruction loss, and/or classication accuracy. In the case of lossy compression, r-d, orib, the Lagrange multiplier con- trolling the tradeo between compression and delity corresponds to the mixing parameter along the geometric annealing path commonly used in ais. Huang et al. (2020) leverage this insight to evaluate the entire rd curve in one run of ais. We provide detailed analysis into the role of the parameter in Ch. 3 and Ch. 5, including connections with hypothesis testing and Renyi or Amari -divergences. 23 Publications R. Brekelmans and F. Nielsen. Rho-tau bregman information and the geometry of annealing paths. Information Geometry for Data Science Conference (under review), 2022. R. Brekelmans, D. Moyer, A. Galstyan, and G. V. Steeg. Exact rate-distortion in autoencoders via echo noise. In Advances in Neural Information Processing Systems, pages 3889{3900, 2019. R. Brekelmans, V. Masrani, T. Bui, F. Wood, A. Galstyan, G. V. Steeg, and F. Nielsen. Annealed importance sampling with q-paths. In NeurIPS Workshop on Information Geometry in Deep Learning, 2020a. URL https://openreview.net/pdf?id=ZBJ20FRVPD. R. Brekelmans, V. Masrani, F. Wood, G. V. Steeg, and A. Galstyan. All in the exponential family: Bregman duality in thermodynamic variational inference. In International Conference on Machine Learning, 2020b. R. Brekelmans, F. Nielsen, A. Galstyan, and G. V. Steeg. Likelihood ratio exponential fami- lies. In NeurIPS Workshop on Information Geometry in Deep Learning, 2020c. URL https: //openreview.net/forum?id=RoTADibt26_. R. Brekelmans, T. Genewein, J. Grau-Moya, G. Del etang, M. Kunesch, S. Legg, and P. Ortega. Your policy regularizer is secretly an adversary. Transactions on Machine Learning Research, 2022a. R. Brekelmans, S. Huang, M. Ghassemi, G. V. Steeg, R. B. Grosse, and A. Makhzani. Improving mutual information estimation with annealed and energy-based bounds. In International Confer- ence on Learning Representations, 2022b. URL https://openreview.net/forum?id=T0B9AoM_ bFg. A. Jaiswal, R. Brekelmans, D. Moyer, G. V. Steeg, W. AbdAlmageed, and P. Natarajan. Discovery and separation of features for invariant representation learning. arXiv preprint:1912.00646, 2019. 24 V. Masrani, R. Brekelmans (Equal Contribution), T. Bui, F. Nielsen, A. Galstyan, G. V. Steeg, and F. Wood. q-paths: Generalizing the geometric annealing path using power means. Uncertainty in Articial Intelligence, 2021. D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg. Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, pages 9084{9093, 2018. V. Nguyen, V. Masrani, R. Brekelmans, M. Osborne, and F. Wood. Gaussian process bandit opti- mization of the thermodynamic variational objective. Advances in Neural Information Processing Systems, 33:5764{5775, 2020. 25 Chapter 2 Improving Mutual Information Estimation Using Annealed and Energy-Based Bounds 2.1 Introduction Mutual information (mi) is among the most general measures of dependence between two random variables. Among other applications in machine learning, mi has been used for both training (Alemi et al., 2016, 2018; Chen et al., 2016; Zhao et al., 2018) and evaluating (Alemi and Fischer, 2018; Huang et al., 2020) generative models. Furthermore, successes in neural network function approximation have encouraged a wave of variational or contrastive methods for mi estimation from samples only (Belghazi et al., 2018b; van den Oord et al., 2018; Poole et al., 2019). However, McAllester and Stratos (2020) have shown strong theoretical limitations on any estimator based on direct sampling without an analytic form of at least one marginal distribution. In light of these limitations, we consider mi estimation in settings where a single marginal or the full joint distribution are known. In this work, we view mi estimation from the perspective of importance sampling. Using a general approach for constructing extended state space bounds on mi, we combine insights from Importance Weighted Autoencoder (iwae) (Burda et al., 2016; Sobolev and Vetrov, 2019) and Annealed Importance Sampling (ais) (Neal, 2001) to propose Multi-Sample ais bounds in Sec. 2.3. We empirically show that this approach can tightly estimate large values of mi when the full joint distribution is known. Our importance sampling perspective also suggests improved mi lower bounds that assume ac- cess to only joint samples for optimization, but require a single marginal distribution for evaluation. In Sec. 2.2.4, we propose Generalized iwae (giwae), which generalizes both iwae and InfoNCE 26 (Poole et al., 2019) and highlights how variational learning can complement multi-sample con- trastive estimation to improve mi lower bounds. Finally, in Sec. 2.4 we propose mine-ais, which optimizes a tighter lower bound than mine (Belghazi et al., 2018b), called the Implicit Barber- Agakov Lower bound (ibal). We demonstrate that the ibal corresponds to the innite-sample limit of thegiwae lower bound, although our proposed energy-based training scheme involves only a single `negative' contrastive sample obtained using Markov Chain Monte Carlo (mcmc). mine-ais then uses Multi-Sample ais to evaluate the lower bound on mi, and shows notable improvement over existing variational methods in the challenging setting of mi estimation for deep generative models. We summarize the mi bounds discussed in this chapter and their relationships in Fig. 2.1. 2.1.1 Problem Setting The mutual information between two random variables x and z with joint distribution p(x;z) is I(x;z) =E p(x;z) log p(x;z) p(x)p(z) =H(x)H(xjz) =E p(x;z) [logp(xjz)]E p(x) [logp(x)]; (2.1) where H(xjz) denotes the conditional entropyE p(x;z) logp(xjz). We primarily focus on bounds that assume either a single marginal distribution or the full joint distribution are available. A natural setting where the full joint distribution is available is estimating mi in deep generative models between the latent variables, with a known prior z p(z), and data x p(x) simulated from the model (Alemi and Fischer, 2018). 1 Settings where a single marginal is known appear, for example, in simulation-based inference (Cranmer et al., 2020), where information about input parameters is known and a simulator can generate x for a given , but the likelihood p(xj) is intractable. While sampling from the posterior p(zjx) for an arbitrary x is often intractable, we can obtain a single posterior sample forxp(x) in cases where samples from the joint distributionp(x)p(zjx) are available. We will refer to bounds which involve only a single posterior sample as practical, and those involving multiple posterior samples as impractical. When the conditional p(xjz) is tractable to sample and evaluate, simple Monte Carlo sampling provides an unbiased, low variance estimate of the conditional entropy term in Eq. (2.1). The 1 An alternative, \encoding" mi between the real data and the latent code is often of interest (see Brekelmans et al. (2022b) App N), but cannot be directly estimated using our methods due to the unavailability of p d (x) or q(z) = R p d (x)q(zjx)dx. 27 Figure 2.1: Schematic of various mi bounds. Green shading indicates our contributions, while columns and gold labels indicate single- or multi-sample bounds. Blue arrows indicate special cases using the indicated proposal distribution. Several bounds with unknown p(xjz) use learned energy or critic functions, where the optimal critic function re ects the truep(xjz). Relationships based on critic functions are indicated by red arrows. Bounds for unknownp(xjz) provide only lower bounds on mi, while we obtain both upper and lower bounds with known p(xjz). All bounds require a single known marginal p(z) for evaluation, except (Structured) Info-NCE. diculty of mi estimation then reduces to estimating the log partition function logp(x), for which importance sampling (is) based methods are among the most well studied and successful solutions. 2.2 Unifying Mutual Information Bounds via Importance Sampling In this section, we present a unied view of mutual information estimation from the perspective of extended state space importance sampling. This general approach provides a probabilistic inter- pretation of many existing mi bounds and will suggest novel extensions in Sec. 2.3 and Sec. 2.4. 2.2.1 A General Approach for Extended State Space Importance Sampling Bounds To estimate the log partition function, we construct a proposalq prop (z ext jx) and targetp tgt (x;z ext ) distribution over an extended state space, such that the normalization constant of p tgt (x;z ext ) is Z tgt = R p tgt (x;z ext )dz ext = p(x) and the normalization constant of q prop (x;z ext ) isZ prop = 1. Taking expectations of the log importance weights logp tgt (x;z ext )=q prop (z ext jx) under the proposal and target, respectively, we obtain lower and upper bounds on the log partition function E qprop(zextjx) log p tgt (x;z ext ) q prop (z ext jx) | {z } elbo x;q prop ;p tgt logp(x)E ptgt(zextjx) log p tgt (x;z ext ) q prop (z ext jx) | {z } eubo x;q prop ;p tgt : (2.2) 28 These bounds correspond to extended state space versions of the elbo and eubo. In particular, the gap in the lower bound is the forward kl divergence D KL [q prop (z ext jx)jjp tgt (z ext jx)] and the gap in the upper bound equal to the reverse kl divergence D KL [p tgt (z ext jx)jjq prop (z ext jx)]. For example, E q prop (zextjx) p tgt (x;z ext ) q prop (z ext jx) = logp(x)D KL [q prop (z ext jx)kp tgt (z ext jx)] | {z } elbo(x;q prop ;p tgt ) log logp(x) (2.3) with similar derivations for the eubo (see Brekelmans et al. (2022b) App A). 2.2.2 Barber-Agakov Lower and Upper Bounds As a rst example, consider the standard elbo(q ) and eubo(q ) bounds, which are derived from simple importance sampling using a variational distribution q (zjx) and z ext = z in Eq. (2.2). Plugging these lower and upper bounds on logp(x) into Eq. (2.1), we obtain upper and lower bounds on mi as I ba L (q ) :=E p (x;z) log q (zjx) p(z) I(x;z)E p (x)q (zjx) log q (zjx) p (x;z) H(xjz) =:I ba U (q ): (2.4) The left hand side of Eq. (2.4) is the well-known Barber-Agakov (ba) bound (Barber and Agakov, 2003), which has a gap of E p(x) [D KL [p(zjx)jjq (zjx)]]. We refer to the right hand side as the ba upper boundI ba U (q ), with a gap ofE p(x) [D KL [q (zjx)jjp(zjx)]]. In contrast toI ba U (q ), note that I ba L (q ) does not require access to the conditional p(xjz) to evaluate the bound. 2.2.3 Importance Weighted Autoencoder The iwae lower and upper bounds on logp(x) (Burda et al., 2016; Sobolev and Vetrov, 2019) improve upon simple importance sampling by extending the state space using multiple samples z ext =z (1:K) . Consider a proposal q iwae prop (z (1:K) jx) with K independent samples from a given varia- tional distributionq (zjx). The extended state space targetp iwae tgt (z (1:K) jx) is a mixture distribution involving a single sample from the posterior p(zjx) or jointp(x;z) andK 1 samples fromq (zjx) q iwae prop (z (1:K) jx) := K Y k=1 q (z (k) jx) p iwae tgt (x;z (1:K) ) := 1 K K X s=1 p(x;z (s) ) K Y k=1;k6=s q (z (k) jx): (2.5) 29 The log importance weight log p iwae tgt (x;z (1:K) ) q iwae prop (z (1:K) jx) reduces to the familiar ratio in theiwae objective, while the normalization constant of p iwae tgt (x;z (1:K) ) is p(x). As in Sec. 2.2.1, taking expectations under the proposal or target yields a lower or upper bound, respectively, E K Q k=1 q (z (k) jx) " log 1 K K X i=1 p(x;z (k) ) q (z (k) jx) # | {z } elbo iwae (x;q ;K) logp (x)E p(z (1) jx) K Q k=2 q (z (k) jx) " log 1 K K X i=1 p(x;z (k) ) q (z (k) jx) # | {z } eubo iwae (x;q ;K) : (2.6) For simplicity of notation, we assume s = 1 and z (1) p(zjx) when writing the expectation in eubo iwae (q ;K), due to invariance of Eq. (2.5) to permutation of the indices. See Brekelmans et al. (2022b) App. B for detailed derivations of these bounds. As for the standard elbo and eubo, the gap in the iwae lower and upper bounds on logp(x) are D KL [q iwae prop jjp iwae tgt ] and D KL [p iwae tgt jjq iwae prop ], respectively. As in Sec. 2.1.1, with known p(xjz), the lower and upper bounds on logp(x), elbo iwae (x;q ;K) and eubo iwae (x;q ;K) translate to upper and lower bounds on mi, which we denote as I IWAE U (q ;K) and I IWAE L (q ;K). Complexity in K While it is well-known that increasing K leads to tighter iwae bounds (Burda et al., 2016; Sobolev and Vetrov, 2019), we explicitly characterize the improvement of multi-sampleiwae bounds over the single-sample elbo oreubo in the following proposition. This result will lay the foundation for similar results throughout the rest of the paper. In particular, any bound which involves expectations under a mixture of one `positive' sample and K 1 `negative' samples, such as eubo iwae (x;q ;K), will be limited to logarithmic improvement in K. Proposition 2.2.1 (Improvement of IWAE with Increasing K). Let p iwae tgt (sjx;z (1:K) ) = p(x;z (s) ) q (z (s) jx) = P K k=1 p(x;z (k) ) q (z (k) jx) denote the normalized importance weights andU(s) indicate the uniform distribution overK discrete values. Then, we can characterize the improvement ofelbo IWAE (x;q ;K) and eubo IWAE (x;q ;K) over elbo(x;q ) and eubo(x;q ) using kl divergences, as follows elbo IWAE (x;q ;K) =elbo(x;q ) +E q iwae prop (z (1:K) jx) D KL [U(s)kp iwae tgt (sjz (1:K) ;x)] | {z } 0kl of uniform from snis weightsD KL [q (zjx)kp(zjx)] ; (2.7) eubo IWAE (x;q ;K) =eubo(x;q )E p iwae tgt (z (1:K) jx) D KL [p iwae tgt (sjz (1:K) ;x)kU(s)] | {z } 0kl of snis weights from uniform logK : (2.8) 30 Prop. 2.2.1 demonstrates that the improvement of the iwae log partition function bounds over its single-sample counterparts is larger for more non-uniform snis weights. Notably, the improve- ment of eubo IWAE (x;q ;K) over the single-sample eubo(x;q ) is limited by logK. Translating Prop. 2.2.1 to the iwae bounds on mi yields the following corollary. Corollary 2.2.2. iwae bounds on mi improve upon the ba bounds with the following relationships: I BA L (q )I IWAE L (q ;K)I BA L (q ) + logK; I IWAE U (q ;K)I BA U (q ): (2.9) Cor. 2.2.2 shows that, in order to obtain a tight bound on mi, the iwae lower bound requires exponential sample complexity inE p(x) [D KL [p(zjx)kq (zjx)]], which is the gap of eithereubo(x;q ) or I ba L (q ). Although elbo iwae (x;q ;K) and I IWAE U (q ;K) are not guaranteed to be limited to logarithmic improvement with increasing K, it has been argued that the same exponential sam- ple complexity, K/ exp D KL [p(zjx)kq (zjx)] , is required to achieve tight importance sampling bounds (Chatterjee and Diaconis (2018)). These observations motivate our improved ais proposals in Sec. 2.3, which achieve linear bias reduction in the number of intermediate distributions T used to bridge between q (zjx) and p(zjx). 2.2.4 Generalized IWAE In this section, we consider a family of Generalized iwae (giwae) lower bounds, which improve upon I ba L (q ) using multiple samples K > 1 and a contrastive critic function T (x;z), but do not require access to p(xjz) as in I iwae L (q ;K). While similar bounds appear in (Lawson et al., 2019; Sobolev, 2019), we provide a thorough discussion of special cases, and empirical analysis for mi estimation in Sec. 2.5.3. We also show that our ibal bound in Sec. 2.4 corresponds to the innite sample limit of giwae. To derive a probabilistic interpretation for giwae, we begin by further extending the state space of the iwae target distribution in Eq. (2.5), using a uniform index variable p(s) = 1 K 8s that species which sample z (s) p(zjx) is drawn from the posterior p giwae tgt (z (1:K) ;sjx) := 1 K p(z (s) jx) K Y k=1;k6=s q(z (k) jx): (2.10) 31 Note that marginalization overs leads to theiwae targetp iwae tgt (z (1:K) jx) in Eq. (2.5). The posterior p giwae tgt (sjx;z (1:K) ) over the index variable s, which infers the index of the positive sample drawn from p(zjx) given a set of samples z (1:K) , corresponds to the normalized importance weights. p giwae tgt (sjx;z (1:K) ) = p(x;z (s) ) q (z (s) jx) K P k=1 p(x;z (k) ) q (z (k) jx) : (2.11) For thegiwae extended state space proposal, consider a categorical index variableq giwae prop (sjz (1:K) ;x) drawn using snis, with weights calculated by a learned critic function T q giwae prop (z (1:K) ;sjx) = K Y k=1 q(z (k) jx) ! q giwae prop (sjz (1:K) ;x); (2.12) where q giwae prop (sjz (1:K) ;x) = e T (x;z (s) ) K P k=1 e T (x;z (k) ) : We can view thesnis distributionq giwae prop (sjz (1:K) ;x) as performing variational inference of the pos- terior p giwae tgt (sjx;z (1:K) ). We will show in Prop. 2.2.3 that the optimal giwae critic function is T (x;z) = log p(x;z) q (zjx) +c(x), in which case Eq. (2.11)-(2.12) recover the iwae probabilistic inter- pretation from Domke and Sheldon (2018) (see Brekelmans et al. (2022b) App. B1). We will focus on the giwae lower bound on logp(x), as we nd in Brekelmans et al. (2022b) App. C1-C2 that the giwae upper bound on logp(x) does not provide practical benet. The corresponding mi lower bound is I GIWAE L (q ;T ;K) =E p(x;z) log q (zjx) p(z) | {z } I BA L (q ) +E p(x)p(z (1) jx) K Q k=2 q (z (k) jx) " log e T (x;z (1) ) 1 K P K i=1 e T (x;z (k) ) # | {z } 0 contrastive term logK : (2.13) We observe that the giwae lower bound decomposes into the sum of two terms, where the rst is theba variational lower bound forq (zjx) and the second is a contrastive term which distinguishes negative samples from q (zjx) and positive samples from p(zjx). Relationship with BA With a constant T (x;z) = const, the second term in giwae vanishes and we haveI GIWAE L (q ;T = const;K) =I BA L (q ) for allK. ForK = 1,I GIWAE L (q ;T ;K = 1) also equals the ba lower bound for all T . Similarly to ba, giwae requires access to the analytical 32 form of p(z) to evaluate the bound on mi. However, both the ba and giwae lower bounds can be used to optimize mi even if no marginal is available. See Brekelmans et al. (2022b) App. N. Relationship with InfoNCE When the priorp(z) is used in place ofq (zjx), we can recognize the second term in Eq. (2.13) as the InfoNCE contrastive lower bound (van den Oord et al., 2018; Poole et al., 2019), with I InfoNCE L (T ;K) = I giwae L (p(z);T ;K). From this perspective, the giwae lower bound highlights how variational learning can complement contrastive bounds to improve mi estimation beyond the known logK limitations of InfoNCE (van den Oord et al. (2018), see Eq. (2.17) below). However, using the prior as the proposal in InfoNCE does allow the critic function to admit a bi-linear implementation T (x;z) = f x (x) T f z (z) , which requires onlyN +K forward passes instead ofNK forgiwae, whereN is the batch size andK is the total number of positive and negative samples. Relationship with IWAE The following propositions characterize the relationship between the giwae lower bound in Eq. (2.13) and the iwae lower bound on mi from Sec. 2.2.3. Proposition 2.2.3 (Improvement of IWAE over GIWAE). For a given q (zjx), the iwae lower bound on mi is tighter than the giwae lower bound for anyT (x;z). Their dierence is the average kl divergence between the normalized importance weights p giwae tgt (sjz (1:K) ;x) and the variational distribution q giwae prop (sjz (1:K) ;x) in Eq. (2.12), I IWAE L (q ;K) =I GIWAE L (q ;T ;K) +E p(x)p iwae tgt (z (1:K) jx) h D KL [p giwae tgt (sjz (1:K) ;x)jjq giwae prop (sjz (1:K) ;x)] i : Corollary 2.2.4 (Optimal GIWAE Critic Function yields IWAE). For a givenq (zjx) andK > 1, the optimal giwae critic function is equal to the true log importance weights up to an arbitrary constant: T (x;z) = log p(x;z) q (zjx) +c(x). With this choice of T (x;z), we have I GIWAE L (q ;T ;K) =I IWAE L (q ;K): (2.14) Corollary 2.2.5 (Logarithmic Improvement of GIWAE). Suppose the critic function T (x;z) is parameterized by , and9 0 s:t:8 (x;z); T 0 (x;z) = const. For a given q (zjx), let T (x;z) denote 33 the critic function that maximizes the giwae lower bound. Using Cor. 2.2.2 and Cor. 2.2.4, we have I BA L (q )I GIWAE L (q ;T ;K)I IWAE L (q ;K)I BA L (q ) + logK: (2.15) See Brekelmans et al. (2022b) C4-C5 for proofs. Note that Prop. 2.2.1, which relates the iwae and ba lower bounds, can be seen as a special case of Prop. 2.2.3, since I ba L (q ) is a special case of giwae. While the giwae lower bound on mi does not assume access to the full joint distribution, Cor. 2.2.4 suggests that the role of the critic function is to learn the true log importance weights for the variational q (zjx). Thus, when p(x;z) is known, iwae is always preferable to giwae. Cor. 2.2.5 shows that while I GIWAE L (q;T ;K) can improve upon ba, this improvement is at most logarithmic in K. Finally, Eq. (2.13) suggests a similar decomposition of I IWAE L (q ;K) into a ba term and a contrastive term I IWAE L (q ;K) =E p(x;z) log q (zjx) p(z) | {z } I BA (q) +E p(x)p(z (1) jx) K Q k=2 q (z (k) jx) 2 4 log p(x;z (1) ) q (z (1) jx) 1 K P K i=1 p(x;z (k) ) q (z (k) jx) 3 5 | {z } 0 contrastive termlogK : (2.16) Relationship with Structured InfoNCE InfoNCE and Structured InfoNCE are special cases of giwae and iwae, respectively, which use p(z) as the variational distribution. Since I ba L (p(z)) = 0, Cor. 2.2.5 suggests the following relationship 0I InfoNCE L (T ;K)I S-InfoNCE L (K) logK: (2.17) From the giwae perspective, we can interpret the logK limitations of (Structured) InfoNCE as arising from the logK improvement results for I giwae L (q ;T ;K) and I iwae L (q ;K) in Cor. 2.2.5. 2.3 Multi-Sample AIS Bounds on logp(x) and Mutual Information ais (Neal, 2001) is considered the gold standard for obtaining unbiased and low variance estimates of the partition function, while Bidirectional Monte Carlo (bdmc) (Grosse et al., 2015, 2016) provides lower and upper bounds on the log partition function using forward and reverse ais chains. In this section, we propose various Multi-Sample ais bounds, which highlight that extending the 34 state space over multiple samples, as in iwae, is complementary to extending the state space over intermediate distributions as in ais. Our approach includes bdmc bounds as special cases, and we obtain a novel upper bound which can match or improve upon the performance of bdmc. We derive novel probabilistic interpretations of Multi-Sample ais bounds which, perhaps surprisingly, suggest that the practical sampling schemes used in bdmc do not correspond to the same probabilistic interpretation. We present our Multi-Sample ais bounds of logp (x) within the context of estimating the generative mutual information (Alemi and Fischer, 2018), as in Sec. 2.1.1. However, our methods are equally applicable for other applications, including evaluating the marginal likelihood (Wu et al., 2016; Grosse et al., 2015) or rate-distortion curve (Huang et al., 2020) of generative models. We rst review background on ais, before describing Multi-Sample ais log partition function bounds in Sec. 2.3.2 and 2.3.3. For all bounds in this section, we assume that both the true marginal p(z) and conditional p(xjz) densities are known. 2.3.1 Annealed Importance Sampling Background ais (Neal, 2001) constructs a sequence of intermediate distributionsf t (z)g T t=0 , which bridge be- tween a normalized initial distribution 0 (zjx) and target distribution T (zjx) =p(zjx). The target has an unnormalized density T (x;z) =p(x;z) and normalizing constantZ T (x) =p(x). A common choice for intermediate distributions is the geometric path parameterized byf t g T t=0 : t (zjx) = 0 (zjx) 1t T (x;z) t Z t (x) ; where Z t (x) = Z 0 (zjx) 1t T (x;z) t dz: (2.18) In the probabilistic interpretation of ais, we consider an extended state space proposalq ais prop (z 0:T jx), obtained by sampling from the initial 0 (zjx) and constructing transitionsT f (z t jz t1 ) which leave t1 (zjx) invariant. The target distributionp ais tgt (z 0:T jx) is given by running the reverse transitions T r (z t1 jz t ) starting from a target or posterior sample T (zjx), as shown in Fig. 2.2, q ais prop (z 0:T jx) := 0 (z 0 jx) T Y t=1 T f (z t jz t1 ); p ais tgt (x;z 0:T ) := T (x;z T ) T Y t=1 T r (z t1 jz t ): (2.19) 35 IWAE Single-Sample AIS Independent Multi-Sample AIS Independent Reverse Multi-Sample AIS Coupled Reverse Multi-Sample AIS Practical? ✓ ✓ ✓ ✗ ✓ Target EUBO I LB Proposal ELBO I UB z K ( ) z j ( ) z 1 ( ) z 0 z1 z T-1 ... zT z K 0 ( ) z K 1 ( ) z K T-1 ( ) ... z K T ( ) z j 0 ( ) z j 1 ( ) z j T-1 ( ) ... z j T ( ) z 1 0 ( ) z 1 1 ( ) z 1 T-1 ( ) ... z 1 T ( ) z K 0 ( ) z K 1 ( ) z K T-1 ( ) ... z K T ( ) z j 0 ( ) z j 1 ( ) z j T-1 ( ) ... z j T ( ) z 1 0 ( ) z 1 1 ( ) z 1 T-1 ( ) ... z 1 T ( ) z K 0 ( ) z K 1 ( ) z K T-1 ( ) ... z j 0 ( ) z j 1 ( ) z j T-1 ( ) ... zT z 1 0 ( ) z 1 1 ( ) z 1 T-1 ( ) ... z K ( ) z j ( ) z 1 ( ) z 0 z1 z T-1 ... zT z K 0 ( ) z K 1 ( ) z K T-1 ( ) ... z K T ( ) z j 0 ( ) z j 1 ( ) z j T-1 ( ) ... z j T ( ) z 1 0 ( ) z 1 1 ( ) z 1 T-1 ( ) ... z 1 T ( ) z K 0 ( ) z K 1 ( ) z K T-1 ( ) ... z K T ( ) z j 0 ( ) z j 1 ( ) z j T-1 ( ) ... z j T ( ) z 1 0 ( ) z 1 1 ( ) z 1 T-1 ( ) ... z 1 T ( ) z K 0 ( ) z K 1 ( ) z K T-1 ( ) ... z j 0 ( ) z j 1 ( ) z j T-1 ( ) ... zT z 1 0 ( ) z 1 1 ( ) z 1 T-1 ( ) ... Figure 2.2: Extended state-space probabilistic interpretations of Multi-Sample ais bounds. For- ward chains are colored in blue, and backward chains are colored in red. Note that elbos and eubos are obtained by taking the expectation of the log importance weight logp tgt ()=q prop () under either the proposal or target distribution, and can then be translated to mi bounds. As in Sec. 2.2.1, taking expectations of the log importance weights under the proposal and target yields a lower and upper bound on the log partition function logp (x) E z 0:T q ais prop log p ais tgt (x;z 0:T ) q ais prop (z 0:T jx) | {z } elbo ais (x; 0 ;T ) logp (x)E z 0:T p ais tgt log p ais tgt (x;z 0:T ) q ais prop (z 0:T jx) | {z } eubo ais (x; 0 ;T ) : (2.20) These single-chain lower and upper bounds translate to upper and lower bounds on mi,I ais L ( 0 ;T ) and I ais U ( 0 ;T ), which were suggested for mi estimation in the blog post of Sobolev (2019). To characterize the bias reduction forais with increasingT , we prove the following proposition. Proposition 2.3.1 (Complexity in T ). Assuming perfect transitions and a geometric annealing path with linearly-spacedf t g T t=1 , the gap of the ais upper and lower bounds (Eq. (2.20)) reduces linearly with increasing T , eubo ais (x; 0 ;T )elbo ais (x; 0 ;T ) = 1 T D KL [ T (zjx)k 0 (zjx)] +D KL [ 0 (zjx)k T (zjx)] : (2.21) See App. D for a proof, which involves techniques from Ch. 3. Our proposition generalizes Thm. 1 of Grosse et al. (2013), which holds for the case of T!1 instead of nite T as above. In our experiments in Sec. 2.5, we will nd that this linear bias reduction inT is crucial for achieving tight mi estimation when both p(z) andp(xjz) are known. We can further tighten the single-sample ais bounds with multiple annealing chains (K > 1) using two dierent approaches, which we present in the following sections. 36 2.3.2 Independent Multi-Sample AIS Bounds To derive Independent Multi-Sample ais (im-ais), we construct an extended state space proposal by runningK independentais forward chainsz (k) 0:T q ais prop in parallel. Similarly to theiwae upper bound (Eq. (2.6)), the extended state space target involves selecting a indexs uniformly at random, and running a backward ais chain z (s) 0:T p ais tgt starting from a true posterior sample z T p(zjx). The remainingK 1 samples are obtained by running forwardais chains, as visualized in Fig. 2.2 q im-ais prop (z (1:K) 0:T jx) := K Y k=1 q ais prop (z (k) 0:T jx); p im-ais tgt (z (1:K) 0:T ;x) := 1 K K X s=1 p ais tgt (x;z (s) 0:T ) K Y k=1;k6=s q ais prop (z (k) 0:T jx); (2.22) whereq ais prop andp ais tgt were dened in Eq. (2.19). Note that sampling from the extended state space target distribution is practical, as it only requires one sample from the true posterior distribution. As in Sec. 2.2.1, taking the expectation of the log unnormalized density ratio under the proposal and target yields lower and upper bounds on logp (x), respectively, E z (1:K) 0:T q ais prop log 1 K K X k=1 p ais tgt (x;z (k) 0:T ) q ais prop (z (k) 0:T jx) | {z } elbo im-ais (x; 0 ;K;T ) logp(x)E z (1) 0:T p ais tgt z (2:K) 0:T q ais prop log 1 K K X k=1 p ais tgt (x;z (k) 0:T ) q ais prop (z (k) 0:T jx) | {z } eubo im-ais (x; 0 ;K;T ) ; (2.23) which again have extended-state space kl divergences as the gap in their bounds. Independent Multi-Sample ais reduces to iwae for T = 1, and reduces to single-sample ais for K = 1. Both upper and lower bounds are tight as K ! 1 or T ! 1, and translate to lower and upper bounds on mi as in Sec. 2.1.1. In Brekelmans et al. (2022b) App C5, we show a similar result to Prop. 2.2.1, which characterizes the improvement of Independent Multi-Sample ais with increasing K. In particular, the lower bound on mi is limited to logarithmic improvement over single-sample ais, with I im-ais L ( 0 ;K;T )I ais L ( 0 ;T ) + logK. 2.3.3 Coupled Reverse Multi-Sample AIS Bounds We can exchange the role of the forward and backward annealing chains in Independent Multi- Sample ais to obtain alternative bounds on the log partition function. We dene Independent Reverse Multi-Sample ais (ir-ais) using the following proposal and target distribution, as shown in Fig. 2.2. q ir-ais prop (x;z (1:K) 0:T ) := 1 K K X s=1 q ais prop (z (s) 0:T jx) K Y k=1;k6=s p ais tgt (x;z (k) 0:T ); p ir-ais tgt (x;z (1:K) 0:T ) := K Y k=1 p ais tgt (x;z (k) 0:T ): 37 Note that partition function ratio isZ tgt =Z prop = p(x) K =p(x) K1 = p(x). Using these distribu- tions, we derive logp(x) and mi bounds in Brekelmans et al. (2022b) App. G. However, the these bounds will be impractical in most settings since they require multiple true posterior samples (see Sec. 2.1.1). To address this, we propose Coupled Reverse Multi-Sample ais (cr-ais). As shown in Fig. 2.2, the extended state space target distribution initializes K backward chains from a single target sample z T T (zjx), with the remaining transitions p ais tgt (z 0:T1 jz T ) matching standard ais in Eq. (2.19). p cr-ais tgt (z (1:K) 0:T1 ;z T ;x) := T (z T ;x) K Y k=1 p ais tgt (z (k) 0:T1 jz T ;x): (2.24) The extended state space proposal is obtained by selecting an index s uniformly at random and running a single forward ais chain. We then run K 1 backward chains, all starting from the last state of the selected forward chain, as visualized in Fig. 2.2 q cr-ais prop (z (1:K) 0:T1 ;z T jx) := 1 K K X s=1 q ais prop (z (s) 0:T1 ;z T jx) K Y k=1;k6=s p ais tgt (z (k) 0:T1 jz T ;x): (2.25) Taking the expected log ratio under the proposal and target yields lower and upper bounds on logp(x), E z (1) 0:T1 ;z T q ais prop (z 0:T jx) z (2:K) 0:T1 p ais tgt (z 0:T1 jz T ;x) log 1 K K X k=1 q ais prop () p ais tgt () | {z } elbo cr-ais (x; 0 ;K;T ) logp(x)E z T T (z T jx) z (1:K) 0:T1 p ais tgt (z 0:T1 jz T ;x) log 1 K K X k=1 q ais prop () p ais tgt () | {z } eubo cr-ais (x; 0 ;K;T ) : We show in Brekelmans et al. (2022b) App. H3 that the Coupled Reverse Multi-Sample ais upper bound onmi is limited to logarithmic improvement over single-sampleais, withI cr-ais U ( 0 ;K;T ) I ais U ( 0 ;T ) logK. 2.3.4 Discussion Relationship with BDMC Bidirectional Monte Carlo (bdmc) (Grosse et al., 2015, 2016) was the rst to propose multi-sample log partition function bounds using ais chains. In particular, the multi-sample bdmc lower and upper bounds (Grosse et al., 2015, 2016) on the log partition 38 function correspond to the lower bound of Independent Multi-Sample ais and upper bound of Coupled Reverse Multi-Sample ais, respectively. Our probabilistic interpretations provide novel perspective on these existing bdmc bounds. Perhaps surprisingly, we found that the multi-sample bdmc lower bound (Fig. 2.2 Col. 4, Row 4) and upper bound (Fig. 2.2 Col. 6, Row 3) do not arise from the same probabilistic interpretation (i.e. are not in the same row in Fig. 2.2). In other words, the gap in their log partition function bounds do not correspond to the forward and reverse kl divergences between the same pair of extended state space proposal and target distributions. We experimentally compare all Multi- Sample ais bounds, including bdmc, in Sec. 2.5.1 to provide recommendations for which bounds to use in practice. Eect of K and T We have shown in Prop. D.0.1 that Multi-Sample ais bounds can achieve linear bias reduction with increasing T , although this computation must be done in serial fashion. While increasing K involves parallel computation, its bias reduction is often only logarithmic, as we show for I iwae L (q ;K) (Cor. 2.2.2), I im-ais L ( 0 ;K;T ) (Brekelmans et al. (2022b) App. E3 Prop. E2), and I cr-ais U ( 0 ;K;T ) (Brekelmans et al. (2022b) App. H3 Prop. H2). Based on these arguments, we recommend increasingK until computation can no longer be parallelized on a given hardware and allocating all remaining resources to increasing T . 2.4 MINE-AIS Estimation of Mutual Information In this section, we present energy-based bounds which are designed for settings where the true conditional density p(xjz) is unknown. We rst review Mutual Information Neural Estimation (mine) (Belghazi et al., 2018b) in Sec. 2.4.1, and present probabilistic interpretations which allow us to derive Generalized mine lower bounds with a variational q (zjx) as the base distribution instead of the marginal p(z). Our main contribution in this section is the mine-ais method, which optimizes a tighter lower bound onmi than the Generalizedmine lower bound and extends our Multi-Sample ais evaluation method to the case where p(xjz) is unknown. 39 2.4.1 Generalized Mutual Information Neural Estimation To derive a probabilistic interpretation for mine (Belghazi et al., 2018b), consider an energy based approximation to the joint distribution p(x;z). To extend mine, we consider a general base varia- tional distribution q (zjx) in place of the marginal p(z) ; (x;z) := 1 Z ; p(x)q (zjx)e T (x;z) ; where Z ; =E p(x)q (zjx) h e T (x;z) i ; (2.26) where T (x;z) is the negative energy function and the partition functionZ ; integrates over both x and z. Subtracting a joint kl divergence D KL [p(x;z)k ; (x;z)] from I(x;z), we obtain the Generalized mine-dv lower bound on mi I(x;z)I gmine-dv (q ;T ) :=E p(x;z) log q (zjx) p(z) | {z } I BAL (q ) +E p(x;z) [T (x;z)] logE p(x)q (zjx) h e T (x;z) i | {z } contrastive term : (2.27) Note that the standard mine-dv lower bound (Belghazi et al., 2018b) corresponds to Generalized mine-dv using the prior p(z) as the proposal, I mine-dv (T ) = I gmine-dv (p(z);T ). In Sec. 2.6.3, we interpret the contrastive term in Eq. (2.27) as arising from the dual representation of the kl divergence D KL [p(x;z)kp(x)q (zjx)]. The Generalized mine-f bound is a looser lower bound than Generalized mine-dv, which can be derived from Eq. (2.27) using the inequality logu u e , I gmine-dv (q ;T )I gmine-f (q ;T ) :=E p(x;z) h log q (zjx) p(z) i +E p(x;z) [T (x;z)]E p(x)q (zjx) e T (x;z)1 : Generalized mine-f reduces to standard mine-f when the base distribution is the prior, with I mine-f (T ) =I gmine-f (p(z);T ). We provide a probabilistic interpretation of Generalized mine-f in Brekelmans et al. (2022b) App. J3, and a conjugate duality interpretation in Sec. 2.6.4. Despite the intractability of the log partition function term logZ ; in Eq. (2.27), Belghazi et al. (2018b) use direct sampling inside the logarithm for training T . Our mine-ais method will both improve the training and evaluation schemes of (Generalized) mine and optimize a tighter lower bound. 40 2.4.2 MINE-AIS Estimation of Mutual Information In this section, we presentmine-ais, which is inspired by Generalized mine but optimizes a tighter lower bound on mi that involves an intractable log partition function. We show in Prop. 2.4.2 that this bound corresponds to the limiting behavior of the giwae lower bound as K!1. However, we present a qualitatively dierent training scheme inspired by contrastive divergence learning of energy-based models (Hinton, 2002) in Sec. 2.4.2.1. In Sec. 2.4.2.2, we use Multi-Sample ais to evaluate the intractable log partition function, and show that mine-ais reduces to the methods in Sec. 2.3 for the optimal critic function or known p(xjz). To begin, we consider a exible energy-based distribution ; (zjx) as an approximation to the posterior p (zjx) (Poole et al., 2019; Arbel et al., 2020) ; (zjx) := 1 Z ; (x) q (zjx)e T (x;z) ; where Z ; (x) =E q (zjx) h e T (x;z) i ; (2.28) whereq (zjx) is a base variational distribution which can be tractably sampled from and evaluated and T (x;z) is the negative energy or critic function. Plugging ; (zjx) into theba lower bound, we denote the resulting bound as theibal. We use the term \implicit", since it is often dicult to explicitly evaluate the ibal due to the intractable log partition function term. After simplifying I ba L ( ; ), we obtain I(x;z)ibal(q ;T ) :=I ba L ( ; ) =E p (x;z) log q (zjx) p(z) +E p (x;z) [T (x;z)]E p (x) [logZ ; (x)] =E p(x;z) log q (zjx) p(z) | {z } I ba L (q ) +E p(x;z) " log e T (x;z) E q (zjx) e T (x;z) # | {z } 0 contrastive termE p (x) D KL [p(zjx)kq (zjx)] ; (2.29) where the gap of the ibal is E p (x) [D KL [p (zjx))jj ; (zjx)]]. Note that the ibal generalizes the Unnormalized Barber-Agakov bound from Poole et al. (2019), with I uba (T ) =ibal(p(z);T ). Proposition 2.4.1. For a givenq (zjx), the optimal ibal critic function equals the log importance weights up to a constant T (x;z) = log p(x;z) q (zjx) +c(x). For this T , we have ibal(q ;T ) =I(x;z). 41 Relationship with MINE ibal(q ;T ) provides a tighter lower bound than Generalized mine, as we discuss in detail in Brekelmans et al. (2022b) App. J and summarize in Fig. 2.6. In particular, we have I(x;z)ibal(q ;T )I gmine-dv (q ;T )I gmine-f (q ;T ): (2.30) Our mine-ais method will optimize and evaluate the intractable ibal directly. Relationship with GIWAE The ibal lower bound resembles the giwae lower bound in that both improve upon the ba variational bound using a contrastive term. Further, Prop. 2.4.1 and Cor. 2.2.4 show that, in both cases, the optimal critic function equals the true log importance weights plus a constant. The following proposition shows that ibal(q ;T ) can be viewed the limiting behavior of the giwae lower bound as K!1. Proposition 2.4.2 (IBAL as Limiting Behavior of GIWAE). For given q (zjx) and T (x;z), we have lim K!1 I GIWAE L (q ;T ;K) =ibal(q ;T ): (2.31) See Brekelmans et al. (2022b) App. L2 for proof. To gain intuition for Prop. 2.4.2, we consider a closely related result involving the marginal SNIS distribution of giwaeq giwae prop (zjx;K), which re- sults from samplingK times fromq (zjx) and returning the sample in indexsq giwae prop (sjx;z (1:K) )/ e T (x;z (s) ) . In Brekelmans et al. (2022b) App. L3, we show that in the limit asK!1, the marginal snis distribution of giwae matches themine-ais energy-based posterior ; (zjx)/q (zjx)e T (x;z) . In both cases, the energy or critic function `modulates' the base distribution q (zjx) to better ap- proximate the true posterior. The contrastive term in the giwae bound is tractable, as it only requires a single posterior sample and K 1 samples from the base distribution. As shown in Cor. 2.2.4, this limits the improvement ofI giwae L (q ;T ) overI ba L (q ) to logK nats, even for the optimal critic function. By contrast, in the following proposition we show that the contrastive term inibal(q ;T ) is expressive enough to potentially close the gap in the ba bound. This improved contrastive term comes at the cost of tractability, as ibal(q ;T ) involves an intractable partition function logE q (zjx) [e T (x;z) ]. 42 Proposition 2.4.3. Suppose the critic function T (x;z) is parameterized by , and that9 0 s:t: T 0 (x;z) = const. For a given q (zjx), let T (x;z) denote the critic function that maximizes ibal(q ;T ). Then, I BA L (q )ibal(q ;T )I(x;z) =I BA L (q ) +E p (x) [D KL [p(zjx)kq (zjx)]]: (2.32) In particular, the contrastive term in Eq. (2.29) is upper bounded by E p (x) [D KL [p(zjx)kq (zjx)]]. See Brekelmans et al. (2022b) App L1 for proof. In the following sections, we present our mine-ais method, which overcomes the intractability ofibal(q ;T ) using anmcmc-based training scheme and ais-based evaluation. 2.4.2.1 MINE-AIS Training Although the log partition function logZ ; (x) in theibal is intractable to evaluate, we only require an unbiased estimator of its gradient for training. Dierentiating Eq. (2.29) with respect to the parameters and of the variational and energy function, respectively, we obtain @ @ ibal(q ;T ) =E p (x;z) @ @ logq (zjx) E p (x) ; (zjx) @ @ logq (zjx) : (2.33) @ @ ibal(q ;T ) =E p (x;z) @ @ T (x;z) E p (x) ; (zjx) @ @ T (x;z) : (2.34) Eq. (2.33) indicates that in order to maximize the ibal as a function of and , we need to increase the value ofT (x;z) or logq (zjx) on positive samples fromp(x;z), and lower it on negative samples from p(x) ; (zjx). As is common in training energy-based models, it is dicult to draw samples from ; (zjx). A natural approach is to initialize mcmc chains from a sample of the base distribution z 0 q (zjx), and runM steps of Hamiltonian Monte Carlo (hmc) transition kernelsT (M) (zjz 0 ;x) (Neal, 2011). However, in practice, we may require infeasibly long mcmc chains when the base distribution is far from desired energy-based model ; (zjx). Instead, we choose to initialize the chains from the true posterior sample z 0 p(zjx) for a simulated data point x p(x). The approximate energy function gradient becomes @ @ ibal(q ;T )E p (x;z) @ @ T (x;z) E p(x;z 0 )T (M) (zjz 0 ;x) @ @ T (x;z) ; (2.35) 43 We can use an identical modication for the gradient with respect to . This initialization is similar in spirit to Contrastive Divergence learning of energy-based models (Hinton, 2002), and can signicantly reduce the computational cost and variance of the estimated gradient. In our experiments, we see that this approach enables ecient training of energy functions in complex latent spaces of deep generative models. 2.4.2.2 Evaluation of IBAL in MINE-AIS After learning an critic function using our improved mine-ais training scheme, we still need to evaluate the ibal lower bound on mi. Although we have an unbiased, low variance estimate for E p (x;z) [T (x;z)], the log partition function logZ ; (x) is intractable. We use our Multi-Sampleais bounds from Sec. 2.3 to provide both an upper bound and approximate lower bound onibal(q ;T ): MINE-AIS upper bound on IBAL. We rst consider estimating a lower bound on the log partition function logZ ; (x) of ; (zjx) = 1 Z ; (x) q (zjx)e T (x;z) . In order to do so, we can use a Multi-Sampleais lower bound with a base distributionq (zjx), intermediate distributions t (zjx)/ q (zjx)e tT (x;z) , and nal distribution ; (zjx) at T = 1. This yields a stochastic lower bound on logZ ; (x), which translates to a stochastic upper bound for estimating ibal(q ;T ). MINE-AIS approximate lower bound on IBAL. In order to obtain a lower bound on ibal and preserve a lower bound on I(x;z), we would like to estimate an upper bound on logZ ; (x) using Multi-Sample ais. However, note that this requires true samples from ; (zjx) to initialize backward chains. Since these samples are unavailable, we argue in Brekelmans et al. (2022b) App M2 that Multi-Sampleais can preserve an approximate upper bound on logZ ; (x) and lower bound on ibal(q ;T ) by initializing reverse chains from the true posterior p (zjx) instead of ; (zjx). With the optimal critic function, we have ; (zjx) = p(zjx), and our approximate lower bound becomes a true stochastic lower bound on ibal(q ;T ) =I(x;z). If the full joint density p(x;z) is known, the mine-ais upper bound and approximate lower bound on ibal(q ;T ) reduce to the Multi-Sample ais upper and lower bounds on I(x;z). 2.5 Experiments In this section, we evaluate our proposed Multi-Sample ais, mine-ais, and giwae bounds for estimating the mutual information of deep generative models such as vaes and gans trained on 44 2 4 8 16 32 64 128 256 512 1024 2000 5000 10000 20000 Number of AIS Distributions 1 2 5 11 26 58 131 295 666 1500 Mutual Information Linear VAE (k=10000) Independent AIS Upper Bound Independent AIS Lower Bound Coupled Reversed AIS Upper Bound Coupled Reversed AIS Lower Bound Analytical True MI (a) Linear VAE (K=10000) 2 4 8 16 32 64 128 256 512 1024 2000 5000 10000 20000 Number of AIS Distributions 1 3 8 25 72 209 608 1770 5153 15000 Mutual Information MNIST-VAE (k=1000) Independent AIS Upper Bound Independent AIS Lower Bound Coupled Reversed AIS Upper Bound Coupled Reversed AIS Lower Bound (b) MNIST VAE20 (K=1000) 2 4 8 16 32 64 128 256 512 1024 2000 5000 10000 20000 Number of AIS Distributions 1 3 8 22 60 167 464 1292 3594 10000 Mutual Information MNIST-GAN (k=1000) Independent AIS Upper Bound Independent AIS Lower Bound Coupled Reversed AIS Upper Bound Coupled Reversed AIS Lower Bound (c) MNIST GAN20 (K=1000) Figure 2.3: Comparing Multi-Sampleais sandwich bounds for varying number of ais distributions. mnist and cifar datasets. We chose deep generative models as they re ect complex relationships between latent variables and high-dimensional images, and allow comparison across methods with dierent assumptions. We rst compare our various Multi-Sample ais bounds in Sec. 2.5.1 with the goal of making recommendations for which bound to use in practice. We then compare Multi-Sample ais and iwae bounds in Sec. 2.5.2 for estimating large ground truth values of mi. We will show that Multi- Sampleais can tightly bound the ground truthmi, assuming access to the truep(xjz). This ground truth information is valuable for evaluating the bias of our energy-based lower bounds, including giwae and mine-ais, which are applicable even when p(xjz) is unknown. We provide a detailed comparison of these energy-based mi bounds in Sec. 2.5.3. We describe experimental details and link to our publicly available code in Brekelmans et al. (2022b) App. 0. 2.5.1 Comparison of Multi-Sample AIS Bounds In Fig. 2.3, we compare the performance of our various Multi-Sample ais bounds formi estimation of a Linear vae with 10 latent variables and random weights, and vae and gan models with 20 latent variables trained on mnist. To obtain an upper bound on mi, we recommend using the Independent Multi-Sample ais elbo for log partition function estimation. This corresponds to the forward direction of bdmc and achieves the best performance in all cases. This upper bound uses independent samples and is not limited to logK improvement, in contrast to the Coupled Reverse Multi-Sample ais upper bound on mi (Brekelmans et al. (2022b) Prop. H2). The results are less conclusive for the Multi-Sample ais lower bounds on mi, where either the Independent Multi-Sample ais eubo or Coupled Reverse ais eubo may be preferable for log 45 partition function estimation. Recall that these bounds have dierent sources of stochasticity that provide improvement over single-chainais. The stochasticity in the Independent Multi-Sample ais lower bound on mi comes from K 1 independent negative forward chains which, by Brekelmans et al. (2022b) Prop. E2 can only lead to logK improvement over single-sampleais. However, these gains are easily attained for low values of T . For example, with two total ais distributions, which corresponds to simple importance sampling with 0 (z) =p(z), the Independent Multi-Sample ais lower bound on mi reduces to Structured Info-NCE and saturates to logK. This may be useful for quickly estimating values of mi at a similar order of magnitude as logK. The stochasticity in the Coupled Reverseais lower bound onmi is induced bymcmc transitions in K coupled backward chains. While this does not formally limit the improvement over single- sample ais, we see in Fig. 2.3 that at least moderate values of T may be needed to match or marginally improve upon Independent Multi-Sample ais. These observations suggest that the preferred lower bounds on mi may vary based on the scale of the true mi and the amount of computation available. 2.5.2 Multi-Sample AIS Estimation of Mutual Information We compare Multi-Sample ais mi estimation against iwae, since both methods assume the full joint distribution is available. For the initial distribution of ais or variational distribution of iwae, we can use any distribution that is tractable to sample and evaluate. We experiment using both the prior p(z) and a learned Gaussian q (zjx). Table 2.1 summarizes our results. IWAE As described in Sec. 2.2.3, iwae bounds encompass a wide range of mi estimators. The K = 1 bounds with learned q (zjx) correspond to ba bounds, while for K > 1 and p(z) as the proposal, we obtain Structured InfoNCE. While the iwae upper bound on mi, which uses the logp(x) lower bound with independent sampling from q (zjx), is tight for certain models, we can see that the improvement of the iwae lower bound on mi is limited to logK as expected from Prop. 2.2.1. In particular, exponentially large sample size is required to to close the gap from the ba lower bound (K = 1) to the truemi. For example, oncifargan100, at leaste 460 total samples are required to match the lower bound estimated by ais. In Brekelmans et al. (2022b) App. B3, we explicitly decomposeI iwae L (q ;K) into anI ba L (q ) term and a contrastive term to validate this logarithmic improvement with increasing K. 46 Table 2.1: mi Estimation on mnist and cifar-10 with iwae (with varying number of samples K), and Multi-Sample ais (with varying number of intermediate distributions T ). Bounds with a gap of less than 2 nats from the ground truth mi are in bold. Method Proposal MNIST VAE2 MNIST VAE10 MNIST VAE100 MNIST GAN2 MNIST GAN10 MNIST GAN100 AIS (T=1) p(z) (0:00; 249:82) (0:00; 1929:84) (0:00; 5830:52) (0:00; 726:27) (0:00; 786:12) (0:00; 861:38) q(zjx) (7.59, 9.24) (21:06; 63:00) (34:49; 362:13) (7:21; 19:13) (3:67; 314:72) (2:61; 513:33) AIS (T=500) p(z) (8.63, 9.12) (34:05; 39:09) (79.90; 95:17) (9.21, 10.83) (21.57, 22.47) (25.86, 27.55) q(zjx) (9.09, 9.09) (34.16, 34.29) (80.19, 82.34) (10.69, 11.06) (21.60, 23.06) (25.58; 29:53) AIS (T=30K) p(z) (8.98, 9.09) (34.21, 34.21) (80.78, 80.84) (10.56, 10.81) (21.97, 22.02) (26.47, 26.52) q(zjx) (9.09, 9.09) (34.21, 34.21) (80.77, 80.80) (10.80, 10.81) (22.01, 22.01) (26.53, 26.54) IWAE (K=1) p(z) (0:00; 799:55) (0:00; 3827:58) (0:00; 11501:92) (0:00; 1638:10) (0:00; 1630:00) (0:00; 1740:39) q(zjx) (8.63, 9.19) (25:20; 35.34) (44:54; 95:63) (8:83; 17:58) (4:23; 57:47) (3:23; 260:87) IWAE (K=1K) p(z) (6:81; 29:40) (6:91; 1197:75) (6:91; 4234:19) (6:88; 121:89) (6:91; 446:80) (6:91; 494:73) q(zjx) (9.09, 9.10) (31:69; 34.24) (51:44; 85:30) (10.74, 11.40) (11:14; 52:73) (10:14; 201:18) IWAE (K=1M) p(z) (9.09, 9.09) (13:82; 376:89) (13:82; 2247:73) (10.76, 10.99) (13:81; 81:51) (13:82; 114:01) q(zjx) (9.09, 9.09) (34.10, 34.22) (58:35; 83:39) (10.81, 10.81) (17:76; 30:88) (16:98; 58:04) Method Proposal CIFAR GAN5 CIFAR GAN10 CIFAR GAN100 AIS T=1 p(z) (0:00; 3601256:00) (0:00; 4035635:75) (0:00; 4853410:50) q(zjx) (10:80; 157378:25) (13:02; 758075:75) (12:47; 2724562:25) AIS T=500 p(z) (18:37; 259:27) (29:52; 33089:90) (104:51; 63290:40) q(zjx) (32:47; 69:54) (48:16; 136:15) (145:19; 2786:53) AIS T=100K p(z) (39.58, 41.06) (71.87, 73.98) (480:26; 488:07) q(zjx) (39.22, 40.05) (72.85, 73.54) (479:27; 484:84) IWAE K=1 p(z) (0:00; 7095384:00) (0:00; 7765695:50) (0:00; 9916102:00) q(zjx) (14:53; 40.31) (17:45; 77:52) (20:00; 5346:85) IWAE K=1K p(z) (6:91; 1065552:25) (6:91; 2044170:75) (6:91; 2856714:50) q(zjx) (21:43; 39.73) (24:06; 74.00) (26:98; 5283:13) IWAE K=1M p(z) (13:82; 96698:10) (13:82; 710511:63) (13:82; 1903854:50) q(zjx) (28:34; 39.71) (30:73; 73.36) (33:81; 5271:56) Multi-Sample AIS We evaluate Multi-Sampleais withK = 48 chains onmnist andK = 12 on cifar, and a varying number of intermediate distributionsT . We show results for the Independent Multi-Sampleais upper bound onmi and Coupled Reverse Multi-Sampleais lower bound onmi in Table 2.1. Using large enough values of T , Multi-Sample ais can tightly sandwich large values of ground truthmi for all models and datasets considered. This is in stark contrast to the exponential sample complexity required for theiwaemi lower bound, and highlights that increasingT in Multi- Sample ais is a practical way to reduce bias using additional computation. We provide runtime details in Brekelmans et al. (2022b) App O. 2.5.3 Energy-Based Estimation of Mutual Information In this section, we evaluate the performance of our giwae and mine-ais methods, which assume access to a known marginal p(z) but not the conditional p(xjz). In Cor. 2.2.5, we have shown that 47 10 1 10 2 10 3 10 4 Number of AIS Distributions 21 22 23 24 25 26 27 Mutual Information Linear VAE Analytical True MI AIS UB on True MI AIS LB on True MI AIS UB on IBAL (MINE-AIS) AIS LB on IBAL (MINE-AIS) AIS UB on IBAL (GIWAE k=100) AIS LB on IBAL (GIWAE k=100) GIWAE (k=100) IWAE LB (k=100) Barber-Agakov LB (a) Linear VAE 10 1 10 2 10 3 10 4 Number of AIS Distributions 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Mutual Information MNIST-VAE AIS UB on True MI AIS LB on True MI AIS UB on IBAL (MINE-AIS) AIS LB on IBAL (MINE-AIS) AIS UB on IBAL (GIWAE k=100) AIS LB on IBAL (GIWAE k=100) GIWAE (k=100) IWAE LB (k=100) Barber-Agakov LB AIS UB on IBAL (InfoNCE k=100) AIS LB on IBAL (InfoNCE k=100) InfoNCE (k=100) (b) MNIST VAE 10 1 10 2 10 3 10 4 Number of AIS Distributions 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 Mutual Information MNIST-GAN AIS UB on True MI AIS LB on True MI AIS UB on IBAL (MINE-AIS) AIS LB on IBAL (MINE-AIS) AIS UB on IBAL (GIWAE k=100) AIS LB on IBAL (GIWAE k=100) GIWAE (k=100) IWAE LB (k=100) Barber-Agakov LB AIS UB on IBAL (InfoNCE k=100) AIS LB on IBAL (InfoNCE k=100) InfoNCE (k=100) (c) MNIST GAN Figure 2.4: Estimatingibal using Multi-Sampleais for various methods of critic function training. for a xed q (zjx), we have I BA L (q )I GIWAE L (q ;T ;K)I IWAE L (q ;K)I BA L (q ) + logK. Although we perform separate optimizations and obtain dierentq (zjx) for each entry in Fig. 2.5a, we nd that these relationships hold in almost all cases. We summarize the gaps in these bounds and their relationships in Fig. 2.5b. BA, IWAE, and GIWAE Bounds Recall that I GIWAE L (q ;T ;K) in Eq. (2.13) and I IWAE L (q ;T ;K) in Eq. (2.16) can be decomposed into the sum of a ba lower bound and a con- trastive term. We report the contribution of each term along with the overall lower bound in Fig. 2.5a. Despite the fact that giwae uses a learnedT (x;z) instead of the optimal critic in iwae (Cor. 2.2.4), we observe that giwae can approach the performance of iwae. We can also conrm that both bounds improve upon the ba lower bound. In all cases, (Structured) InfoNCE bounds, which use q (zjx) =p(z), saturate to logK. We can further analyze the contribution of the contrastive term across dierent models. For the Linear vae, the true posterior is in the Gaussian variational family and I BA L (q ) is close to the analytical mi. In this case, the contrastive term provides much less than logK improvement for giwae and iwae, since even the optimal critic function cannot distinguish q (zjx) andp (zjx). As K increases, we learn a worse q in almost all cases, as measured by a lower ba term. This allows the contrastive term to achieve closer to its full potential logK, resulting in a higher overall bound. 2 For more complexvae andgan posteriors, there is a reduced tradeo between the terms. In these cases, the variational family is far enough from the true posterior (in reversekl divergence) that either giwae or iwae critic functions can approach logK improvement without signicantly lowering the ba term. 2 A similar observation can be made in trainingvaes with theelboiwae objective and a restricted variational family, where increasing K results in a worse q (zjx) (in forward kl divergence) but a better overall bound. 48 Input Used X X X X X X X X X X X Bound Model Linear VAE10 MNIST VAE20 MNIST GAN20 p(x;z) Analytical 23.23 N/A N/A p(x;z) Joint Samples AIS Bound on True MI (23:23; 23:23) (65:11; 65:17) (53:43; 53:50) IWAE LB (K = 1000) 20:53 + 2:66 = 23:19 38:21 + 6:90 = 45:11 20:97 + 6:91 = 27:88 IWAE LB (K = 100) 21:64 + 1:50 = 23:14 38:86 + 4:61 = 43:47 20:86 + 4:60 = 25:46 Structured InfoNCE LB (K = 1000) 6:91 6:91 6:91 Structured InfoNCE LB (K = 100) 4:61 4:61 4:61 p(z) Joint Samples AIS Bound on IBAL (MINE-AIS) (23:15; 23:15) (57:72; 57:74) (40:79; 40:79) AIS Bound on IBAL (GIWAE K = 100) (22:87; 22:87) (44:97; 44:97) (28:61; 28:62) AIS Bound on IBAL (InfoNCE K = 100) (11:38; 11:39) (5:18; 5:18) (7:42; 7:42) Generalized IWAE LB (K = 1000) 22:31 + 0:38 = 22:69 37:23 + 6:55 = 43:78 20:50 + 6:72 = 27:22 Generalized IWAE LB (K = 100) 22:48 + 0:39 = 22:87 37:56 + 4:34 = 41:90 20:68 + 4:57 = 25:25 Barber-Agakov LB (K = 1) 22:69 37:92 21:42 Joint Samples InfoNCE LB (K = 1000) 6:91 6:91 6:91 InfoNCE LB (K = 100) 4:61 4:61 4:61 (a) (b) Figure 2.5: (a) Comparison of energy-based bounds (giwae and mine-ais) with other mi bounds. (b) Visualizing the gaps of various energy-based lower bounds and their relationships. MINE-AIS Bounds We now discuss results for mine-ais, where we have used xed standard Gaussian p(z) as the base variational distribution, energy-based training for T (x;z), and Multi- Sample ais evaluation of ibal(p(z);T ). We can see in Fig. 2.5a that mine-ais improves over ba due to its exible, energy-based variational family. To evaluate the quality of the learned T (x;z), we also compare theibal to the Multi-Sampleais lower bound, which assumes access top(xjz) and corresponds to the optimal critic (Prop. 2.4.1). We nd that mine-ais underestimates the ground truth mi by 11% and 24% on mnist-vae and mnist-gan, respectively. We also observe that mine-ais outperforms iwae and giwae by up to 14 nats, and investigate whether this improvement is due to a more costly Multi-Sample ais evaluation or our energy-based training scheme for the critic T . In Fig. 2.4 and Fig. 2.5a, we use Multi-Sample ais to evaluate the ibal corresponding to (q ;T ), which are learned by optimizing the giwae (with K = 100) and InfoNCE (with K = 100 and q = p(z)) lower bounds. As argued in Prop. 2.4.2, the ibal corresponds to the limiting behavior of giwae as K!1. We observe that the ais evaluation of the ibal corresponding to giwae or InfoNCE critic functions only marginally improves upon evaluation of the original giwae or InfoNCE lower bounds. This indicates that the improvement of mine-ais overgiwae orInfoNCE can be primarily attributed to learning a better critic function using energy-based training. 49 Recall that the iwae and Structured InfoNCE critics use the true log importance weights T (x;z) = log p(x;z) q (zjx) +c(x) (Cor. 2.2.4), and with this optimal critic, ibal(q ;T ) =I(x;z) regard- less ofq (zjx) (Prop. 2.4.1). We would thus expect Multi-Sampleais evaluation of theibal(q ;T ) to tightly bound the true mi if the critic were optimal. We instead nd that ais evaluation for a learned giwae critic falls short of the true mi by up to 20 nats. This is despite the fact that the giwae critic came close to matching the performance of iwae and saturating the contrastive term at logK in the original objective. These observations highlight a further shortcoming of the contrastive bounds used in giwae and InfoNCE: beyond their logK limitations for evaluation (Cor. 2.2.5), these bounds may not be conducive to learning the true log importance weights. Validation of Approximate Reverse Annealing for IBAL Lower Bound As discussed in Sec. 2.4.2.2, obtaining a lower bound on the ibal is intractable due to the need for exact samples from ; (zjx). In Fig. 2.4, we conrm that our approximate reverse annealing procedure described in Brekelmans et al. (2022b) App. M2 underestimates theibal for all training methods and numbers of intermediate distributions, although we cannot mathematically guarantee this procedure provides a lower bound. We can use our Multi-Sample ais upper bound on the ibal to validate the convergence of our estimation procedure, despite the fact that this quantity does not lower or upper bound the true mi in general. In other words, we can conclude that the true ibal(q ;T ) has been obtained when its lower and upper bounds converge to the same estimate for a large number of intermediate distributions T . MINE-DV and MINE-F We do not include results for mine-dv and mine-f, as they are highly unstable in large mi settings due to the diculty of direct Monte Carlo estimation of E p(x)p(z) e T (x;z) (McAllester and Stratos, 2020). We expect Generalized mine to suer from the same challenges and instead recommend using mine-ais. 2.6 Conjugate Duality Interpretations In this section, we provide alternative derivations of the energy-basedmi lower bounds inmine-ais, Generalized mine-dv, Generalized mine-f, giwae, iwae, and InfoNCE from the perspective of conjugate duality. In particular, we highlight that the critic or negative energy function in the 50 above bounds arises as a dual variable in the convex conjugate representation of the kl divergence. In all cases, the kl divergence of interest corresponds to the gap in the ba lower bound I(x;z) =E p(x;z) log q (zjx) p(z) +E p(x) [D KL [p(zjx)jjq (zjx)]] : (2.36) For q (zjx) = p(z), the ba lower bound term is 0 and our derivations correspond to taking dual representation of mi directly, e.g. E p(x) [D KL [p(zjx)jjp(z)]], as in Belghazi et al. (2018b). Our conjugate duality interpretations are complementary to our probabilistic interpretations, with either approach equally valid for deriving lower bounds and characterizing their gaps. 2.6.1 Convex Conjugate Background The kl divergence to a xed reference distribution 0 is a convex function of the distribution in the rst argument () :=D KL [jj 0 (z)]. In later subsections, we will also consider kl divergences over joint distributions, unnormalized density functions, and extended state spaces. For a given (), we can dene the conjugate function () over a dual variable or function T (z), with the following relationships (Boyd and Vandenberghe, 2004) (T (z)) := sup (z) h(z);T (z)i ((z)); (2.37) ((z)) = sup T (z) h(z);T (z)i (T (z)); (2.38) where inner product notation indicatesh(z);T (z)i := R (z)T (z)dz. Solving for the optimizing argument in each of Eq. (2.37) and Eq. (2.38) yields the following dual correspondences T (z) =r T (T (z)); T (z) =r ((z)): (2.39) We will proceed to derive closed form expressions for various special cases of and below. For an arbitrary p(z) and T (z) which are not in dual correspondence according to Eq. (2.39), we can use Eq. (2.38) to derive a lower bound on (p(z)) known as Fenchel's inequality (p(z)) =hp(z);T (z)i (T (z)) +D () p(z); T (z) (2.40) hp(z);T (z)i (T (z)): (2.41) 51 where the gap in the inequality D () p(z); T (z) is the Bregman divergence generated by and T (z) is the dual variable corresponding to T (z) using Eq. (2.39). 3 2.6.2 Conjugate Duality Interpretation of IBAL To obtain an alternative derivation of ibal(q ;T ), we consider the conditional kl divergence from a reference q (zjx), which is a convex function of its rst argument () =D KL [jjq (zjx)]; (2.42) To derive the conjugate function (T ), note that we must restrict the optimization to the simplex (zjx)2 , since the standard kl divergence requires a normalized distribution as input (T (x;z)) := sup (zjx) Z (zjx)T (x;z)dz ((zjx)) Z (zjx)dz 1 (2.43) = logE q (zjx) h e T (x;z) i (2.44) where we have solved for the optimizing argument T (zjx) to obtain the conjugate function in Eq. (2.44). Eq. (2.39) suggests the following dual correspondence between primal and dual variables T (zjx) = 1 Z(x;T ) q (zjx)e T (x;z) T (x;z) = log (zjx) q (zjx) +c(x): (2.45) We would like to leverage this duality to estimate thekl divergence (p(zjx)) =D KL [p(zjx)jjq (zjx)] fromq (zjx) to the true posteriorp(zjx). In particular, plugging Eq. (2.45) into Eq. (2.38) suggests the following variational representation D KL [p(zjx)jjq (zjx)] = sup T (x;z) Z p(zjx)T (x;z)dz logE q (zjx) h e T (x;z) i : (2.46) 3 This follows directly from the denition of the Bregman divergence by using Tq =r (q) and the identity in Eq. (2.37), D [p;q] = (p) (q)hr (q);pqi = (p) (q) +hTq;qihTq;pi = (p) + (q)hTq;pi. 52 Figure 2.6: Generalized Energy Based Bounds. On the left, arrows indicate the gaps in each mi lower bound or its relationship to other lower bounds. D gkl (k) represents the extended KL diver- gence between two unnormalized densities. All bounds are written in terms of a base variational distribution q (zjx), which may be chosen to be the marginal p(z) as in mine-dv and mine-f. For a suboptimal T (x;z), which is in dual correspondence with T (zjx) instead of the desired posterior p(zjx), we can use Eq. (2.41) to obtain a lower bound on D KL [p(zjx)jjq (zjx)]. To char- acterize the gap in this inequality, one can conrm that the Bregman divergence generated by the kl divergence in Eq. (2.42) is also the kl divergence D D KL [jjq] [p;] =D KL [pjj]. Thus, we have D KL [p(zjx)jjq (zjx)] = Z p(zjx)T (x;z)dz logE q (zjx) h e T (x;z) i +D KL [p(zjx)jj T (zjx)] (2.47) Finally, the ibal(q ;T ) uses this variational representation of the gap in the ba lower bound, E p(x) [D KL [p(zjx)jjq (zjx)]], to obtain a tighter bound on mi. In particular, for any learned critic function T (x;z), we can use Eq. (2.47) to derive the ibal and its gap I(x;z) =E p(x;z) log q (zjx) p(z) | {z } Ibal (q ) +E p(x) [D KL [p(zjx)jjq (zjx)]] (2.48) =E p(x;z) log q (zjx) p(z) +E p(x) h E p(zjx) [T (x;z)] logE q (zjx) h e T(x;z) ii | {z } ibal(q ;T ) +E p(x) [D KL [p(zjx)jj T (zjx)]] : The optimal critic functionT (x;z) provides the maximizing argument in Eq. (2.46) and is in dual correspondence with the true posterior p(zjx). In particular, we have T (zjx) =p(zjx), resulting in I(x;z) =ibal(q ;T ). 53 2.6.3 Conjugate Duality Interpretation of Generalized MINE-DV To obtain a conjugate duality interpretation of (Generalized) mine-dv, we consider the dual rep- resentation of the kl divergence over joint distributions. Choosing 0 (x;z) = p(x)q (zjx) as the reference distribution, the kl divergence is a convex function of the rst argument () =D KL [jjp(x)q (zjx)] (2.49) Thiskl divergence is equivalent to the gap in theba boundE p(x) [D KL [p(zjx)jjq (zjx)]] after noting the marginal distribution of both p(x;z) and p(x)q (zjx) is p(x). However, the duality associated with () =D KL [jj 0 (x;z)] holds for general joint distributions, and we will see that the conjugate duality perspective using this divergence leads to looser bound on mi than in App. 2.6.2. To evaluate the expression ((x;z)) = sup T (x;z) h(x;z);T (x;z)i (T (x;z)), we need to derive the conjugate function (T (x;z)). Similarly to Eq. (2.43), we constrain the joint distribution to be normalized and obtain (T (x;z)) := sup (x;z) Z (x;z)T (x;z)dxdz ((x;z)) Z (x;z)dzdx 1 (2.50) = logE p(x)q (zjx) h e T (x;z) i ; (2.51) where we have used 0 (x;z) =p(x)q (zjx). Note that the expectation overp(x) now appears inside the log in (T (x;z)), compared with the conjugate for the conditionalkl divergence in Eq. (2.44). Solving for the optimizing arguments in Eq. (2.38) and Eq. (2.50), we have the following rela- tionship between primal and dual variables, T (x;z) := 1 Z(T ) p(x)q (zjx)e T (x;z) ; T (x;z) = log (x;z) p(x)q (zjx) +c; (2.52) Finally, we use Eq. (2.38) to write the dual representation of joint kl divergence as D KL [p(x;z)jjp(x)q (zjx)] = sup T (x;z) Z p(x;z)T (x;z)dxdz logE p(x)q (zjx) [e T (x;z) ]: (2.53) 54 As in the previous section and Eq. (2.41), a suboptimalT (x;z), which is in dual correspondence with T (zjx) instead of the desired posterior p(zjx), yields a lower bound on D KL [p(x;z)jjp(x)q (zjx)], with the gap equal to a kl divergence D KL [p(x;z)jjp(x)q (zjx)] = Z p(x;z)T (x;z)dxdz logE p(x)q (zjx) [e T(x;z) ] +D KL [p(x;z)jj T (x;z)]: (2.54) The Generalized mine-dv bound I gmine-dv (q ;T ) uses this variational representation of the gap in the ba lower bound to obtain a tighter bound on mi. For a learned critic T (x;z), we can use Eq. (2.54) to write I(x;z) =E p(x;z) log q (zjx) p(z) | {z } Ibal (q ) +E p(x) [D KL [p(zjx)jjq (zjx)]] (2.55) =E p(x;z) log q (zjx) p(z) +E p(x)p(zjx) [T (x;z)] logE q (zjx) h e T(x;z) i | {z } Igmine-dv(q ;T ) +D KL [p(x)p(zjx)jj T (x;z)] As discussed in Brekelmans et al. (2022b) App. J2, the Generalized mine-dv bound is looser than the ibal. 2.6.4 Conjugate Duality Interpretation of Generalized MINE-F Nguyen et al. (2010) consider the conjugate duality associated with the family of f-divergences, of which thekl divergence is a special case. We will show that this dual representation corresponds to taking the conjugate function of an extended kl divergence which can take unnormalized densities as input. We use tilde notation ~ (z) to indicate possibly unnormalized measures over z. For example, ~ p(zjx) corresponds to p(x;z). The extendedkl divergence, which is convex in either argument since it diers from the standard kl divergence by only a linear term, is dened as D ExKL [~ r(z)jj~ s(z)] = Z ~ r(z) log ~ r(z) ~ s(z) dz Z ~ r(z)dz + Z ~ s(z)dz: (2.56) For () =D ExKL [jj~ q (zjx)], we write the dual variable using the notation T 0 (x;z) and consider (T 0 (x;z)) := sup ~ (zjx) Z ~ (zjx)T 0 (x;z)dzD ExKL [~ (zjx)jj~ q (zjx)] (2.57) 55 Note that we do not explicitly include a Lagrange multiplier to enforce restriction to normalized distributions. Solving for the optimizing argument or writing the dual correspondence in Eq. (2.39), we obtain ~ T (zjx) = ~ q (zjx)e T 0 (x;z) ; T 0 ~ (x;z) = log ~ (zjx) ~ q (zjx) : (2.58) Plugging this ~ T (zjx) back into Eq. (2.57) yields the conjugate function (T 0 (x;z)) = Z ~ q(zjx)e T 0 (x;z) dz Z ~ q(zjx)dz; (2.59) which leads to a dual representation of the extended kl divergence that matches Nguyen et al. (2010) after plugging into Eq. (2.38) D ExKL [~ p(zjx)jj~ q(zjx)] = sup T 0 (x;z) Z ~ p(zjx)T 0 (x;z)dz Z ~ q(zjx)e T 0 (x;z) dz + Z ~ q(zjx)dz: (2.60) We now use the reparameterization T 0 (x;z) =T (x;z) 1. Assuming a normalized ~ q(zjx) =q(zjx) and ~ p(zjx) =p(zjx), and noting that D ExKL [p(zjx)jjq (zjx)] =D KL [p(zjx)jjq (zjx)] for normalized distributions, we obtain D KL [p(zjx)jjq (zjx)] = sup T (x;z) Z p(zjx)T (x;z)dz Z q (zjx)e T (x;z)1 dz; (2.61) which matches dual representation of the kl divergence found in Belghazi et al. (2018b); Nowozin et al. (2016). See Ruderman et al. (2012) for further discussion. For a suboptimal T (x;z) which is in dual correspondence with T (zjx) instead of the de- sired posterior p(zjx), we can use Eq. (2.41) to obtain a lower bound on D ExKL [p(zjx)jjq (zjx)] = D KL [p(zjx)jjq (zjx)]. To characterize the gap of this lower bound, note that the Bregman divergence generated by the D ExKL divergence is also the D ExKL divergence D D ExKL [jj~ q] [~ pjj~ ] = D ExKL [~ pjj~ ]. Thus, using Eq. (2.40), we have D KL [p(zjx)jjq (zjx)] = Z p(zjx)T (x;z)dz Z q (zjx)e T (x;z)1 dz +D ExKL [p(zjx)jjq (zjx)e T (x;z)1 ]: (2.62) 56 Finally, we obtain the Generalized mine-f bound by evaluating the dual representation of the extended kl divergence for normalized p(zjx) and q (zjx). In particular, I gmine-f (q ;T ) uses the critic function T to tighten the gap in I bal (q ) via the dual optimization in Eq. (2.62), I(x;z) =E p(x;z) log q (zjx) p(z) | {z } Ibal (q ) +E p(x) [D KL [p(zjx)jjq (zjx)]] (2.63) =E p(x;z) log q (zjx) p(z) +E p(x) h E p(zjx) [T (x;z)] logE q (zjx) h e T(x;z) ii | {z } Igmine-f(q ;T ) +E p(x) [D ExKL [p(zjx)jj~ T (zjx)]] The optimal critic function is the dual variable of the true posterior p(zjx), which can be found using Eq. (2.58) asT (x;z) = 1+log ~ p(zjx) ~ q (zjx) . With this optimal critic, the Generalizedmine-f bound is tight. 2.6.5 Conjugate Duality Interpretation of GIWAE, IWAE, and InfoNCE In this section, we use conjugate duality to derive the giwae, iwae, and Info-NCE bounds on mutual information and characterize their gaps. Our approach extends that of Poole et al. (2019), where the mine-f dual representation (Sec. 2.6.4) was used to derive Info-NCE. We provide an alternative derivation using the dual representation associated with the conditional kl divergence and ibal in Sec. 2.6.2. For either dual representation, giwae, iwae, and Info-NCE arise from limiting the family of the critic functions T in order to eliminate the intractable log partition function term. We start from the decomposition of I(x;z) into I ba L (q ) and its gap I(x;z) =E p(x;z) log q (zjx) p(z) | {z } Iba L (q ) +E p(x) [D KL [p(zjx)jjq (zjx)]] : (2.64) We will focus on the dual representation of the normalized, conditional kl divergence () = D KL [jjq (zjx)], as in Sec. 2.6.2. 57 Multi-Sample IBAL Consider extending the state space of the posterior or target distribution p(zjx), by using an additionalK1 samples from a base variational distributionq (zjx) to construct p tgt (z (1:K) ;sjx) :=U(s)p(z (s) jx) K Y k6=s k=1 q (z (k) jx); (2.65) wheres is an index variable sU(s) drawn uniformly at random from 1sK, which species the index of the posterior samplep(zjx). We similarly expand the state space of the base variational distribution to write q prop (z (1:K) ;sjx) :=U(s) K Y k=1 q (z (k) jx): (2.66) It can be easily veried that this construction does not change the value of the kl divergence D KL [p tgt (z (1:K) ;sjx)jjq prop (z (1:K) ;sjx)] =E ptgt(z (1:K) ;sjx) log p tgt (z (1:K) ;sjx) q prop (z (1:K) ;sjx) =E ptgt(z (1:K) ;sjx) h log U(s)p(z (s) jx) Q k6=s q (z (k) jx) U(s) K Q k=1 q (z (k) jx) i =E U(s)p(z (s) jx) log p(z (s) jx) q (z (s) jx) =E p(zjx) log p(zjx) q (zjx) =D KL [p(zjx)jjq (zjx)]: Consider the convex function () :=D KL jjq prop (z (1:K) ;sjx) ; (2.67) where the primal variable is a distribution in the extended-state space of z (1:K) ;s , and the dual variable is a critic functionT (x;z (1:K) ;s). We derive a conjugate optimization in similar fashion to Sec. 2.6.2, but now over the extended state space. For this (), the conjugate function (T ) takes a log-mean-exp form analogous to Eq. (2.44) 4 . We can write the variational representation of (p tgt ) as (ptgt) = sup T Z K X s=1 ptgt(z (1:K) ;sjx)T (x;z (1:K) ;s)dz (1:K) logE qprop (z (1:K) ;sjx) h e T (x;z (1:K) ;s) i : (2.68) 4 We consider the conjugate with restriction to normalized distributions as in Sec. 2.6.2 Eq. (2.43). 58 For a particular choice ofT (x;z (1:K) ;s), we can use Eq. (2.68) to obtain a lower bound on the kl divergenceD KL [p(zjx)jjq (zjx)] (as in Eq. (2.41)). This lower bound translates to the Multi-Sample ibal lower bound on mi, I ms-ibal (q ;T ) via Eq. (2.64) . I(x;z)E p(x;z) log q (zjx) p(z) +E p(x) h E ptgt(z (1:K) ;sjx) h T (x;z (1:K) ;s) ii logE qprop(z (1:K) ;sjx) h e T (x;z (1:K) ;s) i | {z } Z(x;T ) =:I ms-ibal (q ;T ); (2.69) whereZ(x;T ) is the normalization constant of the dual distribution T (z (1:K) ;sjx) corresponding toT , T (z (1:K) ;sjx) = 1 Z(x;T ) q prop (z (1:K) ;sjx)e T (x;z (1:K) ;s) : (2.70) As in Eq. (2.40), we can write the gap of I ms-ibal (q ;T ) as a Bregman divergence or kl divergence I(x;z) =I ms-ibal (q ;T ) +E p(x) h D KL p tgt (z (1:K) ;sjx)jj T (z (1:K) ;sjx) i : (2.71) The optimalK-sample energy function in Eq. (2.68) should result in T (z (1:K) ;sjx) =p tgt (z (1:K) ;sjx). Using similar reasoning as in Brekelmans et al. (2022b) App C5 or L1, this occurs forT (x;z (1:K) ;s) = log p(z (s) jx) q (z (s )jx) +c(x), for which we have I(x;z) =I ms-ibal (q ;T ). GIWAE is a Multi-Sample IBAL with a Restricted Function Family Although extending the state space did not change the value of the kl divergence or alter the optimal critic function, it does allow us to consider a restricted class of multi-sample energy functions that yield tractable, low variance estimators. In particular, giwae and Info-NCE arise from choosing a restricted family of functionsT giwae (x;z (1:K) ;s), under which the problematic logZ(x;T ) term evaluates to 0. This function family is dened as T giwae (x;z (1:K) ;s) := log e T (x;z (s) ) 1 K K P k=1 e T (x;z (k) ) ; (2.72) 59 whereT giwae (x;z (1:K) ;s) is specied by an arbitrary single-sample critic function T (x;z). We can now see that logZ(x;T giwae ) = 0, logZ(x;T giwae ) = logE qprop(z (1:K) ;sjx) h e Tgiwae(x;z (1:K) ;s) i = logE U(s) K Q k=1 q (z (k) jx) " e T (x;z (s) ) 1 K K P k=1 e T (x;z (k) ) # = 0; using the fact that z (1:K) K Q k=1 q (z (k) jx) is invariant to re-indexing. With this simplication, I ms-ibal (q ;T giwae ) recovers the giwae lower bound on mi I(x;z)I ms-ibal (q ;T giwae ) (2.73) =E p(x;z) log q (zjx) p(z) +E 1 K p(x)p(z (s) jx) K Q k6=s k=1 q (z (k) jx) h log e T (x;z (s) ) 1 K K P k=1 e T (x;z (k) ) i (2.74) =E p(x;z) log q (zjx) p(z) +E p(x)p(z (1) jx) K Q k=2 q (z (k) jx) h log e T (x;z (1) ) 1 K K P k=1 e T (x;z (k) ) i (2.75) =I giwae (q ;T ;K): Finally, we can use Eq. (2.71) to recover the probabalistic interpretation of giwae and the gap in the lower bound on mi. As we saw above, the dual distribution Tgiwae (z (1:K) ;sjx) is normalized withZ(x;T giwae ) = 1. In particular, we can write Tgiwae (z (1:K) ;sjx) = 1 Z(x;T giwae ) q prop (z (1:K) ;sjx)e Tgiwae(x;z (1:K) ;s) (2.76) = 1 Z(x;T giwae ) U(s) K Y k=1 q (z (k) jx) e T (x;z (s) ) 1 K K P k=1 e T (x;z (k) ) (2.77) = K Y k=1 q (z (k) jx) e T (x;z (s) ) K P k=1 e T (x;z (k) ) (2.78) which recovers q giwae prop (z (1:K) ;sjx) from the probabilistic interpretation of giwae in Eq. (2.12). We can write the gap ofI giwae (q ;T ;K) =I ms-ibal (q ;T giwae ) as the Bregman divergence or kl diver- gence as in Eq. (2.40) I(x;z) =I ms-ibal (q ;T giwae ) +E p(x) h D KL h p tgt (z (1:K) ;sjx)jj Tgiwae (z (1:K) ;sjx) ii ; (2.79) 60 which matches the reversekl divergenceE p(x) D KL [p giwae tgt (z (1:K) ;sjx)jjq giwae prop (z (1:K) ;sjx)] derived from the probabilistic approach in Brekelmans et al. (2022b) App. C1. Recall that Info-NCE is a special case of giwae with q (zjx) =p(z) (Sec. 2.2.4). Conjugate Duality Interpretation of IWAE We can gain alternative perspective on iwae from this conjugate duality interpretation. In particular, iwae is a special case of giwae, where the optimal single-sample critic function T (x;z) = log p(x;z) q (zjx) +c(x) (see Sec. 2.2.4) is used in Eq. (2.72) to construct the optimal multi-sample functionT iwae , within thegiwae restricted multi- sample function family. Thus, we have I ms-ibal (q ;T iwae ) =I iwae (q ;K). Althoughiwae uses the optimal critic function, the restriction to the function family in Eq. (2.72) is necessary to obtain a tractable bound on the kl divergenceD KL [p(zjx)jjq (zjx)] and mutual in- formation. Without the restricted function family, the intractable log partition term in Eq. (2.68) would require mcmc methods such as ais to accurately estimate, as we saw for the single-sample ibal in Sec. 2.4. 2.6.6 Conjugate Duality Interpretation of IWAE Lower Bound on logp (x) In the previous section, we have interpreted the iwae lower bound on mi using the conjugate dual expansion of the single-sample kl divergence D KL [p(zjx)kq (zjx)]. However, the iwae mi bound I iwae L (q ;T ;K) involves a `contrastive' upper bound on logp (x), eubo iwae (q ;T ;K), involving a single posterior sample. In this section, we provide a conjugate duality interpretation of the standard single-sample elbo(q ) and multi-sample elbo iwae (q ;T ;K) lower bounds on logp (x). Evidence Lower Bound To establish this duality, we begin with the single-sample case and note that 0 () = D KL [k 0 ] is convex in its rst argument. We consider the following dually coupled functions, 0 () =D KL [(z)jj 0 (z)] = sup T E (z) [T (z)] 0 (T ) and 0 (T ) = logE 0 (z) [e T (z) ] = sup (z) E (z) [T (z)]D KL [(z)jj 0 (z)] (2.80) where the expressions for the conjugate functions can be easily veried, with derivations in Sec. 1.3.1. 61 Note that for the log importance weight as the dual function T (x;z) = log p (x;z) 0 (zjx) , we have q iwae prop (T ) = logE 0 (z) [e log p (x;z) 0 (zjx) ] = log R p (x;z)dz = logp (x). Noting that any variational distri- bution (z) =q (zjx) provides a lower bound in Eq. (2.80), we can write 0 (T ) = logp (x)E q (zjx) h log p (x;z) 0 (zjx) | {z } T (x;z) i D KL q (zjx)jj 0 (z) (2.81) For 0 (z) =p(z), we recover the standard elbo logp (x)E q (zjx) h logp (xjz) i D KL q (zjx)jjp(z) : (2.82) where the gap is the Bregman divergence generated by (T ), with D [T :T q ] =D KL [q (zjx) : p(zjx)] since the Bregman divergence induced by the log partition function of the exponential family (T ) = (T ) corresponds to the kl divergence with the order of the arguments reversed. IWAE Lower Bound For the multi-sample case, recall the iwae probabilistic interpretation from Sec. 2.2.3, q iwae prop (z (1:K) jx) := K Y k=1 q (z (s) jx) p iwae tgt (x;z (1:K) ) := 1 K K X s=1 p (x;z (k) ) K Y k=1 k6=s q (z (k) jx): (2.83) Since we will be using the (extended state space) kl divergence duality, we know that for a dual function or critic T (z (1:K) ), the corresponding primal variable takes the form of an exponential family distribution, T (z) (1:K) = 0 (z (1:K) ) expfT (z (1:K) ) (T )g. In particular, we show that the iwae target distribution p iwae tgt (x;z (1:K) ) corresponds to the following critic function, T p iwae tgt (z (1:K) jx) = log 1 K K X k=1 p (x;z (k) ) q (z (k) jx) (2.84) =) T (z (1:K) ) = 1 Z(T ) K Y k=1 q (z (k) jx) exp log 1 K K X s=1 p (x;z (s) ) q (z (s) jx) = 1 Z(T ) 1 K K X s=1 p (x;z (s) ) K Y k=1 k6=s q (z (k) jx) =p iwae tgt (z (1:K) jx): (2.85) 62 Using 0 = q iwae prop (z (1:K) jx) as the base distribution or second argument of the kl divergence, consider the consider the following dually coupled functions, q iwae prop () =D KL [(z (1:K) )jjq iwae prop (z (1:K) )] (2.86) = sup T E (z (1:K) ) h T (z (1:K) ) i q iwae prop (T ); q iwae prop (T ) = logE q iwae prop (z (1:K) ) [e T (z (1:K) ) ] (2.87) = sup E (z (1:K) ) [T (z (1:K) )]D KL [(z (1:K) )jjq iwae prop (z (1:K) )]: For example, it is natural considerp iwae tgt (z (1:K) jx) as the input to the divergence, with q iwae prop (p iwae tgt ) = D KL [p iwae tgt (z (1:K) jx)jjq iwae prop (z (1:K) jx)]. 5 Considering the critic functionT p iwae tgt (z (1:K) jx) in Eq. (2.84) as the argument of q iwae prop (T p iwae tgt ), we follow similar derivations in Eq. (2.85) to show q iwae prop (T p iwae tgt ) = logp (x), (T p iwae tgt ) = log Z q iwae prop (z (1:K) jx)e T p iwae tgt (z (1:K) jx) = log Z K Y k=1 q (z (k) jx) exp log 1 K K X s=1 p (x;z (s) ) q (z (s) jx) dz (1:K) = log Z 1 K K X s=1 p (x;z (s) ) K Y k=1 k6=s q (z (k) jx)dz (1:K) = logp (x) (2.88) Now, we can easily conrm (T p iwae tgt ) = logp(x), i.e. the normalization constant ofp iwae tgt (z (1:K) ;x). We can obtain a lower bound for any multi-sample distribution(z (1:K) jx). In particular, we choose =q iwae prop to recover the iwae log partition function bound (T p iwae tgt ) = logp (x) = sup (z (1:K) jx) E (z (1:K) jx) [T p iwae tgt (z (1:K) )]D KL [kq iwae prop ] (2.89) E q iwae prop (z (1:K) jx) [T p iwae tgt (z (1:K) )] ( ( ( ( ( ( ( ( ( D KL [q iwae prop kq iwae prop ] (2.90) =E K Q k=1 q (z (s) jx) " log 1 K K X k=1 p (x;z (k) ) q (z (k) jx) # ; (2.91) 5 Note, this is slightly dierent than Sec. 2.6.5 or Brekelmans et al. (2022b) App K5, since this construction involves the multi-sample KL divergence directly, e.g. DKL[p iwae tgt (z (1:K) jx)kq iwae prop (zjx)], which is not equal to the single-sample KL divergence DKL[p(zjx)kq (zjx)]. 63 where the gap is the Bregman divergenceD [T p iwae tgt :T q iwae prop ] =D KL [q iwae prop (z (1:K) jx)kp iwae tgt (z (1:K) jx)] since the Bregman divergence generated by the log partition function is the kl divergence with order of arguments reversed. Thus, we have (T p iwae tgt ) = logp(x) =E K Q k=1 q (z (s) jx) " log 1 K K X k=1 p (x;z (k) ) q (z (k) jx) # +D KL [q iwae prop (z (1:K) jx)kp iwae tgt (z (1:K) jx)]; as desired. 2.7 Discussion We have provided a unifying view of mutual information estimation from the perspective of impor- tance sampling. We derived probabilistic interpretations of each bound, which shed light on the limitations of existing estimators, and motivated our novelgiwae, Multi-Sampleais, andmine-ais bounds. When the conditional is not known, our giwae bounds highlight how variational bounds can complement contrastive learning to improve lower bounds on mi beyond known logK limita- tions. When the full joint distribution is known, we have shown that our Multi-Sample ais bounds can tightly estimate large values of mi without exponential sample complexity, and thus should be considered the gold standard for mi estimation in these settings. Finally, mine-ais extends Multi- Sample ais evaluation to unknown conditional densities, and can be viewed as the innite-sample behavior of giwae and existing contrastive bounds. Our mine-ais and Multi-Sample ais methods highlight how mcmc techniques can be used to improve mutual information estimation when a single analytic marginal or conditional density is available. AIS-Inspired MI Estimation from Samples Only Despite the limitations mentioned in Sec. 2.1 or McAllester and Stratos (2020), estimating mutual information from samples only remains the most widely applicable problem setting. Our methods require density information of at least one marginal distribution in order dene mcmc transition kernels or evaluate the ba lower bound on mi. Promising directions for mi estimation from joint samples only include telescoping Density Ratio Estimation (dre, Rhodes et al. (2020) or its innitesimal dre extension in Choi et al. (2021). These approaches are similar in spirit to ais, as they construct a sequence of easier problems by bridging between the joint distribution p(x;z) and product of marginals p(x)p(z). However, since density information is not available to obtain samples using mcmc, these works construct intermediate samples using a xed interpolation scheme. In particular, for z 0 p(x)p(z) andz T 64 p(x;z), dene intermediate samples using z t = p 1 2 t z 0 + t z T forf t g T1 t=1 , 0 = 0, and T = 1. The goal of the estimation problem is to now learn a classier T (z) which distinguishes adjacent samples z t1 ;z t , in similar spirit to the giwae or InfoNCE critic function. Choi et al. (2021) consider the continuous time limit of `innitesimal' classication, highlighting connections with recent advances in score-based generative modeling (Song et al., 2021) and adopting optimization techniques to learn how the density ratio changes with t . Further connections between these approaches and the developments in this chapter remain interesting questions for future work. Conjugate Duality Interpretations Our general approach for constructing importance sam- pling bounds in Sec. 2.2.1 leverages analogues of the elbo and eubo in extended state space. We will discuss the single sampleelbo andeubo bounds in considerable detail in Ch. 3 as the endpoints of the Thermodynamic Variational Objective (tvo) integrand (Masrani et al., 2019; Brekelmans et al., 2020a). We have brie y considered extended state space versions of these bounds, such as multi-sample tvo with an iwae-style energy function. The analysis in Ch. 3 will also hold for arbitrary energy or critic functions from Sec. C.1, which highlights the exibility of our importance sampling and exponential family perspectives for analyzing a very general class of energy-based or Gibbs-Boltzmann distributions (Sec. 2.4). We have also extended the conjugate representation of the Kullback-Leibler (kl) divergence to multi-sample critic functions to derive the giwae and iwae lower bounds on mi (2.6.5). This direction is relatively unexplored in the literature, and might be used to derive tractable lower bounds with desirable properties for other divergence functions (e.g. in the f-gan framework (Nowozin et al., 2016)). In Sec. 2.6.6, we derived the elbo and iwae lower bounds on logp (x) using conjugate duality. Dai et al. (2019) adopt this conjugate duality view of (single-sample) maximum likelihood estimation for learning energy-based models, where the critic function T and an approximate sampler of the corresponding T are trained jointly. Finally, we have used the conjugate duality associated with thekl divergence over unnormalized positive measures to derive the mine-f lower bound of Belghazi et al. (2018a). Notably, we have used functional dualities between density functions ~ and critic functions T in a similar style to the representational duality of Ch. 4. We will again leverage these conjugate dualities, including extensions to -divergences, in the context of reinforcement learning in Ch. 6. 65 Chapter 3 Bregman Duality in Thermodynamic Variational Inference 3.1 Introduction Modern variational inference (vi) techniques are able to jointly perform maximum likelihood param- eter estimation and approximate posterior inference using stochastic gradient ascent (Kingma and Welling, 2013; Rezende et al., 2014). Commonly, this is done by optimizing a tractable bound to the marginal log likelihood logp (x) = R p (x;z) dz, obtained by introducing a divergence D[q (zjx)jjp (zjx)] between the variational distribution q (zjx) and true posterior p (zjx) (Blei et al., 2017; Li and Turner, 2016; Dieng et al., 2017; Cremer et al., 2017; Wang et al., 2018). The recent Thermodynamic Variational Objective (Thermodynamic Variational Objective (tvo)) (Masrani et al., 2019) reframes likelihood estimation in terms of numerical integration along a ge- ometric mixture path connecting q (zjx) and p (zjx). This perspective yields a natural family of lower and upper bounds via Riemann sum approximations, with the Evidence Lower Bound (elbo) appearing as a single-term lower bound and wake-sleep (ws) update corresponding to the simplest upper bound. The tvo generalizes these objectives by using a K-term Riemann sum to obtain tighter bounds on marginal likelihood. We refer to the discrete partitionf t g T t=0 used to construct this estimator as an `integration schedule.' However, the gaps associated with these intermediate bounds was not previously known, an important roadblock to understanding the objective. Further, the tvo was limited by a grid search procedure for choosing the integration schedule. While tvo bounds should become tighter with more rened partitions, Masrani et al. (2019) actually observe deteriorating performance in practice with high K. Our central contribution is an exponential family interpretation of the geometric mixture curve underlying the tvo and various path sampling methods (Gelman and Meng, 1998; Neal, 2001). 66 Using the Bregman divergences associated with this family, we characterize the gaps in the tvo upper and lower bounds as the sum of KL divergences along a given path, resolving this open question about the tvo. Further, we propose to choose intermediate distributions in the tvo based on the `moment- averaged' path of Grosse et al. (2013), which arises naturally from the dual parameterization of our exponential family. This scheduling scheme was originally proposed in the context of Annealed Importance Sampling (ais), where additional sampling procedures may be required to even approx- imate it. We provide an ecient implementation for the tvo setting, which allows the choice of to adapt to the shape of the integrand and degree of posterior mismatch throughout training. In Figure 3.8, we observe that this exible schedule yields near-optimal performance compared to grid search for a single intermediate distribution, so that the tvo can signicantly improve upon the elbo for minimal additional cost. However, our moments scheduler can still suer the previously observed performance degradation as the number of intermediate distributions increases. As a nal contribution, we propose a doubly reparameterized gradient estimator for thetvo, which we show can avoid this undesirable behavior and improve overall performance in continuous models. Our exponential family analysis may be of wider interest given the prevalence of Markov Chain Monte Carlo (mcmc) techniques utilizing geometric mixture paths (Neal, 1996, 2001; Grosse et al., 2016; Huang et al., 2020). To this end, we also present a framework for understanding thermody- namic integration (ti) (Ogata, 1989) and the tvo using Taylor series remainders, which claries that the tvo is a rst-order objective and provides geometric intuition for several results from Grosse et al. (2013). We hope these connections can help open new avenues for analysis at the intersection of mcmc, vi, and statistical physics. 3.2 Thermodynamic Variational Objective Thermodynamic Integration Thermodynamic integration (ti) is a technique from statistical physics, which frames estimating ratios of partition functions as a one-dimensional integration problem. Commonly, this integral is taken over 2 [0; 1], which parameterizes a path of geometric mixtures between a base distribution 0 , and a target distribution 1 (Gelman and Meng, 1998) (z) := ~ (z) R ~ (z)dz = 1 0 (z) 1 (z) Z : (3.1) 67 E ⇡ [log p ✓ (x,z) q (z|x) ]<latexit sha1_base64="2gwDVx1FkoExCqGSX0IfpeQ76OY=">AAACsHicbZFdixMxFIbT8WutH9vVS2+CReiClJmloJeLIni5gt0tdIYhyWTabDOTmJzR1pg/6D/wX3irV2bayjq7Hgi8POcN5+QN1VJYiOMfvejW7Tt37x3c7z94+Ojx4eDoyblVjWF8ypRUZkaJ5VLUfAoCJJ9pw0lFJb+gq7dt/+IzN1ao+iNsNM8qsqhFKRiBgPJBkVYElpS6dz53qRZ5SjkQP0+lWuC0NIQ5naewDHC0s5Zu7V/iv/qrx8fefcpTvRSjK/jtynvss3wwjMfxtvBNkezFEO3rLD/qTdJCsabiNTBJrJ0nsYbMEQOCSe77aWO5JmxFFnweZE0qbjO3jcPjF4EUuFQmnBrwlv57w5HK2k1Fg7Pd0l7vtfB/vXkD5evMiVo3wGu2G1Q2EoPCbba4EIYzkJsgCDMi7IrZkoQMIfxAZ0rVSBBGffFdSlaccSm7lCq1AkJt99V8rZVpIykuGwtUrX2/H3JOrqd6U5yfjJPJePJhMjx9s0/8AD1Dz9EIJegVOkXv0RmaIoa+o5/oF/odnUSzKI/Izhr19neeok5Fl38AGajbmA==</latexit>0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> <latexit sha1_base64="Cd4fUhlXYW481UKJTiYhjMA+33Q=">AAACT3icbZBNTwIxEIa7+AX4BXr0spGYeCK7itGj0YtHTERNgJhpGaDS3W7aWZUQfoNX/Vke/SXejAX34KqTNHnzvNPMzMsTJS0FwbtXWFhcWl4plsqra+sbm5Xq1rXVqRHYElppc8vBopIxtkiSwtvEIERc4Q0fnc/8mwc0Vur4isYJdiMYxLIvBZBDrQ5HgrtKLagH8/L/ijATNZZV867qNTo9LdIIYxIKrG2HQULdCRiSQuG03EktJiBGMMC2kzFEaLuT+bZTf8+Rnt/Xxr2Y/Dn9+WMCkbXjiLvOCGhof3sz+J/XTql/0p3IOEkJY/E9qJ8qn7Q/O93vSYOC1NgJEEa6XX0xBAOCXEC5KVGqSBr9OM1TGKFApfKUaz0i4DZ/NT4l2swi6d2nlrh+mpbLLufwd6p/xfVBPTysH102aqdnWeJFtsN22T4L2TE7ZResyVpMMMme2Qt79d68D++zkLUWvExss1wVSl8Fl7Uq</latexit> 0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> 1<latexit sha1_base64="N7w/IXb3C506Zqfo+1ruaPO5NkQ=">AAACS3icbVBNT8JAEN2iKOIX6NFLIzHxRFpDokeiF4+QyEcCDdluB1zZdpvdqUIIv8Cr/ix/gL/Dm/HgAj1Y8CWTvLw3k5l5fiy4Rsf5tHJb2/md3cJecf/g8Oi4VD5pa5koBi0mhVRdn2oQPIIWchTQjRXQ0BfQ8cd3C7/zDEpzGT3gNAYvpKOIDzmjaKSmOyhVnKqzhL1J3JRUSIrGoGzV+oFkSQgRMkG17rlOjN6MKuRMwLzYTzTElI3pCHqGRjQE7c2Wl87tC6ME9lAqUxHaS/XvxIyGWk9D33SGFB/1urcQ//N6CQ5vvBmP4gQhYqtFw0TYKO3F23bAFTAUU0MoU9zcarNHqihDE05mS5gI5Eq+zLMqHQMDIbKqL+UYqa+zX8MklmoRSfCUaPTlZF4smpzd9VQ3Sfuq6taqtWatUr9NEy+QM3JOLolLrkmd3JMGaRFGgLySN/JufVhf1rf1s2rNWenMKckgl/8FK660Og==</latexit> <latexit sha1_base64="Cd4fUhlXYW481UKJTiYhjMA+33Q=">AAACT3icbZBNTwIxEIa7+AX4BXr0spGYeCK7itGj0YtHTERNgJhpGaDS3W7aWZUQfoNX/Vke/SXejAX34KqTNHnzvNPMzMsTJS0FwbtXWFhcWl4plsqra+sbm5Xq1rXVqRHYElppc8vBopIxtkiSwtvEIERc4Q0fnc/8mwc0Vur4isYJdiMYxLIvBZBDrQ5HgrtKLagH8/L/ijATNZZV867qNTo9LdIIYxIKrG2HQULdCRiSQuG03EktJiBGMMC2kzFEaLuT+bZTf8+Rnt/Xxr2Y/Dn9+WMCkbXjiLvOCGhof3sz+J/XTql/0p3IOEkJY/E9qJ8qn7Q/O93vSYOC1NgJEEa6XX0xBAOCXEC5KVGqSBr9OM1TGKFApfKUaz0i4DZ/NT4l2swi6d2nlrh+mpbLLufwd6p/xfVBPTysH102aqdnWeJFtsN22T4L2TE7ZResyVpMMMme2Qt79d68D++zkLUWvExss1wVSl8Fl7Uq</latexit> <latexit sha1_base64="VziWYX+3e3DkDYTZSy3/SQ5Mddg=">AAAHUnichVXbbhMxEHW5NAXKpfDIy4qqUqmiqmmDCuKlKkWAAHERpRVNiLze2Y1Vr9fxOm0Ss7/AK3wYL/wKT4y3TZvEKVhKMjvnzMzxTNYOleC5WVv7PXPp8pWrs5W5a9dvzN+8dfvOwt3PedbVDHZZJjK9H9IcBJewa7gRsK800DQUsBcePnP43hHonGfyk+kraKY0kTzmjBrnagjotO4srq2ulSvwjdqpsbg1R8r1vrUw+7gRZaybgjRM0Dw/WK8rUwWZoOB201JtOBNQXG90c1CUHdIEDtCUNIW8aUvRRbCEniiIM40faYLSOxphaZrn/TREZkpNO5/EnPMMWxorZeLHTcul6hqQ7KRS3BWByQLXgiDiGpgRfTQo0xzFBqxNNWUGGzUm+lOtaZ06l2YMENg+WRspMqYupCGIYkyUVT2XKEdiBDEOq9yv7easpSEq7McX24WtPdqsbtSrjzaQJeGYZWlKZWQbIRTuK+HSQkdSrWm/CE4dVPBEYoaJEHAhgNYQD8qn83CvhJRekZX/VymjLq6z4hU66hXWNtzUwjjoTSY8Oh5Bjz20P4L2PXQwgg48NB5B40n0UCC607Kv3xSTUAQMMdWyDdMGQ4tgGffwDas99LohHbODTNXmJW+AvJ7HS0D6GavBtJRCRD7VYym3c7U8LYHqjcVPDe+4GrYzBJfKPUT0QvpgZJdTi/IOMl61XMr9p1/8PnXORhHa54XjedNKRqaVeF2hDg15EqCJpw5MEvQ5QU8llN1QJYnq4LQ5k69EDCo/l4GiqEYPF5n09IIQSvMUHB3tr9ic8tEbB1egU8dSfAo2TKH4hRlkxvHcxIthqIxRgU1cPpHr9TpKqfsHRC1H7nnZOI9yXurJ3C0BxqIH32jndNwdwANew1us8w6VU5PpFdugOinT4m+j6qx/EbkcEtHCu6Y2ebP4xuf11Vp99cmH+uLW9smlQ+bIffKALJMa2SRb5CV5T3YJI23ynfwgP2d/zf6pzFQun1AvzZzG3CNjqzL/F1+otkg=</latexit><latexit sha1_base64="VziWYX+3e3DkDYTZSy3/SQ5Mddg=">AAAHUnichVXbbhMxEHW5NAXKpfDIy4qqUqmiqmmDCuKlKkWAAHERpRVNiLze2Y1Vr9fxOm0Ss7/AK3wYL/wKT4y3TZvEKVhKMjvnzMzxTNYOleC5WVv7PXPp8pWrs5W5a9dvzN+8dfvOwt3PedbVDHZZJjK9H9IcBJewa7gRsK800DQUsBcePnP43hHonGfyk+kraKY0kTzmjBrnagjotO4srq2ulSvwjdqpsbg1R8r1vrUw+7gRZaybgjRM0Dw/WK8rUwWZoOB201JtOBNQXG90c1CUHdIEDtCUNIW8aUvRRbCEniiIM40faYLSOxphaZrn/TREZkpNO5/EnPMMWxorZeLHTcul6hqQ7KRS3BWByQLXgiDiGpgRfTQo0xzFBqxNNWUGGzUm+lOtaZ06l2YMENg+WRspMqYupCGIYkyUVT2XKEdiBDEOq9yv7easpSEq7McX24WtPdqsbtSrjzaQJeGYZWlKZWQbIRTuK+HSQkdSrWm/CE4dVPBEYoaJEHAhgNYQD8qn83CvhJRekZX/VymjLq6z4hU66hXWNtzUwjjoTSY8Oh5Bjz20P4L2PXQwgg48NB5B40n0UCC607Kv3xSTUAQMMdWyDdMGQ4tgGffwDas99LohHbODTNXmJW+AvJ7HS0D6GavBtJRCRD7VYym3c7U8LYHqjcVPDe+4GrYzBJfKPUT0QvpgZJdTi/IOMl61XMr9p1/8PnXORhHa54XjedNKRqaVeF2hDg15EqCJpw5MEvQ5QU8llN1QJYnq4LQ5k69EDCo/l4GiqEYPF5n09IIQSvMUHB3tr9ic8tEbB1egU8dSfAo2TKH4hRlkxvHcxIthqIxRgU1cPpHr9TpKqfsHRC1H7nnZOI9yXurJ3C0BxqIH32jndNwdwANew1us8w6VU5PpFdugOinT4m+j6qx/EbkcEtHCu6Y2ebP4xuf11Vp99cmH+uLW9smlQ+bIffKALJMa2SRb5CV5T3YJI23ynfwgP2d/zf6pzFQun1AvzZzG3CNjqzL/F1+otkg=</latexit> k 1<latexit sha1_base64="y1hmrlc22j/B0ZHMi84Yh17NbFY=">AAACVXicbZBNS8NAEIY38avWz+rRS7AIXiyJFPQoevGoYFVoS5ndTnXNJht2J2oJ+Rle9WeJP0ZwW3sw6gsLL8/MMLMvz5S0FIYfnj83v7C4VFuur6yurW9sNraurc6NwI7QSptbDhaVTLFDkhTeZgYh4QpveHw2qd88orFSp1c0zrCfwF0qR1IAOdTtcSQYFPFBVA42m2ErnCr4a6KZabKZLgYNr90bapEnmJJQYG03CjPqF2BICoVlvZdbzEDEcIddZ1NI0PaL6c1lsOfIMBhp415KwZT+nCggsXaccNeZAN3b37UJ/K/WzWl03C9kmuWEqfheNMpVQDqYBBAMpUFBauwMCCPdrYG4BwOCXEyVLUmuSBr9VFYpxChQqSrlWscE3FZ/jc+ZNpNIhg+5Ja6fy3rd5Rz9TvWvuT5sRe1W+7LdPDmdJV5jO2yX7bOIHbETds4uWIcJptkLe2Vv3rv36c/7i9+tvjeb2WYV+Rtf3zS2hw==</latexit> k<latexit sha1_base64="gcrTaGRC2c7QE6GyZV1NF+ODePk=">AAACUXicbVBNSyNBEK0Zv2L8Xo9eBoPgKcxIQI+iF49Z2BghCaG6U9F2eqaH7ho1hPwIr7s/a0/7U7xtJ+bgqA8KHu9VUVVPFFo5juN/Qbiyura+Udusb23v7O7tH/y4daa0kjrSaGPvBDrSKqcOK9Z0V1jCTGjqivR67nefyDpl8l88KWiQ4X2uxkoie6nbF8Q4TIf7jbgZLxB9JcmSNGCJ9vAgaPVHRpYZ5Sw1OtdL4oIHU7SspKZZvV86KlCmeE89T3PMyA2mi3tn0YlXRtHYWF85Rwv148QUM+cmmfCdGfKD++zNxe+8Xsnji8FU5UXJlMv3ReNSR2yi+fPRSFmSrCeeoLTK3xrJB7Qo2UdU2ZKVmpU1z7OqiilJ0rqqCmNSRuGqX9NLYew8ktFj6ViYl1m97nNOPqf6ldyeNZNWs/Wz1bi8WiZegyM4hlNI4Bwu4Qba0AEJKbzCb/gT/A3eQgjD99YwWM4cQgXh1n+yAbUJ</latexit>0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> 1<latexit sha1_base64="N7w/IXb3C506Zqfo+1ruaPO5NkQ=">AAACS3icbVBNT8JAEN2iKOIX6NFLIzHxRFpDokeiF4+QyEcCDdluB1zZdpvdqUIIv8Cr/ix/gL/Dm/HgAj1Y8CWTvLw3k5l5fiy4Rsf5tHJb2/md3cJecf/g8Oi4VD5pa5koBi0mhVRdn2oQPIIWchTQjRXQ0BfQ8cd3C7/zDEpzGT3gNAYvpKOIDzmjaKSmOyhVnKqzhL1J3JRUSIrGoGzV+oFkSQgRMkG17rlOjN6MKuRMwLzYTzTElI3pCHqGRjQE7c2Wl87tC6ME9lAqUxHaS/XvxIyGWk9D33SGFB/1urcQ//N6CQ5vvBmP4gQhYqtFw0TYKO3F23bAFTAUU0MoU9zcarNHqihDE05mS5gI5Eq+zLMqHQMDIbKqL+UYqa+zX8MklmoRSfCUaPTlZF4smpzd9VQ3Sfuq6taqtWatUr9NEy+QM3JOLolLrkmd3JMGaRFGgLySN/JufVhf1rf1s2rNWenMKckgl/8FK660Og==</latexit>1<latexit sha1_base64="N7w/IXb3C506Zqfo+1ruaPO5NkQ=">AAACS3icbVBNT8JAEN2iKOIX6NFLIzHxRFpDokeiF4+QyEcCDdluB1zZdpvdqUIIv8Cr/ix/gL/Dm/HgAj1Y8CWTvLw3k5l5fiy4Rsf5tHJb2/md3cJecf/g8Oi4VD5pa5koBi0mhVRdn2oQPIIWchTQjRXQ0BfQ8cd3C7/zDEpzGT3gNAYvpKOIDzmjaKSmOyhVnKqzhL1J3JRUSIrGoGzV+oFkSQgRMkG17rlOjN6MKuRMwLzYTzTElI3pCHqGRjQE7c2Wl87tC6ME9lAqUxHaS/XvxIyGWk9D33SGFB/1urcQ//N6CQ5vvBmP4gQhYqtFw0TYKO3F23bAFTAUU0MoU9zcarNHqihDE05mS5gI5Eq+zLMqHQMDIbKqL+UYqa+zX8MklmoRSfCUaPTlZF4smpzd9VQ3Sfuq6taqtWatUr9NEy+QM3JOLolLrkmd3JMGaRFGgLySN/JufVhf1rf1s2rNWenMKckgl/8FK660Og==</latexit> (✓ , ,x) <latexit sha1_base64="YA9/3VggtBVRB7Ddblx9ChCZX2k=">AAACc3icbVBNT9tAEN245aPho6E9clkRqoIEkV2oyhG1lx6p1ABSHEWzmzFZsvZau2NIZPkn8Gt6bX9If0jvXYccauiTdvX03oxm5olcK0dh+LsVvHi5srq2/qq9sbm1/bqz8+bSmcJK7Eujjb0W4FCrDPukSON1bhFSofFKTL/U/tUdWqdM9p3mOQ5TuMlUoiSQl0ad9zHhjJwsUQtTHcQ0QYIjHucT5f8UaCKSclYdjjrdsBcuwJ+TaEm6bImL0U7rNB4bWaSYkdTg3CAKcxqWYElJjVU7LhzmIKdwgwNPM0jRDcvFRRV/55UxT4z1LyO+UP/tKCF1bp4KX1nv6J56tfg/b1BQcjYsVZYXhJl8HJQUmpPhdTx8rCxK0nNPQFrld+VyAhYk+RAbU9JCk7LmvmqqMEWJWjdVYcyUQLjm1TjLja0jGd8WjoSZVe22zzl6mupzcvmhF530Pn477Z5/Xia+znbZHjtgEfvEztlXdsH6TLIH9oP9ZL9af4LdYC/YfywNWsuet6yB4PgvuanBgA==</latexit> D KL [q (z|x)||p ✓ (z|x)]<latexit sha1_base64="k/HtgedfsD7MDw2o8IwUqqIwB5w=">AAAHcXichVVbbxM5FDYsNOWyS9l9QrwMVJVKN6qaUgQrXiouWtCCuIhCRSdEHs+ZiVWPx/E4aRJ3nvk1+wq/hd/BH9jjSUOTccqOlOTM+b5zz7EjJXhhtra+nTv/y4WLS43lS5evXP31t2sr139/X+R9zWCP5SLX+xEtQHAJe4YbAftKA80iAR+iw8cO/zAAXfBcvjMjBe2MppInnFGDqs7KrScd+8+L8iAEyY6PVceGpguGluvhYHwcDoZ32p2V1a3NreoJfKF1IqzuLpPqed25vvQgjHPWz0AaJmhRHGzvKNMEmWI93bal2nAmoLwc9gtQlB3SFA5QlDSDom2rmspgDTVxkOQaP9IElXbWwtKsKEZZhMyMmm5Rx5zyB7Y2F8okD9qWS9U3WPQkUtIXgckD16Eg5hqYESMUKNMckw1Yl2rKDPZxLul3rbZ12Tk3c4DA7srWTJC57CIagSjnkrJq6BwVSIwhwVlW9dp+wToa4tK+/ftRaVv37jfv7jTv3UWWhCOWZxmVsQ0jKN1XyqWFnqRa01EZnCio4KlEDzUTcCaA0hQPqrdTcy+ElF6Qjf+PUlmdHWfDCzQYltaGbmpREgzrDgdHM+iRh45m0JGHjmfQsYcmM2hSRw8FopNNqUMxMMROFyfAzRni5ozveN2QjtlDpuryijfZsDovBel7bAaLXAoR+1SPpVzlan2RAzWcs19o3nMxbG8KrlU1xPRM+nimyoVBeQ8ZzzvO5f7Dj36fej9GEdmnpeN500pnppV6XaEOjXgaoIinDtQJ+pSgFxKqbqiKRHVw0pz6SiSgitM0MCmqUcNFLr18QQileQaOjvInbE716o2DK9CZYym+AJu6UPxMDzLneG7ivTHNjFGBTVyfpOv1Os6o+wfEHUceet44jwte5ZO7SwSMRQ1utFM67hPAA17DS4zzCjOnJtcbNqQ6rdzib9h00s+IXE6JKOFd06rfLL7wfnuztbP515ud1d1Hk0uHLJOb5DZZJy1yn+ySZ+Q12SOMfCb/ki/k69L3xo1G0Lg9oZ4/d2LzB5l7Gn/+Byr1wg4=</latexit>logp ✓ (x)<latexit sha1_base64="HjDqHWliNcCsWEIM5Iwm3pen4xw=">AAAHVnichVXbbhMxEDUFmnJv4ZGXFVWlUkVV04taxEtVigAB4iIKFd0Qeb2zG6ter+N12qRmf4JX+DD4GcR429AkTsFSktk5Z2aOZ7J2pAQvzMrKr0tTl69cna7NXLt+4+at23dm5+5+LPKuZrDHcpHr/YgWILiEPcONgH2lgWaRgE/R4ROHfzoCXfBcfjB9Bc2MppInnFGDrv1Q5GkQql5rdn5leaVagW80zoz57RlSrbetuemtMM5ZNwNpmKBFcbC6rkwdZIqi201LteFMQHk97BagKDukKRygKWkGRdNWwstgAT1xkOQaP9IElXc4wtKsKPpZhMyMmnYxjjnnX2xhpJRJtpqWS9U1INlppaQrApMHrg1BzDUwI/poUKY5ig1Ym2rKDDZrRPSHRtM6dS7NCCCwhbIxVGREXUQjEOWIKKt6LlGBxBgSHFi1X9stWEtDXNr3z3ZK29jYrK+t1zfWkCXhmOVZRmVswwhK95VyaaEjqda0XwZnDip4KjHDWAi4EEBrgAfV03m4V0JKr8jS/6tUURfXWfIKHfVKa0M3tSgJeuMJj46H0GMP7Q+hfQ89GUJPPDQZQpNx9FAgutuyL1+V41AMDDHVsqFpg6FlsIh7+IrVHnrdkI7ZQaZq84p3gryex0tB+hnrwaSUQsQ+1WMpt3O1OCmB6o3ETwzvuBq2MwAXqj3E9EL6ydAuJxblHWS8aLmU+48/+33q/B1FZJ+WjudNKx2aVup1hTo04nh8CYqnDowT9DlBTyRU3VAViergrDnjr0QCqjiXgaKoRg8XufT0ghBK8wwcHe0v2Jzq0RsHV6Azx1J8AjZIofiFGWTO8dzEy2GgjFGBTVw8lev1Os6o+wfELUfuedk4jwte6cndTQHGogffaOd03F3AA17Da6zzBpVTk+slG1KdVmnxN6w7619ELgdEtPCuaYzfLL7xcXW5sb786N36/PbO6aVDZsh98oAskgbZJNvkOXlL9ggjgnwj38mP6Z/Tv2tXa7VT6tSls5h7ZGTVZv8AsGS31A==</latexit> L (✓ , ,x) <latexit sha1_base64="xphD6EiyjCYRWS8CbnGhtEVK3cc=">AAACdHicbVBNb9NAEN24QEP4SuAIhxVRpSKhyKatyrEqFw5IFKlJKyVRNLsZN9usvdbuuE1k+S/wa7i2/4M/wpl1mgNOeNKu3ryZ0cw8kWnlKAx/N4KdR4+f7Daftp49f/HyVbvzeuBMbiX2pdHGXgpwqFWKfVKk8TKzCInQeCHmX6r8xQ1ap0x6TssMxwlcpSpWEshLk/b+iHBBThbng+/l5JsPZ0jwkY+ymfJ/AjQTcbEoP0za3bAXrsC3SbQmXbbG2aTTOBxNjcwTTElqcG4YhRmNC7CkpMayNcodZiDncIVDT1NI0I2L1Ukl3/PKlMfG+pcSX6n/dhSQOLdMhK+sdnSbuUr8X26YU/x5XKg0ywlT+TAozjUnwyt/+FRZlKSXnoC0yu/K5QwsSPIu1qYkuSZlzW1ZV2GOErWuq8KYOYFw9atxkRlbWTK9zh0JsyhbLe9ztOnqNhl86kUHvaMfh92T07XjTfaWvWf7LGLH7IR9ZWeszyT7yX6xO3bf+BO8C7rB3kNp0Fj3vGE1BL2/6AbBjA==</latexit> E ⇡ [·]<latexit sha1_base64="tTJPeHdmduE7mFmZ2r5wwU1BpfQ=">AAAHb3ichVXbbhMxEDW3ptxbeOABCVZUlUoVVU0vahEvFRcBAsRF9CKaEHm9sxurXq/jddqkZh/5Gl7hY/gM/oDxpqFJnIKlbGbnnJk5Hq/tUAmem+XlX+fOX7h4aaoyffnK1WvXb9ycmb21k2cdzWCbZSLTeyHNQXAJ24YbAXtKA01DAbvhwVOH7x6CznkmP5megkZKE8ljzqhBV3Pmfj2lphWG9nnRtHXF8RGCoUWxX2dRZhrNmbnlpeVyBL5ROzHmtqZJOd43Z6c261HGOilIwwTN8/2VNWWqIBOcTathqTacCSiu1Ds5KMoOaAL7aEqaQt6w5YyKYB49URBnGn/SBKV3OMLSNM97aYhMpz8fx5zzLzY/UsrEmw3LpeoYkKxfKe6IwGSB608QcQ3MiB4alGmOYgPWopoyg10cEf2p1rBOnUszAgjsrawNFRlRF9IQRDEiyqquS5QjMYIYV7Kcr+3krKkhKuzHF08KW1vfqK6uVddXkSXhiGVpSmXklqtwj4RLC21Jtaa9IjhxUMETiRnGQsCFAFoDPCjfTsO9ElJ6RRb/X6WMOrvOolfosFtY2/8i46A7nvDwaAg98tDeENrz0OMh9NhD4yE0HkcPBKLPmvb1m2IcioAhpnDXmJbbNsECzuErVnvodUM6ZtttshYvecfI63q8BKSfsRpMSilE5FM9lnIzVwuTEqjuSPzE8LarYdsDcL6cQ0TPpB8PzXJiUd5GxqumS7n3+LPfp/bfpegfSm2v6YfJ0GolXleoQ0OeBGjiqQPjBH1K0BMJZTdUSaI6OGnO+JaIQeWnMlAU1ejhIpOeXhBCaZ6Co6P9BZtTvnrLwRXo1LEUn4ANUih+ZgaZcTw38dYYKGNUYBMX+nK9XkcpdV9A1HTkrpeN8yjnpZ7MXSFgLHpwRzun4z4DPOA1vMU671A5NZletHWqkzIt/terzvoXkcsBES28a2rjN4tv7Kws1daWHn1Ym9t60r90yDS5Sx6QBVIjG2SLvCTvyTZh5Bv5Tn6Qn1O/K3cq9ypBn3r+3EnMbTIyKg//ALs2wYo=</latexit>Figure 3.1: Thetvo is a K-term Riemann sum approximation of logp (x), which can be expressed as a scalar integral over the unit interval in (3.6) and on the right. The elbo is a single-term left Riemann approximation of the same integral using the point = 0 with 0 =q (zjx). Note that the integrand is negative in practice, but shown as positive for interpretability. The insight of ti is to recognize that, while the log partition function is intractable, its derivative can be written as an expectation that may be estimated using sampling or simulation techniques (Neal, 2001; Habeck, 2017) r logZ =E log 1 (z) 0 (z) : (3.2) In Sec. 3.3, we will see that this identity arises from an interpretation of the geometric mixture curve Eq. (3.1) as an exponential family. Applying this within the fundamental theorem of calculus, logZ 1 logZ 0 = Z 1 0 r logZ d (3.3) = Z 1 0 E log 1 (z) 0 (z) d: (3.4) While (3.3) holds for any choice of path parameterized by, we can construct ecient estimators of the integrand in (3.4) and estimate the partition function ratio logZ 1 =Z 0 using numerical integration techniques. Thermodynamic Variational Objective Thetvo (Masrani et al., 2019) usesti in the context of variational inference to provide natural upper and lower bounds on the log evidence, which can then be used as objectives for training latent variable models. In particular, the geometric mixture path interpolates between the approximate posteriorq (zjx) and the joint generative modelp (x;z) (zjx) = ~ (x; z) R ~ (x; z) dz := q (zjx) 1 p (x;z) Z (x) : (3.5) 68 As distributions over z, we can identify the endpoints as 0 (zjx) =q (zjx) and 1 (zjx) =p (zjx), with corresponding normalizing constants Z 0 = 1 and Z 1 = R p (x;z)dz =p (x). Applying ti Eq. (3.3) for this set of log partition functions, Masrani et al. (2019) express the generative model likelihood using a one-dimensional integral over the unit interval logp (x) = Z 1 0 E log p (x;z) q (zjx) d: (3.6) The left and right endpoints of this integrand correspond to familiar lower and upper bounds on logp (x). The evidence lower bound (elbo) occurs at = 0, while the analogous Evidence Upper Bound (eubo) at = 1 uses the `reverse' KL divergence and appears in various wake-sleep objectives (ws) (Hinton et al., 1995; Bornschein and Bengio, 2014) elbo(;;x) = logp (x)D KL [q jjp ] (3.7) eubo(;;x) = logp (x) +D KL [p jjq ]: (3.8) To arrive at thetvo, a discrete partition schedule is chosenP =f t g T t=0 with 0 = 0, T = 1. The integral in Eq. (3.6) is then approximated using a left or right Riemann sum. Masrani et al. (2019) show the integrand is increasing, which implies that these integral approximations yield valid lower and upper bounds on the marginal likelihood tvo L (;;x) := T X t=1 ( t t1 )E t1 log p (x;z) q (zjx) (3.9) tvo U (;;x) := T X t=1 ( t t1 )E t log p (x;z) q (zjx) (3.10) with tvo L (;;x) logp (x)tvo U (;;x): (3.11) The rst term of tvo L (;;x) corresponds to the elbo, while the last term of tvo U (;;x) corresponds to the eubo. Thus, the tvo generalizes both objectives, with additional partitions leading to tighter bounds on likelihood as visualized in Fig. 3.1. Although we consider thermodynamic integration over 0 1 to approximate logp (x), note that this integral does not avoid the need for integration overz since each intermediate distribution 69 must be normalized. Masrani et al. (2019) propose an ecient, self-normalized importance sampling (snis) scheme with proposal q (zjx), so that expectations at any intermediate can be estimated by simply reweighting a single set of importance samples E [] S X i=1 w i P S i=1 w s [] where w i := p (x;z i ) q (z i jx) : (3.12) We will use this approximate sampling scheme for our method of choosingf t g T1 t=1 in Sec. 3.5 and the experiments in Sec. 3.7. However, in Sec. 3.9.1, we discuss connections between tvo and methods such as ais and Bidirectional Monte Carlo (bdmc). This perspective suggests using mcmc transition operators to transform samples from the initial distribution z 0 q (zjx) and provide more accurate samples z t t (z) from each intermediate distribution. We highlight the limitations of snis with a xed set of samples from q (zjx) in Fig. 3.15. 3.3 Likelihood Ratio Exponential Family Interpretation In this section, we consider the geometric mixture path in Eq. (3.5) as an exponential family of distributions. Following Gr unwald (2007) Ch. 17, we refer to this as a likelihood ratio exponential family since the sucient statistics correspond to T (x;z) = log p (x;z) q (zjx) . We then show that that several key quantities in thetvo arise from familiar properties of expo- nential families. In Sec. 3.4, we leverage the Bregman divergences associated with our exponential family to naturally characterize the gap in tvo bounds as a sum of KL divergences. Denition To match the tvo setting in Eq. (3.5), we consider an exponential family of distri- butions with natural parameter , base measure q (zjx), and sucient statistics equal to the log importance weights as in Eq. (3.9)-Eq. (3.10) (zjx) := 0 (zjx) expfT (x;z) (x;)g (3.13) where T (x;z) := log p (x;z) q (zjx) 0 (zjx) :=q (zjx) 70 This induces a log-partition function (x;), which normalizes overz and corresponds to logZ (x) in Eq. (3.5) (x;) := log Z q (zjx) expf log p (x;z) q (zjx) g dz; = log Z q (zjx) 1 p (x;z) dz (3.14) = logZ (x): (3.15) The log-partition function will play a key role in our analysis, often written as () to omit the dependence on x. We emphasize that we have made no additional assumptions on p (x;z) orq (zjx), and do not assume they come from exponential families themselves. This `higher-order' exponential family thus maintains full generality and may be constructed between arbitrary distributions. We now show that a number of key quantities, which were manually derived in the original tvo work, may be directly obtained from our exponential family. TI Integrates the Mean Parameters It is well known that the log-partition function () is convex, with its rst (partial) derivative equal to the expectation of the sucient statistics under (Wainwright and Jordan, 2008). :=r () =E h T (x;z) i =E log p (x;z) q (zjx) (3.16) This quantity is known as the mean parameter , which provides a dual coordinate system for indexing intermediate distributions (Wainwright and Jordan (2008) Sec. 3.5.2). Comparing with Eq. (3.2) and Eq. (3.6), we observe that the ability to trade derivatives of the log-partition function for expectations in ti arises from this property of exponential families. We may then interpret the tvo as integrating over the mean parameters =r logZ (x) of our path exponential family, which can be seen by rewriting Eq. (3.3) (1) (0) = Z 1 0 d = Z 1 0 E log p (x;z) q (zjx) d: (3.17) TVO Likelihood Bounds The convexity of the log partition function arises from the fact that entries in its matrix of second partial derivatives with respect to the natural parameters correspond 71 to the (co)variance of the sucient statistics (Wainwright and Jordan, 2008). In our 1-d case, this corresponds to the variance of the log importance weights r 2 () = Var h T (x;z) i = Var log p (x;z) q (zjx) : (3.18) We can see that the tvo integrandr () is increasing from non-negativity ofr 2 () 08, which ensures that the left and right Riemann sums will yield valid lower and upper bounds on the marginal log likelihood. ELBO on the Graph of Inspecting Fig. 3.1, we see that the gap in the tvo bounds corresponds to the amount by which a Riemann approximation under- or over-estimates the area under the curve (auc). We can solidify this intution for the case of the elbo, a single-term approximation of logp (x) using = 0 for the entire interval 1 0 = 1 0 gap = Z 1 0 r ()d | {z } auc (1 0) | {z } width E 0 log p (x;z) q (zjx) | {z } height = logp (x)elbo(;;x) =D KL [q (zjx)jjp (zjx)]: (3.19) In the next section, we generalize this reasoning to more rened partitions, showing that the gap in arbitrary tvo bounds corresponds to a sum of KL divergences between adjacent t along a given pathf t g T t=0 . 3.4 TVO logp(x) Bound Gaps via Bregman Divergences In previous work, it was shown only that tvo L (;;x) minimizes a quantity that is non-negative and vanishes atq (zjx) =p (zjx) (Masrani et al., 2019). Using the Bregman divergences associated with our path exponential family, we can now provide a unied characterization of the gaps in tvo bounds. Bregman Divergence We begin with a brief review of the Bregman divergence, which can be visualized on the graph of the tvo integrand in Fig. 3.2 or the log partition function in Fig. 3.3. 72 A Bregman divergenceD is dened with respect to a convex function (Banerjee et al., 2005b) which, in our case, takes distributions indexed by natural parameters and 0 as its arguments D [ : 0 ] = () ( 0 ) + ( 0 )r ( 0 ) | {z } First Order Taylor Approx : (3.20) Geometrically, the Bregman divergence corresponds to the gap in a rst-order Taylor approximation of () around the second argument 0 , as depicted in Fig. 3.3. Note that this dierence is guaranteed to be nonnegative, since we know that the tangent will everywhere underestimate a convex function (Boyd and Vandenberghe, 2004). The Bregman divergence D for the exponential family in (3.13) is also equivalent to the KL divergence, with the order of the arguments reversed (also see App. A.0.1). Applying (3.16) and adding and subtracting a base measure term, D [ : 0 ] = () ( 0 ) ( 0 )r ( 0 ) (3.21) = () E 0 [T ]E 0 [ log 0 ] (3.22) ( 0 ) + 0 E 0 [T ] +E 0 [ log 0 ] =E 0 [ log 0 log ]; (3.23) where in the third line, we use the fact that E [log ] =E [log 0 (zjx) +T (x;z)] (x;) from (3.13). We then obtain our desired result, with D [ : 0 ] =D KL [ 0jj ]: (3.24) KL Divergence on the Graph of We can also visualize the Bregman divergence on the graph of the integrand =r () in Fig. 3.2, which leads to a natural expression for the gaps in tvo upper and lower bounds. To begin, we consider a single subinterval [ t1 ; t ] and follow the same reasoning as for theelbo in Sec. 3.3. In particular, the area under the integrand in this region is auc = R t t1 r ()d = ( t ) ( t1 ), with the left-Riemann approximation corresponding to ( t t1 )r ( t ). Taking 73 k 1<latexit sha1_base64="y1hmrlc22j/B0ZHMi84Yh17NbFY=">AAACVXicbZBNS8NAEIY38avWz+rRS7AIXiyJFPQoevGoYFVoS5ndTnXNJht2J2oJ+Rle9WeJP0ZwW3sw6gsLL8/MMLMvz5S0FIYfnj83v7C4VFuur6yurW9sNraurc6NwI7QSptbDhaVTLFDkhTeZgYh4QpveHw2qd88orFSp1c0zrCfwF0qR1IAOdTtcSQYFPFBVA42m2ErnCr4a6KZabKZLgYNr90bapEnmJJQYG03CjPqF2BICoVlvZdbzEDEcIddZ1NI0PaL6c1lsOfIMBhp415KwZT+nCggsXaccNeZAN3b37UJ/K/WzWl03C9kmuWEqfheNMpVQDqYBBAMpUFBauwMCCPdrYG4BwOCXEyVLUmuSBr9VFYpxChQqSrlWscE3FZ/jc+ZNpNIhg+5Ja6fy3rd5Rz9TvWvuT5sRe1W+7LdPDmdJV5jO2yX7bOIHbETds4uWIcJptkLe2Vv3rv36c/7i9+tvjeb2WYV+Rtf3zS2hw==</latexit> k<latexit sha1_base64="gcrTaGRC2c7QE6GyZV1NF+ODePk=">AAACUXicbVBNSyNBEK0Zv2L8Xo9eBoPgKcxIQI+iF49Z2BghCaG6U9F2eqaH7ho1hPwIr7s/a0/7U7xtJ+bgqA8KHu9VUVVPFFo5juN/Qbiyura+Udusb23v7O7tH/y4daa0kjrSaGPvBDrSKqcOK9Z0V1jCTGjqivR67nefyDpl8l88KWiQ4X2uxkoie6nbF8Q4TIf7jbgZLxB9JcmSNGCJ9vAgaPVHRpYZ5Sw1OtdL4oIHU7SspKZZvV86KlCmeE89T3PMyA2mi3tn0YlXRtHYWF85Rwv148QUM+cmmfCdGfKD++zNxe+8Xsnji8FU5UXJlMv3ReNSR2yi+fPRSFmSrCeeoLTK3xrJB7Qo2UdU2ZKVmpU1z7OqiilJ0rqqCmNSRuGqX9NLYew8ktFj6ViYl1m97nNOPqf6ldyeNZNWs/Wz1bi8WiZegyM4hlNI4Bwu4Qba0AEJKbzCb/gT/A3eQgjD99YwWM4cQgXh1n+yAbUJ</latexit>0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> 1<latexit sha1_base64="N7w/IXb3C506Zqfo+1ruaPO5NkQ=">AAACS3icbVBNT8JAEN2iKOIX6NFLIzHxRFpDokeiF4+QyEcCDdluB1zZdpvdqUIIv8Cr/ix/gL/Dm/HgAj1Y8CWTvLw3k5l5fiy4Rsf5tHJb2/md3cJecf/g8Oi4VD5pa5koBi0mhVRdn2oQPIIWchTQjRXQ0BfQ8cd3C7/zDEpzGT3gNAYvpKOIDzmjaKSmOyhVnKqzhL1J3JRUSIrGoGzV+oFkSQgRMkG17rlOjN6MKuRMwLzYTzTElI3pCHqGRjQE7c2Wl87tC6ME9lAqUxHaS/XvxIyGWk9D33SGFB/1urcQ//N6CQ5vvBmP4gQhYqtFw0TYKO3F23bAFTAUU0MoU9zcarNHqihDE05mS5gI5Eq+zLMqHQMDIbKqL+UYqa+zX8MklmoRSfCUaPTlZF4smpzd9VQ3Sfuq6taqtWatUr9NEy+QM3JOLolLrkmd3JMGaRFGgLySN/JufVhf1rf1s2rNWenMKckgl/8FK660Og==</latexit> L (✓ , ,x) <latexit sha1_base64="xphD6EiyjCYRWS8CbnGhtEVK3cc=">AAACdHicbVBNb9NAEN24QEP4SuAIhxVRpSKhyKatyrEqFw5IFKlJKyVRNLsZN9usvdbuuE1k+S/wa7i2/4M/wpl1mgNOeNKu3ryZ0cw8kWnlKAx/N4KdR4+f7Daftp49f/HyVbvzeuBMbiX2pdHGXgpwqFWKfVKk8TKzCInQeCHmX6r8xQ1ap0x6TssMxwlcpSpWEshLk/b+iHBBThbng+/l5JsPZ0jwkY+ymfJ/AjQTcbEoP0za3bAXrsC3SbQmXbbG2aTTOBxNjcwTTElqcG4YhRmNC7CkpMayNcodZiDncIVDT1NI0I2L1Ukl3/PKlMfG+pcSX6n/dhSQOLdMhK+sdnSbuUr8X26YU/x5XKg0ywlT+TAozjUnwyt/+FRZlKSXnoC0yu/K5QwsSPIu1qYkuSZlzW1ZV2GOErWuq8KYOYFw9atxkRlbWTK9zh0JsyhbLe9ztOnqNhl86kUHvaMfh92T07XjTfaWvWf7LGLH7IR9ZWeszyT7yX6xO3bf+BO8C7rB3kNp0Fj3vGE1BL2/6AbBjA==</latexit> D ! KL [⇡ k 1 ||⇡ k ]<latexit sha1_base64="FudQyCFuPcGU6lGFK0H5L8yo2SM=">AAAHkHichVVbbxM5FDbXFJbdbeGRF4uqUrfKVk0pahEPlJuWXVgtiyhU24SRx3NmYtXjcTxOm9Sdv8Ov2VeQ+Dd7PGloEqeslcuZ833nPrZjLUVpNza+Xrp85eq1642FGzd/uPXjTz8vLt1+XxZ9w2GPF7Iw+zErQQoFe1ZYCfvaAMtjCR/iw2ce/3AEphSFemeHGjo5y5RIBWcWVdHi7vPIvXpdfXTtJvUfI7KuZcYUx9VBW4vItWOwLHKHv7aqip6e0illVXWixeWN9Y160VBonQnLuwukXm+ipes77aTg/RyU5ZKV5cHmlrZNUBmW2+04ZqzgEqqb7X4JmvFDlsEBiorlUHZcXXJFV1CT0LQw+FWW1tpJC8fyshzmMTJzZrvlLOaV37CVqVA23ek4oXTfguKjSGlfUltQ30CaCAPcyiEKjBuByVLeZYZxi22eSvpdq+N8dt7NFCCx+ao1EWQqu5jFIKuppJweeEclEhNIcdR1va5f8shAUrm3vz2tXOvBdvP+VvPBfWQpOOZFnjOV+FlV/icTykFP4WzZsKJnCiZFptDDjAl4E0BpjNP66dw8CKFUEGTt/6PUVhfHWQsCHQ0q59p+anFKB7MOj44n0OMAHU6gwwA9mUBPAjSdQNNZ9FAiOtpIs1ACHDGNW8Z2cc9UdBVrOMVovwTdUJ7ZQ6buipp3grxBwMtAhR5x585xKWUSUgOW9pXr1XkO9GDKfq55z8dwvTG4UteQsAvpJxNVzg0qesj4PfIu9x/9E/ap920UsXtReV4wrWxiWlnQFebRWGQURTx1YJZgzglmLqHuhq5JzNCz5sxuiRR0eZ4GJsUMaoQsVJAvSKmNyMHTUcbDePQYjENoMLlnaTEHG7vQ4kIPqhB4buK1Ms6MM4lNXB2lG/Q6yZl/A5LIkweBNyGSUtT5FP6OAetQgzvaKz33OeABb+BPjPMXZs5sYdZcm5msdov/7aaXvkcUakxECe+a1uzNEgrvN9dbW+sP/95a3n06unTIArlL7pFV0iLbZJe8JG/IHuHkE/mXfCZfGrcbO43HjScj6uVLZzZ3yNRq/PEfUVTNjA==</latexit>⌘ k 1<latexit sha1_base64="i3wN8NxUWEy4N1svKSK1u7p0GG0=">AAAHYnichVVbb9MwFM64rGPctvEIDxHTpDGVqd1FG+JlGkOAADHQbmItleOcpNYcx3XcrZ3JX+EV/hLv/BCOs3Zr6w4itTk+33fusR1IzjJdqfyeuHHz1u3J0tSd6bv37j94ODM7d5ClbUVhn6Y8VUcByYAzAfuaaQ5HUgFJAg6Hwckrix+egspYKvZ0V0I9IbFgEaNEo6oxM1cDTRqmFhSvk+fVPG/MzFeWK8Xju0K1J8xvTXnFs9uYndyshSltJyA05STLjlfWpC6DiLGAZt0QpRnlkE/X2hlIQk9IDMcoCpJAVjdFEbm/gJrQj1KFP6H9QjtoYUiSZd0kQGZCdDMbxazyElsYCqWjzbphQrY1CHoRKWpzX6e+bYkfMgVU8y4KhCqGyfq0SRShGhs3lPRetW5sdtbNEMCxnaI6EGQou4AEwPOhpIzsWEcZEkOIcHhFvaad0YaCMDdf3mznprq+UV5dK6+vIkvAGU2ThIjQDiu3fzETBlqCKEW6ud9TEM5igR5GTMCaAEp93C9WV+ZOCCGcIEv/j1JYXR9nyQl02smNqdmpBZHfGXV4ejaAnjlodwDtOuj5AHruoNEAGo2iJxzRnYZ5/yEfhUKgiEncM7qJmyb3F7GG7xjtmdMNYZktZMomK3jnyOs4vBiE67Hsj3PJeehSHZa0lcvFcQ5kZ8h+rHnLxjCtPrhQ1BCSa+nnA1WODcpayHjXsC6PXn51+9S6HEVgXueW50wrHphW7HSFWDRgsY8injowSlBXBDWWUHRDFiSi/F5zRrdEBDK7SgOTIgo1jKfCyRc4l4olYOkof8PmFEtnHEyCSixLsjFY34Vk13oQKcNzEy+KfmaUcGzi4kW6Tq/DhNgvIGxYcsfxxliYsSKf1N4aoA1qcEdbpeXuAB7wCj5inE+YOdGpWjI1ouLCLb5rZSv9i8hEn4gS3jXV0ZvFFQ5Wlqtryy8+r81vbV9cOt6U99h76i16VW/D2/Leervevke9jvfD++n9mvxTmi7Nlh5dUG9M9Gx66/5TevIXR4y7bQ==</latexit>r ( )<latexit sha1_base64="/IGyrEo1VTNfnNGWSPjh0ox0vIk=">AAAHYnichVVbTxNBFF68UMQb6KM+bCQkhTSEAkaNLwQxatSIhltkazM7e3Y7YXZ2OjuFlnH/iq/6l3z3h3hmaaHtFJ2k7ZnzfefemQklZ7leXf09de36jZvTlZlbs7fv3L13f27+wX6edRSFPZrxTB2GJAfOBOxppjkcSgUkDTkchMevLH5wAipnmdjVPQmNlCSCxYwSjarm3INAkJATP5A5qwYhaLLUnFtYXVktl+8K9b6wsDnjlWunOT/9PIgy2klBaMpJnh+tbUhdA5FgAa2GIUozyqGYDTo5SEKPSQJHKAqSQt4wZRGFv4iayI8zhR+h/VI7bGFImue9NERmSnQrH8es8gJbHAml4+cNw4TsaBD0PFLc4b7OfNsSP2IKqOY9FAhVDJP1aYsoQjU2biTp3XrD2OysmxGAYztFfSjISHYhCYEXI0kZ2bWOciRGEOPwynpNJ6dNBVFhvrzZKkz96bPa+kbt6TqyBJzSLE2JiAyOqbBfCRMG2oIoRXqF31cQzhKBHsZMwJoASgPcL3eX5k4IIZwgy/+PUlpdHWfZCXTSLYwJ7NTC2O+OOzw5HUJPHbQ3hPYc9GwIPXPQeAiNx9Fjjuh207z/UIxDEVDEZNMEuoXHpfCrWMN3jLbkdENYZhuZssVK3hnyug4vAeF6rPmTXHIeuVSHJW3lsjrJgeyO2E80b9sYpj0AF8saInIl/WyoyolBWRsZ75rW5eHLr26f2hejCM3rwvKcaSVD00qcrhCLhizxUcRbB8YJ6pKgJhLKbsiSRJTfb874kYhB5pdpYFJEoYbxTDj5AudSsRQsHeVv2Jxy64yDSVCpZUk2ARu4kOxKDyJjeG/iQzHIjBKOTayep+v0OkqJ/QdETUvuOt4Yi3JW5pPZVwO0QQ2eaKu03G3AC17BR4zzCTMnOlPLJiAqKd3ib1Cz0r+ITAyIKOFbUx9/WVxhf22lvrHy4vPGwubW+aPjzXiPvCde1at7z7xN76234+151Ot6P7yf3q/pP5XZynzl4Tn12lTfpr8frMrjvxmwutM=</latexit>Figure 3.2: The Bregman divergence D KL [ t1 jj t ] can be visualized as the area under the curve minus the left-Riemann sum via Eq. (3.25). This term contributes to the gap in the likelihood bound tvo L . We also derive an integral form for the KL divergence in Sec. 3.4.1. Note that both the integrand and () are negative in practice. k 1<latexit sha1_base64="y1hmrlc22j/B0ZHMi84Yh17NbFY=">AAACVXicbZBNS8NAEIY38avWz+rRS7AIXiyJFPQoevGoYFVoS5ndTnXNJht2J2oJ+Rle9WeJP0ZwW3sw6gsLL8/MMLMvz5S0FIYfnj83v7C4VFuur6yurW9sNraurc6NwI7QSptbDhaVTLFDkhTeZgYh4QpveHw2qd88orFSp1c0zrCfwF0qR1IAOdTtcSQYFPFBVA42m2ErnCr4a6KZabKZLgYNr90bapEnmJJQYG03CjPqF2BICoVlvZdbzEDEcIddZ1NI0PaL6c1lsOfIMBhp415KwZT+nCggsXaccNeZAN3b37UJ/K/WzWl03C9kmuWEqfheNMpVQDqYBBAMpUFBauwMCCPdrYG4BwOCXEyVLUmuSBr9VFYpxChQqSrlWscE3FZ/jc+ZNpNIhg+5Ja6fy3rd5Rz9TvWvuT5sRe1W+7LdPDmdJV5jO2yX7bOIHbETds4uWIcJptkLe2Vv3rv36c/7i9+tvjeb2WYV+Rtf3zS2hw==</latexit> k<latexit sha1_base64="gcrTaGRC2c7QE6GyZV1NF+ODePk=">AAACUXicbVBNSyNBEK0Zv2L8Xo9eBoPgKcxIQI+iF49Z2BghCaG6U9F2eqaH7ho1hPwIr7s/a0/7U7xtJ+bgqA8KHu9VUVVPFFo5juN/Qbiyura+Udusb23v7O7tH/y4daa0kjrSaGPvBDrSKqcOK9Z0V1jCTGjqivR67nefyDpl8l88KWiQ4X2uxkoie6nbF8Q4TIf7jbgZLxB9JcmSNGCJ9vAgaPVHRpYZ5Sw1OtdL4oIHU7SspKZZvV86KlCmeE89T3PMyA2mi3tn0YlXRtHYWF85Rwv148QUM+cmmfCdGfKD++zNxe+8Xsnji8FU5UXJlMv3ReNSR2yi+fPRSFmSrCeeoLTK3xrJB7Qo2UdU2ZKVmpU1z7OqiilJ0rqqCmNSRuGqX9NLYew8ktFj6ViYl1m97nNOPqf6ldyeNZNWs/Wz1bi8WiZegyM4hlNI4Bwu4Qba0AEJKbzCb/gT/A3eQgjD99YwWM4cQgXh1n+yAbUJ</latexit>⌘ k 1 <latexit sha1_base64="3R28E0kocs5n8EjJ7WsJ03+OryI=">AAACXnicbVBNS8NAEN3Gr1q/ar0IXoJF8GJJVNCj6MWjglWhLWV2O9U1m2zYnWhLyF/xqn/Jmz/Fbe3BqA+Wfbw3w8w8nippKQg+Kt7c/MLiUnW5trK6tr5R32zcWp0ZgW2hlTb3HCwqmWCbJCm8Tw1CzBXe8ehi4t89o7FSJzc0TrEXw0Mih1IAOalfb3SRoJ93+fSLDsKi6NebQSuYwv9Lwhlpshmu+puV4+5AiyzGhIQCazthkFIvB0NSKCxq3cxiCiKCB+w4mkCMtpdPly/8PacM/KE27iXkT9WfHTnE1o5j7ipjoEf725uI/3mdjIanvVwmaUaYiO9Bw0z5pP1JEv5AGhSkxo6AMNLt6otHMCDI5VWaEmeKpNEvRVmFCAUqVVa51hEBt+WrcZRqM4lk8JRZ4npU1Gou5/B3qn/J7WErPGqF18fNs/NZ4lW2w3bZPgvZCTtjl+yKtZlgI/bK3th75dNb9Na8je9SrzLr2WIleNtf+z+46A==</latexit> ⌘ k <latexit sha1_base64="5G5Rvhv2LdburDjDlvVXtfI6uco=">AAACXHicbVBNS8NAEN3G+tX6URW8eAkWwVNJVNCj6MWjgq1CW8rsdqprNtmwO1FLyD/xqv/Ji7/Fbe3BWB8s+3hvhpl5PFXSUhB8VryF6uLS8spqrb62vrHZ2NruWJ0ZgW2hlTb3HCwqmWCbJCm8Tw1CzBXe8ehy4t89o7FSJ7c0TrEfw0MiR1IAOWnQaPSQYJD3+PSLimLQaAatYAp/noQz0mQzXA+2Kie9oRZZjAkJBdZ2wyClfg6GpFBY1HqZxRREBA/YdTSBGG0/n65e+AdOGfojbdxLyJ+qvztyiK0dx9xVxkCP9q83Ef/zuhmNzvq5TNKMMBE/g0aZ8kn7kxz8oTQoSI0dAWGk29UXj2BAkEurNCXOFEmjX4qyChEKVKqscq0jAm7LV+Nrqs0kkuFTZonr16JWczmHf1OdJ52jVnjcCm9OmucXs8RX2B7bZ4csZKfsnF2xa9Zmgj2zN/bOPipfXtWre+s/pV5l1rPDSvB2vwHxEbh2</latexit> D KL [⇡ k ||⇡ k 1 ] <latexit sha1_base64="Gsmp7VT8MMoWLhgAFPIBdBuDWAk=">AAACi3icbZDfa9RAEMf3Un/U0+q1PvqyeAg+6JG0BUUEiz9A0IcKXlu4xDDZm7RrNtmwO2l7pPvn+Nf4qg/+N26u92Bah1348p0ZZuaT1UpaCsM/g2Dtxs1bt9fvDO/e27j/YLS5dWB1YwROhVbaHGVgUckKpyRJ4VFtEMpM4WFWvOvyh6dorNTVV1rUmJRwXMlcCiBvpaM377+18TPePYU5gTH6zKXtp89uFtcybeMMCdK2cI5fXPCe9TxyLklH43ASLoNfF9FKjNkq9tPNwW4816IpsSKhwNpZFNaUtGBICoVuGDcWaxAFHOPMywpKtEm7vNTxJ96Z81wb/yviS/ffjhZKaxdl5itLoBN7NdeZ/8vNGspfJq2s6oawEpeD8kZx0rzDxufSoCC18AKEkX5XLk7AgCAPtzelbBTJjmLfhQIFKtV3M60Lgsz2r8bzWpsOyfx7YynT52449Jyjq1Svi4PtSbQzib7sjvferoivs0fsMXvKIvaC7bGPbJ9NmWA/2E/2i/0ONoKd4FXw+rI0GKx6HrJeBB/+AiSIyoo=</latexit> D ! KL [⇡ k 1 ||⇡ k ] <latexit sha1_base64="qorabaxKW42wcji08leOxe+V48s=">AAACjHicbZDfaxNBEMc354/WqDXVR18Wg+CDhrsasCBCUBFBHyqYtpA7j9nNJFlv7/bYndOG6/07/jW+KvjfuJfmwWsdduHLd2aYmY8otXIUhn96wbXrN27u7N7q375zd+/eYP/+sTOVlTiVRht7KsChVgVOSZHG09Ii5ELjicjetPmTb2idMsVnWpeY5LAs1EJJIG+lg8nbL3X8lLfPquWKwFrzvUnrDx+bWVyqtI4FEqR19ixqGn5+zjtm0yTpYBiOwk3wqyLaiiHbxlG63xvHcyOrHAuSGpybRWFJSQ2WlNTY9OPKYQkygyXOvCwgR5fUm1Mb/tg7c74w1v+C+Mb9t6OG3Ll1LnxlDrRyl3Ot+b/crKLFYVKroqwIC3kxaFFpToa33PhcWZSk116AtMrvyuUKLEjydDtT8kqTail2XchQotZdVxiTEQjXvRrPSmNbJPOvlSNhzpp+33OOLlO9Ko4PRtHzUfRpPJy83hLfZQ/ZI/aERewFm7D37IhNmWQ/2E/2i/0O9oJx8DJ4dVEa9LY9D1gngnd/ATfNywc=</latexit> 1 <latexit sha1_base64="f/ucaHSqP+D9Bp31BanY7u7K7A0=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd7RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/K6y0Og==</latexit> k <latexit sha1_base64="7QiWRbKqlTf8AJ3kox2y2xrtvWY=">AAACknicbVFLbhNBEG0Pv2B+CbADoREWEitrBhYgVoZsWEQikWI7km1Z1T3lpDP9GXXXJLZGs+QAbOEiuQpn4BL02AExCSW1+tV7VaofL5T0lCQ/O9GNm7du39m62713/8HDR9s7j0felk7gUFhl3REHj0oaHJIkhUeFQ9Bc4Zjnu40+PkPnpTWHtCpwpuHYyIUUQIEaTzkSzPP5di/pJ2uLr4P0EvQGzy8Ofn19cbE/3+mU08yKUqMhocD7SZoUNKvAkRQK6+609FiAyOEYJwEa0Ohn1brfOn4VmCxeWBeeoXjN/ptRgfZ+pXmI1EAn/qrWkP/TJiUt3s8qaYqS0IhNoUWpYrJxM3ycSYeC1CoAEE6GXmNxAg4EhRW1quhSkXT2vG6zkKNApdostzYn4L49NS4L65qVZKelJ26XQTZ4LqzWYLJqSmdW8Tr8uCQvqsPRl3q+8ST9Zf/4e3Vj3W44VHr1LNfB6E0/fdtPD9Le4BPb2BZ7xl6y1yxl79iAfWb7bMgEy9k39p39iJ5GH6KP0e4mNOpc5jxhLYv2fgNQH9Od</latexit> k 1 <latexit sha1_base64="8Nw9YOLXY+ONGgmvT8htIKUpWnQ=">AAAClnicbVFNbxMxEHWWj7bhq4ULEpeFCKkciHbbQzmhCITggESQmrRSNops76Q164+VPW4TrfY3cOIK/6X/gh/CHW9SENvyJMtv3pvRjMeslMJhkvzsRDdu3rq9sbnVvXP33v0H2zsPx854y2HEjTT2mFEHUmgYoUAJx6UFqpiEI1a8bfyjM7BOGH2IyxKmip5oMRecYpAmGQOks6p4mdaz7V7ST1aIr5P0kvQGT3d/XXzNXgxnOx2f5YZ7BRq5pM5N0qTEaUUtCi6h7mbeQUl5QU9gEqimCty0Ws1cx8+DksdzY8PRGK/UfysqqpxbKhYyFcVTd9VrxP95E4/zV9NK6NIjaL5uNPcyRhM3C4hzYYGjXAZCuRVh1pifUks5hjW1uigvUVhzXrdVWgAHKdsqM6ZAylz71bAojW1Wkn/xDplZBFvDOTdKUZ1XGZ4ZyepwwwIdrw7Hn+rZOhL4V/0Tf6wbdLvho9Kr33KdjPf66X4//Zz2Bm/IGpvkCXlGdklKDsiAfCBDMiKcGPKNfCc/osfR6+hd9H6dGnUuax6RFqLhbylC1MQ=</latexit> ( )<latexit sha1_base64="onIJ3jXM+0D/nG5/DqDVVA1lDFI=">AAAHT3ichVVbb9MwFPa4dBduGzzyEjFNKlM1rbBpIF6mAQIEiIsYTKylcpyT1JrjuI67tTP5GbzCb+KRX8Ib4jjrWBsXiNTm5Hyfz/f5HCUOleC5WV//MXPu/IWLtdm5+YVLl69cvba4dP19nvU1g12WiUzvhTQHwSXsGm4E7CkNNA0FfAgPHjr8wyHonGfynRkqaKc0kTzmjBpM7bdUzuutEAy93VlcXl9bL6/AD5qjYHl7jpTX685Sba0VZayfgjRM0Dzfv7OhTANkgr67bUu14UxAsdDq56AoO6AJ7GMoaQp525bei2AFM1EQZxp/0gRldnyFpWmeD9MQmSk13byKueQfbGVCysT32pZL1Tcg2YlS3BeByQLXiSDiGpgRQwwo0xzNBqxLNWUG+zVh+l2zbZ07V2YCENhF2RwTmXAX0hBEMWHKqoErlCMxghhnVu7X9nPW0RAV9u2TncI2N7cadzcam3eRJeGIZWlKZWRxTIX7S7i00JNUazosglGCCp5IrFBZAm4JYHSKB+XT2XJPQkpPZPX/KuWqv+usekKHg8LalptaGAeDasHDozH0yEOHY+jQQ4/H0GMPjcfQuIoeCEQfdezzF0UVioAhpjq2Zbr4uhRBHffwGdVue92QjtlDpupy5B1/HnicBGSl2qAR+KWEiHxRj6XcjlV9mhc1mFhf9xf3nILtFaPKK6XziDryNK3e8djepkryHjKedVzJvQcf/e70/gwgtI8Lx/NmlIzNKPF6Qh0a8iTAEL81UCXoM4KeSih7oUoS1cGoNdUXIQaVn9lAU1RjhotMen5BCKV5Co6O8SdsTvnoDYMr0KljKT4FOy2h+F8ryIzj1xJPhVNnjApsYv3ErtfrKKVu/lHHkQdeNc6jnJd+MndEgLGYwffYJR33EeBnXcNL1HmFzqnJ9KptUZ2UZfHearjoX0QuT4kY4QnTrJ4nfvD+zlpzY+3+m43l7Z2To4bMkZvkFqmTJtki2+QpeU12CSMZ+UK+km+177WftV+zI+q5mVFwg0xcs/O/AdEatYo=</latexit>D ! KL [⇡ k 1 ||⇡ k ] <latexit sha1_base64="qorabaxKW42wcji08leOxe+V48s=">AAACjHicbZDfaxNBEMc354/WqDXVR18Wg+CDhrsasCBCUBFBHyqYtpA7j9nNJFlv7/bYndOG6/07/jW+KvjfuJfmwWsdduHLd2aYmY8otXIUhn96wbXrN27u7N7q375zd+/eYP/+sTOVlTiVRht7KsChVgVOSZHG09Ii5ELjicjetPmTb2idMsVnWpeY5LAs1EJJIG+lg8nbL3X8lLfPquWKwFrzvUnrDx+bWVyqtI4FEqR19ixqGn5+zjtm0yTpYBiOwk3wqyLaiiHbxlG63xvHcyOrHAuSGpybRWFJSQ2WlNTY9OPKYQkygyXOvCwgR5fUm1Mb/tg7c74w1v+C+Mb9t6OG3Ll1LnxlDrRyl3Ot+b/crKLFYVKroqwIC3kxaFFpToa33PhcWZSk116AtMrvyuUKLEjydDtT8kqTail2XchQotZdVxiTEQjXvRrPSmNbJPOvlSNhzpp+33OOLlO9Ko4PRtHzUfRpPJy83hLfZQ/ZI/aERewFm7D37IhNmWQ/2E/2i/0O9oJx8DJ4dVEa9LY9D1gngnd/ATfNywc=</latexit> logp ✓ (x) <latexit sha1_base64="EuQLouLJNOc/j7x87FGJ1WRCcbg=">AAACZXicbVBNb9NAEN0YWkr6lQJCQhxYNUIql8imreAYwYVjkEhbKY6i2c042WbttXbHbSLL4tdwhd/DL+BvsE5zwC1PGunpvRnNzBO5Vo7C8HcrePR4a/vJztP27t7+wWHn6NmFM4WVOJRGG3slwKFWGQ5Jkcar3CKkQuOlWHyu/csbtE6Z7ButchynMMtUoiSQlyadV7E2M55PYpojwUmcAs1FUi6rd5NON+yFa/CHJNqQLttgMDlqncVTI4sUM5IanBtFYU7jEiwpqbFqx4XDHOQCZjjyNIMU3bhc/1Dxt16Z8sRYXxnxtfrvRAmpc6tU+M76Rnffq8X/eaOCko/jUmV5QZjJu0VJoTkZXgfCp8qiJL3yBKRV/lYu52BBko+tsSUtNClrbqumCguUqHVTFcYsCIRrfo3L3Ng6kul14UiYZdVu+5yj+6k+JBfve9Fp7/zrWbf/aZP4DnvNjtkJi9gH1mdf2IANmWTf2Q/2k/1q/Qn2gxfBy7vWoLWZec4aCN78BQtnu7s=</latexit> 0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> Figure 3.3: tvo L (;;x) may be viewed as con- structing successive rst-order Taylor approxi- mations to intermediate ( t ), with the accu- mulated error corresponding to the gap in the bound. The upper bound takes KL divergences in the reverse direction, with the rst argument decreasing along the path. the dierence between these expressions, we obtain the denition of the Bregman divergence in (3.20) gap = ( t ) ( t1 ) | {z } auc ( t t1 )r ( t1 ) | {z } Term in tvo L (;;x) =D [ t : t1 ] =D ! KL [ t1 jj t ]: (3.25) where arrows indicate whether the rst argument of the KL divergence is increasing or decreasing along the path. For the gap in the right-Riemann upper bound, we follow similar derivations with the order of the arguments reversed in Sec. 3.4. This results in a gap of D KL [ t jj t1 ], with expectations t =r ( t ) taken under t . TVO Lower Bound Gap Extending the above reasoning to the entire unit interval, we can consider any sorted partitionP =f t g T t=0 with 0 = 0 and T = 1. Summing (3.25) across intervals, note that intermediate ( t ) terms cancel in telescoping fashion T X t=1 D [ t : t1 ] = (1) (0) T X t=1 ( t t1 )r ( t1 ) (3.26) where the last term matches the tvo L objective in Eq. (3.9). 74 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 <latexit sha1_base64="g0I7ndMk9Nuzy1o7PNdW9Rbwvkk=">AAAI63icjZVbc9tEFMfVcqkptxQeeRFkMuNmNBkrdpp2eMmUMsAAQ2GaNkPtelarI3knq9V6tU5kb/TMB+CN4ZVHXniAz8K34axsJ7I2ATS+HJ//71z2rOSNJGeF7vX+vnX7tdffePNO5627b7/z7nvvb9374HmRzxSFY5rzXJ1EpADOBBxrpjmcSAUkizi8iE4/s/qLM1AFy8UzPZcwykgqWMIo0egab3081FDqOo+JospMx6b2FNRIlcuqqsZb2729Xn35rhGujO2jjldfT8f37vwxjHM6y0BoyklRvNwfSB2ASHE9k5EhSjPKobo7nBUgCT0lKbxEU5AMipGpe6n8HfTEfpIrfAvt195mhCFZUcyzCMmM6EnR1qzzUtvZKKWThyPDhJxpEHRZKZlxX+e+nZAfMwVU8zkahCqGzfp0QhShGue40fSzcGRsdzbNhsBxuiJsFNnoLiIR8GqjKSNLm6hAMIYE93K5I7OCjhXElfnhi8eVCQ8Og/4gOOgjJeCc5llGRIz7JVmFn4zHYIZo/+8kTYoV7JLoB72g19Ije3fU+sMHQfioH+wfHLSQNE1XyKNBUL9aQK6ISGHFYHwQDg6dQqkCALHuJQjDvkV2Ws2s9F4w6AfhYbtQQ/fDQbjqtTmz6aJc3uxywqru4qK83wJkmS4qI+3jMAFNqm55sWgz00Uzh5NhM97RI5wEfqRMGJgKohSZV/7KQThLBe5ZKwRsCKC11v3611W4U0IIp8juf1epo26us+sUOsNpmqF92KLEL9sJz84b6rmjzhvq3FEXDXXhqElDTdrqKUf1ydh8/U3VlmKgqF1tj9/FNVxgtfvONIQl19tsuQVypcOl9qZtZwz861JyHruoQ8lyA7qWmdpEZroWd+pGY3ITzqaIfzW2ASef/ugudXo5zch8XlnOGXjaGHjqLIxYNWKpjyb+30MbUFeAuhao1ypriCh/tfT2XZ2ALK7awKaIQg/juXD6Bc6lYhlYHO1XONb6pzNsJkFlllr+gba0dQrJbswgcoYnFp7Y684o4TjE7rJdZ9ZxRuz+xmMLl042xuKC1f3k9vgGbdCDD6V1WvYJ4NGq4Fus8x12TnSuds2QqLROi9/DwFr/BjKxBtHCUz5sn+mu8Xx/L3ywN/h+sH30eHncex3vI+8Tr+uF3qF35H3pPfWOPer95P3u/en91ck6P3d+6fy6RG/fWsV86G1cnd/+ATgfTjE=</latexit> q prop <latexit sha1_base64="uiSPZAM6O+4yeOGrcMHgcCsaPB8=">AAAI63icjZXNbttGEMeZpG2UNG2d5pgLW8OAYhCGaMlxglyMJEVTtEXSIk6MRoqwXA6phZdLarmyKa157gP0FvTaYy89tM/St+ksRcUU125L6GM0/9/OzM6Q2iDjLFe93t9Xrl774MOPrndu3Pz41ieffrZx+/NXeTqTFA5pylN5FJAcOBNwqJjicJRJIEnA4XVw/MTor09A5iwVL9U8g1FCYsEiRolC13jji6GCQlVxNMtZqbOxrlw51SpWZVmONzZ7O73qcm3Dr43Ng45TXS/Gt6//MQxTOktAKMpJnr/ZHWTKAxHjfiYjTaRilEN5czjLISP0mMTwBk1BEshHuqqldLfQE7pRKvEtlFt5mys0SfJ8ngRIJkRN8rZmnO+1rbVUKnow0kxkMwWCLjNFM+6q1DUdckMmgSo+R4NQybBYl06IJFRhH9eKfumPtKnOhFkTOHZX+I0ka9UFJABerhWls8IEyhEMIcJZLicyy+lYQljqH79+XGp/b9/rD7y9PlICTmmaJESEOK4M5zZUjIegh2j/7yBNqhp+TfS9ntdr6UEQ1PqD+57/sO/t7u21kDiOa+ThwKteLSCVRMRQM7je8wf7VqJYAoBY1eL5ft8gW61iar3nDfqev99O1NBdf+DXtTZ7Nl0UpZ7ivZ5NWNldnBX3WkBWxIv6aZiAImW3OFu0memiGcOKsL7e0gPsBH7ETGiYCiIlmZdu7SCcxQJn1loCZgmgtdLd6tf5ciuFEFaS7f/OUq26PM+2legEu6mH5mELIrdoBzw5bainljpvqHNLXTTUhaVGDTVqq8cc1adj/e13ZVsKgaJ2Ph63i3s4w2z3rG4IQ67GbLgFcoXFxeambUf03ItCch7aqEVlxRp0ITM1gfR0JW5VhYbkMpxNEf9mbBYcPfrJ3ur0fTcD/VVpOKvhcaPhsbUxYtSAxS6a+H8PbUCeA/JCoNprVkFEuvXW23d1BFl+XgYWRSR6GE+FVS9wnkmWgMHRfottrX5azWYZyMRQyz/QlrYKkbFLI4iU4YmFJ/aqMko4NrG7LNfqdZgQM99wbODCisZYmLOqntQc36A0evChNE7DPgU8WiV8j3meY+VEpXJbD4mMq7D4PfSM9W8gEysQLTzl/faZbhuvdnf8+zuDHwabB4+Xx73Tce46Xzpdx3f2nQPnmfPCOXSo87Pzu/On81cn6fzSedf5dYlevVKvueOsXZ3f/gGu6E4/</latexit> p tgt PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 PublishedasaconferencepaperatICLR2022 ✗ Figure2: Extendedstate-spaceprobabilisticinterpretationsofmulti-sample AIS bounds. Forward chainsarecoloredinblue,andbackwardchainsarecoloredinred. Notethat ELBOsand EUBOsare obtainedbytakingtheexpectationofthelogunnormalizedimportanceweights logp TGT (·)/q PROP (·) undereithertheproposalortargetdistribution,andcanthenbetranslatedto MI bounds. 3MULTI-SAMPLE AIS BOUNDS FOR ESTIMATING MUTUAL INFORMATION Intheprevioussections,wederivedprobabilisticinterpretationsofextendedstatespaceimportance samplingusingmultiplesamplesK,asin IWAE. Inthissection,wefirstreview AIS,whichextends thestatespaceusingT intermediatedensitiesWethenshowthattheseapproachesarecomplementary, and derive two practical Multi-Sample AIS methods, which provide tighter bounds by combining insightsfrom IWAE and AIS. 3.1 ANNEALED IMPORTANCE SAMPLING BACKGROUND AIS (Neal, 2001) constructs a sequence of intermediate distributions {⇡ t (z)} T t=0 , which bridge between a normalized initial distribution ⇡ 0 (z|x) and target distribution ⇡ T (z|x)= p(z|x) with unnormalized density ⇡ T (x,z)= p(x,z) and normalizing constant Z T (x)= p(x). A common choiceforintermediatedistributionsisthegeometricmixturepathparameterizedby { t } T t=0 ⇡ t (z|x)= ⇡ 0 (z|x) 1 t ⇡ T (x,z) t Z t (x) , where Z t (x)= Z ⇡ 0 (z|x) 1 t ⇡ T (x,z) t dz. (11) Intheprobabilisticinterpretationof AIS,weconsideranextendedstatespaceproposalq AIS PROP (z 0:T |x), obtainedbysamplingfromtheinitial⇡ 0 (z|x)andconstructingtransitionsT f (z t+1 |z t ,x)whichleave ⇡ t (z|x) invariant. The target distribution p AIS TGT (z 0:T |x) is given by running the reverse transitions T r (z t |z t+1 ,x)startingfromatargetorposteriorsample⇡ T (z|x),asshowninFig.2, q AIS PROP (z 0:T |x) :=⇡ 0 (z 0 |x) T 1 Y t=0 T f (z t+1 |z t ),p AIS TGT (x,z 0:T ) :=⇡ T (x,z T ) T 1 Y t=0 T r (z t |z t+1 ,x). (12) Takingexpectationsofthelogimportanceweightsundertheproposalandtargetagainyieldsalower and upper bound on the log partition function logp(x) (App. E). These single-chain lower and upperboundstranslatetoupperandlowerboundson MI,I AIS U (⇡ 0 ,T)andI AIS L (⇡ 0 ,T),whichwere suggestedfor MI estimationintheblogpostofSobolev(2019). Tocharacterizethebiasreductionfor AIS withincreasingT,weprovethefollowingproposition. Proposition 3.1(ComplexityinT). Assuming perfect transitions and a geometric annealing path with linearly-spaced { t } T t=1 , the sum of the gaps in the AIS sandwich bounds on MI,I AIS U (⇡ 0 ,T) I AIS L (⇡ 0 ,T), reduces linearly with increasingT. SeeApp.D.1foraproof. InourexperimentsinSec.5,wewillfindthatthislinearbiasreductionin T iscrucialforachievingtight MI estimationwhenbothp(z)andp(x|z)areknown. However,we canfurthertightenthese AIS boundsusingmultipleannealingchains(K> 1),andwepresenttwo practicalextendedstatespaceapproachesinthefollowingsections. 3.2 INDEPENDENT MULTI-SAMPLE AIS BOUNDS Toderive Independent Multi-Sample AIS (IM-AIS),weconstructanextendedstatespaceproposalby running K independent AIS forward chainsz (k) 0:T ⇠ q AIS PROP in parallel. Similarly to the IWAE upper bound(Eq.(5)),theextendedstatespacetargetinvolvesselectingaindexsuniformlyatrandom,and runningabackward AIS chainz (s) 0:T ⇠ p AIS TGT startingfromatrueposteriorsamplez T ⇠ p(z|x). The remainingK 1samplesareobtainedbyrunningforward AIS chains,asvisualizedinFig.2 5 IWAE Independent Multi-Sample AIS (ours) Coupled Reverse MS AIS (ours) 0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> 1<latexit sha1_base64="N7w/IXb3C506Zqfo+1ruaPO5NkQ=">AAACS3icbVBNT8JAEN2iKOIX6NFLIzHxRFpDokeiF4+QyEcCDdluB1zZdpvdqUIIv8Cr/ix/gL/Dm/HgAj1Y8CWTvLw3k5l5fiy4Rsf5tHJb2/md3cJecf/g8Oi4VD5pa5koBi0mhVRdn2oQPIIWchTQjRXQ0BfQ8cd3C7/zDEpzGT3gNAYvpKOIDzmjaKSmOyhVnKqzhL1J3JRUSIrGoGzV+oFkSQgRMkG17rlOjN6MKuRMwLzYTzTElI3pCHqGRjQE7c2Wl87tC6ME9lAqUxHaS/XvxIyGWk9D33SGFB/1urcQ//N6CQ5vvBmP4gQhYqtFw0TYKO3F23bAFTAUU0MoU9zcarNHqihDE05mS5gI5Eq+zLMqHQMDIbKqL+UYqa+zX8MklmoRSfCUaPTlZF4smpzd9VQ3Sfuq6taqtWatUr9NEy+QM3JOLolLrkmd3JMGaRFGgLySN/JufVhf1rf1s2rNWenMKckgl/8FK660Og==</latexit><latexit sha1_base64="wX0P7tBliH57Ik+DCHV5Rr75+G0=">AAAJs3icjVZbb9s2FFa7W9zd0vVxL8KMAG4mBJYvSYu+BFu2ddiGdUPTBrVdgZIomQhF0RSdyKb1j/Zr9jZsP2aH8iU2mawTfDk633fO+XgoUgw5JYVst/+6d/+99z/48KO9xoOPP/n0s8/3H37xqsinIsLnUU5zcRGiAlPC8LkkkuILLjDKQopfh5ffavz1FRYFydlLOeN4lKGUkYRESIIr2P/+7K0aUpxIJER+XQXqp5+r+n7gDjkJ1DDEEgVKfu1XlbtY7DrBNRQkHctRsN9sH7Xry7UNf2U0T/ec+noRPNw7H8Z5NM0wkxFFRTHo9Lj0MEth0OORQkKSiOLqwXBaYI6iS5TiAZgMZbgYqXrglXsAnthNcgFfJt3aux2hUFYUsywEZobkuDAx7dxgBzulZPJkpAjjU4lZtKyUTKkrc1e30Y2JwJGkMzBQJAiIdaMxEiiS0Owd0S/9kdLqdJodgMIUMH+ryI66EIWYVjuiFC91ogKIMU5gwuvxqmkRBQLHlfr9h28q5fdPvG7P63eBxfB1lGcZYrEaSk4q+CU0xgomsfrfSbZZpCAbRtdre20DD8NwhT859vynXa/T7xuUNE1XlKc9r/4YhFwgluIVB+I9v3diFUoFxpittXi+371Fywpue72u55+YdW5g/7izErrdMC5TWQ06I2gaLuXW+Pnb2lNESjOC9U3Tr1rNzuPKSDPhIudmHqg92WSpCe9MMy8hBpYeH5OqNV+Uj025ZToHbcCQY1idVatczE3OZL6dw8qwG2/hMjKHAVpVs2NKDWHy4CclTOEJg30FzWCfWDoQJSmDx8wIwToEg7XG3fruJtwqwZhV5PDdVeqou+scWoWuoOuqNBNdXYN3qHeNMHGvLXS2hc4sFLqs5pY32YpJTPSSAnpW780mFOMIsJtpc1ugeQFVrCdI7zBqM/2aNwdeafFSva7MjJ57W0pKY5tqsXi5Q7qVM9GJ1GQNHtRCY3QXnUyA/mOgAy6evbGHOtl0M1TfVZpnNTzdanhqDQxpNCSpCya8krBJEDcEcSuhHiuvSUi4q6GbT3GCeXEjA0QhAR5Cc2bpxZRyQTKs6WDDS3t5azWbcCwyzVru8Qa2TsHJnRlYTuClCiePtbIIUWhiaynX6nWcIT2/caDJpZWNkLggtZ5cH0OwVOCBRaidmnuG4e0v8C9Q51dQjmQuDtUQibROC/9DT1v/RSRsTQQLDiK+eeywjVedI//4qPdbr3n6fHkicfacL52vnJbjOyfOqfPceeGcO5Hzh/On87fzT6PfGDTCRryk3r+3innk7FyN7F/Y6pwM</latexit> D KL ⇥ ⇡ t+1 ||⇡ t ⇤ <latexit sha1_base64="aW1PjDixxLIlo16MCEFF/Ubkkag=">AAAIqnicjZVbb9s2FMfZy1Y3u6XtY1+EBgGcQAis2Gla9CXoBe3QDeuKps0WeQYlHclEKEqm6EQ2I/Sj7HX7SP02O5TtxBaTbYIvx+f/OxceSmaQc1aoTufLjZu3bn/19Z3W3bVvvv3u+x/W793/WGRjGcJhmPFMHgW0AM4EHCqmOBzlEmgacPgUnLww+qdTkAXLxAc1yaGf0kSwmIVUoWuw/mDNFzTg1PHzgrX9ABTdGqxvdHY69eXYhjc3Ng5apL7eDe7d+exHWThOQaiQ06I43u3lygWR4AqGfU2lYiGHas0fF5DT8IQmcIymoCkUfV2vonI20RM5cSbxLZRTe5cjNE2LYpIGSKZUDYumZpwX2uZKKRU/6Wsm8rECEc4qxWPuqMwxM3EiJiFUfIIGDSXDZp1wSCUNFU5upekPXl+b7kyaFYHjPIW3VGSlu4AGwKuVpnRemkQFghHEuHv1evW4CAcSokq/f/280t7evtvtuXtdpASchVmaUhFpX+Wswk/GI9A+2v87yTLFCnZBdN2O22noQRDM9SePXe9p193d22sgSZLMkac9t341gExSkcCcwXjX6+1bhRIJAGLRi+t53St6mcsdt9d1vf1eYyCjaVnp0QCHMWRVe3pebjWAvEymlc6RUEO8x6t2eT5tMqPpcg4rw2q8pQe4TPxImNAwElRKOqmcuYNylgjckEYImBBAa6E79a/LcKuEEFaR7f+uUkddX2fbKnSK09RlM9HpGXp983wFsXNmqZMldWKpOD09tbzxUkzcVE84qi8H+u1PVVOKIETtcjucNvZ8jlW2rNULQy621XBT5EqLS8wd2MzoOlel5DyyUYvKyxXoSmZkEunRQtysG43odTgbIf7jwAQcPfvdXuroYpqBflUZzhp4sjTwxFoYNWrAEgdN/POGJiAvAXklUK81ryEqnfnSm3dxDHlx2QY2RSV6GM+E1S9wnkuWgsHR/gPHWv+0hs1ykKmhZv+GDW2RImfXZhAZw+MHD9xFZyHlOMT2rF1r1lFKzf5GAwOXVjbGooLV/WTm9AWl0YMPoXEa9iXgOSnhZ6zzC3ZOVSa3tU9lUqfFb9811r+BTCxAtPDI9poHtG183N3xHu/0fu1tHDyfnd2kRR6SR6RNPLJPDsgb8o4ckpBMyJ/kL/J3y229b/3WOp6hN2/MYx6QlasV/QOsFDMq</latexit> r ( ) <latexit sha1_base64="IPFiufBuhuBRUPvMSiXZroyR4u8=">AAAIvXicjVVtc9tEEFZboKa8pfQjXzRkwjgZTcaKnaYdPpApZYABhsI0bYbKaE6nlXyT00k+nRPZFw1/gF/DV/gj/Bv2ZDmxdQmgseXVPs/uPrt31kUFZ6UaDP6+c/feW2+/c7/37oP33v/gw4+2Hn78qsxnksIJzXkuTyNSAmcCThRTHE4LCSSLOLyOzr40+OtzkCXLxUs1L2CckVSwhFGi0BVufRYoqFRJtTrP6/CkH6gJKOIFxYR5QUbUJEp0Ve+GW9uD/UFzubbht8b2cc9prhfhw/u/BXFOZxkIRTkpyzcHo0J5IFJsajLWRCpGOdQPglkJBaFnJIU3aAqSQTnWTWO1u4Oe2E1yiV+h3Ma7HqFJVpbzLEKm0Vp2MeO8wnY2SqnkyVgzUcwUCLqslMy4q3LXjMmNmQSq+BwNQiVDsS6dEEmowmFuiH7pj7VRZ9JsABxHLPy1IhvqIhIBrzdE6aIyiUokxpDggjb96llJQwlxrX/++lmt/cMjbzjyDofIEnBB8ywjItaBKliNd8Zj0AHa/zvJOouV7Iox9AbeoINHUdTiTx57/tOhd3B42KGkadpSno685tMh5JKIFFoOxnv+6MgqlEoAECstnu8Pb9DSwgNvNPT8o1FnINNFVetpqM1WrvuLy2q3QyiqdFHrAhnNnq/71eWiy5ku1nNYGTbjLTzCNvGWMqFhKoiUZF67rYNwlgpckE4ImBBAa4W7zdN1uFVCCKvI3n9XaaJur7NnFTrHaeKboOu9QG/7nnAvLHS+hs4tFKenF5Y3WYtJuugZR/R5qL/7vu5CMVDErpfD7aPmS6yya3UvDHO1rIa3QF5l8VKzA7sZPfemlJzHNtViFdUG6UbO1CTS0xW40wiNyW10NkX6t6EJOP38F7vV6dU0I/1VbXjWwNO1gadWY8SgEUtdNPHlDV2CvCbIGwlNr0VDItJtW+/u4gSK8loGiiISPYznwtILnBeSZWDoaP+KY20erWGzAmRmWMu3YQdbpSjYrRlEzvD4wTN4pYwSjkPsL+Vas44zYtY3Dg25srIxFpes0ZObAxmURg/+CY3TcJ8DnpMSfsA6P6JyonK5pwMi0yYt/gaesf6NyMSKiBYe2X73gLaNVwf7/uP90U+j7eNny7Pb6TmfOJ86fcd3jpxj5xvnhXPiUOd35w/nT+ev3hc96PGeWFLv3mljHjkbV+/iH2TsPGU=</latexit> tvo U (✓ , ,x) 0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> 1 <latexit sha1_base64="f/ucaHSqP+D9Bp31BanY7u7K7A0=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd7RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/K6y0Og==</latexit> <latexit sha1_base64="87U/rTUnHP/a81X0TF4OgT/maxk=">AAAJcnicjVbbbttGEGXSm5TekvatfWErGFAMwhAl2U7QF6NN0RRt0bSIYyOWQizJEbXwckktVzalNT+ir+2X9T/6AZ2l7rt2U0KX4ZwzM2dmedkwZ7SQnc7f9+6/8+5773/QaD748KOPP/n04aPPXhXZVERwGmUsE+chKYBRDqeSSgbnuQCShgzOwsvvNH52BaKgGX8pZzkMU5JwOqIRkeg6G4QgSSCDh63OQac+XNvwl0brpOHUx4vgUeN0EGfRNAUuI0aK4qLbz6UHPEHN46EiQtKIQfVgMC0gJ9ElSeACTU5SKIaq1l25e+iJ3VEm8MulW3u3IxRJi2KWhshMiRwXJqada2xvp5QcPRkqyvOpBB4tKo2mzJWZq6fgxlRAJNkMDRIJimLdaEwEiSTOakf0S3+otDqdZgdgOEHubxXZUReSEFi1I0rlpU5UIDGGEa5X3a+aFlEgIK7U7z98Wyn/8Njr9b3DHrI4XEdZmhIeq4HMaYW/lMWgBmj/7yTbLFrQNaPndbyOgYdhuMSfHHn+057XPTw0KEmSLClP+179MQiZIDyBJQfjPb9/bBVKBADwlRbP93u3aFnCHa/f8/xjs84G9o+6S6HbA8tlIquL7hCHBqXc6j9/U3uKSGlGsDpp+VW71X1cGWkmuchyMw/Wnqyz1IS3ppmXGBOgrDGt2vOb8rEpt0zmqA0Zcoy3ZNUub+YmZzLfzmFl2I23cBmZbaBW1eqaUkNcPPxJKFcw4UQIMqvcpYMwmnC8zIwQ0CGA1gp367NNuFWCc6vI/tur1FF319m3Cl3h1FVpJrq6Ru9APzXCkXttobMtdGahOGU1t7yjrZiRiV4yRJ8F6qefKxOKIUJss2xuGzXfYBXrCtJPGLVefs2bI6+0eIm+r8yMnntbSsZim2qx8nKHdCtnohOpyQrcq4XG5C46nSD9x0AHnH/z2m51sp5mqL6vNM8aeLI18MRqjGg0pImLJr6SwCSIDUHcSqh7zWsSEe6ydfMqHkFebGSgKCLQQ1nGLb3AWC5oCpqO9hsca31qDZvmIFLNWjzjDWyVIqd3ZuAZxZcqbhxWyiLCcIjthVxr1nFK9PrGgSaXVjZK44LWejK9iwCp0IM3oXZq7jPAt7+AX7DOr6icyEzsqwERSZ0W/weetv6LSPmKiBZuRHxz22Ebr7oH/tFB/7d+6+T5YkfiNJwvna+dtuM7x86J89x54Zw6kXPp/OH86fzV+Kf5RfOrZmtBvX9vGfO5s3M0vX8BsJ+Byg==</latexit> t <latexit sha1_base64="LkRQjbDbfVVxVCQ6w370n4CjQZI=">AAAJdnicjVbbbttGEGXSm5TekuYxQEFUMKoYrCFKsp2gL0aboinaomkRJ0YthViSI2rh5ZJarmxKa35GX9vv6p/0sbPUfdduSugynHNm5swsLxvmjBay0/n7zt133n3v/Q8azXsffvTxJ5/ef/DZqyKbighOo4xl4iwkBTDK4VRSyeAsF0DSkMHr8OJbjb++BFHQjL+UsxyGKUk4HdGISHSdD0KQJFDyK78K7rc6B536cG3DXxqtk4ZTHy+CB43TQZxF0xS4jBgpivNuP5ce8AR1j4eKCEkjBtW9wbSAnEQXJIFzNDlJoRiqWnvl7qEndkeZwC+Xbu3djlAkLYpZGiIzJXJcmJh2rrG9nVJy9GSoKM+nEni0qDSaMldmrp6EG1MBkWQzNEgkKIp1ozERJJI4rx3RL/2h0up0mh2A4RS5v1VkR11IQmDVjiiVlzpRgcQYRrhmdb9qWkSBgLhSv33/TaX8w2Ov1/cOe8jicBVlaUp4rAYypxX+UhaDGqD9v5Nss2hB14ye1/E6Bh6G4RJ/cuT5T3te9/DQoCRJsqQ87Xv1xyBkgvAElhyM9/z+sVUoEQDAV1o83+/doGUJd7x+z/OPzTob2D/qLoVuDyyXiazOu0McGpRyq//8Te0pIqUZweqk5VftVvdxZaSZ5CLLzTxYe7LOUhPemmZeYkyAssa0as+vy8em3DKZozZkyDHellW7vJ6bnMl8O4eVYTfewmVktoFaVatrSg1x8fAnoVzBhBMhyKxylw7CaMLxMjNCQIcAWivcrc824VYJzq0i+2+vUkfdXmffKnSJU1elmejyCr0D/dQIR+6Vhc620JmF4pTV3PKOtmJGJnrBEH0WqB9/qkwohgixzbK5bdR8jVWsK0g/YdR6+TVvjrzS4iX6vjIzeu5NKRmLbarFyssd0o2ciU6kJitwrxYak9vodIL0HwIdcPb173ark/U0Q/VdpXnWwJOtgSdWY0SjIU1cNPGVBCZBbAjiRkLda16TiHCXrZtX8QjyYiMDRRGBHsoybukFxnJBU9B0tN/gWOtTa9g0B5Fq1uIZb2CrFDm9NQPPKL5UcfOwUhYRhkNsL+Ras45Totc3DjS5tLJRGhe01pPpnQRIhR68CbVTc58Bvv0F/Ix1fkHlRGZiXw2ISOq0+D/wtPVfRMpXRLRwI+Kb2w7beNU98I8O+r/2WyfPFzsSp+E8cr5w2o7vHDsnznPnhXPqRE7m/OH86fzV+Kf5eXOv+eWCevfOMuahs3M0O/8CgR2DSA==</latexit> t 1 <latexit sha1_base64="15vj982ZPqiK0Nke7o4+diac0ng=">AAAIqXicjVVRb9s2EFa7dfW6bku7x70ICwI4mRBYsdO02EvQddiKblg7NG3Q2jUo6kQToSiZohPZjLB/0tfuL/Xf9CjbiS0m2wRbPt/33d13R0qMcsEL3el8vHHzs89vfXG79eWdr+5+/c23G/fuvyqyiaJwRDORqeOIFCC4hCPNtYDjXAFJIwGvo5OfLf76FFTBM/lST3MYpIRJnnBKNLqGG/f7ImN+PjR9PQJNqna5PdzY7Ox26st3jXBhbB62vPp6Prx3++9+nNFJClJTQYri7V4v1wFIhg2MBoYozamA6k5/UkBO6Alh8BZNSVIoBqZuovK30BP7SabwK7Vfe1cjDEmLYppGyEyJHhVNzDovsK21Ujp5ODBc5hMNks4rJRPh68y3I/FjroBqMUWDUMVRrE9HRBGqcXBrol+GA2PV2TRrgMBxynClyJq6iEQgqjVRJi9togKJMSS4eHW/ZlLQoYK4Mn/9+rgy4f5B0O0F+11kSTijWZoSGeNS5bzCOxcxmD7a/zvJKosX/ILRDTpBp4FHUbTAHz4IwkfdYG9/v0FhjC0oj3pB/WkQMkUkgwUH44Owd+AUYgoA5FJLEIbdK7Qs4E7Q6wbhQa8xkPGsrMwYN3E+4lV7dl5uNwh5yWaVWd3m57MmZzxbzeFkWI938AjbxBvj0sBYEqXItPIXDiI4k7ggjRCwIYDWEvfrf5fhTgkpnSI7/12ljrq+zo5T6BSnacpmotMz9Pbt8xUl/pmDTlfQqYPi9MzM8SYrMUkTPRGIPhmaZ79XTSgGitjlcvht1HyOVbad7qVlLpfV8mbIKx0eszuwmTHwr0opROxSHVZerpGu5IxtIjNeglu10JhcR+djpD8d2oDjn964rY4vphmZXyrLcwbOVgbOnMaIRSPOfDTx5Q1NgrokqCsJda95TSLKX7Te3MUJ5MWlDBRFFHq4yKSjF4TIFU/B0tF+h2Ot/zrD5jmo1LLmb8MGtkyR82szyIzj8YPn7VIZJQKH2J7LdWYdp8Subzy05NLJxnlc8FpPZg9f0AY9+BBap+U+ATwnFfyBdf5E5URnasf0iWJ1WvztB9b6NyKXSyJaeGSHzQPaNV7t7YYPdnsvepuHj+dnt9fyvvd+8Npe6B14h95v3nPvyKNe6b33Pnj/tH5svWgdt97MqTdvLGK+89auFv0E6QIzaw==</latexit> logp ✓ (x) <latexit sha1_base64="wX0P7tBliH57Ik+DCHV5Rr75+G0=">AAAJs3icjVZbb9s2FFa7W9zd0vVxL8KMAG4mBJYvSYu+BFu2ddiGdUPTBrVdgZIomQhF0RSdyKb1j/Zr9jZsP2aH8iU2mawTfDk633fO+XgoUgw5JYVst/+6d/+99z/48KO9xoOPP/n0s8/3H37xqsinIsLnUU5zcRGiAlPC8LkkkuILLjDKQopfh5ffavz1FRYFydlLOeN4lKGUkYRESIIr2P/+7K0aUpxIJER+XQXqp5+r+n7gDjkJ1DDEEgVKfu1XlbtY7DrBNRQkHctRsN9sH7Xry7UNf2U0T/ec+noRPNw7H8Z5NM0wkxFFRTHo9Lj0MEth0OORQkKSiOLqwXBaYI6iS5TiAZgMZbgYqXrglXsAnthNcgFfJt3aux2hUFYUsywEZobkuDAx7dxgBzulZPJkpAjjU4lZtKyUTKkrc1e30Y2JwJGkMzBQJAiIdaMxEiiS0Owd0S/9kdLqdJodgMIUMH+ryI66EIWYVjuiFC91ogKIMU5gwuvxqmkRBQLHlfr9h28q5fdPvG7P63eBxfB1lGcZYrEaSk4q+CU0xgomsfrfSbZZpCAbRtdre20DD8NwhT859vynXa/T7xuUNE1XlKc9r/4YhFwgluIVB+I9v3diFUoFxpittXi+371Fywpue72u55+YdW5g/7izErrdMC5TWQ06I2gaLuXW+Pnb2lNESjOC9U3Tr1rNzuPKSDPhIudmHqg92WSpCe9MMy8hBpYeH5OqNV+Uj025ZToHbcCQY1idVatczE3OZL6dw8qwG2/hMjKHAVpVs2NKDWHy4CclTOEJg30FzWCfWDoQJSmDx8wIwToEg7XG3fruJtwqwZhV5PDdVeqou+scWoWuoOuqNBNdXYN3qHeNMHGvLXS2hc4sFLqs5pY32YpJTPSSAnpW780mFOMIsJtpc1ugeQFVrCdI7zBqM/2aNwdeafFSva7MjJ57W0pKY5tqsXi5Q7qVM9GJ1GQNHtRCY3QXnUyA/mOgAy6evbGHOtl0M1TfVZpnNTzdanhqDQxpNCSpCya8krBJEDcEcSuhHiuvSUi4q6GbT3GCeXEjA0QhAR5Cc2bpxZRyQTKs6WDDS3t5azWbcCwyzVru8Qa2TsHJnRlYTuClCiePtbIIUWhiaynX6nWcIT2/caDJpZWNkLggtZ5cH0OwVOCBRaidmnuG4e0v8C9Q51dQjmQuDtUQibROC/9DT1v/RSRsTQQLDiK+eeywjVedI//4qPdbr3n6fHkicfacL52vnJbjOyfOqfPceeGcO5Hzh/On87fzT6PfGDTCRryk3r+3innk7FyN7F/Y6pwM</latexit> D KL ⇥ ⇡ t+1 ||⇡ t ⇤ <latexit sha1_base64="d1fy6Bo1Ko3P3AqbBK7idgjfTuw=">AAAIoHicjZVbb9s2FMfV7lKvu7Xb416EBgGcQAis2Gla7CXoOqzDNrQdmjZo5BmUdCQToSiapBPZjLBPsdftc+3b7FC2E1tMtgm+HJ//71x4KJmxYFTpXu/vO3c/+PCjj+91Prn/6Weff/Hlg4dfvVXlVCZwnJSslCcxUcAoh2NNNYMTIYEUMYN38dl3Vn93DlLRkr/RMwHDguScZjQhGl2nkVC0G8Wgyc7owVZvr9dcvmuES2PrqOM116vRw3u/R2mZTAvgOmFEqdP9gdAB8Bz7Hg8NkZomDOr70VSBIMkZyeEUTU4KUEPT9F772+hJ/ayU+Obab7zrEYYUSs2KGMmC6LFqa9Z5pW1vlNLZk6GhXEw18GRRKZsyX5e+nYSfUgmJZjM0SCIpNusnYyJJonFeG02/CYfGdmfTbAgMp8jDtSIb3cUkBlZvNGVEZRMpBFPIcM+a9ZqpSkYS0tr8+sOz2oQHh0F/EBz0keJwkZRFQXhqIi1ojZ+UpWAitP93knWKKnpF9INe0GvpcRwv9SePg/BpP9g/OGgheZ4vkaeDoHm1gFISnsOSwfggHBw6hXIJAHzVSxCG/Rt6Wcq9YNAPwsNBayCTeVWbyQiHMaZ1d35Z7bQAUeXz2ggk9Bjv8bpbXc7bzGS+nsPJsBnv6DEuEz9yyg1MOJGSzGp/6SCM5hw3pBUCNgTQWul+8+s63CnBuVNk97+rNFG319l1Cp3jNE3VTnR+gd7IPl9x5l846mxNnTkqTs/MHW+2FpO11TOG6vOR+ennui2lkKB2vR1+F3u+xCo7zuq5JVfbark5cpXD5fYObGcM/JtSMpa6qEOJagO6kZnYRGayErebRlNyG04niP84sgEn3753lzq5mmZsvq8t5ww8Xxt47iyMWDWmuY8m/nlDG5DXgLwRaNYqGohIf7n09l2cgVDXbWBTRKKHspI7/QJjQtICLI72bzjW5qczbCpAFpZa/Bu2tFUKQW/NwEuKxw8es6vOEsJwiN1Fu86s04LY/U1HFq6cbJSmijb9lPbMBW3Qgw+hdVr2OeA5KeEXrPMSOye6lLsmIjJv0uJ3FFjr30DKVyBaeGSH7QPaNd7u74WP9wavB1tHzxZnt9fxvvEeeV0v9A69I++F98o79hKv9P7w/vT+6jzqvOi87LxeoHfvLGO+9jauzvt/AEQlMCU=</latexit> ( ) <latexit sha1_base64="rro3cm1glFq0FzsaB9M4zhmUrkE=">AAAInHicjZVbb9s2FMfZbl3d7tZujwUGYUEANxACK3aaFnsJugzd0BVri+aC1Z5BUUcyEYqSKTqRzQj9DHvdPtm+zQ5lO7HFpJvgy/H5/86Fh5IZ5oIXutP559btTz6989nd1r37n3/x5VdfP3j4zVGRTRSDQ5aJTJ2EtADBJRxqrgWc5ApoGgo4Dk9/tPrxGaiCZ/KdnuYwSGkiecwZ1eg67oeg6VAPH2x0tjv15blGsDA29lukvl4PH9790I8yNklBaiZoUbzf6eXaB5lgz6OBoUpzJqC6358UkFN2ShN4j6akKRQDU/ddeZvoibw4U/iW2qu9qxGGpkUxTUMkU6pHRVOzzkttc62Ujp8ODJf5RINk80rxRHg68+wUvIgrYFpM0aBMcWzWYyOqKNM4q7Wm3wUDY7uzadYEgROUwUqRte5CGoKo1poyeWkTFQhGEON+1es1k4INFUSVefvieWWC3T2/2/N3u0hJOGdZmlIZmb7OeYWfXERg+mj/7ySrFC/4JdH1O36noYdhuNCfPvGDZ11/Z3e3gSRJskCe9fz61QAyRWUCCwbj/aC35xRKFADIZS9+EHSv6WUhd/xe1w/2eo2BjGdlZcZDHMaIV+3ZRfm4AeRlMqtMjoQe4U1etcuLWZMZz1ZzOBnW4x09xGXiR8KlgbGkStFp5S0cVPBE4oY0QsCGAFpL3at/XYU7JaR0imz9d5U66uY6W06hM5ymKZuJzs7R27fPVxh75446XVGnjorTMzPHG6/ExE31VKB6MDQvf62aUgQMtavt8NrY8wVWeeysXlpyua2WmyFXOlxi78BmRt+7LqUQkYs6VF6uQdcyY5vIjJfiZt1oRG/C+RjxX4Y24OSH392lji+nGZqfKss5A09WBp44C6NWDXnioYl/3tAE1BWgrgXqteY1RJW3WHrzLo4hL67awKaoQg8XmXT6BSFyxVOwONp/4Fjrn86weQ4qtdT837ChLVPk/MYMMuN4/OARu+yMUYFDbM/bdWYdpdTubzS0cOlk4zwqeN1PZs9b0AY9+BBap2UPAM9JBa+wzm/YOdWZ2jJ9qpI6LX73fWt9DORyCaKFR3bQPKBd42hnO3iy3XvT29h/Pj+7SYs8It+TNgnIHtknP5PX5JAwckr+JH+Rv1vftQ5aL1uv5ujtW4uYb8na1Tr6Fx8HLtc=</latexit> t Figure 3.4: (Left): tvo Upper Bound visualized on the graph of the integrand =r () = E [log p (x;z) q (zjx) ]. (Right): tvo upper bound on the graph of the log partition function (), where the gap is the sum of the error in rst order Taylor approximations along the path. WritingD as a KL divergence as in (3.24) and recalling that (1) (0) = logp (x), we obtain logp(x)tvo L (;;x) = T X t=1 D ! KL [ t1 jj t ]: (3.27) We therefore see that the gap in the tvo lower bound is the sum of KL divergences between adjacent t distributions. Alternatively, we can view (3.26) as constructing successive rst-order Taylor approximations to intermediate ( t ) in Fig. 3.3. The likelihood bound gap of P T t=1 D [ t : t1 ] measures the accumulated error along the path. While the elbo estimates (1) = logp (x) directly from = 0, more rened partitions can reduce the error and improve our bounds. As K!1, tvo L (;;x) becomes tight as our t are innitesimally close, and the Riemann integral estimate would become exact given exact estimates of t . TVO Upper Bound Gap To characterize the gap in the upper bound, we rst leverage convex duality to obtain a Bregman divergence in terms of the conjugate function () and the mean parameters. As shown in App. A.0.1, this divergence, D , is equivalent toD with the order of arguments reversed D [ t : t1 ] = D [ t1 : t ] = D KL [ t jj t1 ]: (3.28) 75 k 1<latexit sha1_base64="y1hmrlc22j/B0ZHMi84Yh17NbFY=">AAACVXicbZBNS8NAEIY38avWz+rRS7AIXiyJFPQoevGoYFVoS5ndTnXNJht2J2oJ+Rle9WeJP0ZwW3sw6gsLL8/MMLMvz5S0FIYfnj83v7C4VFuur6yurW9sNraurc6NwI7QSptbDhaVTLFDkhTeZgYh4QpveHw2qd88orFSp1c0zrCfwF0qR1IAOdTtcSQYFPFBVA42m2ErnCr4a6KZabKZLgYNr90bapEnmJJQYG03CjPqF2BICoVlvZdbzEDEcIddZ1NI0PaL6c1lsOfIMBhp415KwZT+nCggsXaccNeZAN3b37UJ/K/WzWl03C9kmuWEqfheNMpVQDqYBBAMpUFBauwMCCPdrYG4BwOCXEyVLUmuSBr9VFYpxChQqSrlWscE3FZ/jc+ZNpNIhg+5Ja6fy3rd5Rz9TvWvuT5sRe1W+7LdPDmdJV5jO2yX7bOIHbETds4uWIcJptkLe2Vv3rv36c/7i9+tvjeb2WYV+Rtf3zS2hw==</latexit> k<latexit sha1_base64="gcrTaGRC2c7QE6GyZV1NF+ODePk=">AAACUXicbVBNSyNBEK0Zv2L8Xo9eBoPgKcxIQI+iF49Z2BghCaG6U9F2eqaH7ho1hPwIr7s/a0/7U7xtJ+bgqA8KHu9VUVVPFFo5juN/Qbiyura+Udusb23v7O7tH/y4daa0kjrSaGPvBDrSKqcOK9Z0V1jCTGjqivR67nefyDpl8l88KWiQ4X2uxkoie6nbF8Q4TIf7jbgZLxB9JcmSNGCJ9vAgaPVHRpYZ5Sw1OtdL4oIHU7SspKZZvV86KlCmeE89T3PMyA2mi3tn0YlXRtHYWF85Rwv148QUM+cmmfCdGfKD++zNxe+8Xsnji8FU5UXJlMv3ReNSR2yi+fPRSFmSrCeeoLTK3xrJB7Qo2UdU2ZKVmpU1z7OqiilJ0rqqCmNSRuGqX9NLYew8ktFj6ViYl1m97nNOPqf6ldyeNZNWs/Wz1bi8WiZegyM4hlNI4Bwu4Qba0AEJKbzCb/gT/A3eQgjD99YwWM4cQgXh1n+yAbUJ</latexit>⌘ k 1 <latexit sha1_base64="3R28E0kocs5n8EjJ7WsJ03+OryI=">AAACXnicbVBNS8NAEN3Gr1q/ar0IXoJF8GJJVNCj6MWjglWhLWV2O9U1m2zYnWhLyF/xqn/Jmz/Fbe3BqA+Wfbw3w8w8nippKQg+Kt7c/MLiUnW5trK6tr5R32zcWp0ZgW2hlTb3HCwqmWCbJCm8Tw1CzBXe8ehi4t89o7FSJzc0TrEXw0Mih1IAOalfb3SRoJ93+fSLDsKi6NebQSuYwv9Lwhlpshmu+puV4+5AiyzGhIQCazthkFIvB0NSKCxq3cxiCiKCB+w4mkCMtpdPly/8PacM/KE27iXkT9WfHTnE1o5j7ipjoEf725uI/3mdjIanvVwmaUaYiO9Bw0z5pP1JEv5AGhSkxo6AMNLt6otHMCDI5VWaEmeKpNEvRVmFCAUqVVa51hEBt+WrcZRqM4lk8JRZ4npU1Gou5/B3qn/J7WErPGqF18fNs/NZ4lW2w3bZPgvZCTtjl+yKtZlgI/bK3th75dNb9Na8je9SrzLr2WIleNtf+z+46A==</latexit> ⌘ k <latexit sha1_base64="5G5Rvhv2LdburDjDlvVXtfI6uco=">AAACXHicbVBNS8NAEN3G+tX6URW8eAkWwVNJVNCj6MWjgq1CW8rsdqprNtmwO1FLyD/xqv/Ji7/Fbe3BWB8s+3hvhpl5PFXSUhB8VryF6uLS8spqrb62vrHZ2NruWJ0ZgW2hlTb3HCwqmWCbJCm8Tw1CzBXe8ehy4t89o7FSJ7c0TrEfw0MiR1IAOWnQaPSQYJD3+PSLimLQaAatYAp/noQz0mQzXA+2Kie9oRZZjAkJBdZ2wyClfg6GpFBY1HqZxRREBA/YdTSBGG0/n65e+AdOGfojbdxLyJ+qvztyiK0dx9xVxkCP9q83Ef/zuhmNzvq5TNKMMBE/g0aZ8kn7kxz8oTQoSI0dAWGk29UXj2BAkEurNCXOFEmjX4qyChEKVKqscq0jAm7LV+Nrqs0kkuFTZonr16JWczmHf1OdJ52jVnjcCm9OmucXs8RX2B7bZ4csZKfsnF2xa9Zmgj2zN/bOPipfXtWre+s/pV5l1rPDSvB2vwHxEbh2</latexit> D KL [⇡ k ||⇡ k 1 ] <latexit sha1_base64="Gsmp7VT8MMoWLhgAFPIBdBuDWAk=">AAACi3icbZDfa9RAEMf3Un/U0+q1PvqyeAg+6JG0BUUEiz9A0IcKXlu4xDDZm7RrNtmwO2l7pPvn+Nf4qg/+N26u92Bah1348p0ZZuaT1UpaCsM/g2Dtxs1bt9fvDO/e27j/YLS5dWB1YwROhVbaHGVgUckKpyRJ4VFtEMpM4WFWvOvyh6dorNTVV1rUmJRwXMlcCiBvpaM377+18TPePYU5gTH6zKXtp89uFtcybeMMCdK2cI5fXPCe9TxyLklH43ASLoNfF9FKjNkq9tPNwW4816IpsSKhwNpZFNaUtGBICoVuGDcWaxAFHOPMywpKtEm7vNTxJ96Z81wb/yviS/ffjhZKaxdl5itLoBN7NdeZ/8vNGspfJq2s6oawEpeD8kZx0rzDxufSoCC18AKEkX5XLk7AgCAPtzelbBTJjmLfhQIFKtV3M60Lgsz2r8bzWpsOyfx7YynT52449Jyjq1Svi4PtSbQzib7sjvferoivs0fsMXvKIvaC7bGPbJ9NmWA/2E/2i/0ONoKd4FXw+rI0GKx6HrJeBB/+AiSIyoo=</latexit> D ! KL [⇡ k 1 ||⇡ k ] <latexit sha1_base64="qorabaxKW42wcji08leOxe+V48s=">AAACjHicbZDfaxNBEMc354/WqDXVR18Wg+CDhrsasCBCUBFBHyqYtpA7j9nNJFlv7/bYndOG6/07/jW+KvjfuJfmwWsdduHLd2aYmY8otXIUhn96wbXrN27u7N7q375zd+/eYP/+sTOVlTiVRht7KsChVgVOSZHG09Ii5ELjicjetPmTb2idMsVnWpeY5LAs1EJJIG+lg8nbL3X8lLfPquWKwFrzvUnrDx+bWVyqtI4FEqR19ixqGn5+zjtm0yTpYBiOwk3wqyLaiiHbxlG63xvHcyOrHAuSGpybRWFJSQ2WlNTY9OPKYQkygyXOvCwgR5fUm1Mb/tg7c74w1v+C+Mb9t6OG3Ll1LnxlDrRyl3Ot+b/crKLFYVKroqwIC3kxaFFpToa33PhcWZSk116AtMrvyuUKLEjydDtT8kqTail2XchQotZdVxiTEQjXvRrPSmNbJPOvlSNhzpp+33OOLlO9Ko4PRtHzUfRpPJy83hLfZQ/ZI/aERewFm7D37IhNmWQ/2E/2i/0O9oJx8DJ4dVEa9LY9D1gngnd/ATfNywc=</latexit> 0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> 1 <latexit sha1_base64="f/ucaHSqP+D9Bp31BanY7u7K7A0=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd7RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/K6y0Og==</latexit> k <latexit sha1_base64="7QiWRbKqlTf8AJ3kox2y2xrtvWY=">AAACknicbVFLbhNBEG0Pv2B+CbADoREWEitrBhYgVoZsWEQikWI7km1Z1T3lpDP9GXXXJLZGs+QAbOEiuQpn4BL02AExCSW1+tV7VaofL5T0lCQ/O9GNm7du39m62713/8HDR9s7j0felk7gUFhl3REHj0oaHJIkhUeFQ9Bc4Zjnu40+PkPnpTWHtCpwpuHYyIUUQIEaTzkSzPP5di/pJ2uLr4P0EvQGzy8Ofn19cbE/3+mU08yKUqMhocD7SZoUNKvAkRQK6+609FiAyOEYJwEa0Ohn1brfOn4VmCxeWBeeoXjN/ptRgfZ+pXmI1EAn/qrWkP/TJiUt3s8qaYqS0IhNoUWpYrJxM3ycSYeC1CoAEE6GXmNxAg4EhRW1quhSkXT2vG6zkKNApdostzYn4L49NS4L65qVZKelJ26XQTZ4LqzWYLJqSmdW8Tr8uCQvqsPRl3q+8ST9Zf/4e3Vj3W44VHr1LNfB6E0/fdtPD9Le4BPb2BZ7xl6y1yxl79iAfWb7bMgEy9k39p39iJ5GH6KP0e4mNOpc5jxhLYv2fgNQH9Od</latexit> k 1 <latexit sha1_base64="8Nw9YOLXY+ONGgmvT8htIKUpWnQ=">AAAClnicbVFNbxMxEHWWj7bhq4ULEpeFCKkciHbbQzmhCITggESQmrRSNops76Q164+VPW4TrfY3cOIK/6X/gh/CHW9SENvyJMtv3pvRjMeslMJhkvzsRDdu3rq9sbnVvXP33v0H2zsPx854y2HEjTT2mFEHUmgYoUAJx6UFqpiEI1a8bfyjM7BOGH2IyxKmip5oMRecYpAmGQOks6p4mdaz7V7ST1aIr5P0kvQGT3d/XXzNXgxnOx2f5YZ7BRq5pM5N0qTEaUUtCi6h7mbeQUl5QU9gEqimCty0Ws1cx8+DksdzY8PRGK/UfysqqpxbKhYyFcVTd9VrxP95E4/zV9NK6NIjaL5uNPcyRhM3C4hzYYGjXAZCuRVh1pifUks5hjW1uigvUVhzXrdVWgAHKdsqM6ZAylz71bAojW1Wkn/xDplZBFvDOTdKUZ1XGZ4ZyepwwwIdrw7Hn+rZOhL4V/0Tf6wbdLvho9Kr33KdjPf66X4//Zz2Bm/IGpvkCXlGdklKDsiAfCBDMiKcGPKNfCc/osfR6+hd9H6dGnUuax6RFqLhbylC1MQ=</latexit> ( )<latexit sha1_base64="onIJ3jXM+0D/nG5/DqDVVA1lDFI=">AAAHT3ichVVbb9MwFPa4dBduGzzyEjFNKlM1rbBpIF6mAQIEiIsYTKylcpyT1JrjuI67tTP5GbzCb+KRX8Ib4jjrWBsXiNTm5Hyfz/f5HCUOleC5WV//MXPu/IWLtdm5+YVLl69cvba4dP19nvU1g12WiUzvhTQHwSXsGm4E7CkNNA0FfAgPHjr8wyHonGfynRkqaKc0kTzmjBpM7bdUzuutEAy93VlcXl9bL6/AD5qjYHl7jpTX685Sba0VZayfgjRM0Dzfv7OhTANkgr67bUu14UxAsdDq56AoO6AJ7GMoaQp525bei2AFM1EQZxp/0gRldnyFpWmeD9MQmSk13byKueQfbGVCysT32pZL1Tcg2YlS3BeByQLXiSDiGpgRQwwo0xzNBqxLNWUG+zVh+l2zbZ07V2YCENhF2RwTmXAX0hBEMWHKqoErlCMxghhnVu7X9nPW0RAV9u2TncI2N7cadzcam3eRJeGIZWlKZWRxTIX7S7i00JNUazosglGCCp5IrFBZAm4JYHSKB+XT2XJPQkpPZPX/KuWqv+usekKHg8LalptaGAeDasHDozH0yEOHY+jQQ4/H0GMPjcfQuIoeCEQfdezzF0UVioAhpjq2Zbr4uhRBHffwGdVue92QjtlDpupy5B1/HnicBGSl2qAR+KWEiHxRj6XcjlV9mhc1mFhf9xf3nILtFaPKK6XziDryNK3e8djepkryHjKedVzJvQcf/e70/gwgtI8Lx/NmlIzNKPF6Qh0a8iTAEL81UCXoM4KeSih7oUoS1cGoNdUXIQaVn9lAU1RjhotMen5BCKV5Co6O8SdsTvnoDYMr0KljKT4FOy2h+F8ryIzj1xJPhVNnjApsYv3ErtfrKKVu/lHHkQdeNc6jnJd+MndEgLGYwffYJR33EeBnXcNL1HmFzqnJ9KptUZ2UZfHearjoX0QuT4kY4QnTrJ4nfvD+zlpzY+3+m43l7Z2To4bMkZvkFqmTJtki2+QpeU12CSMZ+UK+km+177WftV+zI+q5mVFwg0xcs/O/AdEatYo=</latexit>D ! KL [⇡ k 1 ||⇡ k ] <latexit sha1_base64="qorabaxKW42wcji08leOxe+V48s=">AAACjHicbZDfaxNBEMc354/WqDXVR18Wg+CDhrsasCBCUBFBHyqYtpA7j9nNJFlv7/bYndOG6/07/jW+KvjfuJfmwWsdduHLd2aYmY8otXIUhn96wbXrN27u7N7q375zd+/eYP/+sTOVlTiVRht7KsChVgVOSZHG09Ii5ELjicjetPmTb2idMsVnWpeY5LAs1EJJIG+lg8nbL3X8lLfPquWKwFrzvUnrDx+bWVyqtI4FEqR19ixqGn5+zjtm0yTpYBiOwk3wqyLaiiHbxlG63xvHcyOrHAuSGpybRWFJSQ2WlNTY9OPKYQkygyXOvCwgR5fUm1Mb/tg7c74w1v+C+Mb9t6OG3Ll1LnxlDrRyl3Ot+b/crKLFYVKroqwIC3kxaFFpToa33PhcWZSk116AtMrvyuUKLEjydDtT8kqTail2XchQotZdVxiTEQjXvRrPSmNbJPOvlSNhzpp+33OOLlO9Ko4PRtHzUfRpPJy83hLfZQ/ZI/aERewFm7D37IhNmWQ/2E/2i/0O9oJx8DJ4dVEa9LY9D1gngnd/ATfNywc=</latexit> Figure 3.5: Adding the KL divergences in each direction, we can visualize the symmetrized KL divergence as the area of a rectangle. The shape of the tvo integrand suggests which direction of the KL divergence is larger, with the divergence becoming symmetric when is linear in . As in Eq. (3.26), we expand the dual divergences along a path as T X t=1 D [ t1 : t ] = (0) (1) T X t=1 ( t1 t )r ( t ) Since the last term corresponds to a right-Riemann sum, we can rearrange to characterize the gap in tvo U (;;x) using a sum of KL divergences in the reverse direction tvo U (;;x) logp(x) = T X t=1 D KL [ t jj t1 ]: (3.29) 3.4.1 Integral Forms and Symmetrized KL To further contextualize the developments in this section, we show in App. C of Brekelmans et al. (2020a) that both thermodynamic integration and the tvo may be understood using the integral form of the Taylor remainder theorem (Kountourogiannis and Loya, 2003). In particular, for the k-th order approximation of (x) around a, with 2 [a;x] 1 , we have R t (x) = Z x a r (k+1) () k! (x) k d (3.30) 1 We use generic variable x, not to be confused with data x, for notational simplicity. 76 Recall that the kl divergence is the gap in a rst order Taylor approximation of (), with r 2 () = Var h log p (x;z) q (zjx) i . We obtain can thus obtain integral representation of the kl diver- gence between t1 and t by integrating from t1 to t , D ! KL [ t1 jj t ] = t Z t1 ( t )Var log p (x;z) q (zjx) d: (3.31) This lends further intuition for our interpretation of the kl divergence as the area of a region in Fig. 3.2 or Fig. 3.5. Switching the order of integration, we can write the reverse kl divergence D KL [ t jj t1 ] = t Z t1 ( t1 )Var log p (x;z) q (zjx) d: (3.32) Combining the remainders in each direction, we recover a known identity (Eq. (3.33)) relating the symmetrized KL divergence to the integral of the Fisher information (Dabak and Johnson (2002)). D $ KL [ t1 ; t ] :=D ! KL [ t1 jj t ] +D KL [ t jj t1 ] = ( t t1 ) t Z t1 Var log p (x;z) q (zjx) d (3.33) = ( t t1 )( t t1 ) (3.34) where the last line follows from the following reasoning. Writing the integrand in Eq. (3.33) as Var [log p (x;z) q (zjx) ] =r 2 (), we can apply the fundamental theorem of calculus, in similar fashion to thermodynamic integration, but for the functionr () instead of (). In particular, we can write R t t1 r 2 ()d =r ( t )r ( t1 ) = t t1 , which expresses the integral in Eq. (3.33) in terms of the mean parameters. We thus obtain an interpretation of the symmetrized kl divergence as the area of a rectangle, which we visualize in Fig. 3.5. For 0 = 0 and 1 = 1, we can conrm that 1 0 =euboelbo = D KL [q jjp ] +D KL [p jjq ]. Note that this identity will also play a key role in characterizing the bias reduction of ais (Neal, 2001) in Ch. 2 Sec. 2.3.1. Also see Brekelmans et al. (2020a) App. D or Brekelmans et al. (2022b) App. D. To summarize, we have given several equivalent ways of understanding the kl divergence and its symmetrization. The `forward' and `reverse' kl divergences arise as gaps in the tvo left- and 77 right-Riemann approximations (Fig. 3.2, 3.4), or rst order Taylor remainders as in Eq. (3.31) and Eq. (3.32). Summing these quantities corresponds to the area of a rectangle on the graph of thetvo integrand (Eq. (3.34), Fig. 3.5). Equivalently, this can be written as the integral of a variance term via the Taylor remainder theorem Eq. (3.33). Before presenting our proposed approach for choosing in the next section, we note that Grosse et al. (2013) describe a `coarse-grained' linear binning schedule, which allocates intermediate distributions based on the identity in Eq. (3.34) and is evaluated as a baseline in Sec. 4.6. 3.5 Moment-Spacing Schedule Masrani et al. (2019) observe that the performance of the tvo can depend heavily on the choice of partition scheduleP , and propose a grid search to choose rst 1 with log-uniform spacing of the remainingf 2 ;::: K1 g. Instead, we propose choosing t to yield equal spacing in the y-axis of the tvo integrand =r logZ , as shown in Fig. 3.6. The resulting spacing in the x-axis will thus re ect how quickly is changing as a function of , which re ects the intution that we should allocate more partition points in regions where the integrand is changing quickly. Our moment-spacing schedule can thus adapt to the shape of the tvo integrand, which re ects the mismatch between q (zjx) and p (zjx) 2 and can change signicantly across samples and training epochs. This scheduling naturally arises from our exponential family interpretation in the previous section. More formally, distributions in our (minimal) family Eq. (3.13) may be indexed by either the natural parameter t or mean parameter t , where the bijection between parameterizations arises from the convexity of () (see Wainwright and Jordan (2008) Sec. 3.5.2). Equal spacing in the mean parameters also corresponds to the `moment-averaged' path of Grosse et al. (2013), which was shown to yield robust estimators and natural generative samples from intermediate in the context of ais. Given a budget of intermediate distributions T =jP j, we seek t such that t are uniformly distributed between the endpoints 0 =elbo and 1 =eubo (see Eq. (3.7)-Eq. (3.8)): t = 1 (1 t T )elbo + t T eubo (3.35) 2 Note that the tvo integrand will be at when q (zjx) =p (zjx), so that changing has no eect. 78 0 <latexit sha1_base64="+bcvVsgReT4e+AN7PozBeVeZPe8=">AAACS3icbVDLTgJBEJxFUcQX6NHLRmLiiewqRo9ELx4hkUcCGzI7NDgyu7OZ6VXIhi/wqp/lB/gd3owHh8fBBSvppFLVne4uPxJco+N8WpmNzezWdm4nv7u3f3BYKB41tYwVgwaTQqq2TzUIHkIDOQpoRwpo4Ato+aO7md96BqW5DB9wEoEX0GHIB5xRNFLd6RVKTtmZw14n7pKUyBK1XtGqdPuSxQGEyATVuuM6EXoJVciZgGm+G2uIKBvRIXQMDWkA2kvml07tM6P07YFUpkK05+rfiYQGWk8C33QGFB/1qjcT//M6MQ5uvISHUYwQssWiQSxslPbsbbvPFTAUE0MoU9zcarNHqihDE05qSxAL5Eq+TNMqHQEDIdKqL+UIqa/TX8M4kmoWSf8p1ujL8TSfNzm7q6muk+ZF2b0sX9UrpertMvEcOSGn5Jy45JpUyT2pkQZhBMgreSPv1of1ZX1bP4vWjLWcOSYpZLK/KcW0OQ==</latexit> 1<latexit sha1_base64="N7w/IXb3C506Zqfo+1ruaPO5NkQ=">AAACS3icbVBNT8JAEN2iKOIX6NFLIzHxRFpDokeiF4+QyEcCDdluB1zZdpvdqUIIv8Cr/ix/gL/Dm/HgAj1Y8CWTvLw3k5l5fiy4Rsf5tHJb2/md3cJecf/g8Oi4VD5pa5koBi0mhVRdn2oQPIIWchTQjRXQ0BfQ8cd3C7/zDEpzGT3gNAYvpKOIDzmjaKSmOyhVnKqzhL1J3JRUSIrGoGzV+oFkSQgRMkG17rlOjN6MKuRMwLzYTzTElI3pCHqGRjQE7c2Wl87tC6ME9lAqUxHaS/XvxIyGWk9D33SGFB/1urcQ//N6CQ5vvBmP4gQhYqtFw0TYKO3F23bAFTAUU0MoU9zcarNHqihDE05mS5gI5Eq+zLMqHQMDIbKqL+UYqa+zX8MklmoRSfCUaPTlZF4smpzd9VQ3Sfuq6taqtWatUr9NEy+QM3JOLolLrkmd3JMGaRFGgLySN/JufVhf1rf1s2rNWenMKckgl/8FK660Og==</latexit> = ⌘ 0<latexit sha1_base64="6+4qeF/7GagqDgb5TD7oX/bn5dU=">AAACYnicbZDPSiNBEMY74+7qjrtrouBl99AYFvYUZiSgF0H04jGCUSEJobpT0d70TA/dNZowzst41Rfy7oPY+XPY0f2g4eNXVVTXJzKtHEXRSy1Y+/T5y/rG13Dz2/cfW/XG9qUzuZXYlUYbey3AoVYpdkmRxuvMIiRC45WYnM7rV3donTLpBc0yHCRwk6qxkkAeDeu7fcIpOVmgFqbkR7yPBMNoWG9GrWgh/tHEK9NkK3WGjVq7PzIyTzAlqcG5XhxlNCjAkpIay7CfO8xATuAGe96mkKAbFIsDSv7bkxEfG+tfSnxB/50oIHFulgjfmQDduve1OfxfrZfT+HBQqDTLCVO5XDTONSfD52nwkbIoSc+8AWmV/yuXt2BBks+ssiXJNSlr7ssqhQlK1LpKhTETAuGqV+M0M3Yeyehv7kiYaRmGPuf4faofzeV+K2632uft5vHJKvEN9pPtsT8sZgfsmJ2xDusyyR7YI3tiz7XXIAwawc6yNaitZnZYRcGvN/FYukI=</latexit>r ( )<latexit sha1_base64="XDTAGLDIdvf7A+IjZkocEveBdsc=">AAACZHicbZDfSltBEMY3p/5rTDWpFAqCLIaC3oRzJGAvpd70UqHRQE4Is5uJrtmze9idowkhfZre6vv4An0ONzEXPdoPFj6+mWF2fiLXylMcP1eiD2vrG5tbH6vbtU87u/XG5ytvCyexI622rivAo1YGO6RIYzd3CJnQeC3G54v69T06r6z5RdMc+xncGDVSEihEg/rX1IDQMEgFEvA09+poaY8H9Wbcipfi702yMk220sWgUWmnQyuLDA1JDd73kjin/gwcKalxXk0LjznIMdxgL1gDGfr+bHnCnH8LyZCPrAvPEF+m/07MIPN+monQmQHd+re1Rfi/Wq+g0ff+TJm8IDTyddGo0JwsX/DgQ+VQkp4GA9Kp8Fcub8GBpECttCUrNClnH+blFMYoUetyKqwdEwhfvhonuXULJMO7wpOwk3m1Gjgnb6m+N1cnraTdal+2m2c/VsS32D47ZEcsYafsjP1kF6zDJPvN/rBH9lT5G9WivejLa2tUWc3ssZKigxdYSrrl</latexit>⌘ k<latexit sha1_base64="NeBqR9YQGC9qsngqLdTak77GfIY=">AAACUHicbVBNS8NAEJ3U7/itRy/BIngqiRT0KHrxqGCr0JYy2U51zSYbdidqKf0PXvVnefOfeNNtrWDUB8s+3pthZl6cK2k5DN+8yszs3PzC4pK/vLK6tr6xudW0ujCCGkIrba5jtKRkRg2WrOg6N4RprOgqTk7H/tU9GSt1dsmDnDop3mSyLwWyk5ptYuwm3Y1qWAsnCP6SaEqqMMV5d9Ort3taFCllLBRa24rCnDtDNCyFopHfLizlKBK8oZajGaZkO8PJuqNgzym9oK+NexkHE/VnxxBTawdp7CpT5Fv72xuL/3mtgvtHnaHM8oIpE1+D+oUKWAfj24OeNCRYDRxBYaTbNRC3aFCwS6g0JS0US6MfRmUVExKkVFmNtU4YY1u+mh5zbcaR9O4Ky7F+HPm+yzn6nepf0jyoRfVa/aJePT6ZJr4IO7AL+xDBIRzDGZxDAwTcwRM8w4v36r17HxXvq/T7h20ooeJ/AvLJtZw=</latexit> k<latexit sha1_base64="gcrTaGRC2c7QE6GyZV1NF+ODePk=">AAACUXicbVBNSyNBEK0Zv2L8Xo9eBoPgKcxIQI+iF49Z2BghCaG6U9F2eqaH7ho1hPwIr7s/a0/7U7xtJ+bgqA8KHu9VUVVPFFo5juN/Qbiyura+Udusb23v7O7tH/y4daa0kjrSaGPvBDrSKqcOK9Z0V1jCTGjqivR67nefyDpl8l88KWiQ4X2uxkoie6nbF8Q4TIf7jbgZLxB9JcmSNGCJ9vAgaPVHRpYZ5Sw1OtdL4oIHU7SspKZZvV86KlCmeE89T3PMyA2mi3tn0YlXRtHYWF85Rwv148QUM+cmmfCdGfKD++zNxe+8Xsnji8FU5UXJlMv3ReNSR2yi+fPRSFmSrCeeoLTK3xrJB7Qo2UdU2ZKVmpU1z7OqiilJ0rqqCmNSRuGqX9NLYew8ktFj6ViYl1m97nNOPqf6ldyeNZNWs/Wz1bi8WiZegyM4hlNI4Bwu4Qba0AEJKbzCb/gT/A3eQgjD99YwWM4cQgXh1n+yAbUJ</latexit> = ⌘ K<latexit sha1_base64="f8EJMs5j7bvOJ3iluGJnswg+gpM=">AAACZHicbZDPSiNBEMY7s3/UqLtREQRBmg2CpzCzBNzLguhF8KJgVEhCqO5UtE3P9NBdvZswzD6N19338QV8DjsxB0f3g4aPX1VRXZ/ItXIUx4+16MPHT5+Xllfqq2vrX742NjavnPFWYkcabeyNAIdaZdghRRpvcouQCo3XYnwyq1//QuuUyS5pmmM/hdtMjZQECmjQ2OkRTsjJAr0wJf/Je0gwKM7KQaMZt+K5+HuTLEyTLXQ+2Ki1e0MjfYoZSQ3OdZM4p34BlpTUWNZ73mEOcgy32A02gxRdv5ifUPL9QIZ8ZGx4GfE5fT1RQOrcNBWhMwW6c29rM/i/WtfT6Ee/UFnuCTP5smjkNSfDZ3nwobIoSU+DAWlV+CuXd2BBUkitsiX1mpQ1v8sqhTFK1LpKhTFjAuGqV+MkN3YWyfDeOxJmUtbrIefkbarvzdX3VtJutS/azaPjReLLbJd9YwcsYYfsiJ2yc9Zhkv1hD+wv+1d7itairWj7pTWqLWa2WEXR3jNqsLty</latexit>Figure 3.6: By enforcing equal spacing in the mean parameter space, our moments sched- ule naturally `adapts' by allocating more parti- tions to regions where the integrand is changing quickly. Figure 3.7: We visualize placement of t for our moments-spacing schedule across the rst 100 epochs, withT = 20. Most t concentrate near 0 in early epochs, but spread out as training proceeds and the integrand becomes atter as a function of . We use 1 [] to indicate the value of the natural parameter such that the expected sucient statistics =E [log p (x;z) q (zjx) ] match a desired target . This mapping between our dual parame- terizations is also known as the Legendre transform (Amari, 2016) and is a dicult optimization in its own right (Wainwright and Jordan, 2008). However, in the context of tvo, estimating the moments at a given simply involves reweighting and normalizing importance samples using snis in Eq. (3.12). Equipped with this cheap evaluation mechanism, we can easily apply a procedure such as binary search or gradient descent to nd the t with a target expectation value t as in (3.35). We choose to update our choice of schedule at the end of each epoch, and provide further implementation details in App. G of Brekelmans et al. (2020b). 3.6 Doubly-Reparameterized TVO Gradient To optimize the tvo, Masrani et al. (2019) derive a reinforce-style gradient estimator (see their App. F), which provides lower variance gradients and improved performance with discrete latent variables. Writing to denotef;g, withw =p (x;z)=q (zjx) and ~ (x;z) as in (3.5), we obtain gradients for expectations of arbitrary f(z) under , with the tvo integrand corresponding to f(z) = logw, d d E [f(z)] =E [ d d f(z)] + Cov f(z); d d log ~ (x;z) (3.36) 79 However, whenz i q (zjx) can be reparameterized viaz i =z( i ;); i p(), we can improve the estimator in Eq. (3.36) by more directly incorporating f(z) gradient information. To this end, we derive a doubly-reparameterized gradient estimator in Brekelmans et al. (2020a) App. I d d E [f(z)] = E d d f(z) @z @ @f(z) @z + (1) Cov f(z); @z @ @ logw @z : (3.37) Doubly-reparameterized gradient estimators avoid a known signal-to-noise ratio issue for inference network gradients (Rainforth et al., 2018b), using a second application of the reparameterization trick within the expanded total derivative (Tucker et al., 2018). We use a simplied form of Eq. (3.37) (see Brekelmans et al. (2020b) App. I Eq. 75) for learning and Eq. (3.36) for learning . Comparing the covariance terms of Eq. (3.36) and Eq. (3.37), note that d d log ~ (x;z) and @z @ @ logw @z dier by their dierentation operator and a factor of logq due to reparameterization, with log ~ = logq + logw. Further, the eect of the partial derivative @z @ @f(z) @z in the rst term of Eq. (3.37) linearly decreases as ! 1 and (zjx) has less dependence on . Finally, we see that Eq. (3.37) passes two basic sanity checks, with the covariance correction term vanishing at both endpoints. At = 0, we recover the gradient of the elbo, d d E 0 [f(z)] = E z(;) [ d d f(z)]. At = 1, note that the @z @ @f(z) @z term cancels when expanding d d f(z), leaving d d E 1 [f (z)] = E p [ @ @ f (z)]. This is to be expected for expectations under p (zjx), since the derivative with respect to passes inside the expectation and @z @ = 0. 3.7 Experiments We investigate the eect of our moment-spacing schedule and reparameterization gradients using a continuous latent variable model on the Omniglot dataset. We estimate test logp (x) using the iwae bound (Burda et al., 2015a) with 5k samples, and use S = 50 samples for training unless noted. In all plots, we report averages over ve random seeds, with error bars indicating min and max values. We describe our model architecture and experiment design in App. F of Brekelmans et al. (2020a), 3 with runtimes and additional results on binary mnist in App. H of Brekelmans et al. (2020a). 3 https://github.com/vmasrani/tvo_all_in 80 0.0 0.2 0.4 0.6 0.8 1.0 113 112 111 110 109 108 107 log p(x) ELBO Grid Search Ours (Moment-Spacing) Figure 3.8: The original tvo paper recommended using two partition points, with a single inter- mediate 1 in addition to the elbo at 0 = 0. We report test logp (x) values from training a separate Variational Autoencoders (vae) at each 1 , but this grid search is prohibitively expensive in practice. Our moment-spacing schedule is an adaptive method for choosing points, which yields near-optimal performance on Omniglot and provides notable improvement over the elbo. 2 5 10 30 50 K 108.0 107.5 107.0 106.5 106.0 105.5 105.0 104.5 104.0 log p (x) iwae_dreg (-104.01) iwae (-104.59) linear (-107.63 / -105.62) moments (-106.09 / -104.89) log (-108.68 / -104.68) coarse_grain (-109.10 / -104.90) (a) tvo with reinforce Gradients 2 5 10 30 50 K 108.0 107.5 107.0 106.5 106.0 105.5 105.0 104.5 104.0 log p (x) iwae_dreg (-104.01) iwae (-104.59) linear (-105.81 / -104.75) moments (-105.31 / -104.14) log (-108.57 / -104.09) coarse_grain (-109.26 / -104.07) (b) tvo with Doubly-Reparameterized Gradients Figure 3.9: Scheduling Performance by K on Omniglot, with S = 50. Legend shows (min / max) test logp (x) across K. Moment Spacing Dynamics We seek understand the dynamics of our moment spacing schedule in Fig. 3.7, visualizing the choice of points across training epochs withT = 20. Our intermediate distributions concentrate near = 0 at the beginning of training, since q (zjx) and p (zjx) are mismatched and the tvo integrand rises sharply away from q (zjx). This eect is particularly dramatic within the rst ve epochs. While the curve is still fairly noisy within the rst twenty epochs, it begins atten as training pro- gresses andq (zjx) learns to matchp (zjx). This is re ected in the t achieving a given proportion of the moments dierence (eubo-elbo) moving to higher values. We found the moment-scheduling partitions to be relatively stable after 100 epochs. Grid Search Comparison Next, we x T = 2 with only 1 chosen by the moment spacing schedule. We compare against grid search in Fig. 3.8 and plot test logp (x) as a function of 81 1 2 [0; 1] across 25 static values. We report the value of 1 for our moments schedule at the nal epoch, which indicates where 1 is halfway between our estimated elbo and eubo. We nd that our adaptive scheduling matches the best performance from grid search, with the optimal intermediate distribution occurring at 1 0:3 on both datasets. With a single, properly chosen intermediate distribution, we nd that the tvo can achieve notable improvements over the elbo at minimal additional cost. Evaluating Scheduling Strategies From a numerical integration perspective, the tvo bounds should become arbitrarily tight as T!1. However, Masrani et al. (2019) observe that additional partitions can be detrimental for learning in practice. We thus investigate the performance of our moment spacing schedule with a varying number of partitions. We plot test log likelihood at T =f2; 5; 10; 30; 50g, and compare against three scheduling baselines: linear, log-uniform spacing, and the `coarse-grained' schedule from Grosse et al. (2013). We begin the log-uniform spacing at 1 = 0:025, a choice which results from grid search over 1 for T > 2 in Masrani et al. (2019). We observe in Fig. 3.9a that the moment scheduler provides the best performance at high and low T , while the log-uniform schedule can perform best for particular T . As previously observed, all scheduling mechanisms still suer degradation in performance at large T . Reparameterized TVO Gradients While our scheduling techniques do not address the detri- mental eect of using many intermediate , we now investigate the use of our reparameterization gradient estimator from Sec. 3.6. Repeating the previous experiment in Fig. 3.9b, we nd that reparameterization helps preserve competitive performance for high T and improves overall model likelihoods. Our moments schedule is still particularly useful at lowT , while the various scheduling methods converge to similar performance with many partition points. All scheduling techniques will be equivalent in the limit, as discussed in App. D of Brekelmans et al. (2020a). Comparison with IWAE Finally, we compare tvo with moments scheduling against the Im- portance Weighted Autoencoder (iwae) (Burda et al., 2015a) and doubly reparameterized iwae dreg (Tucker et al., 2018) for model learning and posterior inference. It is interesting to note that iwae corresponds to a direct estimate of (1), with the snis normalizer P S i=1 w 1 i in tvo (3.12) appearing inside the log. In Fig. 3.10, we observe that tvo with reparameterization gradients achieves model learning performance in between that of iwae and iwae dreg, with lower KL divergences across all values 82 10 30 50 100 Importance Samples (S) 106.5 106.0 105.5 105.0 104.5 104.0 log p (x) TVO Original (-106.55 / -104.43) TVO Reparam (-105.26 / -104.05) IWAE (-105.45 / -104.31) IWAE-DReG (-105.00 / -103.89) (a) Omniglot Test logp (x) 10 30 50 100 Importance Samples (S) 8 9 10 11 12 13 14 15 16 D KL [q |p ] TVO Original (9.69 / 12.91) TVO Reparam (9.01 / 12.84) IWAE (10.27 / 15.40) IWAE-DReG (9.75 / 14.74) (b) Omniglot Test D KL [q (zjx)jjp (zjx)] Figure 3.10: Model Learning and Inference by number of samples S with T = 5. Legend shows (min / max) values across S. 10 30 50 100 Importance Samples (S) 86.00 85.75 85.50 85.25 85.00 84.75 84.50 84.25 84.00 log p (x) TVO Original (-85.80 / -84.57) TVO Reparam (-85.35 / -84.50) IWAE (-85.23 / -84.44) IWAE-DReG (-85.45 / -84.41) (a) MNIST Test logp (x) 10 30 50 100 Importance Samples (S) 4 6 8 10 12 14 D KL [q |p ] TVO Original (6.89 / 8.88) TVO Reparam (6.64 / 9.06) IWAE (7.79 / 13.44) IWAE-DReG (7.71 / 12.79) (b) MNIST Test D KL [q (zjx)jjp (zjx)] Figure 3.11: Model Learning and Inference by number of samples S (with T = 5) of S. We repeat this experiment for mnist in Fig. 3.11, where tvo matches iwae dreg model learning with better inference. Although we tend to obtain lowerD KL with lower model likelihood, we do not observe strong evidence of the signal-to-noise ratio issues of Rainforth et al. (2018b) on either dataset. tvo with reparameterization thus appears to provide a favorable tradeo between model learning and posterior inference. 3.8 Discussion: Likelihood Ratio Exponential Families The preceding sections are adapted from the ICML 2020 conference paper (Brekelmans et al., 2020a). In the following two sections, we provide additional perspectives on the Thermodynamic 83 Variational Objective from the workshop paper (Brekelmans et al., 2020c), further study by the authors, and recent related work in the literature. We begin by connecting the likelihood ratio exponential family interpretation of the geometric mixture path to Renyivi (Li and Turner, 2016; Dieng et al., 2017) and hypothesis testing (Nielsen, 2013; Cover and Thomas, 2012). This allows us to identify the point where the tvo integrand switches from a lower bound to an upper bound in Sec. 3.8.4, as in the workshop paper (Brekelmans et al., 2020d). In Sec. 3.8.2, we nd that our framework yields a simple proof that the variance of importance sampling weights scales exponentially with the kl divergence between target and proposal. 3.8.1 Renyi Divergence Variational Inference In this section, we show that each intermediate log partition function logZ in our likelihood ratio exponential family corresponds to a scaled version of the R enyi VI objectiveL (Li and Turner, 2016), with order = 1. To begin, we recall the denition of Renyi's divergence. D [pjjq] = 1 1 log Z q(!) 1 p(!) d! Note that this involves geometric mixtures similar to (3.14). Pulling out the factor of logp(x) to consider normalized distributions over zjx, we obtain the objective of Li and Turner (2016). This is similar to the elbo, but instead subtracts a Renyi divergence of order . () = log Z q(zjx) 1 p(x;z) dz = logp(x) (1)D [p (zjx)jjq (zjx)] = logp(x)D 1 [q (zjx)jjp (zjx)] :=L 1 where we have used the skew symmetry property D [pjjq] = 1 D 1 [qjjp] for 0<< 1 (Van Er- ven and Harremos, 2014). Note thatL 0 = 0 andL 1 = logp (x) as in Li and Turner (2016) and Sec. 3.3. 84 The fundamental theorem of calculus also suggests that the Renyivi objective may be obtained by integrating the tvo curve from 0 to , L 1 = 1 () = Z 0 E log p (x;z) q (zjx) d: (3.38) The cubo objectives of Dieng et al. (2017) are a special case of Renyi vi, corresponding to upper bounds on logp (x) and log partition functions with 2 [1; 2]. From our exponential family perspective, there is no explicit restriction that our natural parameters remain in the unit interval, with the 2 divergence at = 2 of notable interest due its link with importance sampling variance (as in Sec. 3.8.2 above ). 3.8.2 TVO Proof that Importance Sampling Variance Scales Exponentially in the KL Divergence The fact that the importance sampling variance scales exponentially in thekl divergenceD KL [p (zjx) : q (zjx)] between target and proposal can be easily shown from thetvo perspective. Similar results are found in (Song and Ermon, 2019; Sanz-Alonso, 2018). First, recall the denition of the log partition function of the likelihood ratio exponential family. We slightly adapt our approach for the variational inference setting, and instead consider the normalized target distribution p (zjx) as the endpoint, () = log Z q (zjx) 1 p (x;z) dz (3.39) We rst note that (2) is related to the variance of importance weights (2) = log R q (zjx) 1 p (x;z) 2 dz = logE q (zjx) p (x;z) q (zjx) 2 = log Var q (zjx) h p (x;z) q (zjx) i +p (x) 2 ; where Var q (zjx) h p (x;z) q (zjx) i =E q (zjx) p (x;z) q (zjx) 2 E q (zjx) [ p (x;z) q (zjx) ] 2 and the second term simplies to p (x) 2 . 85 Now, recall that () is a convex function of, which means that the rst-order Taylor approxi- mation will everywhere underestimate the function (as in the denition of the Bregman divergence). We will consider the Taylor approximation around = 1, with (1) = log Z p (x;z)dz = logp (x) (3.40) r () =E p (zjx) log p (x;z) q (zjx) ] = logp (x) +D KL [p (zjx) :q (zjx)] (3.41) The rst-order Taylor approximation now suggests that (2) (1) + (2 1)r (1) (3.42) log Var q (zjx) p (x;z) q (zjx) +p (x) 2 logp (x) + logp (x) +D KL [p (zjx) :q (zjx)] (3.43) =) Var q (zjx) p (x;z) q (zjx) p (x) 2 e D KL [p (zjx):q (zjx)] 1 (3.44) 3.8.3 Variational Representations and Hypothesis Testing Grosse et al. (2013) note that any distribution along the geometric mixture path can be given a variational representation as the solution to an expected kl divergence divergence minimization (z) = argmin r(z) (1)D KL [r(z)jj 0 (z)] + D KL [r(z)jj 1 (z)] (3.45) We proceed to interpret Eq. (3.45) as a Bregman information (or gap in Jensen's inequality) (Baner- jee et al., 2005c), and as describing an optimal decision rule for hypothesis testing using the Neyman Pearson lemma. Bregman Information Banerjee et al. (2005c) dene the Bregman information as the minimum expected Bregman divergence to a representative point in the second argument. Regardless of the Bregman generator, the optimal representative corresponds to the mean over the input arguments (Banerjee et al., 2005c). Since D KL [r (z)jj 0 (z)] = D [0 : r ] when optimizing over r (z) in the exponential family, we can rewrite Eq. (3.45) as = argmin r (1)D [0 : r ] + D [1 : r ] = (1) 0 + 1 (3.46) 86 At this optimum, the expected kl divergence divergence (3.46) can be written as a gap in Jensen's inequality for the convex function () (Banerjee et al., 2005c), J ;f1;g;f0;1g = (1)D [0 :] + tD [1 :] (3.47) = (1) (0) + (1) (3.48) We visualize this gap in Jensen's inequality in Fig. 3.14. Nielsen (2019, 2010) utilizeJ to construct additional divergence measures, which Deasy et al. (2020) explore in the context of variational autoencoders. As shown in (Nielsen and Nock, 2011) or App. E1 of Brekelmans et al. (2020d), we can also viewJ , or the expected kl divergence divergence (3.45), as a R enyi divergence with order J ;f1;g;f0;1g = (1)D KL [ (z)jj 0 (z)] + D KL [ (z)jj 1 (z)] (3.49) = (1)D [ 1 (z) : 0 (z)]: Gr unwald (2007) and Harremo es (2006) provide additional coding interpretations of the R enyi divergence. We discuss the Bregman Information in further detail in Ch. 5. Neyman-Pearson Lemma Suppose we have access to n i.i.d. observations from an unknown distribution z 1:n r(z), and are interested in testing the hypotheses that either H 0 :r(z) = 0 (z) or H 1 :r(z) = 1 (z). The Neyman-Pearson lemma states that the likelihood ratio test is optimal, in the sense that, for any other decision region with type-1 error Pr(e 1 ) =R, then the type-2 error is no better than that of the likelihood ratio test ((Cover and Thomas, 2012) Ch. 11, Borade and Zheng (2006)) 4 . This is known as the size-power tradeo, and we will see that the size of each region will be governed by a thresholding parameter (Eguchi and Copas, 2006). In particular, given observed samples z 1:n , the likelihood ratio test concludes z 1:n 1 (z) if z 1:n 2A n ( 1 ;) := z 1:n 1 n n X i=1 log 1 (z i ) 0 (z i ) (3.50) for some threshold . Let a type-1 error occur when n i.i.d. drawsfz i g N i=1 from 0 (z) will yield empirical expectations exceeding the threshold . Sanov's Theorem and large deviation theory 4 While the Neyman-Pearson lemma is often obtained via the discrete method of types (Cover and Thomas, 2012), Csisz ar (1998) gives a derivation for the continuous setting. 87 ((Cover and Thomas, 2012) Ch. 11, (Csisz ar and Shields, 2004)) states that the asymptotic error exponent corresponds to a kl divergence divergence lim n!1 1 n Pr(e 1 )! exp D KL [r (z)jj 0 (z)] ; (3.51) where r (z) = argmin r(z)2M D KL [r(z)jj 0 (z)] (3.52) The feasible setM :=fr(z)jE r log 1 (z) 0 (z) =g re ects a expectation constraint corresponding to a given decision threshold, and the error exponent is obtained by minimizing the divergence subject to this constraint. The projection of 0 (z) onto the setM exactly matches the conjugate or maximum entropy optimization for a given expected sucient statistic, and thusr (z) lies within the likelihood ratio exponential family, r (z) = 0 (z) exp log 1 (z) 0 (z) () (3.53) Note that we also have () = D KL [ (z)jj 0 (z)] as the conjugate function, which re ects the optimal value in Eq. (3.52). As shown in Fig. 3.12, Sanov's Theorem implies a similar expression for the asymptotic type-2 error, when draws from 1 (z) achieve a lower expected likelihood ratio than . Expressing the conditions of the Neyman Pearson lemma using these asymptotic error probabilities, we can write Pr(e 2 ) = min r(z) D KL [r(z)jj 1 (z)] subj. to D KL [r(z)jj 0 (z)] =R (3.54) Using a Lagrange multiplier = 1 to enforce the constraint, we obtain the variational form (3.45) 1 Pr(e 2 ) = min r(z) (1)D KL [r(z)jj 0 (z)] +D KL [r(z)jj 1 (z)] (3.55) Thus, any distribution in our likelihood ratio exponential family corresponds to a likelihood ratio test with decision threshold , which is optimal for a type-1 error region of size () =R. 88 M ¯ ⌘ := r(z)| E r(z) [ (z)] = ¯ ⌘ <latexit sha1_base64="5wrsEYKDKhUNDtegfgw4KoTAVRQ=">AAAHtnichVXbbhs3EGWSNnLSS5zksS9EDQOOIRhW4iBOggJGLkiLNmhaxIlRryJwubMrwlwuxaVsScz+YX6gv9HX9qFDyoolUW4JaXd2zpkLZ3hJtRS13d3988rVa198eb21duPmV19/8+2t9dt33tXV0HA45JWszFHKapBCwaEVVsKRNsDKVML79OS5x9+fgqlFpd7asYZuyQolcsGZRVVvPU9KZvucSfe66bkkZcYlYFnT0Cc/0CQVReJo0qZma3LPvz/6h/95qzR1L9HIY80xTXRfeFaXesOZn6mPpre+sbuzGwaNhc65sHGwRsJ407t9fT/JKj4sQVkuWV0f39/Ttg2qwKL0u44ZK7iE5mYyrEEzfsIKOEZRsRLqrguFaegmajKaVwb/ytKgnbdwrKzrcZki00+oXsa88jO2uRDK5vtdJ5QeWlB8GikfSmor6stMM2GAWzlGgXEjMFnK+8wwbrEZC0m/7XSdz867WQAktkh15oIsZJeyFGSzkJTTI++oRmIGOS6IMF83rHnPQNa43189a1zn4aP2g732wwfIUnDGq7JkKsPOQ+MfhVAOBooZw8a+eUHBpCgUelgyAW8CKM1wGr4uzKMQSkVBtv8/SrC6PM52FOh01Dg3XaI5HS07PD2bQ88idDyHjiN0ModOIjSfQ/Nl9EQi+qLnfv6lWYYy4Ihp3IC2H3bNFs7hI0a7F1VDeeYAmbjdAm+CvFHEK0DFHnHfrnApZRZTI5b2M9dbqxzo0YL9SvOBj+EGM3AzzCFjl9Inc7NcGVQMkPFTz7s8evpHXKfB51ZMT6lBVPTTYq5bRVQV5lE8vCiKeOrAMsFcEMxKQqiGDiRm6HlxlrdEDrq+SAOTYgY1QlYqyhek1EaU4Okof8DihM+oHUKDKT1LixXYzIUWl3pQlcBzEy+fWWb+fniJfQjpRrXOSuZXQNbz5FHkTYisFiGfyt9EYB1qcEd7pee+ADzgDbzGOL9i5sxWZtslzBTBLb6Ttpf+iyjUjIgS3jWd5ZslFt7d3+ns7Tz+bW/j4Nn00iFr5DvyPdkiHfKIHJAfyRtySDj5RP4if5N/WvutDy1oFVPq1SvnNnfJwmjpfwHOj9mV</latexit>Figure 3.12: Sanov's Theorem. Each exponential arc intersects the manifoldM with a right angle, due to a generalized Pythagorean Theorem for the kl divergence (Sec. 1.6, 2.8 of (Amari, 2016)). Note that (z) = T (z) indicates the sucient statistics of the exponential family or, in this case, the log likelihood ratio. 3.8.4 Finding the Point on the TVO Integrand where E [log p (x;z) q (zjx) ] = logp (x) Cherno Information While each choice of determines a likelihood ratio test and error region, how should we choose this parameter? Regardless of the prior probabilities p 0 ;p 1 that we might assign to each hypothesis in a Bayesian setting, the Cherno information provides the best achievable error exponent in the large sample limit (Nielsen (2013), Cover and Thomas (2012) Ch. 11). C := min C() = min log Z 0 (z) 1 1 (z) dz (3.56) Notice that Cherno information in (3.56) involves the log-partition functionC() for the geometric mixture between normalized 0 and 1 , whereas we have dened () = log R ~ 0 (z) 1 ~ 1 (z) dz using unnormalized ~ 0 and ~ 1 . RewritingC() using 0 = ~ 0 =Z 0 and 1 = ~ 1 =Z 1 , we can pull out factors of (1) logZ 0 and logZ 1 to obtain the relationC() = () (1) logZ 0 logZ 1 . The Cherno information can thus be written using the Jensen gapJ ;f1;g;f0;1g from (3.48) C := min C() = max (1) (0) + (1) : (3.57) The optimum over, or , is denoted the Cherno point (Nielsen, 2013). In Brekelmans et al. (2020d) App E2, we derive the moment-matching condition = ( 1 ) ( 0 ) 1 0 (3.58) 89 R(D)= ⇤ (⌘ )<latexit sha1_base64="JEfImWcd4tzRYs3lL34XtQI6Eds=">AAAHZXichVXdbhM5FDawNF1+dssCe8MFFlWltIqqBooALStVSxGsltWyiJYKJkQez5mJVY/H8ThtUjMPwy08EU/Aa3A8bWgSp7sjJTlzvu9858exHWspSrux8eXc+Qs/XFxoLP546fKVqz/9vHTtl92yGBgOO7yQhdmLWQlSKNixwkrY0wZYHkt4E+8/8fibAzClKNRrO9LQyVmmRCo4s+jqLt181dxepb/TSJfivVurmhFYttpdWt5Y36gfGhrtE2N5a5HUz8vutYWHUVLwQQ7KcsnK8t3dTW1boDLsoddxzFjBJVSXokEJmvF9lsE7NBXLoey4uo+KrqAnoWlh8KMsrb2TEY7lZTnKY2TmzPbKWcw7v2MrU6ls+rDjhNIDC4ofZ0oHktqC+qnQRBjgVo7QYNwILJbyHjOMW5zdVNGv2x3nq/MyU4DEiar2RJKp6mIWg6yminJ66IVKJCaQ4vrV/bpBybsGksq9evZH5dr3H7Tubbbu30OWgkNe5DlTiYtiqPxXJpSDvmLGsFFFTxxMikyhwkwI+BBAa4zT+u00PEihVJBk7f+z1FFn51kLEh0MK+civ2pxSoezggeHE+hhgI4m0FGAHk2gRwGaTqDpLLovEd3uur9eVLNQAhwx3XWR7eF2qWgTe/iA2VaDaSjP7CNT90TNO0LeMOBloELFFp0nKWUSUgOW9p3r5jwBPZyKnxve9zlcfwyu1D0k7Ez60USXc5OKPjL+7HrJvd/ehnPqf1+K2D2tPC9YrWxitbJgKsyjscgomnjqwCzBnBLMXEI9DV2TmKEnw5ndEino8rQMLIoZ9AhZqKBekFIbkYOno/0eh1O/BsshNJjcs7SYg40ltDhTQRUCz028K8aVcSZxiM3jcoNZJznz/4Ck68nDQE2IpBR1PYW/OMA69OCO9k7P3QY84A38jXn+wcqZLcyai5jJaln8jVre+i+iUGMiWnjXtGdvltDYvbve3lx/9O/m8tbj40uHLJJb5A5pkjZ5QLbIc/KS7BBOHPlIPpHPC18bVxs3Gr8eU8+fO4m5Tqaexu1v0c+68w==</latexit>D= ⌘<latexit sha1_base64="aDUOtYH3UWaz/mxAfPuXqAL0KMM=">AAAHWHichVXbbhMxEDW3ptxvj7xYVJVKFaoGigABUgVFgABRUG+ChMjrnd1Y9Xodr9MmNfsXvMJ/wdcw3jY0iVNYKcnsnDMzxzOxHWkpCru8/OvU6TNnz83UZs9fuHjp8pWr167f2CrynuGwyXOZm52IFSCFgk0rrIQdbYBlkYTtaPeFx7f3wBQiVxt2oKGVsVSJRHBm0fV5jT6jd5tgGW1fm1teWq4eGhqNI2NudZZUz3r7+syjZpzzXgbKcsmK4su9FW3roFKU3Wk5ZqzgEsoLzV4BmvFdlsIXNBXLoGi5SnpJ59ET0yQ3+FGWVt7RCMeyohhkETIzZjvFJOadf7H5sVI2edRyQumeBcUPKyU9SW1OfSNoLAxwKwdoMG4EiqW8wwzjFts1Jnqj0XJenU8zBkhsomqMFBlTF7EIZDkmyum+T1QgMYYER1at1/UK3jYQl+7Tq+elazx4WL+/Un9wH1kK9nmeZUzFrhlB6b9SoRx0FTOGDUp65GBSpAozTISADwG0hjit3o7DgxJKBUUW/1+lijq5zmJQaK9fOtf0U4sS2p9MuLc/gu4H6GAEHQTowQh6EKDJCJpMorsS0bW2e/uunIRi4IjptmvaDm6Xki7gGr5htTtBN5RndpGpO6LiHSCvH/BSUGHGOp2WUso4pAYs7VeuF6Yl0P2x+KnhXV/DdYfgfLWGmJ1IPxhZ5dSioouMN22fcufJ57BP3b+jiNzL0vOCaaUj00qDrjCPRiKlaOKpA5MEc0wwUwlVN3RFYoYeNWdySySgi2MZKIoZ9AiZq0AvSKmNyMDT0f6Kzaleg3EIDSbzLC2mYMMUWpyYQeUCz028HobKOJPYxIVDuUGv44z5f0Dc9uR+kE2IuBCVntzfFWAdenBHe6fnrgEe8AbeY50PqJzZ3Cy6JjNplRZ/m3Vv/Yso1JCIFt41jcmbJTS27i01VpYef1yZW316eOmQWXKL3CYLpEEeklXymqyTTcKJIt/JD/Jz5neN1Gq184fU06eOYm6Ssad24w9URbaF</latexit>Free Energy ( )=sup ⌘ ·⌘ ⇤ (⌘ ) sup occurs when d ⇤ d⌘ = = dR dD<latexit sha1_base64="BdZg5yERDb73UauFiWXKSilchqA=">AAAIDXichVVtcxs1ED4XqNPwlsLHftGQCeNkTCZu02kZ2plMSYEOMJTQtJn2XI9Ot3fWRKeTJTmxI+438Gv4BnzlN/BvWJ3txLYcuJnEe/s8u89q9yQlSnBj9/b+adx45933bjbXbq2//8GHH328cfuTl6YcagbHrBSlPkmoAcElHFtuBZwoDbRIBLxKTr/2+Ksz0IaX8oUdK+gWNJc844xadPU2/owtjKz7RgOQpxJ0Pq5IPBjSlMTK8FYsMFVKt8nnj0lshorEghfcmp6LwVKktsmUQmKWlpZ4N/miDn7rdqqWf98mcbw+ia7FSMnYUBty3gdJLuUyTZlLZ4GVS8lEwitPJR77zBPeERIOq97G5t7uXv2Q0OhMjc2Dtah+nvdu33wYpyUbFiAtE9SYN3f3lW2DzLHX/a6j2nImoFqPhwYUZac0hzdoSlqA6bq63xXZQk9KslLjn7Sk9s5HOFoYMy4SZBbU9s0y5p2X2NaClM0edh2XamhBsolSNhTElsRPj6RcA7NijAZlmmOxhPUpdsTijBeKftHpOl+dT7MACJy87MyJLFSX0AREtVCUUyOfyCAxhQy/s3q9bmhYT0NauaNvn1Suc/9B+95++/49ZEk4Z2VRUJm6OIHK/8u5dDCQVGvqP7CJgwqeS8ywFAI+BNCa4aR+uwoPJKQMRHb+X6WOul5nJxA6G1XOxX5qSUZGywnPzufQ8wAdz6HjAL2YQy8CNJtDs2X0VCB62HPf/1AtQykwxBRuVtuv91IL1/Arqm0H3ZCeOUCm6vOad4G8UcDLQYYZ8QxYkVKINKQGLOVXrlqrEqjRQvzK8IHXcIMZuFWvIaXX0i/mVrlSlA+Q8aznU5589Trs0+ByFIl7WnleMK18blp50BXq0YTn/lDDUweWCfqKoFcS6m6omkQ1mTZneUtkoMxVGVgU1ejhopRBvSCE0rwAT0f7LTanfg3GwRXowrMUX4HNUih+bQZZcjw38U6bVcaowCa2JuUGvU4L6r+AtOfJoyAb56nhdT2lv+DAOvTgjvZOzz0EPOA1/Ig6P2Hl1JZ6x8VU53Va/I3b3vovIpczIlp413SWb5bQeHl3t7O/++XP+5sHjyaXTrQW3Yk+i1pRJ3oQHUTfRc+j44g1NhvPGkeNX5q/NX9v/tH8a0K90ZjGfBotPM2//wUA5vkb</latexit> ⇤ (⌘ )<latexit sha1_base64="4/ysG8VUnprM9Natvajo/+9nFw4=">AAAHeXichVXbbhMxEDW3ptwLPPKyUFVKo6hqaBEgeKi4CBAgLqJQwYbI653dWPV6Ha/TJjX7DXwNr/AdfAsvjDcNTeIUVkoyO+fMmfFMbEdK8MKsr/86cfLU6TMLtcWz585fuHjp8tKVqx+KvK8ZbLNc5HonogUILmHbcCNgR2mgWSTgY7T7yOEf90AXPJfvzVBBO6Op5Aln1KCrs7QaGhiYSsdqiEsbqoJ/sY2yHoKhHRsK1IppuVp2lpbX19arJ/CN1qGxvLVIqudN58rC3TDOWT8DaZigRfH51qYyTZAprqvbtlQbzgSU58J+AYqyXZrCZzQlzaBo26qmMlhBTxwkucaPNEHlnYywNCuKYRYhM6OmW8xizvkXW5lKZZK7bcul6huQbJQp6YvA5IHrVBBzDcyIIRqUaY7FBqxLNWUG+zlV9PtW27rqnMwUILDLsjWRZKq6iEYgyqmirBo4oQKJMSQ409Fk+gXrVNN59/RhaVu37zQ3Npu3N5AlYZ/lWUZlbMMISveVcmmhJ6nWdFgGhw4qeCpRYSYEXAigNcaD6u0o3EshpZek8f8sVdTxeRpeor1BaW3ophYlwWBWcG9/At330OEEOvTQgwn0wEOTCTSZRXcFoo879sXLchaKgSGmcMeYLu6cMqjjGr5itlWvG9Ixe8hUXV7xDpA38HgpSF+xGcyTFCL2qR5LuZWr+jwBNZiKnxveczlsbwyuVGuI6bH0g4lVzk3Ke8h43nGSO/c/+X3q/R1FZJ+UjudNK52YVup1hTo04mmAJp46MEvQRwQ9l1B1Q1UkqoPD5sxuiQRUcVQGFkU1erjIpVcvCKE0z8DR0f6CzalevXFwBTpzLMXnYGMJxY9VkDnHcxPvj3FljApsYn1UrtfrOKPuHxB3HHngqXEeF7yqJ3eXCRiLHtzRzum4jwEPeA2vMM9rrJyaXDdsSHVayeJv2HTWv4hcjolo4V3Tmr1ZfOPDrbXW5tq9t5vLWw9Glw5ZJNfJTVInLXKHbJFn5A3ZJox8I9/JD/Jz4XftRq1ea4yoJ08cxlwjU09t4w8sN8XW</latexit> ⌘ <latexit sha1_base64="L6ScZ3aF0vuvVW7VOdwitZPL9nA=">AAAHcHichVXbbhMxEDW3ptxbeEHigYWqUqlC1dAiQPBQcREgQFxEaQUbIq93dmPV63W8TpvU7Ctfwyv8C7/BFzDeNDSJU1gpyeycM2fGM7EdKcELs7r669jxEydPzdRmT585e+78hYtz85c+FnlXM9hkucj1dkQLEFzCpuFGwLbSQLNIwFa089jhW7ugC57LD6avoJnRVPKEM2rQ1ZoLQgM9U+lYDXFpb4VgaMuGAjViWpatuYXVldXqCXyjcWAsbMyS6nnbmp+5F8Y562YgDRO0KD7fXlemDjLF5bSblmrDmYDyTNgtQFG2Q1P4jKakGRRNW5VSBovoiYMk1/iRJqi8oxGWZkXRzyJkZtS0i0nMOf9ii2OpTHKvablUXQOSDTIlXRGYPHANCmKugRnRR4MyzbHYgLWppsxgG8eK/tBoWledkxkDBDZXNkaSjFUX0QhEOVaUVT0nVCAxhgRHORhIt2Ctaijvnz0qbePO3fraev3OGrIk7LE8y6iMbRhB6b5SLi10JNWa9svgwEEFTyUqTISACwG0hnhQvR2Geymk9JIs/z9LFXV0nmUv0W6vtDZ0U4uSoDcpuLs3gu55aH8E7Xvo/gi676HJCJpMojsC0Sct+/JVOQnFwBBTuGFMGzdOGSzhGr5itpteN6RjdpCp2rzi7SOv5/FSkL5iPZgmKUTsUz2WcitXS9MEVG8sfmp4x+WwnSG4WK0hpkfS90dWOTUp7yDjRctJbj/45Pep83cUkX1aOp43rXRkWqnXFerQiKcBmnjqwCRBHxL0VELVDVWRqA4OmjO5JRJQxWEZWBTV6OEil169IITSPANHR/sLNqd69cbBFejMsRSfgg0lFD9SQeYcz028NoaVMSqwiUuDcr1exxl1/4C45cg9T43zuOBVPbm7Q8BY9OCOdk7HfQJ4wGt4jXneYOXU5HrZhlSnlSz+hnVn/YvI5ZCIFt41jcmbxTc+3l5prK/cf7e+sPFwcOmQWXKV3CBLpEHukg3ynLwlm4SRb+Q7+UF+zvyuXaldq10fUI8fO4i5TMae2vIfVtXCMA==</latexit> ( )<latexit sha1_base64="hrEnsqzl3FLBMgGHZZwCzev8Pkg=">AAAHcHichVXbbhMxEDW3ptwLvCDxwEJVKVShaqAIEDxUXAQIEBdRWsGGyOud3Vj1eh2v0yY1+8rX8Ar/wm/wBYw3DU3iFCwlmZ1z5sx4JmtHSvDCrK7+OnL02PETc7X5k6dOnzl77vzChYsfi7ynGWywXOR6K6IFCC5hw3AjYEtpoFkkYDPafuzwzR3QBc/lBzNQ0MpoKnnCGTXoai8EoYG+qXRsJHpQ2puhKng9FKgR0xtle2FxdWW1WoFvNPeNxfV5Uq237Qtz98I4Z70MpGGCFsXnW2vKNECmuJ1Oy1JtOBNQngp7BSjKtmkKn9GUNIOiZatSymAJPXGQ5Bo/0gSVdzzC0qwoBlmEzIyaTjGNOedfbGkilUnutSyXqmdAsmGmpCcCkweuQUHMNTAjBmhQpjkWG7AO1ZQZbONE0R+aLeuqczITgMDmyuZYkonqIhqBKCeKsqrvhAokxpDgKIcD6RWsrSEu7ftnj0rbvHO3cXutcec2siTssjzLqIxtGOHQ8Cvl0kJXUq3poAz2HVTwVKLCVAi4EEBrhAfV00G4l0JKL8ny/7NUUYfnWfYS7fRLa0M3tSgJ+tOCO7tj6K6HDsbQgYfujaF7HpqMock0ui0QfdK2L1+V01AMDDHVtqHpgKFlUMc9fMVsN7xuSMfsIlN1eMXbQ17f46UgfcVGMEtSiNineizldq7qswRUfyJ+ZnjX5bDdEbhU7SGmh9L3xnY5MynvIuNF20luPfjk96n7dxSRfVo6njetdGxaqdcV6tCIpwGaeOrANEEfEPRMQtUNVZGoDvabM/1KJKCKgzKwKKrRw0UuvXpBCKV5Bo6O9hdsTvXojYMr0JljKT4DG0kofqiCzDmem3htjCpjVGAT68NyvV7HGXX/gLjtyH1PjfO44FU9ubtDwFj04BvtnI77BPCA1/Aa87zByqnJ9bINqU4rWfwNG876F5HLEREtvGua0zeLb3y8tdJcW7n/bm1x/eHw0iHz5Aq5TuqkSe6SdfKcvCUbhJFv5Dv5QX7O/a5drl2tXRtSjx7Zj7lEJlZt+Q+u8cGp</latexit> ⇤<latexit sha1_base64="m8nVgFoP/qYUlLvqgUQGNeGILaQ=">AAAHl3ichVVtbxM5EDYvRwr3QoFPiA+sqCqVKooS2goqPlDxfoLTAaJQQcLi9c5urHq9jtdpk5r9yK/hK/yY+zc33jQ0Wad3lqLMzvPMPOOZ9TpSghem3f7nzNlz53+50Fi6eOnX337/4/LylavvinyoGeyyXOR6L6IFCC5h13AjYE9poFkk4H20/8jh7w9AFzyXb81YQS+jqeQJZ9SgK1y+2TUwMlUeq/MojMQQStuNwNBPdr0sw+WVdqtdrcA3OsfGys4Sqdar8MqFz904Z8MMpGGCFsXHO5vKNEGmuJt+z1JtOBNQXuoOC1CU7dMUPqIpaQZFz1aVlMEqeuIgyTX+pAkq72yEpVlRjLMImRk1/aKOOedPbHVOyiT3epZLNTQg2UQpGYrA5IHrTxBzDcyIMRqUaY7FBqxPNWUGuzhX9NtOz7rqXJo5QGBvZWdGZK66iEYgyrmirBq5RAUSY0hwkpN5DAsWaohL++bZw9J2tu42NzabWxs11snUdBqVtt1sbWxsNVvb28iTcMjyLKMydiOt5ppyaWEgqdZ0XAbHDip4KlGpFgIuBNCa4kH1dBLuSUjpiaz/v0oVdbrOuid0MCqt7brpRkkwqic8OJxBDz10PIOOPfRoBj3y0GQGTerovkD0cWhfvCzrUAwMMRXarunj0SqDNdzDF1S77XVDOuYAmarPK94R8kYeLwXpZ2wGi1IKEftUj6XcztXaogRqNBe/MHzgNOxgCq5We4jpqfSjmV0uFOUDZPwZupR79z/4fRr8HEVkn5SO500rnZlW6nWFOjTiaYAmfp2gTtAnBL2QUHVDVSSqg+Pm1I9EAqo4KQOLoho9XOTSqxeEUJpn4Ohof8LmVI/eOLgCnTmW4guwaQrFT80gc47fV7xdppUxKrCJa5NyvV7HGXVvQBw68sjLxnlc8Kqe3F01YCx68EQ7p+M+BrwINPyFOn9j5dTket12qU6rtPjfbTrrv4hcTolo4Z3Uqd9AvvHuTquz2dp+vbmy83ByOZElcoPcImukQ+6SHfKcvCK7hJGv5Bv5Tn40rjceNJ42nk+oZ88cx1wjc6vx+l9Y6dAb</latexit>D KL [⇡ ⇤ ||⇡ 1 ]<latexit sha1_base64="IfBBpeq5xourjsLI1o0DZ6TcM78=">AAAHq3ichVVtb9s2EGZfVqfdS9P2474IDQJkgWfYdYI26JdgbdEO27Bua9qgkatS1EkmQlE0RSd2GP2Q/pp93X7C/s2OctzYorMRsH2657m7h3emGCvBS9Pt/nPt+o2bX9xqrd2+8+VXX39zd/3e/bdlMdYMDlghCn0Y0xIEl3BguBFwqDTQPBbwLj5+5vB3J6BLXsg3ZqpgkNNM8pQzatAVrfdDAxNT57HjkkUakso+j+xPP1dHoeKRDWMw9IPdrqrzc+foDapofaPb6dYr8I3ehbGxv0bq9Tq6d+tjmBRsnIM0TNCyPHq0o0wbZIY7HA4s1YYzAdWdcFyCouyYZnCEpqQ5lANbq6uCTfQkQVpo/EgT1N7FCEvzspzmMTJzaoZlE3POz9jmUimTPhlYLtXYgGSzSulYBKYIXM+ChGtgRkzRoExzFBuwIdWUGezskug3vYF16lyaJUBgv2VvociSupjGIKolUVZNXKISiQmkON3GjH5/+UNle7uP2/2d9m6/wdJFHMViDJXVWVzZbrvT7++2O3t7yJNwyoo8pzJxs63cV8alhZGkWtNpFVw4qOCZxEqNEHAhgNYcD+qny3CvhJReke3/r1JHXV1n2yt0MqmsDd104zSYNBOenC6gpx46XUCnHnq2gJ55aLqApk30WCA6O09NKAGGmMIjZoZ4xqpgC/dwjtW+87ohHXOETDXkNe8MeROPl4H0M7aDVSmFSHyqx1Ju52prVQI1WYpfGT5yNexoDm7We0jolfSzhV2uLMpHyPgxcikPn773+zT6PIrYvqgcz5tWtjCtzOsKdWjMswBNfDtBk6AvCXoloe6GqklUBxfNaR6JFFR5KQNFUY0eLgrp6QUhlOY5ODraH7A59aM3Dq5A546l+ApsnkLxKzPIguP7FW+cuTJGBTZxaybX63WSU/cPSCJHnnjZOE9KXusp3PUDxqIHT7RzOu5zwItAwy9Y51dUTk2ht21IdVanxd+w7az/InI5J6KFd1KveQP5xttHnd5OZ++3nY39/uxyImvkW/KQbJEeeUz2ySvymhwQRj6RP8lf5O/W960/Wu9b4Yx6/dpFzAOytFrwLzJ72KI=</latexit>D KL [⇡ ⇤ ||⇡ 0 ]<latexit sha1_base64="4dB3lG6U6b3giZY8ogArFUeD+sU=">AAAHq3ichVVtb9s2EGZfVqfdS9P2474IDQJkgWfYdYI26JdgbdEO27Bua9qgkatS1EkmQlE0RSd2GP2Q/pp93X7C/s2OctzYorMRsH2657m7h3emGCvBS9Pt/nPt+o2bX9xqrd2+8+VXX39zd/3e/bdlMdYMDlghCn0Y0xIEl3BguBFwqDTQPBbwLj5+5vB3J6BLXsg3ZqpgkNNM8pQzatAVrfdDAxNT57HjkkUakso+j+xPP1dHoeKRDWMw9IPdrqrzc+foDqpofaPb6dYr8I3ehbGxv0bq9Tq6d+tjmBRsnIM0TNCyPHq0o0wbZIY7HA4s1YYzAdWdcFyCouyYZnCEpqQ5lANbq6uCTfQkQVpo/EgT1N7FCEvzspzmMTJzaoZlE3POz9jmUimTPhlYLtXYgGSzSulYBKYIXM+ChGtgRkzRoExzFBuwIdWUGezskug3vYF16lyaJUBgv2VvociSupjGIKolUVZNXKISiQmkON3GjH5/+UNle7uP2/2d9m6/wdJFHMViDJXVWVzZbrvT7++2O3t7yJNwyoo8pzJxs63cV8alhZGkWtNpFVw4qOCZxEqNEHAhgNYcD+qny3CvhJReke3/r1JHXV1n2yt0MqmsDd104zSYNBOenC6gpx46XUCnHnq2gJ55aLqApk30WCA6O09NKAGGmMIjZoZ4xqpgC/dwjtW+87ohHXOETDXkNe8MeROPl4H0M7aDVSmFSHyqx1Ju52prVQI1WYpfGT5yNexoDm7We0jolfSzhV2uLMpHyPgxcikPn773+zT6PIrYvqgcz5tWtjCtzOsKdWjMswBNfDtBk6AvCXoloe6GqklUBxfNaR6JFFR5KQNFUY0eLgrp6QUhlOY5ODraH7A59aM3Dq5A546l+ApsnkLxKzPIguP7FW+cuTJGBTZxaybX63WSU/cPSCJHnnjZOE9KXusp3PUDxqIHT7RzOu5zwItAwy9Y51dUTk2ht21IdVanxd+w7az/InI5J6KFd1KveQP5xttHnd5OZ++3nY39/uxyImvkW/KQbJEeeUz2ySvymhwQRj6RP8lf5O/W960/Wu9b4Yx6/dpFzAOytFrwLytm2KE=</latexit>1<latexit sha1_base64="0hkwRoabOXQORf1Ezwi2mTQEJQs=">AAAHd3ichVXbbts4EGV6idPupbfHPlTYIAtvYBhWnSAN+hL0gm6xu9i2aNpga9elqJFMhKJoik7ssPqCfd1+XD+lbztU4sYWnVaA7dGcM3MOh5YYKcEL0+l8Xrl0+crV1cbates//PjTzzdu3rr9psjHmsE+y0WuDyJagOAS9g03Ag6UBppFAt5Gh48d/vYIdMFz+dpMFfQzmkqecEYNpl6Gg5vrnXanugI/CM+C9b01Ul0vBrdWP/TinI0zkIYJWhTv7m8p0wKZotth31JtOBNQXu+NC1CUHdIU3mEoaQZF31aOy2ADM3GQ5Bo/0gRVdr7C0qwoplmEzIyaYVHHXPIrtrEgZZIHfculGhuQ7FQpGYvA5IFbfxBzDcyIKQaUaY5mAzakmjKDU1ow/TrsW+fOtVkABM5OhnMiC+4iGoEoF0xZNXGNCiTGkOBOVeu144INNMSlffXsUWnD7Z1Wd6u13a2xdB4NIjGG0uo0Km2n1e52t1vt3V3kSThmeZZRGdtehAz8Srm0MJJUazotg7MEFTyVqFQrAVcCGM3woLo7L/ckpPRENr+vUlVdrLPpCR1NSmt7bnejJJjUGx4dz6HHHjqdQ6ceejKHnnhoMocmdfRQIPpkYP/4s6xDMTDE1MD2zBAMLYMmruEjqv3mTUM65giZasgr3gnyJh4vBel3bAXLWgoR+1SPpdzKVXNZAzVZqF9aPnIadjQDN6o1xPRC+sncKpeK8hEyng9cy4OH//hzGn3disg+LR3P2610brdSbyrUoRFPAwzx7QR1gj4n6KWEahqqIlEdnA2n/kgkoIpzG2iKasxwkUvPLwihNM/A0TF+j8Opbr3t4Ap05liKL8FmLRS/sIPMOb5f8fSYOWNU4BCbp3a9WccZdf+AeODIE68b53HBKz+5O0rAWMzgE+2SjvsE8CDQ8Bfq/I3Oqcn1pu1RnVZt8bfXctG3iFzOiBjhmRTWTyA/eHO/HW61d19ure91Tw8nskbukl9Ik4Rkh+yR38kLsk8YAfIv+Y98Wv3SuNf4tdE8pV5aOau5QxauRvg/VRvC7Q==</latexit> ⇤ (⌘ ⇤ )<latexit sha1_base64="CqGKO5XULoKWsqLr2vdEOk0ApXk=">AAAHp3ichVVtb9s2EGbbrc66vqTdx30RFgRwAsO16wRt0C/B1mErtqHt0KRBK0elqJNMhKJoik7scPoZ+zX7uv6I/ZsdlbixRWcTYOt0z3P3HO9EMVaCl6bX++fGzVtffHm7tfbVna/v3rv/YP3ho8OymGgGB6wQhT6KaQmCSzgw3Ag4UhpoHgt4F5/84PB3p6BLXsi3ZqZgmNNM8pQzatAVrT8ODUxNncdOShZpSCobqpIf2+2qHYKhkQ1jvLnnaquK1jd63V59Bb7RvzQ29tdIfb2OHt7+GCYFm+QgDRO0LD882VGmAzLD1Y2GlmrDmYDqTjgpQVF2QjP4gKakOZRDW1dWBZvoSYK00PiTJqi9ixGW5mU5y2Nk5tSMyibmnJ+xzSUpkz4bWi7VxIBkF0rpRASmCFy/goRrYEbM0KBMcyw2YCOqKTPY1aWi3/aH1lXn0iwBAnst+wsiS9XFNAZRLRVl1dQlKpGYQIqTbczn95++r2x/92lnsNPZHTRYuoijWEygsjqLK9vrdAeD3U53bw95Es5YkedUJm6olfvLuLQwllRrOquCSwcVPJOo1AgBFwJozfGgfroK9ySk9ES2/1+ljrpeZ9sTOp1W1oZuunEaTJsJT88W0DMPnS2gMw89X0DPPTRdQNMmeiIQfRHZX36tmlACDDGFe8uMcHNVQRvX8AeqbXndkI45RqYa8Zp3jrypx8tA+hk7waqUQiQ+1WMpt3LVXpVATZfiV4aPnYYdz8HNeg0JvZZ+vrDKlaJ8jIyXkUt59Py936fx51HE9sfK8bxpZQvTyryuUIfGPAvQxK8TNAn6iqBXEupuqJpEdXDZnOaWSEGVV2VgUVSjh4tCevWCEErzHBwd7WNsTv3ojYMr0LljKb4Cm6dQ/NoMsuD4fcXTZl4ZowKb2L4o1+t1klP3BiSRI0+9bJwnJa/rKdzRA8aiB3e0czruC8CDQMNvqPMKK6em0Ns2pDqr0+I97Djrv4hczolo4ZnUb55AvnH4pNvf6e692dnYH1wcTmSNfEu+I23SJ0/JPvmZvCYHhJE/yV/kb/KptdV61TpsHV1Qb964jPmGLF0t+i95M9at</latexit>=<latexit sha1_base64="QgR5W/V+UxaZ8+RT4cKtja5eHXw=">AAAHjnichVVRbxNHEF6g4BRoCSD1hZcTUaQ0siwbJ4QIRUQtFVRtVagIRMTm2NubO6+yt7feWyd2lvszvMIf4t8we4mJfevQk2zPzffNfLMzvptICV6YdvvLpctXfrh6rbH04/UbN3/6+dby7Tuvi3ykGeyxXOR6P6IFCC5hz3AjYF9poFkk4E10+LvD3xyBLnguX5mJgn5GU8kTzqhBV7j8S8/A2FR57KhgoYa4tDtluLzSbrWrK/CNzpmxsrtEqutFePva+16cs1EG0jBBi+LgwYYyTZApHmLQt1QbzgSU13ujAhRlhzSFAzQlzaDo26qAMlhFTxwkucaPNEHlnY2wNCuKSRYhM6NmUNQx5/yGrc5JmeRR33KpRgYkO1VKRiIweeDaEsRcAzNiggZlmmOxARtQTZnB5s0V/arTt646l2YOENhS2ZkRmasuohGIcq4oq8YuUYHEGBIcYG0M/z37rbSdza1md6O52a2xdB6FkRhBaXUalbbdbHW7m83W9jbyJByzPMuojG0vQgZ+pVxaGEqqNZ2UwZmDCp5KVKqFgAsBtKZ4UN2dh3sSUnoi6/+vUkVdrLPuCR2NS2t7brpREozrCY+OZ9BjD53MoBMPPZlBTzw0mUGTOnooEH0a2r/+LutQDAwxFdqeGYChZbCGZ/iAar963ZCOOUSmGvCKd4K8scdLQfoZm8GilELEPtVjKXdytbYogRrPxS8MHzoNO5yCq9UZYnoh/WTmlAtF+RAZf4Yu5f7jt36fht9GEdk/SsfzppXOTCv1ukIdGvE0QBPfTlAn6HOCXkiouqEqEtXBWXPqj0QCqjgvA4uiGj1c5NKrF4RQmmfg6Gi/w+ZUt944uAKdOZbiC7BpCsUvzCBzju9XXCrTyhgV2MS103K9XscZdf+AOHTksZeN87jgVT252zBgLHrwiXZOx30KuAg0/IM6/2Ll1OR63faoTqu0+NtrOut7RC6nRLRwJ3XqG8g3Xj9odTZa2y83Vna7p8uJLJF75D5ZIx2yRXbJc/KC7BFGPpCP5BP53FhuPGzsNJ6cUi9fOou5S+auxvOvyvzMEg==</latexit> ( ⇤ )<latexit sha1_base64="1mbdaqtzDXwIQxwv1A4lTnyzH3c=">AAAHnXichVVtb9s2EGa7tk7bvaTbx36YsCCAGxiGXSfogn0JthbdsG5ri6YJVrkaRZ1kIhRFU3Rih9Mv6K/p1+2X7N/0KMeNLTodAdune567e3hnirESvDS93n/Xrn924+at1sbtO3c//+LLrzbvff26LCaawSErRKGPY1qC4BIODTcCjpUGmscCjuKTnxx+dAq65IV8ZWYKhjnNJE85owZd0eZ2aGBq6jxWF3EUiwlUNlQlb4cxGPrW7lQPqmhzq9ft1Svwjf6FsXWwQer1PLp3668wKdgkB2mYoGX55uGuMh2QGe5pNLRUG84EVHfCSQmKshOawRs0Jc2hHNpaTxVsoycJ0kLjR5qg9i5HWJqX5SyPkZlTMyqbmHN+xLZXSpn0+6HlUk0MSDavlE5EYIrAdSlIuAZmxAwNyjRHsQEbUU2ZwV6uiH7VH1qnzqVZAQR2WPaXiqyoi2kMoloRZdXUJSqRmECK85xPZVKySENS2ZdPf6xsf+9RZ7Db2Rs0WJez01lc2V6nOxjsdbr7+8iTcMaKPKcysTjTyn1lXFoYS6o1nVXBhYMKnkms1AgBFwJoLfCgfroM90pI6RXZ+f8qddTVdXa8QqfTytrQTTdOg2kz4enZEnrmobMldOah50vouYemS2jaRE8Eoo8j++uzqgklwBBTkQ3NCM9WFbRxD39jtQdeN6RjjpGpRrzmnSNv6vEykH7GTrAupRCJT/VYyu1ctdclUNOV+LXhY1fDjhfgdr2HhF5JP1/a5dqifIyMXyKX8viHP/0+jT+OIrZPKsfzppUtTSvzukIdGvMsQBPfTtAk6EuCXkuou6FqEtXBRXOaRyIFVV7KQFFUo4eLQnp6QQileQ6OjvZbbE796I2DK9C5Yym+BlukUPzKDLLg+H7FO2ahjFGBTWzP5Xq9TnLq/gFJ5MhTLxvnSclrPYW7cMBY9OCJdk7HfQx4EWj4Dev8gcqpKfSODanO6rT4G3ac9SkilwsiWngn9Zs3kG+8ftjt73b3X+xuHQzmlxPZIPfJd6RN+uQROSA/k+fkkDDyjrwn/5B/W9+2nrSetX6fU69fu4j5hqys1tEHnELSQQ==</latexit>⌘ <latexit sha1_base64="o292XNuHKqZokwRUPXEZgh+bYWQ=">AAAHgnichVXbbtNAEF2uKffbIy8WVaVSRVHStIIKHiouAgSIiyhUkBDW67Gz6nq9WW/apIt/g1f4Lf6GWTehiTcFS4nHc87MnJ3xekMleG6azd+nTp85e+58benCxUuXr1y9dv3GzY95NtQMdlgmMr0b0hwEl7BjuBGwqzTQNBTwKdx77PBP+6BznskPZqygm9JE8pgzatDV6YChPdsJ8Vb0ri83G83yCnyjNTGWt5dIeb3t3Tj/rRNlbJiCNEzQPP+yvqFMHWSCwvtdS7XhTEBxsTPMQVG2RxP4gqakKeRdW4ovghX0REGcafxJE5Te2QhL0zwfpyEyU2r6eRVzzr/YylwpE9/vWi7V0IBkR5XioQhMFrhWBBHXwIwYo0GZ5ig2YH2qKTPYsDnRH1pd69S5NHOAwDbK1kyROXUhDUEUc6KsGrlEORIjiHFo5XrtMGc9DVFh3z97VNjW5r16e6O+2a6wdBb2QjGEwuokLGyz3mi3N+uNrS3kSThgWZpSGblxFu4v4dLCQFKt6bgIJg4qeCKxUiUEXAigNcWD8uk43CshpVdk7f9VyqiT66x5hfZHhbUdN90wDkbVhPsHM+iBh45n0LGHHs6ghx4az6BxFd0TiD7p2ZeviioUAUNM4a4yfbetglVcw3esdtfrhnTMATJVn5e8Q+SNPF4C0s9YDxalFCLyqR5LuZWr1UUJ1GgufmH4wNWwgym4Uq4hoifSD2dWubAoHyDjRc+l3H3w2e/T4O8oQvu0cDxvWsnMtBKvK9ShIU8CNPHrBFWCPibohYSyG6okUR1MmlPdEjGo/FgGiqIaPVxk0tMLQijNU3B0tL9ic8pHbxxcgU4dS/EF2DSF4idmkBnH7yseJFNljAps4uqRXK/XUUrdGxD1HHnkZeM8ynmpJ3OnChiLHtzRzum4TwAPAg2vsc4bVE5Nptdsh+qkTIv3Tt1Z/yJyOSWihWdSq3oC+cbH9UZro7H1bmN5u310OJElcpvcIaukRe6RbfKcvCU7hBFFfpCf5FftbG2t1qpNuKdPTWJukbmr9vAPY1THEA==</latexit>logp ✓ (x)<latexit sha1_base64="XiB/TbqA98lrmgL2pTtLE1lTTxg=">AAAHoHichVVRbxNHEF4oxYEWGtpHXk5EkUxkWTZJRKO+RJSqoLYCKgIRnLnu7c2dV9nbW++tEzvL/YX+mr7S/8G/YfYcY/vWgZUsz833zXyzM7e3sRK8NL3exytXv7n27fXWxo2b331/6/YPm3d+fFUWY83giBWi0McxLUFwCUeGGwHHSgPNYwGv45NfHf76FHTJC/nSTBUMcppJnnJGDbqizXZoYGLqPFYXcRSLMVQ2FEUWqMiGZgiGVu3J/Sra3Op1e/UKfKN/YWwdbpB6PY/uXP8nTAo2zkEaJmhZvn2wp0wHZIbbGg4s1YYzAdXNcFyCouyEZvAWTUlzKAe2LqkKttGTBGmh8SdNUHuXIyzNy3Kax8jMqRmWTcw5P2PbK1Im/XlguVRjA5LNlNKxCEwRuEYFCdfAjJiiQZnmWGzAhlRTZrCdK0W/7A+sq86lWQEENln2l0RWqotpDKJaKcqqiUtUIjGBFEc6G8y4ZJGGpLJ///6osv39h53dvc7+boO1GJ/O4sr2Ot3d3f1O9+AAeRLOWJHnVCY2jN2AY8i4tDCSVGs6rYILBxU8k6jUCAEXAmjN8aB+WoR7ElJ6IjtfV6mjLtfZ8YROJ5W1oZtunAaTZsLTsyX0zEOnS+jUQ8+X0HMPTZfQtImeCEQfR/aPP6smlABDbHG0gjbu4T2q3fe6IR1zhEw15DXvHHkTj5eB9DN2gnUphUh8qsdSbueqvS6BmqzErw0fOQ07moPb9R4Sein9fGmXa0X5CBlPI5fy+Jc3fp9Gn0cR298qx/OmlS1NK/O6Qh0a8yxAE79O0CToBUGvJdTdUDWJ6uCiOc0jkYIqF2VgUVSjh4tCevWCEErzHBwd7XfYnPrRGwdXoHPHUnwNNk+h+KUZZMHx+4rXzLwyRgU2sT0r1+t1klP3BiSRI0+8bJwnJa/rKdydA8aiB0+0czruY8CLQMNfqPMMK6em0Ds2pDqr0+J/2HHWl4hczolo4Z3Ub95AvvHqQbe/1z14sbd1+Gh2OZENcpfcI23SJw/JIXlCnpMjwsi/5D/ygfzfutd60nrWejGjXr1yEfMTWVmtN58Ak1jTvQ==</latexit>1<latexit sha1_base64="0hkwRoabOXQORf1Ezwi2mTQEJQs=">AAAHd3ichVXbbts4EGV6idPupbfHPlTYIAtvYBhWnSAN+hL0gm6xu9i2aNpga9elqJFMhKJoik7ssPqCfd1+XD+lbztU4sYWnVaA7dGcM3MOh5YYKcEL0+l8Xrl0+crV1cbates//PjTzzdu3rr9psjHmsE+y0WuDyJagOAS9g03Ag6UBppFAt5Gh48d/vYIdMFz+dpMFfQzmkqecEYNpl6Gg5vrnXanugI/CM+C9b01Ul0vBrdWP/TinI0zkIYJWhTv7m8p0wKZotth31JtOBNQXu+NC1CUHdIU3mEoaQZF31aOy2ADM3GQ5Bo/0gRVdr7C0qwoplmEzIyaYVHHXPIrtrEgZZIHfculGhuQ7FQpGYvA5IFbfxBzDcyIKQaUaY5mAzakmjKDU1ow/TrsW+fOtVkABM5OhnMiC+4iGoEoF0xZNXGNCiTGkOBOVeu144INNMSlffXsUWnD7Z1Wd6u13a2xdB4NIjGG0uo0Km2n1e52t1vt3V3kSThmeZZRGdtehAz8Srm0MJJUazotg7MEFTyVqFQrAVcCGM3woLo7L/ckpPRENr+vUlVdrLPpCR1NSmt7bnejJJjUGx4dz6HHHjqdQ6ceejKHnnhoMocmdfRQIPpkYP/4s6xDMTDE1MD2zBAMLYMmruEjqv3mTUM65giZasgr3gnyJh4vBel3bAXLWgoR+1SPpdzKVXNZAzVZqF9aPnIadjQDN6o1xPRC+sncKpeK8hEyng9cy4OH//hzGn3disg+LR3P2610brdSbyrUoRFPAwzx7QR1gj4n6KWEahqqIlEdnA2n/kgkoIpzG2iKasxwkUvPLwihNM/A0TF+j8Opbr3t4Ap05liKL8FmLRS/sIPMOb5f8fSYOWNU4BCbp3a9WccZdf+AeODIE68b53HBKz+5O0rAWMzgE+2SjvsE8CDQ8Bfq/I3Oqcn1pu1RnVZt8bfXctG3iFzOiBjhmRTWTyA/eHO/HW61d19ure91Tw8nskbukl9Ik4Rkh+yR38kLsk8YAfIv+Y98Wv3SuNf4tdE8pV5aOau5QxauRvg/VRvC7Q==</latexit> ⇤<latexit sha1_base64="m8nVgFoP/qYUlLvqgUQGNeGILaQ=">AAAHl3ichVVtbxM5EDYvRwr3QoFPiA+sqCqVKooS2goqPlDxfoLTAaJQQcLi9c5urHq9jtdpk5r9yK/hK/yY+zc33jQ0Wad3lqLMzvPMPOOZ9TpSghem3f7nzNlz53+50Fi6eOnX337/4/LylavvinyoGeyyXOR6L6IFCC5h13AjYE9poFkk4H20/8jh7w9AFzyXb81YQS+jqeQJZ9SgK1y+2TUwMlUeq/MojMQQStuNwNBPdr0sw+WVdqtdrcA3OsfGys4Sqdar8MqFz904Z8MMpGGCFsXHO5vKNEGmuJt+z1JtOBNQXuoOC1CU7dMUPqIpaQZFz1aVlMEqeuIgyTX+pAkq72yEpVlRjLMImRk1/aKOOedPbHVOyiT3epZLNTQg2UQpGYrA5IHrTxBzDcyIMRqUaY7FBqxPNWUGuzhX9NtOz7rqXJo5QGBvZWdGZK66iEYgyrmirBq5RAUSY0hwkpN5DAsWaohL++bZw9J2tu42NzabWxs11snUdBqVtt1sbWxsNVvb28iTcMjyLKMydiOt5ppyaWEgqdZ0XAbHDip4KlGpFgIuBNCa4kH1dBLuSUjpiaz/v0oVdbrOuid0MCqt7brpRkkwqic8OJxBDz10PIOOPfRoBj3y0GQGTerovkD0cWhfvCzrUAwMMRXarunj0SqDNdzDF1S77XVDOuYAmarPK94R8kYeLwXpZ2wGi1IKEftUj6XcztXaogRqNBe/MHzgNOxgCq5We4jpqfSjmV0uFOUDZPwZupR79z/4fRr8HEVkn5SO500rnZlW6nWFOjTiaYAmfp2gTtAnBL2QUHVDVSSqg+Pm1I9EAqo4KQOLoho9XOTSqxeEUJpn4Ohof8LmVI/eOLgCnTmW4guwaQrFT80gc47fV7xdppUxKrCJa5NyvV7HGXVvQBw68sjLxnlc8Kqe3F01YCx68EQ7p+M+BrwINPyFOn9j5dTket12qU6rtPjfbTrrv4hcTolo4Z3Uqd9AvvHuTquz2dp+vbmy83ByOZElcoPcImukQ+6SHfKcvCK7hJGv5Bv5Tn40rjceNJ42nk+oZ88cx1wjc6vx+l9Y6dAb</latexit>⌘ ⇤ =<latexit sha1_base64="4RaIMYGMW1R/TXcq8EXp6e2eDx0=">AAAHoXichVVtb9s2EGbbrU67l6bdx34hFgRIA8Ow6wRdMAwItg5dsQ3LiqYNFrsaRZ1kIhRFU3Rih9Vv2K/Z1+137N/sKMeNLTqbAFune56753gnirGWorTd7j+3bt/56OO7rY179z/59LPPH2w+fPSmLCaGwzEvZGFOYlaCFAqOrbASTrQBlscS3sZn33n87TmYUhTqtZ1pGOYsUyIVnFl0RZtPBhamts7jTBFHsZxA5QZgWeQGMd7eud2qot/QKtrc6na69UVDo3dlbB1ukPo6ih7e/X2QFHySg7JcsrI8fbqnbRtUhusaDR0zVnAJ1f3BpATN+BnL4BRNxXIoh66uqaLb6EloWhj8KUtr73KEY3lZzvIYmTmzo7KJeecHbHtFyqZfDZ1QemJB8blSOpHUFtR3iibCALdyhgbjRmCxlI+YYdxiP1eKft0bOl+dT7MCSOyy6i2JrFQXsxhktVKU01OfqERiAinOdD6ZSckjA0nlXr34tnK9/Wft/l57v99gXc/PZHHluu1Ov7/f7hwcIE/BBS/ynKnEj7Xyf5lQDsaKGcNmFb1yMCkyhUqNEKhfCrQWOK2frsMDCaUCkd3/V6mjbtbZDYTOp5VzAz/dOKXTZsLziyX0IkBnS+gsQC+X0MsATZfQtImeSUSfR+7Hn6omlABHTOPusiPcXhXdwTW8R7UnQTeUZ46RqUei5l0ibxrwMlBhxjZdl1LKJKQGLO1XrnfWJdDTlfi14WOv4cYLcLteQ8JupF8urXKtqBgj42XkU558/VvYp/GHUcTu+8rzgmllS9PKgq4wj8Yio2ji1wmaBHNNMGsJdTd0TWKGXjWnuSVS0OV1GVgUM+gRslBBvSClNiIHT0f7HTanfgzGITSY3LO0WIMtUmhxYwZVCPy+4jmzqIwziU3cmZcb9DrJmX8DksiTp0E2IZJS1PUU/tAB69CDO9o7Pfc54EFg4GfU+QUrZ7Ywu27ATFanxfug7a3/Igq1IKKFZ1KveQKFxpunnd5e5+DXva3D/vxwIhvkMfmS7JAeeUYOyQ/kiBwTTv4gf5K/yN+trdbL1lHr1Zx6+9ZVzBdk5Wqd/gu73dPa</latexit> ( )<latexit sha1_base64="uWNtXWGR0fiaNB4SWXTPJ3L9Mg8=">AAAHgXichVVtb9s2EGbSNs6ybn37uC9CgwBOYBh2nKAN+iVoU3TFNjQtkjZY7HkUdZKJUBRN0YkdTj9jX9fftX+zoxIntuh0BGyf7nnu7uGdKYZK8Ny0Wv8uLd+7/2Cltvrd2vcPf/jx0eMnTz/n2UgzOGaZyPRJSHMQXMKx4UbAidJA01DAl/DsjcO/nIPOeSaPzERBL6WJ5DFn1KDrtKtyXu+GYOhm//F6q9kqV+Ab7WtjfX+VlOuw/2Tlz26UsVEK0jBB8/x0e0eZBsgEdQ96lmrDmYBirTvKQVF2RhM4RVPSFPKeLbUXwQZ6oiDONH6kCUrvbISlaZ5P0hCZKTWDvIo55w22MVfKxC97lks1MiDZVaV4JAKTBa4TQcQ1MCMmaFCmOYoN2IBqygz2a070UbtnnTqXZg4Q2EXZnikypy6kIYhiTpRVY5coR2IEMc6s3K8d5ayvISrsp3evC9vefdHo7DR2OxWWzsJ+KEZQWJ2EhW01mp3ObqO5t4c8CRcsS1MqI4vjLNxXwqWFoaRa00kRXDuo4InESpUQcCGA1hQPyqfbcK+ElF6Rrf+vUkbdXWfLK3Q+LqztuumGcTCuJjy/mEEvPHQyg0489HIGvfTQeAaNq+iZQPSgb3/5tahCETDEVN92zQCPVRHUcQ9/YbVNrxvSMYfIVANe8i6RN/Z4CUg/YyNYlFKIyKd6LOV2ruqLEqjxXPzC8KGrYYdTcKPcQ0TvpF/O7HJhUT5Exvu+S3ny6ne/T8ObUYT2beF43rSSmWklXleoQ0OeBGji2wmqBH1L0AsJZTdUSaI6uG5O9UjEoPJbGSiKavRwkUlPLwihNE/B0dH+A5tTPnrj4Ap06liKL8CmKRS/M4PMOL5f8R6ZKmNUYBPrV3K9Xkcpdf+AqO/IYy8b51HOSz2Zu1TAWPTgiXZOxz0AvAg0/IZ1PqByajK9ZbtUJ2Va/O02nPUtIpdTIlp4J7WrN5BvfN5utneaex931vc7V5cTWSU/keekTtrkBdknP5NDckwYycjf5B/ytXavtllr1bavqMtL1zHPyNyqvfoPdLrGEg==</latexit> ⇤ =max (1 ) (0)+ (1) ( )<latexit sha1_base64="hiFabegHXYUvglkSq5YOxlO1PjU=">AAAH0XichVVtbxtFEL4WqEN5S+Fjv6waRXKCa/nqRCVClSJaBAgQBZo2peeavbu58yp7e+u9dWJnOQnxlR/F7+AH8BX+ArPrOLFvHTjJ9uw8z8wzO+u9iSVnle71/rxx840337rV2nj79jvvvvf+B5t3PnxelROVwFFS8lIdx7QCzgQcaaY5HEsFtIg5vIhPHlv8xSmoipXimZ5JGBQ0FyxjCdXoGm6+jGLQ9LXZrckjEhV0SiLOCqaroXFITaIOaYf33WLHLiJZsXZvh3xMnO/SFe6Q+3Nrzh1ubvW6PfcQ3wgvjK3DjcA9T4d3bv0cpWUyKUDohNOqevVgT+oOiBz7MBoYqjRLONS3o0kFkiYnNIdXaApaQDUwrhc12UZPSrJS4Udo4rzLEYYWVTUrYmQWVI+qJmadl9j2ipTOPhkYJuREg0jmStmEE10S21mSMgWJ5jM0aKIYFkuSEVU00dj/laKfhQNjq7NpVgCOpyLCJZGV6mIaA69XijJyahNVSEwhw/+A26+ZVMlQQVqbH774rDbh/sNOf6+z32+wVBkPYz6B2qg8rk2v0+339zvdgwPkCThLyqKgIrX/g9p+5UwYGAuqFJ3V5MJBOcsFKjVCwIYAWgucuNVVuCchhCey+/8qLup6nV1P6HRaGxPZ040zMm0mPD1bQs88dLaEzjz0fAk999BsCc2a6AlH9MnQfP1N3YRSSBCTeB31yN3HNu7hF1Tb8bohLHOMTDlijneOvKnHy0H4GfESr0nJeepTPZa0O5ftdQnkdCV+bfjYapjxAtx2e0jptfTzpV2uFWVjZHw1tCmPP/3J79P48ihi83lted5p5UunlXtdoRaNWY7vSopvJ2gS1BVBrSW4bkhHoopcNKd5JTKQ1VUZWBRV6GG8FF69wLlUrABLR/s1NsctveNgElRhWZKtwRYpJLs2gygZvl9xLi0qSyjHJrbn5Xq9TnGiIDMdWvLUy8ZYWjFXT2mHFGiDHrzR1mm5TwAHgYJvUec7rJzqUu2aiKrcpcXfqGOt/yIysSCihTMpbE4g33j+oBvudQ++39s67M+HU7AR3A3uBe0gDB4Gh8GXwdPgKEiCP4K/gr+Df1o/tmatX1u/zak3b1zEfBSsPK3f/wUPSuFl</latexit> ⇤ =max (1 ) (0)+ (1) ( )<latexit sha1_base64="bEIKiqDfdoDbID/ucVGNeXROeRc=">AAAH53ichVVfbxtFEL+Utg7lXwqPfVk1iuQE17LrRCVClSJaBAgQBTVtRM899u7mzqvs7a331omd5T4Db4hXPhQPfBZemF3HiX3rwEmWZ+f3mz87s7sTS84q3ev9vXHrndt37rY237333vsffPjR1v2PX1XlRCVwnJS8VCcxrYAzAceaaQ4nUgEtYg6v49NnFn99BqpipXipZxKGBc0Fy1hCNaqirTLUMNXOj1FlHMV8ArUJY9D0rdmryVMSFnRKQs4KpqtojtQk7JB2/5Fb7NpFKCvW7u2ST4nTXan6u+TRXJpz62hru9ftuY/4Qv9S2D7aDNz3Irp/95cwLZNJAUInnFbVm8f7UndA5Fid0dBQpVnCob4XTiqQNDmlObxBUdACqqFxO6vJDmpSkpUKf0ITp122MLSoqlkRI7OgelQ1Mau8wnZWQunss6FhQk40iGQeKZtwokti601SpiDRfIYCTRTDZEkyooomGruykvTL/tDY7KybFYBjr0R/KchKdjGNgdcrSRk5tY4qJKaQ4cmY93dSJZGCtDY/ffVFbfoHTzqD/c7BoMG6PgUqj2vT63QHg4NO9/AQeQLOk7IoqEjtQXDnJGfCwFhQpeisJpcKylkuMFLDBKwJoLTAiVtdm3shhPCC7P1/FGd1c5w9L9DZtDYmtN2NMzJtOjw7X0LPPXS2hM489GIJvfDQbAnNmugpR/R5ZL79rm5CKSSISbyPeuQuZBv38CtG2/WqISxzjEw5Yo53gbypx8tB+B7xFq9xyXnqUz2WtDuX7XUO5HTFfq352MYw4wW44/aQ0hvpF0u7XBuUjZHxTWRdnnz+s1+n8VUrYvNlbXlet/KlbuVeVahFY5bjY0nxdYImQV0T1FqCq4Z0JKrIZXGaVyIDWV2ngUlRhRrGS+HlC5xLxQqwdJTfYnHc0msHk6AKy5JsDbZwIdmNHkTJ8H3FabXILKEci9iep+vVOsWRgsw0suSp542xtGIun9KOLtAGNXijrdJynwMOAgXfY5wfMHOqS7VnQqpy5xb/w46V/ovIxIKIEs6kfnMC+cKrx93+fvfwx/3to8F8OAWbwYPgYdAO+sGT4Cj4OngRHAdJ8Ffwz8btjTst1vqt9Xvrjzn11salzSfBytf6818heum9</latexit>= 1 Z 0 ⌘ d<latexit sha1_base64="o7qVgkQTvnGFaRb15QfT46dh38g=">AAAHuXichVVtb9s2EFa7rU6zl6brx34RFgTIAsOw6wRtUAxIX4atWId1Q9MGq1yVok4yF4qiSTqxw+n37Nfsa4f9mx5lu7FFZxNg63TPc3cP70QxkZxp0+3+e+36J59+dqO1cXPz8y++/OrW1u2vX+lyrCgc05KX6iQhGjgTcGyY4XAiFZAi4fA6OX3i8NdnoDQrxUszlTAoSC5Yxigx6Iq3HkUGJqbOY1WZxAkfQ2W/CyMmTBhxVjCj4+7bXhiBIbGNErxV4WbUDtO5HW9tdzvd+gp9ozc3to82gvp6Ed++8S5KSzouQBjKidZv7u1L0waR43qHA0uUYZRDtRmNNUhCT0kOb9AUpAA9sLXWKtxBTxpmpcIfSq29yxGWFFpPiwSZBTFD3cSc8yO2s1LKZA8Glgk5NiDorFI25qEpQ9fBMGUKqOFTNAhVDMWGdEgUoQb7vCL6ZW9gnTqXZgXg2H3RWyqyoi4hCfBqRZSVE5dIIzGFDGc9m9hY01hBWtnffnhc2d7B/XZ/v33Qb7Au56rypLLddqffP2h3Dg+RJ+CclkVBROpmW7m/nAkLI0GUItMqnDsIZ7nASo0QcCGA1gIP66fLcK+EEF6Rvf+vUkddXWfPK3Q2qayN3HSTLJw0E56dL6HnHjpdQqceerGEXnhotoRmTfSUI/o0tj89r5pQChQxiVvMDOt9tYtr+BOrfet1QzjmCJlyyGreBfImHi8H4Wdsh+tScp76VI8l3crl7roEcrISvzZ85GrY0QLcqdeQkivpF0urXFuUjZDxLHYpTx7+7vdp9HEUif2+cjxvWvnStHKvK8ShCcvxQ0jw6wRNgrokqLWEuhuyJhEVzpvT3BIZSH0pA0URhR7GS+HpBc6lYgU4OtpvsTn1ozcOJkEVjiXZGmyRQrIrM4iS4fcVz5+FMko4NnF3JtfrdVoQ9waksSNPvGyMpZrVekp3GIGx6MEd7ZyO+xTwIFDwM9b5BZUTU6o9GxGV12nxHrWd9V9EJhZEtPBM6jVPIN94da/T2+8c/rq/fdSfHU7BRnA3+CbYDXrB/eAo+DF4ERwHNPgr+Dt4H/zTetgirWHrjxn1+rV5zJ1g5WrpD5DJ3Gs=</latexit>logp ✓ (x)<latexit sha1_base64="XiB/TbqA98lrmgL2pTtLE1lTTxg=">AAAHoHichVVRbxNHEF4oxYEWGtpHXk5EkUxkWTZJRKO+RJSqoLYCKgIRnLnu7c2dV9nbW++tEzvL/YX+mr7S/8G/YfYcY/vWgZUsz833zXyzM7e3sRK8NL3exytXv7n27fXWxo2b331/6/YPm3d+fFUWY83giBWi0McxLUFwCUeGGwHHSgPNYwGv45NfHf76FHTJC/nSTBUMcppJnnJGDbqizXZoYGLqPFYXcRSLMVQ2FEUWqMiGZgiGVu3J/Sra3Op1e/UKfKN/YWwdbpB6PY/uXP8nTAo2zkEaJmhZvn2wp0wHZIbbGg4s1YYzAdXNcFyCouyEZvAWTUlzKAe2LqkKttGTBGmh8SdNUHuXIyzNy3Kax8jMqRmWTcw5P2PbK1Im/XlguVRjA5LNlNKxCEwRuEYFCdfAjJiiQZnmWGzAhlRTZrCdK0W/7A+sq86lWQEENln2l0RWqotpDKJaKcqqiUtUIjGBFEc6G8y4ZJGGpLJ///6osv39h53dvc7+boO1GJ/O4sr2Ot3d3f1O9+AAeRLOWJHnVCY2jN2AY8i4tDCSVGs6rYILBxU8k6jUCAEXAmjN8aB+WoR7ElJ6IjtfV6mjLtfZ8YROJ5W1oZtunAaTZsLTsyX0zEOnS+jUQ8+X0HMPTZfQtImeCEQfR/aPP6smlABDbHG0gjbu4T2q3fe6IR1zhEw15DXvHHkTj5eB9DN2gnUphUh8qsdSbueqvS6BmqzErw0fOQ07moPb9R4Sein9fGmXa0X5CBlPI5fy+Jc3fp9Gn0cR298qx/OmlS1NK/O6Qh0a8yxAE79O0CToBUGvJdTdUDWJ6uCiOc0jkYIqF2VgUVSjh4tCevWCEErzHBwd7XfYnPrRGwdXoHPHUnwNNk+h+KUZZMHx+4rXzLwyRgU2sT0r1+t1klP3BiSRI0+8bJwnJa/rKdydA8aiB0+0czruY8CLQMNfqPMMK6em0Ds2pDqr0+J/2HHWl4hczolo4Z3Ub95AvvHqQbe/1z14sbd1+Gh2OZENcpfcI23SJw/JIXlCnpMjwsi/5D/ygfzfutd60nrWejGjXr1yEfMTWVmtN58Ak1jTvQ==</latexit>Figure 3.13: Cherno point on =r (). R(D)= ⇤ (⌘ )<latexit sha1_base64="JEfImWcd4tzRYs3lL34XtQI6Eds=">AAAHZXichVXdbhM5FDawNF1+dssCe8MFFlWltIqqBooALStVSxGsltWyiJYKJkQez5mJVY/H8ThtUjMPwy08EU/Aa3A8bWgSp7sjJTlzvu9858exHWspSrux8eXc+Qs/XFxoLP546fKVqz/9vHTtl92yGBgOO7yQhdmLWQlSKNixwkrY0wZYHkt4E+8/8fibAzClKNRrO9LQyVmmRCo4s+jqLt181dxepb/TSJfivVurmhFYttpdWt5Y36gfGhrtE2N5a5HUz8vutYWHUVLwQQ7KcsnK8t3dTW1boDLsoddxzFjBJVSXokEJmvF9lsE7NBXLoey4uo+KrqAnoWlh8KMsrb2TEY7lZTnKY2TmzPbKWcw7v2MrU6ls+rDjhNIDC4ofZ0oHktqC+qnQRBjgVo7QYNwILJbyHjOMW5zdVNGv2x3nq/MyU4DEiar2RJKp6mIWg6yminJ66IVKJCaQ4vrV/bpBybsGksq9evZH5dr3H7Tubbbu30OWgkNe5DlTiYtiqPxXJpSDvmLGsFFFTxxMikyhwkwI+BBAa4zT+u00PEihVJBk7f+z1FFn51kLEh0MK+civ2pxSoezggeHE+hhgI4m0FGAHk2gRwGaTqDpLLovEd3uur9eVLNQAhwx3XWR7eF2qWgTe/iA2VaDaSjP7CNT90TNO0LeMOBloELFFp0nKWUSUgOW9p3r5jwBPZyKnxve9zlcfwyu1D0k7Ez60USXc5OKPjL+7HrJvd/ehnPqf1+K2D2tPC9YrWxitbJgKsyjscgomnjqwCzBnBLMXEI9DV2TmKEnw5ndEino8rQMLIoZ9AhZqKBekFIbkYOno/0eh1O/BsshNJjcs7SYg40ltDhTQRUCz028K8aVcSZxiM3jcoNZJznz/4Ck68nDQE2IpBR1PYW/OMA69OCO9k7P3QY84A38jXn+wcqZLcyai5jJaln8jVre+i+iUGMiWnjXtGdvltDYvbve3lx/9O/m8tbj40uHLJJb5A5pkjZ5QLbIc/KS7BBOHPlIPpHPC18bVxs3Gr8eU8+fO4m5Tqaexu1v0c+68w==</latexit>D= ⌘<latexit sha1_base64="aDUOtYH3UWaz/mxAfPuXqAL0KMM=">AAAHWHichVXbbhMxEDW3ptxvj7xYVJVKFaoGigABUgVFgABRUG+ChMjrnd1Y9Xodr9MmNfsXvMJ/wdcw3jY0iVNYKcnsnDMzxzOxHWkpCru8/OvU6TNnz83UZs9fuHjp8pWr167f2CrynuGwyXOZm52IFSCFgk0rrIQdbYBlkYTtaPeFx7f3wBQiVxt2oKGVsVSJRHBm0fV5jT6jd5tgGW1fm1teWq4eGhqNI2NudZZUz3r7+syjZpzzXgbKcsmK4su9FW3roFKU3Wk5ZqzgEsoLzV4BmvFdlsIXNBXLoGi5SnpJ59ET0yQ3+FGWVt7RCMeyohhkETIzZjvFJOadf7H5sVI2edRyQumeBcUPKyU9SW1OfSNoLAxwKwdoMG4EiqW8wwzjFts1Jnqj0XJenU8zBkhsomqMFBlTF7EIZDkmyum+T1QgMYYER1at1/UK3jYQl+7Tq+elazx4WL+/Un9wH1kK9nmeZUzFrhlB6b9SoRx0FTOGDUp65GBSpAozTISADwG0hjit3o7DgxJKBUUW/1+lijq5zmJQaK9fOtf0U4sS2p9MuLc/gu4H6GAEHQTowQh6EKDJCJpMorsS0bW2e/uunIRi4IjptmvaDm6Xki7gGr5htTtBN5RndpGpO6LiHSCvH/BSUGHGOp2WUso4pAYs7VeuF6Yl0P2x+KnhXV/DdYfgfLWGmJ1IPxhZ5dSioouMN22fcufJ57BP3b+jiNzL0vOCaaUj00qDrjCPRiKlaOKpA5MEc0wwUwlVN3RFYoYeNWdySySgi2MZKIoZ9AiZq0AvSKmNyMDT0f6Kzaleg3EIDSbzLC2mYMMUWpyYQeUCz028HobKOJPYxIVDuUGv44z5f0Dc9uR+kE2IuBCVntzfFWAdenBHe6fnrgEe8AbeY50PqJzZ3Cy6JjNplRZ/m3Vv/Yso1JCIFt41jcmbJTS27i01VpYef1yZW316eOmQWXKL3CYLpEEeklXymqyTTcKJIt/JD/Jz5neN1Gq184fU06eOYm6Ssad24w9URbaF</latexit>Free Energy ( )=sup ⌘ ·⌘ ⇤ (⌘ ) sup occurs when d ⇤ d⌘ = = dR dD<latexit sha1_base64="BdZg5yERDb73UauFiWXKSilchqA=">AAAIDXichVVtcxs1ED4XqNPwlsLHftGQCeNkTCZu02kZ2plMSYEOMJTQtJn2XI9Ot3fWRKeTJTmxI+438Gv4BnzlN/BvWJ3txLYcuJnEe/s8u89q9yQlSnBj9/b+adx45933bjbXbq2//8GHH328cfuTl6YcagbHrBSlPkmoAcElHFtuBZwoDbRIBLxKTr/2+Ksz0IaX8oUdK+gWNJc844xadPU2/owtjKz7RgOQpxJ0Pq5IPBjSlMTK8FYsMFVKt8nnj0lshorEghfcmp6LwVKktsmUQmKWlpZ4N/miDn7rdqqWf98mcbw+ia7FSMnYUBty3gdJLuUyTZlLZ4GVS8lEwitPJR77zBPeERIOq97G5t7uXv2Q0OhMjc2Dtah+nvdu33wYpyUbFiAtE9SYN3f3lW2DzLHX/a6j2nImoFqPhwYUZac0hzdoSlqA6bq63xXZQk9KslLjn7Sk9s5HOFoYMy4SZBbU9s0y5p2X2NaClM0edh2XamhBsolSNhTElsRPj6RcA7NijAZlmmOxhPUpdsTijBeKftHpOl+dT7MACJy87MyJLFSX0AREtVCUUyOfyCAxhQy/s3q9bmhYT0NauaNvn1Suc/9B+95++/49ZEk4Z2VRUJm6OIHK/8u5dDCQVGvqP7CJgwqeS8ywFAI+BNCa4aR+uwoPJKQMRHb+X6WOul5nJxA6G1XOxX5qSUZGywnPzufQ8wAdz6HjAL2YQy8CNJtDs2X0VCB62HPf/1AtQykwxBRuVtuv91IL1/Arqm0H3ZCeOUCm6vOad4G8UcDLQYYZ8QxYkVKINKQGLOVXrlqrEqjRQvzK8IHXcIMZuFWvIaXX0i/mVrlSlA+Q8aznU5589Trs0+ByFIl7WnleMK18blp50BXq0YTn/lDDUweWCfqKoFcS6m6omkQ1mTZneUtkoMxVGVgU1ejhopRBvSCE0rwAT0f7LTanfg3GwRXowrMUX4HNUih+bQZZcjw38U6bVcaowCa2JuUGvU4L6r+AtOfJoyAb56nhdT2lv+DAOvTgjvZOzz0EPOA1/Ig6P2Hl1JZ6x8VU53Va/I3b3vovIpczIlp413SWb5bQeHl3t7O/++XP+5sHjyaXTrQW3Yk+i1pRJ3oQHUTfRc+j44g1NhvPGkeNX5q/NX9v/tH8a0K90ZjGfBotPM2//wUA5vkb</latexit> ⇤ (⌘ )<latexit sha1_base64="4/ysG8VUnprM9Natvajo/+9nFw4=">AAAHeXichVXbbhMxEDW3ptwLPPKyUFVKo6hqaBEgeKi4CBAgLqJQwYbI653dWPV6Ha/TJjX7DXwNr/AdfAsvjDcNTeIUVkoyO+fMmfFMbEdK8MKsr/86cfLU6TMLtcWz585fuHjp8tKVqx+KvK8ZbLNc5HonogUILmHbcCNgR2mgWSTgY7T7yOEf90AXPJfvzVBBO6Op5Aln1KCrs7QaGhiYSsdqiEsbqoJ/sY2yHoKhHRsK1IppuVp2lpbX19arJ/CN1qGxvLVIqudN58rC3TDOWT8DaZigRfH51qYyTZAprqvbtlQbzgSU58J+AYqyXZrCZzQlzaBo26qmMlhBTxwkucaPNEHlnYywNCuKYRYhM6OmW8xizvkXW5lKZZK7bcul6huQbJQp6YvA5IHrVBBzDcyIIRqUaY7FBqxLNWUG+zlV9PtW27rqnMwUILDLsjWRZKq6iEYgyqmirBo4oQKJMSQ409Fk+gXrVNN59/RhaVu37zQ3Npu3N5AlYZ/lWUZlbMMISveVcmmhJ6nWdFgGhw4qeCpRYSYEXAigNcaD6u0o3EshpZek8f8sVdTxeRpeor1BaW3ophYlwWBWcG9/At330OEEOvTQgwn0wEOTCTSZRXcFoo879sXLchaKgSGmcMeYLu6cMqjjGr5itlWvG9Ixe8hUXV7xDpA38HgpSF+xGcyTFCL2qR5LuZWr+jwBNZiKnxveczlsbwyuVGuI6bH0g4lVzk3Ke8h43nGSO/c/+X3q/R1FZJ+UjudNK52YVup1hTo04mmAJp46MEvQRwQ9l1B1Q1UkqoPD5sxuiQRUcVQGFkU1erjIpVcvCKE0z8DR0f6CzalevXFwBTpzLMXnYGMJxY9VkDnHcxPvj3FljApsYn1UrtfrOKPuHxB3HHngqXEeF7yqJ3eXCRiLHtzRzum4jwEPeA2vMM9rrJyaXDdsSHVayeJv2HTWv4hcjolo4V3Tmr1ZfOPDrbXW5tq9t5vLWw9Glw5ZJNfJTVInLXKHbJFn5A3ZJox8I9/JD/Jz4XftRq1ea4yoJ08cxlwjU09t4w8sN8XW</latexit> ⌘ <latexit sha1_base64="L6ScZ3aF0vuvVW7VOdwitZPL9nA=">AAAHcHichVXbbhMxEDW3ptxbeEHigYWqUqlC1dAiQPBQcREgQFxEaQUbIq93dmPV63W8TpvU7Ctfwyv8C7/BFzDeNDSJU1gpyeycM2fGM7EdKcELs7r669jxEydPzdRmT585e+78hYtz85c+FnlXM9hkucj1dkQLEFzCpuFGwLbSQLNIwFa089jhW7ugC57LD6avoJnRVPKEM2rQ1ZoLQgM9U+lYDXFpb4VgaMuGAjViWpatuYXVldXqCXyjcWAsbMyS6nnbmp+5F8Y562YgDRO0KD7fXlemDjLF5bSblmrDmYDyTNgtQFG2Q1P4jKakGRRNW5VSBovoiYMk1/iRJqi8oxGWZkXRzyJkZtS0i0nMOf9ii2OpTHKvablUXQOSDTIlXRGYPHANCmKugRnRR4MyzbHYgLWppsxgG8eK/tBoWledkxkDBDZXNkaSjFUX0QhEOVaUVT0nVCAxhgRHORhIt2Ctaijvnz0qbePO3fraev3OGrIk7LE8y6iMbRhB6b5SLi10JNWa9svgwEEFTyUqTISACwG0hnhQvR2Geymk9JIs/z9LFXV0nmUv0W6vtDZ0U4uSoDcpuLs3gu55aH8E7Xvo/gi676HJCJpMojsC0Sct+/JVOQnFwBBTuGFMGzdOGSzhGr5itpteN6RjdpCp2rzi7SOv5/FSkL5iPZgmKUTsUz2WcitXS9MEVG8sfmp4x+WwnSG4WK0hpkfS90dWOTUp7yDjRctJbj/45Pep83cUkX1aOp43rXRkWqnXFerQiKcBmnjqwCRBHxL0VELVDVWRqA4OmjO5JRJQxWEZWBTV6OEil169IITSPANHR/sLNqd69cbBFejMsRSfgg0lFD9SQeYcz028NoaVMSqwiUuDcr1exxl1/4C45cg9T43zuOBVPbm7Q8BY9OCOdk7HfQJ4wGt4jXneYOXU5HrZhlSnlSz+hnVn/YvI5ZCIFt41jcmbxTc+3l5prK/cf7e+sPFwcOmQWXKV3CBLpEHukg3ynLwlm4SRb+Q7+UF+zvyuXaldq10fUI8fO4i5TMae2vIfVtXCMA==</latexit> ( )<latexit sha1_base64="hrEnsqzl3FLBMgGHZZwCzev8Pkg=">AAAHcHichVXbbhMxEDW3ptwLvCDxwEJVKVShaqAIEDxUXAQIEBdRWsGGyOud3Vj1eh2v0yY1+8rX8Ar/wm/wBYw3DU3iFCwlmZ1z5sx4JmtHSvDCrK7+OnL02PETc7X5k6dOnzl77vzChYsfi7ynGWywXOR6K6IFCC5hw3AjYEtpoFkkYDPafuzwzR3QBc/lBzNQ0MpoKnnCGTXoai8EoYG+qXRsJHpQ2puhKng9FKgR0xtle2FxdWW1WoFvNPeNxfV5Uq237Qtz98I4Z70MpGGCFsXnW2vKNECmuJ1Oy1JtOBNQngp7BSjKtmkKn9GUNIOiZatSymAJPXGQ5Bo/0gSVdzzC0qwoBlmEzIyaTjGNOedfbGkilUnutSyXqmdAsmGmpCcCkweuQUHMNTAjBmhQpjkWG7AO1ZQZbONE0R+aLeuqczITgMDmyuZYkonqIhqBKCeKsqrvhAokxpDgKIcD6RWsrSEu7ftnj0rbvHO3cXutcec2siTssjzLqIxtGOHQ8Cvl0kJXUq3poAz2HVTwVKLCVAi4EEBrhAfV00G4l0JKL8ny/7NUUYfnWfYS7fRLa0M3tSgJ+tOCO7tj6K6HDsbQgYfujaF7HpqMock0ui0QfdK2L1+V01AMDDHVtqHpgKFlUMc9fMVsN7xuSMfsIlN1eMXbQ17f46UgfcVGMEtSiNineizldq7qswRUfyJ+ZnjX5bDdEbhU7SGmh9L3xnY5MynvIuNF20luPfjk96n7dxSRfVo6njetdGxaqdcV6tCIpwGaeOrANEEfEPRMQtUNVZGoDvabM/1KJKCKgzKwKKrRw0UuvXpBCKV5Bo6O9hdsTvXojYMr0JljKT4DG0kofqiCzDmem3htjCpjVGAT68NyvV7HGXX/gLjtyH1PjfO44FU9ubtDwFj04BvtnI77BPCA1/Aa87zByqnJ9bINqU4rWfwNG876F5HLEREtvGua0zeLb3y8tdJcW7n/bm1x/eHw0iHz5Aq5TuqkSe6SdfKcvCUbhJFv5Dv5QX7O/a5drl2tXRtSjx7Zj7lEJlZt+Q+u8cGp</latexit> ⇤<latexit sha1_base64="m8nVgFoP/qYUlLvqgUQGNeGILaQ=">AAAHl3ichVVtbxM5EDYvRwr3QoFPiA+sqCqVKooS2goqPlDxfoLTAaJQQcLi9c5urHq9jtdpk5r9yK/hK/yY+zc33jQ0Wad3lqLMzvPMPOOZ9TpSghem3f7nzNlz53+50Fi6eOnX337/4/LylavvinyoGeyyXOR6L6IFCC5h13AjYE9poFkk4H20/8jh7w9AFzyXb81YQS+jqeQJZ9SgK1y+2TUwMlUeq/MojMQQStuNwNBPdr0sw+WVdqtdrcA3OsfGys4Sqdar8MqFz904Z8MMpGGCFsXHO5vKNEGmuJt+z1JtOBNQXuoOC1CU7dMUPqIpaQZFz1aVlMEqeuIgyTX+pAkq72yEpVlRjLMImRk1/aKOOedPbHVOyiT3epZLNTQg2UQpGYrA5IHrTxBzDcyIMRqUaY7FBqxPNWUGuzhX9NtOz7rqXJo5QGBvZWdGZK66iEYgyrmirBq5RAUSY0hwkpN5DAsWaohL++bZw9J2tu42NzabWxs11snUdBqVtt1sbWxsNVvb28iTcMjyLKMydiOt5ppyaWEgqdZ0XAbHDip4KlGpFgIuBNCa4kH1dBLuSUjpiaz/v0oVdbrOuid0MCqt7brpRkkwqic8OJxBDz10PIOOPfRoBj3y0GQGTerovkD0cWhfvCzrUAwMMRXarunj0SqDNdzDF1S77XVDOuYAmarPK94R8kYeLwXpZ2wGi1IKEftUj6XcztXaogRqNBe/MHzgNOxgCq5We4jpqfSjmV0uFOUDZPwZupR79z/4fRr8HEVkn5SO500rnZlW6nWFOjTiaYAmfp2gTtAnBL2QUHVDVSSqg+Pm1I9EAqo4KQOLoho9XOTSqxeEUJpn4Ohof8LmVI/eOLgCnTmW4guwaQrFT80gc47fV7xdppUxKrCJa5NyvV7HGXVvQBw68sjLxnlc8Kqe3F01YCx68EQ7p+M+BrwINPyFOn9j5dTket12qU6rtPjfbTrrv4hcTolo4Z3Uqd9AvvHuTquz2dp+vbmy83ByOZElcoPcImukQ+6SHfKcvCK7hJGv5Bv5Tn40rjceNJ42nk+oZ88cx1wjc6vx+l9Y6dAb</latexit>D KL [⇡ ⇤ ||⇡ 1 ]<latexit sha1_base64="IfBBpeq5xourjsLI1o0DZ6TcM78=">AAAHq3ichVVtb9s2EGZfVqfdS9P2474IDQJkgWfYdYI26JdgbdEO27Bua9qgkatS1EkmQlE0RSd2GP2Q/pp93X7C/s2OctzYorMRsH2657m7h3emGCvBS9Pt/nPt+o2bX9xqrd2+8+VXX39zd/3e/bdlMdYMDlghCn0Y0xIEl3BguBFwqDTQPBbwLj5+5vB3J6BLXsg3ZqpgkNNM8pQzatAVrfdDAxNT57HjkkUakso+j+xPP1dHoeKRDWMw9IPdrqrzc+foDapofaPb6dYr8I3ehbGxv0bq9Tq6d+tjmBRsnIM0TNCyPHq0o0wbZIY7HA4s1YYzAdWdcFyCouyYZnCEpqQ5lANbq6uCTfQkQVpo/EgT1N7FCEvzspzmMTJzaoZlE3POz9jmUimTPhlYLtXYgGSzSulYBKYIXM+ChGtgRkzRoExzFBuwIdWUGezskug3vYF16lyaJUBgv2VvociSupjGIKolUVZNXKISiQmkON3GjH5/+UNle7uP2/2d9m6/wdJFHMViDJXVWVzZbrvT7++2O3t7yJNwyoo8pzJxs63cV8alhZGkWtNpFVw4qOCZxEqNEHAhgNYcD+qny3CvhJReke3/r1JHXV1n2yt0MqmsDd104zSYNBOenC6gpx46XUCnHnq2gJ55aLqApk30WCA6O09NKAGGmMIjZoZ4xqpgC/dwjtW+87ohHXOETDXkNe8MeROPl4H0M7aDVSmFSHyqx1Ju52prVQI1WYpfGT5yNexoDm7We0jolfSzhV2uLMpHyPgxcikPn773+zT6PIrYvqgcz5tWtjCtzOsKdWjMswBNfDtBk6AvCXoloe6GqklUBxfNaR6JFFR5KQNFUY0eLgrp6QUhlOY5ODraH7A59aM3Dq5A546l+ApsnkLxKzPIguP7FW+cuTJGBTZxaybX63WSU/cPSCJHnnjZOE9KXusp3PUDxqIHT7RzOu5zwItAwy9Y51dUTk2ht21IdVanxd+w7az/InI5J6KFd1KveQP5xttHnd5OZ++3nY39/uxyImvkW/KQbJEeeUz2ySvymhwQRj6RP8lf5O/W960/Wu9b4Yx6/dpFzAOytFrwLzJ72KI=</latexit>D KL [⇡ ⇤ ||⇡ 0 ]<latexit sha1_base64="4dB3lG6U6b3giZY8ogArFUeD+sU=">AAAHq3ichVVtb9s2EGZfVqfdS9P2474IDQJkgWfYdYI26JdgbdEO27Bua9qgkatS1EkmQlE0RSd2GP2Q/pp93X7C/s2OctzYorMRsH2657m7h3emGCvBS9Pt/nPt+o2bX9xqrd2+8+VXX39zd/3e/bdlMdYMDlghCn0Y0xIEl3BguBFwqDTQPBbwLj5+5vB3J6BLXsg3ZqpgkNNM8pQzatAVrfdDAxNT57HjkkUakso+j+xPP1dHoeKRDWMw9IPdrqrzc+foDqpofaPb6dYr8I3ehbGxv0bq9Tq6d+tjmBRsnIM0TNCyPHq0o0wbZIY7HA4s1YYzAdWdcFyCouyYZnCEpqQ5lANbq6uCTfQkQVpo/EgT1N7FCEvzspzmMTJzaoZlE3POz9jmUimTPhlYLtXYgGSzSulYBKYIXM+ChGtgRkzRoExzFBuwIdWUGezskug3vYF16lyaJUBgv2VvociSupjGIKolUVZNXKISiQmkON3GjH5/+UNle7uP2/2d9m6/wdJFHMViDJXVWVzZbrvT7++2O3t7yJNwyoo8pzJxs63cV8alhZGkWtNpFVw4qOCZxEqNEHAhgNYcD+qny3CvhJReke3/r1JHXV1n2yt0MqmsDd104zSYNBOenC6gpx46XUCnHnq2gJ55aLqApk30WCA6O09NKAGGmMIjZoZ4xqpgC/dwjtW+87ohHXOETDXkNe8MeROPl4H0M7aDVSmFSHyqx1Ju52prVQI1WYpfGT5yNexoDm7We0jolfSzhV2uLMpHyPgxcikPn773+zT6PIrYvqgcz5tWtjCtzOsKdWjMswBNfDtBk6AvCXoloe6GqklUBxfNaR6JFFR5KQNFUY0eLgrp6QUhlOY5ODraH7A59aM3Dq5A546l+ApsnkLxKzPIguP7FW+cuTJGBTZxaybX63WSU/cPSCJHnnjZOE9KXusp3PUDxqIHT7RzOu5zwItAwy9Y51dUTk2ht21IdVanxd+w7az/InI5J6KFd1KveQP5xttHnd5OZ++3nY39/uxyImvkW/KQbJEeeUz2ySvymhwQRj6RP8lf5O/W960/Wu9b4Yx6/dpFzAOytFrwLytm2KE=</latexit>1<latexit sha1_base64="0hkwRoabOXQORf1Ezwi2mTQEJQs=">AAAHd3ichVXbbts4EGV6idPupbfHPlTYIAtvYBhWnSAN+hL0gm6xu9i2aNpga9elqJFMhKJoik7ssPqCfd1+XD+lbztU4sYWnVaA7dGcM3MOh5YYKcEL0+l8Xrl0+crV1cbates//PjTzzdu3rr9psjHmsE+y0WuDyJagOAS9g03Ag6UBppFAt5Gh48d/vYIdMFz+dpMFfQzmkqecEYNpl6Gg5vrnXanugI/CM+C9b01Ul0vBrdWP/TinI0zkIYJWhTv7m8p0wKZotth31JtOBNQXu+NC1CUHdIU3mEoaQZF31aOy2ADM3GQ5Bo/0gRVdr7C0qwoplmEzIyaYVHHXPIrtrEgZZIHfculGhuQ7FQpGYvA5IFbfxBzDcyIKQaUaY5mAzakmjKDU1ow/TrsW+fOtVkABM5OhnMiC+4iGoEoF0xZNXGNCiTGkOBOVeu144INNMSlffXsUWnD7Z1Wd6u13a2xdB4NIjGG0uo0Km2n1e52t1vt3V3kSThmeZZRGdtehAz8Srm0MJJUazotg7MEFTyVqFQrAVcCGM3woLo7L/ckpPRENr+vUlVdrLPpCR1NSmt7bnejJJjUGx4dz6HHHjqdQ6ceejKHnnhoMocmdfRQIPpkYP/4s6xDMTDE1MD2zBAMLYMmruEjqv3mTUM65giZasgr3gnyJh4vBel3bAXLWgoR+1SPpdzKVXNZAzVZqF9aPnIadjQDN6o1xPRC+sncKpeK8hEyng9cy4OH//hzGn3disg+LR3P2610brdSbyrUoRFPAwzx7QR1gj4n6KWEahqqIlEdnA2n/kgkoIpzG2iKasxwkUvPLwihNM/A0TF+j8Opbr3t4Ap05liKL8FmLRS/sIPMOb5f8fSYOWNU4BCbp3a9WccZdf+AeODIE68b53HBKz+5O0rAWMzgE+2SjvsE8CDQ8Bfq/I3Oqcn1pu1RnVZt8bfXctG3iFzOiBjhmRTWTyA/eHO/HW61d19ure91Tw8nskbukl9Ik4Rkh+yR38kLsk8YAfIv+Y98Wv3SuNf4tdE8pV5aOau5QxauRvg/VRvC7Q==</latexit> ⇤ (⌘ ⇤ )<latexit sha1_base64="CqGKO5XULoKWsqLr2vdEOk0ApXk=">AAAHp3ichVVtb9s2EGbbrc66vqTdx30RFgRwAsO16wRt0C/B1mErtqHt0KRBK0elqJNMhKJoik7scPoZ+zX7uv6I/ZsdlbixRWcTYOt0z3P3HO9EMVaCl6bX++fGzVtffHm7tfbVna/v3rv/YP3ho8OymGgGB6wQhT6KaQmCSzgw3Ag4UhpoHgt4F5/84PB3p6BLXsi3ZqZgmNNM8pQzatAVrT8ODUxNncdOShZpSCobqpIf2+2qHYKhkQ1jvLnnaquK1jd63V59Bb7RvzQ29tdIfb2OHt7+GCYFm+QgDRO0LD882VGmAzLD1Y2GlmrDmYDqTjgpQVF2QjP4gKakOZRDW1dWBZvoSYK00PiTJqi9ixGW5mU5y2Nk5tSMyibmnJ+xzSUpkz4bWi7VxIBkF0rpRASmCFy/goRrYEbM0KBMcyw2YCOqKTPY1aWi3/aH1lXn0iwBAnst+wsiS9XFNAZRLRVl1dQlKpGYQIqTbczn95++r2x/92lnsNPZHTRYuoijWEygsjqLK9vrdAeD3U53bw95Es5YkedUJm6olfvLuLQwllRrOquCSwcVPJOo1AgBFwJozfGgfroK9ySk9ES2/1+ljrpeZ9sTOp1W1oZuunEaTJsJT88W0DMPnS2gMw89X0DPPTRdQNMmeiIQfRHZX36tmlACDDGFe8uMcHNVQRvX8AeqbXndkI45RqYa8Zp3jrypx8tA+hk7waqUQiQ+1WMpt3LVXpVATZfiV4aPnYYdz8HNeg0JvZZ+vrDKlaJ8jIyXkUt59Py936fx51HE9sfK8bxpZQvTyryuUIfGPAvQxK8TNAn6iqBXEupuqJpEdXDZnOaWSEGVV2VgUVSjh4tCevWCEErzHBwd7WNsTv3ojYMr0LljKb4Cm6dQ/NoMsuD4fcXTZl4ZowKb2L4o1+t1klP3BiSRI0+9bJwnJa/rKdzRA8aiB3e0czruC8CDQMNvqPMKK6em0Ns2pDqr0+I97Djrv4hczolo4ZnUb55AvnH4pNvf6e692dnYH1wcTmSNfEu+I23SJ0/JPvmZvCYHhJE/yV/kb/KptdV61TpsHV1Qb964jPmGLF0t+i95M9at</latexit>=<latexit sha1_base64="QgR5W/V+UxaZ8+RT4cKtja5eHXw=">AAAHjnichVVRbxNHEF6g4BRoCSD1hZcTUaQ0siwbJ4QIRUQtFVRtVagIRMTm2NubO6+yt7feWyd2lvszvMIf4t8we4mJfevQk2zPzffNfLMzvptICV6YdvvLpctXfrh6rbH04/UbN3/6+dby7Tuvi3ykGeyxXOR6P6IFCC5hz3AjYF9poFkk4E10+LvD3xyBLnguX5mJgn5GU8kTzqhBV7j8S8/A2FR57KhgoYa4tDtluLzSbrWrK/CNzpmxsrtEqutFePva+16cs1EG0jBBi+LgwYYyTZApHmLQt1QbzgSU13ujAhRlhzSFAzQlzaDo26qAMlhFTxwkucaPNEHlnY2wNCuKSRYhM6NmUNQx5/yGrc5JmeRR33KpRgYkO1VKRiIweeDaEsRcAzNiggZlmmOxARtQTZnB5s0V/arTt646l2YOENhS2ZkRmasuohGIcq4oq8YuUYHEGBIcYG0M/z37rbSdza1md6O52a2xdB6FkRhBaXUalbbdbHW7m83W9jbyJByzPMuojG0vQgZ+pVxaGEqqNZ2UwZmDCp5KVKqFgAsBtKZ4UN2dh3sSUnoi6/+vUkVdrLPuCR2NS2t7brpREozrCY+OZ9BjD53MoBMPPZlBTzw0mUGTOnooEH0a2r/+LutQDAwxFdqeGYChZbCGZ/iAar963ZCOOUSmGvCKd4K8scdLQfoZm8GilELEPtVjKXdytbYogRrPxS8MHzoNO5yCq9UZYnoh/WTmlAtF+RAZf4Yu5f7jt36fht9GEdk/SsfzppXOTCv1ukIdGvE0QBPfTlAn6HOCXkiouqEqEtXBWXPqj0QCqjgvA4uiGj1c5NKrF4RQmmfg6Gi/w+ZUt944uAKdOZbiC7BpCsUvzCBzju9XXCrTyhgV2MS103K9XscZdf+AOHTksZeN87jgVT252zBgLHrwiXZOx30KuAg0/IM6/2Ll1OR63faoTqu0+NtrOut7RC6nRLRwJ3XqG8g3Xj9odTZa2y83Vna7p8uJLJF75D5ZIx2yRXbJc/KC7BFGPpCP5BP53FhuPGzsNJ6cUi9fOou5S+auxvOvyvzMEg==</latexit> ( ⇤ )<latexit sha1_base64="1mbdaqtzDXwIQxwv1A4lTnyzH3c=">AAAHnXichVVtb9s2EGa7tk7bvaTbx36YsCCAGxiGXSfogn0JthbdsG5ri6YJVrkaRZ1kIhRFU3Rih9Mv6K/p1+2X7N/0KMeNLTodAdune567e3hnirESvDS93n/Xrn924+at1sbtO3c//+LLrzbvff26LCaawSErRKGPY1qC4BIODTcCjpUGmscCjuKTnxx+dAq65IV8ZWYKhjnNJE85owZd0eZ2aGBq6jxWF3EUiwlUNlQlb4cxGPrW7lQPqmhzq9ft1Svwjf6FsXWwQer1PLp3668wKdgkB2mYoGX55uGuMh2QGe5pNLRUG84EVHfCSQmKshOawRs0Jc2hHNpaTxVsoycJ0kLjR5qg9i5HWJqX5SyPkZlTMyqbmHN+xLZXSpn0+6HlUk0MSDavlE5EYIrAdSlIuAZmxAwNyjRHsQEbUU2ZwV6uiH7VH1qnzqVZAQR2WPaXiqyoi2kMoloRZdXUJSqRmECK85xPZVKySENS2ZdPf6xsf+9RZ7Db2Rs0WJez01lc2V6nOxjsdbr7+8iTcMaKPKcysTjTyn1lXFoYS6o1nVXBhYMKnkms1AgBFwJoLfCgfroM90pI6RXZ+f8qddTVdXa8QqfTytrQTTdOg2kz4enZEnrmobMldOah50vouYemS2jaRE8Eoo8j++uzqgklwBBTkQ3NCM9WFbRxD39jtQdeN6RjjpGpRrzmnSNv6vEykH7GTrAupRCJT/VYyu1ctdclUNOV+LXhY1fDjhfgdr2HhF5JP1/a5dqifIyMXyKX8viHP/0+jT+OIrZPKsfzppUtTSvzukIdGvMsQBPfTtAk6EuCXkuou6FqEtXBRXOaRyIFVV7KQFFUo4eLQnp6QQileQ6OjvZbbE796I2DK9C5Yym+BlukUPzKDLLg+H7FO2ahjFGBTWzP5Xq9TnLq/gFJ5MhTLxvnSclrPYW7cMBY9OCJdk7HfQx4EWj4Dev8gcqpKfSODanO6rT4G3ac9SkilwsiWngn9Zs3kG+8ftjt73b3X+xuHQzmlxPZIPfJd6RN+uQROSA/k+fkkDDyjrwn/5B/W9+2nrSetX6fU69fu4j5hqys1tEHnELSQQ==</latexit>⌘ <latexit sha1_base64="o292XNuHKqZokwRUPXEZgh+bYWQ=">AAAHgnichVXbbtNAEF2uKffbIy8WVaVSRVHStIIKHiouAgSIiyhUkBDW67Gz6nq9WW/apIt/g1f4Lf6GWTehiTcFS4nHc87MnJ3xekMleG6azd+nTp85e+58benCxUuXr1y9dv3GzY95NtQMdlgmMr0b0hwEl7BjuBGwqzTQNBTwKdx77PBP+6BznskPZqygm9JE8pgzatDV6YChPdsJ8Vb0ri83G83yCnyjNTGWt5dIeb3t3Tj/rRNlbJiCNEzQPP+yvqFMHWSCwvtdS7XhTEBxsTPMQVG2RxP4gqakKeRdW4ovghX0REGcafxJE5Te2QhL0zwfpyEyU2r6eRVzzr/YylwpE9/vWi7V0IBkR5XioQhMFrhWBBHXwIwYo0GZ5ig2YH2qKTPYsDnRH1pd69S5NHOAwDbK1kyROXUhDUEUc6KsGrlEORIjiHFo5XrtMGc9DVFh3z97VNjW5r16e6O+2a6wdBb2QjGEwuokLGyz3mi3N+uNrS3kSThgWZpSGblxFu4v4dLCQFKt6bgIJg4qeCKxUiUEXAigNcWD8uk43CshpVdk7f9VyqiT66x5hfZHhbUdN90wDkbVhPsHM+iBh45n0LGHHs6ghx4az6BxFd0TiD7p2ZeviioUAUNM4a4yfbetglVcw3esdtfrhnTMATJVn5e8Q+SNPF4C0s9YDxalFCLyqR5LuZWr1UUJ1GgufmH4wNWwgym4Uq4hoifSD2dWubAoHyDjRc+l3H3w2e/T4O8oQvu0cDxvWsnMtBKvK9ShIU8CNPHrBFWCPibohYSyG6okUR1MmlPdEjGo/FgGiqIaPVxk0tMLQijNU3B0tL9ic8pHbxxcgU4dS/EF2DSF4idmkBnH7yseJFNljAps4uqRXK/XUUrdGxD1HHnkZeM8ynmpJ3OnChiLHtzRzum4TwAPAg2vsc4bVE5Nptdsh+qkTIv3Tt1Z/yJyOSWihWdSq3oC+cbH9UZro7H1bmN5u310OJElcpvcIaukRe6RbfKcvCU7hBFFfpCf5FftbG2t1qpNuKdPTWJukbmr9vAPY1THEA==</latexit>logp ✓ (x)<latexit sha1_base64="XiB/TbqA98lrmgL2pTtLE1lTTxg=">AAAHoHichVVRbxNHEF4oxYEWGtpHXk5EkUxkWTZJRKO+RJSqoLYCKgIRnLnu7c2dV9nbW++tEzvL/YX+mr7S/8G/YfYcY/vWgZUsz833zXyzM7e3sRK8NL3exytXv7n27fXWxo2b331/6/YPm3d+fFUWY83giBWi0McxLUFwCUeGGwHHSgPNYwGv45NfHf76FHTJC/nSTBUMcppJnnJGDbqizXZoYGLqPFYXcRSLMVQ2FEUWqMiGZgiGVu3J/Sra3Op1e/UKfKN/YWwdbpB6PY/uXP8nTAo2zkEaJmhZvn2wp0wHZIbbGg4s1YYzAdXNcFyCouyEZvAWTUlzKAe2LqkKttGTBGmh8SdNUHuXIyzNy3Kax8jMqRmWTcw5P2PbK1Im/XlguVRjA5LNlNKxCEwRuEYFCdfAjJiiQZnmWGzAhlRTZrCdK0W/7A+sq86lWQEENln2l0RWqotpDKJaKcqqiUtUIjGBFEc6G8y4ZJGGpLJ///6osv39h53dvc7+boO1GJ/O4sr2Ot3d3f1O9+AAeRLOWJHnVCY2jN2AY8i4tDCSVGs6rYILBxU8k6jUCAEXAmjN8aB+WoR7ElJ6IjtfV6mjLtfZ8YROJ5W1oZtunAaTZsLTsyX0zEOnS+jUQ8+X0HMPTZfQtImeCEQfR/aPP6smlABDbHG0gjbu4T2q3fe6IR1zhEw15DXvHHkTj5eB9DN2gnUphUh8qsdSbueqvS6BmqzErw0fOQ07moPb9R4Sein9fGmXa0X5CBlPI5fy+Jc3fp9Gn0cR298qx/OmlS1NK/O6Qh0a8yxAE79O0CToBUGvJdTdUDWJ6uCiOc0jkYIqF2VgUVSjh4tCevWCEErzHBwd7XfYnPrRGwdXoHPHUnwNNk+h+KUZZMHx+4rXzLwyRgU2sT0r1+t1klP3BiSRI0+8bJwnJa/rKdydA8aiB0+0czruY8CLQMNfqPMMK6em0Ds2pDqr0+J/2HHWl4hczolo4Z3Ub95AvvHqQbe/1z14sbd1+Gh2OZENcpfcI23SJw/JIXlCnpMjwsi/5D/ygfzfutd60nrWejGjXr1yEfMTWVmtN58Ak1jTvQ==</latexit>logp ✓ (x)<latexit sha1_base64="XiB/TbqA98lrmgL2pTtLE1lTTxg=">AAAHoHichVVRbxNHEF4oxYEWGtpHXk5EkUxkWTZJRKO+RJSqoLYCKgIRnLnu7c2dV9nbW++tEzvL/YX+mr7S/8G/YfYcY/vWgZUsz833zXyzM7e3sRK8NL3exytXv7n27fXWxo2b331/6/YPm3d+fFUWY83giBWi0McxLUFwCUeGGwHHSgPNYwGv45NfHf76FHTJC/nSTBUMcppJnnJGDbqizXZoYGLqPFYXcRSLMVQ2FEUWqMiGZgiGVu3J/Sra3Op1e/UKfKN/YWwdbpB6PY/uXP8nTAo2zkEaJmhZvn2wp0wHZIbbGg4s1YYzAdXNcFyCouyEZvAWTUlzKAe2LqkKttGTBGmh8SdNUHuXIyzNy3Kax8jMqRmWTcw5P2PbK1Im/XlguVRjA5LNlNKxCEwRuEYFCdfAjJiiQZnmWGzAhlRTZrCdK0W/7A+sq86lWQEENln2l0RWqotpDKJaKcqqiUtUIjGBFEc6G8y4ZJGGpLJ///6osv39h53dvc7+boO1GJ/O4sr2Ot3d3f1O9+AAeRLOWJHnVCY2jN2AY8i4tDCSVGs6rYILBxU8k6jUCAEXAmjN8aB+WoR7ElJ6IjtfV6mjLtfZ8YROJ5W1oZtunAaTZsLTsyX0zEOnS+jUQ8+X0HMPTZfQtImeCEQfR/aPP6smlABDbHG0gjbu4T2q3fe6IR1zhEw15DXvHHkTj5eB9DN2gnUphUh8qsdSbueqvS6BmqzErw0fOQ07moPb9R4Sein9fGmXa0X5CBlPI5fy+Jc3fp9Gn0cR298qx/OmlS1NK/O6Qh0a8yxAE79O0CToBUGvJdTdUDWJ6uCiOc0jkYIqF2VgUVSjh4tCevWCEErzHBwd7XfYnPrRGwdXoHPHUnwNNk+h+KUZZMHx+4rXzLwyRgU2sT0r1+t1klP3BiSRI0+8bJwnJa/rKdydA8aiB0+0czruY8CLQMNfqPMMK6em0Ds2pDqr0+J/2HHWl4hczolo4Z3Ub95AvvHqQbe/1z14sbd1+Gh2OZENcpfcI23SJw/JIXlCnpMjwsi/5D/ygfzfutd60nrWejGjXr1yEfMTWVmtN58Ak1jTvQ==</latexit>1<latexit sha1_base64="0hkwRoabOXQORf1Ezwi2mTQEJQs=">AAAHd3ichVXbbts4EGV6idPupbfHPlTYIAtvYBhWnSAN+hL0gm6xu9i2aNpga9elqJFMhKJoik7ssPqCfd1+XD+lbztU4sYWnVaA7dGcM3MOh5YYKcEL0+l8Xrl0+crV1cbates//PjTzzdu3rr9psjHmsE+y0WuDyJagOAS9g03Ag6UBppFAt5Gh48d/vYIdMFz+dpMFfQzmkqecEYNpl6Gg5vrnXanugI/CM+C9b01Ul0vBrdWP/TinI0zkIYJWhTv7m8p0wKZotth31JtOBNQXu+NC1CUHdIU3mEoaQZF31aOy2ADM3GQ5Bo/0gRVdr7C0qwoplmEzIyaYVHHXPIrtrEgZZIHfculGhuQ7FQpGYvA5IFbfxBzDcyIKQaUaY5mAzakmjKDU1ow/TrsW+fOtVkABM5OhnMiC+4iGoEoF0xZNXGNCiTGkOBOVeu144INNMSlffXsUWnD7Z1Wd6u13a2xdB4NIjGG0uo0Km2n1e52t1vt3V3kSThmeZZRGdtehAz8Srm0MJJUazotg7MEFTyVqFQrAVcCGM3woLo7L/ckpPRENr+vUlVdrLPpCR1NSmt7bnejJJjUGx4dz6HHHjqdQ6ceejKHnnhoMocmdfRQIPpkYP/4s6xDMTDE1MD2zBAMLYMmruEjqv3mTUM65giZasgr3gnyJh4vBel3bAXLWgoR+1SPpdzKVXNZAzVZqF9aPnIadjQDN6o1xPRC+sncKpeK8hEyng9cy4OH//hzGn3disg+LR3P2610brdSbyrUoRFPAwzx7QR1gj4n6KWEahqqIlEdnA2n/kgkoIpzG2iKasxwkUvPLwihNM/A0TF+j8Opbr3t4Ap05liKL8FmLRS/sIPMOb5f8fSYOWNU4BCbp3a9WccZdf+AeODIE68b53HBKz+5O0rAWMzgE+2SjvsE8CDQ8Bfq/I3Oqcn1pu1RnVZt8bfXctG3iFzOiBjhmRTWTyA/eHO/HW61d19ure91Tw8nskbukl9Ik4Rkh+yR38kLsk8YAfIv+Y98Wv3SuNf4tdE8pV5aOau5QxauRvg/VRvC7Q==</latexit> ⇤<latexit sha1_base64="m8nVgFoP/qYUlLvqgUQGNeGILaQ=">AAAHl3ichVVtbxM5EDYvRwr3QoFPiA+sqCqVKooS2goqPlDxfoLTAaJQQcLi9c5urHq9jtdpk5r9yK/hK/yY+zc33jQ0Wad3lqLMzvPMPOOZ9TpSghem3f7nzNlz53+50Fi6eOnX337/4/LylavvinyoGeyyXOR6L6IFCC5h13AjYE9poFkk4H20/8jh7w9AFzyXb81YQS+jqeQJZ9SgK1y+2TUwMlUeq/MojMQQStuNwNBPdr0sw+WVdqtdrcA3OsfGys4Sqdar8MqFz904Z8MMpGGCFsXHO5vKNEGmuJt+z1JtOBNQXuoOC1CU7dMUPqIpaQZFz1aVlMEqeuIgyTX+pAkq72yEpVlRjLMImRk1/aKOOedPbHVOyiT3epZLNTQg2UQpGYrA5IHrTxBzDcyIMRqUaY7FBqxPNWUGuzhX9NtOz7rqXJo5QGBvZWdGZK66iEYgyrmirBq5RAUSY0hwkpN5DAsWaohL++bZw9J2tu42NzabWxs11snUdBqVtt1sbWxsNVvb28iTcMjyLKMydiOt5ppyaWEgqdZ0XAbHDip4KlGpFgIuBNCa4kH1dBLuSUjpiaz/v0oVdbrOuid0MCqt7brpRkkwqic8OJxBDz10PIOOPfRoBj3y0GQGTerovkD0cWhfvCzrUAwMMRXarunj0SqDNdzDF1S77XVDOuYAmarPK94R8kYeLwXpZ2wGi1IKEftUj6XcztXaogRqNBe/MHzgNOxgCq5We4jpqfSjmV0uFOUDZPwZupR79z/4fRr8HEVkn5SO500rnZlW6nWFOjTiaYAmfp2gTtAnBL2QUHVDVSSqg+Pm1I9EAqo4KQOLoho9XOTSqxeEUJpn4Ohof8LmVI/eOLgCnTmW4guwaQrFT80gc47fV7xdppUxKrCJa5NyvV7HGXVvQBw68sjLxnlc8Kqe3F01YCx68EQ7p+M+BrwINPyFOn9j5dTket12qU6rtPjfbTrrv4hcTolo4Z3Uqd9AvvHuTquz2dp+vbmy83ByOZElcoPcImukQ+6SHfKcvCK7hJGv5Bv5Tn40rjceNJ42nk+oZ88cx1wjc6vx+l9Y6dAb</latexit>⌘ ⇤ =<latexit sha1_base64="4RaIMYGMW1R/TXcq8EXp6e2eDx0=">AAAHoXichVVtb9s2EGbbrU67l6bdx34hFgRIA8Ow6wRdMAwItg5dsQ3LiqYNFrsaRZ1kIhRFU3Rih9Vv2K/Z1+137N/sKMeNLTqbAFune56753gnirGWorTd7j+3bt/56OO7rY179z/59LPPH2w+fPSmLCaGwzEvZGFOYlaCFAqOrbASTrQBlscS3sZn33n87TmYUhTqtZ1pGOYsUyIVnFl0RZtPBhamts7jTBFHsZxA5QZgWeQGMd7eud2qot/QKtrc6na69UVDo3dlbB1ukPo6ih7e/X2QFHySg7JcsrI8fbqnbRtUhusaDR0zVnAJ1f3BpATN+BnL4BRNxXIoh66uqaLb6EloWhj8KUtr73KEY3lZzvIYmTmzo7KJeecHbHtFyqZfDZ1QemJB8blSOpHUFtR3iibCALdyhgbjRmCxlI+YYdxiP1eKft0bOl+dT7MCSOyy6i2JrFQXsxhktVKU01OfqERiAinOdD6ZSckjA0nlXr34tnK9/Wft/l57v99gXc/PZHHluu1Ov7/f7hwcIE/BBS/ynKnEj7Xyf5lQDsaKGcNmFb1yMCkyhUqNEKhfCrQWOK2frsMDCaUCkd3/V6mjbtbZDYTOp5VzAz/dOKXTZsLziyX0IkBnS+gsQC+X0MsATZfQtImeSUSfR+7Hn6omlABHTOPusiPcXhXdwTW8R7UnQTeUZ46RqUei5l0ibxrwMlBhxjZdl1LKJKQGLO1XrnfWJdDTlfi14WOv4cYLcLteQ8JupF8urXKtqBgj42XkU558/VvYp/GHUcTu+8rzgmllS9PKgq4wj8Yio2ji1wmaBHNNMGsJdTd0TWKGXjWnuSVS0OV1GVgUM+gRslBBvSClNiIHT0f7HTanfgzGITSY3LO0WIMtUmhxYwZVCPy+4jmzqIwziU3cmZcb9DrJmX8DksiTp0E2IZJS1PUU/tAB69CDO9o7Pfc54EFg4GfU+QUrZ7Ywu27ATFanxfug7a3/Igq1IKKFZ1KveQKFxpunnd5e5+DXva3D/vxwIhvkMfmS7JAeeUYOyQ/kiBwTTv4gf5K/yN+trdbL1lHr1Zx6+9ZVzBdk5Wqd/gu73dPa</latexit> ( )<latexit sha1_base64="uWNtXWGR0fiaNB4SWXTPJ3L9Mg8=">AAAHgXichVVtb9s2EGbSNs6ybn37uC9CgwBOYBh2nKAN+iVoU3TFNjQtkjZY7HkUdZKJUBRN0YkdTj9jX9fftX+zoxIntuh0BGyf7nnu7uGdKYZK8Ny0Wv8uLd+7/2Cltvrd2vcPf/jx0eMnTz/n2UgzOGaZyPRJSHMQXMKx4UbAidJA01DAl/DsjcO/nIPOeSaPzERBL6WJ5DFn1KDrtKtyXu+GYOhm//F6q9kqV+Ab7WtjfX+VlOuw/2Tlz26UsVEK0jBB8/x0e0eZBsgEdQ96lmrDmYBirTvKQVF2RhM4RVPSFPKeLbUXwQZ6oiDONH6kCUrvbISlaZ5P0hCZKTWDvIo55w22MVfKxC97lks1MiDZVaV4JAKTBa4TQcQ1MCMmaFCmOYoN2IBqygz2a070UbtnnTqXZg4Q2EXZnikypy6kIYhiTpRVY5coR2IEMc6s3K8d5ayvISrsp3evC9vefdHo7DR2OxWWzsJ+KEZQWJ2EhW01mp3ObqO5t4c8CRcsS1MqI4vjLNxXwqWFoaRa00kRXDuo4InESpUQcCGA1hQPyqfbcK+ElF6Rrf+vUkbdXWfLK3Q+LqztuumGcTCuJjy/mEEvPHQyg0489HIGvfTQeAaNq+iZQPSgb3/5tahCETDEVN92zQCPVRHUcQ9/YbVNrxvSMYfIVANe8i6RN/Z4CUg/YyNYlFKIyKd6LOV2ruqLEqjxXPzC8KGrYYdTcKPcQ0TvpF/O7HJhUT5Exvu+S3ny6ne/T8ObUYT2beF43rSSmWklXleoQ0OeBGji2wmqBH1L0AsJZTdUSaI6uG5O9UjEoPJbGSiKavRwkUlPLwihNE/B0dH+A5tTPnrj4Ap06liKL8CmKRS/M4PMOL5f8R6ZKmNUYBPrV3K9Xkcpdf+AqO/IYy8b51HOSz2Zu1TAWPTgiXZOxz0AvAg0/IZ1PqByajK9ZbtUJ2Va/O02nPUtIpdTIlp4J7WrN5BvfN5utneaex931vc7V5cTWSU/keekTtrkBdknP5NDckwYycjf5B/ytXavtllr1bavqMtL1zHPyNyqvfoPdLrGEg==</latexit> ⇤ =max (1 ) (0)+ (1) ( )<latexit sha1_base64="hiFabegHXYUvglkSq5YOxlO1PjU=">AAAH0XichVVtbxtFEL4WqEN5S+Fjv6waRXKCa/nqRCVClSJaBAgQBZo2peeavbu58yp7e+u9dWJnOQnxlR/F7+AH8BX+ArPrOLFvHTjJ9uw8z8wzO+u9iSVnle71/rxx840337rV2nj79jvvvvf+B5t3PnxelROVwFFS8lIdx7QCzgQcaaY5HEsFtIg5vIhPHlv8xSmoipXimZ5JGBQ0FyxjCdXoGm6+jGLQ9LXZrckjEhV0SiLOCqaroXFITaIOaYf33WLHLiJZsXZvh3xMnO/SFe6Q+3Nrzh1ubvW6PfcQ3wgvjK3DjcA9T4d3bv0cpWUyKUDohNOqevVgT+oOiBz7MBoYqjRLONS3o0kFkiYnNIdXaApaQDUwrhc12UZPSrJS4Udo4rzLEYYWVTUrYmQWVI+qJmadl9j2ipTOPhkYJuREg0jmStmEE10S21mSMgWJ5jM0aKIYFkuSEVU00dj/laKfhQNjq7NpVgCOpyLCJZGV6mIaA69XijJyahNVSEwhw/+A26+ZVMlQQVqbH774rDbh/sNOf6+z32+wVBkPYz6B2qg8rk2v0+339zvdgwPkCThLyqKgIrX/g9p+5UwYGAuqFJ3V5MJBOcsFKjVCwIYAWgucuNVVuCchhCey+/8qLup6nV1P6HRaGxPZ040zMm0mPD1bQs88dLaEzjz0fAk999BsCc2a6AlH9MnQfP1N3YRSSBCTeB31yN3HNu7hF1Tb8bohLHOMTDlijneOvKnHy0H4GfESr0nJeepTPZa0O5ftdQnkdCV+bfjYapjxAtx2e0jptfTzpV2uFWVjZHw1tCmPP/3J79P48ihi83lted5p5UunlXtdoRaNWY7vSopvJ2gS1BVBrSW4bkhHoopcNKd5JTKQ1VUZWBRV6GG8FF69wLlUrABLR/s1NsctveNgElRhWZKtwRYpJLs2gygZvl9xLi0qSyjHJrbn5Xq9TnGiIDMdWvLUy8ZYWjFXT2mHFGiDHrzR1mm5TwAHgYJvUec7rJzqUu2aiKrcpcXfqGOt/yIysSCihTMpbE4g33j+oBvudQ++39s67M+HU7AR3A3uBe0gDB4Gh8GXwdPgKEiCP4K/gr+Df1o/tmatX1u/zak3b1zEfBSsPK3f/wUPSuFl</latexit> ⇤ =max (1 ) (0)+ (1) ( )<latexit sha1_base64="bEIKiqDfdoDbID/ucVGNeXROeRc=">AAAH53ichVVfbxtFEL+Utg7lXwqPfVk1iuQE17LrRCVClSJaBAgQBTVtRM899u7mzqvs7a331omd5T4Db4hXPhQPfBZemF3HiX3rwEmWZ+f3mz87s7sTS84q3ev9vXHrndt37rY237333vsffPjR1v2PX1XlRCVwnJS8VCcxrYAzAceaaQ4nUgEtYg6v49NnFn99BqpipXipZxKGBc0Fy1hCNaqirTLUMNXOj1FlHMV8ArUJY9D0rdmryVMSFnRKQs4KpqtojtQk7JB2/5Fb7NpFKCvW7u2ST4nTXan6u+TRXJpz62hru9ftuY/4Qv9S2D7aDNz3Irp/95cwLZNJAUInnFbVm8f7UndA5Fid0dBQpVnCob4XTiqQNDmlObxBUdACqqFxO6vJDmpSkpUKf0ITp122MLSoqlkRI7OgelQ1Mau8wnZWQunss6FhQk40iGQeKZtwokti601SpiDRfIYCTRTDZEkyooomGruykvTL/tDY7KybFYBjr0R/KchKdjGNgdcrSRk5tY4qJKaQ4cmY93dSJZGCtDY/ffVFbfoHTzqD/c7BoMG6PgUqj2vT63QHg4NO9/AQeQLOk7IoqEjtQXDnJGfCwFhQpeisJpcKylkuMFLDBKwJoLTAiVtdm3shhPCC7P1/FGd1c5w9L9DZtDYmtN2NMzJtOjw7X0LPPXS2hM489GIJvfDQbAnNmugpR/R5ZL79rm5CKSSISbyPeuQuZBv38CtG2/WqISxzjEw5Yo53gbypx8tB+B7xFq9xyXnqUz2WtDuX7XUO5HTFfq352MYw4wW44/aQ0hvpF0u7XBuUjZHxTWRdnnz+s1+n8VUrYvNlbXlet/KlbuVeVahFY5bjY0nxdYImQV0T1FqCq4Z0JKrIZXGaVyIDWV2ngUlRhRrGS+HlC5xLxQqwdJTfYnHc0msHk6AKy5JsDbZwIdmNHkTJ8H3FabXILKEci9iep+vVOsWRgsw0suSp542xtGIun9KOLtAGNXijrdJynwMOAgXfY5wfMHOqS7VnQqpy5xb/w46V/ovIxIKIEs6kfnMC+cKrx93+fvfwx/3to8F8OAWbwYPgYdAO+sGT4Cj4OngRHAdJ8Ffwz8btjTst1vqt9Xvrjzn11salzSfBytf6818heum9</latexit>Figure 3.14: Cherno point on () which holds between arbitrary 0 ; 1 and implies = (1) (0) for 0 = 0; 1 = 1. At this critical point, the kl divergence to the endpoints is equal, as shown in Cover and Thomas (2012) Ch. 11 or Brekelmans et al. (2020d) App. E3 D KL [ (z)jj 0 (z)] =D KL [ (z)jj 1 (z)]: (3.59) Cherno Point on the TVO Integrand For the unnormalized likelihood ratio log ~ 1 (z)= 0 (z), we can interpret the Cherno point using thermodynamic integration bounds (3.60) T1 X t=0 ( t+1 t )E t log p (x;z) q(zjx) logZ 1 T1 X t=0 ( t+1 t )E t+1 log p (x;z) q(zjx) : (3.60) With 0 (z) =q(zjx) as intvo (Masrani et al., 2019; Brekelmans et al., 2020a), we note that the inte- grand at 0 = 0 corresponds to the familiarelbo,E 0 log ~ 1 (x;z) 0 (zjx) = logZ 1 (x)D KL [q(zjx)jjp (zjx)]. Similarly, at 1 = 1, the integrandE 1 [] = logZ 1 (x) +D KL [p (zjx)jjq(zjx)] is an upper bound. Since (1) = logp (x) and (0) = 0, the condition for the Cherno point in (3.58) becomes =E log p (x;z) q(zjx) = logp (x); (3.61) 90 or the point after which the expected likelihood ratio switches from an lower bound to an upper bound. We visualize this in Fig. 3.13, with , as a point on the y-axis, equal to the area under the curve, logp (x). Note that the red shaded regions correspond to thekl divergence divergence from to each endpoint (see (Brekelmans et al., 2020a)), and will have equal area due to Eq. (3.59). 3.9 Discussion: TVO and AIS In this section, we investigate the limitations of the snis sampling procedure used in initial work on the tvo (Masrani et al., 2019; Brekelmans et al., 2020b; Chen et al., 2021). We highlight connections between the tvo objectives and ais under perfect transitions (Neal, 2001; Grosse et al., 2013) or Bidirectional Monte Carlo (bdmc, Grosse et al. (2016)), which indicates thatmcmc transition operators should be used for improved sampling of the expectations in the tvo. We investigate the accuracy of the snis sampling procedure in Sec. 3.9.2, and point to recent work (Zhang et al., 2021; Gener and Domke, 2021) providing reparameterization gradients through mcmc transformations. These approaches might be considered in future work to obtain best results using tvo-style objectives. Finally, in Sec. 3.9.3 we discuss recent related work (Chen et al., 2021) using q-paths (Masrani et al., 2021) (see Ch. 4) to construct tvo objectives. We highlight that there are in fact two dierent approaches for tvo with q-paths. 3.9.1 TVO matches Single-Sample AIS under Perfect Transitions bdmc (Grosse et al., 2015, 2016) provides multi-sample lower and upper bounds on logp (x) by samplingais chains starting from the initial or target distribution, respectively. In particular, these bounds correspond to Independent Multi-Sample aiselbo and Coupled Reverse Multi-Sampleais eubo in Ch. 2. However, for the case of single-sample ais lower and upper bounds (elbo ais and eubo ais from Eq. (2.20)) under perfect transitions to thetvo lower or upper bound objective using a T -term Riemann sum with the same choice of schedulef t g T t=0 . However, for the case of single-sample ais lower and upper bounds on logp (x) ( elbo ais and eubo ais from Eq. (2.20)), we can use the simplication 91 We use the lower bounds as an example, recalling the denition of elbo ais from Eq. (1.47) or Eq. (2.20). As in Sec. 1.4.4, we can use the simplication of the log importance weights log p tgt (x;z ext ) q prop (z ext jx) = logw 0:T = log T Y t=1 p (x;z t ) q (z t jx) t t1 = T X t=1 ( t t1 ) log p (x;z t ) q (z t jx) (3.62) to rewrite the lower bound as logp (x)elbo ais (x;q ais prop ;) =E q ais prop (z 0:T jx) " T X t=1 t t1 log p (x;z t ) q (z t jx) # (3.63) where the sampling distribution is given by q ais prop (z 0:T jx) = 0 (z 0 jx) Q T1 t=0 T f (z t+1 jz t ). While each transition kernelT f (z t+1 jz t ) is constructed to leave t (zjx) (Neal, 2001), we may not achieve `perfect transitions' which produce exact, independent samples from the invariant distribution. However, in the case of perfect transitions, we have q ais prop (z 0:T jx) (pt) := q (z 0 jx) T Y t=1 t1 (z t ) and z 0:T q ais prop (z 0:T jx): (3.64) Plugging this into Eq. (3.63) elbo ais (x;q ais prop ;) =E q ais prop (z 0:T jx) " T X t=1 t t1 log p (x;z t ) q (z t jx) # (3.65) =E 0 (z 0 ) T Q t=1 t1 (zt) " T X t=1 t t1 log p (x;z t ) q (z t jx) # (3.66) = T X t=1 t t1 E t1 (zt) log p (x;z t ) q (z t jx) (3.67) =tvo L (;;x) (3.68) which matches the tvo lower bound in Eq. (3.9). Similar reasoning applies for eubo ais (x;q ais prop ;) and the tvo upper bound, where sampling is performed using the extended state space targetp ais tgt (z 0:T jx) (see Sec. 2.3.1). In particular, we need 92 to initialize with an exact posterior sample z T p (zjx). Under perfect transitions using reverse kernels, we have z 0:T p ais tgt (z 0:T jx) (pt) := p (z T jx) T Y t=1 t1 (z t1 ) and z 0:T p ais tgt (z 0:T jx): (3.69) This yields expectations under each t1 , which will correspond to the right-Riemann sum from the tvo perspective. Noting that taking expectations under p (z T jx) = T (z T jx) corresponds to the single-sample eubo in Eq. (3.8), we have eubo ais (x;q ais prop ;) =E p ais tgt (z 0:T jx) " T X t=1 t t1 log p (x;z t ) q (z t jx) # (3.70) =E p (z T jx) T Q t=1 t1 (z t1 ) " T X t=1 t t1 log p (x;z t ) q (z t jx) # (3.71) = T X t=1 t t1 E t (zt) log p (x;z t ) q (z t jx) (3.72) =tvo U (;;x): (3.73) Thus, we have conrmed that the tvo lower and upper bounds correspond to single-sample ais log partition bounds under perfect transitions. Extended State Space Bounds To conrm the above relationship with the tvo, we consider the gap in the lower bound on logp (x). The gap in the single-sample ais lower bound elbo ais from Ch. 2 is an extended state space kl divergence which, under perfect transitions, becomes logp (x)elbo ais (x;q ais prop ;) =D KL q ais prop (z 0:T jx)kp ais tgt (z 0:T jx) (3.74) =E q (z 0 jx) T Q t=1 t1 (zt) " log q (z 0 jx) Q T t=1 t1 (z t ) p (z T jx) Q T t=1 t1 (z t1 ) # (3.75) =E q (z 0 jx) " log q (z0jx) 0 (z0) # +E T Q t=1 t1 (z t ) " log Q T t=1 t1 (zt) Q T t=1 t (zt) # (3.76) = T X t=1 D ! KL [ t1 jj t ] (3.77) which matches the gap in tvo L (;;x) derived in Section 3.4. 93 0.0 0.2 0.4 0.6 0.8 1.0 92.5 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 log 1 K p(x, z) q 0 (z) AIS (k=1) (-80.75 ) k=1 w/ AIS (-80.82) k=1 w/ SNIS (-84.00) Figure 3.15: tvo integrand E [log p (x;z) q (zjx) ] estimated using ais with T = 200 intermediate distri- butions. The solid, dark blue line indicates the ais estimate of the expectation under each . The overall left Riemann sum lower bounds are shown in parenthesis, and the dotted line re ects the Independent Multi-Sample aiselbo using the same set of samples. We use ais as ground truth to evaluate the accuracy of snis sampling with a xed set of samples from q (zjx), whose estimates are shown in the light blue line. We see that accuracy of this estimator deteriorates as increases and is less well-approximated byq (zjx). Thistvo sampling scheme is used throughout Masrani et al. (2019); Brekelmans et al. (2020a) was used to train the vae model under consideration in this gure. 3.9.2 How Bad is SNIS Sampling? While the above connections withais would suggest optimizing thetvo lower bound using forward mcmc transition kernels to draw samples which more closely approximate the intermediate target distributions z t t1 (z), previous work on the tvo (Masrani et al., 2019; Brekelmans et al., 2020a; Nguyen et al., 2020; Chen et al., 2021) has been limited to snis sampling with a xed set of samples z 0 q (zjx). While Masrani et al. (2019) hypothesize that this may incur signicant bias in estimation of tvo integrand terms as ! 1, this issue has not been concretely investigated. In Fig. 3.15, we consider avae model on themnist dataset trained using thetvo objective, the sampling scheme from Masrani et al. (2019), and reparameterization gradients from Section 3.6. After training, we consider estimating the tvo integrand E [log p (x;z) q (zjx) ] at each point along the geometric mixture path. We compare snis re-weighting of a xed set of z 0 q (zjx) to an ais procedure which samples z 0:T q ais prop (z 0:T jx) using Hamiltonian Monte Carlo (hmc) (Neal, 2011) with 3 leapfrog steps per transition and Metropolis-Hastings accept reject steps. We plot the 94 integrand estimate for each method, at each along thex-axis, and report the overall left-Riemann sum in parenthesis (higher is better). The dotted line represents the Independent Multi-Sample ais elbo using the same set of samples. 5 As expected,mcmc transitions are crucial for accurate estimation of thetvo integrand of overall log partition function. As increases, (z)/ q (zjx) 1 p (x;z) becomes closer to p (zjx) and sampling from q (zjx) no longer provides an eective importance sampling proposal. While these results only indicate the suboptimality of the snis procedure at the end of training, the impact of providing better estimators of the tvo integrand during training remains not well-understood. Beyond estimation accuracy, we also must consider the behavior of the gradients of thetvo ob- jective when optimizing the parameters of; during training. While reparameterization gradients (Kingma and Welling, 2013; Rezende et al., 2014; Tucker et al., 2018) have been shown to provide lower variance gradient estimators and improve both model learning and variational inference, dif- culties arise when trying to derive reparameterization gradients through the Metropolis-Hastings accept-reject steps commonly used in ais. While score function gradients are available in this case, these tend to be higher variance and were not eective for learning in our initial experiments. Recently, Zhang et al. (2021); Gener and Domke (2021) proposed `dierentiable ais', which preserves the motivation of extended state space importance sampling along a path of sampling transitions, but ignores accept-reject steps in order to derive reparameterization gradients. At the end of training, the nal sample from the forward sampling procedure may be used as an approximate posterior sample. Zhang et al. (2021) train vaes using the Independent Multi-Sample aiselbo, which generalizes theiwae lower bound usingais transitions. As noted above, this bound averages samples inside the logarithm instead of outside the logarithm, as in the standard elbo or tvo. While this im-ais objective may prove to be preferable to the tvo objective (Rainforth et al., 2018a), it is apparent that leveraging the full potential of thetvo would require dierentiable sampling transformations as in Zhang et al. (2021); Gener and Domke (2021). Applying these approaches to dicult inference problems, in which the exible variational posteriors provided by ais transitions are necessary for strong performance, remains a question for future work. 5 Notably, this bound averages K samples inside the log ( 1 K P K k=1 log p (x;z (k) ) q (z (k) jx) ]) instead of averaging K samples outside the log as in estimating the tvo integrand ( 1 K P K k=1 log p (x;z (k) ) q (z (k) jx) ]). 95 3.9.3 TVO with q-Paths We have seen in Section 3.9.1 that, for the geometric averaging path, the ais extended state space elbo, under perfect sampling transitions, matches the tvo objective. However, this is not guaranteed to be true in general, since these bounds were derived using distinct approaches. In Ch. 2, we constructed theaiselbo by subtracting an extended state spacekl divergence, whereas in Sec. 3.2-3.3, we derived thetvo objective using thermodynamic integration and the fundamental theorem of calculus. We use theq-paths as an example for how these bounds would dier for more general annealing paths, with detailed exposition on q-paths in Ch. 4. ~ (q) (zjx) = (1)q (zjx) 1q +p (x;z) 1q 1 1q : (3.78) Considering (q) (zjx) = 1 Zq (x;) ~ (q) (zjx), we can write the multiplicative normalization constant as Z q (x;) = Z h (1)q (zjx) 1q +p (x;z) 1q i 1 1q dz: (3.79) Note that the partition function recoversZ q (x; = 1) = R p (x;z)dz =p (x) andZ q (x; = 0) = R q (zjx)dz = 1 for these choices of endpoint densities. Inspired by thermodynamic integration, Chen et al. (2021) dierentiate the log-partition func- tion of the q-path intermediate density logZ q (x;) = log R [(1)q (zjx) 1q +p (x;z) 1q ] 1 1q dz to construct the integrand logp (x) = logZ q (x; = 1) logZ q (x; = 0) = Z 1 0 @ @ logZ q (x;)d: (3.80) Alternatively, viewing the tvo in terms of ais under perfect transitions suggests the lower bound (Bui, 2020a; Grosse et al., 2013) L TVO L = logp (x) T X t=1 D KL h (q) t1 (zjx)k (q) t (zjx) i (3.81) which is the same objective as in Sec. 3.4, but uses a dierent path of intermediate ~ (q) (zjx). Note that this approach also provides a valid lower bound even without perfect transitions, and simply corresponds to the single-sample ais elbo using q-paths for the intermediate densities. To understand and leverage the distinctions between these two approaches, future work on the tvo should apply sampling transformations to more faithfully sample from densities along the path. 96 Chapter 4 q-Paths: Generalizing the Geometric Mixture Path using Power Means and -Divergences 4.1 Introduction Given a tractable and often normalized base distribution 0 (z) and unnormalized target ~ 1 (z), many statistical methods require a path : [0; 1] ! P, whereP is a family of unnormalized density functions with (0) = 0 (z) and (1) = ~ 1 (z). For example, marginal likelihood estimation methods such as thermodynamic integration (ti) (Ogata, 1989) or Annealed Importance Sampling (ais) (Neal, 2001) and Markov Chain Monte Carlo (mcmc) methods such as parallel tempering (Earl and Deem, 2005) and Sequential Monte Carlo (smc) (Del Moral et al., 2006) typically use the geometric path with mixing parameter , ~ (z) = expf(1) log 0 (z) + log ~ 1 (z)g; (4.1) In the Bayesian context, 0 (z) and 1 (z) can represent the prior and posterior distribution, re- spectively, in which case the geometric path amounts to tempering the likelihood term (Friel and Pettitt, 2008). Previous work has demonstrated theoretical or empirical improvements upon the geometric path can be achieved, but the applicability of these methods remains limited in practice due to restrictive assumptions on the parametric form of the endpoint distributions. Gelman and Meng (1998) derive an optimal path in distribution space but this is intractable to implement beyond toy examples. The moment-averaging path of Grosse et al. (2013) demonstrates performance gains for partition function estimation in Restricted Boltzmann Machines, but is only applicable for endpoint distributions which come from an exponential family. Bui (2020b) proposed a path based 97 = 0.0 = 0.9 = 0.11 = 0.22 = 0.33 = 0.44 = 0.56 = 0.67 = 0.78 = 0.89 10 5 0 5 10 x = 1.0 (a) q = 0 = 0.0 = 0.1 = 0.11 = 0.22 = 0.33 = 0.44 = 0.56 = 0.67 = 0.78 = 0.89 10 5 0 5 10 x = 1.0 (b) q = 0:5 = 0.0 = 0.8 = 0.11 = 0.22 = 0.33 = 0.44 = 0.56 = 0.67 = 0.78 = 0.89 10 5 0 5 10 x = 1.0 (c) q = 0:5 = 0.0 = 1.0 = 0.11 = 0.22 = 0.33 = 0.44 = 0.56 = 0.67 = 0.78 = 0.89 10 5 0 5 10 x = 1.0 (d) q = 1 Figure 4.1: Intermediate q-path densities betweenN (4; 3) andN (4; 1), with 10 equally spaced . For low q, the q-path approaches a mixture distribution at q = 0, and becomes the geometric mixture parameterized by at q = 1. on-divergence minimization using an iterative projection scheme from Minka (2005) which is also reliant on exponential family assumptions. In this work, we propose q-paths, which can be constructed between arbitrary endpoint distri- butions and admit a closed form that can be used directly for mcmc sampling ~ ;q (z) = (1) 0 (z) 1q + ~ 1 (z) 1q 1 1q (4.2) Our q-paths adapt the -integration of Amari (2007) to the problem of annealing between two unnormalized densities, with our notation q intended to highlight connections with the deformed logarithm and exponential functions from nonextensive thermodynamics (Tsallis, 2009; Naudts, 2011). q-paths may be viewed as taking the generalized mean (Kolmogorov, 1930; de Carvalho, 2016) of the endpoint densities according to a mixing parameter and monotonic transformation function ln q (u) = 1 1q (u 1q 1). As q! 1, we recover the natural logarithm and geometric mean in Eq. (4.1), while the arithmetic mean corresponds to q = 0. Grosse et al. (2013) show that intermediate distributions along the geometric and moment- averaged paths correspond to the solution of a weighted forward or reverse kl divergence min- imization objective, respectively. In Sec. 4.5, we generalize these variational representations to q-paths, showing that ~ ;q (z) minimizes the expected -divergence to the endpoints for an appro- priate mapping between q and . Finally, we highlight several implementation considerations in Sec. 4.6, observing thatq = 1 for small appears most useful both for qualitative mixing behavior and numerical stability. We provide a simple heuristic for setting an appropriate value of q, and nd that q-paths can yield empirical gains for Bayesian inference using smc and marginal likelihood estimation for generative models using ais. 98 4.2 Background 4.2.1 Geometric Annealing Path The geometric mixture path is the most ubiquitous method for specifying a set of intermediate distributions between a tractable base distribution 0 and unnormalized target ~ 1 , (z) = 0 (z) 1 ~ 1 (z) Z() ; where (4.3) Z() = Z 0 (z) 1 ~ 1 (z) dz: (4.4) The geometric path may also be written as an exponential family of distributions, with natural parameter and sucient statistic T (z) = log ~ 1 (z)= 0 (z) corresponding to the log importance ratio. We follow Gr unwald (2007); Brekelmans et al. (2020a,d) in referring to this as a likelihood ratio exponential family, with (z) = 0 (z) exp log ~ 1 (z) 0 (z) () (4.5) () := logZ() = log Z 0 (z) 1 ~ 1 (z) dz: (4.6) It is often more convenient to work with Eq. (4.5), because one gains access to known exponential family properties that are not apparent from Eq. (4.3) (Grosse et al., 2013; Brekelmans et al., 2020a,d). In Section 4.4 we provide an analogous interpretation forq-paths in terms ofq-exponential families. 4.2.2 Moment Averaging Path Previous work (Grosse et al., 2013) considers alternative annealing paths in the restricted setting where 0 (z) and 1 (z) are members of the same exponential family, with parameters 0 and 1 respectively. Writing the base measure as g(z) and sucient statistics as (z), (z) =g(z) expf(z) ()g (4.7) 99 Grosse et al. (2013) propose the moment-averaged path based on the dual or `moment' param- eters of the exponential family, which correspond to the expected sucient statistics () = d () d =hE [ j (z)]i N j=1 ; (4.8) withhi indicating vector notation and () denoting the log partition function of Eq. (4.7). In minimal exponential families, the sucient statistic function () is a bijective mapping between a natural parameter vector and dual parameter vector (Wainwright and Jordan, 2008). The moment-averaged path is dened using a convex combination of the dual parameter vectors (Grosse et al., 2013), ( ) = (1)( 0 ) +( 1 ): (4.9) To solve for the corresponding natural parameters, we calculate the Legendre transform, or a function inversion 1 . = 1 (1)( 0 ) +( 1 ) : (4.10) This inverse mapping is often not available in closed form and can itself be a dicult estimation problem (Wainwright and Jordan, 2008; Grosse et al., 2013), which limits the applicability of the moment-averaged path in practice. 4.2.3 q-Deformed Logarithm / Exponential While the standard exponential arises in statistical mechanics via the Boltzmann-Gibbs distribu- tion, Tsallis (1988) proposed a generalized exponential which has formed the basis of nonextensive thermodynamics and found wide application in the study of complex systems (Gell-Mann and Tsallis, 2004; Tsallis, 2009). Consider modifying the integral representation of the natural logarithm lnu := R u 1 1 x dx using an arbitrary power function ln q u = Z u 1 1 x q dx: (4.11) 100 Solving Eq. (4.11) yields the denition of the q-logarithm ln q (u) := 1 1q u 1q 1 : (4.12) We dene the q-exponential as the inverse of q-logarithm exp q (u) := ln 1 q (u) exp q (u) = 1 + (1q)u 1 1q + ; (4.13) where [x] + = maxf0;xg = relu(x) ensures that exp q (u) is non-negative and fractional powers can be taken for q < 1, and thus restricts the domain where exp q (u) takes nonzero values to u>1=(1q). We omit this notation in subsequent derivations because our q-paths in Eq. (4.2) take non-negative densities as arguments for the 1=(1q) power. Note also that both theq-log andq-exponential recover the standard logarithm and exponential function in the limit, lim q!1 ln q (u) lim q!1 exp q (u) = lim q!1 d dq (u 1q 1) d dq (1q) = lim q!1 [1 + (1q)u] 1 1q = loguu 1q 1 q=1 = lim n!1 h 1 + u n i n = log(u) := exp(u): In Section 4.4, we use this property to show q-paths recover the geometric path as q! 1. 4.3 q-Paths from Power Means q-paths are derived using a generalized notion of the mean due to Kolmogorov (1930). For any monotonic function (u), we dene the generalized mean (u; w) = 1 N X i=1 w i (u i ) ! ; (4.14) 101 Geometric Path q-Path ~ (geo) (z) = 0 (z) 1 ~ 1 (z) ~ (q) (z) = (1) 0 (z) 1q + ~ 1 (z) 1q 1 1q ~ (geo) (z) = 0 (z) exp n log ~ 1 (z) 0 (z) o ~ (q) (z) = 0 (z) exp q n ln q ~ 1 (z) 0 (z) o log ~ (geo) (z) = (1) log 0 (z) + log ~ 1 (z) ln q ~ (q) (z) = (1) ln q 0 (z) + ln q ~ 1 (z) (z) = argmin ~ r (1)D KL [~ rk 0 ] +D KL [~ rk~ 1 ] ~ ;q (z) = argmin ~ r (1)D [~ 0 jj~ r] +D [~ 1 jj~ r] Figure 4.2: Summary ofq-paths (right) in relation to the geometric path (left). q-paths recover the geometric path as q! 1 and = 2q 1 in Amari's -divergence D . The deformed logarithm ln q and its inverse exp q are dened in Section 4.2.3. where outputs a scalar given a normalized measure w = (w 1 ;:::;w N ) (with P N i=1 w i = 1) over a set of input elements u = (u 1 ;:::;u N ) (de Carvalho, 2016). 1 The generalized mean can be thought of as rst applying a nonlinear transformation function to each input, applying the desired weights in the transformed space, and nally mapping back to the distribution space. The geometric and arithmetic means are homogeneous, that is, they have the linear scale-free property (c u; w) =c (u; w). Hardy et al. (1953) shows the unique class of functions (u) that yield means with the homogeneity property are of the form q (u) = 8 > < > : au 1q +b q6= 1 logu q = 1 : (4.15) for anya andb. Settinga =b = 1=(1q), we can recognize q (u) as the deformed logarithm ln q (u) from Eq. (4.12). We refer generalized means which use the class of functions q (u) as power means, and it can be shown (Masrani et al. (2021) App. A, Amari (2007)) that for any choice of a and b, q (u; w) = " N X i=1 w i u 1q i # 1 1q : (4.16) 1 The generalized mean is also referred to as the abstract, quasi-arithmetic, or Kolmogorov-Nagumo mean in the literature. 102 Notable examples include the arithmetic mean at q = 0, geometric mean as q! 1, and the min or max operation as q!1. For q = 1+ 2 , a = 1 1q , and b = 0, the function q (u) matches the -representation in information geometry (Amari, 2016), and the resulting power mean over normalized probability distributions as input u is known as the -integration (Amari, 2007). For annealing between unnormalized density functions, we propose the q-path of intermediate ~ ;q (z) based on the power mean. Observing that the geometric mixture path in Eq. (4.1) takes the form of a generalized mean for (u) = ln(u), we choose the deformed logarithm q (u) := ln q (u) 1 q (u) = exp q (u); (4.17) as the transformation function for the power mean. This choice will facilitate our parallel discussion of geometric and q-paths in terms of generalized logarithms and exponentials in Section 4.4. Using u = ( 0 ; ~ 1 ) as the input elements and w = (1;) as the mixing weights in Eq. (4.16), we obtain a simple, closed form expression for the q-path intermediate densities ~ ;q (z) = (1) 0 (z) 1q + ~ 1 (z) 1q 1 1q (4.18) Crucially, Eq. (4.18) can be directly used as an energy function in mcmc sampling methods such as Hamiltonian Monte Carlo (hmc) (Neal, 2011), and our q-paths do not require additional assump- tions on the endpoint distributions. Finally, to compare against the geometic path, we write the q-path in terms of the generalized mean in Eq. (5.3) ~ ;q = exp q (1) ln q 0 (z) + ln q ~ 1 (z) ; (4.19) from which we can see that ~ ;q recovers the geometric path in Eq. (4.1) asq! 1, ln q (u)! log(u), and exp q (u)! exp(u). Taking the deformed logarithm of both sides also yields an interpretation of the geometric or q-paths as ln or ln q -mixtures of density functions, respectively. 103 4.4 q-Likelihood Ratio Exponential Families Similarly to Eq. (4.5), we relate ~ ;q to a q-exponential family with a single sucient statistic and natural parameter ~ ;q (z) = (1) 0 (z) 1q +~ 1 (z) 1q 1 1q (4.20) = 0 (z) 1q + ~ 1 (z) 1q 0 (z) 1q 1 1q (4.21) = 0 (z) " 1 + ~ 1 (z) 0 (z) 1q 1 !# 1 1q (4.22) = 0 (z) 1 + (1q) ln q ~ 1 (z) 0 (z) 1 1q (4.23) = 0 (z) exp q ln q ~ 1 (z) 0 (z) : (4.24) To mirror the likelihood ratio exponential family interpretation of the geometric path in Eq. (4.5), we multiply by a factor Z q () to write the normalized q-path distribution as ;q (z) = 1 Z q () 0 (z) exp q fT (z)g (4.25) Z q () := Z ~ ;q (z)dz; T (z) := ln q ~ 1 (z) 0 (z) (4.26) which recovers Eq. (4.5) as q! 1. Note that we normalize usingZ q () instead of subtracting a q () term inside the exp q as in the standard denition of a parameteric q-exponential family (Naudts, 2009, 2011; Amari and Ohara, 2011) ;q (z) =g(z) exp q q (z) q () : (4.27) where we use q (z) to indicate a general sucient statistic vector which will often dier from our likelihood ratio family above, which uses T (z) = ln q ~ 1 (z)= 0 (z). While logZ() = () for q = 1, translating between these normalization constants for q6= 1 requires a non-linear transformation of the parameters. This delicate issue of normalization has been noted in (Matsuzoe et al., 2019; Suyari et al., 2020; Naudts, 2011), and we give detailed discussion in Masrani et al. (2021) App. B. In Masrani et al. (2021) App. D, we use the q () 104 (a) Moment-Avg (a) q=0 (b) q=0.5 (c) q=0.9 (d) q=1 Figure 2: Intermediateq-path densities between N( 4,3) and N(4,1), with 10 equally spaced . For lowq, theq-path approachesamixturedistributionatq=0,andbecomesthegeometricmixtureparameterizedby atq=1. Tosolveforthecorrespondingnaturalparameters,wecal- culatetheLegendretransform,orafunctioninversion⌘ 1 . ✓ = ⌘ 1 (1 )⌘ (✓ 0 )+⌘ (✓ 1 ) . (11) Thisinversemappingisoftennotavailableinclosedform andcan itselfbe adifficult estimation problem (Wainwright and Jordan, 2008; Grosse et al., 2013), which limits the applicabilityofthemoment-averagedpathinpractice. 2.3 Q-DEFORMEDLOGARITHM/ EXPONENTIAL Whilethestandardexponentialarisesinstatisticalmechan- ics via the Boltzmann-Gibbs distribution, Tsallis (1988) proposed ageneralizedexponentialwhich hasformedthe basisofnonextensivethermodynamicsandfoundwideap- plicationinthestudyofcomplexsystems(Gell-Mannand Tsallis,2004;Tsallis,2009). Considermodifyingtheintegralrepresentationofthenatural logarithmlnu := R u 1 1 x dxusinganarbitrarypowerfunction ln q u = Z u 1 1 x q dx. (12) SolvingEq.(12)yieldsthedefinitionoftheq-logarithm ln q (u):= 1 1 q u 1 q 1 . (13) We define the q-exponential as the inverse of q-logarithm exp q (u):=ln 1 q (u) exp q (u)= ⇥ 1+(1 q)u ⇤ 1 1 q + , (14) where [x] + = max{0,x} = RELU(x) ensures that exp q (u) isnon-negativeandfractionalpowerscanbetakenforq< 1, andthusrestrictsthedomainwhere exp q (u)takesnonzero valuestou> 1/(1 q).Weomitthisnotationinsubse- quent derivations because ourq-paths in Eq. (2) take non- negativedensitiesasargumentsforthe 1/(1 q)power. Notealsothatboththeq-logandq-exponentialrecoverthe standardlogarithmandexponentialfunctioninthelimit, lim q!1 ln q (u)lim q!1 exp q (u) =lim q!1 d dq (u 1 q 1) d dq (1 q) =lim q!1 [1+(1 q)·u] 1 1 q = logu·u 1 q 1 q=1 =lim n!1 h 1+ u n i n =log(u):=exp(u). In Section 4 we use this property to showq-paths recover thegeometricpathasq! 1. 3 Q-PATHSFROMPOWERMEANS q-pathsarederivedusingageneralizednotionofthemean due to Kolmogorov (1930). For any monotonic function h(u),wedefinethe generalized mean μ h (u,w)= h 1 N X i=1 w i ·h(u i ) ! , (15) where μ h outputs a scalar given a normalized measure w=(w 1 ,...,w N ) (with P N i=1 w i =1) over a set of input elementsu=(u 1 ,...,u N )(deCarvalho,2016). 1 Thegeneralizedmeancanbethoughtofasfirstapplyinga nonlinear transformation function to each input, applying the desired weights in the transformed space, and finally mappingbacktothedistributionspace. The geometric and arithmetic means are homoge- neous, that is, they have the linear scale-free property μ h (c·u,w)= c·μ h (u,w). Hardy etal.(1953)showsthe unique class of functions h(u) that yield means with the homogeneitypropertyareoftheform h q (u)= ( a·u 1 q +bq6=1 loguq=1 . (16) 1 Thegeneralizedmeanisalsoreferredtoastheabstract,quasi- arithmetic,or Kolmogorov-Nagumomeanintheliterature. (b) q = 0 Figure 4.3: Moment-averaging path and q = 0 mixture path betweenN (4; 3) andN (4; 1). See Section 4.5 for discussion. normalization constant to derive an analogue of the moment-averaging path between parametric q-exponential family endpoints. q-Paths for Parametric Endpoints The geometric path has a particularly simple form when annealing between exponential family endpoint distributions = (1) 0 + 1 : (4.28) In Masrani et al. (2021) App. D2, we verify Eq. (4.28) and show that the same result holds for q-paths between endpoint distributions within the same q-exponential family. Intuitively, for the (generalized) exponential family distribution in Eq. (4.27), we can write the unnormalized density ratio ln q ~ (z)=g(z) =(z) as a linear function of the parameters. Thus, theq-path generalized mean over density functions with h q (~ i ) = ln q ~ i (z) will translate to an arithmetic mean in the parameter space withh 1 ( i ) = i . Note that the moment-averaging path in Eq. (4.10) may also be viewed as a generalized mean for the function (). 4.5 Variational Representations Grosse et al. (2013) observe that intermediate distributions along the geometric path can be viewed as the solution to a weighted kl divergence minimization = argmin r (1)D KL [rk 0 ] +D KL [rk 1 ] (4.29) where the optimization is over arbitrary distributions r(z). 105 When the endpoints come from an exponential family of distributions and the optimization is limited to only this parametric familyP e , Grosse et al. (2013) nd that the moment-averaged path is the solution to a kl divergence minimization with the order of the arguments reversed = argmin r2Pe (1)D KL [ 0 kr] +D KL [ 1 kr]: (4.30) In Masrani et al. (2021) App. C, we follow similar derivations as Amari (2007) to show that the q-path density ~ ;q minimizes the -divergence to the endpoints ~ ;q = argmin ~ r (1)D [~ 0 jj~ r] +D [~ 1 jj~ r] (4.31) where the optimization is over arbitrary measures ~ r(z). Amari's -divergence over unnormalized measures, for = 2q 1 (Amari (2016) Ch. 4), is dened D [~ r : ~ p] = 4 (1 2 ) 1 2 Z ~ r(z)dz + 1 + 2 Z ~ p(z)dz Z ~ r(z) 1 2 ~ p(z) 1+ 2 dz (4.32) The -divergence variational representation in Eq. (4.31) generalizes Eq. (4.29), since the kl di- vergence D KL [~ rjj~ p] is recovered (with the order of arguments reversed) 2 as q! 1. However, while the -divergence tends to D KL [~ pjj~ r] as q! 0, Eq. (4.31) does not generalize Eq. (4.30) since the optimization in Eq. (4.30) is restricted to the parametric familyP e . For the case of arbitrary endpoints, the mixture distribution rather than the moment-averaging distribution minimizes the reverse kl divergence in Eq. (4.30), producing dierent paths as seen in Fig. 4.3. We discuss this distinction in greater detail in Masrani et al. (2021) App. C1 and D3, along with Ch. 5 Sec. 5.5.4.1. In Ch. 5, we consider these variational representations as a particular property of a generalized Bregman divergence called the rho-tau divergence (Zhang, 2004, 2013). We also show that the q-path can be obtained as the solution to an expected divergence minimization, ~ (q) (z) = argmin ~ r(z) (1)D (2q) B [~ r(z) : ~ 0 (z)] +D (2q) B [~ r(z) : ~ 1 (z)]; (4.33) 2 The kl divergence extended to unnormalized measures is dened DKL[~ q : ~ p] = R ~ q(z) log ~ q(z) ~ p(z) dz R ~ q(z)dz + R ~ p(z)dz. 106 0.9871 0.9923 0.9954 0.9972 0.9983 0.999 geometric q (log spaced) 900 600 300 marginal likelihood number of move steps ground truth 1 3 5 7 10 15 20 (a) SMC tempering using q-Paths on a binary regression model over 10 runs. q = 0:9972 outperforms the geometric path both in terms of marginal likelihood estimation and reduced variability across runs. Table 4.1: SMC sampling with linear/adaptive scheduling in a binary regression model forf1; 3; 5g move steps. lin indicates a linearly spaced schedule (K = 10) and ada uses an adaptive schedule (cf. Section 4.6.1). Median err =j log ^ p(D) logp(D)j across 10 seeds is reported against ground truth. q-path (grid) shows best of 20 log-spaced2 [10 5 ; 10 1 ], andq-path (ess) uses theess heuristic to initialize q as described in Masrani et al. (2021) App. G1. Error for most runs (8/12) is q-path (grid) < q-path (ess) < geo. q-path q-path pima geo (ess heuristic) (grid) lin-1 79.02 (39.1) 80.64 (42.33) 10.77 (2.30) lin-3 59.11 (41.71) 59.64 (47.41) 5.79 (1.46) lin-5 45.63 (19.86) 41.96 (25.23) 6.63 (2.62) ada-1 2.51 (1.35) 2.31 (2.99) 1.62 (1.79) ada-3 1.49 (0.43) 1.12 (1.05) 0.84 (0.84) ada-5 0.48 (0.60) 0.76 (0.29) 0.52 (0.59) sonar lin-1 228.7 (80.9) 217.92 (72.51) 93.33 (15.79) lin-3 175.21 (38.66) 172.66 (61.55) 55.94 (5.69) lin-5 218.94 (92.08) 222.07 (78.76) 36.67 (10.32) ada-1 20.17 (15.99) 18.15 (15.43) 15.32 (8.19) ada-3 3.83 (3.44) 3.78 (2.77) 3.11 (3.26) ada-5 2.79 (2.41) 2.68 (1.95) 2.23 (0.72) where the -divergence of order 2q is dened as D (2q) B [~ a : ~ b ] = Z 1 1q 1 2q ~ a (z) 2q + 1 2q ~ b (z) 2q 1 1q ~ a (z)~ b (z) 1q dz: (4.34) See Ch. 5 Sec. 5.5.2 for detailed discussion. 4.6 Experiments Code for all experiments is available at https://github.com/vmasrani/qpaths_uai_2021. 107 15 5 5 15 q = 0.98 geo qpath 15 5 5 15 q = 0.99 15 5 5 15 q = 0.999 Figure 4.5: q-paths betweenN (10; 1) andN (10; 1), which are notably more separated than those in Fig. 4.1. For dicult annealing problems such as those in our experiments, small deviations from the geometric path (grey) can achieve mass-covering behaviour (center), which is lost if the q-path too much resembles the arithmetic (left) or geometric mean (right). 0 100 200 300 400 500 600 700 800 N 0.0 0.2 0.4 0.6 0.8 1.0 best q min q Figure 4.6: Evaluating the choice of q for smc. Since the scale of the likelihood ~ 1 depends on the number of data examples, we expect the numerical stability of q-paths to vary by N. While the minimum q yielding a stable estimator (orange) increases with N, the best performing q-path (blue) is still q = 1 for small > 0. 4.6.1 Sequential Monte Carlo in Bayesian Inference In this section, we usesmc to sample posterior parameters 1 () =p(jD)/p() Q N n=1 p(x n j) and estimate the log marginal likelihood logp(D) = log R p()p(Dj)d in a Bayesian logistic regression models on the \tall" Pima Indians diabetes dataset (N = 768;D = 8) and \wide" Sonar dataset (N = 208;D = 61) (see Masrani et al. (2021) App G). Ground truth logp(D) is computed using 50k samples and 20 move steps, and for all runs we use 10k samples and plot median error across ten seeds. Grid search shows best of 20 runs, where we sweep over 20 log-spaced 2 [10 5 ; 10 1 ]. We explore the use ofq-paths in both the non-adaptive case, with a xed linear schedule with K = 10 intermediate distributions, and the adaptive case, where the next value of t+1 is chosen to yield an eective sample size (ess) of N=2 (Chopin et al., 2020). 108 For the non-adaptive case, we nd in Section 4.5 that q2 [0:9954; 0:9983] can achieve more accurate marginal likelihood estimates than the geometric path with fewer movement steps and drastically reduced variance. In Table 4.1 we see that q-paths achieve gains over the geometric path in both the linear and adaptive setting across both datasets. Numerical Stability and Implementation To implement q-paths in practice, we begin by considering the log of the expression in Eq. (4.24), which is guaranteed to be non-negative because ~ ;q (z) is an unnormalized density. log ~ ;q (z) = log 0 (z) + 1 1q log 1 + (1q) ln q ~ 1 (z) 0 (z) ; We focus attention on ln q ~ 1 (z)= 0 (z) term, which is potentially unstable for q6= 1 since it takes importance weights w = ~ 1 (z)= 0 (z) as input. Since we are usually given log weights in practice, we consider the identity mapping w = exp(logw) and reparameterize q = 1 1 to obtain ln q (exp logw) = 1 1q h (exp logw) 1q 1 i (4.35) = h (exp logw) 1 1 i (4.36) = expf 1 logwg 1 : (4.37) This suggestsq should be chosen such that the exponential doesn't over ow or under ow, which can be accomplished by setting on the order of = max i j logw i j: (4.38) where i indexes a set of particlesfz i g. This choice is reminiscent of the log-sum-exp trick and ensuresj 1 logwj 1. In Fig. 4.6, we explore the impact of changing the scale of logw on the numerical stability of q-paths. For the case of inferring global model parameters over N i.i.d. data points p(D) = Q N n=1 p(x n ), we can see that the scale of the unnormalized densities ~ 1 (;D) =p() Q N n=1 p(x n j) diers based on the number of datapoints, where increasing N decreases the magnitude of logw = log ~ 1 (;D) with ~ 0 () =p(). We randomly subsample N data points for conditioning our model, and observe the eect on both the best-performing q and the numerical stability of smc with q-paths. The minimum value 109 of q for which we can obtain stable estimators rises as the number of datapoints N increases and the scale of ~ 1 (;D) becomes smaller. Sensitivity to q While setting on the order of max i j logw i j ensures numeric stability, Fig. 4.6 indicates that numerical stability may not be sucient for achieving strong performance in smc. In fact, q-paths with values just less than 1 consistently perform best across all values of N. To understand this observation, recall the example in Fig. 4.5 where the initial and target distribution are well-separated and even theq = 0:98 path begins to resemble a mixture distribution. This is clearly undesirable for path sampling techniques, where the goal is to bridge between base and target densities with distributions that are easier to sample. Heuristic for Choosing q Motivated by the observations above and the desire to avoid grid search, we provide a rough heuristic to nd a q which is well-suited to a given estimation problem. Taking inspiration from the ess criterion used to select t+1 in our smc experiments above (Chopin et al., 2020), we select q to obtain a target value of ess for the rst intermediate 1 L( 1 ;q) =jjESS( 1 ;q) ESS target jj 2 2 (4.39) ESS(;q) = P i w i (;q) 2 P i w i ;q 2 with w i (;q) = ~ ;q (z i ) 0 (z i ) : As in the case of the adaptive scheduling heuristic for smc, we set the target ESS target =N=2 to ensure adequate sampling diversity (Jasra et al., 2011; Sch afer and Chopin, 2013; Buchholz et al., 2021; Chopin et al., 2020). For xed scheduling, the value of 1 may be known and thus we can easily select q to obtain the target value ESS( 1 ;q) ESS target . However, in adaptive scheduling, 1 is not known and the objectiveL( 1 ;q) is non-convex in 1 ;q. In Masrani et al. (2021) App. G2, we provide a coordinate descent algorithm to nd local optima using random initializations around an initial q = 1 1 for as in Eq. (4.38), with results in Table 4.1. Note that this heuristic sets q based on a set of initial z i 0 (z), and thus does not consider information about the mcmc sampling used to transform and improve samples. Nevertheless, in Table 4.1 we observe that q-paths initialized by this heuristic can outperform the geometric path on benchmarksmc binary regression tasks. Comparison with grid search results indicate that further performance gains might be achieved with an improved heuristic. 110 1.00 0.990 0.992 0.994 0.996 0.998 q 104 105 106 107 108 109 Negative lower bound on real data (nat) q-path geometric K = 500 K = 1000 K = 5000 (a) Estimating logp(x) on real data using ais. 0.990 0.992 0.994 0.996 0.998 1.000 q 0.10 1.00 BDMC gap (nat) q-path geometric K = 500 K = 1000 K = 5000 (b) Bidirectional Monte Carlo (bdmc) Gap on simu- lated data. Figure 4.7: Evaluating Generative Models usingais withq-paths on Omniglot dataset. Best viewed in color. 4.6.2 Evaluating generative models using AIS ais with geometric paths is often considered the gold-standard for evaluating decoder-based gener- ative models (Wu et al., 2016). In this section, we evaluate whether q-paths can improve marginal likelihood estimation for a Variational Autoencoders (vae) trained using the Thermodynamic Vari- ational Objective (tvo) (Masrani et al., 2019) on the Omniglot dataset. First, we use ais to evaluate the trained generative model on the true test set, with a Gaussian prior 0 (z) =p(z) as the base distribution and true posterior 1 (z) =p(zjx)/p(x;z) as the target. Intermediate distributions then become ~ (z) = p(z)p(xjz) . We report stochastic lower bound estimates (Grosse et al., 2015) ofE p data (x) logp(x) in Fig. 4.7b, where we have plotted the negative likelihood bound so that lower is better. Even for a large number of intermediate distributions, we nd that q2 [0:992; 0:998] can outperform the geometric path. When exact posterior samples are available, we can use a reverse ais chain from the target density to the base to obtain a stochastic upper bound on the log marginal likelihood (Grosse et al., 2015). While such samples are not available on the real data, we can use simulated data drawn from the model using ancestral sampling x;z p(z)p(xjz) as the dataset, and interpret z as a posterior sample. We use the bdmc gap, or dierence between the stochastic lower and upper bounds obtained from forward and reverse chains on simulated data, to evaluate the quality of the ais procedure. In Fig. 4.7, we report the average bdmc gap on 2500 simulated data examples, and observe that q-paths with q = 0:994 or q = 0:996 consistently outperform the geometric path as we vary the number of intermediate distributions K. 111 4.7 Conclusion In this work, we proposed q-paths as a generalization of the geometric mixture path which can be constructed between arbitrary endpoint distributions and admits a closed form energy function. We provided a q-likelihood ratio exponential family interpretation of our paths, and derived a variational representation ofq-path intermediate densities as minimizing the expected-divergence to the endpoints. Finally, we observed empirical gains in smc andais sampling usingq-paths with q = 1 for small . Future work might consider more involved heuristics for choosing q, such as running truncated, parallel sampling chains, to capture the interplay between choices of ;q; and sampling method. Applyingq-paths in settings such as sampling with Parallel Tempering (pt) or variational inference using the tvo, remain interesting questions for future work. In the next chapter, we will provide a more detailed exploration of the `variational representa- tions' in Sec. 4.5. In particular, we will interpret this property as arising from the `centroid' property of Bregman divergences after embedding by an arbitrary monotonic function (~ ) (Banerjee et al., 2005c; Zhang, 2004; Nielsen and Nock, 2009). We show q-paths may also be derived using a -divergence minimization, in addition to the -divergence minimization shown in Sec. 4.5. 112 Chapter 5 Rho-Tau Bregman Information and the Geometry of Annealing Paths 5.1 Introduction Markov Chain Monte Carlo (mcmc) methods such as Annealed Importance Sampling (ais) (Neal, 2001; Jarzynski, 1997), Sequential Monte Carlo (smc) (Del Moral et al., 2006), thermodynamic integration (ti) (Ogata, 1989; Gelman and Meng, 1998) and Parallel Tempering (pt) (Earl and Deem, 2005) are fundamental tools in machine learning and statistical physics, which can be used to sample from complex distributions, estimate normalization constants, and calculate physical quantities such as entropy or free energy. The key insight is to use importance sampling to compare ratios of unnormalized density functions along a sequence of samples generated by mcmc transition kernels such as Langevin dynamics (Rossky et al., 1978; Welling and Teh, 2011) or Hamiltonian Monte Carlo (hmc) (Duane et al., 1987; Neal, 2011; Betancourt et al., 2017). mcmc algorithms commonly break the problem of sampling from the target distribution into a sequence of easier subproblems along an annealing path of intermediate densitiesf~ t (z)g T t=0 , which bridge between a tractable ~ 0 (z) and the target density of interest ~ 1 (z). We indicate unnormalized positive measures using the notation ~ , and are often interested in sampling from 1 (z)/ ~ (z) or estimating the normalization constant Z 1 = R ~ 1 (z)dz or its logarithm logZ 1 . Transition kernelsT (z t jz t1 ) such as importance resampling, Langevin dynamics (Rossky et al., 1978; Welling and Teh, 2011) or hmc (Duane et al., 1987; Neal, 2011; Betancourt et al., 2017) are used to transform samples to more closely approximate the target distribution. For example, we describe the algorithm for annealed importance sampling (ais, Neal (2001); Jarzynski (1997)) in Algorithm 1. 113 Most commonly, intermediate unnormalized densities are constructed using geometric averag- ing ~ (z) = ~ 0 (z) 1 ~ 1 (z) of the initial ~ 0 (z) and target ~ 1 (z) densities, here taken to be over a continuous sample space z2R D . Masrani et al. (2021) propose annealing paths corresponding to quasi-arithmetic means (Kolmogorov, 1930) with the deformed logarithm used in nonextensive thermodynamics (Naudts (2011) Ch 7, Tsallis (2009)). Intriguingly, Grosse et al. (2013); Mas- rani et al. (2021) show that these paths can be viewed as arising from an expected divergence minimization log ~ (1) (z) = (1) log ~ 0 (z) + log ~ 1 (z) = argmin ~ r(z) (1)D KL ~ r(z) : ~ 0 (z) +D KL [~ r(z) : ~ 1 (z) (q = 1) log q ~ (q) (z) = (1) log q ~ 0 (z) + log q ~ 1 (z) = argmin ~ r(z) (1)D () A ~ 0 (z) : ~ r(z) +D () A [~ 1 (z) : ~ r(z) (q6= 1) (5.1) where we reparameterize the Amari -divergence (Amari, 1982, 2016) using 1+ 0 2 so that =q, D () A [~ a : ~ b ] = 1 Z ~ a (z)dz + 1 1 Z ~ b (z)dz 1 1 1 Z ~ a (z) 1 ~ b (z) dz: (5.2) Eq. (5.1) is reminiscent of the problem of nding the `centroid' argmin r E (u) D[u : r] of a random variable U(u) with respect to a statistical divergence D and sampling measure (u). While uniqueness properties depend on the choice of divergence in general, Banerjee et al. (2005c) show that for any Bregman divergence, minimization in the second argument yields the arithmetic mean as the unique optimal centroid r = E (u) [u]. At this minimizing argument, the value of the divergence minimization corresponds to a gap in Jensen's inequality, called the Bregman Information (Banerjee et al., 2005c). However, notice that our minimizing arguments in Eq. (5.1) are arithmetic means after applying a monotonic representation function (~ ) = log q ~ . In this work, we use the rho-tau Bregman divergence framework of Zhang (2004, 2013) to formally extend the Bregman Information results of Banerjee et al. (2005c) to quasi-arithmetic means, or arithmetic means after transformation by (~ ) as in Eq. (5.1). Intriguingly, the Bregman Information associates a divergence function D (;q) [~ 0 (z) : ~ 1 (z)] (Zhang, 2004, 2013) with each intermediate density along the q-path between ~ 0 and ~ 1 , encom- passing many common divergences as special cases in Table 5.1. Our analysis highlights intimate connections between quasi-arithmetic means and divergence functions, and naturally bridges be- tween parametric (Amari, 2016; Nielsen, 2020) and nonparametric information geometry (Zhang, 2013). Through the familiar example of annealing paths, we seek to familiarize the wider machine 114 Familiar Divergences from scaled Bregman Information 1 (1) (1)Z q (0) +Z q (1)Z q () 0: and Quasi-Arithmetic Means: inputs:f ~ 0 ; ~ 1 g (q) (z) = 1 q (1)q ~ 0(z) +q ~ 1(z) weights:f1;g representation: q (~ ) = log q ~ Bregman Divergence -Divergence Convex Function ! 0 ! 1 =62f0; 1g logZ 1 () = log R (geo) (z)dz D KL [~ 0 : ~ 1 ] D KL [~ 1 : ~ 0 ] D () R [~ 0 : ~ 1 ] Z 1 () = R (geo) (z)dz D KL [~ 0 : ~ 1 ] D KL [~ 1 : ~ 0 ] D () A [~ 0 : ~ 1 ] Z q () = R (q) (z)dz D (q) A [~ 0 : ~ 1 ] D (q) A [~ 1 : ~ 0 ] D (;q) Z [~ 0 : ~ 1 ] Z 0 () = R (arith) (z)dz D KL [~ 1 : ~ 0 ] D KL [~ 0 : ~ 1 ] D () JS [~ 0 : ~ 1 ] Z esc q () = R (q) (z) 2q dz D (2q) B [~ 1 : ~ 0 ] D (2q) B [~ 0 : ~ 1 ] see Eq. (5.81) Table 5.1: Bregman Information and Divergence Functions. Note thatD () R is R enyi's-divergence, D () A is Amari's-divergence,D () JS is the Jensen-Shannon divergence with mixture weight,D (;q) Z is the (;) divergence of Zhang (2004), andD (q) B is the-divergence of orderq (Basu et al., 1998). learning community with the referential-representational biduality in information geometry (Zhang, 2004, 2013). 5.1.1 Quasi-Arithmetic Means and Annealing Paths For a strictly monotonic representation function (u), the quasi-arithmetic mean (Kolmogorov, 1930) is (u; w) = 1 N X i=1 w i (u i ) ! ; (5.3) where (u; w) outputs a scalar for given normalized mixing positive weights, w = (w 1 ;:::;w N ) with P N i=1 w i = 1, over a set of input elements u = (u 1 ;:::;u N ). Since the function is monotonic and thus invertible, we may represent a given density function ~ (x) equivalently using ~ (x) . 115 We refer to (~ ) as the -representation of ~ (Amari, 2007, 2016). Finally, note that (u; w) corresponds to the arithmetic mean of the inputs after transformation by (u) ( (u; w)) = N X i=1 w i (u i ); (5.4) In other words, (u; w) is -ane, or linear in the -representation of the u i 's (Zhang, 2004). Geometric Annealing Path Geometric averaging is the most common way to construct annealing paths. In particular, for two endpoint densities of interest, u = f~ 0 (x); ~ 1 (x)g, the mixing parameter t is used to dene a convex combination w =f1 t ; t g. Using (~ ) = log ~ as the representation function, an intermediate density along the geometric path is dened using the quasi-arithmetic mean, log ~ (x) = (1) log ~ 0 (x) + log ~ 1 (x); (5.5) which simplies to ~ (x) = ~ 0 (x) 1 ~ 1 (x) . q-Paths and -Representation As an additional example, we will consider quasi-arithmetic means derived using the q-logarithm (Tsallis, 2009; Naudts, 2011) from nonextensive thermody- namics. Recall the denition of the q-logarithm and its inverse, the q-exponential, log q (u) = 1 1q u 1q 1 exp q (t) = [1 + (1q)t] 1 1q + : (5.6) where [] + = max(; 0). While log q (u) is also known as the -representation (Amari, 2007) (for = 2q 1), we will use the parameter q to avoid later confusion as to the role of the parameter (Zhang, 2004, 2013). It can be shown that log q (u) is concave and strictly increasing in u (Naudts, 2011) while, taking the limiting behavior as q! 1, we recover the natural logarithm log(u) and standard exponential exp(t) as its inverse. Masrani et al. (2021) consider `q-paths', which generalize the geometric path using the q- logarithm as the representation function in the quasi-arithmetic mean with the endpoint densities as inputs, log q ~ (q) (x) = (1) log q ~ 0 (x) + log q ~ 1 (x): (5.7) This also simplies as ~ (q) (x) = (1)~ 0 (x) 1q + ~ 1 (x) 1q 1 1q + , and recovers the geometric path in Eq. (5.5) as q! 1. 116 Annealing paths might also be constructed using the more general family of -deformed loga- rithms (Naudts (2004, 2011); Naudts and Zhang (2018), App. F.1), including the -logarithm of Kaniadakis and Scarfone (2002). Our later results will apply for arbitrary monotonic representation functions (u). Annealing Paths with Exponential Family Endpoints We also consider the special case where endpoint densities 0 and 1 belong to the same exponential family, with base measure g(z), natural parameters =f i g N i=1 , sucient statistics T (z) =fT i (x)g N i=1 , and log partition function (), (x) =g(z) expfT (z) ()g: (5.8) Ignoring the normalization constants ~ 0 (x) =g(z) expf 0 T (z)g and ~ 1 (x) =g(z) expf 1 T (z)g, we can see that the unnormalized density is linear in after applying the (~ ) = log ~ (x) representation function. Since the quasi-arithmetic mean also has this-ane property (Eq. (5.4)), the geometric path is simply a linear interpolation in the natural parameters = (1) 0 + 1 : (5.9) Grosse et al. (2013) propose the moment averaging path, which uses the dual parameter mapping () =() =E [T (z)] as a representation function for the quasi-arithmetic mean, ( ) = (1)( 0 ) +( 1 ): (5.10) While Grosse et al. (2013) show performance gains using the moment averaging path, additional sampling procedures may be required to nd via the inverse mapping 1 () (Wainwright and Jordan, 2008; Zellner and Higheld, 1988). Parametric Interpretation ofq-Paths Using the-anity property to link quasi-arithmetic means and parametric families as in Eq. (5.9) (Zhang, 2004), we can also interpret the geometric or q-annealing paths as a one-dimensional (deformed) exponential family (Brekelmans et al., 2020a,c; Masrani et al., 2021). In particular, considering the endpoint ~ 0 (x) as the base distribution, the 117 q-log likelihood ratio T (x) = log q ~ 1 (x) ~ 0 (x) as the sucient statistic, and the mixing weight as the natural parameter, ~ (q) (x) = ~ 0 (x) exp q n log q ~ 1 (x) ~ 0 (x) o : (5.11) This one-dimensional parametric family is an alternative expression of the q-annealing paths in Eq. (5.7) and can be constructed between arbitrary endpoint densities. The multiplicative normal- ization constant for the q-likelihood ratio family in Eq. (5.11) will play a role in our later analysis, with Z q () = Z ~ (q) (x)dx = Z ~ 0 (x) exp q n log q ~ 1 (x) ~ 0 (x) o dx: (5.12) For the geometric path, this recovers Z 1 () = R ~ 0 (x) 1 ~ 1 (x) dx. In Sec. 5.5.4.3, we explain how Z 1 () appears in relation to Amari's -divergence, whereas the log partition function () = logZ 1 () is related to R enyi's -divergence (Nielsen and Nock, 2011). We will focus on Z q () in the case of q-paths (Sec. 5.5.1), noting that the multiplicative and subtractive normalization constants do not align for deformed exponential families (see Amari and Ohara (2011); Matsuzoe et al. (2019); Wong and Zhang (2021)). 5.1.2 Convexity, Divergences, and Bregman Information Convex functions play a crucial role in dening divergence functionals which measure the `contrast' between two densities ~ a ; ~ b or probability distributions a ; b . We rst recall Jensen's inequality for a convex function f : X7! R with respect to a normalized probability distribution (x) over an input space X. Writing both the case of two inputs =f1;g and the general case, f (1)x 0 +x 1 (1)f(x 0 ) +f(x 1 ) (5.13a) f E (x) [x] E (x) f(x) : (5.13b) Multiplying both sides of Eq. (5.13a) by 1 (1) to induce limiting behavior as !f0; 1g, we recover the condition that a rst-order Taylor approximation everywhere underestimates the convex 118 function f (Boyd and Vandenberghe (2004) 3.1.3). For example, using L'H^ opital's rule as ! 0, we have f(x 1 )f(x 0 ) +hrf(x 0 );x 1 x 0 i: (5.14) Using these properties of convex functions, we review various divergence functionals in the following subsections. 5.1.2.1 f-Divergence Thef-divergence family (Csisz ar, 1967; Ali and Silvey, 1966) is a fundamental example of a decom- posable divergence which can be written as a sum or integral over the sample space. For a convex function f : [0;1)7! (1;1], I f [~ a : ~ b ] = Z f ~ a (x) ~ b (x) ~ b (x)dx (5.15) In particular, f-divergences have the property of being invariant under reparameterization and monotonic under coarse-graining (Amari (2016) Ch. 3). With normalized input densities a ; b and the condition f(1) = 0, we can conrm nonnegativity of I f [ a : b ] f( R d a ) = 0 using Jensen's inequality. Amari-Divergence Using the generatorf (u) = 1 log u+ 1 1 = 1 (1) u 1 + 1 u+ 1 1 , we recover Amari's-divergence (Havrda (1967),Amari (1982, 2016) Ch. 4) as a common example of the f-divergence I f [~ a : ~ b ], D () A [~ a : ~ b ] = 1 Z ~ a (x)dx + 1 1 Z ~ b (x)dx 1 1 1 Z ~ a (x) 1 ~ b (x) dx: (5.16) From this denition, it is clear that D () A [~ a : ~ b ] = D (1) A [~ b : ~ a ], where D (1) corresponds to the generator f 1 (u) =uf ( 1 u ). While this presentation of the -divergence emphasizes the role of the representation function (u) = log q u (with =q), we will also derive the -divergence using the mixture weight = in Sec. 5.3. KL Divergence Taking the limiting behavior as ! 0 or ! 1, we recover the kl diver- gence and reversekl divergence, respectively, where the domain is again extended to unnormalized densities (Zhu and Rohwer, 1995). 119 In particular, as! 0, the generatorf 0 (u) =u loguu + 1 yields the kl divergence with the same order of arguments as the f-divergence, D KL [~ a : ~ b ] =I f0 [~ a : ~ b ] = Z ~ a (x) log ~ a (x) ~ b (x) dx Z ~ a (x)dx + Z ~ b (x)dx: (5.17) As ! 1, we recover the reverse kl divergence using f 1 (u) = logu +u 1, D KL [~ b : ~ a ] =I f1 [~ a : ~ b ] = Z ~ b (x) log ~ b (x) ~ a (x) dx Z ~ b (x)dx + Z ~ a (x)dx: (5.18) 5.1.2.2 Bregman Divergence The Bregman Divergence generated by a strictly convex function f is dened as the gap in the rst-order Taylor approximation in Eq. (5.14). For example, consider the generator to be the log-partition function of an exponential family : 7! R of an exponential family with natural parameters and sucient statisticsT (z). Noting that @ @ i () = i =E [T i (x)], the Bregman divergence over the parameter space D [ a : b ] = ( a ) ( b )hr ( b ); a b i (5.19) can be shown to equal the kl divergenceD KL [ b : a ] between the respective distributions, with the order of arguments reversed (e.g. Amari (2016) 1.3). More generally, we consider decomposable Bregman divergences which apply the inequality in Eq. (5.14) to a scalar density at each point in the sample space (Zhang, 2004; Nielsen and Nock, 2009). For a strictly convex scalar function f :R7!R, D f [~ a : ~ b ] = Z f ~ a (x) f ~ b (x) ~ a (x) ~ b (x) f 0 ~ b (x) dx; (5.20) wheref 0 = df du . Since convexity is preserved by nonnegative weighted sums (Boyd and Vandenberghe (2004) 3.2.1), we will associate the Bregman divergence in Eq. (5.20) with a decomposable convex generator f [~ ], where f [~ ] = Z f ~ (x) dx: (5.21) We derive the kl divergence below, and consider additional examples in Sec. 5.5.2. 120 KL Divergence Using the same pointwise generator f 0 (~ ) = ~ log ~ ~ + 1 as for the f- divergence, the decomposable generator f 0 [~ ] = R ~ (x) log ~ (x)dx R ~ (x)dx + 1 is equal to the negative Shannon entropy, up to an ane term which does not aect the induced Bregman divergence (Banerjee et al. (2005b) App A). Noting thatr f [~ ] = log ~ , the Bregman divergence becomes D KL [~ a : ~ b ] = f 0 [~ a ] f 0 [~ b ]hr f 0 [~ b ]; ~ a ~ b i = Z ~ a(x) log ~ a(x) ~ a(x) ~ b (x) log ~ b (x) + ~ b (x) ~ a(x) ~ b (x) log ~ b (x)dx = Z ~ a (x) log ~ a (x) ~ b (x) dx Z ~ a (x)dx + Z ~ b (x)dx: (5.22) In fact, using the kl divergence to a reference ~ 0 as a decomposable generator kl ~ 0 [~ ] = R ~ (x) log ~ (x) ~ 0 (x) dx R ~ (x)dx+ R ~ 0 (x)dx also inducesD KL [~ a : ~ b ] as a Bregman divergenceD kl ~ 0 [~ a : ~ b ]. Relationship withf-divergence Assuming the standard form of the generatorf(1) =f 0 (1) = 0, anyf-divergence can be written using a decomposable Bregman divergence with a base measure ~ b (x) over the sample space, I f [~ a : ~ b ] =D f h ~ a ~ b : 1 i = Z f ~ a(x) ~ b (x) f(1) ~ a(x) ~ b (x) 1 f 0 (1) ~ b (x) = Z f ~ a (x) ~ b (x) ~ b (x)dx: (5.23) 5.1.2.3 Jensen Diversity Instead of using the rst order convexity condition in Eq. (5.14) to construct the Bregman diver- gence, we might also consider constructing divergences from Jensen's inequality in Eq. (5.13a). For a (decomposable) convex functional f [~ ], I () f [~ a : ~ b ] := (1) f [~ a ] + f [~ b ] f (1)~ a +~ b ; (5.24) As in Eq. (5.14), we can introduce scaling factors 1 (1) I f [~ a : ~ b ] to recover the Bregman divergenceD f [~ a : ~ b ] in the limit as! 0, orD f [~ a : ~ b ] =D f [~ b : ~ a ] in the limit as! 1. 121 Eq. (5.24) is known as the -divergence in Zhang (2004, 2013) but, more generally for N > 2 and inputs ~ =f~ i g N i=1 with weights =f i g N i=1 , might be referred to as a Jensen diversity I f [~ ;] = N X n=1 i f [ i ] f h N X n=1 i i : (5.25) Jensen-Shannon Divergence The Jensen-Shannon divergence (jsd) (Burbea and Rao, 1982; Lin, 1991) is the most natural example of a Jensen diversity. Using the negative Shannon entropy f 0 [~ ] = R ~ (x) log ~ (x)dx R ~ (x)dx + 1, the same Bregman generator that we used to generate the kl divergence, the Jensen-Shannon divergence with mixing weight becomes D () JS ~ 0 : ~ 1 = Z (1)~ 0 (x) log ~ 0 (x) (arith) (x) +~ 1 (x) log ~ 1 (x) (arith) (x) dx (5.26) =I () f 0 ~ 0 : ~ 1 := (1) f 0 [~ 0 ] + f 0 [~ 1 ] f 0 [ (arith) ] where (arith) (x) = (1)~ 0 (x) +~ 1 (x) = P N1 i=0 i ~ i (x) is the mixture distribution. Banerjee et al. (2005b); Sibson (1969) show that the arithmetic mean appears in the Jensen diversity as the `centroid', or the argmin in an expected divergence minimization similar to Eq. (5.1), D () JS ~ 0 : ~ 1 =I () f 0 ~ 0 : ~ 1 = min ~ r(x) (1)D f 0 ~ 0 : ~ r +D f 0 [~ 1 : ~ r : (5.27) We revisit the Jensen-Shannon divergence in Sec. 5.5.4.1 as a special case of our main result in Theorem 5.3.1, which yields quasi-arithmetic means and Jensen diversities as the solution to an expected Bregman divergence minimization. 5.1.3 Parametric Information Geometry Information geometry studies dierential geometric structures on the space of probability distri- butions or densities. In particular, the `Eguchi relations', from the seminal work of Eguchi (1983, 1985), show that a statistical divergence D[ a : b ] naturally induces a Riemannian metric and pair of conjugate ane connections on the tangent space of a manifold M of distributions or densities. We rst review the Eguchi relations for the parametric case, where the parameters () :M7!UR N provide a coordinate system for points on the manifold 2M . In coordinate notation with basis vectors of the tangent space @ i = @ @ i , the metric g ij () = h@ i ;@ j i species a bilinear form at a point indexed by . An ane connectionr denes notions 122 of curvature (of the manifold), covariant dierentiation (of vector elds), and parallel transport (of tangent vectors along curves). In particular, the covariant derivative can be expressed using the (scalar) Christoel symbols ij;k () =hr @ i @ j ;@ k i. We refer to Amari and Nagaoka (2000) or Nielsen (2020) for detailed background. For a given divergence function, taking the second and third order dierentials yield the following metric and conjugate pair of ane connections g ij () =(@ j ) a (@ k ) b D[ a : b ] a = b (5.28) ij;k () =(@ i ) a (@ j ) a (@ k ) b D[ a : b ] a = b (5.29) ij;k () =(@ i ) b (@ j ) b (@ k ) a D[ a : b ] a = b (5.30) where (@ j ) a indicates partial dierentiation with respect to the parameter j a of the rst argument. Note that conjugacy of connections amounts to ij;k + ik;j =@ i g jk and ensures the inner product of tangent vectors is preserved by parallel translation of each vector according to the respective connection (Amari and Nagaoka (2000) Sec. 3.1). It can be shown that any f-divergence (Eguchi, 1983), including the kl divergence in either direction, Amari's-divergence, and the Jensen-Shannon divergence 1 , yields the Fisher Information metric, which is the unique metric that is invariant to reparameterization (Chentsov, 1982), g ij () = Z (x) @ log (x) @ i @ log (x) @ j dx = Z @ log (x) @ i @ (x) @ j dx (5.31) = Z @ log q (x) @ i @ log 1q (x) @ j dx; where the expression in terms of q-representations re ects a duality of representation functions (Amari and Nagaoka, 2000; Zhang, 2004; Nielsen, 2020) which we will study in Sec. 5.5.1. However, the divergences we consider will dier in their induced dual ane connections (Eguchi (1983); Zhang (2013);Table 5.2). Bregman Divergence and KL Divergence It is well-known that any Bregman divergence induces a dually at space, in which there exists a coordinate system such that either ij;k or ij;k vanishes everywhere (Amari and Nagaoka (2000) Ch. 3, Nielsen (2020) Sec. 3.7, Amari 1 The jsd, rescaled by 1 (1) to ensure the standard form that f 00 (1) = 1, is also an f-divergence for f (u) = 1 (1) (1)u logu + (1)u + log (1)u + . 123 Divergence D KL [~ a : ~ b ] 0 1 D () A [~ a : ~ b ] 1 1 (1) D () JS [~ a : ~ b ] 1 I f [~ a : ~ b ] f 000 (1) 1 f 000 (1) + 2 Table 5.2: Dual pair of -connections induced by common divergence functions, where values of refer to expressions in Eq. (5.32)-(5.34). We assume f 00 (1) = 1. (2016) Ch. 6). It is easy to conrm that ij;k () = 0 for the parametric Bregman divergence D [ a : b ] =D KL [ b : a ] (Eq. (5.19) associated with the exponential family. For arbitrary input distributions, dierentiating the reverse kl divergence (! 1, Eq. (5.18)) using(@ i ) a (@ j ) a (@ k ) b D KL [ b : a ], we obtain ij;k () = (1) ij;k () := Z @ 2 log (x) @ i @ j @ (x) @ k dx (5.32) ij;k () = (0) ij;k () := Z @ 2 (x) @ i @ j @ log (x) @ k dx: (5.33) These dual connections are well-known as the e- and m-connections, respectively (Amari and Na- gaoka, 2000). Similarly, the forward kl divergence (! 0, Eq. (5.17)) induces ij;k = (0) ij;k and ij;k = (1) ij;k . -Connection and f-Divergence More generally, Amari (1982) introduced the family of - connections which, among other interpretations, may be viewed as an arithmetic mixture of (1) ij;k and (0) ij;k , () ij;k () = (1) (0) ij;k + (1) ij;k (5.34) = Z 1 (x) @ 2 (x) @ i @ j @ (x) @ k 1 (x) @ (x) @ i @ (x) @ j @ (x) @ k dx where the second line can be derived by simplifying from Eq. (5.32)-(5.33). Adapted to our notation, it can be shown that any f-divergence induces a pair of conjugate connections ij;k = () ij;k and ij;k = (1) ij;k where =f 000 (1) 1 (see Table 5.2). In particular, we see that both the -divergence and the (weighted, scaled) Jensen-Shannon divergence induce the -connection. Although the limiting behavior as !f0; 1g recovers the kl divergence of both divergences, note that the f-divergence generator for the -divergence in 124 Eq. (5.16) is a function of the representation q (u) = log q (u). By contrast, the jsd in Eq. (5.26) is derived as either a Jensen diversity or f-divergence using the mixture parameter . To explain this observation, we rst consider the role of the representation (u) in generating the rho-tau Bregman divergence (Zhang, 2004, 2013) in Sec. 5.2, before describing the role of the mixture parameter in the induced Bregman Information or Jensen diversity in Section 5.3. 5.2 Rho-Tau Bregman Divergence The rho-tau divergence, rst introduced in Zhang (2004), provides an elegant framework for under- standing the relationship between divergence functions and representations of probability densities. The family of rho-tau divergences naturally extends bothf-divergences and decomposable Bregman divergences, and highlights two distinct notions of duality in nonparametric information geometry. Following Zhang (2004, 2013), we nd that these referential and representational dualities are en- coded in the parameters andq, respectively, of the quasi-arithmetic means or annealing paths in Sec. 5.1.1. 5.2.1 Rho-Tau Bregman Divergence Consider and to be scalar functions from7! R which, for an unnormalized density function ~ (x) :Z 7!, we will use to map ~ (x)7! ~ (x) . We consider a proper, convex, lower semi- continuous function f :R7!R applied to (~ )2R, and dene f its convex conjugate. Using the Fenchel-Moreau biconjugation theorem, we can express f or f via a conjugate optimization, f () = sup f() f() = sup f (): (5.35) Solving for the optimizing arguments above suggests the conjugacy conditions, =f 0 () = (f ) 0 1 () = (f ) 0 () = (f 0 ) 1 (): (5.36) Zhang (2004, 2013) refer to these choices of and as conjugate representations with respect to the convex functionf, emphasizing that the choice of two of these functions (;) or (;f) determines the third (Zhang and Matsuzoe, 2021). Note that monotonicity of = (f ) 0 () is guaranteed iff is strictly convex. We will indeed interpret(~ ) as the representation function for the quasi-arithmetic mean in Sec. 5.5. 125 We now consider using the convex functions f() or f () to generate decomposable Bregman divergences, where the arguments are now expressed as the(~ ) or(~ ) representations of the input density functions. Equivalently, as in Sec. 5.1.2.2, we can consider the decomposable generators f [] = Z f ~ (x) dx f [] = Z f ~ (x) dx; (5.37) which are referred to as the negative rho-tau entropy functionals in Naudts and Zhang (2018). The Bregman divergence generated by f becomes, D f ~ a : (~ b ) = f [(~ a )] f [(~ b )] r f [(~ b )]; (~ a )(~ b ) = Z f ~ a (x) f ~ b (x) ~ a (x) ~ b (x) f 0 ( ~ b (x))dx; (5.38) where can also substitute (~ (x)) = f 0 ((~ (x))). We will use the notation ~ (x) := (~ (x)) as a slight shorthand which emphasizes that our convex dualities are with respect to the representation functions. Using the conjugate functionf () or the corresponding generator f [], we can derive a dual rho-tau Bregman divergence. Noting that ~ b (x) = (f ) 0 ( ~ b ), the dual divergences can be shown to be equivalent up to ordering of the arguments. Using the conjugate relationships in Eq. (5.35) and underlining terms involved in simplication steps, we have D f (~ a ) :(~ b ) = Z f ~ a (x) f ~ b (x) ~ a (x) ~ b (x) ~ b (x)dx (5.39) = Z f ~ a (x) +f ~ b (x) ~ a (x) ~ b (x)dx (5.40) = Z f ~ b (x) f ~ a (x) ~ b (x) ~ a (x) ~ a (x)dx (5.41) =D f (~ b ) :(~ a ) : (5.42) Eq. (5.40) is the canonical form of the rho-tau Bregman divergence (Amari and Nagaoka (2000) 3.4), which we write using a mixed parameterization in terms of (~ a ) and (~ b ) D f;f (~ a ) :(~ b ) = Z f ~ a (x) +f ~ b (x) ~ a (x) ~ b (x)dx (5.43) = f [ ~ a ] + f [ ~ b ] Z ~ a (x) ~ b (x)dx; (5.44) 126 Constructing D f ;f (~ b ) : (~ a ) in a similar fashion, we note that the various divergences are related by D f (~ a ) :(~ b ) =D f;f (~ a ) :(~ b ) =D f ;f (~ b ) :(~ a ) =D f (~ b ) :(~ a ) : The canonical divergences D f;f ;D f ;f are analogous to the Fenchel-Young losses studied in (Blondel et al., 2020; Martins et al., 2021; Nielsen, 2022), here using the rho-tau negative entropy functionals. 5.3 Rho-Tau Bregman Information In this section, we extend the results of Banerjee et al. (2005a,b), where the arithmetic mean over inputs provides the minimizing argument for the expected Bregman divergence to a representative `centroid' in the second argument. Thm. 5.3.1 uses the rho-tau Bregman divergence framework to extend this result to quasi-arithmetic means, clarifying the variational interpretations of annealing paths in Eq. (5.1). At this minimizing argument, the expected divergence objective matches the representational -divergence of (Zhang, 2004, 2013), or Jensen diversity from Sec. 5.1.2.3. 127 Theorem 5.3.1. (Rho-Tau Bregman Information): Consider N unnormalized measures ~ = f~ i (x)g N i=1 on x2X , and a normalized distribution of mixture weights =f i g N i=1 , P i i = 1. (i) For a rho-tau Bregman divergence D f [(~ a ) :(~ b )] with generator f , the optimization I f; ~ ; := min ~ r(x) N X i=1 i D f [(~ i ) :(~ r)] : (5.45) has a unique minimizer given by the arithmetic mean over arguments (Banerjee et al., 2005b), which corresponds to quasi-arithmetic mean with representation function (~ ) ~ () (x) := 1 N X i=1 i ~ i (x) ! = argmin ~ r(x) N X i=1 i D f [(~ i ) :(~ r)]: (ii) At this minimizing argument, the value of the expected divergence in Eq. (5.45) is called the Bregman Information and is equal to the Jensen diversity, or gap in Jensen's inequality, for the convex functional [ ~ ] = R f ~ (x)) dx, mixture weights, and inputs ~ , I f; ~ ; = N X i=1 i f ~ i f ~ () : (5.46) (iii) Using ~ r 6= ~ () as the representative in Eq. (5.45), the suboptimality gap is a rho-tau Bregman divergence N X i=1 i D f ~ i : ~ r I f; ~ ; =D f ~ () : ~ r (5.47) = f [ ~ () ] + f [ ~ r ] Z ~ () (x) ~ r (x)dx: The last expression is the Bregman divergence canonical form in Eq. (5.44), which also arises as the gap in the conjugate optimization in Eq. (5.35), f ( ~ () ) R ~ () (x) ~ r (x)dx f ( ~ r ), for suboptimal ~ r. Proof. (ii): We begin by showing that the optimal representative ~ () (x) = 1 P N i=1 i ~ i (x) yields a Jensen diversity in (ii), before proving this choice is the unique minimizing argument. 128 Expanding the expected divergence in Eq. (5.45) for ~ r = ~ () , we have P N i=1 i D f [(~ i ) : ~ () ] = P N i=1 i f [ ~ i ] f [ ~ () ] RP N i=1 i ~ i (x) ~ () (x) ~ () (x)dx. Since P i i ~ i (x) = ~ () , the nal term cancels to yield N X i=1 i D f h (~ i ) : ~ () i = N X i=1 i f [ ~ i ] f [ ~ () ]: (5.48) (i): To show ~ () is indeed optimal in (i), we consider the dierence in the expected divergence using an arbitrary ~ r(x) instead of ~ () . We nd that this dierence corresponds to the rho-tau Bregman divergence in (iii), which has a unique minimizer at ~ r(x) = ~ () . Using Eq. (5.48), we see that P i i f [ ~ i ] terms cancel to yield N X i=1 i D f [(~ i ) :(~ r)] N X i=1 i D f h (~ i ) : ~ () i (5.49) = N X i=1 i f [ ~ i ] f [ ~ r ] Z N X i=1 i ~ i (x) ~ r (x) ~ r (x)dx N X i=1 i f [ ~ i ] f [ ~ () ] =D f ~ () : ~ r : (5.50) where the last line follows by denition after noting that ~ () := P N i=1 i ~ i (x). The rho-tau divergence is minimized if and only if (~ () ) =(~ r) (Zhang, 2004), thus proving (i). (iii): Finally, we can express the suboptimality gap in Eq. (5.49) or rho-tau Bregman divergence in Eq. (5.50) as the gap in a conjugate optimization. Considering the conjugate expansion of f [(~ () )], we have f [ ~ () ] = sup (x) Z ~ () (x)(x)dx f [] Z ~ () (x) ~ r (x)dx f [ ~ r ] (5.51) for any choice of ~ r (x). This provides a lower bound on [ ~ () ], where the gap in the lower bound is the canonical form of the Bregman divergence. Indeed, substituting f [ ~ r ] = R ~ r (x) ~ r (x)dx f [ ~ r ] in Eq. (5.49), we have D f (~ ;) : ~ r = f [ ~ () ] + f [ ~ r ] Z ~ () (x) ~ r (x)dx (5.52) = f [ ~ () ] f [ ~ r ] Z ~ () (x) ~ r (x) ~ r (x)dx: Conjugate optimizations which treat f-divergences as a convex function are popular for providing variational lower bounds on divergences (Nguyen et al., 2010; Poole et al., 2019) or min-max optimizations for adversarial training (Nowozin et al., 2016; Nock et al., 2017). Note however, that 129 this proof provides a variational upper bound on the Bregman Information, which includes the Jensen-Shannon divergence (Sec. 5.1.2.3, 5.5.4.1) and mutual information (Banerjee et al. (2005b) Ex. 6) as examples. Our focus in the rest of the paper will be on annealing paths with two endpoint densities (N = 2). In this case, the Bregman Information or Jensen diversity can be viewed as a statistical divergence, as we emphasize in the following corollary. Corollary 5.3.2. ForN = 2, consider the Bregman Information with additional scaling factors to induce limiting behavior for !f0; 1g. This quantity recovers the representational -divergence of Zhang (2004, 2013), 1 (1) I f ; f~ 0 ; ~ 1 g;f1;g (5.53) = 8 > > > > > < > > > > > : 1 (1) (1) f ~ 0 + f ~ 1 f ~ () (6= 1) D f [(~ 1 ) :(~ 0 )] =D f [(~ 0 ) :(~ 1 )] (! 0) D f [(~ 1 ) :(~ 0 )] =D f [(~ 0 ) :(~ 1 )] (! 1) where(~ () ) = (1)(~ 0 )+(~ 1 ) is the quasi-arithmetic mean. From the annealing path per- spective, Eq. (5.53) associates a divergence functional comparing ~ 0 and ~ 1 with each intermediate density along the path. Here, we also note the `referential duality' (Zhang, 2004, 2013) in terms of , where 1 (1) I f ; (f~ 0 ; ~ 1 g;f1;g) = 1 (1) I f ; (f~ 1 ; ~ 0 g;f; 1g). 5.4 Information Geometry of Rho-Tau Divergence Using the Eguchi relations in Eq. (5.28)-(5.30), Zhang (2004, 2013) derive the metric and ane connections induced by the representational -divergence or scaled Jensen diversity in Eq. (5.53). In particular, the metric becomes g ij () = Z @ (x) @ i @ (x) @ j dx = Z 0 (x) 0 (x) @ (x) @ i @ (x) @ j dx; (5.54) which generalizes Eq. (5.31) beyond () = log q () and () = log 1q (). 130 Zhang (2004) show that the divergence 1 (1) I f ; f~ 0 ; ~ 1 g;f1;g in Cor. 5.3.2 induces the following analogue of the -connection () f; ij;k () = Z (1) @ 2 (x) @ i @ j @ (x) @ k dx + @ 2 (x) @ i @ j @ (x) @ k dx: = Z 0 0 @ 2 @ i @ j @ @ k + (1) 00 0 + 00 0 @ @ i @ @ j @ @ k dx which is a mixture () f; ij;k () = (1) (0) f; ij;k () + (1) f; ij;k () of the conjugate connections induced by the rho-tau Bregman divergences at = 0 and = 1. This generalizes the expressions for the -connection () ij;k in Eq. (5.32)-(5.34), which are recovered for () = and () = log (Sec. 5.5.4.2). Since the target density in mcmc applications is usually unnormalized and can not be repre- sented using a parametric model, we would like to consider a nonparametric information geometry over the space of unnormalized measures. In this case, the approach is to construct a Banach manifoldM 0 of all densities which are absolutely continuous with respect to a base measure 0 (Pistone and Sempi, 1995; Loaiza and Quiceno, 2013a). Similarly to our likelihood ratio interpreta- tion in Eq. (5.11), the coordinate mappings correspond toq-logarithmic likelihood ratios, while the tangent space can be identied with one-dimensional q-exponential families (Loaiza and Quiceno, 2013a). For detailed constructions over normalized nonparametric densities, see Loaiza and Quiceno (2013a,b) or Pistone and Sempi (1995); Gibilisco and Pistone (1998); Grasselli (2010). For the nonparametric setting, Zhang (2004, 2013) consider general tangent vectors u(x);v(x) and obtain the following analogues of the Fisher metric and -connection, g u;v (~ ) =hu;vi = Z 0 ~ (x) 0 ~ (x)u(x)v(x)dx (5.55) () f; wu;v (~ ) =hr w u;vi = Z 0 ~ (x) 0 ~ (x) dwu(x)v(x) + (1) 00 ~ (x) 0 ~ (x) + 00 ~ (x) 0 ~ (x) u(x)w(x)v(x) dx: (5.56) where d w u is the directional derivative of u in the direction of w and 0 ~ (x) = 0 ~ (x) . The parametric expression above can be recovered using, for example, u(x) = @ @ i (x), and we also note the similar form of Eq. (5.56) to the expression in Eq. (5.34). Again, the connections () f; wu;v = (1) (0) f; wu;v + (1) f; wu;v are obtained as mixtures of the conjugate connections induced by the rho-tau Bregman divergence. In our examples below, we derive special cases of the representational -divergences or rho-tau Jensen diversities in Corollary 5.3.2 and analyze the induced geometric structures, noting that each 131 divergence is associated with a quasi-arithmetic mean or annealing path intermediate density via Thm. 5.3.1 (i). 5.5 Examples Note that there are two degrees of freedom in specifying the rho-tau divergence. For a given choice of monotonic representation function, choosing either the convex functionf or a conjugate representation will uniquely determine the divergence (Naudts and Zhang, 2018). As examples, we consider the two choices proposed in Naudts and Zhang (2018); Zhang and Matsuzoe (2021) and comment on further options in Sec. 5.5.3. -Deformed Gauge: ~ (x) = log ~ (x) (5.57) f( ~ ) =c ~ (x) +a ~ (x) +b -Identity Gauge: ~ (x) = log ~ (x) (5.58) ~ (x) = ~ (x): We present example constructions which recover the kl divergence (Kullback and Leibler, 1951), Amari's -divergence (Amari, 1982, 2016), and the -divergence (Basu et al., 1998; Cichocki and Amari, 2010). In particular, we will derive the q-paths in Eq. (5.7) using the -representation ~ (x) = log q (~ (x)) and each of the `gauges' in Eq. (5.57)-(5.58). We recover the expected - divergence minimization in Eq. (5.1) for the -deformed gauge, whereas the -id gauge yields the q-path as the solution to an expected -divergence (Basu et al., 1998; Eguchi, 2006) minimization. 5.5.1 Amari -Divergence and the -Deformed Gauge As an example of the -deformed gauge, we consider the q-logarithm and 1 q logarithm as representation functions, ~ (x) = log q ~ (x) = 1 1q ~ (x) 1q 1 1q ~ (x) = log 1q ~ (x) = 1 q ~ (z) q 1 q (5.59) 132 where the duality betweenq and 1q deformations mirrors the and 1 connections induced by the-divergence in Table 5.2. These choices of and are conjugate with respect to the following f() and f (), f() = 1 q exp q fg 1 q 1 q f () = 1 1q exp 1q fg 1 1q 1 1q : (5.60) Note that these expressions correspond to f(u) = c 1 (u) and f (t) = c 1 (t) up to ane terms which do not change the rho-tau Bregman divergence. Comparing Eq. (5.60) with the expression in Eq. (5.57), we can ensure that f((~ )) is convex in ~ by choosing a< 0, since (~ ) = log (~ ) is concave in ~ (Naudts, 2004, 2011). Amari -Divergence as Rho-Tau Bregman Divergence In contrast to Zhang (2004); Nielsen and Nock (2009), these choices yield a generator function which is strictly convex in ~ (x) 0. In particular, the decomposable generators f [ ~ ] = R f (~ (x) dx and f [ ~ ] = R f (~ (x) dx become f ~ = 1 q Z ~ (x)dx 1 q(1q) Z ~ (x) 1q dx 1 1q f ~ = 1 1q Z ~ (x)dx 1 q(1q) Z ~ (x) q dx 1 q : (5.61) The -divergence with = q = 1+ 0 2 may now be derived using either D f , D f , or the canonical form in Eq. (5.44). We focus on the latter form, which can be rewritten using the decomposable generators in Eq. (5.61), D (q) f;f (~ a ) :(~ b ) = f ~ a + f ~ b Z ~ a (x) ~ b (x)dx (5.62) = 1 q Z ~ a (x)dx + 1 1q Z ~ b (x)dx 1 q 1 1q Z ~ a (x) 1q ~ b (x) q dx =D (q) A [~ a : ~ b ]; (5.63) Expressing the divergence in terms of the pairs f; or f ;, we can see that D (q) A [~ a : ~ b ] =D f (~ a ) :(~ b ) =D f;f (~ a ) :(~ b ) =D f (~ b ) :(~ a ) : 133 Zhang's (;) Divergence as Bregman Information Applying Thm. 5.3.1, we now consider the rho-tau Bregman Information or Jensen diversity induced by the Amari -divergence of order q. As in Amari (2007); Masrani et al. (2021), the q-path or quasi-arithmetic mean ~ (q) (x) = exp q (1) log q ~ 0 (x) + log q ~ 1 (x) (5.64) = (1)~ 0 (x) 1q + ~ 1 (x) 1q 1 1q + = argmin ~ r (1)D (q) A ~ 0 : ~ r +D (q) A ~ 1 : ~ r minimizes the expected Amari-divergence to the second argument. RecognizingD (q) A as a rho-tau Bregman divergence D f [~ a : ~ b ] as in Eq. (5.63), we can rewrite the Jensen diversity induced by a mixture parameter 2 (0; 1) in several insightful forms 1 (1) I f f~ 0 ; ~ 1 g;f1;g = 1 (1) (1)D (q) A ~ 0 : ~ (q) +D (q) A ~ 1 : ~ (q) (5.65) = 1 (1) (1) f ~ 0 + f ~ 1 f ~ (q) (5.66) = 1 (1) 1 q Z (1) ~ 0 (x) + ~ 1 (x) ~ (q) (x)dx (5.67) =:D (;q) [~ 0 : ~ 1 ] (5.68) where, in the third line, we have simplied using the denition of the generator f and q-path density ~ (q) . We recognize the scaled Bregman Information in Eq. (5.66)-(5.67) as the (;) divergence from Zhang (2004, 2013), which we rename as D (;q) [~ 0 : ~ 1 ] to distinguish the role of each parameter. Viewing the q-path as a one-dimensional q-exponential family of unnormalized densities as in Eq. (5.11), note that the last term in Eq. (5.67) corresponds to the normalization constant Z q () = R ~ (q) (x)dx. Since Z q (0) = R ~ 0 (x)dx for all q and similarly for Z q (1), we have the following parametric interpretation of the divergenceD (;q) [~ 0 : ~ 1 ] or scaled Bregman Information D (;q) [~ 0 : ~ 1 ] = 1 (1) 1 q (1)Z q (0) +Z q (1)Z q () : (5.69) 134 Dual Bregman Information Finally, we can generate the 1q path using the dual rho-tau Bregman divergence D f [(~ a ) : (~ b )] in the (~ ) = log 1q (~ ) representation. Since the dual divergence reverses the order of arguments, the Bregman Information can be written 1 (1) I f f~ 0; ~ 1g;f1;g (5.70) = 1 (1) (1)D (q) A ~ (1q) : ~ 0 +D (q) A ~ (1q) : ~ 1 = 1 (1) (1) f ~ 0 + f ~ 1 f ~ (q) = 1 (1) 1 1q Z (1) ~ 0(x) + ~ 1(x) ~ (1q) (x)dx where the normalization constantZ 1q () of the intermediate density ~ (1q) (x) appears in the nal term, similarly to Eq. (5.69). Geometry of - and (;q)-Divergence To see that Amari's -divergence or Zhang's (;q) induce the Fisher Information metric in Eq. (5.31), note that the choice of and in Eq. (5.59) have the property 0 (~ ) 0 (~ ) = ~ q ~ q1 = ~ 1 regardless of the choice of q. Using Eq. (5.54), this leads to the metric g ij () = Z 1 (x) @ (x) @ i @ (x) @ j dx g uv (~ ) = Z 1 ~ (x) u(x)v(x)dx: (5.71) In Sec. 5.1.3, we found that the-divergence induces the-connection, which can now be viewed as arising from the rho-tau Bregman divergence (! 1) for the representation function(~ ) = log q ~ . Varying the mixing parameter in the Jensen diversity, we can use Eq. (5.55) to analyze the ane connection induced by the (;q) divergence. Since 00 (~ ) 0 (~ ) =(1q)~ 1 and 00 (~ ) 0 (~ ) =q~ 1 , we () f; wu;v (~ ) = Z 1 ~ (x) d w u(x)v(x) (1)(1q) +q 1 ~ (x) u(x)w(x)v(x) dx: (5.72) Comparing with Eq. (5.34), this is the standard-connection with order = (1)(1q) +q = +q 2q 1 (Zhang, 2004, 2013). 2 Note that for we recover the =q connection for = 1 and = 1q connection for = 0, matching the case of the Amari -divergence in Table 5.2. 5.5.2 -Divergence and the -Id Gauge As a representative example of the -id gauge, we will consider the Beta divergence of order 2q (Cichocki and Amari (2010), Naudts (2011) Sec. 8.7), which has found fruitful application in 2 Under the parameterization ^ = 1+ 2 , ^ q = 1+q 2 , Zhang (2004, 2013) nd that the -connection has coecient 1+^ 2 = 1+ ^ ^ q 2 , which matches the ^ = ^ ^ q connection. 135 promoting robustness to outliers (Basu et al., 1998; Mihoko and Eguchi, 2002; Futami et al., 2018; Knoblauch et al., 2018) D (2q) B [~ a : ~ b ] = 1 1q 1 2q Z ~ a (x) 2q dx + 1 2q Z ~ b (x) 2q dx (5.73) 1 1q Z ~ a (x)~ b (x) 1q dx: As shown in Table 5.4, special cases of the Beta divergence include the squared Euclidean distance for q = 0, and the Itakura-Saito divergence D IS [~ a : ~ b ] = R ~ a(x) ~ b (x) dx R log ~ a(x) ~ b (x) dx 1 as q! 2. -Divergence as Rho-Tau Bregman Divergence We now show that theq-path and mixture path arise from an expected Beta divergence minimization via the (~ ) = log q ~ and (~ ) = ~ representations, respectively ~ (x) = log q ~ (x) = 1 1q ~ (z) 1q 1 1q ~ (x) = ~ (x): (5.74) These choices of and are conjugate with respect to the following f() and f (), where we include additional additive constants to induce kl divergences as limiting behavior (see Table 5.4) f() = 1 2q [1 + (1q)] 2q 1q + 1 2q f () = 1 1q 1 2q 2q 1 1q + 1 2q : (5.75) This leads to the decomposable convex generators f [ ~ ] = 1 2q Z ~ (x) 2q dx 1 2q (5.76) f [ ~ ] = 1 (1q)(2q) Z ~ (x) 2q dx Z 1 1q ~ (x)dx + 1 2q : (5.77) Up to ane terms, note that each convex generator is equivalent to the negative Tsallis entropy of order 2q (Naudts (2011) Sec. 8.7), which reduces to the negative Shannon entropy as q! 1. Using the canonical form of the rho-tau divergence, we obtain the Beta divergence (Cichocki and Amari, 2010) of order 2q, D f;f [(~ a ) :(~ b )] = f ( ~ a ) + f ( ~ b ) Z ~ a (x) ~ b (x)dx (5.78) = 1 1q 1 2q Z ~ b (x) 2q dx + 1 2q Z ~ a (x) 2q dx 1 1q Z ~ b (x)~ a (x) 1q dx =:D 2q B [~ b : ~ a ]: (5.79) 136 We emphasize that the representation parameter q sets the order of the -divergence, rather than the mixing parameter. Further, note that thef; orf;f forms of the Bregman divergence have a dierent order of the arguments compared to the -divergence, with D (2q) B [~ b : ~ a ] =D f [(~ a ) :(~ b )] =D f;f [(~ a ) :(~ b )] =D f [(~ b ) :(~ a )]: Bregman Information from -Divergence Using Thm. 5.3.1, we also recover theq-path in Eq. (5.7) from an expected divergence minimization in the (~ ) = log q ~ representation ~ (q) (x) = exp q (1) log q ~ 0 (x) + log q ~ 1 (x) (5.80) = argmin ~ r(x) (1)D f [(~ 0 ) :(~ r)] +D f [(~ 1 ) :(~ r)]: Since minimizingD f [(~ i ) :(~ r)] over ~ r in the second argument corresponds to minimizing over the rst argument of the -divergence, we can write the scaled Bregman Information in the following forms, 1 (1) I f~ 0 ; ~ 1 g;f1;g (5.81) = 1 (1) (1)D (2q) B [~ (q) : ~ 0 ] +D (2q) B [~ (q) : ~ 1 ] = 1 (1) (1) f ~ 0 + f ~ 1 f ~ = 1 (1) 1 2q (1) Z ~ 0 (x) 2q dx + Z ~ 1 (x) 2q dx Z ~ (q) (x) 2q dx ; where the q-path intermediate density appears raised to the 2q power inside the integral. Dual Bregman Information In contrast to the example for the -divergence, the annealing path generated by the dual divergence D f [(~ a ) :(~ b )] and potential function f generates the mixture path, since (~ ) = ~ . 1 (1) I f~ 0 ; ~ 1 g;f1;g (5.82) = 1 (1) (1)D (2q) B [~ 0 : (arith) ] +D (2q) B [~ 1 : (arith) ] = 1 (1) (1) f ~ 0 + f ~ 1 f (arith) = 1 (1) 1 1q 1 2q (1) Z ~ 0 (x) 2q dx + Z ~ 1 (x) 2q dx Z (arith) (x) 2q dx ; 137 Indeed, several previous works have noted the convenience of the -divergence for optimization under linear constraints. Csiszar (1991) provide an axiomatic characterization of -divergences as providing scale-invariant projection onto the set of positive measures satisfying expectation con- straints. Naudts (2011) Ch. 8 discuss related benets of using the-divergence for thermodynamic interpretations in place of the Tsallis entropy or -divergence, which induce escort expectations due to the deformed dual representation(~ ) = 1 q (~ (x) q 1) from Sec. 5.5.1 (Naudts, 2011; Zhang and Matsuzoe, 2021). Geometry of Beta Divergence and the Induced Jensen Diversity In contrast to the case of the -divergence in the -deformed gauge, the -divergence or its associated Jensen diversities in the-id gauge do not necessarily induce the standard Fisher metric. Instead, since 0 (~ ) 0 (~ ) = ~ q 1 = ~ q , the metric becomes g ij () = Z 1 (x) q @ (x) @ i @ (x) @ j dx g uv (~ ) = Z 1 ~ (x) q u(x)v(x)dx; (5.83) which matches the Fisher metric only for q = 1 and (~ ) = log ~ . Similarly, for the Beta divergence of order 2q as the Bregman divergence ( = 1), the Jensen diversity or Bregman Information associated with the mixture parameter induces the ane connection () f; wu;v (~ ) = Z 1 ~ (x) q d w u(x)v(x)q 1 ~ (x) u(x)w(x)v(x) dx: (5.84) using Eq. (5.56), 00 (~ ) 0 (~ ) = 0, and 00 (~ ) 0 (~ ) =q~ 1 . While the outer integration over ~ (x) q prevents us from identifying Eq. (5.84) with the standard -connection in Eq. (5.34), the corresponding coecient would be the product q. 5.5.3 Cichocki & Amari's (;) Divergence using -Deformations While the - and Beta-divergence constitute well-known examples from the literature, there is no apparent reason to limit our analysis to only these choices. In particular, xing (~ ) as the q-logarithm as above, we can consider the following (~ ) = log q (~ ) = 1 1q ~ 1q 1 1q (~ ) = log 1 (~ ) = 1 ~ 1 : (5.85) 138 For any choice of , , it is possible to nd an appropriate f = 1 due to the fact that set of strictly monotonic functions form a group, where the group action is function composition (Zhang, 2015). This function f is convex as long as 0 q 1. Thus, the and representations in Eq. (5.85) are conjugate with respect to the convex function f() = 1 1 + 1q exp q +1q 1 +c; (5.86) for an additive constant c. The decomposable generator becomes f [ ~ ] = Z f (~ (x)) dx = 1 1 + 1q Z ~ (x) +1q dx 1 1 1q Z ~ (x) 1q dx +c: Finally, the rho-tau Bregman divergence recovers the (;) divergence of Cichocki and Amari (2010) D f [(~ a ) :(~ b )] = 1 (1q)( + 1q) (1q) Z ~ a (x) +1q dx (5.87) + Z ~ b (x) +1q dx ( + 1q) Z ~ a (x) ~ b (x) 1q dx : Note that this recovers the -divergence and-deformed gauge for =q, and the Beta-divergence and -id gauge for = 1. 5.5.4 Rho-Tau Bregman Information for KL Divergences We nally analyze the rho-tau Bregman Information for the forward and reverse kl divergences, which recover the mixture path and geometric path, respectively. As summarized in Table 5.4, paths based on the kl divergence can be derived as limiting behavior in either the -deformed (Sec. 5.5.1) or -id gauges (Sec. 5.5.2) . 5.5.4.1 Mixture Path using Forward KL Divergence To recover the kl divergence from the rho-tau Bregman divergence, consider the following choices (~ ) = ~ (x) f() = log + 1 (5.88) f ~ = Z f (~ (x) dx = Z ~ (x) log ~ (x)dx Z ~ (x)dx + 1; (5.89) 139 Divergence --Divergence (~ );(~ ) Convex f(), f () D (q) A [~ a : ~ b ] 1 q R ~ a (z)dz + 1 1q R ~ b (z)dz 1 q(1q) R ~ a (z) 1q ~ b (z) q dz D f [ ~ a : ~ b ] (~ ) = log q ~ f() = 1 q exp q fg 1 q 1 q D f [ ~ b : ~ a ] (~ ) = log 1q ~ f () = 1 1q exp 1q fg 1 1q 1 1q D (2q) B [~ a : ~ b ] R 1 1q 1 2q ~ a (z) 2q + 1 2q ~ b (z) 2q 1 1q ~ a (z)~ b (z) 1q dz D f [ ~ b : ~ a ] (~ ) = log q ~ f() = 1 2q exp q fg 2q 1 2q D f [ ~ a : ~ b ] (~ ) = ~ f () = 1 1q 1 2q 2q 1 1q + 1 2q Table 5.3: - and -divergences as rho-tau Bregman divergences. See Table 5.4 for special cases. -Deformed q! 0 q! 1 q = 2 ; Convex f(), f () D f [ ~ a : ~ b ] D KL [~ a : ~ b ] D KL [~ b : ~ a ] D P 2[~ b : ~ a ] (~ ) = log q ~ f() = 1 q exp q fg 1 q 1 q D f [ ~ a : ~ b ] D KL [~ b : ~ a ] D KL [~ a : ~ b ] D P 2[~ a : ~ b ] (~ ) = log 1q ~ f () = 1 1q exp 1q fg 1 1q 1 1q -id q = 0 q! 1 q! 2 ; Convex f(), f () D f [ ~ a : ~ b ] 1 2 k~ a ~ b k 2 2 D KL [~ b : ~ a ] D IS [~ b : ~ a ] (~ ) = log q ~ f() = 1 2q [1 + (1q)] 2q 1q + 1 2q D f [ ~ a : ~ b ] 1 2 k~ a ~ b k 2 2 D KL [~ a : ~ b ] D IS [~ b : ~ a ] (~ ) = ~ f () = 1 1q 1 2q 2q 1 1q + 1 2q Table 5.4: Limiting Behavior in q for the - and -divergences as rho-tau Bregman Divergences. Note that D P 2 indicates Pearson's 2 divergence and D IS indicates the Itakura-Saito divergence. where f [ ~ ] is the negative Shannon entropy up to an ane term and matches f 0 [~ ] from Sec. 5.1.2. Substituting into Eq. (5.38), we can conrm that the resulting rho-tau divergence recovers the kl divergence with the same order of arguments, D f [(~ a ) :(~ b )] =D KL [~ a : ~ b ]. Jensen-Shannon Divergence as Bregman Information While Thm. 5.3.1 is necessary to extend the results of Banerjee et al. (2005b) to rho-tau divergences with arbitrary representation functions, it is well-known that the arithmetic mean minimizes the expected Bregman divergence in the second argument with density functions as input ((~ ) = ~ ). (arith) (x) := (1)~ 0 (x) +~ 1 (x) = argmin ~ r (1)D KL [~ 0 : ~ r] +D KL [~ 1 : ~ r] : (5.90) As in Sec. 5.1.2.3, we obtain the weighted Jensen-Shannon divergence as the Bregman Information or Jensen diversity 1 (1) I f (f~ 0 ; ~ 1 g;f1;g). 1 (1) D () JS ~ 0 : ~ 1 = 1 (1) (1)D KL ~ 0 : ~ r +D KL [~ 1 : ~ r (5.91) = 1 (1) (1) f [~ 0 ] + f [~ 1 ] f [ (arith) ] : (5.92) 140 5.5.4.2 Geometric Averaging Path using Reverse KL Divergence To obtain the geometric averaging path from Eq. (5.5), Thm. 5.3.1 suggests using the (~ ) = log ~ representation, (~ ) = log ~ f() = expfg 1 (5.93) f ~ = Z f (~ (x) dx = Z log ~ (x)dx + Z ~ (x)dx 1 (5.94) Using Eq. (5.38), we nd that the rho-tau divergence recovers the kl divergence with the order of arguments reversed, D f [(~ a ) :(~ b )] =D KL [~ b : ~ a ]. Amari's -Divergence as Bregman Information Using the reverse kl divergence for the expected divergence minimization, we recover the result from Grosse et al. (2013) for the geometric annealing path, ~ (geo) (x) = ~ 0 (x) 1 ~ 1 (x) = argmin ~ r (1)D KL [~ r : ~ 0 ] +D KL [~ r : ~ 1 ]: (5.95) Note that optimization is over the rst argument of the kl divergence, or second argument of D f [(~ a ) :(~ b )]. The scaled Bregman Information recovers Amari's -divergence where, as in Cor. 5.3.2 but in contrast to Sec. 5.5.1, the mixing parameter sets the order of the -divergence 1 (1) I f ~ ; = 1 (1) (1)D KL [~ (geo) : ~ 0 ] +D KL [~ (geo) : ~ 1 ] (5.96) = 1 (1) (1) Z ~ 0 (x)dx + Z ~ 1 (x)dx Z ~ 0 (x) 1 ~ 1 (x) dx =D () A [~ 0 : ~ 1 ]: (5.97) Thus, we have derived Amari's -divergence as a Bregman Information or Jensen diversity using the geometric averaging path, or the representation (~ ) = log q ~ as q! 1. The order of the -divergence is set by the mixture parameter , which is analogous to an inverse temperature parameter in maximum entropy or lossy compression applications (Jaynes, 1957; Rose et al., 1990; Tishby et al., 1999; Alemi et al., 2016, 2018). By contrast, in Sec. 5.5.1, the-divergence was derived as a rho-tau Bregman divergence (! 0 or ! 1), where the order of the divergence was set by the representation function (~ ) = log q ~ . 141 5.5.4.3 Geometric Averaging Path as an Exponential Family Parametric Interpretation of Amari -Divergence We can equivalently consider the geo- metric path as a one-dimensional exponential family as in Eq. (5.11) (Gr unwald (2007) Ch 17, (Brekelmans et al., 2020a,c)). Considering the normalized (geo) with a subtractive normalization constant, we have (geo) (x) = ~ 0 (x) exp log ~ 1 (x) ~ 0 (x) () ; (5.98) whereT (x) = log ~ 1 (x) ~ 0 (x) is the sucient statistic and () = logZ() is the log of the normalization constant Z() = R ~ 0 (x) 1 ~ 1 (x) dx. Noting that @ @ Z() = R ~ (x) log ~ 1 (x) ~ 0 (x) dx, we can use Z() to generate a parametric Bregman divergence with mixing weights as the arguments, D Z() [ a : b ] =Z( a )Z( b ) ( a b ) Z ~ b (x) log ~ 1 (x) ~ 0 (x) dx =D KL [~ b : ~ a ]: (5.99) Since this kl divergence over unnormalized densities appears in the minimization in Eq. (5.95), and the the minimizing argument ~ (geo) is within our exponential family, we can write the scaled Bregman Information or Amari -divergence from Eq. (5.97) using a parametric form, D () A [~ 0 : ~ 1 ] = 1 (1) (1)Z(0) +Z(1)Z() ; (5.100) which is a Jensen diversity for the convex function Z(). R enyi Divergence as Bregman Information Finally, we note that we can construct a Bregman Information using the parametric divergence generated by the log partition function (). As in Sec. 5.1.2.2, this divergence reduces to the kl divergence between normalized distributions D [ a : b ] =D KL [ b : a ]. As in Banerjee et al. (2005b), the arithmetic mean over arguments minimizes the expected divergence = (1) 0 + 1 = argmin r (1)D [0 : r ] +D [1 : r ]; (5.101) where we can identify with either the unnormalized ~ or normalized along the geometric averaging path. We can recognize the resulting Bregman Information as a R enyi divergence (R enyi, 1961; Van Er- ven and Harremos, 2014), which matches the result in Nielsen and Nock (2011) that the R enyi di- vergence between normalized distributions 0 ; 1 in the same exponential family is proportional 142 to a gap in Jensen's inequality (1) ( 0 ) + ( 1 ) ( ) for the log partition function. As in Van Erven and Harremos (2014) Thm. 30, we can write 1 (1) I f0; 1g;f1;g = 1 (1) (1)D KL [ 0 : (geo) ] +D KL [ 1 : (geo) ] = 1 (1) (1) (0) + (1) () (5.102) = 1 (1) (1) log Z ~ 0 (x)dx + log Z ~ 1 (x)dx log Z ~ 0 (x) 1 ~ 1 (x) dx : =D R 0 (x) : 1 (x) : where the R enyi divergence of order is dened as D R [ a : b ] = 1 (1) log R a (x) 1 b (x) dx 3 , using constant factors to induce limiting behavior of D KL [ a : b ] as ! 0 and D KL [ b : a ] as ! 1. 5.6 Discussion In this chapter, we have analyzed existing annealing paths (Neal, 2001; Grosse et al., 2013; Masrani et al., 2021) using the rho-tau Bregman divergence framework (Zhang, 2004, 2013; Naudts and Zhang, 2018). In particular, we have generalized the Bregman divergence `centroid' results of Banerjee et al. (2005b) to arbitrary monotonic embedding functions or quasi-arithmetic means. Using this perspective, the q-deformed logarithmic paths of Masrani et al. (2021) may be derived from an expected divergence minimization in either the-divergence or-divergence minimization. For the-divergence case, our denitions yield f and f which are strictly convex as a func- tion of ~ . This is in contrast to previous work (Zhang, 2004; Nielsen and Nock, 2009) where the decomposable generator f ( ~ ) = R f((~ (z)))dz is linear in ~ . We drew additional connections with the (;q) divergence of Zhang (2004, 2013), which we interpret in terms of an expected diver- gence minimization and a gap in Jensen's inequality for the multiplicative normalization constant of a parametric q-exponential family. It is interesting to notice that Amari's -divergence can be derived using either the geometric path, where the -divergence appears as a Bregman Information with its order set by the mixture parameter (Eq. (5.100)), or the q-path, where the -divergence appears as a rho-tau Bregman divergence with order set by the deformation parameter q (Eq. (5.63)). For the geometric path, we also constructed a Bregman divergence using the log partition function of the one-dimensional 3 Note that we have changed the order of arguments in order to match our dention of the -divergence (Zhang, 2004; Amari, 2007, 2016). 143 exponential family () := logZ() with natural parameter . In this case, the corresponding scaled Bregman Information recovers the R enyi divergence as the gap in Jensen's inequality for () (Nielsen and Nock, 2011). This derivation in terms of the mixing parameter is distinct from the approach of Wong and Zhang (2021), where the R enyi entropy or divergence with =q is the dual potential associated with the log partition function logZ q () of a q-exponential family via a generalized c-convex duality (Wong, 2018). Further understanding the relationship between these and q constructions and leveraging this in algorithmic applications such as `safe' or tempered variational inference (Gr unwald, 2012; Knoblauch et al., 2019) remains an interesting question for future work. We have also shown that the q-paths from Ch. 4 can be derived using the -divergence, which corresponds the Bregman divergence induced by the Tsallis entropy or -divergence as a convex generator (Naudts and Zhang (2018) Sec. 8.7, Belousov (2017)). Moving beyond the ubiquitous use of thekl divergence in machine learning, it would be interesting to further explore the implications of the identity dual representation for the-divergence in applications such as variational inference (Knoblauch et al., 2019), structured prediction (Blondel et al., 2020), continuous attention mecha- nisms (Martins et al., 2021), and regularized reinforcement learning (Geist et al., 2019; Brekelmans et al., 2022a). Finally, further exploration is needed to construct or learn annealing paths which are suitable to particular sampling or estimation problems. How can we (adaptively) choose paths based on features of a sampling problem? Syed et al. (2021) provide a intriguing example of optimizing the annealing path using a dierentiable measure of sample quality. We might consider learning the deformation function (u) for a -deformed logarithmic path, although care would be needed to ensure tractible density and gradient calculations. 144 Chapter 6 Your Policy Regularizer is Secretly an Adversary: Conjugate Duality in Reinforcement Learning 6.1 Introduction Regularization plays a crucial role in various settings across reinforcement learning (rl), such as trust-region methods (Peters et al., 2010; Schulman et al., 2015, 2017; Bas-Serrano et al., 2021), oine learning (Levine et al., 2020; Nachum et al., 2019a,b; Nachum and Dai, 2020), multi-task learning (Teh et al., 2017; Igl et al., 2020), and soft Q-learning or actor-critic methods (Fox et al., 2016; Nachum et al., 2017; Haarnoja et al., 2017, 2018; Grau-Moya et al., 2018). Various justi- cations have been given for policy regularization, such as improved optimization (Ahmed et al., 2019), connections with probabilistic inference (Levine, 2018; Kappen et al., 2012; Rawlik et al., 2013; Wang et al., 2021), and robustness to perturbations in the environmental rewards or dynamics (Derman et al., 2021; Eysenbach and Levine, 2021; Husain et al., 2021). In this work, we use convex duality to highlight that reward robustness naturally arises from policy regularization in rl. In particular, we interpret regularized reward maximization as a two- player game between the agent and an imagined adversary that modies the reward function. For a policy (ajs) regularized with a convex function () = E [ _ ()] and regularization strength 1=, we investigate statements of the form max (ajs) (1 )E () " 1 X t=0 t r(a t ;s t ) 1 _ (a t js t ) # = max (ajs) min r 0 (a;s)2R (1 )E () " 1 X t=0 t r 0 (a t ;s t ) # (6.1) where r 0 (a;s) indicates a modied reward chosen from an appropriate robust setR (see Fig. 6.1- 6.2). Eq. (6.1) suggests that an agent may translate uncertainty in its estimate of the reward 145 Figure 6.1: Robust setR (red region) of perturbed reward functions to which a stochastic policy generalizes, in the sense of Eq. (6.2). Red star indicates the worst-case perturbed reward r 0 = r r (Prop. 6.3.2) chosen by the adversary. The robust set also characterizes the set of reward perturbations r(a;s) that are feasible for the adversary, which diers based on the choice of regularization function, regularization strength , and reference distribution 0 (see Sec. 6.4.1, Fig. 6.2). We show the robust set for the optimal single-step policy with value estimates Q(a;s) = r(a;s) and kl divergence regularization to a uniform 0 , and = 1. Our robust set is larger, with qualitatively dierent shape compared to the robust set analyzed by Derman et al. (2021) (dotted lines, see Sec. 6.5.2). function into regularization of a learned policy, which is particularly relevant in applications such as inverserl (Ng et al., 2000; Arora and Doshi, 2021) or learning from human preferences (Christiano et al., 2017). This reward robustness further implies that regularized policies achieve a form of `zero-shot' generalization to new environments where the reward has been adversarially chosen. In particular, for any given (ajs) and a modied reward r 0 (a;s)2R within the corresponding robust set, we obtain the following performance guarantee E () " 1 X t=0 t r 0 (a t ;s t ) # E () " 1 X t=0 t r(a t ;s t ) 1 _ t # : (6.2) Eq. (6.2) states that the expected modied reward under (ajs), with r 0 (a;s)2R as in Fig. 6.1, will be greater than the value of the regularized objective with the original, unmodied reward. It is in this particular sense that we make claims about robustness and zero-shot generalization throughout the paper. Our analysis unies and extends recent work exploring similar interpretations (Ortega and Lee, 2014; Husain et al., 2021; Eysenbach and Levine, 2021), as summarized in Sec. 6.5 and Table 6.1. Our contributions include • A thorough analysis of the robustness associated with kl and -divergence policy regu- larization, which includes popular Shannon entropy regularization as a special case. Our derivations for the -divergence generalize the Tsallis entropy rl framework of Lee et al. (2019). 146 Ortega and Lee (2014) Eysenbach and Levine (2021) Husain et al. (2021) Derman et al. (2021) Ours Multi-Step Analysis 7 3 3 3 3 Worst-Case r(a;s) policy form policy form value form policy (via dual lp Eq. (6.11)) policy & value forms Robust Set 7 3 (see our Fig. 6.6) 7 3( exible specication) 3 Divergence Used KL ( = 1) Shannon entropy (Sec. 6.5.1) any convex derived from robust set any convex , -Div examples (a;s) or (ajs) Reg.? (ajs) (ajs) Both (ajs) Both Indierence 3 7 7 7 3 Path Consistency 7 7 7 7 3 Table 6.1: Comparison to related work. • We derive the worst-case reward perturbations r =rr 0 corresponding to any stochastic policy and a xed regularization scheme (Prop. 6.3.2). • For the optimal regularized policy in a given environment, we show that the corresponding worst-case reward perturbations match the advantage function for any -divergence. We relate this nding to the path consistency optimality condition, which has been used to construct learning objectives in (Nachum et al., 2017; Chow et al., 2018), and a game- theoretic indierence condition, which occurs at a Nash equilibrium between the agent and adversary (Ortega and Lee, 2014). • We visualize the setR of adversarially perturbed rewards against which a regularized policy is robust in Fig. 6.1-6.2, with details in Prop. 6.3.1. Our use of divergence instead of entropy regularization to analyze the robust setR claries several unexpected conclusions from previous work. In particular, similar plots in Eysenbach and Levine (2021) suggest that MaxEnt rl is not robust to the reward function of the training environment, and that increased regularization strength may hurt robustness. Our analysis in Sec. 6.5.1 and Brekelmans et al. (2022a) App. F establishes the expected, opposite results. • We perform experiments for a sequential grid-world task in Sec. 6.4 where, in contrast to previous work, we explicitly visualize the reward robustness and adversarial strategies resulting from our theory. We use the path consistency or indierence conditions to certify optimality of the policy. Finally, we point to the recent work Husain and Knoblauch (2022), in which many results from this chapter are applied to the generalized variational inference framework of Knoblauch et al. (2019) (see Section 1.2). In Section 6.3, we provide notation and basic tools for understanding how our adversarial interpretation for regularized rl translates to this setting. 6.2 Preliminaries In this section, we review linear programming (lp) formulations of discounted Markov Decision Processes (mdp) and extensions to convex policy regularization. 147 Notation For a nite setX , let R X denote the space of real-valued functions overX , with R X + indicating the restriction to non-negative functions. We use jXj to denote the probability simplex with dimension equal to the cardinality ofX . For ;q2 R X , we useh;qi = P x2X (x)q(x) to indicate the inner product in Euclidean space. 6.2.1 Convex Conjugate Function We begin by reviewing the convex conjugate function, also known as the Legendre-Fenchel trans- form, which will play a crucial role throughout our paper. For a convex function () which, in our context, has domain 2R X + , the conjugate function is dened via the optimization (r) = sup 2R X + ; r (); (6.3) where r2R X . The conjugate operation is an involution for proper, lower semi-continuous, convex (Boyd and Vandenberghe, 2004), so that ( ) = and is also convex. We can thus represent () via a conjugate optimization () = sup r2R X ; r (r): (6.4) Dierentiating with respect to the optimization variable in Eq. (6.3) or (6.4) suggests the optimality conditions r =r (r) r =r (): (6.5) Note that the above conditions also imply relationships of the form r = (r ) 1 (r). This dual correspondence between values of and r will form the basis of our adversarial interpretation in Sec. 6.3. 6.2.2 Divergence Functions We are interested in the conjugate duality associated with policy regularization, which is often ex- pressed using a statistical divergence () over a joint density(a;s) =(s)(ajs) (see Sec. 6.2.3). In particular, we consider the family of -divergences (Amari, 2016; Cichocki and Amari, 2010), which includes both the forward and reverse kl divergences as special cases. In the following, we 148 consider extended divergences that accept unnormalized density functions as input (Zhu and Ro- hwer, 1995) so that we may analyze function space dualities and evaluate Lagrangian relaxations without projection onto the probability simplex. KL Divergence The `forward' kl divergence to a reference policy 0 (ajs) is commonly used for policy regularization in rl. Extending the input domain to unnormalized measures, we write the divergence as 0 () =E (s) h D KL [ : 0 ] i = X s2S (s) X a2A (ajs) log (ajs) 0 (ajs) (ajs) + 0 (ajs) : (6.6) Using a uniform 0 (ajs) = 18 (a;s), we recover the Shannon entropy up to an additive constant. -Divergence The -divergence E (s) D [ 0 : ] over possibly unnormalized measures is de- ned as () 0 () = 1 (1) X s2S (s) (1) X a2A 0 (ajs) + X a2A (ajs) X a2A 0 (ajs) 1 (ajs) (6.7) Taking the limiting behavior, we recover the `forward' kl divergence D KL [ : 0 ] as ! 1 or the `reverse' kl divergence D KL [ 0 :] as ! 0. To provide intuition for the -divergence, we dene the deformed -logarithm as in Lee et al. (2019), which matches Tsallis's q-logarithm (Tsallis, 2009) for = 2q. Its inverse is the - exponential, with log (u) = 1 1 u 1 1 ; exp (u) = [1 + ( 1)u] 1 1 + : (6.8) where [] + = max(; 0) ensures fractional powers can be taken and suggests that exp (u) = 0 for u 1=(1). Using the -logarithm, we can rewrite the -divergence similarly to the kl divergence in Eq. (6.6) () 0 () = 1 X s2S (s) X a2A (ajs) log (ajs) 0 (ajs) (ajs) + 0 (ajs) : For a uniform reference 0 , the-divergence diers from the Tsallis entropy by only the 1= factor and an additive constant (see App. E.4). 149 6.2.3 Unregularized MDPs A discounted mdp is a tuplefS;A;P; 0 ;r; g consisting of a state spaceS, action spaceA, transi- tion dynamicsP (s 0 js;a) fors;s 0 2S,a2A, initial state distribution 0 (s)2 jSj in the probability simplex, and reward function r(a;s) :SA7!R. We also use a discount factor 2 (0; 1) (Puter- man (1994) Sec 6). We consider an agent that seeks to maximize the expected discounted reward by acting ac- cording to a decision policy (ajs)2 jAj for each s2S. The expected reward is calculated over trajectories () := 0 (s 0 ) Q (a t js t )P (s t+1 js t ;a t ), which begin from an initial s 0 0 (s) and evolve according to the policy (ajs) and mdp dynamics P (s 0 js;a) RL(r) := max (ajs) (1 )E () 1 X t=0 t r(s t ;a t ) : (6.9) We assume that the policy is stationary and Markovian, and thus independent of both the timestep and trajectory history. Linear Programming Formulation We will focus on a linear programming (lp) form for the objective in Eq. (6.9), which is common in the literature on convex duality. With optimization over the discounted state-action occupancy measure,(a;s) := (1 )E () P 1 t=0 t I(a t =a;s t =s) , we rewrite the objective as RL(r) := max ;r subject to (a;s) 0 8(a;s)2AS; (6.10) X a (a;s) = (1 ) 0 (s) + X a 0 ;s 0 P (sja 0 ;s 0 )(a 0 ;s 0 ) 8s2S: We refer to the constraints in the second line of Eq. (6.10) as the Bellman ow constraints, which force(a;s) to respect themdp dynamics. We denote the set of feasible asMR AS + . For nor- malized 0 (s) andP (sja 0 ;s 0 ), we show in App. E.1.2 that(a;s)2M implies(a;s) is normalized. It can be shown that any feasible(a;s)2M induces a stationary(ajs) =(a;s)=(s), where (s) := P a 0(a 0 ;s) and(ajs)2 jAj is normalized by denition. Conversely, any stationary policy (ajs) induces a unique state-action visitation distribution(a;s) (Syed et al. (2008), Feinberg and Shwartz (2012) Sec. 6.3). Along with the denition of (a;s) above, this result demonstrates the equivalence of the optimizations in Eq. (6.9) and Eq. (6.10). We will proceed with the lp notation 150 Divergence Conjugate Conjugate Expression Optimizing Argument ( r or r ) 1 D KL [ : 0 ] 1 0; (r) 1 P a 0 (ajs) expf r(a;s)g 1 0 (ajs) expf r(a;s)g 1 D KL [ : 0 ] 1 0; (r) 1 P a;s 0 (a;s) expf r(a;s)g 1 0 (a;s) expf r(a;s)g 1 D [ 0 :] 1 () 0; (r) 1 1 P a 0(ajs) exp r(a;s) r(s;) 1 1 + r(s;) 0 (ajs) exp (r(a;s) r (s;)) 1 D [ 0 :] 1 () 0; (r) 1 1 P a;s 0 (a;s) exp r(a;s) 1 1 0 (a;s) exp r(a;s) Table 6.2: Conjugate Function expressions forkl and-divergence regularization of either the pol- icy (ajs) or occupancy (a;s). See App. C.1.1-C.1.4 for derivations. The nal column refers to the optimizing argument in the denition of the conjugate function 1 (r), for example r (a;s) := argmax (a;s) h(a;s); r(a;s)i 1 0 (). Note that each conjugate expression for (ajs) regularization also contains an outer expectation over (s). from Eq. (6.10) and assume (s) is induced by (ajs) whenever the two appear together in an expression. Importantly, the ow constraints in Eq. (6.10) lead to a dual optimization which re ects the familiar Bellman equations (Bellman, 1957). To see this, we introduce Lagrange multipliersV 2R S for each ow constraint and(a;s)2R AS + for the nonnegativity constraints. Summing overs2S, and eliminating (a;s) by setting d=d(a;s) = 0 yields the dual lp RL (r) := min V; (1 ) 0 ;V subject to V (s) =r(a;s) + E s 0 a;s V (s 0 ) +(a;s) 8(a;s)2AS; (6.11) where we have used E s 0 a;s V (s 0 ) as shorthand for E P (s 0 ja;s) [V (s 0 )] and reindexed the transition tuple from (s 0 ;a 0 ;s) to (s;a;s 0 ) compared to Eq. (6.10). Note that the constraint applies for all (a;s)2AS and that (a;s) 0. By complementary slackness, we know that (a;s) = 0 for (a;s) such that (a;s)> 0. 6.2.4 Regularized MDPs We now consider regularizing the objective in Eq. (6.10) using a convex penalty function () with coecient 1=. We primarily focus on regularization using a conditional divergence 0 () := E (s)(ajs) [ _ ()] between the policy and a normalized reference distribution 0 (ajs), as in Sec. 6.2.2 and (Ortega and Braun, 2013; Fox et al., 2016; Haarnoja et al., 2017, 2018). We also use the notation 0 () = E (a;s) [ _ ()] to indicate regularization of the full state-action occupancy measure to a 151 normalized reference 0 (a;s), which appears, for example, in Relative Entropy Policy Search (reps) (Peters et al., 2010; Belousov and Peters, 2019). We dene the regularized objectiveRL ; (r), RL ; (r) := max 2M ;r 1 0 () (6.12) where 0 () contains an expectation under(a;s) as in Eq. (6.6)-(6.7). We can also derive a dual version of the regularized lp, by rst writing the Lagrangian relaxation of Eq. (6.12) max min V; (1 ) 0 ;V +h;r + E s 0 a;s V V +i 1 0 (): (6.13) Swapping the order of optimization under strong duality, we can recognize the maximization over (a;s) as a conjugate function 1 0 ; , as in Eq. (6.3), leading to a regularized dual optimization RL ; (r) = min V; (1 ) 0 ;V + 1 0 ; r + E s 0 a;s V V + (6.14) which involves optimization over dual variables V (s) only and is unconstrained, in contrast to Eq. (6.11). Dual objectives of this form appear in (Nachum and Dai, 2020; Belousov and Peters, 2019; Bas-Serrano et al., 2021; Neu et al., 2017). We emphasize the need to include the Lagrange multiplier (a;s), with (a;s) > 0 when the optimal policy has (ajs) = 0, since an important motivation for-divergence regularization is to encourage sparsity in the policy (see Eq. (6.8), Lee et al. (2018, 2019); Chow et al. (2018)). Soft Value Aggregation In iterative algorithms such as (regularized) modied policy iteration (Puterman and Shin, 1978; Scherrer et al., 2015), it is useful to consider the regularized Bellman optimality operator (Geist et al., 2019). For given estimates of the state-action value Q(a;s) := r(a;s) + E s 0 a;s V (s 0 ) , the operatorT 0 ; updates V (s) as V (s) 1 0 ; (Q) = max 2 jAj ;Q 1 0 () (6.15) Note that this conjugate optimization is performed in each state s2S and explicitly constrains each(ajs) to be normalized. Although we proceed with the notation of Eq. (6.12) and Eq. (6.14), our later developments are compatible with the `soft-value aggregation' perspective above. See App. E.2 for detailed discussion. 152 6.3 Adversarial Interpretation In this section, we interpret regularization as implicitly providing robustness to adversarial pertur- bations of the reward function. To derive our adversarial interpretation, recall from Eq. (6.4) that conjugate duality yields an alternative representation of the regularizer 1 () 0 () = max r2R AS h; ri 1 () 0 ; (r): (6.16) Using this conjugate optimization to expand the regularization term in the primal objective of Eq. (6.12), RL ; (r) = max 2M min r h;r ri + 1 () 0 ; r : (6.17) We interpret Eq. (6.17) as a two-player minimax game between an agent and an implicit adversary, where the agent chooses an occupancy measure(a;s)2M or its corresponding policy(ajs), and the adversary chooses a reward perturbation r(a;s) subject to the convex conjugate 1 () 0 ; (r) as a penalty function (Ortega and Lee, 2014). To understand the limitations this penalty imposes on the adversary, we transform the op- timization over r in Eq. (6.17) to a constrained optimization in Sec. 6.3.1. This allows us to characterize the feasible set of reward perturbations available to the adversary or, equivalently, the set of modied rewards r 0 (a;s)2R to which a particular stochastic policy is robust. In Sec. 6.3.2 and 6.3.4, we interpret the worst-case adversarial perturbations corresponding to an arbitrary stochastic policy and the optimal policy, respectively. 6.3.1 Robust Set of Modied Rewards In order to link our adversarial interpretation to robustness and zero-shot generalization as in Eq. (6.1)-(6.2), we characterize the feasible set of reward perturbations in the following proposition. We state our proposition for policy regularization, and discuss dierences for (a;s) regularization in App. E.3.2. Proposition 6.3.1. Assume a normalized policy (ajs) for the agent is given, with P a (ajs) = 18s2S. Under -divergence policy regularization to a normalized reference 0 (ajs), the optimiza- tion over r(a;s) in Eq. (6.17) can be written in the following constrained form min r2R ;r r where R := r2R AS () 0 ; (r) 0 ; (6.18) 153 We refer toR as the feasible set of reward perturbations available to the adversary. This translates to a robust setR of modied rewards r 0 (a;s) = r(a;s) r(a;s) for the given policy. These sets depend on the -divergence and regularization strength via the conjugate function. For kl divergence regularization, the constraint is X a2A 0 (ajs) expf r(a;s)g 1: (6.19) See App. E.3.1 for proof, and Table 6.2 for the convex conjugate function 1 () 0 ; (r) associated with various regularization schemes. The proof proceeds by evaluating the conjugate function at the minimizing argument r in Eq. (6.17) (see Sec. 6.3.2), with () 0 ; (r ) = 08 for normalized (ajs) and 0 (ajs). The constraint then follows from the fact that () 0 ; (r ) is convex and increasing in r. We visualize the robust set for a two-dimensional action space in Fig. 6.2, with additional discussion in Sec. 6.4.1. As in Eq. (6.2), we can provide `zero-shot' performance guarantees using this set of modied rewards. For any perturbed reward in the robust set r 0 2R , we haveh;r 0 ih;ri 1 () 0 (), so that the policy achieves an expected modied reward which is at least as large as the regularized objective. However, notice that this form of robustness is sensitive to the exact value of the regularized objective function. Although entropy regularization and divergence regularization with a uniform reference induce the same optimal(a;s), we highlight crucial dierences in their reward robustness interpretations in Sec. 6.5.1. 6.3.2 Worst-Case Perturbations: Policy Form From the feasible set in Prop. 6.3.1, how should the adversary select its reward perturbations? In the following proposition, we use the optimality conditions in Eq. (6.5) to solve for the worst-case reward perturbations r (a;s) which minimize Eq. (6.17) for an xed but arbitrary stochastic policy (ajs). Proposition 6.3.2. For a given policy (ajs) or state-action occupancy (a;s), the worst-case adversarial reward perturbations r or r associated with a convex function () and regular- ization strength 1= are r =r 1 (): (6.20) 154 See App. E.1.1 for proof. We now provide example closed form expressions for the worst-case reward perturbations under common regularization schemes. We emphasize that the same stochas- tic policy (ajs) or joint occupancy measure (a;s) can be associated with dierent adversarial perturbations depending on the choice of -divergence and strength . 1 KL Divergence For kl divergence policy regularization, the worst-case perturbations are r (a;s) = 1 log (ajs) 0 (ajs) ; (6.21) which corresponds to the pointwise regularization r (a;s) = _ 0 ((ajs) for each state-action pair, with 0 () =E (a;s) [ _ 0 ((ajs)]. See App. C.1.1. We show an analogous result in App. C.1.2 for state-action occupancy regularization D KL [ : 0 ], where r (a;s) = 1 log (a;s) 0 (a;s) = _ 0 ((a;s)). -Divergence Forkl divergence regularization, the worst-case reward perturbations had a sim- ilar expression for conditional and joint regularization. However, we observe notable dierences for the -divergence in general. For policy regularization to a reference 0 , r (a;s) = 1 log (ajs) 0 (ajs) + r (s;); (6.22) where we dene r (s;) as r (s;) := 1 1 X a2A 0 (ajs) X a2A 0 (ajs) 1 (ajs) ! : (6.23) As we discuss in App. C.1.3, r (s;) plays the role of a normalization constant for the optimizing argument r (ajs) in the denition of 1 () 0 ; (r) (see Eq. (6.3), Table 6.2). This term arises from dierentiating () 0 () with respect to (a;s) instead of from an explicit constraint. Assuming the given(ajs) and reference 0 (ajs) are normalized, note that r (s;) = 1 (1)D [ 0 :]. With normalization, we also observe that r (s;) = 0 for kl divergence regularization ( = 1), which conrms Eq. (6.21) is a special case of Eq. (6.22). 1 One exception is that a policy with a 0 s.t. (a 0 js) = 0 can only be represented using kl regularization if 0(a 0 js) = 0. 155 For any given state-action occupancy measure (a;s) and joint -divergence regularization to a reference 0 (ajs), the worst-case perturbations become r (a;s) = 1 log (a;s) 0 (a;s) ; (6.24) with detailed derivations in App. C.1.4. In contrast to Eq. (6.22), this expression lacks an explicit normalization constant, as this constraint is enforced by the Lagrange multipliersV (s) and(a;s)2 M (App. E.1.2). 6.3.3 Worst-Case Perturbations: Value Form In the previous section, we analyzed the implicit adversary corresponding to any stochastic policy (ajs) for a given ; 0 ; and. We now take a dual perspective, where the adversary is given access to a set of dual variables V (s) across states s2S and selects reward perturbations r V (a;s). We will eventually show in Sec. 6.3.4 that these perturbations match the policy-form perturbations at optimality. Our starting point is Theorem 3 of Husain et al. (2021), which arises from taking the convex conjugate (RL ; (r)) of the entire regularized objectiveRL ; (r), which is concave in (a;s). See Husain et al. (2021) or Brekelmans et al. (2022a) App. E1 for proof. Theorem 6.3.3 (Husain et al. (2021)). The optimal value of the regularized objectiveRL ; (r) in Eq. (6.12), or its dualRL ; (r) in Eq. (6.14), is equal to inf V; inf r V (1 ) 0 ;V + 1 () 0 ; (r V ) (6.25) subject to V (s) =r(a;s) + E s 0 a;s V (s 0 ) r V (a;s) +(a;s) 8(a;s)2AS: Rearranging the equality constraint to solve for r V (a;s) and substituting into the objec- tive, this optimization recovers the regularized dual problem in Eq. (6.14). We can also compare Eq. (6.25) to the unregularized dual problem in Eq. (6.11), which does not include an adversarial cost and whose constraint V (s) =r(a;s) + E s 0 a;s V (s 0 ) +(a;s) implies an unmodied reward, or r V (a;s) = 0. Similarly to Sec. 6.3.2, the adversary incorporates the eect of policy regularization via the reward perturbations r V (a;s). 156 6.3.4 Policy Form = Value Form at Optimality In the following proposition, we provide a link between the policy and value forms of the adversarial reward perturbations, showing that r (a;s) = r V (a;s) for the optimal policy (ajs) and value V (s). As in Eysenbach and Levine (2021), the uniqueness of the optimal policy implies that its robustness may be associated with an environmental reward r(a;s) for a given regularized mdp. Proposition 6.3.4. For the optimal policy (ajs) and value function V (s) corresponding to - divergence policy regularization with strength , the policy and value forms of the worst-case ad- versarial reward perturbations match, r = r V , and are related to the advantage function via r (a;s) =Q (a;s)V (s) + (a;s); (6.26) where we dene Q (a;s) := r(a;s) + E s 0 a;s V (s 0 ) and recall (a;s) (ajs) = 0 by complemen- tary slackness. Note that V (s) depends on the regularization scheme via the conjugate function 1 () 0 ; (r V ) in Eq. (6.25). Proof. See App. E.1.3. We consider the optimal policy in an mdp with -divergence policy regu- larization 1 () 0 (), which is derived via similar derivations as Lee et al. (2019) or by eliminating (a;s) in Eq. (6.13). (ajs) = 0 (ajs) exp Q (a;s)V (s) +(a;s) r (s;) : (6.27) We prove Prop. 6.3.4 by plugging this optimal policy into the worst-case reward perturbations from Eq. (6.22), r (a;s) = 1 log (ajs) 0 (ajs) + r (s;). We can also use Eq. (6.26) to verify (ajs) is normalized, since r ensures normalization for the policy corresponding to r . In App. E.2.1, we also show Q (s;) = V (s) + r (s;), where Q (s;) is a Lagrange multiplier enforcing normalization in Eq. (6.15). Path Consistency Condition The equivalence between r (a;s) and r V (a;s) at optimality matches the path consistency conditions from (Nachum et al., 2017; Chow et al., 2018) and suggests 157 generalizations to general-divergence regularization. Indeed, combining Eq. (6.22) and (6.26) and rearranging, r(a;s) + E s 0 a;s V (s 0 ) 1 log (ajs) 0 (ajs) r (s;) =V (s) (a;s) (6.28) for all s2S and a2A. This is a natural result, since path consistency is obtained using the kkt optimality condition involving the gradient with respect to of the Lagrangian relaxation in Eq. (6.13). Similarly, we have seen in Prop. 6.3.2 that r =r 1 () 0 (). See App. E.1.4. Path consistency conditions were previously derived for the Shannon entropy (Nachum et al., 2017) and Tsallis entropy with = 2 (Chow et al., 2018), but our expression in Eq. (6.28) provides a generalization to -divergences with arbitrary reference policies. See Brekelmans et al. (2022a) App E2 for additional comparison with Chow et al. (2018). Indierence Condition As Ortega and Lee (2014) discuss for the single step case, the saddle point of the minmax optimization in Eq. (6.17) re ects an indierence condition which is a well- known property of Nash equilibria in game theory (Osborne and Rubinstein, 1994). Consider Q(a;s) = r(a;s) + E s 0 a;s V (s 0 ) to be the agent's estimated payo for each action in a particular state. For the optimal policy, value, and worst-case reward perturbations, Eq. (6.28) shows that the pointwise modied reward Q (a;s) r (a;s) = V (s) is equal to a constant. 2 Against the optimal strategy of the adversary, the agent becomes indierent between the actions in its mixed strategy. The value or conjugate function V (s) = 1 0 ; (Q ) (see App. E.2) is known as the certainty equivalent (Fishburn, 1988; Ortega and Braun, 2013), which measures the total expected utility for an agent starting in state s, in a two-player game against an adversary dened by the regularizer with strength . We empirically conrm the indierence condition in Fig. 6.3 and 6.5. Adversarial Interpretation of Bayesian Inference Before proceeding to our main results, we recall from Eq. (1.17)-1.28 that the standard Bayesian posterior may be viewed as the solution to similar conjugate optimization as Eq. (6.12). As shown by Husain and Knoblauch (2022), we many results in this chapter may be translated to the setting of generalized variational inference (Knoblauch et al., 2019). As an example, we describe the appropriate choices for the standard Bayesian posterior and resulting Evidence Lower Bound (elbo) below. 2 This holds for actions with (ajs) > 0 and (a;s) = 0. Note, we treat Q(a;s) as the reward in the sequential case. 158 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (a) D KL ; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (b) D KL ; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (c) D KL ; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (d) =1; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (e) = 3; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (f) D KL ; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (g) D KL ; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (h) D KL ; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (i) =1; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (j) = 3; = 10 Figure 6.2: Robust Set (red region) of perturbed reward functions to which a stochastic policy generalizes, in the sense that the policy is guaranteed to achieve an expected modied reward greater than or equal to the value of the regularized objective (Eq. (6.2)). The robust set characterizes the perturbed rewards which are feasible for the adversary. Red stars indicate the worst-case perturbed reward r 0 =r r (Prop. 6.3.2). We show robust sets for the optimal (ajs) with xedQ(a;s) =r(a;s) values (blue star), where the optimal policy diers based on the regularization parameters;; 0 (see Eq. (6.27)). The robust set is more restricted with decreasing regularization strength (increasing), implying decreased generalization. Importantly, the slope of the robust set boundary can be linked to the action probabilities under the policy (see Sec. 6.4.1). Using r(a;s) =`(x;z) = logp (xjz) and quantifying uncertainty to the prior 0 (z) using a divergence functional 0 () =D[k 0 ], we have the following conjugate optimization and adver- sarial interpretation, L vi `; 0 ;Q = max q (zjx)2Q Z q (zjx) logp (xjz)dz 1 D q (zjx)k 0 (z) (6.29) = max q (zjx)2Q min `(x;z) Z q (zjx) logp (xjz) `(x;z) dz + 1 0 ; (`): (6.30) For kl divergence regularization 0 () = D KL [k 0 ] and `(x;z) = logp (xjz), we obtain the true Bayesian posterior as the maximizing argument in Eq. (6.29) for = 1. Solutions `-vae' or rate-distortion optimizations are obtained for 6= 1 (Alemi et al., 2018). In this interpretation, regularizing the variational distribution q (zjx) to match the prior 0 (z) corresponds to playing a minimax game against an implicit adversary that perturbs the condi- tional likelihood. These perturbations are chosen to lower the conditional likelihood or raise the reconstruction loss achieved by q (zjx). See Husain and Knoblauch (2022) for detailed discussion. 6.4 Experiments In this section, we visualize the robust set and worst-case reward perturbations associated with policy regularization, using intuitive examples to highlight theoretical properties of our adversarial interpretation. 159 (i) = 1:0 a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 1.5 2.0 r(a, s) Environment r(a, s) a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 Policy (a|s) a 1 a 2 a 3 a 4 a 5 a 6 -1.0 -0.5 0 0.5 1.0 r (a, s) Perturbation r (a, s) a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 1.5 2.0 r 0 (a, s) Perturbed r 0 = r r (ii) = 10 a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 1.5 2.0 Q(a, s) Agent Q(a, s) Values a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 Policy (a|s) a 1 a 2 a 3 a 4 a 5 a 6 -1.0 -0.5 0 0.5 1.0 r (a, s) Perturbation r (a, s) a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 1.5 2.0 r 0 (a, s) Perturbed r 0 = r r (a) Optimal Policy (i) = 1:0 a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 1.5 2.0 r(a, s) Environment r(a, s) a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 Policy (a|s) a 1 a 2 a 3 a 4 a 5 a 6 -1.0 -0.5 0 0.5 1.0 r (a, s) Perturbation r (a, s) a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 1.5 2.0 r 0 (a, s) Perturbed r 0 = r r (ii) = 10 a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 1.5 2.0 Q(a, s) Agent Q(a, s) Values a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 Policy (a|s) a 1 a 2 a 3 a 4 a 5 a 6 -1.0 -0.5 0 0.5 1.0 r (a, s) Perturbation r (a, s) a 1 a 2 a 3 a 4 a 5 a 6 0.0 0.5 1.0 1.5 2.0 r 0 (a, s) Perturbed r 0 = r r (b) Suboptimal Policy Figure 6.3: Single-Step Reward Perturbations forkl regularization to uniform reference policy 0 (ajs). Q-values in leftmost columns are used for each in columns 2-4. We report the worst-case r (a;s) from Eq. (6.22), so that negative values correspond to reward decreases. (a) Optimal policy (Q (a;s) = r(a;s)), where the perturbed reward r 0 (a;s) = c8a re ects the indierence condition. (b) Suboptimal policy where indierence does not hold. In all cases, actions with high Q(a;s) are robust to reward decreases. 6.4.1 Visualizing the Robust Set In Fig. 6.2, we visualize the robust set of perturbed rewards for the optimal policy in a two- dimensional action space for thekl or-divergence, various, and a uniform or non-uniform prior policy 0 . Since the optimal policy can be easily calculated in the single-step case, we consider xed Q (a;s) = r(a;s) =f1:1; 0:8g and show the robustness of the optimal (ajs), which diers based on the choice of regularization scheme using Eq. (6.27). We determine the feasible set of r using the constraint in Prop. 6.3.1 (see App. E.3.3 for details), and plot the modied reward r 0 (a;s) =Q (a;s) r (a;s) for each action. Inspecting the constraint for the adversary in Eq. (6.19), note that both reward increases r(a;s) < 0 and reward decreases r(a;s) > 0 contribute non-negative terms at each action, which either up- or down-weight the reference policy 0 (ajs). The constraint on their summation forces the adversary to trade o between perturbations of dierent actions in a particular state. Further, since the constraints in Prop. 6.3.1 integrate over the action space, the rewards for all actions in a particular state must be perturbed together. While it is clear that increasing the reward in both actions preserves the inequality in Eq. (6.2), Fig. 6.2 also includes regions where one reward decreases. For high regularization strength ( = 0:1), we observe that the boundary of the feasible set is nearly linear, with the slope 0 (a 1 js) 0 (a 2 js) based on the ratio of action probabilities in a policy that matches the prior. The boundary steepens for lower regularization strength. We can use the indierence condition to provide further geometric insight. First, drawing a line from the origin 160 with slope 1 will intersect the feasible set at the worst-case modied reward (red star) in each panel, with r 0 (a 1 ;s) = r 0 (a 2 ;s). At this point, the slope of the tangent line yields the ratio of action probabilities in the regularized policy, as we saw for the = 0:1 case. With decreasing regularization as !1, the slope approaches 0 or1 for a nearly deterministic policy and a rectangular feasible region. Finally, we show the -divergence robust set with 2f1; 3g and = 10 in Fig. 6.2 (d)-(e) and (i)-(j), with further visualizations in Fig. 6.7 and 6.8 at the end of this chapter. Compared to the kl divergence, we nd a wider robust set boundary for =1. For = 3 and = 10, the boundary is more strict and we observe much smaller reward perturbations as the optimal policy becomes deterministic ((a 1 js) = 1) for both reference distributions. However, in contrast to the unregularized deterministic policy, the reward perturbations r (a;s)6= 0 are nonzero. We provide a worked example in App. E.5, and note that indierence does not hold in this case, r 0 (a 1 ;s)6=r 0 (a 2 ;s), due to the Lagrange multiplier (a 2 ;s)> 0. 6.4.2 Visualizing the Worst-Case Reward Perturbations In this section, we consider kl divergence regularization to a uniform reference policy, which is equivalent to Shannon entropy regularization but more appropriate for analysis, as we discuss in Sec. 6.5.1. Single Step Case In Fig. 6.3, we plot the negative worst-case reward perturbationsr (a;s) and modied reward for a single step decision-making case. For the optimal policy in Fig. 6.3(a), the perturbations match the advantage function as in Eq. (6.26) and the perturbed reward for all actions matches the value function V (s). While we have shown in Sec. 6.3.2 that any stochastic policy may be given an adversarial interpretation, we see in Fig. 6.3(b) that the indierence condition does not hold for suboptimal policies. The nearly-deterministic policy in Fig. 6.3(a)(ii) also provides intuition for the unregularized case as !1. Although we saw in Sec. 6.3.3 that r (a;s) = 08a in the unregularized case, Eq. (6.11) and (6.26) suggest that (a;s) =V (s)Q (a;s) plays a similar role to the (negative) reward perturbations in Fig. 6.3(a)(ii), with (a 1 ;s) = 0 and (a;s)> 0 for all other actions. Sequential Setting In Fig. 6.4(a), we consider a grid world where the agent receives +5 for picking up the reward pill,1 for stepping in water, and zero reward otherwise. We train an agent using tabular Q-learning and a discount factor = 0:99. We visualize the worst-case reward 161 (a) Environment (uniform 0 (s), r(a;s) =1 water, r(a;s) = 5 goal) 0 1 2 3 4 5 0 1 2 3 4 5 -0.0 -0.0 -0.0 -0.0 -0.1 -0.0 -0.0 0.0 -0.1 0.0 -0.1 -0.0 0.0 0.1 -0.2 0.2 -0.2 -0.1 -0.0 -0.4 -2.2 0.1 -0.7 -0.2 -0.0 -1.3 2.7 0.4 0.1 -0.0 -0.5 -1.2 1.2 0.7 0.3 0.0 0.0 -0.0 0.1 -0.0 -0.1 -0.1 0.1 -0.0 0.1 0.0 -0.0 -0.1 0.2 -0.2 0.2 0.1 -0.0 -0.7 0.2 -2.1 -0.3 0.1 -0.0 0.3 2.7 -1.4 -0.2 -0.0 0.5 1.1 -1.2 -0.7 -0.3 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.0 -0.2 0.0 -0.1 -0.2 -0.2 -0.7 -1.6 -0.7 -0.2 -0.3 -0.9 -2.9 -2.8 -0.7 -0.4 -0.5 -0.8 1.8 -0.7 -0.4 -0.3 0.1 0.0 0.1 -0.0 0.1 0.1 0.1 0.0 0.2 -0.0 0.1 0.1 0.2 -0.4 0.6 -0.4 0.2 0.2 0.8 0.8 3.4 0.8 0.7 0.4 0.5 -0.3 -0.3 0.4 0.3 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (b) = 0:2 (High Reg.) 0 1 2 3 4 5 0 1 2 3 4 5 -0.0 -0.0 -0.1 -0.0 -0.0 -0.0 -0.0 -0.0 -0.1 0.1 -0.0 -0.0 -0.0 0.1 -0.3 0.3 -0.1 -0.1 -0.0 -0.3 -2.3 0.2 -0.7 -0.1 -0.0 -1.1 1.1 0.2 0.1 -0.0 -0.4 -0.7 0.7 0.5 0.2 -0.0 0.0 -0.1 -0.0 -0.0 -0.0 -0.0 0.1 -0.1 -0.0 -0.0 -0.0 -0.1 0.3 -0.3 0.1 0.0 -0.0 -0.7 0.2 -2.3 -0.4 0.0 -0.0 0.1 1.1 -1.3 -0.2 -0.0 0.3 0.6 -0.7 -0.5 -0.3 -0.0 -0.0 -0.1 -0.0 -0.1 -0.1 -0.0 -0.1 -0.1 -0.2 -0.2 -0.2 -0.2 -0.2 -0.1 -0.4 -0.1 -0.2 -0.2 -0.3 -0.7 -1.7 -0.7 -0.3 -0.3 -0.7 -2.5 -2.5 -0.6 -0.4 -0.4 -0.7 0.7 -0.7 -0.4 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.3 0.1 0.2 0.2 0.3 -0.3 0.6 -0.3 0.2 0.3 0.6 0.4 1.3 0.4 0.5 0.3 0.4 -0.4 -0.4 0.3 0.2 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (c) = 1 0 1 2 3 4 5 0 1 2 3 4 5 -0.0 0.0 0.0 -0.1 -0.1 -0.1 -0.0 0.1 0.0 -0.1 -0.1 -0.0 -0.0 0.1 0.1 -0.1 -0.2 -0.0 -0.0 0.1 -2.1 -0.2 -1.2 -0.0 -0.0 -0.5 0.1 -0.6 0.0 -0.0 -0.2 -0.2 0.1 0.1 0.1 -0.1 -0.1 -0.0 0.0 -0.0 -0.0 -0.1 -0.1 0.0 0.1 -0.0 -0.0 -0.2 -0.1 0.0 0.1 -0.0 -0.0 -1.2 -0.3 -1.9 0.1 -0.1 -0.0 -0.6 0.1 -0.5 -0.1 -0.0 0.1 0.1 -0.2 -0.2 -0.2 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.2 -0.1 -0.1 -0.1 -0.2 -0.1 -0.2 -0.1 -0.1 -0.1 -0.2 -0.2 -0.2 -0.2 -1.1 -0.2 -0.2 -0.2 -0.2 -1.8 -1.9 -0.2 -0.2 -0.2 -0.9 0.1 -0.9 -0.2 -0.1 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.0 0.0 0.1 0.1 0.1 -0.9 -0.0 -0.9 0.1 0.1 0.1 -0.3 0.1 -0.2 0.1 0.1 0.1 -0.2 -0.2 0.1 0.1 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 (d) = 10 (Low Reg.) Figure 6.4: Grid-World Reward Perturbations. (a) Sequential task. (b)-(d) Policies trained with Shannon entropy regularization of dierent strength. Action probabilities are indicated via relative arrow lengths; the goal-state is gray and without annotations. Colors indicate worst-case adversarial reward perturbations r (a;s) = 1 log (ajs) 0 (ajs) for each state and action (up, down, left, right) against which the policy is robust. Red (or positive r (a;s)) implies that the policy is robust to reward decreases (up to the value shown) imposed by the adversary. These decreases are balanced by adversarial reward increases (blue) for other actions in the same state. We conrm the optimality of each policy using path consistency in Fig. 6.5. perturbations r (a;s) = 1 log (ajs) 0 (ajs) in each state-action pair for policies trained with various regularization strengths in Fig. 6.4(b)-(d). While it is well-known that there exists a unique optimal policy for a given regularized mdp, our results additionally display the adversarial strategies and resulting Nash equilibria which can be associated with a regularization scheme specied by , 0 , , and in a given mdp. Each policy implicitly hedges against an adversary that perturbs the rewards according to the values and colormap shown. For example, inspecting the state to the left of the goal state in panel Fig. 6.4(b)-(c), we see that the adversary reduces the immediate reward for moving right (in red, r > 0). Simultaneously, the adversary raises the reward for moving up or down towards the water (in blue). This is in line with the constraints on the feasible set, which imply that the adversary must balance reward decreases with reward increases in each state. In Fig. 6.5, we certify the optimality of each policy using the path consistency conditions, which also conrms that the adversarial perturbations have rendered the agent indierent across actions in each state. Although we observe that the agent with high regularization in Fig. 6.4(b) is robust to a strong adversary, the value of the regularized objective is also lower in this case. As expected, lower reg- ularization strength reduces robustness to negative reward perturbations. With low regularization in Fig. 6.4(d), the behavior of the agent barely deviates from the deterministic policy in the face of the weaker adversary. 162 (a) Environment (r(a;s) =1 water, r(a;s) = 5 goal) 0 1 2 3 4 5 0 1 2 3 4 5 -0.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.1 -0.2 -0.1 -0.1 -0.1 -0.2 0.0 -0.2 0.0 0.1 0.1 0.4 1.6 0.5 0.2 0.3 1.0 2.3 2.3 0.9 0.7 1.5 2.0 3.2 2.0 1.3 1.0 -0.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.1 -0.2 -0.1 -0.1 -0.1 -0.2 0.0 -0.2 0.0 0.1 0.1 0.4 1.6 0.5 0.2 0.3 1.0 2.3 2.3 0.9 0.7 1.5 2.0 3.2 2.0 1.3 1.0 -0.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.1 -0.2 -0.1 -0.1 -0.1 -0.2 0.0 -0.2 0.0 0.1 0.1 0.4 1.6 0.5 0.2 0.3 1.0 2.3 2.3 0.9 0.7 1.5 2.0 3.2 2.0 1.3 1.0 -0.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.1 -0.2 -0.1 -0.1 -0.1 -0.2 0.0 -0.2 0.0 0.1 0.1 0.4 1.6 0.5 0.2 0.3 1.0 2.3 2.3 0.9 0.7 1.5 2.0 3.2 2.0 1.3 1.0 0 1 2 3 4 5 (b) = 0:2 (High Reg.) 0 1 2 3 4 5 0 1 2 3 4 5 1.5 1.5 1.6 1.5 1.5 1.5 1.7 1.7 1.7 1.6 1.7 1.7 1.9 1.8 2.1 1.8 1.9 1.9 2.2 2.5 3.7 2.4 2.1 2.2 2.8 3.9 3.9 2.7 2.5 3.2 3.6 4.3 3.6 3.0 2.8 1.5 1.5 1.6 1.5 1.5 1.5 1.7 1.7 1.7 1.6 1.7 1.7 1.9 1.8 2.1 1.8 1.9 1.9 2.2 2.5 3.7 2.4 2.1 2.2 2.8 3.9 3.9 2.7 2.5 3.2 3.6 4.3 3.6 3.0 2.8 1.5 1.5 1.6 1.5 1.5 1.5 1.7 1.7 1.7 1.6 1.7 1.7 1.9 1.8 2.1 1.8 1.9 1.9 2.2 2.5 3.7 2.4 2.1 2.2 2.8 3.9 3.9 2.7 2.5 3.2 3.6 4.3 3.6 3.0 2.8 1.5 1.5 1.6 1.5 1.5 1.5 1.7 1.7 1.7 1.6 1.7 1.7 1.9 1.8 2.1 1.8 1.9 1.9 2.2 2.5 3.7 2.4 2.1 2.2 2.8 3.9 3.9 2.7 2.5 3.2 3.6 4.3 3.6 3.0 2.8 0 1 2 3 4 5 (c) = 1 0 1 2 3 4 5 0 1 2 3 4 5 3.8 3.8 3.7 3.7 3.8 3.8 4.0 3.9 3.8 3.8 3.9 3.9 4.1 3.9 3.8 3.9 4.1 4.1 4.3 4.1 4.9 4.0 4.2 4.2 4.4 4.9 4.9 4.4 4.3 4.6 4.7 4.9 4.7 4.6 4.4 3.8 3.8 3.7 3.7 3.8 3.8 4.0 3.9 3.8 3.8 3.9 3.9 4.1 3.9 3.8 3.9 4.1 4.1 4.3 4.1 4.9 4.0 4.2 4.2 4.4 4.9 4.9 4.4 4.3 4.6 4.7 4.9 4.7 4.6 4.4 3.8 3.8 3.7 3.7 3.8 3.8 4.0 3.9 3.8 3.8 3.9 3.9 4.1 3.9 3.8 3.9 4.1 4.1 4.3 4.1 4.9 4.0 4.2 4.2 4.4 4.9 4.9 4.4 4.3 4.6 4.7 4.9 4.7 4.6 4.4 3.8 3.8 3.7 3.7 3.8 3.8 4.0 3.9 3.8 3.8 3.9 3.9 4.1 3.9 3.8 3.9 4.1 4.1 4.3 4.1 4.9 4.0 4.2 4.2 4.4 4.9 4.9 4.4 4.3 4.6 4.7 4.9 4.7 4.6 4.4 0 1 2 3 4 5 (d) = 10 (Low Reg.) Figure 6.5: Conrming Optimality of the Policy. We show the perturbed rewards r 0 (a;s) = Q(a;s) r(a;s) for policies trained withkl divergence regularization. The indierence condition holds in all cases, with r 0 (a;s) = c(s) for each state-action pair, which conrms that the policy is optimal (Ortega and Lee, 2014; Nachum et al., 2017). In particular, for kl regularization, c(s) = V (s) and the shading and value in each grid cell re ect the value function, or expected discounted reward that the agent expects to receive by following its policy from the given state. 6.5 Discussion Our analysis in Sec. 6.3 unies and extends several previous works which analyze the reward robustness of regularized policies (Ortega and Lee, 2014; Eysenbach and Levine, 2021; Husain et al., 2021). We summarize our contributions with respect to previous work in Table 6.1, with additional discussion below. 6.5.1 Comparison with Entropy Regularization As argued in Sec. 6.3, the worst-case reward perturbations preserve the value of the regularized objective function. Thus, we should expect our robustness conclusions to depend on the exact form of the regularizer. When regularizing with the Tsallis or Shannon ( = 1) entropy, the worst-case reward pertur- bations become r (a;s) = 1 log (ajs) + 1 1 1 X a2A (ajs) : (6.31) In Brekelmans et al. (2022a) App. F2, we show that for 0 < 1, these perturbations cannot decrease the reward, withr (a;s) 0 and r 0 (a;s) r(a;s). In the rest of this section, we argue that this property leads to several unsatisfying conclusions in previous work (Lee et al., 2019; 163 (a) Fig. 2 or Fig. 9 (left) of Eysenbach and Levine (2021) Constraint: P a expf r(a)g 1 (b) With reference 0 (ajs) =u(a) = 1 2 Constraint P a 0 (a) expf r(a)g 1 Figure 6.6: Comparing robust sets for entropy regularization (a) vs. divergence regularization (b). Eysenbach and Levine, 2021), which are resolved by using the kl and -divergence for analysis instead of the corresponding entropic quantities. 3 First, this means that a Shannon entropy-regularized policy is only `robust' to increases in the reward function. However, for useful generalization, we might hope that a policy still performs well when the reward function decreases in at least some states. Including the reference distribution via divergence regularization resolves this issue, and we observe in Fig. 6.2 and Fig. 6.4 that the adversary chooses reward decreases in some actions and increases in others. For example, for the kl divergence, r (a;s) = 1 log (ajs) 0 (ajs) =Q (a;s)V (s) implies robustness to reward decreases when (ajs)> 0 (ajs) or Q (a;s)>V (s). Similarly, Lee et al. (2019) note that for any , 1 (H) (Q) = max 2 jAj h;Qi + 1 H ()Q(a max ;s) wherea max = argmax a Q(a;s) and the Tsallis entropyH () equals the Shannon entropy for = 1. This soft value aggregation yields a result that is larger than any particular Q-value. By contrast, for the -divergence, we show in Brekelmans et al. (2022a) App F3 that for xed and > 0, Q(a max ;s) + 1 1 log 2 (a max js) 1 () 0 ; (Q)Q(a max ;s): 3 Entropy regularization corresponds to divergence regularization with the uniform reference distribution 0(ajs). 164 This provides a more natural interpretation of the Bellman optimality operator V (s) 1 () 0 ; (Q) as a soft maximum operation. As a function of , we see that the conjugate ranges between E 0 [Q(a max ;s)] 1 () 0 ; (Q)Q(a max ;s) in Brekelmans et al. (2022a) App C4 and F3. Finally, using entropy instead of divergence regularization also aects interpretations of the feasible set. Eysenbach and Levine (2021) consider the same constraint as in Eq. (6.19), but without the reference 0 (ajs) X a2A expf r(a;s)g 1 8s2S: (6.32) This constraint suggests that the original reward function (r = 0) is not feasible for the adversary. More surprisingly, Eysenbach and Levine (2021) App. A8 argues that increasing regularization strength (with lower ) may lead to less robust policies based on the constraint in Eq. (6.32). In Fig. 6.6 and Brekelmans et al. (2022a) App. F4, we visualize how including 0 (ajs) in the constraint via divergence regularization (Prop. 6.3.1) avoids this conclusion. As expected, Fig. 6.2 shows that increasing regularization strength leads to more robust policies. 6.5.2 Related Algorithms Several recent works provide algorithmic insights which build upon convex duality and complement or extend our analysis. Derman et al. (2021) derive practical iterative algorithms based on a gen- eral equivalence between robustness and regularization, which can be used to enforce robustness to both reward perturbations (through policy regularization) and changes in environment dynamics (through value regularization). For policy regularization, Derman et al. (2021) translate the spec- ication of a desired robust set into a regularizer using the convex conjugate of the set indicator function. In particular, Derman et al. (2021) associatekl divergence or (scaled) Tsallis entropy pol- icy regularization with the robust setR :=frj r(a;s)2 [ 1 1 log (ajs) 0 (ajs) ;1)8 (a;s)2ASg. Our analysis proceeds in the opposite direction, from regularization to robustness, using the con- jugate of the divergence. While the worst-case perturbations result in the same modied objective, our approach yields a larger robust set with qualitatively dierent shape (see Fig. 6.1). Zahavy et al. (2021) analyze a general `meta-algorithm' which alternates between updates of the occupancy measure(a;s) and modied rewardr 0 (a;s) in online fashion. This approach highlights the fact that the modied reward r 0 or worst-case perturbations r change as the policy or occupancy measure is optimized. The results of Zahavy et al. (2021) and Husain et al. (2021) hold 165 for general convex mdps, which encompass common exploration and imitation learning objectives beyond the policy regularization setting we consider. As discussed in Sec. 6.3.4, path consistency conditions have been used to derive practical learning objectives in (Nachum et al., 2017; Chow et al., 2018). These algorithms might be extended to general -divergence regularization via Eq. (6.28), which involves an arbitrary reference policy 0 (ajs) that can be learned adaptively as in (Teh et al., 2017; Grau-Moya et al., 2018). Finally, previous work has used dual optimizations similar to Eq. (6.14) to derive alterna- tive Bellman error losses (Dai et al., 2018; Belousov and Peters, 2019; Nachum and Dai, 2020; Bas-Serrano et al., 2021), highlighting how convex duality can be used to bridge between policy regularization and Bellman error aggregation (Belousov and Peters, 2019; Husain et al., 2021). 6.5.3 Conclusion In this work, we analyzed the robustness of convex-regularized rl policies to worst-case perturba- tions of the reward function, which implies generalization to adversarially chosen reward functions from within a particular robust set. We have characterized this robust set of reward functions for kl and -divergence regularization, provided a unied discussion of existing works on reward robustness, and claried apparent dierences in robustness arising from entropy versus divergence regularization. Our advantage function interpretation of the worst-case reward perturbations pro- vides a complementary perspective on how Q-values appear as dual variables in convex program- ming forms of regularized mdps. Compared to a deterministic, unregularized policy, a stochastic, regularized policy places probability mass on a wider set of actions and requires state-action value adjustments via the advantage function or adversarial reward perturbations. Conversely, a reg- ularized agent, acting based on given Q-value estimates, implicitly hedges against the anticipated perturbations of an appropriate adversary. 166 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (a) =1; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (b) =1; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (c) =1; = 5 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (d) =1; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (e) D KL ; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (f) D KL ; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (g) D KL ; = 5 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (h) D KL ; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (i) = 2; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (j) = 2; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (k) = 2; = 5 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (l) = 2; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (m) = 3; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (n) = 3; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (o) = 3; = 5 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (p) = 3; = 10 Figure 6.7: Reference distribution 0 = ( 1 2 ; 1 2 ). See caption of Fig. 6.8. 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (a) =1; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (b) =1; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (c) =1; = 5 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (d) =1; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (e) D KL ; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (f) D KL ; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (g) D KL ; = 5 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Feasible Set Q(s,a) Perturbed (h) D KL ; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (i) = 2; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (j) = 2; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (k) = 2; = 5 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (l) = 2; = 10 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (m) = 3; = 0:1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (n) = 3; = 1 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (o) = 3; = 5 0.6 0.8 1.0 1.2 1.4 1.6 r 0 (a 1 , s) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 r 0 (a 2 , s) Robust Set Q(s,a) Worst-Case (p) = 3; = 10 Figure 6.8: Reference distribution 0 = ( 2 3 ; 1 3 ). Feasible Set (red region) of perturbed rewards available to the adversary, for kl ( = 1) and -divergence ( =f1; 2; 3g) regularization, various , and xed Q (a;s) =r(a;s) values (blue star). We consider the optimal (ajs) with regularization parameters;; 0 and the given Q-values. Red star indicates worst-case perturbed reward r 0 =r r for optimal policy. 167 Bibliography Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pages 151{160. PMLR, 2019. A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken elbo. In International Conference on Machine Learning, pages 159{168, 2018. A. A. Alemi and I. Fischer. Gilbo: One metric to measure them all. In Advances in Neural Information Processing Systems, 2018. A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016. S. M. Ali and S. D. Silvey. A general class of coecients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131{142, 1966. S.-i. Amari. Dierential geometry of curved exponential families-curvatures and information loss. The Annals of Statistics, pages 357{385, 1982. S.-i. Amari. Integration of stochastic models by minimizing -divergence. Neural computation, 19(10): 2780{2796, 2007. S.-i. Amari. Information geometry and its applications, volume 194. Springer, 2016. S.-i. Amari and H. Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000. S.-i. Amari and A. Ohara. Geometry of q-exponential family of probability distributions. Entropy, 13(6): 1170{1185, 2011. B. Amos, L. Xu, and J. Z. Kolter. Input convex neural networks. In International Conference on Machine Learning, pages 146{155. PMLR, 2017. M. Arbel, L. Zhou, and A. Gretton. Generalized energy based models. arXiv e-prints, pages arXiv{2003, 2020. S. Arora and P. Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress. Articial Intelligence, 297:103500, 2021. A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7):2664{2669, 2005a. A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman Divergences. Journal of Machine Learning Research, 6:1705{1749, 2005b. A. Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh, and J. Laerty. Clustering with bregman divergences. Journal of machine learning research, 6(10), 2005c. D. Barber and F. Agakov. The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 201{ 208, 2003. 168 J. Bas-Serrano, S. Curi, A. Krause, and G. Neu. Logistic q-learning. In International Conference on Articial Intelligence and Statistics, pages 3610{3618. PMLR, 2021. A. Basu, I. R. Harris, N. L. Hjort, and M. Jones. Robust and ecient estimation by minimising a density power divergence. Biometrika, 85(3):549{559, 1998. I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, and A. Courville. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018a. M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm. Mutual information neural estimation. In International Conference on Machine Learning, pages 531{540. PMLR, 2018b. R. Bellman. A markovian decision process. Journal of mathematics and mechanics, 6(5):679{684, 1957. B. Belousov. Bregman divergence of alpha divergence. Blog post, 2017. URL http://www.boris-belousov. net/2017/04/16/bregman-divergence/. B. Belousov and J. Peters. Entropic regularization of markov decision processes. Entropy, 21(7):674, 2019. M. Betancourt, S. Byrne, S. Livingstone, M. Girolami, et al. The geometric foundations of Hamiltonian Monte Carlo. Bernoulli, 23(4A):2257{2298, 2017. D. M. Blei, A. Kucukelbir, and J. D. McAulie. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859{877, 2017. M. Blondel, A. F. Martins, and V. Niculae. Learning with Fenchel-Young losses. J. Mach. Learn. Res., 21 (35):1{69, 2020. S. Borade and L. Zheng. I-projection and the geometry of error exponents. In Proceedings of the Forty-Fourth Annual Allerton Conference on Communication, Control, and Computing, Sept 27-29, 2006. J. Bornschein and Y. Bengio. Reweighted Wake-Sleep. 1, 2014. URL http://arxiv.org/abs/1406.2751. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004. R. Brekelmans, V. Masrani, F. Wood, G. V. Steeg, and A. Galstyan. All in the exponential family: Bregman duality in thermodynamic variational inference. In International Conference on Machine Learning, 2020a. R. Brekelmans, V. Masrani (Equal Contribution), F. Wood, G. Ver Steeg, and A. Galstyan. All in the exponential family: Bregman duality in thermodynamic variational inference. In International Conference on Machine Learning, pages 1111{1122. PMLR, 2020b. R. Brekelmans, F. Nielsen, A. Galstyan, and G. V. Steeg. Likelihood ratio exponential families. In NeurIPS Workshop on Information Geometry in Deep Learning, 2020c. URL https://openreview.net/forum? id=RoTADibt26_. R. Brekelmans, F. Nielsen, A. Galstyan, and G. V. Steeg. Likelihood ratio exponential families. In NeurIPS Workshop on Information Geometry in Deep Learning, 2020d. URL https://openreview.net/forum? id=RoTADibt26_. R. Brekelmans, T. Genewein, J. Grau-Moya, G. Del etang, M. Kunesch, S. Legg, and P. Ortega. Your policy regularizer is secretly an adversary. Transactions on Machine Learning Research, 2022a. R. Brekelmans, S. Huang, M. Ghassemi, G. V. Steeg, R. B. Grosse, and A. Makhzani. Improving mutual information estimation with annealed and energy-based bounds. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=T0B9AoM_bFg. A. Buchholz, N. Chopin, and P. E. Jacob. Adaptive Tuning of Hamiltonian Monte Carlo Within Sequential Monte Carlo. Bayesian Analysis, -1(-1):1{27, Jan. 2021. ISSN 1936-0975, 1931-6690. doi: 10.1214/ 20-BA1222. 169 T. Bui. Connecting the thermodynamic variational objective and annealed importance sampling. 2020a. T. Bui. Connecting the thermodynamic variational objective and annealed importance sampling. 2020b. URL https://thangbui.github.io/docs/reports/tvo_annealed_is.pdf. J. Burbea and C. Rao. Entropy dierential metric, distance and divergence measures in probability spaces: A unied approach. Journal of Multivariate Analysis, 12(4):575{596, 1982. ISSN 0047-259X. doi: https: //doi.org/10.1016/0047-259X(82)90065-3. URL https://www.sciencedirect.com/science/article/ pii/0047259X82900653. Y. Burda, R. Grosse, and R. Salakhutdinov. Importance Weighted Autoencoders. Iclr-2015, pages 1{12, 2015a. URL http://arxiv.org/abs/1509.00519. Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015b. Y. Burda, R. B. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. In International Con- ference on Learning Representations, 2016. A. L. Caterini, A. Doucet, and D. Sejdinovic. Hamiltonian variational auto-encoder. In Advances in Neural Information Processing Systems, pages 8167{8177, 2018. S. Chatterjee and P. Diaconis. The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099{1135, 2018. J. Chen, D. Lu, Z. Xiu, K. Bai, L. Carin, and C. Tao. Variational inference with holder bounds. arXiv preprint arXiv:2111.02947, 2021. X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable rep- resentation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2180{2188, 2016. N. Chentsov. Statistical decision rules and optimal inference. Translations of Mathematical Monographs, 53, 1982. K. Choi, C. Meng, Y. Song, and S. Ermon. Density ratio estimation via innitesimal classication. arXiv preprint arXiv:2111.11010, 2021. N. Chopin, O. Papaspiliopoulos, et al. An introduction to sequential Monte Carlo. Springer, 2020. Y. Chow, O. Nachum, and M. Ghavamzadeh. Path consistency learning in tsallis entropy regularized mdps. In International Conference on Machine Learning, pages 979{988. PMLR, 2018. P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017. A. Cichocki and S.-i. Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532{1568, 2010. T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012. K. Cranmer, J. Brehmer, and G. Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055{30062, 2020. C. Cremer, Q. Morris, and D. Duvenaud. Reinterpreting importance-weighted autoencoders. arXiv preprint arXiv:1704.02916, 2017. I. Csisz ar. Information-type measures of dierence of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229{318, 1967. 170 I. Csisz ar. Information geonetry and alternating minimization procedures. Statistics and decisions, 1:205{ 237, 1984. I. Csiszar. Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems. The annals of statistics, 19(4):2032{2066, 1991. I. Csisz ar. The method of types [information theory]. IEEE Transactions on Information Theory, 44(6): 2505{2523, 1998. I. Csisz ar and P. C. Shields. Information theory and statistics: A tutorial. Now Publishers Inc, 2004. A. G. Dabak and D. H. Johnson. Relations between kullback-leibler distance and sher information. 2002. B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pages 1125{1134. PMLR, 2018. B. Dai, Z. Liu, H. Dai, N. He, A. Gretton, L. Song, and D. Schuurmans. Exponential family estimation via adversarial dynamics embedding. Advances in Neural Information Processing Systems, 32, 2019. M. de Carvalho. Mean, what do you mean? The American Statistician, 70(3):270{274, 2016. J. Deasy, N. Simidjievski, and P. Li o. Constraining variational inference with geometric jensen-shannon divergence. Advances in Neural Information Processing Systems, 33:10647{10658, 2020. P. Del Moral, A. Doucet, and A. Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411{436, 2006. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1{22, 1977. E. Derman, M. Geist, and S. Mannor. Twice regularized mdps and the equivalence between robustness and regularization. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1{5, 2016. A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei. Variational inference vian upper bound minimization. In Advances in Neural Information Processing Systems, pages 2732{2741, 2017. X. Ding and D. J. Freedman. Improving importance weighted auto-encoders with annealed importance sampling. arXiv preprint arXiv:1906.04904, 2019. J. Domke and D. R. Sheldon. Importance weighting and variational inference. In Advances in neural information processing systems, pages 4470{4479, 2018. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics letters B, 195(2): 216{222, 1987. D. J. Earl and M. W. Deem. Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics, 7(23):3910{3916, 2005. S. Eguchi. Second order eciency of minimum contrast estimators in a curved exponential family. The Annals of Statistics, pages 793{803, 1983. S. Eguchi. A dierential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima mathematical journal, 15(2):341{391, 1985. S. Eguchi. Information geometry and statistical pattern recognition. Sugaku Expositions, 19(2):197{216, 2006. 171 S. Eguchi and J. Copas. Interpreting kullback{leibler divergence with the neyman{pearson lemma. Journal of Multivariate Analysis, 97(9):2034{2040, 2006. B. Eysenbach and S. Levine. Maximum entropy RL (provably) solves some robust RL problems. In In- ternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= PtSAD3caaA2. E. A. Feinberg and A. Shwartz. Handbook of Markov decision processes: methods and applications, volume 40. Springer Science & Business Media, 2012. P. C. Fishburn. Nonlinear preference and utility theory. Number 5. Johns Hopkins University Press, 1988. R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Articial Intelligence, pages 202{211, 2016. N. Friel and A. N. Pettitt. Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3):589{607, 2008. F. Futami, I. Sato, and M. Sugiyama. Variational inference based on robust divergences. In International Conference on Articial Intelligence and Statistics, pages 813{822. PMLR, 2018. T. Gener and J. Domke. Mcmc variational inference via uncorrected hamiltonian annealing. Advances in Neural Information Processing Systems, 34, 2021. M. Geist, B. Scherrer, and O. Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160{2169. PMLR, 2019. M. Gell-Mann and C. Tsallis. Nonextensive entropy: interdisciplinary applications. Oxford University Press, 2004. A. Gelman and X.-L. Meng. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, pages 163{185, 1998. P. Gibilisco and G. Pistone. Connections on non-parametric statistical manifolds by orlicz space geometry. Innite Dimensional Analysis, Quantum Probability and Related Topics, 1(02):325{347, 1998. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. M. R. Grasselli. Dual connections in nonparametric classical information geometry. Annals of the Institute of Statistical Mathematics, 62(5):873{896, 2010. J. Grau-Moya, F. Leibfried, and P. Vrancx. Soft q-learning with mutual-information regularization. In International Conference on Learning Representations, 2018. R. B. Grosse, C. J. Maddison, and R. R. Salakhutdinov. Annealing between distributions by averaging moments. In Advances in Neural Information Processing Systems, pages 2769{2777, 2013. R. B. Grosse, Z. Ghahramani, and R. P. Adams. Sandwiching the marginal likelihood using bidirectional Monte Carlo. arXiv preprint arXiv:1511.02543, 2015. R. B. Grosse, S. Ancha, and D. M. Roy. Measuring the reliability of MCMC inference with bidirectional Monte Carlo. In Advances in Neural Information Processing Systems, pages 2451{2459, 2016. P. Gr unwald. The safe bayesian. In International Conference on Algorithmic Learning Theory, pages 169{ 183. Springer, 2012. P. D. Gr unwald. The minimum description length principle. MIT press, 2007. 172 T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1352{1361. PMLR, 06{11 Aug 2017. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: O-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861{1870. PMLR, 2018. M. Habeck. Model evidence from nonequilibrium simulations. In Advances in Neural Information Processing Systems, pages 1753{1762, 2017. G. Hardy, J. Littlewood, and G. P olya. Inequalities. The Mathematical Gazette, 37(321):236{236, 1953. P. Harremo es. Interpretations of r enyi entropies and divergences. Physica A: Statistical Mechanics and its Applications, 365(1):57{62, 2006. C. F. Havrda, Jan. Quantication method of classication processes. concept of structural a-entropy. Ky- bernetika, 03(1):(30){35, 1967. URL http://eudml.org/doc/28681. J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534, 2019. G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14 (8):1771{1800, 2002. G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The "wake-sleep" algorithm for unsupervised neural networks. Science, 268(5214):1158{1161, 1995. M. D. Homan. Learning deep latent gaussian models with markov chain monte carlo. In International conference on machine learning, pages 1510{1519, 2017. S. Huang, A. Makhzani, Y. Cao, and R. Grosse. Evaluating lossy compression rates of deep generative models. In International Conference on Machine Learning. PMLR, 2020. H. Husain and J. Knoblauch. Adversarial interpretation of bayesian inference. In International Conference on Algorithmic Learning Theory, pages 553{572. PMLR, 2022. H. Husain, K. Ciosek, and R. Tomioka. Regularized policies are reward robust. International Conference on Articial Intelligence and Statistics, 2021. M. Igl, A. Gambardella, J. He, N. Nardelli, N. Siddharth, W. B ohmer, and S. Whiteson. Multitask soft option learning. In Conference on Uncertainty in Articial Intelligence, pages 969{978, 2020. A. Jaiswal, R. Brekelmans, D. Moyer, G. V. Steeg, W. AbdAlmageed, and P. Natarajan. Discovery and separation of features for invariant representation learning. arXiv preprint:1912.00646, 2019. C. Jarzynski. Equilibrium free-energy dierences from nonequilibrium measurements: A master-equation approach. Physical Review E, 56(5):5018, 1997. A. Jasra, D. A. Stephens, A. Doucet, and T. Tsagaris. Inference for L evy-Driven Stochastic Volatility Models via Adaptive Sequential Monte Carlo. Scandinavian Journal of Statistics, 38(1):1{22, 2011. ISSN 1467-9469. doi: 10.1111/j.1467-9469.2010.00723.x. E. T. Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957. G. Kaniadakis and A. Scarfone. A new one-parameter deformation of the exponential function. Physica A: Statistical Mechanics and its Applications, 305(1-2):69{75, 2002. H. J. Kappen, V. G omez, and M. Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159{182, 2012. 173 D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive ow. In Advances in Neural Information Processing Systems, pages 4743{4751, 2016. J. Knoblauch, J. E. Jewson, and T. Damoulas. Doubly robust bayesian inference for non-stationary streaming data with -divergences. Advances in Neural Information Processing Systems, 31, 2018. J. Knoblauch, J. Jewson, and T. Damoulas. Generalized variational inference: Three arguments for deriving new posteriors. arXiv preprint arXiv:1904.02063, 2019. A. Kolmogorov. On the notion of mean. Mathematics and Mechanics, 1930. D. Kountourogiannis and P. Loya. A derivation of taylor's formula with integral remainder. Mathematics magazine, 76(3):217{219, 2003. S. Kullback and R. A. Leibler. On information and suciency. The annals of mathematical statistics, 22(1): 79{86, 1951. F. Kunstner, R. Kumar, and M. Schmidt. Homeomorphic-invariance of em: Non-asymptotic convergence in kl divergence for exponential families via mirror descent. In International Conference on Articial Intelligence and Statistics, pages 3295{3303. PMLR, 2021. D. Lawson, G. Tucker, B. Dai, and R. Ranganath. Energy-inspired models: learning with sampler-induced distributions. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8501{8513, 2019. K. Lee, S. Choi, and S. Oh. Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1466{1473, 2018. K. Lee, S. Kim, S. Lim, S. Choi, and S. Oh. Tsallis reinforcement learning: A unied framework for maximum entropy reinforcement learning. arXiv preprint arXiv:1902.00137, 2019. S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018. S. Levine, A. Kumar, G. Tucker, and J. Fu. Oine reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020. Y. Li and R. E. Turner. R enyi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073{1081, 2016. J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37 (1):145{151, 1991. G. I. Loaiza and H. Quiceno. A q-exponential statistical banach manifold. Journal of Mathematical Analysis and Applications, 398(2):466{476, 2013a. G. I. Loaiza and H. R. Quiceno. A riemannian geometry in the q-exponential banach manifold induced by q-divergences. In Geometric Science of Information. First International Conference, GSI 2013, Paris, France, August 28-30, 2013. Proceedings, pp 737-742. Springer Berlin Heidelberg, 2013b. D. J. MacKay, D. J. Mac Kay, et al. Information theory, inference and learning algorithms. Cambridge university press, 2003. A. F. Martins, M. Treviso, A. Farinhas, P. M. Aguiar, M. A. Figueiredo, M. Blondel, and V. Niculae. Sparse continuous distributions and fenchel-young losses. arXiv preprint arXiv:2108.01988, 2021. 174 V. Masrani, T. A. Le, and F. Wood. The thermodynamic variational objective. arXiv preprint arXiv:1907.00031, 2019. V. Masrani, R. Brekelmans (Equal Contribution), T. Bui, F. Nielsen, A. Galstyan, G. V. Steeg, and F. Wood. q-paths: Generalizing the geometric annealing path using power means. Uncertainty in Articial Intelligence, 2021. H. Matsuzoe, A. M. Scarfone, and T. Wada. Normalization problems for deformed exponential families. In International Conference on Geometric Science of Information, pages 279{287. Springer, 2019. D. McAllester and K. Stratos. Formal limitations on the measurement of mutual information. In International Conference on Articial Intelligence and Statistics, pages 875{884. PMLR, 2020. M. Mihoko and S. Eguchi. Robust blind source separation by beta divergence. Neural computation, 14(8): 1859{1886, 2002. T. Minka. Divergence measures and message passing. Technical report, 2005. D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg. Invariant representations without adver- sarial training. In Advances in Neural Information Processing Systems, pages 9084{9093, 2018. O. Nachum and B. Dai. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866, 2020. O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, 2017. O. Nachum, Y. Chow, B. Dai, and L. Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Reinforcement Learning for Real Life Workshop (ICML), 2019a. O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b. J. Naudts. Estimators, escort probabilities, and phi-exponential families in statistical physics. arXiv preprint math-ph/0402005, 2004. J. Naudts. The q-exponential family in statistical physics. Open Physics, 7(3):405{413, 2009. J. Naudts. Generalised thermostatistics. Springer Science & Business Media, 2011. J. Naudts and J. Zhang. Rho{tau embedding and gauge freedom in information geometry. Information geometry, 1(1):79{115, 2018. R. M. Neal. Sampling from multimodal distributions using tempered transitions. Statistics and computing, 6(4):353{366, 1996. R. M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125{139, 2001. R. M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, page 113, 2011. R. M. Neal and G. E. Hinton. A view of the em algorithm that justies incremental, sparse, and other variants. In Learning in graphical models, pages 355{368. Springer, 1998. G. Neu, A. Jonsson, and V. G omez. A unied view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017. A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, 2000. 175 V. Nguyen, V. Masrani, R. Brekelmans, M. Osborne, and F. Wood. Gaussian process bandit optimization of the thermodynamic variational objective. Advances in Neural Information Processing Systems, 33: 5764{5775, 2020. X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847{5861, 2010. F. Nielsen. A family of statistical symmetric divergences based on jensen's inequality. arXiv preprint arXiv:1009.4004, 2010. F. Nielsen. An information-geometric characterization of cherno information. IEEE Signal Processing Letters, 20(3):269{272, 2013. F. Nielsen. On a generalization of the jensen-shannon divergence and the js-symmetrization of distances relying on abstract means. CoRR, abs/1904.04017, 2019. URL http://arxiv.org/abs/1904.04017. F. Nielsen. An elementary introduction to information geometry. Entropy, 22(10), 2020. F. Nielsen. Statistical divergences between densities of truncated exponential families with nested supports: Duo bregman and duo jensen divergences. Entropy, 24(3):421, 2022. F. Nielsen and R. Nock. The dual voronoi diagrams with respect to representational bregman divergences. In 2009 Sixth International Symposium on Voronoi Diagrams, pages 71{78. IEEE, 2009. F. Nielsen and R. Nock. On renyi and tsallis entropies and divergences for exponential families. arXiv preprint arXiv:1105.3259, 2011. R. Nock, Z. Cranko, A. K. Menon, L. Qu, and R. C. Williamson. f-gans in an information geometric nutshell. Advances in Neural Information Processing Systems, 2017. S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational diver- gence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 271{279, 2016. Y. Ogata. A Monte Carlo method for high dimensional integration. Numerische Mathematik, 55(2):137{157, 1989. P. Ortega and D. Lee. An adversarial interpretation of information-theoretic bounded rationality. In Pro- ceedings of the AAAI Conference on Articial Intelligence, volume 28, 2014. P. A. Ortega and D. A. Braun. Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153), 2013. M. J. Osborne and A. Rubinstein. A course in game theory. 1994. A. B. Owen. Monte Carlo theory, methods and examples. 2013. J. Peters, K. Mulling, and Y. Altun. Relative entropy policy search. In Proceedings of the AAAI Conference on Articial Intelligence, volume 24, 2010. G. Pistone and C. Sempi. An innite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. The annals of statistics, pages 1543{1561, 1995. B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, pages 5171{5180, 2019. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994. 176 M. L. Puterman and M. C. Shin. Modied policy iteration algorithms for discounted markov decision problems. Management Science, 24(11), 1978. T. Rainforth, R. Cornish, H. Yang, A. Warrington, and F. Wood. On nesting monte carlo estimators. In International Conference on Machine Learning, pages 4267{4276. PMLR, 2018a. T. Rainforth, A. Kosiorek, T. A. Le, C. Maddison, M. Igl, F. Wood, and Y. W. Teh. Tighter variational bounds are not necessarily better. In International Conference on Machine Learning, pages 4277{4285. PMLR, 2018b. K. Rawlik, M. Toussaint, and S. Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In International Joint Conference on Articial Intelligence, 2013. D. Rezende and S. Mohamed. Variational inference with normalizing ows. In International Conference on Machine Learning, pages 1530{1538, 2015. D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278{1286, 2014. B. Rhodes, K. Xu, and M. U. Gutmann. Telescoping density-ratio estimation. Advances in Neural Informa- tion Processing Systems, 2020. K. Rose, E. Gurewitz, and G. Fox. A deterministic annealing approach to clustering. Pattern Recognition Letters, 11(9):589{594, 1990. P. J. Rossky, J. Doll, and H. Friedman. Brownian dynamics as smart Monte Carlo simulation. The Journal of Chemical Physics, 69(10):4628{4633, 1978. A. Ruderman, M. D. Reid, D. Garc a-Garc a, and J. Petterson. Tighter variational representations of f- divergences via restriction to probability measures. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1155{1162, 2012. F. Ruiz and M. Titsias. A contrastive divergence for combining variational inference and mcmc. In Interna- tional Conference on Machine Learning, pages 5537{5545, 2019. A. R enyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 547{ 561, Berkeley, Calif., 1961. University of California Press. URL https://projecteuclid.org/euclid. bsmsp/1200512181. R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872{879, 2008. T. Salimans, D. P. Kingma, and M. Welling. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. Proceedings of the 32nd International Conference on Machine Learning, (Mcmc):1218{1226, 2015. URL http://arxiv.org/abs/1410.6460. D. Sanz-Alonso. Importance sampling and necessary sample size: An information theory approach. SIAM/ASA Journal on Uncertainty Quantication, 6(2):867{879, 2018. doi: 10.1137/16M1093549. C. Sch afer and N. Chopin. Sequential Monte Carlo on large binary sampling spaces. Statistics and Computing, 23(2):163{184, Mar. 2013. ISSN 1573-1375. doi: 10.1007/s11222-011-9299-z. B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist. Approximate modied policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16(49):1629{1676, 2015. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In Interna- tional conference on machine learning, pages 1889{1897. PMLR, 2015. 177 J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv e-prints, pages arXiv{1707, 2017. R. Sibson. Information radius. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(2): 149{160, 1969. A. Sobolev. Thoughts on mutual information estimation: More estimators. Blog post, 2019. URL http: //artem.sobolev.name/posts/2019-08-10-thoughts-on-mutual-information-more-estimators. html. A. Sobolev and D. P. Vetrov. Importance weighted hierarchical variational inference. In Advances in Neural Information Processing Systems, volume 32, 2019. J. Song and S. Ermon. Understanding the limitations of variational mutual information estimators. In International Conference on Learning Representations, 2019. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic dierential equations. International Conference on Learning Representations, 2021. H. Suyari, H. Matsuzoe, and A. M. Scarfone. Advantages of q-logarithm representation over q-exponential representation from the sense of scale and shift on nonlinear systems. The European Physical Journal Special Topics, 229(5):773{785, 2020. S. Syed, V. Romaniello, T. Campbell, and A. Bouchard-C^ ot e. Parallel tempering on optimized paths. International Conference on Machine Learning, 2021. U. Syed, M. Bowling, and R. E. Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pages 1032{1039, 2008. Y. W. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: robust multitask reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4499{4509, 2017. A. Thin, N. Kotelevskii, A. Doucet, A. Durmus, E. Moulines, and M. Panov. Monte carlo variational auto-encoders. In International Conference on Machine Learning, pages 10247{10257. PMLR, 2021. N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. 37th Annual Allerton Conference on Communications, Control and Computing, 1999, pages 368{377, 1999. C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of statistical physics, 52(1-2): 479{487, 1988. C. Tsallis. Introduction to nonextensive statistical mechanics: approaching a complex world. Springer Science & Business Media, 2009. G. Tucker, D. Lawson, S. Gu, and C. J. Maddison. Doubly reparameterized gradient estimators for monte carlo objectives. arXiv preprint arXiv:1810.04152, 2018. A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. T. Van Erven and P. Harremos. Renyi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797{3820, 2014. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1{2):1{305, 2008. D. Wang, H. Liu, and Q. Liu. Variational inference with tail-adaptive f-divergence. In Advances in Neural Information Processing Systems, pages 5737{5747, 2018. 178 Z. Wang, O. So, J. Gibson, B. Vlahov, M. S. Gandhi, G.-H. Liu, and E. A. Theodorou. Variational inference mpc using tsallis divergence. arXiv preprint arXiv:2104.00241, 2021. M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681{688. Citeseer, 2011. T.-K. L. Wong. Logarithmic divergences from optimal transport and r enyi geometry. Information Geometry, 1(1):39{78, 2018. T.-K. L. Wong and J. Zhang. Tsallis and R enyi deformations linked via a new -duality. arXiv preprint arXiv:2107.11925, 2021. C. J. Wu. On the convergence properties of the em algorithm. The Annals of statistics, pages 95{103, 1983. Y. Wu, Y. Burda, R. Salakhutdinov, and R. Grosse. On the quantitative analysis of decoder-based generative models. arXiv preprint arXiv:1611.04273, 2016. T. Zahavy, B. O'Donoghue, G. Desjardins, and S. Singh. Reward is enough for convex mdps. Advances in Neural Information Processing Systems, 2021. A. Zellner and R. A. Higheld. Calculation of maximum entropy distributions and approximation of marginal posterior distributions. Journal of Econometrics, 37(2):195{209, 1988. G. Zhang, K. Hsu, J. Li, C. Finn, and R. B. Grosse. Dierentiable annealed importance sampling and the perils of gradient noise. Advances in Neural Information Processing Systems, 34, 2021. J. Zhang. Divergence function, duality, and convex analysis. Neural computation, 16(1):159{195, 2004. J. Zhang. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy, 15(12):5384{5418, 2013. J. Zhang. On monotone embedding in information geometry. Entropy, 17(7):4485{4499, 2015. J. Zhang and H. Matsuzoe. Entropy, cross-entropy, relative entropy: Deformation theory (a). Europhysics Letters, 134(1):18001, 2021. S. Zhao, J. Song, and S. Ermon. The information autoencoding family: A lagrangian perspective on latent variable generative models. In Proc. 34th Conference on Uncertainty in Articial Intelligence, 2018. H. Zhu and R. Rohwer. Information geometric measurements of generalisation. Technical Report, 1995. 179 Appendices 180 Appendix A Conjugate Duality in Exponential Families (Ch. 3) The Bregman divergence associated with a convex functionF : !R can be written as (Banerjee et al., 2005c): D B F [p :q] =F (p)F (q)hpq;rF (q)i The family of Bregman divergences includes many familiar quantities, including the KL divergence corresponding to the negative entropy generator F (p) = R p logpd!. Geometrically, the diver- gence can be viewed as the dierence between F (p) and its linear approximation around q. Since F is convex, we know that a rst order estimator will lie below the function, yielding D F [p :q] 0. For our purposes, we can let F , () = logZ() over the domain of probability distributions indexed by natural parameters of an exponential family (e.g. Eq. (3.13)) : D [ p : q ] = ( p ) ( q )h p q ;r ( q )i (A.1) This is a common setting in the eld of information geometry (Amari, 2016), which introduces dually at manifold structures based on the natural parameters and the mean parameters. 181 A.0.1 KL Divergence as a Bregman Divergence For an exponential family with partition function () and sucient statisticsT (!) over a random variable !, the Bregman divergence D corresponds to a KL divergence. Recalling thatr () = =E [T (!)] from Eq. (3.16), we simplify the denition Eq. (A.1) to obtain D [ p : q ] = ( p ) ( q ) p q + q q = ( p ) ( q )E q [ p T (!)] +E q [ q T (!)] =E q q T (!) ( q ) +E q [log 0 (!)] | {z } logq(!) E q p T (!) ( p ) E q [log 0 (!)] | {z } logp(!) =E q(! log q(!) p(!) =D KL [q(!)jjp(!)] (A.2) where we have added and subtracted terms involving the base measure 0 (!), and used the denition of our exponential family from Eq. (3.13). The Bregman divergence D is thus equal to the KL divergence with arguments reversed. A.0.2 Dual Divergence We can use convex duality to derive an alternative divergence based on the conjugate function . () = sup () (A.3) which suggests the following optimality conditions =) =r (); and = (r ()) 1 (): Finally, we can rewrite the conjugate function for and in dual correspondence () = ( ) (A.4) The conjugate measures the maximum distance between the line and the function (), which occurs at the unique point where =r (). This yields a bijective mapping between and 182 for minimal exponential families (Wainwright and Jordan, 2008). Thus, a distribution p may be indexed by either its natural parameters p or mean parameters p . Noting that ( ) = () = sup () (Boyd and Vandenberghe, 2004), we can use a similar argument as above to write this correspondence as =r (). The dual divergence D becomes, D [ p : q ] = ( p ) ( q )h p q ;r ( q )i = ( p ) ( q ) p q + q q = ( p ) + ( q ) p q (A.5) where we have used Eq. (A.4) to simplify the underlined terms. Similarly, D [ p : q ] = ( p ) ( q )h p q ;r ( q )i = ( p ) ( q ) p q + q q = ( p ) + ( q ) p q (A.6) Comparing Eq. (A.5) and Eq. (A.6), we see that the divergences are equivalent with the arguments reversed, so that: D [ p : q ] =D [ q : p ] (A.7) This indicates that the Bregman divergence D should also be a KL divergence, but with the same order of arguments. We derive this fact directly in Eq. (A.12) , after investigating the form of the conjugate function . 183 A.0.3 Conjugate as Negative Entropy or KL Divergence We rst treat the case of an exponential family with no base measure 0 (!), with derivations including a base measure below. For a distribution p in an exponential family, indexed by p or p , we can write logp(!) = p T (!) (). Then, Eq. (A.4) becomes: ( p ) = p p ( p ) (A.8) = p E p [T (!)] ( p ) (A.9) =E p logp(!) (A.10) =H p (!) (A.11) since p and ( p ) are constant with respect to !. Utilizing ( p ) =E p logp(!) from above, the dual divergence with q becomes: D [ p : q ] = ( p ) ( q )h p q ;r ( q )i =E p logp(!) ( q ) p q + q q =E p logp(!) p q + ( q ) =E p logp(!)E p [T (!) q ] + ( q ) =E p logp(!)E p logq(!) =D KL [p(!)jjq(!)] (A.12) Thus, the conjugate function is the negative entropy and induces the KL divergence as its Bregman divergence (Wainwright and Jordan, 2008). Conjugate as a KL Divergence Note that, by ignoring the base distribution 0 (!) in Eq. (A.8)-Eq. (A.11), we have instead assumed that 0 (!) :=u(!) is uniform over the domain. We now illustrate that the eect of adding a base distribution is to turn the conjugate function into a 184 KL divergence, with the base 0 (!) in the second argument. This is consistent with our derivation of negative entropy, since D KL [p (!)jju(!)] =H p ( ) +const. () = sup () (A.13) = ( ) =E [ T (!)] ( ) =E [ T (!)] ( )E [log 0 (!)] =E [log (!) log 0 (!)] =D KL [ (!)jj 0 (!)] (A.14) Note that we have added and subtracted a factor of E log 0 (!) in the fourth line. Comparing with the derivations in Eq. (A.9)-Eq. (A.10), we need to include a term ofE p [log 0 (!)] in moving to an expected log-probabilityE p [logp(!)], with the extra, subtracted base measure term transforming the negative entropy into a KL divergence. In the Thermodynamic Variational Objective (tvo) setting in Ch. 3, we have a one-dimensional natural parameter vector = equal to the geometric mixture parameter. With the base distri- bution 0 (zjx) =q (zjx) equal to the approximate posterior, the conjugate corresponds to () =D KL [ (zjx)jjq (zjx)]: (A.15) When including a base distribution, the induced Bregman divergence is still the KL divergence since, as in the derivation of Eq. (A.2), both E p logp(!) andE p logq(!) will contain terms involving the base distributionE p log 0 (!). A.0.4 Canonical Divergence The Bregman divergences D and D are in fact equivalent up to reordering of the arguments D [ : 0 ] =D [ 0 :]; (A.16) where we abbreviate 0 = 0 and = . The conjugacy relationships ( 0 ) = 0 0 ( 0 ) () = () (A.17) 185 can be used to translate between these dual divergences. D [ : 0 ] = () ( 0 )h 0 ;r ( 0 )i (A.18) = () ( 0 ) 0 + 0 0 = () + ( 0 ) 0 (A.19) = ( 0 ) () + 0 = ( 0 ) ()h 0 ;r ()i =D [ 0 :] The intermediate expression Eq. (A.19) is known as the canonical form of the Bregman divergence (Amari, 2016), written in mixed coordinates as D ; [ : 0 ] =D [ : 0 ] =D [ 0 :] = () + ( 0 ) 0 (A.20) From this expression, it is clear that the divergence vanishes for and in dual correspondence according to Eq. (A.17), since each parameterization refers to the same distribution ( ) + ( ) = 0 =) D [ : ] =D [ : ] = 0: (A.21) This canonical form of the Bregman divergence can more generally be seen as arising as the gap in a convex conjugate optimization for a particular, suboptimal 0 () = sup () 0 () (A.22) which implies the canonical form in Eq. (A.20) () + () 0 0: (A.23) We consider these derivations in more detail and generality in App. B.1, noting that this fact holds for convex conjugate optimizations outside the exponential family. In particular, considering an f-divergence as a convex function of its argument as in the f-GAN framework (Nowozin et al., 2016), we can derive the gap in the conjugate optimization as an appropriate Bregman divergence. We also made use of this observation for the rho-tau Bregman Information in Ch. 5 5.3.1. 186 Appendix B Conjugate Duality Beyond the Exponential Family (Ch. 2 and 5) In this section, we will generalize the conjugate duality for parametric exponential families in the previous section to arbitrary distributions or positive measures. Convex Conjugate For a convex function (~ ) with domain ~ 2D, the conjugate function is dened via the optimization (T ) = sup ~ (z)2D ~ (z);T (z) (~ ) (B.1) whereT2D is associated with a linear functional in the dual space ofD. In what follows, we will often choose ~ (z)2D to be the space of normalized probability distributions or unnormalized positive measures. Inner producth;i notation indicates integration or summation over the domain, for exampleh(z);T (z)i := R (z)T (z)dz. By the Fenchel-Moreau biconjugation theorem, the conjugate operation is an involution for any proper, lower semi-continuous, convex (Boyd and Vandenberghe, 2004), In other words, ( ) = . Note that is also convex in this case. We can thus represent () via a conjugate optimization (~ ) = sup T (z)2D h~ (z);T (z)i (T ); (B.2) The dual variable T (z) is often called a `critic function', and appears prominently in the f- gan framework (Nowozin et al., 2016) for generalizing the generative adversarial network (gan) minimax framework beyond the Jensen-Shannon divergence (Goodfellow et al., 2014). Dual Correspondence Solving for the optimizing argument in each of Eq. (B.1) and Eq. (B.2) yields the following dual correspondence 187 ~ T (z) =r T (T ) T ~ (z) =r (~ ): (B.3) The conjugate optimizations in Eq. (B.1) and B.2 also suggest optimality conditions of the form ~ T (z) = (r ~ ) 1 (T ) and T ~ (z) = (r T ) 1 (~ ). Conjugate duality thus provides an alternative representation of a convex function (), using a related conjugate function (T ) whose input arguments correspond to gradients T =r () of the original function. We provide a graphical interpretation in Fig. 1.1. We discuss an example using 0 () =D KL [ : 0 ] in Sec. 1.3.1. To summarize, we have T (zjx) = 0 (z) expfT (x;z) (x)g 0 (T ) = (x) := logE 0 (z) h e T (x;z) i (B.4) where (x) is a log normalization constant. We will derive closed form expressions for additional special cases of and in Ch. 6. Comparison with Exponential Family Duality To reconcile this interpretation with the exponential family duality in the previous section, we can consider expanding and then re-assigning components of the dot product , where = E (z) [T (z)] represents the expected sucient statistics of the exponential family. For , in dual correspondence, we have () = () =E (z) [T (z)] (): (B.5) When = is a one-dimensional or scalar parameter vector, we can simply divide both sides of Eq. (B.5) by this parameter. With slight abuse of notation, we have () =h (z);T (z)i 1 (): (B.6) whereh;i integrates over the domain of z and T (z) is a scalar product over the number of natural parameter or sucient statistic dimensions. Since we have seen in the previous section that the dual potential () = D KL [ (z)k 0 (z)] corresponds to the negative entropy or kl divergence, our approaches above directly consider a statistical divergences as a convex functional of the distribution (z), for example () = D KL [k 0 (z)]. Similarly, we have seen in Sec. 1.3 or Sec. 2.6 that the log partition function 188 () = log R 0 (z)e T (z) dz can be expressed as a function of the (usually xed) sucient statis- tic, `critic', or negative energy function T (x;z) for a xed . We provide detailed derivations in App. C.1.1 to conrm this correspondence. 0 (T ) =h ;T (z);T (z)iD KL [ ;T (z)k 0 (z)] (B.7) When = is a one-dimensional or scalar parameter vector, we can simply divide both sides of Eq. (B.5) by this parameter. 1 0 (T ) =h T (z);T (z)i 1 D KL [ T (z)k 0 (z)]: (B.8) = sup (z) h(z);T (z)i 1 D KL [(z)k 0 (z)] (B.9) We emphasize that, although () = 1 D KL [k 0 (z)] is still a convex function to , the conjugate function changes to include , as shown in Table 6.2. This is the approach taken in Ch. 6 and App. E, where we consider the parameter as a static regularization strength parameter given in the problem description. Varying this parameter over optimization or iteration steps would more closely align with annealed importance sampling or simulated annealing, and principled methods for changing remain an interesting question for future work. To summarize, we translate exponential family duality in the one-dimensional case to a function space duality, where the kl divergence and log partition function still provide the dual potential functions, but the arguments now correspond to (z) and T (z), respectively. With respect to a base measure 0 , ()$ 0 (T ) ()$ 0 (~ ) (B.10) Note that this notation switches the notion of which function is `primal' and which is `dual', although this is interchangeable in general. Comparison with Rho-Tau Representational Duality Finally, from the perspective of rho- tau representational Bregman divergences in Ch. 5, we can view the term ~ (z) as arising from the (~ ) = ~ identity representation function, or log q (u) as q! 0. For the unnormalized exponential family density ~ ;T (z) = ~ 0 (z) expfT (z)g, we see that T (z) = log ~ ;T (z) corresponds to the (~ ) = log ~ representation of the density function, orq = 1 (up to the base measure term). We can 189 also see this in the conjugate duality conditions for thekl divergence and = 1 in Eq. (B.4) below. Note that these interpretations ignore projection onto the probability simplex by normalizing ~ (z). B.1 Bregman Divergence as the Gap in Conjugate Optimization Bregman divergences will play a crucial role in our analysis of the tvo in Ch. 3 and more general annealing paths in Ch. 4. In this section, we present an insightful result showing how the Bregman divergence appears as the gap in the conjugate optimizations in Eq. (B.1) and Eq. (B.2). Bregman Divergence and Dual Bregman Divergence The Bregman divergence generated by a convex function corresponds to the gap in a rst order Taylor approximation of the function (p) around the second argument 2D, D [p :] = (p) ()hp;r ()i (B.11) The Bregman divergence is nonnegative since, by the convexity of , the rst order Taylor approx- imation will everywhere underestimate the function (Boyd and Vandenberghe, 2004). We can also consider the Bregman divergence generated by the convex conjugate of . Letting T p ;T 2D indicate the corresponding dual variables, we can write D [T :T p ] = (T ) (T p )hT T p ;r (T p )i: (B.12) We have reversed the order of the arguments, since we will now show that Eq. (B.11) and Eq. (B.12) are equal, D [p : ] = D [T : T p ]. To show this, rst note that T =r () by the conjugate optimality conditions. Highlighting crucial changes, we rewrite Eq. (B.11) as D [p :] = (p) ()hp;r (y)i +h;T i (B.13) = (p) + (T )hp;T i (B.14) = (T p ) +hp;T p i + (T )hp;T i (B.15) = (T ) (T p )hT T p ;r (T p ) | {z } p i (B.16) =D [T :T p ] (B.17) 190 We have thus shown that the Bregman divergences generated by a conjugate dual pair of functions and are equivalent up to reordering of the arguments, D [p :] = (p)+ (T )hp;T i =D [T :T p ]: (B.18) The expression in mixed coordinates, above or in Eq. (B.13), is known as the canonical form of the Bregman divergence (Amari, 2016). Bregman Divergence as Gap in Fenchel's Inequality Changing notation to re ect ourelbo example, consider an arbitraryp(zjx)2D andT (x;z)2D which are not in dual correspondence according to Eq. (B.3). We use the notationT (x;z) to indicate that this dual variable corresponds to a certain T 2D according to Eq. (B.3). Since the supremum in Eq. (2.38) is achieved for T p (x;z), we can obtain a lower bound on (~ ) using a suboptimal T (x;z), (p) = sup T hp(zjx);T (x;z)i (T ) (B.19) hp(zjx);T (x;z)i (T ): (B.20) Following similar derivations as in Eq. (B.13)-(B.17), we now show that the Bregman divergence corresponds to the gap in this inequality, also known as the Fenchel or Fenchel-Young inequality. Letting T (zjx) be the distribution which is in dual correspondence with T (x;z) (p)hp(zjx);T (x;z)i + (T ) 0 (B.21) (p)hp(zjx);T (x;z)i+h T (z);T (x;z)i ( T ) 0 (B.22) (p) ( T )hp(zjx) T (zjx);T (x;z)i 0 (B.23) (p) ( T )hp(zjx) T (zjx);r ( T )i 0 (B.24) D () p(zjx) : T (zjx) 0 (B.25) where, in the second line, we have expanded the denition of the conjugate function. In Eq. (B.24), we have used the fact that T (x;z) =r ( T ), and the last line follows by the denition of the Bregman divergence in Eq. (B.11). Note that the canonical form of the Bregman divergence from Eq. (B.13) appears in the rst line above, Eq. (B.21). For p(zjx) and T p (x;z) which are in dual correspondence, equality holds and the Bregman divergence is 0. 191 Similar reasoning holds for the conjugate dual function (T p ), (T p ) = sup q(zjx) hq(zjx);T p (x;z)i (q) (B.26) h(zjx);T p (x;z)i () (B.27) for a given (zjx) which is not in dual correspondence with p (zjx). The gap in this lower bound corresponds to the Bregman divergence, with the target T p in the rst argument, (T p )h(zjx);T p (x;z)i + () 0 (B.28) (T p )h(zjx);T p (x;z)i + +h(zjx);T (x;z)i (T ) 0 (B.29) (T p ) (T )hT p (x;z)T (x;z);r (T )i 0 (B.30) D () T p (x;z) :T (x;z) 0 (B.31) Thus, we have shown that the gap in the lower bound induced by a conjugate optimization is a Bregman divergence, where the target is in the rst argument and the optimization variable is in the second argument. (p) =hp(zjx);T (x;z)i (T ) +D () p(zjx) : T (zjx) (B.32) (T p ) =h(zjx);T p (x;z)i () +D () T p (x;z) :T (x;z) (B.33) B.2 Conjugate Duality Interpretation of the Evidence Lower Bound We now provide a worked example of conjugate duality to provide an alternative derivation of the elbo lower bound on logp (x). In general, we can approach conjugate dualitiy derivations using the following steps: • Choose a convex function () • Derive its conjugate function (T ) = sup h;Ti (), which also involves solving for the optimizing argument T . • Express () or (T ) using a conjugate optimization (Eq. (B.1)-(B.2)). We can consider a lower bound on either () or (T ) using an optimization over learnedT or, respectively (Eq. (B.20)). 192 • Derive the Bregman divergences associated with and ((Eq. (B.24)), in order to char- acterize the gap in the conjugate optimization lower bound. To derive the elbo, we consider the Bregman divergence and convex duality associated with the function (T p ), forT p (x;z) = log p (x;z) p(z) = logp (xjz) which is in dual correspondence withp(zjx). Conjugate Dual (T ) of the KL Divergence Thekl divergence to a xed reference or prior distribution p(z) is a convex function of the distribution in the rst argument () :=D KL [jjp(z)]. We restrict the kl divergence to accept only normalized distributions as input, so that p(z) is assumed to be normalized. Conditioned on a particular x, we have p(z) (T ) = sup (zjx) (zjx);T (x;z) Z (zjx) log (zjx) p(z) dz (x) Z (zjx)dz 1 +(a;s) (B.34) where we have included an explicit Lagrange multiplier (x) to enforce normalization of (zjx), and (a;s) is a Lagrange multiplier enforcing (zjx) 0. Dierentiating Eq. (B.34) leads to the optimality condition T =r () as in Eq. (B.3), T (x;z) = 1 log (zjx) p(z) + (x)(a;s): (B.35) Inverting this expression to solve for the optimizing argument , we have T (zjx) =p(z) expfT (x;z) (x)g: (B.36) Note that we ignore the Lagrange multiplier (a;s), since the exp function automatically enforces T (zjx) 0 for all z. Using the condition R T (zjx)dz = 1 in Eq. (B.36), we can solve for (x) (x) = log Z p(z) expfT (x;z)gdz = logE p(z) h e T (x;z) i (B.37) Finally, substituting back into the conjugate optimization in Eq. (B.34), we obtain p(z) (T ) = T (zjx); T (x;z) Z T (zjx) log p(z) expf T (x;z) (x)g p(z) dz (x) Z T (zjx)dz 1 | {z } 11=0 : p(z) (T ) = (x) = logE p(z) h e T (x;z) i : (B.38) 193 Thus, we can see that the conjugate function associated with () =D KL [ :p(z)] matches the form of the log-partition function of an exponential family (see App. A.0.1) with base measure p(z) and sucient statistics T (x;z), Conjugate Duality Interpretation of the ELBO We are nally ready to express the elbo as a lower bound induced by the conjugate dual optimization p(z) (T ) = sup h;Ti p(z) (). We choose p(z) as the base or reference distribution and T p (zjx) (x;z) = log p (x;z) p(z) = logp (xjz). With these choices, the optimizing argument T (zjx) =p (zjx) recovers the true posterior, which implies that (T p (zjx) ) = logp (x), T (zjx) =p(zjx) since T (zjx) = expflogp (xjz) (x)g = 1 e (x) p (x;z) (logp (xjz)) = logp (x) since (x) = log Z p(z) exp log p (x;z) p(z) dz = log Z p (x;z)dz Finally, the conjugate optimization for (logp (xjz)) indicates that any suboptimal q (zjx)6= p (zjx) will provide a lower bound, which leads to (logp (xjz)) = logp (x) = sup (zjx) h(zjx); logp (xjz)iD KL [(zjx) :p(z)] (B.39) hq (zjx); logp (xjz)iD KL [q (zjx) :p(z)] (B.40) =E q (zjx) log p (x;z) q (zjx) =:elbo(x;;) (B.41) which matches the elbo. Gap in ELBO as Bregman Divergence As shown in Eq. (3.19), the gap in the elbo is D KL [q (zjx) :p (zjx)]. However, we can also derive this from the Bregman divergence perspective. In particular, we show that D 0 [T p :T q ] =D KL [q (zjx) :p (zjx)] (B.42) 194 for () = D KL [ : 0 ]. First, recall from Eq. (B.36) that, in this case, the primal distribu- tion T (zjx) associated with a critic function T (x;z) has an exponential family form T (zjx) = 0 (z) expfT (x;z) (x)g D 0 [T p :T q ] = 0 (T p ) 0 (T q ) T p (x;z)T q (x;z);r 0 (T q ) (B.43) = logE 0 [e Tp (x;z) ] logE 0 [e Tq (x;z) ] T p (x;z)T q (x;z);q (zjx) E q (zjx) [log 0 (z)] =E q (zjx) h log 0 (z) +T q (x;z) logE 0 [e Tq (x;z) ] | {z } logq (zjx) i (B.44) E q (zjx) h log 0 (z) +T p (x;z) logE 0 [e Tp (x;z) | {z } logp (zjx) i =D KL [q (zjx) :p (zjx)] (B.45) which matches the gap in the elbo. This completes our worked example of the elbo, where we have derived both the lower bound and its gap using conjugate duality. Bregman Divergence Generated by KL Divergence Finally, we conrm that the Bregman divergenceD 0 [p (zjx) :q (zjx)] generated by () =D KL [ : 0 ] corresponds to thekl divergence D KL [q (zjx) : p (zjx)]. This aligns with the result in Eq. (B.18) that the Bregman divergences generated by and are equivalent up to reordering of the arguments. Noting thatr q (q) =r q R q (zjx) log q (zjx) 0 (z) dz = log q (zjx) 0 (z) + 1, we have D 0 [p (zjx) :q (zjx)] = 0 (p ) 0 (q ) p (zjx)q (zjx);r 0 (q ) (B.46) = Z p (zjx) log p (zjx) 0 (z) dz Z q (zjx) log q (zjx) 0 (z) dz (B.47) Z p (zjx) log q (zjx) 0 (z) +q (zjx) log q (zjx) 0 (z) +p (zjx)q (zjx) dz = Z p (zjx) log p (zjx) q (zjx) dz Z p (zjx)dz + Z q (zjx)dz: (B.48) =D KL [p (zjx) :q (zjx)] (B.49) App. C.1 provides detailed derivations for additional special cases of conjugate duality, including for thekl divergence over positive measures (Zhu and Rohwer, 1995) and Amari's-divergence (Amari, 2016). We summarize these conjugate functions and their optimizing arguments in Table 6.2. 195 Appendix C Detailed Conjugate Derivations for KL- & -Divergences (Ch. 6) C.1 Conjugate Derivations without Normalization Constraint In this section, we derive the convex conjugate associated with kl and-divergence regularization of the policy (ajs) or state-action occupancy (a;s). We summarize these results in Table 6.2, with equation references in Table C.1. In both cases, we treat the regularizer 1 () as a function of (a;s) and optimize over all states jointly, 1 () = sup 2D ; 1 (): (C.1) These conjugate derivations can be used to reason about the optimal policy via 1 r(a;s) + E s 0 a;s V (s 0 ) V (s)(a;s) , as argued in App. E.1.2, or the worst case reward perturbations using 1 (r). We use r as the argument or dual variable throughout this section. In App. E.2, we derive alternative conjugate functions which optimize over the policy in each state, where(ajs)2 jAj is constrained to be a normalized probability distribution. This conjugate arises in considering soft value aggregation or regularized iterative algorithms as in Sec. 6.2.4. See Table C.2 for equation references. C.1.1 KL Divergence Policy Regularization: 1 0 ; (r) The conjugate function forkl divergence of the policy(ajs) to a reference 0 (ajs) has the following closed form 1 0 ; (r) = 1 X s (s) X s 0 (ajs) expf r(a;s)g 1 ! : (C.2) 196 Divergence 1 (r) r (a;s) r (a;s) 1 D KL [ : 0 ] Eq. (C.2) Eq. (C.5) Eq. (C.6) 1 D KL [ : 0 ] Eq. (C.10) Eq. (C.8) Eq. (C.9) 1 D [ 0 :] Eq. (C.11) Eq. (C.16) Eq. (C.17) 1 D [ 0 :] Eq. (C.20) Eq. (C.21) Eq. (C.22) Table C.1: Equations for r or `mdp Optimality' Conjugate ( Optimization, No Normalization Constraint) Divergence 1 (Q) Q (a;s) Q (a;s) 1 D KL [ : 0 ] Eq. (C.28) Eq. (C.25) Eq. (C.26) 1 D [ 0 :] Eq. (C.32) Eq. (C.31) Eq. (C.30) Table C.2: Equations for `Soft Value' V (s) Conjugates ( Optimization, Normalization Constraint) Proof. We start from the optimization in Eq. (6.3) or (E.1), using conditional kl divergence regu- larization 0 () =E (s) [D KL [ : 0 ]] as in Eq. (6.6). 1 0 ; (r) = max (a;s) (a;s); r(a;s) 1 (a;s) log (a;s) (s) 0 (ajs) + 1 X a;s (a;s) 1 X a;s (s) 0 (ajs) (C.3) =) r(a;s) =r 1 (a;s) log (a;s) (s) 0 (ajs) + 1 X a;s (a;s) 1 X a;s (s) 0 (ajs) ! (C.4) Worst-Case Reward Perturbations r (ajs) We can recognize Eq. (C.4) as an instance of Prop. 6.3.2. To derive the worst-case reward perturbations r(a;s), we take care to dierentiate the conditional regularizer with respect to (a;s) as in Neu et al. (2017); Lee et al. (2019), with P s 0 d d(a;s) (s 0 ) = P s 0 ;a 0 d d(a;s) (a 0 ;s 0 ) = P s 0 ;a 0(a 0 ;s 0 =a;s) = 1. r(a;s) = 1 log (a;s) (s) 0 (ajs) 1 X a;s d(a;s) d(a 0 ;s 0 ) | {z } 1 + 1 X a;s (a;s) (s) d P a 00(a 00 ;s) d(a 0 ;s 0 ) | {z } (s=s 0 ) + 1 1 X a;s d P a 00(a 00 ;s) d(a 0 ;s 0 ) | {z } (s=s 0 ) 0 (ajs) = 1 log (a;s) (s) 0 (ajs) + 1 X a (a;s) (s) 1 X a 0 (ajs) = 1 log (a;s) (s) 0 (ajs) : (C.5) In the last line, we assume P a 0 (ajs) = 1 and note that P a (a;s) (s) = P a (a;s) (s) = (s) (s) = 1. 197 Optimizing Argument r (ajs) We derive the conjugate function by solving for the optimizing argument (a;s) in terms of r(a;s) and substituting back into Eq. (C.3). Dening r (ajs) = r (a;s) r (s) as the policy induced by the optimizing argument r in Eq. (C.5), we can solve for r to obtain r (ajs) = 0 (ajs) expf r(a;s)g (C.6) Conjugate Function 1 0 ; (r) We plug this back into the conjugate optimization Eq. (C.3), with r (a;s) =(s) r (a;s). Assuming 0 (ajs) is normalized, we also have P a;s (s) 0 (ajs) = 1 and 1 0 ; (r) = (s) r (ajs); r(a;s) 1 X a;s (s) r (ajs) log 0 0 expfr(a;s)g(s) 0 (ajs) expfr(a;s)g +(s) 0 (ajs) = 1 X s (s) X s 0 (ajs) expf r(a;s)g 1 ! (C.7) as desired. Note that the conjugate function also depends on the regularization strength . Finally, we verify that our other conjugate optimality condition r (a;s) = r r 1 0 ; 1 ( r ), or r (a;s) =r r 1 0 ; (r ) holds for this conjugate function. Indeed, dierentiating with respect to r(a;s) above, we see thatr r 1 0 ; (r) = (s) 0 (ajs) expf r(a;s)g matches r (a;s) =(s) r (ajs) via Eq. (C.6). Although our regularization 0 () =E (s) [D KL [ : 0 ]] applies at each 0 (ajs), we saw that performing the conjugate optimization over (a;s) led to an expression for a policy r (ajs) = r (a;s)=(s) that is normalized by construction P a r (ajs) = P a r (a;s) (s) = 1. Conversely, for a given normalized(ajs), the above conjugate conditions yield r (a;s) such that Eq. (C.6) is also normalized. 198 C.1.2 KL Divergence Occupancy Regularization: 1 0 ; (r) Nearly identical derivations as App. C.1.1 apply when regularizing the divergence 0 () =D KL [ : 0 ] between the joint state-action occupancy (a;s) and a reference 0 (a;s). This leads to the following results Worst-Case Perturbations: r (a;s) = 1 log (a;s) 0 (a;s) (C.8) Optimizing Argument: r (a;s) = 0 (a;s) expf r(a;s)g: (C.9) Conjugate Function: 1 0 ; (r) = 1 X a;s 0 (a;s) expf r(a;s)g 1 X a;s 0 (a;s) (C.10) Such regularization schemes appear in reps (Peters et al., 2010), while Bas-Serrano et al. (2021) consider both policy and occupancy regularization. C.1.3 -Divergence Policy Regularization: 1 () 0 ; (r) The conjugate function for -divergence regularization of the policy (ajs) to a reference 0 (ajs) takes the form 1 () 0; (r) = 1 1 X a;s (s) 0 (ajs) 1 +( 1) r(a;s) r (s;) 1 1 + X s (s) r (s;): (C.11) where r (s;) is a normalization constant for the optimizing argument r (ajs) corresponding to r(a;s). We provide explicit derivations of the conjugate function instead of leveraging f-divergence duality (Belousov and Peters, 2019; Nachum and Dai, 2020) in order to account for the eect of optimization over (a;s). We will see in App. C.2.2 see that the conjugate in Eq. (C.11) takes a similar to form as the conjugate with restriction to normalized(ajs)2 jAj , where this constraint is not captured using f-divergence function space duality. 199 Proof. We begin by writing the -divergence () 0 () = E (s) [D [ 0 : ]] as a function of the occupancy measure , with (ajs) = (a;s) (s) . As in Prop. 6.3.2, the conjugate optimization implies an optimality condition for r(a;s). 1 () 0 ; (r) = max (a;s) (a;s); r(a;s) 1 1 (1) 0 @ (1) X a;s (s) 0 (ajs) + X a;s (a;s) X a;s (s) 0 (ajs) 1 (a;s) (s) 1 A (C.12) =) r(a;s) =r 1 1 (1) 0 @ (1) X a;s (s) 0 (ajs) + X a;s (a;s) X a;s (s) 0 (ajs) 1 (a;s) (s) 1 A Worst-Case Reward Perturbations r (ajs) Dierentiating with respect to(a;s) as in Lee et al. (2019), r(a;s) =r 1 () 0 () (C.13) = 1 1 (1) X a 0 ;s 0 d d(a;s) (1)(s 0 )0(a 0 js 0 ) +(a 0 ;s 0 )(s 0 ) 1 0(a 0 js 0 ) 1 (a 0 ;s 0 ) = 1 1 (1) (1) X s 0 d P a 0 (a 0 ;s 0 ) d(a;s) | {z } (s=s 0 ) X a 0 0(a 0 js) + X a 0 ;s 0 d(a 0 ;s 0 ) d(a;s) | {z } (a 0 ;s 0 =a;s) X a 0 ;s 0 (a 0 ;s 0 ) 1 d(a 0 ;s 0 ) d(a;s) | {z } (a 0 ;s 0 =a;s) (s 0 ) 1 0(a 0 js 0 ) 1 (1) X a 0 ;s 0 (s 0 ) d P a 0 (a 0 ;s 0 ) d(a;s) | {z } (s=s 0 ) 0(a 0 js 0 ) 1 (a 0 ;s 0 ) = 1 1 (1) (1) X a 0(ajs) + (a;s) (s)0(ajs) 1 (1) X a 0 0(ajs) 1 (a;s) (s) (1) = 1 1 + 1 1 1 1 1 1 (ajs) 0(ajs) 1 1 1 X a 0(ajs) 1 (ajs) = 1 1 1 (ajs) 0(ajs) 1 1 + 1 1 1 X a 0(ajs) 1 (ajs) ! (C.14) where we have rewritten (1) in terms of the policy(ajs) = (a;s) (s) and assumed 0 (ajs) is normalized. Letting r (ajs) indicate the policy which is in dual correspondence with r(a;s), we would eventually like to invert the equality in Eq. (C.14) to solve for (ajs) in each (a;s). However, the nal term depends on a sum over all actions. To handle this, we dene r (s;) = 1 1 X a 0 (ajs) X a 0 (ajs) 1 r (ajs) : (C.15) Since r (ajs) = r (a;s) (s) is normalized by construction, the constant r (s;) with respect to actions has appeared naturally when optimizing with respect to (a;s). In App. C.2.2-E.2.1, we will relate this quantity to the Lagrange multiplier used to enforce normalization when optimizing over (ajs)2 jAj . 200 Finally, we use Eq. (C.14) to write r (a;s) as r (a;s) = 1 log (ajs) 0 (ajs) + r (s;): (C.16) Optimizing Argument r (ajs) Solving for the policy in Eq. (C.16) and denoting this as r (ajs), we have r (ajs) = 0 (ajs) exp r(a;s) r (s;) = 0 (ajs) 1 +( 1) (r(a;s) r (s;)) 1 1 + (C.17) Note that r (ajs) is dened in self-consistent fashion due to the dependence of r (s;) on r (ajs) in Eq. (C.15). Further, r (s;) does not appear as a divisive normalization constant for general , which is inconvenient for practical applications (Lee et al., 2019; Chow et al., 2018). Conjugate Function 1 () 0 ; (r) Finally, we plug this into the conjugate optimization Eq. (C.12). Although we eventually need to obtain a function of r(a;s) only, we write r (ajs) in initial steps to simplify notation. 1 () 0 ; (r) = (s) r (ajs); r(a;s) 1 1 (1) (1) X a;s (s) 0 (ajs) + X a;s (s) r (ajs) X a;s (s) r (ajs) r (ajs) 0 (ajs) 1 = (s) r (ajs); r(a;s) 1 1 X a;s (s) 0 (ajs) 1 1 1 X a;s (s) r (ajs) (C.18) + 1 1 (1) X a;s (s) r (ajs) 1 + ( 1) | {z } 1 r(a;s) r (s;) (1) = 1 1 | {z } 1 (s) r (ajs); r(a;s) 1 1 X a;s (s) 0 (ajs) 1 1 1 1 1 (1) | {z } 1 1 X a;s (s) r (ajs) + 1 r (s;) (2) = 1 1 X a;s (s) r (ajs) + 1 X a;s (s) r (ajs) r(a;s) 1 r (s;) + 1 r (s;) 1 1 X a;s (s) 0 (ajs) where in (1) we note that 1 1 (1) ( 1) = 1 . In (2), we add and subtract the term in blue, which will allow to factorize an additional term of [1 +( 1)(r r (s;))] and obtain a function of r(a;s) only 1 () 0 ; (r) = 1 1 X a;s (s) r (ajs) 1 +( 1) r(a;s) r (s;) + 1 + 1 | {z } 1 r (s;) 1 1 X a;s (s) 0 (ajs) (1) = 1 1 X a;s (s) 0 (ajs) 1 +( 1) r(a;s) r (s;) 1 1 + X s (s) r (s;)i (C.19) where in (1) we have used Eq. (C.17) and 1 1 + 1 = 1 , along with P a 0 (ajs) = 1. 201 Conrming Conjugate Optimality Conditions Finally, we conrm that dierentiating Eq. (C.19) with respect to r(a;s) yields the conjugate condition r (ajs) =r 1 () 0 ; (r). Noting that 1 1 = 1 1 , r 1 () 0 ; (r) = (1) 1 P s (s) P a 0 (ajs) 1 +( 1) r(a;s) r (s;) 1 1 dr(a;s) dr(a 0 ;s 0 ) d r (s;) dr(a 0 ;s 0 ) + P s (s) d r (s;) dr(a 0 ;s 0 ) which simplies to r =r 1 () 0 ; (r) = 0 (ajs) 1 +( 1) r(a;s) r (s;) 1 1 and matches Eq. (C.17). C.1.4 -Divergence Occupancy Regularization: 1 () 0 ; (r) The conjugate function 1 () 0 ; (r) for -divergence regularization of the state-action occupancy (a;s) to a reference 0 (a;s) can be written in the following form 1 () 0 ; (r) = 1 1 X a;s 0 (a;s) 1 +( 1)r(a;s) 1 1 1 (C.20) Note that this conjugate function can also be derived directly from the duality of general f- divergences, and matches the form of conjugate considered in (Belousov and Peters, 2019; Nachum and Dai, 2020). Proof. Worst-Case Reward Perturbations r (ajs) r(a;s) = 1 1 (1) r (1) X a;s 0 (a;s) + X a;s (a;s) X a;s 0 (a;s) 1 (a;s) ! (C.21) = 1 1 1 1 1 1 0 (a;s) 1 (a;s) 1 Optimizing Argument r (a;s). Solving for r (a;s), r (a;s) = 0 (a;s) exp f r(a;s)g = 0 (a;s)[1 +(1)r(a;s)] 1 1 + (C.22) 202 Conjugate Function 1 () 0 ; (r). Plugging this back into the conjugate optimization, we obtain 1 () 0; = r ; r 1 1 (1) (1) X a;s 0 (a;s) + X a;s r (a;s) X a;s r (a;s) r (a;s) 0 (a;s) 1 | {z } 1+(1)r(a;s) = 1 1 r (a;s); r(a;s) 1 1 X a;s 0 (a;s) + 1 1 (1) 1 1 1 X a;s r (a;s) = 1 0 (a;s) 1 +( 1)r(a;s) 1 1 ; r(a;s) + 1 1 X a;s 0 (a;s) 1 +( 1)r(a;s) 1 1 1 1 X a;s 0 (a;s) = X a;s 0 (a;s) 1 +( 1)r(a;s) 1 1 1 1 1 +( 1)r(a;s) 1 1 X a;s 0 (a;s) (C.23) = 1 1 X a;s 0 (a;s) 1 +( 1)r(a;s) 1 1 1 X a;s 0 (a;s) (C.24) where, to obtain the exponent in the last line, note that 1 1 + 1 = 1 . C.2 Conjugate Derivations with Normalization Constraint In this section, we consider conjugate optimizations where the primal variables(ajs) are restricted to be normalized probability distributions (Ruderman et al., 2012). We justify our notation for this case in App. E.2, which aligns with the problem of soft-value aggregation in regularized rl. Again, these results translate to other choices of dual variables and distributions over other random variables beyond the Q(a;s) used here. For example, the log-mean-exp form of the kl divergence conjugate function (T ) in Eq. (C.28) was used in the context of mutual information estimation in Ch. 2 Sec. 2.6.2 Eq. (2.44) C.2.1 KL Divergence Soft Value Aggregation: 1 0 ; (Q) We proceed to derive a closed form for the conjugate function of the kl divergence 0 () as a function of (ajs)2 jAj , which we write using the Q-values as input 1 0 ; (Q) = max (ajs) ;Q 1 (ajs) log (ajs) 0 (ajs) + 1 X a (ajs) 1 X a 0 (ajs) ! Q (s;) 0 @ X a2A (ajs) 1 1 A + X a2A (a;s) =) Q(a;s) = 1 log (ajs) 0(ajs) + Q(s;)(a;s) (C.25) 203 Optimizing Argument Solving for yields the optimizing argument Q (ajs) = 0 (ajs) exp Q(a;s) Q (s;)) (C.26) where we can ignore the Lagrange multiplier for the nonnegativity constraint(a;s) since expfg 0 ensures Q (ajs) 0: We can pull the normalization constant out of the exponent to solve for Q (s;) = 1 log X a 0 (ajs) expfQ(a;s)g : (C.27) Plugging Eq. (C.26) into the conjugate optimization, 1 0 ; (Q) = Q ;Q 1 Q ; log 0 (ajs) 0 (ajs) exp Q(a;s) Q (s;) + 1 X a Q (ajs) | {z } 1 1 Q (s;) X a2A Q (ajs) | {z } 1 1 : Conjugate Function We nally recover the familiar log-mean-exp form for the kl-regularized value function V (s) 1 0 ; (Q) = Q (s;) = 1 log X a 0 (ajs) expfQ(a;s)g: (C.28) Notice that the conjugate or value functionV (s) 1 0 ; (Q) is exactly equal to the normalization constant of the policy Q (s;). We will show in App. C.2.2 that this property does not hold for general -divergences, with example visualizations in App. E.2.1 Fig. E.1. C.2.2 -Divergence Soft Value Aggregation: 1 0 ; (Q) We now consider soft value aggregation using the -divergence, where in contrast to App. C.1.3, we perform the conjugate optimization over (ajs)2 jAj in each state, with Lagrange multipliers Q (s;) and (a;s) to enforce normalization and nonnegativity. 1 0; (Q) = max (ajs) ;Q 1 1 X a 0 (ajs) 1 1 1 X a (ajs) + 1 1 (1) X a 0 (ajs) 1 (ajs) (C.29) Q (s;) X a (ajs) 1 ! + X a (a;s) =) Q(a;s) = 1 log (ajs) 0 (ajs) + Q (s;)(a;s) (C.30) 204 Optimizing Argument Solving for yields the optimizing argument for the soft value aggre- gation conjugate, Q (ajs) = 0 (ajs) exp Q(a;s) +(a;s) Q (s;)) : (C.31) Unlike the case of the standard exponential, we cannot easily derive a closed-form solution for Q (s;). Note that the expressions in Eq. (C.30) and Eq. (C.31) are similar to the form of the worst-case reward perturbations r (ajs) in Eq. (C.16) and optimizing policy r (ajs) in Eq. (C.17), except for the fact that Q (s;) arises as a Lagrange multiplier and does not have the same form as r (s;) = 1 (1)D [ 0 : ] as in Eq. (6.23) and Eq. (C.15). We will nd that Q (s;) and r (s;) dier by a term of V (s) in App. E.2.1 (Eq. (E.23)). Conjugate Function Plugging Eq. (C.31) into the conjugate optimization, we use similar deriva- tions as in Eq. (C.18)-Eq. (C.19) to write the conjugate function, or regularized Bellman optimality operator as V (s) 1 () 0 ; (Q) = 1 1 X a 0 (ajs) h 1 +( 1) Q(a;s) +(a;s) Q (s;) i 1 + 1 1 + Q (s;) (C.32) = 1 1 X a 0 (ajs) exp n Q(a;s) +(a;s) Q (s;) o 1 1 + Q (s;) Comparison with KL Divergence Regularization Note that for general , the conjugate or value function V (s) = 1 0 ; (Q) in Eq. (C.32) is not equal to the normalization constant of the policy Q (s;). We discuss this further in the next section. We also note that the form of the conjugate function is similar using two dierent approaches: optimizing over with an explicit normalization constraint, as in Eq. (C.32), or optimizing over with regularization of but no explicit normalization constraint, as in App. C.1.3 or Table 6.2. This is in contrast to the kl divergence, where the normalization constraint led to a log-mean-exp conjugate in Eq. (C.25) which is dierent from App. C.1 Eq. (C.2). 205 Appendix D Proof of Prop. D.0.1: Linear Bias Reduction for AIS (Ch. 2 and 3) In this section, we prove Prop. D.0.1, which relates the sum of the gaps in the single-sample Annealed Importance Sampling (ais) upper and lower bounds to the symmetrized kl divergence between the endpoint distributions. Since this proof assumes perfect transitions, it turns out to be a special case of the analy- sis for the tvo in Ch. 3, where the symmetrized kl divergence appears as the area of a rect- angle as in Fig. 3.5. In particular, combining the left- and right- Riemann sums with uniform spacing in (i.e. linear scheduling), all intermediate terms cancel and we are left with the de- sired 1 T D KL [ T (zjx)k 0 (zjx)] +D KL [ 0 (zjx)k T (zjx)] as the sum of the gap in the bounds. See Sec. 3.9 for an explicit connection between the tvo objective and single-sample ais under perfect transitions. Proposition D.0.1 (Complexity in T ). Assuming perfect transitions and a geometric annealing path with linearly-spacedf t g T t=1 , the gap of the ais upper and lower bounds (Eq. (2.20)) reduces linearly with increasing T , eubo ais (x; 0 ;T )elbo ais (x; 0 ;T ) = 1 T D KL [ T (zjx)k 0 (zjx)] +D KL [ 0 (zjx)k T (zjx)] : (D.1) Note thateubo ais (x; 0 ;T )elbo ais (x; 0 ;T ) on the left hand side also corresponds to the sum of the gapsD KL [q ais prop (z 0:T jx)kp ais tgt (z 0:T jx)]+D KL [p ais tgt (z 0:T jx)kq ais prop (z 0:T jx)]. This result translates to mutual information bounds as in Sec. 2.1.1. In contrast to Grosse et al. (2013) Thm. 1, our linear bias reduction result holds for nite T . 206 Proof. For linear scheduling and perfect transitions, we simplify the dierence in the single-sample upper and lower bounds as T;K=1 L + T;K=1 U =eubo ais (x; 0 ;T )elbo ais (x; 0 ;T ) =E z 0:T p ais tgt log p ais tgt (x;z 0:T ) q ais prop (z 0:T jx) E z 0:T q ais prop log q ais prop (z 0:T jx) p ais tgt (x;z 0:T ) =E z 0:T p ais tgt 2 6 6 4 log T (z T jx) T Q t=1 T r (z t1 jz t ) 0 (z 0 jx) T Q t=1 T f (z t jz t1 ) 3 7 7 5 E z 0:T q ais prop 2 6 6 4 log T (z T jx) T Q t=1 T r (z t1 jz t ) 0 (z 0 jx) T Q t=1 T f (z t jz t1 ) 3 7 7 5 =E z 0:T p ais tgt " log T Y t=1 ~ T (x;z t ) ~ 0 (z t jx) tt1 # E z 0:T q ais prop " log T Y t=1 ~ T (x;z t ) ~ 0 (z t jx) tt1 # =E z 0:T p ais tgt " T X t=1 ( t t1 ) log ~ T (x;z t ) ~ 0 (z t jx) # E z 0:T q ais prop " T X t=1 ( t t1 ) log ~ T (x;z t ) ~ 0 (z t jx) # : (D.2) (1) = T X t=1 E t (z) ( t t1 ) log ~ T (x;z) ~ 0 (zjx) T X t=1 E t1 (z) ( t t1 ) log ~ T (x;z) ~ 0 (zjx) (2) = 1 T E T (z) log ~ T (x;z) ~ 0 (zjx) 1 T E 0(z) log ~ T (x;z) ~ 0 (zjx) = 1 T D KL ( 0 k T ) +D KL ( T k 0 ) ; where in (2), we use the linear annealing schedule t t1 = 1 T 8t and note that intermediate terms cancel in telescoping fashion. In (1), we have used the assumption of perfect transitions (pt), which is common in analysis of ais (Neal, 2001; Grosse et al., 2013). In this case, the ais proposal and target distributions have the following factorial form z 0:T q ais prop (z (1:K) 0:T jx) (pt) = 0 (z 0 ) T Y t=1 t1 (z t ); (D.3) z 0:T p ais tgt (z (1:K) 0:T jx) (pt) = T (z T ) T Y t=1 t1 (z t1 ): (D.4) In other words, for 1 t T , perfect transitions results in independent, exact samples from z t t1 (z) in the forward direction, andz t t (z) in the reverse direction. Using the factorized structure of Eq. (D.3) and Eq. (D.4), the expectations over the extended state space simplify to a sum of expectations at each z t . The above proves the proposition for the case of single sample ais, but our multi-sample ais bounds will also inherit this linear bias reduction since they are tighter. See Brekelmans et al. (2022b) App E2 and App H4 for proofs. 207 Appendix E Appendix for \Your Policy Regularizer is Secretly an Adversary" (Ch. 6) For the purposes of the reinforcement learning applications in Ch. 6, we have usedr(a;s), r(a;s), and r 0 (a;s) as dual variables in this section, instead of T (x;z) as in Ch. 2. However, these derivations are equivalent and indeed, Husain and Knoblauch (2022) translates similar robustness results from the setting of rl (Husain et al., 2021; Brekelmans et al., 2022a) to the language of Bayesian inference. E.1 Implications of Conjugate Duality Optimality Conditions In this section, we show several closely-related results which are derived from the conjugate op- timality conditions. We provide additional commentary in later Appendix sections which more closely follow the sequence of the main text. First, recall from Section 6.2.1 the denition of the conjugate optimizations for functions over X :=AS. We restrict 2R AS + to be a nonnegative function overX , so that 1 (r) = sup 2R AS + ; r 1 (); 1 () = sup r2R AS ; r 1 (r); (E.1) and the implied optimality conditions are r = 1 r () = r r 1 1 () = 1 r r (r) = r 1 1 (r): (E.2) 208 E.1.1 Proof of Prop. 6.3.2 : Policy Form Worst-Case Reward Perturbations Proposition 6.3.2. For a given policy (ajs) or state-action occupancy (a;s), the worst-case adversarial reward perturbations r or r associated with a convex function () and regular- ization strength 1= are r =r 1 (): (6.20) Proof. The reward perturbations are dened via conjugate optimization for () in Eq. (E.1), where r2 R AS . The proposition follows directly from the optimality condition in Eq. (E.2), and we focus on the r = 1 r () condition for convenience. In App. C.1, we derive the explicit forms for the worst-case reward perturbations for kl and -divergence regularization from Sec. 6.3.2 of the main text. See App. C.1 Table C.1 for references to particular derivations. Note that we do not consider further constraints on in the conjugate optimization. Instead, we view the Bellman ow constraints(a;s)2M (and normalization constraint(a;s)2 jAjjSj ) as arising from the overall (regularized) mdp optimization in Eq. (6.10) or (6.12), as we discuss in the next subsection. E.1.2 Optimal Policy in a Regularized MDP In Lemma E.1.1 below, we show that the Bellman ow constraints Eq. (6.10), which are enforced by the optimal Lagrange multipliers V (s), ensure that the optimal (a;s) is normalized. This suggests that an explicit normalization constraint is not required. In Prop. E.1.2, we then proceed to derive the optimal policy in a regularized mdp using the conjugate optimality conditions in Eq. (E.2). Lemma E.1.1 (Flow Constraints Ensure Normalization). Assume that the initial state distribution 0 (s) and transition dynamics P (s 0 ja;s) are normalized, with P s 0 (s) = 1 and P s 0P (s 0 ja;s) = 1. If a state-occupancy measure satises the Bellman ow constraints (a;s)2M, then it is a normalized distribution (a;s)2 jAjjSj . Proof. Starting from the Bellman ow constraints P a (a;s) = (1 ) 0 (s)+ P a 0 ;s 0P (sja 0 ;s 0 )(a 0 ;s 0 ), we consider taking the summation over states s2S, X a;s (a;s) = (1 ) X s 0 (s) + X s X a 0 ;s 0 P (sja 0 ;s 0 )(a 0 ;s 0 ) (1) = (1 ) + X s P (sja 0 ;s 0 ) X a 0 ;s 0 (a 0 ;s 0 ) (2) = (1 ) + X a 0 ;s 0 (a 0 ;s 0 ) 209 where (1) uses the normalization assumption on 0 (s) and the distributive law, and (2) uses the normalization assumption on P (sja 0 ;s 0 ). Finally, we rearrange the rst and last equality to obtain (1 ) X a;s (a;s) = (1 ) =) X a;s (a;s) = 1 (E.3) which shows that (a;s) is normalized as a joint distribution over a2A;s2S, as desired. Proposition E.1.2 (Optimal Policy in Regularized MDP). Given the optimal value functionV (s) and Lagrange multipliers (a;s), the optimal policy in the regularized mdp is given by = 1 r r +E s 0 a;s V V + = r 1 1 r +E s 0 a;s V V + : This matches the conjugate conditions in Eq. (E.2) using the arguments r(a;s) r(a;s) + E s 0 a;s V (s 0 ) V (s) + (a;s). Proof. In Sec. 6.2.4, we moved from the regularized primal optimization (Eq. (6.12)) to the dual optimization (Eq. (6.14)) via the regularized Lagrangian min V; max (1 ) 0 ;V +h;r + E s 0 a;s V V +i 1 0 () Note that the Lagrange multipliers (a;s) enforce (a;s) 0 while V (s) enforces the ow con- straints and thus, by Lemma E.1.1, normalization of (a;s). We recognized the nal two terms as a conjugate optimization 1 0 ; r + E s 0 a;s V V + = max ;r + E s 0 a;s V V + 1 0 () (E.4) to yield a dual optimization over V (s) and (a;s) only in Eq. (6.14). After solving the dual optimization for the optimalV (s), (a;s), we can recover the optimal policy in themdp using the optimizing argument of Eq. (E.4). Dierentiating Eq. (E.4) and solving for yields the condition r 1 () =r + E s 0 a;s V V + which we invert to obtain Prop. E.1.2. The other equality follows from the conjugate optimality conditions in Eq. (E.2). Table 6.2 provides explicit forms for the optimal policy or state-action occupancy, where the same results for 1 () in App. C.1 apply with r(a;s) r(a;s)+E s 0 a;s V (s 0 ) V (s)+ (a;s) as the dual variable. Note that the argument to the conjugate function in Eq. (E.4) accounts for the 210 ow and nonnegativity constraints using V (s) and (a;s), in contrast to r(a;s) in App. E.1.1. For -divergence regularization, the optimal policy is Policy Reg., App. C.1.3 Eq. (C.17) (a;s) =(s) 0 (ajs) exp n r(a;s) + E s 0 a;s V (s 0 ) V (s) +(a;s) r (s;) o (E.5) Occupancy Reg., App. C.1.4 Eq. (C.22) (a;s) = 0 (a;s) exp n r(a;s) + E s 0 a;s V (s 0 ) V (s) +(a;s) o (E.6) where r (s;) = 1 1 (1 P a 0 (ajs) 1 (ajs) ) appears from dierentiatingr 1 0 () as in Eq. (6.23) or App. C.1.3. This means that the optimal policy is only available in self-consistent fash- ion, with the normalization constant inside the exp , which can complicate practical applications (Lee et al., 2019; Chow et al., 2018). E.1.3 Proof of Prop. 6.3.4: Policy Form Worst-Case Perturbations match Value Form at Optimality The substitution r(a;s) r(a;s) +E s 0 a;s V (s 0 ) V (s) + (a;s) above already anticipates the result in Prop. 6.3.4, which links the reward perturbations for the optimal policy r or state- action occupancy r to the advantage function. See the proof of Thm. 6.3.3 in Brekelmans et al. (2022a); Husain et al. (2021) for additional context in relation to the value-form reward perturbations r V (a;s). Proposition 6.3.4. For the optimal policy (ajs) and value function V (s) corresponding to - divergence policy regularization with strength , the policy and value forms of the worst-case ad- versarial reward perturbations match, r = r V , and are related to the advantage function via r (a;s) =Q (a;s)V (s) + (a;s); (6.26) where we dene Q (a;s) := r(a;s) + E s 0 a;s V (s 0 ) and recall (a;s) (ajs) = 0 by complemen- tary slackness. Note that V (s) depends on the regularization scheme via the conjugate function 1 () 0 ; (r V ) in Eq. (6.25). 211 Proof. The result follows by combining Prop. 6.3.2, which states that r = r 1 (), and Prop. E.1.2, which impliesr 1 ( ) =r(a;s) + E s 0 a;s V (s 0 ) V (s) + (a;s) as a condition for optimality off (a;s);V (s); (a;s)g. Thus, for the optimal policy (ajs) and Lagrange multi- pliersfV (s); (a;s)g, we have r (a;s) =r(a;s) + E s 0 a;s V (s 0 ) V (s) + (a;s) and similarly for r (a;s). We can conrm this using the expression for the optimal policy in Eq. (E.5) and the worst- case reward perturbations in Sec. 6.3.2. For example, recalling (a;s) = (s) (ajs), we can write the -divergence policy regularization case as r (a;s) = 1 log (ajs) 0 (ajs) + r (s;) = r(a;s) + E s 0 a;s V (s 0 ) V (s) +(a;s) r (s;): E.1.4 Path Consistency and KKT Conditions Finally, note that the kkt optimality conditions (Boyd and Vandenberghe, 2004) include the con- dition that we have used in the proof of Prop. 6.3.4. At optimality, we have r(a;s) + E s 0 a;s V (s 0 ) V (s) + (a;s)r 1 ( ) = 0: (E.7) This kkt condition is used to derive path consistency objectives in Nachum et al. (2017); Chow et al. (2018). For general -divergence policy regularization, we substitute r 1 () 0 ( ) = r (a;s) = 1 log (ajs) 0 (ajs) + r (s;) using Eq. (6.22) (see App. C.1.3 for detailed derivations). This leads to the condition r(a;s) + E s 0 a;s V (s 0 ) V (s) + (a;s) 1 log (ajs) 0 (ajs) r (s;) = 0; (E.8) which matches Eq. (6.28). We compare our -divergence path consistency conditions to previous work in Brekelmans et al. (2022a) App. E. 212 E.1.5 Modied Rewards and Duality Gap for Suboptimal Policies We can also use the conjugate duality of state-action occupancy measures and reward functions (r(a;s) orr 0 (a;s)) to express the optimality gap for a suboptimal (a;s). Consider the regularized primal objective as a (constrained) conjugate optimization, RL ; (r) := 1 (r) = max 2M ;r 1 () (E.9) r 0;r 1 ( r 0) (E.10) where the inequality follows from the fact that any feasible r 02M provides a lower bound on the objective. We use the notation r 0 to anticipate the fact that, assuming appropriate domain considerations, we would like to associate this occupancy measure with a modied reward function r 0 using the conjugate optimality conditions in Eq. (E.2) (withr 0 as the dual variable). In particular, for a given , we use the fact that r 0 = 1 r (r 0 ) to recognize the conjugate duality gap as a Bregman divergence. Rearranging Eq. (E.10), 1 (r) r 0;r + 1 ( r 0) 0 (E.11) 1 (r) r 0;r + r 0;r 0 1 (r 0 ) 0 (E.12) 1 (r) 1 (r 0 ) rr 0 ; 1 r (r 0 ) | {z } r 0 0 (E.13) D [r :r 0 ] 0 (E.14) where the last line follows from the denition of the Bregman divergence (Amari, 2016). For example, using thekl divergence () =D KL [ : 0 ], one can conrm that the Bregman divergence generated by is also a kl divergence, D KL [ r 0 : r ] (Belousov, 2017; Banerjee et al., 2005c). E.2 Soft Value Aggregation Soft value aggregation (Fox et al., 2016; Haarnoja et al., 2017) and the regularized Bellman opti- mality operator (Neu et al., 2017; Geist et al., 2019) also rely on the convex conjugate function, but with a slightly dierent setting than our derivations for the optimal regularized policy or reward 213 perturbations in App. C.1. In particular, in each state we optimize over the policy (ajs)2 jAj using an explicit normalization constraint (Eq. (E.17)). We derive the regularized Bellman optimality operator from the primal objective in Eq. (6.12). Factorizing(a;s) =(s)(ajs), we can imagine optimizing over(s) and(ajs)2 jAj separately, max (s)!M max (ajs)2 jAj min V (s) (1 ) 0 (s);V (s) + (a;s);r(a;s) + E s 0 a;s V (s 0 ) V (s) 1 0 (): (E.15) Eliminating (s) (by setting d=d(s) = 0) leads to a constraint on the form of V (s), since both may be viewed as enforcing the Bellman ow constraints. V (s) = (ajs); r(s;a) + E s 0 a;s V (s 0 ) 1 0 (): (E.16) We dene Q(s;a) := r(s;a) + E s 0 a;s V (s 0 ) and write V (s) =h(ajs); Q(a;s)i 1 0 () moving forward. As an operator for iteratively updatingV (s), Eq. (E.16) corresponds to the regularized Bellman operatorT 0 ; and may be used to perform policy evaluation for a given (ajs) (Geist et al., 2019). The regularized Bellman optimality operatorT 0 ; , which can be used for value iteration or modied policy iteration (Geist et al., 2019), arises from including the maximization over(ajs)2 jAj from Eq. (E.15) V (s) 1 0 ; (Q) = max 2 jAj (ajs);Q(a;s) 1 0 (): (E.17) Comparison of Conjugate Optimizations Eq. (E.17) has the form of a conjugate optimization 1 0 ; (Q) (Geist et al., 2019). However, in contrast to the setting of App. E.1.2 and App. C.1, we optimize over the policy in each state, rather than the state-action occupancy (a;s). Further, we must include normalization and nonnegativity constraints (ajs)2 jAj , which can be enforced using Lagrange multipliers Q (s;) and (a;s). We derive expressions for this conjugate function for the kl divergence in App. C.2.1 and -divergence in App. C.2.2, and plot the value V (s) as a function of and in App. E.2.2. Compared with the optimization for the optimal policy in Eq. (E.4), note that the argument of the conjugate function does not include the value function V (s) in this case. We will highlight 214 relationship between the normalization constants Q (s;), r (s;), and V (s) in App. E.2.1, where Q (s;) =V (s) + r (s;) as in Lee et al. (2019) App. D. E.2.1 Relationship between Normalization Constants r , Q , and Value Function V (s) In this section, we analyze the relationship between the conjugate optimizations that we have considered above, either optimizing over (a;s) as in deriving the optimal policy, or optimizing over (ajs)2 jAj as in the regularized Bellman optimality operator or soft-value aggregation. Using Q(a;s) =r(a;s) + E s 0 a;s V (s 0 ) , Optimal Policy (or Worst-Case Reward Perturbations) (App. C.1.3) (E.18) 1 () 0; r(a;s) + E s 0 a;s V (s 0 ) V (s) +(a;s) (E.19) = max (a;s)2F (a;s);r(a;s) + E s 0 a;s V (s 0 ) V (s) +(a;s) 1 () 0 () Soft Value Aggregation (App. C.2.2); (E.20) V (s) 1 () 0; r(a;s) + E s 0 a;s V (s 0 ) (E.21) = max (ajs)2 jAj (a;s);r(a;s) + E s 0 a;s V (s 0 ) 1 () 0 () Note that the arguments dier by a term ofV (s). We can essentially consider(a;s) as an argument of the conjugate in Eq. (E.20), since a linear term ofh(a;s);(a;s)i will appear when enforcing (ajs)2 jAj . Evaluating the optimizing arguments, Optimal Policy (or Worst-Case Reward Perturbations) (App. C.1.3, Eq. (C.17), Table 6.2) (ajs) = 0 (ajs) exp n Q(a;s) +(a;s)V (s) r (s;) o (E.22) Soft Value Aggregation (App. C.2.2, Eq. (C.32)); (ajs) = 0 (ajs) exp n Q(a;s) +(a;s) Q (s;) For the optimal V (s) and Q (a;s) =r(a;s) + E s 0 a;s V (s 0 ) V (s), the two policies match. This can be conrmed using similar reasoning as in Lee et al. (2019) App. D-E or (Geist et al., 2019) to show that iterating the regularized Bellman optimality operator leads to the optimal policy and value. 215 -3 -2 -1 0 1 2 3 0.94 0.95 0.96 0.97 0.98 1 *( ) 0 , Value Fn V * (s) by -3 -2 -1 0 1 2 3 0.8 1.0 1.2 1.4 (s; ) Normalization (s; ) by -3 -2 -1 0 1 2 3 0.2 0.0 0.2 0.4 1 (1 )D 1 (1 )D [ 0 : * ] by -3 -2 -1 0 1 2 3 -0.001 0 0.001 Diff (s; ) V * (s) 1 (1 )D [ 0 : * ] = 0 = 0.5 (a) = 0:5 -3 -2 -1 0 1 2 3 1.02 1.04 1.06 1.08 1.10 1 *( ) 0 , Value Fn V * (s) by -3 -2 -1 0 1 2 3 0.75 1.00 1.25 1.50 1.75 (s; ) Normalization (s; ) by -3 -2 -1 0 1 2 3 0.25 0.00 0.25 0.50 1 (1 )D 1 (1 )D [ 0 : * ] by -3 -2 -1 0 1 2 3 -0.001 0 0.001 Diff (s; ) V * (s) 1 (1 )D [ 0 : * ] = 0 = 1.0 (b) = 1 -3 -2 -1 0 1 2 3 1.20 1.25 1.30 1 *( ) 0 , Value Fn V * (s) by -3 -2 -1 0 1 2 3 1.0 1.5 (s; ) Normalization (s; ) by -3 -2 -1 0 1 2 3 0.5 0.0 0.5 1 (1 )D 1 (1 )D [ 0 : * ] by -3 -2 -1 0 1 2 3 -0.001 0 0.001 Diff (s; ) V * (s) 1 (1 )D [ 0 : * ] = 0 = 2 (c) = 2 -3 -2 -1 0 1 2 3 1.4 1.5 1.6 1 *( ) 0 , Value Fn V * (s) by -3 -2 -1 0 1 2 3 1.00 1.25 1.50 1.75 2.00 (s; ) Normalization (s; ) by -3 -2 -1 0 1 2 3 0.50 0.25 0.00 0.25 0.50 1 (1 )D 1 (1 )D [ 0 : * ] by -3 -2 -1 0 1 2 3 -0.001 0 0.001 Diff (s; ) V * (s) 1 (1 )D [ 0 : * ] = 0 = 5 (d) = 5 Figure E.1: Value Function V (s) = 1 () 0 ; (Q ) (rst row) and Normalization Constant Q (s;) (second row) as a function of for various regularization strengths . We use the same rewards as in Fig. 6.3 and Fig. E.3 and a uniform reference. We plot r (s;) = 1 (1)D [ 0 : ] in the third row, and conrm the identity V (s) = Q (s;) r (s;) from Eq. (E.23) and (E.27) in the last row. We nd that this equality holds for all up to small optimization errors on the order of 10 3 . 0.001 0.01 0.1 1 10 100 1/ 1.0 1.5 2.0 V(s) = 1 *( ) 0 , max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] = -1 KL[ 0 || ] = 0.5 KL[ || 0 ] = 2 = 3 a i 0 0.2 0.4 Prior 0 (a|s) a i 0 1 2 Q(a, s) Values (a) Uniform 0 (ajs) 0.001 0.01 0.1 1 10 100 1/ 1.5 2.0 V(s) = 1 *( ) 0 , max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] max a r(a, s) 0 [r(a, s)] = -1 KL[ 0 || ] = 0.5 KL[ || 0 ] = 2 = 3 a i 0 0.2 0.4 Prior 0 (a|s) a i 0 1 2 Q(a, s) Values (b) 0 (ajs)/r(a;s) Figure E.3: Value function V (s) = 1 () 0 ; (Q) as a function of (x-axis) and (colored lines), using Q(a;s) and 0 (ajs) from the left inset. See Eq. (C.28) and Eq. (C.32) for closed forms. 216 Relationship between V (s) and Q (s;) This implies the condition which is the main result of this section. Q (s;) =V (s) + r (s;): (E.23) In Fig. E.1, we empirically conrm this identity and inspect how each quantity varies with and (App. E.2.2) 1 Eq. (E.23) highlights distinct roles for the value function V (s) and the Lagrange multiplier Q (s;) enforcing normalization of(ajs) in soft-value aggregation ( Eq. (C.29) or Eq. (E.20)). It is well known that these coincide in the case of kl divergence regularization, withV (s) = Q (s;) as in App. C.2.1. We can also conrm that r (s;) = 1 1 P a 0 (ajs) P a 0 (ajs) 1 (ajs) = 0 vanishes for kl regularization ( = 1) and normalized 0 and the normalized optimal policy . However, in the case of -divergence regularization, optimization over the joint (a;s) in Eq. (E.18) introduces the additional term r (s;), which is not equal to 0 in general. Relationship between Conjugate Functions We might also like to compare the value of the conjugate functions in Eq. (E.18) and Eq. (E.20), in particular to understand how including V (s) as an argument and optimizing over versus aect the optima. We write the expressions for the conjugate function in each case, highlighting the terms from Eq. (E.23) in blue. Optimal Policy (or Worst-Case Reward Perturbations) (App. C.1.3, Eq. (C.11), Table 6.2) 1 () 0; r(a;s) + E s 0 a;s V (s 0 ) V (s) + (a;s) (E.24) = 1 1 X a 0 (ajs) exp n Q (a;s) + (a;s)V (s) r (s;) o 1 1 + r (s;) Soft Value Aggregation (App. C.2.2, Eq. (C.32)); (E.25) V (s) = 1 1 X a 0 (ajs) exp n Q (a;s) + (a;s) Q (s;) o 1 1 + Q (s;) Note that we have rewritten V (s) 1 0 ; (r(a;s) + E s 0 a;s V (s 0 ) ) directly as V (s). To further simplify, note that the optimal policy matches as in Eq. (E.22), with (ajs) = 0 (ajs) exp f (Q(a;s) +(a;s) Q (s;))g = 0 (ajs) exp f (Q(a;s) +(a;s)V(s) r (s;))g: 1 Note that r (s;) = 1 1 P a 0(ajs) P a 0(ajs) 1 (ajs) appears from dierentiating 1 ( 0 )() with respect to (App. C.1.3 Eq. (C.15)). We also write this as r (s;) = 1 (1)D[0 :] for normalized0,. 217 We can use this expression to write terms of the form 0 (ajs) exp fg in Eq. (E.24)-(E.25) as 0 (ajs) 1 (ajs) . Finally, simplifying the value function expression in Eq. (E.25) simply recovers the equality in Eq. (E.23) 1 () 0 ; (r(a;s) + E s 0 a;s V (s 0 ) ) =V (s) = 1 1 X a 0 (ajs) 1 (ajs) 1 1 + Q (s;) (E.26) = Q (s;) r (s;) (E.27) where we use r (s;) = 1 1 P a 0 (ajs) 1 1 P a 0 (ajs) 1 (ajs) from Eq. (C.15). We can use the same identity to show that the conjugate in Eq. (E.24) evaluates to zero, 1 () 0 ; r(a;s) + E s 0 a;s V (s 0 ) V (s) + (a;s) = 0 (E.28) In Lemma E.3.1, we provide a more detailed proof and show that this identity also holds for suboptimal policies and their worst-case reward perturbations, 1 () 0 ; (r ) = 0, where Eq. (E.28) is a special case for r (a;s) =r(a;s) + E s 0 a;s V (s 0 ) V (s) + (a;s) via Prop. 6.3.4. Finally, note that the condition in Eq. (E.28) implies that for the optimalV (s), the regularized dual objective in Eq. (6.14) reduces to the value function averaged over initial states,RL ; (r) = (1 )h 0 (s);V (s)i. This is intuitive since V (s) measures the regularized objective attained from running the optimal policy for innite time in the discounted mdp. E.2.2 Plotting Value Function as a Function of Regularization Parameters ; Conrming Relationship between Normalization r (s;), Q (s;) and Value Func- tion V (s) In Fig. E.1, we plot both V (s) and Q (s;) for various values of (x-axis) and (in each panel). We also plot r (s;) = 1 (1)D [ 0 : ] in the third row, and conrm the identity in Eq. (E.23) in the fourth row. As we also observe in Fig. E.3, the soft value function or certainty equivalentV (s) = 1 () 0 ; (Q) is not monotonic in for this particular set of single-step rewards (the same r(a;s) as in Fig. 6.3 or Fig. E.3). Note the small scale of the y-axis in the top row of Fig. E.1. While it can be shown that r (s;) is convex as a function of (Amari and Ohara, 2011), we see that r (s;) is not necessarily convex in and appears to be monotonically decreasing in . Finally, we nd that the identity in Eq. (E.23)-(E.27) holds empirically, with only small numerical optimization issues. 218 Value V (s) as a Function of , In Fig. E.3, we visualize the optimal value function V (s) = 1 0 ; (Q), forkl or-divergence regularization and dierent choices of regularization strength 1=. The choice of divergence particularly aects the aggregated value at low regularization strength, although we do not observe a clear pattern with respect to. 2 In all cases, the value function ranges between max a Q(a;s) for an unregularized deterministic policy as !1, and the expectation under the reference policy E 0 [Q(a;s)] for strong regularization as ! 0. We also discuss this property in Sec. 6.5.1. E.3 Robust Set of Perturbed Rewards In this section, we characterize the robust set of perturbed rewards to which a given policy (ajs) or (a;s) is robust, which also provides performance guarantees as in Eq. (6.2) and also describes the set of strategies available to the adversary. For proving Prop. 6.3.1, we focus our discussion on policy regularization with kl or-divergence regularization and compare with state-occupancy regularization in App. E.3.2. E.3.1 Proof of Prop. 6.3.1: Robust Set of Perturbed Rewards for Policy Regularization We begin by stating two lemmas, which we will use to characterize the robust set of perturbed rewards. All proofs are organized under paragraph headers below the statement of Prop. 6.3.1. Lemma E.3.1. For the worst-case reward perturbation r (a;s) associated with a given, nor- malized policy (ajs) and - or kl-divergence regularization, the conjugate function evaluates to zero, 1 () 0 ; (r ) = 0: (E.29) Lemma E.3.2. The conjugate function 1 () 0 ; (r) is increasing in r. In other words, if ~ r(a;s) r(a;s) for all (a;s)2AS, then 1 () 0 ; (~ r) 1 () 0 ; (r): 2 See Belousov and Peters (2019); Lee et al. (2018, 2019), or Section E.2.2 for additional discussion of the eect of -divergence regularization on learned policies. 219 Proposition 6.3.1. Assume a normalized policy (ajs) for the agent is given, with P a (ajs) = 18s2S. Under -divergence policy regularization to a normalized reference 0 (ajs), the optimiza- tion over r(a;s) in Eq. (6.17) can be written in the following constrained form min r2R ;r r where R := r2R AS () 0 ; (r) 0 ; (6.18) We refer toR as the feasible set of reward perturbations available to the adversary. This translates to a robust setR of modied rewards r 0 (a;s) = r(a;s) r(a;s) for the given policy. These sets depend on the -divergence and regularization strength via the conjugate function. For kl divergence regularization, the constraint is X a2A 0 (ajs) expf r(a;s)g 1: (6.19) Proof. Recall the adversarial optimization in Eq. (6.17) for a xed (a;s) =(s)(ajs) min r(a;s) h(a;s);r(a;s) r(a;s)i + 1 () 0 ; r ; (E.30) which we would like to transform into a constrained optimization. From Lemma E.3.1, we know that 1 () 0 ; (r ) = 0 for the optimizing argument r in Eq. (E.30), but it is not clear whether this should appear as an equality or inequality constraint. We now show that the constraint 1 () 0 ; (r ) 0 changes the value of the objective, whereas the inequality constraint 1 () 0 ; (r) 0 does not change the value of the optimization. Inequality First, consider the optimization min r(a;s) h(a;s);r(a;s) r(a;s)i subject to 1 () 0 ; (r) 0. From the optimizing argument r (a;s), consider an increase in the reward perturbations ~ r(a;s) r (a;s)8(a;s) where9(a;s) s.t. (a;s)> 0 and ~ r(a;s)> r (a;s). By Lemma E.3.2, we have 1 () 0 ; (r) 1 () 0 ; (r ) = 0. However, the objective now satises h;r ~ ri <h;r r i for xed (a;s), which is a contradiction since r (a;s) provides a global minimum of the convex objective in Eq. (E.30). Inequality We would like to show that this constraint does not introduce a dierent global minimum of Eq. (E.30). Assume there exists ~ r(a;s) with 1 () 0 ; (~ r) < 0 andh;r ~ ri < h;r r i. By convexity of 1 () 0 ; (r), we know that a rst-order Taylor approximation around r everywhere underestimates the function, 1 () 0 ; (~ r) 1 () 0 ; (r ) +h~ r(a;s) 220 r (a;s);r 1 () 0 ; (r )i. Noting that (a;s) =r 1 () 0 ; (r ) by the conjugate optimality con- ditions (Eq. (6.5), App. E.1), we have 1 () 0 ; (~ r) 1 () 0 ; (r )h; ~ rih; r i. This now in- troduces a contradiction, since we have assumed both that 1 () 0 ; (~ r) 1 () 0 ; (r )< 0, and that ~ r(a;s) provides a global minimum, whereh;r~ ri<h;rr i impliesh; ~ rih; r i> 0: Thus, including the inequality constraint 1 () 0 ; (~ r) 0 cannot introduce dierent minima. This constraint is consistent with the constrained optimization and generalization guarantee in Eq. (6.1)-(6.2), where it is clear that increasing the modied reward away from the boundary of the robust set (i.e. decreasing r(a;s) and 1 () 0 ; (r)) is feasible for the adversary and preserves our performance guarantee. See Eysenbach and Levine (2021) A2 and A6 for alternative reasoning. Proof of Lemma E.3.1 For -divergence policy regularization and a given (ajs), we substi- tute the worst-case reward perturbations r (a;s) = 1 1 1 log (ajs) 0 (ajs) + r (s;) (Eq. (6.22) or Eq. (C.16)) in the conjugate function 1 () 0 ; (r ) (Eq. (C.11) or Table 6.2). Assuming P a (ajs) = P a 0 (ajs) = 1, we have 1 () 0 ; (r) = 1 1 X a 0(ajs) exp r(a;s) r(s;) 1 1 + r(s;) = 1 1 X a 0(ajs) h 1 +( 1) 1 1 1 (ajs) 0(ajs) 1 1 + r(s;) r(s;) i 1 + 1 1 + r(s;) = 1 1 X a 0(ajs) 1 (ajs) 1 1 + r(s;) = 0: In the last line, we recall that r (s;) = 1 1 P a 0 (ajs) 1 1 P a 0 (ajs) 1 (ajs) from Eq. (6.23) or (C.15). For kl regularization, we plug r (a;s) = 1 log (ajs) 0 (ajs) (Eq. (6.21),(C.5)) into the conjugate in Eq. (C.2) or Table 6.2, 1 0 ; (r) = 1 X a 0(ajs) expf r(a;s)g 1 = 1 X a 0(ajs) exp 1 log (ajs) 0(ajs) 1 = 1 X a (ajs) 1 = 0: Proof of Lemma E.3.2 See Husain et al. (2021) Lemma 3. E.3.2 Robust Set for -Divergence under (a;s) Regularization For state-action occupancy regularization andkl divergence, Lemma E.3.1 still holds with 1 0 ; (r ) = 0 for normalized (a;s) and r (a;s) = 1 log (a;s) 0 (a;s) . 221 However, the reasoning in App. E.3 no longer holds for-divergence regularization to a reference 0 (a;s). Substituting the worst-case reward perturbations (Eq. (6.24) or (C.21)) into the conjugate function (Eq. (C.20) or Table 6.2) 1 () 0 ; (r ) = 1 1 X a 0 (a;s) exp r (a;s) 1 1 (E.31) = 1 1 X a 0 (a;s) h 1 +( 1) 1 1 1 (a;s) 0 (a;s) 1 1 i 1 + 1 1 = 1 1 X a 0 (a;s) 1 (a;s) 1 1 whose value is not equal to 0 in general and instead is a function of the given (a;s). This may result in the original environmental reward not being part of the robust set, since r(a;s) = 0 evaluates to 1 () 0 ; (r) = 0. E.3.3 Plotting the -Divergence Feasible Set To plot the boundary of the feasible set in the single step case, for the kl divergence regular- ization in two dimensions, we can simply solve for the r(a 2 ;s) which satises the constraint P a 0 (ajs) expf r(ajs)g = 1 for a given r(a 1 ;s) r(a 2 ;s) = 1 log 1 0 (a 2 js) (1 0 (a 1 js) expf r(a 1 ;s)g): (E.32) The interior of the feasible set contains r(a 1 ;s) and r(a 2 ;s) that are greater than or equal to these values. However, we cannot analytically solve for the feasible set boundary for general -divergences, since the conjugate function 1 () 0 ; (r) depends on the normalization constant of r (a;s). In- stead, we perform exhaustive search over a range of r(a 1 ;s) and r(a 2 ;s) values. For each pair of candidate reward perturbations, we use cvx-py (Diamond and Boyd, 2016) to solve the conju- gate optimization and evaluate 1 () 0 ; (r). We terminate our exhaustive search and record the boundary of the feasible set when we nd that 1 () 0 ; (r) = 0 within appropriate precision. 222 E.4 Tsallis Entropy and -Divergence To show a relationship between the Tsallis entropy and the-divergence, we rst recall the denition of the q-exponential function log q (Tsallis, 2009). We also explicitly write the form log (u) = log 2q (u) with = 2q, so that our use of log (u) in Ch. 6 matches Lee et al. (2019) Eq. (5). log q (u) = 1 1q u 1q 1 log 2q (u) := log 2q (u) = 1 1 u 1 1 (E.33) The Tsallis entropy of orderq ((Naudts, 2011) Ch. 7-8) can be expressed using either log q or log 2q H T q [(a)] = 1 q 1 1 X a2A (a) q = X a2A (a) log q 1 (a) (E.34) = X a2A (a) log 2q (a) (E.35) Eq. (E.34) and Eq. (E.35) mirror the two equivalent ways of writing the Shannon entropy forq = 1. In particular, we have q = 2q and H 1 [(a)] = P (a) log 1 (a) = P (a) log(a). See Naudts (2011) Ch. 7 for discussion of these two forms of the deformed logarithm. Note that the log (u) = log 2q (u) function used throughout Ch. 6 matches the order of the -divergence generated by the (u) = log q (u) representation in the -id gauge of Ch. 5 Sec. 5.5.2. The analysis of Cichocki and Amari (2010) uses dierent ordering of the arguments compared to our Ch. 5, but indeed notes 3 that the same generating function f(u) can be used to derive the Amari -divergence (as an f-divergence generator) or the -divergence (as a decomposable Bregman divergence generator). Further investigation of these connections is needed (see Naudts (2011) Sec. 7.2). To connect the Tsallis entropy and the -divergence in Eq. (6.7), we can consider a uniform reference measure 0 (a) = 18a. For normalized P a (a) = 1, D [ 0 (a) :(a)] = 1 (1) (1) X a2A 0 (a) + X a2A (a) X a2A 0 (a) 1 (ajs) ! (E.36) = 1 (1) +(1) X a2A (ajs) ! + 1 (1) (1) X a2A 0 (a)(1) ! (E.37) = 1 H T [(a)] +c (E.38) 3 See below Eq. 57 in Cichocki and Amari (2010), at the bottom of page 1546. 223 which recovers the negative Tsallis entropy of order, up to an multiplicative factor 1 and additive constant. Note that including this constant factor via-divergence regularization allows us to avoid an inconvenient 1 factor in optimal policy solutions (Eq. (6.27)) compared with Eq. 8 and 10 of Lee et al. (2019). E.5 Worked Example for Deterministic Regularized Policy We consider the single-step example in Sec. 6.4.1 Fig. 6.2 or Fig. 6.7-6.8, with a two-dimensional action space, optimal state-action value estimates, Q (a;s) = r(a;s) =f1:1; 0:8g, and uniform prior 0 (ajs). The case of policy regularization with = 2 and = 10 is particularly interesting, since the optimal policy is deterministic with (a 1 js) = 1. 4 First, we solve for the optimal policy for Q (a;s) =r(a;s) as in App. C.2.2, 1 0; (Q ) = max 2 jAj (ajs);Q (a;s) 1 () 0 () Q (s;) X a (ajs) 1 ! + X a (a;s) =) (ajs) = 0 (ajs) h 1 +( 1) Q (a;s) + (a;s) V (s) r (s;) | {z } = Q (s;) (see App. E.2.1 ) i 1 1 : where for = 2, we obtain (ajs) = 0 (ajs) 1 + Q (a;s) + (a;s)V (s) r (s;) . Using cvx-py (Diamond and Boyd, 2016) to solve this optimization with = 2, = 10, 0 (ajs) = 1 2 8a, and the given Q (a;s), we obtain Q (a 1 ;s) = 1:1 Q (a 2 ;s) = 0:8 (a 1 ;s) = 0 (a 1 js) = 1 (a 2 js) = 0 (a 2 ;s) = 0:1 V (s) = 1:05 r (s;) =0:05 Q (s;) = 1:0: (E.39) Our rst observation is that, although the policy is deterministic with (a 1 js) = 1, the value func- tionV (s) = 1:05 is not equal to max a Q (a;s) = 1:1 as it would be in the case of an unregularized 4 We use = 2 instead of = 3 in Fig. 6.2 to simplify calculations. See Fig. 6.7 for = 2 robust set plots. 224 policy. Instead, we still need to subtract the -divergence regularization term, which is nonzero. With = 2 and 1 =1, we have V (s) =h (ajs);Q (a;s)i 1 1 1 1 1 X a 0 (ajs) 1 (ajs) ! | {z } 1 D[0(ajs):(ajs)] = 1:1 1 10 1 2 1 1 1:5 1 1 2 :5 1 0 2 = 1:1 +:05 (1 2) = 1:05 Recall that for normalized 0 , , we have r (s;) = 1 (1)D [ 0 (ajs) : (ajs)] =0:05, so that we can conrm Eq. (E.39) for = 2. Finally, we conrm Prop. 6.3.4 by calculating the reward perturbations in two dierent ways for both a 1 and a 2 , r (a 1 ;s) = 1 1 1 (a 1 js) 0 (a 1 js) 1 1 ! + r (s;) = 1 10 1 1 1 :5 1 1 :05 =:05 =Q (a 1 ;s)V (s) + (a 1 ;s) = 1:1 1:05 + 0 =:05; r (a 2 ;s) = 1 1 1 (a 2 js) 0 (a 2 js) 1 1 ! + r (s;) = 1 10 1 1 0 :5 1 1 :05 =:15 =Q (a 2 ;s)V (s) + (a 2 ;s) = 0:8 1:05 + 0:1 =:15 so that we have r (a 1 ;s) = 0:05 and r (a 2 ;s) =0:15. We can observe that the indierence condition does not hold, since Q (a 1 ;s) r (a 1 ;s) = 1:1 0:05 = 1:05 does not match Q (a 2 ;s) r (a 2 ;s) = 0:8 (0:15) = 0:95. However, adding the Lagrange multiplier (a 2 ;s) = 0:1 accounts for the dierence in these values. This allows us to conrm the path consistency condition (Eq. (6.28)), r(a;s) + E s 0 a;s V (s 0 ) | {z } Q(a;s) 1 log (ajs) 0 (ajs) r (s;) | {z } r (a;s) =V (s) (a;s) 8(a;s)2AS (E.40) withQ (a 1 ;s)r (a 1 ;s)V (s)+ (a 1 ;s) = 1:10:051:05+0 = 0 andQ (a 2 ;s)r (a 2 ;s) V (s) + (a 2 ;s) = 0:8 (0:15) 1:05 + 0:1 = 0. 225 Appendix F Appendix for \Bregman Information and the Geometry of Annealing Paths" (Ch. 5) F.1 -Deformed Logarithm Paths For a strictly positive function(v), Naudts (2004, 2009, 2011) dene the-deformed logarithm as log u = u Z 1 1 (v) dv (F.1) The natural logarithm logu is recovered for (v) = v, and the q-deformed logarithm is recov- ered for (v) = v q . Note that log ~ (z) is monotonically increasing and concave in ~ (z), since d 2 du 2 log (u) = 1 (u) 2 < 0. In order to dene the inverse function exp such that u = exp log (u) , it appears we need to know the indenite integral of 1=(v) in Eq. (F.1). Instead, the -exponential can be dened using the integral of an additional function (v) = d dv exp (v), where exp (u) = 1 + u Z 0 (v)dv (F.2) We then have the relationships (Naudts and Zhang, 2018) (u) = exp (u) (u) = log (u) : (F.3) The -deformed logarithmic path can now be written as ~ (z) = exp (1) log ~ 0 (z) + log ~ 1 (z) ) (F.4) 226 Its gradient with respect toz is necessary for taking Markov Chain Monte Carlo (mcmc) transition steps, for example. d dz ~ (z) = ~ (1) ~ 0 (z) ~ 0 (z) d log ~ 0 (z) dz + ~ 1 (z) ~ 1 (z) d log ~ 1 (z) dz ! (F.5) Exploring further special cases of the deformed logarithm (Kaniadakis and Scarfone, 2002), or even learning the deformation function (u) with respect to a quantitative measure of sample quality (Syed et al., 2021), remain interesting directions for future work. For the latter approach, parameterizing(u) using a strictly positive neural network would still require integration to obtain log . If we were to instead directly parameterize log (u) using an input-convex neural network (Amos et al., 2017), evaluating (one of) the function(s) exp , log , or to evaluate ~ (z) may still be challenging. F.2 Limiting Behavior of Rho-Tau Divergences In this section, we show that the limiting behavior of the functions generating the -divergence as a rho-tau Bregman divergence recovers the familiar generators for the kl divergence. Similar techniques can be used to derive the limiting behavior for the -divergence choice of;;f; andf in 5.4. f; Limiting Behavior as q! 0: Recall the choices of and f which generate the - divergence, ~ = log q ~ (z) = 1 1q ~ (z) 1q 1 1q (F.6) f() = 1 q exp q fg 1 q 1 q = 1 q [1 + (1q)] 1 1q + 1 q 1 q : Considering the limiting behavior of f() as q! 0, note that both the denominator d(q) =q and numerator n(q) = [1 + (1q)] 1 1q 1 have lim q!0 d(q) = 0, lim q!0 n(q) = 0 and are well dened away from q = 0. We can thus use L'H^ opital's rule to nd lim q!0 f() = lim q!0 n 0 (q) d 0 (q) = 1 1 n 0 (q) q=0 (F.7) 227 Dierentiatingn(q) = [1+(1q)] 1 1q 1 with respect toq using d dq (f(q) g(q) ) = d dq e g(q) logf(q) = f(q) g(q) f 0 (q) g(q) f(q) +g 0 (q) logf(q) , we have n 0 (q) q=0 = [1 + (1q)] 1 1q 1 1q 1 + (1q) + 1 (1q) 2 log 1 + (1q) ! q=0 = (1 +) 1 + + log(1 +) =) lim q!0 f() = (1 +) log(1 +) (F.8) where we have ~ (z) = log 0 ~ (z) = ~ 1 and 1 + = ~ (z). The functionf((~ )) = ~ (z) log ~ (z) ~ (z) + 1 thus matches the Bregman generator for the negative Shannon entropy and kl divergence D KL [~ a : ~ b ]. Indeed, the rho-tau Bregman divergence simplies as D f [ ~ a : ~ b ] = Z f ~ a (z) f ~ b (z) ( ~ a (z) ~ b (z)) ~ b (z)dz = Z ~ a(z) log ~ a(z) ~ a(z) ~ b (z) log ~ b (z) + ~ b (z) ~ a(z) ~ b (z) log ~ b (z)dz = Z ~ a (z) log ~ a (z) ~ b (z) dz Z ~ a (z)dz + Z ~ b (z)dz =D KL [~ a : ~ b ]: (F.9) f; Limiting Behavior as q! 1: For q! 1, we directly reason that ~ (z) = log ~ (z) and f() = expfg 1. This leads to the generatorf( ~ (z)) = log ~ (z) + ~ (z) 1, which matches the f-divergence generator of the kl divergence. f ; Limiting Behavior as q! 1: Recall the choices of and f which generate the - divergence, ~ (z) = log 1q ~ (z) = 1 q ~ (z) q 1 q (F.10) f () = 1 1q exp 1q fg 1 1q 1 1q = 1 1q [1 +q] 1 q + 1 1q 1 1q Note that the order of the arguments is reversedD f [ ~ a : ~ b ] =D f [ ~ b : ~ a ] using these functions. For q! 1, we have ~ (z) = ~ (z) 1. 228 We apply L'H^ opital's rule to calculate lim q!1 f (), using similar reasoning as above ford(q) = 1 1q andn(q) = [1+q] 1 q 1. We dierentiate using d dq (f(q) g(q) ) = d dq e g(q) logf(q) =f(q) g(q) (f 0 (q) g(q) f(q) + g 0 (q) logf(q)) with g(q) = 1 q , f(q) = 1 +q to obtain lim q!1 f () = lim q!1 n 0 (q) d 0 (q) = 1 1 n 0 (q) q=1 =[1 +q] 1 q 1 q 1 +q 1 q 2 log 1 +q ! q=1 =(1 +) 1 + log(1 +) =) lim q!1 f () = (1 +) log(1 +) (F.11) Using ~ (z) = ~ (z) 1 for q = 1, we have f ( ~ (z)) = ~ (z) log ~ (z) ~ (z) + 1, which matches the pointwise negative Shannon entropy or the f-divergence generator for D f [~ a : ~ b ] =D KL [~ a : ~ b ]. Identical derivations as in Eq. (F.9) conrm that the rho-tau divergence recovers D f [ ~ a : ~ b ] = D KL [~ a : ~ b ]. This observation is indicative of the representational duality with respect to the functions (u) = log q u and (t) = log 1q t. Clearly, switching q! 1q and 1q!q switches the role of f; and f ;, so that we obtain the same divergence using D (q) f; [~ a : ~ b ] and D (1q) f ; [~ a : ~ b ]. For example, we have seen that using q = 0 for (u) and q = 1 for (t) recover the same divergence. f ; Limiting Behavior as q! 0: For q! 0, we similarly nd that ~ (z) = log ~ (z) and f () = expfg1. This leads to the generatorf ( ~ (z)) = log ~ (z)+~ (z)1, which matches the f-divergence generator of the kl divergence, with D KL [~ b : ~ a ] = D f [~ a : ~ b ] = D (q=1) f; [~ a : ~ b ] =D (q=0) f ; [~ a : ~ b ]. 229
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mutual information estimation and its applications to machine learning
PDF
Imposing classical symmetries on quantum operators with applications to optimization
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Physics-based data-driven inference
PDF
Robust causal inference with machine learning on observational data
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Representation problems in brain imaging
PDF
Responsible artificial intelligence for a complex world
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Understanding diffusion process: inference and theory
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
High dimensional estimation and inference with side information
PDF
Applications and error correction for adiabatic quantum optimization
PDF
Fast and label-efficient graph representation learning
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Target assignment and path planning for navigation tasks with teams of agents
PDF
Efficient bounded-suboptimal multi-agent path finding and motion planning via improvements to focal search
Asset Metadata
Creator
Brekelmans, Rob
(author)
Core Title
Information geometry of annealing paths for inference and estimation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
08/03/2022
Defense Date
05/18/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
annealed importance sampling,convex duality,information geometry,Information Theory,OAI-PMH Harvest,partition function estimation,thermodynamic integration,variational inference
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ver Steeg, Greg (
committee chair
), Galstyan, Aram (
committee member
), Nakano, Aiichiro (
committee member
), Oberai, Assad (
committee member
)
Creator Email
rbrekelmans0@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111376029
Unique identifier
UC111376029
Legacy Identifier
etd-Brekelmans-11074
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Brekelmans, Rob
Type
texts
Source
20220803-usctheses-batch-968
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
annealed importance sampling
convex duality
information geometry
partition function estimation
thermodynamic integration
variational inference