Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reinforcement learning for the optimal dividend problem
(USC Thesis Other)
Reinforcement learning for the optimal dividend problem
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Reinforcement Learning for the Optimal Dividend Problem by Thejani C.M. Gamage A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) August 2024 Copyright 2024 Thejani C.M. Gamage Dedication This thesis is dedicated to my father Chandraprema Gamage, and my mother Pushpa Gamage whose support and encouragement in my education enabled me to pursue my doctoral studies. ii Acknowledgements First and foremost, I extend my deepest appreciation to my advisor, professor Jin Ma, for his steadfast support, invaluable insights, and continuous guidance throughout my PhD research career. Without his mentorship, none of my achievements would have been attainable. Furthermore, I extend my thanks to professor Jianfeng Zhang and professor Renyuan Xu for their contributions to my education and academic career during graduate school and for serving on my thesis committee. I am deeply grateful for the faculty and staff in the department of Mathematics of University of Southern California and the department of Mathematics of University of Colombo, Sri Lanka for their invaluable role in my academic career. My special thanks goes to my high school class teacher, Mrs. H. A. D. D. Deepashika for her guidance and support at the beginning of my academic career. I extend my special gratitude to the hardworking citizens of Sri Lanka whose selfless contributions financed my education; their generosity will always be remembered. I want to extend my deep gratitude for my parents whose unwavering support and encouragement have been the cornerstone of my success, and I owe them a debt of gratitude that words cannot adequately express. Moreover, while I am unable to thank each person individually, I am thankful for the unwavering support and encouragement of my wonderful friends. iii Lastly, I am deeply grateful to my beloved husband, Rusiru, for his constant companionship, unwavering love, and in particular the invaluable support over the past five years which has greatly facilitated navigating through graduate school. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Exploration and Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Entropy-regularized Exploratory Optimal Control Problem . . . . . . . . . . 4 Chapter 2: Optimal Dividend Problem under the Diffusion Model . . . . . . . . . . . 9 2.1 Preliminaries and Problem Formulation . . . . . . . . . . . . . . . . . . . . . 9 2.2 The Value Function and Its Regularity . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Convergence in the Temperature Parameter . . . . . . . . . . . . . . 23 2.3 Policy Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Policy Gradient (PG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Policy Evaluation — A Martingale Approach . . . . . . . . . . . . . . . . . . 34 2.6 Temporal Difference (TD) Based Learning . . . . . . . . . . . . . . . . . . . 44 2.7 Alternate Approach for Approximating the Optimal Policy . . . . . . . . . . 57 2.7.1 On-policy and Off-policy Algorithms . . . . . . . . . . . . . . . . . . 57 2.8 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.8.1 Parametrization of the Cost Functional . . . . . . . . . . . . . . . . . 62 2.8.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.8.3 CTD methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.8.4 The ML Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Chapter 3: Optimal Dividend Problem under the Perturbed Cram´er–Lundberg Risk Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.1 Introduction and Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2 Entropy-regularized Exploratory Control Problem . . . . . . . . . . . . . . . 73 3.3 Policy Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 v 3.4 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.5 Method of “q-learning” as an Off-policy Algorithm . . . . . . . . . . . . . . 88 Chapter 4: Possible Extensions and Future Research . . . . . . . . . . . . . . . . . . 89 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Appendix: RL Concepts in Discrete Time . . . . . . . . . . . . . . . . . . . . . . . 98 Appendix A: RL in Discrete Time and Space . . . . . . . . . . . . . . . . . . . . . 98 Appendix A1: Finite Markov Decision Process . . . . . . . . . . . . . . . . . 99 Appendix A2: Value Functions of a Finite MDP . . . . . . . . . . . . . . . . 99 Appendix A3: Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Appendix B: Introduction to temporal difference Methods . . . . . . . . . . . . . 101 Appendix B1: Prediction Problems . . . . . . . . . . . . . . . . . . . . . . . 101 Appendix B2: Problem Formulation: Finite Time Horizon . . . . . . . . . . 102 Appendix B3: TD Methods in Finite Horizon . . . . . . . . . . . . . . . . . 103 Appendix B4: TD Methods in Infinite Horizon . . . . . . . . . . . . . . . . . 104 vi List of Tables 2.1 Results for the CT D0 method . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.2 Convergence results for the CT D0 method . . . . . . . . . . . . . . . . . . . 68 vii Abstract We study the optimal dividend problem with the dividend rate being restricted in a given interval, first under the continuous time diffusion model and then under the well-known “Cram´er-Lundberg” model. Unlike the standard literature, we shall particularly be interested in the case when the parameters (e.g. drift and diffusion coefficients) of the model are not specified so that the optimal control cannot be explicitly determined. To approximate the optimal strategy, we use methods from the Reinforcement Learning (RL) literature, specifically, the method of solving the corresponding RL-type entropy-regularized exploratory control problem, which randomizes the control actions, and balances the levels of exploitation and exploration. We shall first carry out a theoretical analysis of the entropy-regularized exploratory control problem focusing particularly on the corresponding HJB equation. We will then use a policy improvement argument, along with policy evaluation devices to construct approximating sequences of the optimal strategy. These algorithms are essentially “on-policy” algorithms, which has certain drawbacks in practical applications in some contexts. Hence we would use an “off-policy” algorithm, namely the ”q-learning” algorithm to approximate the optimal strategy. We present some numerical results using different parametrization families for the cost functional, to illustrate the effectiveness of the approximation schemes and to discuss possible methodologies to improve the effectiveness of Policy Evaluation methodologies. viii Chapter 1 Introduction The problem of maximizing the cumulative discounted dividend payment can be traced back to the work of de Finetti [13]. Since then the problem has been widely studied in the literature under different models. In particular, for optimal dividend problem and its many variations in continuous time under diffusion models, we refer to the works of, among others, [1, 3, 6, 7, 12, 45, 51] and the references cited therein and for the Optimal dividend problem under the perturbed “Cram´er-Lundberg” model to the work of [5, 14, 29, 32, 49]. In many cases the problem of maximizing the cumulative discounted dividend payment can be explicitly solved when the model parameters are known. Optimal dividend problem of controlled diffusion model was first solved by Asmussen and Taksar [1]. They proved that the optimal control was a threshold control when restricted dividend payments are considered and a barrier control when unrestricted dividend payments are allowed. It was proven in [5] by a verification theorem, that in the Optimal dividend problem under the perturbed “Cram´er-Lundberg” model, the optimal strategy is a barrier strategy and that the optimal value function is a concave function of the state (reserve) value. The main motivation of this thesis is to study the optimal dividend problem in which the model parameters are not specified so the optimal control cannot be explicitly determined, using recently developed 1 methods of Reinforcement Learning (RL), a sub category of machine Learning. 1.1 Machine Learning A computer program takes a set of inputs and uses an algorithm to produce an output. Machine Learning differs from other types of programs in one key aspect. The methodology of generating the output is learned by the machine in machine learning, as opposed to other computer program where this methodology is given to the machine explicitly by the programmer. The machine will learn with varying degrees of supervision from the programmer, and based of the level of supervision and the learning methodology, machine learning can be categorized into three main categories. 1.1.1 Supervised Learning In supervised learning, machine learns from a training set of labelled data, (inputs to which an output has been specified) provided by a supervisor. The goal is to use the trained algorithm to make predictions when presented with new Data. Supervised learning is in general used in Regression and Classification Problems. Some of the well known supervised learning algorithms include linear regression, random forest and Support Vector Machines. 1.1.2 Unsupervised Learning In supervised learning, machine learns from a set of unlabelled data, (inputs to which an output has not been specified). Unlike supervised learning, these data are not a training data set. The machine tries to analyze the data and recognize underlying patterns of the data. Unsupervised learning is in general used to classify and make associations among a given a data set. It is also used for reducing the dimension of a data set by identifying the most vital and relevant features of the data. Some of the well known supervised learning algorithms include K-Means Clustering, Principal Component Analysis and Hierarchical Clustering. 2 1.1.3 Reinforcement Learning In Reinforcement learning, machine learns by interacting with the environment using a series of trail and errors. Unlike in supervised and supervised learning, the goal is not to predict or classify, but to maximize a specific reward that is designed to achieve certain goals according to the context of the problem. The agent tries a set of actions and find the favorable actions depending on the reward she is trying to maximize. A Reinforcement Learning agent observes relevant aspects of the environment, (which the agent usually has varying levels of uncertainty about) and select actions from a given set of options, to maximize her reward. This translates the RL problem to a controlled optimization problem, and the ability of an RL algorithm to analyze the controlled optimization problem with model uncertainty, (i.e. unknown parameters) is the feature that we are most interested in this dissertation. We can categorize RL into two categories, namely the model-based RL problem, where the agent is trying to learn the model parameters, and the model-free RL problem, where the agent observes multiple simulated trajectories of the state process in order to find the optimal strategy (See [19], chapter 4). The method of using Reinforcement Learning to solve discrete Markov decision problems has been well studied, but the extension of these concepts to the continuous time and space setting is still fairly new. In this dissertation we analyze a stochastic optimal control problem in the continuous time and space setting, namely the optimal dividend problem. 1.2 Exploration and Exploitation One of the main challenges in RL is the exploration (learning by trying new actions)- exploitation (optimization using learned information and repetitive use of actions which resulted in higher rewards in the past) trade off. It is crucial to balance the level of agent’s exploration 3 and exploitation since the former is usually computationally expensive and time consuming, while the latter may lead to sub optimums. In RL theory a typical idea to balance the exploration and exploitation in an optimal control problem is to “randomize” the control action and add a (Shannon’s) entropy term in the cost function weighted by a temperature parameter. By maximizing the entropy one encourages the exploration and by decreasing the temperature parameter one gives more weight to exploitation. The resulting optimal control problem is often refer to as the entropy-regularized exploratory optimal control problem, which will be the starting point of our investigation. 1.3 Entropy-regularized Exploratory Optimal Control Problem Optimizing the corresponding entropy regularized exploratory control problem in order to approximate the classical optimal strategy has been applied to various problems in the literature including [18, 43, 47], and in general is carried out as a model based RL problem, due to the inability to create the necessary learning steps without the (approximated) model parameters. The optimal dividend problem can be analyzed using RL using both a modelbased and model-free approach since the specific nature of our particular problem allows us to use a model-free approach as we discuss in detail in section 2.3. Hence in this dissertation we shall choose to approach this as a model-free RL problem and follow the method of optimizing the corresponding entropy regularized exploratory control problem introduced in [46] along the lines of Policy Improvement (PI) and Policy Evaluation (PE) schemes. Following one of the standard Reinforcement Learning schemes, we shall solve the entropyregularized exploratory optimal dividend problem via a sequence of Policy Evaluation and Policy Improvement procedures. The former evaluates the cost functional for a given policy, and the latter produces new policy that is “better” than the current one. We note that the 4 idea of Policy Improvement Algorithms (PIA) as well as their convergence analysis is not new in the numerical optimal control literature (see, e.g., [21, 26, 27, 33, 37]). The main difference in the current RL setting is the involvement of the entropy regularization, which causes some technical complications in the convergence analysis. In the continuous time entropy-regularized exploratory control problem with diffusion models a successful convergence analysis of PIA was first established for a particular LinearQuadratic (LQ) case in [47], in which the exploratory HJB equation (i.e., HJB equation corresponding to the entropy regularized problem) can be directly solved, and the Gaussian nature of the optimal exploratory control is known. A more general case was recently investigated in [20], in which the convergence of PIA is proved in a general infinite horizon setting, without requiring the knowledge of the explicit form of the optimal control. The problem studied in chapter 2 in this dissertation is very close to the one in [20], but not identical. While some of the analysis in this dissertation is benefited from the fact that the spatial variable is one dimensional, but there are particular technical subtleties because of the presence of ruin time, although the problem is essentially an infinite horizon one, like the one studied in [20]. The problem studied in chapter 3 is further complicated because of the “jump” in the state process and the “non-locality” of the corresponding HJB equation. There are two main issues that this dissertation will focus on. The first is to design the PE and PI algorithms that are suitable for the continuous time optimal dividend problems. We shall follow some of the “popular” schemes in RL, such as the well-understood Temporal Difference (TD) methods, combined with the so-called martingale approach to design the PE learing procedure. Two technical points are worth noting: 1) since the cost functional involves ruin time, and the observation of the ruin time of the state process is sometimes practically impossible (especially in the cases where ruin time actually occurs beyond the time horizon we can practically observe), we shall propose algorithms that are insensitive to the ruin time; 5 2) although the infinite horizon nature of the problem somewhat prevents the use of the so-called “batch” learning method, we shall nevertheless try to study the temporally “truncated” problem so that the batch learning method can be applied. It should also be noted that one of the main difficulties in PE methods is to find an effective parameterization family of functions from which the best approximation for the cost functional is chosen, and the choice of the parameterization family directly affects the accuracy of the approximation. Since there are no proven standard methods of finding a suitable parameterization family, except for the LQ (Gaussian) case when the optimal value function is explicitly known, we shall use the classical “barrier”-type (restricted) optimal dividend strategy in [1] to propose the parametrization family in chapter 2, and carry out numerical experiments using the corresponding families. The second main issue is the convergence analysis of the PIA. Similar to [20], in this dissertation we focus on the regularity analysis on the solution to the exploratory HJB equation and some related PDEs. Compared to the heavy PDE arguments as in [20], With the help of the one dimensionaly of the state process and the regularity results that it helps us establish, we prove the convergence of PIA to the value function along the line of [20] but with a much simpler argument. Although we derive PE methods based on Temporal Difference (TD) methods and the martingale approach, and PI methods based on the PIA derived in sections 2.3 and 3.3, that enables us to design algorithms to approximate the optimal dividend rate, we are also interested in the alternative approach provided by the ”q-learning” algorithms. This is because the approach of PE and PI sequences we propose is inherently an “on-policy” approach, and sometimes it is necessary and advantageous to have an “off policy approach” as we discuss in section 2.7. Motivated by this concept, we derive ”q-learning” algorithms similar to ones introduced first in [24]. We emphasize here that the algorithms that we derive in this dissertation presents an alter6 native data-driven approach for approximating the optimal dividend rate and there exists many other general data driven stochastic control learning methods, such as the methods discussed in [9, 8, 31], that can be extended and modified to numerically approximate the optimal dividend rate using the high-frequency observations of state process. Our theoretical objectives here are to analyze the entropy regularized exploratory value function, and in extension the convergence of Policy Evaluation and Policy Improvement methods rigorously. Our practical objectives are to empirically analyze the accuracy and the applicability of optimizing the corresponding entropy regularized exploratory control problem as a method of approximating the optimal dividend rate of the classical problem, and to formulate methodologies for the selection of suitable parameterization families for PE algorithms. We observe that the choice of a suitable paraemetrization family of functions is crucial to guarantee the convergence of the algorithms to accurate results. To the best of our knowledge, a rigorous theoretical foundation for such a choice is an open problem in the Reinforcement Learning literature. To this end, we shall take advantage of the fact that in chapter 2 of this dissertation the state process is one dimensional, so that some stability arguments for 2-dimensional first-order nonlinear systems can be applied to conclude that the exploratory HJB equation has a concave, bounded classical solution, which would coincide with the viscosity solution (of class (L)) of HJB equation and the value function of the optimal dividend problem. We explicitly calculate a viscosity sub- solution and a super-solution, which we use as a basis along with the empirical observations to propose certain guidelines towards establishing rigorous methodologies to choose suitable parametrization families. In section 2.8.1, we also propose certain modifications to the Policy Evaluation methodologies that will potentially work effectively with a bigger class of parameterizion families. In chapter 3, we formalize some of the proposed methodologies rigorously. Empirical analysis of these modifications is very important to analyze the computational effectiveness of the proposed modifications, 7 and warrants extensive studies, but for now, we leave that for future research. 8 Chapter 2 Optimal Dividend Problem under the Diffusion Model 2.1 Preliminaries and Problem Formulation Throughout this chapter we consider a filtered probability space (Ω, F, {Ft}t≥0, P) on which is defined a standard Brownian motion {Wt , t ≥ 0}. We assume that the filtration F := {Ft} = {FW t }, with the usual augmentation so that it satisfies the usual conditions. For any metric space X with topological Borel sets B(X), we denote L 0 (X) to be all B(X)- measurable functions, and L p (X), p ≥ 1, to be the space of p-th integrable functions. The spaces L 0 F ([0, T]; R) and L p F ([0, T]; R), p ≥ 1, etc., are defined in the usual ways. Furthermore, for a given domain D ⊂ R, we denote C k (D) to be the space of all k-th order continuously differentiable functions on D, and C(D) = C 0 (D). In particular, for R+ := [0,∞), we denote C k b (R+) to be the space of all bounded and k-th continuously differentiable functions on R+ with all derivatives being bounded. 9 Consider the simplest diffusion approximation of a Cram´er-Lundberg model with dividend: dXt = (µ − αt)dt + σdWt , t > 0, X0 = x ∈ R, (2.1) where x is the initial state, µ and σ are constants determined by the premium rate and the claim frequency and size (cf., e.g., [1]), and αt is the dividend rate at time t ≥ 0. We denote X = Xα if necessary, and say that α = {αt , t ≥ 0} is admissible if it is F-adapted and takes values in a given “action space” [0, a]. Furthermore, let us define the ruin time to be τ α x := inf{t > 0 : Xα t < 0}. Clearly, Xα τ α = 0, and the problem is considered “ruined” as no dividend will be paid after τ α . Our aim is to maximize the expected total discounted dividend given the initial condition Xα 0 = x ∈ R: V (x) := sup α∈U [0,a] Ex h Z τ α x 0 e −ctαtdti := sup α∈U [0,a] E h Z τ α x 0 e −ctαtdt x0 = x i , (2.2) where c > 0 is the discount rate, and U [0, a] is the set of admissible dividend rates taking values in [0, a]. The problem (2.1)-(2.2) is often referred to as the classical optimal dividend problem with the restricted dividend rate in a given interval [0, a]. It is well-understood that when the parameters µ and σ are known, then the optimal control is of the “feedback” form: α ∗ t = a ∗ (X∗ t ), where X∗ t is the corresponding state process and a ∗ (·) is a deterministic function taking values in [0, a], often in the form of a threshold control (see, e.g., [1]). However, in practice the exact form of a ∗ (·) is not implementable since the model parameters are usually not known, thus the “parameter insensitive” method through Reinforcement Learning (RL) becomes much more desirable alternative, which we now elaborate. In the RL formulation, the agent follows a process of exploration and exploitation via a sequence of trial and error evaluation. A key element is to randomize the control action as a probability distribution over [0, a], similar to the notion of relaxed control in control theory, 10 and the classical control is considered as a special point-mass (or Dirac c-measure) case. We refer to [40, 16, 50] for more details of relaxed control. We emphasize here that even though the randomized policy has the same structure as the relaxed control, they are derived using two distinct approaches with two different motivations. To make the idea more accurate mathematically, let us denote B([0, a]) to be the Borel field on [0, a], and P([0, a]) to be the space of all probability measure on ([0, a], B([0, a])), endowed with, say, the Wasserstein metric. A “relaxed control” is a randomized policy defined as a measure-valued progressively measurable process (t, ω) 7→ π(·;t, ω) ∈ P([0, a]). Assuming that π(·;t, ω) has a density, denoted by πt(·, ω) ∈ L 1 +([0, a]) ⊂ L 1 ([0, a]), (t, ω) ∈ [0, T] × Ω, then we can write π(A;t, ω) = Z A πt(w, ω)dw, A ∈ B([0, a]), (t, ω) ∈ [0, T] × Ω. In what follows we shall often identify a relaxed control with its density process π = {πt , t ≥ 0}. Now, we denote G := B([0, a]) ⊗ F. For t ∈ [0, T], we define a probability measure on ([0, a] × Ω, G) as follows: for A ∈ B([0, a]) and B ∈ F, Qt(A × B) := Z A Z B π(dw;t, ω)P(dω) = Z A Z B πt(w, ω)dwP(dω). (2.3) We expand the original filtered probability space to ([0, a] × Ω, G, Q; {Gs}s⩾0) where Gs = B([0, a]) ⊗ Fs. Let is denote G = {Gs}t≥0. We call a function Aπ : [0, T] × [0, a] × Ω 7→ [0, a] the “canonical representation” of a relaxed control π = {π(·, t, ·)}t≥0, if Aπ t (w, ω) = w. Then, for t ≥ 0 we have E Qt [A π t ] = Z Ω Z a 0 A π t (w, ω)π(dw;t, ω)P(dω) = E P h Z a 0 wπt(w)dwi . (2.4) We can now derive the exploratory dynamics of the state process X along the lines of entropyregularized relaxed stochastic control arguments (see, e.g., [46]). Roughly speaking, consider 11 the discrete version of the dynamics (2.1): for small ∆t > 0, ∆xt := xt+∆t − xt ≈ (µ − at)∆t + σ(Wt+∆t − Wt), t ≥ 0. (2.5) Let {a i t} N i=1 and {(x i t , Wi t )} N i=1 be N independent samples of (at) under the distribution πt , and the corresponding samples of (Xπ t , Wt), respectively. Then, the law of large numbers and (2.4) imply that X N i=1 ∆x i t N ≈ X N i=1 (µ − a i t ) ∆t N ≈ E Qt [µ − A π t ]∆t = E P h µ− Z a 0 wπt(w, ·)dwi ∆t, (2.6) as N → ∞. This, together with the fact 1 N PN i=1(∆x i t ) 2 ≈ σ 2∆t, leads to the follow form of the exploratory version of the state dynamics: dXt = µ − Z a 0 wπt(w, ·)dw dt + σdWt , X0 = x, (2.7) where {πt(w, ·)} is the (density of) relaxed control process, and we shall often denote X = Xπ,0,x = Xπ,x to specify its dependence on control π and the initial state x. To formulate the entropy-regularized optimal dividend problem, we first give a heuristic argument. Similar to (2.6), for N large and ∆t small we should have 1 N X N i=1 e −cta i t1[t≤τ i ]∆t ≈ E Qt h e −ctA π t 1[t≤τ π x ]∆t i = E P h 1[t≤τ π x ]e −ct Z a 0 wπt(w)dw∆t i . Therefore, in light of [46] we shall define the entropy-regularized cost functional of the optimal expected dividend control problem under the relaxed control π as J(x, π) = Ex h Z τ π x 0 e −ctHπ λ (t)dti , (2.8) 12 where Hπ λ (t) := R a 0 (w − λ ln πt(w))πt(w)dw, τ π x = inf{t > 0 : X π,x t < 0}, and λ > 0 is the so-called temperature parameter balancing the exploration and exploitation. We now define the set of open-loop admissible controls as follows. Definition 2.1.1. A measurable (density) process π = {πt(·, ·)}t≥0 ∈ L 0 ([0, ∞) × [0, a] × Ω) is called an open-loop admissible relaxed control if 1. πt(·; ω) ∈ L 1 ([0, a]), for dt ⊗ dP-a.e. (t, ω) ∈ [0,∞) × Ω; 2. for each w ∈ [0, a], the process (t, ω) 7→ πt(w, ω) is F-progressively measurable; 3. Ex R τ π x 0 e −ct|Hπ λ (t)|dt < +∞. We shall denote A (x) to be the set of open-loop admissible relaxed controls. Consequently, the value function (2.2) now reads V (x) = sup π∈A (x) Ex n Z τ π x 0 e −ctHπ λ (t)dto , x ≥ 0. (2.9) An important type of π ∈ A (x) is of the “feedback” nature, that is, πt(w, ω) = π(w, Xπ,x t (ω)) for some deterministic function π, where Xπ,x satisfies the SDE: dXt = µ − Z a 0 wπ(w, Xt)dw dt + σdWt , t ≥ 0; X0 = x. (2.10) Definition 2.1.2. A function π ∈ L 0 ([0, a] × R) is called a closed-loop admissible relaxed control if, for every x > 0, 1. The SDE (2.10) admits a unique strong solution Xπ,x , 2. The process π = {πt(·; ω) := π(·, Xπ,x t (ω)); (t, ω) ∈ [0, T] × Ω} ∈ A (x). We denote Acl ⊂ A (x) to be the set of closed-loop admissible relaxed controls. 13 Remark 2.1.3. In the RL setting, the original dynamics (2.1) can be thought of as that the agent samples an action αt = a π t from π(., Xπ t ) at each time t, so that the corresponding solution Xα := X˜π = {X˜π s : s ≥ 0} will be given by the following equation. X˜ t = x + (µ − w π s )ds + σWt , X˜ 0 = x, (2.11) where w π = {w π s : s ≥ 0} is the action process generated from π. For a fixed w π , the dynamics (2.11) has a unique strong solution which has the same distribution as the solution to dynamics (2.10). We can also rewrite the cost functional using the probability measure Q as J(x, π) = E Q h Z τ π x 0 e −ct w π t − λ ln πt(w π t ) dt X˜π 0 = x i . (2.12) where dQ := Qtdt is the measure on [0,∞) × [0, a] × Ω with Qt being defined by (2.3). The following properties of the value function is straightforward. . Proposition 2.1.4. Assume a > 1. Then the value function V satisfies the following properties: (1) V (x) ≥ V (y), if x ≥ y > 0; (2) 0 ≤ V (x) ≤ λ ln a+a c , x ∈ R+. Proof. (1) Let x ≥ y, and π ∈ A (y). Consider ˆπt(w, ω) := πt(w, ω)1{t<τ π y (ω)}+ e w λ λ(e a λ −1) 1{t≥τ π y (ω)}, (t, w, ω) ∈ [0,∞)×[0, a]×Ω. Then, it is readily seen that J(x, πˆ) ≥ J(y, π), for a > 1. Thus V (x) ≥ J(x, πˆ) ≥ V (y), proving (1), as π ∈ A (y) is arbitrary. (2) By definition R a 0 wπt(w)dw ≤ a and − R a 0 ln πt(w))πt(w)dw ≤ ln a by the well-known Kullback-Leibler divergence property. Thus, Hπ λ (t) ≤ λ ln a + a, and then V (x) ≤ λ ln a+a c . On the other hand, since πˆ(w, x) ≡ 1 a , (x, w) ∈ R+ × [0, a], is admissible and J(x,πˆ) ≥ 0 for a > 1, the conclusion follows. 14 On the other hand, it is also natural to assume that the surplus process X of an insurance company to have a positive return that the return, including the safety loading, is higher than the interest rate c. This, together with the analysis on the dividend rate, leads to the following assumption, which will be used throughout the paper. We remark that in optimal dividend control problems it is often assumed that the maximal dividend rate is greater than the average return rate (that is, a > 2µ), and that the average return of a surplus process X, including the safety loading, is higher than the interest rate c. These, together with Proposition 2.1.4, lead to the following standing assumption that will be used throughout the dissertation. Assumption 2.1.5. The maximal dividend rate a satisfies a > max {1, 2µ} and the average return µ satisfies µ > max {c, σ2/2}. 2.2 The Value Function and Its Regularity In this section we study the value function of the entropy-regularized, relaxed control problem (2.7-2.9). We note that while most of the results are well-understood, some details still require justification, especially concerning the regularity, due particularly to the nonsmoothness of the exit time τx. We begin by recalling the Bellman optimality principle (cf. e.g., [48]): V (x) = sup π(·)∈A (x) Ex h Z s∧τ π 0 e −ctHπ λ (t)dt + e −c(s∧τ π)V (X π s∧τ π ) i , s > 0. Noting that V (0) = 0, we can (formally) argue that V satisfies the HJB equation: cv(x)= sup π∈L1[0,a] Z a 0 h w−λ ln π(w) + 1 2 σ 2 v ′′(x)+(µ − w)v ′ (x) i π(w)dw; v(0) = 0. (2.1) 15 For a fixed x ∈ R+, denote Fx(w, y) := [(1−v ′ (x))w−λ ln y]y where y : [0, a] → R. Then the maximization problem on the right hand side can be rewritten as supπ∈L1[0,a] R a 0 Fx(w, π)dw and the maximizer satisfies the Euler-Lagrange equation given by, ∂ ∂yFx(w, y) = 0 (see [15]). Solving the equation ∂ ∂yFx(w, y) = 0 and multiplying by a normalizing constant using the fact that π ∈ P([0, a]), we readily obtain the optimal feedback control which has the following Gibbs form, assuming all derivatives exist: π ∗ (w, x) = G(w, 1 − v ′ (x)), (2.2) where G(w, y) = y λ[e a λ y−1] · e w λ y1{y̸=0} + 1 a 1{y=0} for y ∈ R. Plugging (2.2) into (2.1), we see that the HJB equation (2.1) becomes the following second order ODE: 1 2 σ 2 v ′′(x) + f(v ′ (x)) − cv(x) = 0, x ≥ 0; v(0) = 0, (2.3) where the function f is defined by f(z) := n µz + λ ln h λ(e a λ (1−z) − 1) 1 − z io1{z̸=1} + [µ + λ ln a]1{z=1}. (2.4) The following result regarding the function f is important in our discussion. Since the proof is more or less elementary, we shall omit it and refer to [4] for details. Proposition 2.2.1. The function f defined by (2.4) enjoys the following properties: (1) f(z) = µz + λ lnw(z) for all z ∈ R, where w(z) = a + P∞ n=2 a n(1−z) n−1 n!λn−1 , z ∈ R. In particular, f ∈ C ∞(R); (2) the function f(·) is convex and has a unique intersection point with k(x) = µx, x ∈ R. Moreover, the abscissa value of intersection point H ∈ (1, 1 + λ). 16 We should note that (2.3) can be viewed as either a boundary value problem of an elliptic PDE with unbounded domain [0,∞) or a second order ODE defined on [0,∞). But in either case, there is missing information on boundary/initial conditions. Therefore the wellposedness of the classical solution is actually not trivial. Let us first consider the equation (2.3) as an ODE defined on [0,∞). Since the value function is non-decreasing by Proposition 2.1.4, for the sake of argument let us first consider (2.3) as an ODE with initial condition v(0) = 0 and v ′ (0) = ˜α > 0. By denoting X1(x) = v(x) and X2(x) = v ′ (x), we see that (2.3) is equivalent to the following system of first order ODEs: for x ∈ [0,∞), X′ 1 = X2, X1(0) = v(0) = 0; X′ 2 = 2c σ2 X1 − 2 σ2 f(X2), X2(0) = v ′ (0). (2.5) Here f is an entire function. Let us define X˜ 1 := X1 − f(0) c , X := (X˜ 1, X2) T , A := 0 1 2c σ2 − 2 σ2 h ′ (0) and q(X) = 0 − 2 σ2 k(X2) where h(y) := f(y)−f(0) = yh′ (0)+P∞ n=2 h (n) (0)y n n! = yh′ (0) + k(y). Then, X satisfies the following system of ODEs: X ′ = AX + q(X), X(0) = (−f(0)/c, v′ (0))T . (2.6) It is easy to check A has eigenvalues λ1,2 = −h ′ (0)∓ √ 2cσ2+h′(0)2 σ2 , with λ1 < 0 < λ2. Now, let Y = P X, where P is such that P AP −1 = diag[λ1, λ2] := B. Then Y satisfies Y ′ = BY + g(Y ), Y (0) = P X(0), (2.7) where g(Y ) = P q(P −1Y ). Since ∇Y g(Y ) exists and tends to 0 as |Y | → 0, and λ1 < 0 < λ2, 17 we can follow [10, Theorem 13.4.3] to construct a solution ϕ˜ to (2.7) for certain values of ˜α, such that |ϕ˜(x)| ≤ C1e −C2x for some constants C1, C2 > 0. Hence |ϕ˜(x)| → 0, as x → ∞. Thus, the function ϕ(x) := P −1ϕ˜(x) is a solution to (2.6) satisfying |ϕ(x)| → 0, as x → ∞. In other words, (2.5) has a solution such that (X1(x), X2(x)) → (0 + f(0) c , 0) = ( f(0) c , 0) as x → ∞. We summarize the discussion above as the following result. Proposition 2.2.2. The differential equation (2.3) has a classical solution v that enjoys the following properties: (i) v ′ (0) > 0 and limx→∞(v(x), v′ (x)) = ( f(0) c , 0); (ii) v is increasing and concave. Proof. Following the discussion proceeding the proposition we know that the classical solution v to (2.3) satisfying (i) exists. We need only check (ii). To this end, we shall follow an argument of [41]. Let us first formally differentiate (2.3) to get v ′′′(x) = 2c σ2 v ′ (x) − 2 σ2 f ′ (v ′ (x))v ′′(x), x ∈ [0,∞). Since v ∈ C 2 b ([0,∞)), denoting m(x) := v ′ (x), we can write m′′(x) = 2c σ2m(x) − 2 σ2 f ′ (m(x))m′ (x), x ∈ [0,∞). Now, noting Proposition 2.2.1, we define a change of variables such that for x ∈ [0,∞), φ(x) := R x 0 exp R v 0 − 2 σ2 f ′ (m(w))dw dv, and denote l(y) = m(φ −1 (y)), y ∈ (0,∞). Since φ(0) = 0, and φ ′ (0) = 1, we can define φ −1 (0) = 0 as well. Then we see that, l ′′(y) = [φ ′ (φ −1 (y))]−2 2c σ 2 l(y), y ∈ (0, ∞); l(0) = m(0) = v ′ (0) = α > 0. (2.8) Since (2.8) is a homogeneous ODE, by uniqueness l(0) = α > 0 implies that l(y) > 0, y ≥ 0. That is, m(x) = v ′ (x) > 0, x ≥ 0, and v is (strictly) increasing. Finally, from (2.8) we see that l(y) > 0, y ∈ [0,∞) also implies that l ′′(y) > 0, y ∈ [0,∞). Thus l(·) is convex on [0, +∞), and hence would be unbounded unless l ′ (y) ≤ 0 for all y ∈ [0,∞). 18 This, together with the fact that v(x) is a bounded and increasing function, shows that l(·) (i.e., v ′ (·)) can only be decreasing and convex, thus v ′′(x) (i.e., l ′ (y)) ≤ 0, proving the concavity of v, whence the proposition. Viscosity Solution of (2.3). We note that Proposition 2.2.2 requires that v ′ (0) exists, which is not a priorily known. We now conside (2.3) as an elliptic PDE defined on [0,∞), and follow Perron’s method to argue that it possesses a unique bounded viscosity solution. We will then identify its value v ′ (0) and argue that it must coincide with the classical solution identified in Proposition 2.2.2. We first recall the following definition (see, e.g., [2]). For D ⊆ R, we denote the set of all upper (resp. lower) semicontinuous function in D by USC(D) (resp. LSC(D)). Definition 2.2.3. We say that u ∈ USC([0, +∞)) is a viscosity sub-(resp. super-)solution of (2.3) on [0, +∞), if u(0) = 0 and for any x ∈ (0, +∞) and φ ∈ C 2 (R) such that 0 = [u − φ](x) = maxy∈(0,+∞) (resp. miny∈(0,+∞))[u − φ](y), it holds that 1 2 σ 2φ ′′(x) + f(φ ′ (x)) − cu(x) ≥ (resp. ≤ ) 0. We say that u ∈ C([0, +∞)) is a viscosity solution of (2.3) on [0, +∞) if it is both a viscosity subsolution and a viscosity supersolution of (2.3) on [0, +∞). Furthermore, a viscosity solution u is said to be of class (L) if it is bounded and increasing on [0, +∞). To see that both viscosity subsolution and viscosity supersolution to (2.3) exist, we first consider the following two functions: ψ(x) := 1 − e −x , ψ(x) := A M (1 − e −M(x∧b) ), x ∈ [0,∞), (2.9) 19 where A, M, b > 0 are constants satisfying M > 2µ/σ2 and the following constraints: 1 M n ln A A−M ∨ ln A A− f(0) c M o < b < 1 M n ln A H ∧ ln σ 2 2µ M o; A > max n M + H, f(0) c M + H, σ 2M2 σ2M−2µ , f(0) c · σ 2M2 σ2M−2µ o . (2.10) The following proposition, similar to [2], can be proved straightforwardly. We again omit the proof here and refer the interested reader to [4] for detailed arguments. Proposition 2.2.4. Assume that Assumption 2.1.5 holds, and let ψ, ψ be defined by (2.9). Then ψ(·) is a viscosity subsolution of (2.3) on [0, ∞), ψ(·) is a viscosity supersolution of (2.3) on [0,∞). Furthermore, it holds that ψ(x) ≤ ψ(x) on [0,∞). Now let ψ and ψ be defined by (2.9), and consider the set F := {u ∈ C(R+) | ψ ≤ u ≤ ψ; u is a class (L) vis. subsolution to (2.3)}. (2.11) Clearly, ψ ∈ F, so F ̸= ∅. Define ˆv(x) = supu∈F u(x), x ∈ [0, +∞), and let v ∗ (resp. v∗) be the USC (resp. LSC) envelope of ˆv, defined respectively by v ∗ (x) = lim r↓0 sup{v(y) : y ∈ (0, +∞), |y − x| ≤ r}, v∗(x) = lim r↓0 inf{v(y) : y ∈ (0, +∞), |y − x| ≤ r}. Note that by definition we have v∗(x) ≤ v ∗ (x), x ∈ [0,∞). The following theorem gives us the existence and uniqueness of the viscosity solution to (2.3) of class (L). Theorem 2.2.5. (i) v ∗ (resp. v∗) is a viscosity sub-(resp. super-)solution of class (L) to (2.3) on R+; and (ii) (Comparison Principle) Let v¯ be a viscosity supersolution and v a viscosity subsolution 20 of (2.3), both of class (L). Then v ≤ v¯. Consequently, v ∗ = v∗ = ˆv is the unique vis. solution of class (L) to (2.3) on R+. Proof. The proof of (i) is the same as a similar result in [2]. The comparison principle can be argued along the lines of [11, Theorem 5.1], we omit the proof here. Following our discussion we can easily raise the regularity of the viscosity solution. Corollary 2.2.6. Let v be a vis. solution of class (L) to the HJB equation (2.1). Then, v has a right-derivative v ′ (0+) > 0, and consequently v ∈ C 2 b ([0,∞)). Furthermore, v is concave and satisfies limx→∞ v(0) = f(0)/c and limx→+∞ v ′ (x) = 0. Proof. Let v be a viscosity solution of class (L) to (2.1). We first claim that v ′ (0+) > 0 exists. Indeed, consider the subsolution ψ and supersolution ψ defined by (2.9). Applying Theorem 2.2.5, for any x > 0 but small enough we have 1 − e −x x = ψ(x) x ≤ v(x) x ≤ ψ(x) x = A M (1 − e −Mx) x . Sending x ↘ 0 we obtain that 1 ≤ limx↘0 v(x) x ≤ limx↘0 v(x) x ≤ A. Also, it follows from Proposition 2.2.2 that the ODE (2.3) has a bounded classical solution in C 2 b ([0,∞)) satisfying v ′ (0+) = ˜α, and is increasing and concave. Hence it is also a viscosity solution to (2.3) of class (L). But by Theorem 2.2.5, the bounded viscosity solution to (2.3) of class (L) is unique, thus the viscosity solution v ∈ C 2 b ([0,∞)) and we can conclude that v ′ (0+) = ˜α = limx↘0 v(x) x = limx↘0 v(x) x . The rest of the properties are the consequences of Proposition 2.2.2. Verification Theorem and Optimal Strategy. Having argued the well-posedness of ODE (2.3) from both classical and viscosity sense, we now look at its connection to the value function. We have the following Verification Theorem. Theorem 2.2.7. Assume that Assumption 2.1.5 is in force. Then, the value function V 21 defined in (2.9) is a viscosity solution of class (L) to the HJB equation (2.3). More precisely, it holds that V (x) = sup u∈F u(x) := ˆv(x), x ∈ [0, +∞), (2.12) where the set F is defined by (2.11). Moreover, V coincides with the classical solution of (2.3) described in Proposition 2.2.2, and the optimal control has the following form: π ∗ t (w) = G(w, 1 − V ′ (X π ∗ t )). (2.13) Proof. The proof that V is a viscosity solution satisfying (2.12) is more or less standard (see, e.g., [48]), and Proposition 2.1.4 shows that V must be of class (L). It then follows from Corollary 2.2.6 that V ′ (0+) exists and V is the (unique) classical solution of (2.3). It remains to show that π ∗ defined by (2.13) is optimal. To this end, note that |Hπ ∗ λ (t)| = R a 0 ¯f(V ′ (X∗ t ))π ∗ t (w)dw , where ¯f(z) := wz + λ ln λ(e a λ (1−z)−1) 1−z 1{z̸=1} + [w + λ ln a]1{z=1}. Thus Ex h Z τ π ∗ 0 e −ct|Hπ ∗ λ (t)|dti = Ex h Z τ π ∗ 0 e −ct Z a 0 ¯f(V ′ (X ∗ t ))π ∗ t (w)dw dti < +∞, as V ′ (X∗ t ) ∈ (0, V ′ (0+)], thanks to the concavity of V . Consequently π ∗ ∈ A(x). Finally, since V ∈ C 2 b ([0,∞)) and π ∗ is defined by (2.13) is obviously the maximizer of the Hamiltonian in HJB equation (2.1), the optimality of π ∗ follows from a standard argument via Itˆo’s formula. We omit it. 22 2.2.1 Convergence in the Temperature Parameter To end this section we would like to comment on the dependence of the entropy-regulated optimal control on the learning parameter λ, which indicates the weight of the exploration. That is, the greater λ indicates more exploration, while the smaller λ means more exploitation. Furthermore, as λ → 0, we should expect that the entropy-regularized stochastic control problem of dividend payment should degenerate to the classical form. To this end, let us denote the the optimal strategy defined by (2.13) as π ∗,λ to indicate its dependence on λ. That is, π ∗,λ t (w) = 1 − V ′ (X∗ t ) λ h e a−w λ (1−V ′(X∗ t )) − e − w λ (1−V ′(X∗ t ))i. Then a simple calculation shows that, as λ → 0, we have lim λ→0 π ∗,λ t (w) = δ{a}(w)1{1−V ′(X∗ t )>0} + δ{0}(w)1{1−V ′(X∗ t )<0}. We see that this is exactly the optimal (barrier) strategy of the classical optimal dividend problem (cf. [1]). optimal dividend problem degenerates to the classical form. To analyze the convergence of the entropy-regularized exploratory value function corresponding to λ, say Vλ to the classical value function, say V , we refer to [44]. In [44] section 3.4, Theorem 10 states that Vλ → V locally uniformly as λ → 0 + as long as the system parameters satisfy the [44] section 3.4 Assumption 1. A simple calculation confirms that our model parameters satisfy the [44] section 3.4 Assumption 1, hence we can conclude that Vλ → V locally uniformly as λ → 0 +. This justifies the approach of optimizing the entropy-regularized exploratory control problem as a method of optimizing the classical optimal control. 23 2.3 Policy Update We now turn to an important step in the RL scheme, that is, the so-called Policy Update. More precisely, we prove a Policy Improvement Theorem which states that for any closeloop policy π ∈ Acl(x), we can construct another π˜ ∈ Acl(x), such that J(x,π˜) ≥ J(x, π). Furthermore, we argue that such a policy policy updating procedure can be constructed without using the system parameters, and we shall discuss the convergence of the iterations to the optimal policy To begin with, for x ∈ R and π ∈ Acl(x), let Xπ,x be the unique strong solution to the SDE (2.10). For t > 0, we consider the process Wˆ s := Ws+t − Wt , s > 0. Then Wˆ is an Fˆ-Brownian motion, where Fˆ s = Fs+t , s > 0. Since the SDE (2.10) is time-homogeneous, the path-wise uniqueness then renders the flow property: X π,x r+t = Xˆ π,Xπ,x t r , r ≥ 0, where Xˆ satisfies the SDE dXˆ s = µ − Z a 0 wπ(w, Xˆ s)dw ds + σdWˆ s, s ≥ 0; Xˆ 0 = X π,x t . (2.1) Now we denote ˆπ := π(·, Xˆ ·) ∈ Aol(X π,x t ) to be the open-loop strategy induced by the closed-loop control π. Then the corresponding cost functional can be written as (denoting Xπ = Xπ,x) J(X π t ;π) = EXπ t h Z τ π Xπ t 0 e −crh Z a 0 (w − λ ln ˆπr(w))ˆπr(w)dwi dri , t ≥ 0, (2.2) where τ π X π,x t = inf{r > 0 : Xˆ π,Xπ,x t r < 0}. It is clear that, by flow property, we have τ π x = τ π X π,x t + t, P-a.s. on {τ π x > t}. Next, for any admissible policy π ∈ Acl, we formally define a new feedback control policy as follows: for 24 (w, x) ∈ [0, a] × R +, π˜(w, x) := G(w, 1 − J ′ (x;π)), (2.3) where G(·, ·) is the Gibbs function defined by (2.2). We would like to emphasize that the new policy ˜π in (2.3) depends on J and π, but is independent of the coefficients (µ, σ)(!). To facilitate the argument we introduce the following definition. Definition 2.3.1. A function x 7→ π(·; x) ∈ P([0, a]) is called “Strongly Admissible” if its density function enjoys the following properties: (i) there exist u, l > 0 such that l ≤ π(x, w) ≤ u, x ∈ R + and w ∈ [0, a]; (ii) there exists K > 0 such that |π(x, w) − π(y, w)| ≤ K|x − y|, x, y ∈ R +, uniformly in w. (iii) The support of π(x, ·) = [0, a], for all x ∈ R +. The set of strongly admissible controls is denoted by A s cl. The following lemma justifies the Definition 2.3.1. Lemma 2.3.2. Suppose that a function x 7→ π(·; x) ∈ P([0, a]) whose density takes the form π(w, x) = G(w, c(x)) where c ∈ C 1 b (R+). Then π ∈ A s cl. Proof. Since c ∈ C 1 b (R+), and G is positive, continuous, and for fixed w, G(w, ·) ∈ C ∞(R), it is easy to check that as the composition, π(·, ·) = G(·, c(·)) ∈ A s cl. In what follows we shall use the following notations. For any π ∈ Acl, r π (x) := Z a 0 (w − λ lnπ(w, x))π(w, x)dw; b π (x) = µ − Z a 0 wπ(w, x)dw. (2.4) Clearly, for π ∈ A s cl, b π and r π are bounded and are Lipschitz continuous. We denote 25 X := Xπ,x to be the solution to SDE (2.10), and rewrite the cost function (2.8) as J(x, π) = Ex h Z τ π x 0 e −csr π (X π,x s )dsi . (2.5) where τ π x = inf{t > 0 : X π,x t < 0}. Thus, in light of the Feynman-Kac formula, for any π ∈ A s cl, J(·, π) is the probabilistic solution to the following ODE on R+: L π [u](x) + r π (x):= 1 2 σ 2uxx(x) + b π (x)ux(x) − cu(x) + r π (x) = 0, u(0) = 0. (2.6) Now let us denote u π R to be solution to the linear elliptic equation (2.6) on finite interval [0, R] with boundary conditions u(0) = 0 and u(R) = J(R, π), then by the regularity and the boundedness of b π and r π , and using only the interior type Schauder estimates (cf. [17]), one can show that u π R ∈ C 2 b ([0, R]) and the bounds of (u π R) ′ and (u π R) ′′ depend only on those of the coefficients b π , r π and J(·, π), but uniform in R > 0. By sending R → ∞ and applying the standard diagonalization argument (cf. e.g., [28]) one shows that limR→∞ u π R(·) = J(·, π), which satisfies (2.6). We summarize the above discussion as the following proposition for ready reference. Proposition 2.3.3. If π ∈ A s cl, then J(·, π) ∈ C 2 b (R +), and the bounds of J ′ and J ′′ depend only on those of b π , r π , and J(·, π). Our main result of this section is the following Policy Improvement Theorem. Theorem 2.3.4. Assume that Assumption 2.1.5 is in force. Then, let π ∈ A s cl and let π˜ be defined by (2.3) associate to π, it holds that J(x,π˜) ≥ J(x, π), x ∈ R+. Proof. Let π ∈ A s cl be given, and let ˜π be the corresponding control defined by (2.3). Since π ∈ A s cl, b π and r π are uniformly bounded, and by Proposition 2.3.3, (1−J ′ (·, π)) ∈ C 1 b (R +). Thus Lemma 2.3.2 (with c(x) = 1 − J ′ (x, π)) implies that π˜ ∈ A s cl as well. Moreover, since π ∈ A s cl, J(·, π) is a C 2 -solution to the ODE (2.6). Now recall that π˜ ∈ A s cl is the maximizer 26 of supπb∈A s cl [b πb (x)J ′ (x, π) + r πb (x)], we have L π˜ [J(·, π)](x) + r π˜ (x) ≥ 0, x ∈ R+. (2.7) Now, let us consider the process Xπ˜ , the solution to (2.10) with π being replaced by π˜. Applying Itˆo’s formula to e −ctJ(Xπ˜ t , π) from 0 to τ π˜ x ∧ T, for any T > 0, and noting the definitions of b π˜ and r π˜ , we deduce from (2.7) that e −c(τ π˜ x ∧T) J(X π˜ τ π˜ x ∧T , π) ≥ J(x, π) − Z τ π˜ x ∧T 0 e −crr π˜ (X π˜ r )dr + Z τ π˜ x ∧T 0 e −crJ ′ (X π˜ r , π)σdWr. Taking expectation on both sides above, sending T → ∞ and noting that J(Xπ˜ τ π˜ x , π) = J(0, π) = 0, we obtain that J(x, π) ≤ J(x,π˜), x ∈ R +, proving the theorem. In light of Theorem 2.3.4 we can naturally define a “learning sequence” as follows. We start with c0 ∈ C 1 b (R +), and define π0(x, w) := G(w, c0(x)), and v0(x) := J(x, π0), πn(x, w) := G(w, 1 − J ′ (x, πn−1)), (w, x) ∈ [0, a] × R +, for n ≥ 1. (2.8) Also for each n ≥ 1, let vn(x) := J(x, πn). The natural question is whether this learning sequence is actually a “maximizing sequence”, that is, vn(x) ↗ v(x), as n → ∞. Such a result would obviously justify the policy improvement scheme, and was proved in the LQ case in [47]. Before we proceed, we note that by Proposition 2.3.3 the learning sequence vn = J(·, πn ) ∈ C 2 b (R+), n ≥ 1, but the bounds may depend on the coefficients b π n , r π n , thus may not be uniform in n. But by definition b π n and Proposition 2.1.4, we see that supn ∥b π n ∥L∞(R+) + ∥V ∥L∞(R+) ≤ C for some C > 0. Moreover, since for each n ≥ 1, J(·, π0 ) ≤ J(·, πn ) ≤ V (·), if we choose π 0 ∈ A s cl be such that J(x, π0 ) ≥ 0 (e.g., π 0 t ≡ 1 a ), then we have ∥J(·, πn )∥L∞(R+) ≤ ∥V ∥L∞(R+) ≤ C for all n ≥ 1. That is, vn’s are uniformly bounded, and uniformly in n, 27 provided that r π n ’s are. The following result, based on the recent work [20], is thus crucial. Proposition 2.3.5. The functions r πn , n ≥ 1 are uniformly bounded, uniformly in n. Consequently, the learning sequence vn = J(·, πn ) ∈ C 2 b (R+), n ≥ 1, and the bounds of vn’s, up to their second derivatives, are uniform in n. Our main result of this section is the following. Theorem 2.3.6. Assume that the Assumption 2.1.5 is in force. Then the sequence {vn}n≥0 is a maximizing sequence. Furthermore, the sequence {πn}n≥0 converges to the optimal policy π ∗ . Proof. We first observe that by Lemma 2.3.2 the sequence {πn} ⊂ A s cl, provided π0 ∈ A s cl. Since vn = J(·, πn), Proposition 2.3.5 guarantees that vn ∈ C 2 b (R+), and the bounds are independent of n; Theorem 2.3.4 shows that {vn} is monotonically increasing, and thus {vn}n≥0 must converge, say, to v ∗ (·). Let us fix any compact set E ⊂ R+. A simple application of Arzella-Ascolli Theorem shows that there exist a subsequence {nk}k≥1 such that {v ′ nk }k≥0 converge uniformly on E, say, to v ∗∗(·). Since obviously limk→∞ vnk = v ∗ , noting that the derivative operator is a closed operator, it follows that v ∗∗(x) = (v ∗ ) ′ (x), x ∈ E. By the same argument, for any subsequence of {v ′ n}, there exists a sub-subsequence that converges uniformly on E to the same limit (v ∗ ) ′ , and thus the sequence {v ′ n} itself converges uniformly on E to (v ∗ ) ′ . Since E is arbitrary, this shows that {(vn, v′ n )}n≥0 converges uniformly on compacts to (v ∗ ,(v ∗ ) ′ ). Since πn is a continuous function of v ′ n , we see that {πn}n≥0 converges uniformly to π ∗ ∈ Acl defined by π ∗ (x, w) := G(w, 1 − (v ∗ ) ′ (x)). Finally, applying Lemma 2.3.2 we see that π ∗ ∈ A s cl, and the structure of the π ∗ (·, ·) guarantees that v ∗ satisfies the HJB equation (2.1) on the compact set E. By expanding the result to R + using the fact that E is arbitrary, v ∗ satisfies the HJB equation (2.1) (or equivalently (2.3)). 28 Now by using the slightly modified verification argument in Theorem 4.1 in [20] we conclude that v ∗ = V ∗ is the unique solution to the HJB equation (2.1) and thus π ∗ by definition is the optimal control. The details of the modified verification argument is as follows. We consider an arbitrary π and its corresponding SDE, and apply Ito’s formula to v ∗ (Xπ t ) and use the equation (2.1) to obtain following inequality. v ∗ (x) = E " e −c(τ π x ∧T) v ∗ (X π τ π x ∧T ) − Z τ π x ∧T 0 e −crL π [v ∗ ](X π r )dr# ≥ E " e −c(τ π x ∧T) v ∗ (X π τ π x ∧T ) + Z τ π x ∧T 0 e −crr π (X π r )dr# (2.9) Again following along the lines of PI theorem, sending T → ∞, and noting that v ∗ (Xπ τ π x ) = v ∗ (0) = 0, we obtain the inequality, v ∗ (x) ≥ E " R τ π x 0 e −crr π (Xπ r )dr# = J(x, π). Taking the supremum of this inequality over π ∈ Acl, we obtain, v ∗ (x) ≥ V (x) for all x. Now We consider π ∗ and its corresponding SDE, and apply Ito’s formula to v ∗ (Xπ ∗ t ) and use the equation (2.1) to obtain the following equality. v ∗ (x) = E " e −c(τ π ∗ x ∧T) v ∗ (X π ∗ τ π∗ x ∧T ) + Z τ π ∗ x ∧T 0 e −crL π ∗ [v ∗ ](X π r )dr# = E " e −c(τ π ∗ x ∧T) v ∗ (X π ∗ τ π∗ x ∧T ) + Z τ π ∗ x ∧T 0 e −crr π ∗ (X π ∗ r )dr# (2.10) Again following along the lines of PI theorem, sending T → ∞, and noting that v ∗ (Xπ ∗ τ π∗ x ) = v ∗ (0) = 0, we obtain the inequality, v ∗ (x) = E " R τ π ∗ x 0 e −crr π ∗ (Xπ ∗ r )dr# = J(x, π∗ ) ≤ V (x). Hence the result. In theory, since the Theorem 2.3.4 is based on finding the maximizer of the Hamiltonian, the learning strategy (2.8) may depend on the system parameters, making model-based RL approach more suitable. However, a closer look at the learning parameters cn and dn in (2.8) 29 shows that, in our context they depend only on vn, but not (µ, σ) directly, allowing us to follow a model-free RL approach using the learning strategy (2.8) for our numerical analysis in §7. Remark 2.3.7. Another well known method for policy update in Reinforcement Learning, in the setting of continuous time and space, is the so-called Policy Gradient (PG) method introduced in [23]. This is an alternate policy update method suitable for model-free RL, applicable for both finite and infinite horizon problems. Roughly speaking, a PG method parametrizes the policies π ϕ ∈ As cl and then solves for ϕ via the equation ∇ϕJ(x, πϕ ) = 0, using stochastic approximation method. The advantage of a PG method is that it does not depend on the system parameter, whereas in theory Theorem 2.3.4 is based on finding the maximizer of the Hamiltonian, and thus the learning strategy (2.8) may depend on the system parameter. However, a closer look at the learning parameters cn and dn in (2.8) we see that they depend only on vn, but not (µ, σ) directly. In fact, we believe that in our case the PG method would not be advantageous, especially given the convergence result in Theorem 2.3.6 and the fact that the the PG method requires also a proper choice of the parameterization family which, to the best of our knowledge, remains a challenging issue in practice. We shall therefore content ourselves to algorithms using learning strategy (2.8) for our numerical analysis in §7. As stated in remark 2.3.7, the learning strategy (2.8) does not depend on (µ, σ) directly, but only through the J(x, πn ) = vn terms. But instead of explicitly calculating the J(x, πn ) terms, in the algorithms 1 and 2 we use Policy Evaluation methods that are independent of the coefficients µ and σ to approximate J(x, πn ), making these algorithms implementable without the knowledge of coefficients µ and σ, assuming the availability of an environment simulator that does not require the knowledge of these coefficients. However when we consider a more general framework where the state process and the value 30 function of the classical problem is given by dXw t = x + µ(x, w)dt + σ(x, w)dWt , t > 0, x ∈ R, and V (x) := sup w∈W Ex h Z τ w x 0 e −ctρ(x, w)dti respectively, where x is the initial state and w is a action that belongs to a given action space W , we obtain the following learning sequence by considering the relevant entropy regularized relaxed control problem and following a similar argument as in the proof of the theorem (2.3.4). π n+1(x, w) = e 1 λ [µ(x,w)v ′ n(x)+ 1 2 σ 2 (x,w)v ′′ n(x)+ρ(x,w)] R W e 1 λ [µ(x,w)v ′ n(x)+ 1 2 σ2(x,w)v ′′ n(x)+ρ(x,w)]dw (2.11) where vn = J(x, πn ). As evident by the learning sequence 2.11, the Policy Update step will not be independent of the drift and diffusion functions, if the drift or the diffusion function depends on the control w. Thus in such a scenario, to design an algorithm that can be implemented without the knowledge of the functions µ and σ, one has to use alternate methods of Policy Update such as the Policy Gradient method discussed in remark (2.3.7). In next section, we develop the theoretical foundation for PG methods for our context, following the work of [23]. 2.4 Policy Gradient (PG) For PG method a parameterized family of policies π ϕ where ϕ ∈ Φ ∈ R l where l ≥ 1 is considered.The aim is to compute the policy gradient, g(x, ϕ) = ∇ϕJ(x, πϕ ), assuming π ϕ is admissible. When the gradient g is calculated, the update rule ϕ → ϕ + αg(t, x, ϕ) where 31 α > 0 is a learning rate. We recall from section 2.3 that for any strongly admissible policy π ϕ , J(·, πϕ ) satisfies the SDE L π ϕ [J(x, πϕ )] + r π ϕ (x) = 0. Defining ˜r(x, w, πϕ ) = L π ϕ [J(x, πϕ )] + r π ϕ (x) we obtain, R A r˜(x, w, πϕ )π ϕ (x, w)dw = 0, x ∈ R+. J(0, πϕ ) = 0. (2.12) Here we are assuming that the family of parameters Φ is chosen so that g(0, ϕ) = 0 when J(0, πϕ ) = 0. We take the derivative of (2.12) to obtain, R a 0 ∇ϕ h r˜(x, w, πϕ ) i π ϕ (x, w)dw + ˜r(x, w, πϕ )∇ϕπ ϕ (x, w)dw = 0, x ∈ R+. g(0, ϕ) = 0. (2.13) we can rewrite (2.13) as follows. L π ϕ [g(x, ϕ)] + R A (˜r(x, w, πϕ ) − λ)∇ϕ lnπ ϕ (w, x) π ϕ (w, x)dw = 0 g(0, ϕ) = 0. Defining p(x, w, πϕ ) = (˜r(x, w, πϕ ) − λ)∇ϕ lnπ ϕ (w, x) we have , L π ϕ [g(x, ϕ)] + R A p(x, w, πϕ )π ϕ (w, x)dw = 0 g(0, ϕ) = 0. (2.14) Now applying Feynman Kac formula we observe that g(x, ϕ) = E P x h Z τx 0 e −βs Z A p(X π ϕ s , w, πϕ )π ϕ (w, Xπ s )dwi ds 32 or equivalently, g(x, ϕ) = E Q x h Z τx 0 e −βs p(X˜π ϕ s , wπ ϕ s , πϕ ) dsi (2.15) This expectation cannot be computed using sample trajectories because the term p(X˜π ϕ s , wπ ϕ s , πϕ ) relies on coefficients σ and µ which are unknown in the RL setting. We therefore consider 1 2 σ 2 J ′′(X˜π ϕ s , πϕ ) + (µ − w)J ′ (X˜π ϕ s , πϕ ) ∼ dJ(X˜π ϕ s , πϕ ) − J ′ (X˜π s , πϕ )σdWs in the sense of “stochastic differentials” (or by, say, Itˆo’s formula). More precisely, observing that E Q J ′ (X˜π s , πϕ )σdWs = 0, we see that E Q x h Z τx 0 e −βs[ 1 2 σ 2 J ′′(X˜π s , πϕ ) + (µ − w)J ′ (X˜π s , πϕ )]∇ϕ ln(π(w π ϕ s , Xπ ϕ s ))dsi = E Q t,xh Z τx 0 e −βs∇ϕ ln(π(w π ϕ s , Xπ ϕ s ))dJ(X˜ x,π s , πϕ ) i we can define a function: pb(x, w, ϕ) := (w − cJ(x, πϕ ) − λ lnπ ϕ (w, x) − λ)∇ϕ lnπ ϕ (w, x) (2.16) In fact, in light of the argument above, we can write g(x, ϕ) = E Q x h Z τx 0 e −βspb(s, Xπ ϕ s , wπ ϕ s , ϕ)ds + Z τx 0 e −βs∇ϕ ln(π ϕ (w π ϕ s , Xπ ϕ s ))dJ(X˜ x,πϕ s , πϕ ) We note that the expectation above does not depend on the coefficients. However, we cannot observe the sample trajectory till τx, so we cannot compute the above expectation. So we need an online learning approach. For online learning we consider the case where ϕ ∗ is the optimal point for J(x, πϕ ) for any x > 0 and that it is an interior point of the set Φ. (i.e. the first-order condition holds). 33 Then g(x, ϕ∗ ) = 0 for all x. Thus by (2.14) we obtain, Z A p(x, w, πϕ ∗ )π ϕ ∗ (w, x)dw = 0 (2.17) for all x, a. Thus we have the following theorem. Theorem 2.4.1. If there is an interior optimal point ϕ ∗ that maximizes J(x, πϕ ) for any x, then 0 = E Q x h Z τx 0 ηs n ∇ϕ ln(π ϕ ∗ (w π ϕ ∗ s , X˜π ϕ ∗ s ) + ζs h dJ(X˜π ϕ ∗ s , πϕ ∗ ) + w π ϕ ∗ s − λ ln π(w π ϕ ∗ s , X˜π ϕ ∗ s ) − cJ(X˜π ϕ ∗ s , πϕ ∗ ) dsi − λ∇ϕ ln(π(w π ϕ ∗ s , X˜π ϕ ∗ s ) o dsi for any η, ζ ∈∈ L 2 F ([0, T]; Mθ )) Proof: This is identical to the proof of Theorem 3 in [23] and thus we omit it. Theorem (2.4.1) can be used for online learning by selecting ηs = 0 for s ≥ ρ when we have the trajectory only upto ρ > 0, the present time. To learn the optimal policy we solve the system of equations given by theorem (2.4.1) using Stochastic Approximation. Then the update rule will be given by ϕ ← ϕ+l(k)αϕ∆ϕ where ∆ϕ = ηtk ξtk + ∇ϕ ln π ϕ (wtk , xtk )) ∆ − λ∇ϕ ln π ϕ (wtk , xtk )))∆t where wtk , xtk represents the action and state at time tk respectively. 2.5 Policy Evaluation — A Martingale Approach Having proved the policy improvement theorem, we turn our attention to an equally important issue in the learning process, that is, the evaluation of the cost (value) functional, or the Policy Evaluation. The main idea of the policy evaluation in reinforcement learning literature usually refers to a process of approximating the cost functional J(·, π), for a given 34 feedback control π, by approximating J(·, π) by a parametric family of functions J θ , where θ ∈ Θ ⊆ R l . Throughout this section, we shall consider a fixed feedback control policy π ∈ A s cl. Thus for simplicity of notation, we shall drop the superscript π and thus write r(x) = r π (x), b(x) = b π (x) and J(x, π) = J(x); τx = τ π x . We note that for π ∈ A s cl, the functions r, b ∈ Cb(R+) and J ∈ C 2 b (R+). Now let Xx = Xπ,x be the solution to the SDE (2.10), and J(·) satisfies the ODE (2.6). Then, applying Itˆo’s formula we see that Mx t := e −ctJ(X x t ) + Z t 0 e −csr(X x s )ds, t ≥ 0, (2.1) is an F-martingale. Furthermore, the following result is more or less standard. Proposition 2.5.1. Assume that Assumption 2.3.5 holds, and suppose that J˜(·) ∈ Cb(R+) is such that J˜(0) = 0, and for all x ∈ R+, the process M˜ x := {M˜ x t = e −csJ˜(Xx s ) + R t 0 e −csr(Xx s )ds; t ≥ 0} is an F-martingale. Then J ≡ J˜. Proof. First note that J(0) = J˜(0) = 0, and Xx τx = 0. By (2.1) and definition of M˜ x we have M˜ x τx = R τx 0 e −csr(Xx s )ds = Mx τx . Now, since r, J, and J˜ are bounded, both M˜ x and Mx are uniformly integrable F-martingales, by optional sampling we have J˜(x) = M˜ x 0 = E[M˜ x τx |F0] = E[Mx τx |F0] = Mx 0 = J(x), x ∈ R+. The result follows. We now consider a family of functions {J θ (x) : (x, θ) ∈ R+ × Θ}, where Θ ⊆ R l is a certain index set. For the sake of argument, we shall assume further that Θ is compact. Moreover, we shall make the following assumptions. Assumption 2.5.2. (i) The mapping (x, θ) 7→ J θ (x) is sufficiently smooth, so that all the derivatives required exist in the classical sense. (ii) For all θ ∈ Θ, φ θ (Xx · ) are square-integrable continuous processes, and the mappings θ 7→ ∥φ θ∥L 2 F ([0,T]) are continuous, where φ θ = J θ ,(J θ ) ′ ,(J θ ) ′′ . 35 (iii) There exists a continuous function K(·) > 0, such that ∥J θ∥∞ ≤ K(θ). In what follows we shall often drop the superscript x from the processes Xx , Mx etc., if there is no danger of confusion. Also, for practical purpose we shall consider a finite time horizon [0, T], for an arbitrarily fixed and sufficiently large T > 0. Denoting the stopping time ˜τx = τ T x := τx ∧ T, by optional sampling theorem, we know that M˜ t := Mτ˜x∧t = Mτx∧t , for t ∈ [0, T], is an F˜-martingale on [0, T], where F˜ = {Fτ˜x∧t}t∈[0,T] . Let us also denote M˜ θ t := Mθ τx∧t , t ∈ [0, T]. We now follow the idea of [22] to construct the so-called Martingale Loss Function. For any θ ∈ Θ, consider the parametrized approximation of the process M = Mx : Mθ t = M θ,x t := e −ctJ θ (X x t ) + Z t 0 e −csr(X x s )ds, t ∈ [0, T]. (2.2) For notational simplicity in what follows we denote ˜r X t := e −ctr(Xt), t ≥ 0. In light of the Martingale Loss function introduced in [22], we denote ML(θ)= 1 2 E hZ τ˜x 0 |Mτx−M˜ θ t | 2 dti = 1 2 E hZ τ˜x 0 e −ctJ θ (Xt)− Z τx t r˜ X s ds 2 dti . (2.3) We should note that the last equality above indicates that the martingale loss function is actually independent of the function J, which is one of the main features of this algorithm. Furthermore, inspired by the mean-squared and discounted mean-squared value errors we define MSVE(θ) = 1 2 E h Z τ˜x 0 |J θ (Xt) − J(Xt)| 2 dti , (2.4) DMSVE(θ) = 1 2 E h Z τ˜x 0 e −2ct|J θ (Xt) − J(Xt)| 2 dti . (2.5) 36 The following result connects the minimizers of ML(·) and DMSV E(·). Theorem 2.5.3. Assume that Assumption 2.5.2 is in force. Then, it holds that arg min θ∈Θ ML(θ) = arg min θ∈Θ DMSVE(θ). (2.6) Proof. First, note that J(0) = J θ (0) = 0, and Xτx = 0, we have M˜ θ t = Mθ τx = Mτx = R τx 0 e −csr(Xs)ds, t ∈ (˜τx, T). (Here we use the convention that (˜τx, T) = ∅ if ˜τx = T.) Consequently, since M˜ θ t = Mθ t , for t ∈ [0, τ˜x], by definition (2.3) we can write 2ML(θ) = E h Z τ˜x 0 |Mτx−M˜ θ t | 2 dti = E h Z τ˜x 0 |Mτx − Mθ t | 2 dti (2.7) = E h Z τ˜x 0 |Mτx−Mt | 2+|Mt− Mθ t | 2 + 2(Mτx− Mt)(Mt − Mθ t ) dti . Next, noting (2.1) and (2.2), we see that E hZ τ˜x 0 |Mt − Mθ t | 2 dti = E hZ τ˜x 0 e −2ct|J(Xt) − J θ (Xt)| 2 dti = 2DMSVE(θ). Also, applying optional sampling we can see that E hZ τ˜x 0 (Mτx−Mt)(Mt−Mθ t )dti = Z T 0 E E (Mτx−Mt)|Ft 1{τx≥t}(Mt − Mθ t ) dt = E h Z T 0 E (Mτx − Mt)|Ft∧τx 1{τx≥t}(M˜ t − M˜ θ t )dti = 0. Combining above we see that (2.7) becomes 2ML(θ) = 2DMSVE(θ)+E R τ˜x 0 |Mτx −Mt | 2dt . Since E[ R τ˜x 0 |Mτx − Mt | 2dt] is independent of θ, we conclude the result. Remark 2.5.4. Since the minimizers of MSVE(θ) and DMSVE(θ) are obviously identical, Theorem 2.5.3 suggests that if θ ∗ is a minimizer of either one of ML(·), MSV E(·), DMSV E(·), then J θ ∗ would be an acceptable approximation of J. In the rest of the section we shall therefore focus on the identification of θ ∗ . 37 We now propose an algorithm that provides a numerical approximation of the policy evaluation J(·) (or equivalently the martingale Mx ), by discretizing the integrals in the loss functional ML(·). To this end, let T > 0 be an arbitrary but fixed time horizon, and consider the partition 0 = t0 < · · · < tn = T, and denote ∆t = ti − ti−1, i = 1, · · · , n. Now for x ∈ R+, we define Kx = min{l ∈ N : ∃t ∈ [l∆t,(l + 1)∆t) : Xx t < 0}, and ⌊τx⌋ := Kx∆t so that τx ∈ [Kx∆t,(Kx + 1)∆t). Finally, we define Nx = min{Kx, n}. Clearly, both Kx and Nx are integer-valued random variables, and we shall often drop the subscript x if there is no danger of confusion. In light of (2.3), let us define 2ML∆t(θ)=E h N X−1 i=0 e −ctiJ θ (Xti )− K X−1 j=i r˜ X tj ∆t 2∆t i =:E h N X−1 i=0 |∆M˜ θ ti | 2∆t i . (2.8) Furthermore, for t ∈ [0, τx], we define m(t, θ) := −e −ctJ θ (Xt) + R τx t r˜ X s ds. Now note that {τx ≥ T} = {⌊τx⌋ < T ≤ τx} ∪ {⌊τx⌋ ≥ T}, and {⌊τx⌋ ≥ T} = {N = n}. Denoting F˜ 1 = E h R T 0 |m(t, θ)| 2dt1{⌊τx⌋<T ≤τx} i , we have E h Z T 0 |m(t, θ)| 2 dt1{τx≥T} i = E h Z T 0 |m(t, θ)| 2 dt 1{⌊τx⌋<T ≤τx} + 1{⌊τx⌋≥T} i = F˜ 1 + E h N X−1 i=0 Z ti+1 ti |m(t, θ)| 2 dt1{N=n} i . (2.9) Since {⌊τx⌋ < T} ={N < n} = {τx < T} ∪ {⌊τx⌋ < T ≤ τx}, denoting F˜ 2 = E h Z τx 0 |m(t, θ)| 2 dt1{⌊τx⌋<T ≤τx} i and F˜ 3 = E h Z τx ⌊τx⌋ |m(t, θ)| 2 dt1{⌊τx⌋<T} i , 3 we obtain, E h Z τx 0 |m(t, θ)| 2 dt1{τx<T} i = E h Z ⌊τx⌋ 0 |m(t, θ)| 2 dt1{N<n} i + F˜ 3 − F˜ 2 = E h N X−1 i=0 Z ti+1 ti |m(t, θ)| 2 dt1{N<n} i + F˜ 3 − F˜ 2 Combining (2.9) and (2.10), similar to (2.7)) we can now rewrite (2.3) as 2ML(θ) = E h Z τ˜x 0 |m(t, θ)| 2 dti = E h Z T 0 |m(t, θ)| 2 dt 1{τx≥T} + 1{τx<T} i = E h N X−1 i=0 Z ti+1 ti |m(t, θ)| 2 dti + F˜ 1 − F˜ 2 + F˜ 3. (2.10) We are now ready to give the main result of this section. Theorem 2.5.5. Let Assumptions 2.3.5 and 2.5.2 be in force. Then it holds that lim ∆t→0 ML∆t(θ) = ML(θ), uniformly in θ on compacta. Proof. Fix a partition 0 = t0 < · · · < tn = T. By (2.8) and (2.10) we have, for θ ∈ Θ, 2|ML(θ) − ML∆t(θ)| ≤ E h N X−1 i=0 Z ti+1 ti |m(t, θ)| 2 − |∆M˜ θ ti | 2 dti + X 3 i=1 |F˜ i |. (2.11) Let us first check |F˜ i |, i = 1, 2, 3. First, by Assumption 2.5.2, we see that |m(t, θ)| ≤ |J θ (Xt)| + Z ∞ 0 e −cs|r(Xs)|ds ≤ K(θ) + R c =: C1(θ), t > 0, (2.12) where C1(·) is a continuous function, and R is the bound of r(·). Thus we have |F˜ 3| = E h Z τx ⌊τx⌋ |m(t, θ)| 2 dt1{N<n} i ≤ |C1(θ)| 2∆t. (2.13) 3 Note that ⌊τx⌋ ≤ T implies τx ≤ ⌊τx⌋ + ∆t ≤ T + ∆t, and for small ∆t (e.g., ∆t < 1), by definitions of F˜ 1 and F˜ 2 we have |F˜ 1|+|F˜ 2| ≤ 2E hZ T +1 0 |m(t, θ)| 2 dt1{⌊τx⌋≤T ≤τx} i ≤ 2|C1(θ)| 2 (T + 1)P{⌊τx⌋ ≤ T ≤ τx} ≤ 2|C1(θ)| 2 (T + 1)P{|T − τx| ≤ ∆t}. (2.14) Since X is a diffusion, one can easily check that lim∆t→0 P{|T − τx| ≤ ∆t} = P{T = τx} ≤ P{XT = 0} = 0. Furthermore, noting that |C1(θ)| 2 is uniformly bounded for θ in any compact set, from (2.13) and (2.14) we conclude that lim ∆t→0 (|F˜ 1| + |F˜ 2| + |F˜ 3|) = 0, uniformly in θ on compacta. (2.15) Next, let E˜∆t 1 :=E PN−1 i=0 R ti+1 ti |m(t, θ)| 2−|m(ti , θ)| 2 dt , E˜∆t 2 :=E PN−1 i=0 |m(ti , θ)| 2−|∆M˜ θ ti | 2 ∆t , then we have E h N X−1 i=0 Z ti+1 ti |m(t, θ)| 2 − |∆M˜ θ ti | 2 dti ≤ E˜∆t 1 + E˜∆t 2 , (2.16) Now, by definition of m(t, θ) and ∆M˜ θ ti , i = 1, · · · , n, we can easily check that Di := m(ti , θ) − ∆M˜ θ ti = K X−1 j=i Z tj+1 tj [˜r X s − r˜ X tj ]dt + Z τx ⌊τx⌋ r˜ X s ds. (2.17) Clearly, |E[ R τx ⌊τx⌋ r˜ X s ds]| ≤ K1∆t for some K1 > 0. Moreover, note that ˜r is a bounded and continuous process, for any ε > 0, let M˜ ∈ N be such that e −ct ≤ εc 4R , t ≥ M˜ , and define 40 ρ M˜ 2 (˜r X, ∆t) := sup|t−s|≤∆t,t,s∈[0,M˜ ] ∥r˜ X t − r˜ X s ∥L2(Ω), we have E hX∞ j=i Z tj+1 tj |r˜ X s − r˜ X tj |dsi ≤ M X˜ −1 j=1 Z tj+1 tj E|r˜ X s − r˜ X tj |ds + X∞ j=M˜ Z tj+1 tj E|r˜ X s − r˜ X tj |ds ≤ M X˜ −1 j=1 Z tj+1 tj ρ2(˜r X, ∆t)ds + 4R Z ∞ M˜ e −csds = ∆t(M˜ − 1)ρ M˜ 2 (˜r X, ∆t) + 4R e −cM˜ c ≤ ∆t(M˜ − 1)ρ M˜ 2 (˜r X, ∆t) + ε. Sending ∆t → 0 we have lim∆t→0 supi E hP∞ j=i R tj+1 tj |r˜ X s − r˜ X tj |dsi ≤ ε. Since ε > 0 is arbitrary, this implies supi E hP∞ j=i R tj+1 tj |r˜ X s − r˜ X tj |dsi → 0 as ∆t → 0. Consequently, we deduce from (2.17) that supi≥0 E|Di | → 0 as ∆t → 0. On the other hand, from definition (2.8) we see that under Assumption 2.5.2 it holds that |∆M˜ θ ti | ≤ C1(θ), i = 1, · · · , n. Therefore, we have E˜∆t 2 = E h N X−1 i=0 |m(ti , θ) + ∆M˜ θ ti |Di ∆t i ≤ 2∆t|C1(θ)|E h N X−1 i=0 |Di | i (2.18) ≤ 2n∆t|C1(θ)|sup i≥0 E|Di | → 0, as ∆t → 0. Since C1(·) is continuous in θ, we see that the convergence above is uniform in θ on compacta. Similarly, note that by Assumption 2.5.2 the process m(·, θ) is also a square-integrable continuous process, uniformly in θ, and by (2.12) we have E˜∆t 1 ≤ 2C1(θ)E hXn−1 i=0 Z ti+1 ti |m(t, θ) − m(ti , θ)|dti (2.19) ≤ 2C1(θ) Xn−1 i=0 Z ti+1 ti ρ(m(·, θ), ∆t)dt = 2C1(θ)T ρ(m(·, θ), ∆t), where ρ(m(·, θ), ∆t) := sup|t−s|≤∆t,t,s∈[0,T] ∥m(t, θ)− m(s, θ)∥L2(Ω) → 0, as ∆t → 0, uniformly in θ on compacta. Combining (2.16)–(2.19), and noting (2.15) as well as (2.11), we complete the proof of the theorem. 41 Now let us denote h = ∆t, and consider the functions f(θ) := ML(θ), fh(θ) := MLh(θ), rh(θ) := MLh(θ) − ML(θ). Then fh(θ) = f(θ) + rh(θ), and by Assumption 2.5.2 we can easily check that the mappings θ 7→ fh(θ), rh(θ) are continuous functions. Applying Theorem 2.5.5 we see that rh(θ) → 0, uniformly in θ on compacta, as h → 0. Note that if Θ is compact, then for any h > 0, there exists θ ∗ h ∈ arg minθ∈Θ fh(θ). In general, we have the following corollary of Theorem 2.5.5. Corollary 2.5.6. Assume that all assumptions in Theorem 2.5.5 are in force. If there exists a sequence {hn}n≥0 ↘ 0, such that Θn := arg minθ∈Θ fhn (θ) ̸= ∅, then any limit point θ ∗ of the sequence {θ ∗ n}θ ∗ n∈Θn must satisfy θ ∗ ∈ arg minθ∈Θ f(θ). Proof. This is a direct consequence of [22, Lemma 1.1]. Remark 2.5.7. We should note that, by Remark 2.5.4, the set of minimizers of the martingale loss function ML(θ) is the same as that of DMVSE(θ). Thus Corollary 2.5.6 indicates that we have a reasonable approach for approximating the unknown function J. Indeed, if {θ ∗ n} has a convergent subsequence that converges to some θ ∗ ∈ Θ, then J θ ∗ is the best approximation for J by either the measures of MSVE or DMSVE. To end this section we discuss the ways to fulfill our last task: finding the optimal parameter θ ∗ . There are usually two learning methods for this task in RL, often referred to as online and batch learning, respectively. The batch learning methods use multiple sample trajectories, X (k) t , t ∈ [0, T] for k ≤ M, where M is the number of training episodes, (i.e. number of sample trajectories) used. In offline learning, an initial θ (1) will be specified. At each k ≥ 1, θ (k) will be updated into θ (k+1) after observing the whole sample trajectory X(k) . In online learning setting, instead of multiple sample trajectories, only a single sample trajectory X will be observed. In online learning, an initial θ (1) will be specified. At each i ≥ 1, θ (i) will be updated into θ (i+1) using 42 the observation X(ti) at the time t = ti . Hence, for online learning methods, the learner does not have to wait till the end of the trajectory to update θ. They can update θ at each time step, either until θ converges or till the number of iterations that will be used for updating θ is exhausted. (There will be a specified number of maximum iterations used for updating θ for computational efficiency). Clearly, the online learning is particularly suitable for infinite horizon problem, whereas the ML function is by definition better suited for batch learning. Although our problem is by nature an infinite horizon one, we shall first create a batch learning algorithm via the ML function by restricting ourselves to an arbitrarily fixed finite horizon T > 0, converting it to an finite time horizon problem. Note that in ML∆t(·) (cf. (2.8)), K may be unbounded, and may not be computed using sample data when T < τx because we observe the sample trajectories only in time [0, T]. we shall consider instead the function: 2ML g∆t(θ) = E h N X−1 i=0 e −ctiJ θ (Xti ) − N X−1 j=i r˜ X tj ∆t 2 ∆t i . We observe that the difference |2ML∆t(θ) − 2ML g∆t(θ)| = N X−1 i=0 he −ctiJ θ (Xti ) − N X−1 j=i r˜ X tj ∆t 2 ∆t − e −ctiJ θ (Xti ) − K X−1 j=i r˜ X tj ∆t 2 ∆t i = N X−1 i=0 h K X−1 j=N r˜ X tj ∆t 2e −ctiJ θ (Xti ) − N X−1 j=i r˜ X tj ∆t − K X−1 j=i r˜ X tj ∆t i ∆t ≤ K(θ)∆t N X−1 i=0 K X−1 j=N r˜ X tj ∆t ≤ Ke(θ)∆t N X−1 i=0 K X−1 j=N e −ctj∆t ≤ Ke(θ)e −cT T ∆t, for some continuous function Ke(θ). Thus if Θ is compact, for T large enough or ∆t small enough, the difference between ML∆t(θ) and ML g∆t(θ) is negligible. Furthermore, we note 4 that under Q, 2ML g∆t(θ) = E Q h N X−1 i=0 e −ctiJ θ (X˜ ti ) − N X−1 j=i e −ctj (w π tj − λ ln π(w π tj , X˜ tj ))∆t 2 ∆t i , we now follow the method of Stochastic Gradient Descent (SGD) to minimize ML g∆t(θ) and obtain the updating rule: θ (k+1) ← θ (k) − α(k)∇θML g∆tθ, which we shall name as the MLalgorithm. Here α(k) denotes the learning rate for the k th iteration and is chosen in a way such that P∞ k=0 α(k) = ∞, P∞ k=0 α 2 (k) < ∞ to help guarantee the convergence of the algorithm, based on the literature on the convergence of SGD. 2.6 Temporal Difference (TD) Based Learning In this section we consider another policy evaluation method utilizing the parametric family {J θ}θ∈Θ. The starting point of this method is Proposition 2.5.1, which states that the best approximation J θ is one whose corresponding approximating process Mθ defined by (2.2) is a martingale (in which case J θ = J(!)). Next, we recall the following simple fact (see, e.g., [22], proposition 2 for a proof). Proposition 2.6.1. An Itˆo process Mθ ∈ L 2 F ([0, T]) is a martingale if and only if E h Z T 0 ξtdMθ t i = 0, for any ξ ∈ L 2 F([0, T]; Mθ ). (2.1) The functions ξ ∈ L 2 F ([0, T]; Mθ ) are called test functions. Proposition 2.6.1 suggests that a reasonable approach for approximating the optimal θ ∗ could be solving the martingale orthogonality condition (2.1). However, since (2.1) involves infinitely many equations, for numerical approximations we should only choose a finite number of test functions, often referred to as moment conditions. If θ ∈ Θ ∈ R l , we need at least l test functions. Throughout this section, we consider all the vectors to be column vectors 44 of order l × 1. There are many ways to choose the test functions. This is where the knowledge of Temporal Difference algorithms from the discrete time and space RL literature (see Appendix B for more details) come to play. In the finite horizon case, [22] proposes certain test functions using the observation that, for certain choices of test functions, solving equation (2.1) using stochastic approximation is equivalent to continuous analogs of TD methods, known as CTD methods. In this section, we derive analogous algorithms for the infinite horizon problem, using the fact that in this setting, • E[dMθ t ] = e −ctE[dJθ (Xt) − cJθ (Xt)dt + rtdt]. • That is denoting the sampled action at time t by wt , E Q [dMθ t ] = e −ctE Q [dJθ (X˜ t) − cJθ (X˜ t)dt + (wt − λlnπ(wt , X˜ t))dt]. Thus denoting ∆J θ (X˜ ti ) := J θ (X˜ ti+1 ) − J θ (X˜ ti ), we have Mθ ti+1 − Mθ ti ≈ e −cti ∆J θ ti (X˜) + (−cJθ (X˜ ti ) + wti − λlnπ(wti , X˜ ti )∆t , where wti is the action sampled at time ti from the policy distribution π(., X˜ ti ).Throughout this section we will use ∆i to denote e −cti ∆J θ ti (X˜)+(−cJθ (X˜ ti )+wti−λlnπ(ati , X˜ ti )∆t . We also use ∆(k) i to denote e −cti ∆J θ ti (X˜(k) ) + (−cJθ (X˜ (k) ti ) + w (k) ti − λlnπ(a (k) ti , X˜ (k) ti )∆t , where w (k) ti is the action sampled at time ti from the policy distribution π(., X˜ (k) ti ), when we are using the k th sample trajectory, X˜(k) . Throughout this section, we use an arbitrary time T to derive the batch learning and relevant online learning algorithms. We notice here that we can choose any T > 0 for the family of equations (2.1) since, for any ξ ∈ L 2 F([0,∞], Mθ ), ˜ξ = { ˜ξt : ˜ξt = ξt1t≤T , t ≥ 0} ∈ L 2 F([0, T]) and ˜ξ will satisfy the equation (2.1). Thus when deriving online algorithms we use T = N˜ on × ∆t, where ∆t is the time scale and N˜ on is the maximum number of iterations used 45 in the online algorithm. We note that the online algorithms are independent of T, and our interest in this section is exclusively online Algorithms. Hence the choice of T has no real consequences. We only consider T here to demonstrate the concepts underlying the online algorithms. In the following algorithms, method of stochastic approximation from Robbins and Monroe(1951) [39] is used to find the root of the equation E R T 0 ξtdMθ t = 0 for different choices of ξt . The Stochastic Approximation (Robins & Monroe) is a method of finding roots of an equation M(θ) = β where M(θ) = E[N(θ)]. The suggested method is to find the root by solving the set of iterative equations, θn+1 = θn−αn[N(θn)−β] for each n ≥ 0 starting from an arbitrary θ0. This method is proven to converge under following conditions. 1. Learning rates αn satisfy P∞ n=1 αn = ∞ and P∞ n=1 α 2 n < ∞. 2. The process N is bounded for all θ. 3. The function M is non decreasing in θ . 4. The root θ ∗ satisfies M′ (θ ∗ ) > 0. We should note that although our problem is essentially an infinite horizon one, we can consider a sufficiently large truncated time horizon [0, T] as we did in previous section, so that offline CTD methods similar to [22] can also be applied. However, in the Numerical analysis in section 2.8, we shall focus only on an online version of CTD(γ) method that is more suitable to the infinite horizon case. Following examples demonstrate the different CTD methods that can be derived by using the stochastic approximation method from Robbins and Monroe(1951) [39] to find the root of the equation E Q R T 0 ξtdMθ t = 0 for different choices of ξt . 46 Example 2.6.2. Assume that ξt = ∇θJ θ (X˜ t). The offline algorithm is as follows. θ (k+1) = θ (k) + α(k) Z T 0 ∇θJ θ (k) (X˜ (k) t )dMθ (k) t ≈ θ (k) + α(k) Xn−1 i=0 ∇θJ θ (k) (X˜ (k) ti )(Mθ (k) ti+1 − Mθ (k) ti ) ≈ θ (k) + α(k) Xn−1 i=0 ∇θJ θ (k) (X˜ (k) ti )∆(k) i . Here tn = T and α(k) denotes the learning rate for the k th iteration. Here α(k) is chosen so that P∞ k=0 α(k) = ∞, P∞ k=0 α 2 (k) < ∞ to help guarantee the convergence of the algorithm , based on the convergence conditions of Stochastic Approximation methods. This corresponds to offline learning procedure of CTD(0) method. We now consider the following online algorithm. θ (i+1) = θ (i) + α(i)∇θJ θ (i) (X˜ ti )dMθ (i) t ≈ θ (i) + α(i)∇θJ θ (i) (X˜ ti )(Mθ (i) ti+1 − Mθ (i) ti ≈ θ (i) + α(i)∇θJ θ (i) (X˜ ti )∆i This corresponds to the online learning procedure of CTD(0). Here α(i) denotes the learning rate for i th iteration and it is chosen in such a way so that P∞ i=0 α(i) = ∞, P∞ i=0 α 2 (i) < ∞ , in order to help guarantee the convergence of the algorithm. Example 2.6.3. Assume that ξt = R t 0 γ t−s∇θJ θ (Xs)ds, 0 < γ ≤ 1. The offline algorithm is 47 as follows. θ (k+1) = θ (k) + α(k) Z T 0 Z t 0 γ t−s∇θJ θ (k) (X˜(k) s )ds dMθ (k) t ≈ θ (k) + α(k) Xn−1 i=1 X i−1 j=0 γ ∆t(i−j)∇θJ θ (k) (X˜ (k) tj )∆t (Mθ (k) ti+1 − Mθ (k) ti ) ≈ θ (k) + α(k) Xn−1 i=1 X i−1 j=0 γ ∆t(i−j)∇θJ θ (k) (X˜ (k) tj )∆t ∆ (k) i Here tn = T. This corresponds to the offline learning procedure CT D(γ). On the other hand, an online algorithm can be derived as follows. θ (k) = θ (i) + α(i) Z t 0 γ t−s∇θJ θ (i) (X˜ s)ds dMθ t ≈ θ (i) + α(i) X i−1 j=0 γ ∆t(i−j)∇θJ θ (i) (X˜ tj )∆t (Mθ (i) ti+1 − Mθ (i) ti ) ≈ θ + α(i) X i−1 j=0 γ ∆t(i−j)∇θJ θ (i) (X˜ tj )∆t ∆i Example 2.6.4. By selecting J θ (Xt) = Pl j=1 θjψj (Xt) = ψ ⊤ t θ, where ψt denotes the column vector (ψj (Xt)j∈{1,2,...,l} and ξt = ∇θJ θ (Xt), we reduce (2.1) to the solvable equation E[ Z T 0 ξtdMθ t ] = E[ Z T 0 e −ctψt(dJθ (Xt)) + Z T 0 e −ctψt(rt − cJθ (Xt))dt] = E[ Z T 0 e −ctψt(dψ⊤ t )θ + Z T 0 e −ctψt(rt − cψ⊤ t θ)dt] = E[ Z T 0 e −ctψt(dψ⊤ t − cψ⊤ t dt)θ + Z T 0 e −ctψtrtdt] 48 Thus the optimal value θ ∗ is given by θ ∗ = − " E h Z T 0 e −ctψt(dψ⊤ t − cψ⊤ t dt) i #−1 E " Z T 0 ψte −ctrtdt# = − " E Q h Z T 0 e −ctψt(dψ⊤ t − cψ⊤ t dt) i #−1 E Q " Z T 0 ψte −ct(at − λ lnπ(at , X˜ t))dt# provided the inverse exists. Define D = " E Q h R T 0 e −ctψt(dψ⊤ t − cψ⊤ t dt) i # and E = E Q " R T 0 ψte −ct(at − λ lnπ(at , X˜ t))dt# . In this case using a standard method of estimating D and E, an approximation of θ ∗ can be obtained. One standard method of estimating the Expected value of R T 0 g(X˜ t)dt is to observe M trajectories of X˜ t , where X˜(k) denotes the k th trajectory and approximating the expected value by, E Q [ Z T 0 g(X˜ t)dt] ≈ Xn−1 i=0 E Q g(X˜ ti )∆t ≈ Xn−1 i=0 hPM k=1 g(X˜ (k) ti ) M i ∆t = ∆t PM k=1 hPn−1 i=0 g(X˜ (k) ti ) i M . Using this methodology, we approximate DM = " E Q h Z T 0 e −ctψt(dψ⊤ t − cψ⊤ t dt) i # = PM k=1 " Pn−1 i=0 e −ctiψ(X˜ (k) ti ) [ψ(X˜ (k) ti+1 )]⊤ − [ψ(X˜ (k) ti )]⊤ − c[ψ(X˜ (k) ti )]⊤∆t # M . and EM = E Q " Z T 0 ψte −ct(at − λ lnπ(at , X˜ t))dt# = PM k=1 " Pn−1 i=0 e −ctiψ(X˜ (k) ti ) ati − λ lnπ(ati , X˜ (k) ti ) ∆t # M . 49 Thus we calulate θ (M) = [DM] −1EM, for each M ∈ {1, 2, ...N˜ of f } where N˜ of f is the total number of trajectories observed, provided the inverse exists. This is an offline learning method because we use the whole sample trajectory and multiple trajectories to approximate the Expected values D and E. Here we note that θ N˜ off is the best approximation for θ, but we nevertheless define θ M for each M ∈ {1, 2, ...N˜ of f } , since in the algorithms, policy evaluation and policy improvement is carried out simultaneously, and policy is updated in each trajectory M using θ M. Same principle applies for the online learning methods as well. The corresponding online method is as follows. When we have sample trajectories available up-to time tk we use the long term average method defined below. E[ Z T 0 g(X˜ t)dt] ≈ R tk 0 g(X˜ t)dt tn ≈ Pk−1 i=0 g(X˜ ti )∆t tk . Using this methodology, at each time step k we can approximate D and E using Dk and Ek defined below. Dk = Pk−1 i=0 e −ctiψ(X˜ ti ) ψ(X˜ ti ) ⊤ − ψ(X˜ ti ) ⊤ − cψ(X˜ ti ) ⊤∆t tk . and Ek = Pk−1 i=0 e −ctiψ(X˜ ti ) ati − λ lnπ(ati , X˜ ti ) ∆t tk . Hence at each time step θ (k) can be computed as [Dk] −1Ek. This example corresponds to the CLSTD(0) (Continuous Least Square Temporal difference) method. Now we shift our attention into another method of solving (2.1). Generalized Methods of Moments(GMM) , is a standard way of solving moment conditions. GMM method is 50 described as the minimization of the function, GMM(θ) = 1 2 b ⊤Ab where b = E Q R T 0 ξtdMθ t and A is an appropriate matrix. We consider the following examples that are analogous to the methods discussed in Zhou(2021) [22], for a finite horizon problem. Example 2.6.5. Here we choose, ξt = ∇θJ θ (Xt) and A = I and J θ (Xt) as the linear function Pl j=1 θjψj (Xt) = ψ ⊤ t θ. When we take the derivative of the corresponding GMM(θ) function w.r.t. θ, we obtain E Q Z T 0 d(∇θMθ t )ξ ⊤ t + Z T 0 ∇θξ ⊤ t dMθ t E Q Z T 0 ξtdMθ t = vE Q Z T 0 ξtdMθ t where v := E Q R T 0 d(∇θMθ t )ξ ⊤ t + R T 0 ∇θξ ⊤ t dMθ t . Since ∇θMθ t = e −ct∇θJ θ (X˜ t) = e −ctξt = e −ctψt and ∇θξt = 0 we have v = E Q Z T 0 d(e −ctξt)ξ ⊤ t = E Q Z T 0 e −ct(dξt − cξtdt)ξ ⊤ t = E Q Z T 0 e −ct(dψt − cψtdt)ψ ⊤ t . Motivated by well known GT D(0) algorithm discrete time RL, we propose using long term average method to approximate v by vk at each time step k, as discussed in online learning 51 method in 2.6.4. Thus the update rule for θ in the k th time step is given by θ ← θ + α(k)vkξtk dMθ tk = θ + α(k)vkψ(X˜ tk )e −ctk [ψ(X˜ tk+1 )]⊤θ − [ψ(X˜ tk )]⊤θ − cψ[(X˜ tk )]⊤θ∆t + (atk − λ ln π(a, X˜ tk ))∆t This is the CGTD(0) method, i.e. the continuous counterpart of GTD(0) method. Example 2.6.6. Now we choose ξt = ∇θJ θ (Xt) and A = E Q R T 0 ξtξ ⊤ t dt−1 . We know from Zhou(2021) [22], that ∇θGMM(θ) with respect to θ is given by E Q h Z T 0 d(∇θMθ t )ξ ⊤ t i u + E Q h Z T 0 ∇θξ ⊤ t dMθ t u − E Q h Z T 0 u ⊤ξt∇θξ ⊤ t udti , where u = Ab. There are 2 online algorithms based on the above equation. In both algorithms u is updated by the rule u ← u + αu[ξtdMθ t − ξtξ ⊤ t u∆t] ≈ u + αu[ξti∆i − ξti ξ ⊤ ti u∆t]. u is updated using long term average in Zhou(2021) [22], but we observe that this update rule is easily derived by applying stochastic approximation method to the function f(u) = b − A−1u = 0 and then using the corresponding online learning method. In one of the algorithms in this example, θ is updated using the following algorithm at time step i, based on SGD method to minimize GMM(θ). θ ← θ − αθ h d(∇θMθ t )ξ ⊤ t u + ∇θξ ⊤ t dMθ t u − u ⊤ξt∇θξ ⊤ t u∆t i ≈ θ − α(i) h (∇θMθ ti+1 − ∇θMθ ti )ξ ⊤ ti u + ∇θξ ⊤ ti (Mθ ti+1 − Mθ ti )u − u ⊤ξti∇θξ ⊤ ti u∆t i ≈ θ − α(i) h (e −cti+1 ξti+1 − e −cti ξti )ξ ⊤ ti u + ∇θξ ⊤ ti ∆iu − u ⊤ξti∇θξ ⊤ ti u∆t i . 52 this corresponds to the CGTD(2) method. When deriving the next algorithm , by using the fact ξt = ∇θJ θ (X˜ t), we can rewrite d(∇θMθ t ) = d(e −ct∇θJ θ (X˜ t)) = −ce−ct∇θJ θ (X˜ t)) + e −ctd(∇θJ θ t ) = e −ct[−cξt + d(ξt)] . Thus we see d(∇θMθ t )ξ ⊤ t u = e −ct[−cξt + d(ξt)]ξ ⊤ t u ≈ e −cti [−cξti + ξti+1 − ξti ]ξ ⊤ ti u ≈ e −cti [−(1 + c)ξti + ξti+1 ]ξ ⊤ ti u Thus we can use the update rule θ ← θ − αθ h e −cti [−(1 + c)ξti + ξti+1 ]ξ ⊤ ti u + ∇θξ ⊤ ti ∆iu − u ⊤ξti∇θξ ⊤ ti u∆t i . this corresponds to CT DC method. Though the two methods are derived using same basic foundation, we can observe that the algorithms themselves are different from one another. And it has been shown empirically that TDC perform slightly better than GTD2 in, when applied to certain discrete cases. Both CTDC and CGTD2 algorithms are online algorithms, thus can be used for infinite horizon problems. Example 2.6.7. Now we choose ξt = ∇θJ θ (Xt) and A = E Q R T 0 ξtξ ⊤ t dt−1 similar to 2.6.6. Furthermore we use the linear function J θ (Xt) = Pl j=1 θjψj (Xt) = θ ⊤ψt is used. Thus we have ∇θMθ t = e −ct∇θJ θ (X˜π t ) = e −ctξt = e −ctψt and ∇θξt = 0. In this case gradient of GMM(θ) with respect to θ is given by E Q Z T 0 d(∇θMθ t )ξ ⊤ t u. We use the same rule as 2.6.6 to update u, u ← u + αu[ξti e −cti∆i − ξti ξ ⊤ ti u∆t]. But update rule for θ using GT D methods is much simpler now. We see that the update rule for θ in 53 the i th step is now given by, θ ← θ − αθ d(∇θMθ t )ξ ⊤ t u ≈ θ − αθ (e −cti+1 ξti+1 − e −cti ξti )ξ ⊤ ti u . This is a special case of CGT D2 methods. For a fixed ∆t, we observe that the convergence analysis of the above methods coincides with those of the SGD and stochastic approximation methods. Now we are interested in analysing the convergence with respect to ∆t. We prove the following results analogous to results in Zhou(2021) [22], in the discounted problem setting. Assumption 2.6.8. We consider ξ ∈ L 2 F([0, T], Mθ ) to be a test function if there exists a continuous function C(θ) and a constant α > 0 such that E|ξt − ξs| 2 ≤ C(θ)|t − s| α . Proposition 2.6.9. Denote by θ ∗ the solution to the equation E[ R T 0 ξtdMθ t ] = 0 and by θ ∗ ∆t solution to the discretization of the above equation with fixed ∆t, E[ Pn−1 i=0 ξti (Mθ ti+1 − Mθ ti )]. Then if a convergent sub-sequence of (θ ∗ ∆t )∆t>0 exists, its limit is θ ∗ . Also E[ R T 0 ξtdMθ ∗ ∆t t ] ≤ C(∆t) α 2 for some C > 0. Proof. The proof is similar to Zhou(2021) [22], section 4.2 Theorem 3. We observe that E Z T 0 ξtdMθ t − Xn−1 i=0 ξti (Mθ ti+1 − Mθ ti ) = E Xn−1 i=0 Z ti+1 ti (ξti − ξt)dMθ t = E Xn−1 i=0 Z ti+1 ti (ξti − ξt)e −ct[dJθ t + (rt − cJθ t )dt] ≤ P1 + P2 + P3 where 54 P1 = X i E Z ti+1 ti (ξti − ξt)e −ctdJθ t P2 = X i E Z ti+1 ti (ξti − ξt)e −ctrtdt P3 = X i E Z ti+1 ti (ξti − ξt)e −ctcJθ t dt . By Assumption 2.5.2 (iii), we know |J θ (x)| ≤ K(θ) and r(x) ≤ R for all x > 0. Thus, P2 = Xn−1 i=0 E Z ti+1 ti (ξti − ξt)e −ctrtdt ≤ R Xn−1 i=0 E Z ti+1 ti |ξti − ξt |dt ≤ R Xn−1 i=0 sZ ti+1 ti E|ξti − ξt | 2dtsZ ti+1 ti dt ≤ R Xn−1 i=0 (∆t) α+2 2 = R p C(θ)T(∆t) α 2 From similar arguments P3 ≤ cK(θ) p C(θ)T(∆t) α 2 . By observing DJθ t = LJ θ t dt, P1 ≤ Xn−1 i=0 E Z ti+1 ti (ξti − ξt)LJ θ t dt ≤ Xn−1 i=0 s E Z ti+1 ti (ξti − ξt) 2dts E Z ti+1 ti (LJ θ t ) 2dt ≤ Xn−1 i=0 sZ ti+1 ti C(ti − t) 2dts E Z ti+1 ti (LJ θ t ) 2dt ≤ p C(θ)(∆t) α+1 2 Xn−1 i=0 s E Z ti+1 ti (LJ θ t ) 2dt ≤ p C(θ)(∆t) α+1 2 vuutE Xn−1 i=0 Z ti+1 ti (LJ θ t ) 2dt√ n = p C(θ)(∆t) α+1 2 √ n||LJ θ t ||L2[0,T] = p C(θ)(∆t) α 2 √ T||LJ θ t ||L2[0,T] By assumption ||LJ θ t ||L2[0,T] < ∞. That is, P1, P2, P3 → 0 as ∆t → 0 uniformly on any compact set. Now let us consider the functions f(θ) := E[ Z T 0 ξtdMθ t ], f∆t(θ) := E[ Xn−1 i=0 ξti (Mθ ti+1 − Mθ ti )], r∆t(θ) := f∆t(θ) − f(θ). 55 Then f∆t(θ) = f(θ)+r∆t(θ), and by Assumption 2.5.2 we can easily check that the mappings θ 7→ f∆t(θ), r∆t(θ) are continuous functions. We proved that r∆t(θ) → 0, uniformly in θ on compacta, as ∆t → 0. Note that if Θ is compact, then for any ∆t > 0, there exists θ ∗ ∆t ∈ arg minθ∈Θ f∆t(θ). Thus by corollary (2.5.6) any convergent sub-sequence of θ ∗ ∆t → θ ∗ as ∆t → 0. Since by definition f∆t(θ ∗ ∆t ) = 0, |f(θ ∗ ∆t )| = |f(θ ∗ ∆t ) − f∆t(θ ∗ ∆t )| = |r∆t(θ ∗ ∆t )| ≤ C|∆t| α 2 . Here C = max{supθ∈Θ p C(θ)RT,supθ∈Θ cTK(θ) p C(θ),supθ∈Θ p T C(θ)||LJ θ t ||L2[0,T]}. That is E[ R T 0 ξtdMθ ∗ ∆t t ] ≤ C(∆t) α 2 for some C > 0. We also have the following proposition about the convergence of GMM methods. Proposition 2.6.10. Denote by θ ∗ (GMM) the minimizer of GMM(θ) = 1 2 b ⊤Ab where b = E R T 0 ξtdMθ t and A is a given matrix. Also denote by θ ∗ ∆t (GMM) minimizer of the discretization of function GMM(θ) with fixed ∆t, GMM∆t(θ) = 1 2 (b∆t) ⊤A∆tb∆t where b∆t = E Pn−1 i=0 ξ θ ti (Mθ ti+1 −Mθ ti ) and A∆t is a discretization of A satisfying |A−A∆t | ≤ C˜(θ)|∆t| β , C˜(θ) is a continuous function and β > 0 is a constant. Then if a convergent sub-sequence of (θ ∗ ∆t (GMM))∆t>0 exists, its limit is θ ∗ (GMM). We also have |GMM(θ ∗ ∆t (GMM)) − GMM(θ ∗ (GMM))| ≤ C(∆t) min( α 2 ,β) for some C > 0. Proof. The proof is similar to Zhou(2021) [22], section 4.2 Theorem 4. Let ˜b(∆t) := b∆t − b and A˜(∆t) = A∆t − A. Then, 2|GMM∆t(θ) − GMM(θ)| = |(b∆t) ⊤A∆tb∆t − b ⊤Ab| = |(b + ˜b∆t) ⊤(A + A˜∆t)(b + ˜b∆t) − b ⊤Ab| ≤ |A||˜b∆t | 2 + 2|A||b||˜b∆t | + 2|A˜∆t ||b| 2 + 2| ˜b∆t | 2 |A˜∆t |. By definition |A˜∆t | → 0 as ∆t → 0 uniformly on a compact set. By proposition (2.6.9) | ˜b∆t | → 0 as ∆t → 0 uniformly on a compact set. Hence the result follows by corollary 56 (2.5.6). In what follows we denote f(θ) = GMM(θ), f∆t = GMM∆t(θ), r∆t(θ) = f(θ) − f∆t(θ) and θ ∗ (GMM) = θ ∗ , θ∗ ∆t (GMM) = θ ∗ ∆t . By an argument similar to proposition (2.6.9), |r∆t | ≤ C˜(∆t) min( α 2 ,β) for some C >˜ 0. We also notice based on lemma 8 in Zhou(2021) [22] , |GMM(θ ∗ ) − GMM(θ ∗ ∆t )| = |f(θ ∗ ) − f(θ ∗ ∆t )| = f(θ ∗ ∆t ) − f(θ ∗ ) = r∆t(θ ∗ ∆t ) − r∆t(θ ∗ ) + f∆t(θ ∗ ∆t ) − f∆t(θ ∗ ) ≤ r∆t(θ ∗ ∆t ) − r∆t(θ ∗ ) ≤ 2C˜(∆t) min( α 2 ,β). Hence the result. We also note here the result from Zhou(2021) [22], that states the condition |A − A∆t | ≤ C˜(θ)|∆t| β indeed holds for A = E R T 0 ξtξ ⊤ t dt−1 , justifying the use of A = E R T 0 ξtξ ⊤ t dt−1 . Finally, we remark that, although we discussed many PE methods analogous to well known TD methods , many of these are particularly well suited for linear parameterization families. Since we are interested in parameterized families that are nonlinear in nature, in section 2.8, we focus on CT D(γ) algorithms (online) that are well suited for infinite horizon problems with nonlinear parameterized families. 2.7 Alternate Approach for Approximating the Optimal Policy 2.7.1 On-policy and Off-policy Algorithms As discussed in [42], a policy can be categorized either as a Behavior Policy / Sampling Policy or an Update Policy/ Target Policy. A behavior policy, is used to sample actions from in the environment simulation, whereas an update policy is the policy that is being 57 updated in the algorithm with the target of obtaining the optimal policy. It is standard in Reinforcement Learning literature to consider on-policy algorithms that sample actions from the update policy itself, as well as off-policy algorithms that use a behavior policy that is different from the update policy to sample actions from. The choice between on-policy and off-policy learning is dictated by the availability of the data and the learning agent’s choice. If the agent has the ability to generate data for any given policy, the agent has the option to carry out on-policy learning as well as off-policy learning. But, if the agent does not have the ability to generate data and has to depend on the available data of a simulated path of some specific policy, agent’s only option is to use off-policy learning. We can use the learning sequence designed in section 2.3, along with the policy evaluation methods based on the Martingale approach proven in section 2.5 to create iterative algorithms to approximate the optimal policy and the optimal value function. This is inherently an “on-policy” approach since, to update a policy π into an improved policy π˜, we need to approximate J ′ (·, π), which requires the simulated state values of the process Xπ . Since “on-policy” approaches can only be used when the learning agent is in control of generating data, it is reasonable to analyze and design alternative methods to approximate the optimal policy that can be carried out as an “off-policy” learning. To this end we consider the q-function associated with the optimal policy π ∗ is adapted from [24] and is defined as follows for our context. q ∗ (x, w) = (µ − w)v ′ (x) + 1 2 σ 2 v ′′(x)) − cv(x) + w; (x, w) ∈ R + × [0, a]. where v is the optimal value function. It is proven in [24] that we have R a 0 exp{ 1 λ q ∗ (x, w)}dw = 1, for all x, and consequently the optimal policy π ∗ is π ∗ (w, x) = exp{ 1 λ q ∗ (x, w). We use the q −function in designing algorithms, and an important step in the algorithms is generating the actions for trial and error. To formalize this idea mathematically we use the probability measure Q defined by (2.3) and the filtration G. Let us denote G˜ = {Gτ˜x∧t}t≥0. 58 Now to use the q − function in designing algorithms, we consider the following theorem adapted from theorem 4 from [24]. Theorem 2.7.1. Let a function vb ∈ C 2 with polynomial growth and a continuous function qb∗ : R × [0, a] → R + be given satisfying vb(0) = 0, Z a 0 exp{ 1 λ qb∗ (x, w)}dw = 1, ∀x ∈ R +. (2.2) Then (i) If vb and qb∗ are respectively the optimal value function and the optimal q-function, then for any π ∈ A s cl and all x ∈ R +, the following process e −c(τ π x ∧s) vb(X˜ τ π x ∧s) + Z τ π x ∧s 0 e −cu w π u − qb∗ (X˜ u, wπ u ) du (2.3) is an (G˜ , Q)-martingale, where {X˜ π s , s ≥ 0} is the solution to (2.11) under the policy π. (ii) If there exists a π ∈ A s cl such that for all x ∈ R +,(2.3) is an (G˜ , Q)-martingale, then vb and qb∗ are respectively the optimal value function and the optimal q-function. When = q ∗ is the optimal q-function, πb∗ (w, x) = exp{ 1 λ qb∗ (x, w)} is the optimal policy. proof: Let vb = v and qb∗ = q ∗ respectively be the optimal value function and the optimal q-function. Then, for any π ∈ A s cl and all x ∈ R +, by using Ito’s formula, e −c(τ π x ∧s) v(X˜ τ π x ∧s) − v(x) + Z τ π x ∧s 0 e −cu[w π u − q ∗ (X˜ τ π x ∧s, wπ u )]du = Z τ π x ∧s 0 e −cuL π [v](X π u )du + Z τ π x ∧s 0 e −cuv ′ (X π u , π)σdWu + Z τ π x ∧s 0 e −cu[w π u − q ∗ (X˜ τ π x ∧s, wπ u )]du = Z τ π x ∧s 0 e −cuv ′ (X π u , π)σdWu. (2.4) Since R τ π x ∧s 0 e −cuv ′ (Xπ u , π)σdWu is a (G˜ , Q) martingale, it follows that (2.3) is a (G˜ , Q) martingale. Thus (i) is proved. 59 For (x, w) ∈ R + × [0, a], let us define the function ˆl(x, w) := (µ − w)(vb) ′ (x) + 1 2 σ 2 (vb) ′′(x)) − cvb(x). Let us consider the process e −c(τ π x ∧s) vb(X˜ τ π x ∧s) − R τ π x ∧s 0 e −cuˆl(X˜ τ π x ∧s, wπ u )du. By using Ito’s formula, e −c(τ π x ∧s) vb(X˜ τ π x ∧s) − vb(x) − Z τ π x ∧s 0 e −cuˆl(X˜ τ π x ∧s, wπ u )du = Z τ π x ∧s 0 e −cuL π [vb](X π u )du + Z τ π x ∧s 0 e −cu(vb) ′ (X π u , π)σdWu − Z τ π x ∧s 0 e −cuˆl(X˜ τ π x ∧s, wπ u )du = Z τ π x ∧s 0 e −cu(vb) ′ (X π u , π)σdWu. (2.5) Thus we see that e −c(τ π x ∧s) vb(X˜ τ π x ∧s) − R τ π x ∧s 0 e −cuˆl(X˜ τ π x ∧s, wπ u )du is an (G˜ , Q) martingale and by assumption, (2.3) is an (G˜ , Q) martingale. Therefore R τ π x ∧s 0 e −cu w π u − qb∗ (X˜ u, wπ u ) + ˆl(X˜ τ π x ∧s, wπ u ) du is a continuous (G˜ , Q) martingale with finite variation and thus zero quadratic variation. Thus Q− almost surely, R τ π x ∧s 0 e −cu w π u − qb∗ (X˜ u, wπ u ) + ˆl(X˜ τ π x ∧s, wπ u ) du = 0 for all t ≥ 0. Following a similar argument to [24] theorem 1-(i), since π ∈ A s cl, Q− almost surely, R τ π x ∧s 0 e −cu w π u −qb∗ (X˜ u, wπ u ) +ˆl(X˜ τ π x ∧s, wπ u ) du = 0 for all t ≥ 0 implies w −qb∗ (x, w) +ˆl(w, x) for all (w, x) ∈ [0, a] × R+. That is, qb∗ (x, w) = ˆl(w, x) + w for all (w, x) ∈ [0, a] × R+. Since R a 0 exp{ 1 λ qb∗ (x, w)}dw = 1, we have that, πb∗ (w, x) = exp{ 1 λ qb∗ (x, w)} is a probability distribution function on [0, a]. Now, qb∗ (x, w) = λ lnπb∗ (w, x). Therefore we have, 0 = Z a 0 qb∗ (x, w) − λ lnπb∗ (w, x) πb∗ (w, x)dw = Z a 0 ˆl(w, x) + w − λ lnπb∗ (w, x) πb∗ (w, x)dw = Z a 0 (µ − w)(vb) ′ (x) + 1 2 σ 2 (vb) ′′(x)) − cvb(x) + w − λ lnπb∗ (w, x) πb∗ (w, x)dw = L πc∗ [vb](x) + r πc∗ (x). (2.6) By the uniqueness of the solution to the pde (2.6), we can conclude that vb = J(.,πb). Furthermore since qb∗ (x, w) = ˆl(w, x) + w = (p − w)(vb) ′ (x) + 1 2 σ 2 (vb) ′′(x)) − cvb(x) + w = 60 (1 − vb) ′ (x))w + µ(vb) ′ (x) + 1 2 σ 2 (vb) ′′(x)) − cvb(x), we observe that (1 − vb) ′ (x))w = qb∗ (x, w) − p(vb) ′ (x) − 1 2 σ 2 (vb) ′′(x)) + cvb(x). Thus the improved policy of πb∗ , denoted by ˜ πb∗ is given by, ˜ πb∗ = e (1−vb ′ )(x))w λ R a 0 e (1−vb′)(x))w λ dw = e qc∗(x,w) λ R a 0 e qc∗(x,w) λ dw = e qc∗(x,w) λ = πb∗ . Therefore by Theorem 2.3.4, πb∗ is the optimal policy, vb is the optimal value function and qb∗ = q. To learn the optimal value function and q-function based on Theorem 2.7.1 we can use approximators J θ and q ψ that satisfy J θ (0) = 0 and R a 0 exp{ 1 λ q ψ (x, w)}dw = 1 for each x. Now, learning the optimal value function and q-function is equivalent to finding a function J θ and q ψ such that e −c(τ π˜ x ∧s) vb(X˜ τ π˜ x ∧s ) +R τ π˜ x ∧s 0 e −cu w π u −qb∗ (X˜ u, wπ u ) du is an (G˜ , Q) martingale for any given π. Hence we can again use stochastic approximation method and stochastic gradient descent methods similar to [22] to learn the optimal value function and q-function. Furthermore since we can use any policy π to learn the optimal value function and q-function, this creates an ”off-policy” algorithm, as was needed. 2.8 Numerical Results In this section we present the numerical results along the lines of PE and PI schemes discussed in sections 2.3,2.5 and 2.6. In particular, we shall consider the CTD(γ) methods and ML Algorithm and some special parametrization based on the knowledge of the explicit solution of the original optimal dividend problem (with λ = 0), but without specifying the market parameter µ and σ. To test the effectiveness of the learning procedure, we shall use the so-called environment simulator: (x ′ ) = ENV∆t(x, a), that takes the current state x and action a as inputs and generates state x ′ at time t + ∆t, and we shall use the outcome of the simulator as the dynamics of X. We note that the environment simulator will be problem specific, and should be created using historic data pertaining to the problem, without using environment coefficients, which is considered as unknown in the RL setting. But for the testing purpose, we shall use “dummy” values of µ and σ, along with the following 61 Euler–Maruyama discretization for the SDE (2.10) as our environment simulator: xti+1 = xti + (µ − ati )∆t + σZ, i = 1, 2, · · · , (2.1) where Z ∼ N(0, √ ∆t) is a normal random variable and ati is the action at time ti . We note that by (2.2), the optimal policy function has the form π ∗ (x, w) = G(w, c˜(x)), where ˜c(x) is a continuous function, and thus we shall considered only policies of this format. For a policy distribution π of this form, the inverse of the cumulative distribution function is given by F π −1 (x, w) = λ ln w(e ac˜(x) λ −1)+1 c˜(x) 1{c˜(x)̸=0} + aw1{c˜(x)=0}, (w, x) ∈ [0, a] × R +. Thus, to sample the policy distribution π, we shall simulate U ∼ U[0, 1], the uniform distribution on [0, 1], and use the inversion method using the fact that F π −1 (x, U) ∼ π(·, x). 2.8.1 Parametrization of the Cost Functional The next step is to choose the parametrization of J θ . In light of the well-known result (cf. e.g., [1]), we know that if (µ, σ) are given, and β = a c − 1 β3 > 0, (thanks to Assumption 2.1.5), the classical solution for the optimal dividend problem is given by V (x) = K(e β1x − e −β2x )1{x≤m} + h a c − e −β3(x−m) β3 i 1{x>m}. (2.2) where K = β e β1m−e−β2m , β1,2 = ∓µ+ √ 2cσ2+µ2 σ2 , β3 = µ−a+ √ 2cσ2+(a−µ) 2 σ2 , and m = log( 1+ββ2 1−ββ1 ) β1+β2 . We should note that the threshold m > 0 in (2.2) is most critical in the value function, as it determines the switching barrier of the optimal dividend rate. That is, optimal dividend rate is of the “bang-bang” form: αt = a1{X(t)>m}, where X is the reserve process (see, e.g., [1]). We therefore consider the following two parametrizations based on the initial state x = X0. (i) x < m. By (2.2) we use the approximation family: J θ (x) = θ3(e θ1x − e −θ2x ), θ1 ∈ h 4c (1 + √ 5)a , 1 i ; θ2 ∈ h 1 + 4c (1 + √ 5)a , 2 i ; θ3 ∈ [15, 16], (2.3) where θ1, θ2, θ3 represent β1, β2 and K of the classical solution respectively. In particular, the bounds for θ1 and θ2 are due to the fact β1 ∈ [ 4c (1+√ 5)a , 1] and β2 ∈ [1 + 4c (1+√ 5)a ,∞) under Assumption 2.1.5. We should note that these bounds alone are not sufficient for the algorithms to converge, and we actually enforced some additional bounds. In practice, the range of θ2 and θ3 should be obtained from historical data for this method to be effective in real life applications. Even though we follow a model-free RL approach theoretically, we use the knowledge that the optimal dividend rate is of the “bang-bang” form: αt = a1{X(t)>m}, where X is the reserve process, to derive the optimal strategy using the results of our algorithms. To this end, we observe that (2.2) actually implies that µ = c(β2−β1) β2β1 and σ 2 = 2c β2β1 . We can therefore approximate µ, σ by c(θ ∗ 2−θ ∗ 1 ) θ ∗ 2 θ ∗ 1 and q 2c θ ∗ 2 θ ∗ 1 , respectively, whenever the limit θ ∗ can be obtained. The threshold m can then be approximated via the approximated values µ and σ. We emphasize here that, even though our algorithms were created to approximate the optimal strategy for the entropy regularized exploratory control problem, by sending λ → 0 and using the knowledge we have about the classical problem, we use the results of the algorithms to directly approximate the optimal dividend rate of the classical problem, which is essentially approximating the threshold m, because of the “bang-bang” nature of the optimal strategy of the classical problem. (ii) x > m. Again, in this case by (2.2) we choose J θ (x) = a c − θ1 θ2 e −θ2x , θ1 ∈ [1, 2], θ2 ∈ h c a , 2c a i , (2.4) 63 where θ1, θ2 represent (e m) β3 and β3 respectively, and the bounds for θ2 are the bounds of parameter β3 in (2.2). To obtain an upper bound of θ1, we note that θ1 ≤ θ2a c is necessary to ensure J θ (x) > 0 for each x > 0, and thus the upper bound of θ2 leads to that of θ1. For the lower bound of θ1, note that e m > 1 and hence so is (e m) β3 . Using J θ ∗ x (m) = 1, we approximate m by ln(θ1) θ2 . Remark 2.8.1. The parametrization above depends heavily on the knowledge of the explicit solution for the classical optimal dividend problem. In general, the explicit solution for the classical problem may not be available. In such scenarios, it is natural to consider the use of the viscosity solution of the entropy regularized relaxed control problem as the basis for the parameterization family. However, although we did identify both viscosity super- and subsolutions in (2.9), we found that the specific sub and super-solutions do not work effectively in our algorithms. We believe that the reasons for this include the computational complexities resulted by the piece-wise nature of the super-solution, as well as the the complicated nature of the bounds of the parameters involved (see (2.10)); whereas the viscosity sub-solution, being a simple function independent of all the parameters we consider, does not seem to be an effective choice for a parameterization family either. These may not be the only possible reasons, since we also observed empirically that simple linear parameterization families does not converge to the accurate results either. The choice of a parameterization family can affect the effective convergence of the algorithm in two ways. Firstly, the parameterization family may not include the cost functional we are trying to approximate using Policy Evaluation methods, and hence we may get an incorrect approximation of the cost functional. Secondly, we may approximate the cost functional J(·, π) correctly using policy evaluation methods and obtain J θ ∗ (·) ≈ J(·, π), but the algorithm has poor policy iteration as a result of the choice of the parameterization family. This is because we use the derivative (J θ ∗ ) ′ (·) to generate a “better” policy π˜ in Policy Iteration, and since the PE methods does not take into account the derivatives of the cost functional 64 J, the difference of the derivatives of the approximated and real cost functional, namely |(J θ ∗ ) ′ (·) − J ′ (·)| can result in a poor policy iteration, negatively affecting the convergence of the algorithm. We believe that one of the main reasons that the parameterization family based on the classical value function is effective in these algorithms is that, the approximation of J in this case, naturally approximate the derivatives of J as well. In the parameterization families based on viscosity solutions and linear parameterization families, carrying out PE methods which are only interested in J can result in loosing information about the derivatives of J, making them too rough a choice. While the preservation of information of the derivatives is a necessity for effective policy iteration, it can potentially increase the efficiency of Policy Evaluation as well. In general interpolation problems, it has been shown that cubic spline interpolation has a much lower interpolation error compared to the linear interpolation, due to the fact that, cubic spline interpolation takes into account the first and second derivative of the function to be approximated. Thus, Policy Evaluation methods based on approximating J and its derivatives simultaneously can prove to be much effective, and possibly could be used with a larger set of parameterization families. Using high dimensional linear parameterization families and parameterization families based on viscosity solutions, and approximating J, J′ and J ′′ simultaneously using methodologies such as Neural Networks, (see eg. [30, 36, 38]) can potentially provide effective and accurate Policy Evaluation methods. Such methods would have the advantage of a larger class of suitable parameterization families that does not depend on the knowledge of an explicit classical value function. The derivation of such PE methods deserves to be studied extensively, and we shall start formulating some of these concepts mathematically in chapter 3. 65 2.8.2 Algorithm Design In the following two subsections we summarize our numerical experiments following the analysis so far. For testing purpose, we choose “dummy” parameters a = 3, µ = 0.4, σ = 0.8 and c = 0.02, so that Assumption 2.1.5 holds. We use T = 10 to limit the number of iterations, and we observe that on average the ruin time of the path simulations occurs in the interval [0, 10]. We also use the error bound ϵ = 10−8 , and make the convention that d ∼ 0 whenever |d| < ϵ. 2.8.3 CTD methods Algorithm 1 CTD(0) Algorithm Define Initial state x0, Time horizon T, time scale ∆t, K = T ∆t , Initial temperature λ, Initial learning rate α, functional forms of l(.), p(.), Jθ (.), ∇θJ θ (.), number of simulated paths M, Variable sz , an environment simulator ENV∆t(t, x, a). Initialize θ , j = 1 and set V ar = θ . while j < M do Set θ = V ar. if mod (j − 1, sz) = 0 AND j > 1 then Compute and store Aj = Average(θ ∗ ) over the last sz iterations. if j > sz AND the absolute difference DA = |Aj − A(j−sz) | < ϵ then End iteration Initialize k = 0, observe x0 and store xtk ⇐ x0 while k < K do λ = λl(k) Compute π(., xtk ) = G(., 1 − J V ar x (xtk )) and generate action atk ∼ π(., xtk ) Apply atk to ENV∆t(tk, xtk , atk ) to observe and store xtk+1 if xtk+1 < ϵ then End iteration Compute ∆θ = ∇θJ θ (xtk ) e −ctk∆k. if ∥∆θ∥2 < ϵ then End iteration Update θ ← θ + αp(k)∆θ. Update k ← k + 1 Set θ ∗ = θ and update j ← j + 1. Set θ ∗ = Aj In Algorithm 1 below we carry out the PE procedure using the CT D(0) method. We choose λ = λ(k) as a function of iteration number: λ(k) = 2l(k) = 2(0.2)k∗∆t ≥ 2 · 0.2 T = 2.048 × 10−7 . This particular function is chosen so that λ → 0 and the entropy regularized control problem converges to the classical problem, but λ is still bounded away from 0 so 6 as to ensure that π is well defined. We shall initialize the learning rate at 1 and decrease it using the function p(k) = 1/k so as to ensures that the conditions P∞ k=0 αk = ∞ and P∞ k=0(αk) 2 < ∞ are satisfied. We note that Algorithm 1 is designed as a combination of online learning and the so-called batch learning, which updates parameter θ at each temporal partition point, but only updates the policy after a certain number (the parameter “sz” in Algorithm 1) of path simulations. This particular design is to allow the PE method to better approximate J(·, π) before updating π. Convergence Analysis. To analyze the convergence as ∆t → 0, we consider ∆t = 0.005, 0.001, 0.0005, 0.0001, 0.00005, respectively. We take M = 40000 path simulations and sz = 250 in the implementation. Note that with the choice of dummy parameters a, µ and σ, the classical solution is given by m = 4.7797 , V (3) = 17.9522 and V (10) = 24.9940. We thus consider two parameterization families, for initial values x = 3 < m and x = 10 > m respectively. Table 2.1: Results for the CT D0 method ∆t J θ ∗ m J θ ∗ m J θ ∗ m J θ ∗ m family(i) family(ii) x=3 x=10 x=3 x=10 0.01 15.49 5.383 31.276 3.476 32.489 19.359 55.667 9.635 0.005 17.188 4.099 22.217 4.292 31.108 18.532 53.262 11.942 0.001 16.68 4.474 23.082 3.931 37.58 11.925 60.948 10.691 0.0005 16.858 4.444 23.079 4.049 40.825 11.797 65.179 10.5 0.0001 17.261 4.392 23.094 4.505 38.899 18.994 55.341 18.142 Case 1. x = 3 < m. As we can observe from Tables 2.1 and 2.2 , in this case using the approximation (2.3) (family (i)) shows reasonably satisfactory level of convergence towards the known classical solution values of J(x0) and m as ∆t → 0, despite some mild fluctuations. We believe that such fluctuations are due to the randomness of the data we observe and that averaging over the sz paths in our algorithm reduced the occurrence of these fluctuations to a satisfactory level. As we can see from Table 2.2, despite the minor anomalies, the general trajectory of these graphs tends towards the classical solution as ∆t → 0. We should also observe that using family (ii) (2.4) does not produce any satisfactory convergent results. But 67 this is as expected, since the function (2.4) is based on the classical solution for x > m. Case 2. x = 10 > m. Even though the family (2.3) is based on the classical solution for x < m, as we can see from Tables 2.1 and 2.2 , the algorithm using family (2.3) converges to the values of the classical solution even in the case x = 10 > m, whereas the algorithm using family (ii) (2.4) does not. While a bit counter intuitive, this is actually not entirely unexpected since the state process can be seen to reach 0 in the considered time interval in general, but the parameterization (2.4) is not suitable when value of the state reaches below m. Consequently, it seems that the parameterization (2.3) suits better for CT D(0) method, regardless of the initial value. Table 2.2: Convergence results for the CT D0 method J θ ∗ w.r.t. ∆t m w.r.t. ∆t Results obtained using family (i) for x = 3 Results obtained using family (i) for x = 10 Finally, we would like to point out that the case for CT D(γ) methods for γ ̸= 0 is much more complicated, and the algorithms are computationally much slower than CT D(0) method. We believe that the proper choice of the learning rate in this case warrants further investigation, 68 but we prefer not to discuss these issues in this dissertation. 2.8.4 The ML Algorithm In Algorithm 2 we present the so-called ML-algorithm in which we use a batch learning approach where we update the parameters θ by ˜θ at the end of each simulated path using the information from the time interval [0, T ∧τx]. We use M = 40000 path simulations and initial temperature λ = 2 and decrease it using the function l(j) = (0.9)j where j represents the iteration number. We also initialize the learning rate at 1 and decrease it using the function p(j) = 1/j. To represent the convergence as ∆t → 0, we use ∆t = 0.005, 0.001, 0.0005, 0.0001 respectively. For both the initial values x = 3 and x = 10, parameterized family (i) gives optimal θ ∗ i as the lower bound of each parameter θi , for i = 1, 2, 3 and parameterized family (ii) gives optimal θ ∗ i as the average of the lower and upper bounds of each parameter θi for i = 1, 2, since in each iteration θi is updated as the upper and lower boundary alternatively. The reason is because the learning rate 1 j is too large for this particular algorithm. Decreasing the size of the learning rate results in optimal θ values occur away from the boundaries, but the algorithms were shown not to converge empirically, and thus the final results depend on the number of iterations used (M). In general, the reason for this could be due to the loss of efficiency occurred by decreasing the learning rates, since Gradient Descent Algorithms are generally sensitive to learning rates. Specific to our problem, among many possible reasons, we believe that the limiting behavior of the optimal strategy when λ → 0 is a serious issue, as π is not well defined when λ = 0 and a Dirac c-measure is supposed to be involved. Furthermore, the ”bang-bang” nature and the jump of the optimal control could also affect the convergence of the algorithm. Finally, the algorithms seems to be quite sensitive to the value of m since value function V (x) is a piece-wisely smooth function depending on m. Thus, to rigorously analyze the effectiveness 69 of the ML-algorithm with parameterization families (i) and (ii), further empirical analysis are needed which involves finding effective learning rates. All these issues call for further investigation, but based on our numerical experiment we can nevertheless conclude that the CTD(0) method using the parameterization family (i) is effective in approximating the value m and V (x), provided that the effective upper and lower bounds for the parameters can be identified using historic data. Algorithm 2 ML Algorithm 1: Define Initial state x0, Time horizon T, time scale ∆t, K = T ∆t , Initial temperature λ, Initial learning rate α, functional forms of l(.), p(.), Jθ (.), ∇θJ θ (.), number of simulated paths M, Variable sz , an environment simulator ENV∆t(t, x, a). 2: Initialize θ , j = 1 . 3: while j < M do 4: λ = λl(j). 5: if mod (j − 1, sz) = 0 AND j > 1 then 6: Compute and store Aj = Average(θ) over the last sz iterations 7: if j > sz AND the absolute difference DA = |Aj − A(j−sz) | < ϵ then 8: End iteration 9: Initialize k = 0, observe x0 and store xtk ⇐ x0 10: while k < K do 11: Compute π(., xtk ) = G(., 1 − J θ x (xtk )) and generate action atk ∼ π(., xtk ). 12: Apply atk to ENV∆t(tk, xtk , atk ) to observe and store xtk+1 13: if xtk+1 < ϵ then 14: End iteration 15: observe and store J θ (xtk+1 ), ∇θJ θ (xtk+1 ) 16: Update k ← k + 1 17: Compute ∆θ using Ml algorithm and Update θ ← θ − αp(j)∆θ 18: Update j ← j + 1. 19: θ ∗ = Aj 70 Chapter 3 Optimal Dividend Problem under the Perturbed Cram´er–Lundberg Risk Model A natural extension of the Optimal Dividend Problem under the diffusion model discussed in chapter 2 is to extend our analysis to the Optimal Dividend Problem where the surplus of the insurance company is modeled by the Cram´er-Lundberg Model purturbed by diffusion. We again use the corresponding entropy regularized exploratory control problem to analyze the classical problem under model uncertainty. While some of the results are analogous to those proven in chapter 2, analysis in this chapter are somewhat more involved due to the “Jump” of the state process and the “non-locality” of the corresponding HJB equation. 3.1 Introduction and Formulation Let us again consider the probability space (Ω, F, P) introduced in chapter 2. Let N = {Nt}t≥0 be a Poisson process with intensity λ0, independent of the Brownian motion. We assume that the free surplus of an insurance company follows the Cram´er-Lundberg risk 71 model, that is, a compound Poisson process with drift given by Xt = x + pt − X Nt n=1 Un, where x is the initial surplus level; p is the premium rate; Un is the size of the n th claim of the company, which are independent and identically distributed random variables with common distribution F.Here, the processes N, W and the random variables {Un} are mutually independent. The Cram´er-Lundberg risk model follows an ideal Insurance model that assumes a constant rate of premium income as the only source of income and hence does not account for the fluctuations in the company’s income caused by the changes of the size of the customer base, or the varying income of the insurance company by the return of possible investments of the capital. To address this issue, following [14] we use the Cram´er-Lundberg risk model purturbed by a diffusion process (that represents the uncertainty of the size of the customer base and varying income of the insurance company by the return of possible investments of the capital) to model the Surplus Level as follows. Xt = x + pt − X Nt n=1 Un + σWt , where σ is the diffusion parameter. As discussed in chapter 2, The company takes part of its surplus to pay dividends to its shareholders and we define the associated controlled process Xα t with initial surplus level x as X α t = x + pt − X Nt n=1 Un − Z t 0 αtdt + σWt . (3.1) Let us call τ α the ruin time at which the company gets ruined, i.e. τ α := inf{t ⩾ 0 : Xα t < 0}. 72 Our objective again is to maximize the expected discounted dividends paid by the company until ruin, with the constant discount factor c > 0. So, for any initial surplus level x ⩾ 0, the optimal value function is defined by, V (x) = sup α∈U[0,a] Jα(x) (3.2) where Jα(x) = Ex[ R τ α 0 e −csαs ds]. The optimal value function satisfies the following HJB equation, sup α∈[0,a] n (p − α)v ′ (x) + λ0R[v](x) − cv(x) + α + σ 2 2 v ′′(x) o (3.3) where R[v](x) = R ∞ 0 (v(x − y) − v(x))dF(y). 3.2 Entropy-regularized Exploratory Control Problem We analyze this problem using RL approach, in particularly by optimizing the corresponding entropy regularized-exploratory control problem. Using the notations and methodology explained in section 2.1, we can derive the exploratory dynamics of the state process X and the entropy regularized value function corresponding to the classical problem (3.1)-(3.2). The exploratory dynamics of the state process is given by, Xt = x + Z t 0 p − Z a 0 wπs(w, ·)dw ds − X Nt n=1 Un + σWt , X0 = x, (3.1) where {πt(w, ·)} is the (density of) relaxed control process, and we shall often denote X = Xπ,0,x = Xπ,x to specify its dependence on control π and the initial state x. We shall define the entropy-regularized cost functional of the optimal expected dividend control problem 73 under the relaxed control π as J(x, π) = Ex h Z τ π x 0 e −ctHπ λ (t)dti , (3.2) where Hπ λ (t) := R a 0 (w − λ ln πt(w))πt(w)dw, τ π x = inf{t > 0 : X π,x t < 0}, and λ > 0 is the so-called temperature parameter balancing the exploration and exploitation. Since we are only interested in the scenario when x ∈ R +, and J(x) = 0 by definition, WLOG we define J(x) = 0 for x < 0. For a feedback policy π ∈ A (x), Xπ,x satisfies the following dynamics: Xt = x + Z t 0 p − Z a 0 wπ(w, Xs)dw ds − X Nt n=1 Un + σWt , X0 = x, (3.3) Along the lines of the proof of proposition (2.1.4), it is straight forward to check that value function positive, non-decreasing and bounded above by λ ln a+a c and it (formally) satisfies the HJB equation: cv(x)= sup π∈L1[0,a] Z a 0 h w−λ ln π(w) +(p − w)v ′ (x) i π(w)dw + λ0R[v](x) + σ 2 2 v ′′(x); v(0) = 0. (3.4) Next, similar to section 2.2, optimal feedback control has the Gibbs form π ∗ (w, x) = G(w, 1− v ′ (x)), assuming all derivatives exist. Plugging (2.2) into (2.1), we see that the HJB equation (2.1) becomes the following second order ODE: 1 2 σ 2 v ′′(x) + g(v ′ (x)) − cv(x) + λ0R[v](x) = 0, x ≥ 0; v(0) = 0, (3.5) where the function g(z) := {pz + λ ln[λ(e a λ (1−z)−1) 1−z ]}1{z̸=1} + [p + λ ln a]1{z=1}. 74 3.3 Policy Update We are now interested in proving a PI theorem to create an iterative Policy Update algorithm. Roughly speaking we shall argue that for any close-loop policy π ∈ Acl(x), we can construct another π˜ ∈ Acl(x), such that J(x,π˜) ≥ J(x, π). This section will also discuss the convergence of these iterations to the optimal policy. To begin with, for x ∈ R and π ∈ Acl(x), let Xπ,x be the unique strong solution to (3.3). For t > 0, we consider the process Wˆ s := Ws+t − Wt , s > 0. Then Wˆ is an Fˆ-Brownian motion, where Fˆ s = Fs+t , s > 0. Consider the solution to Xˆ s = X π,x t + Z s 0 p − Z a 0 wπ(w, Xˆ s)dw ds + σWˆ s − N Xt+s n=Nt+1 Un, s ≥ 0. (3.1) (Here PNt+s n=Nt+1 Un := 0 if Nt = Nt+s). By the Markovian Property of the compound Poisson process and the Brownian motion, we have the flow property: X π,x r+t = Xˆ π,Xπ,x t r , r ≥ 0. Now we denote ˆπ := π(·, Xˆ ·) ∈ Acl(X π,x t ) to be the open-loop strategy induced by the closed-loop control π. Then the corresponding cost functional can be written as (denoting Xπ = Xπ,x) J(X π t ;π) = EXπ t h Z τ π Xπ t 0 e −crh Z a 0 (w − λ ln ˆπr(w))ˆπr(w)dwi dri , t ≥ 0, (3.2) where τ π X π,x t = inf{r > 0 : Xˆ π,Xπ,x t r < 0}. It is clear that, by flow property, we have τ π x = τ π X π,x t + t, P-a.s. on {τ π x > t}. Next, for any admissible policy π ∈ Acl, we formally define a new feedback control policy as follows: for (w, x) ∈ [0, a] × R +, π˜(w, x) := G(w, 1 − J ′ (x;π)) (3.3) 75 We would like to emphasize that the new policy ˜π is independent of the system parameters (p, λ0, σ)(!), although it depends on the dividend bound a and the temperature parameter λ. For any π ∈ Acl, we denote ˜b π (x) = p− R a 0 wπ(w, x)dw. We recall the definition of “Strongly Admissible” from section 2.3 , and observe that for π ∈ A s cl, ˜b π and r π are bounded and are Lipschitz continuous. We denote X := Xπ,x to be the solution to (3.3), and rewrite the cost function (2.8) as J(x, π) = Ex h Z τ π x 0 e −csr π (X π,x s )dsi . (3.4) where τ π x = inf{t > 0 : X π,x t < 0}. We define the operator, L π [u](x) := 1 2 σ 2uxx(x) + ˜b π (x)ux(x) − cu(x) + λ0R[u](x). Thus, in light of the Feynman-Kac formula, for any π ∈ A s cl, J(·, π) is the probabilistic solution to the following non homogeneous, non local PDE on R+: L π [u](x) + r π (x)= 0, u(0) = 0 (3.5) given that (3.5) has a classical solution. Proposition 3.3.1. If π ∈ A s cl, then J(·, π) ∈ C 2 b (R +), and the bounds of J ′ and J ′′ depend only on those of ˜b π , r π , and J(·, π). proof: First we are going to use a variation of the Picard–Lindel¨of theorem to prove that the non homogeneous, non local PDE (3.5) has a unique classical solution, assuming that u ′ (0) = β0 exists. For a fixed π, |r π (·)| ≤ C and | ˜b π (·)| ≤ B for some C, B > 0. Let us denote Rπ (x, u) := λ0 R x 0 u(x−y)dF(y)+r π (x). The second order non local PDE (3.5) is equivalent to the system 76 of first order non local PDEs, X′ = AX + q(X), where X = X(x) = X1(x) X2(x) = u(x) u ′ (x) , A = 0 1 2(λ0+c) σ2 −2˜b π (x) σ2 , and q(X) = 0 −2Rπ (x,X1) σ2 . Now we consider the function space C(Id(0), R 2 ) of continuous functions, where Id(0) := [0, d] for a constant d > 0 to be specified later. For ϕ : R + → R 2 , we define the operator T : C(Id(0), R 2 ) → C(Id(0), R 2 ) by T(ϕ(t)) = X(0) + R t 0 [A(x)ϕ(x) + q(ϕ(x))]dx. denote ∥ϕ∥d,∞ = supt∈[0,d] |ϕ(t)|. ∥T(ϕ(t))−T(ψ(t))∥d,∞ = Z t 0 A(x)[ϕ(x) − ψ(x)]dx + Z t 0 [q(ϕ(x)) − q(ψ(x))]dx d,∞ ≤ Z d 0 (|A(x)[ϕ(x) − ψ(x)]| + |q(ϕ(x)) − q((ψ(x))|)dx ≤ Kd∥ϕ − ψ∥d,∞ + Z d 0 |q[ϕ](x) − q[ψ](x)|dx. Here, K = ∥A(·)∥∞. On the other hand, for any x ∈ [0, d], |q[ϕ](x) − q[ψ](x)| = 2 σ 2 λ Z x 0 ϕ1(x − y)dF(y) − λ Z x 0 ψ1(x − y)dF(y) = 2λ σ 2 Z x 0 (ϕ1(x − y) − ψ1(x − y))dF(y) ≤ 2λ σ 2 ∥ϕ − ψ∥d,∞ | Z x 0 dF(y)| ≤ 2λ σ 2 ∥ϕ − ψ∥d,∞ , since F(·) is a distribution function. Thus we can conclude that for t ∈ Id(0), it holds that ∥T(ϕ(t)) − T(ψ(t))∥d,∞ ≤ K + 2λ σ 2 d∥ϕ − ψ∥d,∞ =: Kc ˜ ∥ϕ − ψ∥d,∞ . (3.6) for some K > ˜ 0. So choosing d ≤ 1 K˜ , we can conclude that T is a contraction mapping. That is, the ODE (2.6) has a unique solution on the interval [0, d]. To extend the solution to the whole positive half plane, we shall argue that the maximum 77 existence interval of ODE (3.5) is [0,∞). To this end, suppose that the maximum existence interval of (3.5) is [0, x0] for some x0 > 0. We denote by M := supx∈[0,x0] |v(x)| and consider the following general problem. ˜b π (x)ux(x) + 1 2 σ 2uxx(x) − (λ0 + c)u(x) + λ0 Z x x0 u(x − y)dF(y) + λ0 Z x0 0 v(x − y)dF(y) + r π (x) = 0; u(x0) u ′ (x0) = α˜0 β˜ 0 . (3.7) where ˜α0, β˜ 0 > 0 . By considering λ0 R x0 0 v(x−y)dF(y) +r π (x) to be a function independent from the unknown function u, and following the same methodology as above, we can prove that the equation (3.7) has a unique solution on the interval [x0, x0 + d]. We can extend this result by partitioning the positive real line into intervals of size d, and conclude that the equation (3.5) has a unique solution on R +. Thus, we conclude that if π ∈ A s cl, then J(·, π) ∈ C 2 b (R +). Also if ϕ : R + → R 2 is a solution to the system of equations X′ = AX + q(x, X), we know that ϕ(t) = X(0) + Z t 0 A(x)ϕ(x) + q(ϕ(x))dx. Since A(x), q(x, X1(x)) and J(x, π) are bounded in x, so is J ′ (x, π). Since by equation (3.5), J ′′(x, π) is a combination of functions bounded in x, J ′′(x, π) is also bounded in x. Hence the result. Our main result of this section is the following Policy Improvement Theorem. Theorem 3.3.2. Let π ∈ A s cl and let π˜ be defined by (3.3) associate to π, it holds that J(x,π˜) ≥ J(x, π), x ∈ R+. Also, if π˜ ≡ π, then π˜ ≡ π is the optimal policy. Proof. If (1−J ′ (·, π)) ∈ C 1 b (R +). Thus Lemma 2.3.2 (with c(x) = 1−J ′ (x, π)) implies that π˜ ∈ A s cl as well. 78 Moreover, since π ∈ A s cl, J(·, π) is a C 2 -solution to the ODE (3.5). Now recall that π˜ ∈ A s cl is the maximizer of supπb∈A s cl [b πb (x)J ′ (x, π) + r πb (x)], we have L π˜ [J(·, π)](x) + r π˜ (x) ≥ 0, x ∈ R+. (3.8) Now, let us consider the process Xπ˜ , the solution to (3.1) with π being replaced by π˜. Applying Itˆo’s formula to e −ctJ(Xπ˜ t , π) from 0 to τ π˜ x ∧ T, for any T > 0, and noting the definitions of ˜b π˜ and r π˜ , we deduce from (3.8) that e −c(τ π˜ x ∧T) J(X π˜ τ π˜ x ∧T , π) (3.9) = J(x, π) + Z τ π˜ x ∧T 0 e −crL π˜ [J(·, π)](X π˜ r )dr + Z τ π˜ x ∧T 0 e −crJ ′ (X π˜ r , π)σdWr ≥ J(x, π) − Z τ π˜ x ∧T 0 e −crr π˜ (X π˜ r )dr + Z τ π˜ x ∧T 0 e −crJ ′ (X π˜ r , π)σdWr, Taking expectation on both sides above, sending T → ∞ and noting that J(Xπ˜ τ π˜ x , π) = 0, we obtain that J(x, π) ≤ J(x,π˜), x ∈ R +. Now suppose ˜π ≡ π. Then J(·,π˜) = J(·, π). Let J ∗ denote J(·,π˜) = J(·, π). Since π ∈ A s cl, J ∗ satisfies (3.5). That is, 1 2 σ 2 J ∗ xx(x) + ˜b π˜ (x)J ∗ x (x) − cJ∗ (x) + λ0R[J ∗ ](x) + r π˜ (x) = 0. (3.10) But, since ˜π ∈ A s cl is the maximizer of supπb∈A s cl [ ˜b πb (x)J ′ (x, π)+r πb (x)] = supπb∈A s cl [ ˜b πb (x)J ∗ (x)+ r πb (x)], substituting ˜b π˜ (x)J ∗ x (x) + r π˜ (x) = supπb∈A s cl [ ˜b πb (x)J ∗ (x) + r πb (x)] to (3.10), we have 1 2 σ 2 J ∗ xx(x) + sup πb∈A s cl [ ˜b πb (x)J ∗ (x) + r πb (x)] − cJ∗ (x) + λ0R[J ∗ ](x) = 0. (3.11) That is J ∗ satisfies (3.4) and hence J ∗ is the optimal value function and ˜π is the optimal policy. 79 In light of Theorem 3.3.2 we can naturally define a “learning sequence” as follows. Since πˆ(w, x) ≡ 1 a , (x, w) ∈ R+ × [0, a], is strongly admissible and J(x,πˆ) ≥ 0, let us define π0(x, w) ≡ 1 a , and v0(x) := J(x, π0), πn(x, w) := G(w, 1 − J ′ (x, πn−1)), (w, x) ∈ [0, a] × R +, for n ≥ 1 (3.12) Also for each n ≥ 1, let vn(x) := J(x, πn). The natural question is whether this learning sequence is actually a “maximizing sequence”, that is, vn(x) ↗ v(x), as n → ∞. It turns out that if we have a global uniform bound for the sequence vn, we can prove that it is a “maximizing sequence”. Hence we carry our following analysis of the bounds of r πn , on which depends the bounds of vn. To this end, consider a policy π ∈ Acl in the form, π(w, x) = c(x)e w λ c(x) d(x) 1{c(x)̸=0} + 1 a 1{c(x)=0},(w, x) ∈ [0, a] × R + where d(x) = λ(e a λ c(x) − 1). Then for c(x) = 0, rπ (x) = a 2 + λ ln(a). For c(x) ̸= 0, r π (x) = R a 0 (w − λ lnπ(w, x))π(w, x)dw = R a 0 (1 − c(x))wπ(w, x)dw − λ ln( c(x) d(x) ) and R a 0 wπ(w, x)dw = c(x) d(x) ( ac(x) λ e a λc(x) − d(x) λ ) (c(x))2 λ2 . Thus we have, r π (x) = (1 − c(x)) ae a λ c(x) (e a λ c(x)−1) − (1 − c(x)) λ c(x) − λ ln( c(x) λ(e a λ c(x)−1) ). Lemma 3.3.3. Suppose that π ∈ Acl is in the form, π(w, x) = c(x)e w λ c(x) d(x) 1{c(x)̸=0} + 1 a 1{c(x)=0},(w, x) ∈ [0, a] × R +. Let C1 > 0 be a given constant. Then, for all x ∈ R, |r π (x)| ≤ C0 +C1|c(x)| for some C0 > 0 depending on C1. Consider the function k : R \ {0} → R defined by k(y) = ae a λ y (e a λ y−1) − (1 − y) λ y . There exists 80 ˜k > 0 that depends only on a, λ such that |k(y)| ≤ ˜k, y ̸= 0. Since for x such that c(x) ̸= 0 we have r π (x) = k(c(x)) − ac(x)e a λ c(x) (e a λ c(x)−1) − λ ln( c(x) λ(e a λ c(x)−1) , we observe that, if c(x) ̸= 0, |r π (x)| ≤ ˜k + − ac(x)e a λ c(x) (e a λ c(x) − 1) − λ ln c(x) λ(e a λ c(x) − 1)! . (3.13) Consider the function g(z) = zez e z−1 + ln z e z−1 = ln( ze zez ez−1 e z−1 ) (defined at 0 by limz→0 g(z)). We can rewrite the inequality (3.13) using the function g as follows. |r π (x)| ≤ ˜k + λ g( ac(x) λ ) − ln(a) . (3.14) There exists z0 such that for |z| > z0, |g(z)| ≤ C1 a |z|. Since g is an even function we can consider z ≥ 0. g(z) ≥ 1 for all z ≥ 0. Hence if we prove that there exists z0 such that ze zez ez−1 e z−1 ≤ e C1z a when z > z0, we are done. This holds true since limz→∞ ze zez ez−1 e z−1 e C1z a = limz→∞ z e C1z a e zez ez−1 e z − 1 = limz→∞ z e C1z a e zez ez−1 e z e z e z − 1 = limz→∞ z e C1z a e z ez−1 e z e z − 1 = limz→∞ z e C1z a limz→∞ e z ez−1 limz→∞ e z e z − 1 = 0 × 1 × 1 = 0. Let ˜l = sup|z|≤z0 |g(z)|. Thus for all z ∈ R, |g(z)| ≤ C1 a |z| + ˜l. Hence, for c(x) ̸= 0, |r π (x)| ≤ ˜k+λ ln a+γ( C1 a | ac(x) λ |+˜l) = ˜k+γ ˜l+λ ln a+γ C1 a a|c(x)| λ = ˜k+γ ˜l+ λ ln a+C1|c(x)|. Taking C0 = max(˜k+γ ˜l+λ ln a, a 2 +λ ln(a)), we get |r π (x)| ≤ C0 +C1|c(x)| for all x ∈ R +. Hence the result. Since in our iteration, we have πn(w, x) is of the form c(x)e w λ c(x) d(x) 1{c(x)̸=0} + 1 a 1{c(x)=0},(w, x) ∈ [0, a] × R + when c(x) = 1 − J ′ (x, πn−1) for each n ≥ 1, we are interested in the bound of the function J ′ (·, πn) for each n ≥ 0. To this end, we consider the following lemma. 8 Lemma 3.3.4. Consider a compact set E ⊂ R. For each n ≥ 0, let Rn = supx∈E |r πn (x)|. Then for each n ≥ 0, for x ∈ E, |J ′ (x, πn)| ≤ K0 +K1|Rn| for some K0, K1 > 0, independent of n. Suppose that J ′ (0, πn) = β0. Since J(x, πn) is a solution to the equation (2.6), (when π = πn), we know that for y ∈ E, J(y, πn) J ′ (y, πn) = 0 β0 + Z y 0 A(x) J(x, πn) J ′ (x, πn) + q J(x, πn) J ′ (x, πn) ! dx. Thus, we have J ′ (y, πn) = β0 + R y 0 2 σ2 ((λ0 +c)J(x, πn)−˜b πn (x)J ′ (x, πn)−Rπn (x, J(x, πn))dx. Since 0 ≤ J(x, π0) ≤ J(x, πn) ≤ V (x) and |V (x)| ≤ λ ln a+a c , we have |J(x, πn)| ≤ U := λ ln a+a c , x ∈ R +. Thus |J ′ (y, πn)| ≤ |β0|+ 2 σ2 (λ0+c)Uy+| R y 0 2 σ2 ˜b πn (x)J ′ (x, πn)dx|+ 2 σ2 λ0Uy+| R y 0 2 σ2 r πn (x)dx|. Let U˜ = 2 σ2 (λ0 + c)U + 2 σ2 λ0U . Since | ˜b πn (x)| ≤ µ + a, and Rn = supx∈E |r πn (x)|, we have |J ′ (y, πn)| ≤ |β0|+Uy ˜ + 2 σ2 (a+µ) R y 0 |J ′ (x, πn)|dx+ 2 σ2Rny. We use the notation Ln = U˜+ 2 σ2Rn and B = 2 σ2 (a + µ) to obtain, |J ′ (y, πn)| ≤ |β0| + Lny + B R y 0 |J ′ (x, πn)|dx. Then by Gronwall’s enequality, |J ′ (y, πn)| ≤ |β0| + Lny + Z y 0 (Lns + |β0|)BeB(y−s) ds = |β0|e By + Ln B (e By − 1). Since 0 ≤ J(x, πn) ≤ V (x) and J(x, πn) ∈ C 2 , we have |β0| = |J ′ (0, πn)| ≤ V ′ (0). That is, |J ′ (y, πn)| ≤ V ′ (0)e By + Ln B (e By − 1) ≤ V ′ (0)e B|E| + Ln B (e B|E| − 1) (Here we have assumed |y| ≤ |E| WLOG. Taking K0 = |V ′ (0)e B|E| + U˜ B (e B|E| − 1)| and K1 = 2 Bσ2 (e B|E| − 1) proves the lemma. Lemma 3.3.5. Consider a compact set E ⊂ R. There exists L > 0 such that for each n ≥ 0, |r πn (x)| ≤ L on E. Consequently, the bounds of vn, v′ n and v ′′ n are independent of n. 82 proof Let C1 = Bσ2 3λ(eB|E|−1) . Since πn+1(w, x) = c(x)e w λ c(x) λ(e a λ c(x)−1) 1{c(x)̸=0}+ 1 a 1{c(x)=0},(w, x) ∈ [0, a]× R + where c(x) = 1 − J ′ (x, πn), we have by lemma 3.3.3,that |r πn+1 (x)| ≤ C0 + C1|1 − J ′ (x, πn)| ≤ C0+C1+C1|J ′ (x, πn)|. But by lemma 3.3.4, for x ∈ E, |J ′ (x, πn)| ≤ K0+K1|Rn|. That is, Rn+1 = supx∈E |r πn+1 (x)| ≤ C0 + C1 + C1(K0 + K1|Rn|) for each n ≥ 0. But, C1K1 = 2 3 < 1. Hence, for each n ≥ 0, |Rn| ≤ L for some L > 0, that only depends on C0, C1, K0, K1 and |R0|, all of which are independent of n. Since by proposition 2.3.3 the bounds of vn, v′ n and v ′′ n depends on n only through the bounds of r πn which we proved to be independent of n, the result follows. Now we turn for the convergence analysis of the learning sequence. Theorem 3.3.6. The sequence {vn}n≥0 is a maximizing sequence. Furthermore, the sequence {πn}n≥0 converges to the optimal policy π ∗ . Proof. We first observe that by Lemma 2.3.2 the sequence {πn} ⊂ A s cl, provided π0 ∈ A s cl. Let us fix any compact set E ⊂ R+. Since vn = J(·, πn), Proposition 2.3.3 and lemma 3.3.5 guarantees that vn ∈ C 2 b (E), and the bounds are independent of n. Thus a simple application of Arzella-Ascolli Theorem shows that there exist subsequences {nk}k≥1 and {n ′ k }k≥1 such that {vnk }k≥0 and {v ′ n ′ k }k≥0 converge uniformly on compacts. Assume that limk→∞ vnk (·) = v ∗ (·), uniformly on E, for some function v ∗ . By definition of πn’s we know that {vn} is monotonically increasing, thanks to Theorem 2.3.4, thus the whole sequence {vn}n≥0 must converge uniformly on E to v ∗ . Next, let us assume that limk→∞ v ′ n ′ k (·) = v ∗∗(·), uniformly on E, for some function v ∗∗. Since obviously limk→∞ vn ′ k (·) = v ∗ (·) as well, and note that the derivative operator is a closed operator, it follows that v ∗∗(x) = (v ∗ ) ′ (x), x ∈ E. Applying the same argument one shows that for any subsequence of {v ′ n}, there exists a sub-subsequence that converges uniformly on E to the same limit (v ∗ ) ′ , we conclude that the sequence {v ′ n} itself converges uniformly on E to (v ∗ ) ′ . Since E is arbitrary, this shows that {(vn, v′ n )}n≥0 converges uniformly on compacts 83 to (v ∗ ,(v ∗ ) ′ ). Since πn is a continuous function of v ′ n , we see that {πn}n≥0 converges uniformly to π ∗ ∈ Acl defined by π ∗ (x, w) := G(w, 1 − (v ∗ ) ′ (x)). Finally, applying Lemma 2.3.2 we see that π ∗ ∈ A s cl, and the structure of the π ∗ (·, ·) guarantees that v ∗ satisfies the HJB equation (3.4) on the compact set E. By expanding the result to R + using the fact that E is arbitrary, v ∗ satisfies the HJB equation (3.4) (or equivalently (3.5)). Now by using the slightly modified verification argument in Theorem 4.1 in [20] we conclude that v ∗ = V ∗ is the unique solution to the HJB equation (2.1) and thus π ∗ by definition is the optimal control. 3.4 Policy Evaluation To achieve our target of finding the optimal policy and optimal value function, we can design algorithms based on the learning sequence (2.8). To this end, we need a methodology to evaluate the function J ′ (·, πn), for given πn, for each n ≥ 0. To this end we introduce a Policy Evaluation method to approximate J ′ (·, π) for a given π ∈ Acl(x). Thus in this section we consider a fixed π ∈ A s cl(x) and use the notation r π = r, τπ x = τx, ˜b π = ˜b and J(·, π) = J(·). Since J satisfies the equation (3.5), by differentiating the equation (3.5) we observe that J ′ satisfies the following second order ODE. 1 2 σ 2 y ′′(x) + (˜b) ′ (x)y(x) + ˜b(x)y ′ (x) − cy(x) + λG[y](x) + r ′ (x) = 0. (3.1) Thus, in light of the Feynman-Kac formula, J ′ (·) can be expressed as, J ′ (x) = Ex h Z τx 0 e − R s 0 l(Xx ρ )dρr ′ (X x s )ds + e − R τx 0 l(Xx ρ )dρJ ′ (X x τx ) i (3.2) where Xx be the unique strong solution to the dynamics (2.10), and l(x) := c − ( ˜b) ′ (x) for 84 all x. We define Yt = e − R t 0 l(Xx ρ )dρJ ′ (Xx t ) + R t 0 e − R ρ 0 l(Xx ρ )dρr ′ (Xs)ds t ≤ τx; Yτx , t > τx. Here Xx be the unique strong solution to the dynamics (2.10) for x ∈ R. Theorem 3.4.1. For all x ∈ R +, the process Y = {Yt ;t ≥ 0} is an F˜-martingale where F˜ = {Fτ˜x∧t}t≥0. Conversely, if there exists a continuous function J˜ such that for all x ∈ R +, Y˜ = {Y˜ t = e − R t 0 l(Xx ρ )dρJ˜′ (Xx t ) + R t 0 e − R ρ 0 l(Xx ρ )dρr ′ (Xs)ds;t ≥ 0} is an F˜-martingale, J˜(x) = 0 for x ≤ 0 and J ′ (0) = J˜′ (0), we have J ′ = J˜′ . Proof. For a fixed t ≥ 0, we have E[Yτx I{t<τx}|Fs] = E[I{t<τx} Z τx 0 e − R ρ 0 l(Xx ρ )dρr ′ (Xs)ds + e − R τx 0 l(Xx ρ )dρJ ′ (X x τx )|Ft ] = I{t<τx}E h Z τx t e − R ρ 0 l(Xx ρ )dρr ′ (Xs) + e − R τx 0 l(Xx ρ )dρJ ′ (X x τx )|Ft i + I{t<τx} Z t 0 e − R ρ 0 l(Xx ρ )dρr ′ (Xs)ds Consider the solution to Xˆ ρ = X x t + Z ρ 0 p − Z a 0 wπ(w, Xˆ s)dw ds + σWˆ ρ − N Xt+ρ n=Nt+1 Un, t ≥ 0. (3.3) (Here PNt+ρ n=Nt+1 Un := 0 if Nt = Nt+ρ). By the Markovian Property of the compound Poisson process and the Brownian motion, we have the flow property: Xx t+s = Xˆ Xx t s , s ≥ 0. We denote by τXx t = inf{r > 0 : Xˆ Xx t r < 0}. It is clear that, by flow property, we have τx = τXx t + t, P-a.s. on {τx > t}. 85 Thus we obtain E h R τx t e − R ρ 0 l(Xx ρ )dρr ′ (Xs) + e − R τx 0 l(Xx ρ )dρJ ′ (Xx τx )|Ft i = e − R t 0 l(Xx ρ )dρJ ′ (Xt). And then it follows that E[Yτx I{t<τx}|Ft ] = I{t<τx}Yt . Thus, E[Yτx |Fs] = E[Yτx I{s<τx}|Fs] + E[Yτx I{s≥τx}|Fs] = I{s<τx}Ys + E[YsI{s≥τx}|Fs] = I{s<τx}Ys + I{s≥τx}Ys = Ys. Thus the first part of the theorem is proved. Since Y˜ is an F˜-martingale, we have e − R t 0 l(Xx ρ )dρJ˜′ (Xx t ) + R t 0 e − R ρ 0 l(Xx ρ )dρr ′ (Xs)ds = Y˜ t = E[Y˜ τx |Ft ] = E[Yτx |Ft ] = Yt = e − R t 0 l(Xx ρ )dρJ ′ (Xx t ) + R t 0 e − R ρ 0 l(Xx ρ )dρr ′ (Xs)ds, P-a.s. (Y˜ τx = Yτx follows from the fact that J ′ (Xx τx ) = J˜′ (Xx τx )). That is J ′ (Xt) = J˜′ (Xt) a.s. for all t ≥ 0. Hence the result. Now, if J ′ (0) is known, approximating J ′ (·) is equivalent to find a function R˜ such that R˜(0) = J ′ (0) and e − R t 0 l(Xx ρ )dρR˜(Xx t )+R t 0 e − R ρ 0 l(Xx ρ )dρr ′ (Xs)ds is an F˜-martingale. Hence we can use stochastic approximation method and stochastic gradient descent methods similar to [22] to approximate J ′ (·) using a family of parameterized functions Rθ such that Rθ (0) = J ′ (0) for all θ. Hence we need to find the value of J ′ (0). To this end, we propose carrying out PE for J on the interval [0, ∆t], and use the value J(∆t) ∆t as an approximation for J ′ (0). We define Mt = e −ctJ(Xx t ) + R t 0 e −csr(Xx s )ds t ≤ τx; Mτx t > τx. Theorem 3.4.2. For all x ∈ R +, the process M = {Mt ;t ≥ 0} is an F˜-martingale. Conversely, if there exists a continuous function J˜ such that for all x ∈ R +, M˜ = {M˜ t = 86 e −ctJ˜(Xx t ) + R t 0 e −csr(Xx s )ds;t ≥ 0} is an F˜-martingale and J˜(0) = 0, then J = J˜. Proof. For a fixed t ≥ 0, we have E[Mτx I{t<τx}|Ft ] = E[I{t<τx} Z τx 0 e −csr(X x s )ds|Ft ] = I{t<τx} Z t 0 e −csr(X x s )ds + E h I{t<τx} Z τx t e −csr(X x s )ds|Ft i = I{t<τx} Z t 0 e −csr(X x s )ds + I{t<τx}E h Z τx t e −csr(X x s )ds|Ft i Now, since E[ R τx t e −csr(Xx s )ds|Ft ] = e −ctJ(Xt), we have E[Mτx |Ft ] = E[Mτx I{t<τx}|Ft ] + E[Mτx I{t≥τx}|Ft ] = I{t<τx} Z t 0 e −csr(X x s )ds + I{t<τx}e −ctJ(Xt) + E[MtI{t≥τx}|Ft ] = I{t<τx}Mt + I{t≥τx}Mt = Mt . Hence for T > t we have, E[MT |Ft ] = E[E[Mτx |FT |Ft ] = E[Mτx |Ft ] = Mt . Thus the first part of the theorem is proved. Since M˜ is an F˜-martingale, we have e −ctJ˜(Xt)+ R t 0 e −csr(Xs)ds = M˜ t = E[M˜ τx |Ft ] = E[Mτx |Ft ] = Mt = e −ctJ(Xt) + R t 0 e −csr(Xs)ds, P-a.s. That is J(Xt) = J˜(Xt) a.s. for all t ∈ R +. Hence the result. Now, approximating J(·) is again equivalent to finding a function R such that R(0) = 0 and e −ctR(Xt) + R t 0 e −csr(Xs)ds is an F˜-Martingale. Hence we can again use stochastic approximation method and stochastic gradient descent methods similar to [22] to approximate J(·) using a family of parameterized functions J θ such that J θ (0) = 0 for all θ. Then we can use these values to approximate J ′ (0) as discussed above. 87 3.5 Method of “q-learning” as an Off-policy Algorithm Similar to section 2.7, we can consider “q-learning” approach to approximate the optimal strategy. To this end we consider the q-function associated with the optimal policy π ∗ , adapted from [24] and is defined as follows for our context. q ∗ (x, w) = (p − w)v ′ (x) + 1 2 σ 2 v ′′(x)) + λ0G[v](x) − cv(x) + w; (x, w) ∈ R + × [0, a]. Now to use the q − function in designing algorithms, we consider the following theorem adapted from theorem 4 from [24]. Theorem 3.5.1. Let functions vb ∈ C 2 with polynomial growth and qb∗ ∈ C be given satisfying vb(0) = 0, R a 0 exp{ 1 λ qb∗ (x, w)}dw = 1, ∀x ∈ R +. Then, (i) If vb and qb∗ are respectively the optimal value function and the optimal q-function, then for any π ∈ A s cl and all x ∈ R +, the following process e −c(τ π x ∧s) vb(X˜ τ π x ∧s) + Z τ π x ∧s 0 e −cu w π u − qb∗ (X˜ u, wπ u ) du (3.4) is an (G˜ , Q)-martingale, where {X˜ π s , s ≥ 0} is the solution to (2.11) under the policy π. (ii) If there exists a π ∈ A s cl such that for all x ∈ R +,(3.4) is an (G˜ , Q)-martingale, then vb and qb∗ are respectively the optimal value function and the optimal q-function. When = q ∗ is the optimal q-function, πb∗ (w, x) = exp{ 1 λ qb∗ (x, w)} is the optimal policy. proof: We omit the proof since it is almost identical to the proof of the Theorem (2.7.1). Learning the optimal value function and q-function based on Theorem 3.5.1 is equivalent to finding parameters θ and ψ such that e −c(τ π˜ x ∧s) vb(X˜ τ π˜ x ∧s ) + R τ π˜ x ∧s 0 e −cu w π u − qb∗ (X˜ u, wπ u ) du is an (G˜ , Q) martingale for any given π. 88 Chapter 4 Possible Extensions and Future Research Throughout the chapter 2, we have developed a theoretical approach for approximating the optimal dividend rate of the Optimal Dividend problem under the Perturbed Cram´erLundberg Model using Reinforcement Learning. We have devised both “on-policy” and “offpolicy” approaches for approximating the optimal dividend rate. The immediate and natural future direction is to carry out the numerical simulations and compare the effectiveness of the devised algorithm by comparing the obtained results with the classical optimal dividend rate of the purturbed Cram´er Lundberg model, which we leave for our future research. In both of the problems discussed above in chapter 2 and 3, the reserve of the Insurance Company depends only on one control, namely the dividend rate. In general, the company can control its reserve via a combination of consumption (e.g., dividends payment), investment, and reinsurance (see [35]). The optimal dividend problem under various combinations of these controls creates different interesting problems that presents unique technical challenges, as discussed below. In what follows the reserve is modelled by the classical Cram´erLundberg model and we consider the corresponding Entropy-regularized exploratory control 89 problem to surpass the issue of model uncertainity. • The case where the reserve is controlled by the dividend rate and the reinsurance offers a suitable context to develop Policy Improvement and Evaluation methods for vector valued policies, as well as to analyze the controlled jump-process (the optimal feedback policies no longer are in the Gibbs format). RL for Stochastic control problems with these properties has not yet been explored in the continuous time and space setting, to the best of our knowledge. One approach for solving this problem is a different iteration method based directly on the Dynamic Programming principle, similar to [34], where we consider a initial value function V 0 and update it for n ≥ 1, by V n (x) = supL∈Πx {Ex[ R τ1 0 e −δsHL s ds + e−δτ1 V n−1 (XL τ1 )1{τ1<τ}]}, where τ1 is the time of the first claim arrival, HL s is the entropy regularized exploratory “cumulative dividend paid” upto time s, XL is the reserve under vector valued policy L, and Πx is the set of (admissible) vector valued policies. • In the case where the reserve is controlled by the dividend rate and the investment, we get a controlled diffusion term. Even with a single control where the optimal feedback policy is of the Gibbs form, when the diffusion is controlled, convergence of the PI Algorithms (PIAs) based on the HJB equation of the value function remains an open problem. As explained in [20], such a result is hard to obtain, since the presence of the second order term of the value function V in the Gibbs form of the optimal feedback policy would make it difficult to find a uniform global bound for the iterating sequence of functions, which is the basis of the PIA convergence results proven in [20]. The following different approaches can be considered to surpass this problem. ⋄ Upon close inspection of the proof of the convergence of the iterating sequence based on HJB equation in [20] it is evident that, due to its reliance on the Arzel´a-Ascoli theorem, it suffices to obtain a uniform bound of the iterating sequence of functions on an arbitrary compact set, and not necessarily on the whole space of R. Considering a PIA based on the HJB equation and obtaining such a uniform bound to prove the desired convergence results 90 is one possible approach. ⋄ Another appraoch is to propose different PIAs instead of a PIA based on HJB Equation, such as a PIA based on the Pontryagin’s optimality principle, or on the dynamic programming approach (as discussed above). For instance, Method of Successive Approximations (MSA) is an iterative method for solving stochastic control problems that involves a controlled diffusion term, derived from Pontryagin’s optimality principle. But this method has the possibility of failing to converge. In [25], the corresponding backward stochastic differential equation (BSDE) is used to suggest a modification to the MSA algorithm which is proven to converge for general stochastic control problems with both controlled drift and controlled diffusion. A generalization of this concept could prove a useful approach to our particular problem. The recent advancements in continuous time optimal control problems analysed using RL techniques developed for discrete models are not only natural extensions of the classical stochastic control methodologies, (Ex: PIA, relaxed control, dynamical programming), but further provide a theoretical basis for widely used RL methods, that lack theoretical basis. (Ex: Continuous time q-learning algorithm gives a theoretical foundation to widely used Boltzmann exploration in discrete time RL; see [24]). Hence, extending the RL techniques and concepts into more generalized contexts of stochastic control theory can potentially contribute to a stronger understanding and applications in both stochastic control theory and reinforcement learning. 91 Bibliography [1] Soren Asmussen and Michael Taksar. “Controlled diffusion models for optimal dividend pay-out”. In: Insurance Math. Econom. 20.1 (1997), pp. 1–15. issn: 0167-6687,1873- 5959. doi: 10.1016/S0167- 6687(96)00017- 0. url: https://doi.org/10.1016/ S0167-6687(96)00017-0. [2] Lihua Bai and Jin Ma. “Optimal investment and dividend strategy under renewal risk model”. In: SIAM J. Control Optim. 59.6 (2021), pp. 4590–4614. issn: 0363-0129,1095- 7138. doi: 10.1137/20M1317724. url: https://doi.org/10.1137/20M1317724. [3] Lihua Bai and Jostein Paulsen. “Optimal dividend policies with transaction costs for a class of diffusion processes”. In: SIAM J. Control Optim. 48.8 (2010), pp. 4987–5008. issn: 0363-0129,1095-7138. doi: 10.1137/090773210. url: https://doi.org/10. 1137/090773210. [4] Lihua Bai et al. “Reinforcement Learning for optimal dividend problem under diffusion model”. In: arXiv:2309.10242v1 (2023). [5] Mohamed Belhaj. “Optimal Dividend Payments When Cash Reserves Follow a JumpDiffusion Process”. In: Mathematical Finance 20 (June 2008). doi: 10.1111/j.1467- 9965.2010.00399.x. [6] Jun Cai, Hans U. Gerber, and Hailiang Yang. “Optimal dividends in an OrnsteinUhlenbeck type model with credit and debit interest”. In: N. Am. Actuar. J. 10.2 (2006), pp. 94–119. issn: 1092-0277. doi: 10.1080/10920277.2006.10596250. url: https://doi.org/10.1080/10920277.2006.10596250. [7] Tahir Choulli, Michael Taksar, and Xun Yu Zhou. “A diffusion model for optimal dividend distribution for a company with constraints on risk control”. In: SIAM J. Control Optim. 41.6 (2003), pp. 1946–1979. issn: 0363-0129,1095-7138. doi: 10.1137/ S0363012900382667. url: https://doi.org/10.1137/S0363012900382667. [8] S¨oren Christensen and Claudia Strauch. “Nonparametric learning for impulse control problems—Exploration vs. exploitation”. In: The Annals of Applied Probability 33 (Apr. 2023). doi: 10.1214/22-AAP1849. 92 [9] S¨oren Christensen, Claudia Strauch, and Lukas Trottner. “Learning to reflect: A unifying approach for data-driven stochastic control strategies”. English. In: Bernoulli (2023). issn: 1350-7265. [10] Earl A. Coddington and Norman Levinson. Theory of ordinary differential equations. McGraw-Hill Book Co., Inc., New York-Toronto-London, 1955, pp. xii+429. [11] Michael G. Crandall, Hitoshi Ishii, and Pierre-Louis Lions. “User’s guide to viscosity solutions of second order partial differential equations”. In: Bull. Amer. Math. Soc. (N.S.) 27.1 (1992), pp. 1–67. issn: 0273-0979,1088-9485. doi: 10.1090/S0273-0979- 1992-00266-5. url: https://doi.org/10.1090/S0273-0979-1992-00266-5. [12] Tiziano De Angelis and Erik Ekstr¨om. “The dividend problem with a finite horizon”. In: Ann. Appl. Probab. 27.6 (2017), pp. 3525–3546. issn: 1050-5164,2168-8737. doi: 10.1214/17-AAP1286. url: https://doi.org/10.1214/17-AAP1286. [13] B. De Finetti. “Su un’ impostazione alternativa dell teoria collettiva del risichio”. In: Transactions of the 15th congress of actuaries, New York 2 (1957), pp. 433–443. [14] Fran¸cois Dufresne and Hans U. Gerber. “Risk theory for the compound Poisson process that is perturbed by diffusion”. In: Insurance: Mathematics and Economics 10.1 (1991), pp. 51–59. issn: 0167-6687. doi: https://doi.org/10.1016/0167-6687(91)90023-Q. url: https://www.sciencedirect.com/science/article/pii/016766879190023Q. [15] James Ferguson. A Brief Survey of the History of the Calculus of Variations and its Applications. 2004. arXiv: math/0402357 [math.HO]. [16] W. Fleming and M. Nisio. “On stochastic relaxed control for partially observed diffusions”. In: Nagoya Math. J. 93 (Mar. 1984). doi: 10.1017/S0027763000020742. [17] David Gilbarg and Neil S. Trudinger. Elliptic partial differential equations of second order. Classics in Mathematics. Reprint of the 1998 edition. Springer-Verlag, Berlin, 2001, pp. xiv+517. isbn: 3-540-41160-7. [18] Xin Guo, Renyuan Xu, and Thaleia Zariphopoulou. Entropy Regularization for Mean Field Games with Learning. Sept. 2020. [19] Ruimeng Hu and Mathieu Lauri`ere. Recent Developments in Machine Learning Methods for Stochastic Control and Games. 2024. arXiv: 2303.10257 [math.OC]. [20] Yu-Jui Huang, Zhenhua Wang, and Zhou Zhou. Convergence of Policy Improvement for Entropy-Regularized Stochastic Control Problems. 2023. arXiv: 2209.07059 [math.OC]. 93 [21] Saul D. Jacka and Aleksandar Mijatovi´c. “On the policy improvement algorithm in continuous time”. In: Stochastics 89.1 (2017), pp. 348–359. issn: 1744-2508,1744-2516. doi: 10.1080/17442508.2016.1187609. url: https://doi.org/10.1080/17442508. 2016.1187609. [22] Yanwei Jia and Xun Yu Zhou. “Policy evaluation and temporal-difference learning in continuous time and space: a martingale approach”. In: J. Mach. Learn. Res. 23 (2022), Paper No. [154], 55. issn: 1532-4435,1533-7928. doi: 10.1515/bejte-2021-0070. url: https://doi.org/10.1515/bejte-2021-0070. [23] Yanwei Jia and Xun Yu Zhou. “Policy gradient and actor-critic learning in continuous time and space: theory and algorithms”. In: J. Mach. Learn. Res. 23 (2022), Paper No. [275], 50. issn: 1532-4435,1533-7928. [24] Yanwei Jia and Xun Yu Zhou. q-Learning in Continuous Time. 2023. arXiv: 2207. 00713 [cs.LG]. [25] B. Kerimkulov, David Siska, and Lukasz Szpruch. “A Modified MSA for Stochastic Control Problems”. In: Applied Mathematics and Optimization 84 (Dec. 2021). doi: 10.1007/s00245-021-09750-2. [26] B. Kerimkulov, D. Siˇska, and L. Szpruch. “A modified MSA for stochastic control ˇ problems”. In: Appl. Math. Optim. 84.3 (2021), pp. 3417–3436. issn: 0095-4616,1432- 0606. doi: 10 . 1007 / s00245 - 021 - 09750 - 2. url: https : / / doi . org / 10 . 1007 / s00245-021-09750-2. [27] Bekzhan Kerimkulov, David Siˇska, and Lukasz Szpruch. “Exponential convergence ˇ and stability of Howard’s policy improvement algorithm for controlled diffusions”. In: SIAM J. Control Optim. 58.3 (2020), pp. 1314–1340. issn: 0363-0129,1095-7138. doi: 10.1137/19M1236758. url: https://doi.org/10.1137/19M1236758. [28] O. A. Ladyˇzenskaja, V. A. Solonnikov, and N. N. Ural’ceva. Linear and Quasilinear Equations of Parabolic Type. AMS, Providence, RI, 1968, p. 736. [29] Shuanming Li. “The distribution of the dividend payments in the compound poisson risk model perturbed by diffusion”. In: Scandinavian Actuarial Journal 2006.2 (2006), pp. 73–85. doi: 10.1080/03461230600589237. eprint: https://doi.org/10.1080/ 03461230600589237. url: https://doi.org/10.1080/03461230600589237. [30] Xingjian Li, Deepanshu Verma, and Lars Ruthotto. A Neural Network Approach for Stochastic Optimal Control. Sept. 2022. [31] Timothy P. Lillicrap et al. “Continuous control with deep reinforcement learning”. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, 94 Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2016. url: http://arxiv.org/abs/1509.02971. [32] X.Sheldon Lin and Kristina P. Pavlova. “The compound Poisson risk model with a threshold dividend strategy”. In: Insurance: Mathematics and Economics 38.1 (2006), pp. 57–80. issn: 0167-6687. doi: https : / / doi . org / 10 . 1016 / j . insmatheco . 2005 . 08 . 001. url: https : / / www . sciencedirect . com / science / article / pii / S0167668705001095. [33] Guoxin Liu, Xiaoying Liu, and Zhaoyang Liu. “The policy iteration algorithm for a compound Poisson process applied to optimal dividend strategies under a Cram´er–Lundberg risk model”. In: Journal of Computational and Applied Mathematics 413 (2022), p. 114368. issn: 0377-0427. doi: https : / / doi . org / 10 . 1016 / j . cam.2022.114368. url: https://www.sciencedirect.com/science/article/pii/ S0377042722001649. [34] Guoxin Liu, Xiaoying Liu, and Zhaoyang Liu. “The policy iteration algorithm for a compound Poisson process applied to optimal dividend strategies under a Cram´er–Lundberg risk model”. In: Journal of Computational and Applied Mathematics 413 (2022), p. 114368. issn: 0377-0427. doi: https : / / doi . org / 10 . 1016 / j . cam.2022.114368. url: https://www.sciencedirect.com/science/article/pii/ S0377042722001649. [35] Yuping Liu and Jin Ma. “Optimal reinsurance/investment problems for general insurance models”. In: arXiv.org, Quantitative Finance Papers 19 (Aug. 2009). doi: 10.1214/08-AAP582. [36] Marcus Pereira et al. Neural Network Architectures for Stochastic Control using the Nonlinear Feynman-Kac Lemma. Feb. 2019. [37] M. L. Puterman. “On the convergence of policy iteration for controlled diffusions”. In: J. Optim. Theory Appl. 33.1 (1981), pp. 137–144. issn: 0022-3239,1573-2878. doi: 10.1007/BF00935182. url: https://doi.org/10.1007/BF00935182. [38] Maziar Raissi. “Forward-Backward Stochastic Neural Networks: Deep Learning of High-dimensional Partial Differential Equations”. In: (Apr. 2018). [39] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: Ann. Math. Statistics 22 (1951), pp. 400–407. issn: 0003-4851. doi: 10 . 1214 / aoms / 1177729586. url: https://doi.org/10.1214/aoms/1177729586. [40] Bahlali Se¨ıd, Brahim Mezerdi, and Boualem Djehiche. “Approximation and optimality necessary conditions in relaxed stochastic control problems”. In: Journal of Applied Mathematics and Stochastic Analysis 2006 (June 2006). doi: 10.1155/JAMSA/2006/ 72762. 95 [41] S. E. Shreve, J. P. Lehoczky, and D. P. Gaver. “Optimal consumption for general diffusions with absorbing and reflecting barriers”. In: SIAM J. Control Optim. 22.1 (1984), pp. 55–75. issn: 0363-0129. doi: 10.1137/0322005. url: https://doi.org/ 10.1137/0322005. [42] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Second. The MIT Press, 2018. url: http://incompleteideas.net/book/the-book2nd.html. [43] Lukasz Szpruch, Tanut Treetanthiploet, and Yufei Zhang. Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning. 2023. arXiv: 2208.04466 [cs.LG]. [44] Wenpin Tang, Yuming Zhang, and Xun Zhou. “Exploratory HJB Equations and Their Convergence”. In: SIAM Journal on Control and Optimization 60 (Nov. 2022), pp. 3191–3216. doi: 10.1137/21M1448185. [45] Stefan Thonhauser and Hansjoerg Albrecher. “Dividend maximization under consideration of the time value of ruin”. In: Insurance: Mathematics and Economics 41 (July 2007), pp. 163–184. doi: 10.1016/j.insmatheco.2006.10.013. [46] Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. “Reinforcement learning in continuous time and space: a stochastic control approach”. In: J. Mach. Learn. Res. 21 (2020), Paper No. 198, 34. issn: 1532-4435,1533-7928. [47] Haoran Wang and Xun Yu Zhou. “Continuous-time mean-variance portfolio selection: a reinforcement learning framework”. In: Math. Finance 30.4 (2020), pp. 1273–1308. issn: 0960-1627,1467-9965. doi: 10.1111/mafi.12281. url: https://doi.org/10. 1111/mafi.12281. [48] Jiongmin Yong and Xun Yu Zhou. Stochastic controls. Vol. 43. Applications of Mathematics (New York). Hamiltonian systems and HJB equations. Springer-Verlag, New York, 1999, pp. xxii+438. isbn: 0-387-98723-1. doi: 10.1007/978-1-4612-1466-3. url: https://doi.org/10.1007/978-1-4612-1466-3. [49] Kam Yuen, Yuhua Lu, and Rong Wu. “The compound Poisson process perturbed by a diffusion with a threshold dividend strategy”. In: Applied Stochastic Models in Business and Industry 25 (Jan. 2009), pp. 73–93. doi: 10.1002/asmb.734. [50] Xun Zhou. “On the Existence of Optimal Relaxed Controls of Stochastic Partial Differential Equations”. In: Siam Journal on Control and Optimization - SIAM J CONTR OPTIMIZAT 30 (Mar. 1992). doi: 10.1137/0330016. [51] Jinxia Zhu and Hailiang Yang. “Optimal financing and dividend distribution in a general diffusion model with regime switching”. In: Adv. in Appl. Probab. 48.2 (2016), 96 pp. 406–422. issn: 0001-8678,1475-6064. url: http://projecteuclid.org/euclid. aap/1465490755. 97 Appendix: RL Concepts in Discrete Time Appendix A: RL in Discrete Time and Space As we discussed in the introduction, in the Reinforcement Learning literature, the learning and decision making entity is called the agent, and entities separate from the agent is known as the environment. An environment has states it can take and the set of possible states is denoted by S and the state of the environment at time t is denoted by st . We consider a sequence of discrete times (t = 0, 1, 2, ..). At each time step t the agent perceives the state st and executes an action at ∈ A(st) (set of possible actions) according to the Policy function, π. Policy function can be either deterministic, ( π is a mapping from state space to action space and π(s) is the action taken at state s) or Stochastic( given the state st , the agent chooses an action at from the set of possible actions, A(st), and πt(s, a) is the probability of choosing action a in state s at time t ). Then the environments state changes to st+1 and the agent gets a reward rt+1. The agents goal is set to maximize the expected cumulative discounted return,where cumulative discounted return is given by, Rt = Σ∞ k=0γ k rt+k+1 . 98 Here γ is a discounting factor and γ < 1. Appendix A1: Finite Markov Decision Process In a reinforcement Learning problem we consider a 4-tuple (S, A, Pa , Ra ), where S is the set of states the environment can take, A is the set of possible actions Pa ss′ := P{st+1 = s ′ |st = s, at = a} and Ra ss′ = E{rt+1|st = s, at = a, st+1 = s ′}. A Markov Decision processes (MDP) is a such a tuple that satisfies the Markov property. Given that st , at , rt are the State, action and reward at time t respectively, the task is said to satisfy the Markov property if, P(st+1 = s, rt+1 = r|st , rt , at , st−1, rt−1, at−1..., s0, a0) = P(st+1 = s, rt+1 = r|st , rt , at). A Markov Decision Process with a finite state and action space is called a finite Markov Decision Process. RL problem in discrete time is commonly associated with a finite MDP, and in what follows we consider the 4-tuple (S, A, Pa , Ra ) as a finite MDP. Appendix A2: Value Functions of a Finite MDP A value function is an indicator of how desirable the agents actions are, given a particular state.There are two different versions of value functions. 1. The state-value function of state s under each stochastic policy π is denoted by V π (s) and defined as follows. V π (s) = Eπ{Rt |st = s} = Eπ{Σ ∞ k=0γ k rt+k+1|st = s} 2. The action-value function of state s , under action a and stochastic policy π is denoted by Qπ (s, a) and defined as follows. Qπ (s, a) = Eπ{Rt |st = s, at = a} = Eπ{Σ ∞ k=0γ k rt+k+1|st = s, at = a}. This is the expected value of return when action a is taken at state s and following policy π thereafter. The goal of the RL problem is to maximize the expected cumulative discounted return, i.e, 99 the state value function. Appendix A3: Q-Learning The goal of the RL problem is to maximize the state-value function. We will analyze the optimal value functions in this section. From this section on-wards, we will only consider deterministic policy functions. Maximum state-value function is denoted by V ∗ and defined as V ∗ (s) = M ax π V π (s) for all s ∈ S. Optimal policy, denoted by π ∗ , is defined as the policy under which the maximum statevalue function is obtained. There can be multiple optimal policies that results in the same maximum state-value function. The action-value function under π ∗ is denoted by Q∗ . Dynamic programming principle for Q∗ is given by, Q∗(s, a) = n R(s, a) + γΣs ′P(s ′ , r|a, s) max a ′ Q∗(s ′ , a′ ) o (4.1) This is the basis for ’Q-learning’ methods of Policy Improvement often used in discrete RL problems. One of the main properties of the dynamic program principle for ’Q-function” is that it gives rise to the ’Q-learning’ method, (an off policy learning method) as well as the SARSA method (an on policy learning method). The Q-function, by definition, is a function of the current state and action, assuming that the agent takes a particular action at the current time and follows through a given control policy starting from the next time step. Therefore, until recently, it has been theorised that the notion of Q− functions cannot be extended to the continuous time. Q− functions has been developed for continuous-time problems by discretizing the problem and analyzing the resulting discrete time problem. Recently, [24] introduced a extension of the Q− learning concept to the continuous time. They introduce the “q-function” and propose data driven 100 model free Policy Evaluation and Policy Improvement algorithms based on the “q-function”. Appendix B: Introduction to Temporal Difference Methods Temporal difference(TD) methods are a way of solving prediction problems, which are defined as “the problem of using past experience with an incompletely known system to predict its future behavior” in Sutton(1988) [42]. Appendix B1: Prediction Problems TD methods are introduced as an improvement on supervised learning approaches for prediction problems. The approach of supervised learning is asking the learner to associate pairs of items and develop the ability to correctly match inputs to the outputs. In a prediction problem approached via supervised learning methods, first item, (the input) will be the data based on which the prediction is to be made, and the second item is the actual outcome. In conventional methods for prediction problems the ’correctness’ of the prediction is assessed using the difference between predicted and actual outcomes, i.e. the difference between a function of first item, and the second item in supervised learning. TD methods use the difference between temporally successive predictions to asses the ’correctness’ of a prediction. Sutton(1988) [42], classify the prediction problems as single-step and multi-step. In a singlestep prediction problem all information about the ’correctness’ of the prediction is revealed at once while in a multi-step prediction problem, this information is only partially revealed at each stage. Thus, in a single step problem data is naturally labelled as observation-outcome pairs and hence the methods of TD and conventional methods will be one and the same. Thus TD methods can only be recognized as a different and better method only in multi-step prediction problems. 101 Appendix B2: Problem Formulation: Finite Time Horizon We formulate the multi-step prediction problem with following notation. The experience, that we use to train the learner is given as observation-outcome sequences. An observationoutcome sequence is of the format x1, x2, x3, ..., xm, z where each xt is the vector of observations at time t. z is the observed outcome of that sequence. components of each xt ∈ R and z is a real valued scalar. For each observation-outcome sequence, the learner will create a sequence of predictions of z, denoted by P1, P2, ..., Pm. Each prediction is based on a vector of weights w. Generally Pt can be a functions of x1, x2, ...xt , but in this setting for simplicity we assume Pt = P(xt , w). In a supervised learning setting the observation-outcome sequence is viewed as a sequence of pairs as (x1, z),(x2, z), ...,(xm, z) and is analyzed as such. The learning is happening in terms of updating weight w. Assuming the w is changed only after the completion of an observation-outcome sequence,(also known as offline learning) the modification of w is given as w = w + Xm t=1 ∆wt where ∆wt = α(z − Pt)∇wPt (4.2) Here α > 0 is the learning rate, (z − Pt) assess the correctness of prediction as discussed above and ∇wPt determines the effect of change of w on Pt . w will be updated using observation-outcome sequences in this manner, till we observe a convergence pattern in w or the observation-outcome sequences are exhausted, and use that w to calculate the prediction Pt(w).This is the supervised learning approach. There are more variations of this approach where w is updated after each observation in the sequence using ∆wt , (known as online learning) , or over a training set of several sequences. Equation (4.2) is heavily inspired by the gradient descent method, where the goal is to minimize an error function Rt(w) over 102 the set of weights by increasing the weight vector in the direction of steepest descent of Rt(w) with respect to w, or the direction of ∆wt = −α∇wRt(w) where α > 0 is the learning rate. In the above analysis the error function Rt(w) = EX E[z|xt ] − P(xt , w) 2 where EX denotes the expectation over the observation vectors X.) Since we only observe one observation vector at a time, we can define a ’per-observation’ error function and the update rule now reads ∆wt == α(E[z|xt ] − P(xt , w))∇wP(xt , w). This equation corresponds to the equation (4.2), when we use the actual outcome z to approximate E[z|xt ]. Appendix B3: TD Methods in Finite Horizon In this section we describe the TD method (batch learning approach) following the notations and methodology in [42]. Observing that z − Pt = Pm i=t (Pi+1 − Pi) where Pm+1 := z using the telescoping sum, ∆wt = α(z − Pt)∇wPt = α Pm i=t (Pi+1 − Pi) ∇wPt . Thus, w = w + Xm t=1 ∆wt = w + Xm t=1 α Xm i=t (Pi+1 − Pi) ∇wPt = w + Xm i=1 α(Pi+1 − Pi) X i t=1 ∇wPt = w + Xm t=1 α(Pt+1 − Pt) X t i=1 ∇wPi Hence ∆wt can be redefined by ∆wt = α(Pt+1−Pt) Pt i=1 ∇wPi , and this equation describes the TD method. It is named as the Temporal Difference (TD) method, based on the fact that computing ∆wt in this format, uses the temporal difference Pt+1 − Pt rather than the difference z − Pt .The T D(1) learning procedure is described as w = w + Xm t=1 ∆wt . TD method have the following computational advantages over the supervised learning approach. 1. Since ∆wt is computed incrementally. there is no need to store individual values of 103 Pt at each time stage t till the outcome z is observed. At time t, we need only store Pt , ∆wt , Pt i=1 ∇wPt . The storage units for Pt , Pt i=1 ∇wPi can be reused at each time step since,Pt , Pt i=1 ∇wPi is replaced by Pt+1, Pt+1 i=1 ∇wPi at each time step. This saves memory over the supervised learning method. 2. Though there are slightly more arithmetic operations in TD method since Pt i=1 ∇wPi + ∇wPt+1 should be computed at time t + 1, these operations are evenly spaced in time. And since ∆wt is computed incrementally in time in TD, while in conventional method all the ∆wi must be computed only after observing z,hence at once. Thus, the arithmetic operations of the conventional method occur all at once, and hence need more processing power. The T D(λ) family of learning procedures is decribed as follows. The concept is to add more value to recent predictions by using a factor 0 ≤ λ ≤ 1. In T D(λ) learning procedure weight changes in time t are defined as follows. ∆wt = α(Pt+1 − Pt) X t i=1 λ t−i∇wPi (4.3) Though the process of adding more value to recent predictions can be achieved by other forms of discounting, the use of an exponential discounting factor gives us the advantage of computing ∆wt s incrementally. If we define et = Pt i=1 λ t−i∇wPi , et+1 = ∇wPt+1 + λet . The T D(λ) learning procedure is described as w = w + Xm t=1 ∆wt where ∆wt is described by equation (4.3). Appendix B4: TD Methods in Infinite Horizon Since the T D(λ) family of learning procedures is developed for prediction problems with finite time horizon, Sutton(1988) [42] analyze a particular infinite horizon problem and proposes 104 a TD method for that problem. If a process generates costs ct+1 at each transition from t to t + 1, for t ∈ N, the discounted sum at time t is given by zt = P∞ i=0 γ i ct+1+i = ct+1 + γzt+1 where 0 < γ < 1 is the discounting factor. If Pt is used to predict zt at each t, the prediction is correct if zt − Pt = ct+1 + γzt+1 − Pt = ct+1 + γPt+1 − Pt = 0. The issue of using (4.3) in infinite horizon setting is that Temporal difference term Pt+1 − Pt is significant there because at the end they are a measure of Zt − Pt . Hence there is a need for a different temporal difference term in the infinite horizon setting that measure the difference Zt − Pt . As described above, ct+1 + γPt+1 − Pt is the natural candidate for temporal difference in the infinite horizon case. With this insight and the format of (4.3) as the basis. TD learning procedure for infinite horizon is defined as follows in Sutton(1988) [42]. ∆wt = α(ct+1 + γPt+1 − Pt) X t i=1 λ t−i∇wPi (4.4) Remark: In this case online method of updating w should be used since infinite number of ∆wt cannot be computed in most cases, unless in specific cases where ∆wt follows a pattern and P∞ i=1 Wt exists finitely . That is at each time step t, the updating w = w + ∆wt will occur. 105
Abstract (if available)
Abstract
We study the optimal dividend problem with the dividend rate being restricted in a given interval, first under the continuous time diffusion model and then under the well-known “Cramer-Lundberg” model. Unlike the standard literature, we shall particularly be interested in the case when the parameters (e.g. drift and diffusion coefficients) of the model are not specified so that the optimal control cannot be explicitly determined. To approximate the optimal strategy, we use methods from the Reinforcement Learning (RL) literature, specifically, the method of solving the corresponding RL-type entropy-regularized exploratory control problem, which randomizes the control actions, and balances the levels of exploitation and exploration. We shall first carry out a theoretical analysis of the entropy-regularized exploratory control problem focusing particularly on the corresponding HJB equation. We will then use a policy improvement argument, along with policy evaluation devices to construct approximating sequences of the optimal strategy. These algorithms are essentially “on-policy” algorithms, which has certain drawbacks in practical applications in some contexts. Hence we would use an “off-policy” algorithm, namely the ”q-learning” algorithm to approximate the optimal strategy. We present some numerical results using different parametrization families for the cost functional, to illustrate the effectiveness of the approximation schemes and to discuss possible methodologies to improve the effectiveness of Policy Evaluation methodologies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Optimal dividend and investment problems under Sparre Andersen model
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Understanding goal-oriented reinforcement learning
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Equilibrium model of limit order book and optimal execution problem
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Learning and control for wireless networks via graph signal processing
PDF
Reinforcement learning based design of chemotherapy schedules for avoiding chemo-resistance
PDF
Learning at the local level
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Efficient control optimization in subsurface flow systems with machine learning surrogate models
PDF
Machine learning in interacting multi-agent systems
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Reinforcement learning in hybrid electric vehicles (HEVs) / electric vehicles (EVs)
PDF
Discrete optimization for supply demand matching in smart grids
Asset Metadata
Creator
Gamage, Thejani Chamodika Malshani
(author)
Core Title
Reinforcement learning for the optimal dividend problem
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2024-08
Publication Date
07/12/2024
Defense Date
03/18/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
continuous time diffusion model,Cramer-Lundberg model,entropy-regularized exploratory control problem,exploitation and exploration,HJB equation,Martingale Loss (ML).,OAI-PMH Harvest,optimal dividend problem,Policy Evaluation (PE),Policy Improvement (PI),Q-learning,Reinforcement Learning (RL),Temporal Difference (TD) algorithm
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ma, Jin (
committee chair
), Xu, Renyuan (
committee member
), Zhang, Jianfeng (
committee member
)
Creator Email
gamage@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113997L9N
Unique identifier
UC113997L9N
Identifier
etd-GamageThej-13192.pdf (filename)
Legacy Identifier
etd-GamageThej-13192
Document Type
Dissertation
Format
theses (aat)
Rights
Gamage, Thejani Chamodika Malshani
Internet Media Type
application/pdf
Type
texts
Source
20240712-usctheses-batch-1179
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
continuous time diffusion model
Cramer-Lundberg model
entropy-regularized exploratory control problem
exploitation and exploration
HJB equation
Martingale Loss (ML).
optimal dividend problem
Policy Evaluation (PE)
Policy Improvement (PI)
Q-learning
Reinforcement Learning (RL)
Temporal Difference (TD) algorithm