Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Modern nonconvex optimization: parametric extensions and stochastic programming
(USC Thesis Other)
Modern nonconvex optimization: parametric extensions and stochastic programming
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Modern Nonconvex Optimization: Parametric Extensions and Stochastic Programming by Ziyu He A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (INDUSTRIAL AND SYSTEMS ENGINEERING) August 2023 Copyright 2023 Ziyu He To all those who have shown me the true meaning of love. ii Acknowledgements Firstandforemost,Iwouldliketoexpressmydeepestgratitudetomyadvisor,Dr. Jong-Shi Pang, for being such an exceptional role model one can ask for. Throughout my doctoral studies,younotonlyequippedmewithalltheknowledgeaboutresearch,butalsotaughtme the qualities required of a true academician: dedication, humility, rigor and an unwavering curiosity. Having the privilege to work alongside one of the greatest minds in my field at such an early stage of my career is beyond my imagination, and this experience will forever hold a special place in my heart. Furthermore, I am profoundly thankful to my collaborators, Dr. Ying Cui, Dr. Andr´ es G´ omez, Dr. Shaoning Han and Dr. Junyi Liu. The experience we shared while working together, the accomplishments we have achieved as a team, and the invaluable knowledge I have gained form each of you will be treasured forever. Additionally, I would also like to express my sincere appreciation to several professors in my department, including Dr. Vishal Gupta, Dr. Meisam Razaviyayn, Dr. Suvrajet Sen, Dr. Maged Dessouky, and many others. Our meaningful discussions, and your valuable insights have played an instrumental role in my growth and development as a researcher, and I will always be grateful for your guidance and contributions. Moreover, I would like acknowledge the wonderful staff in my department, including Shelly Lewis, Grace Owh and many others. I would not have been completed this phase of my journey without your assistance and support. I would like to extend my heartfelt gratitude to the University of Southern California for giving me the opportunity to experience the beautiful city of Los Angeles, and meeting my amazing friends, Bo Jones, Sina Baharlouei, Yunan Zhou, Mingdong Lyu, Han Yu, Yue Yu, iii Qing Jin, Peng Dai, Yinxiao Ye, Suyanpeng Zhang, Caroline Johnston, Xinyao Zhang, Jing Jin, Ke Shen, Yue Fang, Mingxi Wang, Michael Huang, Yang Liu and many others. I will always cherish the memory and the bond we share. Finally, Xinyue, thank you for walking into my life and being such a special one for me. Last but not the least, I would like to dedicate this dissertation to my parents, Dr. JinliangHeandYoupingTu. Yourendlesslove,guidanceandsupporthavebeenthebedrock ofeverystageofmyjourney, andIamsoproudtohavesuchremarkableparentsasyouare. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 A source application: Markov Random Field . . . . . . . . . . . . . . . . . . 3 1.2 Sparsity learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Hyperparameter selection and parametric extension . . . . . . . . . . . . . . 9 1.4 Intractable normalizers and stochastic programming . . . . . . . . . . . . . . 13 Chapter 2: Linear-Step Solvability of Sparse Optimization with Z-Property . . . . . . 21 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.1 Folded concave family . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.2 Z- and M-functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2 The GHP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.1 On linear-step solvability and strong polynomial complexity . . . . . 48 v 2.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.4.1 ℓ 1 -penalized problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.4.2 Capped ℓ 1 -penalized problems . . . . . . . . . . . . . . . . . . . . . . 53 2.4.3 Verifying strong polynomial complexity . . . . . . . . . . . . . . . . . 54 Chapter 3: Parametric Extensions with Capped ℓ 1 -Regularizers . . . . . . . . . . . . 57 3.1 Summary of Previous Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2 Analytical Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2.1 Piecewise structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.2 On the number of pieces . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3 Computing D-stationary Paths for Capped ℓ 1 . . . . . . . . . . . . . . . . . 70 3.3.1 Alternative GHP algorithms . . . . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Basis and degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.3 The pivoting scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.3.4 Analysis of the pivoting scheme . . . . . . . . . . . . . . . . . . . . . 90 3.3.5 Advantage of the specialized GHP initializations . . . . . . . . . . . . 96 3.3.6 Absorbing basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.4.1 The ℓ 1 - and capped ℓ 1 -paths on synthetic problems . . . . . . . . . . 105 3.4.2 Results on the GMRF problem . . . . . . . . . . . . . . . . . . . . . 109 Chapter 4: Adaptive Importance Sampling for Logarithmic Integral Optimization . . 120 4.1 The 3-Layer Bayesian Hierarchical Model . . . . . . . . . . . . . . . . . . . . 121 4.1.1 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.2 Discrete Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.3 The Logarithmic Integral Optimiation Problem . . . . . . . . . . . . . . . . 128 4.3.1 Surrogation and stationarity . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.2 AIS-based surrogation method . . . . . . . . . . . . . . . . . . . . . 132 vi 4.4 Analysis of the AIS-based Surrogation Algorithm . . . . . . . . . . . . . . . 136 4.4.1 A Digression: ULLN for triangular arrays . . . . . . . . . . . . . . . 137 4.4.2 Convergence of deterministic surrogation algorithm . . . . . . . . . . 139 4.4.3 Sufficient descent with stochastic error . . . . . . . . . . . . . . . . . 141 4.4.4 Asymptotic fixed-point stationarity . . . . . . . . . . . . . . . . . . . 143 4.4.5 The main theorem with discussion. . . . . . . . . . . . . . . . . . . . 145 4.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.5.1 Comparisons with SMM and SAA . . . . . . . . . . . . . . . . . . . . 148 4.5.2 Applications in BHM inference . . . . . . . . . . . . . . . . . . . . . 153 4.6 Proofs and Supplement Materials . . . . . . . . . . . . . . . . . . . . . . . . 158 4.6.1 Proofs of some intermediate results . . . . . . . . . . . . . . . . . . . 159 4.6.1.1 Proof of Lemma 62 . . . . . . . . . . . . . . . . . . . . . . . 159 4.6.1.2 Proof of Proposition 64 . . . . . . . . . . . . . . . . . . . . 160 4.6.1.3 Proof of Lemma 66 . . . . . . . . . . . . . . . . . . . . . . . 161 4.6.2 On an alternative AIS scheme with better sample complexity . . . . . 161 4.6.3 Some details for numerical experiments . . . . . . . . . . . . . . . . . 163 4.6.3.1 Subproblems for the experiments in Section 4.5.1 . . . . . . 164 4.6.3.2 Piecewise affine approximation of indicator function . . . . . 165 Chapter 5: Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . 167 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 vii List of Tables 2.1 Computational results for the ℓ 1 -problems. . . . . . . . . . . . . . . . . . . . 52 2.2 Computational results for the capped ℓ 1 -problems. . . . . . . . . . . . . . . . 54 2.3 Computational results for the capped ℓ 1 -problems. . . . . . . . . . . . . . . . 55 3.1 Summary of piece number complexities. . . . . . . . . . . . . . . . . . . . . . 69 3.2 Summary of tests on parametric ℓ 1 and capped ℓ 1 -solution. . . . . . . . . . . 106 3.3 Summary of computational times (in seconds). . . . . . . . . . . . . . . . . . 112 3.4 Some key statistics for the GMRF experiments. . . . . . . . . . . . . . . . . 116 4.1 Experiment configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.2 Comparison of AIS, SMM and SAA. . . . . . . . . . . . . . . . . . . . . . . 150 4.2 Comparison of AIS, SMM and SAA (continued). . . . . . . . . . . . . . . . . 151 4.3 Stability: AIS vs SMM.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 viii List of Figures 1.1 Image recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Disease mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Signal recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Signal recovery vs. support recovery. . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Comparison of ℓ 0 ,ℓ 1 and capped ℓ 1 . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 MRF graph topology can be difficult to determine. . . . . . . . . . . . . . . 14 1.7 Na¨ ıve estimation of E and{β ⋆ ij } (i,j)∈E leads to undesirable recovery. . . . . . 15 2.1 The ℓ 0 -function and some folded concave functions. . . . . . . . . . . . . . . 23 2.2 The original folded concave function. . . . . . . . . . . . . . . . . . . . . . . 25 2.3 The modified folded concave function. . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Computing time vs. size n for capped ℓ 1 -problems. . . . . . . . . . . . . . . 56 3.1 Quadratic loss L(x) as a function of the sparsity∥x∥ 0 (unconstrained cases). 107 3.2 Quadratic loss L(x) as a function of the sparsity∥x∥ 0 (constrained cases).. . 108 3.3 Quadratic loss L(x) as a function of the sparsity∥x∥ 0 . . . . . . . . . . . . . 114 3.4 Signal recovery as a function of support recovery. . . . . . . . . . . . . . . . 115 3.5 Modified GHP vs. na¨ ıve GHP. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.1 Effectiveness of sampling from π H(¯x,• ) IS with a fixed ¯ x∈X. . . . . . . . . . . . 134 4.2 AIS vs. SMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.3 Comparison of MRF models for signal recovery (σ 2 =0.25). . . . . . . . . . 155 4.4 Comparison of MRF models for signal recovery (σ 2 =0.09). . . . . . . . . . 156 4.5 Comparison of MRF models for image recovery (case 1). . . . . . . . . . . . 157 ix 4.6 Comparison of MRF models for image recovery (case 2). . . . . . . . . . . . 157 4.7 Piecewise affine φ ij approximation of indicator F − 1 e β ij . . . . . . . . . . . . . . . 165 x Abstract Nonconvex optimization plays a crucial role in many modern applications in operations research and data science, allowing for more faithful modeling of the complicated real-world environment. In this dissertation, we address several under-resolved topics in nonconvex optimization that commonly arise in various domains, driving the need for advancements in both theoretical understanding and computational tools. In Chapter 2, we provide an affirmative answer to a question that has been overlooked in existingliterature: canacertainclassofsparsitylearningproblemsbesolvedwithlinear-step complexity or even in strongly polynomial time? We identify such class of problems charac- terized by a loss function with the Z-property and a folded concave regularizer. To tackle the challenges associated with the inherent nonconvexity and nonsmoothness exhibited by this class, we propose the GHP algorithm, which effectively exploits the special structure of the problem and achieves the desired complexity results through rigorous analytical investi- gations. The effectiveness and efficiency of the algorithm are further demonstrated through numerical experiments. Our study in sparsity learning extends to the parametric setting, which aims at conduct- ing hyperparameter selection via computing the solution paths as functions of the sparsity inducing parameter. As discussed in Chapter 3, through a comprehensive analysis and computational investigations, we substantiate the intuition that a nonconvex parametric approach, specifically computing the capped ℓ 1 d-stationary paths, offers a resolution to the trade-off between computational efficiency and practical performance, as exhibited by the dilemma in choosing between the ℓ 0 - and ℓ 1 -regularizer. In particular, we examine and xi enhance the known properties of the optimal solution paths computed from the ℓ 0 - and ℓ 1 - regularizers, supplementing them with new analytical results for the previously unexplored capped ℓ 1 -paths. The computation of a capped ℓ 1 d-stationary path is enabled by our pro- posal of the pivoting scheme accompanied by the modified GHP algorithms . The primary challenge in this computation lies in dealing with potential discontinuities in the path. The modified GHP algorithms, leveraging the Z-propery, play a crucial role in rapidly restoring the path when such discontinuities occur. This aspect, which has not been extensively ad- dressed in the classical parametric programming literature, is one of the contributions of our approach. OneoftheapplicationsofourstudiesinChapter2and3istheinferenceofaMarkovran- domfieldwithaknownnetworktopology. However,apracticalissueariseswhenthenetwork topology is unknown and cannot be reliably estimated in advance. This situation prompts us to consider a broader extension of our studies, encompassing the challenges encountered in Bayesian hierarchical models (BHMs) with intractable normalizers. In Chapter 4, we ad- dress these concerns by focusing on a class of logarithmic integral (stochastic) optimization problems, where nonconvexity and nonsmoothness are coupled within an intractable inte- gral. Through the proposal of the adaptive importance sampling-based surrogation method, we provide an efficient tool to simultaneously handle nonconvexity and nosmoothness, while alsoimprovingthesamplingapproximationoftheintractableintegralviavariancereduction. Through rigorous analysis, we guarantee the performance of this algorithm, showcasing an almost sure subsequential convergence to a surrogation stationary point - a necessary can- didate for a local minimizer. Furthermore, extensive numerical experiments confirm the effectiveness of the algorithm, demonstrating its efficiency and stability in facilitating the application of advanced BHMs, where intractable normalizers often arise as a result of en- hanced modeling capability. xii Chapter 1 Introduction For decades, mathematical optimization has played a vital role in resolving fundamental problems across various scientific and engineering fields, such as machine learning, data science, operations research, decision science and econometrics, encompassing a range of applications that can be captured by the following generic optimization problem minimize x∈X L(x) + R(x) (1.1) whereX ⊆ R n andL:R n 7→Risconvexorsmooth(orboth)whicharethepillarproperties inclassicoptimizationliterature, whileR :R n 7→Rexhibitsnonconvexityornonsmoothness (or both), which typically arises when we respect the fidelity in modeling the complicated real world environment. For instance, the objective function of training a neural network is typically nonconvex and nonsmooth, and later we will encounter additional applications where these two “non-” properties are coupled together. Although bringing us closer to more accurate representation, function R in (1.1) exposes new technical challenges beyond state-of-the-art techniques for convex and smooth optimization. Such obstacles cannot be straightforwardly overcome by na¨ ıve simplification of modeling or deploying heuristic meth- ods. Such compromises, though easy to implement and computing, can result in undesirable solutions that significantly deviate from reality or lack provable quality. 1 Considerable analytical and computational progress has been made in the optimization community to tackle the challenges posed by nonconvexity and nonsmoothness. For a com- prehensive overview of these advancement, we refer to [29]. In this dissertation, we explore the following two additional extensions pertaining to the function R: • Parametrization: WhenR ischaracterizedbythestructure R(x)=γ Φ( x)whereΦ: R n 7→Rispossiblynonconvexandnonsmooth,andγ ≥ 0isaso-calledhyperparameter whose value is what we are interested to determine. • Stochasticity: When R comes from an expectation EF(x,ζ ) where ζ is a random variable,orsimplya(Lebesgue)integral R F(x,z)dz foragivenfunctionF :R n × R d 7→ R. These two extensions, when combined with nonconvexity and nonsmoothness, are prevalent incontemporaryapplications,aswewillsoondelveintoingreaterdetail. However,theypose under-resolved challenges, motivating us to thoroughly investigate them in this dissertation. More specifically, our research focuses on addressing the following questions: 1. Given the presence of nonconvexity and nonsmoothness, a global optimal solution for (1.1) may be unattainable. Therefore, how do we properly define weaker notions of solutions that are computable in practical terms? 2. Howcanweprovablycomputesuchsolutions? Canwedesignmoreefficientalgorithms by leveraging some special structures that are inherent in the function L? 3. Can more advanced nonconvex and nonsmooth modeling strike a favorable balance betweenpracticalperformanceandcomputationalefficiency,atleastfromanempirical standpoint? Aligned with these overarching goals, our study in this dissertation aims at enriching the existing literature of nonconvex optimization, and advancing computational techniques in various application domains that have inspired our study. Detailed presentations of these 2 applications will be provided in the following sections, along with a highlight of the critical technical difficulty and a glimpse of our mathematical treatments. 1.1 A source application: Markov Random Field Throughout the upcoming sections, we will utilize the graphical model of Markov Random Fields (MRFs) [10, 11] as a running example to illustrate how the two key extensions we focus on in this dissertation, namely parametrization and stochasticity, naturally occur in real-world applications. Specifically, consider a MRF defined on an undirected graph ( V,E) with node set V, where each node i ∈ V carries an d-dimensional random vector e θ i whose components represent the features of an object of interest (e.g., pixels of an image) at this location. Conventionally, with a given edge set E ⊆ V×V and fixed nonnegative weights {β ⋆ ij } (i,j)∈E , an MRF represents our beliefs that features on pairs of nodes incident to a common edge should be similar. Such beliefs are modeled through the following probability density function q of e θ ≜{ e θ i } i∈V whose support is Θ ⊆ R d and realization is denoted as θ : q(θ ) ≜ 1 Z exp h(θ ) , where h(θ ) ≜ − X (i,j)∈E β ⋆ ij h ij (θ i ,θ j ) and Z ≜ Z Θ exp h(θ ′ ) dθ ′ , (1.2) Here h ij (θ i ,θ j ) is a metric measuring the “similarity” between θ i and θ j , e.g., using the Euclidean norm h ij (θ i ,θ j ) = ∥θ i − θ j ∥ 2 . The strength of such similarity is encoded in β ⋆ ij . Consequently, through (1.2), the probability distribution of e θ is more concentrated if e θ i and e θ j take similar values when (i,j) ∈E. Suppose each node i∈ V is additionally associated with an m-dimensional random vector e y i whose realization y i is observed, and {e y i } i∈V are conditionally independent (shorthanded as “ind.”) given e θ with conditional density function 3 denoted as p y i (•| e θ ). Based on (1.2), we can define the following benchmark Bayesian hierar- chicalmodel(BHM)whichintendstorecovertheunderlying e θ fromobservationy≜{y i } i∈V : level 1: e y i | e θ ind. ∼ p y i (y i | e θ ), for i∈V level 0: e θ ∼ q(θ ), as defined in (1.2) (1.3) Thismodelfindsextensiveapplicationsindiversedomains,especiallywhenweareinterested in recovering some latent information e θ from observed data y, while knowing a priori that the elements of e θ exhibit certain “similarity structures” that can be captured by the edge setE. Take image processing [12, 40] as an example, in this context each i∈V represents a pixel and y is a blurry image we observe, see Figure 1.1a. By applying (1.3), we can utilize the underlying structure of the image to recover the underlying uncontaminated image e θ as in Figure 1.1b. More specifically, we can leverage the knowledge that darker pixels in the center of the image should exhibit similar values, and this also applies to the lighter pixels in the background. (a) Blurry image y. (b) True image e θ . Figure 1.1: Image recovery. Another classic application of MRF model (1.3) is disease mapping [57, 60] within public health, where each y i can be understood as the data for a specific type of disease at a geo- graphical area i∈V, e.g., COVID-19 fatality data among 10 counties in southern California as in Figure 1.2b. For a given e θ i , suppose the conditional distribution p y i (y i | e θ i ) is Poisson 4 with parameter e θ i , then we can interpret e θ i as the relative risk of this region which we are interested to discover. (a) COVID-19 fatality y. (b) Relative risk e θ . Figure 1.2: Disease mapping. Based on the benchmark MRF model (1.3), one of the fundamental inference method is the Maximum a posteriori approach. It seeks to estimate e θ by solving the following optimization problem which maximizes its posterior density (up to a normalizing constant): maximize θ ∈Θ Y i∈V p y i (y i |θ ) ! ¯q(θ ) (1.4) This problem can be reformulated as follows: minimize θ ∈Θ − X i∈V logp y i (y i |θ ) | {z } from density of y i + X (i,j)∈E β ⋆ ij h ij (θ i ,θ j ) | {z } from MRF + logZ |{z} constant (1.5) Problem (1.5) serves as a benchmark, which allows us to illustrate the emergence of the two additional extensions under certain additional considerations. It is worth mentioning that our intention here is to highlight the ease with which nonconvexity and nonsmoothness can arise from the parametric and stochastic extensions. Nevertheless, the objective of (1.5), which we treat as the L part in (4.39), can already exhibits such “non”-properties. 5 1.2 Sparsity learning One important form of prior knowledge we can impose on the parameter e θ is its sparsity. For example, when observing the noisy signal in Figure 1.3a, it becomes evident that the underlying true signal should have very few nonzero values. To incorporate sparsity into our inference, a common approach is to include a ”sparsity-inducing regularizer” γ Φ in the objective function. Here, Φ is a function that penalizes non-sparse solutions, and γ ≥ 0 is a hyperparameter controlling the strength of this penalization. In the context of the MAP for MRF (1.5), this leads to the following regularized optimization problem: minimize θ ∈Θ − X i∈V logp y i (y i |θ )+ X (i,j)∈E β ⋆ ij h ij (θ i ,θ j ) | {z } ≜L(θ ) +γ Φ( θ ) (1.6) where γ Φ is added to the objective of (1.5) and the constant log Z is dropped. Additionally, function L(θ ) will be referred to as the loss function in this context and for now γ ≥ 0 is treated as a fixed constant. Note that we can also obtain (1.6) from the MAP of an enhancement of (1.3), where ¯q(θ ) is proportionate to exp − X (i,j)∈E β ⋆ ij h ij (θ i ,θ j )− γ Φ( θ ) instead. (a) Noisy signal. (b) Bad configuration. (c) Better configuration. Figure 1.3: Signal recovery. Sparsity learning is a classic topic in contemporary statistical machine learning, which datesbacktothefamousLASSOmodel[81]andreaderscanreferto[47]foracomprehensive 6 review. Here we are interested in studying the sparse estimation problem (1.6) from the following two perspectives: 1. What is an appropriate choice of regularizer Φ that can offer us an effective recovery of the underlying sparsity while maintaining efficient computation in practice? 2. Is there any favorable structure of the loss function L which we can leverage to gain a faster computation, and what kind of complexity guarantee can we ensure? The first question comes from the following dilemma from a practical point of view: • The ideal candidate of Φ is a ℓ 0 or weighted ℓ 0 function, namely Φ( θ ) = P n i=1 p i |θ i | 0 where |t| 0 = 0 if t = 0 and |t| 0 = 1 otherwise, and p i > 0 for all i ∈ [n] where [n]≜{1,...,n}. Although the ℓ 0 -regularizer is preferable as it is the actual count of cardinality, due to its discontinuous nature, the resulting optimization problems can be computationally expensive hence can have limited utility for large scaled inference in practice. • On the other hand, to handle the discontinuity of the ℓ 0 -function, various convex (hence continuous) approximations have been proposed, see [47] for a review. Among this family, the ℓ 1 -function or weighed ℓ 1 -functions, i.e., the famous LASSO model with Φ( θ ) = P n i=1 p i |θ i | where p i > 0 for all i ∈ [n], is arguably the most popular choice as it enjoys an efficient computation. However, the ℓ 1 -function can be a crude approximation of ℓ 0 , and lacks some favorable statistical properties (see the discussion in [35]) hence can lead to undesirable results in practice. To bridge the gap between these two extremes, namely ℓ 0 that is ideal but expensive, and ℓ 1 that is simpler but crude, nonconvex and potentially nonsmooth choices of the sparsity regularizer Φ has been proposed in the statistical community. In particular, the class of folded concave functions, see [2] and Chapter 2 for formal definitions, serves as a unified framework that incorporates many nonconvex regularizers as special cases. 7 The folded concave family offers a compelling compromise between ℓ 0 and ℓ 1 regular- ization, providing a more accurate approximation to ℓ 0 while still allowing for continuous optimization However, effectively handling this family computationally remains as a chal- lenge. Thenonconvexnatureoftheproblemmakesfindingglobaloptimalsolutionsdifficult, if not impossible. In such cases, it becomes crucial to define weaker solution concepts and developalgorithmsthatguaranteeconvergencetothesesolutions, particularlywhenthe reg- ularizerisadditionallynonsmooth. Previousresearchhasexploredthefoldedconcavefamily from a difference-of-convex perspective [2], but recent studies have also highlighted negative results regarding its computation, namely the NP-hardness of various sparse optimization problems involving folded concave regularizers [22]. These previous studies motivate our investigations in Chapter 2 to answer the second question which we proposed earlier, namely does there exist a class of folded-concave- regularized sparsity learning problems that are solvable by a linear-step algorithm or even an algorithm with strongly polynomial complexity? For clarity, here strongly polynomial complexity means the total number of arithmetic operations required by the algorithm is a low degree polynomial of the number of decision variables, and the term linear-step refers to a linear number (in the number of decision variables) of subproblems required by the algorithm where each one of them can be handled by solving a finite number of algebraic equations. Whentheseequationsareadditionallylinear,asinthecasewhenthelossfunction L is quadratic, then the linear-step algorithm becomes a strongly polynomial one. The cornerstone of our studies in Chapter 2 is the concept of the Z-property, which generalizesthescenariowhenallthedistributionsin(1.2)areGaussian, andwillbeformally defined in Chapter 2. This property traces its origins back to the early work of [62, 68, 69] in the field of solving linear complementarity problems (LCPs). In the context of LCPs, the Z-property manifests as the Z-matrix, which will be defined in Chapter 2, and the relevant notion of the hidden Z-matrix. These properties serve as the basis for the strongly 8 polynomial complexity of several renown algorithms for LCPs, including the parametric principal pivoting [23] and Lemke’s algorithms [1]. This intriguing connection between the strong polynomial solvability of LCPs and Z- matrices, along with the utility of such matrices in more recent sparsity learning studies [8, 36], has inspired our work in Chapter 2 which aims to modernize the classic literature on LCPs while complementing contemporary statistical learning research. Specifically, we identify an important class of sparse learning problems that exhibit linear-step and strong polynomialsolvability, accompaniedbythepromisedalgorithmthatachievesthesefavorable computational guarantees. To justify the performance of this algorithm, we also provide in Chapter 2 its formal analysis, supported by extensive numerical experiments. 1.3 Hyperparameterselectionandparametricextension Our exploration of sparsity learning extends beyond the scope of the previous section which is restricted to a fixed value of γ . A natural question here is what is the appropriate choice of γ and how to determine it. As demonstrated by Figure 1.3, it is evident that a larger γ yields a better configuration (see Figure 1.3c) compared to a smaller γ , as evidenced by the sparser signal recovered in the former case, which is more desirable in the context of sparsity recovery. Such endeavor of choosing the scalar valued γ is a specialization of a more general class of problem known as hyperparameter selection which is fundamental in statistics. In thisdissertation,weconsiderhyperparameterselectionofthefollowingregularizedquadratic program which generalizes (1.6) when all the distributions in (1.3) are Gaussian: minimize ℓ≤ x≤ u 1 2 x ⊤ Qx + q ⊤ x | {z } ≜q(x) + γ Φ( x) (1.7) under the following specifications: • Q∈R n× n is a Stieltjes-matrix (to be defined later), and q,ℓ and u are vectors inR n , 9 • ℓ≤ x≤ u is shorthanded for x∈{χ ∈R n :ℓ i ≤ χ i ≤ u i ,∀i∈[n]}. Given problem (1.7), and a function C :R n 7→R n ′ to evaluate the solution of (1.7), hyper- parameter selection of γ can be casted as the following two-step procedure: 1. Compute a path ¯x(γ ) of solutions (of a computable kind) for (1.7) as a function in γ ≥ 0. 2. Select an appropriate γ based on the path C(¯x(γ )). The first step, also known as parametric programming in mathematical programming liter- ature, is the primary focus of our study. This is because once we obtain the solution path ¯x(γ ), accomplishing the second step becomes relatively straightforward. For instance, when n ′ = 1, it involves solving a univariate optimization problem of maximizing or minimizing C(¯x(γ )) over γ ≥ 0. It is important to note that in practice, the first step is typically carried out using grid search, where we compute ¯x(γ ) for a finite set of predetermined γ values. However, parametric programming offers distinct advantages over grid search when it can be computed within finite time. Firstly, parametric programming allows us to cap- ture all the solution information as a function of γ . Secondly, it eliminates the need to be concerned about the granularity of the grid. This aspect is important in practice since an overly sparse grid can yield inaccurate hyperparameter selection, while an excessively dense grid can impose unnecessary workload. We will revisit this topic later with a more detailed discussion. Similar to our concern for the fixed γ problem, the parametric extension of (1.7) also presents the dilemma in choosing between ℓ 0 and ℓ 1 as the regularizer Φ. To illustrate this, let us consider (1.7) as the problem of recovering a sparse ground truth signal X ∗ , and ¯x(γ ) represents the solution path when Φ is either ℓ 0 or ℓ 1 . For the sake of comparing the 10 performance of the ℓ 0 - and ℓ 1 -path, we suppose that X ∗ is known and introduce two criteria based on the following expressions: signal recovery n X i=1 ¯x i (γ )− X ∗ i 2 support recovery n X i=1 |¯x i (γ )| 0 −| X ∗ i | 0 By considering these two criteria, we are able to compare the ℓ 0 - and ℓ 1 -paths via the two Pareto curves depicted in Figure 1.4. These curves illustrate the trade-off between signal and support recovery along the two paths for different problem dimensions n. For a smaller (a) n=10 2 . (b) n=10 4 . Figure 1.4: Signal recovery vs. support recovery. scaledproblem,asillustratedinFigure1.4a,itbecomesevidentthattheℓ 0 -pathoutperforms its ℓ 1 counterpart. This is because achieving minimal signal recovery and minimal support recoveryarehighlycorrelatedfor ℓ 0 , oftenoccurringatthesameγ value. However, thesame correlation does not hold for ℓ 1 , as obtaining a good signal recovery for ℓ 1 would come at the expense of an unfavorable support recovery, and vice versa. On the other hand, in larger- scale problems, although the ℓ 1 -path still necessitates a trade-off between the two recovery quantities, the ℓ 0 -path becomes intractable to compute within a reasonable amount of time. Hence, the practicality of using the ℓ 0 -path is jeopardized as the problem size increases. 11 Similar to the previous section, we resort to nonconvex regularizer Φ for a reconciliation of this dilemma. More specifically, here we employ a (weighted) capped ℓ 1 regularizer: Φ( x) = n X i=1 p i min |x i | δ ,1 where p i > 0 for all i∈ [n]. From Figure 1.5, it is evident that capped ℓ 1 provides us with a more accurate approximation to ℓ 0 than ℓ 1 , and such approximation is controlled by the predetermined parameter δ > 0. A smaller δ leads to a more accurate approximation to ℓ 0 , and without loss of generality we assume that such parameter remains the same for all dimensions i∈[n]. (a) ℓ 0 function. (b) ℓ 0 ,ℓ 1 and capped ℓ 1 . Figure 1.5: Comparison of ℓ 0 ,ℓ 1 and capped ℓ 1 . Unlike ℓ 1 and ℓ 0 , the concept of parametric programming for capped ℓ 1 has received little attention in the existing literature. This can be attributed to its nature of coupling nonconvexity and nonsmoothness, which requires a thorough investigation into computable solutions and poses difficulties in designing an algorithm that can accurately trace a path consisting of such solutions. These motivations drive our studies in Chapter 3, where we systematically explore the comparison between parametric programming of problem (1.7) when the regularizer Φ is ℓ 0 , ℓ 1 , and capped ℓ 1 , with a particular focus on the analytical and computational aspects of the path of local optimal solutions from capped ℓ 1 -problem. 12 From an analytical perspective, the most prominent property which we concentrate on is the complexity of the aforementioned solution paths, which, as will be demonstrated in Chapter 3, are all of piecewise affine nature. In practical applications, computing these paths involves tracing them piece by piece, making the number of pieces a crucial factor for computational efficiency. Specifically, we aim to identify problem classes where the three paths are guaranteed to have polynomially many pieces or, in the worst-case scenario, can possess exponentially many pieces. These complexity measured in terms of the number of decision variables. Regarding computational considerations, the solution paths considered in the traditional parametric programming techniques are typically continuous, making them easier to com- pute. However, we will demonstrate later that a capped ℓ 1 -path can be typically discontin- uous, due to the nonsmoothness of the capped ℓ 1 -function. Consequently, our initial focus is on designing an algorithmic framework that can robustly and efficiently handle such dis- continuities, which, as will be elaborated in Chapter 3, is made possible by the courtesy of the Stieltjes-matrix property of Q. The culmination of our investigations will yield a com- prehensive package of analytical and computational results supporting the use of parametric capped ℓ 1 as a favorable compromise between parametric ℓ 0 and ℓ 1 . In practical scenarios where ℓ 0 paths are intractable, we can compute the capped ℓ 1 path relatively quickly. More- over, the performance of this path is comparable to ℓ 0 and superior to ℓ 1 , eliminating the need to compromise between the qualities in signal and support recovery, as illustrated in the previous example. 1.4 Intractablenormalizersandstochasticprogramming In this section, we introduce an additional extension to the benchmark MRF model (1.3) which makes it more aligned with practical applications. Recall that in model (1.3), we assume that the graph topology E and weights {β ⋆ ij } (i,j)∈E , which encode the structure of 13 similarities among{ e θ i } i∈V , are known a priori. However, this is rarely the case in reality and prescribing such topological information can be challenging in practice. Consider the example of a disease mapping problem, where the goal is to estimate the relativeriskofseveralneighborhoods, includingtheUniversityofSouthernCalifornia(USC) campus, as illustrated in Figure 1.6b. In this case, the edgesE indicate whether two neigh- borhoodsarecorrelatedinameaningfulwayduringapandemic. Fromaspatialgeographical perspective, one might be tempted to assume that USC is connected to Historic South Cen- tral (HSC) since these two regions are adjacent to each other. Conversely, it might be assumed that USC campus is not connected to Koreatown (K-town) and Downtown Los Angeles (DTLA) as they do not share borders. However, from an epidemiological point of view, we can also argue that it would make more sense to connect USC campus to K-town and DTLA, as these areas are popular hubs for USC students, thereby raising the risk of disease transmission. Likewise, we can say that there should be no edge between USC and HSCduetotheminimaltransmissionbetweenthesetwoneighborhoods. Clearly,prescribing the edges for a disease mapping task can be a complex undertaking. (a) Image recovery. (b) Disease mapping. Figure 1.6: MRF graph topology can be difficult to determine. It is also worth acknowledging that the graph topology E and weights {β ⋆ ij } (i,j)∈E can be prone to misspecification when estimated heuristically from the data {y i } i∈V . This can result in an unreliable recovery of e θ , especially when{y i } i∈V is considered as a noisy version of e θ . For instance, in the context of image recovery, given the blurry image {y i } i∈V shown 14 in Figure 1.6a, a crude approach to specify the graph topology could involve setting E = {(i,j) ∈ [n]× [n] : i < j} and β ⋆ ij = I{∥y i − y j ∥ 2 > ¯y}, where I represents the indicator function and ¯y is a fixed threshold whose specific definition can be found in the numerical section of Chapter 4. However, if we examine the images recovered by solving (1.5) under various ¯y values, as depicted in Figure 1.7b to Figure 1.7d, none of these recovered images closely resemble the true image shown in Figure 1.7a. (a) True. (b) Small ¯y. (c) Moderate ¯y. (d) Large ¯y. Figure 1.7: Na¨ ıve estimation of E and{β ⋆ ij } (i,j)∈E leads to undesirable recovery. Inlightofthedifficultyinprescribing E and{β ⋆ ij } (i,j)∈E , amorereasonableapproachisto incorporatethemaspartofourinferenceviatheproposalofageneralizedMRFdistribution for e θ whose density function (conditioning on random parameter e β ≜ { e β ij : (i,j) ∈ [n]× [n],i<j}) is specified as follows q(θ | e β )≜ 1 Z( e β ) exp n h(θ, e β ) o with h(θ, e β ) = − X i<j e β ij h ij (θ i ,θ j ) Z( e β ) ≜ Z Θ exp n h(θ ′ , e β ) o dθ ′ (1.8) which gives us the following generalizations of the benchmark model (1.3): • MRF with unknown edges level 2 e y i | e θ ind. ∼ p y i (y i | e θ ), for i∈V level 1 e θ | e β ∼ q(θ | e β ), as defined in (1.8) level 0 e β ij ind. ∼ Bernoulli(p ij ), where p ij ∈[0,1] are fixed for i<j. (1.9) 15 • MRF with unknown weights level 2 e y i | e θ ind. ∼ p y i (y i | e θ ), for i∈V level 1 e θ | e β ∼ q(θ | e β ), as defined in (1.8) level 0 e β ij ind. ∼ Uniform h β ij , β ij i , where β ij , β ij ∈R 2 + are fixed for i<j. (1.10) Model (1.9) makes the assumption that the occurrence of an edge between (i,j) has prob- ability p ij , while (1.10) assumes that the “strength” of connection between node i and j is unknown and to be estimated. These frameworks indeed generalize (1.3), e.g., if β ⋆ ij =1 for all (i,j) ∈ E in (1.3), then model (1.9) is apparently more flexible by allowing us to learn the appearance of edgesE from solving the associated MAP: minimize θ,β − X i∈V logp y i (y i |θ )+h(θ, β ) + log Z Θ exp{h(θ ′ ,β )}dθ ′ + X i<j β ij log p ij 1− p ij subject to θ ∈Θ , β ij ∈{0, 1} for i<j. (1.11) However, despite being more flexible, generalized MRF models (1.9) and (1.10) clearly give rise to MAPs that are more involved than the benchmark problem. More specifically, we introduce additional nonconvexity (stemming from h(θ,β )) and intractability (due to the integral Z(β )) to the original MAP formulation (1.5). Moreover, the MAP problem (1.11) associated with (1.9) becomes a mixed zero-one program since e β ij is binary-valued. In a more general context, consider a generic BHM represented using the following no- tations: likelihood e y ∼ p(y| e ξ ) prior e ξ ∼ q(ξ ) with MAP: minimize ξ − logp(y|ξ ) − logq(ξ ) (1.12) where y represents the data we observe, and e ξ is a comprehensive notation encompassing all the(random)modelparameterswhosedensityq mayincludeadditionaldependenciesamong 16 its components. Note that when the model parameter e ξ is assumed to be deterministic, the MAP (1.12) reduces to the Maximum Likelihood Estimation (MLE) of a “non-Bayesian” model, and the subsequent discussions also apply to this scenario. ThechallengingfeaturesthatweencounterwhenemployingthegeneralizedMRFmodels areincreasinglyprevalentinmodernapplications. Thesefeaturesremainunder-resolvedand can significantly impede the MAP inference of a BHM that incorporate enhanced modeling capabilities. In what follows, we will summarize these features in more details: • (Intractable Normalizers) In a BHM, the conditional densities are often expressed as ratios, which in generic notation is given by: π (ζ |χ )= ℓ(ζ, χ ) L(χ ) where ζ ∈ Ξ and (ζ,χ ) can be either understood as (e y, e ξ ) in (1.12) or a further decomposition of pa- rameters e ξ . Unfortunately, the closed form of the normalizer L(χ )≜ Z Ξ ℓ(z,χ )dz is typically unavailable. This issue is pervasive in many interesting models where an un-normalized ℓ(ζ, χ ) is designed to facilitate the modeling capabilities, albeit at the expense of a resulting normalizer that cannot be evaluated effectively. Such difficulty becomes a major handicap for MAP and MLE inference on models of this type, which can be found in a wide span of applications including machine learning [76], graphical models [10], social networks [44], population genetics [72], epidemiology [57, 60] and image processing [12, 53]. • (Nonconvexity and Nonsmoothness) To capture complex phenomena in realistic settings, BHMs often employ (conditional) density functions that are nonsmooth and potentially lead to a nonconvex problem (1.12). A common source is the family of indicator functions (see [58]) which can be approximated by piecewise affine functions (see our discussions in Chapter 4). Nonconvexity also typically arises when previously deterministic parameters are randomized to improve the flexibility of a benchmark model. Notably, the generalized MRF models (1.10) and (1.9), which we introduced earlier, exhibit these nonconvex and nonsmooth properties due to their unknown net- work topology. 17 BHMswithintractablenormalizersarealsoreferredtoasbeing doublyintractableinthelit- erature. Asubstantialamountofpreviousstudiesonthesemodelsarededicatedtoresolving this issue for Markov Chain Monte Carlo inference, which aims to sample from the posterior distribution of model parameters, see for instance [39, 63, 65]. In contrast, studies on MAP and MLE under the doubly intractable settings are relatively limited. Early approaches, including pseudo likelihood [10] and stochastic gradient descent [86] are not adequate as they lack theoretical guarantees in nonconvex and nonsmooth cases. While Sample Aver- age Approximation (SAA) [38, 41, 42] has been shown to be consistent for MLE, it is not practically an algorithm for such cases as well. To date there has been an absence of rigorous investigations into computational algo- rithmsforMAP/MLEofBHMswithintractablenormalizersthatareadditionallynonconvex and nonsmooth, hindering their utilities in modern data science. In response, the primary purpose of our studies in Chapter 4 is to bridge such a gap with the following contributions: 1. OfferingaunifiedtreatmentforMAP/MLE,weproposeapracticalalgorithmicframe- work termed Adpative Importance Sampling (AIS)-based Surrogation to tackle the following Logarithmic Integral Optimization problem in generic notations: minimize x∈X c(x) + logZ(x) where Z(x) ≜ Z Ξ r(x,z)dz (1.13) where X and Ξ are given sets in their respective spaces (to be specified later). This al- gorithmemploystwomajortechniques. First,weintroduce surrogation[29,Chapter7] to address the challenge of nonconvexity and nonsmoothness. Second, an adaptive im- portance sampling (AIS) scheme is incorporated to allow for improved control over the variances of sampling-based approximations for intractable Z during the iterations. 2. As a practical solution to accommodate applications with discrete parameters e ξ , we present a systematic approach to approximate the inverse of cumulative distribution functions to the discrete priors and convert an otherwise mixed integer programming 18 formulation of the MAP into a continuous formulation that can be handled by our proposed algorithm. 3. Via a rigorous SAA consistency analysis on the non-independent and identically dis- tributed(non-iid)triangulararrayproducedbytheiterate-dependentsamplingprocess intheAIS-basedsurrogationmethod,wepresenttheconvergenceofouralgorithmtoa kindofsurrogationstationarypoint,whichservesasacomputablecandidateforalocal minimizer. To further support the numerical stability and efficiency of the algorithm in practice, we conduct extensive numerical experiments and compare its performance with related approaches on some realistic BHMs. It is important to highlight that problem (1.13) can be readily reformulated as a stochastic program, which inspires our algorithmic approach in the dissertation and justifies the use of the term “stochastic programming” in the title of this Section. By pursuing the afore- mentioned endeavors, we anticipate that our studies in Chapter 4 will not only enhance the field of computational statistics, but also extend the literature on nonconvex stochastic pro- gramming by offering an effective variance reduction algorithm for the logarithmic integral optimization problem, which is interesting in its own right. Last but not the least, our AIS scheme, which in high level iteratively alternates between sampling approximation of the integral function and updating the sampling distribution, is the major workhorse of our proposed algorithm. Therefore, we closed up our discussion here by a brief overview on previous works that are relevant to our AIS scheme. Emerged from the simulation community, the concept of AIS was initially proposed as a variance reduction technique for estimating a scalar-valued integral through sampling, cf. [67] and [66]. EarlyapproachesforstochasticoptimizationusingtheAISschemethataremorerelated to ours were introduced in [30] and [54]. These studies focused on decomposition methods for multistage stochastic programs and were followed up by a few subsequent studies like [71]. While more recent works have explored AIS in conjunction with stochastic (proximal) gradient descent [3, 56] and coordinate primal descent (dual ascent) methods [25, 80] on 19 problemswithconvexityorsmoothness,theutilityofAISforthechallengingtypeofproblem (1.13)weconsiderherestillremainsopen, whichfurthermotivatestheproposalandanalysis of our algorithm. 20 Chapter 2 Linear-Step Solvability of Sparse Optimization with Z-Property In this chapter, we continue our discussions in Section 1.2 to address the following class of sparsity learning problem written in generic notations: minimize ℓ≤ x≤ u L(x) + γ Φ( x) (2.1) where function L :R n 7→R is referred to as the loss function, Φ : R n 7→R + is the sparsity inducing regularizer, ℓ ≤ x ≤ u is shorthanded for box constraints {x ∈ R n : ℓ i ≤ x i ≤ u i , for all i∈[n]} and γ ≥ 0 is assumed to be a fixed scalar throughout this chapter. Our primary goal is to facilitate the computational efficiency of folded concave Φ, which is of the nonconvex and nonsmooth type, as a practical compromise between the ideal yet computationallyintensive ℓ 0 -normandtheconvexℓ 1 -normwhichmayleadtounsatisfactory relaxation of ℓ 0 . By exploiting a special structure of loss function known as the Z-property, whichgeneralizestheMAPinferenceofaGaussianMRF,weareabletodesignanalgorithm that ensures convergence to a computable solution of (2.1) in a linear number of steps. In some cases, this algorithm can even achieve strongly polynomial complexity. For the definitions of these complexity concepts, we refer to Section 1.2. 21 The subsequent sections of this chapter are based on the materials from the published paper [43] and are organized as follows. In Section 2.1, we introduce several key concepts such as stationary solutions, the Z-property, the folded concave family, and present some additionalmotivatingexamples. Ourmainalgorithm, namedasthe G´ omez-He-Pang (GHP) algorithm after the authors of [43], is proposed in Section 2.2 and analyzed in Section 2.3. Finally, in Section 2.4, we evaluate the performance of GHP through various experiments. 2.1 Preliminaries Inthissection,weformalizeourdefinitionsoffoldedconcavefamily,Z-propertyandsomerel- evant concepts, aswellassomediscussionsontheconceptofstationarityaspractically com- putable solutions. All the notations in this chapter should be understood as self-contained. 2.1.1 Folded concave family We begin with a generic definition of folded concave family. Definition 1. We call a function Φ: R n 7→R folded concave if it can be formulated as Φ( x) ≜ n X i=1 p i Φ i |x i | δ for some p i >0 for all i∈[n] and δ > 0 where each Φ i :R + 7→R satisfies the followings: • Φ i is concave piecewise smooth; • There exists J i intervals, J i − 1 positive break points ξ (i) 1 ,...,ξ (i) J i − 1 and a family of functions{Φ (i) j } j∈[J i ] , so that Φ i =Φ (i) j on interval h ξ (i) j− 1 ,ξ (i) j i where ξ (i) 0 ≜0,ξ (i) J i ≜∞; • Each piece Φ (i) j (for all j∈[J i ]) is twice continuously differentiable and concave. Without loss of generality, we only consider break points ξ (i) 1 ,...,ξ (i) J i − 1 where Φ i (x) is not differentiable. For example, even though the well-known Smoothly Clipped Absolute 22 Deviation (SCAD) function [35] technically is defined by three “pieces” on R + , here we only treatitasasinglepieceofcontinuouslydifferentiableconcavefunctionsinceitiscontinuously differentiable everywhere on R ++ . Moreover, the parameter δ is introduced to control the approximation of the folded concave family to ℓ 0 . For convenience, we set δ = 1 as our choice. Note that each Φ i is proposed as an approximation to the univariate ℓ 0 -function, thus beyond its concavity and smoothness conditions, we also propose the following restrictions on its shape. Assumption 2 (Shape restrictions). Our folded concave Φ i also satisfies the following con- ditions: • Φ i (0)=0 and Φ i (t)≥ 0 for all t≥ 0; • Φ i is non-decreasing hence Φ ′ i (t)≥ 0 for all t≥ 0 and t̸=ξ (i) j ,∀j∈[J i − 1]; • Φ i (t)=c i >0 if t≥ ¯t∈R ++ , namely, Φ ′ i (t)=0 and Φ (i) J i =c i for all t≥ ¯t. A special case of Φ i would be a concave piecewise affine function which has the following point-wise minimum formulation Φ i (t) = min j∈[J i ] α (i) j t+β (i) j for t ≥ 0. For such Φ i to satisfy Assumption 2 we need α (i) 1 > α (i) 2 > ... > α (i) J i = 0 = β (i) 1 < β (i) 2 < ... < β (i) J i = c i . Figure 2.1 (citedfrom[29])showcasesseveralknownfoldedconcavefunctionsthatsatisfyAssumption2, which all provide advanced approximation to ℓ 0 comparing to the convex ℓ 1 . 0.5 0.5 1 t | t | 0 0.5 0.5 1 t ⇢ (t;0.1) SCAD MCP capped `1 trunc. trans. `1 trunc. log. 1 Figure 2.1: The ℓ 0 -function and some folded concave functions. 23 Inadditiontotheshapeassumptionsabove, wealsoproposethefollowingkeyconditions on our pieces n Φ (i) j o j∈[J i ],i∈[n] which plays a crucial role for the analysis of our algorithm: Assumption 3 (Derivative dominance). Suppose for all i∈[n], we have • Φ (i) j ′ (x i )> Φ (i) j ′ ′ (x i ), for all x i ∈ h ξ (i) j− 1 , ξ (i) j i , j ′ >j, and j≤ J i − 1, • Φ (i) j ′ (x i )< Φ (i) j ′ ′ (x i ), for all x i ∈ h ξ (i) j− 1 ,ξ (i) j i , j ′ <j, and ∀j >1. In simpler terms, within its designated interval h ξ (i) j− 1 ,ξ (i) j i , the derivative Φ (i) j ′ should be dominant over the derivatives of all subsequent pieces j ′ >j, while also being dominated by the derivatives of all preceding pieces j ′ < j. At first glance, this assumption may appear quite restrictive. However, in what follows, we demonstrate that it can be easily met for all concave piecewise smooth Φ i functions satisfying the conditions in Definition 1 and Assumption 2, after we slightly modify the pieces Φ (i) j without changing the overall Φ. Proposition 4. Let{Φ (i) j } j∈[J i ],i∈[n] be a class of functions satisfying the conditions in Def- inition 1 and Assumption 2. If we substitute each Φ (i) j with their respective b Φ (i) j defined in below, b Φ (i) 1 (x i ) ≜ Φ (i) 1 (x i ) if 0≤ x i ≤ ξ (i) 1 a (i) 1> x i +b (i) 1> if x i ≥ ξ (i) 1 b Φ (i) J i (x i ) ≜ a (i) J i < x i +b (i) J i < if x i ≤ ξ (i) J i − 1 Φ (i) J i (x i ) if x i ≥ ξ (i) J i − 1 b Φ (i) j (x i ) ≜ a (i) j< x i +b (i) j< if x i ≤ ξ (i) j− 1 Φ (i) j (x i ) if ξ (i) j− 1 ≤ x i ≤ ξ (i) j a (i) j> x i +b (i) j> if x i ≥ ξ (i) j for all j =2...J i − 1 where a (i) j< ≜ Φ (i) j ′ ξ (i) j− 1 , b (i) j< ≜Φ (i) j ξ (i) j− 1 − Φ (i) j ′ ξ (i) j− 1 ξ (i) j− 1 for all j =2...J i a (i) j> ≜ Φ (i) j ′ ξ (i) j , b (i) j> ≜Φ (i) j ξ (i) j − Φ (i) j ′ ξ (i) j ξ (i) j for all j =1...J i − 1 24 Then the new pieces Φ (i) j will satisfy Assumption 3 and each Φ i remains the same. Proof. We can verify that for all x i ∈ h ξ (i) j− 1 ,ξ (i) j i , j ′ >j and j <J i − 1: b Φ (i) j ′ (x i )≥ b Φ (i) j ′ ξ (i) j = Φ (i) j ′ ξ (i) j > Φ (i) j ′ ′ ξ (i) j ′ − 1 = b Φ (i) j ′ ′ (x i ) and for all x i ∈ h ξ (i) j− 1 ,ξ (i) j i , j ′ <j and j >1: b Φ (i) j ′ (x i )≤ b Φ (i) j ′ ξ (i) j− 1 = Φ (i) j ′ ξ (i) j− 1 < Φ (i) j ′ ′ ξ (i) j ′ = b Φ (i) j ′ ′ (x i ), Thus we have verified that n b Φ (i) j o j∈[J i ],i∈[n] meets the conditions in Assumption 3, and the new Φ i after substituting Φ (i) j with b Φ (i) j is unchanged by construction. ToillustratethemodificationofthepiecesΦ (i) j asdescribedinProposition4,wepresenta simple example. Consider a uni-variate folded concave function Φ( x) = min{Φ 1 (x),Φ 2 (x)}, see Figure 2.2a for a visualization, which comprises two smooth pieces specified as follows: Φ 1 (x)=− e − x +1 and Φ 2 (x)=− 1 2 e − 4x + 3 4 . The break point occurs at ξ 1 ≈ 1.3783. As shown in Figure 2.2b, it is evident that the derivatives of Φ 1 and Φ 2 do not satisfy Assumption 3. (a) Original pieces. (b) Original derivatives. Figure 2.2: The original folded concave function. 25 After applying the modifications in Proposition 4, the derivatives of the modified pieces b Φ 1 and b Φ 2 satisfy Assumption 3, as depicted in Figure 2.3b. Meanwhile, the modified folded concave function min{ b Φ 1 (x), b Φ 2 (x)} is the same as the original one, as shown in Figure 2.3a. (a) Modified pieces. (b) Modified derivatives. Figure 2.3: The modified folded concave function. 2.1.2 Z- and M-functions Despite having a nonsmooth Φ, we assume that our loss function L is smooth, and the so-called Z-property of L is characterized through its gradient ∇L : R n 7→ R n as a vector valued map. For clarify, we introduce the following concepts for a generic vector valued map and a matrix. Definition 5 (Classes of functions). We say that F :R n 7→R n is • a Z-function or off-diagonally antitone if the univariate, scalar-valued function t7→ F i (x+te j ), where e j is the j-th coordinate vector, is nonincreasing for all i̸= j and x∈R n ; • inverse isotone if F(x)≤ F(y) implies x≤ y for all x and y inR n ; • an M-function if it is an inverse isotone Z-function; • monotone if (F(x)− F(y)) ⊤ (x− y)≥ 0 for all x,y∈R n ; 26 • strictly monotone if (F(x)− F(y)) ⊤ (x− y)>0 for all x,y∈R n and x̸=y; • strongly monotone if a constant c > 0 exists such that (F(x)− F(y)) ⊤ (x− y) ≥ c(x− y) ⊤ (x− y) for all x,y∈R n ; • a P-function if max i∈[n] (x i − y i )(F i (x)− F i (y))>0 for all x,y∈R n and x̸=y. Definition 6 (Classes of matrices). We say a matrix Q∈R n× n • a Z-matrix if Q ij ≤ 0 for all i,j∈[n] and i̸=j; • an M-matrix if it is Z-matrix and there exists s > 0,B ∈R n× n with B ij ≥ 0 for all i,j ∈ [n] and s ≥ ρ (B), so that Q = sI− B, where I ∈ R n× n is the identity matrix and ρ (B) is the spectral radius of B; • a Stieltjes matrix if it is a symmetric positive definite Z-matrix ( ⇒ nonsingular M-matrix); • a P-matrix if all the principle minors are positive. More comprehensive discussions and detailed analyses on the concepts mentioned above can be found in references [23, 34, 64]. These references provide a rich varieties of related conclusions, as well as the interconnections of these function and matrix classes. In what follows, we highlight a couple of results from theses works that are relevant to our context. Proposition 7. Let F :R n 7→R n be given, • if F is restricted to a rectangle, then F is an M-function if and only if F is both a Z- and P-function; • if F is a continuous, strongly monotone Z-function, then it is an M-function; • if F is affine, namely there exists Q∈R n× n and q∈R n so that F(x) = Qx+q, then F is a Z-function (M- or P-function) if and only if Q is a Z-matrix (M- or P-matrix); 27 • if F is restricted to a rectangle and additionally Fr´ echet-differentiable with Jacobian matrix JF, then F is a Z-function (M- and P-function) if and only if JF is Z-matrix (nonsingular M- and P-matrix) on the rectangle; • if F is Gˆ ateaux-differentiable with positive semidefinite (positive definite) Jacobian matrix JF on a convex set, then F is monotone (strictly monotone); • ifF is the gradient map of a convex (strictly convex or strongly convex) function, then F is monotone (strictly monotone or strongly monotone); • if F is the gradient map of a continuously differentiable strongly convex function and F is additionally a Z-function, then F is a surjective M-function. Proposition 8. Let Q∈R n× n be a Z-matrix then the followings are equivalent • Q is a nonsingular M-matrix; • Q is nonsingular and Q − 1 has nonnegative elements; • Q has positive leading principle minors. Like what we have discussed in Section 1.2, our GHP algorithm is inspired by the con- nection between Z-matrix and strongly polynomial solvability of certain (parametric) Linear Complementarity Problem (LCP). Take reference [68] as an example, suppose we are given the following LCP in (x,y) ∈ R n × R n , which admits the Karush–Kuhn–Tucker (KKT) conditions of a box constrained quadratic program as a special case: 0 ≤ x ⊥ c+Dx+y ≥ 0 0 ≤ y ⊥ b− x ≥ 0 (2.2) where notation “u ⊥ v” is understood as u ⊤ v = 0, and D ∈ R n× n ,b and c ∈ R n . The favorable connection between Z-matrix and existence of a strongly polynomial algorithm for solving LCP (2.2) is built upon a critical analytical result named Least Element Theorem 28 (LET), which in high level states that when D is a Z-matrix then there exists a solution of (2.2) that is element-wise dominated by all the items in a certain set, and when D is additionally a nonsingular M-matrix, then such solution is unique. We summarize this LET result in below. Proposition9(LinearLET:Theorem3.1andProposition3.2in[68]). (i)IfDisaZ-matrix, then there exists a least element ¯x of set ξ ∈R n : 0≤ ξ ≤ b ∀i∈[n]: if ξ i 0, then the loss function L(θ ) is L(θ ) = X i∈[n] 1 σ 2 i (θ i − y i ) 2 +λ X (i,j)∈E (θ i − θ j ) 2 Here the second term on the right-hand side can be interpreted as the smoothness of θ we aim to recover and it is controlled by a prescribed parameter λ . 30 Time-varying Gaussian MRF A standard inference method for time-varying MRF is the time-vraying Graphical Lasso, see [46] for instance. The objective of this optimization problem (see [36]) models a smooth evolution of the process over time, which is incorporated in a loss function L as follow: L (X t ) t∈[T] = X t∈[T] X t − ˆ Σ − 1 t 2 F +λ X t∈[T− 1] g(X t+1 − X t ) where each X t ∈ R n× b is the estimation of the precision matrix at time t, and ˆ Σ − 1 t is a predeterminedestimationofprecisionmatrix. Thedifferencebetween X t and ˆ Σ − 1 t iscaptured by function∥•∥ 2 F which is the square of the Frobenius norm, while function g :R n× n 7→R models the temporal dependence among (X t ) t∈[T] . Scaled nearly isotonic regression with sparsity control The nearly isotonic regression problem is introduced in [83] wherein the classic isotonic constraints x i ≤ x i+1 for all i∈ [n− 1] (see [18]) are replaced by penalty terms max{x i − x i+1 ,0} for all i∈ [n− 1]. Let y i be given for all i∈ [n], a possible scaled near isotonicity generalization is as follows: L(x)= X i∈[n] (y i − x i ) 2 +λ X i∈[n− 1] max{x i − b i+1 x i+1 ,0} 2 where a scaling factor λ > 0 is introduced in the second term on the right-hand side, to penalize the violation of scaled isotonic constraints x i ≤ b i+1 x i+1 with b i+1 > 0. A potentialapplicationofthisoptimizationproblemisspikedetectionincalciumimagingdata [20][55][84]. 31 2.1.3 Stationarity As previously emphasized, the folded concave regularizer gives rises to nonconvexity and nonsmoothness that are coupled together, making the computation of a global optimal solu- tion of (2.1) out of the question. Instead, in this chapter we settle for a stationary solution which we can provably compute, and strive for one that is of the sharpest kind, known as the directional stationary (d-stationary) solution which we formally define as follows. Definition 12. Let f :R n 7→R be a given function and d∈R n be a give direction, then the directional derivative of f at ¯x∈R n along the direction d is defined as: f ′ (¯x;d) ≜ lim τ ↓0 f(¯x+τd )− f(¯x) τ andwesaythatf isdirectionallydifferentiable at ¯xiff ′ (¯x;d)iswell-definedalongall d∈R n . Suppose X ⊆ R n is a closed convex set and f is directionally differentiable at all x ∈ X, then a vector ¯x∈X is a directional stationary solution of the following problem minimize x∈X f(x) if f ′ (¯x;x− ¯x)≥ 0 for all x∈X. More generally speaking, a stationary solution of an optimization problem refers to a feasible solution that satisfies a certain necessary condition for a local minimizer, and such condition is commonly defined via generalized notions of differentiations given that differ- entiability is not present. In this spirit, directional derivative is employed to define the stationarity condition for the d-stationary solutions: f ′ (¯x;x− ¯x)≥ 0 for all x∈X, which is necessarily satisfied for all the local solutions. Furthermore, we can also extend such condi- tions to a nonconvex closed setX through the help of tangent cones. A more comprehensive discussions on various concepts of stationary solutions can be found in [29] and [70]. What isworthemphasizingisthat, forafunctionthatisdirectionallydifferentiable, ad-stationary 32 solution is of the sharpest kind among other common types of stationary points, e.g., Clarke and limiting stationary solutions. This means that a d-stationary solution sufficiently satis- fies the conditions for these alternative types of stationary solutions, see Chapter 6.1 in [29] for a nice summary of the relevant theories. Due to the fact that both L and Φ are directionally differentiable everywhere, we can subsequentlyobtainthefollowingcharacterizationofthedirectionalderivativeofφ≜L+γ Φ at a ¯x∈R n along a d∈R n : φ ′ (¯x;d) ≜ ∇L(¯x) ⊤ d+γ X i∈A>(¯x) p i Φ ′ i (¯x i ;d i )+ X i∈A<(¯x) p i Φ ′ i (− ¯x i ;− d i )+ X i∈A=(¯x) p i Φ ′ i (0;|d i |) where we define the following three index sets: A > (¯x) ≜ {i∈[n]: ¯x i >0}, A < (¯x) ≜ {i∈[n]: ¯x i <0} and A = (¯x) ≜ {i∈[n]: ¯x i =0} Utilizing this result, we can derive the following equivalent conditions for a d-stationary solution of (2.1). The detailed proof is omitted here, as it closely resembles the arguments for the upcoming Lemma 15. Proposition 13. Let sign(s)=1 if s≥ 0 and sign(s)=− 1 if s<0. Then, a vector ¯x∈R n is a d-stationary solution of (2.1) if and only if the following conditions hold: • Φ i is differentiable at |¯x i | for all i∈A > (¯x)∪A < (¯x) so that ¯x i ̸=ℓ i or u i ; • ∂L(¯x) ∂x i +sign(¯x i )γp i Φ ′ i (|¯x i |)=0 for all i∈A > (¯x)∪A < (¯x) so that ¯x i ̸=ℓ i or u i ; • ∂L(¯x) ∂x i ≤ γp i Φ ′ i (0;1) for all i∈A = (¯x); • ∂L(¯x) ∂x i − γp i Φ ′ i (|¯x i |)≥ 0 for all i∈[n] such that ¯x i =ℓ i ; • ∂− L(¯x) ∂x i − γp i Φ ′ i (|¯x i |)≥ 0 for all i∈[n] such that ¯x i =u i . 33 Finally, an interesting fact about the d-stationary solutions of (2.1) is that they are not only necessary but also sufficient to be local optimal solutions when each Φ i is additionally piecewiselinear-quadratic, e.g., whenitispiecewiseaffine, see[26]foramorecomprehensive discussion. 2.2 The GHP Algorithm In what follows, we present the promised linear-step algorithm, named as the GHP algo- rithm, for computing a d-stationary solution of problem (2.1), where L and Φ satisfy the assumptions we stated in the previous section. Note that the GHP algorithm and its ana- lytical results remain valid when p i =1,ℓ i =−∞ and u i =∞ for all i∈[n], thus for ease of notation we assume that theses configurations hold for the rest of this chapter. At a high level, GHP is a double-loop scheme that is guaranteed to obtain a d-stationary solution of (2.1) by solving at most linearly (in n) many subproblems. These subproblems have simple sign constraints and smooth objective functions. By the end of the forthcom- ing section we will discuss how to leverage the Z-function property of ∇L to compute an appropriate notion of solution for such subproblem by solving at most linearly (in n) many nonlinear equations. TheouterloopofGHPischaracterizedbyanonnegativeintegerν ,whereeachouterstep ν is associated with a tuple (S ν , ¯ S ν ) of disjoint index sets S ν and ¯ S ν ⊆ [n],S ν ∪ ¯ S ν = [n], whichwewillrefertoasthesignstructureofthisouterstep. Each(S ν , ¯ S ν )isassociatedwith an outer loop subproblem that is solved by an inner loop subroutine. Each step of this inner loop is further indexed by (ν )t for nonnegative integer t, and we define j (ν )t i ∈ [J i ], referred to as the piece selection, to assign index i to a particular piece of Φ i . Such piece selection will determine an inner loop subproblem for us to solve. When an inner loop, corresponding totheouterloopν , isterminatedatt=t ∗ , wedefine j (ν ) i ∈[J i ]tobethefinalpieceselection at this last inner step (ν )t ∗ . Additionally, we denote e Φ (i) j (x i )≜Φ (i) j (− x i ). 34 Before we proceed with the formal introduction of the outer and inner loops of GHP, we would like to present the following assumptions, which will hold true when γ is sufficiently small or when Φ i functions are concave piecewise affine. Assumption 14. The upcoming inner loop subproblem (2.6) is always strongly convex hence has a unique global optimal solution. It is important to note that these assumptions are temporarily imposed solely for the sake of illustrating our inner loop subroutine and our analysis of the GHP algorithm. At the end of the next section, we will demonstrate how these assumptions can be easily relaxed without affecting our theoretical results. Algorithm 1 GHP: Outer Loop 1: Initialize with S 1 =[n], ¯ S 1 =∅, j (0) i ≜J i for all i∈[n] and E 0 =∅. 2: for ν =1,2,... do 3: Initialize the inner loop method with: j (ν )1 i =j (ν − 1) i for all i∈(S ν − 1 \E ν − 1 )∪ ¯ S ν − 1 j (ν )1 i =1 for all i∈E ν − 1 4: to solve a d-stationary solution x ν of the following outer loop subproblem: minimize x∈R n L(x)+γ n X i=1 Φ i (s ν i x i ) subject to x¯S ν ≤ 0, x S ν ≥ 0. (2.5) 5: where s ν i =1 if i∈S ν , s ν i =− 1 if i∈ ¯ S ν . 6: Output{j (ν ) i } i∈[n] (to be defined in inner loop). 7: Check E ν ≜ i∈S ν :x ν i =0, (∇L(x ν )) i +γ e Φ (i) 1 ′ (0)>0 . 8: if E ν ̸=∅ then 9: Update S ν +1 =S ν \E ν , ¯ S ν +1 = ¯ S ν ∪E ν and proceed to the next step ν +1 10: else 35 11: Terminate and output solution x ∗ ≜x ν . 12: end if 13: end for Algorithm 2 GHP: Inner Loop for Problem (2.5) at Outer Step ν 1: Input n j (ν − 1) i o i∈[n] ,E ν − 1 and initialize with: j (ν )1 i =j (ν − 1) i for all i∈(S ν − 1 \E ν − 1 )∪ ¯ S ν − 1 j (ν )1 i =1 for all i∈E ν − 1 2: for t=1,2,... do 3: Solve the global optimal solution x (ν )t of the following inner loop subproblem: minimize x∈R n L(x)+γ X i∈ ¯S ν e Φ (i) j (ν )t i (x i )+γ X i∈S ν Φ (i) j (ν )t i (x i ) subject to x¯S ν ≤ 0, x S ν ≥ 0 (2.6) 4: and check ¯ E (ν )t ≜ i∈ ¯ S ν : x (ν )t i ≤− ξ (i) j (ν )t i if j (ν )t i <J i E (ν )t ≜ i∈S ν : x (ν )t i ≤ ξ (i) j (ν )t i − 1 if j (ν )t i >1 5: if ¯ E (ν )t ∪E (ν )t ̸=∅ then 36 6: Update the followings and proceed to the next step t+1: for all i∈ ¯ E (ν )t , set j (ν )t+1 i = j if j (ν )t i <j <J i ,− ξ (i) j <x (ν )t i ≤− ξ (i) j− 1 J i if x (ν )t i ≤− ξ (i) J i − 1 for all i∈E (ν )t , set j (ν )t+1 i = j if 1<j <j (ν )t i , ξ (i) j− 1 <x (ν )t i ≤ ξ (i) j 1 if 0≤ x (ν )t i ≤ ξ (i) 1 for all i∈[n]\( ¯ E (ν )t ∪E (ν )t ), set j (ν )t+1 i =j (ν )t i 7: else 8: Terminate and output solution x ν ≜x (ν )t with n j (ν ) i ≜j (ν )t i o i∈[n] . 9: end if 10: end for Thegeneralideabehindtheoverall algorithmisthatweleveragethepropertyofM-function to tackle the combinatorial nature of the original problem (2.1). Indeed, we can break down the non-smoothness of the absolute value function by fixing the sign structure ( S ν , ¯ S ν ) of variable x in the outer loop. The non-smoothness of the folded concave Φ i can be further peeled off in the inner loop by designating the piece selection j (ν )t i , which implicitly restricts thevalueofx i ineither − ξ (i) j (ν )t i ,− ξ (i) j (ν )t i − 1 or ξ (i) j (ν )t i − 1 ,ξ (i) j (ν )t i . Asaresult,weobtainasmooth problem of type (2.6) which is known to be solvable inO(n 4 ) if it is additionally a quadratic program. For instance when L(x) = 1 2 x ⊤ Qx+q ⊤ x, where Q is an M-matrix and each Φ i is concave piecewise affine. Inthissense, ourgoalcanbeequivalentlycastedasfindingthecorrectsignstructureand piece selection, whose associated inner loop subproblem (2.6) yields a d-stationary solution of (2.1). The GHP algorithm accomplishes this through the following dynamics. Initially,westartwithallthevariablesrestrictedtobex i ≥ 0henceS 1 =[n], ¯ S 1 =∅,and alltheindicesi∈[n]aredesignatedwithj (0) i =J i hencetheyareassignedtopieceΦ (i) J i onthe nonnegativeorthant. Byupdatingpieceselection n j (ν )t i o i∈[n] , theinnerloopkeepsrectifying 37 the errors that prevent inner loop problem (2.6) from producing a d-stationary point for an outer loop problem (2.5). More specifically, it corrects the index i∈[n] whose solution x (ν )t i falls out of the region assigned by its associated piece, namely either − ξ (i) j (ν )t ,− ξ (i) j (ν )t − 1 or ξ (i) j (ν )t − 1 ,ξ (i) j (ν )t . Once a d-stationary solution of problem (2.5) is obtained by the terminated inner loop, theouterloopwillverifyifthissolutionisindeedd-stationaryfortheoriginalproblem(2.1). This is accomplished by examining whether an index i∈ [n] achieving x ν i = 0 possesses the correct derivative information from the left of the origin 0. With the aid of the M-function property of∇L, we later can prove that throughout this entireprocedure,thesolutionsobtainedfromtheinnerloopproblems(2.6)exhibitadecrease in their element-wise value. This renders us a monotone scheme for searching the correct sign structure and piece selection. If we view a piece selection n j (ν )t i o i∈[n] under S ν , ¯ S ν as a soft constraint on the variable values to be situated within region − ξ (i) j (ν )t ,− ξ (i) j (ν )t − 1 or ξ (i) j (ν )t − 1 ,ξ (i) j (ν )t ,thenouralgorithmcommenceswiththeassumptionthatallvaluesliewithin region ξ (i) J i ,∞ , and uni-directionally moves indices i to the downstream regions towards −∞ . 2.3 Algorithm Analysis In what follows, we prove that our operations in the inner loop (respectively, the outer loop) willterminatewithad-stationarysolutionof (2.5)(respectively, (2.1))inlinear(in n)steps, and in worst case our algorithm needs to solve at most linearly (in n) many subproblems of type (2.6). The sketch of our proof is as follows. In Lemma 15 (respectively, Lemma 17), we give conditions for a d-stationary solution of (2.5) (respectively, (2.6)) to be a d-stationary solution of (2.1) (respectively, (2.5)). Then we present Corollary 18 as an extension of Theorem 10, namely the non-linear generalization of the classic LET for LCP, which will serve as the key tool for us to show the linear-step convergence of the inner loop, which 38 is summarized in Proposition 21. The high level idea behind Proposition 21 is that, we inductively show that in each inner step t, our solution x (ν )t i will not fall to the right of ξ (i) j (ν )t i or− ξ (i) j (ν )t i − 1 . The base case of such inductive argument will be taken care of by Lemma 19, and the intermediate steps can be handled by Lemma 20. Finally in Theorem 22, we show the convergence of the outer loop to a d-stationary point of (2.1), and the overall inner steps we need to achieve this is linear in n. Lemma 15. Let x ν be a d-stationary solution of (2.5) with x ν ¯S ν <0, then (i)|x ν i |̸=ξ (i) j for all j∈[J i − 1], (ii) x ν is a d-stationary solution of (2.1) if and only if E ν =∅. Proof. Denote g i (x)≜Φ i (s ν i x i ) in (2.5), the characterization of d-stationarity of x ν is n X i=1 (∇L(x ν )) i d i +γg ′ i (x ν ;d)≥ 0 forallfeasibledirectiondofconstraintsetof (2.5)atx ν . Notethatforalli∈S ν (respectively ¯ S ν ), we have ξ (i) j− 1 ≤ x ν i < ξ (i) j (respectively − ξ (i) j ≤ x ν i < − ξ (i) j− 1 ) for some j ≥ 1 and we denote such j as j(i). The directional derivative g ′ i (x ν ;d) can be determined as follow, for all i∈S ν : g ′ i (x ν ;d) = Φ (i) j(i) ′ (x ν i )d i when x ν i ∈ ξ (i) j(i)− 1 ,ξ (i) j(i) if j(i)̸=1 x ν i ∈ h ξ (i) j(i)− 1 ,ξ (i) j(i) if j(i)=1 g ′ i (x ν ;d) = Φ (i) j(i)− 1 ′ (x ν i )d i if d i ≤ 0 Φ (i) j(i) ′ (x ν i )d i if d i ≥ 0 when x ν i =ξ (i) j(i)− 1 and j(i)̸=1 39 and for all i∈ ¯ S ν : g ′ i (x ν ;d) = e Φ (i) j(i) ′ (x ν i )d i when x ν i ∈ − ξ (i) j(i) ,− ξ (i) j(i)− 1 g ′ i (x ν ;d) = e Φ (i) j(i)− 1 ′ (x ν i )d i if d i ≤ 0 e Φ (i) j(i) ′ (x ν i )d i if d i ≥ 0 when x ν i =− ξ (i) j(i) If we have some i∈S ν and x ν i =ξ (i) j(i)− 1 then (∇L(x ν )) i d i +γg ′ i (x ν ;d)= (∇L(x ν )) i +γ Φ (i) j(i)− 1 ′ (x ν i ) d i if d i ≤ 0 (∇L(x ν )) i +γ Φ (i) j(i) ′ (x ν i ) d i if d i ≥ 0 By Φ (i) j(i)− 1 ′ (x ν i ) > Φ (i) j(i) ′ (x ν i ) we can always make the term (∇L(x ν )) i d i + γg ′ i (x ν ;d) arbitrarily small by making d i sufficiently large or small, hence this case cannot happen. Similarly, we cannot have x ν i =− ξ (i) j(i) for all i∈ ¯ S ν . In sum, we have shown part (i). Denote A ν 0 ≜{i∈ S ν : x ν i = 0}, keeping n X i=1 (∇L(x ν )) i d i +γg ′ i (x ν ;d)≥ 0 for all feasible direction d is equivalent to (∇L(x ν )) i +γ e Φ (i) j(i) ′ (x ν i )=0 ∀i∈ ¯ S ν (∇L(x ν )) i +γ Φ (i) j(i) ′ (x ν i )=0 ∀i∈S ν \A ν 0 (∇L(x ν )) i +γ Φ (i) 1 ′ (0)≥ 0 ∀i∈A ν 0 Forx ν tobead-stationarysolutionof (2.1),weadditionallyneed(∇L(x ν )) i +γ e Φ (i) 1 ′ (0)≤ 0 for all i∈A ν 0 , which proves our statement (ii). Remark 16. The part (i) of Lemma 15, i.e.,|x ν i |̸= ξ j for all j∈ [J i − 1], is crucial for our latter analysis. Thus in what follows, we will highlight the importance of this statement in advance. 40 • We will apply this to derive the conditions in Lemma 17 and to justifying the base case in Proposition 21. • This statement tells us that, for an inner loop x (ν )t to be a d-stationary solution of (2.5), when an index i is assigned to some e Φ (ν )t j , then the “correct” region for the value ofx (ν )t i shouldbestrictly − ξ (i) j ,− ξ (i) j− 1 sincex (ν )t i cannottake− ξ j or− ξ j− 1 . Likewise, when an index i is assigned to Φ (ν )t j , the “correct” region for x (ν )t i is ξ (i) j− 1 ,ξ (i) j . Thus in Lemma 17, we will use weak inequalities, namely “≥ ” and “≤ ”, to introduce the two types of error preventing x (ν )t from being d-stationary for (2.5), namely when x (ν )t j falls to the left of its “correct” region which we call type E errors (sets ¯ E (ν )t ,E (ν )t in Lemma 17), and when x (ν )t j falls to the right of its “correct” region which we call type G errors (sets ¯ G (ν )t ,G (ν )t in Lemma 17). • Theideabehindthedesignofourinnerloopisthat,forafixed n j (ν )t i o i∈[n] ateachinner stept,werectifytheindiceswithtypeEerrorsbyre-assigningthemtothepieceswhich correspond to the regions where their value x (ν )t i currently reside. In our upcoming analysis, we will show that such regions are always on the left of − ξ (i) j (ν )t i ,− ξ (i) j (ν )t i − 1 or ξ (i) j (ν )t i − 1 ,ξ (i) j (ν )t i . A key detail is that whenever x (ν )t i = − ξ (i) j ′ for some j ′ ≥ j or x (ν )t i =ξ (i) j ′ for some j ′ <j, we will always re-assign it to the region on the left of± ξ (i) j ′ in the next inner step t+1. For these cases, the statement (ii) of Lemma 20 will allow us to show that x (ν )t+1 i < x (ν )t i = ± ξ j ′. Thus, type G error will not happen on these indices in the subsequent step t+1. Lemma 17. Suppose the global optimal solution x (ν )t of problem (2.6) has x (ν )t ¯S ν < 0, then x (ν )t will be a d-stationary solution of problem (2.5) if the following sets are empty: ¯ E (ν )t ≜ i∈ ¯ S (ν )t :x (ν )t i ≤− ξ (i) j (ν )t i ,j (ν )t i <J i , E (ν )t ≜ i∈S (ν )t :x (ν )t i ≤ ξ (i) j (ν )t i − 1 ,j (ν )t i >1 ¯ G (ν )t ≜ i∈ ¯ S (ν )t :x (ν )t i ≥− ξ (i) j (ν )t i − 1 ,j (ν )t i >1 , G (ν )t ≜ i∈S (ν )t :x (ν )t i ≥ ξ (i) j (ν )t i ,j (ν )t i <J i 41 Proof. Denote the objective function of (2.6) as f (ν )t (x), since this is a convex problem by Assumption14,anequivalentcharacterizationofx (ν )t isthatf ′ (ν )t (x (ν )t ;d)≥ 0forallfeasible direction d of constraint set of (2.6) at x (ν )t . More specifically we have: for all i∈S ν , if x (ν )t i >0, ∇L(x (ν )t ) i +γ Φ (i) j (ν )t i ′ x (ν )t i =0 if x (ν )t i =0, ∇L(x (ν )t ) i +γ Φ (i) j (ν )t i ′ x (ν )t i ≥ 0 for all i∈ ¯ S ν , ∇L(x (ν )t ) i +γ e Φ (i) j (ν )t i ′ x (ν )t i =0 It is easy to verify that if ¯ E (ν )t =E (ν )t = ¯ G (ν )t =G (ν )t =∅, then x (ν )t will be a d-stationary solution of (2.5). Corollary 18. If f :R n 7→R is continously differentiable and F ≜∇f is a Z-function, then there exists a d-stationary solution ¯x of the following problem minimize x∈R n f(x) subject to − η ¯S ≤ x¯S ≤ 0, x S ≥ 0 (2.7) where η ∈R n is nonnegative, being the least element of set: E f ≜ ξ ∈R n : ξ S ≥ 0,− η ¯S ≤ ξ ¯S ≤ 0 for all i∈S :(F(ξ )) i ≥ 0 for all i∈ ¯ S : if ξ i <0⇒(F(ξ )) i ≥ 0 If we additionally have f being strongly convex, such ¯x is unique and globally optimal. Proof. Problem (2.7) only possesses box and sign constraints, hence solving its d-stationary solution is equivalent to solving its corresponding KKT system. By a simple application of Theorem 10, we can conclude the existence of a d-stationary ¯x for (2.7) being the least element ofE f . Furthermore, when f is strongly convex, we have F strongly monotone hence 42 a P-function, thus it is also an M-function and we can apply Theorem 10 to conclude the uniqueness of ¯x. Lemma 19. For any outer loop step ν , the solution of its first inner loop step t=1 has (i) x (ν )1 ≤ x ν − 1 and (ii) x (ν )1 E ν − 1 <0. Proof. First we introduce a vector η ∈ R n , whose element η i > 0, for all i ∈ [n], is large enough so that x (ν )1 ≥− η,x ν − 1 ≥− η . Note that x (ν )1 is the solution of: minimize x∈R n L(x)+γ X i∈ ¯S ν e Φ (i) j (ν )1 i (x i )+γ X i∈S ν Φ (i) j (ν )1 i (x i ) subject to − η ¯S ν ≤ x¯S ν ≤ 0, x S ν ≥ 0. (2.8) By Corollary 18 we have x (ν )1 being the least element of the following set: E (ν )1 ≜ ζ ∈R n : ζ S ν ≥ 0,− η ¯S ν ≤ ζ ¯S ν ≤ 0 for all i∈S ν :(∇L(ζ )) i +γ Φ (i) j (ν )1 i ′ (ζ i )≥ 0 for all i∈ ¯ S ν : if ζ i <0⇒(∇L(ζ )) i +γ e Φ (i) j (ν )1 i ′ (ζ i )≥ 0 Also note that x ν − 1 is the unique solution of minimize x∈R n L(x)+γ X i∈ ¯S ν − 1 e Φ (i) j (ν − 1) i (x i )+γ X i∈S ν − 1 Φ (i) j (ν − 1) i (x i ) subject to − η ¯S ν − 1 ≤ x¯S ν − 1 ≤ 0, x S ν − 1 ≥ 0. Again, by Corollary 18 we have x ν − 1 being the least element of the following set: E ν − 1 ≜ z∈R n : z S ν − 1 ≥ 0,− η ¯S ν − 1 ≤ z¯S ν − 1 ≤ 0 for all i∈S ν − 1 :(∇L(z)) i +γ Φ (i) j (ν − 1) i ′ (z i )≥ 0 for all i∈ ¯ S ν − 1 : if z i <0⇒(∇L(z)) i +γ e Φ (i) j (ν − 1) i ′ (z i )≥ 0 43 Note that by definition S ν − 1 = S ν ∪ E ν − 1 , ¯ S ν = ¯ S ν − 1 ∪ E ν − 1 and x ν − 1 S ν ≥ 0,x ν − 1 E ν − 1 = 0,− η ¯S ν − 1 ≤ x ν − 1 ¯S ν − 1 ≤ 0 hence x ν − 1 satisfies the first condition in E (ν )1 . It is also easy to check that x ν − 1 satisfies the second and third conditions in E (ν )1 by our definition that j (ν )1 i = j (ν − 1) i for all i ∈ (S ν − 1 \E ν − 1 )∪ ¯ S ν − 1 . In sum, we have shown that x ν − 1 ∈ E (ν )1 , hence x (ν )1 ≤ x ν − 1 which gives us the statement in part (i). Now we show the statement in (ii), namely x (ν )1 E ν − 1 < 0. Suppose on the contrary x (ν )1 i ′ = 0 for some i ′ ∈ E ν − 1 . If we substitute x¯S ν = y¯S ν − η ¯S ν in (2.8), then we have x (ν )1 S ν ,y (ν )1 ¯S ν ≜x (ν )1 ¯S ν +η ¯S ν uniquelysolvesthefollowingnonlinearcomplementarityproblem: 0≤ x S ν ⊥ (∇L(x)) S ν +γ F (ν )1 x S ν y¯S ν − η ¯S ν S ν ≥ 0 0≤ y¯S ν ⊥ (∇L(x))¯S ν +γ F (ν )1 x S ν y¯S ν − η ¯S ν ¯S ν +λ ¯S ν ≥ 0 0≤ λ ¯S ν ⊥ η ¯S ν − y¯S ν ≥ 0 where (F (ν )1 (z)) i ≜ Φ (i) j (ν )1 i ′ (z i ) if i∈ S ν and (F (ν )1 (z)) i ≜ e Φ (i) j (ν )1 i ′ (z i ) if i∈ ¯ S ν . Notice that now y (ν )1 i ′ =η i ′ >0 so that 0=(∇θ (x (ν )1 )) i ′ +γ (F (ν )1 (x (ν )1 )) i ′ +λ (ν )1 i ′ ≥ (∇θ (x ν − 1 )) i ′ +γ e Φ (i ′ ) 1 ′ (0)+λ (ν )1 i ′ >0 where the first inequality is by x (ν )1 i ′ = x ν − 1 i ′ = 0,x (ν )1 ≤ x ν − 1 , ∇L being a Z-function and our choice of j (ν )1 i =1 for all i∈E ν − 1 , whereas the second inequality is due to the definition of E ν − 1 . Note that we have derived 0 > 0 which is apparently a contradiction, hence we must have x (ν )1 E ν − 1 <0. Lemma 20. Suppose we fix arbitrary S⊆ [n], ¯ S≜ [n]\S, and take some i + ∈ S,i − ∈ ¯ S if possible. Let f be strongly convex, smooth with∇f being Z-function, and for g 0 ,g + :R + 7→ 44 R and h 0 ,h + :R − 7→R (all differentiable functions), we assume that f +g 0 +h 0 ,f +g + + h 0 ,f +g 0 +h − are strongly convex. Furthermore the following problems (P 0 ) x 0 =argmin x f(x)+g 0 (x i + )+h 0 (x i − ) s.t. x S ≥ 0,x¯S ≤ 0 (P + ) x + =argmin x f(x)+g + (x i + )+h 0 (x i − ) s.t. x S ≥ 0,x¯S ≤ 0 (P − ) x − =argmin x f(x)+g 0 (x i + )+h − (x i − ) s.t. x S ≥ 0,x¯S ≤ 0 satisfy g ′ 0 (x + i + ) < g ′ + (x + i + ) and h ′ 0 (x − i − ) > h ′ − (x − i − ). Based on the conditions above, we can conclude that (i) x + ≤ x 0 ≤ x − , and (ii) ifx 0 i + >0 thenx + i + <x 0 i + , ifx 0 i − <0 thenx − i − >x 0 i − . Proof. Here, we use an argument that is similar to the proof of Lemma 19. By introducing η ≥ max{|x 0 |,|x + |,|x − |} element-wise we can restrict a lower bound− η ¯S ≤ x¯S ≤ 0 on (P 0 ) and (P + ). Consequently, we have x 0 being the least element of the following set: E 0 ≜ z∈R n : z S ≥ 0,− η ¯S ≤ z¯S ≤ 0 for all i∈S :(∇f(z)) i ≥ 0, for i̸=i + and (∇f(z)) i + +g ′ 0 (z i + )≥ 0 for all i∈ ¯ S : if z i <0, for i̸=i − ⇒(∇f(z)) i ≥ 0, if z i − <0 ⇒(∇f(z)) i− +h ′ 0 (z i − )≥ 0 Similarly x + is the least element of E + ≜ ζ ∈R n : ζ S ≥ 0,− η ¯S ≤ ζ ¯S ≤ 0 for all i∈S :(∇f(ζ )) i ≥ 0, for i̸=i + and (∇f(ζ )) i + +g ′ + (ζ i + )≥ 0 for all i∈ ¯ S : if ζ i <0, for i̸=i − ⇒(∇f(ζ )) i ≥ 0, if ζ i − <0 ⇒(∇f(ζ )) i− +h ′ 0 (ζ i − )≥ 0 45 It is straightforward to verify that x 0 ∈ E + hence x + ≤ x 0 . Moreover, for the comparison between (P 0 ) and (P − ), we can substitute x=− ξ in both problems and apply the previous conclusion to show that x 0 ≤ x − which completes (i). As for the statement (ii), when x 0 i + > 0, we can without loss of generality assume that x + i + > 0, otherwise we trivially have x + i + < x 0 i + . Suppose on the contrary x 0 i + = x + i + > 0, from the KKT system of (P + ) and (P 0 ), we can deduce that (∇f(x + )) i + +g ′ + (x + i + ) =0 and (∇f(x 0 )) i + +g ′ 0 (x 0 i + )=0. Hence, we have 0= (∇f(x + )) i + − (∇f(x 0 )) i + + g ′ + (x + i + )− g ′ 0 (x 0 i + ) >0 where we get the inequality from (∇f(x + )) i + − (∇f(x 0 )) i + ≥ 0 (since x 0 ≥ x + ,x 0 i + = x + i + and ∇f is a Z-function) and g ′ + (x + i + )− g ′ 0 (x 0 i + ) > 0 by our assumption. Note that now we have derived 0 > 0 which is apparently a contradiction, hence we must have x + i + < x 0 i + . On the other hand, we can similarly flip the sign ξ 0 ≜− x 0 and ξ − ≜− x − in (P 0 ) and (P − ), and conclude that ξ − i − <ξ 0 i − since ξ 0 i − >0 by assumption, which shows x − i − >x 0 i − . Proposition 21. Assume that x ν − 1 ¯S ν − 1 <0, then the inner loop subroutine will: (i) produce a non-increasing sequence x (ν )t+1 ≤ x (ν )t , (ii) terminate within n X i=1 (J i − 1) steps and when we terminate at step t ∗ , we have x ν ≜ x (ν )t ∗ being a d-stationary solution of (2.5), (iii) obtain x ν ≤ x ν − 1 with x ν ¯S ν <0. Proof. First, we inductively prove that for an arbitrary step t, we have ¯ G (ν )t = G (ν )t = ∅ and x (ν )t ¯S ν < 0, hence Lemma 17 is applicable at every step t. For the base case t = 1, by definition we have j (ν )1 i = j (ν − 1) i for all i∈ (S ν − 1 \E ν − 1 )∪ ¯ S ν − 1 , j (ν )1 i = 1 for all i∈ E ν − 1 and x ν − 1 being the d-stationary point of (2.5) in outer step ν − 1, we can apply the first part of Lemma 19 and conclude that: 46 • For all i∈S ν s.t. j (ν )1 i <J i : x (ν )1 j (ν )1 i ≤ x ν − 1 j (ν )1 i =x ν − 1 j (ν − 1) i <ξ (i) j (ν − 1) =ξ (i) j (ν )1 =⇒ x (ν )1 j (ν )1 i <ξ (i) j (ν )1 • For all i∈ ¯ S ν s.t. j (ν )1 i >1: x (ν )1 j (ν )1 i ≤ x ν − 1 j (ν )1 i =x ν − 1 j (ν − 1) i <− ξ (i) j (ν − 1) − 1 =− ξ (i) j (ν )1 − 1 =⇒ x (ν )1 j (ν )1 i <− ξ (i) j (ν )1 − 1 In other words, we have shown that ¯ G (ν )1 = G (ν )1 = ∅. Moreover by the second part of Lemma 19 we have x (ν )1 E ν − 1 <0, which gives us x (ν )1 ¯S ν <0 when combined with the first part of Lemma 19, namely x (ν )1 ¯S ν − 1 ≤ x ν − 1 ¯S ν − 1 <0. Suppose now we are at some arbitrary inner step t > 1, and from the last step t− 1 we have x (ν )t− 1 ¯S ν < 0 and ¯ G (ν )t− 1 = G (ν )t− 1 = ∅ satisfied by the induction hypothesis. By the first part of Lemma 20, our operations in step t− 1 make x (ν )t ≤ x (ν )t− 1 , hence we achieve x (ν )t ¯S ν ≤ x (ν )t− 1 ¯S ν < 0. To show G (ν )t = ¯ G (ν )t =∅, we first fix a i∈ S ν so that j (ν )t i <J i . When j (ν )t i = j (ν )t− 1 i , by G (ν )t− 1 =∅ we have x (ν )t i ≤ x (ν )t− 1 i < ξ (i) j (ν )t− 1 i = ξ (i) j (ν )t i . On the other hand, when j (ν )t i < j (ν )t− 1 i , by the rule of inner loop and the second part of Lemma 20, we obtain x (ν )t i < x (ν )t− 1 i ≤ ξ (i) j (ν )t i . In total we have shown that G (ν )t = ∅. Under a similar argument, we can show that ¯ G (ν )t =∅ as well. Thus we have completed the proof for G (ν )t = ¯ G (ν )t =∅ and x (ν )t ¯S ν <0 for an arbitrary inner step t. Note that eventually the inner loop will terminate either when ¯ E (ν )t = E (ν )t = ∅, or when j (ν )t i = J i for all ∈ ¯ S ν and j (ν )t i = 1 for all i ∈ S ν (which is sufficient to show ¯ E (ν )t =E (ν )t =∅). When we stop at step t ∗ , which should take us no more than n X i=1 (J i − 1) inner loop steps to accomplish, we have G (ν )t ∗ = ¯ G (ν )t ∗ = ¯ E (ν )t ∗ = E (ν )t ∗ =∅ and x (ν )t ∗ ¯S ν < 0. By Lemma 17, we can conclude that x (ν )t ∗ is a d-stationary solution for (2.5). 47 In summary, the part (ii) is proven and the part (i) has already been justified in our analysis above. Finally, to show the part (iii) we apply the part (i) and Lemma 19 to conclude x ν =x (ν )t ∗ ≤ x (ν )1 ≤ x ν − 1 and x ν ¯S ν =x (ν )t ∗ ¯S ν ≤ x (ν )1 ¯S ν <0. Theorem 22. The outer loop will terminate in n steps. When we terminate we will obtain a d-stationary solution x ∗ of (2.1), and the overall inner steps we need is linear in n. Proof. We start with S 1 = [n], ¯ S 1 =∅, hence trivially we have x 1 ¯S 1 < 0. By the statement (iii) in Proposition 21 and a trivial induction argument we obtain x ν ¯S ν <0 for all outer step ν , thus Lemma 15 is always applicable. The outer loop will terminate either with E ν =∅ or S ν =∅ (hence E ν =∅) in no more than n steps. By Lemma 15, this means that we will achieve a d-stationary point x ∗ ≜x ν ∗ of (2.1) when we stop at step ν ∗ ≤ n. Finally, our overall procedure can be seen as monotonically moving all the indices from initially being assigned to the rightmost region (i.e., assuming x i ≥ 0 and selecting piece J i ) towards the leftmost region (i.e., assuming x i ≤ 0 and selecting piece J i ). Suppose in each inner loop step, we only move one index from its current region to its left one, then in worst case we will execute n X i=1 (2J i − 1)≤ 2 max i∈[n] J i − 1 n inner loops, since each index will be moved at most 2J i − 1 times under this assumption. Remark 23. In Theorem 22, the linearity of the total inner step count is based on the assumption that J i are independent of problem dimension n. 2.3.1 On linear-step solvability and strong polynomial complexity The major computational workhorse of the GHP algorithm is the inner loop subproblem (2.6). From our previous analysis, to maintain our theoretical results, what we essentially 48 need is that the x (ν )t in each inner loop step to be a least d-stationary solution of (2.6), which means that it is d-stationary for (2.6) and is the least element of the following set E (ν )t ≜ ζ ∈R n : ζ S ν ≥ 0,− η ¯S ν ≤ ζ ¯S ν ≤ 0 for all i∈S ν :(∇L(ζ )) i +γ Φ (i) j (ν )t i ′ (ζ i )≥ 0 for all i∈ ¯ S ν : if ζ i <0⇒(∇L(ζ )) i +γ e Φ (i) j (ν )t i ′ (ζ i )≥ 0 where η i > 0 is sufficiently large for all i ∈ [n]. Note that the existence of such least d- stationary solution is guaranteed by Corollary 18, and such solution will be unique if the objective of (2.6) is additionally strongly convex. Consequently under Assumption 14, the unique global solution x (ν )t in each inner step will be the least element of E (ν )t and our analysis naturally applies. In more general cases where Assumption 14 does not hold, despite the possibility of havingmultipledistinct d-stationarysolutionsof (2.6), our theoreticalresultsare stillsound if we additionally assume that a subroutine is applied so that x (ν )t is computed as the least d-stationary solution of (2.6). Details of such subroutine is summarized in Section 6.4 of [43], and here we only discuss the major implication of its existence. From Section 6.4 of [43], we know that this subroutine can be terminated with the targeting least d-stationary solution by solving at most linearly (in n) many smooth nonlinear equations. Combining with Theorem 22, this means that the overall work load of GHP in the general scenario is at most solvingO(n 2 ) such equations. When our loss function L is quadratic and each Φ i is piecewise quadratic, the inner loop subproblem (2.6) can be formulated as a box constrained quadratic program. Notably, the matrix associated with the quadratic term in the objective function remains a symmetric Z-matrix. According to [68], it is known that the least d-stationary solution of such problem can be exactly computed by a subroutine withO(n 3 ) complexity. Thus, by employing such 49 subroutine, our GHP algorithm will achieve a strong polynomial complexity. We formally state this result in the following theorem which does not need a proof. Theorem 24. Let L(x) = 1 2 x ⊤ Qx+q ⊤ x, where q ∈ R n is a given vector and Q ∈ R n× n is a symmetric Z-matrix, and let each Φ be piecewise quadratic. The GHP algorithm will compute a d-stationary solution of (2.1) inO(n 4 ) complexity. Finally, it is worth noting that when each Φ i is already smooth, e.g., when it is a SCAD functionorsimplyalinearfunction,theinnerloopsubroutineisnolongernecessarysinceitis specificallydesignedtotacklethenonsmoothfoldedconcavefunctions. Insuchcases,wecan modify the outer loop scheme to compute a d-stationary solution x ν that satisfies x ν ≤ x ν − 1 elementwise in each step ν , except for ν = 1. By applying an argument similar to that in Lemma 19 and Theorem 22, we can inductive prove the same linear-step termination as stated in Theorem 22. The existence and computation of such x ν follows the same principle as discussed previously, where x ν should be the least d-stationary solution of the outer loop subproblem (2.5), and a subroutine can be designed to find this solution by solving at most linearly many smooth nonlinear equations. Further details regarding this discussion can be found in Section 6.1 of [43]. Remark25. Itisimportanttoemphasizethatourcomplexityresultsinthischapterpertain to the exact computation of the desired solutions. This should not be confused with an iterative algorithm that only converge to a solution in its limit. The complexity of such iterative algorithm is typically measured by the number of iterations required to compute an approximate solution within a specified accuracy. 2.4 Numerical Experiments In this section, we test the practical efficiency of our GHP algorithm on problem (2.1) when a quadratic loss L(x) = 1 2 x ⊤ Qx+q ⊤ x, whose Q is a Stieltjes matrix (namely a symmetric positive definite Z-matrix, see Definition 6), is paired up with the following two choices of 50 regularizer. The first choice is when Φ i (t) = t which leads to the ℓ 1 -regularizer, and the second choice is Φ i (t)=min t δ ,1 with a fixed δ > 0, which is the capped ℓ 1 -regularizer. Notethattheproblemswithℓ 1 -regularizersareconvexquadraticprograms, hencecanbe handledbywell-establishedsolverssuchasGUROBI[45]andMOSEK9.3[5]. Consequently, we are interested in comparing GHP with theses available commercial solvers. On the other hand, fortheproblemswiththecapped ℓ 1 -regularizer, wecomparetheGHPAlgorithmwith GUROBI when they are applied to a mixed integer reformulation specified Subsection 2.4.2. We synthetically generate a sparse Stieltjes matrix Q with n = 5,000 (a meaningful size for experimental purposes) and the density of the off-diagonal entries equals to σ ∈ {0.01,0.2}, namely thenumberof nonzero off-diagonalentriesis σn (n− 1). Thenonzero off- diagonalentriesareuniformlygeneratedwithintheinterval[− 1,0); thenwesetthediagonal terms Q ii =1.2 X j∈[n],j̸=i |Q ij | for all i ∈ [n] so that Q is positive definite. Additionally, we randomly generate the components of q independently and uniformly in [− 100,100]. As mentioned earlier, the GHP algorithm can be readily extended to handle problem (2.1) with box constraints ℓ≤ x≤ u and weighted folded concave regularizer Φ= n X i=1 p i Φ i , where ℓ i < 0,u i > 0 and p i > 0 for all i ∈ [n]. In our experiments, we test with p i independently and uniformly generated in (0,1]. When the box constraints are imposed, we set− ℓ i =u i = 2 3 ∥Q − 1 q∥ ∞ foralli∈[n],where− Q − 1 q istheunconstrainedminimizerofloss L. All the statistics reported in the tables below are averaged over 5 random instances. The experiments were carried out within MATLAB R2017b on a Mac OS X personal computer with 2.3 GHz Intel Core i7 and 8 GB RAM. 2.4.1 ℓ 1 -penalized problems WhenapplyingtheGHPalgorithmtotheℓ 1 -penalizedproblems,itisworthnotingthateach outer loop subproblem (2.5) is already strongly convex and smooth. As a result, there is no needtoinvoketheinnerloopsubroutine. Instead, wesolveeachouterloopsubproblem(2.5) as an LCP using the method described in [68]. In Table 2.1, we provide a summary of the 51 computational time (in seconds) and the number of outer loop steps (referred to as “steps I”) taken before termination. Furthermore, we fix the hyperparameter γ in (2.1) to be 0.01, 1, or 10, and abbreviate the computational results of GUROBI and MOSEK as “GRB” and “MSK” respectively. In this case, all methods solve the optimization problem to optimality, due to the convexity of ℓ 1 -function. From the results presented in Table 2.1, it is evident that GHP outperforms the bench- mark commercial solvers by a significant margin, with a speedup of at least five times in almost all settings. Moreover, the proposed method demonstrates superior exploitation of data sparsity. Particularly, for cases with low off-diagonal density ( σ = 0.01), GHP can be 160 times faster than GRB (respectively MSK), as observed in the constrained setting with (γ,σ ) = (10,0.01). For similar cases with σ = 0.2, GHP is 40 times faster than GRB and 6 times faster than MSK. Additionally, it is noteworthy that while GHP may require a worst-case of n outer steps, in practice it typically terminates within an average of five steps. Thus, the practical per- formance of GHP often surpasses the worst-case complexity of O(n 4 ) (as discussed in the previous section). Settings Unconstrained Constrained GHP GRB time MSK time GHP GRB time MSK time time steps I time steps I (γ,σ ) = (0.01,0.01) 1.32 5.2 176.20 59.71 1.49 5.4 197.50 77.12 (γ,σ ) = (0.01,0.2) 10.51 5.4 286.17 61.16 11.10 5.2 335.78 63.63 (γ,σ ) = (1,0.01) 1.35 5.2 174.93 52.49 1.32 5.0 198.54 81.95 (γ,σ ) = (1,0.2) 12.28 6.0 295.37 77.52 11.39 5.4 333.67 60.80 (γ,σ ) = (10,0.01) 1.13 5.0 196.06 56.13 1.16 5.0 185.70 88.44 (γ,σ ) = (10,0.2) 9.23 5.2 312.28 63.86 9.34 5.0 331.36 59.10 Table 2.1: Computational results for the ℓ 1 -problems. 52 2.4.2 Capped ℓ 1 -penalized problems Besides the ℓ 1 -cases, we also test and highlight the advantages of the GHP Algorithm when applied to capped ℓ 1 -problems. We fix γ =1 and δ = 1 6 ∥Q − 1 q∥ ∞ in both unconstrained and constrained settings. Table 2.2 summarizes the computing time, total number of inner loop steps (labelled steps II) and the final objective value (denoted as “obj.”) when our method terminates with a d-stationary point x ghp . As a comparison, we apply GUROBI to solve the cappedℓ 1 -problemafteritisreformulatedasthefollowingmixedintegerquadraticprogram: minimize x,t,s∈R n 1 2 x ⊤ Qx + q ⊤ x + γp ⊤ t subject to |x i | δ ≤ t i +Ms i ,∀i∈[n] 1≤ t i +M(1− s i ),∀i∈[n] ℓ≤ x≤ u, and s∈{0,1} n . where M≜ max 2∥˜ q∥ 2 δλ min (Q) , 1 if unconstrained u 1 δ (=4) if constrained ˜ q i ≜q i +sign(q i ) γ δ ∀i ∈ [n] λ min (Q)≜ minimal eigenvalue of Q The choice of M for the unconstrained cases is motivated by the fact that if x ∗ is an optimal solution of the capped ℓ 1 -problem, then∥x ∗ ∥ ∞ ≤ 2∥q ∗ ∥ 2 λ min (Q) ≤ 2∥˜ q∥ 2 λ min (Q) , where q ∗ i ≜ q i + γ δ p i if 0≤ x ∗ i ≤ δ q i − γ δ p i if − δ ≤ x ∗ i <0 q i if|x ∗ i |>δ In Table 2.2, we record the best objective values found by GUROBI when it is set to stop at the time limit of 200 and 600 seconds. Additionally, we also initialize GUROBI with 53 x ghp and rerun it for 600 seconds, with the results included in the column named “GRB cont.”. Additionally, we also present the gaps between the upper and lower bounds of global optimality when GUROBI is terminated. Finally, we use “Unc.” (respectively, “Con.”) to represent the unconstrained (respectively, constrained) cases. Settings GHP GRB (time lim.=200) GRB (time lim.=600) GRB cont. (time lim.=600) obj. time steps II obj. gap obj. gap obj. gap Unc. σ =0.01 -3.3732e5 1.58 13.4 -3.3732e5 >100% -3.3732e5 0.58% -3.3732e5 0.53% Unc. σ =0.2 -1.6360e4 21.18 23.0 0 >100% -1.6306e4 >100% -1.6360e4 >100% Con. σ =0.01 -3.7314e5 1.74 13.0 0 >100% -3.3713e5 0.88% -3.7314e5 0.42% Con. σ =0.2 -1.6290e4 18.53 21.4 0 >100% -1.6214e5 20.93% -1.6291e4 18.61% Table 2.2: Computational results for the capped ℓ 1 -problems. Itisevidentthatthed-stationarypointsx ghp achievedbyGHP(in2secondsforσ =0.01, and20secondsforσ =0.2)arebetterthantheincumbentsfoundbyGUROBIin200seconds, with the exception of the unconstrained, σ = 0.01 scenario. On the other hand, when the time limit is set to be 600 seconds, GUROBI usually finds feasible solutions whose objective values are close to (but slightly worse than) the objective value of x ghp , while requiring significantly more time to compute. Finally, we note that the optimality gaps provide by GUROBI can serve as a certificate that, for the σ =0.01 scenario, the solution delivered by GHP has an objective value that is (at most) 1% worse than the global minimizer of the problem. 2.4.3 Verifying strong polynomial complexity Toverifythestrongpolynomialcomplexityofourmethod, wetestitoncapped ℓ 1 -problems, bothconstrainedandunconstrained,whereQhavingoff-diagonaldensity σ ∈{0.01,0.05,0.2}. We consider various sizes n ∈ {500,1000,2000,4000,8000,16000}. Parameters γ , δ , u and ℓ are chosen as identical to the previous experiments. The computing time and the total number of inner loop steps (labelled as steps II) are summarized in Table 2.3. We also plot 54 the computing time as a function of n under various settings in Figure 2.4. These results indicate that doubling n results in an 4 to 6 times increase in computing times, suggesting that the practical complexity of the GHP method is roughly of the order n 2.5 in the exper- iments, which is considerably less than the O(n 4 ) complexity as suggested by Theorem 24. Additionally, the total number of steps II is also significantly less than the worst-case bound of linear (in n) number of inner loop subproblems (2.6) as stipulated in Theorem 22. Unconstrained Settings n=500 n=1000 n=2000 n=4000 n=8000 n=16000 σ =0.01 time 0.03 0.04 0.15 0.61 2.79 11.39 steps II 6.4 6.4 7.4 8.2 9.6 10.2 σ =0.05 time 0.04 0.10 0.43 2.18 9.64 50.36 steps II 7.0 8.4 8.6 10.0 11.4 14.6 σ =0.2 time 0.07 0.32 1.44 6.33 25.92 86.68 steps II 8.6 9.2 10.8 12.2 14.2 14.0 Constrained Settings n=500 n=1000 n=2000 n=4000 n=8000 n=16000 σ =0.01 time 0.04 0.04 0.14 0.71 3.72 15.89 steps II 6.4 6.4 7.4 8.6 9.8 10.4 σ =0.05 time 0.03 0.10 0.49 2.45 12.12 61.28 steps II 7.0 8.4 9.2 10.2 12.2 15.0 σ =0.2 time 0.08 0.32 1.49 6.23 27.92 85.06 steps II 8.6 8.8 10.4 11.4 14.2 13.6 Table 2.3: Computational results for the capped ℓ 1 -problems. Insummary,ourfindingsinthesenumericalexperimentsdemonstratethatGHP,withits favorable computational complexity, translates into a highly efficient algorithm in practice when Z-property is present. It significantly outperforms leading off-the-shelf solvers, and at the same time provides solutions of high-precision without requiring additional computa- tional effort. 55 (a) Unconstrained. (b) Constrained. Figure 2.4: Computing time vs. size n for capped ℓ 1 -problems. 56 Chapter 3 Parametric Extensions with Capped ℓ 1 -Regularizers In this chapter, following our disucssions in Section 1.3, we present our parametric program- ming study of the sparsity-regularized quadratic program (1.7), which we re-introduce in below: minimize ℓ≤ x≤ u 1 2 x ⊤ Qx + q ⊤ x + γ Φ( x) (3.1) Here, q ∈ R n is a given vector, Q ∈ R n× n is a Stieltjes matrix (see Definition 6) hence a symmetric and positive definite Z-matrix. Moreover, the vectors u,ℓ satisfy the followings: 0<min{− ℓ i ,u i }≤ max{− ℓ i ,u i } for all i∈[n] . Additionally, we consider three candidates for Φ, namely the (weighted) ℓ 0 -function Φ( x)= n X i=1 p i |x i | 0 , the (weighted) ℓ 1 -function Φ( x) = n X i=1 p i |x i |, and the (weighted) capped ℓ 1 - function. The capped ℓ 1 -function, which is a special case of the piecewise affine folded concave regularizer, is formulated as: Φ( x) = n X i=1 p i min |x i | δ ,1 where p i > 0 for all i ∈ [n] and δ > max max i∈[n] u i , max i∈[n] (− ℓ i ) are prescribed parameters. Specifically, we are interested in the analytical properties and computational approaches for 57 the solution paths of (3.1) as a function of hyperparameter γ ≥ 0, when Φ is chosen as its three candidates respectively. For convenience, we introduce the following notations: f 0 (γ ) ≜ minimum ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x+γ n X i=1 p i |x i | 0 f 1 (γ ) ≜ minimum ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x+γ n X i=1 p i |x i | f cap (γ ) ≜ minimum ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x+γ n X i=1 p i min |x i | δ ,1 f d cap (γ ) ∈ loc-min ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x+γ n X i=1 p i min |x i | δ ,1 and x 0 (γ ) ∈ argmin ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x+γ n X i=1 p i |x i | 0 x 1 (γ ) ≜ argmin ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x+γ n X i=1 p i |x i | x cap (γ ) ∈ argmin ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x+γ n X i=1 p i min |x i | δ ,1 x d cap (γ ) ∈ arg-loc-min ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x+γ n X i=1 p i min |x i | δ ,1 where “loc-min” stands for the set of objective values pertaining to the local minimizers of (3.1) when Φ is capped ℓ 1 , which is denoted as “arg-loc-min”. It is worth noting that by [26], all of the d-stationary solutions (see Definition 12) of a capped ℓ 1 -problem (3.1) are equivalenttoitslocalminimizers, henceforthwewillusethesetwoconceptsinterchangeably. Inparticular,wewillrefertof d cap andx d cap as“cappedℓ 1 d-stationarypaths”. Inwhatfollows, we summarize our contributions in this chapter: • Inlightofthegapsinthepreviousliteratureconcerningparametricprogrammingstud- ies involving ℓ 0 , ℓ 1 -functions and other types of nonconvex choices for Φ (summarized in the forthcoming Section 3.1), we present a set of analytical findings regarding the concavity, monotonicity, differentiability and piecewise structure of paths f 0 ,f 1 ,f cap and f d cap , as well as the complexity of the number of pieces in paths x 0 ,x 1 ,x cap and 58 x d cap . Some of these analytical results can be comprehended as enhancements of the existing ones, which will be listed in Section 3.2 for self-containedness and ease of comparison. These results are accompanied by some new findings that complement the existing studies. For example, in Section 3.2, we will demonstrate that when Q is Stieltjes and the lower bound ℓ = 0, all the aforementioned paths in worst case can have at most a polynomial (in n) number of pieces, which is favorable for practical computation. Conversely, we provide a special class of (3.1) where Q is not a Z-matrix and p i values are not all equal, in which case the ℓ 0 -path can have exponentially (in n) many pieces in worst case. • Considering the nonconvex nature of the capped ℓ 1 -function, our focus lies in tracing a d-stationary path x d cap , which can be computationally more efficient than obtaining the global solution path x cap . In Section 3.3, we propose a finite algorithm to compute this d-stationary path and address a key challenge that distinguishes it from conven- tionalparametricprogrammingliterature. Specifically, thecapped ℓ 1 -functionexhibits nonsmoothness at± δ , leading to potential discontinuities in the path x d cap . To tackle this issue, we introduce a modified GHP algorithm (described in Section 2.1), which capitalizes on the Stieltjes structure of Q and the latest information of the path to achieve efficient restoration of x d cap whenever a discontinuity arises. These analytical and computational investigations, strengthened by our extensive numerical studies presented in Section 3.4, provide compelling evidence supporting the utility of para- metric capped ℓ 1 as a practical tool for hyperparameter selection. Specifically, this approach offers an efficient and effective approximation of the ℓ 0 -counterpart, making it a compet- itive alternative especially when the ℓ 0 -path is computationally intractable for large scale problems. Moreover, capped ℓ 1 d-stationary path exhibits superior properties compared to the crude approximation of ℓ 0 provided by ℓ 1 -path. Overall, these findings underscore the 59 significanceofparametriccapped ℓ 1 studies, whichisarelativelyunexploredtopicinmathe- matical programming prior to our research summarized in this chapter. Finally, the content presented in this chapter is based on the published paper [50]. 3.1 Summary of Previous Studies In this section, we provide a summary of the existing literature on parametric studies in- volving ℓ 0 , ℓ 1 -functions, and nonconvex regularizers other than capped ℓ 1 . Our aim here is to highlight the aspects that have not been extensively explored in the previous studies, which serve as the motivations for presenting our analytical and computational findings in the subsequent sections. Parametric ℓ 1 Until now, there has been a lack of comprehensive investigations into parametric approaches for ℓ 0 or capped ℓ 1 . In contrast, parametric ℓ 1 -approaches can be dated back to the study of parametric LASSO path, which pertains to problem (3.1) with a general symmetric pos- itive (semi)definite matrix Q, see [32] and [82] for reference. The Least Angle Regression (LARS) method introduced in [82], exhibits characteristics akin to the classical parametric quadraticprogrammingtechniquesinoptimizationliterature[13],albeitwithamoreexplicit exploitation of the LASSO structure. It is known that the complexity of an algorithm to compute an ℓ 1 -path can in general be exponential in the number of variables, due to the possible exponential number of pieces in the path, see [61] for a particular class of problem which in worst case obtains such unfavorable complexity. 60 Parametric ℓ 0 The exact ℓ 0 -path, when the weights are equal (p i = 1 for all i∈ [n]), has been studied in [79]. This work establishes the piecewise affine nature of the path and describes it in terms of solving a linear (in n) number of the fixed-cardinality variable selection problems. Similar approaches involving the enumeration of all n+1 cardinalities are commonly employed in the literature, see for instance [9] and [24]. In contrast, to date there has been a lack of studies when p i are not all equal. Methods for multiobjective optimization, often designed in the context of mixed-integer linear optimization, typically call for solving a large sequence of mixed-integer programs [14, 15]. Naturally,suchmethodsmayperformpoorlyifalargenumberofcallstoamixed-integer optimization solver are necessary, particularly in the context of mixed-integer nonlinear optimization, where each subproblem is comparatively more difficult than the linear case. Parametric approaches for other nonconvex Φ It is worth noting that, even though the parametric capped ℓ 1 -approach has not been exten- sively investigated, there have been parametric studies involving other types of nonconvex regularizers. An example of this would be [87], which introduces an ℓ p -regularized “critical path” (0δ . ∗ When δ is sufficiently small, if x d cap (γ ) is continuous, then it takes value of a unique constant vector with no zero elements. – Let Q additionally be Stieltjes, then there exists a finite time algorithm which computes a particular x d cap (γ ) that satisfy the followings: ∗ This computed path x d cap is (possibly discontinuous) piecewise affine. ∗ Let f d cap (γ ) = L(x d cap (γ ))+γ Φ( x d cap (γ )) where Φ is capped ℓ 1 , then f d cap (γ ) is non-decreasing and lower semicontinuous, L(x d cap (γ )) is non-increasing and Φ( x d cap (γ )) is non-decreasing. ∗ If ℓ=0, then x d cap has at most 3n pieces. The capped ℓ 1 d-stationary solutions, hence the d-stationary paths, are not necessarily unique. In our investigation, we focus on a specific d-stationary path that can be com- puted using a finite-time algorithm. Specifically, when the matrix Q is Stieltjes, the results pertaining to f d cap and x d cap are obtained through the exploitation of the Stieltjes structure by our algorithm. Therefore, we will defer the formal introduction of these results until the next section when we formally present our capped ℓ 1 d-stationary path algorithm. 64 3.2.1 Piecewise structures From the previous summary, it is worth observing that all the paths exhibit a piecewise structure. The intuition behind this is similar to the combinatorial perspective we adopt when designing our GHP algorithm in the previous chapter. Recall that for the GHP al- gorithm, we have defined sign structures ( S ν , ¯ S ν ) and piece selections {j (ν )t i } i∈[n] to either explicitly or implicitly restrict the values of solutions. Similarly, a certain type of “basis”, usually defined as a tuple of index sets, can be designed to relax the original problem (3.1), whenΦis ℓ 0 ,ℓ 1 andcappedℓ 1 , toasmoothandconvexsubproblem. Withinthisframework, the values of f 1 (γ ),f 0 (γ ) and f cap (γ ) for a specific γ ≥ 0 can be determined by selecting the bestamongtheoptimalvaluesobtainedfromtheserelaxedsubproblems,eachcorresponding to a particular “basis” under the same γ . As a natural consequence of the finite number of “bases”, the piecewise structures of the paths emerge, where each piece is associated with a specific “basis”. All the results, which we present in below, follow this recipe, and we begin with the case of ℓ 0 . Proposition 26. Let Q be symmetric and positive definite. For γ ≥ 0, the function f 0 (γ ) is concave,non-decreasingandpiecewiseaffine,while x 0 (γ )ispiecewiseconstant. Furthermore, when p i =p j for all i,j∈[n] and i̸=j, then f 0 (γ ) has at most n+1 pieces. Proof. We begin with a definition of “basis” as a pair of disjoint index sets ( S, ¯ S) so that S, ¯ S ⊆ [n] and S∪ ¯ S = [n], which controls the sparsity of solutions. More specifically, it is fairly straightforward to verify that f 0 is equivalent to the following reformulation f 0 (γ ) = minimum S⊆ [n] v 0 (S)+ X i∈S p i ! γ where v 0 (S) ≜ minimum ℓ≤ x≤ u 1 2 x ⊤ Qx+q ⊤ x subject to x i =0,∀i∈ ¯ S. 65 Once this reformulation is established, we obtain the concavity and piecewise affinity of f 0 by an application of the Danskin’s Theorem, and the non-decreasing property of f 0 is also fairly obvious. Note that each piece of f 0 is associated with a particular choice of (S, ¯ S), and its solution is the one we obtain by solving the minimization problem as defined in v 0 (S). Thus, the solution value on each piece is a constant (with respect to γ ). Finally, noticethatitalltheweights p i areequal, whichwecanwithoutlossofgenerality say that p i =1 for all i∈[n], then f 0 (γ ) can be further reformulated as follows: f 0 (γ ) = minimum 0≤ k≤ n v k 0 +kγ where v k 0 ≜ minimum S:|S|=k v 0 (S), and|S| denotes the cardinality of set S. From this new formulation, it is quite evident that the number of peices in the path is at most n+1. Proposition 27. Let Q be symmetric positive definite. For γ ≥ 0, the function f 1 (γ ) is concave, non-decreasing, once continously differentiable and piecewise linear-quadratic, which means there exists a finite number of intervals of [0 ,+∞) so that f 1 (γ ) is quadratic whenγ isrestrictedtoeachoftheseintervals. Additionally, the(unique)solutionpath x 1 (γ ) is piecewise affine. Proof. Again, we define a “basis” ( S, ¯ S) to be a mutually exclusive decomposition of [n], exceptfornowitcontrolsthesignofsolutions. Morespecifically,wecanreformulatefunction f 1 (γ ) as follows: f 1 (γ ) = minimum S⊆ [n] v 1 (γ,S ) where v 1 (γ,S ) ≜ minimum x 1 2 x ⊤ Qx+q ⊤ x+ X i∈S p i x i − X i∈ ¯S p i x i ! γ subject to 0≤ x S ≤ u S , ℓ¯S ≤ x¯S ≤ 0. 66 Oncethisisestablished, thepiecewiselinear-quadraticresultwillbeasimpleconsequenceof standard parametric convex quadratic programming techniques. The concavity and differ- entiability can be easily obtained by the Danskin’s Theorem and Q being positive definite, which ensures the uniqueness of solution. To study the solution path x 1 (γ ), we resort to an alternative definition of “basis” ( α 0 ,α ± ,σ ± ), which gives rise to the following relaxed solution (¯x(γ )) α 0 ≜0, (¯x(γ )) σ + ≜u σ + , (¯x(γ )) σ − ≜ℓ σ − , (¯x(γ )) α ± ≜− (Q α ± α ± ) − 1 q α ± +Q α ± σ ± ξ σ ± − γ (Q α ± α ± ) − 1 η α ± , where ξ σ + ≜u σ + , ξ σ − ≜ℓ σ − , η α + ≜p α + , η α − ≜− p α − ; for a vector d∈R n and a set β ⊆ [n], notation d β stands for the subvector determined by the elements in β ; similarly, for a matrix M ∈R n× n and subsets β,β ′ ⊆ [n], notation M ββ ′ represents the submatrix determined by the row indices in β and the column indices in β ′ . The relaxed solution ¯x(γ ) will be optimal for the original problem (3.1) under ℓ 1 -regularizer, if and only if γ ≥ 0 satisfies the following linear inequalities induced by optimality condition: 0≤ (¯x(γ )) α + ≤ u α + , ℓ α − ≤ (¯x(γ )) α − ≤ 0 (Q¯x+q) α 0 +γp α 0 ≥ 0, (Q¯x+q) α 0 − γp α 0 ≤ 0 (Q¯x+q) σ + +γp σ + ≤ 0, (Q¯x+q) σ − − γp σ − ≥ 0 Solving these inequalities will yield scalar values γ and γ such that x 1 (γ ) = ¯x(γ ) for all γ ∈[γ ,γ ]. It is important to note that ¯x(γ ) is affine in γ . Consequently, we have partitioned the interval [0,∞) into a finite number of subintervals where x 1 (γ ) is affine within each of these subintervals. In other words, we can conclude that x 1 (γ ) is piecewise affine. Proposition 28. Let Q be symmetric positive definite. For γ ≥ 0, f cap (γ ) is concave, non-decreasing and piecewise linear-quadratic, and the solution path x cap (γ ) is piecewise affine. 67 Proof. Define the “basis” ( S, ¯ S) as a mutually exclusive decomposition of [n], and it is straightforward to the verify the followings: f cap (γ ) = minimum S⊆ [n] v cap (γ,S ) where v cap (γ,S ) ≜ minimum x 1 2 x ⊤ Qx+q ⊤ x+ X i∈S p i δ |x i |− X i∈ ¯S p i ! γ subject to |x i |≤ δ, ∀i∈S, x i =1,∀i∈ ¯ S. The rest of the proof is similar to Proposition 26 and Proposition 27, hence will be omitted here. 3.2.2 On the number of pieces Once we have established the piecewise structures of the paths, a naturally follow-up ques- tion arises regarding whether we can provide a guarantee on the worst-case complexity of the number of pieces. This issue is particularly important as it directly determines the com- plexity of computing these paths. In Table 3.1, we summarize such complexity results. The abbreviation “exp.” represents “exponential in n”, “PD” stands for “positive definite”, “d- stat.” is short for “d-stationary”, “eq.” and “uneq.” are abbreviations for “equal weights” and “unequal weights” respectively. From this table, it becomes apparent that the Stieltjes property (and thus the Z-property) plays a crucial role in achieving linear and polynomial complexity in the number of pieces for the solution paths. This is not surprising as we have seen a similar phenomenon when we design our GHP algorithm in the previous chapter. Namely, we are able to achieve the favorable linear-step and strong polynomial complexity by leveraging the Z-property. 68 Regularizer Pieces Optimality Class Reference ℓ 1 exp. global Non-Stieltjes Q Proposition 1 in [61] 2n+1 global ℓ =0, Stieltjes Q Proposition 31 ℓ 0 n+1 (eq.) global general PD Q Proposition 26 exp. (uneq.) global Non-Stieltjes Q Proposition 29 n+1 (uneq.) global ℓ =0, Stieltjes Q Proposition 30 capped ℓ 1 2n 2 +3n+1 global ℓ =0, Stieltjes Q Proposition 32 3n+1 d-stat. ℓ =0, Stieltjes Q Proposition 50 Table 3.1: Summary of piece number complexities. We initiate with a negative finding concerning the piece complexity of the ℓ 0 -path when the weights p i are not all equal. The subsequent proposition is extracted from the paper [50], which features the author of this dissertation. It can be understood as analogous to Proposition1in[61],whichprovidesaclassofparametricℓ 1 -problemsthatadmitexponential (in n) piece complexity. Proposition 29. Let the parameters Q,q,p,ℓ and u in (3.1) be constructed as follows: • Q=cI +qq ⊤ where I is the n-by-n identity matrix and c>0; • For i∈[n], set q i = √ 2 i and p i =2 i ; • Let u=− ℓ so that for all S⊆ [n] we have ℓ S ≤− (Q SS ) − 1 q S ≤ u S . Then f 0 (γ ) has 2 n +1 pieces. Proof. See Example 5 in [50] for the details. NoticethatthematrixQconstructedinProposition29haspositiveoff-diagonalelements, hence cannot be a Z-matrix and certainly will not be a Stieltjes one. Subsequently, we will present a series of results, all cited from the papers [50] and [43] (co-authored by the author ofthisdissertation),tohighlightthesignificanceofStieltjesmatricesinachievingpolynomial piece complexities. 69 Proposition 30. Let Q be Stieltjes, ℓ=0, then f 1 has at most n+1 pieces. Proof. Recall that from Proposition 26, each piece of f 0 corresponds to a particular support (sparsity) of the ℓ 0 -solution. Denote S(γ )≜{i∈[n]:(x 1 (γ )) i ̸=0} as the support of x 1 (γ ), by courtesy of the Stieltjes Q, we can verify that S(γ )⊆ S(γ ′ ) whenever 0≤ γ ′ <γ . Notice that S(γ ) = ∅ when γ is sufficiently large, hence the support will change at most n times which means there is at most n+1 pieces in f 0 . More details of the proof can be found in Proposition 6 in [50]. Proposition 31. Let Q be Stieltjes, ℓ = 0 and p satisfies ( Q SS ) − 1 p S > 0 for all S ⊆ [n], then f 1 has at most 2n+1 pieces. Proof. The proof of this result is algorithmic, aided by an algorithm which computes one piece of x 1 (γ ) at each step. Leveraging the Stieltjes structure of Q and our construction of p vector, this algorithm will terminate within 2n+1 steps. For a more comprehensive presentation of this algorithm, see Theorem 18 in [43]. Proposition 32. Let Q be Stieltjes, ℓ=0, then f cap has at most n 2 +3n+1 pieces. Proof. This result can be proved by an argument similar to that of Proposition 30, see Proposition 7 of [50] for more details. 3.3 Computing D-stationary Paths for Capped ℓ 1 Inthissection,wewillformallyintroduceanalgorithm,whichwewillrefertoasthePivoting Scheme, to trace a capped ℓ 1 d-stationary path x d cap when matrix Q is Stieltjes, along with some relevant analyses. We begin with the following characterization of a d-stationary solution of (3.1) when Φ is the capped ℓ 1 function. 70 Proposition33. LetΦbecapped ℓ 1 ,avectorx ∗ ∈R n suchthatℓ≤ x ∗ ≤ uisad-stationary solution of (3.1) if and only ifS = (x ∗ )=∅ and the followings hold for all i∈S 0 (x ∗ ): (Qx ∗ +q) i + γ δ p i ≥ 0, (Qx ∗ +q) i − γ δ p i ≤ 0, for all i∈S < (x ∗ ): (Qx ∗ +q) i + γ δ p i sign(x ∗ i )=0, for all i∈S > (x ∗ ): (Qx ∗ +q) i =0, for all i∈S u (x ∗ ): (Qx ∗ +q) i ≤ 0, for all i∈S ℓ (x ∗ ): (Qx ∗ +q) i ≥ 0. (3.2) where S 0 (x ∗ )≜{i∈[n]:x ∗ i =0}, S = (x ∗ )≜{i∈[n]:|x ∗ i |=δ } S u (x ∗ )≜{i∈[n]:x ∗ i =u i }, S u (x ∗ )≜{i∈[n]:x ∗ i =ℓ i } S < (x ∗ )≜{i∈[n]:0<|x ∗ i |<δ }, S > (x ∗ )≜{i∈[n]:δ <x ∗ i 0 for capped ℓ 1 , a relatively high number of discontinuities in the path x d cap should be expected. In this regard, employing a fast algorithm for computing d-stationary solutions becomes advantageous in reducing the workload. The GHP algorithm, developed in the previous chapter, aligns well with this need. It is specifically designed to leverage the Stieltjes structure so that a fast convergence (in our case, strong polynomial complexity) to a d-stationary solution can be ensured, see 71 Theorem 24 for more details. To make our discussions self-contained and for easy reference, thenextsectionwillpresenttwoalternativeversionsoftheGHPalgorithm, alongwithsome relevant results that will prove useful for our subsequent discussions. These two alternative versions of the GHP algorithm will serve as the fundamental building blocks for the design and analysis of our pivoting scheme. Furthermore, we will dedicate a section to demon- strate that GHP not only offers the benefit of fast convergence, but also provides interesting theoretical and practical advantages regarding the d-stationary path traced by our pivoting scheme. 3.3.1 Alternative GHP algorithms The GHP algorithm, as introduced in the previous chapter, was presented as a double loop scheme with a specific initialization, leading to intermediate solutions that exhibit an ele- mentwise non-increasing property. In this section, we will introduce two alternative GHP methods that merge these two loops together and allow for a more generalized set of ini- tializations, making it more convenient to embed them into our pivoting scheme for discon- tinuity restoration. Notably, one of these alternative methods will exhibit an elementwise non-decreasing property among its solutions. Algorithm 3 non-increasing GHP (abbreviated as the non-inc-GHP) 1: Initialize S 1 =[n], ¯ S 1 =∅,S 1 > =[n],S 1 < = ¯ S 1 < = ¯ S 1 > =∅ 2: for t=1,2,... do 3: Solve the global optimal solution x t of the following subproblem: minimize x∈R n 1 2 x ⊤ Qx+q ⊤ x+γ X i∈S t < p i δ x i − γ X i∈ ¯S t > p i δ x i subject to 0≤ x S t ≤ u S t, ℓ¯S t ≤ x¯S t ≤ 0 (3.3) 72 4: Check the following sets: E t ≜ n i∈S t < : x t i =0, (Qx t +q) i − γ p i δ >0 o ¯ F t > ≜ i∈ ¯ S t > :x t i ≤− δ , F t > ≜{i∈S t > :x t i ≤ δ } 5: if E t ∪F t > ∪ ¯ F t > =∅ then 6: Terminate with x ∗ ≜ x t . Otherwise update the followings and proceed to step t+1: S t+1 < ≜(S t < ∪F t > )\E t S t+1 > ≜S t > \F t > S t+1 ≜S t+1 < ∪S t+1 > ¯ S t+1 < ≜ ¯ S t < ∪ ¯ F t > ¯ S t+1 > ≜( ¯ S t > \ ¯ F t > )∪E t ¯ S t+1 ≜ ¯ S t+1 < ∪ ¯ S t+1 > 7: end if 8: end for Algorithm 4 non-decreasing GHP (abbreviated as the non-dec-GHP) 1: Initialize S 1 =∅, ¯ S 1 =[n], ¯ S 1 < =[n],S 1 > =S 1 < = ¯ S 1 > =∅ 2: for t=1,2,... do 3: Solve the global optimal solution x t of the subproblem (3.3) and check: ¯ E t ≜ n i∈ ¯ S t > : x t i =0, (Qx t +q) i +γ p i δ <0 o ¯ F t < ≜{i∈ ¯ S t < :x t i ≥− δ }, F t < ≜{i∈S t < :x t i ≥ δ } 4: if ¯ E t ∪F t < ∪ ¯ F t < =∅ then 5: Terminate with x ∗ ≜ x t . Otherwise, update the followings and proceed to step t+1: S t+1 < ≜(S t < ∪ ¯ E t )\F t < S t+1 > ≜S t > ∪F t < S t+1 ≜S t+1 < ∪S t+1 > ¯ S t+1 < ≜ ¯ S t < \ ¯ F t < ¯ S t+1 > ≜( ¯ S t > ∪ ¯ F t < )\ ¯ E t ¯ S t+1 ≜ ¯ S t+1 < ∪ ¯ S t+1 > 6: end if 73 7: end for Remark 34. In order for our upcoming analysis to hold for both GHP algorithms, it is not necessary to adhere strictly to the initializations as stated above. Instead, we only need to ensure that S 1 < ,S 1 > , ¯ S 1 < , ¯ S 1 > will satisfy the following conditions non-inc-GHP: x 1 ¯S 1 <0 and F 1 < = ¯ F 1 < =∅; non-dec-GHP: x 1 S 1 >0 and F 1 > = ¯ F 1 > =∅. (3.4) As long as these conditions are met, our main conclusion, as stated in Theorem 41, will remain unchanged. This more general family of initializations is crucial for us to integrate both GHP methods into the forthcoming pivoting scheme. We will later explain how, when encountering discontinuities in the path x d cap , we can utilize the most up-to-date information to construct an initialization that satisfies (3.4). The fundamental concept behind these two algorithms is once again the enumeration of the combinatorial structure (S t < ,S t > , ¯ S t < , ¯ S t > ), which governs the sign restrictions and piece selections in the capped ℓ 1 -problem. The goal here is to find the appropriate combina- torial structure whose associated subproblem (3.3) has the unique optimal solution being d-stationary for (3.1). This goal is accomplished by reassigning the indices within the “error sets” (E t ,F t > , ¯ F t > ) or ( ¯ E t ,F t < , ¯ F t < ). The Stieltjes matrix Q once again ensures such reassign- ment to exhibit a monotonic property and ultimately lead to a termination within linear (in n) steps. To analyze these two GHPs, we begin with the following conditions for a given (S < ,S > , ¯ S < , ¯ S > ) to produce a d-stationary solution for (3.1). Lemma 35. Let S < ,S > , ¯ S < , ¯ S > and ¯ S = ¯ S < ∪ ¯ S > ,S = S < ∪S > be given, and the solution ¯x of: minimize x∈R n 1 2 x ⊤ Qx+q ⊤ x+γ X i∈S< p i δ x i − γ X i∈ ¯S> p i δ x i +γ X i∈ ¯S<∪ ¯S> p i subject to 0≤ x S ≤ u S , ℓ¯S ≤ x¯S ≤ 0 (3.5) 74 satisfies ¯ x¯S <0. Then ¯x will be a d-stationary point of (3.1) if the following sets are empty: E≜ n i∈S < : ¯x i =0, (Q¯x+q) i − γ p i δ >0 o ¯ F > ≜{i∈ ¯ S > : ¯x i ≤− δ }, F > ≜{i∈S > : ¯x i ≤ δ } ¯ F < ≜{i∈ ¯ S < : ¯x i ≥− δ }, F < ≜{i∈S < : ¯x i ≥ δ } Proof. Trivial by inspection. Remark 36. If we assume that the ¯x in Lemma 35 satisfies ¯ x S > 0, then the following conditions are also sufficient for ¯ x to be d-stationary for (3.1): ¯ E ≜ {i ∈ ¯ S > : ¯x i = 0, (Q¯x+q) i +γ p i δ <0}=∅ and ¯ F < = ¯ F > =F < =F > =∅. The following results, namely Lemma 37, Lemma 38 and Corollary 40, summarize the behavior of the intermediate steps of GHP algorithms. Lemma 37. Let H ∈R n be a given Stieltjes matrix, h∈R n and fix arbitrary S⊆ [n], ¯ S≜ [n]\S; take some i + ∈S,i − ∈ ¯ S, if possible, and let ε>0; for the following three problems: (P 0 ) x 0 ≜ argmin x 1 2 x ⊤ Hx+h ⊤ x s.t. 0≤ x S ≤ u S , ℓ¯S ≤ x¯S ≤ 0, (P + ) x + ≜ argmin x 1 2 x ⊤ Hx+(h+εe i + ) ⊤ x s.t. 0≤ x S ≤ u S , ℓ¯S ≤ x¯S ≤ 0, (P − ) x − ≜ argmin x 1 2 x ⊤ Hx+(h− εe i − ) ⊤ x s.t. 0≤ x S ≤ u S , ℓ¯S ≤ x¯S ≤ 0, where e j denotes the jth coordinate vector, we can obtain that (i) x + ≤ x 0 ≤ x − , (ii) if x 0 i + >0 then x + i + <x 0 i + , if x 0 i − <0 then x − i − >x 0 i − . Proof. The key argument is the Least Element Theorem (Corollary 18). More specifically x 0 is the least element of the following set: E 0 ≜ z∈R n 0≤ z S ≤ u S , ℓ¯S ≤ z¯S ≤ 0: for all i∈S : z i 0 and by the KKT of (P 0 ) we have h i + +H i + • x 0 =0. Without loss of generality we can assume that x + i + >0 (otherwise we are done), which shows h i + +ε+H i + • x + =0 by the KKT of (P + ). If x 0 i + =x + i + then: 0=ϵ +H i + • (x + − x 0 )=ϵ + X j̸=i + H i + j (x + j − x 0 j )>0 where the second equality is by x 0 i + = x + i + , and the first inequality is by ϵ > 0,x + ≤ x 0 and H being Stieltjes. Notice that we have just derived 0 < 0, thus we must have x + i + < x 0 i + . Finally, for (P 0 ) and (P − ), we can once again flip the sign x=− ξ and obtain x − i − >x 0 i − . Lemma 38. Let Σ 1 < ,Σ 1 > , ¯Σ 1 < , ¯Σ 1 > be arbitrary subsets of [n], and define the following prob- lems: (P 1 ) x 1 ≜ argmin x 1 2 x ⊤ Qx+q ⊤ x+γ X i∈Σ 1 < p i δ x i − γ X i∈ ¯Σ 1 > p i δ x i subject to 0≤ x Σ 1 ≤ u Σ 1, ℓ¯Σ 1 ≤ x¯Σ 1 ≤ 0. (P 2 ) x 2 ≜ argmin x 1 2 x ⊤ Qx+q ⊤ x+γ X i∈Σ 2 < p i δ x i − γ X i∈ ¯Σ 2 > p i δ x i subject to 0≤ x Σ 2 ≤ u Σ 2, ℓ¯Σ 2 ≤ x¯Σ 2 ≤ 0. 76 based on the following sets: Σ 1 ≜Σ 1 < ∪Σ 1 > , ¯Σ 1 ≜ ¯Σ 1 < ∪ ¯Σ 1 > , E ≜ n i∈S > :ξ i =0,(Qξ +q)− γ p i δ >0 o , ¯Σ 2 < ≜ ¯Σ 1 < , Σ 2 > ≜Σ 1 > , ¯Σ 2 > ≜ ¯Σ 1 > ∪E, Σ 1 < ≜Σ 1 < \E, where the pair (ξ,S < ) are given so that ξ ≥ x 1 and S < ⊆ Σ 1 < . Then we have (i) x 2 ≤ x 1 , (ii) x 2 E <0. Proof. Forj∈{1,2}, we denote the objective function of problem (P j ) as 1 2 x ⊤ Qx+(q (j) ) ⊤ x, then by Corollary 18, solution x j is the least element ofE j which is defined as E j ≜ z∈R n : 0≤ z Σ j ≤ u Σ j, ℓ¯Σ j ≤ z¯Σ j ≤ 0: for all i∈Σ j : z j 0 where the first inequality is by x 2 ≤ x 1 ≤ ξ,x 2 i = ξ i = 0, and Q being Stieltjes, the second inequalityisbyourdefinitionof i∈E. Thisisapparentlyacontradictionhence x 2 E <0. 77 Remark 39. If instead we are given a pair ( ¯ξ, ¯ S > ) so that ¯ξ ≤ x 1 and ¯ S > ⊆ ¯Σ 1 > , we can similarly define the following sets: ¯ E ≜ n i∈ ¯ S > : ¯ξ i =0,(Q ¯ξ +q) i +γ p i δ <0 o , ¯Σ 2 < ≜ ¯Σ 1 < , ¯Σ 2 ≜ ¯Σ 1 \ ¯ E, Σ 2 < ≜Σ 1 < ∪ ¯ E, Σ 2 > ≜Σ 1 > , and prove that the solution x 2 of (P 2 ) satisfies: (i) x 2 ≥ x 1 , (ii) x 2 ¯ E >0. Corollary 40. Let t≥ 1 be an arbitrary step, then • For the non-inc-GHP, we have (i) x t+1 ≤ x t , (ii) x t+1 F t > <x t F t > ,x t+1 ¯ F t > <x t ¯ F t > ,x t+1 E t <0; • For the non-dec-GHP, we have (iii) x t+1 ≥ x t , (iv) x t+1 F t < >x t F t < ,x t+1 ¯ F t < >x t ¯ F t < ,x t+1 ¯ E t >0. Proof. Here we only prove for the non-inc-GHP, since the arguments for the non-dec-GHP is essentially symmetrical by Remark 36 and Remark 39. First notice that reassignment of indices from step t to t + 1 can be equivalently done in a sequential fashion. Denote m ≜ |F t > |, ¯m ≜ | ¯ F t > | and indices {j i : i = 1,...,m+ ¯m} so that F t > = {j 1 ,...,j m }, ¯ F t > = {j m+1 ,...,j m+¯m }. Then, we construct the following sets: for all i=1...m S t(i) > ≜S t > \{j 1 ,...,j i } S t(i) < ≜S t < ∪{j 1 ,...,j i } ¯ S t(i) > ≜ ¯ S t > ¯ S t(i) < ≜ ¯ S t < for all i=m...m+ ¯m S t(i) > ≜S t > \F t > S t(i) < ≜S t < ∪F t > ¯ S t(i) > ≜ ¯ S t > \{j m+1 ,...,j i } ¯ S t(i) < = ¯ S t < ∪{j m+1 ,...,j i } Note that S t+1 > = S t(m+¯m) > ,S t+1 < = S t(m+¯m) < \E t , ¯ S t+1 > = ¯ S t(m+¯m) > ∪E t and ¯ S t+1 < = ¯ S t(m+¯m) < . Denote the solution of subproblem (3.3), defined by S t(i) < ,S t(i) > , ¯ S t(i) < , ¯ S t(i) > , as x t(i) . Then, by Lemma 37, we get x t(i+1) ≤ x t(i) and x t(i+1) j i <x t(i) j i , for all i=0,...,m+ ¯m, where we denote 78 x t(0) ≜x t . Henceinaccumulationweobtainx t(m+¯m) ≤ x t ,x t(m+¯m) F t > <x t F t > andx t(m+¯m) ¯ F t > <x t ¯ F t > . Finally, apply Lemma 38 under the following settings (i=m+ ¯m): (Σ 1 < ,Σ 1 > , ¯Σ 1 < , ¯Σ 1 > )=(S t(i) < ,S t(i) > , ¯ S t(i) < , ¯ S t(i) > ), (Σ 2 < ,Σ 2 > , ¯Σ 2 < , ¯Σ 2 > )=(S t+1 < ,S t+1 > , ¯ S t+1 < , ¯ S t+1 > ), ξ =x t , E =E t , we obtain x t+1 ≤ x t(m+¯m) and x t+1 < 0. In summary, we have shown both part (i) and (ii). Theorem 41. Both GHP algorithms will terminate in 3n steps with a d-stationary point x ∗ of (3.1). Proof. Here we only prove for the non-increasing case, since the non-decreasing case can be proven in a similar fashion. The key argument is to inductively show that x t ¯S t < 0 and F t < = ¯ F t < =∅ hold for all steps t≥ 1. Notice that these are trivially satisfied for the base case, namely t = 1, by our initialization. Now suppose we are at step t > 1 and assume that x t− 1 ¯S t− 1 < 0,F t− 1 < = ¯ F t− 1 < =∅. From Corollary 40, we immediately have x t ¯S t ≤ x t− 1 ¯S t− 1 < 0 and x t E t− 1 < 0, thus in total we get x t ¯S t < 0 since ¯ S t = ¯ S t− 1 ∪ E t− 1 . Moreover, by our induction hypothesis and part (i) of Corollary 40, for i ∈ S t− 1 < such that i ̸∈ F t− 1 > we can obtain x t i ≤ x t− 1 i <δ , and for i∈ ¯ S t− 1 < such that i̸∈ ¯ F t− 1 > we similarly have x t i ≤ x t− 1 i <− δ . On the other hand, by part (ii) of Corollary 40, we have x t i <x t− 1 i ≤ δ for all i∈ F t− 1 > , and x t i < x t− 1 i ≤− δ for all i∈ ¯ F t− 1 > . In summary, these arguments give us x t i < δ for all i∈ S t < , and x t i <− δ for all i∈ ¯ S t < , namely F t < = ¯ F t < =∅. With x t ¯S t < 0 holds for all t≥ 1, Lemma 35 is applicable for all the steps. Notice that thealgorithmwithterminateeitherwhenE t =F t > = ¯ F t > =∅or ¯ S t < =[n]withsomex ∗ ≜x t . In both cases, Lemma 35 tells us that x ∗ is d-stationary for (3.1). Finally, suppose in each step we have reassigned exactly one index, then in worst case we will execute 3n steps, since each index will be moved at most 3 times. 79 Finally,denotetheoriginalobjectivefunctionof (3.1)asF,anddefine F t astheobjective function of subproblem (3.3) at step t after we augment it with a constant: F t (x) ≜ L(x)+γ X i∈S t < p i δ x i − γ X i∈ ¯S t > p i δ x i +γ X i∈S t > p i + X i∈ ¯S t < p i where L(x) = 1 2 x ⊤ Qx+q ⊤ x. We present the following result, which will prove useful later in demonstrating that, the d-stationary path x d cap (γ ) obtained from our pivoting scheme achieves a non-decreasing objective value F(x d cap (γ )) with respect to γ ≥ 0. Proposition 42. For both GHP methods, we have F(x t+1 )≤ F(x t ) holds for all t≥ 1. Proof. Here we only work with the non-inc-GHP, since the non-decreasing case can be sim- ilarly proven. Suppose we are at step t+1, then F t+1 (x t ) = L(x t )+γ X i∈S t+1 < p i δ x t i − γ X i∈ ¯S t+1 > p i δ x t i +γ X i∈S t+1 > p i + X i∈ ¯S t+1 < p i = L(x t )+γ X i∈[n] p i min |x t i | δ ,1 = F(x t ) which is straightforward to verify by the updating rule of sets ¯ S t+1 < , ¯ S t+1 > ,S t+1 < ,S t+1 > . More- over, it is easy to see that x t is feasible for subproblem (3.3) at step t+1 thus: F t+1 (x t+1 ) ≤ F t+1 (x t ) = F(x t ), (3.6) 80 and it is also straightforward to verify that F(x t+1 ) = L(x t+1 )+γ X i∈[n] p i min |x t+1 i | δ ,1 ≤ L(x t+1 )+γ X i∈S t+1 < p i δ x t+1 i − γ X i∈ ¯S t+1 > p i δ x t+1 i +γ X i∈S t+1 > p i + X i∈ ¯S t+1 < p i = F t+1 (x t+1 ) Combining this result with (3.6) will give us F(x t+1 )≤ F(x t ). 3.3.2 Basis and degeneracy The design of our pivoting scheme is based on what we call the basis of a d-stationary solutionofcappedℓ 1 -problem(3.1), whichisadisjointdecomposition(α 0 ,α ± ,β ± ,σ ± )of[n], where α ± is shorthanded for α + ∪α − (also applies to β ± and σ ± ). Serving a similar role as the concept of basis for the (parametric) simplex method in linear programming, we use our basis to characterize a d-stationary solution x ∗ of (3.1) at γ as follows: i∈α 0 ⇒ x ∗ i =0, i∈σ + ⇒ x ∗ i =u i , i∈σ − ⇒ x ∗ i =ℓ i i∈α + ⇒ 0≤ x ∗ i <δ and (Qx ∗ +q) i + γ δ p i =0 i∈α − ⇒ − δ <x ∗ i ≤ 0 and (Qx ∗ +q) i − γ δ p i =0 i∈β + ⇒ δ <x ∗ i ≤ u i and (Qx ∗ +q) i =0 i∈β − ⇒ ℓ i ≤ x i <− δ and (Qx ∗ +q) i =0 (3.7) Analogous to linear programming, where each basis is linked with a basic feasible solution alongwithasetofconditionsforthissolution(hencethebasis)tobeoptimal,wecansimilarly define a basic solution associated with our basis (α 0 ,α ± ,β ± ,σ ± ) and derive conditions that render this basic solution d-stationary for (3.1). We formally introduce these fundamental concepts in below. 81 Definition 43. Let a basis (α 0 ,α ± ,β ± ,σ ± ) and γ ≥ 0 be given. • A vector ¯x(γ )∈R n is a basic solution of (α 0 ,α ± ,β ± ,σ ± ) at γ if it satisfies: ¯x G ± (γ )=− (Q G ± G ± ) − 1 (q G ± +Q G ± σ ± ξ σ ± +γη G ± ), ¯x α 0 (γ )=0, ¯x σ + (γ )=u σ + , ¯x σ − (γ )=ℓ σ − where G ± ≜α ± ∪β ± and ξ i ≜ u i if i∈σ + ℓ i if i∈σ − and η i ≜ p i δ if i∈α + − p i δ if i∈α − 0 if i∈β ± • A basis (α 0 ,α ± ,β ± ,σ ± ) is d-stationary at γ if its basic solution at γ is d-stationary, which is guaranteed by the following conditions: 0≤ ¯x α + (γ )<δ 1 α + , − δ 1 α − < ¯x α − (γ )≤ 0, δ 1 β + < ¯x β + (γ )≤ u β + , ℓ β − ≤ ¯x β + (γ )<− δ 1 β − , (Q¯x(γ )+q) α 0 + γ δ p α 0 ≥ 0, (Q¯x(γ )+q) α 0 − γ δ p α 0 ≤ 0, (Q¯x(γ )+q) σ + ≤ 0, (Q¯x(γ )+q) σ − ≥ 0. (3.8) where 1∈R n is a vector whose elements are all one. Since Q is a Stieltjes matrix hence positive definite, it is evident that when a basis (α 0 ,α ± ,β ± ,σ ± ) is d-stationary at some γ , then there is a unique d-stationary solution ¯x of (3.1) at γ satisfying the following constraints: 0≤ ¯x α + <δ 1 α + , δ 1 β + < ¯x β + ≤ u β + , − δ 1 α − < ¯x α − ≤ 0, ℓ β − ≤ ¯x β + <− δ 1 β − ¯x α 0 =0, ¯x σ + =u σ + , ¯x σ − =ℓ σ − 82 However, it is important to note that the converse of this statement is generally not true. In other words, a d-stationary solution of (3.1) at γ that satisfies the above inequalities may be the basic solution of multiple distinct bases. The reason for this lies in the conceptual overlap between α 0 and α ± , as well as σ ± and β ± , in our definition of a basis. This overlap is intended to accommodate the degeneracy of a d-stationary solution, which is analogous to the degeneracy of a basic feasible solution in linear programming. Definition 44. A d-stationary solution x ∗ at γ is non-degenerate if the followings hold: • There is no i∈[n] so that x ∗ i =0 and (Qx ∗ +q) i ± γ δ p i =0 simultaneously; • There is no i∈[n] so that when x ∗ i =u i or ℓ i , the value of (Qx ∗ +q) i is zero. Suppose for a given d-stationary x ∗ , we have the following sets: e α 0 (x ∗ )≜{i∈[n]:x ∗ i =0}, e α ± 0 (x ∗ )≜ n i∈[n]:x ∗ i =0,(Qx ∗ +q) i ± γ δ p i =0 o , e α + (x ∗ )≜{i∈[n]:0<x ∗ i <δ }, ˜ α − (x ∗ )≜{i∈[n]:− δ <x ∗ i <0}, e β + (x ∗ )≜{i∈[n]:δ <x ∗ i 0 − ¯q i +δ ¯p i | {z } Case 1 , max i∈β + :¯p i <0 − ¯q i +δ ¯p i | {z } Case 2 , max i∈α − :¯p i <0 − ¯q i − δ ¯p i | {z } Case 3 max i∈β − :¯p i >0 − ¯q i − δ ¯p i | {z } Case 4 , max i∈α + :¯p i <0 − ¯q i ¯p i | {z } Case 5 , max i∈α − :¯p i >0 − ¯q i ¯p i | {z } Case 6 max i∈α 0 :¯p i >0,¯q i <0 − ¯q i ¯p i | {z } Case 7 , max i∈α 0 :¯p i <2 p i δ ,¯q i >0 − ¯q i ¯p i − 2 p i δ | {z } Case 8 , max i∈β + :¯p i >0 − ¯q i +u i ¯p i | {z } Case 9 max i∈β − :¯p i <0 − ¯q i +ℓ i ¯p i | {z } Case 10 , max i∈σ + :¯p i <0 − ¯q i ¯p i | {z } Case 11 , max i∈σ − :¯p i >0 − ¯q i ¯p i | {z } Case 12 We refer to γ ∗ as the break point, which is the value that marks the left limit of the range of γ for which our current basis remains d-stationary. When γ ∗ is achieved by Cases 1- 4, the resulting d-stationary solution path will be discontinuous at γ ∗ , while the other cases yield a continuous break point. We will provide a detailed explanation of these cases later. For now, we focus on introducing the procedure for updating the d-stationary basis (α new 0 ,α new ± ,β new ± ,σ new ± ) for the subsequent interval of γ ≤ γ ∗ , under different scenarios Discontinuous Break Points For Case 1-4, we implement non-inc-GHP or non-dec-GHP with the following initializations to obtain a d-stationary solution of (3.1) at γ =γ ∗ : • Case 1 (namely ¯x k (γ ) increases to δ at γ ∗ ): initialize the non-dec-GHP with ¯ S 1 < =β − ∪σ − , ¯ S 1 > =α 0 ∪α − , S 1 < =α + \{k}, S 1 > =β + ∪{k}∪σ + . • Case 2 (namely ¯x k (γ ) decreases to δ at γ ∗ ): initialize the non-inc-GHP with ¯ S 1 < =β − ∪σ − , ¯ S 1 > =α − , S 1 < =α 0 ∪α + ∪{k}, S 1 > =β + \{k}∪σ + . 86 • Case 3 (namely ¯x k (γ ) decreases to− δ at γ ∗ ): initialize the non-inc-GHP with ¯ S 1 < =β − ∪{k}∪σ − , ¯ S 1 > =α − \{k}, S 1 < =α 0 ∪α + , S 1 > =β + ∪σ + • Case 4 (namely ¯x k (γ ) increases to− δ at γ ∗ ): initialize the non-dec-GHP with ¯ S 1 < =β − \{k}∪σ − , ¯ S 1 > =α 0 ∪α − ∪{k}, S 1 < =α + , S 1 > =β + ∪σ + Suppose the GHP methods terminate with a d-stationary point x ∗ which is associated with the tuple ( ¯ S ∗ < , ¯ S ∗ < ,S ∗ > ,S ∗ > ), we update the basis for the next pivoting step as follows: • Case 1 and 4: α new 0 ≜{i∈ ¯ S ∗ > :x ∗ i =0}, σ new + ≜{i∈S ∗ > :x ∗ i =u i }, σ new − ≜{i∈ ¯ S ∗ < :x ∗ i =ℓ i } α new + ≜S ∗ < , α new − ≜ ¯ S ∗ > \α new 0 , β new + ≜S ∗ > \σ new + , β new − ≜ ¯ S ∗ < \σ new − • Case 2 and 3: α new 0 ≜{i∈S ∗ < :x ∗ i =0}, σ new + ≜{i∈S ∗ > :x ∗ i =u i }, σ new − ≜{i∈ ¯ S ∗ < :x ∗ i =ℓ i } α new + ≜S ∗ < \α new 0 , α new − ≜ ¯ S ∗ > , β new + ≜S ∗ > \σ new + , β new − ≜ ¯ S ∗ < \σ new − Continuous Break PointsWhenγ ∗ comesfromCase5-12,weupdatethebasisasfollows: • Case 5 (namely ¯x k (γ ) decreases to 0 at γ ∗ ): α new 0 =α 0 ∪{k}, α new + =α + \{k} 87 • Case 6 (namely ¯x k (γ ) increases to 0 at γ ∗ ): α new 0 =α 0 ∪{k}, α new − =α − \{k} • Case 7 (namely ¯x k (γ ) turns positive from 0 after γ ∗ ): α new 0 =α 0 \{k}, α new + =α + ∪{k} • Case 8 (namely ¯x k (γ ) turns negative from 0 after γ ∗ ): α new 0 =α 0 \{k}, α new − =α − ∪{k} • Case 9 (namely ¯x k (γ ) increases to u k at γ ∗ ): β new + =β + \{k}, σ new + =σ + ∪{k} • Case 10 (namely ¯x k (γ ) decreases to ℓ k at γ ∗ ): β new − =β − \{k}, σ new − =σ − ∪{k} • Case 11 (namely ¯x k (γ ) becomes smaller than u k after γ ∗ ): β new + =β + ∪{k}, σ new + =σ + \{k} • Case 12 (namely ¯x k (γ ) becomes larger than ℓ k after γ ∗ ): β new − =β − ∪{k}, σ new − =σ − \{k} 88 The sets α new 0 ,α new ± ,β new ± ,σ new ± that are not mentioned in the updates above are determined to be the same as α 0 ,α ± ,β ± ,σ ± . We summarize the whole pivoting scheme in below. Algorithm for x d cap : the pivoting scheme Initialization. Let γ ∗ =max i∈[n] δq i p i ,e γ =+∞, α 0 =[n],α ± =β ± =σ ± =∅ General Step. For the most up-to-date basis (α 0 ,α ± ,β ± ,σ ± ) ande γ : • Compute ¯p and ¯q in (3.10), and conduct ratio test to compute γ ∗ ; • Obtain (α new 0 ,α new ± ,β new ± ,σ new ± ) for (dis-)continuous γ ∗ as described above; • Retrieve the current piece as follows, x d cap (γ ) G ± =− ¯q G ± − γ ¯q G ± , x d cap (γ ) α 0 =0, x d cap (γ ) σ + =u σ + , x d cap (γ ) σ − =ℓ σ − , where γ ∈(γ ∗ ,e γ ] if γ ∗ is discontinuous, and γ ∈[γ ∗ ,e γ ] if γ ∗ is continuous; • Update (α 0 ,α ± ,β ± ,σ ± )=(α new 0 ,α new ± ,β new ± ,σ new ± ) ande γ =γ ∗ ; • Proceed to the next step where γ ≤ e γ . Termination. We stop when γ ∗ <0 or the most recent ratio test has no candidate. In the case where the break point γ ∗ is continuous, the primary computational load arises from solving linear equations to compute vectors ¯p and ¯q. The size of these equations is determined by the cardinality of G ± . It is worth noting that in this scenario, the sets G ± between two consecutive steps differ by only one index. Consequently, we do not need to solve the aforementioned linear system from scratch. This advantage is one of the key strengthsofparametricprogrammingtechniquescomparedtoheuristicgridsearchmethods, which require solving (3.1) anew at each selected grid point. In contrast, when encountering 89 a discontinuous break point, the GHP algorithms are employed to restore the path, which is considerably more intensive compared to the continuous cases. However, this additional workload should be regarded as a necessary cost for achieving a better approximation of the ℓ 0 -path, where discontinuities are expected as discussed in the previous section. 3.3.4 Analysis of the pivoting scheme Once we have established the design of our pivoting scheme, our first order of business is to address the following questions: • For Cases 1-4, can we guarantee obtaining a d-stationary solution at γ ∗ using the specialized initializations for the GHP algorithms? • For Cases 5-12, is x d cap indeed continuous at γ ∗ ? • Under what conditions can we terminate the pivoting scheme in finite time? Before delving into a formal analysis of these issues, we present a simple result that will prove useful in identifying continuous and discontinuous break points. Lemma 46. Let γ ≥ 0 be given, and x ∗ be the basic solution of a d-stationary basis (α 0 ,α ± ,β ± ,σ ± ) at γ , then x ∗ is the optimal solution of GHP subproblem: minimize x∈R n 1 2 x ⊤ Qx+q ⊤ x+γ X i∈S< p i δ x i − γ X i∈ ¯S> p i δ x i subject to 0≤ x S ≤ u S , ℓ¯S ≤ x¯S ≤ 0 (3.11) whereS =S < ∪S > , ¯ S = ¯ S < ∪ ¯ S > ,and(S < ,S > , ¯ S < , ¯ S > )isconstructedasoneofthefollowings: (i) S < =α 0 ∪α + , S > =β + ∪σ + , ¯ S > =α − , ¯ S < =β − ∪σ − . (ii) S < =α + , S > =β + ∪σ + , ¯ S > =α 0 ∪α − , ¯ S < =β − ∪σ − . 90 Proof. By optimality condition, x ∗ should be a solution of the following system: 0≤ x α + <δ 1 α + , (Qx+q) α + + γ δ p α + =0, − δ 1 α − <x α − ≤ 0, (Qx+q) α − − γ δ p α − =0, δ 1 β + <x β + ≤ u β + , (Qx+q) β + =0, ℓ β − ≤ x β − <− δ 1 β − , (Qx+q) β − =0, (Qx+q) α 0 + γ δ p α 0 ≥ 0, (Qx+q) α 0 − γ δ p α 0 ≤ 0, x α 0 =0, (Qx+q) σ + ≤ 0, (Qx+q) σ − ≥ 0, x σ + =u σ + , x σ − =ℓ σ − . (3.12) From (3.12), it is easy to verify that x ∗ satisfies the optimality condition of (3.11). Remark 47. Let ¯x be a solution that satisfies all the conditions in system (3.12), except for the strict inequalities. In other words, we have the followings hold only in the weak sense: 0≤ ¯x α + ≤ δ 1 α + , δ 1 β + ≤ ¯x β + ≤ u β + , − δ 1 α − ≤ ¯x α − ≤ 0, ℓ β − ≤ ¯x β + ≤− δ 1 β − We will refer to this type of solution as an almost d-stationary solution. It is important to note that even though ¯x is not necessarily d-stationary for (3.1), as the latter cannot take on the values± δ , it is straightforward to verify that ¯x is still the solution of the GHP subproblem (3.11) under the constructions introduced in Lemma 46. Notice that Lemma 46 and Remark 47 provide us the connection between a basis that admits an (almost) d-stationary solution, and a particular design of the GHP subproblem. In the next proposition, we will utilize this connection to demonstrate the convergence of GHP algorithms in Case 1-4, and the continuity of x d cap at break points γ ∗ obtained from Case 5-12. Proposition 48. The following results hold as valid for the pivoting scheme: (i) When a break point γ ∗ comes from Case 1-4, the path x d cap is discontinuous at γ ∗ , and 91 the GHP algorithms under the specialized initializations will find a d-stationary point x ∗ at γ ∗ in 3n steps. (ii) When a break point γ ∗ comes from Case 5-12, the path x d cap is continuous at γ ∗ and the proposed (α new 0 ,α new ± ,β new ± ,σ new ± ) are d-stationary for the next step. (iii)Iftheratiotestalwaysattainsitsmaximumatauniqueindex, thenthepivotingscheme will terminate in finitely many steps. Proof. For the statement (i), we will only prove it for Case 2, where lim γ ↓γ ∗ (x d cap (γ )) k = δ and k ∈ [n] is the index achieving the maximum in the ratio test. In other words, the kth element in the path decreases to δ in its limit when we approach γ ∗ from its right. The same argument can be applied to prove the rest of Case 1-4. Notice that in Case 2, we can easily verify that ¯x≜ lim γ ↓γ ∗ x d cap (γ ) is almost d-stationary, hence by Lemma 46 and Remark 47, ¯x is the solution of the GHP subproblem (3.11) when γ =γ ∗ and ¯ S < =β − ∪σ − , ¯ S > =α − , S < =α + ∪α 0 , S > =β + ∪σ + where (α 0 ,α ± ,β ± ,σ ± ) is the most up-to-date basis before γ ∗ . Also note that, under the specialized initialization (S 1 < ,S 1 > , ¯ S 1 < , ¯ S 1 > ) we introduced for the non-inc-GHP in Case 2, the solution of the first non-inc-GHP subproblem (denoted as e x for convenience) is optimal for (3.11) when γ =γ ∗ and ¯ S < = ¯ S 1 < =β − ∪σ − , ¯ S > = ¯ S 1 > =α − , S < =S 1 < =α 0 ∪α + ∪{k}, S > =S 1 > =β + \{k}∪σ + . By Remark 34, in order for the non-inc-GHP to terminate with a d-stationary solution x ∗ of (3.1) in 3n steps, we only need to demonstrate that the base case for the inductive argument in the proof of Theorem 41 holds, namely F 1 < = ¯ F 1 < =∅ ande x¯S 1 < 0 where ¯ S 1 = ¯ S 1 < ∪ ¯ S 1 > . By applying Lemma 37, it is easy to verify thate x≤ ¯x ande x k < ¯x k =δ . Additionally, since ¯x is almost d-stationary, we can obtaine x i ≤ ¯x i <δ for all i∈S 1 < \{k} ande x i ≤ ¯x i <− δ for all i∈ ¯ S 1 < . Combining these inequalities withe x k < δ , we can conclude that F 1 < = ¯ F 1 < =∅. 92 Moreover, using Lemma 37 and our construction of initialization, we have e x¯S 1 ≤ ¯x¯S 1 < 0. Consequently, by employing a similar argument to that used in the proof of Theorem 41, we canestablishthatthenon-inc-GHPconvergestoad-stationary x ∗ atγ ∗ in3nsteps. Finally, based on Corollary 40, we have x ∗ k ≤ e x k < ¯x k , which implies that x d cap is discontinuous at γ ∗ . Thus, we have completed our proof for statement (i). We will only prove the statement (ii) for Case 5, and the proof for the remaining cases follows a similar approach. Let ¯x be lim γ ↓γ ∗ x d cap (γ ), then by Lemma 46 it is optimal for GHP subproblem (3.11) when γ =γ ∗ and ¯ S < =β − ∪σ − , ¯ S > =α − , S < =α + ∪α 0 , S > =β + ∪σ + . (3.13) On the other hand, denotee x as lim γ ↑γ ∗ x d cap (γ ), which by Lemma 46 is the optimal solution of (3.11) when γ =γ ∗ and ¯ S < =β new − ∪σ new − , ¯ S > =α new − , S < =α new + ∪α new 0 , S > =β new + ∪σ new + . (3.14) Based on the construction of the new basis (α new 0 ,α new ± ,β new ± ,σ new ± ), it is easy to verify that the sets in (3.13) and (3.14) are identical. Therefore, since e x and ¯x are both the (unique) optimalsolutiontothesameproblem,weconcludethate x= ¯x. Asaresult,x d cap iscontinuous at γ ∗ . Finally, we observe that in each step of the pivoting scheme, the ratio test identifies the smallest γ ∗ at which the current basis remains d-stationary. This implies that as long as the ratio test always selects a unique index k that maximizes the ratio, the pivoting scheme will never revisit the same d-stationary basis. Consequently, considering the fact that there are onlyafinitenumberofbases,wecanconcludethatstatement(iii)holds,andthetermination of the pivoting scheme within a finite number of steps is guaranteed. 93 In addition to the properties of the solution path x d cap , we can also apply Lemma 46 and Remark 47 to study the objective value path f d cap (γ ) = L(x d cap (γ )) + γ Φ( x d cap (γ )), where L(x)= 1 2 x ⊤ Qx+q ⊤ x. These results will be summarized in below. Proposition 49. The objective path have the following properties: (i) Let (γ 1 ,γ 2 ] be the range of γ associated with an arbitrary piece in the computed path x d cap (γ ), then f d cap (γ ) is concanve, non-decreasing and differentiable in γ ∈(γ 1 ,γ 2 ]. (ii) L(x d cap (γ )) is non-decreasing and Φ( x d cap (γ )) is non-increasing in γ ∈(γ 1 ,γ 2 ]. (iii) The overall f d cap (γ ) is non-decreasing and lower semi-continuous in γ ≥ 0. Proof. Let (α 0 ,α ± ,β ± ,σ ± ) be the d-stationary basis associated with the piece in (i) and (ii), and let γ ∈ (γ 1 ,γ 2 ] be arbitrary. By Lemma 46, x d cap (γ ) is the optimal solution of (3.11) when S < =α 0 ∪α + ,S > =β + ∪σ + , ¯ S > =α − and ¯ S < =β − ∪σ − . It is easy to verify that: f d cap (γ ) = minimum 0≤ x S ≤ u S ,ℓ¯S ≤ x¯S ≤ 0 L(x)+γ X i∈S< p i δ x i − X i∈ ¯S> p i δ x i + X i∈S>∪ ¯S< p i (3.15) From the Danskin’s Theorem, we have the conditions in (i) hold, where differentiability comes from uniqueness of solutions in (3.15). Moreover, the derivative of f d cap at γ can be specified as f d cap ′ (γ ) = X i∈S< p i δ x i − X i∈ ¯S> p i δ x i + X i∈S>∪ ¯S< p i = Φ( x d cap (γ )), hence by the concavity of f d cap (γ ) in γ ∈ (γ 1 ,γ 2 ], its derivative Φ( x d cap (γ )) is non-increasing in γ . Also note that if we take γ 1 <γ <γ ′ ≤ γ 2 , then we can show that: L(x d cap (γ ))+γ Φ( x d cap (γ ′ )) ≤ L(x d cap (γ ))+γ Φ( x d cap (γ )) ≤ f d cap (γ ′ )+Φ( x d cap (γ ))(γ − γ ′ ) = L(x d cap (γ ′ ))+γ Φ( x d cap (γ ′ )) 94 where the first inequality is by Φ( x d cap (γ )) non-increasing, the second inequality by (3.15) and the gradient inequality of a concave function, and the last equality is simply from expanding f d cap (γ ) = L(x d cap (γ ′ )) + γ Φ( x d cap (γ ′ )). From the derivations above, we obtain L(x d cap (γ ))≤ L(x d cap (γ ′ )), thus we have completed the proof for part (ii). To prove part (iii), assume that we have two consecutive pieces in the solution path x d cap to be associated with intervals (γ 1 ,γ 2 ] and (γ 2 ,γ 3 ]. If γ 2 is a continuous break point, then f d cap is trivially lower semi-continuous at γ 2 and it is nonincreasing in (γ 1 ,γ 3 ] by our conclusion in part (i). Thus, what remains to be shown is when γ 2 is discontinuous. First, observe that it is suffice to prove that f d cap (γ 2 ) = lim γ ↑γ 2 f d cap (γ ) ≤ v 0 ≜ lim γ ↓γ 2 f d cap (γ ) since this is enough to guarantee that f d cap is overall non-decreasing in (γ 1 ,γ 3 ] and f d cap (γ )≤ liminf γ →γ 2 f d cap (γ ). Noticethatv 0 =L(¯x)+γ 2 Φ(¯ x), where ¯x≜ lim γ ↓γ 2 x d cap (γ )isalmostd-stationary. By Lemma 46 and Remark 47, it is straightforward to verify that ¯x is the optimal solution and v 0 is the optimal value of (3.11) when (S < ,S > , ¯ S < , ¯ S > ) is specified as follows Case 1 and 4: ¯ S < =β − ∪σ − , ¯ S > =α − ∪α 0 , S < =α + , S > =β + ∪σ + Case 2 and 3: ¯ S < =β − ∪σ − , ¯ S > =α − , S < =α + ∪α 0 , S > =β + ∪σ + Ontheotherhand, denotev 1 (respectively, v ∗ )astheoptimalvalueofthefirst(respectively, last) GHP subproblem of the non-inc-GHP or non-dec-GHP, when they are initialized as specified in Case 1-4. By employing an argument similar to the proof of Proposition 42, we can conclude that v 1 ≤ v 0 . Moreover, through a simple inductive argument that iteratively applies Proposition 42, we can establish v ∗ ≤ v 1 hence v ∗ ≤ v 0 . Based on our construction 95 of (α new 0 ,α new ± ,β new ± ,σ new ± ), which is connected to the last GHP subproblem at termination by Lemma 46, we can easily conclude that v ∗ =f d cap (γ 2 ). In summary, we have derived: f d cap (γ 2 ) = v ∗ ≤ v 1 ≤ v 0 = lim γ ↓γ 2 f d cap (γ ) Consequently, considerourpreviousdiscussion, wefindthat f d cap isnon-decreasingin(γ 1 ,γ 3 ] andlowersemiconituousatγ 2 . Finally,theaforementioneddiscussionissufficienttoconclude the statement in part (iii). 3.3.5 Advantage of the specialized GHP initializations In principle, when encountering a discontinuous break point γ ∗ as described in Case 1-4, we can restore the computation of path x d cap using any d-stationary solution of (3.1) at γ =γ ∗ . For instance, we can utilize the d-stationary solution computed by: Case 1,4: non-dec-GHP with initialization ¯ S 1 > =S 1 < =S 1 > =∅, ¯ S 1 < =[n] Case 2,3: non-inc-GHP with initialization ¯ S 1 < = ¯ S 1 > =S 1 < =∅,S 1 > =[n] (3.16) namely, the nominal initializations as initially stated in Section 3.3.1. However, the special- ized GHP initializations, which we will refer to as the modified GHP , can offer us several crucialadvantages, bothanalyticallyandcomputationally. Fromananalyticallystandpoint, we have already observed the monotonicity and lower semi-continuity of the computed ob- jective path f d cap , which specifically depend on the modified GHPs, as demonstrated in the proofofProposition49. Subsequently,wepresentanotheranalyticalresultwhichguarantees a linear complexity in the number of pieces within the computed x d cap , when the modified GHPs are employed. Proposition 50. Let ℓ = 0, then our pivoting scheme will terminate in no more than 3n steps, namely, the computed x d cap (γ ) will have no more than 3n pieces. 96 Proof. Let (α 0 ,α ± ,β ± ,σ ± ) be the basis associated with an arbitrary step in the pivoting scheme, then α − =β − =σ − =∅ by ℓ=0. First, observe that vector the ¯p in (3.10) satisfies ¯p G ± = (Q G ± G ± ) − 1 η G ± ≥ 0 since η G ± ≥ 0 in this case by construction, and (Q G ± G ± ) − 1 has nonnegative elements by submatrix Q G ± G ± being Stieltjes. Moreover, by the Stieltjes property of Q, we have the off-diagonal blocks Q α 0 G ± and Q σ ± G ± possessing only nonpositive elements, thus ¯p α 0 = 1 δ p α 0 − Q α 0 G ± ¯p G ± ≥ 0 and ¯p σ ± = − Q σ ± G ± ¯p G ± ≥ 0 Altogether we have proven that ¯p≥ 0. In such scenario, only the following cases are feasible for the ratio test, where γ ∗ is the break point and k is the index achieving it: Case 1: x d cap increases to δ at γ ∗ Case 7: x d cap turns positive from zero after γ ∗ Case 9: x d cap increases to u k at γ ∗ Without loss of generality, assume that the final piece in x d cap computed by the pivoting scheme corresponds to the basis with σ + = [n] while the rest of the sets are empty. It is straightforward to verify that the reassignment of an arbitrary index j ∈ [n] during the pivoting scheme will follow the movements in below α 0 → α + → β + → σ + (3.17) To see this, note that when k = j, namely j achieves the maximum in the ratio test, the reassignment of j occurs during Case 7 and 9 will correspond to the first and third arrow in (3.17), respectively. When index j triggers Case 1, then according to our analysis in 97 Theorem 41, the non-dec-GHP will reassign j from α + to either β new + or σ new + . On the other hand,ifj ′ ̸=j andj ′ =kunderCase1,thenbytheanalysisinTheorem41,thereassignment of index j in this step still adheres to the movements as indicated in (3.17). In summary, throughout the pivoting scheme, an index j always moves uni-directionally from set α 0 to σ , thus undergoing at most 3 movements. Hence, in worst case, the pivoting scheme will be terminated in 3n steps. In practical applications, the modified GHPs offer several empirical properties that are advantageous for computations. Firstly, by utilizing the most recent basis before a discon- tinuous break point, the modified GHPs generally exhibit high efficiency in practice. The worst-case complexity of 3n steps, as guaranteed by Theorem 41, is often an overestimation. Secondly, when employing the modified GHPs, the bases obtained throughout the entire pivoting scheme tend to be less sensitive to the changes in γ . This is achieved through the specialized initializations and the properties of the GHP algorithms. Consequently, we can avoid abrupt and significant changes among the bases obtained in the path, which can commonly occur when applying other algorithms such as the GHPs with the initializations specified in (3.16). Such abrupt changes the in bases can undermine the effectiveness of the computed path, when the path is applied to hyperparameter selection tasks. Further details regarding this discussion will be provided in Section 3.4. 3.3.6 Absorbing basis In our pivoting scheme, we deliberately choose to drive γ from +∞ to 0 in a decreasing fashion, as this choice brings several significant advantages to the practical computation. In what follows, we will conduct a more formal examination of these advantages and present severalinterestingresultsthatemergefromthesestudies. Forconvenience,wecallapivoting scheme with an increasing (respectively, decreasing) γ to be an increasing (respectively, a decreasing)pivotingscheme,andabasiswhoseα 0 =[n]withtherestofthesetsbeingempty will be referred to as the zero basis. 98 An evident advantage of a decreasing pivoting scheme is that, if on the contrary an increasingpivotingschemeisapplied(theimplementationissimilartothedecreasingversion henceomittedhere),thefirststepofpivotingwillrequireustosolveafull-dimensional(in n) linear system. This can be computationally expensive when n is large, and also unnecessary. In reality, α 0 ∪σ ± ̸= ∅, meaning that the true dimension of the linear system we need to compute can be far less than n. Another more subtle but important reason is that if an increasing pivoting scheme is employed, we have no control on the final basis we attain at its termination. In practice, we can obtain such a basis that is not a zero basis. Namely, the d-stationary solutions in the tail of our computed path x d cap will not be zero, which conflicts with the intuition behind the sparsity-inducing utility of the capped ℓ 1 -penalty in practice. To formally characterize this phenomenon, we define the following concepts for a basis to be absorbing. Definition 51. LetR be the range of γ for a basis (α 0 ,α ± ,β ± ,σ ± ) to be d-stationary. IfR is unbounded, then this basis is absorbing, with the following three types. • IfR=R, then this basis is doubly absorbing. • IfR=[γ ,+∞) for some γ ∈R, then this basis is left absorbing. • IfR=(−∞ ,¯γ ] for some ¯γ ∈R, then this basis is right absorbing. Notice that the left and right end points ofR in Definition 51 comes from solving linear inequalities(3.8)inγ . Inthecontextofpivoting,anabsorbingbasisisthecasewhentheratio test has no candidate hence the pivoting scheme will be terminated. More specifically, we will terminate our pivoting scheme when we encounter a right (respectively, left) absorbing basis in the current step if we are driving γ upwards (respectively, downwards), and we will terminate at the first step if we start with a doubly absorbing basis. Note that we can 99 similarly define a basis for the computation of ℓ 1 -path x 1 (γ ), namely we discard β ± and re-define α ± to be i∈α + =⇒ 0≤ x i ≤ u i , and i∈α − =⇒ ℓ i ≤ x i ≤ 0. In this case, due to the uniqueness of optimal solution of (3.1) when Φ is ℓ 1 -function, it is easytoverifythattheℓ 1 -problemonlyhaveonetrivialrightabsorbingbasiswhichisthezero basis. Additionally, the ℓ 1 -problem does not admit a doubly absorbing basis. Consequently, if we apply an increasing pivoting scheme to the ℓ 1 -case, we will always terminate with a path x 1 (γ ) such that x 1 (γ )=0 for all sufficiently large γ ≥ 0. However,thisisgenerallynotthecasefortheparametriccappedℓ 1 -problem. Toillustrate this, we begin with a detailed characterzation of the doubly, right and left absorbing basis foracappedℓ 1 -problem. Wewillcallabasis(α 0 ,α ± ,β ± ,σ ± )thatsatisfiesacertainproperty being unique up to degeneracy, when all the other bases satisfying the same property admit a basic solution that is identical to the basic solution of (α 0 ,α ± ,β ± ,σ ± ). Proposition 52. Let (α 0 ,α ± ,β ± ,σ ± ) be a basis of the parametric capped ℓ 1 -problem with R being the range of γ for this basis to be d-stationary, and suppose thatR̸=∅. (i) When α ± ̸=∅, it cannot be absorbing. (ii) If it is absorbing then it is either right or doubly absorbing, (iii) It is right absorbing if and only if α ± =∅ and α 0 ̸=∅. (iv) It is doubly absorbing if and only if α ± =α 0 =∅. (v) If it is doubly absorbing, then it is unique up to degeneracy and its basic solution is the unique solution of (3.1) when γ =0. Proof. Accordingto(3.10),thebasicsolution ¯x(γ )of(α 0 ,α ± ,β ± ,σ ± )atγ satisfies ¯ x G ± (γ )= − ¯q G ± − γ ¯p G ± , where G ± = α ± ∪β ± , ¯p G ± = (Q G ± G ± ) − 1 η G ± ,η α + = 1 δ p α + ,η α − =− 1 δ p α − and η β ± = 0. If α ± ̸=∅ then η G ± ̸= 0, thus we have ¯p G ± ̸= 0 by Q G ± G ± non-singular. Observe 100 that ¯x G ± (γ )needstobebothupperandlowerboundedfor(α 0 ,α ± ,β ± ,σ ± )tobed-stationary, thusR will be a bounded interval, which proves part (i). Consequently, for the basis to be absorbing, we need α ± = ∅, which we assume to be valid for the remainder of the proof. In this case, denote ξ σ + = u σ + and ξ σ − = ℓ σ − , then from (3.10) we can deduce that ¯p G ± = ¯p β ± = 0 and ¯p σ ± = 0. As a result, the basic solution satisfies: ¯x G ± (γ )= ¯x β ± (γ )=− ¯q β ± =− (Q β ± β ± ) − 1 (q β ± +Q β ± σ ± ξ σ ± ), (Q¯x(γ )+q) σ ± = ¯q σ ± =q σ ± +Q σ ± σ ± ξ σ ± − Q σ ± β ± ¯q β ± , (Q¯x(γ )+q) α 0 + γ δ p α 0 =(q α 0 +Q α 0 σ ± ξ σ ± − Q α 0 β ± ¯q β ± )+ γ δ p α 0 . (3.18) Inotherwords, ¯x G ± (γ )and(Q¯x(γ )+q) σ ± areconstantinγ . Thus,aslongasδ 1 β + <− ¯q β + ≤ u β + ,ℓ β − ≤− ¯q β − <− δ 1 β − and ¯q σ + ≤ 0,¯q σ − ≥ 0 are satisfied, which is guaranteed by R̸=∅, theterms ¯x G ± (γ )and(Q¯x(γ )+q) σ ± donotcontributetoR,basedontheinequalitiesin(3.8). Therefore, when α 0 =∅, the basis will be d-stationary for all γ ∈R, which by Definition 51 means that the basis is doubly absorbing, and the reverse of this statement is also easily verifiable. On the other hand, when α 0 ̸=∅, the inequalities in (3.8) are reduced to: γ δ p α 0 ≥ q α 0 +Q α 0 σ ± ξ σ ± − Q α 0 β ± ¯q β ± whose solution set is equivalent to γ ≥ γ for some γ ≥ 0, since δ > 0 and p i > 0 for all i∈ [n]. Thus when α 0 ̸=∅, the basis is right absorbing. In summary, we have proven the statements in (ii), (iii) and (iv). Finally, to prove (v), notice that when the basis is doubly absorbing hence α 0 = α ± = ∅, it is d-stationary for all γ ∈ R, and its basic solution is constant in γ according to (3.18). In particular, this basic solution is d-stationary for γ =0, where the objective of (3.1) is simply a strongly convex quadratic function, therefore only obtains a unique global hence d-stationary solution. 101 Corollary 53. Let x 0 be the unique optimal solution of (3.1) when γ =0. (i) There exists a doubly absorbing basis if and only if|x 0 i |>δ for all i∈[n]. (ii) If there is a doubly absorbing basis, it is unique up to degeneracy. Additionally, there is a unique continuous d-stationary path x d cap , and it satisfies x d cap (γ )=x 0 for all γ ≥ 0. Proof. The verification of the statement in part (i) can be readily done by referring to the proofsofpart(iv)and(v)inProposition52. Giventheuniquenessofx 0 ,alldoublyabsorbing bases will share the same basic solution, making them unique up to degeneracy. Suppose there exists a doubly absorbing basis and assume that it is unique without loss ofgenerality, atrivialcontinuousd-stationarypathisgivenby x d cap (γ )=x 0 forallγ ≥ 0. To demonstrate its uniqueness, we can employ an increasing pivoting scheme aiming at finding an alternative continuous path. Notice that any d-stationary path x d cap satisfies x d cap (0)=x 0 duetotheuniquenessofx 0 . Thus,weinitializetheschemewiththeuniquedoublyabsorbing basis. In order to obtain a distinct continuous d-stationary path, we would need to change the basis at some γ ∗ > 0. However, by referring to (3.18), it can be easily verified that any such change in the basis would introduce discontinuity in the path. As a result, the path with a constant value of x 0 remains the unique continuous d-stationary path. Corollary54. Thereexistsaγ ≥ 0sothatforallγ ≥ γ ,ad-stationarybasis(α 0 ,α ± ,β ± ,σ ± ) at γ has α ± =∅. Namely, for all γ ≥ γ , a d-stationary solution x ∗ of (3.1) has either x ∗ i =0 or|x ∗ i |>δ . Proof. Take γ to be the largest γ so that there exists a non-absorbing or left absorbing basis to be d-stationary, and the rest is trivial by Proposition 52. Proposition 52 confirms the existence of a doubly absorbing basis whose basic solution has no sparsity, and a right absorbing basis that is not a zero basis. On the one hand, if there exists a doubly absorbing basis, then it will be the unique (up to degeneracy) d- stationary basis at γ = 0, which is the starting point when an increasing pivoting scheme is applied. Consequently, we will terminate immediately, and the resulting path will be a 102 constantvectorwithoutanyzeroelements. Forinstance,considerthefollowingspecifications for the parameters of (3.1): Q= 2 − 1 − 1 2 , q = − 1 − 1 , p= 1 1 , u=− ℓ= 2 2 , δ =0.1 Then, in the given scenario, β + = {1,2} (with the remaining sets being empty) is the unique doubly absorbing basis, and the path we obtain from the increasing pivoting scheme is x d cap (γ )=(1,1) ⊤ for all γ ≥ 0. On the other hand, even if our problem does not possess a doubly absorbing basis, due to the potential discontinuity in the path and non-uniqueness of d-stationary solutions, it is possible for an increasing pivoting scheme to encounter a right absorbing basis that is not a zero basis. In such cases, we will terminate and obtain a d-stationary path x d cap (γ ) with nonzero values for any γ ≥ 0. For instance, when the parameters for problem (3.1) are specified in below Q= 2 − 1 − 1 2 , q = − 1 0.9 , p= 1 1 , u=− ℓ= 2 2 , δ =0.3 In this particular example, our problem has no doubly absorbing basis. However, α 0 = {1},β + = {2} (with the rest being empty) is right absorbing, and it remains d-stationary for all γ ≥ 0.225. Meanwhile, the zero basis is only d-stationary after γ ≥ 0.3. Therefore, if an increasing pivoting scheme obtains α 0 ={1},β + ={2} as the new basis at 0.225 < 0.3, we will terminate with a path whose value is consistently determined as (0,0.45) ⊤ , instead of 0, for all γ ≥ 0.225. These observations are sound from a mathematically perspective. However, they appear to be counter-intuitive since, in order to induce sparsity, we would expect the value of the computed path to attain a value of zero when γ is sufficiently large. To address such issue, a sensible approach is to apply the decreasing pivoting scheme, where we start with 103 γ = max i∈[n] | δq i p i |, which is the lower bound of γ for the zero basis to be d-stationary. By adopting this initialization, we intentionally ensure that the tail of our path is zero, thereby making it more comparable to the ℓ 0 - and ℓ 1 -paths. We conclude our discussion by highlighting an important implication of the existence of a doubly absorbing basis, as presented in Corollary 53. Recall that when we present the design of our pivoting scheme, we have emphasized that the ℓ 0 -path is piecewise constant hence discontinuous. Consequently, instead of computing a continuous path, we should expect and embrace the discontinuities in computing a capped ℓ 1 d-stationary path, in order to achieve a good approximation of the ℓ 0 -path. Corollary 53 formalizes this intuition. Namely, in order to facilitate the approximation of a capped ℓ 1 -function to an ℓ 0 -function, it is common practice to employ a δ > 0 that is sufficiently small. In this case, the condition that|x 0 i |<δ foralli∈[n]canbeeasilymet,andfromCorollary53wecanconcludethatthe only continuous capped ℓ 1 d-stationary path we can compute is the one that is constant and lacks any sparsity. Therefore, to effectively induce sparsity in practical scenarios, we should anticipateatrade-offwherethecomputationworkloadincreasesduetotheGHPalgorithms, which we apply to restore discontinuities. 3.4 Numerical Experiments In this section, we compare the numerical performance of the three solution paths: • Theexact ℓ 0 -path: thisiscomputedbysolving(independently)asequenceofmixed- integer nonlinear programs determined by the weighted sum method, see [4, 14]. Some details of the mixed-integer formulation used to solve (3.19) (for a fixed value of γ ) is giveninSection3.4.2. WeusetheCPLEXsolvertosolveeachmixed-integerprogram. • The ℓ 1 -path: this is computed by aMATLAB R2017b implementation of the para- metric procedure in [43] when specialized to (3.1) when Φ is ℓ 1 -function. 104 • The capped ℓ 1 d-stationary path: this is computed by our pivoting scheme coded inMATLAB R2017b. All the numerical experiments are conducted on a Mac OS X personal computer with 2.3 GHz Intel Core i7 and 8 GB RAM. The reported times are in seconds on this computer. 3.4.1 The ℓ 1 - and capped ℓ 1 -paths on synthetic problems To gain some preliminary experience with the performance of our pivoting scheme, we first carry out a set of experiments on some synthetic ℓ 1 - and capped ℓ 1 -problems with randomly generated data and two dimensions: n = 500 and 5,000. We did not include the ℓ 0 -path in this set of experiments because these dimensions are too large for this path to be computed in practical scenarios (more details below). Since our next goal is to compare all three paths on the Gaussian Markov Random Field (GMRF) problems whose matrix Q is very sparse, we generate Q in the synthetic problems as a sparse Stieltjes-matrix in the following way. First, we set a ratio 2/n of the off-diagonal elements of Q being nonzero, which is the same amount of nonzero off-diagonal elements as a tridiagonal matrix. The nonzero off-diagonal entries of Q are independent random numbers that are uniformly sampled from the interval [− 1,0]. With these off- diagonal elements generated, we add sufficiently large diagonal terms to keep Q positive definite. Additionally, we randomly generate independently and identically distributed (iid) entries q i ∼ Uniform([− 10,10]) and p i ∼ Uniform([0,1]) for all i∈[n]. The experiments consist of unconstrained and constrained problems; for the latter, we set − ℓ i = u i = max{ 11 10 δ, 1 2 |(− Q − 1 q) i |} for all i ∈ [n] and test several values of δ ∈ {10,1,10 − 1 ,10 − 4 }, which controls the approximation of capped ℓ 1 to ℓ 0 . The results are summarized in Table 3.2, where “Bpts.” and “Dis. Bpts.” stand for the total numbers of break points and discontinuous break points (i.e., the number of GHP restorations). All the statistics are averaged over 10 runs. 105 n= 500 n =5000 Unconstrained Unconstrained Settings Time Bpts. Dis.Bpts. Settings Time Bpts. Dis.Bpts. cap δ = 10 − 4 1.03 959 479 cap δ =10 − 4 96.48 9568 4783 cap δ = 10 − 1 1.17 951 459 cap δ =10 − 1 108.24 9490 4582 cap δ = 1 0.86 797 297 cap δ =1 86.86 7974 2968 cap δ = 10 0.17 504 0 cap δ =10 11.08 5029 0 ℓ 1 0.11 504 N/A ℓ 1 11.03 5029 N/A Constrained Constrained Settings Time Bpts. Dis.Bpts. Settings Time Bpts. Dis.Bpts. cap δ = 10 − 4 0.74 974 487 cap δ =10 − 4 27.01 9762 4881 cap δ = 10 − 1 0.80 965 469 cap δ =10 − 1 30.89 9646 4681 cap δ = 1 0.75 807 299 cap δ =1 38.84 8046 2981 cap δ = 10 0.22 504 0 cap δ =10 11.66 5029 0 ℓ 1 δ = 10 − 4 0.35 989 N/A ℓ 1 δ =10 − 4 13.13 9886 N/A ℓ 1 δ = 10 − 1 0.31 978 N/A ℓ 1 δ =10 − 1 13.64 9763 N/A ℓ 1 δ = 1 0.29 886 N/A ℓ 1 δ =1 15.09 7846 N/A ℓ 1 δ = 10 0.19 504 N/A ℓ 1 δ =10 11.81 5029 N/A Table 3.2: Summary of tests on parametric ℓ 1 and capped ℓ 1 -solution. By examining Table 3.2, we can observe that when δ = 10, a relatively large value, the behavior of the capped ℓ 1 -path closely resembles that of the ℓ 1 -path in terms of the computational time and the number of breakpoints. Additionally, in this scenario, there is no need to restore discontinuities in the computation of the capped ℓ 1 -path, making it comparable to the ℓ 1 -path. On the contrary, for other values of δ , the computation of the capped ℓ 1 -path requires more time, and these paths exhibit more pieces and discontinuous breakpoints. This aligns with our previous analysis in Proposition 53: when δ is small (e.g., δ = 10 − 4 ), the only continuous d-stat path is a constant one, given by x d cap (γ )=x 0 for all γ ≥ 0, where x 0 is the unique optimal solution of (3.1) at γ = 0. In general, such constant paths lack sparsity. As 106 a result, the capped ℓ 1 -path computed by the pivoting scheme becomes discontinuous with more breakpoints when δ is smaller, leading to significantly longer computational times. The increased workload depends on the instances; in some cases, the computation time for cappedℓ 1 exceedsthatoftheℓ 1 -pathbymorethan10times. Nevertheless,itremainswithin a reasonable range of 2 minutes on our personal computer (for n=5,000 and δ =10 − 4 ). However, for the unconstrained cases, capped ℓ 1 -paths with small δ (e.g., 10 − 4 ) consis- tently achieve better loss function values L(x) than both the ℓ 1 -path and capped ℓ 1 -paths underlargerδ ,whenthesolutionsfromthesepathshasthesamesparsity. Thisphenomenon persists in the constrained cases, when a capped ℓ 1 -path is compared to its ℓ 1 counterpart. Toillustratethis,weplotL(x)asafunctionofsparsity|x| 0 fortheconsideredpathsinFigure 3.1 and 3.2, which are computed from some representative instances in both unconstrained and constrained scenarios. (a) n=500. (b) n=5000. Figure 3.1: Quadratic loss L(x) as a function of the sparsity∥x∥ 0 (unconstrained cases). These figures confirm that the increased computation workload of the capped ℓ 1 -path benefits us in solutions with higher qualities per the computed L(x)-values. In the next section, we show more advantages of the capped ℓ 1 -path in the context of the GMRF model. Notethatinourexperimentstheconstrainedpathsgenerallytakelesstimetotracethan the unconstrained paths. This is reasonable in that the two major computing expenses, namelysolvinglinearsystemsforthecontinuouscases(Case5-12inthepivotingscheme)and 107 discontinuityrestorationsbytheGHPalgorithms,aresignificantlyreducedintheconstrained cases. This is due to the presence of sets σ ± , which reduces the effective dimensions of the linear systems and GHP subproblems which we need to solve. Finally, it is important to note that the reported run-times correspond to computing local minimizers for all values of γ , which can be viewed as solving O(n) fixed-parameter problemswithcarefullyselectedparametervalues. Thetimecomplexityperfixed-parameter problem falls within the range of 1 to 10 milliseconds for the case of n = 5,000. This run- timeiscompetitive, ifnotbetter,comparedtoexistingheuristicsfoundintheliterature[85]. Furthermore, for context, in [48], the authors evaluate the solution at 100 discrete values of γ in a problem with n=17,000 within 3 seconds. While the reported run-times in our case are larger by an order of magnitude, it is essential to consider that we compute the complete path rather than a discrete approximation. In most situations, run-times of minutes are perfectly acceptable. (a) n=500,δ =10 − 4 . (b) n =500,δ =10 − 1 . (c) n=500,δ =1. (d) n=500,δ =10. (e) n=5000,δ =10 − 4 . (f) n=5000,δ =10 − 1 . (g) n =5000,δ =1. (h) n=5000,δ =10. Figure 3.2: Quadratic loss L(x) as a function of the sparsity∥x∥ 0 (constrained cases). 108 3.4.2 Results on the GMRF problem Consider a two-dimensional graphical model with the following prescribed parameters: a gridsizet∈Z + (withn=t 2 ), a“spikesize”parameter s∈Z + , a“spikenumber”parameter h ∈ Z + and a noise parameter σ . We generate the true values of the stochastic process X ∈R t× t as follows. Construct the precision matrix Θ ∈R (s× s)× (s× s) such that Θ ij,ij =4 for all i,j∈ [s], Θ ij,(i+1)j = Θ (i+1)j,ij =− 1 for i∈ [s− 1] and j∈ [s], Θ ij,i(j+1) = Θ i(j+1),ij =− 1 for i∈ [s] and j∈ [s− 1], and Θ ij,kℓ = 0 otherwise. We use the notation X [i,j],[k,ℓ] to denote the submatrix of X from rows i to j (inclusive) and columns k to ℓ (inclusive). Initially, X is fully sparse, this is, X =0. Then we iteratively repeat h times the following process. • Randomly select indexes i∈[t+1− s] and j∈[t+1− s], corresponding to the initial row and column of a spike. • Sample a Gaussian shock w∈R s× s with zero mean and Θ − 1 as the covariance. • Add shock w to the current X [i,i+s],[j,j+s] . The resulting X is thus mostly sparse, but each non-zero s× s spike corresponds to a two- dimensional GMRF. Finally, we sample noisy observations from X, given by y ij =X ij +ε ij , where ε ij are iid Gaussian noise with zero mean and σ 2 variance. The values y ij are the inputs of the following ℓ 0 -regularized problem of interest, which is a special case of (3.1) when Φ is ℓ 0 -function: f 0 (γ ) = minimum x p X i=1 p X j=1 1 σ 2 (y ij − x ij ) 2 + p− 1 X i=1 p X j=1 (x ij − x i+1,j ) 2 + p X i=1 p− 1 X j=1 (x ij − x i,j+1 ) 2 +γ ∥x∥ 0 . (3.19) The mixed-integer formulation for solving the above problem for a given value of γ is based on the following convexification from [7]. The resulting relaxation is stronger than the perspective relaxation, commonly used in mixed-integer programming [9, 85]. 109 Proposition 55. Let set X = (x,z,τ )∈R n + ×{ 0,1} n × R: n X i=1 x i ! 2 ≤ τ, x i (1− z i )=0, for all i∈[n] . Then the closure of the convex hull of X is given by (x,z,τ )∈R n + × [0,1] n × R: n X i=1 x i ! 2 ≤ τ min ( 1, n X i=1 z i ) . An application of Proposition 55 to problem (3.19) yields the mixed-integer second-order conic formulation: minimize x,z,u,v,w,s,t p X i=1 p X j=1 1 σ 2 y 2 ij − 2y ij x ij + u ij + p− 1 X i=1 p X j=1 v ij + p X i=1 p− 1 X j=1 w ij +γ p X i=1 p X j=1 z ij subject to x 2 ij ≤ u ij z ij , for all (i,j) (x ij − x i+1,j ) 2 ≤ v ij s ij , s ij ≤ z ij + z i+1,j , 0≤ s ij ≤ 1, for all (i,j) (x ij − x i,j+1 ) 2 ≤ w ij t ij , t ij ≤ z ij + z i,j+1 , 0≤ t ij ≤ 1, for all (i,j) x ∈ R p× p , z ∈ {0,1} p× p . We compare the three paths on problems with t = 10 (thus 100 variables in total, in this setting we use h = s = 3), where the ℓ 1 - and capped ℓ 1 -paths are obtained from problem (3.19) with the ℓ 0 -regularizer substituted by the ℓ 1 - and capped ℓ 1 -regularizer. In addition, the ℓ 1 - and capped ℓ 1 -paths are also tested with t = 100 (thus 10,000 variables in total, in this setting we use h=s=10). The numerical tracing of the ℓ 0 -path is handicapped by the challenge of solving non- linear mixed-integer programs by theCPLEX solver; on problems with 100 variables, the 110 computation already takes 800 seconds. Thus, the computation of the exact ℓ 0 -path on the larger-sized problems is expected to be prohibitively impractical and thus omitted. Theexperimentsaimtoevaluatethedifferentmethodsbothfromanoptimizationstand- point(run-timeandobjectivevalues)andastatisticalstandpoint(howwellcanthemethods recover the underlying “true” signal?). Moreover, we are interested in: (i) confirming the improved quality of the capped ℓ 1 -dstat path as a surrogate for the ℓ 0 -path from an opti- mizationpointofview,and(ii)showingtheadvantageofthecappedℓ 1 -pathovertheℓ 1 -path when applied to the hyper-parameter selection for the GMRF maximum a posteriori infer- ence, namely problem (3.19) with the appropriate regularizer. Through these experiments, we can confirm the effectiveness of the capped ℓ 1 d-stationary path as a practical compro- mise between the ℓ 0 -path and the ℓ 1 -path, remedying the slow computational speed of the former for large-scale problems, and improving the solution quality over the latter without sacrificing the efficiency. Optimization standpoint Each plot in Figure 3.3 shows, the value of the quadratic loss L(x) as a function of the spar- sity∥x∥ 0 , where each x corresponding to a break point in the solution path of a given kind. Additionally, each plot corresponds to a single instance. In the cases when the instances are small, the exact ℓ 0 -paths always produce the best solutions, as expected. Moreover, the approximation provided by the ℓ 1 -path is consistently the worst, and the capped ℓ 1 - paths generally perform better as δ decreases, despite the increasing non-convexity of the optimization problem. In particular, for δ ∈ {10 − 4 ,10 − 3 }, the capped ℓ 1 -paths are almost indistinguishable from the ℓ 0 -path, showing that our pivoting scheme is effective at consis- tently finding high-quality local (if not global) minimizers of the associated optimization problems. Inthelarge-scalecases,whileitisnotpossibletocomparewiththeexactℓ 0 -path,westill observe that the path of the capped ℓ 1 -method delivers substantially better solutions than 111 the ℓ 1 -path. Additionally, for both small and large instances, the improvements achieved by the capped ℓ 1 -formulation over the ℓ 1 -formulation are particularly pronounced in the low signal-to-noise regimes. Small instances n=100 Method σ = 0.02 σ =0.1 σ =0.3 σ =0.5 σ =0.7 σ =1 ℓ 0 39.75 76.60 340.80 966.00 1,307.80 1,935.60 capped ℓ 1 0.66 1.03 0.45 0.32 0.84 0.92 ℓ 1 0.12 0.03 0.02 0.02 0.03 0.06 Large instances n=10,000 Method σ = 0.02 σ =0.1 σ =0.3 σ =0.5 σ =0.7 σ =1 ℓ 0 N/A N/A N/A N/A N/A N/A capped ℓ 1 200.45 376.79 415.13 467.36 509.37 579.95 ℓ 1 29.57 29.48 33.57 36.96 38.85 42.17 Table 3.3: Summary of computational times (in seconds). Table3.3presentsthecomputationaltimes(inseconds)requiredtocomputethesolution paths. We only report the capped ℓ 1 -results for δ = 10 − 4 (averaged over 5 instances) since this is a preferred choice according to our previous experience. It is worth mentioning that similar to what is reported in Table 3.2, capped ℓ 1 -paths under δ = 10 − 4 generally require more time to compute than the capped ℓ 1 -paths under larger δ values, e.g., δ = 10. All the methods require more time as the noise (σ ) increases: for the ℓ 1 - and capped ℓ 1 -problems, computational times increase at most by a factor of three, whereas for the ℓ 0 -problem, computational times increase by two orders of magnitude. We observe that for the small instances, the exact ℓ 0 -path can be computed in approxi- mately one minute in high signal-to-noise regimes, and under one hour in low signal-to-noise regimes. It is important to note that computing the solution path for n = 100 requires solving approximately 160 nonlinear mixed-integer optimization problems. Although each problemissolvedrelativelyquickly(rangingfromunderonesecondto15seconds,depending on the noise level), the lack of an integrated parametric scheme leads to long computational 112 times. As a point of reference, computing the exact ℓ 0 -path for n = 225 and σ = 1 takes more than one day, indicating that handling instances with n = 10,000 exactly is currently beyond the capabilities of existing solvers. On the other hand, computing the capped ℓ 1 -path is up to 10 times more expensive than the ℓ 1 -path, but it is four orders of magnitude faster than the ℓ 0 -method for the instances with n = 100. In fact, capped ℓ 1 -paths can be computed in one second for n = 100, and in 10 minutes for n = 10,000. These computation times are acceptable for an experimental MATLAB implementation. Therefore, considering that the capped ℓ 1 -method also delivers near-optimal solutions, we conclude that it is a much more practical choice than the ℓ 0 -path for the large instances, without compromising the solution quality. It is worth mentioning thattheℓ 1 -pathscanbecomputedveryquickly,inunderoneminuteevenforlargeinstances. However, as previously noted, the fast computation of the ℓ 1 -path comes at the expense of a worse solution quality. Statistical standpoint In the following experiments, we expand upon our discussion from Section 1.3, and compare the performance of the ℓ 0 -, ℓ 1 - and capped ℓ 1 -paths in the context of hyperparameter selec- tion. Specifically, we focus on the inference of the GRMF model and aim to determine the appropriate value γ that provides a reasonable recovery of the underlying signal X. To this end, we re-introduce the two criteria we employ for selecting γ : Signal recovery: X i∈[t] X j∈[t] (x ∗ ij (γ )− X ij ) 2 Support recovery: X i∈[t] X j∈[t] |x ∗ ij (γ )| 0 −| X ij | 0 where x ∗ (γ ) denotes a computed solution of the three paths at a given γ . 113 (a) Small inst., σ =0.1 (b) Large inst., σ =0.1 (c) Small inst., σ =0.5 (d) Large inst., σ =0.5 (e) Small inst., σ =0.7 (f) Large inst., σ =0.7 Figure 3.3: Quadratic loss L(x) as a function of the sparsity∥x∥ 0 . 114 (a) Small inst., σ =0.1 (b) Large inst., σ =0.1 (c) Small inst., σ =0.5 (d) Large inst., σ =0.5 (e) Small inst., σ =0.7 (f) Large inst., σ =0.7 Figure 3.4: Signal recovery as a function of support recovery. 115 Small instances n=100 Settings σ =0.02 σ =0.1 σ =0.3 Supp.Rec. Sig.Rec. Supp.Rec. Sig.Rec. Supp.Rec. Sig.Rec. ℓ 0 1.2e0 1.41e-1 4.0e0 5.34e-1 8.8e0 1.06e0 cap(δ =10 − 4 ) 1.2e0 1.41e-1 4.0e0 5.34e-1 8.8e0 1.06e0 cap(δ =10 − 3 ) 1.2e0 1.41e-1 4.0e0 5.34e-1 8.8e0 1.06e0 cap(δ =10 − 2 ) 1.2e0 1.41e-1 4.0e0 5.34e-1 8.8e0 1.06e0 cap(δ =10 − 1 ) 1.2e0 1.43e-1 4.0e0 5.37e-1 8.8e0 1.07e0 cap(δ =10 0 ) 1.2e0 1.45e-1 4.0e0 5.41e-1 8.8e0 1.10e0 ℓ 1 1.2e0 1.45e-1 3.6e0 5.41e-1 8.8e0 1.10e0 Settings σ =0.5 σ =0.7 σ =1 Supp.Rec. Sig.Rec. Supp.Rec. Sig.Rec. Supp.Rec. Sig.Rec. ℓ 0 1.4e1 1.33e0 2.0e1 1.29e0 2.2e1 8.36e-1 cap(δ =10 − 4 ) 1.4e1 1.33e0 1.9e1 1.26e0 2.2e1 8.18e-1 cap(δ =10 − 3 ) 1.4e1 1.33e0 1.9e1 1.26e0 2.2e1 8.18e-1 cap(δ =10 − 2 ) 1.4e1 1.33e0 1.9e1 1.26e0 2.2e1 8.18e-1 cap(δ =10 − 1 ) 1.5e1 1.34e0 2.0e1 1.26e0 2.2e1 8.18e-1 cap(δ =10 0 ) 1.5e1 1.42e0 2.0e1 1.27e0 2.2e1 8.23e-1 ℓ 1 1.5e1 1.42e0 2.0e1 1.28e0 2.2e1 8.27e-1 Large instances n=10,000 Settings σ =0.02 σ =0.1 σ =0.3 Supp.Rec. Sig.Rec. Supp.Rec. Sig.Rec. Supp.Rec. Sig.Rec. ℓ 0 N/A N/A N/A N/A N/A N/A cap(δ =10 − 4 ) 3.7e1 4.63e-1 1.1e2 1.75e0 2.6e2 3.42e0 cap(δ =10 − 3 ) 3.7e1 4.63e-1 1.1e2 1.75e0 2.6e2 3.42e0 cap(δ =10 − 2 ) 3.7e1 4.63e-1 1.1e2 1.75e0 2.6e2 3.42e0 cap(δ =10 − 1 ) 3.7e1 4.67e-1 1.1e2 1.75e0 2.7e2 3.49e0 cap(δ =10 0 ) 3.6e1 4.68e-1 1.0e2 1.77e0 2.7e2 3.70e0 ℓ 1 3.6e1 4.68e-1 1.0e2 1.77e0 2.7e2 3.70e0 Settings σ =0.5 σ =0.7 σ =1 Supp.Rec. Sig.Rec. Supp.Rec. Sig.Rec. Supp.Rec. Sig.Rec. ℓ 0 N/A N/A N/A N/A N/A N/A cap(δ =10 − 4 ) 5.3e2 4.68e0 8.1e2 4.68e0 9.6e2 5.11e0 cap(δ =10 − 3 ) 5.3e2 4.68e0 8.1e2 4.68e0 9.6e2 5.11e0 cap(δ =10 − 2 ) 5.4e2 4.68e0 8.3e2 4.68e0 9.6e2 5.11e0 cap(δ =10 − 1 ) 6.0e2 4.75e0 8.8e2 4.75e0 9.6e2 5.00e0 cap(δ =10 0 ) 6.3e2 5.20e0 8.9e2 5.20e0 9.6e2 5.07e0 ℓ 1 6.3e2 5.20e0 8.9e2 5.20e0 9.6e2 5.07e0 Table 3.4: Some key statistics for the GMRF experiments. 116 Each plot in Figure 3.4 visualizes the value of signal recovery as a function of support recovery along the solution paths of different kinds. More specifically, each point on a plotted curve corresponds to a particular pair of support and signal recovery at a give γ value. Additionally, we also provide the best support and signal recovery results achieved by the various paths (averaged over 5 instances), as displayed in the “Supp. Rec.” and “Supp. Rec.” columns of Table 3.4. The plots in Figure 3.4 reveal interesting insights about the performance of the ℓ 1 - and cappedℓ 1 -pathindifferentnoiseregimes. Inthesmallnoisescenarios,bothmethodsgenerate solutionsthatexhibitfavorablesignalandsupportrecovery,thatcanbepotentiallyidentified byhyperparameterselection. However, asthenoiselevel σ increases, thestatisticalbehavior of these two paths diverges. • For the capped ℓ 1 -path, if the parameter γ is selected such that the support of the solutionapproximatelymatchesthetruesupport,theresultingestimatorsdemonstrate satisfactory signal recovery. Conversely, as γ deviates from this critical value, the estimators have worse performance in terms of both signal and support recovery. In other words, when the capped ℓ 1 -pathis employed, a good support recovery andsignal recovery can be roughly achieved by the same γ value. • On the other hand, the ℓ 1 -path has a different behavior. Choosing γ to align with the true support often leads to estimators with subpar signal recovery. Conversely, the values of γ that yield good signal recovery tend to result in solutions with poor support recovery. As a result, it becomes unclear which γ value yields better overall performance. These findings suggest that the capped ℓ 1 d-stationary path is more attractive from a sta- tistical perspective. Indeed, hyperparameter selection can potentially identify solutions that simultaneously achieve good signal and support recovery, a feature that is absent in the ℓ 1 - solutionpath. Itisworthnotingthattheℓ 0 -pathsdemonstrateasimilarstatisticalproperty, 117 assupportedbytheplotsinFigure3.4a,3.4c,and3.4e. Consequently,thesefindingsprovide additional evidence in favor of the capped ℓ 1 -path. Lastly, we observe that, in general, there are multiple solutions obtained from the capped ℓ 1 -path that outperform all ℓ 1 -solutions in terms of signal recovery. This observation suggests that the capped ℓ 1 -method is preferable when signal recovery is the primary criterion. The specialized GHP initializations Recall that in Section 3.3.5, we highlighted the significance of our specialized GHP initial- ization strategies, which we will refer to as the modified GHP . By taking advantage of the most recent almost d-stationary solution in the path, the modified GHP provides several favorable analytical results, as discussed in Section 3.3.5. In the following experiments, we willelaboratethecomputationalbenefitsbroughtbythemodifiedGHP.Specifically,ouraim is to demonstrate that the specialized initialization is crucial for preserving all the desirable properties of the capped ℓ 1 -paths as previously observed. As a comparison, we also test the pivoting scheme when the GHPs are initialized as shown in (3.16), which we refer to as the na¨ ıve GHP . Notably, as per our analysis in Section 3.3.1, the na¨ ıve GHP also guarantees a d-stationary solution at a discontinuous break point, albiet potentially different from the one computed by the modified version. To illustrate how the na¨ ıve initializations could potentially undermine the judicious se- lectionoftheparameterγ ,wesummarizethebehaviorofthecappedℓ 1 -pathsobtainedusing different GHP initialization strategies. These paths are tested on the GMRF problem with n = 100,σ = 0.4, and the results are presented in Figure 3.5. More specifically, Figure 3.5a displays the curves of support recovery and sparsity as functions of log(γ ) (details can be foundinthelegend), whereasFigure3.5bpresentstheParetocurvesofsignalversussupport recovery. As observed in Figure 3.5a, the path computed by the modified GHP attains a d- stationary solution x ∗ that achieves a minimal support recovery of 5 at approximately 118 γ ∗ ≈ 10 − 4 . For γ ranging from 10 − 4.5 to 10 − 3.5 , the modified GHP results are significantly betterthanthosefromthena¨ ıveGHP,whosebestsupportrecoveryisinthemid-50’s, which is 10 times more than the best result from the modified GHP. Additionally, the superiority of the modified GHP is evident even in the early stages, as observed in the initial pivoting steps. More precisely, when γ is near the right end of the curves in Figure 3.5a, the solution from the na¨ ıve GHP path abruptly changes from being zero to relatively dense. As a result, the overall sparsity of this path has been elevated to a relatively high level even in the early phaseofthepivotingscheme. Consequently,atγ ∗ ,thena¨ ıveGHPpathselectsad-stationary solution that is far worse than x ∗ in both measures considered in Figure 3.5b. This explains why the path computed by the the na¨ ıve GHP lacks the desirable properties exhibit by the modified GHP path, as depicted in Figure 3.5b. (a) Support recovery and sparsity vs. γ . (b) Signal recovery vs. support recovery. Figure 3.5: Modified GHP vs. na¨ ıve GHP. 119 Chapter 4 Adaptive Importance Sampling for Logarithmic Integral Optimization Inthischapter,wedelveintoourstudiesinresponsetotheconcernsweraisedinSection1.4, namely the challenging features of nonconvexity, nonsmoothness and intractable integrals that are prevalent for Maximum a Posteriori (MAP) inference of a Bayesian Hierarchical Model (BHM) with advanced modelling power. As a quick recap, here we are interested in solving the following Logarithmic Integral Optimization problem, which captures these computational obstacles minimize x∈X c(x) + logZ(x) where Z(x) ≜ Z Ξ r(x,z)dz (4.1) where X and Ξ are given sets in their respective spaces. Once again, the notations in this Chapter should be understood as self-contained hence independent from other chapters. The rest of the chapter is organized as follows. In Section 4.1 we formalize the details on the relevant background of BHM within the scope of our investigation. Section 4.2 introduces our approach to handle models with discrete priors. Our major computational method, named AIS-based surrogation algorithm, is proposed in Section 4.3 and analyzed in Section 4.4, and its numerical performance is evaluated in Section 4.5. Finally, Section 4.6 provides supplementary details on our proofs and numerical implementations. 120 4.1 The 3-Layer Bayesian Hierarchical Model In its most general form, a BHM involves a finite number of uncertainty layers, where the randomness of each layer is conditioned on the randomness of the lower layers. While han- dling the general model can introduce significant notational complications, our methodology can be based on and extended from a 3-layer model, which we described as follows: level 2: e y i | e θ ind. ∼ p y i (y i | e θ ), i=1,...,M level 1: e θ | e β ∼ q(θ | e β ), e β ≜ ( e β ℓ ) L ℓ=1 level 0: e β ℓ ind. ∼ p 0 ℓ (β ℓ ), ℓ=1,...,L; (4.2) where “ind.” stands for “independently distributed”; the tilde notation denotes a random variable whose realizations are written without the tilde; the vertical notation denotes con- ditioningwithp y i (y i | e θ ),q(θ | e β ), andp 0 ℓ (β ℓ )denotingtheprobabilitydensityfunctionsofthe respective random variables; the random vector e y i (the data) is m-dimensional, whereas e θ (the intermediate) and e β ℓ (the prior) have support Θ ⊆ R d andB ℓ ⊆ R respectively. Notice that level 2 of the model as stated assumes that the output {e y i } M i=1 are conditionally inde- pendent given e θ with conditional density p y i (y i | e θ ) and are related to e β =( e β ℓ ) L ℓ=1 , which has independent components, only through the intermediate e θ . Given observations{y i } M i=1 for some positive integer M, we can determine the unknown parameters θ and β via solving the following MAP, which is also termed as the o-BHOP (for original-Bayesian Hierarchical Optimization Problem) here: minimize θ ∈Θ; β ∈B − M X i=1 logp y i (y i |θ )− logq(θ |β )− L X ℓ=1 logp 0 ℓ (β ℓ ) where B ≜ L Y ℓ=1 B ℓ . (4.3) Theformulationpresentedabovesuffersfromasignificantdrawbackwhenthepriorparame- ters e β are discretely distributed, meaning thatB ℓ is a discrete set for all ℓ. In such cases, the last term in the objective function becomes undefined when β ℓ falls outside this finite set. 121 As a result, approximation or relaxation-based solution methods encounter significant chal- lenges. However, a well-known transformation can address this issue by reformulating the BHMmodel(4.2), eliminatingthisdifficultyandprovidingafoundationforthedevelopment of a continuous optimization algorithm. Fact: Letχ bearandomvariablewithcumulativedistributionfunction(cdf)F χ (t)≜ P(χ ≤ t) whose generalized inverse F − 1 χ :[0,1]→[infχ, supχ ] is given by F − 1 χ (s) ≜ inf{t | F χ (t) ≥ s} if s∈(0,1) infχ if s=0 supχ if s=1. so that F − 1 χ (U), where U is uniformly distributed in [0,1], has the same distribution as χ . Applying the above transformation to each random variable e β ℓ and letting e U = (e u ℓ ) L ℓ=1 be the vector of associated uniformly distributed random variables e u ℓ with realizations u ℓ , we may consider, as an alternative to (4.2), the uniformly-transformed BHM: level 2: e y i | e θ ind. ∼ p y i (y i | e θ ), i=1,...,M level 1: e θ | e U ∼ b q(θ | e U) ≜ q(θ |ϕ ( e U)), where ϕ (U) ≜ F − 1 e β ℓ (u ℓ ) L ℓ=1 level 0: e U = (e u ℓ ) L ℓ=1 : L independent uniform random variables in [0,1]. (4.4) This leads to the formulation of the uniform-transformed MAP problem, referred to as the u-BHOP (uniform-Bayesian Hierarchical Optimization Problem): minimize θ ∈Θ; u∈[0,1] L − M X i=1 logp y i (y i |θ )− logb q(θ |u). (4.5) where the term associated with the density function of e U, being constant, is excluded from the objective function. 122 4.1.1 Exponential families We focus on a class of BHMs wherein with e θ = ( e θ i ) M i=1 , the conditional distributions e y i | e θ and e θ | e β have densities belonging to the exponential family so that: p y i (y i |θ ) = exp[g(y i ,θ i )− a(θ i )] and q(θ |β ) = exp[h(θ,β )− b(β )] (4.6) for some bivariate functions g(y i ,θ i ) and h(θ,β ) with a(θ i ) and b(β ) being the following normalizing factors that ensure the integration of p y i (y i |θ ) and q(θ |β ) equal to unity: a(θ i ) = log Z R m exp(g(y ′ ,θ i ))dy ′ , b(β ) = log Z Θ exp(h(θ ′ ,β ))dθ ′ . Thus, under the framework of (4.6), the associated u-BHOP (4.5) becomes: minimize θ ∈Θ; u∈[0,1] L − M X i=1 g(y i ,θ i )+ M X i=1 log Z R m exp(g(y ′ ,θ i )) dy ′ − h(θ,ϕ (u))+log Z Θ exp(h(θ ′ ,ϕ (u)))dθ ′ . (4.7) This problem is confronted with several computational challenges, which we have previously discussed in Section 1.4. Here, we will summarize them again: • Intractabilityofthelogarithmicintegralfunction(s): therearetwoofthesein(4.7)which posessignificantcomputationaldifficultiesunlesstheassociateddistributionsarespeciallike multivariate Gaussian. For instance, in the case where p y i (y i |θ ) is the density function of a multivariatenormaldistributionwithmean θ i andcovariancematrix V, thefunctiona(θ i )is theconstantlog (2π ) d/2 √ detV thatcanremovedfrom(4.7). Todesignasolutionmethod for problem (4.7), our first order of business is to effectively handle these intractable integral functions with efficiency. 123 • Nonconvexity of the functions g(y i ,θ i ) and h(θ,β ): often, these functions are products of functions of their arguments; for example, h(θ,β )=β ⊤ v(θ ), for some v :Θ →R L , as ob- servedinMarkovRandomFields(MRFs)whichwehavediscussedthoroughlyinSection1.4. • Nonsmoothness of the functions g(y i ,θ i ) and h(θ,β ): in cases where distance functions are utilized as measures of deviation or similarity, such as in MRFs, g and h can be nons- mooth. • Discontinuity of the transformation function ϕ (u): when e β is a discreterandom variable, the inverse cdf F − 1 e β ℓ is a step function hence discontinuous. Approximating such discon- tinuous function with continuous ones, which we will thoroughly investigate in the next section, can easily introduce an alternative source of nonconvexity and nonsmoothness to (4.7). Whatsworse, in this case, these two additionally introduced “non”-properties are cou- pled and embedded in the integral function, making the resulting continuous optimization problem more challenging to deal with. 4.2 Discrete Priors Motivated by a Bernoulli random variable for e β ℓ in (4.2), as exemplified by the generalized MRF model (1.9) proposed in Section 1.4, we describe the transformation function ϕ (u) in (4.4) when each e β ℓ is a discrete random variable whose range{β k ℓ } K ℓ k=1 satisfies 0 ≜ β 0 ℓ ≤ β 1 ℓ < β 2 ℓ < ··· < β K ℓ ℓ | {z } values of the random variable e β ℓ (4.8) 124 and the associated probabilities {p k ℓ } K ℓ k=1 are positive and sum up to unity. Under such configurations, the cdf of e β ℓ and its generalized inverse are given by: F e ℓ (t) = X k:β k ℓ ≤ t p k ℓ , t ∈ (−∞ ,∞), F − 1 e β ℓ (s) = β 1 ℓ + K ℓ X k=2 b β k ℓ 1 (0,∞) (s− b p k− 1 ℓ ), for s∈[0,1]. (4.9) where 1 (0,∞) (• ) is the indicator function of (0,∞) and b β k ℓ ≜ β k ℓ − β k− 1 ℓ ≥ 0 by (4.8) and b p k ℓ ≜ k X i=1 p i ℓ for k∈[K ℓ ]. Note that F − 1 e β ℓ (• ) is lower semicontinuous (i.e., left-continuous), nondecreasing, piecewise constant on (0,1), and by definition, right-continuous and left-continuous at the left and right end point of the interval, respectively. Following the schemes in the two references [27, 28], we approximate F − 1 e β ℓ (s) by continuous piecewise function employing a convex function b φ cvx :R→R and a concave function b φ cve :R→R satisfying b φ cvx (0) = 0 = b φ cve (0) and b φ cvx (1) = 1 = b φ cve (1), with both (continuous) functions being increasing in the interval [0,1] and nondecreasing outside. Truncating these two functions to the range [0, 1], we obtain the upper and lower bounds of the two indicator functions 1 [0,∞) (t) and 1 (0,∞) (t) as follows: for any (t,δ ) ∈ R × R ++ , φ ub (t,δ ) ≜ min max b φ cvx 1+ t δ , 0 , 1 ≥ 1 [0,∞) (t) ≥ 1 (0,∞) (t) ≥ max min b φ cve t δ , 1 , 0 ≜ φ lb (t,δ ). (4.10) we have the following result stated and proved in [27, Proposition 2]. 125 Proposition 56. The bivariate functions φ ub and φ lb defined above have the following properties: 1. (a) For any t∈R, φ ub (t,δ ) is nondecreasing in t onR ++ and φ lb (t,δ ) is nonincreasing in t on R ++ . Both functions φ ub and φ lb are Lipschitz continuous on every compact set T × ∆ ⊆ R× R ++ . 2. The following equalities hold: 1 [0,∞) (t) = infimum δ> 0 φ ub (t,δ ) = lim δ ↓0 φ ub (t,δ ), ∀t ∈ R and 1 (0,∞) (t) = supremum δ> 0 φ lb (t,γ ) =lim δ ↓0 φ lb (t,δ ), ∀t ∈ R. (4.11) Applying φ ub and φ lb as the basic ingredients, we obtain the following approximations: F − 1 e β ℓ (u ℓ ) = β 1 ℓ + K ℓ X k=2 b β k ℓ 1 (0,∞) (u ℓ − b p k− 1 ℓ ) ≥ β 1 ℓ + K ℓ X k=2 b β k ℓ φ lb u ℓ − b p k− 1 ℓ ,δ ≜ b φ lb (u ℓ ,δ ) and F − 1 e β ℓ (u ℓ ) ≤ β 1 ℓ + K ℓ X k=2 b β k ℓ φ ub u ℓ − b p k− 1 ℓ ,δ ≜ b φ ub (u ℓ ,δ ). Therefore, for a function: h(θ,β ) = β ⊤ v(θ ) = L X ℓ=1 β ℓ v ℓ (θ ) where each v ℓ is a nonpositive function (cf. (1.8)), it follows that its uniform-transformation h(θ,ϕ (u)) (as in (4.4) and (4.7)) satisfies L X ℓ=1 b φ ub (u ℓ ,δ )v ℓ (θ ) | {z } denoted h lb (θ,u ) ≤ h(θ,ϕ (u)) = L X ℓ=1 F − 1 e β ℓ (u ℓ )v ℓ (θ ) ≤ L X ℓ=1 b φ lb (u ℓ ,δ )v ℓ (θ ) | {z } denoted h ub (θ,u ) 126 More generally, provided that the function h(θ, • ) is isotone in the second argument (cf. e.g. (1.8)), β ≤ β ′ ⇒ h(θ,β ) ≤ h(θ,β ′ ) we can bound the discontinuous function h(θ,ϕ (u)) by continuous functions. In the rest of the paper, we assume that such continuous bounding functions h lb (θ,u ) and h ub (θ,u ) have been derived for h(θ,ϕ (u)); needless to say, if the latter function is already continuous (for instance,ifthecdfofeach e β ℓ hasacontinuousinverse),thenthereisnoneedforsuchbounds. Summarizing the above discussion, and assuming that a(θ i ) in (4.6) is a constant for notational simplicity, we obtain the following two problems that provide upper and lower bounds for (4.7): minimum θ ∈Θ; u∈[0,1] L − M X i=1 g(y i ,θ i )− h ub (θ,u )+log Z Θ exp(h lb (θ ′ ,u))dθ ′ ≤ minimum θ ∈Θ; u∈[0,1] L − M X i=1 g(y i ,θ i )− h(θ,ϕ (u))+log Z Θ exp(h(θ ′ , ϕ (u))dθ ′ (problem (4.7)) ≤ minimum θ ∈Θ; u∈[0,1] L − N X i=1 g(y i ,θ i )− h lb (θ,u )+log Z Θ exp(h ub (θ ′ ,u))dθ ′ The above inequalities are in terms of the global minimum objective values of the respective problems. Yet, the two bounding problems remain (highly) nonconvex and (often) non- smooth; moreover, they still contain the practically intractable logarithmic integral func- tions. To address the former “non”-features, we settle for a practically computable solution that satisfies a stationarity condition of some sort (to be defined in the next section). This modest goal is the general principle of our modern point of view of nonconvex nonsmooth optimization detailed in the recent monograph [29]; that is, instead of the impossible task of computing global minimizers, we emphasize practical computability along with the valida- tion of some stationarity conditions (i.e., necessary conditions for local optimality) satisfied by the computed solutions. 127 4.3 The Logarithmic Integral Optimiation Problem As a unification of the upper and lower bounding minimization problems of (4.7), we consider the following nonconvex nonsmooth optimization problem whose objective con- tains an intractable logarithmic integral function, which is the vanilla problem (4.1) with r(x,z)=exp(H(x,z))tohighlightitsconnectiontoexponentialfamilies,althoughourmeth- ods in this section apply to general r: minimize x∈X c(x)+logZ(x), where Z(x) ≜ Z Ξ exp(H(x,z))dz. (4.12) We impose the following blanket assumptions on (4.12): 1. (Sets) X ⊆ R n is compact and convex; Ξ ⊆ R d is compact. 2. (Continuity) Both c :O→R and H :O×Z → R are continuous, where O andZ are open sets containing X and Ξ. Hence H(x,z) is Borel measurable in z∈ Ξ for all x∈X. 3. (B-differentiability) Functions c and H(• ,z) (for all z∈ Ξ) are Bouligand differen- tiable (B-differentiable); that is, they are locally Lipschitz continuous on O and their directional derivatives c ′ (x;dx) ≜ lim τ ↓0 c(x+τdx )− c(x) τ and H(• ,z) ′ (x;dx)≜ lim τ ↓0 H(x+τdx,z )− H(x,z) τ exist for all (x,dx) ∈O× R n . Lastly, we assume that there exists Lip H :Z → R ++ satisfying |H(x,z)− H(x ′ ,z)| ≤ Lip H (z)∥x− x ′ ∥, ∀x,x ′ ∈ O and z ∈ Z and Z Ξ Lip H (z)dz <∞. 128 By the Dominated Convergence Theorem and Theorem 7.44 of [78], we obtain the following basic properties on function Z: • Z is continuous on X and Z(x)<∞ for all x∈X. • Z, hence logZ, is B-differentiable on X. In terms of the directional derivative, we recall that a local minimizer of (4.12) must be directionally stationary; that is, if ¯x is such a minimizer, then c ′ (¯x;x− ¯x)+(logZ) ′ (¯x;x− ¯x) ≥ 0, ∀x ∈ X In the absence of a practical way to provably compute a local minimizer of a nonconvex nondifferentiable optimization problem, the practical goal of “solving” such a problem is to settle for the computation of a directional stationary solution. Even this less demanding task is not easy in general (see [29, Chapter 7]), and for (4.12) in particular. This task is complicated by the taunting, if not impossible, evaluation of Z(x). As a necessary step to alleviate the latter, we propose a combination of the methods of surrogation [29, Chapter 7] and adaptive importance sampling [75]. The former substitutes the nonconvex functions c and H(• ,z) at an arbitrary reference vector ¯x by respective convex majoring functions that are the basis for the development of effective algorithms; the latter replaces the integral function by an equivalent expectation function which is then discretized into a finite sum via sampling from a judiciously chosen density function, and this is adaptively employed to control the variance of such discretization at each iteration. The end result of this fu- sion of deterministic surrogation and stochastic sampling allows us to design a practically implementable algorithm for computing a “stationary solution” of (4.12) of a certain kind. 4.3.1 Surrogation and stationarity We assume that for every (¯x,z)∈X× Ξ , there exist convex b c(• ;¯x) and b H(• ,z;¯x) such that 129 1. (Majorization) c(x)≤ b c(x;¯x) and H(x,z)≤ b H(x,z;¯x) for all x∈X; 2. (Touching) c(¯x)=b c(¯x;¯x) and H(¯x,z)= b H(¯x,z;¯x); 3. (Upper semicontinuity)b c(• ,• ) and b H(• ,z;• ) are upper semicontinuous onO×O ; 4. (Continuity) b H(x,• ;¯x) is continuous hence Borel measurable on Z for all (x,¯x) ∈ O×O ; Denote b Z(x;¯x)≜ Z Ξ exp( b H(x,z;¯x))dz, then under the assumptions stated above, we can deduce the following properties ofb c and b Z by Fatou’s Lemma and Theorem 7.44 of [78]: • b Z is upper semicontinuous on X× X and b Z(x;¯x)<∞ for all (x,¯x)∈X× X. • The following inequalities hold true for any dx∈R n : b c(• ;¯x) ′ (¯x;dx) ≥ c ′ (¯x;dx) and (log b Z(• ;¯x)) ′ (¯x;dx) ≥ (logZ) ′ (¯x;dx), (4.13) The inequalities (4.13) pertain todirectional derivative majorization of the surrogation functions. If equalities hold in (4.13), then the majoring functionsb c(• ;¯x) and b H(• ,z;¯x) are said to be directional derivative consistent at ¯x. With the convex surrogation functionsb c(• ;¯x) and b H(• ,z;¯x), we may consider the surro- gate optimization problem, for a given ¯x∈X and ρ> 0: b P ρ (¯x): minimize x∈X b c(x;¯x)+log Z Ξ exp( b H(x,z;¯x))dz+ 1 2ρ ∥x− ¯x∥ 2 2 (4.14) The lemma below asserts that this is a convex program. For later purposes, we also include the same convexity of an empirical-average-version of the integral function. We omit the proof as it is a simple consequence of the renowned Jensen’s inequality. 130 Lemma57. Supposethatthebivariatefunctione:X× Ξ →Rissuchthate(• ,z)isconvex for all z∈Ξ and e(x,• ) is integrable on Ξ. Then the two functions log Z Ξ exp(e(• ,z))dz and log 1 N N X s=1 exp(e(• ,ζ s )) ! are convex for all ζ N ≜{ζ s } N s=1 . □ The problem (4.14) allows us to define an important solution concept for the problem (4.12). Let c M ρ (¯x) denote the optimal solution set of problem b P ρ (¯x), Definition 58. For given bivariate surrogation functionsb c(• ;¯x) and b H(• ,z;¯x) of c(• ) and H(• ,z), respectively, satisfying the majorizaiton and touching conditions, a vector ¯x∈X is a (b c, b H)-surrogation stationary point of (4.12) if ¯x∈ c M ρ (¯x) for some ρ> 0. □ Equivalently, the inclusion ¯x∈ c M ρ (¯x) states that ¯x is a fixed point of the “surrogation stationarity map”: c M ρ : ¯x∈X → argmin b P ρ (¯x) ⊆ X. The role of the surrogation stationarity concept is that it is a necessary condition for a local minimizer of (4.12) as asserted by the following simple result; the condition is sufficient for directional stationarity if the surrogation is directional derivative consistent. Being an immediate consequence of the directional derivative majorization (4.13), the result does not require a proof. Proposition 59. Let ¯x be a local minimizer of (4.12), then ¯x is a (b c, b H)-surrogation sta- tionary point of (4.12) for any pair of bivariate functions (b c(• ;¯x), b H(• ,z;¯x)) that majorizes and touches (c,H(• ,z)) at ¯x for all z ∈ Ξ. Conversely, if the pair of surrogation func- tions (b c(• ;¯x), b H(• ,z;¯x)) are directional derivative consistent at ¯x for all z ∈ Ξ, and if ¯ x is a (b c, b H)-surrogation stationary point of (4.12), then ¯x is a directional stationary point of (4.12). □ 131 While (4.14) is a convex program, its practical solution remains daunting, if not in- tractable, because of the challenging task of evaluating the multi-dimensional integral func- tion. Thus it is necessary to approximate the latter function; as mentioned before, this is accomplished by the statistical technique of importance sampling to discretize the integra- tion. 4.3.2 AIS-based surrogation method For any bivariate function e : O×Z → R that is (Lebesgue) integrable in z ∈ Z for all x∈O, and d-dimensional random vector e ζ with support Ξ and positive probability density π , we have Z e (x) ≜ Z Ξ exp(e(x,z))dz = Z Ξ π (z) exp(e(x,z)) π (z) dz = E e ζ ∼ π " exp(e(x, e ζ )) π ( e ζ ) # . (4.15) For a given x and a batch ζ N ≜{ζ s } N s=1 of size N of iid samples drawn from distribution π , written as ζ N iid ∼ π , the Sample Average Approximation (SAA) of Z e (x) is given by Z e (x) ≈ Z π e (x,ζ N ) ≜ 1 N N X s=1 exp(e(x,ζ s )) π (x,ζ s ) . (4.16) Applying the importance sampling (IS) reformulation (4.15) to function Z in (4.12) turns it intoastochasticprogramalbeitofnonconvexandnondifferentiabletype,whichcaneitherbe na¨ ıvely approximated by SAA as shown in (4.16) or more rigorously handled by algorithms thatcombinesurrogationandincrementalSAA(seeforinstance[59]andSection10.2of[29]). It is worth noting that such treatments will achieve their respective convergence results for any arbitrary positive density π . A natural question arises as to what density π to choose if we desire a more efficient approximation from sampling. It turns out that this question can be answered based on the following principle of importance sampling, which suggests to minimize the variance of the SAA. 132 Lemma 60. [75, Theorem 3.12] Let f :Ξ →R be (Lebesgue) integrable with|f|>0. Let Z Ξ f(z)dz SAA ≈≈ I(π,ζ N ) ≜ 1 N N X s=1 f(ζ s ) π (ζ s ) where ζ N ≜{ζ s } N s=1 iid ∼ π and π is a positive density function. Then for all N π f IS ≜ |f| Z Ξ |f(z)|dz ∈ argmin π Var e ζ ∼ π " f( e ζ ) π ( e ζ ) # = argmin π Var ζ N ∼ π I(π ;ζ N ) , where “Var e ζ ∼ π ” and “Var ζ N ∼ π ” denote the variance when the random variables follow π . Moreover, if f is positive, then Var e ζ ∼ π f IS " f( e ζ ) π f IS ( e ζ ) # = 0; thus Z Ξ f(z)dz = I(π f IS ,ζ N ) almost surely. □ The lemma above, when applied to function f(• ) = expH(¯x,• ) at a given ¯x∈ X, indi- cates that the density achieving the minimal variance of estimating Z at ¯x actually depends onthereferencepoint ¯x. Thus, asfarasvariancereductionisconcerned, theprobabilitydis- tribution in the stochastic programming reformulation (4.15) of (4.12) is implicitly decision dependent, as there exists the following density family parametrized by ¯x∈X π H(¯x,• ) IS (z) ≜ exp(H(¯x,z)) Z Ξ exp(H(¯x,z ′ ))dz ′ = exp(H(¯x,z)) Z(¯x) . (4.17) whoseassociatedSAAoftype(4.16)equalstoZ at ¯x. Inotherwords,byexplicitlycontrolling the variance, sampling from (4.17) and constructing (4.16) will yield a good approximation of the function Z locally around ¯x with moderate sample size. 133 (a) uniform π , N =1,000. (b) π =π H(¯x,• ) IS , N =100. Figure 4.1: Effectiveness of sampling from π H(¯x,• ) IS with a fixed ¯ x∈X. Todemonstratethis, supposeweapproximatethefunction Z(x)= Z 1 − 1 exp{− zsin(x)}dz with SAA of type (4.16), and we compare the variance of SAA from a π that is uniform on [− 1,1] with π = π H(¯x,• ) IS under a particular choice of ¯x. The variance is exemplified by 50 independent replications of SAA under the aforementioned two choices of π . As indicated by Figure 4.1a, when we apply the uniform π , the variance around a local maximizer of Z (the blue dot) is large even when the sample size N is 1,000. On the contrary, from Figure 4.1b, we can obtain a better recovery of the landscape around the local maximizer (the blue dot in Figure 4.1b) with smaller variance if we adopt π = π H(¯x,• ) IS with a ¯x (e.g., the green dot in Figure 4.1b) that is near the local maximizer, and we are able to achieve this with fewer samples, e.g., N =100. This observation provides the following hints for designing an algorithm for (4.12) that iteratively solves subproblems which depends on IS-based SAA (4.16) of Z. Namely, instead of restricting to one IS distribution for SAA throughout, in each iteration t we should make such distribution adapted to our current iterate x t ; more specifically, sampling from π H(x t ,• ) IS so that we can benefit from its reduced variance. Combining such an adaptive IS scheme 134 with the previously defined surrogation b H results in the following surrogated sampling ap- proximation of logZ at a reference point ¯x∈X: logZ(x) IS = log E e ζ ∼ π H(¯x,• ) IS " exp(H(x, e ζ )) π H(¯x,• ) IS ( e ζ ) #! ≤ log E e ζ ∼ π H(¯x,• ) IS " exp( b H(x, e ζ ;¯x)) π H(¯x,• ) IS ( e ζ ) #! , surrogation of H(• , e ζ ) by b H(• , e ζ ;¯x) SAA ≈≈ log 1 N N X s=1 exp( b H(x,ζ s ;¯x)) π H(¯x,• ) IS (ζ s ) ! , draw ζ N ≜{ζ s } N s=1 iid ∼ π H(¯x,• ) IS = log 1 N N X s=1 exp( b H(x,ζ s ;¯x)) exp(H(¯x,ζ s )) ! + logZ(¯x) | {z } constant given ¯x , substituting out π H(¯x,• ) IS ( e ζ ) in the denominator. With the above constructions, we obtain the following approximation of (4.12), regularized by a proximal term with coefficient ρ > 0 to ensure its strict convexity hence uniqueness of solution b P N ρ (¯x;ζ N ): minimize x∈X b c(x;¯x)+log 1 N N X s=1 exp( b H(x,ζ s ;¯x)) π H(¯x,• ) IS (ζ s ) ! + 1 2ρ ∥x− ¯x∥ 2 2 . (4.18) Problem (4.18) is the computational workhorse in the iterative algorithm described below. At each iteration, we employ the most recent iterate to define the reference vector ¯ x; thus both the IS density π H(¯x,• ) IS and surrogation are iterate dependent. The overall procedure is the promised AIS-based surrogation scheme that aims to compute a surrogation stationary solution of the original logarithmic integral optimization problem (4.12) corresponding to a given pair (b c, b H) of surrogation functions; such a solution is a x ∞ ∈ X such that x ∞ ∈ c M γ (x ∞ ). Algorithm for (4.12): AIS-based Surrogation Method 135 Initialization. Let x 0 ∈X, ρ> 0, and a sequence of (positive) increasing integers{N t } ∞ t=0 be given. Set t=0. General Step. Given x t ∈ X and sample batch ζ t ≜{ζ st } Nt s=1 with ζ st |x t iid ∼ π H(x t ,• ) IS . Let x t+1 be the unique optimal solution of the problem b P Nt ρ (x t ;ζ t ), which is equivalent to minimize x∈X b c(x;x t )+log 1 N t Nt X s=1 exp( b H(x,ζ st ;x t )) exp(H(x t ,ζ st )) ! | {z } convex in x given x t + 1 2ρ ∥x− x t ∥ 2 2 . (4.19) Stop if a prescribed termination criterion is satisfied; otherwise, return to the general step with t replaced by t+1. □ 4.4 Analysis of the AIS-based Surrogation Algorithm In this section, we present an analysis of the AIS-based surrogation algorithm which estab- lishesitsalmostsuresubsequentialconvergencetoasurrogationstationarypoint. Whileour adaptive scheme has the advantage of minimizing the (conditional) SAA variance at x t for each iteration, it jeopardizes much of the typical analysis of SAA-based surrogation method (cf. [29, Theorem 10.2.1]), which relies critically on a uniform law of large numbers (ULLN) for iid samples. Since our iterate-dependent sampling process easily leads to a special trian- gular array of non-iid samples, our first order of business is to extend such laws, which are then applied to prove the convergence of our algorithm via two key steps, namely showing 136 sufficient descent and asymptotic fixed point stationarity . Before further details, we sum- marize the notations that will be used in the analysis. For a given probability distribution with density π and samples ζ s iid ∼ π for all s=1,··· ,N, define SAA: ¯ Z N π (x) ≜ 1 N N X s=1 exp(H(x,ζ s )) π (ζ s ) Surrogation given ¯x: b Z(x;¯x)≜ Z Ξ exp( b H(x,z;¯x))dz = E ζ ∼ π exp( b H(x,ζ ;¯x)) π (ζ ) ! SAA-surrogation given ¯x: b Z N π (x;¯x) ≜ 1 N N X s=1 exp( b H(x,ζ s ;¯x)) π (ζ s ) . Additionally we write π ∗ (¯x) as a shorthand for π H(¯x,• ) IS when it appears in the subscripts. 4.4.1 A Digression: ULLN for triangular arrays In this section, we extend ULLN to a special triangular array that generalizes our AIS scheme, which can be understood as the following data generating mechanism written in generic notations. Definition 61. (AIS triangular array) For a subsetY ⊆ R n and a compact setK⊂ R m , let {η (• ,y) : K → R ++ } y∈Y be a parametric family of probability density functions and {N t } t≥ 1 be a sequence of positive integers. We generate an array {z t } t≥ 1 of samples by the following process: Step 1 =⇒ fix an arbitrary y 0 ∈Y Step t≥ 1 =⇒ given y t ∈Y sample z t ≜{z st } Nt s=1 with z st |y t iid ∼ η (• ,y t ), and set y t+1 =ξ t (z t ,y t )∈Y (4.20) where ξ t :K Nt ×Y →Y is measurable. 137 Note that samples{z t } t≥ 1 are obviously not iid, instead{z st } Nt s=1 are conditionally inde- pendent given y t with conditional distribution η (• ,y t ). In what follows, we establish several asymptotic results for such a triangular array. The first result is a strong LLN for the array {φ(z st ):s∈[N t ],t≥ 1} when φ is some function with a bounded range. We then generalize it to a strong ULLN for triangular array of random functions, which is then applied to prove a customized strong “one-sided” ULLN for a triangular array of random functions defined by our surrogate b H. Lemma 62. Let the following be given: • an array{z t } t≥ 1 and sequence{y t } t≥ 1 as generated in Definition 61; • a measurable function φ:K→R such that sup z∈K |φ(z)|<∞; and • a sequence of positive integers{N t } t≥ 1 satisfying∃κ > 0 and T κ such that N t ≥ κt for all t≥ T κ . It then holds that lim t→∞ 1 N t Nt X s=1 φ(z st )− E ζ ∼ η (• ,y t ) φ(ζ ) = 0, almost surely. (4.21) Proof. See Section 4.6.1.1 for details. With Lemma 62, we can follow the same line of proof of [78, Theorem 7.48], i.e., ULLN of SAA, and obtain the following result for the{z t } t≥ 1 . Details of the proof are omitted. Proposition 63. In addition to the settings of Lemma 62, let X be a compact set in an Euclidean space and function ψ :X ×K → R so that ψ (χ, • ) is measurable for all χ ∈X and there exists M ψ <∞ with|ψ (χ,z )|<M ψ for all (χ,z )∈X ×K . Then lim t→∞ sup χ ∈X 1 N t Nt X s=1 ψ (χ,z st )− E ζ ∼ η (• ,y t ) ψ (χ,ζ ) = 0, almost surely. 138 By [78, Theorem 7.48], for a given x t , we can easily obtain error sup x∈X b Z Nt π ∗ (x t ) (x,x t )− b Z(x,x t ) →0 almost surely as N t → ∞. However, as we will see in the analysis of AIS method in the subsequent sections, we need the rate for such convergence to be uniform in x t , and the following proposition provides a “one-sided” version of this type of result. Proposition 64. Let {x t } ∞ t=0 be as generated in the AIS-based surrogation algorithm and furtherassumethat{N t } t≥ 0 isasequenceofpositiveintegerssatisfyingforsomescalarκ> 0 and integer T κ the condition that N t ≥ κt for all t≥ T κ . Then almost surely we can find a subsequence{x t } t∈T so that x t (t∈T)→x ∞ ∈X and that for all x∈X limsup t(∈T)→∞ b Z Nt π ∗ (x t ) (x;x t ) ≤ b Z(x;x ∞ ) Proof. See Section 4.6.1.2 for details. 4.4.2 Convergence of deterministic surrogation algorithm To make our discussion self-contained, we present the following convergence result for the deterministicsurrogationmethodasareferencefortheintuitionbehindtheformalanalysisof AIS-basedsurrogation. Themainideaisthatourmethodcanbetreatedasacombinationof deterministicsurrogationandsomestochasticerrorsubjectedtooursamplingscheme. While the convergence of deterministic surrogation method can be straightforwardly established as below, what remains to be shown is that the accumulated stochastic error will diminish asymptotically. Proposition 65. Let x 0 ∈X,ρ> 0 be arbitrary and for all t≥ 0 let x t+1 ∈ c M ρ (x t ). Then the sequence{x t } t≥ 0 has an accumulation point x ∞ ∈X such that x ∞ ∈ c M ρ (x ∞ ). 139 Proof. Step 1 (sufficient descent) With notation b Z(• ,x t ) = Z Ξ exp( b H(• ,z; x t ))dz we have: c(x t+1 )+logZ(x t+1 )+ 1 2ρ ∥x t+1 − x t ∥ 2 2 ≤ b c(x t+1 ; x t )+log b Z(x t+1 ; x t )+ 1 2ρ ∥x t+1 − x t ∥ 2 2 ←− by majorization ≤ b c(x t ; x t )+log b Z(x t ; x t ) ←− by optimality of x t+1 = c(x t )+logZ(x t ) ←− by touching (4.22) By c and logZ being bounded from below, we can easily deduce that lim t→∞ ∥x t+1 − x t ∥ 2 =0. Step 2 (asymptotic fixed-point stationarity) Fix an arbitrary x∈X, we have c(x t+1 )+logZ(x t+1 )+ 1 2ρ ∥x t+1 − x t ∥ 2 2 ≤ b c(x; x t )+log b Z(x; x t )+ 1 2ρ ∥x− x t ∥ 2 2 (4.23) from majorization and optimality of x t+1 . By compactness of X, we can restrict to a subse- quence indexed byT so that x t →x ∞ ∈X when t(∈T)→∞ and b c(x ∞ ; x ∞ )+log b Z(x ∞ ; x ∞ )+ 1 2ρ ∥x ∞ − x ∞ ∥ 2 2 = c(x ∞ )+logZ(x ∞ ) ←− by touching ≤ b c(x; x ∞ )+log b Z(x; x ∞ )+ 1 2ρ ∥x− x ∞ ∥ 2 2 ←− by upper semicontinuity if we take “limsup” for t(∈T)→∞ on both sides of (4.23), which gives us x ∞ ∈ c M ρ (x ∞ ). 140 4.4.3 Sufficient descent with stochastic error Similar to the analysis of the deterministic surrogation method, the detailed convergence analysisoftheAIS-basedsurrogationmethodconsistsoftwomainsteps, thefirstofwhichis the sufficient descent property that aims to show ∥x t+1 − x t ∥ 2 →0 almost surely. Note that we can obtain the following inequality similar to (4.22), but with the SAA of the function Z instead: c(x t+1 )+log ¯ Z Nt π ∗ (x t ) (x t+1 )+ 1 2ρ ∥x t+1 − x t ∥ 2 2 ≤ c(x t )+log ¯ Z Nt π ∗ (x t ) (x t ) (4.24) Define e t ≜ log ¯ Z Nt π ∗ (x t ) (x t )− log ¯ Z N t− 1 π ∗ (x t− 1 ) (x t ) and note that ¯ Z Nt π ∗ (x t ) (x t ) = Z(x t ) by the AIS scheme and Lemma 60; thus e t ≤ e ′ t− 1 ≜ sup x∈X logZ(x)− log ¯ Z N t− 1 π ∗ (x t− 1 ) (x) . For all t≥ 1 we get: c(x t+1 )+log ¯ Z Nt π ∗ (x t ) (x t+1 )+ 1 2ρ ∥x t+1 − x t ∥ 2 2 ≤ c(x t )+log ¯ Z N t− 1 π ∗ (x t− 1 ) (x t )+e ′ t− 1 (4.25) As for the case when t=0, we can let log ¯ Z N − 1 π ∗ (x − 1 ) (x 0 ) as logZ(x 0 ) and e ′ − 1 ≜0. In this way, the last inequality (4.25) holds for all t. To obtain∥x t+1 − x t ∥ 2 → 0 from (4.25), we essentially need error e ′ t− 1 → 0 fast enough in some appropriate notion of convergence. This can be ensured by the following conditions on{N t } t≥ 0 , which we assume to be valid for the remaining analysis: ∞ X t=0 1 N α t <∞ for a fixed α ∈ 0, 1 2 . To show this, the following result is useful in providing a uniform (in ¯x∈ X) non-asymptotic rate for the convergence of log ¯ Z N π ∗ (¯x) to logZ on X when N →∞. Lemma 66. For any α ∈ 0, 1 2 there exists C α >0 such that for all integer N ≥ 0, E sup x∈X log ¯ Z N π ∗ (¯x) (x)− logZ(x) ≤ 2C α N α , for all ¯x∈X (4.26) where the expectation is understood as the data in ¯ Z N π ∗ (¯x) having density π H(¯x,• ) IS . 141 Proof. See Section 4.6.1.3 for details. Denote byD t the σ -algebra generated by∪ t τ =0 ∪ Nτ sτ =1 {ζ sτ τ } for all t≥ 0, namely all the samples generated until step t, and define D − 1 ≜ {∅,Ω }. Then the following holds for all t≥ 0 from (4.25): E c(x t+1 )+log ¯ Z Nt π ∗ (x t ) (x t+1 ) D t− 1 + 1 2γ E ∥x t+1 − x t ∥ 2 2 D t− 1 ≤ c(x t )+log ¯ Z N t− 1 π ∗ (x t− 1 ) (x t )+e ′ t− 1 (4.27) Furthermore, ∞ X τ =− 1 e ′ τ <∞ almost surely, otherwise we haveP ∞ X τ =− 1 e ′ τ =∞ ! >0 hence: E ∞ X τ =− 1 e ′ τ ! = ∞ This contradicts the followings under our previous assumptions on{N t } t≥ 0 : E ∞ X τ =0 e ′ τ ! = ∞ X τ =0 E(e ′ τ ) = ∞ X τ =0 E E(e ′ τ |D τ − 1 ) ≤ ∞ X τ =0 2C α N α τ < ∞ (4.28) The first equality of (4.28) is by e ′ τ ≥ 0 and applying Theorem 2.15 in [37]. The second equality of (4.28) is by the law of total expectation. The first inequality of (4.28) is by x 0 being fixed, the fact that E(e ′ 0 |D − 1 ) = E(e ′ 0 ), E(e ′ τ |D τ − 1 ) = E(e ′ τ |x τ ) for all τ ≥ 1 and Lemma 66 whose applicability is established by our definition of the conditional distribution ζ sτ |x τ iid ∼ π ∗ (• ,x τ ) = π H(x τ ,• ) IS for all s ∈ [N τ ]. The second inequality of (4.28) is by our assumptions on{N τ } τ ≥ 0 . In sum, we almost surely have ∞ X τ =− 1 e ′ τ <∞ holds. Finally, by c(x t )+log ¯ Z N t− 1 π ∗ (x t− 1 ) (x t ) being uniformly bounded from below, we can with- out loss of generality combine the result that ∞ X τ =− 1 e ′ τ <∞ almost surely with (4.27) and 142 apply the Robbins-Siegmund nonnegative almost supermartingale convergence theorem [74, Theorem 1] to conclude that the following holds almost surely ∞ X t=0 E ∥x t+1 − x t ∥ 2 2 D t− 1 < ∞ This in turns gives us ∥x t+1 − x t ∥ 2 → 0 almost surely by a straightforward application of conditional Markov inequality and the second Borel-Cantelli lemma [31, Theorem 5.3.2]. 4.4.4 Asymptotic fixed-point stationarity Similar to Step 2 in the analysis of Proposition 65, we can fix an arbitrary x∈X and get: c(x t+1 )+log ¯ Z Nt π ∗ (x t ) (x t+1 ) | {z } Term I + 1 2ρ ∥x t+1 − x t ∥ 2 2 ≤ b c(x;x t )+log b Z Nt π ∗ (x t ) (x;x t ) | {z } Term II + 1 2ρ ∥x− x t ∥ 2 2 (4.29) The key intuition is that restricted to a subsequence T such that x t → x ∞ ∈ X, when t(∈T)→∞(duetothecompactnessofX),TermIin(4.29)shouldconvergetologZ(x ∞ )= log b Z(x ∞ ;x ∞ ) and the limit of Term II in (4.29) should be upper bounded by log b Z(x;x ∞ ). More precisely, since Term I and II can be understood as the SAA of their respective con- ditional expectation, when N t → ∞ as we let t(∈ T) → ∞, the law of large number that dominates the convergence of these SAAs should be uniform in x t . In what follows, Term I and II will be analyzed under such overarching goals. Term I Denote ψ (¯x,y,z)≜Z(y)exp{H(¯x,z)− H(y,z)} so that ¯ Z Nt π ∗ (x t ) (x t+1 )= 1 N t Nt X s=1 ψ (x t+1 ,x t ,ζ st ), Z(x t+1 )=E ζ ∼ π ∗ (x t ) ψ (x t+1 ,x t ,ζ ) 143 Note thatby ourassumptions on{N t } t≥ 0 , Proposition63 is applicableto ψ hence foralmost every ω∈Ω (Ω being the set of possible outcomes) we can restrict to a subsequence indexed byT(ω) so that x ∞ (ω)≜ lim t(∈T(ω))→∞ x t (ω)∈X and the followings hold: lim t(∈T(ω))→∞ sup (¯x,y)∈X× X 1 N t Nt X s=1 ψ (¯x,y,ζ st (ω))− E ζ ∼ π ∗ (x t (ω)) (ψ (¯x,y,ζ )) = 0 (4.30) Also note that after some simple operations: ¯ Z Nt π ∗ (x t (ω)) (x t+1 (ω))− Z(x ∞ (ω)) ≤ sup (¯x,y)∈X× X 1 N t Nt X s=1 ψ (¯x,y,ζ st (ω))− E ζ ∼ π ∗ (x t (ω)) (ψ (¯x,y,ζ )) | {z } Term I.1 + Z(x t+1 (ω))− Z(x ∞ (ω)) | {z } Term I.2 When we take t(∈T(ω))→∞, Term I.1 of this inequality goes to zero by (4.30) and Term I.2 also goes to zero by an application of Dominated Convergence Theorem. Thus lim t(∈T(ω))→∞ ¯ Z Nt π ∗ (x t (ω)) (x t+1 (ω)) = Z(x ∞ (ω)) In sum if we take t(∈T(ω))→∞, the left hand side of (4.29) goes to c(x ∞ (ω))+logZ(x ∞ (ω)) = b c(x ∞ (ω);x ∞ (ω))+log b Z(x ∞ (ω);x ∞ (ω)) since from the sufficient descent analysis ∥x t+1 (ω)− x t (ω)∥ 2 →0 as t(∈T(ω))→∞. Term II ByourassumptionswecantreatlogarithmasLipschitzcontinuouswithconstant e Lhence log b Z Nt π ∗ (x t (ω)) (x;x t (ω)) ≤ e L b Z Nt π ∗ (x t (ω)) (x;x t (ω))− b Z(x;x ∞ (ω)) +log b Z(x;x ∞ (ω)) (4.31) 144 Given the assumptions on {N t } t≥ 0 , Proposition 64 is applicable so that without loss of generality limsup t(∈T(ω))→∞ b Z Nt π ∗ (x t (ω)) (x;x t (ω))− b Z(x;x ∞ (ω)) ≤ 0 If we apply this to (4.31) and subsequntly to the right hand side of (4.29) we can obtain: b c(x ∞ (ω);x ∞ (ω))+log b Z(x ∞ (ω);x ∞ (ω))+ 1 2ρ ∥x ∞ (ω)− x ∞ (ω)∥ 2 2 ≤ b c(x;x ∞ (ω))+log b Z(x;x ∞ (ω))+ 1 2ρ ∥x− x ∞ (ω)∥ 2 2 namely, x ∞ (ω)∈argmin ¯x∈X b c(¯x;x ∞ (ω))+log b Z(¯x;x ∞ (ω))+ 1 2ρ ∥¯x− x ∞ (ω)∥ 2 2 = c M ρ (x ∞ (ω)). 4.4.5 The main theorem with discussion Combining the analysis presented in Section 4.4.3 and 4.4.4, we formalize the following conclusion which guarantees an almost sure subsequential convergence of the AIS-based surrogationmethodtoasurrogationstationarypointofthelogarithmicintegraloptimization problem (4.12). Theorem 67. Let the following be given: • A pair (b c, b H) of surrogate functions satisfying the conditions in Section 4.3.1; • A sequence of positive integers{N t } t≥ 0 such that ∞ X t=0 1 N α t <∞ for some α ∈ 0, 1 2 . Then with probability one, the sequence {x t } t≥ 0 produced by the AIS-based surrogation method has an accumulation point x ∞ ∈ X such that x ∞ is a (b c, b H)-surrogation stationary point of (4.12). Remark 68. We have the following comments pertaining to Theorem 67: 1. By Proposition 59, the accumulation point x ∞ will be directional stationary for (4.12) if the surrogation (b c, b H) satisfies the condition that ( b c(• ;x), b H(• ,z;x)) is directional derivative consistent at x for all (x,z)∈X× Ξ, namely, b c(• ;x) ′ (x;dx)=c ′ (x;dx) and 145 b H(• ,z;x) ′ (x;dx) = H ′ (x,z;dx), which ensures (log b Z(• ;¯x)) ′ (¯x;dx) = (logZ) ′ (¯x;dx), for any dx∈R n . 2. From Theorem 67, a sequence of N t that grows faster than t 2 should be sufficient to guaranteeourconvergenceresult. However, asindicatedbyournumericalresults, such estimation of sample complexity can be too conservative in practice. In Section 4.6.2, wepresentamodifiedversionofourschemewhichsuggestsa N t thatincreaseslinearly in t. This alternative scheme requires ρ to be small enough instead of any arbitrary positive value, and uses ∥x− x t ∥ 2 for the regularization term instead of its square in subproblem (4.19). However, this modification should be regarded as a theoretical insightintoimprovingsamplecomplexityatthecostofasacrificedpracticalutility,due to the restricted “step-size” ρ and the use of nonsmooth subproblems which are more challenging to solve. Therefore, in the subsequent experiments, we opt to implement only the original AIS-based surrogation scheme, as its practical effectiveness requires a more manageable sample size than what the theoretical results suggest, as we just mentioned. 4.5 Numerical Experiments In this section, we test the numerical performance of our AIS-based surrogation method (abbreviated as “AIS” from now on) with two driving goals. First, we aim to demonstrate the advantage of AIS when compared with the following two approaches for (4.12): • Sample Average Approximation (SAA): We substitute the integral term in (4.12) by the SAA of type (4.16) under a prescribed density π SAA and sample size S, and obtain minimize x∈X c(x)+log 1 S S X s=1 exp(H(x,ζ s )) π SAA (ζ s ) ! = c(x)+log ¯ Z S π SAA (x) (4.32) 146 Problem (4.32) is handled by a surrogation method, which iteratively solves x t+1 ∈ c M SAA ρ (x t ;S) ≜ argmin x∈X b c(x;x t ) + log b Z S π SAA (x;x t ) + 1 2ρ ∥x− x t ∥ 2 2 and produces a sequence {x t } t≥ 0 with an accumulation point x SAA ∈ c M SAA ρ (x SAA ;S), with fixed x 0 ∈X,ρ> 0. The proof is virtually the same as Proposition 65 hence omitted here. • StochasticMajorizationMinimization(SMM):Withaprescribedcontinuousdensity functionπ SMM throughouttheprocedure,wehavethefollowingtwoimplementationsofSMM: —Non-incremental SMM: this is essentially the same as AIS method except for now we sample non-adaptively from π SMM in each step t with sample size N t identical to that of AIS. — Incremental SMM: in each step we still draw N t iid samples from π SMM but now we applyallthesamplesuptostepttoconstructsubproblem(4.19). ConvergenceresultofSMM under this setting can be found in Theorem 10.2.1. of [29], where the same subsequential convergence as in Theorem 67 can be established if sequence N t satisfies the followings ∞ X t=1 N t − N t− 1 N t N α t− 1 <∞, for some α ∈ 0, 1 2 . Through the applications of AIS, SMM and SAA on the same set of generic problems under various configurations, we verify that AIS outperforms the other methods in terms of com- puting time, objective value at termination, and stability, namely the variance in solutions, near convergence. Our second goal is to show that AIS, as a suitable method for MAP inference of BHMs with intractable normalizers, enables the practical benefits of such models that are other- wise handicapped by na¨ ıve treatments of the intractable normalizers, e.g., model (1.2) as a simplificaiton of (1.8). Finally, all the numerical experiments are conducted on a Mac OS X personal computer with 2.3 GHz Intel Core i7 and 8 GB RAM. The reported times are in seconds on this computer. 147 4.5.1 Comparisons with SMM and SAA ThissectioncomparestheperformanceofAIS,SMMandSAAbyapplyingthemtorandomly generated instances of the o-BHOP (4.3) for model (1.10) under the following specifications: • Both y i and e θ i are scalar valued, i.e., m=1, hence vectors y and e θ are both|V| =M- dimensional, and p y i (y i | e θ ) in (1.10) is Gaussian with mean e θ i and a known variance σ 2 i >0. • Density q(θ | e β ) in (1.10) is specified with h ij (θ i ,θ j )=|θ i − θ j |. The o-BHOP (4.3) under these settings is formulated as: minimize θ,β M X i=1 1 2σ 2 i (y i − θ i ) 2 + X i<j β ij |θ i − θ j |+log Z Θ exp ( − X i<j β ij |θ ′ i − θ ′ j | ) dθ ′ | {z } ≜ Z(β ) ! subject to θ ∈Θ=[0 ,1] M , β ij ≤ β ij ≤ β ij ,∀i<j. (4.33) We set M to be 10 and 20, resulting in D = 55 and D = 210 total variables in (4.33), respectively. Data y is generated from model (1.10), with σ i =0.1 for all i∈V, and β ij =0 and ¯ β ij = 0.01 for all i < j, except for two randomly chosen pairs whose upper bound ¯ β ij is set to be 100. When we implement AIS, we adopt sample size N t = min{t 1.2 , ¯ N} where ¯ N is a predetermined positive integer. Given the current β t , we apply Metropolis-Hastings method [75] to draw conditionally independent samples {z st } Nt s=1 from density function (in z): 1 Z(β t ) exp ( − X i<j β t ij |z i − z j | ) and then construct AIS subproblem (4.19) whose formulation is specified as (4.39) in Sec- tion 4.6.3.1. On the other hand, the sampling distribution π SMM for both incremental and non-incrementalSMMarefixedasuniformoversetΘ. Ineachstep t, giveniidsamplesfrom π SMM , we solve subproblems whose formulations under the two SMM schemes can be found 148 in (4.39) of Section 4.6.3.1. Finally, for SAA we set π SAA as uniform over Θ and for a fixed sample size S we solve (4.32) with the surrogation method we introduced in the beginning of this section. Throughout our experiments, we test AIS, SMM and SAA with S∈{100,1000,10000}, ¯ N ∈ {10,20,50,100} and ρ = 100, and the detailed configurations are summarized in Ta- ble 4.1. For SAA, we refer to distinct batches of S samples from π SAA as “replications”. For each M, we use “3 instances” to denote three randomly generated problems that are shared by different methods under varying arrangements, including different choices of ¯ N and S, as well as different initializations and replications. All the statistics shown in Table 4.2 are averaged over the total number of runs under a particular pair of (M, ¯ N) or (M,S): 15 for AIS and SMM, 45 for SAA. Methods Configurations AIS under a (M, ¯ N) 3 instances× 5 initializations SMM under a (M, ¯ N) 3 instances× 5 initializations SAA under a (M,S) 3 instances× 5 initializations× 3 replications Table 4.1: Experiment configurations. Finally, allthemethodsareterminatedwhentwostoppingcriteriaaremet, namelywhen the relative difference between two consecutive solutions are small enough (we use notation x to represent (θ,β ) in what follows): ∥x t − x t− 1 ∥ ∥x t− 1 ∥ <ε sol , values of ε sol : ¯ N =10 ¯ N =20 ¯ N =50 ¯ N =100 M =10 7e-5 2e-5 2e-5 2e-5 M =20 1e-4 1e-4 1e-4 1e-4 149 and the changes in objective are relatively small among the three most recent steps: 1 3 t X τ =t− 2 b f(x τ )− b f(x τ − 1 ) b f(x τ − 1 ) <ε obj , values of ε obj : ¯ N =10 ¯ N =20 ¯ N =50 ¯ N =100 M =10 2e-5 1e-5 1e-5 1e-5 M =20 2e-5 1e-5 1e-5 1e-5 where b f is the objective of (4.33) with Z(β ) approximated by 10 5 predetermined iid uniform samplesonΘ. Additionally, weterminateifthesetwostoppingrulesarenotsatisfiedatstep 100. M =10,D =55 Methods Tot. Time Samp. Time Sol. Time Obj. Steps SAA S =100 12.99 0.01 12.98 -4.627 26 S =1000 12.33 0.02 12.31 -4.633 15 S =10000 75.57 0.12 75.45 -4.635 15 AIS ¯ N =10 7.87 2.50 5.37 -4.635 12 ¯ N =20 11.76 4.59 7.17 -4.636 16 ¯ N =50 13.04 5.79 7.25 -4.636 17 ¯ N =100 16.31 7.48 8.83 -4.636 20 SMM (inc.) ¯ N =10 11.60 0.01 11.59 -4.632 23 ¯ N =20 18.36 0.01 18.35 -4.633 34 ¯ N =50 20.03 0.02 20.02 -4.634 33 ¯ N =100 28.15 0.03 28.12 -4.635 41 SMM (non-inc.) ¯ N =10 41.39 0.02 41.37 -4.637 100 ¯ N =20 43.22 0.03 43.19 -4.637 100 ¯ N =50 44.39 0.06 44.33 -4.636 100 ¯ N =100 45.97 0.10 45.68 -4.636 100 Table 4.2: Comparison of AIS, SMM and SAA. AllthetestresultsaresummarizedinTable4.2,inwhich“Tot. Time”standsforhowlong ittakesforthealgorithmstoterminate, “Samp. Time”and“Sol. Time”representsthetime that is purely spent on sampling and solving the subproblems, “Obj.” is the approximated objective b f value at termination, and “Steps” records the number of subproblems we solve before we stop. Incremental and non-incremental SMM are referred to as “SMM (inc.)” and “SMM (non-inc.)”.First, from Table 4.2, the final objective attained by AIS is generally 150 better than the two SMM schemes. While a larger ¯ N does not significantly affect the objective value at termination, it does increase computing time. Additionally, both AIS and incremental SMM can be stopped within 100 steps but the former requires less steps to terminate and it is apparently faster than the latter, e.g., twice as fast when M = 20, ¯ N = 10. On the other hand, for all sizes of ¯ N, non-incremental SMM is not capable of meeting the stopping criteria within 100 iterations. In fact, as demonstrated by the curves of approximated objective b f vs. steps from AIS and the two SMM schemes when they are applied to a typical case with M = 20, ¯ N = 10 for 100 steps (see Figure 4.2), non- incremental SMM oscillates too significantly for us to conclude its convergence while AIS has the most stable performance. It is worth noting that AIS achieves all these properties despitespendingasubstantiallylongertimeonsampling,asthedistributionwesampledfrom are more complex than simple uniform law. However, a further decrease in AIS sampling time can be expected if we adopt a more adequate sampling method than the current na¨ ıve implementation of Metropolis Hastings. M =20,D =210 Methods Tot. Time Samp. Time Sol. Time Obj. Steps SAA S =100 31.11 0.01 31.11 -6.228 29 S =1000 68.97 0.02 68.95 -6.251 26 S =10000 635.19 0.11 635.08 -6.262 30 AIS ¯ N =10 24.51 5.23 19.28 -6.263 23 ¯ N =20 33.10 9.41 23.69 -6.264 27 ¯ N =50 39.45 13.84 25.62 -6.264 30 ¯ N =100 43.91 15.98 27.94 -6.264 31 SMM (inc.) ¯ N =10 41.37 0.01 41.36 -6.241 34 ¯ N =20 120.51 0.02 120.49 -6.253 63 ¯ N =50 165.91 0.03 165.88 -6.258 58 ¯ N =100 316.44 0.06 316.38 -6.258 68 SMM (non-inc.) ¯ N =10 89.03 0.02 89.00 -6.256 100 ¯ N =20 87.19 0.04 87.16 -6.257 100 ¯ N =50 92.55 0.06 92.49 -6.257 100 ¯ N =100 100.05 0.11 99.94 -6.258 100 Table 4.2: Comparison of AIS, SMM and SAA (continued). 151 The advantage of AIS over both SMM schemes is endowed by its adaptive sampling scheme, which provides accurate approximations of the objective with smaller variance (at least locally around the current (θ t ,β t )), even with a small sample size such as ¯ N = 10. This is particularly crucial as most computations in all methods are spent on solving their respective subproblems, whose expenses heavily depend on the sample size. Thus, AIS achieves a lighter computation than incremental SMM, by requiring fewer samples to retain an effective approximation in each step. On the other hand, non-incremental SMM, which applies the same sample size as AIS without accumulation, suffers from a larger variance at each step. This is the key reason behind the smoother and more stable behavior of AIS compared to non-incremental SMM, as seen in Figure 4.2. Another interesting observation from Figure 4.2 is that the approximated objective curve of AIS is nearly non-increasing. It is known that the deterministic surrogation method as described in Proposition 65, has the property of non-increasing objective along the iterations. Therefore, if we view AIS as an inexact surrogation method, the near monotonicity of the AIS curve further supports its approximation accuracy. Figure 4.2: AIS vs. SMM. Methods ¯ N =10 ¯ N =20 Var. Obj. Var. Sol. Var. Obj. Var. Sol. AIS 2.2e-4 9.3e-4 1.5e-4 1.1e-3 SMM (inc.) 2.4e-3 4.2e-3 2.4e-4 5.7e-3 SMM (non-inc.) 1.0e-3 1.2e-2 1.2e-3 3.6e-3 Methods ¯ N =50 ¯ N =100 Var. Obj. Var. Sol. Var. Obj. Var. Sol. AIS 1.3e-4 6.3e-4 1.2e-4 9.7e-4 SMM (inc.) 1.9e-3 4.8e-3 1.5e-3 4.9e-3 SMM (non-inc.) 7.6e-4 2.8e-3 5.0e-4 2.2e-3 Table 4.3: Stability: AIS vs SMM. Table 4.3 presents some additional quantifications for the stability comparison between AISandSMM,where10repetitionsareconductedforeachpairofinitializationandrandom instance(M =10). Standarddeviationofthefinalobjectivesandthenormoffinalsolutions amongthe10repetitionsareaveragedoveralltheinstancesandinitializations, andreported 152 as “Std. Obj.” and “Std. Sol.” respectively. Based on the table, AIS can effectively control the variance from the intermediate sampling hence the variability in the solutions computed, which is favorable in practice. For instance, when ¯ N =10, AIS achieves one-fifth (resp. one-tenth) standard deviation in objective (resp. norm of solution) compared to the non-incremetal SMM. The advantage of AIS over SAA is also significant from Table 4.2. Overall, AIS out- performs SAA in terms of efficiency and final objective value. The performance of SAA primarily relies on the sample size S, meaning that an inaccurate SAA from a small sample size like S = 100 will lead to an unreliable solution despite being easier to compute. As S grows larger, the quality of SAA solutions approaches those obtained by AIS, as indicated bytheobjectivevalueattermination. Interestingly,evenwhen ¯ N =10,AIScanstillachieve a comparable objective as SAA under S = 10000, but with a computing speed that is at least five times faster. 4.5.2 Applications in BHM inference In this part, AIS is applied to the following approximation of MRF with unknown edges (1.9) level 2 e y i | e θ ind. ∼ p y i (y i | e θ ), for all i∈V, level 1 e θ |e u ∼ 1 Z edge (e u) exp ( − X i<j φ ij (e u ij )h ij (θ i ,θ j ) ) , level 0 e u ij iid ∼ Uniform([0,1]), for all i<j. (4.34) where Z edge (e u)≜ Z Θ exp ( − X i<j φ ij (e u ij )h ij (z i ,z j ) ) dz. This model is attained by first ap- plying the uniform transformation as in (4.4), i.e., substituting e β ij in (1.9) by F − 1 e β ij (e u ij ) with e u ij being uniformly distributed on [0,1] and F − 1 e β ij (s) = 1 (0,∞) (s− p ij ) being the generalized inverse of Bernoulli cdf. Then we employ the treatment in Section 4.2 to approximate F − 1 e β ij by a nonconvex piecewise affine φ ij , derived from φ ub in (4.10), whose formulation is listed 153 in Section 4.6.3.2. Additionally, p y i (y i | e θ ) is univariate Gaussian with mean e θ i and fixed variance σ 2 i >0 for all i∈V. The MAP associated with (4.34) is typically nonconvex and nondifferentiable with in- tractable integral term Z edge , making it computationally challenging. For a simplification, wecanresorttothena¨ ıvebenchmarkmodeling (1.3), bysetting e u ij in(4.34)tobedetermin- istic so that the intractable Z edge will vanish. For example, if we lete u ij to be some known ¯u ij ∈ [0,1] in (4.34) when h ij (θ i ,θ j ) = A ij (θ i − θ j ) 2 with fixed A ij > 0, then the simplified MAP becomes minimize θ ∈Θ M X i=1 1 2σ 2 i (y i − θ i ) 2 + X i<j φ ij (¯u ij )A ij (θ i − θ j ) 2 (4.35) whose objective is quadratic with a special Stieltjes structure, making its global optimal solution computable in strongly polynomial time when Θ is defined by box constraint; [69]. However, as discussed in Section 1.4, such simplifications are not always reliable, as the prescribed ¯u ij might be misspecified if na¨ ıvely estimated from noisy data y. In this sense, model (4.34) can benefit us with its flexibility in allowing both network topology e u ij and ground truth e θ to be recovered simultaneously. To emphasize such advantages, comparisons between the more generalized model (4.34) and its simplification are made within the context of signal and image recovery tasks. Through the following experiments, we highlight the value of AIS as an effective method to enable the utility of model (4.34) which summarizes the underlying data generating pro- cess more faithfully. Smooth signal recovery Suppose we discretize function sin(t) for t∈[0,4π ] by sampling it at equidistant points with a spacing of 0.05 (except for t = 4π ), resulting in M ≜|V| = 253. We define Θ = [ − 1,1] M and construct noisy signal y by adding iid Gaussian noises with mean 0 and variance σ 2 to 154 each point sampled, where σ 2 is 0.09 (resp. 0.25) to represent small (resp. large) noises. Model (4.34) is applied with h ij (θ i ,θ j )=A ij (θ i − θ j ) 2 where A ij =50. Intuitively, the ideal valuee u ∗ in(4.34)shouldreflectthesmoothnessoftheunderlyingsignal, namely φ ij (e u ∗ ij )=1 if i + 1 = j and φ ij (e u ∗ ij ) = 0 otherwise. To simplify our computation, we only estimate the similarities around the four peaks of sine signal between 0 and 4π . More specifically, with edge set N ≜ {(kπ +l− 1, kπ +l): k =1...4, l =− 5...5}, we fix e u ij in (4.34) to be the ideal values e u ∗ ij given that (i,j) ̸∈ N. When our AIS method is applied to the MAP of (4.34) whose formulation can be found as (4.42) in Section 4.6.3.2, subproblem (4.19) is formulatedas(4.43)andsolvedbyMOSEK [5]. Finally, AISis terminated whentherelative difference of solutions (in ℓ 2 norm) between the two most recent steps is less than 5× 10 − 5 . To facilitate comparison, we simplify model (4.34) for (i,j) ∈ N by setting e u ij to be ¯u ij =1 heuristically if|y i − y j |< ¯y ande u ij to be ¯u ij =0 otherwise, where the threshold is ¯y ≜ 1 m− 1 m− 1 X i=1 |y i+1 − y i | + κν with κ ∈{− 1,− 0.5,0,0.5,1} and ν being the standard deviation among {|y i+1 − y i | : i = 1...M− 1}. The resulting simplified MAP is (4.35) and solved by method introduced in [69]. (a) o.g. (b) noisy. (c) generalized MRF. (d) s. MRF (κ =− 1). (e)s. MRF(κ =− 0.5). (f) s. MRF (κ =0). (g) s. MRF (κ =0.5). (h) s. MRF (κ =1). Figure 4.3: Comparison of MRF models for signal recovery (σ 2 =0.25). 155 Figures 4.3 and 4.4 depict the signals recovered from model (4.34) (referred to as “gen- eralized MRF”) and its simplifications (referred to as “s. MRF”) at various κ for threshold ¯y. For reference, we also plot the noisy signal (labelled as “noisy”) and the ground truth (labelled as “o.g.”). (a) o.g. (b) noisy. (c) generalized MRF. (d) s. MRF (κ =− 1). (e)s. MRF(κ =− 0.5). (f) s. MRF (κ =0). (g) s. MRF (κ =0.5). (h) s. MRF (κ =1). Figure 4.4: Comparison of MRF models for signal recovery (σ 2 =0.09). As demonstrated by the figures, the signals computed using model (4.34) are overall smooth around the peaks and closely resemble the ground truth signal. In other words, the solutions obtained by AIS are capable of capturing the ideal similarity structure e u ∗ . Conversely, the signals recovered from the simplified model exhibit some undesired spikes around the peaks, even when the noise is relatively small (σ 2 = 0.09), see for instance Figure4.4d. Thisindicatesthatwhilethesimplifiedmodelbenefitsfromfasterinference,the qualityofitssolutionmaybecompromisedbyanunreliablespecificationofhyperparameters ¯u ij from noisy data. Image recovery AIS is also tested on the two image recovery tasks. In case 1, the original image, denoted as “o.g.” in Figure 4.5a, is an 8-by-8 black and white image (M =64) with the 16 pixels in the middle taking value 25.5 while the background pixels being 153. We set Θ = [0 ,255] M 156 and contaminate each pixel with iid Gaussian noise with zero mean and variance σ 2 = 25. The resulting observation y is further trimmed to take value in [0,255], see Figure 4.5b. The observation for case 2 is similarly generated except for the original image being block diagonal as shown in Figure 4.6a. (a) o.g. (b) noisy (c) generalized MRF (d) s. MRF κ =− 1 (e) s. MRF κ =− 0.5 (f) s. MRF κ =0 (g) s. MRF κ =0.5 (h) s. MRF κ =1 Figure 4.5: Comparison of MRF models for image recovery (case 1). (a) o.g. (b) noisy (c) generalized MRF (d) s. MRF κ =− 1 (e) s. MRF κ =− 0.5 (f) s. MRF κ =0 (g) s. MRF κ =0.5 (h) s. MRF κ =1 Figure 4.6: Comparison of MRF models for image recovery (case 2). To recover the original image as well as the pairwise similarities between pixels, we consider (4.34) with h ij (θ i ,θ j ) = A ij |θ i − θ j | and A ij = 50. The associated MAP and its 157 corresponding AIS subproblem (4.19) are similar to (4.42) and (4.43) in Section 4.6.3.2 re- spectively, thus are omitted here. AIS is terminated when we reach a 5× 10 − 4 relative difference in the norm of solutions between the most recent two steps. The image we re- covered from (4.34) is shown in Figure 4.5c. Similar to signal recovery, we also implement the simplification of (4.34) by fixing e u ij to be ¯u ij = 1 if|y i − y j | < ¯y ande u ij to be ¯u ij = 0 otherwise, where the threshold is ¯y = 2 m(m− 1) X i<j |y i − y j | + κν with κ ∈{− 1,− 0.5,0,0.5,1} and ν is the standard deviation among{|y i − y j |: i<j}. The figures showing the images recovered by solving the simplified MAPs under different κ are presented in Figure 4.5d to 4.5h and Figure 4.6d to 4.6h, labelled as “s. MRF”. The solutions obtained by AIS from model (4.34), as shown in Figure 4.5 and 4.6, are able to accurately capture the dark pixels in the original image, which indicates that (4.34) has effectively identified the relevant similarities between pixels that distinguish the dark blocks from the background. It is worth noting that such similarity structure is challenging to specify a priori, especially under the added noise. Consequently, the simplified MRF approaches considered with various κ values are unable to recover the original image, as they lack the generality of model (4.34). 4.6 Proofs and Supplement Materials In this section, we formalize the proofs for some intermediate results that serve as building blocks for our analysis of AIS-based surrgation method, along with an alternative algorith- mic scheme that achieves an improved sample complexity. Additionally, we provide some necessary implementation details for our numerical experiments. 158 4.6.1 Proofs of some intermediate results In what follows, we complete the proof of several results that are useful for our analysis, namely Lemma 62, Proposition 64 and Lemma 66. 4.6.1.1 Proof of Lemma 62 Denote M φ ≜sup z∈Z φ(z)− inf z∈Z φ(z)<∞, then for arbitrary ¯y∈Y and ε>0 we have P 1 N t Nt X s=1 φ(z st )− E ζ ∼ η (• ,y t ) φ(ζ ) ≥ ε y t = ¯y ! ≤ 2exp − 2N t ε 2 M 2 φ (4.36) holds due to the conditional independence among{z st } Nt s=1 , the fact that for all s=1,...,N t we have E(φ(z st )− E ζ ∼ η (• ,y t ) φ(ζ )|y t = ¯y) = 0, and virtually the same argument of how to provetheconventionalHoeffding’sinequality. Forall t≥ 1,denoteS t ≜{z 1 ,...,z t ,y 1 ,...,y t } andF t as the σ -algebra generated by S t withF 0 ≜{∅,Ω } where Ω is the underlying set of possible outcomes. By our data generating process and the definition of conditional proba- bility distribution, from (4.36) the following holds true almost surely P 1 N t Nt X s=1 φ(z st )− E ζ ∼ η (• ,y t ) φ(ζ ) ≥ ε F t− 1 ! = P 1 N t Nt X s=1 φ(z st )− E ζ ∼ η (• ,y t ) φ(ζ ) ≥ ε y t ! ≤ 2exp − 2N t ε 2 M 2 φ (4.37) By the second Borel-Cantelli Theorem [31, Theorem 4.3.4], and our assumptions on N t 0 = P ∞ X t=1 P 1 N t Nt X s=1 φ(z st )− E ζ ∼ η (• ,y t ) φ(ζ ) ≥ ε F t− 1 ! =∞ ! = P 1 N t Nt X s=1 φ(z st )− E ζ ∼ η (• ,y t ) φ(ζ ) ≥ ε i.o. ! 159 where “i.o.” stands for infinitely often. Note that this is sufficient to show that 1 N t Nt X s=1 φ(z st )− E ζ ∼ η (• ,y t ) φ(ζ ) −→ 0 almost surely. 4.6.1.2 Proof of Proposition 64 Denote ϕ (χ, ¯x,z) ≜ exp b H(χ,z ;¯x) expH(¯x,z) Z(¯x), then by Proposition 63 and our assumptions, we have lim t→∞ sup (χ, ¯x)∈X× X 1 N t Nt X s=1 ϕ (χ, ¯x,ζ st )− E ζ ∼ π ∗ (x t ) ϕ (χ, ¯x,ζ ) =0. Note that for an arbitrary x∈X b Z Nt π ∗ (x t ) (x;x t )− b Z(x;x t ) = 1 N t Nt X s=1 ϕ (x,x t ,ζ st )− E ζ ∼ π ∗ (x t ) ϕ (x,x t ,ζ ) ≤ sup (χ, ¯x)∈X× X 1 N t Nt X s=1 ϕ (χ, ¯x,ζ st )− E ζ ∼ π ∗ (x t ) ϕ (χ, ¯x,ζ ) . Thus we can obtain lim t→∞ b Z Nt π ∗ (x t ) (x;x t )− b Z(x;x t ) =0, whichimpliesthatlimsup t→∞ b Z Nt π ∗ (x t ) (x;x t )− b Z(x;x t )≤ 0.Thenforanyconvergentsubsequence {x t } t∈T with the limit point x ∞ ∈X, we have limsup t(∈T)→∞ b Z Nt π ∗ (x t ) (x;x t )≤ limsup t(∈T)→∞ b Z(x;x t )≤ b Z(x;x ∞ ) where the last inequality is obtained by the upper semicontinuity of b H(x;• ) and Fatou’s Lemma. 160 4.6.1.3 Proof of Lemma 66 Byourassumptions,wecanwithoutlossofgeneralitytreatlogarithmasLipschitzcontinuous with constant e L, and there exists M H >0 such that sup (x,z)∈X× Ξ exp(H(x,z)) π ∗ (z,¯x) <M H for all ¯x ∈ X and there exists L H > 0 such that exp(H(x,z)) π ∗ (z,¯x) is Lipschitz continuous in x∈ X with constant L H for all (¯x,z)∈ X× Ξ. By an application of Theorem 3.5 in [33], these conditions are enough to ensure (4.26) holds with C α ≜ e L √ n L H D+ M H p (1− 2α )e ! where D is the diameter of compact X. 4.6.2 OnanalternativeAISschemewithbettersamplecomplexity As we discussed in Section 4.4.5, we can improve N t if we make a slight adjustment to our AIS-based surrogation method and require ρ > 0 to be small enough. More specifically, if ∥x− x t ∥ 2 is used for the regularizer instead of ∥x− x t ∥ 2 2 as in the original algorithm, then we can achieve the following result that is similar to Theorem 67 but only requires N t to increases linearly in t. Proposition 69. Let the followings be given: • The regularization term in the AIS-based surrogation method is changed to 1 2ρ ∥x− x t ∥ 2 ; • A sequence of positive integers {N t } t≥ 0 satisfying for some κ > 0 and integer T κ the condition that N t ≥ κt for all t≥ T κ . 161 Thenthereexistsa ¯ρ> 0suchthatifwefix ρ< ¯ρ thenwithprobabilityone,thesequence {x t } t≥ 0 produced by the alternative algorithm will attain an accumulation point x ∞ ∈ X that satisfies x ∞ ∈ argmin x∈X b c(x;x ∞ )+log b Z(x;x ∞ )+ 1 2ρ ∥x− x ∞ ∥ 2 Proof. First note that if ∥x t+1 − x t ∥ 2 → 0 (almost surely) is given, then the analysis of asymptotic fixed point stationarity, i.e., Section 4.4.4 after we substitute ∥•∥ 2 2 with∥•∥ 2 , onlyrequiresN t togrowlinearlyint. Thusitamountstoshowthat∥x t+1 − x t ∥ 2 →0almost surely if we additionally restrict ρ > 0 to be small enough. Indeed, note that after some simple rearrangements, (4.24) gives: 1 2ρ ∥x t+1 − x t ∥ 2 ≤ | c(x t )− c(x t+1 )|+2 logZ(x t )− log ¯ Z N t− 1 π ∗ (x t− 1 ) (x t ) +|logZ(x t )− logZ(x t+1 )|+ logZ(x t+1 )− log ¯ Z Nt π ∗ (x t ) (x t+1 ) ≤ | c(x t )− c(x t+1 )|+ e L |Z(x t+1 − Z(x t )|+2 Z(x t )− ¯ Z N t− 1 π ∗ (x t− 1 ) (x t ) + Z(x t+1 )− ¯ Z Nt π ∗ (x t ) (x t+1 ) (4.38) where the second inequality is by our assumptions so that logarithm can be treated as Lipschitz continuous with constant e L. Furthermore, by the identity that Z(x t ) = 1 N t− 1 N t− 1 X s=1 exp H(x t ,ζ s(t− 1) ) π H(x t ,• ) IS (ζ s(t− 1) ) there exists constant L H >0 so that the followings hold: Z(x t )− ¯ Z N t− 1 π ∗ (x t− 1 ) (x t ) ≤ 1 N t− 1 N t− 1 X s=1 exp H(x t ,ζ s(t− 1) ) exp − H(x t ,ζ s(t− 1) ) Z(x t ) − exp − H(x t− 1 ,ζ s(t− 1) ) Z(x t− 1 ) ≤ L H ∥x t − x t− 1 ∥ 2 162 where the second inequality is by H,Z being continuous on compact set hence there exists M H <∞ such that exp(H(x,z))<M H for all (x,z)∈X× Ξ, and Z(x)exp(− H(x,z)) can be treated as jointly Lipschitz continuous in x and z. For the same reason we have: Z(x t+1 )− ¯ Z Nt π ∗ (x t ) (x t+1 ) ≤ L H ∥x t+1 − x t ∥ 2 Furthermore, by c and Z being continuous on compact X hence Lipschitz continuous with constant L c and L Z respectively, we can eventually derive the followings from (4.38): 1 2ρ ∥x t+1 − x t ∥ 2 ≤ (L c + e LL Z + e LL H )∥x t+1 − x t ∥ 2 +2 e LL H ∥x t − x t− 1 ∥ 2 If we take ρ to be small enough so that ¯ C≜ 2 e LL H 1 2ρ − (L c + e LL Z + e LL H ) ∈(0,1) , then ∥x t+1 − x t ∥ 2 ≤ ¯ C∥x t − x t− 1 ∥ 2 for any t≥ 1, which means∥x t+1 − x t ∥ 2 →0 by X being compact. Remark 70. Intuitively, restricting ρ > 0 to be small enough in exchange for a better complexity on N t is necessary if we interpret ρ as the step size for our algorithm. By Lemma 57, AIS-based surrogation can only accurately approximate Z(x) in (4.12) with fewer samples when we are locally around x t , hence we cannot afford a step size ρ that is too large if we intend to control N t . 4.6.3 Some details for numerical experiments Below we provide the formulations for the subproblems in Section 4.5.1 and the piecwise affine approximation of indicator function which we employed in Section 4.5.2. 163 4.6.3.1 Subproblems for the experiments in Section 4.5.1 Given θ t ,β t from the most recent iteration, we can construct the following subproblem minimize θ,β M X i=1 1 2σ 2 i (y i − θ i ) 2 + 1 2 X i<j (β ij +|θ i − θ j |) 2 +log(Z t (β ))+ 1 2ρ ∥β − β t ∥ 2 2 +∥θ − θ t ∥ 2 2 − X i<j 1 2 β t ij 2 +β t ij β ij − β t ij + 1 2 θ t i − θ t j 2 + θ t i − θ t j θ i − θ t i − (θ j − θ t j ) subject to θ ∈Θ , β ij ≤ β ij ≤ β ij ,∀i<j. (4.39) where Z t (β ) is specified as follows for AIS-based surrogation: AIS-based surrogation iid{z st } Nt s=1 drawn from 1 Z(β t ) exp ( − X i<j β t ij |z i − z j | ) Z t (β ) = Nt X s=1 exp ( − X i<j (β ij − β t ij ) z st i − z st j ) For the SMM methods, Z t (β ) is defined as: non-incremental SMM iid{z st } Nt s=1 drawn from uniform distribution over Θ Z t (β ) = Nt X s=1 exp ( − X i<j β ij z st i − z st j ) incremental SMM iid{z st } Nt s=1 drawn from uniform distribution over Θ Z t (β ) = t X τ =1 Nτ X s=1 exp ( − X i<j β ij z sτ i − z sτ j ) Note that by Lemma 57 problem (4.39) is convex and handled by MOSEK [5] in our exper- iments. 164 4.6.3.2 Piecewise affine approximation of indicator function Suppose e β ij is Bernoulli with parameter p ij > 0, then the generalized inverse of its cdf is F − 1 e β ij (s) = 1 (0,∞) (s− p ij ) for s ∈ [0,1]. We can apply the b φ ub treatment in (4.10) with e φ cvx (s)=s and result in the following nonconvex piecewise affine approximation of F − 1 e β ij φ ij (s) ≜ min max 1+ s− p ij δ ij , 0 , 1 (4.40) whereδ ij >0is afixed hyperparameter controlling theapproximationand as δ ij ↓0we have φ ij →F − 1 e β ij pointwise. Visualization of such approximation can be found in Figure 4.7. (a) indicator function F − 1 e β ij . (b) piecewise affine φ ij . Figure 4.7: Piecewise affine φ ij approximation of indicator F − 1 e β ij . Note that φ ij in (4.40) can be reformulated as the following difference-of-convex (DC) function φ ij (s) = max(a ij s+b ij , 0) | {z } ≜φ + ij (s) − max(a ij s+b ij − 1, 0) | {z } ≜φ − ij (s) (4.41) where a ij ≜ 1 δ ij and b ij ≜ 1− p ij δ ij . The MAP associated with model (4.34) in Section 4.5.2 is minimize θ,u M X i=1 1 2σ 2 i (y i − θ i ) 2 + X i<j φ ij (u ij )A ij (θ i − θ j ) 2 + log(Z edge (u)) subject to θ ∈Θ , u ij ∈[0,1] for i<j. (4.42) 165 where h ij (θ i ,θ j ) is substituted by A ij (θ i − θ j ) 2 with fixed A ij > 0. In the context of AIS method, given u t and θ t from the previous step, we draw conditionally independent samples {z st } Nt s=1 fromdensity 1 Z edge (u t ) exp ( − X i<j φ ij (u t ij )A ij (z i − z j ) 2 ) andconstructthefollowing formulation for AIS subproblem (4.19) minimize θ,u M X i=1 1 2σ 2 i (y i − θ i ) 2 + X i<j b G t ij (u,θ ) + 1 2ρ ∥θ − θ t ∥ 2 2 + ∥u− u t ∥ 2 2 +log 1 N t Nt X s=1 exp ( − X i<j A ij b φ t ij (u ij )− φ ij (u t ij ) z st i − z st j 2 )! subject to θ ∈Θ , u ij ∈[0,1] for i<j. (4.43) where b G t ij (u,θ ) ≜ A ij (θ i − θ j ) 2 if a ij u t ij +b ij ≥ 1 1 2 φ + ij (u ij )+A ij (θ i − θ j ) 2 2 − ξ t ij (u ij − u t ij ) − 2A 2 ij (θ t i − θ t j ) 3 (θ i − θ t i )− (θ j − θ t j ) − 1 2 A 2 ij (θ t i − θ t j ) 4 − 1 2 φ + ij (u t ij ) 2 otherwise ξ t ij ≜ max a ij a ij u t ij +b ij , 0 b φ t ij (u ij ) ≜ min{a ij u ij +b ij , 1} if a ij u t ij +b ij ≥ 0 0 otherwise 166 Chapter 5 Conclusion and future work In this dissertation, we conduct a comprehensive investigation into various aspects of non- convex optimization, which is increasingly pervasive yet remains a challenging task in many modern applications. Our research focuses on several key topics, unified by the inference of Markov Random Fields as a motivating example in Chapter 1, where we identify several under-resolved issues that serve as the inspiration for our studies. • In Chapter 2, we present a class of sparsity learning problems with a loss function possessing the Z-property and a regularizer being folded concave, guaranteeing linear- step solvability, and achieving strongly polynomial complexity when specialized to quadratic loss and piecewise quadratic regularizers. Through the design of the GHP algorithm, tailored to the special structure of these problems, we achieve the desired complexity results, as supported by rigorous analysis and numerical experiments. • Our research in Chapter 3 formally validate the effectiveness of parametric capped ℓ 1 - approachinbridgingthegapbetweenidealbutcomputationallyexpensive ℓ 0 -pathand fast but less desirable ℓ 1 -path. By comparing ℓ 0 -, ℓ 1 - and capped ℓ 1 -paths, we conduct analytical studies, highlighting the complexity with respect to the number of pieces in the solution paths, among other properties of interests. To facilitate the computa- tion of a capped ℓ 1 d-stationary path, we propose a pivoting method equipped with the modified GHP algorithms to efficiently handle path discontinuities by leveraging 167 the Z-property. These investigations, enhanced by our numerical studies, confirm that capped ℓ 1 -paths offer nearly identical empirical performance to ℓ 0 -paths, while main- taininganacceptablecomputationalworkload, makingthemapowertoolforpractical hyperparameter selection. • In Chapter 4, we introduce and study the AIS-based surrogation method for a general class of logarithmic integral (stochastic) optimization problems, which encompass the challenging features of nonconvexity, nonsmoothness and intractable integrals that are central to the inference of a Bayesian hierarchical model. Our analysis establish the consistency results of the algorithm, ensuring a computable candidate for the local minimizers in its subsequential limit. Furthermore, the practical efficiency brought aboutbythecombinationofthesurrogationtechniquewiththeAISvariancereduction scheme, arefurtherexaminedandconfirmedthroughextensivenumericalexperiments. Throughout this dissertation, our research has highlighted the advantage of embracing a faithful nonconvex modelling in practical applications, provided that efficient computational methods can be carefully designed to leverage the problem structures. This should not be consideredtheendpoint,butratherthebeginningoffurtherresearchinnonconvexoptimiza- tion that follows this theme. In the subsequent discussion, we conclude this dissertation by outlining potential further directions that extend our work on nonconvex stochastic opti- mization presented in Chapter 4. • In Chapter 4, we take the MAP problem with intractable normalizers as the motiva- tion for our AIS-based surrogation method. However, in the BHM context, intractable normalizersaremoreprevalent,presentingthemselvesinothersituationslikemaximiz- ing the marginal likelihood for hyperparameter optimization. Specifically, suppose we have a simple BHM, with likelihoode y| e θ ∼ p(y| e θ ), which is the conditional density of 168 data e y. Additionally, the (random) parameter e θ has a prior density q(θ ;β ) with (de- terministic)hyperparameter β ∈B, whichcanbeselectedbymaximizingthefollowing (logarithmic) marginal likelihood function: maximize β ∈B log Z p(y|z)q(z|β )dz (5.1) See the Chapter 5 in [73] as a reference. An application of our method to this problem could be an interesting extension, when the integral in (5.1) is intractable. • In addition to the surrogation method explored in Chapter 4, the use of the AIS vari- ance reduction technique has mainly been investigated in conjunction with stochastic gradientdescentforempiricalriskminimizationandcoordinatedescent(ascent)meth- ods. This raises the question of whether AIS can be applied to other stochastic opti- mization algorithms, such as the stochastic proximal point method [6] and stochastic decomposition method [51]. Furthermore, the need to improve variance estimation in uncertainty estimation extends beyond the realm of stochastic optimization. It would be intriguing to explore the potential application of the AIS technique to other math- ematical programming problems, including stochastic (non)-linear complementarity problems [23] and stochastic variational inequalities [34]. This is particularly relevant for problems involving the estimation of rare event probabilities. • In Chapter 4, we focus on a problem involving logarithms, which naturally arises from MAP inference. This allowes us to employ the exact minimal variance density within our AIS-based surrogation scheme. However, when combining AIS with other stochastic optimization algorithms for more general stochastic programs, the exact minimal variance density is generally not applicable. Consequently, it is crucial to develop approaches to overcome this limitation. Here, we propose several possible directions to address this challenge: 169 – Instead of a fully adaptive scheme, we can utilize a fixed but effective importance sampling density, which can be optimized based on the iteration complexity of the nominal algorithm, see for instance [89]. This provides a more meaningful benchmark compared to simple prescriptions like the uniform distribution. – We can obtain an estimation of the minimal variance density by solving an ap- proximating variance minimization problem, where the objective function is ap- proximated using appropriate methods, see [80] for a reference. – Anotherapproachistoapproximatetheminimalvariancedensitybyconstraining thesearchwithinaspecificfamilyofdistributions,whichcanadapttothecurrent progress of the algorithm. Some previous research under this idea can be found in [17] and [71]. – Inthecaseofdiscreteuncertainty,wecanviewthederivationoftheminimalvari- ancedensityasa multi-arm bandit problem (see[16]and[77]), whichallowsusto leverage advances in online optimization to design and analyze an approximation scheme. – Itisworthnotingthattheinapplicabilityoftheminimalvariancedensityispartly due to the use of the empirical average as our estimator. To mitigate this issue, alternative estimators such as the self-normalized estimator [19] can be adopted. However, further analytical verification of the algorithm under this estimator is needed. • Finally,itisworthnotingthatinChapter4,theempiricalsuperiorityoftheAIS-based surrogation method over a non-adaptive approach has been demonstrated. However, a comprehensive theoretical analysis of this phenomenon remains unclear. Here, we propose several potential avenues for further investigation: – Proving the improvement in asymptotic variance achieved by employing AIS, see [49] as an example. 170 – Examining if the sample complexity or step size of a nominal algorithm can be enhanced through the adoption of AIS. – Exploring possibilities for improving non-asymptotic results, such as the conver- gence rate of the nominal algorithm when combined with AIS. 171 References 1. Adler, I., Cottle, R. W. & Pang, J. S. Some LCPs Solvable in Strongly Polynomial Time with Lemke’s Algorithm. Mathematical Programming 160, 477–493 (2016). 2. Ahn, M., Pang, J. S. & Xin, J. Difference-of-convex Learning: Directional Stationarity, Optimality, and Sparsity. SIAM Journal on Optimization 27, 1637–1665 (2017). 3. Alain, G., Lamb, A., Sankar, C., Courville, A. C. & Bengio, Y. Variance Reduction in SGD by Distributed Importance Sampling. ArXiv abs/1511.06481 (2015). 4. Aneja, Y. P. & Nair, K. Bicriteria Transportation Problem. Management Science 25, 73–78 (1979). 5. ApS, M. The MOSEK Optimization Toolbox for MATLAB Manual. Version 9.3.21. (2022). 6. Asi, H. & Duchi, J. C. Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity. SIAM Journal on Optimization 29, 2257–2290 (2018). 7. Atamt¨ urk, A. & G´ omez, A. Rank-One Convexification for Sparse Regression. ArXiv abs/1901.10334 (2019). 8. Atamt¨ urk, A., G´ omez, A. & Han, S. Sparse and Smooth Signal Estimation: Convexification of ℓ 0 -Formulations. Journal of Machine Learning Research 22, 52:1–52:43 (2018). 9. Bertsimas, D., Pauphilet, J. & Parys, B. P. G. V. Sparse Regression: Scalable Algorithms and Empirical Performance. Statistical Science (2019). 10. Besag, J. Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society: Series B (Methodological) 36, 192–225 (1974). 11. Besag, J. & Kooperberg, C. L. On Conditional and Intrinsic Autoregressions. Biometrika 82, 733–746 (1995). 172 12. Besag, J., York, J. & Molli´ e, A. Bayesian Image Restoration, with Two Applications in Spatial Statistics. Annals of the Institute of Statistical Mathematics 43, 1–20 (1991). 13. Best, M. J. An Algorithm for the Solution of the Parametric Quadratic Programming Problem. Physica-Verlag HD, 57–76 (1996). 14. Boland, N. L., Charkhgard, H. & Savelsbergh, M. W. P. A Criterion Space Search Algorithm for Biobjective Integer Programming: The Balanced Box Method. INFORMS Journal on Computing 27, 735–754 (2015). 15. Boland, N. L., Charkhgard, H. & Savelsbergh, M. W. P. A Criterion Space Search Algorithm for Biobjective Mixed Integer Programming: The Triangle Splitting Method. INFORMS Journal on Computing 27, 597–618 (2015). 16. Borsos, Z., Krause, A. & Levy, K. Y. Online Variance Reduction for Stochastic Optimization in Annual Conference Computational Learning Theory (2018). 17. Bouchard, G., Trouillon, T., Perez, J. & Gaidon, A. Accelerating Stochastic Gradient Descent via Online Learning to Sample. ArXiv abs/1506.09016 (2015). 18. Brunk, H. D., Barlow, R. E., Bartholomew, D. J. & Bremner, J. M. Statistical Inference under Order Restrictions: the Theory and Application of Isotonic Regression. International Statistical Review 41, 395 (1973). 19. Bugallo, M. F., Elvira, V., Martino, L., Luengo, D., M´ ıguez, J. & Djuri´ c, P. M. Adaptive Importance Sampling: The Past, the Present, and the Future. IEEE Signal Processing Magazine 34, 60–79 (2017). 20. Chen, T.-W., Wardill, T. J., Sun, Y., Pulver, S. R., Renninger, S. L., Baohan, A., et al. Ultra-Sensitive Fluorescent Proteins for Imaging Neuronal Activity. Nature 499, 295–300 (2013). 21. Chen, X., Ge, D., Wang, Z. & Ye, Y. Complexity of Unconstrained L 2 − L p Minimization. Mathematical Programming 143, 371–383 (2011). 22. Chen, Y., Ge, D., Wang, M., Wang, Z., Ye, Y. & Yin, H. Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions in International Conference on Machine Learning (2015). 23. Cottle R. Pang, J.-S. & Stone, R. The Linear Complementarity Problem (Society for Industrial and Applied Mathematics, 2021). 24. Cozad, A., Sahinidis, N. V. & Miller, D. C. Learning surrogate models for simulation-based optimization. Aiche Journal 60, 2211–2227 (2014). 173 25. Csiba, D., Qu, Z. & Richt´ arik, P. Stochastic dual coordinate ascent with adaptive probabilities in International Conference on Machine Learning (2015). 26. Cui, Y., Chang, T.-H., Hong, M. & Pang, J. S. A Study of Piecewise Linear-Quadratic Programs. Journal of Optimization Theory and Applications 186, 523–553 (2017). 27. Cui, Y., Liu, J. & Pang, J. S. Nonconvex and Nonsmooth Approaches for Affine Chance-Constrained Stochastic Programs. Set-Valued and Variational Analysis 30, 1149–1211 (2022). 28. Cui, Y., Liu, J. & Pang, J. S. The Minimization of Piecewise Functions: Pseudo Stationarity. ArXiv abs/2305.14798 (2023). 29. Cui, Y. & Pang, J.-S. Modern nonconvex nondifferentiable optimization (Society for Industrial and Applied Mathematics, 2021). 30. Dantzig, G. B. & Infanger, G. Large-Scale Stochastic Linear Programs: Importance Sampling and Benders Decomposition in (1991). 31. Durrett, R. Probability: Theory and Examples (Cambridge University Press, 2019). 32. Efron, B., Hastie, T. J., Johnstone, I. M. & Tibshirani, R. Least Angle Regression. Annals of Statistics 32, 407–499 (2004). 33. Ermoliev, Y. M. & Norkin, V. I. Sample Average Approximation Method for Compound Stochastic Optimization Problems. SIAM J. Optim.23, 2231–2263 (2013). 34. Facchinei, F. & Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems (Springer New York, 2003). 35. Fan, J. & Li, R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association 96, 1348–1360 (2001). 36. Fattahi, S. & G´ omez, A. Scalable Inference of Sparsely-Changing Markov Random Fields with Strong Statistical Guarantees. ArXiv abs/2102.03585 (2021). 37. Folland, G. B. Real Analysis: Modern Techniques and Their Applications (John Wiley & Sons, 1999). 38. Gelfand, A. E. & Carlin, B. P. Maximum-likelihood Estimation for Constrained- or Missing-data Models. Canadian Journal of Statistics-revue Canadienne De Statistique 21, 303–311 (1993). 174 39. Gelman, A. & Meng, X.-L. Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling. Statistical Science 13, 163–185 (1998). 40. Geman, S. & Graffigne, C. Markov Random Field Image Models and Their Applications to Computer Vision. Proceedings of the international congress of mathematicians 1 (1986). 41. Geyer, C. J. On the Convergence of Monte Carlo Maximum Likelihood Calculations. Journal of the royal statistical society series b-methodological 56, 261–274 (1994). 42. Geyer, C. J. & Thompson, E. A. Constrained Monte Carlo Maximum Likelihood for Dependent Data. Journal of the royal statistical society series b-methodological 54, 657–683 (1992). 43. G´ omez, A., He, Z. & Pang, J. S. Linear-step solvability of some folded concave and singly-parametric sparse optimization problems. Mathematical Programming 198, 1339–1380 (2022). 44. Goodreau, S. M., Kitts, J. A. & Morris, M. M. Birds of a Feather, or Friend of a Friend? Using Exponential Random Graph Models to Investigate Adolescent Social Networks. Demography 46, 103–125 (2009). 45. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual 2023. 46. Hallac, D., Park, Y., Boyd, S. P. & Leskovec, J. Network Inference via the Time-Varying Graphical Lasso in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017). 47. Hastie, T. J., Tibshirani, R. & Wainwright, M. J. Statistical Learning with Sparsity: the Lasso and Generalizations in (2015). 48. Hazimeh, H. & Mazumder, R. Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms. Operations Research 68, 1517–1537 (2018). 49. He, S., Jiang, G., Lam, H. & Fu, M. C. Adaptive Importance Sampling for Efficient Stochastic Root Finding and Quantile Estimation. Operations Research (2021). 50. He, Z., Han, S., G´ omez, A., Cui, Y. & Pang, J. S. Comparing solution paths of sparse quadratic minimization with a Stieltjes matrix. Mathematical Programming, 1–50 (2023). 51. Higle, J. L. & Sen, S. Stochastic Decomposition: An Algorithm for Two-Stage Linear Programs with Recourse. Mathematics of Operations Research 16, 650–669 (1991). 175 52. Hochbaum, D. S. & Lu, C. A Faster Algorithm Solving a Generalization of Isotonic Median Regression and a Class of Fused Lasso Problems. SIAM Journal on Optimization 27, 2563–2596 (2017). 53. Ib´ a˜ nez, M. V. & Sim´ o, A. Parameter Estimation in Markov Random Field Image Modeling with Imperfect Observations. A Comparative Study. Pattern Recognition Letters 24, 2377–2389 (2003). 54. Infanger, G. Monte Carlo (Importance) Sampling within a Benders Decomposition Algorithm for Stochastic Linear Programs. Annals of Operations Research 39, 69–95 (1991). 55. Jewell, S. W. & Witten, D. Exact Spike Train Inference via ℓ 0 Optimization. The Annals of Applied Statistics 12 4, 2457–2482 (2017). 56. Johnson, T. B. & Guestrin, C. Training Deep Models Faster with Robust, Approximate Importance Sampling in Neural Information Processing Systems (2018). 57. Lawson, A. B. Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology in (2008). 58. Lee, W., Yu, H. & Yang, H. Reparameterization Gradient for Non-differentiable Models. ArXiv abs/1806.00176 (2018). 59. Liu, J., Cui, Y. & Pang, J. S. Solving Nonsmooth and Nonconvex Compound Stochastic Programs with Applications to Risk Measure Minimization. Mathematics of Operations Research (2020). 60. MacNab, Y. C. Bayesian Disease Mapping: Past, Present, and Future. Spatial Statistics 50, 100593–100593 (2022). 61. Mairal, J. & Yu, B. Complexity Analysis of the Lasso Regularization Path. ArXiv abs/1205.0079 (2012). 62. Mangasarian, O. L. Linear Complementarity Problems Solvable By a Single Linear Program. Mathematical Programming 10, 263–270 (1976). 63. Møller, J. M., Pettitt, A. N., Reeves, R. & Berthelsen, K. K. An Efficient Markov Chain Monte Carlo Method for Distributions with Intractable Normalising Constants. Biometrika 93, 451–458 (2006). 64. Mor´ e, J. J. & Rheinboldt, W. C. On P- and S-functions and Related Classes of n-Dimensional Nonlinear Mappings. Linear Algebra and its Applications 6, 45–68 (1973). 176 65. Murray, I., Ghahramani, Z. & MacKay, D. J. C. MCMC for Doubly-Intractable Distributions in Conference on Uncertainty in Artificial Intelligence (2006). 66. O. Cappe´ e A. Guillin, J. M. & Robert, C. Population Monte Carlo. Journal of Computational and Graphical Statistics 13, 907–929 (2004). 67. Oh, M.-S. & Berger, J. O. Adaptive Importance Sampling in Monte Carlo Integration. Journal of Statistical Computation and Simulation 41, 143–168 (1992). 68. Pang, J. S. On a Class of Least-element Complementarity Problems. Mathematical Programming 16, 111–126 (1979). 69. Pang, J. S. & Chandrasekaran, R. Linear Complementarity Problems Solvable by a Polynomially Bounded Pivoting Algorithm. Mathematical Programming Essays in Honor of George B. Dantzig Part II, 13–27 (1985). 70. Pang, J. S., Razaviyayn, M. & Alvarado, A. Computing B-Stationary Points of Nonsmooth DC Programs. Mathematics of Operations Research 42, 95–118 (2015). 71. Parpas, P., Ustun, B., Webster, M. D. & Tran, Q. K. Importance Sampling in Stochastic Programming: A Markov Chain Monte Carlo Approach. INFORMS J. Comput. 27, 358–377 (2015). 72. Pritchard, J. K., Seielstad, M., P´ erez-Lezaun, A. & Feldman, M. W. Population Growth of Human Y Chromosomes: a Study of Y Chromosome Microsatellites. Molecular biology and evolution 16 12, 1791–8 (1999). 73. Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MA: MIT press, 2006). 74. Robbins, H. & Siegmund., D. A Convergence Theorem for Non-negative Almost Supermartingales and Some Applications. Optimizing Methods in Statistics, 233–257 (1971). 75. Robert, C. P. & Casella, G. Monte Carlo statistical methods (Springer texts in statistics) in (2005). 76. Salakhutdinov, R. & Larochelle, H. Efficient Learning of Deep Boltzmann machines in International Conference on Artificial Intelligence and Statistics (2010). 77. Salehi, F., Celis, L. E. & Thiran, P. Stochastic Optimization with Bandit Sampling. ArXiv abs/1708.02544 (2017). 78. Shapiro, A., Dentcheva, D. & Ruszczynski, A. Lectures on Stochastic Programming - Modeling and Theory (MOS-SIAM Series on Optimization, 2009). 177 79. Soussen, C., Idier, J., Duan, J. & Brie, D. Homotopy Based Algorithms for ℓ 0 -Regularized Least-Squares. IEEE Transactions on Signal Processing 63, 3301–3316 (2014). 80. Stich, S. U., Raj, A. & Jaggi, M. Safe Adaptive Importance Sampling. ArXiv abs/1711.02637 (2017). 81. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological 58, 267–288 (1996). 82. Tibshirani, R. J. The Lasso Problem and Uniqueness. Electronic Journal of Statistics 7, 1456–1490 (2012). 83. Tibshirani, R. J., H¨ ofling, H. & Tibshirani, R. Nearly-Isotonic Regression. Technometrics 53, 54–61 (2011). 84. Vogelstein, J. T., Packer, A., Machado, T. A., Sippy, T., Babadi, B., Yuste, R., et al. Fast Nonnegative Deconvolution for Spike Train Inference from Population Calcium Imaging. Journal of Neurophysiology 104 6, 3691–704 (2009). 85. Xie, W. & Deng, X. Scalable Algorithms for the Sparse Ridge Regression. SIAM Journal on Optimization 30, 3359–3386 (2018). 86. Younes, L. Estimation and Annealing for Gibbsian Fields. Annales De L Institut Henri Poincare-probabilites Et Statistiques 24, 269–294 (1988). 87. Yukawa, M. & Amari, S.-I. ℓ p -Regularized Least Squares (0
Abstract (if available)
Abstract
Nonconvex optimization plays a crucial role in many modern applications in operations research and data science, allowing for more faithful modeling of the complicated real-world environment. In this dissertation, we address several under-resolved topics in nonconvex optimization that commonly arise in various domains, driving the need for advancements in both theoretical understanding and computational tools.
In Chapter 2, we provide an affirmative answer to a question that has been overlooked in existing literature: can a certain class of sparsity learning problems be solved with linear-step complexity or even in strongly polynomial time? We identify such class of problems characterized by a loss function with the Z-property and a folded concave regularizer. To tackle the challenges associated with the inherent nonconvexity and nonsmoothness exhibited by this class, we propose the GHP algorithm, which effectively exploits the special structure of the problem and achieves the desired complexity results through rigorous analytical investigations. The effectiveness and efficiency of the algorithm are further demonstrated through numerical experiments.
Our study in sparsity learning extends to the parametric setting, which aims at conducting hyperparameter selection via computing the solution paths as functions of the sparsity inducing parameter. As discussed in Chapter 3, through a comprehensive analysis and computational investigations, we substantiate the intuition that a nonconvex parametric approach, specifically computing the capped L1 d-stationary paths, offers a resolution to the tradeoff between computational efficiency and practical performance, as exhibited by the dilemma in choosing between the L0- and L1-regularizer. In particular, we examine and enhance the known properties of the optimal solution paths computed from the L0- and L1-regularizers, supplementing them with new analytical results for the previously unexplored capped L1-paths. The computation of a capped L1 d-stationary path is enabled by our proposal of the pivoting scheme accompanied by the modified GHP algorithms. The primary challenge in this computation lies in dealing with potential discontinuities in the path. The modified GHP algorithms, leveraging the Z-propery, play a crucial role in rapidly restoring the path when such discontinuities occur. This aspect, which has not been extensively addressed in the classical parametric programming literature, is one of the contributions of our approach.
One of the applications of our studies in Chapter 2 and 3 is the inference of a Markov random field with a known network topology. However, a practical issue arises when the network topology is unknown and cannot be reliably estimated in advance. This situation prompts us to consider a broader extension of our studies, encompassing the challenges encountered in Bayesian hierarchical models (BHMs) with intractable normalizers. In Chapter 4, we address these concerns by focusing on a class of logarithmic integral (stochastic) optimization problems, where nonconvexity and nonsmoothness are coupled within an intractable integral. Through the proposal of the adaptive importance sampling-based surrogation method, we provide an efficient tool to simultaneously handle nonconvexity and nosmoothness, while also improving the sampling approximation of the intractable integral via variance reduction. Through rigorous analysis, we guarantee the performance of this algorithm, showcasing an almost sure subsequential convergence to a surrogation stationary point - a necessary candidate for a local minimizer. Furthermore, extensive numerical experiments confirm the effectiveness of the algorithm, demonstrating its efficiency and stability in facilitating the application of advanced BHMs, where intractable normalizers often arise as a result of enhanced modeling capability.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mixed-integer nonlinear programming with binary variables
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
A stochastic conjugate subgradient framework for large-scale stochastic optimization problems
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Stochastic games with expected-value constraints
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Computational stochastic programming with stochastic decomposition
PDF
Train scheduling and routing under dynamic headway control
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
A continuous approximation model for the parallel drone scheduling traveling salesman problem
PDF
Learning and control in decentralized stochastic systems
PDF
Applications of explicit enumeration schemes in combinatorial optimization
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Elements of robustness and optimal control for infrastructure networks
Asset Metadata
Creator
He, Ziyu
(author)
Core Title
Modern nonconvex optimization: parametric extensions and stochastic programming
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Degree Conferral Date
2023-08
Publication Date
08/08/2023
Defense Date
08/02/2023
Publisher
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,optimization,parametric programming,stochastic programming
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pang, Jong-Shi (
committee chair
), Gomez, Andres (
committee member
), Gupta, Vishal (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
ziyu.he.kenyon@gmail.com,ziyuhe@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113296576
Unique identifier
UC113296576
Identifier
etd-HeZiyu-12210.pdf (filename)
Legacy Identifier
etd-HeZiyu-12210
Document Type
Dissertation
Rights
He, Ziyu
Internet Media Type
application/pdf
Type
texts
Source
20230808-usctheses-batch-1080
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
optimization
parametric programming
stochastic programming