Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning and control for wireless networks via graph signal processing
(USC Thesis Other)
Learning and control for wireless networks via graph signal processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNING AND CONTROL FOR WIRELESS NETWORKS VIA GRAPH SIGNAL PROCESSING by Libin Liu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2020 Copyright 2020 Libin Liu Dedication To My Family: Past, present and future... ii Acknowledgements Obtaining a PhD degree is a fantastic and challenging adventure, which is impossible to carry out alone. Over the past five years, I have received support, feedback and encouragement from various individuals. This dissertation would not have been possible without their help. First of all, I would like to express my sincere gratitude and appreciation to my PhD advisor, Prof. Urbashi Mitra, for her support, guidance, care and patience. Prof. Mitra offered me various opportunities for learning, independent thinking, attending conferences and internships. It has been a privilege and an honor to work with her. I would like to thank my qualifying exam and dissertation committee members: Prof. Antonio Ortega, Prof. Sze-chuan Suen, Prof. Ashutosh Nayyar and Prof. Bhaskar Krishnamachari, for their intriguing questions and constructive comments for me to complete this thesis. A special thanks to Dr. Arpan Chattopadhyay for helping me develop the network model and guiding me through the proof of the theorems in Chapter 2. Life at USC is amazing, I would like to thank the EEB staff members: Diane Demetras, Gerrielyn Ramos, Corine Wong and Susan Wiedem. They were always available to help and answer all my questions. I would also like to thank my wonderful previous and current group members: Dr. Junting Chen, Dr. Marcos Vasconcelos, Dr. Sajjad Beygi, Dr. Amr Elnakeeb, Dhruva Kartik, Tze-Yang Tung, Jianxiu Li, Mustafa Can Gursoy, Joni Shaska and Madhavi Rajiv, for all the interesting discussions, argument and fun that we had together. Lastly and most importantly, I would like to express my deepest gratitude to my parents, Yaohua Liu and Baoman Li, for their enduring love, support, understanding and encouragement. iii Table of Contents Dedication ii Acknowledgements iii List of Figures vii List of Tables x Abstract xi Related Publications xiii 1 Introduction 1 1.1 Thesis background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Reduced Complexity MDP algorithms . . . . . . . . . . . . 2 1.2.2 Reduced Complexity RL algorithms . . . . . . . . . . . . . . 3 1.2.3 Analysis of policy gradient . . . . . . . . . . . . . . . . . . . 4 1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 On Solving Large Scale Markov Decision Processes Problems: Exploitation of Spectral Properties and Policy Structures 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Preliminaries: MDP and GSP . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Markov Decision Processes and Policy Iteration . . . . . . . 11 2.2.2 Graph Signal Processing . . . . . . . . . . . . . . . . . . . . 13 2.3 Wireless Model and Policy Structures . . . . . . . . . . . . . . . . . 15 2.4 Reduced Dimension MDP and Subspace Construction . . . . . . . . 22 2.5 General Subspace Construction using GSP methods . . . . . . . . . 26 2.5.1 SYM: Natural Symmetrization . . . . . . . . . . . . . . . . . 28 2.5.2 BIB: Bibliometric Symmetrization . . . . . . . . . . . . . . . 28 2.5.3 AVF: Approximated Value Function Graph . . . . . . . . . . 29 2.5.4 Jordan Form . . . . . . . . . . . . . . . . . . . . . . . . . . 29 iv 2.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6.1 Zig-zag Policy Update . . . . . . . . . . . . . . . . . . . . . 34 2.6.2 Optimal Basis Selection . . . . . . . . . . . . . . . . . . . . 35 2.6.3 Subspace Construction with GSP . . . . . . . . . . . . . . . 36 2.6.4 More General Results . . . . . . . . . . . . . . . . . . . . . . 38 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3 On Sampled Reinforcement Learning in Wireless Networks: Exploitation of Policy Structures 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Preliminary on Q-learning . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Policy Sampling, Interpolation and Refinement . . . . . . . . . . . . 46 3.3.1 Policy Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Policy Interpolation . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.3 Policy Refinement . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.1 Complexity of Policy Sampling . . . . . . . . . . . . . . . . 60 3.4.2 Policy Error Analysis . . . . . . . . . . . . . . . . . . . . . . 61 3.4.3 Analytical Bounds . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.1 Policy Sampling and Interpolation . . . . . . . . . . . . . . . 67 3.5.2 Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.3 Policy Refinement . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4 Neural Policy Gradient in Wireless Networks: Analysis and Improvement 74 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Case Study and Motivation . . . . . . . . . . . . . . . . . . . . . . 79 4.4 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.1 Properties of gradients . . . . . . . . . . . . . . . . . . . . . 82 4.4.2 Approximations of the output probability . . . . . . . . . . . 87 4.4.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.4 Implications for local convergence . . . . . . . . . . . . . . . 90 4.5 Improving Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . 92 4.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6.1 Point-to-point system . . . . . . . . . . . . . . . . . . . . . . 95 4.6.2 MIMO system . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 v 5 Conclusions 101 A On Solving Large Scale Markov Decision Processes Problems: Exploitation of Spectral Properties and Policy Structures 103 A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.1.1 Monotonicity of V ∗ (q,h) in h . . . . . . . . . . . . . . . . . 103 A.1.2 Monotonicity of V ∗ (q,h) in q . . . . . . . . . . . . . . . . . 104 A.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.3.1 Thresholded policy in h. . . . . . . . . . . . . . . . . . . . . 108 A.3.2 Thresholded policy in q. . . . . . . . . . . . . . . . . . . . . 108 A.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.5 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 B On Sampled Reinforcement Learning in Wireless Networks: Exploitation of Policy Structures 112 B.1 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.1.1 Upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.1.2 Lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.2 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 B.3 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 C Neural Policy Gradient in Wireless Networks: Analysis and Improvement 118 C.1 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 C.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.3 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . 120 C.4 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . 122 References 124 vi List of Figures 2.1 Markov chain of the system, where the arrows show all possible transitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 optimal policy of the point-to-point transmission system, network setting: Q = 40, H = 30, β = 10000. . . . . . . . . . . . . . . . . . 20 2.3 Block diagram for modified policy improvement, the action{0, 1} denotes silence and transmit, respectively. . . . . . . . . . . . . . . 21 2.4 Modified policy improvement step, where red nodes denote transmit, blue nodes denote silence; the action {0, 1} denotes silence and transmit, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Pictorial representation of the value function and optimal policy. For optimal policy,{0, 1} denote silence and transmission, respectively. 32 2.6 Poisson arrivals: optimal policy. For optimal policy,{0, 1} denote silence and transmission, respectively. . . . . . . . . . . . . . . . . . 33 2.7 Normalized error versus σ 2 , network size = 1200. . . . . . . . . . . 33 2.8 Runtime comparison between original PI and simplified PI . . . . . 34 2.9 Number of iterations of zig-zag policy update, network size = 2040, versusβ (left figure); Runtime comparison versus network size (right figure). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.10 Performance of the optimal basis selection method, network size = 2040, versus β. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.11 Performance of subspace approach using GSP methods, with different network sizes, versus β. . . . . . . . . . . . . . . . . . . . . 36 2.12 Left figure: Runtime of subspace approaches with network size = 35, versus β; Right two figures: runtime of subspace approaches (with/without computation overhead for subspace construction), versus network size . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.13 Performance of subspace approaches using GSP methods, network size = 2040, versus β. In the SNR figure, the discontinuities represent infinite SNR. . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.14 Left figure: policy error for equipment replacement model, network size = 2000, versus R; Right figure: policy error for random graph model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 vii 3.1 Sampled policy for subset of states with random input policy (a) and random thresholded input policy (b); sampling budget Ψ = 20%. 49 3.2 Random thresholded input policy generation . . . . . . . . . . . . . 50 3.3 Generated input policies by different δ. . . . . . . . . . . . . . . . . 50 3.4 Normalized error w.r.t K and T with the other parameter fixed. . . 51 3.5 First two figures: Q values as a function of number of visits to state (10,10) and (15,10) in the wireless network, number of network states = 1200. Third figure: Q values for state (10,10) in wireless network with size 1500. . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 From left to right: image graph structure, sampled policy signal, interpolated policy signal, final policy after thresholding. . . . . . . 54 3.7 Decomposition of total variation on lattice graph. . . . . . . . . . . 54 3.8 Pictorial representation of cushion method (a) and boundary perturbation (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.9 Line sampling problem, variables less than t will be labeled 0, variables greater than t will be labeled 1. . . . . . . . . . . . . . . . 62 3.10 Policy error with respect to sample fraction, network size = 1200 . . 66 3.11 Performance comparison of policy sampling and interpolation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.12 Decay rate bounds and analytical bounds. Upper figure: decay rate regarding Theorem 6. Lower figure: analytical result regarding Proposition 7 and 8. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.13 Performance comparison of policy sampling, interpolation and refinement algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.14 Optimal policy and policy interpolation for system with multiple actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.15 Optimal policy for server in the queueing system and performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.16 Histograms of policy error of DQN and policy sampling/interpolation approach, Network size = 900. . . . . . . . . 72 4.1 Pictorial illustration of a policy network, the number of nodes are 3− 5− 4− 2. Activation functions are shown on top of each layer. . 80 4.2 Normalized weights of neural network during learning process for different policy gradient algorithms. . . . . . . . . . . . . . . . . . . 82 4.3 Pictorial representation of the simple policy network. . . . . . . . . 83 4.4 Surface plot of π(0|s) . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5 Evolution of output probability for silence action. . . . . . . . . . . 88 4.6 2D surface plot of the objective function J as a function of W (2) 12 and W (2) 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7 Pictorial illustration of continual learning. The tasks labeled in red are pre-train tasks, task labeled in blue is future task. . . . . . . . . 93 viii 4.8 Policy error for policy gradient with smart initialization. . . . . . . 96 4.9 Comparison of optimal policy, policy obtained by policy gradient, policy obtained by policy gradient with smart initialization. The weighting factor is β = 12. . . . . . . . . . . . . . . . . . . . . . . . 96 4.10 Performance of policy gradient with continual learning for initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.11 Pictorial representation of MIMO system. . . . . . . . . . . . . . . 97 4.12 Performance comparison of random initialization and smart initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 A.1 Pictorial representation of low rank propoerty. State 1 and 2 have same ancestor states, and their incoming transition probabilities are scalar multiple of each other. . . . . . . . . . . . . . . . . . . . . . . 110 ix List of Tables 2.1 Proposed methods for subspace design . . . . . . . . . . . . . . . . 30 3.1 Policy sampling, interpolation and refinement methods . . . . . . . 59 3.2 Comparison of number of visits (1200 network states) . . . . . . . . 61 x Abstract With the proliferation of the Internet-of-Things and cyber-physical systems, wireless communication networks have grown in size, heterogeneity and complexity. Optimal control of such networks is critical to achieving desired system performance. Such network control problems are usually modeled as Markov decision processes (MDPs). Direct solution of MDPs is generally prohibitively complex and further challenged by unknown system parameters in practical environments. As such, reinforcement learning (RL) has achieved strong popularity to solve MDPs in the presence of unknown network parameters. Classical algorithms such as policy iteration and Q-learning have been successful in solving various types of wireless network control problems such as throughput maximization for energy harvesting, design of media access control (MAC) protocols, power control and rate adaptation, and resource allocation. However, as noted above, conventional methods suffer from complexity challenges as a result of the exponential growth of the number of system states, thus it is of high importance to develop reduced dimensional MDP/RL algorithms. In this thesis, three different approaches are proposed to solve for the optimal control for the packet transmission strategy in a wireless network. First, under the assumption of the availability of system dynamics, the control problem is modeled as a Markov decision process. Projection methods have been proposed to solve an MDP problem with a large state space but the design of a good subspace is usually application dependent. In our wireless network problem, xi the low rank structural property of the probability transition matrix (PTM) is first examined, based on which a good subspace can be constructed and can enable both dimension reduction and perfect reconstruction of the optimal control signal. In addition, we take the novel perspective from graph signal processing and proposed subspace construction algorithms only using the PTM. The subspace constructed from bibliometric symmetrization can also achieve perfect reconstruction of the optimal control signal. Furthermore, we are also able to prove the monotonicity of the value functions as well as a thresholded structure of the optimal control policy. A zig-zag policy update algorithm is also proposed to modify the policy iteration algorithm and 50% runtime reduction is observed. Second, a more realistic scenario is considered where the transition probabilities are unknown. Q-learning algorithm also suffers from complexity in the presence of a large state space. To develop low complexityQ-learning algorithm, the thresholded policy structure is exploited and a three-stage algorithm is proposed including policy sampling, interpolation and refinement. In particular, the novel “image graph” perspective is taken during the design of policy interpolation step. The proposed algorithm achieves both runtime reduction by 65% as well as an error reduction of 50%. Finally, the decay rate of the averaged error associated with the proposed algorithm is also derived and is shown to be a polynomial decay. Third, to avoid learning the optimal policy through value functions orQ-values, policy gradient approaches have been applied to our wireless network system to directly learn the control strategy. Theoretical analysis have been conducted to study the behavior of convergence to local optima. In addition, smart initialization strategy using continual learning is also proposed to modify the policy gradient algorithms and 60% error reduction can be achieved. xii Related Publications • Conference Papers 1. L. Liu, and U. Mitra, “Policy Sampling and Interpolation for Wireless Networks: A Graph Signal Processing Approach”, IEEE International Global Communications Conference (GLOBECOM), IEEE, 2019. 2. L. Liu, A. Chattopadhyay and U. Mitra, “Exploiting Policy Structure for Solving MDPs with Large State Space”, 52nd Annual Conference on Information Sciences and Systems, IEEE, Mar, 2018. 3. L. Liu, A. Chattopadhyay and U. Mitra, “On Exploiting Spectral Properties for Solving MDP with Large State Space”, 55th Annual Allerton Conference on Communication, Control and Computing, pp. 1213-1219, IEEE, Oct, 2017. • Journal Papers 4. L. Liu and U. Mitra, “Policy Gradient in Wireless Network: Neural Network Analysis and Improvement”, submitted. 5. L. Liu and U. Mitra, “On Sampled Reinforcement Learning in Wireless Networks: Exploitation of Policy Structures”, IEEE Transactions on Communications, vol. 68, no. 5, pp. 2823-2837, 2020. 6. L. Liu, A. Chattopadhyay and U. Mitra, “On Solving Large Scale MDPs: Expoitation of Spectral Properties and Policy Structures”, IEEE Transactions on Communications, vol. 67, no. 6, pp. 4151-4165, 2019. xiii Chapter 1 Introduction 1.1 Thesis background Sequential decision making problems are commonly used to model network control problems, where the network behavior varies according to different control strategies and thus achieves different long term expected costs/rewards. Therefore, the key is to understand the transition behavior (logical transition graph) between states under various control strategies so that an optimal policy for each system state can be derived. Depending on the availability of system dynamics, the optimization problem can be further categorized as a Markov decision process (MDP) or reinforcement learning (RL) problem [1]. While such problems can be solved by classical policy iteration algorithms and Q-learning methods [1], these approaches suffer from high complexity in the presence of large state space/action space. A possible complexity reduction strategy is via function approximation, which can be achieved via projecting signals onto lower dimensional subspaces. A key challenge with this approach is the design of the lower dimensional subspace. An alternative approach seeks a compact system representation via model reduction. Model reduction can be achieved via state aggregation, but 1 suffers from unrealistic assumptions or methods that are overly dependent on the particular instantiation of the problem. In this thesis, we mainly focus on developing low complexity algorithms for MDP/RL problems in wireless networks, using graph signal processing (GSP) theory [2]. The work consists of three parts: 1) reduced dimension MDP algorithm design; 2) efficient RL algorithm methods; 3) analysis and improvement of policy gradient algorithms. The proposed algorithms all achieved complexity reduction of more than 50% and negligible policy error. 1.2 Thesis contributions 1.2.1 Reduced Complexity MDP algorithms With many fine-grained modeling techniques, the cost is high complexity. In our Markov Decision Process modeling, the fine grained modeling results in a logical graph which is larger than the graph induced by the presence of physical transmitting and receiving nodes. The behavior of the graph is captured via the probability transition matrix. The cost of being at any given state is the graph signal that we will manipulate via optimized control. Our first approach was to design lower dimensional subspaces for the graph signals using graph signal processing techniques [2]. Under a smoothness assumption for the state costs and leveraging tools from GSP, a good subspace (10% of original size) can be constructed via the eigenvalue decomposition of the graph Laplacian matrix. The bulk of GSP theory focuses on undirected graphs; however, our wireless network graphs are highly directed. Thus, we initially developed symmetric proxies for the probability transition matrix. We proposed the use of the bibiometric symmetrization which was able to achieve a 40% complexity reduction and perfection reconstruction of the cost function/graph signal as well as determine the 2 optimal policy. This result relied on perfect knowledge of the probability transition matrix [3]. We also analyzed the structure of the optimal policy and proved that the optimal policy has a threshold structure. We designed a new policy iteration algorithm which incorporated this structural property which resulted in a further complexity reduction. We were also able to prove that the probability transition matrix had a low rank property which could be exploited for subspace design. Our overall reduced complexity design was able to achieve perfect reconstruction of the optimal policy. There are a large class of problems for which this thresholded structure exists and thus our methods have potential application to problems such as inventory problems and many MIMO communication problems. 1.2.2 Reduced Complexity RL algorithms As noted above, our initial work assumed perfect knowledge of the probability transition of the underlying graph; such information is not known in practice. While Q-learning is often considered as a solution in such scenarios, the learning strategy suffers from high complexity and requires the traversal of many, long trajectories to implicitly learn the probability transition matrix. Leveraging insights from my prior work, we proposed a three-stage algorithm including policy sampling, iterpolation and refinement. We proposed sampled Q-learning wherein a random subset of states are selected and Q-learning is terminated when there is a sufficient number of visits to these states. To learn the Q-functions for the unsampled states, a novel interpolation strategy was proposed. We view the policy signal as a partial image graph, hence GSP techniques can be adapted for graph signal interpolation. In addition, given the previously proved threshold structural property, machine learning techniques such as support vector machines (SVM) can also be employed. Finally, the policy is further refined via boundary perturbation methods (exploiting this ”image” view of the policy). Numerical results showed that in a system with 2000 states, we have a huge runtime reduction of 80% 3 and the policy error is 50% smaller than that achieved by full-dimension classical Q-learning. We were also able to prove that the decay rate of average policy error scales with 1/ √ N, where N is the total number of states of the system. Our theoretical analysis shows that my three stage algorithm achieves a significant complexity reduction and zero error asymptotically [4] – which is exactly the large network regime which is of most practical interest. 1.2.3 Analysis of policy gradient The above methods are categorized as intermediate methods since the optimal policy is derived with the help of value functions/Q values. However, they can only learn deterministic policies and in many cases, the policy signal has a simpler structure than that of the value functions or Q-values. Policy gradient [1] is a class of methods that rely upon optimizing parametrized policies with respect to the long-term cumulative cost by gradient descent. Neural networks are a popular tool to parametrize the policy, although potential challenges exist such as hyperparameter tuning, overfitting issue, large training data set etc. Neural networks have enjoyed success in many applications such as gaming, robotic locomotion and continuous action control. However, in simulation for our wireless network model, we discovered that the policy gradient algorithms always converge to a locally optimal solution. Gradient analysis has been performed and we are able to show that certain weights in the neural network are monotonically increasing/decreasing as we train the network. To tackle the local optimal issue, we have incorporated the idea of continual learning [5] and pre-train the neural network with a few known settings and their corresponding optimal policy. To our knowledge, continual learning has not been widely applied to wireless network control problems. With this smart initialization trick, the policy gradient algorithm converges to the optimal solution. In addition, we have started a joint project with prof. Helen Li at Duke university to investigate the application of sparse 4 approximation techniques to neural networks. One preliminary result was that by introducing a sparsity constraint in the objective function, the learned neural network is sparse in terms of network weights with negligible performance loss. Thus complexity reduction can be achieved when the neural network is used for future tasks. 1.3 Thesis outline The rest of this thesis is organized as follows: Chapter 2 mainly focuses on reduced MDP problems. Section 2.1 first provides necessary background on Markov decision process and graph signal processing. Section 2.3 presents the a wireless network control problem under the framework of Markov decision processes. The projection method for MDPs is presented and the optimal criteria for subspace construction is derived in Section 2.4. Section 2.5 provides more detail on general subspace construction methods using GSP. Numerical validation is provided in Section 2.6 and the proofs are provided in Appendix A. Chapter 3 extends the MDP work to reinforcement learning setting. Section 3.2 provides background on Q-learning. Section 3.3 describes the policy sampling, interpolation, and refinement algorithm. Section 3.4 provides theoretical analysis on the averaged decay rate of the policy error associated with our proposed algorithm. Numerical results are given in Section 3.5. Appendix B provides the proofs of the theorems in Chapter 3. Chapter 4 studies the network control problem in a different perspective, using policy gradient approach. Section 4.2 provides background on RL and policy gradient methods. A case study containing interesting convergence result is provided in Section 4.3 and it is served as the motivation for further analysis. Theoretical analysis regarding the properties of the gradients and network weights are given in Section 4.4. Section 4.5 provides smart initialization techniques for 5 the neural network and numerical results are shown in Section 4.6. The proofs are provided in Appendix C. Finally, Chapter 5 concludes the thesis. 6 Chapter 2 On Solving Large Scale Markov Decision Processes Problems: Exploitation of Spectral Properties and Policy Structures 2.1 Introduction Markov Decision Processes (MDPs) are useful in modeling a wide range of network control and optimization problems, where the system evolves stochastically as a function of current system state and control input. The application of MDPs can be found in wireless networks [6], sensor networks [7–9], inventory problems [10], resource allocation [11], agriculture and population [12], etc. The theory of MDPs is well established [13, 14]. Classical algorithms such as value iteration, policy iteration and linear programming serve as useful tools to solve the optimization problem. However, in practice, large scale networks are ubiquitous (e.g. wireless networks, sensor networks, biological networks, etc). The 7 size of such networks usually scale polynomially or exponentially with the number of state variables. This is generally referred to as the curse of dimensionality, rendering the standard algorithms computationally prohibitive. There are two main methodologies tailored for MDPs to tackle the curse of dimensionality. One is through value function approximation, which seeks to find compact representations of the value function in a lower dimensional subspace [13–15]. Finding a proper subspace can be non-trivial, our early work [16] presented several approaches and comparisons of subspace selection. Work reported in [17–19] showed that diffusion wavelets [20] could be useful for generating the subspace and solving the MDP problems efficiently, although the Probability Transition Matrix (PTM) is constrained to be symmetric. The challenges with subspace approaches are finding the right basis and ensuring the convergence of the associated iterative methods [14]. The other dimension reduction approach is realized through model reduction. This approach seeks compact representations of the system (Markov chain) rather than the value function via state aggregation and disaggregation. Model reduction can be achieved with different techniques and metrics. For example, the notion of stochastic bisimulation is proposed in [21] wherein two states are aggregated if they have the same cost functions and transition probabilities. In [22], the Kullback-Leibler divergence is used as a metric to minimize the distance between the original Markov chain and the one after aggregation and disaggregation. The Kron-reduction technique can also be employed given the identification of an independent set [23]. Recently, stochastic factorization has been proposed in [24], where a low rank factorization of the PTM is performed to create a lower dimensional MDP by swapping the position of the two factored stochastic matrices. The disadvantages of these methods are: complexity (NP hard to find the right independent set for Kron-reduction), limitation to specific kinds of Markov Chains 8 (K-L aggregation and stochastic factorization), and idealized conditions (stochastic bisimulation). In addition to the approximation methods mentioned above, structural information can also be incorporated into the standard algorithms to achieve complexity reduction. For example, periodicity of the system is explored in [10] to reduce computation; as a generalization, graphs that are M-block cyclic [25] can potentially allow complexity reduction in analysis. Moreover, a threshold structure for optimal policies is a common phenomenon in many MDP applications in wireless networks [26–29]. Thresholded policies have thresholding states that characterize the optimal control decisions. The generalization of these types of networks is still challenging. In [30], a sufficient condition for the value function and optimal strategy to be even and quasi-convex is derived, but this work mainly focuses on MDPs where the state space is even (symmetric on the real line); although [30] also defines a folding operation to extend the analysis to the R + . However, the conditions are still too strong for typical wireless networks. For particular wireless networks exhibiting such thresholded policy structure, we will show that incorporating the policy structure in policy optimization can strongly reduce complexity. In this chapter, we consider a value function approximation method with Graph Signal Processing techniques employed. Graph Signal Processing (GSP) is a theory for analyzing signals defined on graphs. Early work [2, 31–34] provided insightful extension of frequency and Fourier transforms from the classical signal domain to the graph domain. In our MDP problem, although the probability transition matrix can be viewed as a transition graph and the value function of each state can be viewed as graph signal, directly applying GSP theory appears to be challenging. By the nature of the MDP problem, our transition graph is a directed graph; GSP theory tailored for directed graphs [32–34] requires the Jordan decomposition of the PTM, which is much more complex than the Eigenvalue Decomposition (EVD), 9 a technique that is usually used for undirected graphs. Luckily, various kinds of symmetrization techniques have been proposed [35]. Here, we adopt several methods and use classical GSP theory to obtain a good subspace for value function approximation. The main contributions in this chapter are: • The policy structure of a MDP for a wireless network example is derived. • A modified policy iteration algorithm exploiting the policy structure is proposed. Numerical results show that it achieves reasonable complexity reduction with no performance loss. • Optimal subspace construction criteria for reduced dimension MDP is derived, and a good subspace is generated based on the system model that achieves zero policy error with faster runtime. • Various methods are proposed to generate the subspace using GSP techniques for value function approximation. One particular method based on the graph symmetrization technique in [35] achieves zero policy error. • Finally, we show the application of our methods to another MDP admitting a thresholded policy [36]. The rest of the thapter is organized as follows: Section 2.2 provides background on MDPs and GSP. In Section 2.3, the wireless system is elaborated upon with the structure of the optimal policy derived, and a modified policy iteration algorithm is proposed. The projection method for MDPs is presented and the optimal criteria for subspace construction is derived in Section 2.4. Section 2.5 provides more detail on general subspace construction methods using GSP. Numerical validation is provided in Section 2.6 and Section 2.7 concludes the chapter. Finally the proofs of the technical results are provided in the appendix A. 10 2.2 Preliminaries: MDP and GSP In this section, we provide key background on MDPs and GSP. For notational clarity, column vectors are in bold lower case (e.g. x); matrices are in bold upper case (e.g. A); sets are in calligraphic font (e.g. S); and scalars are non-bold (e.g α, a). 2.2.1 Markov Decision Processes and Policy Iteration Markov Decision Processes provide mathematical tools for modeling systems that involve decision making. Typically, a MDP is a 5-tuple{S,U, P,c,α}, where S ={s 1 ,s 2 ,...,s n } denotes a finite state space, andU ={u 1 ,u 2 ,...,u k } denotes the finite action space. The probability transition matrix is given by P and we denote s(t) and u(t) as the state and action we take at time slot t respectively, where t ={1, 2, 3,...}. The transition probability of going to state s 0 , given the current state s and current action u, is given by: p(s,u,s 0 ) =P (s(t + 1) =s 0 |s(t) =s,u(t) =u), (2.1) ∀s,s 0 ∈S, u∈U. For each transition, there is an associated instantaneous cost ϕ ( s,u,s 0 ). The average cost (single stage cost) defined for each states, given action u, is: c(s,u) = X s 0 ∈S ϕ(s,u,s 0 )p(s,u,s 0 ) ∀s,u. (2.2) In different networking settings, depending on the application, the cost function c(s,u) can be used to describe performance metrics such as throughput, delay, 11 failure probability, etc. The core problem of MDP optimization can be formulated as: v ∗ (s) = min μ v μ (s) = min μ E μ ( ∞ X t=0 α t c(s(t),μ(s(t))) s(0) =s ) , (2.3) where α∈ (0, 1) is the discount factor, and v ∗ (s) is called the value function [13] measuring the expected sum of discounted costs over an infinite horizon starting from state s. A policy, μ :S →U, is a mapping from the state space to the action space, specifying the decision rule for each state. If under policy μ, the action taken at state s is u, then this action is represented as: μ(s) = u. Denote μ t a stationary policy if μ t = μ for all t; it is also called a deterministic policy if P (μ(s) = u) = 1,u∈U. Without loss of generality [13], we focus on stationary and deterministic policies that solve the optimization problem (2.3). Policy iteration [13] is a classical algorithm to compute the solution to the optimization problem. It starts with any arbitrary policy μ 0 and iteratively generates a sequence of policies{μ (k) } k≥1 . Denote P μ (k) and c μ (k) as the PTM and cost function under policy μ (k) respectively; and the value function under policy μ (k) is denoted by v μ (k). Then each iteration consists of two steps: • Policy evaluation Under the current policy, μ (k) , the system evolves as a Markov chain. The value function is evaluated by solving the Bellman fixed point equation [13]: v μ (k) = c μ (k) +αP μ (k)v μ (k), (2.4) where c μ (k) and P μ (k) are the average cost and probability transition matrix under policy μ (k) . 12 • Policy improvement After obtaining v μ (k), the policy is updated by the greedy search of the one step look-ahead of the current value function: μ (k+1) (s) = arg min u∈U c(s,u) +α X s 0 ∈S p(s,u,s 0 )v μ (k)(s 0 ) . (2.5) We keep iterating between steps (2.4) and (2.5) until two successive policies are the same, and we have the optimal policy and value function. It can be shown that after each iteration, the policy is improving the value function. Due to the finite nature of the state and action spaces, the policy iteration is guaranteed to converge to the optimal solution in finite number of iterations [13]. And the optimal solution satisfies the optimality equation: v ∗ (s) = min u∈U c(s,u) +α X s 0 ∈S p(s,u,s 0 )v ∗ (s 0 ) . (2.6) 2.2.2 Graph Signal Processing Graphs provide an efficient representation tool for data in many domains. Many networks such as wireless networks, social networks and biological networks, have this kind of structure with data located on vertices and links representing their relationships (connectivity, similarity, etc). The graph can be either the physical graph of the system itself (e.g. a transportation network or social network) or the logical graph induced by the network protocol (e.g. Markov chains, Finite State Machines). When the vertices of the graph are appropriately labeled, we can form a column vector containing the data on the vertices, this vector is termed the graph signal. The analysis of a graph signal starts with the graph structure. Denote x as the graph signal andG ={V,E, W} as the corresponding graph, whereV is the node 13 set andE is the edge set. The relationship between nodes is represented by the adjacency matrix W. If the graph is undirected (W is symmetric), then the graph Laplacian matrix [37] is defined as L = D− W, where D is the degree matrix, a diagonal matrix with D i,i = P j∈V W i,j , where W i,j is the edge weight between node i and j. It is easy to see that L is a positive semidefinite (PSD) matrix. The spectral representation of the original graph signal is given by the eigenvalue decomposition of the matrix L: L = BΛB T = |V| X i=1 λ i b i b T i , λ i ≥ 0 ∀i, (2.7) where B = [b 1 , b 2 ,..., b |V| ] form an orthogonal eigenspace with projection matrix b i b T i . Clearly, every graph signal defined on graphG can be decomposed as a linear combination of the eigenvectors b i . Similar to classical signal processing, the total variation (i.e. graph frequency) [2] of the basis function b i is defined as a quadratic sum: TV(b i ) = 1 2 X i,j W i,j (b i − b j ) 2 = b T i Lb i =λ i . (2.8) Therefore, σ(L) ={0 =λ 1 ≤λ 2 ≤...≤λ |V| } represents the graph frequencies from low to high. It has been shown that the eigenvalues and eigenvectors of the Laplacian matrix L provide a harmonic analysis of graph signals [37], hence the Graph Fourier Transform is defined as the projection of the graph signal on the eigenspace of L: ˜ x = B T x, where ˜ x is the vector of graph frequency coefficients. And the inverse transform is given by: x = B˜ x. For directed graphs, due to the asymmetry induced by directivity of the adjacency matrix, the graph Laplacian concept can not be easily extended to directed graphs. While the graph Laplacian can be defined for directed graphs 14 [38]; it is restricted to probability transition graphs (Markov chains) and the computation involves the calculation of the stationary distribution, which makes the computation more complicated. There are also other endeavors in analyzing graph signals defined on directed graphs, the authors in [32–34] directly analyzed the adjacency matrix and showed a nice frequency interpretation using Jordan decompositions, where the frequency is measured by the distance of between each eigenvalue and the point (1,0) on the complex plane. However, such methods also raise important issues requiring further attention. First, the generalized eigenvectors form a bi-orthogonal basis so that Parseval’s identity does not hold. In addition, the total variation introduced in [34] does not guarantee that a constant graph signal has zero total variation. Furthermore, numerical complexity and instability can be an issue even for moderate matrix size and usually complex field analysis is required. Without a proper mapping from the complex field to real field, it is difficult to achieve good performance of approximating the graph signal in this manner. 2.3 Wireless Model and Policy Structures The MDP model for a wireless network can vary depending on different applications and objective functions. For example, we can focus on the throughput of a network, whose state can be described by buffer length and retransmission index [6, 39]; or optimal transmission strategy in energy harvesting network, where the state of the network can be characterized by the energy storage in each device [40]. In this section, a wireless transmission example is examined; although it has a simple physical network structure, the probability transition graph induced by this MDP can have a large state space. We will demonstrate its thresholded policy structure and modify the policy iteration algorithm by incorporating such information. For the general case, a network control problem for a large network leads to an MDP 15 with a large transition graph. Hence we consider an example of a MDP with large state space. The structure of the network enables us to solve the MDP theoretically and compare it with our proposed algorithms. Consider a system of only one transmitter and one receiver. Time is discretized into slots with equal duration. In each time slot the transmitter can decide whether to send a packet to the receiver or not. Packet arrival at the transmitter can be characterized by a Bernoulli proces with arrival probability p (this is not an uncommon assumption for queuing systems. Poisson arrivals can also be considered and our numerical results yielded similar performance behavior, see Sec. 2.6). Incoming packets will be stored in a buffer of capacity Q, and full occupancy of the buffer will lead to packet drop when there is a new packet arrival. Packets are transmitted through a channel between the transmitter and the receiver. The channel can be modeled as having path-loss, shadowing and fading [41]. The received power of a packet at the receiver is given by: P rcv =P T C 0 l −η h, (2.10) whereP T is the transmitting power at the transmitter, C 0 is a constant (path-loss at reference distance l), η is the path loss exponent, and the distance between transmitter and receiver is denoted by l. The Rayleigh fading gain is denoted by h, which is exponentially distributed with mean 1 and i.i.d. over time slots. We (q t+1 ,h t+1 ) = (0,h t+1 ) w.p. (1−p)P (h =h t+1 ), if q t = 0 (1,h t+1 ) w.p. pP (h =h t+1 ), if q t = 0 (q t ,h t+1 ) w.p. (1−p)P (h =h t+1 ), if 0<q t <Q and U t = 0 (q t + 1,h t+1 ) w.p. pP (h =h t+1 ), if 0<q t <Q and U t = 0 (q t ,h t+1 ) w.p. pP (h =h t+1 ), if 0<q t <Q and U t = 1 (q t − 1,h t+1 ) w.p. (1−p)P (h =h t+1 ), if 0<q t <Q and U t = 1 (q t ,h t+1 ) w.p. 1·P (h =h t+1 ), if q t =Q and U t = 0 (q t − 1,h t+1 ) w.p. (1−p)P (h =h t+1 ), if q t =Q and U t = 1 (q t ,h t+1 ) w.p. pP (h =h t+1 ), if q t =Q and U t = 1 (2.9) 16 assume that the length of each time slot is greater than the channel coherence time thus the i.i.d assumption can hold. Under the assumption of flat channel in each time slot, the channel stateh can be obtained via channel estimation, see e.g. [41]. It can be achieved by sending pilot signals at the beginning of each time slot and use the estimate as the reference channel condition. This channel estimation strategy can be found in multiple papers and books [42–44] [45]. For the successful transmission of a packet, we assume that there exists a threshold P th at the receiver. Thus, for any given channel state, there is also a power threshold at the transmitter to meet the successful transmission requirement. Therefore, given P rcv =P th , P T is given by: P T (h) = P th C 0 l −η h . (2.11) Notice that the channel state is continuous, discretization can be employed to simplify the analysis. The Gilbert-Elliot channel model [46, 47] is widely used for describing the burst error in transmission channels, where the evolution of the channel can be modeled as a Markov chain with two states (good and bad). In our model, the channel state is represented by the Rayleigh fading gainh, here we adopt the idea of discretization and partition the range of h into disjoint intervals with their midpoints representing the particular range of channel quality. Thus the probability density function (PDF) is changed to the probability mass function (PMF): P (h =H i ) = Z b i a i e −x dx, [ i [a i ,b i ) = [0, +∞), (2.12) whereH i ∈ [a i ,b i ) is a particular number that represents the channel state within that interval. The state of the system thus can be described by the pair (q,h), where q∈ {0, 1, 2,...,Q} indicates the number of packets in the buffer and h∈{1, 2,...,H} represents the indices of the channel state. We consider a binary action space 17 Buffer Channel Figure 2.1. Markov chain of the system, where the arrows show all possible transitions. U ={transmit, idle} Δ ={1, 0} for the transmitter. We also assume an i.i.d. channel and the channel coherence time equals the length of time slot. The transition probabilities are given by Equation (2.9) and the Markov chain representation graph is shown in Fig. 2.1. We consider a discounted cost infinite horizon problem (2.3) and the single-stage cost function in time slot t is defined as: c(s(t),u(t)) = 1{packet drop at time t} +βP T (s(t))· 1{u(t) = transmit},(2.13) where 1(·) is the indicator function, it is defined as follows: 1(x) = 1, if x is true 0, otherwise. (2.14) And β is the weighting factor measuring the emphasis on the transmitting power in the single stage cost. 18 Denote V ∗ (q,h) as the optimal value function at state (q,h). Then the optimality equations are: V ∗ (0,h) = αE a,h 0V ∗ (a,h 0 ) q = 0, (2.15) V ∗ (q,h) = min ( βP T (h) +αE a,h 0V ∗ (q− 1 +a,h 0 ),αE a,h 0V ∗ (q +a,h 0 ) ) 1≤q≤Q− 1, (2.16) V ∗ (Q,h) = min ( βP T (h) +αE a,h 0V ∗ (Q− 1 +a,h 0 ),p +αE a,h 0V ∗ (Q,h 0 ) ) q =Q, (2.17) where packet arrival is denoted by binary random variable a with with P (a = 1) = p. The expectation is taken over a and h 0 . Equation (2.15) follows since no transmission is taken when the buffer is empty, (2.16) is the case where the buffer is not empty nor full, and the two terms correspond to the expected cost of transmission and silence, and (2.17) represents the situation when the buffer is full, being silent will incur no cost but will have additional expected cost p for packet dropping due to a new packet arrival. There are three main theorems regarding this system: Theorem 1 The value function V ∗ (q,h) is nondecreasing in q and nonincreasing in h. Proof. see Appendix A.1. Remark 1. Notice that the optimal value function can also be obtained by the value iteration algorithm. The main idea is to employ induction and show that such a monotonicity structure holds in each iteration. 19 Figure 2.2. optimal policy of the point-to-point transmission system, network setting: Q = 40, H = 30, β = 10000. Theorem 2 The one-step difference functionp· 1(q =Q) + E a,h 0V ∗ (min{q +a,Q},h 0 )− E a,h 0V ∗ (q− 1 +a,h 0 ) is increasing in q. Proof. see Appendix A.2. Remark 2. Two inequalities are required to show the monotonicity, one directly follows from the optimality equations, the other comes from induction. Given Theorem 1 and 2, we are able to show the thresholded policy structure of the optimal policy (shown in Fig. 2.2). Theorem 3 Thresholded policy: the optimal control policy is that, at state (q,h), we transmit only when q ≥ q th (h) and h ≥ h th (q), where q th and h th are threshold functions. Proof. See Appendix A.3. Remark 3. The thresholded policy can be proved by directly applying Theorem 1 and 2 to the optimality equations. The structured policy revealed by Theorem 3 is appealing since it enables efficient computation in the policy iteration. Due to the particular form of the 20 Initialization: μ(q 0 ,h 0 ) = 0,∀q 0 ,h 0 start from state (q,h) = (Q, 1) Policy update for state (q,h) using Equation (2.5) μ(q,h) = 1 ? q :=q h :=h + 1 μ(q,h 0 ) = 1 ∀h 0 >h q :=q− 1 h :=h q = 0 ? or h =H ? Stop no yes no yes Figure 2.3. Block diagram for modified policy improvement, the action{0, 1} denotes silence and transmit, respectively. optimal policy, a specialized algorithm can be developed to search among policies that have the same form as the optimal policy, avoiding the need for checking all states to perform policy update, which is a typical step in the original policy iteration algorithm. The flowchart and pictorial representation of the modified policy update method are shown in Figs. 2.3 and 2.4. The general idea of the modified policy update can be described as follows: with states placed on a 2D plane. We start from state (Q, 1) and go diagonally towards (0,H). At particular state (q,h), we keep increasingh by 1 until we reach the state whose optimal action is to transmit. From the threshold property we know all the states above should transmit. Then the buffer indexq is decreased by 1 and the process is repeated until we reach the boundary states q = 0 or h =H. 21 Channel Buffer 0 1 2 Q Q-1 Q-2 1 2 H H-1 If μ ( q,h ) = 1 If Color all the nodes above red transmit silence μ ( q,h ) = 0 Figure 2.4. Modified policy improvement step, where red nodes denote transmit, blue nodes denote silence; the action{0, 1} denotes silence and transmit, respectively. It should be noted that it is a heuristic algorithm, but it is also to easy see that the original policy check is reduced to a “zig-zag” check. In addition, the original policy iteration needs to perform two policy checks for (Q + 1)×H states; whereas in the zig-zag check, at most Q +H number of states are needed (from state (Q,H) to state (0,H)). Therefore, the complexity for policy update reduces from O(2· (Q + 1)·H) to O(2· (Q +H)). It is clear that the algorithm essentially identifies the boundary states. Although we can update the boundary based on the previous execution of the zig-zag policy, we may need to check the states around the previous boundary to determine the new boundary, which may require visiting more than (Q+H) states. 2.4 Reduced Dimension MDP and Subspace Construction In spite of the existence of a nice optimal policy structure, it is quite common that the state space and even the action space grow large in real systems, such a curse of dimensionality is still a major obstacle for the implementation of policy iteration. Therefore, a dimension reduction technique is highly desired. In this 22 section, we first demonstrate the idea of projection for reduced dimension MDP and derive the optimal subspace construction criteria. We will also further show that fast subspace construction is enabled under the system proposed in Sec. 2.3. To address the complexity challenge, approximate dynamic programming [14] has been proposed. We first notice that the system has composite states (q,h) and the state variable h is uncontrollable due to the i.i.d. channel property (It is an important property as we will explain later). A simplification method for uncontrollable state components has been proposed (Section 6.1.5 [14]), where dimension reduction can be achieved by averaging out the uncontrollable state variable and hence a Bellman equation of lower dimension can be obtained. The details are omitted due to space constraints, but numerical results of this method will be provided in Sec. 2.6. While we will observe that the method has promise for one particular wireless network example, the method has limited applicability to general MDP problems. In contrast, the strategy we propose herein works well with general MDP problems as well. A more general approach is through the projected equation method, where we seek compact representation of value function in a lower dimensional subspace ˆ v μ = Mr. The subspace is denoted by the matrix M with dimension N×K, where N =|V| is the size of the graph, KN is the size of the subspace, and r is the coefficient vector. Without loss of generality, we assume that the columns of M are orthonormal. Since the original value function v μ is a fixed point of the Bellman operator, a way to find a good approximation is to force the linear approximation ˆ v μ = Mr to be also a fixed point under the Bellman operator. As a matter of fact, it is the main idea of least-squares fixed point approximation in [48]. 23 Therefore, we seek an approximate value function ˆ v μ that is invariant under the Bellman operator followed by projection, which will yield the following derivation: ˆ v μ = MM T (c μ +αP μ ˆ v μ ) ⇒ Mr = MM T c μ +αMM T P μ Mr ⇒ r = M T c μ +αM T P μ Mr ⇒ r = (I−αM T P μ M) −1 M T c μ , (2.18) where the third equation follows from the orthogonality of M. Then the approximated value function is: ˆ v μ = M(I−αM T P μ M) −1 M T c μ . (2.19) In standard policy iteration, for each iteration, the complexity is generally O(N 3 ) due to matrix inversion in the policy evaluation step (2.4). But in the projected equation case, the pure matrix inversion is reduced to O(K 3 ) and the complexity of matrix-vector multiplication is O(N 2 ); therefore complexity reduction is achieved. It should be noted that the convergence of such a reduced dimension method is not always guaranteed, since the approximated value function is used for policy improvement (2.5), and one possible reason is due to policy oscillation [14]. To ensure convergence to the true value function, one trivial condition is to have v μ = ˆ v μ , then the criteria for subspace selection can be given by the following Theorem: 24 Theorem 4 If P μ is low rank, to achieve v μ = ˆ v μ , the subspace for Equation (2.19) is the set of orthonormal basis vectors that span the column space of P μ ⊕ c μ , where⊕ denotes direct sum. Proof. See Appendix A.4. Remark 4. The proof can be seen by direct comparison of the original value function in full dimension and the approximated value function. This proposition provides general strategy for optimal subspace construction and has potential application in many networks, since it is possible that the PTM induced by the protocol in a FSM is low rank. One major shortcoming of such method lies in the changing subspace and possibly changing rank of P μ in each iteration. However, the model given in Section 2.3 can have further complexity reduction as a result of the following rank preservation Theorem. Theorem 5 Under the network setting in Section 2.3, the rank of P μ is always equal to Q + 1, where Q is the buffer capacity. For any given channel state h 0 , the columns associated with state (q,h 0 ),q ={0, 1, 2,...,Q} in P μ form an independent set of basis that span the column space of P μ . Proof. See Appendix A.5. Remark 5. The proof is completed by exploiting the i.i.d. fading in the channel, as well as the independence between packet arrival and the channel. Then it can be shown that states with common buffer length q share similar incoming transition pattern. The significance of the uncontrollability of h should also be emphasized. Since h is uncontrollable, Theorem 5 can also be interpreted as the 25 rank of PTM equals to the number of controllable elements of the state space, from which Theorem 4 can directly benefit. Theorems 4 and 5 offer a fast basis construction method. We assume the states are ordered according to the lexicographical order (first h then q). In each iteration, the independent set of basis can be selected from the firstQ + 1 columns of P μ and then concatenate with c μ ; and orthogonality can be simply achieved by Gram-Schmidt orthogonalization. The subspace ratio can be defined as the ratio between the size of the subspace and the original dimension. In our wireless network, the original dimension is (Q+1)×H, whereas the size of the subspace isQ+2. Thus the ratio is Q+2 (Q+1)H ≈ 1 H when Q,H are large. The efficiency increases as Q and H get larger. In Section 2.6, we will numerically show that such a subspace selection method achieves zero policy error as well as faster runtime compared with the standard policy iteration algorithm. 2.5 General Subspace Construction using GSP methods It can be observed that the design of subspace M is still application dependent. Besides, it may be a strict condition for the PTM of the wireless system to be low rank, or to have the same rank in each iteration. This serves as the motivation for the investigation of general methods, and Graph Signal Processing can be one possible solution. From the Bellman Equation (2.4), we notice that the probability transition matrix can be viewed as a transition graph, and the value function can be viewed as the corresponding graph signal. The theory of GSP in Section 2.2.2 provides a way of constructing a set of orthonormal basis vectors that can represent any 26 graph signal. The decomposition of the graph signal is given by its graph Fourier transform, i.e., the projection of the graph signal onto each basis. Compact representation of the graph signal can be achieved if it is smooth on the graph, i.e., most of the energy will be concentrated on the low frequency components. It can also be seen from the Bellman Equation (2.4), where there is a smoothing operation with respect to the value function (the value function of each node is the averaging of value functions from its neighbor nodes), therefore we can assume that the value function is a relatively smooth signal across the graph so that it can be approximated by the low frequency basis. In typical MDP settings, a major challenge is that the relationship captured by the edge directivity in directed graphs is fundamentally different from that of undirected graphs. To combat this challenge, the directed graph (PTM) is replaced with an undirected proxy that can both preserve system information and has computational advantage. Therefore, the general strategy is to find a proxy graph on which the value function is smooth, thus the subspace can be formed by picking the low frequency eigenvectors of the Laplacian of the proxy graphs. The construction of undirected proxies involves graph symmetrization [35, 39] or other methods so that the GSP techniques designed for undirected graphs can apply. Another challenge is that the probability transition matrix changes after each policy update in policy iteration. Generating the subspace in each iteration accordingly will incur high complexity. To reduce complexity, a fixed subspace will be generated from functions of the averaged probability transition matrix ¯ P , thus the key subspace is computed only once. The definitions of ¯ P and ¯ c are given by Equation (2.20). ¯ P(s,s 0 ) = 1 |U| X u∈U p(s,u,s 0 ), ¯ c(s) = 1 |U| X u∈U c(s,u). (2.20) 27 In the sequel, we demonstrate methods to construct undirected proxies given a directed graph P, they are described as follows: 2.5.1 SYM: Natural Symmetrization The undirected proxy can be obtained by A 1 = 1 2 (P + P T ). Then the Laplacian matrix is defined as L 1 = D 1 −A 1 , where D 1 is the degree matrix of A 1 . And a set of eigenvectors are given by the eigenvalue decomposition of the Laplacian matrix L 1 . Given the subspace size K, the subspace M 1 is then formed by selecting the eigenvectors associated with the smallest K eigenvalues (corresponding to low frequencies, see Equation (2.8)). The advantage of such method is of course simplicity, it retains the same set of edges but ignores the edge directivity. 2.5.2 BIB: Bibliometric Symmetrization In the analysis of document citations, due to the directed nature of citation graph, the symmetrized graph can be obtained as PP T , a.k.a. bibliographic coupling matrix [49]; it measures the similarity of nodes that share similar out-links. Similarly, we can also obtain the co-citation matrix P T P [50], which emphasizes the significance of common in-links. Since there is no obvious reason value out-links more than in-links, we take the sum of the two matrices and obtain the bibliometric symmetrization [35] matrix as A 2 = PP T + P T P. Although it may result in an undirected graph with a completely different structure, the edge directivity is preserved. In addition, depending on the application and the emphasis on in-degree and out-degree, the bibliometric matrix can also be constructed as P 2,γ = γPP T + (1−γ)P T P,γ∈ [0, 1]. In this work for simplicity we just set γ = 0.5. Notice that A 2 is a PSD matrix, thus picking the low frequency eigenvectors from its Laplacian matrix is equivalent to picking the eigenvectors of A 2 that have 28 large eigenvalues. Therefore, subspace M 2 can be formed by selecting eigenvectors whose associated eigenvalues are the largest K eigenvalues. 2.5.3 AVF: Approximated Value Function Graph The notion of approximated value function graph first appeared in [39]. It is a virtual graph that measures the similarity of value functions between states. Denote d(s) as the minimum number of hops from state s to a predefined high cost region in the system. The main idea is that two states s and s 0 are similar if|d(s)−d(s 0 )|≤ 1, and there exists a virtual link connecting them with weights indicating the similarity, such a similarity matrix is denoted by W a . The construction of W a requires the knowledge of the value function, but we can not obtain the value function without actually implementing standard algorithms. To avoid this complexity, we propose a method to estimate the value function and use the approximated value function to construct W a . To this end sgn( ¯ P) and sgn(¯ c) are required, where ¯ P and ¯ c are defined in Equation (2.20), and sgn(·) is the sign function. The construction of AVF graph is described in Algorithm 1. The AVF graph W a serves as a hidden undirected graph for the system, therefore we can select eigenvectors from the corresponding Laplacian matrix to form the subspace. AVF was our early attempt to apply GSP to MDPs; we include this method [39] here for comparison. 2.5.4 Jordan Form As a comparison to symmetrization, the Jordan form approach [33] is also considered. The Jordan decomposition of ¯ P provides another set of basis vectors for signals defined on directed graphs. Therefore, a set of low frequency basis can be selected and complex numbers are mapped to real numbers using magnitude. 29 Algorithm 1 AVF graph construction 1, Define the high cost regionH ={s : ¯ c(s)>τ,∀s∈S}, s.t. card(H)/card(S) = 0.1, where τ is a threshold. Compute d(s) for each state s∈S. 2, Set all the non-zero entries of ¯ P, ¯ c to 1 i.e., ¯ P := sgn( ¯ P), ¯ c := sgn(¯ c), normalize ¯ P so that it is row stochastic. 3, Set ¯ v (0) = 1 and iteratively apply ¯ v (k+1) = ¯ c + α ¯ P¯ v (k) until the following condition holds (the “SNR” for two consecutive value functions is above a threshold): 20 log 10 ||¯ v (k+1) || 2 ||¯ v (k+1) − ¯ v (k) || 2 ! > 40. 4, Set ¯ v = v (k+1) , and W a (s,s 0 ) = exp{−[¯ v(s)− ¯ v(s 0 )] 2 /(2σ 2 )} if|d(s)−d(s 0 )|≤ 1 (σ is the variance of ¯ v),∀s,s 0 ∈S Algorithm 2 Basis construction with Jordan form 1, Perform Jordan decomposition on ¯ P, ¯ P = YJY −1 . 2, Subspace T j is formed by the following rule: y i ∈ T j if |λ i − 1|<,∀i, where is a threshold that affects the size of the subspace. 3, M j =|T j |. 4, Perform Gram-Schmidt orthogonalization on M j . Table 2.1 Proposed methods for subspace design Basis Prior Proxy Description SYM ¯ P 1 2 ( ¯ P + ¯ P T ) EVD of the Laplacian matrix of proxy and pick eigenvectors with small eigenvalues BIB ¯ P ¯ P ¯ P T + ¯ P T ¯ P EVD of the bibliometric symmetrization and pick eigenvectors with large eigenvalues AVF sgn( ¯ P), sgn(¯ c) Wa EVD of the Laplacian matrix of proxy and pick eigenvectors with small eigenvalues Jordan ¯ P No Jordan decomposition of ¯ P and pick eigenvectors whose eigenvalues are closer to (1,0) on complex plane Gram-Schmidt orthogonalization is also employed to tackle non-orthogonality of the basis vectors. The algorithm is described in Algorithm. 2. The summary of all the methods is shown in Table 2.1. 2.6 Numerical Results We simulate the system in Section 2.3, the parameters for the channel come from a sensor network application [7, 51]; they are set as: C 0 = 10 0.17 ,η = 4.7,l = 20m, 30 and P rcv =−97dBm. The packet arrival rate is p = 0.9 and the buffer can store at most 50 packets. In our simulations, the partition of channel is set to be [0, 1)∪ [1, 2)··· [H− 2,H−1)∪[H−1, +∞), with the midpoint of each interval representing the channel state. Therefore, the channel state can be characterized by a discrete random variable h ={0.5, 1.5,...,H− 0.5}, and the PMF of h can be calculated as: P (h =n− 0.5) = Z n n−1 e −x dx = e −n+1 − e −n n ={1, 2,...,H− 1} P (h =H− 0.5) = Z ∞ H−1 e −x dx = e −H+1 . (2.21) The channel is discretized into 40 states thus the total number of states is 2040. In later sections, the performances of different methods as a function of network size will be demonstrated. For different network (probability transition graph) sizes, Q and H are chosen in a way that they are close to each other. Typical parameter settings for the networks sizes are listed as follows: [100, 500, 900, 1200, 1600, 2000] = [10· 10, 25· 20, 30· 30, 40· 30, 40· 40, 50· 40]. Fig. 2.5 shows the value function and optimal policy for a particular β, where in the figure for the optimal policy,{0, 1} indicate silence and transmission, respectively. As shown in the figure, the monotonicity of value function and the thresholded policy structure are clearly observed. For Poisson arrivals, the numerical results are shown in Fig. 2.6. It is observed that the optimal policy also has a threshold form. As channel state information can be obtained via channel estimation, we can examine the effect of channel estimation error. The channel estimation error is modeled as: e∼N (0,σ 2 ) and the estimated channel state is h 0 = h +e. Let μ ∗ be the optimal policy and μ 0 the policy determined under channel estimation 31 0 40 2 4 30 60 6 Value function channel 8 20 40 buffer 10 10 20 0 0 1 2 3 4 5 6 7 8 0 40 30 50 policy 40 Optimal policy: =20000 channel 20 30 buffer 1 20 10 10 0 0 Figure 2.5. Pictorial representation of the value function and optimal policy. For optimal policy,{0, 1} denote silence and transmission, respectively. errors. We compare the value function under these two policies v μ ∗, v μ 0. We use the normalized error (NE) to measure the error between v μ ∗ and v μ 0; it is defined as: NE =||v μ ∗− v μ 0||/||v μ ∗||. A typical signal to noise ratio in the channel is 3dB to 10dB, SNR dB = 10 log 10 (1/σ 2 ), i.e.,σ 2 ∈ [0.1, 0.5]. The normalized error is shown in Fig. 2.7. It can be observed that if good channel estimation is achieved, the performance (value function) is close to the optimal one. For simplicity, channel quantization is applied, we can consider finer quantization: h 0, 1 N 0 ∪ h 1 N 0 , 2 N 0 ··· h (H− 1) N 0 −1 N 0 ,H− 1 ∪ [H− 1, +∞), (2.22) whereN 0 tunes the “resolution” of quantization. A largerN 0 yields more accuracy in the optimal policy since the Rayleigh fading variable is continuous. Fig. 2.5 also shows a finer boundary when N 0 = 3; for example, given q = 40, it can be calculated that the threshold states for original quantization and finer quantization are 12 and 35 3 . Even in the absence of quantization, we have a thresholded policy. The state space will become uncountable which will be prohibitive for value or policy iteration computations. As mentioned in Sec. 2.4, the simplification method for uncontrollable state components [14] can be applied, the numerical result is shown in Fig. 2.8. It can be clearly observed that the simplification can dramatically reduce runtime complexity. However, the simplification method only works for special types of 32 Figure 2.6. Poisson arrivals: optimal policy. For optimal policy,{0, 1} denote silence and transmission, respectively. 0.1 0.2 0.3 0.4 0.5 2 0 0.5 1 1.5 2 2.5 NE 10 -4 Figure 2.7. Normalized error versus σ 2 , network size = 1200. . MDP problems where there are uncontrollable state components. For general MDP problems provided in Sec. 2.6.4, this method can not be applied, whereas our GSP approach still works and achieves good performance. In the sequel, we will numerically evaluate the “zig-zag” policy update (Section 2.3), low rank subspace construction (Section 2.4) and GSP subspace construction (Section 2.5). To evaluate the performance under different network settings, the weighting factorβ in Equation (2.13) is tuned, yielding different cost functions and thus leading to different policies. The cost for packet drop in cost function (2.13) is 1 and it can be calculated that the order of transmitting power is O(10 −4 )mw; thus for fair comparison the range of β is tuned from O(1) to O(10 4 ) (increase of 33 0 500 1000 1500 2000 Network size 0 0.5 1 1.5 2 Runtime (s) PI simplified PI Figure 2.8. Runtime comparison between original PI and simplified PI emphasis on transmitting power). Since our major concern is optimal control, the policy error is defined as: policy error = 1 N N X i=1 1(ˆ μ(i)6=μ ∗ (i)), (2.23) where N is the total number of states, and the approximated policy and optimal policy are denoted by ˆ μ and μ ∗ . 2.6.1 Zig-zag Policy Update The complexity of the modified algorithm can be measured by both the number of iterations for convergence and the runtime w.r.t. network size. From Fig. 2.9, it can be observed that, compared with the original policy iteration, the zig-zag policy update reduces the number of iterations by 50% for a fixed network size; as the network size increases, we have an even larger runtime reduction. In the zig-zag policy update algorithm, the exact value function is used for zig-zag policy update. It is also worthwhile to investigate the performance of zig-zag policy update with an approximated value function. This performance will be shown in the sequel. 34 0 1 2 3 4 5 6 10 4 2 3 4 5 6 Number of Iterations policy iteration zig-zag 0 500 1000 1500 2000 2500 3000 Network size 0 1 2 3 4 5 Runtime (s) policy iteration zig-zag Figure 2.9. Number of iterations of zig-zag policy update, network size = 2040, versus β (left figure); Runtime comparison versus network size (right figure). 0 1 2 3 4 5 6 10 4 0.8 1 1.2 1.4 1.6 1.8 Runtime (s) policy iteration subspace Figure 2.10. Performance of the optimal basis selection method, network size = 2040, versus β. 2.6.2 Optimal Basis Selection Fig. 2.10 shows the performance of the optimal basis selection method in Section 2.4. In iterationk, the subspace M (k) is formed by picking the independent columns of P μ k ⊕ c μ k and perform Gram-Schmidt orthogonalization. We see that the runtime varies for differentβ, since for eachβ we start with a random initial policy, the initial policy is same for both policy iteration and the subspace approach. Still, the runtime is roughly improved by 20%. Also, it should be notice that we have perfect reconstruction of the policy, since the algorithm forces ˆ v = v. The subspace ratio is given by Q+2 (Q+1)×H = 52 51×40 = 0.0255, obviously we are constructing a subspace that is much smaller than the original size. 35 0 0.5 1 1.5 2 2.5 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Policy error Random-930 SYM-930 Random-2040 SYM-2040 AVF-930 Random-35 AVF-35 AVF-2040 SYM-35 Jordan-35 BIB-35 BIB-930 BIB-2040 Figure 2.11. Performance of subspace approach using GSP methods, with different network sizes, versus β. 2.6.3 Subspace Construction with GSP The policy error is shown in Fig. 2.11, and different GSP approaches on different network sizes are indicated by the labels. To further analyze the performance, a reference line labelled “Random” is introduced, whose method is to generate the subspace randomly. Due to the complexity of Jordan decomposition, it is only implemented on a newtork with small state space (size 35, Q = 6,H = 5), with subspace size 40% of the original size (see Fig. 2.12). It can be clearly observed that the Jordan form, while having reasonably good performance, suffers from high complexity and thus scaling to larger networks is problematic. It can also be observed from Fig. 2.12 that although the GSP methods incur some additional cost over the policy iteration due to the computation overhead for subspace construction, the subspace is constructed only once and thus the overall runtime is close to the classical policy iteration. 36 0 0.5 1 1.5 2 2.5 10 4 10 -4 10 -2 10 0 10 2 Runtime (s) Jordan AVF BIB SYM Random 0 500 1000 1500 2000 2500 3000 Network size 0 2 4 6 8 10 12 Runtime (s) with computation overhead AVF BIB SYM policy iteraton Random 0 500 1000 1500 2000 2500 3000 Network size 0 1 2 3 4 5 Runtime (s) without computation overhead policy iteraton AVF SYM BIB Random Figure 2.12. Left figure: Runtime of subspace approaches with network size = 35, versus β; Right two figures: runtime of subspace approaches (with/without computation overhead for subspace construction), versus network size . In networks with large state space (see Fig. 2.11) where the network size is 930 (Q = 30,H = 30) and 2040 (Q = 50,H = 40), the subspace is chosen to be 10% of the original size. Clearly, the natural symmetrization performs better than the random approach, but still has high policy error due to the fact that it ignores the directivity of the graph. The performance is further improved by the AVF graph approach, because instead of simple symmetrization, the undirected proxy graph is constructed by looking at the similarity of value function between nodes. Finally and surprisingly, the best performance (zero policy error) is achieved by bibliometric symmetrization. We conjecture that a possible reason for the good performance may be that not only it considers the similarity between nodes in terms of transition probability, but also preserves information of directivity by looking at both in-degree and out-degree links. Fig. 2.13 shows the overall performance (policy error and SNR) of all methods (network size = 2040). The policy error is defined in Equation (2.23) and the Signal to Noise Ratio (SNR) is defined as: SNR = 20 log 10 ||v ∗ || 2 ||v ∗ − ˆ v|| 2 ! , (2.24) where v ∗ and ˆ v denote the original value function and the approximated value function, respectively. 37 0 1 2 3 4 5 6 10 4 0 0.1 0.2 0.3 0.4 0.5 Policy error SYM SYM + zig-zag AVF AVF + zig-zag BIB + zig-zag BIB zig-zag (a) Policy error 0 1 2 3 4 5 6 10 4 10 -1 10 0 10 1 10 2 10 3 SNR (dB) zig-zag BIB BIB + zig-zag AVF AVF + zig-zag SYM + zig-zag SYM (b) SNR Figure 2.13. Performance of subspace approaches using GSP methods, network size = 2040, versus β. In the SNR figure, the discontinuities represent infinite SNR. It is already discussed that zig-zag policy update with exact value function can reduce the number of iteration for convergence, in addition, the SNR is really high (the discontinuities represent infinite SNR) so that the policy error is negligible. However, when combined with subspace approach, the error is not negligible, due to the fact that approximated value function is used, which has low SNR in natural symmetrization and AVF method. Still, we have better performance compared to the pure subspace approach. 2.6.4 More General Results Two more examples will be given for further validation of the subspace methods using GSP. We notice that the simplification method [14] can not be applied here due to the absence of uncontrollable state components. The first example is an equipment replacement model in [36, Chapter 4.7.5], where the set of integers s ={1, 2, 3,...} represents the condition of the equipment from good to bad. At each state the decision maker can choose from action set u = {u 0 ,u 1 }, where 38 10 20 30 40 50 R 0 0.1 0.2 0.3 0.4 0.5 0.6 Policy error Random AVF AVF + threshold SYM SYM + threshold BIB BIB + threshold Random + threshold threshold 0 2 4 6 8 10 Instance 1 2 3 4 5 6 7 8 9 Policy error 10 -3 Random BIB SYM AVF Figure 2.14. Left figure: policy error for equipment replacement model, network size = 2000, versus R; Right figure: policy error for random graph model. u 0 corresponds to operating the equipment for an additional period; and u 1 corresponds to replacing it immediately. The transition probabilities satisfy: p(s,u,s 0 ) = 0 s 0 ≤s,u =u 0 g(s 0 −s) s 0 >s,u =u 0 g(s 0 ) u =u 1 , (2.25) whereg(·) is a predefined probability distribution, in our simulation,g is set to be an uniform distribution from integer 1 to 10. The reward function consists of three parts: a fixed income of R units; a nondecreasing state-dependent cost c 1 (s) = ζs, ζ = 0.01, and a replacement cost c r = 5. To convert the reward maximization problem to a cost minimization problem and apply the definition of value function in Equation (2.3), the cost function is obtained by taking the negative value of the reward: c(s,u) = −[R−c 1 (s)] u =u 0 −[R−c r −c 1 (s)] u =u 1 . (2.26) It can be observed that the transition probability has structured form, and it also has been proven that such example exhibits a thresholded policy, i.e. 39 equipment will be replaced only when its condition is worse than a threshold state s ∗ . Therefore, a modified policy iteration algorithm incorporating threshold information can also be developed, and we also apply our general subspace design method to different network sizes and tune the reward function R (so that we have different policies), the subspace is set to be 10% of the original size and the performance is shown in the left figure in Fig. 2.14. Although the bibliometric method does not achieve zero policy, it is still the best one among all the methods. Previous examples focus on MDPs where the PTM is a structured graph. Simulation on random graphs is also conducted and the result is shown in the right figure in Fig. 2.14. The network size is 2000, the graph structure, transition probabilities and the cost functions are all generated randomly so that there is no policy structure to exploit. Therefore the subspace methods are pure GSP approach and it can be observed that all methods achieve negligible policy error. 2.7 Conclusions In this chapter, a point-to-point transmission control problem is formulated as a MDP. The structure of the optimal policy is examined and modified policy iteration algorithm is proposed, with which the number of iterations is reduced by half. Based on the application, a fast subspace construction method is also presented, and we are able to attain both complexity reduction (runtime improved by roughly 20%) and perfect reconstruction of value function and policy. Furthermore, general subspace construction methods using GSP are also proposed. Among all the methods, the subspace obtained by the eigenvalue decomposition of the bibliometric symmetrization of the averaged PTM gives best performance (zero policy error). 40 Chapter 3 On Sampled Reinforcement Learning in Wireless Networks: Exploitation of Policy Structures 3.1 Introduction Reinforcement learning is a classical tool to solve network control or policy optimization problems in unknown environments. In order to learn the optimal policy correctly, the classical Q-learning algorithm requires sufficient visits to all state-action pairs, resulting in the need for a large number of observations in the presence of a large state-action space. Nevertheless, complexity reduction can be achieved by exploiting the particular structure of the optimal policy. A sampled reinforcement learning algorithm is proposed, where the optimal policy is estimated only for a subset of states; a machine learning technique, as well as a graph signal processing approach, are applied for policy interpolation for unvisited states. A policy refinement algorithm is further proposed to improve the performance of policy interpolation. Performance analysis and bounds are also provided for the 41 proposed policy sampling and interpolation algorithms. Numerical experiments on a single link wireless network with a large state space show that the sample Q-learning algorithm with policy interpolation achieves a much faster runtime with negligible performance loss compared to classical Q-learning. The goal of reinforcement learning (RL) [1] is to learn good policies for sequential decision problems, where the challenge is to understand the impact of the current action on future costs when the objective function is a sum of a sequence of discounted costs up to some horizon. If there is an explicit model, the problem can be well modeled as a Markov decision process (MDP) problem, and dynamic programming techniques such as value iteration and policy iteration [13, 14] can be employed to obtain the optimal policy. In the absence of known system dynamics, the agent must explore the unknown environment, trying different actions exhaustively in each state in order to unveil the best action that yields the least cumulative cost. Q-learning [52] is one of the most popular RL algorithms, where the agent updates the Q-function after collecting multiple trajectories of states, actions and costs; an estimate of the optimal policy can be derived based on the Q functions. In recent years, RL has been extensively employed to design wireless networks, finding application in multiple arenas, e.g., throughput maximization for energy harvesting [53], design of media access control (MAC) protocols [54], power control and rate adaptation [55], and resource allocation [56]. However, most of the prior work did not consider the attendant sample complexity of RL algorithms. Since networks with large state space are ubiquitous, it is of great importance to develop efficient algorithms to achieve sample complexity reduction. Function approximation approaches can be applied in Q-learning to achieve dimension reduction, where the Q function is linearly approximated by a few basis functions (a.k.a. feature vector) [1]. However, the drawback is that one needs to carefully hand-design the feature vector in order to have good performance; typically it is 42 not easy to find proper features for the Markov chain associated with the wireless network. In this chapter, we seek a general way to construct a proper subspace for reduced dimension RL. Graph signal processing (GSP) techniques can be applied when the RL problem is mapped to a MDP (when the system dynamics are known); since GSP provides analysis and design tools for irregular graph-based signals, and the value functions in MDPs can be viewed as graph signals defined on the probability transition graph. Due to the directed nature of the probability transition graph, direct application of GSP to MDPs/RL is challenging since most prior work in GSP focuses on undirected graphs. Although there exist methods for analysis of graph signals on directed graphs [32, 33], they suffer from additional high complexity in constructing the proper subspace, as well as the lack of a proper mapping from complex to real fields, which is necessary for our policy optimization problem. By constructing undirected proxies of the directed graphs, several reduced dimension algorithms have been developed [6, 16, 19] to solve wireless network control problems. However, it is difficult to apply those methods in the RL setting since transition probabilities are unknown. Instead of focusing on the logical transition graph and the value functions as graph signals, for RL we will take the novel perspective of treating the policy itself as the graph signal and construct the corresponding “image” graph so that the methods from undirected graphs can be applied. Thus our goal is to combine GSP and RL to develop low complexity algorithms for wireless network control. We consider a RL problem for a point-to-point transmission network. Our previous work [57] showed that the optimal policy has a special thresholded structure. It is a common property that exists in many network applications [26–28]. The previous chapter only proposed a “zig-zag” policy update algorithm based on the thresholded policy structured and no GSP technique is involved. In 43 this chapter, based on the “image” graph, a GSP technique is proposed for policy interpolation. We also develop a method to sample the policy and interpolate to obtain the remainder of the policy as well as multiple strategies to improve the estimate of the optimal policy. We are also able to provide a theoretical analysis of how performance of policy sampling and interpolation algorithms scale with network size. There has been recent interest in the use of neural networks to solve a variety of non-linear optimization problems. We compare the performance of our proposed methods to that of a deep Q network (DQN) [58] which uses a neural network to learn the Q functions. The DQN offers inferior performance despite efforts at parameter tuning and requires much more training than our proposed method. The main contributions of this chapter are: • A value-function-based policy sampling algorithm is proposed, with GSP/machine learning exploiting the structure of the optimal policy for policy interpolation. This algorithm achieves significant runtime complexity reduction (up to 60%) compared to the classical Q-learning algorithm. • A reduced complexity Q-learning algorithm is proposed to further reduce runtime complexity (up to 80%) while the policy error is similar to classical Q-learning. • A policy refinement algorithm is proposed to further reduce the policy reconstruction error (roughly 50%). • The decay rate of the averaged performance of policy sampling and interpolation algorithms is derived and shown to match that of the numerical results. • Our proposed algorithms have better performance than the DQN approach (roughly 80% error reduction). 44 The rest of the chapter is organized as follows: Section 3.2 provides background on Q-learning. Section 3.3 describes the policy sampling, interpolation, and refinement algorithm. Analytical and numerical results are given in Sections 3.4 and 3.5 respectively. Section 3.6 concludes the chapter. Proofs of the technical results are provided in appendix B. 3.2 Preliminary on Q-learning In this section, key background on Q-learning will be provided. For notational clarity, column vectors are in bold lower case (e.g. x); matrices are in bold upper case (e.g. A); sets are in calligraphic font (e.g. S); and scalars are non-bold (e.g α, a). The efficiency of dynamic programming in solving MDP problems relies on the exact evaluation of the value functions using the Bellman fixed point equation [14], which requires full knowledge of the system. Without such information, value function evaluation is challenging and the original problem falls into the category of RL. Luckily,Q-learning is a model-free and online algorithm that can addresses this challenge. The main objective in RL is the same as that of MDP problems, i.e., to find an optimal policy that minimizes the value function (2.3). Q-learning learns theQ functionQ(s,a) (derived in (3.1) below) for each state-action pairs so as to measure the “quality” of taking a particular action in a given state. As the name suggests, the Q functions are updated through the interaction between the agent and the environment, which typically consists of sequences of episodes. Each episode is a state transition trajectory of finite length that usually ends up in a particular set of terminal states. Episode length is random due to random transitions, but can be set to be a constant for systems without explicit terminal states. After observing 45 the cost function at each transition, the Q function is updated by the following rule: Q(s,u)←Q(s,u) +θ[c(s,u) +α min a∈U Q(s 0 ,a)−Q(s,u)], (3.1) wheres ands 0 are the current and future state, respectively,u is the action taken at states andθ is the learning rate. To balance between exploitation and exploration, an “-greedy” action selection scheme can be employed, i.e., with probability 1− the agent selects the action that minimizes the current Q(s,u) (exploitation) and with probability it chooses the rest of the actions randomly (exploration). A large number of transitions are required to get good estimates, and the Q function has been shown to converge to Q ∗ with probablity 1 as the number of transitions go to infinity [1], after which, the optimal policy can be computed: μ ∗ (s) = arg min u∈U Q ∗ (s,u). (3.2) 3.3 Policy Sampling, Interpolation and Refinement In the classical Q-learning algorithm [1], the Q functions for all state-action pairs need to be learned. The overall complexity is high and the convergence rate is slow for a large state space. Given the thresholded structure of the optimal policy, if correct policy estimation can be made for a subset of states, a global policy for all states can be obtained via a proper interpolation algorithm, then the complexity reduction can be achieved. Refinement algorithms can also be developed to further reduce the error of policy reconstruction. We propose such methods herein. 46 3.3.1 Policy Sampling We seek a subset of states to select. Since the states (q,h) can be placed on a 2D plane, good exploration is achieved by having the sampling states randomly selected on the plane so that an overall summary/distribution of the policy can be obtained. To be more precise, we assume the states are ordered according to lexicographical order (first h then q). Given a sampling budget Ψ∈ (0, 1), the indices of the sampling states inS s are selected randomly in the following way: S s =S r (1 :dΨNe) s.t. S r = rndperm(S), S ={1, 2,...,N}, (3.3) whereS consists of the indices of all states, rndperm(·) is a random permuation function andd·e is the ceiling function. S s is constructed by selecting the first dΨNe elements inS r . The estimate of the optimal policy for a particular states∈S s can be obtained either from estimates of value functions ˆ v μ (s) under different policies, or from estimates of Q functions ˆ Q(s,u), ∀u ∈ U. We consider two policy sampling algorithms based on these approaches. Value Function Estimation From the optimality condition of a MDP problem [14], the optimal policy satisfies the condition: μ ∗ (s) = arg min μ v μ (s), (3.4) which reveals the necessity to compute value functions under different policies. Notice that the Bellman fixed point equation can not be applied to evaluate the exact value function. An alternative is to apply the original definition of the value function (2.3), i.e., under the current policy μ i , starting from each s∈S s , we 47 observe K trajectories of length T and the corresponding cost functions occurred at each transition, and the value function can be estimated via sample averages: ˆ v i (s) = 1 K K X n=1 T−1 X t=0 α t c n (s t ,u t ), (3.5) wherec n (s t ,u t ) is the cost function observed at timet in thenth sample trajectory. Therefore, given input policies{μ 1 ,μ 2 ,...,μ M } and the corresponding value function estimates {ˆ v 1 , ˆ v 2 ,..., ˆ v M } using Equation (3.5), following from the optimality condition (3.4), an estimate of the optimal policy for s∈S s can be obtained as: ˆ μ(s) =μ i (s) s.t. ˆ v i (s) = min{ˆ v 1 (s), ˆ v 2 (s),..., ˆ v M (s)}. (3.6) A key question is the design of input policies{μ 1 ,μ 2 ,...,μ M }. Clearly, the larger the value of M, the better the performance, as one would have access to more instances of the value function. However, with large state spaces, the input policies should be selected efficiently to avoid overly high sample complexity. The goal is that the final estimated policy ˆ μ(s) is close to the true policyμ ∗ fors∈S s . Consider the wireless system in Sec. 2.3 with Q = 40 and H = 30 (a total of 1230 states), we set M = 10. The settings for the other system parameters can be found in the numerical results section (Sec. 3.5). With a sampling budget of Ψ =|S s |/|S| = 0.2 (i.e., only 20% of the states are sampled), the sampled policy is far from any sampled version of the optimal policy if the input policy is generated randomly (see Fig. 3.1(a)); this sparse random sampling yields a policy that is far from the optimal policy (Fig. 3.1(a)). However, since the optimal policy has the thresholded form, the input policies can be restricted to be thresholded rather than random. To generate a random 48 0 10 20 30 40 Buffer 5 10 15 20 25 30 Channel Sampled policy: random policy input silence transmit (a) 0 10 20 30 40 Buffer 5 10 15 20 25 30 Channel Sampled policy: random thresholded policy input silence transmit boundray (b) Figure 3.1. Sampled policy for subset of states with random input policy (a) and random thresholded input policy (b); sampling budget Ψ = 20%. thresholded input policy, we only need to identify its boundary states. Starting from state (Q, 1) for t = 1, a sequence of states can be generated in the following way until the boundary states are reached: (q t+1 ,h t ) = (q t − 1,h t ), w.p. δ, (q t ,h t + 1), w.p. 1−δ, (3.7) where δ∈ (0, 1). Such an operation will yield a sequence of states that form a border, separating the two actions and hence the thresholded policy is obtained. Different δ can be selected to yield different thresholded policies and we select 10 values of δ uniformly from 0.2 to 0.9 which results in a rich enough set of boundary shapes and thus policies. The impact of the choice of different values of δ is shown in Fig. 3.2 and 3.3. With such a constraint, the sampled policy exhibits much better behavior: the recovered policy is close to the optimal policy (see Fig. 3.1(b)). Clearly, the complexity of the policy sampling algorithm as well as the performance of policy recovery are dependent on the hyperparameters: number of input policies M, number of sample trajectories K for each state, and trajectory length T . To improve performance, it is desired that we increase the value of the 49 Channel Buffer 0 1 2 Q Q-1 Q-2 1 2 H H-1 transmit silence w.p. δ w.p. 1 − δ Figure 3.2. Random thresholded input policy generation Figure 3.3. Generated input policies by different δ. parameters listed above; however the sample complexity of the algorithm will then increase. For example, denote v(S s ) as the vector consisting of value functions of sampled statesS s , and the normalized error defined as:||ˆ v(S s )− v(S s )||/||v(S s )||. The complexity-performance trade-off is clearly shown in Fig. 3.4, i.e., as K and T increase, the normalized error decreases. To tackle this issue, instead of sampling the value function and estimating the policy according to Equations (3.5) and (3.6) (on-policy approach), we simplify the algorithm and directly learn the Q values for the subset of states (off-policy approach), where the policy can be directly derived. Sampled Q-learning The sampledQ-learning process is adapted from classicalQ-learning, but with the constraint that the starting state for each episode must be a state in the sampling subsetS s . In addition, sinceQ-learning asymptotically achievesQ ∗ , i.e., it requires an infinite number of visits to each state-action pair. Hence for quick estimation, 50 5 10 15 20 25 K 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 Normalized error T=100 10 20 30 40 50 60 70 80 90 100 T 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Normalized error K=10 Figure 3.4. Normalized error w.r.t K and T with the other parameter fixed. 0 50 100 150 Number of visits 0.2 0.4 0.6 0.8 1 Q values for state (10,10) Number of states: 1200 Learned Q values Optimal Q values (a) Q values for state (10,10) with network size 1200 0 50 100 150 Number of visits 0.4 0.5 0.6 0.7 0.8 0.9 1 Q values for state (15,10) Number of states: 1200 Learned Q values Optimal Q values (b) Q values for state (15,10) with network size 1200 0 50 100 150 Number of visits 0 0.2 0.4 0.6 0.8 1 Q values for state (10,10) Number of states: 1500 Learned Q values Optimal Q values (c) Q values for state (10,10) with network size 1500 Figure 3.5. First two figures: Q values as a function of number of visits to state (10,10) and (15,10) in the wireless network, number of network states = 1200. Third figure: Q values for state (10,10) in wireless network with size 1500. the termination condition for the sampled Q-learning is weakened compared to classical Q-learning algorithm. Since the ultimate goal is to estimate policies for s∈S s , policy estimation can be potentially done withoutQ ∗ . For example, given a particular state s 0 with Q ∗ (s 0 ,u 1 ) > Q ∗ (s 0 ,u 2 ), meaning that μ ∗ (s 0 ) = u 2 . Correct estimation can still be made if there are some estimated Q functions for s 0 that satisfy ˆ Q(s 0 ,u 1 ) > ˆ Q(s 0 ,u 2 ), then the estimated policy ˆ μ(s 0 ) =u 2 is the same asμ ∗ (s 0 ). This implies that strong convergence is not necessary, and the policy can be derived through Q functions that have been updated only for a finite number of times. Therefore, in the sampled Q-learning algorithm, the condition for algorithm termination is modified such that a counter CT (s,u),∀s∈S s is introduced. As the Q functions are updated according to Equation (3.1), the counter for that 51 particular (s,u) will increase correspondingly. The algorithm terminates when CT (s,u) > CT 0 , ∀s ∈ S s , ∀u ∈ U, where CT 0 is an appropriately selected threshold. The learning rate in the Q-learning algorithm is the same for all states; however, due to random transitions, the number of visits to each state-action pairs are different. Therefore, given a fixed duration for learning, Q values for some states will be close to convergence; however, this may not be true for all states. Based on cross-validation, a visit threshold for the states inS s is set with the assumption that if the visit threshold has been met, theQ values of that state are close to convergence to the optimal value and thus a near optimal policy can be derived. Through cross-validation, we determined that a visit threshold of 50 visits was sufficient for the various network sizes we considered. In fact, Fig. 3.5 shows the learned Q values for a particular state in a network of size 1200 and 1500, it can be observed that the Q values have essentially converged to the true values when the number of visits exceeds 50. This value seems to hold irrespective of network size and we use the same threshold for all the simulations. The estimated policy fors∈S s can be derived using Equation (3.2), and it can be observed that the sampled policy exhibits a structure similar to Fig. 3.1(b). 3.3.2 Policy Interpolation In this subsection, two algorithms for policy interpolation are proposed. SVM Approach Given the nice clustered sampled data structure presented in Fig. 3.1(b) and the threshold structure of the optimal policy in Fig. 3.1, a natural way to perform policy interpolation is to apply a support vector machine (SVM) [59] with a 52 polynomial kernel. The soft-margin SVM is applied and it can be formulated as the following optimization problem: min w ||w|| 2 +ζ P N 0 i=1 max (0, 1−y i f(z i )) s.t. f(z) = w T Φ(z) +b, (3.8) where (w,b) determines the hyperplane in some high dimensional space, and y i is the label of data z i . In our application, y i and z i correspond to transmit/silence labels and the sample state’s coordinate, respectively. The parameter ζ specifies the trade-off between the margin size and error tolerance, and Φ(·) is the degree-2 polynomial feature mapping for each data point, which usually takes the following form: Φ(z) = (z 2 n ,...,z 2 1 , √ 2z n z n−1 ,..., √ 2c 0 z n ,...,c 0 ), (3.9) where c 0 is a predefined constant. There are toolboxes for solving the SVM (Matlab) and due to space constraints, the details of solving Problem (3.8) are omitted. The SVM approach immediately identifies a boundary that separates the transmit region and silence region (also see Fig. 3.1(b)), thus the estimated policy ˆ μ ∗ is obtained. GSP Approach A second strategy for policy interpolation is to view the sampled policy signal as a partial signal on an image graph (see Fig. 3.6(a)). The adjacency matrix A for the image graph is defined as: A i,j = 1 d(i,j) = 1, 0 otherwise, (3.10) 53 0 10 20 30 40 Buffer 0 5 10 15 20 25 30 Channel Structure of image graph (a) (b) (c) (d) Figure 3.6. From left to right: image graph structure, sampled policy signal, interpolated policy signal, final policy after thresholding. = + Figure 3.7. Decomposition of total variation on lattice graph. where d(i,j) is the l 1 distance between i and j on the lattice. The image graph is simple, it is undirected, and has regular connectivity, thus GSP theory can be applied. For any valid policy signal x, x i ∈{0, 1}, the total variation (see (2.8)) under the image graph can be expressed as: TV(x) = x T L 0 x = X i,j A i,j (x i − x j ) 2 , (3.11) where L 0 is the graph Laplacian matrix of the image graph with adjacency matrix A. Given the image graph structure, the summation in Equation (3.11) can be decomposed as the summation along vertical and horizontal directions (see Fig. 3.7). For any valid policy signal, the maximum total variation can be achieved by alternating the values 0 and 1 on adjacent vertices on the graph, resulting in the total variation ofQH+(Q+1)(H−1) (horizontal and vertical summation); whereas for a thresholded policy signal, the total variation is upper bounded by Q + 1 +H 54 since each row or column sum is upper bounded by 1. This is small compared to the maximum total variation, thus we can see that a thresholded policy signal is, in fact, a smooth/low-pass graph signal. Therefore, a graph signal interpolation problem can be formulated as the following optimization problem: min x x T L 0 x s.t. x(s) = ˆ μ(s), ∀s∈S s , (3.12) where ˆ μ is the sampled policy signal. Denote the sampling matrix by D, where D s,s = 1,∀s∈S s and 0 elsewhere. The equality constraint can be expressed in a compact form Dx = ˆ and thus the closed form solution (using Lagrangian multiplier) to problem (3.12) is x ∗ = L −1 0 D T (DL 0 D T ) −1 ˆ , where (·) −1 denotes proper matrix inverse/pseudoinverse. Problem (3.12) has a convex objective function and equality constraint, thus it can be also solved by the standard CVX toolbox. The interpolated policy signal is discrete except for some oscillations around the boundary, hence a final thresholding of x is necessary. We set the threshold to be the mean of x, since the mean is the optimal threshold for a binary detection problem with equal priors and symmetric pdfs. The overall process of policy interpolation using GSP is shown in Fig. 3.6. The sampling algorithms share the same sampling budget Ψ and the interpolation algorithms share the same input of sampled data. The summary of the policy sampling and interpolation algorithms are given in Algorithm 3 and Algorithm 4. 55 Algorithm 3 Value function based policy sampling and reconstruction Input: sampling budget Ψ, input policies{μ 1 ,μ 2 ,...,μ M }. 1: for i = 1 to M do 2: Run the system under policy μ i , estimate the value function ˆ v i (s) for s∈S s from sample trajectories (3.5). 3: end for 4: Obtain an estimate of the optimal policy ˆ μ(s) for s∈S s using ˆ v i (s) and μ i , i∈{1, 2,...,M} (3.6). 5: Perform policy interpolation to obtain the policy ˆ μ ∗ (s),∀s∈S (SVM (3.8) or GSP (3.12)). Output: policy ˆ μ ∗ . Algorithm 4 Sample Q-learning and policy reconstruction Input: sampling budget Ψ, CT (s,u) = 0,∀s∈S s , counter threshold CT 0 . 1: while∃s∈S s ,∃u∈U, CT (s,u)<CT 0 do 2: Generate episodes of length 10, update Q functions (3.1) and counters CT (s,u) correspondingly. 3: end while 4: Estimate policy: ˆ μ(s) = arg min u Q(s,u),∀s∈S s . 5: Perform policy interpolation with SVM (3.8) / GSP (3.12). Output: policy ˆ μ ∗ . 3.3.3 Policy Refinement We observe that both the SVM and GSP approaches for policy interpolation yield a coarse estimate of the policy boundary. Starting from the coarse boundary, further refinement can be made to reduce the policy error. We assume that the coarse boundary is close to the optimal boundary, thus a finer boundary can be obtained if good estimates of the Q functions can be made for the states near the boundary. Following this main idea, we also present two policy refinement algorithms. Cushion Method As the name suggests, a cushion regionS c is created around the coarse boundary, and the Q functions for the states in S c are learned properly similar to the sampled Q-learning algorithm described in Sec. 3.3.1. The refined boundary can 56 0 5 10 15 20 25 30 35 40 Buffer 5 10 15 20 25 30 Channel Cushion method silence transmit boundary cushion bound 1 cushion bound 2 (a) 0 5 10 15 20 25 30 35 40 Buffer 5 10 15 20 25 30 Channel Boundary perturbation silence transmit boundary (b) Figure 3.8. Pictorial representation of cushion method (a) and boundary perturbation (b). Algorithm 5 Policy refinement with cushion method Input: policy boundary, width of the cushion d. 1: Generate a cushion regionS c of width d with coarse boundary in the center. 2: Learn the Q function for states inside the cushionS c . 3: for h = 1 to H do 4: Identify the range of q such that (q,h)∈S c , where q∈ [q min ,q max ]. 5: Find smallest q 0 ∈ [q min ,q max ] such that the optimal action is to transmit, i.e., Q[(q 0 ,h), 0]>Q[(q 0 ,h), 1]. 6: end for Output: Refined boundary and the corresponding policy. be determined by Equation (3.2); detailed description of this method is given in Algorithm 5. Boundary Perturbation Instead of learning theQ functions in the cushion regionS c , a direct perturbation can be done on the coarse boundary. Given the current boundary state (q,h), the policy can be derived after its Q functions are learned. There are two basic scenarios: • If the optimal action is to transmit for (q,h), setq :=q− 1 until the optimal action is to remain silent or q = 0. 57 Algorithm 6 Policy refinement with boundary perturbation Input: policy boundary 1: while two consecutive boundaries are different do 2: Learn the Q functions for the current boundary states. 3: for h = 1 to H do 4: if Q((q,h), 0)>Q((q,h), 1) then 5: if previous action is transmit then 6: q :=q− 1. 7: else 8: thresholding state fixed for h, q th =q. 9: end if 10: else 11: if previous action is silence then 12: q :=q + 1. 13: else 14: thresholding state fixed for h, q th =q + 1. 15: end if 16: end if 17: end for 18: end while Output: new policy boundary and corresponding policy. • If the optimal action is silent for (q,h), setq :=q +1 until the optimal action is to transmit or q =Q. It should be noticed that new Q functions need to be learned along the new boundary after each perturbation. The boundary perturbation algorithm is given in Algorithm 6. The pictorial representation for both methods is shown in Fig. 3.8. The cushion method is slightly more complex than boundary perturbation (it can be numerically shown in Sec. 3.5 Fig. 3.13), since the cushion method will sample more state-action pairs. To summarize, the proposed methods for sampling, interpolation and refinement algorithms are shown in Table. 3.1. There are 2 methods for each layer and thus there are a total of 2 3 = 8 combinations. 58 Table 3.1 Policy sampling, interpolation and refinement methods Sampling value function based / sampled Q-learning Interpolation SVM / GSP Refinement cushion method / boundary perturbation Since our main focus is optimal control, the policy error is used to measure how close the recovered policy is to the optimal policy, it is already defined in Equation (2.23). 3.4 Performance Analysis In this section, analytical results are provided to validate the performance of our proposed algorithms.We mainly focus on the performance of policy sampling and interpolation algorithm since they are more tractable. We simulate the system described in Sec. 2.3. All the experiments were performed in Matlab R2018b, running on MacBook Pro with Intel Core i7 2.5GHz CPU and 16GB RAM. The packet arrival probability isp = 0.9; and the parameters for the channel in the wireless network come from a sensor network application [7]; they are set as: C 0 = 10 0.17 , η = 4.7, l = 20m, and P rcv =−97dBm. For the policy sampling algorithm using value function, the sampling budget is set to be Ψ = 20% and the number of input policies is set to be M = 10; for each state in the sampling set, we observe K = 5 trajectories of length T = 100 (the discount factor is α = 0.9, and α 100 =O(10 −5 )). The learning rate for Q-learning can be eitherθ = 0.1 orθ = 1/n(s,a) wheren(s,a) is the number of visits to (s,a) pair. Remember that in Sec. 3.3.1 we determined a common threshold via cross validation for the termination for both classical Q-learning and our algorithm. For the sampledQ-learning algorithm, the visit threshold for each state-action pair in the sampled subsetS s is set to be 50, and the learning rate can be either θ = 0.1 or θ = 1/n(s,a) where n(s,a) is the number of visits to (s,a) pair. 59 3.4.1 Complexity of Policy Sampling To obtain a good estimate of the policies, sufficient visits to state-action pairs are required. Once the estimated policy is obtained, the policy interpolation and refinement steps are computationally much cheaper. In addition, simulation results reveal that the majority of the complexity lies in the policy sampling phase. Thus a rough complexity comparison can be conducted via the comparison of the number of visits to state-action pairs. The convergence analysis of the classicalQ-learning algorithm has been studied in [60], where the number of visits to states is used to reflect the overall complexity. Denote Q L as the learned Q functions after L number visits of states. The main result in [60, Theorem 5] indicates that||Q L −Q ∗ || ∞ < with probability 1−δ after: L = Ω ln |S||U| δκ (κ) 2 (3.13) number of visits, where|·| denotes the cardinality of a set, κ = (1−α)/2 and Ω(·) is the Big Omega notation. We provide a simplification of the original result to focus on the dominant term. For a fair comparison, we also consider the number of visits for the policy sampling algorithms. The sampled Q-learning clearly achieves complexity reduction since the Q functions are learned only for a subset of states. For the value function based policy sampling, the number of visits equals to Ψ|S|KTM, since for each sampled state, we obtained estimates of the value functions for Ψ|S| number of states under M polices, where for each state, K trajectories of length T are generated. To better visualize the sample complexity, consider the system with network size (number of states) 1200. After finite termination of Q learning, we observe that ≈ 0.1, we also set δ = 0.1 and apply the parameters in Equation (3.13), then the number of visits is roughly L≈ 6.1× 10 5 . As our complexity analysis 60 Table 3.2 Comparison of number of visits (1200 network states) Methods Q-learning value function based sampled Q-learning Number of visits L (3.13) Ψ|S|KTM O(ΨL) Numerical results 2× 10 6 1.2× 10 6 3× 10 5 is based on big Ω, the true complexity may be larger; in fact from simulation the actual number of visits is roughly 2× 10 6 . The number of visits for the sampled Q-learning is roughly 3× 10 5 from simulation; and in the value function based policy sampling, the number of visit equals to|S|ΨKTM = 1.2× 10 6 . Hence, sample complexity reduction is clearly observed (40% reduction). The overall comparison is shown in Table. 3.2. In simulations, the runtime of the overall algorithm is used as a metric to reflect overall complexity. 3.4.2 Policy Error Analysis We mainly focus on the policy error for the policy sampling and interpolation algorithms. For the sake of tractability, several modifications and assumptions are made. The optimal thresholded policy can be represented by its boundary states, i.e., via the boundary vector q ∗ = [q ∗ 1 ,q ∗ 2 ,...,q ∗ H ] T , where q ∗ h is the buffer index of the thresholding state when channel state ish. To analyze the average performance, we compute the expectation of the policy error: P H h=1 |E[ˆ q h ]− q ∗ h |/N. The original SVM problem is difficult to analyze since the states are placed on a 2-D plane and the boundary given by the algorithm is a curve, for which it is challenging to obtain an explicit expression of ˆ q i . However, notice that the policy signal can be represented by a matrix (Fig. 3.6(b)), where the rows and columns correspond to the channel index and buffer index respectively, and the binary entry of the matrix represents the policy signal. We are able to reduce the dimension from 2D to 1D by looking at the boundary state for each row. 61 0 1 x 1 x 2 x n x n +1 x M t Figure 3.9. Line sampling problem, variables less than t will be labeled 0, variables greater than t will be labeled 1. After policy sampling, a particular row may look like the following form (assuming that for the sampled states, the estimated policy agrees with the true optimal policy): [φ,φ,..., 0,φ,.., 0,...,φ, 1,.., 1,...,φ], (3.14) where 0 and 1 represent the sampled states that are already labeled as silence and transmission, the unsampled states are represented by φ (null). To make analysis easier, the discrete sampling can be reformulated as a continuous line sampling problem by normalization (dividing by Q and see Fig. 3.9), with several assumptions stated as follows: • M =dΨ(Q + 1)e points are uniformly randomly sampled on line segment [0, 1], i.e., we assume each row has equal number of sampled states. • X ={x 1 ,x 2 ,...,x M } represent the M variables ranging from the smallest to the largest. • Variables less than thresholdt (which represents the thresholding state) will be labeled 0, and 1 otherwise. It should be noted that in simulation, the assumptions may be violated. Since random sampling can not guarantee each row has exactly M sampled entires, and erroneous decisions can still occur during Q learning, i.e., the sampled states may not be properly labeled. The main results regarding the policy error are stated in the following theorem: 62 Theorem 6 If the logical network is constructed such that Q + 1≈ H, then the decay rate of the averaged policy error after policy sampling and interpolation algorithms is O(1/ √ N), where N = (Q + 1)H is the size of the network. Proof. See Appendix. B.1. Remark 6. The proof follows by finding upper bound and lower bounds of the averaged policy error, and showing that these bounds have the same decay rate. 3.4.3 Analytical Bounds Theorem 6 only provides decay rate for the averaged policy error. It is also interesting to investigate the performance of our algorithms with genie aided information, i.e., the indices of the boundary states. For the GSP approach for policy interpolation, although there is a closed form solution to Problem (3.12), it is challenging to analyze due to the facts that: 1) the recovered policy signal will be different under different subset selection schemes; 2) the closed form solution requires the exact inverse of the graph Laplacian L 0 , which is rank deficit. If we replace the exact inverse with Moore-Penrose pseudoinverse, numerical results showed that the pseudoinverse solution will be different from the solution given by the cvx toolbox, furthermore, the performance is inferior. On the other hand, given the line sampling problem formulation (Fig. 3.9), it is still challenging to determine an explicit expression for the index of the boundary state. Although the objective function x T L 0 x can be decomposed as a summation along vertical and horizontal directions (see Fig. 3.7), the minimization can not be taken independently. For example, in order to minimize each horizontal summation, the boundary state for each row may lie anywhere between the closest states that are labeled 0 and 1 (see Fig. 3.9, where x n < t < x n+1 ). After 63 considering the vertical summation, we need to carefully tune each boundary state’s position such that the total sum is minimized; it is clear that the optimal solution of the objective function can only be solved through enumeration. Since it is challenging to obtain an exact solution, we derive bounds for the averaged performance. In the line sampling problem, given threshold t, we seek an upper bound x u and a lower bound x l of the estimated boundary index. For a particular realization, there is an n such that x n < t < x n+1 . Immediately we have an upper bound x u = x n+1 and a lower bound x l = x n . It is clear that x u and x l are random variables since n is a random variable, thus our goal herein is to compute E[x l ] and E[x u ]. The results are given by the following theorem: Theorem 7 Given a threshold t, the expectation of x l and x u are given by the following formulae: E[x l ] = M X n=0 M n ! n n + 1 t n+1 (1−t) M−n , (3.15) E[x u ] = M X n=0 M n ! t n " t(1−t) M−n + (1−t) M−n+1 M−n + 1 # . (3.16) Proof. The proof is provided in Appendix. B.2 Therefore, given a channel state h, we have the threshold t h = q ∗ h /Q, the closest sampled state on the left of the boundary state has an index QE[x l,h ], and the closest sampled state on its right has an index QE[x u,h ]. Since the estimated boundary state will be in this interval, the maximum number of erroneous error is given by max{Qt h −QE[x l,h ],QE[x u,h ]−Qt h }. Thus the policy error can be upper bounded by: policy error< 1 N H X h=1 max{Qt h −QE[x l,h ],QE[x u,h ]−Qt h }. (3.17) 64 For the SVM approach, since the SVM algorithm creates the largest margin between the data of two different classes, the estimated threshold index x svm in one particular realization can be defined as follows: x svm = x 1 , if x 1 >t xn+x n+1 2 , if x n <t<x n+1 , 1≤n≤M− 1 x M , if x M <t , (3.18) where the first condition is that all the sampled points are greater than t, in this case, we choose the smallest one as the boundary index; and the third condition is that all the sampled points are less than t, then, the largest one is chosen as the boundary index. Otherwise, the mid point of two consecutive variables that have different labels will be the boundary index. The computation of E[x svm ] is given by the following Theorem: Theorem 8 Given threshold t and using the SVM algorithm, the expected value of the separation boundary index Ex svm is given by the following equation: E[x svm ] = M M + 1 t M+1 + M−1 X n=1 1 2 ( M n ! n n + 1 t n+1 (1−t) M−n + M n ! t n " t(1−t) M−n + (1−t) M−n+1 M−n + 1 #) +t(1−t) M + (1−t) M+1 M + 1 . (3.19) Proof. See Appendix. B.3. Again, once we have the optimal policy, we normalize for each row h, find the corresponding threshold t h , and the expectation of the thresholding state is given by QE[x svm,h ]. The averaged policy error can be approximately represented by: policy error≈ 1 N H X h=1 Qt h −QE[x svm,h ] . (3.20) 65 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Sample fraction 0 0.02 0.04 0.06 0.08 0.1 0.12 Policy error Sampled Q-learning + GSP Sampled Q-learning + SVM Figure 3.10. Policy error with respect to sample fraction, network size = 1200 Numerical validation of the Theorems 6, 7 and 8 will be provided in the following section. 3.5 Numerical Results The simulation parameters can be found in Sec. 3.4. First of all, Fig. 3.10 shows the relationship between policy error and the sampling budget. As expected, the policy error decreases as sampling budget increases. However, we observe the classical trade-off between performance and complexity in selecting the fraction of samples: too many incurs high complexity, too few results in poor performance. From our experience, as revealed in Fig. 3.10, a good sampling fraction is around 15%∼ 20%. As a comparison, the classicalQ-learning algorithm is also implemented. Since there is no terminal state in the wireless network system, for a fair comparison with sample Q-learning, the Q-learning also stops when the number of visits for all state-action pairs exceeds the same counting threshold as in sampleQ-learning. Although there are totally 8 combinations of policy sampling, interpolation and refinement algorithm, not all of their performances will be demonstrated. Since the whole process has a 3-layer structure (sampling, interpolation and refinement), we 66 500 1000 1500 2000 Network Size 0 20 40 60 80 100 120 140 160 Runtime (s) Q-Learning VF based + GSP VF based + SVM Sampled Q-leaning + GSP Sampled Q-learning + SVM 500 1000 1500 2000 Network Size 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Policy error VF based + GSP VF based + SVM Sampled Q-leaning + GSP Sampled Q-learning + SVM Q-Learning Figure 3.11. Performance comparison of policy sampling and interpolation algorithm focus on the performance of the algorithms within each layer. After the analysis, we select the best combinations. 3.5.1 Policy Sampling and Interpolation We first examine the policy sampling and interpolation algorithms. Since the sampling set is constructed randomly, for each algorithm, we run 20 instances so that averaged performance can be obtained. From Fig. 3.11 the following results can be observed: 1. The proposed policy sampling and interpolation algorithms achieve sample complexity reduction compared to standard Q-learning (60% and 80% respectively when network size is 2000). 2. The sampled Q-learning with SVM/GSP approach is faster than the value function based policy sampling with SVM/GSP approach; moreover, the sampled Q-learning with SVM/GSP approach has smaller policy error. 3. The sampled Q-learning with SVM/GSP approach achieves similar performance as the standard Q-learning. Regarding results 2) and 3), the superiority of sampled Q-learning is shown: it has a has higher accuracy than the value function based policy sampling. 67 0 500 1000 1500 2000 Network Size 10 -2 10 -1 Policy error (dB) Decay rate: upper bound Sampled Q-learning + GSP Fitted curve: 1.1/sqrt(N) Sampled Q-learning + SVM Decay rate: lower bound 500 1000 1500 2000 Network size 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 policy error Q-learning Sampled Q-learning + GSP Bound-GSP Sampled Q-learning + SVM Bound-SVM Figure 3.12. Decay rate bounds and analytical bounds. Upper figure: decay rate regarding Theorem 6. Lower figure: analytical result regarding Proposition 7 and 8. 500 1000 1500 2000 Network size 0 20 40 60 80 100 120 140 160 Runtime (s) Q-learning Sampled Q-learning + SVM + cushion Sampled Q-learning + SVM + boundary perturb Sampled Q-learning + SVM 500 1000 1500 2000 Network size 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 Policy error Sampled Q-learning + SVM Q-learning Sampled Q-learning + SVM + boundary perturb Sampled Q-learning + SVM + cushion Figure 3.13. Performance comparison of policy sampling, interpolation and refinement algorithm Since the sampled Q-learning directly learns the Q functions, whereas the value function estimation depends on several design variables with an attendant complexity-performance trade-off: trajectory number K, trajectory length T and input policy number M. 3.5.2 Error Bounds The average policy error decay rate (Theorem 6), performance bounds (Inequality (3.17)) and analytical results (Equation (3.20)) derived in Sec. 3.4 are shown in Fig. 3.12. For the policy error decay rate, it can be observed that the true performance curves are well bounded, thus we infer that the decay rate of our 68 algorithm is O(1/ √ N). It can be further validated by the fact that the actual performance curves are well fitted by 1.1/ √ N (Fig. 3.12). For Theorems 7 and 8, we see that the analytical bound for the SVM approach matches the performance curve pretty well. For the GSP approach, we can only estimate the upper boundary and lower boundary, although the upper bound is close to the actual performance curve, it is not as good as the SVM bound due to the lack of information that determines the location of boundary states. However, by applying the same technique in proving Theorem 6, it can alo be shown that the GSP bound is upper bounded by a curve with decay rate O(1/ √ N). 3.5.3 Policy Refinement From results in Fig. 3.11, to assess the performance of policy refinement, we focus on sampled Q-learning for policy sampling, and the SVM approach for policy interpolation (though we underscore that SVM and GSP achieve similar performance, it can be numerically verified, but the results are omitted here). The overall performance of sampling, interpolation and refinement algorithm is shown in Fig. 3.13. It can be observed that with a slight increase of complexity, the averaged policy error is further reduced by 50%, and the runtime is also less than half of the classical Q-learning. 3.5.4 Extensions A binary action space is considered in the original system, and it can be easily extended to a multi-action-space system, which is more general for modern networks. For example, by allowing the transmitter to transmit more packets, and modifying the cost functions and transition probabilities accordingly, the optimal policy also exhibits a thresholded structure and our policy sampling and 69 0 5 10 15 20 25 30 35 40 Buffer 5 10 15 20 25 30 Channel Sampled policy and SVM interpolation silence transmit 1 packet transmit 2 packets boundary Figure 3.14. Optimal policy and policy interpolation for system with multiple actions (a) 500 1000 1500 2000 Network size 0 20 40 60 80 100 120 140 Runtime (s) Q-learning Sampled Q-learning + SVM + cushion Sampled Q-learning + SVM + boundary perturb (b) 500 1000 1500 2000 Network size 0 0.005 0.01 0.015 0.02 Policy error Sampled Q-learning + SVM + boundary perturb Sampled Q-learning + SVM + cushion Q-learning (c) Figure 3.15. Optimal policy for server in the queueing system and performance comparison interpolation algorithms still apply (see Fig. 3.14). The performance curve for policy error is similar so it is omitted here. The thresholded policy is a common property that exists in many queueing systems [26–28]. Therefore, we will examine a queueing system to further validate our proposed methods. Consider a system with 2 queues sharing a single server. Customers arrive at two queues following two independent Poisson processes with rate λ 1 , λ 2 , and the service time of the server follows the i.i.d exponential distribution with parameter ξ. The state of the system is then characterized by the number of customers in each queue, with holding costs c 1 , c 2 per customer in each queue. The objective is to minimize the infinite horizon discounted cost when the discount rate is α, and the decision is to figure out which queue the server should serve next. 70 Notice that this is in fact a continuous-time semi-Markov problem, solving such a continuous time MDP problem is challenging. Fortunately, the uniformization [14] technique can be employed to convert the original problem to an equivalent discrete time MDP problem so that the classical algorithms can apply. It has been shown that the optimal policy for this queueing system also exhibits a thresholded structure [61] (see Fig. 3.15(a)). Therefore, our proposed algorithms apply and the results are shown in Fig. 3.15(b)(c). It is also observed that our proposed algorithm achieves complexity reduction with similar performance as the classical Q-learning. We have implemented a Deep Q Network for our wireless network system, where we have 900 states (30 buffer states and 30 channel states). In our simulation, the deep Q network is a classical feed-forward, fully connected neural network with two input nodes (buffer and channel). There are 4 hidden layers with each layer containing 100 neurons; the final output layer also has two nodes, where the value of each node represents the Q values for each action associated with the particular input state. In the simulation, we run both DQN and our algorithm 30 times; the policy errors are binned to approximate the probability distribution of the policy error for each approach. Fig. 3.16 shows the histogram of the policy errors associated with the two approaches. It can be clearly observed that our proposed policy sampling and interpolation approach has a much smaller averaged policy error (2.8% vs. 10%) and smaller variance (7×10 −5 vs. 1.5×10 −3 ) compared to the DQN approach. Finally, it took roughly 1 minute to train the DQN whereas our proposed algorithm only takes few seconds to complete. We noted that although there are basic guidelines for hyperparameter tuning and architecture design for a DQN, finding a good DQN for our problem requires huge effort – each time when the network parameters are changed, we need to retrain or even redesign the DQN. Significant hand-tuning went towards optimizing the DQN and the results in Fig. 3.16 show the best of our tuning effort. We believe our proposed approach 71 Figure 3.16. Histograms of policy error of DQN and policy sampling/interpolation approach, Network size = 900. leverages structural information well, while using machine learning techniques in key sub-modules to effectively learn key unknowns. The DQN implementation shows that neural networks do not always provide the best performance and often have the cost of significant hyperparameter tuning and training in contrast to our proposed method. Finally, we conclude this section by observing that our model can be easily extended to a MIMO system by considering multiple antennas at both the transmitter and receiver. Then the state of the system is represented by the number of packets in the buffer, together with all the channel states between any pair of transmit and receive antennae. Denote N t , N r as the number of transmit and receive antennae, the total number of states is B· (N t N r ) H , where B is the buffer capacity. As the complexity of the MIMO system increases, not only does the state space grow exponentially, the visualization of the policy structure is also challenging due to multi-dimensionality. One way to tackle such a challenge is to perform state composition [62]. The authors in [62] also consider a point-to-point transmission network with a MIMO transceiver setting, and the state composition is performed by aggregating all the channel state information at the transmitters, i.e., h ={h 1,1 ,h 1,2 ,...,h t,r }. As a result, the state of the system is “reduced” to 2 dimensions (buffer dimensionB and “composite channel state” dimensionh), which 72 is suitable for visualization and resembles the state space of our model. Moreover, it is also shown that optimal policy of the system in [62] also exhibits a thresholded structure (Fig. 3.1(b)). Thus our model can serve as a representative for a variety of MIMO systems, and our proposed algorithms can be applied to MIMO systems. 3.6 Conclusions In this chapter, based on the thresholded structure of the optimal policy, reduced complexity algorithms for wireless network control problem are developed. The algorithm first estimates policies for a subset of states using value function or Q functions; then SVM or GSP techniques can be employed for global policy reconstruction. Policy refinement algorithms can further reduce the reconstruction error. The proposed algorithms can dramatically reduce the runtime compared to the classical Q-learning while achieving similar or even better performance. Specifically, the sampledQ-learning algorithm, together with SVM reconstruction and boundary perturbation algorithm can reduce the runtime by 60% and also reduce the policy error by 50%. 73 Chapter 4 Neural Policy Gradient in Wireless Networks: Analysis and Improvement 4.1 Introduction From the previous Chapter 3, we see that reinforcement learning (RL) [1] seeks to find an optimal decision rule for sequential decision making problems that are typically modeled as Markov decision processes. Classical Q-learning algorithms can be employed to learn the optimalQ values first and then an optimal policy can be derived. However, as noted above, conventional methods suffer from complexity challenges as a result of the exponential growth of the number of system states, thus it is of high importance to develop reduced dimensional MDP/RL algorithms. The key to the algorithm design is to obtain an efficient representation of the value functions/Q values so that the original problem can be solved in a lower dimensional space. Good representations include hand designed basis functions [14]; bases derived from graph signal processing methods [3, 4, 6], as well as 74 those based on polynomial, Fourier bases and/or radial basis functions [1]. In addition, artificial neural networks [58] have also been employed for non-linear representations. Typically, these approaches are referred to as value-based methods since the derivation of an optimal policy is based on first computing the value functions or Q values. However, there are several limitations to these value-based approaches. First, although function approximation [14] techniques are necessary in developing low complexity algorithms, even a small change in the estimated value can lead to a dramatic change in the policy; for example, without a proper basis for representing Q values, Q-learning and dynamic programming methods have been shown to be unstable and thus fail to converge for even a simple MDP problem [63, 64]. Learning the policy directly can reduce the value function errors. Second, for a system with a continuous action space, applying value-based methods is challenging since one needs to search for the global optima in the continuous action space. Finally, given additional constraints in the original optimization problem, the optimal policy is a stochastic policy [65, 66], thus different actions are selected according to a specific probability distribution, which can not be easily directly computed from value functions or Q values. Therefore, policy-based methods [1], generally referred to as policy gradient approaches, are proposed learn the optimal policy directly. The output policy is usually parameterized by a neural network with state variables being the network input and a probability distribution on the action space as an output. To update the weights of the neural network, the system first collects cumulative costs from multiple sample trajectories under the current policy, after which, gradients with respect to neural network parameters can be computed with the help of the policy gradient theorem [1], and the update process can be done by gradient descent. A key disadvantage of this Monte-Carlo policy gradient method (also called REINFORCE) [67] is the high variance of the computed gradient. To reduce 75 the variance, the Actor-Critic (AC) [68] method has been proposed, where the neural network not only learns a probability distribution, but also seeks to estimate the value function which is used as a baseline to subtract from the cumulative costs. Nevertheless, these algorithms may still be too sensitive to updates in the parameter space. It is still possible that catastrophic performance degradation will happen due to an improper amount of update. As a result, new surrogate functions are proposed to seek constrained updates during the learning process and, trust region policy optimization (TRPO) [69] and, proximal policy optimization (PPO) [70] have been developed to mitigate this issue. Policy gradient methods have also been applied to various wireless network applications such as caching in distributed wireless networks [71], energy harvesting for mobile devices [72] and power allocation in satellite networks [73]. Despite its success, the actual optimization behavior of policy gradient methods is not well understood and there are simple scenarios that the algorithms do not behave well. In this work, we have applied the policy gradient approaches to a point-to-point transmission system and numerical results suggest that the algorithms converge to the same local optima with high probability. Since the research on RL algorithms has not fully solved the two key challenges of convergence to a local optima and the related issue of how to initialize the neural network implementation of RL. The main contributions of this chapter are as follows: • A theoretical analysis of the properties of the gradient of the neural network is provided, showing the monotonic behavior network weights. This analysis, in turn, explains the preponderance of local optima. • The analysis is generalized to more complex neural networks for more complicated systems. 76 • Smart initialization techniques using tools from continual learning are employed, achieving at least 60% error reduction versus random initialization. • Implementation of our modified policy gradient algorithms to a more complex multiple-input and multiple-output (MIMO) system is undertaken, achieving roughly 80% error reduction compared to random initialization. The rest of the chapter is organized as follows: Section 4.2 provides background on policy gradient algorithms. A case study containing interesting convergence result is provided in Section 4.3 and it is served as the motivation for further analysis. Theoretical analysis regarding the properties of the gradients and network weights are given in Section 4.4. Section 4.5 provides smart initialization techniques for the neural network and numerical results are shown in Section 4.6. Finally, Section 4.7 concludes the chapter. Proofs of the technical results are provided in appendix C. 4.2 Policy Gradient As mentioned in Section 3.1, the classical Q-learning algorithm determines an optimal policy by computing all the Q values for each state-action pair, resulting in high complexity. Policy gradient algorithms, on the other hand, perform a direct policy search via a gradient method. The policy is usually parametrized by a neural network. Denote θ as the parameters of the neural network. Given an input state s, the policy network will generate a probability distribution π θ (a|s) on the action spaceA. The objective functionJ(θ) in policy gradient is defined as the expected value of cumulative costs with respect to a trajectory of length T : J(θ) =E τ (T) " T X t=1 c t # , (4.1) 77 wherec t denotes the cost function at timet andτ (T ) denotes a lengthT trajectory, notice that T is tunable. With the help of the policy gradient theorem [1], the gradient∇ θ J(θ) is given by the following equation: ∇ θ J(θ) =E τ (T) " T X t=1 ∇ θ logπ θ (a t |s t )·G t # , (4.2) where G t = P t t 0 =t c t 0 is the cumulative cost starting from time t to time T ∗ . The detail derivation of the gradient is provided as follows, from Equation (4.1) we have: J(θ) =E τ (T) " T X t=1 c t # = T X t=1 P θ (s t ,a t |τ)c t , (4.3) where P θ (s t ,a t |τ) denotes the probability of taking action a t at state s t given trajectory τ at time step t, notice that the probability is parameterized by neural network parameters θ. Taking the derivative of J(θ), we have: ∇ θ J(θ) = T X t=1 ∇ θ P θ (s t ,a t |τ)c t = T X t=1 P θ (s t ,a t |τ) ∇ θ P θ (s t ,a t |τ) P θ (s t ,a t |τ) c t =E τ (T) " T X t=1 ∇ θ logP θ (s t ,a t |τ)c t # . (4.4) ∗ For the sake to tractability, we only consider the basic REINFORCE algorithm without discounting. 78 To further simplify the equation, we take a more careful look at ∇ θ logP (s t ,a t |τ), by definition, P θ (s t ,a t |τ) =P (s 1 )π θ (a 1 |s 1 )P (s 2 |s 1 ,a 1 )π θ (a 2 |s 2 )···P (s t |s t−1 ,a t−1 )π θ (a t |s t ). (4.5) Taking log on both sides: ∇ θ logP (s t ,a t |τ) =∇ θ logπ θ (a 1 |s 1 ) +∇ θ logπ θ (a 2 |s 2 ) +··· +∇ θ logπ θ (a t |s t ). (4.6) Therefore, the gradient can be simplified as: ∇ θ J(θ) =E τ (T) " T X t=1 ∇ θ logP θ (s t ,a t |τ)c t # =E τ (T) " T X t=1 ( t X t 0 =1 ∇ θ logπ θ (a t 0|s t 0))c t # =E τ (T) " T X t=1 ( T X t 0 =t c t 0)∇ θ logπ θ (a t |s t ) # , (4.7) where the last equation follows from the change of the order of summation and it is the same as Equation (4.2). We observe that in simulation, lack of knowledge of the exact probability of each possible trajectory results in the gradient∇ θ J(θ) being approximated by the sample average of a few trajectories. 4.3 Case Study and Motivation In this section, we provide a case study wherein multiple policy gradient algorithms are applied to the system described in Section 2.3. The goal is for these methods to learn the optimized control. The numerical behavior observed will reveal 79 Input Input Input Output Output r(·) r(·) σ(·) Figure 4.1. Pictorial illustration of a policy network, the number of nodes are 3− 5− 4− 2. Activation functions are shown on top of each layer. interesting convergence results that motivate our subsequent theoretical analysis in Section 4.4. To run policy gradient algorithms, a feed-forward, fully connected neural network with one hidden layer is first constructed as our policy network (see Figure 4.1 for pictorial representation) † , with the number of nodes in each layer being 2-16-2. We observe that our signals are fundamentally different from images or sequential data, as such convolutional neural networks or recurrent neural networks are not particularly suited as an architecture for our problem. We use the ReLU activation function as our non-linearity and to avoid vanishing gradient issue, The ReLU activation function is typically used in every hidden layer, it is defined as: r(x) = 0, x≤ 0 x, x> 0 . (4.8) At the output, in order to obtain a probability distribution, the softmax activation function is employed: σ(z) i = e z i P N j=i e z j , for i = 1,...,N and z = [z 1 ,..,z N ]. (4.9) † Since the input is neither an image nor sequential data, using convolutional neural network or recurrent neural network is not the best choice. 80 There are two input variables since the state of the wireless system is characterized by buffer length q and channel state h. There are also two output nodes since we have a binary action space as mentioned in Section ??. In terms of network settings, the packet arrival probability is p = 0.9; and the parameters for the channel in the wireless network come from a sensor network application [7]; they are set as: C 0 = 10 0.17 , η = 4.7, l = 20m, and P rcv =−97dBm. The total number of network states is 300. We run a total of 100 episodes, each episode contains 10 sample trajectories with length T = 30 to get a fair estimate of the gradient in equation (4.2). Finally, the learning rate for the neural network is 0.001. Three policy gradient algorithms, REINFORCE [67], Actor-Critic (AC) [68] and PPO [70], are implemented and numerical results suggest that with high probability, the algorithms converge to the same local optima, i.e., the optimal decision suggested by the policy network is silence for all states, which is different from the optimal thresholded policy as clearly seen in Figure 2.2. We see that in our neural network, the weights in the last layer associated with the silence action keep increasing, while the weights associated with the transmit action keep decreasing (see Figure 4.2) for all three policy gradient approaches ‡ . Since this monotonicity property exists for all three methods, and the more advanced methods such as Actor-Critic and PPO do not improve performance over REINFORCE, we analyze the REINFORCE algorithm. We first seek to determine if this behavior is inherent to the approach or an artifact of our simulation scenario. Looking ahead, we will see that this behavior is, in fact, inherent to REINFORCE applied to our problem. We considered a number of additional techniques to determine if policy gradient methods could be improved to avoid these local optima. In particular, training acceleration via batch normalization [74], improved weight initialization with ‡ The weights during the learning process have been normalized betweeen−1 and 1 for better visualization and comparison among the three methods. 81 0 10 20 30 40 50 Iterations -1 -0.5 0 0.5 1 Weights Silence-weight-PPO Silence-weight-REINFORCE Silence-weight-AC Transmit-weight-REINFORCE Transmit-weight-AC Transmit-weight-PPO Figure 4.2. Normalized weights of neural network during learning process for different policy gradient algorithms. Xavier methods [75], adaptive learning rates [76]; and weight regularization [77]. The network size was also varied, to no avail. Given the failure of these methods, we turned our attention to analyzing why policy gradient methods were unsuccessful for our problem framework. 4.4 Algorithm Analysis Based on the numerical observations in the previous section, the goal of this section is to provide theoretical analysis of the algorithms. We mainly focus on the properties of the gradients and we also seek to obtain a closed-form expression of the evolution of output policy. 4.4.1 Properties of gradients For the sake of tractability, we first consider a simple 2− 2 neural network (see Figure 4.3) but we emphasize that similar behavior of network weights can be observed in a more complex neural network and generalized theoretical results for more complex networks is provided in the sequel in theorem 11. In this simple policy network, the input layer has 2 variables x = [q,h] T , the output layer will be a Bernoulli distribution on the action space. The connection between the input 82 w 11 w 12 w 21 w 22 x π(0 | s ) = e w 1 T x + b 1 e w 1 T x + b 1 +e w 2 T x + b 2 π(1 | s ) = e w 2 T x + b 2 e w 1 T x + b 1 +e w 2 T x + b 2 b = [ b 1 ,b 2 ] T softmax activation W = [ w 1 ,w 2 ] T Figure 4.3. Pictorial representation of the simple policy network. and output can be described by a 2×2 matrix W = [w 1 , w 2 ] T . Thus, given a state input x, the output probability is: π W,b (0|s) = 1−π W,b (1|s) = e w T 1 x+b 1 e w T 1 x+b 1 + e w T 2 x+b 2 (4.10) where b = (b 1 ,b 2 ) T is the bias vector, thus the parameters for the policy network defined in equation (4.1) are θ ={W, b}. Without loss of generality, we use π 0 andπ 1 to denote the probability of taking silence and transmit action, respectively. Thus w 1 and b 1 are the parameters associated with the silence action, w 2 and b 2 are the parameters associated with the transmit action. In this simple policy network, the learned policy converged to the all-silence suboptimal policy and the weights in the policy network that are associated with the silence action (w 11 and w 12 ) keep increasing (similar to Figure 4.2). Conversely, weights associated with the transmit action (w 21 and w 22 ) keep decreasing. Due to the properties of the softmax function, we can prove the following theorem: Theorem 9 Gradients of the cost function with respect to w 1 and w 2 are additive inverses of each other. ∂J ∂w 1 =− ∂J ∂w 2 (4.11) Proof. The proof is provided in Appendix. C.1 Remark 7. Although the derivations are for a simple neural network, for the weights in the last layer of a more complex multilayer perceptron, the negative gradient 83 property still holds due to characteristics of the softmax activation function (see Equation (4.9)). Remark 8. The negative gradient property suggests that instead of analyzing the whole policy network, it is sufficient to focus on only two of the network parameters (w 1 and b 1 ) since the other two (w 2 and b 2 ) are negatively correlated. Our next objective is to show∇ w 1 J < 0 and∇ b 1 J < 0. We show that∇ w 11 J < 0; derivations for the other parameters are similar. For clarity, in the sequel, we use∇ to represent∇ w 11 unless otherwise specified and use π to denote the parameterized output probability π θ . Before providing the second theorem, we first rewrite Equation (4.2) and provide one lemma for a specific variable in the equation, which will be used in the proof of the next theorem. Rearranging terms in Equation (4.2), we obtain: ∇J = E τ (T) ( T X t=1 ∇ logπ(a t |s t )·G t ) = E τ (T) ( T X t=1 ∇ logπ(a t |s t )· T X t 0 =t c t 0 ) (a) = E τ (T) T X t 0 =1 c t 0· t 0 X t=1 ∇ logπ(a t |s t ) , (4.12) where (a) follows from the exchange of the order of the summations. Define Z T as follows: Z T . =∇ logπ(a 1 |s 1 ) +∇ logπ(a 2 |s 2 ) +... +∇ logπ(a T |s T ), (4.13) we have the following lemma: 84 Lemma 1 Z T is a zero-mean random variable, i.e., E τ (T){Z T } = 0. (4.14) Proof. The proof is provided in Appendix. C.2 Remark 9. Since each∇ logπ(a i |s i ), 1≤ i≤ T is a zero-mean binary random variable with respect to actiona i ,Z T can be viewed as a generalized random walk. However, the proof of Lemma 1 is non-trivial because the expectation is taken with respect to trajectories rather than the action sequence in the action space. Thus for each∇ logπ(a i |s i ), the expectation is taken backward from time T to time i and eventually yields the expectation of a binary random variable. Remark 10. It can be also shown thatZ T is a Martingale [? ] duo to the Markovian property and the zero-mean property of each∇ logπ(a i |s i ). Although properties of Martingales are not directly applied to the proofs of the lemma, we believe this classical set of tools will be useful in deriving results for neural networks. With the definition ofZ T , the gradient term can be re-written in a much simpler form as follows: ∇J = E τ (T) T X t 0 =1 c t 0· t 0 X t=1 ∇ logπ(a t |s t ) = E τ (T) ( T X t 0 =1 c t 0Z t 0 ) = E τ (T){X T }, (4.15) where X T . = P T t=1 c t Z t . Echoing the observation from Figure 4.2 that there exists a monotonic behavior of weight w 11 , we have the following theorem: 85 Figure 4.4. Surface plot of π(0|s) Theorem 10 Under the wireless network setting in Section 2.3, when applying the REINFORCE algorithm, the gradient of the objective function with respect to w 11 is negative: ∇ w 11 J < 0. (4.16) Proof. The proof is provided in Appendix. C.3 Remark 11. The result is proven via mathematical induction and by exploiting relationship betweenX T andZ T . A similar analysis can be undertaken for the other weights and due to theorem 9, we can also prove the following results:∇ w 12 J < 0, ∇ w 21 J > 0,∇ w 22 J > 0,∇ b 1 J < 0,∇ b 2 J < 0. Remark 12. Since the trajectory can start from any state, without loss of generality, we assume that starting state s 1 is drawn from a uniform distribution in the state space. From the above theorems, we have shown that w 1 , b 1 keep increasing and w 2 , b 2 keep decreasing. The increasing and decreasing directions are shown in surface plot ofπ(0|s) in Figure 4.4 and thus convergence to all silence probability from all initial states is observed. 86 4.4.2 Approximations of the output probability The next objective is to characterize the evolution of the output probability. To be more specific, we wish to derive a closed-form expression for the silence probability π t (0|s) for state s at iteration time t, provided by Equation (4.10): π t (0|s) = e w T 1,t x+b 1,t e w T 1,t x+b 1,t + e w T 2,t x+b 2,t . (4.17) At iteration stept+1, due to the gradient updates, we have w i,t+1 = w i,t +Δw i,t , b i,t+1 =b i,t + Δb i,t , i = 1, 2, thus π t+1 (0|s) is given as follows: π t+1 (0|s) = e (w 1,t +Δw 1,t ) T x+b 1,t +Δb 1,t e (w 1,t +Δw 1,t ) T x+b 1,t +Δb t,1 + e (w 2,t +Δw 2,t ) T x+b 2,t +Δb 2,t = π t (0|s) π t (0|s) + (1−π t (0|s))e (Δw 2,t −Δw 1,t ) T x+Δb 2,t −Δb 1,t . (4.18) Since the the perturbation terms include the gradient, learning rate and−1 (e.g. Δw 1,t =−(learning rate)×∇ w 1,t J), from the opposite and negative gradient properties, we have: Δw 1,t > 0, Δb 1,t > 0, Δw 2,t < 0, Δb 2,t < 0. Therefore, the exponential term in Equation (4.18) is between 0 and 1 as a result of x = [q,h] T > 0. The exponential term challenges the derivation of a closed-form formula for π t (0|s). Even with a Taylor series expansion, the perturbation terms are time-dependent and are difficult to express explicitly. We consider both a linear and a constant approximation of the exponential term in Equation (4.18) for analytical tractability. Our numerical results in the sequel show that both of these approximations provide a comparable fit to simulated behavior. For the linear approximation, the exponential term is approximated by e x ≈a +bx (wherea and 87 0 10 20 30 40 50 Iterations 0 0.2 0.4 0.6 0.8 1 t Fixed =0.7 0 =0.5 0 =10 -3 0 =10 -5 (a) 0 10 20 30 40 50 Iterations 0 0.2 0.4 0.6 0.8 1 t Fixed 0 =10 -5 =0.3 =0.6 =0.8 (b) 0 10 20 30 40 50 Iterations 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 t Simulation results for state (20,15) Simulation results for state (10,10) Constant approx: =0.7, 0 =0.63 Linear approx (c) Figure 4.5. Evolution of output probability for silence action. b are found via fitting) andπ t (0|s) can only be obtained by numerical computation. In contrast, for the constant approximation, we fit the best value κ∈ (0, 1) to the exponential term and we can determine a closed form expression for π(0|s). With the constant approximation, we have, π t (0|s): π t+1 (0|s)≈ π t (0|s) π t (0|s) +κ(1−π t (0|s)) ⇒ 1 π t+1 (0|s) − 1 =κ 1 π t (0|s) − 1 ! ⇒π t (0|s) = 1 1 +κ t 1 π 0 (0|s) − 1 . (4.19) It can be observed from equation (4.19) that there are two key variables affecting the convergence behavior of π t (0|s): the initial probability π 0 (0|s) and the update factor κ. Figures 4.5(a) and (b) shows different convergence curves of π t (0|s) under different parameter settings. Given the update factorκ, if the initial probability π 0 (0|s) is closer to 1, the output probability converges to 1 faster (see Figure 4.5 (a)); on the other hand, for a fixed initial probability, larger κ yields a slower convergence rate (see Figure 4.5(b)). Although Equation (4.19) is derived from a constant approximation of the exponential term, it appears to be relatively tight versus the simulation result (see Figure 4.5 (c)); furthermore, we see that the constant and linear approximations provide comparable matches to the simulated computation of π t (0|s). 88 4.4.3 Generalization Although the previous analysis is done on a simple 2− 2 neural network for a point-to-point transmission system with binary action space, the results can be extended to a more complex neural network for a system with a multi-action space. Consider a RL system where each state can take k actions and a neural network is built to learn the policy for each state. The weight matrix W l and the bias vector b l between the penultimate layer and the last layer are given by W l = [w 1 , w 2 ,..., w k ] T , b l = [b 1 ,b 2 ,...,b k ] T . The opposite gradient property in theorem 9 generalizes to the following zero-sum gradient theorem: Theorem 11 For a RL system where|A| = k, given a multi-layer perceptron as a policy network for the REINFORCE algorithm, the gradients with respect to the parameters in W l in the policy network have a zero sum: ∂J ∂w 1 + ∂J ∂w 2 +··· + ∂J ∂w k = 0. (4.20) Proof. The proof is provided in Appendix. C.4. Remark 13. Similar techniques can be applied to prove the result for the parameter vector b l . Since the proof mainly exploits the characteristics of the softmax activation function, this property holds for a set of control problems irrespective of the action space and the number of nodes in the multi-layer perceptron. The extension of the negative gradient property (theorem 10) to a complex system is less straightforward since the derivation involves exploiting the structure of the cost functions as well as the transition behavior between states. In the point-to-point transmission system, the cost function is sparse and monotonically increasing in the action space; the transition is also independent of the channel dimension. In the numerical section (Section. 4.6), we also constructed a MIMO 89 system which naturally admits the i.i.d channel property and also possesses the monotonic cost function in the action space § . Numerical results suggest that the policy gradient algorithms also only converge to local optima. We conjecture that for many network control problems, the application of policy gradient method, will result in convergence to local minima. On the other hand, if we determine methods to obviate the local minima challenge for the point-to-point system, these methods have the potential to cope with local minima in more complex systems. 4.4.4 Implications for local convergence The prior subsections provide theoretical insights into the behavior of the REINFORCE algorithm persistently converging to a local optima, irrespective of the initialization of the parameters of the algorithms. However, we have previously proved that the optimal policy has a thresholded structure [3] (see Figure 2.2). In this subsection, we further investigate the interplay between these two results. First, we notice that the optimal policy in our point-to-point system is a deterministic policy, i.e., the system only takes one specific action at each state. However, the output policy of the neural network is a stochastic policy, meaning that the algorithm can take different actions according to the distribution on the action space. A stochastic policy reduces to a deterministic policy when the probability distribution is concentrated on one specific action; or it can be obtained from a stochastic policy by policy trimming, i.e., picking the action that has the § The i.i.d channel property naturally exists in MIMO systems. While not universal, several important cost functions are interest do share the monotonicity property. 90 largest probability. Denote r (s,a) as the probability of taking action a at state s, the deterministic policy d after policy trimming is given as follows: d (s) = arg max a r (s,a). (4.21) As policy gradient is employed, ideally, the stochastic policy should converge to the deterministic policy, or one can transform the stochastic policy to a deterministic policy after each iteration, but the gradient computation is slightly more complicated. We next provide a surface plot of the objective function in Equation (4.1) under the original output stochastic policy and the trimmed deterministic policy to see the effect of output policy trimming. Since it is difficult to visualize in the high dimensional parameter space, we fix all other parameters and create a 2-D surface plot of the objective function with respect to two manually selected parameters. Here, a simple policy network with three layers and the number of nodes in each layer being 2− 4− 2 is considered, the total number of parameters is 8 + 4 + 8 + 2 = 22. Denote W (2) as the weight matrix between the second and last layer of the policy network and hence it is a 2-by-4 matrix. We select W (2) 12 and W (2) 23 (W (2) 13 and W (2) 22 ) as two variables and fix all other weights and biases. As the network weights W (2) 12 (W (2) 13 ) and W (2) 23 (W (2) 22 ) are tuned, different output policies can be obtained. From Equation (4.1) we can compute both the objective function under original stochastic policy and the objective function under the trimmed deterministic policy, the surface plot is constructed in Figure 4.6. We observed that the objective function under the original policy gradient algorithm without policy trimming is continuous since the output probability is continuous. However, the objective function evaluated under the trimmed deterministic policy becomes non-smooth. For the objective function associated with the trimmed policy, there are two constant surfaces. Numerical validation 91 (a) (b) Figure 4.6. 2D surface plot of the objective function J as a function ofW (2) 12 andW (2) 23 . suggests that the upper region represents the parameter space such that after trimming, the output deterministic policy is an all-transmit policy; on the other hand, the lower surface will yield the all-silence policy, which is the local optima to which the smooth objective function is converging. Furthermore, notice that there is a “trench” lying between the upper and lower surfaces; the globally optimal policy after trimming is, in fact, in the trench. Therefore, the dilemma is how to trim the stochastic output policy such that we find the globally optimal policy. In theory, there should exist a set of parameters that yields a stochastic policy which is arbitrarily close to the optimal deterministic policy. However, our numerical simulations show that in a high dimensional space, the region containing this set of parameters is too small to be found. On the other hand, if we run policy gradient algorithms with policy trimming at the output, it is also difficult to converge to the global optimal policy. From Figure 4.6 it is clear that on these two large surfaces, the gradients with respect to the parameters are zero. 4.5 Improving Policy Gradient We consider two possible methods by which to improve the performance of vanilla gradient descent. One way is to employ a smart gradient update strategy and the 92 Task1 Task2 Task3 Task4 data 1 data 2 data 3 Task Timeline Figure 4.7. Pictorial illustration of continual learning. The tasks labeled in red are pre-train tasks, task labeled in blue is future task. other is to perform smart initialization. Both of these methods seek to either avoid local optima or jitter the updates out of basis of attraction for local optima. Given our previous analysis of the REINFORCE algorithm in Theorems 9,10 and the observations on the objective function (recall Figure 4.6), we focus on the original policy gradient algorithms without policy trimming, but with the goal that when the algorithm terminates, the trimmed policy is close to the optimal deterministic policy. Employing smart gradient updates via SPSA (simultaneous perturbation stochastic perturbation) [78, 79] is challenging as seen from Figure 4.6, we have two undesirable regions of constant objective functions and a narrow desired region wherein gradient updates can be effective. If the policy is initialized in the undesired constant regions, even gradient perturbations are unlikely to move the gradient into a desirable direction. If the algorithm is initialized in the desirable region, a gradient perturbation could move the policy into one of the undesirable constant regions. Therefore, the key is to perform smart initialization, where the algorithm parameters is ideally initialized in a smart way such that the starting point should be close to the global optima. Techniques such as parameter tuning and early stopping should also be employed so that the algorithm does not stray too far from the global optima. A good stopping time can be found via cross-validation and the stopping time will be provided in the numerical section. To perform smart initialization, one possible method is to “teach” the policy network with existing examples of good policies so that it gains prior knowledge of the overall structure of the policy. To this end, continual learning [5] is well-matched to our goal. 93 Algorithm 7 Policy gradient with continual learning Input: policy network, pre-train tasks. 1: Pre-train the policy network with several tasks. 2: Run policy gradient algorithm (REINFORCE) with early stopping in new network setting. 3: Trim the stochastic policy to a deterministic policy by selecting the action with the highest probability. Output: policy ˆ μ ∗ . Continual learning considers the problem of learning a stream of possibly infinite data generated from different input domains associated with different tasks, aiming to solve future learning problems with the acquired knowledge from previous tasks (see Figure 4.7). The goal is to help pre-train the policy network so that the starting point is close to the optimal solution in the parameter space. To successfully run the continual learning algorithm, the network should be able to distinguish between different tasks. Herein, the task is to find the optimal policy given a certain network setting. To further characterize different network setting, the input space is augmented to x = [q,h,β], where q is buffer length, h is current channel state and β is the weighting factor in the cost function (2.13). Notice that although q and h are of finite dimension, the input can be infinite dimension since β is continuous, meaning that we may have an infinite number of network scenarios. Essential to continual learning is the assumption that we already have knowledge of the optimal policy under a few network settings, which may not appear in a future learning setting. Each pre-train task will be a neural network estimation problem where the goal for the neural network is to output the optimal policy signal. After the training is completed, the policy network is used to run the policy gradient algorithm in a new network setting (different β). The entire algorithm is described in Algorithm. 7. 94 4.6 Numerical Results We apply the REINFORCE approach with smart initialization via continual learning to the point-to-point model described in Section 2.3. The wireless network parameters can be found in Section 4.3 and the neural network constructed to parameterize the policy is given as follows. The neural network is a fully connected, feed-forward network with the number of nodes in each layer being 3−128−64−2. The numbers of nodes in each layer are chosen to be powers of two and relatively large so that the overall architecture is able to find good representations of complex policies. Again, the ReLU activation function (4.8) is used in the hidden layers and the softmax activation function (4.9) is used at the output. Finally, the learning rate is set to be 0.001. Two performance metrics are used to evaluate our proposed algorithm and conventional policy gradient. The first metric is the policy error, defined as: policy error = 1 N X s∈S 1(ˆ μ(s)6=μ ∗ (s)); (4.22) the second metric is the averaged value function, defined as: V (μ) = X s∈S v μ (s)d μ (s), (4.23) where v μ (s) are the value function and stationary distribution for state s under policy μ. 4.6.1 Point-to-point system In simulation, we pre-trained the network with ground truth policies where the corresponding network cost weight factor is β ∈ [5, 10, 15, 20]. In addition, each neural network training episode is 300. While running the policy gradient algorithm, from Figure 4.8 it can be observed that the policy error first decreases 95 0 20 40 60 80 100 Iterations 0 0.05 0.1 0.15 0.2 0.25 policy error Policy gradient with smart initialization Figure 4.8. Policy error for policy gradient with smart initialization. (a) Policy from PG 0 5 10 15 20 Buffer 0 5 10 15 Channel silence transmit (b) (c) Figure 4.9. Comparison of optimal policy, policy obtained by policy gradient, policy obtained by policy gradient with smart initialization. The weighting factor is β = 12. and then increases as we are passing by the global optimal point. To ensure a close proximity to the global optima after algorithm termination, a good running episode for updating the neural network is selected to be between 30 and 70. Figure 4.9 shows the comparison of policies obtained by natural REINFORCE and by REINFORCE with smart initialization. It can be clearly observed that with smart initialization and early stopping, the learned policy is much closer to the optimal policy. To further test the performance of the modified algorithm in different network settings, the weighting factorβ is changed so that a performance curve can be obtained (see Figure 4.10). We have the following observations: 1. The policy error is dramatically reduced compared to policy gradient with random initialization. 96 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 policy error Random init Pre-trained (a) 8 10 12 14 16 18 20 22 24 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 averaged value function Random init Pre-trained Optimal expected cost (b) 8 10 12 14 16 18 20 22 24 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 policy error 3 pre-train tasks 4 pre-train tasks 5 pre-train tasks 5 pre-train tasks (c) Figure 4.10. Performance of policy gradient with continual learning for initialization. Tx 1 Rx Packet Channel 1 Buffer 1 Tx 2 Packet Channel 2 Buffer 2 Link 1 Link 2 Figure 4.11. Pictorial representation of MIMO system. 2. The average value function associated with the policy given by smart initialization is much closer to the optimal average value function. We further consider the effect of the random shuffling of the order of the pre-trained tasks and the effect of increasing the number of pre-trained tasks. Numerical results suggest that random shuffling has less impact on the performance as we increase the number of pre-trained tasks; on the other hand, the overall performance is still improving compared to random initialization. Finally, from Figure 4.10(c) we observed that increasing the number of pre-train tasks can help reduce the policy error in future learning; on the other hand, increasing the number of pre-train tasks will also incur a forgetting effect [80], as it can be observed from the solid curve with diamond markers. It is possible that by the end of the pre-train process, the neural network may forget what it 97 has learned from the early tasks. As a result, when faced with a new task similar to the early tasks, the neural network may not be able to perform well. 4.6.2 MIMO system Apart from the simple point-to-point transmission network, we can also apply the policy gradient algorithm with continual learning initialization to a MIMO network to show the applicability of our algorithm to a more complex system. Consider a two transmitters and one receiver system where the receiver (see Figure 4.11) is able to receive packets from both transmitters. We assume independent arrival process at each transmitter and also independent channel evolution across two links. The state of this MIMO system is represented by the 4-tuple (q 1 ,h 1 ,q 2 ,h 2 ), where s (1) = (q 1 ,h 1 ) and s (2) = (q 2 ,h 2 ) represent the buffer length and channel state associated with each link, respectively. The action space for each transmitter is also binary a (1) ,a (2) ∈{0, 1} Δ ={transmit, silence}. Thus there are 4 possible actions at each state. The instantaneous cost function for the whole system is defined as a weighted sum of the two costs associated with each individual link: c(s (1) ,a (1) ,s (2) ,a (2) ) =ρc(s (1) ,a (1) ) + (1−ρ)c(s (2) ,a (2) ), (4.24) where c(s (1) ,a (1) ) and c(s (2) ,a (2) ) are defined in equation (2.13) and ρ∈ (0, 1) is a joint cost weighting factor that emphasizes the importance of the cost associated with each individual link. In simulation, for the first link we set Q 1 = 10, H 1 = 8; for the second link we set Q 2 = 10, H 2 = 6. Thus we have a total of (Q 1 H 1 )· (Q 2 H 2 ) = 4800 states. For other parameters, the arrival rates at each transmitter are both 0.9 and the joint cost weighting factor is ρ = 0.5. By tuning the individual weighting factor β 1,2 , we change the structure of the policy. We pre-train the policy network with 98 (a) (b) [8,8] [8,12] [12,8] [12,12] [ 1 , 2 ] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 policy error Random init Pre-trained (c) [8,8] [8,12] [12,8] [12,12] [ 1 , 2 ] 7.6 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 averaged value function Random init Pre-trained Optimal expected cost (d) Figure 4.12. Performance comparison of random initialization and smart initialization. policies given by β 1,2 ∈{5, 10} (training episode is 300) and we test the network with new network settings with β 1,2 ∈{8, 12}. The neural network we used as a policy network has a structure being 4− 128− 64− 4. Since we have 4 state variables, state aggregation is need to facilitate the visualization of the policy. Here the states are aggregated by links, i.e., (q 1 ,h 1 ) and (q 2 ,h 2 ) are aggregated into two composite states, respectively. Figure 4.12 shows the optimal policy and the policy learned by the policy gradient algorithm with smart initialization. First, from the optimal policy (Figure 4.12 (a)), we conjecture that there exists a state aggregation method such that the optimal policy is also in a thresholded form, similar to the result in [62]. In addition, with smart initialization using continual learning, the learned policy is close to the optimal policy. Furthermore, Figures 4.12 (c) and (d) also show a comparison between random initialization of the policy network versus smart initialization, which achieves a significant reduction in error of roughly 80% can be observed. In terms of the averaged value function, policy gradient with smart initialization yields value functions close to the optimal values. Therefore, our proposed algorithm is applicable to more complex systems. 4.7 Conclusions In this chapter, theoretical analysis on the policy gradient algorithm for wireless network control problems is performed. Properties of the gradients for the neural 99 network are investigated and the monotonic behavior of the gradients is also proved. Leveraging tools from continual learning, a smart initialization technique is also employed, which achieves error reduction more than 50%. The analysis is also generalized and the proposed algorithm is also implemented on a more complex MIMO system, thus showing the potential capability of the proposed algorithm to a larger set of wireless control problems. 100 Chapter 5 Conclusions In this thesis, we have considered the finding the optimal control policy for a point-to-point wireless network. The control problem is modeled as a Markov decision process or reinforcement learning based on the availability of system dynamics. Several efficient algorithms are developed to tackle the curse of dimensionality and they all achieve complexity reduction and error reduction compared to the classical algorithms such as policy iteration and Q-learning. In Chapter 2, the main goal is to solve an MDP problem using projection method, where the key is the design of a good subspace. Based on the i.i.d channel property, the low-rank property of the probability transition matrix (PTM) is first proved. A subspace constructed from the Gram-Schmidt orthogonalization of the PTM can also be proved to achieve perfect reconstruction of the policy signal. In addition, monotonicity of the value functions as well as the thresholded structure of the optimal policy are also proved; a new zig-zag policy update method exploiting the policy structure is also proposed to facilitate the policy update step. Furthermore, we also propose more general subspace design technique using graph signal processing. Symmetrization methods need to be employed to tackle the directivity of PTM and numerical result shows that subspace constructed from bibliometric symmetrization achieves perfect reconstruction of the optimal policy. 101 In Chapter 3, an efficient algorithm is proposed to solve reinforcement learning problems with a large state space, exploiting the thresholded structure of the optimal policy. The proposed algorithms consists three stages: policy sampling, policy interpolation and policy refinement. Value function based and Q value based approaches are proposed for estimating policy for a subset of states; both support vector machine and graph signal processing are applied to help reconstruct the global policy; cushion method and boundary perturbation are also proposed to further refine the policy boundary. The proposed algorithm achieves 65% runtime reduction when the network state is 2000, as well as a 50% error reduction compared to the classicalQ-learning algorithm. Theoretical analysis has also been conducted to study the decay rate of the averaged policy error associated with our proposed algorithm. In Chapter 4, three policy gradient approaches (REINFORCE, Actor-Critic, PPO) are applied to directly learn the optimal policy. Numerical results suggest that the classical algorithms converge to an all-silence local optima with high probability. To analyze the convergence behavior, theoretical analysis have been conducted, where opposite and negative gradient properties of the parameters in the neural network are shown. To tackle the local convergence problem, techniques from continual learning and early stopping are applied. The modified policy gradient algorithm achieves more than 50% error reduction and also applies to complex MIMO system. 102 Appendix A On Solving Large Scale Markov Decision Processes Problems: Exploitation of Spectral Properties and Policy Structures A.1 Proof of Theorem 1 A.1.1 Monotonicity of V ∗ (q,h) in h Since the channel states are i.i.d. over time, therefore E a,h 0V ∗ (q− 1 +a,h 0 ) and E a,h 0V ∗ (q +a,h 0 ) are constants w.r.th, which can also be viewed as nonincreasing function of h. In addition, βP T (h) is a strictly decreasing function of h. Therefore, two functions in the min function are nonincreasing in h, soV ∗ (q,h) is nonincreasing in h for given q. 103 A.1.2 Monotonicity of V ∗ (q,h) in q Notice that optimal value function can also be obtained by value iteration. We can start with V 0 (q,h) = 0 for all q,h (i.e., a zero vector), it automatically satisfies the nondecreasing property in q for any given h. Consider 1≤ q≤ Q− 1 and assume that in kth iteration V (k) (q,h) is nondecreasing in q given h. In (k + 1)th iteration we have: V (k+1) (q + 1,h) = min ( βP T (h) +αE a,h 0V (k) (q +a,h 0 ), p· 1[q =Q− 1] +αE a,h 0V (k) (min{q + 1 +a,Q},h 0 ) ) ≥ min ( βP T (h) +αE a,h 0V (k) (q− 1 +a,h 0 ), αE a,h 0V (k) (min{q +a,Q},h 0 ) ) =V (k+1) (q,h) 1≤q≤Q− 1. (A.1) where the inequality follows from the induction hypothesis. For q = 0: V (k+1) (0,h) = αE a,h 0V (k) (a,h 0 ) ≤ min ( βP T (h) +αE a,h 0V (k) (a,h 0 ), αE a,h 0V (k) (1 +a,h 0 ) ) = V (k+1) (1,h). (A.2) the inequality follows because αE a,h 0V (k) (a,h 0 )≤ βP T (h) +αE a,h 0V (k) (a,h 0 ), and induction hypothesis: αE a,h 0V (k) (a,h 0 )≤αE a,h 0V (k) (1 +a,h 0 ). 104 Hence, V (k+1) (q,h) is increasing in q for any given h. Hence, V (j) (q,h) is increasing inq for any givenh, for allj = 0, 1, 2,.... SinceV (j) →V ∗ , monotonicity holds for V ∗ as well. A.2 Proof of Theorem 2 We need to show for 1≤q≤Q− 2 and for q =Q− 1: αE a,h 0V ∗ (q + 1 +a,h 0 )−αE a,h 0V ∗ (q +a,h 0 )≥ αE a,h 0V ∗ (q +a,h 0 )−αE a,h 0V ∗ (q− 1 +a,h 0 ), (A.3) p +αE a,h 0V ∗ (Q,h 0 )−αE a,h 0V ∗ (Q− 1 +a,h 0 )≥ αE a,h 0V ∗ (Q− 1 +a,h 0 )−αE a,h 0V ∗ (Q− 2 +a,h 0 ), (A.4) which is equivalent to: E a,h 0V ∗ (q + 1 +a,h 0 ) + E a,h 0V ∗ (q− 1 +a,h 0 )≥ 2E a,h 0V ∗ (q +a,h 0 ), (A.5) p +αE a,h 0V ∗ (Q,h 0 ) +αE a,h 0V ∗ (Q− 2 +a,h 0 )≥ 2αE a,h 0V ∗ (Q− 1 +a,h 0 ). (A.6) Sufficient conditions for the above inequalities to hold are 1≤q≤Q− 1 and ∀h 0 we have: E h 0V ∗ (q + 1,h 0 ) + E h 0V ∗ (q− 1,h 0 )≥ 2E h 0V ∗ (q,h 0 ), (A.7) 1 +αE h 0V ∗ (Q− 1,h 0 )≥αE h 0V ∗ (Q,h 0 ). (A.8) To verify inequality (A.5), simply multiply inequality (A.7) by p with q := q + 1 and by 1−p with q := q and add them up, a comparison with the expansion of inequality (A.5) w.r.t. a can directly show that they are equivalent. Inequality (A.6) can be verified by multiplying inequality (A.7) byα(1−p) withq :=Q−1 and 105 adding with inequality (A.8) multiplied by p. In the sequel, we focus on proving inequality (A.5) since the derivations for inequality (A.6) are similar. Recall from the optimality equation (2.16): V ∗ (q,h) = min ( βP T (h) +αE a,h 0V ∗ (q− 1 +a,h 0 ),αE a,h 0V ∗ (q +a,h 0 ) ) ≤αE a,h 0V ∗ (q +a,h 0 ) (equality if q = 0) =αpE h 0V ∗ (q + 1,h 0 ) +α(1−p)E h 0V ∗ (q,h 0 ). (A.9) Take expectation on both side w.r.t. h and it follows that: E h 0V ∗ (q + 1,h 0 )≥ 1 +αp−α αp E h 0V ∗ (q,h 0 ). (A.10) To show inequality (A.7), we only need to show: E h 0V ∗ (q− 1,h 0 )≥ αp +α− 1 αp E h 0V ∗ (q,h 0 ). (A.11) In value iteration, we start with V (0) = 1, which automatically satisfies condition (A.11). We assume that in the kth iteration we have E h 0V (k) (q− 1,h 0 )≥ αp +α− 1 αp E h 0V (k) (q,h 0 ). (A.12) Then in the (k + 1)th iteration: 106 If q = 0. V (k+1) (0,h) =αE a,h 0V (k) (a,h 0 ) ≥ αp +α− 1 αp αE a,h 0V (k) (1 +a,h 0 ) ≥ αp +α− 1 αp min ( βP T (h) +αE a,h 0V (k) (a,h 0 ),αE a,h 0V (k) (1 +a,h 0 ) ) = αp +α− 1 αp V (k+1) (1,h). (A.13) The first inequality follows because a and h are independent random variables so we can directly use the induction hypothesis. Similar analysis can be applied to the case where 1≤ q≤ Q− 1. Equation (A.12) holds in each iteration and therefore holds in the optimal value functionV ∗ since V (k) →V ∗ . As a result, we have proved Equation (A.11). By simple summation, inequality (A.10) and inequality (A.11) will give us inequality (A.7), which is a sufficient condition for inequality (A.5). Together with inequality (A.6), we prove that p· 1(q = Q) + E a,h 0V ∗ (min(q +a,Q),h 0 )− E a,h 0V ∗ (q− 1 +a,h 0 ) is increasing in q. A.3 Proof of Theorem 3 If the optimal action is to transmit, the expected cost of transmission should not be greater than that of silence: βP T (h)≤αE a,h 0V ∗ (q +a,h 0 )−αE a,h 0V ∗ (q− 1 +a,h 0 ) 1≤q≤Q− 1 (A.14) βP T (h)≤p +αE a,h 0V ∗ (Q,h 0 )−αE a,h 0V ∗ (Q− 1 +a,h 0 ) (A.15) 107 A.3.1 Thresholded policy in h. When q = 0, we do not transmit, which can be viewed as a special thresholding policy in h. Given 1 ≤ q ≤ Q, the R.H.S of Equation (A.14) and (A.15) are just a constant w.r.t. h, and βP T (h) is a decreasing function of h. Therefore there is a threshold valueh th such that the inequality holds whenh≥h th . Therefore we have thresholded policy in h given q. A.3.2 Thresholded policy in q. When q = 0, we do not transmit. If 1≤q≤Q− 1, given h, βP T (h) in Equation (A.14) is just a constant. Theorem 2 reveals thatp· 1(q =Q) +αE a,h 0V ∗ (min(q + a,Q),h 0 )−αE a,h 0V ∗ (q− 1 +a,h 0 ) is increasing in q, thus there exists a threshold q th such that the inequality holds when q≥q th . A.4 Proof of Theorem 4 The original value function and the approximated value function are given by v = (I−αP) −1 c and ˆ v = M(I−αM T PM) −1 M T c,where M is the subspace. One sufficient condition for v = ˆ v is to have (I− αP) −1 = M(I− αM T PM) −1 M T . From this equality we have: (I−αP) −1 = M(I−αM T PM) −1 M T (A.16) ⇒ (I−αP) −1 M = M(I−αM T PM) −1 (A.17) ⇔ M(I−αM T PM) = (I−αP)M (A.18) ⇔ MM T PM = PM, (A.19) where the second equality follows from the orthogonality of M. 108 Notice that MM T is a projection matrix, thus the columns of PM lie in the column space of M. Also, columns of PM is in the column space of P. Since P is low rank, a sufficient condition for Equation (A.19) to hold is to have columns of M spanning the column space of P. There will be an addition problem going back from Equation (A.17) to Equation (A.16) due to the existence of a projection matrix. However, if c also lies in the column space of M, then from Equation (A.17) we have: ˆ v = M(I−αM T PM) −1 M T c = (I−αP) −1 MM T c = (I−αP) −1 c = v. (A.20) Therefore, for v = ˆ v, subspace M should be the set of orthonormal basis that span the column space of P⊕ c. A.5 Proof of Theorem 5 One possible case for P being low rank is that it has multiple similar columns. From the perspective of Markov chain, it means that several states have similar in-degree. In our model, the state of the system is represented by the pair (q,h), the state can be placed on a 2D plane (shown in Fig. A.1). Notice that given the protocol, for a state with buffer length q, it can only be reached from the states that have buffer length q− 1 (packet arrival and no transmission), or buffer lengthq (no packet arrival and no transmission), or buffer length q + 1 (no packet arrival and transmission). Due to the i.i.d property of the channel, for state i and state j that have the same buffer length, if there exists 109 Buffer Channel p 1 p 2 p 3 p 5 p 6 q q −1 q +1 1 2 3 4 5 p 4 Figure A.1. Pictorial representation of low rank propoerty. State 1 and 2 have same ancestor states, and their incoming transition probabilities are scalar multiple of each other. a transition from state k to state i, there also exists a transition from state k to state j. In other words, state i and state j have same set of ancestor states. Denote P[(q 0 ,h 0 )→ (q,h)] as the transition probability from state (q 0 ,h 0 ) to (q,h). Consider two states that have the same buffer length (q,h 1 ), (q,h 2 ). The in-degree/incoming probabilities for state (q,h 1 ) are as follows: P[(q 0 ,h 0 )→ (q,h 1 )] = p·P h (h =h 1 ), q 0 =q− 1, μ(q 0 ,h 0 ) = 0. (1−p)·P h (h =h 1 ), q 0 =q, μ(q 0 ,h 0 ) = 0. (1−p)·P h (h =h 1 ), q 0 =q + 1, μ(q 0 ,h 0 ) = 1. 0, otherwise, where p is the probability for packet arrival, P h is the probability mass function for channel state, and μ(q,h) ={0, 1} Δ ={silence, transmit}. 110 Similarly, for the in-degree for state (q,h 2 ), we have: P[(q 0 ,h 0 )→ (q,h 2 )] = p·P h (h =h 2 ), q 0 =q− 1, μ(q 0 ,h 0 ) = 0. (1−p)·P h (h =h 2 ), q 0 =q, μ(q 0 ,h 0 ) = 0. (1−p)·P h (h =h 2 ), q 0 =q + 1, μ(q 0 ,h 0 ) = 1. 0, otherwise. By simple comparison, it can be found that P[(q 0 ,h 0 ) → (q,h 2 )] = P h (h=h 2 ) P h (h=h 1 ) P[(q 0 ,h 0 )→ (q,h 1 )]. Therefore, all the states sharing the same buffer length have similar in-degree distribution, i.e., their in-degrees are just scalar multiples of each other. As a result, the rank of the probability transition matrix is always Q + 1. 111 Appendix B On Sampled Reinforcement Learning in Wireless Networks: Exploitation of Policy Structures B.1 Proof of Theorem 6 In the line sampling problem shown in Fig. 3.9, there are M sampled points {x 1 ,x 2 ,...,x M } ranging from smallest to largest. From order statistics [81], the probability density function of the k-th smallest random variable x k is given by: f (k) (x) =M M− 1 k− 1 ! x k−1 (1−x) M−k , 0<x< 1. (B.1) It is easy to verify that E[x k ] = k M+1 , hence the expected index of the k-th sampling state on this row is k M+1 Q, i.e., on average, the range of the buffer index [0,Q] can be partitioned into the unions of the subintervals h k M+1 Q, k+1 M+1 Q i , k∈{0, 1,...,M}, M =dΨ(Q + 1)e. The true boundary state must lie within one particular subinterval, causing its left sampled state labeled 0 and right sampled state labeled 1 due to the 112 thresholded policy structure. Hence the estimated boundary state will also lie in the same subinterval when performing policy interpolation. For a particular row with channel state h, denote Δ h as the average number of states that have erroneous policies, then the average policy error Υ can be expressed as: Υ = P H h=1 Δ h (Q + 1)H . (B.2) Our goal is to derive the decay rate of the policy error w.r.t. network size N, under the assumption thatQ + 1≈H andN = (Q + 1)H (hence (Q + 1)≈ √ N). We look at upper and lower bounds of the Υ and will see that the two bounds have the same decay rate. To this end, we first seek upper and lower bounds of Δ h . B.1.1 Upper bound As described above, since the true boundary state and the estimated boundary state both lie in the same subinterval, the maximum difference between the two indices can not be greater than Q M+1 , i.e., Δ h ≤ Q M+1 . Therefore, the upper bound for the average policy error can be obtained: Υ = P H h=1 Δ h (Q + 1)H ≤ Q M+1 ·H (Q + 1)·H = Q (Q + 1)(M + 1) = √ N− 1 √ N· (dΨ √ Ne + 1) = O(1/ √ N). (B.3) B.1.2 Lower bound Since we have no prior information about the location of the true boundary state, it is reasonable to assume that the boundary state is uniformly distributed on interval 113 k M+1 Q, k+1 M+1 Q i with probability M+1 Q (since there are totally Q M+1 elements) ∗ . Denote ˆ x as the index of the estimated boundary states and x i ∈ k M+1 Q, k+1 M+1 Q i be the possible true boundary state, a lower bound for Δ h can obtained as follows: Δ h = X i M + 1 Q |x i − ˆ x| > Q M + 1 − 1 M + 1 Q · 1 = Q−M− 1 Q , (B.4) where the inequality follows from the fact that when the true boundary state x i 6= ˆ x,|x i − ˆ x|≥ 1, and there are a total of Q M+1 − 1 values of x i that satisfy |x i − ˆ x|≥ 1. Therefore, the lower bound for the average policy error is: Υ = P H h=1 Δ h (Q + 1)·H > (Q−M− 1)·H Q· (Q + 1)·H = Q−M− 1 Q· (Q + 1) ≈ √ N−dΨ √ Ne− 2 ( √ N− 1) √ N =O(1/ √ N). (B.5) Notice that the lower bound is trivial when Ψ = 1. We compute upper and lower bounds and show that they have the same decay rate O(1/ √ N). Thus proving the Proposition. ∗ The left most state is not included since it is labeled as 0 (i.e., silence) 114 B.2 Proof of Theorem 7 Since t∈ (x n ,x n+1 ) for some n∈{0, 1, 2,...,M} (here we assume that x 0 = 0 and x M+1 = 1), a natural lower bound is x l = x n . In addition, n is also a random variable, hence the law of total expectation is employed: E[x l ] = E n {E(x n |x n <t<x n+1 )} = M X n=0 P (x n <t<x n+1 )· E x n x n <t<x n+1 = M X n=0 P (x n <t<x n+1 ) Z t 0 x n f(x n |x n <t<x n+1 )dx n . (B.6) One simplification can be made, notice that: f(x n =u|x n <t<x n+1 ) = ∂ ∂u P (x n <u|x n <t<x n+1 ) = ∂ ∂u " P (x n <u≤t<x n+1 ) P (x n <t<x n+1 ) # . (B.7) Therefore, Equation (B.6) can be simplified as: E[x l ] = M X n=0 Z t 0 u ∂ ∂u P (x n <u≤t<x n+1 )du, (B.8) where x n <u≤t<x n+1 denotes the event that there are exactly n variables less thanu (which is less thant), and the remaining variables are greater thant. Using properties of order statistics, the probability for this event is: P (x n <u≤t<x n+1 ) = M n ! u n (1−t) M−n . (B.9) 115 Finally, the closed form expression for E[x l ] can be derived: E[x l ] = M X n=0 Z t 0 u· M n ! nu n−1 · (1−t) M−n du = M X n=0 M n ! n n + 1 t n+1 (1−t) M−n . (B.10) Similar techniques can be employed to obtain E[x u ]: E[x u ] = E n {E(x n+1 |x n <t<x n+1 )} = M X n=0 P (x n <t<x n+1 )· E x n+1 x n <t<x n+1 = M X n=0 Z 1 t v· ∂ ∂v P (x n <t<x n+1 <v)dv = M X n=0 Z 1 t M n ! t n ·v(M−n)(1−v) M−n−1 dv = M X n=0 M n ! t n " t(1−t) M−n + (1−t) M−n+1 M−n + 1 # . (B.11) The fourth equality follows since: P (x n <t<x n+1 <v) = M n ! t n [(1−t) M−n − (1−v) M−n ]. B.3 Proof of Theorem 8 Given that x svm = x 1 , if x 1 >t xn+x n+1 2 , if x n <t<x n+1 , 1≤n≤M− 1 x M , if x M <t , (B.12) 116 we have: E[x svm ] =P (x 1 >t)E(x 1 |x 1 >t) + M−1 X n=1 P (x n <t<x n+1 )E x n +x n+1 2 x n <t<x n+1 ! +P (x M <t)E(x M |x M <t). (B.13) Notice that the first term and third term reduce to computing the expectation of the minimum and maximum random variable. The computation for the second term follows from the same procedure as the proof for Theorem. 7 (See Appendix. B.2) 117 Appendix C Neural Policy Gradient in Wireless Networks: Analysis and Improvement C.1 Proof of Theorem 9 From the softmax function (4.9), the output probabilities for silence and transmit action are: π W,b (0|s) = 1−π W,b (1|s) = e w T 1 x+b 1 e w T 1 x+b 1 + e w T 2 x+b 2 . (C.1) For notational simplicity, we just use π to represent π W,b . By simple derivations, it is easy to obtain: ∂ logπ(0|s) ∂w 1 = [1−π(0|s)]x, ∂ logπ(1|s) ∂w 1 =−π(0|s)x ∂ logπ(0|s) ∂w 2 =−π(1|s)x, ∂ logπ(1|s) ∂w 2 = [1−π(1|s)]x. (C.2) 118 Therefore, we can plug in the above equations to Equation (4.2) and compute the gradient: ∂J ∂w 1 = E τ (T) ( T X t=1 ∂ logπ(a t |s t ) ∂w 1 G t ) = E τ (T) ( T X t=1 [I(a t = 0)−π(0|s t )]G t · x t ) , similarly, we have : ∂J ∂w 2 = E τ (T) ( T X t=1 [I(a t = 1)−π(1|s t )]G t · x t ) . Since we have a binary action space: I(a t = 0)−π(0|s t ) + I(a t = 1)−π(1|s t ) = 0. (C.3) Therefore, we have ∂J ∂w 1 =− ∂J ∂w 2 , a similar analysis can be obtained for b 1 , b 2 , ∂J ∂b 1 =− ∂J ∂b 2 . C.2 Proof of Lemma 1 We first show that E at {∇ logπ(a t |s t )} = 0, since: E at {∇ logπ(a t |s t )} =π(0|s t )∇ logπ(0|s t ) +π(1|s t )∇ logπ(1|s t ) =π(0|s t )[1−π(0|s t )]·q + [1−π(0|s t )]· (−π(0|s t ))·q = 0, (C.4) where the gradient computation follows from (C.2). 119 Denote P(τ (T ) ) as the probability of a specific trajectory τ (T ) , we have: P(τ (T ) ) = " T−1 Y t=1 π(a t |s t )p(s t+1 |s t ,a t ) # ·π(a T |s T ). (C.5) For each∇ logπ(a i |s i ) in Z T (1≤i≤T ) we have : E τ (T){∇ logπ(a i |s i )} = X τ (T) P(τ (T ) )·∇ logπ(a i |s i ) = X st,at 1≤t≤T h T−1 Y t=1 π(a t |s t )p(s t+1 |s t ,a t ) i π(a T |s T )·∇ logπ(a i |s i ) (b) = X st,at 1≤t≤i h i−1 Y t=1 π(a t |s t )p(s t+1 |s t ,a t ) i π(a i |s i )·∇ logπ(a i |s i ) = X st,at 1≤t≤i h i−1 Y t=1 π(a t |s t )p(s t+1 |s t ,a t ) i · E a i {∇ logπ(a i |s i )} = 0, (C.6) where (b) follows from performing the summation with respect to all variables a j , s j , j >i first. Therefore, we have E τ {Z T } = 0. C.3 Proof of Theorem 10 We use mathematical induction. Starting from T = 1, we have: E τ (1){X 1 } =E τ (1){c 1 Z 1 } =E s 1 ,a 1 {c 1 ∇ logπ(a 1 |s 1 )} =E s 1 n π(0|s 1 )·c(s 1 , 0)·∇ logπ(0|s 1 ) +π(1|s 1 )·c(s 1 , 1)·∇ logπ(1|s 1 ) o =E s 1 n π(0|s 1 )[1−π(0|s 1 )]· [1(overflow)−c(s 1 , 1)] o . (C.7) 120 Notice that c(s 1 , 1) is positive. The indicator function is 1 only when the buffer lengthq is full. For a system with (Q + 1)H number of states, the quantity c(s 1 , 0)−c(s 1 , 1) is positive for at most H number of the states; for the rest of the states, c(s 1 , 0)−c(s 1 , 1) is negative. Under the uniform assumption of s 1 in Remark. 12 the expectation is negative, i.e.,E τ (1){X 1 }< 0. SupposeE τ (T){X T }< 0 holds for some T≥ 2, then for T + 1: E τ (T+1){X T +1 } =E τ (T+1){X T +c T +1 Z T +1 } (c) =E τ (T+1){X T } +E τ (T+1){c T +1 Z T } +E τ (T+1){c T +1 ∇ logπ(a T +1 |s T +1 )}, (C.8) where (c) follows from expanding Z T +1 as a summation of Z T and ∇ logπ(a T +1 |s T +1 ). We examine the three terms individually, for the first term, notice that X T is a random variable containing the trajectory information from time 1 to time T , thus the summation can be done for the variables after time T first: E τ (T+1){X T } = X τ (T+1) P(τ (T +1) )X T = X τ (T+1) h T Y t=1 π(a t |s t )p(s t+1 |s t ,a t ) i π(a T +1 |s T +1 )X T (d) = X τ (T) h T−1 Y t=1 π(a t |s t )p(s t+1 |s t ,a t ) i π(a T |s T )X T =E τ (T){X T } ≤ 0, (C.9) where (d) follows from doing summation w.r.t. s T +1 , a T +1 first. For the third term E τ (T+1){c T +1 ∇ logπ(a T +1 |s T +1 )}, we can apply the same technique in Equation (C.7) to show that it is negative. 121 For the second termE τ (T+1){c T +1 Z T }, we first condition on c T +1 : E τ (T+1){c T +1 Z T } =E c T+1 {c T +1 E τ (T)[Z T |c T +1 ]}. (C.10) Conditioned on c T +1 , denote S c as the set of states that achieve the instantaneous cost c T +1 . It can be shown that the states inS c can be reached by any other states due to the structural similarity induced by the network protocol. Therefore, before reaching state s T +1 ∈S c , the number of previous trajectories of length T regarding to Z T is exactly the same as if we were considering Z T alone, without any transition to the future; and these trajectories should have the same probability distribution. Then due to Lemma ?? we have E τ (T)[Z T |c T +1 ] = 0. As a result,E τ (T+1){c T +1 Z T } = 0 Finally, we prove that E τ (T+1){X T +1 } < 0 since all three terms in Equation (C.8) are non-positive. C.4 Proof of Theorem 11 Given the input vector y to the last layer and the weight matrix W l = [w 1 , w 2 ,..., w k ] T and the bias vector b l = [b 1 ,b 2 ,...,b k ] T , the output probability is given as: π(a t =i|s) = e w T i y+b i P k j=1 e w T j y+b j . (C.11) 122 It is easy to verify that the gradient with respect to each w i ,i = 1, 2,...,k, has the following form: ∇ w i logπ(a t =i|s t ) = ∂ ∂w i log e w T i y+b i P k j=1 e w T j y+b j = y− e w T i y+b i P k j=1 e w T j y+b j y = [1−π(a t =i|s t )]y. (C.12) Similarly we have: ∇ w i logπ(a t =j|s t ) =−π(a t =i|s t )y, j6=i. (C.13) Combining Equations (C.12) and (C.13) yields: ∇ w i logπ(a t |s t ) = [I(a t =i)−π(a t =i|s t )]y. (C.14) As a result, when performing summation of the gradients: k X i=1 ∂J ∂w i = k X i=1 E τ (T) ( T X t=1 ∂ logπ(a t |s t ) ∂w i G t ) = E τ (T) ( T X t=1 k X i=1 [I(a t =i)−π(a t =i|s t )]G t · y ) = 0, (C.15) where the last equation exploits the fact that P k i=1 I(a t =i) = 1 and P k i=1 π(a t = i|s t ) = 1. 123 Bibliography [1] R. Sutton and A. Barto, “Reinforcement learning: an introduction. ” The MIT Press, 2017. [2] D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013. [3] L. Liu, A. Chattopadhyay, and U. Mitra, “On solving MDPs with large state space: Exploitation of policy structures and spectral properties,” IEEE Trans. Communications, 2019, early Access. [4] L. Liu and U. Mitra, “On sampled reinforcement learning in wireless networks: Exploitation of policy structures,” IEEE Trans. Communications, vol. 68, no. 5, pp. 2823–2837, 2020. [5] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual lifelong learning with neural networks: A review,” in Neural Networks. Elsevier, 2019. [6] M. Levorato, S. Narang, U. Mitra, and A. Ortega, “Reduced dimension policy iteration for wireless network control via multiscale analysis,” in IEEE Global Communication Conference. IEEE, Dec. 2012, pp. 3886–3892. [7] A. Chattopadhyay, M. Coupechoux, and A. Kumar, “Sequential decision algorithms for measurement-based impromptu deployment of a wireless relay network along a line,” IEEE/ACM Transactions on Networking, vol. 24, no. 5, pp. 2954 – 2968, 2015. [8] A. Munir and A. Ross, “An MDP-based dynamic optimization methodology for wireless sensor networks,” IEEE Trans. Parallel and Distributed Systems, vol. 23, no. 4, pp. 616–625, 2012. [9] M. A. Alsheikh, D. H. Hoang, D. Niyato, H. Tan, and S. Lin, “Markov decision processes with applications in wireless sensor networks: A survey,” IEEE Communication Surveys & Tutorials, vol. 17, no. 3, pp. 1239–1267, 2015. 124 [10] J. Riss, “Discounted Markov programming in a periodic process,” in Operations Research, vol. 13, no. 6, 1965, pp. 920–929. [11] N. Buras, “A Three-Dimensional optimization problem in Water-resources engineering,” Operational Research Quarterly, vol. 16, no. 4, pp. 419–428, 1965. [12] R. Mendelssohn, “Managing stochastic multispecies models,” Mathematical Biosciences, vol. 49, no. 3-4, pp. 249–261, 1980. [13] D. Bertsekas, “Dynamic programming and optimal control,” vol. 1. Athena Scientific, Belmont, Massachusetts, 2005. [14] ——, “Dynamic programming and optimal control,” vol. 2. Athena Scientific, Belmont, Massachusetts, 2005. [15] D. P. de Farias and B. V. Roy, “The Linear Programming approach to approximate dynamic programming,” Operations Research, vol. 51, no. 6, pp. 850–865, 2003. [16] L. Liu, A. Chattopadhyay, and U. Mitra, “On exploiting spectral properties for solving MDP with large state space,” in 55th Annual Allerton Conference on Communication, Control and Computing. IEEE, Oct. 2017. [17] D. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, pp. 129–150, 2011. [18] M. Maggioni and S. Mahadevan, “Fast direct policy evaluation using multiscale analysis of Markov diffusion processes,” in the 23rd international conference on Machine learning. ACM, 2006. [19] M. Levorato, U. Mitra, and A. Goldsmith, “Structured-based learning in wireless networks via sparse approximation,” EURASIP Journal on Wireless Communications and Networking, 2012. [20] R. Coifman and M. Maggioni, “Diffusion wavelets,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 53–94, 2006. [21] R. Givan, T. Dean, and M. Greig, “Equivalence notion and model minimization in Markov decision processes,” Artificial Intelligence, vol. 147, no. 1-2, pp. 163–223, 2013. [22] K. Deng, P. Mehta, and S. Meyn, “Optimal Kullback-Leibler aggregation via spectral theory of Markov chains,” IEEE Trans. Automatic Control, vol. 56, no. 12, pp. 2793–2808, Dec 2012. 125 [23] E. Pavez, N. Michelusi, A. Anis, U. Mitra, and A. Ortega, “Markov chain sparsification with independent sets for approximate value iteration,” in Fifty-third Annual Allerton Conference. IEEE, 2015, pp. 1399–1405. [24] A. Barreto, J. Pineau, and D. Precup, “Policy iteration based on stochastic factorization,” Journal of Artificial Intelligence Research, vol. 50, pp. 763–803, 2014. [25] O. Teke and P. P. Vaidyanathan, “Extending classical multirate signal processing theory to graphs - part I: Fundamentals,” IEEE Trans. Signal Processing, vol. 65, no. 2, pp. 409–422, 2017. [26] D. S. Zois and U. Mitra, “Active state tracking with sensing costs: analysis of two-states and methods for n-states,” IEEE Trans. Signal Processing, vol. 65, no. 11, pp. 2828–2843, 2017. [27] J. Chakravorty and A. Mahajan, “On the optimal thresholds in remote state estimation with communication costs,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 1041–1046. [28] M. H. R. Khouzani, S. Sarkar, and E. Altman, “Dispatch then stop: Optimal dissemination of security patches in mobile wireless networks,” in 49th IEEE Conference on Decision and Control. 2010, 2010, pp. 2354–2359. [29] M. Levorato, U. Mitra, and M. Zorzi, “Cognitive interference management in retransmission-based wireless networks,” IEEE Trans. Information Theory, vol. 58, no. 5, pp. 3023–3046, 2012. [30] J. Chakravorty and A. Mahajan, “Sufficient conditions for the value function and optimal strategy to be even and quasi-convex,” in arXiv:1703.10746, 2017. [31] S. Narang and A. Ortega, “Perfect reconstruction two-channel wavelet filter banks for graph structured data,” IEEE Trans. Signal Processing, vol. 60, no. 6, pp. 2786–2799, 2012. [32] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs,” IEEE Trans. Signal Processing, vol. 61, no. 7, pp. 1644–1656, 2013. [33] ——, “Discrete signal processing on graphs: graph Fourier transform,” in International Conference on Acoustic, Speech and Signal Processing (ICASSP). IEEE, 2013, pp. 6167–6170. [34] ——, “Discrete signal processing on graphs: frequency analysis,” IEEE Trans. Signal Processing, vol. 62, no. 12, pp. 3042–3054, 2014. 126 [35] V. Satuluri and S. Parthasarathy, “Symmetrizations for clustering directed graphs,” in 14th International Conference on extending database technology, 2011, pp. 343–354. [36] M. L. Puterman, “Markov Decision Processes: Discrete stochastic dynamic programming.” Wiley Series, 2005. [37] F. K. Chung, “Spectral graph theory,” in CBMS Regional Conference Series in Mathematics. AMS bookstore, 1997. [38] ——, “Laplacian and Cheeger inequality for directed graphs,” Annals of Combinatorics, vol. 19, no. 1, pp. 1–19, 2005. [39] M. Levorato, S. Narang, U. Mitra, and A. Ortega, “Optimization of wireless networks via graph interpolation,” in IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 2013, pp. 483–486. [40] N. M. D. Testa and M. Zorzi, “Optimal transmission policies for two-user energy harvesting device networks with limited state-of-charge knowledge,” IEEE Trans. Wireless Communications, vol. 15, no. 2, pp. 1393–1405, 2016. [41] D. Tse and P. Viswanath, “Fundamentals of Wireless Communication.” Cambridge University Press, 2005. [42] M. Dong and L. Tong, “Optimal design and placement of pilot symbols for channel estimation,” IEEE Trans. Signal Processing, vol. 50, no. 12, pp. 3055–3069, 2002. [43] E. Ertin, U. Mitra, and S. Siwamogsatham, “Maximum-Likelihood-based multipath channel estimation for Code-Division Multiple-Access systems,” IEEE Trans. Communications, vol. 49, no. 2, pp. 290–302, 2001. [44] S. Beygi and U. Mitra, “Multi-Scale Multi-Lag channel estimation using low rank approximation for OFDM,” IEEE Trans. Signal Processing, vol. 63, no. 18, pp. 4744–4755, 2015. [45] A. Goldsmith, “Wireless Communications. ” Cambridge University Press, 2005. [46] E. N. Gilbert, “Capacity of a burst-noise channel,” The Bell System Technical Journal, vol. 39, no. 5, pp. 1253–1265, 1960. [47] E. O. Elliot, “Estimates of error rates for codes on burst-noise channels,” The Bell System Technical Journal, vol. 42, no. 5, pp. 1977–1997, 1963. 127 [48] M. Lagoudakis and R. Parr, “Least-squares policy iteration,” Journal of Machine Learning Research, vol. 4, pp. 1107–1149, 2003. [49] M. Kessler, “Bibliographic coupling between scientific papers,” in American Documentation, 1963, pp. 10–25. [50] H. Small, “Cocitation in the scientific literature: A new measure of the relationship between two documents,” Journal of the American Society for Information Science, vol. 24, no. 4, pp. 265–269, 1973. [51] A. Chattopadhyay, A. Ghosh, A. S. Rao, B. Dwivedi, S. Anand, M. Coupechoux, and A. Kumar, “Impromptu deployment of wireless relay networks: Experiences along a forest trail,” in 11th Conference on Mobile Ad Hoc and Sensor Systems (MASS). IEEE, Oct 2014. [52] C. Watkins, “Learning from delayed rewards. ” Doctoral dissertation, Kings’ College, University of Cambridge, 1998. [53] H. Ayatollahi, C. Tapparello, and W. Heinzelman, “Reinforcement learning in MIMO wireless networks with energy harvesting,” in 2017 IEEE International Conference on Communications (ICC). IEEE, 2017. [54] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multiple access for heterogeneous wireless networks,” in 2018 IEEE International Conference on Communications (ICC). IEEE, 2018. [55] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement learning approach to power control and rate adaptation in cellular networks,” in 2017 IEEE International Conference on Communications (ICC). IEEE, 2017. [56] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation in V2V communications,” in 2018 IEEE International Conference on Communications (ICC). IEEE, 2018. [57] L. Liu, A. Chattopadhyay, and U. Mitra, “Exploiting policy structure for solving MDP with large state space,” in 52nd Annual Conference on Information Sciences and Systems (CISS). IEEE, Mar 2018. [58] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015. [59] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. 128 [60] E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,” Journal of Machine Learning Research, vol. 5, pp. 1–25, 2003. [61] M. S. Moustafa, “Optimal assignment policy of a single server attended by two queues,” Applied Mathematics and Computation, vol. 80, no. 2-3, pp. 245–255, 1996. [62] D. V. Djonin and V. Krishnamurthy, “MIMO transmission control in fading channels – A constrained Markov decision process formulation with monotone randomized policies,” IEEE Trans. Signal Processing, vol. 55, no. 10, pp. 5069–5083, 2007. [63] G. J. Gordon, “Stable function approximation in dynamic programming,” in Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, 1995, pp. 261–268. [64] J. N. Tsitsiklis and B. V. Roy, “Feature-based methods for large scale dynamic programming,” Machine Learning, vol. 22, pp. 59–94, 1996. [65] K. Ross, “Randomized and past-dependent policies for Markov decision processes with multiple constraints,” Operation Research, vol. 37, no. 3, pp. 474–477, 1989. [66] S. P. Singh, T. S. Jaakkola, and M. I. Jordan, “Learning without state-estimation in partially observable Markovian decision processes,” in Proceedings of the Eleventh International Conference on Machine Learning. Morgan Kaufmann, 1994, pp. 284–292. [67] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, pp. 229–256, 1992. [68] V. Mnih, A. P. Badia, M. Mirza et al., “Asynchronous methods for deep reinforcement learning,” in Proceedings of The 33rd International Conference on Machine Learning, vol. 48, 2016, pp. 1928–1937. [69] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” in arXiv: 1502.05477, 2015. [70] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” in arXiv: 1707.06347, 2017. [71] S. Somuyiwa, A. Gy¨ orgy, and D. G¨ und¨ uz, “A reinforcement-learning approach to proactive caching in wireless networks,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 6, pp. 1331–1344, 2018. 129 [72] C. Qiu, Y. Hu, Y. Chen, and B. Zeng, “Deep deterministic policy gradient (DDPG)-based energy harvesting wireless communications,” IEEE Internet of Things Journal, vol. 6, no. 5, pp. 8577–8588, 2019. [73] T. H. Dinh, D. Niyato, and N. T. Hung, “Optimal energy allocation policy for wireless networks in the sky,” in 2015 IEEE International Conference on Communications (ICC), 2015, pp. 3204–3209. [74] S. Ioffe and S. C, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in arXiv: 1502.03167, 2015. [75] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of Machine Learning Research, vol. 9, 2010, pp. 249–256. [76] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” in arXiv: 1412.6980, 2014. [77] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in neural information processing systems, 2016, pp. 2074–2082. [78] S. J. Spall, “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,” IEEE Trans. Automatic Control, vol. 37, no. 3, pp. 332–341, 1992. [79] ——, “An overview of the simultaneous perturbation method for efficient optimization,” Johns Hopkins APL Technical Digest, vol. 19, no. 4, pp. 482–492, 1998. [80] M. McCloskey and N. J. Cohen, “Catastrophic inference in connectionist networks: the sequential learning problem,” Psychology of learning and motivation, vol. 24, pp. 109–165, 1989. [81] H. A. David and H. N. Nagaraja, “Order statistics. ” Wiley Series in Probability and Statistics, 2003. 130
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Active state tracking in heterogeneous sensor networks
PDF
Learning and decision making in networked systems
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
PDF
Empirical methods in control and optimization
PDF
Exploiting side information for link setup and maintenance in next generation wireless networks
PDF
Estimation of graph Laplacian and covariance matrices
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Enabling virtual and augmented reality over dense wireless networks
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Learning and control in decentralized stochastic systems
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Compression of signal on graphs with the application to image and video coding
Asset Metadata
Creator
Liu, Libin
(author)
Core Title
Learning and control for wireless networks via graph signal processing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/11/2020
Defense Date
11/30/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph signal processing,Markov decision process,OAI-PMH Harvest,reinforcement learning,wireless networks
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mitra, Urbashi (
committee chair
)
Creator Email
libinliu@usc.edu,liulibin1993@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-409743
Unique identifier
UC11668103
Identifier
etd-LiuLibin-9171.pdf (filename),usctheses-c89-409743 (legacy record id)
Legacy Identifier
etd-LiuLibin-9171.pdf
Dmrecord
409743
Document Type
Dissertation
Rights
Liu, Libin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
graph signal processing
Markov decision process
reinforcement learning
wireless networks