Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
(USC Thesis Other)
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ACCELERATING REINFORCEMENT LEARNING USING HETEROGENEOUS PLATFORMS: CO-DESIGNING HARDWARE, ALGORITHM, AND SYSTEM SOLUTIONS by Yuan Meng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) August 2024 Copyright 2024 Yuan Meng Dedication To my family and friends. ii Acknowledgements I would like to extend my profound and heartfelt gratitude to my advisor, Professor Viktor K. Prasanna, for his steadfast guidance and support. Throughout my Ph.D. journey, he has patiently trained me in diverse research activities, generously supported my work, and guided me through various challenges and obstacles. I am also deeply thankful to Professor Rajgopal Kannan for his valuable guidance and collaboration on numerous research projects and publications. I am also especially thankful to Sanmukh (now Professor Kuppannagari!) who is always passionate about discussing difficult problems and has helped me stay optimistic about my work during the most challenging times. Additionally, I am grateful to Professor Yue Zhao and Professor Bhaskar Krishnamachari, for their willingness to serve on my dissertation committee and for their invaluable input and support. I would also like to thank Professor Bistra Dilkina and Professor Paul Paul Bogdan, who served on my qualifying exam committee. Their kind suggestions and insightful feedback have sparked many research ideas. I feel lucky to have been in an active research group. I would like to thank Bingyi Zhang, Hongkuan Zhou, Hanqing Zeng and Chi Zhang for their guidance during my initial PhD years. They are sincere in sharing their life and work experiences during graduate studies, as well as tips for living better in the LA area. I am lucky to be able to work closely with Sam Wiggins, Hongjiang Men, and Qian Wang on several research papers. All of them are brilliant students with perseverance, and all the best to their own academic and industrial careers. Thanks to Sasindu Wijeratne, Jason Lin, Pengmiao Zhang, Yang Yang and many others for their inspiring discussions. iii I am deeply thankful for the unwavering support of my many friends who have journeyed with me through our Ph.D. endeavors. To Hanyuan Xiao and Wen Lin, who have emotionally supported me through countless struggling days and nights; I wish us all success in achieving the goals we set at the start of our Ph.D. careers. To Anguo Hu, Hang Yu, Weiye Wang, Yannan Li, Tian Sang, and Yuying Li, who have been very warm and accompanied me in participating numerous outdoor and sports activities, I cherish the memories of our gaming adventures, dining out, and festival celebrations. Finally, I wish to convey my heartfelt gratitude to my devoted parents, Jun Sun and Wentao Meng. They share in the achievement of this degree. They always actively strive to learn about my work, my research group, and to truly understand my situations and experiences as a Ph.D. student. They have provided a safety net for me financially and have shared in all the highs and lows of my Ph.D. journey. Completing my Ph.D. would have been impossible without their unwavering support and the boundless freedom they have granted me throughout my life. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Metrics for Evaluating an RL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 System Implementation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 Algorithm Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Hardware-Algorithm Co-Design for Key RL Primitives . . . . . . . . . . . . . . . . 6 1.4.2 System Solutions for RL Using Heterogeneous Platforms . . . . . . . . . . . . . . . 8 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Background and Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Reinforcement Learning Primitives and Workload Characterization . . . . . . . . . . . . . 11 2.1.1 Model-Free Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 Model-Based Reinforcement Learning using MCTS . . . . . . . . . . . . . . . . . . 15 2.2 Computing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Parallel Methods, Libraries and Frameworks for RL . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Hardware Accelerators for Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Target Reinforcement Learning Algorithms in This Thesis . . . . . . . . . . . . . . . . . . 21 Chapter 3: Acceleration of Reinforcement Learning Primitives . . . . . . . . . . . . . . . . . . . . 23 3.1 DYMAMAP: A Dynamic Algorithm Mapping Framework for DNN Inference . . . . . . . . 23 3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Formulation for Algorithm Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.3 Accelerator Design for Dynamic Algorithm Switching . . . . . . . . . . . . . . . . 29 3.1.4 Design Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 FSCAccel: Hardware-Algorithm Co-Design for Fractionally Strided Convolution in Training 44 v 3.2.1 Fractionally Strided Convolution in CNN Training: Computational Challenges . . 44 3.2.2 FSC in CNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.3 Algorithmic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.4 Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Acceleration of Tree-based Policy in MCTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.1 MCTS Performance Analysis and Challenges . . . . . . . . . . . . . . . . . . . . . 63 3.3.2 Accelerator Algorithm-Hardware Co-Optimizations . . . . . . . . . . . . . . . . . 64 3.3.3 Accelerator Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4: Acceleration System for Model-Free Deep Reinforcement Learning using Heterogeneous Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Runtime System & Training Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.2 DRL Heterogeneous Training Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.3 Runtime System Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.3.1 Task Parallelism and Data Parallelism over Heterogeneous Hardware . . 81 4.2.3.2 Communication Overhead Reduction . . . . . . . . . . . . . . . . . . . . 83 4.3 System Composer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.1 Heterogeneous System Composition . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.2 Accelerator Setup and Performance Estimation . . . . . . . . . . . . . . . . . . . . 88 4.4 Parameterized Library of Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.1 Replay Manager (RM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.1.1 RM on CPU and GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.1.2 RM on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.2 Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.2.1 Learner on CPU and GPU . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.2.2 Learner on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5.2 Performance of Accelerated Primitives . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5.3 System Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.4 Comparison with Existing DRL Libraries . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.5 User Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 5: Acceleration Systems for Model-Based Monte-Carlo Tree Search using Heterogeneous Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.1 MCTS System Solution on CPU-FPGA Platforms . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.1 System and Framework Specifications . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1.2 Framework Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 MCTS System Solution on Multi-Core and CPU-GPU Platforms . . . . . . . . . . . . . . . 117 5.2.1 Adaptive Parallelism and Implementation . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.1.1 Parallelization Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.1.2 Adaptive Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.1.3 GPU-offloaded DNN Inference . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 vi 5.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Chapter 6: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.2.1 Emerging Composable Heterogeneous Platforms . . . . . . . . . . . . . . . . . . . 130 6.2.2 Multi-Agent RL Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2.3 Distributed RL for Emerging Large Models . . . . . . . . . . . . . . . . . . . . . . . 131 6.2.4 Real-Time RL Deployment on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . 132 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 vii List of Tables 3.1 Tensor Layout transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Load/Store Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 End-to-end Latency Improvement due to Dynamic Algorithm Mapping . . . . . . . . . . . 42 3.4 Comparison with state-of-the-art implementations . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Hardware Parameters/Resource Utilization for Benchmark CNN Models . . . . . . . . . . 57 3.6 Comparisons with Existing Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 FPGA Resource Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.8 Itv of Dynamic vs Static Tree Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1 Specification of Heterogeneous Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2 Benchmarking Environments and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3 Comparison with Existing DRL Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4 User Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.1 API functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 viii List of Figures 2.1 RL Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Replay Manager Data Layout and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 DRL Primitives Workload Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Heterogeneous Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Categorization of Key RL Algorithms in This Thesis . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Computation and Memory Loads of GEMM-CONV algorithms on different layer configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 PE dataflow and optimizations to support stall-free operations . . . . . . . . . . . . . . . . 32 3.4 LTU (Data-Store side): FSM flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 DYNAMAP Software Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 A snippet of an example GPSA1,PSA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7 Layer exe. times: Inception-v4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.8 Layer exe. times: GoogleNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.9 FSC producing 2242 -pixels feature map and 64 (3) input(output) channels . . . . . . . . . . 45 3.10 Stride = 2 CONV FW and its corresponding BW fractionally-strided CONV . . . . . . . . 47 3.11 FSC as Interpolated Pixel-Broadcast-Multiplication: up-sampling from 2 × 2 to 5 × 5 grid . 48 3.12 Im2col applied to FSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.13 MCMK kn2row applied to FSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 ix 3.14 Dataflow Accelerator Engine Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.15 1x1 Conv Core Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.16 Pad-Accumulation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.17 Effective Utilization by BW Layers: Nature-CNN . . . . . . . . . . . . . . . . . . . . . . . . 59 3.18 Effective Utilization for Common Auto-Decoding Layer Configurations . . . . . . . . . . . 59 3.19 Effective Utilization by FW Layers: DCGAN . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.20 Back-Propagation Time: Nature-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.21 Inference Time: Autodecoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.22 Inference Time: DCGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.23 MCTS system performance on CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.24 Accelerator Design: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.25 Example Custom Butterfly-based Interconnection: D = 4, Y = 8 . . . . . . . . . . . . . . 67 3.26 Example Comparison-LookUp Design: F = 9, f = 3 . . . . . . . . . . . . . . . . . . . . . 72 3.27 Latency of in-tree operations. Y-axes are in log scale for better visualization. . . . . . . . . 76 4.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Runtime System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 DRL Heterogeneous Training Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Inputs to the System Composition Algorithm: I. a Task Dependency Graph; II. a Heterogeneous Compute Latency Table; and III. a Heterogeneous Interconnection Bandwidth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 FPGA - Replay Manager Hardware Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 FPGA - Learner Hardware Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.7 RM Operation Latency across Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.8 Learner Latency across Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.9 System Composition. For Parallelized Learner over Devices, TP stands for task parallelism and DP stands for data parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 x 4.10 Adopted Task Graph to Device Mappings for Task Parallel Learners. The Top SubFigure Corresponds to the Optimal Mapping in Figure 4.9-(h). The Bottom Sub-Figure Corresponds to the Optimal Mapping in Figure 4.9-(c)/(f). . . . . . . . . . . . . . . . . . . . 100 4.11 Rewards over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1 Hybrid Parallel Execution Model workflow. Exp-Sim denotes Expansion and Simulation; Upd-Sel denotes Update (i.e., node updates in Back-Up) and Selection. . . . . . . . . . . . . 107 5.2 Heterogeneous System Overview. Global Memory is the CPU DRAM. . . . . . . . . . . . . 108 5.3 Framework Design Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4 Timeline of the parallel execution in [87] and our framework. p=128. . . . . . . . . . . . . 113 5.5 System Throughput Comparisons. D=32. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.6 Rewards under various frameworks. async.(sync.) stands for execution with(without) dependency-relaxation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.7 Shared-tree method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.8 Local-tree method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.9 Design Exploration of Inference Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.10 Training throughput under optimal configurations . . . . . . . . . . . . . . . . . . . . . . . 125 5.11 DNN loss over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 xi Abstract Reinforcement Learning (RL) is an area of AI that constitutes a wide range of algorithms, enabling autonomous agents to learn optimal decisions through online environment interactions, data collection, and training. Recently, certain categories of RL algorithms have witnessed widespread adoption due to their generalizability and reliability, including model-free RL based on Deep Neural Network (DNN) policy optimizations and model-based RL using Monte Carlo Tree Search. The efficiency of running an RL algorithm on a hardware platform depends on several factors, including (a) the suitability of the hardware architecture for supporting the heterogeneous computation patterns fundamental to RL; (b) the capability of the hardware architecture’s memory hierarchy to minimize data-communication overheads; and (c) the ability of the system to hide overheads introduced by the deeply nested highly irregular computations in RL. General-purpose processors cannot simultaneously satisfy these requirements for all the RL components and algorithms. Optimized acceleration systems that exploit heterogeneity across different architectures to support the variations of compute kernels and memory characteristics in RL are crucial to fast and efficient development. In this dissertation, we develop acceleration frameworks for two key categories of RL algorithms, i.e., model-free Deep RL, and model-based RL using Monte Carlo Tree Search (MCTS). We implement these frameworks by addressing two objectives: Objective 1 is to develop algorithm-hardware co-optimized accelerators for the fundamental primitives in the key categories of RL algorithms. This includes inference and training of DNN models, replay operations, and dynamic tree-based operations. Objective 2 is to xii create end-to-end RL acceleration systems by identifying the scheduling, mapping, and design configurations of RL primitives onto heterogeneous devices based on the task dependency, compute, and memory characteristics of the target RL algorithms. To realize Objective 1, we present fundamental acceleration methods for the key components of RL through algorithm-hardware co-optimizations. For DNN policy inference and training, we develop acceleration methods, DYNAMAP and FSCAccel, and demonstrate speedups using Field-Programmable Gate Arrays (FPGAs). DYNAMAP uses a Partition-Boolean Quadratic solver to optimally map layer-specific algorithms for computing different layers in arbitrary-shaped DNN computation graphs while re-using a unified hardware module. FSCAccel focuses on the backpropagation phase of Convolutional Neural Network (CNN) training and presents a novel parallel algorithm to eliminate wasteful zero-multiplications. We also design accelerators for MCTS tree models and tree-based replay operations using on-chip memory of FPGAs. These constitute a library of accelerated primitives as the building block for realizing Objective 2. To realize Objective 2, we present PEARL, a framework for portable, productive, and high-performance RL using Heterogeneous Platforms. We develop novel intermediate abstraction layers below a unified Python Interface to enable efficient use of diverse hardware resources. These include a System Composer for parallel task mapping and scheduling based on an RL task dependency graph abstraction and a Runtime Coordinator with a generalized RL training protocol that ensures portability across various platforms and model-free Deep RL algorithms. Additionally, for model-based RL, we introduce MCTS heterogeneous acceleration systems. We develop a general design space exploration method to address the tradeoff between parallelization and mapping strategies on CPU-GPU platforms. We also develop a CPU-FPGA acceleration framework with a hybrid parallel execution model that concurrently exploits the computing power of both devices. It leverages a near-memory-fashioned accelerator design that encapsulates the tree-based search model using FPGA. xiii To demonstrate the effectiveness of our approaches, we conduct detailed experiments using interconnected CPUs, FPGAs, and GPUs to showcase performance improvements across diverse models, algorithms, hardware platforms, and benchmark environments. By optimizing the compute processes from fundamental primitives to complete systems and ensuring efficient utilization of heterogeneous resources, these frameworks enable robust RL agents and significantly accelerate their development cycles. xiv Chapter 1 Introduction 1.1 Reinforcement Learning Reinforcement Learning (RL) is a key method widely adopted in recent Artificial Intelligence research, offering a paradigm that enables agents to learn optimal decision-making strategies through interaction with an environment. Unlike supervised learning, where a model is trained on labeled data, and unsupervised learning, where patterns are extracted from unlabeled data, RL focuses on learning from feedback in the form of rewards or penalties. This dynamic process mimics how humans learn from trial and error, making RL a powerful class of methods for solving complex decision-making problems. The versatility of RL has led to its widespread adoption across various domains, including robotics, finance, healthcare, gaming, and autonomous systems [129]. Model-free and model-based RL are two important categories within this field, each with distinct advantages and applications. Model-free methods excel in scenarios where the dynamics of the environment are complex or unknown [82], while model-based approaches leverage explicit models of the environment to make decisions more efficiently [94]. Model-free RL algorithms operate without explicit knowledge of the environment’s dynamics and aim to directly learn optimal polifcies or value functions from experiences. Prominent example algorithms in the category include Deep Q-Networks (DQN) [92], which utilizes deep neural networks to approximate 1 Q-values; Policy Gradient methods, such as Proximal Policy Optimization (PPO) [118], directly optimize the policy parameters to maximize expected cumulative rewards; and Actor-Critic methods, such as Deep Deterministic Policy Gradient (DDPG) [79] and Soft Actor-Critic (SAC) [42]. Model-based RL algorithms aim to learn a model of the environment’s dynamics and use it to make decisions. One of the most widely used model-based algorithms is Monte Carlo Tree Search (MCTS). MCTS is particularly effective in domains with large state spaces and complex dynamics [13]. It builds a search tree by iteratively sampling actions and simulating future trajectories, using the collected information to guide decision-making towards states with higher expected rewards. MCTS has demonstrated remarkable success in various applications, including neural architecture search [142], game playing [104], robotics [10], and optimization problems [72]. 1.2 Challenges and Motivation While RL methods are widely applied in various domains, there are several challenges for RL application developers to efficiently train RL agents: 1. Heterogeneity of compute primitives: The compute primitives intrinsic to RL algorithms include policy inference, policy update, and dataset management. Even within the same algorithm, they have a large variance in terms of their arithmetic intensity, computation, and memory requirements. Furthermore, different RL algorithms have different policy representations (e.g., decision trees [13], fullyconnected deep neural networks [7], convolutions neural networks [159]), and they have different dataset management mechanisms (e.g., table-based [111] or prioritized replay [114]). These primitives cannot be simultaneously optimized using general-purpose processors as they lead to various (compute or memory) overheads that become the performance bottleneck. Motivated by this RL characteristic, we propose to utilize heterogeneous hardware resources in state-of-the-art data centers to apply optimizations tailored for different RL primitives. 2 2. Multifarious hardware characteristics: Before deployment, application developers use simulation software to train their agents in data centers or servers to avoid trial-and-error scenarios that could potentially lead to damage to physical infrastructures [86, 154]. This involves iterative testing with various algorithms and hyper-parameters. State-of-the-art data centers provide hardware with vastly different characteristics and programming models. The computing architecture includes general-purpose processors like CPUs, data-parallel architectures like GPUs and spatial architectures like FPGAs with different hardware characteristics that suits different workloads. Additionally, there are various external memory techniques such as 3D stacked HBM, DRAM, [62] etc. The interconnection also has various latency and bandwidth characteristics [123]. Sub-optimal placement and scheduling of RL primitive to these hardware component can lead to under-utilization of heterogeneous resources, resulting in low training performance. Furthermore, the optimal RL primitive-to-hardware assignments can change based on varying algorithms and hardware platforms. Consistently achieving high performance implementations is challenging since it requires portable and flexible solutions that can optimally map RL onto various devices. 3. Intrinsic dependency in RL algorithms: Most RL algorithms follow a sequential process where any agent’s action step should depend on its learning in the previous steps, so iterations are sequential and cannot be trivially parallelized. On one hand, increasing algorithm performance usually causes some challenges in parallelization for speed. For example, in the class of model-free RL, [114] proposes to use a prioritized replay to replace a random replay buffer in off-policy model-free RL. This improves the algorithm efficiency at the cost of introducing additional sequential computation in replay operations that prolongs the latency of each iteration. On the other hand, enabling more parallelism for speed improvements leads to a tradeoff in algorithms performance. For example, in the class of model-based RL, the leaf parallelization of MCTS enables high-throughput independent simulations, but it lowers the algorithm efficiency by introducing unnecessary and repeating action steps [18]. 3 Existing frameworks that support parallel and distributed RL [96, 78, 75] focus on fixed mapping solutions on general-purpose and data-parallel architectures. They do not address the challenges mentioned above. In this thesis, we bridge this gap by developing frameworks to accelerate two classes of RL algorithms. These frameworks adaptively utilize heterogeneous resources, including general-purpose processors (CPUs) interconnected with data-parallel architectures (GPUs) or spatial architectures (FPGAs), or both. We also address the trade-off between parallelization and algorithm performance. By selectively preserving serial execution and relaxing dependencies where applicable, we aim for optimal training throughput without negatively affecting reward convergence. 1.3 Metrics for Evaluating an RL Implementation In this thesis, we aim at improving the RL system implementation performance without negatively affecting RL algorithm performance, to accelerate RL application development. The metrics in evaluating the system implementation performance and algorithm performance are summarized as follows: 1.3.1 System Implementation Performance • System Training Throughput. The training throughput of RL is a latency-bound throughput defined as Number of Actions or Training Steps Processed in One Iteration Iteration Latency . The denominator of this equation is bounded by the latency of all the primitives involved in the RL algorithm, i.e., Iteration Latency = policy inference latency ⊛ policy update latency ⊛ replay operations latency where ⊛ is an operator based on how the primitives are scheduled. If they are all sequential, ⊛ is equivalent to +. If they are all overlapped, a ⊛ b ⊛ c is equivalent to max(a, b, c). 4 • Latency: Policy Inference, Policy Update, and Dataset Management (Replay) Operations. Since the system training throughput can be bound by the latency of these primitives, we use latency to measure the time taken to complete each primitive for processing a batch of samples in an iteration. Low latency implementations of all the primitives are essential in achieving high-throughput RL training. • Effective Resource Utilization. It is defined as Achieved Effective Operations Per Second Peak Operations Per Second Provided by the Hardware Platform. Its value is in the range [0,1], where effective resource utilization close to 0 means the implementation is not able to fully utilize available compute power (i.e., it either stalls, bottle-necked by memory accesses, or performs wasteful computations); a high effective resource utilization (close to 1) means the implementation is able to utilize most of the available compute power to deliver the optimal computational efficiency. • Throughput Performance Portability. Consistent with that described in [103], it is defined as Φ(P) = 0 if, ∃i ∈ P, T hroughputi = 0 P |P| i∈P 1 T hroughputi otherwise (1.1) where P is a set of heterogeneous platforms. It measures the ability of the implementation to achieve high throughput across different heterogeneous platforms; The higher Φ is, the better performance portability can be achieved. 1.3.2 Algorithm Performance • Average Reward. This metric measures the average cumulative reward obtained by the agent over a specified period of episodes. It provides an overall indication of how well the agent is performing in the environment. 5 • Learning Curve & Reward Convergence Speed. The Learning curve plots the agent’s performance average reward over episodes, visualizing the learning progress. A steep increasing learning curve indicates rapid improvement, while a plateau suggests convergence or stagnation. The convergence speed measures how quickly the RL algorithm converges to an optimal or near-optimal policy based on the learning curve. It can be quantified by dividing the converged average reward time by the number of episodes to reach convergence. 1.4 Thesis Contributions In this thesis, we develop hardware-algorithm co-optimized accelerators and build frameworks for composing heterogeneous RL systems to address the above challenges. 1.4.1 Hardware-Algorithm Co-Design for Key RL Primitives We propose three hardware-algorithm co-optimized accelerator designs for the fundamental primitives that are common to various RL algorithms. This are the deep neural network (DNN) policy inference, DNN policy training, and tree-based models (e.g., MCTS decision trees). DNN policy inference. CNNs are powerful techniques used in many computer vision related RL tasks such as automatic learning of agents in robotics, autonomous driving, and games. Our novelty is in enabling the dynamic use of algorithms for the diversely-characterized layers in a CNN for low-latency inference. The main contributions are: • A unified template hardware overlay re-used across layers. It enables low-overhead layer-wise layout transformation, allowing dynamic switching of three popular GEMM-based convolution algorithms: im2col, kn2row and Winograd algorithms; 6 • Accurate modeling of the computation and communication latency for various convolution algorithm combinations to allow easy construction of a parameterized dependency graph representation for any CNN models, capturing architectural parameters, FPGA device capabilities, and CNN metadata; • A framework, DYNAMAP, that proposes a 2-step design space exploration workflow: – Hardware Mapping: Identifying the fixed architectural parameters as well as the most efficient dataflow for each layer under all algorithm settings; – Algorithm Mapping: Polynomial time optimal algorithm selection exploiting the series parallel characteristic of CNN graphs; DNN policy training. We focus on Fractionally Strided Convolution (FSC), which is a key operation in the training image-based Deep Learning models, i.e., it is used in the back propagation of CNN training. FSC typically performs up-convolution on a 2-D grid image, resulting in substantial amount of zerocomputations on structured locations of the feature maps. These zero-computations lead to low effective resource utilization on the hardware platform. Our main novelty is a multi-channel-multi-kernel parallel algorithm, kn2row, associated with a dedicated accelerator design, to eliminate zero-computations in FSC. The main contributions are: • We propose a novel methodology that completely eliminates zero-spacing based on kn2row, a multichannel-multi-kernel parallel convolution algorithm. • We develop an accelerator template that integrates a novel reconfigurable zero-skipping mechanism and enables both convolution and FSC operations on arbitrary feature map, kernel and stride sizes. MCTS decision tree model. Monte Carlo Tree Search (MCTS) is a pivotal model-based RL approach. We accelerate the key primitive of MCTS, a tree-based search model that explores the state-action space 7 iteratively. Our novelties include adopting the spatial architecture (FPGAs) for low-overhead dependency handling and alleviating the memory overheads, and a novel mechanism for enabling dynamic tree management on FPGAs. The main contributions are: • To support arbitrary dynamic accesses by all the in-tree operations with minimal area overhead, we propose an accelerator design with a custom Butterfly-based Interconnection between the computing units and the memory banks; • Based on our interconnection design, we propose an on-chip memory bank assignment algorithm for MCTS tree construction to minimize the runtime bank conflict during all the in-tree operations; 1.4.2 System Solutions for RL Using Heterogeneous Platforms An Acceleration Framework for Composing Deep Reinforcement Learning Implementations using Heterogeneous Systems. We introduce a framework for composing parallel Deep RL (DRL) systems on heterogeneous platforms consisting of general-purpose processors (CPUs) and accelerators (GPUs, FPGAs). Our main innovation is enabling flexible partitioning, mapping and scheduling of primitives onto devices in order to tune the implementation for arbitrary algorithms and platforms, achieving portable performance. The main contributions are: • We propose a general DRL heterogeneous training protocol that is agnostic of the types of underlying accelerators, thus portable to different heterogeneous platforms. • We abstract any DRL algorithm using a generalized task dependency graph representation, whose nodes denote fine-grained tasks within the Replay Manager and Learner primitives. This facilitates the parallelization of a single primitive over multiple devices. 8 • Based on the above graph representation of DRL, we develop a novel System Composer for exploiting task and data parallelism within the DRL Learner over heterogeneous devices, improving the heterogeneous hardware utilization and the achievable throughput for large-scale training. This also makes the framework performance optimization generalizable to new Learner functions involving multiple DNNs. • We develop a parameterized library that contains accelerated DRL primitives on various architectures (CPU, GPU, and FPGA). We offer a Python-based User API to enable productive DRL application development on heterogeneous platforms. Frameworks for Monte-Carlo Tree Search on Heterogeneous Platforms. We develop two system solutions for general MCTS and Deep Learning guided MCTS, targeting CPU-FPGA and CPU-GPU heterogeneous platforms, respectively. On CPU-FPGA platforms, we develop a framework with an accelerator encapsulating the MCTS decision tree model in a near-memory fashion. To enable efficient hardware mapping and end-to-end execution, the framework consists of: • Accelerator Generator that decides the design parameters based on algorithm and benchmarks; • A hybrid parallel execution model that concurrently exploits the compute power of both CPU (data parallelism) and FPGA (pipeline parallelism); • Python-based API wrapping the FPGA in-tree accelerator to make our design portable to state-ofthe-art RL benchmarking libraries; On multi-core and CPU-GPU platforms, we propose a a novel framework for accelerating Deep Neural Network guided Monte-Carlo Tree Search (DNN-MCTS). This framework proposes an adaptive parallel scheme that optimally chooses between two parallel schemes on the CPU to parallelize the MCTS component based on the hardware platforms and algorithm specifications. The main contributions are: 9 • We perform the tradeoff analysis between two parallel implementations (shared-tree and local-tree methods), and propose an acceleration methodology of adaptively selecting the implementation at compile time given an arbitrary DNN-MCTS algorithm targeting a multi-core CPU. • We utilize an efficient search method that determines the best DNN-request-processing batch size in the design configuration workflow to fine-tune the DNN-MCTS performance on an arbitrary CPUGPU platform. 1.5 Organization The rest of the dissertation is organized as follows. In Chapter 2, we review the fundamentals and workload characteristics of two classes of RL, including the model-free Deep RL and model-based RL using MCTS. In Chapter 3, we discuss the three fundamental works in acceleration of primitives (for CNN inference, training, and MCTS decision tree model operations). In Chapter 4, we describe the system solution for efficiently parallelizing and mapping model-free Deep RL using CPU-GPU-FPGA platforms. In Chapter 5, we discuss the system solutions for MCTS-based RL algorithms on two types of heterogeneous platforms. In Chapter 6, we conclude the dissertation by summarizing the impact of the completed works and possible future directions. 10 Chapter 2 Background and Related Works 2.1 Reinforcement Learning Primitives and Workload Characterization Figure 2.1: RL Workflow We show a generic view of RL workflow in Figure 2.1. It contains a Sample Generation loop and a Model Update loop. In Sample Generation, the Actor infers on a policy model (commonly represented as a deep neural network (DNN) [119] or a decision tree model [63], or a combination of both [125]). In each iteration, the environment outputs the current state s. The policy network computes an action a given the current state s via neural network inference. The action a is actuated in the environment to obtain the next state s ′ and a reward r. This generates a sample of experience (s, a, s′ , r) to be stored in a dataset (i.e., Replay Buffer) for training. The Sample Generation process iterates until the end of RL training and 11 populates the dataset in an online manner. In Model Update, the loop occurs between the Learner and the Replay Buffer. In each iteration, the Learner samples a batch of experiences from the Replay Buffer to perform training. The training technique is dependent upon the RL algorithm and policy model (e.g., stochastic gradient descent (SGD) [3] is commonly used for RL with DNN models). In model-based RL such as MCTS [63], an environment model (i.e., decision tree) is utilized to facilitate policy training. In certain RL algorithms, the Learner also updates the experiences sampled from the Replay Buffer (e.g., in RL algorithms with prioritized replay, the Learner sets the priorities of the sampled batch data to the new priorities after SGD update, and samples from the replay based on the priorities; The priorities of experiences are managed using a sum tree data structure [46, 153]). We define the key RL primitives as the Actor, the Learner, and the Replay Manager (RM, for managing experiences stored in the Replay Buffer). In model-free Deep RL algorithms (further detailed in Section 2.1.1), Actor performs DNN policy inferences, Learner performs SGD training of DNN policies, and RM performs replay operations on a sum tree data structure for storing priorities of experiences [114]. In model-based RL algorithms using MCTS (further detailed in Section 2.1.2), Actor performs tree-based action selection, Learner performs tree expansion and tree update, all using a decision tree data structure as the environment model. The characteristics of Reinforcement Learning (RL) primitives exhibit variations not only among themselves but also across different learning functions, policy models, hyper parameters, classes of algorithms, etc. Consequently, relying on a fixed architectural solution proves inadequate for optimizing hardware utilization and achieving high-throughput DRL across the diverse spectrum of algorithms and applications. In the following, we summarize the operations performed by various primitives, and analyze the the intrinsically heterogeneous workload characteristics in the two classes of RL algorithms. 12 2.1.1 Model-Free Deep Reinforcement Learning In this category, we consider both on-policy algorithms such as PPO [118], A2C [91], and off-policy algorithms such as DQN [92], DDPG [79], SAC [42], TD3 [38]. The primitives standard to these algorithms include the Learner and the replay operations conducted by the Replay Manager (RM). The Learner performs stochastic gradient descent [112], and its computations vary based on algorithms and policy models (e.g., MLP [107], CNN [50]). The RM performs sampling and update operations on a sum tree data structure [114], and the replay operations also vary based on replay configurations (i.e., the depth and fanout of the sum tree [153]). We show the key replay operations associated with the RM as follows: • Replay Sampling decides which experiences (indices) should be sampled from the Prioritized Replay Buffer for training the policy. For each sample, a data point xi is selected according to a priority distribution Pr(i) = P(i)/ P i P(i), i ∈ [0, SE), where SE is the total number of data points in the Prioritized Replay Buffer. To do so, we first sample x ∼ U(0, 1). Then, we use the cumulative density function (cdf = Pi j=1 Pr(j), i ∈ [0, SE)) to derive the sample index i = cdf −1 (x). This is equivalent to finding the minimum index i, such that the prefix sum of the probability up to i is greater than or equal to x, the target prefix sum value: min i X i j=1 P(i) ≥ x · X SE j=1 P(j) (2.1) Such index i is known as prefix sum index. To find index i, we traverse from the root node to the leaf node level by level as shown in Figure 2.2. During the traversal of each level, we need to read the prefix sum of priority values from all the child nodes. The time complexity of finding prefix sum index is O(K logK N), where N is the number of elements in the replay buffer. • Replay Update requires updating the current priorities using newly computed priorities. This operation is performed after each training iteration. To update the priority, we update the node values 13 from the leaf node to the root node as shown in Figure 2.2. The time complexity of priority update is O(logK N). Root node Level 1 node Leaf node min i X i j=1 P(i) x · X SE j=1 P(j) Prefix Sum Index Computation Priority Update z}|{ z}|{ z}|{ z}|{ Summation New Priority New Priority Figure 2.2: Replay Manager Data Layout and Operations Note that sampling and priority update are always performed on a batch of data. Since computing the prefix sum index is read-only, sampling different data points inside a batch can be fully parallelized. The arithmetic intensity of sampling is 1 FLOPS/word as each data read from the memory is only used once during the tree traversal. The arithmetic intensity of priority update is 0.5 FLOPS/word because each data is used once after each read and write. In Figure 2.3, we illustrate throughput performance of key compute primitives (replay sampling, replay update, and learner) for two algorithms (DQN [92], DDPG [79]) and policy models (MLP, CNN), on the roofline models for a CPU, GPU, and FPGA. In this example, primitives such as small MLP policies (commonly used in classical control and robotics benchmarks [33]) and replay operations exhibit low arithmetic intensities and high-latency memory accesses, making them memory-bound and challenging to optimize on multi-core or data-parallel architectures (CPU and/or GPU). The performance of these primitives can benefit from a near-memory fashion design using spatial architecture (FPGA). Learner functions with higher arithmetic intensity and data reuse, such as CNN policies used in vision-based applications [92], justify the data parallel resources provided by GPUs. Still, the characteristics of DRL performance 14 Figure 2.3: DRL Primitives Workload Analysis may vary due to significant differences among replay and learner configurations based on applications, as well as diverse rooflines resulting from device bandwidth and compute capabilities. Overall, we observe that the characteristics of model-free Deep RL primitives exhibit variations not only among themselves but also across different learning functions, policy models, hyper parameters, etc. Consequently, relying on a fixed architectural solution proves inadequate for optimizing hardware utilization and achieving high-throughput DRL across the diverse spectrum of algorithms and applications. 2.1.2 Model-Based Reinforcement Learning using MCTS Monte Carlo Tree Search (MCTS) is an iterative process composed of tree-based action selection and environmental simulations for Sample Generation, and tree expansion followed by tree update for Model Update. In certain algorithms (e.g., AlphaZero [125], AlphaFold [61]), DNNs are additionally used to predict simulation results for expanding the tree in Sample Generation, and they are also trained in Model Update. Our work focus on the (variations of) tree-parallel MCTS [19, 81]. This is recently the most popular MCTS parallelization technique used in state-of-the-art DNN-MCTS applications [125, 142]. In tree-parallel MCTS, multiple workers share accesses to and update the same tree. The operations in each iteration of a tree-parallel MCTS [19] worker are: 15 • Tree-based action selection: The search of each worker starts from the current state (root node of the tree) and traverses down the tree. At every node traversed s, the next edge (s, a) is selected according to the UCT score statistics stored in the search tree as follows: a = argmx(U(s, a)), where the UCT score U(s, a) = V (s, a) + c · √ ln(ΣbN(s,b)) N(s,a) for UCT Algorithm Q(s, a) + c · P(s, a) · √ ln(ΣbN(s,b)) 1+N(s,a) for AlphaZero Algorithm (2.2) This leads the agents towards states with high reward values (exploitation), high policy-action probability, and low visit counts (exploration). c is a pre-set constant controlling the tradeoff between exploitation and exploration. V (s, a) is the reward obtained from environmental simulations. P and Q are derived from policy and value DNN inferences. N tracks the number of visits to a specific node. In tree-parallel MCTS, each worker subtracts a virtual loss VL from U of the traversed edges to lower their weights, encouraging other workers to take different paths. To avoid the potential conflict of two workers sharing the same node, a mutex is utilized to ensure atomic accesses by multiple workers [19]. • Tree Expansion: When the tree traversal of an worker reaches a leaf node and encounters an edge that was never visited before, the search process adds a new successor node s ′ , and initializes Q(s ′ , a), N(s ′ , a) for all its adjacent edges a. • Node Evaluation: The Node Evaluation starts at the expanded node in Tree Expansion, and returns a sampled reward or estimated reward with other necessary statistics to be used in the following Tree Update operations. In the UCT (Upper Confidence bounds applied to Trees) algorithm, a rollout simulation is performed to return a reward V . In DNN-MCTS such as AlphaZero [125] and 16 AlphaX [142], a DNN is used to approximate the policy to return P(s, a) - the probability of taking the action a. • Tree Update: To synchronize the tree with the most recent node evaluation, V is propagated from the new leaf node back to the root for all thee workers. At each tree level, the visit counts N is incremented, and the state value Q is updated using V . A mutex is used to protect accesses to each node to avoid race conditions from multiple workers accessing the same nodes. From computation characterization perspective, we can view an MCTS process as two parts: the in-tree operations (i.e., tree-based action selection, tree expansion and tree update), and the simulations. The intree operations are highly memory-bound and incurs irregular memory accesses dynamically determined at runtime, which is hard to optimize using thread-level parallelism or data parallelism. On the other hand, the simulations (and DNN operations, if applicable) among different workers have highly parallelizable compute patterns. As a result, in-tree operations may become a bottleneck that hinder performance scalability to large number of workers on homogeneous multi-core systems. The in-tree operations are difficult to scale, posing a constant upper-bound to the system throughput. Therefore, optimizing system performance require heterogeneous hardware. For example, the in-tree operations requires acceleration using spatial architecture in a near-memory fashion to improve their performance, while the simulations and DNN workload requires thread-parallel and data-parallel resources. 2.2 Computing Platforms The architectures of emerging data centers are highly heterogeneous; they integrate a variety of processors, accelerators, and memory [4, 51, 8]. In this thesis, our target platforms consist of a subset (or all) of three types of devices: general-purpose multi-core architectures (CPUs), data-parallel architectures (GPUs), and spatial architectures (FPGAs). An example of such a platform is shown in Figure 2.4. These platforms 17 are well-suited for accelerating RL due to the massive parallelism and heterogeneity. The architecture of devices in these platforms includes data-parallel architectures (e.g., GPU) that are suitable for computeintensive policy training, spatial architecture (e.g., FPGA) that can be optimized for memory-intensive dataset management, and general-purpose processors suitable for application simulations in RL. Figure 2.4: Heterogeneous Platform 2.3 Parallel Methods, Libraries and Frameworks for RL Model-Free RL. Several libraries support prototyping of RL algorithms, ranging from single threaded implementations [16, 31, 33] to distributed settings [77, 15, 37, 47]. These libraries typically focus on providing frameworks for hyperparameter tuning (e.g., loss, exploration, or optimization steps) and offer accessible codebases, enabling users to experiment with various algorithm components. However, they are not optimized for execution speed and often under-utilize the underlying resources available on modern machines. [76] proposes parallel reinforcement learning using MapReduce [29] framework with linear function approximation. Other works, such as [48, 153, 110], implement parallel DRL algorithms by employing multiple parallel Actor threads and a centralized Learner thread, utilizing deep learning libraries like Tensorflow and LibTorch. These works leverage CPU and GPU data parallel resources for training, but do not efficiently optimize memory-bound primitives (such as small model training and replay operations) on specialized hardware. RLlib introduces high-level abstractions for distributed reinforcement learning, built on top 18 of the Ray library [77]. Dopamine [16] provides a research framework for fast prototyping of reinforcement learning algorithms building upon pre-compiled modular components. These works offer high-level programming interfaces for modular RL algorithm composition, but still suffer from high latency overheads and result in low training throughput due to the absence of fine-grained dataflow and parallelism specialized for RL inference, training, and dataset management tasks. Model-Based RL using MCTS. Parallel MCTS: Multiple parallel algorithms have been developed for high throughput MCTS and DNNMCTS. Leaf-parallel MCTS [17] uses a single tree and creates multiple parallel node simulations at the same leaf node, but it wastes parallelism due to the lack of diverse evaluation coverage on different selected paths, which leads to algorithm performance degrades [65]. Root-parallel MCTS [66] creates multiple trees at different workers and aggregates their statistics periodically, but still lets multiple workers visit repetitive states. The Speculated DNN-MCTS [67] complies with the sequential in-tree operations, and uses a speculative model in addition to the main model for faster node evaluation. This preserves the decision-making quality of the sequential MCTS but introduces additional computations. Tree-Parallel MCTS and its variants benefit significantly from their superior algorithm performance compared with the other parallel methods [19, 90, 81, 87]. It has been adopted in various successful applications such as Go [124], traffic control [22], and robotics path planning [27]. Execution Models for Tree-Parallel MCTS: Existing Tree-Parallel MCTS on CPU can be categorized into two parallel execution models: shared tree method (with multi-threaded tree traversal) [19] and local tree method (with single-thread tree traversal) [81]. In the shared tree method, each worker accessing the tree is assigned a separate thread, and local mutex at each tree node is used for accessing the shared tree. The main disadvantage of this method is that multiple threads communicate through the DDR memory which lead to high synchronization overheads dominated by DDR access time (hundreds of CPU cycles [54, 5]). In the local tree method, only a single master thread is assigned for performing in-tree operations 19 exclusively, and multiple worker threads perform simulations (or DNN inferences) exclusively. It has the advantage of low-latency memory access time since the tree can be managed in the local memory (e.g. last-level cache). It also achieves higher throughput than multi-threaded tree traversal, because the intree operations can be overlapped with simulations. In recent work [81], the local tree method combined with a novel formulation of the virtual loss has shown better algorithm and throughput performance on Atari game benchmarks. However, the sequential time interval between workers for in-tree operations is still large because all the workers are serialized, and the system performance may not scale well to large number of workers. 2.4 Hardware Accelerators for Reinforcement Learning Model-Free RL. A few recent works focus on accelerating RL using specialized hardware such as FPGAs and ASICs. This include hardware designs for Q-Learning algorithms such as Deep Q Learning [128, 39] and Tablebased Q Learning [26, 111]. Some works propose accelerator designs for the Learner in specific RL algorithms. In [121] and [122], a hardware architecture based on customized pearlmutter propagation is developed to accelerate Trust Region Policy Optimization (TRPO) [117], leading to 20 times speedup on robotic locomotion benchmarks against Keras deep learning library with Theano backend on CPU. In [40], a CPUFPGA architecture is proposed to accelerate Deep Deterministic Policy Gradient (DDPG) [79], which combines Deep Q-Learning with policy optimization methods. A CPU-FPGA implementation of Asynchronous Advantage Actor-Critic (A3C) algorithm is presented in [24]; Using a two-level buffer hierarchy associated with the inference and training module, it outperforms GPU-based A3C implementation by 30%. [70] utilizes FPGA in accelerating the Sample Generation phase of Deep RL and integrates it in a robotics system. While these hardware-accelerated approaches show performance improvement over implementations on 20 general-purpose processors, they focus on a specific algorithm or RL primitive, and lack generalizability to various RL algorithms. Model-Based RL using MCTS. [58, 108] design Blokus Duo Game solvers on FPGA that uses MCTS. Their accelerators target Blokus Duo game only and implement the simulator circuit on FPGA. It is difficult for their designs to generalize to various applications due to the lack of general-purpose simulators provided by CPU processors. [87] proposed to accelerate MCTS in CPU-FPGA heterogeneous systems, and developed FPGA accelerator for in-tree operations. However, the accelerator design in [87] requires static memory allocation for a full tree at compile time. This is because it assumes a static one-to-one association between the topological ordering of tree nodes with the on-chip memory addresses. As the memory requirement for the full tree increases exponentially wrt the tree height, the supported tree height is extremely limited on FPGAs which typically have limited on-chip resources. This constrain the asymptotically growing characteristic the tree, thus affecting the domain-specific algorithm performance of MCTS algorithms. In summary, none of the existing hardware design support generalized MCTS with dynamic tree management which is critical in achieving high algorithm performance. 2.5 Target Reinforcement Learning Algorithms in This Thesis We develop accelerators for the common primitives as well as system solutions for the two major categories of state-of-the-art RL algorithms, model-free and model-based algorithms, as shown in Figure 2.5. Specifically, in the category of model-free RL, we focus on Deep RL whose workloads include policy and value optimizations using DNNs and replay operations. Examples include Deep Q Network (DQN) [92], Deep Deterministic Policy Gradient (DDPG) [79], Soft Actor Critic (SAC) [42], Twin Delayed DDPG (TD3) [38], and Proximal Policy Optimization (PPO) [118]. In the category of model-based RL, we focus on Monte 21 Carlo Tree Search (MCTS) and Deep Neural Network guided MCTS (DNN-MCTS) algorithms such as AlphaZero [125]. Figure 2.5: Categorization of Key RL Algorithms in This Thesis 22 Chapter 3 Acceleration of Reinforcement Learning Primitives 3.1 DYMAMAP: A Dynamic Algorithm Mapping Framework for DNN Inference 3.1.1 Motivation CNNs are powerful techniques used in vision based RL applications, such as learning of agents in autonomous vehicles [73], robotics [134], and games [92]. Recently, new families of CNNs featuring various convolution (CONV) layers [132, 131, 49] have been developed. These CNNs show superior performance wrt prediction accuracy while introducing new convolution operations (e.g. depthwise CONV in MobileNet, "Fire" module in SqueezeNet, 1×7 filter in "Inception module"). Most solutions for CNN inference use a specific algorithm across all the layers and reuse a generic architecture (e.g., CPU, GPU, or systolic array processors) to perform the algorithm [152, 106, 98, 155, 150]. These approaches leave some performance and hardware efficiency on the table due to (1) use of a fixed algorithm across diverse layers and (2) under-utilization due to fixed hardware. This situation is compounded since state-of-the-art CNNs have more diverse layer shapes and complex structures, resulting in sub-optimal latency. Convolution (CONV) layers are major building blocks of CNNs and their meta data are defined as follows: Each CONV layer has Cin (Cout) input (output) channels, where each channel is a H1 × H2 23 (O1 × O2) 2D feature map. The layer weights W contain Cin × Cout number of kernels, each sized at K1 × K2. A number of algorithms have been proposed for efficient implementation of convolution operation. Among these General-matrix-multiplication (GEMM) - based methods are most widely adopted [23, 139, 69] for spatial convolution. In this section, we summarize three families of popular GEMM-CONV algorithms and their trade-offs. im2col Method. im2col [23] is a popular algorithm that converts spatial convolution into GEMM. For a feed forward pass of a CONV layer, im2col stretches each group of Cin kernels into a row of the weight/kernel matrix, X, and each group of Cin corresponding windows of input feature maps into a column of the input activation matrix, W, expressing the feed forward pass as zl = W(Cout×K1K2Cin) l ∗ X (K1K2Cin×O1O2) l−1 . kn2row Method. The kn2row [139] method is based on the decomposition of convolutions and reordering the data layout. In the first phase, "unit-CONV GEMM", a K1 × K2 convolution is computed using K1K2 separate 1 × 1 unit-convolutions, which is equivalent to a GEMM call: pk1,k2 = W(Cout×Cin) l ∗X (Cin×O1O2) l−1 . In the second phase, "Pad-and-Accumulate", the intermediate output patches of all unit-convolutions, pk1,k2 , are shifted by their offsets wrt the center patch, padded with 0 on the nonoverlapping areas and Hadamard-added to generate the final output feature maps. Winograd Minimal Filtering Method. Winograd algorithm [69] is a fast matrix multiplication algorithm which reduces the number of operations in a GEMM call. A F(m×m, r×r) Winograd algorithm can generate output as Y = AT GgGT ⊙ BT dB A. In this equation, Y and d represent output and input tiles, while g represents kernels. A, G, and B are constant transforming matrices and ⊙ represents Hadamard multiplication. Input feature map D is partitioned into multiple input tiles, d, with size (m + r − 1) × (m + r − 1) while adjacent tiles share an (r − 1) overlap. Each output tile size is m × m and kernel size is r × r. The final output feature map is calculated by concatenating all output tiles, Y , and summing up over the depth of the input tensor. Equivalently, we can reduce over Cin channels in the 24 transform space before applying the inverse transform A to the sum. This amortizes the cost of the inverse transform over the number of channels, and allows us to alternatively express the Hadamard products and depth-wise additions into (m + r − 1) × (m + r − 1) independent GEMMs [69]. State-of-the-art CNNs [132, 131, 56, 133] have highly complex architectures with multiple branches and wide variations in CONV layer configurations. As illustrated in Figure 3.1, using three common layer configurations, relative performance of the three GEMM based convolution algorithms depends heavily on the layer configuration, Thus, using a single algorithm across all the layers is not an optimal strategy to minimize the latency of CNN inference. Rather, the ability to switch algorithms between layers is needed. Figure 3.1: Computation and Memory Loads of GEMM-CONV algorithms on different layer configurations Enabling efficient algorithm switching requires us to solve the following problem: (a) As wide variety of CONV configurations and convolution algorithms exist, a unified architecture which is general enough to execute all combinations, while, simultaneously supporting algorithm specific optimizations to extract maximum performance is needed. (b) As each algorithm requires the input and produces the output in its own specific format a low overhead data re-ordering mechanism is needed. (c) As state-of-the-art-CNNs are extremely deep with number of CONV layers easily surpassing 50 or 100 (Inception-v4 has 141 CONV layers), a low complexity algorithm, which can handle the exponential combinatorial explosion in design choices (3 141 for Inception-v4) is needed to determine optimal algorithm for each layer. We solve the problems mentioned above and present DYNAMAP — a framework that takes any CNN model and maps it onto FPGA in order to obtain extremely low latency CNN inference. 25 3.1.2 Formulation for Algorithm Mapping Given a graph representation of a CNN model and the target FPGA platform, we need to determine the following: (a) For each layer, the choice of the convolution algorithm and the dataflow (i.e. algorithmdataflow pair). (b) parameters to customize architecture overlay for the CNN model on the target platform. We discuss (a) in this section, and (b) in Section 3.1.4. Note that as optimal dataflow for each algorithm in each layer can be determined in step (b), algorithm mapping here implicitly implies algorithm-dataflow pair mapping. The choice of algorithm-dataflow pair not only impacts the execution time of the layer, it also impacts the execution time of all the neighboring layers as represented by the edges in the graph G. This is because data layout transformations are needed to ensure data is available to the algorithms in the correct format. Thus, greedily selecting algorithm-dataflow pairs that minimize the execution time at each layer will not minimize the overall execution time of the CNN. The CNN graph is G = (V, E, Cv, Te), each vertex v ∈ V represents a layer and each edge e ∈ E representing the ordering between two layers. Cv is the cost vector array that represents the computation costs of the vertices under different algorithm-dataflow pairs. For vertex i, ⃗ci denotes the cost vector. Te is the set of transition cost matrices that represent the cost of data layout transformation between vertices. Tij denotes transition matrix for each edge (i, j). The objective is to determine algorithm-dataflow mapping for each layer of CNN such that the cost — total latency of executing the CNN is minimized. ⃗xi is a 0-1 26 assignment vector with ⃗xi(k) = 1, if algorithm k is chosen and 0 otherwise. Exactly one entry of ⃗xi can be set to 1. The problem can be formulated as follows: minimizeP 1≤i<j≤N ⃗xT i Tij⃗xj + P 1≤i≤N ⃗xT i ⃗ci s.t. ⃗xi ∈ {0, 1} |⃗ci| ∀1 ≤ i ≤ N ||⃗xi ||1 == 1 (3.1) This problem formulation is known as Partitioned Boolean Quadratic Programming (PBQP) problem [116]. PBQP has been used to model a number of problems in compiler optimization such as register allocation for architectures with irregular instruction sets [116], and instruction selection on DAGs [36]. PBQP is NP-Complete [116, 6]. However, we show that for a class of graphs, known as series-parallel graphs, PBQP can be solved in polynomial time. Moreover, we show that the graphs of a majority of popular CNN architectures fall into this class. This allows us to develop a polynomial time optimal algorithm for the algorithm mapping optimization problem defined above. Definition 1: A (undirected) graph, with two distinguished vertices – source s and sink t, is a seriesparallel graph if it can be turned into a K2 graph (a graph with two vertices connected with an edge) by a sequence of the following operations [34]: 1. Remove a degree 2 vertex other than s or t and the edges incident on it. Directly connect the two neighbors with a single edge. 2. Replace a pair of parallel edges with a single edge that connects the two endpoint vertices. Theorem 1. PBQP can be solved in polynomial time if the graph is a series parallel graph. Moreover, for a graph with N vertices and d = maxi |⃗ci |, the running time is O(N d2 ). 27 PROOFSKETCH. For a graph G1 with three vertices i, j, k and edges (i, k),(k, j), applying operation 1 on vertex k to obtain K2 graph — G1 K2 and setting Tij (di , dj ) = mindk {⃗ck(dk) + Ti,k(di , dk) + Tk,j (dk, dj )} for all (di , dj ) (where (di , dj ) are algorithm choices for i and j) in G1 K2 , one can show that the optimality is preserved. Similarly, for parallel edges between vertices i and j, by updating Ti,j (di , dj ) = T 1 i,j (di , dj ) + T 2 i,j (di , dj ), where T 1 and T 2 are matrices of the parallel edges, one can show that the optimality is preserved. Each operation requires O(d 2 ) amount of work and is performed O(N) times — operation 1 is applied only on vertices, operation 2 is applied on parallel edges which are generated only as a result of operation 1. (Detailed proof can be found in [88]). □ Lemma 1. Resnet [44], VGG [126], Alexnet [68], etc. which do not have any branches are series parallel graphs. Proof. Let the input layer be denoted by vertex s and the output layer by vertex t. The degrees of vertices corresponding to all the other layers for VGG and Alexnet are 2. By repeatedly applying operation 1, we obtain a K2 graph. In ResNet, some vertices have higher degrees due to skip connections. However, the vertices between the end points of each skip connection have degree 2. Thus, repeatedly applying operation 1 on these nodes until an edge parallel to the skip connection edge is obtained, and then applying operation 2 results in a graph with all vertices, except s and t, having degree 2. By repeatedly applying operation 1, we obtain a K2 graph. Lemma 2. GoogleNet [132], Inception-v1 to v4 and Inception-ResNet-v1 [131, 133], which are composed of inception modules are series parallel graphs. Proof. Due to space limitations, we prove this lemma only for Inception-v4. Similar arguments can be made for the other networks. Consider Inception-C block (Figure 6 in [130]). The output of the Filter concat layer at the bottom splits into 4 branches. The left 2 branches have only degree 2 nodes and can be converted to a single edge by operation 1. For the third branch from left, applying operation 1 on 1×3 and 28 3 × 1 CONV layers, followed by operation 2 on the resulting parallel edge and then applying operation 1 on 1 × 1 CONV layer will result in a single edge. Similarly, the rightmost branch can also be converted into a single edge. The four parallel edges can be merged (operation 2) resulting in a K2 graph. In a similar manner, Inception-A, B and Stem modules can also be reduced to K2. Thus, Inception-v4 network (Figure 9 in [130]) can be reduced into a number of K2 graphs connected in series which can in turn be trivially reduced to K2. 3.1.3 Accelerator Design for Dynamic Algorithm Switching Figure 3.2: Architecture Overview We develop a unified architecture template with a shared central Computing Unit which can be used by all the supported algorithms and separate algorithm-specific auxiliary functional modules. We propose a low overhead data layout transformation module to enable fast algorithm switching between layers. Architecture Overview. Figure 3.2 shows a high-level overview of the template architecture design. The accelerator is composed of a GEMM Computing Unit performing, Linear Transform Modules for Winograd, a Pad-andAccumulation Module for kn2row, Data Layout Transformation (DLT) Modules for data re-ordering when 29 switching between different layers and algorithms, and a Pooling Module for max pooling on spatial feature maps. The Computing Unit (CU) is a PSA1×PSA2 2-D systolic array of Multiply-Accumulation (MAC) units optimized for GEMM. The Input Buffer and Kernel Buffer are organized into PSA1 and PSA2 banks, respectively; Each bank stores a partition of input feature / kernel matrix. During GEMM execution, PSA1 and PSA2 data elements are read concurrently by parallel PEs in the systolic array, and written to PSA1 or PSA2 banks of output buffer in parallel. Under im2col mode, the Toeplitz matrices of input feature maps and kernel parameters are loaded into the Input and Kernel Buffers. The output feature maps are directly written into the output buffer. Under kn2row mode, the independent 1×1 unit convolutions are expressed as a series of GEMM calls. Then Pad-and-Accumulate Module shifts each intermediate output patch to align with the position determined by the original K1 × K2 kernel and produces the final output feature maps using an accumulation buffer. The bank indices and address offsets for Pad-and-Accumulation are pre-computed based on CONV layer meta data [155]. The "unit-CONV GEMM" and "Pad-and-Accumulate" phases are pipelined enabling CU to start working on the next patch of unit-CONV GEMM while accumulation buffer still processes the last patch. This reduces the overall "Pad-and-Accumulate" overhead of kn2row. Under Winograd mode, both input feature maps and kernels are fed into the Linear Transformation Modules so that the GEMM operates in Winograd-transformed space. Each of(m+r−1)×(m+r−1)input feature map tile (overlapped by r−1 in adjacent tiles) and r×r kernel tile are transformed into a (m+r− 1)×(m+r−1)sized tile. All the tiles are scattered and reordered into (m+r−1)×(m+r−1)independent input and kernel matrices [69] sized at ( H1 m × H1 m , Cin) and Cin, Cout, respectively. These GEMMs are fed into the systolic array sequentially and the output is again multiplied with transformation matrices to recover the output tensor shape (O1 × O2, Cout). Linear Transform Module requires multiplication by constants and additions which are determined by Winograd hyper-parameters (m, r). For instance, in 30 Table 3.1: Tensor Layout transformations I Step_b Step_d I1 inc_b2 inc_d2 I2 inc_b3 inc_d3 3D Tensor → Toeplitz O2 S K1K2 K1 1 1 K2 H1S 1 3D Tensor → Winograd H1H2 m2 m − H1H2(m+r−1)2 m2 + 1 m + r − 1 1 H1H2 m2 m + r − 1 H1 H1H2 m2 Winograd → 3D Tensor H1H2 m2 1 m2 m H1H2 m2 1 m H1H2 m2 1 a F(2, 3), the transformation matrices are only composed of values of ±1 and ± 1 2 , which can be easily implemented using shift and add operations. Dataflow-switchable & Stall-free PE design. Performance of GEMM on a fixed systolic array heavily depends upon the dimension chosen for parallelism. For example, consider a 31 × 31 systolic array size for multiplying input matrices of sizes (a, b),(b, c) = (62, 124),(124, 64). Parallelizing along dimensions a, c and breaking each input matrix into tiles sized at (124, 31) and (124, 31), respectively, will require extensive zero padding in the last tile which will have only 2 columns along dimension c. The effective PE utilization will only be 68%. However, if we parallelize along dimension a, b instead, no under-utilization will occur. To handle such scenarios, we develop a novel PE design that allows for no-overhead switching between different dataflow to improve PE utilization. Non-stationary (NS) Dataflow: The input matrices W(b × c) and X(a×b) are partitioned along dimensions a (or c) into tiles of size b×PSA1 (or PSA2). Each Processing Element (PE) computes a vector-product that contributes to one output feature. In each clock cycle each PE performs one multiply-accumulation (MAC) and shifts the input and weight along two directions. Once all the MAC computations for a pixel are finished, the final result is shifted out of the PE and the PE proceeds to work on a new pixel. The NS datapath in a PE is shown in Figure 3.3 with black and redcolored wires. Each pass of matrix partitions incur an initialization overhead ISA that is proportional to max(PSA1, PSA2) and in a naive implementation will be incurred for each pass in each layer. We alleviate these overheads to implement stall-free GEMM as follows (Figure 3.3): MUX highlighted in grey selects between shifting accumulation result of P Ex,y and other PEs. When one PE completes the dot product 31 for one pixel, the accumulation result is directly shifted out, and during the computation of the next PE, accumulation results of other PEs can be shifted concurrently such that ISA is overlapped with next-pass computation. To further avoid additional stalls due to accumulation result congestion between passes when Figure 3.3: PE dataflow and optimizations to support stall-free operations b < PSA (commonly occurs in layers with small filter and shallow feature maps), we widen the wire(s) used to shift down accumulation results by j Lpass for PEs at the j th row. This ensures that the rate at which outputs are shifted out of each PE matches the rate at which outputs are generated. Weight-Stationary (WS) Dataflow: In WS, in each pass, the PEs pre-load a (stationary) block of the weight/kernel matrix sized at PSA1×PSA2 into their local registers. Then the input matrix is fed as tiles of size PSA1×a into the systolic array in a pipelined fashion. In each clock cycle each PE performs a MAC operation, shifting the input to the next neighbor along PSA1 direction and shifting the partial result down to accumulate the partial sums. Each b PSA1 pass produces an intermediate accumulation result of a corresponding PSA2 × a-sized partition of the output matrix. This is accumulated at the bottom of the systolic array using accumulators connected with FIFOs of depth PSA1 + c. After one round (a total of b PSA1 passes) is complete, the final results of one PSA2 × a partition of the output matrix is written back. Such rounds are repeated for every block-tile pairs until the entire GEMM is covered. To remove the initialization overheads in each pass, we modify the basic PE design to pre-load weights (input) in a ping-pong manner using two shift registers (Figure 3.3: highlighted in blue). The PSA1 × PSA2 weight block for the next pass is pre-fetched into the 32 register while the current pass is still being processed. Input-Stationary (IS) Dataflow is the mirror of WS: in each pass the PEs pre-load a (stationary) block of the input matrix, shifting the weight and result tiles. Data Layout Transformation Module. Each algorithm — im2col, kn2row and Winograd requires input and produces output in a specific layout. We design the Data Layout Transformation (DLT) Module to achieve layout transformation onthe-fly with minimal overheads. At data-store (data-load) side, DLT module streams in the [on-chip SRAM (DRAM) address, data] tuples, converts the output layout of the previous layer into the correct input tensor layout for the algorithm implemented in the following layer, generates the [DRAM (on-chip SRAM) address, data] tuples and stream out to the DDR controller. As the DLT at data-load side performs symmetric operations as that at data-store side with flipped on-chip SRAM / DRAM address tuples, we only show the transformation scheme for data-store side. While all three algorithms have different data layout for the input tensor shape, im2col and kn2row outputs the intermediate feature map in the same layout - spatial 3D tensor layout. Therefore the DLT Module selects from one of six available combinations for layout conversion. When both layers use kn2row algorithm, the output layout is the same (3D tensor) as the next input layout. In this case, the transformation is simply a one-to-one matching between consecutive on-chip SRAM and DRAM addresses. We list the other conversions below: Figure 3.4: LTU (Data-Store side): FSM flow 33 3D Tensor Layout Transformations: The input layout for im2col is known as the Toeplitz matrix, where each group of Cin windows of input feature maps corresponding to a filter size is stretched into a column of the Toeplitz matrix, sized at (O1O2, K2Cin). We define the output layout of im2col and kn2row as spatial 3D Tensor Layout, which is a matrix of shape (H1H2, Cin). Figure 3.4 shows a high-level FiniteState Machine diagram of the mechanism adopted in a Layout Transformation Unit (LTU) implementing the transformations, where B, D represent the on-chip SRAM and DDR addresses corresponding to a data point in the feature map. Incrementing B by 1 (or H ∗ S) means jumping to the address of the adjacent data (or data at distance of S rows) in the original feature map in on-chip SRAM to obtain a new tuple, and incrementing D by 1 means setting the next consecutive address as write-back address. The generated (DDR address, data) tuples are buffered until DDR transfer burst length (BL) is saturated, and are then sent to DDR interface. Iter_x denotes the counters that keep track of the number of times state x is visited. The process for transforming a 3D Tensor output to a Toeplitz input is shown in Table 3.1 first row, state 2 loops inside each row (length K1) of sliding window, state 3 iterates all K2 rows in a sliding window and state 1 steps over all overlapped sliding windows. Winograd Layout Transformation: In Winograd Input Tensor Layout, H1H2 m 2 input feature tiles (each sized at (m + r − 1)2 ) are stretched into rows and adjacent tiles share an overlap with width of r − 1. As each tile sized at (m + r − 1)2 is initially scattered to (m + r − 1)2 different matrices for separate matrix multiplications [69], H1H2 m 2 elements corresponding to the same relative position in each overlapping tile should be adjacent in on-chip SRAM, When transforming from 3D tensor layout to Winograd input Layout (Table 3.1 row 2). Consistently, in the Winograd output Tensor Layout, the m2 elements of each output tile are scattered to locations spaced out at H1 m horizontally or H2 m vertically, and H1H2 m 2 elements from different tiles are adjacent in the Output Buffer. Therefore, to transform from Winograd output layout to Toeplitz layout, we first need to restore the 3D Tensor layout (Table 3.1 row 3), then transform to Toeplitz input layout using row 2 configurations. To avoid extra roundtrip to DRAM, we double-buffer the Output Buffer 34 into two bank groups, where the systolic array writes to bank group A, LTU #1 transforms 3D Tensor and writes into bank group B, while LTU #2 takes input from bank group B and writes into DRAM. 3.1.4 Design Exploration Figure 3.5: DYNAMAP Software Tool Flow DYNAMAP Software Execution Flow. DYNAMAP uses a hardware overlay template (Section 3.1.3), which is parameterized with PSA1, PSA2, sequence of ψ, and sequence of control signals encoded by specific algorithm mapping that defines the behavior of DLT, Linear Transform and Pad-and-Accumulate modules. As shown in Figure 3.5, after inputs (FPGA device capabilities, CNN meta data) are provided, DYNAMAP executes the following steps: 1. Identify PSA1, PSA2 and best ψ associated with available algorithms in each layer. 2. These parameters are used to construct and populate the CNN graph as discussed in Section 3.1.4 - Cost Graph Construction. 3. Then, an off-the-shelf PBQP solver [115] is utilized to perform the node reduction steps for algorithm mapping as described in Section 3.1.2. The PBQP solver outputs the optimal algorithm assignment vectors for all layers. 4. Based on the algorithm-dataflow mappings, the template overlay is customized. 5. CNN is scanned to identify any consecutive layers whose total memory consumption do not exceed on-chip SRAM capability. Store-side LTUs are allocated and customized to generate SRAM addresses and store the layer output into the Input Buffer. Thus, on FPGA devices with larger on-chip memory, redundant off-chip data traffic will be avoided. 6. Integrating all the optimizations, control signal sequences are generated 35 to support the algorithm switching on the hardware overlay. The output of DYNAMAP is synthesizable VERILOG program that can be deployed on the target FPGA. In the following, we discuss in detail (i) the CNN Cost Graph Construction, which assumes a fixed systolic array of size PSA1 × PSA2, and identifies the cost vectors c and Transition matrices T for each layer; (ii) the Hardware Customization, which performs DSE to identify the systolic array dimensions and dataflow mapping to algorithms, providing input to the Cost Graph Construction. Cost Graph Construction. To construct cost graph G = (V ′ , E′ , Cv, Te) from CNN graph G = (V, E), for each vertex v i ∈ V , we add a node v i c into V ′ . Moreover, for each v i ∈ V | outdegree(v i ) > 1, we add another node v i s into V ′ because: Layer i that is connected directly to multiple downstream layers can store the output in only one format. The data load time of each downstream layer is dependent upon this format. The vertex v i s is used to capture the format in which layer i needs to store the data to DRAM. Now, for an edge (v i , vj ) ∈ E, if outdegree(v i ) ≤ 1, we simply add the edge (v i c , v j c ) to E′ . Else we add the following new edges: (v i c , vi s ) and (v i s , v j c ) ∀j. Cost Vector Array Construction: Let ψ denote the dataflow. For a GEMM with dimensions a × b (input) and b × c (weight), the execution time on the systolic array PSA1 × PSA2 is given: Cmm (PSA1,PSA2,Ψ) (a, b, c) = Ψ = NS : ⌈ a PSA1 ⌉ × ⌈ c PSA2 ⌉ × b + ISA, Ψ = W S : ⌈ b PSA1 ⌉ × ⌈ c PSA2 ⌉ × a + ISA, Ψ = IS : ⌈ b PSA1 ⌉ × ⌈ a PSA2 ⌉ × c + ISA (3.2) where ISA represents the one-time initialization overhead. Thus, the latencies of executing a CONV layer on a device with frequency FREQ are given by Equation 3.3 for im2col, 3.4 for kn2row and 3.5 for Winograd (m, r) algorithm. Cmm (PSA1,PSA2,Ψ) (O1O2, K1K2Cin, Cout)/F REQ (3.3) Cmm (PSA1,PSA2,Ψ) (O1O2, Cin, Cout) × K1K2/F REQ (3.4) 36 Cmm (PSA1,PSA2,Ψ) ( H1H2 m2 , Cin, Cout) + LT)(m + r − 1)2 K1K2 r 2 /F REQ (3.5) where LT denotes the linear transformation overhead. Let the number of algorithm-dataflow pairs in i th layer be |Ai |. We define the cost vector array Cv consisting of |V ′ | vectors as follows: For i ∈ Vc, define ⃗ci ∈ R|Ai| , each entry computed by plugging in the dimensions of layer i into one of the Equations 3.3-3.5 with appropriate ψ in the algorithm-dataflow pair j. For i ∈ Vs, define a zero vector ⃗ci ∈ R Pd d′=1 |Ad′| , where d is the outdegree of i and d ′ is the index for each downstream layer. Transition Matrix Construction: Each layer fetches input from the external memory(DRAM), stores it in the on-chip memory, performs computations, and stores the output back into the external memory. Each CONV algorithm has specific input and output layouts. We denote the layout associated with layer i as algorithm format (AFi ). The transition overhead between two layers is composed of the following: (a) Store: The latency to store the current-layer output from on-chip SRAM into DRAM in the layout needed by the next-layer’s algorithm; (b) Load: The latency to load the next-layer input from DRAM into on-chip SRAM in the layout needed by next layer’s algorithm; (c) Overheads: latency of max pooling, etc. Feature map Load/Store latencies can be calculated using equations in Table 3.2. BW denotes the DDR bandwidth. For store, AFi → AFi+1 means layer i computes using algorithm AFi and the output needs to be stored (transformed) into the format of AFi+1. For load, it means the output of layer i was stored in the format of algorithm AFi and layer i + 1 will use AFi+1 as input. The 1 st row of Table 3.2 shows the transformation from 3D Tensor to Toeplitz layout, which incurs data copies due to overlapping sliding windows but can be streamed out into consecutive DRAM addresses. In the 2 nd row, between im2col/kn2row output and kn2row 3D Tensor input, one-to-one matching is required. With W inograd output features, some re-ordering is required. For 3D Tensor to Winograd input layout transformation(3 rd row), both data re-ordering and duplication are needed and the generated DDR 37 Table 3.2: Load/Store Latency Algo. Format: AFi → AFi+1 Load/Store Latency im2col → im2col kn2row → im2col O1O2K1K2Cout(i) BW im2col → kn2row kn2row → kn2row winograd → kn2row H1H2Cout(i) BW im2col → W inograd kn2row → W inograd H1H2(m+r−1)2Cout(i) m2f(BW,Cout(i)) W inograd → W inograd H1H2(m+r−1)2Cout(i) m2BW W inograd → im2col O1O2K1K2Cout(i) BW + ovhd ∗ H1, H2, K1, K2, O1, O2 are Layeri+1 meta data. addresses are H1H2 m2 apart. Depending on whether each transaction of Cout(i) addresses saturate the DDR burst length, burst length wastage may occur. We use f to capture such possible wastage of bandwidth: f BW, Cout(i) = BW if Cout(i) ≥ BL Cout(i) Cout(i)+ m2 H1H2 × BW , otherwise (3.6) The 4 th row models the transformation from Winograd output to Winograd input layout, taking advantage of the fact that both are in the "scattered" layout, streaming access can be achieved. The 5 th row models the time for the 2-step transformation: Winograd output to 3D Tensor followed by 3D Tensor to Toeplitz layout. We use 2 pipelined LTU operating on double-buffered SRAM, and use ovhd to denote the initialization overhead. Figure 3.6: A snippet of an example GPSA1,PSA2 For an edge (v i c , v j c ), the transition matrix Tij of size |Ai |×|Aj | can be constructed as follows: Tij (m, n) for algorithm m in layer i and algorithm n in layer j, the entry Tij (m, n)is evaluated as the transformation 3 execution time overhead Store(m, n, dim(j)) + Load(n, n, dim(j)) and other overheads (pooling, etc.) if applicable. For an edge (v i c , vi s ) such that v i s has b outgoing neighbor vertices, which for simplicity are indexed {1, . . . , b}, the transition matrix Tii will be of size |Ai | × Pb b ′=1 |Ab ′| and can be constructed as follows: Tii(m; b ′ , o) = Store(m, o, dim(b ′ )) and other overheads for algorithm m in layer i, storage format corresponding to algorithm o in layer corresponding to vertex b ′ . For an edge (v i s , vj ), where v j is the b ′th neighbor of v i s , the transition matrix Tij will be of size Pb b ′=1 |Ab ′| × |Aj | and can be constructed as follows: Tij (o; b ′ , p) = Load(o, p, dim(j))(p is the algorithm in layer j). Figure 3.6 shows a snippet of an example graph GPSA1,PSA2 . Hardware Customization. We determine (1) PSA1 × PSA2 and (2) optimal dataflow mapping for each algorithm for each layer. PSA1 × PSA2 is then used to construct the CNN cost graph GPSA1,PSA2 . The process is as follows: We iterate through possible values of PSA1 and PSA2. For a fixed PSA1, PSA2 pair, we calculate the value of empirical total node cost, τtemp, which is the sum of the execution times of all the algorithms over all the layers. Execution time of an algorithm for a layer is calculated using the dataflow that leads to the minimum value. PSA1, PSA2 with minimum τtemp is output. 3.1.5 Experiments The fundamental objective of our framework is to reduce hardware under-utilization induced by diverse layer shapes and minimize the total end-to-end inference latency. In this section, we use our framework to generate the hardware-algorithm co-designs for two state-of-the-art CNNs, GoogleNet[132] and Inceptionv4 [132] and show: (1) how our dynamic dataflow selection and architecture configuration technique achieves local optimal acceleration at each layer by driving up effective PE utilization; (2) how our novel algorithm mapping achieves global optimal acceleration by improving end-to-end inference latency. 39 We use Xilinx Alveo U200 FPGA board hosted on a Xeon Server CPU (E5-2698 v4 @2.2GHz) to evaluate the designs generated by our framework. We use 8-bit fixed-point data representation to perform CNN inference. The designs were synthesized using Vivado 2018. We input the CNN model and FPGA device meta data into our framework to obtain the architecture customization as output. We limit the systolic array DSP consumption to 6084 instead of using all the available DSPs to obtain a fair performance comparison with the state-of-the-art implementations. DYNAMAP returns optimal (PSA1, PSA2) as (92,66) for GoogleNet, and (95,64) for Inception-V4. Effect of layer-wise algorithm switching. We calculate the execution time of each CNN module using different algorithms. Figure 3.7 and 3.8 shows the results for the two CNNs, where on the x axis we group all the CONV layers in each Inception (Reduction) Modules, consistent with the notions in [132, 131] and the corresponding columns show the sum of computation and communication latency of all layers in an Inception (Reduction) Module. The STEM module in Inception-v4 is broken down by the first Filter Concatenation layer for better visibility. The "im2col-only" (bl3) columns show the result of using one algorithm - im2col - across all the layers, the "kn2row-applied" (bl4) columns show the results of applying kn2row where possible and im2col everywhere else, and the "wino-applied" (bl5) columns show the results of applying Winograd (m = 2, r = 3) where applicable (i.e. layers with square-shaped kernels) and im2col everywhere else. OP Treturned are the results using algorithm mapping returned by DYNAMAP, which is observed to be superior than all bl3−5 on all modules. In Inception-v4, a large portion of the kernels are shaped 7(3)×1, making such layers more memory-bound, therefore kn2row almost always out-perform im2col, which requires data duplication and results in less data-reuse. However, on GoogleNet, for most layers the lower-communication-cost benefit of kn2row do not offset the overheads due to "Pad-and-Accumulate" and serializing a large GEMM into K2 smaller ones, making it less advantageous. A typical GoogleNet Inception Module has two layers with square-shaped 3 × 3 and 5 × 5 kernels among others. While applying Winograd on such layers always 40 reduces the computation complexity, it is not always optimal overall. This is because for kernels larger than 3×3, K1K2 3 2 rounds of Winograd is required, resulting in severe transformation overheads and amortized decrease in computation complexity reduction. Winograd also imposes high memory overheads and Figure 3.7: Layer exe. times: Inception-v4 Figure 3.8: Layer exe. times: GoogleNet layout transformation cost, so kn2row is overall better for such layers with slightly higher computation cost but significantly lower communication cost. These observations suggest that an algorithm mapping scheme that greedily chooses the algorithm with the smallest layer node cost c would not return the optimal mapping. DYNAMAP captures the tradeoffs that occur in such algorithm transitions, yielding lower end-to-end latency than using any of the algorithm or even all three algorithms greedily selected based on layer node costs. The algorithm mapping obtained in DYNAMAP is optimal on the given systolic array (as 41 supported by Theorem 1) and is obtained within 2 seconds on an AMD 3700X cpu. The overall percentage decrease in the latency of the designs returned by DYNAMAP compared to the base-lines (bl3−5) are summarized in Table 3.3. Table 3.3: End-to-end Latency Improvement due to Dynamic Algorithm Mapping bl3 bl4 bl5 GoogleNet 67.5% 78% 22% Inception-V4 86% 61% 17% 42 Comparison with State-of-the-art. Table 3.4 compares the performance of our design produced by DYNAMAP with existing works accelerating the same networks. We achieve 286MHz frequency for both GoogleNet and Inception-v4 accelerator designs. GoogleNet acceleration using DYNAMAP significantly outperforms [84] and [151] in terms of both latency and throughput. This is partly due to the advantage of DYNAMAP’s optimizations on dataflow and algorithm switching, partly due to the lower-precision we adopted enabling more PEs. Even if we scale down the systolic array size (2 DSP consumption per PE), in the worst case the performance will be halved and we still achieve 2× and 1.4× lower latency compared with [84] and [151] respectively. For Inception-v4, we compare with [156] which applies dynamic memory management to overcome data transfer bottlenecks and [143] that uses kn2row method for all layers in GoogleNet. Compared to [143], even with lower frequency, our design achieves 20% speedup. While using Winograd on some layers leads to low complexity, its impact is limited as there are more memory-bound than computation-bound layers in Inception-v4. However, DYNAMAP allocates kn2row to those memory-bounded kernels while keeping the computation-bounded layers optimized as well, integrating dataflow optimization to improve hardware utilization in both cases. As CNNs evolve to be more layer-diverse and the tradeoffs become less obvious, the benefits of using DYNAMAP will become much more pronounced. The motivation of FlexCNN [127] is similar to that of DYNAMAP. However, it uses dynamic tiling with data layout optimizations across different layers to drive up effective DSP utilization to as high as 93.5%/91.4% on 3x3-/1x1-kernel layers on the Open Pose-v2 network (2.9 GOPS). It achieves a singleimage inference latency of 24.7ms using 8x8x8 systolic array. To estimate the best-case performance using FlexCNN to accelerate Googlenet (∼ 3 GOPS) and Inception-v4 (∼ 9 GOPS), we project this latency onto GoogleNet (Inception-v4) with 92x66(95x64) PEs as deployed in our design (optimistically assuming 100% DSP utilization on all types of layers): Lprojected−GN = 24.7ms × 8×8×8×93% 92×66×100% × 3GOP S 2.9GOP S = 2ms, 43 Lprojected−Incp4 = 24.7ms × 8×8×8×93% 95×64×100% × 9GOP S 2.9GOP S = 6ms, both higher than DYNAMAP’s achieved latency. This is because DYNAMAP uses compute-reducing algorithm, Winograd, and memory-saving algorithm, kn2row, to resolve bottlenecks in both compute-intensive and memory-bound layers. The achieved performance benefits offsets the additional overheads for switching between different algorithms in DYNAMAP. Table 3.4: Comparison with state-of-the-art implementations This Paper [84] [151] [156] [143] GoogleNet Inception-v4 GoogleNet GoogleNet Inception-v4 Inception-v4 Device Alveo U200 Stratix 10 GX KU115 XCVU9P XCVU9P Datatype INT8 INT8 INT16 INT16 INT8 INT8 Frequency [MHz] 286 286 300 250 300 180 DSP (% total) 6239 (91%) 6230 (91%) 6304 (55%) 4214 (76%) 5254 (75%) 5130 (75%) On-chip Memory [BRAM/M20K] (% total) 2K (93%) 2.1K (97%) 1949 (17%) 2160 (100%) 1664 (77%) 562 (26%), 845 (88%) URAM LUT/ALM (% total) 745K (60%) 806K (65%) 528K (97%) 663K (71%) 469K (40%) 543K (46%) Throughput [GOPS/s] 3568 3650 557 1630 3448 1528 Latency/image [ms] 1.34 4.39 5.7 3.8 5.29 6.03 3.2 FSCAccel: Hardware-Algorithm Co-Design for Fractionally Strided Convolution in Training 3.2.1 Fractionally Strided Convolution in CNN Training: Computational Challenges Fractionally Strided Convolution (FSC) [35] is a key operation in training image-based RL and machine learning models. Specifically, it is the operation computed in back propagation of CNN training [140, 83], as well as the decoding stage of convolutional auto-encoders and generative CNNs (GAN) [85, 109] , etc. FSC typically performs up-convolution on a 2-D grid image, i.e., expands it to a larger one, as compared to conventional (down)-convolution, resulting in more complex computation patterns. Specifically, FSC, when expressed as matrix multiplication, introduces additional zero-insertions within the features and zero-paddings on the edge of the feature maps [35]. This leads to severe under-utilization of hardware 44 resources. Moreover, as these zeros are spread across all of the sliding window regions in an un-structured manner, it is very difficult to avoid significant pipeline stalls introduced by unnecessary zero computations. Figure 3.9: FSC producing 2242 -pixels feature map and 64 (3) input(output) channels The challenge of hardware under-utilization due to zero-spacing is especially severe in layers with large kernels and high strides, which are common in the early layers of down-sampling CNNs and all the up-sampling (decoding and generative) CNNs. To illustrate this, in Figure 3.9, we show the breakdown of non-zero and zero- computations from commonly used stride and kernel sizes (s, k) for FSC layers operating on a typical 224 × 224 feature map size. It is evident that the zero-computations can be a severe bottleneck taking up to 85% of computation time when the layer has larger kernel and stride sizes [149, 147]. While several existing works propose to tackle the zero-insertion problem by interpolation of broadcast multiplication [105], they usually suffer from high control and communication overhead or poor data-reuse that limits effective hardware utilization or scalability. Moreover, none of the existing FPGA training frameworks [140, 158, 32, 97, 83] address the issue of zero-padding in computing FSC. 45 3.2.2 FSC in CNN Training The training of CNN involves both forward propagation (FW) and backward propagation (BW) using a batch of data. The forward propagation (FW) in a CNN model involves iterative convolving of some kernels w onto the input features to generate output features used as input for the next layer(s): o l x,y = X x′ X y ′ w l x,ya l−1 (x+x′),(y+y ′) (3.7) After the completion of FW, the performance of the network is estimated using an objective function that leverages the label/reward value and the output from FW process. The derivative of the cost function wrt. this output is the error obtained at the output layer. Error values are back-propagated (BW) to all hidden layers to obtain their local gradients, as described in the following equation [140]: δ l x,y = φ ′ l o l x,yX x′ X y ′ δ l+1 x′ ,y′w l+1 (x−x′),(y−y ′) (3.8) where δ are the local gradients, w ′ is the flipped kernel, and φ ′ l (x) is activation derivative of layer l. For down-sampling models, The local gradients of layer l is obtained by applying FSC of flipped kernels on local gradients of layer l + 1. The use of FSC arises from the need to use a transformation going in the opposite direction of a normal down-sampling CONV, i.e., FSC reconstructs the shape of the input of a CONV from the shape of its output while maintaining a connectivity pattern that is consistent with said CONV. This case is more complex than the BW in a fully-connected layer which only requires using a transposed dense weight matrix. FSC as Matrix Multiplication. There are multiple strategies for executing a general convolution layer. One way is to directly express convolution as a matrix multiplication (matmul) by flattening input/output feature maps and re-arranging 46 the kernel into a sparse toeplitz matrix. This way BW can be modeled as a matrix multiplication simply with transposed kernel matrix. This sparse-weight strategy introduces significant sparsity and data duplication in the weights and are not hardware-friendly [35]. Instead, existing GPU and FPGA implementations of CNN training usually expresses FSC as a normal convolution to ensure efficient hardware reuse in all FW, BW and weight-update processes of training [32, 158, 97, 93], and then further apply im2col[23] on the normal convolution - a more common technique to express convolution as a matmul than the aforementioned sparse-weight strategy keeping the kernel dense. However, such method still suffer from some inefficiency because it usually involves zero-spacing in the feature maps. (a) Conv: down-sampling from 5 2 to 2 2 (b) FSC as CONV: up-sampling from 2 2 to 5 2 Figure 3.10: Stride = 2 CONV FW and its corresponding BW fractionally-strided CONV Note: The figures are created based on the repository at https://github.com/vdumoulin/conv_arithmetic In the BW (FSC), to maintain the same connectivity pattern in its equivalent convolution as that of the down-sampling FW, the BW in a convolution with FW stride > 1 involves an equivalent convolution with stride < 1. As a result zeros are inserted between input units, which makes the kernel move at a slower pace than with stride= 1. Specifically, the following relationship holds [35, 9]: Lemma 1. A FW convolution described by input size i, output size o, padding p = 0, kernel size k and stride s has an associated BW FSC described by input size ˜i ′ , output size o ′ = i, kernel size k ′ = k, stride 47 s ′ = 1 and padding p ′ = k − 1, where ˜i ′ = s(o − 1) + 2k − 1 is the size of the stretched input obtained by adding s − 1 zeros between each input element, and its output feature size is o ′ = i = s (o − 1) + k. Fig. 3.10 shows an example pair of mirroring CONV (i = 5, o = 2, k = 3, s = 2, p = 0) and FSC (According to Lemma 1: i ′ = o = 2, o′ = i = 5, k′ = k = 3, s′ = 1, p′ = k − 1 = 2). When executing the FSC, it is essentially translated into a CONV such that the input feature maps are padded with k − 1 = 2 rows and columns of zeros, and inserted s − 1 = 1 zeros in between all the pixels, making the input dimension ˜i ′ = s(i ′ − 1) + 2k − 1 = 2 × 1 + 2 × 3 − 1 = 7. From the rules to express FSC as a direct CONV, it can be observed that the amount of zeros inserted are proportional to the kernel size and non-unit stride size in a FW CONV layer. Because there is no direct relation between the kernel sliding window size and the amount of zero insertions, such zero-spacing can become excessive in layers with large kernels and strides. This introduces severe inefficiency when deployed on a spatial accelerator. The common technique to use im2col[23] to express CONV as a matrix multiplication by stretching the kernels into rows and receptive fields into columns cannot skip the padded and inserted zeros, so it causes a significant amount of unuseful computation (order of ˜i ′ טi ′ − o 2 , which is O((s(o − 1) + 2k − 1)2 − o 2 ) in FSC. FSC as Interpolated Pixel-Broadcast Multiplication. Figure 3.11: FSC as Interpolated Pixel-Broadcast-Multiplication: upsampling from 2 × 2 to 5 × 5 grid Apart from FSC expressing as normal CONV and matmul, another method for performing FSC consists of three steps: (1) each input pixel is broadcast to and multiplied with all k × k kernel weights, thus furnishing a block of k × k output products; (2) The generated blocks are each added to different but overlapped local k × k region of the output feature map based on their input pixel location and the layer stride size; (3) After steps (1) and (2) are repeated for all input pixels, the padded edges on the output feature map is cropped out. 48 It is worth noting that in step (2), neighboring input pixels lead to a total of i ′ × i ′ overlapping output blocks, which are overlapped and accumulated onto the output feature map with the same stride as its mirroring CONV operation. An order of i ′ × i ′ × (k − s) overlapping rows and columns must be properly managed to provide the correct FSC result. Although this method eliminates the zero-insertions introduced by the matmul method described in FSC as matrix multiplication, it requires additional calculation to account for which pairs of blocks need accumulations in the overlapped region, which can cause unnecessary design complexity in terms of routing, etc and introduce expensive control overhead in hardware. In this paper, we propose another novel approach to address the drawback of both the above mentioned methods by accelerating FSC with an algorithm — kn2row. kn2row completely eliminates all zerocomputations introduced by the normal CONV/matmul methods and easily scale to different layer metadata avoiding potential complex control logic and routing resulted from the Interpolated Pixel-BroadcastMultiplication method. Accordingly, we design a scalable 1 × 1 CONV Core based hardware architecture without zero-skipping control overhead while completely getting rid of zero-computations introduced by large kernel/stride sizes, thereby, achieving maximal hardware utilization. 3.2.3 Algorithmic Optimization We define the parameters of a Fractionally-Strided Convolution based on the following notations: A FSC layer has c ′ in (c ′ out) input (output) channels, where each channel is a ˜i ′ טi ′ (o ′ × o ′ ) 2D feature map (note ˜i ′ × ˜i ′ is zero-padded based on the kernel size, k and stride size, s in a mirroring normal CONV of the same layer, as discussed in Lemma 1). The layer weights W contain c ′ in × c ′ out kernel slices, each sized at k ′ ×k ′ . If the Fractionally-Strided Convolution is naively expressed as a MatMul using the im2col method, it stretches each group of c ′ in kernels into a row of the weight/kernel matrix, W, and each group of c ′ in 49 corresponding windows of input feature maps into a column of the input activation matrix, X, expressing the feed forward pass as zl = W (c ′ out ×k ′2 c ′ in ) l ∗ X (k ′2 c ′ in ×o ′2 ) l−1 (3.9) where ∗ stands for matrix product. As an example shown in Fig. 3.12 (k = 3, s = 2, consistent with Fig. 3.10b), such method leads to zero-padding which is worsened by the input data-duplicating nature of the required im2col layout. This can cause severe hardware under-utilization in Fractionally-Strided Conv layers. To address this problem, we propose to adapt the kn2row [139] algorithm to avoid (1) zerocomputation and (2) data duplication in this process. Figure 3.12: Im2col applied to FSC Our methodology based on kn2row consists of decomposition of convolutions and reordering the data layout. In the first phase, “unit-CONV GEMM", a k ′ × k ′ convolution is computed using k ′2 separate 1 × 1 unit-convolutions indexed by the tuple (a, b) where a, b ∈ (0, k′ ], each is equivalent to a GEMM call: fa,b = W (c ′ out ×c ′ in ) l ∗ X (c ′ in ×o ′2 ) l−1 (3.10) In the second phase, “Pad-and-Accumulate", the intermediate output patches of all 1 × 1 unit convolutions, fa,b are accumulated to generate the final output feature maps: Each unit convolution will produce a intermediate feature map with dimension (o ′ × o ′ ), corresponding to different 1 × 1 sub-kernels from a 50 Figure 3.13: MCMK kn2row applied to FSC total of k ′ × k ′ . These intermediate feature maps can be added by offsetting every pixel in all the output channels vertically and/or horizontally based on their own sub-kernel indices in the original kernel. For example, if k ′ = 3, the central sub-kernel of each kernel outputs a perfectly aligned intermediate feature map with the final output feature map, which does not need to be shifted or padded. For the upper-right sub-kernel, the feature map must be offset up and right by one pixel. After shifting, the intermediate feature pixels that are out of the final feature map bounds are discarded and the empty spaces on the lowerand left- columns are zero-padded. Formally, each unit convolution will contribute to the final output feature map whose upper-left pixel is indexed by (x, y) as noted in equation z x,y l = k X′ a=0 k X′ b=0 f x+a,y+b i (3.11) The motivation of adapting the kn2row algorithm for performing FSC lies in the opportunity to eliminate the unnecessary zero computations that occurs in the im2col method. An example of how kn2row eliminates the zero computations is shown in Fig. 3.13, which shows an equivalent FSC as that in Fig. 3.12. According to Lemma 1, the amount of zero insertions around the edge of the input feature map along dimension˜i is 2(k−1). Since the convolution operation of any 1×1 kernel element slides through an entire window of size (˜i ′−(k−1))×(˜i ′−(k−1)), as shown in Fig. 3.13, all the 1×1 kernel elements must cover the same number of non-zero pixels in their own sliding-window regions. This suggests an opportunity to 51 exploit the parallelism between 1 × 1 sub-kernel convolutions and further extract out the non-zero pixels, so that we are able to not only completely eliminate unnecessary zero computations encountered in the im2col method, but also minimize the complexity of accumulating intermediate feature maps generated by all 1 × 1 sub-kernels. Figure 3.14: Dataflow Accelerator Engine Architecture 3.2.4 Accelerator Design To leverage both the data-level and the task-level parallelism, our methodology embraces a spatial dataflow accelerator engine to execute a subgraph (layer) of a CNN at a time and store the intermediate outputs to the DRAM. The accelerator is described in a HLS template for quick development, easy customization at compile-time and easy integration with high-level languages such as tensorflow at run-time. As illustrated in the architectural diagram in Fig. 3.14, the hardware template is composed of four Super Stages, and the components in different Super Stages are connected with FIFOs. Using our proposed algorithmic optimization described in Sec. 3.2.3, we express any arbitrary convolution and fractionally-strided convolution as a combination of two major operations: (i) 1 × 1 convolution and (ii) Padded Accumulation, which corresponds to the Super Stages 1 and 2. Super stage 0 fetches input features and weights from DRAM, and in Super Stage 3 the output feature maps are streamed out into DRAM from the on-chip memory banks. 52 It is worth noting that the accelerator does not require complex pre-processing of the input data layout, it simply streams in the inputs(weights) in a flattened row(column)-major layout w.r.t. all pixels(kernel weights). It streams out into DRAM with the exactly same layout to be easily processed by adjacent layers. Tiled Interleaving Convolution Cores. As shown in Fig. 3.15, each Conv Core performs tiled matrix multiplication using a 2D Processing Element (PE) array of size P1 × P2. It repeatedly takes one tile from the input feature matrix and one from the 1 × 1 sub-kernel matrix, and outputs a block of the intermediate output feature matrix. Both the input and the kernel tiles are buffered and sufficiently re-used on-chip. Specifically, each input pixel is reused by n 1 × 1 kernel weights (CONV Cores) and P2 output channels (PEs in the same row). Each weight is reused by all output channels and input pixels. Since the CONV operation is essentially one 1 × 1 sliding window through the entire ˜i ′ טi ′ padded input feature map, we can completely eliminate all redundant computations due to zero-insertions and paddings as described in Lemma 1. Formally, the total amount of MAC (Multiply-Accumulate) operations required for completing one fractionally-strided convolution layer is k ′2 ×cout ×cin ×i ′2 . Compared to using direct CONV (or im2col) method which requires cout ×k ′2 ×cin ×o ′2 MAC operations, the computation load onto the Conv Core(s) is reduced by an order of o ′2 i ′2 = (s(i ′−1)+k) 2 i ′2 = O(s 2 + sk i ′ + k 2 i ′2 ). The Conv Core templates support both variable-precision fixed Figure 3.15: 1x1 Conv Core Dataflow point and float point arithmetics. Each PE in the 2D array accumulates into a temporary register, which is written back to a tile buffer at the end of each iteration covering all c ′ in input channels. The accumulation 53 in each cycle requires the partial sum in the previous cycle of the accumulation loop (read-after-write dependency). This causes a loop-carried dependency, which is especially severe for applications that require float point precision (e.g. CNN training, Deep Reinforcement Learning) since float-point accumulators need > 10 cycles to finish. To avoid pipeline stalls from such dependencies from contiguous accumulation onto the same register, we re-arrange the iteration space, making the accumulation loop out-most, while the inner loops iterate through all pairs of tiles in the control logic. Accordingly, we implement an accumulation buffer of size c ′ out P2 × o ′2 P1 local to each PE to hold all partial sums from all pairs of input tiles. Such optimization ensures that in adjacent cycles, all the PEs read from different locations than those being processed in the immediate previous cycle, thus hiding the large intervals between accumulations onto the same output pixel and resolving the issue of pipeline stalls. In Super Stage 1 of the dataflow accelerator, we deploy n such PE arrays to conduct the reduced computations from 1×1 sub-kernels for a number of rounds to cover all k ×k convolutions, and their outputs are written to FIFOs in an aggregated manner. The FIFOs ensure that all Super Stages are executed in a task-level-ipelined manner. Pad-and-Accumulation Module. Pad-and-Accumulation (P.A.C.) Module takes as input all n outputs from the 1 × 1 Conv Cores and combine them in a stride-interleaving manner using a single scratchpad memory as shown in Fig. 3.16. We maintain an index-aware counter in the control of each P.A.C. accumulator to keep track of the address to which pixel (x, y, z) ∈ (o ′ , o′ , c′ out) in round a from CONV Core b accumulates onto, where a is indicator 54 of the number of rounds into the CONV Cores to cover all k × k kernel weights. The relation between the pixel and its address to store into the scratchpad is: index1 = k − 1 − (a ∗ n + b)/k + x ∗ s (3.12) index2 = k − 1 − (a ∗ n + b)%k + y ∗ s (3.13) Addrx,y,z,a,b = Scratchpad[index1][index2][z] (3.14) Essentially, each intermediate pixel needs to be spaced out by s − 1 when they get accumulated into the Figure 3.16: Pad-Accumulation Module final output feature maps. An example of s = 2, i′ = 2, o′ = 5 is shown in Fig. 3.16. To avoid (1) race condition where the accumulation operations of different intermediate feature maps overlap onto the same address in the same cycle and (2) loop-carried dependency for float accumulation, we carefully schedule the accumulation order such that all the pixels in different output channels and from the same input pixel but multiplied with different 1 × 1 weights are accumulated concurrently into different (adjacent) output locations. For example, in Fig. 3.16, the intermediate features of the same color are accumulated concurrently into adjacent locations, while different-colored features are accumulated in adjacent cycles where no loop-carried dependencies occur from read-after-write at the same location. The memory banks are 55 partitioned and arranged accordingly to support fully independent and parallel accumulations into independent memory banks. Generally, it only requires n × P2 memory banks at minimum to fully exploit the conflict-free parallel pad-and-accumulation without additional control. Model Partition Scheme. In modern multi-die FPGAs, an FPGA board is built by a manufacturing process called Stacked Silicon Interconnect technology that combines multiple Super-Logic Region (SLR) components (i.e. dies) mounted on a passive Silicon Interposer [145]. When fitting a design which has large amount of internal routing across SLRs, cross-die routing requires substantially long wires to distributed on-chip memory and computing resources. To avoid such intensive cross-SLR wiring that causes clock failure and/or long data transfer latencies, we scale our dataflow accelerator engine to multiple SLRs by making m copies of the dataflow accelerator engine in a SIMD manner. We let each engine compute all the output pixels in c ′ out m output channels. We thus ensure the computation performed by each engine is independent and write into the same DDR bank for seamless processing of the adjacent layers. 3.2.5 Evaluation Experimental Setup and Customization. The fundamental objective of our work is to reduce hardware under-utilization induced by zeroinsertions in computing FSC. Since FSC is an upsampling procedure and usually appears in the backpropagation of CNN training, we show the effectiveness of our co-design by generating the hardware for conducting the back-propagation phase of a state-of-the-art CNN model commonly used in imagebased Deep Reinforcement Learning workloads, Nature-CNN[93], and inference of a standard generative CNN, DCGAN [109]. We also generate and evaluate a general accelerator for common convolutional autoencoders [85], and show the results on typical FSC layers in the decoding stage of the inference task covering different kernel and stride sizes. 56 Model Hardware Parameters/Resource Utilization n P1 P2 l 6 8 8 16 LUT REG SRAM DSP Nature -CNN 579K (81%) 433K (30%) 56Mb (24%) 3904 (86%) n P1 P2 l 2 32 16 64 LUT REG SRAM DSP AutoEncoder, DCGAN 483K (68%) 726K (50%) 77Mb (33%) 2176 (47%) Table 3.5: Hardware Parameters/Resource Utilization for Benchmark CNN Models In this section, we show: (1) the improvement obtained from kn2row method compared to a direct im2col implementation using the same amount of computational resources; (2) the advantages of our FPGA hardware-algorithm co-design compared to general-purpose processors. We then (3) compare our method with other existing work that accelerate FSC kernel using the DCGAN benchmark. [109]. We use Xilinx Alveo U200 FPGA board to evaluate the designs generated by our framework. We use 32- bit float-point numbers to perform Nature-CNN computations, and 16-bit fixed-point numbers for image classification and restoration benchmarks (Auto Encoder and GAN). We build a parameterized overlay source template in HLS with customizable design parameters (data types, PE array dimensions, number of Conv Cores, etc.). The HLS designs were synthesized, emulated and implemented on the FPGA board using Vitis 2020. We limit the resource consumptions of each dataflow accelerator engine to those specified by a single SLR, and use two SLRs of the FPGA board for demonstration purpose (For Auto-encoder and DCGAN benchmarks, we limit the total constraint to that of only one SLR for a fair comparison with existing work). The major design parameters returned and their corresponding resource consumption for the two CNN back-propagation computations are shown in Table 3.5. For comparisons with general-purpose procesors, we use Intel Xeon Gold 5120 CPU [55] at 2.20GHz as the CPU baseline, and TITAN Xp GPU [100] at 1404 MHz as the GPU baseline. 57 Comparison with im2col. We define βi as ratio of the total number of effective computations and the total number of computation performed by all PEs in FSC layer i. That is, βi = PP1P2n j=1 PT t=1 P Ej,efft T · PP1P2n j=1 P Etotal = Yef f T · P1P2n (3.15) As im2col suffers from more excessive zero-computation (which is treated as ineffective computation) compared to kn2row, β serves as an accurate metric for comparing the amount of computational speedup of using kn2row given the same amount of computation resources. For all the benchmark CNNs, we summarize the effective utilization, β, of kn2row using n P1 × P2- sized Conv Cores shown in table 3.5, while im2col using one Conv Core of the size equivalent to n Conv Cores concatenated together ((n × P1) × P2) so that im2col and kn2row in comparison consume the same amount of computational resources (i.e. PEs). For Nature-CNN and DCGAN, we show β layer-by-layer. For auto-encoders, we sample a variety of common configurations (input feature map size, kernel, stride size, etc) of up-sampling decoding layers in an auto-encoder architecture [85]. The variations help empirically confirming our previous theoretical analysis and evaluating the performance in a wide range of scenarios. From Fig. 3.17,3.18,3.19, it can be observed that our methodology always lead to higher effective hardware utilization. It is also obvious that the stride size has a larger impact on the zero-insertions (thus the achievable improvement) than the kernel size (which only contributes to edge paddings). The amount of zero-insertions and paddings incurred in im2col lead to approximately s 2 times more (futile) MACs compared to kn2row. 58 Figure 3.17: Effective Utilization by BW Layers: Nature-CNN Figure 3.18: Effective Utilization for Common Auto-Decoding Layer Configurations Figure 3.19: Effective Utilization by FW Layers: DCGAN From the experimental results, we observe the same trend - β improvement factors are close to 4 in Nature-CNN and DCGAN benchmarks, while it grows almost quadratic with increasing s from 2 to 4 in the common auto-decoding layers. For power efficiency comparisons, we record the sum of the power consumed in the accelerator and the DRAM (Global Memory for GPU). The GPU power is sampled and averaged on a period of 1 ms during runtime. We report the improvement in GOPs/W (Giga-Operations 59 Device SRAM DSP Freq (MHz) Precision GOPs Power (W) Scalability [105] XC7VX690T 840 Kb 1680 320 FIXED 16 921 4.1 N/A [147] N/A 2.16 Mb 256 PEs 800 FIXED 32 N/A 0.32 0.26/16 [157] XC7Z020 2.35 Mb 220 100 FIXED 16 2.6 N/A N/A Ours XCU200 77 Mb 2176 300 FIXED 16 1125 9 0.9/16 Table 3.6: Comparisons with Existing Works Per Watt) using the actual execution time and the overall power consummations for GPU and FPGA implementations. We define another metric γ, to measure the efficiency of the actual speedup with regard to the ideal speedup based on raw total operations required by the FSC for a given algorithm, using the same number of PEs: γi = βkn2orw/βim2col OP Sim2col/OP Skn2row (3.16) Aγ value of 1 suggests that the actual speedup is the maximal speedup achievable. In all of our experiments, we achieve a range of γ = 81% − 95% (with a mean of 90%), illustrating that our methodology scales to various shaped FSC layers without incurring significant overhead. Comparison with General-Purpose Processors. We deploy the entire CNN workloads (all layers) and compare the execution time with general processors. For Nature-CNN, we follow the common batch sizes used in the reinforcement learning training scenarios to evaluate back propagation at batches ranging from 1 to 16 (3.20). For the auto-encoder, we use a standard architecture [85] of 5 down-sampling layers (encoding phase) followed by 5 up-sampling layers (decoding phase) with s = 2, k = 3 and evaluate the inference time of single sample up to batch size of 4. The same is done for the DCGAN benchmark. It can be first observed that FPGA always out-performs the CPU baselines. When compared with GPUs, even with ∼ 0.5× fewer number of computing units and ∼ 0.2× lower clock rate, FPGA kn2row implementation has only ∼ 1.8× higher latency in the worst case (batch-64 nature-CNN BW) than GPU, implicating that our implementation is efficient in terms of effective hardware utilization (higher at least 60 by a factor of 6.7×). Although our methodology sometimes takes longer end-to-end execution time, it always outperforms GPU baseline in terms of power efficiency (GOPs/W) by a factor of 2× to 6.5×. Figure 3.20: Back-Propagation Time: Nature-CNN Figure 3.21: Inference Time: Autodecoders Comparison with Existing Works. We compare throughput per Processing Element (effective GOP per cycle per PE) and scalability, x ′ (×n), where x ′ = fspeedup fPE increase to measure how well the speedup factor resulting from increasing parallel PEs by a factor of n scales (x ′ closer to 1 the better). [105],[147] and [157] all use the Interpolated Pixel-Broadcast-Multiplication method. Compared with [147], we exploit another level of parallelism from different 1 × 1 “kernel bars” (with kn2row) such that we not only re-use weight within each CONV Core but also re-use (broadcast) input features across different CONV Cores; Also, we re-use input features 61 Figure 3.22: Inference Time: DCGAN both across different output feature maps and across different kernel weights spatially. This enables us to efficiently overlap the required off-chip data accesses with computation even with a large number of PEs. Growing from 2 × 4 × 2 PEs to 8 × 16 × 2 PEs, we observe a speedup factor of 14.5 when accelerating DCGAN, while [147] only observes ∼ 6.7× speedup from 4 × 4 to 16 × 16 PEs. [147, 157, 105] also requires additional control to decide whether results from adjacent PEs overlap in the k × k broadcast region of output feature map, which may lead to control overheads (e.g. the accumulators needs additional logic to determine if they operate in the next cycle for adding adjacent PE products, and which address to store afterwards). [105] elminated such control but replace it with fixed routing between specific PEs and from certain PEs to additional accumulators. Compared with [105], we achieve higher operations per cycle per PE (effective throughput per PE = 1.83 OP/cycle/PE (Ours) vs 1.7 OP/cycle/PE). Essentially, we trade larger SRAM buffers (use of scratchpad in P.A.C. module) for simple routing/control and stall-free dataflow with efficiently overlapped communication and computation. Furthermore, our methodology ensures that no expensive hand-tuning of control or wiring targeting different FSC metadata is required, only minimum change of key parameters is required. 62 Figure 3.23: MCTS system performance on CPUs 3.3 Acceleration of Tree-based Policy in MCTS 3.3.1 MCTS Performance Analysis and Challenges The in-tree operations are key computations in Model-Free RL using MCTS, and are often memory bound [19]. The MCTS system throughput is described as the number of worker-Actions performed Per Second, or AP S. A worker-Action is composed of all the operations conducted by a single worker to derive an action and add a new node in the tree (detailed in Chapter 2, Section 2.1.2). In tree-parallel MCTS, during each iteration, the number of in-tree operations and the number application-specific simulations are fixed. Therefore, AP S is upper-bounded by: AP Supper bound = min(P Tsim, P Tin−tree) (3.17) where P Tsim is the peak throughput of simulations (number of simulations performed per second) by all the workers, and P Tin−tree is the peak throughput of the in-tree operations (number of SelectionExpansion-Backup performed per second) by all the workers. The simulation by all the workers are completely independent. Assuming there are p workers, P Tsim can be modeled as the total number of workers divided by the latency of single-worker simulation: P Tsim = p Tsim . On the other hand, in the in-tree operations, all the workers are serialized at the root node for Selection and Update. We denote the amortized 63 time interval between any two consecutive workers that access the root node as Itv. The peak throughput of the in-tree operations is thus bound by Itv: P Tin−tree ≤ p p×Itv = 1 Itv . This poses a constant upperbound that prevent the system throughput from linearly increasing as the number of workers p scale up. In Fig. 3.23, we show the performance analysis based on Equation 3.17 using a classical control benchmark on a 128-core CPU. The line plots show the peak performance bound and the blue area is the range of actual AP S achieved. Note that the AP S for a specific p spans a range as it depends on the specific execution model (shared tree and local tree methods for Tree-Parallel MCTS, details discussed in Chapter 2, Section 2.3 - Execution Models for Tree-Parallel MCTS). Additionally, a key perspective of efficiently balancing exploration-exploitation tradeoff in MCTS is the dynamic construction of its tree policy. The pattern of the tree growth is determined at runtime by the random simulations. While it is simple to perform dynamic tree management using runtime dynamic memory allocation on CPUs, this is a challenging task on FPGAs and ASICs. This is due to the FPGA bitstream’s nature of static memory assignation. A naive method of dynamically re-allocating memory blocks for the growing tree at runtime is through hardware reconfiguration, which causes unnecessary and large time overhead in the end-to-end MCTS execution. We are motivated to address this challenge by proposing the first dynamic MCTS accelerator design without the need for hardware reconfiguration, while alleviating the throughput bound posed by serialized in-tree operation as discussed above. 3.3.2 Accelerator Algorithm-Hardware Co-Optimizations Overview. Data Structure and Operations: The MCTS tree is maintained on-chip of the FPGA accelerator. In the MCTS tree data structure, each node is associated with an ID based on insertion order, its number of visits, and the average reward gained by visiting it. Each edge has a parent ID, a child ID and a weight 64 Figure 3.24: Accelerator Design: Overview (UCT value). Assuming there are p workers, the accelerator performs all their in-tree operations (treebased action selection, tree expansion and tree update as listed in Chapter 2, Section 2.1.2). Note that the application-specific environmental states are stored in the CPU memory rather than FPGA memory, and the rest of the Expansion phase including 1-Step simulation and environmental state management are also performed on the CPU instead of the FPGA (further discussed in Chapter 5). Accelerator Overview: The overview of the accelerator is depicted in Fig. 3.24. The key idea of the accelerator design is to exploit pipeline parallelism among the workers that propagate through multiple stages, each stage operating on a certain tree level stored in on-chip SRAM. Assuming the maximum tree height is D, a pipeline is allocated with D pipeline stages, each stage equipped with an Inserter for tree expansion, a Selector for tree-based action selection, and an Updater for tree update corresponding to operations on a specific tree level. Worker requests for the in-tree operations are streamed into the compute units (Inserter, Selector or Updater) from the PCIe Interface. Upon the completion of Selection and Node Insertion requests, The pipeline outputs requests for simulation back to the PCIe. The pipeline 65 require read/write accesses to Y on-chip SRAM banks that store the tree through a custom Stage-Bank Interconnection. Each DRAM bank stores multiple tree node entries. The content of a node entry s stored in the SRAM banks is listed in the following, assuming F is the fanout of the tree: • s.ID, number of visits N(s) • [uct(s, s′ 0 ), ..., uct(s, s′ F )] • [Bank Index(s ′ 0 ), ..., Bank Index(s ′ F )], [Node Index(s ′ 0 ), ...,Node Index(s ′ F )] In the above list, N(s) and uct are updated by the Selectors; Bank Index (s) denotes the ID of the SRAM bank that stores node entry s, and Node Index(s) denotes the address of node entry s in its bank. Bank Indices and Node Indices are only assigned once by Inserters in each MCTS agent step. The key objectives of the accelerator design are to (1) support dynamic run-time accesses to the tree nodes by the pipeline, and (2) ensure high performance by minimizing the Itv between workers, where Itv is composed of the propagation time in the Stage-Bank Interconnection and the Selector Compute time. We realize objective (1) through a custom interconnection. In Objective (2), we develop a novel bank assignment algorithm for tree construction to improve the throughput of in-tree operations by reducing the bank conflicts in the interconnection, and several hardware optimizations to improve the time performance of the Selector. Stage-Bank Interconnection. For completely run-time dynamic tree management, the shape and topological ordering of tree nodes are not known at compile time, and are different for every MCTS agent step. For this reason, it is not practical to determine a fixed access pattern from the pipeline stages to the SRAM banks storing the tree nodes at compile time. Therefore, to avoid expensive run-time re-configuration between consecutive MCTS agent steps, an all-to-all interconnection between the stages and SRAM banks is required to support arbitrary 66 run-time access patterns. An intuitive solution to meet this requirement is to build an all-to-all broadcast network that routes from all the stages to all the banks. However, this solution is not scalable to large or deep trees. Assuming the total number of SRAM banks used to store the tree is Y , the maximum height supported for the MCTS tree is D, the total area consumption of the all-to-all fully-connected network is O(DY ). This is impractical to implement on typical FPGA devices with limited on-chip resources. For example, such a design cannot be produced even for a tree height D = 32 when targeting a small MCTS tree stored in 256 SRAM banks. We propose an area-efficient solution that can be better scaled to MCTS tree with large heights, while maintaining the capability of all-to-all interconnection for fully dynamic accesses by the compute units. The design of our Stage-Bank Interconnection follows a butterfly network pattern, as shown in Fig. 3.25. Figure 3.25: Example Custom Butterfly-based Interconnection: D = 4, Y = 8 Algorithm 1 Stage-Bank Routing from stage i, i ∈ [0, D) Input: Binary representation of target SRAM bank index idx; Output: A sequence of request-routing actions to SWC{x, y} with the goal of accessing bank idx, where x ∈ [0, log D); y ∈ [0, D); 1: SWC{x, y} ← SWC{0, i} 2: for x ∈ [0, ..., log D − 1) do ▷ bit 0 (log D − 1) is the LSB (MSB) 3: if idx.bit[x]== 0 then ▷ Route Up 4: if 0 <= y%(2x+1) < 2 x then SWC{x, y} ← SWC{x + 1, y}; 5: else: SWC{x, y} ← SWC{x + 1, y − 2 x }; 6: else ▷ Route Down 7: if 2 x <= y%(2x+1) < 2 x+1 then SWC{x, y} ← SWC{x + 1, y}; 8: else: SWC{x, y} ← SWC{x + 1, y + 2x }; 67 The first part of the Stage-Bank Interconnection is a complete butterfly network with D input ports and D output ports that communicate the in-tree operation requests. The second part is a broadcast network from D switches to Y banks, which can be viewed as D independent 1−to−⌈ Y D ⌉ connection units, and they facilitate processing the in-tree operation requests by providing read/write accesses to the memory banks. Our design reduced the area requirement to O(D log D + Y ) while supporting the any-to-any fully connected access pattern which is essential for dynamic tree management. The routing algorithm is shown in Algorithm 1. Note that while our implementation of the Stage-Bank Interconnection can handle routing congestions, these congestions could negatively affect the throughput of in-tree operations. We further discuss optimizations to alleviate such performance degrade. Algorithm Optimization. With the basic design discussed above, we derive the Itv between consecutive workers performing in-tree operations as the interval between consecutive workers making Selection requests. This is the sum of interconnection propagation time and the Selector compute time (i.e., latency for F−way comparison and applying virtual loss, where F is the fanout of the MCTS tree). Note that for both Node Insertion and Update requests, the interval between workers is much smaller than that for the Selection requests. This is because neither the Node Insertion nor the Update have RAW dependency between accessing different tree levels, such that O(1) interval between workers can be easily achieved. On the other hand, the Selection request of a subsequent worker can only be processed after the completion of f−way comparison and virtual loss update by its previous worker, so the overall Itv is equivalent to the interval between consecutive workers’ Selection requests. 68 In the formulation of Itv, the interconnection propagation time can be further decomposed into singleworker latency and overhead from butterfly network congestions. We show the time complexity T() and the worst-case time complexity O() of these components of Itv in Equation 3.18: Itv = Tinterconnection + Tselector | {z } O(F) , where Tinterconnection = Tbutterfly | {z } T(log D) + Tcongestions | {z } O(D) (3.18) Based on the above analysis, we show our novel algorithm-hardware co-optimizations for reducing the time complexity of Tselector (latency of computing F-way comparison) and Tcongestions (latency overhead from congestion of multiple Selection requests on the same interconnection switch). Dynamic Node Insertion Algorithm for Minimizing Interconnection Propagation Time: A potential bottleneck in the Itv time complexity is the time overhead from handling congestions by serializing the Selection requests at a certain bank or interconnection switch. The data layout of the tree (i.e., the bank assignment of each node entry during Node Insertion) plays a critical role in the number of congestions in the butterfly network during the process of Selection requests. In a basic scenario where the node entries are simply stacked into the banks one by one in their insertion order without any constraints, it is possible for node entries belonging to different tree levels to be stored in the same bank. In this case, all the D Selection requests in the Selection pipeline could collide on the same output switch in the butterfly network (although they are all independent requests by different workers) such that Tcongestions = D cycles. To avoid such scenario, we first put a constraint on the Node Insertion logic to ensure that nodes inserted on different tree levels cannot share the same bank (Constraint1). While this avoids the scenario of requests colliding on the same bank, it cannot avoid all the switch collisions, since multiple banks are connected to the same output port of the butterfly network part in the Stage-Bank Interconnection. To minimize Tcongestions for 69 any given number of stages and SRAM banks, we propose a bank assignment algorithm for Node Insertion that minimizes the total number of switch congestions in Selection, based on the butterfly network properties, as shown in Algorithm 2. The intuition behind the proposed algorithm is as follows: We keep track of all the established routing paths using a scratchpad. When a new routing path is constructed by inserting a node into a new bank, we use the scratchpad information to select the bank that minimizes the total number of potential congestions with existing routing paths. Algorithm 2 Bank Assignment Algorithm for Node Insertion Input: Node Insertion stage (tree level) id src; Input: Scratchpad Memory BankInfo[bank id]={stage id, bool full}; ▷ BankInfo tracks the tree level (stage) of the nodes stored in a bank Input: Scratchpad Memory StageInfo[stage id]={list of destination port id}; ▷ StageInfo tracks all the destination ports that a stage routes to Output: The bank id destOP T that minimizes Tcongestions; 1: current_bank ← StageInfo[src][−1] ▷ Most recent accessed bank 2: if !BankInfo[current_bank].full then 3: destOP T ← current_bank; Update BankInfo[destOP T ].full 4: ▷ Inserting to existing bank under Constraint1 5: else 6: min_count ← ∞ 7: for BankCandidate in BankInfo with stage id==Null do 8: congestion_count ← 0 9: for x ∈ [0, ..., log D − 1) do ▷ bit 0 is the LSB 10: srcpotential_conf lict ← src.toggle[bit(x)] 11: if ∃Bankconf lict ∈ StageInfo[srcpotential_conf lict] 12: such that Bankconf lict.bit(log D − x) == 13: BankCandidate.bit(log D − x) then 14: congestion_count + + 15: ▷ Identified congestion at Switch Group x 16: if congestion_count <= min_count then 17: min_count ← congestion_count 18: destOP T ← BankCandidate return destOP T Our proposed bank assignment algorithm avoids exhaustively checking every pair of stage-bank connections. This is done by taking advantage of the recursive property of butterfly network structure and its routing algorithm 1: at every switch group (i.e. interconnection layer) x, the request from a given input port src can only collide with the request from another input port srcpotential_conf lict if bit x is the only different bit in the binary representations of both input ports (Algorithm 2 line 10), and the said collision can be checked by comparing bit log D−x of their destination output ports (this is same as bit log D−x of 70 the target bank index, Algorithm 2 line 11-13). We denote the number of input/output ports to the StageBank Interconnection as D (this is equivalent to the number of pipeline stages and the tree height), and the total number of nodes in the tree as X. Each node insertion into an existing bank takes 1 cycle (Algorithm 2 line 3). Each node insertion identifying a new bank assignment takes O(Y log D) cycles. Overall, for a complete tree construction in a MCTS agent step, the amortized time complexity of the bank assignment algorithm for each node is (1+Y )Y log D 2 × 1 X cycles. The scratchpad memory BankInfo, StageInfo are dynamically filled and used by the bank assignment logic. Their memory overhead are O(Y ) for BankInfo and O(DY /2) for StageInfo. In typical MCTS benchmarks, D ranges from 8 to 32 and X ranges from 500 to 10K. Based on these parameters, the amortized single Node Insertion latency is only 1.02 to 20 cycles. Overall, our bank assignment algorithm optimization trade for low Itv during Selection by introducing latency overhead during Node Insertion. Algorithm 2 benefits the system throughput by reducing Itv between workers. Although it has the tradeoff of increasing the Node Insertion latency, this latency can be hidden in the heterogeneous system using our parallel execution model (to be discussed in the MCTS system execution in Chapter 5). Hardware Optimization for Minimizing Selector Compute Time. Given a tree with Fanout F, each Selector can identify the best child node in F cycles using one comparator. To reduce Tselector and improve the selector performance scalability to large F, we use a hierarchical comparison-lookup design for low latency. Specifically, we define a comparison-lookup factor f, and recursively divide the F uct values into f groups until each group is of size <= f. Within each group, we obtain the maximum of f uct values in a single cycle using a comparison-lookup unit. The design is shown in Figure 3.26. Each comparison-lookup unit has C f 2 = f! (f−2)!2! comparators. Each comparator outputs a 1-bit comparison result of a unique pair of uct values. The concatenation of these results (a C f 2 -bit number) is used to index a Look-Up table (size 2 C f 2 ) that outputs the best child index sˆ (with the maximum uct value). This design allows latency of ⌈logf F⌉-cycle response to any changes in F input 71 uct values. This design is allocated in every Selector to concurrently process the Selection requests by different workers. f should be tuned for the optimal performance within the FPGA resource constraint. butter Figure 3.26: Example Comparison-LookUp Design: F = 9, f = 3 Other Optimizations. Shift Register: After the Selection request accessing a certain tree level is completed at the output port of the Stage-Bank Interconnection, the workers propagate to the next pipeline stage by populating a shiftregister array. The array collects the request processed at every tree level, and shifts by one tree level to serves as the input to the next-round Selection requests. Upon every shift-register operation of the array, one output Expansion-Simulation request is popped from the array and sent to the PCIe interface. Memoization for BackUp: To eliminate the O(D) overhead of sequentially back-tracing from leaf to node in BackUp of each worker, we allocate Y updaters, one associated with every bank, and let the update operation in all the tree levels perform in a data-parallel manner. A BackUp Memoization Buffer with size of D − 1 words is associated with each worker to memorize the node entries to be updated in BackUp during Selection. Thus for each worker, BackUp can be completed in 2 cycles. 3.3.3 Accelerator Evaluation The objectives of our work are to support dynamic in-tree operations with low interval (Itv) between workers, and to improve the scalability compared to CPU-only systems. In this subsection, we evaluate how the proposed dynamic tree management affects the performance of In-tree operations. 72 Implementation Specifications. Benchmark environments: We evaluate our framework on Atari games, a classic benchmark for evaluating reinforcement learning and planning algorithms [41]. We choose three benchmarks: Carnival, Pong, both with action space (i.e., fanout F of the tree) 6, and Alien with action space 18. For these games, both the Simulation and the 1-step simulation in Expansion use OpenAI-gym library. We set the MCTS tree size limit (X) as 10K for all our experiments as consistent with state-of-the-art implementations. Platforms: Our CPU baseline experiments are conducted on an AMD EPYC 7763 64-Core Processor server with 2 sockets (256 hardware threads in total) at 1.5 GHz. The CPU-FPGA platform consists of the same CPU and a Xilinx Alveo U200 board [144] connected by PCIe. In all the experiments, p denotes the number of workers, each worker uses a CPU worker Simulation thread. We use two CPU-only baseline implementations that follow multi-threaded and single-thread tree traversal execution models, respectively. Both are implemented using the Python Multiprocessing class. FPGA Implementation specifications: We develop the FPGA kernel template using High-Level Synthesis (HLS). We follow VITIS development flow [64] for bitstream generation. OpenCL [95] is used to implement the data transfer between the CPU and FPGA. The FPGA kernel code generator takes less than 3 seconds to generate the HLS code for any of our test cases. The resource utilization of our accelerator for both benchmarks are shown in Table 3.7. Note that the Carnival and Pong benchmarks use the same hardware configuration because they have the same tree fanout F and tree size X. The resource consumption bottleneck is in LUTs since both the interconnection and the selector comparison-lookup modules compete for the LUT resources. Specifically, the LUT consumption in the butterfly interconnection ∝ D log(D), and LUT consumption in the selectors also grows asymptotically with increasing F and D. The largest design that we can place on the target FPGA is the Alien benchmark with F = 18, D = 32, which uses up to ∼ 50% of the available LUTs. 73 Our design achieves 250 MHz operating frequency. For D ≤ 4, all-to-all interconnection is chosen by the DSE engine. As D increases, butterfly-based interconnection is chosen. For both interconnection configurations, the critical path is in the switches of the stage-bank interconnection. The critical wire length is expected to grow with increasing D. For the largest D we can place on the target FPGA board (D = 32), the design is able to achieve single-cycle interconnection layer propagation operating at 250MHz. For larger D, inserting a register on the critical wire can help maintain high operating frequency. Table 3.7: FPGA Resource Consumption Benchmarks SRAM DSP FF LUT Carnival, Pong 1.7∼1.88 MB (4.9\%∼5.4%) 289∼385 (4.9\%∼6.5%) 151∼204K (8.5\%∼11.5%) 121∼179K (13.9\%∼20.6%) Alien 5.01∼5.43 MB (14.4\%∼15.8%) 273∼385 (4.7\%∼6.5%) 181∼401K (10.1\%∼22.5%) 162∼418K (19.2\%∼48%) Note: D = 8 ∼ 32 for all the benchmarks. Number of SRAM banks (Y ) is set to 128 for all the test cases. Performance of In-tree Operations. As discussed in Section 3.3.1 and 3.3.2, the serial time interval (Itv) between two workers sharing access to the tree determines the upper-bound of the system throughput when scaling to large number of workers. The lower the Itv is, the higher scalability is achieved. Effect of Dynamic Algorithm Optimization: In Table 3.8, comparing the rows labeled “Ours (Dynamic)" with the rows “Ours (without Alg. 2)", it can be observed that our algorithm optimization reduces Itv to 1 5 of its baseline value. The tradeoff of this optimization is in the longer time overhead for node insertion; however, this can be hidden in our parallel execution model (see Evaluation in Chapter 5). Comparison with state-of-the-art: We compare Itv of our design with existing work [87]. [87] developed a pipelined accelerator for in-tree operations with static memory allocation for a full tree. We point out that the key difference of this work compared with [87] is that our proposed design is capable of supporting dynamic construction of arbitrary-shaped tree at run-time. We test our accelerator and the baseline accelerator ([87]) by feeding synthetic sequence of in-tree operations generated in an entire 74 episode of agent steps, with various tree shape constraints parameterized by the tree height limit D. As shown in Table 3.8, our design supports various shapes since we do not set compile-time constraints on the one-to-one correspondence between the topological location of the tree nodes and SRAM addresses. Higher Itv on Alien compared with the other benchmarks is due to its larger tree fanout, making the comparison-lookup latency higher. Our design shows higher Itv on narrower trees (large D) compared with wider trees (small D). This is because the butterfly-based interconnection leads to Itv ≥ log D cycles. On the other hand, because [87] only pre-allocates SRAM banks for a complete tree and constrains the nodes into their corresponding SRAM addresses, the maximum tree height is limited to logF S, where S is the largest number of node entries that can be stored on-chip. This limit is as low as 8 for Carnival and Pong, and 4 for Alien. In summary, our design can support a larger variety of tree shapes with different tree depth limits, whereas [87] can only support small tree depth, as summarized in Table 3.8. Table 3.8: Itv of Dynamic vs Static Tree Management Benchmarks In-Tree Ops Accelerator Tree Shape Constraints D=8 D=16 D=32 Carnival Ours (Dynamic) Itv=3.2 Itv=3.5 Itv=9.7 Ours (w/o Alg. 2) Itv=8.1 Itv=17.6 Itv=36.4 [87] (Static) Itv=2 No Support No Support D=8 D=16 D=32 Pong Ours (Dynamic) Itv=3.3 Itv=3.7 Itv=9.4 Ours (w/o Alg. 2) Itv=8.5 Itv=16.9 Itv=34.1 [87] (Static) Itv=2 No Support No Support D=4 D=16 D=32 Alien Ours (Dynamic) Itv=4.3 Itv=4.7 Itv=10.5 Ours (w/o Alg. 2) ¯Itv=7.3 Itv=18.2 ¯Itv=37.8 [87] (Static) Itv=3 No Support No Support Note: Itv is the average number of cycles over all the iterations in an agent step. The lower Itv is, the better scalability to large number of workers is achieved. The rows labeled “w/o Alg. 2" describes a baseline design without the bank assignment algorithm for node insertion (instead, a simple heuristic for inserting the node to the next empty bank is used). Comparison with CPU baselines: Our framework using FPGA lead to additional communication overhead through PCIe, compared with CPU-only baselines. Therefore, we measure the end-to-end latency 75 of in-tree operations including the PCIe data transfer time. The PCIe data transfer time is obtained using Xilinx Run-time (XRT) Profiler [146]. The CPU-only baselines include the multi-threaded tree traversal and single-thread tree traversal. We first observe that on the CPU, the multi-threaded implementation does not significantly outperform single-thread implementation. This is because threads must be serialized at root-level, where the root-child nodes must be protected by a mutex to ensure only one thread can access it at a time. The sequential time interval between pair of consecutive threads accessing the shared root-children is dominated by the high latency access to the CPU shared memory (DDR), which cannot be reduced by increasing p. On the other hand, the single-thread implementation only allows a master thread to access the tree, so the tree can be accessed with lower latency in its local memory (cache). This is at the cost of serializing the in-tree operations by all the workers. For both benchmarks, the FPGA accelerator leads to lower latency than CPU, and consistently shows higher speedup compared with CPU at larger number of workers. The PCIe overhead increases very little as p increases, because the reduced PCIe data transfer time is negligible compared with the fixed PCIe initiation latency (∼ 0.04 ms). Figure 3.27: Latency of in-tree operations. Y-axes are in log scale for better visualization. 76 Chapter 4 Acceleration System for Model-Free Deep Reinforcement Learning using Heterogeneous Platforms 4.1 Motivation DRL training is highly time consuming. Due to the distinct compute kernels in DRL that may not be efficiently optimized using a homogeneous architecture (such as multi-core CPUs), there has been a growing trend in using heterogeneous architectures to accelerate DRL algorithms [77, 24, 89]. However, even with access to heterogeneous resources, DRL application developers still face several challenges: (a) Sub-optimal performance: DRL’s distinct components require careful placement and scheduling onto heterogeneous devices based on both computational and hardware characteristics. Sub-optimal placement and scheduling can lead to under-utilization of heterogeneous resources, resulting in missed opportunities for performance improvement. (b) Lack of portability across platforms: The optimal DRL primitive-to-hardware assignments can change based on varying algorithms and platforms. Consistently achieving high performance implementations requires portable solutions that can map and distribute DRL onto various devices, but existing works lack such flexibility. (c) Low development productivity: The growing diversity of heterogeneous resources in data centers [51, 148, 138] has increased the need for hardware optimizations and 77 bridging between different programming models. This significantly increases the required learning effort and programming time for application developers. In this work, we address the above challenges by proposing PEARL, a framework that enhances the performance, productivity, and portability [103] of DRL system development on heterogeneous platforms. PEARL provides DRL application developers with tools and familiar interfaces for running DRL using heterogeneous platforms, while abstracting away the low-level hardware intricacies. Our framework is composed of several layers facilitating DRL implementations from user application specifications to deployment on physical hardware. The inputs to our framework are the algorithm configuration, application simulator specification, and hardware specifications. The output is an automatically optimized DRL system implementation. As shown in Figure 4.1, our framework implements a Host Runtime Coordinator (Section 4.2), a System Composer (Section 4.3), and a Parameterized Library of Accelerated Primitives (Section 4.4). These components create additional system, runtime, and programming layers below the Python & Torch interfaces. These components are the main novelties that distinguish PEARL from existing RL libraries, where we apply our unique approaches to runtime task scheduling, design exploration for effectively utilizing heterogeneous resources through task graph analyses, and fine-grained acceleration of individual primitives. Figure 4.1: Framework Overview 78 4.2 Runtime System & Training Protocol 4.2.1 System Design Figure 4.2: Runtime System The implementation generated by PEARL is based on a parallel DRL system managed by a Host Runtime Thread. Figure 4.2 shows the setup of such a system. Multiple Actor threads generate new data points (experiences) and periodically synchronize weights from the Learner. They send the experiences to the Host Runtime Thread through Data Collection Queues (DCQs). The Host Runtime Thread interacts with the RM through an RM Request Queue (RRQ), where the host initiates sampling (or update) requests and receives outputs of sampled indices (or updated priorities). Parallel Learner sub-modules can be implemented using one or multiple accelerators, and they are initiated by the Learner Assignment Queues (LAQ). In a configuration where the Learner and RM are individually mapped onto distinct, not directly connected devices (as depicted by the default mode in Figure 4.2), the Host Runtime Thread interacts with the Learner through a Leaner Assignment Queue (LAQ) that sends experiences and initiates training. The Learner acknowledges the completion of a training iteration using the Learner Done Queue (LDQ), and the runtime program synchronizes parameters before the next iteration. We also introduce a communicationreduction mode optimized for the cases when the RM and Learner are assigned the same device or devices with direct interconnections such as NVLink [99] (i.e., comm-reduction mode in Figure 4.2), where the LAQ and LDQ are directly connected to the modules without communicating through the host. 79 Figure 4.3: DRL Heterogeneous Training Protocol 4.2.2 DRL Heterogeneous Training Protocol To perform training on a given heterogeneous system, we propose a general DRL heterogeneous training protocol (Figure 4.3). The training protocol can be ported to various heterogeneous devices since the interactions among processors and accelerators are defined at the application layer (i.e., DRL logical components), and are not bound to a specific type of accelerator. We show the essential data exchange and handshake signals between modular components as 1 - 8 in Figure 4.3. We provide a runtime code template that manages the thread pools, the accelerators, and allows the “plug and play" of heterogeneous devices for DRL primitives. It is a Python program executed on the Host Runtime Thread, which utilizes a loop whose iterations follow this protocol. Our protocol features a novel scheduling optimization (Replay-Collision-Free Scheduling) to encourage concurrency while maintaining algorithm correctness. We adopt a strategy of deferring the immediate insertion of experiences into the Replay Buffer when experiences are received from Actor threads. We maintain a data collection buffer to cache experiences generated by the Actors, and only insert them when the buffer is full. Upon experience insertion, we schedule the batched insertion operations after the sampling process concludes. This optimization has two advantages: Firstly, this approach permits us to compare the insertion index against the sampled indices, hence effectively mitigating the potential contamination of data when the Learner and Actors concurrently modify the same indices of the Replay 80 memory. We refer to this procedure as “collision-free data collection", which is shown in Figure 4.3. Secondly, by sequencing data insertion after the sampling phase, we align its execution concurrently with the training process. This hides the time overheads of the priority retrieval and the update operations initiated by experience insertion in the training pipeline. 4.2.3 Runtime System Optimizations PEARL’s runtime system design enables the Actor Threads, the RM Module, and the Learner Module to perform their computations in parallel by letting them continuously read from (write into) their input (output) queues. The runtime program handles the necessary data dependencies as it processes messages between the queues. It integrates a few optimizations that ensure the effective utilization of heterogeneous resources and hide communication overheads. 4.2.3.1 Task Parallelism and Data Parallelism over Heterogeneous Hardware The Actor threads used in the experience collection loop are strictly separate from the threads used to host accelerators in the training loop. This ensures concurrent execution of the processes that generate and consume training data. The training loop involving Replay Management and Learning is usually the time bottleneck in such a concurrent system. Thus, we further apply fine-grained parallelization for them using heterogeneous accelerators. Our runtime system supports both task parallelism and data parallelism in the Learner implementation. Specifically, in task parallelism, we enable dividing the Learner function into segments (based on different different groups of layers in DNNs, as later discussed in Section 4.3.1), and each segment is processed on a separate accelerator. We allocate a separate CPU thread for each accelerator that is assigned a segment of the Learner (or the RM) to host the accelerated computations. We leverage the Torch Distributed Remote Procedure Call (RPC) framework [28] to launch these threads. RPC links together local 81 gradient computation engines on all the CPU and GPU devices involved in the forward pass of all DNNs in the Learner, and automatically reaches out to them during the backward pass to compute gradients. In data parallelism, we copy the complete Learner function and deploy it on a set of accelerators, where each accelerator computes partial gradients using a subset of the batched experiences. We allocate one CPU thread as the "parameter server" (denoted pc) to collect and aggregate these partial gradients to update the model parameters for the Host Runtime Thread (which further syncs with the Actor threads). The CPU thread pc then broadcasts the parameters back to the accelerators for next-iteration training. We leverage the Torch Distributed Data Parallel (DDP) framework [74] to manage the Learner replicas on GPUs. For integrating task parallelism and data parallelism with FPGA devices, we develop Python wrappers using PyBind to handle data type conversion when interfacing the partial gradient generated by FPGA with the CPU thread pc. Task parallelism, given optimal partitioning and segmentation of Learner tasks, typically involves less communication overheads than data parallelism due to alleviating the need to aggregate weight gradients over devices. Task parallelism works well for Learners with large-sized DNNs or a multitude of DNNs. On the other hand, data parallelism is advantageous for tightly interconnected accelerators with low communication overheads in synchronizing weight gradients, given that the complete Learner can fit in the memory of a single accelerator device. By supporting both these parallelization strategies, our runtime system facilitates portable performance across different DRL algorithms and different heterogeneous platforms. This is because the specific system implementation (task-parallel or data-parallel) can be adapted to the specific characteristics of the algorithm and hardware platform. Note that while the data parallel and task parallel implementations of the Learner are supported across all the devices - CPU, GPU, and FPGA, the specific parallel method and device mapping of tasks within the Learner are only determined at the System Composition stage. Our System Composition mechanism for choosing parallelization and mapping approaches is further discussed in Section 4.3.1. 82 4.2.3.2 Communication Overhead Reduction Our scheduling allows concurrent execution of the Actor threads (data collection) and the sampling → policy training (Learner) → experience update (RM) process. We also overlap Learner computation with replay operations. This is achieved by host-device (or on-chip) streaming communication queues between the RM and the Learner, so that training using each data point starts asynchronously as soon as the Learner receives them (rather than waiting for the full batched sampling). Similarly, experience updates are overlapped with the Learner. Additionally, we use double buffering to alleviate the weight transfer overheads between the processor and Learner accelerator. Two buffers with sizes of the complete policy weights is allocated in the host memory (shared by Actors threads and runtime thread). In each iteration i, the CPU threads read from buffer i%2 while the Learner writes into buffer 1 − i%2. Using double buffering helps hide weight communication time compared to using a single buffer. With a single buffer, the CPU needs to wait for the Learner to complete training tasks and finish writing new weights before proceeding, causing idle time in the Actor threads pool. In contrast, double buffering allows concurrent operations. While the Learner writes the updated weights into one buffer, the CPU can simultaneously read the weights from the other buffer. This overlap of reading and writing operations reduces idle time. As a result, the weight transfer overhead is alleviated, leading to more efficient utilization of heterogeneous resources. 4.3 System Composer Given the user-specified Replay Manager (RM) and Learner metadata as inputs, the main goal of the system composer is to (A) determine an optimal primitive(operation)-to-device assignment that maximizes heterogeneous system performance. To realize (A), we also need to realize the goal (B) determine the best-performing accelerator configuration within each device for all the primitive operations, and their associated latency costs. 83 To achieve (A), we leverage a Heterogeneous System Composition Algorithm (Section 4.3.1) based on the arbitrary task dependencies among one or multiple DNNs training in the DRL Learner function as well as the device specifications. The inputs to this algorithm are summarized in Figure 4.4. We formulate a DRL training iteration as a Task Dependency Graph (Figure 4.4-III). Each node represents an operation in RM or the computations of a group of layer propagations in the Learner. Each node is associated with the latency costs of computing the RM operation or layer propagations; all the latency costs are stored in a Heterogeneous Compute Latency Table (Figure 4.4-I). For example, the entry node to a DQN Learner (node ID 2 in Figure 4.4-I) represents all the forward layer propagations in the policy DNN, and its latencies on all candidate devices are stored in the corresponding row of the Heterogeneous Compute Latency Table. This Table is filled by completing (B), which is further described in Section 4.3.2. Each edge represents the dependency between operations or layer propagations. Each edge of the Task Dependency Graph records the communication costs (in bytes) required among RM operations and/or layer propagations. The communication costs can be determined at compile time-based on RM and DNN metadata. For computing the actual latency along an edge, we also sample from a Heterogeneous Interconnection Bandwidth Table (Figure 4.4-II), which stores the bandwidth between any pair of devices. Figure 4.4: Inputs to the System Composition Algorithm: I. a Task Dependency Graph; II. a Heterogeneous Compute Latency Table; and III. a Heterogeneous Interconnection Bandwidth Table 84 Algorithm 3 Heterogeneous System Composition Algorithm. w denotes node computation latency in an iteration. c denotes communication latency. 1: Input: Heterogeneous Compute Latency Table T, Heterogeneous Interconnection Bandwidth Table I, DRL Task Dependency Graph G(V, E) 2: # Step 1: Compute Primitive Prioritization 3: Initialize avail(a) = 0 for all candidate devices a 4: Create a task array Q capable of storing all nodes in V ; 5: for v in V do 6: rank(v) = maxu∈successor(v)(ccvu + rank(u)) + wcv 7: Store nodes (tasks) in decreasing order of their ranks in Q 8: # Step 2: Compute Primitive Placement 9: while not all tasks in Q are assigned do 10: Pick the first unassigned node (task) i from the beginning of Q 11: Compute Earliest Finish Time (EF T) of task i on each accelerator a, where EF T(i, a) = wi,a+ max{avail(a), maxj∈preced(i)( AF T(j) + cj,i) }; AF T(j) + cj,i is the total time for completing the computation and communication of a full batch of data 12: 13: 14: Assign task i on accelerator that minimizes EF T: Device[i] ← argmina′EF T(i, a′ ) 15: Assign AF T(i) ← EF T(i); 16: avail(Device[i]) ← AF T(i); 17: if |Device[Learner task nodes]| == 1 or speedup compared to using single device ≤ 1.5 then 18: Evaluate latency under data-parallelism exploration mode ← TDP 19: Update Device[Learner task nodes] if TDP < AF T of the exit task node 20: Output Device[Learner task nodes], Device[RM] 21: # Step 3: Memory Component Placement 22: Initialize Device[Replay Memory]; min_traffic← ∞ 23: CLearner entry nodes ← BS × (E + 1); CActor ← NActor × E; CRM ← BS 24: for i in [Learner entry nodes, Actors, RM] do 25: Total data traffic = 26: Pi ′∈{Learner entry nodes, Actors, RM} Ci ′ bandwidth(Device[i],Device[i ′ ]) 27: if Total data traffic < min_traffic then 28: min_traffic ← Total data traffic; Device[Replay Memory] ← Device[i]; 29: Output Device[Replay Memory] 85 4.3.1 Heterogeneous System Composition Based on the inputs shown in Figure 4.4 (Heterogeneous Compute Latency Table, Heterogeneous Interconnection Bandwidth Table and DRL Task Dependency Graph), the problem of minimizing the DRL iteration latency is essentially a task scheduling problem of a weighted directed acyclic graph (DAG) to a set of heterogeneous processors to minimize the makespan (i.e., the end-to-end total execution time from the entry node to the exit nodes). It is an NP-Complete problem, but efficient heuristics have been proposed to output the desired scheduling in polynomial time [136, 2]. We develop a Heterogeneous System Composition Algorithm (Algorithm 3) adapting from the Heterogeneous Earliest Finish Time heuristic proposed in [136]. Our system composition algorithm first determines the best device assignment of the primitive operations to maximize achievable compute throughput (i.e., minimize DRL iteration latency), then places the Replay Memory to minimize the total data traffic. In Step 1 (lines 3-5, Algorithm 3), we set the priority of each task (node) with a rank value, which is based on the mean computation latencies w over all the candidate devices and mean communication latencies c over all the device-interconnection links. Basically, rank(v) is the length of the critical path from task v to the exit task, estimating the time left to finish computing. Then, we generate a task array by sorting the tasks by decreasing order of their ranks. This helps preserve the primitive-operation precedence constraints, and ensures that higher-ranked nodes are processed first to minimize critical-path execution time. The complexity of Step 1 is O(NP O × log NP O), where NP O is the number of nodes (primitive operations) in the Task Dependency Graph. In Step 2 (lines 8-13, Algorithm 3), we assign the primitive operations (task nodes) one by one based on the heterogeneous earliest finish time heuristic. The Earliest Finish Time (EF T) of any task on a device a is based on the device’s earliest available time avail(a) and the Actual Finish Time (AF T) of all its preceding tasks and all necessary input data are communicated to a, as shown in line 11-12 of Algorithm 3. The available time slots associated with each device are updated after the task is placed on the device, yielding 86 minimum EF T according to AF T. This ensures that in future iterations, lower-ranked tasks can select devices and appropriate available time slots based on the newest status of all devices. Note that our Heterogeneous System Composition Algorithm may not occupy all the candidate devices, as there exist cases when using a single (or a subset of) devices delivers the best performance, especially with smaller models and batch sizes where the additional communication costs override the parallel beneficiary of using multiple devices to distribute a single training pipeline. Therefore, we additionally provide an optional data-parallelism exploration mode (lines 15-17, Algorithm 3). This mode is only used when the functional parallelism within the Learner is limited (e.g., a linear computation graph with only one DNN policy). In such cases, the while loop in Algorithm 3 (lines 9-14) assigns all the Learner layer propagations to a single accelerator while distributing data over multiple devices may still lead to higher speedup. In the data-parallelism exploration mode, assume |A| devices are available for mapping the Learner, we partition a batch of experiences into |A| groups, each group to be mapped to a specific device a, such that the total computational workloads in each group divided by the peak performance of the devices are approximately the same for all groups. This helps ensure load balancing in data-parallel processing of all sub-batches. Then, we profile the single-iteration gradient update latency, compare it to using a single device as returned by the while loop, and output the method that yields the minimum latency. Note that this mode is advantageous for large batched training with linear training task graph in the Learner function. For smaller batched training (e.g., classical control and robotics [33]) where all the intermediate gradients and Learner DNN weights can easily fit on a single accelerator device without saturating the hardware resources, the data-parallel mode is less likely to lead to speedup since it introduces additional communications without significantly increasing computation efficiency. We put a constraint that forces the two task nodes, RM sample and RM update, to always be placed on the same device. This is because RM operations are memory-bound and not suitable for distributing over two devices. If they are placed on different devices, the synchronization of the sum tree contents between 87 two devices outweighs the acceleration from the RM sample and update operations. Since RM sample is always at the beginning of an iteration and RM update is always at the end of an iteration, we group them into a hypothetical hypernode. The hypernode is the combination of the RM update from iteration i and RM sample from iteration i + 1. Then, Step 2 of Algorithm 3 picks the device that best accelerates both RM operations. In Step 3 (lines 22-26, Algorithm 3), we decide on the device assignment of the Replay Memory. The data traffic with respect to the Replay Memory during each iteration includes BS words of sampling indices from the DLearner, BS × E sampled experiences to the DLearner (where E is the size of each experience for the given benchmark), and NActor × E inserted experiences from the Actors. These communication costs are denoted as C in Algorithm 3. We place Replay Memory on the device that minimizes the total data traffic based on available bandwidths between devices (e.g., PCIe) and within each device (e.g., DDR). The complexity of Step 3 is O(NP O). 4.3.2 Accelerator Setup and Performance Estimation To realize goal (B) and provide accurate latency estimations for the primitives in the Heterogeneous Compute Latency Table, we customize the parameterized accelerators described in Section 4.4 to suit the userinput RM and Learner specifications. Based on the customized accelerators, we obtain the expected latency of executing operations (tasks) in each primitive in one DRL iteration on each of the available devices, and store these latency numbers in a Heterogeneous Compute Latency Table for further analysis of system performance in goal (B). The Heterogeneous Compute Latency Table is an NP O × ND table. The ND columns stand for the choices of ND available devices; the NP O rows correspond to the primitive operations (each correspond to a certain node in the DRL Task Dependency Graph) to be assigned to one of the devices, and each entry on [row x, column y] denotes the latency of performing one iteration of the 88 primitive operation (node) x on device y. The table entries are populated based on accelerator setups as follows: Primitive Setup on a CPU/GPU : For the primitives that can be mapped to the CPU at compile time, i.e., RM and Actors, we allocate their number of threads initially based on the ratio of their single-iteration latency for processing/producing one experience in order to match their throughput. For the RM on a GPU (described in Section 4.4.1.1), the degree of parallelism is set to BS. The sum tree is stored in the GPU global memory. For the Learner on a GPU (described in Section 4.4.2.1) , we search for the best-performing number of streams in the range [1, BS] by recording their per-SGD-step latencies. Note that if multiple GPUs can be used to parallelize the Learner, we fix the same sub-batch size in a stream for all devices based on the search result on one GPU. Accelerator Configuration on an FPGA: The RM can be mapped to the FPGA device only if the total buffer size required by the sum tree is smaller than the total amount of SRAM resources. The training pipeline for a certain DNN in the Learner can be mapped to the FPGA device only if the buffer size required to store intermediate tensors can fit on-chip. This is to avoid pipeline stage idling in waiting for accesses to offchip memory. The specific resource consumption estimation and cycle latency estimations are based on the accelerator designs described in Section 4.4.1.2 and 4.4.2.2. For the RM, the number of pipeline stages is configured to match the tree depth, and the buffer sizes are configured based on their corresponding tree levels. For the Learner, the amount of compute resources allocated to each pipeline stage, UF (see Section 4.4.2.2), is tuned such that all pipeline stages are load balanced (for the maximal effective hardware utilization): Tstage = #MACFP L1 UFFP L1 = ... = #MACGA LN UFFP LN ; where X 1...N UF(DSP s), UF(SRAM), UF(LUT) ≤ #DSPs, #SRAM, #Logic Resources. (4.1) 89 We obtain the latency of accelerators on FPGA through performance modeling: T sampling RM = 2 × F × (BS + D) (4.2) T update or insert RM = 2 × (BS + D) (4.3) TLearner = Tstage × (BS + 3 × (#layers − 1)) (4.4) In equations 3-5, the pipeline latencies are calculated by multiplying single pipeline stage latency by the batch size BS and pipeline fill/drain overhead D (D equals the sum tree depth in RM and # layer propagations in Learner, respectively). Note that for the Learner, either only a portion or all of the layer propagations could be mapped to the FPGA. To avoid the need to adjust UFs for different stages during Step 2 of Algorithm 3, we derive the UFs (and accordingly, the latencies based on these UFs to fill the Heterogeneous Compute Latency Table) assuming all the layer propagations in the entire Learner need to be placed and routed. If the Learner cannot fit on the FPGA device, we fix the UFs to be the smallest values such that all layer propagations are load-balanced, and check for resource limit at line 14 of Algorithm 3 whenever we attempt to assign a layer propagation on the FPGA. If the total resource consumption exceeds the limit, we remove the FPGA as an available device for future task nodes to consider. 4.4 Parameterized Library of Primitives 4.4.1 Replay Manager (RM) The RM performs three replay operations on a sum tree, where leaf nodes store the priorities for all experiences, and a parent node stores the sum of priorities of its children: (1) Priority sampling: Based on Equation 2.1 (Chapter 2, Section 2.1.1), sampled indices are obtained by traversing the tree performing prefix sum from root to leaf. The computations are explained in [153]. (2) Priority retrieval: Given the indices of the experiences, it outputs the priorities stored at the corresponding leaf nodes. (3) Priority update: the 90 Figure 4.5: FPGA - Replay Manager Hardware Module inputs are the indices of experiences and the changes to their priorities ∆; It applies the changes ∆ to the priorities (and sums of priorities) stored in parent nodes in all the tree levels. Note that Insertion of priorities is realized with priority retrievals followed by priority updates. 4.4.1.1 RM on CPU and GPU The computations in replay operations can be viewed as a sequence of operations traversing all levels of the sum tree from the root to a leaf. Our RM implementations on CPU and GPU are parameterized with the tree depth, fanout, BS, and W, where BS is the batch size of the replay operation requests, and W is the number of workers (degree of parallelism) allocated. Each worker is responsible for sampling or updating BS W priorities. All workers share concurrent accesses to the sum tree. We use mutex to ensure the correctness of parallel priority updates that potentially collide on the same node. 4.4.1.2 RM on FPGA We develop an accelerator template (parameterized with the tree depth and fanout) that can be reconfigured to support a range of fanout and tree sizes. We adopt a design of multiple pipeline stages processing a stream of operation requests as shown in Figure 4.5. Each pipeline stage is a hardware module responsible for operating on a certain tree level and exclusively stores all the nodes on that level. Different replay operation requests in a batch are concurrently processed by different pipeline stages. The request fed into 91 the accelerator has a unified operation code as shown in the top of Figure 4.5. The requests are decoded at each pipeline stage, and the corresponding operations are executed in an online manner. We apply the memoization technique in the updaters by using a dedicated register to store the sampled indices at each tree level so that the replay update does not need to backtrace through the tree, re-computing these indices. 4.4.2 Learner The Learner takes a batch of experiences, and performs SGD constituting forward propagation (FP), loss function (LOSS), backward propagation (BP), and gradient aggregation (GA). The Learner may involve one DNN or collaborative training with multiple DNNs [92, 79, 42]. 4.4.2.1 Learner on CPU and GPU We use PyTorch [102, 52] to implement DNN training on CPUs and GPUs. On the GPU, PyTorch utilizes CuDNN [102] or Xe Matrix Extensions [52] backend to exploit SIMD parallelism. We support using multiple streams to process the FP, LOSS, BP, and GA in a pipelined manner. The GPU-based Learner code is parameterized to specify the number of streams. To enable task parallelism across GPUs, we utilize the Torch Distributed Remote Procedure Call (RPC) framework [28] to initiate and host multiple GPUs performing computation on different segments of the Learner using multiple CPU host processes. This connects local gradient computation engines across all GPU devices engaged in the forward pass of all DNNs within the Learner, and synchronously contacts them during the backward pass to compute gradients. 4.4.2.2 Learner on FPGA On FPGA, we design a Learner Module that supports both pipeline parallelism across different neural network layers and data parallelism among sub-batches of data. As an example, we show the design for an N-layer MLP in Figure 4.6. Each pipeline stage uses buffers to store intermediate activations, and uses 92 Figure 4.6: FPGA - Learner Hardware Module an array of multiplier-accumulator units to compute matrix-vector multiplication for a given input. The number of multiplier-accumulator units allocated to each layer is controlled by a unique unroll factor UF, which will be tuned to ensure load balancing for best performance (Section 4.3.2). To realize data streaming between modules, the modules are connected by on-chip FIFO pipes. To support task parallelism, where only a segment of the training pipeline is placed on an FPGA device, we develop a separate set of modules. Each FP module has a DDR interface for streaming activations, and every BP module has a DDR interface for streaming activation gradients. By enabling these modules to send intermediate activations and/or gradients to the host or other devices responsible for gradient aggregation (GA modules), we allow flexible composition of task parallel Learner implementation given an arbitrary slicing point to segment the Learner task graph. The communication among FPGAs with other devices is coordinated through the Host Runtime thread. 93 Table 4.1: Specification of Heterogeneous Platforms Platform Device Process Hardware Parallelism External Memory Frequency Server CG1 CPU Intel Core i9-12MB 10 nm 2 sockets, 16 cores 32 GB 3.3 GHz GPU Intel UHD Graphics Xe 10 nm 32 Unified Pipelines 32 GB 1.6 GHz Server CG2 CPU AMD EPYC 7763 7 nm 2 sockets, 128 cores 512 GB, DDR4 1.5 GHz GPU NVIDIA RTX A5000 (x3) 8 nm 8192 CUDA cores, 256 Tensor cores 24 GB, GDDR6 1.17 GHz Server CGF CPU Xeon Gold 6326 10 nm 2 sockets, 64 cores 256 GB, DDR4 2.9 GHz GPU NVIDIA Geoforce 3090 8 nm 10496 CUDA cores 24 GB, HBM 1.7 GHz FPGA Intel DE-10 Agilex 10 nm 4510 DSPs 32 GB, DDR4 400 MHz 4.5 Evaluation 4.5.1 Experiment Setup To show the portability of our toolkit to different platforms, we conduct our experiments on three heterogeneous platforms. The first platform, ServerCG1, has a Host CPU and an integrated GPU that share the same die. The second platform, ServerCG2, consists of a Host CPU connected to 3 GPUs through PCIe; The GPUs are also interconnected by NVLink [99] . The third platform, ServerCGF , consists of a Host CPU connected to a GPU and an FPGA, both through PCIe with 16 GB/s bandwidth. The specifications of these platforms are summarized in Table 4.1. For FPGA bitstream generation, we follow the oneAPI development flow [53]. We select three widely-used RL benchmarking environments: CartPole, MountainCar, and Pong, in the OpenAI Gym software simulation environment [11]. We demonstrate our toolkit using three representative DRL algorithms widely applied in various applications, DQN [92], DDPG [79], and SAC [43]. We 94 Table 4.2: Benchmarking Environments and Algorithms Environment Algorithm DNN Policy Number of DNNs CartPole DQN 3-layer MLP, hidden size 64 2 MountainCar DDPG 4-layer MLP, hidden sizes 256,128 4 Pong DQN CNN in [92] 2 Humanoid SAC 5-layer MLP, hidden size 2048 6 accelerate these algorithms under a wide range of benchmarking software environments spanning the domains of classical control, games and robotics [11]. The algorithm, size of the states and actions, and policy model for solving each benchmark are shown in Table 4.2. We evaluate the system training throughput as the number of Experiences processed Per Second (EP S = Training batch size Titr , where Titr is the execution time of one training iteration, i.e., the actual makespan returned by Algorithm 3). 4.5.2 Performance of Accelerated Primitives Since EP S is bounded by latencies of the primitives in each iteration, we first show the device assignment tradeoffs for each primitive. In Figure 4.7, we present the actual total execution latencies for batched Replay Manager (RM) operations. They are plotted across a range of commonly-used training batch sizes (a significant DRL hyperparameter affecting DRL iteration time). For PCIe-connected GPU and FPGA on ServerCGF , all the latencies of the primitives in Figure 4.7 include the data transfer (PCIe) time. Note that the latencies for priority retrieval and priority update are combined since these operations are typically performed together during priority insertion and update processes. Our observations reveal superior scalability of GPU- and FPGAaccelerated replay operations compared to the multi-threaded CPU implementation. The RM operations are memory-bound. While GPU data parallel compute resources exhibit good scalability, they are underutilized due to high-latency global memory accesses that cannot be hidden by the computations. The 95 (a) Priority Sampling (b) Priority Retrieval and Update Figure 4.7: RM Operation Latency across Devices FPGA accelerator processes the sum tree operations in a near-memory manner, storing the data structure on-chip, thus delivering the highest scalability. In Figure 4.8, we show the complete actual Learner execution times for one gradient update iteration for different sized MLP and CNN models. Batched layer propagations exhibit a higher arithmetic intensity compared to replay operations. Consequently, the advantages of utilizing data parallel architectures (GPUs) lead to consistently lower gradient update latency compared to CPU. The FPGA accelerator design surpasses GPU performance when arithmetic intensity is low. This is particularly evident when dealing with smaller neural network sizes and batch sizes. As the training batch size or the size of the DNN weights involved in the Learner increases, the execution time of training primitives on GPU begins to outperform 96 Figure 4.8: Learner Latency across Devices that on the FPGA. This shift is due to hidden memory overhead at larger batch computations and a higher clock frequency on the GPU. When populating the Heterogeneous Compute Latency Table, the operation latencies of RM and layer propagations on the CPU and GPU are obtained by profiling. This accurately predicts the runtime operation latency since the computations in every iteration are the same. For operation latencies on the FPGA, we use the performance model described in Section 4.3.2 to avoid costly effort in generating multiple versions of the bitstreams before System Composition. In our implementations, the latencies generated by these performance models differ from the actual runtime latencies (profiled after loading the bitstream) by no more than 10% for the Learner and no more than 12% for the RM. The communication latencies used in Algorithm 3 are computed using the data transfer size, interconnection bandwidth, and data transfer 97 initiation latency. In our experiment, the estimated communication overheads differ from the actual incurred communication latencies by less than 5%. The accurate predictions of primitive and communication latencies help ensure the validity of our task mapping. (a) DQN - CP, ServerCG1 (b) DDPG - MC, ServerCG1 (c) SAC - Humanoid, ServerCG1 (d) DDPG - MC, ServerCGF (e) DQN - Pong, ServerCGF (f) SAC - Humanoid, ServerCGF (g) DQN - Pong, ServerCG2 (h) SAC - Humanoid, ServerCG2 Figure 4.9: System Composition. For Parallelized Learner over Devices, TP stands for task parallelism and DP stands for data parallelism. 4.5.3 System Composition In Figure 4.9, we selected a few scenarios to showcase the necessity of different primitive-to-device mappings based on different algorithms, hyper parameters, and platforms, and confirm the of superiority of our system composition over other design choices in these scenarios. In all the subfigures, the two axes indicate different device choice for the two primitive, RM and Learner, respectively; The color gradients 98 in the grids are proportional to the magnitudes of achieved throughput in their corresponding device assignments. The stars label the optimal mappings yielding the highest throughput. We observe that the choice of device for the primitive with the highest latency significantly influences variations in throughput. Specifically, for small-batch computations and small DNNs (e.g., grid plots with batch size 32 in Figure 4.9(a)(b)(d), the color gradient changes most drastically along the axis with varying device for the RM, because replay operations result in significant overheads as Learner computations are small. On the other hand, for large-batch computations and large (number of) DNNs (e.g., grids with batch size 512 in Figure 4.9(b)-(h)), the color gradient changes most drastically along the vertical axis, as the Learner dominates each training iteration and replay operation overheads are hidden. Note that when multiple device assignment choices lead to the same throughput, our toolkit selects the one with the lowest total data traffic (e.g., Figure 4.9(e)). Note that for the mapping of Learner, it is not practical to list all possible ways to perform task parallel and data parallel training in Figure 4.9. Thus, we only include the multi-device assignments in the Learner axis when the optimal Learner mapping returned by Algorithm 3 uses multiple devices. There are still cases where Algorithm 3 returns the optimal design point without partitioning any primitive over multiple devices. For example, in Figure 4.9(f), we observe that Task Parallelism over two accelerators is only favored when the larger batched training over-saturates the compute power provided by a single device; For the batch-32 experiment, the compute intensity does not fully saturate a single GPU. In this case, involving another device introduces communication overheads without improving speed. This is captured in Algorithm 3, so that the optimal design is obtained using a single accelerator for the Learner. The partition and assignment of the Learner tasks onto the GPU and FPGA is shown at the bottom of Figure 4.10. For the cases where using multiple accelerators for the Learner is advantageous compared to using a single device, we also include the scenarios showing the tradeoff between task parallelism and data parallelism. In Figure 4.9(g), we show the case in Batch-1024, where data parallelism is favored over Task 99 Parallelism. In the DQN algorithm, only one DNN is trained; splitting one DNN over two devices would introduce intermediate (CNN) tensors proportional to batch size, and it causes large communication overheads between the accelerators. When using data parallelism over 2 GPUs, the weight gradient aggregation overhead is smaller than the communication overheads in Task Parallelism. Also, if using 3 GPUs with Data Parallelim, the gradient aggregation overheads again become large so that the optimal design point is chosen at Data Parallelim over 2 GPUs. In Figure 4.9(h), for Batch-1024, Task Parallelism outperforms Data Parallelism for the SAC algorithm with 6 DNNs involved in training. The device assignment returned by the System Composer is shown at the top of Figure 4.10. Although it still introduces intermediate results communicated over 2 GPUs, the communication overhead is trivial. Specifically, a batch of Q values is communicated, leading to 8KB data traffic in total through NVLink. This communication overhead is much smaller than the CPU gradient aggregation overhead, which will appear if we alternatively choose data parallelism. Figure 4.10: Adopted Task Graph to Device Mappings for Task Parallel Learners. The Top Sub-Figure Corresponds to the Optimal Mapping in Figure 4.9-(h). The Bottom Sub-Figure Corresponds to the Optimal Mapping in Figure 4.9-(c)/(f). 4.5.4 Comparison with Existing DRL Libraries We compare PEARL-generated optimal implementations with two state-of-the-art DRL frameworks, RLlib [77] and OpenAI Stable Baselines 3 (SB3) [110], on ServerCGF . The performance of RLlib and SB3 are 100 Table 4.3: Comparison with Existing DRL Frameworks Benchmarks Implementations EP S (Optimal) ServerCGF EP S (CPU-GPU) ServerCGF EP S (Optimal) ServerCG2 EP S (Optimal) ServerCG1 Performance Portability Φ(P) DQN CartPole Ours 94.1K 69.0K 65.8K 49.2 65.0K RLlib 50.3K 50.3K 39.4K 52.8 46.8K Stable Baselines 56.1K 56.1K 46.9K 44.6K 48.2K DDQG MountainCar Ours 95.2K 58.2K 60.5K 57.1 67.2K RLlib 48.5K 48.5K 41.2K 36.2K 41.4K Stable Baselines 50.1K 50.1K 37.9K 35.4K 40.2K DQN Pong Ours 9.6K 9.6K 17.1K 4.1K 7.4K RLlib 6.2K 6.2K 12.4K 3.6K 5.7K Stable Baselines 6.9K 6.9K 13.0K 3.2K 5.6K SAC Humanoid Ours 17.9K 14.7K 32.8K 25.1K 23.7K RLlib 11.2K 11.2K 22.9K 16.9K 15.6K Stable Baselines 13.0K 13.0K 17.5K 16.2K 15.3K obtained using the optimal settings required by each of them (i.e., using GPU for training). The detailed performance across different benchmarks are shown in Table 4.3. System Throughput. The additional flexibility of supporting FPGA accelerators along with our runtime optimizations enable PEARL to achieve up to 1.9×, 2.2× and 1.4× improvements in EP S for the three benchmarks. Even using the same set of hardware (CPU-GPU), our novel scheduling and resource allocation leads to 21% to 55% higher EP S. We also evaluate the effect of our runtime dynamic heterogeneous resource allocation. Another study focused on mapping DRL onto FPGA-based heterogeneous platforms [154], and evaluated using the CartPole benchmark. Due to the different hardware and optimal device assignments, EP S is not directly comparable. Nonetheless, we compare the effective heterogeneous resource utilization (achieved throughput given the peak throughput of all the processors and accelerators in the platform). For CartPole DQN batch-32 training, PEARL achieves 7.9K EP S using a CPU-FPGA with a total peak performance of 0.46 TFLOPS; [154] achieved an amortized throughput of 7.1K EP S using a CPU-FPGA with 0.72 TFLOPS. Despite having 36% lower available device performance, our result shows a 11% higher EP S. 101 Portability. To show the performance portability of our toolkit, we adopt the portability metric for a framework to be consistent with that described in [103]: Φ(H) = 0 if, ∃i ∈ H, EP Si = 0 P |H| i∈H 1 EP Si otherwise (4.5) where H denotes the set of heterogeneous platforms; EP Si is the achieved EP S using the i th device assignment choice or platform in the set H. The results are shown in the last two rows of Table 4.3. Φ(D) quantizes the ability to use different heterogeneous resources given by a single platform. Other existing works that do not support accelerated RM or FPGA-based Learner are not portable to these device assignments (∃i ∈ D, EP Si = 0), thus having Φ(D) = 0. In contrast, our work is portable to all assignment choices provided by ServerCGF . Our work enables the ability to utilize compute powers of a wider range of heterogeneous devices, thus achieving better device portability and higher performance. Φ(P) quantizes the ability to achieve performance across different platforms (i.e., ServerCG1, ServerCG2 and ServerCGF ), where EP Si is the highest throughput achieved on the i th platform. Our toolkit consistently achieves higher platform-throughput portability Φ(P) compared with the existing works. Algorithm Performance. Figure 4.11 plots the cumulative rewards collected by the agent policy over wall clock time. The curves are smoothed to show the sliding average rewards obtained in a window of 100 training iterations, and each curve is the mean of 5 runs of the algorithm-benchmark pair. For all the algorithms and benchmark applications tested on two platforms, we consistently observe faster convergence compared to the baselines (Stable Baselines 3 and RLlib), meaning our implementation improves throughput without significantly sacrificing algorithm performance in terms of reward and convergence rate. 102 Figure 4.11: Rewards over Time 4.5.5 User Productivity For a quick assessment of programmability, we enlisted 5 graduate students familiar with RL but lacking expertise in heterogeneous hardware, aligning with PEARL’s target user community, to implement two algorithms using PEARL. Table 4.4 quantifies the average development effort involved. Note that we exclude the FPGA image (bitstream) compilation time in Table 4.4 (as consistent with established practice [21]), since it is an integral part of the oneAPI workflow, and is not a step directly specified by PEARL users. Although we did not include this time, it is worth noting that meeting the resource constraints (Equation 4.1 in accelerator configuration for FPGA) typically required multiple runs of synthesis using the Quartus 103 Table 4.4: User Productivity Algorithms DQN DDPG SAC User code ∼85 lines ∼120 lines ∼160 lines Development effort ◁ ∼12 minutes ∼17 minutes ∼24 minutes Productivity: CD across platforms ∼0.1 ∼0.06 ∼0.14 ◁ The compiling time for the FPGA image is excluded. compiler, and the synthesis time is included in the development effort. In addition to illustrating the effort required for developing a specific algorithm, we also present the Code Divergence (CD) to demonstrate productivity differences between development on the distinct platforms (ServerCGF , ServerCG1 and ServerCG2). The code divergence cd(i,j) between two platforms, i and j, is computed using the formula cd(i,j) = 1 − |ci∩cj | |ci∪cj | [103], where c represents the lines of user code. The cd(i,j) value falls within the range [0,1]: a value of 0 indicates that a “single-source" code can be shared between both platforms, while a value of 1 implies that the user code is entirely different for the two platforms. CD among |H| platforms is defined as the average of cd from all pairs of platforms in the set H: CD = |H| 2 −1 P i,j∈H×H cd(i,j) . In our case, CD is close to 0, as the only required changes when porting to different devices are modifying the paths to input files, and modifying the function arguments specifying devices for each primitive construct. For DQN and SAC, the CDs are higher since using task-parallel or data-parallel training requires a few additional dedicated lines to manage data synchronization from multiple devices in the Learner function. Overall, DRL application development through training in simulation is for tuning the best model and set of hyper-parameters before physical deployment. This requires repeated rounds of testing with different algorithms, hyper-parameters, and environmental scenarios to ensure the reliability of the agent. On state-of-the-art data-centers, it is unrealistic for application users to hand-tune each round of testing. With PEARL, developers write only dozens of lines of code for generating the accelerated DRL implementation 10 within minutes, significantly reducing the development effort and leading to more robust AI agents with faster development cycles. 105 Chapter 5 Acceleration Systems for Model-Based Monte-Carlo Tree Search using Heterogeneous Platforms As described in the end of Section 2.1, an MCTS process include the in-tree operations and the node evaluations. The in-tree operations are highly memory-bound and incurs irregular memory accesses dynamically determined at runtime, which is hard to optimize using thread-level parallelism or data parallelism. On the other hand, the node evaluations (simulations, or DNN operations) among different workers have highly parallelizable compute patterns. As a result, in-tree operations may become a bottleneck that hinder performance scalability to large number of workers on homogeneous multi-core systems. In this Chapter, we provide two sets of system solutions targeting two different MCTS algorithms and two different types of heterogeneous platforms, respectively. In Section 5.1, we propose a CPU-FPGA system with a hybrid parallel execution model for MCTS. It concurrently exploits the thread-level parallelism provided by multi-core CPUs and algorithm-hardware co-optimized accelerator design (Chapter 3, Section 3.3.2) for alleviating memory bound of in-tree operations on the FPGA. In Section 5.2, we propose an adaptiveparallel methodology for DNN-MCTS (e.g., AlphaZero [125]) based on an analysis of tradeoffs between two parallel techniques (local tree and shared tree), supporting CPU and CPU-GPU platforms. 106 Figure 5.1: Hybrid Parallel Execution Model workflow. Exp-Sim denotes Expansion and Simulation; UpdSel denotes Update (i.e., node updates in Back-Up) and Selection. 5.1 MCTS System Solution on CPU-FPGA Platforms 5.1.1 System and Framework Specifications Our framework takes a benchmark software simulator, MCTS tree specifications, and CPU-FPGA platform specifications as inputs. It outputs the end-to-end mapping of the Tree-parallel MCTS on the given heterogeneous platform. It is composed of a parallel execution model that defines the primitives executed on each processor/accelerator and their interactions, and a tool flow for generating FPGA bitstream and interfacing heterogeneous programming languages (Section 5.1.2). Hybrid Parallel Execution Model. As introduced in Section 2.3, both existing CPU execution models (shared tree method and local tree method) have their tradeoffs. In this work, using our FPGA accelerator, we propose a hybrid parallel execution model that outperforms both existing execution models. The proposed hybrid execution model allows low-latency on-chip memory accesses and concurrent in-tree operations on FPGA that outperforms the multi-threaded CPU tree traversal, while keeping the advantage of high-throughput simulation of single-thread tree traversal. This is achieved by preventing the in-tree operations from occupying the simulation threads. Our execution model uses a task decomposition scheme consistent with existing work 107 Figure 5.2: Heterogeneous System Overview. Global Memory is the CPU DRAM. [87] to reduce the CPU-FPGA data traffic and FPGA on-chip memory consumption - the system dynamically maintains two memory components: The MCTS tree and State Table. The MCTS tree is stored on the FPGA (Chapter 3, Section 3.3.2) and its node entries do not store the environment states. The State Table is stored in the CPU DRAM. It is implemented as a table with X entries (X is the tree size), where the index of each entry is a unique node index maintained in the MCTS tree, and the value is an application-specific environment state represented by that node. The high-level heterogeneous system architecture of our framework is shown in Figure 5.2. We use a master-worker architecture to implement parallel MCTS under our hybrid execution model with the following considerations: First, the simulation operations during Expansion and Simulation phases are application-specific, worker-independent, and usually more time-consuming compared to the in-tree operations. So, they are implemented on the CPU in a data-parallel fashion using multiple worker threads. Second, a centralized Master FPGA Thread hosting a pipelined accelerator is dedicated for high-throughput in-tree operations using localized fast on-chip memory. This also prevents the in-tree operations from occupying and blocking the simulation processes. 108 Master FPGA thread and Simulation threads Workflow. The execution workflow of the FPGA kernel, and each CPU thread is summarized in Figure 5.1. The Master FPGA Thread (1) serves as the host program for the FPGA accelerator, and (2) is used for coordinating and scheduling worker requests among the Simulation threads. The Master FPGA Thread is critical in overlapping in-tree operations with simulations. As shown in Figure 5.1-(b), the master process repeatedly executes the in-tree operations using the FPGA accelerator and assigns Expansion-Simulation tasks to different Worker Simulation Threads through the shared Exp-Sim request buffer (the buffer is implemented as a thread-safe queue). It collects the Update-Selection requests returned by the Worker Simulation Threads to update the MCTS tree statistics. The CPU-FPGA data communication and the communication between master and worker threads are asynchronous, allowing the in-tree operations and simulation by different workers to overlap. As shown in Figure 5.1-(a), after the pipelined processing of Update-Selection requests, the node IDs to be expanded can be generated before deciding the bank assignment of inserted nodes using Algorithm 2. Therefore, the Exp-Sim requests can be sent to CPU in the same time the Node Insertion is executed on the FPGA, hiding the additional overhead introduced by our algorithm optimization for the Node Insertion. The Worker Simulation Thread process is shown in Figure 5.1-(c). Note that The State Table allows fully data-parallel operations by all the simulation threads and does not incur additional synchronization between threads. Dependency-Relaxed Task Scheduling. In synchronous tree-parallel MCTS, a barrier is put after the BackUp to make sure updates by all the workers are completed before the Selection in the next iteration can start. In practice, although this allows all the workers to access the most up-to-date statistics of the uct values in the MCTS tree, it leads to idling of the FPGA hardware in the Selection pipeline. To alleviate this idling, we relax the dependency between different workers in adjacent iterations. Specifically, instead of having all the p workers waiting for the completion of BackUp by all the workers in the previous iteration, we only make sure each individual 109 Figure 5.3: Framework Design Tool Flow worker waits for the completion of BackUp by itself in the previous iteration before beginning Selection of its next iteration. This dependency-relaxed task scheduling can result in staleness of the tree policy by up to p updates behind, compared to the tree policy constructed with the barrier after BackUp. This effect on the algorithm performance is trivial because p (usually tens to hundreds) is usually much smaller than the total number of iterations in a MCTS agent step (up to tens of thousands). 5.1.2 Framework Workflow Figure 5.3 summarizes the design tool flow in our framework. The tool is composed a DSE (Design Space Exploration) Engine, an accelerator code generator, and System API (Application Programming Interface) for interfacing between the FPGA kernel, the Master FPGA thread and the simulation threads in highlevel language (Python) under our proposed hybrid execution model. The inputs to the framework include configuration files describing the MCTS Tree and the FPGA hardware given at compile time, and command line arguments specifying the benchmark environment and number of workers given at run time. The output is an executable Python program that performs the specified MCTS application on a CPU-FPGA platform. The DSE engine and the API are discussed in detail in the following paragraphs. FPGA Accelerator Generation. The input MCTS tree specifications include the tree height limit D, fanout of the tree (i.e., action space) F, and the iteration budget (tree size limit) X. Our proposed FPGA accelerator has two major 110 design parameters: the Stage-Bank interconnection configuration, M, and the Comparison-lookup factor, f. The interconnection configuration has two modes: “butterfly" specifies the custom Butterfly-based Interconnection (Chapter3, Section 3.3.2 - Stage-Bank Interconnection). “all-to-all" specifies an all-to-all connection between the stages and banks yielding the optimal (1-cycle) bank access latency. f is an integer factor (Section 3.3.2 - Algorithm Optimization - Hardware Optimization for Minimizing Selector Compute Time). The DSE engine determines these design parameters with the goal of minimizing Itv (Equation 3.18). Tselector = logf F Tinterconnection = log2 D if M=“butterfly" 1 if M=“all-to-all" Rselector = C f 2 × Rcomparator + RLookup Plogf F −1 n=0 f n Rinterconnection = (D × log2 D + Y ) × Rbuffer if M=“butterfly" D × Y × Rbuffer if M=“all-to-all" (5.1) Based on the above models, our DSE engine search for the global optimal design point on a target FPGA that satisfies Equation 5.2: M, f = argmin{D × Tselector + Tinterconnection } such that D × Rselector + Rinterconnection < RFPGA where M ∈ {“butterfly", “all-to-all"}, f ∈ [1, F] (5.2) In Equations 5.1 and 5.2, RFPGA denotes the available set of FPGA resource (DSP, LUT, SRAM, etc.) and Rmodule denotes the resource consumption of a module. Tmodule denotes the latency (number of FPGA cycles) to process a request using the module. After the Stage-Bank interconnection configuration and the 111 Comparison-lookup factor are determined, they are used by an FPGA Kernel Generator script to produce the HLS code and compile the code into an FPGA bitstream. System API. Our framework need to link across heterogeneous programming languages of CPU and FPGA. Specifically, state-of-the-art AI bench-marking simulators executed on CPU need to be invoked in high-level libraries using Python [12, 135], while our FPGA Kernel are described and hosted using HLS (C++) code. Therefore, an API is needed for the CPU-FPGA runtime system to port the FPGA Kernel initiation and communication protocols into Python functions. Table 5.1 summarizes the API functions provided by our framework. The listed functions are called from a main program executed on the CPU master thread. For the functions interfacing with FPGA (rows 1, 3, 6, 7 of Table 5.1), we use the Pybind library [59] to develop Python wrappers for our HLS C++ host code. Table 5.1: API functions API Functions Description Init() Initialize the platform with FPGA bitstream, Initialize the benchmarking environment for simulation MCTS_Parameters() Set the MCTS tree parameters and number of workers LoadKrnlParameters() Generate and Load the static content of the Comparison-LookUp Tables on FPGA AssignExpSimTasks() Check the parallel worker pool and execute thread-safe protocol for sending Expansion-Simulation requests from the master thread to a worker simulation thread CollectSimTasks() Check the parallel worker pool and execute thread-safe protocol for receiving Update-Selection requests from a worker simulation thread to the master thread SendInTreeRequests() Send the Update-Selection requests from the master thread to FPGA ReceiveSimRequests() Receive the Expansion-Simulation requests from FPGA to the master thread 112 5.1.3 Evaluation MCTS System Throughput. Figure 5.4 shows the timeline of the operations on the FPGA, the CPU master thread and one CPU worker Simulation thread in each iteration of our framework. We observe that the node insertion process with our bank assignment algorithm optimization (Algorithm 2) can be completely hidden by the simulation process. We also observe that the overhead for managing the request buffers and queues for communication between the master thread and worker simulation threads on CPU are small, and they can be overlapped with the simulation processes as they are fetched on-the-fly. This means that these overheads will not become bottlenecks that can hinder system scalability to large number of workers. Figure 5.4: Timeline of the parallel execution in [87] and our framework. p=128. The achieved system throughput in AP S (worker-Actions processed per second) is plotted in Figure 5.5. Comparison with state-of-the-art: We compare our system execution timeline with the state-of-the-art [87], as shown in Figure 5.4. Note that [87] adopts a different execution model where the PCIe communication and State Table accesses are blocking (implicit barriers are present before and after PCIe data transfer). On the other hand, our execution model allows overlapping the communication with the Selection and Node Insertion processes on FPGA. While the in-tree operation latency of our design is higher 113 than those achieved in [87], its effect on the system throughput is very small, since the accelerated FPGA kernels only leads to small (≤ 10%) overheads in each iteration. As a result, the achieved throughputs in both [87] and our work are close to the peak simulation throughputs p Tsim . Figure 5.5: System Throughput Comparisons. D=32. Comparison with CPU baselines: In both CPU-only system execution models, the throughput can linearly scale up with increasing p until the total latency of serialized in-tree operations become the bottleneck (For single-thread tree traversal, in-tree operations by workers are completely serialized. For multithreaded tree traversal, in-tree operations by workers are serialized by the overhead of communicating root-level information across different threads through DDR memory). This threshold is p =16, 32 and 64 on the Carnival, Alien, and Pong benchmarks, respectively. The value of the threshold is affected by the latency ratio of the in-tree operations to the simulation. On the CPU, for benchmarks with lowerlatency simulation, the faster in-tree operations become the bottleneck as p scales up. For p larger than this threshold, the AP S no longer scales up. By reducing Itv between workers, the proposed hybrid parallel execution model that leverages FPGA acceleration alleviates the bottleneck imposed by in-tree 114 operations. In our CPU-FPGA system, higher system throughput improvements are consistently observed for larger p, as evident in Figure 5.5. Overall, we obtain up to 2.8×, 5.14× and 6.8× higher throughput for the three benchmarks compared with the CPU-only baselines. MCTS algorithm performance. The bar plots in Figure 5.6 show the total accumulative rewards gained using our framework, an existing CPU-FPGA baseline [87], and the CPU baseline with the same number of simulations. The tree height limit is set to D = 32 for both benchmarks in our work and the CPU baseline, while D = 4(8) for Alien(Carnival) in the existing work [87]. Comparison with CPU baselines: For the CPU-only baseline, we only show the rewards from the single-thread tree traversal as it is very close to the rewards using multi-threaded tree traversal (the average difference is within 2%). Overall, our framework achieves similar algorithm performance in terms of rewards gained in an episode compared to the CPU baseline, with better scalability to large number of workers (shown by lower time per agent step in the line plots). We also show that the dependency relaxation (Section 5.1.1 - Dependency-Relaxed Task Scheduling) improves speed without significantly affecting algorithm performance. This is done by comparing our execution model (async.) with a synchronous version (sync.) that enforces the dependency of Selection upon all the worker Backups in the previous iteration. Comparison with state-of-the-art: As shown in Figure 5.6, in the Alien and Carnival benchmarks, the rewards achieved by our framework are significantly higher than the rewards achieved in [87]. This is due to the ability of the proposed design to dynamically adjust the tree shapes constructed in different agent steps. For the Pong benchmark, the rewards obtained over all the baselines do not show a significant difference and are saturated at 21. This is because the agent wins the game and terminates it once it hits the score of 21 without letting the enemy hit the same score. We still notice a disadvantage of [87] in terms of the achieved score compared with our work and the CPU baselines. In our FPGA-based design, achieving 115 higher algorithm performance comes at the cost of the additional overheads from interconnection routing and the node insertion latency algorithm. However, because the node insertion overheads can be completely hidden using our proposed execution model, the time per agent step achieved by our framework is very close to that of [87]. Figure 5.6: Rewards under various frameworks. async.(sync.) stands for execution with(without) dependency-relaxation. 116 5.2 MCTS System Solution on Multi-Core and CPU-GPU Platforms 5.2.1 Adaptive Parallelism and Implementation 5.2.1.1 Parallelization Schemes Assume that we allocate N workers sharing the tree during the tree-based search. We consider two methods to implement tree-parallel MCTS on multi-core CPUs. These methods are characterized by their usage of a local tree and a shared tree, respectively: Shared Tree. The shared-tree method uses N threads in total - it assigns each worker an individual thread. Each thread is responsible for its own assigned worker’s in-tree operations and DNN inference. The tree is stored in a shared memory (typically DDR memory of the CPU). The shared-tree method on a multi-core (a) Shared-tree on multi-core system (b) Execution timeline of the shared-tree method Figure 5.7: Shared-tree method system is shown in Figure 5.7-(a). The in-tree operations by each worker are protected with locks to ensure exclusive access at a time. The operation execution timeline of the shared-tree method is shown in Figure 5.7-(b). All workers start at a common root node, and the virtual loss applied to the root children needs to 117 be updated for all workers. So, the time interval between consecutive workers involves the overhead for communicating the root-level information through DDR, creating latency offsets between workers. The main advantage of the shared-tree method is that in-tree operations are parallelized. The disadvantage is that the more compute-intensive node evaluation process cannot fully utilize the compute power provided by the parallel threads, since they need to wait for the completion of in-tree operations by all workers, and these in-tree operations are bounded by memory access latencies. Local Tree. The local-tree method uses N + 1 threads in total - it uses a centralized master thread to manage the complete tree, and it allocates N threads to execute the node evaluations for N workers (each solely dedicated to DNN inferences). The complete tree is stored in the local memory of the master thread (e.g., cache memory). The master thread also manages a worker-thread pool where the master thread communicates with each worker thread through a FIFO (first-in-first-out) communication pipe, as shown in Figure 5.8-(a). The master thread executes a while(1)loop; In each iteration, it selects new nodes to send to worker threads, and checks for backup requests received from any worker in the worker-thread pool. The worker threads’ processes are completely independent of one another; they only coordinate with the master thread. The main advantage of the local-tree method is that it can overlap the computation of DNN inferences and in-tree operations by separating them into different hardware resources (Figure 5.8-(b)); Also, for small-sized trees that can fit in last-level cache, the memory access latencies in in-tree operations are reduced compared to the shared-tree method. The disadvantage is that all the in-tree operations are completely serialized, leading to lower in-tree throughput. 118 (a) Local-tree on multi-core system (b) Execution timeline of the local-tree method Figure 5.8: Local-tree method 5.2.1.2 Adaptive Parallelism The local-tree and shared-tree methods have tradeoffs that suit different scenarios. When DNN inference throughput is the limiting factor, the local-tree method maximizes parallelism for independent node evaluations. Conversely, as the number of workers increases or the tree’s depth leads to sequential in-tree operations becoming the bottleneck, the shared-tree method efficiently parallelizes these in-tree operations. Our work aims to leverage the strengths of both methods through an adaptive tree-parallel DNN-MCTS implementation. This implementation employs an empirical performance model to dynamically select the optimal method during compilation, considering the specific DNN-MCTS algorithm specification and multi-core CPU device specification. To support adaptive parallelism that enables selecting between the local-tree and shared-tree methods, we implement the DNN-MCTS program that selects between the shared-tree and local-tree methods based on an input flag. 119 In the shared-tree method, a pool of threads is spawned to execute all the in-tree operations and DNN inferences in parallel. A threadsafe_rollout function is executed by each thread in the thread pool. The function first traverses the tree from root to leaf, performing node selection, then performing node evaluation through neural_network_simulate, followed by node expansion and backup. During the virtual loss update and backup, multiple threads may share write accesses to the same nodes, so locks are used to ensure atomic accesses. In the local-tree method, a centralized master thread performs all the in-tree operations, and a thread pool is spawned to execute all the DNN inferences asynchronously. The master thread executes the rollout_n_times function. It repeatedly performs node selection, expansion, and backup, and assigns a neural_network_simulate function as node evaluation request to the thread pool through a first-infirst-out queue. When all the threads are occupied by DNN inferences in the thread pool, the master thread waits until receiving a value for backup. Otherwise, it continues with the in-tree operation loop to generate node evaluation requests. 5.2.1.3 GPU-offloaded DNN Inference Our implementation also supports offloading the DNN inferences onto a GPU. We utilize a dedicated accelerator queue for accumulating DNN inference task requests produced by the tree selection process. When the queue size reaches a predetermined threshold, all tasks are submitted together to the GPU for computation. However, careful tuning of the communication batch size associated with the accelerator queue is needed. In the shared-tree method, the communication batch size is always set to the number of threads employed (i.e., thread pool size). This is because the selection processes are parallel, resulting in the nearly simultaneous arrival of all inference tasks, leaving only a small gap to wait for the inference queue to be full. The local-tree method necessitates empirical tuning of the communication batch size. This is because the selection processes on the master thread are sequential and lead to long waiting times by the worker 120 threads; submitting a small batch of inference tasks before the worker threads reach full capacity can help reduce accelerator waiting time, overlapping DNN inference computation with in-tree operations. 5.2.2 Performance Analysis Performance Model. We provide a theoretical analysis of the time performance to understand the tradeoff between the shared tree and local tree methods. The main parallel parameters that affect their performance include the number of threads, the latency of executing in-tree operations and inferences on each thread, and the data access and/or data transfer latencies. Assuming the tree-based search process is conducted on a multi-core CPU with a thread pool size of N with batched DNN computations offload onto a GPU, the amortized latency for each iteration of the shared tree method on a multi-core CPU can be estimated as: T CP U−GP U shared ≈ Tshared access × N + Tselect + Tbackup + T GP U DNN (batch = N) (5.3) The in-tree operations latency and the DNN inference latency are summed up since they execute sequentially on each thread. In the local tree method, the in-tree operations and DNN inferences are overlapped. Therefore, the per-iteration execution time is bounded by either the DNN inference latency or the total latency of the sequential in-tree operations. T CP U−GP U local ≈ max{(Tselect + Tbackup) × N, TP CIe, T GP U DNN−compute(batch = B)} (5.4) For batched DNN computations on GPU, we select a (sub-)batch size B < N such that N B CUDA streams are initiated, each CUDA stream processes a sub-batch after B loop counts of in-tree operations complete. Therefore, the execution timeline can be visualized similarly to that depicted in Figure 5.8-(b); The only 121 differences are (1) the N worker threads are replaced with N B CUDA streams, and (2) the blue-colored pipe communication arrows appear every B iterations (instead of 1 iteration) of in-tree operations. Design Configuration Workflow. To decide the parallel method and relevant design parameters (i.e., accelerator inference batch size) at compile time, we first obtain T CP U−GP U DNN , Tselect and Tbackup of a single worker on a single thread by profiling their amortized execution time on the target CPU for one iteration. In our implementation, the tree is managed as a dynamically allocated array of nodes in the CPU DDR memory. Therefore, we estimate Tshared access as the DDR access latency documented for the target CPU device. These are plugged into the performance models at to decide the optimal parallel method. For deciding the choice of parameter B (i.e., number of cuda streams, each processing a sub-batch), a naive method is to iterate over all the possible values for B(B ∈ [1, N]) and empirically run an episode to test the average latency of each iteration. However, this makes the design space exploration complexity linearly proportional to N and hard to scale to large CPU-GPU systems. To address this, we make the following observations to equation 5.4: • (Tselect + Tbackup) remains constant or monotonically decreases with increasing B. This is because the Expand operation waits for a batch of inferences to complete before they can be traversed in Backup and Selection. The higher the CUDA stream batch size B, the less frequently the nodes get available to be traversed (the frequency of making new node-UCT scores available is about once per N B loop counts on the Master Thread). • TP CIe is the time for transferring a total of N data samples between the CPU and GPU. It can be viewed as N B transfers, each transfer processes a batch of B data samples. Each transfer is associated with a fixed communication and kernel launch latency L. Therefore, TP CIe can be modeled as ( N B ) × L + N PCIe bandwidth . Based on this model, TP CIe is a monotonically decreasing sequence wrt B ∈ [1, N]. 122 • T GP U DNN (batch = B) monotonically increases with B. This is because larger B leads to higher computational workloads. • Based on Equation 5.4, the element-wise maximum of two monotonically decreasing sequences ((Tselect + Tbackup) and TP CIe) is also a monotonically decreasing sequence. The element-wise maximum of this resulting monotonically decreasing sequence and a monotonically increasing sequence (T GP U DNN (batch = B)) should be a “V-sequence" which is a sequence that first monotonically decreases, then monotonically increases wrt B. Essentially, we want to search the design space of B and find its value yielding the minimum execution time, i.e., argminBT CP U−GP U local . Based on the above observations, this enables us to exploit the property of a “V-sequence", and develop an efficient algorithm to determine B at design time. We achieve this by modeling the problem of finding the best-performing CUDA stream batch size B as the problem of finding the minimum value of a “V-sequence" T (T is the array of per-iteration latency across different values of B ∈ {1, ..., N}). Instead of testing every possible value for B ∈ [1, N], we can sample a subset with a reduced complexity of O(log N) as shown in Algorithm 1. Note that this is the mirroring problem of finding the maximum value of a bitonic sequence in O(log N) time using binary search. Algorithm 4 Exploring the optimal batch size B 1: function FindMin(T, lo, hi) 2: if lo == hi then 3: return B ← lo 4: mid ← lo+hi 2 5: Test Run with B = mid and B = mid + 1 6: Record amortized latency T[mid], T[mid + 1] 7: if T[mid] ≥ T[mid + 1] then 8: return FindMin(T, mid + 1, hi) 9: else 10: return FindMin(T, lo, mid) 123 5.2.3 Evaluation Benchmark and Hardware platform. We use the Gomoku game benchmark to evaluate the performance of our proposed method. The board size is 15×15; The neural network is composed of 5 convolution layers and 3 fully-connected layers; The tree size limit per move is 1600 (i.e., The total number of selection-expansion-inference-backup operations performed per agent-move is 1600). We use the AMD Ryzen Threadripper 3990X @ 2.2GHz as our target CPU platform. The CPU is connected with a NVIDIA RTX A6000 GPU through PCIe 4.0. Design Exploration. We show the amortized worker-iteration latency T CP U−GP U local obtained during the design configuration process for choosing the CUDA stream batch size B in Figure 5.9, specific to the localtree method mapped to a CPU-GPU heterogeneous platform. We can observe that at smaller batch sizes, Figure 5.9: Design Exploration of Inference Batch Size sub-batches of inferences are serialized, which hinders the performance. At larger batch sizes, inferences are parallelized with a higher degree on the GPU, but the inference request is made after waiting for all the serial in-tree operations to complete on the master thread, leading to a large overhead. Our design exploration finds the balance point where there are enough inferences within each sub-batch to saturate GPU parallelism, while enough sub-batch requests are made such that the GPU computation can overlap 124 with the computations on the CPU master thread (i.e., GPU does not stay idle waiting for CPU computation to finish). The resulting optimal batch sizes are 8 when N = 16, and 20 when N = 32 or 64. Throughput Analysis. We plot the overall DNN-MCTS training throughput in terms of processed samples per second, calculated by Number of samples processed per episode (Tree-based search time + DNN training time) , for both a CPU-only and a CPU-GPU platforms in Figure 5.10, varying the number of workers used in the tree-based search. In the CPU-only implementations, given the limited number of available CPU hardware threads, we are able to allocate 32 threads for conducting training on the CPU (these are different threads than those used for tree-based search workers). In contrast to GPU-accelerated training, CPU-based DNN training becomes the bottleneck even for a small number of DNN-MCTS workers. Therefore, the throughput improvements from increasing the number of DNN-MCTS workers are not as scalable as that from the CPU-GPU implementations. Overall, in the CPU-GPU implementations, as the number of workers increases, we observe nearlinear improvements in throughput, since the time spent producing the same number of samples in treebased search is reduced. When the number of agents increases above 16, the tree-based search time is reduced to the extent that it is lower than the training time. As a result, the throughput improvement becomes less obvious. Overall, we are able to adaptively choose the best-performing parallel method and Figure 5.10: Training throughput under optimal configurations design configurations, enabling near-linear scalability wrt number of workers in tree-based search acceleration. 125 Algorithm Performance. To demonstrate that our implementation leads to speedups without negatively affecting policy loss convergence, we show the DNN loss over time as the measurement of parallel DNNMCTS training algorithm performance in Figure 5.11. The experiments are conducted using the optimal parallel methods returned across different number of parallel workers. The convergence curve is steeper for larger number of workers, meaning the wall-clock time taken to reach the same converged loss is reduced using the optimal parallel configurations of our adaptive parallel implementations. This confirms that the throughput improvements are effective towards the overall algorithm performance. Figure 5.11: DNN loss over time 126 Chapter 6 Conclusion and Future Work 6.1 Summary of Contributions In this dissertation, we improved the performance, portability and productivity for developing RL systems using heterogeneous hardware, achieving significantly better resource utilization and execution speed compared with the state-of-the-art. We identified the key primitives and abstracted the generalized workflows for two key classes of RL algorithms, the model-free Deep RL, and the model-based RL using MCTS. We designed hardware-algorithm co-optimized accelerators for the primitives, and developed generalized and portable system implementations for both classes of algorithms. In the domain of model-free Deep RL, our contributions included the following: • Acceleration of Primitives. – DNN Policy Inference. We proposed a framework, DYNAMAP, to bridge the performance gap between the widely used “one size fits all" design methodology and an ideal design methodology where dedicated layer-wise algorithm tuning is performed to realize low end-to-end latency and maximal hardware-reuse across layers. To achieve this goal, we presented a linear-time dynamic algorithm mapping algorithm based on PBQP, and developed an accelerator that can support various GEMM-based CONV algorithms for various layers. Our designs achieved up 127 to 2.8× and 1.4× end-to-end latency improvements compared with state-of-the-art FPGA implementations on two classes of CNNs - GoogleNet and Inception-V4. – DNN Policy Training. We developed a methodology to eliminate zero computations in fractionally strided convolution for high-throughput CNN training. We achieved this adapting a multichannel-multi-kernel parallel algorithm, kn2row. We further developed a unified accelerator for kn2row-based convolution and FSC operations. Benefiting from the compute-reduction of kn2row, we achieved up to 14.6× improvement in effective resource utilization in typical convolutional auto-decoding layers, GAN layers and backward pass of Nature-CNN, a RL policy model. These led to overall speedup of up to 3.8× in the complete forward or backward propagation phases of the above benchmarks. Our methodology also led up to 8x speedup and 11× better power efficiency than general-purpose processors. • System Solution. – We introduced a framework for composing parallel DRL systems on heterogeneous platforms consisting of general-purpose processors (CPUs) and accelerators (GPUs, FPGAs). Our innovations included: 1. A general training protocol agnostic of the underlying hardware, enabling portable implementations across various processors and accelerators. 2. Efficient design exploration and automatic task placement for parallelizing tasks within each DRL primitive over one or multiple heterogeneous devices. 3. Incorporation of DRL-specific optimizations on runtime scheduling and resource allocation, facilitating parallelized training and enhancing the overall system performance. 4. High-level API for productive development using the framework. We showcased our framework through experimentation with three widely used DRL algorithms, DQN, DDPG, and SAC, on three heterogeneous platforms with diverse hardware 128 characteristics and interconnections. The generated implementations outperformed state-ofthe-art libraries for CPU-GPU platforms by throughput improvements of up to 2×, and 1.7× higher performance portability across platforms. In the domain of model-based RL using MCTS, our contributions include the following: • Acceleration of Primitives. – We developed a scalable FPGA accelerator encapsulating in-tree operations. It supported intree operations on dynamically evolving trees without expensive hardware reconfiguration through an custom butterfly based interconnection between the computing units and the memory banks. Based on our interconnection design, we proposed an on-chip memory bank assignment algorithm for MCTS tree construction to minimize the runtime bank conflict during all the in-tree operations. This led to a range of 6 to 35× speedup for in-tree operations compared with using multi-core CPU. • System Solutions. – We proposed a MCTS acceleration framework for CPU-FPGA heterogeneous platforms that adopted a hybrid parallel execution model to fully exploit the compute power in a CPU-FPGA heterogeneous system. The framework supported Python-based programming API for easy integration of the proposed accelerator with RL domain-specific bench-marking libraries at run-time. We showed that by using our framework, we achieved up to 6.8× speedup and superior scalability of parallel workers than state-of-the-art parallel MCTS on multi-core systems. – We analyzed the trade-offs of existing MCTS parallel schemes, shared tree and local tree methods on CPU platforms. We proposed an adaptive mapping scheme that optimally selects the MCTS component’s parallel scheme on the CPU. Additionally, we developed an efficient method 129 for determining the optimal communication batch size when the CPU interfaces with DNN operations on an accelerator (GPU). Using a DNN-MCTS algorithm on board game benchmarks, we demonstrated that our approach adaptively generated the best-performing parallel implementation, achieving speedups ranging from 1.5× to 3× compared with baseline methods. 6.2 Future Work 6.2.1 Emerging Composable Heterogeneous Platforms While this thesis focused on heterogeneous platforms with a given and fixed set of hardware resources, some data centers have composable infrastructure. In composable heterogeneous platforms, users can flexibly create and deploy heterogeneous architectures using a pool of disaggregated and heterogeneous compute devices, storage, and network fabric [80, 45, 1, 120]. Composable infrastructure further allows configuring the heterogeneous resources adaptively, optimizing the platform for executing diverse RL algorithms. Extending the proposed adaptive parallel schemes, future work could explore dynamic resource allocation algorithms that adjust the composition of computing resources in real-time based on the specific demands of different RL tasks. This involves developing intelligent schedulers that can dynamically allocate CPUs, GPUs, FPGAs, and memory resources from a disaggregated pool to optimize performance and efficiency for various RL workloads. It also involves future research on developing scalable multi-tenant RL frameworks [141] that efficiently share and manage resources in a composable infrastructure. Such frameworks would benefit cloud-based RL services, allowing multiple users to leverage heterogeneous resources without interference. This includes creating mechanisms to ensure fair resource distribution among multiple concurrent RL tasks, optimizing throughput, and minimizing latency. 130 6.2.2 Multi-Agent RL Systems Multi-agent reinforcement learning (MARL) extends traditional RL to environments where multiple agents interact and learn simultaneously [14]. This presents new challenges and opportunities for algorithm and hardware co-design. Future work could investigate efficient architectures and frameworks to support MARL, addressing issues such as inter-agent communication, coordination, and scalability. Developing specialized accelerators for primitives unique to MARL, such as multi-head attention mechanism for distributing the workloads and messages among multiple agents [57], and policies supporting graph-based state-action representation [113, 60], could lead to significant performance improvements. Additionally, exploring novel parallelization strategies, communication protocols and training protocols tailored for multi-agent settings could enhance the robustness and efficiency of MARL systems. 6.2.3 Distributed RL for Emerging Large Models As RL models continue to grow in complexity and size, distributed training becomes essential for handling the computational and memory demands of these large models. This is particularly relevant for training large language models and implementing reinforcement learning with human feedback (RLHF) [101], both of which require substantial computational resources and efficient parallelization strategies. Training Large Language Models. The training of large language models, such as GPT-3 and beyond, involves processing vast amounts of data and performing extensive computations across many layers of transformer-based neural networks. Distributed RL frameworks can be leveraged to handle this scale, providing several research direction: (1) Parallelization. Implementing model and data parallelism techniques allows the distribution of transformer computations across multiple nodes, reducing training time. This involves partitioning the model and data into chunks that can be processed concurrently, while minimizing the data traffic across distributed and heterogeneous compute nodes. (2) Resource Management. Effective 131 resource management algorithms can dynamically allocate computational resources based on workload demands, optimizing the use of CPUs, GPUs, and specialized accelerators like FPGAs and TPUs. Reinforcement Learning with Human Feedback (RLHF). RLHF involves training RL agents using feedback from human evaluators, combining human insights with automated learning to improve performance on complex tasks. The predominant RL algorithm to finetune LLM is Proximal Policy Optimization (PPO) [118]. This process leverages an Actor-Critic architecture [91], comprising four interconnected models: a Reference model, a Reward model, an Actor model, and a Critic model. The pipeline is segmented into three sequential stages—query generation, inference on all the models, and training for the Actor and Critic models. Future research on distributed and accelerated RLHF frameworks can improve current methods by adopting adaptive model placement strategies. The current mainstream parallel RLHF training methods, such as Transformer Reinforcement Learning [137], ColossalChat [25], and DeepSpeed-Chat [30] adopt a fixed model placement strategy, treating all four models as a single entity to be placed on the same GPU, distributing the batched inputs across all devices, regardless of the different workloads inherent to each model. Future research should develop adaptive separation strategy and model placement should be adopted to support flexible model mapping to interleave the tasks involved in the four models, reduce memory redundancy and communication for RLHF training in heterogeneous (and distributed) devices in modern data centers. 6.2.4 Real-Time RL Deployment on the Edge Deploying RL models on edge devices is prevalent for managing communication, caching and control in IoT applications, such as smart grid, intelligent transportation systems, and mobile crowdsensing [71, 20]. While this thesis focus on the training-in-simulation phase of RL development, future work could focus on optimizing the edge deployment of RL applications. In such edge applications, low-latency, real-time decision-making is crucial. RL policy models and replay buffer need to suit the constraints of edge devices, 132 such as limited power, memory, and computational capacity. This involves developing lightweight RL models, efficient compression techniques for real-time environmental data and models, partitioning and distributing the replay buffer over edge devices, and hardware accelerators tailored for sensors and edge environments. Additionally, adaptive algorithms that can learn and make decisions locally on edge devices while occasionally synchronizing with central servers could enhance the robustness and responsiveness of edge-based RL systems. 133 Bibliography [1] Nikolaos Alachiotis, Andreas Andronikakis, Orion Papadakis, Dimitris Theodoropoulos, Dionisios Pnevmatikatos, Dimitris Syrivelis, Andrea Reale, Kostas Katrinis, George Zervas, Vaibhawa Mishra, et al. “dReDBox: A disaggregated architectural perspective for data centers”. In: Hardware Accelerators in Data Centers (2019), pp. 35–56. [2] Shaikhah AlEbrahim and Imtiaz Ahmad. “Task scheduling for heterogeneous computing systems”. In: The Journal of Supercomputing 73 (2017), pp. 2313–2338. [3] Shun-ichi Amari. “Backpropagation and stochastic gradient descent method”. In: Neurocomputing 5.4-5 (1993), pp. 185–196. [4] AMD. AMD Heterogeneous Accelerated Compute Clusters. https://www.amd-haccs.io/. [5] AMD. Zen Specification. 2019. url: https://www.7-cpu.com/cpu/Zen2.html. [6] Andrew Anderson and David Gregg. “Optimal DNN primitive selection with partitioned boolean quadratic programming”. In: Proceedings of the 2018 International Symposium on Code Generation and Optimization. 2018, pp. 340–351. [7] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. “Deep reinforcement learning: A brief survey”. In: IEEE Signal Processing Magazine 34.6 (2017), pp. 26–38. [8] Lorena A Barba, Andreas Klockner, Prabhu Ramachandran, and Rollin Thomas. “Scientific computing with Python on high-performance heterogeneous systems”. In: Computing in Science & Engineering 23.04 (2021), pp. 5–7. [9] Laurent Boué. “Deep learning for pedestrians: backpropagation in CNNs”. In: arXiv preprint arXiv:1811.11987 (2018). [10] Bruno Bouzy. “Monte-carlo fork search for cooperative path-finding”. In: Workshop on Computer Games. Springer. 2013, pp. 1–15. [11] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “Openai gym”. In: arXiv preprint arXiv:1606.01540 (2016). [12] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “Openai gym”. In: arXiv preprint arXiv:1606.01540 (2016). 134 [13] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. “A survey of monte carlo tree search methods”. In: IEEE Transactions on Computational Intelligence and AI in games 4.1 (2012), pp. 1–43. [14] Lucian Busoniu, Robert Babuska, and Bart De Schutter. “A comprehensive survey of multiagent reinforcement learning”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38.2 (2008), pp. 156–172. [15] Itai Caspi, Gal Leibovich, Gal Novik, and Shadi Endrawis. Reinforcement Learning Coach. Dec. 2017. doi: 10.5281/zenodo.1134899. [16] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. “Dopamine: A Research Framework for Deep Reinforcement Learning”. In: (2018). url: http://arxiv.org/abs/1812.06110. [17] Tristan Cazenave and Nicolas Jouandeau. “On the parallelization of UCT”. In: Computer games workshop. 2007. [18] Guillaume MJ-B Chaslot, Mark HM Winands, and HJVD Herik. “Parallel monte-carlo tree search”. In: International Conference on Computers and Games. Springer. 2008, pp. 60–71. [19] Guillaume MJ-B Chaslot, Mark HM Winands, and HJVD Herik. “Parallel monte-carlo tree search”. In: International Conference on Computers and Games. Springer. 2008, pp. 60–71. [20] Wuhui Chen, Xiaoyu Qiu, Ting Cai, Hong-Ning Dai, Zibin Zheng, and Yan Zhang. “Deep reinforcement learning for Internet of Things: A comprehensive survey”. In: IEEE Communications Surveys & Tutorials 23.3 (2021), pp. 1659–1692. [21] Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. “ThunderGP: HLS-based graph processing framework on FPGAs”. In: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2021, pp. 69–80. [22] Yanqiu Cheng, Xianbiao Hu, Qing Tang, Hongsheng Qi, and Hong Yang. “Monte carlo tree search-based mixed traffic flow control algorithm for arterial intersections”. In: Transportation research record 2674.8 (2020), pp. 167–178. [23] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. “cudnn: Efficient primitives for deep learning”. In: arXiv preprint arXiv:1410.0759 (2014). [24] Hyungmin Cho, Pyeongseok Oh, Jiyoung Park, Wookeun Jung, and Jaejin Lee. “FA3C: FPGA-Accelerated Deep Reinforcement Learning”. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM. 2019, pp. 499–513. [25] Colossal-AI: Making large AI models cheaper, faster, and more accessible. https://github.com/hpcaitech/ColossalAI/tree/main. Accessed: 05.26.2024. 135 [26] Lucileide MD Da Silva, Matheus F Torquato, and Marcelo AC Fernandes. “Parallel implementation of reinforcement learning Q-learning technique for FPGA”. In: IEEE Access 7 (2018), pp. 2782–2798. [27] Tuan Dam, Georgia Chalvatzaki, Jan Peters, and Joni Pajarinen. “Monte-carlo robot path planning”. In: IEEE Robotics and Automation Letters 7.4 (2022), pp. 11213–11220. [28] Pritam Damania, Shen Li, Alban Desmaison, Alisson Azzolini, Brian Vaughan, Edward Yang, Gregory Chanan, Guoqiang Jerry Chen, Hongyi Jia, Howard Huang, et al. “Pytorch rpc: Distributed deep learning built on tensor-optimized remote procedure calls”. In: Proceedings of Machine Learning and Systems 5 (2023). [29] Jeffrey Dean and Sanjay Ghemawat. “MapReduce: simplified data processing on large clusters”. In: Communications of the ACM 51.1 (2008), pp. 107–113. [30] DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat. Accessed: 05.26.2024. [31] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. 2017. [32] Roberto DiCecco, Lin Sun, and Paul Chow. “FPGA-based training of convolutional neural networks with a reduced precision floating-point library”. In: 2017 International Conference on Field Programmable Technology (ICFPT). IEEE. 2017, pp. 239–242. [33] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. “Benchmarking deep reinforcement learning for continuous control”. In: International conference on machine learning. PMLR. 2016, pp. 1329–1338. [34] Richard J Duffin. “Topology of series-parallel networks”. In: Journal of Mathematical Analysis and Applications 10.2 (1965), pp. 303–318. [35] Vincent Dumoulin and Francesco Visin. “A guide to convolution arithmetic for deep learning”. In: arXiv preprint arXiv:1603.07285 (2016). [36] Erik Eckstein, Oliver König, and Bernhard Scholz. “Code instruction selection based on SSA-graphs”. In: International Workshop on Software and Compilers for Embedded Systems. Springer. 2003, pp. 49–65. [37] William A Falcon. “Pytorch lightning”. In: GitHub 3 (2019). [38] Scott Fujimoto, Herke Hoof, and David Meger. “Addressing function approximation error in actor-critic methods”. In: International conference on machine learning. PMLR. 2018, pp. 1587–1596. [39] Pranay Reddy Gankidi. “FPGA accelerator architecture for Q-learning and its applications in space exploration rovers”. PhD thesis. Arizona State University, 2016. 136 [40] Ce Guo, Wayne Luk, Stanley Qing Shui Loh, Alexander Warren, and Joshua Levine. “Customisable Control Policy Learning for Robotics”. In: 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP). Vol. 2160. IEEE. 2019, pp. 91–98. [41] Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. “Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning”. In: Advances in neural information processing systems 27 (2014). [42] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor”. In: International conference on machine learning. PMLR. 2018, pp. 1861–1870. [43] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor”. In: International conference on machine learning. PMLR. 2018, pp. 1861–1870. [44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. [45] Lorraine M Herger, Kaoutar El Maghraoui, I-Hsin Chung, Chekuri Choudary, Kim Tran, and Todd Deshane. “Toward an enterprise-ready composable infrastructure as a service”. In: 2021 IEEE International Conference on Services Computing (SCC). IEEE. 2021, pp. 116–125. [46] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. “Rainbow: Combining improvements in deep reinforcement learning”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 32. 1. 2018. [47] Matthew W Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Nikola Momchev, Danila Sinopalnikov, Piotr Stańczyk, Sabela Ramos, Anton Raichuk, Damien Vincent, et al. “Acme: A research framework for distributed reinforcement learning”. In: arXiv preprint arXiv:2006.00979 (2020). [48] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. “Distributed prioritized experience replay”. In: arXiv preprint arXiv:1803.00933 (2018). [49] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. “Mobilenets: Efficient convolutional neural networks for mobile vision applications”. In: arXiv preprint arXiv:1704.04861 (2017). [50] Tianyi Huang, Min Li, Xiaolong Qin, and William Zhu. “A CNN-based policy for optimizing continuous action control by learning state sequences”. In: Neurocomputing 468 (2022), pp. 286–295. [51] Intel. Intel Heterogeneous DevCloud. https://devcloud.intel.com/oneapi/. 137 [52] Intel Extension for PyTorch. 2022. url: https://github.com/intel/intel-extension-for-pytorch. [53] Intel. OneAPI for Heterogeneous Cloud. url: https://www.intel.com/content/www/us/en/developer/articles/technical/comparing-cpus-gpusand-fpgas-for-oneapi.html. [54] Intel. SkyLake Specification. 2018. url: https://www.7-cpu.com/cpu/Skylake.html. [55] Intel. Xeon Gold 5120. url: https://ark.intel.com/content/www/us/en/ark/products/120474/intelxeon-gold-5120-processor-19-25m-cache-2-20-ghz.html. [56] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: arXiv preprint arXiv:1502.03167 (2015). [57] Shariq Iqbal and Fei Sha. “Actor-attention-critic for multi-agent reinforcement learning”. In: International conference on machine learning. PMLR. 2019, pp. 2961–2970. [58] Ali Jahanshahi, Mohammad Kazem Taram, and Nariman Eskandari. “Blokus Duo game on FPGA”. In: The 17th CSI International Symposium on Computer Architecture & Digital Systems (CADS 2013). IEEE. 2013, pp. 149–152. [59] Wenzel Jakob. PyBind11. 2022. url: https://github.com/pybind/pybind11. [60] Jiechuan Jiang, Chen Dun, Tiejun Huang, and Zongqing Lu. “Graph convolutional reinforcement learning”. In: arXiv preprint arXiv:1810.09202 (2018). [61] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. “Highly accurate protein structure prediction with AlphaFold”. In: Nature 596.7873 (2021), pp. 583–589. [62] Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. “Hbm (high bandwidth memory) dram technology and architecture”. In: 2017 IEEE International Memory Workshop (IMW). IEEE. 2017, pp. 1–4. [63] Bilal Kartal, Pablo Hernandez-Leal, and Matthew E Taylor. “Action guidance with MCTS for deep reinforcement learning”. In: Proceedings of the AAAI conference on artificial intelligence and interactive digital entertainment. Vol. 15. 1. 2019, pp. 153–159. [64] Vinod Kathail. “Xilinx vitis unified software platform”. In: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2020, pp. 173–174. [65] Hideki Kato and Ikuo Takeuchi. “Parallel monte-carlo tree search with simulation servers”. In: 2010 International Conference on Technologies and Applications of Artificial Intelligence. IEEE. 2010, pp. 491–498. [66] Hideki Kato and Ikuo Takeuchi. “Parallel monte-carlo tree search with simulation servers”. In: 2010 International Conference on Technologies and Applications of Artificial Intelligence. IEEE. 2010, pp. 491–498. 138 [67] Juhwan Kim, Byeongmin Kang, and Hyungmin Cho. “SpecMCTS: Accelerating Monte Carlo Tree Search Using Speculative Tree Traversal”. In: IEEE Access 9 (2021), pp. 142195–142205. [68] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105. [69] Andrew Lavin and Scott Gray. “Fast algorithms for convolutional neural networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 4013–4021. [70] Daniel Pinheiro Leal, Midori Sugaya, Hideharu Amano, and Takeshi Ohkawa. “Fpga acceleration of ros2-based reinforcement learning agents”. In: 2020 Eighth International Symposium on Computing and Networking Workshops (CANDARW). IEEE. 2020, pp. 106–112. [71] Sangyoon Lee and Dae-Hyun Choi. “Energy management of smart home with home appliances, energy storage system and electric vehicle: A hierarchical deep reinforcement learning approach”. In: Sensors 20.7 (2020), p. 2157. [72] Kexin Li, Qianwang Deng, Like Zhang, Qing Fan, Guiliang Gong, and Sun Ding. “An effective MCTS-based algorithm for minimizing makespan in dynamic flexible job shop scheduling problem”. In: Computers & Industrial Engineering 155 (2021), p. 107211. [73] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. “Stereo r-cnn based 3d object detection for autonomous driving”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 7644–7652. [74] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. “Pytorch distributed: Experiences on accelerating data parallel training”. In: arXiv preprint arXiv:2006.15704 (2020). [75] Yuxi Li and Dale Schuurmans. “Mapreduce for parallel reinforcement learning”. In: European Workshop on Reinforcement Learning. Springer. 2011, pp. 309–320. [76] Yuxi Li and Dale Schuurmans. “Mapreduce for parallel reinforcement learning”. In: European Workshop on Reinforcement Learning. Springer. 2011, pp. 309–320. [77] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. “RLlib: Abstractions for Distributed Reinforcement Learning”. In: International Conference on Machine Learning (ICML). 2018. [78] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Joseph Gonzalez, Ken Goldberg, and Ion Stoica. “Ray rllib: A composable and scalable reinforcement learning library”. In: arXiv preprint arXiv:1712.09381 85 (2017). [79] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous control with deep reinforcement learning”. In: arXiv preprint arXiv:1509.02971 (2015). 139 [80] Liqid Composable Infrastructure. https://www.liqid.com/solutions. Accessed: 11.08.2023. [81] Anji Liu, Jianshu Chen, Mingze Yu, Yu Zhai, Xuewen Zhou, and Ji Liu. “Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search”. In: International Conference on Learning Representations. 2020. url: https://openreview.net/forum?id=BJlQtJSKDB. [82] Yongshuai Liu, Avishai Halev, and Xin Liu. “Policy learning with constraints in model-free reinforcement learning: A survey”. In: The 30th international joint conference on artificial intelligence (ijcai). 2021. [83] Zhiqiang Liu, Yong Dou, Jingfei Jiang, Qiang Wang, and Paul Chow. “An FPGA-based processor for training convolutional neural networks”. In: 2017 International Conference on Field Programmable Technology (ICFPT). IEEE. 2017, pp. 207–210. [84] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. “Automatic compilation of diverse cnns onto high-performance fpga accelerators”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018). [85] Xiao-Jiao Mao, Chunhua Shen, and Yu-Bin Yang. “Image restoration using convolutional auto-encoders with symmetric skip connections”. In: arXiv preprint arXiv:1606.08921 (2016). [86] Yuan Meng, Rajgopal Kannan, and Viktor Prasanna. “A framework for monte-carlo tree search on cpu-fpga heterogeneous platform via on-chip dynamic tree management”. In: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 2023, pp. 235–245. [87] Yuan Meng, Rajgopal Kannan, and Viktor Prasanna. “Accelerating Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform”. In: 2022 International Conference on Field-Programmable Logic and Applications (FPL). IEEE. 2022. [88] Yuan Meng, Sanmukh Kuppannagari, Rajgopal Kannan, and Viktor Prasanna. “DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference”. In: arXiv e-prints, arXiv:2012.00912 (Dec. 2020), arXiv:2012.00912. arXiv: 2012.00912 [cs.DC]. [89] Yuan Meng, Sanmukh Kuppannagari, and Viktor Prasanna. “Accelerating proximal policy optimization on cpu-fpga heterogeneous platforms”. In: 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2020, pp. 19–27. [90] S Ali Mirsoleimani, H Jaap van den Herik, Aske Plaat, and Jos Vermaseren. “Pipeline Pattern for Parallel MCTS.” In: ICAART (2). 2018, pp. 614–621. [91] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous methods for deep reinforcement learning”. In: International conference on machine learning. PMLR. 2016, pp. 1928–1937. [92] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. “Playing atari with deep reinforcement learning”. In: arXiv preprint arXiv:1312.5602 (2013). 140 [93] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. “Human-level control through deep reinforcement learning”. In: nature 518.7540 (2015), pp. 529–533. [94] Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. “Model-based reinforcement learning: A survey”. In: Foundations and Trends® in Machine Learning 16.1 (2023), pp. 1–118. [95] Aaftab Munshi. “The opencl specification”. In: 2009 IEEE Hot Chips 21 Symposium (HCS). IEEE. 2009, pp. 1–314. [96] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. “Massively parallel methods for deep reinforcement learning”. In: arXiv preprint arXiv:1507.04296 (2015). [97] Hiroki Nakahara, Youki Sada, Masayuki Shimoda, Kouki Sayama, Akira Jinguji, and Shimpei Sato. “FPGA-based training accelerator utilizing sparseness of convolutional neural network”. In: 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE. 2019, pp. 180–186. [98] Yue Niu, Rajgopal Kannan, Ajitesh Srivastava, and Viktor Prasanna. “Reuse Kernels or Activations? A Flexible Dataflow for Low-latency Spectral CNN Acceleration”. In: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2020, pp. 266–276. [99] Nvidia. NVLink. url: https://www.nvidia.com/en-us/data-center/nvlink/. [100] Nvidia. Titan Xp. url: https://www.nvidia.com/en-us/titan/titan-xp/. [101] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”. In: Advances in neural information processing systems 35 (2022), pp. 27730–27744. [102] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. “Pytorch: An imperative style, high-performance deep learning library”. In: Advances in neural information processing systems 32 (2019). [103] S John Pennycook, Jason D Sewall, Douglas W Jacobsen, Tom Deakin, and Simon McIntosh-Smith. “Navigating performance, portability, and productivity”. In: Computing in Science & Engineering 23.5 (2021), pp. 28–38. [104] Diego Perez, Spyridon Samothrakis, and Simon Lucas. “Knowledge-based fast evolutionary MCTS for general video game playing”. In: 2014 IEEE Conference on Computational Intelligence and Games. IEEE. 2014, pp. 1–8. 141 [105] Stefania Perri, Cristian Sestito, Fanny Spagnolo, and Pasquale Corsonello. “Efficient Deconvolution Architecture for Heterogeneous Systems-on-Chip”. In: Journal of Imaging 6.9 (2020), p. 85. [106] Abhinav Podili, Chi Zhang, and Viktor Prasanna. “Fast and efficient implementation of Convolutional Neural Networks on FPGA”. In: 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE. 2017, pp. 11–18. [107] Marius-Constantin Popescu, Valentina E Balas, Liliana Perescu-Popescu, and Nikos Mastorakis. “Multilayer perceptron and neural networks”. In: WSEAS Transactions on Circuits and Systems 8.7 (2009), pp. 579–588. [108] Ehsan Qasemi, Amir Samadi, Mohammad H Shadmehr, Bardia Azizian, Sajjad Mozaffari, Amir Shirian, and Bijan Alizadeh. “Highly scalable, shared-memory, Monte-Carlo tree search based Blokus Duo Solver on FPGA”. In: 2014 International Conference on Field-Programmable Technology (FPT). IEEE. 2014, pp. 370–373. [109] Alec Radford, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with deep convolutional generative adversarial networks”. In: arXiv preprint arXiv:1511.06434 (2015). [110] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. “Stable-Baselines3: Reliable Reinforcement Learning Implementations”. In: Journal of Machine Learning Research 22.268 (2021), pp. 1–8. url: http://jmlr.org/papers/v22/20-1364.html. [111] Rachit Rajat, Yuan Meng, Sanmukh Kuppannagari, Ajitesh Srivastava, Viktor Prasanna, and Rajgopal Kannan. “Qtaccel: A generic fpga based design for q-table based reinforcement learning accelerators”. In: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2020, pp. 323–323. [112] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of mathematical statistics (1951), pp. 400–407. [113] Heechang Ryu, Hayong Shin, and Jinkyoo Park. “Multi-agent actor-critic with hierarchical graph attention network”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 05. 2020, pp. 7236–7243. [114] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. “Prioritized experience replay”. In: arXiv preprint arXiv:1511.05952 (2015). [115] Bernhard Scholz. Partitioned Boolean Quadratic Programming. 2003. url: http://www.complang.tuwien.ac.at/scholz/pbqp.html. [116] Bernhard Scholz and Erik Eckstein. “Register allocation for irregular architectures”. In: Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems. 2002, pp. 139–148. 142 [117] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. “Trust region policy optimization”. In: International conference on machine learning. PMLR. 2015, pp. 1889–1897. [118] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017). [119] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017). [120] Yizhou Shan, Will Lin, Zhiyuan Guo, and Yiying Zhang. “Towards a fully disaggregated and programmable data center”. In: Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems. 2022, pp. 18–28. [121] Shengjia Shao and Wayne Luk. “Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation”. In: 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE. 2017, pp. 1–6. [122] Shengjia Shao, Jason Tsai, Michal Mysior, Wayne Luk, Thomas Chau, Alexander Warren, and Ben Jeppesen. “Towards hardware accelerated reinforcement learning for application-specific robotic control”. In: 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE. 2018, pp. 1–8. [123] Debendra Das Sharma. “Compute Express Link®: An open industry-standard interconnect enabling heterogeneous data-centric computing”. In: 2022 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE. 2022, pp. 5–12. [124] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. “Mastering the game of Go with deep neural networks and tree search”. In: nature 529.7587 (2016), pp. 484–489. [125] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. “Mastering chess and shogi by self-play with a general reinforcement learning algorithm”. In: arXiv preprint arXiv:1712.01815 (2017). [126] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014). [127] Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. “End-to-End Optimization of Deep Learning Applications”. In: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2020, pp. 133–139. [128] Jiang Su, Jianxiong Liu, David B Thomas, and Peter YK Cheung. “Neural network based reinforcement learning acceleration on FPGA platforms”. In: ACM SIGARCH Computer Architecture News 44.4 (2017), pp. 68–73. [129] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. 143 [130] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. “Inception-v4, inception-resnet and the impact of residual connections on learning”. In: arXiv preprint arXiv:1602.07261 (2016). [131] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. “Inception-v4, inception-resnet and the impact of residual connections on learning”. In: Thirty-first AAAI conference on artificial intelligence. 2017. [132] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9. [133] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. “Rethinking the inception architecture for computer vision”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 2818–2826. [134] Lei Tai and Ming Liu. “Mobile robots exploration through cnn-based reinforcement learning”. In: Robotics and biomimetics 3 (2016), pp. 1–8. [135] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. “Deepmind control suite”. In: arXiv preprint arXiv:1801.00690 (2018). [136] Haluk Topcuoglu, Salim Hariri, and Min-You Wu. “Performance-effective and low-complexity task scheduling for heterogeneous computing”. In: IEEE transactions on parallel and distributed systems 13.3 (2002), pp. 260–274. [137] Transformer Reinforcement Learning. https://github.com/huggingface/trl. Accessed: 05.26.2024. [138] Brian Van Essen, Chris Macaraeg, Maya Gokhale, and Ryan Prenger. “Accelerating a random forest classifier: Multi-core, GP-GPU, or FPGA?” In: 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines. IEEE. 2012, pp. 232–239. [139] Aravind Vasudevan, Andrew Anderson, and David Gregg. “Parallel multi channel convolution using general matrix multiplication”. In: 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE. 2017, pp. 19–24. [140] Shreyas Kolala Venkataramanaiah, Yufei Ma, Shihui Yin, Eriko Nurvithadhi, Aravind Dasu, Yu Cao, and Jae-sun Seo. “Automatic compiler based FPGA accelerator for CNN training”. In: 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE. 2019, pp. 166–172. [141] Irene Vilà, Jordi Pérez-Romero, Oriol Sallent, and Anna Umbert. “A multi-agent reinforcement learning approach for capacity sharing in multi-tenant scenarios”. In: IEEE Transactions on vehicular Technology 70.9 (2021), pp. 9450–9465. 144 [142] Linnan Wang, Yiyang Zhao, Yuu Jinnai, Yuandong Tian, and Rodrigo Fonseca. “Alphax: exploring neural architectures with deep neural networks and monte carlo tree search”. In: arXiv preprint arXiv:1903.11059 (2019). [143] Xuechao Wei, Yun Liang, and Jason Cong. “Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management”. In: 2019 56th ACM/IEEE Design Automation Conference (DAC). IEEE. 2019, pp. 1–6. [144] Xilinx. Alveo U250 Data Center Accelerator Card. 2020. url: https://www.xilinx.com/products/boards-and-kits/alveo/u250.html. [145] Xilinx. Large FPGA methodology guide. url: https://www.xilinx.com/support/documentation/sw_manuals/xilinx14_7/ug872_largefpga.pdf. 2012.4.24. [146] AMD Xilinx. XRT Profiling. 2022. url: https://docs.xilinx.com/r/en-US/ug1393-vitisapplication-acceleration/Profiling-the-Application. [147] Dawen Xu, Kaijie Tu, Ying Wang, Cheng Liu, Bingsheng He, and Huawei Li. “FCN-engine: Accelerating deconvolutional layers in classic CNN processors”. In: Proceedings of the International Conference on Computer-Aided Design. 2018, pp. 1–6. [148] Abdurrahman Yasar, Sivasankaran Rajamanickam, Jonathan W Berry, and Umit V Catalyurek. “PGAbB: A Block-Based Graph Processing Framework for Heterogeneous Platforms”. In: arXiv preprint arXiv:2209.04541 (2022). [149] Amir Yazdanbakhsh, Michael Brzozowski, Behnam Khaleghi, Soroush Ghodrati, Kambiz Samadi, Nam Sung Kim, and Hadi Esmaeilzadeh. “Flexigan: An end-to-end solution for fpga acceleration of generative adversarial networks”. In: 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2018, pp. 65–72. [150] Hanchen Ye, Xiaofan Zhang, Zhize Huang, Gengsheng Chen, and Deming Chen. “HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation”. In: arXiv preprint arXiv:2004.03804 (2020). [151] Xiaoyu Yu, Yuwei Wang, Jie Miao, Ephrem Wu, Heng Zhang, Yu Meng, Bo Zhang, Biao Min, Dewei Chen, and Jianlin Gao. “A data-center FPGA acceleration platform for convolutional neural networks”. In: 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE. 2019, pp. 151–158. [152] Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna. “A framework for generating high throughput CNN implementations on FPGAs”. In: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2018, pp. 117–126. [153] Chi Zhang, Sanmukh Rao Kuppannagari, and Viktor K Prasanna. “Parallel actors and learners: A framework for generating scalable rl implementations”. In: 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE. 2021, pp. 1–10. 145 [154] Chi Zhang, Yuan Meng, and Viktor Prasanna. “A framework for mapping drl algorithms with prioritized replay buffer onto heterogeneous platforms”. In: IEEE Transactions on Parallel and Distributed Systems (2023). [155] Wentai Zhang, Ming Jiang, and Guojie Luo. “Evaluating Low-Memory GEMMs for Convolutional Neural Network Inference on FPGAs”. In: 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2020, pp. 28–32. [156] Wentai Zhang, Ming Jiang, and Guojie Luo. “Evaluating Low-Memory GEMMs for Convolutional Neural Network Inference on FPGAs”. In: 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2020, pp. 28–32. [157] Xinyu Zhang, Srinjoy Das, Ojash Neopane, and Ken Kreutz-Delgado. “A design methodology for efficient implementation of deconvolutional neural networks on an FPGA”. In: arXiv preprint arXiv:1705.02583 (2017). [158] Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, and Guangwen Yang. “F-CNN: An FPGA-based framework for training convolutional neural networks”. In: 2016 IEEE 27Th international conference on application-specific systems, architectures and processors (ASAP). IEEE. 2016, pp. 107–114. [159] TAN Ziya and Mehmet KARAKOSE. “Comparative study for deep reinforcement learning with CNN, RNN, and LSTM in autonomous navigation”. In: 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable Economy (ICDABI). IEEE. 2020, pp. 1–5. 146
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Novel graph representation of program algorithmic foundations for heterogeneous computing architectures
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Hardware and software techniques for irregular parallelism
PDF
Advancing distributed computing and graph representation learning with AI-enabled schemes
PDF
Robust and adaptive online reinforcement learning
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Understanding goal-oriented reinforcement learning
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
Asset Metadata
Creator
Meng, Yuan
(author)
Core Title
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2024-08
Publication Date
08/05/2024
Defense Date
06/20/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning accelerators,FPGA,heterogeneous computing,reinforcement learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Zhao, Yue (
committee member
)
Creator Email
myuan2009@gmail.com,ymeng643@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113998T56
Unique identifier
UC113998T56
Identifier
etd-MengYuan-13335.pdf (filename)
Legacy Identifier
etd-MengYuan-13335
Document Type
Dissertation
Format
theses (aat)
Rights
Meng, Yuan
Internet Media Type
application/pdf
Type
texts
Source
20240805-usctheses-batch-1192
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning accelerators
FPGA
heterogeneous computing
reinforcement learning