Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
(USC Thesis Other)
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
METASCALABLE HYBRID MESSAGE-PASSING AND MULTITHREADING ALGORITHMS FOR N-TUPLE COMPUTATION by Manaschai Kunaseth A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 13 th , 2013 © 2013 Manaschai Kunaseth ii Abstract The emergence of the multicore era has granted unprecedented computing capabilities. Extensively available multicore clusters have influenced hybrid message-passing and multithreading parallel algorithms to become a standard parallelization for modern clusters. However, hybrid parallel applications of portable scalability on emerging high-end multicore clusters consisting of multimillion cores are yet to be accomplished. Achieving scalability on emerging multicore platforms is an enormous challenge, since we do not even know the architecture of future platforms, with new hardware features such as hardware transactional memory (HTM) constantly being deployed. Scalable implementation of molecular dynamics (MD) simulations on massively parallel computers has been one of the major driving forces of supercomputing technologies. Especially, recent advancements in reactive MD simulations based on many-body interatomic potentials have necessitated efficient dynamic n-tuple computation. Hence, it is of great significance now to develop scalable hybrid n-tuple computation algorithms to provide a viable foundation for high-performance parallel-computing software on forthcoming architectures. This dissertation research develops a scalable hybrid message-passing and multithreading algorithm for n-tuple MD simulation, which will continue to scale on future architectures (i.e. achieving metascalability). The two major goals of this dissertation research are: (1) design a scalable hybrid message-passing and multithreading parallel algorithmic framework on multicore architectures and evaluate it on most advanced parallel architectures; and (2) develop a iii computation-pattern algebraic framework to design scalable algorithms for general n-tuple computation and prove its optimality in a systematic and mathematically rigorous manner. To achieve the first goal, we have developed and thoroughly analyzed algorithms for hybrid message passing interface (MPI) + open multiprocessing (OpenMP) parallelization of n- tuple MD simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation have been designed: (1) compact- volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). These two algorithms combine fine-grain dynamic load balancing and minimal memory- footprint threading. Theoretical study has revealed decent asymptotic memory efficiency for both algorithms, thereby reducing 75% memory consumption compared to a naïve-threading algorithm. Furthermore, performance benchmarks have confirmed higher performance of the hybrid MD algorithm over a traditional algorithm on large multicore clusters, where 2.58-fold speedup of the hybrid algorithm over the traditional algorithm was observed on 32,768 nodes of IBM BlueGene/P. We have also investigated the performance characteristics of HTM on the IBM BlueGene/Q computer in comparison with conventional concurrency control mechanisms, using an MD application as an example. Benchmark tests, along with overhead-cost and scalability analysis, have quantified relative performance advantages of HTM over other mechanisms. We found that the bookkeeping cost of HTM is high but that the rollback cost is low. We have proposed transaction fusion and spatially compact scheduling techniques to reduce the overhead of HTM with minimal programming. A strong scalability benchmark has shown that the fused HTM has the shortest runtime among various concurrency control mechanisms without extra iv memory. Based on the performance characterization, we have derived a decision tree in the concurrency-control design space for general multithreading applications. To achieve the second goal, we have developed a computation-pattern algebraic framework to mathematically formulate general n-tuple computation. Based on translation/reflection-invariant properties of computation patterns within this framework, we have designed a shift-collapse (SC) algorithm for cell-based parallel MD. Theoretical analysis has quantified the compact n-tuple search space and small communication cost of SC-MD for arbitrary n, which are reduced to those in best pair-computation approaches (e.g. eighth-shell method) for n = 2. Benchmark tests have shown that SC-MD outperforms our production MD code at the finest grain, with 9.7- and 5.1-fold speedups on Intel-Xeon and BlueGene/Q clusters. SC-MD has also exhibited excellent strong scalability. In addition, we have analyzed the computational and data-access patterns of MD, which led to the development of a performance prediction model for short-range pair-wise force computations in MD simulations. The analysis and performance model provide fundamental understanding of computation patterns and optimality of certain parameters in MD simulations, thus allowing scientists to determine the optimal cell dimension in a linked-list cell method. The model has accurately estimated the number of operations during the simulations with the maximum error of 10.6% compared to actual measurements. Analysis and benchmark of the model have revealed that the optimal cell dimension minimizing the computation time is determined by a trade-off between decreasing search space and increasing linked-list cell access for smaller cells. One difficulty about MD is that it is a dynamic irregular application, which often suffers considerable performance deterioration during execution. To address this problem, an optimal v data-reordering schedule has been developed for runtime memory-access optimization of MD simulations on parallel computers. Analysis of the memory-access penalty during MD simulations has shown that the performance improvement from computation and data reordering degrades gradually as data translation lookaside buffer misses increase. We have also found correlations between the performance degradation with physical properties such as the simulated temperature, as well as with computational parameters such as the spatial-decomposition granularity. Based on a performance model and pre-profiling of data fragmentation behaviors, we have developed an optimal runtime data-reordering schedule, thereby archiving speedup of 1.35, 1.36 and 1.28, respectively, for MD simulations of silica at temperatures 300 K, 3,000 K and 6,000 K. The main contributions of this dissertation research are two-fold: Metascalable hybrid message-passing and multithreading parallel algorithmic framework on emerging multicore parallel clusters, and a novel computation-pattern algebraic framework to design scalable algorithm for general n-tuple computation and prove its optimality in a mathematically rigorous manner. We expect that the proposed hybrid algorithms and mathematical approaches will provide a generic framework to a broad range of applications on future extreme-scale computing platforms. vi Table of Contents CHAPTER 1 INTRODUCTION .................................................................................................................................1 1.1 MOTIVATIONS ...................................................................................................................................................1 1.2 CHALLENGES AND GOALS .................................................................................................................................2 1.3 BACKGROUND ...................................................................................................................................................3 1.3.1 Molecular Dynamics Simulation ..............................................................................................................4 1.3.2 Multicore Architectures ............................................................................................................................4 1.3.3 Hybrid Parallelization ..............................................................................................................................6 1.3.4 n-Tuple Molecular Dynamics ...................................................................................................................8 1.3.5 Recent Development of Parallel MD Algorithms .....................................................................................8 1.4 CONTRIBUTIONS ................................................................................................................................................9 1.5 STRUCTURE OF THIS DISSERTATION ...............................................................................................................12 CHAPTER 2 HYBRID MESSAGE-PASSING AND MULTITHREADING PARALLELIZATION ...............13 2.1 INTRODUCTION ...............................................................................................................................................13 2.2 DOMAIN DECOMPOSITION MOLECULAR DYNAMICS .......................................................................................14 2.2.1 Internode Operations ..............................................................................................................................15 2.2.2 Intranode Operations .............................................................................................................................16 2.2.3 Hybrid MPI/OpenMP Parallelization ....................................................................................................18 2.3 DATA PRIVATIZATION SCHEDULING ALGORITHM ..........................................................................................20 2.3.1 Naïve Data Privatization Threading Algorithm .....................................................................................20 2.3.2 Nucleation-Growth Allocation Algorithm ..............................................................................................21 2.4 THEORETICAL ANALYSIS ................................................................................................................................26 2.4.1 Load Imbalance Analysis .......................................................................................................................26 2.4.2 Memory Consumption Analysis ..............................................................................................................27 2.4.3 Computation Complexity Analysis .........................................................................................................35 vii 2.5 PERFORMANCE EVALUATION ..........................................................................................................................37 2.5.1 Scheduling Cost ......................................................................................................................................37 2.5.2 Thread-Level Load Balancing ................................................................................................................38 2.5.3 Memory Consumption ............................................................................................................................39 2.5.4 Reduction-Sum Operation Cost ..............................................................................................................40 2.5.5 Strong-Scaling Benchmark .....................................................................................................................41 2.6 SUMMARY .......................................................................................................................................................44 CHAPTER 3 PERFORMANCE CHARACTERISTICS OF CONCURRENCY CONTROL MECHANISMS .......................................................................................................................................................................................45 3.1 INTRODUCTION ...............................................................................................................................................46 3.2 BACKGROUND .................................................................................................................................................47 3.2.1 Hybrid MPI/OpenMP Molecular Dynamics ..........................................................................................47 3.2.2 Race Conditions and Traditional Concurrency Control Mechanisms ...................................................49 3.2.3 Hardware Transactional Memory on BlueGene/Q ................................................................................50 3.3 PERFORMANCE CHARACTERIZATION OF CONCURRENCY CONTROL MECHANISMS ........................................53 3.3.1 Concurrency-Control Cost Modeling .....................................................................................................53 3.3.2 Model Parameter Fitting Using Microbenchmark .................................................................................55 3.4 MOLECULAR DYNAMICS BENCHMARK ...........................................................................................................57 3.4.1 Performance Comparison of Concurrency Controls .............................................................................60 3.4.2 Rollback Penalty .....................................................................................................................................64 3.4.3 Effect of Scheduling Algorithms on Rollback .........................................................................................65 3.5 DESIGN SPACE FOR CONCURRENCY CONTROLS ..............................................................................................69 3.6 CONCLUSION AND DISCUSSION .......................................................................................................................74 CHAPTER 4 EMBEDDED PERFORMANCE PREDICTION MODEL .............................................................75 4.1 EMBEDDED PERFORMANCE PREDICTION MODEL .............................................................................................76 4.1.1 Grand Scheme of the Embedded Performance Prediction Model ..........................................................79 4.1.2 Tailored Scheme of the Embedded Performance Prediction Model For Molecular Dynamics .............80 viii 4.1.3 Model Validation ....................................................................................................................................82 4.2 OPTIMAL PARALLELIZATION PREDICTION VIA PERFORMANCE MODEL ..........................................................83 4.2.1 Performance Model Design Principles ..................................................................................................84 4.2.2 Model Fitting and Validation .................................................................................................................87 4.3 SUMMARY .......................................................................................................................................................89 CHAPTER 5 N-TUPLE COMPUTATION FORMALISM AND SOLUTION FOR RANGED-LIMITED MANY-BODY MOLECULAR DYNAMICS ...........................................................................................................91 5.1 PROBLEM STATEMENT ....................................................................................................................................92 5.2 BACKGROUND .................................................................................................................................................92 5.2.1 Many-Body Molecular Dynamics ...........................................................................................................92 5.2.2 Dynamic Range-Limited n-Tuple Computation ......................................................................................94 5.3 COMPUTATION-PATTERN ALGEBRAIC FRAMEWORK ......................................................................................95 5.3.1 Algebraic Formulation of Cell-Based MD .............................................................................................96 5.3.2 Shift-Collapse Algorithm ......................................................................................................................103 5.3.3 Correctness of SC Algorithm ................................................................................................................107 5.4 THEORETICAL ANALYSIS ..............................................................................................................................111 5.4.1 Search-Cost Analysis of SC Pattern .....................................................................................................111 5.4.2 Communication Analysis of SC Pattern ...............................................................................................116 5.4.3 Relation to Previous Works ..................................................................................................................117 5.5 PERFORMANCE BENCHMARK ........................................................................................................................119 5.5.1 Search Cost of SC and FS Algorithms ..................................................................................................120 5.5.2 Fine-Grain Parallelism ........................................................................................................................121 5.5.3 Strong-Scaling Benchmark ...................................................................................................................124 5.6 CONCLUSION .................................................................................................................................................125 CHAPTER 6 PERFORMANCE ANALYSIS, MODELING, AND OPTIMIZATION OF CELL-BASED MOLECULAR DYNAMICS ...................................................................................................................................127 6.1 INTRODUCTION .............................................................................................................................................128 ix 6.2 CELL-BASE MOLECULAR DYNAMICS SIMULATIONS .....................................................................................129 6.3 PERFORMANCE ANALYSIS .............................................................................................................................131 6.3.1 Short-Range Interaction Cost Model of the Conventional Linked-List Cell Method ...........................132 6.3.2 Short-Range Interaction Cost Model of the Reduced Linked-list Cell Method ....................................135 6.4 MODEL VALIDATION .....................................................................................................................................138 6.4.1 Validation of the Operation Counts Model ..........................................................................................138 6.4.2 Parameter Fitting .................................................................................................................................140 6.5 PERFORMANCE RESULTS ...............................................................................................................................141 6.6 SUMMARY .....................................................................................................................................................145 CHAPTER 7 MEMORY-ACCESS ANALYSIS AND OPTIMIZATION VIA DYNAMIC DATA REORDERING .........................................................................................................................................................146 7.1 INTRODUCTION .............................................................................................................................................147 7.2 DATA FRAGMENTATION IN PARALLEL MOLECULAR DYNAMICS ..................................................................148 7.2.1 Temperature Induced Fragmentation ...................................................................................................150 7.2.2 Granularity Induced Fragmentation ....................................................................................................155 7.3 PERFORMANCE MEASUREMENT ....................................................................................................................158 7.4 REORDERING FREQUENCY OPTIMIZATION ....................................................................................................163 7.5 SUMMARY .....................................................................................................................................................166 CHAPTER 8 CONCLUSIONS ................................................................................................................................168 REFERENCES ..........................................................................................................................................................171 x List of Tables Table 3.1: Specification of BlueGene/Q at LLNL. ....................................................................... 52 Table 3.2: Parallel efficiency estimate resulting from the decision tree ....................................... 73 Table 4.1: Benchmark cluster specifications ................................................................................ 82 Table 4.2: Parameter-fitting result of MPI-only and hybrid models using bi-square regression . 87 Table 5.1: Specifications of testing platforms. ........................................................................... 120 Table 6.1: Asymptotic complexity of parameter P s ʹ′, P c ʹ′, and P l ʹ′ over µ. ................................... 138 Table 6.2: Accuracy of the operation counts model. .................................................................. 139 Table 6.3: Estimated operation costs from various regression algorithms. ................................ 141 Table 6.4: The predicted versus measured values of the optimal µ with their corresponding speedup compared to the case of no cell dimension reduction. .................................................. 144 Table 7.1: 3,000 K silica datasets with varying granularities: The number of linked-list cells, the portion of surface cells, and the surface-to-volume ratio of the spatial subsystem are listed for each dataset. ................................................................................................................................ 157 Table 7.2: Runtime result with different ordering methods for silica systems at temperature 300K. ........................................................................................................................................... 162 Table 7.3: 3,000 K silica datasets varying granularities ............................................................. 164 xi List of Figures Figure 2.1: (a) 2D schematic of the linked-list cell method for pair computation with the cell dimension r c . Only forces exerted by particles within the cutoff radius (represented by a two-headed arrow) are computed for particle i. The blue particles represent particles in the top-half neighbor cells where the force is computed directly, while the gray particles is a particle whose force is computed via reaction force. (b) 2D schematic of a single computation unit λ k . The shaded cells C j pointed by the arrows constitute the half neighbor cells, nn + (C k ). ........................................................................................................................ 18 Figure 2.2: Schematic workflow of a hybrid MPI/OpenMP scheme. The master thread schedules works for the worker threads before entering the parallel threading section. The master thread also handles internode communication, which is explicitly performed outside the threading region. ................................................................................................................... 19 Figure 2.3: Schematic of a memory layout for a naïve data privatization. To avoid data race condition among threads, the entire output array is locally duplicated for each thread. These local arrays are privatized on the thread memory space and thus eliminate race condition during the computation phase. This approach also incurs additional reduction operations to sum all local results to the global array. ............................................................................... 21 Figure 2.4: Memory layout and three-step algorithm for nucleation-growth allocation algorithm. Worker threads only allocate the essential portion of the force arrays, corresponding to their assigned workload. ................................................................................................................ 22 Figure 2.5: Thread load-balancing algorithm. .............................................................................. 24 xii Figure 2.6: Pseudocode of compact-volume allocation scheduling algorithm. ............................ 24 Figure 2.7: Pseudocode of breadth-first allocation scheduling algorithm. ................................... 26 Figure 2.8: Geometric effects on surface-to-volume ratio. (a) and (b) show the surface volume (shell area) of cubic and spherical shapes with the same main volume. (c) and (d) show the discretized version, which indicates that the spherical shape consumes less volume than the cubic shape. The square grid represents cells in which particles reside. The blue area refers to the main volume while the red area shows the surface volume. Variable l is defined as the side length of main volume created by BFAS while r denotes the radius of CVAS. The surface thickness is r c . ........................................................................................................... 29 Figure 2.9: Illustrates a lower-bound memory consumption estimation of CVAS and BFAS algorithms. The graph indicates that CVAS consumes less memory than BFAS. ............... 34 Figure 2.10: Scheduling cost for CVAS and BFAS algorithms for 128,000-particle system on 64 BG/P nodes. The circles and squares are average scheduling times of CVAS and BFAS algorithms, respectively. ....................................................................................................... 38 Figure 2.11: Illustrates the load-imbalance factor γ of CVAS as a function of p from theoretical bound, scheduler estimation, and actual measurement. ........................................................ 39 Figure 2.12: Presents average memory consumption for the private force arrays as a function of p using CVAS compared to the naïve method. Numbers in the legend denote n. ................... 40 Figure 2.13: Shows average reduction-sum operation time of the CVAS as a function of p. Numbers in the legend denote n. ........................................................................................... 41 Figure 2.14: (a) Thread-level strong scalability of the parallel section on a four quad-core AMD Opteron 2.3 GHz with fixed problem size at n = 8,192 particles. (b) Total running time per xiii MD steps of 1,024 – 32,768 Power PC 450 850 MHz cores on BG/P for a fixed problem size at N = 0.84-million particles and (c) 1.68-million particles. ......................................... 43 Figure 3.1: The benchmark kernel of a serial code without concurrency control. ....................... 56 Figure 3.2: The benchmark kernel with HTM control over 4 updates. ........................................ 56 Figure 3.3: The benchmark kernel with HTM control over 4 updates. ........................................ 57 Figure 3.4: Code fragment from the force update routine in ddcMD code, which has potential memory conflicts. ................................................................................................................. 59 Figure 3.5: (a) Strong-scaling runtime of ddcMD comparing the performance of concurrency control mechanisms. (b) Runtime overhead of each concurrency controls normalized by the baseline runtime. ................................................................................................................... 63 Figure 3.6: Impact of rollback on runtime, indicating linear correlation between runtime overhead and rollback ratio. .................................................................................................. 65 Figure 3.7: Workload distribution in a physical space, where cells assigned to different threads are distinguished by different colors. (a) Random scheduling; and (b) Spatially-compact scheduling. ............................................................................................................................ 67 Figure 3.8: Rollback ratio (a) and the runtime overhead (b) as a function of the number of threads for the three scheduling algorithms. ...................................................................................... 68 Figure 3.9: Typical program structure of loop-oriented applications ........................................... 70 Figure 3.10: Decision tree for concurrency-control design space. We should note that OMP- critical at the bottom has very high performance penalty. .................................................... 72 Figure 3.11: Analysis of four concurrency control mechanisms. Circles and triangles denote strong and weak advantages, respectively, while crosses denote disadvantage. .................. 73 xiv Figure 4.1: Illustrates hierarchy architecture of a modern high-end machine. Combinations of distributed and shared memory parallelism are used in each architecture level. .................. 78 Figure 4.2: Optimal region of MPI-only versus hybrid MPI/OpenMP implementations. The crossover line expected to remain constant at N/P. .............................................................. 84 Figure 4.3: Validation of (a) MPI-only and (b) hybrid MPI/OpenMP performance models. Blue lines denote the model results while the red dots denote the actual measurement on BlueGene/P. .......................................................................................................................... 88 Figure 4.4: The performance model predictions of 1.6 million-particle system of MPI-only versus hybrid MPI/OpenMP implementations, indicating the same result as in terms of optimality region as the actual measurements. ..................................................................... 89 Figure 4.5: Extrapolation of the prediction models, fixing granularity per core constant. The predicted model suggested that the crossover is moved up to ~ 200 – 400 particles/core. .. 89 Figure 5.1: Schematic of range-limited n-tuples: (a) pair (n = 2), (b) triplet (n = 3), and (c) quadruplet (n = 4). ................................................................................................................ 94 Figure 5.2: Set diagram showing the relationship of a bounding force set S (n) with Γ (n) and Γ *(n) 95 Figure 5.3: (a) Cell data-structure, where circles are atoms and small squares are cells. Dashed circle denotes the cutoff radius centered at atom i. (b) Uniform-cell pattern for pair computation, which generates all pairs between each cell and its neighbor cells. ............... 97 Figure 5.4: n-tuple (n = 3) generation for cell search-space S cell of cell c(q) (colored in magenta) and computation path (v 0 ,v 1 ,v 2 ). S cell generates all triplets where the 1 st , 2 nd , and 3 rd atoms in the triplet are in cells c(q+v 0 ), c(q+v 1 ), and c(q+v 2 ), respectively. Dashed line shows one of the triplets (r i ,r j ,r k ). ............................................................................................................... 100 Figure 5.5: Uniform-cell pattern MD algorithm. ........................................................................ 101 xv Figure 5.6: 2D schematic of a shift-collapse (SC) computation pattern. (a) FS computation pattern for n = 3, containing all possible paths of length 3 originated from the center cell (magenta). (b) SC computation pattern after OC-shift and R-collapse subroutines. Yellow cells denote the cell coverage of the center cell, which for SC is much smaller than that for FS. ....................................................................................................................................... 105 Figure 5.7: Shift-Collapse algorithm. ......................................................................................... 105 Figure 5.8: FS generation algorithm. .......................................................................................... 106 Figure 5.9: OC-shift algorithm. .................................................................................................. 106 Figure 5.10: Reflective-collapse algorithm. ................................................................................ 107 Figure 5.11: Schematic of shell methods for pair computation: (a) full-shell, (b) half-shell, and (c) eighth-shell patterns. ...................................................................................................... 119 Figure 5.12: Average number of triplets as a function of domain size. Error bars are too small to be visible in the plot. ........................................................................................................... 121 Figure 5.13: Runtime of SC-MD (red), FS-MD (green), and Hybrid-MD (blue) as a function of the granularity on (a) 48 Intel Xeon nodes and (b) 64 BlueGene/Q nodes. The plot shows that SC-MD is the fastest for N/P < 2,095 and 425 on Intel Xeon and BlueGene/Q platforms, respectively. ....................................................................................................... 123 Figure 5.14: Strong scaling speedup of SC-MD, Hybrid-MD, and FS-MD on (a) Intel Xeon cluster and (b) BlueGene/Q. ............................................................................................... 125 Figure 6.1: 2D schematic of the linked-list cell method. (a) Illustration of conventional linked- list cell method with cell dimension Rs ≥ Rc. (b) 2D schematic of the reduced linked-list cell method. Atoms in the shaded region will be excluded from the force computation of xvi center atom i. Only forces exerted by atoms within the cutoff radius (represented by a two- headed arrow) are computed. .............................................................................................. 129 Figure 6.2: Force computation algorithm. .................................................................................. 130 Figure 6.3: The running time from the prediction and the measurement for R c = 10 Å. The numerals are the number of atoms. ..................................................................................... 143 Figure 6.4: Comparison of speedup gained from predicted and actual μ* for R c = 6.6 and 10 Å. ............................................................................................................................................. 143 Figure 7.1: Schematic of memory layout for atom data in (a) a fully disordered state and (b) a fully ordered state, where C(i,j,k) is the linked list for the cell with indices i, j, and k in the x, y, and z directions, respectively, r[a] is the data associated with the a-th atom. ............ 149 Figure 7.2: Time variation of data fragmentation ratio over 3,000 MD steps. Three datasets of different initial temperatures (300 K, 3,000 K, and 6,000 K) are plotted. .......................... 152 Figure 7.3: Mean squared displacement (MSD) of 300 K, 3,000 K, and 6,000 K datasets. The MSD of 300 K dataset remains constant at 0.23 Å 2 (too small to be seen in the figure). The inset shows the diffusion coefficients at the three temperatures. ........................................ 153 Figure 7.4: Data fragmentation ratio during aluminum combustion simulation over 2,000 MD steps. The dataset involves 15 million atoms on 1,024 processors. .................................... 154 Figure 7.5: (a) Time variation of the data fragmentation ratio of 12,288-atom silica at 3,000 K over 3,000 MD steps, for surface cells, core cells, and both combined. (b) Time variation of the data fragmentation ratio of 3,000 K silica with varying granularities over 3,000 MD steps..................................................................................................................................... 156 Figure 7.6: Time variation of the DTLB miss rate in 300 K, 3,000 K and 6,000 K dataset over 3000 MD steps. ................................................................................................................... 159 xvii Figure 7.7: Relation between the DTLB miss rate and the data fragmentation ratio in 300 K, 3,000 K and 6,000 K datasets. ............................................................................................ 160 Figure 7.8: Running time per MD step as a function of MD steps at temperatures 300K, 3,000K, and 6,000K for 98,304 silica atoms. ................................................................................... 161 Figure 7.9: Schematic ordering of (a) sequential ordering (b) 3 rd -order 2D Hilbert-curve ordering (c) 3 rd -order 2D Morton-curve ordering. ............................................................................. 163 Figure 7.10: Total running time after 3,000 MD steps as a function of reordering periods. The optimal period for 300 K, 3,000 K and 6,000 K datasets are 69, 5 and 3 steps, respectively. ............................................................................................................................................. 165 Figure 7.11: Comparison of total running time of silica MD over 3,000 steps achieved by model prediction and actual measurement of periodic reordering compared with that of the original code without ordering. (All tests are conducted with 98,304 atoms per core.) .................. 166 1 Chapter 1 Introduction 1.1 Motivations Computer simulation has become vital tools to solve a broad range of scientific and engineering problems such as studying properties of nanomaterials at the atomistic scale, designing efficient wind-turbine farms for high-yield clean energy production, simulating catastrophic earthquakes for disaster preparedness, and modeling drug for curing fatal diseases. However, larger and longer simulations, along with sophisticated modeling of real-world problems, are required in order to characterize realistic systems. One simulation model of great significance is molecular dynamics (MD) simulation, which has broad applications in diverse fields such as physics, chemistry, biology, and materials science. Recent advancements in chemically reactive MD simulations have driven the need of vast computing power. Thanks to the advent of multicore architectures, the rapid increase in computing power has enabled scientists to confront these grand challenges for mankind. 2 Nevertheless, efficiently harvesting computing power from the modern architectures is challenging for application developers due to the sophisticated structure of computing platforms. Research has focused on evolving legacy MD algorithms to efficiently execute on modern architectures. Especially, recent advancements in reactive MD simulations based on many-body interatomic potentials have necessitated efficient dynamic n-tuple computation. An aspiration to have scalable and efficient parallel frameworks along with systematic understanding of the reactive MD problems has motivated us to find solutions to these challenges. This will enable software developers to continually enhance MD applications to cope with the advent of novel architectures in the future. 1.2 Challenges and Goals The emergence of the multicore era has granted unprecedented computing capabilities. However, the longstanding message-passing programming model tends to underutilize the full potential of modern architectures. This has posed difficulties to software developers since the legacy programs significantly relied on the increased processing power of a single processor instead of the increased number of cores [1-3]. Thus, many algorithms currently in use are not suitable for the shared-memory model, thereby demanding substantial alterations in software [4, 5]. Also, performance of the traditional concurrency techniques tends to be limited on a large- scale concurrent system. Addressing these issues typically requires extensive coding effort, which limits wide use of massive shared-memory platforms. Hence it is of great challenge to develop a parallel algorithm that possesses “metascalability” (i.e., design once, continue to scale on future architectures) [6] for applications on continuously evolving platforms. 3 Despite remarkable progresses in parallel algorithms for simple MD, parallel dynamic n- tuple computation of reactive MD is still in its infancy. The major challenge emerges from the lack of fundamental understanding of general n-tuple computation characteristics: How can we understand the fundamental characteristics of n-tuple computation such that we can apply the benefits from advanced techniques in simple MD into arbitrary dynamic n-tuple computations? To address these challenges, this dissertation develops a metascalable parallel algorithmic framework, along with mathematical formulations and optimal solutions for general n-tuple computation. This allows scientists to fully utilize the computational capacity of high-end multicore machines to solve n-tuple computation, encompassing even larger spatiotemporal barrier of the physical simulations on the current and forthcoming platforms. 1.3 Background In this section, we provide background information on our research. First, we introduce MD, a prototypical scientific application that is characterized by a number of complications toward parallelization on the multicore architecture. Second, We provide background information on emerging multicore architectures. Third, we discuss our hierarchical parallelization, specifically, a hybrid message-passing and multithreading algorithm for scientific applications without loss of generality. Next, we explain n-tuple computation in many-body MD, which is used to perform reactive MD simulation. Finally, we provide a description of state-of- the-art scalable parallel MD algorithms. 4 1.3.1 Molecular Dynamics Simulation MD simulation has become an important tool [7] to study a broad range of scientific problems at the atomistic level [8-14]. MD simulation follows the phase-space trajectories of an N-particle system, where the forces between particles are given by the gradient of a potential energy function. Positions and velocities of all particles are updated at each MD step by numerically integrating coupled ordinary differential equations. The dominant computation of MD simulations is the evaluation of the potential energy function and associated forces. Many potential energy models were proposed in order to describe numerous properties of various materials. Some of these impose significant computational challenges [15-19]. Many methods exist to reduce this computational complexity [17-21]. Another example of important potential models is one consisting of many-body terms [16]. The dominant computation in such MD simulation is the evaluation of many-body terms involving summation over particle n- tuples. 1.3.2 Multicore Architectures The rapid growth in computing power enables scientists to address more challenging problems by using high-end supercomputers [22, 23]. However, recent improvements in computing power are now gained using multicore architectures instead of increased clock speed [2, 24]. Application scientists at the forefront of high-performance computing (HPC) have been investigating the use of hybrid architectures to improve performance of scientific codes. Hierarchical architectures attempt to improve application performance by combining conventional processors with any of a variety of more specialized processors such as graphics processing units (GPUs) [25, 26], manycores [27-29], and Cells [30, 31]. This marks the end of 5 the free-ride era, where legacy applications cannot increase performance on a newer chip without substantial modification. Furthermore, the number of cores per chip is expected to grow continuously, which deepens the performance impact on the legacy software. Multicore processors with sophisticated computational and memory organizations have been developed as a result of the pursuit for high performance with low energy usage. At the low-level of memory hierarchy, most of the recent multicore chip designs adopt heterogeneous cache access at the cache level. This design has local level 1 and level 2 caches for individual core, while subsets of cores share its last-level cache (LLC). The cache-coherency concept is usually adopted, which guarantees data consistency among all cores in a node. At a node level, there are two major approaches for memory hierarchy: (1) symmetric multiprocessing (SMP), where all the compute cores have the same cost accessing memory; and (2) non-uniform memory access (NUMA), where each processors has different cost to access each memory module. NUMA paradigm requires more attention from software developers to maintain maximal performance [32, 33] (e.g. requires thread pinning to avoid thread switching between cores). However, SMP paradigm tends to have higher latency and is more difficult in circuit design. Nevertheless, the shared-memory concept is usually implemented on a multicore chip. This raises the issues of race conditions, and load balancing when multithreading is used. As multicore becomes the standard of modern supercomputers, two significant issues have emerged: (1) the performance of traditional parallel applications, which are solely based on message passing interface (MPI), is expected to degrade substantially [34]; and (2) the memory available per core tends to decline, since the growth of the number of cores per chip is considerably faster than that of the available memory. To address these complications, hierarchical parallelization frameworks, which integrate several parallel methods to provide 6 different levels of parallelism, have been proposed as a solution to this scalability problem on multicore platforms [6, 35-37]. In addition, a recent trend in HPC also focuses on heterogeneous computing. A combination of generic multicore chips and accelerators such as general-purpose graphics processing unit (GPGPU) [38-42] and manycore processor [43] (i.e. light-weight general processing component such as Intel Many Integrated Cores (MIC) chip) are used. Although accelerators are powerful for specific types of problems such as highly efficient floating-point operations in GPGPU, they usually do not execute every instruction efficiently, if it is able to run at all. Therefore, generic multicore chips are still needed to run non-specialized code sections in the program. 1.3.3 Hybrid Parallelization Hybrid parallelization is a parallelization approach that combines multiple parallelization techniques in order to achieve the best parallelization performance. Hybrid parallelization has been widely used for applications to execute on emerging multicore platforms, using a hybrid scheme reflecting the two-level hierarchical structural designs of multicore clusters. The most common schemes for hybrid parallelization is distributed memory for inter-node communications via message passing, and shared memory for intra-node communication via multithreading. At cluster-level, divide-and-conquer parallel algorithms based on spatial decomposition implemented with MPI are widely used, since MPI is the de-facto standard for high-end parallel computing for decades. MPI provides a moderate-transparency communication framework for parallel applications, by allowing software developers to explicitly code communication routines 7 from elementary communication building blocks without being concerned with sophisticated network protocols [44]. The basic communication functions provided by the MPI standard include synchronous/asynchronous end-to-end communications and collective communications, while secondary functions based on the basic functions are also provided. At node-level parallelization, multithreading is a common strategy to exploit the shared- memory property of multicore chip. Numerous multithreading frameworks are widely available in computer science communities. In HPC area, multithreading frameworks generally equipped with modern compilers include POSIX thread (Pthread) [45] and Open Multiprocessing (OpenMP) [46]. Pthread provides most of the low-level operations with low overhead to the users with the cost of complicated coding. On the other hand, OpenMP offers simpler framework for multithreading but many operations cost considerable overhead. Variants of this message-passing/multithreading hybrid scheme have been proposed and studied broadly [37, 47-51]. Peng et al. [35] added one more step of parallelism at the data level through single-instruction multiple-data (SIMD) techniques, while recent studies [41, 52] have extended this scheme to cover GPGPU context. The hierarchical framework is expected to exploit data concurrency efficiently and continue to scale on future multicore platforms. Since most of the legacy software operates efficiently on supercomputers using MPI, the great challenge for the hybrid message-passing and multithreading parallelization lies on the multithreading part. The most typical issues associated with multithreading are race conditions and load balancing. Race condition is one of the common problems that plague program developers since the earliest age of computing. Although numerous methods were proposed to deal with race conditions (e.g. semaphores, critical section, transactional memory), the performance-sensitive nature of high-performance applications usually renders these naïve 8 methods impractical. On the other hand, load imbalance at the thread level usually occurs at fine- grain scope, which requires that load balancers must be much more dynamic. Also, issues like context switching and false-sharing place tighter constrains on the design and usage of multithreading. 1.3.4 n-Tuple Molecular Dynamics MD simulation using a differentiable interatomic potential-energy function Φ was started by Rahman in 1964 [53] using a pair-wise potential energy, in which Φ is a sum of atomic-pair energies. Since then, scientists started performing many-body MD simulations that use n-tuple (n ≥ 3) energy functions for accurate description of a wider range of materials. In one type of n- tuple computation (i.e. static n-tuple computation) used typically in biomolecular simulations [54], the list of atomic n-tuples is fixed throughout the simulation. In another (i.e. dynamic n- tuple computation), n-tuple lists within given interaction ranges are constructed at every simulation time step [55, 56]. Recent advancements in chemically reactive MD simulations [57] have renewed interests in efficient implementation of dynamic n-tuple computation [58]. Reactive MD describes the formation and breakage of chemical bonds based on a reactive bond- order concept [57]. In the ReaxFF approach, for example, n is 4 explicitly, and force computation involves up to n = 6 due to chain-rule differentiations through bond-order terms [59-61]. 1.3.5 Recent Development of Parallel MD Algorithms Scalable implementation of MD on massively parallel computers has been one of the major driving forces of supercomputing technologies [62-68]. Earlier parallel implementations of MD were based on spatial decomposition, in which the simulated physical volume is subdivided 9 into spatially localized sub-volumes that are assigned to different processors [69]. For long-range pair (n = 2) computation, octree-based O(N) algorithms (N is the number of atoms) [20] have highly scalable parallel implementations [70, 71]. For short-ranged (or range-limited) pair computation, Plimpton relaxed the conventional “owner-compute” rule (i.e., computation is performed by a processor that has data) to design a force-decomposition algorithm to increase the concurrency [72]. Since then, various hybrid spatial-force decomposition algorithms have been designed [73]. On distributed-memory parallel computers, atomic data needed for range- limited pair computations are copied from neighbor processors. The most primitive scheme for these atom-caching operations is full shell (FS), in which data from 26 (face-, edge-, and corner- sharing) neighbor sub-volumes are imported from other processors. In the half-shell (HS) scheme, Newton’s third law is utilized to halve the number of imported sub-volumes to 13 [69]. Relaxation of the owner-compute rule further reduces this number to 7 in the eighth-shell (ES) scheme [11]. In the case of special-purpose computers with low network latency, neutral- territory (NT) [74] and related [75] schemes achieve asymptotically smaller import volumes for fine granularities, N/P (P is the number of processors). In addition to these parallel algorithms for dynamic pair computations, numerous schemes have been employed in biological MD codes to efficiently compute static n-tuple computations [12, 73]. 1.4 Contributions The main contributions of this dissertation research are two-fold: Metascalable hybrid message-passing and multithreading parallel algorithmic framework on emerging multicore parallel clusters, and a novel computation-pattern algebraic framework to design a scalable 10 algorithm for general n-tuple computation and prove its optimality in a mathematically rigorous manner. To design a metascalable algorithm, we have developed and thoroughly analyzed algorithms for hybrid MPI + OpenMP parallelization of n-tuple MD simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation have been designed: (1) compact-volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). These two algorithms combine fine-grain dynamic load balancing and minimal memory-footprint threading. Theoretical study has revealed decent asymptotic memory efficiency for both algorithms, thereby reducing 75% memory consumption compared to a naïve-threading algorithm. Furthermore, performance benchmarks have confirmed higher performance of the hybrid MD algorithm over a traditional algorithm on large multicore clusters, where 2.58-fold speedup of the hybrid algorithm over the traditional algorithm was observed on 32,768 nodes of IBM BlueGene/P. To aid the multithreading program-design decision, we have investigated the performance characteristics of HTM on the IBM BlueGene/Q computer in comparison with conventional concurrency control mechanisms, using an MD application as an example. Benchmark tests, along with overhead-cost and scalability analysis, have quantified relative performance advantages of HTM over other mechanisms. We found that the bookkeeping cost of HTM is high but that the rollback cost is low. We have proposed transaction fusion and spatially compact scheduling techniques to reduce the overhead of HTM with minimal programming. A strong scalability benchmark has shown that the fused HTM has the shortest runtime among various concurrency control mechanisms without extra memory. Based on the performance 11 characterization, we have derived a decision tree in the concurrency-control design space for general multithreading applications. Regarding the second contribution, we have developed a computation-pattern algebraic framework to mathematically formulate general n-tuple computation. Based on translation/reflection-invariant properties of computation patterns within this framework, we have designed a shift-collapse (SC) algorithm for cell-based parallel MD. Theoretical analysis has quantified the compact n-tuple search space and small communication cost of SC-MD for arbitrary n, which are reduced to those in best pair-computation approaches (e.g. eighth-shell method) for n = 2. Benchmark tests have shown that SC-MD outperforms our production MD code at the finest grain, with 9.7- and 5.1-fold speedups on Intel-Xeon and BlueGene/Q clusters. SC-MD has also exhibited excellent strong scalability. In addition, we have analyzed the computational and data-access patterns of MD, which led to the development of a performance prediction model for short-range pair-wise force computations in MD simulations. The analysis and performance model provide fundamental understanding of computation patterns and optimality of certain parameters in MD simulations, thus allowing scientists to determine the optimal cell dimension in a linked-list cell method. The model has accurately estimated the number of operations during the simulations with the maximum error of 10.6% compared to actual measurements. Analysis and benchmark of the model have revealed that the optimal cell dimension minimizing the computation time is determined by a trade-off between decreasing search space and increasing linked-list cell access for smaller cells. One difficulty about MD is that it is a dynamic irregular application, which often suffers considerable performance deterioration during execution. To address this problem, an optimal 12 data-reordering schedule has been developed for runtime memory-access optimization of MD simulations on parallel computers. Analysis of the memory-access penalty during MD simulations has shown that the performance improvement from computation and data reordering degrades gradually as data translation lookaside buffer misses increase. We have also found correlations between the performance degradation with physical properties such as the simulated temperature, as well as with computational parameters such as the spatial-decomposition granularity. Based on a performance model and pre-profiling of data fragmentation behaviors, we have developed an optimal runtime data-reordering schedule, thereby archiving speedup of 1.35, 1.36 and 1.28, respectively, for MD simulations of silica at temperatures 300 K, 3,000 K and 6,000 K. We expect that the hybrid algorithms and mathematical approaches developed in this dissertation will provide a generic framework to a broad range of applications on future extreme- scale computing platforms. 1.5 Structure of This Dissertation The organization of this dissertation is as follows. Chapter 2 presents the proposed hybrid message-passing and multithreading algorithm along with detailed theoretical and statistical analyses. Chapter 3 characterizes and compares several concurrency control mechanisms for multhreading. Chapter 4 presents a hybrid performance model. Chapter 5 presents the formulation and analysis of a computation-pattern algebraic framework for n-tuple computation. A theoretical analysis and design of an MD application on a single node are discussed in chapter 6. Chapter 7 discusses analysis and optimization of irregular memory-access behaviors in MD. Finally, conclusions are drawn in chapter 8. 13 Chapter 2 Hybrid Message-Passing and Multithreading Parallelization In this chapter, we propose threading algorithms for hybrid MPI/OpenMP parallelization of MD simulation, which are scalable on large multicore clusters [76]. Two data privatization thread scheduling via nucleation-growth allocation are presented: (1) compact-volume allocation scheduling (CVAS); and (2) breadth-first allocation algorithm (BFAS). This chapter is organized as follows. Section 2.1 provides a brief introduction to the problem and a hybrid parallel algorithm. Section 2.2 summarizes the hierarchy of parallel operations in a sample MD code, ddcMD, followed by the detail description of the proposed data-privatization algorithms in section 2.3. Theoretical analysis of data-privatization algorithms is given in section 2.4. Section 2.5 evaluates the performance of the hybrid parallelization algorithm, and conclusions are drawn in section 2.6. 2.1 Introduction As multicore chip becomes the standard of modern supercomputers, two significant issues have emerged: (1) the performance of traditional parallel applications, which are solely 14 based on MPI, is expected to degrade substantially [34]; and (2) the memory available per core tends to decline, since the growth of the number of cores per chip is considerably faster than that of the available memory. To address the emerging complications, hierarchical parallelization frameworks, integrating several parallel methods to provide different levels of parallelism, have been proposed as a solution to solve the scalability problem on multicore platforms [6, 35-37]. On multicore architectures, hybrid parallelization based on MPI/threading schemes will likely replace MPI-only parallel MD. This is because the benefit from a shared-memory model of threading can be exploited within a multicore chip, while MPI handles the communication among compute nodes. In the next section, we describe detail algorithmic concept of ddcMD, our prototypical MD application representing broad-range of scientific codes. 2.2 Domain Decomposition Molecular Dynamics In this chapter, we focus on the highly efficient particle-particle/particle-mesh (PPPM) method [17] to compute Coulomb potential. In PPPM the Coulomb potential is decomposed into two parts: A short-range part that converges quickly in real space and a long-range part that converges quickly in reciprocal space. The split of work between the short-range and long-range part is controlled though a “screening parameter” α. With the appropriate choice of α, computational complexity for these methods can be reduced to O(NlogN). Because the long-range part of the Coulomb potential can be threaded easily (as a parallel loop over many individual 1D fast Fourier transforms), this study explores efficient parallelization of the more challenging short-range part of the Coulomb potential using OpenMP threading. The short-range part is a sum over pairs: φ = ∑ i<j q i q j erfc(αr ij )/r ij , where q i is the charge of particle i and r ij is the separation between particles i and j. Though this work is focused 15 on this particular pair function, much of the work can be readily applied to other pair functions. In addition to this intranode parallelization, the ddcMD code is already parallelized across nodes using a particle-based domain decomposition implemented using MPI. Combining the existing MPI-based decomposition with the new intranode parallelization yields a hybrid MPI/OpenMP parallel code. An extensive comparison of MPI-only ddcMD with other pure MPI codes can be found in [77]. In section 2.2.1, we provide node-level communication concepts of ddcMD together with the detail of a load-balancing algorithm used in ddcMD. Section 2.2.2 describes how computation is handled within each domain. Hybrid MPI/OpenMP approach is described in section 2.2.3. 2.2.1 Internode Operations In typical parallel MD codes the first level of parallelism is obtained by decomposing the simulation volume into domains, each of which is assigned to a compute core (i.e., an MPI task). Because particles near domain boundaries interact with particles in nearby domains, internode communication is required to exchange particle data between domains. The surface-to-volume ratio of the domains and the choice of potential set the balance of communication to computation. The domain-decomposition strategy in ddcMD allows arbitrarily shaped domains that may even overlap spatially. Also, remote particle communication between nonadjacent domains is possible when the interaction length exceeds the domain size. A domain is defined only by the position of its center and the collection of particles that it “owns.” Particles are initially assigned to the closest domain center, creating a set of domains that approximates a Voronoi tessellation. 16 The choice of the domain centers controls the shape of this tessellation and hence the surface-to- volume ratio for each domain. The commonly used rectilinear domain decomposition employed by many parallel codes is not optimal from this perspective. The best surface-to-volume ratio in a homogeneous system is achieved if domain centers form a bcc, fcc, or hcp lattice, which are common high-density packing of atomic crystals. In addition to setting the communication cost, the domain decomposition also controls load imbalance. Because the domain centers in ddcMD are not required to form a lattice, simulations with a non-uniform spatial distribution of particles (e.g., voids or cracks) can be load balanced by an appropriate non-uniform arrangement of domain centers. The flexible domain strategy of ddcMD allows for the migration of the particles between domains by shifting the domain centers. As any change in their positions affects both load balance and the ratio of computation to communication, shifting domain centers is a convenient way to optimize the overall efficiency of the simulation. Given an appropriate metric (such as overall time spent in MPI barriers) the domains can be shifted “on-the-fly” in order to maximize efficiency [78]. 2.2.2 Intranode Operations Once particles are assigned to domains and remote particles are communicated, the force calculation begins. Figure 2.1(a) shows a schematic of the linked-list cell method used by ddcMD to compute pair interactions in O(N) time. In this method, each simulation domain is divided into small cubic cells, and a linked-list data structure is used to organize particle data (e.g., coordinates, velocities, type, and charge) in each cell. By traversing the linked list, one retrieves the information of all particles belonging to a cell, and thereby computes interparticle 17 interactions. The dimension of the cells is determined by the cutoff length of the pair interaction, r c . Linked-list traversal introduces a highly irregular memory-access pattern, resulting in performance degradation. To alleviate this problem, we reorder the particles within each node at the beginning of every MD step, so that the particles within the same cell are arranged contiguously in memory when the computation kernel is called. At present we choose an ordering specifically tailored to take advantage of the BlueGene “double-Hummer” SIMD operations. However, we could just as easily reorder the data to account for NUMA or GPGPU architectural details. We consistently find that the benefit of the regular memory access far outweighs the cost of particle reordering. The threading techniques proposed here are specifically constructed to preserve these memory-ordering advantages. The computation within each node is described as follows. Let L be the total number of cells in the system, {C k | 0 ≤ k < L} be the set of cells within each domain. The computation within each node is divided into a collection of small chunks of work called a computation unit λ. A single computation unit λ k = {(r i , r j ) | r i ∈ C k ; r j ∈ nn + (C k )} for cell C k is defined as a collection of pair-wise computations (see Figure 2.1(b)), where nn + (C k ) is a set of half the nearest-neighbor cells of C k . Newton’s third law allows us to halve the number of force evaluations and use nn + (C k ) instead of the full set of nearest-neighbor cells, nn(C k ). The pairs in all computation units are unique, and thus the computation units are mutually exclusive. The set of all computation units on each node is denoted as Λ = {λ k | 0 ≤ k < L}. Since most of our analysis is performed at a node level, n = N/P hereafter denotes the number of particles in each node (P is the number of nodes), and p is the number of threads in each node. 18 Figure 2.1: (a) 2D schematic of the linked-list cell method for pair computation with the cell dimension r c . Only forces exerted by particles within the cutoff radius (represented by a two- headed arrow) are computed for particle i. The blue particles represent particles in the top-half neighbor cells where the force is computed directly, while the gray particles is a particle whose force is computed via reaction force. (b) 2D schematic of a single computation unit λ k . The shaded cells C j pointed by the arrows constitute the half neighbor cells, nn + (C k ). 2.2.3 Hybrid MPI/OpenMP Parallelization The hybrid MPI/OpenMP parallelization of ddcMD is implemented by introducing a thread scheduler into the MPI-only ddcMD. Figure 2.2 shows the workflow of the hybrid MPI/OpenMP code using a master-slave approach. The program repeats the following computational phases: First, the master thread performs initialization and internode communications using MPI; the scheduler computes the scheduled work for each thread; and the worker threads execute the workloads in an OpenMP parallel section. Our algorithm utilizes a scheduler to distribute the workload before entering the pair computation, i.e., parallel section. In this approach, the scheduling cannot interfere with the worker threads, since the scheduling is already completed before the worker threads are started. Because the schedule is recomputed every MD step (or perhaps every few MD steps), there is adequate flexibility to adapt load balancing to the changing dynamics of the simulation. This approach also reduces the number of 19 synchronizations and minimizes context switching compared to a real-time dynamic scheduling approach. We parallelize the explicit pair-force computation kernel of ddcMD, which is the most computationally intensive kernel, at the thread level using OpenMP. Two major problems commonly associated with threading are: (1) race condition among threads; and (2) thread-level load imbalance. The race condition occurs when multiple threads try to update the force of the same particle concurrently. Figure 2.2: Schematic workflow of a hybrid MPI/OpenMP scheme. The master thread schedules works for the worker threads before entering the parallel threading section. The master thread also handles internode communication, which is explicitly performed outside the threading region. We have designed algorithms that combine a mutually exclusive scheduler with a reduced memory data-privatization scheme to address a race condition and fine-grain load balancing issues. The algorithm will be discussed in detail in section 2.3. This research focuses on hybrid MPI/OpenMP parallelization on Sequoia, the third generation of BlueGene, which will be online in 2012 at LLNL. On this symmetric multiprocessing (SMP) platform, MPI-only programming will not be an option for full-scale runs with at least 3.2 million concurrent threads. 20 2.3 Data Privatization Scheduling Algorithm Molecular-dynamics computation patterns often lead to serious fine-grain race conditions. If reaction force is exploited, every pair interaction involving an identical particle (e.g. interactions between particles i-j and j-k) possibly causes a write conflict when two different threads computes on each interaction concurrently. The simplest solution is not to use reaction force but this approach immediately doubled the computation (or more than doubled in case of many-body potentials). Hence, this approach is not the best solution in most circumstances. To address these issues, we use a data privatization algorithm, which allows the reaction forces to be exploited while eliminating race conditions from pair computations. However, a naïve implementation of this algorithm consumes considerably more memory. This will be described in section 2.3.1. In section 2.3.2, the nucleation-growth allocation algorithm, an enhanced data privatization with reduced memory consumption, will be discussed. 2.3.1 Naïve Data Privatization Threading Algorithm A naïve data-privatization algorithm avoids write conflicts by replicating the entire write- shared data structure and allocating a private copy to each thread (Figure 2.3). The memory requirement for this redundant allocation scales as O(np). Each thread computes forces for each of its computation units and stores the force values in its private array instead of the global array. This allows each thread to compute forces independently without a critical section. After the force computation for each MD step is completed, the private force arrays are reduced to obtain the global forces. 21 Figure 2.3: Schematic of a memory layout for a naïve data privatization. To avoid data race condition among threads, the entire output array is locally duplicated for each thread. These local arrays are privatized on the thread memory space and thus eliminate race condition during the computation phase. This approach also incurs additional reduction operations to sum all local results to the global array. The naïve data privatization threading is simple yet powerful. This approach can be used to eliminate race condition regardless of threading complexity in most of the situations. However, the naïve algorithm usually is not an optimal solution because the algorithm allocates memory entirely without considering the actual work assigned to each thread. This drawback can be remedied if the data-access pattern is known ahead of the computation. 2.3.2 Nucleation-Growth Allocation Algorithm As explained in the previous section, the memory requirement of the data-privatization algorithm is O(np). However, it is not necessary to allocate a complete copy of the force array for each thread, since only a subset of all computation units Λ is assigned to each thread. Therefore, we allocate only the necessary portion of the global force array corresponding to the computation units assigned to each thread as a private force array. This idea is embodied in a three-step algorithm (Figure 2.4): (1) the scheduler assigns computation units to threads and then determines which subset of the global data each thread requires; (2) each thread allocates its 22 private memory as determined by the scheduler; and (3) private force arrays from all threads are reduced into the global force array. Figure 2.4: Memory layout and three-step algorithm for nucleation-growth allocation algorithm. Worker threads only allocate the essential portion of the force arrays, corresponding to their assigned workload. To do this, we create a mapping table between the global force-array index of each particle and its thread-array index in a thread memory space. Since ddcMD sorts the particle data based on the cell they reside in, only the mapping from the first global particle index of each cell to the first local particle index is required. The local ordering within each cell is identical in both the global and private arrays. It should be noted that assigning computation unit λ k to thread T i requires memory allocation more than the memory for the particles in C k . Since each computation unit computes the pair forces of particles in cell C k and half of its neighbor cells nn + (C k ) as shown in figure 1(b), the force data of particles in nn + (C k ) need to be allocated as well. In order to minimize the memory requirement of each thread, the computation units assigned to it must be spatially proximate, so that the union of their neighbor-cell sets has a minimal size. This is achieved by minimizing the surface-to-volume ratio of T i . To achieve this, we implement two algorithms 23 based on a nucleation-growth concept. This approach systematically divides the work as equally as possible with minimal memory overhead. These two algorithms are: (1) compact volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS) algorithms. CVAS and BFAS employ a fine-grain load-balancing algorithm, which will be shown in subsection 2.3.2.1. In subsections 2.3.2.2 and 2.3.2.3, we will discuss CVAS and BFAS algorithms in more detail. 2.3.2.1 Thread-level load-balancing algorithm We implement thread-level load balancing based on a greedy approach, i.e., iteratively assigning a computation unit to the least-loaded thread, until all computation units are assigned. Let T i ⊆ Λ denote a mutually exclusive subset of computation units assigned to the i-th thread, ∩ k=0...L-1 λ k = ∅. The computation time spent on λ k is denoted as τ(λ k ). Thus, the computation time of each thread τ(T i ) is a sum of all computation units assigned to thread T i . The algorithm initializes T i to be empty, and loops over λ k in Λ. Each iteration selects the least-loaded thread T min = argmin(τ(T i )), and assigns λ k to it. This original approach is a 2-approximation algorithm [79] and its pseudocode is shown in Figure 2.5. Algorithm Fine-Grain Load Balancing 1. for 0 ≤ i < p 2. T i ← ∅ 3. end for 4. for each λ k in Λ 5. T min ← argmin i (τ(T i )) 6. T min ← T min ∪ λ k 7. end for 24 Figure 2.5: Thread load-balancing algorithm. 2.3.2.2 Compact-volume allocation scheduling (CVAS) algorithm CVAS algorithm consists of the following steps. First, we randomly assign a root computation unit to each thread. Then, the iteration begins by selecting the least-loaded thread T min . From the surrounding volume of T min , we select a computation unit λ j* that has the minimum distance to the centroid of T min , and add it to T min . The algorithm repeats until all computation units are assigned. If all of the surrounding computation units of T min are already assigned, T min randomly chooses a new unassigned computation unit as a new cluster’s root and continue to grow from that point. The pseudocode of the algorithm is shown in Figure 2.6. Algorithm Compact volume allocation scheduling 1. Assign random root cell λ root to each thread 2. while not all of the cells are assigned 3. Find the least loaded thread T min 4. if nearest neighbor cells of T min are already assigned to other threads 5. Randomly pick new root λ root* for thread T min 6. Assign λ root* to T min 7. Set λ root* to be a new seed of T min 8. else 9. Find an unassigned cell λ j* which has minimal distance to the centroid of T min 10. Assign λ j* to T min 11. end if 12. end while Figure 2.6: Pseudocode of compact-volume allocation scheduling algorithm. 25 2.3.2.3 Breadth-first allocation scheduling (BFAS) algorithm One weakness of the CVAS algorithm is a fairly expensive scheduling cost. To improve this aspect, we propose a breadth-first allocation scheduling (BFAS) algorithm as an alternative. BFAS is designed to slightly reduce memory saving but significantly reduces scheduling cost. BFAS consists of the following steps. The initialization step begins by randomly assigning the root computation unit λ root to each thread. Then, an empty queue data structure Q i is defined as a search queue for thread T i . After that, each thread T i adds all of the computation units surrounding λ root into Q i . In the next step, the algorithm proceeds iteratively. First, find the least- loaded thread T min . Then, from the top of the queue Q Tmin , find the computation unit λ j that has not assigned to any thread. Subsequently, assign λ j to T min and update workload sum of each thread and repeat to find T min again. The algorithm terminates when all of the computation units are assigned. Note that BFAS algorithm proceeds similarly to the breadth-first search (BFS) algorithm. The pseudocode of BFAS is shown in Figure 2.7. Algorithm Breadth-first allocation scheduling 1. for each thread T i 2. Initialize queue Q i = ∅ as a search queue 3. Assign random seed cell λ root to T i 4. Enqueue all neighbor cells of λ root to Q i 5. end for 6. while not all of the cells are assigned 7. Find the least loaded thread T min 8. if Q Tmin is empty 9. Randomly pick new seed λ root* for thread T min 10. Assign λ root* to T min 11. Add all neighbor cells of λ root* to Q Tmin 26 12. else 13. repeat 14. λ j = dequeue(Q Tmin ) 15. until λ j is unassigned 16. Assign λ j to T min 17. Enqueue all neighbor cells of λ j to Q Tmin 18. end if 19. end while Figure 2.7: Pseudocode of breadth-first allocation scheduling algorithm. 2.4 Theoretical Analysis In this section, we present an upper bound analysis for load imbalance (section 2.4.1), a lower bound analysis for memory consumption (section 2.4.2), and computational complexity analysis (section 2.4.3) of the proposed algorithms. 2.4.1 Load Imbalance Analysis The fine-grain load balancing algorithm presented in the previous section is simple yet provides an excellent load-balancing capability. In this section, we show that this approach has a well-defined upper bound on the load imbalance. First, to quantify the load imbalance, we define a load-imbalance factor γ as the difference between the runtime of the slowest thread and the average runtime, γ = max(τ(T i ))−τ average τ average . (2.1) By definition, γ = 0 when the loads are perfectly balanced. Since min(τ(T i )) ≤ τ average , 27 γ ≤ max(τ(T i ))−min(τ(T i )) τ average . (2.2) In our load-balancing algorithm, the workload of T min is increased at most by max(τ(λ)) at each iteration. This procedure guarantees that the variance of the workloads among all threads is limited by max(τ(T i ))−min(τ(T i ))≤max(τ(λ k )). (2.3) Substituting Eq. (2.3) in Eq. (2.2) provides an upper limit for the load-imbalance factor, γ ≤ max(τ(λ k )) τ average . (2.4) Performance of this load-balance scheduling algorithm depends critically on the knowledge of time spent on each computation unit τ(λ k ). Since the runtime of the computation units are unknown to the scheduler prior to the actual computation, the scheduler has to accurately estimate the workload of each computation unit. Fortunately, since particle positions change slowly τ(λ k ) remains highly correlated between the consecutive MD steps. Therefore, we use τ(λ k ) measured in the previous MD step as an estimator of τ(λ k ). This automatically takes into account any local variations in the cost of the potential evaluation. For the first step as well as steps when the cell structure changes significantly (e.g., redistribution of the domain centers), the workload of cell τ(λ k ) is estimated by counting the number of pairs in λ k . 2.4.2 Memory Consumption Analysis Our nucleation growth scheduling algorithms tend to distribute workload equally among threads while minimizing the surface volume of the assigned workload in the physical space. Let 28 us assume that the number density of particles of the system, ρ, is uniform. The memory required for storing particle forces for each thread m t is proportional to the volume V t occupied by the particles required for the force computation of each thread: € m t =ρV t. (2.5) The particles within volume V t consist of two groups: (1) main particles, i.e. the particles that are directly assigned to the thread; and (2) surface particles, i.e. the particles that interact with the main particles for the thread. We define main volume Ω as a volume occupied by the main particles and surface volume ω as a volume occupied by the surface particles. Then, V t is written as € V t =Ω +ω , (2.6) From Eqs. (2.5) and (2.6), we can calculate the memory consumption of data privatization if the main and surface volumes are known. Fortunately, both CVAS and BFAS algorithms tend to form a particular geometry, which can be estimated analytically. CVAS and BFAS algorithms are different only during the workload assignment phase. CVAS choose the closest unassigned cells to its centroid because it minimizes the change of its centroid after assignment. This results in a spherical shape formed by CVAS (Figure 2.8(b)). On the other hand, BFAS chooses unassigned cells layer-by-layer starting with the closest to its root cell. Since the cell has cubic geometry, BFAS volume tends to form a cubic shape (Figure 2.8 (a)). These considerations on the CVAS and BFAS volume geometry do not take account the following facts: (1) the other threads might prevent the algorithms to obtain the optimal cells; and (2) discretization error might cause sub-optimal geometry for the algorithms (see Figure 2.8(c) and (d)). Since these factors always increase the memory consumption from the optimal 29 shape, an assumption on perfect CVAS and BFAS geometry implies the best-scenario (i.e. lower-bound) memory consumption for both algorithms. We perform this analysis in detail in the following subsections. Figure 2.8: Geometric effects on surface-to-volume ratio. (a) and (b) show the surface volume (shell area) of cubic and spherical shapes with the same main volume. (c) and (d) show the discretized version, which indicates that the spherical shape consumes less volume than the cubic shape. The square grid represents cells in which particles reside. The blue area refers to the main volume while the red area shows the surface volume. Variable l is defined as the side length of main volume created by BFAS while r denotes the radius of CVAS. The surface thickness is r c . 2.4.2.1 Lower-bound memory consumption of CVAS algorithm In the best case, Ω of CVAS forms a spherical shape. In this situation, the main volume obtained by each thread is € Ω CVAS = 4 3 πr 3 , (2.7) 30 or € r = 3 4π Ω CVAS $ % & ' ( ) 1/3 , (2.8) where r is the radius of main volume’s sphere; see Figure 2.8(b). The surface volume can be calculated from half of the shell volume of the main volume (due to Newton’s third law) with the shell radius equal to r c : € ω CVAS = 2 3 π((r + r c ) 3 − r 3 ) = 2 3 π(r 3 +3r 2 r c +3rr c 2 + r c 3 − r 3 ) =2πr 2 r c +2πrr c 2 + 2 3 πr c 3 . (2.9) Substitution of Eq. (2.9) with r from Eq. (2.8) yields € ω CVAS =2π 3 4π Ω % & ' ( ) * 2/3 r c +2π 3 4π Ω % & ' ( ) * 1/3 r c 2 + 2 3 πr c 3 , (2.10) where Ω = Ω CVAS . Let dimensionless quantity β be € β = Ω 1/3 r c , (2.11) or, € r c = Ω 1/3 β . (2.12) Substituting Eq. (2.10) with r c from Eq. (2.12), we obtain 31 ω CVAS =2π 3 4π Ω " # $ % & ' 2/3 Ω 1/3 β −1 ( ) +2π 3 4π Ω " # $ % & ' 1/3 Ω 2/3 β −2 ( ) + 2 3 π Ωβ −3 ( ) ω CVAS =2π 3 4π " # $ % & ' 2/3 Ωβ −1 +2π 3 4π " # $ % & ' 1/3 Ωβ −2 + 2 3 πΩβ −3 ω CVAS =Ω 2π 3 4π " # $ % & ' 2/3 β −1 +2π 3 4π " # $ % & ' 1/3 β −2 + 2 3 πβ −3 " # $ $ % & ' ' ω CVAS ≈Ω 2.4180β −1 +3.8978β −2 +2.0944β −3 ( ) . (2.13) Therefore, the total volume required by each thread of CVAS is € V t CVAS =Ω CVAS +ω CVAS =Ω CVAS 1+2.4180β −1 +3.8978β −2 +2.0944β −3 ( ) . (2.14) 2.4.2.2 Lower-bound memory consumption of BFAS algorithm In the best case, Ω BFAS forms a cubic shape. In this situation, the main volume obtained by each thread is € Ω BFAS = l 3 , (2.15) or € l =Ω BFAS 1/3 , (2.16) where l is the side length of the main volume; see Figure 2.8(a). The surface volume can be calculated from the cubic shell volume of the main volume with shell thickness equal to r c : 32 € ω BFAS = 1 2 ((l +2r c ) 3 − l 3 ) = 1 2 (l 3 +6l 2 r c +12lr c 2 +8r c 3 − l 3 ) =3l 2 r c +6lr c 2 +4r c 3 . (2.17) Substitution of Eq. (2.17) with l from Eq. (2.16) yields € ω BFAS =3Ω 2/3 r c +6Ω 1/3 r c 2 +4r c 3 , (2.18) where Ω = Ω BFAS . Substituting r c from Eq. (2.18) with € β from Eq. (2.12), we obtain € ω BFAS =3Ω 2/3 (Ω 1/3 β −1 ) +6Ω 1/3 (Ω 2/3 β −2 ) +4Ωβ −3 =Ω(3β −1 +6β −2 +4β −3 ) . (2.19) Therefore, the total volume requires by each thread of BFAS is € V t BFAS =Ω BFAS +ω BFAS =Ω BFAS 1+3β −1 +6β −2 +4β −3 ( ) . (2.20) 2.4.2.3 Memory scaling comparison between CVAS and BFAS From Eqs. (2.14) and (2.20), we can compare how the memory consumption of CVAS and BFAS scales. First, let us assume that Ω CVAS = Ω BFAS for direct comparison. Then, substitute β from Eq. (2.12) into Eq. (2.14) to obtain € V t CVAS Ω =1+2.4180Ω −1/3 r c +3.8978Ω −2/3 r c 2 +2.0944Ω −1 r c 3 (2.21) Since each cell has volume of r c 3 , we express the main volume in terms of r c 3 as € Ω * = Ω r c . (2.22) 33 Substitution of Ω* from Eq. (2.22) into the RHS of Eq. (2.21) yields V t CVAS Ω =1+2.4180Ω *−1/3 +3.8978Ω *−2/3 +2.0944Ω *−1 (2.23) Similarly, using the same analysis on BFAS volume in Eq. (2.20), we obtain V t BFAS Ω =1+3Ω *−1/3 +6Ω *−2/3 +4Ω *−1 . (2.24) Figure 2.9 shows the scaling ratio of V t /Ω as a function of Ω* for both algorithms. The result indicates a decreasing function of the ratio for both CVAS and BFAS. The ratios decrease sharply when Ω* is less than 6 for both algorithms. Subsequently, the ratios decrease more slowly because of the slow decrease of Ω* -1/3 term. We observe that CVAS algorithm consumes less memory compared to the BFAS algorithm due to the geometric effect on the surface-to- volume ratio of sphere and cubic as seen in Figure 2.8 (a) and (d). The memory consumption difference between BFAS and CVAS is large for small Ω* and diminishes for larger Ω*. BFAS uses 18% more memory for Ω*=50, and less than 10% more memory when Ω*>269. This suggests that for a very large system, a relative memory saving between both algorithms is insignificant. 34 Figure 2.9: Illustrates a lower-bound memory consumption estimation of CVAS and BFAS algorithms. The graph indicates that CVAS consumes less memory than BFAS. 2.4.2.4 Asymptotic memory consumption for CVAS and BFAS Equations (2.14) and (2.20) show that both algorithms have the analytical form of minimum memory requirement as m t =ρV t =ρΩ 1+aβ −1 +bβ −2 +cβ −3 ( ) , (2.25) where a, b, c, ∈ N are pre-factors. Substitution of β from Eq. (2.12) into Eq. (2.25) yields m t =ρΩ 1+ar c Ω −1/3 +br c 2 Ω −2/3 +cr c 3 Ω −1 ( ) =ρ Ω+aΩ 2/3 +bΩ 1/3 +c ( ) . (2.26) Let n be the number of particles in a node and p be the number of threads per node. On average, each thread has n/p particles in the main volume Ω. From the assumption that the system has uniform density, then 35 Ω= n pρ . (2.27) Substitution of Eq. (2.27) into Eq. (2.26) yields m t =ρ n pρ +a n p ! " # $ % & 2/3 +b n p ! " # $ % & 1/3 +c ! " # # $ % & & = n p +a n p ! " # $ % & 2/3 +b n p ! " # $ % & 1/3 +c =Ω n p + n p ! " # $ % & 2/3 ! " # # $ % & & , (2.28) as lower bound memory consumption per thread, which is identical to our average-case analysis [80] of Θ(n/p+(n/p) 2/3 ) per node or Θ(n+n 2/3 p 1/3 ) per thread. 2.4.3 Computation Complexity Analysis In this section, we analyze the computational complexity of CVAS and BFAS. CVAS algorithm starts by assigning one cell to each thread. The least loaded thread has to choose a cell that will minimize the change of its centroid after assigned, which is the unassigned cell that is the closest to the current centroid. To find that cell, the distances from the centroid to all the surrounding cells (i.e. surface cells of the thread’s volume cells) need to be calculated. Focusing on the cost of one thread T i , there is only one cell in the thread volume, which is the root cell, in the first step. The number of distance computations scales as the surface area of one cell. In the second assignment phase of T i , there are two cells in T i , thus the number of distance computations scales as the surface area of two cells. In the last assignment step of T i , T i has L/p assigned cells, thus the number of distance computations scales as the surface area of L/p cells. 36 Since the surface area scales as the 2/3 powers, the number of distance computations of one thread is t thread =1+(2) 2/3 +(3) 2/3 +...+( L p ) 2/3 = k 2/3 k=1 L/p ∑ . (2.29) For p threads in one node, the cost is t node = pt thread = p k 2/3 k=1 L/p ∑ . (2.30) For large L/p, the summation can be approximated by integration. Therefore, the computation cost is approximated as p k 2/3 k=1 L/p ∑ < p k 2/3 dk= p k 5/3 " # $ % 0 L/p ∫ k=0 L/p = 3 2 p L p ' ( ) * + , 5/3 =Θ(L 5/3 p −2/3 ) (2.31) Since particles are uniformly distributed, the number of particles for each node n scales as number of cells L. Therefore, t node =Θ(ρn 5/3 p −2/3 )=Θ(L 5/3 p −2/3 ) (2.32) On the other hand, BFAS algorithm is identical to BFS on the assigning work phase. The worst-case complexity of a generic BFS is O(V+E), where V is the number of vertices and E is the number of edges. In BFAS algorithm, the number of vertices is equivalent to the number of cells L and the number of edges is equal to 27L. Although multiple BFS searches run simultaneously in BFAS, the total number of vertex traversals is L steps. Each traversal consumes constant amount of neighbor-cell lookups. Thus, the running time of BFS in BFAS algorithm scales as Θ(L) = Θ(ρn) = Θ(n). 37 2.5 Performance Evaluation In this section, we measure the performance of CVAS and BFAS algorithms described in section 2.3. Section 2.5.1 shows benchmarks on the scheduling cost of CVAS and BFAS algorithms. Section 2.5.2 measures the load-imbalance factor of our scheduler, and section 2.5.3 measures the memory-footprint reduction for CVAS, confirming its Θ(n+p 1/3 n 2/3 ) scaling on average. Section 2.5.4 demonstrates the reduction of the scheduling cost of CVAS without affecting the quality of load balancing, followed by strong-scaling comparison of the hybrid MPI/OpenMP and MPI-only schemes in section 2.5.5 for CVAS. 2.5.1 Scheduling Cost We have measured average scheduling time of a 128,000-particles system over 1,000 MD steps on 64 BG/P nodes. Figure 2.10 shows the average scheduling cost when the scheduling is performed at every 15 steps. We observe that CVAS spends 8.1, 1.5, and 9.8% more time on scheduling than BFAS for the number of threads p = 2, 3, and 4, respectively. Since the scheduling itself takes almost no time, this measurement implies the overhead of a scheduling routine. Based on this assumption, Fig. 10 shows that the routine overhead accounts for 30-40% of CVAS and BFAS costs. 38 Figure 2.10: Scheduling cost for CVAS and BFAS algorithms for 128,000-particle system on 64 BG/P nodes. The circles and squares are average scheduling times of CVAS and BFAS algorithms, respectively. 2.5.2 Thread-Level Load Balancing We perform a load-balancing test for the CVAS on a dual six-core AMD Opteron 2.3 GHz with n = 8,192. In Figure 2.11, the actual measurement of the load-imbalance factor γ is plotted as a function of p, along with its estimator introduced in section 2.3.2.1 and the theoretical bound, Eq. (2.4). The results show that γ estimated and γ actual are close, and are below the theoretical bound. γ is an increasing function of p, which indicates the severity of the load imbalance for a highly multi-threaded environment and highlights the importance of the fine- grain load balancing. We also observe that the performance fluctuates slightly depending on a selection of seed nodes in CVAS algorithm. While the random root selection tends to provide robust performance compared to deterministic selection, it is possible to use some optimization techniques to dynamically optimize the initial cell selection at runtime. For more irregular applications, it is 39 conceivable to combine the light-overhead thread-level load balancing in this chapter with a high quality node-level load balancer such as a hypergraph-based approach [81]. Figure 2.11: Illustrates the load-imbalance factor γ of CVAS as a function of p from theoretical bound, scheduler estimation, and actual measurement. 2.5.3 Memory Consumption To test the memory efficiency of the CVAS algorithm, we perform simulations on a four quad-core AMD Opteron 2.3 GHz machine with a fixed number of particles n = 8,192, 16,000, and 31,250. We measure the memory allocation size for 100 MD steps while varying the number of threads p from 1 to 16. Figure 2.12 shows the average memory allocation size of the force array as a function of the number of threads for the proposed algorithm compared to that of a naïve data-privatization algorithm. The results show that the memory requirement for 16 threads is reduced by 65%, 72%, and 75%, respectively, for n = 8,192, 16,000, and 31,250 compared with the naïve O(np) memory per-node requirement. In Figure 2.12, the dashed curves show the reduction of memory requirement per thread estimated as 40 m=ap −1 +bp −2/3 (2.33) where the first term represents the memory scaling from actual assigned cells and the second term represents scaling from surface cells of each thread, see Eq. (2.28). The regression curves fit the measurements well, indicating that the average memory requirement is accurately modeled by Θ(n+p 1/3 n 2/3 ). Figure 2.12: Presents average memory consumption for the private force arrays as a function of p using CVAS compared to the naïve method. Numbers in the legend denote n. 2.5.4 Reduction-Sum Operation Cost We also measure the computation time spent for the reduction sum of the private force arrays to obtain the global force array. Figure 2.13 shows the reduction-sum time as a function of the number of threads p for n = 8,192, 16,000, and 31,250 particles. Here, dashed curves represent the regression, t reduction =ap 1/3 +b (2.34) which fit all cases well. 41 Figure 2.13: Shows average reduction-sum operation time of the CVAS as a function of p. Numbers in the legend denote n. 2.5.5 Strong-Scaling Benchmark We measure the strong-scaling benchmarks of CVAS. Figure 2.14(a) shows the thread- level strong-scaling speedup on a four quad-core AMD Opteron 2.3 GHz. The algorithm achieves a speedup of 14.43 on 16 threads, i.e., the strong-scaling multi-threading parallel efficiency is 0.90. As shown in section 2.5.3, the combined algorithm reduces the memory consumption up to 65% for n = 8,192, while still maintaining excellent strong scalability. Next, we compare the strong-scaling performance of the hybrid MPI/OpenMP and MPI- only schemes for large-scale problems on BlueGene/P at LLNL. One BlueGene/P node consists of four PowerPC 450 850 MHz processors. The MPI-only implementation treats each core as a separate task, while the hybrid MPI/OpenMP implementation has one MPI task per node, which spawns four worker threads for the force computation. The test is performed on P = 8,192 nodes, which is equivalent to 32,768 MPI tasks in the MPI-only case and 32,768 threads for hybrid MPI/OpenMP. Figure 2.14 (b) and (c) show the running time of 843,750 and 1,687,500 particles 42 systems for the total number of cores ranging from 1,024 to 32,768. The result indicates that the hybrid scheme performs better when the core count is larger than 8,192. On the other hand, the MPI-only scheme gradually stops gaining benefit from the increased number of cores and becomes slower. The MPI/OpenMP code shows 2.58× and 2.16× speedup over the MPI-only implementation for N = 0.84 and 1.68 million, respectively, when using 32,768 cores. Note that the crossover granularity of the two schemes is n/p ~ 100 particles/core for both cases. 43 Figure 2.14: (a) Thread-level strong scalability of the parallel section on a four quad-core AMD Opteron 2.3 GHz with fixed problem size at n = 8,192 particles. (b) Total running time per MD steps of 1,024 – 32,768 Power PC 450 850 MHz cores on BG/P for a fixed problem size at N = 0.84-million particles and (c) 1.68-million particles. 44 This result indicates that the Amdahl’s law limits the performance of the MPI/OpenMP code when using small number of nodes. Namely, only the pair kernel is parallelized, while the rest of the program is sequential in the thread level. This disadvantage of the MPI/OpenMP code diminishes as the number of cores increases. Eventually, the hybrid MPI/OpenMP code performs better than the MPI-only code after 8,192 cores. The main factors underlying this result are: (1) the surface-to-volume ratio of the MPI-only code is larger than that of the hybrid MPI/OpenMP code; and (2) the communication latency for each node of the MPI-only code is four times larger than that of the hybrid MPI/OpenMP code. This result confirms the assertion that the MPI/OpenMP model (or similar hybrid schemes) will be required to achieve better strong- scaling performance on large-scale multicore architectures. 2.6 Summary We have designed two new data privatization algorithms CVAS and BFAS based on a nucleation-growth allocation, which have excellent memory consumption compared to a naïve data privatization. Both algorithms have been thoroughly analyzed theoretically and compared. We found that the geometric effect is the cause of more memory consumption in BFAS. The best-case memory consumption is found to scale as the same as the previously proposed average case, i.e. Θ(n+p 1/3 n 2/3 ). The analysis shows that the computational costs of CVAS and BFAS are bounded by Θ(n 5/3 p -2/3 ) and Θ(n), respectively. Also, benchmarks of massively parallel MD simulations suggest significant performance benefits of the hybrid MPI/OpenMP scheme for fine-grain large-scale applications. Using data privatization algorithm. Namely, 2.58× speedup has been observed for 0.8-million particle simulation on 32,768 cores of BlueGene/P. 45 Chapter 3 Performance Characteristics of Concurrency Control Mechanisms In this chapter, we investigate the performance characteristics of hardware transactional memory (HTM) on the BlueGene/Q computer in comparison with conventional concurrency control mechanisms, using an MD application as an example. Benchmark tests, along with overhead-cost and scalability analyses, quantify relative performance advantages of HTM over other mechanisms. We have found that the bookkeeping cost of HTM is high but that the rollback cost is low. We propose transaction fusion and spatially compact scheduling techniques to reduce the overhead of HTM with minimal programming. A strong scalability benchmark shows that the fused HTM has the shortest runtime among various concurrency control mechanisms without extra memory. Based on the performance characterization, we derive a decision tree in the concurrency-control design space for multithreading application. This chapter is organized as follows. Section 3.1 introduces concurrency control and race condition problem in general. Section 3.2 provides background information on MD and 46 concurrency controls. Section 3.3 characterizes each concurrency control mechanism based on a cost model. Section 3.4 analyzes the performance of concurrency controls including HTM for ddcMD code. Section 3.5 proposes a decision tree in the multithreading concurrency-control design space. Conclusions are drawn in section 3.6. 3.1 Introduction The delivery of Sequoia, a 1.6-million core BlueGene/Q supercomputer at Lawrence Livermore National Laboratory (LLNL), marks the end of the free-ride era—legacy parallel applications can no longer increase performance on a new chip without substantial modification. Especially with 64 threads per node of BlueGene/Q, most of the existing distributed-memory parallel algorithms via message-passing are not suitable for the shared-memory model in multicore platform. Thus, an efficient multithreading scheme is needed to exploit the benefits from such large-scale multicore architectures. However, this is a major challenge due to the difficulty of handling race conditions. Traditionally, race conditions in time-sharing processors and early multicore chips were avoided effectively by conventional concurrency controls such as locks, atomic operations, and data privatization. However, the performance of these techniques tends to be limited on systems with a large number of concurrent multithreads such as BlueGene/Q. Furthermore, the fine-grain race conditions often encountered in real-world applications such as MD [6, 12, 82] further limits the utility of traditional concurrency controls. Addressing these issues typically requires extensive coding effort, which limits wide use of massive multithreading platforms. Hence, alternative concurrent controls need to be established. 47 To improve the programmability on its large multithreading system, BlueGene/Q has become the first commercial platform with transactional memory (TM) implemented in hardware [83, 84]. TM is an opportunistic concurrency control [85], allowing multiple threads to execute through a critical section concurrently by simply defining a critical code block. Since software TM usually suffers considerable overhead [86], hardware TM (HTM) on BlueGene/Q is expected to significantly reduce runtime overhead. This makes HTM on BlueGene/Q a viable option to deal with race conditions effectively in a large multithreading environment. However, the actual performance of HTM on BlueGene/Q compared to other concurrency control mechanisms on real-world applications has not been widely reported. A study of HTM performance along with the traditional mechanisms on BlueGene/Q is of great significance. In this chapter, we characterize the performance of HTM on BlueGene/Q, along with those of traditional concurrency control mechanisms using the MD code, ddcMD [22, 80], as an example application. We also propose transaction fusion and spatially-compact scheduling techniques to reduce the overhead of HTM with minimal programming. Based on the characteristics of concurrency controls, we propose a decision tree in the concurrency-control design space for multithreading applications. 3.2 Background 3.2.1 Hybrid MPI/OpenMP Molecular Dynamics Molecular dynamics simulation follows the phase-space trajectories of an N-particle system. The forces between particles are computed from the gradient of a potential energy function φ(r 1 , r 2 ,…, r N ), where r i is the position of the i-th particle. Positions and velocities of 48 all particles are updated at each MD step by numerically integrating coupled ordinary differential equations. The dominant computation of MD simulations is the evaluation of the forces and associated potential energy function. One model of great physical importance is the interaction between a collection of point charges, which is described by the long-range, pair-wise Coulomb field 1/r (where r is the interparticle distance), requiring O(N 2 ) operations to evaluate. Many methods exist to reduce this computational complexity [17-19]. We focus on the highly efficient particle-particle/particle-mesh (PPPM) method [17]. In PPPM, the Coulomb potential is decomposed into two parts: A short-range part that converges quickly in real space and a long- range part that converges quickly in reciprocal space. The division of work between the short- range and long-range part is controlled through a “screening parameter” α, and computational complexity for the force calculation is reduced to O(Nlog N). In this study, we use a hybrid parallel implementation of the ddcMD code [80], based on MPI and OpenMP. In this program, particles are assigned to nodes using a particle-based domain decomposition and “ghost-atom” information is exchanged via MPI. Parallelization across the cores of a node is accomplished using multithreading via OpenMP. The number of particles in each node is denoted by n = N/P, where P denotes the number of nodes used in the simulation. Although both short-range and long-range kernels are parallelized using OpenMP, we focus on the more challenging short-range computation. The short-range kernel of PPPM computes a sum over pairs: φ = ∑ i<j q i q j erfc(αr ij )/r ij , where q i is the charge of particle i and r ij is the separation between particles i and j. Though this work is focused on this particular pair function, much of the work can be readily applied to other pair functions. 49 3.2.2 Race Conditions and Traditional Concurrency Control Mechanisms A race condition is an unsafe situation where program behavior depends on the relative timing of events. For example, when two or more threads read from and write to the same memory address simultaneously, program results may depend on the order in which the reads and writes are processed. The code section with this possible memory conflict is called a conflict region. Memory conflicts can occur in ddcMD when multiple threads simultaneously update the force of the same particle. Although there is a large number of particle pairs that are prone to memory conflict, the actual occurrence of memory conflict during the force update is rare. To handle the race condition, several concurrency control mechanisms are available. The conventional concurrency control mechanisms such as lock and atomic operation are available in the OpenMP framework. Alternative data privatization schemes require more coding effort, since automatic array reduction is not yet implemented in general OpenMP distributions (data privatization only involving a scalar reduction variable is available in OpenMP). The detail of these mechanisms is described in the following. 3.2.2.1 OpenMP Critical (OMP-critical) OMP-critical is the simplest form of concurrency control, where the conflict region is wrapped by the #pragma omp critical directive. OMP-critical ensures atomicity of the execution in the conflict region by serialization, i.e., allowing only one thread at a time to compute in the conflict region. To achieve this, OMP-critical uses a single global lock to enforce serialization. The major advantage of OMP-critical is its ease of use, but the global lock can cause a severe bottleneck, resulting in tremendous performance degradation. 50 3.2.2.2 OpenMP Atomic (OMP-atomic) Atomic operations in OpenMP are applied by inserting the #pragma omp atomic directive before an atomic update. This causes the compiler to take extra measures to guarantee the atomicity of the scalar update on the following line of code. This mechanism is suitable only for conflict regions that consist solely of a single scalar update. 3.2.2.3 Data Privatization Data Privatization: Data privatization avoids race conditions completely by performing computation on a thread-local buffer instead of reading/writing into the global memory directly. After each thread finishes its computation, partial results from all the threads are combined into the global memory, incurring computational overhead. The main advantage of data privatization is its excellent scalability. However, data privatization needs much more memory than other concurrency controls for local buffer allocation. 3.2.3 Hardware Transactional Memory on BlueGene/Q TM is an opportunistic concurrency control mechanism. It avoids memory conflicts by monitoring a transaction, a set of speculative operations in a defined code section. Numerous TM algorithms have been proposed [87-91]. BlueGene/Q is the first commercially available platform that implements TM in the hardware using multi-versioned L2 cache. This hardware feature allows existence of multiple copies of the same physical memory address. When conflict is detected by the hardware, the OS kernel is signaled by the hardware to discard and restart the speculative executions in the conflicted transaction⎯a process called rollback. If no memory 51 conflict is detected, the transaction is committed at the end of the transaction and the change made by the transaction will be made available to all the threads. To use the HTM, programmers annotate conflict regions with the #pragma tm_atomic directive. (This directive is similar to OMP-critical.) This procedure requires minimal coding effort while allowing efficient non-blocking synchronization for multiple threads. However, rollback possibly undermines the potential benefit from HTM. Numerous factors such as algorithms, number of threads, memory layout, and granularity, are likely to influence the number of rollbacks and hence performance. Earlier evaluations of HTM performance on BlueGene/Q are available in [83, 84]. Our HTM benchmarks are performed on Sequoia, the BlueGene/Q cluster at LLNL. The cluster and compiler specifications used in the benchmark are summarized in Table 3.1. The - qtm compiler option is used to enable HTM features. HTM statistics such as numbers of transactions and rollbacks are obtained from HTM-runtime generated report. Details of BlueGene/Q architecture can be found in [92]. In this study, we use a hybrid parallel implementation of the ddcMD code [80], based on the message-passing interface (MPI) and OpenMP. In this program, particles are assigned to nodes using a particle-based domain decomposition and “ghost-atom” information is exchanged via MPI. Parallelization across the cores of a node is accomplished using multithreading via OpenMP. The number of particles in each node is denoted by n = N/P, where P denotes the number of nodes used in the simulation. Although both short-range and long-range kernels are parallelized using OpenMP, we focus on the more challenging short-range computation. 52 The short-range kernel of PPPM computes a sum over pairs: φ = ∑ i<j q i q j erfc(αr ij )/r ij , where q i is the charge of particle i and r ij is the separation between particles i and j. Though this work is focused on this particular pair function, much of the work can be readily applied to other pair functions. Table 3.1: Specification of BlueGene/Q at LLNL. Processor PowerPC A2 Clock speed 1.6 GHz Physical cores per node 16 Hardware threads per core 4 Instruction unit 1 integer/load/store OP + 1 floating point OP per clock cycle L1 data cache 16 KB (per core) L2 data cache 32 MB (shared by all cores) Memory per node 16 GB DDR3 Networking 5D torus Number of nodes/cores 98,304 nodes/1,572,864 cores Compiler IBM XLC/C++ version 12.1: mpixlc_r 53 3.3 Performance Characterization of Concurrency Control Mechanisms In this section, the characteristics of four concurrency-control mechanisms—OMP- critical, OMP-atomic, data privatization, and HTM—are analyzed using a micro benchmark running a single thread on a single core to obtain a cost model. 3.3.1 Concurrency-Control Cost Modeling Consider a computational kernel, where a conflict region consists of several atomic updates placed inside a loop structure, see Fig. 3.1. The total overhead cost of a control mechanism is estimated from: t β = t control − t serial (β∈{critical,atomic,HTM}) (3.1) where t control is the running time of the kernel with concurrency control, t serial is the running time of the serial kernel. The cost model for each concurrency control mechanism is: OMP-critical: Here the cost mainly comes from the process of obtaining a single global lock at the beginning of the critical section, and releasing the lock at the end. Since the lock/unlock process is not involved in any calculation inside the conflict region, t critical is independent of the size of calculation within the conflict region. Therefore, the cost model of OMP-critical reads t critical =mc lock+unlock (3.2) where m denotes the number of iterations of the for loop in the benchmark kernel, and c lock+unlock denotes the overhead cost of each OMP-critical. 54 OMP-atomic: The cost is associated with an atomic load/store hardware instruction, which is more costly than the ordinary load/store instruction. In BlueGene/Q, high-performance atomic operations are implemented as part of L2 cache access [92]. Since OMP-atomic can only avoid conflict in one atomic update, the cost model of OMP-atomic depends on the number of atomic updates inside the conflict region. The atomic operation cost model reads t atomic =mµc atomic (3.3) where µ is the number of atomic updates enforced with OMP-atomic, and c atomic is the overhead cost of each OMP-atomic. HTM: Each memory address involved in a transaction needs to be monitored for conflict during runtime. This requires additional steps to prepare each transaction in case of rollback. This process incurs a bookkeeping cost (i.e., register check-pointing, operation confinement [83]) associated with every transaction. Furthermore, all speculative executions occur in L2 cache, which incurs considerable latency for L2 accesses. Therefore, the cost of each transaction also depends on the size of conflict region. Hence, the HTM cost model reads: t HTM =m(c HTM_overhead +µc HTM_update ) (3.4) where c HTM_overhead denotes the bookkeeping overhead per transaction, and c HTM_update denotes extra cost per atomic update within the transaction. Since the benchmark only involves a single thread, no rollback occurs and thus the cost of rollback is not included in the model in this section. Data Privatization: The computational overhead is caused by reduction operation in the combining phase. Since the reduction cost is independent of the number of iterations, the cost of data privatization is often negligible compared to the large amount of time spent in the loop. The 55 reduction cost is not applicable to the single thread cost model in this section, and will be included in the ddcMD benchmark in section 3.4. 3.3.2 Model Parameter Fitting Using Microbenchmark Figure 3.1 shows the serial code as a reference. The serial kernel consists of a loop, in which each iteration contains 8 atomic updates. Figure 3.2 shows an example of HTM-control benchmark codes. To identify the cost parameters depending on the number of atomic updates, we vary the number of atomic updates subjected to concurrency controls. For example in Fig. 3.2, HTM is applied to 4 atomic updates. The benchmark is executed using a single thread on a single core. Figure 3.3 shows the average overhead cost of OMP-atomic and HTM per iteration. Benchmark results reveal that the cost of OMP-atomic and HTM mechanisms increase as the number of controlled updates increase. The average cost of OMP-atomic per update c atomic is 393.9 cycles, while the average cost of HTM per update c HTM_update is 139.4 cycles. The HTM has additional bookkeeping overhead, c HTM_overhead = 970.9 cycles per transaction. The plot shows that OMP-atomic has less overhead than HTM for small number of updates (≤ 4), while HTM has less overhead for larger number of updates. HTM has a large bookkeeping overhead, which is absent in OMP-atomic. However, the cost per update is smaller for HTM than OMP-atomic. This results in the observed crossover of the relative performance advantage between OMP-atomic and HTM. For OMP-critical, the fitted cost of lock+unlock c lock+unlock is 418.0 cycles. This cost is independent of the number of atomic updates. 56 1 double x[SIZE]; 2 for (int i = 0;i < niter;i++) 3 { 4 for (int j=0;j<=SIZE-8;j+=8*stride) 5 { 6 x[j ]+=i*j; 7 x[j+ stride]+=i*(j+1); 8 x[j+2*stride]+=i*(j+2); 9 x[j+3*stride]+=i*(j+3); 10 x[j+4*stride]+=i*(j+4); 11 x[j+5*stride]+=i*(j+5); 12 x[j+6*stride]+=i*(j+6); 13 x[j+7*stride]+=i*(j+7); 14 } 15 } Figure 3.1: The benchmark kernel of a serial code without concurrency control. 1 double x[SIZE]; 2 for (int i = 0;i < niter;i++) 3 { 4 for (int j=0;j<=SIZE-8;j+=8*stride) 5 { 6 #pragma tm_atomic 7 { 8 x[j ]+=i*j; 9 x[j+ stride]+=i*(j+1); 10 x[j+2*stride]+=i*(j+2); 11 x[j+3*stride]+=i*(j+3); 12 } 13 x[j+4*stride]+=i*(j+4); 14 x[j+5*stride]+=i*(j+5); 15 x[j+6*stride]+=i*(j+6); 16 x[j+7*stride]+=i*(j+7); 17 } 18 } Figure 3.2: The benchmark kernel with HTM control over 4 updates. 57 Figure 3.3: The benchmark kernel with HTM control over 4 updates. 3.4 Molecular Dynamics Benchmark Figure 3.1 shows the serial code as a reference. The serial kernel consists of a loop, in which each iteration contains 8 atomic updates. Figure 3.2 shows an example of HTM-control benchmark codes. To identify the cost parameters depending on the number of atomic updates, we vary the number of atomic updates subjected to concurrency controls. For example in Fig. 2(b), HTM is applied to 4 atomic updates. The benchmark is executed using a single thread on a single core. To evaluate the performance of concurrency control mechanisms in an actual application, we focus on the race conditions in the force update routine of ddcMD. Figure 3.4 shows the kernel of the ddcMD force update routine, where memory conflict may occur. The code has two conflict regions: (1) the force update in array fB[j] in line 28-33; and (2) the force update in 58 array fA[0] in line 36-41 in Fig. 3.4. Concurrency control needs to be applied to these two regions. Here, we define the baseline code as a reference, in which no concurrency control is used (i.e., the best-possible execution time but the result is likely wrong). For OMP-critical or HTM controlled concurrency, the corresponding compiler directive (i.e., #pragma omp critical or #pragma tm_atomic) substitutes the comments in lines 27 and 35. In case of OMP- atomic, the directive #pragma omp atomic is inserted before every statement in both conflict regions. 59 1 FOUR_VECTOR ftmp; 2 ftmp.v = ftmp.x = ftmp.y = ftmp.z = 0.0; 3 double rA0x, rA0y, rA0z; 4 double qA0 = qA[0]; 5 6 rA0x = rA[0].x; 7 rA0y = rA[0].y; 8 rA0z = rA[0].z; 9 10 for (int k=0;k<nPList;k++) 11 { 12 double fs,vij,fxij,fyij,fzij; 13 int j=list[k]; 14 double complex ff = fv[k]; 15 vij= creal(ff); 16 fs = cimag(ff); 17 fxij=fs*(rA0x - rB[j].x); 18 fyij=fs*(rA0y - rB[j].y); 19 fzij=fs*(rA0z - rB[j].z); 20 21 double qBj = qB[j]; 22 ftmp.v += qBj*vij; 23 ftmp.x += qBj*fxij; 24 ftmp.y += qBj*fyij; 25 ftmp.z += qBj*fzij; 26 27 //conflict region #1 in fB[j] 28 { 29 fB[j].v += qA0*vij; 30 fB[j].x -= qA0*fxij; 31 fB[j].y -= qA0*fyij; 32 fB[j].z -= qA0*fzij; 33 } 34 }//end nPList loop 35 //conflict region #2 in fA[0] 36 { 37 fA[0].v += ftmp.v; 38 fA[0].x += ftmp.x; 39 fA[0].y += ftmp.y; 40 fA[0].z += ftmp.z; 41 } Figure 3.4: Code fragment from the force update routine in ddcMD code, which has potential memory conflicts. 60 3.4.1 Performance Comparison of Concurrency Controls In this subsection, we perform a strong-scaling benchmark of ddcMD on 64 BlueGene/Q nodes. The total number of particles used in the benchmark is fixed at 1,024,000, where the particles are uniformly distributed. Figure 3.5(a) shows the average runtime per MD step for each concurrency control mechanism as a function of the number of threads per node p from 1 to 64. The baseline result (black dashed line) exhibits an excellent scalability with 15.7-fold speedup over a single thread performance for 16 threads. This result is unsurprising since each thread runs on 1 of the 16 physical cores on a BlueGene/Q node. However, the speedup of the baseline code increases to 26.5 when the number of threads is increased to 32 threads. The increased performance from 16 to 32 threads comes from the double-instruction units of the PowerPC A2 processor of BlueGene/Q. We observe further increase of speedup to 40.0 for 64 threads. This additional speedup likely arises from latency hiding enabled by hardware multithreading. These special hardware features on BlueGene/Q thus enable multithreading speedup larger than the number of physical cores. Since the baseline code does not guarantee the correctness of the result, this speedup should be regarded as the upper limit of the actual speedup with concurrency controls. The data privatization code (blue squares) achieves nearly ideal speedup (i.e., almost the same runtime as the baseline that does not guarantee correctness). The measured speedup is 15.7, 26.5, and 38.9 for 16, 32, and 64 threads, respectively. It is worth mentioning that the data privatization implemented in ddcMD employs spatially-compact scheduling technique, which will be discussed in subsection 3.4.3. This technique significantly reduces the memory footprint, 61 i.e., memory reduction of a factor 64/5.04 ~ 12.7 compared with naïve data privatization for 64 threads [80]. The OMP-critical runtime (orange open squares) is not monotonically decreasing, but instead starts to increase above 16 threads. This is due to the serialization in the OMP-critical mechanism, which limits the parallel speedup for larger numbers of threads according to Amdahl’s law. Thus, OMP-critical is only suitable for a small number of threads. The best speedup that OMP-critical mechanism achieved is 6.3 for 16 threads. The runtime of OMP-atomic (red circles) is high compared to the baseline. However, it still exhibits an excellent scalability, i.e. 40.7-fold speedup for 64 threads. The HTM result (green diamonds) shows the highest overhead cost among all the tested methods. Nevertheless, the scalability of HTM is still high: the measured speedup is 14.9, 24.1, and 33.5 for 16, 32, and 64 threads, respectively. The slightly reduced scalability above 32 threads likely originates from the rollback cost due to increased frequency of conflict. To improve the performance of HTM, we aim to reduce the overhead cost by fusing multiple transactions into a single transaction, so that the number of transactions is reduced. According to Eq. (3.4), this will reduce the bookkeeping cost mc HTM_overhead , though the update cost mµc HTM_update will be unchanged. To do this, the entire NPList loop, along with the second conflict region (line 10-41 in Fig. 3.4) is wrapped by #pragma tm_atomic instead of the conflict regions in the loop. Runtime of fused HTM (purple triangles) is significantly decreased from naïve HTM and becomes even smaller than OMP-critical and OMP-atomic. The speedup of fused HTM is 14.7, 26.1, and 38.2 for 16, 32, and 64 threads, respectively. HTM runtime report shows that the number of committed transactions is reduced by a factor of 5.9 by the transaction 62 fusion technique. Although HTM fusion is a useful technique to reduce the overhead cost, it should be noted that the hardware limitation (e.g. L2 cache capacity) poses an upper bound for the size of the fused transaction. Figure 3.5(b) shows the runtime of each mechanism normalized by that of the baseline for different numbers of threads, p. The numeral for each set mechanism in the graph indicates the normalized runtime averaged over p: 1.772, 3.126, 1.003, 2.202, and 1.435 for OMP-atomic, OMP-critical, data privatization, naïve HTM, and fused HTM, respectively. Although the runtime of data privatization is the smallest, it incurs a large memory overhead (5.04 times more). Among the concurrency controls without extra memory overhead (OMP-critical, OMP- atomic, naïve HTM, and fused HTM), the fused HTM exhibits the smallest runtime. 63 Figure 3.5: (a) Strong-scaling runtime of ddcMD comparing the performance of concurrency control mechanisms. (b) Runtime overhead of each concurrency controls normalized by the baseline runtime. 64 3.4.2 Rollback Penalty To quantify the cost penalty and frequency of rollback, we first define the rollback ratio, f rb = n rb /n transaction , where n rb and n transaction denote the total number of rollbacks and that of committed transactions, respectively. These two quantities can be obtained from the HTM- runtime generated report. Measuring rollback cost is challenging since we need to compare runtime of two HTM codes with/without the race condition, which have almost the same computational pattern. To achieve this, we quantify the rollback cost by measuring runtime difference between normal HTM code (i.e., fused HTM from subsection 3.4.1) and non-conflict HTM code. The non- conflict HTM code is constructed by combining data privatization and HTM mechanisms within a single run. In particular, the force computation is performed in the conflict-free local buffer of each thread. Then, the HTM is applied on top of the local buffers. This still involves the HTM overhead cost but without memory conflict. In this benchmark, 1,024,000 particles are simulated on 64 nodes of BlueGene/Q using fused HTM code and data privatization/HTM combined code. The number of threads on each node p varies from 1 to 64. In each run, extra runtime from non-conflict code compared to the normal HTM runtime are measured for different numbers of threads. Figure 3.6 shows the extra runtime as a function of rollback ratio. The total number of committed transactions in all 64 nodes is n transaction = 9.85×10 7 per MD step, in all test cases. We observe a linear relation between the increased runtime and the rollback ratio. From the slope of the plot, we estimate that the runtime increases by 0.69% per 1% increase of rollback ratio. This indicates insignificant 65 performance penalty caused by rollback in MD. Typical rollback ratio in ddcMD is less than 8% according to the HTM runtime report. Figure 3.6: Impact of rollback on runtime, indicating linear correlation between runtime overhead and rollback ratio. 3.4.3 Effect of Scheduling Algorithms on Rollback In ddcMD, particles assigned to each node are sorted into cubic cells with side length of r c (the cutoff radius of the short-range interactions) based on their spatial coordinate. This reduces the complexity of short-range force computation from O(N 2 ) to O(N). Here, the cells in each node are assigned to threads by the load-balance scheduler. In each scheduling step, scheduler picks a cell from the unassigned-cell pool and assigns it to the least-loaded thread to maintain load balance. This scheduling algorithm works well in most cases. However, the choice of the selected cell in each scheduling iteration affects the number of particle pairs with potential 66 memory conflict, which in turn influences the rollback ratio of HTM. Here, we propose a spatially-compact scheduling algorithm, which minimizes the number of rollbacks by preserving the spatial proximity of the cells assigned to each thread. The cell-selection algorithms used in ddcMD (including the proposed one) are described in the following. Baseline scheduling sweeps cell indices in the x, y, and z directions in turn. This scheduling algorithm has similar characteristic to the naïve #pragma omp parallel for with dynamic scheduling. It considerably increases memory conflicts since each thread is likely overlapping with at least two other threads that are assigned nearest-neighbor cells. Random scheduling picks a cell from the unassigned-cell pool randomly. Compared to the baseline scheduling, this reduces the chance that two or more threads are assigned nearest- neighbor cells; see Fig. 3.7(a). Spatially-compact scheduling assigns each cell to the least-loaded thread, while preserving the spatial proximity of the cells assigned to each thread, see Fig. 3.7(b). This reduces the surface area of the cluster of cells assigned to each thread, which is prone to memory conflicts. More information on the spatially-compact scheduling for MD can be found in [80]. 67 Figure 3.7: Workload distribution in a physical space, where cells assigned to different threads are distinguished by different colors. (a) Random scheduling; and (b) Spatially-compact scheduling. To quantify the effect of scheduling algorithms on the number of rollbacks, we perform a strong-scaling benchmark involving 8-million particles on 64 BlueGene/Q nodes using fused HTM. Here, the cell-selection algorithm used in the thread-level scheduling is varied between baseline, random, and spatially-compact algorithms. Due to the decreasing thread granularity in strong-scaling benchmark, higher rollback ratio is expected for larger numbers of threads. Figure 3.8(a) shows the rollback ratio as a function of the number of threads p. The graph shows that the rollback ratios of all scheduling algorithms increase almost linearly with p. The rollback ratios for baseline, random, and compact schedulings are 32.9%, 11.4%, and 7.3%, respectively for 64 threads. Figure. 3.8(b) shows the increase of the runtime as a function of p, corresponding to the rollback profiles in Fig. 3.8(a). Here, the extra runtime of baseline scheduling is much higher than the random and compact schedulings, which is in accord with their rollback characteristics. For 64 threads, the extra runtimes of 21.4%, 8.9%, and 5.4% are observed for baseline, random, and compact schedulings, respectively. The ratio of extra runtime in Fig. 3.8(b) and the rollback ratio in Fig. 3.8(a) is consistent with that in subsection 3.4.2. This 68 result demonstrates the importance of thread scheduling in MD when HTM is used. The number of rollbacks can be reduced by the proposed spatially-compact scheduling algorithm. Figure 3.8: Rollback ratio (a) and the runtime overhead (b) as a function of the number of threads for the three scheduling algorithms. 69 3.5 Design Space for Concurrency Controls Although the implementation of HTM on BlueGene/Q provides more options to deal with race conditions in massive-multicore multithread programming, choosing the optimal concurrency control mechanism for a particular application is still a difficult issue. In this section, we propose a concurrency-control decision tree to handle race condition in multithread programming, based on the performance characteristics from the previous sections. To discuss the concurrency-control design space, we introduce several parameters that are associated with program runtime and conflicts. First, we describe parameters that are obtained from pre-profiling of the serial code. Figure 3.9 shows a typical program structure of loop-oriented applications such as MD. Let conflict region (CR) be a part with a race condition in the target loop (TL) for multithreading. Pre-profiling is performed to obtain the following parameters: (1) the total time in serial code spent in target loop t TL ; and (2) the total time spent in the conflict region t CR . Based on these parameters, we define the fraction of time spent in the conflict region as f CR = t CR t TL (3.5) In addition to the pre-profiling parameters, we define the number of threads p and the number of atomic updates inside the conflict region µ. The user specifies p, while µ can be obtained easily from code inspection. M avail is the memory available for data privatization. 70 Figure 3.9: Typical program structure of loop-oriented applications Figure 3.10 shows the decision tree to choose the optimal concurrency control mechanism among OMP-critical, OMP-atomic, HTM, and data privatization. The decision tree focuses on providing the best performance, i.e., high scalability and low overhead cost, based on the analysis and benchmark results in the previous sections. In the following, we elaborate the reasoning for each decision. Data Privatization: According to the benchmark result in section 3.4, data privatization has the best performance. Namely, it has the smallest overhead among all mechanisms and the decent thread scalability up to 64 threads. Thus, data privatization is the first choice to be 71 considered. Although data privatization includes the reduction overhead (e.g. O(nlog p) for hypercube reduction), it is usually negligible. The major condition that restricts the use of data privatization is the available memory, since the simple application of data privatization could require up to p times more memory. If the available memory is not enough for data privatization, we should consider the other mechanisms that do not require extra memory. OMP-critical: The low overhead cost and the simplicity are the main advantages of OMP-critical. However, its scalability degrades significantly when p is large. Thus, OMP-critical is only suitable for small p (e.g. p ≤ 8 in Figs. 3.5(a) and 3.5(b)). The critical number of threads p crit = 1/f CR denotes the largest number of threads that can theoretically run concurrently without a lock contention for the serialized fraction of f CR . In the decision tree, we recommend p < p crit /2 to ensure reasonable scalability. OMP-atomic and HTM: Simplicity without compromising scalability is the major advantage of OMP-atomic and HTM. To choose between these two mechanisms, we use the cost models in section 3.3 to find the one with the lower overhead. If OMP-atomic is applicable (i.e., only atomic updates are present in the conflict region) and the OMP-atomic cost is lower, then OMP-atomic should be used. Otherwise, the decision depends on conflict rate f rb . If the conflict rate is low, HTM should be used. (Note that the overhead cost of HTM can be further reduced by using HTM fusion technique.) According to Schindewolf et al. [84], the threshold for the usability of HTM is f rb ≤ 1. For f rb > 1, OMP-critical can be used to avoid memory conflicts although it has high penalty. We found that in some cases, minor code structural change could improve the performance of concurrency controls. 72 Figure 3.10: Decision tree for concurrency-control design space. We should note that OMP- critical at the bottom has very high performance penalty. Once the decision of concurrency control mechanism is made, we can estimate the theoretical parallel efficiency for each mechanism. Table 3.2 summarizes the parallel efficiency estimate based on the profiling parameters. In Table 3.2, t total is the total runtime of the serial 73 code and c reduction denotes the reduction cost per particle per hypercube dimension. Figure 3.11 summarizes the advantages/disadvantages of the concurrency control mechanisms considered in this study. Table 3.2: Parallel efficiency estimate resulting from the decision tree Concurrency control mechanism Parallel efficiency OMP-critical e=min( 1 pf CR ,1) OMP-atomic e= t total t total +mµc atomic Data privatization e= t total t total +c reduction nlogp HTM e= t total t total +m(c HTM_overhead +µc HTM_update ) Figure 3.11: Analysis of four concurrency control mechanisms. Circles and triangles denote strong and weak advantages, respectively, while crosses denote disadvantage. 74 3.6 Conclusion and Discussion We have investigated the performance of HTM and conventional concurrency control mechanisms on BlueGene/Q using microbenchmarks and the MD application ddcMD. The microbenchmark indicates the overhead cost of HTM is smaller than OMP-atomic for a large number of threads. The strong scaling benchmark shows excellent scalability of HTM as well as OMP-atomic and data privatization. The proposed fused-HTM and spatially compact scheduling techniques have sped up naïve HTM by a factor of 1.53. An analysis of rollback penalty showed minor impact of rollback on the performance of ddcMD. Finally, we have derived a decision tree in the concurrency-control design space for multithreading application. This will help application developers choose the concurrency control mechanism with the best performance for their applications. 75 Chapter 4 Embedded Performance Prediction Model We have shown that a hybrid message-passing and multithreading algorithm performs better than the MPI-only scheme when the number of executing cores is large and the problem size per core (i.e. granularity) is small. However, the number of cores is expected to be larger than millions and more in most of the emerging extreme-scale clusters, which causes the communication cost to be significantly increased compared with all of the previous studies. This renders previous performance models insufficient for the future generation of extreme-scale clusters. In this chapter, we propose a performance prediction model for hybrid message-passing and multithreading parallelization for future massive-scale clusters based on the LogP model [93], which aims to provide performance trends of hybrid and MPI-only algorithms. As a demonstration, we will develop a model for a hybrid parallelization scheme of MD simulations, which represents a computationally sophisticated scientific problem. The model will be validated 76 on BlueGene/P clusters at Lawrence Livermore National Laboratory and Argonne National Laboratory. 4.1 Embedded performance prediction model As the multicore era has emerged for extreme-scale clusters, it is uncertain which algorithms would be the best option at such a large scale. The longstanding MPI-only algorithm proved to be exceptionally efficient on current supercomputers [22, 77], but might not perform well in future generations. While the prospective hybrid algorithm has yet to prove efficient in all scenarios [80], a guiding model is needed in order to foresee the scalability of the current scientific software on the incoming clusters. Shaw [54] proposed a novel parallel MD algorithm called Neutral Territories along with a performance prediction model for bio-MD at the extreme-scale strong scaling (i.e. small problem on a large number of cores). The model explained the circumstances where the number of processor approach infinity that Shaw’s algorithm is the optimal options against the other decomposition algorithms. The performance model also demonstrated a good analysis on a scaling of bandwidth cost but did not account for a latency cost because the program was run on a specialized low-latency hardware [23]. We should note that the latency cost is significant for most commodity clusters. Chow and Hysom [61] presented an empirical performance model suggesting that the message-passing efficiency can be measure based on the ratio of the communication time of hybrid application and the communication time of the pure MPI version. Adhianto and Chapman [58] extended the Chow-Hysom model so as to address multiple communication threads. Although these models have only been tested on small-scale clusters, they provide good insight 77 to a hybrid scheme’s behavior within a node. Liebrock and Goudy [94] proposed a relaxed analytical model and a detailed methodology to design a performance model for a hybrid program, suggesting that a semi-empirical approach provides good balance between prediction accuracy and extrapolation capability. Wu and Taylor [95] proposed a performance prediction model and benchmarked it for hybrid memory-intensive applications, which confirmed that the reduced performance of such applications was due to bandwidth limitation. In our hybrid performance model, we aim to provide a performance model that predicts performance trends toward large-scale applications using hybrid parallel algorithms. To achieve this, we need a model that balances between accuracy of empirical approach and adaptability of the analytical model. Based on our experiences, we make the following assumptions: (i) Hybrid parallel algorithms are based on combinations of shared and distributed memory models. (ii) In one program, a different code sections have different computation/communication patterns. (iii) Scalability of domain decomposition via message passing is limited by the communication, while the parallel efficiency in the computation section is usually excellent. (iv) Multithreading benefits from the shared memory, which has negligible communication cost among cores. Assumption (i) is based on a hierarchical structure of modern platforms, which are: (1) Inter-node parallelization utilizes distributed-memory parallelism through MPI; (2) inter-core communication uses shared-memory paradigm via multithreading; (3) core-to-accelerator interaction explicitly transfers data by message passing; and (4) intra-accelerator communication uses shared-memory paradigm though specialized multithreading scheme (e.g. single-instruction multiple-threads in GPGPU). The architectural hierarchy described above is shown in Figure 4.1. 78 Assumption (ii) holds true since a typical program usually comprises of many computation kernels that use different algorithms. Assumption (iii) is consistent with the model proposed by [58, 61, 95], which expresses the parallel efficiency of message passing via the communication cost ratio of hybrid and MPI-only implementations. Assumption (iv) was confirmed in our previous study [35], where the time spent on parallel section was overwhelmed by the sequential section as the number of core usage increased. This finding agrees with the result presented by Chow and Hysom [61]. Figure 4.1: Illustrates hierarchy architecture of a modern high-end machine. Combinations of distributed and shared memory parallelism are used in each architecture level. Based on the aforementioned assumptions, we have designed an embedded performance prediction model (EPM) to predict the performance trend of applications that implement a combination of message passing and multithreading. Node! • Distributed memory (MPI)! Core! • Shared memory (multithreading)! Accelerator (GPU/ Manycore)! • Shared memory (SIMT)! 79 4.1.1 Grand Scheme of the Embedded Performance Prediction Model The grand scheme of EPM expresses application runtime as a linear combination of computation and communication costs. Let l = 1…n arch -1 denotes the architecture level, where l = 1 refers to the parallelism on the top-level architecture (e.g. message passing in the hybrid MPI/OpenMP scheme). n arch denotes the number of levels in the structure of hybrid architecture. Runtime prediction T(N) of the l-th level for the message-passing parallel scheme is: T l MP N ( ) = a l k Ω l k N P l k " # $ % & ' k∈S ∑ + L l * k + b l * k W l * k ω l * k N,P l * k ( ) * k ∈ * S ∑ * k ∈ * S ∑ , (4.1) while a runtime prediction of multithreading is: T l MT N ( ) = a l k P l k k∈S ∑ + b l # k # k ∈ # S ∑ $ % & ' ( )Ω l N ( ) , (4.2) where N = Problem size S = Set of computation code sections S’ = Set of communication code sections Ω l k N ( ) = Computation cost function at k-th section of l-th level. ω l ! k N,P ( ) = Communication cost function at k’-th section of l-th level. L l ! k = Latency cost function at k-th section of l-th level. W l ! k = Bandwidth cost function k’-th section of l-th level. 80 P l ! k = Number of parallel units (e.g. node, core) used at k-th section of l-th level. a l k ,b l k = Pre-factor cost of k-th section at level l. The model allows a program to have various sections with different computation/communication characteristics, following the assumption (ii) in section 4.1. The model uses the semi-empirical approach suggested by Liebrock and Goudy [94] to balance accuracy and extrapolatibility. Architecture-and-program dependent parameters need to be fitted to provide accurate cost of key operations in the applications. The grand scheme is a framework to describe common factors that affect performance of general scientific codes. To improve the accuracy of the model, specific analytical terms should be adopted. We illustrate this tailoring of the grand scheme for MD in subsection 4.1.1.2. 4.1.2 Tailored Scheme of the Embedded Performance Prediction Model For Molecular Dynamics Molecular dynamics is a challenging application for performance prediction. The main complications are: (1) dynamic nature of the simulation causes rapidly fluctuating performance; and (2) irregular memory-access pattern incurs uneven performance for each run. Here, we assume a hybrid algorithm similar to the ddcMD code proposed in chapter 2. Hence, the tailored model of hybrid algorithm is expressed as T N ( ) =aΩ thread N P " # $ % & ' +12t latency +bW N P " # $ % & ' 2/3 +t latency logP , (4.3) and 81 Ω thread N ( ) = c p +d " # $ % & ' T s 13 2 ρ 2 r c 6 L 3 +T c 2 3 πρ 2 r c 6 L 3 " # $ % & ' . (4.4) On the other hand, the MPI-only model is expressed as T N ( ) =aΩ MPI N Pp " # $ % & ' +12t latency +bWp N Pp " # $ % & ' 2/3 +t latency logPp =aΩ MPI N Pp " # $ % & ' +12t latency +bWp 1/3 N 2/3 P 2/3 +t latency logPp , (4.5) and Ω MPI N ( ) = T s 13 2 ρ 2 r c 6 L 3 +T c 2 3 πρ 2 r c 6 L 3 " # $ % & ' , (4.6) where N = Number of atoms P = Number of nodes p = Number of core per nodes ρ = Number density of atoms per cell r c = Cut-off radius of the potential L = Total number of cells per node T s , T c = Key operations cost a, b, c, d = Fitting parameters The models in Eqs. (4.3)-(4.6) consider the following functions of the code: (1) force computation routines; (2) atom caching through halo exchange; (3) latency cost of global summations; (4) pair-search kernel in computation routines; and (5) actual force computation kernel. Since only small amount of data is communicated through global summation in this MD 82 code, the bandwidth cost for global communication is negligible. There are a total of six fitting parameters required in the tailored model. Fortunately, we can reduce the complexity of parameterization by fitting each parameter individually because runtime can be measured for each code sections separately. 4.1.3 Model Validation In the first phase, the model will be tested on a single machine in order to validate the threading parts of the model. Subsequently, we will perform our small- and moderate-scale benchmarks at the HPCC facility of USC. In this facility, we have access to a collection of different architectures of Intel Xeon and AMD Opteron machines. In the final stage, the model will be validated on BlueGene/P and/or BlueGene/Q at Lawrence Livermore National Laboratory and Argonne National Laboratory. This gives an opportunity to validate the extrapolation capability of the model on large-scale runs, thus confirming the ability to predict the performance toward future architectures. The platform specifications are given in the Table 4.1. Table 4.1: Benchmark cluster specifications Specification USC-HPCC LLNL-BlueGene/P LLNL-BlueGene/Q Number of nodes 785 dual cores/dual processors + 1,990 quad- and hex-cores nodes 36,864 nodes 98,304 nodes Number of cores per node 2/8/12 4 16 computation + 1 operating system + 1 reserved Memory-access Mixed NUMA and SMP SMP 83 Specification USC-HPCC LLNL-BlueGene/P LLNL-BlueGene/Q model SMP Memory Mostly 16-24 GB/node 4 GB/node, 147 TB total 16 GB/node, 1.6 PB total Clock speed 2.3/2.5/2.66 GHz 850 MHz 1.6 GHz Floating-point performance (measured/peak) 126/176 teraFLOPS [96] 416/501 teraFLOPS [96] 20 petaFLOPS Networking Myrinet 10g Proprietary Proprietary Network topology Mixed 3D mesh + specialized for collective 3D mesh + specialized for collective 4.2 Optimal Parallelization Prediction via Performance Model Based on the benchmark results comparing hybrid MPI/OpenMP and MPI-only parallelization of MD in [80], we expect that the crossover of optimality of between the two schemes is strongly correlated with granularity. Figure 4.2 shows a performance benchmark on BlueGene/P at LLNL, which implies that the optimal-region boundary is linear in terms of granularity N/P. In this section, we develop a performance models which explain those results and predicts the behaviors at the extremities. 84 Figure 4.2: Optimal region of MPI-only versus hybrid MPI/OpenMP implementations. The crossover line expected to remain constant at N/P. 4.2.1 Performance Model Design Principles The observations from the benchmark results in [80] can be summarized as the following: • There are at least 2 terms in the performance model – decreasing as a function of number of nodes P and increasing as a function of P. • The decreasing term is dominant when P is small while the increasing term is dominant when P is large. • Since the MPI-only code uses the same computation kernel as the hybrid code, the model should be described by the same function with different parameters. 85 Based on the aforementioned observations, we design an embedded performance model (EPM) using on LogP paradigm [93] (i.e. the model described by latency, overhead, bandwidth, and number of processors). The following assumptions are made for the EPM model, which is closely to the environment/system in [80]: • One thread per core is used to obtain maximum performance. • Memory bandwidth is not described in this model. • The multithreading model is based on SMP architecture. • Network is not congested. • Not all of the computation routine can be multithreaded. Only a fixed fraction of the routine. (i.e. Amdahl’s law). • For a particular multicore architecture, the number of core per node is constant. • Network topology of the cluster is 3D torus (e.g. on BG/L and BG/P) and the MPI task assignment is random. The non-congest computation-communication cost model is defined as: T = T comp + T comm , where T comm = T bandwidth + T latency . Thus, the EPM model is T(N,P,η,λ,α)=a α λ +(1−α) " # $ % & ' N Pη +bη 1 3 ( N P ) 2 3 +cηP 1 3 , (4.7) where T = Predicted runtime N = Number of atoms P = Number of nodes 86 η = Number of MPI tasks per node λ = Number of threads per MPI task α = Thread parallelizable fraction, 0 ≤ α ≤ 1 a,b,c = Fitting parameters Note that the number of cores per node q = ηλ is constant while total number of cores in the system is Pq. In this study, we compare the performance of the traditional MPI-only and a hybrid MPI/OpenMP implementations. We will derive the tailored models for MPI-only implementation and a full hybrid MPI/OpenMP as special cases in the following section. 4.2.1.1 Traditional MPI-only performance model In the traditional MPI-only implementation, we use one MPI task per core. In this case, we assign η = q and λ = 1 in the EPM model. Hence, the MPI model reads T MPI (N,P,q)=a N Pq +bq 1 3 ( N P ) 2 3 +cqP 1 3 . (4.8) 4.2.1.2 Hybrid MPI/OpenMP performance model Hybrid MPI/OpenMP describes hybrid implementation, which handles all of the compute cores in each node using multithreading. In particular, we assign η = 1 and λ = q in the EPM model. Therefore, the hybrid performance model can be expressed as T hybrid (N,P,q,α)=a α q +(1−α) " # $ % & ' N P +b( N P ) 2 3 +cP 1 3 . (4.9) 87 4.2.2 Model Fitting and Validation In this section, we fit the MPI-only and hybrid performance models using the runtime data of ddcMD in [80]. The system size is varying from 1.6, 0.8, 0.4, and 0.2 million-particle on up to 256-8,196 BG/P nodes (2 10 -2 15 cores) at Lawrence Livermore National Laboratory. We first need to find the prefactors (i.e. a, b, c in Eqs. (4.7)-(4.9)). We begins by using linear regression to fit parameters. However, the result cannot be obtained (i.e. at least one of a,b,c becomes negative) due to the 2 magnitudes different of the residues from the data values. This is the same situation that we observed in our previous study [97], thus we use bi-square robust regression instead. The resulting parameters are shown in Table 4.2. Table 4.2: Parameter-fitting result of MPI-only and hybrid models using bi-square regression Model a (µs) b (µs) c (µs) MPI-only 52.42 68.14 182.8 Hybrid α = 0.25 40.79 228.8 297.6 Hybrid α = 0.50 53.02 228.8 297.6 Hybrid α = 0.75 75.75 228.8 297.6 Using parameters from Table 4.2, we compare the predicted runtime of a 1.6 million- particle system with the actual measurement for MPI-only implementation (Fig. 4.2(a)) and hybrid implementation (Fig. 4.2(b)). The plots indicate that most of the prediction results agree with the on both systems. For MPI-only model, only at the smallest granularities that the model predicts slowdown but the actual result is not. However, when we compare the two predicted 88 results (Fig. 4.3), the crossover of optimality is predicted to be at ~100 particles/core. This agrees well with the actual measurement. To see the results of performance model further, Figure 4.5 shows an extrapolation of the prediction model using 54.8 million particles on up to 1 million cores cluster. The number of cores per node remains at 4. The result suggested that the crossover is shifted to 200 – 400 particles/core instead of ~100 particles per core in the smaller cluster. The performance result predicts that the optimal region of hybrid implementation is increase compared to the traditional model. Figure 4.3: Validation of (a) MPI-only and (b) hybrid MPI/OpenMP performance models. Blue lines denote the model results while the red dots denote the actual measurement on BlueGene/P. 89 Figure 4.4: The performance model predictions of 1.6 million-particle system of MPI-only versus hybrid MPI/OpenMP implementations, indicating the same result as in terms of optimality region as the actual measurements. Figure 4.5: Extrapolation of the prediction models, fixing granularity per core constant. The predicted model suggested that the crossover is moved up to ~ 200 – 400 particles/core. 4.3 Summary We have designed performance prediction models for MPI and MPI/OpenMP schemes based on LogP. The crossover of optimality obtained from the model between the two schemes is strongly correlated with granularity, which agrees well with the experiment results. An extrapolation of the prediction model using 54.8 million particles on up to 1 million cores cluster indicates that the optimal region of hybrid implementation increase at such extremity compared 90 to the traditional model. This suggests the likely benefit of hybrid model on the extreme-scale run of small system, which aims to minimize time-to-solution of the problem. 91 Chapter 5 n-Tuple Computation Formalism and Solution for Ranged- Limited Many-Body Molecular Dynamics In this chapter, we discuss a development of a computation-pattern algebraic framework to mathematically formulate general n-tuple computation used in reactive MD. We also discuss translation/reflection-invariant properties of computation patterns within this framework, which is the basis of our shift-collapse (SC) algorithm for cell-based parallel MD. This chapter is organized as follows. Section 5.1 discusses the problem regarding n-tuple computation. Section 5.2 provides background information on n-tuple MD. Section 5.3 presents the computation-pattern algebraic framework for n-tuple computation and the SC algorithm along with a proof of its correctness. Section 5.4 analyzes the search-space size and communication cost of the SC algorithm. Section 5.5 presents performance benchmark results on BlueGene/Q and Intel Xeon clusters. Conclusions are drawn in section 5.6. 92 5.1 Problem Statement In contrast to these remarkable progresses in parallel algorithms for dynamic pair (n = 2) and static n-tuple computations, parallel dynamic n-tuple computation is still in its beginning stage. Fundamental questions include: How can we generalize the computation-redundancy removal in the HS scheme and the import-volume reduction in the ES scheme developed for pair computation into arbitrary dynamic n-tuple computations? To answer these questions in a mathematically rigorous and systematic manner, we here develop a novel computation-pattern algebraic framework. Based on this framework and translation- and reflection-invariant properties of n-tuple computations, we then design a shift- collapse (SC) algorithm. Our algebraic framework allows not only a formal proof of force- computation completeness but also a quantitative analysis of n-tuple search cost and import volume. We will show, for general dynamic n-tuple computations, that: (1) the pair HS scheme can be generalized to a reflective-collapse (RC) scheme to tighten the n-tuple search space; and (2) the pair ES scheme can be generalized to an octant-compression (OC) shift scheme to minimize the import volume. 5.2 Background In this section, we describe many-body MD, followed by the introduction of dynamic range-limited n-tuple computation in it. 5.2.1 Many-Body Molecular Dynamics MD simulation follows the time evolution of an N-atom system by numerically integrating Newton’s equations of motion: 93 m i d 2 x i dt =f i =− ∂Φ(X) ∂x i , (5.1) where X = {x 0 ,…,x N-1 } denotes a set of atomic positions, m i and f i are the mass of and force acting on atom i, and t is time. In Eq. (5.1), the many-body interatomic potential-energy function Φ(X) is a sum of n-body potential terms Φ n : Φ(X)=Φ 2 +Φ 3 +...+Φ n max = Φ n n=2 n max ∑ , (5.2) where n max is the maximum n and Φ n is a function of n-tuples of atomic positions (x 0 ,...,x n−1 ). Force computation is the most time-consuming step in MD simulation. For each simulation time step, we need to map the current set of atomic positions, R = {r 0 , …, r N-1 }, to n- body force terms for all n ∈ [2, n max ] and all atoms i ∈ [0, N−1] as f i (n) =− ∂ ∂x i Φ n (x 0 ,...,x n−1 ) ∀(r 0 ,...,r n−1 )∈Γ (n) ∑ (x 0 ,...,x n−1 )=(r 0 ,...,r n−1 ) , (5.3) where Γ (n) denotes the set of all n-tuples of atom positions in R. For a given n-tuple, χ = (r 0 ,r 1 ,...,r n-1 ) (see Figures 5.1 (a)-(c) for n = 2-4), forces exerted on all atoms in χ can be calculated simultaneously in one computational step: ∀r i ∈ χ :f i (n) ← f i (n) − ∂ ∂x i Φ n (x 0 ,...,x n−1 ) (x 0 ,...,x n−1 )=χ . (5.4) Namely, the set of n-body forces on all atoms can be decomposed into partial contributions from different n-tuples χ: 94 f i (n) i∈[0,N−1] { } =F (n) = F χ (n) ∀χ∈Γ (n) . (5.5) In all problems we consider, χ in Eq. (5.5) is undirectional, reflecting the Newton’s 3 rd law [61, 98]. Namely, (r 0 ,r 1 ,...,r n-1 ) produces the same forces as (r n-1 ,r n-2 ,...,r 0 ), thus they are not counted as separate entities. We call this reflective equivalence. Assume that force calculation for each Fχ (n) can be executed in constant time. The computational complexity of n-body force computation is then proportional to the size of Γ (n) , i.e., N!/[2(N-n)!] = O(N n ) for a system with N >> n (which is the case in most MD simulations). Thus, the computational complexity is dictated by that of the largest n-body term, O(N n max ) . Figure 5.1: Schematic of range-limited n-tuples: (a) pair (n = 2), (b) triplet (n = 3), and (c) quadruplet (n = 4). 5.2.2 Dynamic Range-Limited n-Tuple Computation In Atomic interaction is often range-limited, i.e., only atoms within short distances contribute to forces. Force computation for a range-limited n-body interatomic potential is defined as Eq. (5.3), where Γ (n) is replaced by its subset Γ *(n) ⊆Γ (n) : Γ *(n) = (r 0 ,...,r n−1 ) r k,k+1 <r cut−n for all k∈ {0,...,n− 2} { } , (5.6) where r k,k+1 is the interatomic distance between r k and r k+1 and r cut-n is the cutoff distance for n- body interaction, see Figures 5.1(a)-(c). 95 Formal complexity of pruning Γ (n) to obtain Γ *(n) is exponential in n, which is not practical. However, it is possible to efficiently find a set of n-tuples S (n) ⊆ Γ (n) such that |S (n) | << |Γ (n) | and Γ *(n) ⊆S (n) (Figure 5.2). After finding S (n) , elements in Γ *(n) can be obtained simply by exhaustive “filtering” from S (n) . Since S (n) is used to compute forces, hereafter we denote S (n) as a force set, and S (n) that satisfies Γ *(n) ⊆S (n) as a bounding force set. Thus, MD force computation can be restated as a problem to find a bounding force set S (n) for all n ∈ {2,…,n max }. To solve this problem, an efficient pruning algorithm A:R⇒S (n) is required. A widely used approach to achieve this is a cell method. It employs a cell data structure to prune Γ (n) in O(N) time to obtain S (n) that tightly bounds Γ *(n) . Figure 5.2: Set diagram showing the relationship of a bounding force set S (n) with Γ (n) and Γ *(n) Cell-based methods are powerful yet simple. Its procedure is straightforward to implement for the case of pair computation. However, cell methods for general n-tuple computation have not been formalized mathematically. To address this problem systematically, we formulate a cell-based n-tuple MD problem using an algebraic framework in the next section. 5.3 Computation-Pattern Algebraic Framework In this section, dynamic range-limited n-tuple MD is formalized using a computation- pattern algebraic framework. Subsection 5.3.1 introduces a uniform-cell pattern (UCP) formalism, which is an algebraic generalization of conventional cell methods to solve n-tuple 96 MD problems for arbitrary n. In subsection 5.3.2, we propose a shift-collapse (SC) algorithm to efficiently solve n-tuple MD on the basis of UCP. Correctness of the SC algorithm is proved in subsection 5.3.3. 5.3.1 Algebraic Formulation of Cell-Based MD 5.3.1.1 Cell data-structure Cell-based MD divides a simulation volume into a non-overlapping set of small cells with side lengths equal or slightly larger than r cut-n . Each cell volume ζ contains a subset of atoms in R that fall within its volume: c = {r i | ∀r i ∈ R and in volume ζ}, (5.7) see Figure 5.3(a). For simplicity, we consider a simple lattice of cubic cells. Then, each cell c(q) can be indexed using a 3-element vector q = (q x , q y , q z ), which specifies the Cartesian coordinate of the cell position in the lattice. Let L x , L y , and L z be the number of cells in the x, y, and z directions, respectively. Then, the set of all cell indices constitutes a vector space L: L= q=(q x ,q y ,q z ) ∀q x ∈{0,...,L x −1} ∀q y ∈{0,...,L y −1} ∀q z ∈{0,...,L z −1} $ % & & ' & & ( ) & & * & & . An essential ingredient of our algebraic formulation is a cell domain Ω, which is defined as a set of all cells: Ω={c(q)|∀q∈L} . (5.8) 97 In this chapter, we assume periodic boundary conditions in all Cartesian directions. Then, for cell c(q), we define a cell-offset operation c(q+Δ)→c( # q ) (Δ ∈ ℤ 3 ) as ! q α =(q α +Δ α )%L α (α ∈{x,y,z}) , where % is a modulo operation. Note that Ω needs to be dynamically constructed every MD step because the positions of atoms are changing as the computation proceeds. Figure 5.3: (a) Cell data-structure, where circles are atoms and small squares are cells. Dashed circle denotes the cutoff radius centered at atom i. (b) Uniform-cell pattern for pair computation, which generates all pairs between each cell and its neighbor cells. 5.3.1.2 Uniform-cell pattern MD UCP formalism is a generalization of a cell-based MD method using an algebraic approach (Figure 5.3(b)). To compute forces for each n-body potential term, UCP generates a force set S (n) using information from cell domain Ω along with a computation pattern Ψ (n) , which is a set of pre-define rules regarding interaction among cells. UCP applies Ψ (n) to each cell c and generates a cell search-space S cell , a set of n-tuples that forms a force subset S cell (c, Ψ (n) ) ⊆ S (n) . By looping over all cells, it generates the entire S (n) . This is analogous to stencil computation for solving partial differential equations, in which a stencil (equivalent to computation pattern in our 98 framework) defines local computation rules for each grid point and its neighbor grid points and the stencil is applied to all grid points in a lattice to solve the problem [27, 28]. To describe UCP, we first define a computation path p (n) for n-tuple computation as a list of n vectors in L (see each arrow in Figure 5.3(b) as an example of pair-computation path): p (n) = (v 0 ,...,v n−1 )∈L n . The inverse of p is defined as p -1 = (v n-1 ,…,v 0 ). In addition, we define a differential representation, σ(p) ∈ L n-1 , of each path p as σ(p)=(v 1 −v 0 ,...,v n−1 −v n−2 ) . Then, a computation pattern is defined as a set of computation paths Ψ (n) = {p (n) } (i.e. set of all arrows in Figure 5.3(b)). Given a cell domain Ω and a computation pattern Ψ (n) , a force set S (n) is obtained as a union of cell search-spaces S cell : S (n) =UCP(Ω,Ψ (n) )= S cell (c(q),Ψ (n) ) ∀c(q)∈Ω , (5.9) where cell search-space S cell for each cell c(q) is defined as S cell (c(q),Ψ (n) )= (r 0 ,...,r n−1 ) ∀p=(v 0 ,...,v n−1 )∈Ψ (n) ∀k∈{0,...,n−1}:∀r k ∈c(q+v k ) % & ' ( ' ) * ' + ' . (5.10) Namely, S cell is a set of n-tuples contained in all paths p ∈ Ψ (n) , which is associated with cell c(q), see Figure 5.4. An algorithm that performs operations in Eqs. (5.9) and (5.10) is given in Figure 5.5. S (n) in Eq. (5.9) is obtained by looping over all c ∈ Ω (line 2), where each iteration 99 adds S cell (c) to S (n) (line 9). In order for UCP to be a valid MD computation, its computation pattern Ψ (n) must satisfy the n-body completeness condition by generating a bounding force set: Γ *(n) ⊆UCP(Ω,Ψ (n) ). (5.11) In this way, MD problem amounts to finding a computation pattern that satisfies Eq. (5.11). Such a computation pattern is regarded as n-complete. Existing cell-based methods can be expressed as particular computation patterns, and our algebraic formalism can be used to prove their completeness (e.g. HS and ES for pair-completeness). This will be discussed in section 5.4.3. Satisfying Eq. (5.11) only guarantees that Ψ (n) is sufficient to generate a bounding force set using the UCP algorithm in Figure 5.5. The generated force set could still contain duplicated or reflectively equivalent tuples as mentioned in section 5.2.1. In a practical implementation, these redundant tuples must be removed. Hence, the cost of the UCP algorithm, T UCP , is a cost for filtering out the unnecessary tuples from cell search-space S cell of every cell. This is proportional to a sum of all search-space sizes |S cell |: T UCP ∝ S cell (c,Ψ (n) ) ∀c∈Ω ∑ . (5.12) 100 Figure 5.4: n-tuple (n = 3) generation for cell search-space S cell of cell c(q) (colored in magenta) and computation path (v 0 ,v 1 ,v 2 ). S cell generates all triplets where the 1 st , 2 nd , and 3 rd atoms in the triplet are in cells c(q+v 0 ), c(q+v 1 ), and c(q+v 2 ), respectively. Dashed line shows one of the triplets (r i ,r j ,r k ). Algorithm UCP Input: Ω : a cell domain Ψ (n) ⊆ L n : a computation pattern Output: S (n) ⊆ R n : a force set Steps: 1. S (n) = ∅ 2. for ∀c(q)∈Ω 3. S cell (c(q)) = ∅ 4. for ∀p= (v 0 ,...,v n−1 )∈Ψ (n) 5. for ∀r 0 ∈c(q+v 0 ) 6. 7. for ∀r n−1 ∈c(q+v n−1 ) 101 8. S cell (c(q)) ←S cell (c(q)) ∪{(r 0 ,...,r n−1 )} 9. S (n) ←S (n) ∪S cell (c(q)) Figure 5.5: Uniform-cell pattern MD algorithm. 5.3.1.3 Parallel UCP We consider parallel MD algorithms, where different sets of atoms c along with associated computation S cell (c, Ψ (n) ) are assigned to different processors. Since S cell (c, Ψ (n) ) requires data from other cells, these data must be imported if they reside in other processors. Depending on computation paths in Ψ (n) , different cells need be imported. This leads to different import volumes (i.e. the number of imported cells), and hence different communication costs. To quantify the import volume, we define the cell coverage Π of a computation pattern Ψ (n) as a set of cells that are needed in order to compute S cell (c(q), Ψ (n) ): Π(c(q),Ψ (n) )= c(q+v k ) ∀p= (v 0 ,...,v n−1 )∈Ψ (n) ∀k :v k ∈p & ' ( ) ( * + ( , ( , for an arbitrary cell c(q) = c. We also define the cell footprint of a computation pattern to be the cardinality of Π(c,Ψ (n) ). The cell footprint is independent of cell, thus we denote it as |Π(Ψ (n) )|. Also, we define the cell-domain coverage as a union of the cell coverage of all cells in the domain: Π(Ω,Ψ (n) )= Π(c,Ψ (n) ) ∀c∈Ω , (5.13) which is the set of cells that are needed to do computation of all cells in Ω. Here and in subsequent discussions of parallel MD, Ω denotes a cell domain for each processor. Then, the set 102 of cells ω that needs to be imported from other processors are the cells that are in the cell-domain coverage but not in the cell domain: ω(Ω,Ψ (n) )=Π(Ω,Ψ (n) )−Ω . The import volume of n-tuple computation is defined as the size of all cells that need to be imported (i.e. the cardinality of ω): V ω (Ω,Ψ (n) )=ω(Ω,Ψ (n) ) . (5.14) The overall import volume V import is determined by the largest import volume among all n: V import =max n (V ω (Ω,Ψ (n) )) . 5.3.1.4 Optimal UCP-MD Problem Based on the computational and communication cost functions of UCP in Eqs. (5.12) and (5.14), the parallel MD problem is reduced to the following optimization problem: PROBLEM (OPTIMAL UCP-MD). Given a cell domain Ω, find a set of n-complete computation patterns {Ψ *(n) } for all n ∈ {2,…,n max } such that each Ψ *(n) satisfies the following: 1. Ψ *(n) =argmin Ψ (n) S cell (c,Ψ (n) ) ∀c∈Ω ∑ & ' ( ) * + 2. Ψ *(n) =argmin Ψ (n) Π(c,Ψ (n) ) ∀c∈Ω −Ω ' ( ) * + , The first condition minimizes the search cost (i.e. filtering out the unnecessary tuples), while the second condition minimizes the import volume in parallel MD. In the next subsection, we present an SC algorithm to solve the optimal UCP-MD problem. We will show in section 5.4 that the SC algorithm in the case of pair computation is reduced to the best cell methods. 103 5.3.2 Shift-Collapse Algorithm We propose an SC algorithm, which makes a complete computation-pattern while addressing the two optimality conditions presented in subsection 5.3.1.4. The SC algorithm consists of three main phases (see Figure 5.7). The first phase, full-shell generation subroutine (GENERATE-FS in Figure 5.8) performs (n-1)-fold nested loops (lines 2-5) to enumerate all possible paths to generate an n-complete pattern (e.g. for n = 3 in Figure 5.6(a)). This creates a computation pattern centered at the original cell (i.e. c(q) in Eq. (5.9)) surrounded by (n-1)-layers of nearest-neighbor cells (yellow region in Figure 5.6(a)). In the following, we refer to this computation pattern as full shell (FS). In the second phase, octant-compression shift subroutine (OC-SHIFT) shifts all paths in the FS pattern to the first octant in order to compact the cell footprint |Π(Ψ (n) )|. To achieve this, all of the paths in FS are shifted toward the upper corner (the highest Cartesian coordinates in all 3 directions, see Figure 5.6(b)) of the FS cell coverage (see lines 3-4 in Figure 5.9). In the last phase, reflective collapse subroutine (R-COLLAPSE) finds and removes redundant paths that generate the same force set (Figure 5.10). This is achieved by doubly nested loops (lines 2-3) over all pairs of paths to compare their equivalence (line 4). If a path-pair is found to be equivalent, one of the paths is removed from the pattern (line 5). The equivalence between two arbitrary paths p and pʹ′ is tested as σ(pʹ′) = σ(p -1 ), which will be proven in Lemma 3 in section 5.3.3. The most important property for the SC algorithm is that a path-shift operation does not alter the resulting force set, i.e., translational invariance. Let p = (v 0 ,…,v n-1 ) be an n-tuple 104 computation path, then path shifting is defined as a translation of the origin of the computation path: p+Δ= (v 0 +Δ,...,v n−1 +Δ) , where Δ ∈ ℤ n denotes a shifting vector. The following theorem proves the path-shift invariance of force computation. THEOREM 1 (PATH-SHIFT INVARIANCE). Let Ω be a cell domain and p be an n-tuple computation path. For an arbitrary shift vector Δ ∈ ℤ n , UCP(Ω,{p}) = UCP(Ω,{p+Δ}). PROOF. For a particular cell c(q)∈Ω and p = {v 0 ,…,v n-1 }, S cell (c(q),{p+Δ}) reads S cell (c(q),{p+Δ})={(r 0 ,...,r n−1 )|∀r k ∈c(q+v k +Δ)} ={(r 0 ,...,r n−1 )|∀r k ∈c((q+Δ)+v k )} =S cell (c(q+Δ),{p}) . (5.15) The union of Eq. (5.15) over all cells in the domain yields S cell (c(q+Δ),{p}) ∀c(q+Δ)∈Ω = S cell (c( % q),{p}) ∀c( % q )∈Ω =UCP(Ω,{p}) . The last equality comes from the definition of UCP in Eq. (5.9). Therefore, UCP(Ω,{p+Δ}) = UCP(Ω,{p}). □ 105 Figure 5.6: 2D schematic of a shift-collapse (SC) computation pattern. (a) FS computation pattern for n = 3, containing all possible paths of length 3 originated from the center cell (magenta). (b) SC computation pattern after OC-shift and R-collapse subroutines. Yellow cells denote the cell coverage of the center cell, which for SC is much smaller than that for FS. Algorithm Shift-Collapse (SC) Input: n ∈ {2,3,…}, the length of a tuple Output: Ψ (n) SC ⊆ L n : Shift-collapse pattern Steps: 1. Ψ FS ← GENERATE-FS(n) 2. Ψ OC ← OC-SHIFT(Ψ FS ) 3. Ψ SC ← COLLAPSE(Ψ OC ) Figure 5.7: Shift-Collapse algorithm. Algorithm Generate full-shell (GENERATE-FS) Input: n ∈ {2,3,…}, the length of a tuple Output: Ψ (n) FS ⊆ L n : Full-shell computation pattern Steps: 106 1. Ψ (n) FS = ∅, v 0 = (0,0,0) 2. for ∀v 1 = (α 1 x ,α 1 y ,α 1 z ) where -1≤α 1 x|y|z ≤1 3. for ∀v 2 = (α 2 x ,α 2 y ,α 2 z ) where α 1 x|y|z -1≤α 2 x|y|z ≤α 1 x|y|z +1 4. 5. for ∀v n−1 = (α n−1 x ,α n−1 y ,α n−1 z ) where α n−2 x|y|z -1≤α n−1 x|y|z ≤α n−2 x|y|z +1 6. Ψ FS (n) ←Ψ FS (n) ∪(v 0 ,...,v n−1 ) Figure 5.8: FS generation algorithm. Algorithm Octant-compression shift (OC-Shift) Input: Ψ (n) ⊆ L n : computation pattern Output Ψ (n) OC ⊆ L n : OC computation pattern Steps: 1. Ψ (n) OC = ∅ 2. for ∀p= (v 0 ,...,v n−1 )∈Ψ (n) 3. Δ min β = min 0≤i<n (v i β ) for all β∈ {x, y,z} 4. ! p ←p−Δ min 5. Ψ OC (n) ←Ψ OC (n) ∪{ $ p} Figure 5.9: OC-shift algorithm. Algorithm Reflective-collapse (R-COLLAPSE) Input: Ψ (n) ⊆ L n : computation pattern Output: Ψ (n) RC ⊆ L n : non-redundant computation pattern 107 Steps: 1. Ψ (n) RC = Ψ (n) 2. for ∀p∈Ψ (n) 3. for ∀ " p ∈Ψ (n) where " p ≠p 4. if σ ( ! p ) =σ (p −1 ) 5. Ψ RC (n) ←Ψ RC (n) −{ $ p} Figure 5.10: Reflective-collapse algorithm. 5.3.3 Correctness of SC Algorithm To prove the correctness of the SC algorithm, we will show that the computation pattern generated by it is complete as defined in Eq. (5.11). The SC algorithm consists of three phases. We will show that the first phase generates a complete pattern, while the second and third phases do not alter the resulting force set. First, we prove that the GENERATE-FS routine generates a pattern that constitutes a bounding force set. LEMMA 1. Given a computation pattern Ψ (n) FS = GENERATE-FS(n), the force set S (n) = UCP(Ω,Ψ (n) FS ) is a bounding force set for an arbitrary cell domain Ω. PROOF. GENERATE-FS(n) in Figure 5.8 constructs Ψ (n) FS as a union of (v 0 ,…,v n-1 ) such that v k+1 =(v k x ±1,v k y ±1,v k z ±1) for all 0 ≤ k < n-1. Thus c(q+v k ) is a nearest neighbor cell of c(q+v k+1 ) for an arbitrary q ∈ L. Consider an arbitrary n-tuple χ =(r 0 ,...,r n−1 )∈Γ *(n) , where r k,k+1 <r cut−n for all 0 ≤ k < n-1. We will prove that χ is included in a cell set (c(q+v 0 ),c(q+v 1 ),...,c(q+v n−1 )) using mathematical induction: 108 Base: r 0 ∈c(q+v 0 ), Induction: if r k ∈c(q+v k ) , then ∃v k+1 :r k+1 ∈c(q+v k+1 ). Base step is trivial because we choose q such that r 0 ∈ c(q+v 0 ). For the induction step, atomic pair (r k ,r k+1 ) with distance r k,k+1 <r cut−n is guaranteed to reside in cell c(q+v k ) and c(q+v k+1 ) because the cell size is larger than the cutoff r cut-n . This proves the induction step. Hence, ∃v k :r k ∈c(q+v k ) for all 0 ≤ k < n-1 and thus, the entire n-tuple χ is contained in a cell set (c(q+v 0 ),c(q+v 1 ),...,c(q+v n−1 )) . From the definition of S cell in Eq. (5.10), for an arbitrary n-tuple in Γ *(n) there exists S cell (c(q),Ψ (n) FS ). Therefore from the definition of UCP in Eq. (5.9), Γ *(n) ⊆UCP(Ω,Ψ (n) ) is satisfied and UCP(Ω,Ψ (n) ) is complete. □ Next, we prove that OC-SHIFT operation does not alter the resulting force set. LEMMA 2. Consider an arbitrary computation pattern Ψ (n) and a cell domain Ω. For Ψ (n) OC = OC-SHIFT(Ψ (n) ), UCP(Ω,Ψ (n) OC ) = UCP(Ω,Ψ (n) ). PROOF. The OC-SHIFT subroutine performs a series of shifting operations to all paths in Ψ (n) . From Theorem 1, each shifting operation keeps the force set unchanged. Therefore, UCP(Ω,Ψ (n) OC ) = UCP(Ω,Ψ (n) ) □ To prove that R-COLLAPSE keeps the force set unchanged, we first prove the equivalence between paths (i.e. line 4 in Figure 5.10) used in R-COLLAPSE. LEMMA 3 (REFLECTIVE INVARIANCE). Consider a cell domain Ω and paths p, pʹ′ of size n. If σ(p -1 ) = σ(pʹ′), then UCP(Ω,{p}) = UCP((Ω,{pʹ′}). 109 PROOF. Let p= (v 0 ,...,v n−1 ), " p = (u 0 ,...,u n−1 ) and the inverse path of p be p -1 = (v n−1 ,...,v 0 ) . From the assumption σ(p -1 )=σ( ! p ), we have u k+1 −u k =v n−2−k −v n−1−k . (5.16) for all k ∈ {0,…,n-2}. From the shift-invariant property in Theorem 1, S cell (c(q),{ ! p}) ∀c(q)∈Ω = S cell (c(q),{ ! p +Δ}) ∀c(q)∈Ω . Let Δ = v n-1 -u 0 , then S cell (c(q),{ ! p +Δ}) reads S cell (c(q),{ ! p})={(r 0 ,...,r n−1 )|∀r k ∈c(q+u k +Δ)} ={(r 0 ,...,r n−1 )|∀r k ∈c(q+v n−1 −u 0 +u k )} . (5.17) Using telescoping, q+v n−1 −u 0 +u k in Eq. (5.17) can be rewritten as q+v n−1 −u 0 +u k =q+v n−1 + (u i+1 −u i ) i=0 k−1 ∑ . (5.18) u i+1 -u i in the R.H.S. of Eq. (5.18) can be replaced by v n-2-i -v n-1-i according to Eq. (5.16), which yields q+v n−1 + (u i+1 −u i ) i=0 k−1 ∑ =q+v n−1 + (v n−2−i −v n−1−i ) i=0 k−1 ∑ =q+v n−1−k . Therefore, q+v n−1 −u 0 +u k =q+v n−1−k . (5.19) Substituting Eq. (5.19) back into Eq. (5.17) yields, 110 S cell (c(q),{ ! p})={(r 0 ,...,r n−1 )|∀r k ∈c(q+v n−1−k )} ={(r 0 ,...,r n−1 )|∀r n−1−k ∈c(q+v k )} ={(r n−1 ,...,r 0 )|∀r k ∈c(q+v k )} . (5.20) Based on the undirectionality of tuples described in section 5.2.1, (r 0 ,...,r n−1 )= (r n−1 ,...,r 0 ). Therefore Eq. (5.20) reads S cell (c(q),{ ! p})={(r 0 ,...,r n−1 )|∀r k ∈c(q+v k )} =S cell (c(q),{p}) . Based on the definition of UCP in Eq. (5.9), this proves UCP(Ω,{p})=UCP(Ω,{ " p}) . □ Using Lemma 3, we can now prove that the R-COLLAPSE subroutine preserves the completeness of a pattern. LEMMA 4. Let Ω be an arbitrary cell domain and Ψ (n) be a computation pattern. For Ψ (n) RC = R-COLLAPSE(Ψ (n) ), UCP(Ω,Ψ (n) RC ) = UCP(Ω,Ψ (n) ). PROOF. In R-COLLAPSE subroutine (Figure 5.10), lines 2-3 loop over all pairs of computation paths {p, ! p} in Ψ (n) . For each pair of paths, we remove one of them (i.e. collapsing) only when σ( ! p )=σ(p -1 ) (lines 4-5). In such a case, Ψ RC (n) =Ψ (n) −{ # p} . Based on Lemma 3, p and pʹ′ produce the same force set, and thus UCP(Ω,Ψ (n) )=UCP(Ω,Ψ (n) −{ $ p}) . (5.21) Substituting Ψ RC (n) =Ψ (n) −{ # p} into Eq. (5.21) yields UCP(Ω,Ψ (n) )=UCP(Ω,Ψ RC (n) ) , which proves the Lemma. □ 111 Using Lemmas 1, 2, and 4, we can now prove the correctness of the SC algorithm by proving the completeness of SC pattern. THEOREM 2. Given an arbitrary cell domain Ω, a computation pattern Ψ (n) SC generated from the shift-collapse algorithm is n-complete. PROOF. Lemma 1 states that GENERATE-FS produces Ψ (n) FS , which is an n-complete pattern. Lemma 2 guarantees that for Ψ (n) OC = OC-SHIFT(Ψ (n) FS ), UCP(Ω,Ψ (n) OC ) = UCP(Ω,Ψ (n) FS ) and thus Ψ (n) OC is also complete. According to Lemma 4, for Ψ (n) SC = R- COLLAPSE(Ψ (n) OC ), UCP(Ω,Ψ (n) SC ) = UCP(Ω,Ψ (n) OC ). Therefore Ψ (n) SC is complete. □ 5.4 Theoretical Analysis In this section, we perform theoretical analysis of the SC algorithm. First, we quantify the cost of n-tuple searches of the SC pattern. We then analyze and estimate the import volume of the SC the pattern. Finally, we discuss the relation of SC to previous works in the case of pair computation. 5.4.1 Search-Cost Analysis of SC Pattern In this subsection, we show that the search cost for the SC pattern is approximately half that of FS. This size reduction arises from R-COLLAPSE, which removes all redundant computation paths using their reflective-invariant property in Lemma 3. First, we show that the size of a search space can be estimated by the cardinality of a computation pattern. LEMMA 5. Assume that the atom distribution is uniform. For an arbitrary cell domain Ω and a computation pattern Ψ (n) , the search cost TUCP is proportional to |Ψ (n) |. 112 PROOF. Let 〈ρ cell 〉 be the average number of atoms per cell. For all paths p = (v0,…,v n-1 ) in Ψ(n) and an arbitrary cell c(q) ∈ Ω, the average number of atomic pairs in a cell-pair c(q+v k ) and c(q+v k+1 ) is 〈ρ cell 〉 2 . Thus, for an n-1 consecutive cell-pairs, the number of tuples in a cell search-space S cell centered at an arbitrary cell c(q) is S cell (c(q),{p}) = ρ cell n−1 . (5.22) By summing Eq. (5.22) over all paths in Ψ (n) , we obtain the average number of n-tuples associated with cell c(q) as S cell (c(q),Ψ (n) ) = ρ cell n−1 Ψ (n) . (5.23) Summation of Eq. (5.23) over all cells in Ω yields the search cost in Eq. (5.12): S cell (c,Ψ (n) ) ∀c∈Ω ∑ =T UCP =L ρ cell n−1 Ψ (n) . (5.24) Hence, T UCP ∝ |Ψ (n) |. □ Based on Lemma 5, we can estimate the costs of UCP(Ω,Ψ (n) FS ) and UCP(Ω,Ψ (n) SC ) in terms of |Ψ (n) FS | as follows. For each path p in Ψ (n) FS , there are n-1 consecutive cell-vector pairs (v k , v k+1 ) for 0 ≤ k < n-1. For a particular cell q+v k , there are 27 nearest-neighbor cells q+v k+1 . Thus for n-1 neighbor-search steps, we have the number of paths as Ψ FS (n) =27 n−1 . (5.25) Substituting Eq. (5.25) in Eq. (5.24), the search cost for UCP(Ω,Ψ (n) FS ) is obtained as 113 T UCP =L 27 ρ cell ( ) n−1 . Next, we analyze the search cost of UCP(Ω,Ψ (n) SC ). Paths in Ψ (n) SC are only removed in the R-COLLAPSE routine, where redundant paths are collapsed. To estimate |Ψ (n) SC |, we need to find how many path are collapsed. Thus, we first categorize paths in Ψ (n) FS into collapsible and non-collapsible paths as Ψ FS (n) =ψ collapsible FS ∪ψ non-collapsible FS , (5.26) where ψ collapsible and ψ non-collapsible are disjoint sets of collapsible and non-collapsible paths, respectively. In the following, we prove that for each path in Ψ (n) FS , there exists a unique path in Ψ (n) FS that produces the same force set. LEMMA 6 (REFLECTIVE PATH-TWIN (RPT)). For an arbitrary cell domain Ω and a path p = (v 0 ,…,v n-1 ) ∈ Ψ (n) FS , there exists a unique reflective path-twin RPT(p) = p -1 -v n-1 ∈ Ψ (n) FS , such that σ(p -1 ) = σ(RPT(p)). PROOF. For p= (v 0 ...,v n−1 ) , we have p −1 = (v n−1 ,...,v 0 ) . From the definition of UCP (Eq. (5.9)) for the case of a single-path pattern {p} and the undirectionality of n-tuple, it is straightforward to prove that UCP(Ω,{p})=UCP(Ω,{p −1 }) . For a particular path ! p =p −1 −v n−1 , we have 114 ! p =p −1 −v n−1 =(v n−1 −v n−1 ,v n−2 −v n−1 ,...,v 0 −v n−1 ) =(0,v n−2 −v n−1 ,...,v 0 −v n−1 ) . Since the first cell offset of p is (0,0,0), we have pʹ′ ∈ Ψ (n) FS from the generation of Ψ (n) FS (line 1 in Figure 5.9). Thus, for every path p ∈ Ψ (n) FS , there exists a path-twin RPT(p) = ! p =p −1 −v n−1 ∈ Ψ (n) FS . The a differential representation of pʹ′ is σ( ! p )=(v n−2 −v n−1 −0,...,(v 0 −v n−1 )−(v 1 −v n−1 )) =(v n−2 −v n−1 ,...,v 0 −v 1 ) , while that of p -1 is σ(p −1 )=(v n−2 −v n−1 ,...,v 0 −v 1 ) . Hence σ(pʹ′) = σ(p -1 ). From Lemma 3, this proves the equivalence of p and its twin RPT(p). Next we consider the uniqueness of RPT(p). Since all paths in a computation pattern begin with the same cell, they are all inequivalent if the path is directional. The only source of equivalence therefore arises from the undirectionality of paths, which is a direct consequence of the undirectionality of n-tuples stated in section 5.2.1. The reflective equivalence is two-fold by definition, and it can produce at most one redundant path. □ From Lemmas 3 and 6, there exists a unique path twin that produces the identical force set for every path p. However, if p = p -1 , then RPT(p) is p itself. Thus, it is not collapsible. COROLLARY 1 (SELF-REFLECTION). For an arbitrary cell domain Ω and a path p ∈ Ψ (n) FS . If p = p -1 , RPT(p) = p. 115 PROOF. Let p be (v 0 ,…,v n-1 ). From Lemma 6, RPT(p) = p -1 – v n-1 . From the assumption, p = p -1 , thus RPT(p) = p – v n-1 . Also, p = p -1 implies v 0 = v n-1 . Since p ∈ Ψ (n) FS , v 0 = 0. Hence, RPT(p) = p – v n-1 = p – v 0 = p. □ According to the self-reflection in Corollary 1, ψ non-collapsible = {p | ∀p ∈ Ψ (n) FS : p = p -1 }. To estimate |ψ non-collapsible |, we need to count the number of paths that are self-reflective. Note that p=p −1 = (v 0 ,...,v (n−1)/2−1 ,v (n−1)/2 ,v (n−1)/2−1 ,...,v 0 ) ;n is odd (v 0 ,...,v n/2−1 ,v n/2−1 ,...,v 0 ) ;n is even " # $ % $ . thus, the number of self-reflective (i.e. non-collapsible) paths is ψ non-collapsible =27 n+1 2 ! " ! # $ # −1 . (5.27) Substituting Eq. (5.27) in Eq. (5.26) yields ψ collapsible FS = Ψ FS (n) −ψ non-collapsible FS =27 n−1 −27 n+1 2 # $ # % & % −1 . (5.28) Lemma 6 states that there is a unique reflective path-twin of p for each collapsible path p such that one of them can be removed. Thus, the number of collapsed paths in SC pattern is half of the collapsible paths in Eq. (5.28): Ψ SC (n) = 1 2 ψ collapsible FS +ψ non-collapsible FS = 1 2 (27 n−1 −27 n+1 2 # $ # % & % −1 )+27 n+1 2 # $ # % & % −1 = 1 2 27 n−1 +O(27 n/2 ) . (5.29) Hence, the search cost of SC is roughly half that of FS for large n. 116 5.4.2 Communication Analysis of SC Pattern In this subsection, we analyze the communication cost of the parallel SC-MD. The communication cost is defined as T comm =T bandwidth +T latency . (5.30) Here, we assume that the bandwidth cost T bandwidth is proportional to the amount of imported data. The imported data in parallel MD is proportional to the imported volume V import in subsection 5.3.1.3 in the average case. On the other hand, the latency cost T latency is proportional to the number of nodes n comm_node from which data needs to be imported. Hence, Eq. (5.30) can be rewritten as T comm =c bandwidth V import +c latency n comm_nodes , (5.31) where c bandwidth and c latency are prefactors. To quantify T comm for SC-MD, we first determine the import volume of the SC pattern. For simplicity, we assume that the given cell domain has a cubic shape such that L x = L y = L z = l. From the definition of import volume in Eq. (5.14), we need to find the union of cell coverage of all cells in the domain. To analyze the cell coverage, we define a layered cell coverage in the range [a, b] (where a and b are nonnegative integers such that a ≤ b) as c[a,b] q = c(q+ δ )∀δ β ∈ {a,a+1 ,...,b} for all β∈ {x,y,z} { } . In the SC algorithm, the key step that affects cell coverage is the OC-SHIFT subroutine. OC-SHIFT performs 1 st -octant compression, in which all computation paths in the pattern are shifted to the first octant of cell coverage (i.e. non-negative cell indices along x, y, and z 117 directions relative to the center cell). Thus, the cell coverage Π(c(q),Ψ (n) SC ) for an arbitrary cell c(q) is Π(c(q),Ψ SC (n) )=c[0,n−1] q . From Eq. (5.13), the cell-domain coverage of the SC pattern is Π(Ω,Ψ SC (n) )= c[0,n−1] q ∀c(q)∈Ω . Hence, the import volume of the SC pattern reads V ω (Ω,Ψ SC (n) )= c[0,n−1] q ∀c(q)∈Ω −Ω . (5.32) Since Ω⊆Π(Ω,Ψ SC (n) ) , the cardinality in the R.H.S. of Eq. (5.32) can be taken individually. Hence the import volume of Ψ (n) SC is V ω (Ω,Ψ SC (n) )= c[0,n−1] q ∀c(q)∈Ω −Ω =(l+n−1) 3 −l 3 . (5.33) Regarding the latency cost, unlike some methods such as NT [74] that minimize the bandwidth cost at the expense of latency cost, all cell-based methods have small constant latency. In SC-MD, we only need to import atom data from 7 nearest processors using only 3 communication steps via forwarded atom-data routing. 5.4.3 Relation to Previous Works Numerous algorithms have been designed to utilize the cell data structure for efficient range-limited n-tuple search. For the case of pair computation (n = 2), in particular, shell-based 118 methods—such as full-shell (FS), half-shell (HS), and eighth-shell (ES)—are used extensively for efficient search of pairs in Γ *(2) . These methods can be described systematically in terms of UCP. In the following subsections, we provide their unified descriptions and compare them with SC. 5.4.3.1 Full-shell method FS is the simplest among shell methods and it produces a bounding force set. FS searches all atoms within the nearest-neighbor cells (i.e. cells of indices within ±1 in the x, y, z directions) surrounding the center cell [69]. In our algebraic notation, FS is equivalent to Ψ (2) FS , and thus is 2-complete according to Lemma 1. However, FS is not optimal in the light of the search cost: |Ψ (2) FS | = 27 (see Figure 5.11(a)). Since FS has reflective redundancy, the extra computational cost is involved for filtering the collapsible pairs. 5.4.3.2 Half-shell method HS reduces the search cost of FS using the symmetric property of reflected pairs to eliminate redundant search (see Figure 5.11(b)) [69]. This amounts to Ψ HS = R- COLLAPSE(Ψ (2) FS ). Thus, the search cost and import volume are reduced by nearly half compared to that of FS, i.e., |Ψ HS | = 14. 5.4.3.3 Eighth-shell method ES improves over HS by relaxing the owner-compute rule, thereby interacting only with the neighbor cells in the upper-corner octant [11]. This amounts to Ψ ES = OC-SHIFT(Ψ HS ) = Ψ (2) SC . Hence, ES is a special case of the SC algorithm for n = 2. OC-shift reduces the cell 119 footprint of ES to |Π(Ψ ES )| = 7, see Figure 5.11 (c). This in turn reduces the import volume for parallel computation, which is Eq. (5.33) for n = 2. Figure 5.11: Schematic of shell methods for pair computation: (a) full-shell, (b) half-shell, and (c) eighth-shell patterns. 5.5 Performance Benchmark In this section, we benchmark the performance of SC-MD and two existing n-tuple computation codes—FS-MD and Hybrid-MD—for a real many-body MD application. Specifically, we consider MD simulation of silica (SiO 2 ), which involves dynamic pair and triplet (n = 2 and 3) computations [56]. In this application, r cut-3 is smaller than r cut-2 , i.e., r cut- 3 /r cut-2 ~ 0.47. FS-MD uses a computation pattern generated from GENERATE-FS in Figure 5.8 without further performing OC-shift and R-collapse. Hybrid-MD is a production code presented in Ref. [12]. In this code, the pair computation is done by constructing a dynamic pair list (called Verlet neighbor list) within UCP for Ψ (2) FS . Then, Hybrid-MD exploits the shorter cutoff of the triplet computation by pruning the triplet search directly from the pair list without using the cell data structure with r cut-3 . Although this hybrid cell/Verlet-neighbor-list approach reduces the triplet search cost in this particular situation, the import volume is not reduced from that of FS- MD. Performance evaluations in this section are performed on two platforms: BlueGene/Q at Argonne National Laboratory and an Intel-Xeon cluster at the Center for High Performance 120 Computing and Communication of the University of Southern California (USC-HPCC) [99]. The BlueGene/Q consists of 49,152 compute nodes, where each node has 16 PowerPC A2 cores. Each core supports 4 hardware threads, which is clocked at 1.6 GHz. The network topology of BlueGene/Q is a 5D torus. Four MPI tasks are spawn on each core of BlueGene/Q to fully take advantage of integer pipeline and cache-latency hiding in the BlueGene/Q architecture. Details of the BlueGene/Q architecture can be found in Ref. [29]. Tests on the USC-HPCC cluster are performed on dual 6-core processor nodes with 2.33 GHz Intel Xeon X5650 processors and 48 GB memory per node (see Table 5.1). Table 5.1: Specifications of testing platforms. ANL-Vesta USC-HPCC Processor IBM PowerPC A2 Intel Xeon X5650 Clock Speed 1.6 GHz 2.66 GHz Number of nodes 36,864 256 Cores per node 16 12 Memory per node 32 GB 24 GB Network 3D torus Myrinet 10G 5.5.1 Search Cost of SC and FS Algorithms According to the analysis in section 5.4.1, the search cost of SC is much smaller than that of FS (asymptotically half for large n). To confirm this assertion, we measure the actual number 121 of n-tuples in the force set for SC-MD and FS-MD. Figure 5.12 shows the measured number of triplets per MD step (averaged over 10,000 time steps) as a function of the number of cells, where the average cell density 〈ρ cell 〉 is fixed for each measurement. The plot shows that the triplet count of FS-MD is ~2.13 times of that of SC-MD. Figure 5.12: Average number of triplets as a function of domain size. Error bars are too small to be visible in the plot. 5.5.2 Fine-Grain Parallelism We compare the runtime of SC-MD with those of FS-MD and Hybrid-MD to characterize the performance as a function of granularity (i.e. the number atoms per core, N/P) on 48-64 nodes for small grains (N/P = 24 – 3,000). Figure 5.13(a) shows an average runtime over 10,000 MD steps of the three codes on 48 nodes of the Intel-Xeon cluster. The average grain size is varied from 24 to 3,000 atoms per core. At the smallest grain (N/P = 24), the runtime per MD step of SC-MD is much shorter than those 122 of FS-MD and Hybrid-MD—by factors of 10.5 and 9.7, respectively. This is mainly due to the small import volume of SC-MD. Accordingly, SC-MD is faster than FS-MD for all granularities. However, the Hybrid-MD performance relative to that of SC-MD improves gradually as the granularity increases. This can be understood as follow. While SC-MD reduces the import volume as compared to Hybrid MD, Hybrid-MD reduces the triplet search cost by taking advantage of the special cutoff condition. For larger granularities (or larger computation/communication ratios), the advantage of smaller import-volume for SC-MD becomes overshadowed by its larger triplet search cost, and hence Hybrid MD becomes more advantageous. The crossover of performance advantage from SC-MD to Hybrid-MD occurs at the granularity of 2,095 atoms per core. Figure 5.13(b) shows an average runtime over 10,000 MD steps of the three codes on 64 nodes of BlueGene/Q. In accordance with the result on the Intel Xeon platform, the finest-grain result of SC-MD shows 5.7- and 5.1-fold speedups over FS-MD and Hybrid-MD, respectively. In this test, the crossover of performance advantage from SC-MD to Hybrid MD on BlueGene/Q is found to be at N/P = 425, which is considerably smaller than that on Intel Xeon. This is likely due to the lower computational power per core of BlueGene/Q compared with that of Intel Xeon. Consequently, the benefit of smaller triplet search space for Hybrid-MD is emphasized to shift down the trade-off point between the search cost and import-volume size. 123 Figure 5.13: Runtime of SC-MD (red), FS-MD (green), and Hybrid-MD (blue) as a function of the granularity on (a) 48 Intel Xeon nodes and (b) 64 BlueGene/Q nodes. The plot shows that SC-MD is the fastest for N/P < 2,095 and 425 on Intel Xeon and BlueGene/Q platforms, respectively. 124 5.5.3 Strong-Scaling Benchmark In this subsection, we perform a strong-scaling benchmark of SC-MD, FS-MD, and Hybrid-MD on 1−512 BlueGene/Q nodes (i.e. 16−8,192 cores) and on 1−64 Intel Xeon nodes (i.e. 12−768 cores). Strong-scaling speedup is defined as S strong = T reference T parallel , where T reference denotes the time spent on a reference test (e.g. sequential or small parallel runs) and T parallel is the time spent on a larger parallel run to solve the same problem. The corresponding parallel efficiency is η strong = S strong /(P parallel /P reference ), where P parallel and P reference are the numbers of cores used in the parallel and reference runs, respectively. The total number of atoms used in the benchmark is fixed at 0.88 and 0.79 millions atoms on Intel-Xeon and BlueGene/Q platforms, respectively. Atoms in both systems are uniformly distributed. Figure 5.14(a) shows the strong-scaling speedup of SC-MD, FS-MD, and Hybrid-MD as a function of the number of cores on the Intel-Xeon cluster. The plot shows that SC-MD achieves excellent scalability from 1−64 nodes with a 59.3-fold speedup (or 92.6% parallel efficiency) on 768 Intel Xeon cores. On the other hand, the scalability of FS-MD and Hybrid- MD decline after the number of cores exceeds 96, so that S strong becomes 24.5 and 17.1 on 768 cores, respectively. The corresponding parallel efficiencies are 38.3% and 26.8%. The number of cores and system size used in this benchmark are in an affordable range for general scientists using commodity clusters. The second benchmark measures strong scalabilities of the three codes on BlueGene/Q. Similarly to the previous benchmark, the results in Figure 5.14(b) indicate that SC-MD maintains 125 excellent strong scalability with S strong = 465.6 (or η strong = 90.9%) on 8,192 cores. This demonstrates an excellent strong scalability of SC-MD on larger platforms. On the other hand, FS-MD and Hybrid-MD maintain decent parallel efficiency only on small numbers of nodes, where 7.1- and 7.0-fold speedup is observed on 8 nodes. After that, their scalabilities decrease gradually, and only 55.1- and 95.2-fold speedups (10.8% and 18.6% efficiencies) are observed on 8,192 cores (N/P ~100 or ~26 atoms per MPI task). Figure 5.14: Strong scaling speedup of SC-MD, Hybrid-MD, and FS-MD on (a) Intel Xeon cluster and (b) BlueGene/Q. 5.6 Conclusion We have developed a computation-pattern algebraic framework to formalize dynamic n- tuple computation in many-body MD. This new formalism allows us to perform systematically and mathematically proven analysis of dynamic range-limited n-tuple MD, which to the best of our knowledge, has not been done before. The new formulation and analysis have led to the development of the SC algorithm, which generalizes some of the best pair (n = 2) computation algorithms to an arbitrary n. Benchmark tests have shown that SC-MD outperforms our 126 production Hybrid-MD code for fine granularities. Excellent strong scalability has also been observed for SC-MD. The results demonstrate the advantage of SC-MD over Hybrid-MD when the time-to-solution (rather than simulating the largest possible system size) is the major goal. There is an additional benefit of SC-MD compared with Hybrid-MD that combines cell and Verlet-neighbor-list methods. Namely, SC exposes maximal concurrency on heterogeneous architectures such as GPU-accelerated and many-core clusters. Since SC executes different n- tuple computations independently, they can be assigned to different hardware (e.g. multiples GPUs). In contrast, Hybrid-MD has a sequential dependence, i.e., the Verlet-neighbor list must be constructed within the pair computation before any n > 2 computation can be performed. Another issue is the cell size. Though we have restricted ourselves to the cell size larger than r cut- n for simplicity, it is straightforward to generalize the SC algorithm to a cell size less than r cut-n as was done, e.g., in the midpoint method [100]. In this case, the SC algorithm improves the midpoint method by further eliminating redundant searches. Relative advantages between ES and midpoint methods have been thoroughly discussed by Hess et al. [12]. 127 Chapter 6 Performance Analysis, Modeling, and Optimization of Cell- Based Molecular Dynamics In this chapter, we present an analysis of a computation cost model for MD simulations within a single node. The analysis and the developed model pave a way to provide understanding about computational behavior of MD, thereby benefiting the development of prediction models for hybrid algorithms. We develop a performance prediction model for short-range interaction computations in MD simulations, thereby predicting the optimal cell dimension in a linked-list cell method. The model expresses computation time in terms of the number and unit computation time of key operations. The model accurately estimates the number of operations during the simulations with the maximum standard error of 10.6% compared with actual measurements. Then, the unit computation times of the operations are obtained by bisquare regression. Analysis of this model reveals that the optimal cell dimension to minimize the computation time is determined by a trade-off between decreasing search space and increasing linked-list cell access for smaller cells. 128 The model predicts the optimal cell dimension correctly for 80% of all tested cases, resulting in an average speedup of 10% and 52% for the cutoff radius of interaction, 6.6 and 10.0 Å, respectively. This chapter is organized as follows. Section 6.1 describes the problem, approach, and opportunity for optimization. Section 6.2 presents the linked-list cell MD and reduced linked-list cell method, to which the proposed performance model is applied. Section 6.3 describes the theoretical analysis of simulation parameters related to the cell dimension selection. Section 6.4 presents the verification of the performance model including the fitting procedure of each estimated parameters. Section 6.5 evaluates the model against measured computation times. Conclusions are drawn in section 6.6. 6.1 Introduction Large-scale MD simulations involving multibillion atoms are beginning to address broad problems. One of the widely used method for improving the performance and scalability of such large-scale MD simulations is the linked-list cell method, which employs a divide-and-conquer strategy to reduce the search space for atomic pairs within the cutoff radius R c of the interatomic interaction. The conventional cell decomposition method uses cubic cell geometry with the dimension ~ R c . However, this incurs considerable amount of unnecessary pair evaluations due to the excessive cell dimension. A commonly used method to decrease the redundant calculations is reducing the cell dimension, see e.g. Mattson et al. [101]. In particular, Yao et al. stated that the cell dimension of R c /2 usually gives the best performance [102]. Our own experience, however, indicates that the optimal cell dimension varies considerably for different parameters 129 such as R c and the number of atoms. It is thus of great interest to systematically study and identify the factors that determine the optimal cell dimension. To address this problem, we carry out a theoretical analysis to estimate the amount of non-bonded interaction computations, which dominate the computational time. Then, the platform-dependent computational cost associated with each key operation is determined by regression. Finally, we use the model to predict the performance as a function of the cell dimension, thus obtain the optimal cell dimension to minimize the computation time. 6.2 Cell-Base Molecular Dynamics Simulations In this chapter, the dominant computation of MD simulation is the evaluation of E(r N ), which consists of two-body E 2 (r i ,r j ) and three-body E 3 (r i ,r j ,r k ) terms in the program considered here [6]. Figure 6.1: 2D schematic of the linked-list cell method. (a) Illustration of conventional linked- list cell method with cell dimension Rs ≥ Rc. (b) 2D schematic of the reduced linked-list cell method. Atoms in the shaded region will be excluded from the force computation of center atom i. Only forces exerted by atoms within the cutoff radius (represented by a two-headed arrow) are computed. 130 Figure 6.1(a) shows a schematic of the computational kernel of MD, which employs a linked-list cell method to compute interatomic interactions in O(N) time. Periodic boundary condition is applied to the system in three Cartesian dimensions. Here, a simulation domain is divided into small rectangular cells, and the linked-list data structure is used to organize atomic data (e.g. coordinates, velocities and atom type) in each cell. By traversing the linked list, one retrieves the information of all atoms belonging to a cell, and thereby computes interatomic interactions. Thus, the cutoff radius R c of the interatomic interaction usually determines the dimensions of the cells. In order to explain the computational procedures in the linked-list cell method, let us define S, which is a set of all atomic pairs, (i, j), within the nearest-neighbor cells. The first phase of the computation, i.e. pair search, constructs a subset, S 0 , of pairs (i, j) whose distance r ij satisfies r ij ≤ R c . Subsequently, second phase computes the forces between all atom pairs within S 0 . Figure 6.2 shows the pseudocode for pair search and force computation. Phase 1: Pair search for do if do end end Phase 2: Force computation for do compute interaction force end Figure 6.2: Force computation algorithm. € S 0 =∅ € ∀(i, j)∈ S € r ij ≤ R c € S 0 = S 0 ∪(i, j) € ∀(i, j)∈ S 0 € f ij ( r ij ) 131 The reduced linked-list cell method attempts to reduce the search space S by decreasing the dimension R s of the linked-list cell. To illustrate this, Fig. 6.1(b) shows the situation where R s is reduced by half. The atoms in the shaded region are at least one cutoff distance away from the center atom i such that it can be safely excluded. Hence, less number of pairs needs to be evaluated, potentially resulting in the performance increase. However, this performance gain could be offset by the increased number of linked-list cells that need to be scanned (e.g., from 9 to 25 cells in the 2D example) and the associated computational overhead for processing the increased number of linked lists. The MD program used in this chapter is parallelized based on spatial decomposition and message passing [6]. However, the focus of this chapter is performance optimization within each compute node, and thus the number of atoms N hereafter denotes the number of atoms per node. 6.3 Performance Analysis Computing non-bonded interactions usually consumes the largest portion of time in MD simulation [23]. In our performance model, we assume that the computational cost consists of three terms: (1) pair search where the number of search operations P s = |S|, i.e. the cardinality of set S (phase 1 in ); (2) force computation for which the number of force computation operations P c = |S 0 | (phase 2 in ); and (3) linked-list cells access overhead—P l . Therefore, the non-bonded computation time in one MD step—T, is T =T s P s +T c P c +T m P l , (6.1) where T s , T c , and T m are the operation cost per P s , P c , and P l , respectively. 132 Generally, the memory access cost (the third term in Eq. (6.1)) in the conventional cell decomposition method is ignored when compared to the pair-search and force-computation costs. However, in the reduced linked-list cell method, this overhead often becomes dominant. In section 6.3.1, we derive P s , P c , and P l for the conventional cell-decomposition method, while the extension of the model to the reduced cell method is derived in section 6.3.2. The operation costs T s , T c , and T m are obtained from regression described in section 6.3.3. 6.3.1 Short-Range Interaction Cost Model of the Conventional Linked-List Cell Method 6.3.1.1 Pair search cost model The search space for short-range force computations is defined as the number of pairs that need to be evaluated in phase 1 of . For N atoms, the search space of an all-pair method is O(N 2 ). In this section, we derive the approximation of the O(N) search space in the linked-list cell method. For the uniform cell decomposition method with cubic cell dimension for all cells, the volume of each cell is V c =R s 3 , (6.2) where R s is the dimension of the linked-list cell such that . If the number density of atoms in the system is ρ atom/Å 3 , the average number of atoms in one linked-list cell is N c =ρR s 3 . (6.3) The number of possible short-range pairs between atoms in two cells is then € R s ≥ R c 133 P cc = 1 2 N c N c −1 ( ) ≈ 1 2 N c 2 = 1 2 ρ 2 R s 6 . (6.4) Let C ijk be a linked-list cell with cell indices i, j, and k in the x, y, and z directions, respectively. To compute forces exerted to all atoms in C ijk , all atom-pair combinations between C ijk and its nearest neighbor cells Cʹ′ iʹ′ jʹ′kʹ′ need to be evaluated, where S neighbor (C ijk )={ ! C ! i ! j ! k (i−1≤ ! i ≤i+1)∧(j−1≤ ! j ≤ j+1)∧(k−1≤ ! k ≤k+1)} , (6.5) is the set of neighbor cells. Since the conventional cell decomposition method evaluates all atomic pairs between C ijk and its 26+1 neighbor cells (including itself), the number of pairs to be evaluated per one linked-list cell is P s C ( ) = 1 2 ρ 2 R s 6 N neighbor , (6.6) where N neighbor = |S neighbor | = 27. Let L x , L y , and L z be the number of cells in the x, y and z directions, respectively, then the total number of atom pairs to be evaluated in one MD step is P s = P s C ijk ( ) k=1 L z ∑ j=1 L y ∑ i=1 L x ∑ = 1 2 ρ 2 R s 6 L x L y L z N neighbor . (6.7) 6.3.1.2 Force computation cost model After the interaction pairs have been evaluated, only the pairs that satisfy the evaluation criterion (i.e., r ij ≤ R c ) undergo force computation. The number of force computations can be estimated as follows. First, we count the number of atoms that are located inside the sphere with radius R c around each atom. The volume of such sphere is 134 V fc = 4 3 πR c 3 , (6.8) and the number of atoms that reside within V fc is N fc = 4 3 πρR c 3 . (6.9) Let N denote the number of atoms in the system. From Eq. (6.3), N can be expressed as N =ρR s 3 L x L y L z . (6.10) Combining Eqs. (6.9) and (6.10), the number of force computation operations—P c is P c = N fc N = 4 3 πρ 2 R s 3 R c 3 L x L y L z . (6.11) The number of pair-force computations in Eq. (6.11) can be halved by applying Newton’s 3 rd law (i.e., the forces acting on a pair of atoms are equal and in the opposite directions) such that P c = 2 3 πρ 2 R s 3 R c 3 L x L y L z . (6.12) 6.3.1.3 Linked-list cell access cost model Linked-list cell access cost refers to the overhead introduced by the cell decomposition scheme. To compute the forces on atoms in cell C, N neighbor cells need to be traversed. Therefore, the number of linked-list cell accesses is P l =L x L y L z N neighbor . (6.13) 135 Combining Eqs. (6.7), (6.12) and (6.13), the prediction model for short-range pair computation is given by T =T s 1 2 ρ 2 R s 6 L x L y L z N neighbor ! " # $ % & +T c 2 3 πρ 2 R s 3 R c 3 L x L y L z ! " # $ % & +T m L x L y L z N neighbor ( ) . (6.14) In the next subsection, we derive a corresponding model for the reduced cell method to be used to obtain the optimal cell dimension. 6.3.2 Short-Range Interaction Cost Model of the Reduced Linked-list Cell Method 6.3.2.1 Reduced-cell cost model In this subsection, we derive the number of searched pairs P s ʹ′, that of pair-force computations P c ʹ′, and that of linked-list accesses P l ʹ′, when the cell dimension R s is reduced to R s ! = R s µ , (6.15) where µ ∈ N is a cell dimension reduction factor. The cell dimension reduction also modifies the number of linked-list cells such that L x ! =µL x L y ! =µL y L z ! =µL z . (6.16) 136 In the reduced cell method, the 1 st -nearest neighbor cells are not sufficient for interaction evaluations, and instead up to the µ th -nearest neighbor cells need to be evaluated. Therefore, N neighbor can be calculated as N neighbor = 2µ+1 ( ) 3 . (6.17) Substituting R s ʹ′, L x ʹ′, L y ʹ′, L z ʹ′, and N neighbor to Eqs. (6.7), (6.12), and (6.13), respectively, yield P s ! = 1 2 ρ 2 R s ! 6 L x ! L y ! L z ! N neighbor = 1 2 ρ 2 R s µ " # $ % & ' 6 µ 3 L x L y L z 2µ+1 ( ) 3 = 2µ+1 ( ) 3 2µ 3 ρ 2 R s 6 L x L y L z , (6.18) P c ! = 2 3 πρ 2 R s ! 3 R c 3 L x ! L y ! L z ! = 2 3 πρ 2 R s µ " # $ % & ' 3 R c 3 µ 3 L x L y L z = 2 3 πρ 2 R s 3 R c 3 L x L y L z =P c , (6.19) P l ! =L x ! L y ! L z ! N neighbor =µ 3 2µ+1 ( ) 3 L x L y L z . (6.20) Equation (6.19) indicates that P c ʹ′ is independent of the reduction factor µ. This is consistent with the requirement that the choice of cell dimension should not affect the simulation results as noted by Heinz et al. [103]. 137 6.3.2.2 Asymptotic analysis of the number of key operations In this section, we analyze the relation of the reduction factor µ to P s ʹ′, P c ʹ′, and P l ʹ′. This can be done by the expansion of Eqs. (6.18) and (6.20) in the power of µ: P s ! = 1 2 µ −3 +3µ −2 +6µ −1 +4 # $ % & ' ( ρ 2 R s 6 L x L y L z , (6.21) P l ! = 8µ 6 +12µ 5 +6µ 4 +µ 3 ( ) L x L y L z , (6.22) while P c ʹ′ is independent of µ. Equations (6.21) and (6.22) provide asymptotic behaviors of P s ʹ′ and P l ʹ′ with respect to µ as summarized in . (The constant term of P s ʹ′, which is irrelevant for the determination of optimal µ, is omitted in ) Since P s ʹ′ and P l ʹ′ are descending and ascending functions of µ, respectively, the optimal µ value that minimizes the computation time is determined by their trade-off. The asymptotic analysis reveals that the increase of the number of linked-list cell accesses P l ʹ′ with µ is significantly larger than the reduction of search space P s ʹ′, while P c ʹ′ is not affected by µ. This implies that asymptotically, the overhead cost from linked-list access overwhelms the benefit due to the reduced the search space for large µ. This agrees with finding by Mattson et al. [101] and Yao et al. [102] such that the computation time is minimal when µ is small. From Eq. (6.1), the costs of pair search T s and that of linked-list access T m are essential parameters to quantify the trade-off between P s and P l , so that the optimal value of µ can be determined as described in section 6.4. 138 Table 6.1: Asymptotic complexity of parameter P s ʹ′, P c ʹ′, and P l ʹ′ over µ. Operation count Scaling with reduction factor µ P s ʹ′ O(µ -3 ) P c ʹ′ O(1) P l ʹ′ O(µ 6 ) 6.4 Model Validation As derived in section 6.3, the computation time for short-range interactions in the reduced linked-list cell method is given by T =T s P s ! +T c P c ! +T m P l ! =T s 2µ+1 ( ) 3 2µ 3 ρ 2 R s 6 L x L y L z " # $ $ % & ' ' +T c 2 3 πρ 2 R s 3 R c 3 L x L y L z " # $ % & ' +T m µ 3 2µ+1 ( ) 3 L x L y L z ( ) . (6.23) In section 6.4.1, we validate the accuracy of the operation-counts models—P s ʹ′, P c ʹ′, and P l ʹ′—by comparing them with measured values. In section 6.4.2, we fit the P s ʹ′, P c ʹ′, and P l ʹ′ derived from the estimation, Eq. (6.23), to obtain the operation costs—T s , T c , and T m . 6.4.1 Validation of the Operation Counts Model MD simulations of silica at temperature 300 K involving up to 192,000 atoms are performed on an Intel Core-i7 2.66 GHz machine, using the cell reduction factor µ ranging from 1 to 4. The cutoff radius R c of this dataset is set to 5.5 Å with atom density ρ = 0.0654 atom/Å 3 , and the linked-list cell size R s = 5.728 Å. 139 Under this condition, the actual values of P s ʹ′, P c ʹ′, and P l ʹ′ measured during the simulations are compared with those obtained from the model; see Table 6.2. The relative error of P s ʹ′ and P l ʹ′ is less than 4% in all cases, confirming the applicability of our model for these parameters. For P c ʹ′, the relative error is ~ 10% for the largest number of atoms (N = 192,000), while it is ~ 8% for the smallest system (N = 1,536). Nevertheless, the maximum error of 10% is acceptable for the subsequent analysis, especially in the light of the highly dynamic nature of MD simulations. Table 6.2: Accuracy of the operation counts model. Number of atoms Reduction factor, µ Relative error (%) P s ʹ′ P c ʹ′ P l ʹ′ 1,536 1 1.67 8.01 3.60 2 1.31 8.01 0.70 3 0.96 8.01 0.19 4 0.95 8.01 0.04 12,288 1 0.79 -0.51 3.50 2 0.76 -0.51 0.60 3 0.49 -0.51 0.09 4 0.39 -0.51 -0.06 41,472 1 -0.24 -2.28 2.68 2 -0.21 -2.28 -0.20 3 -0.40 -2.28 -0.70 4 -0.31 -2.28 -0.85 98,304 1 -1.43 -3.29 1.67 140 2 -1.35 -3.29 -1.18 3 -1.42 -3.29 -1.67 4 -1.59 -3.29 -1.83 192,000 1 -3.50 -10.59 -0.28 2 -3.29 -10.59 -3.08 3 -3.39 -10.59 -3.57 4 -3.38 -10.59 -3.71 6.4.2 Parameter Fitting In section 6.4.1, we confirm that Eq. (6.23) accurately approximates the operation counts. In this section, we use the prediction from the model and the running time from the execution of actual MD simulation to find the cost parameters T s , T c , and T m . Here, we need to use more datasets in order to get better statistics for the fitting. Since R c is one of the most important variables, we here concentrate on the effect of R c on the performance. We run two additional sets of MD simulations using the same setting as in section 6.4.1 except for R c , which is set here to 6.6 and 10 Å. Using the measured timings, we perform least square regression to obtain T s , T c , and T m from Eq. (6.23). However, the results obtained from the least square regression show very large error up to 93% for estimated T c ; see . This may be explained by the two orders-of-magnitude difference between the largest and smallest dataset sizes (1,536 and 192,000 atoms), which causes the fitting residuals of larger datasets to dominate the significance of smaller datasets. To alleviate this artifact, a more robust regression algorithms, Huber regression and bisquare-weighted regression, are used instead of the least square regression [104, 105]. Table 6.3 shows that the estimated operation costs from Huber 141 regression have smaller standard errors (at most 40.8%), while those from bisquare-weighted regression have the least errors (at most 22.5%). Therefore, we use the estimated costs from bisquare-weighted regression to predict the performance and thereby obtain the optimal reduction factor µ in the next section. Table 6.3: Estimated operation costs from various regression algorithms. Regression algorithm Operation cost Estimated cost (ns) Standard error (%) Least square T s 9.76 40.2 T c 34.83 93.1 T m 10.44 7.6 Bisquare c = 1.865 T s 6.48 22.2 T c 52.89 22.5 T m 11.80 2.5 Huber c = 1.345 T s 10.77 14.9 T c 32.56 40.8 T m 10.81 3.0 6.5 Performance Results Here, we use the estimated operations counts and costs in the previous section to predict the performance of the simulation when the cell size is reduced. The optimal reduction factor µ* is then determined as the one that minimizes the predicted computation time: µ * =argmin T µ ( ) ( ) . (6.24) 142 To study the computation time as a function of µ from the prediction, we compare the predicted and measured performances of the system with R c = 5.5, 6.6, 10.0 Å, with the number of atoms ranging from 1,536, 12,288, 41,472, 98,304, to 192,000. Figure 6.3 shows that the prediction captures the essential behavior of the short-range computation time in the actual simulations as a function of µ. This result agrees with the asymptotic characteristics of µ in Table 1, where the linked-list cell access cost dominates for larger µ. Figure 6.4 compares µ* obtained from the prediction, Eq. (6.23), with those obtained from actual measurements. The µ* for R c = 5.5 Å from actual simulation dataset are all 1, similar to the predicted values. Namely, reducing the cell dimension does not improve the performance in this case. On the other hand, the predicted µ* of the system with R c = 6.6 and 10.0 Å is 2 in almost all cases with the average speedup of 1.10 and 1.52, respectively. This result indicates that the reduction of the cell dimension becomes more beneficial for larger R c . In such cases, the pair search space is larger and the associated search cost P s ʹ′ becomes more important relative to the list-processing overhead P l ʹ′. Table 6.4 shows the speedup achieved from our optimization compared to the actual optimal speedup. 143 Figure 6.3: The running time from the prediction and the measurement for R c = 10 Å. The numerals are the number of atoms. Figure 6.4: Comparison of speedup gained from predicted and actual μ* for R c = 6.6 and 10 Å. 144 Table 6.4: The predicted versus measured values of the optimal µ with their corresponding speedup compared to the case of no cell dimension reduction. R c (Å) N µ* Speedup Prediction Actual Prediction Actual 5.5 1,536 1 1 1.00 1.00 12,288 1 1 1.00 1.00 41,472 1 1 1.00 1.00 98,304 1 1 1.00 1.00 192,000 1 1 1.00 1.00 6.6 1,536 2 2 1.12 1.12 12,288 2 2 1.28 1.28 41,472 2 1 0.92 1.00 98,304 2 2 1.15 1.15 192,000 2 2 1.02 1.02 10.0 1,536 3 3 1.85 1.85 12,288 3 2 1.54 1.74 41,472 2 2 1.41 1.41 98,304 3 2 1.54 1.70 192,000 2 2 1.28 1.28 145 6.6 Summary We have developed a performance prediction model of short-range interaction computations of molecular dynamics simulations, which can be used to obtain the optimal cell dimension in the reduced linked-list cell algorithm. The model has been validated against measurement results. The model accurately predicts the optimal reduction factor, and average speedup of 1.10 and 1.52 are obtained for R c = 6.6 and 10.0 Å datasets, respectively. Our analysis reveals that the optimal reduction factor is determined by a trade-off between decreasing search space and increasing linked-list cell access for smaller cells. 146 Chapter 7 Memory-Access Analysis and Optimization via Dynamic Data Reordering One of the major problems on improving performance of sophisticated scientific applications such as MD simulation is to maintain data locality. Since the memory-access pattern in MD simulation is highly non-uniform and unpredictable, the locality optimization problem becomes even more challenging. The impact of this issue leads to a fluctuation of MD simulation performance. The unstable performance is intolerable for the design of performance model, thus need to be addressed. In this chapter, we first discuss the causes of performance fluctuation from poor memory-access pattern. We then propose a solution by dynamic data reordering. This chapter is organized as follows. Section 7.1 presents the parallel MD program, to which the proposed memory optimization scheme is implemented. Section 7.2 describes the analysis of simulation parameters related to data fragmentation. Section 7.3 presents the performance measurements and discusses the correlation between fragmentation and 147 performance. Section 7.4 proposes our adaptive memory-access optimization scheme based on data fragmentation analysis and performance measurements. Summary is shown in section 7.5. 7.1 Introduction A commonly used method to reduce the performance fluctuation from poor data locality issue is data reordering. Data reordering organizes data of irregular memory-access patterns in memory according to a certain locality metric [102, 106-108]. However, a further challenge arises from the dynamic, irregular nature of MD computations. Mellor-Crummey et al. [109, 110] showed that data reordering and computation restructuring enhance data locality, resulting in the reduction of cache and translation-lookaside buffer (TLB) misses and accordingly considerable performance improvement of MD simulations. The study also suggested the necessity of repeated runtime reordering, for which the remaining problem is how often the reordering should be performed in order to achieve the optimal overall performance. In addition, as the data reordering and computation restructuring incur computational overhead, such reordering cost should be considered to find the optimal reordering frequency. Runtime data behaviors of MD simulations are closely related to both physical properties of the system being simulated (e.g. temperature, diffusion rate, and atomic configuration) and computational parameters (e.g. spatial decomposition granularity in parallel MD). Therefore, understanding how these quantities affect the deterioration of data locality is a prerequisite for designing the optimal runtime data-reordering schedule. To address these challenges, we first introduce a data fragmentation ratio metric that quantifies the locality of atom data arrays during MD simulation. Then, we perform an analysis on how MD simulations with different physical/computational characteristics (e.g. temperature, 148 diffusion rate, and granularity) impact the data fragmentation ratio and performance deterioration of the system. Based on the data fragmentation analysis, we finally design and evaluate an adaptive memory optimization scheme that features an optimal runtime data-reordering schedule during MD simulation. 7.2 Data Fragmentation in Parallel Molecular Dynamics High memory-access penalty is a major obstacle of irregular memory accessing applications, which needs to be mitigated. Our MD application performs data and computation orderings by organizing atom data contiguously in memory in the same order as the linked lists of the cells access them. Though the spatial locality after data reordering is retained for a while, data get disordered as simulation progresses due to atom movement among cells. How the benefit from data reordering deteriorates as simulation evolves at runtime essentially affects the performance of MD simulations. To quantify the level of fragmentation, we first define a fragmentation measurement ratio metric. Let C(i, j, k) be a linked-list cell with cell indices i, j, and k in the x, y, and z directions, respectively. Each linked-list cell contains indices of atoms whose coordinates are within their cell dimensions. Before data reordering, atom data are likely scattered in memory as illustrated in Figure 7.1(a), and data reordering improves the data locality as shown in Figure 7.1(b), where the atom data of the same physical cell volume reside continuously. Fragmentation in the atom data array could occur as follows. Suppose that the a-th atom in linked-list cell C moves to another cell Cʹ′. Consequently, the memory block of Cʹ′ becomes partially fragmented because the a-th atom is not contiguous in memory space with other atoms in Cʹ′ such that their distances in the memory space exceed the page size. Therefore, any atom moving out of its original linked- 149 list cell introduces fragmentation in memory, which likely causes data translation lookaside buffer (DTLB) misses. To quantify the degree of fragmentation, we thus define a data fragmentation ratio as f = N fragment N , (7.1) where N fragment is the number of atoms whose positions have moved out of the originally ordered cells and N is the total number of atoms. The data fragmentation ratio is between 0 (i.e., all atoms in all cells reside continuously in memory—fully ordered state, see Figure 7.1(b)) and 1 (i.e., no atom resides in the original cell—fully disordered state, see Figure 7.1(a)). The data fragmentation ratio will be used to quantify fragmentation extensively throughout this chapter. Note that the page size used in all experiments is 4 KB. Figure 7.1: Schematic of memory layout for atom data in (a) a fully disordered state and (b) a fully ordered state, where C(i,j,k) is the linked list for the cell with indices i, j, and k in the x, y, and z directions, respectively, r[a] is the data associated with the a-th atom. 150 Data fragmentation in memory is dictated by the dynamics of atoms, and thus understanding the factors that control the atom dynamics would provide an insight on how to prevent the fragmentation. One of the physical factors that are directly related to the dynamics is the temperature, since high temperature drives atoms to move faster and more freely. Among the computational factors, migration of atom data from one node to another across a subsystem boundary in parallel MD also causes fragmentation. The granularity of spatial decomposition (i.e. the number of atoms per compute node) is related to the subsystem surface/volume ratio, and hence is likely to control the degree of fragmentation via the amount of atom migrations. In the following subsections, we measure and analyze the effects of temperature and granularity on the data fragmentation: Sections 7.2.1 and 7.2.2 deal with fragmentation behavior influenced by temperature and granularity, respectively. 7.2.1 Temperature Induced Fragmentation In this subsection, we study the effect of temperature on the data fragmentation during MD simulations on two systems: Silica material [111] and combustion of aluminum nanoparticles [112]. The silica simulations involving 98,304 atoms and 8,000 linked-list cells (20×20×20 in x, y, and z directions, respectively) are performed for 3,000 MD time steps using a time discretization unit of 2 femtoseconds. The simulations are performed on a dual quadcore Intel Xeon E5410 2.33 GHz (Harpertown) at the High Performance Computing and Communication (USC-HPCC) facility of the University of Southern California. Temperature is a major physical factor that enhances the diffusion of atoms and hence the disordering of data arrays. To examine the effect of atomic diffusions on the data fragmentation ratio, we thus perform a set of simulations with three different initial temperatures, 300 K, 3,000 K, and 6,000 K, starting from the same initial atomic configuration—an amorphous structure with a uniform 151 density across the system. Each dataset represents a distinct phase of silica—solid (300 K), highly viscous liquid (3,000 K) [113], and low-viscosity liquid (6,000 K) (note that the melting temperature of silica is 1,800 K). Here, data ordering is performed only once at the beginning of the simulation, so that the atom data are fully ordered (Figure 7.1(b)) initially, and subsequently we measure the data fragmentation ratio as a function of MD steps without further data reordering. In order to study the influence of temperature on data fragmentation, Figure 7.2 plots the data fragmentation ratio as a function of MD steps. We observe that simulations at higher temperatures exhibit larger fragmentation ratios. The fragmentation of 300 K dataset fluctuates in the first 500 steps, then rises to 0.15 after 1,000 steps, and remains nearly constant throughout the simulation thereafter. At 3,000 K, the fragmentation ratio quickly increases to 0.24 in the first 200 steps, and then increases linearly with time (0.033 per 1,000 steps), reaching 0.36 after 3,000 MD steps. In contrast, the fragmentation ratio at 6,000 K rapidly rises to 0.73 in the first 500 steps, then continues to increase with a slightly slower rate until the data gets almost fully disordered at 0.93 after 3,000 MD steps. 152 Figure 7.2: Time variation of data fragmentation ratio over 3,000 MD steps. Three datasets of different initial temperatures (300 K, 3,000 K, and 6,000 K) are plotted. In order to understand the physical origin of the correlation between the fragmentation ratio and temperature in Figure 7.2, we measure the mean squared displacement (MSD) of atoms in each dataset as a function of time. MSD is used to identify the phase (e.g. solid vs. liquid) of the material and to calculate the diffusion rate of atoms in the system, which may be used to quantify the dynamics of atoms during the simulation. Figure 7.3 shows the MSD of 300 K, 3,000 K, and 6,000 K datasets over 6 picoseconds (3,000 MD steps). The result shows that at 300 K, silica remains in solid phase (MSD remains constant). In contrast, 3,000 K silica is melted and becomes a highly viscous liquid with a small diffusion coefficient of 2.31×10 -5 cm 2 /s (the diffusion coefficient is obtained from the linear slope of the MSD curve). Similarly, MSD of 6,000 K dataset shows that the system is completely melted with a much larger diffusion coefficient of 4.17×10 -4 cm 2 /s. Because atoms are not diffusing in 300 K dataset, 83% of the atoms (f = 0.17) remain in the same linked-list cell throughout the simulation. Only atoms close to the cell boundaries move back and forth between the original and the neighbor cells due to 153 thermal vibration. However, atoms in 3,000 K dataset are melted, and their slow diffusion from their original cells causes the gradually increasing fragmentation ratio. For 6,000 K system, atoms diffuse approximately 18 times faster than 3,000 K system, resulting in the rapid rise of fragmentation ratio. Only 3.2% of the atoms remain in the same cell after 3,000 steps. These results clearly show that diffusion is a major physical factor that contributes to data fragmentation in MD simulations. Figure 7.3: Mean squared displacement (MSD) of 300 K, 3,000 K, and 6,000 K datasets. The MSD of 300 K dataset remains constant at 0.23 Å 2 (too small to be seen in the figure). The inset shows the diffusion coefficients at the three temperatures. Although the silica experiment shows that the temperature significantly affects data fragmentation, we also found that its effect on the fragmentation ratio is sensitive to the material and phenomenon being simulated. To demonstrate this point, the second experiment simulates flash heating of an aluminum nanoparticle surrounded by oxygen environment using a reactive interatomic potential [112], which involves 15,101,533 atoms on 1,024 processors of dual core AMD Opteron 270 2.0 GHz (Italy) at the USC-HPCC facility. Similar to the first simulation, the 154 atom data arrays are ordered only once at the beginning of the simulation. Then, the fragmentation ratio of 3,000 K, 6,000 K and 9,000 K datasets are measured at each step. Figure 7.4 shows the data fragmentation ratio as a function of MD steps. We see that the fragmentation ratios of all datasets rapidly rise to above 90% after 1,000 MD steps, and continue to increase to over 98% after 2,000 MD steps. Since the aluminum nanoparticles are at considerably high temperatures (far above the melting temperature of aluminum ~ 930 K) and are surrounded by oxygen atoms in the gas phase, the atoms are highly reactive and move very rapidly. This accelerates the fragmentation to proceed very quickly regardless of the temperature. In such applications, data reordering is indispensable and is required to be performed more often in order to maintain good spatial locality. These results indicate that fragmentation is highly system dependent, so that data reordering needs to be performed dynamically at runtime, adapting to the system’s behaviors. Figure 7.4: Data fragmentation ratio during aluminum combustion simulation over 2,000 MD steps. The dataset involves 15 million atoms on 1,024 processors. 155 7.2.2 Granularity Induced Fragmentation In addition to fragmentation induced by physical properties, parallel MD using spatial decomposition introduces computational artifacts such as spatial subsystem (or domain) boundaries, which also contribute to the fragmentation of atom data arrays. In this subsection, we study the influence of granularity (i.e. the number of atoms per compute node) on data fragmentation. Atoms near a domain-boundary surface tend to migrate among domains regardless of physical simulation conditions. This inter-domain atomic migration triggers rearrangements of data arrays, including the deletion of the migrated atom data from the original array, compression of data array after the removal of the migrated atoms, and their appending to the array in the destination domain. These newly migrated atoms in the destination domain cause data fragmentation, since they do not reside continuously in memory. To confirm the expected high fragmentation ratio at the domain boundaries as explained above, the data fragmentation ratios of the surface cells (i.e. the outer-most layer cells that share at least one facets with the inter-domain surfaces) and core cells (i.e. non-surface cells deep inside each domain) are measured separately. Here, we consider a 12,288-atom silica dataset initially at 3,000 K temperature similar to the dataset used in the first experiment of section 7.2.1 but with a reduced domain size (their dimensions are reduced by half in all three directions, so that the domain volume is one-eighth of that in section 7.2.1). This dataset consists of 1,000 cells in total (10×10×10), of which 488 cells (48.8%) are surface cells. Figure 7.5(a) clearly shows that the data fragmentation ratio of the surface cells is larger than that of the core cells. In the first 100 MD steps, the fragmentation ratios of the two groups are almost identical. Then, the fragmentation ratio of the surface cells begins to increase at higher rate reaching 0.53 after 3,000 MD steps, whereas the fragmentation ratio of the core cells is only 0.39. Thus, the atom data of 156 the surface cells is 14% more fragmented that that in the core cells, yielding a total fragmentation ratio of 0.42 for the entire system. This result confirms that computational artifacts such as domain boundary indeed induce additional fragmentation. Figure 7.5: (a) Time variation of the data fragmentation ratio of 12,288-atom silica at 3,000 K over 3,000 MD steps, for surface cells, core cells, and both combined. (b) Time variation of the data fragmentation ratio of 3,000 K silica with varying granularities over 3,000 MD steps. 157 The domain-boundary induced fragmentation also implies that systems with smaller granularities will have more fragmentation due to their larger surface/volume ratios (i.e., larger portions of linked-list cells are domain-boundary cells). To test this hypothesis, we perform MD simulations for 3,000 K silica system with three different granularities—98,304, 12,288, and 1,536 atoms, over 3,000 MD steps. The dimensions of the three systems are summarized in Table 7.1, which shows that the surface-to-volume ratio ranges from 27.1% for N = 98,304 to 78.4% for N = 1,536. Asymptotic analysis of surface-to-volume ratio shows that the ratio scales as O(N -1/3 ). Figure 7.5(b) shows that datasets with smaller granularities indeed have larger fragmentation ratios. After 3,000 MD steps, the data array of the smallest granularity (1,536 atoms) dataset is 9.7% more fragmented than that in the largest granularity (98,304 atoms) dataset. Also, the 12,288 atoms dataset is 5.2% more fragmented compare to the largest dataset. The figure also shows large fluctuation for the fragmentation ratio for N = 1,536 dataset, due to less statistics inherent for the small system size. Table 7.1: 3,000 K silica datasets with varying granularities: The number of linked-list cells, the portion of surface cells, and the surface-to-volume ratio of the spatial subsystem are listed for each dataset. Number of atoms Number of cells System dimensions (cells) Portion of surface cells (%) Surface-to-volume ratio (Å -1 ) 1,536 125 5×5×5 78.4 0.111 12,288 1,000 10×10×10 48.8 0.055 98,304 8,000 20×20×20 27.1 0.027 158 7.3 Performance Measurement In the pervious section, we described how physical and computational parameters of MD simulations affect the data fragmentation behavior in the memory space, utilizing the data fragmentation ratio as a metric. In this section, we first establish a correlation between the data fragmentation ratio and the performance of the program. In a fragmented dataset, atom data are likely to reside in different memory pages, which causes a large number of DTLB misses, when they are fetched. The reordering algorithm explained in section 7.2 clusters atom data that need to be fetched and computed in relatively proximate times (e.g. atoms in the same cells) in the same page, thereby reducing DTLB misses tremendously. However, as the simulation progresses, atoms that are once ordered possibly move out of their original cells. It is thus expected that the increase of DTLB misses caused by data fragmentation is a major source of performance degradation in MD. To confirm the correlation between the number of DTLB misses and the data fragmentation ratio, we perform a test using the same datasets from the first experiment of section 7.2.1 (98,304 silica atoms with initial temperatures of 300 K, 3,000 K, and 6,000 K). In addition to the fragmentation ratio measured in section 7.2.1, we here measure the DTLB miss rate as a function of MD steps. We use the Intel VTune Performance Analyzer to monitor DTLB miss events during MD simulation on Intel Core i7 920 2.67 GHz (Nehalem) processor. We measure the number of DTLB misses after data ordering is performed, then normalize it by the original number of DTLB misses without data ordering. We find that the initial DTLB miss rates at all temperatures are approximately 0.07 (<< 1) right after the reordering is performed (namely, the reordering reduces DTLB misses by 93%). The great improvement highlights a significant role of data reordering in reducing DTLB misses for irregular memory-access applications. We 159 also monitor the number of DTLB misses as a function of MD time steps and observe distinct profiles at different temperatures (Figure 7.6). The DTLB misses ratio at 300 K and 3,000 K saturates at 0.08 and 0.13, respectively. In contrast, the 6,000 K simulation exhibits a continuous increase of the number of DTLB misses. The DTLB miss rate reaches 0.62 at 3,000 MD steps after the initial data reordering. Figure 7.6: Time variation of the DTLB miss rate in 300 K, 3,000 K and 6,000 K dataset over 3000 MD steps. Figure 7.7 shows the relation between the data fragmentation ratio and the DTLB misses rate at the three temperatures. The DTLB misses rate remains relatively small when the fragmentation ratio is small, while it rapidly increases when the fragmentation ratio increases above 0.8. This result clearly demonstrates a strong correlation between the DTLB miss rate and the data fragmentation ration. Namely, in MD simulations where atoms are moving extensively, the DTLB miss rate increases rapidly as its fragmentation ratio does. 160 Figure 7.7: Relation between the DTLB miss rate and the data fragmentation ratio in 300 K, 3,000 K and 6,000 K datasets. To show that DTLB misses are the source of performance deterioration during simulation, we measure the running time of the same datasets as in the DTLB miss measurement. Figure 7.8 shows the running time of each MD step for 3,000 MD steps after ordering data at the first step. At 6,000 K, the running time gradually increases over time, and after 3,000 MD steps, the average running time per step increases by 8%. To study how much more performance degradation occurs after the initial data ordering, we extend the execution of 6,000 K dataset to 20,000 MD steps. The result shows an increase of the running time per step by 21%. This result indicates that without data reordering, the performance continues to degrade, and the running time per step continues to increase. On the other hand, the running time rapidly increases at the first 100 MD steps but then remains almost constant at 300 K and 3,000 K. These performance behaviors are akin to the fragmentation ratio and DTLB misses profiles in Fig. 7.2 and 7.6, respectively. These results thus confirm that data fragmentation in memory is indeed the source of performance deterioration during runtime through the increase of DTLB misses. 161 In Fig. 7.8, we also observe that 6,000 K simulation initially executes fastest compare to the other simulations (3.1% faster than 300 K and 2.5% faster than 3,000 K), which cannot be explained by the difference in the DTLB miss rate. One possible reason for this discrepancy is that higher temperature systems have sparser atomic distributions (i.e. lower atomic number densities), such that there are less number of atom interactions within the interaction range r c , and hence less number of floating-point operations. To test this hypothesis, we measure the atom-distance distribution for all datasets. The results show that the number of atom pairs within distance r c of 6,000 K system is approximately 3.0% and 1.4% less than those of 300 K and 3,000 K systems, respectively. As a result, 6,000 K simulation has the least computational load resulting in the fastest running time at the beginning of the simulation, which however is rapidly offset by the increase in DTLB misses. Figure 7.8: Running time per MD step as a function of MD steps at temperatures 300K, 3,000K, and 6,000K for 98,304 silica atoms. In addition to the atom ordering discussed above, the cell ordering in the memory space also affects the locality and performance of the simulations. To address this issue, we measure 162 the performance of three different cell ordering methods: (1) sequential ordering; (2) Hilbert- curve ordering; and (3) Morton-curve ordering (Figures 7.9(a)-(c), respectively). The performance measurements are performed with 300 K temperature silica systems with grain size ranging from 12,288 - 331,776 atoms with reordering period of 20 MD steps on Intel Core i7 920 2.67 GHz (Nehalem) processor. The performance comparisons of all ordering methods are shown in Table 7.2. The results do not exhibit significant improvement due to different ordering methods. For example, the running time of Hilbert curve ordering is at most 1.76% less than that of the sequential ordering at largest granularity (331,776 atoms), while the running time of Morton curve ordering is less than 1% different for all granularities. One of the possible reasons of the insensitivity of the performance on the cell-ordering method is the small number of references across intra-node cell boundaries. For simplicity, we employ the sequential ordering in the following. Table 7.2: Runtime result with different ordering methods for silica systems at temperature 300K. Number of atoms System dimensions (cells) Ordering method Average running time per MD step (ms) 12,288 10×10×10 Sequential 78.48 Morton 78.57 Hilbert 78.45 98,304 20×20×20 Sequential 639.9 Morton 640.1 Hilbert 637.1 331,776 30x30x30 Sequential 2125.7 Morton 2117.4 Hilbert 2088.8 163 Figure 7.9: Schematic ordering of (a) sequential ordering (b) 3 rd -order 2D Hilbert-curve ordering (c) 3 rd -order 2D Morton-curve ordering. 7.4 Reordering Frequency Optimization The previous section showed a serious performance degradation of MD simulation during runtime. To minimize the performance degradation due to data fragmentation, we propose to repeat data reordering periodically during MD simulation. Specifically, we reorder atom arrays after every N rp MD steps (i.e., the reordering period N rp is the number of MD steps between successive reordering operations). Though such data reordering is beneficial and often improves overall application performance, ordering arrays itself introduces an additional computational cost. Therefore, we here develop a dynamic data-reordering schedule based on a performance model and the runtime measurement in section 7.3. To do so, we first introduce a model that accounts for the reordering overhead. Let t cost be the data reordering cost (i.e. the time to reorder arrays), N total the total simulation steps, and t(n) the running time at step n after ordering. The total running time as a function of reordering period is then written as τ total (N rp )= N total N rp ! " ! ! # $ # # t(n)+t cost n=1 N rp ∑ & ' ( ( ) * + + + t(n) n=1 N total mod(N rp ) ∑ . (7.2) 164 The optimal reordering period is then determined as the one that minimizes the total running time: N rp * =argmin(τ total (N rp )) . (7.3) To find the ordering cost parameters in Eq. (7.2), we measure the reordering cost with different grain sizes. The results are summarized in Table 7.3. Table 7.3: 3,000 K silica datasets varying granularities Number of atoms Number of cells System dimensions (cells) Reordering cost (ms) 12,288 1,000 10×10×10 0.66±0.02 98,304 8,000 20×20×20 8.41±0.07 786,432 64,000 40x40x40 65.8±0.10 Figure 7.10 shows the total running times for 3,000 MD steps as a function of the reordering period estimated from Eq. (7.2) with the measured reordering costs in Table 7.3. We obtain the optimal reordering period (which minimizes the total running time) as 69, 5, and 3 steps for 300 K, 3,000 K and 6,000 K datasets, respectively (see the arrows in Fig. 7.10). With the optimally scheduled reordering thus determined, the overall performance is estimated to be improved by a factor of 1.35, 1.36 and 1.28 at 300 K, 3,000 K and 6,000 K, respectively. To verify this model prediction, we measure the running time of MD simulations that implement data reordering with the optimal reordering schedule. Here, MD simulations of silica containing 98,304 atoms at temperatures 300 K, 3,000 K, and 6,000 K are executed using periodic reordering at every 69, 5, and 3 steps, respectively. The results show that the measured speedups in actual executions are 1.27, 1.25, and 1.17, respectively, with 6.44%, 7.59%, and 165 9.95% error from the estimated values. The running time of the system without ordering, the optimized running time estimated from the model, and the measured running time with optimally scheduled reordering are compared in Figure 7.11. The figure shows a substantial effect of runtime data reordering on the performance of MD simulations. It should be noted that the state- of-the-art MD simulations are run up to 10 12 time steps [114], for which the 10 3 -step pre- profiling run for constructing the performance model, Eq. (7.2), is negligible. Figure 7.10: Total running time after 3,000 MD steps as a function of reordering periods. The optimal period for 300 K, 3,000 K and 6,000 K datasets are 69, 5 and 3 steps, respectively. 166 Figure 7.11: Comparison of total running time of silica MD over 3,000 steps achieved by model prediction and actual measurement of periodic reordering compared with that of the original code without ordering. (All tests are conducted with 98,304 atoms per core.) 7.5 Summary In summary, we have developed an adaptive data-reordering schedule for molecular dynamics simulations on parallel computers. Our quantitative analysis has identified physical/computational conditions such as a high initial temperature and a small granularity, which considerably accelerate data fragmentation in the memory space, thereby causing continuous performance degradation throughout the simulations at runtime. Our profiling results have revealed that the degree of data fragmentation correlates with the number of DTLB misses and have identified the former as a major cause of performance decrease. Based on the data fragmentation analysis and a simple performance model, we have developed an optimal data- reordering schedule, thereby archiving a speedup of 1.36× for 3,000 K silica simulation. This 167 study has thus proposed a practical solution to the dynamic data fragmentation problem that plagues many scientific and engineering applications. 168 Chapter 8 Conclusions This dissertation research has developed a metascalable hybrid message-passing and multithreading algorithm for n-tuple MD simulation, thereby achieving: (1) metascalability of hybrid parallelization framework on multicore architectures; and (2) computation-pattern algebraic framework to design scalable algorithms for general n-tuple computation and proof of its optimality in a systematic and mathematically rigorous manner. We have developed and thoroughly analyzed algorithms for hybrid MPI/OpenMP parallelization of n-tuple MD simulation, which are scalable on large multicore clusters. CVAS and BFAS algorithms were proposed, combining the benefit of fine-grain dynamic load balancing and minimal memory-footprint threading. Theoretical study revealed decent asymptotic memory efficiency for both algorithms, thereby reducing 75% memory consumption compared to a naïve-threading algorithm. Furthermore, performance benchmarks have confirmed higher performance of the hybrid MD algorithm over a traditional algorithm on large multicore 169 clusters, where 2.58× speedup of the hybrid algorithm over the traditional algorithm is observed on 32,768 nodes of BlueGene/P. We have addressed issues of concurrency controls, by investigating the performance characteristics of HTM on the BlueGene/Q computer in comparison with conventional concurrency control mechanisms. Benchmark tests, along with overhead-cost and scalability analysis, quantified relative performance advantages of HTM over other mechanisms. We found that the bookkeeping cost of HTM is high but that the rollback cost is low. We have proposed transaction fusion and spatially compact scheduling techniques to reduce the overhead of HTM with minimal programming. A strong scalability benchmark showed that the fused HTM has the shortest runtime among various concurrency control mechanisms without extra memory. Based on the performance characterization, we derived a decision tree in the concurrency-control design space for multithreading application. We have developed a computation-pattern algebraic framework to mathematically formulate general n-tuple computation for reactive MD. Based on translation/reflection-invariant properties of computation patterns within this framework, we designed an SC algorithm for cell- based parallel MD. Theoretical analysis quantified the compact n-tuple search space and small communication cost of SC-MD for arbitrary n, which are reduced to those in best pair- computation approaches (e.g. eighth-shell method) for n = 2. Benchmark tests showed that SC- MD outperforms our production MD code at the finest grain, with 9.7- and 5.1-fold speedups on Intel-Xeon and BlueGene/Q clusters. SC-MD also exhibited excellent strong scalability. We have also analyzed the computational and data-access patterns of MD, which led to the development of a performance prediction model for short-range pair-wise force computations in MD simulations. The analysis and performance model provide fundamental understanding of 170 computation patterns and optimality of certain parameters in MD simulations, thus allowing scientists to determine the optimal cell dimension in a linked-list cell method. The model accurately estimated the number of operations during the simulations with the maximum error of 10.6% compared to actual measurements. Analysis and benchmark of the model revealed that the optimal cell dimension minimizing the computation time is determined by a trade-off between decreasing search space and increasing linked-list cell access for smaller cells. We have developed an optimal data-reordering schedule of MD for runtime memory- access optimization to solve performance deterioration during execution of MD. Analysis of the memory-access penalty during MD simulations showed that the performance improvement from computation and data reordering degrades gradually as data translation lookaside buffer misses increase. We have also found correlations between the performance degradation with physical properties such as the simulated temperature, as well as with computational parameters such as the spatial-decomposition granularity. Based on a performance model and pre-profiling of data fragmentation behaviors, we have developed an optimal runtime data-reordering schedule, thereby archiving speedup of 1.35, 1.36 and 1.28, respectively, for MD simulations of silica at temperatures 300 K, 3,000 K and 6,000 K. The contributions of this dissertation research—metascalable hybrid message-passing and multithreading parallel framework and a computation-pattern algebraic framework for general n- tuple computation—are expected to provide a generic framework to a broad-range of scientific applications on future extreme-scale clusters. 171 References [1] D. Geer, "Chip makers turn to multicore processors," Computer, vol. 38, pp. 11-13, 2005. [2] S. H. Fuller and L. I. Millett, "Computing performance: game over or next level?," Computer, vol. 44, pp. 31-38, 2011. [3] P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelicky, "ExaScale computing study: Technology challenges in achieving exascale systems," University of Notre Dame 2008. [4] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov, "The impact of multicore on math software," in International conference on Applied parallel computing: state of the art in scientific computing, Sweden, 2007, pp. 1-10. [5] J. Dongarra, D. Gannon, G. Fox, and K. Kennedy, "The Impact of Multicore on Computational Science Software," CTWatch Quarterly, vol. 3, February, 2007. [6] K. Nomura, H. Dursun, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, P. Vashishta, F. Shimojo, and L. H. Yang, "A metascalable computing framework for large spatiotemporal-scale atomistic simulations," in International Parallel and Distributed Processing Symposium, 2009. 172 [7] K. Asanovic, R. Bodik, B. C. Catanzro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. Yelicky, "The Landscape of Parallel Computing Research: A View from Berkeley," University of California, Berkeley 2006. [8] M. P. Allen and D. J. Tildesley, "Computer Simulation Liquid," 1987. [9] J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kale', "NAMD: Biomolecular simulations on thousands of processors," in Supercomputing, Los Alamitos, CA, 2002. [10] F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, and J. A. Gunnels, "Simulating solidification in metals at high pressure: The drive to petascale computing," SciDAC 2006: Scientific Discovery Through Advanced Computing, vol. 46, pp. 254-267, 2006. [11] K. J. Bowers, R. O. Dror, and D. E. Shaw, "Zonal methods for the parallel execution of range-limited N-body simulations," Journal of Computational Physics, vol. 221, pp. 303- 329, Jan 20 2007. [12] B. Hess, C. Kutzner, D. van der Spoel, and E. Lindahl, "GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation," Journal of Chemical Theory and Computation, vol. 4, pp. 435-447, 2008. [13] A. Kushima, X. Lin, J. Li, J. Eapen, J. C. Mauro, X. Qian, P. Diep, and S. Yip, "Computing the viscosity of supercooled liquids," The Journal of Chemical Physics, vol. 130, pp. 224504: 1-3, June 14 2009 2009. [14] S. Ohmura, F. Shimojo, R. K. Kalia, M. Kunaseth, A. Nakano, and P. Vashishta, "Reaction of aluminum clusters with water," Journal of Chemical Physics, vol. 134, 2011. 173 [15] H. Dursun, K. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta, "A multilevel parallelization framework for high-order stencil computations," Proceedings of the International European Conference on Parallel and Distributed Computing (Euro-Par 2009), 2009. [16] T. C. Germann, K. Kadau, and P. S. Lomdahl, "25 Tflop/s multibillion-atom molecular dynamics simulations and visualization/analysis on BlueGene/L," Proceedings of Supercomputing (SC05), 2005. [17] D. York and W. Yang, "The fast Fourier Poisson method for calculating Ewald sums," The Journal of Chemical Physics, vol. 101, pp. 3298-3300, 1994. [18] R. Hockney and J. Eastwood, Computer simulation using particles. New York: McGraw- Hill, 1981. [19] T. Darden, D. York, and L. Pedersen, "Particle mesh Ewald: An N log(N) method for Ewald sums in large systems," The Journal of Chemical Physics, vol. 98, pp. 10089- 10092, 1993. [20] L. Greengard and V. Rokhlin, "A fast algorithm for particle simulations," Journal of Computational Physics, vol. 73, pp. 325-348, 1987. [21] P. Gonnet, "A simple algorithm to accelerate the computation of non-bonded interactions in cell-based molecular dynamics simulations," Journal of Computational Chemistry, vol. 28, pp. 570-573, Jan 30 2007. [22] J. N. Glosli, D. F. Richards, K. J. Caspersen, R. E. Rudd, J. A. Gunnels, and F. H. Streitz, "Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability," in Supercomputing, Reno, Nevada, 2007, pp. 1-11. 174 [23] D. E. Shaw, M. M. Deneroff, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon, C. Young, B. Batson, K. J. Bowers, J. C. Chao, M. P. Eastwood, J. Gagliardo, J. P. Grossman, C. R. Ho, D. J. Ierardi, I. Kolossvary, J. L. Klepeis, T. Layman, C. Mcleavey, M. A. Moraes, R. Mueller, E. C. Priest, Y. B. Shan, J. Spengler, M. Theobald, B. Towles, and S. C. Wang, "Anton, a special-purpose machine for molecular dynamics simulation," Communications of the Acm, vol. 51, pp. 91-97, Jul 2008. [24] R. Carter and G. N. Denton, "Parallel computation for simulation users and simulation developers," presented at the Pipeline Simulation Interest Group Annual Meeting, Napa, CA, 2011. [25] J. Sanders and E. Kandrot, CUDA by Example: An introduction to general-purpose GPU programming: Addison-Wesley Professional, 2010. [26] J. Hyeran, X. Yinglong, and V. K. Prasanna, "Parallel Exact Inference on a CPU-GPGPU Heterogenous System," in Parallel Processing (ICPP), 2010 39th International Conference on, 2010, pp. 61-70. [27] S. Borkar, "Thousand core chips--A technology perspective," in Design Automation Conference 2007. [28] H. Wei, K. Skadron, S. Gurumurthi, R. J. Ribando, and M. R. Stan, "Exploring the thermal impact on manycore processor performance," in Semiconductor Thermal Measurement and Management Symposium, 2010. SEMI-THERM 2010. 26th Annual IEEE, 2010, pp. 191-197. [29] M. Geveler, D. Ribbrock, D. G√∂ddeke, and S. Turek, "Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore 175 Processors," in Facing the Multicore-Challenge. vol. 6310, R. Keller, D. Kramer, and J.- P. Weiss, Eds., ed: Springer Berlin Heidelberg, 2011, pp. 92-104. [30] A. El-Moursy, A. El-Mahdy, and H. El-Shishiny, "An efficient in-place 3D transpose for multicore processors with software managed memory hierarchy," presented at the Proceedings of the 1st international forum on Next-generation multicore/manycore technologies, Cairo, Egypt, 2008. [31] K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin, and J. C. Sancho, "Entering the petaflop era: the architecture and performance of Roadrunner," presented at the Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Austin, Texas, 2008. [32] F. Broquedis, N. Furmento, B. Goglin, R. Namyst, and P.-A. Wacrenier, "Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective." vol. 5568, M. Muller, B. de Supinski, and B. Chapman, Eds., ed: Springer Berlin / Heidelberg, 2009, pp. 79-92. [33] D. Kang, H. Park, and J. Choi, "A Novel Memory-Aware CPU Allocation Policy for Multicore NUMA Architecture Reliable and Autonomous Computational Science," S. Y. Shin, R. Gantenbein, T.-W. Kuo, and J. Hong, Eds., ed: Springer Basel, 2010, pp. 41-60. [34] S. R. Alam, P. K. Agarwal, S. S. Hampton, H. Ong, and J. S. Vetter, "Impact of multicores on large-scale molecular dynamics simulations," in International Parallel and Distributed Processing Symposium, Miami, Florida USA, 2008. [35] L. Peng, M. Kunaseth, H. Dursun, K. Nomura, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta, "A scalable hierarchical parallelization framework for molecular dynamics 176 simulation on multicore clusters," in International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, 2009. [36] M. J. Chorley, D. W. Walker, and M. F. Guest, "Hybrid message-passing and shared- memory programming in a molecular dynamics application on multicore clusters," International Journal of High Performance Computing Applications, vol. 23, pp. 196- 211, Aug 2009. [37] R. Rabenseifner, G. Hager, and G. Jost, "Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes," Proceedings of the Parallel, Distributed and Network-Based Processing, pp. 427-436, 2009. [38] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten, "Accelerating molecular modeling applications with graphics processors," Journal of Computational Chemistry, vol. 28, pp. 2618-2640, 2007. [39] J. A. Anderson, C. D. Lorenz, and A. Travesset, "General purpose molecular dynamics simulations fully implemented on graphics processing units," Journal of Computational Physics, vol. 227, pp. 5342-5359, 2008. [40] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu, "An asymmetric distributed shared memory model for heterogeneous parallel systems," SIGARCH Comput. Archit. News, vol. 38, pp. 347-358, 2010. [41] W. M. Brown, P. Wang, S. J. Plimpton, and A. N. Tharrington, "Implementing molecular dynamics on hybrid high performance computers ‚Äì short range forces," Computer Physics Communications, vol. 182, pp. 898-911, 2011. [42] R. Ferrer, J. Planas, P. Bellens, A. Duran, M. Gonzalez, X. Martorell, R. Badia, E. Ayguade, and J. Labarta, "Optimizing the Exploitation of Multicore Processors and 177 GPUs with OpenMP and OpenCL Languages and Compilers for Parallel Computing." vol. 6548, K. Cooper, J. Mellor-Crummey, and V. Sarkar, Eds., ed: Springer Berlin / Heidelberg, 2011, pp. 215-229. [43] L. Peng, M. Kunaseth, H. Dursun, K.-i. Nomura, W. Wang, R. Kalia, A. Nakano, and P. Vashishta, "Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters," The Journal of Supercomputing, vol. 57, pp. 20-33, 2011. [44] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to parallel computing, second edition: Pearson Education Limited, 2003. [45] (1997, October 20). POSIX Thread Specification, Version 2. Available: http://pubs.opengroup.org/onlinepubs/007908799/xsh/threads.html [46] B. Chapman, G. Jost, and R. v. d. Pas, Using OpenMP: Portable Shared Memory Parallel Programming. Cambridge, MA: MIT Press, 2008. [47] J. J. Wu, C. L. Chiang, and N. W. Lin, "A hybrid multithreading/message-passing approach for solving irregular problems on SMP clusters," International Conference on Parallel and Distributed Processing Techniques and Applications, Vols I-V, Proceedings, pp. 1826-1832, 1999. [48] F. Wolf and B. Mohr, "Automatic performance analysis of hybrid MPI/OpenMP applications," Journal of Systems Architecture, vol. 49, pp. 421-439, 2003. [49] E. de Araujo Macedo, A. C. Magalhaes Alves de Melo, G. H. Pfitscher, and A. Boukerche, "Hybrid MPI/OpenMP Strategy for Biological Multiple Sequence Alignment with DIALIGN-TX in Heterogeneous Multicore Clusters," in International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011, pp. 418-425. 178 [50] C. Osthoff, P. Grunmann, F. Boito, R. Kassick, L. Pilla, P. Navaux, C. Schepke, J. Panetta, N. Maillard, P. L. Silva Dias, and R. Walko, "Improving Performance on Atmospheric Models through a Hybrid OpenMP/MPI Implementation," in International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011, pp. 69-74. [51] H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta, "In-core optimization of high-order stencil computations," Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 533-538, 2009. [52] P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur, "Fine-grained multithreading support for hybrid threaded MPI programming," International Journal of High Performance Computing Applications, vol. 24, pp. 49-57, Feb 2010. [53] A. Rahman, "Correlations in the motion of atoms in liquid argon," Physical Review, vol. 136, pp. A405-A411, 1964. [54] J. A. Mccammon, B. R. Gelin, and M. Karplus, "Dynamics of folded proteins," Nature, vol. 267, pp. 585-590, 1977. [55] F. H. Stillinger and T. A. Weber, "Dynamics of structural transitions in liquids," Physical Review A, vol. 28, pp. 2408-2416, 1983. [56] P. Vashishta, R. K. Kalia, J. P. Rino, and I. Ebbsjo, "Interaction potential for SiO 2 - a molecular-dynamics study of structural correlations," Physical Review B, vol. 41, pp. 12197-12209, JUN 15 1990. [57] S. B. Sinnott and D. W. Brenner, "Three decades of many-body potentials in materials research," Mrs Bulletin, vol. 37, pp. 469-473, May 2012. 179 [58] S. J. Plimpton and A. P. Thompson, "Computational aspects of many-body potentials," Mrs Bulletin, vol. 37, pp. 513-521, May 2012. [59] A. C. T. van Duin, S. Dasgupta, F. Lorant, and W. A. Goddard, "ReaxFF: a reactive force field for hydrocarbons," Journal of Physical Chemistry A, vol. 105, pp. 9396-9409, OCT 18 2001. [60] A. Nakano, R. K. Kalia, K. Nomura, A. Sharma, P. Vashishta, F. Shimojo, A. C. T. van Duin, W. A. Goddard, R. Biswas, D. Srivastava, and L. H. Yang, "De novo ultrascale atomistic simulations on high-end parallel supercomputers," International Journal of High Performance Computing Applications, vol. 22, pp. 113-128, 2008. [61] K. Nomura, R. K. Kalia, A. Nakano, and P. Vashishta, "A scalable parallel algorithm for large-scale reactive force-field molecular dynamics simulations," Computer Physics Communications, vol. 178, pp. 73-87, 2008. [62] P. S. Lomdahl, P. Tamayo, N. Gronbech-Jensen, and D. M. Beazley, "50 Gflops molecular dynamics on the CM-5," Proceedings of Supercomputing (SC93), ACM/IEEE, 1993. [63] T. Narumi, R. Susukita, T. Koishi, K. Yasuoka, H. Furusawa, A. Kawai, and T. Ebisuzaki, "1.34 Tflops molecular dynamics simulation for NaCl with a special-purpose computer: MDM," Proceedings of Supercomputing (SC00), ACM/IEEE, 2000. [64] A. Nakano, R. K. Kalia, P. Vashishta, T. J. Campbell, S. Ogata, F. Shimojo, and S. Saini, "Scalable atomistic simulation algorithms for materials research," Proceedings of Supercomputing 2001, ACM/IEEE, 2001. [65] J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kale, "NAMD: biomolecular simulation on thousands of processors," Proceedings of Supercomputing (SC02), ACM, 2002. 180 [66] T. C. Germann, K. Kadau, and P. S. Lomdahl, "25 Tflop/s multibillion-atom molecular dynamics simulations and visualization/analysis on BlueGene/L," Proceedings of Supercomputing (SC05), ACM/IEEE, 2005. [67] J. N. Glosli, D. F. Richards, K. J. Caspersen, R. E. Rudd, J. A. Gunnels, and F. H. Streitz, "Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability," Proceedings of Supercomputing (SC07), ACM, 2007. [68] D. E. Shaw, R. O. Dror, J. K. Salmon, J. P. Grossman, K. M. Mackenzie, J. A. Bank, C. Young, M. M. Deneroff, B. Batson, K. J. Bowers, E. Chow, M. P. Eastwood, D. J. Ierardi, J. L. Klepeis, J. S. Kuskin, R. H. Larson, K. Lindorff-Larsen, P. Maragakis, M. A. Moraes, S. Piana, Y. Shan, and B. Towles, "Millisecond-scale molecular dynamics simulations on Anton," Proceedings of Supercomputing (SC09), IEEE, 2009. [69] D. C. Rapaport, "Large-scale molecular-dynamics simulation using vector and parallel computers," Computer Physics Reports, vol. 9, pp. 1-53, Dec 1988. [70] A. Nakano, R. K. Kalia, and P. Vashishta, "Multiresolution molecular-dynamics algorithm for realistic materials modeling on parallel computers," Computer Physics Communications, vol. 83, pp. 197-214, DEC 1994. [71] S. Ogata, T. J. Campbell, R. K. Kalia, A. Nakano, P. Vashishta, and S. Vemparala, "Scalable and portable implementation of the fast multipole method on parallel computers," Computer Physics Communications, vol. 153, pp. 445-461, JUL 1 2003. [72] S. Plimpton, "Fast parallel algorithms for short-range molecular dynamics," Journal of Computational Physics, vol. 117, pp. 1-19, Mar 1 1995. [73] L. Kale, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, N. Krawetz, J. Phillips, A. Shinozaki, K. Varadarajan, and K. Schulten, "NAMD2: greater scalability for parallel 181 molecular dynamics," Journal of Computational Physics, vol. 151, pp. 283-312, MAY 1 1999. [74] D. E. Shaw, "A fast, scalable method for the parallel evaluation of distance-limited pairwise particle interactions," Journal of Computational Chemistry, vol. 26, pp. 1318- 1328, Oct 2005. [75] M. Snir, "A note on N-body computations with cutoffs," Theory of Computing Systems, vol. 37, pp. 295-318, Mar-Apr 2004. [76] M. Kunaseth, D. Richards, J. Glosli, R. Kalia, A. Nakano, and P. Vashishta, "Analysis of scalable data-privatization threading algorithms for hybrid MPI/OpenMP parallelization of molecular dynamics," The Journal of Supercomputing, pp. 1-25, 2013/04/05 2013. [77] D. F. Richards, J. N. Glosli, B. Chan, M. R. Dorr, E. W. Draeger, J.-L. Fattebert, W. D. Krauss, T. Spelce, F. H. Streitz, M. P. Surh, and J. A. Gunnels, "Beyond homogeneous decomposition: scaling long-range forces on massively parallel systems," in Supercomputing, Portland, Oregon, 2009, pp. 1-12. [78] J. L. Fattebert, D. F. Richards, and J. N. Glosli, "Dynamic load balancing algorithm for molecular dynamics based on Voronoi cells domain decompositions," Computer Physics Communications, vol. 183, pp. 2608-2615, 2012. [79] J. Kleinberg and E. Tardos, Algorithm Design, 2 ed.: Pearson Education, Inc., 2005. [80] M. Kunaseth, D. F. Richards, J. N. Glosli, R. K. Kalia, A. Nakano, and P. Vashishta, "Scalable data-privatization threading for hybrid MPI/OpenMP parallelization of molecular dynamics," in International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, 2011. 182 [81] U. V. Catalyurek, E. G. Boman, K. D. Devine, D. Bozdag, R. Heaphy, and L. A. Riesen, "Hypergraph-based dynamic load balancing for adaptive scientific computations," in International Parallel and Distributed Processing Symposium, 2007, pp. 1-11. [82] C. Mei, Y. Sun, G. Zheng, E. J. Bohm, L. V. Kale, J. C. Phillips, and C. Harrison, "Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime," in Proceedings of Supercomputing (SC11), Seattle, Washington, 2011. [83] A. Wang, M. Gaudet, P. Wu, J. N. Amaral, M. Ohmacht, C. Barton, R. Silvera, and M. Michael, "Evaluation of Blue Gene/Q hardware support for transactional memories," in Procceeding of Conference on Parallel Architectures and Compilation Techniques, Minneapolis, Minnesota, 2012. [84] M. Schindewolf, B. Bihari, J. Gyllenhaal, M. Schulz, A. Wang, and W. Karl, "What Scientific Applications can Benefit from Hardware Transactional Memory?," in Proceedings of Supercomputing (SC12), Salt Lake City, Utah, 2012. [85] M. Herlihy and J. E. B. Moss, "Transactional memory - architectural support for lock- free data-structures," Proceedings of International Symposium on Computer Architecture, pp. 289-300, 1993. [86] C. Cascaval, C. Blundell, M. Michael, H. W. Cain, P. Wu, S. Chiras, and S. Chatterjee, "Software transactional memory: Why is it only a research toy?," Communications of the ACM vol. 51, pp. 40-46, 2008. [87] N. Njoroge, J. Casper, S. Wee, Y. Teslyar, D. Ge, C. Kozyrakis, and K. Olukotun, "ATLAS: A chip-multiprocessor with Transactional memory support," Design, Automation & Test in Europe Conference & Exhibition, vol. 1-3, pp. 3-8, 2007. 183 [88] S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin, S. Yip, H. Zeffer, and M. Tremblay, "Rock: A high-performance sparc CMT processor," IEEE Micro, vol. 29, pp. 6-16, 2009. [89] J. T. Wamhoff, T. Riegel, C. Fetzer, and P. Felber, "RobuSTM: A robust software transactional memory," Stabilization, Safety, and Security of Distributed Systems, vol. 6366, pp. 388-404, 2010. [90] P. Felber, C. Fetzer, P. Marlier, and T. Riegel, "Time-based software transactional memory," IEEE Transactions on Parallel and Distributed Systems, vol. 21, pp. 1793- 1807, 2010. [91] T. Riegel, M. Nowack, C. Fetzer, P. Marlier, and P. Felber, "Optimizing hybrid transactional memory: The importance of nonspeculative operations," Proceedings of Symposium on Parallelism in Algorithms and Architectures, pp. 53-64, 2011. [92] R. A. Haring, M. Ohmacht, T. W. Fox, M. K. Gschwind, D. L. Satterfield, K. Sugavanam, P. W. Coteus, P. Heidelberger, M. A. Blumrich, R. W. Wisniewski, A. Gara, G. L. T. Chiu, P. A. Boyle, N. H. Christ, and C. Kim, "The IBM Blue Gene/Q compute chip," IEEE Micro, vol. 32, pp. 48-60, 2012. [93] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. v. Eicken, "LogP: towards a realistic model of parallel computation," in Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, San Diego, California, USA, 1993, pp. 1-12. [94] L. M. Liebrock and S. P. Goudy, "Methodology for modelling SPMD hybrid parallel computation," Concurrency and Computation-Practice & Experience, vol. 20, pp. 903- 940, Jun 10 2008. 184 [95] X. Wu and V. Taylor, "Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore cluster systems," in IEEE International Conference on Computational Science and Engineering, Dalian, China, 2011. [96] (2011). Top 500 supercomputing site: June 2011. Available: http://www.top500.org [97] M. Kunaseth, R. K. Kalia, A. Nakano, and P. Vashishta, "Performance Modeling, Analysis, and Optimization of Cell-List Based Molecular Dynamics," in International Conference on Scientific Computing, 2010. [98] A. Nakano, R. K. Kalia, P. Vashishta, T. J. Campbell, S. Ogata, F. Shimojo, and S. Saini, "Scalable atomistic simulation algorithms for materials research," Proceedings of Supercomputing (SC01), ACM/IEEE, 2001. [99] (2011, October 18). High performance computing and communications facility, University of Southern California. Available: http://www.usc.edu/hpcc [100] R. A. Haring, M. Ohmacht, T. W. Fox, M. K. Gschwind, D. L. Satterfield, K. Sugavanam, P. W. Coteus, P. Heidelberger, M. A. Blumrich, R. W. Wisniewski, A. Gara, G. L. T. Chiu, P. A. Boyle, N. H. Chist, and K. Changhoan, "The IBM Blue Gene/Q Compute Chip," IEEE Micro, vol. 32, pp. 48-60, 2012. [101] W. Mattson and B. M. Rice, "Near-neighbor calculations using a modified cell-linked list method," Computer Physics Communications, vol. 119, pp. 135-148, Jun 1999. [102] Z. H. Yao, H. S. Wang, G. R. Liu, and M. Cheng, "Improved neighbor list algorithm in molecular simulations using cell decomposition and data sorting method," Computer Physics Communications, vol. 161, pp. 27-35, Aug 1 2004. 185 [103] C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, and W.-M. W. Hwu, "GPU acceleration of cutoff pair potentials for molecular modeling applications," Proceedings of the 5th conference on Computing frontiers, pp. 273-282, 2008. [104] P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier detection: John Wiley & Sons, Inc., 1987. [105] P. J. Huber, "Robust Estimation of Location Parameter," Annals of Mathematical Statistics, vol. 35, pp. 73-101, 1964. [106] H. Han and C. W. Tseng, "Exploiting locality for irregular scientific codes," IEEE Transactions on Parallel and Distributed Systems, vol. 17, pp. 606-618, Jul 2006. [107] Y. C. Hu, A. Cox, and W. Zwaenepoel, "Improving fine-grained irregular shared- memory benchmarks by data reordering," presented at the Proceedings of the 2000 ACM/IEEE conference on Supercomputing, Dallas, Texas, United States, 2000. [108] J. P. Singh, J. L. Hennessy, and A. Gupta, "Implications of hierarchical N-body methods for multiprocessor architectures," ACM Trans. Comput. Syst., vol. 13, pp. 141-202, 1995. [109] J. Mellor-Crummey, D. Whalley, and K. Kennedy, "Improving memory hierarchy performance for irregular applications," in ACM International Conference Supercomputing, 1999. [110] J. Mellor-Crummey, D. Whalley, and K. Kennedy, "Improving memory hierarchy performance for irregular applications using data and computation reorderings," International Journal of Parallel Programming, vol. 29, pp. 217-247, Jun 2001. [111] Y. C. Chen, K. Nomura, R. K. Kalia, A. Nakano, and P. Vashishta, "Void deformation and breakup in shearing silica glass," Physical Review Letters, vol. 103, pp. 035501: 1-4, Jul 17 2009. 186 [112] W. Q. Wang, R. Clark, A. Nakano, R. K. Kalia, and P. Vashishta, "Fast reaction mechanism of a core(Al)-shell (Al2O3) nanoparticle in oxygen," Applied Physics Letters, vol. 95, Dec 28 2009. [113] A. Kushima, X. Lin, J. Li, J. Eapen, J. C. Mauro, X. F. Qian, P. Diep, and S. Yip, "Computing the viscosity of supercooled liquids," Journal of Chemical Physics, vol. 130, p. 224504, Jun 14 2009. [114] D. E. Shaw, R. O. Dror, J. K. Salmon, J. P. Grossman, K. M. Mackenzie, J. A. Bank, C. Young, M. M. Deneroff, B. Batson, K. J. Bowers, E. Chow, M. P. Eastwood, D. J. Ierardi, J. L. Klepeis, J. S. Kuskin, R. H. Larson, K. Lindorff-Larsen, P. Maragakis, M. A. Moraes, S. Piana, Y. Shan, and B. Towles, "Millisecond-scale molecular dynamics simulations on Anton," in Supercomputing, Portland, Oregon, 2009, pp. 1-11.
Abstract (if available)
Abstract
The emergence of the multicore era has granted unprecedented computing capabilities. Extensively available multicore clusters have influenced hybrid message-passing and multithreading parallel algorithms to become a standard parallelization for modern clusters. However, hybrid parallel applications of portable scalability on emerging high-end multicore clusters consisting of multimillion cores are yet to be accomplished. Achieving scalability on emerging multicore platforms is an enormous challenge, since we do not even know the architecture of future platforms, with new hardware features such as hardware transactional memory (HTM) constantly being deployed. Scalable implementation of molecular dynamics (MD) simulations on massively parallel computers has been one of the major driving forces of supercomputing technologies. Especially, recent advancements in reactive MD simulations based on many-body interatomic potentials have necessitated efficient dynamic n-tuple computation. Hence, it is of great significance now to develop scalable hybrid n-tuple computation algorithms to provide a viable foundation for high-performance parallel-computing software on forthcoming architectures. ❧ This dissertation research develops a scalable hybrid message-passing and multithreading algorithm for n-tuple MD simulation, which will continue to scale on future architectures (i.e. achieving metascalability). The two major goals of this dissertation research are: (1) design a scalable hybrid message-passing and multithreading parallel algorithmic framework on multicore architectures and evaluate it on most advanced parallel architectures
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Performance optimization of scientific applications on emerging architectures
PDF
Simulation and machine learning at exascale
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Accelerating scientific computing applications with reconfigurable hardware
PDF
Experimental study and atomic simulation of protein adsorption
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Molecular dynamics studies of protein aggregation in unbounded and confined media
PDF
Molecular dynamics simulation study of initial protein unfolding induced by the photo-responsive surfactants, azoTAB
PDF
Dynamic modeling and simulation of flapping-wing micro air vehicles
PDF
Improving efficiency to advance resilient computing
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Sharpness analysis of neural networks for physics simulations
PDF
A molecular dynamics study of interactions between the enzyme lysozyme and the photo-responsive surfactant azobenzene trimethylammonium bromide (azoTAB)
PDF
Autotuning, code generation and optimizing compiler technology for GPUs
PDF
Stochastic and multiscale models for urban and natural ecology
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Quantum molecular dynamics and machine learning for non-equilibrium processes in quantum materials
PDF
Controlling electronic properties of two-dimensional quantum materials: simulation at the nexus of the classical and quantum computing eras
Asset Metadata
Creator
Kunaseth, Manaschai
(author)
Core Title
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/31/2013
Defense Date
06/18/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm,high performance computing,molecular dynamics,n-tuple computation,OAI-PMH Harvest,performance modeling,scientific computing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nakano, Aiichiro (
committee chair
), Lucas, Robert F. (
committee member
), Shing, Katherine (
committee member
)
Creator Email
kunaseth@usc.edu,manaschai.kunaseth@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-309586
Unique identifier
UC11293174
Identifier
etd-KunasethMa-1911.pdf (filename),usctheses-c3-309586 (legacy record id)
Legacy Identifier
etd-KunasethMa-1911.pdf
Dmrecord
309586
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kunaseth, Manaschai
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithm
high performance computing
molecular dynamics
n-tuple computation
performance modeling
scientific computing