Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Simulation and machine learning at exascale
(USC Thesis Other)
Simulation and machine learning at exascale
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SIMULATION AND MACHINE LEARNING AT EXASCALE
by
Kuang Liu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2022
Copyright 2022 Kuang Liu
ii
Table of Contents
List of Tables ...................................................................................................................................v
List of Figures ................................................................................................................................ vi
Abstract ............................................................................................................................................x
Introduction ......................................................................................................................................1
Chapter 1 Acceleration of Dynamic n-tuple Computations in Many-Body Molecular
Dynamics .....................................................................................................................................4
1.1 Introduction ......................................................................................................................4
1.2 Physical Model.................................................................................................................6
1.2.1 Interatomic potential ....................................................................................................6
1.2.2 MD Simulation.............................................................................................................8
1.2.3 Linked-List Cell Method............................................................................................10
1.3 GPGPU Implementation ................................................................................................11
1.3.1 Two-Body Interaction List Generation ......................................................................13
1.3.2 Three-Body Interaction List Generation ....................................................................16
1.4 Benchmark and Validation ............................................................................................19
1.4.1 GPU Implementation versus the Original MD Code .................................................19
1.4.2 GPU Implementation versus the Baseline .................................................................21
1.4.3 Dynamic Quadruplet Computation in ReaxFF ..........................................................25
1.5 Summary ........................................................................................................................28
iii
Chapter 2 Shift-Collapse Acceleration of Generalized Polarizable Reactive Molecular
Dynamics for Machine Learning-Assisted Computational Synthesis of Layered Material ......30
2.1 Introduction ....................................................................................................................30
2.2 Application and Algorithmic Innovations .....................................................................32
2.2.1 Generalized Polarizable Reactive Force Field (ReaxPQ+) ........................................32
2.2.2 Extended-Lagrangian Acceleration of ReaxPQ+.......................................................36
2.2.3 Renormalized Shift-Collapse Acceleration of ReaxPQ+ ...........................................37
2.3 Parallel Implementation and Performance Optimizations .............................................40
2.3.1 Performance Improvement by Algorithm Accelerations ...........................................42
2.3.2 Improving Thread Scalability by Round-Robin Data Privatization ..........................43
2.3.3 Promoting Data Parallelism by Data Alignment .......................................................44
2.3.4 Additional Code Transformations and Performance Improvement ...........................46
2.4 Scalability and Time-to-Solution ...................................................................................47
2.4.1 Experimental Platform ...............................................................................................47
2.4.2 Weak and Strong Scaling on Theta ............................................................................48
2.4.3 Time-to-Solution Improvement .................................................................................49
2.5 Machine-Learning Guided Computational Synthesis of Atomically Thin Layered
Materials ....................................................................................................................................50
2.6 Summary ........................................................................................................................52
Chapter 3 Machine Learning in Material Simulations ...............................................................53
3.1 Introduction ....................................................................................................................53
3.1.1 Dielectric Polymer Genome .......................................................................................53
iv
3.1.2 Layered Material Phases ............................................................................................54
3.2 Method ...........................................................................................................................56
3.2.1 Multiple Layer Perceptron .........................................................................................56
3.2.2 Graph Neural Networks for Learning Layered Material ...........................................60
3.2.3 Dataset Description ....................................................................................................60
3.2.4 Graph Neural Networks .............................................................................................61
3.3 Experiment and Analysis ...............................................................................................63
3.3.1 Input Features and Training Settings .........................................................................64
3.3.2 Predictive Performance for Layered Material Phases ................................................65
3.3.3 Visualization of Hidden Node States .........................................................................67
3.3.4 Performance for Dielectric Polymer Genome............................................................67
3.4 Summary ........................................................................................................................73
Chapter 4 Protein Modeling at Exascale ....................................................................................75
4.1 Introduction ....................................................................................................................75
4.2 Contact Map Prediction with Representation Learning .................................................77
4.3 Results and Discussion ..................................................................................................82
4.4 Future Work ...................................................................................................................85
References ......................................................................................................................................87
v
List of Tables
Algorithm 1: Body of the main MD simulation loop based on the velocity-Verlet algorithm. .......8
Algorithm 2: GPU implementation of the generation of three-body lists. ....................................17
Table 1.1: Achieved occupancy of kernel execution. ....................................................................24
Table 2.1: Computed dielectric constants. .....................................................................................35
Table 2.2: PQEq get_hsh procedure in RXMD. ............................................................................45
Table 2.3: Sorted polarized neighbor list and stride. .....................................................................45
Table 2.4: Packing atomic-pair distance into neighbor list............................................................46
Table 3.1: Input node and edge features. .......................................................................................64
Table 3.2: F1 scores for different input features. Higher value sinifies better classification
accuracy. ................................................................................................................................65
Table 3.3: ROC-AUC for different input features. Value is in the range of (0, 1). Higher
value signifies better classification accuracy. ........................................................................66
Table 3.4: Hyperparameters used in our random forest model. .....................................................71
Table 3.5: Performance metrics of the random forest model.........................................................72
Table 4.1: Short-range contact prediction results. .........................................................................84
Table 4.2: Medium-range contact prediction results. ....................................................................84
Table 4.3: Long-range contact prediction results. .........................................................................84
vi
List of Figures
Figure 1.1: The workflow of the main MD simulation loop: (a) main solver loop, (b)
original pipelining of atomic force computation on CPU, (c) restructured pipeline on
GPU..........................................................................................................................................9
Figure 1.2: Concurrent construction of valid interactions. ............................................................16
Figure 1.3: Concurrent construction of valid interactions. ............................................................21
Figure 1.4: Speedup of GPU MD over reference CPU MD for varying numbers of atoms. .........21
Figure 1.5: Wall-clock time of the TD approach (orange circles) and that of baseline (PD,
black squares) as a function of the number of atoms. The figure also shows the
performance improvement of the TD approach over the baseline as a function of the
number of atom. .....................................................................................................................22
Figure 1.6: Wall-clock time on kernels for 2-body potential computation per time step for
the baseline (left) and TD (right) approaches. .......................................................................23
Figure 1.7: Relationship between the number of warps per SM and the number of registers
per thread for kernels. ............................................................................................................25
Figure 1.8: Timing comparison of the TD approach and baseline (PD) for the computation
of quadruplet interactions in ReaxFF. The histogram shows wall-clock times per MD
step in milliseconds averaged over 100 time steps. We use RDX material for this
benchmark. .............................................................................................................................27
Figure 1.9: Strong and weak scaling of the parallel ReaxFF MD code on Blue Gene/Q. .............27
Figure 2.1: Schematic of the response of core (green) and shell (yellow) charges to an
external electric field in the new PQEq+ model. ...................................................................33
vii
Figure 2.2: (a) PQEq energy with 1, 10 and 100 iterations of Newton-Raphson method.
(b) Atomic charges after 1 ps of MD simulation with converged RMD (blue) and
XRMD (red) methods. ...........................................................................................................34
Figure 2.3: Illustration of many-body renormalized n-tuple computation of g(i) based on
(a) conventional full neighbor list (NBR) and (b) partial neighbor lists (PNBR
RES
and
PNBR
NT
) used in the RSC algorithm. ....................................................................................39
Figure 2.4: (a) Runtime comparison between different configurations of MPI ranks and
OpenMP threads on 1 node to identify the optimal combination. Total number of
processes is 64. (b) Average runtime per step for the original ReaxPQ+ and new
ReaxPQ+SC implementations on 96 and 1,440 cores. ReaxPQ+SC reduces time-to-
solution 5.0- and 5.7-fold below ReaxPQ+ for 96- and 1,440-core benchmarks,
respectively. The numbers denote runtimes...........................................................................42
Figure 2.5: (a, b) Data privatization. All threads write to the global force array using atomic
operations (a), or each thread writes to private array (b). (c) Round-robin reduction
with non-overlapping chunks for improving data privatization. ...........................................43
Figure 2.6: Sorted neighbor-list followed by SIMD operation. .....................................................45
Figure 2.7: Performance improvement after data privatization and other code
transformations. The numbers denote runtimes. ....................................................................47
Figure 2.8: (a) Wall-clock time of the ReaxPQ+ adapted RXMD code, with scaled
workloads — 24,576P-atom MoS2 on P cores (P = 64, ..., 131,072) of Theta. (b)
Strong-scaling speedup of the ReaxPQ+ adapted RXMD code with a fixed problem
size — 50,331,648-atom MoS2 system on P cores (P = 2,048, ..., 32,768) of Theta.
viii
The measured speedup values (blue) are compared with the ideal speedup (black). The
numbers denote speedups. .....................................................................................................49
Figure 2.9: Machine learning-guided computational synthesis of MoS2 monolayer by CVD.
(a) Simulation setup showing a MoO3 monolayer suspended in S2 gas. The atoms are
colored as Mo: blue, S: yellow, and O: red. (b, c) Close-up of MoS2 monolayer before
(b) and after (c) annealing. The local structures are classified into 1T (green), 2H (red)
and disordered (blue) phases. For clarity, gaseous environment is not shown. (d)
Neural network model for defect identification and classification. .......................................51
Figure 3.1: Top views of 2H (left) and 1T (right) crystalline phases of MoS2 monolayer.
Magenta and yellow spheres represent Mo and S atoms, respectively, whereas Mo-S
bonds are represented by cylinders. .......................................................................................56
Figure 3.2: Schematic of 2H vs. 1T phase classification by graph neural network. ......................60
Figure 3.3: Repeating message function and update function T times to learn atom
representations. ......................................................................................................................63
Figure 3.4: Snapshot of computationally-synthesized MoS2 monolayer. Ball-and-stick
representation is used to show the positions of atoms and covalent bonds with
neighboring atoms. Atoms are color-coded as yellow for sulfur (S), blue for
molybdenum (Mo) and red for oxygen (O), respectively ......................................................64
Figure 3.5: ROC curves for 1T (green) and 2H (red) phases using node and edge features
as input, which correspond to the second row in Table 3.3. The x-axis is the true
positive rate (TPR) and y-axis is the false positive rate (FPR). .............................................66
Figure 3.6: Layer-wise evolution of node states. The x-axis is the dimensions of features
(padded to 20) and y-axis is the number of atoms (zero-padded to 40) in the graph.
ix
Each row represents a feature vector of an atom, and the color of each pixel indicates
the value on the unit. ..............................................................................................................67
Figure 3.7: Dielectric constants (A) and SMILE dimensions (B) of the polymer dataset. ............68
Figure 3.8: Model prediction vs. training data of polymer dielectric constants using the
MLP model with variable activation functions (A) modified RELU (RMSE = 0.053);
(B)RELU (RMSE = 0.242); (C) Tanh (RMSE = 0.267). ......................................................69
Figure 3.9: Performance of MLP prediction model (Elman type). Comparison of prediction
on (A) 662 polymers and (B) 546 polymers training sets. .....................................................70
Figure 3.10: Model prediction vs. training data of polymer dielectric constants using the
random forest model. .............................................................................................................71
Figure 3.11: Convergence of the RNN model performance as a function of training epochs.
................................................................................................................................................72
Figure 4.1: left: 3D protein structure. Right: residue-residue contact map of a protein. ...............78
Figure 4.2: Procedures of the leaning task. AA: amino acid. ........................................................79
Figure 4.3: The proposed multiscale GNN approach on top of the existing RNN-CNN
pipeline. ..................................................................................................................................80
x
Abstract
Machine learning (ML) is revolutionizing scientific research, by utilizing its ever-growing
capability of knowledge discovery from massive simulation data. With the anticipated arrival of
exascale computers, new opportunities and challenges are emerging to maximize the synergy
between exaFLOPS supercomputers and ML-boosted computational science. This thesis addresses
two important and intertwined problems: (i) developing efficient billion-core parallel algorithms
for scientific simulations, and (ii) designing domain-specific deep neural networks for scientific
data analysis. To achieve these goals, I will design a series of algorithms with specific focus on
molecular dynamics (MD) simulations and protein folding as archetypal applications.
The research consists of the following three steps:
1. Analyze the computational pattern of MD simulations to expose its common features for
advanced parallelization such as multithreading on both central processing unit (CPU) and
graphics processing unit (GPU) platforms, and improve the computational performance
accordingly. Specifically, two optimization algorithms are proposed to accelerate computation on
CPUs and GPUs, respectively: (i) GPU acceleration based on tuple-decomposition (TD) that
reorders computations according to dynamically created lists of n-tuples; (ii) CPU acceleration
employing shift-collapse computation of many-body renormalized n-tuples that minimizes MPI
communications and multithreading with round-robin data privatization that improves thread
scalability, thereby achieving weak-scaling parallel efficiency of 0.989 on 131,072 cores, along
with eight-fold reduction of time-to-solution (T2S) compared with the original code, on an Intel
Knights Landing-based computer.
2. Development of specific deep neural networks to learn material properties from MD simulation
data, including: (i) multi-layer perceptron (MLP) and recurrent neural network (RNN) to predict
xi
dielectric constants of polymers; and (ii) graph neural network (GNN) to classify different
crystalline phases inside computationally-synthesized layered material by MD simulations.
3. Extension of ML to the domain of protein folding, using deep neural network to help predict
residue-residue contact map, which is a critical component for protein structure understanding.
Due to the colossal volume of dataset in both ML tasks, we propose to adapt our neural networks
to exascale parallelism.
1
Introduction
The rise of machine learning (ML) in recent decade has extensively augmented the pathway
for scientific research with its ever-growing capability of knowledge discovery from massive
simulation data. As the community is poised to embrace the exa-scale computation era, new
challenges are emerging to maximize the synergy between exaFLOPS supercomputers and ML-
boosted computational science. Two problems are of great significance: 1) developing efficient
billion-core parallel algorithms for scientific simulations; and 2) designing domain specific deep
neural networks for scientific data analysis. I propose to address these two problems by designing
a series of algorithms, with specific focus on molecular dynamics (MD) simulations and protein
folding as archetypal applications.
The methodology of this proposal consists of three steps: (i) we analyze the computational
pattern of MD simulation to expose its common features for advanced parallelization such as
multithreading on both CPU and GPU platforms, and improve the computational performance
accordingly; (ii) driven by the existing deep learning architectures, we develop a specific deep
neural network to learn particle properties from our MD simulation data; and (iii) we extend our
focus on the domain of protein folding, and use deep neural network to help predict residue-residue
contact map, which is a critical component for protein structure understanding. Due to the colossal
volume of dataset in both ML tasks, it is required that our neural networks are adaptive to the exa-
scale parallelism.
In the first step, two optimization algorithms are proposed to accelerate computation on
CPUs and GPUs, respectively.
2
1. Our GPU acceleration applies a tuple-decomposition (TD) approach that reorders computations
according to dynamically created lists of n-tuples. We analyze the performance characteristics of
the TD approach on general purpose graphics processing units for MD simulations involving pair
(n = 2) and triplet (n = 3) interactions. The results show superior performance of the TD approach
over the conventional particle-decomposition (PD) approach. Detailed analyses reveal the register
footprint as the key factor that dictates the performance. Furthermore, the TD approach is found
to outperform PD for more intensive computations of quadruplet (n = 4) interactions in first
principles-informed reactive MD simulations based on the reactive force-field (ReaxFF) method.
This work thus demonstrates the viable performance portability of the TD approach across a wide
range of applications.
2. Our CPU acceleration employs shift-collapse computation of many-body renormalized n-tuples
that minimizes MPI communications and multithreading with round-robin data privatization that
improves thread scalability. The new code achieves weak-scaling parallel efficiency of 0.989 for
131,072 cores, and eight-fold reduction of time-to-solution (T2S) compared with the original code,
on an Intel Knights Landing-based computer. The reduced T2S has for the first time allowed purely
computational synthesis of atomically-thin transition metal dichalcogenide layers assisted by
machine learning to discover a novel synthetic pathway.
In the second step, we develop several deep neural networks to learn material properties
from MD simulation data: (i) multi-layer perceptron (MLP) and recurrent neural network (RNN)
to predict dielectric constants of polymers; and (ii) graph neural network (GNN) to classify
different crystalline phases inside computationally-synthesized layered material, molybdenum
disulfide monolayer, by MD simulations. In particular, we apply GNN-based analysis to
automatically classify different crystalline phases inside computationally-synthesized
3
molybdenum disulfide monolayer by reactive molecular dynamics (RMD) simulations on parallel
computers. We have found that addition of edge-based features like distance increases the model
accuracy up to 0.9391. Network analysis by visualizing the feature space of our GNN model
clearly separates 2H and 1T crystalline phases inside the network. This work demonstrates the
power of the GNN model to identify structures inside multimillion-atom RMD simulation data of
complex materials.
In the third step, we shift our attention to protein folding, which is an increasingly active
area in machine learning research. Inferring structural properties of a protein from the amino acids
sequence is of great importance for understanding protein functions. The process of protein folding
from primary structure to tertiary structure is decomposed into successive steps, where a collection
of computational methods is synthesized to make prediction of the final structure. Here, we
approach one of the key steps, residue-residue contact map prediction, through the lens of GNN.
Training of the proposed model will be conducted in a massively parallel manner as training
efficiency is expected on the upcoming supercomputers.
The rest of the document is organized as follows. Chapter 2 describes GPU acceleration of
dynamic n-tuple computations in many-body MD simulations. Chapter 3 presents our shift-
collapse algorithm to accelerate generalized polarizable reactive MD simulations for machine
learning-assisted computational synthesis of layered materials on leadership-scale parallel
supercomputers. Chapter 4 deals with machine learning of MD simulation data, using specific
examples of dielectric polymer and layered material. Finally, the proposed research on exa-scale
machine learning for protein modeling is presented in Chapter 5.
4
Chapter 1
Acceleration of Dynamic n-tuple Computations in Many-Body
Molecular Dynamics
1.1 Introduction
Computation on dynamic n-tuples of particles is ubiquitous in scientific computing, with
an archetypal application in many-body molecular dynamics (MD) simulations. MD is the most
widely used simulation method for studying structural and dynamic properties of material [1]. MD
simulations follow the trajectories of all atoms, while computing the interatomic interaction as a
function of atomic positions. In his pioneering MD simulation in 1964, Aneesur Rahman used a
pair-wise interatomic potential that only depended on relative positions of atomic pairs [2]. More
complex interatomic potentials (or force fields) have been developed later to study a wide variety
of materials. In MD simulations of biomolecules, for example, the connectivity of atoms is fixed
throughout the simulation, and the interatomic potential is a function of the relative positions of
fixed n-tuples (n = 2, 3, 4) [3]. To describe wider materials processes such as structural
transformations [4, 5] and chemical reactions [6, 7], however, the connectivity of atoms needs be
dynamically updated, hence resulting in many-body MD simulations based on dynamic n-tuple
interactions. Such is the case for many-body MD simulations of inorganic materials, which
typically involve pair (n = 2) and triplet (n = 3) interactions [8]. Higher-order n-tuple computations
are used in first principles-informed reactive molecular dynamics (RMD) simulations based on the
reactive force-field (ReaxFF) method [6, 9, 10]. ReaxFF describes the formation and breakage of
chemical bonds based on a reactive bond-order concept, and its interatomic forces involve
5
computations on up to quadruplets (n = 4) explicitly and sextuplets (n = 6) implicitly through the
chain rule of differentiation.
One of the simplest ways to map MD simulations onto parallel computers is spatial
decomposition [11]. Here, the three-dimensional space is subdivided into spatially localized
domains, and the interatomic forces among n-tuples involving the atoms in each domain are
computed by a dedicated processor in a parallel computer [12]. To achieve higher parallelism than
this spatial-decomposition approach, interatomic forces are often decomposed in various force-
decomposition approaches [13-15].
On high-end parallel supercomputers, each of networked computing nodes consists of
many cores and often augmented with accelerators such as general-purpose graphics processing
units (GPGPUs) [16]. On such platforms with deep hierarchical memory architectures,
metascalable (or “design once, continue to scale on future architectures”) parallel algorithms often
employ globally scalable and locally fast solvers [17-19]. An example of such global-local
separation is the computation of long-range electrostatic potentials, where highly scalable real-
space multigrids for internode computations are combined with fast spectral methods for intranode
computations [17-19]. For MD simulations, the global-local separation insulates the optimization
of intranode computations of dynamic n-tuples from internode parallelization approaches
described above. In this paper, we thus focus on GPGPU acceleration of local dynamic n-tuple
computations. Various schemes have been proposed for the optimization of local MD
computations for pair-wise [20] and more general dynamic n-tuple interactions [21].
Extensive research and prior work exist, which explored the problem of accelerating CPU-
based MD simulation codes with the GPGPU architecture [22-26]. However, majority of these
works focused on the mechanical aspects of accelerating and tuning a set of existing codes on
6
GPGPU. These mechanical aspects include better memory organization to promote coalescing of
global memory-access operations, tuning of register usage, exploiting the GPU shared/cache
memory hierarchy, and minimization of communication between the host and device.
While all of these aspects are crucial for achieving large speedups of scientific codes on
GPU, here, we instead propose an alternative approach — named tuple decomposition (TD) — to
GPU acceleration that restructures the enumeration of interatomic interactions and the calculation
of potential energies, so that they can be performed more efficiently on GPGPU. Our approach
employs two techniques: (1) pipelining and (2) in situ construction of two-body and three-body
atomic interaction lists on GPGPU. An existing MD codebase, which calculates two-body and
three-body interatomic potentials based on the conventional particle decomposition (PD)
approach, serves as a platform to demonstrate our approach.
Despite various GPGPU implementations of dynamic n-tuple computations, less studies
have focused on the critical factors that dictate the performance of these approaches. In this work,
we analyze the performance characteristics of the TD and PD approaches. Among more
conventional factors such as thread divergence, we have found that the register footprint plays a
critical role in controlling the GPGPU performance. In addition to a many-body (n = 2 and 3) MD
simulation, this finding is shown to hold for higher-order n-tuple computations in ReaxFF-based
RMD simulations.
1.2 Physical Model
1.2.1 Interatomic potential
The MD simulation software used as a basis for this research is described in Ref. [7]. To
make the discussion specific, we first consider an interatomic potential proposed in Ref. [8] to
7
study structural and dynamic correlations in silica (SiO2) material. This implementation includes
an interatomic potential combining pair and triplet interactions and the linked-list cell method for
reducing the computational complexity of the force calculation to O(N) (where N is the number of
atoms), while the standard velocity Verlet algorithm is used for time-stepping [7, 12, 21, 27].
The potential energy of the system is a sum of atomic pair (or two-body) and triplet (or
three-body) contributions [8]:
+
=
k j i
ik ij ijk
j i
ij ij
u r u V
,
) 3 ( ) 2 (
) , ( ) ( r r
, (1.1)
where rij = |rij|, rij = ri − rj, and ri is the three-dimensional vector to represent the position of the i-
th atom. In Eq. (1.1), the pair potential is given by
u
ij
(2)
(r)= A
s
i
+s
j
r
æ
è
ç
ö
ø
÷
h
ij
+
Z
i
Z
j
r
e
-r/l
-
a
i
Z
j
2
+a
j
Z
i
2
2r
4
e
-r/z
. (1.2)
Here, the three terms represent the steric repulsion, the Coulombic interaction due to charge
transfer, and an induced dipole-charge interaction caused by electronic polarizabilities of atoms,
respectively. The pair potential is truncated at a cutoff distance, rij = rc2. The triplet potential in
Eq. (1.1) is expressed as
) ( ) (
cos 1
cos
exp ) , (
3 c 3 c
2
3 c 3 c
) 3 (
2
ik ij
jik
ik ij
ik ij
jik
jik
ik ij
ik ij
ik ij
ijk ik ij ijk
r r r r
r r
C
r r
r r r r
B u
− −
−
+
−
−
+
−
=
r r
r r
r r
, (1.3)
where (x) is the step function. In Eq. (1.3), the cutoff radius, rc3, for triplet interactions is much
less than that for pair interactions, rc2. The parameters in Eqs. (1.2) and (1.3) are found in Ref. [8].
8
These parameters are fitted to reproduce the experimentally measured structural and mechanical
properties of the normal-density silica glass.
1.2.2 MD Simulation
The trajectories of atoms are discretized with a time discretization unit of t. In the widely
used velocity Verlet algorithm, the positions ri(t) and velocities vi(t) of atoms (i = 1,..., N) at time
t are updated as
r
i
(t+Dt)=r
i
(t)+v
i
(t)Dt+
1
2
a
i
(t)Dt
2
+O(Dt
4
)
, (1.4)
v
i
(t+Dt)=v
i
(t)+
a
i
(t)+a
i
(t+Dt)
2
Dt+O(Dt
3
)
, (1.5)
where
i i i
i
i
V
m m r
F
a
− = − =
1
(1.6)
is the acceleration of the i-th atom. In Eq. (1.6), Fi is the force acting on the i-th atom and mi is its
mass. The velocity Verlet algorithm repeats the body of the main MD simulation loop as shown in
Algorithm 1. Note that the acceleration at time t, ai(t), has already been computed in the previous
simulation step or before the main MD simulation loop is entered for the first simulation step.
Algorithm 1: Body of the main MD simulation loop based on the velocity-Verlet algorithm.
1. v
i
(t+
Dt
2
) ¬v
i
(t)+
Dt
2
a
i
(t) for all i
2. r
i
(t+Dt) ¬r
i
(t)+v
i
(t+
Dt
2
)Dt for all i
3. Compute a i(t+ t) as a function of the set of atomic positions {r
𝑖 (𝑡 + ∆𝑡 )| 𝑖 = 1, … , 𝑁 }, according
to Eq. (1.6) for all i
4. v
i
(t+Dt) ¬v
i
(t+
Dt
2
)+
Dt
2
a
i
(t+Dt) for all i
On parallel computers, we employ a spatial decomposition approach. The simulated system
is decomposed into spatially localized subsystems, and each processor is assigned the computation
9
of forces on the atoms with one subsystem [7, 12, 21, 27]. Message passing is used to exchange
necessary data for the computations. Specifically, before computing the forces on atoms in a
subsystem (step 3 in Algorithm 1), atomic positions within the interaction cutoff radius rc2 within
the boundaries of the neighboring subsystems are cached from the corresponding processors.
Figure 1.1: The workflow of the main MD simulation loop: (a) main solver loop, (b) original pipelining of
atomic force computation on CPU, (c) restructured pipeline on GPU.
Fig. 1.1 (a) shows the workflow of the main MD simulation loop on a parallel computer,
where “Atom copy” denotes this interprocessor atom caching. After updating the atomic positions
according to the time stepping procedure (step 2 in Algorithm 1), some atoms may have moved
out of its subsystem. These moved-out atoms are migrated to the proper neighbor processors.
“Atom move” in Fig. 1.1 (a) represents this interprocessor atom migration along with the atomic-
position update.
10
1.2.3 Linked-List Cell Method
A naive method of computing the forces between atoms in MD simulations is to first
consider an atom i and then loop over all other j atoms to calculate their separations. This approach
imposes an O(N
2
) computational complexity for pair interactions, and an even worse O(N
3
)
complexity for triplet interactions. Thus, the naive method becomes untenable when the atom
count, N, becomes large.
The linked-cell list method [28] reduces this computational complexity to O(N) for both
cases by dividing the simulation region into cells whose width is slightly greater than the pair
interaction cutoff distance rc2. At each time step, each atom is classified by cell, and then only
interactions between atoms in the same or adjacent cells are considered in the force calculation.
For dynamic pair and triplet computations, the conventional linked-list cell method works as
follows [12, 27]. On each processor, the spatial subsystem containing both the resident and cached
atoms is divided into cells of equal volume whose edge is slightly larger than rc2. Lists of atoms
residing in these cells is constructed by the linked-list method [28]. Pair forces on the resident
atoms are computed by traversing atomic pairs using the linked lists. An atom in a cell interacts
only with the atoms within the cell and its 26 neighbor cells. At the same time as the computation
of pair forces, a list of primary pair atoms, lspr, is constructed. Here, lspr[i][k] stores the identifier
of the k-th neighbor atom, which is within the shorter cutoff distance, rc3, of the i-th atom. Triplet
forces are computed using the primary pair list lspr, which has been constructed during pair-force
computations. In Eq. (1.1), only the resident atoms within the processor are included in the
summation over index i to avoid over-counting. On the other hand, indices j and k are summed
over both the resident and copied atoms. Partial derivatives of the potential energy with respect to
the positions of j and k atoms therefore produce forces on the cached atoms as well as on the
11
resident atoms. The reaction terms on the cached atoms are sent back to the processors in charge
of the neighbor spatial subsystems and are added to the forces there.
1.3 GPGPU Implementation
Our focus in accelerating these codes was finding an efficient method of porting the main
MD simulation loop to a graphics processing unit (GPU). To guide that process, we follow two
fundamental principles: (1) minimization of control divergence (conditional branching, non-
uniform iteration counts) within threads of a warp, and (2) minimization of synchronization events
between threads. The first principle is important on a single instruction, multiple threads (SIMT)
architecture such as GPU, since the kernel scheduler launches threads in groups of 32 (called a
warp). Each thread within a warp executes the instructions of a kernel in lockstep. Divergent
execution from conditional branches is allowed, but the threads in a warp suffer large performance
penalties when this occurs since threads will pause and resynchronize at the end of the conditional
branch. A similar situation occurs when a load imbalance is present. For example, if one or more
threads in a warp execute many more iterations of a loop, warp resources will sit idle while the
overloaded threads execute.
With these principles in mind, we now turn to the computations performed within the solver
loop as outlined in Fig. 1.1 (a). The velocity half-step updates, atomic position updates, and atom
copies (for enforcement of the periodic boundary conditions, etc.) are all directly translated to GPU
kernels. These operations are trivially parallelized by unrolling their loops by atom. Thus, each
thread of the corresponding GPU kernels is responsible for updating the state of a single atom.
However, the vast majority of work performed by the solver occurs within the acceleration-
computation step (“Compute force” in Fig. 1,1 (a)). This is much more difficult to parallelize, and
12
here, we are forced to dramatically restructure the algorithm for efficient execution on GPU. This
restructuring involves application of a pipelining technique to decompose the computation into a
longer sequence of simple steps (implied, but not specifically discussed in Ref. [24]). Fig. 1.1 (b)
and (c) show the original acceleration computation juxtaposed with the new pipelined approach.
The original MD algorithm fuses the identification of interactions between atoms i, j (pair
or two-body) and i, j, k (triplet or three-body) with the calculation of the potentials from Eqs. (1.2)
and (1.3), respectively. This organization is appropriate for the CPU since having spent the time
searching for the neighbors of a particular atom i, no other overheard is incurred to compute the
potential other than clock cycles in the arithmetic logic unit. On GPU, however, this is suboptimal
as a particular atom i will have a variable number of neighbors in its vicinity. If each thread of a
kernel is assigned to find the neighboring interactions for a specific atom, it is very likely that the
threads in a warp will be executing differing amounts of work, thus violating one of the
optimizations principles we committed to follow. In addition, determining the validity of an
interaction between atoms involves numerous checks, such as a comparison of atom types, resident
atom within this spatial subsystem versus cached atom, and the distance cutoff between the atoms.
These checks are conditional branches, which cause control divergence as well.
The GPU implementation alternatively factors out the computation of two-body and three-
body potentials from the identification of the participating interactions (Fig. 1.1 (c)). The
calculation of potentials is an embarrassingly parallel operation, with little control divergence and
no synchronization required other than an atomicAdd() to sum the potential contributions for an
atom i across many threads. Thus, calculating two-body potentials is now a two-step process:
construct a list (or array) of valid interactions within the cutoff distance rc2 for all atom pairs i, j,
and then compute the potential contribution from each interaction in parallel. The same process
13
can be used for three-body potentials. In other words, GPU parallelizes the potential computation
by interaction rather than atom.
With this factorization, we achieve a significant speedup in performance over a baseline
GPU implementation that mirrored the CPU approach with parallelization by atom (Fig. 1.1 (c)).
Having separated the easily parallelizable computation from the tricky interaction determination,
we can then focus on ways to efficiently accelerate the construction of two-body and three-body
interaction lists on GPU. The approaches we employed to accomplish this are explained in the
following section
1.3.1 Two-Body Interaction List Generation
The two-body interaction list generator used in our application draws inspiration from the
Verlet neighbor-list algorithm by Lipscomb et al. [24], but has significant differences. The
Lipscomb method is elegant and simple, yet it does exhibit a drawback. The method relies on the
complete enumeration of all possible two-body atomic interactions at the outset (the master list),
before the sorted member list can be generated. With the cell structure of MD, the number of such
interactions is bounded by O(N), however, the hidden constant factor is very large. In addition,
Ref. [24] does not address how to generate the master list; the implication is that it may have been
generated on CPU and transferred to GPU. We instead would like the list generation to occur
entirely on GPU.
To address this problem, we extend the interaction metaphor to its logical conclusion —
we not only consider atomic interactions, but also cell interactions. At the outset, during
application initialization, once the problem space has been decomposed into cells, the program
catalogs all combinations of pairs of cells u, v that are adjacent to one another within the lattice
(including interaction with self, and interaction with the boundary cells). This cell interaction list
14
Iu,v is saved for future use, and is immutable for the runtime of the application. Within the source
code, the cell interaction list is represented by two arrays, linterci and lintercj.
More concretely, as the solver executes on GPU, the kernel responsible for the array-cell
list phase scans the atoms in the lattice in parallel, classifies each by the cell they are located within,
and counts the number of atoms in each cell. The identifier of each atom is stored in an array Au,
where u is the scalar-index of the cell that the atom is located within. Thus, there is one such array
for each scalar cell in the problem domain. The atom counts per cell is kept in Cu, again where u
is the scalar cell index. Within the source code, Au is implemented as the two-dimensional array
cellatomT, while the cell counts are represented in the array cellcount.
Using the CUDA Thrust software development kit (SDK) [29], the set of cell counts Cu is
scanned to find the maximum value m = maxu(Cu), representing the cell with the most atoms. At
this point, the application has sufficient information to explore the entire space of possible
interactions in parallel. Any pair of interacting cells Iu,v will have at most m
2
combinations of two-
body interactions since the maximum number of atoms in any given cell is m. And since there are
||Iu,v|| possible cell interactions, an upper bound on the number of two-body atomic interactions in
the problem domain is
t=m
2
I
u,v
. (1.7)
A kernel is then launched with t threads to execute the “identify 2B interactions” phase in
Fig. 1.1 (c). Inside the kernel, each thread tid extracts the identifiers of one pair of atoms i, j to test
for two-body interaction via the following multi-step process. First, the index of the cell
interaction, ic, and the atomic interaction combination within that cell, ac, are computed from the
thread identifier tid by
i
c
=t
id
mod I
u,v
, (1.8)
15
a
c
=t
id
I
u,v
. (1.9)
Then, the indices of the interacting cells, u and v, are determined by looking up the i
c
entry in I
u,v
:
{u,v}=I
u,v
(i
c
). (1.10)
The index positions of the (possibly) interacting atoms in A
u
and A
v
are computed by
i
idx
=a
c
/C
v
, (1.11)
j
idx
=a
c
modC
v
. (1.12)
If iidx ≥ Cu, the atomic interaction combination ac is not in the set of possible interactions
for cells u and v, and that thread terminates. Otherwise, the atom identifiers i and j are finally
retrieved from Au and Av:
i=A
u
i
idx
( )
, (1.13)
j=A
v
i
idx
( )
. (1.14)
At this point, a final check is performed to remove duplicate i, j interactions when
considering cell interactions with u = v (again, by termination). The remaining running threads
then each submits its i, j pair of atoms to the same cutoff threshold testing as the original MD code,
and appends i, j to the two-body interaction list if it passes the threshold tests. In this manner, the
entire set of possible two-body atomic interactions are tested in parallel, and in situ, on GPU. A
single contiguous list of interactions is produced, which can then be evaluated in parallel within
the “calculate 2B forces” phase in Fig. 1.1 (c). Within the code, the list is represented by the arrays
linteri and linterj. Fig. 1.2 illustrates the procedure of two-body list construction on GPU. Each
thread examines an interaction between atom i and j, and transfers the index pair to linteri and
linterj only if it passes the threshold test.
16
Figure 1.2: Concurrent construction of valid interactions.
At first glance, this method may seem wasteful of processor resources. It is true that
perhaps even one-half of the threads will terminate without ever testing two atoms for an
interaction, due to imbalances in the distribution of atoms among the cells. However, the
terminating threads are often grouped contiguously in large swathes within the thread space,
allowing the GPU warp scheduler to immediately marshal those released resources for other warps
waiting in the run queue. Another disadvantage that should be noted is this method’s reliance on a
global counter, incremented with atomicAdd() when new i, j interactions are appended to the
interaction list. Despite testing various alternative solutions, which replace that counter with other
data structures, the current implementation provides the best overall runtime performance. We
discuss a possible solution to this problem in the concluding remarks.
1.3.2 Three-Body Interaction List Generation
The three-body interaction list generator utilizes a different algorithm than its two-body
counterpart. This method examines the two-body interaction list produced in the previous section,
and searches for those interactions that fall within the more restrictive three-body distance
Thread 0
Thread 3
Thread 2
Thread 1
.
.
.
atom[i] atom[j] (r < rc & i < j) ?
l i nt er i
l i nt er j
...
...
2
9
6
5 4
7
2
11
10
7
6
10
9
11
.
.
.
.
.
.
17
threshold, rc3. Those interactions that fall within this cutoff are saved in the lspr array in much the
same way as is done in the original MD code explained in Section 2.3. However, in the GPU
implementation, this array is populated by a kernel that inspects each two-body interaction in
parallel (one thread per interaction). As shown in Algorithm 2, kernel gen_lspr produces array lspr
from the existing two-body interaction lists linteri and linterj.
Algorithm 2: GPU implementation of the generation of three-body lists.
CUDA kernel, gen_lspr
// tid is thread ID
// lnum3b[i] is the number of atoms within r c3 of atom i
// linteri and linterj are two-body interaction pairs
i ← linteri[tid]
j ← linterj[tid]
if r ij < r c3
lspr[i][lnum3b[i]] ← j
lspr[j][lnum3b[j]] ← i
lnum3b[i]++
lnum3b[j]++
end if
CUDA kernel, gen_3body_list
// offset is the prefix-sum from Thrust operation
i ← tid
for j from 1 to lnum3b[i]
for k form j + 1 to lnum3b[i]
linter3i[offset] ← i
linter3j[offset] ← lspr[i][j]
linter3k[offset] ← lspr[i][k]
update offset by computing prefix-sum of combinations
end for
end for
Once the lspr array is populated, Thrust is used to scan the neighbor counts for each atom
to generate an array of prefix sums, with the sums produced after applying a function to the
neighbor counts. More specifically, the number of three-body interactions, n
i
, associated with atom
i is computed with the formula,
𝑛 𝑖 =
𝑐 𝑖 (𝑐 𝑖 −1)
2
(1.15)
18
where c
i
is the number of neighboring atoms for atom i in lspr. Then the array of n
i
is used as input
into the Thrust prefix-sum operation to produce a new array of values p
i
. Thrust is also employed
to compute the total number of three-body interactions by simply performing a reduction on n
i
.
The prefix-sum array p
i
is then used as input by a follow-on kernel that populates the actual
three-body interaction list. Each thread in the kernel is assigned to process an atom i, and it iterates
through the n
i
3-tuple combinations for that atom in lspr. The threads save the resulting i, j, k
interactions into the three-body interaction list starting at offset p
i
. The result of this complicated
process is a fully populated i, j, k interaction list, entirely produced on GPU, which can be used in
the “calculate 3B forces” phase in Fig. 1.1 (c). In the source code, this interaction list is represented
by the trio of arrays linter3i, linter3j, and linter3k. In Algorithm 2, kernel gen_3body_list is
implemented to store the three-body interactions into these arrays.
This interaction list generator is remarkably different from its counterpart. Its primary
advantage is the elimination of a global counter for insertion into the three-body interaction list.
The purpose of the prefix-sum array is to assign a dedicated range of entries within the final
interaction list to each thread (atom) so that no explicit coordination is required between threads.
This is a significant advantage, but it comes with a cost. This method does not scale when
there are large numbers of three-body interactions per atom since the memory write into the final
interaction list is strided by the number of combinations. Thus, as the interaction count per atom
increases, the stride correspondingly increases, and memory coalescing worsens. We have found
in practice that the combination counts are typically very small, as well as the total number of
three-body interactions, so the performance degradation from the poor coalescing strategy is
minimal. It should be noted here that rc3 << rc2, as stated before. This comes from a physical
19
principle that higher-order n-tuple interactions (n 3) are short-ranged and only a few neighbors
per atom contribute to them, compared to hundreds of neighbors for pair interactions.
1.4 Benchmark and Validation
To evaluate the runtime performance of the GPU implementations, we run a series of
benchmarking tests at a center for high performance computing that operates both a large,
heterogeneous CPU cluster and a GPU-accelerated cluster. The GPU-accelerated cluster is
comprised of 264 compute nodes, each with two Nvidia Kepler K20m GPU accelerators. We
demonstrate the performance improvement of the new TD approach on MD simulations by
analyzing two comparisons: TD’s GPU implementation versus (1) the original MD code and (2) a
baseline GPU implementation that directly mirrors the CPU implementation.
1.4.1 GPU Implementation versus the Original MD Code
For this timing experiment, we have selected three reference CPU chipsets from the
collection of available compute nodes on the CPU cluster at HPC. We compare the original MD
runtime performance on those chipsets versus the GPU implementation executing on the Kepler
K20m for a series of MD simulations. The lattice sizes in the simulations varies from 4 4 4 to
161616 crystalline unit cells (where each unit cell contains 24 atoms), and each simulation runs
for 1,000 time steps. The CPU chipsets used in the comparison are AMD Opteron 2356 with a
clock speed of 2.3 GHz and 2 MB cache, AMD Opteron 2376 (2.3 GHz, 6 MB cache), and Intel
Xeon E5-2665 (2.4 GHz, 20 MB cache). The purpose behind the multi-chipset comparison is to
give a comprehensive view of the GPU implementation’s performance characteristics versus a
range of common CPUs.
20
Fig. 1.3 plots the wall-clock time versus simulation size (i.e., the number of atoms) for the
three CPU chipsets and the GPU. The corresponding speedups of the GPU implementation over
the three CPU implementations are plotted as a function of the simulation size in Fig. 1.4. As can
be seen in the graph, the speedup varies greatly with the power of the reference CPU, with the
newer Intel Xeon faring better than its counterparts. However, even against the Xeon, the GPU
implementation demonstrates a significant speedup of up to 6-fold, which improves as the lattice
size is increased. This overall trend is expected as the overhead costs of device initialization, PCI
bus communications, and kernel launches are amortized over a larger set of atoms.
An interesting observation noticeable in those charts is the large increase in speedup that
the GPU application experiences for the largest simulation versus the two AMD Opteron CPUs.
This is because the Opteron cache size (6 MB) is not large enough to accommodate the entire
98,304 atomic states along with other necessary data, and the cache-miss rate increases
dramatically at that point. If we were to scale up the tests further, we would likely continue to see
their runtime performance drop in comparison to the GPU.
To validate the simulation results generated by the GPU, we have compared the final
atomic positions and velocities with those of reference datasets produced by the original MD code.
We have examined simulations with lattice sizes 4 44, 8 88, 10 10 10, 12 1212, and
161616, all executed over a span of 1,000 time steps. Across all configurations, these output
parameters agreed within a tolerance of 10
-9
. However, we did discover a problem with
accumulating numerical error that caused the GPU results to drift from that of the CPU while
testing extended simulations of more than 4,000 time steps. This problem is analyzed further in
the concluding remarks.
21
Figure 1.3: Concurrent construction of valid interactions.
Figure 1.4: Speedup of GPU MD over reference CPU MD for varying numbers of atoms.
1.4.2 GPU Implementation versus the Baseline
The TD approach is doubtlessly designed to outperform its sequential origin since
parallelization is employed aggressively. The advantage of the tuple-based restructuring of
computations, however, is still nontrivial and must be investigated. To address this problem, we
introduced a straightforward GPU implementation of the original MD code, which is based on
particle decomposition (PD), as a baseline for comparison. As illustrated in Fig. 1.1, the TD
1
10
10
2
10
3
10
4
10
3
10
4
10
5
AMD Opteron 2356 2.3GHz
AMD Opteron 2376 2.3GHz
Intel Xeon 2.4GHz
Nvidia Kapler K20m
Time (s)
Number of atoms
0
5
10
15
20
25
10
3
10
4
10
5
Speedup vs AMD Opteron 2356 2.3GHz
Speedup vs AMD Opteron 2376 2.3GHz
Speedup vs Intel Xeon 2.4GHz
Speedup
Number of atoms
22
approach separates the computation and identification of 2-body and 3-body interactions into
different kernels. The baseline approach, on the contrary, fuses the two kernels back into a single
one with each GPU thread scanning the interatomic potentials followed by calculation of the
potentials. Thus, this code is simply a GPU-accelerated implementation of the original MD code
on CPU.
The results are collected on Tesla K20m with simulations ran in the lattice sizes ranging
from 4 44 to 32 3232, or equivalently the number of atoms from 1,536 to 786,430. As shown
in Fig. 1.5, our TD approach achieves an approximately 20% performance improvement over the
baseline (PD). We observe that the wall-clock time for TD is higher than that of the baseline upon
the first two lattice sizes. The likely reason is that our TD implementation launches two extra
CUDA kernels than the baseline in each time step, and the overhead of initializing these kernels is
relatively high on a small number of atoms, which needs to be large enough to leverage the
computational capability of GPU.
Figure 1.5: Wall-clock time of the TD approach (orange circles) and that of baseline (PD, black squares) as a
function of the number of atoms. The figure also shows the performance improvement of the TD approach
over the baseline as a function of the number of atom.
To better understand the performance characteristics of the TD and PD approaches, we
next carry out performance profiling on the most compute-intensive CUDA kernels that account
1
10
10
2
10
3
0
20
40
60
80
100
10
3
10
4
10
5
10
6
Tuple-decomposition
Baseline
Performance improvement
Time (s)
Performance improvement (%)
Number of atoms
23
for approximately 90% of the total running time for both implementations, i.e., 2body_inter in
baseline, and gen_lists_2body_inter and compute_accel_2body_inter in TD. We use the NVIDIA
profiling tool, nvprof, for the measurements.
Figure 1.6 compares the wall-clock time of the TD approach with that of the baseline for
these most compute-intensive kernel executions. The total number of atoms is 786,432. As their
names suggest, the test case consists of computing 2-body interactions in a unit time step. Here,
the running time of each kernel is measured on average throughout the entire CUDA threads,
because a single thread may not execute the kernel when the interaction it carries falls out of cutoff
radius. The TD approach (right column) is approximately 23% faster than the baseline (left
column), which in turn echoes the global performance improvement of the former over the latter
in Fig. 1.5. Within the TD approach, generation of the 2-body interaction list and actual
computation of 2-body interactions using the list account for approximately identical computing
times.
Figure 1.6: Wall-clock time on kernels for 2-body potential computation per time step for the baseline (left)
and TD (right) approaches.
In addition, we investigate a major concern of our design mentioned in Section 3, i.e.,
unbalanced workloads over threads. We introduce “achieved occupancy” into our analysis as a
metric to quantify the workload balance on GPU. More precisely, it is defined as the ratio of active
24
warps on a streaming multiprocessor (SM) to the maximum number of active warps supported by
the SM [30]. Table 1.1 shows the achieved occupancy of kernels measured by nvprof. The
compound achieved occupancy of TD, though with statistical variation, is apparently higher than
the baseline, indicating that the procedure of fetching the interatomic pairs from the restructured
tuple-list followed by calculating the interactions produces more balanced workload distributions
over time compared with the baseline approach.
Table 1.1: Achieved occupancy of kernel execution.
Baseline TD
2body_inter 0.386
gen_lists_2body_inter 0.916
compute_accel_2body_inter 0.461
We next perform a hardware-level analysis to unveil the reason behind the achieved
occupancy for the three kernels in Table .1. Occupancy is limited by multiple factors such as shared
memory usage. In this application, register usage per SM is found to be the determining factor. As
Fig. 1.7 shows, the kernel gen_list_2body_inter uses 32 registers per thread, thereby maximizing
the capacity of K20m to hold 64 warps active (2 blocks) per SM, while the other two kernels take
up 41 registers per thread (41,984 for 1 block). The testing device, Tesla K20m, provides up to
65,536 registers for each block with up to 2 blocks supported on each SM. However, these two
kernels use 41,984 registers for one block, thereby each SM can only have 1 block (32 warps as
the red dot suggests) run in parallel. Thus, the achieved occupancy is limited to the upper bound
of 0.5, which prevents it from fully utilizing the GPU. This analysis thus demonstrates that the
register footprint plays a key role in dictating the GPU performance.
25
1.4.3 Dynamic Quadruplet Computation in ReaxFF
To test the performance portability of the TD approach to more general dynamic n-tuple
computations, we consider first principles-informed RMD simulations based on the ReaxFF
method [6, 9, 10]. The ReaxFF approach significantly reduces the computational cost of simulating
chemical reactions, while reproducing the energy surfaces and barriers as well as charge
distributions of quantum-mechanical calculations. RMD simulations describe formation and
breakage of chemical bonds using reactive bond orders [6, 31, 32], while a charge-equilibration
(QEq) scheme [33-35] is used to describe charge transfer between atoms. The ReaxFF interatomic
forces involve up to quadruplets (n = 4) explicitly and sextuplets (n = 6) implicitly through the
chain rule of differentiation. In this work, we test the TD approach in the quadruplet-interaction
computations in a production RMD simulation program named XRMD [10].
Figure 1.7: Relationship between the number of warps per SM and the number of registers per thread for
kernels.
The quadruplet-interaction is one of the most computationally expensive functions in the
ReaxFF method. It requires traversing deeply nested neighbor-list loops, within which the value
of bond-order of a covalent bond or the combination of those is checked to be greater than
predefined cutoff values, which can create significant code branching. Our TD approach performs
the neighbor-list traversal and the branch-condition evaluation on the CPU and the numerically
26
intensive computations on the GPU, enabling us to take advantage of both host and GPU
architectures. The GPU implementation of the TD approach for the computation of quadruplet
interactions is shown in Algorithm 3.
Algorithm 3: GPU implementation of the generation of 4-body lists and computation of quadruplet forces in XRMD.
CUDA kernel, gen_4body_list
// tid is thread ID
// i, j, k, l are atom IDs and BO ij, BO ik, BO lk are bond orders between pairs (i j), (i k) and (k l)
// ie4b is the list for four body and ne4b is interaction number for 4 body
j ← tid
k ← nbrlist[j,k1]
if BO jk > BO cutoff
i ← nbrlist[j,i1]
if i k
if BO ij > BO cutoff && BO ij ×BO jk > BO cutoff
l ← nbrlist[k, l1]
if i l && j l
if BO kl > BO cutoff && BO jk×BO kl > BO cutoff
if BO ij ×BO jk
2
×BO kl > BO cutoff2
ne4b++
ie4b[1:4][ne4b] (j, i1, k1, l1)
CUDA kernel, gen_4body_list compute_4body_force
ne4b ← tid
i ← ie4b[1][ne4b]
j ← ie4b[2][ne4b]
k ← ie4b[3][ne4b]
l ← ie4b[4][ne4b]
call E4b[i,j,k,l]
Figure 1.8 compares the wall-clock time of the quadruplet interaction calculation using the
TD and baseline (PD) approaches for three different problem sizes, i.e., the number of atoms N =
1,344, 4,536 and 10,752. For this kernel invocation, we employ one-dimensional thread
assignment with the number of threads per grid, Nthreads, to be 32 and the number of grids to be
the number of atomic quadruplets divided by Nthreads + 1. The figure shows that the TD approach
consistently outperforms the baseline, providing nearly 1.3 speedup for the three system sizes.
27
Figure 1.8: Timing comparison of the TD approach and baseline (PD) for the computation of quadruplet
interactions in ReaxFF. The histogram shows wall-clock times per MD step in milliseconds averaged over 100
time steps. We use RDX material for this benchmark.
As stated in the introduction, our global-local separation scheme completely insulates the
optimization of intranode computations of dynamic n-tuples from internode parallelization. In
order to test the internode scalability of our parallel ReaxFF MD code, Fig. 1.9 shows the
computing time per MD step as a function of the number of IBM Blue Gene/Q cores in both weak-
and strong-scaling settings. The weak-scaling test simulates 86,016P-atom C3H6N6O6 molecular-
crystal system on P cores, while the strong-scaling test simulates N = 4,227,858,432 atoms
independent of P. The weak-scaling parallel efficiency is 0.977 for P = 786,432 for N =
67,645,734,912. The strong-scaling parallel efficiency is 0.886 for P = 786,432 for a much smaller
system of N = 4,227,858,432.
Figure 1.9: Strong and weak scaling of the parallel ReaxFF MD code on Blue Gene/Q.
0
20
40
60
80
100
120
140
Weak scaling
Strong scaling
10
5
10
6
Wall-clock time (s)
Number of cores
28
1.5 Summary
we have presented a new computational approach for performing many-body MD
simulations on GPGPU. Our tuple-decomposition approach utilizes pipelining and efficient in situ
generation of pair and triplet interaction lists to accelerate the application. Through a wide-ranging
set of benchmarking tests, we have demonstrated that these relatively simple algorithmic changes
provide significant performance speedups for varying simulation sizes and applications.
The algorithms we described for producing lists of pair and triplet interactions can be
improved with further research. One possibility is combining our candidate two-body
interaction generator with the Lipscomb algorithm [24] for Verlet neighbor-list generation.
Our approach addresses the major weakness in their algorithm — the generation of the master list
— while their approach eliminates a need for synchronization through a global counter. We will
synthesize these two methods to yield an exceptional hybrid approach.
We have also examined various techniques for fusing the pair and triplet interaction list
generation into the same overall algorithm, but this proved to be too difficult as their requirements
are slightly different. Further research could yield a way to integrate them in an elegant manner.
This would simplify our computation model immensely.
Although the focus of this work is primarily on MD algorithm development on GPU, there
are various optimizations that could be made to boost the performance even further. Memory
coalescing could be improved in many places by restructuring the atomic state information stored
in device global memory, and utilizing shared memory to coordinate reads among threads. With
the former technique, the goal is to have adjacent threads in a thread block read data from adjacent
addresses in global memory. The latter technique allows the efficient loading of strided array
structures into local memory where the access times are much lower.
29
An intriguing idea that we plan to explore is imposing a sort order on the atoms, so that
adjacent threads in the kernel tend to access adjacent atoms in the atom list. Theoretically, this
could improve` memory coalescing. Determining a proper sort order is difficult, however, and
enforcing it is expensive because it implies that a parallel sort be performed each time step since
the atoms move throughout the simulation. A compromise may be to periodically sort the atoms
and hope that that the benefits of temporarily improved coalescing outweigh the sorting cost.
30
Chapter 2
Shift-Collapse Acceleration of Generalized Polarizable Reactive
Molecular Dynamics for Machine Learning-Assisted
Computational Synthesis of Layered Material
2.1 Introduction
Reactive molecular dynamics (RMD) is a powerful simulation method for describing
material processes involving chemical reactions, with a wide range of applications in physics,
chemistry, biology and materials science [36]. RMD simulation follows time evolution of the
positions, 𝐫 𝑁 = {𝐫 𝑖 |𝑖 = 1, … , 𝑁 }, of N atoms by numerically integrating Newton’s equations of
motion, where the atomic force law is mathematically encoded in the interatomic potential energy
E(r
N
). Reliable interatomic potentials are key to accurately describing thermomechanical and
chemical properties of materials. The first principles-informed reactive force-field (ReaxFF) model
significantly reduces the computational cost, while reproducing the energy surfaces and barriers as
well as charge distributions of quantum-mechanical (QM) calculations [36].
The most intensive computation in RMD simulation arises from a charge-equilibration
(QEq) scheme [37] to describe charge transfer between atoms, thereby enabling the study of
reduction and oxidation reactions. QEq treats atomic charges as dynamic variables, 𝑞 𝑁 =
{𝑞 𝑖 |𝑖 = 1, … , 𝑁 }. The charges and the resulting force law are determined by minimizing the
potential energy with respect to q
N
at every RMD time step. This variable N-charge problem is
commonly solved iteratively, e.g., with the conjugate gradient (CG) method [38]. Though recent
advancements in parallel ReaxFF algorithms [39-41] have enabled large RMD simulations [42]
31
involving multimillion atoms, QEq computation remains to be the major bottleneck for studying
long time trajectories of such large RMD simulations.
Despite enormous success of the QEq-based ReaxFF model, one critical issue has remained
unsolved, namely accurate description of electric polarizability. Polarization of the surrounding
medium essentially dictates the rate of reduction and oxidation reactions, as is articulated, e.g., in
the Nobel lecture by Rudolph Marcus [43]. A recently proposed polarizable reactive force-field
(ReaxPQ) model based on a polarizable charge equilibration (PQEq) scheme significantly improves
the accuracy of describing redox reactions by accommodating the reorganization of surrounding
media [44]. When applied to prediction of electronic polarizabilities, however, the ReaxPQ model
alone was found inadequate. This partly arises from the fact that the original ReaxPQ model
determines the polarization of a nucleus core and an electronic shell within each atom by
considering only the internal electric field produced by atomic charges but not an externally applied
electric field. To remedy this deficiency, we here introduce a generalization of the ReaxPQ model
named ReaxPQ+, in which atomic polarizations respond to both internal and external electric fields,
thereby achieving near quantum accuracy for tested cases.
The improved accuracy of the new ReaxPQ+ model is accompanied by significant increase
of the computational cost. Compared to the original QEq scheme, which only deals with atom-
centered charge-charge interactions, PQEq computation in the ReaxPQ+ model is quadrupled, since
it considers core-core, core-shell, shell-core and shell-shell charge interactions for every atomic
pair. In this paper, we accelerate this heavy ReaxPQ+ computation using (1) an extended
Lagrangian approach to eliminate the speed-limiting charge iteration [45], (2) a new extension of
the shift-collapse (SC) algorithm [46] named renormalized SC (RSC) to compute dynamic n-tuples
with provably minimal data transfers, (3) multithreading with round-robin data privatization, and
32
(4) data reordering to reduce computation and allow vectorization. The accelerated code achieves
(1) weak-scaling parallel efficiency of 0.989 for 131,072 cores, and (2) eight-fold reduction of the
time-to-solution (T2S) compared with the original code, on an Intel Knights Landing (KNL)-based
supercomputer.
The reduced T2S has allowed computational synthesis of atomically-thin transition-metal
dichalcogenide (TMDC) layers with unprecedented fidelity. Functional layered materials (LM) will
dominate materials science in this century [47]. The attractiveness of LMs lies not only in their
outstanding electronic, optical, magnetic and chemical properties, but also in the possibility of
tuning these properties in desired ways by building van der Waals (vdW) heterostructures composed
of unlimited combinations of atomically-thin layers. To rationally guide the synthesis of stacked
LMs by chemical vapor deposition (CVD), exfoliation and intercalation, “computational synthesis”
should encompass large spatiotemporal scales. Such layered materials genome (LMG) has been
chosen as one of the designated applications of the United States’ first exaflop/s computer A21
when it is introduced in 2021 [48]. This paper for the first time demonstrates purely computational
synthesis of TMDC-based LM, which is assisted by machine learning to discover a novel synthetic
pathway. This opens up a possibility to computationally explore new synthetic pathways to novel
TMDC-LMs and vdW heterostructures.
2.2 Application and Algorithmic Innovations
2.2.1 Generalized Polarizable Reactive Force Field (ReaxPQ+)
In the ReaxPQ+ model, the potential energy 𝐸 ({𝐫 ij
}, {𝐫 ijk
}, {𝐫 ijkl
}, {𝑞 𝑖 }, {BO
ij
}) is a function
of relative positions of atomic pairs rij, triplets rijk and quadruplets rijkl, as well as atomic charges qi
and bond orders BOij between atomic pairs. The potential energy includes Coulombic energy
33
ECoulomb. In the PQEq scheme used in the original ReaxPQ model, electric polarization is described
using a Gaussian-shaped electron density (shell) that can polarize away from the nuclei core in
response to internal electric fields produced by atoms [44]. Here, each atom i is partitioned into two
charged sites (i.e., core and shell). The core (ρic) consists of two parts: ρi with a variable total charge
(qi) and ρiZ with a fixed total charge (Zi). The shell (ρis) is massless and has a fixed total charge of
−Zi. The shell and core of an atom are connected by an isotropic harmonic spring with force constant
Ks (Fig. 2.1).
At each time step, the atomic charges q
N
are determined by minimizing ECoulomb subject to
the conditions that the electrochemical potentials, ECoulomb/ qi, are equal among all atoms. The
Coulombic energy is given by
𝐸 Coulomb
({𝐫 𝑖 c
, 𝐫 𝑖 s
, 𝑞 𝑖 }) = ∑ [𝜒 𝑖 0
𝑞 𝑖 +
1
2
𝐽 ii
0
𝑞 𝑖 2
+
1
2
𝐾 𝑠 𝑟 𝑖 c,𝑖 s
2
]
𝑁 𝑖 =1
+ ∑ [𝑇 (𝑟 𝑖 c,𝑗 c
)𝐶 𝑖 c,𝑗 c
(𝑟 𝑖 c,𝑗 c
)𝑞 𝑖 c
𝑞 𝑗 c 𝑖 >𝑗 −𝑇 (𝑟 𝑖 c,𝑗 s
)𝐶 𝑖 c,𝑗 s
(𝑟 𝑖 c,𝑗 s
)𝑞 𝑖 c
𝑍 𝑗 −𝑇 (𝑟 𝑖 s,𝑗 c
)𝐶 𝑖 s,𝑗 c
(𝑟 𝑖 s,𝑗 c
)𝑞 𝑗 c
𝑍 𝑖 +𝑇 (𝑟 𝑖 s,𝑗 s
)𝐶 𝑖 s,𝑗 s
(𝑟 𝑖 s,𝑗 s
)𝑍 𝑖 𝑍 𝑗 ]
(2.1)
where ric, ris, 𝜒 𝑖 0
and 𝐽 𝑖𝑖
0
are the core position, shell position, electronegativity and hardness
of the i-th atom. In Eq. (2.1), ria,jb (i, j = 1,…, N; a, b = core(c) or shell(s)) are charge-charge
distances. The electrostatic energy between two Gaussian charges is given in terms of the error
function 𝐶 ia,jb
(𝑟 𝑖𝑎 ,𝑗𝑏
), and the Coulombic interaction is screened using a taper function T(r) [44].
Figure 2.1: Schematic of the response of core (green) and shell (yellow) charges to an external electric field in
the new PQEq+ model.
34
As was mentioned in the introduction, the core-core, core-shell, shell-core and shell-shell charge
interactions in Eq. (2.1) quadruple the charge computation over the conventional ReaxFF model.
In the original ReaxPQ, the shell position ris for the i-th atom is obtained by balancing the
effect of the electrostatic field due to all other atoms (i.e., inter-atomic interactions) with intra-
atomic interactions involving only the core and shell:
𝐅 inter
= −
𝜕 𝜕 𝐫 𝑖 s
{∑ 𝑇 (𝑟 𝑖𝑎 ,𝑗𝑏
)𝐶 ia,jb
(𝑟 𝑖𝑎 ,𝑗𝑏
)𝑞 ia
𝑞 jb ia>jb
} (2.2)
𝐅 intra
= −
𝜕 𝜕 𝐫 𝑖 s
(
1
2
𝐾 s
𝑟 𝑖 c,𝑖 s
2
) (2.3)
We solve Finter = Fintra to determine ris using Newton-Raphson method. Fig. 2.2(a) compares
time evolution of the PQEq energy for 1, 10 and 100 iterations of Newton-Raphson method, where
the previous shell positions are used as initial guess. Fig. 2.2(a) shows that single iteration suffices
to obtain the accuracy of 10
-4
kcal•mol
-1
/atom. We have also confirmed the accuracy of solution by
comparing the shell position and charge of each atom. We have applied this ReaxPQ model to
compute the dielectric constants 𝜖 of various materials and found that the model generally
underestimates the values.
Figure 2.2: (a) PQEq energy with 1, 10 and 100 iterations of Newton-Raphson method. (b) Atomic charges
after 1 ps of MD simulation with converged RMD (blue) and XRMD (red) methods.
In order to improve the accuracy of describing the dielectric response of materials, we here
introduce a generalized polarizable reactive force-field (ReaxPQ+) model, in which ris is
PQEq Energy Extended Lagrangian
(a) (b)
35
determined by solving 𝐅 inter
+ 𝐅 external
= 𝐅 intra
, i.e., by explicitly including the effect of an external
electric field 𝓔 ,
𝐅 external
= ∑ 𝑞 𝑖𝑎
𝓔 𝑎 (2.4)
In both the original ReaxPQ and new ReaxPQ+ models, polarization is calculated as
𝐏 = ∑ 𝑞 𝑖𝑎
(𝐫 𝑖𝑎
− 𝐫 𝑖𝑎
0
)
𝑖𝑎
(2.5)
where 𝐫 𝑖𝑎
0
is the charge position in the absence of external electric field 𝓔 .
Table 2.1: Computed dielectric constants.
Material ReaxPQ ReaxPQ+ QM
Polyethylene (PE) 1.01 2.25 2.37
C=O defect (PE) 1.02 2.74 2.78
C-Cl defect (PE) 1.01 2.40 2.53
MoS 2 (in plane) 1.03 14.3 15.4
*
MoS 2 (out of plane) 1.03 5.68 7.43
*
Polyvinylidene
fluoride (PVDF)
1.02 2.56 2.52
Alumina 1.03 3.17 2.98
*
Surface Science Reports, vol. 70, pp. 554-586, Dec 2015.
We have tested the accuracy of the ReaxPQ+ model by computing the dielectric constants
𝜖 of poly-ethylene (PE), molybdenum disulfide (MoS2) and other polymer and ceramic materials.
Table 2.1 compares dielectric constants computed with the original ReaxPQ and new ReaxPQ+
models against those obtained by first-principles QM calculations. As noted otherwise, the QM
value has been computed by us using a quantum molecular dynamics (QMD) code [49] based on
density functional theory (DFT), in which dielectric constants are calculated using a Berry-phase
approach [50]. The table shows that the new ReaxPQ+ results agree much better with the QM
results, compared with the original ReaxPQ results.
36
2.2.2 Extended-Lagrangian Acceleration of ReaxPQ+
As shown above, charge-interaction computation in ReaxPQ+ is quadrupled compared to
that in the conventional ReaxFF model. The increased computational cost necessitates innovative
algorithms to speed up the computation.
First, we adapt the extended-Lagrangian reactive molecular dynamics (XRMD) algorithm
[45], which was originally proposed for the conventional ReaxFF model, to the new ReaxPQ+
model. The problem is that an excessively large number of CG iterations are required to reach
sufficient convergence of charges q
N
to guarantee the conservation of the total energy as a function
of time. Insufficiently converged charges act as an artificial heat sink, and the resulting broken time
reversibility causes the total energy to drift over time. A similar trade-off between the computational
speed and energy conservation is encountered in first-principles QMD simulations, where
insufficient convergence of the iterative refinement of electronic wave functions causes serious
energy drift. Niklasson proposed an extended Lagrangian scheme [51] that achieves excellent long-
time energy conservation with drastically reduced number of iterations. In fact, an extended
Lagrangian scheme with no iteration (i.e., requiring only one evaluation of energy gradient per
QMD time step) has been demonstrated [52]. The key idea is to introduce auxiliary wave functions
as dynamic variables that are numerically integrated by time-reversible, symplectic integration
schemes to address the broken reversibility problem, while the auxiliary wave functions are
constrained to iteratively determined wave functions by a harmonic potential. Successful
elimination of the speed-limiting charge iteration in the ReaxFF model was achieved by Nomura et
al. by adapting the extended-Lagrangian scheme [45]. The XRMD algorithm has drastically
improved energy conservation while substantially reducing the time-to-solution. In addition,
XRMD accurately describes atomic trajectories and charges. The average difference of atomic
37
positions was 0.08 Å after 1 ps of simulation between XRMD and fully converged RMD methods
[45]. Fig. 2.2(b) compares atomic charges obtained by XRMD algorithm with those by RMD using
extremely high CG tolerance (10
-8
) which show an excellent agreement. In this paper, we adapt the
XRMD algorithm to the new ReaxPQ+ model, where auxiliary charges are applied only to the
variable part, qi, of the charges.
2.2.3 Renormalized Shift-Collapse Acceleration of ReaxPQ+
Our second algorithmic innovation is a generalization of the shift-collapse (SC) algorithm
[46], named renormalized SC (RSC). SC algorithm provably minimizes data transfer for
computation of dynamic n-tuples in parallel computing based on spatial (or domain) decomposition.
Building on translation and reflection invariance of the set of n-tuple computations, the algorithm
applies shift and collapse algebraic transformations to n-tuple computations so as to completely
eliminate redundant computations among computing nodes while minimizing data transfer. Here,
we apply the SC algorithm to the generalized polarizable charge equilibration (PQEq+) subroutine
that iteratively optimizes atomic charges to minimize the Coulombic potential energy using CG
method. At the beginning of every PQEq+ iteration, information of neighbor atoms near the domain
boundary (including atomic charges and gradients) needs to be imported from the nearest-neighbor
computing nodes for calculating Coulombic potential used in CG iteration. This incurs significant
communication cost. For each PQEq+ iteration, Coulombic potential of the system is evaluated
using guessed atomic charges. After that, each node computes charge gradients and consecutively
updates atomic charges of all atoms resided in its domain. PQEq+ computation is terminated when
either (1) CG residue is small enough, (2) no energy improvement is gained between successive
iterations, or (3) maximum number of iterations is reached.
38
In this work, we employ two approaches to reduce the time spent in PQEq+ subroutine.
First, we apply SC algorithm to minimize communication cost and eliminate redundant pair
evaluations within PQEq+ subroutine. Here, we also develop a new SC approach to handle many-
body renormalized n-tuple computation in PQEq+ due to the interaction with surrounding atoms,
which has never been addressed by SC algorithm before. Second, we store-and-reuse unchanged
coefficients across multiple CG iterations, thereby improving the efficiency of SC computation.
Details of these approaches are discussed below.
SC algorithm utilizes 3-step communication, thereby significantly reducing both bandwidth
and latency costs when compared to the 6-step communication (i.e. halo exchange) in conventional
full-shell (FS) method [53]. Computation pattern of SC algorithm also eliminates redundant pair
evaluations. Evaluating Coulombic potential using SC algorithm is rather straightforward. The
computation follows SC computation pattern for two-body interaction, which includes pairs of
resident atoms and pairs of atoms in the neutral territory (NT) [54]. Typically, contribution from
pairs in NT region needs to be sent back to its resident node. However, the return of contribution is
not required in this situation. Here, only overall sum of Coulombic potential contributions from
every node is needed, which can be efficiently achieved by using the MPI_AllReduce function in
the message passing interface (MPI) standard. Nevertheless, subsequent computation after
obtaining total Coulombic potential requires special treatment for many-body computation within
the CG subroutine. Namely, computation of charge gradient g(i) for a particular atom i requires sum
of contributions over all neighbor atoms of i:
𝑔 (𝑖 ) = ∑ 𝐻 𝑖𝑗
𝑞 𝑗 𝑗 ϵNBR(𝑖 )
(2.6)
where Hij denotes charge Hessian of atom pair i and j, and NBR(i) is the neighbor list of
atoms i. Computing g(i) involves many-body renormalization because it requires Hij contributions
39
from all neighbor atoms in NBR(i); see Fig. 2.3(a). However, in SC computation, atoms near the
lower domain boundary may not have complete neighbor-atom information in the node it resides.
In fact, neighbor-atom information is completed when combining partial neighbor lists (PNBR)
from multiple computing nodes. As such, gradient calculation in SC algorithmic framework gSC(i)
is defined as
𝑔 SC
(𝑖 ) = ∑ 𝐻 𝑖𝑗
𝑞 𝑗 𝑗 ϵPNBR
RES
(𝑖 )
+ ∑ (∑ 𝐻 𝑖 𝑗 ′𝑞 𝑗 ′
𝑗 ′
ϵPNBR
𝑘 NT
(𝑖 )
)
𝑘 (2.7)
where PNBR
RES
(i) denotes partial neighbor list of atom i in the node that atom i resides.
PNBR
𝑘 NT
(i) denotes partial neighbor list of atom i in NT region from node k, which is not in the
same node that i resides. Therefore, to complete g(i) calculation in SC framework, partial g(i)
contribution based on PNBR
NT
(i) must be sent back to the node that owns i (resident node); see Fig.
2.3(b). Although this incurs extra communication (additional 3-way communication for returning
gSC(i) contribution), this is still no larger than the 6-way communication in conventional FS method.
Furthermore, FS computation pattern yields substantial computational overhead from redundant
pair evaluations to build complete NBR in every computing node. On the other hand, SC algorithm
performs only essential pair evaluations without redundancy. This is expected to significantly
reduce running time inside PQEq+ subroutine, while maintaining the same communication cost.
Figure 2.3: Illustration of many-body renormalized n-tuple computation of g(i) based on (a) conventional full
neighbor list (NBR) and (b) partial neighbor lists (PNBR
RES
and PNBR
NT
) used in the RSC algorithm.
40
For each PQEq+ iteration, Coulombic potential energy between atoms i and j based on (16)
takes the following form:
𝐸 Coulomb
(𝑖 , 𝑗 ) = 𝐸 core-core
(𝑖 , 𝑗 ) + 𝐸 core-shell
(𝑖 , 𝑗 ) + 𝐸 core-shell
(𝑗 , 𝑖 ) + 𝐸 shell-shell
(𝑖 , 𝑗 )
𝐸 core-core
(𝑖 , 𝑗 ) = 𝑇 (𝑟 𝑖𝑐 ,𝑗𝑐
)𝐶 𝑖𝑐 ,𝑗𝑐
(𝑟 𝑖𝑐 ,𝑗𝑐
)𝑞 𝑖 𝑞 𝑗
𝐸 core-shell
(𝑖 , 𝑗 ) = −𝑇 (𝑟 𝑖𝑐 ,𝑗𝑠
)𝐶 𝑖𝑐 ,𝑗𝑠
(𝑟 𝑖𝑐 ,𝑗𝑠
)𝑞 𝑖 𝑍 𝑗 (2.8)
𝐸 shell-shell
(𝑖 , 𝑗 ) = 𝑇 (𝑟 𝑖𝑠 ,𝑗𝑠
)𝐶 𝑖𝑠 ,𝑗𝑠
(𝑟 𝑖𝑠 ,𝑗𝑠
)𝑍 𝑖 𝑍 𝑗
Here, 𝑇 (𝑟 𝑖𝑎 ,𝑗𝑏
) and 𝐶 𝑖 a,𝑗 b
(𝑟 𝑖𝑎 ,𝑗𝑏
) are computed using a costly table lookup as a function of
core/shell distance. However, only qi is changing across PQEq+ iterations, while 𝑇 (𝑟 𝑖𝑎 ,𝑗𝑏
) and
𝐶 𝑖 a,𝑗 b
(𝑟 𝑖𝑎 ,𝑗𝑏
) remain unchanged (i.e. atomic/shell positions are fixed). Therefore, we save
considerable computation time by storing these four unchanged coefficients 𝑇 (𝑟 𝑖𝑎 ,𝑗𝑏
)𝐶 𝑖𝑎 ,𝑗𝑏
(𝑟 𝑖𝑎 ,𝑗𝑏
)
for each atomic pair throughout PQEq+ iterations.
2.3 Parallel Implementation and Performance Optimizations
We have implemented RMD simulation based on the new algorithmically-accelerated
ReaxPQ+ model in a scalable parallel RMD code named RXMD [39]. In this code, computations
are parallelized using spatial decomposition, where the simulated system is decomposed into
spatially localized subsystems and each processor is assigned computations associated with one
subsystem. Message passing is used to exchange necessary data for the computations between
processors, utilizing the MPI standard. Specifically, before computing the forces on atoms in a
subsystem, atomic positions within the interaction cutoff radius within the boundaries of the
neighboring subsystems are copied from the corresponding processors (i.e., inter-process atom
caching). After updating the atomic positions according to time-stepping, some atoms may have
41
moved out of its subsystem. These moved-out atoms are migrated to the proper neighbor processors
(i.e., inter-process atom migration). The RXMD code is written in Fortran 90.
For large granularity (the number of atoms per spatial subsystem, N/D > 10
2
), spatial
decomposition (i.e., each processor is responsible for the computation of the forces on the atoms
within its subsystem) suffices. For finer granularity (N/D ~ 1), on the other hand, neutral-territory
(NT) [54] or other hybrid decomposition schemes is more efficient. As discussed in section II, we
use the SC scheme [46], which is a generalization of NT for general dynamic n-tuple computation.
Grain size for each MPI rank is limited by the cutoff length of interaction. To further
accelerate the computation within MPI rank, we introduce an additional layer of shared-memory
parallelism using the Open Multi-Processing (OpenMP) application programming interface. This
hierarchical MPI+OpenMP implementation allows RXMD to take advantage of the simultaneous
multithreading support provided by modern processors to achieve better utilization of the
computing resources within each processor. With multithreading, the most computationally
expensive bond-order and force computations within RXMD are greatly accelerated, serving to
reduce the overall runtime. A secondary benefit of multithreading is that it allows MPI ranks to be
exchanged for local threads, thereby reducing the total number of ranks in a MPI job and similarly
reducing the communication and atom-caching overheads at large scales. To obtain the best time-
to-solution (T2S) on each Intel Knights Landing (KNL) node, we performed timed run for several
possible configurations of MPI ranks and OpenMP threads, as shown in Fig. 2.4(a). In this test, the
product of the number of MPI ranks and that of OpenMP threads is fixed to 64. The best T2S was
observed for the combination of 16 MPI ranks and 4 OpenMP threads on each node, which will be
used in subsequent benchmark tests.
42
Figure 2.4: (a) Runtime comparison between different configurations of MPI ranks and OpenMP threads on
1 node to identify the optimal combination. Total number of processes is 64. (b) Average runtime per step for
the original ReaxPQ+ and new ReaxPQ+SC implementations on 96 and 1,440 cores. ReaxPQ+SC reduces
time-to-solution 5.0- and 5.7-fold below ReaxPQ+ for 96- and 1,440-core benchmarks, respectively. The
numbers denote runtimes.
2.3.1 Performance Improvement by Algorithm Accelerations
We first test the effect of SC acceleration. Fig. 2.4(b) shows the average runtime per time
step on 96 cores (left) and 1,440 cores (right). In each case, the left and right bars show the total
runtime without (ReaxPQ+) and with (ReaxPQ+SC) SC acceleration, respectively. The figure also
shows partial runtimes for polarizable charge-equilibration (PQEq) calculation, which has become
the speed-limiting step in the new ReaxPQ+ model in order to achieve the improved accuracy
demonstrated in section II.A, as well as for non-bonded (ENbond) and other force calculations. We
observe that the SC-accelerated ReaxPQ+SC is 5.0 and 5.7 times faster than the original
implementation, ReaxPQ+, on 96 and 1,440 cores, respectively (N/P = 672). We should note that
this significant speedup is a result of the following two improvements. First, the SC framework
minimized the computation by removing redundant pair calculations, which is especially beneficial
for fine granularity (i.e., small number of atoms per computing node). The reduced communication
in ReaxPQ+SC is highlighted by the increased relative performance with ReaxPQ+ from 5.0 to 5.7
fold when the number of cores increases from 96 to 1,440. Second, the computation was further
reduced by reusing Coulombic coefficients.
43
2.3.2 Improving Thread Scalability by Round-Robin Data Privatization
In MD simulations like RMD, interatomic forces need to be updated at every time step,
which consumes substantial computing. With shared-memory programming, force calculation is
parallelized by distributing computations to threads. Because of Newton’s third law of motion, it is
possible that different threads compute pair interactions involving a common atom and update the
force on that atom simultaneously, thereby causing memory conflict. To avoid such race conditions,
OpenMP provides atomic directive to handle concurrent memory access with a few locks.
In the RXMD code, interatomic forces for all atoms within an MPI process are stacked in
an array f of length n, where n is the total number of atoms per MPI rank. Array f is shared and
atomically accessed by threads in force-calculation module (Fig. 2.5(a)). However, overhead of
concurrent memory access to array f can become performance bottleneck as the simulation size
grows. We observed that the cumulative CPU time of atomic operations accounts for 40% of the
total runtime. Fig. 2.5(b) shows how a thread-privatization scheme addresses this issue [55].
Specifically, the program allocates a private copy, f_private, of array f for each thread and accesses
f_private individually. When a piece of calculated force data on a common atom needs to be updated
by more than one threads, these threads write partial results to their own arrays concurrently, thereby
eliminating atomic operations.
Figure 2.5: (a, b) Data privatization. All threads write to the global force array using atomic operations (a), or
each thread writes to private array (b). (c) Round-robin reduction with non-overlapping chunks for
improving data privatization.
44
Once the force module finishes all of its computation, we need to aggregate the partial
results to the global array, for which we adopt a round-robin approach. As shown in Fig. 2.5(c),
f_private is divided into p chunks of equal length except for the last one for a remainder, where p
is the number of threads. In every round, each thread is in charge of a unique chunk and transfers
data to the global array at the corresponding location. It then circles through the array chunk by
chunk until the entire f_private is transferred. This data reduction is implemented in a single-
instruction multiple-data (SIMD) manner. The non-overlapping chunk design prevents memory
conflicts, while maintaining thread scalability by enhancing cache locality. Though data
privatization requires additional memory allocation that grows as O(np) (p is the number of
threads), it is justified for the target application of computational synthesis. Here, minimizing T2S
is essential for describing long-time growth process, rather than simulating the largest possible
system to saturate the memory.
2.3.3 Promoting Data Parallelism by Data Alignment
The Coulombic energy is updated in every PQEq+ iteration. In the algorithm in Table 2.2,
the procedure contains a doubly-nested loop over atoms i and j. The outer loop over i is mapped to
parallel threads via OpenMP loop parallelism. The inner loop over j in the original RXMD code
was executed sequentially. To promote data parallelism (SIMD) on AVX-512 vector lanes (see the
experimental platform in section IV.A), we apply SIMD operations on j loop. However, atoms j in
the neighbor list of atom i are irregularly stored, and accordingly successive data accesses by index
j are non-contiguous in memory, resulting in inefficient cache performance. Here, we notice two
facts: PQEQ+ neighbor list is only updated once every time step (the outmost loop); and the index
in neighbor list is closely distributed, albeit randomly stored, in a number of groups. To take
advantage of this property, we here use quicksort to rearrange the neighbor list at every PQEq+
45
initialization phase such that atoms j are sorted in ascending order, thus irregular access is avoided.
Fig. 2.6 illustrates the modified neighbor list. After sorting, groups of atoms with sequential indices
are adjacently placed in memory, thus the subsequent computation can leverage SIMD operations
on our computing platform.
Table 2.2: PQEq get_hsh procedure in RXMD.
1. do atoms i
2. do atoms j in i’s PQEQ+ neighbor
3. update Coulombic energy
Figure 2.6: Sorted neighbor-list followed by SIMD operation.
Table 2.3 shows a piece of sorted neighbor list taken from an actual simulation and its
corresponding stride. The unit stride (value of 1) indicates successive placement of indices.
Table 2.3: Sorted polarized neighbor list and stride.
Index in Neighbor-list
[888, 889, 890, 891, 892, 893, 894, 895, 896, 954, 962, 978, 998, 1002, 1017, 1018, 1024, 1096, 1112,
1145, 1146, 1160, 1180, 1192, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1232, 1234,
1235, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243, 1244]
Stride
[1, 1, 1, 1, 1, 58, 8, 16, 20, 4, 15, 1, 6, 72, 16, 33, 1, 14, 20, 12, 26, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1]
46
2.3.4 Additional Code Transformations and Performance Improvement
In the original RXMD code, the distance between atom pairs are computed when creating a
neighbor list, but this data is discarded immediately regardless the fact that such distance will be
frequently recalculated in the subsequent force module. To eliminate these redundant computations,
we modify the data structure by packing the atomic-pair distance into the neighbor list. Table 2.4
illustrates the modification of neighbor list and the later usage of the distance data in the force-
computation module. When accessing the index of neighboring atoms, the program can directly
fetch drij from memory without any redundant computation impairing the performance.
Table 2.4: Packing atomic-pair distance into neighbor list.
In Neighbor-list module
1. for each cell
2. for atoms i in this cell
3. for atom j in the supercell
4. compute distance dr ij of i and j
5. if dr ij < cutoff range
6. nbrlist(i) stores (j, i, dr ij)
In Force module, 2-body case
9. for atom i
10. for atom j in nbrlist(i)
11. fetch dr ij from nbrlist
12. force calculation using dr ij
To assess the effect of several performance optimizations discussed in this section, we run
a simulation of MoS2 crystal on 512 Intel Xeon Phi KNL nodes with 10 simulation steps. Fig. 2.7
compares the wall-clock time of the original code (left) with that of our optimized version (right)
over a set of ReaxPQ+ force functions, including Enbond (nonbonded), Elnpr (lone-pair), Ehb
(hydrogen-bond), E3b (3-body) and E4b (4-body) interactions. The most time-consuming function,
E3b, has become 5.57 times faster, and the aggregate time of force-computation module has
achieved 4.38-fold speedup over the original version. Overall, these performance optimizations
have achieved 1.55-fold speedup of the entire simulation runtime.
47
Figure 2.7: Performance improvement after data privatization and other code transformations. The numbers
denote runtimes.
2.4 Scalability and Time-to-Solution
2.4.1 Experimental Platform
We perform benchmark tests of the RXMD code with the accelerated ReaxPQ+ model on
Theta — an Intel Xeon Phi Knights Landing (KNL) based supercomputer [56]. Each computing
node contains 64 cores. Each core in turn has 32 KB L1 instruction cache, 32 KB L1 data cache,
and two independent floating-point units capable of executing 512-bit wide SIMD instructions.
The peak instruction throughput of the KNL microarchitecture is 2 instructions per clock cycle,
and they can be issued back-to-back from the same thread. Two cores and a shared 1 MB L2 cache
form a tile, and the tiles are connected with a 2D-mesh network-on-chip with 16 GB of high-
bandwidth in-package multichannel DRAM memory (MCDRAM) and 192 GB of regular DRAM.
There are two memory modes of execution. In the cache mode, the MCDRAM is used as a cache
to DRAM; while in the flat mode, both MCDRAM and DRAM form a single linear memory space.
We tested our code in both flat and cache modes but observed no significant difference. Following
benchmarks are performed in cache mode
48
2.4.2 Weak and Strong Scaling on Theta
We perform an isogranular-scaling test of the ReaxPQ+ adapted RXMD code with hybrid
MPI+OpenMP implementation on Theta, in which the number of atoms per MPI rank N/P is kept
constant. We measure the wall-clock time per simulation time step with scaled workloads —
24,576-atom MoS2 system on each core. By increasing the number of atoms linearly with the
number of cores, the wall-clock time remains almost constant, indicating excellent scalability. To
quantify the parallel efficiency, we first define the speed of the RXMD code as a product of the
total number of atoms and the number of RMD time steps executed per second. The isogranular (or
weak-scaling) speedup is given by the ratio between the speed of P core and that of 64 cores as a
reference system. With the granularity of 24,576 atoms per core, the parallel efficiency is 0.989 on
131,072 cores for a 3,221,225,472-atom system, shown in Fig. 2.8(a). This demonstrates a very
high scalability of the ReaxPQ+ adapted RXMD code.
We next perform a strong-scaling test by simulating MoS2 with a total of 50,331,648 atoms.
In this test, the number of cores ranges from P = 2,048 to 32,768, while keeping the total problem
size constant. We measure the wall-clock time per RMD time step as a function of P cores. The
runtime is reduced by a factor of 12.26 on 32,768 cores compared with the 2,048 cores run (i.e.,
using 16-times larger number of cores). This signifies a strong-scaling speedup of 12.26, with the
corresponding strong-scaling parallel efficiency of 0.766. Fig. 2.8(b) shows the measured strong-
scaling speedup as a function of the number of ranks (blue line), while the black line denotes the
ideal speedup. It is more difficult to achieve high strong-scaling parallel efficiency compared with
weak-scaling parallel efficiency, as the comparison of Fig. 2.8, (a) and (b), suggests. This is due to
the surge of communication/computation ratio as the workload per rank shrinks proportionally.
49
With 64 times smaller system size of the weak-scaling test, the observed strong-scaling parallel
efficiency is considered excellent.
2.4.3 Time-to-Solution Improvement
Our target application is computational synthesis of MoS2 monolayer from MoO3 and S2
reactants. Due to the long time of the reaction process, no previous RMD simulation has been able
to complete the reaction to obtain the MoS2 product. Reduced T2S is the key figure of merit for
this purpose. Overall, the algorithmic acceleration in section II has resulted in a speedup of factor
5.0 over the original RXMD code for the ReaxPQ+ model. In addition, a series of code
transformations on Theta in section III has achieved additional speedup of factor 1.55. Overall, the
algorithmic acceleration and performance optimization have achieved a speedup of factor
5.0 × 1.55 = 7.75. Namely, T2S has been reduced to 12.9% compared to that of the original
ReaxPQ+ adapted RXMD code
Figure 2.8: (a) Wall-clock time of the ReaxPQ+ adapted RXMD code, with scaled workloads — 24,576P-
atom MoS 2 on P cores (P = 64, ..., 131,072) of Theta. (b) Strong-scaling speedup of the ReaxPQ+ adapted
RXMD code with a fixed problem size — 50,331,648-atom MoS 2 system on P cores (P = 2,048, ..., 32,768)
of Theta. The measured speedup values (blue) are compared with the ideal speedup (black). The
numbers denote speedups.
50
2.5 Machine-Learning Guided Computational Synthesis of
Atomically Thin Layered Materials
The accurate description of atomic charges and polarization by the new ReaxPQ+ model,
combined with the drastic T2S reduction due to algorithmic acceleration and performance
optimization in the previous section, has opened a new avenue for computational synthesis of novel
materials. This section demonstrates the capability of the resulting parallel RMD code for
computational synthesis of atomically thin layered materials.
We focus on atomically-thin layered materials (LMs) that have unique electronic structures
and mechanical and transport properties not found in their three-dimensional counterparts, which
makes them attractive templates for future functional devices [57]. The primary synthesis technique
for fabrication of LMs is chemical vapor deposition (CVD), where one or more reaction precursors
in the gas phase undergo chemical reactions at elevated temperatures inside a reaction chamber and
the reaction product is deposited on a substrate in a colder region of the substrate [58]. RMD
simulations can provide valuable inputs to rational optimization of CVD growth conditions if
adequate length (1,000 Å) and time (10 ns) scales can be covered.
We simulated CVD synthesis of molybdenum disulfide (MoS2), a prototypical 2D
semiconductor, on 242,144 Intel Xeon Phi cores. MoS2 can be formed by the reaction of MoO3 and
S2 precursors. The initial configuration (Fig. 2.9(a)) consists of ~1.14 million atoms (129,472 Mo,
396,032 O and 620,032 S atoms) in a 1,037 Å 1,080 Å 145 Å simulation cell. Fig. 2.9, (b) and
(c), shows computational synthesis of MoS2 monolayer by CVD and subsequent annealing. A pre-
sulfurized MoO3 sample was thermalized at 3,000 K, and then quenched to 1,000 K in 2.2 ns.
During annealing, the system was thermalized at 1,500 K for 2 ns, then quenched to 1,000 K in 1
ns. We repeated the annealing cycle twice.
51
We used a machine-learning approach to identify key reaction pathways. Fig. 2.9(d) shows
the feed-forward neural network (FNN) model [59, 60] we have developed to identify and classify
atomic structures into 1T-crystalline (green), 2H-crystalline (red) and disordered (blue) structures
in the synthesized MoS2 crystal. In the input layer, the local environment for each atom is
represented by a 60-dimension feature vector consisting of radial and angular symmetry functions
[61]. The first, second and third hidden layers consist of 350, 100 and 50 hidden units, respectively.
The RELU activation function was used in the first and second layers, while a sigmoid function in
the third layer. We trained the model using 36,000-simulation datasets by minimizing the SoftMax
function using Adam-optimizer. The results reveal a novel growth mechanism of 2H crystal
mediated by a metastable 1T crystalline phase (Fig. 2.9(b)). Such atomistic information is
indispensable for guiding experimental CVD synthesis with improved crystallinity.
Figure 2.9: Machine learning-guided computational synthesis of MoS 2 monolayer by CVD. (a) Simulation
setup showing a MoO 3 monolayer suspended in S 2 gas. The atoms are colored as Mo: blue, S: yellow, and O:
red. (b, c) Close-up of MoS 2 monolayer before (b) and after (c) annealing. The local structures are classified
into 1T (green), 2H (red) and disordered (blue) phases. For clarity, gaseous environment is not shown. (d)
Neural network model for defect identification and classification.
52
2.6 Summary
To perform large RMD simulations incorporating dielectric reorganization of materials, we have
proposed a new generalized polarizable reactive force-field (ReaxPQ+) model. The increased
accuracy of ReaxPQ+, along with the reduced time-to-solution achieved by algorithmic and
computational innovations, has for the first time allowed purely computational synthesis of
atomically-thin transition-metal dichalcogenide layers assisted by machine learning. This new
capability opens up an exciting possibility of future computational synthesis of yet-to-exist layered
materials with desired properties. As such, layered materials genome has been chosen as one of the
10 designated applications of the United States’ first exaflop/s computer named A21 when it is
introduced in 2021 [48]. The computational approaches developed in this paper will likely play an
important role in the exascale materials genome. Since ReaxPQ+ is applicable to a wide variety of
elements in the periodic table, the current approach applies to much broader applications in science
and engineering.
53
Chapter 3
Machine Learning in Material Simulations
3.1 Introduction
3.1.1 Dielectric Polymer Genome
Development of high-performance dielectric polymers is urgently needed for the
advancement of a wide range of energy technologies, including energy-storage and pulsed-power
technologies. Recent advent of materials genome (i.e., applying informatics to design new
materials significantly faster than the conventional trial-and-error approach) has enabled rational
design of dielectric polymers, thereby heralding an exciting era of polymer genome. A major
performance indicator of dielectric polymer is dielectric constants. While dielectric genome has
previously been proposed for inorganic crystals, high computational costs of quantum-
mechanically evaluating dielectric constants make it still a challenge. The situation is even worse
for polymers, where computing is far more expensive since complex chemical and morphological
features essentially dictate their dielectric constants. As a result, the highly-anticipated “dielectric
polymer genome” still remains elusive.
This computational challenge may be addressed by recent developments in first principles-
informed reactive molecular dynamics (RMD) simulations based on polarizable reactive force
fields (ReaxPQ) [10-12]. The ReaxPQ model, based on a polarizable charge equilibration (PQEq)
scheme, has significantly improved the accuracy for predicting dielectric constants in orders-of-
magnitude shorter computational times compared with quantum-mechanical (QM) calculations. In
this paper, we further improve the predictive power of ReaxPQ by introducing a new valance (or
charge state)-aware ReaxPQ (ReaxPQ-v) model, thereby achieving near quantum accuracy for
54
predicting dielectric constants of polymers. We incorporate ReaxPQ-v into our high-throughput
computational synthesis framework of polymers [13]. We thereby construct a large dataset of
structurally-diverse 1,276 dielectric polymers with quantum accuracy, which represents a frontier
of dielectric polymer genome research.
In order to explore the large combinatorial design space of dielectric polymers, the above
approach alone is not sufficient. It is essential to further exploit recent advancements in machine
learning (ML) approaches to molecular and materials sciences [14], in particular deep learning
(DL) [15]. For example, various artificial neural networks have been used successfully to identify
structure-property relationships in materials, including graph neural networks (GNN) [16, 17] and
recurrent neural networks (RNN) [18]. Here, it is notable that a wide availability of public ML
tools has now democratized ML tasks that could utilize the above-mentioned state-of-the-art data
set. As a proof of concept of the use of our new dielectric polymer data set, ML models of
progressive complexity—multi-layer perceptron (MLP), random forest and RNN—were applied.
In combination with the ReaxPQ-v based computational framework, MLP and random forest
models have achieved decent accuracy for predicting the dielectric constants of polymers, while
the RNN model is under tuning.
3.1.2 Layered Material Phases
Machine learning (ML) has proved ubiquitous applicability in image recognition, natural
language processing and many other areas involving real-world data. Such success has inspired
chemists to apply ML, especially deep learning, to better understand chemical processes [65, 66].
With the tremendous amount of data generated by computer simulations, deep neural networks in
particular have shown significant power in many chemistry tasks, including the prediction of
properties [67], drug discovery [68] and understanding of quantum molecular dynamics [69]. Most
55
of the existing works deal with small organic molecules, where the datasets are formed by a
collection of molecular graphs with adjacency matrix describing the connectivity of atoms. Such
graph-based datasets require a unique deep-learning architecture to capture the additional
structured information. Graph neural networks (GNN) [62], a variation of the widely used
convolutional neural networks (CNN) [70], process graph data as nodes and edges in a learnable
fashion. Ref. [72] generalized molecular feature extraction methods by a series of differentiable
functions to produce low-dimensional molecular fingerprints. Ref. [70] integrated edge
representations into the learning functions to capture more structural features from the graph. Ref.
[73] applied gated recurrent units to update the hidden state during the learning phase, so that
sequence-based methods (e.g., LSTM) can also be injected in graph models.
More recently, these GNN applications to isolated molecules have been extended by
materials scientists to handle infinitely-repeated crystal structures [69]. However, this crystal-
graph CNN has been limited to periodically-repeated, small crystalline unit cells each involving a
few atoms. In this paper, we extend the applicability of GNN to general material graphs composed
of millions of nodes. This is a challenge since material structure often is a mixture of various
crystalline phases and defects, which are interconnected to each other via bonds, resulting in a
highly-complex, massive graph. Here, we propose a new variant of GNN to identify different
phases inside an atomically-thin molybdenum disulfide (MoS2) monolayer that is computationally
synthesized by reactive molecular dynamics simulation (RMD) simulation [69], mimicking
experimental chemical vapor deposition (CVD). MoS2 is an archetype of atomically-thin layered
materials, for which ML has extensively been applied [68]. Here, our model analyzes the local
graph topology around each atom, and classifies it into crystalline 2H or 1T structures as is shown
in Fig. 3.1.
56
Figure 3.1: Top views of 2H (left) and 1T (right) crystalline phases of MoS 2 monolayer. Magenta and yellow
spheres represent Mo and S atoms, respectively, whereas Mo-S bonds are represented by cylinders.
3.2 Method
3.2.1 Multiple Layer Perceptron
Reactive molecular dynamics (RMD) simulation based on first principles-informed
reactive force fields (ReaxFF) significantly reduces the computational cost of quantum-
mechanically simulating chemical reactions in materials. A recently-proposed polarizable reactive
force fields (ReaxPQ) model extends ReaxFF by accurately describing dielectric properties based
on a polarizable charge model, in which each atom consists of core and shell charges that are
connected by a spring. Unfortunately, advanced dielectric polymers contain atomic species at
various charge states (or valences), making accurate prediction of dielectric constants very hard.
Here, we introduce a valence-aware ReaxPQ (ReaxPQ-v) model to accurately describe the
dielectric response of amorphous polymers based on the quantum-mechanically informed atomic
polarizability. We first determine the polarizability of constituent atomic species quantum
mechanically, using the Hartree-Fock method with def2-svpd basis sets. These polarizability
values for different valences are then used to estimate the spring constants between core and shell
57
charges. The new ReaxPQ-v model has been implemented in our scalable parallel RMD simulation
engine called RXMD.
We generate amorphous polymer structures using our computational synthesis framework.
The framework first generates a single polymer chain using a Simplified Molecular Input Line
Entry System (SMILES) string and Open Babel software. Multiple polymer chains are then placed
at sufficiently large distances from each other in a simulation box. The RMD simulation system is
subjected to a number of consolidation and relaxation steps, until the system reaches a desired
density. Following this procedure, we constructed a synthetic dataset of structurally-diverse 1,276
polymers. Dielectric constants of theses polymers are then computed with the ReaxPQ-v model,
using a high-throughput, parallel RMD simulation workflow. The new data set composed of a
large number of amorphous polymer structures, augmented with quantum-accuracy dielectric
constants, provides an indispensable test ground for developing predictive ML models of polymer
dielectric constants.
To predict the dielectric-constant values from SMILES inputs, we have thus far tested three
ML models: multi-layer perceptron (MLP), random forest and recurrent neural network (RNN).
The MLP algorithm has played a fundamental role in the evolution of artificial neural
networks and ML at large [26]. Pioneered by Rosenblatt [27], MLP has been transformed from a
binary classifier to an affordable method of solving supervised estimation problems [28]. Here, we
use a two hidden-layer perceptron estimator, where each layer is comprised of 1,136 neurons.
Backpropagation is employed for weight adjustment, whereas the activation function of modified
rectified linear unit (RELU) demonstrated the best fit.
To develop a random forest model, we use the Citrination online platform for feature
generation and model construction. Citrination provides auto-generated features that can be used
58
to develop a model or argument the customized features provided by users. This is particularly
useful for developing a ML model for materials, in which each entry is commonly represented by
chemical formula or SMILES string. Given the training data explained above, we first generate
over 100 features (63 standard and 71 extended features) based on given chemical formula and/or
SMILES strings. The standard set consists of elemental properties, molecule features and analytic
features. The extended set contains elemental properties and molecule features. These features
include electronic configurations of constituent atoms (e.g., the number of valence electrons),
which are readily available from the periodic table. In addition, quantum-mechanical calculations
in the framework of density functional theory (DFT) [29-32] are used to produce additional data
such as the electronic energy. We also generate 26 additional features based on functional groups
in the polymer dataset and compare the model performance.
Given the training data and features as described above, we next select an ML model. Our
choice in Citrination is random forest, which corrects the overfitting tendency of decision trees
[33]. Random forest constructs an ensemble of tree predictors at training time. Individual trees
sample random vector values independently of each other, but with the same distribution for all
trees in the forest.
In addition to MLP and random forest, we also consider RNN, in which the next run input
is dependent on the previous one, while communicating through hidden and context layers [34].
RNN is commonly used for problems involving data sequences such as text or symbol arrays. The
basic organization of RNN involves three interconnected layers: input, output and hidden nodes.
The context units are propagated from the output of the hidden layers and recurrently connected.
Among themselves, nodes communicate via direct synapses represented with the weighing
parameters. Due to dynamic behavior, RNN modules employ commonly-used activation functions
59
such as sigmoid and RELU. During the training step, the value of the hidden layer ℎ
𝑣 (𝑡 )
∈ R
m
can
be derived as
ℎ
𝑣 (𝑡 )
= 𝑓 (𝑊 𝑥 (𝑡 )
+ 𝑈 ℎ
𝑣 (𝑡 −1)
+ 𝑏 ℎ
), (3.1)
where 𝑥 (𝑡 )
is an input vector at step t, 𝑊 are weighing parameters for a hidden layer, 𝑈 are
weighing parameters for a context layer, 𝑏 ℎ
∈ R
m
are bias parameters of RNN model, and f is an
activation function. In the traditional RNN, the message function 𝑦 𝑣 (𝑡 )
∈ R
l
is
𝑦 𝑣 (𝑡 )
= 𝑉 ℎ
𝑣 (𝑡 )
+ 𝑏 𝑐 , (3.2)
where 𝑉
are the weighing parameters for output, and 𝑏 (𝑐 )
∈ R
l
denotes bias.
Hochreiter and Schmidhuber developed a long short-term memory (LSTM) unit to avoid
the vanishing and exploding gradient issues that accompany traditional RNNs [35]. Subsequently,
Cho et al. developed a new RNN model called gated recurrent unit (GRU), which employs the
same basic principles as LSTM [36]. Both architectures use gating units denoting the further
outcome of the next hidden layer ℎ
𝑣 (𝑡 +1)
and computation of 𝑦 𝑣 (𝑡 )
. However, GRU architecture
implements lower number of filters and operations for inner cycle ℎ
𝑣 (𝑡 )
calculations. For GRU
implementation, updated gate 𝑧 𝑣 (𝑡 )
and reset gate 𝑟 𝑣 (𝑡 )
can be denoted as
𝑧 𝑣 (𝑡 )
= 𝜎 (𝑊 𝑧 (𝑛 )𝑥 (𝑡 )
+ 𝑈 𝑧 (𝑚 )ℎ
(𝑡 −1)
), (3.3)
𝑟 𝑣 (𝑡 )
= 𝜎 (𝑊 𝑟 (𝑛 )𝑥 (𝑡 )
+ 𝑈 𝑟 (𝑚 )ℎ
(𝑡 −1)
), (3.4)
where n and m are tensor dimensions.
Output value ℎ
𝑣 (𝑡 )
is computed on the basis of the interim ℎ
̃
𝑣 (𝑡 )
. These gates outputs are accountable
for the information to be left or excluded from the training:
ℎ
̃
(𝑡 )
= 𝑡𝑎 ℎ𝑛 (𝑈 ℎ
(𝑚 )(𝑟 𝑣 (𝑡 )
⊙ ℎ
(𝑡 −1)
) + 𝑊 ℎ
(𝑛 )𝑥 (𝑡 )
, (3.5)
representing both update gate and interim values,
60
ℎ
(𝑡 )
= 𝑧 (𝑡 )
⊙ ℎ
(𝑡 )
+ (1 − 𝑧 (𝑡 )
) ⊙ ℎ
̃
(𝑡 −1)
. (3.6)
For the present polymer problem, the initial states ℎ
𝑣 𝑜 are set to the input features of oligomers, such
as structure and dielectric constant (ɛ) values. Due to their efficiency and robustness, both LSTM
and GRU network architectures have become essential computational tools in various areas,
including the study of structure-activity relationships in life and material sciences [37].
3.2.2 Graph Neural Networks for Learning Layered Material
Figure 3.2 presents a high-level schematic of our classification task between 2H and 1T
crystalline phases using GNN. The following subsections explain key components of this learning
architecture, including dataset generation and GNN.
Figure 3.2: Schematic of 2H vs. 1T phase classification by graph neural network.
3.2.3 Dataset Description
We have performed RMD simulation to synthesize MoS2 monolayer by sulfidation of
molybdenum trioxide (MoO3) precursor, followed by thermal annealing. RMD simulation follows
the trajectory of all atoms while computing interatomic interaction using first principles-informed
reactive bond-order and charge-equilibration concepts. We have designed scalable algorithms to
perform large RMD simulations on massively parallel computers [71, 72]. Our RMD produces
61
polycrystalline MoS2 monolayer, where different regions of the monolayer belong to either 2H or
1T phase of MoS2 crystal. Here, each atom inside the synthesized polycrystal is connected to its
nearest-neighbor atoms by forming bonds with them, which in turn are connected to their nearest
neighbors and so on. Such bond formation between atoms makes the entire MoS2 monolayer a
massive graph consisting of multimillion nodes (atoms) and edges (bonds). Hence, identification
of 2H and 1T phases using a conventional graph-based ML model is not feasible for the entire
graph. To circumvent this problem, we have randomly sampled 66,896 atoms from the total MoS2
monolayer, and for each of these atoms, created a local graph using all their neighbors within a
cutoff radius of 0.8 nm. The rationale behind this choice is that, for each atom, graph structure
generated by its first and second nearest neighbors is able to distinguish its phase (2H or 1T). These
66,896 local graph structures serve as the training data for our neural-network model, and it
consisted of 18,650 2H, 32,446 1T and 16,000 disordered structures.
3.2.4 Graph Neural Networks
Graph-based data in general can be represented as 𝑮 = (𝑽 , 𝑬 ), where 𝑽 is the set of nodes
and 𝑬 is the set of edges. Each edge 𝑒 𝑢𝑣
∈ 𝑬 is a connection between nodes 𝑢 and 𝑣 . If 𝑮 is directed,
we have 𝑒 𝑢𝑣
≢ 𝑒 𝑣𝑢
; if 𝑮 is undirected, instead 𝑒 𝑢𝑣
≡ 𝑒 𝑣𝑢
. Unless specified, the remaining of this
paper will deal with undirected graphs, but we will show that it is trivial to modify our model to
process directed graph data. It should be pointed out that, in molecular graphs, the nodes are actually
atoms and the edges are atomic bonds, thus, the two pairs of terms are used interchangeably here.
The goal of GNN is to learn low-dimensional representation of graphs from the connectivity
structure and input features of nodes and edges. The forward pass of GNN has two steps, i.e.,
62
message passing and node-state updating. The architecture is summarized by the following
recurrence relations, where t denotes the iteration count:
𝑚 𝑣 𝑡 +1
= ∑ 𝑀 𝑡 (ℎ
𝑣 𝑡 , ℎ
𝑤 𝑡 , 𝑒 𝑣𝑤
)
𝑤 ∈𝑁 (𝑣 )
(3.7)
ℎ
𝑣 𝑡 +1
= 𝑈 𝑡 (ℎ
𝑣 𝑡 , 𝑚 𝑣 𝑡 +1
) (3.8)
where 𝑁 (𝑣 ) denotes the neighbors of 𝑣 in graph 𝑮 . The message function 𝑀 𝑡 takes node state ℎ
𝑣 𝑡
and edge state 𝑒 𝑣𝑤
as inputs and produces message 𝑚 𝑣 𝑡 +1
, which can be considered as a collection
of feature information from the neighbors of 𝑣 . The node states are then updated by function 𝑈 𝑡
based on the previous state and the message. The initial states ℎ
𝑣 0
are set to be the input features of
atoms, which we will discuss in the next section. Here, we use normalized adjacency matrix 𝐴 ̃
of
the graph coupled with some other features (which will be discussed below) as the edge state. As
shown in Fig. 3.3, these two steps are repeated for a total of T times in order to gather information
from distant neighbors, and the node states are updated accordingly. GNN can be regarded as a
layer-wise model that propagates messages over the edges and update the states of nodes in the
previous layer. Thus, T can be considered to be the number of layers in this model.
The exact form of message function is
𝑚 𝑣 𝑡 +1
= 𝐴 𝑣 𝑾 𝒕 [ℎ
1
𝑡 … ℎ
𝑣 𝑡 ] + 𝒃 (3.9)
where 𝑾 𝒕 are weights of GNN and 𝒃 denotes bias. We use gated recurrent units as the update
function:
𝑧 𝑣 𝑡 = 𝜎 (𝑾 𝒛 𝑚 𝑣 𝑡 + 𝑼 𝒛 ℎ
𝑣 𝑡 −1
) (3.10)
𝑟 𝑣 𝑡 = 𝜎 (𝑾 𝒓 𝑚 𝑣 𝑡 + 𝑼 𝒓 ℎ
𝑣 𝑡 −1
) (3.11)
ℎ
𝑣 𝑡 ̃
= tanh(𝑾 𝑚 𝑣 𝑡 + 𝑼 (𝑟 𝑣 𝑡 ⊙ ℎ
𝑣 𝑡 −1
)) (3.12)
ℎ
𝑣 𝑡 = (1 − 𝑧 𝑣 𝑡 ) ⊙ ℎ
𝑣 𝑡 −1
+ 𝑧 𝑣 𝑡 ⊙ ℎ
𝑣 𝑡 ̃
(3.13
63
where ⊙ denotes element-wise matrix multiplication and 𝜎 (∙) is sigmoid function for nonlinear
activation.
Figure 3.3: Repeating message function and update function T times to learn atom representations.
Once GNN learns low-dimensional atom representations, we feed them to a generic ML
model, e.g. fully-connect network, to predict 2H vs. 1T phases as a complete classification task. We
argue that the learned atom representations for the molecular graphs can better interpret the
structural uniqueness of 2H and 1T phases, so that the model can achieve higher predictive
performance. In the next section, we show the experiment results to support this assumption.
3.3 Experiment and Analysis
We have studied the structure of CVD-grown MoS2 monolayer using RMD simulation
(Fig. 3.4). The initial system for RMD simulation consists of MoO3 slab surrounded from top by
S2 gas. The dimension of the RMD simulation is 211.0 × 196.3 × 14.5 (nm
3
) in the x-, y- and z-
directions, respectively, and it consist of a total of 4,305,600 atoms. The entire system is subjected
to an annealing schedule, where the system temperature is first increased to 3,000 K, and
subsequently its temperature is quenched to 1,000 K, where it is held for 1 nanosecond (ns). This
is followed by two annealing cycles consisting of a heating step from 1,000 K to 1,600 K for 0.4
ns followed by a thermalization step at 1,600 K for 1.5 ns and a cooling step from 1,600 K to 1,000
64
K for 0.4 ns. This annealing schedule facilitates the reaction of S2 with MoO3 slab, which results
in the formation of polycrystalline MoS2 monolayer where different regions of the synthesized
MoS2 monolayer belongs to either 2H or 1T phase.
Figure 3.4: Snapshot of computationally-synthesized MoS 2 monolayer. Ball-and-stick representation is used
to show the positions of atoms and covalent bonds with neighboring atoms. Atoms are color-coded as yellow
for sulfur (S), blue for molybdenum (Mo) and red for oxygen (O), respectively
3.3.1 Input Features and Training Settings
The small molecular graphs carved out from the simulation results carry a number of
properties on atoms as well as bonds. See Table 3.1 for details. These properties are transformed
into vector format and embedded with nodes and edges as the initial states of the GNN model. Due
to the fact that the numbers of atoms in different atom-centered graphs are different, all input
features are zero-padded up to a larger dimension, d = 40.
Table 3.1: Input node and edge features.
Feature Description Size
Node
atom type Mo or S, a one-hot vector 2
charge Atomic charge 1
Edge
distance Distance between atoms in nm 1
bond order Dimensionless chemical bond order 1
65
We randomly shuffle and split the dataset as follows: 50,000 graphs in training set; 5,000
graphs in validation set; and another 5,000 in test set. The remaining 6,896 graphs are not used in
our experiments. The number of layers of our GNN model is T = 2, batch size set to 20, and we
train the model for a maximum of 100 epochs using Adam with a learning rate of 0.01.
3.3.2 Predictive Performance for Layered Material Phases
To investigate how the input features affect the predictive performance, we perform a series
of experiments with different selections of edge features, while node features are kept fixed.
Results are shown in Table 3.2. In the first trail, we only use node features as inputs, revealing no
spatial information but just adjacency matrix for training. The F1 score of 2H in test set is only
0.5424, meaning that nearly half of the 2H phases are misclassified. Next, we add edge features to
the model, then observe improvements in both 1T and 2H classes. It turns out that distance and
bond order have almost equivalent effect to the GNN. The reason for not seeing increase in
performance of the model when distance and bond order are added together as edge feature is
because both make an estimate of bonding between two atoms and hence are highly corelated.
High bond-order values mean strong bonding between atoms and small values mean weak
bonding, whereas cutoff-based feature makes an absolute decision whether atoms are bonded or
not. Since the bond length between atoms is an important feature to distinguish these phases, the
node/distance-based model gives the higher performance.
Table 3.2: F1 scores for different input features. Higher value sinifies better classification accuracy.
input features 1T 2H
node only 0.7821 0.5424
node + edge 0.9321 0.8642
node + distance 0.9391 0.8855
node + bond order 0.9305 0.8761
66
Figure 3.5 plots the ROC (receiver operating characteristic) curve obtained in one of our
experiments using both node and edge features. The curves of 1T and 2H are close to the upper-
left corner, indicating that the model achieves high accuracy on both classes. In addition, we
calculate a more quantitative measure, ROC-AUC (area under the ROC curve) score, as is shown
in Table 3.3, to verify the performance of the model. We observe similar results as the F1 scores,
which confirms the robustness of our conclusion drawn above.
Figure 3.5: ROC curves for 1T (green) and 2H (red) phases using node and edge features as input, which
correspond to the second row in Table 3.3. The x-axis is the true positive rate (TPR) and y-axis is the false
positive rate (FPR).
Table 3.3: ROC-AUC for different input features. Value is in the range of (0, 1). Higher value signifies better
classification accuracy.
input features 1T 2H
node only 0.88 0.87
node + edge 0.93 0.92
node + distance 0.95 0.95
node + bond order 0.93 0.93
67
3.3.3 Visualization of Hidden Node States
It is still an open and active research area to interpret the learning process of ML models. In
this work, we try to provide some insight on how GNN learns from graphs by visualizing the
evolution of the hidden states during the training phase. As is shown in Fig. 3.6, this molecular
graph has 20 atoms with a dimension of 3 (see Table 3.1 for details) for the initial atom features.
There are several patterns shared by the atoms, for example, the first 7 atoms have very similar
feature encodings. In the second and third layers, GNN expands the dimension to 25, which is a
hyperparameter of the model, while the feature of each atom gradually becomes divergent compare
to the initially state. Such divergence would make it easier to find the hyperplane in the feature
space, so that the model can achieve high classification accuracy.
Figure 3.6: Layer-wise evolution of node states. The x-axis is the dimensions of features (padded to 20) and y-
axis is the number of atoms (zero-padded to 40) in the graph. Each row represents a feature vector of an
atom, and the color of each pixel indicates the value on the unit.
3.3.4 Performance for Dielectric Polymer Genome
We first use the MLP model and run prediction for the entire training set of polymer
samples. For efficient training, the SMILES input is transformed to binary representation together
68
with zero padding. Polymers, whose dielectric constants are less than 1.5, are removed from the
training set (Fig. 3.7A) as these statistical outliers do not follow the general trend. Additionally,
the input representations are examined for the presence of duplicated SMILES entries, leading to
the total number of the training set to be 828 (Fig. 3.7B).
Figure 3.7: Dielectric constants (A) and SMILE dimensions (B) of the polymer dataset.
Figure 3.8 compares the parity plot (i.e., model prediction vs. input dataset) of three
developed MLP models for MLP model, using different activation functions: (A) modified RELU
activation function, in which the standard RELU is used for gradient calculation within
backpropagation learning rule, whereas binary step for activation itself; (B) standard RELU
activation function; and (C) hyperbolic tangent (Tanh) function. Out of the three, the obtained with
modified RELU activation function demonstrates an excellent agreement between the first
principles-derived dielectric constant values with the model-predicted ones. The root mean
squared error (RMSE) is 0.045.
The experiment with the variable training sets demonstrates how the enlargement of the
sample pool affects the overall MLP performance. Figure 3.9 shows a parity plot of MLP model
for 662 and 546 training polymers datasets. In the RMSE metrics, the training set of 667 samples
69
demonstrates higher performance than the set of 546 samples. However, the individual RMSE
values indicate the predictive performance within the training pool to be expectedly better for the
low volume set (0.028 for 546 versus 0.032 for 662 polymers).
Figure 3.8: Model prediction vs. training data of polymer dielectric constants using the MLP model with
variable activation functions (A) modified RELU (RMSE = 0.053); (B)RELU (RMSE = 0.242); (C) Tanh
(RMSE = 0.267).
70
Figure 3.9: Performance of MLP prediction model (Elman type). Comparison of prediction on (A) 662
polymers and (B) 546 polymers training sets.
As explained in the previous section, we have built a random forest model using Citrination
online platform. Prior to model development, we have cleaned the polymer dielectric dataset by
removing unphysical records. We obtained 846 data points in total. The random forest model [33]
is implemented using the Lolo library, where a variety of data types and ML functions are
supported, including continuous and categorical features, regression and classification trees,
jackknife variance estimates, hyperparameter optimizations, validation metrics for accuracy and
uncertainty quantification. Jackknife is an uncertainty estimate method by resampling data, similar
to leave-one-out approach for cross-validation. Suppose X = (X1, X2, X3, … Xn) is a set of observed
samples. Jackknife i-th set of samples is defined as the original set without the i-th sample,
𝑋 [𝑖 ]
= {𝑋 1
, 𝑋 2
, … , 𝑋 𝑖 −1
, 𝑋 𝑖 +1
, … , 𝑋 𝑛 } (3.14)
The estimated value 𝜃 (𝑖 )
by an estimator S is obtained as
𝜃 (𝑖 )
= 𝑆 (𝑋 [𝑖 ]
) (3.15)
The performance of a model is evaluated by the bias and standard error of the estimator. Table 3.4
shows hyperparameters that are used to build our model.
71
Table 3.4: Hyperparameters used in our random forest model.
Number of estimators 155
Minimum samples per leaf 1
Maximum tree depth 30
Leaf model Mean
Number of cross-validation folds 3
Figure 3.10 shows the resulting parity plot with uncertainty in the predicted value. Data
points that are within the input feature space are labeled as “not extrapolating”, otherwise labeled
as “extrapolating.”
Figure 3.10: Model prediction vs. training data of polymer dielectric constants using the random forest
model.
Table 3.5 summarizes various model-performance metrics, such as RMSE, non-
dimensional model error (NDME), standard error in the estimate of RMSE (SE-RMSE), and
standard error in the estimate of NDME (SE-NDME). Here, NDME is the ratio between the RMSE
and standard deviation. The small RMSE and NDME values indicate that the developed model
possesses a good prediction capability. We examine three sets of features: (1) SMILES string; (2)
26 additional features consist of polymer functional groups; and (3) SMILES string and the
functional group together. Use of either SMILES string or the functional group feature alone
72
results in similar RMSE around 0.13. However, we found that SE-RMSE substantially improves
by 53% when combining the SMILES and the functional group features together.
While the optimization of RNN model is still in progress, Fig. 3.11 shows that the loss
function (which represents the deviation of model prediction from ground truth) is reduced
monotonically as a function of model training epochs.
Table 3.5: Performance metrics of the random forest model.
Features RMSE SE-RMSE NDME SE-NDME
SMILES 0.129 0.00713 0.327 0.0180
Functional
groups
0.174 0.00281 0.439 0.00709
SMILES &
Functional
groups
0.130 0.00330 0.329 0.00833
Figure 3.11: Convergence of the RNN model performance as a function of training epochs.
73
3.4 Summary
We have developed a high-throughput simulation-ML framework for computationally
synthesizing a large dataset of polymers, evaluating their dielectric constants, and learning
structure-property relationships using a new valence-aware polarizable reactive force-field model
and ML models including multi-layer perceptron, random forest and recurrent neural network.
High level of prediction accuracy was achieved by a large size of simulated training dataset.
Employing the full dataset of structurally-diverse 1,276 polymers, the ML models have predicted
the dielectric constant of polymers with decent accuracy.
We have also shown that graph neural network-based analysis can automatically classify
different phases present in RMD simulation of MoS2 synthesis. Furthermore, we found that
addition of edge based features (especially bond distance) increases the model accuracy
significantly. Network analysis by visualizing the feature space of our GNN model shows clear
separation of the 2H and 1T graph structures inside the network, which helps identify and better
understand these structures.
This is contrary to conventional techniques for structural analysis (e.g., common
neighborhood analysis and centro-symmetry parameter calculation). While they work for mono-
atomic FCC, BCC and HCP crystals, these conventional methods do not distinguish 2H and 1T
crystalline phases in transition-metal dichalcogenide layers considered here. Due to the lack of a
readily available order parameter that can identify each structure type, our GNN model serves as
an indispensable analysis tool for general materials.
As an extension of this work, we are currently developing regression models that are being
combined with our active learning framework, so as to predict the optimal polymer structure with
minimal number of first principles-based calculations. This is achieved by incrementally
74
expanding the training set, while balancing exploration and exploitation based on Bayesian
optimization. Finally, we are implementing all simulation and ML models on scalable parallel
computing platforms, including the United States’ first exaflop/s computer, Aurora A21, to be
installed in 2022, under our Aurora Early Science Program award. Such scalable ML prediction
models, once optimized and validated, will be indispensable for further enlarging the search space
for superior dielectric polymers by orders-of-magnitude.
75
Chapter 4
Protein Modeling at Exascale
4.1 Introduction
The three-dimensional (3D) structure of a protein reveals crucial information about how it
interacts with other proteins to carry out fundamental biological functions. Proteins are linear
chains of amino acids that fold into specific 3D conformations as a result of the physical properties
of the amino acid sequence. The structure, in turn, determines the wide range of protein functions.
Thus, understanding the complexity of protein folding is vital for studying the mechanisms of these
molecular machines in health and disease, and for development of new drugs. Various machine
learning (ML) techniques have been applied successfully to protein structure analysis in the past
[122, 123]. In previous works, for example, we explored fast atomistic learning based on a 2D
convolutional neural network (CNN), through dimension mapping using space-filling curves
[124]. We have also demonstrated that a novel spatial model built with a graph convolution
network (GCNN) can be used effectively to produce interpretable structural classification [125].
ML models such as neural networks have long been applied to predict 1D structural features such
as backbone torsion angles, secondary structure and solvent accessibility of amino-acid residues.
The focus of ML applications has since shifted to 2D representation of 3D structures such as
residue-residue contact maps [126] and inter-residue distance matrices. Recognizing that contact
maps are similar to 2D images — whose classification and interpretation have been among the
most successful applications of deep learning (DL) approaches — the community has begun to
apply DL to recognize patterns in the sequences and structures of proteins in the protein data bank
(PDB) [127]. CNNs have demonstrated excellent performance in image analysis tasks, making
76
them a natural choice for the prediction of protein contact maps. The question of how best to
encode information about the target protein for input to the neural network is an active research
topic. By analogy, color images are often encoded as three matrices of real numbers, i.e., the
intensities of red, green and blue color channels for all image pixels. Methods such as DeepContact
[128] and RaptorX-Contact [129] use input features consisting of 𝑁 × 𝑁 (where N is the number
of amino acids in the sequence of the target protein) residue-residue coupling matrices derived
from covariation analyses of the target protein, augmented by predictions of local sequence
features. In the DeepCov [130] and TripletRes [131] approaches, more information in the target
protein multiple-sequence alignment is provided to the network, in the form of 400 different 𝑁 × 𝑁
feature matrices, each corresponding to a defined pair of amino acids, with the value at position
(𝑖 , 𝑗 ) in a given matrix being either the pair frequency or the covariance for the given amino acid
pair at alignment positions i and j. Then, CNN integrates this massive set of features to identify
spatial contacts, training by large sets of proteins of known structure and their associated contact
maps and multiple sequence alignments. The importance of incorporating ML in template-free
modeling has been highlighted by top-performing CASP13 structure prediction methods, all of
which rely on deep convolutional neural networks for predicting residue contacts or distances,
predicting backbone torsion angles and ranking the final models.
Approaches to the protein residue contact map prediction problem include support vector
machine (SVM) [132], CNN [130], recurrent neural network (RNN) + CNN [133], ResNet, VGG
and other proven architectures. Most of these approaches require heavy feature engineering. Apart
from the amino acids sequence, they used additional engineered features such as amino-acid pair
frequency [130], covariation scores [130, 134], and position-specific scoring matrix (PSSM). Such
heavy feature engineering often leads to poor ability of transfer learning across relevant protein
77
folding tasks. Bepler et al. [133] demonstrated a promise of transfer learning, i.e., ability to transfer
knowledge between structurally related proteins, through representation learning, which we will
follow here. In this work, we introduce graph neural network (GNN) to the conventional RNN-
based pipeline [133] so as to better capture spatial correlations. Furthermore, we propose a
multiscale GNN approach, in which short-, medium- and long-range spatial correlations are
refined by dedicated respective GNNs after coarse learning.
4.2 Contact Map Prediction with Representation Learning
A protein contact map represents pairwise amino acid distances, where each pair of input
amino acids from sequence is mapped to a label ∈ {0,1}, which denotes whether the amino acids
are “in contact” (1, i.e., within a cutoff distance of 8 Å) or not (0); see Fig. 4.1. Accurate contact
maps provide powerful global information, e.g., they facilitate the understanding of the complex
dynamical behavior of proteins or other biomolecules [135] and robust modeling of full 3D protein
structure [136]. Specifically, medium- and long-range contacts, which may be as few as twelve
sequence positions apart, or as many as hundreds apart, are particularly important for 3D structure
modeling. However, existing approaches [130, 132, 133] require heavy feature engineering and
suffer from poor ability of transfer learning across relevant protein folding tasks.
78
Figure 4.1: left: 3D protein structure. Right: residue-residue contact map of a protein.
Here, we address these problems with the proved strength of representation learning.
Specifically, we use a type of RNN, i.e., bidirectional long short-term memory (LSTM) embedding
model, mapping sequences of amino acids to sequences of vector representations, such that
residues occurring in similar structural contexts will be close in embedding space (Fig. 4.2). How
to induce the residue-residue contact from the vector representations is still a challenge, because
these vectors have few position-level correspondences between residues. We solve this problem
by introducing GNN. The role of GNN is, via its powerful capability of structural learning, to infer
the pair relation of residues from the intermediate vector representations.
Graph-based data in general can be represented as 𝑮 = (𝑽 , 𝑬 ), where 𝑽 is the set of nodes
and 𝑬 is the set of edges. Each edge 𝑒 𝑢𝑣
∈ 𝑬 is a connection between nodes 𝑢 and 𝑣 . If 𝑮 is
directed, we have 𝑒 𝑢𝑣
≢ 𝑒 𝑣𝑢
; if 𝑮 is undirected, instead 𝑒 𝑢𝑣
≡ 𝑒 𝑣𝑢
. Here, we deal with undirected
graphs, but it is trivial to modify such a model to process other directed graph data. In protein
residue contact graphs, the nodes are amino-acid residues and the edges are their spatial proximity
within 8 Å.
79
Figure 4.2: Procedures of the leaning task. AA: amino acid.
The goal of GNN is to learn low-dimensional representation of graphs from the
connectivity structure and input features of nodes and edges. The forward pass of GNN has two
steps, i.e., message passing and node-state updating. The architecture is summarized by the
following recurrence relations, where t denotes the iteration count:
𝑚 𝑣 𝑡 +1
= ∑ 𝑀 𝑡 (ℎ
𝑣 𝑡 , ℎ
𝑤 𝑡 , 𝑒 𝑣𝑤
)
𝑤 ∈𝑁 (𝑣 )
(4.1)
ℎ
𝑣 𝑡 +1
= 𝑈 𝑡 (ℎ
𝑣 𝑡 , 𝑚 𝑣 𝑡 +1
) (4.2)
where 𝑁 (𝑣 ) denotes the neighbors of node 𝑣 in graph 𝑮 . The message function 𝑀 𝑡 takes
node state ℎ
𝑣 𝑡 and edge state 𝑒 𝑣𝑤
as inputs and produces message 𝑚 𝑣 𝑡 +1
, which can be considered
as a collection of feature information from the neighbors of 𝑣 . The node states are then updated by
function 𝑈 𝑡 based on the previous state and the message. The initial states ℎ
𝑣 0
are set to be the input
features of amino acids. Here, we use normalized adjacency matrix 𝐴 ̃
of the graph coupled with
some other features as the edge state. These two steps are repeated for a total of T times in order
to gather information from distant neighbors, and the node states are updated accordingly. GNN
can be regarded as a layer-wise model that propagates messages over the edges and update the
states of nodes in the previous layer. Thus, T can be considered to be the number of layers in this
model. The exact form of message function is
𝑚 𝑣 𝑡 +1
= 𝐴 𝑣 𝑾 𝒕 [ℎ
1
𝑡 … ℎ
𝑣 𝑡 ] + 𝒃 , (4.3)
80
where 𝑾 𝒕 are weights of GNN and 𝒃 denotes bias. We use gated recurrent units as the update
function:
𝑧 𝑣 𝑡 = 𝜎 (𝑾 𝒛 𝑚 𝑣 𝑡 + 𝑼 𝒛 ℎ
𝑣 𝑡 −1
) (4.4)
𝑟 𝑣 𝑡 = 𝜎 (𝑾 𝒓 𝑚 𝑣 𝑡 + 𝑼 𝒓 ℎ
𝑣 𝑡 −1
) (4.5)
ℎ
𝑣 𝑡 ̃
= tanh(𝑾 𝑚 𝑣 𝑡 + 𝑼 (𝑟 𝑣 𝑡 ⊙ ℎ
𝑣 𝑡 −1
)) (4.6)
ℎ
𝑣 𝑡 = (1 − 𝑧 𝑣 𝑡 ) ⊙ ℎ
𝑣 𝑡 −1
+ 𝑧 𝑣 𝑡 ⊙ ℎ
𝑣 𝑡 ̃
, (4.7)
where ⊙ denotes element-wise matrix multiplication and 𝜎 (∙) is sigmoid function for nonlinear
activation.
Figure 4.3 shows a schematic of the proposed multiscale RNN+GNN model, which
improves upon existing RNN+CNN models for protein residue contact map prediction [134].
Figure. 4.3: The proposed multiscale GNN approach on top of the existing RNN-CNN pipeline.
81
First, the RNN unit works as encoder that takes a sequence of amino acids representing a
protein and transforms it into vector representations of the same length. To align with the biological
process of protein folding, we employ bidirectional LSTM as encoder to absorb the representation
from the neighboring amino acids, thus the resultant embeddings contain hidden feature from both
sides.
In order to produce contacts from the sequence of embeddings, we define a pairwise feature
tensor,
𝑣 𝑖𝑗
= [𝑧 𝑖 − 𝑧 𝑗
|| 𝑧 𝑖 ⊙ 𝑧 𝑗 ], (4.8)
of size L×L×2D, where D is the dimension of the embedding vector and L is the length of protein,
|| is concatenation, and ⊙ is element-wise product. This featurization is symmetric and has
extensive applications in natural language processing (NLP) models [137]. This 3D feature is then
transformed though the proposed GNN module. To have higher granularity of the contact
predictions with regard to the positional ranges, particularly short-, medium- and long-range
contacts, we have designed three range-based GNN blocks, where in each layer t, the edge feature,
(𝑬 𝑠 𝑡 , 𝑬 𝑚 𝑡 , 𝑬 𝑙 𝑡 ), and node feature, (𝑯 𝑠 𝑡 , 𝑯 𝑚 𝑡 , 𝑯 𝑙 𝑡 ), are updated respectively as
𝑒 𝑖𝑗
𝑡 +1
= 𝛼 [𝑊 1
(𝛼 (𝑊 2
𝑒 𝑖𝑗
𝑡 ) || 𝛼 (𝑊 3
ℎ
𝑖 𝑡 ||ℎ
𝑗 𝑡 ))] (4.9)
ℎ
𝑖 𝑡 +1
= 𝛼 [𝑊 4
(𝛼 (𝑊 5
ℎ
𝑖 𝑡 ||
∑ 𝑒 𝑖𝑗
𝑡 +1
𝑗 𝑁 ⁄
))] (4.10)
where 𝑊 1…5
are learnable weights and 𝛼 is the rectified linear unit (ReLU) activation. In Eq. (10),
N is the number of positional neighbors of an amino acid, which distinguishes the range-based
blocks: (i) short-range blocks focus on positional neighbors from 6 to 11, so that there are N = 12
82
– 6 = 6 neighbors; (ii) medium-range blocks for which N = 24 – 12 = 12; and (iii) long-range block
for which N = L – 24. Thus, the new node features are induced by an average of the neighboring
edge features.
The output of the final layer of the GNN blocks is a triplet, 𝑬 𝑇 = (𝑬 𝑠 𝑇 , 𝑬 𝑚 𝑇 , 𝑬 𝑙 𝑇 ), containing
edge features of the corresponding range-based blocks. A fully connected layer will merge and
convert 𝑬 𝑇 into the contact map.
It is worth noting here that our multiscale feature averaging is akin to the additive
hybridization scheme [138] used in the celebrated multiscale quantum-mechanical
(QM)/molecular-mechanical (MM) simulation approach, for which Karplus, Levitt and Warshel
shared the Nobel prize in chemistry in 2013 [139]. In the additive hybridization scheme, energy
contributions from different spatial ranges are described by appropriate approaches in the
respective ranges which are averaged to provide the total energy [138].
4.3 Results and Discussion
To evaluate the proposed model, we use the ProteinNet dataset [140]. ProteinNet is a
standardized dataset for machine learning of protein structure which builds on CASP (Critical
Assessment of protein Structure Prediction) assessment carrying out blind prediction of recently
solved but publicly unavailable protein structures. In our experiment, we specifically chose a
subset from CASP12. We implemented all methods in Tensorflow 2.5 and trained on two NVIDIA
V100 graphics processing units (GPUs). The GNN module consists of two layers with independent
parameters for each short-, medium- and long-range based block.
83
Figure 4.4 shows the precision of the proposed multiscale RNN+GNN approach as a
function of the training epoch compared with that of the baseline RNN+CNN approach. Here, we
employ a commonly used metrics. The term “P@K” signifies precision for the top K contacts,
where all predicted contacts are sorted from highest to lowest confidence. Let L be the length of
the protein, then “P@L/2”, for example, is the precision for the L/2 most likely predicted contacts.
In Fig. 4.4, we observe improved P@L/2 precision by the addition of multiscale GNNs. As
is well known, contact prediction with increased spatial ranges are progressively more difficult.
The proposed multiscale RNN+GNN approach consistently outperforms the baseline in all ranges,
including the hardest case of long-range contact prediction.
In Tables 4.1 - 4.3, we show the results of contact prediction in terms of P@L, P@L/2 and
P@L/5. We benchmark the proposed method against the CNN-based approach [133] as baseline.
The precisions are collected from the test dataset consisting of 144 protein sequences. The new
multiscale RNN+GNN approach consistently improves the prediction precision over the baseline
RNN+CNN approach for all K top contacts (K = L, L/2, L/5). While the precision decreases as we
move from the short- to medium- to long-ranges as expected, the multiscale RNN+GNN approach
maintains its precision advantage over the baseline.
Fig. 4.4: P@L/2 precision of the proposed multiscale RNN+GNN model (green) as a function of the training
epoch compared to that of the conventional RNN+CNN model (cyan) for the prediction of (top) short-,
(middle) medium- and (bottom) long-range protein residue contacts.
84
Table 4.1: Short-range contact prediction results.
P@L P@L/2 P@L/5
baseline 0.319 0.299 0.344
proposed 0.360 0.355 0.370
Table 4.2: Medium-range contact prediction results.
P@L P@L/2 P@L/5
baseline 0.220 0.223 0.237
proposed 0.254 0.256 0.263
Table 4.2: Long-range contact prediction results.
P@L P@L/2 P@L/5
baseline 0.135 0.139 0.158
proposed 0.145 0.150 0.165
In summary, we have proposed a multiscale GNN-based approach taking a cue from the
celebrated multiscale physics simulation approach, in which a standard pipeline consisting of RNN
and CNN is improved by introducing three GNNs to refine predictive capability for short-,
medium- and long-range contacts, respectively. The results show improved accuracy for contacts
of all ranges, including the most challenging case of long-range contact prediction. The multiscale
GNN approach reflects the inherently multiscale nature of biomolecular and other physical
systems, and as such is expected to find broader applications beyond the protein residue contact
prediction problem.
As deep learning continues to show promise at modeling the relation between primary and
tertiary structure, methods of interpreting learned representations become progressively more
important in providing biologically meaningful insights about how proteins work. Previous work
has tackled approaches for identifying biologically significant model parameters such as saliency
85
measures along 3D space for volumetric methods like CNNs [125]. However, these approaches
are limited in that saliency maps over a volume do not adequately describe the role of interactions
between residues or the predictive value of higher-order sub-structures. Work by Zamora-Resendiz
et al. [124] demonstrated how GCNN architectures learn more biologically relevant
representations. Parameter attributions were found to localize at meaningful segments (including
secondary structures) for RAS proteins and these sub-structures were found to be characterized in
literature on RAS. As we learn more about how to represent physical systems in deep learning
frameworks, the “data-agnostic” capabilities of deep learning methods will help in discovering
biologically relevant sub-structures given the proper innate priors.
4.4 Future Work
With the continuously growing size of the ProteinNet dataset used in this study, the
proposed ML approach is becoming heavily compute bound. To address this challenge, we are
currently implementing our model on leadership-scale parallel supercomputers at the Argonne
Leadership Computing Facility (ALCF), including the Theta [141] and new Polaris machines.
Each computing node of Polaris consists of one AMD EPYC “Milan” central processing unit
(CPU) and four NVIDIA A100 GPUs. These leadership-scale implementations will be applied to
the largest-available protein datasets. For the massively parallel learning, we employ data
parallelism utilizing a distributed ML framework, Horovod [142]. Here, the global batch of input
data are split across computing nodes, and the model parameters are updated by aggregating the
gradients from the nodes. Up to O(100) nodes, we adopt synchronous training, where the model
and gradients are always in sync among all the nodes. For larger-scale training, hybrid
synchronous-asynchronous approach will be employed instead for higher scalability (though with
86
slower statistical convergence) [143]. For complex network and hyperparameter tuning, we also
use DeepHyper, a scalable automated ML (AutoML) package for the development of deep neural
networks [144]. We utilize two components of the package: (i) neural architecture search (NAS)
for automatic search of high-performing deep neural network (DNN) architectures; and (ii)
hyperparameter search (HPS) for automatic identification of high-performing DNN
hyperparameters. Running such massive ML workflow on leadership-scale parallel
supercomputers will likely pose runtime challenges such as fault recovery, for which we will
utilize ALCF support for Balsam high performance computing (HPC) workflows and edge
services [145]. The multiscale RNN+GNN model is being scaled up to leadership-scale parallel
supercomputers.
87
References
[1] D. Frenkel and B. Smit. 2001. Understanding Molecular Simulation. Academic Press, San
Diego, CA.
[2] A. Rahman. 1964. Correlations in the motion of atoms in liquid argon. Physical Review,
136 (2A). A405-A411. https://doi.org/10.1103/PhysRev.136.A405
[3] M. Levitt. 2014. Birth and future of multiscale modeling for macromolecular systems
(Nobel lecture). Angewandte Chemie International Edition, 53 (38). 10006-10018.
10.1002/anie.201403691
[4] S. Tsuneyuki, Y. Matsui, H. Aoki and M. Tsukada. 1989. New pressure-induced structural
transformations in silica obtained by computer-simulation. Nature, 339 (6221). 209-211.
10.1038/339209a0
[5] F. Shimojo, I. Ebbsjo, R. K. Kalia, A. Nakano, J. P. Rino and P. Vashishta. 2000.
Molecular dynamics simulation of structural transformation in silicon carbide under
pressure. Physical Review Letters, 84 (15). 3338-3341.
[6] A. C. T. van Duin, S. Dasgupta, F. Lorant and W. A. Goddard. 2001. ReaxFF: a reactive
force field for hydrocarbons. Journal of Physical Chemistry A, 105 (41). 9396-9409.
10.1021/jp004368u
[7] A. Nakano, R. K. Kalia, P. Vashishta, T. J. Campbell, S. Ogata, F. Shimojo and S. Saini.
ACM/IEEE, 2001. Scalable atomistic simulation algorithms for materials research.
Proceedings of Supercomputing, SC01. 10.1145/582034.582035
[8] P. Vashishta, R. K. Kalia, J. P. Rino and I. Ebbsjo. 1990. Interaction potential for SiO2 -
a molecular-dynamics study of structural correlations. Physical Review B, 41 (17). 12197-
12209. 10.1103/PhysRevB.41.12197
[9] A. Nakano, R. K. Kalia, K. Nomura, A. Sharma, P. Vashishta, F. Shimojo, A. C. T. van
Duin, W. A. Goddard, R. Biswas, D. Srivastava and L. H. Yang. 2008. De novo ultrascale
atomistic simulations on high-end parallel supercomputers. International Journal of High
Performance Computing Applications, 22 (1). 113-128. 10.1177/1094342007085015
[10] K. Nomura, P. E. Small, R. K. Kalia, A. Nakano and P. Vashishta. 2015. An extended-
Lagrangian scheme for charge equilibration in reactive molecular dynamics simulations.
Computer Physics Communications, 192. 91-96. 10.1016/j.cpc.2015.02.023
[11] D. C. Rapaport. 1988. Large-scale molecular-dynamics simulation using vector and
parallel computers. Computer Physics Reports, 9 (1). 1-53. 10.1016/0167-7977(88)90014-
7
88
[12] A. Nakano, P. Vashishta and R. K. Kalia. 1993. Parallel multiple-time-step molecular-
dynamics with 3-body interaction. Computer Physics Communications, 77 (3). 303-312.
10.1016/0010-4655(93)90178-F
[13] S. Plimpton. 1995. Fast parallel algorithms for short-range molecular dynamics. Journal
of Computational Physics, 117 (1). 1-19. 10.1006/jcph.1995.1039
[14] L. Kale, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, N. Krawetz, J. Phillips, A.
Shinozaki, K. Varadarajan and K. Schulten. 1999. NAMD2: greater scalability for parallel
molecular dynamics. Journal of Computational Physics, 151 (1). 283-312.
10.1006/jcph.1999.6201
[15] D. E. Shaw. 2005. A fast, scalable method for the parallel evaluation of distance-limited
pairwise particle interactions. Journal of Computational Chemistry, 26 (13). 1318-1328.
10.1002/jcc.20267
[16] D. A. Reed and J. Dongarra. 2015. Exascale computing and big data. Communications of
the ACM, 58 (7). 56-68. 10.1145/2699414
[17] K. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, K. Shimamura, F. Shimojo, M.
Kunaseth, P. C. Messina and N. A. Romero. IEEE/ACM, 2014. Metascalable quantum
molecular dynamics simulations of hydrogen-on-demand. Proceedings of
Supercomputing, SC14. 661-673. 10.1109/SC.2014.59
[18] F. Shimojo, R. K. Kalia, M. Kunaseth, A. Nakano, K. Nomura, S. Ohmura, K. Shimamura
and P. Vashishta. 2014. A divide-conquer-recombine algorithmic paradigm for multiscale
materials modeling. Journal of Chemical Physics, 140 (18). 18A529. 10.1063/1.4869342
[19] N. A. Romero, A. Nakano, K. Riley, F. Shimojo, R. K. Kalia, P. Vashishta and P. C.
Messina. 2015. Quantum molecular dynamics in the post-petaflops era. IEEE Computer,
48 (11). 33-41.
[20] J. Mellor-Crummey, D. Whalley and K. Kennedy. 2001. Improving memory hierarchy
performance for irregular applications using data and computation reorderings.
International Journal of Parallel Programming, 29 (3). 217-247.
10.1023/A:1011119519789
[21] M. Kunaseth, R. K. Kalia, A. Nakano, K. Nomura and P. Vashishta. ACM/IEEE, 2013. A
scalable parallel algorithm for dynamic range-limited n-tuple computation in many-body
molecular dynamics simulation. Proceedings of Supercomputing, SC13.
[22] J. P. Walters, V. Balu, V. Chaudhary, D. Kofke and A. Schultz. ISCA, 2008. Accelerating
molecular dynamics simulations with GPUs. Proceedings of International Conference on
Parallel and Distributed Computing and Communication Systems (PDCCS 2008). 44-49.
[23] W. M. Brown, P. Wang, S. J. Plimpton and A. N. Tharrington. 2011. Implementing
molecular dynamics on hybrid high performance computers - short range forces. Computer
Physics Communications, 182 (4). 898-911. 10.1016/j.cpc.2010.12.021
89
[24] T. J. Lipscomb, A. Zou and S. S. Cho. ACM, 2012. Parallel Verlet neighbor list algorithm
for GPU-optimized MD simulations. Proceedings of the ACM Conference on
Bioinformatics, Computational Biology and Biomedicine, (BCB '12). 321-328.
10.1145/2382936.2382977
[25] W. M. Brown and M. Yamada. 2013. Implementing molecular dynamics on hybrid high
performance computers-three-body potentials. Computer Physics Communications, 184
(12). 2785-2793. 10.1016/j.cpc.2013.08.002
[26] S. B. Kylasa, H. M. Aktulga and A. Y. Grama. 2014. PuReMD-GPU: a reactive molecular
dynamics simulation package for GPUs. Journal of Computational Physics, 272. 343-359.
10.1016/j.jcp.2014.04.035
[27] A. Nakano, R. K. Kalia and P. Vashishta. 1994. Multiresolution molecular-dynamics
algorithm for realistic materials modeling on parallel computers. Computer Physics
Communications, 83 (2-3). 197-214.
[28] M. P. Allen and D. J. Tildesley. 1987. Computer Simulation of Liquids. Oxford University
Press, Oxford, UK.
[29] W.-m. W. Hwu. 2011. GPU Computing Gems Jade Edition. Morgan Kaufmann, Waltham,
MA.
[30] http://docs.nvidia.com/gameworks/content/developertools/desktop/
analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm
[31] J. Tersoff. 1989. Modeling solid-state chemistry: interatomic potentials for
multicomponent systems. Physical Review B, 39 (8). 5566-5568.
10.1103/PhysRevB.41.3248.2
[32] D. W. Brenner. 1990. Empirical potential for hydrocarbons for use in simulating the
chemical vapor-deposition of diamond films. Physical Review B, 42 (15). 9458-9471.
[33] A. K. Rappe and W. A. Goddard. 1991. Charge equilibration for molecular-dynamics
simulations. Journal of Physical Chemistry, 95 (8). 3358-3363. 10.1021/j100161a070
[34] F. H. Streitz and J. W. Mintmire. 1994. Electrostatic potentials for metal-oxide surfaces
and interfaces. Physical Review B, 50 (16). 11996-12003. 10.1103/PhysRevB.50.11996
[35] S. W. Rick, S. J. Stuart and B. J. Berne. 1994. Dynamical fluctuating charge force-fields -
application to liquid water. Journal of Chemical Physics, 101 (7). 6141-6156.
10.1063/1.468398
[36] T. P. Senftle, S. Hong, M. M. Islam, S. B. Kylasa, Y. Zheng, Y. K. Shin, et al., "The
ReaxFF reactive force-field: development, applications and future directions," npj
Computational Materials, vol. 2, p. 15011, Mar 4 2016.
90
[37] A. K. Rappe and W. A. Goddard, "Charge equilibration for molecular-dynamics
simulations," Journal of Physical Chemistry, vol. 95, pp. 3358-3363, Apr 18 1991.
[38] A. Nakano, "Parallel multilevel preconditioned conjugate-gradient approach to variable-
charge molecular dynamics," Computer Physics Communications, vol. 104, pp. 59-69, Aug
1997.
[39] K. Nomura, R. K. Kalia, A. Nakano, and P. Vashishta, "A scalable parallel algorithm for
large-scale reactive force-field molecular dynamics simulations," Computer Physics
Communications, vol. 178, pp. 73-87, Jan 15 2008.
[40] H. M. Aktulga, S. A. Pandit, A. C. T. van Duin, and A. Y. Grama, "Reactive molecular
dynamics: numerical methods and algorithmic techniques," SIAM Journal on Scientific
Computing, vol. 34, pp. C1-C23, Jan 31 2012.
[41] S. B. Kylasa, H. M. Aktulga, and A. Y. Grama, "Reactive molecular dynamics on
massively parallel heterogeneous architectures," IEEE Transactions on Parallel and
Distributed Systems, vol. 28, pp. 202-214, Jan 1 2017.
[42] K. Nomura, R. K. Kalia, Y. Li, A. Nakano, P. Rajak, C. Sheng, et al., "Nanocarbon
synthesis by high-temperature oxidation of nanoparticles," Scientific Reports, vol. 6, p.
24109, Apr 20 2016.
[43] R. A. Marcus, "Electron-transfer reactions in chemistry - theory and experiment," Reviews
of Modern Physics, vol. 65, pp. 599-610, Jul 1993.
[44] S. Naserifar, D. J. Brooks, W. A. Goddard, and V. Cvicek, "Polarizable charge
equilibration model for predicting accurate electrostatic interactions in molecules and
solids," Journal of Chemical Physics, vol. 146, p. 124117, Mar 28 2017.
[45] K. Nomura, P. E. Small, R. K. Kalia, A. Nakano, and P. Vashishta, "An extended-
Lagrangian scheme for charge equilibration in reactive molecular dynamics simulations,"
Computer Physics Communications, vol. 192, pp. 91-96, July 2015.
[46] M. Kunaseth, R. K. Kalia, A. Nakano, K. Nomura, and P. Vashishta, "A scalable parallel
algorithm for dynamic range-limited n-tuple computation in many-body molecular
dynamics simulation," Proceedings of Supercomputing, SC13, ACM/IEEE, 2013.
[47] A. K. Geim and I. V. Grigorieva, "Van der Waals heterostructures," Nature, vol. 499, pp.
419-425, Jul 25 2013.
[48] R. F. Service, "Design for US exascale computer takes shape," Science, vol. 359, pp. 617-
618, Feb 9 2018.
[49] F. Shimojo, R. K. Kalia, M. Kunaseth, A. Nakano, K. Nomura, S. Ohmura, et al., "A
divide-conquer-recombine algorithmic paradigm for multiscale materials modeling,"
Journal of Chemical Physics, vol. 140, p. 18A529, May 14 2014.
91
[50] P. Umari and A. Pasquarello, "Ab initio molecular dynamics in a finite homogeneous
electric field," Physical Review Letters, vol. 89, p. 157602, Oct 7 2002.
[51] A. M. N. Niklasson, "Extended Born-Oppenheimer molecular dynamics," Physical Review
Letters, vol. 100, p. 123004, Mar 28 2008.
[52] P. Souvatzis and A. M. N. Niklasson, "First principles molecular dynamics without self-
consistent field optimization," Journal of Chemical Physics, vol. 140, p. 044117, Jan 28
2014.
[53] D. C. Rapaport, The Art of Molecular Dynamics Simulation, Second ed. Cambridge, UK:
Cambridge University Press, 2004.
[54] D. E. Shaw, "A fast, scalable method for the parallel evaluation of distance-limited
pairwise particle interactions," Journal of Computational Chemistry, vol. 26, pp. 1318-
1328, Oct 2005.
[55] M. Kunaseth, R. K. Kalia, A. Nakano, P. Vashishta, D. F. Richards, and J. N. Glosli,
"Performance characteristics of hardware transactional memory for molecular dynamics
application on BlueGene/Q: toward efficient multithreading strategies for large-scale
scientific applications," Proceedings of the International Workshop on Parallel and
Distributed Scientific and Engineering Computing, PDSEC-13, Mar 20 IEEE, 2013.
[56] A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, et al., "Knights
Landing: second-generation Intel Xeon Phi product," IEEE Micro, vol. 36, pp. 34-46, Mar-
Apr 2016.
[57] K. S. Novoselov, A. Mishchenko, A. Carvalho, and A. H. C. Neto, "2D materials and van
der Waals heterostructures," Science, vol. 353, p. aac9439, Jul 29 2016.
[58] Y. M. Shi, H. N. Li, and L. J. Li, "Recent advances in controlled synthesis of two-
dimensional transition metal dichalcogenides via vapour deposition techniques," Chemical
Society Reviews, vol. 44, pp. 2744-2756, May 7 2015.
[59] K. Hornik, M. Stinchcombe, and H. White, "Multilayer feedforward networks are universal
approximators," Neural Networks, vol. 2, pp. 359-366, Jan 1989.
[60] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A
Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning
Research, vol. 15, pp. 1929-1958, Jun 2014.
[61] E. D. Cubuk, S. S. Schoenholz, J. M. Rieser, B. D. Malone, J. Rottler, D. J. Durian, et al.,
"Identifying Structural Flow Defects in Disordered Solids Using Machine-Learning
Methods," Physical Review Letters, vol. 114, p. 108001, Mar 2015.
[62] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with
deep convolutional neural networks." Neural Information Processing Systems 25 (2012):
1106–14.
92
[63] Altae-Tran, Han, Bharath Ramsundar, Aneesh S. Pappu, and Vijay Pande. "Low data drug
discovery with one-shot learning." ACS Central Science 3 (2017): 283-93.
[64] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gomez-
Bombarelli, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P. Adams. "Convolutional
networks on graphs for learning molecular fingerprints." Advances in Neural Information
Processing Systems 28 (2015): 2224–32
[65] Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler,
Andreas Ziehe, Alexandre Tkatchenko, O. Anatole von Lilienfeld, and Klaus-Robert
Müller. "Learning invariant representations of molecules for atomization energy
prediction." Advances in Neural Information Processing Systems 25 (2012): 449-57.
[66] Hansen, Katja, Franziska Biegler, Raghunathan Ramakrishnan, Wiktor Pronobis, O.
Anatole von Lilienfeld, Klaus-Robert Müller, and Alexandre Tkatchenko. "Machine
learning predictions of molecular properties: accurate many-body potentials and
nonlocality in chemical space." The Journal of Physical Chemistry Letters 6 (2015): 2326-
31.
[67] Geim, A. K., and I. V. Grigorieva. 2013. "Van der Waals heterostructures." Nature 499
(July 24 2013):419-425. doi: 10.1038/nature12385.
[68] Bassman, L., P. Rajak, R. K. Kalia, A. Nakano, F. Sha, J. Sun, D. J. Singh, M. Aykol, P.
Huck, K. Persson, and P. Vashishta. "Active learning for accelerated design of layered
materials." npj Computational Materials 4 (2018):74.
[69] Hong, S., A. Krishnamoorthy, P. Rajak, S. Tiwari, M. Misawa, F. Shimojo, R. K. Kalia,
A. Nakano, and P. Vashishta. "Computational synthesis of MoS2 layers by reactive
molecular dynamics simulations: initial sulfidation of MoO3 surfaces." Nano Letters 17
(2017): 4866-72.
[70] Kearnes, Steven, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley.
"Molecular graph convolutions: moving beyond fingerprints." Journal of Computer-Aided
Molecular Design 30, (2016): 595-608.
[71] Nomura, K., R.K. Kalia, A. Nakano, and P. Vashishta. "A scalable parallel algorithm for
large-scale reactive force-field molecular dynamics simulations." Computer Physics
Communications 178 (2008): 73-87.
[72] Nomura, K., P. E. Small, R. K. Kalia, A. Nakano, and P. Vashishta. "An extended-
lagrangian scheme for charge equilibration in reactive molecular dynamics simulations."
Computer Physics Communications 192 (2015): 91-96.
[73] Scarselli, F., M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. "The graph neural
network model." IEEE Transactions on Neural Networks 20 (2009): 61-80.
93
[74] Senftle, T. P., S. Hong, M. M. Islam, S. B. Kylasa, Y. Zheng, Y. K. Shin, C. Junkermeier,
et al. "The Reaxff reactive force-field: development, applications and future directions."
npj Computational Materials 2 (2016): 15011.
[75] Tamayo-Mendoza, Teresa, Christoph Kreisbeck, Roland Lindh, and Alán Aspuru-Guzik.
"Automatic differentiation in quantum chemistry with applications to fully variational
Hartree–Fock." ACS Central Science 4 (2018): 559-66.
[76] Wei, Jennifer N., David Duvenaud, and Alan Aspuru-Guzik. "Neural networks for the
prediction of organic chemistry reactions." ACS Central Science 2 (2016): 725-32.
[77] Xie, T., and J. C. Grossman. "Crystal Graph Convolutional Neural networks for an accurate
and interpretable prediction of material properties." Physical Review Letters 120 (2018):
145301.
[78] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. "Gated graph sequence
neural networks." International Conference on Learning Representations (2016).
[79] Chen, J., Ma, T. and Xiao, C., 2018. "FastGCN: fast learning with graph convolutional
networks via importance sampling." arXiv preprint arXiv:1801.10247.
[80] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and
Bengio, Y., 2014. "Learning phrase representations using RNN encoder-decoder for
statistical machine translation." arXiv preprint arXiv:1406.1078.
[81] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,
Irving, G., Isard, M. and Kudlur, M., "Tensorflow: a system for large-scale machine
learning." OSDI 16 (2016): 265-83.
[82] Kingma, D.P. and Ba, J., 2014. "Adam: A method for stochastic optimization." arXiv
preprint arXiv:1412.6980.
[83] Q. M. Zhang, V. Bharti, and X. Zhao, "Giant electrostriction and relaxor ferroelectric
behavior in electron-irradiated poly(vinylidene fluoride-trifluoroethylene) copolymer,"
Science, vol. 280, no. 5372, pp. 2101-2104, Jun 26 1998, doi:
10.1126/science.280.5372.2101
[84] B. J. Chu et al., "A dielectric polymer with high electric energy density and fast discharge
speed," Science, vol. 313, no. 5785, pp. 334-336, Jul 21 2006, doi:
10.1126/science.1127798
[85] V. Sharma et al., "Rational design of all organic polymer dielectrics," Nat Commun, vol.
5, p. 4845, Sep 2014, doi: 10.1038/ncomms5845.
[86] A. Jain, K. A. Persson, and G. Ceder, "Research update: the materials genome initiative:
Data sharing and the impact of collaborative ab initio databases," APL Mater, vol. 4, no. 5,
May 2016, doi: 10.1063/1.4944683.
94
[87] C. Kim, A. Chandrasekaran, T. D. Huan, D. Das, and R. Ramprasad, "Polymer genome: a
data-powered polymer informatics platform for property predictions," J Phys Chem C, vol.
122, no. 31, pp. 17575-17585, Aug 9 2018, doi: 10.1021/acs.jpcc.8b02913.
[88] K. Andersen, S. Latini, and K. S. Thygesen, "Dielectric genome of van der Waals
heterostructures," Nano Lett, vol. 15, no. 7, pp. 4616-4621, Jul 2015, doi:
10.1021/acs.nanolett.5b01251.
[89] P. Umari and A. Pasquarello, "Ab initio molecular dynamics in a finite homogeneous
electric field," Phys Rev Lett, vol. 89, no. 15, p. 157602, Oct 7 2002, doi:
10.1103/PhysRevLett.89.157602.
[90] I. Souza, J. Iniguez, and D. Vanderbilt, "First-principles approach to insulators in finite
electric fields," Phys Rev Lett, vol. 89, no. 11, p. 117602, Sep 9 2002, doi:
10.1103/PhysRevLett.89.117602.
[91] S. Fukushima et al., "Effects of chemical defects on anisotropic dielectric response of
polyethylene," AIP Adv, vol. 9, no. 4, p. 045022, Apr 2019, doi: 10.1063/1.5093566
[92] S. Naserifar, D. J. Brooks, W. A. Goddard, and V. Cvicek, "Polarizable charge
equilibration model for predicting accurate electrostatic interactions in molecules and
solids," J Chem Phys, vol. 146, no. 12, p. 124117, 2017/03/28 2017, doi:
10.1063/1.4978891.
[93] K. Liu et al., "Shift-collapse acceleration of generalized polarizable reactive molecular
dynamics for machine learning-assisted computational synthesis of layered materials,"
Proc ScalA18, pp. 41-48, Nov 12 IEEE, 2018, doi: 10.1109/ScalA.2018.00009.
[94] Y. Li et al., "Scalable reactive molecular dynamics simulations for computational
synthesis," Comput Sci Eng, vol. 21, no. 5, pp. 64-75, Sep/Oct 2019, doi:
10.1109/MCSE.2018.110150043.
[95] A. Mishra et al., "Computational framework for polymer synthesis to study dielectric
properties using polarizable reactive molecular dynamics," ACS Central Sci, submitted,
2020.
[96] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh, "Machine learning for
molecular and materials science," Nature, vol. 559, no. 7715, pp. 547-555, Jul 26 2018,
doi: 10.1038/s41586-018-0337-2.
[97] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-
444, May 28 2015, doi: 10.1038/nature14539.
[98] D. Duvenaud et al., "Convolutional networks on graphs for learning molecular
fingerprints," Proc NeurIPS 2015, vol. 28, Dec 2015.
[99] K. Liu, K. Nomura, P. Rajak, R. K. Kalia, A. Nakano, and P. Vashishta, "Graph neural
network analysis of layered material phases," Proc SpringSim-HPC 2019, SCS, 2019.
95
[100] M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller, "Generating focused molecule
libraries for drug discovery with recurrent neural networks," ACS Central Sci, vol. 4, no.
1, pp. 120-131, Jan 24 2018, doi: 10.1021/acscentsci.7b00512.
[101] A. C. T. van Duin, S. Dasgupta, F. Lorant, and W. A. Goddard, "ReaxFF: a reactive force
field for hydrocarbons," J Phys Chem A, vol. 105, no. 41, pp. 9396-9409, Oct 18 2001, doi:
10.1021/jp004368u.
[102] A. Nakano et al., "De novo ultrascale atomistic simulations on high-end parallel
supercomputers," Int J High Performance Comput Appl, vol. 22, no. 1, pp. 113-128, Feb
2008, doi: 10.1177/1094342007085015.
[103] T. P. Senftle et al., "The ReaxFF reactive force-field: development, applications and future
directions," npj Comput Mater, vol. 2, p. 15011, Mar 4 2016, doi:
10.1038/npjcompumats.2015.11.
[104] A. Hellweg and D. Rappoport, "Development of new auxiliary basis functions of the
Karlsruhe segmented contracted basis sets including diffuse basis functions (def2-SVPD,
def2-TZVPPD, and def2-QVPPD) for RI-MP2 and RI-CC calculations," Phys Chem Chem
Phys, 10.1039/C4CP04286G vol. 17, no. 2, pp. 1010-1017, 2015, doi:
10.1039/C4CP04286G.
[105] K. Nomura, R. K. Kalia, A. Nakano, P. Rajak, and P. Vashishta, "RXMD: a scalable
reactive molecular dynamics simulator for optimized time-to-solution," SoftwareX, vol. 11,
p. 100389, Jan-Jun 2020, doi: 10.1016/j.softx.2019.100389.
[106] D. Weininger, "SMILES, a chemical language and information system. 1. Introduction to
methodology and encoding rules," J Chem Info Comp Sci, vol. 28, pp. 31-36, 1988, doi:
10.1021/ci00057a005.
[107] N. M. O'Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R.
Hutchison, "Open Babel: An open chemical toolbox," J Cheminformatics, vol. 3, no. 1, p.
33, 2011/10/07 2011, doi: 10.1186/1758-2946-3-33.
[108] H. Ramchoun, M. A. J. Idrissi, Y. Ghanou, and M. Ettaouil, "Multilayer perceptron:
architecture optimization and training," IJIMAI, vol. 4, no. 1, pp. 26-30, Sep 2016, doi:
10.9781/ijimai.2016.415.
[109] F. Rosenblatt, Principles of Neurodynamics: Perceptions and the Theory of Brain
Mechanisms. Washington, DC: Spartan, 1962.
[110] N. Talebi, A. M. Nasrabadi, and I. Mohammad-Rezazadeh, "Estimation of effective
connectivity using multi-layer perceptron artificial neural network," Cogn Neurodynamics,
vol. 12, no. 1, pp. 21-42, Feb 2018, doi: 10.1007/s11571-017-9453-1.
[111] P. Hohenberg and W. Kohn, "Inhomogeneous electron gas," Phys Rev, vol. 136, no. 3B,
pp. B864-B871, Nov 9 1964, doi: 10.1103/PhysRev.136.B864.
96
[112] R. M. Martin, Electronic Structure: Basic Theory and Practical Methods. Cambridge, UK:
Cambridge University Press, 2008.
[113] F. Shimojo et al., "A divide-conquer-recombine algorithmic paradigm for multiscale
materials modeling," J Chem Phys, vol. 140, no. 18, p. 18A529, May 14 2014, doi:
10.1063/1.4869342.
[114] Liu, Y., Palmedo, P., Ye, Q., Berger, B. & Peng, J. Enhancing evolutionary couplings
with deep convolutional neural networks. Cell Syst. 6, 65–74.e3 (2018).
[115] Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein
contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017)
[116] Jones, D. T. & Kandathil, S. M. High precision in protein contact prediction using fully
convolutional neural networks and minimal sequence features. Bioinformatics 34, 3308–
3315 (2018).
[117] Yang Li, Chengxin Zhang, Eric W Bell, Wei Zheng, Dongjun Yu, Yang Zhang.
Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices
through deep residual convolutional networks. submitted, 2020.
[118] David E. Kim, Frank DiMaio, Ray Yu-Ruei Wang, Yifan Song, and David Baker. One
contact for every twelve residues allows robust and accurate topology-level protein
structure modeling. Proteins: Structure, Function, and Bioinformatics, 82:208–218,
2014.
[119] Zhao, Y., and Karypis, G. 2003. Prediction of contact maps using support vector
machines. BIBE 2003, Bethesda, MD. IEEE Computer Society, pp. 26–36
[120] Jones DT and Kandathil SM (2018). High precision in protein contact prediction using
fully convolutional neural networks and minimal sequence
features. Bioinformatics 34(19): 3308-3315.
[121] Bepler, Tristan, and Bonnie Berger. "Learning protein sequence embeddings using
information from structure." arXiv preprint arXiv:1902.08661 (2019).
[122] Y. Xu, D. Xu, and J. Liang, Computational Methods for Protein Structure Prediction and
Modeling. Springer, 2007.
[123] J. Jumper et al., "Highly accurate protein structure prediction with AlphaFold," Nature,
vol. 596, no. 7873, pp. 583-589, Aug 26 2021, doi: 10.1038/s41586-021-03819-2.
[124] R. Zamora-Resendiz and S. Crivelli, "Structural learning of proteins using graph
convolutional neural networks," bioRxiv, 10.1101/610444, Apr 16 2019, doi:
10.1101/610444.
97
[125] T. Corcoran, R. Zamora-Resendiz, X. Liu, and S. Crivelli, "A spatial mapping algorithm
with applications in deep learning-based structure classification," arXiv, 1802.02532v2,
Feb 22 2018, doi: 10.48550/arXiv.1802.02532.
[126] X. Yuan and C. Bystroff, "Protein contact map prediction," in Protein Structure Prediction
and Modeling, vol. 1, Y. Xu, D. Xu, and J. Liang Eds., 2007, ch. 8.
[127] H. Berman, K. Henrick, and H. Nakamura, "Announcing the worldwide Protein Data
Bank," Nat Struct Biol, vol. 10, no. 12, pp. 980-980, Dec 2003, doi: 10.1038/nsb1203-980.
[128] Y. Liu, P. Palmedo, Q. Ye, B. Berger, and J. Peng, "Enhancing evolutionary couplings with
deep convolutional neural networks," Cell Syst, vol. 6, no. 1, pp. 65-74, Jan 24 2018, doi:
10.1016/j.cels.2017.11.014.
[129] S. Wang, S. Q. Sun, Z. Li, R. Y. Zhang, and J. B. Xu, "Accurate de novo prediction of
protein contact map by ultra-deep learning model," PLoS Comput Biol, vol. 13, no. 1,
e1005324, Jan 2017, doi: 10.1371/journal.pcbi.1005324.t001.
[130] D. T. Jones and S. M. Kandathil, "High precision in protein contact prediction using fully
convolutional neural networks and minimal sequence features," Bioinformatics, vol. 34,
no. 19, pp. 3308-3315, Oct 1 2018, doi: 10.1093/bioinformatics/bty341.
[131] Y. Li et al., "Deducing high-accuracy protein contact-maps from a triplet of coevolutionary
matrices through deep residual convolutional networks," PLoS Comput Biol, vol. 17, no.
3, e1008865, Mar 2021, doi: 10.1371/journal. pcbi.1008865.
[132] Y. Zhao and G. Karypis, "Prediction of contact maps using support vector machines," Int
J Artif Intell T, vol. 14, no. 5, pp. 849-865, Oct 2005, doi: 10.1142/S0218213005002429.
[133] T. Bepler and B. Berger, "Learning protein sequence embeddings using information from
structure," Proceedings of International Conference on Learning Representations, ICLR,
1078, 2019.
[134] B. Kuhlman and P. Bradley, "Advances in protein structure prediction and design," Nat
Rev Mol Cell Bio, vol. 20, no. 11, pp. 681-697, Nov 2019, doi: 10.1038/ s41580-019-0163-
x.
[135] D. Mercadante, F. Grater, and C. Daday, "CONAN: a tool to decode dynamical information
from molecular interaction maps," Biophys J, vol. 114, no. 6, pp. 1267-1273, Mar 27 2018,
doi: 10.1016/j.bpj.2018.01.033.
[136] D. E. Kim, F. DiMaio, R. Y. R. Wang, Y. F. Song, and D. Baker, "One contact for every
twelve residues allows robust and accurate topology-level protein structure modeling,"
Proteins, vol. 82, pp. 208-218, Feb 2014, doi: 10.1002/prot.24374.
[137] K. S. Tai, R. Socher, and C. Manning, "Improved semantic representations from tree-
structured long short-term memory networks," Proc ACL, pp. 1556-1566, 2015, doi:
10.3115/v1/P15-1150.
98
[138] S. Ogata, E. Lidorikis, F. Shimojo, A. Nakano, P. Vashishta, and R. K. Kalia, "Hybrid
finite-element/molecular-dynamics/electronic-density-functional approach to materials
simulations on parallel computers," Computer Physics Communications, vol. 138, no. 2,
pp. 143-154, Aug 1 2001, doi: 10.1016/S0010-4655(01)00203-X.
[139] A. Warshel, "Multiscale Modeling of Biological Functions: From Enzymes to Molecular
Machines (Nobel Lecture)," Angew Chem, vol. 53, no. 38, pp. 10020-10031, Sep 15 2014,
doi: 10.1002/anie.201403689.
[140] M. AlQuraishi, "ProteinNet: a standardized data set for machine learning of protein
structure," BMC Bioinformatics, vol. 20, 311, Jun 11 2019, doi: 10.1186/s12859-019-2932-
0.
[141] K. Liu et al., "Shift-collapse acceleration of generalized polarizable reactive molecular
dynamics for machine learning-assisted computational synthesis of layered materials,"
Proceedings of Workshop on Latest Advances in Scalable Algorithms for Large-Scale
Systems, ScalA18, pp. 41-48, Nov 12 IEEE, 2018, doi: 10.1109/ScalA.2018.00009.
[142] A. Sergeev and M. Del Balso, "Horovod: fast and easy distributed deep learning in
TensorFlow," arXiv, 1802.05799v3, Feb 21 2018, doi: 10.48550/arXiv.1802.05799.
[143] T. Kurth, M. Smorkalov, P. Mendygral, S. Sridharan, and A. Mathuriya, "TensorFlow at
scale: performance and productivity analysis of distributed training with Horovod, MLSL,
and Cray PEML," Concurrency, vol. 31, e4989, Oct 21 2019, doi: 10.1002/cpe.4989.
[144] P. Balaprakash, M. Salim, T. D. Uram, V. Vishwanath, and S. M. Wild, "DeepHyper:
asynchronous hyperparameter search for deep neural networks," Proc HiPC, Dec IEEE,
2018, doi: 10.1109/HiPC.2018.00014.
[145] M. A. Salim, T. D. Uram, J. T. Childers, P. Balaprakash, V. Vishwanath, and M. E. Papka,
"Balsam: automated scheduling and execution of dynamic, data-intensive HPC
workflows," arXiv, 1909.08704v1 Sep 18 2019, doi: 10.48550/arXiv.1909.08704.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Physics informed neural networks and electrostrictive cavitation in water
PDF
Multi-scale quantum dynamics and machine learning for next generation energy applications
PDF
Neural network for molecular dynamics simulation and design of 2D materials
PDF
Controlling electronic properties of two-dimensional quantum materials: simulation at the nexus of the classical and quantum computing eras
PDF
Machine-learning approaches for modeling of complex materials and media
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
PDF
Quantum molecular dynamics and machine learning for non-equilibrium processes in quantum materials
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Molecular dynamics simulations of nanostructures
PDF
Nanomaterials under extreme environments: a study of structural and dynamic properties using reactive molecular dynamics simulations
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Learning to optimize the geometry and appearance from images
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Fast and label-efficient graph representation learning
PDF
Shock-induced poration, cholesterol flip-flop and small interfering RNA transfection in a phospholipid membrane: multimillion atom, microsecond molecular dynamics simulations
PDF
Large-scale molecular dynamics simulations of nano-structured materials
Asset Metadata
Creator
Liu, Kuang
(author)
Core Title
Simulation and machine learning at exascale
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
12/14/2022
Defense Date
12/13/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph neural networks,high performance computing,machine learning,molecular dynamics,OAI-PMH Harvest,protein folding
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nakano, Aiichiro (
committee chair
), Kalia, Rajiv (
committee member
), Vashishta, Priya (
committee member
)
Creator Email
liukuang@usc.edu,lucifernine86@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112620739
Unique identifier
UC112620739
Identifier
etd-LiuKuang-11369.pdf (filename)
Legacy Identifier
etd-LiuKuang-11369
Document Type
Dissertation
Format
theses (aat)
Rights
Liu, Kuang
Internet Media Type
application/pdf
Type
texts
Source
20221214-usctheses-batch-996
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
graph neural networks
high performance computing
machine learning
molecular dynamics
protein folding