Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Simulation and machine learning at exascale
(USC Thesis Other)
Simulation and machine learning at exascale
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SIMULATION AND MACHINE LEARNING AT EXASCALE
by
Kuang Liu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2022
Copyright 2022 Kuang Liu
ii
Table of Contents
List of Tables ...................................................................................................................................v
List of Figures ................................................................................................................................ vi
Abstract ............................................................................................................................................x
Introduction ......................................................................................................................................1
Chapter 1 Acceleration of Dynamic ntuple Computations in ManyBody Molecular
Dynamics .....................................................................................................................................4
1.1 Introduction ......................................................................................................................4
1.2 Physical Model.................................................................................................................6
1.2.1 Interatomic potential ....................................................................................................6
1.2.2 MD Simulation.............................................................................................................8
1.2.3 LinkedList Cell Method............................................................................................10
1.3 GPGPU Implementation ................................................................................................11
1.3.1 TwoBody Interaction List Generation ......................................................................13
1.3.2 ThreeBody Interaction List Generation ....................................................................16
1.4 Benchmark and Validation ............................................................................................19
1.4.1 GPU Implementation versus the Original MD Code .................................................19
1.4.2 GPU Implementation versus the Baseline .................................................................21
1.4.3 Dynamic Quadruplet Computation in ReaxFF ..........................................................25
1.5 Summary ........................................................................................................................28
iii
Chapter 2 ShiftCollapse Acceleration of Generalized Polarizable Reactive Molecular
Dynamics for Machine LearningAssisted Computational Synthesis of Layered Material ......30
2.1 Introduction ....................................................................................................................30
2.2 Application and Algorithmic Innovations .....................................................................32
2.2.1 Generalized Polarizable Reactive Force Field (ReaxPQ+) ........................................32
2.2.2 ExtendedLagrangian Acceleration of ReaxPQ+.......................................................36
2.2.3 Renormalized ShiftCollapse Acceleration of ReaxPQ+ ...........................................37
2.3 Parallel Implementation and Performance Optimizations .............................................40
2.3.1 Performance Improvement by Algorithm Accelerations ...........................................42
2.3.2 Improving Thread Scalability by RoundRobin Data Privatization ..........................43
2.3.3 Promoting Data Parallelism by Data Alignment .......................................................44
2.3.4 Additional Code Transformations and Performance Improvement ...........................46
2.4 Scalability and TimetoSolution ...................................................................................47
2.4.1 Experimental Platform ...............................................................................................47
2.4.2 Weak and Strong Scaling on Theta ............................................................................48
2.4.3 TimetoSolution Improvement .................................................................................49
2.5 MachineLearning Guided Computational Synthesis of Atomically Thin Layered
Materials ....................................................................................................................................50
2.6 Summary ........................................................................................................................52
Chapter 3 Machine Learning in Material Simulations ...............................................................53
3.1 Introduction ....................................................................................................................53
3.1.1 Dielectric Polymer Genome .......................................................................................53
iv
3.1.2 Layered Material Phases ............................................................................................54
3.2 Method ...........................................................................................................................56
3.2.1 Multiple Layer Perceptron .........................................................................................56
3.2.2 Graph Neural Networks for Learning Layered Material ...........................................60
3.2.3 Dataset Description ....................................................................................................60
3.2.4 Graph Neural Networks .............................................................................................61
3.3 Experiment and Analysis ...............................................................................................63
3.3.1 Input Features and Training Settings .........................................................................64
3.3.2 Predictive Performance for Layered Material Phases ................................................65
3.3.3 Visualization of Hidden Node States .........................................................................67
3.3.4 Performance for Dielectric Polymer Genome............................................................67
3.4 Summary ........................................................................................................................73
Chapter 4 Protein Modeling at Exascale ....................................................................................75
4.1 Introduction ....................................................................................................................75
4.2 Contact Map Prediction with Representation Learning .................................................77
4.3 Results and Discussion ..................................................................................................82
4.4 Future Work ...................................................................................................................85
References ......................................................................................................................................87
v
List of Tables
Algorithm 1: Body of the main MD simulation loop based on the velocityVerlet algorithm. .......8
Algorithm 2: GPU implementation of the generation of threebody lists. ....................................17
Table 1.1: Achieved occupancy of kernel execution. ....................................................................24
Table 2.1: Computed dielectric constants. .....................................................................................35
Table 2.2: PQEq get_hsh procedure in RXMD. ............................................................................45
Table 2.3: Sorted polarized neighbor list and stride. .....................................................................45
Table 2.4: Packing atomicpair distance into neighbor list............................................................46
Table 3.1: Input node and edge features. .......................................................................................64
Table 3.2: F1 scores for different input features. Higher value sinifies better classification
accuracy. ................................................................................................................................65
Table 3.3: ROCAUC for different input features. Value is in the range of (0, 1). Higher
value signifies better classification accuracy. ........................................................................66
Table 3.4: Hyperparameters used in our random forest model. .....................................................71
Table 3.5: Performance metrics of the random forest model.........................................................72
Table 4.1: Shortrange contact prediction results. .........................................................................84
Table 4.2: Mediumrange contact prediction results. ....................................................................84
Table 4.3: Longrange contact prediction results. .........................................................................84
vi
List of Figures
Figure 1.1: The workflow of the main MD simulation loop: (a) main solver loop, (b)
original pipelining of atomic force computation on CPU, (c) restructured pipeline on
GPU..........................................................................................................................................9
Figure 1.2: Concurrent construction of valid interactions. ............................................................16
Figure 1.3: Concurrent construction of valid interactions. ............................................................21
Figure 1.4: Speedup of GPU MD over reference CPU MD for varying numbers of atoms. .........21
Figure 1.5: Wallclock time of the TD approach (orange circles) and that of baseline (PD,
black squares) as a function of the number of atoms. The figure also shows the
performance improvement of the TD approach over the baseline as a function of the
number of atom. .....................................................................................................................22
Figure 1.6: Wallclock time on kernels for 2body potential computation per time step for
the baseline (left) and TD (right) approaches. .......................................................................23
Figure 1.7: Relationship between the number of warps per SM and the number of registers
per thread for kernels. ............................................................................................................25
Figure 1.8: Timing comparison of the TD approach and baseline (PD) for the computation
of quadruplet interactions in ReaxFF. The histogram shows wallclock times per MD
step in milliseconds averaged over 100 time steps. We use RDX material for this
benchmark. .............................................................................................................................27
Figure 1.9: Strong and weak scaling of the parallel ReaxFF MD code on Blue Gene/Q. .............27
Figure 2.1: Schematic of the response of core (green) and shell (yellow) charges to an
external electric field in the new PQEq+ model. ...................................................................33
vii
Figure 2.2: (a) PQEq energy with 1, 10 and 100 iterations of NewtonRaphson method.
(b) Atomic charges after 1 ps of MD simulation with converged RMD (blue) and
XRMD (red) methods. ...........................................................................................................34
Figure 2.3: Illustration of manybody renormalized ntuple computation of g(i) based on
(a) conventional full neighbor list (NBR) and (b) partial neighbor lists (PNBR
RES
and
PNBR
NT
) used in the RSC algorithm. ....................................................................................39
Figure 2.4: (a) Runtime comparison between different configurations of MPI ranks and
OpenMP threads on 1 node to identify the optimal combination. Total number of
processes is 64. (b) Average runtime per step for the original ReaxPQ+ and new
ReaxPQ+SC implementations on 96 and 1,440 cores. ReaxPQ+SC reduces timeto
solution 5.0 and 5.7fold below ReaxPQ+ for 96 and 1,440core benchmarks,
respectively. The numbers denote runtimes...........................................................................42
Figure 2.5: (a, b) Data privatization. All threads write to the global force array using atomic
operations (a), or each thread writes to private array (b). (c) Roundrobin reduction
with nonoverlapping chunks for improving data privatization. ...........................................43
Figure 2.6: Sorted neighborlist followed by SIMD operation. .....................................................45
Figure 2.7: Performance improvement after data privatization and other code
transformations. The numbers denote runtimes. ....................................................................47
Figure 2.8: (a) Wallclock time of the ReaxPQ+ adapted RXMD code, with scaled
workloads — 24,576Patom MoS2 on P cores (P = 64, ..., 131,072) of Theta. (b)
Strongscaling speedup of the ReaxPQ+ adapted RXMD code with a fixed problem
size — 50,331,648atom MoS2 system on P cores (P = 2,048, ..., 32,768) of Theta.
viii
The measured speedup values (blue) are compared with the ideal speedup (black). The
numbers denote speedups. .....................................................................................................49
Figure 2.9: Machine learningguided computational synthesis of MoS2 monolayer by CVD.
(a) Simulation setup showing a MoO3 monolayer suspended in S2 gas. The atoms are
colored as Mo: blue, S: yellow, and O: red. (b, c) Closeup of MoS2 monolayer before
(b) and after (c) annealing. The local structures are classified into 1T (green), 2H (red)
and disordered (blue) phases. For clarity, gaseous environment is not shown. (d)
Neural network model for defect identification and classification. .......................................51
Figure 3.1: Top views of 2H (left) and 1T (right) crystalline phases of MoS2 monolayer.
Magenta and yellow spheres represent Mo and S atoms, respectively, whereas MoS
bonds are represented by cylinders. .......................................................................................56
Figure 3.2: Schematic of 2H vs. 1T phase classification by graph neural network. ......................60
Figure 3.3: Repeating message function and update function T times to learn atom
representations. ......................................................................................................................63
Figure 3.4: Snapshot of computationallysynthesized MoS2 monolayer. Ballandstick
representation is used to show the positions of atoms and covalent bonds with
neighboring atoms. Atoms are colorcoded as yellow for sulfur (S), blue for
molybdenum (Mo) and red for oxygen (O), respectively ......................................................64
Figure 3.5: ROC curves for 1T (green) and 2H (red) phases using node and edge features
as input, which correspond to the second row in Table 3.3. The xaxis is the true
positive rate (TPR) and yaxis is the false positive rate (FPR). .............................................66
Figure 3.6: Layerwise evolution of node states. The xaxis is the dimensions of features
(padded to 20) and yaxis is the number of atoms (zeropadded to 40) in the graph.
ix
Each row represents a feature vector of an atom, and the color of each pixel indicates
the value on the unit. ..............................................................................................................67
Figure 3.7: Dielectric constants (A) and SMILE dimensions (B) of the polymer dataset. ............68
Figure 3.8: Model prediction vs. training data of polymer dielectric constants using the
MLP model with variable activation functions (A) modified RELU (RMSE = 0.053);
(B)RELU (RMSE = 0.242); (C) Tanh (RMSE = 0.267). ......................................................69
Figure 3.9: Performance of MLP prediction model (Elman type). Comparison of prediction
on (A) 662 polymers and (B) 546 polymers training sets. .....................................................70
Figure 3.10: Model prediction vs. training data of polymer dielectric constants using the
random forest model. .............................................................................................................71
Figure 3.11: Convergence of the RNN model performance as a function of training epochs.
................................................................................................................................................72
Figure 4.1: left: 3D protein structure. Right: residueresidue contact map of a protein. ...............78
Figure 4.2: Procedures of the leaning task. AA: amino acid. ........................................................79
Figure 4.3: The proposed multiscale GNN approach on top of the existing RNNCNN
pipeline. ..................................................................................................................................80
x
Abstract
Machine learning (ML) is revolutionizing scientific research, by utilizing its evergrowing
capability of knowledge discovery from massive simulation data. With the anticipated arrival of
exascale computers, new opportunities and challenges are emerging to maximize the synergy
between exaFLOPS supercomputers and MLboosted computational science. This thesis addresses
two important and intertwined problems: (i) developing efficient billioncore parallel algorithms
for scientific simulations, and (ii) designing domainspecific deep neural networks for scientific
data analysis. To achieve these goals, I will design a series of algorithms with specific focus on
molecular dynamics (MD) simulations and protein folding as archetypal applications.
The research consists of the following three steps:
1. Analyze the computational pattern of MD simulations to expose its common features for
advanced parallelization such as multithreading on both central processing unit (CPU) and
graphics processing unit (GPU) platforms, and improve the computational performance
accordingly. Specifically, two optimization algorithms are proposed to accelerate computation on
CPUs and GPUs, respectively: (i) GPU acceleration based on tupledecomposition (TD) that
reorders computations according to dynamically created lists of ntuples; (ii) CPU acceleration
employing shiftcollapse computation of manybody renormalized ntuples that minimizes MPI
communications and multithreading with roundrobin data privatization that improves thread
scalability, thereby achieving weakscaling parallel efficiency of 0.989 on 131,072 cores, along
with eightfold reduction of timetosolution (T2S) compared with the original code, on an Intel
Knights Landingbased computer.
2. Development of specific deep neural networks to learn material properties from MD simulation
data, including: (i) multilayer perceptron (MLP) and recurrent neural network (RNN) to predict
xi
dielectric constants of polymers; and (ii) graph neural network (GNN) to classify different
crystalline phases inside computationallysynthesized layered material by MD simulations.
3. Extension of ML to the domain of protein folding, using deep neural network to help predict
residueresidue contact map, which is a critical component for protein structure understanding.
Due to the colossal volume of dataset in both ML tasks, we propose to adapt our neural networks
to exascale parallelism.
1
Introduction
The rise of machine learning (ML) in recent decade has extensively augmented the pathway
for scientific research with its evergrowing capability of knowledge discovery from massive
simulation data. As the community is poised to embrace the exascale computation era, new
challenges are emerging to maximize the synergy between exaFLOPS supercomputers and ML
boosted computational science. Two problems are of great significance: 1) developing efficient
billioncore parallel algorithms for scientific simulations; and 2) designing domain specific deep
neural networks for scientific data analysis. I propose to address these two problems by designing
a series of algorithms, with specific focus on molecular dynamics (MD) simulations and protein
folding as archetypal applications.
The methodology of this proposal consists of three steps: (i) we analyze the computational
pattern of MD simulation to expose its common features for advanced parallelization such as
multithreading on both CPU and GPU platforms, and improve the computational performance
accordingly; (ii) driven by the existing deep learning architectures, we develop a specific deep
neural network to learn particle properties from our MD simulation data; and (iii) we extend our
focus on the domain of protein folding, and use deep neural network to help predict residueresidue
contact map, which is a critical component for protein structure understanding. Due to the colossal
volume of dataset in both ML tasks, it is required that our neural networks are adaptive to the exa
scale parallelism.
In the first step, two optimization algorithms are proposed to accelerate computation on
CPUs and GPUs, respectively.
2
1. Our GPU acceleration applies a tupledecomposition (TD) approach that reorders computations
according to dynamically created lists of ntuples. We analyze the performance characteristics of
the TD approach on general purpose graphics processing units for MD simulations involving pair
(n = 2) and triplet (n = 3) interactions. The results show superior performance of the TD approach
over the conventional particledecomposition (PD) approach. Detailed analyses reveal the register
footprint as the key factor that dictates the performance. Furthermore, the TD approach is found
to outperform PD for more intensive computations of quadruplet (n = 4) interactions in first
principlesinformed reactive MD simulations based on the reactive forcefield (ReaxFF) method.
This work thus demonstrates the viable performance portability of the TD approach across a wide
range of applications.
2. Our CPU acceleration employs shiftcollapse computation of manybody renormalized ntuples
that minimizes MPI communications and multithreading with roundrobin data privatization that
improves thread scalability. The new code achieves weakscaling parallel efficiency of 0.989 for
131,072 cores, and eightfold reduction of timetosolution (T2S) compared with the original code,
on an Intel Knights Landingbased computer. The reduced T2S has for the first time allowed purely
computational synthesis of atomicallythin transition metal dichalcogenide layers assisted by
machine learning to discover a novel synthetic pathway.
In the second step, we develop several deep neural networks to learn material properties
from MD simulation data: (i) multilayer perceptron (MLP) and recurrent neural network (RNN)
to predict dielectric constants of polymers; and (ii) graph neural network (GNN) to classify
different crystalline phases inside computationallysynthesized layered material, molybdenum
disulfide monolayer, by MD simulations. In particular, we apply GNNbased analysis to
automatically classify different crystalline phases inside computationallysynthesized
3
molybdenum disulfide monolayer by reactive molecular dynamics (RMD) simulations on parallel
computers. We have found that addition of edgebased features like distance increases the model
accuracy up to 0.9391. Network analysis by visualizing the feature space of our GNN model
clearly separates 2H and 1T crystalline phases inside the network. This work demonstrates the
power of the GNN model to identify structures inside multimillionatom RMD simulation data of
complex materials.
In the third step, we shift our attention to protein folding, which is an increasingly active
area in machine learning research. Inferring structural properties of a protein from the amino acids
sequence is of great importance for understanding protein functions. The process of protein folding
from primary structure to tertiary structure is decomposed into successive steps, where a collection
of computational methods is synthesized to make prediction of the final structure. Here, we
approach one of the key steps, residueresidue contact map prediction, through the lens of GNN.
Training of the proposed model will be conducted in a massively parallel manner as training
efficiency is expected on the upcoming supercomputers.
The rest of the document is organized as follows. Chapter 2 describes GPU acceleration of
dynamic ntuple computations in manybody MD simulations. Chapter 3 presents our shift
collapse algorithm to accelerate generalized polarizable reactive MD simulations for machine
learningassisted computational synthesis of layered materials on leadershipscale parallel
supercomputers. Chapter 4 deals with machine learning of MD simulation data, using specific
examples of dielectric polymer and layered material. Finally, the proposed research on exascale
machine learning for protein modeling is presented in Chapter 5.
4
Chapter 1
Acceleration of Dynamic ntuple Computations in ManyBody
Molecular Dynamics
1.1 Introduction
Computation on dynamic ntuples of particles is ubiquitous in scientific computing, with
an archetypal application in manybody molecular dynamics (MD) simulations. MD is the most
widely used simulation method for studying structural and dynamic properties of material [1]. MD
simulations follow the trajectories of all atoms, while computing the interatomic interaction as a
function of atomic positions. In his pioneering MD simulation in 1964, Aneesur Rahman used a
pairwise interatomic potential that only depended on relative positions of atomic pairs [2]. More
complex interatomic potentials (or force fields) have been developed later to study a wide variety
of materials. In MD simulations of biomolecules, for example, the connectivity of atoms is fixed
throughout the simulation, and the interatomic potential is a function of the relative positions of
fixed ntuples (n = 2, 3, 4) [3]. To describe wider materials processes such as structural
transformations [4, 5] and chemical reactions [6, 7], however, the connectivity of atoms needs be
dynamically updated, hence resulting in manybody MD simulations based on dynamic ntuple
interactions. Such is the case for manybody MD simulations of inorganic materials, which
typically involve pair (n = 2) and triplet (n = 3) interactions [8]. Higherorder ntuple computations
are used in first principlesinformed reactive molecular dynamics (RMD) simulations based on the
reactive forcefield (ReaxFF) method [6, 9, 10]. ReaxFF describes the formation and breakage of
chemical bonds based on a reactive bondorder concept, and its interatomic forces involve
5
computations on up to quadruplets (n = 4) explicitly and sextuplets (n = 6) implicitly through the
chain rule of differentiation.
One of the simplest ways to map MD simulations onto parallel computers is spatial
decomposition [11]. Here, the threedimensional space is subdivided into spatially localized
domains, and the interatomic forces among ntuples involving the atoms in each domain are
computed by a dedicated processor in a parallel computer [12]. To achieve higher parallelism than
this spatialdecomposition approach, interatomic forces are often decomposed in various force
decomposition approaches [1315].
On highend parallel supercomputers, each of networked computing nodes consists of
many cores and often augmented with accelerators such as generalpurpose graphics processing
units (GPGPUs) [16]. On such platforms with deep hierarchical memory architectures,
metascalable (or “design once, continue to scale on future architectures”) parallel algorithms often
employ globally scalable and locally fast solvers [1719]. An example of such globallocal
separation is the computation of longrange electrostatic potentials, where highly scalable real
space multigrids for internode computations are combined with fast spectral methods for intranode
computations [1719]. For MD simulations, the globallocal separation insulates the optimization
of intranode computations of dynamic ntuples from internode parallelization approaches
described above. In this paper, we thus focus on GPGPU acceleration of local dynamic ntuple
computations. Various schemes have been proposed for the optimization of local MD
computations for pairwise [20] and more general dynamic ntuple interactions [21].
Extensive research and prior work exist, which explored the problem of accelerating CPU
based MD simulation codes with the GPGPU architecture [2226]. However, majority of these
works focused on the mechanical aspects of accelerating and tuning a set of existing codes on
6
GPGPU. These mechanical aspects include better memory organization to promote coalescing of
global memoryaccess operations, tuning of register usage, exploiting the GPU shared/cache
memory hierarchy, and minimization of communication between the host and device.
While all of these aspects are crucial for achieving large speedups of scientific codes on
GPU, here, we instead propose an alternative approach — named tuple decomposition (TD) — to
GPU acceleration that restructures the enumeration of interatomic interactions and the calculation
of potential energies, so that they can be performed more efficiently on GPGPU. Our approach
employs two techniques: (1) pipelining and (2) in situ construction of twobody and threebody
atomic interaction lists on GPGPU. An existing MD codebase, which calculates twobody and
threebody interatomic potentials based on the conventional particle decomposition (PD)
approach, serves as a platform to demonstrate our approach.
Despite various GPGPU implementations of dynamic ntuple computations, less studies
have focused on the critical factors that dictate the performance of these approaches. In this work,
we analyze the performance characteristics of the TD and PD approaches. Among more
conventional factors such as thread divergence, we have found that the register footprint plays a
critical role in controlling the GPGPU performance. In addition to a manybody (n = 2 and 3) MD
simulation, this finding is shown to hold for higherorder ntuple computations in ReaxFFbased
RMD simulations.
1.2 Physical Model
1.2.1 Interatomic potential
The MD simulation software used as a basis for this research is described in Ref. [7]. To
make the discussion specific, we first consider an interatomic potential proposed in Ref. [8] to
7
study structural and dynamic correlations in silica (SiO2) material. This implementation includes
an interatomic potential combining pair and triplet interactions and the linkedlist cell method for
reducing the computational complexity of the force calculation to O(N) (where N is the number of
atoms), while the standard velocity Verlet algorithm is used for timestepping [7, 12, 21, 27].
The potential energy of the system is a sum of atomic pair (or twobody) and triplet (or
threebody) contributions [8]:
+
=
k j i
ik ij ijk
j i
ij ij
u r u V
,
) 3 ( ) 2 (
) , ( ) ( r r
, (1.1)
where rij = rij, rij = ri − rj, and ri is the threedimensional vector to represent the position of the i
th atom. In Eq. (1.1), the pair potential is given by
u
ij
(2)
(r)= A
s
i
+s
j
r
æ
è
ç
ö
ø
÷
h
ij
+
Z
i
Z
j
r
e
r/l

a
i
Z
j
2
+a
j
Z
i
2
2r
4
e
r/z
. (1.2)
Here, the three terms represent the steric repulsion, the Coulombic interaction due to charge
transfer, and an induced dipolecharge interaction caused by electronic polarizabilities of atoms,
respectively. The pair potential is truncated at a cutoff distance, rij = rc2. The triplet potential in
Eq. (1.1) is expressed as
) ( ) (
cos 1
cos
exp ) , (
3 c 3 c
2
3 c 3 c
) 3 (
2
ik ij
jik
ik ij
ik ij
jik
jik
ik ij
ik ij
ik ij
ijk ik ij ijk
r r r r
r r
C
r r
r r r r
B u
− −
−
+
−
−
+
−
=
r r
r r
r r
, (1.3)
where (x) is the step function. In Eq. (1.3), the cutoff radius, rc3, for triplet interactions is much
less than that for pair interactions, rc2. The parameters in Eqs. (1.2) and (1.3) are found in Ref. [8].
8
These parameters are fitted to reproduce the experimentally measured structural and mechanical
properties of the normaldensity silica glass.
1.2.2 MD Simulation
The trajectories of atoms are discretized with a time discretization unit of t. In the widely
used velocity Verlet algorithm, the positions ri(t) and velocities vi(t) of atoms (i = 1,..., N) at time
t are updated as
r
i
(t+Dt)=r
i
(t)+v
i
(t)Dt+
1
2
a
i
(t)Dt
2
+O(Dt
4
)
, (1.4)
v
i
(t+Dt)=v
i
(t)+
a
i
(t)+a
i
(t+Dt)
2
Dt+O(Dt
3
)
, (1.5)
where
i i i
i
i
V
m m r
F
a
− = − =
1
(1.6)
is the acceleration of the ith atom. In Eq. (1.6), Fi is the force acting on the ith atom and mi is its
mass. The velocity Verlet algorithm repeats the body of the main MD simulation loop as shown in
Algorithm 1. Note that the acceleration at time t, ai(t), has already been computed in the previous
simulation step or before the main MD simulation loop is entered for the first simulation step.
Algorithm 1: Body of the main MD simulation loop based on the velocityVerlet algorithm.
1. v
i
(t+
Dt
2
) ¬v
i
(t)+
Dt
2
a
i
(t) for all i
2. r
i
(t+Dt) ¬r
i
(t)+v
i
(t+
Dt
2
)Dt for all i
3. Compute a i(t+ t) as a function of the set of atomic positions {r
𝑖 (𝑡 + ∆𝑡 ) 𝑖 = 1, … , 𝑁 }, according
to Eq. (1.6) for all i
4. v
i
(t+Dt) ¬v
i
(t+
Dt
2
)+
Dt
2
a
i
(t+Dt) for all i
On parallel computers, we employ a spatial decomposition approach. The simulated system
is decomposed into spatially localized subsystems, and each processor is assigned the computation
9
of forces on the atoms with one subsystem [7, 12, 21, 27]. Message passing is used to exchange
necessary data for the computations. Specifically, before computing the forces on atoms in a
subsystem (step 3 in Algorithm 1), atomic positions within the interaction cutoff radius rc2 within
the boundaries of the neighboring subsystems are cached from the corresponding processors.
Figure 1.1: The workflow of the main MD simulation loop: (a) main solver loop, (b) original pipelining of
atomic force computation on CPU, (c) restructured pipeline on GPU.
Fig. 1.1 (a) shows the workflow of the main MD simulation loop on a parallel computer,
where “Atom copy” denotes this interprocessor atom caching. After updating the atomic positions
according to the time stepping procedure (step 2 in Algorithm 1), some atoms may have moved
out of its subsystem. These movedout atoms are migrated to the proper neighbor processors.
“Atom move” in Fig. 1.1 (a) represents this interprocessor atom migration along with the atomic
position update.
10
1.2.3 LinkedList Cell Method
A naive method of computing the forces between atoms in MD simulations is to first
consider an atom i and then loop over all other j atoms to calculate their separations. This approach
imposes an O(N
2
) computational complexity for pair interactions, and an even worse O(N
3
)
complexity for triplet interactions. Thus, the naive method becomes untenable when the atom
count, N, becomes large.
The linkedcell list method [28] reduces this computational complexity to O(N) for both
cases by dividing the simulation region into cells whose width is slightly greater than the pair
interaction cutoff distance rc2. At each time step, each atom is classified by cell, and then only
interactions between atoms in the same or adjacent cells are considered in the force calculation.
For dynamic pair and triplet computations, the conventional linkedlist cell method works as
follows [12, 27]. On each processor, the spatial subsystem containing both the resident and cached
atoms is divided into cells of equal volume whose edge is slightly larger than rc2. Lists of atoms
residing in these cells is constructed by the linkedlist method [28]. Pair forces on the resident
atoms are computed by traversing atomic pairs using the linked lists. An atom in a cell interacts
only with the atoms within the cell and its 26 neighbor cells. At the same time as the computation
of pair forces, a list of primary pair atoms, lspr, is constructed. Here, lspr[i][k] stores the identifier
of the kth neighbor atom, which is within the shorter cutoff distance, rc3, of the ith atom. Triplet
forces are computed using the primary pair list lspr, which has been constructed during pairforce
computations. In Eq. (1.1), only the resident atoms within the processor are included in the
summation over index i to avoid overcounting. On the other hand, indices j and k are summed
over both the resident and copied atoms. Partial derivatives of the potential energy with respect to
the positions of j and k atoms therefore produce forces on the cached atoms as well as on the
11
resident atoms. The reaction terms on the cached atoms are sent back to the processors in charge
of the neighbor spatial subsystems and are added to the forces there.
1.3 GPGPU Implementation
Our focus in accelerating these codes was finding an efficient method of porting the main
MD simulation loop to a graphics processing unit (GPU). To guide that process, we follow two
fundamental principles: (1) minimization of control divergence (conditional branching, non
uniform iteration counts) within threads of a warp, and (2) minimization of synchronization events
between threads. The first principle is important on a single instruction, multiple threads (SIMT)
architecture such as GPU, since the kernel scheduler launches threads in groups of 32 (called a
warp). Each thread within a warp executes the instructions of a kernel in lockstep. Divergent
execution from conditional branches is allowed, but the threads in a warp suffer large performance
penalties when this occurs since threads will pause and resynchronize at the end of the conditional
branch. A similar situation occurs when a load imbalance is present. For example, if one or more
threads in a warp execute many more iterations of a loop, warp resources will sit idle while the
overloaded threads execute.
With these principles in mind, we now turn to the computations performed within the solver
loop as outlined in Fig. 1.1 (a). The velocity halfstep updates, atomic position updates, and atom
copies (for enforcement of the periodic boundary conditions, etc.) are all directly translated to GPU
kernels. These operations are trivially parallelized by unrolling their loops by atom. Thus, each
thread of the corresponding GPU kernels is responsible for updating the state of a single atom.
However, the vast majority of work performed by the solver occurs within the acceleration
computation step (“Compute force” in Fig. 1,1 (a)). This is much more difficult to parallelize, and
12
here, we are forced to dramatically restructure the algorithm for efficient execution on GPU. This
restructuring involves application of a pipelining technique to decompose the computation into a
longer sequence of simple steps (implied, but not specifically discussed in Ref. [24]). Fig. 1.1 (b)
and (c) show the original acceleration computation juxtaposed with the new pipelined approach.
The original MD algorithm fuses the identification of interactions between atoms i, j (pair
or twobody) and i, j, k (triplet or threebody) with the calculation of the potentials from Eqs. (1.2)
and (1.3), respectively. This organization is appropriate for the CPU since having spent the time
searching for the neighbors of a particular atom i, no other overheard is incurred to compute the
potential other than clock cycles in the arithmetic logic unit. On GPU, however, this is suboptimal
as a particular atom i will have a variable number of neighbors in its vicinity. If each thread of a
kernel is assigned to find the neighboring interactions for a specific atom, it is very likely that the
threads in a warp will be executing differing amounts of work, thus violating one of the
optimizations principles we committed to follow. In addition, determining the validity of an
interaction between atoms involves numerous checks, such as a comparison of atom types, resident
atom within this spatial subsystem versus cached atom, and the distance cutoff between the atoms.
These checks are conditional branches, which cause control divergence as well.
The GPU implementation alternatively factors out the computation of twobody and three
body potentials from the identification of the participating interactions (Fig. 1.1 (c)). The
calculation of potentials is an embarrassingly parallel operation, with little control divergence and
no synchronization required other than an atomicAdd() to sum the potential contributions for an
atom i across many threads. Thus, calculating twobody potentials is now a twostep process:
construct a list (or array) of valid interactions within the cutoff distance rc2 for all atom pairs i, j,
and then compute the potential contribution from each interaction in parallel. The same process
13
can be used for threebody potentials. In other words, GPU parallelizes the potential computation
by interaction rather than atom.
With this factorization, we achieve a significant speedup in performance over a baseline
GPU implementation that mirrored the CPU approach with parallelization by atom (Fig. 1.1 (c)).
Having separated the easily parallelizable computation from the tricky interaction determination,
we can then focus on ways to efficiently accelerate the construction of twobody and threebody
interaction lists on GPU. The approaches we employed to accomplish this are explained in the
following section
1.3.1 TwoBody Interaction List Generation
The twobody interaction list generator used in our application draws inspiration from the
Verlet neighborlist algorithm by Lipscomb et al. [24], but has significant differences. The
Lipscomb method is elegant and simple, yet it does exhibit a drawback. The method relies on the
complete enumeration of all possible twobody atomic interactions at the outset (the master list),
before the sorted member list can be generated. With the cell structure of MD, the number of such
interactions is bounded by O(N), however, the hidden constant factor is very large. In addition,
Ref. [24] does not address how to generate the master list; the implication is that it may have been
generated on CPU and transferred to GPU. We instead would like the list generation to occur
entirely on GPU.
To address this problem, we extend the interaction metaphor to its logical conclusion —
we not only consider atomic interactions, but also cell interactions. At the outset, during
application initialization, once the problem space has been decomposed into cells, the program
catalogs all combinations of pairs of cells u, v that are adjacent to one another within the lattice
(including interaction with self, and interaction with the boundary cells). This cell interaction list
14
Iu,v is saved for future use, and is immutable for the runtime of the application. Within the source
code, the cell interaction list is represented by two arrays, linterci and lintercj.
More concretely, as the solver executes on GPU, the kernel responsible for the arraycell
list phase scans the atoms in the lattice in parallel, classifies each by the cell they are located within,
and counts the number of atoms in each cell. The identifier of each atom is stored in an array Au,
where u is the scalarindex of the cell that the atom is located within. Thus, there is one such array
for each scalar cell in the problem domain. The atom counts per cell is kept in Cu, again where u
is the scalar cell index. Within the source code, Au is implemented as the twodimensional array
cellatomT, while the cell counts are represented in the array cellcount.
Using the CUDA Thrust software development kit (SDK) [29], the set of cell counts Cu is
scanned to find the maximum value m = maxu(Cu), representing the cell with the most atoms. At
this point, the application has sufficient information to explore the entire space of possible
interactions in parallel. Any pair of interacting cells Iu,v will have at most m
2
combinations of two
body interactions since the maximum number of atoms in any given cell is m. And since there are
Iu,v possible cell interactions, an upper bound on the number of twobody atomic interactions in
the problem domain is
t=m
2
I
u,v
. (1.7)
A kernel is then launched with t threads to execute the “identify 2B interactions” phase in
Fig. 1.1 (c). Inside the kernel, each thread tid extracts the identifiers of one pair of atoms i, j to test
for twobody interaction via the following multistep process. First, the index of the cell
interaction, ic, and the atomic interaction combination within that cell, ac, are computed from the
thread identifier tid by
i
c
=t
id
mod I
u,v
, (1.8)
15
a
c
=t
id
I
u,v
. (1.9)
Then, the indices of the interacting cells, u and v, are determined by looking up the i
c
entry in I
u,v
:
{u,v}=I
u,v
(i
c
). (1.10)
The index positions of the (possibly) interacting atoms in A
u
and A
v
are computed by
i
idx
=a
c
/C
v
, (1.11)
j
idx
=a
c
modC
v
. (1.12)
If iidx ≥ Cu, the atomic interaction combination ac is not in the set of possible interactions
for cells u and v, and that thread terminates. Otherwise, the atom identifiers i and j are finally
retrieved from Au and Av:
i=A
u
i
idx
( )
, (1.13)
j=A
v
i
idx
( )
. (1.14)
At this point, a final check is performed to remove duplicate i, j interactions when
considering cell interactions with u = v (again, by termination). The remaining running threads
then each submits its i, j pair of atoms to the same cutoff threshold testing as the original MD code,
and appends i, j to the twobody interaction list if it passes the threshold tests. In this manner, the
entire set of possible twobody atomic interactions are tested in parallel, and in situ, on GPU. A
single contiguous list of interactions is produced, which can then be evaluated in parallel within
the “calculate 2B forces” phase in Fig. 1.1 (c). Within the code, the list is represented by the arrays
linteri and linterj. Fig. 1.2 illustrates the procedure of twobody list construction on GPU. Each
thread examines an interaction between atom i and j, and transfers the index pair to linteri and
linterj only if it passes the threshold test.
16
Figure 1.2: Concurrent construction of valid interactions.
At first glance, this method may seem wasteful of processor resources. It is true that
perhaps even onehalf of the threads will terminate without ever testing two atoms for an
interaction, due to imbalances in the distribution of atoms among the cells. However, the
terminating threads are often grouped contiguously in large swathes within the thread space,
allowing the GPU warp scheduler to immediately marshal those released resources for other warps
waiting in the run queue. Another disadvantage that should be noted is this method’s reliance on a
global counter, incremented with atomicAdd() when new i, j interactions are appended to the
interaction list. Despite testing various alternative solutions, which replace that counter with other
data structures, the current implementation provides the best overall runtime performance. We
discuss a possible solution to this problem in the concluding remarks.
1.3.2 ThreeBody Interaction List Generation
The threebody interaction list generator utilizes a different algorithm than its twobody
counterpart. This method examines the twobody interaction list produced in the previous section,
and searches for those interactions that fall within the more restrictive threebody distance
Thread 0
Thread 3
Thread 2
Thread 1
.
.
.
atom[i] atom[j] (r < rc & i < j) ?
l i nt er i
l i nt er j
...
...
2
9
6
5 4
7
2
11
10
7
6
10
9
11
.
.
.
.
.
.
17
threshold, rc3. Those interactions that fall within this cutoff are saved in the lspr array in much the
same way as is done in the original MD code explained in Section 2.3. However, in the GPU
implementation, this array is populated by a kernel that inspects each twobody interaction in
parallel (one thread per interaction). As shown in Algorithm 2, kernel gen_lspr produces array lspr
from the existing twobody interaction lists linteri and linterj.
Algorithm 2: GPU implementation of the generation of threebody lists.
CUDA kernel, gen_lspr
// tid is thread ID
// lnum3b[i] is the number of atoms within r c3 of atom i
// linteri and linterj are twobody interaction pairs
i ← linteri[tid]
j ← linterj[tid]
if r ij < r c3
lspr[i][lnum3b[i]] ← j
lspr[j][lnum3b[j]] ← i
lnum3b[i]++
lnum3b[j]++
end if
CUDA kernel, gen_3body_list
// offset is the prefixsum from Thrust operation
i ← tid
for j from 1 to lnum3b[i]
for k form j + 1 to lnum3b[i]
linter3i[offset] ← i
linter3j[offset] ← lspr[i][j]
linter3k[offset] ← lspr[i][k]
update offset by computing prefixsum of combinations
end for
end for
Once the lspr array is populated, Thrust is used to scan the neighbor counts for each atom
to generate an array of prefix sums, with the sums produced after applying a function to the
neighbor counts. More specifically, the number of threebody interactions, n
i
, associated with atom
i is computed with the formula,
𝑛 𝑖 =
𝑐 𝑖 (𝑐 𝑖 −1)
2
(1.15)
18
where c
i
is the number of neighboring atoms for atom i in lspr. Then the array of n
i
is used as input
into the Thrust prefixsum operation to produce a new array of values p
i
. Thrust is also employed
to compute the total number of threebody interactions by simply performing a reduction on n
i
.
The prefixsum array p
i
is then used as input by a followon kernel that populates the actual
threebody interaction list. Each thread in the kernel is assigned to process an atom i, and it iterates
through the n
i
3tuple combinations for that atom in lspr. The threads save the resulting i, j, k
interactions into the threebody interaction list starting at offset p
i
. The result of this complicated
process is a fully populated i, j, k interaction list, entirely produced on GPU, which can be used in
the “calculate 3B forces” phase in Fig. 1.1 (c). In the source code, this interaction list is represented
by the trio of arrays linter3i, linter3j, and linter3k. In Algorithm 2, kernel gen_3body_list is
implemented to store the threebody interactions into these arrays.
This interaction list generator is remarkably different from its counterpart. Its primary
advantage is the elimination of a global counter for insertion into the threebody interaction list.
The purpose of the prefixsum array is to assign a dedicated range of entries within the final
interaction list to each thread (atom) so that no explicit coordination is required between threads.
This is a significant advantage, but it comes with a cost. This method does not scale when
there are large numbers of threebody interactions per atom since the memory write into the final
interaction list is strided by the number of combinations. Thus, as the interaction count per atom
increases, the stride correspondingly increases, and memory coalescing worsens. We have found
in practice that the combination counts are typically very small, as well as the total number of
threebody interactions, so the performance degradation from the poor coalescing strategy is
minimal. It should be noted here that rc3 << rc2, as stated before. This comes from a physical
19
principle that higherorder ntuple interactions (n 3) are shortranged and only a few neighbors
per atom contribute to them, compared to hundreds of neighbors for pair interactions.
1.4 Benchmark and Validation
To evaluate the runtime performance of the GPU implementations, we run a series of
benchmarking tests at a center for high performance computing that operates both a large,
heterogeneous CPU cluster and a GPUaccelerated cluster. The GPUaccelerated cluster is
comprised of 264 compute nodes, each with two Nvidia Kepler K20m GPU accelerators. We
demonstrate the performance improvement of the new TD approach on MD simulations by
analyzing two comparisons: TD’s GPU implementation versus (1) the original MD code and (2) a
baseline GPU implementation that directly mirrors the CPU implementation.
1.4.1 GPU Implementation versus the Original MD Code
For this timing experiment, we have selected three reference CPU chipsets from the
collection of available compute nodes on the CPU cluster at HPC. We compare the original MD
runtime performance on those chipsets versus the GPU implementation executing on the Kepler
K20m for a series of MD simulations. The lattice sizes in the simulations varies from 4 4 4 to
161616 crystalline unit cells (where each unit cell contains 24 atoms), and each simulation runs
for 1,000 time steps. The CPU chipsets used in the comparison are AMD Opteron 2356 with a
clock speed of 2.3 GHz and 2 MB cache, AMD Opteron 2376 (2.3 GHz, 6 MB cache), and Intel
Xeon E52665 (2.4 GHz, 20 MB cache). The purpose behind the multichipset comparison is to
give a comprehensive view of the GPU implementation’s performance characteristics versus a
range of common CPUs.
20
Fig. 1.3 plots the wallclock time versus simulation size (i.e., the number of atoms) for the
three CPU chipsets and the GPU. The corresponding speedups of the GPU implementation over
the three CPU implementations are plotted as a function of the simulation size in Fig. 1.4. As can
be seen in the graph, the speedup varies greatly with the power of the reference CPU, with the
newer Intel Xeon faring better than its counterparts. However, even against the Xeon, the GPU
implementation demonstrates a significant speedup of up to 6fold, which improves as the lattice
size is increased. This overall trend is expected as the overhead costs of device initialization, PCI
bus communications, and kernel launches are amortized over a larger set of atoms.
An interesting observation noticeable in those charts is the large increase in speedup that
the GPU application experiences for the largest simulation versus the two AMD Opteron CPUs.
This is because the Opteron cache size (6 MB) is not large enough to accommodate the entire
98,304 atomic states along with other necessary data, and the cachemiss rate increases
dramatically at that point. If we were to scale up the tests further, we would likely continue to see
their runtime performance drop in comparison to the GPU.
To validate the simulation results generated by the GPU, we have compared the final
atomic positions and velocities with those of reference datasets produced by the original MD code.
We have examined simulations with lattice sizes 4 44, 8 88, 10 10 10, 12 1212, and
161616, all executed over a span of 1,000 time steps. Across all configurations, these output
parameters agreed within a tolerance of 10
9
. However, we did discover a problem with
accumulating numerical error that caused the GPU results to drift from that of the CPU while
testing extended simulations of more than 4,000 time steps. This problem is analyzed further in
the concluding remarks.
21
Figure 1.3: Concurrent construction of valid interactions.
Figure 1.4: Speedup of GPU MD over reference CPU MD for varying numbers of atoms.
1.4.2 GPU Implementation versus the Baseline
The TD approach is doubtlessly designed to outperform its sequential origin since
parallelization is employed aggressively. The advantage of the tuplebased restructuring of
computations, however, is still nontrivial and must be investigated. To address this problem, we
introduced a straightforward GPU implementation of the original MD code, which is based on
particle decomposition (PD), as a baseline for comparison. As illustrated in Fig. 1.1, the TD
1
10
10
2
10
3
10
4
10
3
10
4
10
5
AMD Opteron 2356 2.3GHz
AMD Opteron 2376 2.3GHz
Intel Xeon 2.4GHz
Nvidia Kapler K20m
Time (s)
Number of atoms
0
5
10
15
20
25
10
3
10
4
10
5
Speedup vs AMD Opteron 2356 2.3GHz
Speedup vs AMD Opteron 2376 2.3GHz
Speedup vs Intel Xeon 2.4GHz
Speedup
Number of atoms
22
approach separates the computation and identification of 2body and 3body interactions into
different kernels. The baseline approach, on the contrary, fuses the two kernels back into a single
one with each GPU thread scanning the interatomic potentials followed by calculation of the
potentials. Thus, this code is simply a GPUaccelerated implementation of the original MD code
on CPU.
The results are collected on Tesla K20m with simulations ran in the lattice sizes ranging
from 4 44 to 32 3232, or equivalently the number of atoms from 1,536 to 786,430. As shown
in Fig. 1.5, our TD approach achieves an approximately 20% performance improvement over the
baseline (PD). We observe that the wallclock time for TD is higher than that of the baseline upon
the first two lattice sizes. The likely reason is that our TD implementation launches two extra
CUDA kernels than the baseline in each time step, and the overhead of initializing these kernels is
relatively high on a small number of atoms, which needs to be large enough to leverage the
computational capability of GPU.
Figure 1.5: Wallclock time of the TD approach (orange circles) and that of baseline (PD, black squares) as a
function of the number of atoms. The figure also shows the performance improvement of the TD approach
over the baseline as a function of the number of atom.
To better understand the performance characteristics of the TD and PD approaches, we
next carry out performance profiling on the most computeintensive CUDA kernels that account
1
10
10
2
10
3
0
20
40
60
80
100
10
3
10
4
10
5
10
6
Tupledecomposition
Baseline
Performance improvement
Time (s)
Performance improvement (%)
Number of atoms
23
for approximately 90% of the total running time for both implementations, i.e., 2body_inter in
baseline, and gen_lists_2body_inter and compute_accel_2body_inter in TD. We use the NVIDIA
profiling tool, nvprof, for the measurements.
Figure 1.6 compares the wallclock time of the TD approach with that of the baseline for
these most computeintensive kernel executions. The total number of atoms is 786,432. As their
names suggest, the test case consists of computing 2body interactions in a unit time step. Here,
the running time of each kernel is measured on average throughout the entire CUDA threads,
because a single thread may not execute the kernel when the interaction it carries falls out of cutoff
radius. The TD approach (right column) is approximately 23% faster than the baseline (left
column), which in turn echoes the global performance improvement of the former over the latter
in Fig. 1.5. Within the TD approach, generation of the 2body interaction list and actual
computation of 2body interactions using the list account for approximately identical computing
times.
Figure 1.6: Wallclock time on kernels for 2body potential computation per time step for the baseline (left)
and TD (right) approaches.
In addition, we investigate a major concern of our design mentioned in Section 3, i.e.,
unbalanced workloads over threads. We introduce “achieved occupancy” into our analysis as a
metric to quantify the workload balance on GPU. More precisely, it is defined as the ratio of active
24
warps on a streaming multiprocessor (SM) to the maximum number of active warps supported by
the SM [30]. Table 1.1 shows the achieved occupancy of kernels measured by nvprof. The
compound achieved occupancy of TD, though with statistical variation, is apparently higher than
the baseline, indicating that the procedure of fetching the interatomic pairs from the restructured
tuplelist followed by calculating the interactions produces more balanced workload distributions
over time compared with the baseline approach.
Table 1.1: Achieved occupancy of kernel execution.
Baseline TD
2body_inter 0.386
gen_lists_2body_inter 0.916
compute_accel_2body_inter 0.461
We next perform a hardwarelevel analysis to unveil the reason behind the achieved
occupancy for the three kernels in Table .1. Occupancy is limited by multiple factors such as shared
memory usage. In this application, register usage per SM is found to be the determining factor. As
Fig. 1.7 shows, the kernel gen_list_2body_inter uses 32 registers per thread, thereby maximizing
the capacity of K20m to hold 64 warps active (2 blocks) per SM, while the other two kernels take
up 41 registers per thread (41,984 for 1 block). The testing device, Tesla K20m, provides up to
65,536 registers for each block with up to 2 blocks supported on each SM. However, these two
kernels use 41,984 registers for one block, thereby each SM can only have 1 block (32 warps as
the red dot suggests) run in parallel. Thus, the achieved occupancy is limited to the upper bound
of 0.5, which prevents it from fully utilizing the GPU. This analysis thus demonstrates that the
register footprint plays a key role in dictating the GPU performance.
25
1.4.3 Dynamic Quadruplet Computation in ReaxFF
To test the performance portability of the TD approach to more general dynamic ntuple
computations, we consider first principlesinformed RMD simulations based on the ReaxFF
method [6, 9, 10]. The ReaxFF approach significantly reduces the computational cost of simulating
chemical reactions, while reproducing the energy surfaces and barriers as well as charge
distributions of quantummechanical calculations. RMD simulations describe formation and
breakage of chemical bonds using reactive bond orders [6, 31, 32], while a chargeequilibration
(QEq) scheme [3335] is used to describe charge transfer between atoms. The ReaxFF interatomic
forces involve up to quadruplets (n = 4) explicitly and sextuplets (n = 6) implicitly through the
chain rule of differentiation. In this work, we test the TD approach in the quadrupletinteraction
computations in a production RMD simulation program named XRMD [10].
Figure 1.7: Relationship between the number of warps per SM and the number of registers per thread for
kernels.
The quadrupletinteraction is one of the most computationally expensive functions in the
ReaxFF method. It requires traversing deeply nested neighborlist loops, within which the value
of bondorder of a covalent bond or the combination of those is checked to be greater than
predefined cutoff values, which can create significant code branching. Our TD approach performs
the neighborlist traversal and the branchcondition evaluation on the CPU and the numerically
26
intensive computations on the GPU, enabling us to take advantage of both host and GPU
architectures. The GPU implementation of the TD approach for the computation of quadruplet
interactions is shown in Algorithm 3.
Algorithm 3: GPU implementation of the generation of 4body lists and computation of quadruplet forces in XRMD.
CUDA kernel, gen_4body_list
// tid is thread ID
// i, j, k, l are atom IDs and BO ij, BO ik, BO lk are bond orders between pairs (i j), (i k) and (k l)
// ie4b is the list for four body and ne4b is interaction number for 4 body
j ← tid
k ← nbrlist[j,k1]
if BO jk > BO cutoff
i ← nbrlist[j,i1]
if i k
if BO ij > BO cutoff && BO ij ×BO jk > BO cutoff
l ← nbrlist[k, l1]
if i l && j l
if BO kl > BO cutoff && BO jk×BO kl > BO cutoff
if BO ij ×BO jk
2
×BO kl > BO cutoff2
ne4b++
ie4b[1:4][ne4b] (j, i1, k1, l1)
CUDA kernel, gen_4body_list compute_4body_force
ne4b ← tid
i ← ie4b[1][ne4b]
j ← ie4b[2][ne4b]
k ← ie4b[3][ne4b]
l ← ie4b[4][ne4b]
call E4b[i,j,k,l]
Figure 1.8 compares the wallclock time of the quadruplet interaction calculation using the
TD and baseline (PD) approaches for three different problem sizes, i.e., the number of atoms N =
1,344, 4,536 and 10,752. For this kernel invocation, we employ onedimensional thread
assignment with the number of threads per grid, Nthreads, to be 32 and the number of grids to be
the number of atomic quadruplets divided by Nthreads + 1. The figure shows that the TD approach
consistently outperforms the baseline, providing nearly 1.3 speedup for the three system sizes.
27
Figure 1.8: Timing comparison of the TD approach and baseline (PD) for the computation of quadruplet
interactions in ReaxFF. The histogram shows wallclock times per MD step in milliseconds averaged over 100
time steps. We use RDX material for this benchmark.
As stated in the introduction, our globallocal separation scheme completely insulates the
optimization of intranode computations of dynamic ntuples from internode parallelization. In
order to test the internode scalability of our parallel ReaxFF MD code, Fig. 1.9 shows the
computing time per MD step as a function of the number of IBM Blue Gene/Q cores in both weak
and strongscaling settings. The weakscaling test simulates 86,016Patom C3H6N6O6 molecular
crystal system on P cores, while the strongscaling test simulates N = 4,227,858,432 atoms
independent of P. The weakscaling parallel efficiency is 0.977 for P = 786,432 for N =
67,645,734,912. The strongscaling parallel efficiency is 0.886 for P = 786,432 for a much smaller
system of N = 4,227,858,432.
Figure 1.9: Strong and weak scaling of the parallel ReaxFF MD code on Blue Gene/Q.
0
20
40
60
80
100
120
140
Weak scaling
Strong scaling
10
5
10
6
Wallclock time (s)
Number of cores
28
1.5 Summary
we have presented a new computational approach for performing manybody MD
simulations on GPGPU. Our tupledecomposition approach utilizes pipelining and efficient in situ
generation of pair and triplet interaction lists to accelerate the application. Through a wideranging
set of benchmarking tests, we have demonstrated that these relatively simple algorithmic changes
provide significant performance speedups for varying simulation sizes and applications.
The algorithms we described for producing lists of pair and triplet interactions can be
improved with further research. One possibility is combining our candidate twobody
interaction generator with the Lipscomb algorithm [24] for Verlet neighborlist generation.
Our approach addresses the major weakness in their algorithm — the generation of the master list
— while their approach eliminates a need for synchronization through a global counter. We will
synthesize these two methods to yield an exceptional hybrid approach.
We have also examined various techniques for fusing the pair and triplet interaction list
generation into the same overall algorithm, but this proved to be too difficult as their requirements
are slightly different. Further research could yield a way to integrate them in an elegant manner.
This would simplify our computation model immensely.
Although the focus of this work is primarily on MD algorithm development on GPU, there
are various optimizations that could be made to boost the performance even further. Memory
coalescing could be improved in many places by restructuring the atomic state information stored
in device global memory, and utilizing shared memory to coordinate reads among threads. With
the former technique, the goal is to have adjacent threads in a thread block read data from adjacent
addresses in global memory. The latter technique allows the efficient loading of strided array
structures into local memory where the access times are much lower.
29
An intriguing idea that we plan to explore is imposing a sort order on the atoms, so that
adjacent threads in the kernel tend to access adjacent atoms in the atom list. Theoretically, this
could improve` memory coalescing. Determining a proper sort order is difficult, however, and
enforcing it is expensive because it implies that a parallel sort be performed each time step since
the atoms move throughout the simulation. A compromise may be to periodically sort the atoms
and hope that that the benefits of temporarily improved coalescing outweigh the sorting cost.
30
Chapter 2
ShiftCollapse Acceleration of Generalized Polarizable Reactive
Molecular Dynamics for Machine LearningAssisted
Computational Synthesis of Layered Material
2.1 Introduction
Reactive molecular dynamics (RMD) is a powerful simulation method for describing
material processes involving chemical reactions, with a wide range of applications in physics,
chemistry, biology and materials science [36]. RMD simulation follows time evolution of the
positions, 𝐫 𝑁 = {𝐫 𝑖 𝑖 = 1, … , 𝑁 }, of N atoms by numerically integrating Newton’s equations of
motion, where the atomic force law is mathematically encoded in the interatomic potential energy
E(r
N
). Reliable interatomic potentials are key to accurately describing thermomechanical and
chemical properties of materials. The first principlesinformed reactive forcefield (ReaxFF) model
significantly reduces the computational cost, while reproducing the energy surfaces and barriers as
well as charge distributions of quantummechanical (QM) calculations [36].
The most intensive computation in RMD simulation arises from a chargeequilibration
(QEq) scheme [37] to describe charge transfer between atoms, thereby enabling the study of
reduction and oxidation reactions. QEq treats atomic charges as dynamic variables, 𝑞 𝑁 =
{𝑞 𝑖 𝑖 = 1, … , 𝑁 }. The charges and the resulting force law are determined by minimizing the
potential energy with respect to q
N
at every RMD time step. This variable Ncharge problem is
commonly solved iteratively, e.g., with the conjugate gradient (CG) method [38]. Though recent
advancements in parallel ReaxFF algorithms [3941] have enabled large RMD simulations [42]
31
involving multimillion atoms, QEq computation remains to be the major bottleneck for studying
long time trajectories of such large RMD simulations.
Despite enormous success of the QEqbased ReaxFF model, one critical issue has remained
unsolved, namely accurate description of electric polarizability. Polarization of the surrounding
medium essentially dictates the rate of reduction and oxidation reactions, as is articulated, e.g., in
the Nobel lecture by Rudolph Marcus [43]. A recently proposed polarizable reactive forcefield
(ReaxPQ) model based on a polarizable charge equilibration (PQEq) scheme significantly improves
the accuracy of describing redox reactions by accommodating the reorganization of surrounding
media [44]. When applied to prediction of electronic polarizabilities, however, the ReaxPQ model
alone was found inadequate. This partly arises from the fact that the original ReaxPQ model
determines the polarization of a nucleus core and an electronic shell within each atom by
considering only the internal electric field produced by atomic charges but not an externally applied
electric field. To remedy this deficiency, we here introduce a generalization of the ReaxPQ model
named ReaxPQ+, in which atomic polarizations respond to both internal and external electric fields,
thereby achieving near quantum accuracy for tested cases.
The improved accuracy of the new ReaxPQ+ model is accompanied by significant increase
of the computational cost. Compared to the original QEq scheme, which only deals with atom
centered chargecharge interactions, PQEq computation in the ReaxPQ+ model is quadrupled, since
it considers corecore, coreshell, shellcore and shellshell charge interactions for every atomic
pair. In this paper, we accelerate this heavy ReaxPQ+ computation using (1) an extended
Lagrangian approach to eliminate the speedlimiting charge iteration [45], (2) a new extension of
the shiftcollapse (SC) algorithm [46] named renormalized SC (RSC) to compute dynamic ntuples
with provably minimal data transfers, (3) multithreading with roundrobin data privatization, and
32
(4) data reordering to reduce computation and allow vectorization. The accelerated code achieves
(1) weakscaling parallel efficiency of 0.989 for 131,072 cores, and (2) eightfold reduction of the
timetosolution (T2S) compared with the original code, on an Intel Knights Landing (KNL)based
supercomputer.
The reduced T2S has allowed computational synthesis of atomicallythin transitionmetal
dichalcogenide (TMDC) layers with unprecedented fidelity. Functional layered materials (LM) will
dominate materials science in this century [47]. The attractiveness of LMs lies not only in their
outstanding electronic, optical, magnetic and chemical properties, but also in the possibility of
tuning these properties in desired ways by building van der Waals (vdW) heterostructures composed
of unlimited combinations of atomicallythin layers. To rationally guide the synthesis of stacked
LMs by chemical vapor deposition (CVD), exfoliation and intercalation, “computational synthesis”
should encompass large spatiotemporal scales. Such layered materials genome (LMG) has been
chosen as one of the designated applications of the United States’ first exaflop/s computer A21
when it is introduced in 2021 [48]. This paper for the first time demonstrates purely computational
synthesis of TMDCbased LM, which is assisted by machine learning to discover a novel synthetic
pathway. This opens up a possibility to computationally explore new synthetic pathways to novel
TMDCLMs and vdW heterostructures.
2.2 Application and Algorithmic Innovations
2.2.1 Generalized Polarizable Reactive Force Field (ReaxPQ+)
In the ReaxPQ+ model, the potential energy 𝐸 ({𝐫 ij
}, {𝐫 ijk
}, {𝐫 ijkl
}, {𝑞 𝑖 }, {BO
ij
}) is a function
of relative positions of atomic pairs rij, triplets rijk and quadruplets rijkl, as well as atomic charges qi
and bond orders BOij between atomic pairs. The potential energy includes Coulombic energy
33
ECoulomb. In the PQEq scheme used in the original ReaxPQ model, electric polarization is described
using a Gaussianshaped electron density (shell) that can polarize away from the nuclei core in
response to internal electric fields produced by atoms [44]. Here, each atom i is partitioned into two
charged sites (i.e., core and shell). The core (ρic) consists of two parts: ρi with a variable total charge
(qi) and ρiZ with a fixed total charge (Zi). The shell (ρis) is massless and has a fixed total charge of
−Zi. The shell and core of an atom are connected by an isotropic harmonic spring with force constant
Ks (Fig. 2.1).
At each time step, the atomic charges q
N
are determined by minimizing ECoulomb subject to
the conditions that the electrochemical potentials, ECoulomb/ qi, are equal among all atoms. The
Coulombic energy is given by
𝐸 Coulomb
({𝐫 𝑖 c
, 𝐫 𝑖 s
, 𝑞 𝑖 }) = ∑ [𝜒 𝑖 0
𝑞 𝑖 +
1
2
𝐽 ii
0
𝑞 𝑖 2
+
1
2
𝐾 𝑠 𝑟 𝑖 c,𝑖 s
2
]
𝑁 𝑖 =1
+ ∑ [𝑇 (𝑟 𝑖 c,𝑗 c
)𝐶 𝑖 c,𝑗 c
(𝑟 𝑖 c,𝑗 c
)𝑞 𝑖 c
𝑞 𝑗 c 𝑖 >𝑗 −𝑇 (𝑟 𝑖 c,𝑗 s
)𝐶 𝑖 c,𝑗 s
(𝑟 𝑖 c,𝑗 s
)𝑞 𝑖 c
𝑍 𝑗 −𝑇 (𝑟 𝑖 s,𝑗 c
)𝐶 𝑖 s,𝑗 c
(𝑟 𝑖 s,𝑗 c
)𝑞 𝑗 c
𝑍 𝑖 +𝑇 (𝑟 𝑖 s,𝑗 s
)𝐶 𝑖 s,𝑗 s
(𝑟 𝑖 s,𝑗 s
)𝑍 𝑖 𝑍 𝑗 ]
(2.1)
where ric, ris, 𝜒 𝑖 0
and 𝐽 𝑖𝑖
0
are the core position, shell position, electronegativity and hardness
of the ith atom. In Eq. (2.1), ria,jb (i, j = 1,…, N; a, b = core(c) or shell(s)) are chargecharge
distances. The electrostatic energy between two Gaussian charges is given in terms of the error
function 𝐶 ia,jb
(𝑟 𝑖𝑎 ,𝑗𝑏
), and the Coulombic interaction is screened using a taper function T(r) [44].
Figure 2.1: Schematic of the response of core (green) and shell (yellow) charges to an external electric field in
the new PQEq+ model.
34
As was mentioned in the introduction, the corecore, coreshell, shellcore and shellshell charge
interactions in Eq. (2.1) quadruple the charge computation over the conventional ReaxFF model.
In the original ReaxPQ, the shell position ris for the ith atom is obtained by balancing the
effect of the electrostatic field due to all other atoms (i.e., interatomic interactions) with intra
atomic interactions involving only the core and shell:
𝐅 inter
= −
𝜕 𝜕 𝐫 𝑖 s
{∑ 𝑇 (𝑟 𝑖𝑎 ,𝑗𝑏
)𝐶 ia,jb
(𝑟 𝑖𝑎 ,𝑗𝑏
)𝑞 ia
𝑞 jb ia>jb
} (2.2)
𝐅 intra
= −
𝜕 𝜕 𝐫 𝑖 s
(
1
2
𝐾 s
𝑟 𝑖 c,𝑖 s
2
) (2.3)
We solve Finter = Fintra to determine ris using NewtonRaphson method. Fig. 2.2(a) compares
time evolution of the PQEq energy for 1, 10 and 100 iterations of NewtonRaphson method, where
the previous shell positions are used as initial guess. Fig. 2.2(a) shows that single iteration suffices
to obtain the accuracy of 10
4
kcal•mol
1
/atom. We have also confirmed the accuracy of solution by
comparing the shell position and charge of each atom. We have applied this ReaxPQ model to
compute the dielectric constants 𝜖 of various materials and found that the model generally
underestimates the values.
Figure 2.2: (a) PQEq energy with 1, 10 and 100 iterations of NewtonRaphson method. (b) Atomic charges
after 1 ps of MD simulation with converged RMD (blue) and XRMD (red) methods.
In order to improve the accuracy of describing the dielectric response of materials, we here
introduce a generalized polarizable reactive forcefield (ReaxPQ+) model, in which ris is
PQEq Energy Extended Lagrangian
(a) (b)
35
determined by solving 𝐅 inter
+ 𝐅 external
= 𝐅 intra
, i.e., by explicitly including the effect of an external
electric field 𝓔 ,
𝐅 external
= ∑ 𝑞 𝑖𝑎
𝓔 𝑎 (2.4)
In both the original ReaxPQ and new ReaxPQ+ models, polarization is calculated as
𝐏 = ∑ 𝑞 𝑖𝑎
(𝐫 𝑖𝑎
− 𝐫 𝑖𝑎
0
)
𝑖𝑎
(2.5)
where 𝐫 𝑖𝑎
0
is the charge position in the absence of external electric field 𝓔 .
Table 2.1: Computed dielectric constants.
Material ReaxPQ ReaxPQ+ QM
Polyethylene (PE) 1.01 2.25 2.37
C=O defect (PE) 1.02 2.74 2.78
CCl defect (PE) 1.01 2.40 2.53
MoS 2 (in plane) 1.03 14.3 15.4
*
MoS 2 (out of plane) 1.03 5.68 7.43
*
Polyvinylidene
fluoride (PVDF)
1.02 2.56 2.52
Alumina 1.03 3.17 2.98
*
Surface Science Reports, vol. 70, pp. 554586, Dec 2015.
We have tested the accuracy of the ReaxPQ+ model by computing the dielectric constants
𝜖 of polyethylene (PE), molybdenum disulfide (MoS2) and other polymer and ceramic materials.
Table 2.1 compares dielectric constants computed with the original ReaxPQ and new ReaxPQ+
models against those obtained by firstprinciples QM calculations. As noted otherwise, the QM
value has been computed by us using a quantum molecular dynamics (QMD) code [49] based on
density functional theory (DFT), in which dielectric constants are calculated using a Berryphase
approach [50]. The table shows that the new ReaxPQ+ results agree much better with the QM
results, compared with the original ReaxPQ results.
36
2.2.2 ExtendedLagrangian Acceleration of ReaxPQ+
As shown above, chargeinteraction computation in ReaxPQ+ is quadrupled compared to
that in the conventional ReaxFF model. The increased computational cost necessitates innovative
algorithms to speed up the computation.
First, we adapt the extendedLagrangian reactive molecular dynamics (XRMD) algorithm
[45], which was originally proposed for the conventional ReaxFF model, to the new ReaxPQ+
model. The problem is that an excessively large number of CG iterations are required to reach
sufficient convergence of charges q
N
to guarantee the conservation of the total energy as a function
of time. Insufficiently converged charges act as an artificial heat sink, and the resulting broken time
reversibility causes the total energy to drift over time. A similar tradeoff between the computational
speed and energy conservation is encountered in firstprinciples QMD simulations, where
insufficient convergence of the iterative refinement of electronic wave functions causes serious
energy drift. Niklasson proposed an extended Lagrangian scheme [51] that achieves excellent long
time energy conservation with drastically reduced number of iterations. In fact, an extended
Lagrangian scheme with no iteration (i.e., requiring only one evaluation of energy gradient per
QMD time step) has been demonstrated [52]. The key idea is to introduce auxiliary wave functions
as dynamic variables that are numerically integrated by timereversible, symplectic integration
schemes to address the broken reversibility problem, while the auxiliary wave functions are
constrained to iteratively determined wave functions by a harmonic potential. Successful
elimination of the speedlimiting charge iteration in the ReaxFF model was achieved by Nomura et
al. by adapting the extendedLagrangian scheme [45]. The XRMD algorithm has drastically
improved energy conservation while substantially reducing the timetosolution. In addition,
XRMD accurately describes atomic trajectories and charges. The average difference of atomic
37
positions was 0.08 Å after 1 ps of simulation between XRMD and fully converged RMD methods
[45]. Fig. 2.2(b) compares atomic charges obtained by XRMD algorithm with those by RMD using
extremely high CG tolerance (10
8
) which show an excellent agreement. In this paper, we adapt the
XRMD algorithm to the new ReaxPQ+ model, where auxiliary charges are applied only to the
variable part, qi, of the charges.
2.2.3 Renormalized ShiftCollapse Acceleration of ReaxPQ+
Our second algorithmic innovation is a generalization of the shiftcollapse (SC) algorithm
[46], named renormalized SC (RSC). SC algorithm provably minimizes data transfer for
computation of dynamic ntuples in parallel computing based on spatial (or domain) decomposition.
Building on translation and reflection invariance of the set of ntuple computations, the algorithm
applies shift and collapse algebraic transformations to ntuple computations so as to completely
eliminate redundant computations among computing nodes while minimizing data transfer. Here,
we apply the SC algorithm to the generalized polarizable charge equilibration (PQEq+) subroutine
that iteratively optimizes atomic charges to minimize the Coulombic potential energy using CG
method. At the beginning of every PQEq+ iteration, information of neighbor atoms near the domain
boundary (including atomic charges and gradients) needs to be imported from the nearestneighbor
computing nodes for calculating Coulombic potential used in CG iteration. This incurs significant
communication cost. For each PQEq+ iteration, Coulombic potential of the system is evaluated
using guessed atomic charges. After that, each node computes charge gradients and consecutively
updates atomic charges of all atoms resided in its domain. PQEq+ computation is terminated when
either (1) CG residue is small enough, (2) no energy improvement is gained between successive
iterations, or (3) maximum number of iterations is reached.
38
In this work, we employ two approaches to reduce the time spent in PQEq+ subroutine.
First, we apply SC algorithm to minimize communication cost and eliminate redundant pair
evaluations within PQEq+ subroutine. Here, we also develop a new SC approach to handle many
body renormalized ntuple computation in PQEq+ due to the interaction with surrounding atoms,
which has never been addressed by SC algorithm before. Second, we storeandreuse unchanged
coefficients across multiple CG iterations, thereby improving the efficiency of SC computation.
Details of these approaches are discussed below.
SC algorithm utilizes 3step communication, thereby significantly reducing both bandwidth
and latency costs when compared to the 6step communication (i.e. halo exchange) in conventional
fullshell (FS) method [53]. Computation pattern of SC algorithm also eliminates redundant pair
evaluations. Evaluating Coulombic potential using SC algorithm is rather straightforward. The
computation follows SC computation pattern for twobody interaction, which includes pairs of
resident atoms and pairs of atoms in the neutral territory (NT) [54]. Typically, contribution from
pairs in NT region needs to be sent back to its resident node. However, the return of contribution is
not required in this situation. Here, only overall sum of Coulombic potential contributions from
every node is needed, which can be efficiently achieved by using the MPI_AllReduce function in
the message passing interface (MPI) standard. Nevertheless, subsequent computation after
obtaining total Coulombic potential requires special treatment for manybody computation within
the CG subroutine. Namely, computation of charge gradient g(i) for a particular atom i requires sum
of contributions over all neighbor atoms of i:
𝑔 (𝑖 ) = ∑ 𝐻 𝑖𝑗
𝑞 𝑗 𝑗 ϵNBR(𝑖 )
(2.6)
where Hij denotes charge Hessian of atom pair i and j, and NBR(i) is the neighbor list of
atoms i. Computing g(i) involves manybody renormalization because it requires Hij contributions
39
from all neighbor atoms in NBR(i); see Fig. 2.3(a). However, in SC computation, atoms near the
lower domain boundary may not have complete neighboratom information in the node it resides.
In fact, neighboratom information is completed when combining partial neighbor lists (PNBR)
from multiple computing nodes. As such, gradient calculation in SC algorithmic framework gSC(i)
is defined as
𝑔 SC
(𝑖 ) = ∑ 𝐻 𝑖𝑗
𝑞 𝑗 𝑗 ϵPNBR
RES
(𝑖 )
+ ∑ (∑ 𝐻 𝑖 𝑗 ′𝑞 𝑗 ′
𝑗 ′
ϵPNBR
𝑘 NT
(𝑖 )
)
𝑘 (2.7)
where PNBR
RES
(i) denotes partial neighbor list of atom i in the node that atom i resides.
PNBR
𝑘 NT
(i) denotes partial neighbor list of atom i in NT region from node k, which is not in the
same node that i resides. Therefore, to complete g(i) calculation in SC framework, partial g(i)
contribution based on PNBR
NT
(i) must be sent back to the node that owns i (resident node); see Fig.
2.3(b). Although this incurs extra communication (additional 3way communication for returning
gSC(i) contribution), this is still no larger than the 6way communication in conventional FS method.
Furthermore, FS computation pattern yields substantial computational overhead from redundant
pair evaluations to build complete NBR in every computing node. On the other hand, SC algorithm
performs only essential pair evaluations without redundancy. This is expected to significantly
reduce running time inside PQEq+ subroutine, while maintaining the same communication cost.
Figure 2.3: Illustration of manybody renormalized ntuple computation of g(i) based on (a) conventional full
neighbor list (NBR) and (b) partial neighbor lists (PNBR
RES
and PNBR
NT
) used in the RSC algorithm.
40
For each PQEq+ iteration, Coulombic potential energy between atoms i and j based on (16)
takes the following form:
𝐸 Coulomb
(𝑖 , 𝑗 ) = 𝐸 corecore
(𝑖 , 𝑗 ) + 𝐸 coreshell
(𝑖 , 𝑗 ) + 𝐸 coreshell
(𝑗 , 𝑖 ) + 𝐸 shellshell
(𝑖 , 𝑗 )
𝐸 corecore
(𝑖 , 𝑗 ) = 𝑇 (𝑟 𝑖𝑐 ,𝑗𝑐
)𝐶 𝑖𝑐 ,𝑗𝑐
(𝑟 𝑖𝑐 ,𝑗𝑐
)𝑞 𝑖 𝑞 𝑗
𝐸 coreshell
(𝑖 , 𝑗 ) = −𝑇 (𝑟 𝑖𝑐 ,𝑗𝑠
)𝐶 𝑖𝑐 ,𝑗𝑠
(𝑟 𝑖𝑐 ,𝑗𝑠
)𝑞 𝑖 𝑍 𝑗 (2.8)
𝐸 shellshell
(𝑖 , 𝑗 ) = 𝑇 (𝑟 𝑖𝑠 ,𝑗𝑠
)𝐶 𝑖𝑠 ,𝑗𝑠
(𝑟 𝑖𝑠 ,𝑗𝑠
)𝑍 𝑖 𝑍 𝑗
Here, 𝑇 (𝑟 𝑖𝑎 ,𝑗𝑏
) and 𝐶 𝑖 a,𝑗 b
(𝑟 𝑖𝑎 ,𝑗𝑏
) are computed using a costly table lookup as a function of
core/shell distance. However, only qi is changing across PQEq+ iterations, while 𝑇 (𝑟 𝑖𝑎 ,𝑗𝑏
) and
𝐶 𝑖 a,𝑗 b
(𝑟 𝑖𝑎 ,𝑗𝑏
) remain unchanged (i.e. atomic/shell positions are fixed). Therefore, we save
considerable computation time by storing these four unchanged coefficients 𝑇 (𝑟 𝑖𝑎 ,𝑗𝑏
)𝐶 𝑖𝑎 ,𝑗𝑏
(𝑟 𝑖𝑎 ,𝑗𝑏
)
for each atomic pair throughout PQEq+ iterations.
2.3 Parallel Implementation and Performance Optimizations
We have implemented RMD simulation based on the new algorithmicallyaccelerated
ReaxPQ+ model in a scalable parallel RMD code named RXMD [39]. In this code, computations
are parallelized using spatial decomposition, where the simulated system is decomposed into
spatially localized subsystems and each processor is assigned computations associated with one
subsystem. Message passing is used to exchange necessary data for the computations between
processors, utilizing the MPI standard. Specifically, before computing the forces on atoms in a
subsystem, atomic positions within the interaction cutoff radius within the boundaries of the
neighboring subsystems are copied from the corresponding processors (i.e., interprocess atom
caching). After updating the atomic positions according to timestepping, some atoms may have
41
moved out of its subsystem. These movedout atoms are migrated to the proper neighbor processors
(i.e., interprocess atom migration). The RXMD code is written in Fortran 90.
For large granularity (the number of atoms per spatial subsystem, N/D > 10
2
), spatial
decomposition (i.e., each processor is responsible for the computation of the forces on the atoms
within its subsystem) suffices. For finer granularity (N/D ~ 1), on the other hand, neutralterritory
(NT) [54] or other hybrid decomposition schemes is more efficient. As discussed in section II, we
use the SC scheme [46], which is a generalization of NT for general dynamic ntuple computation.
Grain size for each MPI rank is limited by the cutoff length of interaction. To further
accelerate the computation within MPI rank, we introduce an additional layer of sharedmemory
parallelism using the Open MultiProcessing (OpenMP) application programming interface. This
hierarchical MPI+OpenMP implementation allows RXMD to take advantage of the simultaneous
multithreading support provided by modern processors to achieve better utilization of the
computing resources within each processor. With multithreading, the most computationally
expensive bondorder and force computations within RXMD are greatly accelerated, serving to
reduce the overall runtime. A secondary benefit of multithreading is that it allows MPI ranks to be
exchanged for local threads, thereby reducing the total number of ranks in a MPI job and similarly
reducing the communication and atomcaching overheads at large scales. To obtain the best time
tosolution (T2S) on each Intel Knights Landing (KNL) node, we performed timed run for several
possible configurations of MPI ranks and OpenMP threads, as shown in Fig. 2.4(a). In this test, the
product of the number of MPI ranks and that of OpenMP threads is fixed to 64. The best T2S was
observed for the combination of 16 MPI ranks and 4 OpenMP threads on each node, which will be
used in subsequent benchmark tests.
42
Figure 2.4: (a) Runtime comparison between different configurations of MPI ranks and OpenMP threads on
1 node to identify the optimal combination. Total number of processes is 64. (b) Average runtime per step for
the original ReaxPQ+ and new ReaxPQ+SC implementations on 96 and 1,440 cores. ReaxPQ+SC reduces
timetosolution 5.0 and 5.7fold below ReaxPQ+ for 96 and 1,440core benchmarks, respectively. The
numbers denote runtimes.
2.3.1 Performance Improvement by Algorithm Accelerations
We first test the effect of SC acceleration. Fig. 2.4(b) shows the average runtime per time
step on 96 cores (left) and 1,440 cores (right). In each case, the left and right bars show the total
runtime without (ReaxPQ+) and with (ReaxPQ+SC) SC acceleration, respectively. The figure also
shows partial runtimes for polarizable chargeequilibration (PQEq) calculation, which has become
the speedlimiting step in the new ReaxPQ+ model in order to achieve the improved accuracy
demonstrated in section II.A, as well as for nonbonded (ENbond) and other force calculations. We
observe that the SCaccelerated ReaxPQ+SC is 5.0 and 5.7 times faster than the original
implementation, ReaxPQ+, on 96 and 1,440 cores, respectively (N/P = 672). We should note that
this significant speedup is a result of the following two improvements. First, the SC framework
minimized the computation by removing redundant pair calculations, which is especially beneficial
for fine granularity (i.e., small number of atoms per computing node). The reduced communication
in ReaxPQ+SC is highlighted by the increased relative performance with ReaxPQ+ from 5.0 to 5.7
fold when the number of cores increases from 96 to 1,440. Second, the computation was further
reduced by reusing Coulombic coefficients.
43
2.3.2 Improving Thread Scalability by RoundRobin Data Privatization
In MD simulations like RMD, interatomic forces need to be updated at every time step,
which consumes substantial computing. With sharedmemory programming, force calculation is
parallelized by distributing computations to threads. Because of Newton’s third law of motion, it is
possible that different threads compute pair interactions involving a common atom and update the
force on that atom simultaneously, thereby causing memory conflict. To avoid such race conditions,
OpenMP provides atomic directive to handle concurrent memory access with a few locks.
In the RXMD code, interatomic forces for all atoms within an MPI process are stacked in
an array f of length n, where n is the total number of atoms per MPI rank. Array f is shared and
atomically accessed by threads in forcecalculation module (Fig. 2.5(a)). However, overhead of
concurrent memory access to array f can become performance bottleneck as the simulation size
grows. We observed that the cumulative CPU time of atomic operations accounts for 40% of the
total runtime. Fig. 2.5(b) shows how a threadprivatization scheme addresses this issue [55].
Specifically, the program allocates a private copy, f_private, of array f for each thread and accesses
f_private individually. When a piece of calculated force data on a common atom needs to be updated
by more than one threads, these threads write partial results to their own arrays concurrently, thereby
eliminating atomic operations.
Figure 2.5: (a, b) Data privatization. All threads write to the global force array using atomic operations (a), or
each thread writes to private array (b). (c) Roundrobin reduction with nonoverlapping chunks for
improving data privatization.
44
Once the force module finishes all of its computation, we need to aggregate the partial
results to the global array, for which we adopt a roundrobin approach. As shown in Fig. 2.5(c),
f_private is divided into p chunks of equal length except for the last one for a remainder, where p
is the number of threads. In every round, each thread is in charge of a unique chunk and transfers
data to the global array at the corresponding location. It then circles through the array chunk by
chunk until the entire f_private is transferred. This data reduction is implemented in a single
instruction multipledata (SIMD) manner. The nonoverlapping chunk design prevents memory
conflicts, while maintaining thread scalability by enhancing cache locality. Though data
privatization requires additional memory allocation that grows as O(np) (p is the number of
threads), it is justified for the target application of computational synthesis. Here, minimizing T2S
is essential for describing longtime growth process, rather than simulating the largest possible
system to saturate the memory.
2.3.3 Promoting Data Parallelism by Data Alignment
The Coulombic energy is updated in every PQEq+ iteration. In the algorithm in Table 2.2,
the procedure contains a doublynested loop over atoms i and j. The outer loop over i is mapped to
parallel threads via OpenMP loop parallelism. The inner loop over j in the original RXMD code
was executed sequentially. To promote data parallelism (SIMD) on AVX512 vector lanes (see the
experimental platform in section IV.A), we apply SIMD operations on j loop. However, atoms j in
the neighbor list of atom i are irregularly stored, and accordingly successive data accesses by index
j are noncontiguous in memory, resulting in inefficient cache performance. Here, we notice two
facts: PQEQ+ neighbor list is only updated once every time step (the outmost loop); and the index
in neighbor list is closely distributed, albeit randomly stored, in a number of groups. To take
advantage of this property, we here use quicksort to rearrange the neighbor list at every PQEq+
45
initialization phase such that atoms j are sorted in ascending order, thus irregular access is avoided.
Fig. 2.6 illustrates the modified neighbor list. After sorting, groups of atoms with sequential indices
are adjacently placed in memory, thus the subsequent computation can leverage SIMD operations
on our computing platform.
Table 2.2: PQEq get_hsh procedure in RXMD.
1. do atoms i
2. do atoms j in i’s PQEQ+ neighbor
3. update Coulombic energy
Figure 2.6: Sorted neighborlist followed by SIMD operation.
Table 2.3 shows a piece of sorted neighbor list taken from an actual simulation and its
corresponding stride. The unit stride (value of 1) indicates successive placement of indices.
Table 2.3: Sorted polarized neighbor list and stride.
Index in Neighborlist
[888, 889, 890, 891, 892, 893, 894, 895, 896, 954, 962, 978, 998, 1002, 1017, 1018, 1024, 1096, 1112,
1145, 1146, 1160, 1180, 1192, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1232, 1234,
1235, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243, 1244]
Stride
[1, 1, 1, 1, 1, 58, 8, 16, 20, 4, 15, 1, 6, 72, 16, 33, 1, 14, 20, 12, 26, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 2, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1]
46
2.3.4 Additional Code Transformations and Performance Improvement
In the original RXMD code, the distance between atom pairs are computed when creating a
neighbor list, but this data is discarded immediately regardless the fact that such distance will be
frequently recalculated in the subsequent force module. To eliminate these redundant computations,
we modify the data structure by packing the atomicpair distance into the neighbor list. Table 2.4
illustrates the modification of neighbor list and the later usage of the distance data in the force
computation module. When accessing the index of neighboring atoms, the program can directly
fetch drij from memory without any redundant computation impairing the performance.
Table 2.4: Packing atomicpair distance into neighbor list.
In Neighborlist module
1. for each cell
2. for atoms i in this cell
3. for atom j in the supercell
4. compute distance dr ij of i and j
5. if dr ij < cutoff range
6. nbrlist(i) stores (j, i, dr ij)
In Force module, 2body case
9. for atom i
10. for atom j in nbrlist(i)
11. fetch dr ij from nbrlist
12. force calculation using dr ij
To assess the effect of several performance optimizations discussed in this section, we run
a simulation of MoS2 crystal on 512 Intel Xeon Phi KNL nodes with 10 simulation steps. Fig. 2.7
compares the wallclock time of the original code (left) with that of our optimized version (right)
over a set of ReaxPQ+ force functions, including Enbond (nonbonded), Elnpr (lonepair), Ehb
(hydrogenbond), E3b (3body) and E4b (4body) interactions. The most timeconsuming function,
E3b, has become 5.57 times faster, and the aggregate time of forcecomputation module has
achieved 4.38fold speedup over the original version. Overall, these performance optimizations
have achieved 1.55fold speedup of the entire simulation runtime.
47
Figure 2.7: Performance improvement after data privatization and other code transformations. The numbers
denote runtimes.
2.4 Scalability and TimetoSolution
2.4.1 Experimental Platform
We perform benchmark tests of the RXMD code with the accelerated ReaxPQ+ model on
Theta — an Intel Xeon Phi Knights Landing (KNL) based supercomputer [56]. Each computing
node contains 64 cores. Each core in turn has 32 KB L1 instruction cache, 32 KB L1 data cache,
and two independent floatingpoint units capable of executing 512bit wide SIMD instructions.
The peak instruction throughput of the KNL microarchitecture is 2 instructions per clock cycle,
and they can be issued backtoback from the same thread. Two cores and a shared 1 MB L2 cache
form a tile, and the tiles are connected with a 2Dmesh networkonchip with 16 GB of high
bandwidth inpackage multichannel DRAM memory (MCDRAM) and 192 GB of regular DRAM.
There are two memory modes of execution. In the cache mode, the MCDRAM is used as a cache
to DRAM; while in the flat mode, both MCDRAM and DRAM form a single linear memory space.
We tested our code in both flat and cache modes but observed no significant difference. Following
benchmarks are performed in cache mode
48
2.4.2 Weak and Strong Scaling on Theta
We perform an isogranularscaling test of the ReaxPQ+ adapted RXMD code with hybrid
MPI+OpenMP implementation on Theta, in which the number of atoms per MPI rank N/P is kept
constant. We measure the wallclock time per simulation time step with scaled workloads —
24,576atom MoS2 system on each core. By increasing the number of atoms linearly with the
number of cores, the wallclock time remains almost constant, indicating excellent scalability. To
quantify the parallel efficiency, we first define the speed of the RXMD code as a product of the
total number of atoms and the number of RMD time steps executed per second. The isogranular (or
weakscaling) speedup is given by the ratio between the speed of P core and that of 64 cores as a
reference system. With the granularity of 24,576 atoms per core, the parallel efficiency is 0.989 on
131,072 cores for a 3,221,225,472atom system, shown in Fig. 2.8(a). This demonstrates a very
high scalability of the ReaxPQ+ adapted RXMD code.
We next perform a strongscaling test by simulating MoS2 with a total of 50,331,648 atoms.
In this test, the number of cores ranges from P = 2,048 to 32,768, while keeping the total problem
size constant. We measure the wallclock time per RMD time step as a function of P cores. The
runtime is reduced by a factor of 12.26 on 32,768 cores compared with the 2,048 cores run (i.e.,
using 16times larger number of cores). This signifies a strongscaling speedup of 12.26, with the
corresponding strongscaling parallel efficiency of 0.766. Fig. 2.8(b) shows the measured strong
scaling speedup as a function of the number of ranks (blue line), while the black line denotes the
ideal speedup. It is more difficult to achieve high strongscaling parallel efficiency compared with
weakscaling parallel efficiency, as the comparison of Fig. 2.8, (a) and (b), suggests. This is due to
the surge of communication/computation ratio as the workload per rank shrinks proportionally.
49
With 64 times smaller system size of the weakscaling test, the observed strongscaling parallel
efficiency is considered excellent.
2.4.3 TimetoSolution Improvement
Our target application is computational synthesis of MoS2 monolayer from MoO3 and S2
reactants. Due to the long time of the reaction process, no previous RMD simulation has been able
to complete the reaction to obtain the MoS2 product. Reduced T2S is the key figure of merit for
this purpose. Overall, the algorithmic acceleration in section II has resulted in a speedup of factor
5.0 over the original RXMD code for the ReaxPQ+ model. In addition, a series of code
transformations on Theta in section III has achieved additional speedup of factor 1.55. Overall, the
algorithmic acceleration and performance optimization have achieved a speedup of factor
5.0 × 1.55 = 7.75. Namely, T2S has been reduced to 12.9% compared to that of the original
ReaxPQ+ adapted RXMD code
Figure 2.8: (a) Wallclock time of the ReaxPQ+ adapted RXMD code, with scaled workloads — 24,576P
atom MoS 2 on P cores (P = 64, ..., 131,072) of Theta. (b) Strongscaling speedup of the ReaxPQ+ adapted
RXMD code with a fixed problem size — 50,331,648atom MoS 2 system on P cores (P = 2,048, ..., 32,768)
of Theta. The measured speedup values (blue) are compared with the ideal speedup (black). The
numbers denote speedups.
50
2.5 MachineLearning Guided Computational Synthesis of
Atomically Thin Layered Materials
The accurate description of atomic charges and polarization by the new ReaxPQ+ model,
combined with the drastic T2S reduction due to algorithmic acceleration and performance
optimization in the previous section, has opened a new avenue for computational synthesis of novel
materials. This section demonstrates the capability of the resulting parallel RMD code for
computational synthesis of atomically thin layered materials.
We focus on atomicallythin layered materials (LMs) that have unique electronic structures
and mechanical and transport properties not found in their threedimensional counterparts, which
makes them attractive templates for future functional devices [57]. The primary synthesis technique
for fabrication of LMs is chemical vapor deposition (CVD), where one or more reaction precursors
in the gas phase undergo chemical reactions at elevated temperatures inside a reaction chamber and
the reaction product is deposited on a substrate in a colder region of the substrate [58]. RMD
simulations can provide valuable inputs to rational optimization of CVD growth conditions if
adequate length (1,000 Å) and time (10 ns) scales can be covered.
We simulated CVD synthesis of molybdenum disulfide (MoS2), a prototypical 2D
semiconductor, on 242,144 Intel Xeon Phi cores. MoS2 can be formed by the reaction of MoO3 and
S2 precursors. The initial configuration (Fig. 2.9(a)) consists of ~1.14 million atoms (129,472 Mo,
396,032 O and 620,032 S atoms) in a 1,037 Å 1,080 Å 145 Å simulation cell. Fig. 2.9, (b) and
(c), shows computational synthesis of MoS2 monolayer by CVD and subsequent annealing. A pre
sulfurized MoO3 sample was thermalized at 3,000 K, and then quenched to 1,000 K in 2.2 ns.
During annealing, the system was thermalized at 1,500 K for 2 ns, then quenched to 1,000 K in 1
ns. We repeated the annealing cycle twice.
51
We used a machinelearning approach to identify key reaction pathways. Fig. 2.9(d) shows
the feedforward neural network (FNN) model [59, 60] we have developed to identify and classify
atomic structures into 1Tcrystalline (green), 2Hcrystalline (red) and disordered (blue) structures
in the synthesized MoS2 crystal. In the input layer, the local environment for each atom is
represented by a 60dimension feature vector consisting of radial and angular symmetry functions
[61]. The first, second and third hidden layers consist of 350, 100 and 50 hidden units, respectively.
The RELU activation function was used in the first and second layers, while a sigmoid function in
the third layer. We trained the model using 36,000simulation datasets by minimizing the SoftMax
function using Adamoptimizer. The results reveal a novel growth mechanism of 2H crystal
mediated by a metastable 1T crystalline phase (Fig. 2.9(b)). Such atomistic information is
indispensable for guiding experimental CVD synthesis with improved crystallinity.
Figure 2.9: Machine learningguided computational synthesis of MoS 2 monolayer by CVD. (a) Simulation
setup showing a MoO 3 monolayer suspended in S 2 gas. The atoms are colored as Mo: blue, S: yellow, and O:
red. (b, c) Closeup of MoS 2 monolayer before (b) and after (c) annealing. The local structures are classified
into 1T (green), 2H (red) and disordered (blue) phases. For clarity, gaseous environment is not shown. (d)
Neural network model for defect identification and classification.
52
2.6 Summary
To perform large RMD simulations incorporating dielectric reorganization of materials, we have
proposed a new generalized polarizable reactive forcefield (ReaxPQ+) model. The increased
accuracy of ReaxPQ+, along with the reduced timetosolution achieved by algorithmic and
computational innovations, has for the first time allowed purely computational synthesis of
atomicallythin transitionmetal dichalcogenide layers assisted by machine learning. This new
capability opens up an exciting possibility of future computational synthesis of yettoexist layered
materials with desired properties. As such, layered materials genome has been chosen as one of the
10 designated applications of the United States’ first exaflop/s computer named A21 when it is
introduced in 2021 [48]. The computational approaches developed in this paper will likely play an
important role in the exascale materials genome. Since ReaxPQ+ is applicable to a wide variety of
elements in the periodic table, the current approach applies to much broader applications in science
and engineering.
53
Chapter 3
Machine Learning in Material Simulations
3.1 Introduction
3.1.1 Dielectric Polymer Genome
Development of highperformance dielectric polymers is urgently needed for the
advancement of a wide range of energy technologies, including energystorage and pulsedpower
technologies. Recent advent of materials genome (i.e., applying informatics to design new
materials significantly faster than the conventional trialanderror approach) has enabled rational
design of dielectric polymers, thereby heralding an exciting era of polymer genome. A major
performance indicator of dielectric polymer is dielectric constants. While dielectric genome has
previously been proposed for inorganic crystals, high computational costs of quantum
mechanically evaluating dielectric constants make it still a challenge. The situation is even worse
for polymers, where computing is far more expensive since complex chemical and morphological
features essentially dictate their dielectric constants. As a result, the highlyanticipated “dielectric
polymer genome” still remains elusive.
This computational challenge may be addressed by recent developments in first principles
informed reactive molecular dynamics (RMD) simulations based on polarizable reactive force
fields (ReaxPQ) [1012]. The ReaxPQ model, based on a polarizable charge equilibration (PQEq)
scheme, has significantly improved the accuracy for predicting dielectric constants in ordersof
magnitude shorter computational times compared with quantummechanical (QM) calculations. In
this paper, we further improve the predictive power of ReaxPQ by introducing a new valance (or
charge state)aware ReaxPQ (ReaxPQv) model, thereby achieving near quantum accuracy for
54
predicting dielectric constants of polymers. We incorporate ReaxPQv into our highthroughput
computational synthesis framework of polymers [13]. We thereby construct a large dataset of
structurallydiverse 1,276 dielectric polymers with quantum accuracy, which represents a frontier
of dielectric polymer genome research.
In order to explore the large combinatorial design space of dielectric polymers, the above
approach alone is not sufficient. It is essential to further exploit recent advancements in machine
learning (ML) approaches to molecular and materials sciences [14], in particular deep learning
(DL) [15]. For example, various artificial neural networks have been used successfully to identify
structureproperty relationships in materials, including graph neural networks (GNN) [16, 17] and
recurrent neural networks (RNN) [18]. Here, it is notable that a wide availability of public ML
tools has now democratized ML tasks that could utilize the abovementioned stateoftheart data
set. As a proof of concept of the use of our new dielectric polymer data set, ML models of
progressive complexity—multilayer perceptron (MLP), random forest and RNN—were applied.
In combination with the ReaxPQv based computational framework, MLP and random forest
models have achieved decent accuracy for predicting the dielectric constants of polymers, while
the RNN model is under tuning.
3.1.2 Layered Material Phases
Machine learning (ML) has proved ubiquitous applicability in image recognition, natural
language processing and many other areas involving realworld data. Such success has inspired
chemists to apply ML, especially deep learning, to better understand chemical processes [65, 66].
With the tremendous amount of data generated by computer simulations, deep neural networks in
particular have shown significant power in many chemistry tasks, including the prediction of
properties [67], drug discovery [68] and understanding of quantum molecular dynamics [69]. Most
55
of the existing works deal with small organic molecules, where the datasets are formed by a
collection of molecular graphs with adjacency matrix describing the connectivity of atoms. Such
graphbased datasets require a unique deeplearning architecture to capture the additional
structured information. Graph neural networks (GNN) [62], a variation of the widely used
convolutional neural networks (CNN) [70], process graph data as nodes and edges in a learnable
fashion. Ref. [72] generalized molecular feature extraction methods by a series of differentiable
functions to produce lowdimensional molecular fingerprints. Ref. [70] integrated edge
representations into the learning functions to capture more structural features from the graph. Ref.
[73] applied gated recurrent units to update the hidden state during the learning phase, so that
sequencebased methods (e.g., LSTM) can also be injected in graph models.
More recently, these GNN applications to isolated molecules have been extended by
materials scientists to handle infinitelyrepeated crystal structures [69]. However, this crystal
graph CNN has been limited to periodicallyrepeated, small crystalline unit cells each involving a
few atoms. In this paper, we extend the applicability of GNN to general material graphs composed
of millions of nodes. This is a challenge since material structure often is a mixture of various
crystalline phases and defects, which are interconnected to each other via bonds, resulting in a
highlycomplex, massive graph. Here, we propose a new variant of GNN to identify different
phases inside an atomicallythin molybdenum disulfide (MoS2) monolayer that is computationally
synthesized by reactive molecular dynamics simulation (RMD) simulation [69], mimicking
experimental chemical vapor deposition (CVD). MoS2 is an archetype of atomicallythin layered
materials, for which ML has extensively been applied [68]. Here, our model analyzes the local
graph topology around each atom, and classifies it into crystalline 2H or 1T structures as is shown
in Fig. 3.1.
56
Figure 3.1: Top views of 2H (left) and 1T (right) crystalline phases of MoS 2 monolayer. Magenta and yellow
spheres represent Mo and S atoms, respectively, whereas MoS bonds are represented by cylinders.
3.2 Method
3.2.1 Multiple Layer Perceptron
Reactive molecular dynamics (RMD) simulation based on first principlesinformed
reactive force fields (ReaxFF) significantly reduces the computational cost of quantum
mechanically simulating chemical reactions in materials. A recentlyproposed polarizable reactive
force fields (ReaxPQ) model extends ReaxFF by accurately describing dielectric properties based
on a polarizable charge model, in which each atom consists of core and shell charges that are
connected by a spring. Unfortunately, advanced dielectric polymers contain atomic species at
various charge states (or valences), making accurate prediction of dielectric constants very hard.
Here, we introduce a valenceaware ReaxPQ (ReaxPQv) model to accurately describe the
dielectric response of amorphous polymers based on the quantummechanically informed atomic
polarizability. We first determine the polarizability of constituent atomic species quantum
mechanically, using the HartreeFock method with def2svpd basis sets. These polarizability
values for different valences are then used to estimate the spring constants between core and shell
57
charges. The new ReaxPQv model has been implemented in our scalable parallel RMD simulation
engine called RXMD.
We generate amorphous polymer structures using our computational synthesis framework.
The framework first generates a single polymer chain using a Simplified Molecular Input Line
Entry System (SMILES) string and Open Babel software. Multiple polymer chains are then placed
at sufficiently large distances from each other in a simulation box. The RMD simulation system is
subjected to a number of consolidation and relaxation steps, until the system reaches a desired
density. Following this procedure, we constructed a synthetic dataset of structurallydiverse 1,276
polymers. Dielectric constants of theses polymers are then computed with the ReaxPQv model,
using a highthroughput, parallel RMD simulation workflow. The new data set composed of a
large number of amorphous polymer structures, augmented with quantumaccuracy dielectric
constants, provides an indispensable test ground for developing predictive ML models of polymer
dielectric constants.
To predict the dielectricconstant values from SMILES inputs, we have thus far tested three
ML models: multilayer perceptron (MLP), random forest and recurrent neural network (RNN).
The MLP algorithm has played a fundamental role in the evolution of artificial neural
networks and ML at large [26]. Pioneered by Rosenblatt [27], MLP has been transformed from a
binary classifier to an affordable method of solving supervised estimation problems [28]. Here, we
use a two hiddenlayer perceptron estimator, where each layer is comprised of 1,136 neurons.
Backpropagation is employed for weight adjustment, whereas the activation function of modified
rectified linear unit (RELU) demonstrated the best fit.
To develop a random forest model, we use the Citrination online platform for feature
generation and model construction. Citrination provides autogenerated features that can be used
58
to develop a model or argument the customized features provided by users. This is particularly
useful for developing a ML model for materials, in which each entry is commonly represented by
chemical formula or SMILES string. Given the training data explained above, we first generate
over 100 features (63 standard and 71 extended features) based on given chemical formula and/or
SMILES strings. The standard set consists of elemental properties, molecule features and analytic
features. The extended set contains elemental properties and molecule features. These features
include electronic configurations of constituent atoms (e.g., the number of valence electrons),
which are readily available from the periodic table. In addition, quantummechanical calculations
in the framework of density functional theory (DFT) [2932] are used to produce additional data
such as the electronic energy. We also generate 26 additional features based on functional groups
in the polymer dataset and compare the model performance.
Given the training data and features as described above, we next select an ML model. Our
choice in Citrination is random forest, which corrects the overfitting tendency of decision trees
[33]. Random forest constructs an ensemble of tree predictors at training time. Individual trees
sample random vector values independently of each other, but with the same distribution for all
trees in the forest.
In addition to MLP and random forest, we also consider RNN, in which the next run input
is dependent on the previous one, while communicating through hidden and context layers [34].
RNN is commonly used for problems involving data sequences such as text or symbol arrays. The
basic organization of RNN involves three interconnected layers: input, output and hidden nodes.
The context units are propagated from the output of the hidden layers and recurrently connected.
Among themselves, nodes communicate via direct synapses represented with the weighing
parameters. Due to dynamic behavior, RNN modules employ commonlyused activation functions
59
such as sigmoid and RELU. During the training step, the value of the hidden layer ℎ
𝑣 (𝑡 )
∈ R
m
can
be derived as
ℎ
𝑣 (𝑡 )
= 𝑓 (𝑊 𝑥 (𝑡 )
+ 𝑈 ℎ
𝑣 (𝑡 −1)
+ 𝑏 ℎ
), (3.1)
where 𝑥 (𝑡 )
is an input vector at step t, 𝑊 are weighing parameters for a hidden layer, 𝑈 are
weighing parameters for a context layer, 𝑏 ℎ
∈ R
m
are bias parameters of RNN model, and f is an
activation function. In the traditional RNN, the message function 𝑦 𝑣 (𝑡 )
∈ R
l
is
𝑦 𝑣 (𝑡 )
= 𝑉 ℎ
𝑣 (𝑡 )
+ 𝑏 𝑐 , (3.2)
where 𝑉
are the weighing parameters for output, and 𝑏 (𝑐 )
∈ R
l
denotes bias.
Hochreiter and Schmidhuber developed a long shortterm memory (LSTM) unit to avoid
the vanishing and exploding gradient issues that accompany traditional RNNs [35]. Subsequently,
Cho et al. developed a new RNN model called gated recurrent unit (GRU), which employs the
same basic principles as LSTM [36]. Both architectures use gating units denoting the further
outcome of the next hidden layer ℎ
𝑣 (𝑡 +1)
and computation of 𝑦 𝑣 (𝑡 )
. However, GRU architecture
implements lower number of filters and operations for inner cycle ℎ
𝑣 (𝑡 )
calculations. For GRU
implementation, updated gate 𝑧 𝑣 (𝑡 )
and reset gate 𝑟 𝑣 (𝑡 )
can be denoted as
𝑧 𝑣 (𝑡 )
= 𝜎 (𝑊 𝑧 (𝑛 )𝑥 (𝑡 )
+ 𝑈 𝑧 (𝑚 )ℎ
(𝑡 −1)
), (3.3)
𝑟 𝑣 (𝑡 )
= 𝜎 (𝑊 𝑟 (𝑛 )𝑥 (𝑡 )
+ 𝑈 𝑟 (𝑚 )ℎ
(𝑡 −1)
), (3.4)
where n and m are tensor dimensions.
Output value ℎ
𝑣 (𝑡 )
is computed on the basis of the interim ℎ
̃
𝑣 (𝑡 )
. These gates outputs are accountable
for the information to be left or excluded from the training:
ℎ
̃
(𝑡 )
= 𝑡𝑎 ℎ𝑛 (𝑈 ℎ
(𝑚 )(𝑟 𝑣 (𝑡 )
⊙ ℎ
(𝑡 −1)
) + 𝑊 ℎ
(𝑛 )𝑥 (𝑡 )
, (3.5)
representing both update gate and interim values,
60
ℎ
(𝑡 )
= 𝑧 (𝑡 )
⊙ ℎ
(𝑡 )
+ (1 − 𝑧 (𝑡 )
) ⊙ ℎ
̃
(𝑡 −1)
. (3.6)
For the present polymer problem, the initial states ℎ
𝑣 𝑜 are set to the input features of oligomers, such
as structure and dielectric constant (ɛ) values. Due to their efficiency and robustness, both LSTM
and GRU network architectures have become essential computational tools in various areas,
including the study of structureactivity relationships in life and material sciences [37].
3.2.2 Graph Neural Networks for Learning Layered Material
Figure 3.2 presents a highlevel schematic of our classification task between 2H and 1T
crystalline phases using GNN. The following subsections explain key components of this learning
architecture, including dataset generation and GNN.
Figure 3.2: Schematic of 2H vs. 1T phase classification by graph neural network.
3.2.3 Dataset Description
We have performed RMD simulation to synthesize MoS2 monolayer by sulfidation of
molybdenum trioxide (MoO3) precursor, followed by thermal annealing. RMD simulation follows
the trajectory of all atoms while computing interatomic interaction using first principlesinformed
reactive bondorder and chargeequilibration concepts. We have designed scalable algorithms to
perform large RMD simulations on massively parallel computers [71, 72]. Our RMD produces
61
polycrystalline MoS2 monolayer, where different regions of the monolayer belong to either 2H or
1T phase of MoS2 crystal. Here, each atom inside the synthesized polycrystal is connected to its
nearestneighbor atoms by forming bonds with them, which in turn are connected to their nearest
neighbors and so on. Such bond formation between atoms makes the entire MoS2 monolayer a
massive graph consisting of multimillion nodes (atoms) and edges (bonds). Hence, identification
of 2H and 1T phases using a conventional graphbased ML model is not feasible for the entire
graph. To circumvent this problem, we have randomly sampled 66,896 atoms from the total MoS2
monolayer, and for each of these atoms, created a local graph using all their neighbors within a
cutoff radius of 0.8 nm. The rationale behind this choice is that, for each atom, graph structure
generated by its first and second nearest neighbors is able to distinguish its phase (2H or 1T). These
66,896 local graph structures serve as the training data for our neuralnetwork model, and it
consisted of 18,650 2H, 32,446 1T and 16,000 disordered structures.
3.2.4 Graph Neural Networks
Graphbased data in general can be represented as 𝑮 = (𝑽 , 𝑬 ), where 𝑽 is the set of nodes
and 𝑬 is the set of edges. Each edge 𝑒 𝑢𝑣
∈ 𝑬 is a connection between nodes 𝑢 and 𝑣 . If 𝑮 is directed,
we have 𝑒 𝑢𝑣
≢ 𝑒 𝑣𝑢
; if 𝑮 is undirected, instead 𝑒 𝑢𝑣
≡ 𝑒 𝑣𝑢
. Unless specified, the remaining of this
paper will deal with undirected graphs, but we will show that it is trivial to modify our model to
process directed graph data. It should be pointed out that, in molecular graphs, the nodes are actually
atoms and the edges are atomic bonds, thus, the two pairs of terms are used interchangeably here.
The goal of GNN is to learn lowdimensional representation of graphs from the connectivity
structure and input features of nodes and edges. The forward pass of GNN has two steps, i.e.,
62
message passing and nodestate updating. The architecture is summarized by the following
recurrence relations, where t denotes the iteration count:
𝑚 𝑣 𝑡 +1
= ∑ 𝑀 𝑡 (ℎ
𝑣 𝑡 , ℎ
𝑤 𝑡 , 𝑒 𝑣𝑤
)
𝑤 ∈𝑁 (𝑣 )
(3.7)
ℎ
𝑣 𝑡 +1
= 𝑈 𝑡 (ℎ
𝑣 𝑡 , 𝑚 𝑣 𝑡 +1
) (3.8)
where 𝑁 (𝑣 ) denotes the neighbors of 𝑣 in graph 𝑮 . The message function 𝑀 𝑡 takes node state ℎ
𝑣 𝑡
and edge state 𝑒 𝑣𝑤
as inputs and produces message 𝑚 𝑣 𝑡 +1
, which can be considered as a collection
of feature information from the neighbors of 𝑣 . The node states are then updated by function 𝑈 𝑡
based on the previous state and the message. The initial states ℎ
𝑣 0
are set to be the input features of
atoms, which we will discuss in the next section. Here, we use normalized adjacency matrix 𝐴 ̃
of
the graph coupled with some other features (which will be discussed below) as the edge state. As
shown in Fig. 3.3, these two steps are repeated for a total of T times in order to gather information
from distant neighbors, and the node states are updated accordingly. GNN can be regarded as a
layerwise model that propagates messages over the edges and update the states of nodes in the
previous layer. Thus, T can be considered to be the number of layers in this model.
The exact form of message function is
𝑚 𝑣 𝑡 +1
= 𝐴 𝑣 𝑾 𝒕 [ℎ
1
𝑡 … ℎ
𝑣 𝑡 ] + 𝒃 (3.9)
where 𝑾 𝒕 are weights of GNN and 𝒃 denotes bias. We use gated recurrent units as the update
function:
𝑧 𝑣 𝑡 = 𝜎 (𝑾 𝒛 𝑚 𝑣 𝑡 + 𝑼 𝒛 ℎ
𝑣 𝑡 −1
) (3.10)
𝑟 𝑣 𝑡 = 𝜎 (𝑾 𝒓 𝑚 𝑣 𝑡 + 𝑼 𝒓 ℎ
𝑣 𝑡 −1
) (3.11)
ℎ
𝑣 𝑡 ̃
= tanh(𝑾 𝑚 𝑣 𝑡 + 𝑼 (𝑟 𝑣 𝑡 ⊙ ℎ
𝑣 𝑡 −1
)) (3.12)
ℎ
𝑣 𝑡 = (1 − 𝑧 𝑣 𝑡 ) ⊙ ℎ
𝑣 𝑡 −1
+ 𝑧 𝑣 𝑡 ⊙ ℎ
𝑣 𝑡 ̃
(3.13
63
where ⊙ denotes elementwise matrix multiplication and 𝜎 (∙) is sigmoid function for nonlinear
activation.
Figure 3.3: Repeating message function and update function T times to learn atom representations.
Once GNN learns lowdimensional atom representations, we feed them to a generic ML
model, e.g. fullyconnect network, to predict 2H vs. 1T phases as a complete classification task. We
argue that the learned atom representations for the molecular graphs can better interpret the
structural uniqueness of 2H and 1T phases, so that the model can achieve higher predictive
performance. In the next section, we show the experiment results to support this assumption.
3.3 Experiment and Analysis
We have studied the structure of CVDgrown MoS2 monolayer using RMD simulation
(Fig. 3.4). The initial system for RMD simulation consists of MoO3 slab surrounded from top by
S2 gas. The dimension of the RMD simulation is 211.0 × 196.3 × 14.5 (nm
3
) in the x, y and z
directions, respectively, and it consist of a total of 4,305,600 atoms. The entire system is subjected
to an annealing schedule, where the system temperature is first increased to 3,000 K, and
subsequently its temperature is quenched to 1,000 K, where it is held for 1 nanosecond (ns). This
is followed by two annealing cycles consisting of a heating step from 1,000 K to 1,600 K for 0.4
ns followed by a thermalization step at 1,600 K for 1.5 ns and a cooling step from 1,600 K to 1,000
64
K for 0.4 ns. This annealing schedule facilitates the reaction of S2 with MoO3 slab, which results
in the formation of polycrystalline MoS2 monolayer where different regions of the synthesized
MoS2 monolayer belongs to either 2H or 1T phase.
Figure 3.4: Snapshot of computationallysynthesized MoS 2 monolayer. Ballandstick representation is used
to show the positions of atoms and covalent bonds with neighboring atoms. Atoms are colorcoded as yellow
for sulfur (S), blue for molybdenum (Mo) and red for oxygen (O), respectively
3.3.1 Input Features and Training Settings
The small molecular graphs carved out from the simulation results carry a number of
properties on atoms as well as bonds. See Table 3.1 for details. These properties are transformed
into vector format and embedded with nodes and edges as the initial states of the GNN model. Due
to the fact that the numbers of atoms in different atomcentered graphs are different, all input
features are zeropadded up to a larger dimension, d = 40.
Table 3.1: Input node and edge features.
Feature Description Size
Node
atom type Mo or S, a onehot vector 2
charge Atomic charge 1
Edge
distance Distance between atoms in nm 1
bond order Dimensionless chemical bond order 1
65
We randomly shuffle and split the dataset as follows: 50,000 graphs in training set; 5,000
graphs in validation set; and another 5,000 in test set. The remaining 6,896 graphs are not used in
our experiments. The number of layers of our GNN model is T = 2, batch size set to 20, and we
train the model for a maximum of 100 epochs using Adam with a learning rate of 0.01.
3.3.2 Predictive Performance for Layered Material Phases
To investigate how the input features affect the predictive performance, we perform a series
of experiments with different selections of edge features, while node features are kept fixed.
Results are shown in Table 3.2. In the first trail, we only use node features as inputs, revealing no
spatial information but just adjacency matrix for training. The F1 score of 2H in test set is only
0.5424, meaning that nearly half of the 2H phases are misclassified. Next, we add edge features to
the model, then observe improvements in both 1T and 2H classes. It turns out that distance and
bond order have almost equivalent effect to the GNN. The reason for not seeing increase in
performance of the model when distance and bond order are added together as edge feature is
because both make an estimate of bonding between two atoms and hence are highly corelated.
High bondorder values mean strong bonding between atoms and small values mean weak
bonding, whereas cutoffbased feature makes an absolute decision whether atoms are bonded or
not. Since the bond length between atoms is an important feature to distinguish these phases, the
node/distancebased model gives the higher performance.
Table 3.2: F1 scores for different input features. Higher value sinifies better classification accuracy.
input features 1T 2H
node only 0.7821 0.5424
node + edge 0.9321 0.8642
node + distance 0.9391 0.8855
node + bond order 0.9305 0.8761
66
Figure 3.5 plots the ROC (receiver operating characteristic) curve obtained in one of our
experiments using both node and edge features. The curves of 1T and 2H are close to the upper
left corner, indicating that the model achieves high accuracy on both classes. In addition, we
calculate a more quantitative measure, ROCAUC (area under the ROC curve) score, as is shown
in Table 3.3, to verify the performance of the model. We observe similar results as the F1 scores,
which confirms the robustness of our conclusion drawn above.
Figure 3.5: ROC curves for 1T (green) and 2H (red) phases using node and edge features as input, which
correspond to the second row in Table 3.3. The xaxis is the true positive rate (TPR) and yaxis is the false
positive rate (FPR).
Table 3.3: ROCAUC for different input features. Value is in the range of (0, 1). Higher value signifies better
classification accuracy.
input features 1T 2H
node only 0.88 0.87
node + edge 0.93 0.92
node + distance 0.95 0.95
node + bond order 0.93 0.93
67
3.3.3 Visualization of Hidden Node States
It is still an open and active research area to interpret the learning process of ML models. In
this work, we try to provide some insight on how GNN learns from graphs by visualizing the
evolution of the hidden states during the training phase. As is shown in Fig. 3.6, this molecular
graph has 20 atoms with a dimension of 3 (see Table 3.1 for details) for the initial atom features.
There are several patterns shared by the atoms, for example, the first 7 atoms have very similar
feature encodings. In the second and third layers, GNN expands the dimension to 25, which is a
hyperparameter of the model, while the feature of each atom gradually becomes divergent compare
to the initially state. Such divergence would make it easier to find the hyperplane in the feature
space, so that the model can achieve high classification accuracy.
Figure 3.6: Layerwise evolution of node states. The xaxis is the dimensions of features (padded to 20) and y
axis is the number of atoms (zeropadded to 40) in the graph. Each row represents a feature vector of an
atom, and the color of each pixel indicates the value on the unit.
3.3.4 Performance for Dielectric Polymer Genome
We first use the MLP model and run prediction for the entire training set of polymer
samples. For efficient training, the SMILES input is transformed to binary representation together
68
with zero padding. Polymers, whose dielectric constants are less than 1.5, are removed from the
training set (Fig. 3.7A) as these statistical outliers do not follow the general trend. Additionally,
the input representations are examined for the presence of duplicated SMILES entries, leading to
the total number of the training set to be 828 (Fig. 3.7B).
Figure 3.7: Dielectric constants (A) and SMILE dimensions (B) of the polymer dataset.
Figure 3.8 compares the parity plot (i.e., model prediction vs. input dataset) of three
developed MLP models for MLP model, using different activation functions: (A) modified RELU
activation function, in which the standard RELU is used for gradient calculation within
backpropagation learning rule, whereas binary step for activation itself; (B) standard RELU
activation function; and (C) hyperbolic tangent (Tanh) function. Out of the three, the obtained with
modified RELU activation function demonstrates an excellent agreement between the first
principlesderived dielectric constant values with the modelpredicted ones. The root mean
squared error (RMSE) is 0.045.
The experiment with the variable training sets demonstrates how the enlargement of the
sample pool affects the overall MLP performance. Figure 3.9 shows a parity plot of MLP model
for 662 and 546 training polymers datasets. In the RMSE metrics, the training set of 667 samples
69
demonstrates higher performance than the set of 546 samples. However, the individual RMSE
values indicate the predictive performance within the training pool to be expectedly better for the
low volume set (0.028 for 546 versus 0.032 for 662 polymers).
Figure 3.8: Model prediction vs. training data of polymer dielectric constants using the MLP model with
variable activation functions (A) modified RELU (RMSE = 0.053); (B)RELU (RMSE = 0.242); (C) Tanh
(RMSE = 0.267).
70
Figure 3.9: Performance of MLP prediction model (Elman type). Comparison of prediction on (A) 662
polymers and (B) 546 polymers training sets.
As explained in the previous section, we have built a random forest model using Citrination
online platform. Prior to model development, we have cleaned the polymer dielectric dataset by
removing unphysical records. We obtained 846 data points in total. The random forest model [33]
is implemented using the Lolo library, where a variety of data types and ML functions are
supported, including continuous and categorical features, regression and classification trees,
jackknife variance estimates, hyperparameter optimizations, validation metrics for accuracy and
uncertainty quantification. Jackknife is an uncertainty estimate method by resampling data, similar
to leaveoneout approach for crossvalidation. Suppose X = (X1, X2, X3, … Xn) is a set of observed
samples. Jackknife ith set of samples is defined as the original set without the ith sample,
𝑋 [𝑖 ]
= {𝑋 1
, 𝑋 2
, … , 𝑋 𝑖 −1
, 𝑋 𝑖 +1
, … , 𝑋 𝑛 } (3.14)
The estimated value 𝜃 (𝑖 )
by an estimator S is obtained as
𝜃 (𝑖 )
= 𝑆 (𝑋 [𝑖 ]
) (3.15)
The performance of a model is evaluated by the bias and standard error of the estimator. Table 3.4
shows hyperparameters that are used to build our model.
71
Table 3.4: Hyperparameters used in our random forest model.
Number of estimators 155
Minimum samples per leaf 1
Maximum tree depth 30
Leaf model Mean
Number of crossvalidation folds 3
Figure 3.10 shows the resulting parity plot with uncertainty in the predicted value. Data
points that are within the input feature space are labeled as “not extrapolating”, otherwise labeled
as “extrapolating.”
Figure 3.10: Model prediction vs. training data of polymer dielectric constants using the random forest
model.
Table 3.5 summarizes various modelperformance metrics, such as RMSE, non
dimensional model error (NDME), standard error in the estimate of RMSE (SERMSE), and
standard error in the estimate of NDME (SENDME). Here, NDME is the ratio between the RMSE
and standard deviation. The small RMSE and NDME values indicate that the developed model
possesses a good prediction capability. We examine three sets of features: (1) SMILES string; (2)
26 additional features consist of polymer functional groups; and (3) SMILES string and the
functional group together. Use of either SMILES string or the functional group feature alone
72
results in similar RMSE around 0.13. However, we found that SERMSE substantially improves
by 53% when combining the SMILES and the functional group features together.
While the optimization of RNN model is still in progress, Fig. 3.11 shows that the loss
function (which represents the deviation of model prediction from ground truth) is reduced
monotonically as a function of model training epochs.
Table 3.5: Performance metrics of the random forest model.
Features RMSE SERMSE NDME SENDME
SMILES 0.129 0.00713 0.327 0.0180
Functional
groups
0.174 0.00281 0.439 0.00709
SMILES &
Functional
groups
0.130 0.00330 0.329 0.00833
Figure 3.11: Convergence of the RNN model performance as a function of training epochs.
73
3.4 Summary
We have developed a highthroughput simulationML framework for computationally
synthesizing a large dataset of polymers, evaluating their dielectric constants, and learning
structureproperty relationships using a new valenceaware polarizable reactive forcefield model
and ML models including multilayer perceptron, random forest and recurrent neural network.
High level of prediction accuracy was achieved by a large size of simulated training dataset.
Employing the full dataset of structurallydiverse 1,276 polymers, the ML models have predicted
the dielectric constant of polymers with decent accuracy.
We have also shown that graph neural networkbased analysis can automatically classify
different phases present in RMD simulation of MoS2 synthesis. Furthermore, we found that
addition of edge based features (especially bond distance) increases the model accuracy
significantly. Network analysis by visualizing the feature space of our GNN model shows clear
separation of the 2H and 1T graph structures inside the network, which helps identify and better
understand these structures.
This is contrary to conventional techniques for structural analysis (e.g., common
neighborhood analysis and centrosymmetry parameter calculation). While they work for mono
atomic FCC, BCC and HCP crystals, these conventional methods do not distinguish 2H and 1T
crystalline phases in transitionmetal dichalcogenide layers considered here. Due to the lack of a
readily available order parameter that can identify each structure type, our GNN model serves as
an indispensable analysis tool for general materials.
As an extension of this work, we are currently developing regression models that are being
combined with our active learning framework, so as to predict the optimal polymer structure with
minimal number of first principlesbased calculations. This is achieved by incrementally
74
expanding the training set, while balancing exploration and exploitation based on Bayesian
optimization. Finally, we are implementing all simulation and ML models on scalable parallel
computing platforms, including the United States’ first exaflop/s computer, Aurora A21, to be
installed in 2022, under our Aurora Early Science Program award. Such scalable ML prediction
models, once optimized and validated, will be indispensable for further enlarging the search space
for superior dielectric polymers by ordersofmagnitude.
75
Chapter 4
Protein Modeling at Exascale
4.1 Introduction
The threedimensional (3D) structure of a protein reveals crucial information about how it
interacts with other proteins to carry out fundamental biological functions. Proteins are linear
chains of amino acids that fold into specific 3D conformations as a result of the physical properties
of the amino acid sequence. The structure, in turn, determines the wide range of protein functions.
Thus, understanding the complexity of protein folding is vital for studying the mechanisms of these
molecular machines in health and disease, and for development of new drugs. Various machine
learning (ML) techniques have been applied successfully to protein structure analysis in the past
[122, 123]. In previous works, for example, we explored fast atomistic learning based on a 2D
convolutional neural network (CNN), through dimension mapping using spacefilling curves
[124]. We have also demonstrated that a novel spatial model built with a graph convolution
network (GCNN) can be used effectively to produce interpretable structural classification [125].
ML models such as neural networks have long been applied to predict 1D structural features such
as backbone torsion angles, secondary structure and solvent accessibility of aminoacid residues.
The focus of ML applications has since shifted to 2D representation of 3D structures such as
residueresidue contact maps [126] and interresidue distance matrices. Recognizing that contact
maps are similar to 2D images — whose classification and interpretation have been among the
most successful applications of deep learning (DL) approaches — the community has begun to
apply DL to recognize patterns in the sequences and structures of proteins in the protein data bank
(PDB) [127]. CNNs have demonstrated excellent performance in image analysis tasks, making
76
them a natural choice for the prediction of protein contact maps. The question of how best to
encode information about the target protein for input to the neural network is an active research
topic. By analogy, color images are often encoded as three matrices of real numbers, i.e., the
intensities of red, green and blue color channels for all image pixels. Methods such as DeepContact
[128] and RaptorXContact [129] use input features consisting of 𝑁 × 𝑁 (where N is the number
of amino acids in the sequence of the target protein) residueresidue coupling matrices derived
from covariation analyses of the target protein, augmented by predictions of local sequence
features. In the DeepCov [130] and TripletRes [131] approaches, more information in the target
protein multiplesequence alignment is provided to the network, in the form of 400 different 𝑁 × 𝑁
feature matrices, each corresponding to a defined pair of amino acids, with the value at position
(𝑖 , 𝑗 ) in a given matrix being either the pair frequency or the covariance for the given amino acid
pair at alignment positions i and j. Then, CNN integrates this massive set of features to identify
spatial contacts, training by large sets of proteins of known structure and their associated contact
maps and multiple sequence alignments. The importance of incorporating ML in templatefree
modeling has been highlighted by topperforming CASP13 structure prediction methods, all of
which rely on deep convolutional neural networks for predicting residue contacts or distances,
predicting backbone torsion angles and ranking the final models.
Approaches to the protein residue contact map prediction problem include support vector
machine (SVM) [132], CNN [130], recurrent neural network (RNN) + CNN [133], ResNet, VGG
and other proven architectures. Most of these approaches require heavy feature engineering. Apart
from the amino acids sequence, they used additional engineered features such as aminoacid pair
frequency [130], covariation scores [130, 134], and positionspecific scoring matrix (PSSM). Such
heavy feature engineering often leads to poor ability of transfer learning across relevant protein
77
folding tasks. Bepler et al. [133] demonstrated a promise of transfer learning, i.e., ability to transfer
knowledge between structurally related proteins, through representation learning, which we will
follow here. In this work, we introduce graph neural network (GNN) to the conventional RNN
based pipeline [133] so as to better capture spatial correlations. Furthermore, we propose a
multiscale GNN approach, in which short, medium and longrange spatial correlations are
refined by dedicated respective GNNs after coarse learning.
4.2 Contact Map Prediction with Representation Learning
A protein contact map represents pairwise amino acid distances, where each pair of input
amino acids from sequence is mapped to a label ∈ {0,1}, which denotes whether the amino acids
are “in contact” (1, i.e., within a cutoff distance of 8 Å) or not (0); see Fig. 4.1. Accurate contact
maps provide powerful global information, e.g., they facilitate the understanding of the complex
dynamical behavior of proteins or other biomolecules [135] and robust modeling of full 3D protein
structure [136]. Specifically, medium and longrange contacts, which may be as few as twelve
sequence positions apart, or as many as hundreds apart, are particularly important for 3D structure
modeling. However, existing approaches [130, 132, 133] require heavy feature engineering and
suffer from poor ability of transfer learning across relevant protein folding tasks.
78
Figure 4.1: left: 3D protein structure. Right: residueresidue contact map of a protein.
Here, we address these problems with the proved strength of representation learning.
Specifically, we use a type of RNN, i.e., bidirectional long shortterm memory (LSTM) embedding
model, mapping sequences of amino acids to sequences of vector representations, such that
residues occurring in similar structural contexts will be close in embedding space (Fig. 4.2). How
to induce the residueresidue contact from the vector representations is still a challenge, because
these vectors have few positionlevel correspondences between residues. We solve this problem
by introducing GNN. The role of GNN is, via its powerful capability of structural learning, to infer
the pair relation of residues from the intermediate vector representations.
Graphbased data in general can be represented as 𝑮 = (𝑽 , 𝑬 ), where 𝑽 is the set of nodes
and 𝑬 is the set of edges. Each edge 𝑒 𝑢𝑣
∈ 𝑬 is a connection between nodes 𝑢 and 𝑣 . If 𝑮 is
directed, we have 𝑒 𝑢𝑣
≢ 𝑒 𝑣𝑢
; if 𝑮 is undirected, instead 𝑒 𝑢𝑣
≡ 𝑒 𝑣𝑢
. Here, we deal with undirected
graphs, but it is trivial to modify such a model to process other directed graph data. In protein
residue contact graphs, the nodes are aminoacid residues and the edges are their spatial proximity
within 8 Å.
79
Figure 4.2: Procedures of the leaning task. AA: amino acid.
The goal of GNN is to learn lowdimensional representation of graphs from the
connectivity structure and input features of nodes and edges. The forward pass of GNN has two
steps, i.e., message passing and nodestate updating. The architecture is summarized by the
following recurrence relations, where t denotes the iteration count:
𝑚 𝑣 𝑡 +1
= ∑ 𝑀 𝑡 (ℎ
𝑣 𝑡 , ℎ
𝑤 𝑡 , 𝑒 𝑣𝑤
)
𝑤 ∈𝑁 (𝑣 )
(4.1)
ℎ
𝑣 𝑡 +1
= 𝑈 𝑡 (ℎ
𝑣 𝑡 , 𝑚 𝑣 𝑡 +1
) (4.2)
where 𝑁 (𝑣 ) denotes the neighbors of node 𝑣 in graph 𝑮 . The message function 𝑀 𝑡 takes
node state ℎ
𝑣 𝑡 and edge state 𝑒 𝑣𝑤
as inputs and produces message 𝑚 𝑣 𝑡 +1
, which can be considered
as a collection of feature information from the neighbors of 𝑣 . The node states are then updated by
function 𝑈 𝑡 based on the previous state and the message. The initial states ℎ
𝑣 0
are set to be the input
features of amino acids. Here, we use normalized adjacency matrix 𝐴 ̃
of the graph coupled with
some other features as the edge state. These two steps are repeated for a total of T times in order
to gather information from distant neighbors, and the node states are updated accordingly. GNN
can be regarded as a layerwise model that propagates messages over the edges and update the
states of nodes in the previous layer. Thus, T can be considered to be the number of layers in this
model. The exact form of message function is
𝑚 𝑣 𝑡 +1
= 𝐴 𝑣 𝑾 𝒕 [ℎ
1
𝑡 … ℎ
𝑣 𝑡 ] + 𝒃 , (4.3)
80
where 𝑾 𝒕 are weights of GNN and 𝒃 denotes bias. We use gated recurrent units as the update
function:
𝑧 𝑣 𝑡 = 𝜎 (𝑾 𝒛 𝑚 𝑣 𝑡 + 𝑼 𝒛 ℎ
𝑣 𝑡 −1
) (4.4)
𝑟 𝑣 𝑡 = 𝜎 (𝑾 𝒓 𝑚 𝑣 𝑡 + 𝑼 𝒓 ℎ
𝑣 𝑡 −1
) (4.5)
ℎ
𝑣 𝑡 ̃
= tanh(𝑾 𝑚 𝑣 𝑡 + 𝑼 (𝑟 𝑣 𝑡 ⊙ ℎ
𝑣 𝑡 −1
)) (4.6)
ℎ
𝑣 𝑡 = (1 − 𝑧 𝑣 𝑡 ) ⊙ ℎ
𝑣 𝑡 −1
+ 𝑧 𝑣 𝑡 ⊙ ℎ
𝑣 𝑡 ̃
, (4.7)
where ⊙ denotes elementwise matrix multiplication and 𝜎 (∙) is sigmoid function for nonlinear
activation.
Figure 4.3 shows a schematic of the proposed multiscale RNN+GNN model, which
improves upon existing RNN+CNN models for protein residue contact map prediction [134].
Figure. 4.3: The proposed multiscale GNN approach on top of the existing RNNCNN pipeline.
81
First, the RNN unit works as encoder that takes a sequence of amino acids representing a
protein and transforms it into vector representations of the same length. To align with the biological
process of protein folding, we employ bidirectional LSTM as encoder to absorb the representation
from the neighboring amino acids, thus the resultant embeddings contain hidden feature from both
sides.
In order to produce contacts from the sequence of embeddings, we define a pairwise feature
tensor,
𝑣 𝑖𝑗
= [𝑧 𝑖 − 𝑧 𝑗
 𝑧 𝑖 ⊙ 𝑧 𝑗 ], (4.8)
of size L×L×2D, where D is the dimension of the embedding vector and L is the length of protein,
 is concatenation, and ⊙ is elementwise product. This featurization is symmetric and has
extensive applications in natural language processing (NLP) models [137]. This 3D feature is then
transformed though the proposed GNN module. To have higher granularity of the contact
predictions with regard to the positional ranges, particularly short, medium and longrange
contacts, we have designed three rangebased GNN blocks, where in each layer t, the edge feature,
(𝑬 𝑠 𝑡 , 𝑬 𝑚 𝑡 , 𝑬 𝑙 𝑡 ), and node feature, (𝑯 𝑠 𝑡 , 𝑯 𝑚 𝑡 , 𝑯 𝑙 𝑡 ), are updated respectively as
𝑒 𝑖𝑗
𝑡 +1
= 𝛼 [𝑊 1
(𝛼 (𝑊 2
𝑒 𝑖𝑗
𝑡 )  𝛼 (𝑊 3
ℎ
𝑖 𝑡 ℎ
𝑗 𝑡 ))] (4.9)
ℎ
𝑖 𝑡 +1
= 𝛼 [𝑊 4
(𝛼 (𝑊 5
ℎ
𝑖 𝑡 
∑ 𝑒 𝑖𝑗
𝑡 +1
𝑗 𝑁 ⁄
))] (4.10)
where 𝑊 1…5
are learnable weights and 𝛼 is the rectified linear unit (ReLU) activation. In Eq. (10),
N is the number of positional neighbors of an amino acid, which distinguishes the rangebased
blocks: (i) shortrange blocks focus on positional neighbors from 6 to 11, so that there are N = 12
82
– 6 = 6 neighbors; (ii) mediumrange blocks for which N = 24 – 12 = 12; and (iii) longrange block
for which N = L – 24. Thus, the new node features are induced by an average of the neighboring
edge features.
The output of the final layer of the GNN blocks is a triplet, 𝑬 𝑇 = (𝑬 𝑠 𝑇 , 𝑬 𝑚 𝑇 , 𝑬 𝑙 𝑇 ), containing
edge features of the corresponding rangebased blocks. A fully connected layer will merge and
convert 𝑬 𝑇 into the contact map.
It is worth noting here that our multiscale feature averaging is akin to the additive
hybridization scheme [138] used in the celebrated multiscale quantummechanical
(QM)/molecularmechanical (MM) simulation approach, for which Karplus, Levitt and Warshel
shared the Nobel prize in chemistry in 2013 [139]. In the additive hybridization scheme, energy
contributions from different spatial ranges are described by appropriate approaches in the
respective ranges which are averaged to provide the total energy [138].
4.3 Results and Discussion
To evaluate the proposed model, we use the ProteinNet dataset [140]. ProteinNet is a
standardized dataset for machine learning of protein structure which builds on CASP (Critical
Assessment of protein Structure Prediction) assessment carrying out blind prediction of recently
solved but publicly unavailable protein structures. In our experiment, we specifically chose a
subset from CASP12. We implemented all methods in Tensorflow 2.5 and trained on two NVIDIA
V100 graphics processing units (GPUs). The GNN module consists of two layers with independent
parameters for each short, medium and longrange based block.
83
Figure 4.4 shows the precision of the proposed multiscale RNN+GNN approach as a
function of the training epoch compared with that of the baseline RNN+CNN approach. Here, we
employ a commonly used metrics. The term “P@K” signifies precision for the top K contacts,
where all predicted contacts are sorted from highest to lowest confidence. Let L be the length of
the protein, then “P@L/2”, for example, is the precision for the L/2 most likely predicted contacts.
In Fig. 4.4, we observe improved P@L/2 precision by the addition of multiscale GNNs. As
is well known, contact prediction with increased spatial ranges are progressively more difficult.
The proposed multiscale RNN+GNN approach consistently outperforms the baseline in all ranges,
including the hardest case of longrange contact prediction.
In Tables 4.1  4.3, we show the results of contact prediction in terms of P@L, P@L/2 and
P@L/5. We benchmark the proposed method against the CNNbased approach [133] as baseline.
The precisions are collected from the test dataset consisting of 144 protein sequences. The new
multiscale RNN+GNN approach consistently improves the prediction precision over the baseline
RNN+CNN approach for all K top contacts (K = L, L/2, L/5). While the precision decreases as we
move from the short to medium to longranges as expected, the multiscale RNN+GNN approach
maintains its precision advantage over the baseline.
Fig. 4.4: P@L/2 precision of the proposed multiscale RNN+GNN model (green) as a function of the training
epoch compared to that of the conventional RNN+CNN model (cyan) for the prediction of (top) short,
(middle) medium and (bottom) longrange protein residue contacts.
84
Table 4.1: Shortrange contact prediction results.
P@L P@L/2 P@L/5
baseline 0.319 0.299 0.344
proposed 0.360 0.355 0.370
Table 4.2: Mediumrange contact prediction results.
P@L P@L/2 P@L/5
baseline 0.220 0.223 0.237
proposed 0.254 0.256 0.263
Table 4.2: Longrange contact prediction results.
P@L P@L/2 P@L/5
baseline 0.135 0.139 0.158
proposed 0.145 0.150 0.165
In summary, we have proposed a multiscale GNNbased approach taking a cue from the
celebrated multiscale physics simulation approach, in which a standard pipeline consisting of RNN
and CNN is improved by introducing three GNNs to refine predictive capability for short,
medium and longrange contacts, respectively. The results show improved accuracy for contacts
of all ranges, including the most challenging case of longrange contact prediction. The multiscale
GNN approach reflects the inherently multiscale nature of biomolecular and other physical
systems, and as such is expected to find broader applications beyond the protein residue contact
prediction problem.
As deep learning continues to show promise at modeling the relation between primary and
tertiary structure, methods of interpreting learned representations become progressively more
important in providing biologically meaningful insights about how proteins work. Previous work
has tackled approaches for identifying biologically significant model parameters such as saliency
85
measures along 3D space for volumetric methods like CNNs [125]. However, these approaches
are limited in that saliency maps over a volume do not adequately describe the role of interactions
between residues or the predictive value of higherorder substructures. Work by ZamoraResendiz
et al. [124] demonstrated how GCNN architectures learn more biologically relevant
representations. Parameter attributions were found to localize at meaningful segments (including
secondary structures) for RAS proteins and these substructures were found to be characterized in
literature on RAS. As we learn more about how to represent physical systems in deep learning
frameworks, the “dataagnostic” capabilities of deep learning methods will help in discovering
biologically relevant substructures given the proper innate priors.
4.4 Future Work
With the continuously growing size of the ProteinNet dataset used in this study, the
proposed ML approach is becoming heavily compute bound. To address this challenge, we are
currently implementing our model on leadershipscale parallel supercomputers at the Argonne
Leadership Computing Facility (ALCF), including the Theta [141] and new Polaris machines.
Each computing node of Polaris consists of one AMD EPYC “Milan” central processing unit
(CPU) and four NVIDIA A100 GPUs. These leadershipscale implementations will be applied to
the largestavailable protein datasets. For the massively parallel learning, we employ data
parallelism utilizing a distributed ML framework, Horovod [142]. Here, the global batch of input
data are split across computing nodes, and the model parameters are updated by aggregating the
gradients from the nodes. Up to O(100) nodes, we adopt synchronous training, where the model
and gradients are always in sync among all the nodes. For largerscale training, hybrid
synchronousasynchronous approach will be employed instead for higher scalability (though with
86
slower statistical convergence) [143]. For complex network and hyperparameter tuning, we also
use DeepHyper, a scalable automated ML (AutoML) package for the development of deep neural
networks [144]. We utilize two components of the package: (i) neural architecture search (NAS)
for automatic search of highperforming deep neural network (DNN) architectures; and (ii)
hyperparameter search (HPS) for automatic identification of highperforming DNN
hyperparameters. Running such massive ML workflow on leadershipscale parallel
supercomputers will likely pose runtime challenges such as fault recovery, for which we will
utilize ALCF support for Balsam high performance computing (HPC) workflows and edge
services [145]. The multiscale RNN+GNN model is being scaled up to leadershipscale parallel
supercomputers.
87
References
[1] D. Frenkel and B. Smit. 2001. Understanding Molecular Simulation. Academic Press, San
Diego, CA.
[2] A. Rahman. 1964. Correlations in the motion of atoms in liquid argon. Physical Review,
136 (2A). A405A411. https://doi.org/10.1103/PhysRev.136.A405
[3] M. Levitt. 2014. Birth and future of multiscale modeling for macromolecular systems
(Nobel lecture). Angewandte Chemie International Edition, 53 (38). 1000610018.
10.1002/anie.201403691
[4] S. Tsuneyuki, Y. Matsui, H. Aoki and M. Tsukada. 1989. New pressureinduced structural
transformations in silica obtained by computersimulation. Nature, 339 (6221). 209211.
10.1038/339209a0
[5] F. Shimojo, I. Ebbsjo, R. K. Kalia, A. Nakano, J. P. Rino and P. Vashishta. 2000.
Molecular dynamics simulation of structural transformation in silicon carbide under
pressure. Physical Review Letters, 84 (15). 33383341.
[6] A. C. T. van Duin, S. Dasgupta, F. Lorant and W. A. Goddard. 2001. ReaxFF: a reactive
force field for hydrocarbons. Journal of Physical Chemistry A, 105 (41). 93969409.
10.1021/jp004368u
[7] A. Nakano, R. K. Kalia, P. Vashishta, T. J. Campbell, S. Ogata, F. Shimojo and S. Saini.
ACM/IEEE, 2001. Scalable atomistic simulation algorithms for materials research.
Proceedings of Supercomputing, SC01. 10.1145/582034.582035
[8] P. Vashishta, R. K. Kalia, J. P. Rino and I. Ebbsjo. 1990. Interaction potential for SiO2 
a moleculardynamics study of structural correlations. Physical Review B, 41 (17). 12197
12209. 10.1103/PhysRevB.41.12197
[9] A. Nakano, R. K. Kalia, K. Nomura, A. Sharma, P. Vashishta, F. Shimojo, A. C. T. van
Duin, W. A. Goddard, R. Biswas, D. Srivastava and L. H. Yang. 2008. De novo ultrascale
atomistic simulations on highend parallel supercomputers. International Journal of High
Performance Computing Applications, 22 (1). 113128. 10.1177/1094342007085015
[10] K. Nomura, P. E. Small, R. K. Kalia, A. Nakano and P. Vashishta. 2015. An extended
Lagrangian scheme for charge equilibration in reactive molecular dynamics simulations.
Computer Physics Communications, 192. 9196. 10.1016/j.cpc.2015.02.023
[11] D. C. Rapaport. 1988. Largescale moleculardynamics simulation using vector and
parallel computers. Computer Physics Reports, 9 (1). 153. 10.1016/01677977(88)90014
7
88
[12] A. Nakano, P. Vashishta and R. K. Kalia. 1993. Parallel multipletimestep molecular
dynamics with 3body interaction. Computer Physics Communications, 77 (3). 303312.
10.1016/00104655(93)90178F
[13] S. Plimpton. 1995. Fast parallel algorithms for shortrange molecular dynamics. Journal
of Computational Physics, 117 (1). 119. 10.1006/jcph.1995.1039
[14] L. Kale, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, N. Krawetz, J. Phillips, A.
Shinozaki, K. Varadarajan and K. Schulten. 1999. NAMD2: greater scalability for parallel
molecular dynamics. Journal of Computational Physics, 151 (1). 283312.
10.1006/jcph.1999.6201
[15] D. E. Shaw. 2005. A fast, scalable method for the parallel evaluation of distancelimited
pairwise particle interactions. Journal of Computational Chemistry, 26 (13). 13181328.
10.1002/jcc.20267
[16] D. A. Reed and J. Dongarra. 2015. Exascale computing and big data. Communications of
the ACM, 58 (7). 5668. 10.1145/2699414
[17] K. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, K. Shimamura, F. Shimojo, M.
Kunaseth, P. C. Messina and N. A. Romero. IEEE/ACM, 2014. Metascalable quantum
molecular dynamics simulations of hydrogenondemand. Proceedings of
Supercomputing, SC14. 661673. 10.1109/SC.2014.59
[18] F. Shimojo, R. K. Kalia, M. Kunaseth, A. Nakano, K. Nomura, S. Ohmura, K. Shimamura
and P. Vashishta. 2014. A divideconquerrecombine algorithmic paradigm for multiscale
materials modeling. Journal of Chemical Physics, 140 (18). 18A529. 10.1063/1.4869342
[19] N. A. Romero, A. Nakano, K. Riley, F. Shimojo, R. K. Kalia, P. Vashishta and P. C.
Messina. 2015. Quantum molecular dynamics in the postpetaflops era. IEEE Computer,
48 (11). 3341.
[20] J. MellorCrummey, D. Whalley and K. Kennedy. 2001. Improving memory hierarchy
performance for irregular applications using data and computation reorderings.
International Journal of Parallel Programming, 29 (3). 217247.
10.1023/A:1011119519789
[21] M. Kunaseth, R. K. Kalia, A. Nakano, K. Nomura and P. Vashishta. ACM/IEEE, 2013. A
scalable parallel algorithm for dynamic rangelimited ntuple computation in manybody
molecular dynamics simulation. Proceedings of Supercomputing, SC13.
[22] J. P. Walters, V. Balu, V. Chaudhary, D. Kofke and A. Schultz. ISCA, 2008. Accelerating
molecular dynamics simulations with GPUs. Proceedings of International Conference on
Parallel and Distributed Computing and Communication Systems (PDCCS 2008). 4449.
[23] W. M. Brown, P. Wang, S. J. Plimpton and A. N. Tharrington. 2011. Implementing
molecular dynamics on hybrid high performance computers  short range forces. Computer
Physics Communications, 182 (4). 898911. 10.1016/j.cpc.2010.12.021
89
[24] T. J. Lipscomb, A. Zou and S. S. Cho. ACM, 2012. Parallel Verlet neighbor list algorithm
for GPUoptimized MD simulations. Proceedings of the ACM Conference on
Bioinformatics, Computational Biology and Biomedicine, (BCB '12). 321328.
10.1145/2382936.2382977
[25] W. M. Brown and M. Yamada. 2013. Implementing molecular dynamics on hybrid high
performance computersthreebody potentials. Computer Physics Communications, 184
(12). 27852793. 10.1016/j.cpc.2013.08.002
[26] S. B. Kylasa, H. M. Aktulga and A. Y. Grama. 2014. PuReMDGPU: a reactive molecular
dynamics simulation package for GPUs. Journal of Computational Physics, 272. 343359.
10.1016/j.jcp.2014.04.035
[27] A. Nakano, R. K. Kalia and P. Vashishta. 1994. Multiresolution moleculardynamics
algorithm for realistic materials modeling on parallel computers. Computer Physics
Communications, 83 (23). 197214.
[28] M. P. Allen and D. J. Tildesley. 1987. Computer Simulation of Liquids. Oxford University
Press, Oxford, UK.
[29] W.m. W. Hwu. 2011. GPU Computing Gems Jade Edition. Morgan Kaufmann, Waltham,
MA.
[30] http://docs.nvidia.com/gameworks/content/developertools/desktop/
analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm
[31] J. Tersoff. 1989. Modeling solidstate chemistry: interatomic potentials for
multicomponent systems. Physical Review B, 39 (8). 55665568.
10.1103/PhysRevB.41.3248.2
[32] D. W. Brenner. 1990. Empirical potential for hydrocarbons for use in simulating the
chemical vapordeposition of diamond films. Physical Review B, 42 (15). 94589471.
[33] A. K. Rappe and W. A. Goddard. 1991. Charge equilibration for moleculardynamics
simulations. Journal of Physical Chemistry, 95 (8). 33583363. 10.1021/j100161a070
[34] F. H. Streitz and J. W. Mintmire. 1994. Electrostatic potentials for metaloxide surfaces
and interfaces. Physical Review B, 50 (16). 1199612003. 10.1103/PhysRevB.50.11996
[35] S. W. Rick, S. J. Stuart and B. J. Berne. 1994. Dynamical fluctuating charge forcefields 
application to liquid water. Journal of Chemical Physics, 101 (7). 61416156.
10.1063/1.468398
[36] T. P. Senftle, S. Hong, M. M. Islam, S. B. Kylasa, Y. Zheng, Y. K. Shin, et al., "The
ReaxFF reactive forcefield: development, applications and future directions," npj
Computational Materials, vol. 2, p. 15011, Mar 4 2016.
90
[37] A. K. Rappe and W. A. Goddard, "Charge equilibration for moleculardynamics
simulations," Journal of Physical Chemistry, vol. 95, pp. 33583363, Apr 18 1991.
[38] A. Nakano, "Parallel multilevel preconditioned conjugategradient approach to variable
charge molecular dynamics," Computer Physics Communications, vol. 104, pp. 5969, Aug
1997.
[39] K. Nomura, R. K. Kalia, A. Nakano, and P. Vashishta, "A scalable parallel algorithm for
largescale reactive forcefield molecular dynamics simulations," Computer Physics
Communications, vol. 178, pp. 7387, Jan 15 2008.
[40] H. M. Aktulga, S. A. Pandit, A. C. T. van Duin, and A. Y. Grama, "Reactive molecular
dynamics: numerical methods and algorithmic techniques," SIAM Journal on Scientific
Computing, vol. 34, pp. C1C23, Jan 31 2012.
[41] S. B. Kylasa, H. M. Aktulga, and A. Y. Grama, "Reactive molecular dynamics on
massively parallel heterogeneous architectures," IEEE Transactions on Parallel and
Distributed Systems, vol. 28, pp. 202214, Jan 1 2017.
[42] K. Nomura, R. K. Kalia, Y. Li, A. Nakano, P. Rajak, C. Sheng, et al., "Nanocarbon
synthesis by hightemperature oxidation of nanoparticles," Scientific Reports, vol. 6, p.
24109, Apr 20 2016.
[43] R. A. Marcus, "Electrontransfer reactions in chemistry  theory and experiment," Reviews
of Modern Physics, vol. 65, pp. 599610, Jul 1993.
[44] S. Naserifar, D. J. Brooks, W. A. Goddard, and V. Cvicek, "Polarizable charge
equilibration model for predicting accurate electrostatic interactions in molecules and
solids," Journal of Chemical Physics, vol. 146, p. 124117, Mar 28 2017.
[45] K. Nomura, P. E. Small, R. K. Kalia, A. Nakano, and P. Vashishta, "An extended
Lagrangian scheme for charge equilibration in reactive molecular dynamics simulations,"
Computer Physics Communications, vol. 192, pp. 9196, July 2015.
[46] M. Kunaseth, R. K. Kalia, A. Nakano, K. Nomura, and P. Vashishta, "A scalable parallel
algorithm for dynamic rangelimited ntuple computation in manybody molecular
dynamics simulation," Proceedings of Supercomputing, SC13, ACM/IEEE, 2013.
[47] A. K. Geim and I. V. Grigorieva, "Van der Waals heterostructures," Nature, vol. 499, pp.
419425, Jul 25 2013.
[48] R. F. Service, "Design for US exascale computer takes shape," Science, vol. 359, pp. 617
618, Feb 9 2018.
[49] F. Shimojo, R. K. Kalia, M. Kunaseth, A. Nakano, K. Nomura, S. Ohmura, et al., "A
divideconquerrecombine algorithmic paradigm for multiscale materials modeling,"
Journal of Chemical Physics, vol. 140, p. 18A529, May 14 2014.
91
[50] P. Umari and A. Pasquarello, "Ab initio molecular dynamics in a finite homogeneous
electric field," Physical Review Letters, vol. 89, p. 157602, Oct 7 2002.
[51] A. M. N. Niklasson, "Extended BornOppenheimer molecular dynamics," Physical Review
Letters, vol. 100, p. 123004, Mar 28 2008.
[52] P. Souvatzis and A. M. N. Niklasson, "First principles molecular dynamics without self
consistent field optimization," Journal of Chemical Physics, vol. 140, p. 044117, Jan 28
2014.
[53] D. C. Rapaport, The Art of Molecular Dynamics Simulation, Second ed. Cambridge, UK:
Cambridge University Press, 2004.
[54] D. E. Shaw, "A fast, scalable method for the parallel evaluation of distancelimited
pairwise particle interactions," Journal of Computational Chemistry, vol. 26, pp. 1318
1328, Oct 2005.
[55] M. Kunaseth, R. K. Kalia, A. Nakano, P. Vashishta, D. F. Richards, and J. N. Glosli,
"Performance characteristics of hardware transactional memory for molecular dynamics
application on BlueGene/Q: toward efficient multithreading strategies for largescale
scientific applications," Proceedings of the International Workshop on Parallel and
Distributed Scientific and Engineering Computing, PDSEC13, Mar 20 IEEE, 2013.
[56] A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, et al., "Knights
Landing: secondgeneration Intel Xeon Phi product," IEEE Micro, vol. 36, pp. 3446, Mar
Apr 2016.
[57] K. S. Novoselov, A. Mishchenko, A. Carvalho, and A. H. C. Neto, "2D materials and van
der Waals heterostructures," Science, vol. 353, p. aac9439, Jul 29 2016.
[58] Y. M. Shi, H. N. Li, and L. J. Li, "Recent advances in controlled synthesis of two
dimensional transition metal dichalcogenides via vapour deposition techniques," Chemical
Society Reviews, vol. 44, pp. 27442756, May 7 2015.
[59] K. Hornik, M. Stinchcombe, and H. White, "Multilayer feedforward networks are universal
approximators," Neural Networks, vol. 2, pp. 359366, Jan 1989.
[60] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A
Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning
Research, vol. 15, pp. 19291958, Jun 2014.
[61] E. D. Cubuk, S. S. Schoenholz, J. M. Rieser, B. D. Malone, J. Rottler, D. J. Durian, et al.,
"Identifying Structural Flow Defects in Disordered Solids Using MachineLearning
Methods," Physical Review Letters, vol. 114, p. 108001, Mar 2015.
[62] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with
deep convolutional neural networks." Neural Information Processing Systems 25 (2012):
1106–14.
92
[63] AltaeTran, Han, Bharath Ramsundar, Aneesh S. Pappu, and Vijay Pande. "Low data drug
discovery with oneshot learning." ACS Central Science 3 (2017): 28393.
[64] David Duvenaud, Dougal Maclaurin, Jorge AguileraIparraguirre, Rafael Gomez
Bombarelli, Timothy Hirzel, Alan AspuruGuzik, and Ryan P. Adams. "Convolutional
networks on graphs for learning molecular fingerprints." Advances in Neural Information
Processing Systems 28 (2015): 2224–32
[65] Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler,
Andreas Ziehe, Alexandre Tkatchenko, O. Anatole von Lilienfeld, and KlausRobert
Müller. "Learning invariant representations of molecules for atomization energy
prediction." Advances in Neural Information Processing Systems 25 (2012): 44957.
[66] Hansen, Katja, Franziska Biegler, Raghunathan Ramakrishnan, Wiktor Pronobis, O.
Anatole von Lilienfeld, KlausRobert Müller, and Alexandre Tkatchenko. "Machine
learning predictions of molecular properties: accurate manybody potentials and
nonlocality in chemical space." The Journal of Physical Chemistry Letters 6 (2015): 2326
31.
[67] Geim, A. K., and I. V. Grigorieva. 2013. "Van der Waals heterostructures." Nature 499
(July 24 2013):419425. doi: 10.1038/nature12385.
[68] Bassman, L., P. Rajak, R. K. Kalia, A. Nakano, F. Sha, J. Sun, D. J. Singh, M. Aykol, P.
Huck, K. Persson, and P. Vashishta. "Active learning for accelerated design of layered
materials." npj Computational Materials 4 (2018):74.
[69] Hong, S., A. Krishnamoorthy, P. Rajak, S. Tiwari, M. Misawa, F. Shimojo, R. K. Kalia,
A. Nakano, and P. Vashishta. "Computational synthesis of MoS2 layers by reactive
molecular dynamics simulations: initial sulfidation of MoO3 surfaces." Nano Letters 17
(2017): 486672.
[70] Kearnes, Steven, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley.
"Molecular graph convolutions: moving beyond fingerprints." Journal of ComputerAided
Molecular Design 30, (2016): 595608.
[71] Nomura, K., R.K. Kalia, A. Nakano, and P. Vashishta. "A scalable parallel algorithm for
largescale reactive forcefield molecular dynamics simulations." Computer Physics
Communications 178 (2008): 7387.
[72] Nomura, K., P. E. Small, R. K. Kalia, A. Nakano, and P. Vashishta. "An extended
lagrangian scheme for charge equilibration in reactive molecular dynamics simulations."
Computer Physics Communications 192 (2015): 9196.
[73] Scarselli, F., M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. "The graph neural
network model." IEEE Transactions on Neural Networks 20 (2009): 6180.
93
[74] Senftle, T. P., S. Hong, M. M. Islam, S. B. Kylasa, Y. Zheng, Y. K. Shin, C. Junkermeier,
et al. "The Reaxff reactive forcefield: development, applications and future directions."
npj Computational Materials 2 (2016): 15011.
[75] TamayoMendoza, Teresa, Christoph Kreisbeck, Roland Lindh, and Alán AspuruGuzik.
"Automatic differentiation in quantum chemistry with applications to fully variational
Hartree–Fock." ACS Central Science 4 (2018): 55966.
[76] Wei, Jennifer N., David Duvenaud, and Alan AspuruGuzik. "Neural networks for the
prediction of organic chemistry reactions." ACS Central Science 2 (2016): 72532.
[77] Xie, T., and J. C. Grossman. "Crystal Graph Convolutional Neural networks for an accurate
and interpretable prediction of material properties." Physical Review Letters 120 (2018):
145301.
[78] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. "Gated graph sequence
neural networks." International Conference on Learning Representations (2016).
[79] Chen, J., Ma, T. and Xiao, C., 2018. "FastGCN: fast learning with graph convolutional
networks via importance sampling." arXiv preprint arXiv:1801.10247.
[80] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and
Bengio, Y., 2014. "Learning phrase representations using RNN encoderdecoder for
statistical machine translation." arXiv preprint arXiv:1406.1078.
[81] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,
Irving, G., Isard, M. and Kudlur, M., "Tensorflow: a system for largescale machine
learning." OSDI 16 (2016): 26583.
[82] Kingma, D.P. and Ba, J., 2014. "Adam: A method for stochastic optimization." arXiv
preprint arXiv:1412.6980.
[83] Q. M. Zhang, V. Bharti, and X. Zhao, "Giant electrostriction and relaxor ferroelectric
behavior in electronirradiated poly(vinylidene fluoridetrifluoroethylene) copolymer,"
Science, vol. 280, no. 5372, pp. 21012104, Jun 26 1998, doi:
10.1126/science.280.5372.2101
[84] B. J. Chu et al., "A dielectric polymer with high electric energy density and fast discharge
speed," Science, vol. 313, no. 5785, pp. 334336, Jul 21 2006, doi:
10.1126/science.1127798
[85] V. Sharma et al., "Rational design of all organic polymer dielectrics," Nat Commun, vol.
5, p. 4845, Sep 2014, doi: 10.1038/ncomms5845.
[86] A. Jain, K. A. Persson, and G. Ceder, "Research update: the materials genome initiative:
Data sharing and the impact of collaborative ab initio databases," APL Mater, vol. 4, no. 5,
May 2016, doi: 10.1063/1.4944683.
94
[87] C. Kim, A. Chandrasekaran, T. D. Huan, D. Das, and R. Ramprasad, "Polymer genome: a
datapowered polymer informatics platform for property predictions," J Phys Chem C, vol.
122, no. 31, pp. 1757517585, Aug 9 2018, doi: 10.1021/acs.jpcc.8b02913.
[88] K. Andersen, S. Latini, and K. S. Thygesen, "Dielectric genome of van der Waals
heterostructures," Nano Lett, vol. 15, no. 7, pp. 46164621, Jul 2015, doi:
10.1021/acs.nanolett.5b01251.
[89] P. Umari and A. Pasquarello, "Ab initio molecular dynamics in a finite homogeneous
electric field," Phys Rev Lett, vol. 89, no. 15, p. 157602, Oct 7 2002, doi:
10.1103/PhysRevLett.89.157602.
[90] I. Souza, J. Iniguez, and D. Vanderbilt, "Firstprinciples approach to insulators in finite
electric fields," Phys Rev Lett, vol. 89, no. 11, p. 117602, Sep 9 2002, doi:
10.1103/PhysRevLett.89.117602.
[91] S. Fukushima et al., "Effects of chemical defects on anisotropic dielectric response of
polyethylene," AIP Adv, vol. 9, no. 4, p. 045022, Apr 2019, doi: 10.1063/1.5093566
[92] S. Naserifar, D. J. Brooks, W. A. Goddard, and V. Cvicek, "Polarizable charge
equilibration model for predicting accurate electrostatic interactions in molecules and
solids," J Chem Phys, vol. 146, no. 12, p. 124117, 2017/03/28 2017, doi:
10.1063/1.4978891.
[93] K. Liu et al., "Shiftcollapse acceleration of generalized polarizable reactive molecular
dynamics for machine learningassisted computational synthesis of layered materials,"
Proc ScalA18, pp. 4148, Nov 12 IEEE, 2018, doi: 10.1109/ScalA.2018.00009.
[94] Y. Li et al., "Scalable reactive molecular dynamics simulations for computational
synthesis," Comput Sci Eng, vol. 21, no. 5, pp. 6475, Sep/Oct 2019, doi:
10.1109/MCSE.2018.110150043.
[95] A. Mishra et al., "Computational framework for polymer synthesis to study dielectric
properties using polarizable reactive molecular dynamics," ACS Central Sci, submitted,
2020.
[96] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh, "Machine learning for
molecular and materials science," Nature, vol. 559, no. 7715, pp. 547555, Jul 26 2018,
doi: 10.1038/s4158601803372.
[97] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436
444, May 28 2015, doi: 10.1038/nature14539.
[98] D. Duvenaud et al., "Convolutional networks on graphs for learning molecular
fingerprints," Proc NeurIPS 2015, vol. 28, Dec 2015.
[99] K. Liu, K. Nomura, P. Rajak, R. K. Kalia, A. Nakano, and P. Vashishta, "Graph neural
network analysis of layered material phases," Proc SpringSimHPC 2019, SCS, 2019.
95
[100] M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller, "Generating focused molecule
libraries for drug discovery with recurrent neural networks," ACS Central Sci, vol. 4, no.
1, pp. 120131, Jan 24 2018, doi: 10.1021/acscentsci.7b00512.
[101] A. C. T. van Duin, S. Dasgupta, F. Lorant, and W. A. Goddard, "ReaxFF: a reactive force
field for hydrocarbons," J Phys Chem A, vol. 105, no. 41, pp. 93969409, Oct 18 2001, doi:
10.1021/jp004368u.
[102] A. Nakano et al., "De novo ultrascale atomistic simulations on highend parallel
supercomputers," Int J High Performance Comput Appl, vol. 22, no. 1, pp. 113128, Feb
2008, doi: 10.1177/1094342007085015.
[103] T. P. Senftle et al., "The ReaxFF reactive forcefield: development, applications and future
directions," npj Comput Mater, vol. 2, p. 15011, Mar 4 2016, doi:
10.1038/npjcompumats.2015.11.
[104] A. Hellweg and D. Rappoport, "Development of new auxiliary basis functions of the
Karlsruhe segmented contracted basis sets including diffuse basis functions (def2SVPD,
def2TZVPPD, and def2QVPPD) for RIMP2 and RICC calculations," Phys Chem Chem
Phys, 10.1039/C4CP04286G vol. 17, no. 2, pp. 10101017, 2015, doi:
10.1039/C4CP04286G.
[105] K. Nomura, R. K. Kalia, A. Nakano, P. Rajak, and P. Vashishta, "RXMD: a scalable
reactive molecular dynamics simulator for optimized timetosolution," SoftwareX, vol. 11,
p. 100389, JanJun 2020, doi: 10.1016/j.softx.2019.100389.
[106] D. Weininger, "SMILES, a chemical language and information system. 1. Introduction to
methodology and encoding rules," J Chem Info Comp Sci, vol. 28, pp. 3136, 1988, doi:
10.1021/ci00057a005.
[107] N. M. O'Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R.
Hutchison, "Open Babel: An open chemical toolbox," J Cheminformatics, vol. 3, no. 1, p.
33, 2011/10/07 2011, doi: 10.1186/17582946333.
[108] H. Ramchoun, M. A. J. Idrissi, Y. Ghanou, and M. Ettaouil, "Multilayer perceptron:
architecture optimization and training," IJIMAI, vol. 4, no. 1, pp. 2630, Sep 2016, doi:
10.9781/ijimai.2016.415.
[109] F. Rosenblatt, Principles of Neurodynamics: Perceptions and the Theory of Brain
Mechanisms. Washington, DC: Spartan, 1962.
[110] N. Talebi, A. M. Nasrabadi, and I. MohammadRezazadeh, "Estimation of effective
connectivity using multilayer perceptron artificial neural network," Cogn Neurodynamics,
vol. 12, no. 1, pp. 2142, Feb 2018, doi: 10.1007/s1157101794531.
[111] P. Hohenberg and W. Kohn, "Inhomogeneous electron gas," Phys Rev, vol. 136, no. 3B,
pp. B864B871, Nov 9 1964, doi: 10.1103/PhysRev.136.B864.
96
[112] R. M. Martin, Electronic Structure: Basic Theory and Practical Methods. Cambridge, UK:
Cambridge University Press, 2008.
[113] F. Shimojo et al., "A divideconquerrecombine algorithmic paradigm for multiscale
materials modeling," J Chem Phys, vol. 140, no. 18, p. 18A529, May 14 2014, doi:
10.1063/1.4869342.
[114] Liu, Y., Palmedo, P., Ye, Q., Berger, B. & Peng, J. Enhancing evolutionary couplings
with deep convolutional neural networks. Cell Syst. 6, 65–74.e3 (2018).
[115] Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein
contact map by ultradeep learning model. PLoS Comput. Biol. 13, e1005324 (2017)
[116] Jones, D. T. & Kandathil, S. M. High precision in protein contact prediction using fully
convolutional neural networks and minimal sequence features. Bioinformatics 34, 3308–
3315 (2018).
[117] Yang Li, Chengxin Zhang, Eric W Bell, Wei Zheng, Dongjun Yu, Yang Zhang.
Deducing highaccuracy protein contactmaps from a triplet of coevolutionary matrices
through deep residual convolutional networks. submitted, 2020.
[118] David E. Kim, Frank DiMaio, Ray YuRuei Wang, Yifan Song, and David Baker. One
contact for every twelve residues allows robust and accurate topologylevel protein
structure modeling. Proteins: Structure, Function, and Bioinformatics, 82:208–218,
2014.
[119] Zhao, Y., and Karypis, G. 2003. Prediction of contact maps using support vector
machines. BIBE 2003, Bethesda, MD. IEEE Computer Society, pp. 26–36
[120] Jones DT and Kandathil SM (2018). High precision in protein contact prediction using
fully convolutional neural networks and minimal sequence
features. Bioinformatics 34(19): 33083315.
[121] Bepler, Tristan, and Bonnie Berger. "Learning protein sequence embeddings using
information from structure." arXiv preprint arXiv:1902.08661 (2019).
[122] Y. Xu, D. Xu, and J. Liang, Computational Methods for Protein Structure Prediction and
Modeling. Springer, 2007.
[123] J. Jumper et al., "Highly accurate protein structure prediction with AlphaFold," Nature,
vol. 596, no. 7873, pp. 583589, Aug 26 2021, doi: 10.1038/s41586021038192.
[124] R. ZamoraResendiz and S. Crivelli, "Structural learning of proteins using graph
convolutional neural networks," bioRxiv, 10.1101/610444, Apr 16 2019, doi:
10.1101/610444.
97
[125] T. Corcoran, R. ZamoraResendiz, X. Liu, and S. Crivelli, "A spatial mapping algorithm
with applications in deep learningbased structure classification," arXiv, 1802.02532v2,
Feb 22 2018, doi: 10.48550/arXiv.1802.02532.
[126] X. Yuan and C. Bystroff, "Protein contact map prediction," in Protein Structure Prediction
and Modeling, vol. 1, Y. Xu, D. Xu, and J. Liang Eds., 2007, ch. 8.
[127] H. Berman, K. Henrick, and H. Nakamura, "Announcing the worldwide Protein Data
Bank," Nat Struct Biol, vol. 10, no. 12, pp. 980980, Dec 2003, doi: 10.1038/nsb1203980.
[128] Y. Liu, P. Palmedo, Q. Ye, B. Berger, and J. Peng, "Enhancing evolutionary couplings with
deep convolutional neural networks," Cell Syst, vol. 6, no. 1, pp. 6574, Jan 24 2018, doi:
10.1016/j.cels.2017.11.014.
[129] S. Wang, S. Q. Sun, Z. Li, R. Y. Zhang, and J. B. Xu, "Accurate de novo prediction of
protein contact map by ultradeep learning model," PLoS Comput Biol, vol. 13, no. 1,
e1005324, Jan 2017, doi: 10.1371/journal.pcbi.1005324.t001.
[130] D. T. Jones and S. M. Kandathil, "High precision in protein contact prediction using fully
convolutional neural networks and minimal sequence features," Bioinformatics, vol. 34,
no. 19, pp. 33083315, Oct 1 2018, doi: 10.1093/bioinformatics/bty341.
[131] Y. Li et al., "Deducing highaccuracy protein contactmaps from a triplet of coevolutionary
matrices through deep residual convolutional networks," PLoS Comput Biol, vol. 17, no.
3, e1008865, Mar 2021, doi: 10.1371/journal. pcbi.1008865.
[132] Y. Zhao and G. Karypis, "Prediction of contact maps using support vector machines," Int
J Artif Intell T, vol. 14, no. 5, pp. 849865, Oct 2005, doi: 10.1142/S0218213005002429.
[133] T. Bepler and B. Berger, "Learning protein sequence embeddings using information from
structure," Proceedings of International Conference on Learning Representations, ICLR,
1078, 2019.
[134] B. Kuhlman and P. Bradley, "Advances in protein structure prediction and design," Nat
Rev Mol Cell Bio, vol. 20, no. 11, pp. 681697, Nov 2019, doi: 10.1038/ s415800190163
x.
[135] D. Mercadante, F. Grater, and C. Daday, "CONAN: a tool to decode dynamical information
from molecular interaction maps," Biophys J, vol. 114, no. 6, pp. 12671273, Mar 27 2018,
doi: 10.1016/j.bpj.2018.01.033.
[136] D. E. Kim, F. DiMaio, R. Y. R. Wang, Y. F. Song, and D. Baker, "One contact for every
twelve residues allows robust and accurate topologylevel protein structure modeling,"
Proteins, vol. 82, pp. 208218, Feb 2014, doi: 10.1002/prot.24374.
[137] K. S. Tai, R. Socher, and C. Manning, "Improved semantic representations from tree
structured long shortterm memory networks," Proc ACL, pp. 15561566, 2015, doi:
10.3115/v1/P151150.
98
[138] S. Ogata, E. Lidorikis, F. Shimojo, A. Nakano, P. Vashishta, and R. K. Kalia, "Hybrid
finiteelement/moleculardynamics/electronicdensityfunctional approach to materials
simulations on parallel computers," Computer Physics Communications, vol. 138, no. 2,
pp. 143154, Aug 1 2001, doi: 10.1016/S00104655(01)00203X.
[139] A. Warshel, "Multiscale Modeling of Biological Functions: From Enzymes to Molecular
Machines (Nobel Lecture)," Angew Chem, vol. 53, no. 38, pp. 1002010031, Sep 15 2014,
doi: 10.1002/anie.201403689.
[140] M. AlQuraishi, "ProteinNet: a standardized data set for machine learning of protein
structure," BMC Bioinformatics, vol. 20, 311, Jun 11 2019, doi: 10.1186/s128590192932
0.
[141] K. Liu et al., "Shiftcollapse acceleration of generalized polarizable reactive molecular
dynamics for machine learningassisted computational synthesis of layered materials,"
Proceedings of Workshop on Latest Advances in Scalable Algorithms for LargeScale
Systems, ScalA18, pp. 4148, Nov 12 IEEE, 2018, doi: 10.1109/ScalA.2018.00009.
[142] A. Sergeev and M. Del Balso, "Horovod: fast and easy distributed deep learning in
TensorFlow," arXiv, 1802.05799v3, Feb 21 2018, doi: 10.48550/arXiv.1802.05799.
[143] T. Kurth, M. Smorkalov, P. Mendygral, S. Sridharan, and A. Mathuriya, "TensorFlow at
scale: performance and productivity analysis of distributed training with Horovod, MLSL,
and Cray PEML," Concurrency, vol. 31, e4989, Oct 21 2019, doi: 10.1002/cpe.4989.
[144] P. Balaprakash, M. Salim, T. D. Uram, V. Vishwanath, and S. M. Wild, "DeepHyper:
asynchronous hyperparameter search for deep neural networks," Proc HiPC, Dec IEEE,
2018, doi: 10.1109/HiPC.2018.00014.
[145] M. A. Salim, T. D. Uram, J. T. Childers, P. Balaprakash, V. Vishwanath, and M. E. Papka,
"Balsam: automated scheduling and execution of dynamic, dataintensive HPC
workflows," arXiv, 1909.08704v1 Sep 18 2019, doi: 10.48550/arXiv.1909.08704.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Physics informed neural networks and electrostrictive cavitation in water
PDF
Multiscale quantum dynamics and machine learning for next generation energy applications
PDF
Neural network for molecular dynamics simulation and design of 2D materials
PDF
Controlling electronic properties of twodimensional quantum materials: simulation at the nexus of the classical and quantum computing eras
PDF
Machinelearning approaches for modeling of complex materials and media
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Metascalable hybrid messagepassing and multithreading algorithms for ntuple computation
PDF
Quantum molecular dynamics and machine learning for nonequilibrium processes in quantum materials
PDF
Highthroughput methods for simulation and deep reinforcement learning
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Molecular dynamics simulations of nanostructures
PDF
Nanomaterials under extreme environments: a study of structural and dynamic properties using reactive molecular dynamics simulations
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Learning to optimize the geometry and appearance from images
PDF
Building stragglerresilient and private machine learning systems in the cloud
PDF
Analog and mixedsignal parameter synthesis using machine learning and timebased circuit architectures
PDF
Fast and labelefficient graph representation learning
PDF
Shockinduced poration, cholesterol flipflop and small interfering RNA transfection in a phospholipid membrane: multimillion atom, microsecond molecular dynamics simulations
PDF
Largescale molecular dynamics simulations of nanostructured materials
Asset Metadata
Creator
Liu, Kuang
(author)
Core Title
Simulation and machine learning at exascale
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
202212
Publication Date
12/14/2022
Defense Date
12/13/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph neural networks,high performance computing,machine learning,molecular dynamics,OAIPMH Harvest,protein folding
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nakano, Aiichiro (
committee chair
), Kalia, Rajiv (
committee member
), Vashishta, Priya (
committee member
)
Creator Email
liukuang@usc.edu,lucifernine86@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/uscthesesoUC112620739
Unique identifier
UC112620739
Identifier
etdLiuKuang11369.pdf (filename)
Legacy Identifier
etdLiuKuang11369
Document Type
Dissertation
Format
theses (aat)
Rights
Liu, Kuang
Internet Media Type
application/pdf
Type
texts
Source
20221214uscthesesbatch996
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository email address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 900892810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
graph neural networks
high performance computing
machine learning
molecular dynamics
protein folding