Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sharpness analysis of neural networks for physics simulations
(USC Thesis Other)
Sharpness analysis of neural networks for physics simulations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SHARPNESS ANALYSIS OF NEURAL NETWORKS FOR PHYSICS SIMULATIONS by Hikaru Ibayashi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2023 Copyright 2023 Hikaru Ibayashi Acknowledgements I would like to thank many people who have made this dissertation possible. First and foremost, I thank my adviser Aiichiro Nakano, who has supported and helped me to overcome the six years of my Ph.D. life at USC. Without his support and help, I could not complete my work. I especially appreciate his advices on fostering a professional skill as a scientist, such as how to overcome research challenges and how to present my ideas in a compelling manner. I also thank valuable helps from my committee members: Yan Liu, Paulo Branicio, Emilio Ferrara, and Satish Thittamaranahalli Ka for valuable discussions and comments on my work. I’m also grateful to Masaaki Imaizumi, who taught me how to crystallize the ideas into a rigorous mathematical framework. I also appreciate a help from Ken-ichi Nomura, who was always happy to discuss research ideas with me even when my ideas were not solidified. I have been lucky to have many friends to talk about research and various things: Chen-yu Wei, Basileal Imana, Sulagna Mukherjee, He Jiang, Mengxiao Zhang, Chung-Wei Lee, and Tiancheng Jin. I also appreciate a deep support from my family and my partner, Theresa. ii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Training Algorithms for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Sharpness of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Sharpness Regularization for High-Fidelity Physics Simulation . . . . . . . . . . . . . . . . 2 1.3.1 Fluid Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Neural-Network Quantum Molecular Dynamics . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5.1 Minimum Sharpness: Scale-invariant Parameter-robustness of Neural Networks . . . 4 1.5.2 SGD Escapes from Sharp Minima Exponentially Fast in Non-stationary Regime . . 5 1.5.3 Splash in a Flash: Sharpness-aware Minimization for Efficient Liquid Splash Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5.4 Allegro-Legato: Scalable, Fast and Robust Neural-Network Quantum Molecular Dynamics via Sharpness-Aware Minimization . . . . . . . . . . . . . . . . . . . . . 6 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Minimum sharpness: Scale-invariant parameter-robustness of neural networks . . . . . . 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Proposed Method: Minimum Sharpness . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Accuracy and Efficiency of Tr [H] Calculation . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Comparison with Previous Sharpness . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3: SGD escapes from sharp minima exponentially fast in non-stationary regime . . . . . . . 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 iii 3.2.1 Stochastic Gradient Descents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Mean Exit Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.3 Setting and Basic Assumptions for SGD’s Escape Problem . . . . . . . . . . . . . . 23 3.3 Large Deviation Theory for SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Mean Exit Time Analysis for SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1 Approximate Computation of Quasi-potential . . . . . . . . . . . . . . . . . . . . . 27 3.4.1.1 Proximal System withC(θ )=I . . . . . . . . . . . . . . . . . . . . . . 27 3.4.1.2 Approximation of Quasi-potentialV 0 :=min θ ∈∂D V(θ ) . . . . . . . . . . 28 3.4.2 Main Results: Mean Exit Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Numerical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Comparison with Existing Escape Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 4: Splash in a Flash: Sharpness-aware minimization for efficient liquid splash simulation . . 38 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Sharpness-Aware Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Derivation of SAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 MLFLIP with SAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 5: Allegro-Legato: Scalable, Fast, and Robust Neural- Network Quantum Molecular Dynamics via Sharpness- Aware Minimization . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 Method Innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.1 Neural-Network Quantum Molecular Dynamics with Allegro . . . . . . . . . . . . . 47 5.2.2 Fidelity-scalability and Adversarial Robustness . . . . . . . . . . . . . . . . . . . . 47 5.2.3 Key Innovation: Allegro-Legato: SAM-Enhanced Allegro . . . . . . . . . . . . . . 48 5.2.4 RXMD-NN: Scalable Parallel Implementation of Allegro-Legato NNQMD . . . . . 48 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.1 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.2 Fidelity-Scaling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.3 Computational-Scaling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4.1 Simulation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.2 Training Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4.3 Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4.4 Implicit Sharpness Regularization in Allegro . . . . . . . . . . . . . . . . . . . . . 56 5.4.5 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter 6: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Appendices A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 iv A.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.2 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.3 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A.4 Experimental Setup for Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A.5 Experiments using LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A.6 Experimental Setup for Accuracy and Efficiency Calculation Tr[H] . . . . . . . . . . . . . 90 A.7 Approximation for DIAG[H] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 A.8 Exactness and Efficiency Calculation for DIAG [H] . . . . . . . . . . . . . . . . . . . . . . 92 Appendices B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 B.1 Assumptions for [38, Theorem 5.7.11 (a)] . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 B.2 Formal properties of Steepness and Quasi-potential . . . . . . . . . . . . . . . . . . . . . . 94 B.2.1 Fundamental Lemmas of Steepness . . . . . . . . . . . . . . . . . . . . . . . . . . 94 B.2.2 Preliminary Lemmas of Quasi-potential . . . . . . . . . . . . . . . . . . . . . . . . 95 B.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 B.3.1 Main proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 B.4 Hamilton-Jacobi Equation for Quasi-potential . . . . . . . . . . . . . . . . . . . . . . . . . 101 B.5 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Appendices C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Sharpness-Aware Minimization for Robust Molecular Dynamics Simulations . . . . . . . . . . . 108 C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 C.2 Neural Network Quantum Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . . . . 109 C.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Appendices D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Simulating liquids on dynamically warping grids . . . . . . . . . . . . . . . . . . . . . . . . . . 114 D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 D.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 D.2.1 Arbitrary Lagrangian-Eulerian method . . . . . . . . . . . . . . . . . . . . . . . . . 116 D.2.2 Unstructured meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 D.2.3 Structured grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 D.2.4 Image warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 D.2.5 Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 D.2.6 Mesh-free methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 D.2.7 Polynomial and spectral adaptivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 D.3 Method Overview and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 D.4 Deformation Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 D.4.1 Defining the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 D.4.2 Sizing Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 D.4.3 Temporal Deformation Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 D.4.4 Speeding-up the Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 D.5 Advection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 D.6 Pressure Solver on Warped Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 D.6.1 Kinetic Energy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 D.6.2 Accurate Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 v D.6.2.1 Generalized Ghost Fluid . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 D.6.2.2 Ghost Fluid Applied to our Discretization . . . . . . . . . . . . . . . . . 134 D.7 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 D.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 D.8.1 Numerical Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 D.8.2 Timings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 D.8.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 D.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 D.9.1 Momentum Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 D.10 Computation ofM ϕ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 D.11 Numerical Verification of Pressure Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 D.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 vi List of Tables 3.1 Results of the derived exit time. We only show their order by ignoring constants. Ξ = A (1+C 0 rk)λ − 1 min − λ − 1 max is the approximation error, ¯λ is some value in [λ min ,λ max ] defined in [203], and α,δ are parameters related to the tail probability [145]. . . 34 3.2 Technical difference among analyses. The specific meanings of each column are described in the main passages of Section 3.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 Performance comparison of ML based FLIP (MLFLIP) [191] without and with SAM optimization. Our training scheme drastically improves convergences speed both in number of epochs and wall-clock time while it simulates plausible liquid splash effect (Fig. 4.1 left) 43 5.1 SAM strengthρ vs. time-to-failuret failure : We tuneρ by conducting a grid search in the range of 0.001 to 0.05. A model withρ = 0.005 gives the largestt failure with a small-scale simulation(N =432). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Simulation-time comparison: As SAM only applies to the training stage and does not modify the size of architecture, the computational cost for simulation is not affected. . . . . 55 5.3 Training-time comparison: Although SAM takes longer per-epoch training time, it converges faster and thus does not significantly affect total training time. Compared to the reference training times of variations of Allegro models, the extra training cost of Allegro-Legato is negligible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4 Implicit sharpness regularization by Allegro: While our Allegro-Legato model has smaller sharpness than Allegro, Allegro models with larger ℓ have progressively smaller sharpness. Here, we measure sharpness, max ∥ϵ ∥ 2 ≤ ρ {L(w + ϵ )− L(w)}, by taking maximum of 1,000 independent random samples around the 0.05-neighborhood of each minimum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.5 Detailed training setting: All training setups in this work adopt these parameters unless otherwise noted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 C.1 Out-of-sample errors baseline (without SAM) and with SAM (eV/atom) . . . . . . . . . . . 112 D.1 Timings of our examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vii D.2 Timings of Figure D.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 viii List of Figures 2.1 Comparison between baseline and proposal for calculatingTr[H] . The left table reports averaged computational time, and the right figure shows the calculated trace. . . . . . . . . 12 2.2 Numerical experiments showing that our sharpness correlates with generalization gap. We trained an FCNNs with MNIST with partially randomized labels. The ratios are shown above. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 Sharpness dynamics throughout a training via SGD (left), where we used the sharpness defined in [100, Metric 2.1]. As is shown, sharpness fluctuates during the training and becomes small toward the end of training. This means SGD jumps out from sharp minima to find flat and generalizable minima (right). We used VGG [173] fed with CIFAR-10 [110] with cross entropy loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Visual illustration of steepness (Definition 5). The steepness of φ,S T (φ), is greater than S T (ψ ) becauseφ moves against the vector field of gradient −∇ L(θ ). . . . . . . . . . . . . 19 3.3 Notions related to the fundamental theorem (Theorem 2), a minimumθ ∗ , its neighborhood hoodD as well as the boundary∂D, andV 0 , which is defined as the smallest quasi-potential in∂D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 The whole structure of our results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Numerical validation of Theorem 4, where the mean exit time shows exponential dependency onλ − 1 max ,η − 1 ,B, and∆ L. The error bars indicate the standard deviation. . . . . 33 3.6 Reference experiment using the proxy system (3.4). Different from the result of SGD, the exit time has no dependency onλ max while it has exponential dependency onη and ∆ L similarly to SGD. This result is aligned with Proposition 3. . . . . . . . . . . . . . . . . . . 34 3.7 A numerical experiment as in Section 3.5, but with discrete SGD (3.2) rather than the original SGD (3.1). Although they are defined differently, they show similar trends on each parameter. This numerical result shows that (3.2) is a reasonable model for (3.1) for exit time analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 ix 4.1 Simulation results with the same number of training epochs. Our training scheme with sharpness-aware minimization quickly learns physical fluid dynamics (left) while the baseline result (right) with naive training scheme shows numerous unphysical liquid splashes. 39 4.2 Neural network layout used in this study. Detachment classifier determines whether a splash has been formed or not; and for the inputs which are classified as splash, velocity modification component of this network predicts the mean and the variance of the probability distribution for changes in the velocity. . . . . . . . . . . . . . . . . . . . . . . 41 5.1 Number of outliers in atomic force inference during NNQMD simulation: As the simulation progresses, the dynamic of atoms becomes unstable due to an increasing number of unphysically large force values (over 5σ ) predicted by the original Allegro model. This resulted in the eventual failure after 2.6× 10 6 MD steps (red). On the other hand, the proposed model (Allegro-Legato) maintains a nearly constant number of outliers and the simulation stable (blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Fidelity scaling of NNQMD simulation: Here, t failure is measured using NVE ensemble with a timestep of2fs. Statistically improvedt failure is observed in even the smallest system size, which is further pronounced as the system size increases. The exponent of power law fitting shows nearly a factor of two reduction using Allegro-Legato model. . . . . . . . . . . 51 5.3 Wall-clock time of the RXMD-NN code per MD step, with scaled workloads 6,912P atom ammonia liquid usingP A100 GPUs(P =1,..., 1,920). . . . . . . . . . . . . . . . . . . . 53 5.4 GPU acceleration of NNQMD algorithm: Three system sizes of N = 1728,6912 and 13,824 atoms are examined. The histogram presents the reduction in wall-clock time per MD step over the runtime with 32CPU cores without GPU as reference. Detail of the benchmark platform as well as the GPU and CPU architectures are presented in the main text. We have achieved7.6x speedup using four GPUs withN =13,824 atoms. . . . . . . . 54 5.5 Loss surface visualization: One dimensional visualization of loss surface of each model. Following the definition of sharpness (Eq. 4.1), we randomly sample a vector, d, that gives the sharpness direction to computeL(w+pd) forp∈[− 1,1]. . . . . . . . . . . . . . . . . 57 5.6 Computed vibrational spectra of ammonia: (a) While typical first-principles simulation treats atoms classically and electrons quantum-mechanically, PIMD simulation uses multiple replicas of each atom to mimic nuclear quantum effect (NQE). (b) Top curve shows vibrational spectrum computed at zero temperature without NQE, while bottom at finite temperature with Allegro-Legato PIMD simulation. With the inclusion of NQE, Allegro-Legato PIMD correctly shows softening of high-energy inter-molecular modes expected at finite temperature and explains high-end neutron-scattering observations. . . . . 60 A.1 Evaluation for approximating DIAG[H], diagonal element of Hessian matrices. The left and right figures are results of arXiv-version [190] and ICML-version [189] respectively. . 87 A.2 Experimental results comparing minimum and normalized sharpness using LeNet. . . . . . 88 B.1 Domains and trajectory right appearing in the proof of upper bound. . . . . . . . . . . . . . 97 x B.2 Domains and trajectory right appearing in the proof of lower bound. . . . . . . . . . . . . . 99 C.1 Snapshots of crystalline silicon carbide (a) and silicon dioxide glass (b), where spheres represent carbon (cyan), silicon (blue) and oxygen (red) atom positions and cylinders indicate chemical bonds. (c)-(e) Pair distribution functions for Si-Si, Si-C and C-C pairs at 300K (black) and900K (red), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 110 D.1 Liquid leaking out of a container with holes. Our dynamically warping grids adaptively capture detailed motions without the computational expense of unstructured grids. The left image depicts the surface mesh and the corresponding FLIP particles color-coded by the grid cell size, while the right image shows a cross-section. The simulation used128 3 cells and ran with25 seconds per time step, and1.8 minutes per video frame. . . . . . . . . . . . 115 D.2 Merging bunny: the top image shows the spatial (physical) coordinate and the bottom corresponding reference coordinate at the same timings, illustrating the magnification of local feature. 128× 64× 128 resolution,10 seconds per time step and29 seconds per video frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 D.3 Combing a liquid: the left side of the zoomed image highlights our warped grids adapting near the solid boundaries, making the accurate interaction with cylinders possible.128× 64× 128 resolution,12 seconds per time step and30 seconds per video frame. 122 D.4 Timings of our examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 D.5 Breaking dam creating splahes and thin sheets by our dynamically warping grids adapting to the surfaces: 128× 64× 128 resolution, 12 seconds per time step and36 seconds per video frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 D.6 Expanding ripples with high adaptivity. Left: our second-order accurate free surface boundary conditions capturing the subtle motion of expanding ripples in a nearly open- space static pool. 128× 64× 128 resolution,9 seconds per time step,15 seconds per video frame and the maximal volume ratio200. Right: first-order accurate boundary conditions with the same setup exhibit grid aligned artifacts. This example employs a level-set surface tracker to demonstrate the accuracy of our method, and the same scenario using FLIP is provided in the supplemental video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 D.11 Error plots of our Taylor Green vortex velocity experiment. Pressure error (top), warped grid from our sizing function (bottom left), and the log-log graph of the total error with respect to grid resolutions (bottom right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 D.7 Numerical verification of our second-order accuracy for a Poisson’s problem: we solve ∇ 2 p(x) = δ (x) withp = 0 boundary conditions on the lines of solid circles. The colored contour plots show the magnitude of our actual numerical solution. The top table refers to the experiment with the regular setting (top left) and the bottom our warped setting (top right). Both numerical experiments show the second-order convergence. . . . . . . . . . . . 149 xi D.8 Comparison of a drop falling into a pool using three different solvers. Our method successfully captures the thin sheets of a crown splash at128 3 resolution, while the MAC method fails to simulate such detail without doubling the resolution. The adaptive method of Ando et al. [5] successfully recovers small-scale details but leads to significantly longer runtimes. Timing details are given in Table D.2. . . . . . . . . . . . . . . . . . . . . . . . 150 D.9 Pressure surface plot with a single strong velocity initiated at the center pointing upward. Left: a single-point Gaussian quadrature integration rule applied to integrate Eq. (D.13) exhibits oscillation due to null-space issues. Right: our eight-point (four-point in 2D) Gaussian integration rule removes the artifacts. . . . . . . . . . . . . . . . . . . . . . . . . 150 D.10 Comparison with the MAC method. Left side shows our method, right side the MAC solver at the same resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 xii Abstract Deep learning has attracted significant attention in recent years due to its remarkable achievements in various applications. However, building effective deep neural networks requires making crucial design choices such as the network architecture, regularization, optimization, and hyperparameter tuning. In this dissertation, we focus on the concept of “sharpness” of neural networks, which refers to neural networks’ sensitivity against perturbation on weight parameters. We argue that sharpness is not only a theoretical notion but also has practical use cases that can lead to better generalization and robustness of neural models. A major theoretical challenge of defining and measuring sharpness is its scale-sensitivity, i.e., the fact that sharpness can change to the scale transformation of neural networks. In this thesis, we propose a novel definition of sharpness that overcomes this limitation, with provable scale-invariance and extensive empirical validation. By analyzing the relationship between sharpness and model performance, we show how my definition can provide a more objective and accurate characterization of sharpness. Another open question in the sharpness analysis is how training algorithms for machine learning models regularize sharpness. In this dissertation, we answer this question by showing that existing training algo- rithm methods regularize sharpness through what can be called "escaping" behavior, where the optimization process avoids sharp regions in the parameter space. This new explanation demystifies the connection be- tween sharpness and training algorithms, paving the way for more effective and principled approaches to machine learning. xiii Finally, we demonstrate the practical benefits of sharpness regularization for physics simulations. We show that neural networks with small sharpness achieve high-fidelity fluid simulation and molecular dynam- ics. These findings include significant implication that sharpness is not just a mathematical notion but also a practical tool to build physics-informed neural networks. xiv Chapter 1 Introduction In recent years, the successes of deep learning have been a major driving force of machine learning devel- opment [118]. Owing to its strong generalization capability, deep learning has diverged into a wide range of domains, such as computer vision [111], speech recognition [137], and natural language processing [32]. The high performance of deep learning is underpinned by gradient-based learning algorithms, including stochastic gradient descent (SGD) and its variations [103, 166]. However, at the same time, those unprece- dented successes raise a question: What can such complex neural networks be effectively trained? 1.1 Training Algorithms for Neural Networks Although the optimization problems of neural networks were thought to be difficult to solve [13], simple algorithms, such as stochastic gradient descent (SGD), can find nearly optimal solutions empirically, and further, the obtained solutions generalize well [100, 19]. Analyzing the role of optimization on deep learning is an area of research that is currently attracting strong interest [135, 92]. SGD was first proposed in [163], as a lazy version of gradient descent using random subsets of training data. Thus, SGD has been intended to be a convenient heuristic rather than a refined algorithm. How- ever besides its computational convenience, SGD works as effectively as gradient descent does in many 1 optimization problems, and its convergence properties have been solidified on the convex objective func- tions [15]. The recent success in the field of neural networks is particularly remarkable because it is shown that SGD performs greatly on various non-convex functions as well. SGD-based training algorithms have been achieving state-of-the-art one after another, e.g., Adagrad [45], Adam [103] and many others [166]. Although there is a study showing that SGD is not a unique approach for generalization [68], SGD remains one of the standard methods in the long run because of the computational convenience. 1.2 Sharpness of Neural Networks The shape of loss surfaces has long been a topic of interest in neural networks. The claim that the flatness of loss surfaces around local minima improves generalization was first studied by [83, 82], and the obser- vation has recently been reconfirmed in deep neural networks by [100]. Although it is difficult to formally define sharpness [41], many empirical studies have shown its connections to generalization where various definitions of sharpness are studied, such as the volume around a minimum [90], the maximum eigenvalue of Hessian matrix [72, Remark 2.2], and the highest loss value around a minimum [98]. Our analysis adopts the maximum eigenvalue of Hessian matrix (Definition 4) as [72] did. As the other investigations on the loss surface geometry, [80] discussed the asymmetry of loss surfaces, [44, 67] studied how multiple local minima are internally connected, and [120] developed a random dimen- sional reduction method to visualize loss surfaces in low dimensions. 1.3 Sharpness Regularization for High-Fidelity Physics Simulation As the practical outcomes of my sharpness analysis, I demonstrate the effectiveness of utilizing sharpness regularization for achieving high-fidelity physics simulations. Specifically, In fluid simulation, we show sharpness regularization prevents unphysical water splash. In molecular dynamics simulation, sharpness 2 regularization avoids unphysical atomic forces. This approach results in more accurate, high-fidelity simu- lations of physical systems 1.3.1 Fluid Simulation Fluid simulation is the traditional research field in physics, started with Harlow and Welch’s pioneering work which brought Navier-Stokes (NS) equations to the field of numerical computation [78]. ρ du dt =−∇ p+∇· s+ ρ m f ext (1.1) whereρ denotes fluid density, t time,u velocity, p pressure,s viscous stress tensor, m mass,f ext external force. Nowadays, fluid simulation has become an essential tool in variosu fields. For example, air turbu- lence around airplane wings and water flow around ship hulls are the representative application targets of traditional fluid simulation. As deep neural networks have become increasingly prevalent, there has been a growing trend toward adopting machine learning techniques in the field of fluid simulation [115, 192]. While these approaches offer significant performance gains, they often sacrifice some degree of physics accuracy in the process. The main theme of machine learning based fluid simulations is focused on utilizing neural networks without sacrificing fidelity. 1.3.2 Neural-Network Quantum Molecular Dynamics Molecular dynamics (MD) simulation follows time evolution of the positions{r i |i=1,...,N} (i.e., tra- jectories) ofN atoms, m i d 2 dt 2 r i =f i =− ∂ ∂r i E({r i }) (1.2) 3 where m i and f i are the mass of the i-th atoms and the force acting on it, whereas E is the interatomic potential energy that is dictated by quantum mechanics (QM). In NNQMD, neural networks are trained to reproduce ground-truthQM values,E({r i } t ), for a set of atomic configurations {{r i } t |t=1,...,N training } (N training is the number of training configurations) [12, 109, 124, 95, 146]. Similarly to fluid simulation, one of the biggest theme of NNQMD is to achieve high-fidelity simulations without sacrificing performance. 1.4 Thesis Statement My thesis statement is as follows: Regularizing the sharpness of the neural networks provably leads to higher generalization performance, and particularly enhances the efficiency, scalability, and robustness of physics simulation with neural networks. To support this thesis, I present two formal results that verify the effectiveness of sharpness: minimum sharpness (a novel definition of sharpness with scale-invariance) and exponential escape efficiency (a frame- work to understand how sharpness is regularized), followed by two practical application to physics simula- tion: efficient fluid simulation with sharpness regularization and scalable, and robust molecular dynamics simulation with sharpness regularization. 1.5 Main Results 1.5.1 Minimum Sharpness: Scale-invariant Parameter-robustness of Neural Networks In this chapter, we propose a novel sharpness measure, Minimum Sharpness. It is known that NNs have a specific scale transformation that constitutes equivalent classes where functional properties are completely identical, and at the same time, their sharpness could change unlimitedly. We define our sharpness through a minimization problem over the equivalent NNs being invariant to the scale transformation. We also develop an efficient and exact technique to make the sharpness tractable, which reduces the heavy computational 4 costs involved with Hessian. In the experiment, we observed that our sharpness has a valid correlation with the generalization of NNs and runs with less computational cost than existing sharpness measures. 1.5.2 SGD Escapes from Sharp Minima Exponentially Fast in Non-stationary Regime We show that stochastic gradient descent (SGD) escapes from sharp minima exponentially fast even before SGD reaches stationary distribution. SGD has been a de-facto standard training algorithm for various ma- chine learning tasks. However, there still exists an open question as to why SGDs find highly generalizable parameters from non-convex target functions, such as the loss function of neural networks. An “escaping” analysis has been an appealing notion to tackle this question. Escaping analysis measures how quickly SGD escapes from sharp minima, which is likely to have low generalization abilities. Despite its importance, the notion has the limitation that it works only when SGD reaches a stationary distribution after sufficient updates. In this work, we develop a new theory to investigate the SGD’s escaping by introducing the Large Deviation Theory. Modeling SGD as an stochastic differential equations (SDE) with Gaussian noise, we prove that the SGD escapes from sharp minima exponentially fast even in a non-stationary setting, and that it holds not only for continuous SDE-models but also in the discretized setup. A key notion for the result is a quantity called “steepness,” which describes the SGD’s stochastic behavior throughout its training process. Our experiments are consistent with our theory. 1.5.3 Splash in a Flash: Sharpness-aware Minimization for Efficient Liquid Splash Simulation We present sharpness-aware minimization (SAM) for fluid dynamics which can efficiently learn the plau- sible dynamics of liquid splashes. Due to its ability to achieve robust and generalizing solutions, SAM efficiently converges to a parameter set that predicts plausible dynamics of elusive liquid splashes. Our training scheme requires 6 times smaller number of epochs to converge and, 4 times shorter wall-clock time. 5 Our result shows that sharpness of loss function has a close connection to the plausibility of fluid dynamics and suggests further applicability of SAM to machine learning based fluid simulation. 1.5.4 Allegro-Legato: Scalable, Fast and Robust Neural-Network Quantum Molecular Dynamics via Sharpness-Aware Minimization Neural-network quantum molecular dynamics (NNQMD) simulations based on machine learning are rev- olutionizing atomistic simulations of materials by providing quantum-mechanical accuracy but orders-of- magnitude faster, illustrated by ACM Gordon Bell prize (2020) [95] and finalist (2021) [146]. State-of-the- art (SOTA) NNQMD model founded on group theory featuring rotational equivariance and local descriptors has provided much higher accuracy and speed than those models, thus named Allegro (meaning fast). On massively parallel supercomputers, however, it suffers a fidelity-scaling problem, where growing number of unphysical predictions of interatomic forces prohibits simulations involving larger numbers of atoms for longer times. Here, we solve this problem by combining the Allegro model with sharpness aware minimiza- tion (SAM) for enhancing the robustness of model through improved smoothness of the loss landscape. The resulting Allegro-Legato (meaning fast and “smooth") model was shown to elongate the time-to-failuret failure , without sacrificing computational speed or accuracy. Specifically, Allegro-Legato exhibits much weaker de- pendence of timei-to-failure on the problem size,t failure ∝ N − 0.14 (N is the number of atoms) compared to the SOTA Allegro model t failure ∝N − 0.29 , i.e., systematically delayed time-to-failure, thus allowing much larger and longer NNQMD simulations without failure. The model also exhibits excellent computational scalability and GPU acceleration on the Polaris supercomputer at Argonne Leadership Computing Facility. Such scalable, accurate, fast and robust NNQMD models will likely find broad applications in NNQMD simulations on emerging exaflop/s computers, with a specific example of accounting for nuclear quantum effects in the dynamics of ammonia to lay a foundation of the green ammonia technology for sustainability. 6 1.6 Organization The remainder of this dissertation is organized as follows. The discussion on minimum sharpness, scale- invariant definition of sharpness is found in Chapter 2. This is followed by my result on exponential escape efficiency from sharpness in Chapter 3. Then we discuss machine learning based physics simulation with sharpness regularization in Chapter 4, Finally we present a novel method for moleculer dynamics with sharpness regularization, Allegro-Legato in Chapter 5. Chapter 6 concludes this dissertation by presenting a summary of these works, their contributions, and discussions of future work. 7 Chapter 2 Minimum sharpness: Scale-invariant parameter-robustness of neural networks 2.1 Introduction Despite the tremendous success of neural networks (NNs), NNs are known to be vulnerable to adversarial examples, i.e. simple perturbations to input data can mislead models [75, 112, 201]. Many research attempts to resolve this vulnerability to make NNs defensive for the noisy input data. Similarly, the robustness against the model parameters perturbation, or parameter-robustness, is equally essential to make NNs defensive against the noise coming from hardware neural networks [119]. The parameter-robustness is a well-studied topic in different lines of research, where it is called “sharp- ness" [84, 100]. Although sharpness was initially studied in connection to generalization performance, an increasing number of results show that controlling sharpness is the effective way to design robust mod- els. For example, [60] proposed “Sharpness-Aware Minimization" that minimizes approximated sharpness measure in their training process. Similarly, [181] proposed “adversarial corruption-resistant training” that implements adversarial corruption of model parameters in their training process. As a theoretical justifica- tion, [198] rigorously proved that sharpness regarding the activation functions gives a tight upper bound of the NNs performance. 8 However, despite its effectiveness being shown in many ways, sharpness leaves a critical unsolved prob- lem; “scale-sensitivity." [41] has pointed out that traditional sharpness measures are problematic under scale transformation. If NNs utilize a non-negative homogeneous function, ϕ (ax) = aϕ (x) for all a > 0 such as ReLU [144] and Maxout [74] activation function, then a certain scale transformation on parameters does not change the functional property at all; however, the naive sharpness measures may change significantly. This unexpected behavior poses a problem that good sharpness should be invariant under such scale trans- formation, but naive sharpness measures do not. The following are the previous major studies tackling the scale-sensitivity problem. The first study [161] proposed sharpness measure as a spectral norm of Riemannian Hessian on a quotient manifold. The quo- tient manifold makes the sharpness measure invariant due to the definition over equivalence relation via α -scale transformation. The second study [189] developed a sharpness measure through the minimization problem. The minimization enables us to select sharpness independent of scaling parameters. Although those proposed sharpness measures are free from scale-sensitivity, they have the following drawbacks: (i) they require intractable assumptions in their derivation process, and (ii) they suffer from heavy computation to handle Hessian matrices. In this work, we propose a novel invariant sharpness measure called Minimum Sharpness, which over- comes the aforementioned problems. Our sharpness measure is defined as a minimum trace of Hessian matrices over equivalence classes generated by scale transformation. We show that our sharpness measure is scale-invariant owing to the minimization over an equivalence class. Further, our measure is computation- ally efficient thanks to our technique that can exactly and efficiently calculate Hessian. With this technique, the minimization can be carried out without costly computations, but with just several epochs forward-and- back propagation. As empirical justifications, we carried out an experiment to report the efficiency and accuracy of our technique compared to the ground-true Hessian trace calculation. We also empirically confirmed that our 9 sharpness validly correlates with the performance of models. The implementation of the experiment is available online * . 2.2 Related Works Vulnerability of Neural Networks Existing studies on the robustness or defensiveness of neural networks primarily focus on generating adversarial examples. [182]’s pioneering work first proposed the concept of adversarial attack and found that neural network classifiers are vulnerable to some adversarial noise on input data. Following this work, a variety of adversarial attack algorithms were developed [75, 140, 112]. In response to those, some works developed training algorithms to make NNs robust against such adversarial attacks on input data, i.e. adversarial training algorithms [130]. In recent years, extending the scope of the adversarial attack to weight parameters, [181] showed that the robustness against weight parameters perturbations also contributes to the robustness of NNs. Besides their novel results, it is intriguing to see that the notion they introduced, “parameter robustness," is mathematically equivalent to a naively defined sharpness. Sharpness and Generalization Several empirical studies have found that the sharpness of loss surfaces correlates with the generalization performance of NNs. [84] first observed that flat minima of shallow NNs generalize well. Some recent results show that deep NNs also have a similar relation [100, 204]. Further, a large-scale experiment by [98] shows that sharpness measures have a stronger correlation with the gener- alization gap than others. Due to its desirable property, sharpness has been implemented to some practical algorithms [23, 205], and [60] has achieved the state-of-the-art perforce. There are also some theoretical works rigorously formalizing the connection between sharpness, weight perturbations, and robustness of neural networks [198, 188]. * https://github.com/ibayashi-hikaru/minimum-sharpness 10 Sharpness Measure and Scale Sensitivity The development of scale-invariant sharpness has emerged in response to [41]’s criticism on sharpness. [197] utilized the rigorous PAC-Bayes theory to define a metric called pacGen, achieving negligible scale sensitivity. [123] utilized information geometry to design a scale-invariant measure, Fisher-Rao metric, which correlates with the generalization gap well under several scenarios. [161] developed a scale-invariant sharpness measure defined over quotient manifold. However, their measure does not sufficiently correlate with a generalization gap. [189] improved [197]’s work and achieved scale-invariant sharpness measure. However, since the works above are exposed to either heavy computation or unrealistic approximation, they are not as suitable for practical use as the naively defined sharpness [23, 205, 60]. Our minimum sharpness is the first scale-invariant sharpness ready for practical use. 2.3 Methodology 2.3.1 Preliminary Neural Network : We define fully-connected neural networks (FCNNs) as follows: f(x|θ )=W D ϕ (W D− 1 ϕ ··· (ϕ (W 1 x)··· )) wherex is an input,D is the number of layers,θ ={W d } D d=1 is a set of weight parameter matrices ofd-th layer (layer parameter), andϕ is an activation function. For brevity, we use only FCNNs in this study to the explanation. † α -Scale Transformation : α -Scale transformation [41] is a transformation of parameters of FCNNs that does not change the function. We useα ={α d } D d=1 such thatα d >0 and Q d α d =1 to rescale parameters of NNs as follows: θ ′ = {W ′ d } D d=1 = {α d W d } d d=1 . We denote the functionality of this transformation as † The same discussion is applicable with various NNs, including convolutional NNs. 11 num-of-data n=10 n=100 n=1000 Baseline (FCNNs) 11.11 sec 11.05 sec 12.23 sec Proposal (FCNNs) 0.018 sec 0.019 sec 0.021 sec Baseline (CNNs) 10.19 sec 10.73 sec 21.22 sec Proposal (CNNs) 0.026 sec 0.036 sec 0.334 sec Figure 2.1: Comparison between baseline and proposal for calculatingTr[H] . The left table reports aver- aged computational time, and the right figure shows the calculated trace. θ ′ =α (θ ). Importantly, the scale transformation does not change a function by FCNNs, i.e.∀x,f(x|θ )= f(x | θ ′ ), if their activation functions ϕ are non-negative homogeneous (NNH)∀a > 0,ϕ (ax) = aϕ (x) such as Rectified Linear Unit (ReLU) activation. 2.3.2 Proposed Method: Minimum Sharpness The previous work [41] pointed out that theα -scale transformation can generate NNs whose functionalities are identical, but naive sharpness measures for NNs can be different. Let L(θ ) = L(f(·|θ )) be a loss function with a function f(·|θ ) and H θ := ∇ 2 θ L(θ ) is a Hessian matrix, whose trace can be a proxy of sharpness. Obviously L(θ ) = L(α (θ )) holds for any α , but the trace ofH θ can change under varying α . This problem motivates us to develop the “invariant” sharpness measure to theα -scale transformation. Definition 1. We propose the following novel sharpness named Minimum Sharpness (MS) atθ , MS θ =min α Tr[H α (θ ) ]. (2.1) By its definition, the minimum sharpness is invariant to α -scale transformation. To enjoy the benefits of minimum sharpness, we need to overcome two difficulties. The first is the high computational cost to calculate Hessian matrices because a native computation of the whole Hessian matrix 12 takes an infeasible amount of memory. The second difficulty, the target function of the optimization problem equation 2.1 is unclear in a form with respect toα . Otherwise, we have to solve the minimization problem using inefficient methods such as grid search. To make minimum sharpness tractable, we exploit the following two propositions regarding the Hessian matrix. In the following, we consider a classification problem with a softmax loss and K labels. We note that the same discussion is applicable to convolutional NNs and other loss functions. The following results simplify the minimization problem equation 2.1. Proposition 1. Let{(x,y)} n i be a set of data pairs,o l is logit for labell,p l = 1 Z exp[o l ],Z = P I exp[o l ], andL(θ )=− 1 n P i lnp(y i |x i ;θ ). Then we have following formulation Tr[H θ ]= 1 n n X i=1 score(x i ,y i ), score(x,y)= D X d=1 K X l=1 p l ∂o l ∂W d F − ∂lnZ ∂W d F ! where∥·∥ F denotes Frobenius norm. Proposition 2. LetW d be a layer parameters in NNs with NNH activation function,α ={α d } d be param- eters ofα -scale transformation. Then following relations hold ∂lnZ ∂W d = 1 α d ∂lnZ ′ ∂W ′ d , ∂o l ∂W d = 1 α d ∂o ′ l ∂W ′ d , whereW ′ d =α d W d . The valuesZ,o l andZ ′ ,o ′ l are calculated using NNsf andα (f) respectively. Both proofs are provided in Appendices A.1 and A.2 respectively. The first one provides us with de- composition to calculateTr[H], exactly and efficiently. That is, this calculation requires only K+1 epochs 13 forward&back-propagation for gradients ofZ ando l . To the best of our knowledge, this decomposition is also our contribution. Combining the two results, we have the following tractable formulation of minimum sharpness as MS θ =min α D X d=1 1 α 2 d Tr[H θ,d ], where Tr[H θ,d ]= 1 n n X i=1 ( K X l=1 p l@i ∥ ∂o l@i ∂W d ∥ F −∥ ∂lnZ @i ∂W d ∥ F ), andH θ,d denotes diagonal blocked Hessian over d-th layer and and subscription @i indicates i-th data’s outputs. Here, applying inequality of arithmetic and geometric means D X d 1 α 2 d Tr[H θ,d ]≥ D Y d ( 1 α 2 d Tr[H θ,d ]) 1/D =D D s Y d Tr[H θ,d ], we obtain the following result. We note that a more detailed explanation is available in Appendices A.3. Theorem 1. We can formulate minimum sharpness as follow for no-bias FCNNsf case: MS θ =D D Y d=1 Tr[H θ,d ] ! 1/D . Our sharpness has the following remarkable benefits. Firstly, we can check and interpret the invariance easily because of its simple design such that all scales α d are canceled out. Secondly, these results can be extended to with-bias NNs and convolutional NNs, e.g., by adding terms( Q d i=1 α i ) − 2 ∥ ∂ ∂b d ∥ 2 ford-th layer’s biasb d . Thirdly, as shown in the experiment section, minimum sharpness correlated with generalization 14 Figure 2.2: Numerical experiments showing that our sharpness correlates with generalization gap. We trained an FCNNs with MNIST with partially randomized labels. The ratios are shown above. gaps on the same level as previous invariant sharpness. Lastly, owing to the form in Proposition 1, we can compute the measure very efficiently without errors. 2.4 Experiments 2.4.1 Accuracy and Efficiency of Tr [H] Calculation We verify the accuracy and efficiency of our developed calculation of the trace of Hessian matrices. We compare the proposed method with a naive calculation (referred to as “baseline” here), which first computes the gradients of all parameters, applies parameter-wise derivation naively for the gradients, and then selects diagonal elements from the outcome. ‡ The experimental details are described in Appendix. A.6. In Fig 2.1, we report the computational time (left) and the calculated trace (right). Our proposed method is significantly more efficient, and its approximation error is negligibly small. These benefits come from our simplification of Hessian calculation with the NNH activation function. 2.4.2 Comparison with Previous Sharpness In addition to its scale-invariance and computational efficiency, we conduct an experiment to observe how the minimum sharpness is connected to the performance of NNs. Following previous works [189, 123], ‡ Note that we did NOT calculate Hessian of all parameter pairs. We tried several implementations within the naive formulation ∇∇L and chose the best performance from them. 15 we use the “training with randomized label” experiment to check if our sharpness has a valid correlation with the generalization gap. In this experiment, we use corrupted MNIST as a dataset whose labels are partially shuffled with ratios {0.0,0.1,...,1.0}. We evaluate generalization gaps using two neural network architectures: fully-connected NNs (FCNNs). We measure the difference between train accuracy and test accuracy (i.e. Acc train − Acc test ) as generalization gap. For sharpness measures, we investigate the proposed minimum sharpness and the normalized sharpness by [189] as a baseline. § Other experimental setups are detailed in Appendices A.4. We plot the sharpness measures and the generalization gap in Fig. 2.2. We observe that the gap and both measures are successfully correlated. We also carried out the same experiment using a convolutional model, LeNet, which is shown in Appendices A.5. We claim that our minimum sharpness correlates at the same level as [190]’s sharpness. § We did not follow original procedures proposed in the work [189, 190] because their approximation to diagonal elements of H is computationally heavy and numerically unstable. See more details in Appendix. A.7 and A.8 16 Chapter 3 SGD escapes from sharp minima exponentially fast in non-stationary regime 3.1 Introduction Stochastic gradient descent (SGD) has become the de facto standard optimizer in modern machine learning, especially deep learning. However, its prevalence has opened up a theoretical question: why can SGD find generalizable solutions in complicated models such as neural networks? The loss landscapes of neural networks are known to be highly non-convex [120], difficult to minimize [13], and full of non-generalizable minima [206]. It is an important task to answer this open question to attain a solid understanding of modern machine learning. In recent years, “fast escaping from sharp minima” has emerged as a promising narrative for the SGD’s generalization. “Sharp minima” mean the model parameters as local minima of loss functions and that are sensitive to perturbations. They are known to deteriorate generalization ability by several empirical and theoretical studies [100, 48, 98]. The “escaping,” the counterpart of the narrative, which measures how fast SGD moves out of the neighborhood of minima. In a few words, the narrative claims that SGD can find generalizable minima because it quickly escapes from sharp minima [212, 203]. This narrative is aligned with actual phenomena. The left panel of Fig. 3.1 shows how SGD updates affect sharpness of parameters throughout the training of a neural network, where the sharpness oscillates widely in the early phase of the 17 training, and then becomes smaller toward the end. This suggests that SGD repeatedly jumps out of sharp minima and eventually reaches flat minima (Fig. 3.1, right). 0 20000 40000 60000 Steps 0 2 4 6 8 10 Sharpness Figure 3.1: Sharpness dynamics throughout a training via SGD (left), where we used the sharpness defined in [100, Metric 2.1]. As is shown, sharpness fluctuates during the training and becomes small toward the end of training. This means SGD jumps out from sharp minima to find flat and generalizable minima (right). We used VGG [173] fed with CIFAR-10 [110] with cross entropy loss. Many theoretical studies have investigated the SGD’s escaping behavior. Viewing SGD as a stochastic differential equation (SDE) with Gaussian noise, recent studies have identified that SGD’s noise plays a key role in escaping. It was shown that fast escaping is realized by the so-called “anisotropic noise” of SGD, which means the noise with the various magnitudes among directions [212]. [94] formulated the effect of anisotropic noise in the stationary regime, where the SDE-model has reached a stationary distribution after many iterations. By limiting the target to the stationary regime, they took advantage of the theoretical anal- ogy between SGD and thermodynamics. [203] further elaborated this approach and found that escaping can be formulated as the “Kramer’s escape rate," which is a well-used formula in various fields of science [108]. Their result revealed that the SDE-model has the exponentially fast escape from sharp minima, which means that the sharpness exponentially accelerates SGD’s escaping. With these progressive refinements, a remaining task is how to go beyond the stationary regime. Al- though physics and chemistry commonly assume that a system has reached a stationary distribution [54, 77], the stationary regime does not apply to the analysis of SGD due to the following reasons. First, it is 18 Steep Trajectory Less Steep Trajectory Loss Surface: Figure 3.2: Visual illustration of steepness (Definition 5). The steepness of φ,S T (φ), is greater thanS T (ψ ) becauseφ moves against the vector field of gradient −∇ L(θ ). shown that SGD forms a stationary distribution only on the very limited objective functions [40, 26]. Sec- ondly, even when such a stationary distribution exists, SGD takesO(d) steps to reach it, whered is a number of parameters of a model [157]. Since common neural networks have numerous parameters, the stationary regime may not be directly applicable to the actual SGD dynamics. In this work, we propose a novel formulation of exponential escaping of the SDE-model in the non- stationary regime, by introducing the Large Deviation Theory [38, 63]. We formulate the escaping of the SDE-model through a notion of “steepness,” which intuitively means the hardness to climb up loss surface along a given trajectory (Fig. 3.2). Large Deviation Theory provides that the time to escape is described by the steepness of a trajectory starting from the minimum: Time to escape∼ exp[V 0 ] (V 0 : Smallest steepness from a minimum) Based on this analysis, we show the following main result on SGD’s time to escape: Theorem 3 (informal): SGD’s time to escape∼ exp B η ∆ Lλ − 1 max , 19 whereB is a batch size, η is a learning rate, ∆ L is the depth of the minimum, andλ max is the maximum eigenvalue of Hessian matrix at the minimum, i.e. sharpness of the minimum (Definition 4). We can see that as sharpness of the minimum increases, i.e. λ max increases, the time to escape decreases exponentially. This is the first result showing that SGD escapes from sharp minima exponentially fast, even out of the stationary regime. As a further benefit, our formulation can be easily extended to the discrete update rule of SGD (Theorem 4). The rest of this work is organized as follows. Section 3.2 formulates the SGD’s escape problem. Sec- tion 3.3 presents the traditional Large Deviation Theory and its application to the SGD’s escape problem. Section 3.4 gives the main results on SGD’s escaping. In Section 3.5, we confirm that the main results are consistent with numerical experiments. 3.2 Problem Formulation Notations: For ak× k matrixA,λ j (A) is thej-th largest eigenvalue ofA. We especially writeλ max (A)= λ 1 (A) and λ min (A) = λ k (A). O(·) denotes Landau’s Big-O notation. ∥·∥ denotes the Euclidean norm. Given a time-dependent functionθ t , ˙ θ t denotes the differentiation ofθ t with respect tot. N(µ, Σ) denotes the multivariate Gaussian distribution with the meanµ , and the covarianceΣ . 3.2.1 Stochastic Gradient Descents We consider a learning model parameterized byθ ∈R d , whered is a number of parameters. Given training examples{x i } N i=1 and a loss functionℓ(θ,x i ), we consider a training lossL(θ ) := 1 N P N i=1 ℓ(θ,x i ) and a mini-batch lossL B (θ ) is a mini-batch training loss with sizeB. We consider two types of stochastic gradient descent (SGD) methods; a discrete SGD and a continu- ous SGD. Although the ordinary SGD is discrete, we focus on its continuous variation for mathematical convenience. 20 Discrete SGD Given an initial parameterθ 0 ∈R d and a learning rateη > 0, SGD generates a sequence of parameters{θ k } k∈N by the following update rule: θ k+1 =θ k − η ∇L B (θ k ), k∈N. (3.1) We model SGD as a gradient descent with a Gaussian noise perturbation. We decompose−∇ L B (θ k ) in (3.1) into a gradient term−∇ L(θ k ) and a noise term∇L(θ k )−∇ L B (θ k ), and model the noise as a Gaussian noise. With this setting, the update rule in (3.1) is rewritten as θ k+1 =θ k − η ∇L(θ k )+ r η B W k , (3.2) whereW k ∼ N(0,ηC (θ k )) is a parameter-dependent Gaussian noise with its covariance,C(θ ):=E i∼ Uni({1,...,N}) h (∇L(θ )−∇ ℓ(θ,x i ))(∇L(θ )−∇ ℓ(θ,x i )) ⊤ i . Continuous SGD We define an SDE-model so that its discretization corresponds to the discrete SGD (3.2) using the Euler scheme (e.g. Definition 5.1.1 of [69]). We call this SDE-model “continuous SGD." Given a time indext≥ 0 and an initial parameterθ 0 ∈R d , the continuous dynamic of continuous SGD is written as follows, ˙ θ t =−∇ L(θ t )+ r η B C(θ t ) 1/2 ˙ w t (3.3) where w t is a d-dimensional Wiener process, extiti.e. an R d -valued stochastic process with t such that w 0 = 0 andw t+u − w t ∼ N(0,uI) for anyt,u > 0. We note that this system can be seen as a Gaussian perturbed dynamical system with a noise magnitude p η/B becauseη andB do not evolve by time. Remark 1 (Gaussianity of W k ). In this work, the Gaussianity of the noise on gradients (W k ) is justified by the following reasons: (i) if the batch size B is sufficiently large, the central limit theorem ensures the 21 noise term becomes Gaussian [212], and (ii) an empirical study shows that the noise term follows Gaussian distribution [203]. We also remark that the properties of SGD’s noise are still actively studied and some studies showed W k is heavy-tailed [174, 145]. We refer readers to Section 3.7 for the an in-depth discussions. 3.2.2 Mean Exit Time We consider the problem of how discrete and continuous SGD’s escape from minima of loss surfaces. This is formally quantified by a notion of mean exit time. We define θ ∗ ∈R d as a local minimum of loss surfaces, and also define its neighborhood D ⊂ R d as an open set which containsθ ∗ . We define the mean exit time as follows: Definition 2 (Mean exit time from D). Consider a continuous SGD (3.3) starting from θ 0 ∈ D. Then, a mean exit time of the continuous SGD fromD is defined as E[τ ]:=E[min{t:θ t / ∈D}]. Intuitively, a continuous SGD with smallE[τ ] easily escapes from the neighborhoodD. The means exit time of continuous SGD (3.3) formally corresponds the SGD’s “time to escape" in this work. Similarly, we define the discrete mean exit time as follows. Here, kη (time step× learning rate) corre- sponds tot, since theη is regarded as a width of the discretization. Definition 3 (Discrete mean exit time fromD). Consider a discrete SGD (3.2) starting fromθ 0 ∈D. Then, a discrete mean exit time of the discrete SGD fromD is defined as E[ν ]:=E[min{kη :θ k / ∈D}]. 22 Remark 2 (Other measures for escaping). In previous work, there exist several terms and definitions to quantify the escaping behaviors. [212] defined “escaping efficiency” as E θ t [L(θ t )− L(θ 0 )]. [203] defined an “escape rate” as a ratio between the probability of coming out from θ ∗ ’s neighborhood and the probability mass aroundθ ∗ . They also defined an “escape time” by the inverse of the escape rate. 3.2.3 Setting and Basic Assumptions for SGD’s Escape Problem We provide basic assumptions that are commonly used in the literature of the escape problem [132, 212, 94, 203]. Assumption 1 (L(θ ) is locally quadratic inD). There exists a matrixH ∗ ∈R d× d such that for anyθ ∈D, the following equality holds: ∀θ ∈D,L(θ )=L(θ ∗ )+∇L(θ ∗ )(θ − θ ∗ )+ 1 2 (θ − θ ∗ ) ⊤ H ∗ (θ − θ ∗ ) Assumption 2 (Covariance matrix atθ ∗ ). C(θ ∗ )=H ∗ . Assumption 3. For allθ inD, all the entries ofC(θ ) are Lipschitz continuous inθ . Assumption 4. For allθ inD, there exists a constantk >0 that bounds the eigenvalues ofC(θ ): k≥ λ 1 ≥ ...≥ λ d ≥ 1 k . It is known that Assumption 2 hold at global minima [94, 212] and [203, Section 2] empirically showed that it is reasonable approximation at local minima as well. Although Assumption 4 is strong, it has been shown that under certain settings, even models such as neural networks have parameters that satisfy it [208, 27]. 23 Figure 3.3: Notions related to the fundamental theorem (Theorem 2), a minimumθ ∗ , its neighborhood hood D as well as the boundary∂D, andV 0 , which is defined as the smallest quasi-potential in ∂D. Finally, we adopt the following definition as sharpness in our analysis. Definition 4 (Sharpness of a minimumθ ∗ ). Sharpness ofθ ∗ is the maximum eigenvalue ofH ∗ in Assump- tion 1, that is, λ max =λ max (H ∗ ). This is one of the most common definitions of sharpness [72, 93]. We note, however, that there is on-going controversy over its definition [41]. 3.3 Large Deviation Theory for SGD We introduce the basic notions from the Large Deviation Theory [38, 63]. The Large Deviation Theory is a theoretical framework for stochastic processes and one of its results, mean exit time analysis, can formally quantity escaping in our setup. First, we define steepness of a trajectory on a loss surfaceL(θ ), followed by the continuous SGD (3.3). Letφ ={φ t } t∈[0,T] ⊂ R d be a trajectory in the parameter space over a time interval[0,T] with a terminal 24 timeT , whereφ t ∈R d is a parameter which continuously changes int (see Figure 3.2). Also,φ is regarded as a continuous map from[0,T] toR d , i.e. is an element ofC T (R d ) (a set of continuous trajectories inR d ) which is a support of continuous SGD during[0,T]. Here we define the following quantity. Definition 5 (Steepness ofφ for SGD). Steepness of a trajectoryφ for (3.3) is defined as S T (φ):= 1 2 Z T 0 (˙ φ t +∇L(φ t )) ⊤ C(φ t ) − 1 (˙ φ t +∇L(φ t ))dt Intuitively, steepness S T (φ) means the hardness for the system (3.3) to follow this trajectory φ up the hill on L(θ ), as illustrated in Figure 3.2. Formally, the steepness is to describe a distribution of trajecto- ries generated by continuous SGD. If a trajectory φ has a large steepness S T (φ), the probability that the system takes the trajectory decreases exponentially. We summarize the formally properties of steepness in Appendices B.2.1. Although steepness is tailored definition for our setup, this quantity is a special case of a more general notion in Large Deviation Theory. The corresponding notions are called “rate function” in [38, Section 1.2] and “normalized action functional” in [63, Section 3.2]. We secondly define quasi-potential, which is the smallest steepness from a minimumθ ∗ to a boundary ∂D. It plays an essential role in the mean exit time. Definition 6 (Quasi-potential). Given the system (3.3) whose initial point is a local minima θ ∗ , quasi- potential of a parameterθ ∈D is defined as V(θ ):= inf T>0 inf φ:(φ 0 ,φ T )=(θ ∗ ,θ ) S T (φ). Same as steepness, quasi-potential can be seen as the minimum effort that the system (3.3) needs to climb fromθ ∗ up toθ onL(θ ). Lastly, we impose the following assumptions and Large Deviation Theory provides describes the mean exit time of a continuous SGD (3.3) based on the quasi-potential. 25 Assumption 5. There exists anN <∞ such that, for allρ > 0 small enough and allx,y with|x− z|+ |y− z|≤ ρ for somez∈∂D∪{θ ∗ }, there is a functionu satisfying that∥u∥<N andφ T(ρ ) =y, where ϕ t =x− Z t 0 ∇L(φ s )ds+ Z t 0 C 1/2 (φ s )u s ds andT(ρ )→0 asρ →0. Theorem 2 (Fundamental Theorem of Exit Time). Consider the continuous SGD (3.3) whose initial point is the local minimaθ 0 = θ ∗ . Suppose Assumption 1, 3, 4 and 5 holds. Then, the mean exit time (Definition 2) has the following limit: lim η →0 η B lnE[τ ]=V 0 holds, whereV 0 :=min θ ∈∂D V(θ ). We obtain Theorem 2 by adapting a more general theorem [38, Theorem 5.7.11 (a)] to our setting with Assumption 1. Rigorously, we verify that several requirements of the general theorem, such as asymptotic stability and attractiveness, are satisfied with our setup. The precise description of the assumptions can be found in Appendices B.1, and the proof of Theorem 2 under our setup can be found in Appendices B.3. 3.4 Mean Exit Time Analysis for SGD In this section, we give an asymptotic analysis of the mean exit time as our main result. As preparation, we provide an approximate computation of the quasi-potential in our setting, then we give the main theorem. 26 3.4.1 Approximate Computation of Quasi-potential We develop an approximation of the quasi-potential V(θ ), which is necessary to study the mean exit time by the fundamental theorem (Theorem 2). However, the direct calculation with a generalC(θ ) is a difficult problem, and at best we get a necessary condition for the exact formula (Appendices B.4). Instead, we con- sider a proximal system which is a simplified version of the continuous SGD (3.3) with a state-independent noise covariance. 3.4.1.1 Proximal System withC(θ )=I We define the following proximal system which generates a sequence { b θ t }: ˙ b θ t =−∇ L b θ t + r η B ˙ w t (3.4) This system is obtained by replacing the covariance C(θ ) of the continuous SGD (3.3) into an identity I. That is, this proximal system is regarded as a Gaussian gradient descent with isotropic noise. We further define steepness and quasi-potential of the proximal system as follow: Steepness For eachφ∈C T (R d ), b S T (φ):= 1 2 Z T 0 ∥˙ φ t +∇L(φ t )∥ 2 dt (3.5) Quasi-potential For eachθ ∈D, b V(θ ):= inf T>0 inf φ:(φ 0 ,ϕ r)=(θ ∗ ,θ ) b S T (φ) (3.6) Owing to the noise structure of the proximal system, we achieve a simple form of the quasi-potential. For the quasi-potential b V(θ ), the following lemma holds: Lemma 1. Under Assumption 1, b V(θ )=2(L(θ )− L(θ ∗ )). 27 Proof. If the functionφ t fort∈[0,T] does not exit fromD∪∂D, b S T (φ)= 1 2 Z T 0 ∥˙ φ t −∇ L(φ t )∥ 2 dt+2 Z T 0 ˙ φ ⊤ t ∇L(φ t )dt = 1 2 Z T 0 ∥˙ φ t −∇ L(φ t )∥ 2 dt+2(L(φ T )− L(φ 0 )) ≥ 2(L(φ t )− L(φ 0 )) The equality holds when ˙ φ t =∇L(φ t ). Since quasi-potential atθ is the infimum of the steepness from θ ∗ toθ , b V(θ )=2(L(θ )− L(θ ∗ )) is obtained. Lemma 1 shows that the quasi-potential with the proximal system is simply represented as the height ofθ from a minimumθ ∗ . By the quasi-potential, we simply obtain the following result by combining Theorem 2: Proposition 3 (Mean Exit Time of Proximal System). Consider the proxy system (3.4) whose initial point is the local minimaθ 0 = θ ∗ . Suppose that Assumption 1, 2, and 6 hold. Then, the mean exit time of (3.4) from the neighborhoodD,E[b τ ], has the following limit lim η →0 η B lnE[b τ ]=2∆ L. 3.4.1.2 Approximation of Quasi-potentialV 0 :=min θ ∈∂D V(θ ) We approximate the target quasi-potentialV(θ ) usingλ − 1 max b V(θ ) from the proximal system. For this sake, we impose the following assumption: Assumption 6. There existsK > 0 such that for anyθ ∈ D, ifS T (φ) = V(θ ), then∀t∈ [0,T] : ˙ φ t ≤ K holds. This claims that the velocities of trajectories do not become infinitely large. With this mild assumption, we obtain the estimation ofV 0 as follows. 28 Lemma 2. Under Assumption 1, 2, and 6, there exists a constantA such that V 0 − λ − 1 max b V 0 ≤ A (1+C 0 rk)λ − 1 min − λ − 1 max whereC 0 is a Lipschitz constant ofC(θ ) andr is the radius ofD. In the statement of Lemma 2, the right-hand side is small when (i)H ∗ is well-conditioned (i.e. λ min ≈ λ max ) and (ii)C(θ ) does not change much aroundθ ∗ (i.e. C 0 is small.) Under such conditions, the quasi- potential of continuous SGD becomes close to λ − 1 max b V 0 . But we note that this approximation can be loose because of the hidden factors inA (Appendices B.5). 3.4.2 Main Results: Mean Exit Time Analysis As our main results, we give inequalities that characterize a limit of the mean escape time. We recall the definition of the depth of a minimum θ ∗ as∆ L:=min θ ∈∂D L(θ )− L(θ ∗ ). Continuous SGD First, we study the case of continuous SGD (3.3). This result is obtained immediately by combining the fundamental theorem (Theorem 2) with the approximated quasi-potential (Lemma 2): Theorem 3 (Mean Exit Time of Continuous SGD). Consider the continuous SGD (3.3) whose initial point is the local minimaθ 0 =θ ∗ . Suppose that Assumption 1, 2, and 6 hold. Then, the mean exit time (Definition 2) from the neighbourhoodD has the following limit: 2 ∆ L λ max − A (1+C 0 rk)λ − 1 min − λ − 1 max ≤ lim η →0 η B lnE[τ ]≤ 2 ∆ L λ max +A (1+C 0 rk)λ − 1 min − λ − 1 max Excluding the effect of the approximationA λ − 1 min − λ − 1 max +C 0 rk 2 , this result indicates that continu- ous SGD needsexp(2 B η λ − 1 max ∆ L) number of steps asymptotically, before escaping from the neighborhood D of the local minima θ ∗ . Compared to the quasi-potential of the proxy system 2∆ L (Proposition 3), the 29 covariance matrix reduces quasi-potential in the factor of λ max . This result endorses the fact that SGD’s noise structure, C(θ ), exponentially accelerates the escaping [203], because quasi-potential exponentially affect mean exit time (Theorem 2). A more rigorous comparison is given in Section 3.6. Discrete SGD Next, we give the mean escape time analysis for discrete SGD (3.2). Our approach is to combine the following discretization error analysis to the continuous SGD results (Theorem 3): Lemma 3 (Discretization Error). For a stochastic system with Gaussian perturbation and its discrete cor- respondence, the discretization error of exit time has the following convergence rate E[ν ]− E[τ ]=O( √ η ). The following lemma can be simply derived as a special case of [70, Theorem 17] by substituting g(·)=0,f(·)=1, andk(·)=0 in their definition. Based on the analysis, we obtain the following result: Theorem 4 (Mean Exit Time of Discrete SGD). Consider the discrete Gaussian SGD (3.2) whose initial point is the local minima θ 0 = θ ∗ . Suppose that Assumption 1, 2, and 6, hold. Then, the mean exit time (Definition 2) from the neighbourhood D has the following limit: 2 ∆ L λ max − A (1+C 0 rk)λ − 1 min − λ − 1 max ≤ lim η →0 η B lnE[ν ]≤ 2 ∆ L λ max +A (1+C 0 rk)λ − 1 min − λ − 1 max This result indicates that continuous and discrete SGD have the identical asymptotic the mean exit times. In other words, the discretization error is asymptotically negligible in this analysis of escape time. Fig 3.4 summarizes the whole structure of our results. Proof. For the continuous SGD, by Theorem 2, we havelim η →0 η B lnE[τ ]=V 0 , whereV 0 =min θ ′ ∈∂D V (θ ′ ). With this result, it remains to evaluate the discretization error of exit time. 30 Original SGD θ k+1 =θ k − η ∇L B (θ k ) (3.1) Discrete Proxy System ˙ θ k+1 =−∇ L(θ k )+ q η B N(0,ηI ) C(θ )7→I Discrete SGD θ k+1 =θ k − η ∇L(θ k )+ q η B W k (3.2) Exit time: Theorem 4 Proxy System ˙ θ t =−∇ L(θ t )+ q η B ˙ w t (3.4) Exit time: Proposition 3 C(θ )7→I Continuous SGD ˙ θ t =−∇ L(θ t )+ q η B C(θ t ) 1/2 ˙ w t (3.3) Exit time: Theorem 3 Convergence of exit time O( √ η ) (Lemma 3) Convergence of exit time O( √ η ) (Lemma 3) Convergence of coefficient ( B→∞) Figure 3.4: The whole structure of our results. Here, without loss of generality, we assumeE[ν ] > 1. Also, we consider a case withE[ν ]− E[τ ]≥ 0. For the opposite caseE[ν ]− E[τ ] < 0, we can obtain the same result by repeating the following proof. By Lemma 3, for sufficiently small η , there exists a constant c such that 0 ≤ E[ν ]− E[τ ] ≤ c √ η holds. Therefore, the discrete exit time can be lower bounded as η B lnE[ν ]≥ η B lnE[τ ], 31 and also upper bounded as η B lnE[ν ]≤ η B ln(E[τ ]+c √ η ) = η B ln(1+E[τ ]− 1+c √ η ) ≤ η B ln(E[τ ])+ η B ln(1+c √ η ). The last inequality follows thatlog(1+a+b)≤ log(1+a)+log(1+b) for anya,b>0. Using the lower and upper bound, we obtain lim η →0 η B lnE[ν ]= lim η →0 n η B ln(E[τ ])+ η B ln(1+c √ η ) o =V 0 . Combined with Lemma 2 and Theorem 2, we obtain the statement of Theorem 4. 3.5 Numerical Validation We provide numerical experiments to validate our result under practical scenarios. We use a multi-layer per- ceptron with one hidden layer with 5000 units, mean square loss function, fed with the A VILA dataset [36]. To obtain the local minimum θ ∗ , we run the gradient descent network for a sufficiently long time (1000 epochs) to obtain asymptotically stable θ ∗ . The region D is defined as a neighborhood of θ ∗ . With θ ∗ as an initial value, we measure the exit times with SGD for 100 times independently. We measure the average number of steps at which SGD exits from D as the discrete mean exit time. To observe the dependency on the essential hyper-parameters (λ max , η , B, and ∆ L,), we compute the Pearson correlation coefficient, i.e. the linear correlation. The sharpness ofθ ∗ is controlled by mappingL(θ ) toL( √ αθ ) with a parameter α> 0. Since this mapping changesλ max toαλ max with other properties remaining the same, we useα as a 32 surrogate of the sharpnessλ max . In a similar manner,∆ L is controlled by mappingL(θ ) toβL (θ ), where, β is a surrogate of the depth of a minimum∆ L. Fig. 3.5 shows the discrete mean exit time has exponential dependency on λ − 1 max , η − 1 , B, and ∆ L, which is aligned with Theorem 4. As a reference, we provide the same experiment with C(θ ) 7→ I (i.e. (3.4)) Fig. 3.6 shows the discrete mean exit time is independent of sharpness whileη and∆ L show the same trend (Proposition 3). All the codes are available. * 1.0 1.2 1.4 1.6 1.8 2.0 λ max : Sharpness 2.8 3.0 3.2 3.4 3.6 log(E[ν ]) E[ν ]∼ exp(λ − 1 max ) Linear Correlation: -0.929 1.0e-03 1.3e-03 1.5e-03 1.8e-03 2.0e-03 η : Learning rate 2.7 2.8 2.9 3.0 3.1 3.2 3.3 log(E[ν ]) E[ν ]∼ exp(η − 1 ) Linear Correlation: -0.992 10 15 20 25 B: Batch size 3.10 3.15 3.20 3.25 3.30 3.35 log(E[ν ]) E[ν ]∼ exp(B) Linear Correlation: 0.83 0.4 0.6 0.8 1.0 Δ L: Depth 3.2 3.4 3.6 3.8 log(E[ν ]) E[ν ]∼ exp(Δ L) Linear Correlation: 0.985 Figure 3.5: Numerical validation of Theorem 4, where the mean exit time shows exponential dependency onλ − 1 max ,η − 1 ,B, and∆ L. The error bars indicate the standard deviation. * Source code for experiments https://anonymous.4open.science/r/MSML_experiments-A040/. The same files are uploaded as a supplemental material. 33 1.0 1.2 1.4 1.6 1.8 2.0 λ max: Sharpness 47.5 48.0 48.5 49.0 E[ν ] E[ν ]⊥ ⊥λ max Linear Correlation: -0.241 1.0e-03 1.3e-03 1.5e-03 1.8e-03 2.0e-03 η : Learning rate 2.50 2.75 3.00 3.25 3.50 3.75 log(E[ν ]) E[ν ]∼ exp(η − 1 ) Linear Correlation: -0.995 1.00 1.05 1.10 1.15 1.20 Δ L: Depthm 3.85 3.90 3.95 4.00 4.05 log(E[ν ]) E[ν ]∼ exp(Δ L) Linear Correlation: 0.999 Figure 3.6: Reference experiment using the proxy system (3.4). Different from the result of SGD, the exit time has no dependency onλ max while it has exponential dependency onη and∆ L similarly to SGD. This result is aligned with Proposition 3. 3.6 Comparison with Existing Escape Analyses Study Exit Time (Order) [88] exp( 1 η ) [94] exp B η ∆ L+d [212] 1 Tr(C(θ ∗ ) − 1 H ∗ ) [145] 1 η (α − δ )/2 [203] exp B η ∆ L ¯λ − 1 Ours exp B η ∆ Lλ − 1 max ± Ξ Table 3.1: Results of the derived exit time. We only show their order by ignoring constants. Ξ = A (1+C 0 rk)λ − 1 min − λ − 1 max is the approximation error, ¯λ is some value in[λ min ,λ max ] defined in [203], andα,δ are parameters related to the tail probability [145]. We compare our analysis with the closely related existing analyses and discuss the technical differences in detail. As summarized in Table 3.1 and 3.2, we picked as closely related analysis, [88, 94, 212, 145, 203], which analyze how the SGD’s noise affect escape efficiency. Comparison on exit time From Table 3.1, we obtain three implications. (i) In all the results, either or both the learning rate η and H ∗ play an important role. (ii) There are four results where the exit time is expressed as an exponential form, and the sharpness-related valuesλ max and ¯λ appear in the results of [203] 34 Studies Exponential Sharpness No Non- Discreteness escape analysis escape paths stationary [88] √ √ √ [94] √ [212] √ √ √ [145] √ √ √ [203] √ √ Ours √ √ √ √ √ Table 3.2: Technical difference among analyses. The specific meanings of each column are described in the main passages of Section 3.6. and our study. (iii) Our study and [203] have different orders for the parameters for sharpness. This fact will be discussed in the latter half of this section. Escaping path assumption We remark that the assumptions of our theorem have an essential difference from [94] and [203]. Their analyses assume that SGD escapes along a linear path, named “escape path,” where the gradient perpendicular to the path direction is zero. Escaping path is a convenient assumption to reduce the escape analysis to one-dimensional problems. However, the existence of such paths is supported only weakly by [44], and it is unlikely that the stochastic process continuously moves linearly. The fact that we eliminated the escaping path assumptions is a substantial technical improvement. Effect of sharpness The technical significance of our theory is that it can analyze the sharpness effect. Because of its non-linearity, sharpness analyses tend to become non-trivial, thus a limited number of existing works have tackled it. Among the selected results, the sharpness effect appears in [94] and [212] as H ∗ , and in [203] as λ . We note that the results of [94] and [203] include auxiliary sharpness values, such as ¯λ ∈ [λ min ,λ max ] respectively. Those terms appear because of the escaping path assumption and our results show that those terms are not fundamental. 35 Heavy tailed noise Among the selected works, only [145] use a heavy-tailed noise model, extiti.e. the noise whose distribution has a heavier tail than exponential distribution. Although it is known that the heavy-tailed noise models the empirical behavior of SGD well [174], it is quite difficult to mathematically formulate it. [145] use the Lévy process for their analysis, whereα represents the degree of the heavy tail, and δ ∈ (0,1) includes miscellaneous constants. Analyzing the sharpness under the heavy-tailed setup is still an open problem. 3.7 Related Work Learning Rate and Generalization Although our analysis is based on the asymptotic limit ofη →0, the practical SGD has finite learning rates [76]. There are lines of research investigating particularly on the finite learning rates. [177] have shown that the finite learning rate of SGD works as an implicit regularizer. [122] have investigated the validity of SDE approximation of SGD with finite learning rates. The differences in convergence properties among different learning rates are also an active research area [200, 121]. A shortcoming of our framework is that it cannot deal with the effect of finite learning rates. SGD’s Noise Analyzing SGD’s noise has been an appealing topic in the research community. It is known that the magnitude of the gradient noise in SGD has versatile effects on its dynamics [104] thus it has been closely investigated especially in relation to a learning rate and a batch size. An effect of large batch sizes on the reduction of gradient noise is investigated in [85, 178, 135]. Another area of interest is the shape of a gradient noise distribution. [212, 88, 35] investigated the anisotropic nature of gradient noise and its advantage. [174] discussed the fact that a gradient noise distribution has a heavier tail than Gaussian dis- tributions. [145, 175] showed the benefits of these heavy tails for SGD. [152] rigorously examined gradient noise in deep learning and how close it is to a Gaussian. [203] studied a situation where the distribution is Gaussian, and then analyzes the behavior of SGD in a theoretical way. 36 Discretization of SGD We summarize the approximation we used in Table 3.4. We used the continuous SGD (3.3) as an approximation of the discrete SGD (3.2) because (3.3) is exactly discretized to (3.2). This approximation is commonly used because it is well known that the trajectories of those two system show, so-called, “strong convergence” in the order of O( √ η ), extiti.e. E(sup 0≤ t≤ T |θ discrete k − θ kη |) = O(η 1 2 ) (see e.g. [69, 27]). We note that strong convergence validates the similarity of trajectories, but it does not necessarily guarantee the similarity of escaping behavior. Our work is the first completed argument with Lemma 3 introduced. As a final remark, (3.2) is also an approximated model of the original SGD (the dot arrow in Fig. 3.4). Although this approximation is justified via the central limit theorem [94, 79], it is admittedly heuristic and the quantitative validation for the approximation is assumed to be a (highly non-trivial) open problem. In Fig. 3.7, we provide empirical results to justify this approximation in our setup for completeness. 1.0 1.2 1.4 1.6 1.8 2.0 λ max: Sharpness 9.0 9.5 10.0 10.5 log(E[ν ]) E[ν ]∼ exp(λ − 1 max ) Linear Correlation: -0.944 1.0e-03 1.3e-03 1.5e-03 1.8e-03 2.0e-03 η : Learning rate 2.7 2.8 2.9 3.0 3.1 3.2 3.3 log(E[ν ]) E[ν ]∼ exp(η − 1 ) Linear Correlation: -0.993 0.4 0.6 0.8 1.0 Δ L: Depth 3.2 3.4 3.6 3.8 log(E[ν ]) E[ν ]∼ exp(Δ L) Linear Correlation: 0.98 Figure 3.7: A numerical experiment as in Section 3.5, but with discrete SGD (3.2) rather than the original SGD (3.1). Although they are defined differently, they show similar trends on each parameter. This numer- ical result shows that (3.2) is a reasonable model for (3.1) for exit time analysis. 3.8 Conclusion In this work, we showed that SGD escapes from sharp minima exponentially fast even in the non-stationary regime. To obtain the result, we used the Large Deviation Theory and identified that steepness plays the key role in the exponential escape in the non-stationary regime. Our results are the novel theoretical clue to explain the mechanics as to why SGD can find generalizing minima. 37 Chapter 4 Splash in a Flash: Sharpness-aware minimization for efficient liquid splash simulation 4.1 Introduction Owing to the prevailing machine learning techniques, fluid simulation has witnessed drastic performance improvements in recent years [115, 192]. However, due to large three-dimensional datasets, the training remains a bottleneck of machine learning based fluid dynamics (e.g. Table 1 in [102]). In this work, we introduce sharpness-aware minimization (SAM)[60] to machine learning based FLIP (MLFLIP) [191] and show that SAM drastically reduces the training cost while capturing plausible behavior of liquid splashes (Fig. 4.1 and Table 4.1). 4.2 Sharpness-Aware Minimization Sharpness-aware minimization (SAM) is one of such sharpness-regularization methods proposed in the computer vision area [60]. The key component of SAM is that it minimizes “sharpness” of the model defined as max ∥ϵ ∥ 2 ≤ ρ {L(w+ϵ )− L(w)}, (4.1) 38 MFLIP + SAM (Ours) MFLIP Figure 4.1: Simulation results with the same number of training epochs. Our training scheme with sharpness-aware minimization quickly learns physical fluid dynamics (left) while the baseline result (right) with naive training scheme shows numerous unphysical liquid splashes. whereρ (the size of neighborhood) is a hyperparameter to define sharpness. While computing the sharpness directly is infeasible, it has been shown that minimizingL(w)+max ∥ϵ ∥ 2 ≤ ρ {L(w+ϵ )− L(w)} (training loss + sharpness) can be achieved through the following update rule: w =w− η ∇ w ′L w ′ | w ′ =w+ρ ∇wL(w) ∥∇wL(w)∥ (η : learning rate), (4.2) which utilizes first-order derivatives, i.e.,∇ w L(w). This allows for the optimization of sharpness without the need for computationally expensive second-order derivatives. 4.2.1 Derivation of SAM As an essential summary of [60], we briefly describe how Eq. 4.2 amounts to ends up minimizing sharpness. Consider the differentiation of sharpness Eq. 4.1, i.e, ∇ max ∥ϵ ∥ 2 ≤ ρ b L(w+ϵ ), 39 or equivalently, the following form, ∇ b L(w+ϵ ∗ ) withϵ ∗ =argmax ∥ϵ ∥ 2 ≤ ρ b L(w+ϵ ) By the first-order Taylor approximation, b L(x,w+ϵ ) can be approximated to b L(w)+ϵ ⊤ ∇ b L(w) and ϵ ∗ is reduced to the solution of the linear minimization problem. ϵ ∗ =arg max ∥ϵ ∥≤ ρ b L(w+ϵ )≈ argmax ∥ϵ ∥≤ ρ b L(w)+ϵ ⊤ ∇ b L(w) =argmax ∥ϵ ∥≤ ρ ϵ ⊤ ∇ b L(w) Thus, the approximatedϵ ∗ can be obtained as follows. ϵ ∗ ≈ ρ b L(w) ∥∇ b L(w)∥ With the approximatedϵ ∗ and omitting the second-order derivative of as below, ∇ b L(w+ϵ ∗ )=∇ w (w+ϵ ∗ )∇ w ′ b L w ′ | w ′ =w+ϵ ∗ ≈∇ w ′ b L(w ′ )| w ′ =w+ ρ ∥∇ b L(w)∥ b L(w) Eq. 4.2 is obtained, which essentially reduces sharpness (Eq. 4.1) on each step in approximation. 4.3 MLFLIP with SAM MLFLIP is a data-driven splash generation model proposed in [191], which adopts machine learning to pre- dict the locations of fluid splashes instead of using high resolution FLIP scheme [211]. MLFLIP procedure 40 Figure 4.2: Neural network layout used in this study. Detachment classifier determines whether a splash has been formed or not; and for the inputs which are classified as splash, velocity modification component of this network predicts the mean and the variance of the probability distribution for changes in the velocity. consists of several steps: synthetic data generation; feature engineering; and building suitable neural net- work architecture followed by training and evaluation. In the following subsections, we provide a detailed description of each step. Data Generation The training data are generated through multiple high resolution FLIP simulations. These simulations are initialized with random values for the number of droplets and their positions and velocities to ensure sufficient variance in the generated data. From these simulations, we extract feature vectorx consisting of 108 components having 27× 3 velocity values and 27× 1 level set values. In all, we have used10 6 such samples from 16 simulations using a grid spacing of 5 mm, with even distribution of both splashing and non-splashing particles for neural network training. Neural Network Architecture Input to our network is a feature vectorx, which has the information about the flow at a particular position. From this, the neural network will predict two components: 1) detachment classification, which determines whether the region will detach and form a splash or not; and 2) velocity 41 modification, which determines the velocity change for a splash with respect to the fluid motion as shown in Figure 4.2. To estimate velocity modification, we predict both mean and variance of the velocities. The neural network has a separate 64-neuron hidden layer with 10% dropout for each of these prediction values, with tanh as nonlinear activation function, followed by batch normalization layer. These are in turn connected to the single output neuron. Activation for the detachment classifier output neuron is sigmoid. Our neural network takes feature vectorx ={x 1 ,x 2 ,...x n } and estimates the probability of it being a splash. Hence, we maximize the likelihood, i.e. minimize the negative log likelihood L d (b y|x)=− n X i=1 logP(b y i |x i ) (4.3) whereb y = {b y 1 ,b y 2 ,...,b y n } are the actual splash indicator values. We use the cross entropy loss for this classification part of our neural network. For the velocity change in the droplet, we assume it to follow a normal distribution relative to the mean flow of fluid. f v (∆ v i |x i )∼N (∆ v i |µ i ,σ 2 i ) (4.4) where µ i and σ 2 i denote the mean and the variance respectively. Thus, we minimize the loss function L v calculated as L v (∆ v|x)= 1 2 n X i=1 d X j=1 (∆ v i,j − µ i,j ) 2 σ 2 i,j +lnσ 2 i,j (4.5) wherej is the spatial index. We apply SAM to the overall loss functionL d +L v := L w , using equation 4.2. The neural network is trained by minimizing this loss using Adam optimizer [103] with a learning rate of 10 − 4 and exponential decay rates of the first moment ( β 1 ) and second moment (β 2 ) as 0.9 and 0.999 respectively. The models were implemented with Tensorflow (2.5.0) backend [134] having GPU support. 42 Epochs to converge Wall-clock time MLFLIP 320 epochs 2144 sec MLFLIP + SAM (Ours) 50 epochs 569 sec Table 4.1: Performance comparison of ML based FLIP (MLFLIP) [191] without and with SAM optimiza- tion. Our training scheme drastically improves convergences speed both in number of epochs and wall-clock time while it simulates plausible liquid splash effect (Fig. 4.1 left) 4.4 Conclusion We trained two instances of the neural network: 1) MLFLIP, reproduced from [191]; and 2) MLFLIP+SAM that utilizes SAM optimization for MLFLIP. Weights of both neural networks were initialized with normally distributed random values. The neural networks then iteratively learn to capture realistic behavior of the fluids and gradually converge to faithfully represent the underlying physics of droplet formation. The number of epochs and wall-clock time taken by both neural networks are shown in Table 4.1. Optimization with SAM helps the model to rapidly converge, achieving a speedup of 3.76x over MLFLIP while showing visual plausibility (Fig. 4.1 left). These promising results indicate the potential of SAM for fluid simulation and as a future direction, we will explore use of SAM for broader application of fluid dynamics such as pressure solvers and generative fluid models. 43 Chapter 5 Allegro-Legato: Scalable, Fast, and Robust Neural- Network Quantum Molecular Dynamics via Sharpness- Aware Minimization 5.1 Introduction Neural-network quantum molecular dynamics (NNQMD) simulations based on machine learning are revolu- tionizing atomistic modeling of materials by following the trajectories of all atoms with quantum-mechanical accuracy at a drastically reduced computational cost [60]. NNQMD not only predicts accurate interatomic forces but also captures quantum properties such as electronic polarization [109] and electronic excita- tion [124], thus the ‘Q’ in NNQMD. NNQMD represents one of the most scalable scientific applications on the current high-end supercomputers, evidenced by ACM Gordon Bell prize winner in 2020 [95] and finalist in 2021 [146]. A more recent breakthrough in NNQMD is drastically improved accuracy of force prediction [11] over those previous models, which was achieved through rotationally equivariant neural net- works based on a group theoretical formulation of tensor fields [184]. The state-of-the-art (SOTA) accuracy has now been combined with a record speed based on spatially localized descriptors in the latest NNQMD model named Allegro (meaning fast) [142]. Despite its remarkable computational scalability, massively parallel NNQMD simulation faces a major unsolved issue known as fidelity scaling [158]. In large-scale NNQMD simulations, small prediction errors 44 Figure 5.1: Number of outliers in atomic force inference during NNQMD simulation: As the simulation progresses, the dynamic of atoms becomes unstable due to an increasing number of unphysically large force values (over 5σ ) predicted by the original Allegro model. This resulted in the eventual failure after 2.6× 10 6 MD steps (red). On the other hand, the proposed model (Allegro-Legato) maintains a nearly constant number of outliers and the simulation stable (blue). can propagate and lead to unphysical atomic forces that degrade the accuracy of atomic trajectory over time. These force outliers can even cause the simulation to terminate unexpectedly (Fig. 5.1). As simulations become spatially larger and temporarily longer, the number of unphysical force predictions is expected to scale proportionally, which could severely limit the fidelity of NNQMD simulations on new exascale supercomputing platforms, especially for the most exciting far-from-equilibrium applications [124, 138]. In this work, we solve the fidelity-scaling issue taking a cue from a recent development in machine learning. Solving the fidelity-scaling issue requires robustness of the NNQMD model, i.e., reduced number of unphysical force-prediction outliers when simulation trajectories encounter atomic configurations outside the training dataset. It has been observed that the robustness of a neural-network model can be enhanced by sharpness-aware minimization (SAM) [60] — a training algorithm that regularizes the sharpness of the model (i.e., the curvature of the loss surface) along with its training loss. We thus apply SAM to train the 45 fast Allegro model to smoothen its loss landscape, thereby enhancing its robustness. The resulting Allegro- Legato (meaning fast and “smooth”) model is shown to increase the time-to-failure t failure , i.e., how many MD steps a NNQMD simulation can run under microcanonical ensemble, while maintaining the same in- ference speed and nearly equal accuracy. Specifically, Allegro-Legato exhibits much weaker dependence of time-to-failure on the problem size,t failure ∝N − 0.14 (N is the number of atoms) compared to the SOTA Al- legro model t failure ∝N − 0.29 , thus allowing much larger and longer NNQMD simulations without failure. Along with this main contribution, we find that the fidelity-scalability of the NNQMD model correlates with sharpness of the model more than the number of parameters in the model. * The fast and robust Allegro-Legato model has been implemented in our scalable parallel NNQMD code named RXMD-NN. We have achieved a weak-scaling parallel efficiency of 0.91 on 480 computing nodes, each with an AMD EPYC central processing unit (CPU) and four NVIDIA A100 graphics processing units (GPUs), of the Polaris supercomputer at Argonne Leadership Computing Facility (ALCF). The code has also achieved a 7.6-fold single-node performance acceleration using four GPUs over single 32-core CPU of Polaris. Allegro-Legato allows much larger spatio-temporal scale NNQMD simulations than are otherwise pos- sible. Unlike MD simulation with heat bath often used in “effective” long-time sampling of molecular con- figurations ( e.g., for protein folding), which disrupts dynamic trajectories, Allegro-Legato enables “true” long-time Hamiltonian dynamics that can be directly compared with fine vibrational modes observed in high-resolution spectroscopic experiments. Specifically, we can now satisfy the prohibitive computational demand of accounting for subtle nuclear quantum effects in the dynamics of ammonia based on path-integral molecular dynamics, which is essential for resolving a mystery in a recent high-resolution neutron-scattering experimental observation at Oak Ridge National Laboratory. Synergy between the most advanced neutron * Code is available at github.com/ibayashi-hikaru/allegro-legato 46 experiment and leadership-scale NNQMD simulation lays a foundation of the green ammonia-based fuel technology for achieving a sustainable society. 5.2 Method Innovation This section first summarizes (1) NNQMD simulation method, along with the SOTA Allegro model, and (2) SAM for robust neural-network model training. We then present the key method innovation of SAM- enhanced Allegro model, Allegro-Legato, followed by its scalable parallel implementation. 5.2.1 Neural-Network Quantum Molecular Dynamics with Allegro In the SOTA Allegro model, the energyE is composed of pairwise embedding energies,E ij , between atomic pairs(i,j) within a finite cutoff distance to preserve data locality [142]. Key to the high accuracy of Allegro is that all energy terms are group-theoretically equivariant with respect to rotation, inversion and translation, i.e., to the Euclidean groupE(3) [11, 184]. This is achieved by representing the energy in terms of tensors up to rank ℓ and tensor products using their irreducible representations. In short, Allegro attains accuracy through group-theoretical equivariance and computational speed through data locality. 5.2.2 Fidelity-scalability and Adversarial Robustness Neural networks are trained by minimizing the loss functionL(w) wherew represents the weight parame- ters of the neural network. Design choice of optimization methods plays a crucial role in machine learning, as it impacts various factors such as convergence speed and generalization performance [165]. In particular, vulnerability to adversarial attacks is a problem unique to neural networks [73], which has actively been studied in various fields such as computer vision [171] and natural language processing [95]. Recent stud- ies suggest that the fidelity-scalability in NNQMD can also be viewed as a robustness against “adversarial 47 attacks” during large-scale simulations [168, 33], where atomic trajectories are “attacked” by the accumu- lated unphysical predictions, i.e., “adversarial perturbations” throughout the long and large-scale simulation. Therefore, it is natural to expect that optimization methods for adversarial attack would enhance the fidelity- scalability in NNQMD. 5.2.3 Key Innovation: Allegro-Legato: SAM-Enhanced Allegro As explained above, our hypothesis is that smoothened loss landscape through SAM enhances fidelity scal- ing of NNQMD. To quantitatively test this hypothesis, we incorporate SAM into the training of the Allegro NNQMD model [142], which entails SOTA accuracy and computational speed. We call the resulting SAM- enhanced Allegro model as Allegro-Legato. † To find an appropriate strength of sharpness regularization, SAM’s hyperparameter ρ is tuned so as to provide the most robust model, from which we found thatρ =0.005 gives the longestt failure in our setup. For the small-scale simulation test, we used the LAMMPS, which is a widely used open-source MD simulation software (lammps.org). See Section 5.4.5 for the detailed training settings. ρ 0.0 0.001 0.0025 0.005 0.01 0.025 0.05 t failure 4020 4030 6420 8480 4760 4210 3780 Table 5.1: SAM strengthρ vs. time-to-failuret failure : We tuneρ by conducting a grid search in the range of 0.001 to 0.05. A model withρ =0.005 gives the largestt failure with a small-scale simulation(N =432). 5.2.4 RXMD-NN: Scalable Parallel Implementation of Allegro-Legato NNQMD For the large-scale testing of computational and fidelity scaling, we implement the proposed Allegro-Legato NNQMD model in our RXMD-NN software [124, 158], which is an extension of our scalable parallel reactive MD software, RXMD [148]. RXMD-NN employs a hierarchical divide-and-conquer scheme to realize “globally-scalable and local-fast” (or “globally-sparse and locally-dense”) parallelization [149]: (1) † In music, Legato means smooth without sudden breaking between notes. 48 globally scalable spatial decomposition that is best suited for massively parallel computing platforms; and (2) locally efficient linked-list decomposition and subsequent neighborlist construction to achieve the O(N) computational complexity. Interprocess communication is implemented using non-blocking application pro- gramming interfaces (APIs) of Message Passing Interface (MPI) library, and the communication pattern is designed to be lock-free with minimal internode-data exchange. While it is one of the most widely adapted strategies in large-scale MD applications, this is particularly suitable for NNQMD algorithm to take advan- tage of the modern high-performance computing (HPC) architecture, in which a few very powerful GPU cards do the heavy lifting by accelerating computationally demanding kernels while random memory access and out-of-order data processing are concurrently executed by many-core CPUs. In RXMD-NN, CPU is re- sponsible for the adjacency-list construction in parallel. The constructed adjacency list, together with atom position and type information, is converted to PyTorch tensor object for force inference on GPUs. RXMD- NN allows to control the computational granularity, such as the number of atoms per domain and domains per node, to find an ideal balance between horizontal and vertical scalability to utilize available hardware resources. PyTorch has become a standard Python library in machine learning community due to its APIs for com- plex model architectures that enables highly efficient training and inference on GPU. However, production platforms such as HPC clusters, mobile devices, and edge nodes often demand a set of requirements that Python is not designed for, e.g., multithreading, low latency computing, and massively parallel distributed architectures. GPU Offloading of Allegro model is realized by TorchScript, which is statically typed inter- mediate representation to create serialized and optimizableML models. The serialized model can be loaded from other programming language such as C++ allowing to be deployed in environments that are difficult for python codes to run without sacrificing multithreading and optimization opportunities. 49 5.3 Results We test both fidelity and computational scalability of the proposed Allegro-Legato NNQMD model as imple- mented in the RXMD-NN code on a leadership-scale computing platform, Polaris, at Argonne Leadership Computing Facility (ALCF). 5.3.1 Experimental Platform We conduct numerical experiments on the Polaris supercomputer at ALCF. It is a Hewlett Packard Enterprise (HPE) Apollo 6500 Gen 10+ based system consisting of two computing nodes per chassis, seven chassis per rack, and 40 racks, with a total of 560 nodes. Each Polaris node has one2.8GHz AMD EPYC Milan 7543P 32-core CPU with 512 GB of DDR4 RAM, four NVIDIA A100 GPUs with 40GB HBM2 memory per GPU, two 1.6 TB of SSDs in RAID0 and two Slingshot network endpoints. Polaris uses the NVIDIA A 100 HGX platform to connect all 4 GPUs via NVLink, with a GPU interconnect bandwidth of 600GB/s. Designed by Cray, the Slingshot interconnect is based on high radix 64-port switches arranged in dragonfly topology, offering adaptive routing, congestion control and bandwidth guarantees by assigning traffic classes to appli- cations. Polaris is rated at a production peak performance of 44 petaflops with node-wise performance at 78 teraflops for double precision. 5.3.2 Fidelity-Scaling Results For the fidelity-scaling test, we trained Allegro and Allegro-Legato with ℓ=1 and examined their robustness in terms of t failure , i.e. the greater t failure , the more robust. The parameters of MD simulation for the test are carefully chosen so that each MD simulation is expected to fail within a reasonable time but not immediately. While the constant-temperature ensemble method based on Nose-Hoover thermostat (i.e., NVT ensemble) is used to study thermal-equilibrium properties, it could suppress and hidden unphysical model predictions by connecting atoms with an external thermostat. Microcanonical ensemble (NVE) method is the most rigorous 50 test on the model robustness by simply integrating the equations of motion without an external control (also it has broader applicability to non-equilibrium processes). In each simulation instance, the liquid ammonia system is first thermalized at a temperature of 200K using NVT ensemble for 1,000 steps. We subsequently switch the ensemble to NVE and continue the simulation until it fails to determinet failure . The time step∆ t of 2 femto-seconds (fs) is chosen throughout the robustness test. For each system size, over ten independent simulation instances are averaged to measuret failure . Figure 5.2: Fidelity scaling of NNQMD simulation: Here,t failure is measured using NVE ensemble with a timestep of 2fs. Statistically improved t failure is observed in even the smallest system size, which is further pronounced as the system size increases. The exponent of power law fitting shows nearly a factor of two reduction using Allegro-Legato model. To quantify fidelity scaling, we define a fidelity-scaling exponent N − β through the scaling relation, t failure =αN − β , (5.1) where α is a prefactor. A smaller β value (i.e., weaker fidelity scaling) indicates delayed time-to-failure, thus a capability to study larger spatiotemporal-scale processes accurately on massively parallel computers. 51 The Allegro-Legato model has drastically improved fidelity scaling, β Allegro-Legato = 0.14 < β Allegro = 0.29 beyond statistical uncertainty (see the error bars in Fig. 5.2), thus systematically delaying time-to-failure. 5.3.3 Computational-Scaling Results We measure the wall-clock time per MD step with scaled workload, 6,912P -atom ammonia system onP MD domains. In this test, each MD domain consists of 6,912 atoms that are offloaded to single GPU. In addition to the force inference, the execution time includes the adjacency list construction, data transfer between host and GPU memory, and internode communication via network fabric. Fig. 5.3 shows wall- clock time as a function ofP . By scaling the problem size linearly with the number of GPUs, the runtime increases only slightly, indicating an excellent scalability. Here, we quantify the parallel efficiency by defining the speed of NNQMD algorithm as the product of the total number of atoms multiplied by the number of MD steps executed per second. The isogranular speedup is given by the speed on PMD domains relative to the speed of single domain as baseline. The parallel efficiency of weak scalability thus is obtained by the isogranular speedup divided by P . With the granularity of 6,912 atoms per domain, we have obtained an excellent weak-scaling efficiency, 0.91 for up to 13,271,040 atoms on 1,920 A100 GPUs. Despite the relatively large granularity of 6,912 atoms per domain, we obtained a fast time-to-solution of 3.46 seconds per MD step enabling 25,000MD steps per day for production runs. Fig. 5.4 shows GPU acceleration of NNQMD algorithm on single Polaris node. The histogram presents the reduction in wall-clock time per MD step (averaged over 10MD steps) using the runtime obtained with CPU only (32 cores with 32 threads) as baseline. Here, we examined: (1) three system sizes of N =1,728,6,912, and 13,824 ammonia atoms; and (2) three domain decomposition such as single, double 52 Figure 5.3: Wall-clock time of the RXMD-NN code per MD step, with scaled workloads 6,912 P atom ammonia liquid usingP A100 GPUs(P =1,..., 1,920). and quadruple subdomains. Atoms in each domain are assigned to one GPU. WithN =1,728 system, we ob- serve a marginal GPU acceleration up to 1.24x speedup, which has been substantially improved with greater system sizes. We have achieved a 7.6x speedup usingN =13,824 atom system with four subdomains. 5.4 Discussion While SAM-enhanced Allegro model, Allegro-Legato, has achieved improved robustness over the SOTA Allegro model as shown in the previous section, we here discuss the imprecation of SAM training to other aspects such as accuracy and computational speed. 53 Figure 5.4: GPU acceleration of NNQMD algorithm: Three system sizes ofN =1728,6912 and 13,824 atoms are examined. The histogram presents the reduction in wall-clock time per MD step over the runtime with 32CPU cores without GPU as reference. Detail of the benchmark platform as well as the GPU and CPU architectures are presented in the main text. We have achieved 7.6x speedup using four GPUs with N =13,824 atoms. 5.4.1 Simulation Time First, MD simulation time is not affected by SAM since SAM only applies to the training stage but not the inference stage in MD simulation. Table 5.2 compares the simulation time per MD time step for the baseline Allegro model and the proposed Allegro-Legato model. Hereafter, we use the default value,ℓ=1, for the maximum tensor rank, thus the same number of parameters for the two models. The simulation time is identical for both models within the measurement uncertainty due to nondedicated access to the experimental platform. As a comparison, Table 5.2 also shows the baseline Allegro model with two other tensor ranks, ℓ = 0 and 2. Larger ℓ generates more accurate but larger models (i.e., larger numbers of parameters) and hence 54 incur longer simulation times. Based on the accuracy/computational-cost trade-off, production NNQMD simulations with the Allegro model typically useℓ=1. Model # of parameters Time/step(ms) Allegro 133,544 916 Allegro-Legato 133,544 898 Reference Models Allegro(ℓ=0) 95,656 395 Allegro(ℓ=2) 183,720 2,580 Table 5.2: Simulation-time comparison: As SAM only applies to the training stage and does not modify the size of architecture, the computational cost for simulation is not affected. 5.4.2 Training Time SAM’s shortcoming is that it requires more computation time than the base optimizer, because each epoch has to compute the first-order gradients twice. However, in our setting, SAM converges faster than the default optimizer, and thus the total training time is not significantly affected (Table 5.3). As references, we also measured the training time of Allegro models with different maximum tensor ranks, ℓ = 0 and 2 and we observed that the training cost increases drastically for largerℓ. In summary, Allegro-Legato improves the robustness of Allegro without incurring extra training cost. Model Total time (hours) Per-epoch time (seconds) Epochs Allegro 11.1 248 161 Allegro-Legato 13.6 433 113 Reference Models Allegro(ℓ=0) 4.4 127 127 Allegro(ℓ=2) 19.6 636 111 Table 5.3: Training-time comparison: Although SAM takes longer per-epoch training time, it converges faster and thus does not significantly affect total training time. Compared to the reference training times of variations of Allegro models, the extra training cost of Allegro-Legato is negligible. 55 5.4.3 Model Accuracy While faithful reproduction of system energy is necessary to properly guide model training, the most crucial to MD simulations is accurate force prediction. We obtained the validation error in atomic force as 15.9 (root mean square error, RMSE) and 11.6 (mean absolute error, MAE) with Allegro-Legato(ℓ = 1) model, and 14.7 (RMSE) and 10.7 (MAE) with the original Allegro model(ℓ=1), respectively. All error values are in a unit ofmeV/. Chmiela et al. recently provided a guideline that MAE required for reliableMD simulations is 1kcal/mol/, which corresponds to 43.4meV/ [30]. Although Allegro-Legato incurs a slight increase in the force prediction error (about 8% in the liquid ammonia dataset) compared to the original Allegro model, the obtained force error is about a factor four smaller than the guideline for reliably performing MD simulations. Namely, Allegro-Legato improves the robustness without sacrificing accuracy. 5.4.4 Implicit Sharpness Regularization in Allegro While we propose to explicitly control the sharpness of models, we found that one control parameter in the baseline Allegro model (i.e., maximum rank of tensors to represent features) implicitly regulate the sharpness of the model. In Table 5.4, besides our Allegro-Legato model having smaller sharpness, Allegro ℓ = 1,2 models have significantly smaller sharpness and higher t failure compared to Allegro ℓ = 0 model. Namely, Allegro with higher ℓ implicitly regularizes sharpness, resulting in higher robustness (i.e., larger t failure ), but with increasing computational cost. Allegro-Legato (ℓ = 1) model achieves the same level of sharpness as Allegro(ℓ=2) model with much less computing time; see Tables 5.2 and 5.3. Model Allegro(ℓ=0) Allegro(ℓ=1) Allegro(ℓ=2) Allegro-Legato (ℓ=1) Sharpness 5.0× 10 − 4 3.2× 10 − 4 9.8× 10 − 5 1.2× 10 − 4 Table 5.4: Implicit sharpness regularization by Allegro: While our Allegro-Legato model has smaller sharpness than Allegro, Allegro models with largerℓ have progressively smaller sharpness. Here, we mea- sure sharpness,max ∥ϵ ∥ 2 ≤ ρ {L(w+ϵ )− L(w)}, by taking maximum of 1,000 independent random samples around the 0.05-neighborhood of each minimum. 56 Fig. 5.5 visualizes the loss surface of Allegro (ℓ=0,1, and 2) and Allegro-Legato (ℓ = 1) models. The figure confirms: (1) progressive smoothening ( i.e., smaller sharpness) for largerℓ within the Allegro model due to implicit regularization through accuracy but with increasing computational cost; and (2) explicit smoothening of Allegro-Legato through SAM over Allegro with the same ℓ without extra computational cost. Figure 5.5: Loss surface visualization: One dimensional visualization of loss surface of each model. Fol- lowing the definition of sharpness (Eq. 4.1), we randomly sample a vector, d, that gives the sharpness direction to computeL(w+pd) forp∈[− 1,1]. 5.4.5 Training Details Lastly, we provide detailed training configuration for completeness (Table 5.5). For fair comparison, we used the default hyperparameters that are released as the SOTA model and SAM training uses the default optimizer as its base optimizer. 57 Material type Liquid NH 3 Number of atoms per a configuration 432 # of training examples (N training ) 4,500 # of validation examples 500 r max for cutoff 6.0 Maximum tensor rank(ℓ) 1 Batch size 4 Peak learning rate 2e− 3 Learning rate decay ReduceLROnPlateau Learning rate scheduler patience 50 Learning rate scheduler factor 0.5 (Base) Optimizer Adam Adam’s (β 1 ,β 2 ) (0.9,0.999) Loss function Per atom MSE Loss coefficient (force, total energy) (1.0,1.0) Stopping criterion ∆ L validation ≤ 3e− 3 for 100 epochs Table 5.5: Detailed training setting: All training setups in this work adopt these parameters unless other- wise noted. 5.5 Applications The improved robustness of the proposed Allegro-Legato model, while preserving the SOTA accuracy and computational speed of Allegro, enables large spatio-temporal scale NNQMD simulations on leadership- scale computers. A compelling example is the study of vibrational properties of ammonia. Development of dynamical models that accurately reproduce the vibrational spectra of molecular crystals and liquids is vital for predictions of their thermodynamic behavior, which is critical for their applications in energy, biological, and pharmaceutical systems [86]. In particular, there has been growing development of green ammonia-based fuel technologies for sustainable society over the past few years. Ammonia (NH 3 ) has 58 a higher energy density than even liquid hydrogen, but ammonia can be stored at a much less energy- intensive− 33 ◦ C versus− 253 ◦ C, and thanks to a century of ammonia use in agriculture, a vast ammonia infrastructure already exists [24]. Over 180 million metric tons of ammonia is produced annually, and 120 ports are equipped with ammonia terminals [24]. Development of technologies based on ammonia will be reliant on our ability to understand and model the complex physical and chemical interactions that give rise to its unique properties. There are multiple complicating factors that require careful considerations such as nuclear quantum effects (NQEs) and its coupling with vibrational anharmonicity when developing computational frame- works that accurately describe vibrational properties [86]. Standard first-principles calculations for vibra- tional properties only treat electrons quantum mechanically and vibrational properties can be determined by Fourier transform and matrix diagonalization of the unit-cell Hessian, which is at most on the order of a few 100 entries [186]. Evaluating the role of NQEs and its coupling with vibrational anharmonicity is done in the so-called path integral MD (PIMD) approach, which samples the quantum partition function for the entire quantum system [59, 164]. This requires long-time simulations of a large number of replicas of large MD systems that are harmonically coupled to each other as interacting ring-polymers, especially at low temperatures [59, 164]. The background of Fig. 5.6a shows a typical first principles-based simulation, where the atoms are treated classically, and the electron charge density is treated quantum-mechanically to compute atomic forces, which is illustrated as blue iso-surfaces. In the foreground we have highlighted one NH 3 molecule from a PIMD simulation of the same atomic configuration, where each atom has 32 replicas that are harmonically coupled together. The computation of the replica simulations is embarrassingly paral- lel, with only fixed nearest replica communication, and the major cost is computing the energy and forces for the atoms within each replica simulation, which is typically done from first principles. However, our 59 Allegro-Legato model with enhanced robustness allows for stable long-timeMD simulations at near quan- tum accuracy, and thus can replace expensive first-principles calculations in the PIMD simulations, which would make accurate evaluation of ammonia’s low energy intermolecular vibrational modes intractable. We have performed massively parallel PIMD simulations with our Allegro-Legato model, computing the energy and forces within each replica simulation to evaluate the phonon spectra for inter-molecular modes of ammonia. The Allegro-Legato model is found to produce the expected softening of high-energy modes at finite temperature with inclusion of nuclear quantum effects in comparison to standard matrix diagonalization within the harmonic approximation, which is illustrated in Fig. 5.6b. In particular, reduction of the energy of the vibrational modes in the30− 90meV is consistent with high-end neutron experiments for the vibrational spectrum performed by the authors at Oak Ridge National Laboratory in the last summer (these results will be published elsewhere). Figure 5.6: Computed vibrational spectra of ammonia: (a) While typical first-principles simulation treats atoms classically and electrons quantum-mechanically, PIMD simulation uses multiple replicas of each atom to mimic nuclear quantum effect (NQE). (b) Top curve shows vibrational spectrum computed at zero tem- perature withoutNQE, while bottom at finite temperature with Allegro-Legato PIMD simulation. With the inclusion of NQE, Allegro-Legato PIMD correctly shows softening of high-energy inter-molecular modes expected at finite temperature and explains high-end neutron-scattering observations. 60 5.6 Related Work There has been an explosion in the development and application of NNQMD simulations [12, 109, 124, 11, 142] and their scalable parallel implementation [95, 146]. On the other hand, it was only recently that the robustness of NNQMD was quantified in terms of time-to-failure t failure [64] and its deteriorating reduction with the problem size (i.e., fidelity-scaling problem) was pointed out [158]. This work is the first to: (1) formally quantify the fidelity scaling by introducing the fidelity-scaling exponent β throught failure ∝N − β (N is the number of atoms); and (2) propose the solution to the fidelity-scaling problem using sharpness-aware minimization. Robustness against adversarial attacks is a central and widely studied issue in machine learning [95, 73, 171]. Compared to typical adversarial attacks, it is nontrivial to generate adversarial perturbations for NNQMD. This is because the attack we consider is not only focused on the accuracy of the model, but also on the time to failure (t failure ) of the model, which can only be determined through long-time simulations [168, 33]. Generative adversarial network (GAN) is one possible approach for sampling molecular configurations in a learning-on-the-fly setting [99]. However, we remark that the real strength of MD simulation is its ability to compute dynamic correlations that can directly explain high-resolution spectroscopic experiments, which requires a long uninterrupted Hamiltonian trajectory, to which adversarial networks are generally not applicable. In this domain, Allegro-Legato thus provides a unique solution. 5.7 Conclusion We have introduced the proposed SAM-based solution to the fidelity-scaling problem into the Allegro NNQMD model [142] which represents the state-of-the-art accuracy and speed. The resulting Allegro- Legato model has drastically improved fidelity scaling by exhibiting a significantly lower exponent, β Allegro-Legato = 0.14<β Allegro = 0.29, thus systematically delaying time-to-failure. Such improved fidelity scaling is central 61 to ensure that meaningful scientific knowledge is extracted from large-scale simulations on leadership-scale parallel computers. Our scalable parallel implementation of Allegro-Legato with excellent computational scaling and GPU acceleration combines accuracy, speed, robustness and scalability, thus allowing prac- tical large spatiotemporal-scale NNQMD simulations for challenging applications on exascale computing platforms. 62 Chapter 6 Conclusions This thesis has made several significant contributions to the understanding and application of sharpness in deep learning. By proposing a novel, scale-invariant definition of sharpness, we have addressed a critical limitation in the current state of sharpness analysis, providing a more objective and accurate characterization of this important property. Our extensive empirical validation further strengthens the reliability of the proposed definition. Our investigation into the relationship between training algorithms and sharpness regularization has shed light on the "escaping" behavior, which demonstrates how existing training methods implicitly regularize sharpness. This discovery demystifies the connection between sharpness and training algorithms, offering valuable insights for the development of more effective and principled approaches to machine learning. Moreover, we have shown the practical utility of sharpness regularization in the context of physics simulations, specifically in high-fidelity fluid simulation and molecular dynamics. These results underscore the fact that sharpness is not only a theoretical concept, but also a useful tool for designing robust and generalizable neural networks. This dissertation has deepened our understanding of sharpness in neural networks, bridging the gap between theoretical insights and practical applications. Our findings have the potential to inform future research in deep learning and inspire novel techniques for the development of more accurate, efficient, and 63 robust models. As deep learning continues to play an increasingly important role in various fields, the practical and theoretical implications of our work on sharpness will undoubtedly contribute to the ongoing advancement of the field. 64 Bibliography [1] Mridul Aanjaneya, Ming Gao, Haixiang Liu, Christopher Batty, and Eftychios Sifakis. “Power Diagrams and Sparse Paged Grids for High Resolution Adaptive Liquids”. In: ACM Trans. Graph. 36.4 (July 2017). [2] P-A Absil and K Kurdyka. “On the stable equilibrium points of gradient systems”. In: Syst. Control Lett. 55.7 (July 2006), pp. 573–577. [3] Bart Adams, Mark Pauly, Richard Keiser, and Leonidas J. Guibas. “Adaptively Sampled Particle Fluids”. In: ACM Trans. Graph. 26.3 (July 2007). ISSN: 0730-0301. DOI: 10.1145/1276377.1276437. [4] Pierre Alliez, David Cohen-Steiner, Mariette Yvinec, and Mathieu Desbrun. “Variational Tetrahedral Meshing”. In: ACM Trans. Graph. 24.3 (July 2005), pp. 617–625. ISSN: 0730-0301. DOI: 10.1145/1073204.1073238. [5] Ryoichi Ando, Nils Thuerey, and Chris Wojtan. “Highly Adaptive Liquid Simulations on Tetrahedral Meshes”. In: ACM Trans. Graph. (Proc. SIGGRAPH 2013) (July 2013). [6] Ryoichi Ando, Nils Thürey, and Chris Wojtan. “A Dimension-reduced Pressure Solver for Liquid Simulations”. In: Computer Graphics Forum. V ol. 34. 2. Wiley Online Library. 2015, pp. 473–480. [7] Vinicius C. Azevedo and Manuel M. Oliveira. “Efficient Smoke Simulation on Curvilinear Grids”. In: Computer Graphics Forum 32.7 (2013), pp. 235–244. ISSN: 1467-8659. DOI: 10.1111/cgf.12231. [8] Christopher Batty. “A Cell-centred Finite V olume Method for the Poisson Problem on Non-graded Quadtrees with Second Order Accurate Gradients”. In: J. Comput. Phys. 331.C (Feb. 2017), pp. 49–72. ISSN: 0021-9991. DOI: 10.1016/j.jcp.2016.11.035. [9] Christopher Batty, Florence Bertails, and Robert Bridson. “A Fast Variational Framework for Accurate Solid-fluid Coupling”. In: ACM Trans. Graph. 26.3 (July 2007). ISSN: 0730-0301. DOI: 10.1145/1276377.1276502. [10] Christopher Batty, Stefan Xenos, and Ben Houston. “Tetrahedral Embedded Boundary Methods for Accurate and Flexible Adaptive Fluids”. In: Proceedings of Eurographics. 2010. 65 [11] S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky. “E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials”. In: Nature Communications 13 (2021), p. 2453. DOI: 10.1038/s41467-022-29939-5. [12] J. Behler. “Constructing high-dimensional neural network potentials: a tutorial review”. In: International Journal of Quantum Chemistry 115.16 (2015), pp. 1032–1050. ISSN: 0020-7608. DOI: 10.1002/qua.24890. [13] Avrim L Blum and Ronald L Rivest. “Training a 3-node neural network is NP-complete”. In: Neural Networks 5.1 (1992), pp. 117–127. [14] Aleksandar Botev, Hippolyt Ritter, and David Barber. “Practical Gauss-Newton Optimisation for Deep Learning”. In: Proceedings of the 34th International Conference on Machine Learning 2017. 2017, pp. 557–565. [15] Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186. [16] J.U. Brackbill and H.M. Ruppel. “FLIP: A method for adaptively zoned, particle-in-cell calculations of fluid flows in two dimensions”. In: Journal of Computational Physics 65.2 (1986), pp. 314–343. ISSN: 0021-9991. DOI: http://dx.doi.org/10.1016/0021-9991(86)90211-1. [17] Robert Bridson. Fluid simulation for computer graphics. Boca Raton: CRC Press, Taylor & Francis Group, CRC Press is an imprint of the Taylor & Francis Group, an informa Business, 2016. ISBN: 1482232839. [18] Tyson Brochu, Christopher Batty, and Robert Bridson. “Matching Fluid Simulation Elements to Surface Geometry and Topology”. In: ACM Trans. Graph. 29.4 (July 2010), 47:1–47:9. ISSN: 0730-0301. DOI: 10.1145/1778765.1778784. [19] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. “SGD learns over-parameterized networks that provably generalize on linearly separable data”. In: arXiv preprint arXiv:1710.10174 (2017). [20] Jeff Budsberg, Michael Losure, Ken Museth, and Matt Baer. “Liquids in The Croods”. In: ACM DigiPro. (2013). [21] Oleksiy Busaryev, Tamal K. Dey, Huamin Wang, and Zhong Ren. “Animating Bubble Interactions in a Liquid Foam”. In: ACM Trans. Graph. 31.4 (July 2012), 63:1–63:8. ISSN: 0730-0301. DOI: 10.1145/2185520.2185559. [22] José A. Canabal, David Miraut, Nils Thuerey, Theodore Kim, Javier Portilla, and Miguel A. Otaduy. “Dispersion Kernels for Water Wave Simulation”. In: ACM Trans. Graph. 35.6 (Nov. 2016), 202:1–202:10. ISSN: 0730-0301. DOI: 10.1145/2980179.2982415. 66 [23] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. “Entropy-sgd: Biasing gradient descent into wide valleys”. In: Journal of Statistical Mechanics: Theory and Experiment 2019.12 (2019), p. 124018. [24] G. Chehade and I. Dincer. “Progress in green ammonia production as potential carbon-free fuel”. In: Fuel 299 (2021), p. 120845. ISSN: 0016-2361. DOI: 10.1016/j.fuel.2021.120845. [25] Chi Chen, Weike Ye, Yunxing Zuo, Chen Zheng, and Shyue Ping Ong. “Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals”. In: Chem. Mater. 31.9 (May 2019), pp. 3564–3572. [26] Zaiwei Chen, Shancong Mou, and Siva Theja Maguluri. “Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization”. In: arXiv preprint arXiv:2111.06328 (2021). [27] Xiang Cheng, Dong Yin, Peter Bartlett, and Michael Jordan. “Stochastic gradient and langevin processes”. In: International Conference on Machine Learning. proceedings.mlr.press, 2020, pp. 1810–1819. [28] Nuttapong Chentanez, Bryan E. Feldman, François Labelle, James F. O’Brien, and Jonathan R. Shewchuk. “Liquid Simulation on Lattice-based Tetrahedral Meshes”. In: Proceedings of the 2007 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. SCA ’07. San Diego, California: Eurographics Association, 2007, pp. 219–228. ISBN: 978-1-59593-624-0. [29] Nuttapong Chentanez and Matthias Müller. “Real-time Eulerian Water Simulation Using a Restricted Tall Cell Grid”. In: ACM Trans. Graph. 30.4 (July 2011), 82:1–82:10. ISSN: 0730-0301. DOI: 10.1145/2010324.1964977. [30] S. Chmiela, H. E. Sauceda, K.-R. Müller, and A. Tkatchenko. “Towards exact molecular dynamics simulations with machine-learned force fields”. In: Nature Communications 9.1 (2018), p. 3887. ISSN: 2041-1723. DOI: 10.1038/s41467-018-06169-2. [31] Pascal Clausen, Martin Wicke, Jonathan R. Shewchuk, and James F. O’Brien. “Simulating Liquids and Solid-liquid Interactions with Lagrangian Meshes”. In: ACM Trans. Graph. 32.2 (Apr. 2013), 17:1–17:15. ISSN: 0730-0301. DOI: 10.1145/2451236.2451243. [32] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. “Natural language processing (almost) from scratch”. In: J. Mach. Learn. Res. 12.ARTICLE (2011), pp. 2493–2537. [33] E. D. Cubuk and S. S. Schoenholz. “Adversarial forces of physical models”. In: Proceedings of NeurIPS workshop on Machine Learning and the Physical Sciences (2020). [34] Fang Da, Christopher Batty, Chris Wojtan, and Eitan Grinspun. “Double Bubbles Sans Toil and Trouble: Discrete Circulation-preserving V ortex Sheets for Soap Films and Foams”. In: ACM Trans. Graph. 34.4 (July 2015), 149:1–149:9. ISSN: 0730-0301. DOI: 10.1145/2767003. 67 [35] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. “Escaping Saddles with Stochastic Gradients”. In: Proceedings of the 35th International Conference on Machine Learning. V ol. 80. PMLR, 2018, pp. 1155–1164. [36] Claudio De Stefano, Francesco Fontanella, Marilena Maniaci, and Alessandra Scotto di Freca. “A method for scribe distinction in medieval manuscripts using page layout features”. In: International Conference on Image Analysis and Processing. Springer. 2011, pp. 393–402. [37] Tyler De Witt, Christian Lessig, and Eugene Fiume. “Fluid Simulation Using Laplacian Eigenfunctions”. In: ACM Trans. Graph. 31.1 (Feb. 2012), 10:1–10:11. ISSN: 0730-0301. DOI: 10.1145/2077341.2077351. [38] Amir Dembo and Ofer Zeitouni. Large Deviations Techniques and Applications. 2nd. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. [39] Denis Demidov. AMGCL: C++ library for solving large sparse linear systems with algebraic multigrid method (https://github.com/ddemidov/amgcl). 2009. [40] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. “Bridging the gap between constant step size stochastic gradient descent and markov chains”. In: arXiv preprint arXiv:1707.06386 (2017). [41] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. “Sharp minima can generalize for deep nets”. In: International Conference on Machine Learning. 2017, pp. 1019–1028. [42] Yoshinori Dobashi, Yasuhiro Matsuda, Tsuyoshi Yamamoto, and Tomoyuki Nishita. “A Fast Simulation Method Using Overlapping Grids for Interactions between Smoke and Rigid Objects”. In: Computer Graphics Forum 27.2 (2008), pp. 477–486. ISSN: 1467-8659. DOI: 10.1111/j.1467-8659.2008.01145.x. [43] Jean Donea, Antonio Huerta, Jean-Philippe Ponthot, et al. “Arbitrary Lagrangian Eulerian methods”. In: Encyclopedia of Computational Mechanics (2004), Chapter–14. [44] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. “Essentially no barriers in neural network energy landscape”. In: International conference on machine learning. PMLR. 2018, pp. 1309–1318. [45] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradient methods for online learning and stochastic optimization.” In: Journal of machine learning research 12.7 (2011). [46] Adri C T van Duin, Siddharth Dasgupta, Francois Lorant, and William A Goddard. “ReaxFF: A reactive force field for hydrocarbons”. en. In: J. Phys. Chem. A 105.41 (Oct. 2001), pp. 9396–9409. [47] Alexander Dunn, Qi Wang, Alex Ganose, Daniel Dopp, and Anubhav Jain. “Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm”. en. In: npj Comput. Mater. 6.1 (Sept. 2020), pp. 1–10. [48] Gintare Karolina Dziugaite and Daniel M Roy. “Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data”. In: (Mar. 2017). arXiv: 1703.11008[cs.LG]. 68 [49] Essex Edwards and Robert Bridson. “Detailed Water with Coarse Grids: Combining Surface Meshes and Adaptive Discontinuous Galerkin”. In: ACM Trans. Graph. 33.4 (July 2014), 136:1–136:9. ISSN: 0730-0301. DOI: 10.1145/2601097.2601167. [50] R. Elliot English, Linhai Qiu, Yue Yu, and Ronald Fedkiw. “Chimera Grids for Water Simulation”. In: Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. SCA ’13. Anaheim, California: ACM, 2013, pp. 85–94. ISBN: 978-1-4503-2132-7. DOI: 10.1145/2485895.2485897. [51] Doug Enright, Duc Nguyen, Frederic Gibou, and Ron Fedkiw. “Using The Particle Level Set Method And A Second Order Accurate Pressure Boundary Condition For Free Surface Flows”. In: In Proc. 4th ASME-JSME Joint Fluids Eng. Conf., number FEDSM2003–45144. ASME. 2003, pp. 2003–45144. [52] Douglas Enright, Frank Losasso, and Ronald Fedkiw. “A Fast and Accurate semi-Lagrangian Particle Level Set Method”. In: Comput. Struct. 83.6-7 (Feb. 2005), pp. 479–490. ISSN: 0045-7949. DOI: 10.1016/j.compstruc.2004.04.024. [53] Kenny Erleben, Marek Krzysztof Misztal, and J. Andreas Bærentzen. “Mathematical Foundation of the Optimization-based Fluid Animation Method”. In: Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. SCA ’11. Vancouver, British Columbia, Canada: ACM, 2011, pp. 101–110. ISBN: 978-1-4503-0923-3. DOI: 10.1145/2019406.2019420. [54] Henry Eyring. “The activated complex in chemical reactions”. In: The Journal of Chemical Physics 3.2 (1935), pp. 107–115. [55] Ye Fan, Joshua Litven, David I. W. Levin, and Dinesh K. Pai. “Eulerian-on-lagrangian Simulation”. In: ACM Trans. Graph. 32.3 (July 2013), 22:1–22:9. ISSN: 0730-0301. DOI: 10.1145/2487228.2487230. [56] Bryan E. Feldman, James F. O’Brien, and Bryan M. Klingner. “Animating Gases with Hybrid Meshes”. In: ACM Trans. Graph. 24.3 (July 2005), pp. 904–909. ISSN: 0730-0301. DOI: 10.1145/1073204.1073281. [57] F. Ferstl, R. Westermann, and C. Dick. “Large-Scale Liquid Simulation on Adaptive Hexahedral Grids”. In: Visualization and Computer Graphics, IEEE Transactions on 20.10 (Oct. 2014), pp. 1405–1417. ISSN: 1077-2626. DOI: 10.1109/TVCG.2014.2307873. [58] Florian Ferstl, Ryoichi Ando, Chris Wojtan, Ruediger Westermann, and Nils Thuerey. “Narrow Band FLIP for Liquid Simulations”. In: Eurographics’16 to appear (May 2016), p. 8. [59] R. P. Feynman and A. R. Hibbs. Quantum Mechanics and Path Integrals. New York, NY: McGraw-Hill, 1965. ISBN: 978-0070206502. [60] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. “Sharpness-aware Minimization for Efficiently Improving Generalization”. In: International Conference on Learning Representations. 2020. 69 [61] Nick Foster and Ronald Fedkiw. “Practical Animation of Liquids”. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’01. New York, NY , USA: ACM, 2001, pp. 23–30. ISBN: 1-58113-374-X. DOI: 10.1145/383259.383261. [62] Nick Foster and Dimitris Metaxas. “Controlling Fluid Animation”. In: Proceedings of the 1997 Conference on Computer Graphics International. CGI ’97. Washington, DC, USA: IEEE Computer Society, 1997, pp. 178–. ISBN: 0-8186-7825-9. URL: http://dl.acm.org/citation.cfm?id=792756.792862. [63] Mark I Freidlin and Alexander D Wentzell. Random Perturbations of Dynamical Systems 3rd Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. [64] X. Fu, Z. Wu, W. Wang, T. Xie, S. Keten, R. Gomez-Bombarelli, and T. Jaakkola. “Forces are not enough: benchmark and critical evaluation for machine learning force fields with molecular simulations”. In: arXiv (2022), p. 2210.07237. DOI: 10.48550/arXiv.2210.07237. [65] Victor Fung, Jiaxin Zhang, Eric Juarez, and Bobby G Sumpter. “Benchmarking graph neural networks for materials chemistry”. en. In: npj Comput. Mater. 7.1 (June 2021), pp. 1–8. [66] Ran Gal, Olga Sorkine, and Daniel Cohen-Or. “Feature-Aware Texturing”. In: Rendering Techniques 2006 (2006), 17th. [67] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. “Loss surfaces, mode connectivity, and fast ensembling of dnns”. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, pp. 8803–8812. [68] Jonas Geiping, Micah Goldblum, Phil Pope, Michael Moeller, and Tom Goldstein. “Stochastic Training is Not Necessary for Generalization”. In: International Conference on Learning Representations. 2022. URL: https://openreview.net/forum?id=ZBESeIUB5k. [69] Emmanuel Gobet. Monte-Carlo Methods and Stochastic Processes: From Linear to Non-Linear. en. CRC Press, Sept. 2016. [70] Emmanuel Gobet and Stéphane Menozzi. “Stopped diffusion processes: Boundary corrections and overshoot”. In: Stochastic Process. Appl. 120.2 (Feb. 2010), pp. 130–162. [71] Fernando de Goes, Corentin Wallez, Jin Huang, Dmitry Pavlov, and Mathieu Desbrun. “Power Particles: An Incompressible Fluid Solver Based on Power Diagrams”. In: ACM Trans. Graph. 34.4 (July 2015), 50:1–50:11. ISSN: 0730-0301. DOI: 10.1145/2766901. [72] Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein. “Truth or backpropaganda? An empirical investigation of deep learning theory”. In: International Conference on Learning Representations. 2020. URL: https://openreview.net/forum?id=HyxyIgHFvr. [73] I. J. Goodfellow, J. Shlens, and C. Szegedy. “Explaining and harnessing adversarial examples”. In: Proceedings of International Conference on Learning Representations, ICLR (2015). DOI: 10.48550/arXiv.1412.6572. 70 [74] Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. “Maxout Networks”. In: Proceedings of the 30th International Conference on Machine Learning. 2013, pp. 1319–1327. [75] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and Harnessing Adversarial Examples”. In: (Dec. 2014). arXiv: 1412.6572[stat.ML]. [76] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. “Accurate, large minibatch sgd: Training imagenet in 1 hour”. In: arXiv preprint arXiv:1706.02677 (2017). [77] Peter Hanggi. “Escape from a metastable state”. In: Journal of Statistical Physics 42.1 (1986), pp. 105–148. [78] Francis H. Harlow and J. Eddie Welch. “Numerical Calculation of Time-Dependent Viscous Incompressible Flow of Fluid with Free Surface”. In: Physics of Fluids 8.12 (1965), pp. 2182–2189. DOI: http://dx.doi.org/10.1063/1.1761178. [79] Fengxiang He, Tongliang Liu, and Dacheng Tao. “Control batch size and learning rate to generalize well: Theoretical and empirical evidence”. In: Advances in Neural Information Processing Systems 32 (2019), pp. 1143–1152. [80] Haowei He, Gao Huang, and Yang Yuan. “Asymmetric Valleys: Beyond Sharp and Flat Local Minima”. In: arXiv preprint arXiv:1902. 00744 (Feb. 2019). arXiv: 1902.00744[cs.LG]. [81] CW Hirt, Anthony A Amsden, and JL Cook. “An arbitrary Lagrangian-Eulerian computing method for all flow speeds”. In: Journal of computational physics 14.3 (1974), pp. 227–253. [82] Sepp Hochreiter and Jürgen Schmidhuber. “Flat minima”. In: Neural computation 9.1 (1997), pp. 1–42. [83] Sepp Hochreiter and Jürgen Schmidhuber. “Simplifying neural nets by discovering flat minima”. In: Advances in neural information processing systems. 1995, pp. 529–536. [84] Sepp Hochreiter and Jurgen Schmidhuber. “Flat Minima”. In: Neural computation 9 (Feb. 1997), pp. 1–42. [85] Elad Hoffer, Itay Hubara, and Daniel Soudry. “Train longer, generalize better: closing the generalization gap in large batch training of neural networks”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, pp. 1729–1739. [86] J. Hoja, A. M. Reilly, and A. Tkatchenko. “First-principles modeling of molecular crystals: structures and stabilities, temperature and pressure”. In: WIREs Computational Molecular Science 7.1 (2017), e1294. ISSN: 1759-0876. DOI: 10.1002/wcms.1294. [87] Jeong-Mo Hong, Ho-Young Lee, Jong-Chul Yoon, and Chang-Hun Kim. “Bubbles Alive”. In: ACM Trans. Graph. 27.3 (Aug. 2008), 48:1–48:4. ISSN: 0730-0301. DOI: 10.1145/1360612.1360647. 71 [88] Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. “On the diffusion approximation of nonconvex stochastic gradient descent”. In: arXiv preprint arXiv:1705. 07562 (2017). [89] Wenqing Hu, Zhanxing Zhu, Haoyi Xiong, and Jun Huan. “Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent”. In: arXiv preprint arXiv:1901.06054 (2019). [90] W. Ronny Huang, Zeyad Emam, Micah Goldblum, Liam Fowl, Justin K. Terry, Furong Huang, and Tom Goldstein. “Understanding Generalization Through Visualizations”. In: Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops. Ed. by Jessica Zosa Forde, Francisco Ruiz, Melanie F. Pradier, and Aaron Schein. V ol. 137. Proceedings of Machine Learning Research. PMLR, Dec. 2020, pp. 87–97. URL: https://proceedings.mlr.press/v137/huang20a.html. [91] Geoffrey Irving, Eran Guendelman, Frank Losasso, and Ronald Fedkiw. “Efficient Simulation of Large Bodies of Water by Coupling Two and Three Dimensional Techniques”. In: ACM Trans. Graph. 25.3 (July 2006), pp. 805–811. ISSN: 0730-0301. DOI: 10.1145/1141911.1141959. [92] Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo B Kerg, Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, and Krzysztof J Geras. “Catastrophic fisher explosion: Early phase fisher matrix impacts generalization”. In: International Conference on Machine Learning. PMLR. 2021, pp. 4772–4784. [93] Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras. “The break-even point on optimization trajectories of deep neural networks”. In: arXiv preprint arXiv:2002.09572 (2020). [94] Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. “Three factors influencing minima in sgd”. In: arXiv preprint arXiv:1711.04623 (2017). [95] W. Jia, H. Wang, M. Chen, D. Lu, L. Lin, R. Car, W. E, and L. Zhang. “Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning”. In: Proceedings of Supercomputing (ACM/IEEE, 2020), p. 5. DOI: 10.5555/3433701.3433707. [96] Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Lin Lin, Roberto Car, E Weinan, and Linfeng Zhang. “Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. ieeexplore.ieee.org, Nov. 2020, pp. 1–14. [97] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. “Fantastic Generalization Measures and Where to Find Them”. In: International Conference on Learning Representations. 2019. [98] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. “Fantastic generalization measures and where to find them”. In: arXiv preprint arXiv:1912.02178 (2019). [99] R. Jinnouchi, K. Miwa, F. Karsai, G. Kresse, and R. Asahi. “On-the-Fly Active Learning of Interatomic Potentials for Large-Scale Atomistic Simulations”. In: Journal of Physical Chemistry Letters 11.17 (2020), pp. 6946–6955. DOI: 10.1021/acs.jpclett.0c01061. 72 [100] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. “On large-batch training for deep learning: Generalization gap and sharp minima”. In: arXiv preprint arXiv:1609.04836 (2016). [101] Byungmoon Kim, Yingjie Liu, Ignacio Llamas, Xiangmin Jiao, and Jarek Rossignac. “Simulation of Bubbles in Foam with the V olume Control Method”. In: ACM Trans. Graph. 26.3 (July 2007). ISSN: 0730-0301. DOI: 10.1145/1276377.1276500. [102] Byungsoo Kim, Vinicius C Azevedo, Nils Thuerey, Theodore Kim, Markus Gross, and Barbara Solenthaler. “Deep fluids: A generative network for parameterized fluid simulations”. In: Computer Graphics Forum. V ol. 38. 2. Wiley Online Library. 2019, pp. 59–70. [103] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [104] Bobby Kleinberg, Yuanzhi Li, and Yang Yuan. “An alternative view: When does SGD escape local minima?” In: International Conference on Machine Learning. PMLR. 2018, pp. 2698–2707. [105] Bryan M Klingner, Bryan E Feldman, Nuttapong Chentanez, and James F O’brien. “Fluid animation with dynamic meshes”. In: ACM Transactions on Graphics (TOG). V ol. 25. 3. ACM. 2006, pp. 820–825. [106] Bryan M. Klingner, Bryan E. Feldman, Nuttapong Chentanez, and James F. O’Brien. “Fluid Animation with Dynamic Meshes”. In: ACM Trans. Graph. 25.3 (July 2006), pp. 820–825. ISSN: 0730-0301. DOI: 10.1145/1141911.1141961. [107] Philipp Krähenbühl, Manuel Lang, Alexander Hornung, and Markus Gross. “A system for retargeting of streaming video”. In: ACM Transactions on Graphics (TOG). V ol. 28. 5. ACM. 2009, p. 126. [108] H A Kramers. “Brownian motion in a field of force and the diffusion model of chemical reactions”. In: Physica 7.4 (Apr. 1940), pp. 284–304. [109] A. Krishnamoorthy, K. Nomura, N. Baradwaj, K. Shimamura, P. Rajak, A. Mishra, S. Fukushima, F. Shimojo, R. Kalia, A. Nakano, and P. Vashishta. “Dielectric constant of liquid water determined with neural network quantum molecular dynamics”. In: Physical Review Letters 126.21 (2021), p. 216403. ISSN: 0031-9007. DOI: 10.1103/PhysRevLett.126.216403. [110] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”. In: (2009). [111] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Adv. Neural Inf. Process. Syst. 25 (2012), pp. 1097–1105. [112] Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adversarial examples in the physical world. 2016. 73 [113] Akihiro Kushima, Xi Lin, Ju Li, Xiaofeng Qian, Jacob Eapen, John C Mauro, Phong Diep, and Sidney Yip. “Computing the viscosity of supercooled liquids. II. Silica and strong-fragile crossover behavior”. en. In: J. Chem. Phys. 131.16 (Oct. 2009), p. 164505. [114] François Labelle and Jonathan Richard Shewchuk. “Isosurface Stuffing: Fast Tetrahedral Meshes with Good Dihedral Angles”. In: ACM Trans. Graph. 26.3 (July 2007). ISSN: 0730-0301. DOI: 10.1145/1276377.1276448. [115] L’ubor Ladick` y, SoHyeon Jeong, Barbara Solenthaler, Marc Pollefeys, and Markus Gross. “Data-driven fluid simulations using regression forests”. In: ACM Transactions on Graphics (TOG) 34.6 (2015), pp. 1–9. [116] Manuel Lang, Alexander Hornung, Oliver Wang, Steven Poulakos, Aljoscha Smolic, and Markus Gross. “Nonlinear disparity mapping for stereoscopic 3D”. In: ACM Transactions on Graphics (TOG) 29.4 (2010), p. 75. [117] Martin Lastiwka, Nathan Quinlan, and Mihai Basa. “Adaptive particle distribution for smoothed particle hydrodynamics”. In: International Journal for Numerical Methods in Fluids 47.10-11 (2005), pp. 1403–1409. ISSN: 1097-0363. DOI: 10.1002/fld.891. [118] Yann LeCun. “1.1 Deep Learning Hardware: Past, Present, and Future”. In: 2019 IEEE International Solid- State Circuits Conference - (ISSCC). ieeexplore.ieee.org, Feb. 2019, pp. 12–19. [119] Yann LeCun. “1.1 Deep learning hardware: Past, present, and future”. In: 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE. 2019, pp. 12–19. [120] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. “Visualizing the Loss Landscape of Neural Nets”. In: Advances in Neural Information Processing Systems 31 (2018). [121] Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. “Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate”. In: NeurIPS. 2020. [122] Zhiyuan Li, Sadhika Malladi, and Sanjeev Arora. “On the validity of modeling sgd with stochastic differential equations (sdes)”. In: Advances in Neural Information Processing Systems 34 (2021). [123] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. “Fisher-Rao metric, geometry, and complexity of neural networks”. In: The 22nd International Conference on Artificial Intelligence and Statistics. 2019, pp. 888–896. [124] T. Linker, K. Nomura, A. Aditya, S. Fukshima, R. K. Kalia, A. Krishnamoorthy, A. Nakano, P. Rajak, K. Shimmura, F. Shimojo, and P. Vashishta. “Exploring far-from-equilibrium ultrafast polarization control in ferroelectric oxides with excited-state neural network quantum molecular dynamics”. In: Science Advances 8.12 (2022), eabk2625. DOI: 10.1126/sciadv.abk2625. [125] Haixiang Liu, Nathan Mitchell, Mridul Aanjaneya, and Eftychios Sifakis. “A Scalable Schur-complement Fluids Solver for Heterogeneous Compute Platforms”. In: ACM Trans. Graph. 35.6 (Nov. 2016), 201:1–201:12. ISSN: 0730-0301. DOI: 10.1145/2980179.2982430. 74 [126] Frank Losasso, Frédéric Gibou, and Ron Fedkiw. “Simulating Water and Smoke with an Octree Data Structure”. In: ACM Trans. Graph. 23.3 (Aug. 2004), pp. 457–462. ISSN: 0730-0301. DOI: 10.1145/1015706.1015745. [127] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In: arXiv preprint arXiv:1711. 05101 (Nov. 2017). arXiv: 1711.05101[cs.LG]. [128] Steph-Yves Louis, Yong Zhao, Alireza Nasiri, Xiran Wang, Yuqi Song, Fei Liu, and Jianjun Hu. “Graph convolutional neural networks with global attention for improved materials property prediction”. en. In: Phys. Chem. Chem. Phys. 22.32 (Aug. 2020), pp. 18141–18148. [129] Colin B. Macdonald and Steven J. Ruuth. “Level Set Equations on Surfaces via the Closest Point Method”. In: Journal of Scientific Computing 35.2 (2008), pp. 219–240. ISSN: 1573-7691. DOI: 10.1007/s10915-008-9196-6. [130] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. “Towards Deep Learning Models Resistant to Adversarial Attacks”. In: International Conference on Learning Representations (ICLR). 2018. [131] Jonathan P Mailoa, Mordechai Kornbluth, Simon Batzner, Georgy Samsonidze, Stephen T Lam, Jonathan Vandermause, Chris Ablitt, Nicola Molinari, and Boris Kozinsky. “A fast neural network approach for direct covariant forces prediction in complex multi-element extended systems”. en. In: Nature Machine Intelligence 1.10 (Sept. 2019), pp. 471–479. [132] Stephan Mandt, Matthew Hoffman, and David Blei. “A Variational Analysis of Stochastic Gradient Algorithms”. In: Proceedings of The 33rd International Conference on Machine Learning. Ed. by Maria Florina Balcan and Kilian Q Weinberger. V ol. 48. Proceedings of Machine Learning Research. New York, New York, USA: PMLR, 2016, pp. 354–363. [133] Albert W Marshall and Ingram Olkin. “Norms and inequalities for condition numbers, III”. In: Linear Algebra and its Applications 7.4 (1973), pp. 291–300. [134] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/. Software available from tensorflow.org. 2015. [135] Dominic Masters and Carlo Luschi. “Revisiting small batch training for deep neural networks”. In: arXiv preprint arXiv:1804.07612 (2018). [136] A. McAdams, E. Sifakis, and J. Teran. “A Parallel Multigrid Poisson Solver for Fluids Simulation on Large Grids”. In: Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. SCA ’10. Madrid, Spain: Eurographics Association, 2010, pp. 65–74. URL: http://dl.acm.org/citation.cfm?id=1921427.1921438. 75 [137] Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jan ˇ Cernocký. “Strategies for training large scale neural network language models”. In: 2011 IEEE Workshop on Automatic Speech Recognition Understanding. ieeexplore.ieee.org, Dec. 2011, pp. 196–201. [138] M. Misawa, S. Fukushima, A. Koura, K. Shimamura, F. Shimojo, S. C. Tiwari, K. Nomura, R. K. Kalia, A. Nakano, and P. Vashishta. “Application of first-principles-based artificial neural network potentials to multiscale-shock dynamics simulations on solid materials”. In: Journal of Physical Chemistry Letters 11 (2020), pp. 4536–4541. DOI: 10.1021/acs.jpclett.0c00637. [139] M. K. Misztal, K. Erleben, A. Bargteil, J. Fursund, B. Bunch Christensen, J. A. Bærentzen, and R. Bridson. “Multiphase Flow of Immiscible Fluids on Unstructured Moving Meshes”. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation. SCA ’12. Lausanne, Switzerland: Eurographics Association, 2012, pp. 97–106. ISBN: 978-3-905674-37-8. [140] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. “Deepfool: a simple and accurate method to fool deep neural networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. cv-foundation.org, 2016, pp. 2574–2582. [141] Matthias Müller. “Fast and Robust Tracking of Fluid Surfaces”. In: Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. SCA ’09. New Orleans, Louisiana: ACM, 2009, pp. 237–245. ISBN: 978-1-60558-610-6. DOI: 10.1145/1599470.1599501. [142] A. Musaelian, S. Batzner, A. ohansson, L. Sun, C. J. Owen, M. Kornbluth, and B. Kozinsky. “Learning local equivariant representations for large-scale atomistic dynamics”. In: arXiv (2022), p. 2204.05249. DOI: 10.48550/arXiv.2204.05249. [143] Ken Museth, Jeff Lait, John Johanson, Jeff Budsberg, Ron Henderson, Mihai Alden, Peter Cucka, David Hill, and Andrew Pearce. “OpenVDB: An Open-source Data Structure and Toolkit for High-resolution V olumes”. In: ACM SIGGRAPH 2013 Courses. SIGGRAPH ’13. Anaheim, California: ACM, 2013, 19:1–19:1. ISBN: 978-1-4503-2339-0. DOI: 10.1145/2504435.2504454. [144] Vinod Nair and Geoffrey E. Hinton. “Rectified Linear Units Improve Restricted Boltzmann Machines”. In: Proceedings of the 27th International Conference on Machine Learning. 2010, pp. 807–814. [145] Thanh Huy Nguyen, Umut ¸ Sim¸ sekli, Mert Gürbüzbalaban, and Gaël Richard. “First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise”. In: arXiv preprint arXiv:1906.09069 (2019). [146] K. Nguyen-Cong, J. T. Willman, S. G. Moore, A. B. Belonoshko, R. Gayatri, E. Weinberg, M. A. Wood, A. P. Thompson, and I. I. Oleynik. “Billion atom molecular dynamics simulations of carbon at extreme conditions and experimental time and length scales”. In: Proceedings of Supercomputing (IEEE/ACM, 2021), p. 4. DOI: 10.1145/3458817.3487400. [147] Michael B. Nielsen and Robert Bridson. “Spatially Adaptive FLIP Fluid Simulations in Bifrost”. In: ACM SIGGRAPH 2016 Talks. SIGGRAPH ’16. Anaheim, California: ACM, 2016, 41:1–41:2. ISBN: 978-1-4503-4282-7. DOI: 10.1145/2897839.2927399. 76 [148] K. Nomura, R. K. Kalia, A. Nakano, P. Rajak, and P. Vashishta. “RXMD: a scalable reactive molecular dynamics simulator for optimized time-to-solution”. In: SoftwareX 11 (2020), p. 100389. DOI: 10.1016/j.softx.2019.100389. [149] K. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, K. Shimamura, F. Shimojo, M. Kunaseth, P. C. Messina, and N. A. Romero. “Metascalable quantum molecular dynamics simulations of hydrogen-on-demand”. In: Proceedings of Supercomputing, SC14 (IEEE/ACM, 2014), pp. 661–673. DOI: 10.1109/SC.2014.59. [150] Ken-Ichi Nomura, Rajiv K Kalia, Aiichiro Nakano, Pankaj Rajak, and Priya Vashishta. “RXMD: A scalable reactive molecular dynamics simulator for optimized time-to-solution”. In: SoftwareX 11 (Jan. 2020), p. 100389. [151] J Michael Owen, Jens V Villumsen, Paul R Shapiro, and Hugo Martel. “Adaptive Smoothed Particle Hydrodynamics: Methodology. II.” In: The Astrophysical Journal Supplement Series 116.2 (June 1998), pp. 155–209. [152] Abhishek Panigrahi, Raghav Somani, Navin Goyal, and Praneeth Netrapalli. “Non-Gaussianity of stochastic gradient noise”. In: arXiv preprint arXiv:1910.09626 (2019). [153] Cheol Woo Park, Mordechai Kornbluth, Jonathan Vandermause, Chris Wolverton, Boris Kozinsky, and Jonathan P Mailoa. “Accurate and scalable graph neural network force field and molecular dynamics with direct force architecture”. en. In: npj Comput. Mater. 7.1 (May 2021), pp. 1–9. [154] Cheol Woo Park and Chris Wolverton. “Developing an improved crystal graph convolutional neural network framework for accelerated materials discovery”. In: Phys. Rev. Materials 4.6 (June 2020), p. 063801. [155] Saket Patkar, Mridul Aanjaneya, Dmitriy Karpman, and Ronald Fedkiw. “A Hybrid Lagrangian-Eulerian Formulation for Bubble Generation and Dynamics”. In: Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. SCA ’13. Anaheim, California: ACM, 2013, pp. 105–114. ISBN: 978-1-4503-2132-7. DOI: 10.1145/2485895.2485912. [156] GP Purja Pun, R Batra, R Ramprasad, and Y Mishin. “Physically informed artificial neural networks for atomistic modeling of materials”. In: Nat. Commun. 10.1 (2019), pp. 1–10. [157] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. “Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis”. In: Proceedings of the 2017 Conference on Learning Theory. Ed. by Satyen Kale and Ohad Shamir. V ol. 65. Proceedings of Machine Learning Research. PMLR, 2017, pp. 1674–1703. [158] P. Rajak, A. Aditya, S. Fukushima, R. K. Kalia, T. T. Linker, K. Liu, Y . Luo, A. Nakano, K. Nomura, K. Shimamura, F. Shimojo, and P. Vashishta. “Ex-NNQMD: extreme-scale neural network quantum molecular dynamics”. In: Proceedings of International Parallel and Distributed Processing Symposium Workshops, IPDPSW (IEEE, 2021), pp. 943–946. DOI: 10.1109/IPDPSW52791.2021.00145. 77 [159] Pankaj Rajak, Anikeya Aditya, Shogo Fukushima, Rajiv K Kalia, Thomas Linker, Kuang Liu, Ye Luo, Aiichiro Nakano, Ken-Ichi Nomura, Kohei Shimamura, Fuyuki Shimojo, and Priya Vashishta. “Ex-NNQMD: Extreme-Scale Neural Network Quantum Molecular Dynamics”. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). ieeexplore.ieee.org, June 2021, pp. 943–946. [160] Pankaj Rajak, Kuang Liu, Aravind Krishnamoorthy, Rajiv K Kalia, Aiichiro Nakano, Ken-Ichi Nomura, Subodh C Tiwari, and Priya Vashishta. “Neural Network Molecular Dynamics at Scale”. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). ieeexplore.ieee.org, May 2020, pp. 991–994. [161] Akshay Rangamani, Nam H Nguyen, Abhishek Kumar, Dzung Phan, Sang H Chin, and Trac D Tran. “A scale invariant flatness measure for deep network minima”. In: arXiv preprint arXiv:1902.02434 (2019). [162] Nick Rasmussen, Douglas Enright, Duc Nguyen, Sebastian Marino, Nigel Sumner, Willi Geiger, Samir Hoon, and Ronald Fedkiw. “Directable photorealistic liquids”. In: Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation. Eurographics Association. 2004, pp. 193–202. [163] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of mathematical statistics (1951), pp. 400–407. [164] M. Rossi, M. Ceriotti, and D. E. Manolopoulos. “How to remove the spurious resonances from ring polymer molecular dynamics”. In: Journal of Chemical Physics 140.23 (2014), p. 234116. ISSN: 0021-9606. DOI: 10.1063/1.4883861. [165] R. M. Schmidt, F. Schneider, and P. Hennig. “Descending through a crowded valley - benchmarking deep learning optimizers”. In: Proceedings of International Conference on Machine Learning, ICML 139 (2021), pp. 9367–9376. DOI: 10.48550/arXiv.2007.01547. [166] Robin M Schmidt, Frank Schneider, and Philipp Hennig. “Descending through a crowded valley-benchmarking deep learning optimizers”. In: International Conference on Machine Learning. PMLR. 2021, pp. 9367–9376. [167] Kristof T Schütt, Pieter-Jan Kindermans, Huziel E Sauceda, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. “SchNet: A continuous-filter convolutional neural network for modeling quantum interactions”. In: arXiv preprint arXiv:1706. 08566 (June 2017). arXiv: 1706.08566[stat.ML]. [168] D. Schwalbe-Koda, A. R. Tan, and R. Gómez-Bombarelli. “Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks”. In: Nature Communications 12.1 (2021), p. 5104. ISSN: 2041-1723. DOI: 10.1038/s41467-021-25342-8. [169] Andrew Selle, Ronald Fedkiw, Byungmoon Kim, Yingjie Liu, and Jarek Rossignac. “An Unconditionally Stable MacCormack Method”. In: J. Sci. Comput. 35.2-3 (June 2008), pp. 350–371. ISSN: 0885-7474. DOI: 10.1007/s10915-007-9166-4. 78 [170] Rajsekhar Setaluri, Mridul Aanjaneya, Sean Bauer, and Eftychios Sifakis. “SPGrid: A Sparse Paged Grid Structure Applied to Adaptive Smoke Simulation”. In: ACM Trans. Graph. 33.6 (Nov. 2014), 205:1–205:12. ISSN: 0730-0301. DOI: 10.1145/2661229.2661269. [171] A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein. “Adversarial training for free!” In: Proceedings of Advances in Neural Information Processing Systems, NeurIPS 32 (2019). DOI: 10.48550/arXiv.1904.12843. [172] Maurya Shah, Jonathan M Cohen, Sanjit Patel, Penne Lee, and Frédéric Pighin. “Extended galilean invariance for adaptive fluid simulation”. In: Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation. Eurographics Association. 2004, pp. 213–221. [173] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014). [174] Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. “A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks”. In: Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. V ol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 5827–5837. [175] Umut ¸ Sim¸ sekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, and Levent Sagun. “On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks”. In: arXiv preprint arXiv:1912.00018 (2019). [176] Funshing Sin, Adam W. Bargteil, and Jessica K. Hodgins. “A Point-based Method for Animating Incompressible Flow”. In: Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. SCA ’09. New Orleans, Louisiana: ACM, 2009, pp. 247–255. ISBN: 978-1-60558-610-6. DOI: 10.1145/1599470.1599502. [177] Samuel L Smith, Benoit Dherin, David GT Barrett, and Soham De. “On the origin of implicit regularization in stochastic gradient descent”. In: arXiv preprint arXiv:2101.12176 (2021). [178] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. “Don’t Decay the Learning Rate, Increase the Batch Size”. In: International Conference on Learning Representations. 2018. [179] Barbara Solenthaler and Markus Gross. “Two-scale Particle Simulation”. In: ACM Trans. Graph. 30.4 (July 2011), 81:1–81:8. ISSN: 0730-0301. DOI: 10.1145/2010324.1964976. [180] Jos Stam. “Stable Fluids”. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’99. New York, NY , USA: ACM Press/Addison-Wesley Publishing Co., 1999, pp. 121–128. ISBN: 0-201-48560-5. DOI: 10.1145/311535.311548. [181] Xu Sun, Zhiyuan Zhang, Xuancheng Ren, Ruixuan Luo, and Liangyou Li. “Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption”. In: arXiv preprint arXiv:2006. 05620 (June 2020). arXiv: 2006.05620[cs.LG]. 79 [182] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. “Intriguing properties of neural networks”. In: (Dec. 2013). arXiv: 1312.6199[cs.CV]. [183] Gerald Teschl. “Ordinary differential equations and dynamical systems”. In: Grad. Stud. Math. 140 (2000), pp. 08854–08019. [184] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. “Tensor field networks: rotation- and translation-equivariant neural networks for 3D point clouds”. In: arXiv (2018), p. 1802.08219. DOI: 10.48550/arXiv.1802.08219. [185] Nils Thuerey, Chris Wojtan, Markus Gross, and Greg Turk. “A Multiscale Approach to Mesh-based Surface Tension Flows”. In: ACM Trans. Graph. 29.4 (July 2010), 48:1–48:10. ISSN: 0730-0301. DOI: 10.1145/1778765.1778785. [186] A. Togo and I. Tanaka. “First principles phonon calculations in materials science”. In: Scripta Materialia 108 (2015), pp. 1–5. ISSN: 1359-6462. DOI: 10.1016/j.scriptamat.2015.07.021. [187] Adrien Treuille, Andrew Lewis, and Zoran Popovi´ c. “Model Reduction for Real-time Fluids”. In: ACM Trans. Graph. 25.3 (July 2006), pp. 826–834. ISSN: 0730-0301. DOI: 10.1145/1141911.1141962. [188] Yu-Lin Tsai, Chia-Yi Hsu, Chia-Mu Yu, and Pin-Yu Chen. “Formalizing Generalization and Robustness of Neural Networks to Weight Perturbations”. In: arXiv preprint arXiv:2103.02200 (2021). [189] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. “Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis”. In: Proceedings of the 37th International Conference on Machine Learning (2020), pp. 6262–6273. [190] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. “Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis”. In: arXiv preprint arXiv:1901.04653 (2019). [191] Kiwon Um, Xiangyu Hu, and Nils Thuerey. “Liquid splash modeling with neural networks”. In: Computer Graphics Forum. V ol. 37. 8. Wiley Online Library. 2018, pp. 171–182. [192] Nobuyuki Umetani and Bernd Bickel. “Learning three-dimensional flow for interactive aerodynamic design”. In: ACM Transactions on Graphics (TOG) 37.4 (2018), pp. 1–10. [193] Baptiste Van Opstal, Lucas Janin, Ken Museth, and Mihai Aldén. “Large Scale Simulation and Surfacing of Water and Ice Effects in Dragons 2”. In: ACM SIGGRAPH 2014 Talks. SIGGRAPH ’14. Vancouver, Canada: ACM, 2014, 11:1–11:1. ISBN: 978-1-4503-2960-6. DOI: 10.1145/2614106.2614156. [194] Jonathan Vandermause, Steven B Torrisi, Simon Batzner, Yu Xie, Lixin Sun, Alexie M Kolpak, and Boris Kozinsky. “On-the-fly active learning of interpretable Bayesian force fields for atomistic rare events”. en. In: npj Comput. Mater. 6.1 (Mar. 2020), pp. 1–11. 80 [195] P Vashishta, R K Kalia, J P Rino, and I Ebbsjö I. “Interaction potential for SiO2: A molecular-dynamics study of structural correlations”. en. In: Phys. Rev. B Condens. Matter 41.17 (June 1990), pp. 12197–12209. [196] Huamin Wang, Peter J. Mucha, and Greg Turk. “Water Drops on Surfaces”. In: ACM Trans. Graph. 24.3 (July 2005), pp. 921–929. ISSN: 0730-0301. DOI: 10.1145/1073204.1073284. [197] Huan Wang, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. “Identifying Generalization Properties in Neural Networks”. In: arXiv preprint arXiv:1809.07402 (2018). [198] Colin Wei and Tengyu Ma. “Improved sample complexities for deep neural networks and robust classification via an all-layer margin”. In: International Conference on Learning Representations. 2019. [199] Chris Wojtan, Nils Thuerey, Markus Gross, and Greg Turk. “Physics-inspired topology changes for thin fluid features”. In: ACM Trans. Graph. 29.4 (2010), pp. 1–8. ISSN: 0730-0301. DOI: http://doi.acm.org/10.1145/1778765.1778787. [200] Lei Wu, Chao Ma, et al. “How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective”. In: Advances in Neural Information Processing Systems 31 (2018), pp. 8279–8288. [201] Lei Wu, Zhanxing Zhu, et al. “Towards understanding generalization of deep learning: Perspective of loss landscapes”. In: arXiv preprint arXiv:1706.10239 (2017). [202] Tian Xie and Jeffrey C Grossman. “Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties”. en. In: Phys. Rev. Lett. 120.14 (Apr. 2018), p. 145301. [203] Zeke Xie, Issei Sato, and Masashi Sugiyama. “A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima”. In: International Conference on Learning Representations. 2020. [204] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. “Hessian-based Analysis of Large Batch Training and Robustness to Adversaries”. In: Advances in Neural Information Processing Systems 31. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. 2018, pp. 4949–4959. [205] Mingyang Yi, Huishuai Zhang, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. “BN-invariant Sharpness Regularizes the Training Model to Better Generalization”. In: IJCAI. researchgate.net, 2019, pp. 4164–4170. [206] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding deep learning requires rethinking generalization (2016)”. In: arXiv preprint arXiv:1611.03530 (2017). 81 [207] Yanci Zhang, Barbara Solenthaler, and Renato Pajarola. “Adaptive Sampling and Rendering of Fluids on the GPU”. In: Proceedings of the Fifth Eurographics / IEEE VGTC Conference on Point-Based Graphics. SPBG’08. Los Angeles, CA: Eurographics Association, 2008, pp. 137–146. ISBN: 978-3-905674-12-5. DOI: 10.2312/VG/VG-PBG08/137-146. [208] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. “Recovery Guarantees for One-hidden-layer Neural Networks”. In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. V ol. 70. Proceedings of Machine Learning Research. PMLR, June 2017, pp. 4140–4149. URL: https://proceedings.mlr.press/v70/zhong17a.html. [209] Bo Zhu, Wenlong Lu, Matthew Cong, Byungmoon Kim, and Ronald Fedkiw. “A New Grid Structure for Domain Extension”. In: ACM Trans. Graph. 32.4 (July 2013), 63:1–63:12. ISSN: 0730-0301. DOI: 10.1145/2461912.2461999. [210] Bo Zhu, Ed Quigley, Matthew Cong, Justin Solomon, and Ronald Fedkiw. “Codimensional Surface Tension Flow on Simplicial Complexes”. In: ACM Trans. Graph. 33.4 (July 2014), 111:1–111:11. ISSN: 0730-0301. DOI: 10.1145/2601097.2601201. [211] Yongning Zhu and Robert Bridson. “Animating sand as a fluid”. In: ACM Transactions on Graphics (TOG) 24.3 (2005), pp. 965–972. [212] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. “The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects”. In: arXiv preprint arXiv:1803.00195 (2018). 82 Appendices A A.1 Proof of Proposition 1 We provide proof of the proposition 1 using non-bias fully-connected neural networks (FCNNs) with NNH activation function for simplicity. This discussion can be applied to other models including with-bias NNs and convolutional NNs. Letf be a neural network and{(x,y)} n i be a set of data pairs. For a given input (x,y), we denoteo l =⟨w l ,f(x)⟩ as logit for labell, softmax probabilityp(l|x)=p l = 1 Z exp[o l ], and normalization termZ = P I exp[o l ]. We also setL= 1 n P n i L i whereL i =− lnp(y i |x i ) as loss, and ∆= ∇ 2 as Laplace operator. Our goal in this section is to show the following relation Tr[H]= 1 n n X i score(x i ,y i ) score(x,y)= X d X l p l ∂o l ∂W d F − ∂lnZ ∂W d F ! where∥·∥ F denotes Frobenius norm. Firstly, we can decomposeTr[H]=Tr[∆ L]=Tr[∆ P i L i ]= P i Tr[∆ L i ] thanks to the property of trace operation,Tr[A+B]=Tr[A]+Tr[B], and linearity of∆ acting on the loss. This decomposition has a 83 critical role because it allows us to reduce requirements for computational resources via mini-batch calculation. Secondly, we also decomposeTr[∆ L i ] as follows: Let{W d } D d be a set of layer parameters in FCNNs. Because trace operation sums up only diagonal elements, we haveTr[∆ L i ]= P d Tr[H d ] whereH d is diagonal block of Hessian matrix∆ L i ford-th layer. Due to this decomposition, all we need to calculate Tr[∆ L] is computation of data-wiseTr[H d ]. From now on, we omit subscripti when it has no ambiguity. Here, we introduce important relation for NNs with NNH activation function: H d =(J d W D PW ⊤ D J ⊤ d )⊗ (x d x ⊤ d ) whereW D is a matrix packingw l ,P is a matrixdiag[p]− pp ⊤ , wherep is a vector containingp(l|x), J d is a Jacobian matrix, andx d input vector ford-th layer. The operationdiag[·] means embedding elements of a vector into diagonal position in matrix. This equation originating from second deviation of NNH function will vanish. The detailed explanation are available in previous work [14]. Exploiting this equation and properties ofTr[·],Tr[A⊗ B]=Tr[A]Tr[B] andTr[AB]=Tr[BA], we have the following results: Tr[H d ]=Tr[(J d W D PW ⊤ D J ⊤ d )⊗ (x d x ⊤ d )] =Tr[J d W D PW ⊤ D J ⊤ d ] ∥x d ∥ 2 2 . 84 Next,we reformulate the complicated term,Tr[J d WPW ⊤ J ⊤ d ], as follow: Tr[J d W D PW ⊤ D J ⊤ d ] =Tr[J d W D (diag[p]− pp ⊤ )W ⊤ D J ⊤ d ] =Tr[J d W D diag[p]W ⊤ D J ⊤ d ]−∥ J d W D p∥ 2 2 = P l p l ∥J d w l ∥ 2 2 −∥ J d W D p∥ 2 2 Applying this refomulation to theTr[H d ], we have the following relation Tr[H d ] =Tr[J d W D PW ⊤ D J ⊤ d ] ∥x d ∥ 2 2 = P l p l ∥J d w l ∥ 2 2 −∥ J d W D p∥ 2 2 ∥x d ∥ 2 2 = P l p l ∥J d w l ∥ 2 2 ∥x d ∥ 2 2 −∥ J d W D p∥ 2 2 ∥x d ∥ 2 2 = P l p l ∥(J d w l )x ⊤ d ∥ F −∥ (J d W D p)x ⊤ d ∥ F . Here, we use the relation∥x∥ 2 2 ∥y∥ 2 2 =Tr[x ⊤ xy ⊤ y]=Tr[xy ⊤ yx ⊤ ]=Tr[xy ⊤ (xy ⊤ ) ⊤ ]=∥xy ⊤ ∥ F . Finally, applying the back-propagation relations ∂lnZ ∂W d =(J d W D p)x ⊤ d and ∂o l ∂W d =(J d w l )x ⊤ d to the terms inside the Frobenius norm, we have the formulation we want to state. A.2 Proof of Proposition 2 In this section, we provide proof of the proposition 2. LetW d be a layer parameter in no-bias NNs,ϕ be a NNH activation function, andα ={α d |d=1,...,D} be parameters of theα -scale transformation. 85 We introduce the forward-propagation as the following repetition:x d+1 =ϕ (W d x d ). We also denote o l =⟨w l ,x D ⟩ as logit for labell andZ = P I exp[o l ]. We can describe back-propagation as follow: ∂lnZ ∂W d =(J d W D p)x ⊤ d ∂o l ∂W d =(J d w l )x ⊤ d whereJ d =diag[x ′ d+1 ]W ⊤ d+1 (··· (diag[x ′ D ]W ⊤ D )··· ). The operationdiag[·] means embedding elements of a vector into diagonal position in matrix andx ′ d indicates element-wise deviation. Theα -Scale transformation acting on{W d } modifies this back-propagation as follows J d =diag[x ′ d+1 ]α d+1 W ⊤ d+1 (··· (diag[x ′ D ]α D W ⊤ D )··· ) = D Y k=d+1 α d ! diag[x ′ d+1 ]W ⊤ d+1 (··· (diag[x ′ D ]W ⊤ D )··· ). Exploiting the non-negative homogeneous property, we also havex d obtained from theα -transformed NNs as follow: x d =ϕ (α d− 1 W d− 1 ··· (ϕ (α 1 W 1 x)··· )) =( d− 1 Y k=1 α k )ϕ (W d− 1 ··· (ϕ (W 1 x)··· )). 86 Combining these two results, we have ∂lnZ ′ ∂W d = D Y k=d+1 α d J d Wp ! d− 1 Y k=1 α k x ⊤ d ! = D Y k=1,k̸=d α k (J d w l )x ⊤ d = 1 α d (J d w l )x ⊤ d . Here we exploit the constraint, Q D d=1 α d =1. We can prove the result for ∂o l ∂W d in the same way. Figure A.1: Evaluation for approximating DIAG[H], diagonal element of Hessian matrices. The left and right figures are results of arXiv-version [190] and ICML-version [189] respectively. A.3 Proof of Theorem 1 In this section, we provide proof of the following theorem 1: MS θ =min α D X d=1 1 α 2 d Tr[H θ,d ]=D D Y d=1 Tr[H θ,d ] ! 1/D whereH θ,d denotes diagonal blocked Hessian overd-th layer. 87 Figure A.2: Experimental results comparing minimum and normalized sharpness using LeNet. Firstly, we have the following relation enjoying proposition 1 and proposition 2 acting on MS θ , Tr[H θ ] = 1 n n X i=1 D X d=1 K X l=1 p l ∂o @il ∂W d F − ∂lnZ @i ∂W d F ! = D X d=1 1 n n X i=1 K X l=1 p l ∂o @il ∂W d F − ∂lnZ @i ∂W d F ! = D X d=1 1 n n X i=1 K X l=1 p l 1 α 2 d ∂o @il ∂W d F − 1 α 2 d ∂lnZ @i ∂W d F ! = D X d=1 1 α 2 d 1 n n X i=1 K X l=1 p l ∂o @il ∂W d F − ∂lnZ @i ∂W d F ! = D X d=1 1 α 2 d Tr[H θ,d ]. For the simplicity, we denoteH d =Tr[H θ,d ] shortly. We replaceα − 2 d withα ′ d . Theα ′ d satisfies the condition of α in scale transformation, i.e. α ′ d >0 and Q d α ′ d = Q d α − 2 d =( Q d α d ) − 2 =1. Hereinafter, we tackle the following minimization MS θ =min α D X d=1 α d H d . 88 Thanks to the inequality of arithmetic and geometric means, we have the following inequality: D X d=1 α d H d ≥ D( D Y d=1 α d H d ) 1 D =D D Y d=1 H 1 D d where we use the relation Q d α d =1. The remained problem is the existence ofα d achieving the lower bound. Here, we introduce the following α : α ∗ d = 1 H d D Y d=1 H 1 D d . We can check achievement of the lower bound and satisfaction of the condition asα as follow: X d α ∗ d H d = X d (H − 1 d Y k H 1 D k )H d = X d Y k H 1 D k =D Y d H 1 D d , α ∗ d =H − 1 d D Y d=1 H 1 D d >0, Y d α ∗ d = Y d (H − 1 d Y k H 1 D k ) =( Y d H d ) − 1 ( Y d H d ) =1. 89 A.4 Experimental Setup for Comparison The FCNN has three FC layers with128 or512 hidden dimensions, and LeNet has three convolution layers with6,16 or120 channels with5 kernel size and one FC layer with84 hidden dimensions. The optimizer is vanilla SGD with1024 batch-size,10 − 5 weight decay, and0.9 momentum. For a learning rate and the number of epochs, we set10 − 1 and3000 for FCNNs, and10 − 2 and1000 for LeNet. We use the latest parameters for NNs for evaluation. In the same manner, generalization gaps are calculated using the last epoch. To calculate the [189]’s score, we exploit the exact calculation for diagonal elements of Hessian using our proposed calculation. A.5 Experiments using LeNet In this section, we report the results using LeNet in Fig. A.2. These experiments are based on the same setting of comparison experiments 2.4.2. Even though the results of the minimum and the normalized sharpness are perfectly consistent as in the FCNN’s case, both show roughly the same shape. Thus, we achieve the same result in the case of CNN’s. A.6 Experimental Setup for Accuracy and Efficiency Calculation Tr[H] We propose an exact and efficient calculation for the trace of Hessian matrices. In these experiments, we verify the exactness and efficiency compared with proposed calculation and baseline calculation. Because the baseline calculation requires heavy computation, we use a limited number of examples{10,100,1000} in MNIST, a small FCNN, and a small CNN. The small FNNCs consists of three linear layers with20,20 hidden dimensions. The small CNN has two convolution layers20,20 channel with5 kernel size two 90 max-pooling layers with2 kernel size and2 stride, and one linear layer mapping to logits. We carried out experiments with10 different random seeds. The results are shown in Fig.2.1. We compare the correctTr[H] obtained from baseline calculation with ones from our proposal. To evaluate efficiency, we compare the consumed time to calculate the values in seconds. Also, we plot the times in the same manner as the previous experiment. Note that the right figures use log-scale to show our results. As these results show, our proposal accelerates the calculation significantly. A.7 Approximation for DIAG[H] First of all, we define an operation DIAG [A], which extracts diagonal elements from a given matrixA. For simplicity, we roughly use this operation without paying attention to the shapes. We note thatdiag and DIAG are not identical operations. Original normalized sharpness [189, 190] (NS) is defined as a summation of layer-wise sharpness measures: NS= P d NS d where NS d = min σ 1 ,σ 2 ⟨σ 1 , DIAG[H d ]σ 2 ⟩+⟨σ − 1 1 ,W 2 d σ − 1 2 ⟩. σ 1 ,σ 2 are vectors such that Q i σ 1[i] =1 and Q j σ 2[j] =1, similar toα inα scale transformation. σ − 1 and A 2 indicate element-wise inverse and square operations respectively. To minimize NS, we need to calculate DIAG[H]. The previous work [190] approximate this calculation as follows DIAG[H d ] ≈ E ϵ ∼ (0,1) ϵ ⊙ ∇ θ +rϵ L(θ +rϵ )−∇ θ − rϵ L(θ − rϵ ) 2r . 91 Their updated study [189] proposed modified version of this approximation, aiming to prevent convexity of the NS d minimization from violation of this approximation. We evaluated these approximations with the same experiments as in Appendix. A.6. The results are shown in Fig. A.1. The errors indicate the L2 distance between correct diagonal Hessian DIAG[H] and the approximationApp∥DIAG[H]− App∥. The circles indicate the start points of each iteration. Other settings are the same as ones used in Appendix. A.6, e.g.,n indicates the size of data. The left and right figures show that approximation defined in arXiv-version [190] and ICML-version [189] respectively. The results imply that these approximations require more computational cost than carried in this experiment to be close to DIAG[·]. A.8 Exactness and Efficiency Calculation for DIAG [H] We extend our proposed calculation ofTr[H] to DIAG[HJ]. This calculation can get rid of approximations to compute normalized sharpness [189, 190]. Due to the comparison between precision and computational cost, we use our calculation for normalized sharpness in our experiments. The extension is trivial becauseTr[·] and DIAG[·] have similar properties. That is, Tr[A+B]=Tr[A]+Tr[B] and DIAG[A+B]= DIAG[A]+ DIAG[B],Tr[H]= P d Tr[H d ] and DIAG[H]=⊕ d DIAG[H d ] where⊕ denotes direct product , andTr[A⊗ B]=Tr[A]Tr[B] and DIAG[A⊗ B]= DIAG[A]⊗ DIAG[B]. We can compare the performance of the approximations with our extended calculation. Our results in the experiment 2.4.1 are obtained by a summation of this extended DIAG[H] calculation: first calculate DIAG[H] and then sumTr[H] up. Per our observations, we realized accurate and significantly faster calculation. 92 Appendices B B.1 Assumptions for [38, Theorem 5.7.11 (a)] Although there are miscellaneous assumptions (Assumption 7, 8, and 9) in [38, Theorem 5.7.11 (a)], they can be simply derived from the assumptions in SGD’s escape problem. We provide the following brief justification. Stability ofθ ∗ is an essential assumption for [38, Theorem 5.7.11 (a)] and especially commonly assumed in dynamical system [88, 201]. Assumption 7 (θ ∗ is asymptotically stable). For any neighborhoodU that containsθ ∗ , there exists a small neighborhoodV ofθ ∗ such that gradient flow with any initial value θ 0 ∈V does not leaveU fort≥ 0 and lim t→∞ θ t =θ ∗ . Assumption 8 (D is attracted toθ ∗ ). ∀θ 0 ∈ D, a system ˙ θ t =−∇ L(θ t ) with initial valueθ 0 converges to θ ∗ without leavingD ast→∞. But it does not explicitly appear in SGD’s escaping analysis [212, 94, 203], because they usually adopt stronger assumptions. Assumption 7 is known to be equivalent to the local minimality ofθ ∗ under the condition thatL(θ ) is real analytic aroundθ ∗ [2]. Also, by definition of asymptotic stability in 93 Assumption 7, we can always find a region D that satisfies Assumption 8. The more detailed properties of stability can be found, such as in Section 6.5 of [183]. For the rigorous asymptotic analysis, the quasi-potential needs to be finite. Assumption 9 (FiniteV 0 ). V 0 ≜inf θ ′ ∈∂D V(θ ′ )<∞. But this is naturally satisfied in our setup. Because of Assumption 1 and 3, we can choose φ t =∇L(φ t ) to obtain finite steepness. This implies the quasi-potential is always finite for any θ . B.2 Formal properties of Steepness and Quasi-potential B.2.1 Fundamental Lemmas of Steepness Formally, steepness (Definition 5) is a useful measure because it satisfies the following Lemmas 4 and 5. To state the lemmas, readers shall viewθ as a probability measure onR d , extiti.e. θ :=(R d ,F), whereF is the Borelσ -field on R d . Lemma 4. If all the entries of∇L(·) andC(·) are bounded, uniformly Lipschitz continuous functions, then given{θ t }, the solution of (3.3), for allΓ ∈F liminf ε→0 εlnθ (Γ) ≥− inf φ∈Γ o S T (φ) whereΓ o denotes the interior ofΓ . Lemma 5. If all the entries of∇L(·) andC(·) are bounded, uniformly Lipschitz continuous functions, then given{θ t }, the solution of (3.3), for allΓ ∈F limsup ε→0 εlnθ (Γ) ≤− inf φ∈ ¯Γ S T (φ) where ¯Γ denotes the closure ofΓ . 94 Proof. Definition 5 corresponds to the rate function defined in [38, (5.6.6)] in our setup (see the [38, Remark on p214] as well). By [38, Theorem 5.6.7], our steepness satisfies Large Deviation Principle, which is stated in [38, (1.2.4)]. This statement immediately proves Lemma 4 and 5. B.2.2 Preliminary Lemmas of Quasi-potential Derived from Lemma 4 and 5, the following lemmas play the essential roles in bounding the mean exit time. We introduce the statements in our notations and briefly describe their implications. In the lemmas below, we useΘ t to denote gradient flow, extiti.e. ˙ Θ t =−∇ L(Θ t ) and forµ> 0,B µ denotes anµ -neighbourhood ofθ ∗ , that is,B µ :={θ ′ ∈R d |∥θ ′ − θ ∗ ∥≤ µ }. We also define π µ ≜inf{t:t≥ 0,θ t ∈B µ ∪∂D}. Lemma 6 (Lemma 5.7.18 in [38]). Under Assumption 5, for anyξ > 0 and anyµ> 0 small enough, there existsT 0 <∞ such that liminf ε→0 εln inf θ 0 ∈Bµ P(τ ≤ T 0 )>− (V 0 +ξ ) Lemma 6 means ifε is small enough, there exists a trajectory that starts fromB µ , exits with finite time durationT 0 , and has its steepness less thanV 0 +ξ . Lemma 7 (Lemma 5.7.19 in [38]). Under Assumption 1, 2, and 4, for any µ > 0 small enough such that B µ ⊂ D, we have lim t→∞ limsup ε→0 εln sup θ 0 ∈D P(π µ >t)=−∞ . Lemma 7 means ifε is small enough andt is large enough,θ will either fall intoB µ or each∂D at some point. 95 Lemma 8 (Lemma 5.7.21 in [38]). Under Assumption 1 and 3, for any µ > 0 small enough such that B µ ⊂ D, for any closed setN ⊂ ∂D lim µ →0 limsup ε→0 εln sup θ 0 ∈Bµ P θ π µ ∈N ≤− inf θ ′ ∈N V(θ ′ ), Lemma 8 means ifε is small enough andB µ is a small enough neighborhood ofθ ∗ , there exists a trajectory fromB µ toN such that its steepness at leastinf θ ′ ∈N V (θ ′ ). Lemma 9 (Lemma 5.7.22 in [38]). For anyµ> 0 small enough such thatB µ ⊂ D and allθ 0 ∈D, we have lim ε→0 P θ π µ ∈B µ =1 Lemma 9 means ifε is small enough,θ eventually falls intoB µ . Lemma 10 (Lemma 5.7.23 in [38]). For everyδ > 0 and everyξ > 0, there exists a constantT(ξ,δ )<∞ such that limsup ε→0 εln sup θ 0 ∈D P sup t∈[0,T(ξ,δ )] |θ t − θ 0 |≥ δ ! <− ξ Lemma 10 means over short time intervalsθ t can get far from its starting point with exponentially small probability. B.3 Proof of Theorem 2 B.3.1 Main proof For simplicity, we useε to denoteη/B . To prove this result, we provide the proof for an upper bound (Lemma 11) and a lower bound (Lemma 12). Preliminary Lemmas that support the proofs are summarized in Appendices B (Lemmas 6, 7, 8, 9, and 10). 96 Throughout the proofs, we sometimes useP θ 0 orP θ ′ 0 to clearly indicate which trajectory we are referring to. First, we develop the upper bound on the mean exit time. Lemma 11. For any c > 0, there exists an ε 0 such that for ε < ε 0 , εlnE[τ ] < V 0 + c holds, where V 0 :=min θ ′ ∈∂D V(θ ′ ). θ 0 θ T 0 +T 1 θ T 1 Figure B.1: Domains and trajectory right appearing in the proof of upper bound. Proof. Given Lemma 6 withξ := c 2 andµ :=µ 0 , there exists aT 0 such that liminf ε→0 εln inf θ 0 ∈Bµ 0 P(τ ≤ T 0 )>− (V 0 + c 2 ) Also, given Lemma 7 withµ =µ 0 , there exists aT 1 such that limsup ε→0 εln sup θ 0 ∈D P(τ µ 0 >T 1 )<0. LetT =T 0 +T 1 . Then there exists someε 0 >0 such that for allε≤ ε 0 , q≜ inf θ 0 ∈D P(τ ≤ T)≥ inf θ 0 ∈D P(τ µ 0 ≤ T 1 ) inf θ 0 ∈Bµ 0 P(τ ≤ T 0 ) ≥ e − (V 0 + c 2 )/ε 97 Considering the events{τ >kT } fork =1,2,... yields P(τ > (k+1)T)=[1− P(τ ≤ (k+1)T |τ >kT )]P(τ >kT ) ≤ (1− q)P(τ >kT ) Iterating overk =1,2,... gives sup θ 0 ∈D P(τ >kT )≤ (1− q) k Therefore, sup θ 0 ∈D E[τ ]≤ T " 1+ ∞ X k=1 sup θ 0 ∈D P(τ >kT ) # ≤ T ∞ X k=0 (1− q) k = T q and sinceq≥ e − (V 0 +c)/ε , sup θ 0 ∈D E[τ ]≤ Te (V 0 + c 2 )/ε If we takeε 0 small enough, εlnE[τ ]≤ V 0 + c 2 <V 0 +c holds for allε≤ ε 0 andθ 0 ∈D. Next, we develop the lower bound on the exit time. Lemma 12. For any c > 0, there exists an ε 0 such that for ε < ε 0 , εlnE[τ ] > V 0 − c holds, where V 0 :=min θ ′ ∈∂D V(θ ′ ). Proof. To prove the lower bound on τ , let µ 0 > 0 be small enough thatB µ 0 ⊂ D. Let σ 0 = 0 and for n=0,1,... define the following series of time stamps τ n =inf t:t≥ σ n ,θ t ∈B µ 0 /2 ∪∂D , σ n+1 =inf{t:t>τ n ,θ t ∈B µ 0 } 98 with the convention that σ n+1 = ∞ if θ τ n ∈ ∂D. Note that necessarily τ = τ n for some integer n. Moreover, since τ n are exit times and θ t is a strong Markov process, the process z n ≜ θ τ n is a Markov chain. Note that∂D is a closed set and chooseµ 0 >0 small enough as needed by Lemma 8 for θ 0 = θ σ 0 θ σ 1 θ τ 1 := z 1 θ τ = θ τ n := z n θ σ 2 θ τ 0 := z 0 Figure B.2: Domains and trajectory right appearing in the proof of lower bound. limsup ε→0 εln sup θ ′ 0 ∈Bµ 0 P θ ′ 0 θ π µ 0 ∈∂D <− V 0 + c 2 . Now, letξ := V 0 andδ := µ 0 , and letT 0 = T(V 0 ,µ 0 ) be as determined by Lemma 10. Then there exists ε 0 >0 such that for allε≤ ε 0 and alln≥ 1, sup θ 0 ∈D P(τ =τ n )≤ sup θ ′ 0 ∈Bµ 0 P θ ′ 0 θ π µ 0 ∈∂D ≤ e − (V 0 − c/2)/ε and sup θ 0 ∈D P(σ n − τ n− 1 ≤ T 0 )≤ sup θ 0 ∈D P sup t∈[0,T 0 ] |θ t − Θ t |≥ µ 0 ! ≤ e − (V 0 − c/2)/ε 99 The event{τ ≤ kT 0 } implies that either one of the first k+1 among the mutually exclusive events{τ =τ n } occurs, or else that at least one of the first k excursions [τ n ,τ n+1 ] offB µ 0 /2 is of length at mostT 0 . Thus, by the union of events bound, utilizing the preceding worst-case estimates, for allθ 0 ∈D and any integerk, P(τ ≤ kT 0 )≤ k X n=0 P(τ =τ n )+P min 1≤ n≤ k {σ n − τ n− 1 }≤ T 0 ≤ P(τ =τ 0 )+2ke − (V 0 − c 2 )/ε Recall the identity{τ =τ 0 }≡ θ π µ 0 / ∈B ρ and apply the preceding inequality withk = h T − 1 0 e (V 0 − c 2 )/ε i + 1 to obtain (for small enoughε ) P τ ≤ e (V 0 − c 2 )/ε ≤ P(τ ≤ kT 0 )≤ P θ π µ 0 / ∈B µ 0 +4T − 1 0 e − c 2ε By Lemma 9, the left side of this inequality approaches zero asε→0; hence, the following holds. lim ε→0 P e (V 0 − c 2 )/ε ≤ τ =1 By Chebycheff’s bound, if we takeε 0 small enough, εlnE[τ ]>V 0 − c holds for allε≤ ε 0 . 100 B.4 Hamilton-Jacobi Equation for Quasi-potential While we use a proximal system to approximate the quasi-potential, there have been attempts to directly analyzeV(θ ). A prominent result is the theorem by [89], which showed thatV(θ ) satisfies the following Hamilton–Jacobi equation. Theorem 5. For allθ ∈D,V(θ ) satisfies the following Jacobi equation, 1 2 ∇V(θ ) ⊤ C(θ )∇V(θ )−∇ L(θ ) ⊤ ∇V(θ )=0 Although it does not give us a closed solution ofV(θ ), it reflects the role of C(θ ) to makeV(θ ) smaller. Below, we provide the proof in our notation for the completeness. Proof. For u,v ∈ R d , we introduce an inner product and a norm regarding a point θ ∈ D as⟨u,v⟩ θ := u ⊤ C(θ ) − 1/2 v and∥u∥ θ := p ⟨u,u⟩ θ . With these definitions, the S T (φ) is written as follows: S T (φ)= 1 2 Z T 0 ∥˙ φ t +∇L(φ t )∥ 2 φt dt. (B.1) Note thatφ t ∈D holds for anyt∈[0,T] by the definition of trajectories. We rewrite the integrand of (B.1) as follows: ∥˙ φ t +∇L(φ t )∥ 2 φt =∥˙ φ t ∥ 2 φt +∥∇L(φ t )∥ 2 φt +2⟨˙ φ t ,∇L(φ t )⟩ φt = ∥˙ φ t ∥ φt −∥∇ L(φ t )∥ φt 2 +2∥˙ φ t ∥ φt ∥∇L(φ t )∥ φt +2⟨˙ φ t ,∇L(φ t )⟩ φt ≥ 2∥˙ φ t ∥ φt ∥∇L(φ t )∥ φt +2⟨˙ φ t ,∇L(φ t )⟩ φt . (B.2) 101 We develop a parameterization for the term in (B.2). For a trajectory φ, we select an bijective function f :[0,1]→[0,T] as satisfying the follows: for eacht∈[0,T] andt ∗ ∈[0,1] ast=f(t ∗ ), a parameterized trajectoryφ ∗ t ∗ :=φ f(t ∗ ) satisfies ∥˙ φ ∗ t ∗ ∥ φ ∗ t ∗ =∥∇L(˙ φ ∗ t ∗ )∥ φ ∗ t ∗ . (B.3) This parameterization reduces the quasi-potential to the minimum of the following quantity: (S T (φ)≥ S f(T) (φ ∗ )=) Z f(T) 0 ∥˙ φ ∗ t ∗ ∥ φ ∗ t ∗ ∥∇L(φ ∗ t ∗ )∥ φ ∗ t ∗ +⟨˙ φ ∗ t ∗ ,∇L(φ ∗ t ∗ )⟩ φ ∗ t ∗ dt ∗ (B.4) subject to the constraint (B.3). Since the integrand of (B.4) includes the first order derivative regarding t ∗ , (B.4) holds over different parameterizationsf. For convenience, we use another bijective parameterization functiong :[0,R]→[0,1] ast ∗ =g(r) withR>0 andr∈[0,R] such thatb φ r :=φ ∗ g(r) satisfies ∥ ˙ b φ r ∥ b φr =1. (B.5) Then, the quasi-potential is reduced to the following formula, * V(θ )= inf r∈[0,R]:∥ ˙ b φ r ∥ b φr =1 Z R 0 ∥ ˙ b φ r ∥ b φr ∥∇L(b φ r )∥ b φr + D ˙ b φ r ,∇L(b φ r ) E b φr dr, (B.6) * One might think that if we parametrize as above (B.5), the equality condition for (B.2) is violated. Indeed ST(b φ)̸= Z T 0 ∥˙ φt∥ b φ t ∥∇L(b φt)∥ b φ t +⟨˙ φt,∇L(b φt)⟩ b φ t dt forb φ. However,b φ is introduced just for the simple calculation ofST(φ ∗ ). Althoughb φ frequently appears in the proof, our attention is still onφ ∗ andST(φ ∗ ), not onST(b φ). 102 where b φ R = θ . By the Bellman equation-type optimality, we expand the right hand side of (B.6) into the following form: V(θ )= inf r∈[0,R]:∥ ˙ b φ r ∥ b φr =1 Z R R− δ ∥ ˙ b φ r ∥ b φr ∥∇L(b φ r )∥ b φr + D ˙ b φ r ,∇L(b φ r ) E b φr dr+V(b φ R− δ ) , (B.7) with a width valueδ > 0. The Taylor expansion aroundr =R gives Z R R− δ ∥ ˙ b φ r ∥ b φr ∥∇L(b φ r )∥ b φr + D ˙ b φ r ,∇L(b φ r ) E b φr dr+V(b φ R− δ ) =δ ∥∇L(b φ R )∥ b φ R + D ∇L(b φ R ), ˙ b φ R E b φ R − ˙ b φ ⊤ R ∇V (b φ R ) +V (b φ R )+O δ 2 Takingδ →0 and noticingb φ R =θ , (B.7) can be simplified to the following equation: 0= inf r∈[0,R]:∥ ˙ b φ r ∥ b φr =1 ∥∇L(θ )∥ θ + D ∇L(x) ⊤ , ˙ b φ R E θ − ˙ b φ R ∇V (θ ) (B.8) It remains to selectb φ which solves the minimization problem (B.8). Since the following equality holds, D ∇L(θ ) ⊤ , ˙ b φ R E θ − ˙ b φ R ∇V (θ )= D ∇L(θ ) ⊤ −∇ V (θ ) ⊤ C(θ ) − 1/2 , ˙ b φ R E θ , (B.9) it is easy to see that a trajectoryb φ ∗ such that ˙ b φ ∗ R =− ∇L(x) ⊤ −∇ V (θ )C(θ ) 1/2 ∥∇L(θ ) ⊤ −∇ V (θ )C(θ ) 1/2 ∥ θ minimizes (B.9). With thisb φ ∗ , (B.8) simplifies to ∥∇L(θ )∥ θ =∥∇L(θ ) ⊤ −∇ V (θ )C(θ ) 1/2 ∥ θ . 103 Taking the square of both sides, we get the statement. B.5 Proof of Lemma 2 First, we useλ − 1 max b S T (φ) as a “proxy steepness” to estimateS(φ) andV(θ ). For any trajectoryφ that defines quasi-potential, the following bound holds. S T (φ)− λ − 1 max b S T (φ) (B.10) = 1 2 Z T 0 (˙ φ t +∇L(φ t )) ⊤ C(φ t ) − 1 (˙ φ t +∇L(φ t ))dt− 1 2 λ − 1 max Z T 0 ∥˙ φ t +∇L(φ t )∥ 2 dt (B.11) = 1 2 Z T 0 (˙ φ t +∇L(φ t )) ⊤ C(φ t ) − 1 − λ − 1 max I (˙ φ t +∇L(φ t ))dt (B.12) ≤ 1 2 Z T 0 (˙ φ t +∇L(φ t )) ⊤ C(φ t ) − 1 − λ − 1 max I (˙ φ t +∇L(φ t )) dt (B.13) SinceC(φ t ) − 1 − λ − 1 max I is positive semi-definite, (B.13)≤ 1 2 Z T 0 ∥˙ φ t +∇L(φ t )∥ 2 λ max C(φ t ) − 1 − λ − 1 max I dt (B.14) SinceD is a finite set and L(θ ) is a locally quadratic function (Assumption 1), there exists a constant M >0 that satisfies ∀θ ∈D :∥∇L(θ )∥≤ M. we can further obtain the following bound. (B.14)≤ T 2 (K +M) 2 sup 0≤ t≤ T n λ max C(φ t ) − 1 − λ − 1 max I o (B.15) (∵∥∇L(θ )∥≤ M and Assumption 6) = T 2 (K +M) 2 sup 0≤ t≤ T n λ max C(φ t ) − 1 − λ − 1 max o (B.16) = T 2 (K +M) 2 ( sup 0≤ t≤ T λ max C(φ t ) − 1 − λ − 1 max ) (B.17) 104 The following inequalities show thatsup 0≤ t≤ T λ max C(φ t ) − 1 is close toλ − 1 min (=λ max C(θ ∗ ) − 1 by Assumption 2). For anyθ ∈{φ t } 0≤ t≤ T λ max C(θ ) − 1 − λ max C(θ ∗ ) − 1 = C(θ ) − 1 op − C(θ ∗ ) − 1 op (B.18) ≤ C(θ ) − 1 − C(θ ∗ ) − 1 op (B.19) = C(θ ) − 1 (C(θ ∗ )− C(θ ))C(θ ∗ ) − 1 op (B.20) ≤ C(θ ) − 1 op ∥C(θ ∗ )− C(θ )∥ op C(θ ∗ ) − 1 op (B.21) ≤ C(θ ) − 1 op ∥C(θ ∗ )− C(θ )∥ F C(θ ∗ ) − 1 op (B.22) ≤ C(θ ) − 1 op C 0 |θ ∗ − θ | ∞ C(θ ∗ ) − 1 op (B.23) (∵ There existsC 0 >0 by Assumption 3) ≤ C(θ ) − 1 op C 0 r C(θ ∗ ) − 1 op (B.24) (There exists a constant radius,r, becauseD is finite.) ≤ C 0 rkλ − 1 min (∵ Assumption4) (B.25) Thus (B.17)≤ T 2 (K +M) 2 (1+C 0 rk)λ − 1 min − λ − 1 max (B.26) With this upper bound,V 0 can also be bounded in the followings. By definition, V 0 = inf θ ∈∂D inf T>0 inf φ:(φ 0 ,φ T )=(θ ∗ ,θ ) S T (φ) (B.27) b V 0 = inf θ ∈∂D inf T>0 inf φ:(φ 0 ,φ T )=(θ ∗ ,θ ) b S T (φ) (B.28) 105 From here below, we denoteinf φ:(φ 0 ,φ T )=(θ ∗ ,θ ) byinf φ(θ,T ) for brevity. Since∂D is a continuous finite boundary, we have the following θ † andθ ∗ . θ † :=arginf θ ∈∂D inf T>0 inf φ(θ,T ) S T (φ) (B.29) θ ∗ :=arginf θ ∈∂D inf T>0 inf φ(θ,T ) b S T (φ). (B.30) The followings hold. inf θ ∈∂D inf T>0 inf φ(θ,T ) S T (φ)− inf θ ∈∂D inf T>0 inf φ(θ,T ) b S T (φ)≤ inf T>0 inf φ(θ ∗ ,T) S T (φ)− inf T>0 inf φ(θ ∗ ,T) b S T (φ) (B.31) inf θ ∈∂D inf T>0 inf φ(θ,T ) b S T (φ)− inf θ ∈∂D inf T>0 inf φ(θ,T ) S T (φ)≤ inf T>0 inf φ(θ † ,T) b S T (φ)− inf T>0 inf φ(θ † ,T) S T (φ) (B.32) Similarly, we can the following finite T † andT ∗ for eachθ . T † (θ ):=arginf T>0 inf φ(θ,T ) S T (φ) (B.33) T ∗ (θ ):=arginf T>0 inf φ(θ,T ) b S T (φ), (B.34) and the followings hold inf T>0 inf φ(θ ∗ ,T) S T (φ)− inf T>0 inf φ(θ ∗ ,T) b S T (φ)≤ inf φ(θ ∗ ,T ∗ (θ ∗ )) S T ∗ (θ ∗ ) (φ)− inf φ(θ ∗ ,T ∗ (θ ∗ )) b S T ∗ (θ ∗ ) (φ) (B.35) inf T>0 inf φ(θ † ,T) b S T (φ)− inf T>0 inf φ(θ † ,T) S T (φ)≤ inf φ(θ † ,T † (θ † )) b S T † (θ † ) (φ)− inf φ(θ † ,T † (θ † )) S T † (θ † ) (φ). (B.36) 106 Similarly, sinceL(θ ) andC(θ ) are continuous, for eachθ andT , we have the followings φ † (θ,T ):=arginf φ(θ,T ) S T (φ) (B.37) φ ∗ (θ,T ):=arginf φ(θ,T ) b S T (φ). (B.38) and we get inf φ(θ ∗ ,T ∗ (θ ∗ )) S T ∗ (θ ∗ ) (φ)− inf φ(θ ∗ ,T ∗ (θ ∗ )) b S T ∗ (φ)≤ S T ∗ (θ ∗ ) (φ ∗ (θ ∗ ,T ∗ (θ ∗ )))− b S T ∗ (φ ∗ (θ ∗ ,T ∗ (θ ∗ ))) (B.39) ≤ T ∗ (θ ∗ ) 2 (K +M) 2 (1+C 0 rk)λ − 1 min − λ − 1 max (B.40) inf φ(θ † ,T † (θ † )) b S T † (θ † ) (φ)− inf φ(θ † ,T † (θ † )) S T † (θ † ) (φ)≤ b S T † (θ † ) (φ † (θ † ,T † (θ † )))− S T † (θ † ) (φ † (θ † ,T † (θ † ))) (B.41) ≤ T † (θ † ) 2 (K +M) 2 (1+C 0 rk)λ − 1 min − λ − 1 max (B.42) Thus we get V 0 − λ − 1 max b V 0 ≤ max{T † (θ † ),T ∗ (θ ∗ )} 2 (K +M) 2 (1+C 0 rk)λ − 1 min − λ − 1 max (B.43) Defining A:= max{T † (θ † ),T ∗ (θ ∗ )} 2 (K +M) 2 finishes the proof. 107 Appendices C Sharpness-Aware Minimization for Robust Molecular Dynamics Simulations C.1 Introduction Molecular dynamics (MD) simulations are widely used to computationally study material properties by following the trajectories of constituent atoms. Accurate prediction of potential energy is essential for reliable MD simulations, but first-principle quantum mechanical (QM) calculations to obtain ground-truth potential energy are computationally prohibitive. Neural network quantum molecular dynamics (NNQMD) is revolutionizing MD simulations by predicting potential energy with QM accuracy using neural networks at a fraction of cost, thus allowing large spatiotemporal MD simulations [156, 131]. Despite numerous successes, most NNQMD applications are limited to materials under gentle, near-equilibrium conditions [96], and NNQMD has rarely been applied to far-from-equilibrium processes to date. Specifically, at high temperature, the prediction error in atomic forces due to large thermal noise accumulates to produce unphysical behavior, which makes the simulation numerically unstable for larger systems at longer times, i.e. fidelity-scaling problem [159]. While [160] attempted to alleviate the instability using active learning, their practical applicability is limited in terms of scalability and system dependence . It is an urgent task to find an alternative approach that is model- and system-agnostic without sacrificing the algorithmic scalability, to improve model generalizability for far-from-equilibrium NNQMD simulations. 108 We claim that sharpness, a recent notion in machine learning community, could provide a clue to solve the NNQMD instability problem. Sharpness of a neural network is defined as the sensitivity of the training loss against the weight parameters perturbation. Sharpness has gained attention particularly in recent years because an algorithm with sharpness regularization achieved the state-of-the-art (SOTA) performance in the image classification tasks [60]. More importantly for QMD applications, neural networks with regularized sharpness are shown to have strong robustness against noise as well as its high generalization ability [181]. Provided that the simulation becomes unstable because of the combination of physical inaccuracy and numerical fragility, it is natural to apply a regularizer, SAM, that improves physical fidelity (i.e. generalizability of a model) and robustness. In this work, we examine the effect of sharpness regularization on the potential energy prediction performance using SOTA graph neural network (GNN). We start with a technical summary of MD simulation and sharpness regularization, followed by the main result that sharpness regularization significantly improves the prediction performance on multiple material datasets. * Our results also indicate a correlation between the sharpness of the weight parameters and the nature of the potential energy landscape. C.2 Neural Network Quantum Molecular Dynamics Molecular dynamics (MD) is an atomistic modeling method widely used in materials and chemical sciences. By numerically solving Newton’s equations of motion, one obtains the complete history of atomic positions,r N ={r i |i=1,...,N}(N is the number of atoms) as well as the material properties of interest. NNQMD is a newly emerged approximation scheme for MD simulation to statistically predict potential energy and thereby provide approximate trajectories of atoms [156, 131]. * https://github.com/ibayashi-hikaru/Sharpness_MD 109 Figure C.1: Snapshots of crystalline silicon carbide (a) and silicon dioxide glass (b), where spheres represent carbon (cyan), silicon (blue) and oxygen (red) atom positions and cylinders indicate chemical bonds. (c)-(e) Pair distribution functions for Si-Si, Si-C and C-C pairs at300K (black) and900K (red), respectively. Quantum Mechanical Simulation To generate training and test datasets, we use a reactive molecular dynamics (RMD) method based on ReaxFF inter-atomic potential [46] and RXMD software [150]. ReaxFF significantly reduces the computational cost compared to other QM-based approaches while reproducing chemical reactions and charge transfer at QM-level accuracy. We use two datasets with distinct structural characteristics, i.e. crystalline silicon carbide (SiC) and amorphous silicon dioxide glass (a-SiO 2 ), respectively shown in Figs. C.1 (a) and (b). The two systems are thermalized at temperature300K for model training with and without SAM, and 900K for the evaluation of out-of-sample error and the model generalizability against thermal noise. The glass structure of a-SiO 2 system was obtained by melt-quench method [195], resulting in a disordered network of chemical bonds. To characterize the temperature effect on the atomic configuration, Figs. C.1 (c) - (e) compare the pair distribution functions of SiC thermalized at300K and900K. Due to the enhanced thermal motion by the elevated temperature of900K, the atomic positions deviate from the training dataset, increasing the likelihood for a trained model to face unseen atomic configurations. 110 Representation of Atoms Toward the natural representation of atomic data, graph representation learned by GNN has been attracting great attention in materials and chemical science domains. In practice, GNNs have shown higher performance than the naive approach based on static descriptors [65]. Several GNN models have been proposed for chemistry-related problems, mostly focusing on molecular systems [65, 25, 167, 202]. GNN has also been used in material predictions involving periodic crystals, surfaces [47, 128, 154, 202], as well as MD simulations [153]. Instability Issue and Related Works One of the key issues of NNQMD is its instability during long-time simulations, where the accumulated error often causes a breakdown of simulation. An alleviation of the instability is the active learning approach, where a model is adaptively retrained on-the-fly by newly generated atomic configurations when the model prediction uncertainty exceeds a prescribed threshold [194]. However, this approach is not scalable because of its repeated training processes with a heavy cost. As an alternative approach, [159] proposed a physics-based regularization founded on statistical mechanics. While they have successfully performed a billion-atom NNQMD simulation of light-induced polarization dynamics in led titanate crystals, the proposed inductive bias is highly system-dependent, requiring a delicate tuning of the bias strength to prevent catastrophic failure while preserving QM-level accuracy. In contrast, the proposed SAM approach here is scalable and agnostic to a model or a system. More importantly, SAM minimizes sharpness, which, by definition, increases the model stability as we discuss in the next section. C.3 Experiments To investigate the applicability of SAM to the MD problem, we have trained several GNN models (CGCNN [202], SchNet [167], MEGNet [25]) recently developed for materials applications using MatDeepLearn framework [65]. Model training is done with the300K dataset, and their generalizability is 111 tested with the higher-temperature dataset at900K. We examine the effect of SAM with several values of the hyperparameterρ in the loss function. As a baseline optimizer, we train each model with stochastic gradient descent (SGD) with decoupled weight decay (AdamW) [127] for 1000 epochs. The tuning of the hyper-parameters is done by Ray Tune library. We have used NVIDIA GPU (V100) cluster nodes to perform model training, hyper-parameter optimization, and test error evaluation. Table C.1: Out-of-sample errors baseline (without SAM) and with SAM (eV/atom) SiC a-SiO 2 baseline with SAM weight(ρ ) baseline with SAM weight(ρ ) CGCNN 1.957 0.008 0.01 0.065 0.204 0.01 SchNet 0.025 0.024 0.02 0.071 0.011 0.02 MEGNet 19.17 0.447 0.05 0.055 0.169 0.01 We observe improved out-of-sample error for all models with the crystalline SiC dataset, ranging from 5 to 99.6% reduction compared to the baseline. Many of optimal test values were obtained with SAM weightρ around0.02, which is consistent with the value reported by [60]. Unlike the SiC case, SAM does not consistently improve test performance for a-SiO 2 dataset. Glass landscape is known to be rough and consists of intrinsically sharp minima [113]. It is remarkable to see that flat minima do not have high performance for some material data while they constantly show high performance on images [97]. Provided that material science has the accumulation of knowledge of material properties, it is an interesting research question to reveal the relation between the property of energy landscape and loss landscape. 112 C.4 Conclusion We have studied the effect of a novel SAM technique in materials dataset using SOTA GNN models. Two systems,SiC and a-SiO 2 , have been examined, considering their distinct characteristics of chemical bond networks and the associated potential energy landscape. The GNN models were optimized and trained using an ambient condition dataset. The effect of SAM on model generalizability was tested using a high-temperature dataset characterized by enhanced thermal noise. Our study suggests a promising avenue to realize materials models with enhanced generalizability and an insight into the future SAM algorithmic design taking advantage of the energy landscape of materials. SAM is a lightweight, model- and system-agnostic approach to improve the generalization performance in the physical sciences. Of those, SAM will be particularly beneficial for large-scale NNQMD simulations to study far-from-equilibrium material processes, where a carefully handcrafted model suffers from the fidelity-scaling problem. Our result encourages further application of SAM to materials dataset, at the same time, suggests a possible route to future SAM algorithmic design. 113 Appendices D Simulating liquids on dynamically warping grids D.1 Introduction Liquid simulations for computer graphics have long dazzled audiences with their ability to reproduce visually intricate physical features like vibrant water splashes and subtle ripples. Previous researchers have endeavored to reproduce these effects by designing algorithms specific to each phenomenon. Examples of this strategy include thin sheets [199, 141], water droplets [196], foams [34, 21, 101], underwater bubbles [155, 87] and capillary waves [185, 22]. An alternative to developing algorithms for specific phenomena is to use spatially adaptive simulation. Unstructured grids have been the main ingredient to achieve this goal. Examples of methods employing unstructured grids include octrees [126, 57], tetrahedral grids [106, 56, 31, 28], and V oronoi diagrams [18, 71, 176]. A central benefit of adaptive simulations is the capability of simulating large-scale phenomena at a feasible computational cost [5, 57]. The main idea of our work is to enhance existing regular grid algorithms to deliver dynamic spatial adaptivity without excessive computational overhead. Common strategies for simulating with unstructured meshes come with significant complications, namely inefficient memory access and implementation complexity due to non-trivial adaptive mesh refinement. Meanwhile, current adaptive methods based on structured grids only offer limited types of deformation. In this context, 114 Figure D.1: Liquid leaking out of a container with holes. Our dynamically warping grids adaptively capture detailed motions without the computational expense of unstructured grids. The left image depicts the surface mesh and the corresponding FLIP particles color-coded by the grid cell size, while the right image shows a cross-section. The simulation used 128 3 cells and ran with 25 seconds per time step, and 1.8 minutes per video frame. we propose a novel adaptive method with dynamically warping grids, which includes four technical contributions: • Deformationsolver We present a new method to adaptively warp a Cartesian grid by iteratively solving a non-linear system. Our deformation solver quickly adapts the grids to the specific region of interest while taking into account temporal coherence (Section D.4). • Advection Solver Compatible with existing algorithms Our advection method is able to reuse off-the-shelf algorithms like a narrow-band FLIP solver [58], the MacCormack method [169], and sixth- order Weighted Essentially Non-Oscillatory (WENO) interpolation [129] to advect the velocity and the surface geometry (Section D.5). 115 • Pressure solver on deforming grids We introduce a new pressure solver specifically designed for simulating liquid with a deforming regular grid. Our solver is straightforward to implement and avoids the null-space issues typically associated with hexahedral mesh solvers. (Section D.6.1) • Generalized Ghost Fluid method We propose a new formulation to accurately treat free surface boundary conditions extending the ghost fluid method [51]. We show that our approach delivers second- order accuracy on our warped grids and the technique generalizes to arbitrary discretizations (Section D.6.2). D.2 Related Work Our work is based on Eulerian fluid simulations. Eulerian fluid animation has flourished since the introduction of unconditionally stable semi-Lagrangian advection by Stam [180]. Afterward, Foster and Fedkiw [61] enforced free-surface boundary conditions to simulate liquids. Since then, Marker-And-Cell (MAC) grids combined with the level-set method have become the standard method for grid-based liquid simulation. MAC grids were originally introduced to graphics by Harlow and Welch [78] and to computer graphics by Foster and Metaxas [62]. Since then, researchers have continually extended them to support second-order accurate boundary conditions for free surfaces [51], and to handle two-way coupled rigid bodies [9]. Interested readers are referred to the book by Bridson [17] for an overview of the state-of-the-art in fluid animation. Because we target adaptivity, the remainder of this section focuses on methods related to adaptive simulation. D.2.1 Arbitrary Lagrangian-Eulerian method Our method is similar to the arbitrary Lagrangian-Eulerian (ALE) method [81]. ALE method combines a Lagrangian method with an Eulerian method in the sense that it uses grids like Eulerian methods, but those grids move with liquid like Lagrangian methods. Several works in computer graphics also make use of 116 ALE methods, such as translating grids [172, 162] or adaptively refining grids [105]. We refer interested readers to Donea et al. [43] for a comprehensive overview of ALE methods. To the best of our knowledge, our work is the first to computer graphics that applies the ALE method for deforming uniform grids. D.2.2 Unstructured meshes Tetrahedral meshes have been widely used for adaptive simulations [31, 10, 28, 106, 5]. Working with tetrahedral meshes requires high-quality mesh generation, an active area of research [4, 114]. The cost of accessing an element typically requires a tree traversal or hash table lookup, which substantially lowers the cache hit rate. Researchers have also used octree grids. Losasso et al. [126] and Setaluri et al. [170] employed the Finite V olume Method on an octree grid to discretize pressure and velocity similarly to the MAC discretization. Nielsen and Bridson [147] and Ferstl et al. [57] discretized them with the Finite Element Method, where the pressure and the velocity are collocated. Some have also investigated adaptive simplicial complexes [139, 53] and V oronoi-based approaches with a particle-based pressure solver [71, 176]. Sparse grid methods [143, 170] can overcome many of these problems with slow access times. Recently, Aanjaneya et al. [1] demonstrated that liquid simulations are possible with a sparse grid discretization based on power-diagrams. However, the necessity of such specialized discretizations at the free surface illustrates the difficulties that commonly arise for adaptive methods. Our generalized boundary conditions in conjunction with the warped regular grids circumvent such problems. D.2.3 Structured grids Curvilinear coordinates also bring adaptivity to structured grids. The use of curvilinear coordinates dates back to the original work of FLIP [16] and was introduced to graphics by Azevedo and Oliveira [7]. Aside from curvilinear grids, Zhu et al. [209] proposed to use a new extended grid structure where grids were stretched horizontally or vertically. Both of the methods used the Finite V olume Method to discretize the 117 pressure. Our work is related to their methods in the sense that we also use stretched regular grids. However, we further allow such grids to dynamically warp over time, which gives more flexible adaptivity. In the context of structured grids, Irving et al. [91] proposed tall cells for liquid simulations, which vertically coarsened the grids away from the free surface, and Chentanez and Müller [29] extended them to real-time applications. D.2.4 Image warping Image and video processing algorithms have similarly investigated deforming regular grids to adapt the image content to various constraints. For still images, researchers have demonstrated that feature-aware deformations can help to preserve salient regions of an image [66]. Non-linear deformations of video streams were likewise proposed to robustly re-scale video content to different devices [107], or to change the disparity of stereoscopic videos [116]. While these approaches share our goal to achieve flexible deformations with regular grids, the target applications impose very different constraints. Image deformation algorithms usually assign a desired shape and size to each pixel. Because the target deformation is fixed to each pixel, the problem can often be reduced to a linear system. In contrast, adaptive refinement algorithms are usually given a desired volume as a function of space, so moving an element will, in fact, change its desired volume. Because this is inherently a moving-target problem, it inevitably results in a non-linear system. To cope with this added difficulty, we propose a solution which reduces the solution space in a problem-specific manner and propose problem-specific speed-ups. D.2.5 Hybrid approaches Several researchers focused on combining Cartesian grids with unstructured grids. For example, the methods of Dobashi et al. [42] and Azevedo and Oliveira [7] developed overlapping grids to cover complex regions. Both methods used Cartesian grids of different resolutions that are translated and rotated. English 118 et al. [50] introduced Chimera grids, which decompose the simulation domain into multiple regions of interest. These regions are individually discretized with different Cartesian grids and later combined into a single linear system. Some researchers exploited irregular meshes to capture detailed boundaries. Feldman et al. [56] adapted irregular tetrahedra to the curved boundaries of solids, while Brochu et al. [18] devised V oronoi-based meshes to accurately capture the geometry and the topology of free surfaces. These hybrid approaches improve the average cache hit rate, and our method likewise benefits from fast memory accesses provided by regular grids. However, in contrast to many of these approaches, we do not employ any cells with irregular connectivity. Our grids purely consist of warped cubes, so many tasks like the implementation and stability analyses are simpler. D.2.6 Mesh-free methods Adaptive Smoothed-Particle Hydrodynamics (SPH) methods are also studied in the engineering literature [151, 117] as well as interactive applications [207]. Adams et al. [3] proposed to dynamically subdivide SPH particles, while Solenthaler and Gross [179] used a hierarchy model to circumvent the numerical issues introduced by particle splitting and merging. Particle representations are especially well suited for splashes, and our FLIP-based solver also employs particles to represent small structures. D.2.7 Polynomial and spectral adaptivity Some researchers designed specific basis functions for increasing efficiency. Treuille et al. [187] extracted a representative basis by performing Principal Component Analysis (PCA) on a set of training examples. De Witt et al. [37] used a Laplacian eigenfunction basis and derived how the basis coefficients change over time. Edwards and Bridson [49] used the Discontinuous Galerkin method to simulate detailed liquids on very coarse grids by representing pressure with a high-order polynomial basis. This class of methods is 119 Figure D.2: Merging bunny: the top image shows the spatial (physical) coordinate and the bottom corresponding reference coordinate at the same timings, illustrating the magnification of local feature. 128× 64× 128 resolution,10 seconds per time step and29 seconds per video frame. Algorithm 1: Simulation Loop input : Velocity field and the level-set u t andϕ t . output: New velocity field and the level-set u t+∆ t andϕ t+∆ t . 1 Solve the grid deformation and get the mesh velocityu D (Section D.4) 2 Convertu t andu D to the reference coordinate ¯u t and ¯u D 3 Advectu ∗ t ← Adv(u t ) through ¯u t − ¯u D on the reference coordinate (Section D.5) 4 Advectϕ ∗ t ← Adv(ϕ t ) through ¯u t − ¯u D on the reference coordinate (Section D.5) 5 Add force tou ∗ t 6 Projectu t+∆ t ← Proj(u ∗ t ) (Section D.6) 7 Re-distanceϕ t+∆ t ← Redist(ϕ t ) known as p-adaptivity, and our warping grids framework could benefit from combining it with such high-order bases. D.3 Method Overview and Definitions We solve the Navier-Stokes (NS) equations by an operator splitting approach [17]. Our method consists of three independent building blocks: a grid deformation solver for spatial adaptivity (Section D.4), the 120 advection of velocity and surface geometry (Section D.5), and a pressure solver for warped grids (Section D.6). We note that the majority of our results use a narrowband FLIP solver [58] for additional efficiency. We outline our approach in Algorithm 1, and we describe each of the components in more detail in the following sections. Before we start discussing our algorithm, we would like to define our terminology, which follows ALE practice [43]. We refer to coordinates in the physical space as ”spatial coordinates”. Quantities defined in this coordinate can be directly visualized (Figure D.2 top). “Reference coordinate”, on the other hand, denotes the coordinate on uniform grids (Figure D.2 bottom). We exploit reference coordinates for the convenience of calculation, but quantities in this coordinate have no direct physical correspondences. Deformation solver is the operation that produces a vector field that we use to deform a uniform grid. “Mesh velocity” denotes the change of this vector field over time. D.4 Deformation Solver The problem of deforming a grid has been revisited many times throughout the image- and geometry-processing literature. The most efficient of these algorithms address the problem by assigning a desired deformation to each element and then deforming each element to optimally match the target. Because the target deformation is fixed to each element , the problems can often be reduced to a linear system. However, in our application of adaptive refinement, the desired deformation depends on interesting physical behaviors, which will inevitably be a function of physical space. Thus, the target deformations of the elements will change as the grid deforms, and the system is fundamentally non-linear. To cope with this difficulty, we exploit the structure of our particular problem in order to both reduce the dimension of our solution space (Section D.4.1) and devise numerical acceleration strategies (Section D.4.4). 121 Figure D.3: Combing a liquid: the left side of the zoomed image highlights our warped grids adapting near the solid boundaries, making the accurate interaction with cylinders possible.128× 64× 128 resolution,12 seconds per time step and30 seconds per video frame. D.4.1 Defining the problem We first note that our deformation aims to control element volumes instead of completely general deformations. We can exploit this to reduce the solution space of our deformations from a 3-dimensional displacement vector field down to a scalar field. To do this, we first note that the volume of an element can be computed as the cell’s rest volume added to divergence of the deformation field integrated over the cell, according to the divergence theorem. Then we note that the divergent part of a vector field is uniquely described by the gradient of a scalar field, according to the Helmholtz decomposition. Thus, we can remove all deformations that are irrelevant to our sizing function by restricting each element’s deformation tox(ξ,φ )=ξ +∇ ξ φ, whereξ is the reference coordinate andφ is a scalar field containing all of the degrees of freedom in the deformation. In addition, to encode the region of interest, i.e. the region a user wants to emphasize, our solver is fed a user-defined scalar field, which we call sizing function orf size (x). Sizing function can be interpreted as a field of sinks in the spatial coordinate. By putting a large sink at a 122 specific area, a user can make nodes concentrated and obtain a fine grids there. Therefore, we obtain φ via the following non-linear Poisson equation: ∇ 2 ξ φ(ξ )=f size x(ξ ,φ) . (D.1) Each time step, we solve for the scalar field φ and then deform the grid according tox(ξ,φ ). We storeφ at cell centers, and discretize the Laplacian operator by the seven-point Laplacian matrix. We evaluate∇ ξ φ on vertices and use it to displace the vertices. This nodal staggered discretization naturally avoids the null-space problems that can occur on collocated discretizations. We solve Eq. (D.1) by iteratively solving a linearized version of the equation withφ on the right hand side fixed to the value produced by the previous iteration. Each iteration imposes Neumann boundary conditionsn Ω ·∇ ξ φ=0, wheren Ω is the boundary of our grid (not the boundary of our simulation domain). To make sure a solution exists in each iteration, we re-normalize the right hand side by first offsetting it with a constant γ such that R Ω (f size (x(ξ ,φ))+γ )dV ξ =0. Then, to ensure a numerically stable scaling, we divide the vector on the right hand side by its minimum value. D.4.2 Sizing Function As we described in Section D.4.1, we propose a sizing function that is defined by users to emphasize the region of interest. We would like to note that although it gives us the intuition of the sink field, our sizing function is heuristically designed, and is not intended to mimic any particular physical quantity in the real world. We adopted a criteria of using the liquid surface and the region of large velocity as the region of interest. There could be various ways to define a sizing function but we found that it becomes quite controllable when formulated as a sum of exponential functions: f size (x)=− exp − α ϕ b ϕ +exp − β u ||u|| , (D.2) 123 where b ϕ denotes the distance from the free surface and||u|| denotes the magnitude of the velocity at point x. The relative importance of these two terms is weighted by user-weighted parametersα ϕ andβ u . All of our examples use the valuesα ϕ =7 andβ u =0.5, except for Figure D.6 where we have used a static warped grid generated from the initial fluid condition with the same parameters. In case a user wishes to emphasize another region of interest, any additional term can be similarly added to Eq. (D.2). For example, for a scene with detailed boundary interactions (Figure D.3), we add an additional term− exp − γ ψ b ψ , where b ψ is the distance from the solid boundary, and we setγ ψ =7. We evaluate the sizing function at cell centers and apply a small amount of blurring to ensure smoothness. D.4.3 Temporal Deformation Penalty Depending on the sizing functionf size , the solution to Eq. (D.1) can produce temporally flickering deformations. We introduce the following energy function to penalize the drastic motion or acceleration of our displacement field: P temporal = Z Ω 1 2 ρ 1 ∂∇ ξ φ ∂t 2 + 1 2 ρ 2 ∂ 2 ∇ ξ φ ∂t 2 2 dV ξ . (D.3) whereρ 1 andρ 2 are tunable control parameters. Discretizing the time derivatives with implicit Euler and minimizing Eq. (D.3) with respect toφ t+1 gives us the following equation: ρ 1 ∆ t 2 + ρ 2 ∆ t 4 ∇ 2 ξ φ t+1 = ρ 1 ∆ t 2 + 2ρ 2 ∆ t 4 ∇ 2 ξ φ t − ρ 2 ∆ t 4 ∇ 2 ξ φ t− 1 , (D.4) 124 Figure D.4: Timings of our examples whereφ t− 1 ,φ t , andφ t+1 denote the value ofφ at the previous, current, and next time step. Finally, we add Eq. (D.4) to Eq. (D.1) to get an equation forφ that trades off between temporal coherence and respecting the sizing function: 1+ ρ 1 ∆ t 2 + ρ 2 ∆ t 4 ∇ 2 ξ φ=f size x(ξ ,φ) + ρ 1 ∆ t 2 + 2ρ 2 ∆ t 4 ∇ 2 ξ φ t − ρ 2 ∆ t 4 ∇ 2 ξ φ t− 1 . (D.5) We setρ 1 =2∆ t 2 andρ 2 =∆ t 4 throughout our examples. Once again, we re-normalize the right-hand side of the equation to ensure stability. 125 Performance Profile for Leaking Scene (Average) Deformation Solver 2.4 sec Entire Pressure Projection 5.5 sec Matrix Assembly 2.4 sec PCG Solver 2.5 sec PCG Iteration Count 31 FLIP Operations 6.7 sec FLIP Particle Count 1.1× 10 6 Time Per Time Step 25 sec Time Per Video Frame 110 sec Maximal V olume Change Ratio 17 Table D.1: Timings of our examples D.4.4 Speeding-up the Solver Repeatedly solving the linear system in Eq. (D.5) can be expensive. We accelerate our solver by exploiting the fact that the left-hand side of our linear system remains the same throughout our simulation, allowing us to pre-compute a preconditioner at the beginning of a simulation. In our work, we have employed an Algebraic Multigrid (AMG) method provided by Demidov [39] as a preconditioner. For coarsening and smoothing operators for AMG, we used the Smoothed Aggregation technique and a Gauss-Seidel smoother provided with the library. Our method benefits from such a multigrid method since the domain of our deformation grids does not involve any irregular boundaries. 126 Unlike the pressure solver, our deformation solver does not need to reach an accurate solution in practice. In our case, we stop the iteration when the maximal change per iteration is less than the half of a grid cell size. Furthermore, at each iteration we only perform three steps of the Biconjugate gradient stabilized method (BiCGSTAB), and re-evaluate the right-hand side of Eq. (D.5), which we found to converge quickly and stably. We solve the deformation by first constructing a multi-resolution pyramid by repeatedly down-sampling grids by a factor of two. We start with performing our deformation solver from the coarsest resolution, upsample the solution and continue to the next higher resolution. In our examples, we solve only up to one level coarser resolution than the finest resolution. The deformation at the finest resolution is computed simply by subdividing the grids from the previous level. To prevent each solver iteration from overshooting the solution, and thus leading to a poor convergence rate, we damp each update by fifty percent . D.5 Advection We follow an ALE formulation of the form [43]: ∂u(ξ ) ∂t + ¯c(ξ )∇ ξ u(ξ )=0. (D.6) where ¯c= ¯u(ξ i )− ¯u D (ξ i ) and we denote vector qualntitya in the spatial coordinate as ¯a when in the reference coordinate. This can be interpreted as advecting the material velocityu along convective velocity ¯c, which is defined as the difference of motion between fluid and the background mesh. The mesh velocity u D is defined as: u D = ∂∇ ξ φ ∂t . (D.7) 127 One may approximate Eq. (D.6) with a semi-Lagrangian advection in the spatial coordinate, but in our method, we use reference coordinates to perform semi-Lagrangian advection. Finally, the convective velocity is obtained as follows, ¯c(ξ i )=J − 1 u(x i )− J − 1 u D (x i ), (D.8) whereξ i andx i denote the corresponding cell centers on respective coordinates, andJ is the grid Jacobian, which will be precisely defined in Section D.6.1. One remarkable benefit of our advection scheme is that it automatically incorporates the effect of dynamically warping grids using starndard advection schemes. For example, we have employed the MacCormack method [169] combined with the WENO interpolation [129] to advect both the level-set and the velocity, as well as the second-order Runge-Kutta scheme for advecting FLIP particles. In addition, unlike conventional unstructured adaptive methods, we do not need to access a grid element by anO(logN) tree traversal. D.6 Pressure Solver on Warped Grids At the heart of our method’s adaptivity is a pressure solver that runs on a warped grid while being numerically robust and straightforward to implement. Our pressure solver also includes a novel technique to realize second-order accurate Dirichlet boundary conditions on a warped grid, which is formulated via the generalization of the well-known ghost fluid technique [51]. 128 D.6.1 Kinetic Energy Minimization Our discretization uses piece-wise constant velocity vectors stored at cell centers and piecewise trilinear pressure samples stored on vertices. We derive our pressure solver through kinetic energy minimization [9]: minimize p Z Ω 1 2 ρ ∗ u− ∆ t ρ ∇p 2 dV, (D.9) where∆ t, ∗ u,Ω ,ρ denote the time step size, the intermediate velocity after the advection, the liquid domain, and the fluid density, respectively. We solve Eq. (D.9) based on the Finite Element Method (FEM). We refer to the reference coordinates withξ =(ξ,η,τ ) and the spatial (physical) coordinates with x=(x,y,z). We express the trilinear pressure function by a sum of shape functions weighted by the pressure samples on vertices: p(ξ )= 8 X i=1 N i (ξ )p i , (D.10) whereN i is the trilinear shape function defined on vertex i. Taking the gradient of both sides of Eq. (D.10) with respect tox (using the chain rule), and then substituting the result into Eq. (D.9) gives: minimize p n X i=1 Z Ω i 1 2 ρ ∗ u− ∆ t ρ J − 1 ∇ ξ 8 X j=1 N j p j 2 |J|dV ξ , (D.11) wheren andΩ i denote the total cell count and a cubic cell domain in the reference coordinates, ∇ ξ =( ∂ ∂ξ ∂ ∂η ∂ ∂τ ) T , anddV ξ =dξdηdτ .|J| is the determinant of a Jacobian matrixJ ∈R 3× 3 with respect toξ with each entry given by: J ij = ∂x j ∂ξ i = ∂ ∂ξ i n 8 X k=1 N k (ξ )v k o j , (D.12) 129 where{} j denotes thej th component of a vector, andv k is the position vector of vertexk of the warped cube element in spatial coordinates. This Jacobian determinant encodes the relative volume change by a change of coordinate|J|=dV x /dV ξ wheredV x =dxdydz. Taking the derivative of Eq. (D.11) with respect to pressure and setting it to zero yields a symmetric positive-definite linear system: n X i=1 Z Ω i (∇) T D(∇) h p i dV ξ = n X i=1 Z Ω i (∇) T |J| ∗ udV ξ , (D.13) where[p]∈R 8× 1 is the discretized pressure on the eight vertices of a grid cell. Note that we reserve[ ] for discretized variables. (∇)∈R 3× 8 is the matrix encoding the gradient, which maps nodal values to a cell-centered vector. Each entry is given by: (∇) ij = n J − 1 ∇ ξ N j (ξ ) o i . (D.14) D∈diag(R 3 ) is the diagonal matrix: D = |J|∆ t ρ I, (D.15) whereI is the identity matrix. In our results, we setρ =1 for convenience. We discretize Eq. (D.13) with an eight-point Gaussian quadrature integration scheme (single-point Gaussian quadrature is provably unstable, as we discuss in Section D.9). Boundary conditions of the system in the absence of surface tension arep=0 on free surfaces andn·∇p=0 on static solid boundaries [17], wheren is the normal vector of a solid. First-order accurate boundary conditions are given by settingp i =0 for vacuum vertices, and skipping the evaluation of Eq. (D.11) for solid cells. We solve the linear system using a preconditioned 130 Conjugate Gradient method (PCG) with the modified incomplete Cholesky factorization * provided by Bridson [17]. Once we have solved for the pressure, we update the velocity at cell centers by: u new = ∗ u− ∆ t ρ J − 1 ∇ ξ 8 X k=1 N k (ξ c )p k , (D.16) whereξ c denotes the cell center. Like other fluid simulation methods, we extrapolate the new velocity outside of the fluid domain after the velocity update. D.6.2 Accurate Boundary Conditions Accurate boundary conditions are essential for producing visually pleasant liquid simulations [17]. For solid boundaries, we employ the method of Batty et al. [9] and scale the diagonal entries ofD and the intermediate velocityu ∗ by the fraction of a cell not occupied by solid. Free surface boundary conditions are less straightforward. Unfortunately, first-order accurate Dirichlet boundary conditions exhibit grid-aligned artifacts on the free surface. Enright et al. [51] introduced the ghost fluid method to achieve second-order accuracy for Dirichlet boundary conditions on a regular grid. Afterward, the method was extended to irregular meshes where pressure was discretized by the Finite V olume Method (FVM) [18, 10]. Inspired by their methods, we generalize the ghost fluid technique to accurately handle boundary conditions for any grid cell shape. Thus, our new approach works not only for regular grids or tetrahedral meshes, but also for our warped grid discretization. Furthermore, wherever our proposed method is applicable, there is no longer a need to independently develop free surface boundary conditions for each discretization. * With MIC(0) parametersτ = 0.97 andγ = 1.0 131 We start by reviewing the ghost fluid method for the standard MAC discretization on a regular grid. Consider two nodes across the interface in one dimension. The velocity update at the midpoint of the nodes is: u new =u ∗ − ∆ t ρ p G − p L ∆ x , (D.17) which is derived from the discretization of the pressure gradient: (∇) h p i = p G − p L ∆ x , (D.18) wherep L denotes the pressure on the “liquid” side of the interface,p G denotes the ghost pressure on the “air” side of the interface, and∆ x denotes the distance between the nodes. According to Enright et al. [51], the ghost pressure is given byp G =(ϕ G /ϕ L )p L , whereϕ G andϕ L are the level-set values at the nodes wherep G andp L are evaluated. Substituting this relation into Eq. (D.18) gives: (∇) h p i = △ z }| { ϕ L − ϕ G ϕ L J z }| { 0− p L ∆ x . (D.19) Note that the⊙ term is exactly the first-order accurate free-surface pressure boundary condition, and the △ term is a carefully-chosen multiplier that is used in the pressure solver. Our insight here is that the ghost fluid method can be regarded as an “upgrade” operation: The △ term explicitly converts the first-order accurate pressure gradient discretization into a second-order accurate one. D.6.2.1 Generalized Ghost Fluid We generalize this idea to higher dimensions with a novel linear conversion operatorM ϕ : 132 (∇) 2 h p i = △ z}|{ M ϕ J z }| { (∇) 1 h p i . (D.20) Here,M ϕ is a3× 3 matrix that takes a first-order-accurate discretization of the pressure gradient vector (∇) 1 [p] and converts it into a second-order-accurate one(∇) 2 [p]. M ϕ is then integrated into the larger pressure solver matrix. The exact forms of(∇) 1 and(∇) 2 will depend on the particular mesh discretization (regular grid, tetrahedral mesh, etc.), and our approach requires that the pressure gradient is piecewise constant with unknown magnitude and a direction determined by the free surface geometry. We explain how to compute(∇) 1 and(∇) 2 for our particular warped grid discretization in Section D.6.2.2. We view the derivation ofM ϕ as a constraint-satisfaction problem. We begin with a3× 3 matrix, consisting of 9 degrees of freedom. Mapping an arbitrary 3-dimensional input vector to the given output vector(∇) 2 [p] will only pin down three of these degrees of freedom. Because we will eventually integrate M ϕ into our symmetric-positive definite linear system, we also impose symmetry and positive-definiteness ontoM ϕ . We utilize the remaining degrees of freedom to optimize the numerical stability ofM ϕ . We work out these details and provide the analytical form ofM ϕ in Appendices D.10. OnceM ϕ is computed, we can enjoy second-order accurate free surface boundary conditions by simply replacingD in Eq. (D.13) withDM ϕ in all cells that overlap the free surface boundary. We also use the second order pressure gradient in Eq. (D.20) to update the velocity in Eq. (D.16). We would like to emphasize that our new approach not only enforces second-order accurate boundary conditions, but it also preserves the symmetry and positive-definiteness of the system matrix and exhibits provably optimum numerical conditioning. The resulting linear solver is remarkably stable and converges quickly. 133 Figure D.5: Breaking dam creating splahes and thin sheets by our dynamically warping grids adapting to the surfaces: 128× 64× 128 resolution,12 seconds per time step and36 seconds per video frame. D.6.2.2 Ghost Fluid Applied to our Discretization Here, we derive the forms of(∇) 1 and(∇) 2 in our particular discretization, which stores both pressures and level-set samples at grid vertices. We can analytically enforce an accuratep=0 condition at the free surface by constraining the pressure to be a multiple of the level set function[p]=[ϕ ]P , as pointed out by Ando et al. [6]. To compute the first order pressure gradient (∇) 1 [p], instead of satisfying thep=0 condition exactly at the interface, we satisfy it at each node outside of the liquid. We do this by again expressing the pressure as a multiple of the distance function, but this time we replace all level set values outside of the liquid by zero: (∇) 1 [p]=(∇)[ ˚ϕ ]P , where[ ˚ϕ ]=[max(0,ϕ )] is a vector of level-set values, with all entries outside of the liquid set to zero. By expressing the pressure in this way, we only allow one degree of freedom per cell and introduce trilinear error terms (O(∆ x∆ y∆ z) in three dimensions), which conveniently does not affect the first-order accuracy of this pressure gradient discretization. Plugging these pressure gradient discretizations into Eq. (D.20) gives: 134 (∇)[ϕ ]P = △ z}|{ M ϕ J z }| { (∇)[ ˚ϕ ]P (D.21) and we can divide by the unknownP to prescribe a relationship forM ϕ for each boundary cell: (∇)[ϕ ]=M ϕ (∇)[ ˚ϕ ] (D.22) Thus, for the purposes of solving forM ϕ in Section D.6.2.1, we can use(∇) 1 =(∇)[ϕ ] and (∇) 2 =(∇)[ ˚ϕ ]. Note that the level-set valueϕ must be the actual signed distance in spatial coordinates. To calculateϕ , we first compute the level-set ¯ ϕ in reference coordinates, and convert toϕ in spatial coordinates by: ϕ = ¯ ϕ . ∇ ¯ ϕ = ¯ ϕ . J − 1 ∇ ξ ¯ ϕ . (D.23) This conversion operation is only valid for cells near the boundary, but accurate distances in spatial coordinates are only needed for boundary conditions anyway. Thus, we perform this conversion only near the free surfaces before the projection step. D.7 Visualization To visualize our simulation, we first construct a surface mesh of FLIP particles on the Cartesian grids. Next, we displace the vertices of the mesh as well as ballistic FLIP particles through the deformation field. In our implementation, we use OpenVDB [143] to create this mesh. 135 Figure D.6: Expanding ripples with high adaptivity. Left: our second-order accurate free surface boundary conditions capturing the subtle motion of expanding ripples in a nearly open-space static pool. 128× 64× 128 resolution,9 seconds per time step,15 seconds per video frame and the maximal volume ratio200. Right: first-order accurate boundary conditions with the same setup exhibit grid aligned artifacts. This example employs a level-set surface tracker to demonstrate the accuracy of our method, and the same scenario using FLIP is provided in the supplemental video. D.8 Results We ran all of our examples on a Linux machine with 2.7GHz Intel Xeon E5-2697, using a targetL ∞ norm of10 − 3 for the relative residual of the pressure solver. The breaking dam set up of Figure D.5 and the example of a bunny falling into a basin in Figure D.2 illustrate that our method simulates detailed liquids. Our dynamically warping grids act to exaggerate the visually important geometric features, such as splashes and thin liquid sheets. The cross-sections shown in each figure illustrates the warped grid cells and are color-coded by the local volume change induced by our warping. A reference coordinate view of the 136 latter example is shown in the second row of Figure D.2. It highlights that our solver elegantly handles refining and coarsening obstacles in the flow. Each time step took 12.5 and10.5 seconds for this example, of which the entire projection required3.3 seconds and1.6 seconds. The grid deformation solver took1.1 seconds and1.0 seconds, and the maximal volume differences were7.4 and15, respectively. The liquid combing set up of Figure D.3 highlights our ability to handle the complex interaction with solid boundaries more accurately than a MAC solver at the same resolution. A direct comparison with a MAC solver is shown in Figure D.10. In this example, the MAC solver is not able to capture the small-scale splashes due to the lack of resolution between the cylinders. In this case, we have increased the effective resolution around the obstacles by using the distance to the cylinders as a region of interest in Eq. (D.2). Each time step of this simulation took12 seconds on average. Our dynamic grid deformation solver took 1.2 seconds whereas the pressure projection took2.6 seconds. The maximal volume ratio was18 on average. The example of Figure D.6 verifies that our method yields the expected gain in visual quality when using second-order accurate boundary conditions. We set up the scene with four small droplets on a static pool with a high degree of spatial adaptivity of a maximal volume ratio of200. In this example, our second-order accurate boundary conditions shown on the left can capture both sharper splashes and propagating ripples more accurately than the first-order accurate boundary conditions shown on the right. Each time step of this simulation took8.9 seconds, and the pressure projection took5.7 seconds on average. In this specific example, we have used a static warped grid (instead of deforming it over time), and we used the level-set method for tracking liquid surfaces to demonstrate our visual accuracy. For clarification, we also provide an example of the same setup using FLIP in the supplemental video. We note that when FLIP is employed, a subtle noise on the surface persists once the particles are perturbed. However, these artifacts are orthogonal to our boundary conditions and could be alleviated by post-processing the liquid surface. 137 The simulation of liquid leaking from a container with holes in the bottom is shown in Figure D.1. Our method can reproduce the visually interesting streams of turbulent water. A MAC solver, on the other hand, needs to double the resolution to reproduce similar apparent detail. The comparison with the MAC solver for this example at the same resolution is shown in Figure D.10, as well as in the supplemental video. Each time step took25 seconds, grid deformation2.4 seconds, and the pressure projection5.5 seconds on average. The maximal volume ratio of this example was17 on average. D.8.1 Numerical Verification We additionally performed a numerical verification for our free surface boundary conditions, and observe that it yields second-order convergence into the analytical solution. We set up our numerical test in 2D with a single point source of divergence at the center, and solve∇ 2 p(x)=δ (x) with thep=0 boundary conditions enforced on the surface of a circular domain. The analytical solution to this configuration is given byp(x)=G(||x||)− G(r), whereG andr denote the Green’s function solution of Laplace’s equation, and the radius of a circle. We measured the average of theL 1 norm of the error on the surface, and examined the convergence factor by doubling the resolution, similar to Enright et al. [52] and Batty [8]. Our results in Figure D.7 show that both with and without the grid deformation the convergence rate stays second-order. Interestingly, we also observed that the convergence rate improves even more when our sizing function is applied. In this case, our method (with parametersα ϕ =7,β ψ =0 andγ u =0) further improves the solution by increasing the effective resolution around the surface. D.8.2 Timings A detailed performance breakdown for our method is shown in Figure D.4. In all of our examples, the cost of the grid deformation solver was less than10 percent of the whole simulation time, and the timings for 138 Method Resolution Time Per Step [s] MAC 128 3 4.28 Ours 128 3 10.2 MAC 256 3 26.4 [5] 256 3 29.0 Table D.2: Timings of Figure D.8. the PCG solver were close to the timings for assembling the linear system. For the leaking scene example, the average total cost spent on these two operations (deformation solver and the matrix assembly) was4.8 seconds. Hence, the computational overhead of our method compared to a standard liquid solver is approximately19 percent. Note that since we highly depend on the performance of a linear system solver for the grid deformation as well as the pressure solver, we expect that our solver can be sped up with more sophisticated methods, such as a Schur complement solver [125]. Note that the timings for the grid deformation of Figure D.6 are omitted since we did not dynamically warp the grid in this example. D.8.3 Comparisons Figure D.8 shows a comparison of a drop falling into a pool of liquid using three different solvers: our method, a regular MAC solver and the adaptive method of Ando et al. [5]. Animations for all solvers are provided in the supplemental video. For this setup, our method with128 3 resolution can capture fine details, e.g., for the crown splash after the drop impact. Such features are noticeably less detailed using a regular MAC grid at the same resolution. Doubling the resolution of the MAC grid would resolve the features, but leads to very significant increases in runtime. In this case the MAC grid runtime is 2.58 times higher than our128 3 warped grid. The method of Ando et al. [5] enables large-scale simulations with aggressive adaptivity, but the overhead arising from the unstructured BCC-meshes can counteract gains in performance at moderate resolutions. The resulting simulation recovers small-scale details but leads to an 139 almost three times higher runtime than our method. The timings of this comparison can be found in Table D.2. D.9 Discussion We observe ringing artifacts when a naive single-point Gaussian quadrature integration rule is employed to integrate Eq. (D.13). This issue arises from a null-space in the system. An example of our null-space is the pressure gradient evaluated at the cell center with pressure values of− 1,1,− 1,1 assigned on four vertices in counter-clockwise. In this specific configuration, the pressure gradient unphysically evaluates to zero. Such a null-space is known to induce numerical instabilities when solving a linear system, and researchers have taken care to eliminate such instabilities [136, 210]. We circumvent this issue by switching to the eight-point Gaussian quadrature integration rule. Figure D.9 illustrates the comparison of the effect of our integration scheme. Like other fluid simulation frameworks, our method also undergoes slight volume changes depending on the accuracy of the employed velocity advection scheme and surface tracker. Hence, we include the method introduced by Kim et. [101] in our solver. We distribute the correction term to the right-hand side of our pressure solver weighted by the Jacobian of the cells. We found this simple technique effectively recovers the original volume. We compute the volume on the Cartesian coordinate by summing up the Jacobians weighted by a cell fraction of fluid. We note that an alternative to our discretization, aside from the boundary conditions, is the method by Azevedo and Oliveira [7] where FVM is employed. We prefer our FEM-based discretization since the FVM is known to introduce accuracy errors when computing pressure gradients unless the circumcenter of an element is chosen as a sampling location of pressure. As pointed out by Batty et al. [10] the pressure differentiation degrades to the first-order accuracy if it is not evaluated at the circumcenter. We also note that the method of Ferstl et al. [57] can be extended to use our pressure solver. Our work introduces novel 140 generalized second-order accurate boundary conditions, which could also be beneficial for these other discretizations. Using a FLIP surface tracker adds persistent surface noise to any discretization, and subtle surface bumps are visible in our FLIP results as well. We create our surfaces by extracting the surface in reference coordinates and then warping it to spatial coordinates, so our warped grids will warp the existing FLIP artifacts along with it. This warping essentially makes isotropic FLIP artifacts smaller and more anisotropic, depending on the sizing function. In the end, these artifacts are orthogonal to our boundary conditions, and they can be alleviated by post-processing the liquid surface. Currently, our deformation tends to produce thin elements on liquid surfaces, which has a side effect of robustly capturing thin sheets of liquid. On the other hand, we see that these thin features can introduce visible artifacts along the elongated directions. In future work, we would like to utilize methods developed for image warping [107, 116] to address the issue. Such vector-valued solvers could enable additional rotational displacements, allowing us to increase the expressiveness of the achievable deformation, at the expense of more computational degrees of freedom. Similar to Zhu et al. [209], our warped grid could be used to produce an effect similar to non-reflecting boundary conditions. Our current implementation limits the scale of adaptivity, primarily because of the increased cost of deformation solver and the limited CFL number. This issue is common among spatially adaptive simulation methods, where a time step size must be adjusted according to the ratio of the maximum velocity and the minimum grid cell size. This restriction could be reduced by advecting grid points through the velocity field, as demonstrated by Fan et al. [55] for solids. The types of grids generated by our method are naturally limited to those with the connectivity of a regular grid. Consequently, our method cannot create some grid configurations like uniformly tiny cells all around the boundary of a closed region but uniformly large cells in the interior. On the other hand, we do not restrict the boundary of our grid to the boundary of the simulation, so our method can certainly create a 141 region of fine cells surrounded by coarse cells. We note that this topic only affects the calculation of the deformation itself, and does not influence our generalized boundary conditions, pressure solver, or advection strategy. The numerical conditioning depends on element quality in the same way as other curvilinear grid methods, and the warping often results in an increased number of active cells compared to a standard MAC solver at the same resolution. Although our method benefits from the memory access patterns of regular grids, our implementation is not yet optimized to its full extent. Techniques such as the efficient sparse grid methods [170],[143],[193],[20] were shown to yield very high performance, but we point out that the fundamental ideas we introduce are orthogonal to the aforementioned works. Our method for warping grids could be used in conjunction with these techniques to further increase the amount of resolved detail. D.9.1 Momentum Preservation We want to point out that our velocity field is not exactly momentum preserving because of the moving mesh. Suppose that we define momentum of an element i as: Z Ω i u(x)dV, (D.24) whereu(x) andΩ i denote the continuous function of velocity and the volumetric region of an element respectively. For simplicity, we setρ =1 in this exposition, because density is constant in our case. The change of momentum over time is given as: d dt Z Ω i u(x)dV = Z Ω i ∂u(x) ∂t dV + I ∂Ω i u(x)(u D (x)· n)dS, (D.25) 142 where∂Ω i denotes the boundary of the regionΩ i . Notice that the second term of Eq. (D.25) on the right considers a flux of momentum on the region boundary. We note that our advection scheme, although it takes into account mesh velocity, does not compute momentum flux on element boundaries. Although this violation does not seem to be a serious cause of visual artifacts so far, further investigation is needed to achieve more accurate dynamics. D.10 Computation ofM ϕ We can view Eq. (D.20) as a conversion between an arbitrary source vectorb to the target vectora: a=M ϕ b. (D.26) We introduce a rotation matrixR such that: Ra = b a = (b a x b a y 0) T Rb = b b = ( b b x 0 0) T . (D.27) We then apply this change of variables to simplify the problem b a=M R b b (D.28) 143 such thatM R =RM ϕ R T andM ϕ =R T M R R. We ensureM R is symmetric and positive-definite by factoring it with a Cholesky decomposition: M R =L R L T R , whereL R is the lower triangular matrix L R = c 11 0 0 c 21 c 22 0 c 31 c 32 c 33 . (D.29) This decomposition constrainsM ϕ to be symmetric too, because M ϕ =R T M R R =R T L R L T R R = R T L R R T L R T . (D.30) This also makesM ϕ positive definite, because for all nonzero vectors x andy =R T x, y T M ϕ y =x T RM ϕ R T x =x T M R x >0, by positive-definiteness of M R (D.31) The positive definite property is extremely helpful for numerical algorithms, but it imposes a constraint on the input vectorsa andb, namely that b b T M R b b= b b T b a>0. Physically, this means that the angle between the first- and second-order pressure gradients must be less than π/ 2 (they must not point in opposite directions). In the rare event that this condition is violated we resort back to the first order accuracy for safety. This is analogous to the strategy of clamping small level-set values to prevent instability in the 144 traditional ghost fluid method. Using the Cholesky decomposition and the zero entries of Eq. (D.27) to constrainM R gives us M R = b a x / b b x b a y / b b x 0 b a y / b b x c 2 22 +b a 2 y /(b a x b b x ) c 22 c 32 0 c 22 c 32 c 2 32 +c 2 33 . (D.32) This matrix still has three degrees of freedom,c 22 ,c 32 , andc 33 , which we use to optimize numerical stability of the boundary conditions, by minimizing the condition number: κ (M R )= |λ max | |λ min | (D.33) whereλ max andλ min are the maximum and minimum eigenvalues ofM R . According to Marshall and Olkin [133], the following inequality holds for any positive-definite matrix: κ U 11 U 12 U 21 U 22 ≥ κ (U 11 ). (D.34) Applying this to our case, κ (M R )≥ κ b a x / b b x b a y / b b x b a y / b b x c 2 22 +b a 2 y /(b a x b b x ) . (D.35) Our strategy will be to first manipulate our degree of freedom c 22 on the right hand side to minimize the lower bound onκ (M R ). Then, we will try to optimizeM R ’s eigenvalues such thatκ (M R ) is exactly equal to that lowest possible condition number. We first analytically minimize the right hand side of Eq. (D.35) 145 by settingc 2 22 =(b a 2 x +b a 2 y )/(b a x b b x ). Next, we setc 32 =0, which conveniently factors the characteristic polynomial ofM R into a scalar times the upper left2× 2 sub-problem that we already minimized. Finally, we ensure that this third eigenvalue is neither a minimum nor a maximum (it does not affect the condition number) by setting it to the average of the other two eigenvalues, which are guaranteed to be positive due to the positive-definiteness property. At this point, we have constrained all the degrees of freedom in M R , and its condition number is equal to that of its optimized upper left sub-matrix (the inequality in Eq. (D.35) becomes an equality). Therefore, it has the minimum possible condition number. The final matrix is given by: M R = b a x / b b x b a y / b b x 0 b a y / b b x (b a 2 x +2b a 2 y )/(b a x b b x ) 0 0 0 (b a 2 x + b b 2 y )/(b a x b b x ) . (D.36) This matrixM R achieves second-order accurate boundary conditions, symmetric positive-definiteness, and provably optimum numerical conditioning. The resulting linear solver is remarkably stable and converges quickly in practice. D.11 Numerical Verification of Pressure Solver We ran our pressure solver on a Taylor Green vortex velocity field, of which the analytical solution is known. This experiment is absent of free surfaces but allows us to accurately evaluate the behavior of the pressure solver under deformation. We measureL 1 error of pressure and plot convergence as we double the grid resolutions while keeping the sizing function constant (Figure D.11 bottom left). Figure D.11 shows the result of varying resolutions ranging from 32× 32 to 1024× 1024. The error measurements clearly show that our pressure solver properly converges to the analytical solution under refinement. As a consequence, 146 our method likewise yields the correct pressure gradient, which is crucial to enforce incompressibility for the FLIP simulations. 32× 32 128× 128 1024× 1024 Figure D.11: Error plots of our Taylor Green vortex velocity experiment. Pressure error (top), warped grid from our sizing function (bottom left), and the log-log graph of the total error with respect to grid resolutions (bottom right). D.12 Conclusion this work presented dynamically warping grids for liquid simulations. We devised a method to allow flexible spatial adaptivity on regular grids. Consequently, our method can capture complex and diverse 147 liquid motions while retaining the advantages of structured grids. We demonstrated that our method runs in harmony with off-the-shelf algorithms for velocity advection and surface tracking. We also combined novel generalized second-order accurate boundary conditions with our FEM-based pressure solver to enable subtle liquid surface dynamics. We proved that this scheme has optimal convergence properties (symmetry, positive definiteness, and condition number, see Appendices D.10), and we have verified its accuracy through numerical experiments. We also presented a new grid deformation solver with temporal coherence, which requires less than10% of the overall calculations. Altogether, we demonstrated that our method improves the visual quality of liquid simulations through the use of adaptivity, and without excessive computational overhead or complicated implementations. In the future, we would like to extend our method to handle moving solid boundaries, two-way coupled rigid bodies and surface tension forces. We would also like to combine our approach with sparse grid methods [143, 170, 1], to merge the fast lookup times of sparse grids with our method’s ability to maintain temporal coherence and adapt to subtle changes in boundary conditions. 148 Regular Setting Deformed Setting First-order Accuracy Ours Resolution L 1 error Order L 1 error Order 32 4.9× 10 − 3 N/A 2.3× 10 − 4 N/A 64 3.0× 10 − 3 0.73 5.2× 10 − 5 2.0 128 1.5× 10 − 3 0.95 1.3× 10 − 5 2.0 256 7.1× 10 − 4 0.95 3.4× 10 − 6 2.0 512 3.6× 10 − 4 1.1 9.1× 10 − 7 1.9 1024 1.9× 10 − 4 1.0 2.4× 10 − 7 1.9 First-order Accuracy Ours Resolution L 1 error Order L 1 error Order 32 4.4× 10 − 3 N/A 1.3× 10 − 4 N/A 64 1.5× 10 − 3 1.6 2.8× 10 − 5 2.1 128 4.5× 10 − 4 1.7 5.6× 10 − 6 2.4 256 1.7× 10 − 4 1.4 1.3× 10 − 6 2.1 512 6.0× 10 − 5 1.5 3.3× 10 − 7 1.9 1024 2.3× 10 − 5 1.4 9.8× 10 − 8 1.8 Figure D.7: Numerical verification of our second-order accuracy for a Poisson’s problem: we solve ∇ 2 p(x) = δ (x) with p = 0 boundary conditions on the lines of solid circles. The colored contour plots show the magnitude of our actual numerical solution. The top table refers to the experiment with the reg- ular setting (top left) and the bottom our warped setting (top right). Both numerical experiments show the second-order convergence. 149 Figure D.8: Comparison of a drop falling into a pool using three different solvers. Our method successfully captures the thin sheets of a crown splash at128 3 resolution, while the MAC method fails to simulate such detail without doubling the resolution. The adaptive method of Ando et al. [5] successfully recovers small- scale details but leads to significantly longer runtimes. Timing details are given in Table D.2. Figure D.9: Pressure surface plot with a single strong velocity initiated at the center pointing upward. Left: a single-point Gaussian quadrature integration rule applied to integrate Eq. (D.13) exhibits oscillation due to null-space issues. Right: our eight-point (four-point in 2D) Gaussian integration rule removes the artifacts. 150 Figure D.10: Comparison with the MAC method. Left side shows our method, right side the MAC solver at the same resolution. 151
Abstract (if available)
Abstract
Deep learning has attracted significant attention in recent years due to its remarkable achievements in various applications. However, building effective deep neural networks requires making crucial design choices such as the network architecture, regularization, optimization, and hyperparameter tuning. In this dissertation, we focus on the concept of “sharpness” of neural networks, which refers to neural networks’ sensitivity against perturbation on weight parameters. We argue that sharpness is not only a theoretical notion but also has practical use cases that can lead to better generalization and robustness of neural models.
A major theoretical challenge of defining and measuring sharpness is its scale-sensitivity, i.e., the fact that sharpness can change to the scale transformation of neural networks. In this thesis, we propose a novel definition of sharpness that overcomes this limitation, with provable scale-invariance and extensive empirical validation. By analyzing the relationship between sharpness and model performance, we show how my definition can provide a more objective and accurate characterization of sharpness.
Another open question in the sharpness analysis is how training algorithms for machine learning models regularize sharpness. In this dissertation, we answer this question by showing that existing training algorithm methods regularize sharpness through what can be called "escaping" behavior, where the optimization process avoids sharp regions in the parameter space. This new explanation demystifies the connection be- tween sharpness and training algorithms, paving the way for more effective and principled approaches to machine learning.
Finally, we demonstrate the practical benefits of sharpness regularization for physics simulations. We show that neural networks with small sharpness achieve high-fidelity fluid simulation and molecular dynamics. These findings include significant implication that sharpness is not just a mathematical notion but also a practical tool to build physics-informed neural networks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Simulation and machine learning at exascale
PDF
Learning to optimize the geometry and appearance from images
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Controlling information in neural networks for fairness and privacy
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Atomistic modeling of the mechanical properties of metallic glasses, nanoglasses, and their composites
PDF
Towards learning generalization
PDF
Deep learning models for temporal data in health care
PDF
Multi-scale quantum dynamics and machine learning for next generation energy applications
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Understanding diffusion process: inference and theory
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Recording, reconstructing, and relighting virtual humans
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Designing neural networks from the perspective of spatial reasoning
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Learning controllable data generation for scalable model training
Asset Metadata
Creator
Ibayashi, Hikaru
(author)
Core Title
Sharpness analysis of neural networks for physics simulations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
04/26/2023
Defense Date
04/11/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,molecular dynamics,OAI-PMH Harvest,optimization
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nakano, Aiichiro (
committee chair
), Branicio, Paulo (
committee member
), Liu, Yan (
committee member
)
Creator Email
hikaru.ccm1023@gmail.com,ibayashi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113078055
Unique identifier
UC113078055
Identifier
etd-IbayashiHi-11718.pdf (filename)
Legacy Identifier
etd-IbayashiHi-11718.pdf
Document Type
Dissertation
Format
theses (aat)
Rights
Ibayashi, Hikaru
Internet Media Type
application/pdf
Type
texts
Source
20230426-usctheses-batch-1031
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
molecular dynamics
optimization