Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Physics-aware graph networks for spatiotemporal physical systems
(USC Thesis Other)
Physics-aware graph networks for spatiotemporal physical systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PHYSICS-AWARE GRAPH NETWORKS FOR SPATIOTEMPORAL PHYSICAL SYSTEMS by Sungyong Seo A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) December 2021 Copyright 2021 Sungyong Seo Dedication To my beloved Jaewon. Your love and sacrice enabled me to complete the long journey. ii Acknowledgements Foremost, I would like to express my sincere gratitude to many supporters during my long PhD journey. It has been truly challenging for many years and this dissertation would not have been possible without your help. I would like to thank my thesis advisor Prof. Yan Liu rst. It has been an unforgettable moment to talk with you for my PhD admission interview. Although my machine learning background was shallow at that time, you caught my research potential and gave me an opportunity to initiate a PhD program. I am very grateful as we proved that it was a right decision since we have achieved strong collaboration across various research topics over many years. Balancing between exibility of choosing research problems and detailed guidance is ideal for my research personality. I still remember that you guided me to explore machine learning with physics when I was struggling to determine my thesis topic. It was great timing to start the emerging eld and I was able to solve timely problems from your guidance. Furthermore, it was also very helpful to know how to write paper to lead publications. This lesson will last for my career. Besides my advisor, I would like to thank the rest of my guidance committee members: Prof. Xiang Ren, Prof. Antonio Ortega, Prof. Joseph Lim, Prof. Cyrus Shahabi, and Prof. George Ban-Weiss. Your encouragement, insightful comments, and hard questions have been invaluable. I also thank my fellow labmates in our Melady group. Thank you co-authors Chuizheng Meng, Karishma Sharma, Dr. Natali Ruchansky, Dr. Sirisha Rambhatla, and Dr. Xinran He. Thank you colleagues and friends Dr. Michael Tsang, Dr. Nitin Kamra, Dr. Yaguang Li, Dr. Dehua Cheng, Dr. Rose Yu, Dr. Zhengping Che, Hanpeng Liu, Prof. Sanjay Purushotham, Guangyu Li, Aastha Dua, Umang Gupta, Tanachat Nilanon, Wilka Carvalho, Nan Xu, Yizhou Zhang, Loc Trinh, and James Enouen. I want to extend special thanks to iii Michael Tsang. Times we have talked and hung out made me refreshed and move forward. Besides my labmates, it was a great opportunity to collaborate with Prof. Hau Chan, Dr. Jiachen Zhang, Dr. Arash Mohegh, Jiageng Zhu, Prof. P. Jerey Brantingham, and Prof. Phebe Vayanos. The USC administrative and IT sta, Lizsl De Leon, Tracy Charles, Jennifer Gerson, and Jack Li, have been always supportive. My PhD internships have also allowed me to meet great colleagues. Thank you Dr. Jing Huang (Visa Research). You helped me a lot to publish my rst paper and present it to a conference. Thank you Dr. Changwei Hu, Dr. Yifan Hu (Yahoo! Research), and Dr. Sercan O. Ark (Google Cloud AI). I would like to thank Prof. Jinwoo Choi who encouraged me to complete the PhD program successfully, and my wife Jaewon Shin for letting me spend family time on my research never seem to end. I would like to thank my parents-in-law who support my decision to change academic career and believe my eort. Finally, to my parents, I deeply appreciate your unconditional love and support. iv Table of Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures xi Abstract xii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Modeling Physical Systems . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Challenges in Physical System Modeling . . . . . . . . . . . . . . . . 2 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work 8 2.1 Physics-informed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Surrogate Solutions of Existing Approaches . . . . . . . . . . . . . . 9 2.1.2 Learning and Discovering Knowledge . . . . . . . . . . . . . . . . . . 13 2.1.3 Deep Learning Applications . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Simulation of Physical Systems . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Learning Unknown Physics . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Notations, Preliminary and Datasets 25 3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Local Variation and Data Quality . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Calculus on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.1 Weather Underground and WeatherBug in Los Angels County . . . . 33 3.5.2 Simulated Climate Observations in Southern California Region . . . . 34 3.5.3 Weather Stations in the United States . . . . . . . . . . . . . . . . . 35 3.5.4 Air Quality and Extreme Weather Datasets . . . . . . . . . . . . . . 37 3.5.5 Global Climate Network . . . . . . . . . . . . . . . . . . . . . . . . . 39 v 4 Data Quality Inference based on Physical Rule for Spatiotemporal Tem- perature Forecasting 41 4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 Data Quality Network . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 Graph Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.3 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 Forecasting Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Node Embedding and Low-Quality Detection . . . . . . . . . . . . . 47 4.3.3 Data Quality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5 Graph Networks with Physics-aware Knowledge Informed in Latent Space 51 5.1 Physics-aware Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.1 Static Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1.2 Dynamic Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1.3 Physics in Latent Space . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.1 PaGN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.3 One-step Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.4 Multistep Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.5 Eectiveness of Physics Constraint . . . . . . . . . . . . . . . . . . . 58 5.2.6 Importance of Physics Constraint . . . . . . . . . . . . . . . . . . . . 60 6 Physics-aware Dierence Graph Networks for Sparsely-Observed Dynam- ics 61 6.1 Physics-aware Dierence Graph Network . . . . . . . . . . . . . . . . . . . . 61 6.1.1 Dierence Operators on Graph . . . . . . . . . . . . . . . . . . . . . 61 6.1.2 Spatial Dierence Layer . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.1.3 Recurrent Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2 Eectiveness of Spatial Dierence Layer . . . . . . . . . . . . . . . . . . . . 65 6.2.1 Approximation of Directional Derivatives . . . . . . . . . . . . . . . . 65 6.2.2 Graph Signal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Prediction: Graph Signals on Land-based Weather Sensors . . . . . . . . . . 71 6.3.1 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.2 Graph Signal Predictions . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.3 Contribution of Spatial Derivatives . . . . . . . . . . . . . . . . . . . 75 6.3.4 Eect of Dierent Graph Structures . . . . . . . . . . . . . . . . . . . 76 6.3.5 Distribution of Prediction Error Across Nodes . . . . . . . . . . . . . 77 6.4 Evaluation on NEMO Sea Surface Temperature (SST) Dataset . . . . . . . . 77 vi 7 Physics-aware Spatiotemporal Modules with Auxiliary Tasks for Meta- Learning 80 7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.1.1 Decomposability of Variants of a Continuity Equation . . . . . . . . . 81 7.1.2 Spatial Derivative Modules: PDE-independent Modules . . . . . . . . 82 7.1.3 Time Derivative Module: PDE-specic Module . . . . . . . . . . . . 82 7.1.4 Meta-Learning with PDE-independent/-specic Modules . . . . . . . 82 7.2 Physics-aware Meta-Learning with Auxiliary Tasks . . . . . . . . . . . . . . 83 7.2.1 Spatial Derivative Module . . . . . . . . . . . . . . . . . . . . . . . . 84 7.2.2 Time Derivative Module . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.2.3 Meta-Learning with Auxiliary Objective . . . . . . . . . . . . . . . . 85 7.3 Spatial Derivative Modules: Reusable Modules . . . . . . . . . . . . . . . . . 87 7.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.4.1 Preliminary: Which Synthetic Dynamics Need to be Generated? . . . 89 7.4.2 Meta-train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.4.3 Meta-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.4.4 Multi-step Graph Signal Generation . . . . . . . . . . . . . . . . . . . 91 7.4.5 Graph Signal Regression . . . . . . . . . . . . . . . . . . . . . . . . . 94 8 Spatiotemporal Modeling via Physics-aware Causality 95 8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 8.5 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 8.5.1 Synthetic Time Series Generation . . . . . . . . . . . . . . . . . . . . 102 8.5.2 Causality Classication . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8.5.3 Graph Signal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.5.4 Interpretation of Learned Causality . . . . . . . . . . . . . . . . . . . 111 9 Summary, Discussion and Future Work 114 Bibliography 117 vii List of Tables 3.1 Land features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Meteorological observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Description of climate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Information on sensor networks. . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Forecasting mean absolute error (MAE) ( C) . . . . . . . . . . . . . . . . . . 46 4.2 Observations and inferred DQL . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Examples of static equations in Graph networks . . . . . . . . . . . . . . . . 52 5.2 Examples of dynamic equations in Graph networks . . . . . . . . . . . . . . 54 5.3 One step prediction error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4 Multistep prediction error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5 One step prediction error with dierent constraints (MSE) . . . . . . . . . . 59 6.1 Mean squared error (10 2 ) for approximation of directional derivatives. . . . 67 6.2 Mean absolute error (10 2 ) for graph signal prediction with dierent sparsity. 67 6.3 Mean squared error (10 2 ) for approximations of directional derivatives of function f 2 (x;y) = sin (x) + cos (y) with dierent sparsity. . . . . . . . . . . 68 6.4 Mean absolute error (10 2 ) for graph signal prediction. . . . . . . . . . . . . 70 6.5 Numbers of learnable parameters. . . . . . . . . . . . . . . . . . . . . . . . . 73 6.6 Graph signal prediction results (MAE) on multistep predictions. In each row, we report the average with standard deviations from all baselines and PA- DGN. One step is 1-hour time interval. . . . . . . . . . . . . . . . . . . . . . 75 6.7 Mean absolute error (10 2 ) for graph signal prediction on the synthetic dataset. 76 6.8 Graph signal prediction results (MAE) on multistep predictions. In each row, we report the average with standard deviations from all baselines and PA- DGN. One step is 1 hour time interval. . . . . . . . . . . . . . . . . . . . . . 79 6.9 Mean absolute error (10 2 ) for SST graph signal prediction. . . . . . . . . . 79 7.1 Parameters for synthetic dataset. . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2 Prediction error (MAE) of the rst (top) and second (bottom) order spatial derivatives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.3 Multi-step prediction results (MSE) and standard deviations on the two real- world datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.4 Graph signal regression results (MSE, 10 3 ) and standard deviations on the two regions of weather stations. . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.1 Inter-causality classication . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 8.2 Inter-causality classication with additional noise . . . . . . . . . . . . . . . 107 8.3 Recall for causal discovery methods . . . . . . . . . . . . . . . . . . . . . . . 109 8.4 Intra-causality retrieval (AUC) from non-linear causal time series withN (0; 1 2 )109 viii 8.5 Summary of results of prediction error (mean squared error, MSE) with stan- dard deviations for the two regions. . . . . . . . . . . . . . . . . . . . . . . . 110 ix List of Figures 1.1 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1 Each bar represents the signal value at the vertex where the bar originates. Blue and red color indicate positive and negative values, respectively. Local variations at the green node are (a) 1.67 (b) 25 and (c) 15, respectively. . . . 27 3.2 Scalar/vector elds on Euclidean space and vertex/edge functions on a graph. 29 3.3 Personal weather stations distributed over Los Angeles area . . . . . . . . . 34 3.4 Sampled regions in Southern California area. (Left) Los Angeles (274 nodes) and (Right) San Diego (282 nodes) area. . . . . . . . . . . . . . . . . . . . . 36 3.5 Weather stations in (left) western (right) southeastern states in the United States and k-NN graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 Sensor locations in the AQI-CO dataset. We show sensors/stations as blue nodes and edges of k-NN graphs as red lines. Borders of provinces/states are shown in grey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7 Visualization of the rst 5 frames of one extended sequence in the ExtremeWeather dataset. Dots represent the sampled points. Area with high surface temper- ature (TS) are colored with green and area with low TS are colored with purple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Working sensors for TMAX and SNOW located in western (a,b)/eastern (c,d) states in the USA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Architecture of DQ-LSTM that consists of GCN followed by the data quality network DQN and LSTM. GCN are able to extract localized features from given graph signals (dierent colors correspond to dierent signals) and DQN computes the data quality of each vertex. N is the number of vertices, i.e. each loss function on each vertex is weighted by the quality level s i . Note that the dot-patterned circles denote the current vertex i. . . . . . . . . . . . 42 4.2 t-SNE visualization of outputs of GCN in DQ-LSTM. Red dot denotes the reference node and green dots are the adjacent nodes of the red dot. (a), (b) and (c) illustrate how the embeddings of spatiotemporal signals changes. At t = 4, the v 25 node is relatively far from other green nodes because it is connected with the v 4 node which is not a neighbor of red dot. . . . . . . . . 49 5.1 Recurrent architecture to incorporate physics equation on GN. The blue blocks have learnable parameters and the orange blocks are objective functions. is a concatenation operator and the middle core block can be repeated as many as the required time steps (T ). . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 In (a) MSEs of PaGN are almost as good as GN-only (gray lines) despite the less number of training data. (b) provides how the prediction performance is dependent on the physics term. . . . . . . . . . . . . . . . . . . . . . . . . . 59 x 6.1 Examples of dierence operators applied to graph signal. Filters used for the processing are (b) P j (f i f j ) (c) P j (1:1f i f j ), (d) f j 0:5f i . . . . . . . . 62 6.2 Physics-aware Dierence Graph Networks for graph signal prediction. Blue boxes have learnable parameters and all parameters are trained through end- to-end learning. The nodes/edges can be multidimensional. . . . . . . . . . . 64 6.3 Directional derivative on graph . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.4 Gradients and graph structure of sampled points. Left: the synthetic function isf 1 (x;y) = 0:1x 2 +0:5y 2 . Right: the synthetic function isf 2 (x;y) = sin(x)+ cos(y). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.5 Synthetic dynamics and graph structure of sampled points. . . . . . . . . . . 71 6.6 MAE across the nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1 Schematic overview of the physics-aware modular meta-learning (PiMetaL). . 83 7.2 Examples of generated spatial function values and graph signals. Node and edge features (function value and relative displacement, respectively) are used to approximate spatial derivatives (arrows). We can adjust the number of nodes (spatial resolution), the number of edges (discretization), and the degree of uctuation (scale of derivatives) to dierentiate meta-train tasks. . . . . . 87 8.1 Heat dissipation over 2D space and time. The nodes in a graph structure correspond sensors and the observations at each sensor is time varying. Given the heat equation ( _ u = Du), we can provide spatial (blue) and temporal (green) causal relations from previous nodes to a current target node (white). 97 8.2 Proposed model: Causality-aware spatiotemporal graph networks (PA-DGN). A sequence of graph signals is fed into two modules: (1) Spatiotemporal graph networks for causality (STGC) and (2) Spatiotemporal graph networks for values (STGV), respectively. Guiding PDEs from physical principles provide explicit (partially available) causal labels. The red arrows denote how the supervised objectives are dened. . . . . . . . . . . . . . . . . . . . . . . . . 99 8.3 Generated multivariate time series from given causal relations. . . . . . . . . 105 8.4 Average causal probability curves vs. the number of hops over all sensors in each region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.5 Average causal probability curves vs. the number of hops from particular sensors in each region (TMAX) . . . . . . . . . . . . . . . . . . . . . . . . . 113 xi Abstract While deep neural networks have been successful over a number of applications, it is still challenging to achieve a robust model for physical systems since data-driven learning does not explicitly consider physical knowledge, which should be benecial for modeling. To leverage domain knowledge for the robust learning, this thesis proposes various novel methods to incorporate physical knowledge for modeling spatiotemporal observations from physical systems. First, we quantify data quality inspired by physical properties of uids to identify abnormal observations and improve forecasting performance. The second work proposes a regularizer to explicitly impose partial dierential equations (PDEs) associated with physical laws to provide an inductive bias in the latent space. The third method is developed to approximate spatial derivatives, which are one of the fundamental components of spatiotemporal PDEs that have a prominent role for physics-aware modeling. Then, we demonstrate a meta-learning framework to prove that the physics-related quantity is benecial for fast-adaptation of learnable models on few observations. Finally, we propose a spatiotemporal modeling via physics-aware causality, which leverages additional causal information described in PDEs for physical systems. All methods share a common goal: how to integrate physical knowledge with graph networks to model sensor-based physical systems by providing a strong inductive bias. xii Chapter 1 Introduction 1.1 Background Machine learning/deep learning models, which have already achieved tremendous success in a number of domains such as computer vision (LeCun et al., 1998; Krizhevsky et al., 2012; He et al., 2016; Redmon et al., 2016; He et al., 2017) and natural language processing (Mikolov et al., 2013; Socher et al., 2013; Sutskever et al., 2014; Kim, 2014; Bahdanau et al., 2014; Kumar et al., 2016; Vaswani et al., 2017; Peters et al., 2018; Devlin et al., 2019), are beginning to play an important role in advancing scientic discovery in domains traditionally dominated by old-fashioned analytical modeling (Hsieh, 2009; Ivezi c et al., 2014; Karpatne et al., 2017, 2018; Kutz, 2017; Reichstein et al., 2019; Wang et al., 2018). This thesis aims to advance physics-aware machine learning by developing new methodologies to incorporate domain- specic or general knowledge into deep models. 1.1.1 Modeling Physical Systems Modeling real-world phenomena, such as climate observations, trac ow, physics and chemistry simulation (Gilmer et al., 2017; Li et al., 2018b; Long et al., 2018b; de Bezenac et al., 2018; Sanchez-Gonzalez et al., 2018; Geng et al., 2019), has been considered as im- portant and practical tasks. While conventional approaches in modeling these physical sys- tems (Virieux, 1986; Kim et al., 2001; Moukalled et al., 2016) have been based on derivation of equations and nding a numerical solution of the equations, the recent success of deep 1 learning introduces a new set of data-driven solution for modeling real-world systems. Given a set of input and output pairs, deep neural networks are able to extract the complicated relations between the input and output eciently, and the automatic data-driven learn- ing is emerging even in physical modeling (Raissi et al., 2017a,b) since it does not require any specic domain knowledge behind the dataset and thus, it makes the modeling easier. On top of the purely data-driven approaches, there are also many works integrating scien- tic principles with deep models to enhance models' quality in many scientic elds: earth systems (Reichstein et al., 2019), climate science (Faghmous & Kumar, 2014; Krasnopol- sky & Fox-Rabinovitz, 2006; O'Gorman & Dwyer, 2018), turbulence modeling (Mohan & Gaitonde, 2018; Wang et al., 2020b), material discovery (Cang et al., 2018; Raccuglia et al., 2016; Schleder et al., 2019), quantum chemistry (Sadowski et al., 2016; Sch utt et al., 2017), biological sciences (Yazdani et al., 2020), and hydrology (Xu & Valocchi, 2015). 1.1.2 Challenges in Physical System Modeling While deep learning has achieved remarkable successes in prediction tasks by learning latent representations from data-rich applications such as image recognition (Krizhevsky et al., 2012; Howard et al., 2017, 2019; Du et al., 2020), text understanding (Dai & Le, 2015; Wu et al., 2016; Zoph & Le, 2017; Clark et al., 2020; Shazeer et al., 2020), speech recognition (Hinton et al., 2012; Amodei et al., 2016; Conneau et al., 2020; Wang et al., 2020a; Baevski et al., 2020), and hybrid applications (Akbari et al., 2021) we confront many challenging scenarios in modeling natural phenomena with deep neural networks when only a limited number of observations are available. One example is air quality monitoring (Berman, 2017), in which the sensors are irregularly distributed over the space { many sensors are located in urban areas whereas there are very few sensors in vast rural areas. Another example is extreme weather modeling and forecasting, i.e., temporally short events (e.g., tropical cyclones (Racah et al., 2017b)) without sucient observations over time. Moreover, inevitable missing values from sensors (Cao et al., 2018; Tang et al., 2019) further reduce 2 the number of operating sensors and shorten the length of fully-observed sequences. Thus, achieving robust performance by quickly learning from a few spatiotemporal observations remains an essential but challenging problem. Another challenge in physical system modeling is that it is hard to derive (or know) the exact equation governing the spatiotemporal observations. Although we have specic knowl- edge on some physical systems such as climate, the specic knowledge is hardly available in new types of systems recently observed such as air quality. Since there is still substantial gap between the unknown exact equation and existing simple equation which is only partially correct, it is required to ll the gap to understand the physical systems. Thirdly, most of the data-driven methods for physical systems are designed and optimized on particular systems and thus, knowledge-transferring is little considered across dierent systems even though there are shareable properties. More specically, while temperature and air quality have signicantly dierent dynamics, both are uids and some common spatiotemporal equations such as continuity equation are involved to describe the dynamics. Thus, if such shareable features are extracted, they are applicable to dierent systems and helpful for fast adaptation on few data points easily. Finally, even if the data-driven models perform well, there is still a challenge to interpret the models' behavior. For example, we can train a sequence-based deep model to predict temperature dynamics, however, without any understandable inductive biases, it requires another method to interpret the trained model. Additional domain knowledge (e.g., physical rules) can be helpful for scientists and non-machine-learning experts to interpret deep models. 1.2 Thesis Statement Incorporating physics-based knowledge is a way to construct robust and interpretable deep models and provide ecient learning process for modeling physical systems. 3 Sensor observations of physical systems Physical knowledge - Weather - Air quality - Sea surface dynamics - Extreme weather events - etc. - PDEs - Physical properties - Geometrical properties - etc. Physics-aware learning Task-specific inductive bias : Physical knowledge provides a domain-specific inductive bias. Prediction improvement (ICLR18, ICLR20, AAAI-MLPS21) General inductive bias : Physical knowledge provides a general and transferrable inductive bias. Robust learning (AAAI-MLPS21, IJCAI21) Learning transferrable knowledge (ICLR20, IJCAI21) Utilizing causal relations (in submission) Figure 1.1: Thesis statement 1.3 Contributions of Research In this thesis proposal, we aim to address these challenges in modeling natural phenom- ena, especially, sensor-based spatially discrete observations and provide novel methodologies to mitigate the challenges and improve spatiotemporal prediction performance by incorpo- rating domain-specic inductive biases such as physics rules, with data-driven models. Previous works have handled spatiotemporally continuous observations to integrate phys- ical knowledge: adding explicit partial dierential equations (PDEs) as constraints (Raissi et al., 2017a,b; de Bezenac et al., 2018), modifying neural network architectures (Greydanus et al., 2019; Lutter et al., 2019; Wang et al., 2019a), and providing an inductive bias for clas- sical dynamics (Battaglia et al., 2016, 2018; Chang et al., 2016; Kipf et al., 2018). Overall, while leveraging physical knowledge for data-driven models is emerging as a solution to pro- vide robust models, many work only focus on continuous spatiotemporal data stream rather than more realistic setting: sensor-based spatially discrete observations. Since we are only able to observe physical systems as a set of discrete observations spatially and temporally, it is particularly important to provide an inductive bias to the discrete data. 4 In the proposal, we propose a general framework for incorporating physical knowledge for modeling sensor-based spatiotemporal observations from physical systems. The proposed work contributes a pioneering line of research to incorporate physics-related knowledge, such as equations, qualitative properties, and common features in PDEs, with data-driven black- box models to eciently model sensor-based natural phenomena governed by unknown phys- ical rules. Figure 1.1 describes the contributions of the thesis. By leveraging the physical knowledge behind sensor-based observations from physical systems, the data-driven model gets benets from not only actual observations but also the inductive bias which is veried by domain knowledge. Overall, the proposed work has the following fourfold contributions. Performance improvement In many physical systems, if a governing equation is known, the spatiotemporal dynamics can be modeled by the equation easily. In practice, the real- world dynamics are not fully described by a single analytical equation mostly as there are a number of unknown factors which should have been considered in the equation. However, knowing a part of the whole equation is still benecial for ecient learning since the data- driven module is only required to learn the gap between the partially given equation and the actual dynamics. Thus, if we can incorporate the equation with a data-driven module, the task, modeling physical systems, can be easier than modeling from the scratch, and it is expected to achieve improved performance of the tasks. We rst provide how physical properties such as uidic property: spatiotemporally continuous and slowly varying, are imposed into data-driven models by introducing a concept of data quality of observations based on local variation of graph signals. An important result of introducing the quantity is that it not only improves spatiotemporal forecasting performance but also provides human- understandable metrics which can be used to identify whether a sensor is under malfunction or not. Second, we propose how PDEs governing physical systems can be adapted with the learnable parameters to improve prediction performances. 5 Robust learning While \knowledge-free" data-driven models have outperformed knowledge- based traditional methods (Agrawal et al., 2019), the domain-specic constraints, rules, and specications have a crucial role to make the data-driven models robust. For example, Szegedy et al. (2013) shows how purely data-driven models can be vulnerable to small per- turbations (known as adversarial perturbations) on input. To build a robust model on the unknown perturbations, Qin et al. (2019) provides verication algorithms to be able to certify richer properties of neural networks. Similarly, we rst provide how the proposed methods guarantee the stable behavior especially when the number of training data is limited. When a part of the whole (unknown) equation is given, the model with the partial equation shows more stable predictions (e.g., lower standard deviations) than those of a purely data-driven model. Furthermore, the methods are able to provide a way to verify whether or not the models' behaviors are qualied, in other words, physically feasible. Extracting transferrable knowledge Thirdly, the physical knowledge is able to provide generally applicable prior knowledge. Many natural phenomena are described by spatiotem- poral equations such as spatiotemporal PDEs. Although the equations are dened for their own purpose (e.g., Navier-Stokes equation describes the ow of incompressible uids and the wave equation describes vibrating objects), many PDEs in physics have common features such as describing relations between spatial and time derivatives. Thus, the common fea- tures can be transferrable regardless of the source of observations. We propose a framework describing how the common features in PDEs are extracted and the extracted features from a certain data source are transferred to another source. Specically, the proposed frame- work is physics-aware meta-learning with auxiliary tasks whose spatial modules incorporate PDE-independent knowledge and temporal modules are rapidly adaptable to the limited data, respectively. Without requiring the exact form of governing equations to model the observed spatiotemporal data, it mitigates the need for a large number of real-world tasks for meta-learning by leveraging simulated data and provides generally applicable inductive 6 bias for spatiotemporal prediction. Utilizing underlying causal relations in PDEs Last but not least, we propose a frame- work enabling to utilize causal relation underlying PDEs for a physical system. Deep neural networks are highly expressive and able to learn unspecied representations by minimizing a specied objective from a given task. Despite its ecient data-driven learning, interpretable inductive biases can be benecial for constructing robust models as well as learning process. We propose physics-aware graph-based spatiotemporal networks with a causal module, which leverages additional causal information described in partial dierential equations (PDEs) in physical systems. With the partially provided causality labels, we enable to specify causal weights from spatially close and temporally past observations to current observations via semi-supervised learning, and dierentiate the importance of each relation without requir- ing costly computation. Extensive experiments on simulated time series based on causal relations and real-world graph signals show that the proposed model improves prediction performance by utilizing physics-based domain knowledge. 7 Chapter 2 Related Work Prior knowledge has a critical role in traditional approaches to model natural phenomena. It provides a guideline about how to design analytical models, however, it is not able to model unspecied dynamics in observations from real-world physical systems. As an alternative, a data-driven approach is very eective since it does not require complicate description or knowledge to model targeted observations. Instead, it only requires enough data to adjust learnable parameters under the provided dataset. While the approach is able to achieve reasonable performance, its behavior is signicantly aected by quality and quantity of data. To address the practical challenge, prior knowledge can be incorporated with a data- driven approach to construct reliable, interpretable, and robust models from even limited data points. We rst provide an overview of learning techniques with prior knowledge. We then discuss how the techniques can be applied for real-world applications. 2.1 Physics-informed Learning Prior knowledge is helpful for deep models in various aspects: improvements of targeted tasks, ecient learning process, and interpretable models. We summarize existing meth- ods about how physics-based knowledge is incorporated to design deep neural networks to improve targeted tasks, how prior knowledge is used for ecient learning on limited data points, and how the prior knowledge is able to provide interpretable behavior of deep neural networks. Physics is one of the established elds where domain knowledge is rigorously studied and 8 experimentally veried. Recently, many scientists leverage physics-based knowledge to guide data-driven deep models in a desired way. Here, we summarize the excellent survey paper by Willard et al. (2020) and discuss how physics-aware models are broadly incorporated with deep neural networks. 2.1.1 Surrogate Solutions of Existing Approaches In this section, we discuss how physics-informed learning is benecial for handling con- ventional tasks in physics via data-driven learning. The works in this section show that the data-driven modeling can be surrogate solutions of numerical/analytical solutions. Data-driven PDE Solver Many real-world problems from physical systems are mostly about how to describe observations based on partial dierential equations and numerically solve the equations. There are many traditional methods to numerically solve PDEs such as nite-dierence methods (FDM), nite element method (FEM), nite volume method (FVM), etc. All of these methods are numerical and they require proper discretization or a nite number of steps to approximate continuous solution. Combining machine learning techniques with PDE models has a long history in machine learning. Crutcheld & McNamara (1987) introduce a method to reconstruct the determin- istic portion of the equations of motion directly from a data series. This approach employs an informational measure of model optimality to guide searching through the space of dynam- ical systems. Kevrekidis et al. (2003) present a framework for computer-aided multi scale analysis, which enables models at a ne (microscopic/stochastic) level of description to per- form modeling tasks at a coarse (macroscopic/systems) level. These macroscopic modeling tasks, yielding information over long time and large scales, are accomplished through ap- proximately initialized calls to the microscopic similar for only short times and small spatial domains. More recently, Raissi et al. (2017a,b) introduce a solution to solve PDE in a data-driven 9 manner. Rather than analytically solving a given equation, they infer solutions to targeted PDE via supervised learning. Given an input tuple (x;t), they compute spatial and time derivatives from black-box models and then the output are connected based on a form of PDE to update all learnable parameters in the black-box models. This method does not require any discretization and it is fully data-driven to nd a surrogate model. Raissi (2018) approximate the unknown solution as well as the nonlinear dynamics by two deep neural networks. The rst network acts as a prior on the unknown solution and essentially enables us to avoid numerical dierentiations which are inherently ill-conditioned and unstable. The second network represents the nonlinear dynamics and helps us distill the mechanisms that govern the evolution of a given spatiotemporal dataset. Magill et al. (2018) introduce a technique based on the singular vector canonical correlation analysis (SVCCA) for measuring the generality of neural network layers across a continuously-parameterized set of tasks. They illustrate this method by studying generality in neural networks trained to solve parameterized boundary value problems based on the Poisson partial dierential equation. In many physical systems, the governing equations are known with high condence, but direct numerical solution is prohibitively expensive. Often this situation is alleviated by writing eective equations to approximate dynamics below the grid scale. This process is often impossible to perform analytically and is often ad hoc. Bar-Sinai et al. (2019) propose data-driven discretization, a method that uses machine learning to systematically derive discretizations for continuous physical systems. Um et al. (2020) target the problem of reducing numerical errors of iterative PDE solvers and compare dierent learning approaches for nding complex correction functions. They integrate the PDE solver into the training loop and thereby allow the model to interact with the PDE during training. Downscaling Directly solving PDEs require spatial and temporal discretization and ner resolution is more desired to capture physically reliable solutions. However, it increases the 10 computational cost and modeling complexity. Downscaling techniques have been widely used as a solution to capture physical variables that need to be modeled at a ner resolution from a coarser resolution. Recently, articial neural networks have shown a lot of promise for this problem, given their ability to model non-linear relationships. Shari et al. (2019) present a downscaling algorithms using neural networks to leverage the relationships between Satellite precipitation estimates (SPEs) and cloud optical and microphysical properties in northeast Austria. Vandal et al. (2017) present DeepSD (Statistically Downscaling), a generalized stacked super resolution convolutional neural network (SRCNN) framework for statistical downscaling of climate variables. DeepSD augments SRCNN with multi-scale input chan- nels to maximize predictability in statistical downscaling. Lee et al. (2020) introduce a data-driven framework for the identication of unavailable coarse-scale PDEs from micro- scopic observations via machine-learning algorithms. Specically, using Gaussian processes, articial neural networks, and/or diusion maps, the proposed framework uncovers the rela- tion between the relevant macroscopic space elds and their time evolution (the right-hand side of the explicitly unavailable macroscopic PDE) Parameterization Complex physics-based models (e.g., for simulating phenomena in cli- mate, weather, turbulence modeling, astrophysics) often use an approach known as param- eterization to account for missing physics. In parameterization, specic complex dynamical processes are replaced by simplied physical approximations whose associated parameter values are estimated from data, often through a procedure referred to parameter calibration. In geology, Chan & Elsheikh (2017) study the application of Wasserstein GAN (Arjovsky et al., 2017) for the parametrization of geological models. The eectiveness of the method is assessed for uncertainty propagation tasks using several test cases involving dierent per- meability patterns and subsurface ow problems. Goldstein et al. (2014) develop a new predictor for near-bed suspended sediment reference concentration under unbroken waves using genetic programming, a machine learning technique. 11 In meteorological science, Brenowitz & Bretherton (2018) show that a neural network- based parameterization is successfully trained using a near-global aqua-planet simulation with a 4-km resolution (NG-Aqua). The neural network predicts the apparent sources of heat and moisture averaged onto (160 km 2 ) grid boxes. Gentine et al. (2018) present a novel approach to convective parameterization based on machine learning, using an aquaplanet with prescribed sea surface temperatures as a proof of concept. A deep neural network is trained with a super-parameterized version of a climate model in which convection is resolved by thousands of embedded 2-D cloud resolving models. In chemistry, supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Gilmer et al. (2017) reformulate existing models into a single common framework we call Message Passing Neural Networks and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark. Wang et al. (2019b) utilize molecular graph data for property prediction based on spatial graph convolution neural networks. Reduced-Order Models Reduced-Order Models (ROMs) are computationally inexpen- sive representations of more complex models. Usually, constructing ROMs involves dimen- sionality reduction that attempts to capture the most important dynamical characteristics of often large, high-delity simulations and models of physical systems. ML is beginning to assist in constructing ROMs for increased accuracy and reduced computational cost in several ways. One approach is to build an ML-based surrogate model for full-order models (Chen et al., 2012). Xiao et al. (2019) develop a dimensionality reduction method called Non-Intrusive Reduced Order Model (NIROM) for predicting the turbulent air ows found within an urban environment. In uid dynamics, ROMs is largely considered due to the unprecedented physical insight into turbulence oered by high-delity computational uid dynamics (CFD). Mohan & Gaitonde (2018) demonstrate a deep learning based approach to 12 build a ROM using the POD basis of canonical DNS datasets, for turbulent ow control ap- plications. They nd that a type of Recurrent Neural Network, the Long Short Term Memory (LSTM) which has been primarily utilized for problems like speech modeling and language translation, shows attractive potential in modeling temporal dynamics of turbulence. 2.1.2 Learning and Discovering Knowledge In this section, we discuss how the additional inductive bias is benecial for extracting interpretable knowledge/representations from data. Unlike previous works, the works de- scribed in this section are more related to representational learning, which is an emerging topic in deep learning research. Inverse Modeling The forward modeling of a physical system uses the physical param- eters of the system (e.g., mass, temperature, charge, physical dimensions or structure) to predict the next state of the system or its eects (outputs). In contrast, inverse modeling uses the (possibly noisy) output of a system to infer the intrinsic physical parameters. In- verse problems often stand out as important in physics-based modeling communities because they can potentially shed light on valuable information that cannot be observed directly. Often, the solution of an inverse problem can be computationally expensive due to the potentially millions of forward model evaluations needed for estimator evaluation or charac- terization of posterior distributions of physical parameters (Gagne et al., 2017). ML-based surrogate models (in addition to other methods such as reduced-order models) are becoming a realistic choice since they can model high-dimensional phenomena with lots of data and execute much faster than most physical simulations. Inverse problems are traditionally solved using regularized regression techniques. Earlier work (Dawson et al., 1992) show that a neural network is able to infer surface scattering parameters from simulated datasets based on a surface scattering model. Pilozzi et al. (2018) introduce a machine-learning approach applicable in general to numerous topological prob- 13 lems which are important in photonics. Inverse Problems in medical imaging and computer vision are traditionally solved using purely model-based methods. Among those variational regularization models are one of the most popular approaches. Lunz et al. (2018) propose a new framework for applying data-driven approaches to inverse problems, using a neural network as a regularization functional. Recently, novel algorithms using deep learning and neural networks have been applied to inverse problems. Chen et al. (2017) combine the autoencoder, deconvolution network, and shortcut connections into the residual encoder-decoder convolutional neural network (RED-CNN) for low-dose CT imaging, which is more desired given the potential risk of X- ray radiation to the patient. McCann et al. (2017) review recent uses of convolution neural networks (CNNs) to solve inverse problems in imaging. Vamaraju & Sen (2019) develop a novel framework for combining physics-based forward models and neural networks to advance seismic processing and inversion algorithms. There is also increasing interest in the inverse design of materials using ML, where desired target properties of materials are used as input to the model to identify atomic structures that exhibit such properties. Kumar et al. (2020) introduce an ecient and robust ma- chine learning technique for the inverse design of (meta-)materials which, when applied to spinodoid topologies, enables us to generate uniform and functionally graded cellular me- chanical metamaterials with tailored direction-dependent (anisotropic) stiness and density. Raccuglia et al. (2016) demonstrate an alternative approach that uses machine-learning al- gorithms trained on reaction data to predict reaction outcomes for the crystallization of templated vanadium selenites. Schleder et al. (2019) review how data-driven strategies are applicable to computational materials science and Liao & Li (2020) survey and summarize previous works in meta heuristic-based inverse design of various materials. Discovering Governing Equations When the governing equations of a dynamical sys- tem are known explicitly, they allow for more robust forecasting, control, and the opportunity 14 for analysis of system stability and bifurcations through increased interpretability. Further- more, if the learned mathematical model accurately describes the processes governing the observed data, it therefore can generalize to data outside of the training domain. However, in many disciplines (e.g., neuroscience, cell biology, nance, epidemiology, meteorology) dy- namical systems have no formal analytic descriptions. Often in these cases, data is abundant, but the underlying governing equations remain elusive. In this section, we discuss equation discovery systems that do not assume the structure of the desired equation, but rather explore a space a large space of possibly nonlinear mathematical terms. Advances in ML for the discovery of these governing equations has become an active re- search area with rich potential to integrate principles from applied mathematics and physics with modern ML methods. Early works on the data-driven discovery of physical laws relied on heuristics and expert guidance and were focused on rediscovering known, non-dierential, laws in dierent scientic disciplines from articial data (Gerwin, 1974; Langley, 1981; Lan- gley et al., 1983; Lenat, 1983). Recently, general and robust data-driven discovery of potentially unknown governing equations has been studied (Bongard & Lipson, 2007; Schmidt & Lipson, 2009). Long et al. (2018b) present an attempt to learn evolution PDEs from data. Inspired by the latest development of neural network designs in deep learning, they propose a new feed-forward deep network, called PDE-Net, to fulll two objectives at the same time: to accurately predict dynamics of complex systems and to uncover the underlying hidden PDE models. Continuously, a new deep neural network, called PDE-Net 2.0, to discover (time-dependent) PDEs from observed dynamic data with minor prior knowledge on the underlying mechanism that drives the dynamics is also studied (Long et al., 2018a). Brunton et al. (2016) develop a novel framework to discover governing equations underlying a dynamical system simply from data measurements, leveraging advances in sparsity techniques and machine learning. The resulting models are parsimonious, balancing model complexity with descriptive ability while avoiding overtting. Quade et al. (2018) propose a conceptual framework to recover 15 parsimonious models of a system in response to abrupt changes in the low-data limit. Rudy et al. (2017) propose a sparse regression method capable of discovering the governing partial dierential equation(s) of a given system by time series measurements in the spatial domain. The regression framework relies on sparsity-promoting techniques to select the nonlinear and partial derivative terms of the governing equations that most accurately represent the data, bypassing a combinatorially large search through all possible candidate models. Since the concept of Neural Ordinary Dierential Equations is introduced (Chen et al., 2018), a new family of deep neural network models has been developed accordingly. Instead of specifying a discrete sequence of hidden layers, it parameterizes the derivative of the hidden state using a neural network. The output of the network is computed using a black- box dierential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. Another approach to discover underlying physics from ample data is equation-free method. Rather than specifying a form of equation to infer unknown coecients/parameters, physics- based knowledge is only used to construct deep models. Battaglia et al. (2016) introduce the interaction network, a model which can reason about how objects in complex systems inter- act, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Chang et al. (2016) present the Neural Physics Engine (NPE), a framework for learning simulators of intuitive physics that naturally generalize across variable object count and dierent scene congurations. Watters et al. (2017) introduce the Visual Interaction Network, a general purpose model for learning the dynamics of a physical system from raw visual observations. Our model consists of a perceptual front-end based on convolutional neural networks and a dynamics predictor based on interaction networks. Cranmer et al. (2020b) develop a general approach to distill symbolic representations of a learned deep model by introducing strong inductive biases. Graph Neural Networks (GNNs) works as follows: sparse latent representations when GNNs are trained in a supervised setting are 16 extracted, then they are applied to symbolic regression to components of the learned model to extract explicit physical relations. Data Generation Data generation approaches are useful for creating virtual simulations of scientic data under specic conditions. Traditional physics-based approaches for gener- ating data often rely on running physical simulations or conducting physical experiments, which tend to be very time consuming. Also, these approaches are restricted by what can be produced by physics-based models. Hence, these is an increasing interest in generative ML approaches that learn data distributions in unsupervised settings and thus have the potential to generate novel data beyond what could be produced by traditional approaches. Generative machine learning models have found tremendous success in areas such as speech recognition and generation (Oord et al., 2016), image generation (Denton et al., 2015), and natural language processing (Denton et al., 2015). In the scientic domain, Generative Adversarial Networks (GANs) can generate data like the data generated by the physics-based models. Using GANs often incurs certain benets, including reduced compu- tation time and a better reproduction of complex phenomenon, given the ability of GANs to represent nonlinear relationships. Farimani et al. (2017) use conditional generative adver- sarial networks (cGAN) to train models for the direct generation of solutions to steady state heat conduction and incompressible uid ow purely on observation without knowledge of the underlying governing equations. Such approaches that use generative models have been shown to signicantly accelerate the process of generating new data samples. A well-known issue of GANs is that they incur dramatically high sample complexity. Therefore, a growing area of research is to engineer GANs that can leverage prior knowledge of physics in terms of physical laws and invariance properties. For example, GAN-based models for simulating turbulent ows can be further improved by incorporating physical constraints, e.g., conservation laws (Yang et al., 2019) and energy spectrum (Wu et al., 2020), in the loss function. Cang et al. (2018) propose a generative machine learning model 17 that creates an arbitrary amount of articial material samples with negligible computation cost, when trained on only a limited amount of authentic samples. The key contribution of this work is the introduction of a morphology constraint to the training of the generative model, that enforces the resultant articial material samples to have the same morphology distribution as the authentic ones. Uncertainty Quantication Uncertainty quantication (UQ) is of great importance in many areas of computational science (e.g., climate modeling (Deser et al., 2012), uid ow (Christie et al., 2006), systems engineering (Pettit, 2004), among many others). UQ requires an accurate characterization of the entire distributionp(yjx), wherey is the response andx is the covariate of interest, rather than just making a point prediction y =f(x). This makes it possible to characterize all quantiles and skews in the distribution, which allows for analysis such as examining how close predictions are to being unacceptable, or sensitivity analysis of input features. Applying UQ tasks to physics-based models using traditional methods such as Monte Carlo (MC) is usually infeasible due to the thousands or millions of forward model evaluations needed to obtain convergent statistics. In the physics-based modeling community, a common technique is to perform model reduction or create an ML surrogate model, in order to increase model evaluation speed since ML models often execute much faster (Galbally et al., 2010; Manzoni et al., 2016; Tripathy & Bilionis, 2018). With a similar goal, the ML community has often employed Gaussian Processes as the main technique for quantifying uncertainty in simulating physical processes (Bilionis & Zabaras, 2012; Rajabi & Ketabchi, 2017), but neither Gaussian Processes nor reduced models scale well to higher dimensions or larger datasets (Gaussian Processes scale asO(N 3 ) with N data points). Consequently, there is an eort to t deep learning models, which have exhibited countless successes across disciplines, as a surrogate for numerical simulations in order to achieve faster model evaluations for UQ. Since articial neural networks do not have UQ naturally built into 18 them, variations have been developed. Galbally et al. (2010) develop a probabilistic drop-out strategy in which neurons are periodically turned o as a type of Bayesian approximation to estimate uncertainty. There are also other Bayesian variants. MacKay (1992) quantify estimates of the uncertainty on network parameters and on network output, and Bayesian model structures are applied for uncertainty estimation of stream ow simulation in two U.S. Department of Agriculture Agricultural Research Service watersheds (Zhang et al., 2009). The integration of prior physics knowledge into ML for UQ has the potential to allow for a better characterization of uncertainty. For example, ML surrogate models run the risk of producing physically inconsistent predictions, and incorporating elements of physics could help with this issue. Also, note that the reduced data needs of ML due to constraints for adherence to known physical laws could alleviate some of the high computational cost of Bayesian neural networks for UQ. Causal Discovery in Time Series How to discover the underlying causal structure is a fundamental problem and has still been actively studied. Rubin (1974); Pearl (2009); Imbens & Rubin (2015) introduce the problem and provided a mathematical framework for causal reasoning and inference under causal graphical models (also known as Bayesian networks (BN) (Koller & Friedman, 2009)). Granger (1969) formalize a concept of quanti- able causality in time series, called Granger causality. From the pioneering works, learning causal associations from time series has been an emerging topic in machine learning and deep learning community as well. Runge (2018) propose a method to distinguish direct from indirect dependencies and common drivers among multiple time series to reconstruct a causal network. Runge et al. (2019b) quantify causal associations in nonlinear time series and Runge et al. (2019a); Nauta et al. (2019) provide promising applications of the causal discovery in time series. Paml et al. (2020) introduce a smooth acyclicity constraint to multivariate time series inspired by Zheng et al. (2018) that consider a causal discovery as a purely continuous optimization problem. Although many works are successful to discover 19 unknown causal structure from observational data directly, there are few works to leverage explicit causal relations from domain knowledge to improve data-driven models. 2.1.3 Deep Learning Applications Finally, we summarize what kinds of deep learning applications can be improved via physics-informed learning. Transfer Learning and Fast Adaptation The aim of meta-learning is to enable learning parameters which can be used for new tasks unknown at the time of learning, leading to agile models which adapt to a new task utilizing only a few samples (Schmidhuber, 1987; Naik & Mammone, 1992; Thrun & Pratt, 1998). Based on how the knowledge from the related tasks is used, meta-learning methods have been classied as optimization-based (Andrychowicz et al., 2016; Ravi & Larochelle, 2017; Duan et al., 2017; Finn et al., 2017; Nichol et al., 2018; Antoniou et al., 2018; Rusu et al., 2018; Grant et al., 2018), model-based (Santoro et al., 2016; Munkhdalai & Yu, 2017; Duan et al., 2017; Mishra et al., 2018), and metric- based (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017). Recently, another branch of meta-learning has been introduced to more focus on nding a set of reusable modules as components of a solution to a new task. Alet et al. (2018, 2019) provide a framework, structured modular meta-learning, where a nite number of modules are introduced as task- independent modules and an optimal structure combining the modules is found from a limited number of data. Chen et al. (2019) introduces techniques to automatically discover task-independent/dependent modules based on Bayesian shrinkage to nd which modules are more adaptable. To our knowledge, none of the above works provide a solution to use meta-learning for modeling physics-related spatiotemporal dynamics. The integration of physics-based knowledge/property/rules into deep neural networks has the potential to allow for faster and ecient learning/modeling of physical systems in the real-world. 20 Graph Networks for Spatiotemporal Prediction The topic of reasoning physical dy- namics over discrete objects has been actively studied (Battaglia et al., 2016; Chang et al., 2016; Sanchez-Gonzalez et al., 2018; Kipf et al., 2018) as the appearance of graph-based neural networks (Kipf & Welling, 2017b; Santoro et al., 2017; Gilmer et al., 2017). Battaglia et al. (2016) introduce the interaction network, a model which can reason about how ob- jects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Chang et al. (2016) propose a factorization of a physical scene into composable object-based representations and a neural network archi- tecture whose compositional structure factorizes object dynamics into pairwise interactions. Sanchez-Gonzalez et al. (2018) introduce a new class of learnable models{based on graph networks{which implement an inductive bias for object- and relation-centric representations of complex, dynamical systems. Kipf et al. (2018) introduce the neural relational inference (NRI) model: an unsupervised model that learns to infer interactions while simultaneously learning the dynamics purely from observational data. Although these models can handle sparsely located data points without explicitly given physics equations, they are purely data-driven so that the physics-inspired inductive bias for exploiting nite dierences is not considered at all. Another application of graph networks is for urban management. There are rich amount of works on this topic, the arrival time estimation (Li et al., 2018b; Geng et al., 2019; Shi et al., 2020; Ye et al., 2020) and the demand-supply forecasts (Davis et al., 2020). While these works commonly focus on discrete dynamcis, our method consists of physics-aware modules allowing eciently leveraging the inductive bias to learn spatiotemporal data from the physics system. 2.2 Applications We overview several applications in which physics-aware learning can be benecial. As deep learning is successful through data-driven philosophy, it has also been applied for mod- 21 eling physical systems, which are conventionally considered as theory-based approach is the only way to simulate them. Here, we discuss applications related to physics-aware data- driven learning. 2.2.1 Simulation of Physical Systems Physics-based knowledge is directly helpful for data-driven approaches, which try to mimic or learn physical (unknown) equations from data even if the knowledge is only partially given. Greydanus et al. (2019) leverages a concept of Hamiltonian to provide an invariant quantity: total energy to optimize learnable parameters. Specically, they compute the partial derivatives of the Hamiltonian with respect to the coordinates: (q;p) where a position and momentum, respectively. Then they use Hamilton's equations (Equation 2.1) to nd the time derivatives of the system and integrate the time derivatives to predict the state of the system at some time in the future. dq dt = @H @p ; dp dt = @H @q (2.1) As a complimentary class of the Hamiltonian neural networks (HNNs), Cranmer et al. (2020a) introduce Lagrangian Neural Networks (LNNs), which are able to learn Lagrangian functions straight from data without canonical coordinates. The proposed models are able to learn large physical systems under the incorporated rule: energy and momentum conservation and thus, it guides behavior of learnable parameters which lead more robust models. de Bezenac et al. (2018) show how general background knowledge gained from the physics could be used as a guideline for designing ecient deep learning models, particularly sea surface temperature prediction (SST). The classical approach to forecasting SST consists in using numerical models representing prior knowledge on the conservation laws and physical principles, which take the form of PDEs. Instead, they leverage an inductive bias underlying uids: transport occurs through the combination of two principles: advection and diusion. 22 By incorporating the physical knowledge, the data-driven model is able to predict tempera- ture changes more correctly in much more quickly than that of the conventional approach. In the weather and climate modeling via physics-aware learning, Rasp et al. (2018); Gentine et al. (2018); Chattopadhyay et al. (2020c); Bolton & Zanna (2019) emulate complex physical processes in weather and climate systems that are either poorly understood or not suciently well represented by existing models. Groenke et al. (2020); Vandal et al. (2017, 2018); Stengel et al. (2020); Ba~ no-Medina et al. (2020) downscale coarse data to produce high- delity high-resolution data. Chattopadhyay et al. (2020a,b) forecast the spatiotemporal dynamics of the atmosphere and ocean. To mimic physical simulators, King et al. (2018) show how GANs can be used to gen- erate new solutions of PDE governed systems by training on simulation datasets and can capture several desirable physical and statistical properties of turbulent ows. Furthermore, Gagne et al. (2020) use GANs with temporal coherence for stochastic emulation of subgrid scale dynamics, Xie et al. (2018) incorporate temporal coherence to GANs to generate real- izations of turbulent ows, Yang et al. (2018) encode the governing physical laws in the form of stochastic dierential equations into the architecture of GANs, and Stinis et al. (2019) incorporate constraints to enhance the interpolation and extrapolation capabilities of GANs. 2.2.2 Learning Unknown Physics Extracting or discovering unknown physics from data is another eld of applications from physics-aware learning. Cranmer et al. (2020b) is motivated by Symbolic regression which is one such machine learning algorithm for symbolic models: it's a supervised technique that assembles analytic functions to model a dataset. They propose a way to combine the symbolic regression with deep learning, which proves extraordinarily ecient at learning in high-dimensional spaces, but suers from poor generalization and interpretability. The proposed model is applied to dark matter in cosmology and it enables to nd a new analytic equation to describe the overdensity of dark matter given its environment. Long et al. (2018b, 23 2019) focus on similar topic: discover PDEs from observed dynamic data with minor prior knowledge on the underlying mechanism that drives the dynamics. The implicit physical principles such as relation learning can be also benecial for video prediction tasks. Battaglia et al. (2016) introduce the interaction network, a model which can reason about how objects in complex systems interact, supporting dynamical predictions, as well as inferences about the abstract properties of the system. Chang et al. (2016) present the Neural Physics Engine (NPE), a framework for learning simulators of intuitive physics that naturally generalize across variable object count and dierent scene congurations. Watters et al. (2017) introduce the Visual Interaction Network, a general purpose model for learning the dynamics of a physical system from raw visual observations. 2.2.3 Interpretability Interpretable models provide transparency, however, it is dicult for deep neural net- works to become interpretable models from purely data as deep learning is more ecient on interpolation of data rather than extracting human-level understandable knowledge. This research direction is still an infancy and only few works show initial progress in this direction. McGovern et al. (2019) and Ebert-Upho & Hilburn (2020) show ways to interpret, visualize and evaluate machine learning models in meteorological applications. Gagne II et al. (2019) use feature importance and feature optimization to interpret their CNN model for predicting the probability of severe hailstorms and found that the model synthesized information about the environment and storm morphology that is consistent with our current understanding of the physics of hailstorms. Toms et al. (2020a) develop interpretable neural networks for the geosciences and show their usefulness and reliability in improving our understanding of the Madden-Julian oscillation (Toms et al., 2020b). Brenowitz et al. (2020) develop an inter- pretability framework specialized for analysis of the relationship between oine skill versus online coupled prognostic performance for ML parameterizations of convection. 24 Chapter 3 Notations, Preliminary and Datasets We rst show how graph signals and its basic properties are dened. Then, we provide advanced concepts such as the local variation at a vertex based on graph signals, data quality level, calculus on graphs, and graph networks. 3.1 Preliminary Given a graphG = (V;E) whereV andE are a set of verticesV =f1;:::;ng and edges E V 2 , respectively, two types of real functions can be dened on the vertices,f i :V!R, and edges, F ij :E!R, of the graph where i;j are indices of vertices. It is also possible to dene multiple functions on the vertices or edges as multiple feature maps of a pixel in CNNs. Sincef andF can be viewed as scalar and vector elds in dierential geometry (Figure 3.2), the discrete operators on graphs can be dened as follow Bronstein et al. (2017). 3.2 Local Variation and Data Quality We focus on the graph signals dened on an undirected, weighted graphG = (V;E;W ), whereV is a set of vertices withjVj =N andE is a set of edges. W2R NN is a random-walk normalized weighted adjacency matrix which provides how two vertices are relatively close. When the elements W ij are not be expliticly provided by dataset, the graph connectivity can be constructed by various distance metrics, such as Euclidean distance, cosine similarity, and a Gaussian kernel (Belkin & Niyogi, 2002), on the vertex featuresV 2R Nd whered is 25 the number of the features. Once all the structural connectivity is provided, the local variation can be dened by the edge derivative of a given graph signalx2R N dened on every vertex (Zhou & Sch olkopf, 2004). @x @e e=(i;j) = p W ij (x j x i ); (3.1) where e = (i;j) is dened as a direction of the derivative and x i is a signal value on the vertex i. The graph gradient ofx at vertex i can be dened by Equation 3.1 over all edges joining the vertex i. r i x = @x @e e=(i;j) j2N i ! ; (3.2) whereN i is a set of neighbor vertices of the vertexi. While the dimesion of the graph gradient is dierent due to the dierent number of neighbors of each vertex, the local variation at vertex i can be dened by a norm of the graph gradient: kr i xk 2 = X j2N i W ij (x j x i ) 2 ; (3.3) Equation 3.3 provides a measure of local variation of x at vertex i. As it indicates, if all neighboring signals ofi are close to the signal ati, the local variation ati should be small and it means less uctuated signals around the vertex. As Equation 3.3 shows, the local variation is a function of a structural connectivityW and graph signalsx. Figure 3.1 illustrates how the two factors aect the local variation at the same vertex. The concept of the local variation is easily generalized to multivariate graph signals of 26 (a) Original graph (b) Dierent adjacency (c) Dierent graph signal Figure 3.1: Each bar represents the signal value at the vertex where the bar originates. Blue and red color indicate positive and negative values, respectively. Local variations at the green node are (a) 1.67 (b) 25 and (c) 15, respectively. M dierent measures by repeatedly computing Equation 3.3 over all measures. L i = (kr i x 1 k 2 ; ;kr i x m k 2 ; ;kr i x M k 2 ); (3.4) where x m 2 R N corresponds the m-th signals from multiple sensors. As Equation 3.4 indicates, L i is a M dimensional vector describing local variations at the vertex i with respect to the M dierent measures. Finally, it is desired to represent Equation 3.4 in a matrix form to be combined with graph convolutional networks later. L = (D +W )(XX) 2(XWX); (3.5) where D is a degree matrix dened as D ii = P j W ij and is an element-wise product operator. X is aNM matrix describing multivariate graph signals onN vertices andx m is am-th column ofX. L2R NM is a local variation matrix andL im is the local variation at the vertex i with respect to the m-th signal. Data Quality Level While the term of data quality has been used in various ways, it generally means \tness of use" for intended purposes (Juran & Godfrey, 1999). In this section, we will dene the 27 term under the data property we are interested in and propose how to exploit the data quality level into a general framework. Given a multivariate graph signalX2R NM on vertices represented by a feature matrix V 2R Nd , we assume that a signal value at a certain vertexi is desired not be signicantly dierent with signals of neighboring vertices j 2 N i . This is a valid assumption if the signal value at the vertex i is dependent on (or function of) features of the vertex i when an edge weight is dened by a distance in the feature vector space between two vertices. In other words, if two vertices have similar features, they are connected to form the graph structure and the signal values observed at the vertices are highly likely similar. There are a lot of domains which follow the assumption, for instance, geographical features (vertex features) and meteorological observations (graph signal) or sensory nervous features and received signals. Under the assumption, we dene the data quality level (score) at a vertexi as a function of local variations of i: s i =q(L i ): (3.6) It is exible to choose the function q. For example, s i can be dened as an average of L i . If so, all measures are equally considered to compute the data quality at vertex i. In more general sense, we can introduce parameterized function q(L i ; ) and learn the param- eters through data. Kipf & Welling (2017a) propose a method that learns parameters for graph-based features by the layer-wise graph convolutional networks (GCN) with arbitrary activation functions. For a single layer GCN on a graph, the latent representation of each node can be represented as: Z =( ^ AX); (3.7) where ^ A = (D +I N ) 1 2 (A +I N )(D +I N ) 1 2 provides structural connectivities of a graph 28 (a) Scalar eld (b) Vector eld (c) Vertex function (d) Edge function Figure 3.2: Scalar/vector elds on Euclidean space and vertex/edge functions on a graph. and is a trainable parameter matrix. is an activation function. By stacking ( ^ AX), it is able to achieve larger receptive elds with multi-layer GCN. We can replace ^ AX with L which is also a function of the weighted adjacency W and the graph signalX. Note that values in row i of ^ AX andL are a function of values at i as well as neighbors of i. Although L only exploits nearest neighbors (i.e., 1-hop neighbors), it is possible to consider K-hop neighbors to compute the local variations by stacking GCN before applying Equation 3.3. The generalized formula for the data quality level can be represented as: Z = K ( ^ A K1 ( ^ A 1 ( ^ AX 1 ) K1 ) K ); (3.8) s = L (L(W;Z)); (3.9) where K is the number of GCN layers and s = (s 1 ;s 2 ; ;s N ) is the data quality level of each vertex incorporating K-hop neighbors. We propose a constraint to ensure that a higher s i corresponds to less uctuated around i. First, we constrain to be positive to guarantee that larger elements inL cause largerL that are inputs of L . Next, we use an activation function that is inversely proportional to an input, e.g., L (x) = 1 1+x , to meet the relation between the data quality s i and the local variationsL i . Onces is obtained, it will be combined with an objective function to assign a penalty for each vertex loss function. 29 3.3 Calculus on Graphs Vector calculus is concerned with dierentiation of vector/scalar elds primarily in 3- dimensional Euclidean space R 3 . It has an essential role in partial dierential equations (PDEs) in physics and engineering. Gradient on Graphs The gradient on a graph is the linear operator dened by r :L 2 (V)!L 2 (E); (rf) ij = (f j f i ) iffi;jg2E and 0 otherwise, where L 2 (V) and L 2 (E) denote Hilbert spaces of vertex and edge functions, respectively, thus f2L 2 (V) and F2L 2 (E). As the gradient in Euclidean space measures the rate and direction of change in a scalar eld, the gradient on a graph computes dierences of the values between two adjacent vertices and the dierences are dened along the directions of the corresponding edges. Divergence on Graphs The divergence in Euclidean space maps vector elds to scalar elds. Similarly, the divergence on a graph is the linear operator dened by div :L 2 (E)!L 2 (V); (div F ) i = X j:(i;j)2E W ij F ij 8i2V; whereW ij is a weight on the edge (i;j). It denotes a weighted sum of incident edge functions to a vertex i, which is interpreted as the net ow at a vertex i. 30 Laplacian on Graphs Laplacian ( =r 2 ) in Euclidean space measures the dierence between the values of the scalar eld with its average on innitesimal balls. Similarly, the graph Laplacian is dened as :L 2 (V)!L 2 (V); (f) i = X j:(i;j)2E W ij (f i f j ) 8i2V: The graph Laplacian can be represented as a matrix form, L = DW where D = diag( P j:j6=i W ij ) is a degree matrix and W denotes a weighted adjacency matrix. Note thatL = =divr and the minus sign is required to makeL positive semi-denite. Dierence Operators on Triangulated Mesh According to Crane (2018), the gradient and Laplacian operators on the triangulated mesh can be discretized by incorporating the coordinates of nodes. To obtain the gradient operator, the per-face gradient of each triangular face is calculated rst. Then, the gradient on each node is the area-weighted average of all its neighboring faces, and the gradient on edge (i;j) is dened as the dot product between the per-node gradient and the direction vectore ij . The Laplacian operator can be discretized with Finite Element Method (FEM): (f) i = 1 2 X j:(i;j)2E cot j + cot j f j f i ; where nodej belongs to nodei's immediate neighbors (j2N i ) and ( j ; j ) are two opposing angles of the edge (i;j). Based on the core dierential operators on a graph, we can re-write dierentiable physics equations (e.g., Diusion equation or Wave equation) on a graph. 31 3.4 Graph Networks Battaglia et al. (2018) proposed a graph networks framework, which generalizes relations among vertices, edges, and a whole graph. Graph Networks (GN) describe how edge, node, and global attributes are updated by propagating information among themselves. Given a set of nodes (v), edges (e), and global (u) attributes, the steps of computation in a graph networks block are as follow: 1. e 0 ij e (e ij ;v i ;v j ;u) for allfi;jg2E pairs. 2. v 0 i v (v i ; e 0 i ;u) for all i2V. e 0 i is an aggregated edge attribute related to the node i. 3. u 0 u (u; e 0 ; v 0 ) e 0 and v 0 are aggregated attributes of all edges and all nodes in a graph, respectively. where e ; v ; u are edge, node, and global update functions, respectively, and they can be implemented by learnable neural networks. Note that the computation order is exible. The aggregators can be chosen freely once it is invariant to permutations of their inputs. As e is a mapping function from vertices to edges, it can be replaced by the graph gradi- ent operator to describe the known relation explicitly. Similarly, v can learn divergence-like mapping (edge to node) functions. In other words, the graph networks have highly exible modules which are able to imitate the dierential operators in a graph explicitly or implicitly. 3.5 Datasets Overall, we have used sensor-based datasets from real-world physical systems. Speci- cally, we mostly evaluate our models on climate datasets measured from spatially distributed weather sensors. In this section, we provide an overview of the datasets used in our evalua- tion. 32 3.5.1 Weather Underground and WeatherBug in Los Angels County We use real-world meteorological measurements from two commercial weather service providing real-time weather information, Weather Underground (WU) 1 and WeatherBug (WB) 2 . Figure 3.3. Both datasets provide real-time weather information from personal weather stations (PWS). In the datasets, all stations are distributed around Los Angeles County, and ge- ographical characteristics of each station are also provided. These characteristics would be used as a set of input features to build a graph structure. The list of the static 11 character- istics is, Latitude, Longitude, Elevation, Tree fraction, Vegetation fraction, Albedo, Distance from coast, Impervious fraction, Canopy width, Building height, and Canopy direction. Meteorological signals at each station are observed through the installed instruments. The types of the measurements are Temperature, Pressure, Relative humidity, Solar radiation, Precipitation, Wind speed, and Wind direction. Since each weather station has observed the measurements under its own frequency (e.g., every 5 minutes or every 30 minutes), we x the temporal granularity at 1 hour and aggregate observations in each hour by averaging them. We want to ensure that the model can be veried and explained physically within one meteorological regime before applying it to the entire year with many other regimes. Since it is more challenging to predict temperatures in the summer season of Los Angeles due to the large uctuation of daytime temperatures (summer: 36 F (19 C) and winter: 6 F (3.3 C) between inland areas and the coastal Los Angeles Basin), we use 2 months observations from each service, July/2015 and August/2015, for our experiments. At each station, a number of weather data are observed through the installed instruments and recorded. The types of measurements are Temperature, Pressure, Relative humidity, Solar radiation, Precipitation, Wind speed, and Wind direction. We use the dataset for the data quality inference work described in Chapter 4. 1 https://www.wunderground.com/ 2 http://weather.weatherbug.com/ 33 (a) Weather Underground (b) WeatherBug Figure 3.3: Personal weather stations distributed over Los Angeles area 3.5.2 Simulated Climate Observations in Southern California Re- gion We use hourly simulated climate observations for 16 days on the Southern California region Zhang et al. (2018) (Latitude: 32.226044 to 35.140636 and Longitude: -119.593155 to -116.29416).. In this dataset, we sampled small regions randomly from two area (Los Angeles and San Diego, Figure 3.4) encompassing urban and rural meteorological features to generate spatially discrete observations. To build a graph, we connected a pair of the sampled regions by using k-nearest neighbors algorithm (k = 3). This data preprocessing is required to verify the proposed idea as well as evaluate PaGN on the spatiotemporally sparse setting, which is more common for sensor-based datasets. The vertex attributes consist of 5 static features: Latitude (XLAT), Longitude (XLONG), Impervious fraction (FRC URB2D), Vegetation fraction (VEGFRA), and Land usage class (LU INDEX). On each vertex there are 10 climate time varying observations, Air temper- ature (T2), Albedo (ALBEDO), Precipitation (RAINNC), Soil moisture (SMOIS), Relative humidity (RH2), Specic humidity (Q2), Surface pressure (PSFC), Planetary boundary layer height (PBLH), and Wind vector (U and V for 2 directions). Table 3.3 shows the descrip- tion of each feature. While the edge attributes are not given explicitly, we could specify the 34 Table 3.1: Land features Name Unit Description Latitude degree ( ) An angle which ranges from 0 at the Equator to 90 at the poles Longitude degree ( ) An angule which ranges from 0 at the Prime Meridian to +180 eastward and -180 westward Elevation ft Elevation from average sea level Tree fraction dimensionless Fraction covered by tree in that neighborhood Vegetation fraction dimensionless Fraction of the neighborhood covered by vegetation Albedo dimensionless Re ected amount of incoming shortwave Distance from coast m Distance from the nearest coastal point Impervious fraction dimensionless Fraction of the neighborhood covered by impervious material Canopy width ft Width of the buildings to the centerline of streets Building height ft Average height of buildings in neighborhood Canopy direction degree ( ) The direction of the canopy in degrees from 0-90 Table 3.2: Meteorological observations Temperature Pressure Relative Humidity Solar Radiation Unit C mbar % W/m 2 Precipitation Wind Speed Wind Direction Unit mm km/h degree type of each edge by using the type of connected regions. There are 13 dierent land-usage types and each type summarizes how the corresponding land is used. Based on the types of connected regions, we assigned dierent embedding vectors to edges. 3.5.3 Weather Stations in the United States We sample the weather stations located in the United States from the Online Climate Data Directory of the National Oceanic and Atmospheric Administration (NOAA) and choose the stations which have actively measured meteorological observations during 2015. 35 Figure 3.4: Sampled regions in Southern California area. (Left) Los Angeles (274 nodes) and (Right) San Diego (282 nodes) area. Figure 3.5: Weather stations in (left) western (right) southeastern states in the United States and k-NN graph. We choose two geographically close but meteorologically diverse groups of stations: the Western and Southeastern states. Figure 3.5 shows the distributions of the land-based weather stations and their con- nectivity. Since the stations are not synchronized and have dierent timestamps for the observations, we aggregate the time series hourly. 36 Table 3.3: Description of climate data Feature Description Timestamp Every 60 minute T2 (3D)* Near-surface (2 meter) air temperature (unit: K) XLAT (3D) Latitude (unit: degree north) XLONG (3D) Longitude (unit: degree east) ALBEDO (3D)* Albedo (unit: -) FRC URB2D (3D) Impervious fraction, urban fraction (unit: -) VEGFRA (3D) Vegetation fraction (unit: -) LU INDEX (3D) Land use classication (unit: -) (time invariant) Paved index: 31: Low-intensity residential 32: High-intensity residential 33: Commercial/Industrial U (4D)* Wind vector, x-wind component (unit: ms 1 ) (west to east vector) V (4D)* Wind vector, y-wind component (unit: ms 1 ) (south to north vector) RAINNC (3D)* Accumulated total grid scale precipitation (unit: mm) SMOIS (4D)* Soil moisture (unit: m 3 m 3 ) PBLH (3D)* Boundary layer height (unit: m) RH2 (3D)* 2-meter relative humidity (unit: -) Q2 (3D)* 2-meter specic humidity (units: kgkg 1 ) PSFC (3D)* surface pressure (units: Pa) (* time varying features) 3.5.4 Air Quality and Extreme Weather Datasets Data • AQI-CO Berman (2017): There are multiple pollutants in the dataset and we choose carbon monoxide (CO) ppm as a target pollutant in this paper. We select sensors located in between latitude (26, 33) and longitude (115,125) (East region of China). In this region, we sample multiple multivariate time series whose length should be larger than 12 steps (12 hours) for multiple meta-tasks. There are around 60 working sensors and the exact number of the working sensors is varying over dierent tasks. Figure 3.6 shows the locations of selected AQI sensors. • ExtremeWeather: We select the data in the year 1984 from the extreme weather 37 Figure 3.6: Sensor locations in the AQI-CO dataset. We show sensors/stations as blue nodes and edges of k-NN graphs as red lines. Borders of provinces/states are shown in grey. time:0 time:1 time:2 time:3 time:4 Figure 3.7: Visualization of the rst 5 frames of one extended sequence in the Ex- tremeWeather dataset. Dots represent the sampled points. Area with high surface tem- perature (TS) are colored with green and area with low TS are colored with purple. dataset in Racah et al. (2017a). The data is an array of shape (1460, 16, 768, 1152), containing 1460 frames (4 per day, 365 days in the year). 16 channels in each frame correspond to 16 spatiotemporal variables. Each channel has a size of 7681152 corre- sponding to one measurement per 25 square km on earth. For each frame, the dataset provides fewer than or equal to 15 bounding boxes, each of which labels the region af- fected by an extreme weather event and one of the four types of the extreme weather: (1) tropical depression, (2) tropical cyclone, (3) extratropical cyclone, (4) atmospheric river. In the single feature setting, we only utilize the channel of surface temperature (TS). Figure 3.7 shows dynamics of extreme weather and sampled nodes. 38 Table 3.4: Information on sensor networks. Western TMAX TMIN SNOW PRCP # of sensors 434 423 31 319 # of edges 1142 1110 76 862 Eastern TMAX TMIN SNOW PRCP # of sensors 244 248 114 323 # of edges 632 636 298 844 3.5.5 Global Climate Network We nally evaluate our model on graph signal prediction tasks from real-world observa- tions on the climatology network 3 Deerrard et al. (2020). We rst sample time series from the entire sensors to construct more localized graph signals. The number of available sensors is dependent on the type of measurement, and we construct a distance-based graph structure from a kNN algorithm where k = 2. Each sensor has 4 dierent daily measurements over 5 years from 2010 to 2014 (the length of series 1826), and we use them for our experiments (Table 3.4). • TMAX: Maximum temperature (tenths of degrees C) • TMIN: Minimum temperature (tenths of degrees C) • SNOW: Snowfall (mm) • PRCP: Precipitation (tenths of mm) It is worth noting that the number of working sensors for each measurement is highly dif- ferent. Figure 3.8 visualizes the sensors across western and eastern states in the USA for TMAX and SNOW. While daily temperature observations are spatially densely available, the snowfall observations are highly sparse. 3 Global Historical Climatology Network (GHCN) provided by National Oceanic and Atmospheric Ad- ministration (NOAA). 39 (a) TMAX (b) SNOW (c) TMAX (d) SNOW Figure 3.8: Working sensors for TMAX and SNOW located in western (a,b)/eastern (c,d) states in the USA. 40 Chapter 4 Data Quality Inference based on Phys- ical Rule for Spatiotemporal Tempera- ture Forecasting In this work, we propose a novel solution that can automatically infer data quality levels of dierent sources based on physics-inspired local variations of spatiotemporal signals without explicit labels. Furthermore, we integrate the estimate of data quality level with graph convolutional networks to predict temperature on weather sensors located in Los Angeles. 4.1 Model In this section, we give the details of the proposed model, which is able to exploit the data quality dened in Section 3.2 for practical tasks. First, it will be demonstrated how the data quality network DQN is combined with recurrent neural networks (LSTM) to handle time-varying signals. We, then, provide how this model is trained over all graph signals from all vertices. 4.1.1 Data Quality Network In Section 3.2, we nd that the local variations around all vertices can be computed once graph signals X are given. Using the local variation matrix L, the data quality level at each vertexs i can be represented as a function ofL i with parameters (See Equation 3.6). 41 .. .. M GCN DQL LSTM LSTM LSTM … x 1 x 2 x k N FC Loss s i DQN Figure 4.1: Architecture of DQ-LSTM that consists of GCN followed by the data quality network DQN and LSTM. GCN are able to extract localized features from given graph signals (dierent colors correspond to dierent signals) and DQN computes the data quality of each vertex. N is the number of vertices, i.e. each loss function on each vertex is weighted by the quality levels i . Note that the dot-patterned circles denote the current vertex i. While the function q is not explicitly provided, we can parameterize the function and learn the parameters 2R M through a given dataset. One of straightforward parameterizations is based on fully connected neural networks. Given a set of features, neural networks are ecient to nd nonlinear relations among the features. Furthermore, the parameters in the neural networks can be easily learned through optimizing a loss function dened for own purpose. Thus, we use a single layer neural networks followed by an activation function L () to transfer the local variations to the data quality level. Note that multi-layer GCN can be used between graph signals and DQN to extract convolved signals as well as increase the size of convolution lters. (See Equation 3.8 and 3.9). This multi-layer neural networks are ecient to learn nonlinear interactions of K-hop nodes which are not easily learnable by graph polynomial lters in Sandryhaila & Moura (2013). Long Short Term Memory Recurrent neural networks (RNNs) are especially powerful to extract latent patterns in time series which has inherent dependencies between not only adjacent data points 42 but also distant points. Among existing various RNNs, we use Long short term memory (LSTM) (Hochreiter & Schmidhuber, 1997) to handle the temporal signals on vertices for a regression task. We feed nite lengths k of sequential signals into LSTM as an input series to predict the observation at next time step. The predicted value is going to be compared to a true value and all parameters in LSTM will be updated via backpropagation through time. DQ-LSTM Figure 4.1 illustrates how the data quality networks with LSTM (DQ-LSTM) consists of submodules. Time-varying graph signals on N vertices can be represented as a tensor form, X2R NMT where the total length of signals isT . First, the time-varying graph signals for each vertex are segmentized to be fed into LSTM. For example,X(i; :;h :h +k 1) is one of segmentized signals on vertexi starting att =h. Second, the graph signals for all vertices at the last time stampX(:; :;h +k 1) are used as an input of GCN followed by DQN. Hence, we consider the data quality level by looking at the local variations of the last signals and the estimated quality level s i is used to assign a weight on the loss function dened on the vertex i. We use the mean squared error loss function. For each vertex i, DQ-LSTM repeatedly reads inputs, predicts outputs, and updates parameters as many as a number of segmentized length k time series. L i = 1 n i n i X j=1 s i k ^ X(i; :;k +j 1)X(i; :;k +j 1)k 2 +kk 2 ; (4.1) wheren i is the number of segmentized series on vertex i and ^ X (i; :;k +j 1) is a predicted value from a fully connected layer (FC) which reduces a dimension of an output vector from LSTM. L2 regularization is used to prevent overtting. Then, the total loss function over 43 all vertices is as L = 1 N N X i=1 L i : (4.2) 4.2 Experiments In this section, we evaluate DQ-LSTM on real-world climate datasets. In the main set of experiments, we evaluate the mean absolute error (MAE) of the predictions produced by DQ-LSTM over entire weather stations. In addition, we analyze the data quality levels estimated by DQ-LSTM. We use real-world datasets on meteorological measurements from two commercial weather services, Weather Underground(WU) and WeatherBug (WB). For more details, see Section 3.5 4.2.1 Graph Generation Since structural information between pairs of stations is not directly known, we need to construct a graph of the weather stations. In general graphs, two nodes can be interpreted as similar nodes if they are connected. Thus, as mentioned in Section 3.2, we can compute a distance between two nodes in the feature space. A naive approach to dening the distance is using only the geolocation features, Latitude and Longitude. However, it might be inappro- priate because other features can be signicantly dierent even if two stations are fairly close. For example, the distance between stations in the Elysian Park and Downtown LA is less than 2 miles, however, the territorial characteristics are signicantly dierent. Furthermore, the dierent characteristics (e.g., Tree fraction or Impervious fraction) can aect weather observations (especially, temperature due to urban heat island eect). Thus, considering only physical distance may improperly approximate the meteorological similarity between two nodes. To alleviate this issue, we assume that all static features are equally important. This is a 44 reasonable assumption because we do not know which feature is more important since each feature can aect weather measurements. Thus, we normalize all spatial features. In this experiment, we use the Gaussian kernel e ( kV i V j k 2 ) with = 0:2 and 0:6 for WU and WB, respectively, and make weights less than 0:9 zero (i.e., disconnected) such that the average number of node neighbors is around 10. 4.2.2 Baselines We compare our approach to well-studied baselines for time-series forecasting. First, we compare against a stochastic process, autoregressive (AR), which estimates future values based on past values. Second, we further compare against a simple LSTM. This model is expected to infer mixed-dependencies among the input multivariate signals and provide a reference error of the neural networks based model. Lastly, we use graph convolutional networks (Kipf & Welling, 2017a) which are also able to infer the data quality level from a given dataset. We test a single layer GCN (K = 1) and two-layer GCN (K = 2). 4.2.3 Experimental Setting Since DQ-LSTM and our baselines are dependent on previous observations, we set a com- mon lag length of k = 10. For the deep recurrent models, the k-steps previous observations are sequentially inputted to predict next values. All deep recurrent models have the same 50 hidden units and one fully connected layer (R 501 ) that provides the target output. For GCN-LSTM and DQ-LSTM, we evaluate with dierent numbers of layers (K) of GCN. We set the dimension of the rst (K = 1) and second (K = 2) hidden layer of GCN as 10 and 5, respectively, based on the cross validation. The nal layer always provides a set of scalars for every vertex, and we set = 0:05 for the L2 regularization of the nal layer. We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0:001 and a mean squared error objective. We split each dataset into three subsets: training, validation, and testing sets. The 45 Table 4.1: Forecasting mean absolute error (MAE) ( C) AR LSTM GCN- LSTM (K=1) GCN- LSTM (K=2) DQ- LSTM (K=0) DQ- LSTM (K=1) WU7 0.5342 0.5823 (0.0656) 0.5152 (0.0081) 0.5073 (0.0261) 0.5096 (0.0152) 0.4788 (0.0111) WU8 0.5862 0.5911 (0.0221) 0.5356 (0.0398) 0.5151 (0.0272) 0.5087 (0.0117) 0.4856 (0.0086) WB7 0.4812 0.4725 (0.0277) 0.4687 (0.0348) 0.4411 (0.0321) 0.4588 (0.0148) 0.4108 (0.0129) WB8 0.5133 0.5435 (0.0376) 0.5412 (0.0483) 0.5296 (0.0164) 0.4602 (0.0440) 0.4574 (0.0178) rst 60% observations are used for training, the next 20% is used to tune hyperparameters (validation), and the remaining 20% is used to report error results (test). Among the mea- surements provided, Temperature is used as the target measurement, i.e., output of LSTM, and previous time-step observations, including Temperature, are used as input signals. We report average scores over 20 trials of random initializations. 4.3 Results and Discussion 4.3.1 Forecasting Experiment Experimental results are summarized in Table 4.1. We report the temperature forecasting mean absolute error (MAE) of our DQ-LSTM model with standard deviations. Meteorolog- ical measurements for July and August are denoted by 7 and 8, andK indicates the number of GCN layers. Overall, the models that account for graph structures outperform AR and LSTM. While the node connectivities and weights are dependent on our distance function (Section 4.2.1), Table 4.1 clearly shows that knowing the neighboring signals of a given node can help predict next value of the node. Although GCN are able to transfer a given signal of a vertex to a latent representa- tion that is more compact and expressive, GCN have diculty learning a mapping from 46 neighboring signals to data quality level directly, unlike DQ-LSTM which pre-transfers the signals to local variations explicitly. In other words, given signal X, what GCN learns is s =f(X) wheres is the data quality we want to infer from data; however, DQ-LSTM learns s =g(Y =h(X)) whereY is a local variation matrix given byX via a learnable function, h(). Thus, lower MAEs of DQ-LSTM verify that our assumption in Section 3.2 is valid and the local variations are a useful metric to measure data quality level. It is also noteworthy that DQ-LSTM with a GCN reports the lowest MAE among all models. This is because the additional trainable parameters in GCN increase the number of neighboring nodes that are accounted for to compute better local variations. 4.3.2 Node Embedding and Low-Quality Detection As DQ-LSTM can be combined with GCN, it is possible to represent each node as an embedding obtained from an output of GCN. Embeddings from deep neural networks are especially interesting since they can capture distances between nodes. These distances are not explicitly provided but inherently present in data. Once the embeddings are extracted, they can be used for further tasks, such as classication and clustering (Grover & Leskovec, 2016; Kipf & Welling, 2017a). Moreover, since the embeddings have low dimensional rep- resentations, it is more ecient to visualize the nodes by manifold learning methods, such as t-SNE (Maaten & Hinton, 2008). Visualization with spatiotemporal signals is especially eective to show how similarities between nodes change. Figure 4.2 shows varying embedding distributions over time. The green dots are neighbors of a red dot, and they are closely distributed to form a cluster. There are two factors that aect embeddings: temporal signals and spatial structure. Ideally, connected nodes have observed similar signals and thus, they are mapped closely in the embedding space. However, if one nodev i measures a fairly dierent value compared to other connected values, the node's embedding will also be far from its neighbors. Furthermore, if one node v i is connected to a subset of a group of nodesfgg as well as an additional node v j = 2fgg, v i would be aected 47 by the subset and v j = 2fgg simultaneously. For example, if the signals observed at v i are similar to the signals atfgg, the embedding of v i is still close to those offgg. However, if the signals ofv i are close to that ofv j , or the weight ofe(i;j) is signicantly high,v i will be far away fromfgg in the embedding space. Such intuition of the embedding distribution can be used to nd potentially low-quality nodes, which we analyze next. Figure 4.2b shows that a nodev 25 is aected by its neighboring green nodes and v 4 that is not included in the cluster (green dots). The red dot v 22 is connected with the green dots (v 19 ;v 20 ;v 21 ;v 23 ;v 25 ;v 29 ). Since these nodes have similar spatial features and are connected, the nodes are expected to have similar observations. At t = 0, the distribution of the nodes seems like a cluster. However, v 25 is far away from other green nodes and the red node at t = 4. There are two possible reasons. First, observations of v 25 at t = 4 may be too dierent with those of other green nodes and the red node. Second, observations at v 4 , which is only connected to v 25 (not to other green nodes and the red node), might be too noisy. The rst case violates our assumption (See Section 3.2, such that the observation at v 25 should be similar to those of other green nodes.); therefore, the observations of v 25 at t = 4 might be noisy or not reliable. In the second case, the observations of v 4 at t = 4 might be noisy. Thus, v 25 and v 4 are candidates of low-quality nodes. 4.3.3 Data Quality Analysis Since we do not have explicit labels for the data quality levels, it is not straightforward to directly evaluate the data quality inferred from DQ-LSTM. Instead, we can verify the inferred data quality by studying high and low quality examples from embedding distributions and associated meterological observations. Table 4.2 shows meterological observations associated with the previously discussed embedding distribution att = 4 (Figure 4.2b). The valuesx 25 atv 25 are the same asx 20 ;x 21 , andx 23 ; however,x 25 is quite dierent thanx 19 ;x 22 , andx 29 . Moreover, the edge weights between v 25 and other green nodes are not as large as weights 48 19 20 21 23 25 29 22 (a) t = 0 19 20 21 23 25 29 22 4 (b) t = 4 19 20 21 23 25 29 22 (c) t = 8 Figure 4.2: t-SNE visualization of outputs of GCN in DQ-LSTM. Red dot denotes the reference node and green dots are the adjacent nodes of the red dot. (a), (b) and (c) illustrate how the embeddings of spatiotemporal signals changes. At t = 4, the v 25 node is relatively far from other green nodes because it is connected with the v 4 node which is not a neighbor of red dot. Table 4.2: Observations and inferred DQL Temp Press Humid DQL 4 28.4 1013.1 57.7 0.022 19 34.5 1012.2 35.8 0.038 20 28.1 1011.5 58.7 0.039 21 28.1 1011.5 58.7 0.039 22 38.1 1006.2 32.8 0.039 23 28.1 1011.5 58.7 0.038 25 28.1 1011.5 58.7 0.030 29 35.3 1014.3 40.0 0.039 between other green nodes. (v 25 is much closer to the ocean than other green nodes.) As a result, it is not easy for v 25 to be close to the green nodes (v 20 ;v 21 , and v 23 pull v 25 and v 19 ;v 22 , and v 29 push v 25 ). On the other hand, sincex 25 is similar tox 4 and W 25;4 is large, v 25 is more likely to be close to v 4 as in Figure 4.2b. Note that v 4 has very dierent geological features compared to the features of the green nodes and thus, v 4 is not connected to v 22 or other green nodes except v 25 . Consequently, v 25 is the bridge node between v 4 and the cluster of v 22 . Since a bridge node is aected by two (or more) dierent groups of nodes simultaneously, the quality level at the bridge node 49 is more susceptible than those of other nodes. However, this does not directly mean that a bridge node must have a lower data quality level. As Table 4.2 shows,s 4 has the lowest data quality level, which comes from the discrepancy between its neighboring signals and x 4 . Since v 4 is connected to v 25 , v 4 pulls v 25 and s 4 , lowerings 25 relative to data quality levels of the other green nodes that are correctly inferred. 50 Chapter 5 Graph Networks with Physics-aware Knowl- edge Informed in Latent Space While physics conveys knowledge of nature built from an interplay between observations and theory, it has been considered less important for modeling deep neural networks. De- spite the usefulness of physical rules, it is particularly challenging to leverage the knowledge for sparse data since most physics equations are well dened on the continuous and dense space. In addition, it is even harder to inform the equations into a model if the observa- tions are not fully governed by the given physical knowledge. In this work, we present a novel architecture to incorporate physics or domain knowledge given as a form of partial dierential equations (PDEs) on sparse observations by utilizing graph structure. Moreover, we leverage the representation power of deep learning by informing the knowledge in latent space. We demonstrate that climate prediction tasks are signicantly improved and validate the eectiveness and importance of the proposed model. 5.1 Physics-aware Graph Networks As deep learning models are successful to model complex behaviors or extract abstract features in data, it is natural to focus on how the data-driven modeling can solve practical problems in physics or engineering elds. In this section, we provide how domain knowledge described in physics can be incorporated with the graph networks framework. 51 Mapping Equation Physics example node! edge e ij = e (v i ;v j ) r =E (Electric eld) edge! node v i = v (e ij ) = (dive) i rE == 0 (Maxwell's eqn.) node! node v i = v (v i ;fv j:(i;j)2E g) = (v) i = 0 (Laplace's eqn.) Table 5.1: Examples of static equations in Graph networks 5.1.1 Static Physics Many elds in physics dealing with static properties, such as Electrostatic, Magneto- static, or Hydrostatic, describe a number of physics phenomena at rest. Among the various phenomena, it is easy to express dierentiable physics rules in discrete forms on a graph with the operators in previous Section 3.3. For instances, the Poisson equation (r 2 = 0 ) in Electrostatics is realized as a simple matrix multiplication of graph Laplacian with a vertex function. Table 5.1 provides some dierential formulas in Electrostatic and how the updating functions are dened in graph networks. 5.1.2 Dynamic Physics More practical equations have been written in the dynamic forms, which describe how a given physical quantity is changing in a given region over time. GN can be regarded as a module that updates a graph state including the attributes of node, edge, and a whole graph. G 0 = GN(G) (5.1) whereG 0 is the updated graph state. Dynamic physics formulas are written as a function of time and spatial derivatives: f @u @t ; ; @ M u @t M ; @u @x ; ; @ N u @x N ! = 0 (5.2) 52 Encoder ℋ GN ℋ′ $ ′ Decoder x T Physics equation Supervised Loss ′ ⨀ Figure 5.1: Recurrent architecture to incorporate physics equation on GN. The blue blocks have learnable parameters and the orange blocks are objective functions. is a concatenation operator and the middle core block can be repeated as many as the required time steps (T ). where u is a physical quantity spatiotemporally varying and x is the direction where u is dened on. M and N denote the highest order of time and spatial derivatives, respectively. Under the state updating view in Equation 5.1, any types of PDEs written in Equation 5.2 can be represented as a form of nite dierences. Table 5.2 provides the examples of the dynamic physics. _ u and u are the rst and second order time derivatives, respectively. 5.1.3 Physics in Latent Space In Section 3.3, we provide how the dierential operators are implemented in a GN mod- ule. However, it is hardly practical for modeling complicated real-world problems with the dierential operators solely because it is only possible when all physics equations governing the observed phenomena are explicitly known. For example, although we are aware that there are a number of physics equations involved in climate observations, it is almost infea- sible to include all required equations for modeling the observations. Thus, it is necessary to utilize the learnable parameters in GN to ll the missing dynamics which is not described by given equations. There is another advantage to utilize learnable parameters. There are a number of unknown parameters, which need to be pre-dened to specify the physics equations, and the parameters can be inferred by the learnable parameters. For example, while we have knowledge that input signal has a wave property, the speed of waves (c in Table 5.2) should be given to fully describe the wave equation. It will be even worse when multiple input signals are involved since each signal is governed by dierent parameters in the same kind of 53 Equation Physics example v 0 i =v i + v (v i ;fv j:(i;j)2E g) =v i +(v) i _ q =q (Diusion eqn.) v 00 i = 2v 0 i v i +c 2 v (v 0 i ;fv 0 j:(i;j)2E g) = 2v 0 i v i +c 2 (v 0 ) i q =c 2 q (Wave eqn.) Table 5.2: Examples of dynamic equations in Graph networks equation. While both temperature and surface pressure are continuous and diusive, they should have dierent diusion coecients ( in Table 5.2) in the same diusion equation. To address the issue we can transform the input signals to latent space and use one equation in the latent space instead of imposing multiple equations to input signals separately. Then, the parameters in Encoder make the dierent signals follow the equation dierently. We formalize how this idea is implemented as follow. Forward/Recurrent computation Figure 5.1 provides how the desired physics knowledge is integrated with the graph net- works. Given a set of graph attributesfv;e;ug, it is fed into an encoder which transforms a set of attributes of nodes (v), edges (e), and a whole graph (u) into latent spaces. ~ v; ~ e; ~ u = Encoder(v;e;u): (5.3) After the encoder, the encoded graphH =f~ v; ~ e; ~ ug is repeatedly updated within the core block as many as the required time stepsT . For each step,H is updated toH 0 which denotes the next state of the encoded graph. H 0 = GN(H): (5.4) Finally, the sequentially updated attributes are re-transformed to the original spaces by a decoder. v 0 ;e 0 ;u 0 = Decoder(~ v 0 ; ~ e 0 ; ~ u 0 ): (5.5) 54 Objective functions There are two types of objective function in this architecture, physics knowledge and su- pervised objective. First, we dene physics-informed constraint, which is a form of equations in Table 5.1 and 5.2 depending on given physics knowledge and even mixed. f s phy (H 0 t ); f d phy (H 0 t ; ;H 0 t+M ); (5.6) L phy = X t f s phy (H 0 t ) +f d phy (H 0 t ; ;H 0 t+M ); (5.7) where f s phy (H 0 t ) and f d phy (H 0 t ; ;H 0 t+M ) are the static and dynamic physics-informed quan- tity, respectively. For example, we can impose gradient constraint or the diusion equation between node/edge latent representations as follow: f s phy (H 0 t ) =k~ e 0 t r~ v 0 t k 2 ; f d phy (H 0 t ;H 0 t+1 ) =k~ v 0 t+1 ~ v 0 t r 2 ~ v 0 t k 2 : Secondly, the supervised loss function between the predicted graph, ^ G 0 , and the target graph, G 0 . This loss function is constructed based on the task, such as the cross-entropy or the mean squared error (MSE). Finally, the total objective function is a sum of the two constraints: L =L sup +L phy ; (5.8) where controls the importance of the physics term. 5.2 Experiment In this section, we evaluate PaGN on real-world data, hourly simulated climate observa- tions for 16 days on the Southern California region Zhang et al. (2018). See more details in Section 3.5.2. 55 5.2.1 PaGN Architecture As explained in Section 5.1, PaGN consists of three modules, graph encoder, GN block, and graph decoder (Figure 5.1). The encoder contains two feed forward networks, v and e , applied to node and edge features, respectively. By passing the encoder, the features are transformed to the latent space (H) where we will impose physics equations. In the GN block, the node/edge/graph features are updated by the GN algorithm de- scribed in Section 3.4. The latent graph states,H andH 0 , indicate the hidden states of the current and next observations. For the physics constraint, we informed the diusion and wave equation in Table 5.2, which describe the behavior of the continuous physical quanti- ties. As the most of the climate observations are varying continuously, the diusion equation, as a part of the continuity equation, is one of the inductive bias that should be considered for modeling. In addition, the wave equation is useful to describe atmospheric phenomena, especially 1 solar day harmonics (e.g., Atmospheric tide). Note that the physics equations are not directly applied to the input observations, but rather to the latent representations. The state-updating process is repeated at least as many as the order of the equations to provide the nite dierence equation. For multistep predictions, the recurrent module is repeated as many as the number of the predictions and the physics equation will be also applied multiple times as well. Finally, the decoder takesH 0 as input to return the next predictions. The following objective is the total loss function of PaGN with the diusion equation. L = T X i=1 k^ y 0 i y 0 i k 2 + T X i=1 k~ v 0 i ~ v i1 r 2 ~ v i1 k 2 ; (5.9) wherey 0 is a vector of the target observations (i.e. node vectors) and adjusts the diusivity of the latent representations, which is found through cross validation. Note that the equation term can be replaced by other equations properly. 56 Model LA area SD area MLP 0.81400.0651 0.77350.0539 LSTM 0.78550.0644 0.81230.0875 GN-only 0.59510.0517 0.69470.1859 GN-skip 0.59060.0620 0.64560.1499 PaGN (wave) 0.53660.0631 0.64130.1549 PaGN (diff) 0.52890.0405 0.57460.0471 Table 5.3: One step prediction error (MSE) 5.2.2 Experimental Settings In our experiments, we used the air temperature as a target observation and other 9 observations were used as input. We rst evaluated our model by performing the one-step and multistep prediction tasks on the two dierent area with a mean square error metric. For both regions, we commonly trained the model with input observations for 10 timesteps (t 10 : t 1) and predicted targets from t 9 to t. First 65% of a total length was used as a training set and remaining series was split into validation (10%) and test sets (25%). We explored several baselines: MLP, LSTM, and GN-only ignoring the physics constraint in PaGN. We also compared GN-skip which connects betweenH andH 0 with the skip- connection He et al. (2016) without the physics constraint. 5.2.3 One-step Prediction Table 5.3 shows the prediction error of the baselines and PaGN on dierent areas. MLP and LSTM are shared over all stations and their performaces are outperformed by other models leveraging a given graph structure. It implies that knowing neighboring information is signicantly helpful to infer its own state and it is intuitive since climate behaviors are spatiotemporally continuous. Among the graph-based models, PaGN(diff) provides the least MSEs. It validates that the diusive property provides a strong inductive bias with the latent representation learning. Note that the standard deviations from PaGN(diff) are signicantly smaller than those of other baselines and it implies that the integrated physics 57 Model LA area SD area LSTM 1.90220.2078 1.24890.2295 GN-only 1.61370.1128 1.55320.2023 GN-skip 1.54290.0932 1.44230.1622 PaGN(diff) 1.46560.0474 1.09990.0435 Table 5.4: Multistep prediction error (MSE) knowledge properly stabilizes optimization process by introducing additional objective. 5.2.4 Multistep Prediction To evaluate the eectiveness of the state-wise regularization more carefully, we conducted the multistep prediction task (10 forecast horizon). For the task, the recurrent modules are modied to predict input observations as well and the predicted one is re-fed in the model for future timesteps. While the models having a recurrent module are able to predict a few more steps reasonably, there are a couple of things we should pay attention. First, the results imply that utilizing the neighboring information is important because GN-only model shows similar or better MSEs compared to LSTM for the multistep tasks, even though it has a simple recurrent module that is not as good as that of LSTM. Second, we found that the diusion equation in PaGN gives the stable state transition and the property provides slowly varying latent states which are desired particularly for the climate forecasting. Note that the skip-connection in GN-skip is also able to restrict the rapid changes ofH. However, it is necessary to more carefully optimize the parameters in GN-skip to learn the residual term inH 0 =H + GN(H) properly. 5.2.5 Eectiveness of Physics Constraint One of the benets of physics-aware learning is data eciency. We explore how much the physics constraint is helpful by testing if PaGN can be well-trained when the number of data for the supervised objective is limited for the one-step prediction task. We randomly 58 0% 25% 50% 75% 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 MSE LA area SD area (a) MSE on sampled data 0.001 0.01 0.1 1.0 0.5 1.0 1.5 2.0 2.5 MSE LA area SD area (b) MSE with dierent Figure 5.2: In (a) MSEs of PaGN are almost as good as GN-only (gray lines) despite the less number of training data. (b) provides how the prediction performance is dependent on the physics term. Model LA area SD area PaGN(rand) 1.1406 0.7073 PaGN(diff+wave) 0.5624 0.6724 Table 5.5: One step prediction error with dierent constraints (MSE) sampled training data which were used to optimize the total loss function (Equation 5.9) and the left unsampled data were only used to minimize the physics constraint: L =L i sup +L i phy i is a sampled step, L =L i phy otherwise. We found that the diusion equation can benet to optimize PaGN even if the target ob- servations are partially available (Figure 5.2a). Although the overall performances of PaGN are degraded when less number of sampled data are used, the error are not far deviated from those of GN-only. Even the GN-only model is outperformed by PaGN when only 70% training data are used with the state-wise constraint 59 5.2.6 Importance of Physics Constraint To study the importance of the physics term, we trained PaGN with dierent controlling the importance of the physics term. While we found that the physics term is substantially helpful from Table 5.3 and 5.4, the term is not supposed to be dominant (See Figure 5.2b) but tuned properly. This is intuitive since the term only provides partial knowledge (diusive input signals), which changes loss surface to help parameters more stable to predict next signals, instead of governing the dynamics explicitly. Scaling down the physics term is similar to what Sabour et al. (2017) did for reconstruction error not to dominate margin loss but to help the optimization process. We also present MSEs from PaGN(rand) dened by randomly sampling (;)2 [2:5; 2:5] in the constraintjjv 00 +v 0 +vcvjj 2 , and PaGN(diff+wave) superposing the two equa- tions. Table 5.5 shows that the random equation signicantly degrades the overall prediction quality. Note that the simple superposition of two equations does not always guarantee lower error even if each equation is helpful separately. When the two equations are non-linearly connected in the unknown fully governing equation, the superposition cannot provide mean- ingful inductive bias. The results demonstrate that the physics term is not just ad-hoc regularization but useful inductive bias when it is properly dened. 60 Chapter 6 Physics-aware Dierence Graph Net- works for Sparsely-Observed Dynamics In previous chapter, we derive how the explicit equation is incorporated with graph net- works. This chapter more focuses on common properties in physics-related partial dieren- tial equations, spatial derivatives. We propose a novel architecture, Physics-aware Dierence Graph Networks (PA-DGN), which exploits neighboring information to learn nite dier- ences inspired by physics equations. PA-DGN leverages data-driven end-to-end learning to discover underlying dynamical relations between the spatial and temporal dierences in given sequential observations. We demonstrate the superiority of PA-DGN in the approximation of directional derivatives and the prediction of graph signals on the synthetic data and the real-world climate observations from weather stations. 6.1 Physics-aware Dierence Graph Network In this section, we introduce the building module used to learn spatial dierences of graph signals and describe how the module is used to predict signals in the physics system. 6.1.1 Dierence Operators on Graph As approximation of derivatives in continuous domain, dierence operators have been used as a core role to compute numerical solutions of (continuous) dierential equations. Since it is hard to derive closed-form expressions of derivatives in real-world data, the dif- 61 (a) Original graph sig- nals (b) Detected edge (c) Sharpened signals (d) Modulated gradi- ents Figure 6.1: Examples of dierence operators applied to graph signal. Filters used for the processing are (b) P j (f i f j ) (c) P j (1:1f i f j ), (d) f j 0:5f i . ference operators have been considered as alternative tools to describe and solve PDEs in practice. The operators are especially important for physics-related data (e.g., meteorologi- cal observations) because the governing rules behind the observations are mostly dierential equations. 6.1.2 Spatial Dierence Layer While the dierence operators are generalized in Riemannian manifolds (Lai et al., 2013; Lim, 2015), there exist numerical error compared to those in continuous space and the error can be larger when the nodes are spatially far from neighboring nodes because the connected nodes (j2 N i ) of i-th node fail to represent local features around the node. Furthermore, the error is even larger if available data points are sparsely distributed (e.g., sensor-based observations). In other words, the dierence operators are unlikely to discover meaningful spatial variations behind the sparse observations since they are highly limited to immediate neighboring information only. To mitigate the limitation, we propose spatial dierence layer (SDL) which consists of a set of parameters to dene learnable dierence operators as a form of gradient and Laplacian to fully utilize neighboring information: ( w rf) ij =w (g 1 ) ij (f j w (g 2 ) ij f i ); ( w f) i = X j:(i;j)2E w (l 1 ) ij (f i w (l 2 ) ij f j ) (6.1) 62 where w ij are the parameters tuning the dierence operators along with the corresponding edge direction e ij . The two forms (Equation6.1) are associated with edge and node fea- tures, respectively. The superscript in w r and w denotes that the dierence operators are functions of the learnable parameters w. w (g) ij and w (l) ij are obtained by integrating local information as follow: w ij =g(ff k ;F mn jk; (m;n)2h-hop neighborhood of edge (i;j)g) (6.2) While the standard dierence operators consider two connected nodes only (i andj) for each edge (i;j), Equation 6.2 uses a larger view (h-hop) to represent the dierences between i and j nodes. Since graph networks (GN) (Battaglia et al., 2018) are ecient networks to aggregate neighboring information, we use GN for g() function and w ij are edge features of output of GN. Equation 6.2 can be viewed as a higher-order dierence equation because nodes/edges which are multi-hop apart are considered. w ij has a similar role of parameters in convolution kernels of CNNs. For example, while the standard gradient operator can be regarded as an example of simple edge-detecting lters, the operator can be a sharpening lter if w (g 1 ) ij = 1 and w (g 2 ) ij = jN i j+1 jN i j for i node and the operators over each edge are summed. In other words, by modulating w ij , it is readily extended to conventional kernels including edge detection or sharpening lters and even further complicated kernels. On top of w ij , the dierence forms in Equation 6.1 make an optimizing process for learnable parameters based on the dierences instead of the values intentionally. Equation 6.1 thus naturally provides the physics-inspired inductive bias which is particularly eective for modeling physics-related observations. Furthermore, it is easily possible to increase the number of channels forw (g) ij andw (l) ij to be more expressive. Figure 6.1 illustrates how the exemplary lters convolve the given graph signals. 63 SDL RGN Graph signals at time t Predicted graph signals at time t+1 Modulated spatial differences Difference graph Concatenation Hidden graph Updated hidden graph Figure 6.2: Physics-aware Dierence Graph Networks for graph signal prediction. Blue boxes have learnable parameters and all parameters are trained through end-to-end learning. The nodes/edges can be multidimensional. 6.1.3 Recurrent Graph Networks Dierence Graph Once the modulated spatial dierences ( w rf(t); w f(t)) are obtained, they are concatenated with the current signalsf(t) to construct node-wise (z i ) and edge-wise (z ij ) features and the graph is called a dierence graph. The dierence graph includes all information to describe spatial variations. Recurrent Graph Networks Given a snapshot (f(t);F (t)) of a sequence of graph signals, one dierence graph is obtained and is used to predict next graph signals. While a non-linear layer can be used to combine the learned spatial dierences to predict the next signals, it is limited to discover spatial relations only among the features in the dierence graph. Since many equations describing physics-related phenomena are non-static (e.g., Navier{Stokes equations), we adopt recurrent graph networks (RGN) (Sanchez-Gonzalez et al., 2018) with a graph stateG h as input to combine the spatial dierences with temporal variations. RGN returns a graph state (G h = (h (v) ;h (e) )) and next graph signalz i andz ij . The update rule is described as follow: 1. (z ij ;h (e) ) e (z ij ;z i ;z j ;h (e) ) for all (i;j)2E pairs, 2. (z i ;h (v) ) v (z i ; z 0 i ;h (v) ) for all i2V, 64 z 0 i is an aggregated edge attribute related to the node i, where e ; v are edge and node update functions, respectively, and they can be any recurrent unit. Finally, the prediction is made through a decoder by feeding the graph signal,z i and z ij . Learning Objective Let ^ f and ^ F denote predictions of the target node/edge signals. PA-DGN is trained by minimizing the following objective: L = X i2V jjf i ^ f i jj 2 + X (i;j)2E jjF ij ^ F ij jj 2 : (6.3) For multistep predictions,L is summed over all predicting steps. If only one type (node or edge) of signal is given, the corresponding term in Equation6.3 is used to optimize the parameters in SDL and RGN simultaneously. 6.2 Eectiveness of Spatial Dierence Layer To investigate if the proposed spatial dierence forms (Equation6.1) can be benecial to learning physics-related patterns, we use SDL on two dierent tasks: (1) approximate directional derivatives and (2) predict synthetic graph signals. 6.2.1 Approximation of Directional Derivatives As we claimed in Section 6.1.2, the standard dierence forms (gradient and Laplacian) on a graph can cause signicant numerical error easily because they are susceptible to a distance of two points and variations of a given function. To evaluate the applicability of the proposed SDL, we train SDL to approximate directional derivatives on a graph. First, we dene a synthetic function and its gradients on 2D space and sample 200 points (x i ;y i ). Then, we construct a graph on the sampled points by usingk-NN algorithm (k = 4). With the known 65 f(x i ) f(x j ) f(x i ) e ij f(x i ) Figure 6.3: Directional derivative on graph Figure 6.4: Gradients and graph structure of sampled points. Left: the synthetic function is f 1 (x;y) = 0:1x 2 + 0:5y 2 . Right: the synthetic function is f 2 (x;y) = sin(x) + cos(y). gradient rf = ( @f @x ; @f @y ) at each point (a node in the graph), we can compute directional derivatives by projectingrf to a connected edgee ij (See Figure 6.3). We compare against four baselines: (1) the nite gradient (FinGrad) (2) multilayer perceptron (MLP) (3) graph networks (GN) (4) a dierent form of Equation6.1 (One-w). For the nite gradient ((f j f i )=jjx j x i jj), there is no learnable parameter and it only uses two points. For MLP, we feed (f i ;f j ;x i ;x j ) as input to see whether learnable parameters can benet the approximation or not. For GN, we use distances of two connected points as edge features and function values on the points as node features. The edge feature output of GN is used as a prediction for the directional derivative on the edge. Finally, we modify the proposed form as ( w rf) ij = w ij f j f i . GN and the modied form are used to verify the eectiveness of Equation6.1. Note that we dene two synthetic functions (Figure 6.4) which have dierent property; (1) monotonically increasing from a center and (2) periodically varying. 66 Table 6.1: Mean squared error (10 2 ) for approximation of directional derivatives. Functions FinGrad MLP GN One-w SDL f 1 = 0:1x 2 + 0:5y 2 6.420.47 2.120.32 1.050.42 1.410.44 0.970.39 f 2 = sin(x) + cos(y) 5.900.04 2.290.77 2.170.34 6.731.17 1.260.05 Approximation Accuracy As shown in Table 6.1, the proposed spatial dierence layer outperforms others by a large margin. As expected, FinGrad provides the largest error since it only considers two points without learnable parameters. It is found that the learnable parameters can signicantly benet to approximate the directional derivatives even if input is the same (FinGrad vs. MLP). Note that utilizing neighboring information (GN, One- w, SDL) is generally helpful to learn spatial variations properly. However, simply training parameters in GN is not sucient and explicitly dening dierence, which is important to understand spatial variations, provides more robust inductive bias. One important thing we found is that One-w is not eective as much as GN and it can be even worse than FinGrad. It is because of its limited degree of freedom. As implied in the form (r w f) ij =w ij f j f i , only onew ij adjusts the relative dierence betweenf i andf j , and this is not enough to learn whole possible linear combinations of f i and f j . The unstable performance supports that the form of SDL is not ad-hoc but more rigorously designed. Evaluation on Datasets with Dierent Sparsity We changed the number of nodes to control the sparsity of data. As shown in Table 6.2, our proposed model outperforms others under various settings of sparsity on the synthetic experiment in Section 6.2.2. Table 6.2: Mean absolute error (10 2 ) for graph signal prediction with dierent sparsity. #Nodes VAR MLP StandardOP MeshOP SDL 250 0.1730 0.1627 0.1200 0.1287 0.1104 150 0.1868 0.1729 0.1495 0.1576 0.1482 100 0.1723 0.1589 0.1629 0.1696 0.1465 Furthermore, we sampled 400 points and trained SDL as described in Section 6.2.1, and resampled fewer points (350,300,250,200) to evaluate if SDL generalizes less sparse setting. 67 As Table 6.3 shows, MSE increases when fewer sample points are used. However, SDL is able to provide much more accurate gradients even if it is trained under a new graph with dierent properties. Thus, the results support that SDL is able to generalize the c setting. Table 6.3: Mean squared error (10 2 ) for approximations of directional derivatives of function f 2 (x;y) = sin (x) + cos (y) with dierent sparsity. Method 350 Nodes 300 Nodes 250 Nodes 200 Nodes FinGrad 2.88 0.11 3.42 0.14 3.96 0.17 4.99 0.31 SDL 1.03 0.09 1.14 0.12 1.40 0.10 1.76 0.10 6.2.2 Graph Signal Prediction We evaluate PA-DGN on the synthetic data sampled from the simulation of specic convection-diusion equations, to provide if the proposed model can predict next signals of the simulated dynamics from observations on discrete nodes only. Simulated Data for Sparse Observations For the simulated dynamics, we discretize the following partial dierential equation similar to the one in Long et al. (2018b) to simulate the corresponding linear variable-coecient convection-diusion equation on graphs. In a continuous space, we dene the linear variable-coecient convection-diusion equa- tion as: 8 > < > : @f @t =a(x;y)f x +b(x;y)f y +c(x;y)f fj t=0 =f 0 (x;y) (6.4) , with = [0; 2] [0; 2], (t;x;y)2 [0; 0:2] , a(x;y) = 0:5(cos(y) +x(2x) sin(x)) + 0:6; b(x;y) = 2(cos(y) + sin(x)) + 0:8, c(x;y) = 0:5 1 p (x i ) 2 +(y i ) 2 p 2 . We follow the setting of initialization in Long et al. (2018b): f 0 (x;y) = X jkj;jljF k;l cos(kx +ly) + k;l sin(kx +ly) (6.5) 68 , where F = 9, k;l ; k;l N 0; 1 50 , and k and l are chosen randomly. We use spatial dierence operators to approximate spatial derivatives: 8 > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > : f x (x i ;y i ) = 1 2s (f(x i ;y i )f(x i s;y i )) 1 2s (f(x i ;y i )f(x i +s;y i )) f y (x i ;y i ) = 1 2s (f(x i ;y i )f(x i ;y i s)) 1 2s (f(x i ;y i )f(x i ;y i +s)) f xx (x i ;y i ) = 1 s 2 (f(x i ;y i )f(x i s;y i )) + 1 s 2 (f(x i ;y i )f(x i +s;y i )) f yy (x i ;y i ) = 1 s 2 (f(x i ;y i )f(x i ;y i s)) + 1 s 2 (f(x i ;y i )f(x i ;y i +s)) (6.6) , where s is the spatial grid size for discretization. Then we rewrite (6.4) with dierence operators dened on graphs: 8 > < > : @f @t =a(i)(rf) ^ x +b(i)(rf) ^ y +c(i)((f) ^ x^ x + (f) ^ y^ y ) f i (0) =f o (i) (6.7) , where a(i)(x j ;y j ) = 8 > > < > > : a(x i ;y i ) 2s if x i =x j +s; y i =y j a(x i ;y i ) 2s if x i =x j s; y i =y j (6.8) b(i)(x j ;y j ) = 8 > > < > > : b(x i ;y i ) 2s if x i =x j ; y i =y j +s b(x i ;y i ) 2s if x i =x j ; y i =y j s (6.9) c(i)(x j ;y j ) = c s 2 (6.10) . 69 Then we replace the gradient w.r.t time in (6.7) with temporal discretization: 8 > < > : f(t + 1) = t(a(i)(rf) ^ x +b(i)(rf) ^ y +c(i)((f) ^ x^ x + (f) ^ y^ y )) +f(t) f i (0) =f o (i) (6.11) , where t is the time step in temporal discretization. Equation (6.11) is used for simulating the dynamics described by the equation (6.4). Then, we uniformly sample 250 points in the above 2D space and choose their corresponding time series ofu as the dataset used in our synthetic experiments. We generate 1000 sessions on a 50 50 regular mesh with time step size t = 0:01. 700 sessions are used for training, 150 for validation and 150 for test. We uniformly sample 250 points in the above 2D space. The task is to predict signal values of all points in the future M steps given observed values of the rst N steps. For our experiments, we choose N = 5 and M = 15. Since there is no a priori graph structure on sampled points, we construct a graph with k-NN algorithm (k = 4) using the Euclidean distance. Figure 6.5 shows the dynamics and the graph structure of sampled points. To evaluate the eect of the proposed SDL on the above prediction task, we cascade SDL and a linear regression model as our prediction model since the dynamics follows a linear partial dierential equation. We compare its performance with four baselines: (1) vector auto-regressor (VAR); (2) multilayer perceptron (MLP); (3) StandardOP: the standard approximation of dierential operators in Section 6.1.1 followed by a linear regressor; (4) MeshOP: similar to StandardOP but use the discretization on triangulated mesh in Sec- tion 3.3 for dierential operators. Table 6.4: Mean absolute error (10 2 ) for graph signal prediction. VAR MLP StandardOP MeshOP SDL 16.840.41 15.750.53 11.990.29 12.820.06 10.870.98 70 t = 1 t = 5 t = 10 t = 15 t = 20 Figure 6.5: Synthetic dynamics and graph structure of sampled points. Prediction Performance Table 6.4 shows the prediction performance of dierent mod- els measured with mean absolute error. The prediction model with our proposed spatial dierential layer outperforms other baselines. All models incorporating any form of spatial dierence operators (StandardOP, MeshOP and SDL) outperform those without spatial dif- ference operators (VAR and MLP), showing that introducing spatial dierences information inspired by the intrinsic dynamics helps prediction. However, in cases where points with observable signal are sparse in the space, spatial dierence operators derived with xed rules can be inaccurate and sub-optimal for prediction since the locally linear assumption which they are based on no longer holds. Our proposed SDL, to the contrary, is capable of bridg- ing the gap between approximated dierence operators and accurate ones by introducing learnable coecients utilizing neighboring information, and thus improves the prediction performance. 6.3 Prediction: Graph Signals on Land-based Weather Sensors We evaluate the proposed model on the task of predicting climate observations (Temper- ature) from the land-based weather stations located in the United States. 6.3.1 Experimental Set-up Dataset and Task We sample the weather stations located in the United States from the Online Climate Data Directory of the National Oceanic and Atmospheric Administration 71 (NOAA) and choose the stations which have actively measured meteorological observations during 2015. See details in Section 3.5.3. We usek-Nearest Neighbor (NN) algorithm (k = 4) to generate graph structures and the nal adjacency matrix isA = (A k +A > k )=2 to make it symmetric whereA k is the output adjacency matrix from k-NN algorithm. Our main task is to predict the next graph signals based on the current and past graph signals. All methods we evaluate are trained through the objective (Equation 6.3) with the Adam optimizer and we use scheduled sampling (Bengio et al., 2015) for the models with recurrent modules. We evaluate PA-DGN and other baselines on two prediction tasks, (1) 1-step and (2) multistep-ahead predictions. Furthermore, we demonstrate the ablation study that provides how much the spatial derivatives from our proposed SDL are important signals to predict the graph dynamics. Here we provide additional details for models we used in this work, including model architecture settings and hyper-parameter settings. Model Settings Unless mentioned otherwise, all models use a hidden dimension of size 64. • VAR: A vector autoregression model with 2 lags. Input is the concatenated features of previous 2 frames. The weights are shared among all nodes in the graph. • MLP: A multilayer perceptron model with 2 hidden layers. Input is the concatenated features of previous 2 frames. The weights are shared among all nodes in the graph. • GRU: A Gated Recurrent Unit network with 2 hidden layers. Input is the concate- nated features of previous 2 frames. The weights are shared among all nodes in the graph. • RGN: A recurrent graph neural network model with 2 GN blocks. Each GN block has an edge update block and a node update block, both of which use a 2-layer GRU 72 cell as the update function. We set its hidden dimension to 73 so that it has the same number of learnable parameters as our proposed model PA-DGN. • RGN(StandardOP): Similar to RGN, but use the output of dierence operators in Section 6.1.1 as extra input features. We set its hidden dimension to 73. • RGN(MeshOP): Similar to RGN(StandardOP), but the extra input features are calculated using opeartors in Section 3.3. We set its hidden dimension to 73. • PA-DGN: Our proposed model. The spatial derivative layer uses a message passing neural network (MPNN) with 2 GN blocks using 2-layer MLPs as update functions. The forward network part uses a recurrent graph neural network with 2 recurrent GN blocks using 2-layer GRU cells as update functions. The numbers of learnable parameters of all models are listed as follows: Table 6.5: Numbers of learnable parameters. Model VAR MLP GRU RGN # Params 3 4,417 37,889 345,876 Model RGN(StandardOP) RGN(MeshOP) PA-DGN # Params 341,057 342,152 340,001 Training Settings The Number of Evaluation Runs We performed 3 times for every experiment in this paper to report the mean and standard deviations. Length of Prediction For experiments on synthetic data, all models take rst 5 frames as input and predict the following 15 frames. For experiments on NOAA datasets, all models take rst 12 frames as input and predict the following 12 frames. 73 Training Hyperparameters We use Adam optimizer with learning rate 1e-3, batch size 8, and weight decay of 5e-4. All experiments are trained for a maximum of 2000 epochs with early stopping. All experiments are trained using inverse sigmoid scheduled sampling with the coecient k = 107. Environments All experiments are implemented with Python3.6 and PyTorch 1.1.0, and are run with NVIDIA GTX 1080 Ti GPUs. 6.3.2 Graph Signal Predictions We compare against the widely used baselines (VAR, MLP, and GRU) for 1-step and multistep prediction. Then, we use RGN (Sanchez-Gonzalez et al., 2018) to examine how much the graph structure is benecial. Finally, we evaluate PA-DGN to verify if the proposed architecture (Equation 6.1) is able to reduce prediction loss. Experiment results for the prediction task are summarized in Table 6.6. Overall, RGN and PA-DGN are better than other baselines and it implies that the graph structure provides useful inductive bias for the task. It is intuitive as the meteorological observations are continuously changing over the space and time and thus, the observations at the i-th station are strongly related to those of its neighboring stations. PA-DGN outperforms RGN and the discrepancy comes from the fact that the spatial derivatives (Equation 6.1) we feed in PA-DGN are benecial and this nding is expected because the meteorological signals at a certain point are a function of not only its previous signal but also the relative dierences between neighbor signals and itself. Knowing the relative dierences among local observations is particularly essential to understand physics- related dynamics. For example, Diusion Equation, which describes how physical quantities (e.g., heat) are transported through space over time, is also a function of relative dierences of the quantities ( df dt = Df) rather than values of the neighbor signals. In other words, spatial dierences are physics-aware features and it is desired to leverage the features as 74 input to learn dynamics related to physical phenomena. Table 6.6: Graph signal prediction results (MAE) on multistep predictions. In each row, we report the average with standard deviations from all baselines and PA-DGN. One step is 1-hour time interval. West Region Method 1-step 6-step 12-step VAR 0.1241 0.0234 0.4295 0.1004 0.4820 0.1298 MLP 0.1040 0.0003 0.3742 0.0238 0.4998 0.0637 GRU 0.0913 0.0047 0.1871 0.0102 0.2707 0.0006 RGN 0.0871 0.0033 0.1708 0.0024 0.2666 0.0252 RGN(StandardOP) 0.0860 0.0018 0.1674 0.0019 0.2504 0.0107 RGN(MeshOP) 0.0840 0.0015 0.2119 0.0018 0.4305 0.0177 PA-DGN 0.0840 0.0004 0.1614 0.0042 0.2439 0.0163 South East Region Method 1-step 6-step 12-step VAR 0.0889 0.0025 0.2250 0.0013 0.3062 0.0032 MLP 0.0722 0.0012 0.1797 0.0086 0.2514 0.0154 GRU 0.0751 0.0037 0.1724 0.0130 0.2446 0.0241 RGN 0.0790 0.0113 0.1815 0.0239 0.2548 0.0210 RGN(StandardOP) 0.0942 0.0121 0.2135 0.0187 0.2902 0.0348 RGN(MeshOP) 0.0905 0.0012 0.2052 0.0012 0.2602 0.0062 PA-DGN 0.0721 0.0002 0.1664 0.0011 0.2408 0.0056 6.3.3 Contribution of Spatial Derivatives We further investigate if the modulated spatial derivatives (Equation 6.1) are eectively advantageous compared to the spatial derivatives dened in Riemannian manifolds. First, RGN without any spatial derivatives is assessed for the prediction tasks on Western and Southeastern states graph signals. Note that this model does not use any extra features but the graph signal, f(t). Secondly, we add (1) StandardOP, the discrete spatial dierences (Gradient and Laplacian) in Section 6.1.1 and (2) MeshOP, the triangular mesh approxima- tion of dierential operators in Section 3.3 separately as additional signals to RGN. Finally, we incorporate with RGN our proposed Spatial Dierence Layer. Table 6.6 shows the contribution of each component. As expected, PA-DGN provides much higher drops in MAE (3.56%, 5.50%, 8.51% and 8.73%, 8.32%, 5.49% on two datasets, 75 respectively) compared to RGN without derivatives and the results demonstrate that the derivatives, namely, relative dierences from neighbor signals are eectively useful. However, neither RGN with StandardOP nor with MeshOP can consistently outperform RGN. We also found that PA-DGN consistently shows positive eects on the prediction error compared to the xed derivatives. This nding is a piece of evidence to support that the parameters modulating spatial derivatives in our proposed Spacial Dierence Layer are properly inferred to optimize the networks and to improve the prediction performance. 6.3.4 Eect of Dierent Graph Structures In this section, we evaluate the eect of 2 dierent graph structures on baselines and our models: (1) k-NN: a graph constructed with k-NN algorithm (k = 4); (2) TriMesh: a graph generated with Delaunay Triangulation. All graphs use the Euclidean distance. Table 6.7: Mean absolute error (10 2 ) for graph signal prediction on the synthetic dataset. VAR MLP StandardOP MeshOP SDL k-NN TriMesh k-NN TriMesh k-NN TriMesh 17.30 16.27 12.00 12.29 12.87 12.82 11.04 12.40 Table 6.7 and Table 6.8 show the eect of dierent graph structures on the synthetic dataset used in Section 6.2.2 and the real-world dataset in Section 6.3.2 separately. We nd that for dierent models the eect of graph structures is not homogeneous. For RGN and PA-DGN,k-NN graph is more benecial to the prediction performance than TriMesh graph, because these two models rely more on neighboring information and a k-NN graph incor- porates it better than a Delaunay Triangulation graph. However, switching from TriMesh graph to k-NN graph is harmful to the prediction accuracy of RGN(MeshOP) since Delau- nay Triangulation is a well-dened method for generating triangulated mesh in contrast to k-NN graphs. Given the various eect of graph structures on dierent models, our proposed PA-DGN underk-NN graphs always outperforms other baselines using any graph structure. 76 124 122 120 118 116 114 32.5 35.0 37.5 40.0 42.5 45.0 47.5 MAE distribution for 1-step prediction 124 122 120 118 116 114 32.5 35.0 37.5 40.0 42.5 45.0 47.5 MAE distribution for 6-step prediction 124 122 120 118 116 114 32.5 35.0 37.5 40.0 42.5 45.0 47.5 MAE distribution for 12-step prediction 0.050 0.075 0.100 0.125 0.150 0.175 0.200 0.225 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Figure 6.6: MAE across the nodes. 6.3.5 Distribution of Prediction Error Across Nodes Figure 6.6 provides the distribution of MAEs across the nodes of PA-DGN applied to the graph signal prediction task of the west coast region of the real-world dataset in Section 6.3.2. As shown in the gure, nodes with the highest prediction error for short-term prediction are gathered in the inner part where the observable nodes are sparse, while for long-term prediction nodes in the area with a limited number of observable points no longer have the largest MAE. This implies that PA-DGN can utilize neighboring information eciently even under the limitation of sparsely observable points. 6.4 Evaluation on NEMO Sea Surface Temperature (SST) Dataset We tested our proposed method and baselines on the NEMO sea surface temperature (SST) dataset 1 . We rst download the data in the area between 50N -65N and 75W -10W starting from 2016-01-01 to 2017-12-31, then we crop the [0; 550] [100; 650] square from the area and sample 250 points from the square as our chosen dataset. We divide the data into 24 sequences, each lasting 30 days, and truncate the tail. All models use the rst 5-day 1 Available at http://marine.copernicus.eu/services-portfolio/access-to-products/?option=com csw&view= details&product id=GLOBAL ANALYSIS FORECAST PHY 001 024. 77 SST as input and predict the SST in the following 15 and 25 days. We use the data in 2016 for training all models and the left for testing. For StandardOP, MeshOP and SDL, we test both options using linear regression and using RGN for the prediction part and report the best result. The results in Table 6.9 show that all methods incorporating spatial dierences gain improvement on prediction and that our proposed learnable SDL outperforms all other baselines. 78 Table 6.8: Graph signal prediction results (MAE) on multistep predictions. In each row, we report the average with standard deviations from all baselines and PA-DGN. One step is 1 hour time interval. West Region Method Graph 1-step 6-step 12-step VAR - 0.1241 0.0234 0.4295 0.1004 0.4820 0.1298 MLP - 0.1040 0.0003 0.3742 0.0238 0.4998 0.0637 GRU - 0.0913 0.0047 0.1871 0.0102 0.2707 0.0006 RGN k-NN 0.0871 0.0033 0.1708 0.0024 0.2666 0.0252 TriMesh 0.0897 0.0030 0.1723 0.0116 0.2800 0.0414 RGN (StandardOP) k-NN 0.0860 0.0018 0.1674 0.0019 0.2504 0.0107 TriMesh 0.0842 0.0011 0.1715 0.0027 0.2517 0.0369 RGN (MeshOP) k-NN 0.0840 0.0015 0.2119 0.0018 0.4305 0.0177 TriMesh 0.0846 0.0017 0.2090 0.0077 0.4051 0.0457 PA-DGN k-NN 0.0840 0.0004 0.1614 0.0042 0.2439 0.0163 TriMesh 0.0849 0.0012 0.1610 0.0029 0.2473 0.0162 South East Region Method Graph 1-step 6-step 12-step VAR - 0.0889 0.0025 0.2250 0.0013 0.3062 0.0032 MLP - 0.0722 0.0012 0.1797 0.0086 0.2514 0.0154 GRU - 0.0751 0.0037 0.1724 0.0130 0.2446 0.0241 RGN k-NN 0.0790 0.0113 0.1815 0.0239 0.2548 0.0210 TriMesh 0.0932 0.0105 0.2076 0.0200 0.2854 0.0211 RGN (StandardOP) k-NN 0.0942 0.0121 0.2135 0.0187 0.2902 0.0348 TriMesh 0.0868 0.0132 0.1885 0.0305 0.2568 0.0328 RGN (MeshOP) k-NN 0.0913 0.0016 0.2069 0.0031 0.2649 0.0092 TriMesh 0.0877 0.0020 0.2043 0.0026 0.2579 0.0057 PA-DGN k-NN 0.0721 0.0002 0.1664 0.0011 0.2408 0.0056 TriMesh 0.0876 0.0096 0.2002 0.0163 0.2623 0.0180 Table 6.9: Mean absolute error (10 2 ) for SST graph signal prediction. VAR MLP GRU RGU StandardOP MeshOP SDL 15-step 15.123 15.058 15.101 15.172 14.756 14.607 14.382 25-step 19.533 19.473 19.522 19.705 18.983 18.977 18.434 79 Chapter 7 Physics-aware Spatiotemporal Modules with Auxiliary Tasks for Meta-Learning In previous chapters, we have focused on how to utilize physics-associated knowledge to improve a single learnable model. This chapter provides how the knowledge is used to provide strong inductive bias to general models for physical observations. We propose a framework, physics-aware modular meta-learning with auxiliary tasks (PiMetaL) whose spatial modules incorporate PDE-independent knowledge and temporal modules are rapidly adaptable to the limited data, respectively. The framework does not require the exact form of governing equations to model the observed spatiotemporal data. Furthermore, it mitigates the need for a large number of real-world tasks for meta-learning by leveraging simulated data. We apply the proposed framework to both synthetic and real-world spatiotemporal prediction tasks and demonstrate its superior performance with limited observations. 7.1 Motivation In this section, we describe how the physics equations for conserved quantities are de- composable into two parts and how the meta-learning approach tackles the task by utilizing synthetic data when the data are limited. 80 7.1.1 Decomposability of Variants of a Continuity Equation In physics, a continuity equation (Equation 7.1) describes how a locally conserved quan- tity such as temperature, uid density, heat, and energy is transported across space and time. @ @t +rJ =; (7.1) This equation underlies many specic equations such as the convection-diusion equation and Navier-Stokes equations: _ u =r (Dru)r (vu) +R; _ u =(ur)u +r 2 ur! +g: where the scalar u and vector eld u are the variables of interest (e.g., temperature, ow velocity, etc.). A dot over a variable is time derivative. The common feature in these equations is that the forms of equations can be digested as Bar-Sinai et al. (2019): _ u =F (u x ;u y ;u xx ;u yy ;::: ); (7.2) where the right-hand side denotes a function of spatial derivatives. As the time derivative can be seen as a Euler discretization Chen et al. (2018), it is notable that the next state is a function of the current state and spatial derivatives. Thus, knowing spatial derivatives at time t is a key step for spatiotemporal prediction at time t + 1 for locally conserved quantities. According to Equation 7.2, the spatial derivatives are universally used in variants of Equation 7.1 and only the updating function F () is specically dened for a particular equation. This property implies that PDEs for physical quantities can be decomposable into two modules: spatial and temporal derivative modules. 81 7.1.2 Spatial Derivative Modules: PDE-independent Modules Finite dierence method (FDM) is widely used to discretize a d-order derivative as a linear combination of function values on a n-point stencil. @ d u @x d n X i=1 i u(x i ); (7.3) where n > d. According to FDM, it is independent for a form of PDE to compute spatial derivatives, which are input components of F () in Equation 7.2. Thus, we can modularize spatial derivatives as PDE-independent modules. The modules that can be learnable as a data-driven manner to infer the coecients ( i ) have been proposed recently Bar-Sinai et al. (2019); Seo et al. (2020a). The data-driven coecients are particularly useful when the discretization in then-point stencil is irregular and low-resolution where the xed coecients cause substantial numerical errors. 7.1.3 Time Derivative Module: PDE-specic Module Once uptod-order derivatives are modularized by learnable parameters, the modules are assembled by an additional module to learn the function F () in Equation 7.2. This module is PDE-specic as the function F describes how the spatiotemporal observations change. The exact form of a ground truth PDE is not given. Instead, the time derivative module is data-driven and will be adapted to observations. 7.1.4 Meta-Learning with PDE-independent/-specic Modules Recently, Raghu et al. (2020) investigate the eectiveness of model agnostic meta-learning (MAML, Finn et al. (2017)) and it is found that the outer loop of MAML is more likely to learn parameters for reusable features rather than rapid adaptation. The nding that feature reuse is the predominant reason for ecient learning of MAML allows us to use additional 82 Auxiliary tasks Target tasks Label of objective Task-independent modules Task-specific module Forward pass Backward pass Meta-training time Meta-test ... Figure 7.1: Schematic overview of the physics-aware modular meta-learning (PiMetaL). information which is benecial for learning better representations. Previously, the objective in meta-training has been considered to be matched with one in meta-test as the purpose of meta-learning is to learn good initial parameters applicable across similar tasks (e.g., image classication to image classication). We are now able to incorporate auxiliary tasks under a meta-learning setting to reinforce reusable features for a main task. As described in Section 7.1.1, the spatial modules are reusable across dierent observations, and thus, we can meta-initialize the spatial modules rst with spatial derivatives provided by synthetic datasets. Then, we can integrate the spatial modules with the task-specic temporal module during meta-test to help adaptation of TDM on few observations. Since the spatial modules are trained by readily available synthetic datasets, a large number of real-world datasets for meta-training is not required. 7.2 Physics-aware Meta-Learning with Auxiliary Tasks In this section, we develop a physics-aware modular meta-learning method for the modu- larized PDEs and design neural network architectures for the modules. Figure 7.1 describes the proposed framework and its computational process. 83 Algorithm 1 Spatial derivative module (SDM) Input: Graph signals u i and edge featurese ij =x j x i onG wherex i is a coordinate of node i. Output: Spatial derivativesf^ u k;i j i2V and k2Kg whereK =fr x ;r y ;r 2 x ;r 2 y g. Require: Spatial derivative modulef k jk2Kg. 1: for k2K do 2: a k;i ;b k;(i;j) j (i;j)2E;k2K = k (fug;feg;G) 3: for i2V do 4: ^ u k;i =a k;i u i + P (j;i)2E b k;(j;i) u j 5: end for 6: end for 7.2.1 Spatial Derivative Module As we focus on the modeling and prediction of sensor-based observations, where the available data points are inherently on a spatially sparse irregular grid, we use graph networks for each module k to learn the nite dierence coecients. Given a graphG = (V;E) where V =f1;:::;Ng and E =f(i;j) : i;j 2 Vg, a node i denotes a physical location x i = (x i ;y i ) where a function value u i = u(x i ;y i ) is observed. Then, the graph signals with positional relative displacement as edge features are fed into the spatial modules to approximate spatial derivatives by Algorithm 1. The coecients (a i ;b (i;j) ) on each nodei and edge (i;j) are output of and they are linearly combined with the function valuesu i andu j . K denotes a set of nite dierence operators. For example, if we setK =fr x ;r y ;r 2 x ;r 2 y g, we have 4 modules which approximate rst/second order of spatial derivatives in 2-dimension, respectively. 7.2.2 Time Derivative Module Once spatial derivatives are approximated, another learnable module is required to com- bine them for a target task. The form of line 2 in Algorithm 2 comes from Equation 7.2 and TDM is adapted to learn the unknown function F () in the equation. As our target task is 84 Algorithm 2 Time derivative module (TDM) Input: Graph signals u and approximated spatial derivatives ^ u k where k2K onG. Time interval t Output: Prediction of signals at next time step ^ u(t) Require: Time derivative module. 1: ^ u t = TDM u i ; ^ u k;i j i2V and k2K 2: ^ u(t) = u(t 1) + ^ u t1 t the regression of graph signals, we use a recurrent graph network for TDM. 7.2.3 Meta-Learning with Auxiliary Objective As discussed in Sec. 7.1.1, it is important to know spatial derivatives at timet to predict next signals at t + 1 for locally conserved physical quantities. However, it is impractical to access the spatial derivatives in the sensor-based observations as they are highly discretized over space. The meta-initialization with the auxiliary tasks from synthetic datasets is particularly important to address the challenge. First, the spatial modules can be universal feature extractors for modeling observations following unknown physics-based PDEs. Unlike other domains such as computer vision, it has been considered that there is no particular shareable architecture for learning spatiotemporal dynamics from physical systems. We propose that the PDE-independent spatial modules can be applicable as feature extractors across dierent dynamics as long as the dynamics follow a local form of conservation laws. Second, we can utilize synthetic data to meta-train the spatial modules as they are PDE-agnostic. This property allows us to utilize a large amount of synthetic datasets which are readily generated by numerical methods regardless of the exact form of PDE for target observations. Finally, we can provide a stronger inductive bias which is benecial for modeling real-world observations but not available in the observations explicitly. Alg. 3 describes how the spatial modules are meta-initialized by MAML under the su- 85 Algorithm 3 Meta-initialization with auxiliary tasks: Supervision of spatial derivatives Input: A set of meta-train task datasetsD =fD 1 ;:::;D B g whereD b = (D tr b ;D te b ). D b =f(u b i ;e b ij ;y (a 1 ;b) i ;:::;y (a K ;b) i ) :i2V b ; (i;j)2E b g where y (a k ;) i is an k-th auxiliary task label for the i-th node. Learning rate and . Output: Meta-initialized spatial modules . 1: Initialize auxiliary modules = ( 1 ;:::; K ) 2: while not converged do 3: forD b inD do 4: 0 b = r P K k=1 L aux k (D tr b ; k ) 5: end for 6: r P B b=1 P K k=1 L aux k (D te b ; 0 b;k ) 7: end while pervision ofK dierent spatial derivatives. First, we generate values and spatial derivatives on a 2D regular grid from an analytical function. Then, we sample a nite number of points from the regular grid to represent discretized nodes and build a graph from the sampled nodes. Each graph signal and its discretization becomes input feature of a meta-train task and corresponding spatial derivatives are the auxiliary task labels. Figure 7.2 visualizes graph signals and spatial derivatives for meta-initialization. Once the spatial modules are initialized throughout meta-training, we reuse the modules for meta-test where the temporal module (the head of the network) are adapted on few observations from real-world sensors (Alg. 4). Although the standard MAML updates the body of the network (the spatial modules) as well, we only adapt the head layer () as like almost-no-inner-loop method in Raghu et al. (2020). The task at test time is graph signal prediction and the temporal modules () are adapted by a regression loss function L = P T t=1 jju(t) ^ u(t)jj 2 on length T sequence (D tr m ) and evaluated on held-out (t > T ) sequence (D te m ) with the adapted parameters. 86 Algorithm 4 Adaptation on meta-test tasks Input: A set of meta-test task datasetsD =fD 1 ;:::;D M g whereD m = (D tr m ;D te m ). Meta-initialized SDM (). Learning rate . Output: Adapted TDM 0 m on the m-th task. 1: Initialize temporal modules ( 1 ;:::; M ) 2: forD m inD do 3: 0 m = m r m L(D tr m ; ; m ) 4: end for Figure 7.2: Examples of generated spatial function values and graph signals. Node and edge features (function value and relative displacement, respectively) are used to approxi- mate spatial derivatives (arrows). We can adjust the number of nodes (spatial resolution), the number of edges (discretization), and the degree of uctuation (scale of derivatives) to dierentiate meta-train tasks. 7.3 Spatial Derivative Modules: Reusable Modules We have claimed that the spatial modules provide reusable features associated with spatial derivatives such asr x u;r y u, andr 2 x u across dierent dynamics or PDEs. While it has been shown that the data-driven approximation of spatial derivatives is more precise than that of nite dierence method Seo et al. (2020a); Bar-Sinai et al. (2019), it is not guaranteed that the modules eectively provide transferrable parameters for dierent spatial resolution, discretization, and uctuation of function values. We explore whether the proposed spatial derivative modules based on graph networks can be used as a feature provider for dierent spatial functions and discretization. Meta-train Meta-test # nodes (N) f256, 625g f450, 800g # edges per a node (E) f4, 8g f3, 6, 10g Initial frequency (F ) f2, 5g f3, 7g Table 7.1: Parameters for synthetic dataset. 87 (N;E;F ) (450,3,3) (450,3,7) (450,6,3) (450,6,7) SDM (from scratch) 1.3370.044 7.2780.225 7.1110.148 51.5440.148 1.1520.043 5.9970.083 7.2060.180 47.5270.768 SDM (pretrained) 1.0750.005 6.4820.207 5.5280.010 46.2540.262 0.8360.002 5.2510.245 5.3540.001 42.2430.420 (N;E;F ) (450,10,3) (450,10,7) (800,3,3) (800,3,7) SDM (from scratch) 1.1120.036 5.3530.193 7.5290.241 47.3560.560 1.0220.030 7.1960.159 5.6990.242 49.6020.715 SDM (pretrained) 0.7820.006 4.7280.244 5.5500.012 42.7540.442 0.9270.006 6.5530.193 4.4150.011 44.5910.002 (N;E;F ) (800,6,3) (800,6,7) (800,10,3) (800,10,7) SDM (from scratch) 0.7890.021 5.3860.136 5.1790.069 42.5091.080 0.7180.010 4.5360.204 5.5170.110 39.6421.173 SDM (pretrained) 0.6560.008 4.9600.266 3.9770.025 37.6290.760 0.5700.006 4.2130.275 4.1070.019 35.8490.947 Table 7.2: Prediction error (MAE) of the rst (top) and second (bottom) order spatial derivatives. We perform two sets of experiments: evaluate few-shot learning performance (1) when SDM is trained from scratch; (2) when SDM is meta-initialized. Figure 7.2 shows how the graph signal and its discretization is changed over the dierent settings. If the number of nodes is large, it can provide spatially high-resolution and thus, the spatial derivatives can be more precisely approximated. Table 7.1 shows the parameters we used to generate syn- thetic datasets. Note that meta-test data is designed to evaluate interpolation/extrapolation properties. Initial frequency decides the degree of uctuation (In Figure 7.2, the right one has higher F than that of the left one.). For each parameter combination, we generate 100 dierent snapshots based on a form in Long et al. (2018b): u i = X jkj;jljF k;l cos(kx i +ly i ) + k;l sin(kx i +ly i ); (7.4) where k;l ; k;l N (0; 0:02). The index i denotes the i-th node whose coordinate is (x i ;y i ) in the 2D space ([0; 2][0; 2]) andk;l are randomly sampled integers. From the synthetic data, the rst- and second-order derivatives are analytically given and SDM is trained to 88 approximate them. The prediction results for spatial derivatives are shown in Table 7.2. The results show that the proposed module (SDM) is eciently adaptable to dierent conguration on few samples from meta-initialized parameters compared to learning from scratch. The nding implies that the parameters for spatial derivatives can be generally applicable across dierent spatial resolution, discretization, and function uctuation. 7.4 Experimental Evaluation 7.4.1 Preliminary: Which Synthetic Dynamics Need to be Gen- erated? While Table 7.2 demonstrates that the PDE-independent representations are reusable across dierent congurations, it is still an open question: which topological conguration needs to be used to construct the synthetic dynamics? According to Table 7.2, the most important factor aecting error is an initial frequency (F ), which determines min/max scales and uctuation of function values, and it implies that the synthetic dynamics should be similarly scaled to a target dynamics. We use the same topological conguration in Table 7.1 to generate synthetic datasets for a task in Sec. 7.4.4 and adapted conguration for a task in Sec. 7.4.5. We describe more details in Appendix. 7.4.2 Meta-train Data For all experiments, we generate the data of meta-train tasks from the 2D convection- diusion equation (Equation 7.4). We set = 1;D = 0:2 for meta-train data. We rst solve the equation on a 100 100 grid with the spectral method under the time resolution 5e-3, then uniformly sample 250 locations from all grid points as observed nodes to simulate the case where the observations are irregularly distributed in space. We then 89 construct a 4-Nearest Neighbor (NN) graph based on the Euclidean distance as the input of Graph Neural Networks. Tasks We construct 100 sequences, each with the initial condition generated from Equa- tion 7.4 using a unique random seed. Each sequence lasts 20 frames with the timestep size 0.01. We set up 1 k-shot meta-train task on each sequence: predicting the values and 1st/2nd-order spatial derivatives on all nodes for all frames with an auto-regressive model given the rst frame as the input. The rstk frames are used for training and the rest 20k frames for test. We select k = 5; 10 as two experiment settings. 7.4.3 Meta-test Synthetic Data and Task We generate the synthetic meta-test data from Equation 7.4 but set = 0:8;D = 0:1 to simulate the realistic scenario where meta-train tasks and meta-test tasks do not share the same distribution. We reuse the method in 7.4.2 to generate data and construct 10 meta-test tasks for the synthetic meta-test experiment. Real-world dataset Air Quality and Extreme Weather Data We use air quality index and extreme weather observations as meta-test (target) datasets. See Section 3.5.4 for details. Tasks : • AQI-CO: We select the rst sequence of carbon monoxide (CO) ppm records from each month in the year 2015 at land-based stations, and set up the meta-test task on each sequence as the prediction of CO ppm. We construct a 6-NN graph based on the geodesic distances among stations. 90 • ExtremeWeather: First, we aggregate all bounding boxes into multiple sequences. In each sequence, all bounding boxes (1) are in consecutive time steps, (2) are aected by the same type of extreme weather, and (3) have an intersection over union (IoU) ratio above 0.25 with the rst bounding box in the sequence. Then we select the top-10 longest sequences. For each sequence, we consider its rst bounding box A as the region aected by an extreme weather event, and extend it to a new sequence of 20 frames by cropping and appending the same region A from successive frames. For each region we uniformly sample 10% of available pixels as observed nodes to simulate irregularly spaced weather stations and build a 4-NN graph based on the Euclidean distance. Figure 3.7 visualizes the rst 5 frames of one extended sequence. In the single feature experiment, we set up a meta-test task on each extended sequence as the prediction of the surface temperature (TS) on all observed nodes with the initial TS given only. Training Settings Training Hyperparameters : For all meta-train and meta-test tasks, we use the Adam optimizer with the learning rate 1e-3. In each training epoch, we sample 1 task from all available tasks. Environments : All experiments are implemented with Python3.6 and PyTorch 1.3.0, and are conducted with NVIDIA GTX 1080 Ti GPUs. 7.4.4 Multi-step Graph Signal Generation Task: We adopt a set of multi-step spatiotemporal sequence generation tasks to evaluate our proposed framework. In each task, the data is a sequence ofL frames, where each frame is a set of observations on N nodes in space. Then, we train an auto-regressive model with the rstT frames (T -shot) and generate the followingLT frames repeatedly from a given 91 T -shot Method AQI-CO ExtremeWeather 5-shot FDM+RGN (scratch) 0.0290.004 0.9880.557 PA-DGN (scratch) 0.0360.009 0.9650.138 PiMetaL (meta-init) 0.0250.006 0.9170.075 7-shot FDM+RGN (scratch) 0.0260.002 0.7630.060 PA-DGN (scratch) 0.0230.002 0.7480.020 PiMetaL (meta-init) 0.0180.002 0.7270.009 10-shot FDM+RGN (scratch) 0.0210.001 0.7090.003 PA-DGN (scratch) 0.0150.001 0.4160.015 PiMetaL (meta-init) 0.0120.001 0.4070.025 Table 7.3: Multi-step prediction results (MSE) and standard deviations on the two real-world datasets. initial input (T -th frame) to evaluate its performance. Datasets: For all experiments, we generate meta-train tasks with the parameters de- scribed in Table 7.1 and the target observations are 2 real-world datasets: (1) AQI-CO: national air quality index (AQI) observations Berman (2017); (2) ExtremeWeather: the extreme weather dataset Racah et al. (2017b). For the AQI-CO dataset, we construct 12 meta-test tasks with the carbon monoxide (CO) ppm records from the rst week of each month in 2015 at land-based stations. For the extreme weather dataset, we select the top-10 extreme weather events with the longest lasting time from the year 1984 and construct a meta-test task from each event with the observed surface temperatures at randomly sampled locations. Since each event lasts fewer than 20 frames, each task has a very limited amount of available data. In both datasets, graph signals are univariate. Note that all quantities have uidic properties such as diusive and convection. More details are in the supplementary material. Baselines: We evaluate the performance of a physics-aware architecture (PA-DGN) Seo et al. (2020a), which also consists of spatial derivative modules and recurrent graph networks (RGN), to see how the additional spatial information aects prediction performance for same architecture. Note that PA-DGN has same modules in PiMetaL and the dierence is that PiMetaL utilizes meta-initialized spatial modules and PA-DGN is randomly initialized for 92 T -shot (Region) GCN GAT GraphSAGE 5-shot (USA) 2.7420.120 2.5490.115 2.1280.146 10-shot (USA) 2.3710.095 2.1780.066 1.8480.206 GN PA-DGN PiMetaL 5-shot (USA) 2.2520.131 1.9500.152 1.7940.130 10-shot (USA) 1.9490.115 1.6870.104 1.5670.103 T -shot (Region) GCN GAT GraphSAGE 5-shot (EU) 1.2180.218 1.1610.234 1.1650.248 10-shot (EU) 1.1860.076 1.1420.070 1.0440.210 GN PA-DGN PiMetaL 5-shot (EU) 1.1810.210 0.9140.167 0.7810.019 10-shot (EU) 1.1160.147 0.8310.058 0.7730.014 Table 7.4: Graph signal regression results (MSE, 10 3 ) and standard deviations on the two regions of weather stations. learning from scratch on meta-test tasks. Additionally, the spatial modules in PA-DGN is replaced by nite dierence method (FDM+RGN) to see if the numerical method provides better PDE-agnostic representations. The baselines and PiMetaL are trained on the meta- test support set only to demonstrate how the additional spatial information is benecial for few-shot learning tasks. Discussion: Table 7.3 shows the multi-step prediction performance of our proposed framework against the baselines on real-world datasets. Overall, PA-DGN and PiMetaL show similar trend such that the prediction error is decreased as longer series are available for few-shot adaptation. There are two important ndings: rst, with the similar expressive power in terms of the number of learnable parameters, the meta-initialized spatial modules provide high quality representations which are easily adaptable across dierent spatiotem- poral dynamics in the real-world. This performance gap demonstrates that we can get a stronger inductive bias from synthetic datasets without knowing PDE-specic information. Second, the contribution of the meta-initialization is more signicant when the length of available sequence is shorter (T = 5) and this demonstrates when the meta-initialization is particularly eective. Finally, the nite dierence method provides proxies of exact spatial 93 derivatives and the representations are useful particularly whenT = 5 but its performance is rapidly saturated and it comes from the gap between the learnable spatial modules and xed numerical coecients. The results provide a new point of view on how to utilize synthetic or simulated datasets to handle challenges caused by limited datasets. 7.4.5 Graph Signal Regression Task, datasets, and baselines: Deerrard et al. (2020) conducted a graph signal regression task: predict the temperature x t from the temperature on the previous 5 days (x t5 : x t1 ). We split the GHCN dataset spatially into two regions: (1) the USA (1,705 stations) and (2) Europe (EU) (703 stations) where there are many weather stations full functioning. In this task, the number of shots is dened as the number of input and output pairs to train a model. As the input length is xed, more variants of graph neural networks are considered as baselines. We concatenate the 5-step signals and feed it into Graph convolutional networks (GCN) (Kipf & Welling, 2017a), Graph attention networks (GAT) (Veli ckovi c et al., 2018), GraphSAGE (Hamilton et al., 2017), and Graph networks (GN) (Sanchez-Gonzalez et al., 2018) to predict next signals across all nodes. Discussion: Table 7.4 shows the results of the graph signal regression task across dier- ent baselines and the proposed method. There are two patterns in the results. First, although in general we observe an improvement in performance for all methods when we move from the 5-shot setting to the 10-shot setting, PiMetaL's performance yields the smallest error. Second, for the EU dataset, while 5-shot seems enough to achieve stable performance, it demonstrates that the PDE-independent representations make the regression error converge to a lower level. Overall, the experimental results prove that the learned spatial representa- tions from simulated dynamics are benecial for learning on limited data. 94 Chapter 8 Spatiotemporal Modeling via Physics- aware Causality Deep neural networks are highly expressive and able to learn unspecied representations by minimizing a specied objective from a given task. Despite its ecient data-driven learn- ing, interpretable inductive biases can be benecial for constructing robust models as well as learning process. In this work, we propose physics-aware graph-based spatiotemporal net- works with a causal module, which leverages additional causal information described in par- tial dierential equations (PDEs) in physical systems. With the partially provided causality labels, we enable to specify causal weights from spatially close and temporally past observa- tions to current observations via semi-supervised learning, and dierentiate the importance of each relation without requiring costly computation. Extensive experiments on simulated time series based on causal relations and real-world graph signals show that the proposed model improves prediction performance by utilizing physics-based domain knowledge. 8.1 Motivation Recently, many works have been done under the topic of physics-informed machine learn- ing. Wang et al. (2020b) introduce physics-informed models inspired by a particular form of partial dierential equations (PDEs), which are used to numerically solve the turbulent dynamics. Greydanus et al. (2019) explicitly impose the Hamiltonian mechanics as a con- straint to modeling classical dynamics. Raissi et al. (2019) propose physics-informed neural 95 networks which directly solve a particular PDEs. While various physics-based knowledge such as equations, laws, and physical principles have been incorporated with deep neural networks, the concept of causality is barely used in this context. Among the prior knowledge in physics, causality is a fundamental concept as physically meaningful solutions should follow the concept: an eect can not occur before its cause. To this end, we propose a novel framework, which consists of a causal module and spatiotemporal graph networks. The causal module learns causal relations directly from explicitly given labels and the pair- wise representations are combined with corresponding causes to predict eects through the spatiotemporal graph networks. One fundamental question naturally arises: How are the causal relations available from spatiotemporal datasets? Theoretical knowledge on a certain type of spatiotemporal datasets can provide guiding (not governing) equations. It is common that the partially guiding equation is known for a particular dataset, although the exact governing equation is unspec- ied. For example, de Bezenac et al. (2018) leveraged a transport equation called advection- diusion equation to predict sea surface temperature and Seo et al. (2020b) utilized the form of the continuity equation for a temperature prediction task. Once we have a general equation, which is partially benecial for understanding the targeted dynamics, the equation can be decomposed into causes and eects analytically. For example, if the heat equation ( @u @t = Du where D is a diusivity constant) is considered as a guideline of observations, we know that temporally rst order and spatially second order derivatives are involved. Therefore, we can specify causes and eects under the causal perspective on a discrete domain: u i (t + 1) =u i (t) + tDu; (8.1) =u i (t) + tD X j2N i (u i (t)u j (t)); (8.2) where is the Laplace operator andN i is a set of adjacent nodes ofi-th node. Equation 8.2 96 Time t Time t+Δt Figure 8.1: Heat dissipation over 2D space and time. The nodes in a graph structure correspond sensors and the observations at each sensor is time varying. Given the heat equation ( _ u = Du), we can provide spatial (blue) and temporal (green) causal relations from previous nodes to a current target node (white). shows the discrete Laplace operator. Since our target value is u i (t + 1) in Equation 8.2, the variables in the right-hand side are regarded as known causes from the heat equation. As a result, we can assign explicit causal labels between the subset of nodes associated with the equation (Figure 8.1) and the physics-aware causality is incorporated with spatiotemporal graph networks. It is worth noting that only a subset of possible relations is labeled by the equation-based causality and other unlabeled relations are still used to extract node-wise embeddings (semi-supervised learning). 8.2 Problem Formulation We rst formalize the learning problem associated with spatiotemporal observations from physical systems with a graph structure such as climate observations from a weather sensor network. We assume that a (static) graph structureG s = (V s ;E s ) shared across dierent times- tamps is given (or can be constructed by features of each node) and observational data X 1 ; ;X T whereX t 2R N are dened on the nodes in the graph. As there areN dierent nodes, the observations (X2R TN ) can be regarded as multivariate time series. Besides the multivariate time series, we have prior knowledge from physical principles 97 for the targeted observations. If the observations come from weather sensors such as AWS 1 , domain-specic knowledge or equations related to weather phenomena should be benecial for understanding the dynamics. For example, one guiding equation for turbulent dynam- ics is the Navier-Stokes equation Wang et al. (2020b). These equations can be commonly represented as a function of spatial and time derivatives: F ( _ u; u; ;ru;r 2 u; ) = 0; (8.3) where _ u and u are a rst and second-order time derivative, respectively, andr is an op- erator for spatial derivative. As the continuous operators can be numerically decomposed in a discrete domain such as graph structure, we can explicitly dene causes to a targeted observation at time t and the causal relations are discovered accordingly. Note that the causal relations based on a certain equation are not fully but partially complete because the true governing equation is usually unknown. Given the physics-aware causal relations, we can assign explicit labels between NK variables partially where K is a maximum time lag for causality. In the length K observations X tK+1 ; ;X t , there are NK observations, which are mutually correlated, and we deneN c causal relations out of theNKNK possible relations. In Figure 8.1, we have N = 5 nodes inG s and the number of variables in the length K = 2 sequence is 10. Thus, there are 100 possible relations between the 10 variables and the Heat equation (Equation 8.2) elucidates N c = 13 (5 is temporal and 8 is spatial) causal relations. We denote the causal graphG c = (V c ;E c ) wherejV c j =NK andjE c j =N c . Once the spatiotemporal data (X) and the (partially available) causal relations (G c ) are given, our task is to nd a model: ^ X t+1 =F (X tK+1 ; ;X t ;G s ;G c ; ); (8.4) 1 Automatic weather station 98 Inferred causal structure (DAGs) Partially defined causal relations t t-K+1 t-K+2 ... Guiding PDEs (e.g., Diffusion equation or Continuity equation, etc.) Temporal encoder (FC) Spatial encoder (GConv) Spatiotemporal graph networks (STG) STG Causal Module STGC STG STGV Prediction block Value Module Figure 8.2: Proposed model: Causality-aware spatiotemporal graph networks (PA-DGN). A sequence of graph signals is fed into two modules: (1) Spatiotemporal graph networks for causality (STGC) and (2) Spatiotemporal graph networks for values (STGV), respectively. Guiding PDEs from physical principles provide explicit (partially available) causal labels. The red arrows denote how the supervised objectives are dened. where is a set of learnable parameters in a model F (). 8.3 Proposed Model In this section, we describe how the proposed model, called Causality-aware Spatiotem- poral Graph Networks (C-STGN), is designed. Figure 8.2 shows the high-level view of C-STGN. As illustrated in Figure 8.2,C-STGN consists of two main parts: (1) spatiotem- poral graph networks for causality (STGC) and (2) spatiotemporal graph networks for values (STGV). Note that both networks are designed to learn node representations from spatially and temporally correlated observations. Spatiotemporal graph networks for causality (STGC): We rst learn node-wise latent representations by two modules: spatial (SE) and temporal encoders (TE). Spatial encoders are designed to learn spatial dependencies at each timestamp via the static graph structureG s . Graph convolutional networks such as GCN Kipf & Welling (2017b) or Graph- SAGE Hamilton et al. (2017) are used to aggregate spatially neighboring information in a permutation invariant manner. Then, we have K dierent snapshots from the spatial 99 encoder and the snapshots are fed into the temporal encoder: fS t 0 =SE(X t 0;G s )j t 0 =tK + 1; ;tg; (8.5) fZ t 0 =TE(fS t 0 P ;S t 0g)j t 0 =tK + 1; ;tg; (8.6) whereZ t 02R NDc is a set of node representations (dimensionD c ) at timet 0 . P is an aggrega- tion order andTE merges a current embeddingS t 0 and pastP embeddingsS t 0 1 ; ;S t 0 P for spatiotemporal node embeddings att 0 . Note that the temporal encoder does not consider the graph structure. Once node embeddings are obtained, two D c dimensional vectors are fed into a causal module (CM), which computes the probability of causality between the two corresponding nodes. p t j t i ji =CM(Z t j ;j ;Z t i ;i ); (8.7) whereCM is a fully connected network to compute causality andZ t j ;j is aj-th node repre- sentations at timet j . There areN dierent nodes at eacht andK dierent timestamps, thus, there are N 2 K 2 dierent p values. Equation 8.7 is similar to the Key and Query matching mechanism in the transformer Vaswani et al. (2017). If the observations are stationary and the causal relations are independent on the absolute timestamps (t j ;t i ) but dependent on the relative time interval =t i t j , Equation 8.7 can be reduced to: p ji =CM(Z t j ;j ;Z t i ;i ); (8.8) Spatiotemporal graph networks for values (STGV): Similarly, the spatiotemporal graph networks for values is also used to learn node representations from spatiotemporal ob- servations. However, its usage is to predict next signals instead of inferring causality. Thus, we introduce another module called a value module (VM) to transform the spatiotemporal 100 representations (Z) to task-specic representations. AsCM learns causality-associated rep- resentations, VM is adapted to learn prediction-associated representations. The separation is inspired by the architecture of the transformer. fH t 0 =VM(Z t 0)j t 0 =tK + 1; ;tg; (8.9) whereH t 02R NDv . Since the causal relations from the NK past variables to N variables are inferred from STGC, the causal probabilities p t j t i ji (Equation 8.7) are combined withH to predict next variables. Specically, the output (H) from VM in STGV andp from CM in STGC are used to predict the next variable at a node i and time t + 1: ^ X t+1;i = t1 X t 0 =tK+1 X j2N i p t 0 t j;i H t 0 ;j : (8.10) It is worth noting that we use causal probabilities betweent 0 2 [tK +1;t1] andt instead oft 0 2 [tK + 1;t] andt + 1. There are two reasons. First, sinceX t+1 is not available, it is impossible to compute p t 0 ;t+1 , which is a function of X t+1 , in advance. Second, we assume that the causality is stationary thus, the causality from t 0 and t is invariant if = tt 0 is unchanged. The second assumption is particularly valid for spatiotemporal observations in physical systems as most of physics-based phenomena are not dependent on the absolute time but relative time intervals. Additional causality labels from physics principles: In Section 8.2, we assume that the causal relations are given as explicit labels based on the guiding equation (Equation 8.3). However, there is a challenge to directly use the labels to update the causal module and the backbone STG. The PDE exclusively provides information about which past and neighboring variables can be considered as possible causes to a current variable. Thus, there are only positive labels, and opposite labels such that which variables should not be causes are not provided. Since the partially available labels are highly imbalanced, CM is possibly over-tted to the 101 positive-only labels. We address the challenge by introducing not-causal labels based on the physical principle: an eect can not occur before its cause. The not-causal labels are described as: fn t j t i ji = 0j t i t j < 0g; (8.11) Equation 8.11 is a set of relations where a timestamp (t j ) of a candidate cause (X t j ;j ) is later than that of a candidate eect (X t i ;i ). Despite the availability of the not-causal labels, the imbalanced issue still exists as the cardinality offn t j t i ji g is much larger than that offp t j t i ji g. We mitigate it by subsampling the not-causal labels as many as the available causal labels. 8.4 Experimental Results We evaluate the proposed method on two datasets: (1) synthetic time series from known causal relations and (2) daily observations of meteorological elements. We rst examine the causal module (STGC) on the synthetic time series and PA-DGN on the graph signal prediction task with the real-world observational data. 8.5 Experimental Settings 8.5.1 Synthetic Time Series Generation We rst generate multivariate time seriesX2R TN from known causal relations. Con- siderN dierent stationary time series and each series in uences each other in a time-lagged manner. At timet, a variable ini-th time seriesX t;i 2R is dened as a function of variables 102 at t 0 <t such that described in 2 Runge et al. (2019b): X t;i = t1 X t 0 =tP N X j=1 f t 0 ;t j;i (X t 0 ;j ) +; (8.12) We generate two series based on linear and nonlinear causality. Linear causality: X t;0 = 0:7X t1;0 + X t;1 = 0:8X t1;1 + 0:8X t1;3 + X t;2 = 0:5X t1;2 + 0:5X t2;1 + 0:6X t3;3 + X t;3 = 0:4X t1;3 + X t;4 = 0:9X t2;2 + 0:1X t3;6 + X t;5 = 0:2X t1;0 + 0:2X t2;0 + 0:2X t3;0 + X t;6 = where N (0; 1). In the linear causal series, there are 12 causal relations between N = 7 series and the maximum time lag in the causal relations is 3. 2 https://github.com/jakobrunge/tigramite 103 Nonlinear causality: X t;0 = X t;1 = 0:2(X t1;1 ) 2 + 0:7X t2;2 + X t;2 = 0:3(X t2;0 ) 3 + 0:05X t1;3 + X t;3 =0:09(X t3;2 ) 2 + 0:4X t1;5 + X t;4 = 0:2(X t1;0 ) 2 + 0:01X t3;1 0:2(X t1;5 ) 2 + X t;5 = X t;6 = 0:3X t1;5 + 0:3X t2;4 0:3X t3;3 + X t;7 =0:2(X t1;0 ) 2 + 0:7X t2;8 + X t;8 =0:3(X t1;0 ) 3 + 0:05X t2;0 + X t;9 = 0:9X t3;1 + X t;10 =0:02X t1;0 + 0:1X t3;6 0:2(X t1;4 ) 2 + X t;11 =0:3X t4;0 + X t;12 =0:3X t1;11 + where N (0; 1). In the nonlinear causal series, there are 22 causal relations between N = 13 series and the maximum time lag in the causal relations is 4 (SeeX t;11 ). When we conduct the intra-causality retrieval experiment, we feed length 4 series fromX t3 toX t to the classier. Thus, the causality from time lag 4 inX t;11 is not labelled. 8.5.2 Causality Classication Synthetic time series generation We rst generate multivariate time seriesX2R TN from known causal relations. Consider N dierent stationary time series and each series in uences each other in a time-lagged manner. At time t, a variable in i-th time series 104 2.5 0.0 2.5 0th series 5 0 5 1st series 10 0 10 2nd series 2.5 0.0 2.5 3rd series 0 10 4th series 2.5 0.0 2.5 5th series 0 200 400 600 800 1000 2.5 0.0 2.5 6th series (a) Time series from linear causality. 2.5 0.0 2.5 0th series 10 0 10 1st series 0 20 2nd series 25 0 3rd series 2.5 0.0 2.5 4th series 2.5 0.0 2.5 5th series 0 10 6th series 10 0 10 7th series 20 0 8th series 0 10 9th series 2.5 0.0 2.5 10th series 2.5 0.0 2.5 11th series 0 200 400 600 800 1000 2.5 0.0 2.5 12th series (b) Time series from nonlinear causality. Figure 8.3: Generated multivariate time series from given causal relations. X t;i 2R is dened as a function of variables at t 0 <t such that described in 3 Runge et al. (2019b): x i;t = t1 X t 0 =tP N X j=1 f t 0 ;t j;i (X t 0 ;j ) +; (8.13) where P is the auto-regressive order across time and is a noise term that is independent to any other variables. Note that f t 0 ;t j;i () is regarded as a causal function from a previous variable at (j;t 0 ) to a current variable (i;t). Since the time series are stationary, the function 3 https://github.com/jakobrunge/tigramite 105 Table 8.1: Inter-causality classication Linear causality Model Recall AUC CE MLP 0.5790.124 0.6700.012 0.6110.015 GCN+MLP 0.1930.126 0.5080.008 0.6690.004 CHEB+MLP 0.5770.055 0.6770.010 0.5850.017 SAGE+MLP 0.5540.161 0.6680.035 0.5830.014 TE+MLP 0.7560.038 0.8580.020 0.4350.026 STGC 0.7670.023 0.8850.011 0.3400.035 Non-Linear causality Model Recall AUC CE MLP 0.3650.211 0.5330.023 0.6580.013 GCN+MLP 0.2410.194 0.5110.002 0.6770.013 CHEB+MLP 0.4160.124 0.5510.013 0.6500.011 SAGE+MLP 0.3670.101 0.5540.006 0.6370.015 TE+MLP 0.4380.107 0.6110.051 0.6250.017 STGC 0.5030.041 0.6890.013 0.5220.015 f t 0 ;t j;i () in Equation 8.13 can be relaxed as f tt 0 j;i (). We dened the causal function in two dierent ways: (1) linear and (2) non-linear conditional independence. For both settings, we generate length T = 1000 time series across N = 7 (linear) and N = 13 (non-linear) nodes. More detailed information can be found in Appendix. Task formulation Given N dierent stationary series (or nodes), we train a model to predict if there is signicant causality between two variables: X t 0 ;j andX t;i . Since the auto- regressive order isP , there are potentialNPN causal relations fromN variablesX t 0 where t 0 2 [tP;t 1] to N variables at time t. The true causal relations are explicitly given as labels during training and a model is evaluated in two dierent aspects: (1) inter-causality classication and (2) intra-causality retrieval. For the inter-causality classication, we split a simulated multivariate time series into two parts across time axis: fX t jt = 1; ;T train g andfX t jt = T train ; ;Tg. A model is trained from the rst series (a training set) and evaluated in the second series (a testing set). For the intra-causality retrieval, we only use a subset of the known labels to train a model and evaluate the model if it can retrieve the unseen labels correctly. 106 Table 8.2: Inter-causality classication with additional noise Linear causality (AUC) Model N (0; 1 2 ) N (0; 5 2 ) MLP 0.6110.029 0.5060.010 GCN+MLP 0.5070.004 0.5000.001 CHEB+MLP 0.6270.010 0.5130.008 SAGE+MLP 0.6210.021 0.5160.006 TE+MLP 0.8270.021 0.6970.012 STGC 0.8490.020 0.7120.013 Non-Linear causality (AUC) Model N (0; 1 2 ) N (0; 5 2 ) MLP 0.5170.013 0.4990.004 GCN+MLP 0.5020.002 0.5000.004 CHEB+MLP 0.5260.009 0.5000.004 SAGE+MLP 0.5270.007 0.5020.003 TE+MLP 0.5620.033 0.5110.009 STGC 0.6400.012 0.5820.007 Model and baselines The task can be considered as learning directional edge represen- tations from a variable att 0 2 [tP;t 1] to another variable att so that we consider three baselines as follows. First, we feed two node values intoMLP to predict the causality. Other two baselines utilize a spatial and a temporal module, respectively, to aggregate neighboring spatial/temporal values accordingly, and then, the aggregated two node features are fed into MLP to return the causal probability. For the spatial encoder (SE), we use GCN Kipf & Welling (2017b), Chebyshev graph convolution networks (CHEB) Deerrard et al. (2016), andGraphSAGE Hamilton et al. (2017) and the temporal encoder (TE) concatenates node variables in the auto-regressive order. STGC combine the two encoders spatiotemporally and the resultant node representations are fed into MLP. Inter-causality classication For the inter-causality classication task, we report the mean of recall, AUC, and cross entropy error (with standard deviation) on the test series. In this task, the causality among the potential relations (NPN) is sparse so that the recall, which tells how many actual causal relations are retrieved, is particularly important. We evaluate the proposed model on two dierent settings: (1) linear and (2) non-linear causality, 107 and the results in Table 8.1 demonstrate that the proposed one successfully outperforms other baselines on both settings. More specically, all models are able to distinguish non-causal and causal relation in the linear setting according toAUC. However, the temporal change is particularly important to understand the causality among the variables. For the non-linear setting, the results show that all metrics from models are degraded signicantly compared to the linear setting. Still, the temporal information is more important but the spatial information can still be helpful (STGC vs. TE+MLP). To evaluate the robustness of the proposed model, we intentionally add i.i.d. noises to the generated time series. Since the time series are contaminated by the random noise after being causally generated, now it is much harder to discover underlying causality. Table 8.2 shows AUC of the models on the linear and non-linear settings. While AUCs are commonly decreased, STGC is still able to learn meaningful representations from the spatiotemporal series unlike other methods. Note that when the scale of noise is increased (N (0; 5 2 )),MLP and spatial encoders followed by MLP are almost impossible to distinguish between causal and non causal relations (AUC is closed to 0.5) and it happens to TE+MLP for the non- linear series. It is worth noting that the additive noise is a signicant bottleneck to the existing methods for causal discovery in multivariate time series without explicit labels such as PCMCI Runge et al. (2019b) and DYNOTEARS Paml et al. (2020). Without the additional noises, both methods are able to perfectly discover the causal directions, however, they signicantly lose the capability once the noises are included. Table 8.3 shows recall from existing causal discovery in multivariate time series methods (PCMCI based on partial correlations (PARC) and Gaussian process regression and a distance correlation (GPDC)) on the non-linear series. It shows that STGC is able to learn robust representations for the causal discovery from noisy series by utilizing the explicitly given labels. Intra-causality retrieval For the intra-causality retrieval task, we used the time series generated from non-linear causality with added noises. Note that there are 21 causal relations 108 Table 8.3: Recall for causal discovery methods Noise PARC GPDC DYNOTEARS STGC N (0; 1 2 ) 0.48 0.48 0.29 0.66 N (0; 5 2 ) 0.00 0.00 0.00 0.48 Table 8.4: Intra-causality retrieval (AUC) from non-linear causal time series withN (0; 1 2 ) # of Train/Test TE+MLP STGC causaliy labels 16 / 5 0.5500.031 0.6360.024 11 / 10 0.5460.023 0.6200.010 6 / 15 0.5390.028 0.5960.014 1 / 20 0.5010.011 0.5850.018 in the series and we split them into two parts for training and testing, respectively. By adjusting the number of causal relations shown in training series, we can evaluate how the proposed model is robust even if majority of causal relations are not given during the training. Table 8.4 shows the average behaviors of two classiers trained on a subset of causal relations in time series. While TE+MLP detects some unseen causal relations when the number of labels shown for training is large (16), its performance is quickly degraded as the number of available labels is decreased. STGC outperformsTE+MLP with a large margin and it supports that STGC can extract more informative spatiotemporal representations. Interestingly, even if only a single causal relation is given as a known label (1/20), STGC still retrieves unseen causal relations, acceptably. 8.5.3 Graph Signal Prediction We have performed the evaluation of C-STGN against baselines on graph signal predic- tion tasks from real-world observations on the climatology network 4 Deerrard et al. (2020). See Section 3.5.5 for details. 4 1Global Historical Climatology Network (GHCN) provided by National Oceanic and Atmospheric Ad- ministration (NOAA). 109 Table 8.5: Summary of results of prediction error (mean squared error, MSE) with standard deviations for the two regions. TMAX TMIN Model Western Eastern Western Eastern DCRNN 0.13240.0024 0.15850.0033 0.07070.0017 0.13170.0028 GCRN 0.13360.0082 0.15880.0027 0.07010.0004 0.13020.0009 STGV 0.11340.0003 0.13930.0011 0.07590.0024 0.13040.0038 PA-DGN 0.11110.0014 0.13550.0034 0.07310.0009 0.12620.0036 SNOW PRCP Model Western Eastern Western Eastern DCRNN 0.67570.0011 0.04060.0002 0.47030.0020 0.75880.0013 GCRN 0.66830.0012 0.04060.0001 0.47030.0009 0.75950.0001 STGV 0.67200.0070 0.03910.0008 0.46190.0047 0.67700.0042 PA-DGN 0.66130.0035 0.03860.0007 0.45890.0033 0.66580.0025 The main task is a prediction of next signalsX t+1 given length P = 10 past spatiotem- poral seriesX t9 ;X t under the graph structure. We split the series into training (60%), validation (10%), and testing (30%) series and each measurement is normalized. Baselines: For the task, we compare against baselines which are recently introduced for the similar task. DCRNN Li et al. (2018a) (Diusion convolution recurrent neural network) is designed to capture spatial and temporal dependency by random walks on a graph. GCRN Seo et al. (2018) (Graph convolutional recurrent network) is a variant of LSTM by replacing the internal fully connected layers by a graph convolution layer. Besides the external baselines, we also evaluate the internal baselines STGV which does not have the causality module from PA-DGN. Causality labels from PDEs: For PA-DGN, we have used causal relations in PDEs, which are a family of the continuity equation, such as Diusion, Convection, and Navier- Stokes equations. These equations commonly describe how target observations are spa- tiotemporally varying with respect to its second-order spatial derivatives and rst-order time derivative. It is reasonable to consider these equations as an inductive bias to provide the additional information because they are crucial for modeling climate-related observa- 110 tions Wang et al. (2020b), which are our target observations. Thus, in a graph structure, spatially 1-hop neighboring nodes (j2N i ) are considered as adjacent causes to the obser- vation at the ith node and observations at t 1 are potential causes to the observations at t autoregressively. The existing causal labels can be described as follow: fp t j t i ji = 1j t i t j = 1 and j2N i g; (8.14) As discussed in Section 8.3, we also provide a set of not-causal labels (Equation 8.11) to balance a ratio of positive and negative labels. Results: We use a mean squared error (MSE) as a metric to compare PA-DGN against the baselines. Table 8.5 shows that PA-DGN mostly outperforms other baselines across dif- ferent regions and measurements. Both DCRNN and GCRN replace fully connected layers in a variant of RNNs (GRU and LSTM, respectively) with diusion convolution and Chebyshev convolution layers. Thus, they similarly aggregate spatiotemporal correlation and it is shown by the close prediction error from the baselines. Compared to STGV, we can clearly see how the additional causality-associated information is benecial for modeling spatiotemporal data. Specically, the MAE dierence between STGV and PA-DGN on PRCP on Eastern states is the largest, which implies that the PDE-based causal labels are mostly informative. Note that all prediction error from SNOW are similar and it is due to the spatially sparse sensors. If sensors are far away located, the mutual causality is negligible or longer time lags need to be considered. 8.5.4 Interpretation of Learned Causality Once the causal module is trained based on a guiding PDE, we can use the module to examine how the potential causes are varying over space and time. Figure 8.4 shows how the causal probability is changed on the two regions. PA-DGN extracts causality-associated information from spatiotemporal series. In Figure 8.4a, we can see that variables spatially 111 0 2 4 6 8 # of hops Causal probability TMAX (Western) TMIN (Western) SNOW (Western) PRCP (Western) (a) Western region 0 2 4 6 8 # of hops Causal probability TMAX (Eastern) TMIN (Eastern) SNOW (Eastern) PRCP (Eastern) (b) Eastern region Figure 8.4: Average causal probability curves vs. the number of hops over all sensors in each region. close to current observations have higher causality for PRCP and TMAX. The similar pattern appears on another region (Figure 8.4b). On the other hand, sensors for SNOW are more related to further sensors. We can also extract which neighboring sensors have stronger/weaker causal relations to a particular sensor. In Figure 8.5, 4 sensors are sampled from each region to visualize how much their Khop neighboring variables are causally related. We can see that daily max temperatures from the sensor 42 in the western region have been strongly aected by spatially close (2-hop) sensors, however, the max temperature at the sensor 25 is more likely dependent on sensors a bit far away (6 or 7-hop). On the other hand, the sensor 26 is more dependent on mid-range sensors (4 or 5-hop). In eastern states, the sensor 2 and 3 are associated with close sensors, however, the sensor 0 and 1 do not have distinct causal relations from their neighboring sensors. We nd that the physics-aware causality is not only informative for spatiotemporal modeling directly but also enables to discover unspecied causal relations. 112 0 2 4 6 8 # of hops Causal probability Causality in TMAX sensor 25 sensor 26 sensor 36 sensor 42 (a) Western sensors 0 2 4 6 8 # of hops Causal probability Causality in TMAX sensor 0 sensor 1 sensor 2 sensor 3 (b) Eastern sensors Figure 8.5: Average causal probability curves vs. the number of hops from particular sensors in each region (TMAX) 113 Chapter 9 Summary, Discussion and Future Work In this thesis, we present a variety of approaches to model physical systems observed from spatially located sensors. While purely data-driven approaches are successful in a number of domains, it is still required to incorporate domain knowledge (physical equations, rules, and knowledge) to achieve performance improvement, robust modeling, extraction of transferrable knowledge, and ecient learning. This thesis presented a variety of approaches: DQ-LSTM, PaGN, PA-DGN, PiMetaL, and C-STGN, which provide an individual solution to incorporate physics-related knowl- edge in spatiotemporal observations from physical systems. DQ-LSTM is the rst work to estimate data quality of sensor-based observations by considering physical property of u- idic quantities and overall prediction performance is also improved. For leveraging explicit equations rather than qualitative properties, we introduce a new architecture PaGN based on graph networks to incorporate prior knowledge given as a form of PDEs over time and space. While existing works more focus on how to discover equations in data generated by explicit physics rules, we propose a method to leverage weakly given inductive bias describ- ing data. To propose more general methodology for spatiotemporal observations than other models requiring specic equations, PA-DGN is developed to approximate spatial derivatives, which are one of essential components of PDEs that have a prominent role for physics-aware modeling. A meta-learning framework, PiMetaL, is introduced to prove that the physics-related knowledge with auxiliary tasks is eectively benecial for a strong inductive bias, which is generally applicable to fast-adaptation of learnable models on fewer observations. Finally, we 114 propose graph-based network called C-STGN, which leverage the physics-aware causality from the guiding equations as explicit labels for downstream tasks: graph signal prediction. All approaches either improve spatiotemporal prediction performance or establish new ways to incorporate physical knowledge with neural networks. In summary, this is the rst physics-aware learning framework for spatiotemporal ob- servations from physical systems. This thesis reveals the benets of incorporating domain knowledge, specically physics-based knowledge with real-world observations, and demon- strates the eectiveness of the interpretable inductive biases for modeling sparsely available spatiotemporal observations. While the works in the thesis eectively leverage the physics-based inductive biases, there are still many interesting research directions in many aspects. First, we mostly assume the spatially sparse datasets (e.g., sensor-based observations). In fact, some visual data (e.g., radar image) describe spatially continuous (or dense) observations and we need to develop dierent techniques to handle such formats. Second, we focus on regression and prediction tasks from given historical spatiotemporal series. It would be more interesting to consider other tasks such as classication or detection from physical systems to nd anomalous patterns. Furthermore, we mostly use the prior knowledge as an assistant of data-driven models to improve their performance. Instead, if the inductive bias can be used to understand the behavior of the black-box models, it enables the domain knowledge as a main player to interpret deep neural networks. Based on the above discussion, we describe several extensions to the current framework: • Build interpretable deep models by incorporating prior knowledge. As the prior or domain knowledge is analytical or empirical, it is human-understandable. Thus, in- corporation of the knowledge with data-driven (implicit) models is directly connected with interpretability of the models. In this direction, discovering underlying equations or physical rules from ample data is going to be a promising research topic. • Develop a controllable deep models with prior knowledge. In particular, it is crucial to 115 identify how much the physical contribution is important compared to the supervised objective. Moreover, the identied knowledge weight should be controllable during evaluation as dierent systems have dierent rule-dependency. Thus, automatically estimating and enabling control the weight of physics-aware terms is required as a future topic. The controllable knowledge framework will be a general framework to incorporate physical knowledge with data-driven model and it is the ultimate goal of this proposal. • Improve deep neural networks for physical systems to be robust to random noise and even adversarial inputs. As the observations from sensors inherently have unexpected noises, it is a long-lasting topic to build robust spatiotemporal models. This could require new incorporation techniques that can minimize eect of unnecessary noises, or novel architecture that can extract important representations more eciently. • Finally, the human-understandable knowledge can be used to quantify uncertainty of deep neural networks or learnable models. It is challenging to measure how much deep models are reliable and quantify whether the models' expected and/or unexpected behaviors are predictable or not. The inductive bias can reveal uncertainty level of deep models through comparison of the models' output and incorporated knowledge. 116 Bibliography Agrawal, S., Barrington, L., Bromberg, C., Burge, J., Gazen, C., and Hickey, J. Machine learning for precipitation nowcasting from radar images. arXiv preprint arXiv:1912.12132, 2019. Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., and Gong, B. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178, 2021. Alet, F., Lozano-P erez, T., and Kaelbling, L. P. Modular meta-learning. arXiv preprint arXiv:1806.10166, 2018. Alet, F., Weng, E., Lozano-P erez, T., and Kaelbling, L. P. Neural relational inference with fast modular meta-learning. In Advances in Neural Information Processing Systems, pp. 11804{11815, 2019. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pp. 173{182. PMLR, 2016. Andrychowicz, M., Denil, M., Gomez, S., Homan, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981{3989, 2016. Antoniou, A., Edwards, H., and Storkey, A. How to train your maml. arXiv preprint arXiv:1810.09502, 2018. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Inter- national conference on machine learning, pp. 214{223. PMLR, 2017. Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020. Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. Ba~ no-Medina, J., Manzanas, R., and Guti errez, J. M. Conguration and intercomparison of deep learning neural models for statistical downscaling. Geoscientic Model Development, 13(4):2109{ 2124, 2020. Bar-Sinai, Y., Hoyer, S., Hickey, J., and Brenner, M. P. Learning data-driven discretizations for partial dierential equations. Proceedings of the National Academy of Sciences, 116(31):15344{ 15349, 2019. Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pp. 4502{ 4510, 2016. 117 Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018. Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clus- tering. In Advances in neural information processing systems, pp. 585{591, 2002. Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171{1179, 2015. Berman, L. National aqi observations (2014-05 to 2016-12). Harvard Dataverse, 2017. doi: 10. 7910/DVN/QDX6L8. Bilionis, I. and Zabaras, N. Multi-output local gaussian process regression: Applications to uncer- tainty quantication. Journal of Computational Physics, 231(17):5718{5746, 2012. Bolton, T. and Zanna, L. Applications of deep learning to ocean data inference and subgrid parameterization. Journal of Advances in Modeling Earth Systems, 11(1):376{399, 2019. Bongard, J. and Lipson, H. Automated reverse engineering of nonlinear dynamical systems. Pro- ceedings of the National Academy of Sciences, 104(24):9943{9948, 2007. Brenowitz, N. D. and Bretherton, C. S. Prognostic validation of a neural network unied physics parameterization. Geophysical Research Letters, 45(12):6289{6298, 2018. Brenowitz, N. D., Beucler, T., Pritchard, M., and Bretherton, C. S. Interpreting and stabilizing machine-learning parametrizations of convection. Journal of the Atmospheric Sciences, 77(12): 4357{4375, 2020. Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18{42, 2017. Brunton, S. L., Proctor, J. L., and Kutz, J. N. Discovering governing equations from data by sparse identication of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932{3937, 2016. Cang, R., Li, H., Yao, H., Jiao, Y., and Ren, Y. Improving direct physical properties prediction of heterogeneous materials from imaging data via convolutional neural network and a morphology- aware generative model. Computational Materials Science, 150:212{221, 2018. Cao, W., Wang, D., Li, J., Zhou, H., Li, L., and Li, Y. Brits: Bidirectional recurrent imputation for time series. In Advances in Neural Information Processing Systems, pp. 6775{6785, 2018. Chan, S. and Elsheikh, A. H. Parametrization and generation of geological models with generative adversarial networks. arXiv preprint arXiv:1708.01810, 2017. Chang, M. B., Ullman, T., Torralba, A., and Tenenbaum, J. B. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016. Chattopadhyay, A., Mustafa, M., Hassanzadeh, P., and Kashinath, K. Deep spatial transformers for autoregressive data-driven forecasting of geophysical turbulence. In Proceedings of the 10th International Conference on Climate Informatics, pp. 106{112, 2020a. 118 Chattopadhyay, A., Nabizadeh, E., and Hassanzadeh, P. Analog forecasting of extreme-causing weather patterns using deep learning. Journal of Advances in Modeling Earth Systems, 12(2): e2019MS001958, 2020b. Chattopadhyay, A., Subel, A., and Hassanzadeh, P. Data-driven super-parameterization using deep learning: Experimentation with multiscale lorenz 96 systems and transfer learning. Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002084, 2020c. Chen, G., Zuo, Y., Sun, J., and Li, Y. Support-vector-machine-based reduced-order model for limit cycle oscillation prediction of nonlinear aeroelastic system. Mathematical problems in engineering, 2012, 2012. Chen, H., Zhang, Y., Kalra, M. K., Lin, F., Chen, Y., Liao, P., Zhou, J., and Wang, G. Low-dose ct with a residual encoder-decoder convolutional neural network. IEEE transactions on medical imaging, 36(12):2524{2535, 2017. Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary dierential equations. In Advances in neural information processing systems, pp. 6571{6583, 2018. Chen, Y., Friesen, A. L., Behbahani, F., Budden, D., Homan, M. W., Doucet, A., and de Freitas, N. Modular meta-learning with shrinkage. arXiv preprint arXiv:1909.05557, 2019. Christie, M., Demyanov, V., and Erbas, D. Uncertainty quantication for porous media ows. Journal of Computational Physics, 217(1):143{158, 2006. Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1xMH1BtvB. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., and Auli, M. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979, 2020. Crane, K. Discrete dierential geometry: An applied introduction. Notices of the AMS, Commu- nication, 2018. Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P., Spergel, D., and Ho, S. Lagrangian neural networks. arXiv preprint arXiv:2003.04630, 2020a. Cranmer, M., Sanchez-Gonzalez, A., Battaglia, P., Xu, R., Cranmer, K., Spergel, D., and Ho, S. Discovering symbolic models from deep learning with inductive biases. arXiv preprint arXiv:2006.11287, 2020b. Crutcheld, J. P. and McNamara, B. Equations of motion from a data series. Complex systems, 1 (417-452):121, 1987. Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/ 2015/file/7137debd45ae4d0ab9aa953017286b20-Paper.pdf. Davis, N., Raina, G., and Jagannathan, K. Grids versus graphs: Partitioning space for improved taxi demand-supply forecasts. IEEE Transactions on Intelligent Transportation Systems, 2020. 119 Dawson, M., Olvera, J., Fung, A., and Manry, M. Inversion of surface parameters using fast learning neural networks. 1992. de Bezenac, E., Pajot, A., and Gallinari, P. Deep learning for physical processes: Incorporating prior scientic knowledge. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=By4HsfWAZ. Deerrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral ltering. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29, pp. 3844{ 3852. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/ file/04df4d434d481c5bb723be1b6df1ee65-Paper.pdf. Deerrard, M., Milani, M., Gusset, F., and Perraudin, N. Deepsphere: a graph-based spherical cnn. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=B1e3OlStPB. Denton, E., Chintala, S., Szlam, A., and Fergus, R. Deep generative image models using a laplacian pyramid of adversarial networks. arXiv preprint arXiv:1506.05751, 2015. Deser, C., Phillips, A., Bourdette, V., and Teng, H. Uncertainty in climate change projections: the role of internal variability. Climate dynamics, 38(3):527{546, 2012. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirec- tional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers), pp. 4171{4186, Minneapolis, Min- nesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423. Du, X., Lin, T.-Y., Jin, P., Ghiasi, G., Tan, M., Cui, Y., Le, Q. V., and Song, X. Spinenet: Learning scale-permuted backbone for recognition and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11592{11601, 2020. Duan, Y., Andrychowicz, M., Stadie, B., Ho, O. J., Schneider, J., Sutskever, I., Abbeel, P., and Zaremba, W. One-shot imitation learning. In Advances in neural information processing systems, pp. 1087{1098, 2017. Ebert-Upho, I. and Hilburn, K. Evaluation, tuning, and interpretation of neural networks for working with images in meteorological applications. Bulletin of the American Meteorological Society, 101(12):E2149{E2170, 2020. Faghmous, J. H. and Kumar, V. A big data guide to understanding climate change: The case for theory-guided data science. Big data, 2(3):155{163, 2014. Farimani, A. B., Gomes, J., and Pande, V. S. Deep learning the physics of transport phenomena. arXiv preprint arXiv:1709.02432, 2017. Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126{1135. JMLR. org, 2017. 120 Gagne, D. J., McGovern, A., Haupt, S. E., Sobash, R. A., Williams, J. K., and Xue, M. Storm-based probabilistic hail forecasting with machine learning applied to convection-allowing ensembles. Weather and forecasting, 32(5):1819{1840, 2017. Gagne, D. J., Christensen, H. M., Subramanian, A. C., and Monahan, A. H. Machine learning for stochastic parameterization: Generative adversarial networks in the lorenz'96 model. Journal of Advances in Modeling Earth Systems, 12(3):e2019MS001896, 2020. Gagne II, D. J., Haupt, S. E., Nychka, D. W., and Thompson, G. Interpretable deep learning for spatial analysis of severe hailstorms. Monthly Weather Review, 147(8):2827{2845, 2019. Galbally, D., Fidkowski, K., Willcox, K., and Ghattas, O. Non-linear model reduction for uncer- tainty quantication in large-scale inverse problems. International journal for numerical methods in engineering, 81(12):1581{1608, 2010. Geng, X., Li, Y., Wang, L., Zhang, L., Yang, Q., Ye, J., and Liu, Y. Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. In 2019 AAAI Conference on Articial Intelligence (AAAI'19), 2019. Gentine, P., Pritchard, M., Rasp, S., Reinaudi, G., and Yacalis, G. Could machine learning break the convection parameterization deadlock? Geophysical Research Letters, 45(11):5742{5751, 2018. Gerwin, D. Information processing, data inferences, and scientic generalization. Behavioral Sci- ence, 19(5):314{325, 1974. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pp. 1263{1272. JMLR. org, 2017. Goldstein, E., Coco, G., Murray, A., and Green, M. Data-driven components in a model of inner- shelf sorted bedforms: a new hybrid model. Earth Surface Dynamics, 2(1):67{82, 2014. Granger, C. W. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, pp. 424{438, 1969. Grant, E., Finn, C., Levine, S., Darrell, T., and Griths, T. Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018. Greydanus, S., Dzamba, M., and Yosinski, J. Hamiltonian neural networks. In Advances in Neural Information Processing Systems, pp. 15353{15363, 2019. Groenke, B., Madaus, L., and Monteleoni, C. Climalign: Unsupervised statistical downscaling of climate variables via normalizing ows. In Proceedings of the 10th International Conference on Climate Informatics, pp. 60{66, 2020. Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855{864. ACM, 2016. Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024{1034, 2017. 121 He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770{778, 2016. He, K., Gkioxari, G., Doll ar, P., and Girshick, R. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961{2969, 2017. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6): 82{97, 2012. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735{1780, 1997. Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314{1324, 2019. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Ecient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. Hsieh, W. W. Machine learning methods in the environmental sciences: Neural networks and kernels. Cambridge university press, 2009. Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. Ivezi c, Z., Connolly, A. J., VanderPlas, J. T., and Gray, A. Statistics, data mining, and ma- chine learning in astronomy: a practical Python guide for the analysis of survey data, volume 1. Princeton University Press, 2014. Juran, J. and Godfrey, A. B. Quality handbook. Republished McGraw-Hill, pp. 173{178, 1999. Karpatne, A., Atluri, G., Faghmous, J. H., Steinbach, M., Banerjee, A., Ganguly, A., Shekhar, S., Samatova, N., and Kumar, V. Theory-guided data science: A new paradigm for scientic discovery from data. IEEE Transactions on knowledge and data engineering, 29(10):2318{2331, 2017. Karpatne, A., Ebert-Upho, I., Ravela, S., Babaie, H. A., and Kumar, V. Machine learning for the geosciences: Challenges and opportunities. IEEE Transactions on Knowledge and Data Engineering, 31(8):1544{1554, 2018. Kevrekidis, I. G., Gear, C. W., Hyman, J. M., Kevrekidid, P. G., Runborg, O., Theodoropoulos, C., et al. Equation-free, coarse-grained multiscale computation: Enabling mocroscopic simulators to perform system-level analysis. Communications in Mathematical Sciences, 1(4):715{762, 2003. Kim, J., Kim, D., and Choi, H. An immersed-boundary nite-volume method for simulations of ow in complex geometries. Journal of computational physics, 171(1):132{150, 2001. 122 Kim, Y. Convolutional neural networks for sentence classication. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746{1751, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1181. URL https://www.aclweb.org/anthology/D14-1181. King, R., Hennigh, O., Mohan, A., and Chertkov, M. From deep to physics-informed learning of turbulence: Diagnostics. arXiv preprint arXiv:1810.07785, 2018. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. International Conference on Machine Learning, 2018. Kipf, T. N. and Welling, M. Semi-supervised classication with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017a. Kipf, T. N. and Welling, M. Semi-supervised classication with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017b. Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015. Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009. Krasnopolsky, V. M. and Fox-Rabinovitz, M. S. Complex hybrid models combining deterministic and machine learning components for numerical climate modeling and weather prediction. Neural Networks, 19(2):122{134, 2006. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classication with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097{1105, 2012. Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., Zhong, V., Paulus, R., and Socher, R. Ask me anything: Dynamic memory networks for natural language processing. In International conference on machine learning, pp. 1378{1387. PMLR, 2016. Kumar, S., Tan, S., Zheng, L., and Kochmann, D. M. Inverse-designed spinodoid metamaterials. npj Computational Materials, 6(1):1{10, 2020. Kutz, J. N. Deep learning in uid dynamics. Journal of Fluid Mechanics, 814:1{4, 2017. Lai, R., Liang, J., and Zhao, H. A local mesh method for solving pdes on point clouds. Inverse Problems & Imaging, 7(3), 2013. Langley, P. Data-driven discovery of physical laws. Cognitive Science, 5(1):31{54, 1981. Langley, P., Bradshaw, G. L., and Simon, H. A. Rediscovering chemistry with the bacon system. In Machine learning, pp. 307{329. Springer, 1983. LeCun, Y., Bottou, L., Bengio, Y., and Haner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278{2324, 1998. 123 Lee, S., Kooshkbaghi, M., Spiliotis, K., Siettos, C. I., and Kevrekidis, I. G. Coarse-scale pdes from ne-scale observations via machine learning. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(1):013141, 2020. Lenat, D. B. The role of heuristics in learning by discovery: Three case studies. In Machine learning, pp. 243{306. Springer, 1983. Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diusion convolutional recurrent neural network: Data- driven trac forecasting. In International Conference on Learning Representations (ICLR '18), 2018a. Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diusion convolutional recurrent neural network: Data- driven trac forecasting. In International Conference on Learning Representations, 2018b. URL https://openreview.net/forum?id=SJiHXGWAZ. Liao, T. W. and Li, G. Metaheuristic-based inverse design of materials{a survey. Journal of Materiomics, 6(2):414{430, 2020. Lim, L.-H. Hodge laplacians on graphs. arXiv preprint arXiv:1507.05379, 2015. Long, Z., Lu, Y., Ma, X., and Dong, B. PDE-net: Learning PDEs from data. In Proceedings of the 35th International Conference on Machine Learning, 2018a. URL http://proceedings.mlr. press/v80/long18a.html. Long, Z., Lu, Y., Ma, X., and Dong, B. Pde-net: Learning pdes from data. International Conference on Machine Learning, 2018b. Long, Z., Lu, Y., and Dong, B. Pde-net 2.0: Learning pdes from data with a numeric-symbolic hybrid deep network. Journal of Computational Physics, 399:108925, 2019. Lunz, S., Oktem, O., and Sch onlieb, C.-B. Adversarial regularizers in inverse problems. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Gar- nett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Cur- ran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ d903e9608cfbf08910611e4346a0ba44-Paper.pdf. Lutter, M., Ritter, C., and Peters, J. Deep lagrangian networks: Using physics as model prior for deep learning. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=BklHpjCqKm. Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579{2605, 2008. MacKay, D. J. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448{472, 1992. Magill, M., Qureshi, F., and de Haan, H. W. Neural networks trained to solve dierential equations learn general representations. Advances in Neural Information Processing Systems, 2018. Manzoni, A., Pagani, S., and Lassila, T. Accurate solution of bayesian inverse uncertainty quan- tication problems combining reduced basis methods and reduction error models. SIAM/ASA Journal on Uncertainty Quantication, 4(1):380{412, 2016. 124 McCann, M. T., Jin, K. H., and Unser, M. A review of convolutional neural networks for inverse problems in imaging. arXiv preprint arXiv:1710.04011, 2017. McGovern, A., Lagerquist, R., Gagne, D. J., Jergensen, G. E., Elmore, K. L., Homeyer, C. R., and Smith, T. Making the black box more transparent: Understanding the physical implications of machine learning. Bulletin of the American Meteorological Society, 100(11):2175{2199, 2019. Mikolov, T., Chen, K., Corrado, G., and Dean, J. Ecient estimation of word representations in vector space. International Conference on Learning Representations, 2013. Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018. URL https://openreview.net/ forum?id=B1DmUzWAW. Mohan, A. T. and Gaitonde, D. V. A deep learning based approach to reduced order modeling for turbulent ow control using lstm neural networks. arXiv preprint arXiv:1804.09269, 2018. Moukalled, F., Mangani, L., Darwish, M., et al. The nite volume method in computational uid dynamics, volume 6. Springer, 2016. Munkhdalai, T. and Yu, H. Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2554{2563. JMLR. org, 2017. Naik, D. K. and Mammone, R. J. Meta-neural networks that learn by learning. In [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, volume 1, pp. 437{442. IEEE, 1992. Nauta, M., Bucur, D., and Seifert, C. Causal discovery with attention-based convolutional neural networks. Machine Learning and Knowledge Extraction, 1(1):312{340, 2019. Nichol, A., Achiam, J., and Schulman, J. On rst-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018. O'Gorman, P. A. and Dwyer, J. G. Using machine learning to parameterize moist convection: Potential for modeling of climate, climate change, and extreme events. Journal of Advances in Modeling Earth Systems, 10(10):2548{2563, 2018. Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. Paml, R., Sriwattanaworachai, N., Desai, S., Pilgerstorfer, P., Georgatzis, K., Beaumont, P., and Aragam, B. Dynotears: Structure learning from time-series data. In International Conference on Articial Intelligence and Statistics, pp. 1595{1605. PMLR, 2020. Pearl, J. Causality. Cambridge university press, 2009. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018. Pettit, C. L. Uncertainty quantication in aeroelasticity: recent results and research challenges. Journal of Aircraft, 41(5):1217{1229, 2004. 125 Pilozzi, L., Farrelly, F. A., Marcucci, G., and Conti, C. Machine learning inverse problem for topological photonics. Communications Physics, 1(1):1{7, 2018. Qin, C., O'Donoghue, B., Bunel, R., Stanforth, R., Gowal, S., Uesato, J., Swirszcz, G., Kohli, P., et al. Verication of non-linear specications for neural networks. arXiv preprint arXiv:1902.09592, 2019. Quade, M., Abel, M., Nathan Kutz, J., and Brunton, S. L. Sparse identication of nonlinear dynamics for rapid model recovery. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(6):063116, 2018. Racah, E., Beckham, C., Maharaj, T., Kahou, S., Prabhat, M., and Pal, C. Extremeweather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 3405{3416. Curran Associates, Inc., 2017a. URL http://papers.nips.cc/paper/ 6932-extremeweather-a-large-scale-climate-dataset-for-semi-supervised-detection-localization-and-understanding-of-extreme-weather-events. pdf. Racah, E., Beckham, C., Maharaj, T., Kahou, S. E., Prabhat, M., and Pal, C. Extremeweather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In Advances in Neural Information Processing Systems, pp. 3402{3413, 2017b. Raccuglia, P., Elbert, K. C., Adler, P. D., Falk, C., Wenny, M. B., Mollo, A., Zeller, M., Friedler, S. A., Schrier, J., and Norquist, A. J. Machine-learning-assisted materials discovery using failed experiments. Nature, 533(7601):73{76, 2016. Raghu, A., Raghu, M., Bengio, S., and Vinyals, O. Rapid learning or feature reuse? towards un- derstanding the eectiveness of maml. In International Conference on Learning Representations, 2020. Raissi, M. Deep hidden physics models: Deep learning of nonlinear partial dierential equations. arXiv preprint arXiv:1801.06637, 2018. Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics informed deep learning (part i): Data- driven solutions of nonlinear partial dierential equations. arXiv preprint arXiv:1711.10561, 2017a. Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics informed deep learning (part ii): Data- driven discovery of nonlinear partial dierential equations. arXiv preprint arXiv:1711.10566, 2017b. Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics-informed neural networks: A deep learn- ing framework for solving forward and inverse problems involving nonlinear partial dierential equations. Journal of Computational Physics, 378:686{707, 2019. Rajabi, M. M. and Ketabchi, H. Uncertainty-based simulation-optimization using gaussian process emulation: application to coastal groundwater management. Journal of hydrology, 555:518{534, 2017. 126 Rasp, S., Pritchard, M. S., and Gentine, P. Deep learning to represent subgrid processes in climate models. Proceedings of the National Academy of Sciences, 115(39):9684{9689, 2018. Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017. Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unied, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779{788, 2016. Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., et al. Deep learning and process understanding for data-driven earth system science. Nature, 566(7743): 195{204, 2019. Rubin, D. B. Estimating causal eects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974. Rudy, S. H., Brunton, S. L., Proctor, J. L., and Kutz, J. N. Data-driven discovery of partial dierential equations. Science Advances, 3(4):e1602614, 2017. Runge, J. Causal network reconstruction from time series: From theoretical assumptions to prac- tical estimation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7):075310, 2018. Runge, J., Bathiany, S., Bollt, E., Camps-Valls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., Mu~ noz-Mar , J., et al. Inferring causation from time series in earth system sciences. Nature communications, 10(1):1{13, 2019a. Runge, J., Nowack, P., Kretschmer, M., Flaxman, S., and Sejdinovic, D. Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5(11):eaau4996, 2019b. Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018. Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Advances in neural information processing systems, pp. 3856{3866, 2017. Sadowski, P., Fooshee, D., Subrahmanya, N., and Baldi, P. Synergies between quantum mechanics and machine learning in reaction prediction. Journal of chemical information and modeling, 56 (11):2125{2128, 2016. Sanchez-Gonzalez, A., Heess, N., Springenberg, J. T., Merel, J., Riedmiller, M., Hadsell, R., and Battaglia, P. Graph networks as learnable physics engines for inference and control. International Conference on Machine Learning, 2018. Sandryhaila, A. and Moura, J. M. Discrete signal processing on graphs. IEEE transactions on signal processing, 61(7):1644{1656, 2013. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1842{1850, 2016. 127 Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967{4976, 2017. Schleder, G. R., Padilha, A. C., Acosta, C. M., Costa, M., and Fazzio, A. From dft to machine learning: recent approaches to materials science{a review. Journal of Physics: Materials, 2(3): 032001, 2019. Schmidhuber, J. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook. Diplomarbeit, Technische Universit at M unchen, M unchen, 1987. Schmidt, M. and Lipson, H. Distilling free-form natural laws from experimental data. science, 324 (5923):81{85, 2009. Sch utt, K. T., Kindermans, P.-J., Sauceda, H. E., Chmiela, S., Tkatchenko, A., and M uller, K.- R. Schnet: A continuous-lter convolutional neural network for modeling quantum interactions. arXiv preprint arXiv:1706.08566, 2017. Seo, S., Meng, C., and Liu, Y. Physics-aware dierence graph networks for sparsely-observed dynamics. In International Conference on Learning Representations, 2020a. URL https:// openreview.net/forum?id=r1gelyrtwH. Seo, S., Meng, C., and Liu, Y. Physics-aware dierence graph networks for sparsely-observed dynamics. In International Conference on Learning Representations, 2020b. URL https:// openreview.net/forum?id=r1gelyrtwH. Seo, Y., Deerrard, M., Vandergheynst, P., and Bresson, X. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pp. 362{373. Springer, 2018. Shari, E., Saghaan, B., and Steinacker, R. Downscaling satellite precipitation estimates with multiple linear regression, articial neural networks, and spline interpolation techniques. Journal of Geophysical Research: Atmospheres, 124(2):789{805, 2019. Shazeer, N., Lan, Z., Cheng, Y., Ding, N., and Hou, L. Talking-heads attention. arXiv preprint arXiv:2003.02436, 2020. Shi, X., Qi, H., Shen, Y., Wu, G., and Yin, B. A spatial-temporal attention approach for trac prediction. IEEE Transactions on Intelligent Transportation Systems, 2020. Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077{4087, 2017. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631{1642, 2013. Stengel, K., Glaws, A., Hettinger, D., and King, R. N. Adversarial super-resolution of climatological wind and solar data. Proceedings of the National Academy of Sciences, 117(29):16805{16815, 2020. 128 Stinis, P., Hagge, T., Tartakovsky, A. M., and Yeung, E. Enforcing constraints for interpolation and extrapolation in generative adversarial networks. Journal of Computational Physics, 397: 108844, 2019. Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 2014. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. Tang, X., Yao, H., Sun, Y., Aggarwal, C., Mitra, P., and Wang, S. Joint modeling of local and global temporal dynamics for multivariate time series forecasting with missing values. arXiv preprint arXiv:1911.10273, 2019. Thrun, S. and Pratt, L. Learning to learn: Introduction and overview. In Learning to learn, pp. 3{17. Springer, 1998. Toms, B. A., Barnes, E. A., and Ebert-Upho, I. Physically interpretable neural networks for the geosciences: Applications to earth system variability. Journal of Advances in Modeling Earth Systems, 12(9):e2019MS002002, 2020a. Toms, B. A., Kashinath, K., Yang, D., et al. Testing the reliability of interpretable neural networks in geoscience using the madden-julian oscillation. Geoscientic Model Development Discussions, pp. 1{22, 2020b. Tripathy, R. K. and Bilionis, I. Deep uq: Learning deep neural network surrogate models for high dimensional uncertainty quantication. Journal of computational physics, 375:565{588, 2018. Um, K., Brand, R., Fei, Y., Holl, P., and Thuerey, N. Solver-in-the-Loop: Learning from Dieren- tiable Physics to Interact with Iterative PDE-Solvers. Advances in Neural Information Processing Systems, 2020. Vamaraju, J. and Sen, M. K. Unsupervised physics-based neural networks for seismic migration. Interpretation, 7(3):SE189{SE200, 2019. Vandal, T., Kodra, E., Ganguly, S., Michaelis, A., Nemani, R., and Ganguly, A. R. Deepsd: Generating high resolution climate change projections through single image super-resolution. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pp. 1663{1672, 2017. Vandal, T., Kodra, E., Dy, J., Ganguly, S., Nemani, R., and Ganguly, A. R. Quantifying uncertainty in discrete-continuous and skewed data with bayesian deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2377{ 2386, 2018. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30, pp. 5998{6008. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. 129 Veli ckovi c, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, 2018. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630{3638, 2016. Virieux, J. P-sv wave propagation in heterogeneous media: Velocity-stress nite-dierence method. Geophysics, 51(4):889{901, 1986. Wang, C., Tang, Y., Ma, X., Wu, A., Okhonko, D., and Pino, J. fairseq s2t: Fast speech-to-text modeling with fairseq. arXiv preprint arXiv:2010.05171, 2020a. Wang, R., Kashinath, K., Mustafa, M., Albert, A., and Yu, R. Towards physics-informed deep learning for turbulent ow prediction. arXiv preprint arXiv:1911.08655, 2019a. Wang, R., Kashinath, K., Mustafa, M., Albert, A., and Yu, R. Towards physics-informed deep learning for turbulent ow prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1457{1466, 2020b. Wang, X., Li, Z., Jiang, M., Wang, S., Zhang, S., and Wei, Z. Molecule property prediction based on spatial graph embedding. Journal of chemical information and modeling, 59(9):3817{3828, 2019b. Wang, Z., Di, H., Shaq, M. A., Alaudah, Y., and AlRegib, G. Successful leveraging of image processing and machine learning in seismic structural interpretation: A review. The Leading Edge, 37(6):451{461, 2018. Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., and Zoran, D. Visual interaction networks. NIPS, 2017. Willard, J., Jia, X., Xu, S., Steinbach, M., and Kumar, V. Integrating physics-based modeling with machine learning: A survey. arXiv preprint arXiv:2003.04919, 2020. Wu, J.-L., Kashinath, K., Albert, A., Chirila, D., Xiao, H., et al. Enforcing statistical constraints in generative adversarial networks for modeling chaotic dynamical systems. Journal of Compu- tational Physics, 406:109209, 2020. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. Xiao, D., Heaney, C., Mottet, L., Fang, F., Lin, W., Navon, I., Guo, Y., Matar, O., Robins, A., and Pain, C. A reduced order model for turbulent ows in the urban environment using machine learning. Building and Environment, 148:323{337, 2019. Xie, Y., Franz, E., Chu, M., and Thuerey, N. tempogan: A temporally coherent, volumetric gan for super-resolution uid ow. ACM Transactions on Graphics (TOG), 37(4):1{15, 2018. Xu, T. and Valocchi, A. J. Data-driven methods to improve base ow prediction of a regional groundwater model. Computers & Geosciences, 85:124{136, 2015. 130 Yang, L., Zhang, D., and Karniadakis, G. E. Physics-informed generative adversarial networks for stochastic dierential equations. arXiv preprint arXiv:1811.02033, 2018. Yang, Z., Wu, J.-L., and Xiao, H. Enforcing deterministic constraints on generative adversarial networks for emulating physical systems. arXiv preprint arXiv:1911.06671, 2019. Yazdani, A., Lu, L., Raissi, M., and Karniadakis, G. E. Systems biology informed deep learning for inferring parameters and hidden dynamics. PLoS computational biology, 16(11):e1007575, 2020. Ye, J., Zhao, J., Ye, K., and Xu, C. How to build a graph-based deep learning architecture in trac domain: A survey. arXiv preprint arXiv:2005.11691, 2020. Zhang, J., Mohegh, A., Li, Y., Levinson, R., and Ban-Weiss, G. Systematic comparison of the in uence of cool wall versus cool roof adoption on urban climate in the los angeles basin. Envi- ronmental science & technology, 52(19):11188{11197, 2018. Zhang, X., Liang, F., Srinivasan, R., and Van Liew, M. Estimating uncertainty of stream ow simulation using bayesian neural networks. Water Resources Research, 45(2), 2009. Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. Dags with no tears: Continuous optimization for structure learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31, pp. 9472{9483. Curran Associates, Inc., 2018. URL https://proceedings.neurips. cc/paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf. Zhou, D. and Sch olkopf, B. A regularization framework for learning from graph data. In ICML workshop on statistical relational learning and Its connections to other elds, volume 15, pp. 67{68, 2004. Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017. 131
Abstract (if available)
Abstract
While deep neural networks have been successful over a number of applications, it is still challenging to achieve a robust model for physical systems since data-driven learning does not explicitly consider physical knowledge, which should be beneficial for modeling. To leverage domain knowledge for the robust learning, this thesis proposes various novel methods to incorporate physical knowledge for modeling spatiotemporal observations from physical systems. First, we quantify data quality inspired by physical properties of fluids to identify abnormal observations and improve forecasting performance. The second work proposes a regularizer to explicitly impose partial differential equations (PDEs) associated with physical laws to provide an inductive bias in the latent space. The third method is developed to approximate spatial derivatives, which are one of the fundamental components of spatiotemporal PDEs that have a prominent role for physics-aware modeling. Then, we demonstrate a meta-learning framework to prove that the physics-related quantity is beneficial for fast-adaptation of learnable models on few observations. Finally, we propose a spatiotemporal modeling via physics-aware causality, which leverages additional causal information described in PDEs for physical systems. All methods share a common goal: how to integrate physical knowledge with graph networks to model sensor-based physical systems by providing a strong inductive bias.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Trustworthy spatiotemporal prediction models
PDF
Spatiotemporal prediction with deep learning on graphs
PDF
Diffusion network inference and analysis for disinformation mitigation
PDF
Dynamical representation learning for multiscale brain activity
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Human motion data analysis and compression using graph based techniques
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Efficient graph learning: theory and performance evaluation
PDF
Dealing with unknown unknowns
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Machine learning in interacting multi-agent systems
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Deep generative models for time series counterfactual inference
PDF
Spatiotemporal traffic forecasting in road networks
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Critically sampled wavelet filterbanks on graphs
Asset Metadata
Creator
Seo, Sungyong
(author)
Core Title
Physics-aware graph networks for spatiotemporal physical systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-12
Publication Date
09/28/2021
Defense Date
06/18/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,graph neural networks,machine learning,OAI-PMH Harvest,physics-aware machine learning,physics-informed machine learning,spatiotemporal learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Liu, Yan (
committee chair
), Ortega, Antonio (
committee member
), Ren, Xiang (
committee member
)
Creator Email
s.sungyong@gmail.com,sungyons@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15965163
Unique identifier
UC15965163
Legacy Identifier
etd-SeoSungyon-10108
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Seo, Sungyong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
graph neural networks
machine learning
physics-aware machine learning
physics-informed machine learning
spatiotemporal learning