Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hybrid physics-based and data-driven computational approaches for multi-scale hydrologic systems under heterogeneity
(USC Thesis Other)
Hybrid physics-based and data-driven computational approaches for multi-scale hydrologic systems under heterogeneity
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Hybrid Physics-Based and Data-Driven Computational Approaches for
Multi-Scale Hydrologic Systems under Heterogeneity
by
Jinwoo Im
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ENVIRONMENTAL ENGINEERING)
August 2022
Copyright 2022 Jinwoo Im
Acknowledgements
I am grateful for the support from many people who helped me to complete this dissertation. First
and foremost, I would like to thank my advisor, Professor Felipe de Barros. He enlightened me in
every possible way. He has provided me with invaluable advice and persistent support throughout
the Ph.D. program. Under his guidance, I had enough time to grow as both a researcher and a
person. My special thanks go to my committee members, Professor Sami Masri and Professor
Mohammad Sahimi. They provided in-depth knowledge and considerations that led me to explore
new possibilities rooted in solid scientific foundations. As well, I wish to thank Professor Kyoung-
phile Nam for encouraging me to study abroad and achieve what I have dreamed of for a long
time. I thank my colleagues, Mahsa Moslehi, Arianna Libera, Calogero Rizzo, Maria Morvillo,
and Alessandra Bonazzi, for the time we spent talking, drinking, and laughing together. My five
years were truly colorful because of those moments. Lastly, I would like to thank my family. I
was able to dream big without fear and reach my potential thanks to the great love of my parents,
Seung-ik Im and Seong-sun Kim. My sister, Jina Im, considerably took care of the family when I
was apart from. My love, Boyoung Jeong, when you are with me, everything becomes meaningful.
I deeply appreciate you being with me.
ii
Table of Contents
Acknowledgements ii
List of Tables v
List of Figures vii
Abstract xi
Chapter 1: Introduction 1
1.1 Publications and Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2: Analysis of the Uncertainty Propagation of Conductivity Heterogeneity on
the Aquifer Resilience Loss with Different Health Response Models 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Aquifer Resilience Loss and Reliability . . . . . . . . . . . . . . . . . . . 10
2.2.2 Flow and Transport Model . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Numerical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Computational Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Effects of Level of Aquifer Heterogeneity on Aquifer Resilience . . . . . . 20
2.4.1.1 Average Daily Dose and Aquifer Reliability . . . . . . . . . . . 20
2.4.1.2 Loss of Aquifer Resilience . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Relationship between Aquifer Resilience and Hydrogeological Descriptors 26
2.4.2.1 Residence Time of BPA Plumes at the Control Plane . . . . . . . 26
2.4.2.2 Impacts of Flow Rate at Source Zone on Aquifer Resilience . . . 28
2.4.3 Risk Management Strategies for Groundwater Contamination Sites with BPA 34
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 3: Data-driven Identification of Ordinary Differential Equations for Multi-Physics
Systems Utilizing Stochastic Optimization 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Genetic programming for system identification . . . . . . . . . . . . . . . 44
3.3.1.1 Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
iii
3.3.1.2 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1.3 Fitness test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1.4 Evolutionary processes . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1.5 Bloat control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1.6 Parameter optimization . . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 Demonstration of the identification scheme utilizing GP . . . . . . . . . . 55
3.4 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1 Bouc-Wen model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 soil-moisture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Chapter 4: Hybrid Physics-Based and Data-Driven Discovery of Partial Differential Equa-
tions for Multi-Scale Hydrologic Systems by Stochastic Optimization 74
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 Genetic programming for system identification . . . . . . . . . . . . . . . 79
4.3.2 Fitness test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.3 Candidate PDEs and their evolution . . . . . . . . . . . . . . . . . . . . . 85
4.4 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Burgers’ equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Advection-dispersion equation . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5.1 GPSI for the class of PDE problems . . . . . . . . . . . . . . . . . . . . . 100
4.5.2 Advantages and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Chapter 5: Deriving the Governing Equations for Transport in Berea Sandstone Cores
by Stochastic Optimization 104
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter 6: Conclusion and ongoing work 114
References 116
Appendices 125
A Non-monotonic Dose Response Model for Endocrine-related Adverse Health Ef-
fects by Bisphenol A (BPA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
iv
List of Tables
2.1 Two types of adverse health effects induced by BPA . . . . . . . . . . . . . . . . . 12
2.2 Exposure parameters for the health risk of pregnant/lactating women . . . . . . . . 13
2.3 Simulation parameters for the groundwater flow and BPA transport . . . . . . . . . 17
3.1 Characteristics in each module for genetic programming . . . . . . . . . . . . . . 54
3.2 Parameter specification for the SDOF Bouc-Wen model and input data for training
and validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 GP-identified system models and corresponding fitness values (Eq. 3.5) as a func-
tion of the noise level in the training data . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Normalized RMSEs of the output between the GP-identified system models and
the reference system model with reference and validation data . . . . . . . . . . . 61
3.5 Parameter specification of the soil-moisture model and input data for training and
validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6 GP-identified system models and corresponding fitness values (Eq. 3.5) as a func-
tion of the noise level in the training data . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Normalized RMSEs of output between responses from the GP-identified system
models and from the reference system model with reference and validation data. . 68
4.1 GPSI-identified PDEs and their ε
res
from the training and validation data depend-
ing on different Re and ε; ˆ u is the fluid longitudinal velocity estimated from the
identified model; all coefficients are rounded to three decimal places . . . . . . . . 91
4.2 GPSI-identified PDEs and their ε
res
from the training and validation data depend-
ing on different Pe andε; ˆ c is the solute concentration estimated from the identified
model; all coefficients are rounded to three decimal places . . . . . . . . . . . . . 97
5.1 Hyperparameter specification used in this Letter for the stochastic optimization via
GPSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
v
5.2 PDEs discovered from the column experimental data through the stochastic opti-
mization; c is the solute concentration at the location x and time t; all coefficients
are rounded to three decimal places . . . . . . . . . . . . . . . . . . . . . . . . . . 111
vi
List of Figures
2.1 Schematic overview of the groundwater BPA contamination problem under con-
sideration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Computational model set-up for numerical solutions for groundwater flow and
BPA transport. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Monte Carlo convergence of relative means and variances of R
L
(Equation 2.1)
with different log-conductivity variances (σ
2
Y
) and dose-response (DR) models. . . 17
2.4 The flow chart of the systematic methodology developed for this study; The method-
ology integrates stochastic hydrogeological modeling, human health risks, and
aquifer resilience; MDR stands for the monotonic dose-response model and NMDR
for the non-monotonic dose-response model; The hydraulic conductivity field K
i
(x)
is randomly generated for two levels of heterogeneity, σ
2
Y
= 0.25 and 2.5, in the
Monte Carlo framework; The velocity field u
i
(x) and BPA concentration c
i
(x,t)
are computed from Equations 2.5-2.7; The average daily dose ADD
i
(t) is com-
puted according to Equation 2.4 andξ
i
(t) is computed from Equations 2.2-2.3 with
MDR and NMDR models; The resilience loss R
L,i
is calculated from Equation 2.1. 19
2.5 Time profiles of the ADD(t) and the statistics of ξ(t) for the case of σ
2
Y
= 0.25;
(a) the ADD(t) as a function of the exposure starting time t at the CP; (b) the mean
and variance ofξ(t) depending on the DR models. . . . . . . . . . . . . . . . . . 22
2.6 Time profiles of the ADD(t) and the statistics ofξ(t) for the case ofσ
2
Y
= 2.5; (a)
the ADD(t) as a function of the exposure starting time t at the CP; (b) the mean
and variance ofξ(t) depending on the DR models. . . . . . . . . . . . . . . . . . 23
2.7 Distributions of R
L
depending onσ
2
Y
and the DR models. . . . . . . . . . . . . . . 25
2.8 Breakthrough curves of BPA at the CP with differentσ
2
Y
; (a)σ
2
Y
= 0.25 (b)σ
2
Y
= 2.5. 27
2.9 R
L
and∆t
r
from all realizations depending onσ
2
Y
and the DR models; (a)σ
2
Y
= 0.25
(b)σ
2
Y
= 2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vii
2.10 Resident BPA concentration c
R
fields averaged along the z-axis at t = 7 mo and
breakthrough curves for the flux-averaged concentration c
F
of BPA at the CP for
six extreme cases of∆t
r
withσ
2
Y
= 2.5; (a), (c), and (e) cases for low∆t
r
(b), (d),
and (f) cases for high∆t
r
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.11 R
L
and η for all realizations depending on σ
2
Y
and the DR models; (a) σ
2
Y
= 0.25
(b)σ
2
Y
= 2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 The schematic diagram for a general modeling problem. . . . . . . . . . . . . . . 43
3.2 The illustration diagram of an evolutionary loop of GP. As an analogy, programs
like chromosomes contain genetic information to perform a specific task. Accord-
ing to the capability for the task, each program has the level of fitness. Similarly
to Darwin’s theory of evolution, population of programs undergoes evolutionary
processes. Some highly fitted ones survive intact ( Representation), most of them
exchange their genetic information each other (Crossover), and some of them may
experience an exchange with random genetic information (Mutation). Finally, up-
dated programs constitute the next generation. This evolutionary loop continues
until a stop criterion is satisfied. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 One example of (a) a candidate model
ˆ
M as the program of GP, (b) its target func-
tion ˆ g(t), and (c) the corresponding expression tree. f(t) is the external force. x, ˙ x,
and ¨ x are state variables: displacement, velocity, and acceleration, respectively. In
this work H represents the Heaviside step function. Under the assumption where
a system allows the mass to be measured, the inertial term 2 ¨ x is prearranged in the
candidate model
ˆ
M . The rest of the expression is defined as the target function
ˆ g(t), and aimed to be identified through evolutionary processes of expression trees. 47
3.4 One example of the initialization of expression trees. From the initial node at
the tree level 1, the type of mathematical components is determined as to their
probabilities. A specific component is assigned by random sampling from each
component type set. The expression tree keeps growing when an operation or
a function is assigned and, in turn, initializes two other nodes at the next tree
level. The expression tree stops growing when a variable or a number is assigned,
changing the assigned node into the terminal. . . . . . . . . . . . . . . . . . . . . 49
3.5 Examples of evolutionary operations; (a) crossover randomly selects nodes of two
expression trees and changes each other; (b) mutation randomly chooses a node of
an expression tree and replaces it with a new random expression tree. . . . . . . . . 51
3.6 The workflow chart demonstrating the identification scheme utilizing GP . . . . . . 57
3.7 The schematic diagram of a SDOF Bouc-Wen model and its restoring force along
the displacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
viii
3.8 Time profiles of output and phase diagrams from the reference and GP-identified
system models for training and validation; the time profiles between 40 T and 50T
for excitation force (a) and (g), displacement (b) and (h), and acceleration (c) and
(i) with the training and validation data, respectively; phase diagrams for restoring
force (d)-(f) and (j)-(l) with the training and validation data, respectively. . . . . . . 63
3.9 The schematic diagram of the soil-moisture model [126] . . . . . . . . . . . . . . 65
3.10 Time profiles of the output and phase diagrams obtained from the reference and
GP-identified system models for training and validation; the time profiles between
90 and 270 days for rainfall (a) and (e), soil-moisture content (b) and (f), and its
derivative (c) and (g) with the training and validation data, respectively; phase dia-
grams between soil-moisture content and its derivative (d) and (h) with the training
and validation data, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1 Schematic illustration of the general modeling problem; (a) multi-physics systems
encountered in various science and engineering domains; (b) modeling approach
representing system dynamics with a partial differential equation (PDE); (c) ex-
ternal excitation utilized as an initial condition of a PDE; and (d) evaluation of a
quantity of interest over time by numerically solving the PDE. . . . . . . . . . . . 78
4.2 Workflow chart for the Genetic Programming for System Identification (GPSI) ap-
proach proposed; In the mathematical expression library box, H(·),|·| and erf(·)
represent the Heaviside step function, absolute value and error function, respec-
tively; Details related to the evolutionary processes are exhibited in Fig 4.3. . . . . 81
4.3 Representative snapshot of the flow chart to implement the evolutionary processes
of candidate PDEs utilizing their expression trees. . . . . . . . . . . . . . . . . . . 88
4.4 GPSI simulation for the Burgers’ equation with Re= 100 andε = 5%; (a) the loss
function composition and the validationε
res
as a function of the generation; (b)-(e)
the PDE response surfaces at the initial generation (i.e., (b)) and eachL phase end
(i.e., (c)-(e)); and (f)-(i) the corresponding error surfaces; note that, for clarity, the
amplitude levels displayed in Fig. 4f-i are not the same. . . . . . . . . . . . . . . . 92
4.5 The identified PDE responses with the (a) training and (b) validation data from one
identification case for the Burgers’ equation with Re = 100 andε = 5% (see Case
5 in Table 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.6 The contribution of each term in the identified PDE (see Case 5 in Table 1) from
one identification case for the Burgers’ equation using Re = 100 andε = 5%; the
advection and diffusion terms are 1.000 ˆ u
∂ ˆ u
∂x
and 0.010
∂
2
ˆ u
∂x
2
, respectively; (a) the
values of the components in the model under the given validation excitation u
0
;
(b) the responses from the components for a short time domain (0< t < 0.15); the
incremental shades of the colors represent the responses at different time moments
(t= 0.03,0.06,...,0.15). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
ix
4.7 The average performance of GPSI in identifying the reference PDE (i.e, the Burg-
ers’ equation) depending on Re andε; (a) the phase diagram of the identified PDEs
on the probability of discovering the reference model; (b) the average validation
ε
res
; and (c) the average convergence time. . . . . . . . . . . . . . . . . . . . . . . 95
4.8 The identified PDE responses with the (a) training and (b) validation data from one
identification case for the ADE with Pe = 100 andε = 5% (see Case 5 in Table 2). 98
4.9 The contribution of each term in the identified PDE (see Case 5 in Table 2) from
one identification case for the ADE using Re = 100 and ε = 5%; the advection
and dispersion terms are 1.000
∂ ˆ c
∂x
and 0.010
∂
2
ˆ c
∂x
2
, respectively; (a) the values of
the components in the model under the given validation excitation c
0
; (b) the
responses from the components for a short time domain (0 < t < 0.15); the in-
cremental shades of the colors represent the responses at different time moments
(t= 0.03,0.06,...,0.15). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.10 The average performance of GPSI in identifying the reference PDE (i.e, the ADE)
as to Pe andε; (a) the phase diagram of the identified PDEs on the probability of
discovering the reference model; (b) the average validationε
res
; and (c) the average
convergence time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 The stochastic optimization process discovering the nonlinear ADE (see Table 2)
from Scheidegger’s soil column study [138]; (a) the loss function values and their
composition as a function of the generation; (b)-(e) the solute breakthrough curves
estimated from the PDE models at each reference generation as compared to the
experimental data (i.e., (b)-(e)); and (f)-(i) the corresponding squared error; c
p
is
the concentration predicted from the PDEs and c
d
is the concentration observed
from the column experiment; the best PDE among candidates is presented for each
reference generations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2 The solute concentrations (i.e., (a)) and the squared errors (i.e., (b)) at the outlet
of the column from the PDEs with the experimental data reported in Scheidegger
[138]; c
p
is the concentration predicted from the PDEs and c
d
is the concentration
observed from the column experiment; See Table 5.2 for the PDEs’ details. . . . . 111
6.1 Relative health responses for multiple health effects along BPA doses and their
NMDR models [57]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
x
Abstract
Accurate and robust computational modeling of multi-scale hydrologic systems is a key in water
resources management. The main challenge here is uncertainty in characterizing heterogeneous
hydrologic systems under limited resources. Inevitably, the uncertainty occurs in models and their
parameters, leading to unreliable predictions. As described in the first chapter , this study aims to
establish novel computational approaches to overcome the uncertainty associated with parameters
and models and to provide accurate and robust outcomes for water resources management.
Firstly, the parameter uncertainty is addressed in the context of groundwater contamination in
the second chapter. Human health risks against Bisphenol A (BPA) in groundwater are investi-
gated under the uncertainty in the hydraulic conductivity, focusing on the impact of different health
models (i.e., dose-response models for systemic vs. endocrine-related adverse health effects). The
results of the Monte Carlo approach show the non-trivial behavior of the aquifer resilience loss and
the uncertainty against BPA under the interplay health response models and conductivity hetero-
geneity. This chapter provides insights for the risk management of groundwater systems against
emerging contaminants that have endocrine-related concerns.
Next, the model uncertainty is addressed by developing a hybrid physics-based and data-driven
modeling approach utilizing stochastic optimization. This approach aims to discover governing
equations, described by ordinary differential equations (ODEs) or partial differential equations
(PDEs), from observed data. The proposed approach is successfully demonstrated by identifying a
variety of ODEs/PDEs from simulated data sets for multi-scale hydrologic systems (e.g., soil water
balance, flow and transport in porous media) in the third and fourth chapters.
xi
Moreover, the proposed hybrid approach is applied into a more challenging problem for the
model uncertainty: discovering governing equations from limited experimental data. This case in
the fifth chapter emphasizes the novelty of the proposed approach that is capable of providing
physical insights about the underlying physics, even from the limited data set. The capabilities of
the proposed approach in capturing anomalous behavior and providing physical insights are further
enriched with incorporating more sophisticated mathematical expressions (e.g., the fractional time
derivative). The series of these chapters (Chapter 3-5) indicate that the proposed approach is a
powerful gray-box modeling tool.
Consequently, novel computational approaches are developed and applied to address model and
parameter uncertainties in modeling multi-scale hydrologic systems under heterogeneity. As de-
scribed in the last chapter, the research carried out in this study and its continuation contribute to
establishing accurate and robust modeling of hydrologic systems under uncertainty and supporting
the decision-making process in water resource management.
xii
Chapter 1
Introduction
Modeling has the utmost importance in water resources management by supporting the decision-
making process. The relevant questions are: what is a potential risk in a hydrologic system?
and where and how many sampling points do we need for remediation or future management.
Despite this importance, it is challenging to establish accurate and robust models for hydrologic
systems. Especially, when a target hydrologic system is associated with the subsurface system, as
encountered in most cases, it is unfeasible to access to the full characterization. This is because
the subsurface system is naturally heterogeneous in space, but there are limited resources, e.g.,
time and money, to investigate. As such, the uncertainty is unavoidable in modeling hydrologic
systems. Under the limited characterization, the uncertainty occurs in models themselves and
their parameters as well. This study proposes novel computational approaches to overcome these
uncertainties (i.e., parameter uncertainty and model uncertainty) and build an accurate and robust
model for the purpose of supporting water resources management.
For the first, the parameter uncertainty in hydrologic models is considered. A key parame-
ter in groundwater systems is the hydraulic conductivity which represents how fast groundwater
flows through an aquifer. Under the natural heterogeneity of the subsurface system, the hydraulic
conductivity varies several orders of magnitude in space [1, 2]. To capture the subsequent uncer-
tainty in the hydraulic conductivity, many researchers employed a stochastic approach, rooted from
stochastic hydrogeology [2–4]. In this approach, the hydraulic conductivity is deemed as a random
1
variable, described by statistical properties, and the space random function is adopted to repre-
sent or generate a random hydraulic conductivity field. Through this approach, the researchers
investigated how the uncertainty in the hydraulic conductivity propagates into flow and transport
properties [2–6]. Some of the researchers made a further step to the uncertainty in environmen-
tal performance metrics, such as human health risks, to support the decision-making process in
groundwater management [7–11].
However, the human health risk assessments performed in the abovementioned studies rely
solely on a conventional health response model. This model is based on a monotonic relationship
between an exposure level (i.e., a dose) and adverse health responses. The monotonic health re-
sponse model is limited in capturing an emerging threat in groundwater systems. Some of the
emerging contaminants, such as plasticizers, pharmaceuticals, and personal care products, are
known to have potential adverse effects to endocrine systems [12–15]. Since endocrine-related
health effects could occur in a specific low dose range, a non-monotonic does-response model is
needed to be incorporated in the human health risk assessment [16–19]. This is especially true un-
der the groundwater systems where a heterogeneous conductivity field could shape a low exposure
level for a long period of time by catching a contaminant plume and releasing it slowly.
Another type of the uncertainty in modeling hydrologic systems that needs attention is the
model uncertainty. Considering hydrologic systems mostly involved in a large-scale domain with
limited information, it is obvious that user’s assumptions and simplifications are insufficient in es-
tablishing a model. In order to improve the reliability of the model, there is a need to incorporate
observed data into the model. Along with the progress in Artificial Intelligence (AI) and Machine
Leaning (ML), lots of data-driven modeling approaches are suggested [20–24]. On top of that,
these approaches are further improved by incorporating system physics (i.e., user’s prior knowl-
edge) into the model [25–32]. This hybrid physics-based and data-driven modeling has made a
progress in enhancing the reliability of the AI/ML-based models, e.g., physics-informed neural
networks.
2
Although the reliability of the AI/ML-based models has been improved by respecting funda-
mental physical laws, it is difficult to agree that an important decision in water resources man-
agement is made solely on the basis of, so-called, black-box models. At this point, the physical
interpretability of the AI/ML-based models is critical to their reliability as well, which is not yet
guaranteed. In order to secure the interpretability, the models need to be described by accessi-
ble mathematical expressions, such as differential equations [33]. The interpretable AI/ML-based
models are then able to reveal system physics embedded in a data set as well as to predict system
responses.
The objective of this study is to address two types of the uncertainty (i.e., parameter uncertainty
and model uncertainty) in modeling hydrologic systems:
1. Investigated the propagation of the uncertainty in the hydraulic conductivity into the aquifer
resilience loss by incorporating endocrine-related health effects in human risk assessments.
Based on the results, several suggestions are made for risk managers against groundwater
contamination with emerging contaminants.
2. Develop a hybrid physics-based and data-driven modeling approach utilizing stochastic op-
timization. This modeling approach aims to discover governing equations, described by
ordinary or partial differential equations (ODEs or PDEs), from a given data set. By cap-
turing system physics with the accessible mathematical expressions, the underlying physics
embedded in the data set is revealed, as well as accurate and robust predictions are made.
1.1 Publications and Outline of the Dissertation
The chapters of this dissertation are based on the following journal papers:
1. Im, J., Rizzo, C. B., & de Barros, F. P. (2020). Resilience of groundwater systems in the
presence of Bisphenol A under uncertainty. Science of The Total Environment, 727, 138363.
3
2. Im, J., Rizzo, C. B., de Barros, F. P., & Masri, S. F. (2021). Application of genetic program-
ming for model-free identification of nonlinear multi-physics systems. Nonlinear Dynamics,
104(2), 1781-1800.
3. Im, J., de Barros, F. P., & Masri, S. F. (2022). Data-driven identification of partial differential
equations for multi-physics systems using stochastic optimization. Computational Mechan-
ics (under review)
4. Im, J., de Barros, F. P., Masri, S. F., & Sahimi M., (2022). Discovery of the governing
equations for transport in Berea sandstone cores by stochastic optimization. (in preparation)
In Chapter 2, the study on how the uncertainty due to conductivity heterogeneity propagates
into the aquifer resilience loss depending on different health response models is presented. In
Chapter 3, a hybrid physics-based and data-driven modeling approach is developed by utilizing
stochastic optimization to discover ODEs from data. In Chapter 4, the proposed hybrid approach
is expanded to discover PDEs from data for hydrologic systems. In Chapter 5, governing equations
for transport in Berea sandstone cores are discovered from an experimental data set by the proposed
hybrid approach. Finally, conclusions and ongoing work are presented in Chapter 6.
4
Chapter 2
Analysis of the Uncertainty Propagation of Conductivity
Heterogeneity on the Aquifer Resilience Loss with Different
Health Response Models
2.1 Introduction
Challenges associated with water scarcity around the world have brought attention to indirect
potable reuse. Such practice makes use of groundwater being recharged with advanced treated
wastewater as a potable water source. It has been proven effective and is currently being employed
in many countries including the United States, the United Kingdom, Belgium, Singapore, and Aus-
tralia [34]. This groundwater recharge practice with reclaimed wastewater raises health concerns
regarding emerging contaminants such as plasticizers, pharmaceuticals, and personal care prod-
ucts [12, 13]. This is because the toxicity of emerging contaminants has not been fully understood.
Critical and demanding issues lie in characterizing the potential human health risks associated
with the presence of emerging contaminants in groundwater systems and in estimating the aquifer
resilience that represents the capability of the aquifer to recover its reliability as a potable water
source against the contamination [35, 36]. The uncertainty in the toxicity of emerging contam-
inants, coupled with the ubiquitous heterogeneity of the subsurface environment, can lead to a
5
complex temporal profile of health risks at an environmentally sensitive target (EST) by induc-
ing adverse health effects in multiple exposure dose ranges for different time periods. Thus, the
combined toxicological and hydrogeological factors pose difficulties to the assessment of human
health risks and aquifer resilience involved with emerging contaminants in groundwater.
Among emerging contaminants, Bisphenol A (BPA) is chosen for this study on the grounds
that BPA is one of the most frequently detected emerging contaminants in groundwater and a well-
known endocrine disruptor [12, 13, 15, 17]. BPA is a plasticizer being used all over the world
for plastic and epoxy production. Estimating the health risks and the aquifer resilience associated
with the presence of BPA in groundwater is a non-trivial task for multiple reasons. The first reason
is its unconventional toxicity to human health. Since BPA is weakly estrogenic, it is known to
cause reproductive and developmental adverse effects to humans. The United States Environmen-
tal Protection Agency (USEPA) has set the Reference Dose (RfD) as a threshold dose to regulate
BPA based on the monotonic dose-response (DR) model of systemic health effects in the general
population [37]. However, recent toxicological and epidemiological studies report that BPA can
cause endocrine-related health effects within certain low dose ranges below the current regula-
tory level, which displays non-monotonic relationships between the exposure dose and the health
response [14, 18, 19]. Although these health effects have not come to the consensus yet, the Na-
tional Toxicology Program (NTP) showed concern for the effects related to neural and behavior
alterations and lesions in the prostate glands in fetuses, infants, and children in low dose ranges of
BPA which are from 1 to 1000 times lower than the RfD due to the vulnerability of their endocrine
systems [38, 39]. This unconventional toxicity of BPA, having different health responses in differ-
ent dose ranges depending on population groups, renders the estimation of health risks and aquifer
resilience non-trivial.
Another difficulty in the risk and resilience estimation associated with BPA contamination in
groundwater systems arises from incomplete subsurface site characterization. Hydrogeological
properties such as the hydraulic conductivity vary by several orders of magnitude in space. This
6
multi-scale variability leads to a complex structural network of flow pathways that control the dis-
persive behavior of the contaminant plume and its concentration value at an EST ([1]; Ch. 9 and
10 of [2]). Given the impracticability to fully characterize subsurface heterogeneity due to re-
source constraints, flow and transport model predictions within heterogeneous aquifers are subject
to uncertainty. Many works have extensively investigated this issue with a stochastic framework
[2, 4]. As such, these works have provided the tools to develop probabilistic models for flow and
transport while incorporating the effects of aquifer heterogeneity. These stochastic tools have been
employed to conduct uncertainty quantification of the health risks associated with specific contam-
inants present in groundwater, such as radionuclides [40], organic solvents [41], carbon dioxide
[42], heavy metals [43], and pathogens [44]. Several works on probabilistic risk analysis have
been used to improve risk-based decision making strategies and better allocate resources towards
site investigation, management, and remediation [5, 7–11]. For instance, Henri et al. [45] and
Libera et al. [46] explored the health risks associated with chlorinated solvent mixtures. Henri
et al. [45] indicate that preferential channels within aquifers could be beneficial or detrimental in
health risks depending on locations of ESTs.
The above-mentioned works focus on assessing the health risk for specific contaminants in
groundwater systems. In particular, most of these works mainly focus on lifetime cancer risks
of contaminants with some threshold levels such as maximum contaminant levels or acceptable
risk levels. BPA, unlike other contaminants, has concerns on particular non-cancer health risks
associated with certain low exposure ranges in specific time windows such as prenatal or infancy
periods [47]. Nevertheless, existing studies on the risks associated with BPA in groundwater rely
on the regulatory level estimated from the monotonic dose-response (MDR) model [39, 48–50].
The emergence of recent toxicological and epidemiological studies for BPA indicates certain low
dose ranges where endocrine-related health effects in vulnerable subpopulations were observed
and modeled by non-monotonic dose-response (NMDR) models [17]. Since the health risks char-
acterized from NMDR models with low dose ranges can be counterintuitive as compared to those
7
from MDR models with threshold levels, the recent findings urge the scientific community to as-
sess the health risks of BPA associated with the NMDR model by targeting vulnerable population
[16].
In addition to the risk characterization, it is important to understand time-dependent changes
of health risks at an EST, which leads to the discussion about resilience of groundwater systems
against contamination. Resilience was first conceptualized as the magnitude of the persistence of
ecosystem functions against change and disturbance [51]. Since then, several studies have adapted
the resilience concept to represent system reliability against external perturbations in a concerned
period of time. The American Society of Civil Engineers (ASCE) defined the resilience of critical
infrastructure as “the capability to mitigate against significant all-hazards risks and incidents, and
to expeditiously recover and reconstitute critical services with minimum damage to public safety
and health, the economy, and national security” in ASCE Policy Statement 518 [52]. In terms
of the critical infrastructure that provides services such as ecosystems and multipurpose wells,
the resilience of groundwater systems has been studied [53–55]. Among services groundwater
systems can offer, a drinking water supply is critical to humans. In this sense, the resilience
of groundwater systems against contamination can be investigated by examining its reliability as
a potable water source based on health risks over time [35, 56]. When the concept of aquifer
resilience is based on health risks, it allows contaminated site managers to interpret the health risks
of pollutants with respect to time by presenting the time the aquifer needs to recover its reliability
as a potable water source. In the case of BPA, it is important to recognize that different types of
DR models (i.e., monotonic and non-monotonic) can lead to different results for health risks and,
subsequently, aquifer resilience in heterogeneous aquifers. One of our primary goals is to show that
the uncertainty of aquifer resilience will depend on the DR models, which leads to implications for
risk management strategies.
In the context of growing health concerns regarding emerging contaminants present in ground-
water, this study chooses one of the emerging contaminants possibly causing endocrine-related
8
health effects, BPA, and investigates the aquifer resilience against contamination under hydrogeo-
logical uncertainty. Especially, an unconventional way of characterizing health risks with NMDR
relationships is adopted to address endocrine-related health effects from BPA. Specific research
questions are as follows: (1) How does aquifer heterogeneity control the aquifer resilience and its
uncertainty as a function of BPA DR models? (2) How does the interplay between DR models
and natural heterogeneity affect risk management strategies for groundwater contamination sites
with BPA? Our work presents a systematic stochastic methodology that links risk characteriza-
tion and aquifer resilience. We illustrate how the combined effects of aquifer heterogeneity and
dose-response models impact the stochastic characterization of aquifer resilience in a non-trivial
manner. Our analysis also shows how resilience loss is controlled by the residence time of the BPA
plume at the control plane and the volumetric flow rate at the source zone. Last, suggestions for
the risk management involved with BPA contamination in groundwater are made.
2.2 Problem Statement
We consider a contaminant source that releases BPA into an aquifer (Fig. 2.1). The BPA plume
is assumed to move under an anaerobic condition without degradation until it reaches an area
where exposure occurs through the ingestion pathway. The control plane (CP) perpendicular to
the x-axis is located in an environmentally sensitive location. The CP can represent, for example,
an area located in the vicinity of a battery of wells supplying water to a local community. As
for the target population, we assume that pregnant/lactating women are exposed to BPA during
their gestation and lactation periods. BPA is evaluated as a noncarcinogenic substance in the
human health risk assessment. The potential health effects under consideration are as follows:
(1) systemic health effects in pregnant/lactating women as a cohort of the general population [37]
and (2) endocrine-related health effects in their male offspring [38, 57]. We determine the aquifer
reliability depending on whether groundwater is considered potable on the basis of health risks and
9
estimate the loss of aquifer resilience by examining the aquifer reliability over a concerned time
period.
BPA release
Vulnerable
population
Control plane Contamination source
Figure 2.1: Schematic overview of the groundwater BPA contamination problem under considera-
tion.
2.2.1 Aquifer Resilience Loss and Reliability
Resilience is here defined as the capability of the aquifer to recover the state where it can be used
as a potable water source with negligible health risks (i.e., the capability of the aquifer to recover
its reliability as a potable water source). A high resilience value means that an aquifer is possible
to restore a condition that groundwater is considered potable in a short period of time. In order to
quantify the impact of contamination on the aquifer resilience, we utilize the concept of the loss of
aquifer resilience R
L
(d). The mathematical expression for R
L
is:
R
L
=
Z
t
0
+t
f
t
0
[1− ξ(t)]dt, (2.1)
whereξ(t) is the aquifer reliability (-) at the CP when the exposure starts at time t, t
0
is the initial
time (d) of the contamination, and t
f
is the time horizon (d) for the analysis. R
L
is calculated from
the time profile of ξ(t) and represents the time period when the aquifer is not reliable as a potable
water source due to a potential health risk.
10
The aquifer reliability ξ(t) is set to 0 when the aquifer poses a potential health risk to hu-
mans. On the other hand, ξ(t) is assigned to 1 when the aquifer has a negligible health risk, and
therefore is reliable as a potable water source. For this study, two types of health effects caused
from BPA are considered: systemic health effects and endocrine-related health effects as shown
in Table 2.1. The way of characterizing health risks of BPA depends on types of adverse health
effects because the health effects are modeled by different DR models. First, for systemic health
effects of BPA in pregnant/lactating women for the chronic exposure scenario, the Reference Dose
(RfD) recommended by the USEPA [37] is used as a threshold level to characterize health risks
based on monotonic DR models. Second, for endocrine-related health effects of BPA in vulnerable
population such as fetuses, infants, and children, low dose ranges reported by recent studies can be
considered to determine health risks based on non-monotonic DR models [38, 58, 59]. Among the
recent studies, we adopt one study to estimate one low dose range potentially causing endocrine-
related health effects in male offspring as the outcome of the prenatal/postnatal exposure to BPA
[57] [Appendix A]. As a result,ξ(t) can be expressed as follows:
The monotonic dose-response (MDR) model considering systemic health effects in preg-
nant/lactating women [37]
For this case, the aquifer reliability function is defined as:
ξ(t)=
1, ADD(t)≤ RfD;
0, otherwise,
(2.2)
where ADD(t) is the average daily dose (µg/kg/d) when exposure starts at time t and the RfD is
the reference dose of BPA (= 50µg/kg/d) recommended by the USEPA.
11
The non-monotonic dose-response (NMDR) model considering endocrine-related health ef-
fects in male offspring [57]
If endocrine-related health effects are to be considered, the reliability function becomes:
ξ(t)=
1, ADD(t)≤ D
L
or ADD(t)≥ D
U
;
0, otherwise,
(2.3)
where D
U
and D
L
are upper and lower bounds (equal to 0.68 and 80.69µg/kg/d, respectively) of
the low dose range.
Table 2.1: Two types of adverse health effects induced by BPA
Adverse health effects Systemic effects
a
Endocrine-related effects
b
End points Reduced body weight, re-
duced organ weight, etc.
Neural and behavioral alter-
ations, lesions in the prostate,
etc.
Target population General population Vulnerable population (e.g., fe-
tuses, infants, and children)
Exposure duration Chronic toxicity (> 1 yr)
c
Specific time window (e.g.,
pregnancy and infancy)
Dose-response relationship Monotonic model Non-monotonic model
Benchmark dose type Threshold level Low dose range
a
USEPA [37],
b
Shelby [38]; Jones et al. [57]; Vandenberg [18],
c
USEPA [60]
The ADD(t) is the average daily dose value having the total exposure duration ED as a running
time window, in which exposure starts at time t and maintains throughout the ED. Since we have
concerns on pregnant/lactating women as the exposed target group, the ADD(t) is computed from
the integration of daily doses over each period (i.e., pregnant/lactating periods). Therefore, the
ADD(t) can be divided in terms of the ED as follows:
ADD(t)=
1
AT
Z
t+ED
1
t
c
F
(t;x
cp
)IR
1
BW
1
dt+
Z
t+ED
1
+ED
2
t+ED
1
c
F
(t;x
cp
)IR
2
BW
2
dt
, (2.4)
12
where c
F
(t;x
cp
) is the flux-averaged concentration ( µg/l) at the CP, x
cp
is the longitudinal location
(m) of the CP, IR
i
is the ingestion rate (l/d) of drinking water for each period (IR
1
and IR
2
for preg-
nant and lactating periods, respectively), BW
i
is the body weight (kg) for the exposed individual in
each period, ED
i
is the exposure duration (d) for each period, and AT is the averaging time (d) that
is equal to the total exposure duration (i.e., ED= ED
1
+ED
2
). The flux-averaged concentration is
defined as the ratio of the mass discharge to the volumetric water flux at the CP [61]. The exposure
parameters used in the estimation of the ADD(t) are aligned with risk assessment guidelines from
the USEPA [62–64] (Table 2.2).
Table 2.2: Exposure parameters for the health risk of pregnant/lactating women
Parameter Gestation period
a
Lactation period
a
Ingestion rate, IR (l/d) 2.589 3.588
Body weight, BW (kg) 75 80
Exposure duration, ED (d) 280 180
a
USEPA [63, 64]
2.2.2 Flow and Transport Model
In order to estimate the ADD(t) at the CP (see Equation 2.4), groundwater flow and BPA transport
simulations are carried out. We consider a three-dimensional (3D) confined aquifer in the Cartesian
coordinate system given by x=(x,y,z) (see Fig. 2.1). The aquifer under consideration has length
L
x
= 720 m, width L
y
= 520 m, and thickness L
z
= 104 m. Groundwater flow is assumed under
steady-state conditions. As boundary conditions, constant hydraulic heads at x= 0 and 720 m are
assigned to maintain a constant hydraulic gradient of J
x
= 0.002 along the x-direction. The other
boundaries have no-flow conditions. This set-up ensures that flow is uniform-in-the-mean along the
x-direction. The hydraulic conductivity is assumed locally isotropic and spatially heterogeneous.
Groundwater flow is governed by the following equations:
∇· u(x)= 0, (2.5)
13
with:
u(x)=− K(x)
φ
∇h(x), (2.6)
where u(x) is the linear groundwater velocity (m/d),φ is the porosity (-) (assumed to be constant),
K(x) is the hydraulic conductivity (m/d), and h(x) is the hydraulic head (m). We assume that the
conductivity field is the only source of uncertainty in the flow model.
For the BPA transport, the advection-dispersion equation with the retardation factor is em-
ployed:
R
f
∂c(x,t)
∂t
+∇· [u(x)c(x,t)]− ∇· [D(x)∇c(x,t)]= 0, (2.7)
where R
f
is the retardation factor (-), D(x) is the local scale dispersion tensor (m
2
/d), and c(x,t)
is the resident concentration (µg/m
3
) in groundwater. Equations 2.5-2.7 allow us to compute the
flux-averaged concentration ( c
F
(t)) at the CP.
The retardation is assumed to be originated from the partitioning effect of BPA between ground-
water and the porous medium, which is illustrated by a linear relationship. Since we assume that
the BPA transport occurs under an anaerobic condition, the degradation is neglected according to
studies which report that BPA displayed the resistance to biodegradation under anaerobic condi-
tions [65, 66]. The local scale dispersion tensor consists of the mechanical dispersion and the
molecular diffusion. The mechanical dispersion considers longitudinal, horizontal transverse, and
vertical transverse dispersivity coefficients with respect to the principal flow (i.e., α
L
, α
T H
, and
α
TV
(m), respectively), as proposed by Burnett and Frind [67].
The aquifer is subject to a continuous mass release of BPA for a certain time period in a volu-
metric source zone. The size of the source zone is 120× 120× 24 m
3
and its center is located at
the coordinate x
c
s
=(260,260,52) m.
14
2.3 Numerical Implementation
We adopt a stochastic framework to address the randomness associated with the heterogeneity
of the hydraulic conductivity field, which subsequently leads to the randomness of BPA plume
behavior and aquifer resilience. By randomly generating hydraulic conductivity fields (each field
having the same underlying spatial statistics), we can estimate the loss of the aquifer resilience, R
L
,
in a probabilistic way. The effects of aquifer heterogeneity on R
L
are then computed in a statistical
sense. To compute statistical properties of R
L
(e.g., mean and variance), we perform a Monte Carlo
simulation as follows:
(1) Generate heterogeneous hydraulic conductivity fields from a given random space function
model.
(2) Solve the groundwater flow and simulate BPA transport for each conductivity field.
(3) Characterize human health risks and estimate R
L
for each field.
(4) Compute the ensemble statistics of R
L
from all possible conductivity fields. For this work
we mainly focus on the mean and variance of R
L
The geostatistical software SGeMS is used to generate random realizations of the heteroge-
neous hydraulic conductivity fields [68]. The computational domain is discretized into 90 × 65× 65
cells (Fig. 2.2). The log hydraulic conductivity (Y = lnK) fields are generated using sequential
Gaussian simulation with zero mean (⟨Y⟩= 0) and two variances (i.e.,σ
2
Y
= 0.25 and 2.5). These
variances are chosen to represent distinct levels of subsurface heterogeneity with reference to Cape
Cod (i.e., low heterogeneity) and North Bay (i.e., mild heterogeneity) sites [3, 69]. We adopt an
anisotropic exponential semi-variogram model for Y fields. The variagram model is characterized
by a horizontal correlation scale λ ≡ λ
x
=λ
y
= 40 m and a vertical correlation scale λ
z
= 8 m.
The numerical block scale used in the flow solver is set to one fifth of the correlation scale for
all directions to capture heterogeneity (i.e.,∆
x
= 8 m,∆
y
= 8 m, and∆
z
= 1.6 m). All simulation
parameters are reported in Table 2.3. For each value ofσ
2
Y
, we generate 500 random realizations of
15
the hydraulic conductivity field. The convergence of the mean and variance of R
L
with differentσ
2
Y
and DR models in the Monte Carlo simulation is displayed in Fig. 2.3. This convergence indicates
that the number of conductivity realizations used in our work is sufficient to reflect the mean and
variance of R
L
at the CP. These realizations account for many potential migration patterns of the
contaminant plume in the aquifer (which is reflected in the solute breakthrough curve at the CP). In
other words, the low-order moments of R
L
accounts for many possible solute breakthrough curves
at the CP.
7
#
Source zone
3(
&
×
(
×
&
)
5
#
5
(
5
(
3
#
5
&
5
&
Control plane
Grid block
0.2(
&
×
(
×
&
)
Figure 2.2: Computational model set-up for numerical solutions for groundwater flow and BPA
transport.
Groundwater flow is solved with the USGS finite difference code MODFLOW [70]. The con-
taminant transport is simulated with the 3D transport model MT3DMS [71]. Transport param-
eters are summarized in Table 2.3. The molecular diffusion coefficient of BPA is 0 .508× 10
− 9
m
2
/s [72]. Dispersivity coefficients are set to reflect the subgrid heterogeneity: α
L
< 0.1∆
x
,
α
T H
< 0.1α
L
, and α
TV
< 0.1α
T H
[73]. For the retardation of BPA transport, the relationship
between concentrations in the porous media and groundwater is assumed to be linear, and the
following retardation coefficient R
f
is 3.8 [74].
In terms of correlation scales, the source zone has dimensions of 3λ
x
× 3λ
y
× 3λ
z
, and its center
is located at(6.5λ
x
, 6.5λ
y
, 6.5λ
z
) (see Fig. 2.2). The source zone is 5λ
i
(i.e., i= x,y,z) away from
16
Table 2.3: Simulation parameters for the groundwater flow and BPA transport
Parameter Value
Numerical domain
Domain dimensions, L
x
× L
y
× L
z
(m× m× m) 720× 520× 104
Number of cells, n
x
× n
y
× n
z
90× 65× 65
Cell dimensions,∆
x
× ∆
y
× ∆
z
(m× m× m) 8× 8× 1.6
Heterogeneous field
Mean of Y ,⟨Y⟩ (Y = lnK and[K]= cm/s) 0
Variance of Y ,σ
2
Y
0.25 and 2.5
Variogram model exponential model
Correlation scales,λ
x
× λ
y
× λ
z
(m× m× m) 40× 40× 8
Flow problem
Average hydraulic gradient, J
x
(-) 0.002
Porosity,φ (-) 0.4
Transport problem
Molecular diffusion coefficient, D
m
(m
2
/s) 0.508× 10
− 9
Longitudinal dispersivity,α
L
(m) 0.04
Horizontal transverse dispersivity,α
T H
(m) 0.004
Vertical transverse dispersivity,α
TV
(m) 0.0004
Bulk density,ρ
b
(g/cm
3
) 1.5
Retardation factor, R
f
(-) 3.8
0 100 200 300 400 500
Number of realizations [EA]
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
Computed relative value to the final value [-]
R
L
w/
2
Y
= 0.25 & MDR
VAR(R
L
) w/
2
Y
= 0.25 & MDR
R
L
w/
2
Y
= 2.5 & MDR
VAR(R
L
) w/
2
Y
=2.5 & MDR
R
L
w/
2
Y
= 0.25 & NMDR
VAR(R
L
) w/
2
Y
=0.25 & NMDR
R
L
w/
2
Y
= 2.5 & NMDR
VAR(R
L
) w/
2
Y
=2.5 & NMDR
Figure 2.3: Monte Carlo convergence of relative means and variances of R
L
(Equation 2.1) with
different log-conductivity variances (σ
2
Y
) and dose-response (DR) models.
17
each boundary to avoid boundary effects. The mass loading rate is chosen 5.3 kg/mo in accordance
with the study referred for the retardation coefficient [74]. The loading rate maintained constant
for the first 5 mo after the initial time of the contamination t
0
= 0 mo. The CP is located at x= 15λ
x
(i.e., x= 7λ
x
away from the source zone) and the data are collected by 3 d for the total simulation
time t
f
= 60 mo. Fig. 2.4 summarizes the methodological framework developed in this work.
18
Human Health Risk Assessment
Aquifer Resilience Estimation
Stochastic Hydrogeological Modeling
Conductivity fields
!
( = 1,2,…,)
Contamination Scenario
Flow fields
!
()
Resident concentration
fields
!
(,)
Flux-averaged
concentration at the CP
",!
()
Average daily dose
!
()
Aquifer reliability
!
()
Aquifer resilience loss
,!
Site Management Strategy
Ensemble statistics of
depending on
&
'
and
DR models
7.5
-7.5
0.0
2.5
5.0
-2.5
-5.0
Y=ln()
!
"
= 0.25
!
"
= 2.5
!
"
= 0.25
!
"
= 2.5
Time
ADD(t)
Reference dose from MDR
Low dose range from NMDR
R
L
0.25
MDR NMDR
2.5 0.25 2.5
!
"
Figure 2.4: The flow chart of the systematic methodology developed for this study; The method-
ology integrates stochastic hydrogeological modeling, human health risks, and aquifer resilience;
MDR stands for the monotonic dose-response model and NMDR for the non-monotonic dose-
response model; The hydraulic conductivity field K
i
(x) is randomly generated for two levels of
heterogeneity,σ
2
Y
= 0.25 and 2.5, in the Monte Carlo framework; The velocity field u
i
(x) and BPA
concentration c
i
(x,t) are computed from Equations 2.5-2.7; The average daily dose ADD
i
(t) is
computed according to Equation 2.4 andξ
i
(t) is computed from Equations 2.2-2.3 with MDR and
NMDR models; The resilience loss R
L,i
is calculated from Equation 2.1.
19
2.4 Computational Results & Discussion
Computational results are organized into three parts. The first part focuses on the effects of the
level of aquifer heterogeneity on the aquifer resilience (Section 2.4.1). The second part analyses
the relationship between the aquifer resilience and key hydrogeological descriptors (Section 2.4.2).
Finally, we discuss the implications of the results in the context of risk management strategies for
groundwater contamination sites with BPA (Section 2.4.3).
2.4.1 Effects of Level of Aquifer Heterogeneity on Aquifer Resilience
2.4.1.1 Average Daily Dose and Aquifer Reliability
We display the temporal evolution of the ADD(t) and the following statistics of ξ(t) for the case
of σ
2
Y
= 0.25 in Fig. 2.5. In Fig. 2.5a, we show the ADD(t) behavior for all conductivity real-
izations, including the median and 10th and 90th percentiles. For the reference, the benchmark
doses evaluated from different DR models are displayed together (i.e., the reference dose RfD and
the upper/lower bounds D
U
and D
L
). The median of the ADD(t) slightly rises at the initial period
(t < 2 mo) up to nearly 90 µg/kg/d and rapidly decreases, making a concave shape reaching to
0.001 µg/kg/d around t≈ 22 mo. In terms of benchmarks doses from different DR models, all
ADD(t) curves with σ
2
Y
= 0.25 are higher than the RfD based on the MDR model at the early
period (0< t < 15 mo) and fall in the low dose range computed from the NMDR model at the time
period from 10 to 25 mo. This indicates that in the case of σ
2
Y
= 0.25 and for our model set-up,
health risks from both systemic and endocrine-related health effects exist in different time periods.
Following the time-dependent changes of the ADD(t) with respect to benchmark doses, ξ(t)
is determined for each realization on the basis of whether there is a potential risk for each adverse
health effect (see Equations 2.2 and 2.3). The ensemble statistics ofξ(t) (i.e., the mean and vari-
ance, namely⟨ξ(t)⟩ and V AR[ξ(t)]) are computed depending on the DR models and displayed for
the case ofσ
2
Y
= 0.25 in Fig. 2.5b. Sinceξ(t) describes the reliability of the aquifer as a potable
water source with binary numbers 1 and 0,⟨ξ(t)⟩ represents the probability of the aquifer being
20
reliable when exposure starts at time t. V AR[ξ(t)] accounts for the uncertainty in the aquifer reli-
ability. Behaviors of⟨ξ(t)⟩ computed from both DR models are similar although they have a time
gap (see Fig. 2.5b). The mean value,⟨ξ(t)⟩, computed with the MDR and NMDR models have
severe damages and sharp recovery within the early period (t < 25 mo). They initially decrease
to 0 at t = 3 and 8 mo for MDR and NMDR models, respectively, and sharply bounce back to 1
at t = 15 and 25 mo, which are well matched with time moments when all ADD(t) curves cross
the RfD, D
U
, and D
L
(compare Fig. 2.5b with Fig. 2.5a). The dynamics of V AR[ξ(t)] over time
display peaks when the median of the ADD(t) meets the benchmark doses (i.e.,⟨ξ(t)⟩ is 0.5).
As the level of aquifer heterogeneity increases from σ
2
Y
= 0.25 to 2.5, the asymmetry of the
ADD(t) curves and the mean and variance ofξ(t) change as illustrated in Fig. 2.6. The median of
the ADD(t) for σ
2
Y
= 2.5 decreases significantly slower after its peak than the one for σ
2
Y
= 0.25
(compare Fig. 2.6a with Fig. 2.5a). The gap between 10th and 90th percentiles for σ
2
Y
= 2.5
becomes larger over time than the one for σ
2
Y
= 0.25, which indicates that the uncertainty of the
ADD(t) increases as aquifers become more heterogeneous. These results demonstrate that the
macro-dispersion effect arising from the increased level of the heterogeneity in aquifers rendered
the ADD(t) curves stretched and flattened with increasing uncertainty. Since the ADD(t) curves
for σ
2
Y
= 2.5 have lower peak values and longer tails (with larger variability among realizations)
than those forσ
2
Y
= 0.25, the chance for the ADD(t) values to be above the RfD decreases, while
the period of the ADD(t) remaining in the low dose range extends with variability (compare Fig.
2.6a with Fig. 2.5a). This result shows that aquifer heterogeneity changes the interaction between
the ADD(t) and benchmark doses with increasing uncertainty.
Figs. 2.5b and 2.6b report the temporal evolution of the mean and variance ofξ(t). As observed
in Fig. 2.5b and 2.6b, the temporal dynamics of theξ(t) statistics strongly depend on bothσ
2
Y
and
the DR models. The behavior of the first two moments of ξ(t) is dictated by the magnitude of the
ADD(t) and its distribution in time while considering the benchmark doses.
21
Figure 2.5: Time profiles of the ADD(t) and the statistics ofξ(t) for the case ofσ
2
Y
= 0.25; (a) the
ADD(t) as a function of the exposure starting time t at the CP; (b) the mean and variance ofξ(t)
depending on the DR models.
22
Figure 2.6: Time profiles of the ADD(t) and the statistics ofξ(t) for the case ofσ
2
Y
= 2.5; (a) the
ADD(t) as a function of the exposure starting time t at the CP; (b) the mean and variance ofξ(t)
depending on the DR models.
23
2.4.1.2 Loss of Aquifer Resilience
Next, we evaluate R
L
due to the BPA contamination (see Equation 2.1) as a function of the DR
model and degree of conductivity heterogeneity. R
L
indicates how long the aquifer poses risks
to human health, i.e., the aquifer is not reliable as a potable water source. In other words, it
represents how long the aquifer takes to recover its state as a potable water source where adverse
health effects from the BPA contamination are not likely to occur. Thus, as R
L
is lower, the aquifer
is more resilient.
Statistical distributions (in the form of box plots) of R
L
for eachσ
2
Y
and DR model are displayed
in Fig. 2.7 (i.e., the mean, quartiles, and 5th and 95th percentiles). Fig. 2.7 shows that the
interquartile range for R
L
varies according toσ
2
Y
and DR models. Note that the mean behavior of
R
L
exhibits opposite trends with σ
2
Y
for each DR model (compare red and blue box plots in Fig.
2.7 for differentσ
2
Y
). When the MDR model is considered, the mean of R
L
decreases sightly from
12 to 8 mo whenσ
2
Y
changes from 0.25 to 2.5. On the other hand, as shown in the left-half of Fig.
2.7, the spread of the R
L
distribution increases with an increase ofσ
2
Y
. As for the cases when the
NMDR model is considered (i.e., blue box plots located in the right-half of Fig. 2.7), the increase
ofσ
2
Y
brings an increase in the mean of R
L
from 16 to 39 mo (compare results forσ
2
Y
= 0.25 and
0.5). Moreover, the spread of the R
L
distribution significantly increases in contrast to the case of
the MDR model, e.g., the inter-quartile range of R
L
is augmented from 14-17 to 29-53 mo. This
indicates that when the NMDR model is considered, the increased level of aquifer heterogeneity
renders the recovery time of the aquifer reliability longer (in an average sense) and more uncertain.
The results in Fig. 2.7 clearly show the interplay of aquifer heterogeneity and dose-response
models in controlling the mean behavior and uncertainty estimates of R
L
. For our model set-up,
our results demonstrate that the aquifer heterogeneity could be beneficial to the aquifer resilience
when the MDR model is considered, while it would be detrimental to the aquifer resilience by
increasing the mean of R
L
and by posing additional uncertainty in R
L
when the NMDR model
is employed. This is because, for our model set-up, the increase of σ
2
Y
renders peak values of
the ADD(t) decreased. Subsequently, the probability that the aquifer will be reliable when the
24
MDR model is considered increases (compare the solid red line in Fig. 2.6b with the one in Fig.
2.5b). On the other hand, the increased level of heterogeneity makes the tails of the ADD(t) longer
and uncertain. This change decreases the probability that the aquifer will be reliable and also
augments the uncertainty when the NMDR model is used (compare the solid and dashed blue lines
in Fig. 2.6b with those in Fig. 2.5b). We point out that these results are limited to the numerical
set-up adopted in this study. The conclusions can vary and depend on contaminant release rates,
distance from source to receptor, chemical reactions, the random space function model for Y and
P´ eclet numbers. Notwithstanding these constraints, our results reveal that the NMDR model could
bring distinct results of the aquifer resilience uncertainty estimation as opposed to the MDR model
currently employed for health risks of BPA.
2
Y
=0.25 &
MDR
2
Y
=2.5 &
MDR
2
Y
=0.25 &
NMDR
2
Y
=2.5 &
NMDR
0
10
20
30
40
50
60
R
L
[mo]
5/95 percentile for whisker
1st/3rd quartile for box
median
mean
Figure 2.7: Distributions of R
L
depending onσ
2
Y
and the DR models.
25
2.4.2 Relationship between Aquifer Resilience and Hydrogeological Descriptors
Next, we investigate the relationship between R
L
and two hydrogeological descriptors. These
descriptors are the residence time of the BPA plume at the CP and the flow rate at the source
zone. Our previous results showed how macro-dispersion controlled the temporal dynamics of
the exposure dose and reliability. The residence time of the plume at the CP is a metric that
reflects the effects of macro-dispersion. On the other hand, previous studies [6, 75–77] showed
that the volumetric flow rate through the source zone plays a pivotal role in controlling both plume
spreading and plume lengths.
2.4.2.1 Residence Time of BPA Plumes at the Control Plane
To comprehend the behaviors of BPA plumes at the CP due to heterogeneous conductivity struc-
ture, breakthrough curves (BTCs) for the BPA mass at the CP are investigated as a function of the
two heterogeneity levels of the aquifer. Figure 2.8 displays the mass BTCs of BPA for all real-
izations as well as the median and 10th and 90th percentiles. For the purpose of references, lines
for 5, 50, and 95% of the total BPA mass are presented together. As the level of the heterogeneity
increases, the median of the BTC ensemble becomes stretched (compare Fig. 2.8a with Fig. 2.8b).
The early arrival time t
5%
(i.e., the time when 5% of the total mass arrives at the CP) decreases
from 9 to 7 mo, while the late arrival time t
95%
(i.e., the time when 95% of the total mass arrives at
the CP) increases from 17 to 28 mo. Figure 2.8b demonstrates that the increased level of aquifer
heterogeneity disperses BPA plumes passing through the CP with variability. The degree of disper-
sivity of the BPA plume can be represented by the difference between early and late arrival times
(i.e., t
95%
− t
5%
) which indicates an estimate of the plume’s residence time at the CP.
Given that the ADD(t) is evaluated from the total exposure duration ED, the amount of BPA
which is exposed during the ED determines the magnitude of the health risks. For such reasons,
we normalize the residence time normalized with the ED. The normalized residence time, denoted
by∆t
r
(-), is written as:
26
0 10 20 30 40 50 60
t [mo]
0.0
0.2
0.4
0.6
0.8
1.0
M/M
T
[-]
(a)
2
Y
=0.25
median
90% percentile
10% percentile
5% of the total mass
50% of the total mass
95% of the total mass
0 10 20 30 40 50 60
t [mo]
0.0
0.2
0.4
0.6
0.8
1.0
M/M
T
[-]
(b)
2
Y
=2.5
median
90% percentile
10% percentile
5% of the total mass
50% of the total mass
95% of the total mass
Figure 2.8: Breakthrough curves of BPA at the CP with differentσ
2
Y
; (a)σ
2
Y
= 0.25 (b)σ
2
Y
= 2.5.
27
∆t
r
=
t
95%
− t
5%
ED
, (2.8)
where t
5%
is the early arrival time (d), t
95%
is the late arrival time (d), and ED the total exposure
duration (d).
The relationship between R
L
and∆t
r
is examined by illustrating a scatter plot (Fig. 2.9). Each
pair of values, i.e. (∆t
r
,R
L
), displayed in Fig. 2.9 originates from the simulation for each conduc-
tivity realization from the ensemble. Results are shown for twoσ
2
Y
values and DR models (see Fig.
2.9). As shown in Fig. 2.9a, for the case of σ
2
Y
= 0.25, R
L
increases with the plume’s residence
time∆t
r
. This increase is observed for both DR models and the data clearly displays two distinct
slopes. Fig. 2.9a reveals that the trend of R
L
with∆t
r
becomes more pronounced when adopting
the NMDR model. For the results obtained from the NMDR, the resilience loss increases more
than two times from 13 to 29 mo. On the other hand, R
L
values computed with the MDR model
does not show a significant change with ∆t
r
and are on average approximately equal to 12 mo.
Similar analysis is carried out for σ
2
Y
= 2.5 in Fig. 2.9b. Results in Fig. 2.9b shows a different
relationship between∆t
r
and R
L
when the MDR model is adopted. We first observe an increase of
R
L
with∆t
r
followed by a reduction from 13 to 0 mo. When a NMDR model is considered, we
observe a steep rise of R
L
with∆t
r
, i.e. R
L
ranges from 14 to 60 mo.
As shown in Fig. 2.9, the relationship between R
L
and∆t
r
depends on the choice DR models
and the level of heterogeneity. Aquifer heterogeneity controls plume spreading which in turn
impact the magnitude of the BPA concentration at the CP and therefore the exposure level (ADD).
The randomness of the hydraulic conductivity will also control the likelihood that ADD is greater
than the RfD estimated from the MDR or if it falls within the low dose limit established by the
NMDR.
2.4.2.2 Impacts of Flow Rate at Source Zone on Aquifer Resilience
We further inspect what hydrogeological feature is involved with∆t
r
, subsequently having impacts
on the aquifer resilience. Fig. 2.10 exhibits six extreme scenarios (see Fig. 2.10a-f) which have
28
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
t
r
[-]
5
10
15
20
25
30
35
R
L
[mo]
(a)
2
Y
=0.25 w/ MDR
w/ NMDR
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
t
r
[-]
0
10
20
30
40
50
60
R
L
[mo]
(b)
2
Y
=2.5
w/ MDR
w/ NMDR
Figure 2.9: R
L
and∆t
r
from all realizations depending onσ
2
Y
and the DR models; (a)σ
2
Y
= 0.25 (b)
σ
2
Y
= 2.5.
29
drastic impacts on∆t
r
. Fig. 2.10 illustrates the resident concentration field of the BPA averaged
along the z-axis at t = 7 mo. It also shows the corresponding concentration BTCs (c
F
vs. time)
at the CP for each of the six scenarios. As shown in Fig. 2.10, the cases displayed in panels (a),
(c), and (e) have smaller∆t
r
values then compared to cases (b), (d), and (f). Close inspection of
the contour plots displayed in Fig. 2.10 reveals that BPA plumes characterized by small ∆t
r
are
more likely to encounter a well connected path (e.g., preferential flow channel) that conveys the
BPA plume efficiently. The presence of these connect paths results in narrow BTCs having high
concentration peaks (see Fig. 2.10, left column). On the other hand, BPA plumes characterized by
large∆t
r
tend to linger in source zones (see Fig. 2.10, right column). In this case, source zones
release BPA into aquifers at a slower rate, which leads to stretched and “flattened” concentration
BTCs. This observation reveals the importance of the flow rate at the source zone in determining
the residence time of BPA plumes at the CP, which in turn will impact aquifer resilience estimates.
The work of de Barros and Nowak [75] defined the source efficiency, η (-), by normalizing
the hydraulic flux passing through the geometric area of the source with its ensemble mean. The
dimensionless variableη is denoted as the source efficiency since it provides information regarding
the mass flux through the source zone. The variable η captures the ratio between the flow rates
crossing the source zone to the average volumetric flow rate in the aquifer. In the Monte Carlo
framework, η will be distinct for each random realization of the conductivity field. The higher
η is, the faster the contaminant mass is expected to be flushed out from the source zone (see
Fig. 2.10, left column). On the other hand, lower values of η are responsible for retaining the
contaminant plume for a longer time in the vicinity of the source zone (see Fig. 2.10, right column).
Previous studies reported thatη could be an indicator that predicts the contaminant plume length
for continuous release sources as well as spreading and mixing [6, 75, 76]. Moreover, Henri et al.
[77] shows the relationship betweenη and the health risks from chlorinated solvent mixtures. For
additional details related to the variableη, we refer the reader to de Barros and Nowak [75].
Next, we examine relationships of the flow rate at the source zone with the aquifer resilience.
As previously mentioned, the variable η is defined as the ratio of the volumetric flow rate at the
30
0 120 240 360 480 600 720
x [m]
0
130
260
390
520
y [m]
(a) t
r
= 0.53 & = 1.10 at t = 7 mo
CP
source
0
20
40
60
80
100
c R
along z-axis [mg/L]
0 120 240 360 480 600 720
x [m]
0
130
260
390
520
y [m]
(b) t
r
= 2.97 & = 0.18 at t = 7 mo
CP
source
0
20
40
60
80
100
c R
along z-axis [mg/L]
0 120 240 360 480 600 720
x [m]
0
130
260
390
520
y [m]
(c) t
r
= 0.70 & = 1.52 at t = 7 mo
CP
source
0
20
40
60
80
100
c R
along z-axis [mg/L]
0 120 240 360 480 600 720
x [m]
0
130
260
390
520
y [m]
(d) t
r
= 3.16 & = 0.20 at t = 7 mo
CP
source
0
20
40
60
80
100
c R
along z-axis [mg/L]
0 120 240 360 480 600 720
x [m]
0
130
260
390
520
y [m]
(e) t
r
= 0.90 & = 1.04 at t = 7 mo
CP
source
0
20
40
60
80
100
c R
along z-axis [mg/L]
0 120 240 360 480 600 720
x [m]
0
130
260
390
520
y [m]
(f) t
r
= 3.33 & = 0.18 at t = 7 mo
CP
source
0
20
40
60
80
100
c R
along z-axis [mg/L]
0 10 20 30 40 50 60
t [mo]
0
1
2
3
4
5
6
c
F
at CP [mg/L]
t = 7 mo
Case (a)
Case (c)
Case (e)
ED from t = 7 mo
0 10 20 30 40 50 60
t [mo]
0
1
2
3
4
5
6
c
F
at CP [mg/L]
t = 7 mo
Case (b)
Case (d)
Case (f)
ED from t = 7 mo
Figure 2.10: Resident BPA concentration c
R
fields averaged along the z-axis at t = 7 mo and
breakthrough curves for the flux-averaged concentration c
F
of BPA at the CP for six extreme cases
of∆t
r
withσ
2
Y
= 2.5; (a), (c), and (e) cases for low∆t
r
(b), (d), and (f) cases for high∆t
r
.
31
front side of the contamination source to the averaged volumetric flow rate at the background
aquifer:
η =
Q
sz
Q
b
, (2.9)
where Q
sz
is the volumetric flow rate (m
3
/d) at the front side of the source zone with respect to the
x-direction and Q
b
is the averaged volumetric flow rate (m
3
/d) at the background aquifer:
Q
sz
=
ZZ
w
sz
q
x
(x)dydz, (2.10)
where q
x
is the x-directional specific discharge (m /d) and w
sz
is the geometric area (m
2
) (i.e., the
front side area) of the source zone. The total averaged volumetric flow rate is given by:
Q
b
= K
G
J
x
A, (2.11)
where A is the area (m
2
) of the aquifer normal to the x-axis. Note that Q
b
is the same over all
realizations given that our numerical domain ensures that we have ergodic conditions.
For the six cases shown in Fig. 2.10, η exhibits high values for the cases (a), (c), and (e) and
lower values for the cases (b), (d), and (f). From Fig. 2.10, we see an inverse relationship between
η and∆t
r
. Whenη is low, we observe larger values for∆t
r
(and vice-versa). In Fig. 2.11, R
L
andη
are plotted for all realizations of the conductivity ensemble as a function ofσ
2
Y
and the DR models.
The relationship between R
L
andη is the reverse of the one observed between R
L
and∆t
r
(compare
Fig. 2.11 with Fig. 2.9). As depicted in Fig. 2.11, lower values ofη leads to larger values of R
L
(on average). For larger values ofη, the bulk of the BPA plume is not lingering in the source zone
and is carried out by the preferential flow paths. As a consequence, the BPA plume crosses the CP
in a relatively short amount of time. The opposite occurs for smallerη values (i.e. higher values of
R
L
on average). This behavior is observed for both levels of heterogeneity examined in this paper.
Fig. 2.11 also shows that the DR model impacts the concavity of the data.
32
0.2 0.3 0.4 0.5 0.6 0.7
[-]
5
10
15
20
25
30
35
R
L
[mo]
(a)
2
Y
=0.25 w/ MDR
w/ NMDR
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
[-]
0
10
20
30
40
50
60
R
L
[mo]
(b)
2
Y
=2.5
w/ MDR
w/ NMDR
Figure 2.11: R
L
andη for all realizations depending onσ
2
Y
and the DR models; (a)σ
2
Y
= 0.25 (b)
σ
2
Y
= 2.5.
33
2.4.3 Risk Management Strategies for Groundwater Contamination Sites
with BPA
On the basis of effects of the interplay between aquifer heterogeneity and the DR models on the
aquifer resilience, risk management strategies can be suggested for groundwater contamination
sites with BPA. Critical factors to establish the risk management strategy for BPA contamination
in groundwater are land uses and subsurface heterogeneity. Land uses and corresponding demo-
graphics determine whether the NMDR model of BPA should be considered or not. This choice
will depend on the exposure population in a contamination site. For instance, when a contamina-
tion site is used as an industrial area that does not contain a significant number of pregnant/lactating
women, decision makers may evaluate the site based on the aquifer resilience estimated solely from
the MDR model. On the other hand, when a contamination site is a residential area where the like-
lihood of finding pregnant/lactating women is larger (e.g., due to parental leave), decision makers
should consider allocating more weight on the NMDR model (in addition to the MDR model)
when they estimate the risks and resilience of the aquifer. For the numerical set-up adopted in our
work, we note that the uncertainty regarding estimates of R
L
is higher when a NMDR model is
used (see blue box plots, Fig. 2.7). In this situation, if the contamination site is an residential area
(i.e., larger probability of having a population cohort that requires the use of the NMDR model),
the source efficiency, η, needs to be prioritized in the site investigation to decrease the uncertainty
of R
L
. Once the uncertainty of R
L
estimates is reduced, decision makers can compare the envi-
ronmental cost caused by R
L
estimates with the cost of possible groundwater clean up strategies
such as monitored natural attenuation, remediation, or treatment of groundwater before using it as
potable water.
It is important to emphasize that our results are limited to the conditions examined. For the
numerical set-up adopted in work, we observed that the aquifer resilience estimates (and corre-
sponding uncertainties) are sensitive to the choice of the DR model (i.e., NMDR vs. MDR) and
the degree of hydrogeological heterogeneity. Results reported in this paper also highlight the im-
portance ofη on the resilience and that is can be used as a risk indicator. Resources can be allocated
34
to increase characterization of the source zone and its hydraulic fluxes. The work of de Barros and
Nowak [75] showed howη can be used to reduce the uncertainty in transport predictions. Henri et
al. [77] used theη concept developed by de Barros and Nowak [75] to investigate the sensitivity
of increased lifetime cancer risks (due to chlorinated solvents) to source zone conditions. Here,
we show thatη could also be a potential indicator of the resilience of an aquifer. Since the aquifer
resilience can have severe damages when the source zone retains BPA plumes for a long period of
time, the investigation for η should be prioritized. If η is low, decision makers should deem the
contamination situation serious and take active measures for groundwater such as remediation or
treatment before using groundwater for potable purposes.
2.5 Conclusions
To address growing concerns regarding emerging contaminants in groundwater, the aquifer re-
silience based on human health risks is evaluated under a contamination scenario from BPA. In
this process, two types of DR models of BPA are considered to characterize health risks; the MDR
model is used for systemic health effects in the general population and the NMDR model is adopted
for endocrine-related health effects in vulnerable population (i.e., male offspring when their moth-
ers are exposed to BPA through the direct ingestion of groundwater within pregnant/lactating pe-
riods). Based on the set-up adopted in this study (as well as the parameter range investigated), the
results exhibit the following points:
(1) An integrated stochastic modeling framework that links together hydrogeological models
with risk and resilience is systematically established. Through this framework, human health
risks and the aquifer resilience are investigated in a probabilistic way under a groundwater
contamination scenario from BPA.
(2) Depending on the DR models, aquifer heterogeneity has distinct impacts on the aquifer re-
silience and its uncertainty. The increased level of the aquifer heterogeneity is detrimental to
the aquifer resilience evaluated from the NMDR model by increasing the mean of R
L
, while
35
it is beneficial to the aquifer resilience evaluated from the MDR model with decreasing the
mean value of R
L
. In all cases, augmenting the level of aquifer heterogeneity leads to an in-
creased uncertainty in R
L
estimates. The uncertainty of R
L
is increased when employing the
NMDR model when compared to the uncertainty estimates obtained with the MDR model.
(3) The variability of R
L
is controlled by∆t
r
andη. The parameterη is an indicator representing
the retention effect of the source zone on BPA plumes. Our results display that η has a
positive relationship with R
L
evaluated from MDR model and the negative relationship with
R
L
from NMDR model.
(4) In the context of risk management associated with BPA contamination, we highlight impor-
tant factors such as land use and aquifer heterogeneity. When the vulnerable population con-
sists of a significant number of exposed pregnant/lactating women, NMDR models should
be considered. In this case, the site investigation is needed to decrease the uncertainty of R
L
resulting from using the NMDR model.
(5) We also highlight the potential of η in reducing the uncertainty associated with R
L
. This
implies the characterizing the hydraulic fluxes at the source zone could potentially reduce
the uncertainty estimates of groundwater resilience and risks.
We note that other factors such as contaminant source architecture, contaminant mass loading
history, distance between source and receptor, chemical reactions, and the presence of sinks/sources
can impact the results and the uncertainty of aquifer resilience. The results of this work are limited
to our model set-up and parameter range. Nevertheless, our results indicate that the choice of the
dose-response model, coupled with the level of aquifer heterogeneity, can impact the uncertainty
of aquifer resilience. In our case, the uncertainty in R
L
is larger when the NMDR model is adopted
as compared to the result when the MDR model is employed. The impact on the uncertainty of
aquifer resilience emphasizes the importance of considering the NMDR model of BPA in risk
management.
36
In light of new evidence regarding the dose-response relationship of BPA [57, 78], our work
shows the impacts of the interplay betweenσ
2
Y
and BPA DR models on the uncertainty of R
L
. To
reduce the uncertainty of R
L
arising from this interplay, we suggestη as an indicator of R
L
in risk
management. Our work offers an insight as well into the risk and resilience management for other
emerging contaminants possibly causing endocrine-related health effects.
37
Chapter 3
Data-driven Identification of Ordinary Differential Equations
for Multi-Physics Systems Utilizing Stochastic Optimization
3.1 Introduction
System identification is a key component in engineering and scientific analysis [79]. Many engi-
neered and natural systems concerned in modern society are complex. Examples of engineered
systems consist of long-span bridge and satellite structures [80]. Examples in natural systems con-
sist of oxygen dynamics in a soil-plant continuum under stochastic rainfall events [81, 82], hydro-
logical processes in river networks [83], and predicting contaminant concentrations and migration
paths in multi-scale heterogeneous subsurface environments [4, 84]. These complex systems are
not limited to physical dynamic systems. They appear throughout many other disciplines: disease
transmission during epidemics [85, 86] or health risks of groundwater systems under climate and
demographic stresses [87–89]. Furthermore, these systems are subject to uncertainty. In this con-
text, system identification is even more crucial. According to increasing complex systems across
modern society, there is an ongoing endeavor to develop improved and robust state-of-the-art iden-
tification methods.
Conventionally, dynamic systems are represented by mathematical models through deductive
reasoning of the underlying phenomena [90]. Mathematical models derived in this way, known as
white box modeling, are powerful to predict the future behavior of systems, as well as to understand
38
system dynamics itself. However, it is challenging to obtain a reliable mathematical model when
multiple (and more complex) nonlinear processes are involved [91]. This is because mathematical
models are usually subject to assumptions and simplifications. As a result, the reliability of math-
ematical models heavily depends on the user’s understanding of phenomena underlying a system.
On top of that, even though a mathematical model explicitly represents the underlying phenomena
from first principles such as mass balance and energy balance, it is sometimes computationally too
expensive to solve the model due to its complexity [92–94]. As a consequence, complex nonlinear
systems are susceptible to have an inaccurate model due to user’s lack of understanding of a system
or to have a model that could be accurate but computationally intractable.
A variety of data processing methods have been investigated to overcome the limitations of con-
ventional mathematical modeling in identifying complex nonlinear systems [20, 79]. For instance,
parametric modeling methods (e.g., gradient-based methods [95] and Genetic Algorithm (GA)
[96]) are performed by setting up physical or phenomenological models based on prior knowl-
edge and by optimizing their parameter values in accordance with measured data. Non-parametric
modeling methods, such as Artificial Neural Networks (ANN) [26, 97, 98], Gaussian process
[99], Genetic Programming (GP) [100], sparse approximation [22, 24], and state-space [21, 101]
approaches were developed to establish alternative models without relying on a prior model struc-
ture. Along with the recent advances in computing power and sensor technology, these methods
are receiving more attention as data-driven identification methods. Data-driven methods depend
solely on a large data set, sampled from a target system, to construct a model. Since these non-
parametric data-driven methods do not rely solely on user’s prior knowledge when establishing a
system model, they are being studied as easily accessible tools for system identification in many
research fields [102–104]. Recent studies have focused on improving these methods with the goal
of identifying complex nonlinear systems and made noticeable advances. Not just increasing accu-
racy of an identified model, but have robustness improved by incorporating the underlying physics
of a target system. Many efforts that benefit by knowledge of the underlying physics have been
made; the state-space is formulated [21, 101] or the conformity to first principles is evaluated [26,
39
97, 98, 105]. Despite this significant progress, issues related to robustness and overfitting still re-
main in their application. Non-parametric methods result in black-box models that are incapable
to be physically interpreted [20, 106, 107].
Among the non-parametric data-driven methods, GP is a system identification method that
retains the advantages of conventional deductive modeling [20]. It allows mathematical interpreta-
tion of a complex nonlinear system by discovering its symbolic expressions on the basis of a large
data set. The program that is the basic unit of GP is defined in accordance with specific tasks,
e.g., a vector of parameters to optimize an objective function, an expression tree to find a proper
symbolic expression for a given data set, or a decision tree structure to produce the best strategy.
Inspired by Darwin’s theory of evolution, GP randomly generates programs and evolves them in
consecutive generations to fit them into a particular task [100]. Ultimately, GP obtains the best
suitable program for the task. Because of the adaptability of the program, GP has been employed
in many fields from sciences and engineering to economics and social sciences [108–110]. For the
identification of nonlinear dynamic systems, the expression tree (that is the tree structure having
mathematical components within its nodes and terminals) is adopted as the program of GP [20].
Previous works, especially in applied mechanics [93, 111–114], developed GP-based system iden-
tification methods for nonlinear systems by incorporating advanced mathematical components in
the expression tree, such as derivatives of state variables or discontinuous functions. This progress
has changed the GP-based identification method from a simple symbolic regression with algebraic
expressions into the discover of governing equations expressed by differential equations for non-
linear systems. Furthermore, by favoring expression trees having a more concise structure in the
process of identification, these studies represented complex nonlinear systems with parsimonious
governing equations [93, 115]. The potential of GP as an identification method for nonlinear dy-
namic systems can be found in the field of applied mechanics. For instance, Bolourchi et al. [114–
116] developed a GP-based identification approach characterizing target systems with ordinary dif-
ferential equations (ODEs) and displayed its application to identify multiple reference structural
systems, such as the noisy Duffing-Van der Pol oscillator and the Bouc-Wen model of hysteresis.
40
La Cava et al. [93] applied a GP-based identification approach for an industrial-scale wind turbine.
Their approach shows how succinct models identified through GP outperformed those based on
first-principles. For a review of GP-based system identification approaches in applied mechanics,
we refer the reader to Quaranta et al. [20].
On the contrary to the advances of GP-based identification in the field of applied mechanics,
the implementation of GP to identify governing equations in environmental systems is still at an
early stage [117–119]. As an example, Fallah-Mehdipour et al. [117] show that GP was able to
successfully establish the time-series of water-table elevation. Considering that GP can discover
parsimonious nonlinear differential equations solely on the basis of a large data set with minimal
prior knowledge, the identification method utilizing GP is a powerful grey box modeling tool. In
this work, we expand the application of GP to other types of nonlinear systems. To achieve this, it
is essential to provide a more general GP-based identification framework that unravels governing
equations for nonlinear multi-physics systems. However, this expansion inevitably exacerbates the
drawback of GP that is a long computational time [114, 115]. This is because the generalization
of GP-based identification involves more variables or basis functions in expression trees, which
rapidly increases the computational time needed to converge into the best candidate model (i.e.,
the best expression tree). To expedite the convergence of the GP-based identification process, this
study employs the output error of the candidate models into the fitness evaluation. When evaluating
the fitness of a candidate model, previous studies employed the equation error that is computed by
substituting the measured data (of the input variable) into the candidate model and then comparing
these values to the ones obtained from the reference model [93, 112–114]. In our work, the output
error is computed by solving a candidate model that is a single or a set of ordinary differential
equations (ODEs) with measured input data and comparing it to measured output data. This direct
comparison to a given data set as differential equations is able to facilitate the convergence of GP
process. The subsequent increased computational cost is managed with parallel processing on high
performance computing systems [120].
41
In this manner, this study develops and demonstrates a general framework of GP-based identi-
fication for nonlinear multi-physics systems. The proposed general GP-based identification frame-
work enables to apply this powerful grey box modeling to nonlinear multi-physics systems by
constructing parsimonious governing equations from an input and output data set with little prior
knowledge. For the demonstration of the identification method under discussion, two reference
systems (i.e., one structural mechanics and one environmental systems) are identified: 1) a Bouc-
Wen model and 2) a soil-moisture model. The Bouc-Wen model is a nonlinear single-degree-of-
freedom (SDOF) model describing a structural mechanics system under the presence of external
force. The nonlinearity of this model comes from its hysteretic behavior where the current sys-
tem responses (i.e., displacement, velocity, and acceleration) depend on their history. Due to its
complex nonlinearity, the Bouc-Wen model has been identified in numerous studies to exhibit per-
formance of identification approaches [21, 24, 97, 101, 115, 121]. In this work we expand the
GP application to environmental sciences, agricultural systems and water resources engineering.
To illustrate the use of GP in such systems, we consider a soil-moisture model. Among many
soil-moisture models suggested, one SDOF model describing soil moisture context under a series
of rainfall events is employed [122]. Major contributing mechanisms determining the soil mois-
ture context (e.g., rainfall, infiltration, and moisture losses) are lumped into this model, leading
to its nonlinear behavior. The application of the GP-based identification method to this model is
appealing since environmental problems are considered difficult to be fully characterized, prior
knowledge is usually limited, or in some cases, it is simply too complicated to model with first
principles [123].
This paper is organized as follows. First, the general problem statement is given in Section 3.2.
Details pertaining the GP-based identification method are presented in Section 3.3. Illustrative
examples are displayed in Section 3.4, and the following discussion is presented in Section 3.5.
Lastly, implications and future research are discussed in Section 3.6.
42
3.2 Problem Statement
LetM denote a model for a nonlinear multi-physics system. It is assumed thatM is unknown
due to limited knowledge about the system. For instance, an oscillator model or a soil-moisture
model can be considered. This nonlinear multi-physics system is subject to the external excitation
over time such as excitation force for the oscillator or random rainfall for the soil. The external
excitation at a given location x and time t is considered as the input θ(x,t). Given θ(x,t), the
system produces its corresponding response denoted by the output ψ(x,t) at the same location x
and time t. The modelM explains the relationship betweenθ(x,t) andψ(x,t) within the system
(Fig. 3.1). This relationship can be expressed by:
M(θ(x,t),ψ(x,t))= 0. (3.1)
Input , Output ,
System model
ℳ
Figure 3.1: The schematic diagram for a general modeling problem.
Given that the system modelM is unknown due to limited knowledge, an approximate mathe-
matical model
ˆ
M can be identified as the governing equation that is expressed by a single or a set
of ordinary differential equations (ODEs) or partial differential equations (PDEs). Our work will
focus on the governing equation characterized by ODEs with time-dependent terms. Therefore, the
43
input data and the corresponding output data fromM are denoted asθ(t) andψ(t), respectively.
The approximate model
ˆ
M yields its own output ˆ ψ(t) due to the same inputθ(t). Therefore
ˆ
M is
identified from a set of candidate models M by minimizing the L
2
norm of the difference between
ˆ ψ(t) andψ(t):
min
ˆ
M
n
∥ ˆ ψ(t)− ψ(t)∥
2
|
ˆ
M(θ(t), ˆ ψ(t))= 0
o
s.t.
ˆ
M ⊂ M. (3.2)
The objective of this work is to identify the approximate mathematical model
ˆ
M for a nonlinear
multi-physics system solely with input and output data of the system, namely θ(t) and ψ(t). In
order to establish a general framework of system identification utilizing GP, the GP algorithm
needs to be adapted according to the general problem stated in this section. Details of the proposed
framework are described in the following Section 3.3.
3.3 Methodology
3.3.1 Genetic programming for system identification
Genetic Programming (GP) is one of the computational algorithms using artificial intelligence
[100]. This algorithm evolves programs over consecutive generations in a stochastic manner to fit
them into a particular task. Evolutionary concepts adopted from Darwin’s theory of evolution, such
as selection, representation, crossover, and mutation, are applied into the population of programs
to evolve them for the next generation. An overview of GP is presented in Fig. 3.2.
In order to utilize GP for the system identification of nonlinear multi-physics systems expressed
by ODEs, an identification scheme employing GP is firstly established. This GP-based identifica-
tion scheme discovers the governing equation of a system solely on the basis of its input and output
data as the following steps:
Step 1: Determine an identification target multi-physics system, and sample associated input and
output data (i.e.,θ(t) andψ(t)).
44
Update
Selection
Population
Best
Programs Fitness for a task
Evolutionary
Loop
Crossover
Mutation
Representation
Figure 3.2: The illustration diagram of an evolutionary loop of GP. As an analogy, programs like
chromosomes contain genetic information to perform a specific task. According to the capability
for the task, each program has the level of fitness. Similarly to Darwin’s theory of evolution,
population of programs undergoes evolutionary processes. Some highly fitted ones survive intact
(Representation), most of them exchange their genetic information each other (Crossover), and
some of them may experience an exchange with random genetic information (Mutation). Finally,
updated programs constitute the next generation. This evolutionary loop continues until a stop
criterion is satisfied.
45
Step 2: Randomly generate candidate models
ˆ
M
i
(i= 1,2,...,n) from the expression trees. The
candidate model
ˆ
M
i
is characterized by a single or a set of ODEs for the target system.
ˆ
M
i
is the program of GP and its setM is the population.
Step 3: Perform the fitness test by solving
ˆ
M
i
due to θ(t) and comparing its output ˆ ψ
i
(t) with
ψ(t) (see Eq. 4.2) to assign a fitness value to each
ˆ
M
i
.
Step 4: Apply the evolutionary processes depicted in Fig. 3.2 to enhance the overall fitness of M
and update the population for the next generation.
Step 5: Repeat the evolutionary processes (Step 4) until the model
ˆ
M
∗ that has the best fitness for
the system model in the population can satisfy a stop criterion.
Step 6: Validate
ˆ
M
∗ as the target system model by performing the fitness test (Step 3) with another
set ofθ(t) andψ(t).
Step 7: Interpret the mathematical expression of
ˆ
M
∗ to understand nonlinear characteristics of the
target system.
The above identification scheme utilizing GP requires some modification of the GP algorithm.
The following subsections describe the components of GP and their adaptation into the identifica-
tion scheme. The GP code is written in Python (version 3.7.4). The parameters associated with the
scheme are tabulated at the end of this section (Section 3.3.1) in Table 3.1.
3.3.1.1 Program
A program of GP is defined as a candidate model
ˆ
M characterized by a system of ODEs for a target
multi-physics system. Depending on the prior knowledge of the system, a part or a whole math-
ematical expression (i.e., a target function g(t)) comprisingM is aimed to be identified through
evolutionary processes. To decompose this target function into genetic information, the expression
tree is employed. The expression tree has the tree structure containing operations and functions in
its nodes and variables and numbers in its terminals as shown in Fig. 3.3. This expression tree can
46
be converted back-and-forth into the corresponding target function. Algebraic calculations associ-
ated with this conversion are computed through the Python library Sympy (version 1.5.1). Thus, a
system of ODEs that include the expression tree is considered as
ˆ
M , i.e., the program of GP used
in this study. Fig. 3.3 illustrates one example of
ˆ
M for a mechanics system. The candidate model
ˆ
M is defined as the governing equation characterized by a single ODE for the system. This ODE
represents the relationship between the external force f(t) and system responses (i.e., displace-
ment x, velocity ˙ x, and acceleration ¨ x). Depending on prior knowledge of the system (i.e., user’s
engineering judgment), a part of the ODE is selected as a target function ˆ g(t) to be identified and
decomposed into the expression tree. By applying evolutionary processes into the expression tree,
the model
ˆ
M becomes more suitable to represent the governing equation of the system.
(a) Candidate model
!
ℳ
(Program of GP)
(c) Expression tree
+
−
×
̇ 0.8
0.3 & = 0.8 ̇ + −0.3
(b) Target function &
(Target mathematical expression)
= 2 ̈ +0.8 ̇ + −0.3
Figure 3.3: One example of (a) a candidate model
ˆ
M as the program of GP, (b) its target function
ˆ g(t), and (c) the corresponding expression tree. f(t) is the external force. x, ˙ x, and ¨ x are state
variables: displacement, velocity, and acceleration, respectively. In this work H represents the
Heaviside step function. Under the assumption where a system allows the mass to be measured,
the inertial term 2 ¨ x is prearranged in the candidate model
ˆ
M . The rest of the expression is defined
as the target function ˆ g(t), and aimed to be identified through evolutionary processes of expression
trees.
47
3.3.1.2 Population
A set of candidate models
ˆ
M
i
(i= 1,2,...,n) is the population of GP where n is the number of
candidate models. For the initial population of
ˆ
M
i
, expression trees are randomly generated. As
depicted in Fig. 3.4, from the initial node at the tree level 1, the mathematical component types
among operations, functions, variables, and numbers are assigned with their own probabilities in
order. Once a component type is defined, a specific component is assigned by random sampling
from each component type set. If an operation or a function is assigned into a node, this node
keeps growing by initializing two other nodes at the next tree level. If a variable or a number is
assigned into the following node, this node becomes a terminal and stops growing. Components
of the expression tree consist of operations (i.e., plus, minus, and times), functions (e.g., Heaviside
step, absolute value, and exponential functions), state variables (such as displacement, temperature,
pollutant concentration, soil-moisture, etc.), and numbers (i.e., a random rational number between
0 and 1 which will be optimized afterwards). The probabilities for their assignment are set to
guarantee expression trees to grow more than the tree level 2. All parameters employed to produce
the initial population of GP are specified at the end of this section (Section 3.3.1) in Table 3.1. To
introduce a variety of expression trees in the initial population, the ramped half-and-half method
[100] is adopted. In this method, half of expression trees in the initial population grow as to the
probabilities of the components, and the remaining half fully grow until the maximum tree level.
For the same purpose of securing diverse expression trees, the maximum tree level is set to 6,
which allows 31 nodes when the expression tree fully grows.
For a candidate model capturing the systems’ nonlinearity, the inclusion of discontinuous func-
tions (e.g., a Heaviside step function or an absolute value function) in the expression tree compo-
nents is essential. Bolourchi et al. [114] presents several examples where such discontinuous
functions were employed in the identified models to efficiently represent nonlinear systems. If dis-
continuous functions are not available, complex polynomial approximations would appear with a
potential Gibbs phenomenon. However, the more basis functions are considered, the longer conver-
gence time is expected. In this context, the minimal discontinuous functions (i.e., a Heaviside step
48
function and an absolute value function) are included in the basis function library in consideration
of convergence time.
Tree level = 1
+
−
0.3
×
̇ 0.8
Tree level = 2
Tree level = 3
Tree level = 4
P(operations) = 0.5
random sampling from { +, −, × }
P(functions) = 0.1
random sampling from { (), }
P(operations) = 0.5
random sampling from { +, −, × }
P(variables) = 0.2
random sampling from { , ̇ }
P(numbers) = 0.2
random sampling from [ 0, 1 )
Figure 3.4: One example of the initialization of expression trees. From the initial node at the tree
level 1, the type of mathematical components is determined as to their probabilities. A specific
component is assigned by random sampling from each component type set. The expression tree
keeps growing when an operation or a function is assigned and, in turn, initializes two other nodes
at the next tree level. The expression tree stops growing when a variable or a number is assigned,
changing the assigned node into the terminal.
3.3.1.3 Fitness test
In order to evaluate the fitness of
ˆ
M for a target multi-physics system, a fitness test is performed.
Since the objective of the system identification is to find an approximate model for the system
modelM solely with θ(t) and ψ(t), the fitness value for each candidate model can be evaluated
from the L
2
norm of the difference between ˆ ψ(t) (obtained from
ˆ
M ) andψ(t) (obtained fromM )
due to the same input (θ(t)) (see Eq. 4.2). While θ(t) and ψ(t) are measured fromM , ˆ ψ(t) is
computed by solving ODEs (i.e.,
ˆ
M ) underθ(t). The final L
2
norm is then normalized by the L
2
norm ofψ(t) to compute the normalized errorε expressed as:
ε =
∥ ˆ ψ(t)− ψ(t)∥
2
∥ψ(t)∥
2
. (3.3)
49
Since the time domain of θ(t) and ψ(t) is discretized by the sampling time step∆t (i.e., t =
t
1
,t
2
,...,t
l
where l is the total time step), the normalized errorε in Eq. 3.3 is transformed into the
root-mean-square error (RMSE) normalized with the root-mean-square (RMS):
ε =
RMSE[ ˆ ψ(t)]
RMS[ψ(t)]
where RMSE[ ˆ ψ(t)]=
v
u
u
u
t
l
∑
i
{ ˆ ψ(t
i
)− ψ(t
i
)}
2
l
and RMS[ψ(t)]=
v
u
u
u
t
l
∑
i
{ψ(t
i
)}
2
l
.
(3.4)
In addition to the normalized RMSEε, to favor the simplicity of the mathematical expression
of
ˆ
M , the penalty on the number of nodes in the expression tree is counted into a fitness value.
Bolourchi et al. [115] introduced this penalty of complexity to prevent overfitting and encourage
parsimonious models. Therefore, the final fitness value ζ can be expressed as:
ζ =ε+ pa, (3.5)
where p is the weight of the complexity penalty that is 0.1 and a is the number of nodes in the
expression tree. The weight p is obtained from experience to balance between the normalized
RMSE and the complexity penalty of model. Since the final fitness value ζ is comprised of the
normalized RMSE and the complexity penalty, a lower ζ means that
ˆ
M is better suitable to the
target multi-physics system.
3.3.1.4 Evolutionary processes
Candidate models
ˆ
M
i
(i = 1,2,...,n) in the population of GP randomly evolve through evolu-
tionary processes, resulting in the next generation. By repeating the evolutionary loop (see Fig.
3.2) over consecutive generations, the likelihood that the population has improved the estimate of
ˆ
M
i
as a target system model increases. Evolutionary processes include selection, representation,
crossover, and mutation [100]. Firstly, according to the fitness values of
ˆ
M
i
, some of highly fitted
ˆ
M
i
are reserved for the next generation. This reservation of strong candidate models is referred as
50
representation. Next,
ˆ
M
i
are collected from the population for other evolutionary operations by
using the tournament selection method. This method randomly picks two
ˆ
M
i
in the population and
chooses the one having better fitness [100]. This method allows a chance of survival to a candidate
model that has a weaker fitness. The selected
ˆ
M
i
is, in turn, modified through evolutionary oper-
ations (i.e., crossover and mutation) to construct a new
ˆ
M
i
as shown in Fig. 3.5. The crossover
operation transforms expression trees of two
ˆ
M
i
by randomly choosing nodes of the expression
trees and changing each other. The mutation operation alters the expression tree of one
ˆ
M
i
by
randomly choosing a node of the expression tree and changing it with a new random expression
tree. For typical values of the relevant parameters (such as operation rates for the evolutionary
processes), we refer to previous studies [20, 100].
(a) Crossover
+
×
0.6
×
+
0.8
0.3
+
+
0.3
(b) Mutation
×
0.8
×
0.6
+
+
0.2
+
+ ×
0.9 0.2
Figure 3.5: Examples of evolutionary operations; (a) crossover randomly selects nodes of two
expression trees and changes each other; (b) mutation randomly chooses a node of an expression
tree and replaces it with a new random expression tree.
51
3.3.1.5 Bloat control
The expression tree structure employed for
ˆ
M
i
entails inherent issues that can slow down the
identification process and even cause a memory problem in the computational engine (computer).
Several bloat control techniques are implemented to address these issues every generation after the
evolutionary processes. Most importantly, it is possible that one expression tree has too many nodes
due to overfitting. To circumvent this issue, the maximum number of nodes in an expression tree
is set. When the number of nodes exceeds the maximum number, the expression tree is replaced
with a new random expression tree. The maximum node number used in this study is 100, which
is greater than the number of nodes contained in a fully-grown expression tree with the tree level
7. It prevents expression trees from becoming too complicated.
Next, there are two issues that potentially can bring inefficiency in the identification process.
Firstly, since the expression tree does not exercise any calculation, it can contain a redundant struc-
ture for the same expression. For example, the mathematical expression 5x can be converted from
the expressions tree 5× x and x+x+x+x+x that have 1 and 4 nodes, respectively. To prevent this
type of redundancy within the expression tree, a simplifying operation is implemented. It trans-
forms the expression tree into the simplified mathematical expression, and turns this expression
into the simplified expression tree again.
Secondly, there can be redundancy in the population because many expression trees can pro-
duce the same or similar mathematical expressions. For instance, expression trees 5× x, x+ x+
x+ x+ x, and 2x+ x are simplified into 5 x, 5x, and 3x, respectively. Regardless of the coefficients,
expression trees resulting in repeated mathematical expressions need to be removed to guarantee
a variety of mathematical expressions in the population. This type of redundancy can increase the
possibility to fall into local minima by narrowing down diverse expression trees into few strong
expression trees. Therefore, a removing operation is carried out to reduce the redundancy in the
population. It selects one of the expression tree reserved by representation and check the rest of
expression trees in the population whether they have the same mathematical expressions with the
52
selected one regardless of coefficients. If the repetition is found, the expression tree having weaker
fitness is replaced with a new random expression tree.
3.3.1.6 Parameter optimization
Coupled with the evolution processes of GP, parameter optimization enhances the suitability of
ˆ
M
i
forM . Among many optimization methods such as gradient-based methods and Simulated
Annealing [95], Genetic Algorithm (GA) [96] is employed because it is known as a proper opti-
mization algorithm for a discontinuous objective function.
ˆ
M
i
can have Heaviside step or absolute
value functions that bring discontinuity into the model. GA is the optimization algorithm that
utilizes the same evolutionary processes as GP. Furthermore, GA randomly initialize candidate
coefficient vectors c of
ˆ
M and evolves them through evolutionary processes, which leads to pa-
rameter optimization. For GA, the Python library Pyeasyga (version 0.3.1) is implemented with
the parameters used in the study from Bolourchi et al. [115] (Table 3.1).
53
Table 3.1: Characteristics in each module for genetic programming
Parameters Values
Program
Program Candidate model including an expression tree
Candidate model System of ordinary differential equations (ODEs)
Population
Population size 100
Initialization method Ramped half-and-half method
Components [ operations, functions, variables, numbers ]
Component probability [ 0.5, 0.1, 0.2, 0.2 ]
Operations Plus, minus, times
Functions E.g., Heaviside step, absolute value, exponential
Variables State variables of a target system
e.g., displacement, temperature, or soil-moisture
Initial numbers Random rational number in the interval [0, 1)
Initialization tree level 6
Python library Sympy (version 1.5.1)
Fitnesstest
Fitness Normalized root-mean-square-error (RMSE) and
the complexity penalty
Complexity penalty coefficient 0.1
Weighting of output Uniform
Stopping criterion Fitness improvement less than 1% in 1,000 consecutive
generations or at 10,000 generations
Evolutionaryoperators
Representation rate 10%
Selection method Tournament selection
Crossover rate 80%
Mutation rate 20%
Optimization
Optimization method Genetic Algorithm (GA)
Optimization frequency 50 generations
Population size 30
Initial population range [0.05c, 20c] where c is the initial coefficient vector
Representation rate 10%
Selection method Tournament selection
Crossover rate 80%
Mutation rate 20%
Stopping criterion Fitness improvement less than 1% in 30 consecutive
generations or at 300 generations
Python library Pyeasyga (version 0.3.1)
54
3.3.2 Demonstration of the identification scheme utilizing GP
The general identification scheme utilizing GP is developed herein to identify a nonlinear multi-
physics system model. To validate this identification scheme, its demonstration is performed by
identifying an already known reference system denoted here byM
ref
. A reference system is char-
acterized by a set of ODEs. The demonstration process is illustrated in Fig. 3.6. The detailed
procedure is described as follows:
(1) Data generation
(a) Determine a reference modelM
ref
, and select a target function g(t) according to prior
knowledge.
(b) Generate the input data θ(t) from the random excitation function that is targeted for
M
ref
.
(c) Compute the output data ψ(t) by solvingM
ref
(i.e., the ODEs) with the Runge-Kutta
4th order method underθ(t).
(d) Add 0, 1, or 5% error characterized by a zero-mean Gaussian to ψ(t) to yield the
training output data, namelyψ
training
(t).
(e) Confirm the training data set that consists of θ(t) andψ
training
(t).
(2) Training: Genetic programming
(a) Initialize candidate models
ˆ
M
i
(i= 1,2,...,n) by generating random expression trees
for the first population.
(b) Solve ODEs from
ˆ
M
i
with the Runge-Kutta 4th order method due to θ(t) to produce
the candidate model output ˆ ψ
i
(t).
(c) Assign the fitness value by calculating the normalized RMSE between ˆ ψ
i
(t) andψ
training
(t)
and computing the complexity penalty for the expression tree of
ˆ
M
i
.
(d) Apply the evolutionary processes to the current population of
ˆ
M
i
to yield the next
population.
55
(e) Implement bloat control operations to prevent errors and increase efficiency of the iden-
ficiation process.
(f) Perform the parameter optimization (i.e., GA) in accordance with its frequency.
(g) Repeat Training steps (b)-(f) until the stopping criterion of GP is satisfied to reserve
the model that has the best fitness in the population at the last generation as the GP-
identified model
ˆ
M
∗ .
(h) Compare the output of
ˆ
M
∗ withψ(t) instead ofψ
training
(t) to confirm
ˆ
M
∗ .
(3) Validation
(a) Produce other three sets of θ
j
(t) and ψ
j
(t) ( j= v
1
,v
2
, and v
3
) from Data generation
steps (b) and (c) with different random seeds and different standard deviations of input
σ
j
.
(b) Select smaller, the same, and larger standard deviation values (i.e., σ
v
1
, σ
v
2
, and σ
v
3
)
for input data than the one used for the training session and generate the input data,
notatingθ
v
1
(t),θ
v
2
(t), andθ
v
3
(t), respectively. Their corresponding output fromM
ref
are labeled asψ
v
1
(t),ψ
v
2
(t), andψ
v
3
(t).
(c) Solve
ˆ
M
∗ under differentθ
j
(t) ( j= v
1
,v
2
, and v
3
) and compare its output withψ
j
(t),
respectively.
3.4 Illustrative Examples
Next, we illustrate our GP-based methodology using two nonlinear multi-physics systems in dif-
ferent disciplines. The first system is a structural mechanics system (i.e., Bouc-Wen model). This
is a well-known example for system identification in applied mechanics. Through the identifica-
tion of this system, the performance of our methodology is demonstrated. The second system of
56
Randomly generate
input data
Add Gaussian white
noise to
Training dataset
&
!"#$%$%&
Validation dataset
'
,
'
( =
1
,
2
,&
3
)
Output
)
(
Fitness test
Bloat control
Parameter
optimization:
genetic algorithm
Output data
)
'
()
( =
1
,
2
,&
3
)
Apply evolutionary
processes
Comparison
End
Reference system
model ℳ
")*
Output data
Randomly initialize
expression trees
Population of
candidate system
models
-
ℳ
(
( = 1,2,…,)
Identified system
model
-
ℳ
∗
Start
Validation
Training:
Genetic
Programming
Data Generation
Figure 3.6: The workflow chart demonstrating the identification scheme utilizing GP
57
interest consists of a soil-moisture model. This system is identified as an example in environmen-
tal, agricultural and water resources engineering. This indicates that our GP-based methodology
has generality to be applied into nonlinear multi-physics systems across disciplines. Details for the
application are presented in the following subsections.
3.4.1 Bouc-Wen model
We start by illustrating our approach using a structural mechanics system, denoted as the Bouc-
Wen model [124]. This single-degree-of-freedom (SDOF) model is chosen as a reference system
modelM
ref
. This model displays the hysteretic behavior where the history of the system responses
(i.e., displacement, velocity, and acceleration) affects the current responses, which leads to highly
nonlinear responses of the system under the presence of the external force. To capture the hysteresis
of the system, the reference Bouc-Wen modelM
ref
is expressed with coupled ODEs:
f(t)= m ¨ x+ r(x, ˙ x),
˙ r(x, ˙ x)=
1
η
A ˙ x− ν
B| ˙ x||r(x, ˙ x)|
κ− 1
r(x, ˙ x)− γ ˙ x|r(x, ˙ x)|
κ
,
(3.6)
where f(t) is the excitation force, m is the mass of the system, r(x, ˙ x) is the restoring force, ˙ r(x, ˙ x)
is its derivative, andη, A, ν, B, γ, andκ are property coefficients for the Bouc-Wen model [124]
(e.g., parameters that control the size and shape of the hysteresis loop), and x, ˙ x, and ¨ x are the
displacement, velocity, and acceleration, respectively.
In line with the Bouc-Wen model, the input data θ(t) is described as the external excitation
force f(t). The output data ψ(t) corresponds to the system responses (i.e., x, ˙ x, and ¨ x). In the
identification scheme, the external force and system responses (i.e., f(t), x, ˙ x, and ¨ x) are assumed
to be available from measurements. The assumption includes the prior knowledge that the target
system is a hysteretic mechanics system, which leads to measuring m and setting the identification
target to the derivative of the restoring force ˙ r(x, ˙ x) in the governing equation (Eq. 3.6). Fig. 3.7
provides a schematic illustration of the Bouc-Wen model.
58
(, ̇ )
Bouc-Wen
Output data
Response states
= , ̇ ,& ̈
Input data
External force
()= () Hysteresis loop
Figure 3.7: The schematic diagram of a SDOF Bouc-Wen model and its restoring force along the
displacement
To generate the time profile of the random excitation force f(t) without abrupt changes, the
Fourier sine series with the random translation is adopted as follows:
f(t)= b
q
∑
i=1
sin
2πi
D
t+ w
i
, (3.7)
where b is the coefficient of the sine series, q is the number of sine terms, D is the duration of
the excitation force, and w
i
is the random translation for each sine term obtained from the uniform
distribution density function bounded by[0,2π).
The excitation force f(t) is randomly generated for the duration D= 80T to guarantee enough
data points to reflect the sustained (stationary) behavior of the system. Note that T is the period
of the Bouc-Wen model. Parameter values used in the Bouc-Wen model are reported in Table 3.2.
For training and validation, we compute f(t) for different standard deviations. The mean of all
f(t) is 0 given its Fourier sine series representation. The standard deviation for the training data
is set to 1 by adjusting the value of b (see Eq. 3.7). For the validation data sets, three sets of f(t)
is prepared for different standard deviations with different random seeds to simulate f(t) within or
beyond the training data. The values of standard deviation (σ
v
1
, σ
v
2
, and σ
v
3
) are 0.9, 1, and 1.1,
59
respectively. The reference Bouc-Wen model (see Eq. 3.6) is solved by using the Runge-Kutta 4th
order method with∆t= 0.01T to yield the reference system output x, ˙ x, and ¨ x under f(t). The final
output data for training (i.e.,ψ
training
(t)) is prepared by augmenting 0, 1, or 5% error characterized
by zero-mean Gaussian noise to the reference responses.
Table 3.2: Parameter specification for the SDOF Bouc-Wen model and input data for training and
validation
Parameters Values Parameters Values
Referencesystemmodel Inputdata
m 2.000 Input type Fourier sine series with
the random translation
η 1.000 Number of sine terms q 800
A 1.000 Meanµ 0
ν 1.000 Standard deviation
B 2.000 for trainingσ 1
γ -0.500 for validationσ
j
( j= v
1
, v
2
, and v
3
) 0.9, 1, and 1.1
κ 1.000 Duration 80T
Sampling frequency∆t 0.01T
Functions Heaviside step and absolute
value
The identification scheme developed in Section 3.3 results in the GP-identified system models
and corresponding fitness values (see Eq. 3.5) for the reference Bouc-Wen model as exhibited
in Table 3.3. The derivative of the restoring force estimated through the GP-based methodology
displays similar terms in all noise cases. These GP-identified system models are similar to the
actual governing equation of the reference Bouc-Wen model. The derivative of restoring force
ˆ
˙ r(x, ˙ x) identified through GP show similar terms regardless of the degree of errors in the training
data.
The output from the GP-identified system models are examined with their normalized RMSEs
of the output under training and validation data in Table 3.4. The normalized RMSEs from the
reference output data are similar within the range 1.2 to 5.3% for any degree of noise in the training
data. In the similar way, the normalized RMSEs under the validation data ranges from 1.6 to 7.9%
for all noise levels.
60
Table 3.3: GP-identified system models and corresponding fitness values (Eq. 3.5) as a function of
the noise level in the training data
Noise level Derivative of restoring force ˙ r(x, ˙ x) Fitness value
ref. 1.000 ˙ x− 2.000 r(x, ˙ x)| ˙ x|− 0.5 ˙ x|r(x, ˙ x)|
0 % 1.067 ˙ x− 5.549 r(x, ˙ x)
2
˙ x− 0.249 r(x, ˙ x)| ˙ x|− 0.236 r(x, ˙ x) 4.572
1 % 0.821 ˙ x− 2.034 r(x, ˙ x)| ˙ x| 2.780
5 % 1.000 ˙ x− 2.506 r(x, ˙ x)| ˙ x| 5.632
Table 3.4: Normalized RMSEs of the output between the GP-identified system models and the
reference system model with reference and validation data
States displacement x velocity ˙ x acceleration ¨ x
Referencedata(σ
t
= 1)
0 % Noise 1.676% 5.323% 3.417%
1 % Noise 1.186% 3.067% 2.121%
5 % Noise 1.254% 2.439% 1.816%
Validationdata(σ
v
1
= 0.9)
0 % Noise 7.877% 6.374% 4.061%
1 % Noise 3.143% 3.196% 2.610%
5 % Noise 1.671% 2.443% 2.141%
Validationdata(σ
v
2
= 1)
0 % Noise 6.315% 4.871% 3.385%
1 % Noise 1.851% 2.543% 2.130%
5 % Noise 2.016% 2.144% 1.936%
Validationdata(σ
v
3
= 1.1)
0 % Noise 4.879% 4.195% 3.382%
1 % Noise 1.641% 2.137% 1.827%
5 % Noise 2.358% 1.834% 1.690%
61
The time-varying output and phase diagram from both the reference and GP-identified system
models are displayed together with the reference and validation data in Fig. 3.8. Fig. 3.8(a)-(c)
show that the GP-identified system models trained with any level of noise (0, 1, and 5%) generate
output and phase diagrams close to those from obtained from the reference system model in both
cases. The phase diagrams of the restoring force vs. velocity (see Fig. 3.8(f) and (l)) clearly show
that the identification scheme utilizing GP can capture the hysteresis behavior of the reference
Bouc-Wen model.
62
(a) (b) (c)
(d) (e) (f)
Responses due to training data ( = )
Responses due to validation data (
= )
(g) (h) (i)
(j) (k) (l)
Figure 3.8: Time profiles of output and phase diagrams from the reference and GP-identified sys-
tem models for training and validation; the time profiles between 40 T and 50T for excitation force
(a) and (g), displacement (b) and (h), and acceleration (c) and (i) with the training and validation
data, respectively; phase diagrams for restoring force (d)-(f) and (j)-(l) with the training and vali-
dation data, respectively.
63
3.4.2 soil-moisture model
In the following, we investigate a soil-moisture model which is relevant for environmental, agri-
cultural and water resources engineering. The pioneering work of Rodr´ ıguez-Iturbe and Porporato
[122] shows the development a lumped soil-moisture model at a given location by considering
major contributing factors (e.g., rainfall, infiltration, and moisture losses) and by assuming that the
horizontal interactions between locations are minor, see also [81, 125, 126]. This model has been
employed to understand hydrological controls on the emission of greenhouse gasses [82, 127]. The
reference modelM
ref
considered in this section is characterized by (see [122] for details):
φZ
r
˙ s(t)=ϕ(s;t)− χ(s);
ϕ(s;t)= R(t)− I(t)− Q(s;t);
χ(s;t)= E(s;t)+ L(s;t),
(3.8)
where φ is the porosity of the soil, Z
r
is the depth of active soil [mm], s is the relative soil-
moisture content (0≤ s≤ 1), ˙ s is its derivative in time, ϕ(s;t) is the rate of infiltration from
rainfall [mm/d], R(t) is the rainfall rate [mm/d], I(t) is the interception rate [mm/d], Q(t) is the
runoff rate [mm/d],χ(s;t) is the rate of soil-moisture losses from the active soil [mm/d], E(s;t) is
the rate of evapotranspiration [mm/d], and L(s;t) is the rate of leakage [mm/d]. The relationship
amongst all the components is schematically illustrated in Fig. 3.9.
64
Output data:
Relative soil
moisture content
= & ̇
Input data:
Rainfall
= ()
Runoff
;
Leakage
Evapo-
transpiration
Interception
()
Figure 3.9: The schematic diagram of the soil-moisture model [126]
For this study, the soil-moisture model (see Eq. 3.8) is simplified into the following consecutive
logical expressions [81, 122, 125, 126]:
R(t)− I(t)=
0 if R(t)≤ ∆
R(t)− ∆ if R(t)>∆
, (3.9)
ϕ(s;t)=
R(t)− I(t) [R(t)− I(t)]/nZ
r
≤ 1− s
1− s [R(t)− I(t)]/nZ
r
> 1− s
, (3.10)
χ(s)=
0 0< s≤ s
h
E
w
s− s
h
s
w
− s
h
s
h
< s≤ s
w
E
w
+(E
max
− E
w
)
s− s
w
s
∗ − s
w
s
w
< s≤ s
∗ E
max
s
∗ < s≤ s
f c
E
max
+
K
s
e
β(1− s
f c
)
− 1
[e
β(s− s
f c
)
− 1] s
f c
< s≤ 1
, (3.11)
where∆ is the intercept rate [mm/d], E
w
is the evapotraspiration rate [mm/d] at the wilting point,
E
max
is the maximum evapotraspiration rate [mm/d], s
h
is the hygroscopic point, s
w
is the wilting
point, s
∗ is the soil-moisture when the evapotraspiration rate reaches at the maximum, s
f c
is the
65
field capacity, K
s
is the saturated hydraulic conductivity of the soil [mm/d], andβ is the coefficient
for leakage losses [126].
In the identification scheme, it is assumed that R(t), s(t), and ˙ s(t) are accessible from measure-
ments. The target function is the derivative of the relative soil-moisture content ˙ s(t) in equation
(3.8). Following the description of the model in Fig. 3.9, the input data θ(t) is represented by
the rainfall R(t), and the outputψ(t) is described as the relative soil-moisture content s(t) and its
derivative ˙ s(t). The point location is assumed to be subjected to a series of rainfall events that
follows the marked Poisson process [126, 128]. The rainfall R(t) for training and validation is
randomly generated from the exponential distribution as follows:
f(t
inter
;λ)=
λe
− λt
inter
t
inter
≥ 0
0 t
inter
< 0
(3.12)
where t
inter
is the inter-arrival time [d],λ is the inverse of the mean inter-arrival time [storms/d].
The parameters of the reference soil-moisture model are referred to the application study [128].
The mean inter-arrival time,µ
t
inter
, is 6 d and the total period of R(t) is 730 d. The rainfall depth is
randomly produced by the exponential distribution with 1.5 mm of the mean depth denoted byµ
R
d
.
The depth of rainfall is assume to be constant for 1 d. The reference output s(t) and ˙ s(t) under R(t)
is calculated by solving coupled ODEs (see Eqs. 3.9-3.11) with the Runge-Kutta 4th order method
with time step∆t = 0.1 d. Considering additional external factors such as measuring errors, the
training data (i.e., ψ
training
(t)) is finalized by adding 0, 1, or 5% of the zero-mean Gaussian error
to the reference responses s(t) and ˙ s(t). Parameter values used in the soil-moisture model and the
rainfall are exhibited in Table 3.5.
Table 3.6 summarises the results from the identification scheme. As shown in Table 3.6, the
methodology produces the GP-identified system models for the soil-moisture model as a function
of the noise level embedded in the training data. The soil-moisture models identified from GP share
the same terms with different coefficients. Fitness values (see Eq. 3.5) (i.e., normalized RMSEs
and the complexity penalty of the model) are similar regardless of the degree of noise reduces.
66
Table 3.5: Parameter specification of the soil-moisture model and input data for training and vali-
dation
Parameters Values Parameters Values
Referencesystem Inputdata
n 0.42 Input type Marked Poisson process
Z
r
100 mm Mean inter-arrival timeµ
t
inter
6 d
∆ 0.20 mm/d Mean depth
E
w
0.011 mm/d for trainingµ
R
d
1.5 mm
E
max
0.475 mm/d for validationµ
R
d
, j
( j=v
1
, v
2
, and v
3
) 1.35, 1.5, and 1.65 mm
s
h
0.048 Duration 730 d
s
w
0.06 Sampling frequency∆t 0.1 d
s
∗ 0.105 Functions Heaviside step, absolute
s
f c
0.29 value, and exponential
K
s
109.8 mm/d
β 7.5
Table 3.6: GP-identified system models and corresponding fitness values (Eq. 3.5) as a function of
the noise level in the training data
Noise level Derivative of soil-moisture content ˙ s Fitness value
ref. Eqs. 3.8-3.11
0 % 15.380 s
4
+ 1.000 R(t) s
3
− 1.000 s
2
+ 0.023 R(t)+ 0.003 9.814
1 % [− 1.000 exp(3.080 s
3
R(t)
3
)+ 2.884 R(t)] s
2
+[1.017 exp(s
2
− 11.875
0.189 R(t)
3
)− 1.000] s− 0.078 exp(− 0.209 R(t)
3
) + 0.080
5 % 7.486 s
4
− 0.151 s+ 0.026 R(t)+ 0.008 9.618
67
The normalized RMSEs between the output obtained from the GP-identified systems and ones
from the reference system are presented in Table 3.7. They range from 7.2 to 8.5% with all levels
of noise in the training data. The normalized RMSEs with validation data indicate that they are
within 11.9% whenµ
R
d
is the same or a smaller value than the one used for reference data. When
µ
R
d
increases, the normalized RMSEs rapidly increase in the cases with 0 and 1% noise. This is
due to the fact that the soil-moisture content did not exceed s
f c
under the current training data. To
identify a soil-moisture model representing the entire range of the soil-moisture content, training
data obtained for higher values ofµ
R
d
is necessary.
Table 3.7: Normalized RMSEs of output between responses from the GP-identified system models
and from the reference system model with reference and validation data.
States soil-moisture content s
Referencedata(µ
R
d
= 1.5)
0 % Noise 8.514%
1 % Noise 8.523%
5 % Noise 7.204%
Validationdata(µ
R
d
= 1.35)
0 % Noise 7.165%
1 % Noise 9.308%
5 % Noise 5.673%
Validationdata(µ
R
d
= 1.5)
0 % Noise 7.676%
1 % Noise 10.853%
5 % Noise 6.079%
Validationdata(µ
R
d
= 1.65)
0 % Noise >100.0%
1 % Noise 14.346%
5 % Noise 7.937%
Fig. 3.10 shows the temporal evolution of the soil-moisture content and its phase diagrams
obtained from the GP-identified models are compared to those from the reference model. For the
phase diagrams, the relationships between s(t) and ˙ s(t) are examined. While the GP-identified
models show good performance to predict the relative soil-moisture for the training the validation
data (Fig. 3.10(b), (c), (f), and (g)), our results show the challenges in capturing characteristics of
68
the reference system expressed by the phase diagrams (Fig. 3.10(d) and (h)). This difficulty may
be attributed to the complexity of overlapped logical expressions in the reference model (see Eqs.
3.9-3.11).
69
Responses due to training data (
= .)
(d) (c)
(a) (b)
Responses due to validation data (
,
= .)
(h) (g)
(e) (f)
Figure 3.10: Time profiles of the output and phase diagrams obtained from the reference and GP-
identified system models for training and validation; the time profiles between 90 and 270 days for
rainfall (a) and (e), soil-moisture content (b) and (f), and its derivative (c) and (g) with the training
and validation data, respectively; phase diagrams between soil-moisture content and its derivative
(d) and (h) with the training and validation data, respectively.
70
3.5 Discussion
Given that the scope of this paper lies in expanding the application of the GP-based identification
method to different nonlinear multi-physics systems, every hyperparameter is not tuned carefully.
For most hyperparameters, such as population size and evolutionary operation rates, typical values
reported in previous studies are adopted [114, 115]. The hyperparameters under consideration
fall within three categories. Firstly, hyperparameters for expression trees (such as the choice of
basis functions, the probabilities of component assignment, and the maximum node number of
expression tree) are set to test various mathematical expressions in a candidate model. However,
since these parameters entail computational cost, they are adjusted to avoid convergence time from
being significantly delayed. Secondly, the most influential hyperparameter to the accuracy and
simplicity of an identified model is the complexity penalty coefficient. On one hand, as the value
of this coefficient increases, the accuracy of an identified model is compromised by the model’s
simplicity. On the other hand, when this coefficient decreases, the model accuracy increases.
Noticeably, at the certain level of complexity penalty, the identified model shows higher accuracy
in training and validation data sets than the model trained with lower complexity penalty. This is
because the overfitting into a given data set has a negative impact on capturing the nonlinearity of
systems. In this context, the range from 0.001 to 1 is tested for the complexity penalty coefficient,
and for this study, 0.1 is selected to balance between the accuracy and the simplicity of an identified
model. Lastly, by adopting the output error in the fitness function, we achieved earlier convergence,
compared to previous studies [115]. Thus, stopping criteria used in GP identification are able to be
shortened, hence decreasing the total convergence time. After setting up the hyperparameters, only
two hyperparameters (i.e., the choice of basis functions and the complexity penalty coefficient) are
carefully considered for each reference model. In terms of concerning only two hyperparameters
when applying GP identification into a different system, the developed method has a very powerful
advantage in addressing the class of nonlinear system identification problems.
The developed GP identification method results in providing systems of ODEs for two differ-
ent nonlinear models. Since these identified models are discovered through output errors computed
71
from solutions of ODEs, the proposed GP approach allows interpretation of the models as govern-
ing equations revealing the underlying physics of the systems as well as improves model accuracy
and robustness. The following computation cost associated with the repeated computational solu-
tions of the ODEs is well managed by making the technical improvements for earlier convergence,
such as adding more bloat control techniques (i.e., removing redundancy both in a candidate model
and a population) and implementing the GP approach on parallel computing. As a result, this study
renders the GP method into an effective grey box modeling tool for nonlinear multi-physics sys-
tems in many disciplines.
Although the proposed GP approach has many similarities with other non-parametric data-
driven identification methods, it still contains a distinctive feature. First of all, in terms of pro-
viding possible mathematical expressions in the population of candidate models, this approach is
similar with a sparse approximation method [22, 24]. However, since GP constructs candidate
models with random combination of mathematical components and mixes them with evolutionary
processes, such as crossover and mutation, much more diverse expressions are able to appear and
be tested. With respect to capturing the nonlinearity of multi-physics systems, many other ap-
proaches have their own progress. For instance, the Gaussian process approach shows low errors
from training and test data by yielding stochastic estimation [99]. V olterra/Wiener neural network
also successfully captured a nonlinear hysteretic behavior of the Bouc-Wen model, presenting low
errors from training and test data [97]. However, these approaches have the limitation in offering
an easily interpretable model that reveals the underlying physics. In the case of GP, an identified
model is a set of ODEs which consist of variables, basis functions, and coefficients that repre-
sent nonlinear properties. Obviously, discovering a governing equation of a target system as an
identification result is the objective of other methodologies. For instance, there are such attempts
utilizing neural networks and showed some viable results [26, 27, 98, 105]. With activation func-
tions such as a Heaviside function, this neural network has a potential to produce parsimonious
models capturing the nonlinearity of a system. Nevertheless, one major difference between neural
network and GP approaches is the way of training models performed (i.e., evolutionary processes
72
vs. stochastic gradient descent). The evolutionary process has more randomness in its training
process such as crossover and mutation, which is considered suitable for a search within an infi-
nite dimensional function space. Nevertheless, the computational speedup associated with GP still
needs to be improved, and this will be addressed in a future research.
3.6 Conclusions
A general GP-based identification scheme is developed for nonlinear multi-physics systems. This
identification scheme is employed to identify a governing equation characterized by a set of ODEs
for a nonlinear, multi-physics system, solely on the basis of its input and output data set, and little
prior knowledge. The study uses the output error in the fitness evaluation that is computed from
solving a candidate model (i.e., ODEs) and comparing it to the measured data. Along with this
type of the output error, utilizing parallel processing enables the generalization of the GP-based
method, without a rapid increase of computational time. Two illustrative examples originating
from different fields of science and engineering (i.e., the Bouc-Wen model and the soil-moisture
model) demonstrate that the GP-based methodology successfully estimated the governing equa-
tions (i.e., coupled ODEs) for the systems. GP shows a good performance to predict responses of
the systems with the validation data sets, by capturing the dominant characteristics of the systems
that are exhibited in phase diagrams. In addition, thorough inspection of the GP-identified system
models provides a better understanding of the phenomena associated with systems. In summary,
a presented identification scheme employing GP is a powerful grey box modeling tool in terms
of providing some physical insights by constructing governing differential equations for nonlinear
multi-physics systems.
73
Chapter 4
Hybrid Physics-Based and Data-Driven Discovery of Partial
Differential Equations for Multi-Scale Hydrologic Systems by
Stochastic Optimization
4.1 Introduction
System identification methods aim to represent a physical system of concern with a mathematical
model on the basis of its input/output data [129]. It plays a crucial role in predicting the systems’
behavior and enhancing user’s comprehension of underlying physics. Recent advances of sensing
technology and computing power enable system identification methods to capture more complex
multi-physics systems [20]. By virtue of their effectiveness and versatile applicability, state-of-
the-art identification approaches are actively being developed [130].
Modeling approaches can be divided into three broad categories: white, gray, and black box
modeling [20]. Firstly, the “white box” modeling utilizes underlying physics of systems to es-
tablish models. Despite its high accuracy of prediction, this type of modeling has limitations in
term of solely relying on the user’s understanding of the underlying system. Next, the “black
box” modeling is purely driven by data. By employing distinctive mathematical forms (e.g., neural
networks [97] and Gaussian processes [99]), this modeling approach is effectively able to predict
74
system responses without any prior assumptions about the underlying physics. However, since the
black box modeling is purely data-driven, it is neither able to reveal the physics behind systems
nor provide reliable estimates beyond the range of training data. Lastly, the “gray box” modeling
attempts to leverage both user’s prior physical knowledge about systems and data to build models.
This mathematical modeling approach provides a way to overcome the limitations of the white and
black box modeling; the prior knowledge of systems needed to develop models is minimized, and
the models embedded with system dynamics increase reliability of extrapolation. Given these ad-
vantages, many gray box approaches are being proposed. These consist of state-space approaches
[21, 101], sparse approximation [22–24], Gaussian processes [25], and neural networks [26–32].
Genetic Programming (GP) is one of the available gray box tools for modeling physical sys-
tems. Inspired by Darwin’s theory of evolution, GP exercises evolutionary processes (e.g., crossover
and mutation) to optimize programs in a stochastic manner for a specific task [100]. Programs of
GP are designed to perform a specific task and to be decomposed into genetic information [131]:
a vector of parameters for function optimization, an expression tree for symbolic regression, or a
decision tree for decision making. In the GP process, a population of programs is randomly initial-
ized as the first generation, and their fitness is tested according to the performance at a certain task.
Stochastic evolutionary processes are then applied to the programs in accordance with their fitness
in consecutive generations. As a result, the programs become optimized to perform the task.
GP-based identification approaches have been considered effective in identifying symbolic re-
gressions, ordinary differential equations (ODEs) and reduced-order models for nonlinear systems
in mechanics (see Ref. [20] and references therein). In the context of applied mechanics, a few
works [114, 115] have improved GP-based identification methods by expanding the component
library for expression trees (e.g., derivative variables and discontinuous functions). These devel-
opments offer insight for system dynamics and are fairly accurate in extrapolation. Despite this
progress, current GP applications are still limited due to two main issues. Firstly, GP-based meth-
ods have a large amount of hyperparameters that needs to be tuned, which makes the generalization
of the methods for a set of identification problems difficult. Secondly, an attempt to generalize the
75
method for a variety of identification problems has a potential to delay convergence time consider-
ably. These shortcomings were addressed by Im et al. [132] in the context of mechanical systems
described by nonlinear ODEs.
Given that many multi-physics systems are governed by partial differential equations (PDEs),
there is need to further expand GP-based approaches to address physical systems that vary in space
and time. Our work aims to fill this gap by developing a generalized GP-based stochastic opti-
mization method to identify PDEs. The novel features of the proposed framework are as follows:
(1) the incorporation of a multi-purpose loss function and stochastic sampling in the parallel fit-
ness test and (2) the inclusion of bloat control techniques within the evolutionary processes. As
it will be shown, these significant modifications prevent potential drawbacks of GP (i.e., issues
of many hyperparameters and slow convergence rate) and lead to computationally efficient and
highly reproducible identification results. We demonstrate the performance of the methodology
in two canonical PDEs: Burgers’ equation (nonlinear PDE) and the advection-dispersion equation
(linear PDE). The accuracy and robustness of the identification procedure are investigated under
different noise levels and system characteristics. Since the method provides a variety of candi-
date PDEs for a target system in the final result, the identified PDEs serve as possible descriptions
of complex system dynamics to improve user’s understanding, while the best one is used as the
system model to predict (or extrapolate) future responses.
This paper is organized as follows. First, the general problem statement is given in Section 4.2.
Details pertaining to the novel identification method are presented in Section 4.3. This section will
provide a road map for users who wish to tailor this approach for their own application. Illustrative
examples are given in Section 4.4, and their discussion is followed in Section 4.5. Lastly, the
implications of this study and future research are described in Section 4.6.
76
4.2 Problem Statement
A multi-physics system is considered under limited prior knowledge. A multi-dimensional quan-
tity of interest in the system is denoted by r(x,t) where x and t correspond to the d-dimensional
Cartesian space and time. Such a quantity can be the speed of a wave in shallow water, solute con-
centration in an environmental medium, the temperature distribution in a heterogeneous material,
or atomic dispersion in a metallic glass (see more examples in Fig.4.1a). For the purpose of this
work, we consider a one-dimensional quantity, r(x,t), in a one-dimensional spatial domain x. We
represent an external excitation by f(x,t) which will impact the spatiotemporal dynamics of r(x,t)
(see Fig. 4.1b). We assume that the measurements are available at discrete points of space and time
domains, f(x
i
,t
j
) and r(x
i
,t
j
), (i= 1,2,...,n
x
and j= 1,2,...,n
t
where n
x
and n
t
are the numbers
of data points for space and time domains, respectively).
On the basis of the measurements, the system model, namelyM , can be established to describe
the system dynamics of r(x
i
,t
j
) subject to f(x
i
,t
j
), in the form of a PDE (see Fig. 4.1b-d). Since
the model response is an approximation of the (mathematical) system response under the same
excitation f(x
i
,t
j
), it is denoted by ˆ r(x
i
,t
j
) in the system model:
M
ˆ r,
∂ ˆ r
∂t
,
∂ ˆ r
∂x
,
∂
2
ˆ r
∂x
2
,...; f(x
i
,t
j
)
= 0. (4.1)
The main goal of this work is to identify the system modelM , in the infinite function space
M, which minimizes the error norm (i.e., ε
res
) of a suitable model fidelity model, e.g., the nor-
malized root-mean-square error between the values of the system response measurements and the
corresponding model response, namely r(x
i
,t
j
) and ˆ r(x
i
,t
j
):
min
M
ε
res
M
ˆ r,
∂ ˆ r
∂t
,
∂ ˆ r
∂x
,
∂
2
ˆ r
∂x
2
,...; f(x
i
,t
j
)
= 0
withM ⊂ M, (4.2)
77
System response
,
Model response
̂ ,
Excitation
,
Multi-Physics System
System Model
ℳ ̂,
̂
,
̂
,
!
̂
!
,…; =
(a)
(b)
̂
=
!
̂
!
(c) (d)
Fluid flow through pipes Dynamics of oscillators Infiltration in soils
,
Atmospheric flow
,
Contaminant transport in aquifers
,
,
Figure 4.1: Schematic illustration of the general modeling problem; (a) multi-physics systems
encountered in various science and engineering domains; (b) modeling approach representing sys-
tem dynamics with a partial differential equation (PDE); (c) external excitation utilized as an initial
condition of a PDE; and (d) evaluation of a quantity of interest over time by numerically solving
the PDE.
78
whereε
res
is defined as:
ε
res
=
v
u
u
t
∑
i
∑
j
ˆ r(x
i
,t
j
)− r(x
i
,t
j
)
2
∑
i
∑
j
r(x
i
,t
j
)
2
. (4.3)
Thus, the normalized error is the ratio of the Euclidean L
2
-norm of the deviation between the
reference and estimated responses,
ˆ r(x
i
,t
j
)− r(x
i
,t
j
)
2
, divided by the norm of the ensemble of
measurements,
r(x
i
,t
j
)
2
.
In order to “discover” a PDE, as the system modelM , from a data set of f(x
i
,t
j
) and r(x
i
,t
j
)
under limited prior knowledge of the system, a general framework of system identification is estab-
lished by utilizing stochastic optimization (e.g., GP). The proposed methodology is implemented
by modifying the GP algorithm reported in Im et al. [132]. Details follow in the Section 4.3.
4.3 Methodology
4.3.1 Genetic programming for system identification
A general system identification approach utilizing GP is developed to discover a PDE from a
given data set as a model for a multi-physics system. Details of the GP method are found in
Koza [100]. Its code package, named Genetic Programming for System Identification (GPSI), is
written in Python (version 3.7.4) with the SymPy library (version 1.7.1) and is available in Github
(https://github.com/Jinwoousc/GPSI). In the following, we provide a series of steps along
with a brief synopsis of the methodology employed in GPSI, in line with Fig. 4.2:
1. Determine a quantity of interest (i.e., r(x,t)) in a target multi-physics system, and measure
(or sample) both the excitation f(x
i
,t
j
) (i= 1,2,...,n
x
and j = 1,2,...,n
t
) and the corre-
sponding system response r(x
i
,t
j
).
2. According to the user’s prior knowledge of the physical system, select mathematical com-
ponents, e.g., basis functions such as the Heaviside step function and error function (see the
79
mathematical expression library box in Fig. 4.2), which will postulate the expression “li-
brary” to be used for candidate PDEsM
k
(k= 1,2,...,n
M
where n
M
is the number of the
PDEs).
3. Prepare the GP simulation sets by selecting the loss function components (i.e., the model
response error, the equation residual error, or the model complexity penalty) for the fitness
test depending on the data quality of f(x
i
,t
j
) and r(x
i
,t
j
) (refer to Section 4.3.2 for details)
and by arranging the rest of the hyperparameters of GPSI simulation, mainly focusing on the
complex penalty coefficient.
4. Randomly generate initial PDEs by employing the form of a binary expression tree (see the
example of a candidate PDE in Fig. 4.2).
5. Run the GPSI simulation that repeats the fitness test and the evolutionary processes (see the
corresponding boxes in Fig. 4.2) to update the population of PDEs in consecutive genera-
tions. Details related to the evolutionary processes are described in Sections 4.3.3.
6. Collect the PDEs from the last population with which the stop criterion of the simulation
is satisfied, and determine the system model M among the identified PDEs, by considering
theirε
res
(see Eq. 4.3), model simplicity, and interpretation revealing the system dynamics.
7. Test other simulation sets with different settings from Step 2 on the basis of the updated prior
knowledge of the target multi-physics system.
The proposed approach firstly arranges the mathematical components for candidate PDEs and
the other hyperparameters by utilizing the user’s prior knowledge of the target system (Steps 1-
3). Next, stochastic optimization in the infinite function space is performed through GP for the
randomly generated PDEs to minimizeε
res
(see Eq. 4.2) (Steps 4-5). Finally, the system modelM
is selected among the optimized PDEs, which gives users an acceptable estimate (or prediction)
of the system responses under different excitations, and the understanding of the system dynamics
80
PDE Loss
̂
−0.3
̂
+0.8
!
̂
!
=0 0.591
̂
−0.7
̂
=0 0.732
̂
−0.8
!
̂
!
=0
0.948
… …
̂
+ −0.3
̂
+0.8
!
̂
!
=0 →
̂
+
̂
=0
̂
−0.7
̂
=0 →
̂
−0.7 −0.3
̂
+0.8
!
̂
!
=0
̂
−0.8
!
̂
!
=0 →
̂
−0.8
!
̂
!
+ −0.7
̂
=0
…
Fitness Test for PDEs
e.g.,
GPSI-identified PDE
as a System Model
Random Generation of PDEs
utilizing expression trees
Stop Criteria
checked
(Example of a candidate PDE)
- A randomly generated expression tree
- The corresponding PDE
̂
− 0.3
̂
+0.8
!
̂
!
= 0
operations = 0.5
basisfunctions = 0.1
variables = 0.2
numbers = 0.2
−
̂
(⋅)
+
× ×
0.3
̂
0.8
!
̂
!
Multi-Physics System
Data Set from Measurements
2
,
3
&
2
,
3
( =1,…,
4
& =1,…,
5
)
Mathematical Expression Library
for PDEs from Prior Knowledge
e.g., +,−,×, ⋅ , ⋅ ,erf ⋅ , ̂ ,
6 ̂ 7
65
,
6 ̂ 7
64
,
6
/
̂ 7
64
/
,…
GPfSI Simulation Hyperparameters
(e.g., loss function & complexity penalty)
Evolutionary Processes
Figure 4.2: Workflow chart for the Genetic Programming for System Identification (GPSI) ap-
proach proposed; In the mathematical expression library box, H(·), |·| and erf(·) represent the
Heaviside step function, absolute value and error function, respectively; Details related to the evo-
lutionary processes are exhibited in Fig 4.3.
81
(Step 6-7). To implement this method, we develop our own algorithm, GPSI, by enhancing the GP-
based identification algorithm for ODE problems in the previous study [132]. We refer to this study
[132] for the detailed description of each component in the algorithm and its default parameters.
In this work, additional explanation for the method is categorized and presented mainly for two
specific questions: How do we evaluate the plausibility of candidate PDEs for the system? How
do we improve the fidelity of the PDEs for the system?
4.3.2 Fitness test
A fitness test is carried out in parallel to evaluate the suitability of each candidate PDE to a given
data set, at every generation. The stochastic optimization utilized here (i.e., the evolutionary pro-
cesses) is performed on the basis of fitness assigned to the candidate PDEs. In this sense, adopting
a proper evaluation function for the fitness test is a key element of the proposed method. GPSI
significantly improves its effectiveness and robustness in identifying a system model and its com-
putational efficiency by employing the multi-objective loss function L , defined by:
L =λ
1
ε
res
+λ
2
ε
eq
+λ
3
p, (4.4)
whereλ
w
(w= 1,2, and 3) is the weight of each term bound to the range [0,1],ε
res
is given in Eq.
4.3, and p is the number of nodes in the binary expression tree to which a PDE is converted. Here,
ε
eq
is the normalized root-mean-square error of the equation residual whenM is rearranged with
respect to
∂ ˆ r
∂t
and substituted with the system measurements. It is defined as follows:
ε
eq
=
v
u
u
u
u
t
∑
i
∑
j
n
∂ ˆ r
∂t
− ∂r
∂t
o
2
∑
i
∑
j
n
∂r
∂t
o
2
. (4.5)
The objective of GPSI is to identify the PDE that has the minimumL , which results in pro-
viding the system model. The loss functionL (Eq. 4.4) consists of the model response error (i.e.,
82
ε
res
), the equation residual error (i.e.,ε
eq
), and the complexity penalty of a PDE (i.e., p) with their
own weights (λ
1
,λ
2
, andλ
3
). In the following, we provide details regarding the equation forL .
These details will assist users to select proper components and weights inL .
• The model response error, ε
res
(see Eq. 4.3), is the root-mean-square error between the
model response ˆ r(x
i
,t
j
) and the system response r(x
i
,t
j
), which is then normalized by the
root mean square of r(x
i
,t
j
). Users can determine whether it is considered inL or not,
by assigning 0 or 1 to λ
1
. Here, the model response is computed by solving a candidate
PDEM
k
(k= 1,2,...,n
M
) with a suitable numerical method, such as the Runge-Kutta 4th
order method, under the measured excitation f(x
i
,t
j
). In the integration process, initial and
boundary conditions need to be posed. They come from physical constraints of a target
system, or f(x
i
,t
j
) is incorporated into initial or boundary conditions. If a PDE does not
reflect the system dynamics, ε
res
is being rapidly accumulated from the initial state during
the integration process. Thus, through this type of error, it is more likely to identify a PDE
revealing the system dynamics than the case when only the equation residual error, ε
eq
, is
used inL . Regarding the data quality needed, f(x
i
,t
j
) and well-posed initial and boundary
conditions are the only requirement to generate ˆ r(x
i
,t
j
) and then to computeε
res
. Thus, even
if the spatial or temporal resolution of r(x
i
,t
j
) is limited,ε
res
is able to be evaluated. For the
data size of r(x
i
,t
j
), it is desirable for r(x
i
,t
j
) to have a relatively long time domain for the
accumulation of the response error during the integration process.
• The equation residual error,ε
eq
(see Eq. 4.5), is the root-mean-square error between the time
derivatives of the model response
∂ ˆ r
∂t
and the system response
∂r
∂t
, which is then normalized by
the root mean square of
∂r
∂t
. The time derivative values of the model response are computed
by rearranging a candidate PDE with respect to
∂ ˆ r
∂t
and substituting the system measurements
in it, while those of the system response are from the measurements or differentiation. This
type of error requires a higher data quality to be computed than the model response error,ε
res
,
because r(x
i
,t
j
) and its all differential values, such as
∂r
∂t
,
∂r
∂x
,
∂
2
r
∂x
2
, etc. need to be substituted
into a candidate PDE. Thus, ε
eq
is considered inL (i.e., λ
2
= 1) only if all the data (i.e.,
83
r(x
i
,t
j
) and its all differential values) are directly obtained from the measurements, or the
data quality of r(x
i
,t
j
) is appropriate to compute its differential terms with regard to data
noise and resolution. In such a case, the associated computational cost is much cheaper than
the one whenε
res
is considered inL .
• The model complexity penalty, p, is included inL (see Eq. 4.4) to prevent an overfitting
issue. Since a candidate PDE is constructed from a binary expression tree, the number of
nodes in the expression tree is an effective indicator representing the model complexity.
Since there is a noticeable trade-off between the model complexity and the errors (i.e., ε
res
and ε
eq
) as a function of λ
3
, it is the important hyperparameter in GPSI which needs to be
investigated with preliminary trials. Besides from the model complexity and the errors, the
reproducibility of GPSI reduces and its convergence time is extended asλ
3
decreases. Thus,
it is recommended that users try a high value (e.g., 1× 10
− 1
) first, and then decrease its
magnitude of the order to find a proper value of λ
3
.
• In this work, the computational efficiency of L has been significantly enhanced by incor-
porating a random sampling method [133, 134] for ε
res
and ε
eq
(see Eqs. 4.3 and 4.5). For
the fitness test at each generation, a certain number of small batches in the same size are
randomly sampled from the system measurements, r(x
i
,t
j
), and used to calculate the errors.
The number and size of random small batches are denoted by n
b
and n
s
, respectively. One
random small batch, namely b
l
(x
i
,t
m
) (l= 1,2,...,n
b
), has the same spatial data points with
r(x
i
,t
j
) (i= 1,2,...,n
x
) but has a subsequence (m= 1,2,...,n
s
) of the original time sequence
( j= 1,2,...,n
t
). This subsequence is sampled by selecting a random starting moment from
the original time sequence, and extracting the following subsequence with the size of n
s
. By
utilizing this random sampling method, the data use for the fitness test at each generation is
considerably reduced (e.g., 10 or 20%), which directly leads to a lower computational cost
forL .
84
The computational efficiency of GPSI is further improved by utilizing advantages of both error
types (i.e., ε
res
and ε
eq
) forL . The GPSI simulation is carried out with three different phases of
L in series. The first two phases only use the model response and the model complexity penalty
(i.e., (λ
1
,λ
2
,λ
3
)=(1,0,0.01)) to capture PDEs that are likely to reflect system dynamics in the
early generation. The complexity penalty coefficient is determined by preliminary trials with the
range of [0.1, 0.0001]. In the first phase, the GPSI simulation runs with one random small batch
having 10% of the data, and stops when the change of the minimum loss function value is less than
1× 10
− 3
in 100 consecutive generations. Candidate PDEs are thereby sorted out to have a similar
response with the system response within a short time domain at a very early generation. In the
second phase, by increasing the size of the random small batch to 20% of the data and the stop
criterion to 200 consecutive generations, the candidate PDEs have more generations to evolve into
the models predicting the system responses in a longer time domain. In the last phase, the equation
residual error is adopted (i.e., (λ
1
,λ
2
,λ
3
)=(0,1,0.01)) to expedite the convergence. While the
model response error needs the maximum data size in one small batch for the error accumulation
during the integration process, the equation residual error does not. Thus, ten random small batches
with 2% of the data each are used to reduce data biases, which results in using 20% of the data in
total forL . The stop criterion of the last phase is that the minimum loss function value decreases
less than 1× 10
− 3
in 300 consecutive generations, or the GPSI simulation stops regardless of the
phases when the number of generations reaches 1000.
4.3.3 Candidate PDEs and their evolution
GP is an algorithm that utilizes programs to perform a specific task, converts them into genetic
information, and applies stochastic evolutionary processes to enhance the programs in consecutive
generations [100]. In GPSI, the programs are candidate PDEs. In order to efficiently address
candidate PDEs, binary expression trees are employed. Converting the mathematical components
85
into nodes and terminals of the binary expression trees, candidate PDEs can be converted back-
and-forth to the corresponding binary expression trees (see together the example of a candidate
PDE in Fig. 4.2 and the conversion of a PDE into an expression tree in Fig. 4.3).
To initialize candidate PDEs through binary expression trees, several properties for binary ex-
pression trees need to be arranged on the basis of the user’s prior knowledge. Above all, it is
most important to list mathematical components including operations (e.g., addition, subtraction,
and multiplication), basis functions (e.g., Heaviside step function, absolute value function, sign
function, exponential function, logarithmic function, error function, sine function and cosine func-
tion), variables (e.g., ˆ r,
∂ ˆ r
∂t
,
∂ ˆ r
∂x
,
∂
2
ˆ r
∂x
2
), and numbers (i.e., real numbers randomly chosen between 0
and 1 which will be optimized afterwards). Expression trees are randomly generated by assigning
the arranged mathematical components into their nodes and terminals, as depicted in the example
of a candidate PDE in Fig. 4.2. This random generation is controlled by the selected maximum
tree level (i.e., the maximum number counting from the initial node to the last terminal, e.g., 5),
the probabilistic weights for the component types (i.e., {operations, functions, variables, num-
bers} ={0.5, 0.1, 0.2, 0.2}), and the expression tree growing method (i.e., Ramped half-and-half
method). Once these properties are set up, a large ensemble of 100 candidate PDEs (M
k
where
k= 1,2,...,n
M
;n
M
= 100) are generated in a stochastic manner for the first population. All the
hyperparameters relevant to the expression trees were tested and determined in the previous study
[132], which allow testing a variety of expression trees while maintaining a viable computational
cost.
The randomly initialized candidate PDEs,M
k
, are then optimized with evolutionary processes
to produce the system model. Since the evolutionary processes (i.e., crossover and mutation) in-
volve single or multiple expression trees, they are applied to the population of the candidate PDEs
and yield a new population for the next generation. By repeating the evolutionary processes in con-
secutive generations until the stop criteria (refer to Section 4.3.2) are satisfied, the system model
can be identified from the best suitable PDE in the last population. Additionally, in order to prevent
a memory problem in the computational engine (computer) and enhance computational efficiency,
86
several techniques for bloat control (i.e., simplifying expression trees, limiting the number of tree
nodes, and removing repetition in the population) are incorporated at the end of the evolutionary
processes. By integrating the evolutionary processes and the bloat control techniques into one evo-
lution module, the framework for GPSI becomes simplified with a higher computational efficiency.
Specific description of the evolutionary processes is presented in Fig. 4.3. Firstly, loss function
values (Eq. 4.4) are assigned to the candidate PDEs,M
k
, in the fitness test beforehand. The
evolution starts by the representation in which the PDEs having a top 10% of the loss function
values are kept intact for the next generation (see the box of current generation in Fig. 4.3). The
rest of the PDEs for the next generation are produced by the crossover or the mutation. For these
evolutionary processes, one PDE is firstly selected from the tournament selection method which
randomly selects two PDEs in the current population and keeps the one having a better loss function
value [100]. This selected PDE, in the form of the expression tree, goes through the crossover and
the mutation (see the arrow of evolutionary processes in Fig. 4.3), based on their probabilities, i.e.,
0.8 and 0.2, respectively. The crossover randomly picks the node of the expression tree converted
from the selected PDE and exchanges with another randomly picked node of the expression tree
from another selected PDE. The mutation randomly picks the node of the expression tree from the
selected PDE and exchanges it with a randomly generated expression tree.
At the end of the evolutionary processes, several bloat control techniques are carried out (see
the arrow of bloat control in Fig. 4.3). Firstly, the modified expression tree is rearranged by per-
forming mathematical operations and simplifying expressions. Next, two conditions are checked.
The first one is whether or not the number of nodes in the expression tree is over the maximum
node number, e.g., 31. This maximum number of nodes occurs in a fully-grown expression tree
with the tree level 6. The second condition is whether the PDE from the expression tree has the
same structure of the PDEs in the current population. When the expression tree satisfies any of
these two conditions, the evolutionary processes repeat from the selection step to guarantee all the
expression trees for PDEs are simplified, not too complicated, and unique (see the box of next
generation in Fig. 4.3).
87
Evolutionary
processes
Crossover (
!"#$$#%&"
=0.8)
Mutation
(
'()*)+#,
= 0.2)
+
̂
−
̂
0.4
The randomly chosen node is
exchanged with a randomly chosen
node from another selected model
+
̂
−
̂
+
0.3
!
̂
!
erf()
0.8 −
The randomly chosen
node is exchanged with
a randomly generated
expression tree
Bloat
control
+
̂
−
̂
1.1
!
̂
!
erf()
−
The expression tree is
mathematically simplified
Simplification
+
̂
−
̂
1.1
!
̂
!
erf()
−
Complexity check
The number of nodes
should be less than the
maximum node number
Otherwise, back
to tournament
selection
1
2
3
4
̂
+
̂
−erf 1.1−
"
̂
"
=0
Next
generation
Repetition check
Otherwise, back
to tournament
selection
The modified PDE model should be unique
in the population when all coefficients of
PDE models are set to 1.
(Coefficient
optimization)
utilizing GA
every 50 generations
e.g.,
̂
+0.5
"
̂
"
=0
Tournament
selection
PDE model
Expression tree
+
̂
×
0.5
!
̂
!
Current
generation
Keep a top 10% of PDE
models based on fitness
test for next generation
Figure 4.3: Representative snapshot of the flow chart to implement the evolutionary processes of
candidate PDEs utilizing their expression trees.
88
In addition to the evolution of the structure of the candidate PDEs, the optimization of their
coefficients are performed separately once in every 50 generations (see the last arrow in Fig. 4.3).
A genetic algorithm [96] is employed instead of gradient-based optimization methods because the
library of basis functions for PDEs includes discontinuous functions such as Heaviside step func-
tion. The genetic algorithm is applied to each PDE separately. It utilizes the coefficient vector
extracted from the PDE. Candidate coefficient vectors are generated by multiplying random real
numbers from the uniform distribution with lower and upper bounds 0.05 and 20 into each ele-
ment of the original vector, respectively, to thoroughly search optimum coefficient values around
the original coefficient values. The range of the uniform distribution was tested and determined
in the previous study [132]. The population of the candidate coefficient vectors then evolves in
consecutive generations in similar fashion as the evolution of the crossover and mutation in GP. In
consideration of the computational cost, the coefficient optimization is applied to the PDEs having
a top 30% of the loss function values. The user-selectable stop criteria for the coefficient optimiza-
tion are: the change of the minimum loss function value is selected, e.g., as less than 1× 10
− 3
in
30 consecutive generations, or the number of generations reached is 300.
4.4 Illustrative Examples
The proposed stochastic identification method, GPSI, is now demonstrated in three applications.
Two cases employ canonical PDEs, i.e., the Burgers’ equation and the advection-dispersion equa-
tion (ADE). Training data for these cases are prepared by solving theoretical reference models and
adding Gaussian noises (to simulate unavoidable measurement noise pollution). The results show
that the identification performance relies on noise levels and system characteristic numbers. The
system characteristic numbers represent the contribution of each term in the reference models. All
GPSI simulations were performed with the use of parallel processing on high-performance com-
puting systems [120]. The processing unit has Intel Xeon 4116 dodeca-core whose CPU clock is
2.10 GHz, and 94 GB RAM.
89
4.4.1 Burgers’ equation
Burgers’ equation is a nonlinear PDE which is expressed in the following non-dimensional form:
∂u
∂t
=− u
∂u
∂x
+
1
Re
∂
2
u
∂x
2
, (4.6)
where u is the dimensionless fluid longitudinal velocity and Re is the Reynolds number, which
describes the system characteristics through the ratio of inertial forces to viscous forces within the
fluid. Here, x denotes the dimensionless coordinate system and t is the dimensionless time.
In order to test different system characteristics, three different regimes of Re (i.e., 20, 100, and
500) are used. The excitation for the reference system model (see Eq. 4.6) is stated as the initial
condition characterized by a Gaussian pulse:
f(x,t)=
Aexp
− 1
2
(x− µ)
2
σ
2
, if t= 0;
0, if t > 0,
(4.7)
where µ is the central location of the Gaussian pulse, σ is the standard deviation of the pulse
from the central location, and A is the coefficient to normalize the peak of the pulse to 1. Periodic
boundary conditions are employed to exclude any boundary effects of the finite space domain.
For the training and validation data, the space and time domains are defined with proper step
sizes, i.e., x∈[0,1] with ∆x = 1/128 and t∈[0,1] with ∆t = 1× 10
− 4
. The system response
u(x
i
,t
j
) (i= 1,2,...,n
x
and j= 1,2,...,n
t
; n
x
= 128 and n
t
= 1× 10
4
) and its derivatives (i.e.,
∂u
∂t
,
∂u
∂x
, and
∂
2
u
∂x
2
) are computed by solving Eq. 4.6 under f(x
i
,t
j
). The values of µ and σ for f(x
i
,t
j
)
are suitably selected to exhibit system characteristics within the space and time domains: 0.2 and
0.05 for the training data, and 0.4 and 0.1 for the validation data, respectively. For the data noise,
a zero-mean Gaussian noise is embedded only into the training data with different levels (i.e., ε
= 1, 5, and 10%). Accordingly, nine identification cases are prepared in total with the Burgers’
equation as functions of Re andε. For the hyperparameters of GPSI, the default values mentioned
90
Table 4.1: GPSI-identified PDEs and their ε
res
from the training and validation data depending
on different Re andε; ˆ u is the fluid longitudinal velocity estimated from the identified model; all
coefficients are rounded to three decimal places
Case Re ε (%) Identified PDE ε
res
(%)
Training Validation
Ref. - -
∂u
∂t
=− u
∂u
∂x
+
1
Re
∂
2
u
∂x
2
- -
1 20 1
∂ ˆ u
∂t
=− 1.000 ˆ u
∂ ˆ u
∂x
+ 0.050
∂
2
ˆ u
∂x
2
1.0 0.0
2 20 5
∂ ˆ u
∂t
=− 0.999 ˆ u
∂ ˆ u
∂x
+ 0.050
∂
2
ˆ u
∂x
2
5.0 0.2
3 20 10
∂ ˆ u
∂t
=− 1.000 ˆ u
∂ ˆ u
∂x
+ 0.049
∂
2
ˆ u
∂x
2
10.0 0.3
4 100 1
∂ ˆ u
∂t
=− 1.000 ˆ u
∂ ˆ u
∂x
+ 0.010
∂
2
ˆ u
∂x
2
1.0 0.0
5 100 5
∂ ˆ u
∂t
=− 1.000 ˆ u
∂ ˆ u
∂x
+ 0.010
∂
2
ˆ u
∂x
2
5.0 0.0
6 100 10
∂ ˆ u
∂t
=− 1.000 ˆ u
∂ ˆ u
∂x
+ 0.010
∂
2
ˆ u
∂x
2
10.0 0.1
7 500 1
∂ ˆ u
∂t
=− 1.000 ˆ u
∂ ˆ u
∂x
+ 0.002
∂
2
ˆ u
∂x
2
1.0 0.0
8 500 5
∂ ˆ u
∂t
=− 0.999 ˆ u
∂ ˆ u
∂x
+ 0.002
∂
2
ˆ u
∂x
2
5.1 0.2
9 500 10
∂ ˆ u
∂t
=− 1.006 ˆ u
∂ ˆ u
∂x
+ 0.002
∂
2
ˆ u
∂x
2
10.4 1.4
in Sections 4.3.2 and 4.3.3 are used. Under the consideration that GPSI performs a stochastic
optimization, each identification case has 10 trials with different random seeds, and one PDE is
selected as the system model among the PDEs in all the trials in the consideration ofε
res
with the
validation data (see Eq. 4.2) and model complexity.
The identified PDEs and their ε
res
with training and validation data (namely training ε
res
and
validation ε
res
, respectively, hereafter) are presented in Table 4.1 for different Re and ε. In all
the identification cases, the true form of the Burgers’ equation is discovered with low training and
validationε
res
. As an example, one identification case with Re = 100 andε = 5% is shown in Fig.
4.4. The loss function values and their composition are exhibited over generations in Fig. 4.4a. The
values of the validationε
res
(Eq. 4.3) are plotted together as the reference. On the bottom side of
Fig. 4.4, model response surfaces (Fig. 4.4b-e) and error surfaces (see Fig. 4.4f-i) are displayed for
the reference generations, such as the first generation and each session end. Over the generation,
the validationε
res
decreases although it has some fluctuations due to the stochastic sampling. While
the model response surface converges to the reference model responses, the biased error surface
becomes reduced and homogeneous in both space and time.
91
Figure 4.4: GPSI simulation for the Burgers’ equation with Re= 100 and ε = 5%; (a) the loss
function composition and the validation ε
res
as a function of the generation; (b)-(e) the PDE re-
sponse surfaces at the initial generation (i.e., (b)) and eachL phase end (i.e., (c)-(e)); and (f)-(i)
the corresponding error surfaces; note that, for clarity, the amplitude levels displayed in Fig. 4f-i
are not the same.
92
(a)
0.0 0.2 0.4 0.6 0.8 1.0
x [-]
0.0
0.3
0.6
0.9
1.2
1.5
u [-]
Identified model response
Training data
at t = 0, 0.25, 0.5, 0.75, 1
(b)
0.0 0.2 0.4 0.6 0.8 1.0
x [-]
0.0
0.3
0.6
0.9
1.2
1.5
u [-]
Identified model response
Validation data
at t = 0, 0.25, 0.5, 0.75, 1
Figure 4.5: The identified PDE responses with the (a) training and (b) validation data from one
identification case for the Burgers’ equation with Re = 100 andε = 5% (see Case 5 in Table 1).
Table 1 reveals that the example case discovers the true Burgers’ equation (compare Case 5
with the reference case in Table 1). The estimated Reynolds number,
c
Re, extracted from the model
(i.e., 100) is the same as the true ratio of inertial forces to viscous forces within the fluid (i.e.,
100). Since the identified PDE has the same structure as the reference model with the accurate
coefficients, its responses are well matched with the training and validation data as exhibited in
Fig. 4.5. Its training and validationε
res
are 5.0 and 0.0%, respectively.
In addition to predicting the behavior of the target system (i.e., the validation test), the identified
PDE allows users to enhance understanding of the system dynamics. The example model (see Case
5 in Table 1) is further analyzed by superposing the contributions of its terms (i.e., 1.000 ˆ u
∂ ˆ u
∂x
and
0.010
∂
2
ˆ u
∂x
2
) and their respective responses in Fig. 4.6. On the left side (Fig. 4.6a), the values of each
term in the identified model are presented over the space domain when the validation excitation
(i.e., the black line; see Eq. 4.7) is given. Fig. 4.6b shows the response simulated with each
term in the identified model separately under the same excitation for the short period of time (i.e.,
0 < t < 0.15), which reveals the dynamics of each term respectively. The advection term (i.e.,
1.000 ˆ u
∂ ˆ u
∂x
) has a major contribution to the temporal change of ˆ u (see the blue and red lines in Fig.
4.6a), and it moves the pulse forward with the ratio of ˆ u, leading to the nonlinear advection (see
the blue line in Fig. 4.6b). The diffusion term (i.e., 0.010
∂
2
ˆ u
∂x
2
) describes the smooth dispersion
93
(a)
0.0 0.2 0.4 0.6 0.8 1.0
x [-]
5.0
2.5
0.0
2.5
5.0
Value [-]
u
0
u/ t
Advection term
Diffusion term
(b)
0.0 0.2 0.4 0.6 0.8 1.0
x [-]
0.0
0.2
0.4
0.6
0.8
1.0
u [-]
Figure 4.6: The contribution of each term in the identified PDE (see Case 5 in Table 1) from
one identification case for the Burgers’ equation using Re = 100 and ε = 5%; the advection and
diffusion terms are 1.000 ˆ u
∂ ˆ u
∂x
and 0.010
∂
2
ˆ u
∂x
2
, respectively; (a) the values of the components in the
model under the given validation excitation u
0
; (b) the responses from the components for a short
time domain (0< t< 0.15); the incremental shades of the colors represent the responses at different
time moments (t= 0.03,0.06,...,0.15).
as shown in the green lines in Fig. 4.6a and b. Its contribution is less than the advection term,
resulting in the final model response (see the red line in Fig. 4.6b). As a result, the inspection
of the identified PDE reveals that the system dynamics consists of nonlinear advection and linear
dispersion.
Although all the identification cases discover the reference model within 10 trials, each trial in
one identification case shows a different result in identifying PDEs. In order to evaluate the repro-
ducibility of GPSI, the average performance of 10 trials is investigated for each identification case
by calculating the probability of discovering the reference PDE (i.e., the Burgers’ equation), the
average validationε
res
, and the average convergence time. The average performance as functions
of Re andε are summarized via the contours depicted in Fig. 4.7. On the left side (Fig. 4.7a), the
contour exhibits the probability of identifying the reference PDE, namely the model probability.
This contour represents the diagram of the identified PDE, presenting its structure together. The
model probability stays similar as to ε and decreases as the higher Re is used. When Re= 500,
the chance to lose the dispersion term in the identified PDEs becomes high, leading to the models
94
(a)
∂ˆ u
∂t
=− ˆ u
∂ˆ u
∂x
+
1
Re
∂
2
ˆ u
∂x
2
∂ˆ u
∂t
=− ˆ u
or
∂ˆ u
∂t
=− ∂ˆ u
∂x
20 100 500
Re [-]
1
5
10
[%]
0.4
0.0
0.2
0.4
0.6
0.8
1.0
Model probability [-]
(b)
20 100 500
Re [-]
1
5
10
[%]
0.20
0.36
0.00
0.09
0.18
0.27
0.36
0.45
Validation
res.
[-]
(c)
20 100 500
Re [-]
1
5
10
[%]
3.0
4.0
2.0
2.7
3.4
4.1
4.8
5.5
Convergence time [h]
Figure 4.7: The average performance of GPSI in identifying the reference PDE (i.e, the Burgers’
equation) depending on Re andε; (a) the phase diagram of the identified PDEs on the probability
of discovering the reference model; (b) the average validationε
res
; and (c) the average convergence
time.
having linear advection or degradation. On the right side (see Fig. 4.7b and c, respectively), the
average validationε
res
and the average convergence time are displayed. They are as well affected
more by Re than by ε. As Re is higher, the validation ε
res
and the convergence time increase.
Consequently, the performance of GPSI is robust to the data noise (i.e., ε) and shows a good re-
producibility as long as the difference of the contributions that the terms in the model make is less
than 500 times.
95
4.4.2 Advection-dispersion equation
The advection-dispersion equation (ADE), which shares the same mathematical form as the convection-
diffusion equation, is used as the second example of multi-physics system models. The ADE de-
scribes solute transport with linear advection and dispersion terms. In an one-dimensional space,
the dimensionless form of the ADE is written as:
∂c
∂t
=− ∂c
∂x
+
1
Pe
∂
2
c
∂x
2
, (4.8)
where c is the solute concentration at dimensionless location x and time t and Pe is the P´ eclet num-
ber which represents the ratio of the advection to the dispersion time scales. Pe dictates whether
the system dynamics is advection-dominated or dispersion-dominated.
Three different values of Pe are selected (i.e., 20, 100, and 500). The initial and boundary
conditions, i.e., the excitation (see Eq. 4.7) and the periodic boundary conditions, and the GPSI
setting are the same as those for the Burgers’ equation discussed above (refer to Section 4.4.1).
Given the similarity in the set-up, GPSI is able to address a class of PDE problems without tailoring
many hyperparameters for each target system (that demonstrating the robustness of the procedure
under discussion). The training and validation data are generated in a similar manner. Therefore,
nine identification cases, from 3 Pe values and 3 noise levels (i.e., ε = 0, 5, and 10%), are used in
total.
The identified PDEs and their training and validation ε
res
are tabulated as functions of Pe and
ε in Table 4.2. The results of GPSI for the ADE are very similar to the Burgers equation cases.
In all the identification cases, the PDE responses converge to the training data, and their error
surfaces are reduced and homogeneous over generations in the same way as the Burgers equation
cases (refer to Fig. 4.4). At the end, the true form of the ADE is discovered, regardless of Pe
andε with sightly different coefficients, presenting low training and validation ε
res
. For instance,
one identification case with Pe = 100 andε = 5% shows that the identified model’s responses are
in agreement with the training and validation data (Fig. 4.8). Its training and validation ε
res
are
96
Table 4.2: GPSI-identified PDEs and their ε
res
from the training and validation data depending on
different Pe andε; ˆ c is the solute concentration estimated from the identified model; all coefficients
are rounded to three decimal places
Case Pe ε (%) Identified PDE ε
res
(%)
Training Validation
Ref. - -
∂c
∂t
=− ∂c
∂x
+
1
Pe
∂
2
c
∂x
2
- -
1 20 1
∂ ˆ c
∂t
=− 1.000
∂ ˆ c
∂x
+ 0.050
∂
2
ˆ c
∂x
2
1.0 0.0
2 20 5
∂ ˆ c
∂t
=− 1.000
∂ ˆ c
∂x
+ 0.050
∂
2
ˆ c
∂x
2
5.0 0.2
3 20 10
∂ ˆ c
∂t
=− 1.000
∂ ˆ c
∂x
+ 0.049
∂
2
ˆ c
∂x
2
10.0 0.0
4 100 1
∂ ˆ c
∂t
=− 1.000
∂ ˆ c
∂x
+ 0.010
∂
2
ˆ c
∂x
2
1.0 0.0
5 100 5
∂ ˆ c
∂t
=− 1.000
∂ ˆ c
∂x
+ 0.010
∂
2
ˆ c
∂x
2
5.0 0.0
6 100 10
∂ ˆ c
∂t
=− 1.000
∂ ˆ c
∂x
+ 0.010
∂
2
ˆ c
∂x
2
10.0 0.0
7 500 1
∂ ˆ c
∂t
=− 1.000
∂ ˆ c
∂x
+ 0.002
∂
2
ˆ c
∂x
2
1.0 0.0
8 500 5
∂ ˆ c
∂t
=− 1.000
∂ ˆ c
∂x
+ 0.002
∂
2
ˆ c
∂x
2
5.1 0.1
9 500 10
∂ ˆ c
∂t
=− 1.001
∂ ˆ c
∂x
+ 0.002
∂
2
ˆ c
∂x
2
10.2 0.1
5.0 and 0.0%, respectively. The estimated P´ eclet number (i.e., 100) extracted from the model is
identical to the one in the reference system (i.e., 100).
Close inspection of the PDE, identified with Pe = 100 and ε = 5% (see Case 5 in Table 2),
reveals that there are both linear advection and dispersion in the system dynamics. The contribution
of each term is compared by superposing all the terms (i.e., 1.000
∂ ˆ c
∂x
and 0.010
∂
2
ˆ c
∂x
2
) in Fig. 4.9. On
the left side (see Fig. 4.9a), given the validation excitation (i.e., the black line), the contribution of
the advection term (i.e., the blue line) is larger than the one of the dispersion term (i.e., the green
line) to the temporal evolution of ˆ c (i.e., the red line). On the right side (see Fig. 4.9b), the short-
term (0 < t < 0.15) responses simulated from the advection and dispersion terms, respectively,
are presented. As a result, the final response of the identified PDE (i.e., the red line) exhibits the
advection-dominated transport with relatively small dispersion, which corresponds with the system
dynamics.
The performance of GPSI in discovering the governing physical equation is summarized through
the contour plots in Fig. 4.10. The probability of discovering the reference PDE, the average vali-
dationε
res
, and the average convergence time are displayed as to different Pe andε in Figs. 4.10a,
97
(a)
0.0 0.2 0.4 0.6 0.8 1.0
x [-]
0.0
0.3
0.6
0.9
1.2
1.5
c [-]
Identified model response
Training data
at t = 0, 0.25, 0.5, 0.75, 1
(b)
0.0 0.2 0.4 0.6 0.8 1.0
x [-]
0.0
0.3
0.6
0.9
1.2
1.5
c [-]
Identified model response
Validation data
at t = 0, 0.25, 0.5, 0.75, 1
Figure 4.8: The identified PDE responses with the (a) training and (b) validation data from one
identification case for the ADE with Pe = 100 andε = 5% (see Case 5 in Table 2).
Figure 4.9: The contribution of each term in the identified PDE (see Case 5 in Table 2) from one
identification case for the ADE using Re = 100 and ε = 5%; the advection and dispersion terms
are 1.000
∂ ˆ c
∂x
and 0.010
∂
2
ˆ c
∂x
2
, respectively; (a) the values of the components in the model under the
given validation excitation c
0
; (b) the responses from the components for a short time domain (0<
t < 0.15); the incremental shades of the colors represent the responses at different time moments
(t= 0.03,0.06,...,0.15).
98
̂
= −
̂
+
1
Pe
!
̂
!
̂
=−
̂
Figure 4.10: The average performance of GPSI in identifying the reference PDE (i.e, the ADE)
as to Pe andε; (a) the phase diagram of the identified PDEs on the probability of discovering the
reference model; (b) the average validationε
res
; and (c) the average convergence time.
4.10b, and 4.10c, respectively. In most of the cases, the model probability remains high, and the
average validation ε
res
and the average convergence time stay low. When both Pe and ε increase
to 500 and 10%, respectively, the model probability decreases, and the average validationε
res
and
the average convergence time increase. Compared to the performance for the Burgers’ equation,
the overall performance is better in all the cases (see Fig. 4.10 in comparison with Fig. 4.7).
99
4.5 Discussion
4.5.1 GPSI for the class of PDE problems
The illustrative examples show that GPSI is an effective and robust stochastic method to discover
PDEs from a given data set. In the first two examples (Sections 4.4.1 and 4.4.2), there are high
chances that the identified PDEs are the same as the reference models (i.e., 45 and 98% for the
Burgers’ equation and the ADE, respectively) within the range of the noise levels (i.e., 1-10%)
and characteristic numbers (i.e., Re and Pe) considered. Thus, the identified models remain highly
accurate in predicting system responses in the validation test. Since all the results were obtained
without adjusting the hyperparameters of GPSI to each example, the approach is hereby shown to
be appropriate for the class of PDE identification problems under discussion.
Although GPSI is a model-free identification method, and its hyperparameters do not require
repetitive tuning for each new identification problem, there is a room to incorporate user’s prior
knowledge about a target system to adjust the tuning parameters. Since candidate PDEs are con-
structed on the basis of expression trees, it is important to select proper mathematical components
in the expression library as building blocks. This can be achieved by relying on prior knowledge of
simpler (yet similar) physical systems. For instance, some components such as Heaviside step and
absolute functions can be employed or removed in the library according to the prior domain exper-
tise. Since the size of the expression library determines the size of the candidate PDE space needed
to be explored, it is desirable that the expression library does not include too many components.
As the expression library is restricted to necessary components, it is more likely to identify the
system model with a short convergence time. The complexity penalty coefficient, λ
3
, determines
how much complexity users allow for PDEs. This is the most important hyperparameter. Whenλ
3
is too high, a limited number of PDEs would be tested, generating poor validation results. As λ
3
decreases, the candidate PDE space inflates rapidly, which is prone to overfitting and local minima
issues. Thus, it is recommended that users try a high value of λ
3
first, and later decrease it by
examining the results as the identification process is evolving.
100
4.5.2 Advantages and limitations
GPSI has many advantages in system identification. Most of all, an identified system PDE is likely
to exhibit a low validation error, even if validation data are “moderately” out of the range of the
training data. This is because there is a high chance that an identified PDE captures the current
system dynamics. In the cases of the Burgers’ equation and the ADE, GPSI identified the exact
linear and nonlinear terms with their accurate estimates for Re and Pe. The performance of GPSI
is robust as well. The likelihood of discovering the reference models (i.e., the Burgers’ equation
and the ADE) remains high up to 10% of the noise level in training data. Therefore, GPSI is a
powerful system identification method to capture the optimum PDE to model the observed data.
In efforts to capture the dynamics of multi-physics systems, many identification approaches
have been developed and investigated. For instance, Gaussian process approaches [25, 99] and
neural network approaches [26, 29–31, 97] show that they successfully built their models yielding
low errors from training and test data by capturing system dynamics. However, these approaches
have limitation in offering an easily interpretable mathematical model without the prior knowledge
of model structure. In this respect, the interpretability of an identified PDE through GPSI is a
considerable advantage, which enhances the understanding of the system dynamics by providing
its various plausible descriptions. This could be especially useful in the identification cases where
the data measured from an unknown system are the only available resource, i.e., users have no
reference models.
In terms of discovering an interpretable model, e.g., a PDE, that governs system dynamics,
there is considerable progress being made in sparse approximation methods [22, 23] and neural
network methods [27, 28, 32]. However, to capture the complex dynamics of multi-physics sys-
tems, the strategy of GPSI offers advantages compared to these methods. Firstly, by constructing
candidate models on the basis of random expression trees and mixing them with the stochastic
evolutionary processes, infinite combinations of mathematical expressions are able to appear and
be tested. Next, the optimization process is fundamentally discrete because the candidate models
basically keep being “newly” generated through the evolutionary processes. This discrete process
101
is likely to allow more randomness in the training process and more properly addressing discontin-
uous basis functions, such as Heaviside step function, than gradient-based optimization methods.
For these reasons, the strategy of GPSI seems to be considered more suitable for optimization in
an infinite function space.
A distinctive advantage of GPSI is that it remains applicable in identifying a PDE with limited
data, although it is fundamentally a data-driven identification method. This advantage is especially
significant when measurements are limited. The excitation data and the system description are used
to pose initial and boundary conditions. These conditions are the only requirement to compute
model responses, some of which are compared with limited measurements. Thus, compared to
other data-driven methods, GPSI is an attractive identification method when dealing with limited
data.
It is commonly known that GP-based methods have two major drawbacks. Firstly, a large
amount of hyperparameters need to be tuned for each problem. GPSI resolves this issue by improv-
ing its reproducibility, which persists in the changes of the hyperparameters. This reproducibility is
attributed to the efficient handling of expression trees during the evolutionary processes with bloat
control techniques. Consequently, GPSI is able to address the class of PDE identification problems
with the default hyperparameters established in the previous study [132], and further refined in this
study. Secondly, the convergence time in a standard GP application can be excessively long to
be used. In order to overcome this issue, GPSI utilizes the effective multi-purpose loss function,
and employs stochastic sampling. Additionally, parallel computing is implemented to run GPSI
simulation for the fitness test. As a result, for the examples investigated in this work, GPSI was
performed with a relatively short convergence time, even though the basis function library expands
to include many functions (i.e., Heaviside step function, absolute value function, sign function,
exponential function, logarithmic function, error function, sine function and cosine function).
102
4.6 Conclusions
This paper presents a novel system identification approach (GPSI) that utilizes stochastic opti-
mization, in conjunction with available data sets, to discover the optimum PDE candidates that
matche the data. GPSI aims to discover the system PDEs by utilizing little prior knowledge about
a target system and its measurements (i.e., input/output data). Computational speedup is achieved
by incorporating a series of novel steps, such as: (1) a multi-purpose loss function and stochastic
sampling into the parallel fitness test as well as (2) bloat control techniques into the evolutionary
processes. The results reported in this work show that the algorithm is computationally feasible
and allows to identify both linear and nonlinear PDEs.
We illustrate the accuracy and robustness of the proposed approach in two canonical PDEs. Il-
lustrative examples demonstrate that GPSI successfully identified the nonlinear Burgers’ equation
and a linear advection-dispersion equation with significant data noise (i.e., up to 10%). The results
demonstrate that new technique is effective and robust to discover PDEs from data without the need
for the user to select a parametric phenomenological model. The framework provides a promis-
ing new approach for identifying governing equations as well as physical laws in parameterized
spatiotemporal systems.
103
Chapter 5
Deriving the Governing Equations for Transport in Berea
Sandstone Cores by Stochastic Optimization
5.1 Introduction
A governing equation for transport in heterogeneous media is challenging to establish. This is be-
cause a variable flow field caused by the heterogeneity leads solute transport non-Fickian. A linear
advection dispersion equation (ADE), assuming Fickian transport, results in discrepancy between
the model estimation and field observation. The solute transport shows a lag when compared to
the prediction of the linear ADE, which leads to a tailing effect in a solute breakthrough curve.
Many different types of governing equations (e.g., continuous time random walk and fractional
differential equations) are proposed theoretically to capture this anomalous tailing effect observed
in the solute breakthrough data [135–137]. They are on the basis of assumptions in which transport
has a long-term memory effect due to the heterogeneity.
In contrast to the theoretical derivation of governing equations, a stochastic optimization method
proposed by [132] is able to “discover” relevant partial differential equations (PDEs) from the data.
Aligned with recent trend of Artificial Intelligence and Machine Learning [22, 26, 31], this method
104
utilizes prior knowledge about system physics and observed data together to enhance model ac-
curacy and reliability. In addition, the identified PDEs, as the description of spatial and temporal
dynamics of an attribute of interest, reveal the underlying physics of a target system.
In this study, we discover governing equations, including integer-order and fractional PDEs,
from the experimental data. The data set is a concentration breakthrough curve retrieved from the
column experiment with Berea sandstone core, performed by Scheidegger [138]. The identified
PDEs are compared with the models suggested by previous studies to improve the understanding
of the underlying physics.
5.2 Methodology
The stochastic optimization method, dubbed Genetic Programming System Identification (GPSI)
[132], is employed to discover partial differential equations (PDEs) from the experimental data. It
is a Python package that is available in the Github repository (https://github.com/Jinwoousc/GPSI).
This approach aims to optimize the structure and coefficients of PDEs to a given data set through
evolutionary processes under minimal prior knowledge about system physics.
In the workflow of the method, a key feature is to utilize the binary expression tree, consisting of
two-branch nodes and terminals having mathematical expressions, to represent the corresponding
PDE. The mathematical expression library is prepared in advance under user’s prior knowledge
about system physics. PDEs are then randomly generated by randomly initiating binary expression
trees and assigning mathematical expressions therein. Moreover, they are stochastically improved
by mixing each other (i.e., crossover) or adding a new random expression tree to an existing one
(i.e., mutation) according to their model fidelity. As such, a variety of PDEs having different
structure and coefficients are tested to explore the infinite PDE space.
105
When evaluating the model fidelity, a loss function is used to consider model accuracy and
simplicity as follows:
L =σ
2
+ np, (5.1)
whereσ
2
is the model estimation error, n is the number of the nodes of the binary expression tree
converted from the PDE, and p is the complexity penalty coefficient that is tested with a range of
values (e.g., 1× 10
− 2
− 5× 10
− 4
). Here, σ
2
is normalized with squared sum of the given data,
when q(x,t) is a quantity of interest where x is the spatial position at time t:
σ
2
=
∑
i
∑
j
[q
p
(x
i
,t
j
)− q
d
(x
i
,t
j
)]
2
∑
i
∑
j
[q
d
(x
i
,t
j
)]
2
, (5.2)
where q
p
(x,t) is the model prediction of the quantity at the discrete position x
i
(i= 1,2,..,n
x
)
and time t
j
( j= 1,2,..,n
t
) and q
d
(x,t) is the given data. n
x
and n
t
are the discretization numbers
of space and time domains. Once the minimum loss function value of the PDEs is less than a
certain threshold (i.e., 0.001), the optimization process stops so that the several PDEs are selected
as governing equations, by imposing an additional physical constraint (e.g., the mass balance).
Further details regarding GPSI are referred to Im et al. [132]. The associated hyperparameters are
presented in Table 5.1.
In order to discover governing equations for the transport in Berea sandstone cores, the column
experimental study carried out by Scheidegger [138] is considered. He reported the experiment
of a one-dimensional miscible displacement process in Berea sandstone cores. In this experiment,
a core sample was initially fully saturated with one solution, and this solution was displaced by
constantly injecting another solution at the inlet of the column. During the experiment, the con-
centration of the displaced solution was measured at the end of the column. Scheidegger [138]
provides one concentration breakthrough curve over time and the corresponding regression model
(i.e., a linear ADE).
106
We formulate the physical problem in terms of dimensionless variables. Space, time, and
concentration are normalized by the length of the soil column, the time necessary to fill the pore
volume with the injection rate, and the initial concentration. The solute concentration along the
column length and time is denoted by c(x,t) as a quantity of interest. The experimental data set
contains solely the solute concentration at the outlet (x= 1) for the limited time range (0.7< t <
2.1). Under this limitation of the data, the experimental description is additionally employed to
establish a numerical scheme for the PDEs to be evaluated. Firstly, the space and time domains are
defined with proper step sizes, i.e., x∈[0,1] with∆x= 1/128 and t∈[0,2.1] with∆t = 1× 10
− 4
.
Next, the numerical integration methods (e.g., Runge-Kutta 4
th
order method and the predictor-
corrector method suggested by [139]) are implemented for integer-order or fractional PDEs). Last,
the initial and boundary conditions are described as:
c(x,0)= 1 for 0< x< 1, (5.3)
c(0,t)= 0 for t > 0, (5.4)
∂c
∂x
x=1
= 0 for t > 0. (5.5)
At the initial state, the solute concentration was fully saturated (see Eq. 5.3). Once the experiment
started, the solute concentration at the inlet of the column remained at zero (see Eq. 5.4), given
that another solution was injected to displace. Eq. 5.5 enables the use of the resident solute
concentration at the outlet, c(1,t), as the flux-averaged concentration which represents the effluent
concentration [140]. After the numerical computation of a PDE, the estimated breakthrough curve
is compared to the experimental data to calculate estimation error. In the estimation error, one
physical constraint (Eq. 5.4) is additionally employed to postulate the physical constraint and
augment the experimental data.
The stochastic optimization method used in this Letter is a heuristic approach, where the prior
knowledge that is utilized in the optimization process is improved through the examination of
results. Thus, optimization sets were run in series while updating the prior knowledge. Most
107
importantly, the investigation of a long-term memory effect due to the heterogeneity of the Berea
sandstone cores is phased in the three simulation sets. The first set uses solely inter-order variables.
In the second set, the fractional time derivative is added to the variables. In the last set, the first 80%
of the experimental data set in time is used. Under the consideration of the stochastic approach,
each optimization set has 10 trials. The rest of the hyperparameters for the optimization method
assume their default values [132], as described in Table 5.1.
5.3 Results & Discussion
Several partial differential equations (PDEs) discovered from the experimental data are tabulated in
Table 5.2. One example case of the stochastic optimization process, discovering a nonlinear ADE,
is exhibited in Fig. 5.1 We highlight that all these models satisfy the mass balance constraint. Their
estimations of the breakthrough curve are superposed on the data in Fig. 5.2 Firstly, when integer-
order derivatives are solely used (i.e., the first optimization set), a linear advection dispersion
equation (ADE) is identified (see the linear ADE in Table 5.2). This result indicates that the
proposed methodology is capable of identifying the reference model adopted in the original column
study [138] even from the limited experimental data (i.e., one breakthrough curve at the end of the
column). Moreover, an extra nonlinear term is suggested to the linear ADE to further minimize
the estimation error, which leads to the nonlinear ADE (see the nonlinear ADE in Table 5.2). The
solute concentration predicted from this model (see the green line in Fig. 5.2) shows a better match
against the experimental data, when compared to the one from the linear ADE (see the red line in
Fig. 5.2). In particular, we highlight that the nonlinear ADE successfully captures the tailing effect
where the solute concentration decreases slowly at late times (not captured by the linear, classical,
ADE). This result shows the capabilities of the proposed method to capture anomalies associated
with the physical processes (in this case, non-Fickian behavior). It also demonstrates the potential
of the method to identify candidate closure models that are capable to reproduce the tailing effect
while respecting physical constraints (i.e., mass balance).
108
Table 5.1: Hyperparameter specification used in this Letter for the stochastic optimization via GPSI
Parameters Values
Candidate models
Population size 100
Equation form A fractional or integer-order PDE
Genetic Programming form A binary expression tree
Initialization method The ramped half-and-half method
Maximum tree level 5
Expression library
Components {operations, functions, variables, numbers}
Component probability [ 0.5, 0.1, 0.2, 0.2 ]
Operations Plus, minus, times, and division
Functions Heaviside step, exponential, logarithmic, and error functions
Variables c,
∂c
∂x
,
∂
2
c
∂x
2
,
∂c
∂t
, and
∂
α
c
∂t
α
whereα is the fractional order
Numbers Random rational number in the interval [0, 1) and
optimized later
Fitness test
Loss function Normalized prediction error and
the complexity penalty
Complexity penalty coefficient 1 × 10
− 2
,5× 10
− 3
,1× 10
− 3
, 5× 10
− 4
, and 1× 10
− 4
Fitness session None
Random sampling batch None
Stopping criterion The loss function value be less than 0.01, or reach at
5,000 generations
Evolutionary processes
Representation rate 10%
Selection method Tournament selection
Crossover rate 80%
Mutation rate 20%
Coefficient optimization
Optimization method Genetic Algorithm (GA)
Optimization frequency 50 generations
Population size 20
Initial population range [0.1c, 10c] where c is the initial coefficient vector
Representation rate 10%
Selection method Tournament selection
Crossover rate 80%
Mutation rate 20%
Stopping criterion The loss function value be less than 0.01, or reach at
30 generations
109
Final (649)
= −1.046
+0.005
+0.052erf 7.681
Initial (generation=1)
= −
−+0.703
Intermediate 1 (50)
= −1.046
+0.005
Intermediate 2 (150)
= −1.046
+0.005
+0.048erf 13.461
Figure 5.1: The stochastic optimization process discovering the nonlinear ADE (see Table 2) from
Scheidegger’s soil column study [138]; (a) the loss function values and their composition as a
function of the generation; (b)-(e) the solute breakthrough curves estimated from the PDE models
at each reference generation as compared to the experimental data (i.e., (b)-(e)); and (f)-(i) the
corresponding squared error; c
p
is the concentration predicted from the PDEs and c
d
is the con-
centration observed from the column experiment; the best PDE among candidates is presented for
each reference generations.
110
Table 5.2: PDEs discovered from the column experimental data through the stochastic optimiza-
tion; c is the solute concentration at the location x and time t; all coefficients are rounded to three
decimal places
Equation Type Identified PDE σ
2
(%)
Linear ADE
∂c
∂t
=− 1.030
∂c
∂x
+ 0.005
∂
2
c
∂x
2
3.879
Nonlinear ADE
∂c
∂t
=− 1.046
∂c
∂x
+ 0.005
∂
2
c
∂x
2
+ 0.056erf(8.236
∂c
∂x
) 2.446
Fractional ADE
∂
0.988
c
∂t
0.988
=− 1.015
∂c
∂x
+ 0.003
∂
2
c
∂x
2
2.223
(a)
0.9 1.2 1.5 1.8 2.1
t
10
−3
10
−2
10
−1
10
0
c (1,t)
Linear ADE
Nonlinear ADE
Fractional ADE
Scheidegger, 1959
(b)
0.9 1.2 1.5 1.8 2.1
t
0.000
0.001
0.002
0.003
0.004
(c
p
− c
d
)
2
Linear ADE
Nonlinear ADE
Fractional ADE
Figure 5.2: The solute concentrations (i.e., (a)) and the squared errors (i.e., (b)) at the outlet of
the column from the PDEs with the experimental data reported in Scheidegger [138]; c
p
is the
concentration predicted from the PDEs and c
d
is the concentration observed from the column
experiment; See Table 5.2 for the PDEs’ details.
111
The nonlinear ADE (i.e., the nonlinear ADE in Table 5.2) is analyzed by interpreting the extra
nonlinear term. Since the extra term contains the first spatial derivative of the concentration within
the error function, it represents a slow nonlinear advection (compared the extra term with the lin-
ear advective term in the nonlinear ADE in Table 5.2). When this extra term is combined with the
linear advective term, its impact on the model response can be interpreted as the small amount of
advection delay. This interpretation is matched with the previous studies that attribute the tailing
effect in the solute breakthrough curve to small dead pores that retains the solute [141, 142]. The
extra term is further analyzed by being rewritten with the definition of the error function. This
attempt renders the nonlinear ADE into the integro-differential equation that usually involves in
a memory function. This further analysis indicates that the experimental data call for the math-
ematical expressions to capture the memory effect due to the heterogeneity in Berea sandstone
cores.
Under the prior knowledge updated from the results of the first optimization set, the fractional
ADE (i.e., the fractional ADE in Table 5.2) is discovered by adding the fractional time derivative
into the expression library. Its estimation error is even lower than the one evaluated from the
nonlinear ADE. This accuracy comes from capturing the tailing effect at the moment when it
begins (compare green and blue dots for 0.9 < t < 1.2 in Fig. 5.2(b)). The best match of the
fractional ADE with the data confirms that the Berea sandstone core used in the experiment has
the microscopically heterogeneous, leading to the memory effect to the transport therein. However,
the fractional ADE seems having limitation describing a sudden drop of the concentration observed
in the data (see the last data point in Fig. 5.2(a) at t≈ 2.1). This discrepancy indicates that the
interpretation of the nonlinear ADE is still valid in terms of what mechanism in heterogeneous
Berea sandstone cores causing the memory effect.
112
5.4 Conclusion
Governing equations for the transport in Berea sandstone cores, e.g., integer-order and fractional
PDEs, are discovered from the column experimental data. The stochastic optimization method uti-
lizes the prior knowledge about system physics and the limited observed data in the identification
process. As the prior knowledge is improved in series through the optimization sets, the expres-
sion library is expanded to capture the memory effect due to microscale heterogeneity of the Berea
sandstone core. Finally, the identified fractional PDE describes the non-Fickian transport embed-
ded in the data set, leading to only 2.2% estimation error, while the linear ADE presents 3.9%. The
proposed method is an effective and robust gray-box modeling tool to discover PDE models from
data, for a broad category of problems encountered in various science and engineering domains,
without the need for the user to select a parametric phenomenological model.
113
Chapter 6
Conclusion and ongoing work
Novel computational approaches are proposed to overcome the uncertainties associated with mod-
eling of multi-scale hydrologic systems. Firstly, a numerical Monte Carlo approach is established
to address the uncertainty in hydraulic conductivity due to natural heterogeneity of the subsurface
systems. Through this approach, a potential threat to human health against Bisphenol A (BPA)
in groundwater is investigated. Computational results indicate that the propagation of the uncer-
tainty in the hydraulic conductivity into the aquifer resilience loss could be opposite depending on
which health response models (e.g., systemic vs. endocrine-related health effects) are employed.
These results emphasize the consideration of endocrine-related health effects, associated with site-
specific conditions (e.g., having vulnerable subpopulations or not), in risk management against
BPA in groundwater. This stochastic approach is continued to be improved for the investigation of
the interplay between natural heterogeneity and biogeochemical reactions in contaminant transport
and to be developed as an accessible toolbox for general users.
Next, a hybrid physics-based and data-driven modeling approach is developed to address the
model uncertainty for multi-scale hydrologic systems. The hybrid approach, dubbed GPSI, is
established on the basis of stochastic optimization. Through the proposed approach, many ordi-
nary and partial differential equations (ODEs/PDEs) are discovered from minimal prior knowledge
about system physics and given data sets across disciplines (e.g., structural mechanics, environ-
mental science and engineering, and chemical engineering). The demonstration cases indicate that
114
the proposed method is a powerful and versatile hybrid modeling tool that is capable of provid-
ing insights about the underlying physics of a target system. This hybrid approach is continued
to be expanded to capture more complex nonlinearity of multi-physics systems and to be applied
into various data sets to reveal the underlying physics. In conclusion, the novel computational
approaches, proposed in this study, and their continuations contribute to establishing accurate and
robust modeling of multi-scale hydrologic systems under uncertainty and to providing reliable
environmental evaluation for the decision-making process in water resources management.
115
References
1. Freeze, R. A. & Cherry, J. A. Groundwater (Englewood Cliffs, NJ, Prentice-Hall, 1979).
2. Rubin, Y . Applied Stochastic Hydrogeology (Oxford University Press, 2003).
3. Sudicky, E. A., Illman, W. A., Goltz, I. K., Adams, J. J. & McLaren, R. G. Heterogeneity in
hydraulic conductivity and its role on the macroscale transport of a solute plume: From mea-
surements to a practical application of stochastic flow and transport theory. Water Resources
Research 46. doi:https://doi.org/10.1029/2008WR007558 (2010).
4. Fiori, A., Bellin, A., Cvetkovic, V ., de Barros, F. P. J. & Dagan, G. Stochastic modeling
of solute transport in aquifers: From heterogeneity characterization to risk analysis. Water
Resources Research 51, 6622–6648 (2015).
5. De Barros, F. P. J. & Rubin, Y . A risk-driven approach for subsurface site characterization.
Water Resources Research 44. doi:https://doi.org/10.1029/2007WR006081 (2008).
6. Cirpka, O. A., Rolle, M., Chiogna, G., De Barros, F. P. J. & Nowak, W. Stochastic eval-
uation of mixing-controlled steady-state plume lengths in two-dimensional heterogeneous
domains. Journal of Contaminant Hydrology 138, 22–39. doi:https://doi.org/10.
1016/j.jconhyd.2012.05.007 (2012).
7. Maxwell, R. M., Kastenberg, W. E. & Rubin, Y . A methodology to integrate site char-
acterization information into groundwater-driven health risk assessment. Water Resources
Research 35, 2841–2855. doi:https://doi.org/10.1029/1999WR900103 (1999).
8. Bolster, D., Barahona, M., Dentz, M., Fernandez-Garcia, D., S´ anchez-Vila, X., Trinchero,
P., et al. Probabilistic risk analysis of groundwater remediation strategies. Water Resources
Research 45 (2009).
9. De Barros, F. P. J., Rubin, Y . & Maxwell, R. M. The concept of comparative information
yield curves and its application to risk-based site characterization. Water Resources Re-
search 45. doi:https://doi.org/10.1029/2008WR007324 (2009).
10. De Barros, F. P. J., Bolster, D., S´ anchez-Vila, X. & Nowak, W. A divide and conquer ap-
proach to cope with uncertainty, human health risk, and decision making in contaminant hy-
drology. Water Resources Research 47. doi:https://doi.org/10.1029/2010WR009954
(2011).
11. De Barros, F. P. J., Ezzedine, S. & Rubin, Y . Impact of hydrogeological data on measures
of uncertainty, site characterization and environmental performance metrics. Advances in
Water Resources 36, 51–63. doi:https://doi.org/10.1016/j.advwatres.2011.05.
004 (2012).
12. Jurado, A., V` azquez-Su˜ n´ e, E., Carrera, J., de Alda, M. L., Pujades, E. & Barcel´ o, D. Emerg-
ing organic contaminants in groundwater in Spain: a review of sources, recent occurrence
and fate in a European context. Science of the Total Environment 440, 82–94 (2012).
116
13. Lapworth, D. J., Baran, N., Stuart, M. E. & Ward, R. S. Emerging organic contaminants in
groundwater: a review of sources, fate and occurrence. Environmental pollution 163, 287–
303. doi:https://doi.org/10.1016/j.envpol.2011.12.034 (2012).
14. Vandenberg, L., Colborn, T., Hayes, T. B., Heindel, J. J., Jacobs Jr, D. R., Lee, D.-H., et al.
Hormones and endocrine-disrupting chemicals: low-dose effects and nonmonotonic dose
responses. Endocrine Reviews 33, 378–455. doi:https://doi.org/10.1210/er.2011-
1050 (2012).
15. Manamsa, K., Crane, E., Stuart, M., Talbot, J., Lapworth, D. & Hart, A. A national-scale
assessment of micro-organic contaminants in groundwater of England and Wales. Science
of the Total Environment 568, 712–726 (2016).
16. V om Saal, F. & Hughes, C. An extensive new literature concerning low-dose effects of
bisphenol A shows the need for a new risk assessment. Environmental Health Perspectives
113, 926. doi:https://doi.org/10.1289/ehp.7713 (2005).
17. Vandenberg, L., Ehrlich, S., Belcher, S., Ben-Jonathan, N., Dolinoy, D., Hugo, E., et al.
Low dose effects of bisphenol A: An integrated review of in vitro, laboratory animal, and
epidemiology studies. Endocrine Disruptors 1, e26490. doi:https://doi.org/10.4161/
endo.26490 (2013).
18. Vandenberg, L. Non-monotonic dose responses in studies of endocrine disrupting chemi-
cals: bisphenol A as a case study. Dose Response 12, Dose–Response. doi:https://doi.
org/10.2203/dose-response.13-020.Vandenberg (2014).
19. Rochester, J. Bisphenol A and human health: a review of the literature. Reproductive Tox-
icology 42, 132–155. doi:https://doi.org/10.1016/j.reprotox.2013.08.008
(2013).
20. Quaranta, G., Lacarbonara, W. & Masri, S. F. A review on computational intelligence for
identification of nonlinear dynamical systems. Nonlinear Dynamics, 1–53 (2020).
21. No¨ el, J.-P., Esfahani, A. F., Kerschen, G. & Schoukens, J. A nonlinear state-space approach
to hysteresis identification. Mechanical Systems and Signal Processing 84, 171–184 (2017).
22. Rudy, S. H., Brunton, S. L., Proctor, J. L. & Kutz, J. N. Data-driven discovery of partial
differential equations. Science Advances 3, e1602614 (2017).
23. Rudy, S., Alla, A., Brunton, S. L. & Kutz, J. N. Data-driven identification of parametric
partial differential equations. SIAM Journal on Applied Dynamical Systems 18, 643–660
(2019).
24. Lai, Z. & Nagarajaiah, S. Sparse structural system identification method for nonlinear dy-
namic systems with hysteresis/inelastic behavior. Mechanical Systems and Signal Process-
ing 117, 813–842 (2019).
25. Raissi, M. & Karniadakis, G. E. Hidden physics models: Machine learning of nonlinear
partial differential equations. Journal of Computational Physics 357, 125–141 (2018).
26. Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics-informed neural networks: A deep
learning framework for solving forward and inverse problems involving nonlinear partial
differential equations. Journal of Computational Physics 378, 686–707 (2019).
27. Sahoo, S. S., Lampert, C. H. & Martius, G. Learning equations for extrapolation and control.
arXiv preprint arXiv:1806.07259 (2018).
28. Long, Z., Lu, Y . & Dong, B. PDE-Net 2.0: Learning PDEs from data with a numeric-
symbolic hybrid deep network. Journal of Computational Physics 399, 108925 (2019).
117
29. Dal Santo, N., Deparis, S. & Pegolotti, L. Data driven approximation of parametrized PDEs
by reduced basis and neural networks. Journal of Computational Physics 416, 109550
(2020).
30. Jagtap, A. D., Kharazmi, E. & Karniadakis, G. E. Conservative physics-informed neural
networks on discrete domains for conservation laws: Applications to forward and inverse
problems. Computer Methods in Applied Mechanics and Engineering 365, 113028 (2020).
31. Tartakovsky, A. M., Marrero, C. O., Perdikaris, P., Tartakovsky, G. D. & Barajas-Solano,
D. Physics-Informed Deep Neural Networks for Learning Parameters and Constitutive Re-
lationships in Subsurface Flow Problems. Water Resources Research 56, e2019WR026731
(2020).
32. Cranmer, M., Sanchez-Gonzalez, A., Battaglia, P., Xu, R., Cranmer, K., Spergel, D., et
al. Discovering symbolic models from deep learning with inductive biases. arXiv preprint
arXiv:2006.11287 (2020).
33. Mattheij, R. M., Rienstra, S. W. & Boonkkamp, J. T. T. Partial differential equations: mod-
eling, analysis, computation (SIAM, 2005).
34. USEPA. 2017 Potable Reuse Compendium. EPA/810/R-17/002 U.S. Environmental Protec-
tion Agency (Washington, DC, 2017).
35. Rodak, C., Silliman, S. E. & Bolster, D. Time-Dependent Health Risk from Contaminated
Groundwater Including Use of Reliability, Resilience, and Vulnerability as Measures. Jour-
nal of the American Water Resources Association 50, 14–28. doi:https://doi.org/10.
1111/jawr.12103 (2014).
36. S´ anchez-Vila, X., V` azquez-Su˜ n´ e, E., Rodr´ ıguez-Escales, P., Jurado, A., Folch, A., Carles-
Brangar´ ı, A., et al. in Emerging Contaminants in River Ecosystems 47–75 (Springer, Cham,
2015). doi:https://doi.org/10.1007/698\_2015\_5010.
37. USEPA. Bisphenol A Action Plan U.S. Environmental Protection Agency (Washington, DC,
2010).
38. Shelby, M. D. NTP-CERHR monograph on the potential human reproductive and develop-
mental effects of bisphenol A The National Toxicology Program (2008).
39. Beronius, A., Rud´ en, C., H˚ akansson, H. & Hanberg, A. Risk to all or none?: A compara-
tive analysis of controversies in the health risk assessment of Bisphenol A. Reproductive
Toxicology 29, 132–146. doi:https://doi.org/10.1016/j.reprotox.2009.11.007
(2010).
40. Andriˇ cevi´ c, R. & Cvetkovi´ c, V . Evaluation of risk from contaminants migrating by ground-
water. Water Resources Research 32, 611–621. doi:https://doi.org/10.1029/95WR03530
(1996).
41. Maxwell, R. M. & Kastenberg, W. E. Stochastic environmental risk analysis: an integrated
methodology for predicting cancer risk from contaminated groundwater. Stochastic Envi-
ronmental Research and Risk Assessment 13, 27–47. doi:https://doi.org/10.1007/
s004770050030 (1999).
42. Siirila, E. R., Navarre-Sitchler, A. K., Maxwell, R. M. & McCray, J. E. A quantitative
methodology to assess the risks to human health from CO2 leakage into groundwater. Ad-
vances in Water Resources 36, 146–164. doi:https://doi.org/10.1016/j.advwatres.
2010.11.005 (2012).
118
43. Rodr´ ıguez-Lado, L., Sun, G., Berg, M., Zhang, Q., Xue, H., Zheng, Q., et al. Groundwater
arsenic contamination throughout China. Science 341, 866–868. doi:https://doi.org/
10.1126/science.1237484 (2013).
44. Molin, S., Cvetkovic, V . & Stenstr¨ om, T. Microbial risk assessment in heterogeneous aquifers:
2. Infection risk sensitivity. Water Resources Research 46. doi:https://doi.org/10.
1029/2009WR008039 (2010).
45. Henri, C., Fern` andez-Garcia, D. & de Barros, F. P. J. Probabilistic human health risk assess-
ment of degradation-related chemical mixtures in heterogeneous aquifers: Risk statistics,
hot spots, and preferential channels. Water Resources Research 51, 4086–4108. doi:https:
//doi.org/10.1002/2014WR016717 (2015).
46. Libera, A., Henri, C. V . & De Barros, F. P. J. Hydraulic conductivity and porosity hetero-
geneity controls on environmental performance metrics: Implications in probabilistic risk
analysis. Advances in Water Resources 127, 1–12. doi:https://doi.org/10.1016/j.
advwatres.2019.03.002 (2019).
47. Fan, A. M., Chou, W.-C. & Lin, P. in. Second Edition, 765–795 (Academic Press, 2017).
doi:https://doi.org/10.1016/B978-0-12-804239-7.00041-X.
48. Tsai, W.-T. Human health risk on environmental exposure to Bisphenol-A: a review. Journal
of Environmental Science and Health Part C 24, 225–255. doi:https://doi.org/10.
1080/10590500600936482 (2006).
49. Stuart, M., Lapworth, D., Crane, E. & Hart, A. Review of risk from potential emerging
contaminants in UK groundwater. Science of the Total Environment 416, 1–21. doi:https:
//doi.org/10.1016/j.scitotenv.2011.11.072 (2012).
50. Fu, K.-Y ., Cheng, Y .-H., Chio, C.-P. & Liao, C.-M. Probabilistic integrated risk assessment
of human exposure risk to environmental bisphenol A pollution sources. Environmental
Science and Pollution Research 23, 19897–19910. doi:https://doi.org/10.1007/
s11356-016-7207-y (2016).
51. Holling, C. S. Resilience and stability of ecological systems. Annual Review of Ecology and
Systematics 4, 1–23. doi:https://doi.org/10.1146/annurev.es.04.110173.000245
(1973).
52. ASCE. Policy Statement 518 - Unified Definitions for Critical Infrastructure Resilience
2013.
53. Fowler, H. J., Kilsby, C. G. & O’Connell, P. E. Modeling the impacts of climatic change and
variability on the reliability, resilience, and vulnerability of a water resource system. Water
Resources Research 39. doi:https://doi.org/10.1029/2002WR001778 (2003).
54. Kjeldsen, T. R. & Rosbjerg, D. Choice of reliability, resilience and vulnerability estima-
tors for risk assessments of water resources systems. Hydrological Sciences Journal 49.
doi:https://doi.org/10.1623/hysj.49.5.755.55136 (2004).
55. Yagi, J. M., Neuhauser, E. F., Ripp, J. A., Mauro, D. M. & Madsen, E. L. Subsurface ecosys-
tem resilience: long-term attenuation of subsurface contaminants supports a dynamic micro-
bial community. The ISME Journal 4, 131–143. doi:https://doi.org/10.1038/ismej.
2009.101 (2010).
56. Siirila, E. R. & Maxwell, R. M. A new perspective on human health risk assessment: De-
velopment of a time dependent methodology and the effect of varying exposure durations.
Science of the Total Environment 431, 221–232. doi:https://doi.org/10.1016/j.
scitotenv.2012.05.030 (2012).
119
57. Jones, B. A., Shimell, J. J. & Watson, N. V . Pre-and postnatal bisphenol A treatment results
in persistent deficits in the sexual behavior of male rats, but not female rats, in adulthood.
Hormones and Behavior 59, 246–251. doi:https://doi.org/10.1016/j.yhbeh.2010.
12.006 (2011).
58. Beronius, A., Johansson, N., Rud´ en, C. & Hanberg, A. The influence of study design and
sex-differences on results from developmental neurotoxicity studies of bisphenol A, impli-
cations for toxicity testing. Toxicology 311, 13–26. doi:https://doi.org/10.1016/j.
tox.2013.02.012s (2013).
59. Kundakovic, M. in Epigenetics, the Environment, and Children’s Health Across Lifespans
67–92 (Springer, Cham, 2016). doi:https://doi.org/10.1007/978-3-319-25325-
1\_3.
60. USEPA. A Review of the Reference Dose and Reference Concentration Processes. EPA/630/P-
02/002F U.S. Environmental Protection Agency (Washington, DC, 2002).
61. Kreft, A. & Zuber, A. On the physical meaning of the dispersion equation and its solutions
for different initial and boundary conditions. Chemical Engineering Science 33, 1471–1480.
doi:https://doi.org/10.1016/0009-2509(78)85196-3 (1978).
62. USEPA. Guidelines for Exposure Assessment. EPA/600/Z-92/001 U.S. Environmental Pro-
tection Agency (Washington, DC, 1992).
63. USEPA. Exposure Factors Handbook 2011 Edition (Final Report). EPA/600/R-09/052F
U.S. Environmental Protection Agency (Washington, DC, 2011).
64. USEPA. Exposure Factors Handbook Chapter 3 (Update): Ingestion of Water and Other
Select Liquids. EPA/600/R-18/259F (U.S. EPA Office of Research and Development, Wash-
ington, DC, 2019).
65. Environment Canada. Screening Assessment for the Challenge Phenol, 4, 4’-(1-methylethylidene)
bis-(Bisphenol A). Chemical Abstracts Service Registry Number 80-05-7 Environment and
Climate Change Canada (2008).
66. Ying, G.-G., Toze, S., Hanna, J., Yu, X.-Y ., Dillon, P. J. & Kookana, R. S. Decay of endocrine-
disrupting chemicals in aerobic and anoxic groundwater. Water Research 42, 1133–1141.
doi:https://doi.org/10.1016/j.watres.2007.08.029 (2008).
67. Burnett, R. D. & Frind, E. O. Simulation of contaminant transport in three dimensions: 2.
Dimensionality effects. Water Resources Research 23, 695–705. doi:https://doi.org/
10.1029/WR023i004p00695 (1987).
68. Remy, N., Boucher, A. & Wu, J. Applied Geostatistics with SGeMS: A User’s Guide doi:https:
//doi.org/10.1017/CBO9781139150019 (Cambridge University Press, 2009).
69. Hess, K. M., Wolf, S. H. & Celia, M. A. Large-scale natural gradient tracer test in sand
and gravel, Cape Cod, Massachusetts: 3. Hydraulic conductivity variability and calculated
macrodispersivities. Water Resources Research 28, 2011–2027. doi:https://doi.org/
10.1029/92WR00668 (1992).
70. Harbaugh, A. W. MODFLOW-2005: the US Geological Survey modular ground-water model:
the ground-water flow process The United States Geological Survey (2005).
71. Zheng, C. & Wang, P. P. MT3DMS: a modular three-dimensional multispecies transport
model for simulation of advection, dispersion, and chemical reactions of contaminants in
groundwater systems; documentation and user’s guide Alabama University (1999).
120
72. Niesner, R. & Heintz, A. Diffusion coefficients of aromatics in aqueous solution. Journal
of Chemical & Engineering Data 45, 1121–1124. doi:https://doi.org/10.1021/
je0000569 (2000).
73. Gelhar, L. W., Welty, C. & Rehfeldt, K. R. A critical review of data on field-scale dispersion
in aquifers. Water Resources Research 28, 1955–1974. doi:https://doi.org/10.1029/
92WR00607 (1992).
74. Zakari, S., Liu, H., Tong, L., Wang, Y . & Liu, J. Transport of bisphenol-A in sandy aquifer
sediment: column experiment. Chemosphere 144, 1807–1814. doi:https://doi.org/10.
1016/j.chemosphere.2015.10.081 (2016).
75. De Barros, F. P. J. & Nowak, W. On the link between contaminant source release con-
ditions and plume prediction uncertainty. Journal of Contaminant Hydrology 116, 24–34.
doi:https://doi.org/10.1016/j.jconhyd.2010.05.004 (2010).
76. Gueting, N. & Englert, A. Hydraulic conditions at the source zone and their impact on plume
behavior. Hydrogeology Journal 21, 829–844. doi:https://doi.org/10.1007/s10040-
013-0962-7 (2013).
77. Henri, C. V ., Fern` andez-Garcia, D. & De Barros, F. P. Assessing the joint impact of DNAPL
source-zone behavior and degradation products on the probabilistic characterization of hu-
man health risk. Advances in Water Resources 88, 124–138. doi:https://doi.org/10.
1016/j.advwatres.2015.12.012 (2016).
78. Lagarde, F., Beausoleil, C., Belcher, S. M., Belzunces, L. P., Emond, C., Guerbet, M., et al.
Non-monotonic dose-response relationships and endocrine disruptors: a qualitative method
of assessment. Environmental Health 14, 13. doi:https://doi.org/10.1186/1476-
069X-14-13 (2015).
79. Tangirala, A. K. Principles of System Identification: Theory and Practice (CRC Press,
2018).
80. No¨ el, J.-P. & Kerschen, G. Nonlinear system identification in structural dynamics: 10 more
years of progress. Mechanical Systems and Signal Processing 83, 2–35 (2017).
81. Rodriguez-Iturbe, I., Porporato, A., Laio, F. & Ridolfi, L. Plants in water-controlled ecosys-
tems: active role in hydrologic processes and response to water stress: I. Scope and general
outline. Advances in Water Resources 24, 695–705 (2001).
82. Rubol, S., Manzoni, S., Bellin, A. & Porporato, A. Modeling soil moisture and oxygen ef-
fects on soil biogeochemical cycles including dissimilatory nitrate reduction to ammonium
(DNRA). Advances in Water Resources 62, 106–124 (2013).
83. Rodr´ ıguez-Iturbe, I. & Rinaldo, A. Fractal river basins: chance and self-organization (Cam-
bridge University Press, 2001).
84. Rizzo, C. B. & de Barros, F. P. J. Minimum hydraulic resistance and least resistance path in
heterogeneous porous media. Water Resources Research 53, 8596–8613 (2017).
85. Mari, L., Bertuzzo, E., Righetto, L., Casagrandi, R., Gatto, M., Rodriguez-Iturbe, I., et al.
Modelling cholera epidemics: the role of waterways, human mobility and sanitation. Jour-
nal of the Royal Society Interface 9, 376–388 (2012).
86. Li, R., Pei, S., Chen, B., Song, Y ., Zhang, T., Yang, W., et al. Substantial undocumented
infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2). Science
368, 489–493 (2020).
121
87. Libera, A., de Barros, F. P. J. & Guadagnini, A. Influence of pumping operational schedule
on solute concentrations at a well in randomly heterogeneous aquifers. Journal of Hydrology
546, 490–502 (2017).
88. Libera, A., de Barros, F. P. J., Faybishenko, B., Eddy-Dilek, C., Denham, M., Lipnikov,
K., et al. Climate change impact on residual contaminants under sustainable remediation.
Journal of Contaminant Hydrology 226, 103518 (2019).
89. Im, J., Rizzo, C. B. & de Barros, F. P. J. Resilience of groundwater systems in the presence
of Bisphenol A under uncertainty. Science of the Total Environment, 138363 (2020).
90. Ljung, L. & Glad, T. Modeling of Dynamic Systems (PTR Prentice Hall, 1994).
91. Council, N. R. et al. Assessing the Reliability of Complex Models: Mathematical and Sta-
tistical Foundations of Verification, Validation, and Uncertainty Quantification (National
Academies Press, 2012).
92. Celia, M. A., Rajaram, H. & Ferrand, L. A. A multi-scale computational model for multi-
phase flow in porous media. Advances in Water Resources 16, 81–92 (1993).
93. La Cava, W., Danai, K., Spector, L., Fleming, P., Wright, A. & Lackner, M. Automatic
identification of wind turbine models using evolutionary multiobjective optimization. Re-
newable Energy 87, 892–902 (2016).
94. Geers, M. G., Kouznetsova, V . G. & Brekelmans, W. Multi-scale computational homoge-
nization: Trends and challenges. Journal of Computational and Applied Mathematics 234,
2175–2182 (2010).
95. Tekin, E. & Sabuncuoglu, I. Simulation optimization: A comprehensive review on theory
and applications. IIE transactions 36, 1067–1081 (2004).
96. Holland, J. H. et al. Adaptation in Natural and Artificial Systems: An Introductory Analysis
with Applications to Biology, Control, and Artificial Intelligence (MIT press, 1992).
97. Brewick, P. T. & Masri, S. F. An evaluation of data-driven identification strategies for com-
plex nonlinear dynamic systems. Nonlinear Dynamics 85, 1297–1318 (2016).
98. Kadeethum, T., Jørgensen, T. M. & Nick, H. M. Physics-informed neural networks for solv-
ing nonlinear diffusivity and Biot’s equations. PLOS ONE 15, e0232683 (2020).
99. Kocijan, J., Girard, A., Banko, B. & Murray-Smith, R. Dynamic systems identification with
Gaussian processes. Mathematical and Computer Modelling of Dynamical Systems 11, 411–
424 (2005).
100. Koza, J. R. & Koza, J. R. Genetic Programming: On the Programming of Computers by
Means of Natural Selection (MIT press, 1992).
101. Chatzi, E. N., Smyth, A. W. & Masri, S. F. Experimental application of on-line parametric
identification for nonlinear hysteretic systems with model uncertainty. Structural Safety 32,
326–337 (2010).
102. Miranda, M. J. & Fackler, P. L. Applied Computational Economics and Finance (MIT press,
2004).
103. Murray-Smith, D. in Computing in Medicine 333–338 (Springer, 1982).
104. Wang, L. & Garnier, H. System Identification, Environmental Modelling, and Control Sys-
tem Design (Springer, 2011).
105. Tartakovsky, A., Marrero, C. O., Perdikaris, P., Tartakovsky, G. & Barajas-Solano, D. Physics-
informed deep neural networks for learning parameters and constitutive relationships in
subsurface flow problems. Water Resources Research (2020).
122
106. Ghasemi, F., Mehridehnavi, A., Perez-Garrido, A. & Perez-Sanchez, H. Neural network and
deep-learning algorithms used in QSAR studies: merits and drawbacks. Drug Discovery
Today 23, 1784–1790 (2018).
107. Prieto, A., Prieto, B., Ortigosa, E. M., Ros, E., Pelayo, F., Ortega, J., et al. Neural networks:
An overview of early research, current frameworks and new challenges. Neurocomputing
214, 242–268 (2016).
108. Poli, R., Langdon, W. B., McPhee, N. F. & Koza, J. R. A field guide to genetic programming
(Lulu Enterprises, 2008).
109. Gandomi, A. H., Alavi, A. H. & Ryan, C. Handbook of Genetic Programming Applications
(Springer, 2015).
110. Banzhaf, W., Goodman, E., Sheneman, L., Trujillo, L. & Worzel, B. Genetic Programming
Theory and Practice XVII (Springer, 2020).
111. Gray, G. J., Li, Y ., Murray-Smith, D. & Sharman, K. Structural system identification using
genetic programming and a block diagram oriented simulation tool. Electronics Letters 32,
1422–1424 (1996).
112. Gray, G. J., Murray-Smith, D. J., Li, Y ., Sharman, K. C. & Weinbrenner, T. Nonlinear model
structure identification using genetic programming. Control Engineering Practice 6, 1341–
1352 (1998).
113. Rodriguez-Vazquez, K., Fonseca, C. M. & Fleming, P. J. Identifying the structure of non-
linear dynamic systems using multiobjective genetic programming. IEEE Transactions on
Systems, Man, and Cybernetics-Part A: Systems and Humans 34, 531–545 (2004).
114. Bolourchi, A., Masri, S. F. & Aldraihem, O. J. Development and application of computa-
tional intelligence approaches for the identification of complex nonlinear systems. Nonlin-
ear Dynamics. 79, 765–786 (2015).
115. Bolourchi, A., Masri, S. F. & Aldraihem, O. J. Studies into computational intelligence
and evolutionary approaches for model-free identification of hysteretic systems. Computer-
Aided Civil and Infrastructure Engineering 30, 330–346 (2015).
116. Bolourchi, A. & Masri, S. F. Evolutionary computational approaches for data-driven mod-
eling of multi-dimensional memory-dependent systems. Smart Structures and Systems 15,
897–911 (2015).
117. Fallah-Mehdipour, E., Haddad, O. B. & Marino, M. A. Genetic programming in groundwa-
ter modeling. Journal of Hydrologic Engineering 19, 04014031 (2014).
118. Mehr, A. D., Nourani, V ., Kahya, E., Hrnjica, B., Sattar, A. M. & Yaseen, Z. M. Genetic pro-
gramming in water resources engineering: a state-of-the-art review. Journal of Hydrology
566, 643–667 (2018).
119. Mohammad-Azari, S., Bozorg-Haddad, O. & Lo´ aiciga, H. A. State-of-art of genetic pro-
gramming applications in water-resources systems analysis. Environmental Monitoring and
Assessment 192, 73 (2020).
120. Computation for the work described in this paper was supported by the University of South-
ern California’s Center for High-Performance Computing (hpcc.usc.edu). ().
121. Carboni, B., Lacarbonara, W., Brewick, P. T. & Masri, S. F. Dynamical response identifica-
tion of a class of nonlinear hysteretic systems. Journal of Intelligent Material Systems and
Structures 29, 2795–2810 (2018).
122. Rodr´ ıguez-Iturbe, I. & Porporato, A. Ecohydrology of water-controlled ecosystems: soil
moisture and plant dynamics (Cambridge University Press, 2007).
123
123. Oreskes, N., Shrader-Frechette, K. & Belitz, K. Verification, validation, and confirmation of
numerical models in the earth sciences. Science 263, 641–646 (1994).
124. Smyth, A. W., Masri, S. F., Kosmatopoulos, E. B., Chassiakos, A. G. & Caughey, T. K. De-
velopment of adaptive modeling techniques for non-linear hysteretic systems. International
Journal of Non-Linear Mechanics 37, 1435–1451 (2002).
125. Rodriguez-Iturbe, I., Porporato, A., Ridolfi, L., Isham, V . & Coxi, D. Probabilistic modelling
of water balance at a point: the role of climate, soil and vegetation. Proceedings of the Royal
Society of London. Series A: Mathematical, Physical and Engineering Sciences 455, 3789–
3805 (1999).
126. Laio, F., Porporato, A., Ridolfi, L. & Rodriguez-Iturbe, I. Plants in water-controlled ecosys-
tems: active role in hydrologic processes and response to water stress: II. Probabilistic soil
moisture dynamics. Advances in Water Resources 24, 707–723 (2001).
127. Porporato, A., D’odorico, P., Laio, F. & Rodriguez-Iturbe, I. Hydrologic controls on soil
carbon and nitrogen cycles. I. Modeling scheme. Advances in Water Resources 26, 45–58
(2003).
128. Laio, F., Porporato, A., Fernandez-Illescas, C. & Rodriguez-Iturbe, I. Plants in water-controlled
ecosystems: active role in hydrologic processes and response to water stress: IV . Discussion
of real cases. Advances in Water Resources 24, 745–762 (2001).
129. Ljung, L. System identification. Wiley Encyclopedia of Electrical and Electronics Engineer-
ing, 1–19 (1999).
130. Liu, W. K., Karniadakis, G., Tang, S. & Yvonnet, J. A computational mechanics special
issue on: data-driven modeling and simulation—theory, methods, and applications 2019.
131. Ribeiro, B., Albrecht, R. F., Dobnikar, A., Pearson, D. W. & Steele, N. C. Adaptive and
Natural Computing Algorithms in Proceedings of the International Conference in Coimbra,
Portugal 20059 (2005).
132. Im, J., Rizzo, C. B., de Barros, F. P. J. & Masri, S. F. Application of genetic programming
for model-free identification of nonlinear multi-physics systems. Nonlinear Dynamics 104,
1781–1800 (2021).
133. Schneider, J. & Kirkpatrick, S. Stochastic optimization (Springer Science & Business Me-
dia, 2007).
134. Panteleev, A. V . & Lobanov, A. V . Mini-batch adaptive random search method for the para-
metric identification of dynamic systems. Automation and Remote Control 81, 2026–2045
(2020).
135. Neuman, S. P. & Tartakovsky, D. M. Perspective on theories of non-Fickian transport in
heterogeneous media. Advances in Water Resources 32, 670–680 (2009).
136. Bolster, D., Benson, D. A., Le Borgne, T. & Dentz, M. Anomalous mixing and reaction
induced by superdiffusive nonlocal transport. Physical Review E 82, 021119 (2010).
137. Dentz, M., Le Borgne, T., Englert, A. & Bijeljic, B. Mixing, spreading and reaction in
heterogeneous media: A brief review. Journal of contaminant hydrology 120, 1–17 (2011).
138. Scheidegger, A. E. An evaluation of the accuracy of the diffusivity equation for describing
miscible displacement in porous media in Proceedings of the Theory of Fluid Flow in Porous
Media Conference (1959).
139. Diethelm, K., Ford, N. J. & Freed, A. D. A predictor-corrector approach for the numerical
solution of fractional differential equations. Nonlinear Dynamics 29, 3–22 (2002).
124
140. Kreft, A. & Zuber, A. On the use of the dispersion model of fluid flow. The International
Journal of Applied Radiation and Isotopes 30, 705–708 (1979).
141. Berkowitz, B., Scher, H. & Silliman, S. E. Anomalous transport in laboratory-scale, hetero-
geneous porous media. Water Resources Research 36, 149–158 (2000).
142. Bijeljic, B., Mostaghimi, P. & Blunt, M. J. Signature of non-Fickian solute transport in
complex heterogeneous porous media. Physical review letters 107, 204502 (2011).
143. USEPA. Benchmark Dose Technical Guidance. EPA/100/R-12/001 U.S. Environmental Pro-
tection Agency (Washington, DC, 2012).
144. Di Veroli, G. Y ., Fornari, C., Goldlust, I., Mills, G., Koh, S. B., Bramhall, J. L., et al. An au-
tomated fitting procedure and software for dose-response curves with multiphasic features.
Scientific Reports 5, 14701. doi:https://doi.org/10.1038/srep14701 (2015).
145. FDA. Estimating the Maximum Safe Starting Dose in Initial Clinical Trials for Therapeutics
in Adult Healthy Volunteers U.S. Department of Health and Human Services (2005).
125
Appendices
A Non-monotonic Dose Response Model for Endocrine-related
Adverse Health Effects by Bisphenol A (BPA)
Recent toxicological studies report that BPA has endocrine-related health effects in low dose ranges
[17, 18]. Especially, concerns lie in effects on brain, behavior, and prostate gland of fetuses,
infants, and children because their endocrine systems are vulnerable before the full development
[38]. Since these effects occur in specific low dose ranges, their dose-response relationships follow
non-monotonic models [18]. To characterize health risks from endocrine-related health effects of
BPA, one potential low dose range and associated benchmark doses are determined with the data
from Jones et al. [57] study. This study is chosen from Lagarde et al. [78] review on NMDR
relationships of BPA. Aligned with Jones et al. [57] study, the concerned population is determined
as male offspring when their mothers are exposed to BPA for their pregnant/lactating periods. To
obtain benchmark doses from the data, we followed the benchmark dose method from the USEPA
[143]. First, models are fitted from the non-monotonic data through the method of non-linear
least squares as shown in Fig. 6.1. Since the mechanism behind endocrine-related health effects
is assumed as the induction of opposing effects (e.g., agonist versus antagonist), the multiphasic
model for the non-monotonic data is adopted to reflect this mechanism as follows Lagarde et al.
and Di Veroli et al. [78, 144]:
126
E(C)=
1+
E
∞,1
− 1
1+(EC
50,1
/C)
H
1
1+
E
∞,2
− 1
1+(EC
50,2
/C)
H
2
(A.1)
where E(C) is the effect (i.e., the relative health response) obtained at a given concentration C (i.e.,
a dose), E
∞,i
is the maximum effect for each phase (E
∞,1
for the agonistic effect and E
∞,2
for the
antagonistic effect), EC
50,i
is the relative 50% effective concentration for each phase, and H
i
is the
hill exponent for each phase.
10
0
10
1
10
2
10
3
10
4
Dose [ g/kg/day]
0
1
2
3
4
5
Relative health responses [-]
Intromissions
Intromission latency
Ejaculation Latency
Post-Ejaculatory Interval
Figure 6.1: Relative health responses for multiple health effects along BPA doses and their NMDR
models [57].
Next, upper and lower bounds for each low dose range for each health effect are assessed with
one control standard deviation from the control mean, corresponding to 10% extra risk [143]. Last,
by multiplying these bounds with conversion factors for interspecies [145], final benchmark doses
for each health effect are calculated. Among benchmark doses for health effects, the maximum
upper bound and the minimum lower bound are selected for one conservative low dose range. As
a result, the final low dose range (i.e., the upper and lower bounds, D
U
and D
L
) potentially causing
sexual behavior alterations of male offspring when their mothers are exposed to BPA for their
127
gestational and lactational periods are 0.68 and 80.69 µg/kg/d, respectively. This final low dose
range is overlapped with current human exposures the NTP has concern on (e.g., 0.3-13 µg/kg/d
in infants and 0.043-14.7µg/kg/d in children) [38].
128
Abstract (if available)
Abstract
Accurate and robust computational modeling of multi-scale hydrologic systems is a key in water resources management. The main challenge here is uncertainty in characterizing heterogeneous hydrologic systems under limited resources. Inevitably, the uncertainty occurs in models and their parameters, leading to unreliable predictions. As described in the first chapter, this study aims to establish novel computational approaches to overcome the uncertainty associated with parameters and models and to provide accurate and robust outcomes for water resources management. Firstly, the parameter uncertainty is addressed in the context of groundwater contamination in the second chapter. Human health risks against Bisphenol A (BPA) in groundwater are investigated under the uncertainty in the hydraulic conductivity, focusing on the impact of different health models (i.e., dose-response models for systemic vs. endocrine-related adverse health effects). The results of the Monte Carlo approach show the non-trivial behavior of the aquifer resilience loss and the uncertainty against BPA under the interplay health response models and conductivity heterogeneity. This chapter provides insights for the risk management of groundwater systems against emerging contaminants that have endocrine-related concerns. Next, the model uncertainty is addressed by developing a hybrid physics-based and data-driven modeling approach utilizing stochastic optimization. This approach aims to discover governing equations, described by ordinary differential equations (ODEs) or partial differential equations (PDEs), from observed data. The proposed approach is successfully demonstrated by identifying a variety of ODEs/PDEs from simulated data sets for multi-scale hydrologic systems (e.g., soil water balance, flow and transport in porous media) in the third and fourth chapters. Moreover, the proposed hybrid approach is applied into a more challenging problem for the model uncertainty: discovering governing equations from limited experimental data. This case in the fifth chapter emphasizes the novelty of the proposed approach that is capable of providing physical insights about the underlying physics, even from the limited data set. The capabilities of the proposed approach in capturing anomalous behavior and providing physical insights are further enriched with incorporating more sophisticated mathematical expressions (e.g., the fractional time derivative). The series of these chapters (Chapter 3-5) indicate that the proposed approach is a powerful gray-box modeling tool. Consequently, novel computational approaches are developed and applied to address model and parameter uncertainties in modeling multi-scale hydrologic systems under heterogeneity. As described in the last chapter, the research carried out in this study and its continuation contribute to establishing accurate and robust modeling of hydrologic systems under uncertainty and supporting the decision-making process in water resource management.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient stochastic simulations of hydrogeological systems: from model complexity to data assimilation
PDF
Reproducible and rapid computational approaches for assessing contamination in natural aquifers
PDF
Studies into computational intelligence approaches for the identification of complex nonlinear systems
PDF
Model, identification & analysis of complex stochastic systems: applications in stochastic partial differential equations and multiscale mechanics
PDF
Analytical and experimental studies in modeling and monitoring of uncertain nonlinear systems using data-driven reduced‐order models
PDF
A polynomial chaos formalism for uncertainty budget assessment
PDF
Vision-based and data-driven analytical and experimental studies into condition assessment and change detection of evolving civil, mechanical and aerospace infrastructures
PDF
Algorithms for stochastic Galerkin projections: solvers, basis adaptation and multiscale modeling and reduction
PDF
Efficient inverse analysis with dynamic and stochastic reductions for large-scale models of multi-component systems
PDF
Design optimization under uncertainty for rotor blades of horizontal axis wind turbines
PDF
Machine-learning approaches for modeling of complex materials and media
PDF
Data worth analysis in geostatistics and spatial prediction
PDF
Physics-based data-driven inference
PDF
Stochastic and multiscale models for urban and natural ecology
PDF
Novel multi-stage and CTLS-based model updating methods and real-time neural network-based semiactive model predictive control algorithms
PDF
Computationally efficient design of optimal strategies for passive and semiactive damping devices in smart structures
PDF
Inverse modeling and uncertainty quantification of nonlinear flow in porous media models
PDF
Studies into data-driven approaches for nonlinear system identification, condition assessment, and health monitoring
PDF
Studies into vibration-signature-based methods for system identification, damage detection and health monitoring of civil infrastructures
PDF
Model-driven situational awareness in large-scale, complex systems
Asset Metadata
Creator
Im, Jinwoo
(author)
Core Title
Hybrid physics-based and data-driven computational approaches for multi-scale hydrologic systems under heterogeneity
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Environmental Engineering
Degree Conferral Date
2022-08
Publication Date
07/27/2023
Defense Date
05/04/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
aquifer resilience,emerging contaminants,governing equations,hybrid modeling,OAI-PMH Harvest,stochastic hydrogeology,uncertainty quantification
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
de Barros, Felipe P. J. (
committee chair
), Masri, Sami F. (
committee member
), Sahimi, Muhammad (
committee member
)
Creator Email
imjw24@gmail.com,jinwooim@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375334
Unique identifier
UC111375334
Legacy Identifier
etd-ImJinwoo-11028
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Im, Jinwoo
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
aquifer resilience
emerging contaminants
governing equations
hybrid modeling
stochastic hydrogeology
uncertainty quantification