Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deep learning for subsurface characterization and forecasting
(USC Thesis Other)
Deep learning for subsurface characterization and forecasting
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Deep Learning For Subsurface Characterization and Forecasting
by
Anyue Jiang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(CHEMICAL ENGINEERING)
December 2022
Copyright 2022 Anyue Jiang
To this long journey
ii
Acknowledgements
First, I would like to express my sincere gratitude to my advisor Professor Behnam Jafarpour for
his continuous support of my research.
I am also grateful to the rest of my committee members Professor Birendra Jha, Professor Doug
Hammond, Professor Iraj Ershaghi, and Professor Robert Young.
Words cannot express my gratitude to all of my family, my partner, and my friends for their
love and support.
Finally, I would like to thank my computers for their great service.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Geologic Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Deep Learning in Subsurface Model Calibration . . . . . . . . . . . . . . . 5
1.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Geothermal Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Geothermal Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Deep Learning in Geothermal Forecasting . . . . . . . . . . . . . . . . . . 9
Chapter 2: Inverting Subsurface Flow Data for Geologic Scenarios Selection with
Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Feature Extraction Using Inversion with Leading PCA Elements . . . . . . . . . . 13
2.2 Feature Recognition with CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Iterative Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Pumping Test with Non-Gaussian 2D Model . . . . . . . . . . . . . . . . . 26
2.5.2 Pumping Test with 3D Model . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 3: Deep Convolutional Autoencoders for Robust Flow Model Calibration under
Uncertainty in Geologic Continuity . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1 Subsurface Flow Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Parameterization with Variational Autoencoders . . . . . . . . . . . . . . . . . . . 44
iv
3.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1 Fluvial Channel Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1.1 Pumping Test in Single-Phase Flow . . . . . . . . . . . . . . . . 61
3.3.1.2 Two-phase Flow with Limited Measurements . . . . . . . . . . . 62
3.3.2 3D Example with Multiple Facies . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 4: Recurrent Neural Networks for Short-term and Long-term Prediction of
Geothermal Reservoirs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1 Geothermal Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 CNN-RNN Model with Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.1 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.3 Labeling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Overall Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Field Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.1 Interpolation and Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.2 Importance of Honoring the Expected Patterns . . . . . . . . . . . . . . . . 106
4.5.3 Computational Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Chapter 5: A Multiscale Recurrent Neural Network Model for Long-Term Prediction of
Geothermal Reservoir Energy Production . . . . . . . . . . . . . . . . . . . . . 110
5.1 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2 Multiscale RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4.1.1 Fault Zone 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.1.2 Fault Zone 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4.2 Field Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Chapter 6: Transfer Learning for Geothermal Prediction with Single-Well Model and
Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.1 Single-Well Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2 Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3 Preliminary Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Chapter 7: Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
v
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A Stochastic Gradient Descent (SGD) for Minimization . . . . . . . . . . . . . . . . 153
vi
List of Tables
2.1 Details of each layer of the CNN model used in the 2D pumping test where the
filter size is height width and the output dimension is height width channel. . 29
2.2 Details of each layer of the CNN model used in the 3D pumping test where
the filter size is depth height width and the output dimensions are depth
height width channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Details of the V AE architecture for examples with fluvial channels. . . . . . . . . . 56
3.2 Details of the V AE architecture for the 3D example. . . . . . . . . . . . . . . . . . 69
3.3 Data RMSE performance in different numerical experiments. . . . . . . . . . . . . 74
3.4 Average Computational Costs in the Experiments with the Fluvial Systems . . . . . 80
4.1 Details of connectivity for enthalpy and BHP. . . . . . . . . . . . . . . . . . . . . 91
4.2 Details of computational cost of each gradient for different optimization method. . 109
vii
List of Figures
1.1 An illustration of the problems defined based on the availability of information
and the stationarity of the system. Multiscale RNN targets the problems in Q2
where the predicting horizon is long and the system is relatively stable. . . . . . . 12
2.1 (a) Samples from circle and bar images (left) and the corresponding (sorted) PCA
basis images (right). The first few leading elements clearly show the information
about the corresponding objects in the training data; (b) a sample circle image
(Column 1) with its approximations using the first 8 leading PCA elements of
circle objects (Column 2) and bar objects (Columns 3), and by combining the first
8 leading PCA of the two objects (Column 4). Results show that the leading PCA
elements of the bar objects contribute very little to the reconstruction of circle
objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 A Neural network (NN) sketch with fully connected structure (left) and the basic
operations applied to the input connected neurons (right). . . . . . . . . . . . . . . 20
2.3 Top: Overall architecture of convolutional neural network (CNN); (Bottom):
Illustration of a CNN layer containing Convolution-ReLU-Pooling (left), with a
simple example depicting the operation involved (right). . . . . . . . . . . . . . . 21
2.4 Training images of meandering, intersecting and straight fluvial channels with
corresponding realizations that have different channel orientations, which leads to
5 different geologic scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Generation of training data for CNN based on PCA approximation of the original
models with correlated noise added to represent expected inversion errors and
artifacts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 A successful scenario identification example for the 2D fluvial system: (a) The
configuration of the pumping test with the 2D fluvial systems, (b) reference model
from Scenario 1, (c) final likelihoods assigned to each scenario, (d) solution
iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 A failed test cases from Scenario 1 in the pumping test with the 2D fluvial systems:
(a) pumping test domain and well configurations, (b) reference log-hydraulic
conductivity model from Scenario 1, (c) final likelihoods assigned to different
scenarios, (d) solution iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
viii
2.8 The confusion matrix based on 250 test cases showing the probabilities of
dominating (left) and survival (right) for the correct scenario in the pumping test
with fluvial realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 The pumping test field configuration (a) and an example of a multi-facies model
for the 3D experiments (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10 The eight geologic scenarios for the multi-facies examples with different
combinations of aggradation angle and paleo-azimuth. . . . . . . . . . . . . . . . 35
2.11 (a) The results from a successful case in the 3D pumping test, with the PCA
composition before each iteration (left), the extracted spatial features from Step 1
(middle), and the inferred geologic scenarios from CNN (right). . . . . . . . . . . 37
2.12 The confusion matrix based on 400 test cases showing the probabilities of
dominating (a) and surviving (b) for the correct scenario in the 3D pumping test. . . 37
2.13 A test case in which the reference model does not belong to any of the proposed
scenarios: (a) reference log-hydraulic conductivity model, (b) solution iterations
with the PCA composition before each iteration (left), the extracted spatial
features from Step 1 (middle), and the inferred geologic scenarios from CNN
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 A typical symmetric V AE architecture (top), and illustration of the convolu-
tion+ReLU and pooling operations in one layer (bottom). . . . . . . . . . . . . . . 46
3.2 Illustration of the convolution (a) and transposed convolution (b) operations that
are used in the encoder and decoder networks, respectively. The convolution
operation scans the input images with the trainable filters of a chosen size
(3-by-3 here) to compute the inner product between the filter weights and the
corresponding image segments (as shown in the colored illustrations), and store
the results in the respective pixel of the output. The transposed convolution
upsamples an input feature map to a desired output feature map (as shown in the
colored illustration) using learnable weights. . . . . . . . . . . . . . . . . . . . . . 47
3.3 Schematic representation of V AE as a stochastic mapping between data and latent
spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Two training images with fluvial channels that are used to generate six alternative
geologic connectivity scenarios; four random realizations of facies models from
each scenario are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 The projection performance of PCA and V AE when a single scenario is used: (a)
sample model realizations, (b) PCA projection results with different number of
PCs, (c) V AE projection with the same number of latent variables as in (b). For
n
z
= 4,8,16, the variances explained by the PCA bases are 27%, 40%, and 58%,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
ix
3.6 The projection performance of PCA and V AE when multiple scenarios are present:
(a) sample model realizations, (b) PCA projection results with different number
of PCs, (c) V AE projection with the same number of latent variables as in (b). For
n
z
= 4,8,16, the variances explained by the PCA bases are 35%, 50%, and 66%,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 The learned distribution of latent variable Z1 and Z2 in V AE (a), the T-SNE
visualization of the training realizations and 100 newly generated models (b). The
results are shown for Scenario 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.8 Localized nature of the gradient information in V AE: (a) the spatial image x
corresponding to a sample latent vector z; (b) the gradient of the spatial image z
in (a) with respect to each latent variable in z. . . . . . . . . . . . . . . . . . . . . 60
3.9 The domain and well configuration for numerical experiments with fluvial
channels: (a) pumping test example; (b) two-phase flow example. . . . . . . . . . 61
3.10 Summary of the results for the pumping test, showing the projection and
calibration results with PCA and V AE (top), iterations of the model calibration
minimization algorithm (middle), comparison of the normalized predicted and
measured flow data for the single scenario (a) and multiple scenarios (b). The
horizontal dashed line in the middle panels show the value of the objective
function based on the final PCA solution. . . . . . . . . . . . . . . . . . . . . . . 63
3.11 Summary of the results for the two-phase flow example, showing the projection
and calibration results with PCA and V AE (top), iterations of the model
calibration minimization algorithm (middle), comparison of normalized predicted
and measured flow data for the single scenario (a) and multiple scenarios (b).
The bottom row shows the normalized flow response data from all wells stacked
together. The horizontal dashed line in the middle panels show the value of
the objective function based on the final PCA solution. The order of the nine
properties is the BHP of Well I1, the nonwetting phase flow rates of P1, P2, P3,
P4, and the wetting phase flow rates of P1, P2, P3, P4. . . . . . . . . . . . . . . . . 65
3.12 Summary of the results for two additional two-phase flow example with when
multiple scenarios are present. The results are shown for a simple case (a) and
a complex case that does not belong to any of the scenario. In each case, the
projection and calibration results with PCA and V AE (top), iterations of the model
calibration minimization algorithm (middle), comparison of normalized predicted
and measured flow data are shown. The horizontal dashed line in the middle
panels show the value of the objective function based on the final PCA solution.
The order of the nine properties is the BHP of Well I1, the nonwetting phase flow
rates of P1, P2, P3, P4, and the wetting phase flow rates of P1, P2, P3, P4. . . . . . 67
3.13 Sample models showing the geological facies in the eight scenarios considered for
the 3D examples with four facies types. . . . . . . . . . . . . . . . . . . . . . . . 68
x
3.14 Approximate representation of sample models (a) from each scenarios in the 3D
example using the PCA (b) and V AE (c) methods. . . . . . . . . . . . . . . . . . . 69
3.15 Model domain and well configuration in the 3D example (a), and the reference
model used for the numerical experiment (b). . . . . . . . . . . . . . . . . . . . . 71
3.16 Summary of model calibration results for the 3D experiment: (a)-(b) show the
PCA and V AE projection, respectively; (c)-(d) display the calibration results
with PCA and V AE, respectively; (e) and (f) show the iterations of the model
calibration objective function and the final data match, respectively. . . . . . . . . 72
3.17 Layer-by-layer illustration of the reference model (a), calibration using PCA (b)
and V AE (c) parameterization methods. . . . . . . . . . . . . . . . . . . . . . . . 73
3.18 Average activation outputs of all the filters (after normalization) for the cases with
a single scenario (a) and multiple scenarios (b) under a week sparsity constraint,
showing that more filters are active to represent the diverse features when multiple
scenarios are present. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.19 The effect of number of layers and nonlinearity on the approximation quality of
the V AE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.20 Approximation results for scenarios with different channel orientations. The V AE
and PCA are trained only with 0°and 90°. Models with channel orientations along
75°and 45°are not used in the training, but presented to evaluate the performance
of the two methods for approximation of images that are not similar to those used
in the training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.21 Ability of V AE (a) versus PCA (b) to reproduce the learned spatial patterns: the
figures show nine spatial models (x) generated by randomly sampled zN (0;I)
using V AE and PCA, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1 The hourly data of one of the producers from the field data after normalization. . . 83
4.2 The architecture of CNN-RNN model with labeling scheme. y is the target
variable, x is the control, t is the time, and ˆ y is the final prediction from the model. 83
4.3 The architecture of LSTM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 A demonstration of 1D convolution where x, w, and y are the inputs, learnable
weights, and outputs, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 The sensitivity analysis for enthalpy and BHP with truncation for feature selection. 90
4.6 A demonstration of the weak extrapolation of neural networks. . . . . . . . . . . . 92
4.7 The five toy tests and the short-term predictions from the proposed model. . . . . 94
4.8 A toy example where the model suffered from extrapolation when doing long-term
prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xi
4.9 The long-term prediction becomes more accurate after transforming the problem
from extrapolation to interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.10 The configuration of the field example where four faults are present. The color
indicates the initial temperature of the field . . . . . . . . . . . . . . . . . . . . . 96
4.11 The short-term predictions of the producer enthalpies in the field experiment. The
training is on the left side of the vertical dashed line and the testing is on its right. . 98
4.12 The short-term predictions of the producer BHPs in the field experiment. . . . . . . 98
4.13 The long-term predictions of producer enthalpies in the field experiment using
RNN trained on historical data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.14 The long-term predictions of producer BHPs in the field experiment using RNN
trained on historical data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.15 The sensitivity analysis on testing RMSE loss of enthalpy prediction as a function
of the number of training datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.16 The sensitivity analysis on testing RMSE loss of BHP prediction as a function of
the number of training datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.17 The long-term predictions of enthalpy in the field experiment using RNN trained
on 10 full-sequence datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.18 The long-term predictions of BHP in the field experiment using RNN trained on
10 full-sequence datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.19 The long-term predictions of enthalpy in pseudo field data using RNN trained on
10 full-sequence datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.20 The long-term predictions of BHP in pseudo field data using RNN trained on 10
full-sequence datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.21 Enthapy change under constant controls which shows the change of trend when
starting producing from the initial equilibrium state. . . . . . . . . . . . . . . . . 106
4.22 The effect of label variables on the prediction performance for producer BHPs in
Fault 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.23 The effect of shut-in period on the prediction performance for producer BHPs in
Middle Fault and Fault 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.1 A demonstration of the extrapolation issue where all the solution curves match
the training data but are very different in the test part. The data on the left of the
vertical line are used for training, while the right of it is the predicting part. . . . . 111
5.2 The structure of LSTM where t, x, C, and h are the current time step, the input
control, the cell state, and the hidden state, respectively. . . . . . . . . . . . . . . 112
xii
5.3 The architecture of multiscale RNN. The RNN is the short-term model and the
FCNN is the long-term model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4 The configuration of the geothermal field. The color indicates the normalized
initial temperature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5 One of the synthetic realizations of fault zone 1. . . . . . . . . . . . . . . . . . . 119
5.6 Predictions from multiscale RNN, RNN, and ARX of the synthetic example of
fault zone 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.7 The two intermediate predictions, long- and short-term, of multiscale RNN in the
synthetic example of fault zone 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.8 The statistical results of fault zone 1 over 10 production realizations. . . . . . . . 121
5.9 One of the synthetic realizations of fault zone 2. . . . . . . . . . . . . . . . . . . 121
5.10 Predictions from multiscale RNN, RNN, and ARX of the synthetic example of
fault zone 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.11 The statistical results of fault zone 2 over 10 production realizations. . . . . . . . 123
5.12 The field data of fault zone 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.13 The field data of fault zone 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.14 Predictions from multiscale RNN, RNN, and ARX of the field data of fault zone
1. Zero enthalpy values are removed and filled by linear interpolation. . . . . . . . 125
5.15 Predictions from multiscale RNN, RNN, and ARX of the field data of fault zone
2. Zero enthalpy values are removed and filled by linear interpolation. . . . . . . . 126
5.16 Comparison of models trained simultaneously and sequentially for the synthetic
example of fault zone 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.1 A demonstration of the (a) single-well model and (b) multi-well model. . . . . . . 130
6.2 A demonstration of residual learning. . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 A comparison between single-well models and multi-well model. . . . . . . . . . 132
xiii
Abstract
Accurate prediction of flow and transport behavior in complex geologic formations is critical for
the efficient development of the underlying resources. Physics-based simulation models can pro-
vide reliable forecasts about the response of subsurface flow systems to various development strate-
gies. However, the cost and technical difficulties associated with measuring the rock properties,
such as permeability and porosity, at a high enough resolution and coverage have necessitated us-
ing indirect and scattered flow response data for their characterization. Since flow response data
are measured at sparse locations, inference of high-resolution model parameters from the limited
data available often leads to underdetermined inverse problems. Besides the conventional physics-
based models that offer a comprehensive prediction tool, data-driven models provide an efficient
alternative to building predictive models by extracting and using the statistical patterns in data to
make predictions.
In this work, we applied deep learning models to handle the geologic uncertainty in the model
calibration process. We present a novel framework whereby dynamic response data is used to
screen a set of proposed geologic scenarios by combining gradient-based inversion for feature ex-
traction and convolutional neural networks for feature recognition. We also propose a special type
of deep learning architecture, known as variational auto-encoder, for robust dimension-reduced
parameterization of spatially distributed properties, in solving model calibration problems under
uncertain geostatistical models. We show that convolutional autoencoders offer the versatility and
robustness required for nonlinear parameterization of complex subsurface flow property distribu-
tions when multiple distinct geologic scenarios are present.
xiv
In addition, we introduce a variant of RNN that also utilizes the efficiency of convolutional neu-
ral networks (CNN) for the prediction of energy production from geothermal reservoirs. Specifi-
cally, a CNN-RNN architecture is developed that takes historical production data and future well
controls to predict the well quantities that describe the production performance of the reservoir.
The model is paired with a labeling scheme to handle real field disturbances such as data gaps.
RNN primarily exploits statistical relations in training data to generate predictions. Thus it can
be challenging to apply RNN in applications where extrapolation beyond the training data range
is needed. In addition, geothermal data usually involves features on multiple scales, which makes
complex deep learning models more vulnerable to overfitting. To overcome these issues, we have
developed a multiscale architecture for RNN to improve its generalization power for the long-term
prediction of geothermal data.
xv
Chapter 1
Introduction
1.1 Model Calibration
Calibration of subsurface flow models against the flow response data is used to improve the predic-
tive capability of simulation models as a tool for decision making. The resulting model calibration
formulation often leads to inverse problems, where limited data are used to estimate too many
parameters that represent heterogeneous rock property distributions at high resolution. Parameter-
ization and regularization are the typical approaches to solve underdetermined inverse problems.
Parameterization is performed by finding a compact representation of the original parameters either
in the space domain, e.g., the zonation method [63] and the pilot point method [139], or in a trans-
form domain. The transform domain can be either learned from data, e.g., Principle Component
Analysis [187, 150], or it can be data-independent, e.g., Discrete Cosine Transform [4]. Recently,
deep learning approaches such as GAN and V AE has also been introduced as parameterization
techniques [97, 96, 22, 73, 124]. Regularization, on the other hand, constrains the solution of an
inverse problem by adding a penalty term that promotes user-specified properties in the solution.
For example, the Tikhonov regularization [171] stabilizes the inversion process by requiring the
solution to be smooth and the total variation regularization [146] allows for detection of local dis-
continuity (e.g., facies boundaries) in the obtained solution. Parameterization and regularization
techniques often serve two purposes, making the inverse problem better-posed mathematically and
1
improving the plausibility of the obtained solution by incorporating the expected behavior of the
solution (e.g., preserving the geologic continuity).
1.1.1 Geologic Uncertainty
Traditionally, a prior geologic scenario is used to regularize the problem by specifying the ex-
pected form of spatial variability in subsurface properties[167, 20]. The prior geologic scenario is
usually derived from integrating different sources of knowledge[89], such as quantitative measure-
ments (e.g., hard data, well logs, seismic) and qualitative data (e.g., outcrops), as well as geologic
modeling techniques and expert knowledge [11]. Conventional model calibration methods assume
that a prior conceptual model of geologic scenario is known and must be respected while updating
model properties to match the observed response data. However, the assumption that the geologic
scenario is known with certainty is not realistic as several sources of uncertainty and subjectivity
are present in building such models [14, 123].
Acknowledging and incorporating the uncertainty in conceptual geologic scenarios is, there-
fore, important for developing reliable workflows for forecasting, model calibration, optimization,
and uncertainty quantification [28, 108, 143, 113]. On the other hand, disregarding the uncertainty
in the geologic scenario can result in models that can match the available flow response data but
provide incorrect (biased) predictions. Therefore, it is important to account for the uncertainty in
the prior knowledge about the geologic scenario and to use dynamic response data, in addition to
other sources of information, to reduce the underlying uncertainty [80, 46]. Screening of geologic
scenarios based on flow response data is motivated by the need to, on the one hand, account for
the uncertainty and subjectivity in developing conceptual geologic models and, on the other hand,
reduce the uncertainty space by eliminating scenarios that cannot be supported by data.
One approach to represent the uncertainty in the geologic scenarios is to consider multiple
plausible conceptual models, either by having alternative interpretations by different geologists or
by stochastic treatment of process-based models that are used to generate the scenarios. When
confronted with multiple scenarios in model calibration, a basic approach is to perform individual
2
calibration with each prior scenario to cover a wide range of variability and evaluate the resulting
solutions. The Bayesian framework has also been used to quantify the uncertainty [8, 32]. How-
ever, encapsulating all the uncertainty can be computationally demanding, especially when many
initial scenarios are proposed. An alternative approach is to reduce the uncertainty through devel-
oping formulations that use the available data to eliminate irrelevant or highly unlikely scenarios
prior to detailed model calibration and uncertainty quantification. This approach is consistent with
the Popper-Bayes philosophy that observations can only be used to falsify, and not deduce, models
or theories [171], or in our case, geologic scenarios that are not supported by data [169, 133, 152].
However, in practical applications, the computational cost of a rigorous and exhaustive falsifica-
tion process can be prohibitive and more efficient and practical methods may become necessary.
A computationally less demanding approach is to formulate efficient inverse problems that use the
available data to support or reject a set of proposed conceptual geological models [80, 46] or to
extend model calibration formulations to include multiple plausible scenarios as prior knowledge
[81, 80]. Regardless of the formulation used, it is important to note that the models supported or
selected based on available data may be rejected upon the availability of additional data.
1.1.2 Regularization
Regularization techniques are mathematical formulations that stabilize the inverse problem by con-
straining the solution space to honor certain expected attributes, usually by penalizing the solu-
tions that deviate from the desired attributes [177, 30, 55, 178, 9]. For instance, the Tikhonov
regularization [177], is used to promote spatially smooth solutions by minimizing the roughness
(spatial derivatives) of the unknown parameters while matching the observed data. Depending on
the expected properties of the parameters, a number of different regularization forms have been
developed and applied to flow model calibration problems, including Tikhonov [178] and total
variation [105]. Sparsity has also been exploited in different forms as a regularization approach
[107, 109, 81, 82, 47]. Regardless of the choice of a regularization function, regularization meth-
ods typically impart a global or generic attribute on the solution (e.g., smoothness), which may not
3
be sufficient for describing and detecting more specific and distinct spatial patterns. A prime ex-
ample is when the dominant variability in a geologic model is described by well-defined and crisp
geologic objects (e.g., fluvial channels), where regularization alone is not effective for describing
and constraining the expected patterns. A more effective approach, in this case, is to learn the
expected spatial patterns through parameterization.
1.1.3 Parameterization
Parameterization methods attempt to improve the ill-posedness of inverse problems by reducing the
number of unknown parameters [44, 65, 35]. For instance, zonation methods [64, 23, 52, 1] achieve
this goal by aggregating model parameters within a region of an aquifer with (expected) similar
values into a single parameter. Parameterization via linear expansion is an alternative approach in
which a finite set of appropriate expansion functions is used to describe high-dimensional spatial
parameters. These methods gain their efficiency by embedding the expected spatial patterns in the
basis functions and adjusting the weights (coefficients) assigned to them to update the solution.
The resulting weights form the new parameters that are estimated in the inverse problem. These
methods typically provide a lossy approximation of the original high-resolution models, where the
salient information is preserved while less important details are discarded. The effectiveness of this
approximation depends on the properties of the original parameters and the choice of the expansion
functions. When geologic information is not available, predetermined basis functions that are used
in image processing such as discrete cosine transform [67, 68], grid connectivity transform [13],
and discrete wavelet transform [121, 66, 148] can be adopted as effective dimensionality reduction
methods. For problems with complex spatial patterns, the expansion functions are more effective
if they are constructed from prior models with relevant and specific features. A conventional
example is the Principal Component Analysis (PCA) [44, 157, 78], which is a standard approach
of parameterization. Similar learning approaches have also been proposed for complex problems,
including the kernel form of PCA or K-PCA [149] and sparse dictionaries such as K-SVD [3,
4
81]. Other approaches have also been introduced to deal with situations in which the conceptual
geologic model is uncertain, e.g., [119, 74].
1.1.4 Deep Learning in Subsurface Model Calibration
In recent years, deep learning has enjoyed great success in different fields, such as computer vi-
sion [91], natural language processing [83], and speech recognition [50]. Much of this success is
attributed to the efficacy of convolutional neural networks (CNN) in detecting and extracting com-
plex and diverse patterns in natural images [170]. By using several hidden layers in the architecture
of the CNN, each including multiple convolution filters, the learning process leads to different lev-
els of abstraction, where the filters in each layer learn distinct spatial patterns at different scales
[159, 131]. The power of deep learning can also be exploited in many scientific applications [160].
In geosciences, including groundwater systems, the natural variability in geologic formations and
their important flow properties tend to exhibit complex and multi-scale spatial heterogeneity that
controls the evolution of the underlying physical processes. With the versatility afforded by many
convolutional filters, as well as the nonlinearity embedded in their architectures, deep learning
models offer great potential as a tool to represent complex spatial features. To this end, a training
dataset must be available to empower deep learning algorithms. In practice, the training data can
be generated in the form of several realizations from a stochastic model that is used to describe the
uncertainty in the heterogeneous subsurface flow properties. When the heterogeneity represents
itself as a smoothly varying function in space, a simple mathematical function such as a variogram
model may be used as an approximate representation. Traditional geostatistical simulation meth-
ods, like Kriging methods, can then be used to generate the training dataset. In formations with
complex variability, often characterized by abrupt changes and well-defined shapes, more sophis-
ticated pattern-based geostatistical simulation methods such as SNESIM with training images are
available to generate several realizations that represent the underlying variability in these systems
[167, 20]. When multiple geologic scenarios are present, for instance, when different geologists
5
provide alternative conceptual models based on subjective interpretations of limited data, an ex-
haustive training dataset that includes model realizations from different scenarios.
A commonly used deep learning architecture for dimensionality reduction is an autoencoder
(AE) [173]. Autoencoders are simple neural network architectures that include an encoder and
a decoder part. The encoder performs data compression by mapping the input data onto low-
dimensional latent variables. The decoder implements inverse mapping (synthesis) by reconstruct-
ing an instance of the input data from its corresponding low-dimensional latent variables. In ad-
dition to dimensionality reduction, autoencoders have also been used for other image processing
applications such as denoising [181, 182]. A popular variant of AE is the variational auto-encoder
(V AE) [85], in which variational inference is used to approximate the posterior distribution of latent
variables given the training data (the encoder). Variational autoencoders utilize autoencoder-like
structures [90] but learn a probabilistic mapping between the latent and the original data domains
by minimizing a measure of data mismatch between the input and reconstructed image. A genera-
tive network (the decoder) is also trained to take samples from the latent distribution and generate
new realizations in the original data space. The main applications of V AEs include dimension-
ality reduction [184], where simple low-dimensional distributions in the latent space are used to
describe the variability in high-dimensional spatial images (training data), and image generation
[142], where new images are sampled from the distribution of the latent space. Variational autoen-
coders are relatively easy to train and can learn the transformation in both directions, data domain
to latent domain and vice-versa. A more recent deep learning architecture that has been used for
dimensionality reduction with generative properties is the Generative Adversarial (GAN) [48, 7].
In the geoscience application, generative models, such as V AE and GAN, were recently intro-
duced for parameterization of spatially distributed flow properties, including non-Gaussian patterns
in fluvial systems [97, 21, 145, 98, 115]. While GAN has been shown to generate more real-looking
images[130, 127], the adversarial training of GANs has been known to suffer from stability and
mode collapse issues[17]. In addition, GANs learn the mapping from the latent domain to the data
domain, without providing the inverse transformation to the latent domain. Unlike GANs, V AEs
6
are easier to train and learn the transformation in both directions, data domain to latent space and
its inverse transform. On the other hand, dimensionality reduction with V AE involves approxima-
tion; therefore, the generated models by V AE may not have the same level of details or sharpness
as in the original images in the training dataset. Based on the known general properties of these
deep learning models, each of them offers different trade-offs that can be (un)attractive depend-
ing on the objectives and priorities of a study. Given the recent introduction of these methods to
the geosciences literature, further research and investigation are needed to reveal the potential and
suitability of each method for application to subsurface flow modeling problems.
The application of V AE to inverse problems has been considered in recent literature. Laloy
et al. (2017) introduce V AE for parameterization of binary geologic media in model calibration
[97]. The authors show how the dimensionality reduction power of V AE can facilitate probabilistic
data conditioning with Markov Chain Monte Carlo method [145], while preserving the complex
spatial patterns in the training data. Canchumuni et al. (2019) apply variational autoencoders to
parameterize fluvial facies for application with the ES-MDA [40]. They report a promising perfor-
mance for V AE in capturing well-defined channelized facies while highlighting the need for more
research before these methods can be deployed for practical application including their application
to three-dimensional models. In another study, [115] use V AE along with ES-MDA for the integra-
tion of time-lapse seismic data into geologic models. Laloy et al. (2019) and Lopez-Alvis (2021)
have explored combining the deep learning generative models with gradient-based inversion but
found out that the inversion is haunted by the nonlinearity introduced by the generative models and
thus extra effort in tuning is necessary [100]. Lopez-Alvis et al. (2021) later suggested that V AE is
more suitable for gradient-based inversion due to its weaker nonlinearity [116]. While these early
studies underscore the promise that deep learning architectures offer to advance inverse modeling
in geosciences and subsurface flow, they also highlight the need for additional research to better
understand the advantages and limitations of these methods as they relate to the subsurface flow
modeling application.
7
An underlying theme in the previous studies is the use of a single geologic continuity model/scenario
(e.g., a training image) with predefined connectivity patterns that are learned during training with
V AE. A powerful property of deep learning architectures, unlike traditional dimensionality reduc-
tion such as PCA, is their capacity to learn a very diverse set of features with minimal loss in
performance, which is supported by their state-of-the-art performance in learning tasks (e.g., Im-
ageNet [33]) and feature visualization researches [131, 126]. This robustness against diversity in
the training data offers a significant advantage that can be leveraged to develop a single parameter-
ization model when the geologic continuity model is highly uncertain. In this study, we investigate
this important property of convolutional deep learning models (using V AE). The robustness of V AE
stems from the embedded CNN architectures where distinct patterns in natural images are simulta-
neously learned by a large number of convolutional filters that are embedded in the network layers.
Conventional linear parameterization methods (such as PCA-based approaches) tend to lose their
dimensionality reduction power with increasing diversity in the prior models. For instance, the
PCA parameterization basis is constructed by aggregating the distinct features in the prior models.
Because it only extracts global features, when there are more distinct global features in the prior
models, for the same level of accuracy, an increased number of leading basis elements (dimensions)
are required. However, having more distinct global patterns doesn’t necessarily mean having more
local patterns. Therefore, as we discuss in the dissertation, V AE-type architectures are nonlinear
and can conveniently accommodate the increase in diversity of the prior models by adapting the
activation level of the large number of the trainable convolutional filters that are embedded in them.
1.2 Forecasting
1.2.1 Geothermal Energy
As a renewable source of energy, geothermal resources are becoming increasingly important in
meeting future energy demands [117]. Geothermal energy development involves the extraction
of hot fluid or steam from hot rocks in subsurface reservoirs. Simulation models for predicting
8
energy production from geothermal reservoirs provide an important tool for the prediction and
optimization of the energy production process. Traditional physics-based simulation models are
well-developed and offer comprehensive predictions about the reservoir states at different times.
However, this approach requires the construction of a reliable reservoir model, which involves ex-
tensive data acquisition [136, 128], integration of multiple sources of data [87, 34, 39, 79, 31, 72],
and simulation of complex multi-physics processes [188]. Other difficulties associated with the
deployment of physics-based models include their computational burden and expertise require-
ment. These limitations complicate the use of physics-based simulation models for managing and
developing geothermal energy.
1.2.2 Geothermal Operation
Geothermal energy development involves the extraction of hot fluid or steam from subsurface
reservoirs to spin a turbine that is connected to a generator to produce electricity. In a geothermal
field, the hot geothermal fluids are brought to the surface through producers. The produced fluid
is then sent to surface facilities (e.g., power plants or direct-use applications) and goes through
heat exchangers. The geothermal wastewater is then reinjected into the reservoir through injectors,
which helps to maintain the reservoir pressure. For the producers that are not self-discharging, they
are equipped with well pumps to drive the production. The operation of the producers is controlled
by adjusting the speed of the electric motor (i.e., the motor speed) of well pumps. The injectors
are controlled by changing the wellhead pressure through well pumps. Additionally, the bubbler
tube is widely installed in pumped wells to monitor the reservoir pressure (water level) and avoid
pump cavitation.
1.2.3 Deep Learning in Geothermal Forecasting
Data-driven models serve as efficient alternatives to physics-based models [125, 5]. Unlike physics-
based models that are computationally demanding to simulate, especially when used in standard
9
workflows such as model calibration [69, 70, 43, 136], data-driven models extract statistical pat-
terns and dependencies from historical data and use them to generate forecasts. In recent years,
deep learning has enjoyed great success in many applications, including subsurface modeling and
geothermal applications [97, 54, 140, 141, 179, 75, 163, 129]. Recurrent neural networks (RNN)
present an effective tool for modeling time-series data, which are commonly encountered in dy-
namical systems. RNN architectures have been widely adopted in many different applications
[19, 25, 185, 110], and have been proven to be effective universal approximators [151]. The re-
peated units in RNN form a temporal sequence that enables it to exhibit dynamic behaviors. An
important advantage of RNN is its internal state (memory), which propagates (in time) important
information about the underlying dynamics. Commonly used RNN models, including the Long
Short-Term Memory (LSTM) [58], and Gated Recurrent Units (GRU) [26], have implemented dif-
ferent gate mechanisms to control the pass of internal states. RNN has also been used in geothermal
applications in recent years. Tian et al. [176] apply the RNN to the analysis of downhole gauge
data. Gudmundsdottir et al. [53] used RNN to model tracer tests in geothermal reservoirs and
compare its accuracy and performance with those of multi-layer perceptron (MLP). Khosravi et
al. applied MLP that was optimized with the imperialist competitive algorithm (ICA) to predict
the power generation and efficiency of a hybrid geothermal system. Shi et al. [162] developed
an LSTM and MLP combined network and demonstrated its precision through comparisons with
the LSTM and MLP under different hyperparameters. Although there have been several research
studies on applying RNN to the prediction of geothermal data, the development and application of
RNN models and the related workflow (e.g., feature selection, training, etc.) that are tailored to
geothermal data remain active fields of research.
Despite the success in many challenging applications, neural networks are known for being
great at interpolating but weak at extrapolating, where interpolating means predicting at a position
in-between data points, and extrapolation is predicting beyond the existing data points [88, 186].
RNN also has the extrapolation issue especially when applied in dynamic systems. Studies have
shown that error of prediction from RNN can increase rapidly [53] or become completely off [95]
10
when using RNN trained purely on historical data to make a long-term prediction. In Figure 1.1,
we divide the applications of RNN into geothermal problems based on how stationary the system is
and how much data/information is available. The first and fourth quadrant represents the problems
where there are sufficient data available for training. It can be problems where either only short-
term prediction is required (e.g., Gudmundsdottir et al. [53]) or long-term prediction is desired but
additional simulated data of alternative production scenarios are available for training (e.g., Shi et
al. [162]). Most of the current studies that have applied RNN in geothermal problems belong to
these two quadrants. The third quadrant represents applications where the system is not stable, and
there is not enough data to guide the RNN. RNN is probably not a good fit for these applications.
Although it has been proven that RNN can make long-term prediction with additional simulation
data, the requirement of simulation data makes RNN less valuable. This is because then the benefit
of replacing the simulation model by an RNN proxy comes from the reduced computation cost
in optimization. Therefore, improving the long-term predicting power of RNN and eliminating
the need for simulation data would make RNN more attractive as a predicting tool in geothermal
applications. However, to the best of our knowledge, there hasn’t been research effort along this
direction.
11
Figure 1.1: An illustration of the problems defined based on the availability of information and
the stationarity of the system. Multiscale RNN targets the problems in Q2 where the predicting
horizon is long and the system is relatively stable.
12
Chapter 2
Inverting Subsurface Flow Data for Geologic Scenarios
Selection with Convolutional Neural Networks
2.1 Feature Extraction Using Inversion with Leading PCA Elements
In the feature extraction step, realizations from each geologic scenario are first used to construct
very few leading elements of the Principal Component Analysis (PCA) that can be used to com-
pactly represent the salient spatial patterns in the corresponding scenario. These principal compo-
nents (PCs) for different scenarios are then put into a combined linear expansion (hybrid basis) to
approximately represent a model from any geologic scenario. By choosing very few leading PCA
elements, it is ensured that the main spatial patterns from a given scenario can only be useful for
approximating the models pertaining to that scenario, without the ability to represent realizations
from other scenarios. Consequently, the constructed parameterization with the combined PCA
elements possesses a strong discriminating property that is exploited in the formulation. The re-
sulting parameterization is used in an inversion formulation to allow for the scenarios to compete
against each other in matching the observed data. The inversion solution is obtained via an effi-
cient gradient-based iterative algorithm. To formulate the inverse problem for feature extraction
13
with flow response data, we define the forward model that relates the data d
m1
to subsurface flow
property distribution u
n1
(e.g., rock permeability at each of the n grid cells in the model), as
d
obs
= g(u) (2.1)
where g is the nonlinear mapping from the parameter to the data domain (for example, governing
partial differential equations (PDEs) for multi-phase flow in porous media). Note that a geologic
scenario refers to a conceptual model that can be used to generate many realizations (samples)
and u represents one such sample. Our objective is to use flow data to identify geologic scenarios
that are pertinent. To this end, a parameterization is needed to compactly represent the main
geological features (i.e., spatial patterns) in each scenario. That is, instead of dealing with many
samples pertaining to a given scenario, a parametrization method is adopted to summarize the main
features in the corresponding realizations. A classical parameterization method to accomplish this
objective is based on the PCA representation. Denoting a set of L permeability model realizations
from the geologic scenario s as U
S
=[u
1
;u
1
;:::;u
L
]
S
, and the corresponding k-dimensional PCA
basis as F
S
nk
(k is the number of leading PCA basis elements), a sample u
S
j
from this geologic
scenario can have the following k-term PCA approximation:
u
S
j
å
i
f
S
i
v
S
j;i
=F
S
v
S
j
(2.2)
where v
S
j
is the corresponding k-dimensional vector of projected PCA coefficients. A similar pa-
rameterization can be obtained for each of the geologic scenarios. Assuming S proposed geologic
scenarios, one can obtain a set of S truncated PCA bases, i.e.,[F
1
;F
2
;:::;F
S
], each providing ap-
proximate representation of a given sample from their respective scenario,[u
1
j
;u
2
j
;:::;u
S
j
], with the
corresponding coefficients [v
1
j
;v
2
j
;:::;v
S
j
]
T
. By stacking the PCA bases from different scenarios,
we obtainF=[F
1
;F
2
;:::;F
s
], which can be used to compactly represent a sample from any of the
14
scenarios, i.e., u
j
=Fv
j
. With this parameterization, the feature extraction inversion solution can
be obtained by minimizing the mismatch between the observed and predicted data:
ˆ v= argmin
v
jjC
1=2
d
(d
obs
g(Fv))jj
2
2
(2.3)
where C
d
is the (diagonal) observation noise covariance [10]. Some important remarks regarding
this formulation are in order.
First, the basis corresponding to each geologic scenario can include a different number of
leading modes (k), depending on the complexity of the features in each scenario. However, if the
basis for each geologic scenario has dimension nk, the combined versionF will have dimension
S n k,
Second, the formulation assumes that the geologic scenarios are reasonably far from each other,
and the number of leading PCs (k) is sufficiently small. With these assumptions, a model u
S
j
(i.e.,
from geologic scenario s) is expected to have a similar representation inF
S
and in the combined
versionF. That is, the hypothesis is that in approximating a sample from a given geologic sce-
nario, large-scale features that are contained in the leading PCs from other scenarios have little
contribution. In general, this behavior diminishes with an increasing number of leading PCs (k),
as discussed in the next section.
Third, by allowing the basis elements from all scenarios to contribute to the reconstruction of
a solution, the geologic scenarios compete to provide a solution that matches the flow response
data. As a result, it is possible for basis elements corresponding to more than one scenario to
contribute to the solution. This is an advantage of the algorithm as it allows multiple scenarios to
be selected (in practice, the correct geologic scenarios are unknown and one single scenario may
not be sufficient for describing the spatial features in the true field).
Fourth, in general, the flow response data tend to show sensitivity to large-scale variability in
the permeability field. Therefore, basis elements that preserve the large-scale features are used
to map the field connectivity. Additionally, using fewer parameters in the inversion leads to a
better-posed problem.
15
Fifth, although F is expected to form a basis with linearly independent columns (given the
distinct nature of the scenarios), the orthogonal property of the individual PCA bases is lost after
combining them. This, however, only affects the inverse mapping (finding v for a given u), which
is not needed in the formulation.
The nonlinear least-squares problem in Equation 2.3 can be solved to find ˆ v, and hence the
corresponding ˆ u=Fˆ v, using a standard gradient-based algorithm. The gradients of the objective
function with respect to the original property distribution u can be computed using efficient adjoint
implementations. Given the linear form of the parameterization, the chain rule of differentiation
can be invoked to obtain the gradients with respect to the inversion parameters v. The next step is
to use the inversion solution ˆ v or ˆ u to identify which geologic scenarios are used in constructing
the solution. Although this may appear to be a simple task, it can prove to be quite challenging,
considering that the resulting inversion solution is expected to have low quality/resolution and can
include complex combined features, as discussed in the next section.
To demonstrate the discriminating power of the leading PCA basis, Figure 2.1a shows an ex-
ample with two simple scenarios of circle and bar objects. The corresponding PCA elements for
each case are shown on the right side of Figure 2.1a. The PCA basis images in each case clearly
show an order from dominant large-scale features corresponding to the shape of the objects in their
respective prior models. It is also obvious that the non-leading basis images (e.g., No. 100) do
not carry any significant information about the underlying objects. Figure 2.1b illustrates the best
approximation (in RMSE sense) of a sample circle image with eight leading basis images from
each prior scenario and a combination of 8 circle and 8 bar basis images. It is evident from these
results that a circle image cannot be approximated by the leading PCA basis of the bar images. It
is also clear that adding the leading bar basis images to those of the circle basis images does not
provide a noticeable improvement in approximating the sample circle image. This implies that if
very few leading basis images for each prior scenario are included in the combined representa-
tion, only the basis images corresponding to the objects present in the target sample will have a
meaningful contribution to its approximation. It is important to emphasize that this discriminating
16
property diminishes as more non-leading elements of the prior scenarios are included. Therefore,
in the developed formulation, only very few leading PCA basis elements from each prior geologic
scenario are included in the combined parameterization to maintain the discriminating power of
the feature extraction approach.
Figure 2.1: (a) Samples from circle and bar images (left) and the corresponding (sorted) PCA basis
images (right). The first few leading elements clearly show the information about the correspond-
ing objects in the training data; (b) a sample circle image (Column 1) with its approximations using
the first 8 leading PCA elements of circle objects (Column 2) and bar objects (Columns 3), and by
combining the first 8 leading PCA of the two objects (Column 4). Results show that the leading
PCA elements of the bar objects contribute very little to the reconstruction of circle objects.
2.2 Feature Recognition with CNN
With the salient features extracted, the next question is how to identify which scenarios the ex-
tracted features belong to. In general, the extracted features from the inversion problem are ex-
pected to contain inversion errors (artifacts) due to the limited data used. Moreover, the solution
17
may involve spatial patterns from more than one geologic scenario, which can complicate the fea-
ture recognition (geologic scenario selection) problem. Many of the existing quantitative measures
that are used to evaluate the similarity between images are less sensitive to spatial patterns and
may not provide a robust metric for comparing the spatial features in images. A seemingly natural
choice for deciding about the relevant geologic scenarios after the inversion process can be based
on the norm of the PCA coefficients (loadings) of each scenario in the parameterized solution (i.e.,
coefficients in ˆ v). The expectation is that if a given scenario has a significant contribution, the cor-
responding weights must be large. However, we found that this is not necessarily the case because
two or more scenarios can have large coefficients while their net contribution is minimal (that is,
they can cancel each other’s effects on the spatial representation). This issue can be avoided by
applying a regularization penalty function such as group-sparsity [46]. An alternative approach is
to use the spatial solution ˆ u for feature recognition. One way to accomplish this is by projecting
the inversion solution ˆ u (gotten using the basis for all possible scenarios) onto the individual PCA
basis of each geologic scenario and quantifying the projection (approximation) error (e.g., using
`
2
-norm of the projection residuals). In this case, the basis of geologic scenarios that have relevant
features are expected to provide smaller projecting error. While this method solves the issue en-
countered in the first approach, we observed that the projection errors of different scenarios can be
quite similar when quantified through the second norm (`
2
) of the residuals. This is attributed to
the fact that the`
2
-norm over pixel-based differences between two images is not sensitive to struc-
tural similarity (spatial patterns) in them. A pattern-based method is, therefore, more appropriate
for distinguishing images with different structural (connectivity) patterns. In this dissertation, we
use convolutional neural network (CNN) as a powerful image classification method for this task.
A neural network (NN) consists of artificial neurons that use the input to change their inter-
nal state (activation) and generate a corresponding output, that is effective in recognizing patterns.
Neural Networks are formed by connecting the output of certain neurons to the input of other
neurons, resulting in a directed weighted graph. Figure 2.2 (left) shows a simple NN consisting
18
of an input layer, two hidden layers, and one output layer. The mathematical operations asso-
ciated with NNs are simple and typically include weighted combinations followed by applying
nonlinear activation functions as shown in Figure 2.2 (right). The use of NN involves two parts,
training and predicting. In the training mode, the training data (a set of inputs and the correspond-
ing label outputs) is fed into the network, the loss is calculated based on the difference between
network predicted and true outputs. Then the weights in the network are adjusted to minimize the
network prediction errors. The minimization is implemented via gradient-based backpropagation
algorithms [147]. For classification problems, the weight assigned to each category is typically
calculated by a softmax (or normalized exponential function) of the form
s
i
(z)=
e
z
i
å
k
e
z
k
(2.4)
The softmax function converts an arbitrary K-dimensional vector z to a new K-dimensional
vectors(z) with entries in the range(0;1[ that sum to 1. This conversion assigns probability values
to each output class [38]. With the output of the softmax function representing the probability
assigned to each class, a compatible loss function for training a classification model is the cross-
entropy function [38], defined as
H(p;q)=
å
x
p(x)log(q(x)) (2.5)
where p and q are the true and predicted labels, respectively.
The power of NN comes from stacking several layers and training the resulting network with
abundant relevant data (to learn the weights). In fact, NN can be viewed as a nonlinear regression
(or parameter estimation) problem where the observed (training) data are used to estimate the
parameters (weights) of the model (network). While stacking layers increases the complexity and
power of NN, it also results in a substantial increase in the number of parameters (weights) and
nonlinearity, especially in fully connected networks, which can lead to diminishing gradient, and
overfitting issues.
19
Figure 2.2: A Neural network (NN) sketch with fully connected structure (left) and the basic
operations applied to the input connected neurons (right).
A convolutional neural network contains convolutional layers in which several filters or ker-
nels (with learnable weights) scan the input image and perform local convolution operations [102,
103, 101, 104], as shown in Figure 2.3 (Top). Several filters are used to capture different spatial at-
tributes of the input image. Each convolutional layer is typically followed by an activation function
that operates on the convolution results and provides an output that indicates whether the node is
active for the given input. In this dissertation, we use Rectified Linear Unit (ReLU) as the activation
function. The activation layer is followed by a pooling or down-sampling layer, which is designed
to reduce the size of its input. Typical pooling operations are average pooling (replacing a subset
of inputs with their average value) and max-pooling (replacing a subset of inputs with their max-
imum value), of which we use the latter in this dissertation. Figure 2.3 (Bottom) illustrates these
operations for a combined layer of CNN that includes the sequence of Convolution-ReLU-Pooling
operations, with a simple example showing the typical input-output of each operation (right). To
generate the output from CNN as a classification model, Figure 2.3 (Top), the output of the final
pooling step is flattened and the resulting outputs are connected to each of the S classes, following
a softmax transformation.
As a classification model with parameters q, CNN takes as input a spatial image x (extracted
geologic features via inversion) and calculates the probability assigned to the corresponding plau-
sible scenarios, that is y
q
(x). If the number of geologic scenarios is S, the output is represented as a
20
Figure 2.3: Top: Overall architecture of convolutional neural network (CNN); (Bottom): Illustra-
tion of a CNN layer containing Convolution-ReLU-Pooling (left), with a simple example depicting
the operation involved (right).
one-hot vector encoding y
q
(x)2R
S
, where each column represents the likelihood of each sample
to belong to a scenario. During training, the following cross-entropy loss function is minimized by
adjusting the :
L(z;y
q
)=å
S
i=0
y
i
logs(z
i
)+(1 y
i
) log(1s(z
i
))
(6)
wheres is the Sigmoid function and z is the data label. The cross-entropy loss corresponds to the
Kullback-Leibler divergence between the empirical distribution of the input data and the predicted
distribution, which is minimized during training.
Compared to a fully connected layer, a convolutional layer performs local operations with
small filter sizes, which leads to the shift-invariance property of CNN, as well as a significant re-
duction in the number of parameters (weights). Unlike classical image processing methods where
convolution filters are pre-computed, weights in the convolutional layers are learned from train-
ing data. The learned filters are then applied to an input image in forward (prediction) mode for
image analysis (for example, classification). It is important to note that CNN usually contains
21
both convolutional layers and fully connected layers. In general, the convolutional layers extract
spatial features, while the fully connected layers, usually the last few layers, serve as classifica-
tion layers. The convolutional layer is often followed by a pooling or down-sampling layer to
compactly represent the salient information in each kernel by replacing the convolution outputs
with either its maximum value (max pooling) or mean value (average pooling). CNN architectures
have achieved excellent performance in pattern recognition [92, 159, 164]. The success of CNN
can be partly attributed to the fact that the local connectivity in its architecture is well suited for
capturing important information in natural images, which tends to be local. The drawbacks of
CNN include the ad-hoc nature of their design, the number of tuning parameters involved, and its
black-box nature that makes it difficult to explain its behavior and learning mechanism. Despite
these limitations, CNN has enjoyed great success in many applications and remains a growing
field of research. Given the complexities associated with assigning geological scenarios based on
extracted features in our workflow, and the success of CNN for feature detection, we adopt CNN
as an effective pattern-based tool for feature recognition to predict the relevant geologic scenarios
for a given input image.
Construction of a CNN architecture requires tuning through a trial-and-error process, where
different design parameters are adjusted until acceptable performance in training, validation, and
test data is achieved. An important consideration in the design and training neural networks, and
CNN, is overfitting. Without regularization, redundant weights or limited training data can result
in the networks showing great performance during the training stage, but very poor performance
on test cases (prediction). Some common regularization techniques that are used to prevent over-
fitting includes weight decay [93], dropout [166], and batch normalization [61]. Weight decay is
equivalent to having a quadratic penalty term on the weights. Dropout improves overfitting by
randomly dropping out neurons and fitting the training data with a subset of weights. Hence, it
imparts robustness on the trained network and can be viewed as an approximate ensemble learning
framework. Although it was initially applied to fully connected layers, more recently, its applica-
tion with CNN has been shown to be effective [134]. Batch normalization re-centers and re-scales
22
data between layers. Although batch normalization and dropout are two of the most powerful
regularization techniques, recent results in the literature suggest that combining them can lead to
performance degradation [111]. In this study, we use the combination of dropout and weight decay
as the regularization approach for training the CNN.
Although CNN has shown great performance for pattern learning in spatial images, the inter-
pretability of CNN is important for real-world applications and design purposes. The main aspects
of the interpretability are the features learned in different parts of CNN and the features used for
prediction (i.e., attribution) [131]. Visualization tools such as Grad-CAM can be used to under-
stand the importance of different regions of the input image in predicting the output [156]. Another
feature visualization tool is the activation maximization algorithm, in which the objective is to find
an input image that maximizes the activation of a specific neuron or filter in the network. These vi-
sualization tools offer useful interpretations of the learned representations in CNN. We used these
methods in our work to gain insight into the design and tuning of the CNN architecture.
2.3 Training Data
In this work, the input image for the feature recognition (classification) task is the solution of the
corresponding feature extraction inverse problem, which is estimated from incomplete flow data
and involves inversion errors. This implies that the input data for training CNN should also rep-
resent solutions of similar inverse problems, while the output remains the correct label associated
with each input. However, generating input data that represent the solution of an inverse problem
is not practical as it requires solving an inverse problem per each training data. As an approxi-
mation, one may simply use, as training data, the exact representation of prior models and their
corresponding scenario labels. A better alternative is to use PCA approximation of the original
images. While the latter approach is consistent with the PCA parameterization that is used in the
feature extraction step, it does not account for the errors and artifacts that are introduced during
inversion. To account for the inversion errors in the training data, we added spatially correlated
23
errors to the PCA approximations of the original models. For each scenario, this was done by in-
troducing leading PCs from other scenarios with random coefficients. This approach is consistent
with the intended application of CNN as the expected inversion solution artifacts are spanned by
the hybrid PCA parameterization that is used in feature extraction. Tensorflow [2] was used to
build and train the CNN in our study. For the examples presented in the next section, the specific
parameters of the CNN model are summarized in the corresponding tables.
2.4 Iterative Workflow
In the proposed workflow, we assume that multiple realizations from each geologic scenario are
available. A CNN model is first designed and trained with these realizations. The scenarios are then
parameterized into PCA components using their corresponding realizations. Initially, the number
of PCA elements from each scenario in the hybrid basis, k
0
=[k
1
0
;k
2
0
;k
3
0
;:::;k
S
0
], is selected to retain
the same amount of cumulative energy (variance) for each scenario. This approach accounts for the
variability in the level of complexity across the scenarios (e.g., straight and meandering channels)
to ensure that initially each scenario is equally represented. We note that the retained level of
energy is one of the hyper-parameters that depends on the complexity of the scenarios, similarity
between scenarios, and the quality and quantity of the flow response data. When the scenarios are
distinct, only a few leading PCs can provide significant distinguishing power. The total number of
PCs in the hybrid basis å k is then fixed throughout the workflow. With the initial hybrid basis,
the feature extraction step is performed and the solution is passed to CNN for geologic scenario
selection. The presence of multiple (potentially irrelevant) scenarios in the extracted features,
however, may have an impact of the quality of the solution. Therefore, in order to improve the
accuracy of the algorithm, we implement an adaptive algorithm by iteratively performing the two
steps.
24
In the iterative formulation, the number of PCs that are included from each scenario is scaled
according to the weights assigned to that scenario during feature recognition with CNN. At iter-
ation i, the spatial solution from the feature extraction step is based on a hybrid basis that has a
composition k
i
. With the pattern-based probabilities P
i
=[P
1
i
;P
2
i
;P
3
i
;:::;P
S
i
] assigned by CNN,
the new composition of the hybrid basis k
i+1
is updated by linearly adjusting the number of cor-
responding PCs for each scenario. We performed an extensive set of numerical experiments to
evaluate the performance of this iterative approach. In general, two situations are commonly en-
countered: (a) the flow response is sensitive to the significant features of each scenario and the
hybrid basis is dominated by the correct scenario within one or two iterations; and (b) data is in-
sufficient to discriminate against the scenarios or a mixture of features from different scenarios
remain in the solution after several iterations. To deal with the latter situation, a maximum number
of iteration is set to terminate the algorithm. Overall, The formula of updating k
i
is:
k
i+1
= k
i
+a
i
(
å
k
0
P
i
k
0
å P
i
k
0
k
i
) (2.6)
where k
0
accounts for the complexity difference between scenarios, anda
i
is the adaptive step
size that controls the updating.
2.5 Numerical Experiments
To test the performance of the developed workflow with nonlinear response data, we apply it sev-
eral pumping test experiments involving single-phase flow in two-dimensional and three-dimensional
aquifers [122, 60]. We evaluate the performance of the method by computing a dominating rate as
well as a surviving rate of the correct scenario. The dominating rate refers to the fraction of solu-
tions in which the correct scenario is considered as the most likely scenario by the workflow. The
surviving rate refers to the fraction of solutions in which the correct scenario is promoted by the
workflow (having a likelihood that is larger than
1
N
scenario
, where the N
scenario
is the total number of
25
scenarios). The maximum number of iterations is set to 3 and the step size is defined asa
i+1
=
1
2
a
i
accordingly.
2.5.1 Pumping Test with Non-Gaussian 2D Model
For the pumping test, we consider non-Gaussian models with distinct alluvial channel patterns
(Figure 2.4). The models have a dimension of 100 100. The geologic scenarios are defined
based on the orientation and connectivity of the alluvial channel features. These scenarios are
derived from three training images with meandering, intersecting, and straight channels, by chang-
ing the channel orientations. Two diagonal directions are used for the intersecting and straight
channels. Two-facies realizations are generated using the SNESIM algorithm [20] implemented in
S-GeMS [144]. Within-faces heterogeneity is modeled using multi-Gaussian random fields. The
logarithm of hydraulic conductivity inside and outside the channels has a mean value of 0.6 and -
2.4, respectively. For each scenario, a total of 500 realizations are generated, of which 400 are used
to construct the PCA bases and to train the CNN. The remaining 100 models from each scenario
are divided into two sets of 50, one used for validation to tune the hyper-parameters and the other
for testing the performance of the algorithm. General information about the architecture of the
CNN model is listed in Table 2.1. Rectified Linear Unit (ReLU) is used as the activation layer for
the convolutional and fully connected layers. Dropout is added before ReLU in the convolutional
layers. In training, the spatial models are first approximated with a PCA basis that contains 80%
energy, then added with both correlated and uncorrelated noise, as shown in Figure 2.5.
We set up an aquifer pumping test where a pumping well is extracting water at a constant rate
and the distribution of the logarithm of hydraulic conductivity is unknown. The unit of hydraulic
conductivity is cm=s while the non-Gaussian models are used as the distribution of log hydraulic
conductivity. The configuration of the pumping test is shown in Figure 2.6(a). The aquifer has a
dimension of 1000m 1000m 10m. The top and the bottom of the domain are no-flow bound-
aries. The pressures at the left and right boundary are 20 m and 10 m, respectively. The porosity in
the domain has a constant value of 0.36. The background water flow due to the pressure difference
26
Figure 2.4: Training images of meandering, intersecting and straight fluvial channels with corre-
sponding realizations that have different channel orientations, which leads to 5 different geologic
scenarios.
in the field is from left to right. The pumping rate is 3000 m
3
=day. A total of 36 monitoring wells
are uniformly placed in the aquifer, where the log hydraulic conductivity and pressure head are
observed. Neglecting diffusion and dispersion [24], the governing equations, mass conservation
and Darcy’s law, are expressed as:
¶(fr)
¶t
=Ñ(rv)+ q (7)
v=
1
m
K(ÑprgÑz) (8)
27
Figure 2.5: Generation of training data for CNN based on PCA approximation of the original
models with correlated noise added to represent expected inversion errors and artifacts.
28
Table 2.1: Details of each layer of the CNN model used in the 2D pumping test where the filter
size is height width and the output dimension is height width channel.
.
Layer Number Layer Type Filter Size Output Dimension
1 Input 100 100 1
2 Conv2D 5 5 100 100 10
3 AvgPool2D 2 2 50 50 10
4 Conv2D 5 5 25 25 20
5 AvgPool2D 2 2 25 25 20
6 Conv2D 3 3 25 25 40
7 AvgPool2D 2 2 12 12 40
8 Conv2D 3 3 12 12 80
9 AvgPool2D 2 2 6 6 80
10 Flatten 2880
11 FC 5
where f, r, t, v, q, m, K, and p are porosity, density,time, velocity, source term, viscosity, and
pressure, respectively. The feature extraction is implemented using gradient-based inversion.
One of the test cases from Scenario 1 is shown in Figure 2.6(b). Since Scenario 1 (meandering
channel) has significantly more variability than the other scenarios, the initial hybrid basis has more
elements from this scenario (initially, the basis elements represent an equal amount of variance or
energy from each scenario). Figure 2.6(d) shows the details of the solution iterations. In the first
iteration, the recovered feature shows disconnected channel features because PCs from several
geologic scenarios are included. However, CNN is able to recognize the meandering feature and
increases the weight assigned to the correct scenario. The next iterations only confirm the relevance
of the scenario with meandering channels and identify it as the dominant scenario. It is important to
note that the main objective of the PCA-based parameterization is to screen the geologic scenarios
using extremely low-rank (approximate) representations. Once the geologic scenario is identified,
existing model calibration methods that can be applied to non-Gaussian problems can be adopted
to obtain more detailed calibrated models [71].
In Figure 2.7, a failed test case from Scenario 1 is illustrated. In this case, the reference model
(Figure 2.7(b)) has both meandering and intersecting channel features. The center plots in Figure
2.7(d) show that the recovered feature in the solution also contains features that can be identified as
29
Figure 2.6: A successful scenario identification example for the 2D fluvial system: (a) The config-
uration of the pumping test with the 2D fluvial systems, (b) reference model from Scenario 1, (c)
final likelihoods assigned to each scenario, (d) solution iterations.
30
intersecting channels, with some meandering patterns. Therefore, the intersecting scenario (Sce-
nario 3) is assigned the highest probability and the correct scenario (Scenario 1) is identified with
a lower likelihood. Given the South-East to North-West orientation of the identified features, CNN
also assigns a small probability to Scenario 5. The results from several experiments with different
reference scenarios (discussed below) indicate that CNN is successful in identifying the scenarios
that are likely to contribute to the identified patterns in the feature extraction problem.
The above pumping experiment was repeated for 250 test cases (50 per scenario), and the
results are summarized in Figure 2.8. On average, in approximately 72% of the test cases the
highest probability is assigned to the correct scenario (dominating rate) while in 92.8% of the cases
the correct scenarios was included in the solution (surviving rate). The dominating and surviving
rates in the first iteration are 64% and 90%, showing the additional improvement achieved through
the iterative scheme. However, for scenarios with overlapping features (straight and intersecting
scenarios that have the same directionality), these probabilities are lower, indicating that they are
harder to distinguish based on the extracted features in the flow data inversion problem. In those
cases, the algorithm tends to promote and maintain the scenarios with overlapping features if they
are supported by the data. As a result, the surviving rate is a more reliable measure of performance
when scenarios have overlap.
2.5.2 Pumping Test with 3D Model
To further evaluate the performance of the developed approach, we consider 3D models that are
adapted from the SAIGUP geological formation [60]. The configuration of the pumping test is
shown in Figure 2.9(a). The area under investigation has a dimension 750m2250m20m which
is divided into a 3D mesh with 40 120 40 cells. A total of 12 evenly spaced observation wells
are distributed throughout the aquifer. The logarithm of hydraulic conductivity and pressure in the
observation wells are measured (with 10% measurement noise) and used for screening the geologic
scenarios. The pumping rate is fixed at 9000 m
3
=day. An example of the facies distribution, with
four distinct facies types, is shown in Figure 2.9(b). To account for the uncertainty in the conceptual
31
Figure 2.7: A failed test cases from Scenario 1 in the pumping test with the 2D fluvial systems:
(a) pumping test domain and well configurations, (b) reference log-hydraulic conductivity model
from Scenario 1, (c) final likelihoods assigned to different scenarios, (d) solution iterations.
32
Figure 2.8: The confusion matrix based on 250 test cases showing the probabilities of dominating
(left) and survival (right) for the correct scenario in the pumping test with fluvial realizations.
Table 2.2: Details of each layer of the CNN model used in the 3D pumping test where the filter
size is depth height width and the output dimensions are depth height width channel.
Layer Number Layer Type Filter Size Output Dimension
1 Input 40 120 40 1
2 Conv3D 3 3 3 40 120 40 2
3 AvgPool3D 2 2 2 20 60 20 2
4 Conv3D 3 3 3 20 60 20 4
5 AvgPool3D 2 2 2 10 30 10 4
6 Conv3D 2 2 2 10 30 10 8
7 AvgPool3D 2 2 2 5 15 5 8
8 Conv3D 2 2 2 5 15 5 16
9 AvgPool3D 2 2 2 2 7 2 16
10 Flatten 448
11 FC 8
geologic model, eight scenarios with different aggradation angle and azimuth are considered, as
shown in Figure 2.10. Scenarios 1, 2, 3, and 4 have a low aggradation angle while the aggradation
angles of Scenarios 5, 6, 7, and 8 are high. The azimuth of different scenarios are as follows:
Scenarios 1 and 5 = 0
, Scenarios 2 and 6 = 45
, Scenario 3 and 7 = 90
, and Scenarios 4 and 8 =
135
. The parasequence is 2 for all the scenarios. A total of 500 realizations from each scenario
were generated, with 400 realizations used for training, 50 for validation, and 50 as test cases. A
CNN model with 3D convolutional and pooling layers are built for feature recognition. The details
of the CNN model are summarized in Table 2.2.
33
Figure 2.9: The pumping test field configuration (a) and an example of a multi-facies model for
the 3D experiments (b).
34
Figure 2.10: The eight geologic scenarios for the multi-facies examples with different combina-
tions of aggradation angle and paleo-azimuth.
35
The geological features are recovered through gradient-based inversion as discussed in the
Methodology section. Since the scenarios share a similar level of variability, the first 8 leading
PCs from each scenario are included initially. For the test case from Scenario 8 (shown in Figure
2.9(b)), Figure 2.11 shows that the reconstruction with the initial basis (Iteration 1) is far from
the reference case. However, after one iteration, CNN is able to identify the scenarios that have
the correct azimuth (Scenarios 4 and 8 with azimuth 135
). Once the scenarios with incorrect
azimuth are eliminated the existing spatial patterns are recovered more accurately, enabling CNN
to identify the correct scenario (Scenario 8) in later iterations. The results from 400 test cases
(50 test cases per scenario) are shown in Figure 2.12. The overall dominating and surviving rates
across all scenarios are 82.5% and 86.5%, showing the effectiveness of the workflow in recovering
the geologic scenario from flow data in this case study.
To test the method, we also considered a reference case that does not belong to any of the
scenarios. In this case, the reference model has high aggregation angle and 0
azimuth (same as
Scenario 5) but a parasequence of 6 (Figure 2.13(a)). Figure 2.13(b) shows that the parasequence
isn’t correctly extracted (yz surface is very noisy) because this feature isn’t included in the training
data. But the azimuth and aggregation angle, features that are included in the training dataset, are
extracted. Hence, the two relevant scenarios, Scenario 5 and 1, are promoted, with the closest
scenario to the true scenario (Scenario 5) being assigned the highest likelihood.
2.6 Discussion
We present a workflow for screening geologic scenarios based on flow response data to reduce
geologic uncertainty. The proposed method is computationally efficient since only a small number
of forward simulations are involved. The presented workflow consists of two main components:
feature extraction and feature recognition. The feature extraction step takes advantage of the dis-
criminating power of leading PCs to extract the main features that are supported by data effectively.
The feature recognition step leverages the pattern-based classification power of CNN to provide
36
Figure 2.11: (a) The results from a successful case in the 3D pumping test, with the PCA com-
position before each iteration (left), the extracted spatial features from Step 1 (middle), and the
inferred geologic scenarios from CNN (right).
Figure 2.12: The confusion matrix based on 400 test cases showing the probabilities of dominating
(a) and surviving (b) for the correct scenario in the 3D pumping test.
37
Figure 2.13: A test case in which the reference model does not belong to any of the proposed
scenarios: (a) reference log-hydraulic conductivity model, (b) solution iterations with the PCA
composition before each iteration (left), the extracted spatial features from Step 1 (middle), and
the inferred geologic scenarios from CNN (right).
38
feature-based distances that can be used to distinguish between different geologic scenarios. An
iterative scheme is proposed to improve the accuracy of the method by gradually adapting the so-
lution to the identified scenarios. The proposed method is evaluated using aquifer pumping tests
in non-Gaussian two-dimensional models and a three-dimensional case study. The results over
several hundred test cases, involving two-dimensional and three-dimensional aquifers, are used to
show the effectiveness of inversion-based feature extraction and CNN-based feature recognition
for screening geologic scenarios based on flow response data. In most cases, our workflow is able
to extract the correct salient features from flow data and yields reliable predictions of relevant ge-
ologic scenarios. In particular, the robustness of CNN was evident in consistently identifying the
geologic scenarios that could potentially contribute to the feature extraction solution, even when
such solutions did not uniquely capture the geologic features.
In conclusion, the presented geologic screening method offers an efficient and flexible approach
for reducing the uncertainty in geologic scenarios prior to performing model calibration. The
method takes advantage of the discriminating power of the leading PCA elements from different
geologic scenarios and uses the flow data in a gradient-based workflow to construct a spatial map
of aquifer hydraulic conductivity. The constructed map does not include small scale details and
is intended for scenario identification. The extracted features indicate which spatial connectivity
pattern can be used to reproduce the flow data. The resulting map is then used in a trained CNN to
identify the geologic scenarios that are consistent with the extracted spatial patterns. It is important
to note that the workflow presented in this dissertation serves as a tool for geologists to include
the information in pressure and flow data to reduce their uncertainty about the plausible geologic
scenarios. In some cases, the results from our method may be used to revise the proposed geologic
scenarios or generate new geologic scenarios that were not included in the originally proposed
scenarios. Finally, the presented algorithm and the assumptions used in this dissertation are not
intended for model calibration. More elaborate probabilistic and deterministic model calibration
methods exist in the literature that can use the selected geologic scenarios from our method to
perform detailed model calibration.
39
Chapter 3
Deep Convolutional Autoencoders for Robust Flow Model
Calibration under Uncertainty in Geologic Continuity
3.1 Subsurface Flow Model Calibration
We consider inverse problems in single-phase and two-phase flow through porous media. The
governing equations for the single-phase flow consist of mass conservation and Darcy’s law and
can be expressed as
¶(fr)
¶t
=Ñ(
r
m
k(ÑprgÑz))+ q (8)
where f, r, t, v, q, m, k, and p are porosity, density,time, velocity, sink/source term, viscosity,
permeability, and pressure, respectively [24]. With proper initial and boundary conditions, the
above equation can be discretized and solved numerically, to obtain the pressure distribution in an
aquifer for different source and sink disturbances. A typical groundwater application is a pumping
test, where water is extracted from a well and results in pressure drawdown in the aquifer. The
pressure drop at different locations in the aquifer is typically measured using observation wells.
The resulting flow-related measurements are used to infer the spatial distribution of permeability
or hydraulic conductivity (K= k
rg
m
).
40
The governing equations of a two-phase flow system, under incompressible and immiscible
fluids assumption, can be written as
Ñ(
l
a
B
a
k(Ñp
a
g
a
Ñz))=
¶
¶t
(f
S
a
B
a
)+ q
a
; a = w;n (8)
where l, B, k, p, g, z, S, and q are the mobility, formation volume factor, intrinsic permeability,
pressure, phase density, gravity potential, phase saturation, and flux, respectively [24]. The sub-
script a represent the wetting (w) and non-wetting n phases. While in two-phase flow problems,
additional parameters are included to describe the rock-fluid and fluid-fluid interactions, one of
the key parameters of interest that control the fluid flow remains to be the permeability distribu-
tion. Therefore, in what follows, we will focus on model calibration problems that are used to re-
construct the grid-based description of heterogeneous permeability distributions using monitoring
pressure head and flowrate measurements. The unknown parameters of interest are the vectorized
map of permeability field, denoted as x while the measurements used to infer the unknowns are
denoted as (d). In real applications where the permeability is heterogeneous, the dimension of (x)
is significantly higher than that of the measurement vector (d), which leads to many non-unique
solutions. Therefore, parameterization and/or regularization techniques are adopted to constrain
the solution search space and alleviate the nonuniqueness issue.
In formulating the inverse problem, the forward model that relates the unknown permeability
map as parameters x) to the observable pressure head measurements as the response of the system
(d) can be expressed as
d= g(x) (8)
where g is the nonlinear mapping that is implicitly defined by the underlying governing flow equa-
tions (e.g., Equations 8 or 8). Using the observed measurements from the field d
obs
, the following
data mismatch function can be minimized to estimate the unknown parameters:
J(x)=jjC
1=2
d
(d
obs
g(x))jj
2
2
(8)
41
The notation C
d
refers to the covariance matrix of the measurement noise [9]. In general, geologic
formations exhibit heterogeneous rock property distributions whose detailed descriptions lead to
high-dimensional parameters. For instance, for a grid size of n, a grid-based description of the
permeability distribution leads to n parameters. On the other hand, very few m(<< n) observa-
tions (pressure head) are typically available through drilling observation wells at very scattered
locations. As discussed in the Introduction, a common approach to address the discrepancy in the
number of available measurements and unknown model parameters is through low-dimensional
representation of unknown parameters (a.k.a, parameterization). Assuming the mapping
x
n1
P(z
r1
) (8)
that transforms the unknown n-dimensional parameter vector x to a corresponding r(<< n)-dimensional
parameter vector z, where the approximation quality is acceptable, the inverse problem for infer-
ence of x can be solved in terms of z, with far fewer parameters. In this case, the spatial distribution
of the permeability map is controlled by a small number of new parameters that will be used in
solving the inverse problem. After substituting this parameterization form into Equation 8, the
resulting minimization objective function can be expressed as:
J(z)=jjC
1=2
d
(d
obs
g(P(z)))jj
2
2
(8)
In the case of PCA, the mapping functionP is linear, that is
x
n1
P
nr
z
r1
(8)
42
in which mathb f P is defined by the first r leading eigenvectors of the covariance of x. Since
the eigenvectors are orthogonal, the inverse transform is conveniently calculated as z= P
T
x. The
gradient of the data mismatch objective function in this case is
G=
¶J(z)
¶z
=
¶J(x)
¶x
P (8)
When V AE is used for parameterization, the mapping function is not linear and is represented
by the decoder part of the trained encoder/decoder pair, that is
P(z)= Dec
y
(z) ; z= Enc
q
(x) (8)
the details of which are discussed in the next section. As discussed in the next section, since z
assumes a normal distribution with zero mean and unit variance, this prior information can be
used to regularize the inverse problem in Equation 8. Specifically, the resulting inverse modeling
objective function can be expressed as
J
reg
(z)=jjC
1=2
d
(d
obs
g(P(z)))jj
2
2
+gjjzjj
2
2
(8)
where the second term is a regularization term that is used to promote solutions z that follow a
standard normal distribution. In this dissertation, we apply a gradient-based approach to minimize
the above objective function. The gradient of the objective function with respect to the latent
variables is computed by invoking the chain rule of differentiation as follows
G=
¶J
reg
(z)
¶z
=
¶J(z)
¶P(z)
¶P(z)
¶z
+gz (8)
where the two main components are (i) the gradient of the data mismatch objective function with
respect to the spatial distribution of permeability,
¶J(z)
¶P(z)
, which can be computed using either nu-
merical approximation or an efficient adjoint model, and (ii) the gradient of the permeability distri-
bution with respect to the latent variables,
¶P(z)
¶z
, which is derived from the trained neural network.
43
Adjoint models are used to analytically derive the sensitivities of model predictions to input param-
eters. In our application, adjoint models are derived by invoking the Lagrangian approach to solv-
ing constraint minimization problems, where the constraints are the governing flow equations and
the minimization cost function is the model calibration objective function (data mismatch). The
solution of the resulting constraint minimization leads to a recursive expression (adjoint model)
that is solved backward in time yield the required gradients [132]. Alternatively, given the low
dimension of the latent variables, finite-difference approximation may be used to computed the
gradients of the flow response to the parameters of the inversion (latent variables).
An important advantage of low-dimensional parameterization is the efficiency in numerically
approximating the gradient of the objective function directly with respect to the latent variables,
that is
¶J
reg
(z)
¶z
, using perturbation schemes. As is the case in regularized nonlinear inverse prob-
lems, the optimal value for the regularization parameter g is not trivial to obtain, and sensitivity
analysis on the validation set is used to tune this parameter. Next, we discuss the V AE as a non-
linear operatorP that can be used to parameterize a high-dimensional grid-based permeability
distribution x and yield a low-dimensional latent variables z, that is xP(z). This parameteriza-
tion form will be used in Equations 8 and 8. to solve the formulated inverse problem.
3.2 Parameterization with Variational Autoencoders
Autoencoders: Autoencoders are a special class of feed-forward neural networks that compress
the input data into a lower-dimensional latent-space representation and reconstruct the output from
the resulting latent variables [90, 85]. For parameterization of subsurface flow model calibration,
the input is a grid-based rock property distribution (permeability), and the low-dimensional latent
space variables are the parameters that are solved. An autoencoder consists of an encoder network
that maps the original data x to a low-dimensional latent spaceZ , i.e., (q :X !Z ), and a
decoder function network that maps the corresponding latent variables z back to the original data-
spaceX , i.e., (y :Z!X ). For image data, multiple layers in the encoder part of the network
44
are used to condense the input data (images) into a low-dimensional latent space representation.
The operations within each layer typically consist of convolutions with multiple (usually small
scale) filters, activation, and pooling (lower panel in Figure 3.1).
The use of multiple layers and multiple filters enable the encoding of several features, which
is important for detecting distinct patterns belonging to different geologic scenarios. Following
the activation function, pooling or down-sampling is applied to reduce the size of the output. The
pooling process enables the network to detect and encode features at different scales, from com-
mon high-resolution features (e.g., channel boundaries) at the initial layers to more large-scale
patterns (e.g., geologic objects) in final later layers. The decoder, maps an instance of the latent
variables back onto the original image through transposed convolution and up-sampling, as shown
in Figure 3.2. Once the network is trained, the decoder provides the mapping needed to update the
spatial map of permeability field after each update to the low-dimensional latent variables. The
convolution and activation function can be expressed as
z=s(w
T
x+ b) & x
0
=s
0
(w
0T
z+ b
0
) (8)
where w and b denote the convolution kernel (weights) and bias terms, respectively, and the same
notations with prime represent similar parameters for the decoder network, and s represents the
activation function. A commonly used activation function that is also adopted in this dissertation
is the ReLU function, which is an identity function for positive inputs and zero for negative inputs
and can be expressed as s(x)= maxf0;xg, where x is an arbitrary input. Note that as in other
low-dimensional parameterization techniques, the reconstruction is lossy and x
0
provides an ap-
proximation of the input x. For compactness, we collect the filter and bias weights in all the layers
of the encoder and decoder networks into parameter vectorsq andy, respectively, and denote the
corresponding mapping functions as Enc
q
() and Dec
y
(), in the same order. Given a collection of
45
Figure 3.1: A typical symmetric V AE architecture (top), and illustration of the convolution+ReLU
and pooling operations in one layer (bottom).
46
Figure 3.2: Illustration of the convolution (a) and transposed convolution (b) operations that are
used in the encoder and decoder networks, respectively. The convolution operation scans the in-
put images with the trainable filters of a chosen size (3-by-3 here) to compute the inner product
between the filter weights and the corresponding image segments (as shown in the colored illus-
trations), and store the results in the respective pixel of the output. The transposed convolution
upsamples an input feature map to a desired output feature map (as shown in the colored illustra-
tion) using learnable weights.
N input (training) data X=fx
1
;x
2
;x
N
g, the encoder and decoder networks are trained together
to learn the weights(q;y) by minimizing the following loss function:
L
AE
(q;y)=
X Dec
y
(Enc
q
(X))
2
;q :X!Z;y :Z!X (8)
Once (q;y) are learned, the latent variables are obtained as z= Enc
q
(x) and the corresponding
reconstruction is given by x= Dec
y
(z).
Autoencoders were originally developed for dimensionality reduction and have been applied
to many diverse fields from image processing and face recognition (hinton2011) to more scientific
47
applications such as drug discovery [190], biomedical imaging [165], geophysics [160] and more
recently subsurface inverse modeling [99, 21]. With linear activation or a single sigmoid layer, the
optimal solution of an autoencoder is strongly related to the PCA [16]. Among the main advan-
tages of convolutional autoencoders are nonlinearity, versatility and robustness in processing di-
verse spatial images, and effectiveness in reducing high-dimensional image data. These attributes
make autoencoders quite powerful for low-dimensional parameterization in imaging problems.
Autoencoders can be developed as deep learning architectures with the goal of forcing the learned
representations of the input to assume desirable properties [49]. Some examples include regular-
ized autoencoders, such as sparse and denoising autoencoders [183] and variational autoencoders
(V AE) as generative models [86].
VariationalAutoencoders: Variational auto-encoders use similar architectures to the standard
auto-encoders to learn stochastic mappings between an input data space with a typically com-
plex empirical distribution q
D
(x)) and a latent space with a simple distribution, as shown in Fig-
ure 3.3 [85, 37]. In the context of model calibration, the variational form of autoencoder offers
the following main advantages. First, it constrains the distribution and variability of the latent
variables, which can be exploited in formulating and solving inverse problems (e.g., using reg-
ularization terms). Second, it specifies a simple (typically Gaussian) distribution over the latent
variables, which is desirable in inversion and data assimilation problems that make Gaussian as-
sumptions about the parameters. Third, the stochastic mapping of the variational form results
generates more plausible spatial maps from random samples for the latent variables, which is quite
important for preserving the geologic continuity of the training data. Finally, the method allows
for low-dimensional sampling of the original high-dimensional variables, which is important for
the probabilistic formulation and uncertainty quantification.
In V AE, the generative model (encoder) learns a joint distribution p
y
(x;z)= p
y
(z)p
y
(xjz),
where p
y
(z) is a prior distribution over the latent space, and p
y
(xjz) is a stochastic decoder.
The corresponding true posterior distribution of the the generative model is p
y
(zjx), which is
48
Figure 3.3: Schematic representation of V AE as a stochastic mapping between data and latent
spaces.
49
intractable and is approximated by the stochastic encoder or inference model q
q
(zjx). The varioa-
tional parametersq of this inference model are optimized such that q
q
(zjx) p
y
(zjx). The distri-
bution q
q
(zjx) is parameterized using deep (convolutional) neural networks, whereq includes all
the weights and biases of the neural network. Typically, strong assumptions are enforced on the
distribution q
q
(zjx), a commonly used distribution is Gaussian, i.e.,N (z;m;diag(s)).
The mathematical formulation of V AE is based on variational Bayesian inference in directed
graphical networks. During training, the parameters of the stochastic encoder and decoder net-
works are optimized by maximizing the Evidence Lower Bound (ELBO) or Variational Lower
Bound of the data log-likelihood function, that is
L
y;q
(x)=E
q
q
(zjx)
log
p
y
(x;z)
q
q
(zjx)
=E
q
q
(zjx)
log
p
y
(zjx)p
y
(x)
q
q
(zjx)
(8)
or
L
y;q
(x)=E
q
q
(zjx)
[logp
y
(zjx)]E
q
q
(zjx)
[logq
q
(zjx)]+ logp
y
(x) (8)
where the last term on the right-hand side is the log-likelihood of the data. Combining the above
equation with the definition of Kullback-Leibler (KL) divergence/distance [94] between q
q
(zjx)
and p
y
(zjx),
D
KL
[q
q
(zjx)jjp
y
(zjx)]=E
q
q
(zjx)
[logq
q
(zjx)]E
q
q
(zjx)
[logp
y
(zjx)]: (8)
results in the following form of the ELBO
L
y;q
(x)=D
KL
[q
q
(zjx)jjp
y
(zjx)]+ logp
y
(x) (8)
Since the KL divergence is non-negative, the ELBO is the lower bound of the log-likelihood of
the data, that isL
y;q
(x) logp
y
(x). Therefore, in addition to characterizing the distance be-
tween approximated and true posterior of the data, the KL divergence represents the tightness
of the ELBO. The V AE model is trained to optimize the parameters of the encoder and decoder
50
networks by maximizing the ELBOL
y;q
(x). By maximizing the ELBO during training V AE
accomplishes two goals simultaneously: (i) it improves the generative model (decoder) by max-
imizing the marginal likelihood logp
y
(x), and (ii) it improves the inference model (encoder) by
minimizing the KL divergence (distance) between q
q
(zjx) and p
y
(zjx). For convenience, in prac-
tice q
q
(zjx) is assigned a standard normal distributionN (0;I) as it facilitates the computation of
D
KL
(q
q
(zjx
(i)
)jjp
y
(z)).
An important property of the ELBO, is that it allows joint optimization of all parameters(q;y)
using stochastic gradient descent (SGD) algorithms, which are the dominant optimization methods
in deep learning. The SGD algorithm can be viewed as a stochastic approximation of the gradient
descent optimization with desirable smoothness properties. It replaces the actual gradient (from the
entire data set) by an estimate obtained from a randomly selected subset of the data. Consequently,
it leverages the trade-off between faster iterations and a lower convergence rate, thereby reduc-
ing the computational burden of high-dimensional optimization problems [15]. To improve the
performance of the steepest descent algorithm, a host of gradient descent methods have been de-
veloped in recent years for computationally efficient training of neural networks in deep learning
applications, including the Momentum method [137], Adaptive Subgradient Method (AdaGrad)
[36], Root Mean-Square Propagation (RMSProp) [57], and the ”Adam” algorithm [84]. Among
the different existing methods, we use the Adam algorithm both for training the V AE and for min-
imizing the objective function of our model calibration problem (Equations 8 and 8, respectively).
Appendix A presents an overview of the Momentum and RMSProp algorithms that are combined
to develop the Adam algorithm.
While autoencoder models can be trained as single-layer encoder and decoder, the use of deep
convolutional architectures offers several advantages for image (spatially correlated) data, includ-
ing an exponential reduction in computational cost and the amount of training data needed to learn
a function, as well as a better compression rate. For parameterization of inverse problems, the
decoder part of a trained V AE (x= Dec
y
(z)) is used to formulate the problem in terms of the low
51
dimensional latent parameters z. In addition to dimensionality reduction, V AE offers several de-
sirable properties for application to inverse problems. In particular, constraining the distribution of
the latent variables to Gaussian implies that the`
2
-norm is a suitable regularization term to be ap-
plied in solving for the latent variable as the range of the latent variables is constrained (zero-mean
and unit variance). As such, in the absence of a prior model, a random draw from the latent space
can be used to initialize the optimization solution. If a prior model is available and must be used
to initialize and/or regularize the calibration process, the corresponding latent space representation
can be found and used. Moreover, the normally distributed latent space in V AE is better behaved
than that of the regular AE (could be any distribution), which is an important property in gradient-
based optimization. In the next section, we use numerical experiments to demonstrate and discuss
the robustness of V AE-type deep learning architectures for model calibration under uncertainty in
the geologic continuity model.
3.3 Numerical Experiments
To evaluate the performance of the V AE parameterization under geologic uncertainty, we con-
sider model calibration problems in 2D and 3D cases with complex channelized examples that
include two lithofacies types (fluvial systems) as well as 3D examples with multiple lithofacies
types (SAIGUP models). In these problems, the spatial distributions of the unknown target phys-
ical properties (permeability) are calibrated against observed pressure measurements in scattered
monitoring wells. Two types of experiments, pumping test and two-phase flow, are simulated using
a commercial flow simulation software [153] and Matlab Reservoir Simulation Toolbox (MRST)
[112], respectively.
The V AE models are implemented using Tensorflow [2]. Adam algorithm [84] with a tuned
learning rate decay schedule is used to train the V AEs and to perform model calibration using
the formulation discussed in the previous section. In both the training and model calibration,
52
b
1
, b
2
, and e are set to their suggested values in the literature, which are 0.9, 0.999, and 1e-
8, respectively [84]. The hyperparameters are tuned through trial and error using the validation
datasets. The learning ratea in the training of V AE ranges from 1e-2 to 1e-4 typically. We use a
slightly higher range of (1e-1 to 1e-3) for the learning rate in the model calibration process. The
`
2
-norm regularization term is added to both the weights and bias in the V AE during training, and
to constrain the latent variables in the model calibration. Mini-batching with a default batch size
of 64 [15] and early-stopping [135] are applied in the training of V AE. In the training of V AE,
a weight of 1 is given to the reconstruction loss while a weight of 3 is given to the latent loss,
which is because a structured latent space [56] is more desirable for easier inversion. As pointed
out in [100], the quality of gradient-based inversion solution using the deep learning generative
models is sensitive to the choice of initial solution and number of iterations due to their nonlinear
nature. In our model calibration processes, zero latent variables are used as the initial solution
since the latent variables are close to zero-mean normal distributions and this choice does not bias
the initialization toward any individual scenario. We haven’t set a maximum number of iterations
for model calibration and stopped the iterations when the improvement in the objective function
becomes insignificant. The observed quantities are normalized before calculating the value of the
objective function. The C
d
in Equation 8 is specified as a diagonal matrix according to the noise
level added to the measurements.
3.3.1 Fluvial Channel Examples
In this example, we evaluate the performance of V AE parameterization in subsurface flow inverse
problems under two prior assumptions about the conceptual geologic continuity model: (i) a single
known geologic scenario is available and can be used to constrain the solution search space; and
(ii) the geologic scenario is uncertain and multiple cases are proposed as plausible prior models,
resulting in a much larger space to search for solutions. The results are presented for a pumping test
experiment based on single-phase flow in groundwater aquifers and a model calibration problem
involving a two-phase flow system with increased nonlinearity and fewer observations. The same
53
examples are considered for the case with multiple geologic continuity scenarios to compare the
performance of V AE and PCA under the two assumptions.
For these examples, we use two training images with straight and anastomosing fluvial patterns
and include two facies types: channel (sandstone) and non-channel (shale) facies. These training
images are used to generate a total of six geologic connectivity scenarios as shown in Figure 3.4
using the SNESIM algorithm [20] implemented in S-GeMS [144] . For each scenario, 500 real-
izations are generated, of which 400 are used to train the V AE and the remaining 100 are used as
validation and test data. The choice of using 500 realizations was made based on sensitivity analy-
sis by increasing the number of realizations until the improvement in the prediction quality became
marginal. The reference cases in our experiments are not used in the training process and belong to
one of the training images, unless otherwise noted. It is important to note the difference between
validation and test datasets in constructing the V AE model. While both datasets are unseen by the
V AE model during training, the validation dataset is used to monitor the progress of the training
process to verify that images that are not seen by V AE can also be reconstructed with acceptable
accuracy. The test data, on the other hand, is used to evaluate the performance of V AE once the
training is completed. The number of training samples was determined by gradually adding new
samples until the improvement in the reconstruction quality became insignificant. The fluvial mod-
els are generated on a mesh size of 100 100. Details about the architecture of the V AE network
for this model are summarized in Table 3.1. The V AE architecture is determined by starting with a
simple structure and then gradually increasing its complexity until no significant improvement, in
terms of reconstruction quality, is achieved. In the first set of experiments Scenario 3 is assumed
to be the correct scenario. Figure 3.5 compares the approximation power of PCA and V AE at
different levels of parameterization (dimensionality reduction). The models labeled as projections
are the approximated maps when the reference model is known (no inversion).
Choosing the dimensionality of the latent space for the V AE parameterization is not as straight-
forward as standard methods such as the PCA, where the dimensionality is directly related to the
approximation quality. In the case of V AE, the dimensionality of the latent space is not the only
54
Figure 3.4: Two training images with fluvial channels that are used to generate six alternative
geologic connectivity scenarios; four random realizations of facies models from each scenario are
shown.
55
Table 3.1: Details of the V AE architecture for examples with fluvial channels.
.
Layer Layer Type Kernel Size Output Dimension
Input Input 100 100 1
Conv1 Conv2D 3 3 50 50 8
Conv2 Conv2D 3 3 25 25 16
Conv3 Conv2D 3 3 12 12 32
Conv4 Conv2D 3 3 6 6 64
Conv5 Conv2D 3 3 3 3 128
Flatten Reshape 1152
FC1a FC N
latent
FC1b FC N
latent
Latent Latent N
latent
DFC1 FC 1152
Reshape Reshape 3 3 128
DConv1 Conv2Dtranspose 3 3 6 6 64
DConv2 Conv2Dtranspose 3 3 12 12 32
DConv3 Conv2Dtranspose 3 3 25 25 16
DConv4 Conv2Dtranspose 3 3 50 50 8
DConv5 Conv2Dtranspose 3 3 100 100 1
factor that controls the approximation fidelity. Several other hyperparameters (number of layers,
convolution filters, and activation functions, etc.) control the reconstruction quality of the network
and must be tuned to optimize the performance of the network. In the presented examples, the
dimension of the latent-space is selected to be equal to the number of retained eigenvectors in PCA
parameterization. With this choice, the hyperparameters of the networks are tuned as part of the
training process. In PCA method, the optimal level of parameterization is also decided based on
the resolution and support of the available measurements for inversion. Importantly, in approxi-
mating a known image (projection), continuously increasingly the parameterization level results in
improved quality of the reconstruction because of perfect knowledge about the image. However,
in inverse problems with very limited data that does not support the reconstruction of a detailed
image, increasing the PCA dimension does not necessarily lead to improved reconstruction qual-
ity. In the projection examples shown in Figure 3.5, increasing the PCA dimensionality from 4
to 8 and 16 leads to improvement in approximating the salient features in the permeability maps,
although the boundaries remain blurry (which is expected as detailed PCs are needed to reconstruct
the sharp edge information). In inverse problems, the reconstruction quality of PCA depends on
56
Figure 3.5: The projection performance of PCA and V AE when a single scenario is used: (a)
sample model realizations, (b) PCA projection results with different number of PCs, (c) V AE pro-
jection with the same number of latent variables as in (b). For n
z
= 4,8,16, the variances explained
by the PCA bases are 27%, 40%, and 58%, respectively.
the resolution and information content of the available measurements. Hence, the parameterization
level must be tuned to the resolution of available measurements.
Unlike PCA, V AE is able to maintain the reconstruct quality, especially the channel bound-
aries, as the number of latent variables decreases. This behavior is in stark contrast with PCA,
and is one of the main reasons for the superior quality of V AE parameterization and reconstruc-
tion in inverse problems. The V AE parameters that are learned during training provide significant
flexibility and capacity for capturing complex spatial patterns with high fidelity. With many other
learnable parameters, the dimensionality of the latent space is not the only factor that controls
the reconstruction quality. The advantage provided by this behavior is that V AE does not require
high-resolution inversion measurements to reproduce high-fidelity reconstruction images, as those
detailed features are encoded during the training process. Hence, unlike PCA, very few parame-
ters can be included in the inversion without compromising the reconstruction quality. For PCA
parameterization, increasing the number of parameters is needed to improve the solution quality.
However, the parameters can only be increased when high-resolution data is available to enable
57
improved reconstruction quality, which is not possible in inverse problems with limited measure-
ments. The high fidelity of V AE parameterization with low dimensional latent variables can be
attributed to the effectiveness of CNN in detecting local patterns with higher accuracy using sev-
eral filters and the nonlinearity of the mapping. In our examples, for inverse problems using a
single geologic scenario, the dimension of latent variables (and the number of PCs), is set to 8.
When the geologic continuity model becomes uncertain, the performances of PCA and V AE
are affected differently. Figure 3.6a shows a sample from each of the six channel connectivity sce-
narios. In this case, the training data used for PCA and V AE are the collection of all 2400 samples
from the six scenarios, that is 400 from each scenario. The PCA and V AE approximation results
corresponding to each of the samples in Figure 3.6a are shown in Figures 3.6b and 3.6c. Given
the diversity of the model in this case, the approximations are shown for 8, 16, and 32 dimensions.
Interestingly, the V AE is able to capture distinct channel patterns even when 8 latent variables are
used. On the other hand, PCA’s performance is degraded significantly. The main reason for this
behavior is the linear nature of PCA; that is, the PCs are formed as a linear combination of the
input images. Since the input images are diverse, the leading PCs, which are obtained as the linear
combination of the inputs, aggregate the main spatial patterns in different scenarios, thereby los-
ing the specific connectivity features in each scenario. On the other hand, V AE learns a nonlinear
mapping and uses several convolutional filters to capture diverse features that exist in different
scenarios. These projection results, however, only show the representation quality when the true
model is known. In the next section, we evaluate the performance of each method when limited
flow measurements are used to calibrate the models. In model calibration examples of the next
section, when multiple geologic continuity models are present 16 PCs/latent variables are used.
We evaluate the generalization power of the trained V AE in Figure 3.7. Figure 3.7(a) depicts the
distributions of two randomly sampled latent variables in training and testing that show a similar
distribution and are close to normal distribution. The distributional form of the latent variables
also justifies the use of`
2
-norm regularization and shows that the distributions are centered around
zero. In order to check the similarity between the training models and newly generated models,
58
Figure 3.6: The projection performance of PCA and V AE when multiple scenarios are present: (a)
sample model realizations, (b) PCA projection results with different number of PCs, (c) V AE pro-
jection with the same number of latent variables as in (b). For n
z
= 4,8,16, the variances explained
by the PCA bases are 35%, 50%, and 66%, respectively.
a number of randomly generated models and training models are mapped onto a two-dimensional
transformed domain using the t-SNE [120] as a commonly used visualization technique for high
dimensional data. The results in Figure 3.7(b) suggest that the generated models tend to have
connectivity structures that are close, but not identical, to those in the training models. Hence,
that V AE model does not merely memorize the training models and learns to generate new models
using the extracted features from the training data.
For a random sample of the latent vector z, the corresponding spatial model from the V AE
decoder is shown in Figure 3.8(a). Figure 3.8(b) illustrates the gradients of the reconstructed
modelP(z) with respect to each of the eight latent variables, that is¶P(z)=¶z. Interestingly, the
gradient maps are localized around the channel edges, indicating that the latent variables are only
sensitive to the facies boundaries, which are the main unknowns in facies reconstruction problems.
This sensitivity has important implications for gradient-based model calibration formulations as
it suggests that the solution is found by primarily updating the facies boundaries, which is an
interesting and effective strategy for calibrating facies models.
59
Figure 3.7: The learned distribution of latent variable Z1 and Z2 in V AE (a), the T-SNE visual-
ization of the training realizations and 100 newly generated models (b). The results are shown for
Scenario 3.
Figure 3.8: Localized nature of the gradient information in V AE: (a) the spatial image x corre-
sponding to a sample latent vector z; (b) the gradient of the spatial image z in (a) with respect to
each latent variable in z.
60
Figure 3.9: The domain and well configuration for numerical experiments with fluvial channels:
(a) pumping test example; (b) two-phase flow example.
3.3.1.1 Pumping Test in Single-Phase Flow
The first model calibration example is based on a groundwater pumping test. In this example,
the pressure at monitoring wells is available and used to infer the spatial distribution of hydraulic
conductivity. The governing equations, in this case, are presented in Equation 8. The well config-
uration of this experiment is shown in Figure 3.9(a), where a pumping well, located in the center
of the domain, is extracting water from an aquifer at a constant rate of 3000 m
3
=day. A total of
24 evenly-spaced monitoring wells are placed around the pumping well to measure the pressure
drawdown. The top and bottom boundaries are set to no-flow boundary conditions, while the left
and right boundaries are at constant pressure heads of 20 m and 10 m, respectively. The loga-
rithm of hydraulic conductivity values for the two facies types are set to 0.6 (channel) and -2.4
(background), ln(cm=s). The latent variables of V AE and coefficients of PCA are initialized as
zero vectors, as mentioned. The experiment is performed using both PCA and V AE under a single
geologic scenario and multiple geologic scenario assumptions.
Figure 3.10 summarizes the model calibration results for this example. The top panel in Figure
3.10a displays the reconstructed spatial maps, where the projection solutions are also shown for
comparison. In these examples, projection refers to the best achievable approximation of the ref-
erence model, which reflects the error that is introduced due to the approximation, not inversion.
61
While both PCA and V AE methods show good performance in terms of matching observations,
the V AE results are slightly better. The behavior of the calibration objective function with V AE
parameterization is shown in the middle panel of Figure 3.10a while the data match results are
depicted in the lower panel of this figure. Both the value of the final objective function and the
corresponding data match for V AE are better than the values obtained using the PCA.
The results of model calibration when the training data includes samples from multiple ge-
ologic continuity scenarios are depicted in Figure 3.10b. While the V AE solutions show distinct
channel features that are consistent with the reference model, the PCA solutions do not contain dis-
tinctive channel features and the identified patterns are not crisp, especially in the upper part of the
domain. Nonetheless, the data match quality is similar in both cases, despite the clear differences
in the spatial patterns identified. This behavior can be explained by noting the limited number of
wells (measurements) that provide little constraining power and result in non-unique solutions that
match the observed measurements but provide different predictions.
It can be seen from Figure 3.10 that the inversion with V AE achieves a similar data match
for both multi-scenario and single-scenario cases. In general, the inverted solutions from V AE
can have lower data mismatch than the solution from V AE projection since the former directly
minimizes the inversion data mismatch while the latter provides the best approximation in the
spatial domain. In the next example, we consider a more challenging problem, where the data-
model relationship has stronger nonlinearity and measurements from fewer wells (with less spatial
coverage) are available.
3.3.1.2 Two-phase Flow with Limited Measurements
In the second experiment, we consider a two-phase flow system for incompressible and immis-
cible fluids, where the goal is to reconstruct the facies map. The resulting inverse problem is
more ill-posed than the pumping test example in the previous section as fewer (five) wells are
present. The pressure measurements at the injection wells and the flow rates at the production
wells are recorded every 90 days and used as the observed measurements for model calibration.
62
Figure 3.10: Summary of the results for the pumping test, showing the projection and calibration
results with PCA and V AE (top), iterations of the model calibration minimization algorithm (mid-
dle), comparison of the normalized predicted and measured flow data for the single scenario (a)
and multiple scenarios (b). The horizontal dashed line in the middle panels show the value of the
objective function based on the final PCA solution.
63
The governing equations for this example are presented in Equation 8. The well configuration of
our experiment is shown in Figure 3.9b. A No-flow boundary condition is established on all four
sides of the domain. The simulation is run for a total of 1080 days. The injection wells operate at
a constant injection rate for the wetting fluid with a total amount of one pore volume injected. The
initial pressure in the domain is constant and the pressure at the extraction wells is maintained at a
constant producing pressure. The porosity and the connate water saturation are set to 0.2 and 0.1,
respectively. The values of permeability for the two facies types are set to 5000 mD (channel) and
50 mD (non-channel). The latent variables of V AE and coefficients of PCA are still initialized as
zero vectors.
Figures 3.11(a) and 3.11(b) show the model calibration solutions under single and multiple ge-
ologic continuity model assumption, respectively. Given the well configuration in this example, it
is highly unlikely for any method to capture the channel feature in the upper middle section of the
domain, as the flow measurements do not contain any information about that feature. This can be
verified by examining the results from both methods in the top panels of Figure 3.11. Despite this
measurement limitation, the calibration based on the V AE parameterization provides clear channel
features while the solution from the PCA parameterization cannot detect the channel features. This
is in part due to the generative nature of the V AE solution as during training the V AE model has
learned to produce channel features using Gaussian samples of the latent variables. Comparing the
results in Figures 3.11(a) and 3.11(b) shows that the V AE solution is able to recover the vertical
channel feature that connects the right side of the domain, while the PCA parameterization cannot
identify this feature. As a result, the data match from the V AE solution is superior. The data mis-
match of the V AE projection is still higher than the inverted solution as expected but the differences
are higher than the pumping test since the flow rates are more sensitive to the discontinuity of the
channels. It can also be seen from Figure 3.11 that V AE again achieved a similar data match under
single and multiple scenario cases.
To further evaluate the performance of the two methods, we consider two additional examples.
In the first example, very simple straight channels are used as a reference case, while in the second
64
Figure 3.11: Summary of the results for the two-phase flow example, showing the projection
and calibration results with PCA and V AE (top), iterations of the model calibration minimization
algorithm (middle), comparison of normalized predicted and measured flow data for the single
scenario (a) and multiple scenarios (b). The bottom row shows the normalized flow response data
from all wells stacked together. The horizontal dashed line in the middle panels show the value of
the objective function based on the final PCA solution. The order of the nine properties is the BHP
of Well I1, the nonwetting phase flow rates of P1, P2, P3, P4, and the wetting phase flow rates of
P1, P2, P3, P4.
65
case, the reference model does not belong to any of the scenarios considered. The results from
these two experiments are presented in the figures. In the experiments shown in Figures 3.12(a)
and 3.12(b). When a simple channel structure is present and the wells intersect the main channel
features, both methods are able to reconstruct the channels in the solution, with the V AE method
providing sharper channel boundaries. On the other hand, when the reference model is outside the
expected connectivity patterns used for training and the well measurements does not provide the
necessary information about the channel connectivity, Figure 3.12(b) shows that V AE is not able
to identify the exact connectivity patterns in the solution. While V AE provides a solution with
crisp channel structures that clearly outperforms the solution from PCA approach, it is not able
to identify the correct connectivity patterns. Interestingly, the connectivity between the injection
and extraction wells in the lower part of the domain is detected, although the channel locations are
slightly displaced. The upper part of the solution from V AE correctly shows no connection between
the injection well and the two wells located on the top. However, the V AE solution has a vertical
channel feature in the middle section of the domain without any well measurements indicating the
presence of such features. Given that the data match for the solution is good, the existence of this
feature in the solution suggests that measurements do not have any sensitivity to it. This feature
shows the generative nature of V AE as it produces channel features that are consistent with the
training data without the flow data requiring the presence of such features.
3.3.2 3D Example with Multiple Facies
As our last example, we evaluate the performance of V AE in a more complex 3D example with four
facies types. In this example, eight geologic scenarios are considered, as visualized in Figure 3.13,
using Paraview [6]. These examples are generated using the truncated Gaussian simulation and
object-based geostatistical modeling[60]. The eight scenarios represent different combinations of
aggradation angle and paleo-azimuth. Scenarios 1, 2, 3, and 4 have a low aggradation angle while
Scenario 5, 6, 7, and 8 have high aggradation angle. The paleo-azimuths are 0
, 45
, 90
, and 135
for Scenario 1 and 5, Scenario 2 and 6, Scenario 3 and 7, and Scenario 4 and 8, respectively. All
66
Figure 3.12: Summary of the results for two additional two-phase flow example with when multiple
scenarios are present. The results are shown for a simple case (a) and a complex case that does
not belong to any of the scenario. In each case, the projection and calibration results with PCA
and V AE (top), iterations of the model calibration minimization algorithm (middle), comparison of
normalized predicted and measured flow data are shown. The horizontal dashed line in the middle
panels show the value of the objective function based on the final PCA solution. The order of the
nine properties is the BHP of Well I1, the nonwetting phase flow rates of P1, P2, P3, P4, and the
wetting phase flow rates of P1, P2, P3, P4. 67
Figure 3.13: Sample models showing the geological facies in the eight scenarios considered for
the 3D examples with four facies types.
the scenarios have a parasequence of 2 and parallel facies lines. The dimension of these models
is 750m 2250m 20m, which is divided into a 40 120 40 grid mesh. A total of 500 model
realizations are generated for each scenario, of which 400 realizations are used for training, 50 for
validation, and 50 as test cases. Details about the architecture of the V AE model are summarized
in 3.2, which involves 3D convolutional filters. Figure 3.14 shows the approximation quality of
the PCA and V AE for the top layer of two sample realizations from different scenarios using a
32-dimensional latent space. Note that the approximation is performed for the 3D models and only
the top layers are shown in the figure. While the approximated images capture the general patterns
in the models, the approximation with V AE is able to capture the shape of facies boundaries.
68
Table 3.2: Details of the V AE architecture for the 3D example.
Layer Layer Type Kernel Size Output Dimension
1 Input 40 120 40 1
Conv1 Conv3D 3 3 3 20 60 20 8
Conv2 Conv3D 3 3 3 10 30 10 16
Conv3 Conv3D 3 3 3 5 15 5 32
Conv4 Conv3D 3 3 3 2 7 2 64
Flatten Reshape 1792
FC1a FC N
latent
FC1b FC N
latent
Latent Latent N
latent
DFC1 FC 1792
Reshape Reshape 2 7 2 64
DConv1 Conv3Dtranspose 3 3 3 5 15 5 32
DConv2 Conv3Dtranspose 3 3 3 10 30 10 16
DConv3 Conv3Dtranspose 3 3 3 20 60 20 8
DConv4 Conv3Dtranspose 3 3 3 40 120 40 1
Figure 3.14: Approximate representation of sample models (a) from each scenarios in the 3D
example using the PCA (b) and V AE (c) methods.
69
The flow experiment in this example involves a pumping test with the well configuration shown
in Figure 3.15, which consist of a pumping well (at a constant rate of 9000 m
3
=day) in the cen-
ter of the domain and 20 observation wells distributed evenly throughout the formation. No flow
boundary conditions are set for the left and right boundaries, while constant pressure head bound-
ary conditions are specified for the inflow (50m) and outflow(10m) boundaries of the aquifer. 1%
random measurement noise is added to the observed measurements. Zero initialization for the la-
tent variables and PCA basis have been used. Figure 3.15(b) shows the reference case, which is
a random realization that is not used for training. Results of model calibrations are summarized
in Figure 3.16, which clearly show the superior performance of V AE. While the calibrated models
based on PCA and V AE parameterization are both equally good data matches, the reconstructed
map using the V AE method is evidently better able to capture the geologic continuity features in
the reference case. Figure 3.17 displays all 40 layers of the reference model (a), PCA-based cal-
ibrated model (b) and V AE-based calibration model (c). Note that the top layer of the model is
Layer 40. Both methods approximate discrete facies with continuous solutions. However, the so-
lution with V AE is more consistent with the spatial distribution of the facies. Overall, this example
confirms the conclusion from the 2D channel examples that V AE provides a more powerful and
robust parameterization approach, especially when the geologic continuity model is uncertain and
very diverse geologic patterns are proposed as plausible prior models.
3.4 Discussion
The examples from the previous section show the superior performance of V AE to PCA for param-
eterization of inverse problems under a single scenario, in agreement with previous works [99]. In
addition, we have shown that the performance difference becomes more significant when multiple
prior geologic models with distinct spatial patterns are involved. Quantitatively, Table 3.3 sum-
marizes the RMSE as a performance measure for the examples presented in the previous section.
70
Figure 3.15: Model domain and well configuration in the 3D example (a), and the reference model
used for the numerical experiment (b).
71
Figure 3.16: Summary of model calibration results for the 3D experiment: (a)-(b) show the PCA
and V AE projection, respectively; (c)-(d) display the calibration results with PCA and V AE, re-
spectively; (e) and (f) show the iterations of the model calibration objective function and the final
data match, respectively.
72
Figure 3.17: Layer-by-layer illustration of the reference model (a), calibration using PCA (b) and
V AE (c) parameterization methods.
73
Table 3.3: Data RMSE performance in different numerical experiments.
Test Scenario PCA V AE
Pumping test
Single 4349.2 2183.8
Multiple 2882.1 892.9
Two-phase fluvial 1
Single 1871.4 711.7
Multiple 875.0 155.4
Two-phase fluvial 2 Multiple 1301.7 185.7
Two-phase fluvial 3 Multiple 767.2 307.5
3D Pumping Test Multiple 3.1 2.1
The superior performance of deep learning models to PCA-type parameterization methods in cap-
turing diverse spatial patterns is attributed to the versatility offered by the many filters embedded
in their architecture to capture local patterns, as well as the nonlinearity of the mapping. On the
other hand, PCA parameterization is a linear transformation where the PCs are constructed as lin-
ear combinations of the input images, which is not ideal for capturing diverse image features at
low dimensions. Moreover, the PCA parameterization, unlike V AE, does not take advantage of the
efficiency offered through local image analysis and feature learning. The dimensionality reduc-
tion mechanisms of the two methods have important contrasts as well. The PCA parameterization
decomposes the input data based on their spectral content and ranks the resulting orthogonal PCs
according to their contribution to the total variance of the training data. When the features in the
input data become diverse, the leading PCs are constructed to maximally contribute to the recon-
struction of all the features in the input data, which is not efficient. On the other hand, V AE uses a
large number of filters to learn the local patterns in the input data without any specific ranking of
the filters. This provides a more efficient mechanism whereby the diversity of the features can be
easily handled by different subsets of filters working together to capture different patterns.
To demonstrate this important advantage of V AE over PCA, Figure 3.18 compares the nor-
malized mean of the activation outputs of each filter used in V AE for the fluvial example for the
single scenario (Figure 3.18a) and multiple scenarios (Figure 3.18b). Except for the input data, the
two cases are identical in all aspects, including the architecture, initialization, tuning parameters,
and regularization form and parameters. For these plots, a weak sparsity regularization (`
1
-norm
74
Figure 3.18: Average activation outputs of all the filters (after normalization) for the cases with a
single scenario (a) and multiple scenarios (b) under a week sparsity constraint, showing that more
filters are active to represent the diverse features when multiple scenarios are present.
penalty) is used during training to evaluate the redundancy of each network. In the single scenarios
(Figure 3.18a), a large number of the filters are removed (black bars), indicating that those filters
are not essential for capturing the features in the input scenario. The results in Figure 3.18b show
that when multiple scenarios are present in the input data, very few filters are inactive, suggesting
that the redundancy is significantly reduced. The versatility and robustness of V AE are attributed,
in part, to the use of several convolutional layers with many filters that provide significant capacity
for learning diverse features by adapting the activation level of the filters to the diversity of the fea-
tures that are used during training. This property is critical for parameterization of subsurface flow
models under uncertain geologic prior continuity assumptions and, together with other important
properties such as nonlinearity and supervised learning, explains the superior performance of deep
convolutional networks to traditional approaches.
Another important difference between the PCA and V AE is the design requirement of the latter.
Unlike the PCA parameterization that is standard and does not require tuning or design parameters,
V AE models involve several design aspects and tuning parameters, including the number of layers,
the number and size of filters, the type of activation function, the rate and form of pooling, the
learning rate, to name a few. For the most part, the design aspects of V AE are acquired through
experience and trial and error as very few guidelines exist to inform the design parameters of deep
75
learning models. However, the effect of a few of the design parameters is somewhat predictable.
For instance, the nonlinearity of the network increases with the number of layers (that involve
nonlinear activation functions). In general, each layer of V AE may be viewed as a different level
of abstraction. The initial layers tend to capture generic small-scale features, such as edge infor-
mation, while later layers learn increasingly larger-scale and more specific patterns. Hence, the
V AE architecture involves multi-scale feature learning for the effective representation of complex
multi-resolution images. On the other hand, PCA decomposes the spectral content of the input
data by generating ranked features (PCs) according to their variance contributions (typically from
coarse to fine-scale). Therefore, to increase the approximation resolution with PCA, additional
non-leading PCs must be added to the representation, with an increase in dimensionality.
Figure 3.19 shows a comparison between the reconstruction quality with different numbers of
convolutional layers in the encoder/decoder. These models have approximately the same number
of weights to mainly study the effect of network depth and nonlinearity. In these models, each
additional layer adds to the nonlinearity of the mapping as each convolutional layer is followed by a
nonlinear activation function. The number of layers is increased from 1 to 5 and the reconstruction
results for a sample from each geologic scenario are depicted. With the increasing number of layers
(nonlinearity), the channel boundaries become more pronounced and the approximation quality
increases. When fewer layers are used, the filters have to capture large-scale patterns through
small convolutional filters. As the number of layers increases, because of the pooling operation,
the filters in each added layer capture increasingly larger-scale features, thereby improving the
continuity of the detected channels.
Figure 3.20 shows the behavior of V AE and PCA in reconstructing images that are not similar
to those used in the training process (or to generate the PCA basis). In this case, the images
used in the training process have channels with only 0°or 90°orientations. The methods are then
used to reconstruct images with channel orientations along the 45°(large deviation) and 75°(small
deviation). Two methods are used for reconstruction. The first method (labeled as projection in
Figure 3.20) passes the images to the encoder and generates the approximated image as the decoder
76
Figure 3.19: The effect of number of layers and nonlinearity on the approximation quality of the
V AE.
output. In this case, the latent variables generated by the encoders are passed to the decoder and
used to produce the approximation. In the second method (labeled as inversion in Figure 3.20),
images are reconstructed by solving a least-squares problem to find latent variables that provide the
best match (in RMSE) to the spatial representation of the test images. The second method is similar
to how V AE is used in model calibration problems, except that in this case the measurements are
all the pixel values of the spatial map. The results show that while a small deviation can be handled
by the two methods, having features that are largely different from those in the training data can
degrade the approximation quality for both methods. Interestingly, the behavior of the projection
and inversion approaches with V AE is quite different. When the encoder is used to map the input
images to the latent space, followed by a reverse mapping by the decoder back to the original
space, the reconstructed images tend to have features that are used in training, even when the
input images are largely different than those used for training the V AE. On the other hand, in the
second approach, when the latent variables are estimated by minimizing the mismatch between the
reference and approximated images the solution is forced to match the features in the reference
images. However, since the required features are not present in the trained V AE, the solution does
not represent the unseen features with the expected quality. However, the resulting solutions hints
77
Figure 3.20: Approximation results for scenarios with different channel orientations. The V AE
and PCA are trained only with 0°and 90°. Models with channel orientations along 75°and 45°are
not used in the training, but presented to evaluate the performance of the two methods for approx-
imation of images that are not similar to those used in the training data.
at the need for revising the training data (geologic scenarios) and repeating the training process,
which is an advantage of using V AE for parameterization with multiple geologic scenarios.
Another important property of V AE for inverse modeling under geologic constraints is its abil-
ity to reproduce complex spatial models using random samples from simple distributions of the
latent variables (zN (0;I)). Figure 3.21 shows nine randomly generated samples for the fluvial
examples using V AE (left) and PCA (right), which clearly shows that samples generated by V AE
are better able to synthetize channel patterns that are similar to those in the input data, whereas the
patterns generated by PCA are not as crisp. The generated models with PCA appear to be smooth
78
Figure 3.21: Ability of V AE (a) versus PCA (b) to reproduce the learned spatial patterns: the
figures show nine spatial models (x) generated by randomly sampled zN (0;I) using V AE and
PCA, respectively.
and Gaussian-looking, which is expected since the method can only preserve up to second-order
statistics (i.e., sample covariance) of the images. The implication of this property is important for
inversion; the updates applied to the latent variables are more likely to maintain the expected spa-
tial connectivity patterns in the prior (training) models. Deep adversarial networks such as GAN
aim to synthetize natural-looking images that cannot be distinguished from the true images in the
training data. Application of V AE and GAN to subsurface modeling has been reported in the recent
literature [99, 21, 145, 98, 115]. Although GAN is known to generate more realistic images even
for complex natural images [130, 127], its training is computationally more demanding and less
stable [17].
In this study, we use the Adam algorithm for gradient-based optimization of the inverse prob-
lem objective function, and compared the results with the BFGS Quasi-Newton algorithm [41, 45,
158, 18]. The results show that while both methods converge to similar solutions, the BFGS re-
quired fewer iterations to converge but had a significant work per iteration computation, making
the Adam algorithm computationally more efficient. Another important observation in our study
79
Table 3.4: Average Computational Costs in the Experiments with the Fluvial Systems
Process Time Unit
Training 0.782 second/epoch
V AE Encoding 0.004 second/model
V AE Decoding 0.049 second/model
Calculating
¶P(z)
¶z
0.484 second/model
Forward simulation for pumping test 1.492 second
Forward simulation for two-phase flow 104.482 second
(not shown) was that inversion with the V AE resulted in fewer iterations compared to regular au-
toencoders. This is mainly attributed to the constrained nature of the latent variables in V AE (i.e.,
standard Normal distribution) compared to the wider range of variability in latent variables of AE.
The main limitation of V AE in comparison to conventional parameterization methods, such as
the PCA, is the higher computational burden associated with training the network and tuning its
hyper-parameters. For example, Table 3.4 summarizes the computation times of different com-
ponents in the fluvial example, using an Intel i7-7700K and Nvidia RTX 1080. In general, the
computational cost of training a V AE is higher than that of the PCA, especially considering the
hyper-parameter tuning requirement of the V AE, which can be quite time-consuming. In addition,
the number of iterations used in the V AE training process also depends on the complexity of the in-
put data/images, as well as the adopted architecture and regularization techniques used. However,
the computational complexity of the training process may not present a major bottleneck for model
calibration, especially considering the computational demand of the forward flow simulation runs.
Given the variable factors that are involved in the design, training, and data requirement of the
V AE, presenting a concrete computational complexity analysis is not straightforward. However,
these techniques are generally known to have much higher computational demand than conven-
tional linear parameterization methods such as PCA. Another important consideration for imple-
menting the V AE is its data need. In general, deep learning models require large training datasets
(model realization), which are typically generated using geostatistical simulation methods. As is
the case in other rigorous model calibration and uncertainty quantification methods, the need to
generate many model realizations may still be viewed as a practical limitation.
80
Our focus in this dissertation was to illustrate the effectiveness and robustness of the V AE
for low-dimensional parameterization of subsurface flow model calibration. An important aspect
of model calibration that was not discussed in this dissertation is uncertainty quantification. Al-
though the V AE learns a probabilistic mapping between the original parameter space and a low-
dimensional latent space, we only used the generator network (decoder) to parameterize a deter-
ministic model calibration problem. Extension to a probabilistic inverse modeling formulation can
be conveniently accomplished by using a Monte-Carlo-based sampling framework, in a similar
manner to conventional parameterization methods [99]. In fact, the V AE model offers a stochas-
tic mapping between the original (complex) parameter distributions to a much simpler latent space
distribution (e.g., Gaussian), which better lends itself to the probabilistic formulation of the inverse
problem. In this case, instead of sampling the original parameters with complex distributions, one
can sample the latent variables with simpler distributions and use them for data assimilation or
uncertainty quantification analysis.
81
Chapter 4
Recurrent Neural Networks for Short-term and Long-term
Prediction of Geothermal Reservoirs
4.1 Geothermal Operation
Next, we will introduce the main components of the proposed method (shown in Figure 4.2),
followed by the overall workflow. To motivate the method development, Figure 4.1 shows the
hourly data for one producer from a field dataset. Among the variables shown in Figure 4.1, the
motor speed is the control variable while all other variables are output measurements. It can be
seen that the data is generally smooth, and the variables are correlated, except for many sudden
jumps and drops. Close examination of the data reveals that when sudden changes occur, the
motor speed typically goes to zero because of maintenance issues or equipment failure incidents.
It is important to note that some of the values shown in Figure 4.1 do not drop all the way to zero
because the data shows hourly averaged values of minute-by-minute data. A labeling scheme is
added to the model to exclude those intervals of sudden changes during training. The main data
trends in Figure 4.1 show that the model must be able to capture the dominant dynamics as well
as the changes introduced by variations in the control variables and disturbances. The proposed
method in this dissertation is designed with these data features/behavior in mind.
82
Figure 4.1: The hourly data of one of the producers from the field data after normalization.
Figure 4.2: The architecture of CNN-RNN model with labeling scheme. y is the target variable, x
is the control, t is the time, and ˆ y is the final prediction from the model.
83
4.2 CNN-RNN Model with Labeling
The overall architecture of the CNN-RNN model consists of a CNN encoder that extracts important
historical information from a fixed-length lookback sequence before the prediction time steps, an
RNN encoder that compresses the sequence of future controls, and an RNN decoder that predicts
the future responses. The predictions are initialized by the final hidden state and the input sequence
from the CNN encoder and the RNN encoder, respectively. We will discuss this model in more
detail next.
4.2.1 RNN
RNN is a family of neural networks where a sequence of repeated units maps an input sequence
to an output sequence. Each unit in the RNN has its own hidden state, which represents the
context information at the current step [12]. When moving forward to the next unit, the hidden
state from the previous unit is also included as an additional input which makes the units aware
of information in the previous steps. Since the units share the same weights, RNN can process or
predict arbitrarily long sequences by simply extending or shortening their length. However, the
sequential architecture makes the training process difficult as the backpropagation step involves
every unit with an activation function in the sequence. Thus, when the sequence is long, the
gradient can easily become very small or very large. Advanced structures have been invented
to mitigate this vanishing gradient issue. The LSTM [58] and GRU [26] are among the most
commonly used RNN models. Although they are different in terms of architecture, they both
implement gated mechanisms to control the storage and update of the context information (hidden
state). Studies [42, 62, 161] have shown that LSTM and GRU have comparable accuracy, and both
are capable of learning the underlying dynamics in data. In this work, we use LSTM as it is more
widely adopted in various applications. However, we note that the LSTM units in our model can
be replaced by GRU or other RNNs.
84
The architecture of an LSTM unit at step t is shown in Figure 4.3 where x, C, h are the input,
the cell state, and the hidden state, respectively. The subscripts t and t 1 represent the current
and previous time steps, respectively. The forget gate f
t
that decides how much information from
the past cell state should be forgotten is first computed as:
f
t
=s(w
f h
h
t1
+ w
f x
x
t
+ b
f
) (8)
where w
f h
, w
f x
, and b
f
are the two weights and bias of this gate. The notation s represents the
sigmoid function. The input gate i
t
that determines how much new information should be kept is
computed as:
i
t
=s(w
ih
h
t1
+ w
ix
x
t
+ b
i
) (8)
where w
ih
, w
ix
, and b
i
are the two weights and bias of the input gate. The new candidate value
ˆ
C
t
is:
ˆ
C
t
= tanh(w
Ch
h
t1
+ w
Cx
x
t
+ b
C
) (8)
where w
Ch
, w
Cx
, and b
C
are the two weights and bias terms. The cell state is updated according to
the following equation:
C
t
= f
t
C
t1
+ i
t
ˆ
C
t
(8)
The output gate o
t
controls what is being output:
o
t
=s(w
oh
h
t1
+ w
ox
x
t
+ b
o
) (8)
h
t
= o
t
tanh(C
t
) (8)
where w
oh
, w
ox
, and b
o
are again the corresponding weights and bias term.
RNNs are usually implemented in an encoder-decoder architecture where an encoder RNN
encodes the source sequence, and the decoder RNN generates the predictions with the encoded
85
Figure 4.3: The architecture of LSTM.
compressed feature vector (the final hidden state in the encoder) being used as the initial hidden
state[168]. However, unlike language translation, where the source sentence and the targeting
sentence could have different lengths, the control variables and the model responses have the same
length and the variables at the same time step are strongly correlated. Therefore, in the proposed
model structure that is shown in Figure 4.2, the sequence of hidden states from the RNN encoder
is fed into the RNN decoder as input. Another reason for this change stems from the difference
in the causality of the two applications. That is, the response data from a geothermal reservoir at
a given time depends on the results from previous steps, while the translation of a word depends
on the words before and after it. The unidirectional correlation in the geothermal applications is
also the reason why we do not consider bi-directional RNNs [51], which are popular in natural
language processing (NLP) applications. Another important component in using RNN models is
the attention mechanism. Attention mechanisms have greatly improved the accuracy of RNNs in
many challenging NLP applications [118, 27, 180, 138]. But because of the strong one-to-one
correlation (time axis) and the simplicity of geothermal data, in our examples, adding attention
to our architecture did not result in a significant improvement. Therefore, we did not include the
attention mechanism in the presented model in this dissertation.
4.2.2 CNN
CNNs are neural networks that utilize convolutional layers where learnable filters are convolved
with the input [101]. Figure 4.4 shows the operation of 1D convolution that is implemented in
the proposed model. Compared to other neural networks, CNNs have fewer weights and, thus, are
easier to train. Since convolution is a local operation, CNNs perform especially well in applications
86
Figure 4.4: A demonstration of 1D convolution where x, w, and y are the inputs, learnable weights,
and outputs, respectively.
where local correlations dominate (e.g., image classification) [92, 159]. Although there have been
debates about the performance difference between CNN and RNN, it is a general understanding
that CNN has a better performance in extracting salient information in locally correlated data [189],
which is also why CNNs are widely used in classification problems. In the proposed model, we
utilize the history within a lookback period right before the prediction step to help initialize the
hidden state of the RNN decoder. In this case, CNN is used because this is a summarization task,
where the output is a single state. In addition, the lookback period is short and has a fixed length,
which makes CNN a more reasonable choice.
4.2.3 Labeling Scheme
Equipment failure and maintenance are common in geothermal operations [174, 175]. Measure-
ments during these periods are either missing or do not reflect the state of the reservoir under
normal operation. Therefore, it is necessary to remove the data during these periods to improve the
accuracy of the training process. One way to handle the resulting data gaps due to these shut-in
periods is to fill them with interpolated results. However, the filled data could potentially inject
bias and mislead the model during training. There is a more straightforward approach, proposed by
Lipton et al. [114], by making the RNN model aware of these data gaps through labeling, thereby
letting the model learn how to make predictions in the presence of data gaps. We have imple-
mented a labeling scheme based on this idea. As shown in Figure 4.1, the measured values outside
those sudden jumps and drops are within a small range. Therefore, the labeling of data is done by
87
first identifying these problematic steps in each well based on the normal ranges of the measured
values. In field cases, whenever available, operation logs that include records of maintenance and
faults could be helpful. With this information, the measured values during the shut-in periods are
set to zero using a binary label as an indicator. For instance, the label is set to 0 for gaps and to
1 for normal operation. During training, the label variable is included as part of the control. In
addition, the prediction from the RNN model is multiplied by the label as shown in Figure 4.2,
so that the predictions during these gaps are always zero. This scheme will also exclude the time
steps of data gaps from the training because the predictions for these periods will always match the
data.
4.3 Overall Workflow
After introducing the model and its architecture, we now describe the workflow of how to apply
the model to geothermal data.
4.3.1 Feature Selection
Feature selection is the process of finding the optimal subset of variables for the prediction of the
target variables [154]. There are two major gains from applying feature selection: including a
relevant subset of the feature variables simplifies the structure of the model and accelerates the
training and tuning process. It can also be used to eliminate noisy feature variables to improve the
accuracy of the model. In our workflow, we first identify the control variables and observations
(target variables) and use domain insights to select the relevant features for each predicted quantity.
Instead of training one model that takes all control variables and predicts all target variables, we
train multiple individual models using the respective (relevant) well data. For the wells, the pump
rates are used as controls. When the pump rates are not available, they can be replaced by the
measured flow rates. For the producers, the injected brine temperature is added as an additional
88
control to the system, although it is not directly controlled. The outputs to the system are bottom-
hole pressure (BHP) and the specific enthalpy of the production wells.
For geothermal operations, feature selection can be made based on two aspects, variables and
wells. For the two output variables (enthalpy and BHP), we train one model for each target variable
since they are not strongly correlated. Having two separate models for each property also reduces
the dimension of trainable parameters and helps improve the performance. In our field case, tracer
tests [29] have confirmed that there are limited communications between the fault zones in the
geothermal reservoir. These fault zones can be viewed as individual blocks of rocks, where the
variables in one fault zone are not correlated with those in another fault zone. By removing ir-
relevant input variables from each fault zone, a significant reduction in the number of weights in
the model is achieved. With this implementation, we train a separate model for each fault zone.
Although the actual amount of weight reduction depends on the number of hidden states and the
number of wells in each fault zone, roughly one-third to one-fourth of the original number of
weights are used after incorporating the fault zone settings. In our experiments, the additional fault
zone information leads to a significant improvement in prediction accuracy. To assess the impact of
control variables on the observable variables of the subsurface system, we have performed a sensi-
tivity analysis based on a simulation model of the geothermal reservoir under study. The sensitivity
index is calculated as the ratio of the corresponding change in the output to the perturbation (10%
of control value) in the control variables [106]. The rate controls of each well are fixed over ten
years of simulation and are perturbed one by one. The changes of enthalpy or BHP are collected at
the end of the 10-year simulation for each of the producers. Sensitivity indices are then ranked into
different classes for feature selection, where the class with the lowest sensitivity denotes inputs
that can be removed.
As shown in Figure 4.5, the sensitivities below the set threshold (0.003 for enthalpy and 0.05
for BHP) are set to be ”NaN” values in the heatmap. Based on the sorted sensitivity indices, 19
(out of 84) variables are retained for the prediction of enthalpy, and 34 (out of 84) control variables
for BHP. The truncated sensitivity information indicates the connectivity of the subsurface system,
89
Figure 4.5: The sensitivity analysis for enthalpy and BHP with truncation for feature selection.
where both enthalpy and BHP are mainly sensitive to the wells from the same fault (Table 4.1). In
general, the obtained connectivity by sensitivity analysis is consistent with the results determined
from the tracer test in [29]. It can also be seen that the production enthalpy is sensitive to the rate
control from the proximal injection well. In contrast, the production BHP is more sensitive to the
flow rate at the same well. Another observation is that the overall sensitivity measures for enthalpy
are around ten times lower than those for BHP, indicating that enthalpy is less dependent on the
flow rate than BHP and mainly dominated by the latent dynamics (i.e., conductive or convective
heat source, cooling by the injected cold water). In general, the enthalpy exhibits a gradual decline
with time during the life of the geothermal reservoir. On the other hand, BHP variations show more
sensitivity to control inputs without a particular trend in time. Hence, the prediction of enthalpy
with a decreasing data trend becomes an extrapolation problem, where future predictions tend to
fall outside the range of predictions used during the training phase (with past data).
90
Table 4.1: Details of connectivity for enthalpy and BHP.
Name of Fault Production Well Injection Well
Fault 1 P5 -
Fault 2 P3, P4, P6 I2, I3, I4, I5, I8
Fault 3 P1 I1, I6
Middle Fault P2 -
4.3.2 Data Pre-processing
Data pre-processing plays an important role when training neural networks. There are usually
human errors in field data, as well as missing data and duplicate steps. In our study, duplicate
steps are removed, and missing steps are labeled. The data is downsampled to a lower frequency to
reduce the computation cost. For instance, weekly data is used to implement long-term predictions.
Data normalization is usually performed for neural network models since it simplifies the tuning
of the hyperparameters and the initialization of the weights. In our workflow, we normalize the
data using min-max scaling separately for each individual model. The zeros in the data gaps are
excluded when normalizing the data.
4.3.3 Training
Shorter training series are sampled from the original time series data for improved efficiency. The
window sizes of both the lookback period (fixed) and prediction steps (shorter in training) are tuned
during training. Shortening the window size accelerates the training process but may reduce the
accuracy of the model. The window size of the lookback period cannot be too short. Otherwise,
the model may be misled by the existing data noises. The prediction window size of the training
samples has to be long enough for the model to learn meaningful trends in the data during training.
A well-known limitation of neural network models is their ability to extrapolate. Studies have
also shown that the accuracy of RNN tends to deteriorate rapidly when making the long-term
prediction in time-series problems [95, 53, 76]. A simple demonstration of the reason behind this
limitation is shown in Figure 4.6, where the data to the left of the dashed vertical line is used for
training, while the part to the right of this line is used for prediction. Although all the prediction
91
Figure 4.6: A demonstration of the weak extrapolation of neural networks.
scenarios match the training data almost perfectly and remain relatively accurate for short-range
prediction, they depart from the true data as the prediction horizon becomes longer. In theory,
without including additional constraints, the RNN model can pick any of these prediction models
during the training phase. This non-uniqueness issue becomes more severe in problems where
there is a constant change of scale of the target variable, e.g., the declining trend in Figure 4.6.
This is, in part, because the nonlinear nature of neural networks makes their output not sensitive to
the scale change of their input [95]. In general, neural networks tend to predict values within the
range of values that they have seen during training. For short-term prediction, the model can be
trained with just historical data as the level of departure from data is expected to be small. However,
for long-term prediction, the model has to be guided by additional knowledge about the expected
long-term behavior of the data. In this work, we use multiple instances of simulated data to train
models for long-term prediction. Although this method would require a physics-based model, the
RNN can still dramatically accelerate the optimization problem that is performed afterward. This
is because generating training samples takes fewer simulations than the number of forward runs
needed for optimization. Once the RNN model is trained, it can predict flow responses almost
instantly.
92
4.4 Numerical Experiments
In this section, we present simple examples to evaluate the performance of the proposed model.
These examples are followed by field cases to illustrate the application of the model to real data.
4.4.1 Test Cases
We use a set of toy examples to investigate the characteristics of the proposed model. As shown
in Figure 4.7, five datasets are generated using periodic cosine functions with different frequencies
and amplitudes. Test 1 is the base case, where constant frequency and amplitude are used. Test
2 has a constant trend where the amplitude and frequency keep increasing. Test 3 has varying
amplitude and frequency, increasing in the first half and decreasing in the second half. In Test 4,
random noise is added to evaluate the robustness of the RNN model. In Test 5, 5% of the data at
random times are set to zero to represent data gaps. The labels associated with these missing data
points are given to the model. All these series share a length of 1014 time steps. The last 192 time
steps are reserved for prediction, while the rest are used for training. A random 5% of the training
data is used for validation. The window sizes of the lookback period and prediction during training
are 6 and 48, respectively. The normalized time step number is fed to the model as the control
variable.
It can be seen in Figure 4.7 that for short-term prediction, the RNN model can predict the
periodic changes with a fixed frequency. The RNN model also works well for the cosine function
with increasing amplitude in Test 2 and the data with noise or gap in Test 4 and Test 5. In Test 3,
however, the RNN model fails to capture the decreasing trend in both amplitude and frequency.
For long-term prediction, the length of the prediction part of the data in Test 1 is increased to
1200. The prediction results, in this case, are shown in Figure 4.8, where it is evident that the
accuracy keeps decreasing as the prediction horizon increases beyond the training data, despite
the constant amplitude and frequency. A possible effect that can attribute to this behavior is the
monotonically increasing control variable, as shown in Figure 4.8. During prediction, the control
93
Figure 4.7: The five toy tests and the short-term predictions from the proposed model.
94
Figure 4.8: A toy example where the model suffered from extrapolation when doing long-term
prediction.
Figure 4.9: The long-term prediction becomes more accurate after transforming the problem from
extrapolation to interpolation.
variable increases away from the values used during training. To check the contribution of this
effect, we convert the step number into a periodic function. The new control variable and results
are shown in Figure 4.9. Now the long-term prediction is more accurate. This result shows that
converting the problem from extrapolation to interpolation can improve the performance of the
RNN model for long-term prediction.
4.4.2 Field Case Studies
In this section, we apply the model to a geothermal field case that has an existing numerical simula-
tion model. The simulation model of the geothermal reservoir was developed by Cyrq Energy Inc.,
95
Figure 4.10: The configuration of the field example where four faults are present. The color
indicates the initial temperature of the field
using the reservoir simulation software TETRAD. The simulation model is a three-dimensional
single-phase dual-porosity model that consists of roughly 35,000 grid blocks. The dimensions of
the numerical model are 16 km in the (Easting) by 18 km (Northing) and extend from +1500 to
-3000 mRSL. There are six production wells and eight injection wells in the reservoir. The model
contains four fault zones that have high permeability and convective thermal gradients (Figure
4.10). The well names and their connectivity information are summarized in Table 4.1. Due to the
proprietary nature of the data, the normalized versions of the well data are shown.
In these examples, the simulated dataset is sampled weekly. While hourly and daily data fre-
quency may also be used, the changes in reservoir conditions are expected to occur at a longer
time scale. A lower data frequency is also more appropriate for detecting long-term features and
trends. The use of high-frequency data can also increase the computational cost of training RNN
as a longer length of RNN sequence has to be processed. In these examples, the length of the
simulation period is 20 years, starting on 2013/11/28 and ending on 2033/12/29. The control vari-
ables are the mass flow rate of the production and injection wells, which are changed annually. A
2% random noise is added to the control variables to represent the effect of noise in the control
implementation. The observations (target variables) of the simulation model are the enthalpy and
BHP of the production wells.
The control variables are changed annually, with their values generated randomly within a
specific range while honoring two linear constraints for each stage. The first constraint is the
96
mass balance constraint to equate the total rate of production and injection. The second constraint
imposes the power demand (or capacity) of the power plant by setting the total production/injection
rates to be fixed (1850 ton/hr). The injection temperature is not controlled but is part of the input to
the simulator. The injection temperature at each well is sampled based on realistic data that shows
seasonal variations.
In the following sections, four numerical experiments will be presented to demonstrate the
short-term and long-term prediction power of RNN after training based on simulated data. When
short-term predictions are used, the window size for the lookback and prediction periods are 6 and
48 weeks, respectively. In the experiments for long-term prediction, the lookback period is still six
weeks, while the prediction window size in training is increased to 104 (two years).
Case 1: Short-Term Prediction
We first investigate the accuracy of the short-term prediction of the RNN model using the simu-
lated dataset. In this experiment, the total length of the time series data is 422 (around eight years).
The model is trained using the first 371 steps (around seven years) and predicts the last 51 steps
(around one year). As shown in Figure 4.11, the predicted enthalpies match the observed simulated
data in all producers. The data on the left side of the vertical dash line is used for training and the
predictions during the training stage (one sample is shown after every 50 samples so that one pre-
diction is shown for each step). The right side of the dashed line shows the test data and prediction
from the RNN models. For the wells in Fault 2 and Fault 3, the enthalpy prediction captures the
decreasing trend after the 371st step (dashed line). The results indicate good performance for RNN
in predicting the short-term trends (extrapolation). When the declining trend is not present (middle
fault and Fault 3), the RNN model can predict the behavior observed in the data (interpolation).
Figure 4.12 shows a good match between the reference and predicted BHP data. It can be seen
that the BHP responses are different from the enthalpy behavior and stay within the training data
range. This is expected as BHP is more sensitive to the rate control than the enthalpy is.
97
Figure 4.11: The short-term predictions of the producer enthalpies in the field experiment. The
training is on the left side of the vertical dashed line and the testing is on its right.
Figure 4.12: The short-term predictions of the producer BHPs in the field experiment.
98
Figure 4.13: The long-term predictions of producer enthalpies in the field experiment using RNN
trained on historical data.
Case 2: Long-Term Prediction Using Historical Data
We evaluate the long-term prediction power of RNN in this part. For this example, the total length
of data is 1048 time steps (around 20 years). The RNN models are trained using the first 371 steps
(7 years). Instead of predicting the next 51 steps, the trained models are used to predict the next
677 steps (around 13 years).
Figure 4.13 shows the prediction of producer enthalpies. Fault zone 1 and the middle zone both
have one producer and no injector. These two zones have stable heat flux from the underburden
rock. Therefore, the enthalpies of these two fault zones are more stable than the others. From
Fault Zones 2 and 3, it can be observed that the discrepancies between the prediction and true data
are increased as the predicting step goes further away from the training dataset. This can be, in
part, due to the extrapolation issue that was mentioned in the previous example. As the enthalpy
keeps decreasing to the values that the model hasn’t seen in the training stage, the accuracy of the
prediction continues to deteriorate.
99
Figure 4.14: The long-term predictions of producer BHPs in the field experiment using RNN
trained on historical data.
As can be seen from Figure 4.14, the long-term prediction of BHP is more accurate than en-
thalpy. This is, in part, because the total injection rate approximately equals the amount of brine
produced from the subsurface, which makes the subsurface pressure more stable. Compared to
the enthalpy, producer BHP is more dependent on the control variables and less on the dynamics.
Therefore, prediction on producer BHP is closer to interpolation than extrapolation. However, dis-
crepancies are still observed in the producer BHP in Fault Zone 2, where there are three producers
and five injectors. This indicates that there might be long-term dynamics in Fault Zone 2 caused
by the complex connections between the producers and injectors.
Case 3: Long-term Prediction Trained Using Full Data
As observed in the previous experiment, training RNN with only the historical data from the field
may not be sufficient for long-term prediction, especially when the model is used to extrapolate
beyond the training data range. The main contributor to this problem is the lack of adequate data
100
with diverse input-output scenarios to train the RNN model in a way that all future inputs are within
a similar interval to those used during training. When the RNN model is trained with only one
instance of historical data (which is the situation in all practical settings), the variability in the input
features is unlikely to be exhaustive. As a result, the model predictions for unseen input variables
are likely to be inaccurate. Another important issue that limits the long-range prediction with
neural network models is the statistical nature of these methods and the lack of causal relationships,
unlike physics-based simulation models. One way to improve the performance of the RNN is to
use several instances of simulated data to train these models. In this case, multiple instances of
simulated data that cover both the data range and the prediction range are generated and used for
training. With the simulated data exhibiting a similar range of variables during both training and
prediction, the predictions are more likely to be consistent with the expected behavior of the field.
In these experiments, the total length of the data time series remains as 1048 steps (around
20 years). A sensitivity analysis of the number of simulated instances is performed to assess the
performance of the RNN model as a function of the number of instances of the simulated data.
In the sensitivity analysis, The RNN models are trained on 1, 5, 10, 25, 50, 100, 200, and 400
simulated data (all have a length of 1048 steps). A total of 50 additional simulated instances
are also used as the validation set to tune the hyperparameters. The final prediction accuracy is
evaluated based on the RMSE loss on the 50 test set. Figure 4.15 and 4.16 visualize the RMSE
losses of enthalpy and BHP predictions on the test set as a function of the number of used training
series. These series are generated with random controls. It can be observed from the figures that
the losses quickly converge as more training samples are added, which indicates a better chance of
getting more accurate predictions. Examination of the results shows that even with a small number
of instances of training data, the result appears to be promising. For example, Fault Zone 3 only
needs five training samples as adding more data instances does not seem to improve the model
prediction accuracy. Based on the results from this sensitivity analysis of the number of training
samples, we only use ten instances of simulated data for training. Figure 4.17 and 4.18 show the
predictions of one of the test datasets using the RNN models trained with ten instances of simulated
101
Figure 4.15: The sensitivity analysis on testing RMSE loss of enthalpy prediction as a function of
the number of training datasets.
Figure 4.16: The sensitivity analysis on testing RMSE loss of BHP prediction as a function of the
number of training datasets.
data. The predictions match both the short-term control effects and the long-term gradual trend.
These, and several other examples we have performed, show the importance of only a few offline
simulated data on the prediction performance of the model. That is, once trained, RNN is able to
provide accurate long-term predictions.
Case 4: Long-Term Prediction using Simulated Field Data
To validate the prediction power of RNN, we apply it to a set of simulated data for a field case. The
simulated data is generated from the calibrated simulation model described above. For this data,
the weekly input controls are sampled from the field data between 2013/10/17 and 2020/12/31. The
simulated data has been verified against the actual measurements from the field. The field controls
have a more complex pattern than the randomly generated controls in the previous example. The
controls contain several sudden changes and long shut-in periods, which makes the problem more
challenging. In order to deal with the new patterns, new training samples are generated with long
shut-in periods to mimic the behavior of the control variables in the field. The number of training
samples is set to 10, and the window sizes are the same as in Case 3. The predictions of enthalpy
102
Figure 4.17: The long-term predictions of enthalpy in the field experiment using RNN trained on
10 full-sequence datasets.
Figure 4.18: The long-term predictions of BHP in the field experiment using RNN trained on 10
full-sequence datasets.
103
Figure 4.19: The long-term predictions of enthalpy in pseudo field data using RNN trained on 10
full-sequence datasets.
Figure 4.20: The long-term predictions of BHP in pseudo field data using RNN trained on 10 full-
sequence datasets.
and BHP are shown in Figures 4.19 and 4.20, respectively. From the long-term enthalpy shown in
Figure 4.19, it can be seen that the RNN model with the labeling scheme is not misled by the long
shut-in periods. Both the control effects (the spikes) and the gradual long-term trends are captured
though the error seems to be accumulated when prediction is performed for longer time steps. The
predictions of BHP Figure 4.20 are again more accurate, and the accumulated error is minimal
because the short-term control effects dominate the change of BHP. The results suggest that RNN
has learned the correlation between control variables and BHP.
104
4.5 Discussion
We have presented the workflow for developing a CNN-RNN model with the labeling scheme and
evaluated its short-term and long-term prediction performance using a suite of numerical experi-
ments. Below, we discuss some of the important aspects of the proposed workflow and the model
structure.
4.5.1 Interpolation and Extrapolation
The experiments in the previous section suggest that the RNN performance for interpolation prob-
lems is more reliable than when used for extrapolation. For short-term applications, the RNN
model is trained using only historical data to predict the near future. Training based on historical
data is ideal because it does not require building reservoir simulation models, and it saves the costs
associated with running physics-based models to generate simulated training data. The one-year
short-term prediction might be long enough for certain practical applications (e.g., monitoring and
adjusting operations). The underlying model can be retrained and updated whenever new data
becomes available.
For long-term prediction, RNN had an acceptable performance for predicting the BHP, which
is dominated by the control effects. However, the model could not provide a reliable prediction
for the enthalpy. The issue in predicting the enthalpy could be related to the dominant reduction
trend in enthalpy data, which leads to an extrapolation problem, for which neural networks are
known to have difficulty. In this dissertation, we show that the extrapolation difficulty can be
addressed by adding a small number of simulated data about possible future trends and predicted
reservoir performance. However, other solution strategies that do not require the use of a physics-
based simulation model may be possible and can be investigated in the future. One approach to
improve the extrapolation power of RNN is to use regularization by incorporating the physics or
the expected long-term trends into the model architecture.
105
Figure 4.21: Enthapy change under constant controls which shows the change of trend when start-
ing producing from the initial equilibrium state.
Another challenge is how to handle time-variant trends in the data. For instance, when at
different time intervals, the underlying physics shows a different behavior/regime. An example
is shown in Figure 4.21 where the production enthalpy under constant control setting is in an
unsteady state at the beginning but reaches a pseudo-steady state after around 410 steps. Future
studies can focus on making the model more robust to the changes in the data trends, for example,
by introducing change points [172, 59] that are derived from the physics of the problems or the
amount of available data.
4.5.2 Importance of Honoring the Expected Patterns
As a data-driven model, RNN learns the dynamics during the training process. Therefore, it is
important to make sure the training data covers similar patterns and variability to those observed
in the field. We have tested the long-term prediction accuracy of the RNN model in field examples
where the labels are different from those in Experiment 3. Figure 4.22 demonstrates the effect
of a more realistic set of controls (with patterns shown in the field controls) on the prediction of
producers’ BHP in Fault Zone 2. In this example, two label variables with different patterns are
generated for the same training set, named ”unrealistic label” (Figure 4.22a) and ”realistic label”
(Figure 4.22b), respectively. The unrealistic label is generated by randomly setting the label to zero
and ignoring the length of these shut-in periods. In contrast, the realistic label has longer shut-in
106
Figure 4.22: The effect of label variables on the prediction performance for producer BHPs in
Fault 2.
periods closer to the test data pattern (field controls). For the two cases, the RNN models are
trained using the same training series but with different labels and then tested with the available
data set. For the model trained with the unrealistic label, significant discrepancies between the
true data and the prediction are observed after every long shut-in period. Meanwhile, the model
trained with realistic labels can predict the behavior of the wells after the long shut-in periods
because it has seen similar patterns/behaviors during training. This example shows the importance
of honoring the expected patterns in the field when generating simulated training data.
Figure 4.23 is another case that shows the importance of honoring the expected patterns in the
field data. Here we investigate the effect of injection shut-in periods on the prediction of producers’
BHP for Middle Fault and Fault 3 in Experiment 4. The injectors were not active at the beginning
of field data until the producers operated for around 24 weeks. The inactivity or shut-in period
for the injection well causes drops in reservoir pressure and affects the prediction of the field BHP
data. In this example, two different sets of training samples are generated without the injection
shut-in period (Figure 4.23a) and with the shut-in period of 24 weeks at the beginning of operation
(Figure 4.23b), respectively. It can be observed that the BHP predictions show discrepancies for
the case ”without shut-in period” (Figure 4.23c). The injection shut-in period is expected to cause
107
Figure 4.23: The effect of shut-in period on the prediction performance for producer BHPs in
Middle Fault and Fault 3.
a larger drop in BHP than the injection operation without a shut-in period. Therefore, the overall
BHP responses from training samples without including the shut-in period are biased and show
values higher than the field BHP. As shown in Figure 4.23d, the BHP responses from the Middle
Fault and Fault 3 match the observed data after including the same shut-in period as in the testing
set, indicating that the training set is more representative of the real behavior.
4.5.3 Computational Cost
Although the training strategy we introduced for the long-term prediction requires a physics-based
model, the RNN model can still dramatically reduce the computation cost compared to physics-
based approaches, especially in applications such as optimization, where many predictions under
various operating scenarios are needed. We can reduce the computational cost by limiting the num-
ber of forward runs and by parallelizing the computation. Additionally, the computational cost of
training the RNN model is insignificant compared with the cost related to running full reservoir
simulation. The RNN’s ability to be massively parallelized also makes global optimization tech-
niques such as the Genetic Algorithm practical. For gradient-based control optimization problems
108
using the simulation model, the computation of each gradient requires different numbers of simu-
lation runs, depending on the dimension of the control variable. In contrast, the computational cost
of gradient calculation with RNN is constant and negligible. A comparison of computation under
different dimensions of control is listed in Table 4.2. It is important to note that the tuning process
for the RNN can increase its computational cost.
Table 4.2: Details of computational cost of each gradient for different optimization method.
Dimension of
Control
RNN Proxy-
based (sec)
Finite-Difference
(simulation)
14 0.806 15
42 0.864 43
84 0.862 85
168 0.865 169
The examples of this dissertation are run on an Intel Core i7-9700K CPU @ 3.60GHz with
eight cores, with an installed memory (RAM) of 64.0 GB. The simulation model developed for the
simulation-based optimization has a maximum time step of 180 days for the Newtonian iterations
and takes on average 4-5 minutes. In contrast, computing each gradient using the deep-learning
proxy model takes less than a second. The simulation developed for generating a weekly training
dataset has a maximum time step of 7 days and takes around 70 minutes for a 20-year simulation
and around 28 minutes for an 8-year simulation. For the first three field examples, 500 simulations
over 20 years are run to generate the training, validation, and testing sets. The total of 500 sim-
ulations (jobs) can be conveniently parallelized using the eight cores, taking approximately three
and a half days. For the last field example, an additional ten simulations over eight years are run to
generate the training sets, which are completed within one hour.
109
Chapter 5
A Multiscale Recurrent Neural Network Model for Long-Term
Prediction of Geothermal Reservoir Energy Production
5.1 RNN
RNN maps a sequential input to a sequential output through a sequence of repeated units. The
units share the same structure and the same set of weights. In most RNN, there is also a hidden
state for each unit to store useful information. Each unit takes the hidden state of its previous unit
as an additional input and passes its own to the next unit. Such design gives these units access to
information from their past (time). LSTM and GRU are the two most popular RNN. Both of them
have been proven to be effective for time-series problems. We have chosen LSTM as the RNN in
this study. The structure of LSTM unit is shown in Figure 5.2. t, x, h, C are the current time, the
input control, the hidden state, and the cell state, respectively. The hidden state is the output of the
current unit, while the cell state is the internal memory. Both the two states are sent to the next unit
as inputs. LSTM has three gates to control the propagation of information, which are forget gate
f
t
, input gate i
t
, and output gate o
t
. The gates are computed as:
f
t
=s(w
f h
h
t1
+ w
f x
x
t
+ b
f
) (8)
110
Figure 5.1: A demonstration of the extrapolation issue where all the solution curves match the
training data but are very different in the test part. The data on the left of the vertical line are used
for training, while the right of it is the predicting part.
i
t
=s(w
ih
h
t1
+ w
ix
x
t
+ b
i
) (8)
o
t
=s(w
oh
h
t1
+ w
ox
x
t
+ b
o
) (8)
where w and b are the weight and bias. s is the sigmoid function. The forget gate f
t
decides
how much information in the cell state should be forgotten. The input gate i
t
decides how much
information in the candidate cell state should be memorized. The output gate o
t
controls the output.
The candidate cell state
ˆ
C
t
is computed as:
ˆ
C
t
= tanh(w
Ch
h
t1
+ w
Cx
x
t
+ b
C
) (8)
where w, b, tanh are the weight, bias, and hyperbolic tangent function, respectively. Then the new
cell state is:
C
t
= f
t
C
t1
+ i
t
ˆ
C
t
(8)
111
Figure 5.2: The structure of LSTM where t, x, C, and h are the current time step, the input control,
the cell state, and the hidden state, respectively.
The hidden state h
t
, which is also the output of the unit, can be computed as:
h
t
= o
t
tanh(C
t
) (8)
where w, b, tanh are again the weight, bias, and hyperbolic tangent function, respectively.
5.2 Multiscale RNN
Extrapolation is a well-known issue of neural networks. Figure 5.1 is a simple synthetic example
that demonstrates this issue. The black dots are the true data, while the lines are the possible
solutions. The vertical dashed line divides the data into training (left) and test (right). All the
possible solutions fit the training data pretty well, and they stay close to the true data during the
test when it is still close to the training section (close to the vertical line). But as it gets further
from the data points reserved for training, the wrong solutions deviate from the true data. Neural
networks are powerful approximators. They can learn any of the possible solutions in the training.
This is fine if we only want it to make short-term prediction since all the solutions are acceptable,
but it becomes a problem if we need predictions of the further time steps (e.g., time step 1000
in Figure 5.1). What makes it worse is that for problems where the historical and future data
have different ranges of values like the one in Figure 5.1, the network is very unlikely to pick the
true solution. This is because all the data the model has seen in training are between a certain
range (e.g., 0.6 to 1.0 in Figure 5.1), the model would tend to predict within the same range when
predicting for the future time steps. So it usually ends up with the model underestimating the
112
changes. This range issue is also pointed out by Lai et al. [95]. However, it is very common for
data in dynamic systems to have a general increasing or decreasing trend, and the brine enthalpy is
one of them.
One way to avoid such an issue is to acquire an extensive amount of data. Then the network
only needs to interpolate between data points. This is essentially what happens if we train the
network using simulation data in addition to historical data [77]. When there is simulation data
available, the task for the network becomes interpolating between different production scenarios.
Another way is to use knowledge about the data to guide the design of the network. For example,
if we know the trend of the data in Figure 5.1 is linear and limits the complexity of the network,
then it won’t be able to learn the other possible solutions in the plot. Fortunately, the brine enthalpy
data has some natures that we can incorporate into the predictive model.
From our observation, the brine enthalpy from geothermal reservoirs typically contains two
types of features. First, there is a long-term gradual decline trend which is a reflection of the
temperature decline in the reservoir. Then, there are also sudden changes caused by the well
control (pump speed). An RNN that can fit those nonlinear sudden changes is obviously over-
complicated for the smooth and linear decline. Therefore, applying a single RNN to do long-term
prediction for brine enthalpy will probably get into the trouble of overfitting. Lai et al. [95] faced
similar challenge. In their study, their data also contains features in two distinct scales. To improve
RNN’s long-term predicting power, they have developed a long- and short-term model, where an
autoregressive model (AR) is expected to capture the long-term smooth features, while an RNN
model learns the high-frequency features. Inspired by this idea, we have developed a multiscale
RNN model, whose structure is shown in Figure 5.3. It has a long-term model and a short-term
model, which are a FCNN and a RNN. The inputs to the RNN are the recent history and the controls
in the future. The purpose of feeding the recent history is to initialize the hidden state of RNN. As
introduced before, RNN has few sets of weights and biases for the gate mechanisms, which makes
it more complex and easily overfit simple linear trends. Therefore, we added the FCNN to capture
smooth patterns and have the RNN to learn the rest nonlinear features. The FCNN only has access
113
Figure 5.3: The architecture of multiscale RNN. The RNN is the short-term model and the FCNN
is the long-term model.
to the cumulative controls (e.g., the cumulative mass flow rate of the producers and the injectors)
and its complexity is limited. These designs along with a modified loss function force it to focus on
the smooth trends and not to pick up any of the nonlinear features. Lastly, the short-term prediction
from the RNN and the long-term prediction from the FCNN are added together to form the final
prediction of multiscale RNN.
5.3 Training
Multiscale RNN is trained by backpropagation with a modified loss function. The FCNN is ex-
pected only to capture the long-term smooth trend. We have chosen the mean absolute error (MAE)
of the long-term prediction (difference between the long-term prediction and the true data) as the
training loss of the FCNN. This is because we want the model to capture only the main trend in-
stead of a good overall match. MAE is more tolerant to the errors and outliers, which means the
model is more likely to not fit the nonlienar features and leave them to the short-term model. The
objective function for the long-term FCNN model is:
min
Q
L
L
L
(Q
L
)= min
Q
L
1
N
å
t2training
jY
t
ˆ
Y
L;t
j (8)
114
where Q
L
, L
L
, N, Y
t
, and
ˆ
Y
L;t
are the parameters in the long-term model (FCNN), the long-term
loss, the total number of data points, the true data, and the long-term prediction, respectively.
The short-term RNN model is expected to pick up all the remaining features and make the final
prediction match the true data. Here we want the final prediction to have a good overall match to
the training data. Instead of MAE, the parameters in the RNN are tuned by minimizing the mean
squared error (MSE) of the final prediction (difference between the final prediction and the true
data). The objective function for the short-term RNN model is:
min
Q
S
L
S
(Q
S
)= min
Q
S
1
N
å
t2training
jjY
t
ˆ
Y
t
jj
2
2
(8)
whereQ
S
, L
S
, N, Y
t
, and
ˆ
Y
t
are the parameter in the short-term model (RNN), the short-term loss,the
total number of data points, the true data, and the final prediction, respectively.
The overall loss function L(Q) we use in our implementation becomes:
L(Q)= c
1
L
L
(Q
L
)+ c
2
L
S
(Q
S
)+ c
3
jjQjj
2
2
+ c
4
jQj (8)
where c
1
, c
2
, c
3
, c
4
are the weights for each of the terms. Q represents all the parameters in
multiscale RNN.jjQjj
2
2
andjQj are the L2 and L1 regularization terms of the model parameters.
To avoid the short-term loss affecting the parameters in the long-term model, we don’t compute
the gradient with respect to the parameters in the FCNN during the back-propagation of the short-
term loss. The optimization of model parameters can by done using any of the stochastic gradient
descent methods (e.g., Adam [84]).
5.4 Numerical Experiments
In this section, we present the results of experiments where we apply multiscale RNN to a geother-
mal field case to evaluate the performance of the proposed model. There is an existing simulation
model of this geothermal field provided by our partner. So we first introduce a set of experiments
115
Figure 5.4: The configuration of the geothermal field. The color indicates the normalized initial
temperature.
using synthetic data that is generated by this simulation model under different control scenarios.
Lastly, we show the results of applying multiscale RNN to the actual field data. In this geothermal
reservoir, there are 14 wells that are distributed in four fault zones. The configuration of the field is
shown in Figure 5.4. These fault zones have high permeability and convective thermal gradients.
It has been validated by tracer tests and field data that there is no connectivity between wells in
different fault zones[29], thus the fault zones can be modeled independently. In the experiments,
we use data from the simplest and the most complicated fault zone, fault zone 1 and fault zone 2.
Fault zone 1 has only one producer but no injector. Fault zone 2 has three producers and five in-
jectors. Although the pump speed is the actual control variable in the field, it is not available in the
simulation model. So we use the mass flow rates of the wells as the control for both the synthetic
data and field data. As mentioned, our target variable is the brine enthalpy of the producers.
To better assess the accuracy of multiscale RNN, we also train RNN without multiscale struc-
ture and AR models for these datasets and compare their accuracy to that of the proposed model.
Multiscale RNN and RNN are implemented using Tensorflow [2]. AR models are implemented
using Statemodels [155]. For the RNN without a multiscale structure, it has the same structure
116
as the short-term RNN model in multiscale RNN. It has a very similar total number of parame-
ters to multiscale RNN since the FCNN model in the multiscale architecture has a small number
of parameters. For the training of RNN and multiscale RNN, training samples are sampled from
the training data using a sliding window with a fixed length. 10% of the training samples are ex-
cluded from training and used for validation. When making long-term prediction in the testing,
the length of RNN units is extended to match the length of the test data. So, RNN can predict
all the future steps simultaneously. For the AR models, the controls are added as the Exogenous
Variables. For fault zone 1 where there is only one target variable (one producer), we apply Au-
toregressive with Exogenous Variables (ARX). For fault zone 2 where there are multiple target
variables (three producers), we use Vector Autoregressive with Exogenous Variable (V ARX). To
make a fair comparison, the AR models have a length of lag that equals the length of history in
RNN models. To evaluate the long-term predicting ability of the models, we set the predicting
horizon to be relatively long in the experiments. About the last 40% time steps are reserved for
testing. The simulation model and the field data are obtained from our partner Cyrq Energy Inc..
Due to data confidentiality, we only present the normalized data.
5.4.1 Synthetic Data
The three-dimensional simulation model is built in TETRAD by Cyrq Energy Inc.. It is single-
phase, dual-porosity and has about 35,000 grid blocks in total. The configuration of the simulation
model is shown in Figure 5.4. The model has a length of 18km, a width of 16km, and extends
vertically from -3000 mRSL to +1500 mRSL. In the synthetic dataset, the properties of the wells
are sampled weekly and have a total length of 1048 data points (about 20 years). The last 434
weeks (roughly eight years) are used for the test purpose. The rest data points are reserved for
training. Each training sample consists of six steps as the history and 48 steps for the model to
fit. We apply a mini-batch with a size of 10 and shuffle the samples randomly during training.
The control trajectories are generated by changed randomly every year. In addition, a 2% random
117
noise is added to the controls. For both fault zone 1 and 2, we first show the results of one of the
production realizations and then present the statistical results over ten realizations.
5.4.1.1 Fault Zone 1
Fault zone 1 has the simplest well configuration, which has only one producer and no re-injection.
The data of one of the production realization in the synthetic dataset in shown in Figure 5.5. The
cumulative flow rate is calculated based on the flow rate. We can clearly see from the data that the
enthalpy has a steady and smooth decline trend, but there are also sudden changes whenever the
mass flow rate of the producer changes. The enthalpy predictions of this production realization are
shown in Figure 5.6. For the training part on the left of the vertical dashed line, each time step is
included in multiple training samples. Thus, here we only show some of the training samples that
are consecutive. It can be seen that there is some discontinuity in the prediction of training data,
which happens when moving from one training sample to another. Overall, all three models fit the
training data well. For the test part of the data, all three models stay close to the reference curve
at the beginning. But as it gets further from the training data, multiscale RNN still follows the
reference curve, while both ARX and RNN diverge from the true data. The results have suggested
that multiscale RNN has better captured the underlying trend in the data. Figure 5.7 shows the two
intermediate predictions, long- and short-term predictions, of multiscale RNN. It can be observed
that the long-term model has learned the smooth decline, and the short-term model has captured
the nonlinear features, which are the same as what they are designed to do.
After examining the detailed results of one of the production realization, we then test the pro-
posed model on ten realizations and obtain the statistical results, which are shown in Figure 5.8.
As a result of the extrapolation issue, the RNN has the worst accuracy (highest error) among all
three models. Multiscale RNN has achieved a slightly better overall accuracy than the ARX model,
which becomes much more significant when we look at the last part of the test data (last year/54
time steps). The statistical results match the impression we get from the example, suggesting that
118
Figure 5.5: One of the synthetic realizations of fault zone 1.
Figure 5.6: Predictions from multiscale RNN, RNN, and ARX of the synthetic example of fault
zone 1.
119
Figure 5.7: The two intermediate predictions, long- and short-term, of multiscale RNN in the
synthetic example of fault zone 1.
the multiscale structure can greatly improve the predicting power of RNN and make it much better
at long-term prediction.
5.4.1.2 Fault Zone 2
We then move from fault zone 1 to fault zone 2. Fault zone 2 has the most complicated well
configuration in this field. It has three producers and five injectors. Thus there are three target
variables, the brine enthalpy of the three producers. The flow rates of the injector are also included
as part of the control. The injecting temperatures are not included because they are stable and have
little correlation with the enthalpy of producers. The normalized data of this fault zone is shown
in Figure 5.9. We can observe from the data that the enthalpy of all three producers is generally
smooth, but it also contains signatures of the control signal, which is the same as the data of fault
zone 1. The predictions are shown in Figure 5.10. Again, all three models match the training
part pretty well. Multiscale RNN outperforms RNN and V ARX model as it gets further from the
historical data except for producer 1 where V ARX has achieved a better match. If we look closer
to the producer 1, though there is a discrepancy between the prediction from multiscale RNN and
the true data, the prediction has a very similar gradient to the reference curve at the end of the data.
120
Figure 5.8: The statistical results of fault zone 1 over 10 production realizations.
Figure 5.9: One of the synthetic realizations of fault zone 2.
Therefore, the discrepancy will probably not become more significant if there are more data. But
the prediction of V ARX is flatter than the reference curve at the end. Its accuracy could get worse
if there are more data.
The statistical results are obtained for ten realizations. The average root mean square error
(RMSE) shown in Figure 5.11 again suggests that RNN has the worst long-term predicting ability
among the three models. Multiscale RNN has the lowest overall error, and the difference between
it and V ARX is larger when we look at the last year of the test data. The statistical results show the
121
Figure 5.10: Predictions from multiscale RNN, RNN, and ARX of the synthetic example of fault
zone 2.
122
Figure 5.11: The statistical results of fault zone 2 over 10 production realizations.
superior long-term predicting power of multiscale RNN over the other two models, which validates
our observation in fault zone 1.
5.4.2 Field Data
We then evaluate the proposed model using the actual field data. The weekly field data has a total
length of 380 time steps (roughly 7 years). The models are trained with the first 231 time steps and
predict the last 149 steps. The length of training samples is the same as it of synthetic experiments.
The field data of fault zone 1 and 2 are shown in Figure 5.12 and Figure 5.13, respectively. As
can be seen from the figures, the field data is much noisier than the synthetic data. There are a
lot of sudden changes and outliers, especially in the flow rates. Besides data noise, there are a lot
of zero values in the data. Many of the zeros are actually missing data in the dataset. Possible
reasons include well shut-in, human error (the data are recorded manually), and equipment failure.
However, It is hard to figure out whether the zero values are missing values or zero. It is also
impossible to find the actual reason behind every missing value. In our experiments, we keep
zero values in control variables because we don’t know if they are true zero or missing data. For
the target variable enthalpy, since enthalpy cannot be zero, we are sure that the zeros are actually
missing data. So we remove the zeros in enthalpy and fill the gaps through linear interpolation.
123
Figure 5.12: The field data of fault zone 1.
Besides, we don’t remove the outliers before the linear interpolation. Although this might not be
the best way for training and might create some strange patterns after linear interpolation, we want
to honor the data and also test the models’ resistance to data noise.
The enthalpy after linear interpolation and the predictions are shown in Figure 5.14 and Figure
5.15. It can be seen that the filled values seem appropriate in Producer 1 and Producer 2 in fault
zone 2, while there the linear interpolation creates artificial error around time step 140 in the
producer of fault zone 1 and producer 3 of fault zone 2. For the Producer 1 and Producer 2 in fault
zone 2 where the filled values are appropriate, multiscale RNN significantly outperforms RNN
and has slightly better accuracy than AR models. For the cases where there is artificial noise, the
training shows that all three models try to fit the noise. It can be seen from the results that fitting
the noise has a negative impact on the accuracy of VR models. Interestingly, the predictions of
multiscale RNN are not affected by the noise at all. This is because the noise is taken care of by
the short-term model, and thus, it doesn’t affect the learning of the general decline trend by the
long-term model. The results of the field data indicate that multiscale RNN not only has a better
long-term predicting power but also is superior at dealing with data noise.
124
Figure 5.13: The field data of fault zone 2.
Figure 5.14: Predictions from multiscale RNN, RNN, and ARX of the field data of fault zone 1.
Zero enthalpy values are removed and filled by linear interpolation.
125
Figure 5.15: Predictions from multiscale RNN, RNN, and ARX of the field data of fault zone 2.
Zero enthalpy values are removed and filled by linear interpolation.
126
5.5 Discussion
We can conclude from the experiments that the multiscale structure can greatly improve the long-
term predicting ability of RNN. One of the reasons that RNN is more vulnerable to extrapolation
issue is additional parameters from its gate mechanism. As introduced in the methodology section,
LSTM has four sets of weight and bias, three for different gates and one for the candidate cell state.
As a result, the number of parameters in LSTM is four times the number of parameters in an FCNN
that has similar dimensions (input, output, and hidden layer). The gate mechanisms make them
great at learning dynamics but also give them more freedom in the fitting. The extra approximating
power limits the generalizing ability of RNN when learning simple linear trends.
We have also observed that though AR models have worse long-term accuracy than multiscale
RNN, their accuracy is not too bad. We think this is because the long-term trend in our data is
more dominant compared to the short-term changes. Therefore, if the data is very smooth and
doesn’t contain much noise, AR models are good alternatives since they are easier to train and
don’t require a lot of tuning. But for a dataset where there are more nonlinear features, we think
multiscale RNN is the better candidate and expect the performance difference between multiscale
RNN and AR models to be greater than what we have seen in our experiments.
Besides the better long-term predicting ability, noise resistance is another great advantage
of multiscale RNN when applying it to geothermal applications. Missing data is ubiquitous in
geothermal data due to regular maintenance and equipment failure. In addition, the wells are shut
down regularly because many of the fields don’t operate during certain seasons. We have observed
there are dramatic shifts in the well property values whenever wells are shut down or reopened.
It is good to clean the data before using it to train models. But identifying the outliers and filling
the data gaps might inject bias and mislead the models if they are done inappropriately. The noise
resistance of multiscale RNN makes the pre-processing of geothermal data much easier. Even
if some errors are introduced during data pre-processing, as in Figure 5.14 and Figure 5.15, the
short-term model will take care of it and let the long-term model focus on the general trend.
127
Figure 5.16: Comparison of models trained simultaneously and sequentially for the synthetic ex-
ample of fault zone 1.
One of the important aspects of multiscale RNN is the training approach. An alternate way of
training to the current simultaneous training is sequential training. In the sequential training, we
can first train the long-term model and then train the short-term model after subtracting the long-
term prediction from the data. We have compared the two training approaches and found that the
results are very similar. We show one example in Figure 5.16. In this example, we train two models
for the synthetic example of fault zone 1 , one simultaneously and one sequentially. It can be seen
from the figure that, the predictions of the test data are almost identical, which proves the loss
function we designed for the simultaneous training is effective. We have chosen the simultaneous
training over sequential training because train the long- and short-term models together is more
efficient.
The multiscale RNN in this dissertation has two models, short-term and long-term, to cope
with the features in two distinct scales in geothermal data. The short-term model, FCNN, can be
replaced by other simple models, e.g., AR models. We pick FCNN since it can be trained together
with RNN under deep learning frameworks, which makes the training process more efficient. For
problems where there are more than two types of patterns, we can add additional models with
different levels of complexity to the proposed model.
128
Chapter 6
Transfer Learning for Geothermal Prediction with Single-Well
Model and Residual Learning
In Chapters 4 and 5, we investigated the use of RNN for the prediction of geothermal flow re-
sponses. Although the multiscale structure we introduced has eliminated the need for simulation
data for long-term prediction, RNN still needs data to train. What could happen is when we have
enough data for the training, there isn’t much to optimize because most of the wells are drilled at
the beginning of production. The only way to have a predictive model at the beginning stage of
production (limited data) is to somehow use existing data or knowledge from other fields. This
chapter will discuss our investigation of transfer learning for geothermal prediction.
It is very challenging to make use of data from existing fields. First of all, there aren’t many
geothermal fields out there. Many of the fields don’t maintain their data well. Some of the datasets
we examined contain lots of human errors, which make them unusable. This happens a lot to the
fields that have a long production history. Secondly, the fields can be very different from each
other. The fields can have different well configurations, different boundary conditions, different
characteristics, etc. As a result, the knowledge that is valid in the new field can be very limited. It
is also challenging from the model’s aspect. RNN requires the input and output dimensions to be
fixed, but the number of wells is different from field to field. It is hard to transfer knowledge since
it is impossible to understand the meaning of each parameter in the model. It is also difficult to
verify the transfer because there isn’t much data from the new field.
129
Figure 6.1: A demonstration of the (a) single-well model and (b) multi-well model.
6.1 Single-Well Model
To deal with the data shortage and the dimension issue, we have developed a single-well model.
Its structure is shown in Figure 6.1. Here we consider the model we had in chapters 4 and 5 as
a multi-well model. Multi-well model takes the controls of each of the wells in the same fault as
input and makes a prediction for the target variable of all the wells in the same fault. The proposed
single-well model only makes predictions for one well. It takes control of the target well as one
input. The other input is the total control (e.g., total flow rates) of the fault. With the structure
of the single-well model, we avoid the dimension issue because the well configuration no longer
affects the input/output dimension. We have also increased the number of data available because
most geothermal fields have multiple producers.
6.2 Residual Learning
To make sure useful knowledge is transferred, we have adopted the residual learning shown in
Figure 6.2. In this process, We first train a model A using the data from an existing field A. Then,
we use model A to make a prediction for the target field B. The prediction is likely to be off since
model A has never seen data from field B in its training. We then fit a new Model B to the offset.
Then the addition of output from model A and model B becomes the final prediction of the future
of field B. In doing so, we make sure that the knowledge from field A is kept (in model A).
130
Figure 6.2: A demonstration of residual learning.
6.3 Preliminary Result
We have evaluated the single-well model and residual learning using one of the simulation realiza-
tions. The results of fault zone 2, where there are three different producers, are shown in Figure 6.3.
In this example, two of the producers are considered as the existing ones, while the last one is the
target one. This is an oversimplified example just to investigate the characteristics of the proposed
method. For comparison proposes, we trained four different models. The green dot in Figure 6.3
is the proposed method, a single-well model with residual learning. The red dot is a single-well
model with RNN. The yellow dot is a single-well model with multiscale RNN. The blue dot is
what we introduced in Chapter 5, a multiscale RNN model with access to the control information
of each of the wells. From the comparison of the three single-well models, we can see that the one
with residual learning achieved a better accuracy, which indicates the residual learning approach is
effective. However, all three single-well models sometimes mispredict the control effects (e.g., the
control change around step 650). We believe this comes from the error of summarizing controls
of others well because the multi-well multiscale RNN model got all the direction of changes right
though its prediction had a larger error than the proposed method.
The preliminary results indicate that the well configuration and detailed controls of other wells
in the same fault play an important role in making a prediction for the target well. The total controls
cannot fully present the information needed for an accurate prediction. We need to find a way that
131
Figure 6.3: A comparison between single-well models and multi-well model.
not only makes the dimension of input/output uniform but also somehow gives the model access
to the information of well configuration and detailed controls.
132
Chapter 7
Conclusion and future work
In this chapter, we first briefly summarize the main topics and research elements that this disser-
tation covers. Then, we outline the main conclusions that are obtained in this work. In the end,
we provide future works and potential lines of research that can be directed based on the obtained
conclusions or uncompleted tasks related to this dissertation.
7.1 Summary
In this dissertation, we focused on the applications of deep learning models to the characterization
and forecasting of the subsurface environment. Chapter 1 of the dissertation was an introduction
to the research background of the works. Chapters 2 and 3 were about characterization problems.
In Chapter 2, we presented a workflow that takes advantage of CNN for screening geologic sce-
narios based on flow response data to reduce geologic uncertainty. In Chapter 3, we proposed
V AE for robust dimension-reduced parameterization of spatially distributed aquifer properties in
solving model calibration problems under uncertain geostatistical models. We then switched to
the forecasting of geothermal reservoirs. In chapter 4, we developed a CNN-RNN architecture
and a detailed workflow to predict geothermal flow responses. A labeling scheme has been intro-
duced to exclude the related time steps to handle real-field disturbances, such as shut-in periods
due to equipment failure and maintenance. In chapter 5, we introduced a multiscale architecture to
improve the long-term predicting ability of geothermal data for RNN.
133
7.2 Conclusions
The key conclusions that are obtained as results of this work can be summarized as follows:
• Chapter 2
– The assumption of a prior conceptual model of a geologic scenario involves uncertainty
and subjectivity. We propose a selective workflow that calibrates the geologic scenario
against dynamic data.
– The presented workflow consists of two main components: feature extraction and fea-
ture recognition. The feature extraction step takes advantage of the discriminating
power of leading PCs to extract the main features. The feature recognition step uses
CNN to provide feature-based distances. An iterative scheme is proposed to improve
the accuracy of the method by gradually adapting the solution to the identified scenar-
ios.
– The workflow takes advantage of the discriminating power of leading PCs and the
pattern-based classification power of CNN.
– The proposed method is evaluated using aquifer pumping tests in non-Gaussian two-
dimensional models and a three-dimensional case study. The results over several hun-
dred test cases show that our workflow is able to extract the correct salient features
from flow data and yields reliable predictions of relevant geologic scenarios.
• Chapter 3
– We aimed to develop a parameterization method that can include all plausible geologic
scenarios and diverse patterns, which would eliminate the need for assuming one prior
geologic scenario.
– We exploited this advantage of V AE for parameterization of distributed flow properties
under uncertainty in the geologic scenario and illustrated its superior performance to a
conventional method such as PCA
134
– The superiority of V AE is largely attributed to a number of its inherent properties,
including local pattern-learning power of convolutional neural networks, multiscale
representation learning across the layers, and nonlinearity introduced via activation
functions.
– The advantages of V AE come with higher computational demand and the efforts needed
for hyper-parameter tuning, which can be time-consuming. We have also observed a
higher level of difficulty in the inversion process using V AE than PCA in agreeing to
previous works.
• Chapter 4
– We investigated the use of a deep learning model for the prediction of the flow re-
sponses in geothermal reservoirs.
– We developed a CNN-RNN architecture and a detailed workflow that includes data
pre-processing, feature selection, and training. We demonstrated both the short- and
long-term prediction capability of the developed CNN-RNN model through a set of
experiments with increasing complexity, including a field example.
– The RNN performance for interpolation problems is more reliable than when used for
extrapolation. Thus, simulation data is needed for accurate long-term prediction though
historical data is enough for short-term prediction.
– It is important to make sure the training data covers similar patterns and variability to
those observed in the field.
– Although the training strategy we introduced for the long-term prediction requires a
physics-based model, the RNN model can still dramatically reduce the computation
cost of optimization.
– We have evaluated the proposed model against RNN and AR models in a series of
experiments based on a geothermal field. The experiments have proven that the training
process is effective and the multiscale architecture works as designed.
135
• Chapter 5
– Our goal was to enhance the long-term prediction ability of RNN so it doesn’t need
simulation data.
– The proposed model consists of an RNN and a light-weighted FCNN. With a modi-
fied loss function and training process, the RNN is forced to learn only the nonlinear
features, while the FCNN only focuses on the general trend.
– We have evaluated the proposed model against RNN and AR models in a series of
experiments based on a geothermal field. The experiments have proven that the training
process is effective and the multiscale architecture works as designed.
– Noise resistance is another great advantage of multiscale RNN when applying it to
geothermal applications. Missing data is ubiquitous in geothermal data due to regular
maintenance and equipment failure.
– We have chosen simultaneous training over sequential training because training the
long- and short-term models together is more efficient and achieves the same accuracy
as sequential training.
• Chapter 6
– We explored ways to make use of existing data when training a model for a new field
so that it is possible to have a predictive model in the early stage of production.
– We have developed a single-well model that makes the dimension of input/output uni-
form. It also increases the number of training samples.
– We have adopted a residual learning workflow that keeps the existing knowledge by
having a pre-trained model.
– The results showed that the single-well model and residual learning are effective, but
summarizing controls of wells into one total control was not a perfect approximation.
136
7.3 Future Work
There are many directions that could be pursued in areas relevant to the elements and objective of
this dissertation. Our suggestions for future research are as follows:
• All the models we introduced in the dissertation are deterministic. There are also probabilis-
tic deep learning models that capture noise and uncertainty, which would be more practical
in real-world applications.
• Although the multiscale structure introduced in chapter 5 eliminates the need for simulation
data for long-term prediction, we still need historical data to train the models. And for
applications like geothermal, accurate predictions are more valuable in the planning phase.
Therefore, one potential research direction is transferring data or models from another field
to a new field.
• We haven’t explicitly introduce physics into the models. Regularizing the models using
physics (e.g., penalize unrealistic properties in the loss function) would reduce the amount
of data required and accelerate the training.
• Another interesting question is if we can learn physics (e.g., well connections) from a reliable
deep learning model.
137
References
[1] S. Aanonsen. Efficient history matching using a multiscale technique. SPE Reservoir Eval-
uation & Engineering, 11(1):154–164, 2008.
[2] Mart´ ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good-
fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz,
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man´ e, Rajat Monga, Sherry
Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya
Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda
Vi´ egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiao-
qiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.
[3] Michal Aharon, Michael Elad, and Alfred Bruckstein. rmk-svd: An algorithm for designing
overcomplete dictionaries for sparse representation. IEEE Transactions on signal process-
ing, 54(11):4311–4322, 2006.
[4] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE trans-
actions on Computers, 100(1):90–93, 1974.
[5] Bulbul Ahmmed, Maruti K Mudunuru, Satish Karra, Scott C James, and Velimir V Vesseli-
nov. A comparative study of machine learning models for predicting the state of reactive
mixing. Journal of Computational Physics, 432:110147, 2021.
[6] James Ahrens, Berk Geveci, and Charles Law. Paraview: An end-user tool for large data
visualization. The visualization handbook, 717, 2005.
[7] Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein gan, 2017.
[8] Dan Arnold, Vasily Demyanov, Temistocles Rojas, and Mike Christie. Uncertainty quantifi-
cation in reservoir prediction: Part 1—model realism in history matching using geological
prior definitions. Mathematical Geosciences, 51(2):209–240, 2019.
[9] Richard C. Aster, Brian Borchers, and Clifford H. Thurber. Parameter estimation and inverse
problems. 2005.
[10] Richard C Aster, Brian Borchers, and Clifford H Thurber. Parameter estimation and inverse
problems. Elsevier, 2018.
138
[11] Michelle C Baddeley, Andrew Curtis, and Rachel Wood. An introduction to prior infor-
mation derived from probabilistic judgements: elicitation of knowledge, cognitive bias and
herding. Geological Society, London, Special Publications, 239(1):15–27, 2004.
[12] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with
gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
[13] Eric W Bhark, Behnam Jafarpour, and Akhil Datta-Gupta. A generalized grid connectivity–
based parameterization for subsurface flow model calibration. Water Resources Research,
47(6), 2011.
[14] Clare Elizabeth Bond, Alan D Gibbs, Zoe K Shipton, and et al. Jones, Serena. What do you
think this is¿‘conceptual uncertainty”in geoscience interpretation. GSA today, 17(11):4,
2007.
[15] L´ eon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade,
pages 421–436. Springer, 2012.
[16] Kamp Y . Bourlard H. Auto-association by multilayer perceptrons and singular value de-
composition. Biol. Cybern, 59:291–294, 1988.
[17] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high
fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[18] C.G. Broyden. The Convergence of a Class of Double-rank Minimization Algorithms 1.
General Considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 03 1970.
[19] Ayaz Hussain Bukhari, Muhammad Asif Zahoor Raja, Muhammad Sulaiman, Saeed Islam,
Muhammad Shoaib, and Poom Kumam. Fractional neuro-sequential arfima-lstm for finan-
cial market forecasting. IEEE Access, 8:71326–71338, 2020.
[20] Jef Caers and Tuanfeng Zhang. Multiple-point geostatistics: a quantitative vehicle for inte-
grating geologic analogs into multiple reservoir models. 2004.
[21] Smith WA Canchumuni, Alexandre A Emerick, and Marco Aur´ elio C Pacheco. Towards a
robust parameterization for conditioning facies models using deep variational autoencoders
and ensemble smoother. Computers & Geosciences, 128:87–102, 2019.
[22] Smith W.A. Canchumuni, Alexandre A. Emerick, and Marco Aur´ elio C. Pacheco. Towards a
robust parameterization for conditioning facies models using deep variational autoencoders
and ensemble smoother. Computers & Geosciences, 128:87–102, 2019.
[23] G. Chavent and R. Bissell. Indicators for the refinement of parameterization. Inverse Prob-
lems in Engineering Mechanics, M. Tanaka & G.S. Dulikravich, editor, Elsevier, pages 309–
314, 1998.
[24] Zhangxin Chen, Guanren Huan, and Yuanle Ma. Computational methods for multiphase
flows in porous media, volume 2. Siam, 2006.
139
[25] Vinay Kumar Reddy Chimmula and Lei Zhang. Time series forecasting of covid-19 trans-
mission in canada using lstm networks. Chaos, Solitons & Fractals, 135:109864, 2020.
[26] Kyunghyun Cho, Bart Van Merri¨ enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using
rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078,
2014.
[27] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-based models for speech recognition. arXiv preprint arXiv:1506.07503, 2015.
[28] Michael A Christie, James Glimm, John W Grove, David M Higdon, David H Sharp, and
Merri M Wood-Schultz. Error analysis and simulations of complex phenomena. Los Alamos
Science, 29(6), 2005.
[29] Trenton T Cladouhos, Matt W Uddenberg, Michael W Swyer, Yini Nordin, and Geoff H
Garrison. Patua geothermal geologic conceptual model. Trans.-Geotherm. Resour. Counc,
41:1057–75, 2017.
[30] S. Constable, R. Parker, and C. Constable. Occam’s inversion: a practical algorithm for
generating smooth models from electromagnetic sounding data. Geophysics, 52:289–300,
1987.
[31] William Cumming and Randall Mackie. Resistivity imaging of geothermal resources using
1d, 2d and 3d mt inversion and tdem static shift correction illustrated by a glass mountain
case history. In Proceedings world geothermal congress, pages 25–29. Bali, Indonasia,
2010.
[32] Vasily Demyanov, Dan Arnold, Temistocles Rojas, and Mike Christie. Uncertainty quan-
tification in reservoir prediction: part 2—handling uncertainty in the geological scenario.
Mathematical Geosciences, 51(2):241–264, 2019.
[33] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
[34] Melvin B Diaz, Kwang Yeom Kim, Tae-Ho Kang, and Hyu-Soung Shin. Drilling data from
an enhanced geothermal project and its pre-processing for rop forecasting improvement.
Geothermics, 72:348–357, 2018.
[35] J. Doherty. Groundwater model calibration using pilot points and regularisation. Ground
Water, 41 (2):170–177, 2003.
[36] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learn-
ing and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011.
[37] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning,
2016.
140
[38] Rob A Dunne and Norm A Campbell. On the pairing of the softmax activation and cross-
entropy penalty functions and the derivation of the softmax activation function. In Proc. 8th
Aust. Conf. on the Neural Networks, Melbourne, volume 181, page 185. Citeseer, 1997.
[39] Gad El-Qady. Exploration of a geothermal reservoir using geoelectrical resistivity inver-
sion: case study at hammam mousa, sinai, egypt. Journal of Geophysics and Engineering,
3(2):114–121, 2006.
[40] Alexandre A Emerick and Albert C Reynolds. Ensemble smoother with multiple data as-
similation. Computers & Geosciences, 55:3–15, 2013.
[41] R. Fletcher. A new approach to variable metric algorithms. The Computer Journal,
13(3):317–322, 01 1970.
[42] Rui Fu, Zuo Zhang, and Li Li. Using lstm and gru neural network methods for traffic flow
prediction. In 2016 31st Youth Academic Annual Conference of Chinese Association of
Automation (YAC), pages 324–328. IEEE, 2016.
[43] El-Qady Gad, Keisuke Ushijima, and El-Sayed Ahmad. Delineation of a geothermal reser-
voir by 2d inversion of resistivity data at hamam faraun area, sinai, egypt. In World geother-
mal congress, pages 1103–1108, 2000.
[44] GR Gavalas, PC Shah, and John H Seinfeld. Reservoir history matching by bayesian esti-
mation. Society of Petroleum Engineers Journal, 16(06):337–350, 1976.
[45] D. Goldfarb. A Family of Variable Metric Methods Derived by Variational Means. Mathe-
matics of Computation, 24(109):23–26, 1970.
[46] Azarang Golmohammadi and Behnam Jafarpour. Simultaneous geologic scenario identi-
fication and flow model calibration with group-sparsity formulations. Advances in Water
Resources, 92:208–227, 2016.
[47] Azarang Golmohammadi and Behnam Jafarpour. Simultaneous geologic scenario identi-
fication and flow model calibration with group-sparsity formulations. Advances in Water
Resources, 92:208–227, 2016.
[48] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani,
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural
Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
[49] Courville A. Goodfellow I., Bengio Y . Deep Learning. MIT Press. ISBN 978-0262035613.,
2016.
[50] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural net-
works. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing,
pages 6645–6649, May 2013.
141
[51] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition
with deep bidirectional lstm. In 2013 IEEE workshop on automatic speech recognition and
understanding, pages 273–278. IEEE, 2013.
[52] A.A. Grimstad, T. Mannseth, G. Naevdal, and H. Urkedal. Adaptive multiscale permeability
estimation. Computational Geosciences, 7:1–25, 2003.
[53] Halldora Gudmundsdottir and Roland N Horne. Prediction modeling for geothermal reser-
voirs using deep learning.
[54] Fusun S Tut Haklidir and Mehmet Haklidir. The fluid temperature prediction with hydro-
geochemical indicators using a deep learning model: A case study western anatolia (turkey).
In 43rd Workshop on Geothermal Reservoir Engineering, 2019.
[55] Per Christian Hansen. Rank-deficient and discrete ill-posed problems: numerical aspects of
linear inversion. SIAM, 1998.
[56] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew
Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual
concepts with a constrained variational framework. 2016.
[57] G. Hinton. Lecture 6e rmsprop: Divide the gradient by a running average of its recent
magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
[58] Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[59] Reza Hosseini, Kaixu Yang, Albert Chen, and Sayan Patra. A flexible forecasting model for
production systems. arXiv preprint arXiv:2105.01098, 2021.
[60] John A Howell, Arne Skorstad, Alister MacDonald, Alex Fordham, Stephen Flint, Bjørn
Fjellvoll, and Tom Manzocchi. Sedimentological parameterization of shallow-marine reser-
voirs. Petroleum Geoscience, 14(1):17–34, 2008.
[61] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[62] Kazuki Irie, Zolt´ an T¨ uske, Tamer Alkhouli, Ralf Schl¨ uter, Hermann Ney, et al. Lstm, gru,
highway and a bit of attention: an empirical overview for language modeling in speech
recognition. In Interspeech, pages 3519–3523, 2016.
[63] P Jacquard. Permeability distribution from field pressure data. Society of Petroleum Engi-
neers Journal, 5(04):281–294, 1965.
[64] P. Jacquard. ”permeability distribution from field pressure datas”. Society of Petroleum
Engineers, 5(04), 1965.
[65] P. Jacquard and C. Jain. Permeability distribution from field pressure data. Soc. Pet. Eng.
Journal, pages 281–294, 1965.
142
[66] B. Jafarpour, V .K. Goyal, D.B. McLaughlin, and W.T. Freeman. Compressed history match-
ing: Exploiting transform-domain sparsity for regularization of nonlinear dynamic data inte-
gration problems. Mathematical Geosciences, 42:1–27, 2010. 10.1007/s11004-009-9247-z.
[67] B. Jafarpour and D.B. McLaughlin. Reservoir characterization with discrete cosine trans-
form. part-1: parameterization. Soc. Pet. Eng. Journal, SPE-106453-PA, 14(1):182–188,
2009.
[68] B. Jafarpour and D.B. McLaughlin. Reservoir characterization with discrete cosine trans-
form. part-2: history matching. Soc. Pet. Eng. Journal, SPE-106453-PA, 14(1):188–201,
2009.
[69] Behnam Jafarpour and Dennis B McLaughlin. History matching with an ensemble kalman
filter and discrete cosine parameterization. Computational Geosciences, 12(2):227–244,
2008.
[70] Behnam Jafarpour and Dennis B McLaughlin. Reservoir characterization with the discrete
cosine transform. SPE Journal, 14(01):182–201, 2009.
[71] Khodabakhshi Morteza Jafarpour Behnam. A bayesian mixture-modeling approach for
flow-conditioned multiple-point statistical facies simulation from uncertain training images.
Mathematical Geosciences, 43:133–164, 2011.
[72] Atefeh Jahandideh, Siavash Hakim-Elahi, and Behnam Jafarpour. Inference of rock flow
and mechanical properties from injection-induced microseismic events during geologic co2
storage. International Journal of Greenhouse Gas Control, 105:103206, 2021.
[73] A Jiang and B Jafarpour. History matching under uncertain geologic scenarios with varia-
tional autoencoders. In ECMOR XVII, volume 2020, pages 1–14. European Association of
Geoscientists & Engineers, 2020.
[74] Anyue Jiang and Behnam Jafarpour. Combining regularized convolutional neural networks
with production data integration for geologic scenario selection. In SPE Annual Technical
Conference and Exhibition. Society of Petroleum Engineers, 2019.
[75] Anyue Jiang and Behnam Jafarpour. Deep convolutional autoencoders for robust flow
model calibration under uncertainty in geologic continuity. Water Resources Research,
57(11):e2021WR029754, 2021.
[76] Anyue Jiang, Zhen Qin, Trenton T Cladouhos, Dave Faulder, and Behnam Jafarpour. A
multiscale recurrent neural network model for long-term prediction of geothermal energy
production.
[77] Anyue Jiang, Zhen Qin, Dave Faulder, Trenton T Cladouhos, and Behnam Jafarpour. Re-
current neural networks for short-term and long-term prediction of geothermal reservoirs.
Geothermics, 104:102439, 2022.
[78] Ian T Jolliffe. Principal component analysis and factor analysis. In Principal component
analysis, pages 115–128. Springer, 1986.
143
[79] BR Julian, A Ross, GR Foulger, and JR Evans. Three-dimensional seismic image of a
geothermal reservoir: The geysers, california. Geophysical Research Letters, 23(6):685–
688, 1996.
[80] Mohammadreza M Khaninezhad and Behnam Jafarpour. Prior model identification during
subsurface flow data integration with adaptive sparse representation techniques. Computa-
tional Geosciences, 18(1):3–16, 2014.
[81] M.R. Khaninezhad, B. Jafarpour, and L. Li. Sparse geologic dictionaries for subsurface
flow model calibration: Part i. inversion formulation. Advances in Water Resources, 39:106–
121, 2012.
[82] M.R. Khaninezhad, B. Jafarpour, and L. Li. Sparse geologic dictionaries for subsurface
flow model calibration: Part ii. robustness to uncertainty. Advances in water resources,
39:122–136, 2012.
[83] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages
1746–1751, Doha, Qatar, October 2014. Association for Computational Linguistics.
[84] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
[85] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013.
[86] Diederik P. Kingma and Max Welling. An introduction to variational autoencoders. Foun-
dations and Trends® in Machine Learning, 12(4):307–392, 2019.
[87] Alexey V Kiryukhin, Natalia P Asaulova, and Stefan Finsterle. Inverse modeling and fore-
casting for the exploitation of the pauzhetsky geothermal field, kamchatka, russia. Geother-
mics, 37(5):540–562, 2008.
[88] A Kolmogoroff. Interpolation und extrapolation von stationaren zufalligen folgen. Izvestiya
Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 5(1):3–14, 1941.
[89] Christine E Koltermann and Steven M Gorelick. Heterogeneity in sedimentary deposits:
A review of structure-imitating, process-imitating, and descriptive approaches. Water Re-
sources Research, 32(9):2617–2658, 1996.
[90] Mark A. Kramer. Nonlinear principal component analysis using autoassociative neural net-
works. AIChE Journal, 37(2):233–243, 1991.
[91] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-
berger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105.
Curran Associates, Inc., 2012.
[92] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
144
[93] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In
Advances in neural information processing systems, pages 950–957, 1992.
[94] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist.,
22(1):79–86, 03 1951.
[95] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-
term temporal patterns with deep neural networks. In The 41st International ACM SIGIR
Conference on Research & Development in Information Retrieval, pages 95–104, 2018.
[96] Eric Laloy, Romain H´ erault, Diederik Jacques, and Niklas Linde. Training-image based
geostatistical inversion using a spatial generative adversarial neural network. Water Re-
sources Research, 54(1):381–406, 2018.
[97] Eric Laloy, Romain H´ erault, John Lee, Diederik Jacques, and Niklas Linde. Inversion using
a new low-dimensional representation of complex binary geological media based on a deep
neural network. Advances in water resources, 110:387–405, 2017.
[98] Eric Laloy, Romain H´ erault, Diederik Jacques, and Niklas Linde. Training-image based
geostatistical inversion using a spatial generative adversarial neural network. Water Re-
sources Research, 54(1):381–406, 2018.
[99] Eric Laloy, Romain H´ erault, John Lee, Diederik Jacques, and Niklas Linde. Inversion using
a new low-dimensional representation of complex binary geological media based on a deep
neural network. Advances in Water Resources, 110:387–405, Dec 2017.
[100] Eric Laloy, Niklas Linde, Cyprien Ruffino, Romain H´ erault, Gilles Gasso, and Diederik
Jacques. Gradient-based deterministic inversion of geophysical data with generative adver-
sarial networks: Is it feasible? Computers & Geosciences, 133:104333, 2019.
[101] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
[102] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard,
Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code
recognition. Neural computation, 1(4):541–551, 1989.
[103] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard,
Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-
propagation network. In Advances in neural information processing systems, pages 396–
404, 1990.
[104] Yann LeCun, L´ eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[105] J. Lee and P.K. Kitanidis. Bayesian inversion with total variation prior for discrete geologic
structure identification. Water Resources Research, 49(11):7658–7669, 2013.
145
[106] T Lenhart, K Eckhardt, N Fohrer, and H-G Frede. Comparison of two different approaches
of sensitivity analysis. Physics and Chemistry of the Earth, Parts A/B/C, 27(9-10):645–654,
2002.
[107] L. Li and B. Jafarpour. Effective solution of nonlinear subsurface flow inverse problems in
sparse bases. Inverse Problems, 26(10),105016:24 PP, 2010.
[108] Lianlin Li and Behnam Jafarpour. A sparse bayesian framework for conditioning un-
certain geologic models to nonlinear flow measurements. Advances in Water Resources,
33(9):1024–1042, 2010.
[109] Lianlin Li and Behnam Jafarpour. A sparse bayesian framework for conditioning un-
certain geologic models to nonlinear flow measurements. Advances in Water Resources,
33(9):1024–1042, 2010.
[110] Pei Li, Mohamed Abdel-Aty, and Jinghui Yuan. Real-time crash risk prediction on arterials
based on lstm-cnn. Accident Analysis & Prevention, 135:105371, 2020.
[111] Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between
dropout and batch normalization by variance shift. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2682–2690, 2019.
[112] Knut-Andreas Lie. An introduction to reservoir simulation using MATLAB/GNU Octave:
User guide for the MATLAB Reservoir Simulation Toolbox (MRST). Cambridge University
Press, 2019.
[113] Niklas Linde, Philippe Renard, Tapan Mukerji, and Jef Caers. Geological realism in hy-
drogeological and geophysical inverse modeling: A review. Advances in Water Resources,
86:86–101, 2015.
[114] Zachary C Lipton, David C Kale, Randall Wetzel, et al. Modeling missing data in clinical
time series with rnns. Machine Learning for Healthcare, 56, 2016.
[115] Mingliang Liu and Dario Grana. Time-lapse seismic history matching with an iterative en-
semble smoother and deep convolutional autoencoder. Geophysics, 85(1):M15–M31, 2020.
[116] Jorge Lopez-Alvis, Eric Laloy, Fr´ ed´ eric Nguyen, and Thomas Hermans. Deep generative
models in inversion: The impact of the generator’s nonlinearity and development of a new
approach based on a variational autoencoder. Computers & Geosciences, page 104762,
2021.
[117] Shyi-Min Lu. A global review of enhanced geothermal system (egs). Renewable and Sus-
tainable Energy Reviews, 81:2902–2921, 2018.
[118] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to
attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[119] Mohammadreza M Khaninezhad and Behnam Jafarpour. Hybrid parameterization for robust
history matching. SPE Journal, 19(03):487–499, 2014.
146
[120] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of
machine learning research, 9(Nov):2579–2605, 2008.
[121] S. Mallat. A wavelet tour of signal processing: The sparse way, 3nd edition. Academic
Press, 2008.
[122] Tom Manzocchi, Jonathan N Carter, Arne Skorstad, Bjørn Fjellvoll, Karl D Stephen,
JA Howell, John D Matthews, John J Walsh, Manuel Nepveu, Christian Bos, et al. Sen-
sitivity of the impact of geological uncertainty on production from faulted and unfaulted
shallow-marine oil reservoirs: objectives and methods. Petroleum Geoscience, 14(1):3–15,
2008.
[123] S Mohd Razak and B Jafarpour. Convolutional neural networks (cnn) for feature-
based model calibration under uncertain geologic scenarios. Comput Geosci. https://doi.
org/10.1007/s10596-020-09971-4, 2020.
[124] S Mohd Razak and B Jafarpour. History matching with generative adversarial networks. In
ECMOR XVII, volume 2020, pages 1–17. European Association of Geoscientists & Engi-
neers, 2020.
[125] Syamil Mohd Razak, Behnam Jafarpour, et al. Rapid production forecasting with
geologically-informed auto-regressive models: Application to volve benchmark model. In
SPE Annual Technical Conference and Exhibition. Society of Petroleum Engineers, 2020.
[126] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper
into neural networks. 2015.
[127] Lukas Mosser, Olivier Dubrule, and Martin J. Blunt. Deepflow: History matching in the
space of deep generative models, 2019.
[128] Darius Mottaghy, Renate Pechnig, and Christian V ogt. The geothermal project den haag:
3d numerical models for temperature prediction and reservoir simulation. Geothermics,
40(3):199–210, 2011.
[129] Temoor Muther, Fahad Iqbal Syed, Austin T Lancaster, Farah D Salsabila, Amirma-
soud Kalantari Dahaghi, and Shahin Negahban. Geothermal 4.0: Ai-enabled geothermal
reservoir development-current status, potentials, limitations, and ways forward. Geother-
mics, 100:102348, 2022.
[130] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Syn-
thesizing the preferred inputs for neurons in neural networks via deep generator networks,
2016.
[131] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill,
2(11):e7, 2017.
[132] Dean S Oliver, Albert C Reynolds, and Ning Liu. Inverse theory for petroleum reservoir
characterization and history matching. Cambridge University Press, 2008.
147
[133] Hyucksoo Park, C´ eline Scheidt, Darryl Fenwick, Alexandre Boucher, and Jef Caers. History
matching and uncertainty quantification of facies models with multiple geological interpre-
tations. Computational Geosciences, 17(4):609–621, 2013.
[134] Sungheon Park and Nojun Kwak. Analysis on the dropout effect in convolutional neural
networks. In Asian conference on computer vision, pages 189–204. Springer, 2016.
[135] Lutz Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade, pages
55–69. Springer, 1998.
[136] K Pruess, GS Bodvarsson, V Stefansson, and ET Eliasson. The krafla geothermal field,
iceland: 4. history match and prediction of individual well performance. Water Resources
Research, 20(11):1561–1584, 1984.
[137] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural net-
works, 12(1):145–151, 1999.
[138] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison Cottrell.
A dual-stage attention-based recurrent neural network for time series prediction. arXiv
preprint arXiv:1704.02971, 2017.
[139] Banda S RamaRao, A Marsh LaVenue, Ghislain De Marsily, and Melvin G Marietta. Pilot
point methodology for automated calibration of an ensemble of conditionally simulated
transmissivity fields: 1. theory and computational experiments. Water Resources Research,
31(3):475–493, 1995.
[140] Syamil Mohd Razak and Behnam Jafarpour. Convolutional neural networks (cnn) for
feature-based model calibration under uncertain geologic scenarios. Computational Geo-
sciences, 24(4):1625–1649, 2020.
[141] Syamil Mohd Razak, Anyue Jiang, and Behnam Jafarpour. Latent-space inversion (lsi):
a deep learning framework for inverse mapping of subsurface flow data. Computational
Geosciences, pages 1–29, 2021.
[142] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images
with vq-vae-2. In Advances in Neural Information Processing Systems, pages 14837–14847,
2019.
[143] Jens Christian Refsgaard, Steen Christensen, Torben O Sonnenborg, Dorte Seifert,
Anker Lajer Højberg, and Lars Troldborg. Review of strategies for handling geological
uncertainty in groundwater flow and transport modeling. Advances in Water Resources,
36:36–50, 2012.
[144] Nicolas Remy. S-gems: the stanford geostatistical modeling software: a tool for new algo-
rithms development. In Geostatistics banff 2004, pages 865–871. Springer, 2005.
[145] Christian P. Robert and George Casella. Monte Carlo Statistical Methods (Springer Texts in
Statistics). Springer-Verlag, Berlin, Heidelberg, 2005.
148
[146] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise
removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.
[147] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations
by back-propagating errors. nature, 323(6088):533–536, 1986.
[148] I. Sahni and R.N. Horne. Multiresolution wavelet analysis for improved reservoir descrip-
tion. SPEREE, 8,(1):53–69, 2005.
[149] Pallav Sarma, Louis J. Durlofsky, and Khalid Aziz. Kernel principal component analy-
sis for efficient, differentiable parameterization of multipoint geostatistics. Mathematical
Geosciences, 40(1):3–32, Jan 2008.
[150] Pallav Sarma, Louis J Durlofsky, Khalid Aziz, Wen H Chen, et al. A new approach to
automatic history matching using kernel pca. In SPE Reservoir Simulation Symposium.
Society of Petroleum Engineers, 2007.
[151] Anton Maximilian Sch¨ afer and Hans Georg Zimmermann. Recurrent neural networks are
universal approximators. In International Conference on Artificial Neural Networks, pages
632–640. Springer, 2006.
[152] C´ eline Scheidt, Pejman Tahmasebi, Marco Pontiggia, Andrea Da Pra, and Jef Caers. Updat-
ing joint uncertainty in trend and depositional scenario for reservoir exploration and early
appraisal. Computational Geosciences, 19(4):805–820, 2015.
[153] Schlumberger. Eclipse.
[154] Hinrich Sch¨ utze, Christopher D Manning, and Prabhakar Raghavan. Introduction to infor-
mation retrieval, volume 39. Cambridge University Press Cambridge, 2008.
[155] Skipper Seabold and Josef Perktold. statsmodels: Econometric and statistical modeling with
python. In 9th Python in Science Conference, 2010.
[156] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi
Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-
based localization. In Proceedings of the IEEE international conference on computer vision,
pages 618–626, 2017.
[157] Shah Shah, GR Gavalas, and JH Seinfeld. Error analysis in history matching: The optimum
level of parameterization. Society of Petroleum Engineers Journal, 18(03):219–228, 1978.
[158] David F. Shanno. Conditioning of quasi-newton methods for function minimization. 1970.
[159] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn fea-
tures off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition workshops, pages 806–813, 2014.
[160] Chaopeng Shen. A transdisciplinary review of deep learning research and its relevance for
water resources scientists. Water Resources Research, 54(11):8558–8593.
149
[161] Apeksha Nagesh Shewalkar. Comparison of rnn, lstm and gru on speech recognition data.
2018.
[162] Yu Shi, Xianzhi Song, and Guofeng Song. Productivity prediction of a multilateral-well
geothermal system based on a long short-term memory and multi-layer perceptron combi-
national neural network. Applied Energy, 282:116046, 2021.
[163] Drew L Siler, Jeff D Pepin, Velimir V Vesselinov, Maruti K Mudunuru, and Bulbul Ah-
mmed. Machine learning to identify geologic factors associated with production in geother-
mal fields: a case-study using 3d geologic data, brady geothermal field, nevada. Geothermal
Energy, 9(1):1–17, 2021.
[164] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[165] Tzu-Hsi Song, Victor Sanchez, Hesham Eldaly, and Nasir Rajpoot. Hybrid deep autoen-
coder with curvature gaussian for detection of various types of cells in bone marrow trephine
biopsy images. pages 1040–1043, 04 2017.
[166] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of
machine learning research, 15(1):1929–1958, 2014.
[167] Sebastien Strebelle. Conditional simulation of complex geological structures using multiple-
point statistics. Mathematical Geology, 34(1):1–21, 2002.
[168] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. arXiv preprint arXiv:1409.3215, 2014.
[169] Satomi Suzuki and Jef Karel Caers. History matching with an uncertain geological scenario.
In SPE Annual Technical Conference and Exhibition. Society of Petroleum Engineers, 2006.
[170] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1–9, 2015.
[171] Albert Tarantola. Popper, bayes and the inverse problem. Nature physics, 2(8):492–494,
2006.
[172] Sean J Taylor and Benjamin Letham. Forecasting at scale. The American Statistician,
72(1):37–45, 2018.
[173] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Husz´ ar. Lossy image compres-
sion with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017.
[174] Sverrir Thorhallsson. Geothermal well operation and maintenance. Geothermal training
program IGC short course, 2003.
150
[175] Geir Thorolfsson. Maintenance history of a geothermal plant: Svartsengi iceland. In Pro-
ceedings of the World Geothermal Congress 2005, Antalya, Turkey, 2005.
[176] Chuan Tian, Roland N Horne, et al. Recurrent neural networks for permanent down-
hole gauge data analysis. In SPE Annual Technical Conference and Exhibition. Society
of Petroleum Engineers, 2017.
[177] A.N. Tikhonov and V .I. Arsenin. Solution of ill-posed problems. Winston, PA, 1977.
[178] M. Tonkin and J. Doherty. A hybrid regularized inversion methodology for highly parame-
terized models. Water Resources Research, 41,W10412:170–177, 2005.
[179] Fusun S Tut Haklidir and Mehmet Haklidir. Prediction of reservoir temperatures using
hydrogeochemical data, western anatolia geothermal systems (turkey): a machine learning
approach. Natural Resources Research, 29(4):2333–2346, 2020.
[180] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint
arXiv:1706.03762, 2017.
[181] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting
and composing robust features with denoising autoencoders. In Proceedings of the 25th
international conference on Machine learning, pages 1096–1103, 2008.
[182] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man-
zagol. Stacked denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408,
2010.
[183] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man-
zagol. Stacked denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion. J. Mach. Learn. Res., 11:3371–3408, December 2010.
[184] Dongfang Wang and Jin Gu. Vasc: dimension reduction and visualization of single-cell
rna-seq data by deep variational autoencoder. Genomics, proteomics & bioinformatics,
16(5):320–331, 2018.
[185] Fei Wang, Zhiming Xuan, Zhao Zhen, Kangping Li, Tieqiang Wang, and Min Shi. A day-
ahead pv power forecasting method based on lstm-rnn model and time correlation modifica-
tion under partial daily pattern prediction framework. Energy Conversion and Management,
212:112766, 2020.
[186] Norbert Wiener et al. Extrapolation, interpolation, and smoothing of stationary time series:
with engineering applications, volume 8. MIT press Cambridge, MA:, 1964.
[187] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics
and intelligent laboratory systems, 2(1-3):37–52, 1987.
151
[188] Xingru Wu, Gary A Pope, G Michael Shook, and Sanjay Srinivasan. Prediction of enthalpy
production from fractured geothermal reservoirs using partitioning tracers. International
Journal of Heat and Mass Transfer, 51(5-6):1453–1466, 2008.
[189] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Sch¨ utze. Comparative study of cnn and
rnn for natural language processing. arXiv preprint arXiv:1702.01923, 2017.
[190] Aliper A. et al. Zhavoronkov A., Ivanenkov Y .A. Deep learning enables rapid identification
of potent ddr1 kinase inhibitors. Nat Biotechnol, 37:1038–1040, 2019.
152
Appendices
A Stochastic Gradient Descent (SGD) for Minimization
In this Appendix, we briefly present the Momentum and RMSProp algorithm to motivate the use
of the Adam algorithm in this work.
The Momentum Algorithm
The Momentum algorithm for solving SGD combines the gradient of the current step with gradi-
ents from the past steps to determine the search direction. To this end, an exponentially weighted
moving average operation is used, where greater weight and significance are placed on more recent
gradients. Mathematically, the Momentum algorithm can be expressed as
m
t
=b
1
m
t1
(1b
1
)G
t
(A.1)
z
t
= z
t1
am
t
(A.2)
where b
1
is the constant of the exponential decay rate, and G
t
is the gradient in step t. It can be
shown that the estimate m
t
is biased (toward the zero value used for initialization), and that the
bias-corrected form can be expressed as
ˆ m
t
=
m
t
1b
t
1
(A.3)
153
Note that the coefficient b
1
is bounded in the range [0;1]. The closer b
1
is to 1, the longer is
the memory of the moving average operation. A suggested value for b
1
in the literature is 0:90.
The Momentum algorithm improves the gradient estimates derived from the SGD and results in
less oscillatory behavior of the gradient descent minimization algorithms in ravines (where the
slopes are much steeper in one dimension than in another, due to ill-conditioned objective function
Hessian) [137].
The RMSProp Algorithm
The RMSprop algorithm is an adaptive learning rate method proposed by Hinton [57], which is
designed to address the fast diminishing learning rates. This goal is achieved by dividing the
learning rate by an exponentially decaying average of squared gradients.
v
t
=b
2
v
t1
+(1b
2
)G
2
t
(A.4)
z
t
= z
t1
a
G
t
p
v
t
+e
(A.5)
RMSProp adapts the learning rate by dividing it by the root of squared gradient. Hence, the
components with larger gradient projection magnitudes are assigned lower learning rates.
The Adam Algorithm
The Adam algorithm gains its efficiency by combining the Momentum and RMSProp algorithms.
It maintains a per-parameter learning rate that is calculated based on the estimates of the first
and second (non-centered) moments of the gradients. The initial first moment m
0
and the second
moment v
0
are set to zero. At time step t, the first moment m
t
and the second moment v
t
are
estimated using exponentially weighted moving averages computed on the current gradients, i.e.,
m
t
=b
1
m
t1
+(1b
1
) G
t
(A.6)
154
v
t
=b
2
v
t1
+(1b
2
)[G
t
]
2
(A.7)
where b
1
and b
2
are the constants of exponential decay rates for the first and second moment
estimates, respectively. The biased-corrected estimates ˆ m
t
and ˆ v
t
are computed as
ˆ m
t
=
m
t
1b
t
1
(A.8)
ˆ v
t
=
v
t
1b
t
2
(A.9)
The update equation for z
t
in Adam is expressed as
z
t
= z
t1
a
ˆ m
t
p
ˆ v
t
+e
(A.10)
wherea is the learning rate, ande is a small constant to prevent division by zero.
155
Abstract (if available)
Abstract
Accurate prediction of flow and transport behavior in complex geologic formations is critical for the efficient development of the underlying resources. Physics-based simulation models can provide reliable forecasts about the response of subsurface flow systems to various development strategies. However, the cost and technical difficulties associated with measuring the rock properties, such as permeability and porosity, at a high enough resolution and coverage have necessitated using indirect and scattered flow response data for their characterization. Since flow response data are measured at sparse locations, inference of high-resolution model parameters from the limited data available often leads to underdetermined inverse problems. Besides the conventional physics-based models that offer a comprehensive prediction tool, data-driven models provide an efficient alternative to building predictive models by extracting and using the statistical patterns in data to make predictions. In this work, we applied deep learning models to handle the geologic uncertainty in the model calibration process. We present a novel framework whereby dynamic response data is used to screen a set of proposed geologic scenarios by combining gradient-based inversion for feature extraction and convolutional neural networks for feature recognition. We also propose a special type of deep learning architecture, known as variational auto-encoder, for robust dimension-reduced parameterization of spatially distributed properties, in solving model calibration problems under uncertain geostatistical models. We show that convolutional autoencoders offer the versatility and robustness required for nonlinear parameterization of complex subsurface flow property distributions when multiple distinct geologic scenarios are present. In addition, we introduce a variant of RNN that also utilizes the efficiency of convolutional neural networks (CNN) for the prediction of energy production from geothermal reservoirs. Specifically, a CNN-RNN architecture is developed that takes historical production data and future well controls to predict the well quantities that describe the production performance of the reservoir. The model is paired with a labeling scheme to handle real field disturbances such as data gaps. RNN primarily exploits statistical relations in training data to generate predictions. Thus it can be challenging to apply RNN in applications where extrapolation beyond the training data range is needed. In addition, geothermal data usually involves features on multiple scales, which makes complex deep learning models more vulnerable to overfitting. To overcome these issues, we have developed a multiscale architecture for RNN to improve its generalization power for the long-term prediction of geothermal data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deep learning architectures for characterization and forecasting of fluid flow in subsurface systems
PDF
Efficient control optimization in subsurface flow systems with machine learning surrogate models
PDF
Deep learning for characterization and prediction of complex fluid flow systems
PDF
Optimization of CO2 storage efficiency under geomechanical risks using coupled flow-geomechanics-fracturing model
PDF
Sparse feature learning for dynamic subsurface imaging
PDF
Dynamics of CO₂ leakage through faults
PDF
Feature learning for imaging and prior model selection
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Subsurface model calibration for complex facies models
PDF
Integration of multi-physics data into dynamic reservoir model with ensemble Kalman filter
PDF
Application of data-driven modeling in basin-wide analysis of unconventional resources, including domain expertise
PDF
Integrated workflow for characterizing and modeling fracture network in unconventional reservoirs using microseismic data
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Reactivation of multiple faults in oilfields with injection and production
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Deep learning models for temporal data in health care
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Object classification based on neural-network-inspired image transforms
PDF
A coupled multiphase flow-geomechanics framework for modeling material, kinematic and contact nonlinearities in faulted reservoirs
PDF
Real-time reservoir characterization and optimization during immiscible displacement processes
Asset Metadata
Creator
Jiang, Anyue
(author)
Core Title
Deep learning for subsurface characterization and forecasting
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Chemical Engineering
Degree Conferral Date
2022-12
Publication Date
09/17/2022
Defense Date
08/25/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convolutional neural network,deep learning,dynamic performance prediction,geothermal reservoir,inverse modeling,machine learning,neural network,OAI-PMH Harvest,recurrent neural network,subsurface modeling,variational autoencoder
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jafarpour, Behnam (
committee chair
), Ershaghi, Iraj (
committee member
), Hammond, Doug (
committee member
), Jha, Birendra (
committee member
)
Creator Email
alexanyue@gmail.com,anyuejia@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112002234
Unique identifier
UC112002234
Legacy Identifier
etd-JiangAnyue-11222
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Jiang, Anyue
Type
texts
Source
20220919-usctheses-batch-982
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
convolutional neural network
deep learning
dynamic performance prediction
geothermal reservoir
inverse modeling
machine learning
neural network
recurrent neural network
subsurface modeling
variational autoencoder