Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deep learning architectures for characterization and forecasting of fluid flow in subsurface systems
(USC Thesis Other)
Deep learning architectures for characterization and forecasting of fluid flow in subsurface systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Deep Learning Architectures for Characterization and Forecasting of Fluid Flow in Subsurface Systems by Syamil Mohd Razak A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (PETROLEUM ENGINEERING) May 2023 Copyright 2023 Syamil Mohd Razak I dedicate this dissertation to my family, and to the giants whose shoulders I stand on. ii Acknowledgements I am deeply grateful to my dissertation advisor, Dr. Behnam Jafarpour, for their exceptional guid- ance, patience, and unwavering support throughout this long and challenging process. Their in- sights, feedback, and encouragement have been invaluable to me, and I am truly thankful for their mentorship. I would also like to express my sincere appreciation to the members of my disserta- tion committee, Dr. Iraj Ershaghi and Dr. Felipe de Barros, for their expert advice, stimulating conversations, constructive criticism, and insightful suggestions. Their expertise and perspectives have helped me refine my research and make my dissertation more impactful. I am incredibly fortunate to have the support of my partner, family, and friends, whose unwa- vering love, encouragement, and belief in me have been a constant source of inspiration. Thank you for being there for me, even during the toughest moments of this journey. I would also like to extend my gratitude to my past and present colleagues and industry partners, whose fruitful discussions and collaborations have sparked many ideas behind this dissertation. I look forward to continuing to work together to further advance the digital transformation of the energy industry. Last but not least, I would like to acknowledge my two feline friends, Chaton and Panchita, for keeping my laps warm and for their unconditional love, especially when they’re hungry. Your presence has made this journey all the more enjoyable. Thank you to everyone who has played a part in this journey. Let’s hope this dissertation doesn’t end up collecting dust on a shelf somewhere, but rather serves as a launching pad for future success. iii Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures x Abstract xvii Chapter 1: Introduction 1 1.1 Production Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Production Forecasting in Conventional Resources . . . . . . . . . . . . . 4 1.1.1.1 Challenges in Automated Model Calibration . . . . . . . . . . . 4 1.1.2 Production Forecasting in Unconventional Resources . . . . . . . . . . . . 8 1.1.2.1 Challenges in Production Prediction . . . . . . . . . . . . . . . 8 1.2 Neural Network Architectures for Latent Space Representations . . . . . . . . . . 10 1.2.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . 12 1.2.2 Fully-Connected Neural Network (FCNN) . . . . . . . . . . . . . . . . . 13 1.2.3 Convolutional Neural Networks (CNN) . . . . . . . . . . . . . . . . . . . 14 1.2.4 Autoencoders (AE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.5 Generative Adversarial Networks (GAN) . . . . . . . . . . . . . . . . . . 18 1.2.6 Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . . . . . . . . 19 1.3 Latent Space Representations for Production Forecasting . . . . . . . . . . . . . . 20 1.3.1 Inverse Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.2 Production Prediction Formulation . . . . . . . . . . . . . . . . . . . . . . 22 1.4 Motivations for Deep Learning Latent Space Representations . . . . . . . . . . . . 24 1.4.1 Handling Uncertainty in Prior Geologic Scenarios . . . . . . . . . . . . . 24 1.4.2 Model Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.4.3 Proxy Model for Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.4.4 Uncertainty Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.4.5 Long-Term Production Prediction . . . . . . . . . . . . . . . . . . . . . . 32 1.4.6 Physics-Guided Production Prediction . . . . . . . . . . . . . . . . . . . . 34 1.5 Scope of Work and Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . 38 iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2: Feature-based Model Calibration under Uncertain Geologic Scenarios 43 2.1 Convolutional Neural Networks (CNN) for Feature Extraction . . . . . . . . . . . 46 2.1.1 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.1.2 Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.2 PCA for Model Space Compression . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.3 Numerical Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.3.1 Example 1: Synthetic 2D Gaussian Model . . . . . . . . . . . . . . . . . . 51 2.3.2 Example 2: Synthetic 2D Fluvial Model . . . . . . . . . . . . . . . . . . . 56 2.3.3 Example 3: Synthetic 3D Fluvial Model . . . . . . . . . . . . . . . . . . . 66 2.3.4 Example 4: Large-scale Model (Based on V olve Field) . . . . . . . . . . . 72 2.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 3: Latent Space Inversion (LSI) for Inverse Mapping of Subsurface Flow Data 80 3.1 Parameterization with Deep Learning Techniques . . . . . . . . . . . . . . . . . . 82 3.1.1 Autoencoder for Model Space Compression . . . . . . . . . . . . . . . . . 82 3.1.2 Autoencoder for Data Space Compression . . . . . . . . . . . . . . . . . . 83 3.1.3 Inverse Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.2 Latent Space Inversion (LSI) Workflow . . . . . . . . . . . . . . . . . . . . . . . 86 3.2.1 Decoupled LSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.2.2 Coupled LSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.3 Numerical Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.3.1 Example 1: Synthetic 2D Gaussian Model . . . . . . . . . . . . . . . . . . 93 3.3.2 Example 2: Synthetic 2D Fluvial Model . . . . . . . . . . . . . . . . . . . 97 3.3.3 Example 3: Large-scale Model (Based on V olve Field) . . . . . . . . . . . 106 3.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Chapter 4: Latent Space Data Assimilation (LSDA) in Subsurface Flow Systems 118 4.1 Parameterization and Forward Mapping with Deep Learning Techniques . . . . . . 121 4.1.1 Dimension Reduction and Prediction with Autoencoders . . . . . . . . . . 121 4.1.2 Constraining the Latent Spaces with Variational Autoencoders . . . . . . . 124 4.2 Latent Space Data Assimilation (LSDA) Workflow . . . . . . . . . . . . . . . . . 126 4.3 Numerical Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.1 Example 1: Synthetic 2D Gaussian Model . . . . . . . . . . . . . . . . . . 129 4.3.2 Example 2: Synthetic 2D Fluvial Model . . . . . . . . . . . . . . . . . . . 136 4.3.3 Example 3: Large-scale Model (Based on V olve Field) . . . . . . . . . . . 142 4.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Chapter 5: Conditioning Generative Adversarial Networks on Nonlinear Data for Model Calibration and Uncertainty Quantification 153 5.1 Generative Adversarial Networks for Model Space and Data Space Compression . 155 5.1.1 Method 1: Conditional GAN . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.1.2 Method 2: GAN with Neighborhood Selection . . . . . . . . . . . . . . . 161 5.2 Numerical Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.2.1 CGAN Conditioned on Model Space Label . . . . . . . . . . . . . . . . . 166 5.2.2 CGAN Conditioned on Data Space Label . . . . . . . . . . . . . . . . . . 167 v . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Conditioning on Travel-time Tomography Data . . . . . . . . . . 167 5.2.2.2 Conditioning on Two-phase Flow Data . . . . . . . . . . . . . . 171 5.2.3 GAN with Neighborhood Selection . . . . . . . . . . . . . . . . . . . . . 173 5.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Chapter 6: Recurrent Neural Networks for Long-term Production Forecasting in Un- conventional Reservoirs 185 6.1 Dynamic Latent Space Representations with Recurrent Neural Networks . . . . . . 188 6.1.1 Sequence-to-Sequence Forecast Model . . . . . . . . . . . . . . . . . . . 191 6.1.2 Transfer Learning for Reducing Data Requirement . . . . . . . . . . . . . 193 6.2 Numerical Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.2.1 Example 1: Toy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.2.2 Example 2: Synthetic Bakken Data . . . . . . . . . . . . . . . . . . . . . 200 6.2.3 Example 3: Field Data from Bakken Shale Play . . . . . . . . . . . . . . . 205 6.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Chapter 7: Physics-Guided Deep Learning (PGDL) for Improved Production Forecast- ing in Unconventional Reservoirs 215 7.1 Hybrid Techniques for Production Prediction . . . . . . . . . . . . . . . . . . . . 217 7.1.1 Physics-Constrained Neural Networks . . . . . . . . . . . . . . . . . . . . 218 7.1.1.1 Statistical Approach . . . . . . . . . . . . . . . . . . . . . . . . 221 7.1.1.2 Explicit Approach . . . . . . . . . . . . . . . . . . . . . . . . . 224 7.1.2 Physics-Guided Deep Learning (PGDL) Model . . . . . . . . . . . . . . . 225 7.2 Numerical Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.2.1 Example 1: Toy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.2.2 Example 2: Synthetic Bakken Data . . . . . . . . . . . . . . . . . . . . . 233 7.2.3 Example 3: Field Data from Bakken Shale Play . . . . . . . . . . . . . . . 242 7.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Chapter 8: Summary, Conclusions and Future Works 252 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 8.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Nomenclature References Appendices 262 269 285 A Description of the CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . 286 A.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 B Description of the LSI Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 286 B.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 C Description of the LSDA Architectures . . . . . . . . . . . . . . . . . . . . . . . . 292 C.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 D Description of the CGAN Architectures . . . . . . . . . . . . . . . . . . . . . . . 292 vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1 Network Architecture and Training Progression . . . . . . . . . . . . . . . 292 E Description of the RNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . 297 E.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 E.2 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 F Description of the PGDL Architectures . . . . . . . . . . . . . . . . . . . . . . . 301 F.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 F.2 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 306 vii List of Tables 2.1 Proposed two-stage workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.2 Uncertain geologic parameters as input to SGS algorithm . . . . . . . . . . . . . . 53 2.3 Uncertain geologic parameters as input to 2D training image for MPS algorithm . 59 2.4 Root-mean-square error (RMSE) for examples shown in this section . . . . . . . . 63 2.5 Uncertain geologic parameters as input to 3D object-based modelling algorithm . . 66 2.6 Uncertain geologic parameters for V olve field-like case . . . . . . . . . . . . . . . 72 3.1 Pseudocode for decoupled LSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.2 Pseudocode for coupled LSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.1 Pseudocode for data assimilation with LSDA . . . . . . . . . . . . . . . . . . . . 130 5.1 Workflow for method (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.2 Workflow for method (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 8.1 Keras functions corresponding to Figure 2.2 and parameters used in this study (for Example 1 and 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 8.2 Keras functions and parameters for model autoencoder Enc θ ,Dec θ in Figure 3.2 and Figure 8.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 8.3 Keras functions and parameters for data autoencoder Enc ψ ,Dec ψ in Figure 3.3 and Figure 8.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 8.4 Keras functions and parameters for regressor Reg dm γ in Figure 3.4 and Figure 8.1. . 289 8.5 Keras functions and parameters for model autoencoder Enc θ ,Dec θ in section 3.3.3 (V olve) and Figure 8.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 8.6 Keras functions and parameters for data autoencoder Enc ψ ,Dec ψ in section 3.3.3 (V olve) and Figure 8.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 8.7 Keras functions for Enc θ ,Dec θ in Figure 4.3 and parameters used for the 2D ex- amples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 8.8 Keras functions for Enc ψ ,Dec ψ in Figure 4.4 and parameters used for the 2D ex- amples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 8.9 Keras functions for Reg md γ (·) in Figure 4.5 and parameters used for the 2D exam- ples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 8.10 Tensorflow functions and hyperparameters used in section 5.2 . . . . . . . . . . . 294 8.11 Detailed description of components in the proposed forecast model. . . . . . . . . 299 8.12 Detailed description of components in the statistical and explicit PGDL models in Experiment 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 8.13 Detailed description of components in the statistical and explicit PGDL models in Experiment 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 8.14 Detailed description of components in the statistical and explicit PGDL models in Experiment 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 8.15 Parameters used in setting up the model in Experiment 2. . . . . . . . . . . . . . . 306 viii 8.16 Data sources used for sampling input features in Experiment 2. . . . . . . . . . . 306 ix List of Figures 1.1 Classical workflow that considers multiple geologic scenarios and other sources of uncertainty in constructing prior models for history matching to generate reliable dynamic models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 (a) Convolution operation (of a trainable filter) on an input results in activations. (b) Non-linear transformation on activations to generate non-linear activations. (c) Pooling or down-sampling layer reduces the number of parameters. (d) In a super- vised learning setting, these non-linear activations are compared to the expected corresponding output variable (classification for a discrete variable or regression for a continuous variable) and the filter is updated to minimize the difference. Im- ages not drawn to scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Comparison of the generated latent spaces from AE, V AE and PCA. . . . . . . . . 18 1.4 Comparison of generated images using GAN and its variant (e.g., WGAN and WGANGP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 Forward and inverse problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.6 Windows of multivariate time-series data for a well. . . . . . . . . . . . . . . . . . 23 1.7 Schematic of model inversion with deep learning latent space representations. . . . 27 2.1 Proposed two-step workflow that predicts a model realization for any given histor- ical data d obs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.2 (Top) Architecture and dimensions of the CNNs. (Bottom) Activations of the 5× 2 filters in the first convolution layer. . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3 Reference field set-up and samples of conditioned Gaussian realizations. . . . . . 52 2.4 Confusion matrix of the classifier on the testing data set (Gaussian realizations). . 54 2.5 Energy contribution of principal components (PCs) for production data and models (Gaussian realizations). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.6 Production data and models from five scenarios (Gaussian realizations) visualized in the first three principal components (PCs). . . . . . . . . . . . . . . . . . . . . 55 2.7 (Top) Geologic scenarios are only distinguishable in the leading principal compo- nents (PCs) of production data and model realizations. (Bottom) Strong correlation exists between leading PCs of production data and model realizations. . . . . . . . 57 2.8 (a) Reference realization as denoted in Figure 2.6 (b) Predicted realization with the workflow (c) Predicted realization with no geologic scenario selection step (d) Likelihood of each scenario withD(d obs ) (e) Nearest neighbors to the reference realization show that it does not exist in the training set. (f) Samples from M reduced 58 2.9 (Top) Reference field set-up and samples of conditioned fluvial realizations. ( Bot- tom) Training image 1 used for scenario 1 and training image 2 used (by rotation) for scenario 2-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 x 2.10 (a) Reference realization (b) Predicted realization with the workflow (c) Predicted realization with no geologic scenario selection step (d) Likelihood of each scenario withD(d obs ) (e) Discretized prediction by taking a threshold (determined by the mid-point of facies code of channel and non-channel facies) (f) Discretized predic- tion (g) Nearest neighbors to the reference realization show that it does not exist in the training set. (h) Samples from M reduced . . . . . . . . . . . . . . . . . . . . . 61 2.11 Lowest mismatch (RMSE tabulated in Table 2.4) is observed between reference (d obs ) and simulated data from the realization predicted by the workflow. . . . . . 62 2.12 Performance of the proposed workflow on four cases of 2D fluvial realizations with varying degree of confidence in the predicted scenarios of D(·) (RMSE tabulated in Table 2.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.13 Samples and mean of conditioned 3D realizations shown as isochore maps with contour interval of 16 m (measured along true vertical thickness between the top and base horizons of this synthetic field). The thickness of channel facies at each well location is varied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.14 (Top) (i) Reference realization (ii) Predicted realization with no geologic scenario selection step (iii) Predicted realization with the workflow (iv) Prediction using models from Scenario 2 (v) Prediction using models from Scenario 3 (vi) Predic- tion using models from Scenario 4 (Bottom) Time steps of oil-in-place grid of the reference realization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.15 Comparison of production data match (RMSE tabulated in Table 2.4) between pre- dictions from Figure 2.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.16 (Top) (i) Reference realization (ii) Predicted realization with no geologic scenario selection step (iii) Predicted realization using models from Scenario 1 (iv) Predic- tion with the workflow (v) Prediction using models from Scenario 3 (vi) Prediction using models from Scenario 4 (Bottom) Time steps of oil-in-place grid of the ref- erence realization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.17 Comparison of production data match (RMSE tabulated in Table 2.4) between pre- dictions from Figure 2.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.18 Field set-up (2D map view), porosity logs (true vertical depth) for select wells and samples of V olve conditioned Gaussian realizations from three scenarios. . . . . . 73 2.19 Sensitivity analysis on testing accuracy as a function of number of realizations per scenario used for training and validation ofD(·). . . . . . . . . . . . . . . . . . . 74 2.20 (a) Cross-section (x to x’) of reference porosity grid (b) Cross-section (x to x’) of predicted porosity grid (c) Initial and final oil-in-place grids for reference and predicted cases (map view of 11th layer). . . . . . . . . . . . . . . . . . . . . . . 76 2.21 Production data match (RMSE tabulated in Table 2.4) for select producers (oil rate and corresponding water cut) and injectors (pressure). . . . . . . . . . . . . . . . 77 3.1 Data-informed parameterization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2 Actual autoencoder architecture for model realizations. . . . . . . . . . . . . . . . 82 3.3 Actual autoencoder architecture for production response data. . . . . . . . . . . . . 84 3.4 Architecture of regressors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.5 Diagram of decoupled LSI architecture for model inversion. . . . . . . . . . . . . 88 3.6 Training and validation losses for decoupled LSI and histograms (normalized) of predictions for testing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 xi 3.7 Diagram of coupled LSI architecture for model inversion. . . . . . . . . . . . . . . 91 3.8 Training and validation losses for coupled LSI and histograms (normalized) of predictions for testing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.9 (Left) Gaussian dataset with sample model realizations (map view) from five (5) distinct scenarios. (Right) Field set-up for the Gaussian dataset. . . . . . . . . . . . 94 3.10 Comparison of inversion solutions (Gaussian realizations) from coupled and de- coupled approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.11 RMSE of simulated data match from inversion solutions of Gaussian and fluvial testing dataset. Approach C (coupled) and D (decoupled). . . . . . . . . . . . . . . 96 3.12 Field analogs (map view of satellite image) of a meandering and an anastomosing fluvial environment with corresponding training image (map view). . . . . . . . . . 97 3.13 (Left) Fluvial dataset with sample model realizations (map view) from five (5) distinct scenarios. (Right) Field set-up for the fluvial dataset. . . . . . . . . . . . . 98 3.14 Comparison of inversion solutions (fluvial realizations) from coupled and decou- pled approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.15 Comparison of simulated data match from inversion solutions (for row 2 in Fig- ure 3.14) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.16 4 nearest-neighbors (NN) to m re f in model space M and 4 nearest-neighbors (NN) to d obs in data space D (corresponding models are shown) for row 2 in Figure 3.14. 101 3.17 (Left) Match between predicted production profiles (by Dec ψ (Enc ψ (·))) and nor- malized profiles for test dataset. ( Right) z d visualized by dimension (6 shown out of 10), for case row 3 in Figure 3.14. . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.18 Simulated data from inversion solutions ˆ M= Dec θ (Reg dm γ (z d obs +E)) compared to reconstructions Dec ψ (z d obs +E) (for case row 3 in Figure 3.14). . . . . . . . . . . 103 3.19 Predicted model realizations Dec θ (Reg dm γ (z d obs +E)) for d obs (row 1 in Figure 3.14).104 3.20 Predicted model realizations Dec θ (Reg dm γ (z d obs +E)) for d obs (row 2 in Figure 3.14).104 3.21 Predicted model realizations Dec θ (Reg dm γ (z d obs +E)) for d obs (row 3 in Figure 3.14).105 3.22 (Top) z m for testing dataset visualized by dimension, 5 shown out of 64 (z m re f for case row 3 in Figure 3.14). (Bottom) Histograms showing latent variables for set of inversion solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.23 Cross-correlation matrix for dimensions of z m and z d for testing dataset. . . . . . . 106 3.24 V olve field set-up (2D map view), training volume used in MPS algorithm, and facies and porosity logs (in true vertical depth) for wells. . . . . . . . . . . . . . . 107 3.25 V olve porosity-permeability transform functions for sand and shale facies. . . . . . 108 3.26 (Top) The model reference case m re f for d obs and model reference case filtered to show only sand facies. (Middle) Reconstruction of reference model, Dec θ (Enc θ (m re f )) and inversion from coupled LSI, Dec θ (Reg dm γ (Enc ψ (d obs ))) (Bottom) Samples from ˆ M= Dec θ (Reg dm γ (z d obs +E)) . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.27 Wetting-phase saturation grid (layer 13 is shown) of each example in Figure 3.26 for initial, year-4 and final timesteps. . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.28 Simulated data (non-wetting phase rate of producers, wetting phase cut of produc- ers and bottom-hole pressure (BHP) of injectors) from inversion solutions ˆ M = Dec θ (Reg dm γ (z d obs +E)) compared to reconstructions Dec ψ (z d obs +E) (for exam- ples in Figure 3.26 and Figure 3.27). . . . . . . . . . . . . . . . . . . . . . . . . . 113 xii 3.29 Sensitivity analysis on the number of (m,d) data pair for training coupled LSI architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.1 Latent-Space Data Assimilation (LSDA) framework. . . . . . . . . . . . . . . . . 119 4.2 Schematic of neural network architecture for LSDA. . . . . . . . . . . . . . . . . 120 4.3 Autoencoder architecture for model realizations. . . . . . . . . . . . . . . . . . . . 121 4.4 Autoencoder architecture for simulated data. . . . . . . . . . . . . . . . . . . . . . 123 4.5 Regression model architecture for model and data latent variables. . . . . . . . . . 123 4.6 Histograms and scatter plots comparing model and data with the reconstructions and predictions for Gaussian training and testing dataset. . . . . . . . . . . . . . . 132 4.7 Samples of model reconstruction for Gaussian training and testing dataset. . . . . . 133 4.8 Samples of data reconstruction and prediction (denoted as P) from select wells for Gaussian training (top row) and testing (bottom row) dataset. . . . . . . . . . . . . 134 4.9 Gaussian model latent variables (first 4 shown out of 64) for the first and last iteration.134 4.10 Data latent variables (first 4 shown out of 20) for the first and last iteration of the Gaussian dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.11 (Left panel) Reference model, its reconstruction and mean and variance of prior ensemble and posterior ensemble. (Right panel) Samples of prior and posterior realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.12 Data match for posterior ensemble for the Gaussian dataset. . . . . . . . . . . . . . 137 4.13 Histograms and scatter plots comparing model and data with the reconstructions and predictions for fluvial training and testing dataset. . . . . . . . . . . . . . . . . 138 4.14 Two samples of reconstruction from the testing dataset and generative samples (uniformly sampled between the two test cases) from the latent space with their corresponding nearest neighbor in the training dataset. . . . . . . . . . . . . . . . 139 4.15 Fluvial model latent variables (first 4 shown out of 20) for the first and last iteration. 140 4.16 (Left panel) Reference model, its reconstruction and mean and variance of prior ensemble and posterior ensemble. (Right panel) Samples of prior, iterations and posterior realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.17 Data match for posterior ensemble for the fluvial dataset. . . . . . . . . . . . . . . 142 4.18 Facies at well location and data match of P10/mean/P90 profiles within forecast period for the fluvial dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.19 (Left panel) Reference model and mean and variance of prior ensemble and poste- rior ensemble. (Right panel) Samples of prior and posterior realizations. . . . . . . 145 4.20 Data match for posterior ensemble for the V olve dataset. . . . . . . . . . . . . . . 146 4.21 (Top) MeanSSIM comparison between LSDA, LSDA-b and ESMDA for different sizes of ensemble for 2D Gaussian examples and (Bottom) 2D fluvial examples (with ESMDA-b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.1 Method (1) for generating conditional models using CGAN. . . . . . . . . . . . . 158 5.2 Architecture of CGAN/GAN used in this study. Components connected using stip- pled lines are associated with CGAN. . . . . . . . . . . . . . . . . . . . . . . . . 161 5.3 Method (2) for generating conditional models using neighbourhood selection al- gorithm and GAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.4 Dataset A of binary facies realizations where training image 1 (for Scenario 1) and training image 2 (for Scenario 2-5) are derived from conceptual geologic model. Four samples are shown for each scenario. . . . . . . . . . . . . . . . . . . . . . . 165 xiii 5.5 Dataset B of multi-facies realizations where training image 1 (for Scenario 1 and 2) and training image 2 (for Scenario 3, 4 and 5) are derived from conceptual geologic model. Four samples are shown for each scenario. . . . . . . . . . . . . . . . . . . 166 5.6 (Center) Scatter plot of 500 realizations (in low-dimensional leading PCA spaces) as the training dataset and 64 generated model realizations from each label de- noted by color. (Tiles) Comparison of mean and variance of priors and generated realizations for each label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.7 (a) Configuration of transmitters and receivers in travel-time tomography experi- ment for dataset A. (b) Configuration of injectors and producers in two-phase flow experiment for dataset A and (c) for dataset B. . . . . . . . . . . . . . . . . . . . 169 5.8 (a) Arrival time data in leading PCA components where× denotes D,■ denotes cluster centroid,▲ denotes d obs and◦ represents ˜ D 2 . (b) Realizations in leading PCA components where× denotes M and◦ represents 64 generated realizations ˜ M 2 . (c) Samples of M ∗ 2 . (d) Samples of ˜ M 2 . (e) Comparison of mean and variance of relevant priors M ∗ 2 to ˜ M 2 and the reference case m re f used to generate d obs . . . . 170 5.9 Comparison of the clustering outcomes of D (where assigned labels are projected on M and color-coded) and average percentage error ε in each cluster based on different choices of K.■ denotes cluster centroid. . . . . . . . . . . . . . . . . . 172 5.10 (a) The most relevant label to d obs (▲) is determined as label 8. (b) Comparison of M ∗ 8 to ˜ M 8 . (c) Samples of M ∗ 8 . (d) Samples of ˜ M 8 . (e) Comparison of mean and variance of relevant priors M ∗ 8 to ˜ M 8 and the reference case m re f used to generate d obs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.11 Sensitivity analysis on the number of clusters and its impact on variance within M ∗ (closest to d obs ). Mean and variance maps are computed from 64 generated realizations in ˜ M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.12 (a) Colormap represents data mismatch and■ denote D ∗ where the mismatch is within d obs ± ε. (b)■ denote M ∗ that correspond to D ∗ . (c) Comparison of mean and variance of relevant priors M ∗ to ˜ M and the reference case m re f used to gener- ate d obs . (d) Samples of M ∗ . (e) Samples of ˜ M. . . . . . . . . . . . . . . . . . . . 176 5.13 (a) Distribution of M ∗ and ˜ M. (b) ˜ M color-labeled according to geologic scenario label of M ∗ . (c) Geologic scenario proportion of M ∗ and ˜ M. (d) Latent space interpolation between two generated samples from ˜ M. . . . . . . . . . . . . . . . 177 5.14 Profiles of D ∗ and ˜ D simulated from M ∗ and ˜ M respectively. . . . . . . . . . . . . 178 5.15 (a) Colormap represents data mismatch and■ denote D ∗ where the mismatch is within d obs ± ε. (b)■ denote M ∗ that correspond to D ∗ . (c) Comparison of mean and variance of relevant priors M ∗ to ˜ M and the reference case m re f used to gener- ate d obs . (d) Samples of M ∗ . (e) Samples of ˜ M. . . . . . . . . . . . . . . . . . . . 179 5.16 Profiles of D ∗ and ˜ D simulated from M ∗ and ˜ M respectively. Forecast period starts at month-40 (when P3 is drilled) and conditioning period is from month-0 to 40. . 181 6.1 Long-Short Term Memory (LSTM) cell. . . . . . . . . . . . . . . . . . . . . . . . 189 6.2 Forecast model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.3 Transfer learning workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.4 Spatial interpolation with coordinate information for three scenarios of transfer learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.5 Error histogram and samples of forecast versus simulated reference. . . . . . . . . 199 xiv 6.6 RMSE of the training set, testing set, set of all coordinates in the reference map, and set of training data combined with a portion of the testing set for three scenarios of transfer learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.7 Sampled formation and fluid properties for the training set and testing set. . . . . . 201 6.8 Simulated production responses in percentiles (P90/P50/P10) for the training set and testing set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 6.9 Predicted production responses (normalized) versus reference (for all windows) for a test well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.10 Prediction statistics for the training set and testing set. . . . . . . . . . . . . . . . . 203 6.11 Predicted production responses versus reference of sample test wells for transfer learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.12 Prediction statistics of the testing set for transfer learning. . . . . . . . . . . . . . . 204 6.13 Histograms of well parameters for the Bakken dataset. . . . . . . . . . . . . . . . 206 6.14 Distributions of training and testing datasets for three scenarios of transfer learning. 207 6.15 RMSE of the training set, testing set, testing set for recursive predictions, and set of training data combined with a portion of the testing set for recursive predictions, for three scenarios of transfer learning. . . . . . . . . . . . . . . . . . . . . . . . . 208 6.16 Predicted production responses (normalized) versus reference of sample test wells for transfer learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 6.17 Prediction statistics (random scenario) of the testing set for transfer learning. . . . 210 7.1 Schematic of the Physics-Guided Deep Learning (PGDL) model. . . . . . . . . . . 219 7.2 Diagram of the statistical PGDL model architecture. . . . . . . . . . . . . . . . . . 220 7.3 Workflow of the Physics-Guided Deep Learning (PGDL) model. . . . . . . . . . . 226 7.4 Performance comparison of the physics-constrained and black-box models for the toy dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 7.5 Performance comparison of the PGDL models for the toy dataset. . . . . . . . . . 230 7.6 Samples of prediction from the physics-constrained and PGDL models for the toy dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 7.7 Performance statistics of the physics-constrained and PGDL models for the toy dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.8 Normalized distribution of the formation, fluid, and completion properties for the synthetic dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 7.9 Multivariate production profiles for the synthetic dataset (Experiment 2a). . . . . . 235 7.10 Samples of prediction from the PGDL model for the synthetic dataset (Experiment 2a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.11 Sensitivity analysis of the PGDL model performance for the synthetic dataset (Ex- periment 2a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.12 Multivariate production profiles for the synthetic dataset (Experiment 2b). . . . . . 237 7.13 Performance statistics of the PGDL model for the synthetic dataset (Experiment 2b).238 7.14 Samples of prediction from the PGDL model for the synthetic dataset (Experiment 2b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 7.15 Sensitivity analysis of the PGDL model performance for the synthetic dataset (Ex- periment 2b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 7.16 Multivariate production profiles for the synthetic dataset (Experiment 2c). . . . . . 241 xv 7.17 Samples of prediction from the PGDL model for the synthetic dataset (Experiment 2c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.18 Normalized distribution of the well properties for the field dataset. . . . . . . . . . 243 7.19 Performance comparison of the PGDL and black-box models for the field dataset. . 244 7.20 Samples of prediction from the PGDL model for the field dataset. . . . . . . . . . 245 8.1 Weight distribution and dimension of input and output of encoders, decoders and regressor for examples in section 3.3.1 and 3.3.2. . . . . . . . . . . . . . . . . . . 288 8.2 Weight distribution and dimension of input and output of encoders, decoders and regressor for examples in section 3.3.3 . . . . . . . . . . . . . . . . . . . . . . . . 291 8.3 Schematic of the architecture used in this study. Refer to the shorthand notations in Table 8.10 for a description of the functions used within each layer. . . . . . . . 295 8.4 (a) Total losses (normalized) ofC φ ,D ψ andG θ of CGAN using dataset A with geologic scenario as the label. (b) 5 samples of realizations per geologic scenario (row), generated byG θ at selected iterations. . . . . . . . . . . . . . . . . . . . . 296 8.5 Heatmaps of normalized testing RMSE from grid search for size of windows. . . . 301 xvi Abstract Reliable production forecasting is necessary for optimizing the development and management of subsurface flow systems and for assisting asset teams in making sound business decisions. The flow systems in subsurface reservoirs that govern production behavior range from the well-understood flow mechanism of porous media such as in clastic/carbonate petroleum reservoirs, to the highly complex flow mechanism in naturally/induced fractured reservoirs such as in tight oil formations. Standard forecasting workflows involve the development of mathematical models that capture the complex relationship between well parameters, reservoir properties, well operating controls and observed production data to predict future production. Practical applications involve highly non- linear processes, non-Gaussian description of properties and non-stationary flow behavior that clas- sical covariance-based forecasting workflows cannot handle. This body of work presents new deep learning forecasting workflows that leverage state-of-the- art neural network architectures to efficiently extract and compactly represent spatial and temporal information, as well as learn complex multimodal input-output mappings for improving classical workflows. For conventional reservoirs with well-established flow-physics models, latent space representations of geologic models preserve the geologic consistency of history-matched solutions through the manipulation of latent spaces that translate to feature-based calibration in the full dimensional spaces, offering improvement in forecasting reliability when compared to standard workflows that use classical covariance-based techniques. We demonstrate that the discovered salient dynamical features in flow responses can be used to eliminate inconsistent geologic sce- narios that are not supported by observed measurements. Additionally, we develop a direct model calibration method to simultaneously parameterize and invert complex geologic models in efficient xvii latent spaces that do not only exploit the redundancy of large-scale geologic features but also retain features that are sensitive to flow response data. For unconventional reservoirs, latent space representations that capture the temporal dynamics in the flow response data and their relationship with well parameters and well operating controls are used to obtain long-term predictions using minimal initial production data. To benefit from existing physics-based models based on limited flow-physics understanding, we develop latent space representations of well parameters, well operating controls and production data fused with first principle physics-based models to form hybrid forecasting workflows that result in improved physically-consistent predictions when compared to standard data-driven models. The developed workflows combine the benefits of data-driven and physics-based models and exploit the salient spatial and temporal features for improved efficiency and higher generalization ability. The workflows can be integrated into closed-loop reservoir management tools for robust production optimization and can be applied for history matching and production forecasting of other subsurface systems such as geothermal reservoirs, carbon capture and storage reservoirs and groundwater reservoirs. xviii Chapter 1 Introduction Simulation models are developed to represent complex subsurface systems and multiphase flow and transport processes in heterogeneous porous media such as petroleum reservoirs [87] and groundwater systems [204]. These models are used for waste management, carbon sequestra- tion, to address soil and groundwater contamination and extraction of hydrocarbon and geothermal energy. The primary objective of building these models is to give improved predictions of key variables such as the degree and scale of contamination to design remediation schemes for sustain- able groundwater management [87]. In the context of energy systems, a reliable subsurface model allows optimal extraction schemes (i.e. waterflooding) for improved recovery of resources. The model development process involves subjective interpretation and integration of sparse multiscale data across multiple disciplines that inherently introduces uncertainty [34]. To build a representative subsurface model with robust predictive power, the main source of uncertainty that needs to be accounted for is typically geologic description of reservoir parameters that governs fluid flow and solute transport. Conceptual geologic understanding of reservoir architecture is derived by geologists who interpret available data and develop multiple plausible scenarios of prior models. Developing a computer simulation model that can be used to simulate and predict the per- formance of subsurface flow systems often constitutes a complex multi-stage process that involves acquisition, processing, integration, and interpretation of various types of data. The model develop- ment process is composed of hierarchical sequential steps, each with its own sources of uncertainty 1 and subjectivity that must be captured and propagated along the chain. Some of the major sources of uncertainty tend to be present at early stages, where very limited data is used to develop the reservoir structure and the conceptual geologic model for the reservoir. Faced with limited data, geoscientists often have to resort to subjective assumptions in developing a conceptual model that can be used by subsurface flow modelers. Figure 1.1 shows a typical model development process for a conventional subsurface system, in which at the early stages geoscientists include structural uncertainties in defining the top and base of a reservoir due to uncertainties associated with seismic time-depth conversion. This uncer- tainty arises because of the limited number of well penetration points that are available to obtain the conversion coefficients. Efforts to consider structural and fault placement uncertainties require complex parameterization techniques [42]. As the process continues down the chain, other sources of uncertainty present themselves; for example, in the case of a fluvial environment, channel thick- ness, width and azimuth are some of the common sources of uncertainty. In practice, stochastic representation of these parameters is informed by similar fluvial systems and through qualitative expert elicitation [21, 200]. Classical workflows typically do not include dynamic response data in reducing the uncertainty and subjectivity that the geologists deal with. In most cases, only well testing data from the ex- ploration and appraisal wells are available at early stages. Hence, pressure transient analysis is performed to provide coarse insight on the shape and size of the reservoir, which is used by the geologist to ascertain reservoir parameters. As production and injection wells are drilled, the sub- surface team will have more direct samples from the formation and inferred (from production data) information about the connectivity between wells [54]. As a green field matures to a brown field, fine-scale reservoir models become highly relevant as a tool for guiding future infill well placement and developing reservoir management strategies. Traditionally, integration of dynamic data into reservoir models is performed by reservoir engineers to update the initial models while honoring a single conceptual geologic model that is provided by the geologists, a task that is performed either manually or in an assisted way using automatic model calibration workflows [70, 93, 99]. 2 Figure 1.1: Classical workflow that considers multiple geologic scenarios and other sources of uncertainty in constructing prior models for history matching to generate reliable dynamic models. 3 Reservoir simulation is traditionally used to obtain production forecasts to facilitate field de- velopment and management decisions. The underlying theories and mathematical models for the flow and transport phenomenon of conventional reservoirs are extensively studied and understood [14, 250]. Such numerical modeling requires full geologic reservoir descriptions and a history matching process to integrate historical data into the geologic parameters to obtain reliable fore- casts. In contrast, unconventional resources such as shale oil and shale gas are relatively new sources of petroleum and are yet to be strictly defined. Therefore, new physical models derived from the current understanding of conventional wells do not fully capture the physical processes that take place during production from extremely tight formations with complex fracture networks. While an inverse problem formulation may seem like a logical approach to production fore- casting, it is not always ideal due to the limitations of the reservoir simulator. In fact, the re- liability of the forward model may be called into question during the calibration process. As a result, data-driven forecasting methods are becoming increasingly popular, particularly in the case of fractured tight formations with abundant drilling and continuous data acquisition. Data-driven methods can also complement existing reservoir simulators to account for any imperfections of the physics-based models. Traditional workflows tend to rely solely on either data-driven or physics- based methods, without a convenient way to take advantage of both. However, by combining the strengths of both approaches, it is possible to create a more comprehensive and accurate production forecast. 1.1 Production Forecasting 1.1.1 Production Forecasting in Conventional Resources 1.1.1.1 Challenges in Automated Model Calibration Subsurface systems involve ubiquitous nonlinear processes and non-Gaussian description of prior models that could exhibit complex geologic features [33]. The available static data used to build 4 the flow models come with uncertainty that is propagated into a set of prior model realizations. Conditioning observed data to prior model realizations from multiple geologic scenarios is done to reduce prior uncertainty and is an ill-posed inverse problem where multiple priors can sufficiently reproduce the observed data [12]. The calibration of the prior models involves the integration of observed dynamic data into the set of prior realizations, to form a set of calibrated model real- izations that can reproduce the field observations [71]. The data integration process improves the prediction performance of the flow models where reliable forecast of production responses can be obtained when the calibrated models are used to simulate physical flow processes [15]. Model calibration process is typically formalized as an ill-posed inverse problem, as observed field data provides only limited information to resolve high-dimensional subsurface models [166]. Available observed measurements used to constraint solutions are sparse, limited and can only explain large scale features of a subsurface reservoir that is typically modeled at much higher resolution. As such, inverse problems can give many non-unique solutions (i.e. various model realizations from multiple scenarios that reproduce the field observations, while generating dis- tinctive predictions) or there may be no solution at all due to inconsistent data. As dynamic data provides only aggregate conditioning information, the uncertainty in (geologic) reservoir descrip- tion is typically assumed to be more dominant than the errors associated with the governing flow equations and measurements [244]. Therefore, in general, the uncertainty in reservoir description is the main focus of model calibration workflows. The conditioning of prior models to their ob- served dynamic response can be viewed as a Bayesian inverse problem [242], where the state of uncertainty associated with the model parameters and the corresponding observations is described using probability density functions (PDFs). Early automatic model development workflows [70, 93, 99] are formulated as deterministic and probabilistic least-squares regression problems where the mismatch between observed and simu- lated data from a set of prior models is minimized. In probabilistic inverse modeling, an ensemble approximation is often used to represent the uncertainty in model parameters. Recently proposed history matching methods [184] introduced various parameterization techniques of prior models 5 for a better-posed problem (i.e. dimensionality reduction of spatially distributed reservoir param- eters to alleviate underdeterminedness) and to maintain solution plausibility (i.e., to preserve the expected geologic continuity). Principal Component Analysis (PCA), Singular Value Decomposi- tion (SVD) [60, 84] and non-geologic bases such as Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT) [94, 96] have been used for parameterization although preservation of geologic realism is limited to two-point statistics. To reproduce complex fluvial features, kernel PCA (KPCA) [218] has been proposed to preserve higher-order statistics of prior models where predetermined higher-order polynomial kernels are specified to define a high dimensional feature space. Closed-form solution exists for cases where Gaussian model parameters have a linear rela- tionship with data. In practice however, the linear-Gaussian assumption typically does not hold. For nonlinear data conditioning on non-Gaussian priors where closed-form solutions is not avail- able [206], numerous multi-Gaussian approximation methods have been proposed (e.g. Levelset method [38, 189], Truncated Pluri-Gaussian method [13, 40], Distance Transform [81], Normal- Score Transform [275]). In these methods, the parameterization transforms higher-order statistics of reservoir parameters to multi-Gaussian distribution that is amenable to traditional model cali- bration techniques. Ensemble-based methods using Ensemble Kalman Filter (EnKF) [1, 62, 96] and Ensemble Smoother with Multiple Data Assimilation (ES-MDA) [61, 254] have also been pro- posed where probability density function (pdf) of reservoir parameters is represented as a covari- ance matrix that is sequentially (or altogether) updated (with observed data) and computed from an ensemble of priors. Promising results of EnKF [86, 145, 227] and ES-MDA [65, 136, 249] on large-scale field application have been reported despite known issues of mode-collapse and un- derestimation of uncertainty. In ensemble-based methods, the ensemble mean of the posteriors represents the best estimate, while the ensemble empirically represents the PDF of the assimilated parameters. While forward sampling procedure such as Markov-chain Monte Carlo (MCMC) [75, 142] pre- serves prior statistics, it is computationally intractable for high dimensional problems in practical 6 applications. Sampling techniques to generate calibrated models such as the Probability Perturba- tion Method (PPM) [28], Probability Conditioning Method (PCM) [95] and pilot-points method [148] combine the simulation process of geologic models with the data conditioning step. In gen- eral, for complex nonlinear problems, computational efficiency of model calibration methods often comes at a cost of preserving geologic realism in the set of calibrated models. A difficulty in using the existing parameterization methods in model calibration workflows, which is not widely studied in the literature, is the treatment of uncertainty in the geological con- cepts (or geologic scenarios). Classical model calibration methods work with the assumption that the geologic continuity model (e.g., variogram model or training image (TI)) is known and model calibration is performed to determine the local distribution (within the field of interest) of the spa- tial patterns represented by the given model. When the geologic scenario is uncertain or several plausible scenarios are provided, the diversity of the spatial features can increase dramatically. Linear parameterization methods (such as the PCA) show strong sensitivity to such diversity in ge- ological features as they tend to linearly combine the features, making the preservation of geologic realism a challenge. Multiple recent studies [104, 163] suggest that, even the notion of geologic scenario is subjective as boundaries in geology are highly amorphous with no clear demarcation between geologic scenarios. In general, the performance of proposed history matching methods are dependent on three interrelated factors; handling of uncertain prior geologic scenarios, the effectiveness of parameter- ization and the performance of data conditioning technique. Progress in any one factor (out of the three) often results in an improvement of the total performance of any automatic history match- ing algorithms. The recent dramatic rise in computing abilities and machine learning advances have propelled inception of innovative data-driven approaches to model subsurface systems. Such statistical learning methods offer an advantageous alternative to classical workflows that typically involve incomplete physical description of poorly understood systems. When used in the context of inverse problems, these nonparametric methods extract salient information from large amounts 7 of data and may be used for geologic scenario selection, parameterization, learning inverse and forward mapping, and data conditioning. 1.1.2 Production Forecasting in Unconventional Resources 1.1.2.1 Challenges in Production Prediction One of the major challenges for petroleum engineers working with unconventional reservoirs is a lack of models that accurately represent a physical relationship between formation, completion and fluid properties, and production responses. The wide range of uncertainty in model param- eters, such as the interaction between hydraulic fractures and natural fracture networks, coupled with complexities in modeling geomechanical interactions can adversely affect simulation-based prediction of the production behavior. Additionally, the heterogeneity in unconventional reservoirs consists of natural and induced fractures of various scales and is not trivial to model. Conven- tional reservoir simulation models do not represent the complexities associated with unconven- tional reservoirs and are not suitable for predicting their production behavior [5]. Alternative forecast methods to reservoir simulation include variants of Decline Curve Analy- sis (DCA) [138] that are widely used in the industry as a quick forecasting tool with well-known limitations. These empirical methods make simplifying assumptions about the reservoir hetero- geneity, operating conditions, and flow boundaries and yet, are popular for both conventional and unconventional reservoirs. On the other hand, data-driven forecast methods involve learning sta- tistical trends from a large collection of historical production data from relevant wells. Recent papers [205, 265] have widely discussed applications of data-driven methods (i.e., machine learn- ing and deep learning) to the problem of earth system science. The abundance of wells drilled in fractured tight formations and continuous data acquisition effort motivate the use of data-driven methods. Furthermore, most unconventional wells show minimal cross-communication and can be taken as independent data points, as most data-driven methods require training data points to be independent and identically distributed [100]. Auto-regressive models have also been applied for production forecasting [166] despite their shortcomings in representing complex nonlinear trends. 8 Since the flow physics-based mathematical models (i.e., first principle derivations) for unconven- tional reservoirs likely will take many more years to mature, data-driven methods provide a flexible way to discover the flow behavior from acquired field data. These methods offer an attractive and practical alternative to such complex problems [8]. Alternative forecast methods (i.e., empirical and data-driven methods) rely on high-dimensional (i.e., multivariate) field measurements that are typically complex, can be incomplete, noisy, and er- roneous [160]. To obtain reliable forecasts, they require sufficient training data from an extended period of production for any target well and may have limited practical use for newly drilled wells, especially when multiple flow regimes are involved. Moreover, future production trends are often influenced by factors such as formation, completion, fluid properties, and changes in well controls or operating conditions with a complex nonlinear relationship that may not be easily captured by empirical and data-driven methods. Several works have focused on expanding empirical forecast methods to account for additional influencing factors. For example, Ma and Liu [149] developed the Nonlinear Extension of Arps (NEA) decline model, which combines the kernel method with the traditional DCA to model nonlinear multivariate functions. The authors acknowledge that the performance of the algorithm is sensitive to several hyperparameters that need to be tuned. Xi and Morgan [266] apply the Kriging method on DCA parameter values to forecast gas production at new well locations in the Marcellus shale. As statistical prediction tools, data-driven methods have limitations in extrapolating beyond training data and have limited use when sufficient training data is unavailable. For instance, train- ing a large under-determined data-driven predictive model (e.g., neural networks) with insufficient data can produce an overfitted model that does not generalize beyond the training data. More- over, conclusions drawn from a well-fitted statistical model are highly biased on the mechanism of the model and do not imply much about the mechanism of fluid flow in the data set. While the ”black-box” nature of statistical predictive models allows them to be used by practitioners with- out in-depth knowledge of statistical learning, their lack of interpretability and auditability often impedes industry-wide adoption [210]. 9 Unlike statistical models, physics-based models impose causal relations that can provide re- liable predictions over a wide range of inputs. While a detailed physics-based description of multiphase fluid flow in unconventional hydrocarbon reservoirs is not yet available, approximate physical flow functions have been proposed to capture the general production behavior of uncon- ventional wells. As a simple example, variants of the Decline Curve Analysis (DCA) method [11, 66, 138] that are based on hyperbolic functions have been proposed. Despite their well-known limitations and the simplifying assumptions about the reservoir heterogeneity, operating condi- tions, and flow boundaries, they are often used for conventional and unconventional reservoirs [172]. [216] propose another example of an approximate flow function that leverages an analytical model for well production under a constant bottomhole pressure from a dry gas unconventional reservoir, capturing the relationship between cumulative production and the square root of time. Given the benefits and limitations of data-driven and physics-based models, a hybrid method that combines their strengths (and weaknesses) may prove valuable and flexible for production forecasting in unconventional reservoirs with limited physical understanding and data. Such hy- brid methods are described by [140] as a gray-box system that blends a clear-box system (purely physics-based) and a black-box system (purely data-driven). When a spatial or temporal context is provided to deep learning as contextual cues, they can automatically extract spatio-temporal features more efficiently [205]. Developing a hybrid predictive model for production forecast- ing in unconventional reservoirs requires new novel formulations that can effectively combine the strengths of different techniques to provide accurate predictions and optimize production effi- ciency. Such a model can have a significant impact on the energy industry by improving production planning, reducing costs, and increasing profitability. 1.2 Neural Network Architectures for Latent Space Representations Recently, machine learning and predictive analytic methods have gained significant traction in scientific and engineering domains, including geosciences. A simple interpretation of machine 10 learning methods is detecting patterns in data and using them to generate predictions. In many cases, the patterns in high-dimensional data are learned through low-dimensional representations in a latent space. Pattern learning through dimensionality reduction application of machine learn- ing is closely related to the parameterization problem, and has been studied by the subsurface modeling community in recent years. In addition to parameterization, machine learning methods have also been used to perform data conditioning and uncertainty quantification, without solving a classical inverse modeling problem. In machine learning, neural networks are abstract mathe- matical models that consist of interconnected nodes/units that attempt to mimic the information processing behavior of neurons. Variants of neural networks have been applied in the petroleum engineering domain for reservoir history matching and uncertainty quantification of conventional reservoirs [163, 164], as well as time-series forecasting. Dimensionality reduction for parameterization remains an active research area today due to its importance across many disciplines that deal with high dimensional data. Early linear techniques such as Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) (where centered input matrix to SVD is the data covariance matrix) are not adequate for complex nonlinear data. To that end, various nonlinear techniques [49, 253] such as Kernel Principal Component Analysis (KPCA) [226] and Locally Linear Embedding (LLE) [219] have been developed. Recent state-of-the-art nonparametric and nonconvex techniques such as Generative Adversarial Network (GAN), convolutional autoencoder (AE) and its variants (e.g., V AE [121]) have shown encouraging results [261]. When a fully-connected AE has a single layer with a linear activation function, the subspace spanned by the learned low-rank representation is equivalent to that of PCA. For production forecasting of unconventional reservoirs, nonlinear multivariate time-series forecasting remains an area that merits continued research [48]. Production forecasts are often biased and poorly calibrated, thus making a statistical model with unbiased input (i.e., automatic feature extraction) an attractive choice of forecasting method [24]. Chithra Chakra et al. [43] use Higher-Order Neural Networks (HONN) that compute the products of the input to account for nonlinearities for short-term forecasts (i.e., 6-18 months). Aizenberg et al. [2] apply a multi-layer 11 neural network with multi-valued neurons to perform long-term multivariate time-series forecasts within the context of supervised machine learning. A special type of neural network, the Recurrent Neural Networks (RNN), such as Long-Short Term Memory (LSTM), [89] are specifically de- signed to automatically extract complex dynamical relations and patterns from data and use them to generate predictions. While initial studies show promising results from these methods, many properties of these methods and their applicability to subsurface flow parameterization, inverse problems and produc- tion forecasting are current research topics that require further investigation. In the next sections, we introduce several machine learning techniques from classical PCA to deep learning techniques capable of advanced nonlinear data compression, feature identification and learning complex input- output mappings. 1.2.1 Principal Component Analysis (PCA) Principal Component Analysis (PCA) is a popular dimensionality reduction technique that is use- ful in a wide range of applications. The learned basis functions from PCA capture the structures and variance within high-dimensional datasets and can be useful for generating low-rank represen- tations with minimum loss of information. A geologic realization (continuous Gaussian or discrete facies property field), represented as u∈R n , can have a linear expansion in a sub-space defined by specialized basis functionsϕ i such that 0⩽ i< n as: u = ∑ n i=0 v i ϕ i (1.1) The expansion coefficients v = [v 0 , v 1 ,..., v n− 1 ] T represent u in{ϕ i } i=0,1,2,...,n− 1 and can be sparsely approximated by s terms where s<< n and(n− s) of the coefficients are approximately zero. The approximation quality depends on the complexity of the geologic realization and the information in ϕ j such that 0⩽ j< s. The linear approximation of u in n ϕ j o j=0,1,2,...,s− 1 is expressed as: 12 u = ∑ n i=0 v i ϕ i ∼ =∑ s j=0 v j ϕ j (1.2) where Equations (1.1) and (1.2) can be written in matrix notation as u n× 1 =φ ′ n× n v n× 1 =φ ′ n v ∼ =φ ′ n× s v s× 1 =φ ′ s v (1.3) Here, the columns of φ ′ s contain the significant basis elements n ϕ j o j=0,1,2,...,s− 1 . Using an orthonormal basis such as PCA, s is identified as the leading ϕ j basis elements with the largest eigen-values. The low-rank vector of expansion coefficients is used to approximately represent each geologic realization. PCA can also be conveniently used to represent temporal variations in multivariate dynamic data as low-rank vector of expansion coefficients. 1.2.2 Fully-Connected Neural Network (FCNN) A fully-connected neural network (FCNN), also known as a densely connected neural network, is a type of artificial neural network (ANN) in which each neuron in one layer is connected to every neuron in the adjacent layer. This means that each neuron in layer i is connected to every neuron in layer i+1, and vice versa. In other words, there are no skipped connections between neurons in different layers. The FCNN architecture is widely used in many fields of machine learning, including computer vision, natural language processing, and reinforcement learning, due to its ability to learn complex patterns in high-dimensional data. It is also a fundamental building block of deep neural networks, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In this work, the FCNN architecture is used to complement other deep learning architectures to extract salient features and the relationship between those features. A typical FCNN consists of multiple layers of neurons, with the input layer taking in the input data and the output layer producing the final output. The layers in between are called hidden layers, 13 and they are responsible for transforming the input data into a form that can be used by the output layer. Each neuron in a hidden layer takes as input the outputs of all the neurons in the previous layer, applies a nonlinear activation function to it, and produces a scalar output that is passed on to the next layer. The weights of the connections between neurons in an FCNN are learned during the training phase using a form of gradient descent optimization algorithm, such as backpropagation [211]. The goal of the training process is to minimize a loss function that measures the difference between the predicted output and the true output for a given input. One of the key advantages of FCNNs is their ability to learn complex nonlinear relationships between inputs and outputs. However, they are also prone to overfitting if the number of neurons or layers is too large or if the training data is too small. 1.2.3 Convolutional Neural Networks (CNN) Since the seminal work of LeCun et al [130] in using CNN to classify hand-written digits, CNN has become more popular due to the increasing amount of available data and advances in computational hardware. The architecture of CNN has also progressed significantly in terms of capacity (here refers to the ability to learn complex relationships) and computational efficiency [122, 155, 239]. Within the subsurface community, variations of convolutional neural network models have been applied to problems such as seismic inversions to predict hydrocarbon prospectivity [9, 133, 208], geologic feature extraction for preserving realism in data assimilation [114], well placement opti- mization [153], parameterization of geologic models [128] and as a proxy or surrogate model to forecast field production or net-present value (NPV) [18, 68, 101, 152, 159, 225]. In this section, a brief overview of the CNN fundamentals is presented along with a review of its key components, leaving detailed discussions to relevant references in the computer science literature (e.g., [131, 179]). CNN is a class of supervised learning algorithms that are inspired by the behavior of the visual cortex in the brain, where individual cortical neurons respond to stimuli only in a restricted region of the visual field. The CNN architecture mimics the same behavior by 14 using many filters in several layers to detect different features in the input signal that are important in describing the output. Each layer of a CNN architecture consists of a convolution layer, followed by an activation layer and a pooling layer. The main distinguishing component of the CNN is the convolution layer, which performs mathematical convolution (filtering) operation using a large number of filters, each responsible for detecting specific features of the input data. During training of a CNN, the parameters (weights) that describe each convolution filter are adapted to the specific training data that is provided. The choice of filter dimensionality in CNN depends on the nature of the input data, with 1D filters used for processing time-series data, 2D filters for image data, and 3D filters for volumetric data. Understanding the appropriate filter dimensionality for a given task is critical for developing accurate and efficient deep learning models. Figure 1.2 (a) illustrates a convolution operation for extracting salient (spatial/temporal) in- formation from an input matrix. The output of the convolution layer is used as input into the activation layer. Activation layers are an abstraction representing the rate of potential firing action in a neuron. In CNN, activation layers (Figure 1.2 (b)) apply nonlinear transformations to their input signals to generate the corresponding outputs, which determine whether individual neurons will fire or not (given their input). The transformed outputs of the activation layer are then trans- ferred as input signal to the next layer (pooling). Pooling (or down-sampling) layers (Figure 1.2 (c)) are designed to reduce the number of parameters and the computation in the network. The main functionality of the pooling layers in CNN is to replace high-dimensional inputs (from the activation layer) with low-dimensional approximations that will be transferred to the next layer in the network. Compared to dense artificial neural networks and other machine learning algorithms, CNN represents complex non-linear systems using reduced number of parameters through weight sharing (filters) and by taking advantage of local spatial coherence and distributed representation. 1.2.4 Autoencoders (AE) A recently proposed form of neural network, the autoencoder (AE) [116] is utilized for its com- pression performance and ability to represent nonlinear input data. Autoencoder (AE) is a neural 15 Figure 1.2: (a) Convolution operation (of a trainable filter) on an input results in activations. (b) Non-linear transformation on activations to generate non-linear activations. (c) Pooling or down-sampling layer reduces the number of parameters. (d) In a supervised learning setting, these non-linear activations are compared to the expected corresponding output variable (classification for a discrete variable or regression for a continuous variable) and the filter is updated to minimize the difference. Images not drawn to scale. 16 network architecture for nonlinear dimensionality reduction [116] that consists of two components, an encoder and a decoder. The encoder is composed of several main layers, each consisting of a convolutional or dense function and a nonlinear activation function. The convolutional operation extracts local spatial or temporal features in the input data while the nonlinear activation function allows the representation of nonlinear features within the autoencoder. Successive layers are used to gradually reduce (i.e., downsample) the dimensionality of input to obtain the desired compact representation as latent variable. The decoder takes the output of the encoder (i.e., latent variable) as the input and does the opposite of the encoder where the latent variable is gradually upsampled to obtain a reconstruction. Similar to the encoder, the decoder is also composed of several main layers except that the downsampling operation is now replaced with an upsampling operation. The encoder and decoder are connected and trained together using regression loss functions. Autoencoder-like neural networks have been applied for high-fidelity image generation [201], image compression [245] and data denoising [256]. In hydrogeology and hydrocarbon systems, they have been used as a parameterization technique for spatial parameters [103, 114, 127, 134], seismic inference [195], anomaly detection [267], identification of water body from remote sensing data [268] and estimation of formation properties from borehole logs [252]. Artificial neural networks (ANN) that mimic the complex structure and biological functions of human brain [3] have inspired the development of various neural network architectures [199] aimed at solving different problems [228]. Autoencoders can be implemented with convolutional layers (i.e., CNN) and/or dense neural network layers (i.e., ANN). As a universal function approximator, these neural networks have the capacity to adapt to the nature of input parameters (i.e. nonlinear and non-Gaussian) and a predefined learning objective function. The neuron in a neural network is represented as an abstract function (node) where y= f w (x) such that, given training data pairs (x,y) in a supervised learning setting, the weight w in function f w is obtained by minimizing an objective function J(w)=Σ∥ f w (x)− y∥ 2 2 . To account for nonlinearity in the input parameters, function f w can include element-wise multiplication with a nonlinear function (i.e. sigmoid function). 17 Neural network architectures (i.e. autoencoders) are modularly formed by assembling many nodes across many layers. Figure 1.3 compares the latent spaces generated by a vanilla autoencoder (AE), a Variational autoencoder (V AE) and PCA, using the digit-MNIST dataset [130] at a similar level of compression (i.e., dimension of the latent variables). The reconstruction quality of AE and V AE is superior to PCA (i.e., exhibiting rich and complex spatial features) as they can account for highly nonlinear features and allow for a higher level of compression. Figure 1.3: Comparison of the generated latent spaces from AE, V AE and PCA. 1.2.5 Generative Adversarial Networks (GAN) Generative Adversarial Networks (GAN) have been applied in the reconstruction of three-dimensional model of porous media [176, 177], parameterization of geologic models as an alternative to MPS algorithm [36], seismic inversion [178], generating realizations that honor hard data as well as sedimentary architecture [271] and in geostatistical inversion problem [128]. For model calibra- tion and history matching applications, several authors have utilized GAN for low-dimensional representation of complex geologic models [36] and subsequently proceed to integrate observed flow response data into the low-dimensional representations using gradient-based [128, 178] or ensemble-based methods [31, 32]. The Generative Adversarial Networks (GAN) is a class of deep neural network architectures that consist of two stacked fully differentiable network models, called the generative model and discriminative model. These models are simultaneously trained to estimate the distribution of data 18 and likelihood that a sample belongs to the training dataset [78]. The two models are optimized in a two-step alternating procedure with opposing objectives where the goal of the generator is to create fake samples that look realistic while the goal of the discriminator is to correctly classify fake and real samples. The loss function to optimize the trainable weights within the generator and the discriminator is a mini-max game between the two models, where convergence is ideally reached at a Nash equilibrium between them. Recent works have shown that GAN outperforms traditional parameterization methods when complex spatial patterns are present [31, 128]. Figure 1.4 compares the generated images from GAN and its variant, the Wasserstein GAN and Wasserstein GAN with gradient penalty, trained using the digit-MNIST dataset [130] at a similar level of compression (i.e., dimension of the latent variables). The adversarial training used in GAN allows for superior reconstruction and a higher level of compression when compared to the autoencoders that utilized regression loss functions, resulting in crisper images when compared to the images generated in Figure 1.3. Figure 1.4: Comparison of generated images using GAN and its variant (e.g., WGAN and WGANGP). 1.2.6 Recurrent Neural Networks (RNN) Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to process sequential data such as time-series, text, speech, music, and video. Unlike traditional feedforward 19 neural networks that process data in a single pass, RNNs can maintain an internal memory or state, allowing them to capture temporal dependencies in the input data. The key feature of RNNs is that they have feedback connections, which allow the output of the network at a given time step to be fed back as input at the next time step. This feedback mechanism enables RNNs to maintain an internal memory, which can be used to capture long-term dependencies in sequential data. RNNs can be used for a variety of tasks, including language modeling, speech recognition, machine translation, image captioning, and sentiment analysis. However, standard RNNs suffer from the vanishing gradient problem, which can make it difficult to capture long-term dependencies in the data. To address this issue, several variants of RNNs have been proposed, including Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), which are designed to preserve information over longer periods of time. LSTM is a popular variant of RNN as it circumvents the issue of decaying information (i.e., vanishing gradient) for extended time intervals [89], thus making it suitable for production fore- casting involving multiple flow regimes that are typically observed in extended production period. The advantage of LSTM over standard RNN is that it can learn long-term temporal dependence effectively by handling the vanishing gradient problem. When employed for dimensionality reduc- tion, LSTM can effectively learn the latent space representation or encodings of time-series data by capturing the essential temporal dynamical features, which can then be utilized for accurate production forecasting. 1.3 Latent Space Representations for Production Forecasting 1.3.1 Inverse Problem Formulation The general scheme of inverse problem is depicted in Figure 1.5. The model space M represents the spatially correlated high dimensional physical properties of the subsurface system (populated by prior model realizations,{m i } i=1:N ) while the data space D contains the simulated production responses for each of the prior model realizations (denoted as{d i } i=1:N ). A forward model g(·) 20 (e.g., numerical simulator) provides the production response data d based on a given model m (subscript and averaging factor 1/N dropped for brevity) and is defined as: Figure 1.5: Forward and inverse problem. d= g(m) (1.4) A linear forward model is defined as d= Gm. The sparse and minimal access to rock properties in a subsurface system motivates the usage of observed production response data, d obs to reduce the size of set M to include only prior model realizations that can be supported by the observed data. Traditionally, the inverse problem is formulated as an online (after d obs is collected) iterative update to m in the minimization of data mismatch between simulated and observed data such that: L(m)= N ∑ ∥g(m)− d obs ∥ 2 2 (1.5) This formulation is typically ill-posed due to the large discrepancy in the resolution of d obs and m. A classical way to promote solution uniqueness in such ill-posed formulation is to reduce the dimension of m, which is done by mapping it to an intermediate low dimensional latent space 21 z m . In a data-driven formulation, the inverse problem is defined as m= g − 1 (d) where the com- plex function g − 1 (·) is learned from the pairs of (M,D). Suppose there exist functions f m (·) and f d (·) that respectively reduce the dimension of m and d to z m ∈ Z m and z d ∈ Z d , the data-driven formulation of inverse mapping in low dimensional latent spaces is now defined as ˆ z m = g − 1 ω (z d ) (1.6) where backtransformation to high dimensional spaces is achieved with f − 1 m (·) and f − 1 d (·). The trainable weights ω represent sufficient combined neurons to approximate the complex function g − 1 (·). The objective function for the data driven inversion in latent space is now defined as L(ω)= N ∑ ∥g − 1 ω (z d )− z m ∥ 2 2 . (1.7) 1.3.2 Production Prediction Formulation Long-term production prediction is a complex task that involves forecasting future production based on historical data. This can be treated as an auto-regressive problem, where past sequences are used to predict future sequences. In this approach, the model takes as input a sequence of historical production data and then generates a sequence of future production values, one step at a time. At each step, the model uses the previous predicted value as input to predict the next value, effectively capturing the temporal dependencies in the data. Given historical production data from a set of wells as one phase (univariate) or multiphase (multivariate) time-series data, we define N f as a parameter that denotes the dimension of the time-series data, where N f = 1 (univariate) and N f > 1 (multivariate) where the extension of multivariate formulation from the univariate formulation is mathematically straightforward. The problem of production forecasting can be formulated as Y = f(X), where f(·) is a forecast function (i.e., model) that takes an input tuple X to output Y . For any given well, the time-series data can be treated as segments, where 22 each segment is composed of a pair of lag window y w and forecast window y w+1 as illustrated in Figure 1.6. In Figure 1.6, for each of the fluid phase (e.g., oil, water and gas phase), the size of the lag win- dow N l and the forecast window N s dictate the number of piecewise segments that can be obtained. Within each segment, the forecast window y w+1 (univariate or multivariate) has a corresponding control trajectory denoted as u w+1 . When there are multiple wells in a dataset (i.e., producers in a region) each with its own production behavior, we denote each vector of well properties (e.g., formation, fluid and completion parameters) as p that is used to tag the segments. Therefore, each well results in tuples of u w+1 , p, y w and y w+1 where the input tuple X for the formulation consists of X ={p,y w ,u w+1 } and the output Y consists of the corresponding Y ={y w+1 }. Figure 1.6: Windows of multivariate time-series data for a well. The forecast model f(·) is tasked with learning (i) the temporal trends in the time-series data, (ii) the mapping between the temporal trends and well properties, and (iii) the mapping between the temporal trends and the control trajectory. With a trained f(·), for any given test input tuple (representing a newly drilled well with limited observed initial production responses), a forecast window is obtained by computing{y w+1 }= f({p,y w ,u w+1 }). For the newly drilled well, a recur- sive computation for long-term production forecasting (representing successive forecast windows) can be computed as f({p, f({p, f({p,y w ,u w+1 }),u w+2 }),u w+3 }).... In the context of long-term production prediction, the auto-regressive formulation provides a powerful approach for accurately forecasting future production values based on historical data. Various deep learning models such 23 as Fully-Connected Neural Networks (FCNNs), Convolutional Neural Networks (CNNs), and Re- current Neural Networks (RNNs) can be used as the building block to create the forecast model. Additionally, physics-based models can also be embedded into the forecast model to exploit the properties of both a physics-based model and data-driven model as a hybrid predictive model. 1.4 Motivations for Deep Learning Latent Space Representations 1.4.1 Handling Uncertainty in Prior Geologic Scenarios Many of the promising history matching methods that have been developed through the years [184] made an underlying assumption that the conceptual geologic model is known and is used to generate realizations with alternative spatial distribution of the corresponding geologic features (spatial patterns). However, since each scenario is characterized by distinct geologic features that define the connectivity in the reservoir [194, 255], disregarding the uncertainty and subjectivity in the geologic scenario can result in significant underestimation of the uncertainty and can introduce bias in predicting the flow performance [190]. As an example, a meandering fluvial environment tends to have sinuous curves and point bars while an anastomosing fluvial environment will have straight channels meeting at acute angles. The resulting models based on these two conceptual environments can lead to very different fluid flow and displacement patterns that play a critical role in planning infill drilling and overall field development strategies. To enable the feedback from dynamic data to rank conceptual geologic models, novel data in- tegration methods that do not assume a single geologic scenario are needed. A common approach is to adopt the Popper-Bayes philosophy that observations can only be used to falsify, and not de- duce, models or theories [243]. In the context of subsurface flow models, this approach leads to falsification of geologic scenarios that are not supported by additional data. While dynamic flow response data are known to contain low-resolution (aggregate) information support that are sensi- tive only to large-scale features in the reservoir, they can provide important information to support or reject some of the proposed geologic scenarios. Two classes of methods have been proposed in 24 the literature to implement model calibration under uncertainty in the geologic scenario. The first class performs scenario falsification before model calibration [50, 88, 104, 187, 191, 220]. The sec- ond class of methods performs scenario falsification after (and in some methods, simultaneously) model calibration [25, 73, 102, 110, 112]. In the first class of methods, some authors have used kernel density estimation on simulated data (from realizations of multiple scenarios) where distance of observed data to distributions of labeled data is used to infer its likelihood (online method) [50, 88, 187, 191]. In Scheidt et al. [220], decomposition via wavelet transform is used to analyze differences in patterns for measuring global similarity between simulated and observed data. Classification and regression tree (CART) method has also been utilized to find the probability of each outcome (e.g., water breakthrough trend) given uncertain geologic parameters [104]. Pirot et al. [191] use multiple global spatial statistics of simulated data from multiple scenarios to the observed data to identify plausible sce- narios. In [187] and [50], kernel density estimation (KDE) and support vector machine (SVM) are respectively utilized to determine relevant geologic scenarios before the posterior models are generated using probability perturbation method (PPM) [27, 91] where the ratio of multi-scenario conditional models honors the relevance (weight) of each geologic scenario. These methods can be used to reduce the diversity of geologic features to improve the parameterization performance. In the second class of methods, Brunetti et al. [25] perform scenario selection after poste- rior sampling and acknowledge that the method is computationally expensive for large models. In Khodabakhshi and Jafarpour [112], mixture modeling is combined with ensemble-based data assimilation to include multiple plausible scenarios (that are simultaneously screened) as prior knowledge for model calibration. In [73, 110, 111], inverse modeling with sparsity-promoting reg- ularization techniques and sparse geologic dictionaries are used to identify relevant geologic sce- narios and obtain inversion solutions. The approach to deal with the problem of geologic scenario selection is to perform parameterization within each geologic scenario and formulate the inverse problem such that the parameterized description for each geologic scenario is included with an associated weight [74]. In this case, the geologic scenarios are combined as groups or classes of 25 models with distinct geologic features. In probability conditioning method (PCM) [112], dynamic data is integrated using EnKF to construct a facies probability map as a soft input to MPS sim- ulation for generating conditional models from multiple TIs. Simulated data from the generated model realizations are compared to the observed data to assess the relevance of each TI. The two classes of methods suggest that salient dynamical features in the observed data can be utilized to identify relevant prior geologic scenarios for robust model calibration. Deep learning methods can be employed to identify salient distinguishing features for handling the uncertainty in prior geologic scenarios in model calibration workflows. 1.4.2 Model Inversion Latent space representations can be utilized for parameterization and calibration of complex high- dimensional flow models using machine-learning formulations. The primary objective of deep learning techniques is to develop a direct nonlinear mapping from the data space to model space to bypass classical inverse modelling formulations. Figure 1.7 illustrates the concept of model inversion with deep learning latent space representations where a set of high dimensional prior model realizations M and their corresponding set of simulated flow responses D (as training data pair) are compactly represented in model and data spaces (as Z m and Z d respectively) on a learned manifold. The forward model g(·) can be represented by a linear (e.g. travel-time tomography) or nonlinear flow model (e.g. multiphase flow, solute transport model). Direct inversion of data-driven models is a challenging task that requires the development of new neural network architectures capable of not only identifying salient spatial features but also features that are important for data integration. Deep learning techniques have shown promise in this area, as they can handle disparate data types and be trained with multiple objective functions, including dimension reduction and inversion. 26 Figure 1.7: Schematic of model inversion with deep learning latent space representations. 1.4.3 Proxy Model for Simulator Recent advancement in machine learning research has introduced new applications involving the application of deep learning architectures for improving data assimilation workflows in subsurface domain, for learning complex nonlinear input-output mapping as well as for parameterization of complex input parameters. Neural network architectures have been used as proxy models or surro- gates to replace full-fidelity numerical simulators that can be computationally expensive, especially when the simulation models are large [92, 163]. Neural networks have been used as a proxy model to predict well performance in water-flooded reservoirs [7, 274], understanding fluid flow at the pore scale [215], for well placement optimization [113] and have also been used to predict the performance of unconventional wells [146], among many others. Neural network proxy models provide several advantages over traditional proxy models based on polynomial chaos models [59], response surface methods [16] and support vector machines [80]. They are able to implicitly find and extract salient information in the training data without requiring any pre-processing steps such as manual feature engineering that may introduce bias. 27 Some authors [30, 31, 127] have focused their efforts on applying new parameterization tech- niques to develop compact model representations that are more amenable to traditional ensem- ble methods. For example, Canchumuni et al. [30] developed a convolutional V AE to represent the model realizations as Gaussian latent variables that are then calibrated with ESMDA. Their approach involves a physical reservoir simulator to collect production responses from the set of assimilated realizations and can incur high computational cost when many ESMDA iterations are performed. Recent works from Canchumuni et al. [31] presented more developments in combining various deep-learning-based parameterization methods with ESMDA for history-matching. Some authors [151, 156, 157, 236, 260, 276, 277] have focused on developing efficient proxy models to replace costly flow simulation runs. Zhu and Zabaras [276] proposed an end-to-end model-to-parameter surrogate model based on a deep convolutional encoder-decoder network and performed approximate Bayesian inference on millions of uncertain network parameters for uncer- tainty quantification. In their follow-up work, Zhu et al. [277] proposed a method to incorporate the governing equations of the flow model in the loss functions of proxy models without requiring labeled data (i.e., input-output pairs) and observed comparable performance to purely data-driven models. Mo et al. [157] applied the proxy model proposed in Zhu and Zabaras [276] to develop a surrogate model for a geological carbon storage process-based multiphase flow model. Mo et al. [156] proposed a deep autoregressive neural network-based surrogate method that is combined with an iterative local updating ensemble smoother algorithm for a small synthetic con- taminant source identification problem. Wang et al. [260] compared the performance of Theory- guided Neural Network (TgNN) that incorporates physical laws and other constraints, using vari- ous inversion methods and TgNN constrained with geostatistical information (named TgNN-geo) that directly predicts inversion solutions. Recent works [151, 236, 260, 277] involving the in- corporation of governing physical equations into proxy models are motivated by the high data requirement imposed by the need to run expensive numerical simulations for training data collec- tion. 28 Some recent works [158, 240, 241] have also combined parameterization techniques with proxy models in data assimilation workflows. Mo et al. [158] developed a Convolutional Adversarial Autoencoder (CAAE) for parameterization and a Deep Residual Dense Convolutional Network (DRDCN) as a proxy model (that takes an ensemble of models generated by CAAE as input) for integration with an iterative ensemble smoother. Tang et al. [240] developed a surrogate model based on deep convolutional and recurrent neural network architectures that predicts global pres- sure and saturation maps that are then used for calculating well production rates in a data assim- ilation framework and further tested the approach on a 3D channelized subsurface flow problem [241]. In Tang et al. [240] and Tang et al. [241], the geologic models are first parameterized using a CNN-PCA procedure that involves convolutional neural network post-processing of principal com- ponent analysis [144]. Jin et al. [106] proposed a deep-learning-based Reduced-Order Modeling (ROM) control optimization framework consisting of an autoencoder for dimensionality reduction and a linear transition model that approximates the evolution of the system states in low dimension and tested the framework on 2D oil-water reservoir simulation problems. Some authors [141, 147, 231] focused on developing parameterization methods for observa- tions to increase computational efficiency (i.e., by reducing space and time complexities). Luo et al. [147] proposed an ensemble 4D-seismic history-matching framework that adopts wavelet-based sparse representation for dimensionality reduction of the observations and reported that adopt- ing sparse representation of the observations results in improved history-matching performance. Soares et al. [231] applied Dictionary Learning as a sparse representation method in 4D-seismic history-matching to reduce storage and computational cost and observed improved results when compared to conditioning on the full dataset projected on a smaller subspace. Liu and Grana [141] developed a deep convolutional autoencoder to sparsely represent seismic data in a time-lapse seis- mic history matching framework and observed that prior models can be accurately calibrated on production data and parameterized time-lapse seismic data. 29 In a somewhat related domain, Amendola et al. [6] utilized a single autoencoder that transforms indoor air quality measurements from sensors and simulated measurements made by a Computa- tional Fluid Dynamic (CFD) software into a latent space and employed a Long Short Term Memory (LSTM) network to train a function which emulates the dynamic system in the latent space. Peyron et al. [188] proposed an algorithm called Ensemble Transform Kalman Filter with model error in the latent space (ETKF-Q-L) where a single autoencoder is used for dimensionality reduction and a neural network is used as a surrogate model when tested on a tailored instructional version of Lorenz 96 equations, named the augmented Lorenz 96 system. Peyron et al. [188] observed that training both networks together gives better results than training them subsequently. Applications of data assimilation in the latent space for subsurface flow problems require customized neural network architectures to deal with very high-dimensional disparate data (e.g., model parameters and observational data). The findings from these recent works suggest that much remains to be done and further investigations are necessary to develop a practical deep learn- ing proxy model and parameterization techniques in improving data assimilation workflows for subsurface flow problems. 1.4.4 Uncertainty Quantification Uncertainty quantification in subsurface flow problems is done to obtain forecast of fluid pro- duction or contaminant transport. Since the prior geologic models contain a large degree of un- certainties (from the static data used to build the flow models) and cannot be reliably used for forecasting, they need to be first calibrated or conditioned to available observed data. The pro- cess of obtaining calibrated models for uncertainty quantification can be largely divided into four groups: forward sampling methods, optimization-based methods, ensemble-based methods and data-driven methods. While closed-form solution exists for linear-Gaussian problems, we focus on practical problems concerning nonlinear forward model and non-Gaussian reservoir descrip- tions where the linear-Gaussian assumption typically does not hold, thus making the uncertainty quantification challenging. 30 Forward sampling procedure such as Markov-chain Monte Carlo (MCMC) [142] and rejection sampling [170] as exact sampling methods for nonlinear and non-Gaussian cases can maintain the geologic consistency of the calibrated models (i.e., posteriors) by preserving the prior statistics. They are computationally intractable for most practical problems involving high-dimensional ge- ologic models and a large set of priors with a high degree of uncertainties. In ensemble-based methods, the ensemble of posteriors empirically represents the PDF of the assimilated parameters and can provide robust uncertainty quantification for linear-Gaussian cases. For most practical problems involving channelized reservoirs and nonlinear flow models, the quality of posteriors obtained from ensembled-based methods is not adequate for reliable forecasting [55]. Optimization-based methods such as the randomized maximum likelihood (RML) [119] and Gaussian mixture model [69] can similarly underperform for nonlinear problems. Sun and Durlof- sky [237] proposed a data-driven method, the Data-Space Inversion (DSI) procedure as a method to perform data conditioning and uncertainty quantification, without solving a classical inverse modeling problem. DSI relates past production trend to future production behavior in a Bayesian framework to capture the uncertainty in forecasts without the need to sample posterior models. An extension of DSI [105] uses a recurrent autoencoder for dimensionality reduction of production responses and ES-MDA for posterior data sample generation. DSI and related procedures are data- based techniques that differ from model-based techniques as they do not provide posterior models, but rather only posterior predictions of quantities of interest (i.e. time series). In field development and management however, the availability of posterior models can provide an understanding of spatial (geologic) uncertainty that may be useful in optimizing water-flooding strategy and for well placements. Uncertainty quantification in subsurface systems can be improved by using deep learning latent space representations that are more amenable to existing automated history matching algorithms. For example, deep learning parameterization methods can be utilized to create latent space repre- sentations that can be combined with classical ensemble-based methods to obtain a more reliable 31 set of posteriors. Additionally, conditional deep learning parameterization methods can be used as a data-driven method to create a set of calibrated models (i.e., posteriors). 1.4.5 Long-Term Production Prediction Deep learning models that learn dynamical latent space representations offer a powerful technique for analyzing complex and high-dimensional data, making them particularly useful for accurate long-term production forecasting in unconventional reservoirs. By capturing the underlying tem- poral dynamics and nonlinear relationships in the data, these models can provide a more robust and reliable forecast compared to traditional methods. Moreover, the use of deep learning models in this context allows for the incorporation of additional data sources, such as geological and engi- neering data, to further enhance the accuracy of the forecast. Overall, the ability of these models to extract complex patterns and trends from the data make them an invaluable tool for reservoir engineers and operators seeking to optimize production in unconventional reservoirs. Recent works in petroleum engineering involving production forecasting have utilized variants of RNN. Wiewel et al. [264] utilize a convolutional autoencoder to perform dimensionality reduc- tion of pressure fields (i.e., Navier-Stokes problems) to latent variables and proceed to model the evolution of encoded pressure fields over time using a sequence-to-sequence LSTM model. The authors obtain a significant speed-up for modeling temporal evolution of high-dimensional sys- tems when compared to numerically solving physical functions and require 100 training timesteps to predict 4 to 14 steps forward. Lee et al. [132] develop an LSTM forecast model for shale-gas production prediction that accepts a shut-in (SI) control indicator as an additional input feature for production optimization. In this work, the authors use training wells with approximately 20 to 70 months of production history and omit wells with a short (i.e., less than 6 months) production history. Sagheer and Kotb [212] model the dependency between past and future values using a deep LSTM network on a univariate dataset (i.e., oil phase production only). The authors claim that while shallow LSTM networks can learn long-term range dependencies, they fare poorly for highly 32 nonlinear and multivariate time-series. Further, the authors use between 70− 80% of the produc- tion history to train the forecast model to obtain reliable forecasts. Hector and Horacio [85] propose learning the dynamic past-future trends in unconventional wells using extended dynamic mode de- composition (EDMD). In their work, the eigenfunctions for the Koopman operator are learned using a deep convolutional autoencoder and separate EDMD models are built for each production phase. In one of their examples, the authors utilize 18 months of flow dynamics to predict 6 months forward. Liu et al. [143] combine ensemble empirical mode decomposition (EEMD) bases with LSTM networks to construct multiple intrinsic mode functions (IMF) for oil production forecasting and use 80% of the available production history to train their LSTM networks. Bai and Tahmasebi [17] develop a deep LSTM model to forecast oil and water production in a small water-flooded conventional reservoir, using past information of calculated water saturation, resistivity, oil and water rates. In their examples, 90% of the generated production responses are used to train the deep LSTM model. Al-Shabandar et al. [4] compare multiple RNN architectures for production forecasting and suggest using stacked Gated Recurrent Unit (GRU) as opposed to LSTM, as it has a smaller memory footprint. The authors use 40% of the available time-series data for training the stacked GRU model. The findings from these recent works suggest that much remains to be done and further inves- tigations are necessary to develop a practical RNN forecast model. A common drawback in these works is the high data requirement for training RNN models (where sufficient volume of past data is required for training, in terms of both the number of wells and the length of historical production period) that can be alleviated through transfer learning. This is done by transferring the knowledge of mapping between well properties to production dynamics learned from other localities. As the model is trained with full historical production data across multiple flow regimes, control settings and the corresponding well properties from multiple shale plays or regions, it has improved gen- eralization capability to provide robust long-term forecasts to guide field development in a target region with very limited data (i.e., small number of wells with short production history of 3-6 months). The transfer of knowledge across tasks can be achieved using compact, dynamic latent 33 space representations. These representations capture the underlying structure of the data and can be used as input to models trained on related tasks, resulting in improved performance, reduced training time, and the ability to learn from limited data. 1.4.6 Physics-Guided Production Prediction Robust production forecasting can help guide well operations, intervention plans, and field devel- opment. For conventional hydrocarbon fields, reservoir simulators founded on sound mathematical models describing the flow physics in porous media [14] can be used in history-matching work- flows to facilitate production forecasting. The predominant source of uncertainty in conventional resources is the geologic reservoir description that is typically represented as prior geologic models that need to be calibrated with observed production responses [163]. For unconventional resources, however, the flow physics in hydraulically fractured low-permeability formations is not well under- stood, thus posing a further challenge in applying history matching and optimization workflows. Additionally, unlike conventional reservoirs, the complexity arises from the wide range of forma- tion and completion parameters, imperfect description of natural fracture networks and hydraulic fractures, and intricate time-varying geomechanical interactions that affect the production behav- ior. To improve the accuracy of production prediction in unconventional reservoirs, researchers have proposed the use of hybrid predictive models. These models combine the strengths of both physics-based and data-driven approaches to provide more accurate and reliable predictions. By incorporating physics-based models into data-driven models, the resulting hybrid models can cap- ture the complex behavior of unconventional reservoirs and make accurate predictions even in the presence of incomplete or noisy data. The development of hybrid predictive models has the poten- tial to revolutionize production prediction in unconventional reservoirs, leading to more efficient and cost-effective extraction of resources. The advantages of combining data-driven and physics- based models are (i) the input-output physical functions can be used to augment the available data when necessary to enhance the extrapolation power of data-driven methods, (ii) to constrain the 34 output to adhere to the general production trends and, (iii) to expand the general applicability of predictive models for when the training dataset is small (i.e., reduce model variance). In this section, we briefly present relevant works on building fast neural network predictive models that attempt to incorporate physical information and constraints in their training process. The primary objective of building these predictive models is to improve the generalization ability of data-driven methods by leveraging domain knowledge for physically-consistent results. Specif- ically, when working with spatiotemporal data, resulting predictions should be spatially coherent and temporally smooth, and these properties are not guaranteed when using purely data-driven models. Combining machine learning and physical modeling can result in predictive models that are more interpretable. For more comprehensive reviews, we refer readers to [23] and [108] for general scientific applications, [265] for applications in environmental systems and [140] for spe- cific applications in petroleum engineering. These hybrid predictive models come in many flavors. In their most rudimentary form, some re- searchers [166, 202, 224, 235] used simulated training datasets from a physically consistent system or expert knowledge as priors to initialize learning models before retraining with field dataset. For example, [224] impute missing values in trait-–trait correlations matrices in biodiversity and eco- logical research with physically consistent values to incorporate a current scientific understanding of the domain. In [166], matrices representing lumped connectivity information between injectors and producers are initialized with prior geologic knowledge, where each element in the matrix dic- tates the existence and strength of connectivity between any injector-producer pair. The matrices are then continuously recalibrated with dynamic data as they become available. Simple matri- ces cannot represent complex problems with many features as they require significant annotation effort by domain experts. Additionally, complex relationships within a dataset may not be suffi- ciently represented by covariance matrices and may require learning models with higher degrees of freedom. While neural network models can approximate the input-output response of very complex sys- tems [90], such statistical models with high learning capacity may require many data points that 35 may be unavailable. Moreover, data scarcity can potentially lead to learning spurious correlations that will significantly affect prediction quality. To make the training process more well-behaved, some authors [202, 270, 277] have infused physical understanding into neural network models by including additional loss terms and regularization terms in the objective function. For exam- ple, [270] proposes a regularized proxy model by imposing a sparsity-promoting constraint on the trainable weight matrix to avoid learning spurious correlation. This approach effectively combines the implicit physical information from simulated data with reservoir engineering insight to iden- tify inter-well connectivity and to predict well production trends. In addition to employing model pretraining to initialize their network with synthetic data, [202] also introduces a penalty for en- ergy conservation violations as a regularization term in the objective function. [277] incorporates the governing equations of the physical model in the objective functions in a physics-constrained surrogate model. Some authors [29, 83, 120, 125, 197, 230, 278] use data-driven methods (e.g., neural networks) to explicitly perform physical computations to solve or rather, approximate solutions of partial dif- ferential equations (PDEs). For example, [125] presents a method to solve initial and boundary value problems using artificial feedforward neural network that learns parameters to satisfy dif- ferential equations. [83] propose a deep learning-based approach to solve nonlinear PDEs in very high dimensions by using subnetworks to calculate components of the backward stochastic dif- ferential equations for parabolic PDEs. [230] employs a mesh-free algorithm using a deep neural network trained to satisfy the differential operator, initial condition, and boundary conditions for tractable solutions for very high-dimensional PDEs. [197] propose the Physics-Informed Neural Networks (PINN) as a physics-informed surrogate model where neural networks are trained to solve supervised learning tasks (using simulated data or observations) while respecting the PDEs as the prior knowledge. [278] uses physics-informed surrogate models to solve conductive heat transfer partial differential equation (PDE). The convective heat transfer PDEs are used as bound- ary conditions (BCs) where the loss function is defined based on errors to satisfy the PDE, BCs, and initial condition. 36 Some authors [26, 82, 137, 139, 258, 259] incorporate domain knowledge by coupling physical functions and neural networks. In the loose-coupling method from [258] and [137], DCA input parameters are estimated with available observed production data first. Then, the machine learning models (e.g., support vector machines, random forest, and neural networks) are used to learn the mapping from well properties to the learned DCA input parameters. [82] use neural networks to predict the cumulative production at various production times and use DCA to connect all predicted points in time. [259] uses a deep autoencoder to learn the temporal correlations between system states and then subsequently uses the latent temporal variables as input into physical functions representing power flow equations. [26] explicitly coupled physical function and neural network by introducing multi-segment Arp’s equation with piecewise constant parameters (for transient regimes) to automate DCA calculation using the PINN approach. When the physical functions or governing equations are unknown, such coupling methods can also be used to discover the physics [22, 37, 126, 197]. The methods are not intended to replace first-principle derivations, nor are they used to explain the physical causality between variables. However, they are used to help identify patterns and falsify proposed theories. Early methods, such as in [22], apply symbolic regression to determine underlying dynamical systems. As a more recent example, [37] learn the coefficients for a library of terms representing the governing equation of a dynamical system using sparse regression method to discover the PDEs. Similarly, the PINN method proposed by [197] can also be utilized to discover hidden physics from predetermined (based on domain knowledge) term candidates for the form of the differential equation. From the works presented in this section, it is evident that more investigations are called for, not only in the scientific and engineering communities but also in the industry, to develop practical and fit-for-purpose predictive models incorporating physical understanding. Deep learning models can learn latent space representations from data that can be fused with first principle derivations to constitute a hybrid predictive model for robust long-term production forecasting. 37 1.5 Scope of Work and Dissertation Outline In this work, for conventional reservoirs, we focus on the following important challenges in en- abling automatic feature-based model calibration workflows for robust uncertainty quantification in subsurface flow problems, (i) handling of uncertain prior geologic scenarios, (ii) the effective- ness of parameterization, and (iii) the performance of data conditioning technique. In general, the three challenges are intricately linked in resulting a more feasible set of calibrated models where an improvement in at least any one factor out of the three will yield an overall improvement in uncertainty quantification. Classical automatic pixel-based model calibration workflows tend to provide biased estimates of uncertainty quantification for complex practical problems. Existing feature-based model calibration workflows using classical dimensionality reduction (e.g., PCA) and data integration (e.g., least-square solution and covariance-based methods) techniques tend to underperform when complex spatial features are present, especially under uncertain prior geologic scenarios. Motivated by the recent advances in deep learning, one of the main objectives of this work is to utilize deep learning latent space representations to address the three aforementioned challenges. We view automatic feature-based model calibration workflows as consisting of three subsequent or simultaneous steps: (i) prior geologic scenario selection, (ii) parameterization of model and data spaces, and (iii) data conditioning, where the final outcome of the workflows is the set of calibrated models for uncertainty quantification. Additionally, another important topic that we cover in this work is the development of new data-driven inversion techniques using deep learning latent space representations, where we leverage the capability of deep learning to simultaneously solve the subsequent steps in automatic feature-based model calibration workflows. The performance of the developed approaches is measured in terms of their ability to generate valid and statistically consistent calibrated models for uncertainty quantification. In Chapter 2, we focus on the development of a two-stage feature-based data-driven model calibration workflow under uncertain geologic scenarios to simultaneously address the three afore- mentioned challenges. We utilize deep convolutional neural networks (CNN) to extract salient 38 dynamical features from production responses to first identify relevant geologic scenarios. Using a cross-entropy loss function with prior scenario labels, the CNN identify distinguishing temporal features for each scenario such that the likelihood of each scenario can be predicted given observed field data. In the second step, the selected scenarios are used in another CNN to learn a direct inverse mapping from the dynamical features to feature-based PCA representations of the high- dimensional geologic models. The second CNN is trained with regression loss function to predict the corresponding PCA coefficients for given observed field data, thereby bypassing classical in- verse modeling formulations. The work in this chapter has been presented at the 81st European Association of Geoscientists and Engineers (EAGE) Conference and Exhibition 2019 conference [162] and published in the Computational Geosciences journal [163]. In Chapter 3, we expand on the findings from Chapter 2 and developed a systematic data-driven inverse mapping framework called Latent Space Inversion (LSI). In this work, the focus is given to addressing the challenges of parameterization of model and data spaces, and data conditioning where we propose LSI as a new data-informed direct inversion and parameterization framework where the dimensionality reduction is tailored to the governing flow physics. We utilize deep convolutional autoencoders that are superior to PCA (as used in Chapter 2) for dimensionality reduction to jointly extract spatial geologic features in subsurface models and temporal trends in flow data, for the creation of deep learning latent space representations before learning an inverse mapping in the compact latent spaces. We show the benefit of coupled training (versus decoupled training) where the process of dimensionality reduction is combined with learning the inverse map- ping. In this work, we explore the meaningful latent spaces to generate an ensemble of calibrated model realizations around the inversion solution. The work in this chapter has been presented at the American Geophysical Union (AGU) Fall Meeting 2020 conference [165] and published in the Computational Geosciences journal [171]. In Chapter 4, the coupled LSI framework is modified to become Latent Space Data Assimi- lation (LSDA) framework, to perform simultaneous parameterization of model and data to enable the forward mapping between the model and data latent spaces, serving as a latent space proxy 39 model. In this work, we focus on addressing the challenges of parameterization of model and data spaces, and improving existing ensemble covariance-based data conditioning technique for statis- tically consistent feature-based updates. To make the latent spaces amenable to ensemble-based technique, the model and data latent spaces are constrained to conform to a Gaussian distribu- tion by embedding a variational regularization term in the loss function, resulting in an efficient reduced-order implementation of ensemble data assimilation. The work in this chapter has been presented at the American Geophysical Union (AGU) Fall Meeting 2020 conference [98], the So- ciety of Petroleum Engineers Reservoir Simulation Conference (SPE RSC) 2021 conference [170], and published in the Society of Petroleum Engineers (SPE) Journal journal [170]. In Chapter 5, we address the three aforementioned challenges simultaneously, similar to Chap- ter 2, however in this work we utilize deep convolutional conditional generative adversarial net- works for simultaneous low-dimensional parameterization and data label conditioning under un- certain geologic scenarios (i.e., presence of diverse spatial features). The major difference between the work in this chapter and the preceding chapters is the use of adversarial loss function (versus regression loss function) for training the neural network, where adversarial loss has been shown to result in better reconstructions. We perform nonlinear data label conditioning to create a deep learning latent space representation that honor the dynamical variations observed in the data, that allows the sampling of realistic calibrated models for uncertainty quantification. The work in this chapter has been presented at the American Geophysical Union (AGU) Fall Meeting 2019 confer- ence [161], the European Conference on the Mathematics of Oil Recovery (ECMOR XVII) 2020 conference [164], and published in the Computational Geosciences journal [167]. For unconventional reservoirs, where a dependable simulator that can accurately capture flow behavior is lacking, we concentrate on the following crucial obstacles for achieving reliable long- term production forecasting: (i) developing dynamic latent space representations that effectively capture significant production trends for data-driven production forecasting, (ii) overcoming the 40 challenge of training data requirements, and (iii) utilizing these dynamic latent space represen- tations to enhance current simulators that rely on imperfect modeling assumptions for hybrid physics-guided production forecasting. In Chapter 6, we address the first two challenges by developing a recurrent neural network model that takes completion parameters, formation and fluid properties, operating controls, and early (i.e., 3-6 months) production response data, and extract their salient dynamical features and the interaction between those features, to predict future production response. To overcome the issue of high training data requirements, the model is trained on a collection of historical produc- tion data across multiple flow regimes, control settings and the corresponding well properties from multiple shale plays. Unlike other applications of recurrent neural network that require a long his- tory of production data for training, the developed model employs transfer learning by combining early production data from the target well with the long-term dynamics captured from historical production data in other wells. The work in this chapter has been presented at the Unconventional Resources Technology Conference (URTeC) 2021 conference [168] and published in the Society of Petroleum Engineers Journal (SPEJ) [172]. In Chapter 7, we address the three challenges simultaneously, by developing a physics-guided deep learning predictive model for long-term production forecasting. The developed approach directly embeds physical functions as prior knowledge of physical dynamics into the neural net- work as custom computation layers. Algorithmic implementation details for statistical and explicit methods to embed physical flow functions into neural networks are presented with comparative performance analysis. Additionally, residual learning integration with physics-constrained neural networks is presented to further compensate for any imperfection in the embedded physics-based model. The developed hybrid predictive model is designed to be modular and can work with a wide variety of physics-based models. It combines the advantages of both physics-based and data- driven methods to provide accurate predictions. The work in this chapter has been presented at the Unconventional Resources Technology Conference (URTeC) 2022 conference [173], the SPE 2021 41 Middle East Oil and Gas Show and Conference [169] and published in the Society of Petroleum Engineers Journal (SPEJ) [174]. 42 Chapter 2 Feature-based Model Calibration under Uncertain Geologic Scenarios In this chapter, we present convolutional neural network architectures for integration of dynamic flow response data to reduce the uncertainty in geologic scenarios and calibrate subsurface flow models. We introduce a machine learning-based workflow to perform model calibration under un- certainty in the geologic scenario. The workflow consists of two steps, where in the first step the solution search space is reduced by eliminating unlikely geologic scenarios using distinguishing salient flow data trends. The first step serves as a pre-screening to remove unsupported scenarios from the full model calibration process in the second step. For this purpose, a convolutional neural network (CNN) with a cross-entropy loss function is designed to act as a classifier in predicting the likelihood of each scenario based on the observed flow responses. In the second step, the selected geologic scenarios are used in another CNN with anℓ 2 -loss function (as a regression model) to per- form model calibration. The regression CNN model (Step 2) learns the inverse mapping from the production data space to the low-rank (i.e., feature-based) representation of the model realizations within the feasible set. This approach is illustrated in Figure 2.1 where given K possible geologic scenarios, each scenario is used to simulate (N) realizations of the reservoir model properties (e.g., permeability distribution) to represent within-scenario variabilities. The resulting realizations summarize the geologic features that exist in each geologic scenario and can be used to generate the corresponding 43 dynamic flow responses (as training data labels). The presented approach consists of two steps, where the first step is used to reduce the number of geologic scenarios and the second step performs model calibration based on the selected scenarios in Step 1. The main motivation behind geologic scenario selection is to prune the search space prior to performing detailed model calibration. For the first step, a convolutional neural network (CNN) [129] is trained as a classifier for screening geologic scenarios based on their global flow responses. The classification CNN is trained using a cross-entropy loss function to estimate the likelihood that a given historical data (d obs ) belongs to any of the possible geologic scenarios. The resulting likelihoods are used to prepare the realizations that are used in Step 2, to train another CNN that learns the inverse mapping between production data and model realizations. In Step 2 of our workflow, a CNN architecture with an ℓ 2 -loss function is used as a regres- sion model. The CNN learns non-linear low-dimensional manifolds between realizations and the associated production data. The hypothesis is that the probability distribution of data is highly concentrated along the manifold [180]. Each realization (within the reduced space of relevant sce- narios) and corresponding production data is connected to each other and can be reached along the manifold. Under this hypothesis, if the data is sampled well enough to cover the structure of the learned manifold, then its regression property can be exploited to predict the realization for a given historical production data. Furthermore, CNN can be viewed as a non-linear generalization of the standard PCA [76, 223]. In this case, by combining the model realizations with their simulated response data, CNN is trained to learn the correspondence between flow response data and the resulting geologic realizations. Once trained off-line, the resulting model can be used in real-time to provide a geologic realization that corresponds to a given observed data. The presented approach offers an opportunity to utilize flow data in identifying plausible geo- logic scenarios, results in an off-line implementation that is conveniently parallellizable, and can generate calibrated models in real-time, i.e., upon availability of data and without in-depth techni- cal expertise about model calibration. Several synthetic Gaussian and non-Gaussian examples are used to evaluate the performance of the method. 44 Figure 2.1: Proposed two-step workflow that predicts a model realization for any given historical data d obs . 45 2.1 Convolutional Neural Networks (CNN) for Feature Extraction The application of CNN to the problem of handling uncertain prior geologic scenarios is motivated by its superior pattern-learning performance and its convenient adaptability to scenario selection and inverse mapping, simply by modifying the loss function that allows the convolutional filter to learn salient information (i.e., feature extraction). For scenario falsification, this is an effi- cient direct approach that does not require pre-processing step such as multi-dimensional scaling (MDS) and applying kernel density estimation on the metric space to approximate the probabil- ity of each scenario. Multiple non-linear convolutional filters automatically extract distinguishing salient trends between each scenario from the dynamic data through the cross-entropy loss function without any need for a pre-defined distance metric. The probability of each scenario is readily given by the trained model and information loss is minimized as the trained model can be tuned to prevent over-fitting. For model calibration, CNN allows direct inversion by extracting salient production data features and learning the complex manifold between these features and geologic model realizations. During training, CNN implicitly learns the existing trends in Gaussian and non-Gaussian models without the need to specify any pre-defined probability density function (pdf) of the realizations (whether two-point or multi-point statistics). CNN is a convenient and versatile tool that does not require the intervention of domain experts. In this work, we assume that production data contains information that can falsify irrele- vant scenarios and there is sufficient support between the production data and model realizations. The power of CNN for pattern recognition and classification with complex datasets is well established in the literature. In this work, we take advantage of this strength of CNN to classify ge- ologic scenarios based on their dynamic flow responses (classification with CNN) and to associate the flow responses with the spatial distribution of important rock flow properties (through a regres- sion CNN). Details about the CNN implementations are provided in the Appendix. In this work, the training is performed using the Adam stochastic optimization method [115] combined with the dropout regularization method [233] on the dense layer (to prevent overfitting) with dropout rate of 0.2. 46 2.1.1 Classification Model In this work, the focus is on using production response data as input to either identify the geologic scenario as output (classification model, D(·)) or the corresponding geologic realization as output (regression model,H (·)). Figure 2.2 (Top) shows the actual architecture used for both models (Gaussian and non-Gaussian 2D examples). Each box represents the output of a mathematical operation on the preceding box. In this section, we explain each of these operations. To keep our notations general for any depth of CNN architecture, we consider for the l− th layer with input x l ∈R H l × W l × D l as a 3-dimensional tensor where for the first convolution layer ( l = 1), H 1 is the production time interval and W 1 is the features considered (i.e. water-cut, oil production rate and bottom-hole pressure, see Section 2.3.1). As the depth of the input is 1 (D 1 = 1), the input tensor is reduced to a matrix. In our implementation, the input to the architecture is provided in mini- batches of size b. Hence, the input to the first convolutional layer becomes a 4-dimensional tensor of x 1 ∈R H 1 × W 1 × D 1 × b . For simplicity, we assume that b= 1. With this definition, any element in x l has indices (i l , j l ,d l ) such that 0⩽ i l < H l , 0⩽ j l < W l , 0⩽ d l < D l . The output of layer l is denoted as x l+1 and indices(i l+1 , j l+1 ,d l+1 ) such that 0⩽ i l+1 < H l+1 , 0⩽ j l+1 < W l+1 , 0⩽ d l+1 < D l+1 , where H l+1 × W l+1 × D l+1 , are its dimensions. The first convolution layer has D kernels each of size H× W× D 1 . A collection of kernels are denoted as filters of f H× W× D 1 × D and each element is accessed with(i, j,d l ,d) such that 0⩽ i< H, 0⩽ j< W, 0⩽ d l < D l , 0⩽ d< D. For clarity, we omit any bias term and assume a filter of stride 1 with no padding, the convolution operation for all d is mathematically defined as: a i l+1 , j l+1 ,d = ∑ H i=0 ∑ W j=0 ∑ D l d l =0 f i, j,d l ,d × x l i l+1 +i, j l+1 + j, d l (2.1) In Figure 2.2 (Bottom), sample activation of kernels/filters of dimension 5 × 2× 1 is shown. The activation a∈R (H l − H+1)× (W l − W+1)× D is scaled through a channel-wise batch-normalization layer to increase stability in training before passing it through a Rectified Linear Unit (ReLU) as a 47 Figure 2.2: (Top) Architecture and dimensions of the CNNs. (Bottom) Activations of the 5× 2 filters in the first convolution layer. 48 state-of-the-art non-linear layer. The ReLU operation is element-wise and mathematically defined as: h l+1 i, j,d = max n 0, a l i, j,d o (2.2) A hidden layer is defined as a series of convolution, batch-normalization and ReLU layers. The combined mechanism (represented as Conv l in Figure 2.2 (Top)) is such that each convolution filter extracts selective spatial and temporal information. For example, in convolving the filters on a geologic realization of a meandering fluvial channel, filters that detect curved edges will result in positive regions of activation while other filters that detect sharp acute edges will have non-positive regions of activation. When this activation is passed through a ReLU layer, only positive regions remain active. When we convolve these filters on a matrix representation of time-series data within a fixed period of time, the filters extract salient features such as water breakthrough time, pressure trend and cumulative oil production. Next, we utilize a max-pooling (a form of subsampling) layer to reduce a sub-region of an activation block and keep only its maximum value (Pooling in Figure 2.2 (Top)). The max-pooling operation takes the form: p i l+1 , j l+1 ,d = max {0⩽i<H, 0⩽ j<W} h l i l+1 × H+i, j l+1 × W+ j, d (2.3) In this case, p ∈R H l+1 × W l+1 × D l+1 where the pooling filter of size H× W evenly divides the input. The hidden layer can be repeated depending on the complexity of the dataset. In our implementation, two hidden layers were used to capture the complex non-linear relationship within the data set. At this stage, tensor p has distributed representations of our input and is reshaped into a tensor of order one (vector) and is passed through a dense layer (Dense in Figure 2.2 (Top)) to enforce the global structure of the data. We denote the output vector of this dense layer as z ∈ R H l+1 · W l+1 · D l+1 . If there are C classes of geologic scenarios to be considered (where C = H l+1 · W l+1 · D l+1 ), the training output is represented as a one-hot vector encoding y d ∈R C , where each column represents 49 the likelihood of each sample to belong to a scenario. The cross-entropy loss function is computed as: L(z,y d )= ∑ C i=0 − y di logσ(z i )+(1− y di ) log(1− σ(z i )) (2.4) For ease of discussion, the optimized weights ofD(·) are collectively denoted asθ. The feature extraction step (from production data, D) is denoted as f θ and the loss function asL( f θ (D),S) where S is the provided data label. For each scenario label, the cross-entropy loss minimizes the Kullback-Leibler divergence between the empirical distribution of the input data and the predicted distribution to learn decision boundary between scenario labels. 2.1.2 Regression Model If the objective is to use the flow response data to estimate the corresponding geologic realization for any given historical data d obs , the last layer of network discussed for classification will be set up to solve a regression problem. In this case, a subset of geologic realizations (M reduced ) and their corresponding simulated production data (D reduced ) are used for training, where a geologic realization m ∈R N× M is flattened to y h ∈ R N· M . With the output of the dense layer denoted as z ∈ R N· M , the resulting loss function takes the form L(z,y h )=∑ N· M i=0 (y hi − z i ) 2 (2.5) In this case, the optimized weights ofH (·) are denoted as ψ and the feature extraction op- eration is represented by f ψ and the loss function asL( f ψ (D reduced ),M reduced ). To improve the effectiveness of the approach, instead of a grid representation of the geologic realizations, a low- rank feature-based representation is used. 50 2.2 PCA for Model Space Compression Using the low-rank representation of expansion coefficients to approximate the model realizations in a compact model space, Equation (2.5) can be expressed as L(z,v)=∑ s i=0 (v i − z i ) 2 (2.6) where v ∈R s and z ∈R s . As an example, the total number of parameters (ψ) forH (·) in Fig- ure 2.2 is 24,011,384 where 24,010,000 is within the dense layer. With low-rank representation of s= 7 that covers∼ 75% of the variance in a Gaussian dataset, the number of parameters to be learned in the dense layer is reduced to 16,807. 2.3 Numerical Experiments and Results In this section, the performance of the proposed workflow (Table 2.1) is demonstrated with four numerical experiments. The first experiment is based on synthetic multi-Gaussian realizations with uncertain variogram parameters that represent the uncertainty in the continuity model. In the sec- ond and third experiments, five 2D and four 3D scenarios of fluvial reservoirs are considered. For the fluvial channels, the uncertainty in the geologic scenario is reflected through channel azimuth, thickness-to-width ratio, and connectivity patterns. In the fourth experiment, we test the workflow on a field-like example based on V olve field in the North sea. In these experiments, the reference cases are assumed to be known and different modeling methods are used to represent uncertain scenarios. 2.3.1 Example 1: Synthetic 2D Gaussian Model In this example, a two-dimensional reservoir of dimension 1000 m× 1000 m that is discretized into a 100× 100 domain is considered. The experiment is based on a two-phase flow system with four wells, each located at a corner of the reservoir. Three of the wells are producers and one is 51 Table 2.1: Proposed two-stage workflow Datapreparation 1. Identify geologic uncertainties and partition into K scenarios 2. For each scenario, generate N realizations 3. Run N× K forward simulations to get model-data pair(M,D) 4. Construct scenario label S for D Stage1: GeologicScenarioSelectionD(·) 5. Train a classifier model D(·) with cross-entropy loss function,L( f θ (D),S) 6. Predict the probability of each geologic scenario, g = D(d obs ) 7. Form subset(M reduced ,D reduced ) according to proportion g Stage2: ModelCalibrationH (·) 8. Train regression modelH (·) withℓ 2 -loss function,L( f ψ (D reduced ),M reduced ) 9. Predict the realization,b m =H (d obs ) an injector located at the southeastern corner. The uncertain parameters of the variogram model for this example are listed in Table 2.2. For each of the five scenarios ( K = 5), 500 conditioned realizations (N = 500) are generated as in Figure 2.3. For each realization, a Gaussian field z ′ is simulated using the Sequential Gaussian Simulation (SGS) algorithm with Petrel and transformed to get the corresponding porosity model as φ = 0.05z ′ + 0.25, and the permeability model as k= 5+ 10000000φ 10 [181]. Figure 2.3: Reference field set-up and samples of conditioned Gaussian realizations. 52 The porosity values range from 0.1 to 0.4 while the permeability values range from 5 mD to 1054 mD. Approximately one pore volume is injected into the reservoir over 6 years of simulation time, during which production data is collected every 3 months. Since the amount of production data collected for each realization is the same, we represent the data as a matrix (Input in Figure 2.2 (Top)) and assign a label that corresponds to the related scenario. If data is missing for some time steps, the time steps are omitted from the matrix representation. The labeled data set is shuffled and split into training and validation data set with a ratio of 4 : 1. Our testing data set has the same number of elements as the validation data set. The CNN classifier model D(·) is trained with the labeled data set until the accuracy of both the training and validation data sets converge to a similar value, as a model validation step to prevent over-fitting. Table 2.2: Uncertain geologic parameters as input to SGS algorithm Scenario Variogram Azimuth( ◦ ) Variogram (major, minor) Range (metres) 1 Isotropic 500± 100 2 0± 5 (500± 100, 200± 50) 3 45± 5 (500± 100, 200± 50) 4 90± 5 (500± 100, 200± 50) 5 135± 5 (500± 100, 200± 50) The trained classifier is evaluated using the class-balanced testing data set to verify its robust- ness in extracting salient information from unseen samples. In Figure 2.4, we show that the model is capable of achieving 97.6% accuracy (sum of true positive values along the matrix diagonal) on unseen data, suggesting that it has learned to identify, for each scenario, unique features in pro- duction data. We observe that Scenario 1 has the highest false positives (sum of Row 1 less true positive, 10.0%) where the incorrectly classified samples are perceived to be from Scenario 1. This precision (90.0%) is attributed to the relatively high similarity in terms of simulated production re- sponse of realizations from the other four scenarios. 53 Figure 2.4: Confusion matrix of the classifier on the testing data set (Gaussian realizations). From Equation (2.4), once the classifier is trained, feeding any data to the classifier in a feed- forward operation results in the vector g that represents the likelihood of it belonging to different scenarios. In the second stageH (·), we select random samples of realizations from each scenario according to the proportion in g and perform inverse mapping. This consequent data set is denoted as(D reduced ,M reduced ) and has 500 elements. In Figure 2.5, significant amount of variance for this particular synthetic data set is observed in the first 7 leading components. InH (·), the goal is to learn the non-linear inverse mapping between production data and the geologic realizations in a complex solution space as shown in Figure 2.6, and to determine data support for the geologic features. Similar toD(·), the regression model is trained with the reduced training data set until convergence and validated on a validation data set. The reference case comes from the testing set and is not included in any of these two sets. Due to the ill-posed nature of the problem, it is impossible to achieve near zero loss when trainingH (·). In Figure 2.7, realizations from multiple scenarios are distinguishable in the leading principal components. Realizations from the test data set and their predicted realizations (when the corresponding simulated production data are fed toH (·)) are compared in the same orthogonal 54 Figure 2.5: Energy contribution of principal components (PCs) for production data and models (Gaussian realizations). Figure 2.6: Production data and models from five scenarios (Gaussian realizations) visualized in the first three principal components (PCs). 55 space. We can observe strong correlation between the first few leading principal components of the production data and model realizations. This strong correlation fades when increasingly non-leading principal components are analyzed. These distinguishing features and correlation are implicitly extracted and learned by f θ and f ψ from the training data set. When low-rank representation of model realizations is used, an important question is the choice of s, i.e., the number of leading elements to be included. In general, s can be determined by first analyzing the contribution of each eigenvalue to the total variance (Figure 2.5 ( right)), and by examining the correlation between the PCs of model realizations and production data. It is important to note that for geologic scenario identification, smaller s values will more effective as the first few leading elements tend to have strong discriminating power. In Figure 2.7 ( bottom), the correlation is significant for the 7 leading PCs, implying that production data has little sensitivity to non-leading PCs. For highly non-linear relations, where small changes in geologic features can lead to entirely different production data behavior, it is possible to have significant correlation between non-leading PCs of model realizations and leading PCs of production data. Figure 2.8 illustrates the outcome of the workflow for a multi-Gaussian case. The best predicted realization is obtained after the likelihood of each scenario is considered. Without the scenario se- lection step, accuracy of the prediction reduces significantly and artifact from irrelevant scenarios is introduced in the prediction. The nearest neighbors (in the training data set) to the reference realization is shown to prove that the reference case has not been seen by the model in the train- ing process. The reduced training data set used to trainH (·) honors the proportion of scenario identified by D(·). 2.3.2 Example 2: Synthetic 2D Fluvial Model In this example, we demonstrate the workflow using a non-Gaussian two-dimensional reservoir of size 1000 m× 1000 m, which is discretized into a 100× 100 domain. A two-phase flow system with uncertainties in fluvial features, listed in Table 2.3, is also considered for this example. For each of the five scenarios ( K = 5), training images that capture the uncertainties are generated 56 Figure 2.7: (Top) Geologic scenarios are only distinguishable in the leading principal components (PCs) of production data and model realizations. (Bottom) Strong correlation exists between lead- ing PCs of production data and model realizations. 57 Figure 2.8: (a) Reference realization as denoted in Figure 2.6 (b) Predicted realization with the workflow (c) Predicted realization with no geologic scenario selection step (d) Likelihood of each scenario withD(d obs ) (e) Nearest neighbors to the reference realization show that it does not exist in the training set. (f) Samples from M reduced 58 using object-based simulation. With these training images, 500 conditioned realizations (N= 500) are simulated using Multi-point Statistics (MPS) algorithm in Petrel. Table 2.3: Uncertain geologic parameters as input to 2D training image for MPS algorithm Scenario Channel Type Channel Azimuth ( ◦ ) Channel Width (metres) 1 Meandering 0 [70, 400] 2 Anastomosing 0± 5 [70, 400] 3 Anastomosing 45± 5 [70, 400] 4 Anastomosing 90± 5 [70, 400] 5 Anastomosing 135± 5 [70, 400] There are two producers and two injectors in the reservoir as seen in Figure 2.9. The fluvial field is composed of binary facies, where the sand and non-sand facies are assigned ( φ, k) pairs of (0.23, 500 mD) and (0.03, 3 mD), respectively. As in previous example, approximately one pore volume is injected into the reservoir over 6 years of simulation time and production data is collected every 3 months. The labeled data set is shuffled and split into training and validation data set with a ratio of 4 : 1 and the training ofD(·) andH (·) proceeds as described earlier. Figure 2.9 shows the realizations generated for each scenario, capturing the uncertainty in chan- nel geometry (sinuous or straight), azimuth, thickness-to-width ratio, and connectivity (isolated or intersecting). From Figure 2.10, the predicted realization b m with scenario selection step suffi- ciently represents the reference case. As can be seen, the finer geologic features are absent mainly because they are not supported by the production data. From the likelihood predicted byD(·), the irrelevant scenarios are pruned and the solution space is constrained. WithoutD(·), the predicted realization contains artifacts from other scenarios that when compared to historical production data in Figure 2.11, degrades the quality of the production data match (compared to the case where only relevant scenarios are included in trainingH (·)). Table 2.4 tabulates the root-mean-square error (RMSE) of data match for the examples shown in this chapter. 59 Figure 2.9: (Top) Reference field set-up and samples of conditioned fluvial realizations. ( Bottom) Training image 1 used for scenario 1 and training image 2 used (by rotation) for scenario 2-5. 60 Figure 2.10: (a) Reference realization (b) Predicted realization with the workflow (c) Predicted realization with no geologic scenario selection step (d) Likelihood of each scenario withD(d obs ) (e) Discretized prediction by taking a threshold (determined by the mid-point of facies code of channel and non-channel facies) (f) Discretized prediction (g) Nearest neighbors to the reference realization show that it does not exist in the training set. (h) Samples from M reduced 61 Figure 2.11: Lowest mismatch (RMSE tabulated in Table 2.4) is observed between reference (d obs ) and simulated data from the realization predicted by the workflow. 62 Table 2.4: Root-mean-square error (RMSE) for examples shown in this section RMSE Example Case Two-stage workflow No scenario selection Prediction with incorrect scenario Example 2 (Figure 2.11) 0.1349 0.1463 - Example 2 (Figure 2.11) A 0.0217 0.0429 - B 0.0496 0.0731 - C 0.0728 0.0874 - D 0.0564 0.0936 - Example 3 (Figure 17) ii - 0.2016 - iii 0.1673 - - iv - - 0.3293 v - - 0.2619 vi - - 0.3829 Example 3 (Figure 19) ii - 0.2039 - iii - - 0.2813 iv 0.0568 - - v - - 0.2219 vi - - 0.3857 Example 4 (Figure 23) 0.0924 0.1731 - 63 Four additional cases (case A, B, C and D) are shown in Figure 2.12 to demonstrate the impact of confidence in the prediction of D(·) on the solution. In general, the designedD(·) for this specific example achieves 89.0% accuracy on the testing data set. In case A, a wrong label (1) is predicted with very high confidence which suggests that the behaviour in d obs is more prevalent in scenario 1 rather than its assigned scenario (label 5). As such, the inverse mapping inH (·) is learned between geologic features and production behaviour in the selected scenarios, resulting in a solution that visually appears to belong to scenario 1 and has a reasonable data match. This case occurs with very low frequency and can be attributed to the ill-posed nature of the problem. In case B, scenario 1 is predicted instead of its correct scenario 2 albeit with very low confidence such that the likelihood is almost evenly spread across 4 different scenarios. In this situation, D(d obs ) suggests learning inverse mapping from 4 scenarios to increase the chance of finding geologic features that can explain the production behaviour where two outcomes are possible. In the first outcome, relevant features are found and the solution preserves geologic realism where in this example, curvilinear features are present due to the relatively higher contribution from scenario 1. In the second possible outcome, relevant features are found but when combined throughH (·), realism is not preserved and production data match is compromised. When this happens, the best course of action is to use the selected scenarios to help construct new scenarios (by consulting domain experts) that may be able to explain d obs . In case C, the correct scenario 4 is selected with very high confidence resulting in a solution with very low data misfit. In our data set, this particular scenario has relatively low variability compared to other scenarios, hence very distinguishable production behaviour. In case D, the correct scenario 3 is selected with medium confidence between two scenarios. Therefore H (·) learns from the two most likely scenarios resulting a solution that includes features from the two selected scenarios and reasonably low data mismatch. As is expected, features with no sensitivity to production data are not constructed. 64 Figure 2.12: Performance of the proposed workflow on four cases of 2D fluvial realizations with varying degree of confidence in the predicted scenarios of D(·) (RMSE tabulated in Table 2.4). 65 2.3.3 Example 3: Synthetic 3D Fluvial Model To further test the workflow on a more realistic geological setting, a three-dimensional reservoir of dimension 3000 m× 9000 m× 80 m that is discretized into a 40× 120× 16 domain is considered. As before, a two-phase flow system is considered and the uncertain parameters for the geologic scenarios are listed in Table 2.5. For each of the four scenarios (K= 4), 500 realizations (N= 500) are generated using object-based modeling algorithm in Petrel. The realizations are conditioned to three producers and two injectors that are shown in Figure 2.13. The petrophysical properties in this example are similar to those in Example 2. Additional uncertainty about channel thickness within each scenario is introduced. For this example, approximately one pore volume is injected over 8 years of simulation time, during which production data is collected every 3 months. The training ofD(·) andH (·) proceeds as described earlier. Table 2.5: Uncertain geologic parameters as input to 3D object-based modelling algorithm Scenario Channel Type Channel Azimuth ( ◦ ) Thickness-width ratio (ratio) 1 Meandering 0± 5 0.1± 0.05 2 Meandering 30± 5 0.1± 0.05 3 Meandering 330± 5 0.1± 0.05 4 Braided 0± 15 0.2± 0.05 We observe in Figure 2.14 that the best prediction is obtained when irrelevant scenarios have been removed. Figure 2.15 shows satisfactory production data match using the best prediction, although finer geologic features (i.e. high sinuosity meander east of I8) are smoothed out due to lack of sensitivity of production data to those feature. In Figure 2.16, the area southwest of P3 is not drained by the producer nor entirely swept by the injector I9 - resulting in almost no data sensitivity. Hence, the network predicts the mean map as predicted map of this area. The production data match in Figure 2.17 suggests that the best prediction has integrated the relevant geologic features according to the degree of sensitivities learned byH (·). The nega- tive impact of not selecting relevant scenarios before performing inverse mapping is observed in 66 Figure 2.13: Samples and mean of conditioned 3D realizations shown as isochore maps with contour interval of 16 m (measured along true vertical thickness between the top and base horizons of this synthetic field). The thickness of channel facies at each well location is varied. 67 Figure 2.14: (Top) (i) Reference realization (ii) Predicted realization with no geologic scenario selection step (iii) Predicted realization with the workflow (iv) Prediction using models from Sce- nario 2 (v) Prediction using models from Scenario 3 (vi) Prediction using models from Scenario 4 (Bottom) Time steps of oil-in-place grid of the reference realization. 68 Figure 2.15: Comparison of production data match (RMSE tabulated in Table 2.4) between pre- dictions from Figure 2.14. 69 Figure 2.16: (Top) (i) Reference realization (ii) Predicted realization with no geologic scenario selection step (iii) Predicted realization using models from Scenario 1 (iv) Prediction with the workflow (v) Prediction using models from Scenario 3 (vi) Prediction using models from Scenario 4 (Bottom) Time steps of oil-in-place grid of the reference realization. 70 Figure 2.14, where the area between I8 and P2 is not predicted accurately as seen in (ii). When relevant scenarios are selected according to coarse-scale temporal features in production data (ex- tracted by trainingD(·)), the distinguishing information typically corresponds only to coarse-scale geologic features. Since the production data has little sensitivity to the area between I8 and P2, the predicted solution presents an averaged solution of the relevant prior realizations. Figure 2.17: Comparison of production data match (RMSE tabulated in Table 2.4) between pre- dictions from Figure 2.16 We visualize the oil-in-place grid over simulation period where the area between I8 and P2 is partially flooded at the end suggesting low data sensitivity to this particular region. Due to this, when higher-order information is extracted usingD(·) from high sensitivity regions and used to select relevant scenarios (in this particular case, mostly from north-south fluvial scenario), the same geologic features would be observed in low-sensitivity regions as abrupt changes in geology 71 is not expected. As mentioned earlier, in region with low data sensitivity, the prediction would be close to the mean map of the priors considered. 2.3.4 Example 4: Large-scale Model (Based on Volve Field) The workflow is applied to a large-scale example based on V olve field in the North sea. This three-dimensional reservoir of approximate dimension 5000 m× 4000 m× 80 m is discretized into a 78× 87× 15 domain. A complex fault system in the field creates varying degrees of sand juxtaposition that divides the reservoir into hydraulically-separated producing regions. Figure 2.18 shows the field set-up (2D map view) with 10 producers and 4 water injectors (marked in red) that penetrate several of the fault blocks. The porosity field is populated (with Petrel) using Sequential Gaussian Simulation (SGS) algorithm conditioned to the wells (porosity logs for select wells are shown) with thicker sand accumulation in the northern area. The uncertain scenarios (K= 3) considered in this example are defined by variogram azimuth (Table 2.6) and for each scenario, 500 realizations (N = 500) are generated. We assume that the structural and fault framework are certain. The permeability field is calculated using a simple transform function k= 5+10000000φ 8.7 to mimic the ranges seen in the actual field. The porosity values range from 0.05 to 0.35 while the permeability values range from 5 mD to 1085 mD. Table 2.6: Uncertain geologic parameters for V olve field-like case Scenario Variogram Azimuth( ◦ ) Variogram (major, minor, ver- tical) Ranges (metres) 1 0± 5 (1000± 100, 300± 50, 50) 2 120± 5 (1000± 100, 300± 50, 50) 3 240± 5 (1000± 100, 300± 50, 50) A water-flooding system is considered for a total simulation time of 13 years, with approxi- mately 35% oil recovery factor at the end of simulation. Using production data collected every month,D(·) achieves a testing accuracy of 89.7% when trained and validated with N = 500. Our 72 Figure 2.18: Field set-up (2D map view), porosity logs (true vertical depth) for select wells and samples of V olve conditioned Gaussian realizations from three scenarios. 73 sensitivity analysis shown in Figure 2.19 supports that N = 100 is able to yield reasonable test- ing accuracy of 83.0% and increasing the number of N for the scenario selection step introduces marginal improvement in classification accuracy on the testing set. Figure 2.19: Sensitivity analysis on testing accuracy as a function of number of realizations per scenario used for training and validation ofD(·). Similar to the previous examples, we randomly divide the data forD(·) in a 4:1 ratio for the training and validation data sets. The class-balanced testing data set for the sensitivity analysis (Figure 2.19) is kept constant (with 300 elements) for a fair comparison. With this observation, to screen unsupported scenarios the total simulation run (computational cost) can be reduced to N· K = 300. For the inversion step, 500 realizations are sampled from the relevant scenarios specified in g for trainingH (·) to provide more information on the complex manifold of inverse mapping. In Figure 2.20, a cross section (x to x’) across key wells and a 2D map view (of the 11th layer) of the reference porosity grid (a) are compared to the predicted porosity grid (b). While spatial continuity patterns are predicted well for all layers, vertical features (fining-upwards trend) appear smooth especially in areas that are not water-flooded and away from wells. This is expected as 74 spatial patterns can be informed by inter-well water cut and oil rate trends, whereas there is not enough sensitivity to vertical heterogeneity within the dynamic data, since all the wells have single- string completion. The upscaled well porosity logs (hard data) in the predicted realization (b m) are also reproduced, while the extreme values are not predicted accurately (as shown on the histograms of porosity grid). Figure 2.20 (c) shows the initial and final oil-in-place for the reference case and predictions, where some peripheral fault blocks are filtered (defined as non-oil-bearing). The water-flooded areas are visible around the injectors, where production data has the highest sensitivity to geologic features. The northern area of the reservoir (compartment penetrated by P4) with thicker and higher porosity sand shows good prediction performance as more production from this area results in more sensitivity of production data to geologic features. For areas away from the wells with no data sensitivity, the prediction would be close to the mean values of the priors considered. Figure 2.21 shows good data match between the reference production data (d obs ) and data simulated from b m even when predicted spatial features are not identical to those in the reference case, which is attributed to the non-uniqueness of the inverse mapping captured byH (·). For this multi-Gaussian example, geologic realism ofb m across hydraulic compartments are preserved when inverse mapping is done with production data and model realizations from relevant scenarios. The information in d obs is used to identify supported scenarios and allows reliable prediction of geologic features in virgin compartments with no well penetration. 2.4 Summary and Discussion In this work, we demonstrate the feature learning capability of CNN as a classifier to distinguish between geologic scenarios and as a regression model to perform inverse mapping from flow re- sponse data to salient features of reservoir property distribution. Combined together, the two CNNs can be used as a two-stage workflow for geologic scenario selection and history matching, with promising outcomes. The first step of the workflow applies the CNN as a classifier to by extracting 75 Figure 2.20: (a) Cross-section (x to x’) of reference porosity grid (b) Cross-section (x to x’) of predicted porosity grid (c) Initial and final oil-in-place grids for reference and predicted cases (map view of 11th layer). 76 Figure 2.21: Production data match (RMSE tabulated in Table 2.4) for select producers (oil rate and corresponding water cut) and injectors (pressure). 77 salient information in production data to distinguish relevant geologic scenarios. Geologic scenario selection increases the efficiency of the inverse mapping step by reducing the number of possible geologic scenarios. The second step uses CNN as a multi-dimensional non-linear regression model to learn the inverse mapping between production data and important features of reservoir property distribution (permeability in this work) in the reduced set from the previous step. Once the model is trained, it can be used to estimate model parameters for any set of production data if the training data is reliable. The results show that the geologic scenario selection step is an important step to remove possible artifact (due to irrelevant scenarios) from the solution. Despite no formal mechanism to ensure a good data match for the predicted models by this approach, the resulting solutions could reproduce the observed data and contained geologically consistent features. The main computation for the developed data-driven workflow in this work is associated with generating the data sets and training the CNN, which are performed off-line. Once a trained model is constructed, it can be used in real time to map flow response data to reservoir property dis- tribution. Unlike conventional methods that use field data to perform model calibration, through gradient-based or gradient-free methods, the developed approach in this chapter does not require the field data to construct a model for inverse mapping. As such, history matching of field data with the resulting trained model is fast and straightforward, making the framework easy to deploy and implement. Therefore, the workflow can be seamlessly integrated into current practices in the industry, where it can be used as ”black-box” inversion proxy model. The presented workflow, however, has its limitations too. As in many other machine-learning workflows, the off-line computation for training requires several model realizations and forward simulation runs. For the examples presented, the computational cost of training the CNN models was insignificant compared to the computational burden associated with running reservoir simula- tion. In general, the workflow requires a minimum number of training data to effectively represent the manifold of production data and geologic realization. Computational efficiency can be gained by incrementally increasing the number of training data and evaluating the performance of the 78 CNN classifier for falsification. Additionally, no formal guarantees exist for the trained model to honor the field data. However, in the experiments presented in this chapter, the predicted data from the solution was reasonably close to the observed production data. Similar to other methods of automated history matching, the prediction or reconstruction ability depends on several factors such as well density, injector-producer configuration, length of production period and the contrast in petrophysical parameters. The presented workflow exploits the feature-learning property of CNN to perform geologic scenario selection and history matching in a two-stage approach. In the first step, CNN is trained to learn the complex relationship between the general data trend and the geologic scenarios to elim- inate scenarios that cannot be supported by dynamic flow behavior in the field. The second step uses the flow response data from a reduced set of reservoir model realizations to train a CNN and construct an inverse mapping from the response data to salient geologic features (spatial patterns). The resulting workflow showed promising performance in a series of 2D and 3D numerical exper- iments including a realistic field-scale example. Furthermore, the robustness of the workflow in more complex setting where multi-facies geologic models and other major sources of uncertainty such as structural variation in horizons and fault systems are present in the reservoir need to be evaluated. Another important extension is related to quantification of uncertainty by developing proba- bilistic formulations that provide multiple calibrated models as the output of the inverse mapping. All in all, end-to-end automated geologic scenario selection and history matching efforts is an important area of investigation that require novel formulation. Advances in machine learning are also introducing new tools and approaches that may prove effective in solving some of the com- plex problems in reservoir modeling and geosciences in general. Given the complex nature of oil and gas reservoirs, challenges in inferring the intricate interactions between production data and geologic features, and limitations in available data and computational resources, physical insight may become necessary to effectively tailor data science tools for application in subsurface flow problems. 79 Chapter 3 Latent Space Inversion (LSI) for Inverse Mapping of Subsurface Flow Data This chapter presents Latent-Space Inversion (LSI) as a new data-informed inversion and param- eterization framework where dimensionality reduction is tailored to flow physics that governs the behavior of subsurface systems. Inverse modeling in hydrogeology and petroleum engineering involves minimizing the mismatch between observed and simulated data from a set of prior mod- els. A myriad of approaches has been developed to accomplish this goal over the years, and their performance is dependent on the effectiveness of parameterization and the capability of data con- ditioning technique. We demonstrate LSI as a more robust and efficient approach for calibration of subsurface model over traditional approaches where dimensionality reduction of model parameters is done independently (decoupled) of flow data integration. LSI provides a compact description of the parameters in a latent space that does not only exploit the redundancy of large-scale geologic features but also retain features that are sensitive to flow data. Motivated by recent advances in machine learning research, LSI architecture involves a pair of deep convolutional autoencoders that are coupled to jointly extract spatial geologic features in sub- surface models and temporal trends in flow data. To take advantage of these recent developments, we develop a scheme that combines parameterization and data conditioning as per Figure 3.1 where the construction of latent space for model is aware of the information in flow response data. The 80 LSI architecture is trained offline using prior model realizations and their corresponding simu- lated flow responses (as training data) to effectively represent the model and data and to learn the complex nonlinear inverse mapping between data and model. Once field data becomes available, calibrated models can be rapidly obtained using the trained LSI architecture. The resulting data- informed model latent space can be explored to allow the generation of an ensemble of calibrated model realizations around the inversion solution. This is especially useful when observed data is noisy and multiple inversion solutions can be accepted within the noise range. Figure 3.1: Data-informed parameterization. The main advantage of LSI as a data-driven approach over conventional methods of history matching is that nonlinearity in the forward model and non-Gaussianity in the prior model realiza- tions are implicitly managed. LSI approaches we proposed are nonparametric as feature selection happens in a supervised learning setting and are not prone to bias introduced by predetermined kernel transformation or distance measure. Learning inverse mapping in a compact latent space allows tractable computation for high-dimensional model realizations encountered in practical set- tings. The low dimensionality of latent variables also stabilizes the training process of deep neural networks with unstable loss function. By operating in low dimension, LSI increases the inter- pretability of neural network architectures as the latent variables are more amenable to manipu- lation and mathematical operations. In the next sections, we present several inversion examples to illustrate the performance of coupled and decoupled LSI approaches and to discuss the advan- tages and limitations of data-driven inversion approaches in comparison to conventional inverse modelling formulations. 81 3.1 Parameterization with Deep Learning Techniques 3.1.1 Autoencoder for Model Space Compression Autoencoder (AE) is a neural network architecture for nonlinear dimensionality reduction [116] that consists of two components, an encoder and a decoder. The actual architecture of the deep convolutional autoencoder used in this work for the prior model realizations m∈R M× M , is depicted in Figure 3.2. Detailed description of layers (i.e. size of input, size of output, number of weights) within each component is given in the Appendix. The autoencoder is implemented with the deep learning library Keras (version 2.2.4) [45] and the actual functions used and associated parameters are tabulated in the Appendix. For more details on the mechanism of each function, we refer readers to relevant literature in computer science (e.g. Chollet et al. [45], Ramsundar and Zadeh [199]). Figure 3.2: Actual autoencoder architecture for model realizations. The encoder Enc θ (·) is composed of several main layers, each consisting of a convolutional function (denoted as conv2D and color-coded in Figure 3.2), leaky-ReLU (Rectified Linear Unit) nonlinear activation function (lrelu) and a pooling (down-sampling) function (pool). The con- volutional operation extracts local spatial features in the model realizations while the nonlinear 82 activation function allows the representation of nonlinear features within the autoencoder. Suc- cessive pooling functions are used to gradually reduce the dimensionality of input to obtain the desired compact representation as model latent variable, z m ∈R K× 1 . For the examples used in section 3.3.1 and 3.3.2, M= 100 and K= 64. The decoder Dec θ (·) takes z m as the input and does the opposite of Enc θ (·) where z m is grad- ually upsampled to obtain a reconstruction ˆ m of m. Similar to Enc θ (·), Dec θ (·) is also composed of several main layers except pool is now replaced with upsample. To reduce kernel artifacts from the downsampling and upsampling operations, we recommend decreasing and increasing the di- mension by no more than a factor of 2 between each layer. The encoder and decoder are connected together and trained with the following loss function L(θ)= N ∑ ∥m− Dec θ (Enc θ (m))∥ 2 2 (3.1) where once θ is learned, the model latent variable is obtained with z m = Enc θ (m) and the corresponding reconstruction is given by ˆ m= Dec θ (z m ). 3.1.2 Autoencoder for Data Space Compression The actual architecture of the autoencoder for production response data d∈R T× F , is depicted in Figure 3.3. T and F represent the data timesteps and production data features, respectively. Similar to the model autoencoder, details of layers, actual functions and associated parameters within each component are given in the Appendix. The main layers within the data encoder Enc ψ (·) are each composed of a one-dimensional convolutional function (denoted as conv1D and color-coded in Figure 3.3), leaky-ReLU nonlinear activation function (lrelu) and a one-dimensional pooling (down-sampling) function (pool). For the time series, one-dimensional convolutional operation extracts local temporal features in the production response data. Successive temporal pooling functions are used to gradually reduce the dimensionality of input to obtain the desired compact 83 representation as data latent variable, z d ∈R H× 1 . For the examples used in section 3.3.1 and 3.3.2, T = 25, F = 20 and H = 10. Figure 3.3: Actual autoencoder architecture for production response data. The input z d to the data decoder Dec ψ (·) is gradually upsampled to produce a reconstruction ˆ d of d. Similar to Enc ψ (·), Dec ψ (·) is also composed of several main layers except pool is now replaced with upsample. To make Enc ψ (·) robust against noisy data with random fluctuations, the kernel size of conv1D is initially set as 3 (in months) and later increased (to 6) in successive layer. Similar to the model autoencoder, we recommend decreasing (pool) and increasing (upsample) the dimension by no more than a factor of 2 between each layer to reduce kernel artifacts. The encoder and decoder are connected together and trained with the following loss function L(ψ)= N ∑ ∥d− Dec ψ (Enc ψ (d))∥ 2 2 (3.2) where onceψ is learned, the data latent variable is obtained with z d = Enc ψ (d) and the corre- sponding reconstruction is given by ˆ d= Dec ψ (z d ). 84 3.1.3 Inverse Mapping The inverse mapping (g − 1 ω (·) in Equation 1.7) between z d and z m is approximated with a neural network denoted as Reg dm γ (·). The architecture of this regressor is shown in Figure 3.4 and details of layers, actual Keras functions used and associated parameters are tabulated in the Appendix. In Reg dm γ (·), the nonlinearity between z d and z m is captured by multiple fully-connected (dense) layers with nonlinear activations. The loss function to train Reg dm γ (·) is L(γ)= N ∑ ||Reg dm γ (z d )− z m || 2 2 (3.3) where onceγ is learned, the predicted model latent variable is given by ˆ z m = Reg dm γ (z d ). With this formulation, the construction of a forward proxy model Reg md γ (·) is trivial (but is not the focus of this chapter) by the manipulation of output unit size of the dense layers. Figure 3.4: Architecture of regressors. 85 3.2 Latent Space Inversion (LSI) Workflow In this work, we focus on (i) parameterization of high dimensional model and flow response data as latent variables in low dimensional manifolds and (ii) learning direct nonlinear inverse mapping from the latent data space to the latent model space using the identified manifold. In LSI, a decou- pled approach involves dimensionality reduction step that exploits the redundancy of input data for a compact latent representation followed by an inverse mapping step in the latent space. A coupled approach involves simultaneous learning of dimensionality reduction and inverse mapping; where parameterization (in this case, is more aptly defined as feature selection) of M and D are informed by the objective function of inverse mapping and seek to retain only the most relevant features to construct Z m and Z d . In the context of decoupled LSI, the effectiveness of the first step ( z m = f m (m) and z d = f d (d)) that involves transforming high dimensional data pairs (m, d) into low dimensional representation (z m ,z d ) determine the accuracy of the second step (learning the inverse mapping). We present and discuss coupled and decoupled approaches for implementing LSI, using advanced neural network architectures to define the latent spaces and the inverse mapping. 3.2.1 Decoupled LSI In conventional automated history matching workflows, parameterization of prior model realiza- tions (M to Z m ) is typically done separately from the inversion process [30, 96, 257]. Once z m is acquired, data mismatch is minimized (as per Equation 1.5) using gradient-based or gradient-free methods. In decoupled LSI, a data-driven approach is taken (as per Equation 1.6 and Equation 1.7) and the pseudocode is outlined in Table 3.1. As shown in Figure 3.5, the model autoencoder and data autoencoder are both trained indepen- dently to obtain z m and z d before training the regressor Reg dm γ (·). After all three architectures (with total weights of 100,475, see the Appendix) are trained optimally, given observed field data d obs the model prediction ˆ m is obtained by 86 Table 3.1: Pseudocode for decoupled LSI 1 function DecoupledLSI(M, D, N, d obs ,E ): 2 initialize Enc θ ,Dec θ ,Enc ψ ,Dec ψ ,Reg dm γ 3 epoch← 1500 4 batchSize← 32 5 numBatch← ⌈N/batchSize⌉ 6 for i= 1 to epoch 7 for j= 1 to numBatch 8 M j ,D j ← batchify(M, D) 9 computeL(θ)=∑ m∈M j ∥m− Dec θ (Enc θ (m))∥ 2 2 10 computeL(ψ)=∑ d∈D j ∥d− Dec ψ (Enc ψ (d))∥ 2 2 11 update Enc θ ,Dec θ ,Enc ψ ,Dec ψ 12 Z m ← Enc θ (M) 13 Z d ← Enc ψ (D) 14 for i= 1 to epoch 15 for j= 1 to numBatch 16 Z m, j ,Z d, j ← batchify(Z m , Z d ) 17 computeL(γ)=∑ z m ∈Z m, j ,z d ∈Z d, j ∥z m − Reg dm γ (z d )∥ 2 2 18 update Reg dm γ 19 z d obs ← Enc ψ (d obs ) 20 Z d obs ← z d obs +E 21 return Dec θ (Reg dm γ (Z d obs )) 22 end function ˆ m= Dec θ (Reg dm γ (Enc ψ (d obs ))) (3.4) The architectures are trained with the Adam optimizer [115] on a high-performance computing cluster with an NVIDIA Tesla P100 GPU node for approximately 33 minutes (spanning 1500 epochs) and are checkpointed every 10 epoch. The training and validation process (for one instance of training for section 3.3.1) is demonstrated in Figure 3.6 where the optimal checkpoint (without overfitting) for each architecture is identified when validation losses for L(ψ),L(γ) andL(θ) do not show any further reduction. In this specific training instance, those checkpoints are at epoch 50, 800 and 300 respectively. The dimension of each latent variable (H and K) is determined by performing a sensitivity analysis where the dimension is successively increased to a point where further decrease in validation loss is not observed. Generally, the data autoencoder converges faster 87 than the model autoencoder as temporal variation in the timeseries is less complex than the spatial variation in the model realizations. Figure 3.5: Diagram of decoupled LSI architecture for model inversion. The histograms in Figure 3.6 show the comparison of d, m and z m (from the testing dataset) with the corresponding predictions ˆ d, ˆ m and ˆ z m obtained from the trained architectures. The good reconstruction performance indicates that the model and data autoencoders have the ability to rep- resent and generalize (for reconstruction purpose) high dimensional nonlinear input. The training, validation and testing datasets are explained in Section 3.3. 3.2.2 Coupled LSI In coupled LSI, the model autoencoder, data autoencoder and regressor are simultaneously trained as described in Table 3.2. The rationale of this approach is to inform the model and data pa- rameterization step of the final objective function, that is to learn the correspondence between M and D. When presented with unseen testing data, flow data-informed parameterization improves the reconstruction quality of ˆ m (relevant and plausible geologic features are recovered) that subse- quently reduces the mismatch between d obs and simulated data (from g( ˆ m)). In the implementation of coupled LSI, the components of both autoencoders and the regressor are combined as depicted 88 Figure 3.6: Training and validation losses for decoupled LSI and histograms (normalized) of pre- dictions for testing dataset. 89 in Figure 3.7. The loss functions for the autoencoders remain the same as in decoupled LSI, how- ever the weights of Enc ψ (·), Dec θ (·) and Reg dm γ (·) are jointly optimized by minimizing the loss function L(ψ,γ,θ)= N ∑ ∥m− Dec θ (Reg dm γ (Enc ψ (d)))∥ 2 2 (3.5) Table 3.2: Pseudocode for coupled LSI 1 function CoupledLSI(M, D, N, d obs ,E ): 2 initialize Enc θ ,Dec θ ,Enc ψ ,Dec ψ ,Reg dm γ 3 epoch← 1500 4 batchSize← 32 5 numBatch← ⌈N/batchSize⌉ 6 for i= 1 to epoch 7 for j= 1 to numBatch 8 M j ,D j ← batchify(M, D) 9 computeL(ψ)=∑ d∈D j ∥d− Dec ψ (Enc ψ (d))∥ 2 2 10 update Enc ψ ,Dec ψ 11 computeL(ψ,γ,θ)=∑ m∈M j ,d∈D j ∥m− Dec θ (Reg dm γ (Enc ψ (d)))∥ 2 2 12 update Dec θ ,Reg dm γ ,Enc ψ 13 computeL(θ)=∑ m∈M j ∥m− Dec θ (Enc θ (m))∥ 2 2 14 update Enc θ ,Dec θ 15 z d obs ← Enc ψ (d obs ) 16 Z d obs ← z d obs +E 17 return Dec θ (Reg dm γ (Z d obs )) 18 end function Once the combined architectures are optimally trained (with total trainable weights of 100,475 that is equivalent to the number of weights in decoupled LSI), given observed field data d obs the model prediction ˆ m is obtained from Equation 3.4. The data latent variable z d obs from Enc ψ (d obs ) resides in a meaningful latent space where data reconstruction is obtained from ˆ d obs = Dec ψ (Enc ψ (d obs )) (3.6) and the manifold is randomly explored within a noise level ε by feeding Equation 3.4 with z d obs +E to give multiple inversion solutions ˆ M that can be accepted. E ∈R G× H represents G noise samples around z d obs . 90 Figure 3.7: Diagram of coupled LSI architecture for model inversion. The architectures are trained with the same optimizer, computing resources and strategy as the decoupled approach and took approximately 18 minutes for 1500 epochs. Figure 3.8 demonstrates the training and validation process (for one instance of training with non-Gaussian dataset) where the optimal checkpoint is at epoch 100. The flow physics-informed model reconstruction (i.e. inversion) lossL(ψ,γ,θ) is generally higher when compared to model autoencoder reconstruction lossL(θ) as the former is dependent on the sensitivity of D to the geologic features in M. Note that as per Table 3.2, the gradient ofL(ψ,γ,θ) (line 11,12) is not backpropagated to the model encoder Enc θ as when coupled LSI is deployed in prediction (forward) mode, only d obs is available as an input. This is marked as a green stippled line in Figure 3.7. Therefore, in each epoch during training, the same instance of Dec θ (line 11,12 and 13,14) is alternatingly updated with two loss functions (Equation 3.5 and Equation 3.1, in this order). As can be seen in Figure 3.8, any update step in the direction of minimizing Equation 3.1 (for reconstruction) agrees with the update direction of Equation 3.5 (for inversion). This implementation ensures that the latent space of Z m covers all range of geologic features that are present in the set of M (akin to Z d that fully represent D). The histograms in Figure 3.8 show the comparison of d and m (for inversion and reconstruction) with the corresponding predictions ˆ d and ˆ m for the testing dataset. We observe from the second 91 Figure 3.8: Training and validation losses for coupled LSI and histograms (normalized) of predic- tions for testing dataset. 92 and third (from top) histograms that there are more nonbinary values (continous values bounded by 0 and 1) in the inversion solutions (compared to reconstructions) as coupled LSI regresses learned geologic features (from the binary domain in M) to generate model predictions guided by the data support of D. 3.3 Numerical Experiments and Results 3.3.1 Example 1: Synthetic 2D Gaussian Model To demonstrate LSI, we first consider a two-dimensional reservoir of size 1000 m × 1000 m that is discretized into a 100× 100 domain. The forward model is a two-phase (wetting and non-wetting) flow simulator (Eclipse [221]) and the reservoir is penetrated by two producers ( P1 and P2) and two injectors (I8 and I9). We consider prior model realizations from different scenarios as shown in Figure 3.9 to test both coupled and decoupled LSI approaches with varying variogram continuity models. For each of the five scenarios, 500 conditioned realizations ( N= 500) are generated. Note that the uncertainty in geologic scenarios is not within the scope of this chapter (i.e. the geologic scenario of the reference cases is always known). Each prior model realization m is a multi-Gaussian field that is simulated using the Sequential Gaussian Simulation (SGS) algorithm [196] with Petrel [222]. A complete description of vari- ogram parameters used can be found in Mohd Razak and Jafarpour [163]. To collect flow response data d, each m is transformed to get the corresponding porosity model as φ = 0.05m+ 0.25, and the permeability model as k= 5+ 10 7 φ 10 [181]. The porosity values range from 0.1 to 0.4 while the permeability values range from 5 mD to 1054 mD to mimic typical reservoir properties. Ap- proximately one pore volume is injected into the reservoir over 6 years of simulation time, during which production data is collected every 3 months. This results in 25 timesteps (T = 25) and 20 production response features (F = 20, i.e. production rates of wetting phase and non-wetting phase, bottomhole pressure). 5% Gaussian noise is added to each of the production data features. 93 Figure 3.9: (Left) Gaussian dataset with sample model realizations (map view) from five (5) distinct scenarios. (Right) Field set-up for the Gaussian dataset. 94 To train the LSI architectures, the size of the training, validation and testing datasets is 300, 100 and 100 respectively. Figure 3.10: Comparison of inversion solutions (Gaussian realizations) from coupled and decou- pled approach. Figure 3.10 shows the comparison of inversion solutions (Equation 3.4) from coupled and decoupled LSI approaches. The reference cases (denoted as m re f , from the testing dataset) are selected for demonstration according to the distance of each respective d obs to the mean of D (from the training dataset, of respective scenario) where the distance increases by row. In general, we observe that the inversion solutions generated by coupled LSI are more plausible than decoupled LSI especially when the geologic features in the reference cases are not common in the training dataset. The solutions appear to be smooth as d obs has no sensitivity to small scale variogram continuity features. 95 Figure 3.11: RMSE of simulated data match from inversion solutions of Gaussian and fluvial testing dataset. Approach C (coupled) and D (decoupled). Figure 3.11 (top row) shows the average root-mean-square error (RMSE) of simulated data g( ˆ m) from each inversion solution ˆ m of the entire testing dataset when compared to respective d obs . The experiment is repeated 10 times for each scenario and each approach to consider minor variations in the inversion solutions caused by the stochastic nature of neural network optimiza- tion. The RMSE standard deviation of the experiments is reported as an error bar around the mean RMSE of the experiments. Across all scenarios, the RMSE error bar is typically wider for decou- pled LSI as the architectures are trained independently and are more susceptible to fluctuations. The mean RMSE is consistently lower for coupled LSI compared to decoupled LSI as the correspondence between production response data D and variogram continuity features in M is taken into account in the construction of latent spaces Z m and Z d . Furthermore, the components Dec ψ and Enc θ (with reconstruction loss functions) enforce that the meaningful latent spaces cover the full subspaces of M and D consequently allowing coupled LSI approach to generalize better for more plausible predictions. The average RMSE (5− 10%) indicates that the inversion solutions from both coupled LSI and decoupled LSI may be accepted as solutions, however the robustness of coupled LSI becomes evident with underrepresented reference cases. 96 3.3.2 Example 2: Synthetic 2D Fluvial Model In this example, we consider a non-Gaussian two-dimensional reservoir of similar dimension as in Section 3.3.1. Two training images (TI) simulated with object-based modelling algorithm [196] in Petrel [222], as shown in Figure 3.12 are derived from field analogs of a meandering and an anastomosing fluvial systems. The TIs are used in Multi-point Statistics (MPS) algorithm [196] to simulate 500 conditioned realizations (N= 500) for each geologic scenarios shown in Figure 3.13. A complete description of geostatistical parameters used can be found in Mohd Razak and Jafar- pour [163]. In practice, sedimentary systems are composed of complex geologic features resulting from continuous cycle of detritus deposition, lithification and erosion whereas present-day areal analogs (i.e. satellite images) provide only a morphological snapshot of this intricate process [194]. Therefore, the simulated prior model realizations are to capture the uncertainty in chan- nel geometry (sinuous or straight), azimuth, thickness-to-width ratio, and connectivity (isolated or intersecting). Figure 3.12: Field analogs (map view of satellite image) of a meandering and an anastomosing fluvial environment with corresponding training image (map view). 97 The fluvial field is composed of binary facies, where the non-sand and sand facies are assigned (φ, k) pairs of (0.03, 5 mD) and (0.23, 420 mD) respectively along φ− k transform function k= 5+10 9 φ 10 . Production response data is then collected as in the previous example. Figure 3.14 shows the comparison of inversion solutions (Equation 3.4) from coupled and decoupled LSI. The reference cases are selected according to the same criteria as described in Section 3.3.1. The second column in Figure 3.14 shows the reconstruction ( ˆ m= Dec θ (Enc θ (m re f ))) of each reference case m re f where the subspaces spanned by the model autoencoder are capable of representing the complex geologic features that are present. The reconstructions from decoupled (not shown) and coupled LSI are similar, suggesting comparable performance of model autoencoders in the two approaches (note the similar reconstruction RMSE on testing dataset as shown in Figure 3.8 and Figure 3.6). Figure 3.13: (Left) Fluvial dataset with sample model realizations (map view) from five (5) distinct scenarios. (Right) Field set-up for the fluvial dataset. 98 In general, the geologic features that are recovered in the inversion solutions depend on the amount of information in d obs . This is observed in the first reference case (top row in Figure 3.14) where curvilinear features of the meandering channel north of I9 and P2 are not reproduced by both coupled and decoupled LSI, albeit present in the reconstruction. For this reference case, the inversions and simulated data mismatch from the two approaches are comparable. Coupled LSI results in a better inversion outcome for the second reference case (a more challenging test case) where the channel meander amplitude and radius of curvature are sufficiently recovered. The decoupled LSI approach is not able to accurately regress between the geologic features encoded within Z m resulting in a less plausible prediction. Figure 3.14: Comparison of inversion solutions (fluvial realizations) from coupled and decoupled approach. The simulated data (g( ˆ m)) from both inversions of this test case are plotted in Figure 3.15. The relative distance of d obs scatter points to the entire testing dataset (represented by the gray points distributed in the backdrop) indicates the closeness of d obs to the mean of D. Although both inversions result in reasonable data match, the inversion from coupled LSI fits more tightly to d obs . 99 The nearest model realizations to m re f in model space M and the nearest production data response to d obs in data space D (corresponding models are shown) are displayed in Figure 3.16. Figure 3.15: Comparison of simulated data match from inversion solutions (for row 2 in Fig- ure 3.14) Figure 3.11 (bottom row) shows that coupled LSI consistently results in lower RMSE compared to decoupled LSI. Higher overall RMSE (compared to Gaussian testing dataset) of simulated data match for fluvial testing dataset is attributed to the complexity in the relationship between M and D where a small variation in geologic features (i.e. varying degrees of channel confluence) can result in a profound change in the production behaviour. As a further test, we artificially introduce 100 Figure 3.16: 4 nearest-neighbors (NN) to m re f in model space M and 4 nearest-neighbors (NN) to d obs in data space D (corresponding models are shown) for row 2 in Figure 3.14. an east-west trending channel connecting I9 and P2 in the third reference case (third row in Fig- ure 3.14). For this test case, the training dataset for coupled and decoupled LSI is composed of M and D from scenario 3 with predominantly northeast-southwest trending fluvial features. The inversion solution from coupled LSI honors the injector-producer connection and contains fluvial features that belong to scenario 3. Decoupled LSI however fails to predict relevant channel features that can reproduce d obs . We have demonstrated the capability of coupled LSI for the purpose of inversion. When di- mensionality reduction is done simultaneously with learning the inverse mapping between M and D, correspondence between the latent spaces Z m and Z d is also consequently learned, allowing for exploration in the low dimensional domain. Figure 3.17 (left) shows the good match between predicted production profiles ( ˆ d= Dec ψ (Enc ψ (d)) and normalized production response data from the testing dataset. This suggests that the data autoencoder has learned a meaningful latent space representation, Z d for D. Histograms (normalized) of z d for the entire testing dataset are visual- ized by dimension in Figure 3.17 (right). The d obs of the third reference case in Figure 3.14 is mapped to its latent variable (z d obs = Enc ψ (d obs )) and shown as red lines. We specifyε = 5% and randomly sample 32 noise vectors (E ∈R 32× 10 ) to explore the manifold around z d obs (depicted as step histograms in Figure 3.17). Each element of Z d obs (equivalently z d obs +E ) is decoded to be a 101 Figure 3.17: (Left) Match between predicted production profiles (by Dec ψ (Enc ψ (·))) and normal- ized profiles for test dataset. ( Right) z d visualized by dimension (6 shown out of 10), for case row 3 in Figure 3.14. 102 meaningful variation (e.g. delay in arrival of wetting phase at the producers) around d obs , shown as Dec ψ (z d obs +E) in Figure 3.18. Figure 3.18: Simulated data from inversion solutions ˆ M = Dec θ (Reg dm γ (z d obs +E)) compared to reconstructions Dec ψ (z d obs +E) (for case row 3 in Figure 3.14). The exploratory points around z d obs are fed to the regressor and model decoder to obtain a set of inversion solutions ˆ M= Dec θ (Reg dm γ (Z d obs )) that contains variations in geologic features corre- sponding to the variations in production response data. Figure 3.19 shows ˆ M for the first reference case in Figure 3.14 where the level of lateral accretion of the upper meander (penetrated by I9 and P2) and the lower meander (penetrated by I8 and P1) affects the production response. Figure 3.20 103 shows ˆ M for the second reference case in Figure 3.14 where the variations in production response correspond to variations in channel width and curvature. Figure 3.21 shows ˆ M for the third refer- ence case in Figure 3.14 where coupled LSI uses various geologic features from scenario 3 to form a connected channel body between I9 and P2. Figure 3.19: Predicted model realizations Dec θ (Reg dm γ (z d obs +E)) for d obs (row 1 in Figure 3.14). Figure 3.20: Predicted model realizations Dec θ (Reg dm γ (z d obs +E)) for d obs (row 2 in Figure 3.14). In Figure 3.18, comparison of simulated data g( ˆ M) obtained from ˆ M (of the third reference case) to the data reconstructions Dec ψ (z d obs +E) shows that the neigbouring points around z d obs in the manifold of Z d map to neigbouring points in the manifold of Z m . We report that as the distance of d obs to the mean of D increases, a higher percentage of g( ˆ M) is not able to reproduce d obs within a reasonable noise range. For the three test cases in Figure 3.14, out of the 32 sampled 104 Figure 3.21: Predicted model realizations Dec θ (Reg dm γ (z d obs +E)) for d obs (row 3 in Figure 3.14). neighbors, 1,3 and 4 (respectively) outliers are observed. This is attributed to the imperfect map- ping between, as well as the imperfect construction of the manifolds for Z m and Z d as data points become increasingly sparse. The histograms (normalized) in Figure 3.22 (Top) show model realizations from the testing dataset visualized as latent variables (Z recon m = Enc θ (M)). The latent variables obtained from the inversion (Z inv m = Reg dm γ (Enc ψ (D))) of testing dataset are represented as unfilled step histograms. For the third reference case m re f in Figure 3.14, red lines in Figure 3.22 (Bottom) represent the latent variables for model reconstruction (z recon m re f = Enc θ (m re f )). The blue lines represent the latent variables from inversion (z inv m re f = Reg dm γ (Enc ψ (d obs ))). In Figure 3.22 (Bottom), the unfilled step histograms represent the latent variables for ˆ M in Figure 3.21 obtained from Z m re f = Reg dm γ (Z d obs ). Note that for any reference case in Figure 3.14, the second column corresponds to the decoded model latent variables Dec θ (z recon m re f ) and the third column corresponds to Dec θ (z inv m re f ). The subspaces spanned by Z recon m cover the diverse fluvial features that are present in M where each dimension of the latent variables represents different mode of geologic variation (i.e. channel curvature and width). The narrower range of variability observed in Z inv m indicates the sensitiv- ity of D on fluvial features that exist in M, as observed in the examples we have presented. In other words, for any m re f , z inv m re f will approach z recon m re f when d obs can completely recover all geologic features. This behavior is also observed in our earlier works when dimensionality reduction is 105 Figure 3.22: (Top) z m for testing dataset visualized by dimension, 5 shown out of 64 (z m re f for case row 3 in Figure 3.14). (Bottom) Histograms showing latent variables for set of inversion solutions. performed with PCA [163, 164]. The cross-correlation matrix for dimensions of z m and z d (of the testing dataset) in Figure 3.23 illustrates the complex relationship between the latent spaces that is captured by coupled LSI. The behavior of coupled LSI as a data-driven inversion method is conve- niently observed when the dimensionality reduction process performed by the model autoencoder is informed of the flow physics within production response data. Figure 3.23: Cross-correlation matrix for dimensions of z m and z d for testing dataset. 3.3.3 Example 3: Large-scale Model (Based on Volve Field) Coupled LSI is applied to a large-scale example based on V olve field in the North sea. This three- dimensional reservoir of approximate dimension 5000 m× 4000 m× 80 m is discretized into a 106 78× 87× 15 domain (m∈R 78× 87× 15 ). A complex fault system in the field creates varying degrees of sand juxtaposition that divides the reservoir into hydraulically-separated producing regions. We assume that the structural and fault framework are certain. Figure 3.24 shows the field set-up (2D map view) with 4 producers and 3 water injectors (marked in red) that penetrate several of the fault blocks. In Figure 3.24, the upscaled (i.e blocked) facies and porosity logs for the wells used as conditioning data are shown in a well section window. The facies grid is populated using MPS algorithm [196] in Petrel [222] guided by the training volume shown in Figure 3.24 and is conditioned to the upscaled facies logs from each well. Figure 3.24: V olve field set-up (2D map view), training volume used in MPS algorithm, and facies and porosity logs (in true vertical depth) for wells. 107 The porosity field is then populated using Sequential Gaussian Simulation (SGS) algorithm [196] conditioned to the upscaled porosity logs. The simulation of porosity grid is done separately for each facies type. The statistic of porosity field for the sand facies honors the distribution seen in the upscaled logs (Figure 3.25). Due to sampling bias of the shale facies where all wells penetrate only the channel body, a porosity distribution for the shale facies is assumed. Next, the permeability field is calculated using simple transform functions k = 5∗ 10 7 φ 7.2 (for sand) and k= 1+ 10 8 φ 8.1 (for shale) to mimic the ranges seen in the actual field, as plotted in Figure 3.25. Figure 3.25: V olve porosity-permeability transform functions for sand and shale facies. For the experiment in this section, 500 conditional realizations (N= 500) are generated where the size of the training, validation and testing datasets is 300, 100 and 100 respectively. The average Net-To-Gross (NTG) for the 500 model realizations is 22.8%. A water-flooding system is considered for a total simulation time of 13 years where production data is collected every 108 month (i.e. T = 157, F = 35, d∈R 157× 35 ), with approximately 40% oil recovery factor at the end of simulation. Further details of the LSI architectures used for this example are provided in the Appendix. Note that for the model autoencoder, the dimension of z m is 64 (K= 64) and three-dimensional convolutional filters are used to capture the lateral and vertical heterogeneity. In this example, the dimension of z d is doubled to 20 (H = 20) and the number of one-dimensional convolutional layers is also doubled to better capture the complex flow data response from time-varying control trajectories. For this large-scale example, the total run time to parallel simulate 500 flow responses (D) is approximately 12 hours with 1 compute node that has 20 cores (Intel Xeon 64 GB, 2.4 GHz). The coupled LSI architectures are trained using previously mentioned GPU and strategy for approximately 3.7 hours (spanning 1500 epochs). Figure 3.26 shows the model reference case m re f for d obs and m re f filtered to display only the sand facies (i.e. channel body) with vertical heterogeneity. The reconstruction of m re f from the model autoencoder (Dec θ (Enc θ (m re f ))) shows that fine scale vertical Gaussian features are smoothed while large scale lateral fluvial features are reproduced well. The inversion outcome from coupled LSI (Dec θ (Reg dm γ (Enc ψ (d obs )))) contains geologically consistent fluvial features although the channel body southeast of P4 is not reproduced as it lies in another fault block that d obs is not hydraulically sensitive to. This can be observed from the wetting-phase saturation grid in Figure 3.27 (top row) where the area southeast of P4 remains unswept at the end of simulation time as the throw of the fault blocks create hydraulic compartments. We also observe that some cells in the uppermost layers of the channel body that is away from wells are not recovered as injection fluid tends to sweep the lower layers due to gravity effect. Two samples (out of 32) of inversion solutions ˆ M = Dec θ (Reg dm γ (z d obs +E)) for data points around d obs (with 5% noise) are shown in the bottom row of Figure 3.26. The models in ˆ M contain variations in channel width and connectivity patterns that dominate the flow behavior within d obs . Vertical variogram heterogeneity appears smoothened as d obs has no sensitivity to these features since all producers and injectors are vertical wells and completed across the entire pay zone. There 109 Figure 3.26: (Top) The model reference case m re f for d obs and model reference case filtered to show only sand facies. (Middle) Reconstruction of reference model, Dec θ (Enc θ (m re f )) and inversion from coupled LSI, Dec θ (Reg dm γ (Enc ψ (d obs ))) (Bottom) Samples from ˆ M = Dec θ (Reg dm γ (z d obs + E)) 110 Figure 3.27: Wetting-phase saturation grid (layer 13 is shown) of each example in Figure 3.26 for initial, year-4 and final timesteps. 111 are varying channel geometry and orientations (especially between I9 and P4) that could reproduce d obs . As illustrated in Figure 3.27, the saturation grids (at initial, year-4 and final timesteps) for each example in Figure 3.26 show similar wetting-phase arrival times at the producers. In Figure 3.28, we compare simulated data g( ˆ M) obtained from ˆ M to the data reconstructions Dec ψ (z d obs +E). For this particular reference case, well P1 produces at a very small rate from a small compartment with little variation in D and is not shown. As the data latent space Z d preserves the distance between points in D, the neigbouring points around z d obs (i.e. z d obs +E ) in the manifold of Z d translate to structured variations (i.e. initial rate, water breakthrough time) in D. Subsequently, z d obs +E map to neigbouring points around the optimal solution in the manifold of Z m to yield ˆ M. For this example, 4 outliers are observed out of the 32 sampled neighbors where g( ˆ M) is not able to reproduce d obs within a reasonable noise range. As previously mentioned, this is due to the imperfect construction and mapping of Z m and Z d . In contrast to our simpler examples in section 3.3.1 and 3.3.2, for this realistic example, D contains peaks and throughs in response to trajectorial changes in pressure drawdown (for produc- ers) and injection rate (for injectors) that could be mistaken for noise by coupled LSI. An example of this can be viewed in Figure 3.28 for well P2 where ˆ d obs does not reproduce the increase in production rate seen around month 30 in d obs . Nonetheless, ˆ M contains relevant geologic features that are able to reproduce the said increase in rate as shown by g( ˆ M). Given the complexity of this dataset, the data match between d obs (and Dec ψ (z d obs +E)) and g( ˆ M) is acceptable as considerable reduction in uncertainty by the ensemble of calibrated model realizations ˆ M around the inversion solution is evident when compared against D. In Figure 3.29, our sensitivity analysis on the number of (m,d) pair used for training the coupled LSI architectures shows that 300 training data points are able to give reasonable performance where additional number of data points only introduce marginal improvements that may not offset the computational cost arising from additional forward simulation runs to obtain more (m,d) pairs. For this analysis, the model inversion error refers to Equation 3.5 while data mismatch error refers to the difference between d obs and g( ˆ m) for the entire testing dataset (size kept constant at 100 for 112 Figure 3.28: Simulated data (non-wetting phase rate of producers, wetting phase cut of producers and bottom-hole pressure (BHP) of injectors) from inversion solutions ˆ M = Dec θ (Reg dm γ (z d obs + E)) compared to reconstructions Dec ψ (z d obs +E) (for examples in Figure 3.26 and Figure 3.27). 113 Figure 3.29: Sensitivity analysis on the number of (m,d) data pair for training coupled LSI archi- tecture. a fair comparison). In general, a larger training dataset provides a richer representation of geologic features and corresponding flow response trends that results in improved generalization power of Z m and Z d . 3.4 Summary and Discussion In this chapter, we demonstrate two LSI approaches to develop a direct nonlinear mapping from the data space to the model space as an alternative to conventional inverse modeling formulations. We introduce decoupled LSI where low dimensional manifolds are constructed to represent high dimensional model realizations and production response data before learning the correspondence between the manifolds. We also introduce coupled LSI where construction of the low dimensional latent spaces is informed of the final objective that is to learn the inverse mapping between pro- duction response data and geologic features present in the prior model realizations. With examples from non-Gaussian priors and a nonlinear forward model, we show that combining the process of parameterization with data conditioning has an advantage over traditional use-case of autoencoder in subsurface modelling where dimensionality reduction of model realizations is done prior to data 114 conditioning. The reason behind this is that while model autoencoder can be trained to optimality to generalize for diverse geologic features, it is not trained to prepare combination of features that can be used to reproduce observed historical production data. Coupled LSI approach tailors the process of dimensionality reduction to governing flow physics to result in a data-informed parame- terization that does not only exploit the redundancy of large-scale geologic features but also retain features that are sensitive to flow data. The decoupled and coupled LSI architectures we presented are equivalent in terms of their learning capacity and each individual component has sufficient capability (trainable weights) for either reconstruction or inversion. Therefore the improvement in performance is attributed to the joint learning of model and data compression that is informed of the inversion loss function. We demonstrate the convenience of coupled LSI where simultaneous dimensionality reduction and in- version result in a decrease in training time and reduced information loss. An added advantage of working in low dimensional spaces is the increase in interpretability of neural network architec- tures. Meaningful latent spaces with robust nonlinear mapping allow the exploration of data and model spaces that is useful when observed data is noisy and multiple inversion solutions can be accepted. Additionally, the distributions of latent variables in the data-informed model latent space provide an insight on the amount of support that production response data has on geologic features for a non-unique inversion problem. We have also considered a variant of autoencoder, the V AE where a prior Gaussian distribution is imposed on the multi-modal model and data latent variables. As a regularizer, the added con- straint causes a decrease in model and data reconstruction quality where a minimization step in the direction of honoring the prior Gaussian distribution does not agree with an update direction for minimizing reconstruction loss [51]. Therefore in this work, LSI does not include such constraint as our main objective is not random generation of model realizations (from the imposed Gaussian prior distribution) but rather prediction of plausible model realizations conditioned on observed historical production data. 115 The experiments we have performed also do not consider uncertainty in geologic scenario where prior model realizations from multiple scenarios may form a more complex latent space with clusters (representing distinct scenarios) that are far apart. As such, future works will consider uncertainty in geologic scenario where the inversion problem is further complicated by an even diverse set of geologic features that can reproduce the observed data. LSI is a general framework that promotes joint learning of multiple loss functions to create meaningful latent spaces. There exist multiple ways of designing neural networks for dimensionality reduction and learning inverse mapping. For this chapter, we have considered the most optimal and most lean design given the dataset we have used. Further works on coupled LSI architectures include a hybrid of online and offline inversion methods where the resulting data-informed model latent space (trained with data up to current time) is used in a gradient-based inverse problem formulation (to integrate future data as they become available) for faster convergence and to avoid costly retraining. LSI can potentially be integrated with a control optimization workflow for closed-loop field management and development. Addi- tionally, LSI attempts to preserve the geologic realism/consistency of inversion solutions through the manipulation of the latent spaces that translates to feature-based calibration in the full dimen- sional space (versus pixel-based calibration). However, we note that due to the complexity of the inversion process, the resulting inversion solutions are still somewhat blurry (with continuous pa- rameter values) when compared to the original prior models (with discrete parameter values). This is an important research problem that will be addressed in future works. As a data-driven method, LSI requires a simulation run for each of the prior model realization to generate the training data. In our examples, the training and validation datasets include a total of 400 model realizations that require 400 simulation runs, based on the typical number of prior models considered in practice. As we have demonstrated in section 3.3.3, the computational com- plexity involving numerous forward simulation runs is alleviated by parallel computations on a high-performance computing cluster with multiple compute nodes. Given the increasing ubiquity of high-performance computing cluster and access to cloud computing, data-driven methods are 116 undeniably gaining more traction. Additionally, data collection and the training of LSI architec- tures can be performed offline before observed data becomes available. In summary, we have presented decoupled and coupled LSI approaches for parameterization and inversion of subsurface flow models using deep convolutional autoencoders. While both ap- proaches demonstrate encouraging outcomes, coupled LSI as a data-informed autoencoding ap- proach shows improved performance for underrepresented reference cases. The resulting data- informed model latent space opens up many interesting avenues for future applications in hydro- geology and petroleum engineering. Recent advances in machine learning research combined with domain expertise in hydrogeology and petroleum engineering will continue to drive innovations for dimensionality reduction and robust pattern learning in characterizing subsurface flow models. 117 Chapter 4 Latent Space Data Assimilation (LSDA) in Subsurface Flow Systems In this chapter, we present a new deep learning architecture for efficient reduced-order imple- mentation of ensemble data assimilation. Motivated by recent advancements in deep learning, we propose the Latent-Space Data Assimilation (LSDA) framework as illustrated in Figure 4.1, that combines a novel deep learning neural network architecture for efficient reduced-order implemen- tation of ensemble data assimilation. Specifically, deep learning is used to improve two important aspects of data assimilation workflows: (i) low-rank representation of complex reservoir property distributions for geologically consistent feature-based model updating, and (ii) efficient prediction of the statistical information between model parameters and data to serve as a proxy model, thereby eliminating the need for a computationally prohibitive physical simulators that are required for model updating. The proposed method uses deep convolutional autoencoders to nonlinearly map the original complex and high-dimensional parameters onto a low-dimensional parameter latent space that compactly represents the original parameters. In addition, a low-dimensional data latent space is constructed to predict the observable response of each model parameter realization, which can be used to compute the statistical information needed for the data assimilation step. The two mappings are developed as a joint deep learning architecture with two autoencoders that are connected and trained together. The training procedure uses an ensemble of model pa- rameters and their corresponding production response predictions as needed in implementing the 118 Figure 4.1: Latent-Space Data Assimilation (LSDA) framework. standard ensemble-based data assimilation frameworks. Simultaneous training of the two map- pings leads to a joint data-parameter manifold that captures the most salient information in the two spaces for a more effective data assimilation, where only relevant data and parameter features are included. Moreover, the parameter-to-data mapping provides a fast forecast model that can be used to increase the ensemble size for a more accurate data assimilation, without a major computational overhead. The proposed architecture is illustrated in Figure 4.2, where two deep convolutional Variational Autoencoders (V AE) are used to compress the model realizations and corresponding production response data into Gaussian model and data latent variables respectively. A neural net- work regression model links the two V AEs to form a single deep learning architecture that is jointly trained, and act as a proxy model that learns the complex relationship between the model and data latent variables. For data assimilation methods that involve iterative schemes such as ESMDA, the proposed approach offers a computationally competitive alternative. The proposed LSDA framework is demonstrated using 2D synthetic Gaussian and non-Gaussian fields, as well as a 3D field-like example based on the V olve field in the North Sea. Specifically, a set of model-data pair is used to train the proposed architecture. Once the architecture is trained, the model encoder is used to obtain the latent space representations for the ensemble of prior re- alizations. The set of model latent variables representing the priors are assimilated with ESMDA 119 Figure 4.2: Schematic of neural network architecture for LSDA. where the physical correlations between the model latent variables and data latent variables are given by the latent space proxy model. Once the assimilation step is complete, the final pos- terior models are obtained by feeding the assimilated model latent variables into the previously trained model decoder. A fully low-dimensional implementation of ESMDA using the proposed deep learning architecture offers several advantages compared to standard algorithms, including joint data-parameter reduction that respects the salient features in each space, geologically consis- tent feature-based updates, increased ensemble sizes to improve the accuracy and computational efficiency acquired from the use of an efficient proxy model. 120 Figure 4.3: Autoencoder architecture for model realizations. 4.1 Parameterization and Forward Mapping with Deep Learning Techniques 4.1.1 Dimension Reduction and Prediction with Autoencoders Autoencoder (AE) is a neural network architecture for nonlinear dimensionality reduction [116] that consists of two components, an encoder and a decoder. The actual architecture of the deep two-dimensional convolutional autoencoder used in this work for the prior model realizations m∈ R M× M , is depicted in Figure 4.3 as blocks representing the output of each layer. The autoencoder is implemented with the deep learning library Keras (version 2.2.4) [45]. For more details on the mechanism of each function, we refer readers to relevant literature in computer science (e.g. [45, 199]). The encoder Enc θ (·) is composed of several main layers, each consisting of a convolutional function (denoted as conv2D and whose output is color-coded in Figure 4.3), leaky-ReLU (Rec- tified Linear Unit) non-linear activation function ( lrelu) and a pooling (down-sampling) function (pool). The convolutional operation extracts local spatial features in the model realizations while 121 the nonlinear activation function allows the representation of nonlinear features within the autoen- coder. Successive pooling functions are used to gradually reduce the dimensionality of input to obtain the desired compact representation as model latent variables, z m ∈R K× 1 . The decoder Dec θ (·) takes z m as the input and does the opposite of Enc θ (·) where z m is grad- ually upsampled to obtain a reconstruction ˆ m of m. Similar to Enc θ (·), Dec θ (·) is also composed of several main layers except pool is now replaced with upsample. To reduce kernel artifacts from the downsampling and upsampling operations, we recommend decreasing and increasing the di- mension by no more than a factor of 2 between each layer. The encoder and decoder are connected together and trained with the following loss function L(θ)= N ∑ ∥m− Dec θ (Enc θ (m))∥ 2 2 (4.1) where once θ is learned, the model latent variables is obtained with z m = Enc θ (m) and the corresponding reconstruction is given by ˆ m= Dec θ (z m ). Note that even when the prior model realizations m are conditioned to static data (e.g., well data), the reconstructed models ˆ m may show a very small deviation (i.e., less than 5%) in values at the well locations. An additional weighted term can be added to the loss function in Equation 4.1 to ensure that the value at the well locations are reproduced. Alternatively, a simpler post-processing step of reassigning the known value at the well locations can be performed. The actual architecture of the autoencoder for production response data d∈R T× F , is depicted in Figure 4.4 as blocks representing the output of each layer. T and F represent the data timesteps and production data features, respectively. The main layers within the data encoder Enc ψ (·) are each composed of a one-dimensional convolutional function (denoted as conv1D and whose out- put is color-coded in Figure 4.4), leaky-ReLU non-linear activation function (lrelu) and a one- dimensional pooling (down-sampling) function (pool). For the time series, one-dimensional con- volutional operation extracts local temporal features in the production response data. Successive temporal pooling functions are used to gradually reduce the dimensionality of input to obtain the desired compact representation as data latent variables, z d ∈R H× 1 . 122 Figure 4.4: Autoencoder architecture for simulated data. Figure 4.5: Regression model architecture for model and data latent variables. 123 The input z d to the data decoder Dec ψ (·) is gradually upsampled to produce a reconstruction ˆ d of d. Similar to Enc ψ (·), Dec ψ (·) is also composed of several main layers except pool is now replaced with upsample. To make Enc ψ (·) robust against noisy data with random fluctuations, the kernel size of conv1D is initially set as 3 (in months) and later increased (to 6) in successive layer. Similar to the model autoencoder, we recommend decreasing (pool) and increasing (upsample) the dimension by no more than a factor of 2 between each layer to reduce kernel artifacts. The encoder and decoder are connected together and trained with the following loss function L(ψ)= N ∑ ∥d− Dec ψ (Enc ψ (d))∥ 2 2 (4.2) where once ψ is learned, the data latent variables is obtained with z d = Enc ψ (d) and the cor- responding reconstruction is given by ˆ d= Dec ψ (z d ). The forward mapping between z m and z d is approximated with a neural network denoted as Reg md γ (·). The architecture of this regressor is shown in Figure 4.5 as blocks representing the output of each layer. In Reg md γ (·), the nonlinearity between z m and z d is captured by multiple fully-connected (dense) layers with nonlinear activations. The loss function to train Reg md γ (·) is L(γ)= N ∑ ||Reg md γ (z m )− z d || 2 2 (4.3) where onceγ is learned, the predicted data latent variables is given by ˆ z d = Reg md γ (z m ). 4.1.2 Constraining the Latent Spaces with Variational Autoencoders The trained neural network architecture described in previous section can simultaneously perform dimensionality reduction and prediction for production response behaviour (or its latent space rep- resentation) given any model realization as an input. To constraint the model and data latent spaces to conform to a specific distribution, we formulate the autoencoders in the original neural network architecture as Variational Autoencoders (V AE) [117]. V AE learns stochastic mappings between 124 any model space p M or data space p D that is typically described by a complex empirical distribu- tion to a model latent space p z M or data latent space p z D approximated by a simpler distribution (typically Gaussian). In this section, we describe V AE as a dimension reduction neural network architecture for the set of model realizations M to obtain the set of latent space representation z M where p z M ≈ N (µ,diag(σ)). Note that the same description applies for the dimension reduction process of the set of simulated production response data D, to obtain the set of latent space repre- sentation z D where p z D ≈ N (µ,diag(σ)). For clarity, in this section, the encoder for the model realizations is denoted as Enc θ (·) and the model decoder is denoted as Dec θ ′(·). The mathematical formulation of V AE is based on variational Bayesian inference in directed graphical networks and can be posed as p(z m |m)= p(m|z m )p(z m ) p(m) (4.4) where p(z m |m) is the conditional distribution of the model latent variables given the model realizations. Equation 4.4 is hard to evaluate as the integral of p(m) is not available in closed form and is intractable to compute (i.e. require exponential time) due to the multiple integrals involved when integrating over the latent vector z m . Therefore, we choose to approximate p(z m |m) by another distribution q(z m |m) that has a tractable solution (such is the case when it is Gaussian) using variational inference where the Kullback-Leibler (KL) divergence [124] between the two distributions is defined as D KL [p(z m |m)||q(z m |m)]= ∑ z m q(z m |m) log q(z m |m) p(z m |m) =E z m ∼ q(z m |m) [log q(z m |m)− log p(z m |m)]. (4.5) We can substitute Equation 4.4 into Equation 4.5 to obtain D KL [p(z m |m)||q(z m |m)]= E z m ∼ q(z m |m) [log q(z m |m)− log p(m|z m )− log p(z m )+ log p(m)]. (4.6) 125 where the term log p(m) can be moved to the left-hand side as the expectation is over z m . Therefore, Equation 4.6 can be rewritten as log p(m)− D KL [p(z m |m)||q(z m |m)]= E z m ∼ q(z m |m) [log p(m|z m )]− E z m ∼ q(z m |m) [log q(z m |m)− log p(z m )] (4.7) where the first log-likelihood term on the right-hand side of Equation 4.7 becomes the squared reconstruction error when p(m|z m ) is Gaussian and is then equivalent to Equation 4.1. The second expectation term in Equation 4.7 can be rewritten as the divergence between the imposed prior distribution of z m (which is defined as a Gaussian distribution) and the learned distribution q(z m |m). The loss function of the V AE is then described as L(θ,θ ′ )=E z m ∼ q θ (z m |m) [log p θ ′(m|z m )]− λ θ D KL [q θ (z m |m)||p(z m )] (4.8) where θ represents the lumped parameters in the model encoder Enc θ (·) and θ ′ represents the lumped parameters in the model decoder Dec θ ′(·). Equation 4.8 is known as the Evidence Lower Bound (ELBO) of the model realizations log-likelihood function as the KL divergence is non-negative. By maximizing the ELBO during the training of both model and data V AE, the distributions of z m = Enc θ (m) and z d = Enc ψ (d) will both conform to the defined Gaussian distributions p z M ∼ N (0,I) and p z D ∼ N (0,I). In this chapter, the Adam optimizer [115] is used to trained the proposed neural network architecture. 4.2 Latent Space Data Assimilation (LSDA) Workflow In this section, we introduce LSDA as a fully low-dimensional implementation of ESMDA that consists of dual autoencoders and a regression model that are jointly trained offline as a single unified architecture. The iterative assimilation steps are performed in the model and data latent spaces and only involve a single decoding step of the posterior model latent variables to obtain the ensemble of high-dimensional posterior models for computational efficiency. Additionally, the 126 fully low-dimensional approach represents salient information in the data as data latent variables and can be conveniently extended to include other types of data that need to be assimilated through a simple concatenation of data latent variables (from multiple sources of disparate data), such as high-dimensional 4D seismic data. The simple latent space proxy model exploits the inherent low- dimensional representations of model parameters and simulated data to compute only the statistical information needed for the data assimilation step and does not involve predicting global pressure and saturation models. The main objective of data integration is to improve the predictive capability of reservoir mod- els, based on the presumption that calibrated models should be able to provide reasonable matches to observed historical data. For the model calibration step, a popular and computationally efficient choice is the Ensemble Kalman Filter (EnKF) [1, 64] where incoming dynamic data are integrated into an ensemble of reservoir models. The core principle behind EnKF is using physical correla- tions (computed via reservoir simulation) between input model parameters and their corresponding predicted responses to update uncertain model parameters based on observed data from the field. In conventional ensemble-based history-matching workflows, data assimilation with EnKF is typically performed with a set of full-dimensional prior model realizations M and the correspond- ing set of simulated data D [62]. In this work, data assimilation with EnKF is performed with the latent space representations z M and z D and is referred to as Latent Space Data Assimilation (LSDA). Specifically, each model realization m is represented as latent variables z m that is updated to generate geological model realizations that are consistent with static data and can reproduce the observed dynamic data d obs when used in flow simulation results. The original formulation of EnKF in Evensen [64] is modified where the physical correlations are approximated by (i) generating prior models M with different (uncertain) input parameters (ensemble), (ii) using reservoir simulator to forecast each realization’s response D, (iii) obtaining the latent space representations z M and z D and (iv) estimating the sample cross-covariance matrix from the ensemble of latent spaces for model parameters and predicted responses. The EnKF 127 update equation for N realizations of the initial model parameters (i.e. priors) that are represented as latent variables{z m 1 ,z m 2 ,...,z m N } can be expressed as: z i m k = z i m k− 1 +C z m z d (C z d z d +C z D ) − 1 [z i d obs,k − z i d k ] (4.9) where z i d obs,k denotes the latent space representation for the i-th realization of the perturbed measurements at time t k and C z m z d (C z d z d +C z D ) − 1 is the Kalman gain matrix, with C z m z d and C z d z d defined as the cross-covariance between latent space representations for data and model parameters and the measurement latent variables covariance matrices, respectively and C z D denoting the co- variance matrix of observed data latent variables measurement errors. C z D is calculated by feeding both the observations and perturbed observations into the data encoder to obtain their respective latent variables where the errors can be computed between the two before the covariance is calcu- lated. Unlike EnKF that uses recursive updates in time, Ensemble Smoother (ES) is an alternative data assimilation technique that computes one global update in the space-time domain. Therefore, ES does not require the modification of restart files for sequential data assimilation step when a physical reservoir simulator is used, making its application more convenient compared to EnKF. For linear models and measurements, both ES and EnKF provide identical solutions as shown in Evensen [63]. This equivalence however does not hold for nonlinear cases. This is due to the fact that ES is equivalent to a single potentially large Gauss-Newton iteration with a full step and an average sensitivity estimated from the prior ensemble [207]. To improve the performance of ES, Emerick and Reynolds [61] proposed an iterative form of ES called the Ensemble Smoother with multiple data assimilation (ESMDA). In the update equation for ESMDA, the measurement error covariance matrix C z D is multiplied by an inflation factor at each iteration to perform gradual small corrections to the ensemble. Sufficient damping of changes in the realizations (or their latent space representations) of the reservoir model at each iteration prevents the overshooting and undershooting that can lead to overly rough estimates of 128 reservoir models. As proposed by Emerick and Reynolds [61], the inflation factors {α k } N a k=1 are chosen to satisfy the condition N a ∑ k=1 1 α k = 1 (4.10) In LSDA, the physical reservoir simulator is replaced by a latent space proxy model ˆ z d = Reg md γ (z m ) that gives the predicted data latent variables ˆ z d for any model latent variables z m . The latent space proxy model that maps z m to z d reduces the computational burden associated with running the physical reservoir simulator in the N a iteration steps. Additionally, the model and data latent spaces follow a Gaussian distribution that is amenable to ESMDA. The LSDA workflow is outlined in Table 4.1. The rationale of this approach is to inform the model and data parameteriza- tion step of the final objective function, that is to learn the correspondence between m and d. The training process of LSDA is well-behaved such that, any update step in the direction of minimizing model and data reconstruction losses agree with the update direction of parameter-to-data mapping loss. 4.3 Numerical Experiments and Results 4.3.1 Example 1: Synthetic 2D Gaussian Model To demonstrate LSDA, we first consider a two-dimensional reservoir of size 1000 m × 1000 m that is discretized into a 100× 100 domain. The forward model is a two-phase (wetting and non- wetting) flow simulator (Eclipse) and the reservoir is penetrated by four producers ( P1, P2, P3, P4) at each corner of the reservoir and one injector (I8) at the center of the reservoir. 500 conditioned realizations are generated where 300 realizations are used to train the proposed neural network architecture while 99 realizations are used as the prior ensemble for LSDA with 1 hidden reference case (chosen from 101 hidden reference cases). From the training dataset, a 10% validation split is randomly taken at each training epoch. 129 Table 4.1: Pseudocode for data assimilation with LSDA 1 function LatentSpaceDataAssimilation(M, D, N, d obs ): 2 M train ,M test ,D train ,D test = split(M, D) 3 initialize Enc θ ,Dec θ ,Enc ψ ,Dec ψ ,Reg md γ 4 epoch← 1500 5 batchSize← 32 6 numBatch← ⌈N/batchSize⌉ 7 for i= 1 to epoch 8 for j= 1 to numBatch 9 M j ,D j ← batchify(M train , D train ) 10 computeL(ψ,ψ ′ ) as per Equation 4.8 11 update Enc ψ ,Dec ψ 12 computeL(ψ,γ,θ)=∑ m∈M j ,d∈D j ∥d− Dec ψ (Reg md γ (Enc θ (m)))∥ 2 2 13 update Dec ψ ,Reg md γ ,Enc θ 14 computeL(θ,θ ′ ) as per Equation 4.8 15 update Enc θ ,Dec θ 16 z M ← Enc θ (M test ) 17 z d obs ← Enc ψ (d obs ) 18 ˆ z M ← ESMDA(z M , z d obs , Reg md γ ) 19 return Dec θ (ˆ z M ) 20 end function 130 Each prior model realization m is a multi-Gaussian log(k) field that is simulated using the Sequential Gaussian Simulation (SGS) algorithm with SGEMS. To collect flow response data d, the permeability field is obtained by computing the anti-log of m and the porosity is assumed to be constant at 0.2. Approximately one pore volume is injected into the reservoir over 32 months of simulation time. This results in 32 timesteps (T = 32) and 25 production response features (F = 25, i.e. production rates of wetting phase and non-wetting phase, bottomhole pressure for 5 wells). Note that when data is missing for some timesteps in d obs , the missing timesteps can be omitted from the training and testing datasets for consistency. For this experiment, the total run time to parallel simulate 300 flow responses is approximately 2 hours with 1 compute node that has 20 cores (Intel Xeon 64GB, 2.4GHz). The LSDA proxy model architectures are trained until convergence using the ”early-stopping” strategy (when both the training and validation losses converge to a similar value) on a high-performance computing cluster with an NVIDIA Tesla P100 GPU node for approximately 20 minutes for 1000 epochs. In Figure 4.6, the first column shows the histograms of model realizations versus their recon- structions for the training and testing datasets. The distributions of the model realizations are reproduced well by the model autoencoder and the salient spatial features are also reconstructed well as shown in Figure 4.7. The reconstructed model realizations appear smooth due to the 2D convolutional operations on the input model realizations. This however does not impact the per- formance of LSDA as the fidelity of the reconstructed models is higher than what can be resolved by the aggregate information provided by the observed historical data. The second column in Figure 4.6 shows the scatter plots of the normalized (i.e. min-max scaling) simulated production response data and their reconstructions for the training and testing datasets. Most of the data points fall on the unit slope suggesting that the data autoencoder is able to reproduce the production responses, and the comparisons between the input responses and reconstructed responses are shown in Figure 4.8. The third column in Figure 4.6 shows the scatter plots of the normalized responses and the predicted responses when the trained neural network architecture is presented with the corresponding model realizations (i.e. Dec ψ (Reg md γ (Enc θ (m)))). 131 Figure 4.6: Histograms and scatter plots comparing model and data with the reconstructions and predictions for Gaussian training and testing dataset. 132 Figure 4.7: Samples of model reconstruction for Gaussian training and testing dataset. Two samples of predicted responses (representing the P50 RMSE from the testing dataset) are plotted in Figure 4.8. Note that generally, the prediction error for parameter-to-data mapping is higher than the model/data reconstruction errors as there is more complexity associated with learning the correspondence between spatial features and production responses. The dual autoencoders in the proposed neural network architecture are able to efficiently repre- sent the high-dimensional model realizations and production responses as low-dimensional model and data Gaussian latent variables (i.e. z m and z d ). Additionally, the regression model (i.e. Reg md γ (·)) that maps z m to z d can be used as a latent space proxy model to replace a full fi- delity physical reservoir simulator in the LSDA workflow outlined in Table 4.1. The histograms in Figure 4.9 illustrate the first and last updates on the ensemble of model latent variables. The histograms in Figure 4.10 show the resulting data latent variables for the first and last updates, ob- tained via the proxy model. The ensembles of model and data latent variables converge towards the solution (denoted as the red lines representing z m re f and z d obs ) as the ESMDA updates are performed using the latent space proxy model. 133 Figure 4.8: Samples of data reconstruction and prediction (denoted as P) from select wells for Gaussian training (top row) and testing (bottom row) dataset. Figure 4.9: Gaussian model latent variables (first 4 shown out of 64) for the first and last iteration. 134 Figure 4.10: Data latent variables (first 4 shown out of 20) for the first and last iteration of the Gaussian dataset. In Figure 4.11, the reference case and its reconstruction are shown on the top left panel. The bottom left panel shows the mean and variance for the prior ensemble and posterior ensemble, reconstructed by feeding the final posterior model latent variables into the model decoder. The right panel shows samples of prior realizations and reconstructed posterior realizations. We observe that the assimilation of z d obs into the model latent variables introduces important spatial features that could reproduce d obs . Notably, the high permeability connection between I8 and P2 is reproduced, as well as the low permeability region between I8 and P4. In Figure 4.12, the line plots compare the production response data from the prior ensemble (represented by the gray lines), to the reconstructed posterior production responses (represented by the green/blue/purple lines) obtained from the latent space proxy model. The cyan lines represent the mean of production responses obtained by running the physical reservoir simulator on the set of posterior model realizations. This is done as a validation step to ensure that the posterior models are not only geologically consistent but can also reasonably reproduce d obs . The good match shows 135 Figure 4.11: (Left panel) Reference model, its reconstruction and mean and variance of prior ensemble and posterior ensemble. (Right panel) Samples of prior and posterior realizations. that LSDA can perform effective latent space assimilation update using a latent space proxy model, without introducing significant errors. 4.3.2 Example 2: Synthetic 2D Fluvial Model In this example, we consider a non-Gaussian two-dimensional reservoir of similar dimension as in the previous section. A training image (TI) simulated with object-based modelling algorithm in Petrel is derived from field analogs of an anastomosing fluvial system. The TI is used in Multi- point Statistics (MPS) algorithm to simulate 1000 conditioned realizations, where 300 realizations are used to train the proposed neural network architecture while 130 realizations are used as the prior ensemble for LSDA. Each reference case is selected from 570 hidden reference cases. From the training dataset, a 10% validation split is randomly taken at each training epoch. The fluvial field is composed of binary facies, where the non-sand and sand facies are assigned ( φ, k) pairs of (0.03, 5 mD) and (0.23, 420 mD) respectively alongφ− k transform function k= 5+ 10 9 φ 10 . 136 Figure 4.12: Data match for posterior ensemble for the Gaussian dataset. The reservoir is penetrated by two producers (P1 and P2) and two injectors (I8 and I9). Ap- proximately one pore volume is injected into the reservoir over 6 years of simulation time, during which production data is collected every 3 months. This results in 25 timesteps (T = 25) and 20 production response features (F = 20). For this experiment, the total run time to parallel simulate 300 flow responses is approximately 2 .2 hours with 1 compute node that has 20 cores (Intel Xeon 64GB, 2.4GHz). The LSDA architectures are trained until convergence on a high-performance computing cluster with an NVIDIA Tesla P100 GPU node for approximately 25 minutes for 1200 epochs. The first column in Figure 4.13 shows the histograms of model realizations versus their re- constructions for the training and testing datasets. The model reconstructions produced by the model autoencoder can represent the binary facies in the model realizations. The second and third columns show the scatter plots of production response data versus their reconstructions and pre- dictions, respectively for the training and testing datasets. In Figure 4.14, two test samples of reconstructed model are shown in the first and last columns. We uniformly sampled the model latent space between the two test samples to generate a succession of generative samples that 137 Figure 4.13: Histograms and scatter plots comparing model and data with the reconstructions and predictions for fluvial training and testing dataset. 138 Figure 4.14: Two samples of reconstruction from the testing dataset and generative samples (uni- formly sampled between the two test cases) from the latent space with their corresponding nearest neighbor in the training dataset. demonstrate the rich fluvial features represented by the latent space. The nearest model realization in the training dataset, to each of the generated samples is shown to prove that the autoencoder is able to generalize for unseen test cases. The histograms in Figure 4.15 illustrate the first and last updates on the ensemble of model latent variables. The distribution of the model latent variables follow a Gaussian distribution that is amenable to ESMDA and the updated ensemble converges to the latent variables of the reference case represented by the red lines. In the right panel of Figure 4.16, samples from the prior ensemble are shown with the intermediate and final updated realizations. We observe that updating the model latent variables results in reconstructed posterior realizations that can preserve the geologic continuity that describes the prior ensemble. In general, the geologic features that are recovered in the posteriors depend on the amount of information in d obs . In the left panel of Figure 4.16, the mean of the posterior models agree with the reference model, where the well-connected fluvial features between producers P1, P2 and I9 are recovered. In Figure 4.17, the line plots compare the production response data from the prior ensemble (represented by the gray lines), to the reconstructed posterior production responses (represented by the green/blue/purple lines) obtained from the latent space proxy model. The mean of simulated production responses from the posterior models agree with d obs , as shown by the cyan lines in Figure 4.17 and shows that the posterior models are not only geologically consistent but 139 Figure 4.15: Fluvial model latent variables (first 4 shown out of 20) for the first and last iteration. can also reasonably reproduce d obs . The predicted production responses from the latent space proxy model provide sufficient physical correlations that can be used to update the ensemble of prior models, without running computationally expensive forward simulations for each iteration in ESMDA. For further validation, we simulate a common development scenario after the assimilation pe- riod of 6 years, where a new producer (P3 as shown in Figure 4.16) is to be drilled northeast of the existing producer P2. In this scenario, the existing producers will be shut-in and only one of the injectors (I9) remains active. An additional simulation period of 3 years (12 timesteps) is considered as the forecast period. We compare production response data from the following set of models against the reference case: posterior models from LSDA, prior models, and a set of 130 models obtained through rejection sampling (as the only exact sampling method for nonlinear and non-Gaussian cases, denoted as RS in Figure 4.18) of production response data simulated from 5631 additionally generated models. For the rejection sampling operation, a model is accepted if the RMSE between its simulated data and d obs is within the noise threshold. 140 Figure 4.16: (Left panel) Reference model, its reconstruction and mean and variance of prior ensemble and posterior ensemble. (Right panel) Samples of prior, iterations and posterior realiza- tions. 141 Figure 4.17: Data match for posterior ensemble for the fluvial dataset. The bar plot in Figure 4.18 shows that the posteriors can reliably predict the facies at the lo- cation of P3, when compared to the proportion of facies at the location of P3 seen in the set of rejection sampling models. In Figure 4.18, the P10/mean/P90 oil and water profiles of P3 from the posterior models show good agreement with the profiles from the rejection sampling mod- els. These observations indicate that the assimilation process has integrated the past observations into the models for more reliable performance prediction to guide future development plans while incorporating the unresolved uncertainty in reservoir description. 4.3.3 Example 3: Large-scale Model (Based on Volve Field) LSDA is applied to a large-scale example based on V olve field in the North sea. This three- dimensional reservoir of approximate dimension 5000 m× 4000 m× 80 m is discretized into a 78× 87× 15 domain (m∈R 78× 87× 15 ). A complex fault system in the field creates varying degrees of sand juxtaposition that divides the reservoir into hydraulically-separated producing regions. We assume that the structural and fault framework are certain. Figure 3.24 shows the field set-up (2D map view) with 4 producers (marked as P) and 3 water injectors (marked as I) that penetrate several 142 Figure 4.18: Facies at well location and data match of P10/mean/P90 profiles within forecast period for the fluvial dataset. of the fault blocks. In Figure 3.24, the upscaled (i.e blocked) facies and porosity logs for the wells used as conditioning data are shown in a well section window. The facies grid is populated using MPS algorithm in Petrel guided by the training volume shown in Figure 3.24 and is conditioned to the upscaled facies logs from each well. The porosity field is then populated using Sequential Gaussian Simulation (SGS) algorithm conditioned to the upscaled porosity logs. The simulation of porosity grid is done separately for each facies type. The statistic of porosity field for the sand facies honors the distribution seen in the upscaled logs (Figure 3.25). Due to sampling bias of the shale facies where all wells penetrate only the channel body, a porosity distribution for the shale facies is assumed. Next, the permeability field is calculated using simple transform functions k= 5∗ 10 7 φ 7.2 (for sand) and k= 1+ 10 8 φ 8.1 (for shale) to mimic the ranges seen in the actual field, as plotted in Figure 3.25. In this example, 500 conditional realizations (N= 500) are generated where the size of the training, validation and testing (also used as the prior ensemble) datasets is 300, 100 and 100 respectively. The average Net- To-Gross (NTG) for the 500 model realizations is 22.8%. A water-flooding system is considered for a total simulation time of 13 years where production data is collected every month (i.e. T = 157, F = 35, d∈R 157× 35 ), with approximately 40% oil recovery factor at the end of simulation. Note that for the model autoencoder, the dimension of z m is 64 (K= 64) and three-dimensional convolutional filters are used to capture the lateral and vertical heterogeneity. In this example, the dimension of z d is 20 (H= 20) and the number of one-dimensional convolutional filters is doubled 143 to better capture the complex flow data responses from time-varying control trajectories. For this large-scale example, the total run time to parallel simulate 500 flow responses ( D) is approximately 12 hours with 1 compute node that has 20 cores (Intel Xeon 64 GB, 2.4 GHz). The LSDA proxy model architectures are trained on a high-performance computing cluster with an NVIDIA Tesla P100 GPU node for approximately 2.3 hours (spanning 1500 epochs) and are checkpointed every 10 epoch. The optimal checkpoint (without overfitting) for each architecture is identified when val- idation losses do not show any further reduction. The dimension of each latent variable (H and K) is determined by performing a sensitivity analysis where the dimension is successively increased to a point where further decrease in validation loss is not observed. Generally, the data autoencoder converges faster than the model autoencoder as temporal (1D) variation in the timeseries is less complex than the volumetric (3D) variation in the model realizations. Figure 4.19 shows the model reference case m re f for d obs . Four (4) update steps are performed using LSDA. The mean of the posterior models agrees with the reference model, where key fluvial features between wells are recovered. The variance of the posterior models shows significant reduction in uncertainty when compared to the variance of the prior models. The right panel in Figure 4.19 shows samples from the prior and posterior ensembles. From the posterior models, we observe that fine scale vertical Gaussian features are smoothened as d obs has no sensitivity to these features since all producers and injectors are vertical wells and completed across the entire pay zone. The posterior models contain geologically consistent large scale lateral fluvial features (variations in channel width and connectivity patterns that dominate the flow behavior) that can reproduce d obs . We observe that some cells in the uppermost layers of the channel body that is away from wells are not recovered as injection fluid tends to sweep the lower layers due to gravity effect. In Figure 4.20, the gray lines represent production response data from the prior ensemble and the lines in green/blue/purple color represent the production responses obtained from the latent space proxy model. The cyan lines represent the mean of production responses obtained by run- ning the physical reservoir simulator on the set of posterior model realizations. This is done as 144 Figure 4.19: (Left panel) Reference model and mean and variance of prior ensemble and posterior ensemble. (Right panel) Samples of prior and posterior realizations. 145 Figure 4.20: Data match for posterior ensemble for the V olve dataset. 146 a validation step to ensure that the posterior models are not only geologically consistent but can also reasonably reproduce d obs . In contrast to our simpler 2D examples, for this realistic exam- ple D contains peaks and troughs in response to trajectorial changes in pressure drawdown (for producers) and injection rate (for injectors) that could be mistaken for noise by LSDA. Given the complexity of this dataset, the data match is acceptable as considerable reduction in uncertainty by the ensemble of calibrated model realizations around d obs is evident when compared against D of the prior ensemble. 4.4 Summary and Discussion In this chapter, we introduce the Latent-Space Data Assimilation (LSDA) framework for geologi- cally consistent and computationally efficient integration of dynamic response data into subsurface flow models in low-dimensional feature spaces. Numerical simulation of subsurface flow systems leads to high-dimensional parameter space and states dynamics that are known to have inherent low-dimensional representations. Using labeled simulated data and new deep learning architec- tures, we construct low-dimensional latent feature spaces that can be used to facilitate dynamic data integration. In LSDA, complex model realizations and simulated production responses are both transformed to Gaussian latent variables that are amenable to existing data assimilation al- gorithms. The transformations to compact model and data feature spaces are done using a pair of deep convolutional V AEs that extract salient spatial and temporal features. Data assimilation is then efficiently performed in the latent spaces using a latent space proxy model that maps the model latent variables to data latent variables, to alleviate the computational burden of running a full physical reservoir simulator in the iterative workflow. The neural network architecture for LSDA consists of two convolutional V AEs and a regression model (as the latent space proxy model) that are simultaneously trained to learn joint parameter-to-data mappings. We compare the performance (i.e. geologic consistency of posteriors) of LSDA and ESMDA with a physical simulator, for different ensemble sizes from 5 to 300. For this experiment, the 147 only computational burden of training the LSDA architecture is associated with running forward simulations to obtain 300 pairs of model and simulated response data, as explained for the 2D examples. For ESMDA with a simulator, the computational cost is the product of number of assimilation steps (N a = 4) and size of ensemble (N e ) as shown in Figure 4.21. To obtain the average performance for each ensemble size, 10 different reference maps (m re f ) are selected and random models are selected from the hidden testing set to form the set of priors. We use the Structural Similarity Index Measure (SSIM) as implemented in Python scikit-image library [262] as the performance metric MeanSSIM= 1 N e N e ∑ j=1 SSIM(m j ,m re f ) (4.11) where N e is the size of ensemble and m j is a posterior model in the ensemble. The top plot in Figure 4.21 shows that for the 2D Gaussian examples, LSDA results in com- parable performance to ESMDA, with a smaller computational cost. The performance does not increase significantly beyond N e = 150 as the additional realizations added to the ensemble con- tain redundant information. At N e = 150, LSDA incurs 300 simulation runs and ESMDA costs 600 simulations runs. For the 2D Gaussian examples, note that the MeanSSIM score for LSDA is consistently slightly lower than ESMDA due to the smoothing effect caused by using 2D convolu- tional filters in the architecture. The bottom plot in Figure 4.21 shows similar observations for the 2D fluvial examples. However, the performance of LSDA is notably higher than ESMDA beyond N e = 100 due to the geologically consistent feature-based updates. Additionally, the performance of LSDA is also compared with LSDA implementation (here called LSDA-b) that involves only model latent space and a proxy model that maps the model latent variables directly to the production response data. In LSDA-b, the encoder Enc ψ (·) is omitted and the training process does not involve data reconstruction. In Figure 4.21, the performance of LSDA and LSDA-b are comparable for the 2D Gaussian examples. For the 2D fluvial examples however, LSDA performs slightly better than LSDA-b and we attribute this to the fact that the data autoencoder extracts salient information and allows the latent-space proxy model to perform 148 Figure 4.21: (Top) MeanSSIM comparison between LSDA, LSDA-b and ESMDA for different sizes of ensemble for 2D Gaussian examples and (Bottom) 2D fluvial examples (with ESMDA-b). 149 optimally. This suggests that the data autoencoder offers the opportunity to reduce the dimension of observations with negligible loss of information and as a result, simplifying the data assimilation step between the model and data latent spaces. Our observation is consistent with earlier works that have applied dimensionality reduction techniques on observations [141, 147, 231]. We also consider an implementation where ESMDA is used in conjunction with a proxy model that directly maps model parameters to data, called ESMDA-b. To investigate the performance of ESMDA-b, we perform a sensitivity analysis on the size of ensemble for the 2D fluvial examples. The results shown in Figure 4.21 indicate that ESMDA-b performs poorly when pixel-based up- dates (versus feature-based updates in LSDA) are done on non-Gaussian examples. Additionally, we observe that the ESMDA-b proxy model is not able to give reliable prediction when presented with intermediate updates as they were not seen during training. For a fair comparison and opti- mal performance, the architecture for ESMDA-b has comparable number of parameters to LSDA, where z m = 676, z d = 64, and the regression model layers are expanded (i.e., with successive di- mensions of 676,128, and 64) to compensate for the exclusion of model decoder Dec θ (·) and data encoder Enc ψ (·). Given that the computational cost incurred for training both LSDA and ESMDA- b is the same, the results in Figure 4.21 motivate the use of LSDA and illustrate the advantages of LSDA where dimension reduction of both model and data is performed. The major computational burden in LSDA workflow is associated with the number of model- data pairs used to train the neural network architecture, as it involves a simulation run for each of the model realization and can be performed offline. The optimal number of pairs can be determined by performing a sensitivity analysis on the prediction performance (i.e. predicted production re- sponse for any given model realization). Generally, the prediction RMSE decreases with increasing number of training data, up to a point where additional training data only introduce a marginal im- provement. The optimal number of training data needed depends on the complexity of the geologic features, the number, configuration and type of wells present and the length of production period. Given that the same LSDA proxy model is used in the sensitivity analysis shown in Figure 4.21, we observe that the benefits brought by a larger ensemble size is considerably more significant when 150 compared to the potential error introduced by the proxy model. Nonetheless, our experiments show that the proxy model is able to provide enough statistical information needed for model calibration purposes. As new observations are collected from the field, certain components of LSDA will need to be retrained, specifically the data encoder Enc ψ (·), data decoder Dec ψ (·), as well as the latent space proxy component Reg md γ (·). The trainable weights within these components generally constitute about 10% of the total weights that need to be trained and the LSDA architecture can be conve- niently reinitialized with the previous trained weights when retraining becomes necessary, thereby easing the retraining process. This is an advantage when compared to a proxy model that directly maps model parameters to data that will require a complete retraining when newly observed data is available. Future works may consider the application of sequence-to-sequence Recurrent Neu- ral Network (RNN) models for the data encoder and decoder where newly obtained data can be included as additional data points for retraining. While the V AE allows the latent variables to follow a certain predefined distribution (i.e. Gaus- sian), the constraint is embedded in the loss function as a regularization term that can affect the model reconstruction performance. In other words, it is possible to obtain a perfectly Gaussian latent variables at the cost of reconstruction performance, which is not ideal for our application. To improve the reconstruction performance of the V AE, the dimension of the latent variables can be increased, although this may introduce over-fitting and an unnecessarily large number of pa- rameters in the neural network architecture. The proposed LSDA workflow integrates a novel deep learning architecture that performs si- multaneous dimensionality reduction and prediction, with a traditional data assimilation algorithm for improved efficiency. The proposed neural network architecture includes the joint parameter- to-data mapping that functions as a latent space proxy model, to remove the need for a time- consuming physical simulator during the update steps of ESMDA. With LSDA, the size of the prior ensemble can be increased without significant computational overhead, to improve the prediction 151 accuracy of the posterior ensemble. The latent space representation of complex non-Gaussian reservoir property distributions also leads to geologically consistent feature-based model updating. From the experiments that we have performed, LSDA demonstrates improvements over the traditional data assimilation methods. The proposed LSDA workflow can potentially be used in a closed-loop field optimization workflow and is an interesting avenue of research, yet to be investigated. Further experiments involving the assimilation of complex field cases with high- dimensional observations (i.e., 4D seismic) and multiple disparate observations using LSDA are important future works. Recent developments in deep learning continues to offer new approaches to address existing challenges in subsurface flow modeling, for applications in hydrogeology and petroleum engineering. 152 Chapter 5 Conditioning Generative Adversarial Networks on Nonlinear Data for Model Calibration and Uncertainty Quantification Conditioning complex subsurface flow models on nonlinear data is complicated by the need to preserve the expected geological connectivity patterns to maintain solution plausibility. Gen- erative adversarial networks (GANs) have recently been proposed as a promising approach for low-dimensional representation of complex high-dimensional images. The method has also been adopted for low-rank parameterization of complex geologic models to facilitate uncertainty quan- tification workflows. A difficulty in adopting these methods for subsurface flow modeling is the complexity associated with nonlinear flow data conditioning. While conditional GAN (CGAN) can condition simulated images on labels, application to subsurface problems requires efficient conditioning workflows for nonlinear data, which is far more complex. Since neural networks are capable of approximating very complex functions (e.g., flow simu- lator) [163] and GAN is a form of neural network, we investigate if GAN can simultaneously be used for parameterization and data conditioning (i.e., approximating the inverse function of a flow simulator). Two important topics that we address in this chapter are the data conditioning property of generative adversarial networks (GAN) as well as its performance when diverse spatial features are provided as training data (i.e., when multiple geologic scenarios are present). In this work, we perform direct data integration through simultaneous low-dimensional parameterization and label conditioning, which is something that has not been done yet for subsurface flow problems. To that 153 end, we propose two methods to generate flow-conditioned models with complex spatial patterns using GAN. In the first method, we perform simultaneous parameterization and data integration using con- ditional GAN (CGAN) by providing a production response label as an auxiliary input during the training stage of CGAN. The production response label is obtained from clustering of the simu- lated flow responses of the prior model realizations. In this method, CGAN learns (through offline training) the correspondence between the spatial features to the production responses within each cluster. The underlying assumption of this approach is that CGAN can learn the association be- tween the spatial features corresponding to the production responses within each cluster. When the observed data becomes available, a distance metric is applied to identify relevant labels that represent a set of similar dynamic data behaviour. Given the similarity of the observed data to the labeled set, the relation between spatial patterns and data labels learned by CGAN is used to generate model realizations with appropriate spatial patterns. In the second method, a subset of samples from the training data that are within a certain dis- tance from the observed flow response data is used within GAN to generate new model realizations. In this method, GAN is not required to learn the nonlinear relation between production responses and spatial patterns. Instead, it learns the spatial patterns in the selected realizations that provide a close match to the observed data. The conditional low-dimensional parameterization for complex geologic models with diverse spatial features (i.e., when multiple geologic scenarios are plausible) performed by GAN allows for exploring the spatial variability in the conditional geologic real- izations. The spatial variations in the conditional models honor the variations in the production responses and can be explored through interpolation and operations in the low-dimensional rep- resentation space. These variations can help practitioners understand how the geologic features morph between the generated conditional models, which can be critical for decision-making. For well placement and waterflooding optimization in areas of high uncertainty, the realistic spatial features in the generated conditional models provide insights into critical spatial elements for pro- duction, such as channel continuity, presence of flow baffles and sinuosity of fluvial bodies. 154 We demonstrate the performance of these methods using complex single and multi-scenario fluvial reservoirs of binary and multi-facies cases conditioned to linear and non-linear dynamic data. 5.1 Generative Adversarial Networks for Model Space and Data Space Compression The generative adversarial network is a class of deep neural network architectures that consist of two stacked fully differentiable network models, called the generative modelG θ and discriminative modelD ψ , whereθ andψ respectively represent the trainable weights in the models. These models are simultaneously trained to estimate the distribution of data and likelihood that a sample belongs to the training dataset [78]. The two models are optimized in a two-step alternating procedure with opposing objectives where the goal ofG θ is to create fake samples that look realistic while the goal ofD ψ is to correctly classify fake and real samples. The loss function to optimize weights θ and ψ is a mini-max game between the two models, where convergence is ideally reached at a Nash equilibrium betweenG θ andD ψ . Given a set of model realizations M as the training dataset, the loss function used in the training process is mathematically defined as min θ max ψ L(D ψ ,G θ )=E m∼ p M [logD ψ (m)]+E z∼ p Z [log(1− D ψ (G θ (z)))] (5.1) where p Z is a known distribution, typically the standard Gaussian distribution, z is a latent vec- tor sampled from p Z where z∼ p Z and p M is the distribution of the training dataset M (represented by the model realizations within M where m∼ p M ) which is not known. More specifically, the generator G θ accepts the latent vector z as an input to output a generated model realization ˜ m whereG θ (z)→ ˜ m. A set of generated model realizations is defined as ˜ M. The discriminatorD ψ accepts a model realization as an input and assigns a value between 0 (if a model realization is determined to be fake) and 1 (if a model realization is determined to be real) 155 whereD ψ ( ˜ m)→[0,1]. In the alternating training process, the weightsψ inD ψ are first updated by maximizing the first term in Equation 5.1 while keeping the weights θ inG θ fixed. Subsequently, the weights θ inG θ are updated by minimizing the second term in Equation 5.1 while keeping the weights ψ inD ψ fixed. Through the training process, GAN implicitly learns the distribution of the training dataset p M where the generated model realizations ˜ M (fromG θ (z)→ ˜ m) have a distribution represented as p ˜ M and the Jensen-Shannon divergence between the two distributions (cross-entropy) is minimized. At convergence, p ˜ M approximates p M such that p ˜ M ≃ p M . The process of training a GAN is fraught with instability and mode collapse issues as reaching a Nash equilibrium is not trivial. Multi-modal distribution of the training dataset p M can be poorly approximated with a uni-modal distribution in a phenomenon called mode collapse [77]. This results in generated samples that are tightly distributed around the mean behavior of the training samples. Numerous methods to address these issues have been proposed. For example, Wasserstein GAN (WGAN) seeks to promote stability by using the Wasserstein distance as a different metric [79] and Cramer distance was later proposed as another distance metric [20]. To increase the diversity of the generated samples, (Energy-based GAN) EBGAN [272] views the training process as an energy-based system where points near the data manifold are assigned low energy (high priority) and points from low data density are assigned high energy. Regardless of these standing issues, given the growing developments in research involving GAN, this generative model may offer potential solutions to standing problems in the geoscience domain. Two main properties of GAN that are deemed attractive to geoscience applications are its abil- ity to allow dimensionality reduction in multiple orders of magnitude (from a model realization m to a latent vector z) and its capacity to implicitly learn the lower-order and higher-order statis- tics of a (prior or posterior) probability distribution function (pdf) of the training data (e.g., model realizations M) that is otherwise too complex to be explicitly defined. The dimensionality reduc- tion serves to alleviate under-determinedness, to increase efficiency by reducing the number of parameters that need to be calibrated and to maintain solution plausibility by allowing feature- based updates in the data integration step. Variants of GAN have also been developed to serve 156 an intended purpose in different applications. Examples are Auxiliary Classifier GAN (ACGAN), which allows for the generation of samples conditioned to class labels [182] and InfoGAN, which has the ability to learn disentangled representation in a completely unsupervised manner [41]. In this chapter, we adopted the Wasserstein loss function with gradient penalty [10, 79] to mitigate the instability issues of training GAN, in which the quality of generated samples may not necessarily improve as training progresses [78, 213, 232]. The first term in Equation 5.1 tend to saturate to zero whenD ψ is optimally trained and when this happens, the vanishing gradient provides no feedback to optimizeG θ . This occurs asD ψ is easier to train thanG θ where assigning a fake or real label (i.e.,D ψ ( ˜ m)→[0,1]) to a model realization is a straightforward operation to learn compared to an upsampling operation of generating a high-dimensional complex model realization from a low-dimensional latent vector sampled from a simple distribution (i.e.,G θ (z)→ ˜ m). In WGAN, the loss function is generalized to a metric that measures the divergence between two distributions p M and p ˜ M where the metric used is the Earth-mover (EM) or Wasserstein-1 distance such that min θ max ψ L(D ψ ,G θ )=E m∼ p M [D ψ (m)]− E z∼ p Z [D ψ (G θ (z))]− λE ˆ m∼ p ˆ M [( ∇ ˆ m D ψ ( ˆ m) 2 − 1) 2 ]. (5.2) where ˆ m is a model realization that is sampled uniformly between a pair of real training sample m and generated sample ˜ m whileλ is a tuning hyperparameter for the gradient penalty term to en- force 1-Lipschitz condition (for continuous differentiation) onD ψ . The third term in Equation 5.2 provides a continuous metric for improvement as it measures the distance between real and gener- ated samples. For a detailed formulation, readers are referred to Gulrajani et al. [79]. Since the log function for the first two terms in Equation 5.2 has been removed (effectively solving the vanishing gradient problem),D ψ is no longer a discriminator that is trained to distinguish between real and fake model realizations, but rather to critic as the loss function outputs a scalar that allows p ˜ M to fit p M more tightly. 157 Figure 5.1: Method (1) for generating conditional models using CGAN. 5.1.1 Method 1: Conditional GAN The first method generates conditional models through conditional GAN (CGAN), whereby a flow response label c i (for each model m i in training dataset M) is used as an auxiliary input during the training stage of CGAN. In Figure 5.1, given M as a set of priors {m i } i=1,2,3,...N from sin- gle/multiple geologic scenarios, we collect the corresponding set of linear or non-linear dynamic data D (consisting of{d i } i=1,2,3,...N ) using historical controls and proceed to cluster the data (into K clusters) to derive the set of flow labels C, defined by the centroids of each clusters where {c i } i=1,2,3,...N; c=1,2,3,...K . In Figure 5.1, the centroids in the data space D are represented by gray- filled circles and the data points around each centroid (i.e., label) c form the set{D ∗ c } c=1,2,3,...K and are similar in the production response behaviour. The dataset is now composed of the tu- ples{m i ,d i ,c i } i=1,2,3,...N where{M ∗ c } c=1,2,3,...K defines a set of model realizations belonging to the same cluster (i.e., label) c that contains variations in geologic features that can recreate the variations in the dynamic data within{D ∗ c } c=1,2,3,...K . 158 We describe a modification to Equation 5.2 to allow conditioning to class labels. Several meth- ods have been developed [154, 182, 203] to include additional information when generating sam- ples, a topic that is currently at the research state. In our implementation, we adopt the approach by Odena et al. [182] where an auxiliary classifier C φ (where φ represents the trainable weights in the model) is appended toD ψ to provide gradient information toG θ on class boundaries. The training process is now composed of three loss functions, and they are optimized to give a proba- bility distribution over training samples (the two loss functions responsible for this are denoted as L D andL G ) and class labels (cross-entropy loss function formulation denoted asL C ) such that L D = − E m∼ p M [D ψ (m)]+E z∼ p Z ,c∼ p C [D ψ (G θ (z,c))]+λE ˆ m∼ p ˆ M [( ∇ ˆ m D ψ ( ˆ m) 2 − 1) 2 ] (5.3) L C =E m∼ p M ,c∼ p C [logC φ (c|m)]+E ˜ m∼ p ˜ M ,c∼ p C [logC φ (c| ˜ m)] (5.4) L G =− E z∼ p Z ,c∼ p C [D ψ (G θ (z,c))] (5.5) CGAN is trained with the tuples{m i ,c i } i=1,2,3,...N using Equation 5.3-5.5 as the loss functions until convergence. The Appendix provides a detailed description of the two-step alternating train- ing procedure. Once CGAN is trained, the distance of cluster centroids to the observed data d obs is used as a measure to select relevant labels that represent a set of similar dynamic data behaviour D ∗ (that is associated with the corresponding set of model realizations M ∗ ). To generate a set of geologically realistic flow-conditioned models ˜ M where the distribution of the conditional models approximate the distribution of relevant models used in the training process (i.e., p ˜ M ≃ p M ∗ ), the identified labels and a set of latent vectors (where each latent vector z is sampled from a known distribution p Z such that z∼ p Z ) are fed into the trained generatorG θ . This operation is called a feed-forward sampling operation and can be mathematically defined as ˜ m=G θ (z,c) where ˜ m represents a generated sample. The underlying assumption of this method 159 is that CGAN learns the association between spatial features corresponding to the dynamic data within each cluster. In this method, the unknown and intractable distribution of posterior model realizations p M ∗ is parameterized and represented by a simpler distribution p Z (i.e., Gaussian) and cluster labels c using a trained CGAN with an auxiliary classifier that can be mathematically defined as G θ ◦ D ψ ◦ C φ (i.e., the composition of componentsG θ ,D ψ andC φ ). Table 5.1 summarizes the proposed methodology. The number of clusters K (which determines the number of class labels) is a problem-specific design decision and is dependent on the spread within the set of simulated data D and the uncertainty within the observed data d obs . We perform K-means clustering in the data space D and assign to each model realization m a label c based on the centroid (mean) it belongs to. This operation can be done in low or high-dimensional space and if the dimension is reduced, then d obs needs to be considered in the same low-dimensional space so the Euclidean distance to the centroids can be computed for identification of relevant labels in step 6. Table 5.1: Workflow for method (1) Datapreparation 1. Run N forward simulations for priors{m i } i=1,2,3,...N ∈ M to get model-data pairs {m i ,d i } i=1,2,3,...N where{d i } i=1,2,3,...N ∈ D 2. Determine appropriate number of clusters K 3. Run K-means clustering on D and construct model-label pairs{m i ,c i } i=1,2,3,...N as training dataset CGANdesignandtraining 4. Determine CGAN architecture 5. TrainG θ ◦ D ψ ◦ C φ with{m i ,c i } i=1,2,3,...N to convergence with Equation 5.3-5.5 Generationofconditionalmodels 6. Identify relevant labels c∈ C 7. Generate conditional models ˜ m=G θ (z,c) Figure 5.2 illustrates the architecture used in our work. The Appendix presents a more detailed description of the dimensions and functions of each layer in the architecture. The main operation 160 Figure 5.2: Architecture of CGAN/GAN used in this study. Components connected using stippled lines are associated with CGAN. done within the generatorG θ is an upsampling of low-dimensional latent vectors to construct high- dimensional complex model realizations and is performed using deconvolution layers denoted as deconv. The discriminator/critic performs a downsampling operation of model realizations using convolution layers denoted as conv. Note thatC φ andD ψ branch in parallel at the dense layer where the output for each branch is transformed to the shapes required by the loss functions. Convergence is determined when the visual quality of generated realizations for each class label do not improve any further (i.e., visually similar to the training realizations) and exhibit the correct geologic features belonging to the specified class label (see samples of generated realizations for selected iterations during the training process in Figure 8.4(b) in the Appendix). 5.1.2 Method 2: GAN with Neighborhood Selection The alternative method for data conditioning involves two steps as illustrated in Figure 5.3. In the first step, a neighbourhood selection algorithm identifies the subset of relevant flow responses D ∗ (simulated from the set of prior model realizations M) around the observed data d obs such that D ∗ is a subset of the data space D (i.e., D ∗ ⊆ D) and d obs lies within the set D ∗ (i.e., d obs ∼ p D ∗ ). 161 Figure 5.3: Method (2) for generating conditional models using neighbourhood selection algorithm and GAN. The second step employs GAN to generate conditional models based on the set of relevant model realizations M ∗ selected in the first step where each model realization m within M ∗ corresponds to a simulated dynamic response data d within D ∗ . In this case, GAN is not required to learn the relation between flow responses and spatial patterns; instead, it is tasked to learn the patterns that are involved in the selected models M ∗ such that M ∗ is a subset of the model space M (i.e., M ∗ ⊆ M). The reference case m re f used to simulate the observed data d obs is assumed to be known in all of our experiments and lies within the set of relevant model realizations M ∗ (i.e., m re f ∼ p M ∗ ). The size of the selected neighborhood determines the size of D ∗ and translates to the uncertainty in physical measurementε around d obs within the dynamic data domain D. Since the prior model realizations M can originate from multiple geologic scenarios, the set of flow-conditioned models M ∗ could also potentially come from multiple geologic scenarios with diverse spatial patterns. The set of generated conditional models ˜ M by CGAN and GAN appear to honor the original geologic scenario proportion of M ∗ . Additionally, the spatial variability in the conditional models can be explored through interpolation and operations in the latent space (i.e., the space of all latent vectors 162 Table 5.2: Workflow for method (2) Datapreparation 1. Run N forward simulations for priors{m i } i=1,2,3,...N ∈ M to get model-data pairs {m i ,d i } i=1,2,3,...N where{d i } i=1,2,3,...N ∈ D 2. Determine appropriate size of neighborhood d obs ± ε to select D ∗ in D domain 3. Construct training dataset M ∗ GANdesignandtraining 4. Determine GAN architecture 5. TrainG θ ◦ D ψ with M ∗ to convergence with Equation 5.2 Generationofconditionalmodels 6. Generate conditional models ˜ m=G θ (z) zs), which can be critical for decision-making where an understanding of critical spatial elements for production is needed. Table 5.2 summarizes the second proposed methodology. Unlike the first method where CGAN is trained with the entire set of prior models M, we define a neighborhood D ∗ whose size is deter- mined by the measurement uncertaintyε around d obs . The corresponding set of model realizations M ∗ is then used as the training dataset for GAN (as described in Section 5.1). The underlying assumption in this approach is that models that provide an acceptable match to data are expected to have similar local spatial connectivity patterns. Therefore, once such models are identified using a metric of similarity, they can be passed to GAN for learning the underlying patterns and using them to generate models with similar patterns that will provide an acceptable match to d obs . A major difference between this and the previous method is that the current method is not ”offline” in the sense that d obs is introduced before GAN is trained; that is, the observed data has to become available before training can proceed. The actual architecture is shown in Figure 5.2 where label conditioning andL C are not involved (i.e., components connected using stippled lines are associated with the previous method). Similar to the first method, convergence is reached when the visual quality of generated realizations do not show any further improvement. The input to 163 the generator is a Gaussian latent vector z∼ p Z where a feed-forward operation on a trainedG θ generates a conditional model such that ˜ m=G θ (z). The set of generated conditional models is represented by ˜ M where the distribution of the conditional models approximate the distribution of relevant models M ∗ used in the training process where p ˜ M ≃ p M ∗ . Forward simulation on ˜ M yields the set of simulated flow responses ˜ D with a distribution that is similar to the distribution of the set of relevant flow responses D ∗ where p ˜ D ≃ p D ∗ . The learned distribution p ˜ M can be explored through latent space interpolation where successive conditional modelsG θ (ˆ z) between two generated models (i.e.,G θ (z start ) andG θ (z end )) can be computed where ˆ z=αz start +(1− α)z end , α∈(0,1) (5.6) such thatG θ (ˆ z)∼ p ˜ M for all interpolated latent vectors (i.e.,∀ˆ z). 5.2 Numerical Experiments and Results In this section, the capability of both methodologies is demonstrated with synthetic datasets of binary and multi-facies fluvial realizations with uncertainty in channel azimuth, thickness-to-width ratio, facies distribution and connectivity patterns. We first demonstrate the conditioning ability of CGAN using direct linear labels (derived from the K-means clustering of model realizations within the model space in section 5.2.1) and linear simulated travel-time tomography data labels (derived from the clustering of simulated data within the data space in section 5.2.2.1). Then we show the performance of CGAN for conditioning on non-linear dynamic data labels (derived from the clustering of simulated production data within the data space in section 5.2.2.2). In section 5.2.3, we demonstrate the performance of GAN as the second method described in section 5.1.2. Figure 5.4 shows two training images (TIs) that are generated using object-based simulation to represent the uncertainty in the geologic connectivity patterns in a simplified fluvial depositional 164 Figure 5.4: Dataset A of binary facies realizations where training image 1 (for Scenario 1) and training image 2 (for Scenario 2-5) are derived from conceptual geologic model. Four samples are shown for each scenario. environment. The first TI represents conceptual geologic scenario of a meandering fluvial envi- ronment where the expected uncertainty is associated with the thickness, amplitude and sinuosity of fluvial bodies. The second TI represents an anastomosing fluvial environment where the ex- pected uncertainty is associated with the azimuth and thickness of fluvial bodies as well as the confluence between individual channels. The first TI is used to generate model realizations that are conditioned to hard data from wells, for Scenario 1 (four sample realizations are shown in Figure 5.4). The second TI is used to generate model realizations for Scenario 2, 3, 4 and 5 (four sample realizations are shown in Figure 5.4 for each scenario) where the TI is rotated to generate fluvial bodies with an azimuth of 0, 45, 90, and 145 respectively. The realizations are simulated using the Multi-point Statistics (MPS) algorithm in Petrel [222]. The binary fluvial reservoir has a size of 1000m × 1000m and is discretized into a 32× 32 domain. A complete description of the geostatistical parameters used can be found in Mohd Razak and Jafarpour [163]. In Figure 5.5, two different multi-facies TIs are generated using object-based simulation to represent the uncertainty in a fluvial depositional environment where levee and overbank deposits flank the fluvial bodies. The first TI captures the uncertainty in channel geometry (sinuous or 165 Figure 5.5: Dataset B of multi-facies realizations where training image 1 (for Scenario 1 and 2) and training image 2 (for Scenario 3, 4 and 5) are derived from conceptual geologic model. Four samples are shown for each scenario. straight), azimuth, thickness-to-width ratio, and connectivity patterns (isolated or intersecting). The first TI is used to simulate model realizations for Scenario 1 and 2 where the TI is rotated to generate fluvial bodies with an azimuth of 0 and 90 respectively. The second TI represents the uncertainty in azimuth, thickness-to-width ratio, and connectivity patterns and is used for Scenario 3, 4 and 5 where the TI is rotated with an azimuth of 0, 135 and 45 respectively. The reservoir has a dimension of 1000m × 1000m and is discretized into a 64× 64 domain to capture finer geologic features. In both datasets, 500 realizations are generated for each geologic scenario. 5.2.1 CGAN Conditioned on Model Space Label In this section, we demonstrate the ability of CGAN to generate model realizations conditioned on direct linear labels (derived from the clustering of model realizations within the model space). Model realizations M (in original dimension) from only Scenario 5 (single-scenario case of 500 model realizations) of dataset A are clustered into 4 groups of priors{M ∗ c } c=1,2,3,4 to obtain 4 direct 166 conditioning labels. The size of each cluster{M ∗ c } c=1,2,3,4 is 122, 126, 128 and 124 respectively. The objective of this experiment is to demonstrate that CGAN can reproduce the spatial variations within each cluster and the number of clusters is arbitrarily chosen. The center scatter plot in Figure 5.6 shows (in low-dimensional leading PCA spaces) 500 real- izations as the labeled training dataset and sets of 64 generated model realizations ˜ M c from each label c. From the distribution of the data points, we observe that CGAN does not merely memo- rize the training dataset and is able to produce geologically realistic conditional realizations that reasonably reproduce the mean and variance of the priors for each label. Note that the number of generated models (64 in this experiment) is arbitrarily chosen and any number of generated models can be quickly produced using the trained CGAN model. It should also be noted that the generated realizations are continuous in nature (see samples in Figure 8.4(b) in the Appendix) and a discretization method with thresholding (i.e., mid-point cutoff) can be applied without affecting the geologic realism. 5.2.2 CGAN Conditioned on Data Space Label To demonstrate the conditioning ability of CGAN on linear and nonlinear data labels, for each prior model we simulate linear dynamic data with travel-time tomography (d= Gm) and non-linear dynamic data using a two-phase flow Eclipse [221] simulator ( d = g(m)). In Section 5.2.2.1 and Section 5.2.2.2, the class labels used for conditioning are respectively derived from the clustering of simulated linear and non-linear dynamic data within the data space. 5.2.2.1 Conditioning on Travel-time Tomography Data In this travel-time tomography experiment, acoustic waves are transmitted from an array of 4 transmitters to an array of 4 receivers where the recorded data of arrival times is determined by the slowness of the medium. This function can be expressed as t= Z s(x) dx (5.7) 167 Figure 5.6: (Center) Scatter plot of 500 realizations (in low-dimensional leading PCA spaces) as the training dataset and 64 generated model realizations from each label denoted by color. (Tiles) Comparison of mean and variance of priors and generated realizations for each label. 168 Figure 5.7: (a) Configuration of transmitters and receivers in travel-time tomography experiment for dataset A. (b) Configuration of injectors and producers in two-phase flow experiment for dataset A and (c) for dataset B. where t is the travel time of each ray and s(x) is the slowness distribution as a function of the spatial coordinate x within any given model realization m. Figure 5.7(a) shows the configuration of transmitters and receivers for dataset A (single-scenario case of 400 model realizations from only Scenario 3) used in this section, resulting in a total of 16 arrival times (i.e., 16 values of t that constitute any single d). The transmitters and receivers are placed in monitoring wells at the eastern and western edge of the field. The fluvial field is composed of binary facies, where the sand and non-sand facies are assigned slowness values of log(50) and log(5000) respectively. Figure 5.8(a) shows D transformed to low-dimensional space and clustered into 4 groups where the number of clusters (or labels) is determined from the estimated uncertainty ε within the mea- sured field observations d obs (simulated from the reference case m re f shown in Figure 5.8(e)). The variance within each cluster (i.e., spread of data points) approximates ε where in the physical domain, it is reflected as the spread around the mean (centroid in low-dimensional space) arrival times. The size of each cluster{D ∗ c } c=1,2,3,4 is 91, 85, 115 and 109 respectively. The label as- signment for each d and the corresponding m illustrates the linear mapping function d= Gm. The most relevant label is identified as label 2 and is determined from the smallest Euclidean distance between d obs and each of the cluster centroids (numbered accordingly in Figure 5.8(a)). 169 Figure 5.8: (a) Arrival time data in leading PCA components where× denotes D,■ denotes cluster centroid,▲ denotes d obs and◦ represents ˜ D 2 . (b) Realizations in leading PCA components where× denotes M and◦ represents 64 generated realizations ˜ M 2 . (c) Samples of M ∗ 2 . (d) Samples of ˜ M 2 . (e) Comparison of mean and variance of relevant priors M ∗ 2 to ˜ M 2 and the reference case m re f used to generate d obs . 170 A set of 64 conditional models ˜ M 2 is generated by 64 feed-forward computations ˜ m=G θ (z,2) and the distribution (in low-dimensional space) is compared to the distribution of 85 relevant priors M ∗ 2 (Figure 5.8(b)). The generated samples (d) are geologically realistic, diverse and capture the spatial uncertainty in the northwestern section of the reservoir. The mean and variance maps be- tween M ∗ 2 and ˜ M 2 suggest thatG θ has learned the distribution of D and the decision boundary sepa- rating each clusters reasonably well. We report that the generated realizations may contain a small amount of isolated pixels (see samples in Figure 8.4(b) in the Appendix) and a post-processing smoothing algorithm with a small kernel or a discretization method with thresholding can be ap- plied without affecting the geologic realism or their ability to reproduce dynamic data. 5.2.2.2 Conditioning on Two-phase Flow Data A two-phase flow system is considered with two producers and two injectors in the reservoir as shown in Figure 5.7(b). The fluvial field (400 model realizations from Scenario 3 of dataset A) is composed of binary facies, where the channel and shale facies are assigned (φ, k) pairs of (0.23, 500 mD) and (0.03, 3 mD), respectively. Approximately one pore volume of wetting phase (i.e., water) is injected into the reservoir over 6 years of simulation time and the flow response data is collected every 3 months. For an estimated level of uncertaintyε, a larger variability in D requires more clusters to obtain approximately the same spread of data within each{D ∗ c } c=1,...,K (note that the number of d in each D ∗ c is not equal and depends on the variance of each d with respect to its centroid). This is reflected in the physical domain as the spread around the mean (centroid in low-dimensional space) injection and production curves of the wetting and non-wetting phase. In Figure 5.9, given the fixed D, we demonstrate our approach in estimating the number of clusters that will determine the number of labels to be used in the training dataset. Lower uncer- tainty in d obs implies a smaller percentage error ε (measured in normalized physical domain as the average absolute error of all d from the mean of the individually assigned cluster) that needs to be considered. Labeled scatter plots of M illustrate the non-linearity and non-uniqueness of the 171 Figure 5.9: Comparison of the clustering outcomes of D (where assigned labels are projected on M and color-coded) and average percentage errorε in each cluster based on different choices of K. ■ denotes cluster centroid. 172 mapping d= g(m) and the appropriate K is chosen to be 9. The size of each cluster{D ∗ c } c=1:9 is 49, 34, 45, 46, 39, 47, 50, 42 and 48 respectively. Once CGAN is trained, label number 8 is identified as the most relevant label according to the distance of d obs (simulated from the reference case m re f shown in Figure 5.10(e)) to each cluster centroid (numbered accordingly in Figure 5.10(a)). More than one relevant label (i.e., when d obs is approximately equidistant to more than one centroid) can be accepted granted that the cumulative spread does not exceed ε. A total of 64 conditional realizations ˜ M 8 are generated and the distribution is compared to M ∗ 8 (42 relevant priors) in Figure 5.10(e). The mean and variance are both reproduced reasonably well although it is noticeable that less variance exists around well I9 in ˜ M 8 compared to M ∗ 8 . Additionally, we report that there are simulated data ( ˜ d) from a small number of samples in ˜ M 8 that do not fall within the distribution of D ∗ 8 . This is attributed to the imperfect fit between p ˜ M 8 and p M ∗ 8 combined with the complexity arising from the non-uniqueness and non-linearity between d and m. In this particular experiment, we reject any ˜ d (and its corresponding ˜ m) that exceeds ε around d obs and any ˜ m that does not honor the well data. The frequency of these occurrences for our example was less than 5%. Figure 5.11 shows that as we consider smaller ε, the number of clusters increases and less variance (uncertainty in geologic features) is present within M ∗ . Note that the size of each cluster{M ∗ c } c=1:9 is optimal in this experiment where we have observed that increasing the number of priors (i.e., size of M) does not significantly increase the quality of the generated models. 5.2.3 GAN with Neighborhood Selection In this section, we demonstrate the performance of our second method with two experiments using all geologic scenarios in dataset A and B, respectively. We assume that the azimuth and channel type are both uncertain and use multiple TIs to generate realizations for each scenario. In the first experiment, the field configuration and reservoir properties are as mentioned in Section 5.2.2. 173 Figure 5.10: (a) The most relevant label to d obs (▲) is determined as label 8. (b) Comparison of M ∗ 8 to ˜ M 8 . (c) Samples of M ∗ 8 . (d) Samples of ˜ M 8 . (e) Comparison of mean and variance of relevant priors M ∗ 8 to ˜ M 8 and the reference case m re f used to generate d obs . 174 Figure 5.11: Sensitivity analysis on the number of clusters and its impact on variance within M ∗ (closest to d obs ). Mean and variance maps are computed from 64 generated realizations in ˜ M. In Figure 5.12(a)-(b), D and M are displayed in leading PCA space, where the colormap rep- resents data mismatch with respect to d obs (simulated from the reference case m re f shown in Fig- ure 5.12(c)) and square markers represent the D ∗ and M ∗ (60 relevant priors) where data mismatch is within d obs ± ε. We observe in Figure 5.12(c) that the mean and variance of M ∗ (as the training data for GAN) are reproduced well in ˜ M. In Figure 5.12(e), the generated realizations by GAN ( ˜ M) can hardly be differentiated from M ∗ . Figure 5.12(a) shows similar distribution of M ∗ and ˜ M (estimated with Gaussian kernel only for visualization purpose). In (b) and (c), each generated realization in ˜ M is assigned a geologic scenario label according to the (first) nearest-neighbor in M ∗ (whose realizations are assigned a geologic scenario label depending on the TI used at time of generation). The set of generated conditional realizations honors the ratio in M ∗ , where only two geologic scenarios (1 and 3) are supported by d obs ± ε. In practice, it is more efficient to identify and retain relevant geologic scenarios to avoid costly forward simulations for unsupported scenarios. In this example, using the method in [163], Sce- narios 2, 4 and 5 are eliminated as they are not supported by the data. In (d), two latent variables (z start and z end ) are sampled uniformly (as per Equation 5.6) to generate successive conditional realizations that smoothly transcend two supported geologic scenarios. We observe that GAN has 175 Figure 5.12: (a) Colormap represents data mismatch and■ denote D ∗ where the mismatch is within d obs ± ε. (b)■ denote M ∗ that correspond to D ∗ . (c) Comparison of mean and variance of relevant priors M ∗ to ˜ M and the reference case m re f used to generate d obs . (d) Samples of M ∗ . (e) Samples of ˜ M. 176 Figure 5.13: (a) Distribution of M ∗ and ˜ M. (b) ˜ M color-labeled according to geologic scenario label of M ∗ . (c) Geologic scenario proportion of M ∗ and ˜ M. (d) Latent space interpolation between two generated samples from ˜ M. parameterized the remaining geologic uncertainties (i.e. channel curvature, width and connectiv- ity) and these uncertain geologic features (to which the data d obs ± ε are not sensitive) are reflected in the generated conditional realizations. In this example, ˜ M captures all possible connectivity patterns between Wells P1, P2 and I8 while Well I9 remains isolated in another compartment. The response data ˜ D is collected by running the forward model on ˜ M. Figure 5.14 illustrates the production and pressure profiles of the wells where the mean, P10 and P90 curves show a good match. The models in ˜ M represent the uncertainty in the flow model and historical data such that ˜ D is within d obs ± ε, thus the generated conditional realizations are conditioned and can be used for forecasting purposes. To further test this approach, we repeat the experiment with dataset B, where conditioning is done using dynamic data from the first 40 months. This is followed by 35 months of forecast period when another producer (P3) is drilled. In Figure 5.15(a)-(b), D and M are displayed in leading PCA space. The colormap represents data mismatch with respect to d obs (simulated from 177 Figure 5.14: Profiles of D ∗ and ˜ D simulated from M ∗ and ˜ M respectively. 178 Figure 5.15: (a) Colormap represents data mismatch and■ denote D ∗ where the mismatch is within d obs ± ε. (b)■ denote M ∗ that correspond to D ∗ . (c) Comparison of mean and variance of relevant priors M ∗ to ˜ M and the reference case m re f used to generate d obs . (d) Samples of M ∗ . (e) Samples of ˜ M. 179 the reference case m re f shown in Figure 5.15(c)) and square markers depict D ∗ and M ∗ (64 relevant priors), where the data mismatch is within d obs ± ε. In contrast to Figure 5.12(a)-(b), M ∗ is more scattered as dynamic data from the first 40 months has less sensitivity to geologic features (thus increasing the complexity represented by p M ∗ ). This is also portrayed in Figure 5.15(c), where the mean map of M ∗ does not show strong features and the corresponding variance map shows high variability within M ∗ (except at the well locations). In this example, the generated conditional models ˜ M (Figure 5.15(e)) lack granularity and appear to be smooth compared to M ∗ (Figure 5.15(d)) as the quality of ˜ M depends on how well GAN approximates p M ∗ . The mean and variance map of ˜ M in Figure 5.15(c) are consistent with those of M ∗ , although the variability is slightly underestimated. Figure 5.16 shows the mean, P10 and P90 curves of ˜ D and D ∗ where the gray vertical line marks the conditioning and forecast period. The forecast profiles of ˜ D are very similar to D ∗ for Producer P3 that is introduced at the beginning of forecast period and other existing wells (only P1 is shown in the figure). This shows promising performance for GAN’s ability to capture the uncertainty within d obs ± ε and the unresolved uncertainty within model realizations. We have observed that across several multiple multi-facies reference cases, on average, less than 10% of conditional models generated by GAN have to be rejected as the production behavior (in conditioning period only) is not within d obs ± ε. The amount of conditioning data that is avail- able translates to the uncertainty in geologic features that can be resolved. In our experiments, as less conditioning data is available (i.e., at the early lifecycle of a field), more geologic uncertainty in model realizations is present and the performance of GAN (given the same architecture and capacity) deteriorates as the complexity in p M ∗ increases. 5.3 Summary and Discussion In this chapter, we demonstrate two simple approaches for generating flow-conditioned models under non-Gaussian prior models and nonlinear forward models. In the first approach, simulated 180 Figure 5.16: Profiles of D ∗ and ˜ D simulated from M ∗ and ˜ M respectively. Forecast period starts at month-40 (when P3 is drilled) and conditioning period is from month-0 to 40. 181 flow data from prior models are clustered using K-means algorithm and each flow data is assigned a label based on the cluster it belongs to. The number of clusters is determined such that the variance of data around each centroid represents the uncertainty in the observed data. CGAN is then trained offline (using the labels and associated prior realizations) to parameterize the complex multi-modal distribution of prior models for each label. Each group of prior models is then repre- sented by a label and a simpler Gaussian distribution. As field data becomes available, the relevant label (closest centroid to field data) is identified according to similarity in production behaviour and is provided to the trained CGAN (along with random vectors of Gaussian noise) to generate conditional models. The second approach is different as it requires field data to be available before relevant model realizations are selected based on the similarity of dynamic data (simulated from each realization) to the observed data. In this case, the selected models are to train GAN to generate additional conditional models through fast feed-forward sampling. The generated conditional models tend to preserve geologic realism and are able to reproduce the field data within the specified error. In both of these approaches, we observe that as the distribution (that needs to be learned by CGAN/GAN) becomes more complex, the quality of the generated models deteriorates. This results in a small fraction (5-10%) of models that are rejected as they either do not reproduce the field data within the specified error or do not have the right facies at the well locations. Since GAN only approximates the posterior distribution, a small fraction of the generated mod- els do not have the right facies at the well locations and this issue becomes more prevalent when the hard data is at the boundary of facies (i.e. cell at the edge of a channel). The proposed ap- proaches work well when the number of prior models is large enough for training CGAN/GAN. Since forward model run is required for each prior realization, computational cost may become prohibitive when models are large and forward solvers are complex. This computational complex- ity can be alleviated by using fast proxy models for flow prediction (data generation). Moreover, since the flow responses are converted to conditioning labels, approximate predictions that capture the general trend in the response data may prove to be sufficient for the proposed method to work. 182 A larger number of possible geologic scenarios translates to a larger prior model space that can potentially increase the size of the data space. In the first and second method, the clustering and neighborhood selection are respectively done in the data space and the variations of the data points in the data space are dependent on the design of prior geologic scenarios. For the first method, poorly chosen (i.e., irrelevant to observed data) and poorly sampled (i.e., diverse spatial patterns inadequately represented) prior geologic scenarios may result in a data space with highly sparse or concentrated data points that can affect the clustering performance and subsequently, the number of data points within each class. Similarly for the second method, the number of data points within the relevant neighborhood may be affected. In such situations, the design of the prior geologic scenarios needs to be revisited. In our experiments, the number of relevant model realizations used for training (for each cluster in the first method and within the relevant neighborhood in the second method) is optimal where we have observed that increasing the number of samples do not significantly increase the quality of the generated models. Another important consideration in training CGAN/GAN is data requirements and the expected quality of the generated images, which depend on the application. The results from our sensitiv- ity analysis suggest that more complex dataset with high geologic variability (i.e. when multiple geologic scenarios are considered) require more training data to represent the underlying complex distributions. While the quality of the generated models and the stability of training depend on the number of training data and the complexity of geologic features, the expected quality of the solu- tion of an inverse problem (and the standard used to measure it) is vastly different from the standard used in applications that involve natural images. Therefore, data requirements of CGAN/GAN in the two applications are not directly comparable. Note that recent works on CGAN/GAN research in the computer vision domain mention the need for thousands of training data points when natural images are used. However, such observations are not directly applicable to subsurface data con- ditioning workflows. In subsurface flow modelling application, geological features tend to have much less complexity compared to natural images. In addition, model realizations that are used for training are typically conditioned to well data, which further reduces the geological variability. 183 Therefore, the training data requirement of CGAN/GAN that are anticipated in other applications may not necessarily be the same for subsurface flow modelling. For any given field, longer production period and higher well density typically indicate less geologic uncertainty to be resolved. Existing wells constrain the prior models and integration of data from longer production periods reduce the uncertainty in forecast. The proposed methods yield a significant reduction in prior uncertainty and if a perfect fit to observed data is desired, the generated conditional models can be used as initial solutions in a least-square history matching formulation by optimizing the latent variables. Additional work is underway to test these proposed approaches in complex three-dimensional models with irregular geometry and complex hetero- geneity. Other issues related to working with GAN and its variants such as training stability and possible underestimation of uncertainty are current research topics in relevant fields of computer science. Furthermore, multiple data augmentation schemes and transfer learning approaches have been proposed for training CGAN/GAN with limited dataset [109] to avoid overfitting and are im- portant avenues to explore in future works. Additionally, the performance of the proposed methods on a continuous geologic field (described by variogram continuity models) and when within-facies uncertainty for the reservoir properties is present need to be evaluated. In this work, CGAN/GAN show promising performance for compact description of complex non-Gaussian prior models and with a robust data conditioning mechanism, it can be used to gen- erate flow-conditioned models for forecasting purposes. Generating flow-conditioned models in complex subsurface system that are characterized by non-Gaussian variability is important for forecasting the production performance and the associated uncertainty and for guiding future de- velopment plans. As additional new developments in deep learning research are integrated with domain expertise in subsurface flow modeling these techniques promise to offer more advanced capabilities for handling complex model calibration problems. 184 Chapter 6 Recurrent Neural Networks for Long-term Production Forecasting in Unconventional Reservoirs Robust production forecasting allows for optimal resource recovery through efficient field manage- ment strategies. In hydraulically fractured unconventional reservoirs, the physics of fluid flow and transport processes is not well understood and the presence of, and transitions between multiple flow regimes further complicate forecasting. An important goal for field operators is to obtain a fast and reliable forecast with minimal historical production data. The abundance of wells drilled in fractured tight formations and continuous data acquisition effort motivate the use of data-driven forecast methods. However, traditional data-driven forecast methods require sufficient training data from an extended period of production for any target well and may have limited practical use. Therefore, we propose a deep learning statistical forecast model based on Recurrent Neural Networks (RNN) that employs transfer learning [183] by combining early production data from the target well with the dynamics captured from historical production data in other relevant wells. Transfer learning is a machine learning concept that focuses on applying the knowledge learned from solving a problem, to a different but related problem [263]. Motivated by a limited supply of target training data, transfer learning has been applied to image classification problems [123], natural language processing tasks [209], seismic phase picking with discrepancy in scales between the target and source data [35], production optimization [269], and multi-scale flow simulation [273], among many others. In general, transfer learning involves two stages: (i) the pretraining 185 stage using source data for a source task and (ii) the adaptation stage where the learned knowledge is applied on target data for a target task. Depending on the nature of the source data and target data, the second stage may involve fine-tuning of the learned weights (of a neural network), aug- mentation of the original neural network with additional layers, and various retraining schemes for feature transformation and adaptation [185, 263]. Shimodaira [229] demonstrated that a predictive model trained using a given source domain will not show optimal performance on a target domain when the source and target domain have different marginal distribution. In this work, the forecast model is trained on a collection of historical production data across multiple flow regimes, control settings, and the corresponding well properties from multiple shale plays. Through transfer learning, the correspondence between well properties and dynamical trends learned from other localities are exploited to improve generalization in another locality. Once trained, the forecast model takes the completion, formation and fluid properties, as well as operating controls and early (i.e., 3-6 months) production response data from a target well to pre- dict oil, water, and gas production as multivariate time-series. The RNN forecast model consists of LSTM cells to extract temporal trends from the multivariate (i.e., multi-phase production and controls) time-series, which are divided into pairs of a lag window and a forecast window, in a sequence-to-sequence RNN formulation. Each pair of windows represents temporal trends (dy- namic features) and is tagged with the corresponding well properties. The architecture includes an auxiliary dense neural network that jointly learns the mapping between well properties and dynamic features. The forecast model is initially trained and tested using only data from select fields in Bakken. Next, additional data points from a new field are included in the training process, with the model already initialized with weights representing the data from the initially selected fields. In traditional machine learning, the training data and testing data typically have the same input feature space with similar data distribution. This assumption may not hold in practice when the testing data have a different data distribution caused by spatial variability and a different input feature space (i.e., heterogeneous transfer learning [263]). Therefore, for the selection of initial fields (i.e., initial 186 wells) used to train the model, we consider several transfer scenarios to mimic practical infill (i.e., random scenario) and step-out (i.e., latitude and longitude scenarios) development drilling scenarios. With transfer learning, the forecast model gives improved long-term prediction for a set of unseen test data belonging to the new field (i.e., target data), when information from other relevant plays is transferred into the model (i.e., source data). Additionally, the forecast model responds to changes in controls and can accurately predict successive multiple flow regimes, using only a short period of initial production. Our results indicate that transfer learning becomes relevant to adapt a source (i.e., initial) pre- dictive model for use on a target dataset, specifically when the marginal distributions of the source and target datasets are not similar [263]. In this work, we demonstrate three scenarios; the first is the random scenario where the marginal distributions of the source and target datasets are sim- ilar. In this scenario, transfer learning is not needed as the distributions are already similar and introducing additional data points does not lead to improvement in performance. The other two scenarios are the latitude and longitude scenarios, where the marginal distributions of the source and target datasets are not similar and transfer learning is needed. For the latitude and longitude scenarios, when additional data points from the target locality are introduced to adapt the source predictive model, we observe improved performance on the target dataset. The proposed forecast model leverages the properties of the sequence-to-sequence RNN for- mulation [238]. RNN is a non-parametric learning method that captures the complex temporal relationship between the input and output dataset within the trainable weights, that can be con- veniently updated using transfer learning approaches for model adaptation. RNN is known to be able to automatically capture (by remembering or forgetting) temporal trends, offering more flex- ibility than a parametric method like DCA, and can be scaled to multivariate time-series to also capture the relationship between fluid phases in the time-series data. Additionally, the sequence- to-sequence formulation allows the forecast model to use time-series data with varying length by breaking the time-series down into windows (i.e., sequences). This further enables us to provide predictions for new wells when there is a discrepancy in the volume of time-series data available. 187 Without the sequence-to-sequence RNN formulation, the output of the predictive model is fixed in length and it is not possible to incorporate newly available time-series data if they do not agree with the specified fixed length. Consequently, the formulation allows us to provide recursive long- term predictions and continuously update the forecast model with newly available timesteps by converting those timesteps to sequences. The main contributions of this chapter are (i) a new sequence-to-sequence encoder-decoder LSTM forecast model with fully-connected regression layers that considers well properties and control trajectories for multivariate time-series prediction in unconventional reservoirs, (ii) the us- age of transfer learning concept to reduce training data requirement to enable long-term prediction for a target locality with limited data and, (iii) the demonstration on the impact of data distribu- tion (for the source and target datasets) on the performance of the proposed forecast model using synthetic and field datasets of unconventional reservoirs. 6.1 Dynamic Latent Space Representations with Recurrent Neural Networks Recurrent Neural Network (RNN) is a special type of neural network for sequential data with temporal trends. An RNN cell has recurrent edges and memory to process arbitrary temporal sequences. In this work, we utilize a variant of RNN called the Long-Short Term Memory (LSTM) cell that can handle the vanishing gradient problem to effectively represent long-term dependencies [89]. Figure 6.1 shows the unrolled representation of a single LSTM cell. For each element y t in the input sequence y, the internal state of the cell stores past information (C t− 1 ) that is used to selectively pass current input y t in the context of the internal state, to generate current hidden output h t [183]. Specifically, at time t, an LSTM cell receives the internal cell state C t− 1 and the hidden state h t− 1 via the recurrent edges, as well as the current input y t to generate the updated internal cell state C t and the updated hidden output state h t . 188 Figure 6.1: Long-Short Term Memory (LSTM) cell. The internal mechanism of an LSTM cell that regulates the flow of information consists of three gates, i.e., forget gate, input gate, and output gate. The forget gate decides the relevance of past information and is mathematically defined as f t =σ(U f h t− 1 + W f y t + b f ) (6.1) where σ(·) represents an element-wise sigmoid activation function that scales the vector f t to be between[0,1]. Let N h and N f be the length of the hidden output h t and the number of features (univariate or multivariate) respectively. The time-invariant weights and bias term in Equation 6.1 have the following dimension, U f ∈R N h × N h , W f ∈R N h × N f and b f ∈R N h × 1 . The input gate selects the elements of the hidden state that are to be updated and is defined as i t =σ(U i h t− 1 + W i y t + b i ) (6.2) where U i ∈R N h × N h , W i ∈R N h × N f and b i ∈R N h × 1 . The new information for updating the states is determined by a tanh layer defined as ˜ C t = tanh(U c h t− 1 + W c y t + b c ) (6.3) 189 where U c ∈R N h × N h , W c ∈R N h × N f and b c ∈R N h × 1 . tanh(·) is an element-wise hyperbolic tangent activation function that scales the vector i t to be between[− 1,1]. The internal cell state is then modified (i.e. element-wise multiplication) as C t = f t ∗ C t− 1 + i t ∗ ˜ C t (6.4) The output gate selects the elements of the past hidden states to combine with the modified cell state (as per Equation 6.4) that is passed through a tanh function as o t =σ(U o h t− 1 + W o y t + b o ) (6.5) h t = o t ∗ tanh(C t ) (6.6) where U o ∈R N h × N h , W o ∈R N h × N f and b o ∈R N h × 1 . The weights and bias terms are collectively defined as W ={W f ,W i ,W c ,W o }, U={U f ,U i ,U c ,U o } and b={b f ,b i ,b c ,b o }. The set of input weights W is associated with the input y t while the set of hidden weights U is associated with the previous hidden state h t− 1 . During the training process, the gates operate in tandem to identify relevant information within the long chain of sequences. The sequence-to-sequence encoder-decoder RNN family of models [238] is widely adopted in natural language processing for applications in machine translation, question answering systems, and video/image captioning. As an example, in language translation, an encoder model encodes the input sentence into a latent vector and a decoder model converts the encoded latent vector into a sentence of a target language [234]. While many variations exist, in general, it turns a sequence into another sequence using an encoder that performs temporal feature extraction and a decoder that decodes any latent vector representation. These encoder and decoder are each composed of the LSTM cell illustrated in Figure 6.1. In Figure 6.2, the LSTM encoder is denoted as Enc θ and the LSTM decoder is denoted as Dec γ . As illustrated in Figure 6.2, the encoder Enc θ processes the input sequence (i.e., a window of univariate/multivariate time-series data) by discovering salient temporal dependence within the timesteps and returns a temporal encoding. The decoder Dec γ is 190 trained to predict the successive values within a sequence given the temporal encoding generated from the input sequence. 6.1.1 Sequence-to-Sequence Forecast Model The neural network architecture of the proposed forecast model is composed of a pair of LSTM cells denoted as Enc θ and Dec γ and several fully-connected regression layers denoted as Reg ω and Reg ζ . The subscripts θ and γ represent the lumped trainable parameters (i.e., collectively defined as W,U and b) within the LSTM encoder Enc θ and LSTM decoder Dec γ respectively. The subscripts ω and ζ represent the lumped trainable parameters within the regression layers Reg ω and Reg ζ respectively. A fully-connected (i.e., dense) layer with an activation function f(·) is simply defined as Z= f(W d A+ b d ) (6.7) where A∈R N b × N i denotes the input into the layer, N b is the batch size of the input, N i is the size of the input, W d ∈R N d × N i represents the weights to be learned, N d is the number of hidden nodes of the dense layer, and b d ∈R N d × 1 represents the bias term. Leaky-ReLU is used as the element-wise activation function and for any arbitrary variable z is defined as f(z)= max(z,αz), α∈(0,1) (6.8) whereα is set to be 0.3 (as a typical default value). Stacked fully-connected layers can approx- imate complex functions and allow the model to learn an intricate nonlinear mapping between the input and output. In a sequence-to-sequence RNN formulation, the multivariate time-series used as input are segmented into pairs of lag window y w ∈R N l × N f and forecast window y w+1 ∈R N s × N f , where N l and N s are the lag size and forecast length respectively. Each window pair is tagged with a vector of well properties p∈R N p (where N p denotes the number of formation, completion, and fluid 191 properties used as input) and future control u w+1 ∈R N s (that dictates how the well is operated in the forecast window y w+1 ). Each tuple of data consists of u w+1 , p, y w and y w+1 . As illustrated in Figure 6.2, the recurrent operation Enc θ (y w ) yields a dynamic encoding of relevant information from the past data y w , the operation Reg ζ (p) yields an encoding of well prop- erties p and the operation Reg ω (u w+1 ) yields an encoding of future control u w+1 . These encodings are concatenated and fed into the LSTM decoder Dec γ to generate univariate or multivariate time- series prediction y w+1 for the subsequent time window. Long-term forecast is obtained by either recursively feeding the model with the prediction from the preceding timesteps or by training the model for multi-step ahead predictions (i.e., by specifying a larger N s ). In this work, the forecast model is trained to provide multi-step ahead predictions that are recursively fed into the forecast model for long-term predictions. Figure 6.2: Forecast model. Specifically, the loss function of the forecast model for a batch of input of size N b can be formalized as 192 L(ω,ζ,θ,γ)= N b ∑ ∥y w+1 − Dec γ ([Reg ω (u w+1 ),Reg ζ (p),Enc θ (y w )])∥ 2 2 (6.9) and is optimized through the backpropagation algorithm using the Adam optimizer [115] with a learning rate of 1× 10 − 4 . The architecture is trained and validated with the early-stopping strategy to prevent over-fitting. Detailed description of the layers (i.e., size of input, size of output, number of weights) within each component is given in the Appendix. As shown in Table 8.11 in the Appendix, the proposed forecast model contains 27559 trainable weights that need to be calibrated. When the target dataset is small (i.e., limited in terms of both the number of available wells and the length of observed production) in quantity, these weights can not be optimally calibrated and this necessitates the adoption of transfer learning approaches that can help address the issue of data paucity. 6.1.2 Transfer Learning for Reducing Data Requirement In this section, we describe a two-stage transfer learning method for training the forecast model, where the two stages are the initial training stage and the adaptation stage. The notations used in this section is consistent with Weiss et al. [263]. Given a source taskT S in a source domainD S and a target taskT T in a target domainD T , transfer learning is the process of improving the target predictive model f T (·) by using the related information fromT S andD S , whereD S ̸=D T and/or T S ̸=T T [263]. Let X S , P(X S ), Y S , P(Y S ) be the source input feature set, the marginal probability distribution of the source domain feature set, the source output set and the marginal probability distribution of the source domain output set, respectively. Similarly, let X T , P(X T ), Y T , P(Y T ) be the target input feature set, the marginal probability distribution of the target domain feature set, the target output set and the marginal probability distribution of the target domain output set, respectively. In the initial training stage, the source predictive model maps the source input feature set to the source output set, where Y S = f S (X S ). Equivalently, for the target set, this mapping is defined as Y T = f T (X T ). In our application, the source and target domain input set (X S and X T ) are 193 comprised of the well properties p, the future control trajectory u w+1 , and the lag window y w while the source and target domain output set (Y S and Y T ) are comprised of the corresponding forecast window y w+1 , as depicted in Figure 6.3. The number of wells in the source and target datasets is denoted as N source and N target respectively, where generally N source >> N target . Figure 6.3: Transfer learning workflow. The marginal distribution of the source input and output may not be the same as the marginal distribution of the target input and output (i.e., P(X S )̸= P(X T ) and P(Y S )̸= P(Y T )). In this work, we assume that the source input and target input have the same set of features (i.e., homogeneous transfer learning) and the source task and target task are the same, whereT S =T T . Additionally, in the adaptation stage, the choice to include the source dataset when adapting the pretrained source predictive model f S (·) for the target task is domain-dependent and depends on data availability and the size of the source dataset and the target dataset. A large size discrepancy between the source dataset and the target dataset may induce frequency bias. For our application, the source dataset 194 is typically sourced from publicly available data repository and may be included in the adaptation stage (i.e.,[Y S Y T ]= f T ([X S X T ]) and the inclusion is denoted by the green arrows in Figure 6.3). In other applications, when only the pretrained source predictive model is available, or retraining with the source dataset is unfeasible due to resource limitation, the learned knowledge within f S (·) can be transferred to f T (·) through parameter fine-tuning, network expansion and various retraining schemes [185, 263], such as selectively retraining parts of the neural network or training another neural network to map the output of the source model to the output of the target model. We have observed that including the source dataset (when available) in the adaptation stage yields a slightly more accurate target forecast model [46] as the source dataset ensures that the original information learnt within the original source model is not lost. In this work, as illustrated in Figure 6.3, the source forecast model f S (·) is trained with a large collection of relevant source dataset (represented by the larger polygons) to capture the complex dynamical relationship between the source input X S (i.e., p, u w+1 , and y w ) and the source output Y S (i.e., y w+1 ). In our application, the source dataset typically includes multiple wells with well properties and production behavior over a long period of time consisting of multiple phases of production across multiple flow regimes. Note that since the production profiles for each well have been segmented into windows and tagged with the corresponding well properties, they can be taken as independent data points and have been illustrated as such. The equivalent time-series plot representation can be viewed in Figure 1.6 where a longer period of production simply translates to more pairs of windows that can be used for training. A trained f S (·) from the initial training stage will not perform optimally for a target dataset with different marginal distributions, thus necessitating the subsequent adaptation stage. In the adaptation stage, the trained weights from the source forecast model f S (·) is used to initialize the target model f T (·) before f T (·) is trained with the target dataset (as well as the source dataset, if available). In contrast to the source dataset, the target dataset typically is not only smaller in size (denoted by the relative size of polygons in Figure 6.3) but also the wells within the target dataset have only produced for a shorter period of time (denoted by the smaller number 195 of sequences in Figure 6.3). Due to this, building a forecast model using only the target dataset is impractical and unfeasible as the model will not be able to predict the long-term production behavior in the prediction stage, as the model has not seen sufficient long-term production behavior during training. Additionally, an underdetermined deep learning model with a large number of trainable weights that is trained using only limited data points within the target dataset will result in physically unreliable forecasts. We can benefit from the knowledge learned by f S (·) by adapting it through transfer learning approach to become the target model f T (·) by retraining f S (·) with data points from the target dataset as they become available (e.g., newly drilled wells, additional observed production responses), enabling the construction of a reliable target forecast model even when the target dataset is limited in volume. 6.2 Numerical Experiments and Results 6.2.1 Example 1: Toy Data In this section, we explore the impact of including coordinate information as a feature in predicting well production behavior using a synthetic dataset. In Xi and Morgan [266], the Kriging interpo- lation method is applied on DCA parameters learned from production data of existing wells. The resulting parameter maps are used to forecast gas production at new well locations in the Marcel- lus shale. Using a neural network architecture, the same objective can be achieved by considering coordinate information (i.e., longitude and latitude) as the well input properties p and the pro- duction data as the pairs of lag and forecast windows (i.e., y w and y w+1 ). For this experiment, we assume a constant well operating control and omit the component Reg ω . The neural network forecast model simultaneously learns the weights associated with the production responses and the correspondence between input coordinate values and the learned weights, to perform implicit spatial interpolation when presented with an unseen test coordinate. To generate the synthetic production data, consider the following hyperbolic function as the forward model (i.e., physical simulator) 196 q t = q i (1+ bd i t) 1/b (6.10) where q i is the initial production rate, b is the hyperbolic decline constant, d i is the nominal decline rate, t is the timestep and q t as the production rate at timestep t. To represent the well locations, 150 coordinates (as input parameter p where N p = 2) are randomly selected, and the corresponding tuples of (q i , b, d i ) are extracted from the reference maps in Figure 6.4. A set of 150 production data (with 12 timesteps) is obtained using Equation 6.10 and the extracted parameter tuples, and is represented as y w and y w+1 where N f = 1 (i.e., univariate), N s = 1 and N l = 1. We first consider a transfer learning scenario (i.e., random) to mimic an infill drilling development project where the forecast model is initially trained using 120 randomly selected data points (i.e., as the existing wells) and then tested on the remaining 30 data points (i.e., as the infill wells). The forecast for each coordinate in the reference map is obtained using the trained forecast model. The error histogram (between each forecast and simulated data using tuples extracted from the reference maps) and representative samples of forecast are shown in Figure 6.5. For the random scenario, the trained forecast model can give reliable predictions for the test dataset and all coordinates in the reference map as plotted in Figure 6.6. We further consider two additional transfer learning scenarios (i.e., latitude and longitude) to mimic a step-out drilling development project. For the latitude scenario, the forecast model is trained using 87 wells located in the northern part of the field and tested with 63 coordinates of the southern region, as shown by the markers in Figure 6.4. Similarly, for the longitude scenario, 85 wells located in the western part of the field are used for training, and 65 coordinates of the eastern region are used for testing. The error histogram in Figure 6.5 shows that, for the step-out scenarios, while the forecast model is still able to give reliable forecasts, the errors are larger when compared to the infill scenario. As the forward model remains the same regardless of well location, the larger errors are attributed to the complexity of predicting the spatial parameters in the step-out regions. To further investigate the implicit spatial interpolation performed by the forecast model, we next omit the components Enc θ and Dec γ and train Reg ζ to predict the tuples of (q i , b, d i ). We 197 Figure 6.4: Spatial interpolation with coordinate information for three scenarios of transfer learn- ing. consider the same three transfer learning scenarios and use the trained model to predict the param- eter tuple for each coordinate in the reference map. The predicted parameter maps in Figure 6.4 show useful results for infill and step-out scenarios. The forecast errors and parameter prediction 198 Figure 6.5: Error histogram and samples of forecast versus simulated reference. errors for all scenarios are shown in Figure 6.6 where the error bar represents one standard devi- ation of Root Mean Square Error (RMSE) values for ten repeated runs. The RMSE between an arbitrary vector v of length N v and its predicted vector of values ˆ v is defined as: RMSE = r 1 N v Σ N v i=1 ˆ v i − v i 2 (6.11) Note that in practice, additional data collected from new wells drilled in the step-out regions can be used to update the forecast model (i.e., the adaptation stage of transfer learning). The bars labeled transfer in Figure 6.6 reflect the improved generalization capability when data points from the test dataset (33 and 35 randomly selected data points for the latitude and longitude scenarios respectively) are used for retraining the forecast model. Across the three scenarios, the random scenario does not need an adaptation stage as the source and target datasets typically share the same distribution while the latitude and longitude scenarios need an adaptation stage as the source and target datasets may have different distributions. This is illustrated by the marked difference between the RMSE for the training and test datasets for the latitude and longitude scenarios and much reduced RMSE when transfer learning is applied. For the random scenario, the RMSE for the training and test datasets are comparable, suggesting that the training data is relevant and can be directly used for test prediction. 199 Figure 6.6: RMSE of the training set, testing set, set of all coordinates in the reference map, and set of training data combined with a portion of the testing set for three scenarios of transfer learning. 6.2.2 Example 2: Synthetic Bakken Data In this section, we demonstrate the capability of the proposed forecast model to provide long-term multivariate (i.e., N f > 1) time-series prediction under variable controls. The synthetic dataset used in this experiment is generated based on the distributions of the formation, completion and fluid properties for the Bakken shale, obtained from the literature [44, 53, 56, 57, 72, 107, 135, 192, 198, 217, 246, 251]. For this experiment, all parameters are assumed to follow a normal distribution. The porosity correlation with permeability and depth is assumed to be positive and negative respectively [44, 192, 198]. Only formation and fluid properties are considered while the completion properties are kept constant. Note that in this experiment, we utilize a synthetic dataset so we can factor out common issues related to field data such as incomplete data, inaccuracies, and low signal-to-noise ratio, to simply demonstrate the mechanism of the proposed forecast model. The MATLAB Reservoir Simulation Toolbox (MRST) is used to simulate 5-year (i.e., 60 timesteps) multi-phase (i.e., oil, water and gas) monthly production responses of a typical hydraulically- fractured well in an unconventional reservoir with homogeneous formation properties within the drainage area. Specifically, 4200 samples of input properties p where N p = 6 (as illustrated in Figure 6.7) and variable control trajectories are used as input into the simulator to yield 4200 mul- tivariate production responses (as shown in Figure 6.8). Gaussian noise is added to the simulated production responses, and 2100 data points are used as the training set while the remaining data 200 points are reserved for testing. The production responses are represented as a sequence of windows y w and y w+1 where N f = 3, N s = 6 and N l = 3 while the corresponding control trajectories are rep- resented as u w+1 . In this experiment, since the training and testing sets are randomly sampled from the same initial set, they are statistically similar (as observed from the overlaps in Figure 6.7 and Figure 6.8) and represent the random transfer learning scenario. Figure 6.7: Sampled formation and fluid properties for the training set and testing set. Figure 6.8: Simulated production responses in percentiles (P90/P50/P10) for the training set and testing set. 201 The forecast model is trained using input tuples of (u w+1 , p, y w ) to predict the multivariate output y w+1 . The proposed forecast model is trained on a high-performance computing cluster with an NVIDIA Tesla P100 GPU node for approximately 45 minutes (spanning 100 epochs) and are checkpointed every 5 epoch. Figure 6.9 compares the normalized predicted fluid phases (of length N s for each window ˆ y w+1 ) to the actual reference values y w+1 , for all windows of a representative test case. The trained forecast model can generate multivariate predictions that respond to the specified control trajectories, given a set of well properties. Figure 6.10 shows the statistical performance of the forecast model (a single training instance) for the training and testing sets where the RMSE and Mean Relative Error (MRE) values are consistent with the error introduced by the added Gaussian noise. The MRE between an arbitrary vector v of length N v and its predicted vector of values ˆ v is defined as: MRE = 1 N v Σ N v i=1 ˆ v i − v i v i (6.12) However, using the forecast model in this mode requires that the past (i.e., observed) data y w becomes available before any prediction of the succeeding (i.e., future) window y w+1 can be obtained. Figure 6.9: Predicted production responses (normalized) versus reference (for all windows) for a test well. For a more practical usage, the forecast model can be recursively fed with a forecast from the preceding timesteps to generate multi-step ahead predictions. While the value of N s can be increased to enable a longer prediction window, our experiment shows that N s values that are too 202 Figure 6.10: Prediction statistics for the training set and testing set. large result in a reduction in performance as the number of data tuples used for training decreases. In transfer learning applications, the trained forecast model is supplied with well properties from a newly drilled infill or step-out well, early production response data (as the first y w with length N l ) as well as operating controls (for future timesteps) to obtain the forecast. Since this experiment uses a synthetic dataset with no spatial reference (i.e., coordinate information) and common distributional statistics (i.e., random scenario) for both the training and testing sets, a retraining (i.e., adaptation) step of the forecast model with additional data from the testing set is not necessary. Figure 6.11 shows the recursive multi-step predictions of three representative test wells (repre- senting good, average and poor predictions) for transfer learning. Collected production responses for the first time window y w of length N l (denoted as colored scatter points) are used to seed the recursive computation, where the well properties and control trajectories (denoted as purple lines) dictate the future production behaviour (i.e., decline type). For the three wells in Figure 6.11, the forecasts are made for 9 windows of length N s each, resulting in predictions of 54 timesteps. Since N s > N l , only the last N l timesteps of every ˆ y w+1 are utilized for every recursive computation. While the forecasts can be generated for any number of recursive windows, note that the RMSE is calculated between the forecasts and available (i.e., simulated) reference values. The long-term multivariate predictions in Figure 6.11 show that the forecast model can simultaneously predict 203 Figure 6.11: Predicted production responses versus reference of sample test wells for transfer learning. Figure 6.12: Prediction statistics of the testing set for transfer learning. 204 multiple phases of production across multiple flow regimes, using minimal historical production data. Figure 6.12 shows the statistical performance of the forecast model (a single training instance) for the testing set when used in random transfer learning mode. The scatter plot (first column) compares the recursive predictions (i.e., successive ˆ y w+1 windows) to the corresponding simulated reference values y w+1 . The RMSE value is slightly higher than what is observed in Figure 6.10 due to the compounded errors in recursive long-term predictions when newly observed data is not available to calibrate the predictions. This is evident in the bar plot (second column of Figure 6.12), where the error grows with increasing temporal distance from the initial y w window. We also observe that the error tends to be relatively higher for the timesteps with frequent changes in control trajectories. The histogram (third column of Figure 6.12) shows the error distribution of the test cases where the mean error is consistent with the mean error introduced by the added Gaussian noise. Note that the three sources of error are (i) compounded error from the recursive predictions, (ii) error in predicting the effect of changes in well control to the production rate of each fluid phase and (iii) error in discerning the declining trend from the well properties. When the forecast model is used for the step-out transfer learning scenarios (i.e., latitude or longitude scenario as explained in the previous section), the complexity in extrapolating spatial trends may introduce additional error. Nonetheless, the introduction of additional data (as training data) from the new regions can help calibrate the forecast model to be more effective for new localities. 6.2.3 Example 3: Field Data from Bakken Shale Play In this section, we demonstrate the performance of the forecast model using field data comprising 886 wells across 199 unique fields collected from the Bakken shale in North Dakota. They are downloaded from the North Dakota Department of Mineral Resources web page 1 . It is assumed that there is no inter-well communications and no distinction is made between wells that are pro- ducing from the Middle Bakken formation and the Three Forks formations. The collected well 1 https://www.dmr.nd.gov/ 205 properties p where N p = 8 (i.e., the volume of fracturing fluid, weight of proppant, treatment pres- sure, treatment rate, number of fracture stages, well lateral length, latitude and longitude) and their corresponding observed multivariate (i.e., N f > 1) production responses (i.e., oil, water and gas phases) and historical control trajectories (shut-in indicator where a value of 0 represents a shut-in period and a value of 1 represents a flowing period), both of variable length for each well, are rep- resented as y w , y w+1 and u w+1 where N f = 3, N s = 6 and N l = 3. Figure 6.13 shows the histograms of well parameters for the Bakken dataset. We consider the three transfer learning scenarios as de- scribed in the previous sections. The proposed forecast model is trained on a high-performance computing cluster with an NVIDIA Tesla P100 GPU node for approximately 52 minutes (spanning 130 epochs) and are checkpointed every 5 epoch. Figure 6.13: Histograms of well parameters for the Bakken dataset. The first column in Figure 6.14 shows the spatial distribution of the training and testing wells for each scenario. The number of training and testing wells used (as a tuple of (train,test)) is (664,222), (544,342), and (554,332) for the random, latitude and longitude scenarios, respec- tively. For brevity, the well properties (excluding coordinates), production data, and well controls 206 are represented in the first two principal components (PC) using Principal Component Analysis (PCA), and the contour density maps are shown in the second to fourth column in Figure 6.14, for each scenario. As expected, the distributions of the training and testing datasets show more overlap in the random scenario when compared to the two step-out scenarios where large spatial variations may be present between the training and testing regions. Figure 6.14: Distributions of training and testing datasets for three scenarios of transfer learning. The forecast model is initially trained using input tuples of (u w+1 , p, y w ) from the training dataset to predict the multivariate output y w+1 , in the initial training stage of the two-step transfer learning process. The forecast errors for all scenarios are shown in Figure 6.15, where the error bar represents one standard deviation of RMSE values for five repeated runs. The error for long-term recursive predictions (bars labeled as recursive) is consistently higher when compared to short- term predictions (bars labeled as test). The bars labeled transfer in Figure 6.15 reflect the improved 207 generalization capability for long-term predictions when data points from the test dataset (120 and 110 randomly selected data points for the latitude and longitude scenarios, respectively) are used for retraining the forecast model, in the adaptation stage of the two-step transfer learning process. Additionally, when this experiment is performed without including coordinate information as the features in p, we observe larger prediction errors for the latitude and longitude scenarios. This can be observed by comparing the bars labeled transfer to the bars labeled transfer-b (without coordinate information) in Figure 6.15. Figure 6.15: RMSE of the training set, testing set, testing set for recursive predictions, and set of training data combined with a portion of the testing set for recursive predictions, for three scenarios of transfer learning. Figure 6.16 shows the recursive multi-step predictions of three representative test wells for transfer learning. The first observed time window y w of length N l (denoted as colored scatter points) are used to seed the recursive computation, where the well properties and shut-in indicator (denoted as purple lines) dictate the future production behaviour. For all wells in Figure 6.16, the forecasts are made for 18 windows of length N s each, resulting in predictions of 108 timesteps. The forecast model can remove the noise in the production data during training to provide reliable long- term predictions. In most cases, the forecast model not only responds to the changes in control but is also able to predict the jump in production rates caused by pressure build-up after every shut-in period. Note that for this field dataset, a well may be shut-in for various well stimulation operations or simply due to other factors that may or may not be reported. As such, we can expect that if a well 208 Figure 6.16: Predicted production responses (normalized) versus reference of sample test wells for transfer learning. is shut-in for a stimulation operation, a successful stimulation job would increase well productivity and the initial vector of well properties p may not be as relevant anymore, which is a point of caution for practitioners. In the second row in Figure 6.16, we observe that the proposed forecast model underestimates the production rates after each shut-in period. This is potentially attributed to well stimulation operations and introduces an added level of complexity. Figure 6.17 shows the statistical performance of the forecast model (a single training instance) for the testing set when used in random transfer learning mode. For the latitude and longitude scenarios, the performance can vary depending on the selection of additional wells introduced to 209 Figure 6.17: Prediction statistics (random scenario) of the testing set for transfer learning. retrain the forecast model (in the adaptation stage of the two-step transfer learning process), there- fore we have provided statistical results averaged over 5 runs in Figure 6.15. The scatter plot (first column) compares the recursive predictions (i.e., successive ˆ y w+1 windows) to the corresponding observed reference values y w+1 . Compared to the synthetic case in Experiment 2, there is consid- erable noise or production behavior that is not accounted for by the shut-in indicator and collected well properties. Additionally, the normalized production rates are dominated by smaller values that are typically seen in later stages of production. The bar plot (second column of Figure 6.17) shows that the error grows with increasing tem- poral distance from the initial y w window and more error fluctuations are observed for the early timesteps potentially due to the higher fluctuations in early production rates as depicted in Figure 6.16. The histogram (third column of Figure 6.17) shows the error distribution of the test cases, used to select the sample test wells in Figure 6.16. Our observations in this experiment suggest that there are complexities when working with field data that cannot be completely resolved and may introduce large errors, especially when the collected field data is incomplete, inconsistent, and noisy. Additionally, simplifying assumptions (e.g., the absence of inter-well communications and lumping production from different stratigraphic zones) can introduce additional errors. While this experiment with field data has shown promising results, practitioners need to be aware of these complexities inherent within any field dataset. Given the complexity and paucity of the collected 210 field data, the performance of the forecast model is deemed as acceptable and can be continuously improved through the transfer learning workflow when updated with newly available data. 6.3 Summary and Discussion In this chapter, a unified deep learning architecture based on recurrent neural networks and fully- connected neural network layers is presented as a forecast model. The proposed model uses a sequence-to-sequence LSTM autoencoder to extract dynamical trends from historical univariate or multivariate production data across multiple flow regimes. The model learns a mapping from both input well properties and control trajectories to the temporal dynamical trends in predicting the future production behaviour. To alleviate the high data requirement of training a deep learning architecture for a new target locality with limited available historical data, we employ transfer learning where only minimal target data (i.e., early production data, well properties, and user- specified controls for the target well) is required to generate forecasts. The transfer learning process transfers and adapts useful information from a large source dataset (with a large number of wells that have produced for a long period of time) into the target forecast model. Additionally, transfer learning allows the forecast model to be continuously fine-tuned with recently available dynamic data for optimal performance. Several scenarios of transfer learning are investigated to mimic practical infill and step-out drilling development projects. While there can be many other development scenarios involving complex geology and constraints related to surface facilities, acreage boundaries and local regu- lations, our observations are general and indicate that the correspondence between well properties and dynamical trends learned from other localities can be exploited to improve generalization in another locality. The experiments we conducted show that location information helps interpolate and extrapolate complex implicit functions that relate the input to the output data, where less error is typically observed with interpolation (i.e., infill drilling) compared to extrapolation (i.e., step- out drilling). This insight is useful as spatially variable properties may be unavailable, unknown 211 or expensive to collect, and coordinate information can serve as a proxy by learning a mapping between the coordinates and observed production responses. However, we note that the forecast model may also learn spatial mappings for properties that do not naturally vary in space (e.g., com- pletion parameters such as number-of-stages and volume of proppant) where caution needs to be exercised by practitioners for field applications. The proposed forecast model is also tested with field data from Bakken shale and the results indicate that the model can give an improved long-term prediction for a set of unseen test data belonging to a new target locality (i.e., field) when information from other relevant source regions is transferred and adapted into the model using the two-step transfer learning process. Additionally, the forecast model responds to changes in controls and can accurately predict a succession of multiple flow regimes for multiple fluid phases, using only a short period of initial production. Given the abundance of wells drilled in unconventional reservoirs and continuous data acquisition effort, the formulation of the forecast model combined with transfer learning offers an opportunity to develop a more viable and practical deep learning architecture to address standing issues in production forecasting. The main insight discernible from the set of experiments presented in this work is that the sequence-to-sequence encoder-decoder LSTM forecast model can be combined with transfer learning approaches, to adapt a source predictive model to become a target predictive model that can be optimally utilized on the limited target dataset for when there is a discrepancy in the distributions of the source and target datasets. Additionally, the sequence-to-sequence formulation of the forecast model allows practitioners to conveniently use time-series data with varying length. When newly available time-series data from the target locality is used to adapt the source predictive model, considerable reduction in pre- diction error is observed for the target dataset. In general, the longer is the period of production (covering multiple flow regimes) and the higher is the number of target wells used to adapt the source model (providing more information on the relationship between well properties and pro- duction behavior), the higher is the reduction in prediction error. It should be noted that, if the prediction period (say 6 years) significantly exceeds the available production period used in the 212 training and adaptation stages (say 3 years), the forecast model may give physically inconsistent values for the ”out-of-sample” period (i.e., the last 3 years). While the formulation allows the forecast model to incorporate newly available data (time-series data that are segmented into pairs of windows), in the prediction mode, the proposed method also requires users to wait until the production data for the first 3-6 months to become available (i.e., to be used to seed the recursive computation for long-term predictions). In contrast, forecast models that take only well proper- ties as input to give a prediction of fixed length [46] do not require such wait. More research is needed to thoroughly compare these two approaches in terms of their performance, limitations and benefits. The set of experiments performed in this work show that the forecast model and transfer learn- ing approach we have adopted are effective in providing reliable long-term predictions. Specif- ically, while we show that statistical information from different fields can be useful in inferring production behavior in another field, it begs the question of how our observations can be gener- alized to fields where the fluid phases are different or stratigraphic properties differ widely. As an example, a dataset consisting of oil and water production from a calcareous shale play may or may not contain transferable statistical information for another dataset consisting of gas and water production from a clay-rich shale play. How generalizable is our proposed approach, or transfer learning for that matter, is an important subject for future works. While our results are promising, more research is needed to develop effective transfer learning strategies that are tailored specifically to subsurface problems, especially between different shale plays located in different geographical regions where the geologic and fluid properties may show significant variations (i.e., large discrepancy in the distributions). In specific, a robust metric to determine if a source dataset is relevant for a target task needs to be developed to avoid a negative transfer (i.e., when transfer learning degrades the performance of the target predictive model). It is not always certain that a source predictive model that is adapted to become a target predictive model using transfer learning approaches will perform better than a model that is trained using only the target dataset, especially when using source models from plays with very different distributional 213 properties. In fact, one may argue that a target dataset large enough to optimally train a predictive model does not warrant the use of transfer learning. The notion of relevance when deciding if a source dataset can be used to improve the target predictive model is also rather subjective and may introduce bias, especially when multiple source datasets and pretrained models are available. Moreover, most current research in transfer learning are focused on classification problems and more attention towards regression problems is called for. Additionally, input and output features for the source dataset and target dataset may be different and a workflow to handle such heterogeneous transfer learning problem is useful as the type of data collected across producing geographical regions can be different. A guideline or best practices on how to best transfer learned knowledge from one neural network architecture to another will also be helpful for practitioners, especially when there is a large size discrepancy between the source dataset and the target dataset. At present, it appears that building a machine learning model and using transfer learning to solve subsurface flow problems is half an art and half a science. 214 Chapter 7 Physics-Guided Deep Learning (PGDL) for Improved Production Forecasting in Unconventional Reservoirs This chapter proposes two new methods to embed physical flow functions into deep neural network models for improved production forecasting based on physics-constrained formulation [139, 140]. The first method embeds physical functions into a neural network by selectively training segments of the network using data generated by physical functions (i.e., statistical approach). Subsequently, the entire neural network model is trained using field data, while the weights initially trained using data from the physical functions are fixed. Following the two training steps, the neural network represents the combined statistical information from the physical functions and field data. The second method explicitly embeds the physical functions into the neural network architecture. In this case, using a backpropagation algorithm [19], the gradient information flows from the objec- tive function of the training process through the augmented physical functions by invoking the chain rule. The gradient can then be calculated using a closed-form solution when available or approximated through finite-difference numerical approximation methods. Since a reliable physical description of fluid flow in unconventional reservoirs (e.g., [66, 216]) is not yet available, the model performance is largely dependent on the type of embedded physics in a physics-constrained formulation. For instance, DCA methods produce a poor result where abrupt changes in the production rate of a well are observed due to reservoir depletion and fluctuations in bottomhole operating pressure from complex cross-communication between nearby producers 215 or pay zones [150]. Physics-constrained models can learn to transform a wide variety of input features into implicit or intermediate variables as input parameters for the physical function. The statistical aspect of the model can approximate unmeasurable parameters, including the geometry of hydraulic fractures and induced fractures in the stimulated reservoir volume. However, such a physics-agnostic task also may result in a prediction (for any given input) with a large residual error. For improved generalization and reduced systematic bias in physics-constrained models (es- pecially when the embedded physical functions are not fully relevant to the observed data), we further propose the integration of residual learning and physics-constrained models as the new Physics-Guided Deep Learning (PGDL) formulation. Residual learning or residual modeling is a common simple approach to address the imperfection of physics-based models directly [47, 67, 169, 214, 247] where a statistical model learns to predict the residuals made by a physics-based model. In our PGDL formulation, an auxiliary neural network component is introduced to compen- sate for the imperfect description or uncaptured physical components of the constraining physics by learning the complex spatial and temporal correspondence between the well properties such as formation and completion parameters and control trajectories to the expected residuals. The pro- posed method combines the physics-constrained neural network with the predicted residual from the auxiliary neural network component, improving overall estimation. The developed PGDL formulation is tested using several synthetic datasets of increasing com- plexity and a field dataset from Bakken, where the hybrid predictive models are embedded with relevant physical functions. The proposed PGDL formulation can improve predictive performance by augmenting relevant physical functions into a neural network and further compensate for the systematic residual errors when the embedded physics cannot fully describe the complex physical phenomena in the data. We evaluate the performance of PGDL models under variable dataset size to show that such hybrid formulation can improve model performance by reducing the necessary number of trainable weights of neural networks. Additionally, for any test well, the methods can estimate the input parameters of the physical equations to enable long-term forecasting beyond the 216 time available in the training data. Performance comparison between PGDL methods and con- ventional predictive models without embedded physical functions and residual learning shows that PGDL methods offer a flexible hybrid way to benefit from both data-driven and physics-based models in improving production forecasting. 7.1 Hybrid Techniques for Production Prediction We define the general problem formulation, including its input/output and relevant notations used throughout the chapter. Given an observed field dataset of N f ield producers from an unconventional reservoir, we define the well properties (i.e., formation, fluid, and completion parameters) as x, the historical production data as d, and the corresponding control trajectories as u. Let N x be the length of feature vector x and N t be the length of time-series d and u. We define N f as a parameter that denotes the dimension of the time-series production data, where N f = 1 denotes univariate time- series (one phase) and N f > 1 denotes multivariate time-series (multiphase) where the extension of multivariate formulation from the univariate formulation is mathematically straightforward. The problem of data-driven production forecasting can be formulated as d= f(x,u), where f(·) is a forecast function (i.e., model) that takes an input tuple(x,u) to output d. In a black-box data-driven production forecasting method, the forecast model f(·) is tasked with learning (i) the temporal trends in the time-series data, (ii) the mapping between the tempo- ral trends and well properties, and (iii) the mapping between the temporal trends and the control trajectory. With a trained f(·), for any given test input tuple (representing a newly drilled well with limited or no observed initial production responses), the production forecast is obtained by computing ˆ d= f(x,u). In the following subsections, we describe the physics-constrained neural network formulation and introduce two implementation approaches (i.e., statistical and explicit). Subsequently, we introduce the new Physics-Guided Deep Learning (PGDL) model and elaborate 217 on how the residual learning approach can be combined with physics-constrained models for im- proved production prediction. Neural network predictive models for time-series prediction can be divided into static and dy- namic formulations where a static formulation considers timesteps along a temporal dimension as independent data points and performs static mapping between input and output of fixed dimen- sions [39]. A dynamic formulation involves predicting along a temporal dimension by considering the temporal dependencies between timesteps, can be used in an autoregressive manner to generate long-term predictions and are more amenable to field data that typically have time-series of varying lengths [58]. In this work, to introduce the new Physics-Guided Deep Learning (PGDL) model, we adopt a static formulation and assume a dataset with time-series of fixed and equal lengths to remove the additional complexities associated with a dynamic formulation [172]. 7.1.1 Physics-Constrained Neural Networks In a gray-box production forecasting method, we assume the existence of a physics-based model f 2 that is directly embedded into the neural network as custom computation layers to serve as the prior knowledge of physical dynamics. A statistical proxy representation of the physics-based model is denoted as f 2 ω whereω represents the lumped trainable parameters. Specifically, as illustrated in Figure 7.1, a physics-constrained neural network model is a composition of f 1 θ (whereθ represent the lumped trainable parameters) and f 2 (as the explicit approach f 1 θ ◦ f 2 ) or f 2 ω (as the statistical approach f 1 θ ◦ f 2 ω ). For a physics-constrained neural network model, the problem formulation now becomes d c = f 2 ( f 1 θ (x,u)) or d c = f 2 ω ( f 1 θ (x,u)) where d c represents the physically-constrained univariate or multivariate time-series output (i.e., production rates versus time). With a trained f 1 θ , for any given test input tuple, the operation ˆ p= f 1 θ (x,u) results in inter- mediate variables ˆ p that can be used as input for the physics-based model (i.e., as ˆ d c = f 2 (ˆ p) or ˆ d c = f 2 ω (ˆ p)). The component f 1 θ enables the transformation of a wide variety of input features into vectorized input parameters that the physics-based model can accept. The neural network 218 Figure 7.1: Schematic of the Physics-Guided Deep Learning (PGDL) model. architecture of the proposed physics-constrained model is composed of several fully-connected re- gression layers and one-dimensional (1D) convolutional layers. A fully-connected layer (denoted as dense and whose output is color-coded in Figure 7.2) with an activation function f a (·) is simply defined as Z= f a (W d A+ b d ) (7.1) where A∈R N b × N i denotes the input into the layer, N b is the batch size of the input, N i is the size of the input, W d ∈R N d × N i represents the weights to be learned, N d is the number of hidden nodes of the dense layer, and b d ∈R N d × 1 represents the bias term. Leaky-ReLU (denoted as lrelu) is used as the element-wise activation function and for any arbitrary variable z is defined as f a (z)= max(z,αz), α∈(0,1) (7.2) whereα is set to be 0.3 (as a typical default value), stacked fully-connected layers can approximate complex functions and allow the model to learn a detailed nonlinear mapping between the input and output. As tabulated in Table 8.12, Table 8.13 and Table 8.14, several dense layers are used for 219 f 1 θ where N i corresponds to the total length (i.e., N x + N t ) of the vectorized input tuple of x and u. Additionally, a sigmoid activation function can be applied on the last layer of f 1 θ to bound (i.e., scale) its output values between zero and one to agree with the ranges of input values into the subsequent component f 2 ω or f 2 . For f 2 ω and f 2 with a wider range of input values, the last layer of f 1 θ can be adapted using other activation functions (e.g., Leaky-ReLU) to allow f 1 θ to generate a wider range of values. Figure 7.2: Diagram of the statistical PGDL model architecture. The neural network model’s learning capacity (i.e., amount of trainable parameters) is primar- ily affected by the number of hidden nodes within each layer N d and the number of layers to stack (i.e., depth of the neural network). The selection of optimal model hyperparameters is made using 220 the grid search tuning technique that begins with small potential values and incrementally increases the values until no further increase in training and validation performance is observed. This pro- cess will ensure that the model can effectively fit the training data without inducing any form of underfitting or overfitting. The stacked dense layers architecture is the best for f 1 θ . Every node in each layer is connected to every node in another layer in this architecture (enabling any one layer to learn features from all the combinations of output activations of the previous layer), while the input tuple(x,u) does not have any local structures and inherent redundancies. 7.1.1.1 Statistical Approach The statistical approach to embed a physics-based model involves the trainable component f 2 ω as a proxy model that is trained using a simulated dataset (denoted with the subscript sim) generated from the physics-based model f 2 . Specifically, letting N sim be the number of simulated data points, N sim tuples of(x sim ,u sim ) are sampled from relevant physical and operating ranges and are fed into the physics-based model f 2 to yield the time-series d sim by computing d sim = f 2 (x sim ,u sim ). Note that while the statistical approach offers more flexibility, especially when f 2 is complex, the com- putational overhead associated with running forward simulations may be significant, particularly when the tuples(x sim ,u sim ) cover a broad range of values. Since the simulated data d sim ∈R N t × N f can be univariate or multivariate time-series of length N t with local temporal structures and tem- porally invariant features, we construct f 2 ω using one-dimensional (1D) convolutional layers, as tabulated in Table 8.12 and Table 8.13. A decoder-style architecture composed of several main layers is adopted for f 2 ω where each layer consists of convolutional function (denoted as conv1D and whose output is color-coded in Figure 7.2), leaky-ReLU (Rectified Linear Unit) non-linear activation function (lrelu) and an upsampling function (upsample). The input parameters p∈R N p (representing tuples of(x sim ,u sim )) are gradually upsampled (by repeating each temporal step along the time axis) to obtain a reconstruction of d sim . To reduce kernel artifacts from the upsampling operations, we recommend increasing the dimension by 2 between each layer. Note that while a fully-connected layer can be used instead of a convolutional layer, the latter represents complex 221 nonlinear systems using significantly reduced parameters through weight sharing (kernels) and by taking advantage of local spatial coherence and distributed representation. The striding operation of the kernels in the convolutional layers exploits the temporal redundancy in the time-series data, making the convolutional layers more robust to noise in the time-series data. Additionally, the multivariate time series of each production phase can be treated as convolutional channels. To describe the one-dimensional convolution operation, let Y∈R N b × N n × N k be the arbitrary output of any 1D convolutional layer where N n is the length of the output along the time axis, and N k represents the number of kernels or filters. Let h denote a kernel with length N h . Let V∈R N b × N m × N c be the input of a 1D convolutional layer where N m is the length of the input along the time axis and N c represents the number of channels. For simplicity, assuming that N b = 1 and N c = 1 (i.e., a single input data point v∈R 1× N m × 1 or simply a vector v∈R N m ), we can mathematically define the convolution operation for any single kernel or filter h (out of the N k kernels) as y j = N h ∑ i=0 v j+i h i j= 0 N h ∑ i=0 v j+i+(s− 1) h i j= 1 : N m (7.3) where the kernel h is shifted s positions (i.e., stride) after each convolution operation. The resulting output is a single data point y∈R 1× N n × 1 or simply a vector y∈R N n and is each stacked along the last axis for N k kernels. The length of the output N n can be calculated as N n =⌊ N m − N h s ⌋+ 1 (7.4) The input can be padded to make the output length of a convolutional layer equal to the input length. When we let p be the amount of padding added to the input along the time axis, the length of the output N n can be calculated as N n =⌊ N m + 2p− N h s ⌋+ 1 (7.5) 222 The input to each convolutional layer is padded to make N m and N n equal. Moreover, the length of the kernel N h is set to be in increments of 3 months to capture the temporal trends resilient to noise. For more details on the mechanism of each function, we refer readers to relevant literature in computer science (e.g., [45, 118, 199]). The component f 2 ω is then trained using the simulated dataset with the following loss function L(ω)= N sim ∑ ∥d sim − f 2 ω (x sim ,u sim )∥ 2 2 (7.6) where once ω is learned, the prediction of the proxy model is obtained by computing ˆ d sim = f 2 ω (x sim ,u sim ). In the subsequent training step, the physics-constrained model f 1 θ ◦ f 2 ω is trained using the field dataset with the following loss function L(θ)= N f ield ∑ ∥d− f 2 ω ( f 1 θ (x,u))∥ 2 2 (7.7) where the parameters ω that have been initially trained using data from the physical functions are not allowed to change, while the parameters θ are allowed to be calibrated. The physics- constrained model represents the statistical information from the physical functions and the field dataset. With the trained model, the physics-constrained output prediction can be computed as ˆ d c = f 2 ω ( f 1 θ (x,u)). The statistical approach of the physics-constrained model embeds the phys- ical information from a vast dataset generated from physical functions into the neural network model to improve the predictive performance by reducing the under-determinedness of the neural network model. The diagram shown in Figure 7.2 illustrates the implementation of the statistical physics-constrained model we have used for a multivariate synthetic dataset in this study. The component f 2 ω in the diagram is composed of three branches of successive convolutional layers for each flow phase, where the predictions for all phases are concatenated before the loss function is calculated. Note that other architecture variants, such as using only multiple fully-connected 223 layers, can also achieve the same objective of representing a physics-based model. Another exam- ple of an architecture for f 2 ω is using a single branch composed of successive convolutional layers where f 2 ω directly produce a multivariate output (similar to the illustrated component f 3 ζ ). 7.1.1.2 Explicit Approach In the explicit approach, a physics-based model f 2 is directly embedded into a neural network by allowing the output of the preceding f 1 θ (as illustrated in Figure 7.1) to serve as input parameters p into the physical function. Specifically, the trainable component f 2 ω in Figure 7.2 is replaced with a physics-based model f 2 . Since the physics model represents causal relations between the input and output and is embedded directly (i.e., no trainable parameters), the physics-constrained model f 1 θ ◦ f 2 can be trained in a single step using the field dataset with the following loss function L(θ)= N f ield ∑ ∥d− f 2 ( f 1 θ (x,u))∥ 2 2 (7.8) In the training phase, only the parametersθ within f 1 θ are calibrated and the physics-constrained output prediction can be computed as ˆ d c = f 2 ( f 1 θ (x,u)). Similar to the statistical approach, the parameters within the physics-constrained model are trained using the backpropagation algorithm [19], where the sensitivity information flows from the loss function through the embedded physical function by invoking the chain rule. The gradient of f 2 can be calculated using a closed-form so- lution when available or approximated through finite-difference methods. Using gradient-descent algorithm as an example, the updates toθ for any iteration i and learning rateα can be defined as θ i+1 =θ i +α δL δθ i . The derivative δL δθ is obtained using the chain rule as δL δθ = δL δ f 2 δ f 2 δ f 1 θ δ f 1 θ δθ (7.9) 224 The computation of sensitivity information or derivatives can be computationally expensive for a complex physics-based model f 2 that takes in high-dimensional input parameters. As such, ana- lytical models with relatively low-dimensional input parameters are typically preferred for practi- cal applications and are sufficient to capture the general production behavior. There are three main advantages of using the explicit approach over the statistical approach: (i) the physics-constrained model learns to transform the field input data x into p and the discovered input parameters for the physical function can be used to forecast production responses beyond the length of time available in the training data and (ii) there are significantly less number of trainable parameters in f 1 θ ◦ f 2 as the physics-based model is directly embedded, leading to faster convergence and more stable results especially when the training data is limited, and (iii) the computed predictions ˆ d c are guar- anteed to be physically-consistent as they are calculated by the real physics-based model (instead of a proxy model). 7.1.2 Physics-Guided Deep Learning (PGDL) Model Physics-constrained models ( f 1 θ ◦ f 2 ω and f 1 θ ◦ f 2 ) contain embedded physical functions that con- strain the predictions. These models may yield poor performance, especially when the embedded physical functions do not fully represent the relationship between the input and output dataset due to a lack of relevance. In such cases, the predictions ˆ d c from a physics-constrained model will exhibit residual errors when compared to the ground truth d. For training purposes, the residuals d r can be calculated through subtraction where d r = d− ˆ d c . To compensate for the imperfect de- scription or uncaptured physical components of the constraining physics, we introduce an auxiliary neural network component f 3 ζ with trainable parameters ζ (as illustrated in Figure 7.2) to learn the complex spatial and temporal correspondence between the well properties such as formation and completion parameters and control trajectories (as tuples of x and u) to the expected residuals d r . The component f 3 ζ is trained with the following loss function L(ζ)= N f ield ∑ ∥d r − f 3 ζ (x,u)∥ 2 2 (7.10) 225 where onceζ is learned, the residual for any given test input parameters is obtained by computing ˆ d r = f 3 ζ (x,u). The auxiliary component f 3 ζ can be appended to the statistical or explicit imple- mentation of the physics-constrained model (as f 1 θ ◦ f 2 ω + f 3 ζ or f 1 θ ◦ f 2 + f 3 ζ ), and is formalized as the Physics-Guided Deep Learning (PGDL) model as illustrated in Figure 7.1. The final pre- diction from the PGDL model ˆ d is obtained by adding the prediction from the residual model ˆ d r to the prediction from the physics-constrained model ˆ d c where ˆ d= ˆ d r + ˆ d c , resulting in signif- icantly reduced under and over estimations for a more robust production prediction. The neural Figure 7.3: Workflow of the Physics-Guided Deep Learning (PGDL) model. network architecture of f 3 ζ consists of several fully-connected layers (similar to f 1 θ ) followed by a decoder-style convolutional layers (similar to f 2 ω ) as tabulated in Table 8.12 and Table 8.13. 226 The PGDL workflow with two choices of physics-constrained implementation for the training phase and prediction phase is illustrated as a flowchart in Figure 7.3. Specifically, using the statis- tical approach as an example, in the prediction phase, the output of the physics-constrained neural network ˆ d c is obtained (through the upsampling convolutional operations of f 2 ω ) by feeding the neural network with(x,u) as the input, into the dense fully-connected layers of f 1 θ . Note that the output of f 1 θ serves as the input into f 2 ω . The output of the residual model ˆ d r is obtained by feed- ing the neural network with(x,u) as the input, into the combination of dense fully-connected and convolutional layers of f 3 ζ . Standard min-max scaling is performed on the input parameters. The PGDL models are implemented with the deep learning library Keras (version 2.2.4) [45]. Each component is trained and checkpointed using the early-stopping method. The optimal checkpoint (without overfitting) for each component is identified when validation losses do not show any fur- ther reduction. 7.2 Numerical Experiments and Results 7.2.1 Example 1: Toy Data In this section, we demonstrate the mechanism of the statistical and explicit implementations of the PGDL model using a toy dataset. Consider the following scaled combination of hyperbolic and cosine functions as the forward model (i.e., physical simulator) q t = 0.75 q i (1+ bd i t) 1/b + 0.25cos5t b d i q i + 0.25 (7.11) where setting N t = 36 and N f = 1, univariate production data d∈R 36× 1 is obtained. To gen- erate the toy field dataset, we sample 150 values ( N f ield = 150) of(q i ,b,d i ) between the range of [0,1] as the intermediate input parameters p∈R 3 and apply an arbitrary 3× 3 linear transformation matrix to p to obtain a hypothetical set of input well parameters x∈R 3 . For this experiment, we 227 assume a constant well-operating control and omit the input component u. For this experiment, the architecture of the statistical and explicit implementations of the PGDL model is given in Table 8.12 where the residual learning model f 3 ζ has the same structure as both f 1 θ and f 2 ω components combined. The explicit PGDL model does not include the component f 2 ω as the physics-based model f 2 is embedded directly. Therefore it contains a much less number of parameters when compared to the statistical PGDL model. The component f 1 θ outputs the vector p∈R 3 that con- sists of the(q i ,b,d i ) values representing the input into the embedded physics model (as either f 2 ω or f 2 ). The exact form of f 2 is given in Equation 6.10 and f 2 ω is a statistical approximation of f 2 . For this experiment, a black-box approach without embedded physical function involving a model with the architecture of f 3 ζ is used as a performance benchmark. q t = q i (1+ bd i t) 1/b (7.12) We first demonstrate the mechanism of physics-constrained models where the production data is generated using only the first term in Equation 7.11 (i.e., Experiment 1a). A hyperbolic form of the decline curve function is the analytical physics-based model embedded in the physics- constrained models. In this demonstration, the residual learning model becomes unnecessary since the embedded physical function can fully represent the production data. The models are trained with 80% of the dataset and 20% for testing. Moreover, 20% of the dataset in each training epoch is randomly split for cross-validation. Note that in this static formulation, the data from each well (e.g., well properties as the input and production time-series as the output) is taken as an indepen- dent trial. As previously described, the Adam optimizer [115] is used to train the neural network architectures. The first column in Figure 7.4 shows a comparison (using the test dataset) between the actual p used to generate the production data d c and the predicted values ˆ p from a trained f 1 θ . The sta- tistical and explicit physics-constrained models can discover the relevant input parameters into the 228 Figure 7.4: Performance comparison of the physics-constrained and black-box models for the toy dataset. physics-based model to enable long-term prediction. However, some discrepancies are observed as multiple combinations of(q i ,b,d i ) values can result in close matches for the N t timesteps used in the test dataset. The scatter plots in the second column in Figure 7.4 show that the statistically em- bedded physics model results in minor errors compared to the explicitly embedded physics model due to the nature of statistical learning. In the third column of Figure 7.4, we observe that embed- ding relevant physical information results in increased performance (for predicted ˆ d c versus test 229 data d c ) when compared to a purely black-box approach. Additionally, the components f 1 θ and f 2 ω in a black-box predictive model independently do not offer any useful or physically meaning- ful transformations. Figure 7.5: Performance comparison of the PGDL models for the toy dataset. Next, we consider production data d generated using Equation 7.11 (i.e., Experiment 1b) and we embed the same decline curve function as the analytical physics-based model. In this demon- stration, the residual model f 3 ζ is tasked with learning the cosine component of d that the embed- ded physical function cannot capture. The first column in Figure 7.5 shows that when the embed- ded physics model lacks relevance, f 1 θ predicts the intermediate values ˆ p that can best minimize the mismatch between ˆ d c and d. Since the embedded physics model is not able to fully represent the cosine component in Equation 7.11, the scatter plots in the second column show a large mis- match between ˆ d c and d. When the predicted residuals ˆ d r are added to the physics-constrained 230 predictions ˆ d c , a notable improvement in prediction performance is observed as shown in the third column. Each row in Figure 7.6 shows three samples of prediction profiles for the statistical or Figure 7.6: Samples of prediction from the physics-constrained and PGDL models for the toy dataset. explicit PGDL models. The red lines represent ˆ d c , the predictions from the physics-constrained models, the green lines represent ˆ d, the predictions from the PGDL model, and the scatter points represent d, the reference data. We observe that the predicted time-series ˆ d c from f 2 are smoother than the predicted time-series from f 2 ω . The imperfect physics-based description embedded in the physics-constrained models caused ˆ d c to fit the ground truth d in such a way that the residuals are minimized. The predictions from the PGDL models, ˆ d consider the predicted residuals ˆ d r that are largely caused by the missing cosine component in the embedded physics model, resulting in time-series that better fit the ground truth d. The violin plots in Figure 7.7 show the distribution of Root Mean Square Error (RMSE) val- ues of ten repeated runs for each of the cases considered in this section. Each violin (i.e., density curve) contains a blox plot, with the ends of the rectangle showing the first and third quartiles and 231 Figure 7.7: Performance statistics of the physics-constrained and PGDL models for the toy dataset. the white dot showing the median. For Experiment 1a, the performance of the statistical and ex- plicit physics-constrained models (denoted as stats-PC and expl-PC respectively) are comparable and are both better and more consistent than the black-box model (denoted as BB). For experiment 1b, the predictions from the physics-constrained models show high RMSE values. However, the predictions from the statistical and explicit PGDL models (denoted as stats-PG and expl-PG re- spectively) are in better agreement with the ground truth as the residual errors are considered in the final prediction. The proposed PGDL models provide more consistent and more stable (i.e., less elongated distribution) predictions due to having less trainable and unconstrained parameters than purely black-box models. The data-driven components of PGDL (i.e., f 1 θ and f 3 ζ ) augment imperfect physics-based /simulation-based models captured by f 2 or f 2 ω . A purely physics-based approach may not be able to accommodate a wide variety of input parameters (i.e., well properties and controls) and PGDL enables this at the cost of marginal computational burden associated with the training of the physics-constrained neural network model. When the physics-based model is imperfect, the biased physics-constrained predictions can be corrected by the residual learning model at the expense of additional computational burden associated with the training of the residual model. The training process of f 1 θ and f 3 ζ can be done offline and the total computational time incurred is the sum 232 of training time for the physics-constrained neural network and the training time for the residual neural network. For Experiment 1b, the total computational time incurred is approximately 14 minutes on an NVIDIA GeForce RTX 2080 Ti GPU. 7.2.2 Example 2: Synthetic Bakken Data In this section, we demonstrate the capability of the proposed PGDL models to provide robust production prediction under different geological complexities. For this work, we use a commercial simulator (MRST) to simulate 60 months (N t = 60) of a typical multiphase production (i.e., oil, water, and gas, N f = 3) from a hydraulically fractured horizontal well. To generate the synthetic field dataset, we sample 800 values ( N f ield = 800) of formation, fluid and completion properties x∈R 7 between the ranges plotted in the histograms in Figure 7.8 and ran forward simulations to obtain multivariate production response data d∈R 60× 3 . The MRST simulator is based on a single porosity model that considers planar fractures with high conductivity as the main flow conduits for the hydrocarbons from the matrix to the production well. Most of the hydrocarbon is assumed to be present in the low permeability matrix, and formation properties are assumed to be homoge- neous within the drainage area. The reservoir remains above the bubble point pressure. Additional assumptions, like the decrease of fracture conductivity during production either due to propant crushing or increased net effective stress, fracture permeability anisotropy and fracture conductiv- ity increasing as a result of shear fracturing are not considered. The distribution and data ranges for each input feature are based on relevant literature for the Bakken shale play [53, 56, 57, 72, 107, 135, 217, 246, 251]. We consider matrix porosity, forma- tion thickness, initial formation pressure, and initial water saturation as the formation properties, oil density as the fluid property, and the number of fracture stages and well lateral length as the completion properties (as shown by the histograms in Figure 7.8). The porosity correlation with permeability and depth is assumed to be positive and negative, respectively [44, 192, 198]. Note that in this experiment, we utilize a synthetic dataset to factor out common issues related to field 233 data, such as incomplete data, inaccuracies, and low signal-to-noise ratio, to demonstrate the mech- anism of the proposed PGDL model. Additional descriptions are provided in the Appendix. Figure 7.8: Normalized distribution of the formation, fluid, and completion properties for the synthetic dataset. In Experiment 2a, we assume a constant well-operating control and omit the input component u. We consider a reservoir with natural fractures to further mimic a field dataset. The simulated multivariate production response data for Experiment 2a are shown in Figure 7.9. The general trend of the time series can be captured by a hyperbolic form of the decline curve function that we embed as f 2 ω and f 2 in the statistical and explicit PGDL models, respectively. Specifically, for each fluid phase, the exact form of f 2 is given in Equation 6.10 and f 2 ω is a statistical approximation of f 2 . Gaussian noise with a relevant mean and standard deviation is added to the simulated produc- tion responses to replicate the signal-to-noise ratio observed in collected field measurements. The architecture of the components in the PGDL models is given in Table 8.13 and illustrated in Figure 7.2. The component f 1 θ outputs three sets of(q i ,b,d i ) values as the intermediate input parameters p∈R 9 , where each set represents a production phase. The physics of multiphase flow 234 Figure 7.9: Multivariate production profiles for the synthetic dataset (Experiment 2a). is captured in the simultaneously-discovered intermediate parameters p. In the statistical imple- mentation, f 2 ω contains three decoder-style components representing the three production phases, resulting in 20935 total trainable parameters. In contrast, the explicit implementation of the PGDL model only has 8404 total parameters. Similar to Experiment 1, a black-box approach without embedded physical function involving a model with the architecture of f 3 ζ is also used as a per- formance benchmark. Note that since we have demonstrated in Section 1 that the statistical and explicit PGDL models show comparable predictive performance, we will only present results for the explicit PGDL model in this section for brevity. The statistical PGDL model incurs much higher computational cost related to the simulations needed to generate the training dataset for f 2 ω , and the associated additional training step needed. The models are trained with 80% of the dataset and 20% for testing. Moreover, 20% of the dataset in each training epoch is randomly split for validation. Using the Adam optimizer [116], the explicit PGDL model is trained as previously described on an NVIDIA GeForce RTX 2080 Ti GPU for approximately 20 minutes (spanning 1000 epochs) and is checkpointed every 50 epoch. Two test samples of multivariate prediction profiles from the explicit PGDL model are shown in Figure 7.10. For each of the fluid phases, the scatter points represent d, the reference data, the stippled lines represent ˆ d c , the predictions from the physics-constrained model. The bold lines represent ˆ d, the predictions from the PGDL model. In this experiment, the predicted time-series ˆ d c and ˆ d fits the ground truth d as the embedded physics-based description f 2 can optimally represent 235 Figure 7.10: Samples of prediction from the PGDL model for the synthetic dataset (Experiment 2a). Figure 7.11: Sensitivity analysis of the PGDL model performance for the synthetic dataset (Exper- iment 2a). the dataset with the residual errors learned by the residual model f 3 ζ are not significant. 236 We perform a sensitivity analysis to investigate how different sizes of training datasets affect the predictive performance of the PGDL model. In this analysis, the test dataset is fixed (in terms of size and data points) while the size of the training dataset is varied (i.e., data points sampled randomly) as labeled in Figure 7.11. The RMSE values represent the mismatch between d and ˆ d c (labeled as PC), d and ˆ d (labeled as PG), and d and predictions from a black-box model (labeled as BB). The experiment is repeated 10 times for each case in the sensitivity analysis to account for model variance. We observe that the benefits of embedding a physics-based model f 2 become more prominent when the training dataset size is small. As an example, when the amount of available training data is 50, the RMSE values for PC-50 and PG-50 are considerably smaller and more consistent than BB-50. As the size of available training data increases, the performance of the physics-constrained model, the PGDL model, and the black-box model becomes similar. Additionally, as the embedded f 2 is highly relevant to the dataset, the difference between RMSE values for PC and PG are not significant. Note that the observations obtained from the sensitivity analysis remain the same with or without added synthetic noise to the simulated field data. Figure 7.12: Multivariate production profiles for the synthetic dataset (Experiment 2b). In Experiment 2b, we assume a constant well-operating control and geomechanical complex- ities, including decreased fracture conductivity (due to proppant crushing, increased net stress, or anisotropy in fracture permeability) and increasing fracture conductivity (due to shear fracturing). These additional complexities are reflected in the simulated multivariate production response data 237 Figure 7.13: Performance statistics of the PGDL model for the synthetic dataset (Experiment 2b). shown in Figure 7.12 where multiple flow regimes are observed and may not be sufficiently rep- resented by the embedded f 2 (with the exact form given in Equation 6.10). Gaussian noise with a relevant mean and standard deviation is added to the simulated production responses. The ex- plicit PGDL model is built as given in Table 8.13 and trained as described earlier. Using the Adam optimizer [115], the explicit PGDL model is trained on an NVIDIA GeForce RTX 2080 Ti GPU for approximately 32 minutes (spanning 1000 epochs) and is checkpointed every 50 epoch. The first scatter plot in the top row of Figure 7.13 compares d and ˆ d c , where suboptimal predictions are observed due to the uncaptured geomechanical complexities by the embedded f 2 . The calculated multivariate residual time-series d r are shown in the bottom row of Figure 7.13 and the second scatter plot in the top row compares d r and ˆ d r as predicted by the residual model f 3 ζ . The third scatter plot in the top row shows that when ˆ d r is added to ˆ d c , more accurate predictions can be obtained. 238 Figure 7.14: Samples of prediction from the PGDL model for the synthetic dataset (Experiment 2b). Figure 7.15: Sensitivity analysis of the PGDL model performance for the synthetic dataset (Exper- iment 2b). Two test samples of multivariate prediction profiles from the explicit PGDL model are shown in Figure 7.14. In this experiment, the predicted time-series ˆ d fit the ground truth d better than ˆ d c as the uncaptured flow behavior by the embedded physics-based description f 2 is represented by 239 f 3 ζ . We perform another sensitivity analysis for the dataset in Experiment 2b, similar to the sen- sitivity analysis performed in Experiment 2a. In this experiment, as the embedded physics model is less relevant to the dataset compared to the scenario in Experiment 2a, a significant difference is observed in Figure 7.15 between the RMSE values labeled as PC and PG. The incorporation of the residual model as part of the PGDL model (where the mismatch of its prediction is labeled as PG) consistently leads to a reduction in RMSE values for the test dataset, suggesting that f 3 ζ can compensate for the suboptimal f 2 . Additionally, the RMSE values represented by the violins labeled as PG are consistently smaller than the RMSE values represented by the violins labeled as BB, suggesting that the PGDL model is useful, especially when the available training data is limited. When the available training data is very limited (i.e., 50-100 data points), such is the case for underdeveloped or new acreage areas, the physics-constrained model results in lower RMSE than a black-box approach. However, as the training data availability increases (i.e., 200-600 data points), a black-box model outperforms the physics-constrained model due to the limitations imposed by the embedded constraining physics f 2 . As the size of available training data increases, the perfor- mance of the physics-constrained model remains suboptimal. In contrast, the performance of the PGDL and black-box models becomes more comparable. We also observe that predictions from a PGDL model result in a tighter distribution of RMSE values as the predicted time series are more consistent when compared to predicted time series from a purely data-driven approach. In Experiment 2c, we consider variable well-operating control denoted as u∈R 60 and other assumptions made in Experiment 2b. The explicit physics-constrained model is now formulated as ˆ d c = f 2 ( f 1 θ (x,u)) and the residual model is formulated as ˆ d r = f 3 ζ (x,u). Specifically, for each of the fluid phase, f 2 takes a hyperbolic form of the decline curve function as shown in Equation 6.10. The architecture of the PGDL model can be derived from Table 8.13 by simply modifying the input layers of f 1 θ and f 3 ζ to accept a vectorized input tuple (x,u)∈R 67 . The simulated multivariate 240 Figure 7.16: Multivariate production profiles for the synthetic dataset (Experiment 2c). production response data shown without the added noise in Figure 7.16 illustrate the complexities arising from the variable control trajectories and geomechanical interactions that may not be suf- ficiently represented by the embedded f 2 . Using the Adam optimizer [115], the PGDL model is trained on an NVIDIA GeForce RTX 2080 Ti GPU for approximately 38 minutes (spanning 1000 epochs) and is checkpointed every 50 epoch. Figure 7.17 shows three test samples of multivariate prediction profiles from the explicit PGDL model. The purple lines represent the normalized operating pressure u as the control trajectory for each well. In this experiment, the predicted time-series ˆ d can match the ground truth d better than ˆ d c as the residual model f 3 ζ represents changes in flow behavior and well control to the production rate of each fluid phase. Even with such a simple analytical model as the constraining physics in f 2 , our observations in this experiment are consistent with the insights gained from the sensitivity analysis in Experiment 2b. The embedded physics model may result in large residuals d r , espe- cially when it is not formulated to represent complex flow behavior and variable operating controls. Nevertheless, the residual model f 3 ζ learns the trends and relationships that cannot be discerned by f 2 and the final predictions that combine ˆ d c and ˆ d r are more consistent and accurate compared to black-box predictions. 241 Figure 7.17: Samples of prediction from the PGDL model for the synthetic dataset (Experiment 2c). 7.2.3 Example 3: Field Data from Bakken Shale Play In this section, we demonstrate the performance of the forecast model using field data comprising 886 wells across 199 unique fields collected from the Bakken shale in North Dakota. They are downloaded from the North Dakota Department of Mineral Resources web page 1 . It is assumed that there are no inter-well communications and no distinction between wells produced from the Middle Bakken formation and the Three Forks formations. The well properties x where N x = 8 1 https://www.dmr.nd.gov/ 242 (i.e., the volume of fracturing fluid, weight of proppant, treatment pressure, treatment rate, number of fracture stages, well lateral length, latitude, and longitude) and their corresponding observed multivariate (i.e., N f = 3) production responses (i.e., oil, water and gas phases) d∈R 60× 3 are col- lected. Figure 7.18 shows the histograms of well parameters for the Bakken dataset where missing or unreported entries for any feature are filled using the global mean imputation method. Figure 7.18: Normalized distribution of the well properties for the field dataset. We consider the PGDL model and the training strategies outlined in the previous sections. The architecture of the PGDL model is given in Table 8.14 and can also be derived from Table 8.13 by simply modifying the input layers of f 1 θ and f 3 ζ to accept a vectorized input tuple x∈R 8 . We convert the multiphase production data to cumulative production data to minimize the negative impacts of noisy field measurements and rate fluctuations. In this experiment, the PGDL model does not learn the correspondence between operating controls and production time-series, as the operating history of each well is not reported. Nonetheless we believe that the results can be im- proved if the operating history is available. Inspired by [216], we consider an approximate flow function f 2 with an exponential form to describe the relationship between cumulative production and time. Specifically, f 2 takes the form q t = a− be − ct with three sets of arbitrary intermediate 243 input parameters (a,b,c) for each of the production phase, as the vector p∈R 9 . 600 data points are used for training, and the rest are reserved as the test dataset. The proposed explicit PGDL model is trained for approximately 35 minutes (spanning 1200 epochs) and is checkpointed every 50 epoch. Figure 7.19: Performance comparison of the PGDL and black-box models for the field dataset. The first and second scatter plots in Figure 7.19 show that the performance of the PGDL model is higher than the physics-constrained model. In this practical case, the uncaptured trends and input-output relationship cannot be physically discerned but are present within the set of residuals and can be statistically represented by f 3 ζ . The third scatter plot shows that the predictions from a black-box model tend to be noisy and erroneous. This poor result is further shown in Figure 7.19, where the distribution of the RMSE values from 10 repeated experiments for the BB case has a higher mean RMSE. Additionally, the distribution of the RMSE values for the PG case is charac- terized by a large variance, potentially due to the data-driven nature of f 3 ζ and the complexity of the field dataset. Three test samples of scaled multivariate prediction profiles from the explicit PGDL model are shown in Figure 7.20. In general, while the physics-constrained predictions ˆ d c tend to be more realistic (i.e., monotonically increasing values representing cumulative production), the PDGL pre- dictions ˆ d have a closer fit to the ground truth d. Unlike the synthetic case in Experiment 2, the 244 Figure 7.20: Samples of prediction from the PGDL model for the field dataset. operating history for each well in the field dataset is not available (not reported) to explain the con- siderable noise and fluctuations in the rate. Note that for this field dataset, a well may be shut-in for various well stimulation operations or simply due to other factors that may or may not be reported. We can expect that if a well is shut-in for a stimulation operation, a successful stimulation job will increase well productivity, and the initial vector of well properties x may not be as relevant anymore, which is a point of caution for practitioners. 245 Additionally, the final predictions from PGDL exhibit jitters that may not be deemed accept- able by practitioners as time-series of cumulative production should be monotonically increasing. While we have demonstrated that PGDL can be implemented with a wide variety of physics-based models, this field example also shows a standing limitation brought by the data-driven residual model. To circumvent this issue, the field data can be treated as time-series of production rate and a decline curve function can be used as the embedded physics-based function as we did in the previous experiments, rather than using time-series of cumulative production and embedding a physics-based function with an exponential form. One may argue that a simple least-square regression formulation with f 2 as the estimation function and the observed production data d can produce a set of intermediate input parameters p that will result in much smaller residuals than p of PGDL model. This is evident in Figure 7.20 where some of the time-series from the test dataset can easily be represented by f 2 , yet the PGDL predictions ˆ d are inaccurate. We attribute this behavior to the inconsistency in the training dataset. Specifically, when the mapping of x to p (or x to d for the PGDL implementation) is inconsistent, the component f 1 θ cannot learn a robust input transformation from x to p, that will subsequently affect the predictions. In this case, the calculated residuals will also propagate the inconsistency and negatively impact the performance of f 3 ζ , as the residual model is tasked with learning an inconsistent mapping from x to d r . This issue can be alleviated by collecting more features (pre- dictor variables) to be included in x. Additionally, data drift (i.e., distributional shift) between the training set and testing set can also negatively impact the performance of the PGDL model, espe- cially in the field case where field-wide heterogeneity exists and is often unknown. Our observations in this experiment suggest complexities when working with field data that cannot be completely resolved and may introduce large errors, especially when the collected field data is incomplete, inconsistent, and noisy. Moreover, using time-series data with a higher sam- pling rate, as seen in [248] and [186], may provide PGDL with more valuable information that can 246 enhance its predictive accuracy. Additionally, simplifying assumptions (e.g., the absence of inter- well communications and lumping production from different stratigraphic zones) can introduce additional errors. While this experiment with field data has shown promising results, practitioners need to be aware of these complexities inherent within any field dataset. Given the complexity and paucity of the collected field data, the performance of the PGDL model is deemed acceptable and can be improved with additional data. 7.3 Summary and Discussion This chapter presents statistical and explicit approaches to embedding physics-based models into the physics-constrained neural network model. The statistical approach is a two-step training pro- cess where several layers of the physics-constrained model are first trained using simulated data from the physics-based model to act as a proxy. Subsequently, the full neural network model is trained using the field dataset while keeping the parameters within the layers that represent the physics-based model unchanged. On the other hand, the explicit approach directly embeds the physical function into the neural network by allowing the gradient to flow through the function during the training process. The physics-constrained model learns a mapping from well properties to the production responses and outputs physically consistent time series. Moreover, the model learns the transformation functions from a wide variety of field input parameters to intermediate variables. These variables associated with the physical function are particularly useful when an embedded model cannot directly accept the field parameters as its input. We also show how the relevance of the embedded physics-based model (for the field data) may impact the accuracy of the predictions from the physics-constrained model. Since the embedded physics-based model may not sufficiently represent trends and relationships within the field dataset, we augment the physics-constrained model with a residual deep-learning model. The residual model learns a mapping from well properties to the expected residuals. These predicted residuals 247 can compensate for discrepancies introduced by the physics-constrained models due to imperfect knowledge about flow physics (missing or inaccurate physical phenomena) and are added to the physics-constrained prediction to become the final prediction. Integrating the physics-constrained model and residual learning is formalized by the Physics-Guided Deep Learning (PGDL) model as a hybrid predictive model that combines the advantages of physics-based and data-driven methods. Additionally, we present a convenient and practical modular implementation of the PGDL model using fully-connected and 1D convolutional neural network layers. Several synthetic and field datasets are used to demonstrate the mechanism of the PGDL model. The results indicate that the PGDL model can give an improved production prediction for a set of unseen test data compared to a purely black-box model prediction. The embedded prior physics knowledge helps reduce the under-determinedness of the neural network model while allowing for more stable predictions, especially when the amount of available training data is limited. This approach is especially useful for new or underdeveloped fields with a limited number of wells and observations. It is established that while black-box models can fit field observations very well, the predictions may be physically inconsistent or implausible owing to extrapolation or observational biases. The embedded physics-based model ensures physical consistency by integrating domain knowledge, governing physical rules, and system dynamics that can provide additional theoretical and observational constraints (typically incomplete and corrupted with several noise sources). Additionally, the method offers a framework for addressing the limitation of purely statisti- cal data-driven models that cannot predict beyond the range of the training data. By combining physics-based and deep learning models, the developed framework provides an opportunity to enhance the extrapolation capability of data-driven models by automatically discovering the in- termediate input parameters into the embedded physics-based model that can be used to simulate long-term forecasts. Specifically, compared to a black-box model where the activations of its neu- ral network layers do not have any physical meaning, the activations from a physics-constrained 248 model correspond to meaningful variables that are used to characterize the flow dynamics. This property indirectly increases the interpretability of the predictive model as the embedded physical function are based on physical principles and conservation laws. Further improvement in inter- pretability is an important future work to be pursued, especially on trying to understand how the input or parts of the input (e.g., well properties) into the PGDL model affect the output (e.g., pro- duction behavior). The examples presented in this study have used empirical physics-based models with an an- alytical form as the embedded prior knowledge. While more complex numerical physics-based models can be utilized, such models require high computational demand due to a large number of permutations of the input parameters with similar production responses. Embedding a complex numerical physics-based model into a physics-constrained model poses several challenges, such as 1. the intractable training process due to computationally heavy backpropagation algorithm, 2. the non-unique inverse mapping from the production response to the input parameters, and 3. the physics-based model that the observed field data cannot resolve. To develop a practical predictive model for complex real-world problems, we take on a balanced perspective that incorporates just as much physics as needed without compromising flexibility and speed. More research is needed to thoroughly assess the benefits and drawbacks of incorporating more complex physics-based mod- els. As an example, physics-based models that consider additional constraints (e.g, constraints on gas-oil ratio) can be embedded in the physics-constrained neural network. Our neural network predictive models are based on a static formulation. In other words, the predictive models take well properties as input to give a time-series prediction of fixed length. The main limitation of such formulation is that each training data point needs to be produced for the same length of time. There can be scenarios when this requirement makes the static formulation impractical; for example, when wells are drilled in sequence and the length of the production pe- riod that can be used as training data is limited by the latest well produced for the shortest period. 249 Also note that while the discovered intermediate input parameters can be used in the physics model for long-term prediction, the residual model is still a data-driven method and predicts a time series of fixed lengths. Alternatively, a dynamic formulation that can allow practitioners to use time- series data with varying lengths is an important future work to be pursued. This formulation can also allow for dynamic residual correction and be useful for long-term prediction. Regardless of the type of formulation, in general, the longer the period of production (covering multiple flow regimes) and the higher the number of training wells (providing more information on the relation- ship between well properties and production behavior), the higher the reduction in prediction error. Another interesting avenue of future works is expanding the capability of the PGDL model to forecast any jump in production rate that is typically observed when wells are shut-in for a pe- riod of time and then reopened to flow. To be able to model this increase in rate due to pressure build-up and/or stimulation operation, an annotated dataset with indicator labels that distinguish the cause of any jump in production is needed. The increase in production rates can be caused by (i) solely pressure build-up, (ii) stimulation operations, or (iii) pressure build-up and stimulation operations. Although the expanded capability will be useful, obtaining such annotated dataset is not trivial and may require significant data processing overhead. Additionally, the expanded model should also consider (i) the dynamical dependency of the length and amount of past production and length of the shut-in period, as they affect the amount of pressure build up and (ii) new well properties associated with the stimulation operations. Furthermore, the performance of the PGDL model under additional flow complexities such as when the reservoir drops below the bubble point pressure during production (causing an increase in gas production) is an interesting future work that may involve further customized neural network formulation. Moreover, in this study, we combined the physics-constrained and residual models’ predic- tions through a simple addition operation. A dynamic formulation of the proposed workflow may warrant a more sophisticated method to combine the physics-constrained and residual models’ 250 predictions. Additionally, the residuals have also been calculated through a simple subtraction op- eration. A workflow that simultaneously trains the physics-constrained and residual models may uncover some useful properties. This work assumes that unconventional fields are typically de- veloped as individual wells with limited (if any) interference. The performance of the proposed workflow needs to be tested for scenarios where inter-well communications may exist and when a physics-based model that considers well interactions is to be embedded. Developing a robust physics-based model to simulate the flow behavior in unconventional reservoirs will likely take years to mature. Long-term fundamental research is needed to advance our understanding of the flow and transport processes in tight formations with complex fracture networks. In the meantime, hybrid methods can be utilized to address the need for a reliable fit- for-purpose forecasting method before such advances are made. Given the abundance of wells drilled in unconventional reservoirs and continuous data acquisition effort, such hybrid gray-box formulation offers an opportunity to develop a more viable and practical deep learning architecture to address standing issues in production forecasting. As we gain a deeper theoretical understanding of the causal relations of flow processes in unconventional reservoirs, we can wean practitioners off data-driven methods and employ theoretically sound clear-box methods for increased inter- pretability and faster industry-wide adoption. 251 Chapter 8 Summary, Conclusions and Future Works In this chapter, we first summarize the main research topics covered in this dissertation. Then, we outline the conclusions from each of the research topics. Finally, we discuss potential future works that can be pursued based on the current conclusions and observations. 8.1 Summary The research topics presented in this dissertation focused on the development of automatic feature- based model calibration workflows for conventional subsurface systems using deep learning latent space representations to address three main challenges: (i) handling of uncertain prior geologic scenarios, (ii) the effectiveness of parameterization, and (iii) the performance of data conditioning technique. The developed automatic feature-based model calibration workflows consist of three subsequent or simultaneous steps: (i) prior geologic scenario selection, (ii) parameterization of model and data spaces, and (iii) data conditioning, where the final outcome of the workflows is the set of geologically-consistent calibrated models for robust uncertainty quantification. In Chapter 1, we introduced subsurface inverse problems and the challenges in developing automatic model calibration workflows. We also introduced challenges in developing production forecasting methods for subsurface systems for which we have limited understanding of the flow behavior. Additionally, we briefly introduced several deep learning parameterization techniques 252 and outlined the applications of deep learning latent space representations for addressing the stand- ing challenges. Further, we introduced the general research scope of the dissertation. In Chapter 2, we introduced a two-step feature-based model calibration workflow under uncertain geologic scenarios. We explored the feature extraction property of CNN under classification and regression loss functions for the purpose of geologic scenario selection and feature-based inverse mapping. In Chapter 3, we improvised the workflow in Chapter 2 as the Latent Space Inversion (LSI) framework, by employing deep convolutional autoencoders to generate model and data latent space representations. In this chapter, we performed the inverse mapping between the meaningful latent spaces to generate an ensemble of calibrated model realizations around the inversion solution. In Chapter 4, we extended the LSI framework for forward mapping of model latent space to data latent space, and combined the latent space proxy model with ensemble-based algorithms to become a fully low-dimensional data assimilation workflow called Latent Space Data Assimilation (LSDA). We further applied a variational regularization constraint on the latent spaces for creation of latent representations that are amenable to covariance-based data assimilation methods. In Chapter 5, we applied conditional generative adversarial networks for simultaneous low-dimensional parameter- ization and data label conditioning under uncertain geologic scenarios. We explored the capability of adversarial networks to provide conditional models that contain relevant spatial features that can reproduce the production response behavior for each data class label. When observed data becomes available, the relevant class label is identified and the calibrated models are obtained by sampling from the model latent space. The work in Chapter 5 focused on generating geologically realistic conditional models, to promote not only solution discreteness but also the connectivity patterns, by leveraging the adversarial loss function. In this dissertation, deep learning latent space representations are used to address three main challenges in production forecasting of unconventional subsurface systems: (i) effectively captur- ing significant production trends for data-driven long-term production forecasting, (ii) overcoming the challenge of training data requirements, and (iii) enhancing current simulators that rely on imperfect modeling assumptions for hybrid physics-guided production forecasting. 253 Chapter 6 presents a novel approach for long-term production forecasting by learning dynamic latent space representations of completion parameters, formation and fluid properties, operating controls, and early production response data using a recurrent neural network model. The model uses auto-regression to generate accurate forecasts over extended periods. To reduce the training data requirement, the transfer learning concept is applied, leveraging relevant information from related fields. This work focuses on generating purely data-driven long-term forecasts by exploit- ing useful dynamical features within the training data and other relevant fields. Chapter 7 further improves data-driven forecasting models by directly embedding physics-based model to form a modular hybrid predictive model that combines the advantages of both physics-based and data- driven methods to provide accurate predictions. The proposed model addresses the limitations and imperfections of physics-based models and enables physically-consistent long-term predictions. In the current chapter, (i.e., Chapter 8), we present the summary, conclusions, and future works for this dissertation. 8.2 Conclusions The main conclusions obtained from the research topics covered in this dissertation are summa- rized below: Chapter 2: • Feature learning capability of convolutional neural networks (CNN) when used with cross- entropy loss function as a classification model can be exploited to extract salient dynamical trends for distinguishing between geologic scenarios and identifying the likelihood of each scenario. • CNN withℓ 2 regression loss function can perform inverse mapping from flow response data to salient features of reservoir property distribution represented by PCA coefficients for data- driven feature-based model calibration. 254 • Geologic scenario selection step can remove possible artifact (due to irrelevant scenarios) from the calibrated solutions by pruning the search space and further reducing computational burden by considering only relevant scenarios in the inverse mapping step. Chapter 3: • Convolutional autoencoders are effective deep learning parameterization method for extract- ing salient spatial and temporal features from models and data respectively, to generate com- pact deep learning latent space representations and outperform classical PCA-based param- eterization in terms of compression ability and preservation of nonlinear features. • Coupled Latent Space Inversion (LSI) framework that combines the process of parameteri- zation and inversion show improved performance over decoupled method, as construction of the low dimensional latent spaces is informed of the final objective that is to learn the inverse mapping between production response data and geologic features present in the prior model realizations. • Direct nonlinear mapping from the data latent space to the model latent space serves as a feature-based data conditioning alternative to conventional inverse modeling formulations. • Meaningful latent spaces with robust nonlinear mapping allow the exploration of data and model spaces for the generation of an ensemble of non-unique inversion solutions with spa- tial features that can reproduce the observed data. Chapter 4: • The Latent-Space Data Assimilation (LSDA) framework utilized convolutional variational autoencoders to simultaneously parameterize models and data and learn the forward mapping between the model and data latent spaces to function as an effective and compact latent space proxy model. 255 • Embedding a variational regularization term in the loss function results in Gaussian latent variables that are amenable to covariance-based data assimilation algorithms for reduced- order feature-based implementation of ensemble data assimilation. • The latent space proxy model allows geologically consistent feature-based updates, outper- forming conventional proxy model that maps model to data for pixel-based updates and further alleviate the computational burden of running a full physical reservoir simulator in the iterative workflow. • With the latent space proxy model, the size of the prior ensemble can be increased as compu- tational burden does not grow linearly with the number of iterations, to improve the geologic consistency and prediction accuracy of the posterior ensemble. Chapter 5: • We developed a new offline direct probabilistic inversion framework using deep convolu- tional conditional generative adversarial networks for simultaneous low-dimensional param- eterization and data label conditioning under uncertain geologic scenarios. • Adversarial loss function results in improved geologic consistency in the calibrated mod- els, both in terms of solution discreteness and connectivity patterns, when compared toℓ 2 regression loss function (as utilized in Chapters 2 to 5). • Deep learning latent space representation that is conditioned to nonlinear data label can be sampled to provide realistic calibrated models for uncertainty quantification, where the generated models contain spatial features that honor the dynamical variations observed in the data class label. • A second online approach is developed where a neighborhood selection algorithm is used in the data space to select relevant training model realizations based on the similarity of 256 simulated dynamic data to the observed data, and they are used to train a convolutional generative adversarial networks for generation of inversion solutions. • Generative adversarial networks possess superior compression ability where multiple prior geologic scenarios can be combined in the workflow without introducing artifact in the solu- tions (as observed in Chapter 2), although a geologic scenario selection step can reduce the computational burden of training data collection by first removing the irrelevant scenarios. • The performance of the developed approaches improve with longer production period and higher well density for deriving the data class labels, when there is less existing geologic uncertainty to be resolved that translates to less spatial patterns that need to be parameterized by the generative adversarial networks. Chapter 6: • We developed a new unified deep learning architecture based on recurrent neural networks (i.e., LSTM autoencoder) and fully-connected neural network layers in a sequence-to-sequence formulation that learns the dynamical trend for long-term data-driven production forecasting. • Abundant data from repeated factory-style drilling and completion of unconventional re- sources, enabled the development and application of data-driven statistical models for pre- dicting hydrocarbon production performance. • Transfer learning allows relevant dynamical trends from other fields to be used for production forecasting of a target field with relatively smaller dataset and allows continuous fine-tuning with newly available dynamic data. • The long-term prediction performance of the developed approach is consistent across several scenarios of transfer learning where the conditional distributions of the source data and target data differ. 257 Chapter 7: • The new Physics-Guided Deep Learning (PGDL) model leverages the strengths of a data- driven model and a physics-based model to provide physically-consistent long-term pro- duction forecasts, with two choices of implementation to directly embed the physics-based model (i.e., statistical approach and explicit approach). • The augmented residual learning model compensates for errors in the description of inputs or any missing physical phenomena by predicting the expected bias, which is used to correct the predictions from the physics-constrained neural network model. • Embedding a physics-based model reduces number of trainable parameters and eases the training process and data requirements, leading to more stable predictions and higher accu- racy when compared to purely data-driven methods and purely physics-based methods. • The hybrid model utilizes the available data to compensate for current limitations of physics- based models. 8.3 Future Works New developments in deep learning continue to offer new approaches to address existing chal- lenges in subsurface flow modeling, for more advanced capabilities in handling complex model calibration and production forecasting problems. There are many interesting avenues that can be pursued based on the obtained conclusions and observations. Here are our main suggestions for future directions: • Adversarial training strategies for improving geologic consistency: The latent space rep- resentations used in this work show that adversarial networks in Chapter 5 possess supe- rior compression and representation abilities (versus PCA in Chapter 2 and autoencoders in Chapter 3 and Chapter 4) in preserving the geologic consistency in the calibrated models, despite known issues of training instability as they learn to map a random distribution to 258 a complex multi-modal distribution. To ease the training process, the adversarial networks can be trained to map a predefined latent space (i.e., from PCA or autoencoder) to the model space. The adversarial loss function estimates the probability of error between in-sample and out-of-sample models and its minimization allows the neural network architecture to learn, and thus, to generate consistent in-sample (i.e., feasible) models. The modified adversarial network can be combined with ensemble-based algorithms in a feature-based model cali- bration workflow. Additionally, to make the mapping robust against inconsistent geologic features that may be introduced by covariance-based updates, the architecture can be trained to recognize out-of-sample models generated by introducing structural noise to the original in-sample models. Our early observations are published in [52]. • Dynamic neural network for data compression in model calibration workflows: In this work, the data space in the model calibration workflows is represented as static fixed-length vector (for each data point) where the latent space representations are learned using one- dimensional convolutional operations to extract the salient temporal features. A more prac- tical implementation is to enhance the presented approaches by considering the application of sequence-to-sequence Recurrent Neural Network (RNN) models [168] for the data en- coder and decoder where newly obtained data can be conveniently included as additional data points for retraining. • Performance for complex field cases: The developed workflows in this dissertation incor- porate observed production response data to reduce the uncertainty in prior geologic sce- narios and for model calibration. The robustness of the workflows in more complex settings where multi-facies geologic models and other major sources of uncertainty such as structural variation in horizons and fault systems are present in the reservoir need to be evaluated and is an interesting line of research to explore. Additionally, the development of a systematic approach to handle the assimilation of complex field cases with high-dimensional observa- tions (i.e., 4D seismic) and multiple disparate observations using LSI (Chapter 3), LSDA (Chapter 4), and GAN (Chapter 5) are important future works. 259 • Probabilistic formulation: While the developed workflows in Chapter 2 to Chapter 4 are useful for model calibration, they are based on regression loss functions and covariance- based techniques for parameterization and data conditioning. The observations from the methods developed in Chapter 5 suggest that probabilistic methods are more robust for un- certainty quantification. An extension in the form of probabilistic formulation for the meth- ods in Chapter 2 to Chapter 4 may prove to increase the robustness of the methods. • Closed-loop optimization workflow: The developed workflows can be combined with con- trol optimization methods or well placement optimization methods for closed-loop reservoir management or closed-loop field development, respectively. When implemented with the earlier proposed dynamic neural network, the enhanced neural network architectures can be updated with newly available data and the developed workflows can be run periodically to recalibrate the models with newly available data as time progresses, to allow for closed-loop optimization workflows. The set of recalibrated geologic models can be used for robust optimization in uncertainty quantification workflows by considering the remaining geologic uncertainties in the geologic models that can not be resolved by the available dynamic data. • Dynamic formulation of Physics-Guided Deep Learning (DPGDL): In Chapter 7, we developed a PGDL model that treats production responses as static fixed-length vectors. However, a model that can handle time-series data with variable length would be more re- alistic for field data. To embed the physics-based model in its original form, we can train the physics-constrained neural network with a masked loss function, which enables learning from wells with varying production lengths. To do this, we can use an indicator vector to label the partially observed time-series, and exclude gradient information from the timesteps that are not available during backpropagation. The residual model can be implemented with a masked loss function or using a sequence-to-sequence formulation as outlined in Chapter 6. Our early observations on this approach are published in [175]. 260 • Improving interpretability and explainability: In Chapters 6 and 7, we developed work- flows that incorporate data-driven components for production forecasting. Because the fore- casts may be used for high-stakes decision-making, engineers need to be able to understand how the predictive models arrive at their outputs. Improving interpretability enables users to analyze and understand the internal workings and decision-making process of the model to a certain extent, even if it doesn’t provide an explanation for why a particular decision was made. Explainability, on the other hand, refers to the ability to justify or explain why a model made a particular decision in a human-understandable way. Models that have both of these characteristics are more likely to be adopted quickly. 261 Nomenclature Chapter 1 d obs = observed data f(·)= input-output mapping function f m (·)/ f − 1 m (·)= abstract forward/inverse dimension reduction function for m f d (·)/ f − 1 d (·)= abstract forward/inverse dimension reduction function for d g(·)= nonlinear forward function m/d= model/data M/D= set of models/data n= number of elements in vector representation of a geologic realization N f = dimension of time-series data N l = size of lag window N s = size of forecast window p= well properties s= number of basis functions retained for approximation u= vector representation of a geologic realization u w+1 = control trajectory of forecast window φ ′ = matrix of basis functions retained for approximation v= coefficients to matrix of basis functions retained for approximation v i = coefficient to a basis function ϕ i = a basis function X = input tuple y w = lag window y w+1 = forecast window Y = output tuple z m /z d = latent variables of model/data Z m /Z d = set of latent variables of model/data 262 Chapter 2 a= output/activation of a convolution layer b= batch size of training data C= number of scenarios (class labels) to be classified d obs = historical/observed dynamic data D(·)= classification model f = set of filters/kernels f θ = trainable kernels/filters of the classifier f ψ = trainable kernels/filters of the regression model g= likelihood of scenarios h= output of a non-linear function H (·)= regression model H l × W l × D l = height, width and depth of input of any l layer in a neural network H l+1 × W l+1 × D l+1 = height, width and depth of output of any l layer in a neural network H× W× D l = height, width and depth of each filter/kernel of any l layer i l , j l ,d l = index for any element in input matrix for l layer i l+1 , j l+1 ,d l+1 = index for any element in output matrix for l layer k= a permeability model K= number of plausible scenarios m= matrix representation of a model realization b m= predicted realization with proposed workflow M/D= set of model/data from all scenarios M reduced /D reduced = set of model/data from relevant scenarios N= number of model realizations per scenario p= output of a max-pooling (sub-sampling) layer ψ = weights of the regression model φ = a porosity model σ = sigmoid function S= matrix of one-hot encoding of production data labels θ = weights of the classifier x l = input to any l layer in a neural network y d = one-hot vector encoding of class labels for each input for classifier y h = vector representation of each model realization for regression model z= output vector of a dense (matrix-multiplication) layer z ′ = a Gaussian field 263 Chapter 3 ε = perturbation around z d obs E = set of perturbation vectors around z d obs Enc θ (·)/Dec θ (·)= encoder/decoder for models, with weightsθ Enc ψ (·)/Dec ψ (·)= encoder/decoder for data, with weightsψ f m (·)/ f − 1 m (·)= abstract forward/inverse dimension reduction function for m f d (·)/ f − 1 d (·)= abstract forward/inverse dimension reduction function for d F = number of features in d g(·)/g − 1 (·)= forward/inverse model G= number of noise samples H = dimension of z d K= dimension of z m M= grid dimension of m m/d= model/data ˆ m/ ˆ d= predicted or reconstructed model/reconstructed data m re f /d obs = model reference case/observed data ˆ m re f / ˆ d obs = reconstructed model reference case/observed data M/D= set of models/data ˆ M/ ˆ D= set of predicted or reconstructed models/reconstructed data N= number of elements in M/D φ,k= porosity/permeability model Reg dm γ (·)= regressor from z d to z m , with weightsγ T = number of timesteps in d z m /z d = latent variables of model/data ˆ z m /ˆ z d = predicted latent variables of model/data z m re f /z d obs = latent variables of reference model/observed data Z m /Z d = set of latent variables of model/data Z m re f /Z d obs = set of latent variables of inversion solutions/perturbed observed data 264 Chapter 4 α = inflation factor C z D = covariance matrix of observed data errors latent variables C z m z d = covariance matrix of model and data latent variables C z d z d = covariance matrix of data latent variables Enc θ (·)/Dec θ (·)= model encoder/decoder with weightsθ Enc ψ (·)/Dec ψ (·)= data encoder/decoder with weightsψ F = number of features in data H = dimension of data latent variables K= dimension of model latent variables m/d= model/data ˆ m/ ˆ d= reconstructed model/reconstructed or predicted data m re f /d obs = reference model/observed data M/D= set of models/set of data M= dimension of model N a = number of assimilation steps N e = size of ensemble p M /p D = distribution of models/data p z M /p z D = distribution of models latent space/data latent space Reg md γ (·)= regressor from z m to z d with weightsγ T = number of timesteps in data z m /z d = model latent variables/data latent variables z m re f /z d obs = reference model latent variables/observed data latent variables z M /z D = set of model latent variables/set of data latent variables z i d obs,k = latent variables for the i-th realization of the perturbed data at time t k 265 Chapter 5 α = interpolation factor c= class label C= set of class labels C φ = classifier with weights φ D ψ = discriminator/critic with weightsψ ε = error around d obs g(·)= nonlinear forward function G= linear forward function G θ = generator with weightsθ K= number of class labels λ = weight for gradient penalty term m/d= model/data ˜ m/ ˜ d= generated model/data ˆ m= interpolated model m re f /d obs = model reference case/observed data M/D= set of models/data ˜ M/ ˜ D= set of generated models/data M ∗ /D ∗ = set of relevant models/data ˜ M c / ˜ D c = set of generated models/data belonging to class c M ∗ c /D ∗ c = set of relevant models/data belonging to class c N= number of elements in M/D N b = batch size p X = distribution of any setX φ,k= porosity/permeability model s(x)= slowness distribution as a function of the spatial coordinate x t= arrival time z= latent vector ˆ z= interpolated latent vector Z= set of latent vectors 266 Chapter 6 α = slope coefficient A= input into a fully-connected layer b= degree of curvature of decline curve function b c /b f /b i /b o = bias term of the cell/forget/input/output gate b d = bias term of a dense layer C t = current internal state of the cell C t− 1 = past internal state of the cell ˜ C t = update information of the cell gate d i = initial decline rate for decline curve function Dec γ = decoder with trainable parameters D S /D T = source/target domain Enc θ = encoder with trainable parameters f T (·)/ f S (·)= target/source predictive model f(·)= activation function f t = output of the forget gate h t = current hidden state of the cell h t− 1 = past hidden state of the cell i t = output of the input gate N b = batch size N d = number of hidden nodes of the dense layer N f = number of features (univariate or multivariate) N h = length of the hidden output N i = input size into a dense layer N l = size of lag window N p = size of well parameters N s = size of forecast window N source /N target = size of source/target dataset o t = output of the output gate p= well parameters P(X S )/P(X T )= marginal probability distribution of the source/target domain feature set P(Y S )/P(Y T )= marginal probability distribution of the source/target domain output set q i = initial rate Reg ω = regressor with trainable parameters Reg ζ = regressor with trainable parameters σ(·)= sigmoid activation function T S /T T = source/target task u w+1 = control window U c /U f /U i /U o = hidden weight of the cell/forget/input/output gate W c /W f /W i /W o = weight of the cell/forget/input/output gate W d = weight of a dense layer X S /X T = source/target input feature set y= vector of input sequence y w = lag window y w+1 = forecast window Y S /Y T = source/target output feature set Z= output of a dense layer 267 Chapter 7 α = slope coefficient A= input into a fully-connected layer b= degree of curvature of decline curve function b d = bias term d i = initial decline rate for decline curve function d= production data ˆ d= prediction of production data d c = physics-constrained production data ˆ d c = prediction of physics-constrained production data d r = residual of production data ˆ d r = prediction of residual of production data d sim = simulated production data ˆ d sim = prediction of simulated production data f(·)= forecast function f a (·)= activation function f 2 = physics-based model f 2ω = proxy representation of f 2 with trainable parametersω f 1 θ = function that maps x to p with trainable parametersθ f 1 θ ◦ f 2 = explicit physics-constrained neural network f 1 θ ◦ f 2ω = statistical physics-constrained neural network f 3 ζ = residual neural network model f 1 θ ◦ f 2ω + f 3 ζ = statistical Physics-Guided Deep Learning model f 1 θ ◦ f 2 + f 3 ζ = explicit Physics-Guided Deep Learning model h= kernel N f ield = number of field dataset N sim = number of simulated data points N b = batch size N c = number of channels N d = number of hidden nodes N f = dimension of the time-series production data N h = length of kernel N i = size of the input N k = number of kernels N m = length of the input along the time axis N n = length of the output along the time axis N p = dimension of the input for physics-based model N t = length of time-series d and u N x = length of feature vector x p= intermediate variables or input for f 2 ˆ p= prediction of intermediate variables or input for f 2 p= amount of padding q i = initial rate for decline curve function s= convolutional stride u= control trajectories u sim = simulated control trajectories x= well properties x sim = simulated well properties V= arbitrary input of any 1D convolutional layer W d = weights to be learned Y= arbitrary output of any 1D convolutional layer Z= arbitrary output of a fully-connected layer 268 Bibliography [1] S. I. Aanonsen, G. Nævdal, D. S. Oliver, A. C. Reynolds, and B. Vall` es. Ensemble kalman filter in reservoir engineering—a review. SPE J, 14(3):393–412, 2009. [2] I. Aizenberg, L. Sheremetov, L. Villa-Vargas, and J. Martinez-Mu˜ noz. Multilayer neural network with multi-valued neurons in time series forecasting of oil production. Neurocomputing, 175:980– 989, 2016. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2015.06.092. [3] N. I. Al-Bulushi, P. R. King, M. J. Blunt, and M. Kraaijveld. Artificial neural networks workflow and its application in the petroleum industry. Neural Computing and Applications, 409-421, 2012. doi: https://doi.org/10.1007/s00521-010-0501-6. [4] R. Al-Shabandar, A. Jaddoa, P. Liatsis, and A. J. Hussain. A deep gated recurrent neural network for petroleum production forecasting. Machine Learning with Applications, 3:100013, 2021. ISSN 2666-8270. doi: https://doi.org/10.1016/j.mlwa.2020.100013. [5] R. Altman, R. Tineo, A. Viswanathan, and N. Gurmen. Applied learnings in reservoir simulation of unconventional plays. Society of Petroleum Engineers, July 2020. [6] M. Amendola, R. Arcucci, L. Mottet, C. Q. Casas, S. Fan, C. Pain, P. Linden, and Y . Guo. Data assimilation in the latent space of a convolutional autoencoder. In M. Paszynski, D. Kranzlm¨ uller, V . V . Krzhizhanovskaya, J. J. Dongarra, and P. M. A. Sloot, editors, Computational Science – ICCS 2021, pages 373–386, Cham, 2021. Springer International Publishing. ISBN 978-3-030-77977-1. [7] S. Amini and S. Mohaghegh. Application of machine learning and artificial intelligence in proxy modeling for fluid flow in porous media. Fluids, 4(3):126, 2019. [8] F. J. Anscombe. Frederick Mosteller and John W. Tukey: A Conversation, pages 647–660. Springer New York, New York, NY , 2006. ISBN 978-0-387-44956-2. doi: 10.1007/978-0-387-44956-2 42. URLhttps://doi.org/10.1007/978-0-387-44956-2_42. [9] M. Araya-Polo, J. Jennings, A. Adler, and T. Dahlke. Deep-learning tomography. The Leading Edge, 37(1):58–66, 2018. [10] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. CoRR, 2017. URL https://arxiv. org/abs/1701.07875. [11] J. J. Arps. Analysis of decline curves. Transactions of the AIME, 160:228–247, 1945. [12] R. C. Aster, B. Borchers, and C. H. Thurber. Parameter estimation and inverse problems. Elsevier, 2018. [13] A. Astrakova and D. S. Oliver. Conditioning Truncated Pluri-Gaussian Models to Facies Obser- vations in Ensemble-Kalman-Based Data Assimilation. Math Geosci, 47:345–367, 2015. doi: 10.1007/s11004-014-9532-3. [14] K. Aziz and A. Settari. Petroleum Reservoir Simulation. Applied Science Publishers, 1979. [15] K. Aziz et al. Reservoir simulation grids: opportunities and problems. Journal of Petroleum Tech- nology, 45(07):658–663, 1993. 269 [16] M. Babaei, A. Alkhatib, and I. Pan. Robust optimization of subsurface flow using polynomial chaos and response surface surrogates. Computational Geosciences, 19(5):979–998, 2015. [17] T. Bai and P. Tahmasebi. Efficient and data-driven prediction of water breakthrough in subsurface systems using deep long short-term memory machine learning. Computational Geosciences, 25(1): 285–297, 02 2021. ISSN 1573-1499. doi: 10.1007/s10596-020-10005-2. [18] N. Bassamzadeh and R. Ghanem. Probabilistic Data-Driven Prediction of Wellbore Signatures in High-Dimensional Data Using Bayesian Networks. Society of Petroleum Engineers, 2018. [19] A. Baydin, B. Pearlmutter, and A. Radul. Automatic differentiation in machine learning: a survey. CoRR, abs/1502.05767, 2015. URLhttp://arxiv.org/abs/1502.05767. [20] M.G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, and R. Munos. The cramer distance as a solution to biased wasserstein gradients. CoRR, abs/1705.10743, 2017. [21] C. E. Bond, A. D. Gibbs, Z. K. Shipton, and S. Jones. What do you think this is?: “Conceptual uncertainty” in geoscience interpretation. GSA Today, 17:4–10, 2007. [22] J. Bongard and H. Lipson. Automated reverse engineering of nonlinear dynamical systems. Pro- ceedings of the National Academy of Sciences, 104(24):9943–9948, 2007. doi: 10.1073/pnas. 0609476104. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.0609476104. [23] J. Boulicaut and B. Jeudy. Constraint-Based Data Mining, pages 399–416. Springer US, Boston, MA, 2005. ISBN 978-0-387-25465-4. doi: 10.1007/0-387-25465-X 18. URLhttps://doi.org/ 10.1007/0-387-25465-X_18. [24] R. B. Bratvold, E. Mohus, D. Petutschnig, and E. Bickel. Production Forecasting: Optimistic and Overconfident Over and Over Again. SPE Reservoir Evaluation & Engineering, 23(03):0799–0810, 08 2020. ISSN 1094-6470. doi: 10.2118/195914-PA. [25] C. Brunetti, M. Bianchi, G. Pirot, and N. Linde. Hydrogeological Model Selection Among Complex Spatial Priors. Water Resources Research, 55(8):6729–6753, 2019. doi: 10.1029/2019WR024840. [26] D. Busby. Deep-dca a new approach for well hydrocarbon production forecasting. ECMOR XVII, 2020(1):1–10, 2020. ISSN 2214-4609. doi: https://doi.org/10.3997/2214-4609.202035124. URL https://www.earthdoc.org/content/papers/10.3997/2214-4609.202035124. [27] J. Caers. Efficient gradual deformation using a streamline-based proxy method. Journal of Petroleum Science and Engineering, 39(1):57 – 83, 2003. ISSN 0920-4105. doi: 10.1016/S0920-4105(03) 00040-8. [28] J. Caers. Comparing the gradual deformation with the probability perturbation method for solving inverse problems. Mathematical Geology, 39(1):27–52, 2007. [29] G. Camps-Valls, L. Martino, D. Svendsen, M. Campos-Taberner, J. Mu˜ noz-Mar´ ı, V . Laparra, D. Lu- engo, and F. Garc´ ıa-Haro. Physics-aware gaussian processes in remote sensing. Applied Soft Com- puting, 68:69–82, 2018. ISSN 1568-4946. doi: https://doi.org/10.1016/j.asoc.2018.03.021. URL https://www.sciencedirect.com/science/article/pii/S1568494618301431. [30] S. W. A. Canchumuni, A. A. Emerick, and M. A. C. Pacheco. Towards a robust parameterization for conditioning facies models using deep variational autoencoders and ensemble smoother. Computers & Geosciences, 128:87 – 102, 2019. ISSN 0098-3004. doi: https://doi.org/10.1016/j.cageo.2019.04. 006. [31] S. W. A. Canchumuni, J. D. B. Castro, J. Potratz, A. A. Emerick, and M. A. C. Pacheco. Recent de- velopments combining ensemble smoother and deep generative networks for facies history matching. Computational Geosciences, 25:433–466, 2021. doi: https://doi.org/10.1007/s10596-020-10015-0. [32] S.A. Canchumuni, A.A. Emerick, and M.A. Pacheco. Integration of Ensemble Data Assimilation and Deep Learning for History Matching Facies Models. Offshore Technology Conference, 2017. doi: 10.4043/28015-MS. 270 [33] J. Carrera, A. Alcolea, A. Medina, J. Hidalgo, and L. J. Slooten. Inverse problem in hydrogeology. Hydrogeology Journal, 13:206–222, 2005. doi: https://doi.org/10.1007/s10040-004-0404-7. [34] J. C. Castilla-Rho, G. Mariethoz, B. F. J. Kelly, and M. S. Andersen. Stochastic reconstruction of paleovalley bedrock morphology from sparse datasets. Environmental Modelling & Software, 53: 35–52, 2014. ISSN 1364-8152. doi: https://doi.org/10.1016/j.envsoft.2013.10.025. [35] C. Chai, M. Maceira, H. J. Santos-Villalobos, S. V . Venkatakrishnan, M. Schoenball, W. Zhu, G. C. Beroza, C. Thurber, and EGS Collab Team. Using a deep neural network and transfer learning to bridge scales for seismic phase picking. Geophysical Research Letters, 47(16), 2020. doi: 10.1029/ 2020GL088651. [36] S. Chan and A.H. Elsheikh. Parametric generation of conditional geological realizations using gen- erative neural networks. Computational Geosciences, Jul 2019. doi: 10.1007/s10596-019-09850-7. [37] H. Chang and D. Zhang. Machine learning subsurface flow equations from data. Computational Geosciences, 23:895–910, 2019. ISSN 1573-1499. doi: 10.1007/s10596-019-09847-2. URL https://doi.org/10.1007/s10596-019-09847-2. [38] H. Chang, D. Zhang, and Z. Lu. History matching of facies distribution with the EnKF and level set parameterization. Journal of Computational Physics, 229(20):8011 – 8030, 2010. ISSN 0021-9991. doi: 10.1016/j.jcp.2010.07.005. [39] D. K. Chaturvedi. Soft Computing: Techniques and Its Applications in Electrical Engineering. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 9783540774808. [40] C. Chen, G. Gao, B. A. Ramirez, J. C. Vink, and A. M. Girardi. Assisted History Matching of Channelized Models by Use of Pluri-Principal-Component Analysis. Society of Petroleum Engineers, 21, 2016. doi: 10.2118/173192-PA. [41] X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, 2016. [42] N. Cherpeau, G. Caumon, J. Caers, and B. L´ evy. Method for Stochastic Inverse Modeling of Fault Geometry and Connectivity Using Flow Data. Mathematical Geosciences, 44(2):147–168, Feb 2012. [43] N. Chithra Chakra, K. Song, M. M. Gupta, and D. N. Saraf. An innovative neural forecast of cumulative oil production from a petroleum reservoir employing higher-order neural networks (honns). Journal of Petroleum Science and Engineering, 106:18–33, 2013. ISSN 0920-4105. doi: https://doi.org/10.1016/j.petrol.2013.03.004. [44] Y . Cho, E. Eker, I. Uzun, X. Yin, and H. Kazemi. Rock Characterization in Unconventional Reser- voirs: A Comparative Study of Bakken, Eagle Ford, and Niobrara Formations. In -, SPE Rocky Mountain Petroleum Technology Conference / Low Permeability Reservoirs Symposium, 05 2016. doi: 10.2118/180239-MS. URLhttps://doi.org/10.2118/180239-MS. SPE-180239-MS. [45] F. Chollet et al. Keras. https://keras.io, 2015. [46] J. Cornelio, S. Mohd Razak, A. Jahandideh, Y . Cho, H. Liu, R. Vaidya, and B. Jafarpour. Physics-assisted transfer learning for production prediction in unconventional reservoirs. Uncon- ventional Resources Technology Conference, Houston, Texas, 26–28 July 2021, pages 3669–3682, 2021. doi: 10.15530/urtec-2021-5688. URL https://library.seg.org/doi/abs/10.15530/ urtec-2021-5688. [47] J. Cornelio, S. Mohd Razak, Y . Cho, H. Liu, R. Vaidya, and B. Jafarpour. Residual Learning to Integrate Neural Network and Physics-Based Models for Improved Production Prediction in Un- conventional Reservoirs. SPE Journal, 27(06):3328–3350, 12 2022. ISSN 1086-055X. doi: 10.2118/210559-PA. URLhttps://doi.org/10.2118/210559-PA. [48] J. G. De Gooijer and R. J. Hyndman. 25 years of time series forecasting. International Journal of Forecasting, 22(3):443–473, 2006. ISSN 0169-2070. doi: https://doi.org/10.1016/j.ijforecast.2006. 01.001. Twenty five years of forecasting. 271 [49] D. DeMers and G. W. Cottrell. Non-Linear Dimensionality Reduction. In Advances in Neural Information Processing Systems 5, [NIPS Conference], page 580–587, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. ISBN 1558602747. [50] V . Demyanov, D. Arnold, T. Rojas, and M. Christie. Uncertainty Quantification in Reservoir Predic- tion: Part 2—Handling Uncertainty in the Geological Scenario. Mathematical Geosciences, 51(2): 241–264, 2019. [51] N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders, 2016. [52] U. Djuraev, S. Mohd Razak, and B. Jafarpour. Adversarial strategies for improved geologic consis- tency in feature-based parameterization. ECMOR, 2022(1):1–19, 2022. ISSN 2214-4609. doi: https://doi.org/10.3997/2214-4609.202244057. URL https://www.earthdoc.org/content/ papers/10.3997/2214-4609.202244057. [53] Z. Dong, S. A. Holditch, and D. A. McVay. Resource Evaluation for Shale Gas Reservoirs. SPE Economics & Management, 5(01):5–16, 01 2013. ISSN 2150-1173. doi: 10.2118/152066-PA. URL https://doi.org/10.2118/152066-PA. [54] P. Dromgoole and R. Speers. Managing uncertainty in oilfield reserves. Middle East Well Evaluation Review, 12:30–41, 1992. [55] M. Ehrendorfer. A review of issues in ensemble-based Kalman filtering. Meteorologische Zeitschrift, 16(6):795–818, December 2007. doi: 10.1127/0941-2948/2007/0256. [56] US Energy Information Administration EIA. EIA updates Eagle Ford maps to provide greater geo- logic detail. https://www.eia.gov/todayinenergy/detail.php?id=19651, 2015. [57] US Energy Information Administration EIA. The API gravity of crude oil produced in the U.S. varies widely across states. https://www.eia.gov/todayinenergy/detail.php?id=30852, 2017. [58] A. El-Shafie, A. Noureldin, M. Taha, A. Hussain, and M. Mukhlisin. Dynamic versus static neural network model for rainfall forecasting at klang river basin, malaysia. Hydrology and Earth Sys- tem Sciences, 16(4):1151–1169, 2012. doi: 10.5194/hess-16-1151-2012. URL https://hess. copernicus.org/articles/16/1151/2012/. [59] A. H. Elsheikh, I. Hoteit, and M. F. Wheeler. Efficient bayesian inference of subsurface flow mod- els using nested sampling and sparse polynomial chaos surrogates. Computer Methods in Applied Mechanics and Engineering, 269:515–537, 2014. [60] A. A. Emerick. Investigation on Principal Component Analysis Parameterizations for History Match- ing Channelized Facies Models with Ensemble-Based Data Assimilation. Math Geosci, 49:85–120, 2017. doi: 10.1007/s11004-016-9659-5. [61] A. A. Emerick and A. C. Reynolds. Ensemble smoother with multiple data assimilation. Computers & Geosciences, 55:3–15, 2013. ISSN 0098-3004. doi: https://doi.org/10.1016/j.cageo.2012.03.011. Ensemble Kalman filter for data assimilation. [62] G. Evensen. The ensemble kalman filter: theoretical formulation and practical implementation. Ocean Dynamics, 53:343–367, 2003. doi: 10.1007/s10236-003-0036-9. [63] G. Evensen. Sampling strategies and square root analysis schemes for the enkf. Ocean Dynamics, 54:539—-560, 2004. doi: https://doi.org/10.1007/s10236-004-0099-2. [64] G. Evensen. Data assimilation: the ensemble Kalman filter . Springer Science & Business Media, 2009. [65] G. Evensen. Analysis of iterative ensemble smoothers for solving inverse problems. Comput Geosci, 22:885–908, 2018. doi: https://doi.org/10.1007/s10596-018-9731-y. 272 [66] M. J. Fetkovich, E. J. Fetkovich, and M. D. Fetkovich. Useful Concepts for Decline-Curve Forecast- ing, Reserve Estimation, and Analysis. SPE Reservoir Engineering, 11(01):13–22, 02 1996. ISSN 0885-9248. doi: 10.2118/28628-PA. URLhttps://doi.org/10.2118/28628-PA. [67] U. Forssell and P. Lindskog. Combining semi-physical and neural network modeling: An example ofits usefulness. IFAC Proceedings Volumes, 30(11):767–770, 1997. ISSN 1474-6670. doi: https: //doi.org/10.1016/S1474-6670(17)42938-7. URLhttps://www.sciencedirect.com/science/ article/pii/S1474667017429387. IFAC Symposium on System Identification (SYSID’97), Ki- takyushu, Fukuoka, Japan, 8-11 July 1997. [68] F. Friedmann, A. Chawathe, and D. K. Larue. Assessing Uncertainty in Channelized Reservoirs Using Experimental Designs. Society of Petroleum Engineers, 2001. [69] G. Gao, H. Jiang, J. C. Vink, C. Chen, Y . El-Khamra, and J. J. Ita. Gaussian mixture model fit- ting method for uncertainty quantification by conditioning to production data. Computational Geo- sciences, 24:663–681, 2020. ISSN 1573-1499. doi: 10.1007/s10596-019-9823-3. [70] G. R. Gavalas, P. C. Shah, and J. H. Seinfeld. Reservoir History Matching by Bayesian Estimation. Society of Petroleum Engineers, 1976. [71] M. G. Gerritsen and L. J. Durlofsky. Modeling fluid flow in oil reservoirs. Annu. Rev. Fluid Mech., 37:211–238, 2005. [72] S. A. Gherabati, J. Browning, F. Male, S. A. Ikonnikova, and G. McDaid. The impact of pres- sure and fluid property variation on well performance of liquid-rich eagle ford shale. Jour- nal of Natural Gas Science and Engineering, 33:1056 – 1068, 2016. ISSN 1875-5100. doi: https://doi.org/10.1016/j.jngse.2016.06.019. URL http://www.sciencedirect.com/science/ article/pii/S1875510016304024. [73] A. Golmohammadi and B. Jafarpour. Simultaneous geologic scenario identification and flow model calibration with group-sparsity formulations. Advances in Water Resources, 92:208 – 227, 2016. [74] A. Golmohammadi, M.M. Khaninezhad, and B. Jafarpour. Group-sparsity regularization for ill- posed subsurface flow inverse problems. Water Resources Research, 51(10):8607–8626, 2015. doi: 10.1002/2014WR016430. [75] J. J. G´ omez-Hern´ andez and J. Fu. Blocking Markov Chain Monte Carlo Schemes for Inverse Stochastic Hydrogeological Modeling, pages 121–126. Springer Netherlands, Dordrecht, 2010. doi: https://doi.org/10.1007/978-90-481-2322-3 11. [76] I. Goodfellow, Y . Bengio, and A. Courville. Deep Learning. MIT Press, 2016. [77] I.J. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. CoRR, 2017. [78] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems 27, pages 2672–2680, 2014. [79] I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A.C. Courville. Improved training of wasser- stein gans. CoRR, 2017. [80] Z. Guo, C. Chen, G. Gao, and J. Vink. Applying support vector regression to reduce the effect of nu- merical noise and enhance the performance of history matching. SPE Annual Technical Conference and Exhibition, 2017. [81] S. Hakim-Elahi and B. Jafarpour. A distance transform for continuous parameterization of discrete geologic facies for subsurface flow model calibration. Water Resources Research, 53(10):8226–8249, 2017. doi: 10.1002/2016WR019853. [82] D. Han, J. Jung, and S. Kwon. Comparative study on supervised learning models for productivity forecasting of shale reservoirs based on a data-driven approach. Applied Sciences, 10(4), 2020. ISSN 2076-3417. URLhttps://www.mdpi.com/2076-3417/10/4/1267. 273 [83] J. Han, A. Jentzen, and E. Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018. doi: 10.1073/ pnas.1718942115. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1718942115. [84] J. He, P. Sarma, and L. J. Durlofsky. Reduced-order flow modeling and geological parameterization for ensemble-based data assimilation. Computers & Geosciences, 55:54 – 69, 2013. ISSN 0098- 3004. doi: 10.1016/j.cageo.2012.03.027. [85] K. Hector and F. Horacio. Data-Driven Prediction of Unconventional Shale-Reservoir Dynamics. SPE Journal, 25(05):2564–2581, 10 2020. ISSN 1086-055X. doi: 10.2118/193904-PA. [86] H. J. Hendricks Franssen and W. Kinzelbach. Ensemble Kalman filtering versus sequential self- calibration for inverse modelling of dynamic groundwater flow systems. Journal of Hydrology, 365 (3):261–274, 2009. ISSN 0022-1694. doi: https://doi.org/10.1016/j.jhydrol.2008.11.033. [87] H. J. Hendricks Franssen, A. Alcolea, M. Riva, M. Bakr, N. van der Wiel, F. Stauffer, and A. Guadagnini. A comparison of seven methods for the inverse modelling of groundwater flow. application to the characterisation of well catchments. Advances in Water Resources, 32(6):851 – 872, 2009. [88] T. Hermans, F. Nguyen, and J. Caers. Uncertainty in training image-based inversion of hydraulic head data constrained to ERT data: Workflow and case study. Water Resour. Res., 51:5332– 5352, 2015. [89] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735. [90] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2): 251–257, 1991. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(91)90009-T. [91] L.Y . Hu. Extended Probability Perturbation Method for Calibrating Stochastic Reservoir Models. Math Geosci, 40:875–885, 2008. doi: 10.1007/s11004-008-9158-4. [92] A. K. Jaber, S. N. Al-Jawad, and A. K. Alhuraishawy. A review of proxy modeling applica- tions in numerical reservoir simulation. Arab J Geosci, 12, 2019. doi: https://doi.org/10.1007/ s12517-019-4891-1. [93] P. Jacquard. Permeability Distribution From Field Pressure Data. Society of Petroleum Engineers, 1965. [94] B. Jafarpour. Wavelet Reconstruction of Geologic Facies From Nonlinear Dynamic Flow Mea- surements. IEEE Transactions on Geoscience and Remote Sensing, 49(5):1520–1535, 2011. ISSN 1558-0644. doi: 10.1109/TGRS.2010.2089464. [95] B. Jafarpour and M. Khodabakhshi. A probability conditioning method (pcm) for nonlinear flow data integration into multipoint statistical facies simulation. Mathematical Geosciences, 43(2):133–164, 2011. [96] B. Jafarpour and D. B. McLaughlin. History matching with an ensemble kalman filter and discrete cosine parameterization. Computational Geosciences, 12(2):227–244, Jun 2008. doi: 10.1007/ s10596-008-9080-3. [97] A. Jahandideh and B. Jafarpour. Optimization of hydraulic fracturing design under spatially variable shale fracability. Journal of Petroleum Science and Engineering, 138:174–188, 2016. ISSN 0920- 4105. doi: https://doi.org/10.1016/j.petrol.2015.11.032. URL https://www.sciencedirect. com/science/article/pii/S0920410515301984. [98] A. Jahandideh, S. Mohd Razak, U. Djuraev, and B. Jafarpour. Efficient Data Assimilation with Latent-Space Representations for Subsurface Flow Systems. In AGU Fall Meeting Abstracts, volume 2020, pages H108–0002, December 2020. [99] H. O. Jahns. A Rapid Method for Obtaining a Two-Dimensional Reservoir Description From Well Pressure Response Data. Society of Petroleum Engineers, 1966. 274 [100] G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning. Springer, 2013. [101] H. Jeong, A. Y . Sun, J. Lee, and B. Min. A learning-based data-driven forecast approach for pre- dicting future reservoir performance. Advances in Water Resources, 118:95 – 109, 2018. ISSN 0309-1708. doi: 10.1016/j.advwatres.2018.05.015. [102] A. Jiang and B. Jafarpour. Inverting subsurface flow data for geologic scenarios selection with convolutional neural networks. Advances in Water Resources, 149:103840, 2021. ISSN 0309-1708. doi: https://doi.org/10.1016/j.advwatres.2020.103840. URLhttps://www.sciencedirect.com/ science/article/pii/S0309170820311982. [103] A. Jiang and B. Jafarpour. Deep convolutional autoencoders for robust flow model calibration under uncertainty in geologic continuity. Water Resources Research, 57(11):e2021WR029754, 2021. doi: https://doi.org/10.1029/2021WR029754. URL https://agupubs.onlinelibrary.wiley.com/ doi/abs/10.1029/2021WR029754. e2021WR029754 2021WR029754. [104] R. Jiang, D. Stern, T. Halsey, and T. Manzocchi. Scenario Discovery Workflow for Robust Petroleum Reservoir Development under Uncertainty. International Journal for Uncertainty Quantification , 6, 2016. [105] S. Jiang and L. J. Durlofsky. Data-space inversion using a recurrent autoencoder for time-series parameterization. Comput Geosci, 25:411—-432, 2021. doi: 10.1007/s10596-020-10014-1. [106] Z. L. Jin, Y . Liu, and L. J. Durlofsky. Deep-learning-based surrogate model for reservoir simulation with time-varying well controls. Journal of Petroleum Science and Engineering, 192:107273, 2020. ISSN 0920-4105. doi: https://doi.org/10.1016/j.petrol.2020.107273. [107] S. Karla, W. Tian, and X. Wu. A numerical simulation study of co2 injection for enhancing hydro- carbon recovery and sequestration in liquid-rich shales. Petroleum Science, 15:103–115, 2018. doi: 10.1007/s12182-017-0199-5S. [108] A. Karpatne, G. Atluri, J.. Faghmous, M. Steinbach, A. Banerjee, A. Ganguly, S. Shekhar, N. Sam- atova, and V . Kumar. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data Engineering, 29(10):2318–2331, 2017. doi: 10.1109/TKDE.2017.2720168. [109] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila. Training generative adver- sarial networks with limited data. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12104– 12114. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/8d30aa96e72440759f74bd2306c1fa3d-Paper.pdf. [110] M. R. Khaninezhad and B. Jafarpour. Prior model identification during subsurface flow data integra- tion with adaptive sparse representation techniques. Computational Geosciences, 18(1):3–16, Feb 2014. [111] M. R. Khaninezhad, B. Jafarpour, and L. Li. Sparse geologic dictionaries for subsurface flow model calibration: Part I. Inversion formulation. Advances in Water Resources, 39:106 – 121, 2012. [112] M. Khodabakhshi and B. Jafarpour. A bayesian mixture-modeling approach for flow-conditioned multiple-point statistical facies simulation from uncertain training images. Water Resources Re- search, 49(1):328–342, 2013. [113] J. Kim, H. Yang, and J. Choe. Robust optimization of the locations and types of multiple wells using cnn based proxy models. Journal of Petroleum Science and Engineering, 2020. [114] S. Kim, B. Min, S. Kwon, and M. Chu. History Matching of a Channelized Reservoir Using a Serial Denoising Autoencoder Integrated with ES-MDA. Geofluids , 2019. doi: 10.1155/2019/3280961. [115] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv.org, 2014. [116] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes, 2013. 275 [117] D. P. Kingma and M. Welling. An introduction to variational autoencoders. CoRR, abs/1906.02691, 2019. URLhttp://arxiv.org/abs/1906.02691. [118] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. Inman. 1d convolutional neural networks and applications: A survey. Mechanical Systems and Signal Processing, 151:107398, 2021. ISSN 0888-3270. doi: https://doi.org/10.1016/j.ymssp.2020.107398. URL https://www. sciencedirect.com/science/article/pii/S0888327020307846. [119] P. K. Kitanidis. Quasi-linear geostatistical theory for inversing. Water Resources Research, 31(10): 2411–2419, 1995. doi: https://doi.org/10.1029/95WR01945. [120] H. Klie, A. Klie, and B. Yan. Data Connectivity Inference and Physics-AI Models for Field Op- timization. SPE/AAPG/SEG Latin America Unconventional Resources Technology Conference, Day 1 Mon, November 16, 2020, 11 2020. doi: 10.15530/urtec-2020-1098. URL https: //doi.org/10.15530/urtec-2020-1098. [121] M. A. Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37(2):233–243, 1991. doi: 10.1002/aic.690370209. URL https://aiche. onlinelibrary.wiley.com/doi/abs/10.1002/aic.690370209. [122] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25, pages 1097–1105, 2012. [123] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR 2011, pages 1785–1792, 2011. doi: 10.1109/CVPR.2011. 5995702. [124] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist., 22:79—-86, 1951. doi: https://doi.org/10.1214/aoms/1177729694. [125] I.E. Lagaris, A. Likas, and D.I. Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE Transactions on Neural Networks, 9(5):987–1000, 1998. doi: 10.1109/ 72.712178. [126] J. Lagergren, J. Nardini, G. Lavigne, E. Rutter, and K. Flores. Learning partial differential equations for biological transport models from noisy spatio-temporal data. Proceedings of the Royal Society A, 476, 2020. [127] E. Laloy, R. H´ erault, J. Lee, D. Jacques, and N. Linde. Inversion using a new low-dimensional rep- resentation of complex binary geological media based on a deep neural network. Advances in Water Resources, 110:387–405, Dec 2017. ISSN 0309-1708. doi: https://doi.org/10.1016/j.advwatres. 2017.09.029. [128] E. Laloy, R. H´ erault, D. Jacques, and N. Linde. Training-Image Based Geostatistical Inversion Using a Spatial Generative Adversarial Neural Network. Water Resources Research, 54(1):381–406, 2018. doi: 10.1002/2017WR022148. [129] Y . LeCun and Y . Bengio. The Handbook of Brain Theory and Neural Networks. MIT Press, 1998. [130] Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recog- nition. In Proceedings of the IEEE, pages 2278–2324, 1998. [131] Y . LeCun, Y . Bengio, and G. E. Hinton. Deep learning. Nature, 521:436–444, 2015. [132] K. Lee, J. Lim, D. Yoon, and H. Jung. Prediction of Shale-Gas Production at Duvernay Formation Using Deep-Learning Algorithm. SPE Journal, 24(06):2423–2437, 07 2019. ISSN 1086-055X. doi: 10.2118/195698-PA. [133] E. P. Leite and A. C. Vidal. 3D porosity prediction from seismic inversion and neural networks. Computers & Geosciences, 37(8):1174–1180, 2011. 276 [134] H. Li and S. Misra. Prediction of subsurface NMR T2 distributions in a shale petroleum system using variational autoencoder-based neural networks. IEEE Geoscience and Remote Sensing Letters, 14 (12):2395–2397, 2017. [135] H. Li, B. Hart, M. Dawson, and E. Radjef. Characterizing the Middle Bakken: Laboratory Mea- surement and Rock Typing of the Middle Bakken Formation. URTeC, 07 2015. doi: 10.15530/ URTEC-2015-2172485. URL https://doi.org/10.15530/URTEC-2015-2172485. URTEC- 2172485-MS. [136] L. Li, R. Puzel, and A. Davis. Data assimilation in groundwater modelling: ensemble Kalman filter versus ensemble smoothers. Hydrological Processes, 32(13):2020–2029, 2018. doi: https: //doi.org/10.1002/hyp.13127. [137] Y . Li and Y . Han. Decline Curve Analysis for Production Forecasting Based on Machine Learning. SPE Symposium: Production Enhancement and Cost Optimisation, Day 1 Tue, November 07, 2017, 11 2017. URLhttps://doi.org/10.2118/189205-MS. [138] H. Liang, L. Zhang, Y . Zhao, B. Zhang, C. Chang, M. Chen, and M. Bai. Empirical methods of decline-curve analysis for shale gas reservoirs: Review, evaluation, and application. Journal of Natural Gas Science and Engineering, 83:103531, 2020. ISSN 1875-5100. doi: https://doi.org/10. 1016/j.jngse.2020.103531. URL https://www.sciencedirect.com/science/article/pii/ S1875510020303851. [139] H. Liu, M. Boudjatit, M. Basri, and R. Mesdour. Determination Of Hydrocarbon Production Rates For An Unconventional Hydrocarbon Reservoir. US Patent App 17/076,599, 2021. [140] H. Liu, J. Zhang, F. Liang, C. Temizel, M. Basri, and R. Mesdour. Incorporation of Physics into Machine Learning for Production Prediction from Unconventional Reservoirs: A Brief Review of the Gray-Box Approach. SPE Reservoir Evaluation & Engineering, 24(04):847–858, 11 2021. ISSN 1094-6470. doi: 10.2118/205520-PA. URLhttps://doi.org/10.2118/205520-PA. [141] M. Liu and D. Grana. Time-lapse seismic history matching with an iterative ensemble smoother and deep convolutional autoencoder. Geophysics, 85(1):M15–M31, 12 2019. ISSN 0016-8033. doi: 10.1190/geo2019-0019.1. [142] N. Liu and D. S. Oliver. Evaluation of monte carlo methods for assessing uncertainty. Society of Petroleum Engineers, pages 149–162, 6 2003. doi: 10.2118/84936-PA. [143] W. Liu, W. D. Liu, and J. Gu. Forecasting oil production using ensemble empirical model decompo- sition based long short-term memory neural network. Journal of Petroleum Science and Engineering, 189:107013, 2020. ISSN 0920-4105. doi: https://doi.org/10.1016/j.petrol.2020.107013. [144] Y . Liu and L. J. Durlofsky. 3d cnn-pca: A deep-learning-based parameterization for complex geo- models. Computers & Geosciences, 148:104676, 2021. ISSN 0098-3004. doi: https://doi.org/10. 1016/j.cageo.2020.104676. [145] R. J. Lorentzen, K. M. Flornes, and G. Nævdal. History matching channelized reservoirs using the ensemble kalman filter. Society of Petroleum Engineers, 17(1):137–151, 2012. doi: 10.2118/ 143188-PA. [146] G. Luo, Y . Tian, M. Bychina, C. Ehlig-Economides, et al. Production-strategy insights using machine learning: Application for bakken shale. SPE Reservoir Evaluation & Engineering, 22(03):800–816, 2019. [147] X. Luo, T. Bhakta, M. Jakobsen, and G. Nævdal. An Ensemble 4D-Seismic History-Matching Framework With Sparse Representation Based On Wavelet Multiresolution Analysis. SPE Journal, 22(03):985–1010, 11 2016. ISSN 1086-055X. doi: 10.2118/180025-PA. [148] W. Ma and B. Jafarpour. Pilot points method for conditioning multiple-point statistical facies simu- lation on flow data. Advances in Water Resources, 115:219–233, 2018. 277 [149] X. Ma and Z. Liu. Predicting the oil production using the novel multivariate nonlinear model based on arps decline model and kernel method. Neural Computing and Applications, 29:579–591, 2018. ISSN 1433-3058. doi: https://doi.org/10.1007/s00521-016-2721-x. [150] R. W. Mannon. Oil Production Forecasting By Decline Curve Analysis. SPE Annual Technical Conference and Exhibition, All Days, 10 1965. doi: 10.2118/1254-MS. URL https://doi.org/ 10.2118/1254-MS. [151] Z. Mao, A. D. Jagtap, and G. E. Karniadakis. Physics-informed neural networks for high-speed flows. Computer Methods in Applied Mechanics and Engineering, 360:112789, 2020. ISSN 0045- 7825. doi: https://doi.org/10.1016/j.cma.2019.112789. [152] C. Maschio and D. S. Schiozer. Bayesian history matching using artificial neural network and Markov Chain Monte Carlo. Journal of Petroleum Science and Engineering, 123:62–71, 2014. [153] B. Min, C. Park, J. Kang, H. Park, and I. S. Jang. Optimal Well Placement Based on Artificial Neural Network Incorporating the Productivity Potential. Energy Sources Part A Recovery Utilization and Environmental Effects, 33:1726–1738, 07 2011. doi: 10.1080/15567030903468569. [154] M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014. URLhttp://arxiv.org/abs/1411.1784. [155] D. Mishkin, N. Sergievskiy, and J. Matas. Systematic evaluation of CNN advances on the ImageNet. CoRR, 2016. [156] S. Mo, N. Zabaras, X. Shi, and J. Wu. Deep autoregressive neural networks for high-dimensional inverse problems in groundwater contaminant source identification. Water Resources Research, 55 (5):3856–3881, 2019. doi: https://doi.org/10.1029/2018WR024638. [157] S. Mo, Y . Zhu, N. Zabaras, X. Shi, and J. Wu. Deep convolutional encoder-decoder networks for uncertainty quantification of dynamic multiphase flow in heterogeneous media. Water Resources Research, 55(1):703–728, 2019. doi: https://doi.org/10.1029/2018WR023528. [158] S. Mo, N. Zabaras, X. Shi, and J. Wu. Integration of adversarial autoencoders with residual dense convolutional networks for estimation of non-gaussian hydraulic conductivities. Water Resources Research, 56(2), 2020. doi: https://doi.org/10.1029/2019WR026082. [159] S. D. Mohaghegh. Reservoir simulation and modeling based on artificial intelligence and data mining (AI&DM). Journal of Natural Gas Science and Engineering, 3(6):697–705, 2011. [160] S. D. Mohaghegh. Shale Analytics: Data-Driven Analytics in Unconventional Resources. Springer, 2017. [161] S. Mohd Razak and B. Jafarpour. Generative Adversarial Networks for Calibration and Uncertainty Quantification of Complex Subsurface Flow Models. In AGU Fall Meeting Abstracts, volume 2019, pages H43F–2036, December 2019. [162] S. Mohd Razak and B. Jafarpour. Supervised machine-learning for history matching: Learning the inverse mapping from low-rank data and model representations. EAGE Annual 2019, 2019(1):1–5, 2019. ISSN 2214-4609. doi: https://doi.org/10.3997/2214-4609.201900950. [163] S. Mohd Razak and B. Jafarpour. Convolutional neural networks (cnn) for feature-based model calibration under uncertain geologic scenarios. Computational Geosciences, 24:1625–1649, 2020. ISSN 1573-1499. doi: 10.1007/s10596-020-09971-4. [164] S. Mohd Razak and B. Jafarpour. History matching with generative adversarial networks. In ECMOR XVII, volume 2020, pages 1–17. European Association of Geoscientists & Engineers, 2020. [165] S. Mohd Razak and B. Jafarpour. Latent-Space Inversion (LSI) for Subsurface Flow Model Cali- bration with Physics-Informed Autoencoding. In AGU Fall Meeting Abstracts, volume 2020, pages H052–08, December 2020. 278 [166] S. Mohd Razak and B. Jafarpour. Rapid Production Forecasting with Geologically-Informed Auto- Regressive Models: Application to V olve Benchmark Model. In SPE ATCE 2020, volume Day 4 Thu, October 29, 2020 of SPE Annual Technical Conference and Exhibition, 10 2020. doi: 10.2118/ 201356-MS. [167] S. Mohd Razak and B. Jafarpour. Conditioning generative adversarial networks on nonlinear data for subsurface flow model calibration and uncertainty quantification. Computational Geosciences, 2021. ISSN 1573-1499. doi: 10.1007/s10596-021-10112-8. [168] S. Mohd Razak, J. Cornelio, Y . Cho, H. Liu, R. Vaidya, and B. Jafarpour. Transfer Learning with Recurrent Neural Networks for Long-term Production Forecasting in Unconventional Reservoirs. SPE/AAPG/SEG Unconventional Resources Technology Conference, Day 1 Mon, July 26, 2021, 07 2021. doi: 10.15530/urtec-2021-5687. URLhttps://doi.org/10.15530/urtec-2021-5687. [169] S. Mohd Razak, J. Cornelio, A. Jahandideh, B. Jafarpour, Y . Cho, H. Liu, and R. Vaidya. Integrating Deep Learning and Physics-Based Models for Improved Production Prediction in Unconventional Reservoirs. SPE Middle East Oil and Gas Show and Conference, Day 3 Tue, November 30, 2021, 11 2021. doi: 10.2118/204864-MS. URLhttps://doi.org/10.2118/204864-MS. [170] S. Mohd Razak, A. Jahandideh, U. Djuraev, and B. Jafarpour. Deep Learning for Latent Space Data Assimilation LSDA in Subsurface Flow Systems. In SPE RSC 2021, volume Day 1 Tue, October 26, 2021 of SPE Reservoir Simulation Conference, 10 2021. doi: 10.2118/203997-MS. URL https: //doi.org/10.2118/203997-MS. [171] S. Mohd Razak, A. Jiang, and B. Jafarpour. Latent-space inversion (lsi): a deep learning framework for inverse mapping of subsurface flow data. Computational Geosciences, 2021. ISSN 1573-1499. doi: 10.1007/s10596-021-10104-8. [172] S. Mohd Razak, J. Cornelio, Y . Cho, H. Liu, R. Vaidya, and B. Jafarpour. Transfer Learning with Recurrent Neural Networks for Long-Term Production Forecasting in Unconventional Reservoirs. SPE Journal, 27(04):2425–2442, 08 2022. ISSN 1086-055X. doi: 10.2118/209594-PA. URL https://doi.org/10.2118/209594-PA. [173] S. Mohd Razak, J. Cornelio, Y . Cho, H. Liu, R. Vaidya, and B. Jafarpour. Embedding Phys- ical Flow Functions into Deep Learning Predictive Models for Improved Production Forecast- ing. SPE/AAPG/SEG Unconventional Resources Technology Conference, Day 2 Tue, June 21, 2022, 06 2022. doi: 10.15530/urtec-2022-3702606. URL https://doi.org/10.15530/ urtec-2022-3702606. [174] S. Mohd Razak, J. Cornelio, Y . Cho, H. Liu, R. Vaidya, and B. Jafarpour. Physics-Guided Deep Learning (PGDL) for Improved Production Forecasting in Unconventional Reservoirs. SPE Journal, 2023. [175] S. Mohd Razak, J. Cornelio, Y . Cho, H. Liu, R. Vaidya, and B. Jafarpour. A Dynamic Residual Learning Approach to Improve Physics-Constrained Neural Network Predictions in Unconventional Reservoirs. SPE Middle East Oil and Gas Show and Conference, Day 2 Mon, February 20, 2023, 02 2023. doi: 10.2118/213289-MS. URLhttps://doi.org/10.2118/213289-MS. [176] L. Mosser, O. Dubrule, and M.J. Blunt. Reconstruction of three-dimensional porous media using generative adversarial neural networks. CoRR, 2017. [177] L. Mosser, O. Dubrule, and M.J. Blunt. Stochastic reconstruction of an oolitic limestone by genera- tive adversarial networks. Transport in Porous Media, 125(1):81–103, Oct 2018. [178] L. Mosser, O. Dubrule, and M.J. Blunt. Stochastic seismic waveform inversion using generative adversarial networks as a geological prior. First EAGE/PESGB Workshop Machine Learning, 2018. [179] V . Nair and G. E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 807–814, USA, 2010. Omnipress. 279 [180] H. Narayanan and S. Mitter. Sample Complexity of Testing the Manifold Hypothesis. Advances in Neural Information Processing Systems 23, pages 1786–1794, 2010. [181] P. H. Nelson. Permeability-porosity Relationships In Sedimentary Rocks. The Log Analyst - Society of Petrophysicists and Well-Log Analysts, pages 38 – 62, 1994. [182] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. CoRR, 2016. [183] E. S. Olivas, J. D. M. Guerrero, M. M. Sober, J. R. M. Benedito, and A. J. S. Lopez. Handbook Of Research On Machine Learning Applications and Trends: Algorithms, Methods and Techniques - 2 Volumes. Information Science Reference - Imprint of: IGI Publishing, Hershey, PA, 2009. ISBN 1605667668. [184] D. S. Oliver and Y . Chen. Recent progress on reservoir history matching: a review. Computational Geosciences, 15(1):185–221, 2011. [185] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. doi: 10.1109/TKDE.2009.191. [186] Y . Pan, L. Deng, P. Zhou, and W. J. Lee. Laplacian echo-state networks for production analysis and forecasting in unconventional reservoirs. Journal of Petroleum Science and Engineering, 207: 109068, 2021. ISSN 0920-4105. doi: https://doi.org/10.1016/j.petrol.2021.109068. URL https: //www.sciencedirect.com/science/article/pii/S0920410521007257. [187] H. Park, C. Scheidt, D. Fenwick, A. Boucher, and J. Caers. History matching and uncertainty quan- tification of facies models with multiple geological interpretations. Computational Geosciences, 17 (4):609–621, 2013. [188] M. Peyron, A. Fillion, S. G¨ urol, V . Marchais, S. Gratton, P. Boudier, and G. Goret. Latent space data assimilation by using deep learning. Quarterly Journal of the Royal Meteorological Society, 2021. doi: https://doi.org/10.1002/qj.4153. [189] J. Ping and D. Zhang. History matching of fracture distributions by ensemble Kalman filter combined with vector based level set parameterization. Journal of Petroleum Science and Engineering, 108: 288 – 303, 2013. ISSN 0920-4105. doi: 10.1016/j.petrol.2013.04.018. [190] G. Pirot, P. Renard, E. Huber, J. Straubhaar, and P. Huggenberger. Influence of conceptual model uncertainty on contaminant transport forecasting in braided river aquifers. Journal of Hydrology, 531:124 – 141, 2015. ISSN 0022-1694. doi: 10.1016/j.jhydrol.2015.07.036. [191] G. Pirot, E. Huber, J. Irving, and N. Linde. Reduction of conceptual model uncertainty using ground- penetrating radar profiles: Field-demonstration for a braided-river aquifer. Journal of Hydrology, 571:254 – 264, 2019. ISSN 0022-1694. doi: 10.1016/j.jhydrol.2019.01.047. [192] J. K. Pitman, L. C. Price, and J. A. LeFever. Diagenesis and fracture development in the bakken formation, williston basin: Implications for reservoir quality in the middle member. U.S. Geological Survey Professional Paper 1653, 2001. [193] F. Pontes, G. de Amorim, P. Balestrassi, A. de Paiva, and J. Ferreira. Design of experiments and focused grid search for neural network parameter optimization. Neurocomputing, 186:22– 34, 2016. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2015.12.061. URL https: //www.sciencedirect.com/science/article/pii/S0925231215020184. [194] H. W. Posamentier and G. P. Allen. Overview: Siliciclastic Sequence Stratigraphy, Concepts and Applications. SEPM Society for Sedimentary Geology, 1999. [195] A. Pradhan and T. Mukerji. Seismic Bayesian evidential learning: estimation and uncertainty quan- tification of sub-resolution reservoir properties. Computational Geosciences, 24(1573-1499), 2020. doi: https://doi.org/10.1007/s10596-019-09929-1. [196] M. J. Pyrcz and C. V . Deutsch. Geostatistical Reservoir Modeling. Oxford University Press, 2014. 280 [197] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learn- ing framework for solving forward and inverse problems involving nonlinear partial differen- tial equations. Journal of Computational Physics, 378:686–707, 2019. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.10.045. URL https://www.sciencedirect.com/science/ article/pii/S0021999118307125. [198] S. Ramiro-Ramirez. Petrographic and petrophysical characterization of the Eagle Ford Shale in La Salle and Gonzales counties, Gulf Coast Region, Texas. PhD thesis, Colorado School of Mines, 05 2016. [199] B. Ramsundar and R. B. Zadeh. TensorFlow for deep learning: from linear regression to reinforce- ment learning. O’Reilly Media, Inc., 2018. [200] C. H. Randle, C. E. Bond, R. M. Lark, and A. A. Monaghan. Uncertainty in geological interpreta- tions: Effectiveness of expert elicitations. Geosphere, 15(1):108–118, 2019. [201] A. Razavi, A. van den Oord, and O. Vinyals. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, pages 14837–14847, 2019. [202] J. S. Read, X. Jia, J. Willard, A. P. Appling, J. A. Zwart, S. K. Oliver, A. Karpatne, G. J. A. Hansen, P. C. Hanson, W. Watkins, M. Steinbach, and V . Kumar. Process-guided deep learning predictions of lake water temperature. Water Resources Research, 55(11):9173–9190, 2019. doi: https://doi.org/ 10.1029/2019WR024922. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10. 1029/2019WR024922. [203] S.E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. CoRR, abs/1605.05396, 2016. URLhttp://arxiv.org/abs/1605.05396. [204] R. H. Reichle, D. B. McLaughlin, and D. Entekhabi. Hydrologic data assimilation with the ensemble kalman filter. Monthly Weather Review, 130(1):103–114, 2002. [205] M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, N. Carvalhais, and Prabhat. Deep learning and process understanding for data-driven earth system science. Nature, 566:195–204, 2019. [206] A. C. Reynolds, N. He, and D. S. Oliver. Reducing uncertainty in geostatistical description with well testing pressure data. Reservoir Characterization Recent Advances, American Association of Petroleum Geologists, pages 149–162, 1999. [207] A. C. Reynolds, M. Zafari, and G. Li. Iterative forms of the ensemble kalman filter. ECMOR X-10th European conference on the mathematics of oil recovery, 2006. [208] G. Roth and A. Tarantola. Neural networks and inversion of seismic data. Journal of Geophysical Research: Solid Earth, 99(B4):6753–6768, 1994. [209] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Tutorials, pages 15–18, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-5004. URL https: //www.aclweb.org/anthology/N19-5004. [210] C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019. URL https:// doi.org/10.1038/s42256-019-0048-x. [211] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986. [212] A. Sagheer and M. Kotb. Time series forecasting of petroleum production using deep lstm recurrent networks. Neurocomputing, 323:203–213, 2019. ISSN 0925-2312. doi: https://doi.org/10.1016/j. neucom.2018.09.082. 281 [213] T. Salimans, I.J. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and Xi Chen. Improved tech- niques for training gans. CoRR, 2016. URLhttp://arxiv.org/abs/1606.03498. [214] O. San and R. Maulik. Neural network closures for nonlinear model order reduction. Advances in Computational Mathematics, 44:1717–1750, 2018. [215] J. E. Santos, D. Xu, H. Jo, C. J. Landry, M. Prodanovi´ c, and M. J. Pyrcz. Poreflow-net: A 3d con- volutional neural network to predict fluid flow through porous media. Advances in Water Resources, 138:103539, 2020. [216] W. Saputra, W. Kirati, and T. Patzek. Physical scaling of oil production rates and ultimate recovery from all horizontal wells in the bakken shale. Energies, 13(8), 2020. ISSN 1996-1073. doi: 10.3390/ en13082052. URLhttps://www.mdpi.com/1996-1073/13/8/2052. [217] J. F. Sarg. The bakken - an unconventional petroleum and reservoir system. Colorado School of Mines, 2011. URLhttps://www.osti.gov/servlets/purl/1084030. [218] P. Sarma, L. J. Durlofsky, and K. Aziz. Kernel principal component analysis for efficient, differ- entiable parameterization of multipoint geostatistics. Mathematical Geosciences, 40(1):3–32, Jan 2008. doi: 10.1007/s11004-007-9131-7. [219] L. K. Saul and S. T. Roweis. An introduction to locally linear embedding. Technical report, AT&T Labs – Research, 2000. [220] C. Scheidt, C. Jeong, T. Mukerji, and J. Caers. Probabilistic falsification of prior geologic uncertainty with seismic amplitude data: Application to a turbidite reservoir case. GEOPHYSICS, 80:89–100, 2015. [221] Schlumberger. Eclipse e100 industry-reference reservoir simulator, 2014. URL https://www. software.slb.com/products/eclipse. [222] Schlumberger. Petrel e&p software platform, 2016. URL https://www.software.slb.com/ products/petrel. [223] M. Scholz, M. Fraunholz, and J. Selbig. Nonlinear Principal Component Analysis: Neural Network Models and Applications. In Principal Manifolds for Data Visualization and Dimension Reduction, pages 44–67, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. [224] F. Schrodt, J. Kattge, H. Shan, F. Fazayeli, J. Joswig, A. Banerjee, M. Reichstein, G. B¨ onisch, S. D´ ıaz, J. Dickie, A. Gillison, A. Karpatne, S. Lavorel, P. Leadley, C. Wirth, I. Wright, S. Wright, and P. Reich. Bhpmf – a hierarchical bayesian approach to gap-filling and trait prediction for macroe- cology and functional biogeography. Global Ecology and Biogeography, 24(12):1510–1521, 2015. doi: https://doi.org/10.1111/geb.12335. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1111/geb.12335. [225] J. Schuetter, S. Mishra, M. Zhong, and R. LaFollette. A Data-Analytics Tutorial: Building Predictive Models for Oil Production in an Unconventional Shale Reservoir. Society of Petroleum Engineers, 2018. [226] B. Sch¨ olkopf, A. Smola, and K. R. M¨ uller. Kernel principal component analysis. In W. Gerstner, A. Germond, M. Hasler, and J. D. Nicoud, editors, Artificial Neural Networks — ICANN’97 , volume 1327, chapter 10. Springer, Berlin, Heidelberg, 1997. [227] B. Sebacher, A. S. Stordal, and R. Hanea. Bridging multipoint statistics and truncated Gaussian fields for improved estimation of channelized reservoirs with ensemble methods. Comput Geosci, 19:341–369, 2015. doi: 10.1007/s10596-014-9466-3. [228] C. Shen. A trans-disciplinary review of deep learning research for water resources scientists. Water Resources Research, 12 2017. doi: https://doi.org/10.1029/2018WR022643. [229] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. ISSN 0378-3758. doi: https://doi.org/10.1016/S0378-3758(00)00115-4. 282 [230] J. Sirignano and K. Spiliopoulos. Dgm: A deep learning algorithm for solving partial differen- tial equations. Journal of Computational Physics, 375:1339–1364, 2018. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.08.029. URL https://www.sciencedirect.com/science/ article/pii/S0021999118305527. [231] R. V . Soares, X. Luo, G. Evensen, and T. Bhakta. 4d seismic history matching: Assessing the use of a dictionary learning based sparse representation method. Journal of Petroleum Science and Engi- neering, 195:107763, 2020. ISSN 0920-4105. doi: https://doi.org/10.1016/j.petrol.2020.107763. [232] C.K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Husz´ ar. Amortised MAP inference for image super-resolution. CoRR, abs/1610.04490, 2016. URLhttp://arxiv.org/abs/1610.04490. [233] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15: 1929–1958, 2014. [234] J. Su, J. Zeng, D. Xiong, Y . Liu, M. Wang, and J. Xie. A hierarchy-to-sequence attentional neural machine translation model. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 26(3):623–632, mar 2018. ISSN 2329-9290. doi: 10.1109/TASLP.2018.2789721. URL https://doi.org/10.1109/ TASLP.2018.2789721. [235] J. Sun, Z. Niu, K. Innanen, J. Li, and D. Trad. A theory-guided deep-learning formulation and optimization of seismic waveform inversion. Geophysics, 85(2):R87–R99, 01 2020. ISSN 0016- 8033. doi: 10.1190/geo2019-0138.1. URLhttps://doi.org/10.1190/geo2019-0138.1. [236] L. Sun, H. Gao, S. Pan, and J. Wang. Surrogate modeling for fluid flows based on physics-constrained deep learning without simulation data. Computer Methods in Applied Mechanics and Engineering, 361:112732, 2020. ISSN 0045-7825. doi: https://doi.org/10.1016/j.cma.2019.112732. [237] W. Sun and L. J. Durlofsky. A new data-space inversion procedure for efficient uncertainty quantification in subsurface flow problems. Math Geosci, 49:679—-715, 2017. doi: 10.1007/ s11004-016-9672-8. [238] I. Sutskever, O. Vinyals, and Q.V . Le. Sequence to sequence learning with neural networks. In Pro- ceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, page 3104–3112, Cambridge, MA, USA, 2014. MIT Press. [239] C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. CoRR, 2015. [240] M. Tang, Y . Liu, and L. J. Durlofsky. A deep-learning-based surrogate model for data assimilation in dynamic subsurface flow problems. Journal of Computational Physics, 413:109456, 2020. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2020.109456. [241] M. Tang, Y . Liu, and L. J. Durlofsky. Deep-learning-based surrogate flow modeling and geological parameterization for data assimilation in 3d subsurface flow. Computer Methods in Applied Mechan- ics and Engineering, 376:113636, 2021. ISSN 0045-7825. doi: https://doi.org/10.1016/j.cma.2020. 113636. [242] A. Tarantola. Inverse problem theory and methods for model parameter estimation. SIAM, 2005. [243] A. Tarantola. Popper, Bayes and the inverse problem. Nature Physics, 2:492–494, 2006. [244] Z. Tavassoli, J. N. Carter, and P. R. King. Errors in History Matching. SPE Journal, 9(03):352– 361, 09 2004. ISSN 1086-055X. doi: 10.2118/86883-PA. URL https://doi.org/10.2118/ 86883-PA. [245] L. Theis, W. Shi, A. Cunningham, and F. Husz´ ar. Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017. [246] C. Theloy and S. A. Sonnenberg. Integrating Geology and Engineering: Implications for Production in the Bakken Play, Williston Basin. URTeC, 08 2013. doi: 10.1190/urtec2013-100. URL https: //doi.org/10.1190/urtec2013-100. URTEC-1596247-MS. 283 [247] M. L. Thompson and M. A. Kramer. Modeling chemical processes using prior knowledge and neural networks. AIChE Journal, 40(8):1328–1340, 1994. doi: https://doi.org/10.1002/aic.690400806. URLhttps://aiche.onlinelibrary.wiley.com/doi/abs/10.1002/aic.690400806. [248] C. Tian and R. N. Horne. Applying Machine-Learning Techniques To Interpret Flow-Rate, Pres- sure, and Temperature Data From Permanent Downhole Gauges. SPE Reservoir Evaluation & Engineering, 22(02):386–401, 01 2019. ISSN 1094-6470. doi: 10.2118/174034-PA. URL https://doi.org/10.2118/174034-PA. [249] V . Todaro, M. D’Oria, M. G. Tandam, and J. J. G´ omez-Hern´ andez. Ensemble smoother with multiple data assimilation for reverse flow routing. Computers & Geosciences, 131:32–40, 2019. ISSN 0098- 3004. doi: https://doi.org/10.1016/j.cageo.2019.06.002. [250] J. A. Trangenstein and J. B. Bell. Mathematical structure of the black-oil model for petroleum reservoir simulation. SIAM Journal on Applied Mathematics, 49:749–783, 1989. [251] Department of Interior USGS. Selected physical properties of the bakken formation. U.S. Geological Survey, 1985. [252] M. B. Valentin, C. R. Bom, A. L. Martins Compan, M. D. Correia, C. Menezes de Jesus, A. de Lima Souza, M. P. de Albuquerque, M. P. de Albuquerque, and E. L. Faria. Estimation of permeability and effective porosity logs using deep autoencoders in borehole image logs from the brazilian pre-salt carbonate. Journal of Petroleum Science and Engineering, 170:315–330, 2018. ISSN 0920-4105. doi: https://doi.org/10.1016/j.petrol.2018.06.038. [253] L. Van Der Maaten, E. Postma, and J. Van den Herik. Dimensionality reduction: A Comparative Review. J Mach Learn Res, 10:66–71, 2009. [254] P. J. van Leeuwen and G. Evensen. Data Assimilation and Inverse Methods in Terms of a Probabilistic Formulation. Monthly Weather Review, 124(12):2898–2913, 1996. [255] J. C. Van Wagoner, H. W. Posamentier, R. M. Mitchum, P. R. Vail, J. F. Sarg, T. S. Loutit, and J. Hardenbol. An Overview of the Fundamentals of Sequence Stratigraphy and Key Definitions. In Sea-Level Changes: An Integrated Approach. SEPM Society for Sedimentary Geology, 1988. [256] P. Vincent, H. Larochelle, I. Lajoie, Y . Bengio, and P. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of ma- chine learning research, 11:3371–3408, 2010. [257] H. X. V o and L. J. Durlofsky. Data assimilation and uncertainty assessment for complex geological models using a new PCA-based parameterization. Computational Geosciences, 19(4):747–767, Aug 2015. [258] A. Vyas, A. Datta-Gupta, and S. Mishra. Modeling Early Time Rate Decline in Unconventional Reservoirs Using Machine Learning Techniques. Abu Dhabi International Petroleum Exhibition and Conference, Day 4 Thu, November 16, 2017, 11 2017. doi: 10.2118/188231-MS. URL https: //doi.org/10.2118/188231-MS. [259] L. Wang, Q. Zhou, and S. Jin. Physics-guided deep learning for power system state estimation. Journal of Modern Power Systems and Clean Energy, 8(4):607–615, 2020. doi: 10.35833/MPCE. 2019.000565. [260] N. Wang, H. Chang, and D. Zhang. Deep-learning-based inverse modeling approaches: A subsurface flow example. Journal of Geophysical Research: Solid Earth, 126(2), 2021. doi: https://doi.org/10. 1029/2020JB020549. [261] Y . Wang, H. Yao, and S. Zhao. Auto-Encoder Based Dimensionality Reduction. Neurocomput., 184 (C):232–242, April 2016. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2015.08.104. [262] Z. Wang and A. C. Bovik. Mean squared error: Love it or leave it? a new look at signal fidelity measures. IEEE Signal Processing Magazine, 26(1):98–117, 2009. doi: 10.1109/MSP.2008.930649. 284 [263] K. Weiss, T. M. Khoshgoftaar, and D. Wang. A survey of transfer learning. Journal of Big Data, 3: 1056 – 1068, 2016. ISSN 2196-1115. doi: 10.1186/s40537-016-0043-6. [264] S. Wiewel, M. Becher, and N. Thuerey. Latent space physics: Towards learning the temporal evolu- tion of fluid flow. Computer Graphics Forum, 38(2):71–82, 2019. doi: https://doi.org/10.1111/cgf. 13620. [265] J. Willard, X. Jia, S. Xu, M. Steinbach, and V . Kumar. Integrating scientific knowledge with machine learning for engineering and environmental systems. CoRR, 2020. URL https://arxiv.org/ abs/2003.04919. [266] Z. Xi and E. Morgan. Combining Decline-Curve Analysis and Geostatistics To Forecast Gas Produc- tion in the Marcellus Shale. SPE Reservoir Evaluation & Engineering, 22(04):1562–1574, 05 2019. ISSN 1094-6470. doi: 10.2118/197055-PA. URLhttps://doi.org/10.2118/197055-PA. [267] Y . Xiong and R. Zuo. Recognition of geochemical anomalies using a deep autoencoder network. Comput. Geosci., 86:75–82, 2016. [268] L. Yang, S. Tian, L. Yu, F. Ye, J. Qian, and Y . Qian. Deep learning for extracting water body from landsat imagery. International Journal of Innovative Computing, Information and Control, 11:1913– 1929, 01 2015. [269] F. Yin, X. Xue, C. Zhang, K. Zhang, J. Han, B. Liu, J. Wang, and J. Yao. Multifidelity genetic transfer: An efficient framework for production optimization. SPE Journal, 3:1 – 22, 2021. ISSN 1086-055X. doi: 10.2118/205013-PA. [270] J. Yu, A. Jahandideh, S. Hakim-Elahi, and B. Jafarpour. Sparse Neural Networks for Inference of Interwell Connectivity and Production Prediction. SPE Journal, 26(06):4067–4088, 12 2021. ISSN 1086-055X. doi: 10.2118/205498-PA. URLhttps://doi.org/10.2118/205498-PA. [271] T. Zhang, P. Tilke, E. Dupont, L. Zhu, L. Liang, and W. Bailey. Generating geologically re- alistic 3d reservoir facies models using deep learning of sedimentary architecture with genera- tive adversarial networks. Petroleum Science, 16(3):541–549, Jun 2019. ISSN 1995-8226. doi: 10.1007/s12182-019-0328-4. [272] J. Zhao, M. Mathieu, and Y . LeCun. Energy-based generative adversarial network. CoRR, 2016. [273] L. Zhao, Z. Li, Z. Wang, B. Caswell, J. Ouyang, and G. E. Karniadakis. Active- and transfer-learning applied to microscale-macroscale coupling to simulate viscoelastic flows. Journal of Computational Physics, 427:110069, 2021. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2020.110069. URL https://www.sciencedirect.com/science/article/pii/S0021999120308433. [274] Z. Zhong, A. Y . Sun, Y . Wang, and B. Ren. Predicting field production rates for waterflooding using a machine learning-based proxy model. Journal of Petroleum Science and Engineering, 194:107574, 2020. [275] H. Zhou, L. Li, and J. J. G´ omez-Hern´ andez. Characterizing Curvilinear Features Using the Localized Normal-Score Ensemble Kalman Filter. Abstract and Applied Analysis, 2012, 2012. doi: 10.1155/ 2012/805707. [276] Y . Zhu and N. Zabaras. Bayesian deep convolutional encoder-decoder networks for surrogate mod- eling and uncertainty quantification. Journal of Computational Physics, 366:415–447, 2018. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.04.018. [277] Y . Zhu, N. Zabaras, P. Koutsourelakis, and P. Perdikaris. Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. Journal of Computational Physics, 394:56–81, 2019. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2019. 05.024. [278] N. Zobeiry and K. Humfeld. A physics-informed machine learning approach for solving heat transfer equation in advanced manufacturing and engineering applications. Engineering Appli- cations of Artificial Intelligence , 101:104232, 2021. ISSN 0952-1976. doi: https://doi.org/ 10.1016/j.engappai.2021.104232. URLhttps://www.sciencedirect.com/science/article/ pii/S0952197621000798. 285 Appendices A Description of the CNN Architectures A.1 Network Architecture The CNN implementations are based on the open-source machine learning framework Keras (ver- sion 2.2.4) with Tensorflow library as the backend (version 1.15). Table 8.1 lists the actual Keras functions and hyperparameters used for Example 1 and 2 in Section 2.3. For Example 3 and 4, simply modify the sizes of the input and output according to the general formulation provided below. B Description of the LSI Architectures B.1 Network Architecture A complete description of the architectures used in our study is provided in this section. The networks are implemented with the open-source deep learning library Keras (version 2.2.4) [45]. For the 2D examples used in section 3.3.1 and 3.3.2, detailed description of each layer comprising the size of input and output, as well as the number of weights within each component is given in Figure 8.1. The actual Keras functions used and associated parameters are tabulated in Table 8.2 (for the model autoencoder), Table 8.3 (for the data autoencoder) and Table 8.4 (for the regression model). 286 Table 8.1: Keras functions corresponding to Figure 2.2 and parameters used in this study (for Example 1 and 2) Shorthand notation Keras function (version 2.2.4) Parameters Conv 1/2 keras.layers.Conv2D filters = 8/16, kernel size = 5× 2, activation = ’relu’ Pooling keras.layers.MaxPooling2D pool size = 2, padding = ’same’ Dense keras.layers.Dense keras.layers.Dropout units = 5/10000 rate = 0.2 L( f θ (·),·) keras.losses.categorical crossentropy L( f ψ (·),·) keras.losses.mean squared error Optimizer keras.optimizers.Adam α = 0.001,β 1 = 0.9,β 2 = 0.9 Table 8.2: Keras functions and parameters for model autoencoder Enc θ ,Dec θ in Figure 3.2 and Figure 8.1. Shorthand notation Keras function (version 2.2.4) Parameters conv2D keras.layers.Conv2D (16, 3× 3); (8, 4× 4); (4, 5× 5); (4, 5× 5); (8, 4× 4); (16, 3× 3); (1, 3× 3) as (filters, kernel size) pairs in the or- der of appearance, padding=’same’ upsample keras.layers.UpSampling2D size=(2, 2) lrelu keras.layers.LeakyReLU alpha=0.3 pool keras.layers.MaxPooling2D pool size=(2, 2), padding=’same’ reshape keras.layers.Reshape target shape=(676); target shape=(13, 13, 4) dense keras.layers.Dense units=64; units=676 L(θ) keras.optimizers.Adam lr=1e-3 287 Figure 8.1: Weight distribution and dimension of input and output of encoders, decoders and regressor for examples in section 3.3.1 and 3.3.2. 288 Table 8.3: Keras functions and parameters for data autoencoder Enc ψ ,Dec ψ in Figure 3.3 and Figure 8.1. Shorthand notation Keras function (version 2.2.4) Parameters conv1D keras.layers.Conv1D (16, 3); (8, 6); (8, 6); (16, 3); (20, 4) as (filters, kernel size) pairs in the or- der of appearance, padding=’same’ upsample keras.layers.UpSampling1D size=2 lrelu keras.layers.LeakyReLU alpha=0.3 pool keras.layers.MaxPooling1D pool size=2, padding=’same’ reshape keras.layers.Reshape target shape=(56); target shape=(8, 8) dense keras.layers.Dense units=10; units=64 L(ψ) keras.optimizers.Adam lr=1e-3 Table 8.4: Keras functions and parameters for regressor Reg dm γ in Figure 3.4 and Figure 8.1. Shorthand notation Keras function (version 2.2.4) Parameters lrelu keras.layers.LeakyReLU alpha=0.3 dense keras.layers.Dense units=16; units=32; units=64 for Reg dm γ (·) L(γ) keras.optimizers.Adam lr=1e-3 289 For the 3D examples used in section 3.3.3, detailed description of the LSI architectures is given in Figure 8.2. The actual Keras functions used and associated parameters are tabulated in Table 8.5 and Table 8.6 (any functions that are not listed are the same as in Table 8.2, Table 8.3 and Table 8.4). Table 8.5: Keras functions and parameters for model autoencoder Enc θ ,Dec θ in section 3.3.3 (V olve) and Figure 8.2. Shorthand notation Keras function (version 2.2.4) Parameters conv3D keras.layers.Conv3D (16, 3× 3× 3); (8, 4× 4× 4); (4, 5× 5× 5); (4, 5× 5× 5); (8, 4× 4× 4); (16, 3× 3× 3); (1, 7× 6× 6) as (filters, kernel size) pairs in the order of ap- pearance, padding=’same’ upsample keras.layers.UpSampling3D size=(2, 2, 2) pool keras.layers.MaxPooling3D pool size=(2, 2, 2), padding=’same’ reshape keras.layers.Reshape target shape=(10*11*2*4); target shape=(11, 12, 3, 4) dense keras.layers.Dense units=64; units=11*12*3*4 Table 8.6: Keras functions and parameters for data autoencoder Enc ψ ,Dec ψ in section 3.3.3 (V olve) and Figure 8.2. Shorthand notation Keras function (version 2.2.4) Parameters conv1D keras.layers.Conv1D (16, 3); (8, 6); (4, 9); (2, 12); (2, 12); (4, 9); (8, 6); (16, 3); (35, 4) as (fil- ters, kernel size) pairs in the order of appearance, padding=’same’ reshape keras.layers.Reshape target shape=(10*2); target shape=(10, 2) dense keras.layers.Dense units=10*2; units=10*2 290 Figure 8.2: Weight distribution and dimension of input and output of encoders, decoders and regressor for examples in section 3.3.3 . 291 C Description of the LSDA Architectures C.1 Network Architecture Parameters used in the neural network layers of the LSDA architecture for the 2D examples (sec- tion 4.3.1 and 4.3.2) are provided in Table 8.7, Table 8.8 and Table 8.9. Table 8.7: Keras functions for Enc θ ,Dec θ in Figure 4.3 and parameters used for the 2D examples. Shorthand notation Keras function (version 2.2.4) Parameters conv2D keras.layers.Conv2D (16, 3× 3); (8, 4× 4); (4, 5× 5); (4, 5× 5); (8, 4× 4); (16, 3× 3); (1, 3× 3) as (filters, kernel size) pairs in the order of appearance, padding=’same’ upsample keras.layers.UpSampling2D size=(2, 2) lrelu keras.layers.LeakyReLU alpha=0.3 pool keras.layers.MaxPooling2D pool size=(2, 2), padding=’same’ dense keras.layers.Dense units=64; units=64; units=64; units=676 L(θ,θ ′ ) keras.optimizers.Adam lr=1e-3 D Description of the CGAN Architectures D.1 Network Architecture and Training Progression We provide a complete description of the architecture used in our study. The networks are im- plemented with the open-source machine learning framework Tensorflow (version 1.12). For this particular example in the appendix, each label is defined as one of the five geologic scenarios, and is assigned according to the TI used to generate the 32× 32 realizations. CGAN (Method 1) is tasked to parameterize 500 model realizations within each geologic scenario and can be used to generate realizations from the respective geologic scenario when provided with a latent vector z 292 Table 8.8: Keras functions for Enc ψ ,Dec ψ in Figure 4.4 and parameters used for the 2D examples. Shorthand notation Keras function (version 2.2.4) Parameters conv1D keras.layers.Conv1D (32, 3); (16, 6); (16, 6); (32, 3); (20, 4) as (filters, kernel size) pairs in the order of appearance, padding=’same’ upsample keras.layers.UpSampling1D size=2 lrelu keras.layers.LeakyReLU alpha=0.3 pool keras.layers.MaxPooling1D pool size=2, padding=’same’ dense keras.layers.Dense units=20; units=20; units=20; units=128 L(ψ,ψ ′ ) keras.optimizers.Adam lr=1e-3 Table 8.9: Keras functions for Reg md γ (·) in Figure 4.5 and parameters used for the 2D examples. Shorthand notation Keras function (version 2.2.4) Parameters lrelu keras.layers.LeakyReLU alpha=0.3 dense keras.layers.Dense units=32, units=16; units=20; for Reg md γ (·) L(γ) keras.optimizers.Adam lr=1e-3 293 from a Gaussian distribution and a geologic scenario label c (as one-hot vector encoding). Fig- ure 8.3 shows the dimensions of input, output and weights (parameters) associated with each layer. A layer refers to a sequence of dense/convolution/deconvolution operation, followed by an optional batch normalization operation and finally a nonlinear operation. Note that the batch normalization operation and the nonlinear operation do not change the dimension of the input. Table 8.10 lists the actual Tensorflow functions and hyperparameters used within each layer. The shorthand notation for each Tensorflow function in Table 8.10 is consistent with the shorthand notations used in Figure 8.3. For all the examples used in this chapter, the weight of the gradient penalty term,λ (in Equation 5.2, 5.3) is set as 10. The three loss functions (Equation 5.3 - 5.5) for training CGAN are tuned using tf.train.AdamOptimizer(α = 5× 10 − 4 ,β 1 = 0.5,β 2 = 0.9) with a batch size (denoted as N b ) of 32. For the second method, the classifier C φ loss function (L C ) is simply omitted from the training process and the input to the generatorG θ only includes the latent vector. Table 8.10: Tensorflow functions and hyperparameters used in section 5.2 Shorthand notation Tensorflow function (version 1.12) Parameters bn tf.contrib.layers.batch norm decay=0.9, epsilon=10 − 5 conv2D tf.nn.conv2d stride=2, kernel=4× 4, padding=’SAME’ deconv tf.nn.deconv2d stride=2, kernel=4× 4 dense tf.matmul lrelu tf.nn.leaky relu alpha=0.2 softmax tf.nn.softmax sigmoid tf.sigmoid init tf.random normal initializer Figure 8.3 also illustrates the flow of tensors when the components ( G θ ,D ψ ,C φ ) in CGAN are optimized in an alternating manner. WhenD ψ andC φ are updated, the weights inG θ are fixed and the flow of tensors is represented by the red bold path for generated (fake) realizations and the red stippled path for training (real) realizations. In this update step, gradient information is 294 Figure 8.3: Schematic of the architecture used in this study. Refer to the shorthand notations in Table 8.10 for a description of the functions used within each layer. 295 backpropagated toD ψ (calculated usingL D via the red bold and stippled paths) to trainD ψ how to distinguish between fake and real realizations. Additionally, gradient information is backpropa- gated toC φ andD ψ (calculated usingL C via the red stippled path) to trainC φ andD ψ to learn the geologic features for each class label. WhenG θ is updated, the weights inD ψ andC φ are fixed and the flow of tensors is represented by the green bold path for generated (fake) realizations. In this update step, gradient information is backpropagated toG θ (calculated usingL C andL D via the green bold path) to teach the generator how to reproduce geologic features associated with each class label. Figure 8.4: (a) Total losses (normalized) ofC φ ,D ψ andG θ of CGAN using dataset A with geologic scenario as the label. (b) 5 samples of realizations per geologic scenario (row), generated byG θ at selected iterations. Figure 8.4(a) shows total losses of the components (G θ ,D ψ ,C φ ) in CGAN when trained with model realizations labeled by the geologic scenario. The network is trained for 8000 iterations, 296 whereD ψ is updated 5 times for every 1 iteration as recommended by Gulrajani et al. [79]. A single iteration refers to loss computation on a batch - in this case since there are 2500 realizations in total, 78 iterations are needed to process the entire dataset once. Figure 8.4(b) shows samples of realizations generated by geologic scenario at selected iterations. To monitor the convergence, the input latent vector for each generated sample in Figure 8.4(b) is fixed for each iteration. It is observed thatC φ converges rather easily where generated realizations at iteration 1000 are already exhibiting the correct features (azimuth and channel geometry) for each geologic scenario. G θ generates continuous realizations (as shown, with satisfactory quality at iteration 8000) and can be discretized by taking a mid-point threshold in the case of binary facies. In this case, using thresholding method with a mid-point cutoff value of 0.5 (i.e., the mid-point value between 0 and 1) for the generated realizations, any pixel with value of less than 0.5 is assigned a discrete value of 0 and any pixel with value of more or equal to 0.5 is assigned a discrete value of 1. Beyond iteration 8000, the generated realizations remain consistent and only the continuous-valued pixels show minor variations in terms of location and values. The same behavior is observed in Method 2 when the network is trained withoutC φ . E Description of the RNN Architectures E.1 Network Architecture A complete description of the architecture used in our study is provided in this section. The networks are implemented with the open-source deep learning library Keras (version 2.2.4) [45]. Hyperparameters used in the neural network layers, the size of input and output, the actual Keras functions used, as well as the number of weights within each component of the proposed forecast model architecture for the synthetic dataset used in Experiment 2 (Synthetic Bakken data) are provided in Table 8.11. The architecture for Experiment 1 (Synthetic data) and Experiment 3 (Field data) can be derived from this description by simply modifying the input and output size. 297 In Experiment 2, the production responses are represented as a sequence of windows y w and y w+1 where N f = 3, N s = 6 and N l = 3. Consequently, as per Table 8.11 and Figure 6.2, the LSTM encoder Enc θ takes an input of dimension N b × N l × N f where N l = 3 and N f = 3 (i.e., multivariate) to generate a dynamic encoding (for past data) of dimension 1× 50. For simplicity, assume N b = 1. The corresponding control trajectories u w+1 has a dimension of N s = 6 (for each timestep in w+1) and Reg ω takes this input of dimension N b × N s to generate a control encoding of dimension 1× 3. The corresponding vector of well properties p becomes an input with dimension N b × N p for Reg ζ to generate a well property encoding of dimension 1× 3. The encodings for past data, control trajectories and well properties (with dimensions 1× 50, 1× 3 and 1× 3 respectively) are concatenated to become an input with dimension 1× 56 for the LSTM decoder Dec γ . The LSTM decoder then uses the information present in the encodings to predict the flow behavior for the next window w+ 1 with dimension 1× 6× 3 (i.e., N b × N s × N f ). Note that the size of each encoding is arbitrary and depends on the complexity of the dataset. 298 Table 8.11: Detailed description of components in the proposed forecast model. Component Input shape Output shape Weights Name of function Hyperparameters Reg ω 1× 6 1× 5 (6× 5)+ 5 Dense→ LeakyReLU α = 0.3 1× 5 1× 4 (5× 4)+ 4 Dense→ LeakyReLU α = 0.3 1× 4 1× 3 (4× 3)+ 3 Dense Reg ζ 1× 6 1× 5 (6× 5)+ 5 Dense→ LeakyReLU α = 0.3 1× 5 1× 4 (5× 4)+ 4 Dense→ LeakyReLU α = 0.3 1× 4 1× 3 (4× 3)+ 3 Dense Enc θ 1× 3× 3 1× 50 4[(50× 50)+(50× 3)+(50× 1)] LST M Dec γ [(1× 3),(1× 3),(1× 50)] 1× 56 Concatenate 1× 56 1× 54 (56× 54)+ 54 Dense 1× 54 1× 36 (54× 36)+ 36 Dense 1× 36 1× 6× 6 Reshape 1× 6× 6 1× 6× 50 4[(50× 50)+(50× 6)+(50× 1)] LST M 1× 6× 50 1× 6× 3 (50× 3)+ 3 TimeDistributed Total : 27559 299 E.2 Hyperparameter Tuning Selection of optimal model hyperparameters is an important step to ensure that the forecast model can effectively fit the training data without inducing any form of underfitting or overfitting. For the proposed forecast model, the relevant hyperparameters to tune are the number of units in a Dense layer, the parameterα within the LeakyReLU activation function, the learning rate for convergence, the number of training epoch, the dimension of encodings for p, u w+1 , and y w , and the size of windows (i.e., N l and N s ) for the time-series data. The forecast model is not sensitive to the α value within the LeakyReLU activation functions, as long as it is larger than zero to prevent the “dying ReLU” problem where the gradient becomes zero when the activation value is negative. Therefore, the default value of α is utilized and we attribute the lack of sensitivity to the self- correcting nature of the optimization problem. The learning capacity of the proposed forecast model is primarily affected by the number of units in the Dense layers and the dimension of encodings for p, u w+1 , and y w that directly control the amount of trainable weights within the forecast model. These hyperparameters are dependent on the complexity of the dataset and are tuned using the grid search technique [193] where we begin with small potential values and incrementally increase the values until no further increase in performance is observed. The learning rate used in this work is based on common values observed in the literature [234, 238, 264]. Using the simple early-stopping technique with model check- pointing, the number of training epoch is determined at the point where no further improvement on the validation split set is observed. From our observations, the performance of the proposed forecast model is relatively sensitive to the size of windows (i.e., N l and N s ). As such, a systematic approach to determine the optimal hyperparameters is devised based on the grid search technique using relevant potential values. Figure 8.5 shows two heatmaps of normalized RMSE on the same testing dataset from the grid search technique for finding the optimal N l and N s for Experiment 2 and Experiment 3 (random scenario). The heatmaps indicate that values of N l and N s that are too small result in higher errors as the forecast model becomes too sensitive to the noise within the dataset. On the contrary, values 300 of N l and N s that are too large result in a reduction in performance as the number of data tuples used for training decreases. Additionally, when N l and N s are too large, the windows may cover different flow regimes thus making it more challenging for the forecast model to learn the piecewise behaviors in the time-series data. Figure 8.5: Heatmaps of normalized testing RMSE from grid search for size of windows. F Description of the PGDL Architectures F.1 Network Architecture This section provides supplementary descriptions of the architecture used in our study. The net- works are implemented with the open-source deep learning library Keras (version 2.2.4) [45]. Hyperparameters used in the neural network layers, the size of input and output, the actual Keras functions used, as well as the number of weights within each component of the proposed model architecture for the toy dataset used in Experiment 1 and the synthetic dataset used in Experiment 2 (synthetic Bakken data) are provided in Table 8.12 and Table 8.13 respectively. The architecture for Experiment 3 (field data) is given in Table 8.14 and can be derived from this description by simply modifying the input and output sizes. The input and output shapes for certain layers in Table 8.13 and Table 8.14 are denoted with a dot to indicate multiple connections between layers. 301 Layer 3 of f 1 θ in Table 8.13 and Table 8.14 forms an out branching connection to Layer 4, 5 and 6. The first five layers of f 2 ω in Table 8.13 are replicated for each production phase, and the output of each replicate forms an outgoing connection, branching into the same Layer 6 where the outputs get concatenated for further computation. Note that f 2 ω is a statistical approximation of f 2 and if the exact form of f 2 is embedded as is, the component f 2 ω with trainable weights ω becomes unnecessary. 302 Table 8.12: Detailed description of components in the statistical and explicit PGDL models in Experiment 1. Component Input shape Output shape Weights Name of function Hyperparameters Total weights f 1 θ / f 3 ζ 1× 3 1× 4 (3× 4)+ 4 Dense→ LeakyReLU α = 0.3 83 1× 4 1× 8 (4× 8)+ 8 Dense→ LeakyReLU α = 0.3 1× 8 1× 3 (8× 3)+ 3 Dense→ Sigmoid f 2 ω / f 3 ζ 1× 3 1× 36 (3× 36)+ 36 Dense→ LeakyReLU α = 0.3 2209 1× 36 1× 9× 4 Reshape 1× 9× 4 1× 18× 16 (16× 6× 4)+ 16 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 18× 16 1× 36× 32 (32× 3× 16)+ 32 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 36× 32 1× 36× 1 (1× 3× 32)+ 1 Conv1D padding=”same” 303 Table 8.13: Detailed description of components in the statistical and explicit PGDL models in Experiment 2. Component Input shape Output shape Weights Name of function Hyperparameters Total weights f 1 θ 1× 7 1× 8 (7× 8)+ 8 Dense→ LeakyReLU α = 0.3 533 1× 8 1× 12 (8× 12)+ 12 Dense→ LeakyReLU α = 0.3 1× 12 1× 16· (12× 16)+ 16 Dense→ LeakyReLU α = 0.3 1× 16· 1× 3 (16× 3)+ 3 Dense→ Sigmoid 1× 16· 1× 3 (16× 3)+ 3 Dense→ Sigmoid 1× 16· 1× 3 (16× 3)+ 3 Dense→ Sigmoid f 2 ω 1× 3 1× 240 (3× 240)+ 240 Dense→ LeakyReLU α = 0.3 4177× 3 1× 240 1× 15× 16 Reshape 1× 15× 16 1× 30× 16 (16× 6× 16)+ 16 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 30× 16 1× 60× 32 (32× 3× 16)+ 32 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 60× 32 1× 60× 1· (1× 3× 32)+ 1 Conv1D padding=”same” 1× 60× 1· 1× 60× 3 Concatenate f 3 ζ 1× 7 1× 8 (7× 8)+ 8 Dense→ LeakyReLU α = 0.3 7871 1× 8 1× 12 (8× 12)+ 12 Dense→ LeakyReLU α = 0.3 1× 12 1× 16 (12× 16)+ 16 Dense→ LeakyReLU α = 0.3 1× 16 1× 240 (16× 240)+ 240 Dense 1× 240 1× 15× 16 Reshape 1× 15× 16 1× 30× 16 (16× 6× 16)+ 16 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 30× 16 1× 60× 32 (32× 3× 16)+ 32 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 60× 32 1× 60× 3 (3× 3× 32)+ 3 Conv1D padding=”same” 304 Table 8.14: Detailed description of components in the statistical and explicit PGDL models in Experiment 3. Component Input shape Output shape Weights Name of function Hyperparameters Total weights f 1 θ 1× 8 1× 8 (8× 8)+ 8 Dense→ LeakyReLU α = 0.3 541 1× 8 1× 12 (8× 12)+ 12 Dense→ LeakyReLU α = 0.3 1× 12 1× 16· (12× 16)+ 16 Dense→ LeakyReLU α = 0.3 1× 16· 1× 3 (16× 3)+ 3 Dense→ Sigmoid 1× 16· 1× 3 (16× 3)+ 3 Dense→ Sigmoid 1× 16· 1× 3 (16× 3)+ 3 Dense→ Sigmoid f 2 ω 1× 3 1× 240 (3× 240)+ 240 Dense→ LeakyReLU α = 0.3 4177× 3 1× 240 1× 15× 16 Reshape 1× 15× 16 1× 30× 16 (16× 6× 16)+ 16 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 30× 16 1× 60× 32 (32× 3× 16)+ 32 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 60× 32 1× 60× 1· (1× 3× 32)+ 1 Conv1D padding=”same” 1× 60× 1· 1× 60× 3 Concatenate f 3 ζ 1× 8 1× 8 (8× 8)+ 8 Dense→ LeakyReLU α = 0.3 7879 1× 8 1× 12 (8× 12)+ 12 Dense→ LeakyReLU α = 0.3 1× 12 1× 16 (12× 16)+ 16 Dense→ LeakyReLU α = 0.3 1× 16 1× 240 (16× 240)+ 240 Dense 1× 240 1× 15× 16 Reshape 1× 15× 16 1× 30× 16 (16× 6× 16)+ 16 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 30× 16 1× 60× 32 (32× 3× 16)+ 32 Conv1D→ LeakyReLU→ U pSampling1D α = 0.3, padding=”same” 1× 60× 32 1× 60× 3 (3× 3× 32)+ 3 Conv1D padding=”same” 305 F.2 Synthetic Data Generation This section provides additional descriptions of the synthetic data used in Experiment 2. The reservoir model is constructed based on available data [97] and the parameters used in setting up the model is listed in Table 8.15. Hydraulic fracture spacing is calculated as the sampled number of stages divided by the sampled lateral length of the well. Fracture height and natural fracture height are equal to the sampled thickness of the pay zone. V olume of the reservoir is equal to the area of the reservoir multiplied by the sampled thickness of the pay zone. The main data sources used for sampling the input features in Experiment 2 is listed in Table 8.16. Table 8.15: Parameters used in setting up the model in Experiment 2. Parameter Value Bottomhole pressure 725 psi Hydraulic fracture length 1000 ft Fracture width 0.005 ft Fracture conductivity 5 mD-ft Natural fracture length 150 ft Natural fracture conductivity 4 mD-ft Area of reservoir 4000 m x 1000 m Table 8.16: Data sources used for sampling input features in Experiment 2. Input Feature Data Source Porosity (ratio) Map provided by Gherabati et al. (2016) Water Saturation (ratio) Map provided by Gherabati et al. (2016) Formation Thickness (m) Map provided by Theloy and Sonnenberg (2013) Reservoir Pressure (psia) Map provided by Gherabati et al. (2016) Oil Density (kg/m3) Map provided by Gherabati et al. (2016) Lateral Length (m) Well data collected from North Dakota Department of Mineral Resources No of Stages Well data collected from North Dakota Department of Mineral Resources 306
Abstract (if available)
Abstract
Reliable production forecasting is necessary for optimizing the development and management of subsurface flow systems and for assisting asset teams in making sound business decisions. The flow systems in subsurface reservoirs that govern production behavior range from the well-understood flow mechanism of porous media such as in clastic/carbonate petroleum reservoirs, to the highly complex flow mechanism in naturally/induced fractured reservoirs such as in tight oil formations. Standard forecasting workflows involve the development of mathematical models that capture the complex relationship between well parameters, reservoir properties, well operating controls and observed production data to predict future production. Practical applications involve highly nonlinear processes, non-Gaussian description of properties and non-stationary flow behavior that classical covariance-based forecasting workflows cannot handle.
This body of work presents new deep learning forecasting workflows that leverage state-of-the-art neural network architectures to efficiently extract and compactly represent spatial and temporal information, as well as learn complex multimodal input-output mappings for improving classical workflows. For conventional reservoirs with well-established flow-physics models, latent space representations of geologic models preserve the geologic consistency of history-matched solutions through the manipulation of latent spaces that translate to feature-based calibration in the full dimensional spaces, offering improvement in forecasting reliability when compared to standard workflows that use classical covariance-based techniques. We demonstrate that the discovered salient dynamical features in flow responses can be used to eliminate inconsistent geologic scenarios that are not supported by observed measurements. Additionally, we develop a direct model calibration method to simultaneously parameterize and invert complex geologic models in efficient latent spaces that do not only exploit the redundancy of large-scale geologic features but also retain features that are sensitive to flow response data.
For unconventional reservoirs, latent space representations that capture the temporal dynamics in the flow response data and their relationship with well parameters and well operating controls are used to obtain long-term predictions using minimal initial production data. To benefit from existing physics-based models based on limited flow-physics understanding, we develop latent space representations of well parameters, well operating controls and production data fused with first principle physics-based models to form hybrid forecasting workflows that result in improved physically-consistent predictions when compared to standard data-driven models.
The developed workflows combine the benefits of data-driven and physics-based models and exploit the salient spatial and temporal features for improved efficiency and higher generalization ability. The workflows can be integrated into closed-loop reservoir management tools for robust production optimization and can be applied for history matching and production forecasting of other subsurface systems such as geothermal reservoirs, carbon capture and storage reservoirs and groundwater reservoirs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deep learning for subsurface characterization and forecasting
PDF
Deep learning for characterization and prediction of complex fluid flow systems
PDF
Efficient control optimization in subsurface flow systems with machine learning surrogate models
PDF
Application of data-driven modeling in basin-wide analysis of unconventional resources, including domain expertise
PDF
Subsurface model calibration for complex facies models
PDF
Feature learning for imaging and prior model selection
PDF
Inverse modeling and uncertainty quantification of nonlinear flow in porous media models
PDF
Sparse feature learning for dynamic subsurface imaging
PDF
Optimization of CO2 storage efficiency under geomechanical risks using coupled flow-geomechanics-fracturing model
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Integrated reservoir characterization for unconventional reservoirs using seismic, microseismic and well log data
PDF
A study of diffusive mass transfer in tight dual-porosity systems (unconventional)
PDF
Integration of multi-physics data into dynamic reservoir model with ensemble Kalman filter
PDF
Stochastic oilfield optimization under uncertainty in future development plans
PDF
Reactivation of multiple faults in oilfields with injection and production
PDF
Uncertainty quantification and data assimilation via transform process for strongly nonlinear problems
PDF
Synergistic coupling between geomechanics, flow, and transport in fractured porous media: applications in hydraulic fracturing and fluid mixing
PDF
Real-time reservoir characterization and optimization during immiscible displacement processes
PDF
An extended finite element method based modeling of hydraulic fracturing
PDF
Machine-learning approaches for modeling of complex materials and media
Asset Metadata
Creator
Mohd Razak, Syamil
(author)
Core Title
Deep learning architectures for characterization and forecasting of fluid flow in subsurface systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Petroleum Engineering
Degree Conferral Date
2023-05
Publication Date
03/30/2023
Defense Date
03/22/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,inverse problem,machine learning,OAI-PMH Harvest,subsurface systems,timeseries forecasting
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jafarpour, Behnam (
committee chair
), De Barros, Felipe (
committee member
), Ershaghi, Iraj (
committee member
)
Creator Email
mohdraza@usc.edu,rsyamil@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112850382
Unique identifier
UC112850382
Identifier
etd-MohdRazakS-11541.pdf (filename)
Legacy Identifier
etd-MohdRazakS-11541
Document Type
Dissertation
Format
theses (aat)
Rights
Mohd Razak, Syamil
Internet Media Type
application/pdf
Type
texts
Source
20230330-usctheses-batch-1013
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
inverse problem
machine learning
subsurface systems
timeseries forecasting