Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Probabilistic data-driven predictive models for energy applications
(USC Thesis Other)
Probabilistic data-driven predictive models for energy applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PROBABILISTIC DATA-DRIVEN PREDICTIVE MODELS FOR ENERGY APPLICATIONS by Nastaran Bassamzadeh A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (CIVIL AND ENVIRONMENTAL ENGINEERING) May 2016 Copyright 2016 Nastaran Bassamzadeh To my parents, Shirin and Mansour. i Acknowledgments First and above all, I praise God, the almighty for everything he has granted me. I wish to offer my sincere thanks to several people who have assisted and guided me throughout my journey as a PhD student. I would like to express my special appreciation and thanks to my supervisor Professor Roger Ghanem for his continuous support, his patience, motivation and immense knowledge. Dr. Ghanem gave me the freedom to develop my own research agenda and inspired me to work on a broad range of projects. In particular, I would like to mention his unconstrained support during the time I had a little baby at home. Also, I would like to extend my appreciation to my committee members, Dr. Najmedin Meshkati and Dr. James Moore for their insightful comments and encouragement. My time at USC was made enjoyable in large part due to the many friends that became a part of my life. I would like to thank my labmates Hadi Meidani, Vahid Keshavarzzadeh, Hamed Haddadzadegan and Maud Comboul for the friendly environment they provided at the lab. In particular, I am sincerely thankful to Charan Raj Thimmisetty for his invaluable comments and also for providing some part of the data I needed for one of my projects. I am also grateful to Loujaine Mehrez for the motivating discussions and for her pure friendship. ii I would like to thank my family for all their love and encouragement. For my parents, Shirin and Mansour, who raised me and supported me in all stages of my life. Words cannot express how grateful I am to them for all the sacrifices that they have made for me. Your prayer for me was what sustained me thus far. I thank my sister Narges and my brother Mehdi for being my best friends. I also extend my love to my parents-in-law, Razie and Jamal, for all the kindness they have offered me. Finally, the deepest thanks goes to my loving, encouraging, and patient hus- band Jalal whose faithful support during my PhD is so appreciated. He was always my support in the moments when there was no one else besides me. Thank you for always being there. Lastly, I would like to mention my lovely sweet son, Ali, and thank him for bringing the pure joy to our family. iii Contents Acknowledgments ii Contents iv List of Figures vii List of Tables xii Abstract xiii 1 Introduction 1 1.1 Summary of Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Demand Management in Smart Grids 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Robust Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Non-Probabilistic Robust Optimization . . . . . . . . . . . . . 18 2.3.2 Probabilistic Robust Optimization . . . . . . . . . . . . . . . 21 2.4 Chance-Constrained Optimization Model for Optimal Load Scheduling 24 2.4.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.2 Chance-Constraint . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.3 Physical Constraints . . . . . . . . . . . . . . . . . . . . . . . 28 iv 2.4.4 Customer Preferences . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.5 The Optimization Problem . . . . . . . . . . . . . . . . . . . . 30 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.1 Data Set Description . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.2 Uncertainty Sets . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6 Robust Scheduling in a Heterogeneous Population . . . . . . . . . . . 37 2.6.1 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.2 Heterogeneous Population . . . . . . . . . . . . . . . . . . . . 39 2.7 The Effect of Time Scale on Optimal Solution . . . . . . . . . . . . . 44 3 Stochastic Load Forecasting in Smart Grids 48 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.1 Learning Discrete Bayesian Networks . . . . . . . . . . . . . . 54 3.2.2 Exact Inference in Bayesian Networks . . . . . . . . . . . . . . 58 3.2.3 D-Separation and Conditional Independence Properties of A Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3 Building Bayesian Network Models For Demand Response Problem . 69 3.3.1 Data set Description . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.2 Bayesian Networks For Modeling Aggregate Demand . . . . . 72 3.3.3 Bayesian Networks for Modeling Each Household’s Demand . 82 4 Data-Driven Characterization of Reservoirs in Gulf of Mexico 88 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Gaussian Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . 93 4.3.1 LearningGaussianBayesiannetworks-BayesianGaussianequiv- alent score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3.2 Approximate Inference in Bayesian Networks . . . . . . . . . . 99 4.4 Application to the Gulf of Mexico-BOEM Data Set . . . . . . . . . . 101 v Appendix A Description of Attributes 109 5 Effect of Dimensionality Reduction on Stochastic Prediction Per- formance 110 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2 Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2.1 Locally Linear Embedding . . . . . . . . . . . . . . . . . . . . 115 5.2.2 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.3 Diffusion Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3 Application to the Gulf of Mexico-BOEM Data Set . . . . . . . . . . 122 6 Conclusion 133 Reference List 136 vi List of Figures 2.1 Mean of electricity price in different months. . . . . . . . . . . . . . . 32 2.2 Mean of electricity price in different seasons. . . . . . . . . . . . . . . 32 2.3 Percentage of savings in cost with respect to deterministic case for all uncertainty sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4 Percentage of utility loss with respect to deterministic case for all uncertainty sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5 Optimal scheduled consumption for Gaussian and 3σ uncertainty sets and their comparison with deterministic case. . . . . . . . . . . . . . 36 2.6 Sensitivity of model outputs with respect to maximum budget (γ). . . 38 2.7 Sensitivity of model outputs with respect to probability of satisfying chance-constraint (1−). . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 A typical utility function for a stay at home customer. . . . . . . . . 40 2.9 Comparison between the results of deterministic and robust methods for a heterogeneous population. . . . . . . . . . . . . . . . . . . . . . 42 2.10 Average percentage of changes in quantities of interest for a hetero- geneous population when using robust method. . . . . . . . . . . . . 43 vii 2.11 Model outputs for various categories of customers. . . . . . . . . . . . 44 2.12 Savings in cost for employed and stay at home customers. . . . . . . . 44 2.13 Averagedailyconsumptionprofilesfordeterministicandrobustapproaches in a heterogeneous population. . . . . . . . . . . . . . . . . . . . . . . 45 2.14 Mean of electricity price with different time resolutions. . . . . . . . . 46 2.15 Daily robust consumption for various time resolutions. . . . . . . . . 46 3.1 An example BN [41]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 A simple Bayesian network [58]. . . . . . . . . . . . . . . . . . . . . . 61 3.3 The domain graph for Fig.3.2 [58]. . . . . . . . . . . . . . . . . . . . 64 3.4 The domain graph obtained from Fig.3.3 by marginalizing out A 3 [58]. 64 3.5 Message passing in a junction tree. The cliques V 1 , V 2 and V 4 have sent messages to their separators [58]. . . . . . . . . . . . . . . . . . . 67 3.6 Serial (a), diverging (b) and converging (c) network structures [41]. . 68 3.7 Effect of discretization scheme on predicting hourly aggregate demand. 75 3.8 LearnedBayesiannetworkstructureforpredictingaggregatedemand- hourly time scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.9 LearnedBayesiannetworkstructureforpredictingaggregatedemand- 15 minutes time scale. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.10 Uncertainty interval in predicting aggregate demand at 15 minutes timescaleusingdiscreteBNmodel-redsquaresrepresenttheobserved demand, the black dots are the predicted mean and the blue bars correspond to 2σ interval around the mean. . . . . . . . . . . . . . . 78 viii 3.11 Learned discrete Bayesian network structure for predicting hourly aggregate demand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.12 Learned Gaussian Bayesian network structure for predicting hourly aggregate demand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.13 Comparison of conditional probability distribution of hourly aggre- gate demand given price, ρ(demand t |price > 48), using Gaussian and discrete BNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.14 Learned Bayesian network structure for predicting each household’s hourly demand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.15 Uncertainty interval in predicting each household’s hourly demand usingBNmodel-redsquaresrepresenttheobserveddemand, theblack line is predicted mean demand and the shaded area corresponds to 2σ interval around the mean. . . . . . . . . . . . . . . . . . . . . . . 84 3.16 Probability of demand for customer number 671, at time index 10, for three different price ranges. . . . . . . . . . . . . . . . . . . . . . . 85 3.17 Probability of demand for customers number 707, 855, and 867, at time index 10 given a price range between 51 and 59 $/Mwh. . . . . 86 4.1 An example network to perform logic sampling [58]. . . . . . . . . . . 100 4.2 Spatial distribution of boreholes over the Gulf of Mexico; source of data [24]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.3 Structure of the learned GBN to predict oil production rate at day 1 using 36 attributes-labels shown in appendix A. . . . . . . . . . . . . 105 ix 4.4 95%confidenceintervalinpredictingoilproductionratefor45instances from test set. The solid line represents the predicted mean and the shaded area corresponds to 2σ deviation from the mean. The red dots show the actual observed values. . . . . . . . . . . . . . . . . . . 107 4.5 Inferring the PDF of QoI at two test locations using GBN. . . . . . . 108 5.1 The distribution of the data points in the high dimensional space may be inherently embedded on a lower dimensional space. The original data in the well-known Swiss roll example lie in a 3 dimensional space while the manifold can be unfolded as a 2 dimensional space. . . . . . 114 5.2 The schematic summary of LLE algorithm. First, a number of neigh- bors are chosen for each data point. Then, the weights are com- puted so that each point can be linearly reconstructed from its neigh- bors. Finally, using the same weights, the embedded coordinates are obtained by reconstructing each point from its neighbors [65]. . . . . 116 5.3 The structure of learned BN using the Markov blanket variables of QoI as well as X and Y variables. . . . . . . . . . . . . . . . . . . . . 124 5.4 Choosing the appropriate number of neighbors for LLE algorithm when applied to wellbores data set to predict oil production rate at day 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.5 Finding the intrinsic manifold dimension for LLE algorithm when applied to wellbores data set to predict oil production rate at day 1. . 126 5.6 Dependency structure between the LLE coordinates, excluding the QoI.127 x 5.7 Structure of the learned GBN to predict oil production rate using the LLE coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.8 Predicted mean of QoI versus the measured QoI for 45 samples from test data set-GBNs are learned by Markov blanket variables and LLE coordinates respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.9 Inferring the PDF of oil production rate at a test location using the LLE coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 xi List of Tables 2.1 Appliance categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Utility function values. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Categories of consumers in a heterogeneous population. . . . . . . . . 41 2.4 Effect of time resolution on daily cost and utility. . . . . . . . . . . . 47 2.5 Effect of time resolution on optimization problem. . . . . . . . . . . . 47 3.1 RMSE in predicting aggregate demand using discrete Bayesian network. 74 3.2 RMSE in predicting hourly aggregate demand by discrete and Gaus- sian Bayesian networks. . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1 RMSE in predicting Box-Cox transformed QoI. . . . . . . . . . . . . 104 5.1 Model prediction performance using original data, Markov blanket variables, LLE coordinates, Isomap coordinates and Diffusion map coordinates respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 129 xii Abstract The abundance of collected data from physical systems holds the promise of trans- forming conventional systems into smarter infrastructures that permits the devel- opment of credible decisions. These perspectives, taken in the context of recent algorithmic developments for big data, start to address many of the challenges encountered in complex systems. On the other hand, the intertwined uncertainty associated with these systems impose an additional layer of complexity that needs to be addressed properly. Stochastic predictive modeling seeks to quantify the effect of uncertainty on the overall system behavior in order to improve the process of decision making. While fulfilling this promise remains fraught with conceptual, technological, and mathematical challenges, success holds many promises that go to the heart of societal and environmental sustainability. In this study, we aim to enhance our stochastic modeling capability in complex systems by adapting a data-driven approach. More specifically, we focus on future Smart Grid and Smart Oilfields as two vivid examples of the energy sector where analyzing data assists in quantifying the confidence in the predictions. Smart Grid which represents the future generation of power systems, is composed of multiple interacting systems such as renewable supply network, distributions and storage systems and communications system. An important enabler of Smart Grid, the xiii information network, ensures the viability of data-driven and data-aware perspec- tives, both on the side of utilities and consumers. This enables both consumers and utilities to share in the responsibility and benefits of access to advanced technology. Smart Oilfields, was initiated by a similar core idea, i.e. using an advanced infor- mation technology as an enabler for informed decisions. The data-driven predictive approach can serve as a prior model for characterizing the hydrocarbon reservoirs where the conventional analysis methods impose high computational and monetary expenses. This thesis examines some uncertainty induced challenges faced in pre- dictive modeling of Smart Grids and Smart Oilfields and seeks to address these challenges from a data-driven perspective. xiv Chapter 1 Introduction 1 Theenergyindustryisoverwhelminglyfedwithhugeinformationstreamsfrom varying sources. Effectively analyzing this data has the potential to address key challenges the businesses and researchers encounter. Generally speaking, the data can be used in multiple facets, • Discovering new energy sources. • Optimizing decision making by using predictive analytics. • Boosting the overall system performance and productivity. • Predicting possible failures and risks before they happen. • Managing and shaping consumption patterns. • Matching the supply and demand. • Providing better maintenance and repair plans. • Increasing the revenues. The key energy players, realizing the significance of data analysis, are moving towards making smart and digitized infrastructures which are able to collect field data. However, extracting useful knowledge out of this array of data still remains a challenge. Smart Grid and Smart Oilfields are the newly emerged smart versions of the traditional power grids and oil fields. The underlying idea which has entailed this transition is the growing need for optimized energy consumption and management. This goal is sought with the aid of digitized systems and sensors which are capable to monitor the system and collect relevant data at various scales. This transition owes to the astonishing developments in the supporting technologies in recent decades. 2 Smart Grid emerged as an opportune response for challenges to reliable decision-makinginincreasinglycomplexpowernetworks. Theconfluenceofinforma- tion technology and advanced measurements (synchrophasors, smart meters, etc.) enables the conceptualization of power network on a granularity where both gener- ation stations and individual appliances are simultaneously visible. These resources to "think fast and big" provide Smart grid with the capacity to engage load as a resource to manage intermittency in power supply and demand, and to find the most optimal solutions in real time for operation efficiency and resilience to disturbances. This has heralded Smart Grid as the enabler for energy-related challenges such as energy sustainability, fossil fuel emissions into the environment, demand growth, aging asset bases and increased complexity in the power grid [55]. Various approaches have been proposed among stakeholders to envision and categorize the features of smart grid. Accordingly, smart grid is characterized as a power grid with five major features namely, reliability, renewable resources, demand response, electric storage and electric transportation [55], 1. Distributed energy resources (DER) Great penetration of renewable energy resources such as wind power and solar power in order to mitigate the energy dependency on fossil fuels from strategic, environmental and economical point of view. The introduction of solar and wind energy as a major share of the supply presents significant challenges on the power grid. Since we need to ensure a stable and reliable power supply, the inherent uncertainty and rapid fluctuations in these sources makes prediction and estimation tasks in the power grid a major challenge. A great bulk of literature on smart grid is devoted to developing techniques to tackle this problem. 3 2. Demand Response (DR) Currently, 20% of the generation capacity exists to meet peak demand only (i.e. only for 5% of the time) which incurs high operational costs on the power network [17]. By reducing the amount of spinning reserve that electric utilities have to keep on stand-by we can save billions of Dollars. Demand side man- agement (DSM) techniques refer to the methods implemented by the utility companies to manage and shape the consumption profile at the consumer side in order to have a smoother usage trend [51]. This is called peak curtailment or peak leveling. Demand respond reflects the automated two-way commu- nication between the generation and consumers so that the consumers can respond and coordinate their behaviors according to the conditions imposed by the power grid. Electricity price serves as the most powerful and practi- cal incentive for the customers to change their consumption behaviors. For instance, the customers will tend to shift their deferrable energy usages from high price peak periods to lower price off-peak periods. 3. Distributed Storage (DS) Considering the presence of demand response and distributed generation throughout the network, it seems trivial to think of a distributed storage sys- tem within the grid. In fact, distributed generation is an ancillary tool to provide the global stability and reliability in the network. The excess energy in the generations at the off peak hours can be effectively stored to be used later in the high demand periods. Massive battery storage which is technically being enabled by the recent developments in the industry appears to be the most promising storage tool. Moreover, in some contexts demand response is considered as a virtual storage capability of the system. In other words DR is 4 a tool to save the produced energy at peak hours which can be used later in off peak periods. 4. Electric transportation (PHEV) Plug-in hybrid electric vehicles are considered as the major component of the future generation of the transportation system. These vehicles interact in two ways with the power grid. First, they are a major source of electricity consumption. It is estimated that they will consume 600TWh/year assuming 30kwh for a 100−mile trip [55] and 10, 000 miles per year for 200 million vehicles in the US [55]. Second, they can be viewed as distributed storage devices which are able to be charges during off peak hours and be discharges in high demand peak hours. Thus, modeling the interactions between the PHEVs and smart grid is a challenging field of research which has devoted a great share of the research on the smart grid to itself. 5. Reliability Cohesiveintegrationofalltheabovecapabilitiesintothegridcanyieldaflatter total demand which will further augment the reliability in the whole system. In this sense, smart grid can be envisioned as a complex system with multiple componentsthatinteractinacomplicatedmannertoemergetheoverallsystem properties. As such, investigating the reliability issues in the smart grid is very crucial in order for the global system goals to be accomplished. Smart oilfields, also known as digital oilfields, combines sensor information with real-time data analytics to improve the production process. Currently, big oil companies use seismic data along with the geologic data to discover new fields and to estimate the residual oil or gas in place. Moreover, the stream of data from 5 field equipment back to the headquarters improves safety and production. Sensors can warn workers about wells that need repair, staff can be notified of possible overloading compressors, sensor data (such as pressure, temperature and volume) help in designing optimized production and maintenance plans. Another extremely important use of data aims toward accident prevention. The hope is that data science can help in preventing disasters, such as Deepwater Horizon oil catastrophe, before they happen. In this case, sensor data are used to monitor equipment and detect unusual patterns, such as high pressure or temper- ature. Even after the accident, data analysis can be used in rapidly analyzing the outcomes and providing the decision makers with some estimates and analyses to be used in resolving the issue. Smart grid and smart oilfields are two instances of the so-called complex sys- tems. A complex system is characterized by several features namely, incorporating too many components and interactions, different aggregated system behavior from the sum of behaviors of individual parts (emergent/holistic property) and significant changes in the system output caused by small variations in the input (chaotic) [81]. One of the challenges in analyzing the complex systems, in addition to the com- plexity itself, is the appropriate approach to deal with uncertainty in the system. Many contributing components in the complex system exhibit a random behavior, for which a systematic approach must be undertaken. 1.1 Summary of Thesis Work This research hinges upon quantifying uncertainty from a data-driven perspective in two energy related domains, namely smart grids and smart oilfields. In this 6 regard, we address some of the challenges encountered in these systems and try to address them from a data-driven point of view considering the sources of uncertainty meanwhile. The first problem discussed in this thesis is shaping the consumption profile of customers in order to reduce the peak electricity consumption in the context of smart grid. Although peak curtailment has been the subject of much research in the traditional power systems, the recent functionalities provided by installing smart sensors at the households levels has curved this problem into a new format. Using the two way communication systems, owing to the smart grid transformation, the utility companies are able to actively affect the consumption profile of customers by sending them real-time price signals and collect their responses. This makes demand management an achievable and practical goal. Here, we propose a schedul- ing algorithm for electricity appliances in residential buildings that receive real-time and uncertain electricity prices. We consider various appliance categories as well as a heterogeneous consumers’ population to recommend an automatic schedule for programmable devices in a house. Moreover, we capture and include the effect of uncertainties in electricity price estimations. Another challenging problem that is faced in the context of smart grid, is load forecasting. The challenge, other than the intertwined uncertainties, is due to the high dimension of the contributing parameters in forming the demand behavior. Historical consumption, weather variables, time parameters and price signals are among the contributing factors. Traditional approaches, such as stochastic time series models, will need a prohibitively large number of parameters to be calibrated as well as too many prior assumptions. Additionally, these models fail in treating the fine resolution data provided by the smart sensors at households levels. In this 7 thesis, we propose using Bayesian networks as a density estimation tool in complex systems to forecast aggregate and individualized consumption patterns using a real data set provided by Pacific Northwest National Lab (PNNL). Bayesian networks provide a single framework for density estimation and probability propagation and inference in complex systems. In this thesis, we also study a prediction problem in a smart oilfield using a data-driven approach. Traditionally, hydrocarbon reservoir are characterized by inverse modeling, also known as history matching in this discipline. This procedure in computationally intense and requires numerical simulations of the fluid flow in the reservoir. Here, we try to build a purely data-driven model which is capable of predicting quantities of interest at the wellbores (known as wellbore signatures). This rapid prediction mechanism provides valuable prior estimations of wellbore signatures and can be used as a risk assessment tool in case of unforeseen accidents. Also, we study the problem of high dimensional statistics involved in hydrocarbon reservoirs and strive to project the problem in a lower dimension setting. 1.2 Thesis Outline In chapter 2, we provide a scheduling framework based on chance-constrained opti- mization approach to manage the demand in the context of smart grids. We inves- tigate the effect of price uncertainty on optimal behavior and study the model per- formance for various classes of customers. In chapter 3, we use Bayesian network to propose a single framework for forecasting aggregate and individualized electricity consumptions of customers who 8 receive price incentives from the utility companies. Taking into account many con- tributing factors, we are able to quantify the uncertainty in the prediction. In chapter 4, we propose a purely data-driven model which can be used in making rapid estimations regarding the status of desired quantities of interest in hydrocarbon reservoirs, specifically at Gulf of Mexico. The method builds upon Bayesian networks that are trained from observed data at some wellbores and try to estimate wellbore signatures at sites with limited data. This approach can serve as a quick forecasting framework in lieu of the conventional inverse modeling approach. In chapter 5, we strive to discover the embedded structure in the high dimen- sional data and use it to project the data into a lower dimensional space. We study the same problem as discussed in chapter 4 in a low dimension space and compare the performance of the two approaches in predicting wellbore signatures. In chapter 6, we summarize our contributions and discuss their advantages and potential drawbacks. We also discuss potential future works to extend the contributions of this thesis. 9 Chapter 2 Demand Management in Smart Grids 10 2.1 Introduction Currently, 20% of power generation capacity is latently available just to meet peak demand, incurring high operational costs on the power network [17]. As peak demandoccursabout 5%ofthetimesignificantsavingscanbeachievedby"peakcur- tailment" or "peak leveling", i.e., using demand-side management (DSM) or demand response techniques to manage and shape consumption profiles at the consumer side in order to achieve smoother usage trends [51]. One type of demand response uses automated two-way communication between power utilities and consumers so that consumers can respond and coordinate their behaviors according to conditions imposed by the power grid. Electricity prices serve as an efficient and practical incentive for customers to change their consumption behaviors. For instance, with proper incentives, customers could likely shift their deferrable energy usage from high price peak periods to lower price off-peak periods. A reasonable assumption that could underlie predictive models of demand response is that consumers will optimize their respective consumptions in response to dynamic electricity prices. To do so, they each need to solve an optimization problem to find a schedule for appliances that is optimal to her preferences. In practice, these algorithms can be solved by energy consumption scheduling (ECS) units which are embedded in the smart meter of every house. Although the real consumer behavior will deviate from the optimal scheme, this approach provides an appropriate benchmark for decision making purposes, both for individual cus- tomers at household level and for power system operators in managing generation and power delivery systems (e.g., projecting electricity prices and amount of demand response based on the optimal scheme with consideration of deviations, and prepar- ing resources accordingly). Finding the optimal schedule of operation for appliances 11 in a household subject to real-time and uncertain prices remains a challenge in the smart grid context, and constitutes the main contribution of the present chapter. A challenge in addressing this problem consists of accounting for the various uncertaintiesincharacterizingtheunderlyingmodel. Clearly,properlyincorporating these uncertainties into the model-based predictions can significantly increase the effectiveness of ensuing decisions. The major source of uncertainty in the present context is the price of electricity. The ECS scheduler requires some estimate about the future electricity price, say for the next 24 hours, to be able to find the optimal decision at the current point in time. Inthischapter,weproposeamodeltofindtheoptimalscheduleofoperationfor smart appliances in a household on one typical day, taking into account uncertainty in future electricity prices. We define a chance-constrained optimization problem and use robust optimization techniques to tackle it. Robust optimization is a rather newtechniqueinmodelinguncertainty, whichismostsuitableforsituationsinwhich no exact probability distribution of the uncertain parameter is available [5]. Instead, we require the solution to be feasible for all the instances of the uncertain parameter which lie in an uncertainty set. This yields an optimal solution that is robust with respect to the probability distribution of future electricity prices. As part of our model, we propose a new categorization scheme for appliances in a given house. This detailed appliance model aims to more realistically capture the properties of the devices in a house. Given the above model, we will discuss finding the most appropriate uncer- tainty set associated with our specific problem. We then solve this task for a single customer as the first step. Investigating the model performance in a sample het- erogeneous population is another challenging and relevant topic that we address in 12 this study. From the utility company’s perspective it is crucial to have an estimate of the aggregate consumption behavior in a region assuming that all customers are behaving optimally. This can be a useful benchmark for decision making purposes. To conduct this analysis, we initially perform sensitivity analyses to find the most crucial parameters of the model that can trigger different behaviors among differ- ent people. Based on a sensitivity analysis, we categorize people according to their socio-economic characteristics. We discover much more significant dependence of consumption patterns on employment status rather than on income level of cus- tomers. Generally speaking, various time scales can be used for measurement, aggrega- tionanddecision-makingphasesofaproblem. Inthiswork, weseektoknowwhether the aggregation time scale can change the outcomes of the optimal scheduling prob- lem or not. We will use multiple time scales to answer this question. Answering this problem, helps the utility companies to find appropriate time intervals to com- municate with the customers in terms of sending price signals. It will also help customers to understand the required frequency for updating their behaviors. We use real electricity price data from Pacific Northwest National Laboratory (PNNL) pilot project called "Olympic Peninsula" to conduct our simulations [9]. This chapter is organized as follows; section 2.2 reviews the literature on intel- ligent scheduling in smart homes. Robust optimization method and robust counter- partapproximationforchance-constraintsarediscussedinsection2.3. Ourproposed model to find the optimal load schedule is presented in section 2.4 followed by the subsequent simulation results and discussions in section 2.5. Then, the problem is extended to the case of a heterogeneous population in section 2.6. Section 2.7 discusses the significance of aggregation time scale in the model outcomes. 13 2.2 Related Work Smart scheduling is traditionally formulated as an optimization problem in which the consumers wish to minimize their electricity costs [53], [50], [35], [44] and [73] or maximize a weighted sum of their utility function (i.e., level of satisfaction) minus the cost [45], [68], [67] and [62]. This objective is combined with some constraints on the operation condition of appliances, reflecting customer preferences [53], [45]. In some cases, the authors introduce a so-called "inconvenience term" or "waiting cost" into their objective functions to account for the discomfort caused by the delay in satisfying customers’ electricity needs [38]. We categorize technical contributions to this problem into two general classes. The first class consists of those which formulate a deterministic scheduling problem [50], [35], [44], [68], [8], [51], [52], [25], [73] and [45], whereas the papers falling into the second category integrate some kind of uncertainty into their models [62], [12], [59], [37] and [77]. In [50] and [44], the energy consumption of appliances is optimally scheduled based on a time varying curve of electricity price and available power. A three phase strategy including prediction, scheduling and real-time control of appliances within a building is studied by Kang et al. in [35]. Mohsenian-Rad et al. [51] find the optimal schedule for one customer and then extend it for a group of customers borrowing concepts from game theory. In [52], the authors combine the problem of finding the optimal consumption schedule with future price prediction in a real-time pricing environment. They use a weighted average price prediction filter to estimate the unknown future prices. A similar work by Samadi et al., [68], simultaneously solves the scheduling problem on the consumer side with the pricing problem on 14 the power supplier side. [8] proposes an automatic coordination system to schedule major household appliances in order to achieve peak reduction. The authors in Ha et al., [25], use a mixed integer linear program to concurrently control the electrical energy consumption and production in a single dwelling. In an attempt to model the appliances in more detail for energy management purposes, the authors in [73] incorporate constraints to count for the uninterruptible andsequentialmodesofoperationinamixed-integerlinearprogrammingframework. On the same thread to detail the appliance model, [45] introduces four general types of appliances, each with a particular utility function. All the previously mentioned papers include no source of uncertainty into their models. However, it is undeniable that uncertainty is dramatically intertwined with this particular problem and ignoring it is likely to yield sub-optimal results. The subsequent parts examine the literature accounting for uncertainty in power scheduling problems. One of the earliest works to consider price uncertainty was put forward by Conejo et al. in [12]. The authors develop an ARIMA model to predict confidence intervals for future energy prices; based on that they formulate a robust schedule optimization problem. Kishore et al. [38] propose a dynamic programming algorithm in which a transition matrix models the evolution of appliance modes over time. Amoregeneralcaseisstudiedin[59]inwhichbothpriceandcustomerdemand are modeled as Markov chains with unknown transition matrices, which are esti- mated with a Q-learning algorithm. In another approach, the authors in [37] assume a known density function for the future price and subsequently formulate a Markov decisionprocess(MDP)tofinddecisionthresholdsforinterruptibleanduninterrupt- ible loads. Likewise, Turitsyn et al. [77] introduce a MDP focusing primarily on 15 the detailed models of appliances. However, no learning algorithms are suggested to estimate the transition probabilities. In an entirely different approach [62], building occupancies, which in turn shape the demand profile, are assumed to be the source of uncertainty. Stochastic optimization techniques using optimal scenario trees are used to solve the problem. Each of the above papers tackles the uncertainty issue in a different manner; nevertheless, it is argued in all of them that incorporation of uncertainty is crucial since it improves the system’s overall performance. Generally in the papers in which a detailed appliance model is proposed, a discrete or continuous variable characterizes the consumption mode/amount of all kinds of appliances. However, we argue that in reality both kinds of appliances are present at the same time. For some appliances, such as dimmable lights, it is more appropriate to have a continuous variable. Conversely, in case of some other appliances such as dishwashers, the operation mode can be best modeled by a discrete variable. Assigning a continuous variable to these kinds of appliances is unreasonable since they can only consume some pre-specified and discrete amount of electricity at each period of time. In order to have a more realistic model, we propose a mixed set of continuous and discrete appliances to be present in the home at the same time. 2.3 Robust Optimization The parameters that characterize a physical system are usually subject to uncer- tainty mostly due to the following reasons: 16 • The value of some of the parameters (future demands, returns) is not known in advance ad thus the modeler replaces them with their forecasts. This kind of uncertainty is referred to as prediction error. • For some parameters (industrial devices/processes) the exact value perturbs slightly around the "nominal" measured value. These parameters are subject to measurement errors. • In practice it is almost impossible to implement some of the decision variables exactly as recommended by the design stage results. This leads to a so called implementation error which can be effectively modeled through some artificial data uncertainties. In many applications, even small perturbations in data can make the nominal opti- malsolutionmeaningless[5]. Todealwithuncertaintyinoptimizationproblems, two major approaches are commonly used namely, stochastic optimization and robust optimization. In stochastic optimization (SO), we assume that the uncertain data follows some known probability distribution. This assumption might be challenging in some applications for three reasons, 1. Conceptually, many parameters do not have a stochastic nature while still having some fluctuations in a particular domain. 2. Even in many applications with a data of a stochastic nature, it is still prob- lematic to find the correct probability distribution of uncertain parameters. 3. In some large scale problems, finding an efficient solution to the stochastic problem might be computationally expensive. 17 Robust optimization is a rather new approach in handling data uncertainties in opti- mization problems. The method is able to address some of the challenges mentioned above. In this approach the problem is formulated in such a way as to accommodate incomplete knowledge about probability model of the data. We then seek an optimal solution that is robust to this lack of knowledge. Robust optimization provides a specific mathematical structure to these imprecise data models, so as to permit the development of efficient and convergent algorithms. In a broad classification, we can distinguish between probabilistic and non- probabilistic robust optimization. The non-probabilistic methods mostly rely on worst-case analyses and Wald’s maximin model [79]. Accordingly an optimal solu- tion is sought that corresponds to worst-case state of the parameters in their respec- tive domains [47], [66] and [6]. In this section we initially discuss non-probabilistic approach to give a general idea of the method, then we elaborate on probabilis- tic robust optimization, specifically chance-constrained linear programs, which we pursue in this chapter [5]. 2.3.1 Non-Probabilistic Robust Optimization Consider a linear optimization (LO) problem of the form, min X C T X +D s.t. AX≤B (2.1) whereX∈R n is the vector of decision variables,C∈R n ,D∈R,B∈R m andA is anm×n matrix. We can arrange the data of this problem into a (m + 1)× (n + 1) 18 data matrix C T D A B . In an uncertain setting, every entry in this data matrix is subject to fluctuations. The "uncertainty set" U⊂ R (m+1)×(n+1) refers to the set containing all possible variations of the data matrix. The uncertainty set represents a subset of R (m+1)×(n+1) which encompasses all realizations of the uncertain data. Assume S is a deterministic upper bound on the objective function of (2.1), thus we can rewrite (2.1) as, min S,X S s.t. C T X−S≤−D AX≤B (2.2) where the objective is certain and the only uncertainties appear in the constraints. In fact, a general uncertain LO problem can always be reformulated as an uncertain LO problem with certain objective. Thus, without loss of generality, we focus on a single uncertain linear constraint and try to find its feasible solution while its parameters fluctuate in an uncertainty set. Let’s consider constraint a T X≤ b or equivalently, n X i=1 a i x i ≤b (2.3) Theunderlying ideaistoreplace thisuncertainty-affected constraint withaso-called deterministic "robust counterpart" to find the feasible x. To this end, assume each uncertain entry can be modeled as an affine parameterization [46], a i = ¯ a i +ζ i ˆ a i i = 1,...,I− 1 b = ¯ b +ζ I ˆ b (2.4) 19 in which ¯ a i is the nominal value of the parameter, ˆ a i represent positive constant perturbations (the same applies for ¯ b and ˆ b), ζ i is a random variable modeling the perturbations and I is the number of uncertain coefficients in (2.3) assuming that b is also subject to uncertainty. The above model maps the uncertainty space from the uncertainty set to the space of random variables ζ. We can rewrite (2.3) based on (2.4) as follows, I−1 X i=1 ˆ a i x i ζ i − ˆ bζ I ≤ ¯ b− n X i=1 ¯ a i x i (2.5) Let’s define the deterministic vector K as, k i = ˆ a i x i 1≤i≤I− 1 − ˆ b i =I (2.6) Thus, inequality (2.5) is written as, I X i=1 k i ζ i ≤ ¯ b− n X i=1 ¯ a i x i (2.7) To find the robust counterpart of the above inequality, consider a simple case in which ζ i ’s belong to a unit box, Box 1 ≡{ζ∈R I : kζk ∞ ≤ 1} (2.8) In this case, by adopting a worst-case approach (2.7) reduces to, max −1≤ζ i ≤1 I X i=1 k i ζ i ≤ ¯ b− n X i=1 ¯ a i x i (2.9) 20 Thus, we arrive at the following robust counterpart, I X i=1 |k i | + n X i=1 ¯ a i x i ≤ ¯ b (2.10) It is obvious that every feasible solution to this deterministic robust counterpart satisfiestheoriginaluncertaininequality (2.3)foralltherealizationsoftheuncertain parameters. 2.3.2 Probabilistic Robust Optimization By using the worst-case approach, we are over-emphasizing the worst-case scenario and thus we could get overly conservative solutions. Instead of the worst-case sce- nario, we can adopt a probabilistic approach in which constraints are enforced with sufficiently high probability, say 1− . Here, ∈ [0, 1] is a pre-specified small tolerance. The feasible values of x are calculated from the deterministic robust counterpart which we find as follows. First, denote the left-hand side of inequality (2.7) with a new random variable η, η = I X i=1 k i ζ i (2.11) Note that by introducing η we map all the uncertainties from the space of random variables ζ to a single random variable η. In other words, the whole uncertainty in the original inequality is now represented by η. (2.7) can be rewritten as, η≤ ¯ b− n X i=1 ¯ a i x i (2.12) 21 We can now introduce a chance-constrained (probabilistic) version of (2.12) as fol- lows, P η η> ¯ b− n X i=1 ¯ a i x i < (2.13) where P η is the probability distribution of η. This probabilistic constraint is com- putationally intractable [5]. So, we need to find a tractable robust counterpart for this problem in such a way that every feasible solution to it, satisfies our original intractable problem. Note that (2.12) imposes a constraint on a random variable while (2.13) defines a constraint on a probability measure of that random variable. As discussed earlier, robust optimization methods seek to find the optimal solution when only partial information on the uncertain parameter is available. Con- sider the case when we know only the mean and support of the uncertain coefficients. Moreover, assume that the mean is located at the midpoint of the support. Accord- ing to (2.4), this piece of information can be equivalently translated in terms of ζ i s as follows, E{ζ i } = 0 i = 1,...,I |ζ i |≤ 1 i = 1,...,I {ζ i } I i=1 are statistically independent. (2.14) Lemma Considersomedeterministiccoefficientssuchasz i ,i = 1,··· ,I andassume that (2.14) holds, then for every Ω≥ 0 it can be shown [5] that, P ζ I X i=1 z i ζ i > Ω v u u t I X i=1 z 2 i ≤ exp (−Ω 2 /2) (2.15) 22 By setting z i =k i in the above lemma we get, P η η> Ω v u u t I−1 X i=1 ( ˆ a i x i ) 2 + ˆ b 2 ≤ exp (−Ω 2 /2) ∀ Ω≥ 0 (2.16) Nowwecancompare(2.16)with(2.13). Weknowthat(2.16)alwaysholds (∀ Ω≥ 0), but we are looking for solutions to (2.13). If we choose Ω in such a way that exp (−Ω 2 /2)< and by setting, Ω v u u t I−1 X i=1 ( ˆ a i x i ) 2 + ˆ b 2 ≤ ¯ b− n X i=1 ¯ a i x i (2.17) it is guaranteed that every feasible solution to (2.17) is also a feasible answer for (2.13). Thus, we can claim that (2.17) is a robust counterpart of the chance- constrained problem. In other words, every feasible solution to this robust counter- part satisfies with probability at least 1− the inequality in (2.3). We should note that there might be some values of x that satisfy the chance- constraint (2.13) but not the robust counterpart (2.17). This means that using this approach might lead to losing some regions of the feasible space for x. In this section we derived the robust counterpart for the case in which only the mean and support of the uncertain parameters are known (and the mean is in the middleofthesupport). Thispartialinformationwasrepresentedbytheassumptions (2.14) on the space of random variables ζ. However, we may have other partial information on the distribution of uncertain parameters such as knowing only the support or knowing that they are Gaussian with some bounds on their means and variances. By defining appropriate properties for ζ for each case, the corresponding robust counterpart can be derived analytically. The readers are referred to [5] for detailed discussions. 23 2.4 Chance-Constrained Optimization Model for Optimal Load Scheduling In this section, we present the problem setup to find optimal load schedule for appliances in a house in an uncertain price environment. Our approach is based on defining a chance-constrained optimization problem, for which we will use robust optimization method to find the optimal solution. In order to find the most appro- priate uncertainty set for the uncertain parameter, we will carry out a series of sensitivity analyses. This will in fact provide a mechanism to assess the robust- ness of the model. In section 2.6, we will extend this problem to a heterogeneous population of electricity consumers and investigate the aggregate behavior. Let A denote the set of appliances in a given arbitrary house. We classify the appliances into two main groups namely, continuous and discrete. By a continu- ous appliance, we mean an appliance that can consume an arbitrary non-negative amount of power in any given time period. This amount can be adjusted by the automatic scheduler unit. Rechargeable electronic appliances and dimmable lights are two examples of this class. On the other hand, discrete term refers to those appliances that can have only some finite modes of operation. For instance, TVs or clothes washers consume approximately some pre-known amount of electricity when they are on. In this work, we assume binary discrete appliances which take on and off modes. From another perspective, we assume that appliances can have time-variant or time-invariant utility functions. For appliances such as electric vehicles or dishwash- ers, it is only required that the job is completed within a specified time frame. The customers do not necessarily care if the device is on or off in a particular time as 24 Table 2.1: Appliance categories. Time-invariant Utility Time-variant Utility Continuous Plug-in hybrid electric vehicle (PHEV) Light Discrete Dishwasher TV long as the job is completed. Accordingly, we can assume a constant utility function for finishing the task. These are examples of time-invariant utility functions. On the other hand, the customers highly care about time of use for some other appliances such as TV or light. The utility in this case would be a function of time. These are called time-variant utility functions. Based on the discussions above, we propose four categories for appliances in a house as shown in table 2.1. To make it more tangible, the table gives an example for each category. Let us represent the sets of discrete appliances, A d , having time- variant and time-invariant utilities with A d,v and A d,I , respectively. Likewise, let A c,v and A c,I denote the sets of continuous appliances with time-variant and time- invariant utility functions and A c stands for the set of all continuous appliances. By solving the scheduling problem, we seek to find the consumption vector for all appliances over a given time horizon. Let τ = 1,...,T be the desired scheduling horizon which is indexed by t∈ τ. For each discrete appliance a∈ A d , we need to find the optimal scheduling vector Y a = y a,1 ,...,y a,T in which y a,t ∈ 0, 1 is a binary variable. We assume that discrete appliancea consumes a pre-known amount of e a electricity when it is "on". In the same fashion, for each continuous appliance a 0 ∈ A c , we need to find the optimal scheduling vector X 0 a = x a 0 ,1 ,...,x a 0 ,T in which x a 0 ,t ∈Z≥ 0. 25 The uncertain parameter in our problem, which is electricity price, is denoted by p t . We will discuss finding the best uncertainty set for the price in the following sections. The utility function is an abstract concept which reflects the customer’s sat- isfaction level. In fact, it quantifies the monetary value a consumer hypothetically allocates for consuming one kwh ($/kwh) to reflect the relative task priorities. The utility function at each time step for discrete and continuous appliances is repre- sented byu a,t andu a 0 ,t respectively. As discussed earlier, some appliances have con- stant utility functions. For a continuous appliance with constant utility (a 0 ∈A c,I ), we have u a 0 ,t = α a 0. Likewise, we represent the constant utility for discrete time- invariant appliances (a∈ A d,I ) with u a,t = β a . We will use the concept of utility functionaswellassomeothertimingconstraintstoincorporatecustomerpreferences into the model. At this juncture, we can introduce our scheduling problem. The underlying idea is that we assume consumers wish to consume as much as possible as long as they do not have to pay more than a pre-specified budget. Thus, they aim to maximize their utility function subject to their electricity bill not exceeding a target value. We complement this problem with some other constraints to reflect customer preferences and physical restrictions on devices. This formulation provides customers with a direct control over their daily costs as opposed to current methodologies in which the cost is entangled with utility and thuscannotbecontrolleddirectly. Assuch, theparametersintroducedinthechance- constraint enable customers to easily adjust the trade-off level between comfort and electricity cost. Moreover, the present formulation enables us to efficiently incorporate and quantify the uncertainty in the model. 26 2.4.1 Objective Function We assume every customer wishes to take the most possible benefit through elec- tricity consumption. This can be stated as maximizing the utility function over all required services, max X 0 a ,Ya X a 0 ∈Ac X t∈τ U a 0 ,t x a 0 ,t + X a∈A d X t∈τ U a,t e a y a,t (2.18) We should notice that U a 0 ,t and U a,t take constant values α a 0 and β a respectively, when dealing with time-invariant appliances. In these cases the utility is assigned for completing the task. For all other appliances we assume a piece-wise linear utility function which is basically determined based on customer preferences. 2.4.2 Chance-Constraint Although the customers would like to consume the most, however there is always some constraint on the available budget. We argue that in case we impose a deter- ministic budget constraint it will lead to very restrictive and worst-case answers. However, if we relax the constraint through defining a chance-constraint, we can avoid worst-case answers and still get some reasonable solutions. The chance- constraint then assumes the following form, Prob X t∈τ P t X a 0 ∈Ac x a 0 ,t + X t∈τ P t X a∈A d e a y a,t ≤γ ≥ 1− (2.19) This constraint probabilistically indicates that the daily total electricity cost should be less than or equal to the customer’s acceptable value. We denote this customer specific threshold by γ and will refer to it hereafter as "Maximum budget". P t , the 27 electricity price, is the only uncertain parameter in the above constraint. We will elaborate on how to model this parameter in the subsequent sections. Let us elaborate a little more on our chance-constraint which forms the core of our robust optimization approach. We are indicating that the probability of total daily bill exceeding γ should be less than or equal to . For example, we require that the probability of the total bill being greater than $2 should be less than 5%. In other words, the constraint should be satisfied for at least 95% of the realizations drawn from the specified distributions of the random parameter (price). Chance- constraints are suitable tools in order to accommodate the uncertain parameter, price in our work, into the model. We declare that price takes random values from a general distribution with given characteristics. The robustness of our approach arises from the fact that we do not assume any particular probability distribution for the price. Instead, knowing some general characteristics of the distribution such as first or second moment or the support of the distribution would suffice. 2.4.3 Physical Constraints We further need to bound the consumption of continuous appliances at each time step based on their physical specifications, x min a 0 ≤x a 0 ,t ≤x max a 0 ∀a 0 ∈A c ,∀t∈τ (2.20) 2.4.4 Customer Preferences As mentioned earlier, for some appliances such as dishwashers or PHEVs the cus- tomer cares only about finishing the task within a specified time frame. To ensure 28 task completion we require the total consumption to lie between two bounds. The relevant constraints for discrete and continuous appliances are defined in (2.21) and (2.23). We assume that the acceptable time frame for finishing the task, τ 1 and τ 2 , are set by the customer. Moreover, we need to make sure that no consump- tion is allocated to the remaining time periods. These constraints are expressed mathematically as follows, b a ≤ X t∈τ 1 e a y a,t ≤c a ∀a∈A d,I (2.21) y a,t = 0 ∀a∈A d,I ,∀t∈τ−τ 1 (2.22) b a 0≤ X t∈τ 2 x a 0 ,t ≤c a 0 ∀a 0 ∈A c,I (2.23) x a 0 ,t = 0 ∀a 0 ∈A c,I ,∀t∈τ−τ 2 (2.24) As another constraint imposed by the customer, we assume that for some appliances such as TV, the customer cares only about the number of time periods the device is on throughout the desired scheduling time interval τ 3 . Accordingly, she prefers this number to lie between a lower and upper bound represented by d min a and d max a respectively. We also make sure that the device is turned off in the remaining time slots. This yields the following equations, d min a ≤ X t∈τ 3 y a,t ≤d max a ∀a∈A d,v (2.25) 29 y a,t = 0 ∀a∈A d,v ,∀t∈τ−τ 3 (2.26) 2.4.5 The Optimization Problem To sum, our proposed optimization problem is as follows, max X 0 a ,Ya X a 0 ∈Ac X t∈τ U a 0 ,t x a 0 ,t + X a∈A d X t∈τ U a,t e a y a,t (2.27) Subject to, Prob X t∈τ P t X a 0 ∈Ac x a 0 ,t + X t∈τ P t X a∈A d e a y a,t ≤γ ≥ 1− (2.28) and Constraints (2.20) , ... , (2.26) (2.29) In the next section, we present the simulation results for our optimization problem and propose the optimal consumption plan for devices in a single house. Moreover, we will consider various uncertainty sets for the uncertain parameter, price, and discuss their effects on the problem results. 2.5 Experiments To solve the proposed model in the previous section we first need to decide about some elements of the model. We first focus on finding the most appropriate uncer- tainty set for the uncertain parameter, price. As discussed earlier, in the robust optimization methodology we are still able to solve the optimization problem with- out knowing the exact probability distribution of the random parameter. Instead, 30 we require the uncertain parameter to lie within a so-called uncertainty set. Now, the question is how to find the appropriate uncertainty set based on the available data for the uncertain parameter. We will examine various uncertainty sets and decide on the best possible one. Before looking into this problem, we will elaborate on the input parameters of the model. 2.5.1 Data Set Description Electricity price is the crucial input of our robust optimization problem. The elec- tricity price data are collected from a pilot project carried out by PNNL called "Olympic Peninsula" between April 1 st 2006 and March 31 st 2007 with a resolution of 5 minutes [9]. For the purpose of this work we average data over one hour periods, resulting in the setτ given by{1, 2,··· , 24}. We noticed that there is no significant deviation between week day and weekend prices, however, the prices show dissimilar trends in different months. This is illustrated in Figure 2.1. Subsequently, we find the best classification to be based on season with the winter starting at January. As illustrated by Figure 2.2 prices in fall and winter follow the same trend while spring and summer diagrams look alike. We solve the scheduling problem only for winter prices as a representative. We consider four appliances to be scheduled in a typical house. We select these four devices so that they can represent the four categories discussed earlier. PHEV and light represent continuous appliances with time-invariant and time-variant util- ity functions. Moreover, dishwasher and TV are instances of discrete appliances with time-invariant and time-variant utility functions. The utility values based on task priorities are shown in table 2.2. These values are quite arbitrary and since they just reflect the relative importance of the tasks, their absolute value does not 31 Figure 2.1: Mean of electricity price in different months. Figure 2.2: Mean of electricity price in different seasons. Table 2.2: Utility function values. Importance Level Utility Value ($/kwh) Low 0.25 Medium 0.5 High 1 Very High 2 32 change the results. We consider a typical employed customer who wakes up at 8am, leaves home at 9am, returns home at 6pm and goes to bed at 11pm. This customer needs to communicate her preferences to the central scheduler unit. She will do this by assigning some utility functions for the tasks as well as defining her preferred scheduling time periods. Since the customer is out of home during the day, her desired time interval for using light and TV would be when she returns home i.e. τ 3 = 18, 19,..., 22 . We assume that a typical customer assigns high priority (u a 0 ,t = 1) for using light for all t∈ τ 3 . Moreover, she needs to make sure that she can watch TV for all t∈τ 3 , especially from 20− 22pm, when there might be a popular TV show. Accordingly, we assume u a,t = 1 for t ∈ 18, 19 and u a,t = 2 for t∈ 20, 21, 22 . Likewise, for PHEV and dishwasher the customer sets some deadlines for finishing the task. For instance, if she wants the dishes to be washed until she comes back home, the corresponding planning interval would be τ 1 = 1, 2,..., 17 . Moreover, in order to have a fully charged PHEV in the morning she can set τ 2 = 18, 19,..., 24 ∪ 1,..., 8 . We also assign α a 0 and β a to be 1 and 0.5 for PHEV and dishwasher respectively indicating high and medium levels of importance for completing the tasks. 2.5.2 Uncertainty Sets At this time, we consider multiple uncertainty sets for the price and examine how they can affect the optimization results. We use AIMMS 3.11, an optimization soft- ware library, to solve the robust optimization problem [64]. The deterministic case, in which the price assumes deterministic values equal to the mean at each time period, is selected as the benchmark against which we evaluate the effectiveness of 33 Figure 2.3: Percentage of savings in cost with respect to deterministic case for all uncertainty sets. our approach at improving the automatic scheduling problem. Mostly, our compar- isons are based on the differences in cost and utility with respect to the deterministic case. Weconsiderfiveuncertaintysetsfortheuncertainparameter. First, weassume that the price has a bounded (m,s) distribution. This means that at each time step given the mean of price, m, the price lies in (m−s, m +s) interval. We assume s to be equal to 3σ, three times standard deviation of the available real data. The second set we consider is identical to the first set, with s =σ in order to assess the sensitivity of the results with respect to the length of the interval. We use mean and standard deviation of the real price data from the Peninsula project to estimate m and s. According to this distribution, we are assuming that the random parameter lies in a known interval for which the mean happens to be in the middle. As the third uncertainty set, we consider (l,u) distribution. In this case, we onlyknowthesupportoftheuncertainparameterwithoutanyinformationaboutthe 34 Figure 2.4: Percentage of utility loss with respect to deterministic case for all uncer- tainty sets. mean. We can easily find the lower and upper bounds of the support interval based on available price data. As the fourth category of uncertainty sets, we assume the parameter has a symmetric and unimodal distribution around an arbitrary point c. This is shown by symmetric (c,s), unimodal. In this case, the range of the uncertain parameter would be (c−s, c +s). Without loss of generality, we assume that the distribution is symmetric around its mean with the ranges equal to 3σ. In this way, we can directly compare the results for bounded (m, 3σ) and symmetric (m, 3σ), unimodal to assess the effect of symmetry and unimodality on the results. As the last possibility for uncertainty set, we consider Gaussian distribution of the form Gaussian (m l ,m u ,v) wherem l andm u represent lower and upper bounds for the mean. In this case it is thus assumed that the mean is not precisely known. Additionally, the variance is only known to be bounded above by v. Therefore, we have a Gaussian distribution for which mean and variance are known only partially. In our problem, we consider the true mean to be in a (m− 0.01, m + 0.01) interval 35 Figure 2.5: Optimal scheduled consumption for Gaussian and 3σ uncertainty sets and their comparison with deterministic case. around the sample mean, m, obtained from data. Moreover, we assume that the true variance is bounded by the variance of the data σ 2 . The simulation results for these five categories of uncertainty sets are depicted in Figure 2.3 and Figure 2.4. They respectively show the percentage of savings in daily electricity bill and percentage of daily loss of utility compared to the determin- istic case. We can see that the bounded model with (m, 3σ) achieves the highest savings in cost (33.42%) while incurring a relatively high loss of utility (19.05%). On the other hand the bounded model with (m,σ) and Gaussian distributions give similar cost savings (15.95%) yet a negligible loss of utility. Although the saving in symmetric distribution case is very good, the very high loss of utility makes this option unacceptable. As the last possibility we observe that the support distribution has no gains compared to the deterministic case and so it is rejected as well. Given the above discussion, it seems that bounded (m, 3σ), bounded (m,σ) and Gaussian distributions can be possible candidates for the best uncertainty set. Here, we prefer to put more emphasize on savings in cost and accept the associated loss of utility as 36 a trade-off. Accordingly, we choose bounded (m, 3σ) as the uncertainty set for our future analyses in this study. Figure 2.5 compares the daily consumption profiles for bounded (m, 3σ), Gaus- sian and deterministic cases. We can see that the early morning consumptions in the deterministic case have shifted to midnight in case of bounded and Gaussian uncertainty sets. 2.6 Robust Scheduling in a Heterogeneous Popu- lation The simulation results in the previous section were for one employed customer who is assumed to leave home at 9am and return at 6pm. We discussed a typical utility function as well as the timing preferences for this type of customer. Now, the question is what happens if we have a heterogeneous population in which each individual is using a robust optimization approach to automatically schedule the operation of appliances in her home. We will tackle this problem in this section and examine how the aggregate behavior of a group of customers can be predicted by our methodology. In this section we use bounded (mean, 3σ) to conduct all the analyses. 2.6.1 Sensitivity Analysis Before introducing our heterogeneous population model, we perform a sensitivity analysis on the parameters of the model for a single customer to determine the most crucial parameters. By doing this analysis we can identify the parameters that can 37 Figure 2.6: Sensitivity of model outputs with respect to maximum budget (γ). be used in defining various categories in a heterogeneous environment. Figure 2.6 illustrates both the percentage of changes in daily bill and percentage of loss of utility for a typical employed customer in a day. In this case we assume = 0.05. This graph reveals that the percentage of savings is inversely proportional to the maximum budgetγ meaning that higherγ results in lower savings. In other words, the robust methodology is more beneficial for the customers who allocate less budget for their maximum daily bill. Based on the graph, we can see the trade-off between savings in cost and loss of utility. The smaller the maximum budget γ, the greater the cost savings and the greater the loss of utility. This means that, the customer will lose more of her preferences and comfort level when she wants to save more. While this trade-off is intuitive, the proposed analysis permits its quantification for decision making purposes. Accordingly, we identify that γ can be an appropriate parameter for categorizing diverse classes of customers. We can also investigate the effect of , which appears in the chance-constraint, on our results. In fact, 1− reflects the level that customers are determined to satisfy their budget constraint. 38 Figure 2.7: Sensitivity of model outputs with respect to probability of satisfying chance-constraint (1−). The higher 1−, the stricter the consumer is to not exceed her budget limit γ, and so she will save more on her bill by using a robust method. Consistent with our previous results, this corresponds to a higher reduction in customer’s utility. The sensitivity of model output with respect to 1− is illustrated in Figure 2.7 which correspond to a value ofγ = 1. Thus, we can see that can also be used as a model parameter to define and classify a heterogeneous population. 2.6.2 Heterogeneous Population In order to find the aggregate consumption for a group of heterogeneous consumers, weneedtocategorizethepopulationbasedonsomeappropriatemetrics. Wedemon- strated earlier that the model results highly depend on γ and . In fact, these two parameters are somehow reflecting the customer characteristics which form their consumption behavior. We argue that the customers with higher incomes are will- ing to pay higher amounts for their electricity bills. Therefore, we can assume that 39 Figure 2.8: A typical utility function for a stay at home customer. these class of customers assign higher values to their daily maximum budget γ. On the other hand, each customer can choose between high or low comfort level. In fact, this choice reflects the customer priority for satisfying her comfort requirements. A customer may allocate a high amount to γ, but still be strict that this maximum daily budget should not be exceeded. In this case, we use parameter to reflect the customer inclination towards violating her convenience preferences. We saw earlier that represents the allowable level to violate the chance-constraint. In this study, we consider two ranges for to model customers with high or low comfort priority. Another perspective on classifying the population is based on their employ- ment status. We assume that people are either employed or stay at home all day. We discussed earlier about the utility function and timing preferences of employed people. Likewise, for a stay at home individual we assume the utility function to be as depicted in Figure 2.8 for light and TV. We assign to α a 0 and β a the values 0.5 and 1 for PHEV and dishwasher, respectively, indicating medium and high impor- tance levels for completing the tasks. Moreover, we assume that the preferred time 40 Table 2.3: Categories of consumers in a heterogeneous population. Category Income Comfort Employment 1 High High Employed 2 High High Stay at home 3 Medium High Employed 4 Medium High Stay at home 5 Medium Low Employed 6 Medium Low Stay at home 7 Low Low Employed 8 Low Low Stay at home intervals for washing the dishes and PHEV are τ 1 = 1, 2,..., 11 ∪ 20, 21,..., 24 and τ 2 = 1, 2,..., 24 respectively. Based on these properties, we define 8 categories for people as shown in table 2.3. We assign γ = 1.1− 1.5 to high income customers, γ = 0.9− 1.1 to medium income families and γ = 0.7− 0.9 to low income groups. In addition, we assume = 0.01− 0.05 for the customers having low priority for their comfort level. Cus- tomers with high priorities for their comfort consider an = 0.06−0.10. Finally, the employed customers are assumed to have a utility function as discussed earlier while the stay at home customers follow Figure 2.8. We consider a heterogeneous popula- tion consisting of 20 households. We randomly select the categories of people in this area. Furthermore, we randomly select γ and values for each customer from the corresponding γ and ranges defined for each category. Given these points, we can 41 Figure 2.9: Comparison between the results of deterministic and robust methods for a heterogeneous population. solve the robust optimization problem for each customer to find the optimal sched- ule as well as other quantities of interest. A comparison between the deterministic and robust values of average daily bill, average daily consumption and average daily utility for this sample population is illustrated in Figure 2.9. The corresponding per- centages of changes in these three quantities when utilizing probabilistic approach as opposed to deterministic approach is reported in Figure 2.10. At this point, the question is whether our model performs better for partic- ular classes of people or not. We choose two features of the population, namely income level and employment status, to judge about the performance of our model. We already know that the defined categories correspond to income level. As an approximate rule of thumb, people in higher categories have lower incomes. Fig- ure 2.11 demonstrates the percentage of changes in daily electricity cost and utility versus the categories of consumers. We can see that there is no specific correlation between savings and income level. However, based on Figure 2.12, we can observe a correlation between savings and employment status. People who stay at home can 42 Figure 2.10: Average percentage of changes in quantities of interest for a heteroge- neous population when using robust method. save more on their bills by using our proposed model as opposed to the employed customers. This result is somehow natural. The stay at home individuals have more flexibility in their scheduling, and therefore can freely use the low priced hours more often. Finally, we compare the average daily consumption profiles for deterministic and robust cases. Figure 2.13, shows average daily consumption profiles for the deterministic and robust optimization cases. It is noteworthy that the average daily consumption is reduced by about 37%. This consumption reduction is associated to a 41.4% savings in daily cost (Figure 2.10). So, here again the anticipated trade-off is revealed, and quantified, between comfort and costs. 43 Figure 2.11: Model outputs for various categories of customers. Figure 2.12: Savings in cost for employed and stay at home customers. 2.7 The Effect of Time Scale on Optimal Solution Inthissection,weseektoassesstheeffectoftimescaleonthesimulationresults. The original time resolution for real-time prices in the Peninsula project was 5 minutes. However, we used the average electricity price in one hour periods in order to solve the problem as described in the previous sections. By averaging over one hour 44 Figure 2.13: Average daily consumption profiles for deterministic and robust approaches in a heterogeneous population. periods, we have lost some resolution in the data in order to gain simplicity. In this sectionwediscussandquantifythecompromisesassociatedwiththisapproximation. Daily mean of winter electricity prices with multiple resolutions is depicted in Figure 2.14. We consider 15min, 30min and 1hr time resolutions in order to perform the relevant analyses. Throughout this section, we use a bounded (mean, 3σ) distribution for the uncertain parameter, price. Moreover, we consider a typical stay at home customer with the utility function and time preferences described earlier. We consider the values = 0.05 and γ = 1 for this customer. The first outcomes of the model to be investigated are the daily cost and daily utility. Table 2.4 shows the obtained values of these two parameters for 15min, 30min and 1hr time resolutions. It is observed that the lowest daily cost corresponds to the lowest time resolution, i.e. 1hr. Moreover, by increasing the time resolution the daily cost will increase as well. However, we don’t observe a monotone 45 Figure 2.14: Mean of electricity price with different time resolutions. Figure 2.15: Daily robust consumption for various time resolutions. correlation between the time resolution and utility. While 60min resolution has a lower utility compared to the 30min resolution, the 30min resolution gives a higher utility compared to the 15min. However, these discrepancies in cost and utility are almost negligible given the present data and we cannot draw a general conclusion based on them. The daily consumption profile for the three time resolutions for a typical customer, as mentioned above, is illustrated in Figure 2.15. We can see 46 Table 2.4: Effect of time resolution on daily cost and utility. Time Resolution Daily Cost Daily Utility 15 Min 0.68 10.44 30 Min 0.64 11.04 60 Min 0.57 10.11 Table 2.5: Effect of time resolution on optimization problem. Time Scale No. of Constraints No. of Variables Time (s) 15 Min 652 676 2683 30 Min 330 340 634 60 Min 169 172 1.95 that a finer resolution results a more even consumption profile. This can decrease the peak to average ratio in a power grid. Thus, from this point of view having a finer time scale results in a more favorable schedule for the operation of appliances. Finally, the comparison between optimization problem characteristics is shown in table 2.5. We can see that the number of constraints and variables is almost propor- tional with the time resolution. This means that when changing the time scale from 1hr to 30min, the number of constraints and variables almost doubles. However, this is not the case when considering the time. It takes the solver (CPLEX 12.5) 2683s to solve the corresponding 15min problem, while in solving the 1hr problem takes only 1.95s. Thus, while a finer time scale improves system behavior, it adds a significant computational burden. 47 Chapter 3 Stochastic Load Forecasting in Smart Grids 48 3.1 Introduction Building predictive models in the context of smart grid is challenging for several reasons. First, quantifying the effect of real time prices on the consumption profile as well as including other influencing variables (e.g., weather, time of day, previous consumption) is a challenging aspect. Moreover, the availability of data at varying spatial and temporal granularities introduces new challenges in modeling in terms of multiscale models and the effect of model resolution on the results. Furthermore, given the detailed customers’ consumption data, utility companies are interested in having models at the customers’ scale which are able to recommend personalized policies. Finally, quantifying the uncertainty associated with involved components and formalizing their effect on the prediction result in not a trivial task in this context. Load forecasting problem has been extensively addressed in the literature. Generally speaking there have been two main lines of research. The first direction of research is based on "parametric" statistical models such as time-series and lin- ear regression (e.g, [74], [26],[33]). The second trend encompasses "nonparametric" models such as support vector regression, neural networks and fuzzy logic models ([16],[80]). A comprehensive literature survey of load forecasting techniques is pre- sented in [2]. Despite being intuitive, parametric methods often suffer from the limitations posed by the underlying assumptions in their construction. The form of the parametric model is set in advance and thus it cannot incorporate unspeci- fied relationships between the target variable and the predictors. Moreover, these models are not generally suited for high dimensional problems, when the number of predictors grow. 49 In this chapter, we propose Bayesian network framework as a probabilistic tool to model the dependencies between various contributing factors in demand forecasting in a smart grid context. It is argued that Bayesian networks are an appropriate tool for probabilistic modeling in complex systems when the number of variables is high. In this regard, we seek to probabilistically predict demand subject to real time prices (as well as other factors) at multiple spatial and temporal scales. The demand prediction problem is studied for 15 minutes and hourly time resolutions. Besides, we develop a model to predict the demand at the households’ scale which can be used as an aid for personalized policy making. The literature on probabilistic multiscale modeling in sensor-based demand management is not well developed. [70] and [34] are among the few works that have studied the effect of spatial and temporal granularities on load forecasting problem. Moreover, use of Bayesian networks as a probabilistic modeling scheme is very limited in this context [4], [78], [27] and [54]. We argue that Bayesian networks are efficient tools that can address some of the previously mentioned challenges. The remainder of the chapter is organized as follows: in section 3.2, we intro- duce the concept of Bayesian networks and provide necessary notations to learn and use the model as a probabilistic predictive model. Section 3.3 discusses the problem and develops Bayesian network model to forecast demand at multiple spatial and temporal resolutions. 3.2 Bayesian Networks A Bayesian network (BN), is a directed acyclic (i.e. having no cycles) graphical (DAG) model that encodes the joint probability distribution of a set of random 50 X 1 X 3 X 2 X 4 X 5 Figure 3.1: An example BN [41]. variables by making conditional independence assumptions [61]. Bayesian networks have become popular representations for extracting knowledge from data in com- plex uncertain systems in recent years. According to [29], BNs have at least four remarkable advantages over the state-of-the-art data analysis techniques for density estimation, regression, classifications and clustering. First, since BNs are able to encode the dependencies between the input variables, they can easily handle incom- pletedatasets. Second, inaBNmodelthecausalrelationshipsbetweenthevariables can be learned. This provides valuable insights regarding the specific domain under study. Third, Bayesian networks provide an intuitive way to incorporate the prior expert knowledge intothe model. Thisis achievedby usingBayesianstatistical tech- niques in modeling. Finally, BNs provide a principled framework to avoid overfitting of data. A typical BN is shown in Fig.3.1. In this chapter, we consider a Bayesian network over a finite set of discrete random variables, X ={X 1 ,··· ,X n }. A BN represents a joint PDF over X by encoding conditional independence assertions as well as a collection of PDFs. The assertions of conditional independence are modeled through a directed acyclic graph (DAG) structure. Thus, we use the pair B =hS, Θi to define a BN where S is the BN structure (a DAG) in which nodes represent random variables X 1 ,··· ,X n and 51 edges express direct dependencies between the variables. Moreover, Θ is a set of parameters corresponding to that structure. The chain rule of probability implies that a joint distribution can always be represented as follows, using any ordering of the variables, ρ (X 1 ,··· ,X n ) =ρ(X 1 )ρ(X 2 |X 1 )ρ(X 3 |X 2 ,X 1 )···ρ(X n |X 1:n−1 ) (3.1) As n gets large, evaluation of ρ(X i |X 1:i−1 becomes more and more complicated. In a multinomial setting where each variable has K states, ρ(X 1 ) is represented as a table of O(K) numbers (In fact, there are only K− 1 free parameters, but we use O(K) for simplicity). Similarly ρ(X 2 |X 1 ) is a table with O(K 2 ) parameters. Thus, there will be O(K n ) parameters in the model, learning which requires lot of data. The key solution to characterizing large joint distributions is to make some assumptions about conditional independence (CI). We say thatX andY are condi- tionally independent given Z, denoted by X⊥Y|Z if and only if, X⊥Y|Z⇔ρ(X,Y|Z) =ρ(X|Y )ρ(Y|Z) (3.2) Assume that the structure of the directed acyclic graph is known. There exists efficient algorithms to find the ordering of the nodes such that parents come before children [14]. The parent and children notion in a directed graph coincides with their common sense meaning, i.e. parent of a node is the set of all nodes that feed into it and the children of a node is the set of all nodes that feed out of it. Given this ordering, which is known as topological ordering, ordered Markov Property 52 is defined as the assumption that a node only depends on its immediate parents, not all predecessors in the ordering [56], so, X s ⊥X pred(s)\pa(s) |X pa(s) (3.3) wherepa(s) are the parents of nodes, andpred(s) are the predecessors of nodes in the ordering. This is in fact a generalization of the first order Markov assumption where we assume that variable at time t + 1 is independent of all variables up to time t− 1 given that variable at time t is observed. This assumptions generalizes Markov property from chains to DAGs. Accordingly, the chain rule (3.1) for the topological ordering is reduced to a factorized expression given by, ρ (X 1 ,··· ,X n ) = n Y i=1 ρ (X i |pa(X i )) (3.4) where each termρ (X i |pa(X i )) is a conditional probability distribution (CPD). Here, we have used the relation ρ (X i |X 1 ,··· ,X i−1 ) = ρ (X i |Pa(X i )) based on ordered Markov assumption to reduce the model. This factorized joint distribu- tion only holds if the CI assumptions encoded in the associated DAG are correct. In case each node has O(F ) parents and K states, there will be O(nK F ) parameters in the model which is much less than the O(K n ) needed by a model which makes no CI assumptions. In the discrete setting, we associate with each node in the graph a conditional probability table (CPT) which tabulates ρ (X i |Pa(X i )). By convention, each row in the table corresponds to a specific configuration of the parent variables and has a separate multinomial distribution. Assume variable X i and its parentsPa X i have r i and q i mutually exclusive configurations respectively. Thus, the table has r i ×q i 53 entries covering all possible combinations [58]. Each entry in the table is denoted by θ ijk =p(X i =k|Pa(X i ) =j) which is the probability that node X i is in state k given that its parents are in state j. Similarly the set of parameters for each row is denoted by Θ ij and the set of all the parameters of the BN model is represented by Θ. 3.2.1 Learning Discrete Bayesian Networks Learning a Bayesian network consists of learning the graph structure and the cor- responding CPT parameters that best fit the available data. In general, Bayesian network structure learning has two main applications: knowledge discovery and den- sity estimation. By knowledge discovery we mean that a BN structure can efficiently reveal the conditional independencies between variables. On the other hand, the BN structure can be used in conjunction with the CPT parameters to uniquely specify a joint density function over the random variables. These two tasks cannot be carried out independently. First, in order to learn the structure, we need some estimation of the parameters to quantify the goodness of fit. On the other hand, to estimate the conditional probabilities, we must know the graphical structure in advance. Generally, there are two categories of BN structure learning. The first one is search-and-score which assigns a score to each possible BN structure based on its match to the observed data and then searches for the best network. The other one is called constraint-based algorithm, which runs conditional independence tests to learn the structure. Here we are only interested in the search-and-score approach. In this setting, learning a Bayesian network consists of a search conducted over some search space to optimize a scoring metric. The scoring functions are based on differentprinciples, suchasinformationtheoryandminimumdescriptionlength([40], 54 [19]), or Bayesian methods ([13], [30]). As far as the search algorithm is concerned, since finding the globally optimal graph takes exponential time in most cases, some heuristic search method is commonly used. This can be greedy search (such as hill-climbing or tabu search), simulated annealing or genetic algorithms to name the most popular ones. At last, the search space is dominantly the DAG space, however, it can be either equivalence classes of DAGs or orderings over the network variables (an equivalence class refers to a set of DAGs in which all the members imply the same set of conditional independence assertions). In this chapter, we focus on the Bayesian score for discrete Bayesian networks and use tabu search to find the best DAG structure. Let D ={C 1 ,··· ,C m } be the set of m observed instances of C (where each case C i assigns values to one or more variables in X). Our goal is finding a network S that best matches the data set D. The derivations in this section are mostly based on [30]. Posterior probability of a network structure may be used as a Bayesian measure to score a structure, ρ(S|D) = ρ(D|S)ρ(S) P S ρ(D|S)ρ(S) (3.5) in which the denominator is a normalization constant. Calculating this constant requires summing over all possible structures which is computationally expensive even for small domains. Thus, we resort to computing the numerator as the network score. Thus, the Bayesian score is defined as ρ(D,S) = ρ(D|S)ρ(S). We start deriving the Bayesian score by calculating ρ(D|S). Applying the chain rule, we obtain, ρ(D|S) = m Y l=1 ρ (C l |C 1 ,··· ,C l−1 ,S) = m Y l=1 Z θ ρ (C l |Θ,S)ρ (Θ|C 1 ,··· ,C l−1 ,S) (3.6) 55 Here, Θ is dependent on structureS, but we skip this notation for brevity. Thus, to calculateρ(D|S) we sum over all possible values of model parameters. The first term in the above integral is the probability of observing a particular case given that the model is fully specified (known structure and known parameters). Here, we assume that each observed case is a multinomial sample from some BN with some structure and parameters. Moreover, if we assume that the data set is complete (no missing or hidden variables), then using the factorized expression for joint distribution in BNs as in (3.4), we have, ρ (C l |Θ,S) = Y i Y j Y k θ α lijk ijk (3.7) where α lijk is 1 if and only if X i = k and Pa(X i ) = j in case C l and 0 otherwise. To simplify the second term in the integral, researchers assume parameter inde- pendence which implies that the parameters associated with each node in the BN (each variable) can be computed independent of the other nodes. This is called global independence. Moreover, in each node the parameters of each row of the CPT table (associated with each particular configuration of the node parents) can be determined independent of other rows. This is referred to as local independence. These assumptions lead to the following, ρ (Θ|S) = Y i Y j ρ (Θ ij |S) (3.8) where Θ ij is the set of parameters of node i when its parents are in configuration j. Plugging (3.7) and (3.8) into the integral in (3.6) and doing some manipulations yields [30], ρ(D|S) = Y i Y j Y k Y l E (θ ijk |C 1 ,··· ,C l−1 ,S) α lijk (3.9) 56 where E denotes the expectation of θ ijk with respect to ρ(Θ ij ). To calculate the expectation in (3.9) we need to know ρ(Θ ij |D,S) which is the parameter posterior. On the other hand, we already assumed thatρ(D|Θ ij ,S) has a multinomial distribu- tion. Since the Dirichlet distribution is the conjugate for multinomial distribution, by assuming a Dirichlet prior for parameters the parameter posterior would also fol- low a Dirichlet distribution. Accordingly, we assume that ρ(Θ ij |S) has a Dirichlet distribution and thus the posterior distribution, ρ(Θ ij |D,S), is also Dirichlet. It is well known that the mean of the Dirichlet distribution for each of the variables is, E (θ ijk |D,S) = N 0 ijk +N ijk N 0 ij +N ij (3.10) where N ijk is the number of cases in data set where X i = k and Pa(X i ) = j and N ij = P k N ijk . Similarly,N 0 ijk isthepriorDirichlethyperparameterwhichservesasa pseudo count. In other words, we can assume that there is an imaginary (equivalent) sample with size N 0 which reflects our belief in the prior distribution and N 0 ijk is the number of cases in the imaginary data set where X i = k and Pa(X i ) = j. By plugging in (3.10) into equation (3.9) and summing over l, we get an expression for ρ(D|S). As discussed earlier we pick ρ(D,S) = ρ(D|S)ρ(S) as the network score, thus, ρ(D,S) =ρ(S) Y i Y j γ(N 0 ij ) γ(N 0 ij +N ij ) Y k γ(N 0 ijk +N ijk ) γ(N 0 ijk ) (3.11) where γ is the Gamma function [30]. This is called BD metric which stands for Bayesian metric with Dirichlet priors. Here ρ(S) refers to our prior belief regarding the network structure which is commonly assumed to follow a uniform distribution. TheBDmetricisnotscoreequivalent,i.e. itdoesnotgivethesamescorefordifferent networks in the same equivalence class. As discussed earlier, an equivalence class is a set of networks that imply the same assertions of conditional independency. 57 Thus, it is desirable that our metric assigns the same score to all the members of the same class. [30] shows that the only Dirichlet hyperparameters that give a score equivalent metric are„ N 0 ijk =N 0 ρ 0 (X i =k,pa(X i ) =j) (3.12) where ρ 0 is some prior joint probability distribution. Substituting (3.12) into the BD metric, we obtain a new metric called BDe metric. This represents a score equivalent Bayesian metric with Dirichlet priors. Typically the prior distributionρ 0 is assumed to be uniform, i.e. ρ 0 (X i = k,Pa(X i ) = j) = 1 r i q i . Consequently, the hyperparameters would be equal to, N 0 ijk = N 0 r i q i (3.13) where r i is the number of different categories for variable X i and q i is the number of different parent configurations for the same variable. Hence if we sum the pseudo counts over all r i ×q i entries in the CPT, we get a total equivalent sample size of N 0 . This prior is called the BDeu prior, where the "u" stands for uniform. 3.2.2 Exact Inference in Bayesian Networks Bayesian networks provide full representation of the joint probability distribution over their random variables. Thus, we can use them to answer any conditional probabilistic queries about the domain. This is called probabilistic inference and commonly deals with inferring the state of a set of query variables given the state of some evidence (or observation) variables. Probability propagation and belief updating are also popular terms that are used in the literature for addressing this problem. 58 We can distinguish between three different types of inferential tasks in BNs. When a "cause" variable that is upstream in the graph is observed, we can reason aboutthedownstream"effect"variable. Thisiscalledpredictivereasoning or"top down"reasoning. Here, theflowofinformationisinthedirectionofthenetworkarcs. Ontheotherhand, wemighthavetheevidenceofaneffectandwouldliketoinferthe mostlikelycause. Thisisknownasdiagnosticreasoningor"bottomup"approach. This reasoning occurs in the opposite direction of the network arcs. A further form of reasoning involves reasoning about the mutual causes of a common effect. When the common effect is observed, the causes become conditionally dependent, although they are marginally independent. In fact, the causes would "compete" to explain the effect. Now if one of the causes is also observed, the probability of other causes is reduced i.e. p(cause1|effect, cause2)<p(cause1|effect). This is called explaining away or "intercausal reasoning". Sometimes, we face with reasoning that does not fit into any of the three categories discussed above. In fact, we can do any arbitrary reasoning in the BN by combining the above types of reasoning in any way [39]. Another type of query in Bayesian networks is maximum a posteriori (MAP) query which seeks to find the configuration q ∗ of the variables in Q that has the highest posterior probability, MAP (Q|E) =q ∗ = arg max q p(Q =q|E) (3.14) Aswecansee, toansweralltheabovequeriesweneedtocomputeaconditional probability distribution. We denote this by p(Q|E) where Q and E are query and 59 evidence subsets of variables respectively. We can use Bayes theorem to compute this distribution, p(Q|E) = p(Q,E) p(E) (3.15) Here, the joint distribution learned by the Bayesian network, which has a fac- torizedexpressionasin (3.4), canbeusedtocomputethejointdistributions. Infact, we can use marginalization (summing out over unwanted variables) to compute any desired probability. This is known as variable elimination. However, the joint prob- ability table increases exponentially with the number of variables and we need to use more efficient methods. In this thesis we use junction tree algorithm to make infer- ence in discrete Bayesian networks. Junction tree algorithm is an improved version of variable elimination method which tries to find the best ordering in marginalizing out the unwanted variables so that the computational complexity of the algorithm is reduced. In the following sections we briefly overview these two algorithms. Variable Elimination Algorithm Inthissection,forsakeofnotationalconvenience,weusetheframeworkofpotentials. A conditional probability table ρ(A|Pa(A)) is a function φ : pa(A)∪A→ [0 : 1], and we call it a potential and use functional notation to carry out the algebra of probability tables. Besides, the domain of the potential is defined as thepa(A)∪A. This section is based on chapter 4 of [58]. Consider a simple Bayesian network as in Fig.3.2. From the chain rule we have, ρ(U) =φ 1 φ 2 φ 3 φ 4 φ 5 φ 6 (3.16) 60 A 1 A 2 A 3 A 4 A 5 A 6 Figure 3.2: A simple Bayesian network [58]. where U is the universe and φ 1 = ρ(A 1 ), φ 2 = ρ(A 2 |A 1 ), φ 3 = ρ(A 3 |A 1 ), φ 4 = ρ(A 4 |A 2 ), φ 5 = ρ(A 5 |A 2 ,A 3 ) and φ 6 = ρ(A 6 |A 3 ). The marginal distribution of a single node, say A 4 can be computed by marginalizing out unwanted variables, ρ(A 4 ) = X A 1 ,A 2 ,A 3 ,A 5 ,A 6 φ 1 φ 2 φ 3 φ 4 φ 5 φ 6 (3.17) Calculating the above expression is computationally expensive in case of large num- ber of variables. A smart idea is to push sums inside products as, ρ(A 4 ) = X A 1 φ 1 (A 1 ) X A 2 φ 2 (A 2 ,A 1 )φ 4 (A 4 ,A 2 ) X A 3 φ 3 (A 3 ,A 1 ) X A 5 φ 5 (A 5 ,A 2 ,A 3 ) X A 6 φ 6 (A 6 ,A 3 ) (3.18) We now evaluate this expression working right to left. At each step a variable is summed over and marginalized out, thus a new potential is formed with a new domain. The resulting potential is fed into the next summation and the product of potentials is calculated. Then, the summation is conducted over the resulting new potential and this procedure continues until we get to ρ(A 4 ). Mathemati- cally speaking, we first calculate φ 0 6 (A 3 ) = P A 6 φ 6 (A 6 ,A 3 ), then multiplyφ 0 6 (A 3 ) by φ 5 (A 5 ,A 2 ,A 3 ) and calculate φ 0 5 (A 2 ,A 3 ) = P A 5 φ 5 (A 5 ,A 2 ,A 3 )φ 0 6 (A 3 ) and so forth. 61 The initial joint distribution, ρ(U) needs to incorporate all six variables at the same time, while the largest potential obtained through the above procedure con- tains three variables. This is the idea of variable elimination algorithm that tries to reduce the computational cost of direct evaluation of potentials. Until now, we did not consider any evidence on the variables. The effect of introducing evidence is that the corresponding variables are instantiated to their respective values. For instance, if it is observed that A 5 = a 5 and A 6 = a 6 then we will have φ 0 5 = ρ(A 5 = a 5 |A 2 ,A 3 ) and φ 0 6 = ρ(A 6 = a 6 |A 3 ). Thus, the above procedure shall be used to compute any marginal or conditional distribution of interest. One drawback of variable elimination algorithm is that it is only capable to compute one desired marginal distribution at a time. The setup of the algorithm at the very first step is geared towards the desired variables, i.e. the set of variables to keep. Thus in order to compute multiple queries we should run the algorithm multiple times. Ideally one would like to be able to compute multiple queries at the same time from the same model. The size of the factors (their domain size) that are created in the process of summing out is a good measure of complexity. The order of marginalizing, a.k.a. elimination order, has a large impact on the size of the intermediate factors and thus on the computational complexity. In the preceding example we used the particular order, namely A 6 ,A 5 ,A 3 ,A 2 ,A 1 . We could use any other elimination order at the expenseofvaryingcomputationalcosts. Now, thequestionishowtofindtheoptimal elimination order so that the computational complexity is minimized. The answer is given in the junction tree algorithms. 62 Junction Tree Algorithm In this section, we deal systematically with belief updating in Bayesian networks which leads to the junction tree algorithm. The theorems and derivations in this section are primarily based on chapter 4 of [58]. Assume we have a set of real-valued potentials Φ ={φ 1 ,··· ,φ m } over finite variablesU ={A 1 ,··· ,A n }. Marginalizing out a variable from a potential means to sum over its possible values, and we can similarly marginalize out a set of variables by summing over all the variables in the set. We may also use projection notation, φ ↓X (X,Y,··· ,Z) to denote marginalizing out all variables except X. Given this notation a variable X can be eliminated from Φ by following these steps, 1. Call the set of all potentials with X in their domains as Φ X . 2. Calculate φ −X = P X Q Φ X . 3. Define Φ −X = n Φ\ Φ X ,φ −X o . Here, we do not calculate the product and keep it in a bucket until we need it. Now, we get back to the fundamental question of elimination order. First, let’s graphically elaborate the consequences of various elimination orders. By definition the domain of each potential, consists of all the variables in the potential and is denoted by dom(φ i ) = D i . Moreover, the domain graph also known as moral graph for Φ is the undirected graph with the variables of U as nodes and with a link between each pair of variables that are members of the same domain D i . For instance, the domain graph for Fig.3.2 is illustrated in Fig.3.3. Compared to the initial BN, we can see that the directions have been dropped and a so-called moral link connects the two nodes with a common child. 63 A 1 A 2 A 3 A 4 A 5 A 6 Figure 3.3: The domain graph for Fig.3.2 [58]. A 1 A 2 A 4 A 5 A 6 Figure 3.4: The domain graph obtained from Fig.3.3 by marginalizing out A 3 [58]. Let’s assume that we want to marginalize outA 3 . Thus we need to collect the potentials having A 3 in their domains and calculate φ −A 3 . The resulting potential has a domain consisting of all the neighbors ofA 3 in the domain graph. This means that in the domain graph for Φ −X all neighbors of A 3 are pairwise linked (Fig.3.4). The newly created links in Fig.3.4 are called fill-ins. The introduction of fill-ins is equivalent to building new potentials with new domains. The perfect elimination order corresponds to the one that doesn’t introduce any fill-ins. This order requires less space and is computationally cheaper. In other words, the complexity of using a particular elimination order corresponds to the set of domains created meanwhile. 64 Up to now, we have argued that if we can eliminate the variables such that no fill-in links is created this will give the perfect elimination sequence ending with the desired variableA. There is an equivalent purely graph-theoretic concept called triangulated graphs. A triangulated graph is an undirected graph in which each variableA has a perfect elimination sequence (no fill-ins built) ending withA. Con- sequently, it can be shown that we can make a triangulated graph out of our domain graph by adding a chord to every cycle in the graph that has a length greater than 3 (the new chord connects all non-adjacent nodes in the cycle). Having this trian- gulated graph it is straightforward to find the optimal elimination sequence. Belief Propagation in Junction Trees Belief propagation (BP) algorithm is a message passing algorithm that calculates marginal distribution for each unobserved node conditional on the observed nodes in a tree [60]. To make inference in general graphical models, such as Bayesian networks, we would like to use the BP algorithm but since the algorithm is defined for trees we first need to define an appropriate tree corresponding to our Bayesian network. This is exactly what junction tree is doing. It turns out that the cliques of a triangulated graph can be organized into a special kind of tree known as junction tree. A clique in a graph is the set of nodes that are pairwise connected (complete subgraph) and the maximal clique is the clique that is not a subset of another clique. To build the junction tree we first need to build a clique graph. The nodes of clique graph are defined by the maximal cliques of the triangulated graph and the edges are labeled with the intersection of the nodes at the two ends of the edge. Next, we define the edge weight on the clique graph to be the size of the intersection set on that edge and apply maximum weight spanning tree algorithm to this graph to find the so-called junction tree. 65 At this point we are ready to apply the belief propagation algorithm on the junction tree. First, we attach each initial potential φ to a node in the junction tree so that the potential domain is a subset of the clique domain in that node. Moreover, each edge has a so-called separator which contains the intersection of the variables on the two side of the edge along with two more mailboxes, one for each direction. We describe the belief propagation algorithm through an example [58]. Con- sider the Bayesian network in Figure 3.2 where the potentials are φ 1 = P (A 1 ), φ 2 = P (A 2 |A 1 ), φ 3 = P (A 3 |A 1 ), φ 4 = P (A 4 |A 2 ), φ 5 = P (A 5 |A 2 ,A 3 ) and φ 6 = P (A 6 |A 3 ) and with the domain graph in Figure 3.3. The maximal cliques of the respective triangulated graph are V 1 ={A 3 ,A 6 }, V 2 ={A 2 ,A 3 ,A 5 },V 4 = {A 1 ,A 2 ,A 3 },V 6 = {A 2 ,A 4 } and the separators are S 1 = {A 3 }, S 2 = {A 2 ,A 3 }, S 4 ={A 2 }. The corresponding junction tree is shown in Figure 3.5. Assume we want to compute P (A 4 ), so we first find a clique that contains A 4 which is V 6 . We consider it as a temporary root and send messages from the leaves towards the root. The message that each clique sends via an edge is the projection of its potentials on the link separator. For instance V 1 sends the message ψ 1 =φ ↓S 1 6 and V 2 sends the message ψ 2 = φ ↓S 2 5 . Next, V 4 collects all received messages and updates its own potential as Φ 4 =φ 1 ,φ 2 ,φ 3 ,ψ 1 ,ψ 2 and sends the messageψ 4 = Φ ↓S 4 4 to V 6 . Now, V 6 collects this message, multiply it by φ 4 and marginalize out A 2 to get P (A 4 ). This procedure is called collecting evidence. In order to prepare the junction tree for the calculation of all marginals, we need to send messages from V 6 to all other nodes as well, this is called distributing evidence. So a complete run consists of a collecting and distributing phase and at the end of that we have performed a full propagation and to calculate a desired 66 φ 1 ,φ 2 ,φ 3 V 4 : A 1 ,A 2 ,A 3 ψ 4 S 4 :A 2 φ 4 V 6 : A 2 ,A 4 ψ 2 =φ ↓S 2 5 S 2 :A 2 ,A 3 φ 5 V 2 : A 2 ,A 3 ,A 5 ψ 1 =φ ↓S 1 6 S 1 :A 3 φ 6 V 1 : A 3 ,A 6 Figure 3.5: Message passing in a junction tree. The cliques V 1 , V 2 andV 4 have sent messages to their separators [58]. marginal P (X) we find a clique that contains X. For a detailed discussion of this topics readers are referred to [72], [42] and [58]. 3.2.3 D-Separation and Conditional Independence Proper- ties of A Bayesian Network The pure structure of a Bayesian network can be used to infer the conditional inde- pendencies between random variables. On the other hand, by inferring the condi- tional independencies from data we can build the network structure as conducted in constraint-based methods. The building blocks of this assertion are three classes of connectivity network fragments shown in Figure 3.6. The serial connection indicates 67 Figure 3.6: Serial (a), diverging (b) and converging (c) network structures [41]. that{X 1 ⊥ X 3 |X 2 } while X 1 and X 3 are marginally dependent. This is also the case in the diverging graph structure in which{Y 1 ⊥ Y 3 |Y 2 } while Y 1 and Y 3 are marginally dependent. However, the converging graph structure in part (c) encodes a different statement. It dictates that Z 1 and Z 3 are conditionally dependent given Z 2 while Z 1 and Z 3 are marginally independent. To conclude, we can claim that there is mutual relation between graph structure and independency statements over a set of random variables. While knowing the independecies (based on data for instance) can help us to build the structure of the network, having a Bayesian net- work structure, we can deduct and infer independency relations in a probability distribution function. By definition X and Y are conditionally independent given Z if and only if, X⊥Y|Z⇐⇒p(X,Y|Z) =p(X|Z) p(Y|Z) (3.19) This means that if we observe Z, knowing Y will not provide us with any additional information about X. Inferring conditional independencies from a BN is based on the underlying idea that "dependence" is associated with "connectedness" in the graph and "independence" with "separation". In other words, two nodes X and Y are d-separated if all the paths between them are blocked. In finding a path, the 68 direction of the arcs is not important. There are two main rules that determine if an existing undirected path between two nodes is blocked or not, 1. A so-called collider or v-structure, s→m←t, blocks a path unless m or one of its descendants is in Z. 2. A collider-free path is blocked by Z if any member of Z is present in the path. The extension of the above definition to sets of variables is straightforward (i.e., two sets are separated if and only if each element in one set is separated from every element in the other). Using the d-separation rules we can obtain useful insights regarding the conditional independencies between variables. D-separation criteria has an important implication in the Bayesian networks. Each node is independent of all the other nodes given its Markov blanket (By definition, the Markov blanket of a node in a DAG consists of its parents, its children and other co-parents of its children). These expressions prove to be highly informative when we want to extractsomeknowledgeabouttheconditionalindependenciesencodedbyaBayesian network. 3.3 Building Bayesian Network Models For Demand Response Problem It is desirable for decision makers to discover the structure and dependencies among multiple factors which form the behavior of a complex system and use them to fore- cast the future behavior of the system. Building such a model enhances the general insight about the system and facilitates decision making tasks. In this section, we address the demand response issue in the context of smart grid technology. This 69 is a complex system which comprises of multiple random components. Our goal is to discover the dependencies between the contributing factors and probabilisti- cally quantify their impact on the electricity demand. We seek to develop predictive models at the local (each customer) and global (aggregate customers) scales. In this regard, we use Bayesian networks to learn the dependencies as well as quantifying the joint probability distribution. 3.3.1 Data set Description We use the electricity consumption data collected by Pacific Northwest national lab (PNNL) in a project called "Olympic Peninsula". The data consists of smart meter readings, i.e. electricity consumption (Watts), for 82 customers in three zip codes in the state of Washington. The project was carried out between April 1 st 2006 and March 31 st 2007 with a resolution of 5 minutes in recording the meter readings [9]. The customers were divided into three categories based on the pricing policy that was assigned to them: real-time price (RTP), time-of-use price (TOU) and fixed price groups. One of the primary goals of this research is to model the effect of price variations on the demand pattern, thus we only study the behavior of RTP customers (25 households) who received price signals every 15 minutes. We process the data to produce three data sets, namely, 1. All RTP customers with 15 minutes time resolution. 2. All RTP customers with hourly time resolution. 3. Each RTP household with hourly time resolution. 70 Resolution refers to the timing scale that we pick for aggregating price and con- sumption data. In this setting, we consider varying granularities in both temporal and spatial dimensions and we seek to comment on their effect on the prediction accuracy. In case of aggregated customers, we extract the following attributes from data: month of year, time index of day (based on 15 minutes or hourly time res- olutions), outside temperature, electricity price, day of week, is it weekend or not, demand at the current time step, demand at previous time step and demand at two previous time steps. Additionally, when dealing with the problem at the households level we also include customer ID (which uniquely distinguishes each household), aggregate demand of all customers, usage base and usage variations. The last two attributes represent the mean and standard deviation of consumption data for a par- ticular household. The underlying idea is that they are supposed to characterize the household’s consumption behavior. For the sake of space saving, these variables are denoted as follows in the associated graphs: month, timeindex, temperature, price, dayofweek, isweekend, demand t , demand t−1 , demand t−2 , C ID , agg-demand, use-avg and use-var. It is evident that some of the involved variables in our current problem take continuous values and some other take discrete values. Since we would like to use a discrete Bayesian network as our predictive model, we need to discretize continuous variables. To do so we use Fayyad and Irani [18] discretization method which is a widely used technique in the machine learning community. This method partitions the intervals according to a criterion that is based on minimum description length principle. Since, this is a supervised method we need to introduce the class variable tothealgorithm. Theclassvariableneedstobediscretebeforehand. Inourproblem, we take the demand at current time as the class variable and discretize it to k bins. 71 Then we use Fayyad and Irani technique to discretize other continuous variables. We use three different discretization criteria, namely equal width intervals, equal frequency intervals and k-means clustering for discretizing the demand variable. In general, increasing the discretiztaion granularity introduces extra param- eters to the model. When the sample size is not big enough, this increase in the number of parameters will degrade the accuracy of the model. Therefore, we should always be mindful to keep a good balance between number of parameters and the sample size. The greater the number of observations, the finer granularity of the variables permitted. 3.3.2 Bayesian Networks For Modeling Aggregate Demand In order to learn a Bayesian network structure we need to choose a scoring metric as well as a search strategy. In this regard, we pick BDeu score and use tabu search algorithm in searching for the optimal network. We initially start with some arbitrary network structure and find its score. Then the score change for all legal arc operations, one at a time, is calculated. By arc operations we mean arc addition, arc deletion and arc reversal. Next, we pick the arc operation with the highest score gain and apply it. We continue this procedure until a desired convergence rate in score is satisfied. This algorithm is not guaranteed to find a global optimum but only a local optimal structure. To avoid this problem, we can repeat the procedure for multiple initial structures and pick the best one. We use bnlearn package from R repository to conduct the simulations ([69] and [57]). At this point, after learning the Bayesian network structure and parameters we have fully specified the joint probability distribution. To evaluate the quality of the learned model we need to assess its goodness of fit for a training data set. Moreover, 72 we can use a test data set to quantify the generalization power of the learned model when applied to new data sets. We consider 90% of the data as training and 10% as the test set. In our problem, we pick root mean squared error (RMSE) to evaluate the performance of the model in predicting demand. RMSE is defined as, RMSE = v u u t 1 n n X i=1 (y i − ˆ y i ) 2 (3.20) wherey i denotes the observed value for variablesy, ˆ y i represents the predicted value by the model and n is the number of observed samples. In our problem, we take ˆ y i to be the mean of the learned probability distribution for node demand t . Since we discretize the domain of demand t to some bins, the BN model finds a probability mass function over the bins. We use bin midpoints to compute the distribution mean and standard deviation. It should be noted that RMSE is a scale dependent metric meaning that its value depends on the range of the variable. Thus, we also need to report the variable range in our results. We take two different approaches to address the problem of predicting the aggregate demand. First we only consider the discrete Bayesian networks and investigate the effect of discretization method and number of bins on the results. This approach is applied to demand prediction at hourly and 15 minutes time scales and the results are discussed. While discretization is a way to handle continuous variables, the other possible approach is to fit a continuous distribution to the variables. Accordingly, we compare prediction results using a Gaussian Bayesian network (GBN) versus a discrete BN for hourly demand prediction. 73 Table 3.1: RMSE in predicting aggregate demand using discrete Bayesian network. Cluster Equal width Equal frequency 5 30 50 5 30 50 5 30 50 Hourly 9.29 8.48 8.50 8.56 8.58 8.54 11.03 8.60 8.77 15-min 2.09 1.73 1.75 1.84 1.73 1.74 3.33 1.86 1.80 Effect of Discretization Approach on modeling Demand for 15 minutes and hourly Time Scales Table 3.1 illustrates the effect of discretization method and number of bins on the model error for 15 minutes and hourly time resolutions. Here, the range of demand for hourly and 15 minutes time resolutions are (7.19− 125.6) and (2.45− 37.66) respectively. Given these ranges and the RMSEs, we can claim that the BN models give an acceptable error rate and are thus considered a good model for prediction. When the number of bins is small, fewer parameters are required to calibrate the model, however the prediction can be performed at a coarser level. By increasing the number of bins, we will be able to predict at a finer scale but this will be at the cost of more parameters to be estimated. In general, the size of the data set and the desired granularity for prediction influence the choice of the discretization scheme. The dicretization scheme may also influence the shape and smoothness of the pre- dicted probability distribution. We show this effect in Figure 3.7 where three BNs with different discretization schemes are used to predictρ(demand t |price> 48). We use junction tree algorithm from gRain R package to infer any desired (conditional) probability distribution using the learned BN model [32]. The resulting distribu- tion would be a histogram over the bins. We have used kernel density estimation with Gaussian kernels to assign a continuous distribution to the learned histograms. Fig.3.7 shows that the equal width discretization gives the smoothest PDF. It should 74 Figure 3.7: Effect of discretization scheme on predicting hourly aggregate demand. be mentioned that data outliers should be removed before using equal width dis- cretization method since they can dramatically influence the results. Hereafter, we pick the BN obtained from discretizing the demand to 30 equal width intervals for further analysis while we emphasize that this is a modeling choice. The learned BN structures for hourly and 15 minutes time scales using 30 equal width intervals to discretize demand t are depicted in Figures 3.8 and 3.9 respectively. As discussed in section 3.2.3, the pure structure of the learned BN can be used to infer condi- tional dependencies and independencies among the variables. The Markov blanket 75 Figure 3.8: Learned Bayesian network structure for predicting aggregate demand- hourly time scale. ofdemand t for both models isdemand t−1 , which means that demand at the previous time step can be used to fully specify the demand at a desired time step. More- over, the model can be used to answer any "top down" or "bottom up" probabilistic query. For instance using inference algorithms such as junction tree algorithm we can quantify the uncertainty for ρ(demand t |price,timeindex) or ρ(price|demand t ) where the second type of query is useful for diagnostic purposes. We can use the learned model to predict the system’s behavior for the case of unobserved situations. We use the test datset to demonstrate the generalization power of the model. Figure 3.10 illustrates the model results for predicting demand 76 Figure 3.9: Learned Bayesian network structure for predicting aggregate demand-15 minutes time scale. using 50 random samples from the test data set for 15 minutes time scale. Here, the black points show the predicted mean and the shaded area represents 2σ uncertainty bound around the mean. The red squares represent the actual observed value for the demand. We can see that almost all of the observed values fall within the predicted range. 77 Figure 3.10: Uncertainty interval in predicting aggregate demand at 15 minutes time scale using discrete BN model-red squares represent the observed demand, the black dots are the predicted mean and the blue bars correspond to 2σ interval around the mean. Gaussian vs. Discrete Bayesian Networks Discretization is one way to deal with continuous variables in BNs. The other most widely used model to deal with continuous variables is to use the so-called Gaus- sian Bayesian networks (GBN). Here, we assume that each variable has a Gaussian distribution where its mean is a linear combination of its parents’ values and its standard deviation is constant. It is shown [21] that given these local distributions 78 Table 3.2: RMSE in predicting hourly aggregate demand by discrete and Gaussian Bayesian networks. Gaussian BN discrete BN-6 variables Hourly 8.21 8.58 15-min 1.68 1.73 the global joint PDF over all random variables would be a multivariate normal dis- tribution. Similar to discrete Bayesian networks a Bayesian scoring metric can be used in searching for the best BN model when the structure is unknown. This score is called BGe which represents a Bayesian and score equivalent metric for Gaussian networks. For a detailed discussion about GBNs and the Gaussian score refer to [21]. In this research we have also considered the Gaussian BN model. The idea is that discretizing continuous variables (even with fine resolution) leads to some information loss while using the continuous distributions eliminates this problem. However, by discretizing the variables in the BN we are not imposing any particular distribution to the data and the approach is totally data-driven. On the contrary, in Gaussian BNs we are imposing a strong assumption on the variables, that they follow a Gaussian distribution. In practical applications this assumption almost always deviates from reality but still it might be a good choice depending on the specific data. In this work, the original data consists of both discrete and continuous vari- ables (hybrid problem). In order to apply the Gaussian BN we disregard three discrete variables namely month, dayofweek and isweekend since assuming a Gaus- sian distribution for them seems meaningless. Then, we learn a Gaussian and also a discrete BN for the remaining 6 variables. As before, we use equal width dis- cretization with 30 bins for learning the discrete BN. The RMSE for these two models is tabulated in table 3.2. Based on table 3.2 we can observe that the 79 Figure 3.11: Learned discrete Bayesian network structure for predicting hourly aggregate demand. error of predicting demand is lower when using the Gaussian BN, however the improvement is not significant. Figures 3.11 and 3.12 show the structure of the two learned BNs for hourly time scale when using Gaussian and discrete distribu- tions respectively. We can see that these two graphs imply different conditional independency assertions. For instance the Markov blanket of demand t in discrete BN model is consisted of demand t−1 ,temperature while in the Gaussian BN it is demand t−1 ,demand t−2 ,price,temperature. Besides, the price and demand t are directly dependent in the Gaussian BN while in the discrete BN they are dependent through the temperature or demand t−1 . 80 Figure 3.12: Learned Gaussian Bayesian network structure for predicting hourly aggregate demand. Togetmoreinsightregardingthedifferencesofthetwomodels,wecomputethe probability distribution of an arbitrary conditional query using the two models. Fig. 3.13 illustrates the unconditional distribution of demand t as well as the conditional distribution ρ(demand t |price > 48) using the discrete and Gaussian BNs. It is observed that the general shape of the conditional PDF obtained from the discrete model is more similar to the unconditional distribution. This result is somewhat expected, since the discrete BN is a nonparametric data-driven model that tries to capture the true distribution while the Gaussian BN always assigns a Gaussian PDF to the variable. 81 Figure 3.13: Comparison of conditional probability distribution of hourly aggregate demand given price, ρ(demand t |price> 48), using Gaussian and discrete BNs. 3.3.3 Bayesian Networks for Modeling Each Household’s Demand The advent of smart meters enables decision makers to set individualized policies for each particular customer in the electricity market. By studying the consump- tion behavior of each household, the strategists are able to target certain group of customers and enforce appropriate policies to them in order to shape he market 82 Figure 3.14: Learned Bayesian network structure for predicting each household’s hourly demand. as needed. The first step in designing individualized policies is to have a predic- tive model that can envision the expected behavior of the customers in response to the implemented policies. The traditional regression models used in predicting the demand commonly are not successful in predicting at the households scale. They will need too may parameters to tune and often get too complicated to handle easily. Here, we propose the use of Bayesian networks to predict demand at the households level and also to capture the dependencies between the involved variables. 83 Figure 3.15: Uncertainty interval in predicting each household’s hourly demand using BN model-red squares represent the observed demand, the black line is pre- dicted mean demand and the shaded area corresponds to 2σ interval around the mean. The data set used to train the model has 139277 samples which consists of the meter readings for 25 customers over the course of a year as discussed earlier. We use hourly scale to aggregate the price and consumption data. Thus each sample represents the consumption of one particular customer at some specific time subject to some price and temperature values. We use 30 equal width intervals to discretize the demand and then Fayyad and Irani algorithm [18] is applied to the rest of the 84 Figure 3.16: Probability of demand for customer number 671, at time index 10, for three different price ranges. continuous variables. The learned BN is depicted in Figure 3.14. The RMSE in predicting demand using this model is 1.18 while the demand range is (0.02− 10). This error is relatively higher than the error of predicting the aggregate demand for all customers. This was expected, since the demand has much more variations at the household’s scale while in general aggregating smooths the behavior. The Markov blanket of demand t in this problem is consisted of demand t−1 and use-avg. The average of usage is a variable that reflects the customer’s consumption pattern. 85 Figure 3.17: Probability of demand for customers number 707, 855, and 867, at time index 10 given a price range between 51 and 59 $/Mwh. Figure 3.15 illustrates the predicted mean and an error bound in predicting demand t for 50samplespickedrandomlyfromthetestdatasetusingthelearnedBN. The black line represents the mean of the predicted distribution and the shaded area corresponds to 2σ deviation from the mean. Additionally, the red squares denote the actual observed value of demand t for each sample of the test data set. The figure shows that the majority of the red squares fall within the error bound limits and many of them are close to the mean. This suggests that the model is pretty successful in predicting the demand for unseen cases. 86 TheC ID variable in the network represents a unique ID that has been assigned to each customer, we treat this ID as a random variable, so that we can conduct probabilistic queries for each particular household. For instance we can compute ρ(demand t |timeindex = 10,C ID = 671,price) using the junction tree algorithm on the learned BN to quantify the effect of price on demand distribution at 10 am for a specific customer withC ID = 671. The result is depicted in Figure 3.16. A strong trend is observed in this figure as the unit price is raised with a marked decrease in low-level demand, matched by an increase in high-level demand. This may be attributed to a strong correlation between external effects (such as temperature) and demand paired with a pricing mechanism that anticipates this correlation. Thus the low-level demand is not associated with external effects and was thus reduced as price was increased. The high level demand, on the other hand, could be associated with external effects and its rise coincides with an increase in price to mitigate peak usage (eg. the high-level rise might have been more pronounced without such a price increase). We may run any desired query on the learned BN and get probability distri- bution of the target variables as we need. As another application, we demonstrate the capability of the model in predicting demand given a particular price value and a specific time for different customers. The demand probability distribution for 3 different customers is depicted in Figure 3.17. It is observed that the demand of individual customers may deviate significantly from aggregate demand, and the ability of the present BN model to anticipate both types of demand demonstrates its value for a wide range of potential users. 87 Chapter 4 Data-Driven Characterization of Reservoirs in Gulf of Mexico 88 4.1 Introduction Predictive modeling of fluid flow in hydrocarbon reservoirs is traditionally carried out using a high resolution representation of the reservoir. The detailed reservoir model is commonly learned through an iterative inverse modeling procedure known as history matching. In this regard, the reservoir parameters are updated in an iterativemannerusingdynamicdata, suchasproductiondataorseismicdata. These approaches entail numerically solving a set of governing partial differential equations whilesimultaneouslyexplaininghighdimensionalparameterspace. Modelingerrors, lack of sufficient spatial data and the nonlinearity between model parameters and observed quantities add another challenging aspect to these problems [36]. An alternative to the computationally expensive full scale numerical simula- tion, istousesurrogatemodels. Inthiscontext, asurrogatemodelisanapproximate model that can estimate the desired quantity given a set of input parameters [28]. The surrogate model can serve as a rapid risk assessment tool in case of anthro- pogenic events such as Deepwater Horizon oil spill [76]. In this incident, it took almost two months to conduct reliable model simulations to estimate the dam- age and contamination level [49]. The modeling cost of the associated prediction problems was further augmented by the scarcity of observed data as well as the uncertainty in the leakage mechanisms. Inthischapter, weproposeadata-drivenapproachbasedonBayesiannetworks as a surrogate model to predict wellbore signatures over the Gulf of Mexico. The representation entails exploring the probabilistic dependencies between the input parameters space that characterize the reservoir properties and the desired quantity of interest (QoI). In this regard, the goal is to conduct conditional density regression 89 of a stochastic process over the Gulf scale. Thus, we are interested in estimating the conditional density of QoI at a new location with limited data on the parameters space. Bayesian networks are probabilistic graphical models that are highly suited to probabilistic predictions in high dimensional settings. The high dimensionality of the parameters space is another challenging aspect of the modeling process. However, despite their apparent high dimensionality, the data might be efficiently characterized by a low dimensional representation which captures the major fluctuations in the data. In this setting, we seek to explore the inherent structure of the data in order to present a more compact, yet efficient, model for predicting the reservoir behavior. In this research, we use locally linear embeddingmethodasatooltofindthenonlinearmappingofthedataonaninherent manifold (if any). The low dimensional representation of the data is considered as new parameters space which can be alternatively used for prediction purposes. We compare the predictive power of our Bayesian network model for high dimensional and low dimensional settings. The chapter is organized as follows: section 4.3 presents an introduction to Bayesian networks which constitute the predictive data-driven model used in this research. The experimental results of applying the proposed model to real data set in Gulf of Mexico is elaborated in section 4.4. 4.2 Problem Definition Deepwater Horizon oil spill in Gulf of Mexico (GoM) has triggered many questions regarding the efficiency and validity of traditional methods in predicting wellbore signatures due to unanticipated events. During this event it took almost two months 90 to identify reliable wellbore signatures (such as flow rates) to be used in estimating the degree of contamination in GoM [49]. A particularly challenging aspect of this prediction task was the scarcity of data as well as measurement uncertainties. This event made the researchers to think about other procedures to be used for rapid risk assessment in case of similar events. The dominant approach in oil and gas industry towards prediction and fore- casting is based on physical modeling. For instance, in history matching the pro- duction data is used to build a physical representation of the reservoir that can be subsequently used to predict desired quantities such as the future production rate in a well. However, in some cases we encounter such a complex system that the physi- cal laws are not capable to efficiently capture the inherent phenomenon. Moreover, as discussed earlier, the traditional approaches are so computationally intense that lose their usefulness in unanticipated accidents. A newly emergent approach in the industry is based on data analytics that starts with the data and tries to deduce the physics by analyzing the data. While this approach is not a substitute for domain expertise, it can serve as a rapid analysis tool when the physics are too complex. Data mining tools have been used in the oil and gas industry for a long time. However, they have mostly been applied in pilot projects and case studies rather than being used as a standard industry solution [7]. Also, the extent of the artificial intelligence (AI) techniques has been often limited to some traditional tools such as neural networks, fuzzy logic and genetic algorithms. It is evident that application of AI tools in the oil and gas industry has not evolved with the advancements in the artificial intelligence literature itself. The authors in [7] have conducted a survey study on a group of professionals in the exploration and production industry regarding the application artificial intelligence (AI) tools in their fields. The results 91 revealed that while many of respondents declared they were using some level of data mining in their jobs, the general perception of current AI literature was very low. This suggests that there is a big room for improvement in terms of applying machine learning and data analytics in the oil and gas industry. Uncertainty quantification and probabilistic modeling are features that have been rarely addressed in the current data analysis approaches adopted in oil and gas industry. This is partly due to the limitations imposed by the data mining tools that are used. For instance, the widely used neural networks act like a âĂIJblack boxâĂİ and do not reveal any insight regarding the semantics of problem and they fail to account for the uncertainty as well. In this research, we propose using Bayesian networks for data analysis and uncertainty quantification in hydrocarbon reservoirs modeling. Very few researches have used Bayesian networks to solve the problems in the context of oil and gas reservoirs. Bayesian networks are used in [48] to recommend an exploration strategy under uncertainty. [1] suggests using Bayesian networkstoquantifytheprobabilitydistributionofpromisingsolutioninthecontext of history matching. The authors in [22] propose using Bayesian networks to build an assistant tool in making operational decisions based on real-time sensor readings in typical scenarios. In this chapter, we propose a purely data-driven approach to estimate the wellbore signatures in GoM using Bayesian networks. The proposed procedure is able to rapidly estimate QoI in the presence of uncertainty and scarce data. Addi- tionally, we explore the internal structure in the parameters space that describes the reservoir properties. The inferred structure in the data can be used to introduce some new variables that can adequately represent the initial space. This is known as 92 dimension reduction. We study the reduced parameters space and investigate their applicability in predicting the desired quantities of interest at the Gulf scale. 4.3 Gaussian Bayesian Networks In the previous chapter we described a Bayesian network over a discrete set of random variables. In this chapter we use the same notion of Graphical models but instead of a discrete domain consider a continuous domain. Here the conditional probability distribution (CPD) of each node given its parents would be a continuous PDF. We assume that this PDF has the Gaussian distribution and will elaborate on how to learn its structure and conduct probabilistic queries on it. Suppose that the joint probability density function over X is a multivariate Gaussian distribution. Thus, ρ (X) =N (X|μ, Σ), 1 (2π) n/2 |Σ| 1/2 exp − 1 2 (X−μ) T Σ −1 (X−μ) (4.1) where μ is the n-dimensional mean vector and Σ is the n×n covariance matrix of X. According to [21], this distribution can be written as a product of independent normal distributions, ρ (X) = n Y i=1 ρ (X i |X 1 ,··· ,X i−1 ) (4.2) ρ (X i |X 1 ,··· ,X i−1 ) =N μ i + i−1 X j=1 b ij (X j −μ j ), 1/ν i (4.3) where μ i is the unconditional mean of X i and ν i is the conditional variance of X i given X 1 ,··· ,X i−1 . Moreover, b ij is a coefficient to capture the linear relationship 93 between X i and X j [21]. Accordingly, we may consider a Bayesian network equiva- lent to this multivariate normal distribution in which the local PDF associated with each node is defined based on (4.3). Additionally, b ij = 0 (j < i) implies that X j is not a parent of X i . We call this form of Bayesian network a Gaussian Bayesian network (GBN). To sum, a Gaussian Bayesian network is defined as a pair (S, Θ) where, • S is the network structure in which variablesX 1 ,··· ,X n define the nodes and there is no arc from X j to X i whenever b ij = 0 (j <i). • The model is parameterized through μ = (μ 1 ,··· ,μ n ),ν = (ν 1 ,··· ,ν n ) and {b ij |j <i} and Θ represents the collections of these parameters. • The conditional probability distribution (CPD) for each node is defined as a linear Gaussian distribution as in (4.3) and thus the GBN encodes the multi- variate Gaussian distribution as in (4.1). Recursive methods have been developed [71] to calculate the multivariate nor- mal density parameters (i.e. μ, Σ) using the equivalent Gaussian Bayesian network parameters (i.e. μ, ν,{b ij |j <i}) and vice versa. It has also been argued that GBN representations are more appropriate than standard multivariate normal dis- tributions in terms of modeling suitability. To fully specify a Gaussian Bayesian network, we need to specify the the unconditional mean of each variable (μ i ), the relative significance of each parent in determining the values of its child (b ij ) and a conditional variance for each variable given that its parents are fixed (v i ). On the other hand, in characterizing a normal distribution directly, we need to guarantee that the covariance matrix is positive-definite, which is usually addressed by some ad hoc manner [21]. 94 4.3.1 Learning Gaussian Bayesian networks-Bayesian Gaus- sian equivalent score Learning a Bayesian network consists of learning the graph structure and the cor- responding CPD parameters that best fit the available data. In general, Bayesian network structure learning has two main applications: knowledge discovery and den- sity estimation. By knowledge discovery we mean that a BN structure can efficiently reveal the conditional independencies between variables. On the other hand, the BN structure can be used in conjunction with the CPD parameters to uniquely specify a joint density function over the random variables. These two tasks cannot be carried out completely independently. First, in order to learn the structure, we need some estimation of the parameters to quantify the goodness of fit. On the other hand, to estimate the conditional probabilities, we must know the graphical structure in advance. Generally, there are two categories of BN structure learning. The first one is search-and-score which assigns a score to each possible BN structure based on its match to the observed data and then searches for the best network. The other one is called constraint-based algorithm, which runs conditional independence tests to learn the structure [29]. Inthischapter, weareonlyinterestedinthesearch-and-scoreapproach. Inthis setting, learning a Bayesian network consists of a search conducted over some search space to optimize a scoring metric. The scoring functions are based on different principles, such as information theory and minimum description length([10], [40]), or Bayesian metrics ([21]). As far as the search algorithm is concerned, since finding the globally optimal graph takes exponential time in most cases, some heuristic 95 searchmethodiscommonlyused. Thiscanbegreedysearch(suchashill-climbingor Tabu search), simulated annealing or genetic algorithms to name the most popular ones. At last, the search space is dominantly the DAG space, however, it can be either equivalence classes of DAGs or orderings over the network variables (an equivalence class refers to a set of DAGs in which all the members imply the same set of conditional independence assertions). In this chapter, we focus on the Bayesian score for Gaussian Bayesian networks [21]. Let D ={U 1 ,··· ,U m } be the set of m observed instances of U (where each U i assigns values to all variables in X). Our goal is finding a network S that best matches the data set D. The derivations in this section are mostly based on [21]. Posterior probability of a network structure may be used as a Bayesian metric to score a structure, ρ(S|D) = ρ(D|S)ρ(S) P S ρ(D|S)ρ(S) (4.4) in which the denominator is a normalization constant. Calculating this constant requires summing over all possible structures which is computationally expensive even for small domains. Thus, we resort to computing the numerator as the network score. Thus, the Bayesian score is defined asρ(D,S) =ρ(D|S)ρ(S). We first discuss the derivation of ρ(D|S c ) assuming that S c represents the structure of a complete GBN (i.e. with no missing edges). Geiger and Heckerman make some assumptions and compute ρ(D|S) for a general GBN using the ρ(D|S) for complete GBNs [21]. 96 As discussed earlier, assume we havem observed samples of data where each sample represents an instantiation over all n random variables. Therefore, ρ(D|S c ) = m Y k=1 ρ (U k |U 1 ,··· ,U k−1 ,S c ) = m Y k=1 Z ρ (U k |μ,W,S c )ρ (μ,W|U 1 ,··· ,U k−1 ,S c )dμdW (4.5) Here, we assume that the data set is a random sample drawn from a multivari- ate normal distribution with unknown mean μ and unknown precision matrix W. By definition, the precision matrix is the inverse of the covariance matrixW = Σ −1 . Thus, the first term in the integral has a multivariate normal distribution. If the second term is the conjugate for multivariate normal distribution then the integral can be solved analytically. We know that the conjugate for multivariate normal is the normal-Wishart distribution. Therefore, we would like the second term in the integral to follow normal-Wishart distribution. By making appropriate assumptions we can force this term to have normal-Wishart distribution. In fact, there is a the- orem which provides our desired goal [15]. Consider the following expansion of the posterior distribution over μ and W, ρ (μ,W|C 1 ,··· ,C L ) =cρ (C 1 ,··· ,C L |μ,W )ρ (μ,W ) (4.6) According to this theorem, if C 1 ,··· ,C L is a random sample from a multivariate normal distribution with unknown μ and W and ρ (μ,W ) has a normal-Wishart prior distribution then the posterior would also have a normal-Wishart distribution. Now, we can reexamine (4.5). We already assumed that U 1 ,··· ,U k−1 is a random sample from a multivariate normal distribution with unknown μ andW. If we further assume that ρ (μ,W|S c ) has a normal-Wishart distribution, then based 97 on the above theorem we can claim that ρ (μ,W|U 1 ,··· ,U k−1 ,S c ) has a normal- Wishart distribution. At this point (using the discussed assumptions) we can find a closed form solution for the integral in (4.5). It is well known that the integral of the product of a multivariate normal by the appropriate normal-Wishart distribution is ann dimensional multivariatet distribution. For more details you may refer to [21]. Combining these facts, (4.5) is computed as, ρ (D|S c ) = (2π) −nm/2 ( λ λ +m ) n/2 c(n,α) ) c(n,α +m)|T 0 | α/2 |T m | −(α+m)/2 (4.7) where λ, α and T 0 are the parameters of the prior normal-Wishart distribution. Suppose that the prior joint distribution of μ and W is the normal-Wishart distri- bution: the conditional distribution ofμ givenW isN (μ 0 ,λW ) such thatλ> 0 and the marginal distribution of W is a Wishart distribution with α>n− 1 degrees of freedom and precision matrix T 0 , denoted by w(α,T 0 ). Moreover, c(n,α) is defined as, c(n,α) = " 2 αn/2 π n(n−1)/4 n Y i=1 Γ( α + 1−i 2 ) # −1 (4.8) Additionally,T m is the precision matrix of the posterior normal-Wishart distribution for which the derivation can be found in [21]. Atthisjuncture, wehaveananalyticexpressionforρ(D|S c )asin(4.7). Intheir paper [21], Geiger and Heckerman derive the Bayesian score for a general Gaussian Bayesian network based on ρ(D|S c ) as follow, ρ(D|S) = n Y i=1 ρ(D X i Pa(X i ) |S c ) ρ(D Pa(X i ) |S c ) (4.9) Here, D X i Pa(X i ) refers to data set D restricted to variables in X i ∪Pa(X i ). BY multiplying (4.9) byρ(S), a metric for an arbitrary Gaussian BN is obtained where 98 ρ(S) is commonly assumed to have uniform distribution. This score is called BGe which stands for Bayesian metric for Gaussian networks having score equivalence. Score equivalency is a property that declares BN structures that imply the same set of conditional independencies should have the same score. 4.3.2 Approximate Inference in Bayesian Networks Wediscussedvariableeliminationandjunctiontreealgorithmsastoolstomakeexact inference in Bayesian networks in previous chapter. Although these exact inference algorithms improve the computation process, they are still very slow for problems withmanyvariables. Thereforeresearchersresorttoapproximationtechniques. Var- ious approaches exist to make approximate inference in Bayesian networks such as variational inference, loopy belief propagation and stochastic simulation techniques. Stochastic simulation, a.k.a sampling, is done by drawing a large number of random configurations over the random variables using the distribution encoded by Bayesiannetwork. Then, anyprobabilisticquantitycanbeapproximatedusingthese samples. Different algorithms have been proposed to perform this type of sampling. Probabilistic logic sampling [31], likelihood weighting [20] and Gibbs sampling [23] are among the most popular methods in this context. Logic sampling is a form of rejection sampling in which we start with a topological order of variables and sample from them one by one. After one round of sampling from all variables we may keep the samples if they comply with the evidence, if not we may throw the sample away. This procedure is continued until we gather enough samples. The samples are then used to answer any arbitrary query about the domain variables. Here, we use logic sampling to conduct approximate inference on Bayesian networks. We elaborate logic sampling algorithm through an example. Consider the Bayesian 99 A B C D E Figure 4.1: An example network to perform logic sampling [58]. network illustrated in Figure 4.1 and for now assume that there is no observed variables. We can iteratively sample each of the variables to get a configuration over all the variables. First we sample from variable A which has a Gaussian distribution. The conditional distribution of B and C is linearly Gaussian and their mean depends on the value of their parent i.e. A. Knowing the sampled value of A we can sample from the corresponding Gaussian distributions of B and C to get a sample for them. This procedure continues until we also have a state for D and E. It should be noted that the sampling order is based on the topological ordering of the nodes in the network, we start with nodes with no parents and proceed towards the nodes with no children. At each certain node we sample it based on the its associated probability distribution conditioned on the configuration of the parent variables that have already been samples. This procedure is repeated N times, where N is the sufficient number of samples. Given the simulated samples we are able to compute any desired distribution by counting in the sample set. To handle the evidence we can simply discard the samples that do not con- form to it. Logic sampling algorithm to estimate ρ(X k |e) using N samples can be summarized as follow [58], 100 1. Assume (X 1 ,··· ,X N ) is a topological ordering of the variables and e is the evidence. 2. For j = 1 to N, • For i = 1 to n, Using the conditional distribution ρ(X i |pa(X i ) = π, sample a statex i forX i whereπ is the configuration already sampled for pa(X i ). • If x = (x 1 ,··· ,x n ) is consistent with e then, N(X k =x k ) :=N(X k =x k ) + 1. where x k is the samples state for X k . 3. Return, N(X k =x k |e)≈ N(X k =x k ) P x N(X k =x) . the advantage of the preceding method is that we don’t need to store all the samples configurations. Instead we can only update the counts of all variables when a new sample has been determined. Thus, the method saves much space. However, the drawback of using this algorithm is that when the probability of evidence is low, we need to generate too many samples since majority of them will be discarded. 4.4 Application to the Gulf of Mexico-BOEM Data Set In this research we use a data set containing wellbore data from the Gulf of Mexico provided by the Bureau of Ocean and Energy Management (BOEM) which can be 101 Figure 4.2: Spatial distribution of boreholes over the Gulf of Mexico; source of data [24]. queried online (http : //www.boem.gov). The spatial distribution of one of such data sets is shown in Figure 4.2. The wellbores include both currently producing and formerly producing wells. Our goal is to develop a reduced order model to estimate a QoI at sites with limited data. Here, we pick the oil production rate at day one as our QoI, however, the developed model will be applicable to any other QoI. This QoI was simulated using a physical simulator software for 496 wellbores in GoM [76]. The simulator is fed with a subset of the data with local influence, such as permeability. The authors in [76] assume a stochastic process to model the permeability field and thus they obtain random realizations of production rate at each wellbore. We take the mean of these realizations as our QoI and treat it as the observed value for the oil production rate at day 1 (OPR-D1). We pick 36 relevant attributes, as described in the appendix, to learn the predictive model. 102 In Gaussian Bayesian networks it is assumed that the local density function of each variable follows a Gaussian distribution. This assumption doesn’t always coincide with reality, but we can mitigate its effect by applying a suitable trans- formation to the variables. This transformation is supposed to result in a more symmetric and normal like distribution. Box-Cox transformation and log trans- formation are commonly used to attain this goal. The Box-Cox transformation is defined as, y (λ) i = y λ i −1 λ if λ6= 0 ln(y i ) λ = 0 (4.10) which holds for y i > 0 and λ is the model parameter. We pick Box-Cox transformation for our QoI since it yields a more normal- liked distribution. Then, we apply Box-Cox and log transformations to all the other variables and assess the performance of the predictive model using the GBN. To learn the GBN we use the BGe score as discussed before and apply tabu search to conduct a discrete search over the space of BN structures and the highest scoring BN is chosen as the predictive model. To conduct the discrete search, we start with some arbitrary initial structure and then calculate the score change for all legal arc operations one at a time. By arc operations we mean arc addition, arc deletion and arc reversal. Next, we pick the arc operation with the highest score gain and apply it. We continue this procedure until a desired convergence rate is satisfied. This algorithm is not guaranteed to find a global optimum but only a local optimal structure. To avoid this problem, we can repeat the procedure for multiple initial structures and pick the best one. We use bnlearn package from R to learn the structure of the Bayesian network and conduct inferences [57] and [69]. 103 Table 4.1: RMSE in predicting Box-Cox transformed QoI. Model Train RMSE Test RMSE No transformation 0.183 1.213 Log transformation 0.123 0.143 Box-Cox transformation 0.127 0.141 We divide the data set to training and test data sets and learn the model based on the training set. Then, the generalization power of the model is tested based on the test data set. We consider 90% of the data as training and 10% as the test set. In our problem, we pick root mean squared error (RMSE) to evaluate the performance of the model in predicting QoI. RMSE is defined as, RMSE = v u u t 1 n n X i=1 (y i − ˆ y i ) 2 (4.11) wherey i denotes the observed value for variablesy, ˆ y i represents the predicted value by the model and n is the number of observed samples. In our problem, we take ˆ y i to be the mean of the learned Gaussian distribution for QoI. It should be noted that RMSE is a scale dependent metric meaning that its value depends on the range and scale of the variable. Thus, we also report the variable range in our model. Table 4.1 summarizes the performance of the learned GBNs. Here, we have used Box-Cox transformation for the QoI in all three cases while all other variables are subject to the respective transformations (or no transformation as in first case). RMSE is judged in conjunction with the variable range which is (0.7602-2.9320) in this case (for the Box-Cox transformed QoI). It is evident that a suitable transfor- mation highly affects the generalization power of the model. Moreover, it is observed that both log and Box-Cox transformations when applied to predictor variables give 104 Figure 4.3: Structure of the learned GBN to predict oil production rate at day 1 using 36 attributes-labels shown in appendix A. similar prediction errors. In this chapter, we choose to work with the log transfor- mation of the attributes hereafter (while the Box-Cox transformation is still applied to the QoI). The learned GBN is illustrated in Figure 4.3. Here, the Markov blanket of the OPR-D1 (QoI) consists of the following variables: permeability, porosity, Z, SW, THK, TAREA, OAREA, IOFVF, RECO-AF, WDEP, TI, ORECG, GOR, SDPG, P-CUMBOE). The target node is independent of all other variables given 105 that the Markov blanket variables are observed. However, when they are not all observed the target variable (QoI) would still be dependent on other variables. To demonstrate the generalization power of the model to new data points, we randomly pick 45 samples from the test data set and compare the predicted value for QoI using the GBN model with the actual observed value of QoI. Figure 4.4 illustrates the 95% confidence interval around the mean for 45 samples drawn from the test data. The solid line represents the predicted mean value of the QoI and the shaded area corresponds to 2σ bound around the mean which characterizes the 95% confidence interval. Moreover, the dots represent the observed value for the QoI. It can be seen that most of the observed values fall within the 95% confidence interval. This suggests that the proposed GBN has a good performance in predicting the QoI at new locations. The predictions shown in Figure 4.4 are obtained by assuming that all 36 attributes are observed at the unknown location. For instance, it assumes that permeability is measured (or somehow known) at the new location. Nevertheless, one of the advantages of using Bayesian networks is that it equips us with algorithms to conduct conditional inference query given any desired subset of variables. Thus, we can assume, for instance, that the only available data about a new location are its spatial coordinates X and Y. Using sampling methods, we are still able to get realizations of the QoI given values for X and Y. We use logic sampling method in this chapter. Figure 4.5 shows the probability density function of the QoI at two sample locations from the test data set when the only available data is the spatial coordinates of that location. The PDF is obtained by taking many samples from the GBN and keeping only the ones that match the evidence. It should be noted that although the model is learned with the Box-Cox transformed QoI, we have 106 Figure 4.4: 95% confidence interval in predicting oil production rate for 45 instances from test set. The solid line represents the predicted mean and the shaded area corresponds to 2σ deviation from the mean. The red dots show the actual observed values. transformed them back to their original scale in reporting the density distributions to give a better understanding of the physics of the problem. This justifies the non-normality of the illustrated distributions in Figure 4.5. We can see that the GBN proves to be a successful tool in predicting wellbore signatures. It can be used to infer any probabilistic query in a complex domain such 107 Figure 4.5: Inferring the PDF of QoI at two test locations using GBN. as the one studied in this chapter. This enables modelers to investigate the uncer- tainty propagation through the complex system easily. Moreover it can be readily applied for anomaly detection and diagnostic purposes. The most probable cause of an observed outcome (bottom up query) can be computed using the conventional inference algorithms for BN. These unique advantages makes Bayesian networks a unified framework that can be effectively used in predictive modeling as well as diagnostic tasks. Furthermore, BN structure reveals the conditional independence assertions implied by the data. This hands in a valuable tool in the data collection phase. If we know that our desired QoI is independent of variable A given variable B and we have already collected data for variable B, then there is no need to measure and save values of variable A. 108 Appendix A Description of Attributes Attribute Acronym X Spatial coordinate X Y Spatial coordinate Y Z Subsea depth (feet) WDEP Water depth (feet) SW Water saturation POROSITY Average porosity PERMEABILITY Average permeability THK Total average net thickness (feet) TAREA Total area (acres) TVOL Total volume (acre-feet) OTHK Oil average net thickness (feet) OAREA Oil total area (acres) OVOL Oil total volume (acre-feet) PI Initial pressure (psi) TI Initial temperature (degrees F) RSI Initial solution gas-oil ratio (scf/stb) GOR Gas-oil ratio (Mcf/bbl) ORP Produced GOR for oil reservoirs (Mcf/stb) API Oil API gravity (API units) OIP Oil in place (bbl) ORF Oil recovery factor (decimal) IOFVF Initial Oil Formation Volume Factor (bbl/stb) ORECO Oil reservoirs’ recoverable oil (bbl) ORECG Oil reservoirs’ recoverable gas (Mcf) RECO-AF Recoverable oil per acre-foot (bbl/acre-foot) SDPG Sand pressure gradient (psi per foot) SDTG Sand temperature gradient (degrees F per 100 feet) P-RECOIL Proved recoverable oil (bbl) P-RECGAS Proved recoverable gas (Mcf) P-RECBOE Proved recoverable BOE (bbl) P-CUMOIL Cumulative oil produced (bbl) P-CUMGAS Cumulative gas produced (Mcf) P-CUMBOE Cumulative BOE produced (bbl) DISCOIL Discovered oil (bbl) DISCGAS Discovered gas (Mcf) DISCBOE Discovered BOE (bbl) OPR-D1 Oil production rate at day 1 (bbl/day) 109 Chapter 5 Effect of Dimensionality Reduction on Stochastic Prediction Performance 110 5.1 Introduction High-dimensional data arise in various fields such as, • Sensor arrays: The collected signals by several identical sensors at different locations in seismology or weather forecasting applications. • Image processing: Ann×m pixels photo is a data point withn×m variables and we have a collection of these images (data points). • Multivariate data analysis: Different types of variables are collected for each data point. On the other hand, analysis of high-dimensional data raises the fundamental problem of "curse of dimensionality". These data are hard to visualize and interpret in both spatial and temporal aspects. However, in many applications the dimen- sionality of data is only artificially high and the data can be adequately described in termsoffewervariables. Potentiallythedatacontainsfeaturesthatareeitherredun- dant or irrelevant and thus it is safe to remove them. The redundant variables are those that are correlated with the response variable but are also strongly correlated with other relevant variables and thus they are unnecessary. Dimension reduction results in three potential benefits, namely facilitating interpretation of the physical phenomenon, cheaper computational costs and improved prediction performance by reducing overfitting. Consequently, there are cases where regression in the reduced space is more accurate than the original space. There are two main approaches for dimensionality reduction: feature selection and feature extraction. In feature selection we try to select a subset of the original 111 variables that are most relevant for prediction. However, there might be two vari- ables that are both relevant to output and are also highly dependent in a complex way that we cannot simply drop one of them because each one of them alone cannot capture all the information content they both carry. Feature extraction methods seek to find a new set of transformed variables that are informative but not redun- dant. The basic idea here is that the target variable has been literally produced by some latent variables that are not directly accessible and are thus measured in several different but redundant ways. So the goal is to recover those latent variables [43]. In this chapter, we put our focus on feature extraction and specifically its nonlinear version which is known as nonlinear dimensionality reduction or manifold learning. We apply three manifold learning techniques namely, locally linear embed- ding (LLE) [65], Isomap [75] and diffusion maps [11]. We apply these methods on the GoM wellbores data set that was discussed in the previous chapter. The final goal is to identify whether the data points inherently lie on a lower dimensional space such that the predictive performance in the new space is better than or equal to the original space. Dimensionality reduction helps us to explore the underlying structure in the data set and improves the interpretability and visualization of the data. Moreover, Bayesian networks provide a straightforward mechanism for select- ing the most relevant features in predicting a target variable (feature selection). We will briefly discuss this topic in this chapter. This chapter is organized as follows: Section 5.2 discusses three manifold learn- ing methods that we study in this research. Section 5.3 presents the results of apply- ingmanifoldlearningmethodsonGoMwellboresdatasetandtheircomparisonwith 112 the full high dimensional model. Moreover, we briefly discuss the results of using Bayesian networks as a feature selection tool. 5.2 Manifold Learning In general, if we have a collection of dependent variables, the support of their joint distribution does not span the whole space. Thus, dependence results in some structure in the distribution of the variables in space. Accordingly, the data points can form a geometric object in space [43]. The goal of dimensionality reduction is to give a new representation of this object while preserving its structure. Topology studies properties of objects that are preserved through deforma- tions, twisting and stretching. Therefore, it is used to abstract the intrinsic con- nectivity of objects while ignoring their detailed form. The objects of topology are defined as "topological spaces" which can be sufficiently defined in terms of neigh- borhoods. Manifolds are topological spaces that are locally Euclidean (nearly flat on small scales) [43]. In dimensionality reduction context, we assume that the data lie on an embed- ded low-dimensional manifold within the higher-dimensional space. In other words, we assume that there is a hidden structure in the data points that we would like to discover. Figure 5.1 gives a very simple intuition about the concept of manifold learning. To recover the structure of the data, we assume that the observed data points have been generated from latent variables according to a model. This model can be linearornonlinearanddiscreteorcontinuous. Weneedtofindthelowerdimensional 113 Figure 5.1: The distribution of the data points in the high dimensional space may be inherently embedded on a lower dimensional space. The original data in the well-known Swiss roll example lie in a 3 dimensional space while the manifold can be unfolded as a 2 dimensional space. representation of data in such a way that its topological properties (i.e. neighbor- hoods in geometrical view) are preserved. Various criteria can be optimized in order to preserve the neighborhood properties such as minimizing reconstruction error, maximizing variance preservation or maximizing distance preservation [43]. Prin- cipal component analysis (PCA) is the well-known linear dimensionality reduction method which assumes the data lies on a linear hyperplane in space and tries to project data into that hyperplane by preserving the variance. One of the initial questions to address is how to estimate the number of latent variables (i.e the dimension of the low dimensional space). Some of the dimension reduction methods are able to estimate this dimension within their algorithm while for others this number needs to be known a prior. This low dimension is known as intrinsic dimension. In this thesis, we focus on nonlinear dimensionality reduction (manifold learn- ing) which is not bound by the restrictive assumption of linearity. We discuss three manifold learning algorithms namely locally linear embedding, Isomap and diffusion 114 maps. These methods try to find the low-dimensional embedding in such a way that the main geometric properties of data (such as global shape or local neighborhood relationships) are preserved. The final goal is to use the new low-dimensional coor- dinates as predictor variables to probabilistically estimate the QoI using Bayesian networks in the GoM wellbores data set. In case the data inherently lies on a low-dimensional manifold and an appropriate dimensionality reduction algorithm is used, we expect the prediction results to be quite satisfactory. 5.2.1 Locally Linear Embedding The central intuition behind the locally linear embedding algorithm is that each data point and its neighbors lie on or close to a locally linear patch of the manifold. Thus, each point − → X i can be reconstructed by linearly combining its neighbors − → X j 0 s where weight coefficient (W ij ) are obtained by minimizing the reconstruction error [65], (W ) = X i | − → X i − X j W ij − → X j | 2 (5.1) This optimization problem is subject toW ij = 0 ifX j is not one of the neighbors of X i . Besides, P j W ij = 1 implying that the rows of weight matrix sum to one. The authors in [65] argue that the weight coefficients are invariant to rotation, rescaling and translation of data point and its neighbors, thus characterizing the intrinsic geometry of each neighborhood. Therefore, if we can characterize the local geometry in high-dimensional space using W ij s, we should be able to use the same W ij s to characterize local patches on the manifold. Thus, we use the same weights 115 Figure 5.2: The schematic summary of LLE algorithm. First, a number of neighbors are chosen for each data point. Then, the weights are computed so that each point canbelinearlyreconstructedfromitsneighbors. Finally, usingthesameweights, the embedded coordinates are obtained by reconstructing each point from its neighbors [65]. to compute the low-dimensional coordinates for all the data points by minimizing the reconstruction error in the low-dimension space, Φ( − → Y ) = X i | − → Y i − X j W ij − → Y j | 2 (5.2) where − → Y i representsthelowdimensionalcoordinatescorrespondingtopoint − → X i . LLE 116 is a topology (neighborhood) preserving mapping. Besides, both the above mini- mization problems have global optima. The number of neighbors and the intrinsic dimension of the manifold are two external parameters that need to be fed into the algorithm. We will later discuss how we find these parameters for our specific problem. LLE algorithm is schematically depicted in Figure 5.2. 5.2.2 Isomap Isomap is a nonlinear dimensionality reduction method that tries to preserve the distance between data points such that close points stay close and far points stay far in the high-dimensional and low-dimensional spaces. It is obvious that Euclidean distance is not a good metric to measure distance in this context, since two points that might be close by Euclidean distance in the original space can be quite far from each other on the manifold and thus this metric is not appropriate for preserving the neighborhood properties. On the contrary, geodesic distance is defined as the minimum length between two points along the manifold. So, when two points are close to each other by this metric, this implies that they would also be close on the unfolded manifold in the low dimensional space. Finding the minimum length between two points on the manifold in analyti- cally intractable in continuous case. Instead, we can consider the path between two points as the sum of many discrete small steps on the manifold. We need to define a rule for these small jumps from one point to its neighbors. We can either jump from a point to one of its K closet neighbors (K-rule) or jump to one of the points inside a ball with radius centered at the point (-rule). Accordingly, a graph can be defined in which nodes represent data points and each node is only connected to another one if it is in its neighborhood as defined by the rule. The weight of each 117 edge in the graph is set to the Euclidean distance between the two nodes. Therefore, the geodesic distance between two data points can be approximated by the length of shortest path between two nodes on the graph, which is known as graph distance. Dijkstra algorithm is used to find the shortest path in a weighted graph [43]. The Isomap algorithm uses the same algorithm as in Multidimensional scaling method (MDS) [43] but using geodesic distance (instead of Euclidean distance) makes it nonlinear. Accordingly, the generative model to recover the latent variables is, Y =WX (5.3) in whichX P×N is the matrix of centered latent variables,Y D×N is the matrix of cen- tered observed variables and W D×P is the orthonormal axis transformation matrix (such thatW T W =I P ). Here, we haveN data points which are initially embedded in R D and we seek to find their low-dimensional coordinates in R P . TheIsomapmethodisbasedonpreservingthepairwisescalarproductbetween datapointsinthehigh-dimensionalandlow-dimensionalspaces. TheGrammatrix quantifies the pairwise scalar products as, S =Y T Y = (X T W T )(WX) =X T X (5.4) The Gram matrix can be directly calculated based on the pairwise distance matrix (as defined by graph distances) [43]. On the other hand, the eigenvalue decomposi- tion of Gram matrix gives, S =UΛU T = (Λ 1 2 U T ) T (Λ 1 2 U T ) (5.5) 118 comparing (5.4) and (5.5) we can find X, ˆ X =I P×N Λ 1 2 U T (5.6) here, we sort the eigenvalues in Λ in descending order and pick the first largest P, thus the latent coordinates lie in R P . The Isomap algorithm is very slow since it requires finding the shortest paths in the graph of the nodes. Moreover, the method is valid only when the pairwise geodesic distances between the points of the manifold can be mapped to pairwise Euclidean distance in a P dimensional Euclidean space. Such a manifold is called a "developable P-manifold". Intuitively, in R 3 a developable manifold resembles a curved sheet of paper with no holes. Therefore, if the data does not agree with this assumption, the model cannot recover the hidden coordinates in an efficient way. 5.2.3 Diffusion Maps Kernel functions take two data points as input and give a non-negative real num- ber as output. The kernel functions are used to quantify the similarity (closeness) between two data points. Given a kernel function, an associated symmetric kernel matrix, K, can be defined. Now, we can build the graph of data points in which nodes represent data points and kernel values are used to attach weights to edges. This graph can be normalized so that the sum of all weights attached to a node add to one. To do so, each element of the kernel matrix is divided by the sum of elements in corresponding row, A =D −1 K (5.7) 119 here, the degree matrix D is a diagonal matrix whose i th entry is the sum of the i th row of K. All the rows of matrix A add up to one, thus it can be considered as a stochastic transition matrix. It describes a transition matrix for a Markov chain or a random walk on the graph of the data. Therefore, A ij denotes the probability of going from node i to node j in the graph. This random walk tends to make transitions between similar points and thus is able to identify the geometry of the data. Diffusion map technique uses this notion to explore the underlying manifold in a data set. Assumeforsimplicitythatthemanifoldisonedimensionalandwecandescribe each point in the low-dimensional space with a single coordinate. So, f(i) would be the coordinate for i and f is a n× 1 matrix denoting the new coordinates. By minimizing, Φ(f) = X i,j A ij (f i −f j ) 2 (5.8) we can preserve similarities. This objective function forces similar points, (A ij ≈ 1), to stay close in the low-dimension space so thatf i ≈f j . However, dissimilar points, (A ij ≈ 0), can freely vary. The objective function can be rewritten as, X i,j A ij (f i −f j ) 2 = X i,j A ij (f 2 i − 2f i f j +f 2 j ) 2 = X i f 2 i X j A ij − 2f i X j A ij f j + X j A ij f 2 j = X i f 2 i −s X i f 2 i − 2 X i f i X j A ij f j + X j f 2 j = 2f T f− 2f T Af (5.9) 120 thus, we can instead minimize, f T (I−A)f =f T Lf (5.10) where L≡ I−A is called the Laplacian of the normalized graph. By imposing the constraint that f T f = 1 to avoid the undesired minimum at f = 0, we get the following eigenvalue problem as the solution, Lf =λf (5.11) It turns out that the eigenvectors ofA andL are the same, but the eigenvalues are related byμ = 1−λ whereμ is the eigenvalue ofA andλ is the eigenvalue ofL. μ is called the eigen-multiplier of diffusion maps. Moreover, the first eigenvector of L (corresponding to λ = 0) is the vector of all ones which is useless and we discard it. We can extend the solution to the case where the low-dimensional coordinates are in R q , we take the bottom q + 1 eigenvectors of L ,f (n) ,··· ,f (n−q) , and their corresponding eigenvaluesλ n = 0,λ n−1 ,··· ,λ n−q . and discard the bottom one. The diffusion maps coordinates of a data point V would be, Ψ(v) = μ 1 f (n−1) v ,··· ,μ q f (n−q) v (5.12) Diffusion maps algorithm has an internal procedure to estimate the intrinsic dimensionalityof data. We can chooseq, the dimensionof targetspace, as the lowest 121 number of eigenvalues that captures a certain portion of the total eigenvalues, such that, P q i=1 μ i P n j=1 μ j =Threshold (5.13) There is an interesting concept associated with diffusion maps known as dif- fusion distances. Diffusion distance reflects the connectivity in the graph of data between two points. It turns out that the Euclidean distance in the embedded space is equal to the diffusion distance in the original high-dimensional space. Similar to Isomap algorithm that tries to preserve the geodesic distance (distance on the manifold), the diffusion maps algorithm strives to preserve the diffusion distance (connectivity) between the data points. 5.3 Application to the Gulf of Mexico-BOEM Data Set In this section we consider the same problem discussed in previous chapter from a different perspective. The problem is to develop a data-driven mechanism for pre- dicting a desired QoI at a new location from a spatially distributed wellbore data points. We proposed using Bayesian networks to model the joint probability distri- butionbetweenvariablesinahighdimensionalspaceandusethelearneddistribution to make probabilistic inference for the unobserved scenarios. Here, we focus on the high dimensionality of the data and would like to explore any possible structure in the data, so that the data can be sufficiently described with fewer dimensions. First, we investigate whether we can find a subset of variables that are mostly relevant 122 for predicting the QoI (feature selection). This is called feature selection in gen- eral but we study it in the framework of Bayesian networks. Moreover, we also use nonlinear manifold learning approaches as discussed in this chapter, to see whether data can be described in a lower dimensional space with some latent variables. We study the predictive performance of Bayesian network in predicting the QoI using the low-dimensional coordinates. The analyses in this section use the BOEM data set as described in previous chapter. We have 36 predictor variables and would like to predict production rate at day 1 at new locations. Bayesian networks provide an elegant framework for selecting the most informative subset of variables. As discussed before, in a BN each node is independent of all the other nodes given its Markov blanket (MB). ThisprovidesanaturalfeatureselectionmethodinthecontextofBayesiannetworks. Here, the Markov blanket of the OPR-D1 (QoI) consists of the following variables: permeability, porosity, Z,SW,THK,TAREA,OAREA,IOFVF,RECO-AF,WDEP, TI, ORECG, GOR, SDPG, P-CUMBOE). We also add X and Y variables to MB to make probabilistic inference based on spatial location possible. The structure of the learned BN based on these selected variables is shown in Figure 5.3. The prediction RMSE on training and test data sets are 0.121 and 0.128 respectively which are very close to the results obtained by using the high dimensional data which was 0.127 and 0.141 respectively. This suggests that the selected subset of variables can efficiently capture the dependencies between the predictor variables and the dependent variable. To explore the structure of the data, we first apply locally linear embedding method to our data. We use the log transformation to scale the data (except the QoI for which we keep the Box-Cox transformed variable to make the comparisons 123 Figure 5.3: The structure of learned BN using the Markov blanket variables of QoI as well as X and Y variables. consistent). We pick the RMSE error in predicting the QoI as a metric to choose the LLE parameters namely, number of neighbors and the intrinsic dimension. Figure 5.4 shows the training set error for various embedded dimensions as a function of the number of neighbors. As expected, the general trend shows that the error is minimized for some optimal number of neighbors. Intuitively, if the number of neighbors are too small, the model cannot capture the structure of the data. On the other hand, including too many neighbors loses the locality of the data. Figure 5.4 suggests that 20 neighbors is a good choice for our problem. Figure 5.5 124 Figure 5.4: Choosing the appropriate number of neighbors for LLE algorithm when applied to wellbores data set to predict oil production rate at day 1. illustrates the training and test set errors for various embedded dimensions when the number of neighbors is 20. As expected, by increasing the complexity of the model (higher embedded dimensions) the training set error decreases while the test set error will start to increase at some point. This suggests that picking a high embedded dimension results in overfitting of the model. Accordingly we pick 10 as the embedded dimension of the manifold. The RMSE in predicting oil production rate for k = 20 neighbors and d = 10 dimensions is 0.143 on training data and 125 Figure5.5: FindingtheintrinsicmanifolddimensionforLLEalgorithmwhenapplied to wellbores data set to predict oil production rate at day 1. 0.160 on test data. This performance is pretty comparable with the case of high- dimension model with 36 variables where RMSE on training and test data sets are 0.127 and 0.141 respectively. We have used the RDRToolbox package in R to find LLE coordinates [3]. Majority of manifold learning methods, including LLE, try to transform the variables to a low dimensional space such that none of the new coordinates is redun- dant in describing the data structure regardless of the ultimate prediction task. 126 Figure 5.6: Dependency structure between the LLE coordinates, excluding the QoI. However, they never guarantee statistical independence between variables (which is of course a well desired property). Thus we can see as in Figure 5.6 that while majority of LLE coordinates are statistically independent X 2 , X 3 and X 4 are not. Moreover, we would ideally desire that our QoI is directly dependent on the LLE coordinates. This would mean that the new coordinates perfectly convey informa- tionregardingtheQoI.Nevertheless, thisisnotobligatory. Intheprocessoflearning low dimensional coordinates we never inform the algorithm that the ultimate goal is to predict a particular QoI (since LLE is an unsupervised learning method). Thus, 127 Figure 5.7: Structure of the learned GBN to predict oil production rate using the LLE coordinates. when using LLE coordinates to predict QoI, some of them might not show any dependency to the target variables which was already expected. Figure 5.7 illus- trates the structure of the learned Bayesian network using the LLE coordinates (k = 20,d = 10). It is observed that the QoI is directly dependent on all of the new coordinates except one of them which is already justified. There exists a class of supervised dimensionality reduction methods which seek to find the reduced sub- space in such a way that the probabilistic relation between the dependent variable 128 Table 5.1: Model prediction performance using original data, Markov blanket vari- ables, LLE coordinates, Isomap coordinates and Diffusion map coordinates respec- tively. Model No. of variables Train RMSE Test RMSE Original data 36 0.127 0.141 Markov blanket variables 17 0.121 0.128 LLE coordinates 10 0.143 0.160 Isomap coordinates 10 0.218 0.260 Diffusion map coordinates 3 0.359 0.478 with the predictors is preserved. We do not investigate the supervised dimension reduction methods in this thesis. To apply the Isomap algorithm we use the same number of neighbors and manifold dimension as used in the LLE method namely, 20 and 10. This gives a consistent framework for comparing the performance of methods. Given these parameters, the Isomap coordinates are found and then used to train a Gaussian Bayesian network. The RMSE on predicting the QoI using the trained GBN of training and test data sets are 0.218 and 0.260 respectively. Isomap coordinates give higher error rates compared to the LLE coordinates. This might be due to the underlying assumption in the Isomap algorithm that the manifold is developable p-maifold. We have used the RDRToolbox package in R to find Isomap coordinates [3]. Finally, we apply the Diffusion maps algorithm on the data set and find the diffusion coordinates. Diffusion maps algorithm finds the manifold dimension to be 3 (this corresponds to 95% drop-off in eigenvalue multiplier). Here, we have used a Gaussian kernel k(x,y) = exp(− kx−yk 2 2σ 2 ) with σ 2 = 8.9. A good approximate for σ 2 is to set it as 2×α 2 , whereα is the median distance to thek th nearest neighbor, and k is chosen 1%− 2% of size of data set n [63]. Using these coordinates, the RMSE 129 Figure 5.8: Predicted mean of QoI versus the measured QoI for 45 samples from test data set-GBNs are learned by Markov blanket variables and LLE coordinates respectively. in predicting the QoI using the GBN would be 0.359 and 0.478 on the training and test data sets respectively. We have used diffusionMap package in R to find the diffusion coordinates [63]. Table 5.1 compares the predictive performance of the GBN model for five different cases. The first case corresponds to the original model when no dimension reduction is applied and we keep all the 36 predictors. The second case is when we keepasubsetofvariablesaspickedbytheMarkovblanketoftheQoIintheGBNand use them only in predicting QoI. Here we have 17 variables and the error is almost the same as the original case. Third case corresponds to using LLE coordinates when the intrinsic dimension is set to be 10. The errors here are also very close to the first two cases. On the contrary, it is observed that the prediction error when 130 Figure 5.9: Inferring the PDF of oil production rate at a test location using the LLE coordinates. using Isomap and diffusion maps coordinates is quite high. Moreover, the efficiency of dimension reduction algorithms is pronounced in real field applications where the dimension of the original variables is much more than just 36. The results suggest that the LLE is quite successful in reducing the dimension from 36 to 10 while the two other manifold learning methods are not as good. Due to the manifold geometry not all the manifold learning techniques are able to recover the structure. The non-convexity (having holes), non-uniform densities, sparsity in data, corners, noises and clusters in data are among the reasons that make some algorithms not to be able to discover the inherent structure of the manifold. Figure 5.8 shows the predicted mean of QoI versus the observed value of QoI for 45 samples from the test data set. The Bayesian networks are learned by Markov blanket variables and LLE coordinates respectively. The general trend of the graphs 131 are close to theY =X line and also similar to the case when we use the original 36 coordinates. We can also use the LLE coordinates to make probabilistic inferences about the QoI, this proves that these low-dimensional coordinates can efficiently represent the data. Figure 5.9 illustrates the probability distribution function for oil production rate at two locations. These locations are the same as in Figure 4.5, hence making the comparison consistent. Here we use the LLE coordinates to make probabilistic inferences about the QoI at a new location using the learned Bayesian network. Comparing Figure 4.5 and Figure 5.9 suggest that the reduced model is quite capable in predicting the QoI. 132 Chapter 6 Conclusion 133 In this thesis, we attempt to provide solutions to some challenges in the energy affiliated industries from a stochastic and data-driven point of view. We study smart grids and smart oilfields as two application areas that produce a big share of energy. This research aims to provide tools that are able to help decision makers in these fields make more informed decisions. Specifically, we focus on three problems. First, we address demand response in smart grids and recommend a robust scheduling framework which is able to control the timing and usage of programmable electric devices in a residential building. The optimal schedule is regulated based on the forecasting of electricity prices in future time slots. We formulate the problem as a chance- constrained optimization problem which is able to effectively accommodate the uncertainty in price prediction. Moreover, we study various categories of con- sumers based on their socio-economic status and investigate the savings in electricity bills that each household can get by using the recommended scheduling system. We also address the problem of residential load forecasting in the context of smart grid. Multiple stochastic variables that shape the customer consumption behavior pose a challenge to this problem. We propose using Bayesian networks, which is able to estimate the joint probability distribution in a complex system, as the framework for capturing the uncertainties and conducting probabilistic infer- ences. We also study the forecasting problem at the aggregated and individualized scales. The availability of the usage data at the households level provides a unique opportunity for modeling households consumption behavior which can lead to tar- geted pricing policies for each household. Finally we developed a data-driven procedure to predict wellbore signatures over the Gulf of Mexico. The construction is based upon the Bayesian network modeling approach as an elegant tool to simultaneously address the uncertainty 134 and complexity issues. The model proves to serve an efficient predictive model in high dimensional settings as often encountered in reservoir modeling applications. Furthermore, the structure of the data has been explored by a nonlinear manifold learning technique. The low dimensional coordinates demonstrated an equivalently efficient representation of the data in predicting quantity of interest. The pro- posed approach holds a great promise to simplifying the prediction task in complex domains. Relaxing the Gaussian assumption and considering a more generic dis- tribution can be a possible direction to extend this work. For instance the joint distribution may be modeled through a Gaussian mixture model. Furthermore, the low dimensional coordinates of the data can be learned in a supervised manner, which might enhance the accuracy of the prediction at low dimensional space. In this thesis, we assessed some challenges that arise in complex systems, namely smart grids and smart oilfields, from a stochastic modeling point of view. We investigated the effect of uncertainties on the system behavior by adopting a data-driven perspective. The boom of data streams in energy industry necessitates system modelers to utilize this data in order to improve the system performance. In this research, we used real data to improve demand management in smart grids by controllingthedemandandalsobyforecastingdemandinanuncertainenvironment. Furthermore, we used the hydrocarbon reservoirs’ data at the Gulf of Mexico to provide rapid system characterization. 135 Reference List [1] A. Abdollahzadeh, A. Reynolds, M. Christie, D. W. Corne, G. J. Williams, B. J. Davies, et al. Estimation of distribution algorithms applied to history matching. SPE Journal, 18(03):508–517, 2013. [2] H. K. Alfares and M. Nazeeruddin. Electric load forecasting: literature sur- vey and classification of methods. International Journal of Systems Science, 33(1):23–34, 2002. [3] C. Bartenhagen. RDRToolbox: A package for nonlinear dimension reduction with Isomap and LLE., 2014. R package version 1.18.0. [4] K.Basu, L.Hawarah, N.Arghira, H.Joumaa, andS.Ploix. Apredictionsystem for home appliance usage. Energy and Buildings, 67:668–679, 2013. [5] A. Ben-Tal, L. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton Series in Applied Mathematics. Princeton University Press, 2009. [6] A. Ben-tal and A. Nemirovski. Robust solutions of linear programming prob- lems contaminated with uncertain data. Mathematical Programming, 88:411– 424, 2000. [7] C. E. Bravo, L. Saputelli, F. Rivas, A. G. Pérez, M. Nickolaou, G. Zangl, N. De Guzmán, S. D. Mohaghegh, G. Nunez, et al. State of the art of artificial intelligence and predictive analytics in the e&p industry: A technology survey. SPE Journal, 19(04):547–563, 2014. [8] D. Caprino, M. L. Della Vedova, and T. Facchinetti. Peak shaving through real-time scheduling of household appliances. Energy and Buildings, 75:133– 148, 2014. [9] D. P. Chassin. Olympic peninsula demonstration testbed results. Technical report, Pacific Northwest National Laboratory, 2010. 136 [10] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions on, 14(3):462–467, 1968. [11] R. R. Coifman and S. Lafon. Diffusion maps. Applied and computational har- monic analysis, 21(1):5–30, 2006. [12] A. Conejo, J. Morales, and L. Baringo. Real-time demand response model. Smart Grid, IEEE Transactions on, 1(3):236 –242, dec. 2010. [13] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309–347, 1992. [14] T. H. Cormen. Introduction to algorithms. 2009. [15] M. H. DeGroot. Optimal statistical decisions, volume 82. John Wiley & Sons, 2005. [16] R. E. Edwards, J. New, and L. E. Parker. Predicting future hourly residential electrical consumption: A machine learning case study. Energy and Buildings, 49:591–603, 2012. [17] H. Farhangi. The path of the smart grid. Power and Energy Magazine, IEEE, 8(1):18 –28, january-february 2010. [18] U.M.FayyadandK.B.Irani. Multi-intervaldiscretizationofcontinuous-valued attributes for classification learning. In Proceedings of the 13th international joint conference on artificial intelligence, pages 1022–1029. Morgan Kaufmann Publishers Inc., 1993. [19] N. Friedman and M. Goldszmidt. Learning Bayesian networks with local struc- ture. In Learning in graphical models, pages 421–459. Springer, 1998. [20] R. Fung and K.-C. Chang. Weighing and integrating evidence for stochastic simulation in Bayesian networks. arXiv Preprint arXiv:1304.1504, 2013. [21] D. Geiger and D. Heckerman. Learning Gaussian networks. In Proceedings of the Tenth international conference on Uncertainty in artificial intelligence, pages 235–243. Morgan Kaufmann Publishers Inc., 1994. [22] M. Giese, R. B. Bratvold, et al. Probabilistic modeling for decision support in integrated operations. SPE Economics & Management, 3(03):173–185, 2011. [23] W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for complex Bayesian modelling. The Statistician, pages 169–177, 1994. 137 [24] J. Graham, K. Rose, J. Bauer, C. Disenhof, C. Jenkins, C. Ringo, L. Sim, and K. Vanackeren. Integration of spatial data to support risk and impact assessments for deep and ultra-deepwater hydrocarbon activities in the gulf of mexico, 2012. [25] D. L. Ha, H. Joumaa, S. Ploix, and M. Jacomino. An optimal approach for electrical management problem in dwellings. Energy and Buildings, 45:1–14, 2012. [26] M. T. Hagan and S. M. Behr. The time series approach to short term load forecasting. Power Systems, IEEE Transactions on, 2(3):785–791, 1987. [27] L. Hawarah, S. Ploix, and M. Jacomino. User behavior prediction in energy consumption in housing using Bayesian networks. In Artificial Intelligence and Soft Computing, pages 372–379. Springer, 2010. [28] J. He, L. J. Durlofsky, et al. Reduced-order modeling for compositional simula- tion by use of trajectory piecewise linearization. SPE Journal, 19(05):858–872, 2014. [29] D. Heckerman. A Tutorial on Learning with Bayesian Networks. Springer, 1998. [30] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian net- works: The combination of knowledge and statistical data. Machine learning, 20(3):197–243, 1995. [31] M. Henrion. Propagation of uncertainty by probabilistic logic sampling in BayesâĂŹ networks. In Uncertainty in Artificial Intelligence, volume 2, pages 149–164, 1988. [32] S. Højsgaard. Graphical independence networks with the gRain package for R. Journal of Statistical Software, 46(10):1–26, 2012. [33] J. Hosking, R. Natarajan, S. Ghosh, S. Subramanian, and X. Zhang. Short- term forecasting of the daily load curve for residential electricity usage in the smart grid. Applied Stochastic Models in Business and Industry, 29(6):604–620, 2013. [34] R. K. Jain, K. M. Smith, P. J. Culligan, and J. E. Taylor. Forecasting energy consumption of multi-family residential buildings using support vector regres- sion: Investigating the impact of temporal and spatial monitoring granularity on performance accuracy. Applied Energy, 123:168–178, 2014. 138 [35] S. J. Kang, J. Park, K.-Y. Oh, J. G. Noh, and H. Park. Scheduling-based real time energy flow control strategy for building energy management system. Energy and Buildings, 75:239–248, 2014. [36] M. Khaninezhad, B. Jafarpour, et al. Bayesian history matching and uncer- tainty quantification under sparse priors: a randomized maximum likelihood approach. InSPE Reservoir Simulation Symposium. Society of Petroleum Engi- neers, 2013. [37] T. Kim and H. Poor. Scheduling power consumption with price uncertainty. Smart Grid, IEEE Transactions on, 2(3):519 –527, sept. 2011. [38] S.KishoreandL.Snyder. Controlmechanismsforresidentialelectricitydemand in smartgrids. In Smart Grid Communications (SmartGridComm), 2010 First IEEE International Conference on, pages 443 –448, oct. 2010. [39] K. B. Korb and A. E. Nicholson. Bayesian Artificial Intelligence. CRC press, 2010. [40] W. Lam and F. Bacchus. Learning Bayesian belief networks: An approach based on the mdl principle. Computational Intelligence, 10(3):269–293, 1994. [41] H. Langseth and L. Portinale. Bayesian networks in reliability. Reliability Engineering & System Safety, 92(1):92 – 108, 2007. [42] S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), pages 157–224, 1988. [43] J. A. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer Publishing Company, Incorporated, 1st edition, 2007. [44] A. Lefort, R. Bourdais, G. Ansanay-Alex, and H. Guéguen. Hierarchical control method applied to energy management of a residential house. Energy and Buildings, 64:53–61, 2013. [45] N. Li, L. Chen, and S. Low. Optimal demand response based on utility max- imization in power networks. In Power and Energy Society General Meeting, 2011 IEEE, pages 1 –8, july 2011. [46] Z. Li and C. A. Floudas. Robust counterpart optimization: Uncertainty sets, formulations and probabilistic guarantees. 139 [47] X. Lin, S. L. Janak, and C. A. Floudas. A new robust optimization approach for scheduling under uncertainty:: I. bounded uncertainty. Computers & Chemical Engineering, 28(67):1069 – 1085, 2004. <ce:title>FOCAPO 2003 Special issue</ce:title>. [48] G. Martinelli, J. Eidsvik, K. Hokstad, R. Hauge, et al. Strategies for petroleum exploration on the basis of Bayesian networks: A case study. SPE Journal, 19(04):564–575, 2014. [49] M. K. McNutt, R. Camilli, T. J. Crone, G. D. Guthrie, P. A. Hsieh, T. B. Ryerson, O. Savas, and F. Shaffer. Review of flow rate estimates of the deepwater horizon oil spill. Proceedings of the National Academy of Sciences, 109(50):20260–20267, 2012. [50] R. Missaoui, H. Joumaa, S. Ploix, and S. Bacha. Managing energy smart homes according to energy prices: analysis of a building energy management system. Energy and Buildings, 71:155–167, 2014. [51] A. Mohsenian-Rad, V. Wong, J. Jatskevich, R. Schober, and A. Leon-Garcia. Autonomous demand-side management based on game-theoretic energy con- sumption scheduling for the future smart grid. Smart Grid, IEEE Transactions on, 1(3):320 –331, dec. 2010. [52] A.-H. Mohsenian-Rad and A. Leon-Garcia. Optimal residential load control with price prediction in real-time electricity pricing environments. Smart Grid, IEEE Transactions on, 1(2):120 –133, sept. 2010. [53] A.-H. Mohsenian-Rad, V. Wong, J. Jatskevich, and R. Schober. Optimal and autonomous incentive-based energy consumption scheduling algorithm for smart grid. In Innovative Smart Grid Technologies (ISGT), 2010, pages 1 –6, jan. 2010. [54] P. Morris, D. Vine, and L. Buys. Application of a Bayesian network complex system model to a successful community electricity demand reduction program. Energy, 84:63–74, 2015. [55] K. Moslehi and R. Kumar. A reliability perspective of the smart grid. Smart Grid, IEEE Transactions on, 1(1):57–64, 2010. [56] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012. [57] R. Nagarajan, M. Scutari, and S. Lèbre. Bayesian Networks in R with Appli- cations in Systems Biology, volume 48. Springer, 2013. 140 [58] T. D. Nielsen and F. V. Jensen. Bayesian Networks and Decision Graphs. Springer Science & Business Media, 2009. [59] D. O’Neill, M. Levorato, A. Goldsmith, and U. Mitra. Residential demand response using reinforcement learning. In Smart Grid Communications (Smart- GridComm), 2010 First IEEE International Conference on, pages 409 –414, oct. 2010. [60] J. Pearl. Reverend bayes on inference engines: A distributed hierarchical approach. [61] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2014. [62] M. Pedrasa, E. Spooner, and I. MacGill. Robust scheduling of residential dis- tributed energy resources using a novel energy service decision-support tool. In Innovative Smart Grid Technologies (ISGT), 2011 IEEE PES, pages 1 –8, jan. 2011. [63] J. Richards. diffusionMap: Diffusion map, 2014. R package version 1.1-0. [64] M. Roelofs and J. Bisschop. AIMMS 3. 9-Language Reference. Lulu. com, 2009. [65] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000. [66] N. V. Sahinidis. Optimization under uncertainty: state-of-the-art and oppor- tunities. Computers & Chemical Engineering, 28(67):971 – 983, 2004. <ce:title>FOCAPO 2003 Special issue</ce:title>. [67] S. Salinas, M. Li, and P. Li. Multi-objective optimal energy consumption scheduling in smart grids. Smart Grid, IEEE Transactions on, 4(1):341–348, 2013. [68] P. Samadi, A. Mohsenian-Rad, R. Schober, V. Wong, and J. Jatskevich. Opti- mal real-time pricing algorithm based on utility maximization for smart grid. In Smart Grid Communications (SmartGridComm), 2010 First IEEE Interna- tional Conference on, pages 415 –420, oct. 2010. [69] M. Scutari. Learning Bayesian networks with the bnlearn R package. Journal of Statistical Software, 35(3):1–22, 2010. [70] R. Sevlian and R. Rajagopal. Short term electricity load forecasting on varying levels of aggregation. arXiv preprint arXiv:1404.0058, 2014. 141 [71] R. D. Shachter and C. R. Kenley. Gaussian influence diagrams. Management Science, 35(5):527–550, 1989. [72] G. R. Shafer and P. P. Shenoy. Probability propagation. Annals of Mathematics and Artificial Intelligence, 2(1-4):327–351, 1990. [73] K. C. Sou, J. Weimer, H. Sandberg, and K. Johansson. Scheduling smart home appliances using mixed integer linear programming. In Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages 5144 –5149, dec. 2011. [74] J. W. Taylor, L. M. De Menezes, and P. E. McSharry. A comparison of univari- ate methods for forecasting electricity demand up to a day ahead. International Journal of Forecasting, 22(1):1–16, 2006. [75] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric frame- work for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [76] C.Thimmisetty,A.Khodabakhshnejad,N.Jabbari,F.Aminzadeh,R.Ghanem, K. Rose, J. Bauer, and C. Disenhof. Multiscale stochastic representation in high-dimensional data using gaussian processes with implicit diffusion met- rics. In Dynamic Data-Driven Environmental Systems Science, pages 157–166. Springer, 2015. [77] K. Turitsyn, S. Backhaus, M. Ananyev, and M. Chertkov. Smart finite state devices: A modeling framework for demand response technologies. In Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages 7 –14, dec. 2011. [78] M. Vlachopoulou, G. Chin, J. C. Fuller, S. Lu, and K. Kalsi. Model for aggre- gated water heater load using dynamic Bayesian networks. [79] A. Wald. Statistical decision functions which minimize the maximum risk. The Annals of Mathematics, 46(2):265–280, 1945. [80] Z. Yun, Z. Quan, S. Caixin, L. Shaolan, L. Yuming, and S. Yang. Rbf neural network and anfis-based short-term load forecasting approach in real-time price environment. Power Systems, IEEE Transactions on, 23(3):853–858, 2008. [81] E. Zio. Reliability analysis of complex network systems: Research and practice in need. 142
Abstract (if available)
Abstract
The abundance of collected data from physical systems holds the promise of transforming conventional systems into smarter infrastructures that permits the development of credible decisions. These perspectives, taken in the context of recent algorithmic developments for big data, start to address many of the challenges encountered in complex systems. On the other hand, the intertwined uncertainty associated with these systems impose an additional layer of complexity that needs to be addressed properly. Stochastic predictive modeling seeks to quantify the effect of uncertainty on the overall system behavior in order to improve the process of decision making. While fulfilling this promise remains fraught with conceptual, technological, and mathematical challenges, success holds many promises that go to the heart of societal and environmental sustainability. ❧ In this study, we aim to enhance our stochastic modeling capability in complex systems by adapting a data-driven approach. More specifically, we focus on future Smart Grid and Smart Oilfields as two vivid examples of the energy sector where analyzing data assists in quantifying the confidence in the predictions. Smart Grid which represents the future generation of power systems, is composed of multiple interacting systems such as renewable supply network, distributions and storage systems and communications system. An important enabler of Smart Grid, the information network, ensures the viability of data-driven and data-aware perspectives, both on the side of utilities and consumers. This enables both consumers and utilities to share in the responsibility and benefits of access to advanced technology. Smart Oilfields, was initiated by a similar core idea, i.e. using an advanced information technology as an enabler for informed decisions. The data-driven predictive approach can serve as a prior model for characterizing the hydrocarbon reservoirs where the conventional analysis methods impose high computational and monetary expenses. This thesis examines some uncertainty induced challenges faced in predictive modeling of Smart Grids and Smart Oilfields and seeks to address these challenges from a data-driven perspective.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Model-driven situational awareness in large-scale, complex systems
PDF
Algorithms for stochastic Galerkin projections: solvers, basis adaptation and multiscale modeling and reduction
PDF
Data-driven aggregated fractal dynamic load modeling for early warning of voltage collapse
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Efficient stochastic simulations of hydrogeological systems: from model complexity to data assimilation
PDF
Application of data-driven modeling in basin-wide analysis of unconventional resources, including domain expertise
PDF
Prediction models for dynamic decision making in smart grid
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Efficient inverse analysis with dynamic and stochastic reductions for large-scale models of multi-component systems
PDF
Stochastic oilfield optimization under uncertainty in future development plans
PDF
Data worth analysis in geostatistics and spatial prediction
PDF
Hybrid physics-based and data-driven computational approaches for multi-scale hydrologic systems under heterogeneity
PDF
Studies into computational intelligence approaches for the identification of complex nonlinear systems
PDF
Stochastic and multiscale models for urban and natural ecology
PDF
Model, identification & analysis of complex stochastic systems: applications in stochastic partial differential equations and multiscale mechanics
PDF
Feature and model based biomedical system characterization of cancer
PDF
Physics-based data-driven inference
PDF
Environmental effects from a large-scale adoption of electric vehicle technology in the City of Los Angeles
PDF
Data-driven multi-fidelity modeling for physical systems
PDF
Computational validation of stochastic programming models and applications
Asset Metadata
Creator
Bassamzadeh, Nastaran
(author)
Core Title
Probabilistic data-driven predictive models for energy applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Civil Engineering (Structural Engineering)
Publication Date
04/22/2016
Defense Date
03/21/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data-driven analysis,demand response,dimensionality reduction,load forecasting,mathematical modeling,OAI-PMH Harvest,probabilistic prediction,smart grids,smart oilfields,stochastic reservoir characterization
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ghanem, Roger (
committee chair
), Meshkati, Najmedin (
committee member
), Moore, James E. (
committee member
)
Creator Email
bassamza@usc.edu,nsbassamzadeh@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-241714
Unique identifier
UC11276888
Identifier
etd-Bassamzade-4356.pdf (filename),usctheses-c40-241714 (legacy record id)
Legacy Identifier
etd-Bassamzade-4356.pdf
Dmrecord
241714
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Bassamzadeh, Nastaran
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data-driven analysis
demand response
dimensionality reduction
load forecasting
mathematical modeling
probabilistic prediction
smart grids
smart oilfields
stochastic reservoir characterization