Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Failure prediction for rod pump artificial lift systems
(USC Thesis Other)
Failure prediction for rod pump artificial lift systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Failure Prediction for Rod Pump Articial Lift Systems by Yintao Liu Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2013 Copyright 2013 Yintao Liu Dedication To my greatest parents Zicai Liu and Xiaolan Jiang, my elder brother Jintao Liu, and especially to my wife, without whom none of these would have ever happened. ii Acknowledgements I want to thank my PhD advisors Professor Cauligi S. Raghavendra and Dr. Ke-Thia Yao. Without their supervision, this dissertation would be impossible. With enough freedom of conducting my research, their rich experience and expertise in both computer science theory and application helped me greatly while choosing my topic. I enjoy all these ve years working with them. I also want to thank the members of my qualifying and dissertation committee: Professor Aiichiro Nakano, Professor Yan Liu, Professor Viktor Prasanna and Professor Iraj Ershaghi. Their feedback on my thesis topic was very critical and instructional. I also want to thank Shuping Liu, Dinesh Babu, Anqi Wu, Dong Guo, Sanaz Norouzi and Sanaz Seddighrad for our numerous discussions. Thank Lanre Olabinjo, Oluwafemi Balogun, Tracy Lenz, Burcu Seren, Jared Ivanhoe and Paul Miller for their rich petroleum engineering knowledge and feedback while we were conducting our ex- periments. Last but not the least, I want to thank Juli Legat and Sorin Marghitoiu for their kind and professional support in many other aspects. All work in this dissertation is supported by Chevron Corp. under the joint project, Center for Interactive Smart Oileld Technologies (CiSoft), at the University of Southern California. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii Abstract x Chapter 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2 Related Work 10 2.1 Industrial Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Academic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Failure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 3 Smart Engineering Apprentice 19 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Data Collection and Preparation . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2 Data Cleansing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Noise Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv 3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.7 Alert Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7.1 Day-level Visualization . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7.2 Watchlist generalization and visualization . . . . . . . . . . . . . 40 3.8 Knowledge Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 4 Semi-supervised Failure Prediction 45 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 Training Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.3 Semi-supervised Classication using Random Peek . . . . . . . . 50 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Overtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5 Global Model for Failure Prediction 56 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Attributes and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Labeling and Model Construction . . . . . . . . . . . . . . . . . . . . . . 59 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 6 Weighted Task Regularized Multitask Learning 71 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Methods Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3.1 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3.2 Problem Equivalent to Multitask Learning . . . . . . . . . . . . . 75 6.3.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3.3.1 Linear case . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3.3.2 Nonlinear case . . . . . . . . . . . . . . . . . . . . . . . 79 6.3.4 Generalization of uniform regularized multitask learning . . . . . 79 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.4.2 Rod Pump Failure Prediction . . . . . . . . . . . . . . . . . . . . 84 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 v Chapter 7 Prediction Condence Level Model 90 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.5 Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.6 Updated SEA System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Chapter 8 Conclusions and Future Work 96 Bibliography 98 vi List of Tables 3.1 Error rates at various delays based on cross-validation using ADTree. 35 3.2 Error rates of dierent classiers using 10-fold cross-validations. . . . . . 36 3.3 Confusion matrix and prediction-related terminologies. . . . . . . . . . . 38 3.4 Failure prediction on 33 wells. . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Watch list in a collaborative format. . . . . . . . . . . . . . . . . . . . . 41 4.1 10-fold cross validation accuracy . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Confusion matrix for testing on a single eld using 8 failure wells. . . . 53 4.3 Comparison between well-level and sample-level cross validation accuracy using SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 Overall confusion matrix for testing global model on 5 oil elds. . . . . 62 5.2 Field-specic confusion matrix for testing global model on 5 oil elds. . 63 vii List of Figures 2.1 Four dierent dynamometer cards and their corresponding symptoms (Rowlan et al., 2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Cough cyrup and liquid decongestant sales from (Goldenberg et al., 2002) for disease outbreak detection study. . . . . . . . . . . . . . . . . . . . 13 3.1 The hierarchy of the oil industry with respect to number of rod pumps. 20 3.2 Instrumented Rod Pump System (Courtesy Weatherford, EP Systems (Rowlan et al., 2007)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Daily measured multivariate time series out of POC for a specic rod pump. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 An ideal failure example that reveals the progress of a failure. . . . . . . 23 3.5 Smart Engineering Apprentice system ow diagram. . . . . . . . . . . . 25 3.6 An example to show how outliers impact the feature extraction process. By handling the outliers, a more smoothed curve can be revealed. . . . 29 3.7 Correlation analysis between dierent attributes using all available data - missing data excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.8 Failure accuracy evaluation logic. . . . . . . . . . . . . . . . . . . . . . . 39 3.9 SEA day-level visualization. . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.10 SEA watch list in a geographic representation. . . . . . . . . . . . . . . 42 3.11 An example that illustrate how case-based reasoning from SEA KM mod- ule is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 Accurate labeling under clustering. . . . . . . . . . . . . . . . . . . . . . 49 4.2 Random peek illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 CardUnchangedDays and RuntimeRatio correlate to sudden failures. . . 58 5.2 Training set construction owchart. . . . . . . . . . . . . . . . . . . . . . 60 5.3 Good example for successful prediction that leads to a tubing leak: sig- nature that indicates a tubing leak began to occur in late Octr, 2011 and it happened another two times, and then discovered as a tubing failure in mid Feb, 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Another good example for successful prediction that leads to a tubing leak: early less frequent prediction indicates a tubing failure, then signif- icant failure event happened. . . . . . . . . . . . . . . . . . . . . . . . . 65 viii 5.5 Good example for timely prediction. No clear pre-failure patterns were recognized but soon as data went towards the failure pattern we were able to identify it and trigger alerts more than half month earlier than it was found by eld specialist. . . . . . . . . . . . . . . . . . . . . . . . . 66 5.6 Sudden failure: hard to predict by trends. . . . . . . . . . . . . . . . . . 67 5.7 After March 2012, data began to decline which matches the tubing failure pattern, however the data went back to normal level. Then again the data showed that the load dropped signicantly and this time it held that value consistently. This is exactly the signature that we used to train our prediction model, but for this case, because we have no further knowledge by the time our experiment was held, and the date was exceeding 100 days, these predictions are considered as false alerts. . . . . . . . . . . . 68 5.8 Sudden failure: hard to predict by trends. . . . . . . . . . . . . . . . . . 69 5.9 Sudden failure: hard to predict by trends. . . . . . . . . . . . . . . . . . 70 6.1 Synthetic dataset: Task 1 and Task 2 are two similar tasks while Task 3 is very dierent from their optimal classication boundaries - red for Task 1, blue for Task 2 and pink for Task 3. . . . . . . . . . . . . . . . 82 6.2 Overall performance for separated SVM, one SVM, regularized MTL and weighted tasks RMTL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3 Data distributions, training samples and hyperplanes for three tasks for optimal - top row, independent (sSVM) - second row, unied (oSVM) - third row, regularized (RMTL) - fourth row and weighted regularized SVM (WRMTL) - bottom row, under the same sampled training data while the original data contains high noise level. . . . . . . . . . . . . . 85 6.4 Class distribution for the rod pump failure prediction dataset. . . . . . . 86 6.5 Precision and recall for cross-validation of anomalous classes using dier- ent algorithms on the petroleum dataset. . . . . . . . . . . . . . . . . . . 89 7.1 ROC curve for condence-level model from 5 elds. . . . . . . . . . . . . 93 7.2 Parameter weights for dierent factors for condence-level model. . . . . 94 7.3 Updated SEA system owchart. . . . . . . . . . . . . . . . . . . . . . . . 95 ix Abstract Failure prediction, a subset of anomaly detection which aims at the precursory events that potentially triggers failures, is of great value in maintaining reliable complex engi- neering systems. Given massive amount of historical data in multivariate time series for a complex system, data mining and machine learning techniques can play an important role that learns from historical failures which can then be integrated into real world monitoring and fault-prevention applications. In this dissertation, such data mining and machine learning techniques are applied to failure prediction in digital oil elds. However, there are two major challenge categories in applying to oil elds where there are wells in multiple assets: 1) within a single domain/asset, and 2) across multiple domains/assets. For a single domain/asset, the data collected is in high dimensional multivariate time series which contains uncertainties such as noise, missing data, incon- sistent data, etc. In Multiple domains/assets, because of the rarity of such events with regards to the heterogeneity from thousands of assets and diversity of failure patterns, as well as sparse labels and limited resource, it is unrealistic to build individual predic- tive model for each asset. This thesis addresses the problems of failure prediction on multiple multivariate time series: 1) how to systematically learn from historical failures and train an eective model that is applicable in failure prediction application; 2) how to train a generalized model from the labeled dataset that is ecient in predicting fail- ures out of thousands of multivariate time series that exhibit clustered heterogeneity. This thesis emphasizes, but not limited to, down-hole mechanical failures for rod pump articial lift systems (a.k.a. rod pumps), which is the most common type of oil producer system. x The rst part of the thesis addresses the rst challenge category by presenting Smart Engineering Apprentice system (SEA) that involves data extraction, data preparation, feature extraction, machine learning, alert generation and knowledge management. The data extraction stage extracts data needed including the time series data and event logs from the enterprise database. The noise and missing values are partly handled in the data preparation stage. The denoised data is then fed into feature extraction stage for obtaining features. Given a labeled dataset, general supervised learning algorithms can be applied to train, test and evaluate the results in the Machine Learning stage. In the Alerting stage the system visually depicts alerts to provide warning of impending failures. Finally, in knowledge management stage, a wide range of factors are combined to train a condence level model for ranking across multiple assets from multiple cat- egories. SEA has been successfully applied to failure prediction for rod pump articial lift systems. In the second part, this thesis addresses an approach to handling the second challenge by presenting an in-depth model for generalizing the learning algorithm so that a uni- ed model can be applied to multiple heterogeneous elds yet maintaining comparable precision and recall. Our objective is to build a generalized model that: 1) automati- cally recognizes examples based on limited knowledge from the subject matter experts (SME); 2) takes advantage of larger amount of recognizable examples from all histori- cal data so that the learned model is statistically more robust; 3) better customizes so that dierent elds are capable of exhibiting variations that arises from other impor- tant uncertainties that were dicult to be considered during previous algorithms. We proposed an unsupervised rule-enhanced labeling with support vector machine (SVM) that enables the SEA system to learn from much larger historical data from multiple elds. Then we further improved this algorithm by proposing a multitask learning al- gorithm that combines multiple decision relevant factors to yield a better generalized global model. xi As a pure data-driven system, SEA is evaluated using real-world data from thousands of rod pump articial lift systems in multiple heterogeneous oil elds. Experiments show that SEA produces good results and signicant economic value for use in oil elds. xii Chapter 1 Introduction 1.1 Background Articial lift techniques are widely used in oil industry to enhance production for reser- voirs with energy levels too low to directly lift uids to the surface. Among various articial lift techniques in the industry (such as Gas Lift, Hydraulic Pumping Units, Electric Submersible Pump, Progressive Cavity Pump and Sucker Rod Pump tech- niques), the Sucker Rod Pump technique is the most commonly used one which uses a sucker rod pump on the bottom of the tubing string, a surface pumping unit and a sucker-rod string that runs down the well to connect them (Hyne, 2001). Sucker rod pump systems are the most commonly applied worldwide articial lift method with 59% of all Articial Lift in North America and 71% of 832,000 wells for the rest of the world. In the U.S., there are about 350,000 wells (Rowlan et al., 2007). In a typical oil eld, there are hundreds of wells, and each eld has its unique geological formation. Well failures in oil eld assets lead to production loss and can greatly increase the OPEX (operational expense). Experts with rich experience are capable of identifying various types of anomalies via combining dierent types of information such as a well's recent performance, its events log and its neighboring well performance. Such anomalies, once identied, have high probability to be followed by a failure in the future. Such anomalies might have already been causing economical losses due to loss of production 1 or inecient operation. Thus either proactive maintenance or a repair task can be scheduled. However, due to limited number of trained personnel and resources available for managing large elds, the proactive operation is dicult to deploy manually in real-world petroleum operations. Therefore, automating the monitoring is important to achieve higher economic benets in petroleum eld operations. Within the last decades, the petroleum industry has begun a series eorts in instrumenting well equipment and achieving useful event logs. There are massive amounts of data that contain historical event logs and time series which have been accumulated by an enterprise. This makes potential use of data mining methodologies to learn from experts how they predict well failures. Successful failure predictions can dramatically improve performance, such as by adjusting operating parameters to forestall failures or by scheduling maintenance to reduce unplanned repairs and to minimize downtime. In a Smart Oil Field, the decision support center (DSC) specialists use measurements from elds for ecient and integrated management and operation. The reasons for rod pump failures can be broadly classied into two main cate- gories: mechanical and chemical. Mechanical failures are caused by improper design, by improper manufacturing, or by wear and tear during operations. Well conditions may contribute to excessive wear and tear, such as sand intrusions, gas pounding, rod cutting and asphalting. Chemical failures are caused by the corrosive nature of the uid being pumped through such systems. For example, uid may contain H 2 S or bacteria. For rod pumps one of the major mechanical failures is called tubing failure, where its tubing starts leaking because of accumulated mechanical friction and cutting events. A tubing leak does not cause a rod pump to shut down, but will lead to failure in the near future. It is happening in down-hole which makes it dicult for eld specialist to identify its anomalous status via visual or sound inspection. If not discovered in time, such a leaking causes continuous loss of production and reduces the rod pump eciency signicantly. The attribute trends exhibited by tubing failure can in principle be detected by SMEs. However, because of massive number of rod pumps that typical 2 SMEs are responsible for day-to-day monitoring, it is dicult for SMEs to notice specic tubing leaks when they rst occur (Liu et al., 2010). Currently pump o controllers (POCs) play a major role in monitoring the operation of well systems. POCs can be programmed to automatically shut down units if the values of torque and load deviate beyond cuto boundaries. Also, by analyzing the dynocard patterns collected by the POCs one can understand the behavior of these units. These POCs reduce the amount of work required by production and maintenance personnel operating a eld. However, POCs by themselves are not sucient. In the elds studied in this dissertation the sheer number of these units compared to the number of personnel still demands a great deal of time and eort in terms of monitoring each and every unit operating. This mere ratio of number of operating units to that of personnel involved in maintaining them suggests a need for more automated system, such as an articial intelligent system that can dynamically keep track of certain parameters in each and every unit from every eld, give early indications or warnings for failures, and provide suggestions on types of maintenance work required based on knowledge acquired from best practices. Such system would be a great asset to industry personnel by allowing them to be more proactive and to make better maintenance decisions. Such a proactive system will increase the eciency of the pumping units to bring down OPEX and to make the pumping operations more economical. 1.2 Motivation One of the key challenges in predicting failures is how to model the predictive func- tional conditions using data-driven approach given log history with multiple attributes and failure events. Traditionally, following manage-by-exception (MBE) practice, an SME veries a failure once a rod pump stops functioning. By checking the previous dynamometer card and other logs such as run time and production history, the failure can be veried which leads to a decision and/or solution. In order to document a failure, a \look-back" over the attributes recorded by POC is usually performed to identify the 3 failure causes. With sucient experience, SMEs can visually identify the date range when anomalies began and recognize the exact failure date. But for a machine to auto- matically follow such practice for real-time alerts remains dicult. The solution of this problem can be generic to other MBE surveillance systems. For predictive analytics, machine learning techniques have been successfully applied to a variety of problems. In machine learning this problem can be described as how to learn a reliable and generalized multi-variate time series classier which can then be scaled up for real-time failure predictions for thousands of rod pumps in multiple oil elds. As an important subject in machine learning, supervised learning has been used in modern data mining approaches on time series classication, such as handwriting recognition (Rath & Manmatha, 2003), membrane protein sequences (Xing et al., 2008), real time system diagnosis (Bottone et al., 2008). In supervised learning the data mining algorithm is given positive and negative training examples of the concept the algorithm is supposed to learn. However, because the data is dicult to be segmented into xed- length time series, it is dicult to utilize such algorithms for our problem. Instead, rather than directly applying learning algorithms, more sophisticated features need to be extracted to model the conditions. Since data is collected via sensors, a common problem faced by real-world operations is that many uncertainties in uence the data such as weather, geological activity, oper- ational interruption, sensor and communication errors, etc. How can a reliable model be learned despite the existence of heavy noise? Furthermore, since each rod pump is a single functional unit, and there are hundreds of them, how can such a reliable model be generalized so that it can be applied to so many rod pumps? Last but not least, the rod pumps are usually managed by elds that are geographically separated, how can the model be further generalized so that it can perform well across many elds? 4 1.3 Objectives Our goal is to develop a model that is capable of predicting down-hole failures of rod pumps on a daily basis across heterogeneous elds using data mining and machine learning approaches. By \prediction" we mean to detect early signals such as mechanical frictions, rod cut events that potentially lead to a tubing leak or pump failure, and once a failure happens, we are also able to detect it in time. Rather than using a specic model per well, our objective is to achieve a generalized global failure prediction model that works across multiple rod pumps located across multiple elds. The techniques and methods that we develop to address this failure prediction prob- lem will be applicable to other similar problems. These problems can be other critical event predictions in oil industry, as well as other similar enterprise-level multiple time series which requires anomaly detection capability but suers from sparse labels. In computer science, one such problem is in predicting slower machines in cloud comput- ing environments for optimizing time and resource consumption for large computational tasks (Ananthanarayanan et al., 2010). In smart grids, to detect anomalous energy us- age of dierent units is an important procedure towards reducing energy consumption (Mo et al., 2012). 1.4 Major Contributions In this thesis, we formulate our rod pump failure prediction problem as a classication problem with noisy multiple multivariate time series data with sparse labels. First, we will present our success in designing and implementing an automated data mining system that includes all key aspects of challenges with baseline solutions using state- of-the-art machine learning techniques. Next, we present our semi-supervised support vector machine algorithm which is capable of learning and predicting for a single oil eld. Then we extended our algorithm by modeling it as a multi-task learning method which considers interrelations between dierent wells, dierent elds so that appropriate 5 bias can be learned to guide the hypothesis space search process, which as a result, both extends our prediction algorithm to be broadly applicable and improves our prediction performance. Our major contributions are: 1. Design of Smart Engineering Apprentice system, a fully automated system that is able to predict down-hole failures over rod pumps; 2. Design of semi-supervised learning algorithm by integrating domain knowledge and clustering with Support Vector Machine that produces multiple reliable and generalized predictive models for specic elds; 3. Development of a new multitask learning formulation called weighted task reg- ularized multitask learning that provides added bias exibility for dealing with heterogeneous tasks. This formulation is capable of producing a single global predictive model that outperforms eld-specic models by 10% using synthetic dataset and 5 - 10% in precision and recall using rod pump failure prediction dataset over multiple oil elds. 1.5 Outline The remainder of the dissertation is organized as follows: 1. Chapter 2 reviews related and relevant literature on real-world applications and academic approaches for anomaly detection and failure prediction algorithms for rod pumps, biosurveillance, computer system reliability. We then discuss the ap- plicability of the approaches to rod pump failure prediction. Then we introduce multitask learning, an emerging area under machine learning, which has the po- tential to achieve better generalized prediction models. 2. Chapter 3 presents the Smart Engineering Apprentice system. Given various instrumented parameters from POCs installed on each rod pump and a few years 6 data in the past, SEA enables a data-driven process that can learn from the historical data and make failure predictions. SEA has been deployed in real-world application that can eectively predict down-hole failures. We present all the necessary modules including data collection, feature extraction, prediction engine, evaluation, alert generalization and visualization and knowledge management for a fully automated failure prediction system while addressing the challenges in each module. 3. Chapter 4 explores the key component - the prediction engine - of SEA. Given massive amount of unlabeled data, rather than supervised learning, we propose a semi-supervised learning algorithm that can learn from historical data, under which prediction is adjusted prior to prediction on each rod pump following a process that learns the normal behavior of a specic pump. This provides a eld- specic model that works well on the eld where the data comes from. It has been successfully used in real-world applications. 4. Chapter 5 describes the development of a new version of learning model - global model - that is applicable to multiple oil elds. The eld-specic model generalizes poorly when applied to other elds, moreover, it requests labor-intensive labeling because it requires SMEs involvement to validate each label that our algorithm proposes during training set construction stage. Global model, on the other hand, overcomes this by applying an automated rule-enhanced clustering-based labeling on multiple oil elds that yields a much larger training set while maintaining comparable precision and recall. This grants SEA the scalability up to multiple oil elds, yet only a single model needs to be maintained. 5. Chapter 6 describes a novel machine learning algorithm called weighted tasks regu- larized multitask learning (WRMTL). Global model's limitation is that it smooths over all training data. But in reality wells from dierent oil elds exhibit dierent 7 behaviors due to many unconsidered reasons such as geological formation, opera- tion, etc. WRMTL considers such dierences among dierent elds so that it can generalize the global model by allowing dierent level of biases in heterogeneous elds. Our experiments using both synthetic dataset and rod pump dataset prove that WRMTL is both theoretically and practically better than global model. 6. Chapter 7 explains the condence level model which enables better decision sup- port by considering other high-level factors. Via logistics regression, multiple decision-related factors are integrated over all historical failure predictions. This model also enables a ranking mechanism across dierent failure predictions for dierent rod pumps with explanations of what key factors draw a failure predic- tion. The condence-level model serves as a complementary model for knowledge management. At the end of this chapter an updated SEA system module diagram is given. 7. Chapter 8 concludes this dissertation with a summary of the topics covered. Future work can include additional attributes, such as corrosion, can be considered in the SEA system. Also, additional studies can be done with correlations of other factors in well failures, for example, separate task treatment for wells from gaseous reservoir to the wells that mostly produce oil to make the multitask learning ner- grained. 1.6 Conclusion In this chapter, we introduced the background of why failure prediction in rod pump articial lift systems is important, as well as it being a challenging computer science problem using pure data-driven predictive modeling. With data instrumented via pump- o controllers installed on each rod pump together with past failure logs, we addressed the two major challenges for developing a data-driven failure prediction system. Our objective is to develop a model that can predict down-hole rod pump failures on a 8 daily basis across heterogeneous elds using data-driven approaches. In the following chapters, we will present our Smart Engineering Apprentice (SEA) System regarding how the challenges are handled. 9 Chapter 2 Related Work There are several approaches and methodologies for failure prediction in rod pump systems and we categorize these existing approaches into two broad classes: \industrial" and \academic" and brie y review them. 2.1 Industrial Approach In petroleum industry, dynamometer card (a.k.a. dynocard or dynagraph) is a standard tool derived from POCs for tracking a well's current performance. Each dynocard mea- sures a whole upstroke and downstroke cycle of a rod pump, its horizontal axis is the position of its stroke length, and its vertical axis records the load corresponding to the location (Eickmeier, 1967). As Figure 2.1 illustrates, dierent shapes of dynocards rep- resent dierent functional conditions. Some of them might lead to failure ( uid pound) while some might just drag a pump's eciency down (gas interference, asphaltenes). The whole process requires manual dynocard reading and visual translation - neither is automated thus not ecient. Existing automated methodologies such as SmartSignal (Snyder, 2009) 1 attempts to model each rod pump's data behavior using a non-parametric approach. As the 1 Now part of GE Intelligent Platforms 10 (a) Normal dynocard (b) Fluid pound (c) Gas interference (d) Asphaltenes in pump Figure 2.1: Four dierent dynamometer cards and their corresponding symptoms (Rowlan et al., 2007) state-of-the-art single-point solution in rod pumped systems, SmartSignal relies on non- parametric state matrix to evaluate every new record. It requires careful tuning of the model by a sophisticated user which can then be applied to detect anomaly signals once a new observation is \invalid". However, in order to apply SmartSignal to many wells in a eld, one has to develop as many well-specic models which will require their individual historical records. Then experts have to manage each model so that it captures the updates whenever there is change in status of any equipment. To the best of our knowledge as we began our research, SmartSignal does not consider inter-well relationships. Each model has to be maintained on a regular basis using only information from that specic well. Clearly it is infeasible to use SmartSignal to hundreds of wells, let alone thousands. For this domain, it brings more benets if there is only one model doing the failure prediction task on all wells. 11 2.2 Academic Approach 2.2.1 Anomaly Detection Anomaly detection is an important problem addressed by data mining researchers. It refers to the problem of nding patterns in data that do not conform to expected behavior (Chandola et al., 2009). This non-conformity can be represented in many dierent ways including statistical outliers, changing points which has many dierent application areas such as fraud detection for credit cards (Phua et al., 2005), intrusion detection for Internet tracs (Sabahi & Movaghar, 2008), disease outbreaks in bio- surveillance (Damle et al., 2011), etc. For dierent problems, the solution can be quite dierent. SmartSignal can be categorized as an anomaly detection solution via non-parametric methods, which targets at building a model of normal data and then to attempt to detect deviations from this model in the newly observed data. Such an approach will be useful for anomaly detection problems such as anomalous video scenes detection in video surveillance (Seidenari & Bertini, 2010), trac anomalies detection (Lenser & Veloso, 2005), disease outbreaks (Zhang et al., 2003), etc. In early epidemic outbreak detection (Goldenberg et al., 2002), disease caused by inhalational anthrax attack always develops in two stages: 1) the rst, lasting from several hours to several days, causes in uenza-like symptoms such as low-grade fever, non-productive cough, fatigue and chest discomfort; 2) the second stage involves abrupt onset of high fever, severe respiration distress and shock. In this case, it is very important to track the trend of increase in OTC sales for in uenza medicines for early anthrax detection. However, simply based on the OTC counts is not enough due to the potential overlap with u season and anomalous weather condition as is displayed in Figure 2.2. In (Wong, 2004) (Cooper et al., 2004), additional information is needed such as season, weather, date, gender, age, and so on. Given multiple factors and relational predicates combined with OTC drug sales, a Bayesian network is constructed to address 12 the relationship of dierent factors and thus a generative model is constructed that is able to calculate the p-value of observations, and furthermore a threshold based on the p-value is used to determine whether an alert should be given. Figure 2.2: Cough cyrup and liquid decongestant sales from (Goldenberg et al., 2002) for disease outbreak detection study. Articial Neural Net models have been applied for detecting well failures merely based on dynamometer card (L. Alegre & da Rocha, 1993)(Ocanto & Rojas, 2001). Al- though much information can be gained from visual interpretation of surface dynamome- ter cards, success is directly linked to the skill and experience of the analyst and even the most experienced analysts can be misled into an incorrect diagnosis (Ocanto & Ro- jas, 2001). (Xu et al., 2007) uses articial neural network to solve dynamometer cards auto recognition problem by constructing a self-organizing competitive neural network model. Because of the limitation of our data instrumentation, we do not have direct access to daily dynamometer cards which we can use to conduct similar study. There are also similar research but rather than using dynocard, some other param- eters are used. (Tian et al., 2007a) uses fuzzy neural network to model the observed parameters including \various status messages" and \dierent kinds of information" to build a three-layered network. 5 dierent fuzzy rules are encoded as 5 smaller neural networks to represent dierent functioning status from \normal working" to \pumping robs serious eccentric wear". Then Levenberg-Marquart optimization algorithm is used 13 to train the fuzzy neural network. The whole experiment was conducted based on a maximum of 170 samples. Another similar approach includes support vector machines was also experimented (Tian et al., 2007b). However, for the two studies, the data is never made public. 2.2.2 Failure Prediction As a subset of anomaly detection, the goal of failure prediction is to detect consis- tent precursory events that may lead to a critical failure in the future. Rather than a short-term outcome for common reactive anomaly detection, failure prediction empha- sizes long-term value. Data mining algorithms play an increasingly important role as an approach to predict computer system failures (Salfner et al., 2010) because of its capability of adapting domain constraints and combining machine learning algorithms for prediction. In (Hamerly & Elkan, 2001) (Hughes et al., 2002), hard drives S.M.A.R.T. (Self Monitoring and Reporting Technology) log is the major source of predicting disk drive failures. The S.M.A.R.T. log is comprised of drive aging, drive power cycles, errors corrected by inner ECC code, mechanical shock, and so on. S.M.A.R.T. signal attributes are transformed into probabilities based on their frequencies. Discretization is applied to continuous numeric attributes. In order to avoid zero probabilities which impacts the Expectation Maximization algorithm, smoothing is used by assigning them with the lowest probabilities. For hard drive failure prediction, the author categorizes parameter learning algorithms into 2 classes: 1) in a fully unsupervised fashion, learning from both normal and anomalous data which assumes that the anomalous data is rare enough to not aect the model parameters signicantly; 2) in a semi-supervised fashion, learning a model from only the normal data with the anomalous instances removed from training. He addresses that in general, semi-supervised learning is preferable since it should learn the most accurate model for the normal class. Unsupervised learning is useful if training data is not labeled at all and the assumption is true that anomalous data is indeed rare 14 in the training set. In his dataset, 34 snapshots out of 94022 represent 9 will-fail drives. He uses a variant Naive Bayes model with EM to train the model. For the area of failure prediction in computer system reliability study, three general machine-learning-based categories were summarized in (Salfner et al., 2010): 1. Outlier/change point based; 2. Pattern based; 3. Model based. Type 1 is widely used in wireless sensor networks for detecting anomalous behavior by optimizing threshold on each individual parameter. For example, in (Wang & Yu, 2005)(Wang et al., 2008)(Bahrepour et al., 2009), an outlier or change point is indicated by jointly considering its neighbor sensors' behaviors. By carefully setting the thresholds combined with the neighbor sensors' behaviors, this method provides clear traceable causes for each event. It is dicult to extend their approach to our failure prediction problem because of the diculty in picking the appropriate thresholds. Moreover, in our case each rod pump needs its own thresholds and each exhibits multiple attributes characterizing dierent aspect of well behavior, the cost for both developing the model and maintaining so many individual models is too high. Another concern is that the actual data is always noisy and sometimes exhibits false records, which would cause many false alerts. For pattern based methods, it assumes that the event patterns are known or already listed for the time series. For example, (Vilalta & Ma, 2002) extracted the patterns based on event frequencies and used a priori algorithm combined with rules to predict rare events. Given the observations, if they are dierent from predened patterns, then an event can be concluded. A similar approach was taken by (Zhang & Sivasubramaniam, 2008), which focused on IBM BlueGene/L event logs. The failure patterns are learned by linking statistical patterns with the event log and forming a classication problem. Such an approach requires an accurate understanding of the domain, as well as the right 15 model that is able to learn from the training patterns. Once the patterns are identied, a wide range of pattern recognition methodologies can be applied. However if the labels are sparse, it becomes dicult to directly apply this approach. Thus adjustment has to be made for our problem to use this approach including what the patterns are, how to identify the patterns of interest and how to learn from them. The last category is model-based event prediction/detection. This approach is do- main driven. Dierent models have been built according to the data characteristic. Dy- namic Bayesian Network (DBN) (Wang et al., 2008) or Markov Random Field (MRF) (Wang & Yu, 2005) or stochastic process combined with Hidden Markov Model (Ihler et al., 2006) and (Au et al., 2010) have been explored. In (Ihler et al., 2006), the method links Poisson Processes to form a Hidden Markov Model to model trac patterns. This approach provides a reasonable framework to estimate the dynamics of trac patterns and the corresponding event probabilities. The input to the system is univariate sensor data, which has strong periodic patterns. The frequency analysis of the data shows clear periodic patterns by hour, day and week. The events also occur repeatedly, thus a Poisson process model linked with static transition probability is a reasonable choice. However because of the uniqueness of the graphical model under Poisson distribution of events, it is dicult to apply it to our dataset because most rod pumps have zero or one failure throughout our dataset time range, making it a poor assumption for event distributions. Moreover, due to lack of existing domain studies, we also nd it dicult to adapt our data into a generative model. The next chapter will present the way we convert rod pump failure prediction prob- lem into a hybrid between pattern-based and model-based methods which involves a systematic combination of data mining techniques and machine learning algorithms that we applied for each module of the system. 16 2.3 Multitask Learning Recently, Multitask Learning (MTL) has attracted a lot of research attention and has been proven to outperform traditional supervised learning algorithms because it pro- vides more exibility for including high-level relations as constraints between dierent learning tasks. MTL is an approach to learn multiple tasks in parallel while using a shared representation, under the assumption that the tasks are related (Caruana, 1997). The intrinsic tasks relatedness is critical to generalize single task learning into a global multitask solution. Various regularization metrics can be applied to handle dierent assumptions with respect to how the tasks are related. Tasks can be fully related (Evgeniou & Pontil, 2004) (Evgeniou et al., 2005), containing outliers (Gong et al., 2012), in form of groups (Zhou et al., 2011), graphs (Chen et al., 2009) or trees (Kim & Xing, 2010). (Zhou et al., 2011) (Zhou et al., 2012) apply MTL to predict disease progression for dierent patients using multiple ridge regression with temporal smoothing probabilistic prior during multi-task learning process. For rod pump failure predictions, the ultimate goal is to develop a global model which can be applied to multiple elds, because global models are easier to update and maintain. Because rod pumps are from multiple elds, dierent oil elds may yield dierent impact with respect to the data behavior from each rod pump, such as oil recovery methods (CO 2 ood vs. water ood), geology formation (heavy oil vs. light oil, underground fractures), and eld operations. Each eld may have limited amount of historical failures that can be used for training albeit not exact, and the fact that rod pump failures are expected to exhibit similar patterns across elds. To be statistically robust we need to generate models that recognize training examples from past failures, but we should allow the exibility for each eld's model to have some dierences to improve robustness. If we only consider adding more failures into the model which are collected from dierent elds, this model that is smoothed over all elds might not function well because it ignores eld-level dierences. Instead we have to develop multiple eld-based models, which does not contain many training examples 17 which are expensive to maintain. Therefore, we need to generalize the methodology to form a global model that can perform well on predicting failures across multiple elds - multitask learning is a good way to achieve this goal. 2.4 Conclusion Failure prediction is an important type of anomaly detection task that aims at identi- fying the precursory events that may lead to real failures. In order to perform such a task, domain knowledge has to be carefully combined with data mining and machine learning algorithms. As a failure prediction domain that is studied, there are many things to explore in order to achieve eective predictive model(s). Existing industrial and academic approaches have their limitations which prevent them to be directly ap- plied to our problem. In our work, we propose a systematic data mining methodology called Smart Engineering Apprentice for rod pump failure predictions that is a hybrid of pattern-based and model-based approach. As the core of the SEA system - the clas- sication algorithm - we propose a semi-supervised learning algorithm, and then an improved multitask learning algorithm that is generalized over multiple elds. 18 Chapter 3 Smart Engineering Apprentice 3.1 Background A well failure not only reduces the eciency of the pumping operation, but also causes reactive well workovers or repairs. Compared with proactive maintenance that can help reduce such risks because the actual failure may not happen, reactive repairs may cause hazardous working environment such as gas or oil spillovers. The repairs require shutting down that well which cause production loss and increases the operating expense (OPEX). As the most used oil recovery technique, rod pumping method is used in about 80% - 85% of the oil wells all over the world (Tian et al., 2007a). A single rod pump can function in excess of 30 years (Rowlan et al., 2007) and each may experience multiple failures during its lifetime under daily data monitoring, Figure 3.1 shows the scale of our problem for failure prediction. A single oil eld can have over 100 rod pumps, while an oil company may have more than 10,000 rod pumps. To monitor the performance and be certain that most wells are functioning properly is very critical. Pump-o controllers (POCs) play a major role in monitoring the operation of rod pumping well systems. The POCs can be programmed to automatically handle excep- tional conditions, such as shutting down wells if the values of torque and load deviate 19 Figure 3.1: The hierarchy of the oil industry with respect to number of rod pumps. Figure 3.2: Instrumented Rod Pump System (Courtesy Weatherford, EP Systems (Rowlan et al., 2007)) beyond cuto boundaries. As Figure 3.2 shows, POC is a device installed in the surface (polished rod) of each rod pump. Also, POCs are designed to collect data from well sensors, such as the load of the rod pump, runtime, etc. By analyzing the dynamometer card patterns collected by the POCs, experts can diagnose the behavior of these wells. These POCs reduce the amount 20 of work required by production and maintenance eld personnel. However, the POCs by themselves are not sucient. Field personnel are still requirement to spend a great deal of time and eort in terms of monitoring the operation of each and every well. As Figure 2.1 illustrates, dierent shapes of dynocards represent dierent functional conditions. Some of them might lead to failure ( uid pound) while some might just drag a pump's eciency down. Ideally, the down-hole dynamometer card re ects the exact condition of a well. How- ever, due to the data instrumentation problem, POCs can only measure the information from above the surface in a position called polished rod above the wellhead. The depth of each well varies from several hundred feet to thousand feet, so some information can be inaccurate. The area, peak and min load are extracted from each dynocard. Besides dynocard, each well's run time, cycles (number of on and os), stroke length, stroke per minute, production and so on are also collected. From data-driven perspective, 14 dierent parameters are collected on a daily basis. A common problem of real-world data is its quality. As Figure 3.3 shows, for a normal well, there are many missing data during all its production period, and there are noisy outliers as well. Ideally, companies should keep logs of all the downtimes, but in reality they do not. When downtime happens, the data will be missing. Additionally, some missing data are caused by loss of communication of POCs. The data, like most anomaly detection or event prediction systems, is mostly stable, but outliers do exist. Before any feature analysis and pre- dictions can be performed, the data quality has to be processed to ensure quality. In our previous work (Liu et al., 2010) (Liu et al., 2011), all the missing data are ignored or interpolated, data outliers are either smoothed under smoothing or handled in the feature extraction step. POCs gather and record periodic well sensor measurements indicating production and well status through load cells, motor sensors, pressure transducers and relays. Some attributes recorded by these sensors are relevant to a potential downhole failure. All 21 Figure 3.3: Daily measured multivariate time series out of POC for a specic rod pump. these attributes together dene a labeled multivariate time series data set for articial lift. With such monitored attributes, Figure 3.4 is an example that shows the progress of tubing failure. This chart displays several selected attributes that were collected through the POC equipment. This well failure was detected by eld personnel on March 31, 2010. After pulling all the pumping systems out of the ground, they discovered that there were holes on the tubing that caused the leaking problem, which in turn reduced the uid load on the rod pump carrying the uid to the surface. Also, through a \look back" process, they found that the actual leak started around Feb 24, 2010. Even before that, the SME conrmed that there were \rod cut" events that started around Nov 25, 2009, when the rod started rubbing against tubing. The problem grew worse over time resulting in large holes in the tubing. The hole caused the uid to leak back down into the well although the rod pump itself was still functioning. This is an ideal 22 Figure 3.4: An ideal failure example that reveals the progress of a failure. failure example because it clearly progress from normal to pre-failure (rod cut) to failure (tubing failure). However, in most of our cases, the failures developed in a much more messy way such as data switching between normal and abnormal back-and-forth for multiple times, dierent data ranges due to dierent pump congurations, variational failure development speed, etc. Although each well would generate years of data, the true problem that we are facing is not on a single well but on many, very many wells from many elds. As the major type of oil wells that bring oil from underground to surface, an oil eld may have hundreds of rod pumps, and a big oil enterprise has up to tens to hundreds of oil elds all over the world. As we have reviewed in the previous chapter, machine learning is a widely used tech- nique in a variety of modern time series anomaly detection problems. Given predictive features as representation for future events (functional status), the goal is to learn such a function (a.k.a. model) that is then used in real-time prediction. Common types of machine learning are supervised learning and unsupervised learning. In supervised learning, the algorithm is given positive and negative training examples of the concept 23 the algorithm is supposed to learn. This contrasts with unsupervised learning where training examples are not given. In this chapter, we propose a data mining system called Smart Engineering Apprentice (SEA) that takes as input multivariate time series data from LOWIS (Life of Well Information Software 1 ) and generates machine learning model that can predict future rod pump failures that happen in downhole. We also described the challenges in each module of the system. 3.2 System Design Figure 3.5 shows how the Smart Engineering Apprentice system ows. In order to take the monitored attributes from LOWIS as input and prediction alerts as output, SEA presents the minimum but practical set of modules. Data collection and preparation is to instrument and collect useful attributes di- rectly from the system that can be further used to analyze its running status. Then perform basic noise handling methods to stage the data for next step. Feature extraction is to extract features that best represent the functional status. The features should be robust and predictive, i.e. they should not be impacted signicantly by noise or non-failure events. Machine learning is to apply machine learning algorithms to learn patterns of normal and failure events that enables the outcome classier to predict interested events when new data arrives. Alert is to generate alerts by applying the learned classier to incoming data and to present the predictions as watchlist alerts via visualization. Review alerts is to evaluate the performance for predicted results and make adjust- ment to the classier if necessary. 1 LOWIS - An Integrated Approach To Well Management: http://www.ep- weatherford.com/solutions/Software/LOWIS.htm 24 Figure 3.5: Smart Engineering Apprentice system ow diagram. Knowledge management is to encode input from SMEs and other domain knowledge so that the feedback can be used to improve the system appropriately by adjusting the classier. Knowledge management also estimates alert's condence-level. In each module, there are challenges to be carefully handled which the following sections will cover. 3.3 Data Collection and Preparation To determine what to collect is a nontrivial and domain dependent question, as it is cru- cial to rank the importance of dierent parameters because of data quality and behavior. This phase involves intense collaboration with SMEs in order to conrm the usefulness of parameters that can be instrumented and collected. In general data collected from 25 the eld will exhibit noise, duplicated and missing data. Therefore methodologies in- cluding data cleansing are needed. The challenge faced in this stage is to understand the problem itself, as well as determining the mechanism of data instrumentation, as well as how to process the \raw" data so that it is ready for the next step use. 3.3.1 Data Extraction Data collection always comes as the rst step for any data mining problem. It should provide software connectors capable of extracting data from articial lift data systems and feeding it to the prediction system. In SEA system this is achieved by running SQL query on LOWIS's backend database, or directly on the data mart, to extract the necessary attributes for each well in the form of time series. Unique identiers and recorded dates are needed in order to evaluate the correlation of all attributes to future events. Some attributes recorded by POCs are direct extraction of dynamometer card in- cluding card area, peak surface load, minimum surface load, other working status param- eters namely strokes per minute, surface stroke length, ow line pressure, pump llage, yesterday cycles, and daily run time. Additionally, calculated gearbox torque, polished rod horsepower, and net downhole pump eciency are calculated. These attributes are measured daily, sent over wireless network, and recorded in a LOWIS database. Besides the dynamic attributes, PumpID is also considered to dierentiate between dierent wells. In LOWIS database these attribute values are indexed by a well identier and a date. In addition to these daily measurements, eld specialists perform intermittent tests and enter the test results into the LOWIS database. These attributes include last approved oil, last approved water, and uid level. Since these attributes are not mea- sured daily, LOWIS lls in the missing daily values with the previous measurement, i.e. these attribute values are assumed to be piecewise constants. Finally, we add a special attribute called class. The class attribute label indicates the daily status of the well, i.e. 26 either it has failed or no recorded events 2 . We merge the status column from well work table to initially ll up class attribute. All these attributes together dene a coarsely labeled multivariate time series data set for each rod pump. Based on consultation with SMEs, we partitioned the attributes into three attribute groups, and ranked the groups according to a metric that combines relevancy to failure predication and data quality. The three groups are labeled A, B and C with group A being the most relevant and having the highest data quality. A Card area, peak surface load, minimum surface load, yesterday cycles, daily run time B Strokes per minute, pump llage, calculated gearbox torque, polished rod horsepower, net downhole pump eciency, gross uid rate (sum of last approved oil and water), ow line pressure C Surface stroke length, PumpID 3.3.2 Data Cleansing Usually there should be one record per day. However due to some implementation limitation and communication errors there exist duplicated records and missing records from the database. In order to constrain the dates in a continuous fashion to guarantee the performance of predictive analytics, it is necessary to avoid such duplications and missing records. Our cleaning process includes removal of duplicate data and padding of missing records. The main logic is to linearly go through each record sorted by well id and date in ascendant order, and then remove the duplicated records from database one by one. Once all duplicates are avoided, it is time to ll up the date \gaps" that prevent the data from being continuous by dates. Thus for such gaps, we pad dummy records with all numeric attributes as 0. 2 Here no recorded events do not necessarily mean that there is no failures happening. We will discuss about this later. 27 3.3.3 Noise Handling The next challenge is how to deal with the statistical noise. Raw articial lift time series data almost always contains noise and faults. The noise and faults can be attributed to multiple factors. Severe weather conditions, such as lightning strikes, can disrupt communication causing data to be dropped. Transcription errors may occur if data is manually entered into the system. Noisy and faulty data can signicantly degrade the performance of data mining algorithms (Han et al., 2011). The purpose of the noise handling is to reduce the noise as much as possible. We have applied two noise reduction techniques. The Grubbs's test is used to detect statistical outliers and then locally weighted scatter plot smoothing (a.k.a. Lowess) algorithm (Cleveland & Devlin, 1988) is used to smooth the impact of the outliers. Figure 3.6 shows the impact of outliers and the results for before and after the smoothing process while doing linear regression on articial data points where we added random Gaussian noise and two outliers. We can easily observe that the two outliers have biased the curve by introducing two local peaks which in fact do not exist. But after we removed the outliers, the same regression algorithm is able to recover the original shape of the curve. 3.4 Feature Extraction The key challenge for feature extraction is to understand what type of features can be used to represent failure, as well as normal status. A good set of features should be able to capture the patterns that SMEs usually use to identify rod pump anomalies. Roughly speaking for our problem, the features that SMEs use are the \trends". This section contains our analysis of data and our solution with respect to feature extraction. 28 Figure 3.6: An example to show how outliers impact the feature extraction process. By handling the outliers, a more smoothed curve can be revealed. 3.4.1 Data Analysis In order to analyze what features are useful for failure prediction, we begin with ana- lyzing the correlation among collected attributes. Each well is characterized by multiple attributes, where each attribute by itself is a temporal sequence. This type of data set is called a multivariate time series. In this section, we follow the formal description described in (Wei & Keogh, 2006). Denition 1. Multivariate time series: A multivariate time series T = t 1 ;t 2 ;:::;t m is an ordered set of m variables. Each variable t i is a ktuple, where each tuple t i =t i1 ;t i2 ;:::;t ik contains k real-values. In our problem, a multivariate time series always refers to the data for a specic well. Data miners are typically not interested in any of the global properties of a whole multivariate time series. Instead, we are more interested in deciding which subsection is 29 abnormal. Therefore if we are given a long multivariate time series per well, we prefer to convert every well's records into a set of multivariate subsequences. Denition 2. Multivariate subsequence: Given a multivariate time series T of length m, a multivariate subsequence C p is a sampling of length w, w < m of contiguous position from T , that is, C p =t p ;t p+1 ;;t p+w1 for 1pmw + 1. We also dene sliding window subsequence: Denition 3. Sliding window subsequences: Given a multivariate time series T of length m, and a user-dened multivariate subsequence length of w, sliding window sub- sequences are all possible subsequences that can be extracted by sliding a window of size w across T and extracting each subsequence. Next, we need to estimate w the subsequence sampling length. If w is too small, the subsequences may fail to capture enough trend information to aid in failure predic- tion. If w is too large, the subsequences may contain extraneous data that hinders the performance of data mining algorithms. High dimensional data are well known to be dicult to work with. This is usually called the curse of dimensionality. In addition, high dimensional data may incur large computational penalties. To help us estimate the correct value for w we ask two questions: 1. what is the dependency between attributes across time? 2. what is the dependency between an attribute's current value with its prior values? For the rst question, cross-correlation analysis was applied, and autocorrelation for the second question. Denition 4. Cross-correlation: Given a multivariate time seriesT ofk attributes, the cross-correlation is a measure of similarity of two attributes' sequences as a function of time-lag applied to one of them. 30 Figure 3.7: Correlation analysis between dierent attributes using all available data - missing data excluded. Denition 5. Autocorrelation: Given a single time series T , autocorrelation is the cross-correlation with itself. Figure 3.7 describes the correlation analysis among a subset of four attributes from our dataset: card area, daily run time, yesterday cycles and last approved oil. The x- axis is the time-lag, and the y-axis is the correlation. The subplots along the diagonal are autocorrelations, and the other subplots show the cross-correlations. These plots indicate pairwise attributes rapidly becoming uncorrelated as a function of time lag. 3.4.2 Features Even with a xed w, these subsequences still have high dimensionality - wk. We further reduce the dimensionality of the subsequences by performing feature extraction. Denition 6. Feature extraction: Given a multivariate time series subsequence C p of length w, the feature f p of C p is by constructing combinations of the high dimensional 31 wk space into a n dimensional feature vector, where n<wk, while still preserving its relevant characteristics. There are many dierent ways for feature extraction under the sliding window and high dimensional scheme, such as Principle Component Analysis, Isomap (Tenenbaum et al., 2000), Locally linear embedding (Roweis & Saul, 2000), wavelet, as well as simple linear combinations e.g. statistical mean, median, variance, etc. There are also domain- specic approaches in time series feature extraction, such as event-related potential (ERP) in neuroscience and Discrete Fourier Transform (DFT) in signal processing. Here we list three important criteria for good feature sets: 1. Re ect the nature of the data, e.g. it is robust, reliable and normalized 2. Capture critical relevancy in order to perform desired task, e.g. it re ects the trending information to be used to detect and predict failures 3. Reduce dimensionality. We tested with various feature extraction approaches including statistical metrics (variations, standard deviation, mean), PCA, polynomial curve tting, moving average, moving median etc. Experimentally we found the feature extraction technique described below to give the best results for failure prediction. This feature extraction technique satises the three criteria by combining domain knowledge, correlation analysis, robust regression, and normalizations. Based on domain knowledge, when a tubing failure happens, i.e. a tubing leak occurs, it causes signicant drop in the load of uid pumped to the surface, forming the pattern that was described in Figure 3.4. Other types of failures followed dierent trending. SMEs refer to dynamometer cards, which are a commonly-used technique that shows the dynamic relationship between load and stroke length. Also, the experts check the short-term and long-term performance of a well including its daily runtime, pumping 32 cycles and so on to analyze the trend of that well performance. In our approach, we do not directly use the card, but instead, we extract its area, peak value and minimum value. The domain system records one dynamometer card per day per well, thus we have one such set of values for a specic well per day, assuming that it is a fairly good representation of the performance for that entire day. After collecting those raw daily data, we apply our feature extraction algorithm to extract trending information that best represents a failure. The raw data changes fre- quently and does not follow any obvious stochastic process patterns. We design our own criteria for representing the trends by using medians. For our problem, a global trend and local trend are most useful to decide on how much the trend changes. Therefore in order to capture both long-term and short-term trends, we allow two subsequences within a single sliding window - one bigger sized subsequence for preserving global trend, and one smaller sized subsequence for capturing local trend, as we mentioned in our correlation analysis part early in this section. Algorithm 1 Extract Feature Input: Time series T i for U i 2 U, dual step size [step 1 ;step 2 ], global normalization size D Return: FeaturesfF i g Initial: F i =; Segment T i according to its failure events for each segment do for each attribute a j within this segment do Slide from beginning with xed window size step 1 +step 2 Linear interpolate the missing data in this window Extract trends from each window by m =median(fa previous D records g) m 1 =median(fa initial step 1 records g) m 2 =median(fa next step 2 records g) f ij = [m 1 +;m 2 +] m+ F ij =F ij [f ij end for F i =F i [F ij end for return F i 33 Algorithm 1 describes the feature extraction logic. One thing to mention is that because wells might have their conguration changed after each failure event, it is un- reasonable to still consider correlation from two dierent congurations which may result in erroneous inferences. Therefore, we initially segment each well's records by all work over events. So if there is an event, the extracted features would not cross between two congurations. We use the robust statistical attribute median for performing the dimension reduction task, so that it would not be biased by data spikes. 3.5 Machine Learning The challenges for this module includes two major parts: 1. Training set construction: how to take advantage of sparsely given failure logs, combined with domain knowledge and machine learning algorithm to construct a training dataset, that can be generalized to all future data. 2. Modeling and optimizing classier: how to model the relationship between features and labels in a mathematical way, and how to optimize it. Here we are facing with a multiclass classication problem, rather than a binary classication which is used in most anomaly detections. Because there are dierent failure modes, and SMEs believe that they can and should be distinguished. We use general supervised learning algorithms to predict failures and evaluate our results by using real datasets from actual elds. In the training and testing process, we initially use = 14 days as our subsequence length, D = 1 as the polynomial coecients, and [7; 7] days for the step sizes. For each well failure, we extract one subsequence sample that immediately precedes the failure date, and label it as failure. For each normal well, we extract all possible subsequences, and label them as normal. In our case, the number of normal samples is substantially higher than the number of failure samples. An overall error rate of both failure and normal samples can be misleading. 34 We employ two ways to evaluate failure prediction accuracy: using k-fold cross- validation and using a test set. The classication algorithms that we used were alter- nating decision tree (Freund, 1999), Bayesian network (Buntine, 1996), and support vector machine (Chang & Lin, 2011). However, we can technically use any multiclass classication algorithm that can accommodate the numerical dataset. Our rst classication algorithm is called Alternating Decision Tree (ADTree). An alternating decision tree consists of decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single number. ADTrees always have prediction nodes as both root and leaves. An instance is classied by an ADTree, following all paths for which all decision nodes are true and summing any prediction nodes that are traversed. In our case, a negative weighted sum indicates potential failure happening in the future. ADTree is unlike the traditional classication decision tree such as C4.5 (Quinlan, 1993), for which the prediction follows a single path for a nal decision. We use ADTree to help us study the dierence between values, as well as A, B and C ranked attributes. From Table 3.1 we can observe that the shorter the subsequence is, the more accurate the cross-validation. We can also see that it does not make much dierence if we remove the low ranked attributes - B and C. In other words, the results support the expert's ranking. Table 3.1: Error rates at various delays based on cross-validation using ADTree. Monthly All AB Biweekly All AB Weekly All AB Failure 0.45 0.54 0.45 0.45 0.09 0.09 Normal 0.16 0.16 0.21 0.21 0.10 0.10 Overall 0.27 0.30 0.30 0.30 0.10 0.10 ADTree has its advantage of visualizing the outcome as a set of rules encoded in a decision tree format. The tree can be eectively validated by SMEs. Similar visual output can also be achieved if Bayesian network is used that involves structural learning process, which instead of rules, would present the hidden dependency relationship among dierent features. 35 Support Vector Machine (SVM) is one of the most popular classication methods for machine learning. A support vector machine constructs a hyperplane or set of hyperplanes in a high or innite dimensional space, which can be used for classication, regression or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class, known as functional margin, since in general the larger the margin the lower the generalization error of the classier. For the dataset using a half month subsequence, we can observe that SVM performs the best for both failures and overall predictions. We also used Bayesian Network on our data. In particular we used the simplest Bayesian model - Naive Bayesian network, in which all variables are only dependent on the class variables, and not dependent on each other. Bayesian networks are represented by directed acyclic graphs. The graph nodes represent random variables in the Bayesian sense, i.e. they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies. Nodes that are not connected by edges are conditionally independent of each other. Each node is associated with a probability function that takes as input a particular set of values for the node's parent variables and gives the probability of the variable represented by the node. It provides great potential to unveil the inter-relationships between attributes if further models can be developed. Table 3.2 shows the 10-fold cross validation results with the three classication algorithms. Table 3.2: Error rates of dierent classiers using 10-fold cross-validations. ADTree SVM Naive Bayes Failure 0.060 0.017 0.128 Normal 0.001 0.002 0.003 Overall 0.029 0.026 0.008 So far we just introduced potential ways to construct the prediction model. But as we learned more and more about the problem, we discovered that those \prepared" algorithms can be further improved in multiple ways - from training set construction to model training. There are several factors that aect the performance of an articial 36 lift system, e.g., elds, formations, well types, etc. Therefore, even though the training process shows excellent accuracy, it is not always true that the prediction process is also accurate. Moreover, the work of labeling heavily depends on domain expert's eorts. In the dataset used for this section, only a very small amount of the samples are labeled. In Chapters 4 and 6 we will show our work that provides both deeper understanding and solutions on a much larger dataset. 3.6 Evaluation All the results that we presented so far are cross-validation results. That is, we know their labels and just try to nd classiers that best suits training data. Given only a very small portion of data labeled, the real challenge is to apply this trained model to the real daily-updated data to see how many failures we are able to predict ahead of time, as well as how many false alerts that our model generates. Because real-time failures may not have timely failure conrmation, the rational way to evaluate how well a model performs can depend on a \future" dataset. That is, to separate the dataset into \training" and \validation", as well as \prediction" segments. Training segment is the time range when all failures have happened and were con- rmed. Validation segment is the time range that all failures have also been conrmed but is strictly later than training segment. Prediction segment is the most recent time range that most failures, even if hap- pened, are not conrmed yet. In evaluation module, the segment used is \validation segment". And because we are targeting at predicting failures in the future, our denition of precision and recall follows the \predictive" way. 37 Generally, the most commonly used metrics for evaluating an event prediction system is called confusion matrix as is shown in Table 3.3. Each failure prediction belongs to one out of four cases: if the prediction algorithm predicts an upcoming event, which is called a positive; it results in raising an alert. This decision can be right or wrong. If in reality the failure is conrmed, the prediction is a true positive (TP). If not, a false positive (FP). If prediction indicates that a system is running well but a failure happened, this prediction is false negative (FN). If nothing happens, no prediction or true failures, it is a true negative (TN) (Salfner et al., 2010). Table 3.3: Confusion matrix and prediction-related terminologies. True event True normal Predicted failure True positive (TP, correct alert) False positive (FP, false alert) No alerts False negative (FN, missing alert) True negative (TN) The evaluation of our system diers signicantly from most previous methodologies in the literature because of the uniqueness of our time-oriented failure prediction prob- lem. We have to handle the failure dates that were not accurately recorded. Moreover, even false positive predictions are dicult to establish, because we cannot be certain that it is truly a false prediction - it could be a failure in the future, which is precisely the value of our methodology. Therefore, we are mainly aiming at predicting failures rather than normal operation. Maintaining a low false alert rate, that is high precision and recall for failures is favorable. In cases where the system is not able to identify a failure pattern, there is nothing that we can do unless a more useful parameter is introduced in the future. Our evaluation is described in Figure 3.8. For a failing well, its recorded failure is marked by the red box. This is when eld specialist detected a failure and recorded it in the database. The black box shows the day that the true failure begins, and the grey boxes are the pre-signals that happen prior to the failure. The white boxes are normal time, where no failure or pre-failure signals existed. A failure prediction is true only if it is within D days from the actual recorded failure. This process is done on 38 each testing well, and for those failure wells whose failures are successfully predicted, they are considered to be true positives. For the normal wells that have failure alerts indicated, they are false positives. For the wells that have failures not predicted ahead of time or not predicted at all, they are considered to be false negatives. For a normal well that has no failure predicted is considered to be a true negative. Figure 3.8: Failure accuracy evaluation logic. It is always unwise to produce overwhelming number of alarms. Moreover, because failures are rare, compared with massive amounts of normal data, the goal for a good prediction model should be able to predict as many failures as possible while not making too many false positives. At the same time among all predicted alarms, we want as many of them to actually happen in the future. In that case we have the following evaluation criteria: Precision: is dened as the ratio of true predicted events over all predicted events. precision = TP TP +FP Recall: is the ratio of correctly predicted events to the number of true events. recall = TP TP +FN These two metrics will serve as the major evaluation criteria for prediction models. For this dataset, we have our prediction results using ADTree and SVM as in Table 3.4. 39 So for the basic prediction engine, SVM works better, though it also contains 5 more false positives. Table 3.4: Failure prediction on 33 wells. ADTree ADTree SVM SVM Failure predicted No alerts Failure predicted No alert Failure 1 5 5 1 Normal 5 22 10 17 3.7 Alert Visualization 3.7.1 Day-level Visualization We developed our version for visualizing all parameters implemented based on JFreeChart 3 . Our visualization includes original values and extracted trends together with well ID and dates, as displayed in Figure 3.9. This is done by elds and various initializing con- ditions are used to lter well list displayed in the tool, e.g. all wells that we put in the watch list, or all wells that we predict failures between a specic date range. The gure below shows CardArea, MinSurfLoad and DailyRunTime records between Jan, 2012 and Aug, 2012. On the bottom chart, it shows multiple status including its workovers (top row), downtime (middle row), and our prediction (bottom row). Attributes can be customized in the Attributes menu. 3.7.2 Watchlist generalization and visualization For testing phase, when ground truth is unknown but a good prediction model is estab- lished, how to handle the alarms is mainly discussed in this section. Given prediction results of a well, trace back on a given days d to nd the earliest date when the alarm begins. Then the statistics of failure prediction types, as well as features are returned like in Table 3.5. 3 www.jfree.org/jfreechart 40 Figure 3.9: SEA day-level visualization. Table 3.5: Watch list in a collaborative format. Well ID Alert Date Description Failure Type SME Conrm date 1 7/02/2012 100% leaking CardArea is down; MinSurfLoad is up; DailyRunTime is up; This alerts can be provided in daily-basis for real-world application if combined with a geographical representation as shown in Figure 3.10 4 . 3.8 Knowledge Management The goal for SEA is to assist real-world decision-making rather than completely replacing human experts. Knowledge management is not only about managing gained knowledge 4 Google Earth: earth.google.com 41 Figure 3.10: SEA watch list in a geographic representation. in a smart way, but also to use knowledge for better system performance and decision- making assistance. All other modules should be heavily involved with knowledge so that the data is not utilized \blindly". One of the most important tasks for failure prediction is for preventive maintenance in a long term. However due to the limitation of uncertainty, data quality, model capa- bility and lack of ground truth, it is still risky for each decision even if the prediction engine is well-trained and able to predict events accurately under the evaluation mech- anism. In petroleum industry where an existing model is able to predict down-hole failures of rod pumping units, SMEs still have to be involved to verify the correctness of each predicted alert. Once a true failure decision is made, the rod pump has to be shut down and a detailed diagnosis is scheduled. This procedure usually takes more than two weeks. If it turns out to be a false alarm, the total cost because of loss of production of downtime and well diagnosis is expensive. The decision process includes 42 the empirical knowledge from many factors that are dicult to be considered during the prediction phase including the mechanical conguration, failure history, scheduled maintenances which are not caused by failures, spatial-temporal state transition out of the predictions, etc. To take advantage of such knowledge requires a more exible model which is capable of learning from the empirical relations and inferring condence levels. We have proposed SEA Knowledge Management (SEA KM), which follows a case- based reasoning process retrieve, reuse, revise and retain (Plaza et al., 2005). Under this process, interpretable knowledge, especially the well-work logs that are relevant to failures, is managed for reuse. SEA KM is used to bring the relevant knowledge from knowledge base, where historical knowledge is well-formed and stored. Each time SEA predicts or detects an abnormal event, e.g. rod cut, SEA KM retrieves top k nearest cases from the knowledge base according to the features used in the model, type of event combined with static parameters such as rod pump system conguration, eld name, type of reservoir, etc. Then, SEA KM applies its self-dened criteria to choose one out of k options, and extracts its relevant historical date range, key attributes, root cause analysis and best practice response actions. If the knowledge is the exact one and only requires trivial changes, there will be no need to revise it. However, if a dierent or more generalized solution is raised from the current case, it is necessary to revise it so that it better suits the current problem. The revised case is then ltered and formatted through retain process, which will be stored in the knowledge base. Figure 3.11 illustrates the idea of applying this routine into a well's prediction output. A ranked list is produced for each alert, according to its features compared with existing training dataset. In Chapter 7, we will present a condence level model for knowledge management. 3.9 Conclusion In this chapter, we presented our Smart Engineering Apprentice system for failure pre- diction for rod pump articial lift systems. We divided the SEA system into several 43 Figure 3.11: An example that illustrate how case-based reasoning from SEA KM module is used. modules namely Data Collection and Preparation, Feature Extraction, Machine Learn- ing, Evaluation, Alert Generation and Visualization, and Knowledge management. In each module, we described the challenges and provided basic ways to solve them. We claim that the SEA system would not fully function with any module excluded. Further, during the implementation process, domain knowledge should be carefully considered. In the following chapters, we will take a deeper dive into the core of SEA system - the prediction engine. The key focus is on indicating the shortcomings of the basic super- vised learning approach and how to improve existing machine learning algorithms so that they can perform better predictions. 44 Chapter 4 Semi-supervised Failure Prediction 4.1 Background The failure prediction for rod pump articial lift systems pose dicult challenges to data mining with respect to high dimensionality, noise, inadequate labeling and heterogeneity between wells and between elds. High Dimensionality : The data is inherently high dimensional. In this dataset we have 14 attributes, and each attribute is measured daily. For a hundred days of dataset, the dimension is 1400. The POC controllers gather and record periodic well sensor measurements indicating production and well status through load cells, motor sensors, pressure transducers and relays. Noisy Data : Well measurement articial lift dataset tends to be very noisy. The noise comes from multiple sources, which include both natural and human in- volved causes. The wells operate in rough physical environments which tend to cause equipment to break down. For example, lightning strikes can some- times disrupt the wireless communication networks on which the data collected by POC sensors are not sent to the centralized logging database. This would result in missing values in the data. Also, petroleum engineering eld workers would sometimes perform maintenance and make calibration adjustments to the equipment. These maintenance activities and adjustments would cause the sensor 45 measurements to change. Workers usually diligently log their work in downtime and workover database tables. Occasionally a log entry would be delayed, or not logged at all. For example, the workers occasionally recalibrate the POC sensors. The results of such recalibration may introduce drastic changes to sensor readings. Nevertheless, it is not standard practice to record such recalibration actions. An- other source of data noise is the variation caused by the force drive mechanisms. In oil elds with insucient formation pressure, injection wells are sometimes used to inject water (or steam or carbon dioxide) to drive the oil toward the oil produc- tion wells. The injection rate can aect the POC sensors measurements. However, based on the advice of petroleum engineering SMEs we decided to ignore injection parameters to avoid making the problem too complex. Inadequate Labeling : The data set is not explicitly labeled. Manually labeling a data set is tedious, and our access to busy SMEs is limited. Fully automatic labeling is problematic. Although the well failure events are recorded in the well database, they are not suitable for direct use because of semantic dierences in the interpretation of well failure dates. Well failure dates in the database do not correspond to the actual failure dates, or even to the dates when SMEs rst notice these failures. The recorded failure dates correspond to the date when the workers shut down a well to begin repairs. Because of the backlog of well repair jobs, there can be several months between the actual failure dates and the recorded failure dates. Moreover, even if the exact failure dates are known, we still need to dierentiate among normal, pre-failure and failure signals. Well and eld heterogeneity : Given various factors that aect the performance of an articial lift system, e.g., elds, formations, well types, etc., even though the training process shows excellent accuracy, it is not always true that the prediction process is also accurate. Furthermore only a very small portion of data is labeled by SMEs. 46 This is an important problem in petroleum engineering domain. To the best of our knowledge this petroleum engineering problem has not been previously explored by data mining researchers. Existing SEA system that we presented in Chapter 3 provides basic solutions for data collection and preparation, feature extraction, prediction engine, alert generation and visualization and knowledge management. The heart of SEA system is its prediction engine. Via the initial data preparation and feature extraction, the data noise and dimensionality are reduced signicantly. Thus our target is how to improve prediction engine. In this chapter we show a semi-supervised failure prediction algorithm that both improves the labeling eciency and makes prediction engine more systematic and robust when dealing with predictions: 1) applies unsupervised learning to assist labeling, 2) presents a random peek semi-supervised learning algorithm to deal with inadequate labeling. 4.2 Algorithm In this section, we focus on a methodology to predict failures on noisy multiple multi- variate time series. Given very sparse information of the labels, such as the failure date, we wish to learn the failure patterns as automatically and optimally as possible. Note that the recorded failure event dates may be far from accurate. Existing failure pre- diction or event detection framework is dicult to t to this problem. The assumption throughout this section is that most of the time a rod pump is functioning normally. 4.2.1 Labeling Our dataset is not explicitly labeled. Manual labeling from our basic approach can be problematic because of the limited availability of SMEs, as well as imprecise empirical labels that might introduce more noise during the process. Based on our previous knowledge, several \clear" failure examples showed a clear descending trends before the failure happened. Thus given such a failure, we may take 47 advantage of unsupervised learning, a.k.a. clustering, to statistically separate the time series. Our clustering is applied to individual wells, not across them (e.g. clustering among two wells) because the variation across wells can be large, and clustering across wells tends to generate uninteresting clusters that do not relate to failures. Several clustering techniques could be applied to label the multivariate time series data. We used clustering by considering all the attributes as relevant by using Expectation Maximization (EM) algorithm. The EM algorithm assumes that the data is formed based on Gaussian mixtures - here we assumed that each Gaussian distribution represents a failure phase - normal, pre-failure, failure - the development progress for all predictable failures. Here, the observed data is F i , which is a complete failure case from normal to its specic failure date, having log-likelihood `(;F i ;Z i ) depending on parameters = f normal ; prefailure ; failure g, which more specically re ects the parameters of three unknown joint Gaussian distributions. In the log-likelihood, Z i represents the latent data or missing values, which is the assignment of each record in F i with respect to the three distributions. Thus our labeling process can be formulated as a maximum likelihood estimation problem. This is done using an EM procedure. E step: compute Q( 0 j i ) =E `( 0 ;F i ;Z i ) as a function of the dummy argument 0 . M step: determine the new estimate i+1 using: i+1 = argmax Q(j i ) We correlate the clustering results by considering timing information, similar to Figure 4.1. Because the actual cluster assignment might not re ect the real failure procedure, we need to be cautious on totally depending on clustering results for the nal 48 labeling. We choose to consult SMEs in order to conrm and adjust our analysis by tuning the labels accordingly. The results can be used to identify the failure range with clustering, which combines all the trends to distinguish among normal, pre-failure and failure signals. The clustering results are plotted with respect to the phases developed all the way to failure - level dierence of the phase curve. This allows an SME to visually inspect the labeling proposed by the clustering, and to conrm the true labeling. Figure 4.1: Accurate labeling under clustering. 4.2.2 Training Selection Our objective for training set selection is to focus on the labeling of a few \good" wells. Although our assisted-labeling methodology greatly reduces the time required, we still want to maximize the value provided by the SMEs. By \good" we mean that a failure well has clear trending signals leading the well from normal to pre-failure signal modes, and then to failure signal modes. The duration of these trending signals may sometimes last for more than half a year. 49 Once the training wells are suggested by the SMEs and they are labeled under our assisted-labeling process, we then start constructing the training set. A methodology to iteratively enhance the training set was developed. Due to the time consuming process of interacting with SMEs, we aimed at picking carefully the training set following a bootstrapping process - start with a small set of failure cases which have \good" trend signals. Then we iteratively add false negative samples into training set until it achieves a good failure recall rate: we start adding false negative examples into training set until the failure recall converges. Once the maximum amount of failures can be predicted, we begin to introduce false positives into training set until the failure precision is max- imized, while still maintaining the failure recall level within an acceptable threshold. Algorithm 2 shows our training selection mechanism. Algorithm 2 Training Selection Input: Failure set X, normal set X 0 , threshold Return: Training setfLg Initial: L =; Segment T i according to its failure events while ( TP TP+FN < or ( TP TP+FN )<) do argmax X i 2X f(X i ) =f TP TP +FN jTrain and Test(L[X i )g L =L[X i end while while ( TP TP+FN > and ( TP TP+FP )<) do argmax X 0 i 2X 0 f(X 0 i ) =f TP TP +FP jTrain and Test(L[X 0 i )g L =L[X 0 i end while 4.2.3 Semi-supervised Classication using Random Peek It is critical to keep the number of false positives as low as possible. Because for each alert, if a failure decision is made, a well work has to be issued to stop the well for a full 50 inspection. This whole process requires costly labor and down time. In our dataset the small number of the labeled samples causes high false positive rates. This is because each well behaves dierently enough from one another that the space on normal behaviors is not fully covered by all our training examples. In order to capture the individual knowledge of the testing wells, we propose an approach called random peek. Random Peek is inspired by semi-supervised learning (Zhu & Goldberg, 2009). Typically, semi-supervised learning algorithms assume some prior knowledge about the distribution of the data set that is able to help increase the accuracy. In our case, failure is always a rare event - less than 70 failures within 480 days among over 350 wells. Therefore we are condent to assume that a well is functioning in normal condition most of the time. Thus most of the unlabeled samples should be normal. Under this assumption, by splitting the data into two clusters in its feature space, the bigger cluster is most likely to be the normal subsequences one - we cannot guarantee the smaller cluster represents the failure cases because not all wells have failures as illustrated in Figure 4.2. Figure 4.2: Random peek illustration. 51 Before testing on an individual well, its random peek will help tune the classication boundaries by learning its normal behavior according to Algorithm 3. Algorithm 3 SSL using Random Peek Input: Training set L, testing set X Return: Classication result Y Initial: Y =; for testing well X i 2X do Clustering X i to get C 1 , C 2 , wherekC 1 k>kC 2 k Label centroid Center 1 2C 1 , label it as normal Update training set L i =L[Center 1 to get f i Testing on X i to get Y i =f i (X i ) Y =Y[Y i end for 4.3 Evaluation Our SSL failure prediction algorithm is evaluated using our dataset collected from a specic oil eld. There are 391 rod pump wells and one and a half years record (09/2009 - 02/2011) for each well. There were a total of 65 times rod pump failures occurred in 62 wells. We considered 12 attributes which are relevant to failure signatures based on extracted features from dynamometer cards that we listed in our introduction section. Through this process we were able to guarantee that the dates are in consecutive sequence for each well. Since some of the events were recorded after the well was down, in order to better evaluate our prediction algorithm, we shifted such events to the most recent working date - the exact day the well failed. Under SMEs' suggestion, we picked 8 failure wells which had \good" data for training - clear trends of failures. Also we picked some normal wells that had no previous known failure for failure precision correction purpose. Table 4.1 contains the cross-validation result by using dierent classication algo- rithms. The cross-validation is done at the sample level, not on well level. In the model 52 Table 4.1: 10-fold cross validation accuracy Accuracy Decision Tree SVM Bayes Net Failure 0.916 0.943 0.939 Normal 0.990 1.000 0.973 Overall 0.970 0.985 0.964 Table 4.2: Confusion matrix for testing on a single eld using 8 failure wells. True failure True normal Predict failure 52 72 Predict normal 13 254 selection process, we used 10-fold cross-validation and picked the parameter congura- tion with the highest accuracy. The results demonstrate that support vector machines is the best option for providing the highest cross-validation accuracy for both failure and normal examples. Therefore we used SVM as our nal classier, and used radial basis kernel. Then we xed the same parameters for later semi-supervised learning tasks. Once the model is xed, we tested it on all the 391 wells for all time periods. Table 4.2 lists our confusion matrix for prediction results based on our evaluation algorithm illustrated in Figure 3.8. In the confusion matrix, we have the recall for failure as 80.0% while the precision for failure as 41.9%. It means that even though we capture 80% of the actual failure, still there are over 50% that are probably falsely predicted. Another explanation is that 72 false positives might contain some issues that showed failure patterns, which were not discovered by the SMEs. From another perspective, regarding normal predictions, it shows that if our algorithm claims that the well is normal, then we have 95.1% condence that the well is functioning normal. 4.4 Overtting Overtting occurs when the model specializes on noise in the data instead of on the underlying concept. To assess the possibility of overtting, we initially applied the standard 10-fold cross validation on the training set. However, we consistently noticed 53 Table 4.3: Comparison between well-level and sample-level cross validation accuracy using SVM Accuracy Well level Sample level Failure 0.299 0.943 Normal 0.784 1.000 Overall 0.661 0.985 that the cross validation error rates tend to be much lower than the testing set error rates. The dierence between the error rates is most likely due to two causes. The rst cause is the test set labels are automatically generated, so they are subject to data noise and label problems described in the Background Section. The testing set consists of the all the wells in the oil eld, so it is impractical to apply the assisted label technique. We elected to use all the wells from the oil eld, because we wanted to estimate the alarm frequency that SME would experience in the eld using the induced models. We collected the daily alarm rates for the entire oil eld. The average daily number of alarms is 4:1%, which is so low that it does not burden SMEs. Even for the highest daily alarm which is 34, it also reduces the work load of SMEs by over 90%. The second cause of the error rate dierence is that the training examples are not independent. The sliding window technique generates multiple examples for each well. The 10-fold cross validation technique randomly assigns the examples from each well to one of the 10 folds. So, most likely during the validation phase the learning algorithm would have already seen examples from the wells used for validation. To understand the dierence in error rates caused by automatic labeling and by dependent samples, we employed a modied leave-one-well-out cross validation methodology. In this approach we leave all the examples from same well for validation. We never place examples from the same well in both the training set and the testing set. See Table 4.3 for the dierence between the two validation approaches. As seen, the cross-validation by leave-one-well out gets much lower accuracy than leave-10% samples-out. This proves our assumption. From another perspective, Table 4.3 also proves that our training wells are exclusive - representing dierent failure patterns. 54 4.5 Conclusion In this chapter, we presented our semi-supervised failure prediction algorithm based on random peek and support vector machine under the assumption that each rod pump would be normal most of the time. An assisted labeling mechanism was also discussed that displays the clustering results in a temporal order so that it would better help SMEs eciently identify the dierent labels, which speeds up the labeling process signicantly. We tested this approach on a specic eld with 391 rod pumps. Albeit overtting exists, 80% recall and 41.9% precision were achieved. SEA with the algorithms described in this chapter has been successfully deployed in real-world applications and achieved great value to the stakeholder. 55 Chapter 5 Global Model for Failure Prediction 5.1 Background Our prior work described in Chapter 4 used machine learning techniques to generate high quality failure prediction models with good accuracy. However, it suers from two major drawbacks. 1. First, it uses traditional machine learning techniques that require labeled datasets for training the models. Generating these labeled datasets is labor-intensive and time-consuming. 2. Second, this model is eld-specic which is only applicable to the specic eld from which the labeled dataset is derived. Field-specic models generally per- form poorly when applied to other elds because of the dierences in the data characteristic caused by eld geology, operational procedure, etc. Moreover, these models have to be maintained independently, which accordingly raises nontrivial maintenance costs. In this chapter, we present a generalized global model for failure prediction that works across multiple rod pumps located across multiple elds. Machine learning based labeling approach that involves clustering and rule-based ltering is used to automate the labeling process. Integration of training sets across multiple elds showed that this 56 labeling approach is eective. Our experimental results show that the precision and recall for failure predictions are good. Furthermore, one global model can be employed for predicting failures for rod pumps in multiple elds while reducing maintenance cost. 5.2 Attributes and Features Based on our previous study we discovered that the trends of the 4 most reliable at- tributes are the key indicators for potential anomalies, plus two calculated features: 1. Existing attributes from LOWIS for trends extraction: card area, peak surface load, minimum surface load, daily runtime; 2. Calculated attributes: (a) card unchanged days: If all card measures (area, peak and min load) reject to change, this integer eld keeps increasing with missing dates considered as unchanged. (b) daily runtime ratio: percentage of runtime with respect to a full 24 hour running. Combining the trends and calculated attributes, we can formulate failure prediction as a machine learning problem that given a new feature set of a new date from a well, the output is whether the well is staying normal, having potential failure, or is failing. We discovered that there is correlation between unchanging features with potential failures. The observation is based on seeing many sudden failures after episodes of unchanging features. It is natural for daily data to uctuate because of real-world uncertainties that govern complex systems. According to reliability studies of many elds (Scheer et al., 2009), because models of complex systems are usually not accurate enough to predict reliably where critical thresholds may occur, it would be useful to build a set of reliable statistical procedure to test whether the environment is in a stable state, or at a critical tipping point in the future. This \statistical procedure" 57 Figure 5.1: CardUnchangedDays and RuntimeRatio correlate to sudden failures. for our problem, that exhibits the statistics for measuring system reliability, is the number of days that reliable critical attributes do not change (CardArea, PeakSurfLoad, MinSurfLoad). Here, by unchanging we mean values not changing by a single digit. Figure 5.1 is an example of this situation. The system has been shown experiencing \unreliable" states when the CardUnchangeDays accumulates between March, 2012 and April, 2012. This is followed by a critical downfall of CardArea which ultimately led to a Tubing Failure. In many cases, this \unchanged days" marks either POC problems that fail to read the parameters correctly or the actual situation when the rod pump is problematic. RunTimeRatio is another parameter that we added for prediction. In our previous version, trends of DailyRunTime are used. However the ratio of the runtime with regards to 24 hours/day is also relevant to a failure. From Figure 5.1, because of the sliding window process, we can observe that the trend \Daily Runtime Short-term Trend" has not reached the tipping point around early May, 2012 when the system has already almost stopped functioning. RunTimeRatio captures this information in its 58 earliest stage. So by including this feature we are expecting our prediction to be slightly \earlier" compared with previous predictions. 5.3 Labeling and Model Construction Labeling for training data is one of the key components of failure predictions. In sta- tistical learning, the assumption is that there is one or more \generative distributions" that generates each type of failure and normal examples. The data reveals only the portion of the \observations". Under this theory, we can rely on roughly labeled training data to achieve a reliable model that would work for multiple elds. The eld-specic model's biggest drawback is its expensive training process which prevents it from being adapted in multiple elds. What we have learned is that we are condent with labeling many tubing leak failures that exhibit consistent failure trends, but less condent for the failures without such trends. For such types of failures, we can label using a rule-enhanced statistical cluster- ing algorithm. For rod pump failures, because of its various factors of its root causes, we will rely on similar process but with less rule constraints. For prediction together with analyzing it based on a timeline, we are able to identify some historical failures which show a clear normal! pre-failure! failure transition process. In this process, normal should be considered as a major observation. However when it gets closer to failure, pre-failure signatures would show up that statistically dier from normal examples' distribution. However a range of signatures mixed with normal and pre-failure are allowed. All signatures nally converge to failures when data is close to the real failure events. Rules are used to constrain the clustering results to follow this process. For failures that fall out of this rule, we will not use them as training set candidates. This training set construction process is done by scanning all existing failure and normal wells, for which the output clusters are labeled as the corresponding pre-failure type, failure type and normal examples as illustrated in the owchart in 59 Figure 5.2. The labeled set, which we call it the training set, is then used to train a multi-class SVM so that it is capable of making prediction given future features. Figure 5.2: Training set construction owchart. The rules used for ltering clustering output are empirical. We assume that if there are less than 100 non-missing signatures, then sample size is too small. When clustering is done, the clusters have to be \discriminative" by their priors so that normal cluster is dominating by taking over 50% of the time, and the failure cluster has to touch the real failure example, while leaving the remaining cluster for pre-failure signatures. This process can be done in parallel, and sampling rules of training wells can be adjusted but here we use pure random order. For normal examples, similar process is applied but instead of 3-class clustering, we use 2-class clustering so that the major class is normal with over 70% distribution, while the smaller one is then discarded as noise. The details of the algorithm are described in Algorithm 4. With the labeled training set, a multi-class support vector machine (SVM) is then trained to learn from all these labeled cases. After the training process, the prediction model can then be evaluated. 60 Algorithm 4 Rule-enhanced clustering labeling algorithm for failure prediction 1: Input: failure vs. normal rate r, number of failure training example n. 2: Output: labeled training setS with r+1 r n training wells' signatures 3: Initial:S =; 4: Collect all failures of the same type from all available elds in a sampling poolP 5: Randomly sample P 2P, [P;f type ] =fX t g n t=1 , where X n is the only non-missing signature before this failure date. 6: if n<D then 7: Goto Step 24 8: else 9: [priors,idx]=EM(P,#clusters=3) 10: Cluster normal =idx max i ;priors(i) ;Cluster failure =idx(n) 11: if f type is Tubing Failure then 12: CardArea = 1 count(idx=Cluster failure ) P i;idx(i)=Cluster failure X i (CardArea2) 13: if CardArea > then 14: Goto Step 24 15: end if 16: end if 17: if Cluster normal 6=Cluster failure and Priors(idx normal )> 0:5 then 18: Cluster prefailure idx that remain assigned 19: S =S[f(X i ;normal)jidx(i) =Cluster normal g 20: S =S[f(X i ;f type )jidx(i) =Cluster failure g 21: S =S[f(X i ;prefail type )jidx(i) =Cluster prefailure g 22: end if 23: end if 24: P =PfPg 25: Repeat Step 5 until failure training reaches n orP =;. 26: Collect all normal wells signatures from all available elds in a sampling poolQ 27: Randomly sample Q2Q;Q =fX t g n t=1 , where X n is the most recent non-missing signature before of the well's valid training range 28: if n<D then 29: Goto Step 37 30: else 31: [priors;idx] =EM(P; #clusters = 2) 32: Cluster normal =idx max i priors(i) 33: if Priors(idx normal )> 0:5 then 34: S =S[f(X i ;normal)jidx(i) =Cluster normal g 35: end if 36: end if 37: Q =QfQg 38: Repeat Step 27 until failure training reachesb n r c or Q =;. 61 Table 5.1: Overall confusion matrix for testing global model on 5 oil elds. True failure True normal Predict failure 278 145 No alerts 141 1383 5.4 Experiments To evaluate our global model, we used a dataset that was collected from ve actual elds which has 1,947 rod pump articial lift systems. The training data are between 2010-01-01 and 2011-01-01, and the validation data range is between 2011-01-01 and 2012-01-01. We set the correct prediction threshold to be 100 days which means that if within 100 days of an actual failure we have prediction, then it is a true positive. If a failure prediction is made but the actual failure happens beyond 100 days, then this is con- sidered to be a false positive. For a normal well, if we have failure prediction alerts, then it is considered to be a false negative. During our experiments, because for some operations other than work overs, such as electrical maintenance that may turn down the rod pump or change the value of the trends, we ltered out the alerts that were produced within 15 days of the down time. For the training set construction, we set the ratio to be 0:8 with 50 failures, which means that the training set contains 50 failure wells' samples as well as 350 normal wells samples. We used radial basis kernel for SVM. Table 5.1 shows our results for evaluating the global model, from which we can infer that our precision is 65.7% and recall is 66.4%. We also maintained the eld-specic results including eld-specic precision and recall in Table 5.2. Field 1 has the greatest recall as 88.5%, but it also has a lower precision of 53.5% compared with eld 2, 3, and 4. Field 5 has the lowest precision and recall. We discovered that rather than the failures which exhibit trends before failing, Field 5 has more sudden failures - failures caused by sudden events like rod parted, joint split - than other elds. 62 Table 5.2: Field-specic confusion matrix for testing global model on 5 oil elds. Field Prediction True Failure No Failure Precision (%) Recall (%) 1 y Failure 54 47 53.5 88.5 No alert 7 392 2 Failure 73 18 80.2 69.5 No alert 32 190 3 Failure 72 25 74.2 69.9 No alert 31 271 4 Failure 39 15 72.2 60.0 No alert 26 193 5 Failure 40 40 50.0 47.1 No alert 45 337 y: same eld used for eld-specic semi-supervised learning model in Chapter 4. Compared with eld-specic model which has 87% for recalls and 42% for precision, the global model has even better results: 1.5% higher in recall and 11.5% higher in precision. In general, because of the generalization that involves multiple elds train- ing samples, global model tends to be less \aggressive" than eld-specic model while predicting failures. However global model learns from more failures across multiple elds that makes it adaptive to more failure signatures. This cross-eld learning also prevents global model to generate many false alerts like the eld-specic model. Most importantly, global model is scalable and can be generalized to more elds. Figure 5.3 shows a good example for a successful tubing leak prediction. In this gure, we visualize four major attributes as time series aligned well by date: the bottom chart shows the downtime records as teal in the middle line, recorded failure as red for tubing failure in the top line and failure predictions for tubing leak as grey in the bottom line. In this example our model successfully began to predict a tubing leak because it recognized the failure trend beginning on mid Oct, 2011 and repeated two more times in Jan, 2012. Then it truly failed after the fourth predicted failure because of tubing failure caused by tubing holes. Figure 5.4 and 5.5 are also examples of early failure predictions. 63 Figure 5.3: Good example for successful prediction that leads to a tubing leak: signature that indicates a tubing leak began to occur in late Octr, 2011 and it happened another two times, and then discovered as a tubing failure in mid Feb, 2012. Because prediction relies heavily on the dynamics of data by their trends, if there are no signicant trends, global model or eld-specic model will nd it dicult to predict ahead of time. Figure 5.6 is an example for a sudden failure that no clear trend can be identied by our algorithms. Even the SMEs considered this as an impossible prediction task based on these attributes because they were in a perfectly normal range before failure. For such failures, they can only be detected after failure rather than predictions. There are also factors that impact the false predictions as shown in Figures 5.7, 5.8 and 5.9. 64 Figure 5.4: Another good example for successful prediction that leads to a tubing leak: early less frequent prediction indicates a tubing failure, then signicant failure event happened. 5.5 Conclusion In this chapter, we presented a global model for failure prediction for rod pump articial lift systems. Unlike our prior work that is expensive for model building and mainte- nance because of the eld-specic constraint, we extend our methodology to learn across multiple eds via an automated labeling algorithm that uses clustering and rule-based ltering. Our results show that the global model produces acceptable results compared to eld-specic model with higher precision. Rather than developing models for specic elds that involves labor-intensive and time-consuming process to label failures, a single global model that automatically builds up the training set by our proposed algorithm can be easily scaled to more elds with signicantly lower maintenance cost. 65 Figure 5.5: Good example for timely prediction. No clear pre-failure patterns were recognized but soon as data went towards the failure pattern we were able to identify it and trigger alerts more than half month earlier than it was found by eld specialist. 66 Figure 5.6: Sudden failure: hard to predict by trends. 67 Figure 5.7: After March 2012, data began to decline which matches the tubing failure pattern, however the data went back to normal level. Then again the data showed that the load dropped signicantly and this time it held that value consistently. This is exactly the signature that we used to train our prediction model, but for this case, because we have no further knowledge by the time our experiment was held, and the date was exceeding 100 days, these predictions are considered as false alerts. 68 Figure 5.8: Sudden failure: hard to predict by trends. 69 Figure 5.9: Sudden failure: hard to predict by trends. 70 Chapter 6 Weighted Task Regularized Multitask Learning We have introduced how we construct support vector machine as our failure prediction engine. Yet we found that it may fall into overtting problem, or over smoothing problem that prevents the trained model from being generalized so that it works well for all elds. Regularized multitask learning algorithm (RMTL) (Evgeniou & Pontil, 2004) is a potential approach but it lacks the exibility to handle the situation when observations are rare and noisy while task relationships are known. In this chapter we introduce our multitask learning algorithm called weighted task regularized multitask learning algorithm (WRMTL) that provides enough exibility to solve this problem. 6.1 Background For problems that can be divided into separate, yet related, tasks, where each task has their own feature distributions, multitask learning has been proven to perform better than traditional single task learning. Traditional single-task learning approaches tend to fail when number of training examples are small for each task. The small training sample size under constrains the search in hypothesis space for appropriate models re- sulting in high generalization errors. Under the setting of multitask learning related tasks can be used to introduce biases to ne tune the model search. In order to cap- ture this bias information, many approaches in multitask learning have been proposed 71 including sharing distance metric, sharing common feature set, sharing a common la- tent representation, etc. Regularization has been introduced to solve multitask learning problems that produces better performance than traditional solution using single-task learning. The approach described in these works takes full advantage of support vector machines by constructing novel kernels that reveal task relations. Multitask learning is a learning framework which falls into transfer learning category which requires the same feature representation for all tasks. It tries to learn multiple tasks simultaneously even when they are dierent. As an active eld in machine learning, multitask learning has been adapted to discover the common (latent) features that benets each individual task. Back to our original problem - rod pump failure prediction, so far we are able to come up with a global model that is capable of predicting failures among multiple elds, although all elds are using the same predictive model. Due to the heterogeneity prob- lem that we discussed in Chapter 4, smoothing over all elds would yield a mediocre model that does not perform as well as a eld specic model for a specic eld. In this chapter, we present our methods that properly generalizes the formulation of regular- ization multitask learning model so that each eld can have an individual model with respect to its heterogeneity to the other elds. Specically, we construct each learning task as a support vector machines classication with dierent weights that represents their distance with other tasks. We then prove that the combined form of multiple SVM still satises the multitask learning property that enables smoother and more exible task relation assumptions. This MTL can be transformed to a dual form for optimiza- tion similar to how traditional SVM works. Besides our failure prediction task, we also generated a synthetic dataset to illustrate that our weighted MTL algorithm works bet- ter when the observations are rare and task distance relationships are known and noisy. Our methodology outperforms existing single task learning and regularized multitask learning. 72 6.2 Notation GivenT binary classication tasks, for each task t there arem t examplesf(x it ;y it :i2 N)g sampled from a distributionP t onX t Y t whereXR d ,Y =f+1;1g. The total labeled data is: f(x 11 ;y 11 );:::; (x m 1 1 ;y m 1 1 );:::; (x m T T ;y m T T )g The goal is to learn T functions f 1 ;f 2 ;:::;f T such that f t (x it )y it For support vector machines, the functions can be described by a set of hyperplanes so that for each task, the hyperplane separates the positively-labeled points with the negative-labeled points by maximum margin in linear (dot product) space or in Repro- ducing Kernel Hilbert Space (RKHS) by kernel tricks. 6.3 Methods Formulation 6.3.1 Problem Denition For simplicity, we assume that each task contains same number of examples, and the functionsf t is a hyperplane in linear space which is f t (x) =w t x, fort2f1; 2;:::;Tg. w t represents the hyperplane and in linear space, w t x is the dot product of the two vectors. In a support vector machine classier, w t is the maximum margin hyperplane. In (Evgeniou & Pontil, 2004), the task relation assumption is Gaussian so that the parameters for each task are independent and identically distributed (i.i.d.). There are 73 two assumptions: 1) i.i.d. and 2) number of tasks is large, under which the central limit theorem can be applied to the task parameters as random variables to estimate the distribution. But in practice the number of tasks may be small, and each task may not have equal weights, e.g. prioritizations for some tasks. When used to analyze the observed correlations, this lack of examples and lack of i.i.d. characteristics may introduce the eect called Simpson's Paradox, or Yule-Simpson Eect (Wagner, 1982). One common approach to handle the Simpson's Paradox is to weight each task dierently. Thus rather than the uniform mean which is used for estimating Gaussian distribution parameters, we use a weighted mean. w t =w 0 + t v t (6.1) where w t is the parameter for task t, v t is the bias of the task, t is the weight for task t's bias, and w 0 is the weighted mean of all task parameters. If t are all 1.0, this assumption is reduced to Gaussian. Due to the fact that each task may not have same level of distance from mean, we can assign the values for t but instead, we solve it by forming the following optimization problem by allowing weights assignment for each task indicated via prior knowledge: Problem 1. min w 0 ;vt; it J (w 0 ;v t ; it ) = T X t=1 m X i=1 it + 1 T T X t=1 t jj t v t jj 2 + 0 jjw 0 jj 2 subject to it 0 y it w t x it 1 it where i2f1; 2;:::;mg;t2f1; 2;:::;Tg 74 0 is regularization parameter for the mean parameter, t is regularization parameter for each biased parameter, while it is the slack variable for task t's ith example which measures the hinge loss of the model on this data point. The regularization parameters control the parameters out of optimizations. From this formulation, the larger t is, the closer the corresponding regularization term tend to be zero. Thus, if 0 !1, then each task would be independent because w 0 ! 0. If task t's parameters diverge from other tasks, we can expect its t to be small because it requires greater bias to deviate itself to other tasks. Similar tasks are expected to have greater t because they are likely to be used to reveal the average parameters. 6.3.2 Problem Equivalent to Multitask Learning Lemma 1. The optimal solution to w 0 of Problem 1 is: w 0 = P T t=1 t w t T 0 + P T t=1 t (6.2) Proof. Using Lagrange multiplier, we dene the Lagrange function as: `(w 0 ;v t ; it ; it ) =J (w 0 ;v t ; it ) T X t=1 m X i=1 it (y it (w 0 + t v t )x it 1 + it ) T X t=1 m X i=1 it it (6.3) where it , it are the Lagrange multipliers for Problem 1. Take partial derivative for w 0 , we can have: w 0 = 1 2 0 T X t=1 m X i=1 it y it ~ x it (6.4) Similarly, for t v t : 75 t v t = T 2 t m X i=1 it y it ~ x it (6.5) Therefore w 0 = 1 0 T T X t=1 t t v t Therefore given Equation 6.1, we have: w t =w 0 + t v t Replacing t v t we have: w 0 = P T t=1 t w t T 0 + P T t=1 t From this expression of relation between w 0 andw t , we can conclude that w 0 is the weighted mean of w t only when 0 = 0, otherwise it is not. Now we can prove that optimizing the single forumlation of Problem 1 is equivalent to a multitask learning problem. Lemma 2. Solving the optimization problem dened in Problem 1 is equivalent to solv- ing the following multitask learning problem: 76 Problem 2. min wt; it T X t=1 m X i=1 it + T X t=1 1;t t jjw t jj 2 + 2 T X t=1 t w t 1 T T X s=1 s w s 2 (6.6) such that for i2f1; 2;:::;mg;t2f1; 2;:::;Tg it 0 y it w t x it 1 it : The values for 1;t and 2 are 1;t = 1 T 1 t T 0 T + P T t=1 t ! (6.7) 2 = 1 0 T + P T t=1 t (6.8) Proof. Based on the form from Problem 1, taking in Equation 6.1, 6.2 and 6.4 we have: 1 T T X t=1 t jjv t jj 2 + 0 jjw 0 jj 2 = 1 T T X t=1 t jjw t jj 2 ( P T t=1 t T + 0 )jjw 0 jj 2 = 1 T T X t=1 t jjw t jj 2 1 0 T + P T t=1 t 1 T T X t=1 t w t 2 For the form from Problem 2: 77 T X t=1 1;t t jjw t jj 2 + 2 T X t=1 t w t 1 T T X s=1 s w s 2 = T X t=1 ( 1;t t + 2 2 t )jjw t jj 2 2 T T X t=1 t w t 2 , 1 T T X t=1 t jjw t jj 2 1 0 T + P T t=1 t 1 T T X t=1 t w t 2 Let t = t , we have: 2 = 1 0 T + P T t=1 t 1;t = 1 T 1 t T 0 T + P T t=1 t ! Now the form of 1() and 2 are found, and the equivalence is proved. 6.3.3 Kernels 6.3.3.1 Linear case In this section, we introduce Dual optimization to solve the problem and derive the kernel in order to solve Problem 1. The dual optimization is standard and is similar to solving common SVM problems. Based on the Lagrange function described in Equation 6.3, we take derivatives for dierent variables, replace the terms from the function and nally get: 78 `(w 0 ;v t ; it ; it) = T X t=1 m X i=1 it 1 2 T X t=1 m X i=1 T X s=1 m X j=1 T 2 t st + 1 2 0 it y it js y js ~ x it ~ x js Thus if we consider C = 1 2 0 and dene t = T 0 t , we can then dene the kernel as: K st (x it ;x js ) = ( st t + 1)~ x it ~ x js (6.9) Once the kernel is dened, the optimization can be done by tting in any existing SVM solution. 6.3.3.2 Nonlinear case We can easily generalize the linear kernel into nonlinear case using kernel tricks so that the original features are mapped into higher dimensional Hibert spaceH. :X!H Then the new kernel function would be the inner product of the mapped features: K 0 st (x it ;x js ) = ( st t + 1)h(~ x it ); (~ x js )i (6.10) If we are mapping it into Hilbert space with innite dimensions, we can replace Equation 6.10 into an RBF function, and then we apply common SVM in practice. 6.3.4 Generalization of uniform regularized multitask learning For a uniform multitask regularization kernel (RMTL), each task is assigned with iden- tical weights, thus the ratio between 2 (for w 0 ) and 1 (for w t ) determines the task similarities, which can be represented as = T 2 1 . 79 K rmtl (x it ;x js ) =( st + 1 )~ x it ~ x js / st 1 + 1 T 2 ~ x it ~ x js For generalized regularized multitask learning kernel (WRMTL), the task parame- ters are: K wrmtl (x it ;x js ) =( st t + 1)~ x it ~ x js / st t + 1 T 0 ~ x it ~ x js It is clear that our derivation is a superset representation for the uniformed version. By setting each t to be uniform, we are expecting the same output. Rather than in RMTL represents general task similarity, in WRMTL, each t actually represents how each task diverges from the other tasks. Larger t introduces higher penalties if any of its example is misclassied during training stage. For a task that exhibits small divergence from other tasks, it melts itself into the optimization process so that each task would learn from it. On the contrary, if we set a t to be innite, then the task will learn nothing from other tasks. Therefore this framework has a very robust characterstics that lessens the eect of outlier tasks. 6.4 Experiments We divide this section into two parts. We rst demonstrate the eectiveness of our algorithm using synthetic data. Then we test it using our rod pump failure prediction 80 data. Our implementation is based on the implementation of LibSVM (Chang & Lin, 2011). 6.4.1 Synthetic Data We tested the proposed method using data with task anity and dissimilarity. We generated two-variated Gaussian distributions for three tasks, each task consists of two Gaussian shaped classes and each dimension corresponds to one attribute. We enforce two of the tasks (i.e. T 1 and T 2 ) to have similar distributions, while the other one (i.e. T 3 ) possesses dierent inner relations and class distributions, which could be deemed as an outlier. In such a setting, we can show how our weighted method can eectively utilize the anity among tasks to adjust the hyperplane so as to improve classication accuracy. We use the same Gaussian to generate hyperplanes forT 1 andT 2 , which are straight lines in two-dimensional space, and use another Gaussian forT 3 . Given each hyperplane (weight vector), we generate two Gaussian classes to each side with large variance so that classes heavily mix up together. The number of samples for each class and each task is 300, thus 1800 in total. Figure 6.1 shows the synthetic dataset and the optimal classication hyperplanes for all three tasks. Given the basic synthetic dataset, we only randomly pick extremely limited labeled observations as the training set, say 3 out of 300, and the rest is for testing. Our rod pumps data set has the characteristic of being very noisy and of having limited number of labeled data. Normal SVM classier is not able to learn an optimal hyperplane from the small number of training points for each task. The resulting hyperspace tends to be badly deviated or even be orthogonal to the optimal hyperplane. We compared our method with the following three approaches: separate SVM for each task (sSVM), one SVM for all tasks but without task relatedness indicator (oSVM), Regularized Multitask SVM (Evgeniou & Pontil, 2004) (RMTL). Our method is abbre- viated as WRMTL. Since all approaches are based on the original SVM with linear 81 Figure 6.1: Synthetic dataset: Task 1 and Task 2 are two similar tasks while Task 3 is very dierent from their optimal classication boundaries - red for Task 1, blue for Task 2 and pink for Task 3. kernel, the SVM parameter C is set to 0.1, on the consideration that small C would stress the optimization with regard to weight vectors and avoid the similar result over all methods because of over-emphasis on loss function. As a result, variations on weights especially for multitask learning would be able to make big dierences. For RMTL and WRMTL, we form the kernel as the following form: K(x it ;x js ) = ( st t + 1)~ x it ~ x js where t = T 0 t , and 1 2 0 can be considered as a part of C thus ignored here. We only concern about the ratio of 0 to t , namely t . The smaller t is, the less independent the task is and more anity towards other tasks, which also implies that it's more inclined to incorporate other tasks' training set to its own training procedure to make impacts; and vice versa with larger t . t is constant across three tasks for RMTL, but varies according to task similarity and dissimilarity for WRMTL. We tested RMTL with = (0:0001; 0:001; 0:01; 0:1; 1; 10; 100; 1000; 10000) and tested WRMTL 82 with t = (0:001; 0:001; 1000), which means for each scale of , we don't retain st = 1 for each task anymore, instead, we add a weight for each task. Because of the similarity between T 1 and T 2 , we predict that sharing each other's training set would improve the result. WhileT 3 is relatively independent and dissimilar using the training points of T 1 , T 2 should not lead to better performance. t is changeable within some range, as long as keeping t small enough for similar tasks and large for outliers, then we could derive some satisfying results. For each of the four methods, Figure 6.2 shows the classication accuracy averaged over 1000 iterations. For each iteration, we randomly pick 3 training points for each class in each task. The rest of the points are reserved for the test set. Within each iteration, sSVM and oSVM are run once. RMTL and WRMTL are run 9 times by varying and t , then picking the highest one as the accuracy representing for each iteration. From Figure 6.2, the top row represents the optimal classication boundaries for three tasks as in three columns. The second row represents the results for sSVM. The third row shows the results using oSVM, and the bottom two rows are results for RMTL and WRMTL individually. All the observed data that are used for training are circled. We can observe that WRMTL outperforms all others with the smallest deviation, and second comes RMTL, then sSVM, nally oSVM. The reason for WRMTL outperforming the other three algorithms could be illus- trated by considering task anity between T 1 and T 2 , we enlarge both training set by combining them together, which leads to an adjustment of both hyper planes. The second row shows that sSVM's hyperplanes are completely in uenced by the sparse training set. The sSVM's hyperplanes for task T 1 and T 2 substantially deviate from optimal hyperplanes, while the hyperplane for T 3 seems reasonable. The third row shows oSVM hyperplanes which is signicantly in uenced by T 1 and T 2 , which makes the hyperplane for T 3 almost orthogonal to its optimal one. The oSVM accuracy is the worst of all; thus the importance of task relation is highlighted here. By using RMTL, the best result happens when = 0:1, which implies that task relations play an 83 important role for classication. But because of the limitation of RMTL's assumption that each task shares equal weight during estimating the task parameter distribution, the fourth row shows the hyperplanes for all three tasks that are skewed towards to wrong way whose performance is as poor as oSVM. The fth row shows that the skew is successfully prevented by WRMTL that enforces a task weight for Task 3 as large as 100, which diverges itse from the rst two tasks. At the same time Task 1 and Task 2, as similar tasks, are sharing their samples properly so that the nal results almost reveal the true classication boundaries for all three tasks. WRMTL gives the highest accuracy, especially when t = (1; 1; 100). In this setting, even though T 3 is deemed as the outlier task, due to the random selection of training set, it could be slightly related with other tasks, but usually remain independence of it could be a wise choice. Figure 6.2: Overall performance for separated SVM, one SVM, regularized MTL and weighted tasks RMTL. 6.4.2 Rod Pump Failure Prediction We also tested our algorithm on rod pump failure prediction problem using a real world dataset. As we discussed in Chapter 4, one of the main challenges for this dataset is 84 Figure 6.3: Data distributions, training samples and hyperplanes for three tasks for optimal - top row, independent (sSVM) - second row, unied (oSVM) - third row, regularized (RMTL) - fourth row and weighted regularized SVM (WRMTL) - bottom row, under the same sampled training data while the original data contains high noise level. heterogeneity. Depending on the geology and geometry of the oil reservoir, as well as on the dierence in articial lift equipment, the rod pump failure signals dier. This makes this data set ideal for multitask learning. The goal is to predict four failure cases of sucker rod pump production wells on a daily basis across heterogeneous oil elds. Figure 3.3 shows an example of what the data looks like for a tubing failure. A dataset has been constructed by the SMEs who picked 50 well studied failure cases across 4 elds and extracted the trends from dierent attributes monitored from sensors. The attributes include the pump load, runtime, 85 Figure 6.4: Class distribution for the rod pump failure prediction dataset. cycles and several relevant metrics which are commonly used to identify failures in petroleum industry. The extracted trends serve as features - both long-term trends and short-term trends for each attribute (Liu et al., 2011). Besides failures, there are 300 normal wells in the dataset which may contain noise that looks like failures. Each well has at least 100 records. For each well failure, it is labeled according to the progress of the development of the failure - such as rod cutting events happening consistently which lead to tubing leaks, or pump friction events that lead to rod pump failures. Failure wells also contains normal records. All sudden failures, which happens because of accumulated mechanical strain that would not exhibit any anomalous signals before failure events, are excluded from this dataset. Figure 6.4 shows the distribution of each class in this dataset. Obviously normal samples take absolutely the majority, albeit the emphasis for our failure prediction is upon the 4 failure classes. The characteristics of oil elds bias the distribution of the dataset. The character- istics include geographic location, well depth, geological formation, production. Even failure handling actions by dierent eld management teams can bias the data distri- bution. Therefore for the 4 elds from the dataset, it can be naturally divided into 4 dierent but related failure prediction tasks. Due to the fact that this dataset is from real world, it contains high volume of noise. However, failures are generally rare events. Similar failure patterns, even if they are in dierent elds, are very possible to mean the same failure types, e.g. for tubing leaks, 86 it always follows a signicant decline in pump load. It is also intuitive that similar failure patterns can be shared among dierent elds. With the rarity of events among a noisy dataset and the tasks are given, WRMTL, which we demonstrated using synthetic data in Figure 6.1, can be useful. This is more of an anomaly detection problem. Therefore we pay more attention to the precision/recall for the failure cases. The state-of-the-art algorithm on this dataset is based on a semi-supervised support vector machines (Liu et al., 2011). We tested our algorithm against the state-of-the- art algorithm and other algorithms using cross-validation. To demonstrate how elds dier with respect to each other, we employ leave-one-eld-out cross-validation. Figure 6.5 shows the results for all failure classes precision and recall using leave-one-eld- out, original semi-supervised support vector machine, regularized multitask support vector machines, as well as weighted regularized multitask support vector machines. RBF kernel and same C = 1, = 0:4 values are used for all algorithms. For the two multitask learning, we set = 1:0, while for WRMTL, we set its vector as the symmetric Kullback-Liebler divergence. We consider each task as a single domain and calculate its distribution for each feature, then based on the sum of the KL-divergence of the corresponding features, we collect the t values. We take 0 = 1:0. To calculate the KL-divergence, we rst discretized all continuous-valued features. Then sum up over each feature against all other elds as follows: t = T X j=1;j6=t X f2F ln p t (f) p j (f) p t (f) where F is the feature set, and t is the eld index, p t (f) represents feature f's distribution in eld t. From the results, we can clearly observe that leave-one-eld-out yields the worst performance, which proves our hypothesis that dierent elds are very dierent for the learning task. The state-of-the-art solution yields one global model using a single SVM. But because it generalizes over all elds, the results are \over-smoothed". While we enable the multitask learning setting, RMTL and WRMTL shows consistently better 87 precision and recall from each failure classes. Since WRMTL adapts KL-divergence for ne-tuning the task regularization parameters, it produces 1 5% better than MTL from both evaluation metrics. 6.5 Discussion We have presented our weighted regularized multitask learning framework which is a generalized framework of RMTL. Like its predecessor, this framework employs a single task learning, which reuses a common support vector machine with self-dened kernels. It is able to learn from multiple tasks simultaneously and transfer knowledge among tasks that potentially alleviate the lack of observations in noisy environments. We tested our algorithm on both synthetic data and a real-world dataset from petroleum industry. The result on the real-world data is better than the state-of-the-art solution, as well as regularized multitask learning. Our algorithm shows a robust learning capability that handles outlier tasks without complex metrics using only KL-divergence as the task weight indicator. This is eective especially when there are few tasks, and when some prior knowledge is available. Besides KL-divergence, other distance metrics may also be used. We leave that as future work. 6.6 Conclusion We introduced a generalized regularized multitask learning method that encodes ner- grained task relationships. We proved multitask learning property holds over this method, and the updated kernels of dual form exhibit the generalized form. By experi- ments on both synthetic data and rod pump dataset, we showed the our methodology is superior to traditional regularized multitask learning, as well as to the global model (single-task learning) approaches. 88 (a) Precision (b) Recall Figure 6.5: Precision and recall for cross-validation of anomalous classes using dierent algorithms on the petroleum dataset. 89 Chapter 7 Prediction Condence Level Model 7.1 Background Due to the limitations in our observation, there are always uncertain factors that are not considered during our prediction. Historical predictions, together with critical decision- related factors are assembled to build up a condence-level model, which evaluates the failure probability from a wider perspective and ranks same-date predictions between dierent rod pumps. Under the condence level model, certain threshold can be set to lter out low-condence predictions, and prioritization mechanism can be implemented. In a data-driven process, a signature may fall between two dierent types of failures. Therefore for prediction, to provide condence level is important. Given a trained Support Vector Machine, we can use (Huang et al., 2006) to calculate the distribution of the sample with respect to all possible events. However given many uncertainties that involves the data noise, inconsistency, lack of trend, and other related factors, it is unwise to directly convert each alert into a eld action. Furthermore, not all failures are as ideal as the ones that we use to train the model. Additionally, if there are many alerts being accumulated across dierent rod pumps, the model lacks capability to prioritize among dierent alerts. An example approach is to prioritize the larger producers but what if they are not as probable to fail as a producer with slightly lower production? This is where condence-level model can be applied. 90 7.2 Modeling According to SMEs, we have listed multiple factors that are relevant to decision-making process including: 1. Field: which oil eld this rod pump belongs. As dierent eld may have dier- ent operational standards, and dierent geological formations, which all makes it similar if the rod pumps are from the same eld, but not as similar if they are from dierent elds. 2. Components: the component conguration has always been an important factor while comparing among rod pumps. 3. Annual failure rates by elds 4. Recent failure prediction condence: this is the probabilistic output from our prediction model. 5. Past rod pump specic failure statistics: failure history of each specic rod pump would imply correlation to future failure under the impression that failure rod pump has higher probability to have another failure. 6. Production: higher producers are more prioritized than low producers. The way we construct the condence-level model is to make an updated model that learns from all past failures with their relevant decision-relevant factors. The task can be described as this: Given random variables that represent every features as X i ;i = 1;:::;N, build a model that models the conditional probability P (YjX 1 ;X 2 ;:::;X N ) such that Y = f0; 1g where 0 represents normal, and 1 represents failure and the model parameters are estimated by maximum likelihood. 91 7.3 Logistic Regression A logistic regression problem can be considered as a supervised learning task where given M training instancesf(x i ;y i );i = 1;:::;Mg. It models the probability distribution of the class label y given the feature vector x as follows: p(y = 1jx; ) =( T x) = 1 1 +exp( T x) Here 2 R N are the parameters of the logistic regression model, and () is the sigmoid function (Cessie & Houwelingen, 1992) (Hall et al., 2009). Thus once all the past failures and normal records have been used to train such a model, it is ready to be used to evaluate the condence level for real-time alerts. 7.4 Experiments Based on historical predictions during validation process, we can construct the condence- level mode by collecting all related factors. To measure how good the condence level is, we use a technique called receiver-operating characteristics curve (ROC curve). An ROC curve plots the proportion of positive (failure) points correctly classied into pos- itive against the proportion of negative (normal) points being incorrectly classied into positive as the classication threshold is varied (Adams & Hand, 1999). It uses the cumulative distributions of true positive rate (TPR) against false positive rate (FPR). The formulation can also be represented using the terminology dened in Table 3.3: TPR = TP TP +FN FPR = FP FP +TN The curve is formed by plotting the TPR and FPR by considering the highest condence samples at rst, and then gradually adding less condent samples. The area 92 Figure 7.1: ROC curve for condence-level model from 5 elds. below the curve, called area under the curve (AUC) is popularly used to evaluate how well the model performs. In most cases, the closer AUC value is towards 1.0, the better the model is. For failure prediction statistics, we backtrack each failure and retrieve their average failure distribution with respect to each of the gradual/sudden failure categories 3 days before the rod pump stops sending any data. For each category, missing data are ignored. To evaluate the model, unlike the failure prediction stage, we rely on 10-fold cross-validation to generate the ROC curve, which is displayed in Figure 7.1. Its AUC value is 0.779. 7.5 Utilization Besides the output of condence, the outcome of logistic regression can be versatile. From the trained model parameters , we can have an intuition which parameters are comparatively more important than the others. Furthermore, given the actual factors the model can also estimate which set of parameters are playing the most important role for the high/low condence. According to Figure 7.2, past failure history and the predicted failures are obviously critical during decision making process. However another reason why the weights are high is because these are small numbers between 0 93 Figure 7.2: Parameter weights for dierent factors for condence-level model. and 1. While for the other parameters which are mostly considered to be categorical, 1 would be taken, otherwise will be 0. With each rod pump having a single failure probability, a cuto threshold can be set so that only highly probable rod pump failures can be picked up for timely treatment. 7.6 Updated SEA System With the updates that involve prediction engine and knowledge management, we have the updated SEA system module diagram in Figure 7.3. 7.7 Conclusion In this section, we presented our condence level model including the condence fac- tors and model. We also nished initial experiment on how to evaluate and use this condence level model in real-world. With the condence level model, there is now a 94 Figure 7.3: Updated SEA system owchart. mechanism which is able to rank among dierent rod pump failures, and provide ner- grained explanation with respect to which factors are most important while an alert is made. Facing many natural and human-involved uncertainties, this is a complementary element of SEA system that can be used to assist decision-making. In the future, this model can be extended following similar approach that we adapted in this chapter. An incremental condence-level model can be more robust and reliable. 95 Chapter 8 Conclusions and Future Work Failure prediction is important in modern industry where active asset surveillance is available in multivariate time series with historical data storage and failure events. Es- pecially, when many assets have to be managed altogether, e.g. in a decision support center, it is impractical to develop and maintain individual models upon each asset. Data mining and machine learning have been emerging elds that have been studied and proven to be eective upon multiple event prediction applications such as biosurveil- lance, hard drive and computer systems failure prediction. A key challenge in order to perform such a task is that domain knowledge has to be carefully combined with data mining and machine learning algorithms. Moreover, when facing with hundreds or even thousands of assets, how to properly generalize the model such that it can adapt to all assets remains challenging in computer science study. This can be represented as two problems: 1) how to systematically learn from historical failures and train an eective model that is applicable in failure prediction application; 2) how to train a generalized model from the labeled dataset that is ecient in predicting failures out of thousands of multivariate time series that exhibit heterogeneity. We presented Smart Engineering Apprentice system (SEA) that covers data extrac- tion, data preparation, feature extraction, data mining, alert generation and knowledge management. We successfully applied SEA to failure predictions for rod pump arti- cial lift systems, one of the most important assets in petroleum industry, using the 96 logs instrumented by POCs. Due to many uncertainties, we also presented an in-depth semi-supervised algorithm for better generalization. Then we further improved the algo- rithm by a novel multitask learning algorithm that combines multiple decision relevant factors to yield a better generalized global model, which can be optimized via exist- ing support vector machine solver. Our algorithms achieved better performance than state-of-the-art algorithms for our failure prediction problem. The data attributes instrumented by POCs do not measure several important factors that may be useful for failure prediction. Chemicals and corrosion status for tubings or rods are also relevant to potential failures. How to integrate such knowledge into failure prediction algorithms remains unexplored. It is commonly known for SMEs that wells in a gaseous reservoir exhibit dierent behaviors compared to wells in a reservoir that is low in gas. For our multitask learning, a reasonable extension can be grouping wells according to gas productions so that it may bring better heuristics during optimization. For knowledge management, existing condence level model only covers a small amount of decision factors that we learned from SMEs. More factors should be consid- ered once more feedback is received. In the future, to evaluate eld deployment performance in real-world applications requires more data from oil elds, so that better task weights can be described by observed relevant factors. Furthermore, though the system was proven eective for rod pumps, it will be interesting if other assets that exhibit similar requirements can be experimented such as electrical submersible pump (ESPs). 97 Bibliography Adams, N., & Hand, D. (1999). Comparing classiers when the misallocation costs are uncertain. Pattern Recognition, 32, 1139 { 1147. Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., & Har- ris, E. (2010). Reining in the outliers in map-reduce clusters using mantri. Proceedings of the 9th USENIX conference on Operating systems design and implementation (pp. 1{16). Berkeley, CA, USA: USENIX Association. Au, T. S., Duan, R., Kim, H., & Ma, G.-Q. (2010). Spatiotemporal event detection in mobility network. Proceedings of the 2010 IEEE International Conference on Data Mining (pp. 28{37). Washington, DC, USA: IEEE Computer Society. Bahrepour, M., Zhang, Y., Meratnia, N., & Havinga, P. J. M. (2009). Use of event detec- tion approaches for outlier detection in wireless sensor networks. Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2009 5th International Con- ference on (pp. 439{444). Bottone, S., Lee, D., O'Sullivan, M., & Spivack, M. (2008). Failure prediction and diagnosis for satellite monitoring systems using bayesian networks. Military Commu- nications Conference, 2008. MILCOM 2008. IEEE (pp. 1{7). Buntine, W. (1996). A guide to the literature on learning probabilistic networks from data. IEEE Trans. on Knowl. and Data Eng., 8, 195{210. Caruana, R. (1997). Multitask learning. Mach. Learn., 28, 41{75. Cessie, S. L., & Houwelingen, J. C. V. (1992). Ridge estimators in logistic regression. Journal of the Royal Statistical Society. Series C (Applied Statistics), 41, pp. 191{201. Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Comput. Surv., 41, 15:1{15:58. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1{27:27. Chen, J., Tang, L., Liu, J., & Ye, J. (2009). A convex formulation for learning shared structures from multiple tasks. Proceedings of the 26th Annual International Confer- ence on Machine Learning (pp. 137{144). New York, NY, USA: ACM. 98 Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local tting. Journal of the American Statistical Association, 83, 596{610. Cooper, G. F., Dash, D. H., Levander, J. D., Wong, W.-K., Hogan, W. R., & Wagner, M. M. (2004). Bayesian biosurveillance of disease outbreaks. Proceedings of the 20th conference on Uncertainty in articial intelligence (pp. 94{103). Arlington, Virginia, United States: AUAI Press. Damle, A., Deshmukh, D., Dixit, J., & Patil, S. (2011). Epidemiological investigation of an outbreak of acute diarrheal disease: A shoe leather epidemiology. Journal of Global Infectious Diseases, 3, 361{365. Eickmeier, J. (1967). Diagnostic analysis of dynamometer cards. Journal of Petroleum Technology, 19, 97{106. Evgeniou, T., Micchelli, C. A., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6, 615{637. Evgeniou, T., & Pontil, M. (2004). Regularized multi{task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109{117). New York, NY, USA: ACM. Freund, Y. (1999). The alternating decision tree learning algorithm. In Machine Learn- ing: Proceedings of the Sixteenth International Conference (pp. 124{133). Morgan Kaufmann. Goldenberg, A., Shmueli, G., Caruana, R. A., & Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proc. Natl. Acad. Sci. U.S.A., 99, 5237{5240. Gong, P., Ye, J., & Zhang, C. (2012). Robust multi-task feature learning. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 895{903). New York, NY, USA: ACM. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: an update. SIGKDD Explor. Newsl., 11, 10{18. Hamerly, G., & Elkan, C. (2001). Bayesian approaches to failure prediction for disk drives. Proceedings of the Eighteenth International Conference on Machine Learning (pp. 202{209). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 3rd edition. Huang, T.-K., Weng, R. C., & Lin, C.-J. (2006). Generalized bradley-terry models and multi-class probability estimates. J. Mach. Learn. Res., 7, 85{115. 99 Hughes, G. F., Murray, J. F., Kreutz-delgado, K., Member, S., & Elkan, C. (2002). Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51, 2002. Hyne, N. (2001). Nontechnical guide to petroleum geology, exploration, drilling, and production. PennWell nontechnical series. PennWell Corporation. Ihler, A., Hutchins, J., & Smyth, P. (2006). Adaptive event detection with time-varying poisson processes. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 207{216). New York, NY, USA: ACM. Kim, S., & Xing, E. P. (2010). Tree-guided group lasso for multi-task regression with structured sparsity. ICML (pp. 543{550). L. Alegre, C. K. M., & da Rocha, A. F. (1993). Intelligent diagnosis of rod pumping problems. . Houston, Texas: Society of Petroleum Engineers. Lenser, S., & Veloso, M. (2005). Non-parametric time series classication. Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Confer- ence on (pp. 3918{3923). Liu, Y., Yao, K.-T., Liu, S., Raghavendra, C., Balogun, O., & Olabinjo, L. (2011). Semi-supervised failure prediction for oil production wells. Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on (pp. 434{441). Liu, Y., Yao, K.-T., Liu, S., Raghavendra, C. S., Lenz, T., Olabinjo, L., Seren, B., Seddighrad, S., & Babu, D. (2010). Failure prediction for rod pump articial lift systems. . Anaheim, California, USA: Society of Petroleum Engineers. Mo, Y., Kim, T.-H., Brancik, K., Dickinson, D., Lee, H., Perrig, A., & Sinopoli, B. (2012). Cyber-physical security of a smart grid infrastructure. Proceedings of the IEEE, 100, 195{209. Ocanto, L., & Rojas, A. (2001). Articial-lift systems pattern recognition using neural networks. . Buenos Aires, Argentina: Society of Petroleum Engineers. Phua, C., Lee, V., Smith-Miles, K., & Gayler, R. (2005). A comprehensive survey of data mining-based fraud detection research. Articial Intelligence Review. Plaza, E., Armengol, E., & Onta~ n on, S. (2005). The explanatory power of symbolic similarity in case-based reasoning. Articial Intelligence Review, 24, 145{161. Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Rath, T., & Manmatha, R. (2003). Word image matching using dynamic time warping. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (pp. 521{527). 100 Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323{2326. Rowlan, O. L., Lea, J. F., & McCoy, J. N. (2007). Overview of beam pump operations. . Anaheim, California, USA: Society of Petroleum Engineers. Sabahi, F., & Movaghar, A. (2008). Intrusion detection: A survey. Systems and Net- works Communications, 2008. ICSNC '08. 3rd International Conference on (pp. 23{ 26). Salfner, F., Lenk, M., & Malek, M. (2010). A survey of online failure prediction methods. ACM Comput. Surv., 42, 10:1{10:42. Scheer, M., Bascompte, J., Brock, W. A., Brovkin, V., Carpenter, S. R., Dakos, V., Held, H., van Nes, E. H., Rietkerk, M., & Sugihara, G. (2009). Early-warning signals for critical transitions. Nature, 461, 53{59. Seidenari, L., & Bertini, M. (2010). Non-parametric anomaly detection exploiting space- time features. Proceedings of the international conference on Multimedia (pp. 1139{ 1142). New York, NY, USA: ACM. Snyder, T. (2009). Taking condition monitoring to the next level with predictive ana- lytics. E&P magazine. Tenenbaum, J. B., Silva, V. d., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319{2323. Tian, J., Gao, M., Cao, L., & Li, K. (2007a). Fault detection of oil pump based on fuzzy neural network. Natural Computation, 2007. ICNC 2007. Third International Conference on (pp. 636{640). Tian, J., Gao, M., Li, K., & Zhou, H. (2007b). Fault detection of oil pump based on classify support vector machine. Control and Automation, 2007. ICCA 2007. IEEE International Conference on (pp. 549{553). Vilalta, R., & Ma, S. (2002). Predicting rare events in temporal domains. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on (pp. 474{ 481). Wagner, C. H. (1982). Simpson's paradox in real life. American Statistician, 46{48. Wang, T.-Y., & Yu, C.-T. (2005). Collaborative event region detection in wireless sensor networks using markov random elds. Wireless Communication Systems, 2005. 2nd International Symposium on (pp. 493{497). Wang, X. R., Lizier, J. T., Obst, O., Prokopenko, M., & Wang, P. (2008). Spatiotem- poral anomaly detection in gas monitoring sensor networks. Proceedings of the 5th 101 European conference on Wireless sensor networks (pp. 90{105). Berlin, Heidelberg: Springer-Verlag. Wei, L., & Keogh, E. (2006). Semi-supervised time series classication. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 748{753). New York, NY, USA: ACM. Wong, W.-K. (2004). Data mining for early disease outbreak detection. Doctoral disser- tation, Pittsburgh, PA, USA. AAI3121061. Xing, Z., Pei, J., Dong, G., & Yu, P. S. (2008). Mining sequence classiers for early prediction. SIAM International Conference on Data Mining (pp. 644{655). Xu, P., Xu, S., & Yin, H. (2007). Application of self-organizing competitive neural network in fault diagnosis of suck rod pumping system. Journal of Petroleum Science and Engineering, 58, 43 { 48. Zhang, J., chiang Tsui Phd, F., Wagner, M. M., & Md, W. R. H. (2003). Detection of outbreaks from time series data using wavelet transform. In: AMIA Fall Symp., Omni Press CD (2003) 748752 (pp. 748{752). Omni Press CS. Zhang, Y., & Sivasubramaniam, A. (2008). Failure prediction in ibm bluegene/l event logs. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on (pp. 1{5). Zhou, J., Chen, J., & Ye, J. (2011). Clustered multi-task learning via alternating structure optimization. NIPS (pp. 702{710). Zhou, J., Liu, J., Narayan, V. A., & Ye, J. (2012). Modeling disease progression via fused sparse group lasso. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1095{1103). New York, NY, USA: ACM. Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Articial Intelligence and Machine Learning, 3, 1{130. 102
Abstract (if available)
Abstract
Failure prediction, a subset of anomaly detection which aims at the precursory events that potentially triggers failures, is of great value in maintaining reliable complex engineering systems. Given massive amount of historical data in multivariate time series for a complex system, data mining and machine learning techniques can play an important role that learns from historical failures which can then be integrated into real world monitoring and fault-prevention applications. In this dissertation, such data mining and machine learning techniques are applied to failure prediction in digital oil fields. However, there are two major challenge categories in applying to oil fields where there are wells in multiple assets: 1) within a single domain/asset, and 2) across multiple domains/assets. For a single domain/asset, the data collected is in high dimensional multivariate time series which contains uncertainties such as noise, missing data, inconsistent data, etc. In Multiple domains/assets, because of the rarity of such events with regards to the heterogeneity from thousands of assets and diversity of failure patterns, as well as sparse labels and limited resource, it is unrealistic to build individual predictive model for each asset. This thesis addresses the problems of failure prediction on multiple multivariate time series: 1) how to systematically learn from historical failures and train an effective model that is applicable in failure prediction application
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Discovering and querying implicit relationships in semantic data
PDF
Trustworthy spatiotemporal prediction models
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Customized data mining objective functions
PDF
Learning to optimize the geometry and appearance from images
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Learning to diagnose from electronic health records data
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Dynamic graph analytics for cyber systems security applications
PDF
From matching to querying: A unified framework for ontology integration
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Deep learning for subsurface characterization and forecasting
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Real-world evaluation and deployment of wildlife crime prediction models
PDF
Learning to adapt to sensor changes and failures
PDF
Simulation and machine learning at exascale
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
Asset Metadata
Creator
Liu, Yintao
(author)
Core Title
Failure prediction for rod pump artificial lift systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/27/2013
Defense Date
07/22/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data mining,failure prediction,machine learning,multitask learning,OAI-PMH Harvest,predictive analytics,rod pump
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Raghavendra, Cauligi S. (
committee chair
), Yao, Ke-Thia (
committee chair
), Ershaghi, Iraj (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
yintaoli@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-323256
Unique identifier
UC11293518
Identifier
etd-LiuYintao-2024.pdf (filename),usctheses-c3-323256 (legacy record id)
Legacy Identifier
etd-LiuYintao-2024.pdf
Dmrecord
323256
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Liu, Yintao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data mining
failure prediction
machine learning
multitask learning
predictive analytics
rod pump