Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Prediction modeling and statistical analysis of amino acid substitutions
(USC Thesis Other)
Prediction modeling and statistical analysis of amino acid substitutions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PREDICTION MODELING AND STATISTICAL ANALYSIS FOR AMINO ACID SUBSTITUTIONS by Hua Yang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2006 Copyright 2006 Hua Yang Dedication To my beloved parents Meiji Zhou and Huchen Yang ii Acknowledgments I would like to thank all the people who have helped me during my time at University of Southern California (USC). My most earnest acknowledgment must go to my academic advisors, Dr. Ting Chen and Dr. C.-C. Jay Kuo, who gave me their support and guid- ance throughout this research. Their insight, technical expertise, and high standards in research make this work fulfilled. I am grateful to my academic guidance committee members, Dr. Fengzhu Sun, Dr. Richard M. Leahy, Dr. Shrikanth S. Narayanan, for their valuable advices at my qualifying and defense examinations. Especially, I would like to thank Dr. Michael Waterman, Dr. Lei Li, Dr. Simon Tavare and Dr. Xianghong Zhou for their insightful lectures and generous discussions that became a great source of knowledge in this research. I would like to express my special appreciation to Dr. Rui Jiang, who assisted this research in designing rule extraction strategy, for his constant support. Also, my col- leagues, Dr. Lei Zhuge, Ms. Linqi Zhou, Dr. Xiaotu Ma, Dr. Chao Cheng, Dr. Debo- jyoti Dutta, Dr. Yunhu Wan, Dr. Bingwen Lu and Dr. Shipra Metah, deserve special recognition for their technical discussions. I also owe a huge gratitude to my friends, Dr. Xiting Yan, Dr. Hyun-Ju Lee, Dr. Zhidong Tu, Dr. Weihong Xu, Ms. Li Wang, Ms. Lijuan Mo, Mr. Huangying Ge, Mr. Kangyu Zhang and Mr. Min Xu who helped my life in USC much easier. iii My most heartfelt acknowledgment should go to my parents, Meiji Zhou and Huchen Yang, and my brother Song Yang, for their constant patience, encouragement, and sup- port during my study. I dedicate this thesis to my parents Meiji Zhou and Huchen Yang. iv Table of Contents Dedication ii Acknowledgments iii List of Figures viii List of Tables xii Abstract xiv Chapter 1: Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . 1 1.2 Summary of Main Research Results . . . . . . . . . . . . . . . . . 2 1.2.1 Features and Classification for Amino Acid Substitutions . . 2 1.2.2 Rule Extraction Based on Simulate Annealing Bump Hunting 5 1.2.3 Predictive Model Design and Prioritization Analysis . . . . 6 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . 7 1.4 Outline of the Proposal . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Background Review 10 2.1 Fundamentals of Molecular Biology . . . . . . . . . . . . . . . . . 10 2.2 Feature Set for Amino Acid Substitutions . . . . . . . . . . . . . . 19 2.3 Pattern Recognition Methods in Machine Learning . . . . . . . . . 23 2.3.1 Decision Tree (DT) Method . . . . . . . . . . . . . . . . . 24 2.3.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . 26 2.3.3 Mixture Models and Gaussian Mixture Model (GMM) . . . 30 2.3.4 Random forests . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Simulated Annealing Bump Hunting and Rule Extraction . . . . . . 38 2.4.1 Bump Hunting Method . . . . . . . . . . . . . . . . . . . . 38 2.4.2 Simulated Annealing Bump Hunting Method . . . . . . . . 40 2.4.3 Rule extraction and interpretation . . . . . . . . . . . . . . 41 2.5 Prediction Model Design for Prioritization Analysis . . . . . . . . . 42 v Chapter 3: Novel Feature Sets and Classification of Amino Acid Substitu- tions 44 3.1 Feature Extraction for Amino Acid Substitutions . . . . . . . . . . 44 3.1.1 Amino acid properties in feature set . . . . . . . . . . . . . 44 3.1.2 Feature set calculation . . . . . . . . . . . . . . . . . . . . 47 3.2 Data Sets for Classifier Learning . . . . . . . . . . . . . . . . . . . 49 3.2.1 Unbiased experimental amino acid substitutions . . . . . . . 50 3.2.2 Human disease-associated amino acid substitutions . . . . . 51 3.3 Decision Trees for Classifying Amino Acid Substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 Decision tree learning . . . . . . . . . . . . . . . . . . . . 52 3.3.2 Implementation of decision trees . . . . . . . . . . . . . . . 54 3.3.3 Training and test with decision trees . . . . . . . . . . . . . 54 3.4 SVMs for Classifying Amino Acid Substitutions . . . . . . . . . . 55 3.4.1 Support vector machine learning . . . . . . . . . . . . . . . 56 3.4.2 LIBSVM training and test . . . . . . . . . . . . . . . . . . 58 3.4.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Gaussian Mixture Models for Predicting Effects of Amino Acid Sub- stitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.1 GMM training by EM algorithm and classification . . . . . 64 3.5.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . 67 3.6 Random Forests for Classifying Amino Acid Substitutions . . . . . 69 3.6.1 Random forests learning . . . . . . . . . . . . . . . . . . . 69 3.6.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . 71 3.7 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4: Simulated Annealing Bump Hunting Rule Extraction for Amino Acid Substitutions 74 4.1 The Patient Rule Induction Method . . . . . . . . . . . . . . . . . . 75 4.1.1 Rule induction as function optimization . . . . . . . . . . . 76 4.1.2 Top-down peeling and bottom-up pasting . . . . . . . . . . 79 4.1.3 Trajectory plot . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Simulated Annealing Bump Hunting Strategy . . . . . . . . . . . . 83 4.2.1 Feature sets for rule extraction . . . . . . . . . . . . . . . . 84 4.2.2 Comparison with exhaustive search and decision tree method 85 4.2.3 Automatic feature selection in simulated annealing bump hunting . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.4 Simulated annealing strategy . . . . . . . . . . . . . . . . . 89 4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.1 Selection of hyperparameters . . . . . . . . . . . . . . . . . 92 4.3.2 Box training and comparison . . . . . . . . . . . . . . . . . 92 4.3.3 Interpretation of the rules . . . . . . . . . . . . . . . . . . . 97 vi 4.4 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . 100 Chapter 5: Prediction Model Design and Prioritization Analysis for Human Disease-related Mutations 102 5.1 Multiple selection and rule voting (MS-RV) prediction model . . . . 104 5.1.1 Module training on partitioned data sets . . . . . . . . . . . 105 5.1.2 Sequential forward selection of optimal feature set . . . . . 105 5.1.3 Prediction with rule voting strategy . . . . . . . . . . . . . 106 5.2 Performance evaluation criteria . . . . . . . . . . . . . . . . . . . . 107 5.3 Effectiveness of the MS-RV model on the proposed feature set . . . 109 5.4 Differentiation of mutations related to human diseases . . . . . . . . 110 5.4.1 Mitochondrial mutations . . . . . . . . . . . . . . . . . . . 110 5.4.2 Mutation data sets related to specific human diseases . . . . 111 5.4.3 Mutation data on specific gene . . . . . . . . . . . . . . . . 111 5.4.4 Mutation data in chromosomal regions . . . . . . . . . . . . 112 5.5 Prioritization of human disease-associated amino acid substitutions . 113 5.6 Analysis of unclassified mutation data . . . . . . . . . . . . . . . . 116 5.6.1 Mutations related toHBB HUMAN gene . . . . . . . . . 117 5.6.2 Mutations related toCD36 HUMAN gene . . . . . . . . . 118 5.6.3 Mutations related toCASR HUMAN gene . . . . . . . . 119 5.6.4 Mutations related toCD22 HUMAN gene . . . . . . . . . 119 Chapter 6: Conclusion and Future Work 121 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 References 126 vii List of Figures 2.1 The central dogma of molecular biology. . . . . . . . . . . . . . . . 10 2.2 DNA code and codons for amino acid. . . . . . . . . . . . . . . . . 12 2.3 RNA code and codons for amino acid. . . . . . . . . . . . . . . . . 13 2.4 1-letter and 3-letter abbreviations of amino acid. . . . . . . . . . . . 14 2.5 Qualitative classification of20 amino acids. . . . . . . . . . . . . . 15 2.6 The order of molecular weights of20 amino acids. . . . . . . . . . 15 2.7 The order of pI values of20 amino acids. . . . . . . . . . . . . . . 16 2.8 The order of the hydrophobicity scale of20 amino acids. . . . . . . 17 2.9 Increased preference for helices of20 amino acids. . . . . . . . . . 18 2.10 Increased preference for strands of20 amino acids. . . . . . . . . . 18 2.11 Increased preference for turns of20 amino acids. . . . . . . . . . . 18 2.12 Summary of feature sets in amino acid substitution study, where fea- tures are categorized into sequence-based, physicochemical, evolu- tionary information, statistical score and structure-based five groups. 21 2.13 The block diagram of a statistical pattern recognition system. . . . . 23 2.14 A taxonomy of statistical pattern recognition techniques. . . . . . . 24 3.1 Calculation of feature set from multiple sequence alignment. The segment of lac repressor’s MSA. A window centered at L with size of 9 residues is highlighted on the query sequence. The column at relative position66 is also highlighted. . . . . . . . . . . . . . . . 49 viii 3.2 Decision tree learning example. On the right side, decision tree expresses the boolean function (A xor B) shown in the left truth table. Nodes A and B specifies the attributes, branches from each node are deter- mined by its attribute value. The leaves represent the outputs. Each path from the root node to a leaf corresponds a row in the truth table. 53 3.3 Decision tree method 10-fold cross validation accuracy comparison. For each case of lac repressor, T4 lysozyme and mixed samples, the left bar represents the published data, and the right bar represents the result with our feature set. . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Training a support vector machine. It is to find the optimal hyper- plane, which has the maximum distance from the nearest training pat- terns. The support vectors are the points on the two lines with margin distance from the hyperplane. . . . . . . . . . . . . . . . . . . . . . 57 3.5 SVMs 10-fold cross validation accuracy comparison. For each case of lac repressor, T4 lysozyme and mixed samples, the left bar represents the published data, and the right bar represents the result with our feature set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 Evolutionary quantity and predicted free energy change plot of lac repressor. The x-axis denotes the evolutionary quantity, and the y- axis denotes the predicted free energy change. Red points are for intolerant amino acid substitutions, and green points are for tolerant ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7 Evolutionary quantity and predicted free energy change plot of T4 lysozyme. The x-axis denotes the evolutionary quantity, and the y- axis denotes the predicted free energy change. Red points are for intolerant amino acid substitutions, and green points are for tolerant ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.8 Random forests 10-fold cross validation accuracies for experimental amino acid substitutions on lac repressor, T4 lysozyme and mixed samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1 Illustration of the PRIM algorithm. There are two classes, indicated by the circles(class0) and diamonds(class1) points respectively. The procedure starts with a rectangle surrounding all of the data, and then peels away points along one edge by a prespecified amount in order to maximize the mean of the points remaining in the box. Starting at the top left panel, the sequence of peeling is shown, until a pure diamond region is isolated in the bottom right panel. . . . . . . . . . . . . . 82 ix 4.2 Illustration of the trajectory plot. Mean versus support for boxes induced by the top-down peeling algorithm. Most appropriate trade- off is chosen by the user. . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Determination of the rewarding factor¸. . . . . . . . . . . . . . . . 93 4.4 Boxes trained by the simulated annealing bump hunting method. ”Num- ber of Features” represents the number of features included in the corresponding boxes. ”Number of Samples” represents the number of data samples inside the box for the corresponding training or test set. The ”Coverage Rate” represents the proportion of data samples inside the box for the corresponding training or test set. . . . . . . . 94 4.5 Comparison of the boxes obtained by the simulated annealing bump hunting strategy and the original bump hunting method (the SuperGEM software) for the intolerant model. Red round points are boxes trained by simulated annealing bump hunting method and green triangle points are boxes trained by SuperGEM. The numbers beside each point denotes the number of features included in the corresponding box. . . . . . . 95 4.6 Comparison of the boxes obtained by the simulated annealing bump hunting strategy and the original bump hunting method (the SuperGEM software) for the tolerant model. Red round points are boxes trained by simulated annealing bump hunting method and green triangle points are boxes trained by SuperGEM. The numbers beside each point denotes the number of features included in the corresponding box. . . . . . 96 4.7 Example rule of intolerant model, for 27-D samples: This rule can cover 2.21% data samples with an accuracy of 90.2%. . . . . . . . . 97 4.8 Example rule of intolerant model, for 27-D samples: This rule can cover 2.09% data samples with an accuracy of 94.7%. . . . . . . . . 97 4.9 Example rule of intolerant model, for 45-D samples: This rule can cover 2% data samples with an accuracy of 94.5%. . . . . . . . . . 98 4.10 Example rule of tolerant model, for 27-D samples: This rule can cover 4.47% data samples with an accuracy of 97.1%. . . . . . . . . 99 4.11 Example rule of tolerant model, for 27-D samples: This rule can cover 8.16% data samples with an accuracy of 95.6%. . . . . . . . . 99 4.12 Example rule of tolerant model, for 45-D samples: This rule can cover 5.03% data samples with an accuracy of 93.0%. . . . . . . . . 100 x 5.1 The training of 20 modules in the MS-RV model. Each module is trained on the substitution data with one of the 20 original amino acids. Tree ensemble is built up on bootstrapped data sets from the optimal feature set. . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 The generation of optimal feature set for each module using sequen- tial forward feature selection. Starting from empty, the optimal fea- ture set is updated by adding one more feature which furthest increases certain criteria. This procedure continues until the criteria decreases by adding any more features or all features have been included. . . . 106 5.3 Testing mutation data by MS-RV model using rule voting strategy. . . . . 107 5.4 Prediction performance comparison on amino acid substitutions related to human diseases. ROC curves are drawn from the prediction results from support vector machine, random forest and MS-RV model. . . 109 5.5 ROC curve for prioritizing disease-associated substitutions from 30 neigh- boring gene regions. Data sets are collected from the Swiss-Prot database. 115 5.6 ROC curve for prioritizing disease-associated substitutions from 30 neigh- boring gene regions. Data sets are collected from the Swiss-Prot database. 116 xi List of Tables 2.1 Parameterizations and geometric interpretation of the covariance matrix in the Gaussian mixture model. . . . . . . . . . . . . . . . . . . . . 33 2.2 The number of estimated parameters of covariance matrices in the Gaussian mixture model. . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Properties of amino acids included in our feature set. . . . . . . . . 46 3.2 Amino acid property values for feature set calculation. ‘MolWgt’ denotes ‘Molecular weight’; ‘pI’ denotes ‘pI value’; ‘Hyd’ denotes ‘Hydrophobicity’; ‘RelFreq-h’ denotes amino acid’s relative frequency in helices; ‘RelFreq-s’ denotes amino acid’s relative frequency in strands; ‘RelFreq-t’ denotes amino acid’s relative frequency in turns. 46 3.3 Groups of amino acids favoring different globular protein secondary structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Details of the proposed 27-D feature set. . . . . . . . . . . . . . . 48 3.5 Experimental amino acid substitutions. . . . . . . . . . . . . . . . . 50 3.6 Selected Feature Sets for GMM Training and Testing. . . . . . . . 68 3.7 GMM Heterogeneous Validation Accuracy Comparison. . . . . . . 69 3.8 GMM Mixture Sample Cross-Validation Accuracy Comparison. . . 69 4.1 Details of the proposed 45-D feature set. . . . . . . . . . . . . . . 86 5.1 Details of the 26-D feature set. Each of the 6 amino acid properties is cal- culated in4 situations, formingX 1 » X 24 , and the conservation scores for the original and the substituted amino acids areX 25 andX 26 , respectively. 103 5.2 p-value of rank sum test for mutations related to10 human diseases. . . . 112 xii 5.3 Substitution data sets in 30 neighboring gene regions. Each region centers at a gene whose amino acid substitutions associate with a human disease. Mutation data are collected from heterogenous data sources, and numbers of substitutions are listed for each case. . . . . . . . . . . . . . . . . . 114 xiii Abstract Classifying and predicting amino acid substitutions are important in pharmaceutical and pathological research. We proposed a novel feature set from amino acids’ physicochem- ical properties, evolutionary profile of proteins, and protein sequence information. Large scale size of human disease-associated data were collected and processed, together with the unbiased experimental amino acid substitutions. Machine learning methods of deci- sion tree, support vector machine, Gaussian mixture model, and random forests were used to classify neutral and deleterious substitutions, and the comparison of classifica- tion accuracy with published results showed that our feature set is superior to the existing ones. We designed a simulated annealing bump hunting method to automatically extract interpretable rules for amino acid substitutions. Rules are consistent with current bio- logical knowledge or provide new insights for understanding substitutions. We also designed a Multiple Selection and Rule V oting (MS-RV) model, which inte- grates data partition and feature selection to predict and prioritize disease-associated mutations. For mutation data in SwissProt database, the 10-fold cross validation accu- racy outperforms the support vector machine and random forests. We prioritized the substitutions inside thirty 10-Mb chromosomal regions which are related to monogenic diseases, and analyzed the normalized ranks. The overall area under ROC curve (AUC) scores is 86.6%. For the polygenic disease-associated amino acid substitutions, we xiv analyzed the mutations that cause the Alzheimer disease. Our method prioritized the disease-associated substitutions on top ranks. The results indicate that MS-RV model effectively prioritizes disease-associated amino acid substitutions. We also studied the unclassified mutations with high prediction scores, and found evidences to support our conclusions. xv Chapter 1 Introduction 1.1 Significance of the Research On DNA sequences, the single base variants are called single nucleotide polymorphisms (SNPs). Amino acid substitution is the replacement of one an amino acid in a protein with another, if the SNPs occur in protein coding regions and change the encoded amino acid. This process may occur naturally or be induced experimentally. Tolerant (also known as neutral) amino acid substitutions do not change protein’s wild-type activity while intolerant (also known as deleterious) amino acid substitutions impact the struc- ture of a protein so that the protein function is diminished or eliminated. Predicting potential effects of amino acid substitutions on protein functions is of cen- tral importance in modern pathological and pharmaceutical studies [TCC + 02]. This is especially true with increasing amounts of amino acid substitutions being detected and collected in various databases such as the Human Gene Mutation Database (HGMD) [KBF + 00], the Swiss-Prot database [BA00] and the Online Mendelian Inheritance in Man + (OMIM) database [McK98]. Stand alone data sets such as the unbiased lab- oratory mutagenesis data derived from experiments on the lac repressor [MKC + 94], [SMK + 96] and the T4 lysozyme [RBHP91] are also available. Functional amino acid polymorphisms are common genetic variants, which affect the structure, function and expression of a protein. Typically, prediction of effects of amino acid substitutions is based on features derived from sequence/structural properties as well as the phylogenetic information of 1 proteins that contain substitutions. Some widely-used features and methods will be reviewed in Chapter 2. Although some of them are claimed to be accurate, these feature sets are complicated, or derived from other prediction tools, which make their gener- alization difficult. Furthermore, prediction accuracy itself cannot explain the mutation principles. Thus, the main objectives of this research include: 1. to find a set of features to facilitate the classification and prediction of amino acid substitutions; 2. to extract interpretable rules for amino acid substitutions; 3. to design predictive model and prioritize amino acid substitutions. They will be detailed in the next section. 1.2 Summary of Main Research Results The main research results to be presented in Chapters 3, 4 and 5 are summarized below. 1.2.1 Features and Classification for Amino Acid Substitutions The problem of intolerant and tolerant amino acid substitution detection can be treated as a binary classification problem, and machine learning methods are implemented on extracted feature vectors. Models are trained from substitution samples with a known target value, e.g. intolerant or tolerant (deleterious or neutral). In Chapter 3, we propose a new set of features using multiple sequence alignment (MSA) and amino acids’ physicochemical properties. Six amino acid’s properties are considered, namely, the molecular weight, the pI value, hydrophobicity, residue’s rela- tive frequency in helices, residue’s relative frequency in strands and residue’s relative frequency in turn. The following feature groups are generated. 2 ² The window-sized group Features are the averaged properties of a 9-residue window centered at the studied amino acid. ² The original amino acid group Features are properties of the wild-type residues. ² The substitution group Features are properties of the substituted residues. ² The column-weighted group Features are the weighted average properties, where the weights are the frequen- cies of 20 possible amino acids in a column of the MSA. ² The conservation group This group contains the original amino acid and substitution’s relative frequencies in MSA, and the entropy of the column. For conservation group, entropy [CT91] is defined as¡ P 20 i=1 p i log 2 p i , in whichp i is the frequency of each possible amino acid in the column of the MSA. Four classification methods are investigated in this work. ² Decision Tree (DT) Algorithm A non-parametric learning algorithm, known as the decision tree, is used to parti- tions amino acid substitutions according to their attribute values. Classification is derived if test samples match some rules, which are a path from the root node to a leaf in the decision tree. ² Support Vector Machine (SVM) The support vector machine (SVM) offers a kernel-based parametric method. It 3 maps substitution data to a high dimensional feature space. The hyperplane deci- sion function is defined by support vectors, which are around decision boundaries. The margin between boundaries of intolerant and tolerant substitutions is maxi- mized. Test samples are classified according to the output (0 or1) of the SVM. ² The Gaussian mixture model (GMM) The Gaussian mixture model (GMM) provides another commonly used classi- fier, where the mixture components of each class are trained by the Expectation- Maximization (EM) algorithm. The Gaussian mixture model is initialized by the K-means algorithm to form an initial partition of the data and assign the model parameters from this partition. For high dimensional data, large number of param- eters need to be estimated, while the available sample size, computational and storage requirements limit the performance. The Branch & Bound algorithm selects ‘optimal’ feature sets before training and testing mixture models. ² Random Forests A random forest [Bre01a] is a classifier consisting of a collection of tree- structured classifiersh(x;µ k );k=1;::: where theµ k are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at inputx. Random forests [BDF + 05] are a type of high-dimensional non- parametric predictive model made up of a collection of classification or regression trees. Random forests have some attractive features such as their top performance of prediction, built-in estimate of prediction error, and feature ranking according to the measures of predictive importance of variables. Validations of the DT, SVM, GMM and Random Forests classifiers using our feature sets are conducted and compared with published data. 4 1.2.2 Rule Extraction Based on Simulate Annealing Bump Hunting With the set of simple and interpretable features available, we can further explore the chemical and physical principles behind the amino acid substitutions and associate sub- stitutions with physicochemical and biological knowledge. To achieve this objective, we propose simulated annealing bump hunting method to induce a set of rules for amino acid substitutions from experimental data samples automatically. These rules consist of only a small set of features so that they can be easily interpreted using chemical and physical terms. In addition, these rules have high quality (with very few exceptions) and, at the same time, they are general enough to cover a large number of substitutions. We view rules as sub-regions (boxes) in the feature space composed of experimental amino acid substitutions. Specifically, boxes are defined in terms of feature intervals. Finding boxes in the feature space, called the Patient Rule Induction Method (PRIM) or the “bump hunting” method, was proposed in [FF99]. PRIM searches for optimal boxes using a steepest-ascent approach. It has been used in several studies such as microarray analysis [LMH + 04] and the gene expression analysis [KU03], etc. There are however several limitations with the bump hunting method. First, since the steepest- ascent strategy is adopted, it may lead to the local optimum. Second, it may generate complicated rules which could be difficult to explain. Finally, it cannot select features automatically. In Chapter 4, we propose a bump hunting method based on simulated annealing. That is, it utilizes the simulated annealing strategy to find “boxes” that have a high aver- age target value of samples inside. For intolerant substitutions, the target value is1 and 0 for the intolerant and tolerant models, respectively. For tolerant substitutions, the tar- get value is defined in the opposite way. For intolerant (or tolerant) models, the original box contains all samples, and the original box-mean is the proportion of the intolerant (or tolerant) substitutions. Boundaries of the box can be shrunk or extended, which 5 will remove or add samples for the box. Each boundary corresponds to a selected fea- ture. Thus, the simulated annealing bump hunting method automatically selects features to generate simple rules, where rules can be interpreted with biological and chemical meanings. Simply speaking, it is an integrated method of rule extraction and feature selection. By defining an energy function in terms of the box mean and the feature num- ber, the Metropolis algorithm finds candidate boxes with a high box mean and a large box size, which is the proportion of samples covered by the box. 1.2.3 Predictive Model Design and Prioritization Analysis Predicting and prioritizing disease-associated amino acid substitutions are critical for studying the human diseases. We designed a Multiple Selection and Rule V oting (MS-RV) model, which integrates data partition and feature selection, and uses rule voting strategy to make prediction. MS-RV is made up of 20 modules that corre- spond to mutation data sets with 20 types of original amino acids. Each module is an ensemble of trees, which are trained on bootstrapped data sets. Trees are grown using CART [BFSO84] methodology to maximize tree size without pruning. At each node, a small number of features are randomly selected from the optimal feature set to split on. A module’s optimal feature set is generated through sequential forward selec- tion (SFS)[PNK94, OLM04], based on a given optimization criterion, which is the area under the ROC curve (AUC) from 10-fold cross validation using MS-RV model. First, a single best feature corresponding to the highest AUC is selected, then add one more feature with largest increase on AUC. Iterate the updating procedure until the AUC stops increasing or all features are included. For a testing data, each tree gives its probabilities to be disease-associated or polymorphism, and the prediction is made by averaging the probabilities from all trees in the module. 6 For the unbiased experimental amino acid substitutions on the lac repressor and the T4 lysozyme, the10-fold cross-validation accuracies of MS-RV model are96% and93% respectively. Large scale size of amino acid substitutions (4;581 polymorphism ones, 9;640 disease-associated ones, and 1;517 unclassified ones) occurring in a wide range of human proteins were collected from the Swiss-Prot database. We studied thirty 10- Mb chromosomal regions, each of them centers at a gene whose amino acid substitutions cause a monogenetic human disease and has comparable numbers of disease-associated substitutions (37 in each region on average) and polymorphism ones (35 in each region on average). Other non-synonymous single nucleotide polymorphisms and the coded amino acid substitutions were collected from the HapMap and Ensembl databases, alto- gether there are37 disease-associated substitutions and111 polymorphism ones in each regions. We prioritized the substitutions and analyzed the normalized ranks. The over- all area under ROC curve (AUC) scores are86:6% and82:3% for30 data sets collected from the Swiss-Prot database and all three databases, respectively. The results indicate that MS-RV model effectively prioritizes disease-associated amino acid substitutions with the proposed feature set. 1.3 Contributions of the Research There are several major contributions made in this thesis proposal. 1. Novel feature set definition We introduce a novel set of features, which can be calculated in a very simple way, where intermediate prediction parameters from other programs are not needed. Only multiple sequence alignment and amino acids’ physicochemical properties are used in the feature selection process. 2. Comparable validation accuracies 7 Machine learning methods such as decision tree, support vector machine, Gaus- sian mixture model and random forests are trained and tested on sample vectors with the proposed feature sets, and validation accuracies are compared with pub- lished data. Our proposed feature set offers excellent classification results. 3. GMM for amino acid substitution classification The GMM is first used for amino acid substitution study in our study. By con- sidering the limited number of training samples, computation and storage com- plications, we develop a feature selection method to train complicated mixture component parameters. 4. Automatic feature selection in bump hunting The number of features is embedded in the optimization function of bump hunting. The proposed simulated annealing bump hunting method searches the optimal solution and selects features automatically. As compared with wrapper or filter feature selection algorithms, the integrated feature selection and rule extraction approach is more robust. It achieves automatic learning. 5. Simulated annealing strategy The simulated annealing strategy overcomes the local optimum problem of the steepest-ascent strategy in bump hunting. Free energy is defined in terms of the box mean and the feature number function. The probability of accepting a new state is determined by energy change and the pseudo temperature. This accelerates the search procedure and will not be trapped to a local maximum. 6. Interpretable prediction rules 8 Each box corresponds to a rule. Rule defined by a small number of features can be easily explained in biological and physicochemical meanings. Interpretable rules are more suitable to direct biological research. 7. Large scale size of data processing Large scale size of mutation data are collected from multiple databases. Defined on the proposed feature set, the amino acid substitutions are organized into various groups, such as those related to specific diseases, those from specific genes, those in neighboring chromosomal regions. 8. MS-RV model design for effective prediction and prioritization MS-RV model integrates data partition and feature selection, and makes predic- tion using rule voting strategy. Its effectiveness of prediction and prioritization is verified on unbiased experimental amino acid substitutions and human disease- associated substitutions. 1.4 Outline of the Proposal Our research motivations, objectives, methods and contributions were stated in this introduction chapter. The remainder of this proposal is organized as follows. In Chapter 2, we briefly review the research background, that includes machine learning methods applied to our feature sets and the simulated annealing bump hunting algorithm. Then, we introduce a novel feature set and compare validation accuracies of several machine learning methods in Chapter 3. The simulated annealing bump hunting algorithm and interpretable rules are described in Chapter 4. The MS-RV model design and the anal- ysis of prediction and prioritization are given in Chapter 5. Concluding remarks and future research directions are given in Chapter 6. 9 Chapter 2 Background Review 2.1 Fundamentals of Molecular Biology The central dogma of molecular biology is illustrated in Fig. 2.1, which describes the flow of genetic information in a cell, from DNA through RNA to protein. This process consists of three steps: transcription, translation, and replication. Some recent studies of RNA processing indicates that a fourth step called splicing could be included. DNA DNA DNA DNA RNA RNA RNA RNA Protein Protein Protein Protein Replication: DNA duplicates Transcription: RNA synthesis Translation: Protein synthesis Figure 2.1: The central dogma of molecular biology. In DNA [TMBW00], there are four heterocyclic bases: adenine (A) and guanine (G) are purines while cytosine (C) and thymine (T) are pyrimidines. In RNA, thymine is replaced by the structurally similar pyrimidine, uracil (U). The nucleic acid sequence is a sequence of bases A, C, G, T/U in DNA or RNA chains. The sequence is conventionally 10 written from the free5 0 -end (the end of a polynucleotide with a free, or phosphorylated, or capped5 0 -hydroxyl group) to the free3 0 -end (the end of a polynucleotide with a free, or phosphorylated 3 0 -hydroxyl group) of the molecule. DNA often occurs in form of a double helix. Two separate and antiparallel chains of DNA are wound around each other in a right-handed helical (coiled) path, with the sugar-phosphate backbones on the out- side and the bases, paired by hydrogen bonding and stacked on each other, on the inside. Adenine (A) pairs with thymine (T); guanine (G) pairs with cytosine (C). The two chains are complementary; one specifies the sequence of the other. Most RNA molecules occur as a single strand, which may be folded into a complex conformation, involving local regions of intramolecular base pairing and other hydrogen bonding interactions. Transcription is the synthesis of a single-stranded RNA from a double-stranded DNA template. The information contained in a section of DNA is transferred to a newly assembled piece of messenger RNA (mRNA). It is facilitated by RNA polymerase and transcription factors. RNA synthesis occurs in the 5 0 ! 3 0 direction and its sequence corresponds to that of the DNA strand, which is known as the sense strand. Message in RNA is translated into a protein sequence according to the genetic code. There are three stages of protein synthesis: 1. initiation: the assembly of a ribosome on an mRNA; 2. elongation: repeated cycles of amino acid delivery, peptide bond formation and movement along the mRNA; 3. termination: the release of the polypeptide chain. During the transmission of the genetic information between parents and progeny, DAN must be replicated. Thus, the cycle “DNA ! RNA ! protein” can repeat in a new generation of cells or organisms. 11 The genetic code is the way in which the nucleotide sequence in nucleic acids spec- ifies the amino acid sequence in proteins. It is a triplet code, where the codons are adjacent and are not separated by punctuation. There are20 amino acids and64 codons, some of them specify the same amino acid, which means that the genetic code has redun- dancy. Fig. 2.2 shows the DNA code and amino acids coded by several different codons. Fig. 2.3 shows those from the RNA code. Each amino acid can be denoted by 1-letter or 3-letter codon. These symbols are shown in Fig. 2.4. TTT Phe TCT Ser TAT Tyr TGT Cys T TTC Phe TCC Ser TAC Tyr TGC Cys C TTA Leu TCA Ser TAA STOP TGA STOP A TTG Leu TCG Ser TAG STOP TGG Trp G CTT Leu CCT Pro CAT His CGT Arg T CTC Leu CCC Pro CAC His CGC Arg C CTA Leu CCA Pro CAA Gln CGA Arg A CTG Leu CCG Pro CAG Gln CGG Arg G ATT Ile ACT Thr AAT Asn AGT Ser T ATC Ile ACC Thr AAC Asn AGC Ser C ATA Ile ACA Thr AAA Lys AGA Arg A ATG Met/S TR ACG Thr AAG Lys AGG Arg G GTT Val GCT Ala GAT Asp GGT Gly T GTC Val GCC Ala GAC Asp GGC Gly C GTA Val GCA Ala GAA Glu GGA Gly A GTG Val GCG Ala GAG Glu GGG Gly G Code Acid Code Acid Code Acid Code Acid T C A G DNA Code T C A G Figure 2.2: DNA code and codons for amino acid. A typical amino acid has a primary amino group, a carboxyl group, a hydrogen atom and a side-chain (R group) attached to a central ®-carbon atom (C ® ). Proline (Pro) is the exception to the rule in that it has a secondary amino group. The side-chain confer different physical and chemical properties. 20 amino acids can be categorized into different groups according to their physicochemical properties. Fig. 2.5 illustrates qualitative classification of20 standard amino acids. 12 UUU Phe UCU Ser UAU Tyr UGU Cys U UUC Phe UCC Ser UAC Tyr UGC Cys C UUA Leu UCA Ser UAA STOP UGA STOP A UUG Leu UCG Ser UAG STOP UGG Trp G CUU Leu CCU Pro CAU His CGU Arg U CUC Leu CCC Pro CAC His CGC Arg C CUA Leu CCA Pro CAA Gln CGA Arg A CUG Leu CCG Pro CAG Gln CGG Arg G AUU Ile ACU Thr AAU Asn AGU Ser U AUC Ile ACC Thr AAC Asn AGC Ser C AUA Ile ACA Thr AAA Lys AGA Arg A AUG Met/S TR ACG Thr AAG Lys AGG Arg G GUU Val GCU Ala GAU Asp GGU Gly U GUC Val GCC Ala GAC Asp GGC Gly C GUA Val GCA Ala GAA Glu GGA Gly A GUG Val GCG Ala GAG Glu GGG Gly G Code Acid Code Acid Code Acid Code Acid RNA Code U C A G U C A G Figure 2.3: RNA code and codons for amino acid. We use amino acid’s six physicochemical properties in our study. They are the molecular weight, the pI value, hydrophobicity, amino acid’s relative frequencies in protein’s secondary structures of helices, strands and turns as detailed below. 1. Molecular weight. Amino acid’s molecular weight is the mass of one molecule of that substance. It relates to the volume, a quantification of how much space the object occupies. Fig. 2.6 shows amino acids in the order of increased molecular weights along the arrow direction. 2. ThepI value. In chemistry and biochemistry, a dissociation constant or an ionization constant is a specific type of equilibrium constant used for reversible reactions or processes. It refers to the extent to which a complex, molecule, or salt separates or splits 13 Amino Acid Name 3-letter Abbr. 1-letter Abbr. Alanine Ala A Arginine Arg R Asparagine Asn N Aspartic acid Asp D Cysteine Cys C Glutamic acid Glu E Glutamine Gln Q Glycine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V Figure 2.4: 1-letter and 3-letter abbreviations of amino acid. into smaller molecules, ions, or radicals in a reversible manner. The dissociation constant is represented by symbolK d . Given reaction A x B y ,xA+yB; parameterK d can be calculated as K d = [A] x [B] y A x B y ; where [A], [B], and [A x B y ] indicate concentrations of A, B, and A x B y , respec- tively. ThepK value is defined as pK =¡log 10 K d ; 14 Figure 2.5: Qualitative classification of20 amino acids. G A S P V T C I L N D Q K E M H F R Y W Less molecular weight More molecular weight Figure 2.6: The order of molecular weights of20 amino acids. whereK d is the dissociation constant. Acid dissociation constant, the acidity constant, or the acid-ionization constant, denoted byK a , is a specific type of equilibrium constant that indicates the extent of dissociation of hydrogen ions from an acid. Its common logarithm can be written as pK a =¡log 10 K a : 15 The isoelectric point pI is the pH at which a molecule carries no net electrical charge. For an amino acid with only one amine and one carboxyl group, the pI can be calculated from thepK a ’s of this molecule as pI = P pK a 2 : For amino acids with more than two ionizable groups such as lysine, the same formula is used. However, the two pKa’s used are those of the two groups that lose and gain a charge from the neutral form of the amino acid this time. Proteins can be separated according to their isoelectric point in a process known as isoelectric focusing. Fig. 2.7 shows amino acids in the order of increased pI values in the direction of the arrow. D E C N F T Q Y S M W V G L I A P H K R Less pI-Value More pI-Value Figure 2.7: The order of pI values of20 amino acids. 3. Hydrophobicity. In chemistry, hydrophobicity refers to the physical property of a molecule that is repelled by water. Hydrophobic molecules in water often cluster together. On the other hand, hydrophilicity refers to a physical property of a molecule that can transiently bond with water (H 2 O) through hydrogen bonding. This is thermo- dynamically favorable, and makes these molecules soluble not only in water, but also in other polar solvents. Fig. 2.8 shows amino acids in the order of increased hydrophobicity scale (the kyte and Doolittle scale) values in the direction of the arrow. 16 R K D E N Q H P Y W S T G A M C F L V I Most Hydrophilic Most Hydrophobic Figure 2.8: The order of the hydrophobicity scale of20 amino acids. 4. Amino acid’s propensity in secondary structures. A protein is a linear sequence of amino acids linked together by peptide bonds [HH00]. The peptide bond is a covalent bond between the ®-amino group of one amino acid and the ®-carboxyl group of another. The linear sequence of amino acids joined together by peptide bonds is termed the primary structure of the protein. The secondary structure in a protein refers to the regular folding of regions of the polypeptide chain. Three common types of secondary structures are the®-helix, the¯-pleated sheet (strand), turns and loops [Lev78]. The ®-helix is a cylindrical, rod-like helical arrangement of the amino acids in the polypeptide chain, which is maintained by hydrogen bonds parallel to the helix axis. The right-handed ®-helix has 3:6 amino acids per turn and is stabilized by hydrogen bonds between peptide N- H and C=O groups three residues apart. In a ¯-pleated sheet, hydrogen bonds form between adjacent sections of polypeptides that are either running in the same direction (parallel ¯-pleated sheet) or in the opposite direction (antiparallel ¯- pleated sheet). Turns and loops invariably lie on the surfaces of proteins and thus often participate in interactions between proteins and other molecules. Different sections of the secondary structure and connecting regions fold into a well-defined tertiary structure, which is the three-dimensional arrangement of all amino acids in the polypeptide chain, with hydrophilic amino acids mostly on 17 the surface and hydrophobic ones in the interior. This biologically active, native conformation is stabilized by noncovalent interactions and sometimes disulfide bonds. Denaturation leads to loss of secondary and tertiary structures. If a pro- tein is made up of more than one polypeptide chain, it is said to have quaternary structure. This refers to the spatial arrangement of the polypeptide subunits and the nature of the interactions between them. Indices of amino acid’s propensity indicate the relative frequencies of a particu- lar kind of amino acid residue in a certain secondary structure (e.g. ® helices, ¯ strands and turns, etc.). Fig. 2.9, Fig. 2.10 and Fig. 2.11 show amino acids’ increased preference for helices, strands and turns, respectively, in the direction of the arrows. P G Y S T N V R I W D F C H K Q A L E M Do not prefer α-helices Prefer α-helices Figure 2.9: Increased preference for helices of20 amino acids. P D C E N K Q A G S M R L H W T Y F I V Do not prefer β-strands Prefer β-strands Figure 2.10: Increased preference for strands of20 amino acids. M V I F L H W A C R K Q E T Y N S D G P Do not prefer turns Prefer turns Figure 2.11: Increased preference for turns of20 amino acids. 18 2.2 Feature Set for Amino Acid Substitutions The replacement of the original amino acid of a protein with another residue may pre- serve or change the protein’s function. There are pathological substitutions, functional polymorphisms, and neutral polymorphisms for amino acid substitutions. Missense mutation that leads to intolerant amino acid polymorphism is the type of mutation that is often related to human diseases [YSD + 04]. Amino acid variants may have an impact on the folding of polypeptide chains, the structure and form of interaction sites, and the overall solubility or stability of the protein. An intolerant amino acid substitution may result in the following damaging effects [SRK + 01]. 1. It may occur in an annotated active or binding site. 2. It may affect interaction with ligands present in the crystallographic structure. 3. It may lead to hydrophobicity or electrostatic charge change in a buried site. 4. It may destroy a disulphide bond. 5. It may affect the protein’s solubility. 6. It may inserts proline in an®-helix. 7. It may be incompatible with the profile of amino acid substitutions observed at this site in the set of homologous proteins. Functionally important residues (e.g. in ligand binding and protein-protein interactions) [BGR + 04] are often evolutionarily conserved and most likely to be solvent-accessible. Conserved residues within the protein core most probably have an important role in maintaining protein’s fold. 19 Detection and prediction of amino acid substitutions play an important role in under- standing protein functions and directing clinical/pharmaceutical research. Based on biological and chemical knowledge, computational methods complement experimen- tal determinations. For instance, Chasman and Adams [CA01] derived sequence and structure-based features from a structural model and the phylogenetic information. Sun- yaev et al. [SRK + 01], [RBS02] analyzed amino acid substitutions based on the protein three-dimensional structure and multiple alignments of homologous sequences. Ferrer- Costa et al. [FCOdlC02], [FCOdlC04] characterized disease-associated substitutions in terms of the substitution matrix, the secondary structure, accessibility, free energies of transfer from water to octanol, etc. Saunders and Baker [SB02] created mutation models by means of a variety of structural and evolutionary features. Krishnan and Westhead [KW03] used physicochemical classes of residues, the sequence conservation score, the secondary structure, solvent accessibility, and buried charge, etc. Wang and Moult [WM01] studied the structural and functional effect of missense single nucleotide polymorphisms (SNPs) in human proteins. Andrew et al. [MFC + 02] reported a sys- tematic automated analysis of the effects of p53 mutations on the structure of the core domain of the protein. Ng and Henikoff [NH01] utilized the sequence conservation and the BLOSUM amino acid substitution matrices. Lau and Chasman [LC04] defined an operationally biological function that provides an explicit link between the functional classification of proteins and effects of the genetic variation or mutation on the protein function. With a feature set ready, prediction is conventionally performed by the use of either the standard machine learning methods such as the support vector machines [Vap98] and the decision tree [Mit97], statistical and classification models [CA01], [SRK + 01] [SB02], or certain specifically designed methods such as the SIFT (sorts intolerant from tolerant substitutions) [NH01]. Fig. 2.12 summarizes 5 groups of amino acid substitution features used in these studies. 20 Ferrer-Costa, C. Terp, B. N. Krishnan, V. G. Chasman, D. Saunders, C. T. Wang, Z. Sunyaev, S. Ramensky, V. Martin, A. C. Ng, P. C. Lau, A. Y. Sequen ce basedSequential potential √ √ √ √ Post-translational modification √ √ √ √ Sequence variability (based on PET91 mutation matrix) √ √ √ √ Molecular mass shift on mutation √ √ √ √ Residue identities of the original and mutated residues √ √ √ √ Mutated residue is directly involved in a catalytic process √ √ √ √ Hydrophobicity or electrostatic charge change in a buried site √ √ √ √ Size descriptors (van der Waals volumes, volume of buried residues) √ √ √ √ Physicochemical classes of the original and mutated residues (hydrophobic, polar, charged, glycine) √ √ √ √ Hydrophobic parameters (from water/octanol free energy, statistical potentials, and structural information) √ √ √ √ Relative residue phynogenetic entropy √ √ √ √ √ √ √ √ Neighborhood relative phynogenetic entropy √ √ √ √ Polymorphism occurs on conserved position √ √ √ √ Variability at the mutation position in the multiple sequence alignment √ √ √ √ Unusual AA (one of the amino acid residues in the polymorphism is not in the phylogenetic profile) √ √ √ √ Incompatibility with the profile of amino acid substitutions observed at this site in the set of homologous proteins √ √ √ √ Rare AA (the polymorphism includes an amino acid that occurs less than 10% of the time in the phylogenetic profile) √ √ √ √ Database functional annotations (involved in disulfide bridge, inside splicing region, modified residue, active site, etc.) √ √ √ √ Unusual AA class (the residues in the polymorphism are not in the smallest residue class including the phylogenetic profile) √ √ √ √ SIFT score √ √ √ √ √ √ √ √ Position-specific scoring matrices √ √ √ √ Mutation matrices (BLOSUM62, PAM40) √ √ √ √ √ √ √ √ √ √ √ √ Sequence conservation score at the mutated position (from scorecons program) √ √ √ √ Statistical potentials (surface potential, contact potential and overall potential, from Prosa II) √ √ √ √ Position-specific probability estimation (multiple sequence alignment, BLOSUM62, Dirichlet mixture) √ √ √ √ Operational definition of biological function for classification of proteins (Dirichlet mixture priors, Bayesian formalism) √ √ √ √ Observed secondary structure (SSTRUC) √ √ √ √ √ √ √ √ Predicted secondary structure (helix, beta and coil, from PHD) √ √ √ √ √ √ √ √ Secondary structure propensities (from Chou and Fasman analysis, and from Swindells et al. Analysis) √ √ √ √ Solubility change √ √ √ √ √ √ √ √ Observed solvent accessiblity (NACCESS, 3-state plus relative one) √ √ √ √ Predicted solvent accessibility (buried, half-buried, exposed and relative) √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ Local energy change √ √ √ √ Difference in exposed hydrophobic surface between the mutant and wild-type structures √ √ √ √ Difference in the distribution of buried/exposed residues consequent to an amino acid substitution √ √ √ √ Difference in the number of unsatisfied buried hydrogen bond donors and acceptors √ √ √ √ Distance between the mutated residue and the active site of the protein √ √ √ √ Occurenece of usual geometries (the number of residues with unusual bond angles) √ √ √ √ Buried charge (change, as in 5) √ √ √ √ √ √ √ √ √ √ √ √ Relative residue B-factor √ √ √ √ √ √ √ √ Neighborhood relative B-factor √ √ √ √ Polymorphic residue's proximity to ligand or cofactor molecules √ √ √ √ Near conserved position √ √ √ √ Near heterogen atom √ √ √ √ Near interface √ √ √ √ Turn breaking √ √ √ √ Helix breaking √ √ √ √ C-beta density (the number of C-beta atoms within 10 atomic units (au) of the C-beta atom of the site residue) √ √ √ √ Percentage of the total solvent-accessible structure area for the native residue at the mutation site √ √ √ √ Normalized crystallographic B-factor √ √ √ √ Hydrophobic core disruption √ √ √ √ Proline in alpha-helix √ √ √ √ √ √ √ √ Loss of one or more hydrogen bonds √ √ √ √ √ √ √ √ Loss of hydrophobic interaction √ √ √ √ Loss of a salt bridge √ √ √ √ Over-packng (a larger side chain makes unavoidable short atomic contacts) √ √ √ √ Internal cavity √ √ √ √ Electrostatic repulsion √ √ √ √ √ √ √ √ Buried polar residue √ √ √ √ Disruption of metal binding √ √ √ √ Breakage of a disulfide bond √ √ √ √ √ √ √ √ Backbone strain √ √ √ √ Destabilization of a protein multimer √ √ √ √ Ligand atoms interact with the mutated side chain √ √ √ √ √ √ √ √ Mutated residue is involved in an allosteric mechanism √ √ √ √ Annotated active or binding site √ √ √ √ Solvent accessibility surface area √ √ √ √ Normalised accessible surface area √ √ √ √ Change in accessible surface propensity (knowledge-based hydrophobic 'potentials') √ √ √ √ Region of the phi-psi map (Ranmachandran map) derived from the dihedral angles √ √ √ √ Hydrogen bonding (use hbplus program) √ √ √ √ Backbone torsion angles and solvent accessiblity (use naccess) √ √ √ √ Sidechain replacement assessment by geometry of the new hydrogen bond √ √ √ √ Sidechain replacement assessment by steric acceptability √ √ √ √ Structure based Sequen ce based Physicochemical Evolutionary informaiton Statistical score Figure 2.12: Summary of feature sets in amino acid substitution study, where features are categorized into sequence-based, physicochemical, evolutionary information, statis- tical score and structure-based five groups. 21 Whatever classification method is used, the quality of features plays an important role in predicting potential effects (either intolerant or tolerant) of a given substitution. To construct these features, amino acid substitutions were mapped to protein 3D struc- tures [CA01], [SRK + 01], [RBS02], [SB02]; evolutionary properties were measured from statistical models [CA01], [SRK + 01]; the secondary structure and accessibility were computed from various prediction programs [FCOdlC02], [SB02]; database anno- tations were also included [FCOdlC04]. However, the availability of protein or homologous protein’s structures limits the scope of the applications of these methods. In addition, most of these prediction meth- ods are complicated and the prediction results are difficult to interpret since a large number of complicated features are used and some of those features heavily rely on other computational models. Although simple features were used in some methods with some specifically designed statistical models [NH01], the prediction accuracy is not as high as those methods using combined multiple features [TCC + 02], [CA01], [SRK + 01], [RBS02], [FCOdlC02], [FCOdlC04], [SB02], [KW03]. A good feature set should contain as few features as possible while each feature should have a clear physicochemical meaning, which can be easily interpreted in bio- logical terms. To design such a feature set, we make use of three physicochemical characteristics (i.e. the molecular weight, the pI value, and the hydrophobicity scale) of the amino acids, and three relative frequencies of the amino acids in the secondary structures (i.e. helices, strands, and turns) of proteins. We compare the quality of these features with other more complicated features by applying two machine learning meth- ods (the support vector machine [Vap98] and the decision tree [Mit97]) to experimental substitution data sets of the lac repressor [MKC + 94] [SMK + 96] and the T4 lysozyme [RBHP91]. The results show that our simple feature set is comparable with other more complicated feature sets. 22 2.3 Pattern Recognition Methods in Machine Learning Prediction of amino acid substitutions is conducted by inductive machine learning meth- ods, which extract rules and patterns out of massive data sets. The supervised learning can be modeled as a pattern recognition system, which operates in two modes: the training mode and the classification mode. A statistical pattern recognition system is shown in Fig. 2.13, which is borrowed from [JDM00]. The preprocessing mod- ule defines a compact representation of the pattern. In the training mode, the feature extraction/selection module finds the appropriate features and a classifier is trained to partition the feature space. The feedback path allows the optimization of preprocessing and feature extraction/selection strategies. In the classification mode, the trained classi- fier assigns the input pattern to one of the pattern classes under consideration based on measured features. Figure 2.13: The block diagram of a statistical pattern recognition system. The decision boundaries of a classifier can be obtained directly (i.e., by the geomet- ric approach) or indirectly (i.e., by the probabilistic density-based approach) [JDM00]. Various statistical pattern recognition approaches are shown in Fig. 2.14. Four machine 23 learning methods will be used in our following research and, therefore, reviewed in this section. They are the decision tree method, the support vector machine method, Gaus- sian mixture model and random forests. Figure 2.14: A taxonomy of statistical pattern recognition techniques. 2.3.1 Decision Tree (DT) Method The decision tree (DT) method [HTF01] partitions a feature space into a set of rect- angles, each of them is fitted by a simple model. They are conceptually simple yet powerful. In a two-dimensional feature space, recursive binary partitions generate M 24 regions, each of them,R m with response meanY . The regression model predictsY with constantc m inR m . That is, ^ f(X)= M X m=1 c m If(X 1 ;X 2 )2R m g: The same model can be represented by a binary tree. Each node corresponds to an attribute, observations satisfying the condition at this junction are assigned to the left branch, and otherwise to the right branch. The leaves of the tree correspond to regions. This recursive binary tree is interpretable. The method needs to automatically decide the splitting variables and split points, and also the topology of the tree. The criterion of minimizing the sum of squares P (y i ¡f(x i )) 2 estimates the best ^ c m to be the average ofy i in regionR m : ^ c m =ave(y i jx i 2R m ): The best binary partition, the determination of splitting variables and split points, in terms of the minimum sum of squares can be solved by a greedy algorithm. Each time, data are partitioned into two resulting regions, and the splitting process is repeated on each of the two regions. The tree complexity is governed by a tunable parameter, i.e. the tree size. A very large tree might overfit the data while a small tree might not capture the important struc- ture. One naive approach is to split tree nodes only if the decrease in sum-of-squares due to the split exceeds some threshold. The preferred strategy is to grow a large tree T 0 , stopping the splitting process only when the minimum node size is reached. This large tree is pruned using a cost-complexity pruning technique. There are some limitations of the decision tree method. One major problem is the high variance of the resulting tree. Often a small change in the data can result in a very different series of splits, the instability comes from the hierarchical nature of the 25 process. The effect of an error in the top split propagates down to all splits below it. Another limitation is the lack of smoothness of the prediction surface, which can degrade the performance in the regression setting, whose underlying function is expected to be smooth. Trees have difficulty in modeling the additive structure. The decision tree is trained by an iterative selection of individual features that are most salient at each node of the tree [JDM00]. The criteria for feature selection and tree generation include the information content, the node purity, or Fisher’s criterion. Dur- ing classification, just those features that are needed for the test pattern are considered. Thus, feature selection is implicitly built-in in the training mode. The most commonly used decision tree classifiers are binary in nature. A single feature is used at each node, which results in a decision boundary that is perpendicular to the feature axis. Conse- quently, such decision trees are intrinsically suboptimal for most applications. However, besides its speed, the main advantage of the tree classifier is the possibility to interpret the decision rule in terms of individual features. This makes decision trees attractive for interactive use by experts. Decision tree classification systems such as CART and C4.5 are available in the public domain and often used as a benchmark. 2.3.2 Support Vector Machines Support vector machines (SVMs) [CST00a] can train linear learning machines in a kernel-induced feature space efficiently. SVMs utilize generalization theory and opti- mization theory. They produce “sparse” dual representations of the hypothesis with efficient algorithms. The Mercer’s conditions on kernels (continuous symmetric kernel of a positive integral operator) overcome the local minim, and the Bayesian learning strategy is used. A learning algorithm should output consistent hypotheses, which provide an accu- rate fit to the training data. The generalization of a hypothesis is to classify data not 26 in the training set correctly. In supervised learning, linear functions are the simplest hypotheses, which form a linear combination of input variables. General applications require more abstract features of data. Kernel functions map the data into a high dimen- sional feature space in a non-linear manner. Kernel representations can be addressed in a general and self-contained way, and used with different theory. For specific applications, the design of learning algorithms can be transformed to the design of an appropriate kernel function. The use of kernels overcomes the curse of dimensionality in both computation and generalization. The increased flexibility of the kernel-induced feature space should be controlled to guarantee good generalization. The theory of Vapnik and Chervonenkis (VC) appropriately describes SVMs. It places reliable bounds on the generalization of linear classifiers and indicates how to control the complexity of linear functions in the kernel space. For independent and identically distributed (i.i.d.) training and test data from an unknown but fixed distributionD, the target function isy = t(x) and the classification function isy =h(x). We useS to denote a randomly generated training set, S =((x 1 ;y 1 );:::;(x l ;y l )): The error function ofh inD is err D (h)=Df(x;y):h(x)6=yg: The generalization error of selected hypothesis h s is bounded by probably approxi- mately correct (pac): err D (h s )·²(l;H;±): 27 For hypothesish in the function classH, a set of pointsfx 1 ;:::; x l g , the set f(h(x 1 );:::;h(x l )):h2Hg=f¡1;1g l is shattered by H. Suppose a finite d is the largest size of the shattered set and ± is a small number. Then, the Vapnik and Chervonenkis (VC) theorem can be stated as follows. Theorem 1 Let H be a hypothesis space having VC dimension d. For any probability distribution D on X£f¡1;1g, with probability 1¡± over l random samples S, any hypothesish2H that is consistent withS has error no more than err D (h)·²(l;H;±)= 2 l (dlog 2el d +log 2 ± ); providedd·l andl > 2 ² . VC theory implies that, for an infinite set of hypotheses, the problem of overfitting is avoidable and the measure of complexity that should be used is exactly the VC dimen- sion. The size of the training set required to ensure good generalization scales linearly with this quantity in the case of a consistent hypothesis. For functionf that models the data y =f(x)=hw:xi+b; the penalized loss function is L(w;b)=¸hw:wi+ X i=1 l(hw:x i i+b¡y i ) 2 : 28 The margin of an example(x i ;y i )2X£f¡1;1g with respect tof is ° i =y i f(x i ): ° i > 0 implies correct classification of f(x i ). The hypothesis function is chosen to minimize (or maximize) a certain function such as the cost function in a linear learning machine. To train SVMs, the cost function can be a convex quadratic function, and the corresponding optimization is a convex quadratic programme. A real-valued function f(w) is convex for w2R n , if8w; u2R n , and for anyµ2(0;1), f(µw+(1¡µ)u)·µf(w)+(1¡µ)f(u): Besides providing mathematic techniques, optimization theory also defines the nec- essary and sufficient conditions of a given function to be a solution. Consider the train- ing sample S =((x 1 ;y 1 );:::;(x l ;y l )) the primal optimization problem for the maxima margin case can be written as minimize w;bhw:wi; subject toy i (hw:x i i+b)·1; i=1;:::;l: To optimize the margin slack vector, we obtain subject toy i (hw:x i i+b)·1¡» i ; i=1;:::;l; where » i · 0, i = 1;:::;l. The SVM system design focuses on the choice of kernels and the optimization criteria. 29 2.3.3 Mixture Models and Gaussian Mixture Model (GMM) A mixture model [HTF01] provides an useful tool for density estimation. It is also a kind of kernel method. Fork-component mixture models [FJ02], the probability density function can be written as p(yjµ)= k X m=1 ® m p(yjµ m ) with the mixing proportions® m , P m ® m =1, and eachµ m is a set of parameters defin- ing the mth component. Then, µ ´ fµ 1 ;:::;µ k ;® 1 ;:::;® k g is the complete set of parameters. Under the assumption that all the components have the same functional form, each one is characterized by the parameter vectorµ m . For a set ofn i.i.d. samples Y =fy (1) ;:::;y (n) g, the log-likelihood of ak-component mixture is logp(Yjµ)=log n Y i=1 p(y (i) jµ)= n X i=1 log k X m=1 ® m p(y (i) jµ m ): The maximum likelihood (ML) estimate can be written as ^ µ ML =argmax µ flogp(Yjµ)g; which may not be found analytically. Given priorp(µ) on the parameters, the Bayesian maximum a posterior (MAP) estimate can be written as ^ µ MAP =argmax µ flogp(Yjµ)+logp(µ)g; which may not be obtained analytically, either. The expectation-maximization (EM) algorithm is an iterative procedure. It obtains ML or MAP estimates of mixture parameters through the search of local maxima of 30 logp(Yjµ) orlogp(Yjµ)+logp(µ). The EM algorithm interpretsY as incomplete data. For a finite mixture model, the missing part is a set ofn labels Z =fz (1) ;:::; z (n) g that are associated with the n samples, indicating which components produced each sample. Each label is a binary vector z (i) =[z i 1 ;:::;z i k ]; where z i m = 1 and z i p = 0, for p 6= m. This means that sample y (i) was produced by the mth component. If the complete data X = fY;Zg were observed, the complete log-likelihood for estimatingµ becomes logp(Y;Zjµ)= n X i=1 k X m=1 z i m log[® m p(y (i) jµ m )] A sequence of estimatesf ^ µ(t);t = 0;1;2;:::g are produced by alternatingly applying the E-step and the M-step, until some convergence criterion is met. ² The E-step: GivenY and current estimate ^ µ(t), we compute the conditional expectation of the complete log-likelihood. Since logp(Y;Zjµ) is linear with respect to missing Z, we can compute the conditional expectation W ´ E[ZjY; ^ µ(t)], and plug it into logp(Y;Zjµ). The resultingQ-function is: Q(µ; ^ µ(t))´E[logp(Y;Zjµ)jY; ^ µ(t)]=logp(Y;Wjµ): 31 The elements ofZ are binary, and their conditional expectations are given by w (i) m ´E[z (i) m jY; ^ µ(t)]=Pr[z (i) m =1jy (i) ; ^ µ(t)]= ^ ® m (t)p(y (i) j ^ µ m (t)) P k j=1 ^ ® j (t)p(y (i) j ^ µ j (t)) ; where® m is the a prior probability ofz (i) m = 1, andw (i) m is the a posterior proba- bility ofz (i) m =1 after observingy (i) . ² The M-step: In the case of MAP estimation, the parameter estimates are updated according to ^ µ(t+1)=argmax µ fQ(µ; ^ µ(t))+logp(µ)g: In the case of ML criterion, the parameter estimates are updated according to ^ µ(t+1)=argmax µ fQ(µ; ^ µ(t))g: In both cases, the following constraints must be satisfied: ® m ·0; m=1;:::;k; and k X m=1 ® m =1: For p-dimensional samples in the Gaussian mixture model [Fra98], each compo- nent’s covariance matrix,§ k , can be parameterized in terms of its eigenvalue decompo- sition § k =¸ k O k A k O T k ; where¸ k is a scalar,O k is the orthogonal matrix of eigenvectors,A k is a diagonal matrix whose elements are proportional to eigenvalues of § k . The orientation of the principal components of § k is determined by O k , while A k determines the shape of the density 32 contours. Then, ¸ p k jA k j is proportional to the volume of the corresponding ellipsoid. Table 2.1 shows the relationships between the orientation, the volume, and the shape for various multivariate models of Gaussian covariance matrices. Table 2.1: Parameterizations and geometric interpretation of the covariance matrix in the Gaussian mixture model. § k Distribution V olume Shape Orientation ¸I Spherical Fixed Fixed NA ¸ k I Spherical Variable Fixed NA ¸A Diagonal Fixed Fixed coordinate axes ¸ k A Diagonal Variable Fixed coordinate axes ¸A k Diagonal Fixed Variable coordinate axes ¸ k A k Diagonal Variable Variable coordinate axes ¸OAO Elliptical Fixed Fixed Fixed ¸ k O k A k O k Elliptical Variable Variable Variable ¸O k AO k Elliptical Fixed Fixed Variable ¸ k O k AO k Elliptical Variable Fixed Variable For each component of the Gaussian mixture model [Toi04], the numbers of param- eters (V ) of covariance matrices are listed in Table 2.2, whereD is the dimension of the feature vectors. Table 2.2: The number of estimated parameters of covariance matrices in the Gaussian mixture model. Type of data and matrix Number of parameters Real valued data and arbitrary covariance matrices V =D+ 1 2 D 2 + 1 2 D Complex valued data and arbitrary covariance matrices V =2D+D 2 For high-dimensional data in GMM, there are many parameters to be estimated. Sometimes, the number of available data is not enough, and the computational and stor- age requirements become bottlenecks. Feature selection is pre-processed before utiliz- ing the mixture model. Consider the problem of selectingd features from an initial set 33 ofD measurements using objective functionJ as a criterion of effectiveness. Searching the feature subset makes use of the monotonicity property of a certain selection criterion function. Let ¹ X j be the set of features obtained by removingj featuresy 1 ;:::;y j from the setY of allD features, i.e., ¹ x j =Yjfy 1 ;:::;y j g For feature subsets ¹ x 1 ;¹ x 2 ;:::;¹ x j , where ¹ x 1 ¾ ¹ x 2 ¾:::¾ ¹ x j ; the monotonicity condition assumes that J(¹ x 1 )·J(¹ x 2 )·:::·J(¹ x j ): The Branch and Bound (B&B) algorithm [SPK04] selects the optimal feature subset in a fast way. The algorithm constructs a search tree where the root represents the set of allD features and leaves represent target subsets ofd features. While traversing the tree down from the root to leaves, the algorithm successively removes single features from the current set of “candidates” (X k in thekth level). The algorithm keeps the information about both the currently best subsetX and the criterion valueX ¤ (the bound) it yields. Anytime the criterion value in some internal node is found to be lower than the current bound, the whole subtree may be cut off and many computations may be omitted. In the “Improved” B & B algorithm, the criterion value decrease is defined to be the difference between the current criterion value and the value after removal of one feature. Bad features are those features whose removal from the current candidate set causes only a slight criterion value decrease while good features are those whose removal causes a 34 significant criterion value decrease. This algorithm aims to position bad features to the right, which is the less dense part of the tree, and good features to its left, which is the more dense part. Faster bound increase may be obtained by such an ordering since the preferential removal of bad features maintains the candidate criterion value at a high level. Removing good features from the candidate set at later stages of search improves the chance of the criterion value dropping below the current bound and, there- fore, allowing more effective subtree cut-offs. This effective algorithm uses the in-level node ordering heuristics. In our substitution samples, 15-D and 20-D features are selected by the B&B algo- rithm, Gaussian mixture models are trained for intolerant and tolerant classes, and test samples are classified by the maximum likelihood rule. 2.3.4 Random forests Random forests [Bre01b, Bre01a] are a combination of tree predictors. The kth tree depends on an independently sampled random vector, µ k , which is independent of the past random vectorsµ 1 ;:::µ k¡1 all with same distribution. The best split at each node is selected from among a random subset of the predictor variable. The training set used to grow each tree is a bootstrap sample of the observations. The accuracy of a random forest depends on the strength of the individual tree classifiers and a measure of the dependence between them. Random forests have been used to classify the hyperspectral data [JBS05], the multisource remote sensing and geographic data [GBS04], to predict the fault-proneness [GMCS04], to identify SNPs predictive of phenotype [BDF + 05]. For a random forest, given the ensemble of classifiersh 1 (x;h 2 (x;:::;h k (x), and the training set drawn at random from the distribution of random vector Y;x, the margin function is mg(x;Y) = av k I(h k (x = Y)¡ max j6=Y av k I(h k (x) = j). where I(:) is the indicator function. The margin measures the average vote for the correct class 35 exceeds the average vote for any other class. The larger the margin, the more confidence the classification. The generalization error is given by PE ¤ = P X;Y (mg(X;Y) < 0), where the probability function is over the(X;Y) space. The random forests have an upper bound for the generalization error, which is defined in terms of the strength of the set of classifiersh(x;µ) and the mean value of the correlation. If we define the margin function as mr(X;Y)=P £ (h(X;£)=Y)¡max j6=Y P £ (h(X;£)=j), and the strength of the set of classifiersh(x;£) is s=E X;Y mr(X;Y). Further define the raw margin function as rmg(£;X;Y)=I(h(X;£)=Y)¡I(h(X;£)= ^ j(X;Y)), where ^ j(X;Y)=argmax j6=Y P £ (h(X;£)=j). If £;£ 0 are independent with the same distribution, let ½(£;£ 0 ) be the correlation betweenrmg(£;X;Y) andrmg(£ 0 ;X;Y) holding£;£ 0 fixed, andsd(£) be the stan- dard deviation ofrmg(£;X;Y) holding£ fixed, and ¹ ½ be the mean value of the corre- lation, that is, ¹ ½=E £;£ 0(½(£;£ 0 )sd(£)sd(£ 0 ))=E £;£ 0(sd(£)sd(£ 0 )). And an upper bound for the generalization error is given by PE ¤ · ¹ ½(1¡s 2 )=s 2 . It shows that the suggestive function for random forests satisfies the VC-type bounds, which are the same as other types of classifiers. Intuitively, the smaller the correlation between trees and the larger the strength of each tree, the smaller the generalization error. To improve accuracy, random forests use randomly selected inputs or combination of inputs at each node to grow each tree. It has been proved that random forests give accuracy than compare favorably with Adaboost [FS96], which adaptively reweights the training set. 36 A tree is trained on a bootstrapped training set using random feature selection. The trees are grown without pruning. Bagging is used in tandem with random feature selec- tion. Bagging enhances accuracy when random features are used and gives ongoing estimates of the generalization error, the strength and correlation of combined ensemble of trees. For a standard training setD, on average, each of itsL bootstrapped sets will have 63:2% of the examples of D, the rest being duplicated. The L models are fitted using the aboveL bootstrap samples and combined by averaging the output (in case of regression) or voting (in case of classification). In random forests, the generalization error is estimated by out-of-bag error rate, which is calculated using the out-of-bag classifier on the training set. That is, for a specific training set T and the bootstrap training sets T k , construct classifiers h(x;T k ), for each(y;x) in training setT , aggregate the votes only over those classifiers for which T k does not contain(y;x). In each bootstrap training set, about one-third of the instances are left out, therefore, the out-of-bag estimates are based on combining only about one- third of many classifiers as in the ongoing main combination. The out-of-bag estimates tend to overestimate the current error rate, the unbiased out-of-bag estimates can be obtained by running past the point where the test error converges. Random forests are generated using random features, which are formed by two methods. The first one is to select at random, at each node, a small group of input variables to split on. Then grow the tree using CART methodology [?] to maximize size and do not prune. This procedure is called Forest-RI. The other approach consists of defining more features by taking random linear combinations of a number of the input variables. Denote this procedure as Forest-RC. Random forests can also be used for regression, wherein they are formed by growing trees depending on a random vector£ such that the predictorh(x;£) takes on numerical values instead of class labels in classification. 37 According to the Law of Large Numbers, random forests do not overfit. Injecting the right kind of randomness makes random forests accurate classifiers and regressors. Without progressively changing the training set, random forests give results competitive with boosting and adaptive bagging, and reduce bias. The only types of randomness used in original random forests study are bagging and random features. Other randomness may give better results, which is an ongoing topic. 2.4 Simulated Annealing Bump Hunting and Rule Extraction 2.4.1 Bump Hunting Method Bump hunting is a type of greedy optimization algorithm equipped with patience, which stresses interpretability of the resulting model. It optimizes the average of the output variable while choosing a series of input variables and the corresponding cut-points in the following way. ² During the model construction process, just as in regression trees, bump hunting looks for rectangular regions called boxes, but not by minimizing the sum of the averages of the (two) new regions into which the current space is split. Bump hunting ”peels off” a certain percentage of the data while optimizing the response average of the elements left in the box. ² At each peeling step, a variable and a peel-off value are chosen. Besides, a border is defined so that data points left in the region have the largest mean of the output variable. The top-down-peeling process stops when a minimum number of ele- ments in the box is reached. Since peeling is a greedy process, the average of the 38 response variable in the box can often be improved by ”pasting” back some of the data to the box. ² The bottom-up-pasting process stops when the average in the box can no longer be improved. In general, when pasting is possible, the new box does have a larger mean of the response variable, but that rarely has a dramatic effect. ² Two parameters have to be specified in the box construction algorithm: the peeling quantile ® and the minimal box support. The peeling quantile determines the percentage of data points excluded (peeled away) from the current box at each peeling step. Usually® is between 0:05 and 0:1, which results in the removal of 5% to10% of the data at each step. The minimal support is a threshold parameter, which determines the minimal size of the final box. The choice of the minimal box support involves statistical decision and the domain knowledge of the target application. The development of a box mean (i.e., the mean of the target variable for data points in the box) with respect to support can be observed with the help of the box construction trajectory. The trajectory allows one to visually choose an optimal minimum box support. The bump hunting method is also called the Patient Rule Induction Method (PRIM) and has been used in bioinformatics for several years. Segal et al. [SCH01] analyzed the relationship of genotypes (amino acid sequences) and phenotypes (outcomes) by applying bump hunting to the prediction of the peptide binding from the amino acid sequence position and between-position interactions. Kehl and Ulm [KU03] designed a bootstrapping bump hunting procedure to identify positive and negative responders to a new treatment, which was compared with a classical treatment (placebo) in a ran- domized clinical trial. Wang et al. [WKPT04] implemented boosted PRIM to solve the problem of “searching for oncogenic pathways” based on array-CGH (Comparative 39 Genomic Hybridization) data. Liu et al. [LMH + 04] analyzed tissue microarray (TMA) data to detect effects of higher level biomarker interaction. 2.4.2 Simulated Annealing Bump Hunting Method When applied to the problem of our concern, the original bump hunting method has drawbacks in that some redundant features in the boxes have to be manually removed. Otherwise, the quality of the resulting boxes would significantly decrease. To overcome these shortcomings, we develop a novel simulated annealing bump hunting strategy that makes use of a simulated annealing process to discard redundant features automatically while extracting high quality rules. We implement the method and test it on experimental data. A set of rules are successfully identified and each of them is either consistent with the current knowledge or providing new insights into the understanding of amino acid substitutions. In physics, annealing gradually lowers the temperature of a system towards zero with no randomness, to relax into a low-energy configuration [DHS01]. As the temperature is lowered, the system is more likely to find the optimum configuration. The probability the system is in a configuration° having energyE ° is given by P(°)= e ¡E°=T Z(T) ; where Z is a normalization constant. The numerator is the Boltzmann factor and the denominator is the partition function, the sum over all possible configurations becomes Z(T)= X ° 0 e ¡E ° 0=T : 40 The probability is distributed roughly evenly among all configurations at high T while it is concentrated at the lowest-energy configurations at lowT . The simulated annealing method finds the optimum configuration of a general opti- mization problem. For two states, from state a to state b, the system energy is E a and E b , respectively. If ¢E ab < 0, accept the change in state. If ¢E ab > 0, accept the change with a probability equal to e ¡¢E ab =T : The distinction of simulated annealing from the traditional gradient descent search is its occasional acceptance of a state that is energetically less favorable. It allows the system to jump out of an unacceptable local energy minimum. For intolerant (or tolerant) amino acid substitution model, the box-mean is the pro- portion of intolerant (or tolerant) samples inside the box. The number of box boundaries corresponds to the number of selected features. Free energy is defined in terms of the box-mean and selected features. New boxes with an updated box-mean and feature number are generated by one of six meta-operations: shrink a boundary on the left side or the right side, extend a boundary on the left side or the right side, exclude a feature, or include a feature. The simulated annealing bump hunting method simultaneously finds the candidate box and selects features. 2.4.3 Rule extraction and interpretation Rule extraction is related to the fields of machine learning, knowledge discovery, expert systems and artificial intelligence. It generates useful if-then rules from data based on statistical significance. In tree-based methods, the tree-shaped structure represents sets 41 of rules for the classification of a dataset. Specific algorithms include the Classifica- tion and Regression Trees (CART), the Chi Square Automatic Interaction Detection (CHAID), etc. CART segments a dataset by creating 2-way splits while CHAID seg- ments a data set using chi square tests to create multi-way splits. A rule is evaluated by its accuracy and coverage. Accuracy tells how often the rule is correct while coverage tells how often the rule applies. In our research, each candidate box defines a subregion, which is related to a rule. The box mean is the average target value while the box size is the proportion of samples inside it. Boxes with a high box mean and a large box size correspond to rules with high accuracy and large coverage. Boxes with a small number of features can be interpreted in chemical and physical terms, which associate amino acid substitutions with biological knowledge. Each boundary of a box sets a feature interval, and the combination of all these intervals defines the condition term in the “if-then” rule. Rules can predict substitutions with feature intervals locating within box boundaries. Rules can also direct the selection of candidate substitutions in clinical or pharmaceutical research. 2.5 Prediction Model Design for Prioritization Analysis We proposed a multiple selection and rule voting (MS-RV) model to predict the effects of amino acid substitutions. This model integrates data partition and feature selection, and makes prediction using rule voting strategy. There are 20 modules in the MS-RV model, each of them is a tree ensemble trained on bootstrapped mutation data sets with one of 20 original amino acids. Trees are built up without pruning. At each node, a small number of features are randomly selected from that module’s optimal feature set, which was generated by sequential forward feature selection. A testing data fits one rule 42 in a tree, which gives the probabilities to predict it to be tolerant and intolerant. The average of probabilities from all trees give the prediction result. A large number of amino acid substitutions occurring in human proteins have been collected in the Swiss-Prot database [BAW + 05], the HapMap database [Con03] and the Ensembl database [BAC + 06]. MS-RV model predicts and prioritizes amino acid sub- stitutions from organized mutation data groups, such as those related to specific human diseases, specific genes and neighboring chromosomal regions. The rank sum test and prioritization analysis verify the effectiveness of MS-RV model. Furthermore, we stud- ied the unclassified substitutions which were predicted by MS-RV model with high pre- diction scores, and found references to support our conclusion. 43 Chapter 3 Novel Feature Sets and Classification of Amino Acid Substitutions In this chapter, we collect intolerant and tolerant substitution samples from experimen- tal data. Features are extracted based on residue’s physicochemical and structural prop- erties. Our novel feature sets are defined in five groups: the window-sized features, the column-weighted features, wild-type amino acid and substitution’s features, and the evolutionary information. They integrate the sequential, structural and evolution- ary information, and can be calculated in a very simple way. Two state-of-art machine learning methods, the decision trees and support vector machine (SVM) are trained and validated, the accuracies are compared with published results with other complicated feature sets. Gaussian mixture model (GMM) is first used for amino acid substitution study. Expectation maximization (EM) algorithm trains GMM, and validation accura- cies are compared for different numbers of selected features. 3.1 Feature Extraction for Amino Acid Substitutions 3.1.1 Amino acid properties in feature set Most contemporary studies aiming at identifying potential effects (tolerant or intolerant) of amino acid substitutions on protein functions make use of a large number of features, including some complex ones such as the three-dimensional structural information and evolutionary properties of the proteins. These complex features are often obtained by 44 complicated calculation and in many times are not generally available. Considering these facts, we design a set of simple features integrate the sequential, structural and evolutionary information. They are general enough for most known proteins and are easy to be obtained by simple calculations. Our feature set makes use of3 physicochemical properties (the molecular weight, the pI value, and the hydrophobicity scale) of the amino acids, and 3 relative frequencies for the occurrences of the amino acids in the secondary structures (helices, strands, and turns) of the proteins. Here, the unit of molecular weight is Dalton. The pI (isoelectric point) is the pH at which a molecule carries no net electrical charge. The hydrophobicity scale of Kyte and Doolitle is derived from the physicochemical properties of amino acid side chains [KD82]. All these six properties can either be obtained from the literature [DEKM98][KD82]. Table 3.1 lists all the six properties of amino acids. The feature values for each amino acid are shown in Table 3.2, inside which ‘RelFreq-h’ denotes ‘Relative frequency in helices’, ‘RelFreq-s’ denotes ‘Relative frequency in strands’ and ‘RelFreq-t’ denotes ‘Relative frequency in turns’. For the “Hydrophobicity”column in Table 3.2, positive values denote hydrophobic character, the larger value, the more hydrophobic; and the negative values denote hydrophillic character, the larger absolute value, the more hydrophillic. The values of the right three columns in Table 3.2 are obtained by analysis of the known structures of a number of proteins. Table 3.3 shows three groups of amino acids (in their one-letter codons) favoring different secondary structures. Evolutionary information includes the relative frequencies of the original amino acid and substituted residue on a column in the multiple sequence alignment (MSA), and the entropy corresponding to this column. They can be calculated using only the sequential information of the proteins [Lev78]. All of these features are derived from biological and physicochemical knowledge, and describe amino acids in property groups. 45 Table 3.1: Properties of amino acids included in our feature set. Physicochemical properties Molecular weight pI value Hydrophobicity scale Relative frequencies (in) Helices Strands Turns Table 3.2: Amino acid property values for feature set calculation. ‘MolWgt’ denotes ‘Molecular weight’; ‘pI’ denotes ‘pI value’; ‘Hyd’ denotes ‘Hydrophobicity’; ‘RelFreq- h’ denotes amino acid’s relative frequency in helices; ‘RelFreq-s’ denotes amino acid’s relative frequency in strands; ‘RelFreq-t’ denotes amino acid’s relative frequency in turns. Amino Acid MolWgt pI Hyd RelFreq-h RelFreq-s RelFreq-t Alanine (A) 89.09 6.1 1.8 1.29 0.90 0.78 Arginine (R) 174.2 10.76 -4.5 0.96 0.99 0.88 Asparagine (N) 132.12 5.41 -3.5 0.90 0.76 1.28 Aspartic acid (D) 133.1 2.97 -3.5 1.04 0.72 1.41 Cysteine (C) 121.15 5.02 2.5 1.11 0.74 0.80 Glutamine (Q) 146.15 5.65 -3.5 1.27 0.80 0.97 Glutamic acid (E) 147.13 3.22 -3.5 1.44 0.75 1.00 Glycine (G) 75.07 5.97 -0.4 0.56 0.92 1.64 Histidine (H) 155.16 7.59 -3.2 1.22 1.08 0.69 Isoleucine (I) 131.17 6.02 4.5 0.97 1.45 0.51 Leucine (L) 131.17 5.98 3.8 1.30 1.02 0.59 Lysine (K) 146.19 9.59 -3.9 1.23 0.77 0.96 Methionine (M) 149.21 5.74 1.9 1.47 0.97 0.39 Phenylalanine (F) 165.19 5.48 2.8 1.07 1.32 0.58 Proline (P) 115.13 6.3 -1.6 0.52 0.64 1.91 Serine (S) 105.09 5.68 -0.8 0.82 0.95 1.33 Threonine (T) 119.12 5.64 -0.7 0.82 1.21 1.03 Tryptophan (W) 204.23 5.89 -0.9 0.99 1.14 0.75 Tyrosine (Y) 181.19 5.66 -1.3 0.72 1.25 1.05 Valine (V) 117.15 5.96 4.2 0.91 1.49 0.47 Table 3.3: Groups of amino acids favoring different globular protein secondary struc- tures. Favor® helices Favor¯ strands Favor turns Amino acids fA;C;L;M;E;Q;H;Kg fV;I;F;Y;W;Tg fG;S;D;N;P;Rg 46 3.1.2 Feature set calculation For a given protein, the above 6 properties for each amino acid (residue) in the pro- tein sequence can be calculated for the original and the substituted amino acid, as well as in a window-sized situation that includes the neighbors of the original amino acids in the query protein sequence, and in a column-weighted circumstance to which the original amino acid is aligned with multiple homologous proteins. The calculations of the properties for the original and the substituted amino acid are straightforward. The window-sized properties (with sizew) are calculated as the average of the corresponding properties for the amino acid and itsw¡1 neighbors in the query protein sequence. We set the window size w to 9, centered at the original amino acid, with 4 left side neigh- boring residues, and4 right side neighboring residues. The column-weighted properties for an amino acid are calculated as follows. Multiple sequence alignments are extracted by pfam [BCD + 04] for each domain of the query protein. Assuming that the amino acid appears in the c-th column of the alignment, the column-weighted properties for the amino acid are then calculated as the weighted average of the corresponding properties for all the 20 possible amino acids, where the weight of a certain kind of amino acid is the frequency of its occurrence in the c-th column of the alignment. For evolutionary information group, the relative frequencies of original amino acid and its substitution are obtained from their frequencies in the column of the MSA; the entropy [CT91] is calculated by¡ P 20 i=1 p i log 2 p i , wherep i is the frequency of each possible amino acid in the column of the MSA. Therefore, our feature set has 27 features, including 6 simple yet meaningful prop- erties of amino acids with each property calculated in4 different situations, and3 evo- lutionary information measurements. For summarization, Table 3.4 shows all of the features, with features labeled byx i fori = 1;:::;27. x 1 ;:::;x 24 are for the first four 47 groups, each of them contains the above6 properties. x 25 ;:::;x 27 are for the evolution- ary information group, where ‘rf-ori’ means the relative frequency of the original amino acid; ‘rf-sub’ means the relative frequency of the substitution; ‘entropy’ is calculated for the column. This labeling method for our features will be used throughout this proposal. Table 3.4: Details of the proposed 27-D feature set. Physicochemical properties Relative frequency in Property Sets Molecular Weight pI Value Hydro- phobicity Helices Strands Turns Window-sized x 1 x 2 x 3 x 4 x 5 x 6 Original amino acid x 7 x 8 x 9 x 10 x 11 x 12 Column-weighted x 13 x 14 x 15 x 16 x 17 x 18 Substitution x 19 x 20 x 21 x 22 x 23 x 24 Evolutionary information rf-ori rf-sub entropy x 25 x 26 x 27 Figure 3.1 illustrates how to calculate the feature set. They are segment of multiple sequence alignment (MSA) for lac repressor. There are 1296 sequences, indexed from 1 to 1296, The query sequence on the first row is ungapped, and its relative positions on this segment are from 50 to 90. For position 81, the wild-type residue is L, its 9-residue window contains amino acids P;A;L;F;D;V;S;D, which are highlighted. For each window-sized feature,fx 1 ;x 2 ;:::;x 6 g in Table 3.4, it is the average of these 9 amino acids’ corresponding property values. At relative position 66 on the query sequence, first count the frequency of each residue in the highlighted column. For each featurefx 13 ;x 14 ;:::;x 18 g in Table 3.4, it is the weighted average of residues’ property values; each occurred residue’s weight is its frequency in this column. For original amino acid’s features offx 7 ;x 8 ;:::;x 12 g, they are amino acid A’s 6 properties. For a substitution of A on relative position 66, substitution features offx 19 ;x 20 ;:::;x 24 g are this substitution’s corresponding property values. x 25 andx 26 are the frequencies of 48 amino acidA and its substitution in the column on position66, andx 26 could be zero if the substitution does not occur in this column. x 27 is calculated from the frequencies of all possible amino acids in this column. 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Q R V S G L I I N Y P L D D Q D A I A V E A A C T N V P A L F L D V S D Q T P I N 2 T R P R G I I I S E . S E Y T D A P E R Q A L Q T E L V F V L N R N L P F P A V D 3 M R V E G L L V S V . S E R R D E L L R W L A P R G V P L V Y F D R A D L Q S F E 4 H S A D G L I L S - . P L M M T V D D I A A L D G D Y P L V L L G E R F G V N A P 5 A M V D G L I Y S - . P L E L G P D D A A E V E V D Y P L V L I G E R F T D K V D 6 G V I D G L I L S - . P L E L E A E D L R G R A A D V P L V L L G E R Y D L P Y D 7 T M F D G L L L S - . P L S I S A D E L R R R T N R V P V V L L G E H F N G S F H 8 T F V D G I I L S P L A L T A D D L R D - - R T G F P P T V L L G E L L E E G A D 9 G L V D G M M F Q P . T V M G S T E L A Q S R . P G V P L V I L G E N A P L G L D 10 S T V D G M I L S M . - S E V E N I S P D D L K V D F P L V I V G A R T W G I V D 11 L G A E G W I L F V M - . S P L A D E G A V L E Q P Y P I V T T G D H A Y G K S D 1287 F K T L G V D Y A V L S S G Y D P V V A D L A R S G I K T I S A G I P N S D K I A 1288 F K T L G V D Y A V L S S G Y D P V V A D L A R S G I K T I S A G I P N S D K I A 1289 D G V D A I I I C C S L T A L N P T I E Y A Y S K G V P V F S Y S G Y S P F A V N 1290 E A L K S L I F T L . T T R H R K F V E H V L D S T K T K L I L Q N I E T R Q P F 1291 E A L K S L I F T L . T T R H R K F V E H V L D S T N T K L I L Q N I D K H Q P F 1292 E A I D Y L V F T L . T I R H R K F I E Q L L S S N Q T K I I L Q N I G D R Q P F 1293 Q G I D G L I F S P V D D F G E A L I R E A Q K L K I P V V K L D M L L P N F I A 1294 S K P D A V G F T R I E T A F D E N I M R A K D A G I P V V L Y N V A E K L D V P 1295 R N P F G I V L L L P K H T S K D L I L Q I Q D K S T N L I V Y N R - N I E G V K 1296 S G F N G A V F L D N Q G T T S Q I T R L V K Q A K L P V I A V V D H H A P Q E D Figure 3.1: Calculation of feature set from multiple sequence alignment. The segment of lac repressor’s MSA. A window centered atL with size of 9 residues is highlighted on the query sequence. The column at relative position66 is also highlighted. 3.2 Data Sets for Classifier Learning Here we show that our simple yet meaningful feature set is as good as other published complicated feature sets in the prediction of potential effects of amino acid substitutions on the protein functions. We use machine learning methods to predict the tolerant or intolerant amino acid substitutions, and compare the cross-validation accuracies for the 49 methods adopting our feature set with the same methods using other published feature sets. The philosophy behind this comparison is that for the same learning method, the better feature set can very likely yield better prediction accuracy. 3.2.1 Unbiased experimental amino acid substitutions Two sets of extensive unbiased amino acid substitution data for the lac repressor [MKC + 94][SMK + 96] and the T4 lysozyme [RBHP91] are adopted for the comparison. In these data sets, the effects of amino acid substitutions on the function of the corre- sponding proteins are rated and classified to one of the four categories: + (no effect), +¡ (slight effect),¡+ (larger effect), and¡ (complete absence). Following the defini- tions in [CA01], substitutions fall into the “+” category are treated as tolerant ones, and substitutions in the rest categories are regarded as intolerant ones. For the lac repressor, 3;105 mutations for amino acids locating from position 72 to 323 are analyzed. For the T4 lysozyme,1;444 mutations for amino acids locating from position28 to144 are analyzed. The details of the data sets are listed in Table 3.5. Table 3.5: Experimental amino acid substitutions. Number of Substitutions Protein Position Intolerant Tolerant ¡ +¡;¡+ All + lac repressor [72;323] 781 399 1;180 1;925 T4 lysozyme [28;144] 208 232 440 1;004 Combined — 989 631 1;620 2;929 50 3.2.2 Human disease-associated amino acid substitutions A large number of amino acid substitutions occurring in human proteins have been col- lected in the Swiss-Prot database [BAW + 05]. In version 50:2 (released on June-27- 2006), the Swiss-Prot database contains26;265 amino acid substitution entries in4;225 human proteins, with each substitution being annotated as “Disease”, “Polymorphism”, or “Unclassified”. For a clear and concise presentation, we would refer to amino acid substitutions with the annotation “Disease” as disease-associated ones and those with the annotation “Polymorphism” as polymorphism ones. In this paper, we study human proteins having at least 20 homologous proteins in the Pfam database [ALR + 04] (ver- sion 20.0 released in May-2006), and focus on the substitutions occurring in known protein domains. Totally, we collected 9;640 disease-associated substitutions, 4;581 polymorphism substitutions, and1;517 unclassified ones in2;165 human proteins. Besides the Swiss-Prot database [BAW + 05], additional coding non-synonymous single nucleotide polymorphisms (nsSNPs) and the coded amino acid substitution data were collected from the HapMap database [Con03] and the Ensembl database [BAC + 06]. For The testing data sets are from thirty 10-Mb neighboring gene regions, each of them centers at a gene whose mutations are associated with a human disease, and has comparable numbers of disease-associated and polymorphism substitutions. 3.3 Decision Trees for Classifying Amino Acid Substitutions Machine learning techniques create programs that will learn from experience and mod- ify their activity by the analysis of data sets. They have abilities to model non-linear rela- tionships, construct symbolic and interpretable models, etc. Common algorithm types 51 include supervised learning and unsupervised learning. Supervised learning generates a function that maps inputs to desired outputs. One standard formulation of the supervised learning task is the classification problem: the learner is required to learn (to approxi- mate the behavior of) a function which maps a vectorfx 1 :::; x n g into one of several classes by looking at several input-output examples of the function. Unsupervised learn- ing models a set of inputs: labeled examples are not available. Two supervised learning methods, the decision trees and support vector machines, are used to classify intoler- ant and tolerant amino acid substitutions. They are discussed in this and next sections, respectively. 3.3.1 Decision tree learning Decision tree learning [Mit97] is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Trees can also be represented as sets of if-then rules. Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node inside a tree specifies a test of some attribute (the feature) of the instance,and each branch descending from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node. In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions. Figure 3.2 shows a simple decision tree expressing theXOR boolean function. Branches from nodeA orB are determined by 52 its attribute value: False (F) or True (T). The boolean function’s output is represented by a leaf. Each path from the root node to a leaf corresponds to a row in the truth table. A B B F T T F F F F T T T F F F T T F T F T F T T A xor B B A Figure 3.2: Decision tree learning example. On the right side, decision tree expresses the boolean function (A xor B) shown in the left truth table. Nodes A and B specifies the attributes, branches from each node are determined by its attribute value. The leaves represent the outputs. Each path from the root node to a leaf corresponds a row in the truth table. Decision trees are learnt by a top-down, greedy search through the space of possible decision trees. Attributej (one of the features listed in Table 3.4) is constructed by start- ing with all samples in the training set in a single root node, and then recursively splitting each nodeN by testing on the attributej for which the information gain for attributej is maximal, until it reaches a classifying leaf node (‘intolerant’ or ‘tolerant’). The informa- tion gain measures how well a given attribute separates the training examples according to their target classification. Continuous-valued decision attributes can be incorporated into the learned tree, by dynamically defining new discrete-valued attributes that par- tition the continuous attribute value into a discrete set of intervals. Missing attribute values can also be handled in the evaluation of certain attribute according to the infor- mation gain maximization. Learn from an input set of examples, decision tree algorithm works as followings: (1) Select a test on a feature from feature space, and split the example set according 53 to the possible values of the selected feature. (2) Check the partition induced by step 1. If every subset contains examples of only one class, then stop. Label each leaf with the name of the class. (3) Recursively split any subset that contains examples from different classes. When the decision tree is built, its internal node’s outcome determines the branch along which to do the next test, and the label of arriving leaf determines the predicted class. Each branch through a decision tree can be transformed into a rule. 3.3.2 Implementation of decision trees C4.5 decision tree software is derived from Hunt’s method [HMS66] for constructing a decision tree. The software can be downloaded freely (http://www.rulequest.com/Personal/). The program c4.5 induces classification rules in the form of decision trees from the training data. It constructs trees in two ways: in batch mode, it generates a single tree using all the available data; in iterative mode, it starts with a randomly-selected subset of the data (the window), generate a trial decision tree, adds some misclassified objects, and continues until the trial decision tree correctly classifies all objects not in the window or until it appears that no progress is made. The program c4.5rules forms production rules from unpruned decision trees. For each tree that it finds, c4.5rules generates a set of pruned rules, and then sifts this set in an attempt to find the most useful subset of them. If more than one tree was found, all subsets are then merged and the resulting composite set of rules is then sifted. 3.3.3 Training and test with decision trees We apply a ten-fold cross-validation to each of the data sets presented in Table 3.5. The data were randomized and split into 10 equal parts. One part was used as test set and the reminder as training set. This procedure was repeated 10 times so that each substitution 54 sample was used exactly once for testing. In order to test whether our feature set is general enough for different types of proteins, we also perform a mixed cross-validation on the combined data set which contains both the lac repressor and the T4 lysozyme data sets. The accuracy comparison bar plot is shown in Figure 3.3 The overall cross- validation accuracies obtained by using our feature set are higher than or equal to those obtained by using the published feature set. 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% Published data 84% 80% 79% Our feature set 84% 81% 81% LAC repressor T4 lysozyme Mixed Figure 3.3: Decision tree method 10-fold cross validation accuracy comparison. For each case of lac repressor, T4 lysozyme and mixed samples, the left bar represents the published data, and the right bar represents the result with our feature set. 3.4 SVMs for Classifying Amino Acid Substitutions Support Vector Machines (SVM) [CST00b][DHS01] are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. With an appropriate nonlinear mapping'(:), SVMs map data to a sufficiently high dimension, where the data can be separated by a hyperplane. 55 In this study, SVMs use a kernel function to map the input vectors of instances, the amino acid substitutions, to a high dimensional feature space; a hyperplane is con- structed to optimally divided the instances into two classes, ‘intolerant’ or ‘tolerant’. LIBSVM (http://www.csie.ntu.edu.tw/ cjlin/libsvm/), a library for SVM and regres- sion, is used to train and test the sample sets; and accuracies are compared with pub- lished data. 3.4.1 Support vector machine learning SVM learning[MMR + 01] finds separating hyperplanes (as computed by a perceptron) y =sign((w¢x)+b) which can be upper bounded in terms of the margin. The goal of learning is to findw and such that the expected risk is minimized. For each ofn patterns, x k (k =1;2;:::;n) has been transformed toy k = '(x k ). Letz k =§1, according to whether patternk belongs to class! 1 or! 2 . A linear discriminant in an augmentedy space is g(y)= a t y where both the weight vector and the transformed pattern vector are augmented. If a t y i ¸ b > 0, for all i, the positive constant b is the margin. Thus a separating hyper- plane ensures z k g(y k )¸1;k =1;:::;n The goal in training a SVM is to find the separating hyperplane with the largest margin. The larger the margin, the better generalization of the classifier. The distance from any 56 hyperplane to a (transformed) pattern y isjg(y)j= k a k, and assuming that a positive marginb exists, then z k g(y k ) k ak ¸b The goal is to find the weight vector a that maximizesb. In Figure 3.4, the red 2D patterns are from one class, and the blue 2D patterns are from the other class. The support vectors are the (transformed) training patterns, they are (equally) close to the hyperplane, as the data points on the two lines. The support vectors are the training samples that define the optimal separating hyperplane and are the most difficult patterns to classify. Informally speaking, they are the patterns most informative for the classification task. Optimal hyperplane y1 y2 2 1 margin 2 1 margin Figure 3.4: Training a support vector machine. It is to find the optimal hyperplane, which has the maximum distance from the nearest training patterns. The support vectors are the points on the two lines with margin distance from the hyperplane. Training a support vector machine is based on some modification to the Perceptron training rule, which updates the weight vector by an amount proportional to any ran- domly chosen misclassified pattern. A SVM is trained by choosing the current worst- classified pattern. During most of the training period, such a pattern is one on the wrong 57 side of the current decision boundary, farthest from that boundary. This pattern will be one of the support vectors. The training procedure of SVM is as followings: 1. Choose the nonlinear '-functions that map the input to a higher-dimensional space. Some basis functions are linear, polynomial, radial basis function (RBF), sigmoid, etc. 2. Recast the minimizing of the magnitude of the weight vector constrained by the separation into an unconstrained problem by the method of Lagrange undeter- mined multipliers. To minimizek ak, we seek to minimize L(a;®)= 1 2 k ak 2 ¡ n X k=1 ® k [z k a t y k ¡1] subject to the constraints n X k=1 z k ® k given the training data. These equations can be solved using quadratic program- ming and other schemes. 3.4.2 LIBSVM training and test For training and testing SVMs in classification problem, LIBSVM performs in the fol- lowing procedures: 1. Transform data to the format of LIBSVM. Each instance starts with the target (0 or 1, for different classes), following feature index (1;2;:::;n) and feature value (x 1 ;x 2 ;:::;x n ); 2. Conduct scaling on the data, to make the feature in certain interval, i.e.,[0;1]; 58 3. Select a kernel function; 4. Use cross-validation to find the best parameters to define the SVMs; 5. Use the best parameters to train the training set; 6. Test the test set. 3.4.3 Simulation results Substitution instances of lac repressor and T4 lysozyme are transformed into the for- mat of LIBSVM. 10-fold cross-validation accuracies are listed in Figure 3.5, and are compared with published data. Our feature set achieves better overall cross-validation accuracies for all the three sample sets (the lac repressor, the T4 lysozyme, and the combined). 65% 70% 75% 80% 85% Published data 73% 71% 72% Our feature set 83% 82% 82% LAC repressor T4 lysozyme Mixed Figure 3.5: SVMs 10-fold cross validation accuracy comparison. For each case of lac repressor, T4 lysozyme and mixed samples, the left bar represents the published data, and the right bar represents the result with our feature set. 59 3.5 Gaussian Mixture Models for Predicting Effects of Amino Acid Substitutions Gaussian Mixtures (GM) can model data distributions. When the data are unlabeled, this modeling leads to the identification of groupings (clusters) among the data, a task, which is called clustering. On the other hand, estimating labeled data distributions empowers us to perform classification of previously unseen data. The key approach in both cases is the approximation of data distributions using GM, where the data are either unlabeled or are of the same class label. In a mixture model [Nab01], a probability density function is expressed as a linear combination of basis functions. A model withM components is written in the form p(x)= M X j=1 P(j)p(xjj) where the parameters P(j) are the mixing coefficients and the parameters of the com- ponent density functionsp(x) typically vary withj. The mixing coefficients satisfy M X j=1 P(j)=1 0·P(j)·1 and choosing the normalized density functions Z +1 ¡1 p(xjj)dx=1 guarantees that the model represents a density function. 60 The mixture model is a generative model, each component is chosen with probability P(j), which is the prior probability of thejth component. A data point is generated from the corresponding density p(xjj). Using Bayes’ theorem, the corresponding posterior probabilities can be written in the form P(jjx)= p(xjj)P(j) p(x) These posterior probability satisfy the constraints M X j=1 P(jjx)=1 0·P(jjx)·1 In Gaussian mixture model, each component density is a Gaussian distribution with a mean vector¹ of dimension d, if the samples are in the R d (feature) space. And the covariance matrix is of certain form. Several covariance matrix forms with increasing generality and complexity are listed as followings: 1. Spherical: each Gaussian component is isotropic Gaussian, and the covariance matrix is a scalar multiple of the identity matrix, P j =¾ 2 j I so that p(xjj)= 1 (2¼¾ 2 j ) d 2 expf¡ kx¡¹ j k 2 2¾ 2 j g . 2. Diagonal: the covariance matrix is diagonal X j =diag(¾ 2 j;1 ;:::;¾ 2 j;d ) 61 and the density function is p(xjj)= 1 (2¼ Q d i=1 ¾ 2 j;i ) d 2 expf¡ d X i=1 (x i ¡¹ j;i ) 2 2¾ 2 j;i g . 3. Full: the covariance matrix is any positive definited£d matrix P j and the density function is p(xjj)= 1 (2¼) d 2 j P j j d 2 expf¡ 1 2 (x¡¹ j ) T X ¡1 (x¡¹ j )g . 4. Full: the covariance matrix is the sum of two terms: one is diagonal in a q- dimensional subspace spanned by the first q principal components and the other is spherical. Applying Gaussian Mixture Model (GMM) to the amino acid substitution study is motivated by the analysis of evolutionary quantity (evolutionary information measure- ment) and predicted free energy change (structural information measurement) related to substitutions. For each column in the MSA, we defined the evolutionary quantity in terns of entropy and mutual information Evolutionary Quantity = expf¡(Postion Specific Entropy)g¤expf¢(Mutual Information)g Free energy change is a protein function stability measurement. If an amino acid substitution is deleterious, the free energy change is large, and the stability 62 change is obvious; on the other hand, free energy change corresponding to neu- tral substitution is small. Free energy change is predicted by DFIRE software (http://phyyz4.med.buffalo.edu/hzhou/dmutation.html). For substitutions of lac repressor and T4 lysozyme, we draw the plots of evolutionary quantity and predicted free energy change in Figure 3.6 and Figure 3.7. For two types of unbiased experimental substitutions, their distributions defined on feature space of evolutionary quantity and predicted free energy change can’t be differentiated by single curves. For each of the intolerant and tolerant types, they are clustered in different regions in the feature space. This allows the usage of mixture model, which models certain type of substitutions by mixed components. Figure 3.6: Evolutionary quantity and predicted free energy change plot of lac repressor. The x-axis denotes the evolutionary quantity, and the y-axis denotes the predicted free energy change. Red points are for intolerant amino acid substitutions, and green points are for tolerant ones. 63 Figure 3.7: Evolutionary quantity and predicted free energy change plot of T4 lysozyme. The x-axis denotes the evolutionary quantity, and the y-axis denotes the predicted free energy change. Red points are for intolerant amino acid substitutions, and green points are for tolerant ones. 3.5.1 GMM training by EM algorithm and classification Determining the parameters of a Gaussian mixture model from data set is based n max- imizing the data likelihood, which can be recast to minimize the negative log likelihood of the data set. E =¡L=¡ N X n=1 logp(x n ) For a Gaussian pdf, the analytical solution of parameters is the estimates of mean and variance, but usually the analytical approach is intractable, especially for Gaussian mix- ture distributions with arbitrary covariance matrices. The alternative is to use expecta- tion maximization (EM) algorithm, which is an iterative method for calculating max- imum likelihood distribution parameter estimates. EM algorithm is usually faster to 64 converge than general purpose algorithms, and can deal with data points with missing values. Amino acid substitution can be represented by “input” variables x = (x 1 ;:::;x D ) T in theR D (feature) space, and an outputy with value1 or0, to categorize intolerant or tolerant substitutions. For each data pointx n , there is a corresponding random variable z n , which is the component label, and takes an integer value from f1;:::;Mg. Let y n to be the hypothetical complete data point (x n ;z n ) andw as the parameters in the mixture model, which are mean and variance parameters of the components’ density functions. The EM algorithm generates a sequence of estimatesw (m) starting from the initial parameter setw (0) . The likelihood of a complete data point ifz =j is: p((x;z =j)jw)=p(xjz =j;w)P(z =jjw)=p(xjµ j )P(z =jjw) where µ j are the density function parameters (mean and variance) for component j. Marginalizing overz, the likelihood ofx is: p(xjw)= M X j=1 P(z =jjw)p(xjµ j ) Given a set of parametersw m , the next set of parametersw m+1 can be estimated from current parameters and the component labels. The class labels are uncertain, but their probability distribution is known, and their expected values can be obtained given the current parameters. The functionQ(wjw m ) is formed as follows: Q(wjw m )=E(log(yjw))p(z n jx n ;w (m) ) 65 = M X j=1 N X n=1 [log(p(x n ;z n jw))]P(z n =jjx n ;w (m) ) = M X j=1 N X n=1 [logP(j)+logp(x n jµ j )]P (m) (jjx n ) where P (m) (jjx n ):=P(z n =jjx n ;w (m) )= P (m) (j)p(x n jµ (m) j ) P M j=1 P (m) (j)p(x n jµ (m) j ) is the expected posterior distribution of the class labels given the observed data. Q(wjw m ) is a function of P(j) andµ j while P (m) (j) andµ (m) j are fixed values. The calculation ofQ is the E-step of the algorithm. The M-step optimizesQ(wjw m ), that is, w (m+1) =argmax w Q(wjw m ) The mixing coefficientsP(j) satisfy P j P(j)=1. To differentiate Q+¸( M X j=1 P(j)¡1) with respect toP(j) and set the result to zero, the update equation is: P (m+1) (j)= 1 N N X n=1 P (m) (jjx n ) DifferentiatingQ with respect to thejth mean¹ j and setting the result to zero gives the update equation ¹ (m+1) j = P N n=1 P (m) (jjx n )x n P N n=1 P (m) (jjx n ) 66 To estimate the covariance parameters, each data point is weighted byP (m) (jjx n ). Dif- ferentiateQ with respect to the covariance matrix parameters and set the result to zero, different types of covariance matrices are as follows: 1. Spherical covariance matrices: (¾ (m+1) j ) 2 = 1 d P N n=1 P (m) (jjx n )kx n ¡¹ (m+1) j k 2 P N n=1 P (m) (jjx n ) 2. Diagonal covariance matrices: (¾ (m+1) i;j ) 2 P N n=1 P (m) (jjx n )(x n i ¡¹ i ;j (m+1) ) 2 P N n=1 P (m) (jjx n ) 3. Full covariance matrices: X j = P N n=1 P (m) (jjx n )(x n ¡¹ (m+1) j )(x n ¡¹ (m+1) j ) T P N n=1 P (m) (jjx n ) Test samples from different classesC 1 ;:::;C c can be classified by use of the pos- terior probabilities of class membershipP(C k jx), where each class is defined by GMM parameters trained from EM algorithm. To each class, its density function p(xjC k ) is estimated from data points in this class, andP(C k ) is the prior probability. From Bayes’ theorem, P(C k jx)= p(xjC k )P(C k ) p(x) which tells the probability of sample x belonging to classC k . 3.5.2 Simulation results For data points in high-dimensional feature space, the number of parameters to define Gaussian components is huge, and the application is limited by the available number of 67 training samples. To deal with this problem, we can do feature selection before GMM training. Branch and Bound algorithm [SPK04] [JDM00] is used to select optimal feature sets. In terms of a class of monotonic criterion functions, the Branch and Bound algo- rithm avoids an exhaustive search by using intermediate results for obtaining bounds on the final criterion value. Table 3.6 lists the features selected by Branch and Bound algorithm, for heteroge- neous validation and mixture validation. The original feature set are 27-D, the selected 15-D and 20-D features are used in GMM training. The variables are from Table 3.4. For each case, the selected features are listed in the order of decreasing importance. Table 3.6: Selected Feature Sets for GMM Training and Testing. Selected 20-D features Selected 15-D features Train lac repressor, x 6 ,x 13 ,x 2 ,x 18 ,x 14 ,x 25 ,x 15 ,x 16 ,x 17 ,x 4 x 6 ,x 13 ,x 2 ,x 18 ,x 14 ,x 25 ,x 15 ,x 16 test T4 lysozyme x 1 ,x 3 ,x 27 ,x 5 ,x 24 ,x 23 ,x 20 ,x 9 ,x 8 ,x 7 x 1 ,x 3 ,x 27 ,x 24 ,x 9 ,x 8 ,x 7 Train T4 lysozyme, x 26 ,x 15 ,x 25 ,x 18 ,x 17 ,x 14 ,x 3 ,x 5 ,x 16 ,x 4 ,x 2 x 26 ,x 15 ,x 25 ,x 18 ,x 17 ,x 14 ,x 3 ,x 5 test lac repressor x 6 ,x 24 ,x 23 ,x 22 ,x 21 ,x 20 ,x 19 ,x 12 ,x 7 x 16 ,x 4 ,x 2 ,x 6 ,x 24 ,x 23 ,x 7 Mixture Samples x 16 ,x 27 ,x 13 ,x 17 ,x 18 ,x 15 ,x 14 ,x 25 ,x 1 ,x 2 ,x 26 x 27 ,x 15 ,x 13 ,x 16 ,x 14 ,x 17 ,x 18 ,x 25 x 6 ,x 3 ,x 4 ,x 5 ,x 24 ,x 23 ,x 22 ,x 21 ,x 20 x 1 ,x 2 ,x 26 ,x 6 ,x 3 ,x 4 ,x 5 For heterogeneous validation, the GMM training and test accuracies and comparison data are listed in Table 3.7. It shows that feature selection improves the performance. For mixture samples, GMM validation results are shown in Table 3.8. Our result is better than the published one with SVM method, and GMM overcomes the overfitting problem of decision tree method. 68 Table 3.7: GMM Heterogeneous Validation Accuracy Comparison. Method Train lac, test T4 Train T4, test lac GMM (selected 20-D from MSA 27-D) 63% 63% GMM (selected 15-D from MSA 27-D) 67% 65% Table 3.8: GMM Mixture Sample Cross-Validation Accuracy Comparison. Method Validation accuracy GMM (selected 15-D feature set) 74% SVM (published data) 72% C4.5 (published data) 79% 3.6 Random Forests for Classifying Amino Acid Substi- tutions 3.6.1 Random forests learning Random forests [Bre01a, Bre01b] are collections of classification trees grown on boot- strap observation samples, using a random subset of predictors to define the best split at each node. They are high-dimensional nonparametric methods that thrive on large number of predictors, and are an extension of decision tree learning. There are also other ensemble classification methods such as boosting and bagging [Bre96]. Bagging (or bootstrap aggregating) is based on training many classifiers on bootstrapped samples from the training set. It reduces the variance of classification. Boosting uses iterative re-training, where the incorrectly classified samples are given more weight in each suc- cessive iteration. Although boosting generally reduces both the variance and the bias of classification, its training is very slow. Random forests have been shown to be compara- ble to boosting in terms of accuracies, and are much less intensive computationally than boosting. 69 There are two types of trees grown in random forests [JBS05]: the CART-like trees and Binary Hierarchy Classifiers (BHC). CART (Classification and Regression Tree) is a binary tree where splits are made on a variable (feature) resulting in the greatest change in impurity or minimum impurity. The impurity ^ i(t) at nodet in a tree is defined as [DHS01] ^ i(t)= P i6=j ^ P(x i jt) ^ P(x j jt) where ^ P(x i jt) is the estimated probability of samplex i 2Classi being in the subset of samples at node t. An actual classification occurs during training. The growing of the tree is stopped when all possible split yield only one node or if the impurity becomes zero. It is easy to overtrain a decision tree like the CART. To generalize a classifier based on a tree, pruning is usually necessary. While for a multiclassifier ensemble like random forests, each tree is trained to its fullest on a subset of training data to diversify individual trees. Binary Hierarchy Classifier (BHC) system, is a binary tree of classifiers, where the split on a node is based on classes (labels/output), rather than variables (features). Com- pared to a CART-like tree, the splits in BHC-like tree are pure. For a Binary Hierarchy Classifier, the best split on each node is based on class separability or rather meta-class separability. The training begins with a single meta-class which is split into 2 meta- classes and so on, until the true classes are realized in the leaves, while simultaneously computing the Fisher discriminant and projection [KGC02]. In the random forests system (http://www.stat.berkeley.edu/users/ breiman/RandomForests), CART-like trees are generated. There are two types of random forests, the random input (RI) forests and the random combination (RC) forests. 70 In RI forests, at each node, randomly select a small group of input variables to split on. Trees are grown using CART methodology to maximum size without pruning. The number of selected variables can be a randomly selected one, or the first integer less thanlog 2 M +1, whereM is the number of input variables. In RC forests, more features are selected by taking random linear combinations of a number of input variables. Those features are defined as A f = P L i=1 b i x ind(i) ;:::;A F = P L i=1 b i x ind(i) whereb i are uniform random numbers on[¡1;1]. At each node, split samples on the best of these linear boundaries. Random forests can explore the importance of features by using internal out-of-bag (OOB) estimates, and verification by reruns using only selected variables. Suppose we study the mth feature, after each tree is constructed, the values of the mth value in the OOB examples are randomly permuted and the OOB data is put into the corresponding trees to get the label by voting. Subtract the number of votes for the correct class in the mth variable permuted OOB data from the number of votes for the correct class in the untouched OOB data. The average of this number over all trees in the forest is the raw importance score for themth variable. 3.6.2 Simulation results Random forests first worked on the substitution instances of lac repressor and T4 lysozyme. The 10-fold cross-validation accuracies are listed in Figure 3.8 with high cross-validation accuracies for all the three sample sets (the lac repressor, the T4 lysozyme, and the combined). 71 Accuracy from random forests 90% 90% 91% 91% 92% 92% 93% 93% 94% 94% 95% Accuracy Our feature set 94% 92% 91% LAC repressor T4 lysozyme Mixed Figure 3.8: Random forests 10-fold cross validation accuracies for experimental amino acid substitutions on lac repressor, T4 lysozyme and mixed samples. 3.7 Conclusion and Future Works We implemented some state-of-art machine learning methods to classify the unbiased experimental amino acid substitutions. Compared with other published results, deci- sion tree and SVM achieved better 10-fold cross-validation accuracies on data samples defined by our proposed feature set. GMM was seldom applied to amino acid substi- tution study. When we model substitution prediction in a statistical pattern recognition way, GMM deals with the problem well. Random forests methodology can be integrated to our prediction model design. From the above simulations, GMM achieves comparable results in some cases of heterogenous validation, which is the most difficult prediction problem on available samples. Feature selection improves the accuracy. In mixture validation, GMM (for selected 15D vector) is better than SVM, and there is no overfitting (a problem of deci- sion tree method). 72 In [FJ02], [LFJ04] the authors introduce EM algorithm to estimate salient features and number of clusters in mixture model. In our GMM study, for training amino acid substitution samples, the targets are known, and feature deduction will be preprocessed before model training and classification. The priors of parameters and feature selection may be solved through Bayesian learning, this is an interesting topic. 73 Chapter 4 Simulated Annealing Bump Hunting Rule Extraction for Amino Acid Substitutions With the set of simple and interpretable features available, we can further explore the chemical and physical principles behind the amino acid substitutions and associate the substitutions with physicochemical and biological knowledge. To achieve this objective, we use simulate annealing bump hunting method to automatically induce a set of rules for amino acid substitutions from experimental data samples. Good rules should have the following qualities: (1)they consist of small number of the features; (2)they can be interpreted with chemical and physical terms; (3)they contain certain class of substi- tutions with very few exceptions; (4)they cover a significant number of substitutions. Rules can be treated as sub-regions (boxes) in the feature space, where each dimension corresponds to a feature in Table 3.4. Each boundary of a box is defined in terms of the feature interval. A previous method for finding subregions in the feature space is the Patient Rule Induction Method (PRIM) [FF99], which uses a steepest-ascent approach and is referred to as the “bump hunting” method. To our 27-D substitution samples, the original bump hunting method finds high-dimensional boxes whose redundant features should be manually removed, which significantly decreases the quality of the resulting boxes. We develop a novel simulated annealing bump hunting strategy which makes use of the simulated annealing method to automatically selects small number of features to 74 extract high quality rules. The method is implemented and tested on the experimental data. Interpretable rules are identified, which are consistent with biological and physic- ochemical knowledge, some of them provide new insight for understanding amino acid substitutions. 4.1 The Patient Rule Induction Method The objective of bump hunting is to find regions in the input (feature) space with rela- tively high or low target values. There are many problems where finding such regions is of considerable practical interest. Several applications are listed as follows. In [KU03], the positive and negative responders to a new treatment were identified by bump hunting method. In [LMH + 04], the method related tissue microarray (TMA) data to censored time-to-event data. In [WKPT04], boosted PRIM solved the problem of “searching for oncogenic pathways” based on array-CGH (Comparative Genomic Hybridization) data. The decision maker aims to choose the values of the input variables to optimize the value of the target variable, which may be numeric or a binary class label. In case of multiple classes, it can be reformulated as a collection of binary problems. One class is encoded with label 1 and the rest with label 0, and do this for each class in turn. The box construction (rule extraction) strategy of PRIM consists of two phases: 1. Patient successive top-down refinement, followed by 2. bottom-up recursive expansion. Boxes with high target mean values are candidates. In consideration of the problem of overfitting, optimal boxes are those with high target mean on the test data. From the point of view of rule induction, the amino acid substitution database con- tains repeated observation of a designated “output” variabley along with simultaneous 75 values of additional “input” variables x=(x 1 ;:::;x D ) T in theR D (feature) space. The value of the output y is categorized to discrete value 1 or 0, indicating the potential effects of the substitution to the protein functions. In the case that we like to induct rules for intolerant substitutions,y =1 if the substitution is intolerant and0 otherwise; In the case that we target to induct rules for tolerant substitutions,y has the opposite meaning. The goal is to use these dataf(y k ; x k )g N k=1 to “extract” a set of rules which can “predict” likely values ofy for future specified values of the inputs x. 4.1.1 Rule induction as function optimization Supposing that the dataf(y k ; x k )g N k=1 is a random sample from some (unknown) joint distribution with probability densityp(y; x), the goal of prediction can be characterized as trying to obtain the probability density ofy at each x p(yjx)= p(y; x) R p(y; x)dy ; which in most cases is simply described by its first moment f(x)=E[yjx]= Z yp(yjx)dy: Expressing the outputy as y =E[yjx]+(y¡E[yjx])=f(x)+"; we can see that the prediction problem can be casted as one of function estimation where the goal is to approximate a deterministic target functionf(x) from a set of observations with noise ". The noise represents the random distribution of the output y about its mean valuef(x), at each x, and characterizes the fact that specifying a simultaneous set 76 of input values does not specify a unique y — other factors, not captured by the set of measured inputs, influence the output value. With the first moment of the conditional distributionp(yjx) represented by the func- tion off(x), we can apply function optimization in contexts where the property of inter- est is the optimal values of the target f(x). Specifically, we can seek a subregion of the space of the input values within which the average value of target is much larger (smaller) than its average over the entire input space. Without loss of generality, letS d be the set of all possible values for the input variablex d (in thed-th dimension), i.e., fx d k 2S d g N k=1 : The entire input domain S can then be represented by D-dimensional outer product space S =S 1 £¢¢¢£S D : The goal is to find a subregionR of the input domainS,R½S, for which ¹ ° R = R x2R f(x)p ( x)dx R x2R p(x)dx À R x2S f(x)p ( x)dx R x2S p(x)dx = ¹ ° S ; where ¹ ° R is the average target over the subregion and ¹ ° S is the average target over the entire input space. An important property of any such subregion is the size of the subregion, ¯ R = Z x2R p(x)dx: Typically, the targets f(x k ) (k = 1;:::;N) are unknown and thus y k (k = 1;:::;N) is used instead. In this case, the support and the average output over the subregion is calculated by ^ ¯ R = 1 N X x I(x2R); 77 and ^ ° R = ¹ y R = 1 N ^ ¯ R X x yI(x2R); whereI(x) is an indicating function which is equal to 1 if thex is true and 0 otherwise. The subregion in the above problem can be described and interpreted in terms of important characteristics of the domain knowledge. In most cases, this implies that the solution regionR be specified by simple statements (logical conditions) involving the values of the individual input variablesfx d g D d=1 . Such “rules” take the form R= J [ j=1 B j : That is, the solution region is taken to be the union of a set of simple defined subregions fB j g J j=1 . Let s d j be the subset of the possible values of the input variable x d ; that is fs d j µ S d g D d=1 where eachS d represents all possible x d . Then eachB j is taken to be a “box” B j =s 1 j £¢¢¢£s d j within the entire input space. Particularly, each box is described by the intersection of subsets of values of each input x2B j = D \ d=1 (x d 2s d j ): For inputs that are real valued, the subsets can be represented by contiguous subintervals s d j =[b d j¡ ;b d j+ ]: Thus, the projection of a box B j on the subspace of real valued inputs is a hyper- rectangle. 78 4.1.2 Top-down peeling and bottom-up pasting This problem of finding a subregion with average target significantly greater than the average target in the whole domain space has been explored using the Patient Rule Induction Method (PRIM) [FF99], which can be thought of as a steepest-ascent method aiming at searching for a set of optimal values of a given target in high-dimensional space. The PRIM treats the searching space as a large “box” in theR d space and repeat- edly shrink the box in such a manner that the average value of y in the box increases until a (not very) small box contains optimal values is obtained. The final box is then the rule extracted by PRIM. For our problem of inducting rules for amino acid substitution, the PRIM method can be mathematically described as follows. A boxB in theR d space is described as B = D \ d=1 B d ; where D = 27, interval B d = [b d¡ ;b d+ ] is the boundary in the d-th dimension for the box. The location of the k-th data point in the d-th dimension is described by an indicator ± d k = 8 < : 1; x d k 2[b d¡ ;b d+ ]; 0; otherwise: Another indicator is further introduced as ± k = D Y d=1 ± d k ; 79 to describes whether the k-th data point is located within the box or not. The size of a boxB is quantified by the (normalized) number of data points falling into the box as ¯ B = 1 N N X k=1 ± k : The average value of the outputy for data points located in the box B is referred to as the box-mean and calculated by ° B = 1 N¯ B N X k=1 ± k y k : These definitions make both the box-size and the box-mean taking values in the interval of[0;1], and independent of the number of the data samples. The PRIM then intends to find a box (which is marked by a set of intervals f[b d¡ ;b d+ ]g D d=1 ) such that the box-mean ° B is as large as possible, and simultaneously the size of the box¯ B is not too small (larger than a given threshold¯ 0 ). This objective can be thought of as a optimization problem of searching in the box space for a maximal box-mean° B with the constraint¯ B >¯ 0 , as max ° B ; s:t: ¯ B >¯ 0 ; where¯ 0 is a predefined threshold (minimum size, e.g.,0:05) of the box. The PRIM deals with this problem by a “top-down peeling” algorithm and a “bottom-up pasting” algorithm. The top-down peeling algorithm starts from the whole search space (the initial box) and repeatedly try to shrink the box by remove some data points within the box. In details, at each iteration step, a small subbox b within the current boxB is removed. The particular subboxb ? chosen for removal, to produce the 80 next (smaller) box in the sequence, is the one that yields the largest output mean value within the next boxB¡b ? b ? =arg max b2C(b) ° (B¡b ? ) ; where C(b) represents a class of potential subboxes eligible for removal. The current box is the updated BÃB¡b ? and the procedure repeated on this new smaller box. This “peeling” away of small subboxes continues until the support within the current box¯ B falls below the threshold ¯ 0 . Without loss of generality, the set of eligible boxesC(b) can be generated by setting b d¡ =fxjx d <x d (®)g; or b d+ =fxjx d >x d (1¡®)g; for each dimension. Herex d (®) andx d (1¡®) are the®-quantile and(1¡®)-quantile of thed-th dimension for data within the current box, respectively. ® is a meta-parameter usually taken to be quite small (e.g., 0.05). As an example, Figure 4.1 illustrates the peeling procedure. The goal of the top-down peeling algorithm is to find a box covering a subregion of the input variable space within which the target mean is relatively large. The boundaries of this box are determined by particular values of those variables that defined the sub- boxes chosen for peeling at the various stages of the top-down refinement procedure. Except for the last one, these final box boundaries were determined at earlier steps of the peeling sequence without knowledge of later peels that further refined the boundaries 81 Figure 4.1: Illustration of the PRIM algorithm. There are two classes, indicated by the circles(class0) and diamonds(class1) points respectively. The procedure starts with a rectangle surrounding all of the data, and then peels away points along one edge by a prespecified amount in order to maximize the mean of the points remaining in the box. Starting at the top left panel, the sequence of peeling is shown, until a pure diamond region is isolated in the bottom right panel. on other input variables. It is therefore possible that the final box can be improved by readjusting some of its boundaries. This is done through a bottom-up “pasting” strategy. This pasting algorithm is basically the inverse of the peeling procedure. Starting with the peeling solution, the current box B is iteratively “enlarged” by pasting onto it a small subbox b ? . The small subbox b ? chosen is the one from an eligible class b 2 C(b) that maximizes the output mean in the new (larger) box. The eligible class of pasting subboxes is defined analogously to those used for peeling. Without loss of generality, for real valued input variables, eligible subboxes are represented by small 82 intervals extending the upper and lower boundaries of the current boxB on that variable. The width of each of these intervals is chosen so as to contain®N B data samples, where ® is the peeling fraction andN B is the number of data points in the current boxB. 4.1.3 Trajectory plot Often, application domain considerations place constraints on the support of the induced box. The support of the cross-validated estimate of the optimal box may be too small to be useful. For example, it may identify characteristics of intolerant amino acid substitu- tions, but the support may be too small to allow sufficient observation of the same kind of substitutions. Therefore, some boxes with larger support but slightly low mean can often provide us more meaningful understanding for amino acid substitutions. In the PRIM, application domain judgements such as the trade-off between box mean and support are made by the user via a “trajectory plot”, as is illustrated in Figure 4.2. In details, a set of candidate boxes obtained in the (nested) peeling procedure is plotted in a trajectory plot, in which thex-axis is the support of the box (ranging from0:0 to1:0) and they-axis is the mean of the box (also ranging from0:0 to1:0). The user can then make trade-off between the quality of the induced rules (box mean) and the coverage of the rules (box support). More generally, boxes obtained by using different peeling fraction (®) can also be plotted on the same figure, forming a multiple trajectory plot, and enabling the user to make a (possible) better trade-off. 4.2 Simulated Annealing Bump Hunting Strategy The PRIM can be thought of as a steepest-ascent searching method in the box space, the final box is (local) optimum but no guarantee to be global optimum. Also, the top- down peeling removes the data points permanently in each iteration. Although some 83 Figure 4.2: Illustration of the trajectory plot. Mean versus support for boxes induced by the top-down peeling algorithm. Most appropriate trade-off is chosen by the user. of the good data points can be put back by the bottom-up pasting, the repair to the box may be very limited. Moreover, it is doubtable that the process of removing some features from the rule after the final box is obtained could be an effective way to generate an optimum rule. These considerations motivate us to explorer an automated feature selection methodology which can discard the redundant features while extracting the rule. The basic idea is to use the simulated annealing algorithm or peer optimization algorithms (such as the genetic algorithm) instead of steepest ascent, while incorporating the feature selection process in the algorithm. The proposed method is able to overcome the above shortcomings of the PRIM. 4.2.1 Feature sets for rule extraction Classical pattern classification methods such as the support vector machine (SVM) and the neural network (NN) give each feature a weight and combine them either in a lin- ear (SVM) or non-linear way (NN) to seek for more discriminating power between the data samples. As for our amino acid substitution problem, the 27-D features used in the 84 support vector machine, the decision tree, and the Gaussian mixture model are defined in Table 3.4. Rule induction methods such as the PRIM and the simulated annealing bump hunting, however, cannot natively consider the weights for features and the rela- tionships between the features, although the combination of features sometimes does yield better discriminating power. Considering this, we introduce3 extra feature groups which describe the differences between the substitution’s properties and other feature groups and extend the total number of features to 45, as is shown in Table 4.1 with features labeled by x i for i = 1;:::;45. The first five feature groups are the same as those in Table 3.4. The sixth group, x 28 ;:::;x 33 , are for ‘Window-sized differences’, which are the differences of corresponding properties between the substitution and the window-sized features. Similarly, the seventh group, say,x 34 ;:::;x 39 , are for ‘Original amino acid differences’, which are the differences in properties between the substitu- tion and original amino acid. Finally, the eighth group, x 40 ;:::;x 45 , are for ‘Column- weighted differences’, which are the differences in properties between the substitution and column-weighted features. We will use both the 27-D and the 45-D feature sets for rule extraction using the PRIM and simulated annealing bump hunting. 4.2.2 Comparison with exhaustive search and decision tree method For discrete problems, the exhaustive search can be used to test each possible solution sequentially to determine the optimal one. Such exhaustive examination of all possibil- ities is also known as direct search, or the ”brute force” method. When the total number of possible solutions is small, the exhaustive search method can work well. In the case that the total number of possible solutions is huge, the direct (exhaustive) search method would be prohibited because of the limitation of time and/or storage space. For example, for the45-D amino acid substitution data samples, if we want to find boxes with small number (·4) of features, the exhaustive search method may be feasible by enumerating 85 Table 4.1: Details of the proposed 45-D feature set. Physicochemical properties Relative frequency in Property Sets Molecular Weight pI Value Hydro- phobicity Helices Strands Turns Window-sized x 1 x 2 x 3 x 4 x 5 x 6 Original amino acid x 7 x 8 x 9 x 10 x 11 x 12 Column-weighted x 13 x 14 x 15 x 16 x 17 x 18 Substitution x 19 x 20 x 21 x 22 x 23 x 24 Evolutionary information rf-ori rf-sub entropy x 25 x 26 x 27 Window-sized difference x 28 x 29 x 30 x 31 x 32 x 33 Original amino acid difference x 34 x 35 x 36 x 37 x 38 x 39 Column-weighted difference x 40 x 41 x 42 x 43 x 44 x 45 all possible feature combinations and apply the classical PRIM method on each of the combinations. When the number of features in the box goes up, however, the exhaustive search is no longer possible for the following reasons. 1. Computation time. PRIM has to run on data sets of each possible feature com- bination, whose number is very large. For example, the number of all possible combinations of5 features out of the45 features would be ¡ 45 5 ¢ =1;221;759 and a rough estimation of the time to do the exhaustive search on these combinations would be more than one month on a main-stream computer. 2. Box and feature selection. Each PRIM training generates a trajectory plot, which is composed of candidate boxes. Users need to choose boxes according to the trade-off between box-size and box-mean, and select features manually, because the PRIM is not an automatic learning procedure. This step may be possible for several hundred trajectory plots but not practical for thousands of trajectory plots. 86 3. Quality of the rules. PRIM searches boxes by a steepest-ascent method, this greedy algorithm is easy to find local optima, which can not guarantee to be the global optimal solution. Decision tree method can also generate rules. While in decision tree, according to the feature value on each splitting node, samples are split into two separate groups. And samples that removed at higher levels can not be recovered at lower levels. These will use up the limited training samples quickly. The rules may be complicated or not cover a significant part of the data set. 4.2.3 Automatic feature selection in simulated annealing bump hunting The first shortcoming we target to overcome is that PRIM generally results in boxes with high dimension, even after the manually selection of the features. For example, when applying PRIM on data samples containing the27 dimensional inputs (as is listed in Table 3.4), the resulting boxes often contain more than 15 dimensions, even after manually selection, making the rule hard to interpret. In general, the dimensionality of a box represents the complexity of a rule, and thus boxes with fewer dimensions are preferred. This consideration suggests that we can penalize boxes with more dimensions (reward boxes with less dimensions). At the same time, the presence of a feature in a rule can be described as the presence of a boundary in a box, we can thus embed an automated feature selection mechanism in the rule induction algorithm and perform the feature selection while inducting rules. 87 Mathematically, the presence of a feature in a rule can be described as the presence of a boundary in a box and represented by an indicator » d = 8 < : 1; the boundary of thed-th dimension is included in the box; 0; otherwise: The indicator± k with» d included then becomes ± k = D Y d=1 (± d k ) » d ; where possible 0 0 is set to 1 for simplicity. The formula for the box size ¯ B and box- mean° B remain unchanged. For boxes with comparable box-sizes and box-means, we prefer boxes having fewer dimensions. This is done by rewarding boxes with less features using an exponential distribution, ¿ B =exp(¡¸ D X d=1 » d ) where¸ is a hyper-parameter. ¸ = 0 means that we do not take the number of features (the complexity of the rule) into consideration, while positive¸ values give preference to less number of features. The modified function optimization problem then requires to maximize the box mean° B using as simple box as possible with the constraint that the box size¯ B > ¯ 0 . We write this maximization problem as max 1 2 (° B +¿ B ); s:t: ¯ B >¯ 0 ; where¯ 0 is a predefined threshold (minimum size) of the box, e.g.,0:05. 88 4.2.4 Simulated annealing strategy The second shortcoming we target to overcome is that PRIM uses a steepest-ascent strategy to search for optimal boxes. In general, the steepest-ascent strategy will lead to a local optimum in the searching space, even with the “patient” peeling fractions (very small ®). Our proposed method to overcome this shortcoming is to use the simulated annealing strategy. The simulated annealing algorithm and its statistical properties are discussed in numerous publications such as the books by Otten and van Ginneken [OG89]. Below, we briefly outline the basics of simulated annealing and describe how we use it to find optimal boxes. The annealing algorithm is defined on a state spaceS, which is a collection of indi- vidual states. Each of these states represents a configuration of the problem under inves- tigation (i.e., a box). The states are related by a neighborhood system, and the set of neighbor pairs inS defines a substructureM inS£S. The elements inM are called moves. Two statess,s 0 are called adjacent, if they can be reached by a single move (i.e., (s;s 0 )2M). Similarly,(s;s 0 )2M k are said to be connected via a set ofk moves. In our applications, the state space is finite. The basic idea of the annealing algorithm is the following: given a certain state, pick a move according to a selection scheme from the set of permissible moves, which leads to a new state. Compare the scores (free energy) of the old and the new state. If the score of the new state is better than the score of the old state, accept the move. If the score of the new state is not better than the score of the old state, accept the move with a certain probability. The acceptance probability depends on the score of the two states under consideration and a parameter that indicates at which point in time the annealing chain is (this parameter is usually referred to as the pseudo-temperature). For any pair of scores, the further ahead we are in the annealing scheme, the lower the acceptance 89 probability, if the proposed state has a score worse than the score of the old state. This algorithm generally leads to good-scoring states. For our application, define the energy function as E =1¡ 1 2 (° B +¿ B ): The simulated annealing repeatedly generates new boxes using operations described later and seeks for energy decreasing. Let ¢E =E new ¡E old : If a tentative new box can decrease energy E (¢E < 0), it is accepted; otherwise (¢E >0), it is accepted with probability ¼ B =exp( ¡¢E T ); T is the pseudo-temperature. The calculation of ¢E should be performed in a com- putationally economical manner by making use of techniques such as indexing, hash- ing, caching, etc. The process of accepting tentative boxes by its energy is called the Metropolis algorithm, which along with the Gibbs sampling strategy consists of two major branches in the Monte Carlo simulation. The key trick in the simulated anneal- ing strategy is that the pseudo-temperature should keep decreasing while running the Metropolis algorithm, embodying the concept of the annealing process. From the point of view of optimization, a high temperature allows long distance jumps in the search space, while a low one can restrict the search only in a local area. Some meta-operation for generate a new box include 1. Left side peeling. r d¡ (r d¡ +®jr d+ ¡r d¡ j. 90 2. Right side peeling. r d+ (r d+ ¡®jr d+ ¡r d¡ j. 3. Left side pasting. r d¡ (r d¡ ¡®jr d+ ¡r d¡ j. 4. Right side pasting. r d¡ (r d¡ +®jr d+ ¡r d¡ j. 5. Feature including. » d (1. 6. Feature excluding. » d (0. ® is a hype-parameter and supposed to be robust while taking values in the interval of [0:05;0:1]. Smaller ® encourages more “patience”, which is strongly emphasized in PRIM. The items ofjr d+ ¡r d¡ j in the operations make the shrink or growth depending on the size of the current box and can further encourage the patience. In each iteration of the simulated annealing, we can randomly pick one of these meta-operation, calculate the resulting¢E, and decide whether the generated box should be accepted or rejected. In order to extract more than one rules (finding more than one boxes), we can run the above simulated annealing algorithm more than once, while penalize boxes which has been found. This can be done by introducing the concept of decay to the points included in the box. A decay ratio " is used for this purpose. Suppose that we repeatedly apply the simulated annealing bump hunting strategy on the database, each run will give us a box. Letn k be the number of times that thek-th data point is located in any of the these identified boxes. The updated formula for calculating box-mean is then given as ° B = 1 N¯ B N X k=1 ± k y k " n k : In summary, the whole simulated annealing bump hunting strategy is listed as fol- lows: 91 1. Initialization. Set initial boxB = T D d=1 [b d¡ ;b d+ ] with b d¡ = min N k=1 fx d k g and b d+ =max N k=1 fx d k g. Set peeling fraction® =0:05. Set initial pseudo-temperature T =1:0. Select a suitable threshold¯ 0 . 2. Random move. Select a meta-operation at random. Calculate the¢E. If¢E <0, accept the random move; otherwise, accept the random move with probability exp(¡¢E=T). 3. T ÃT £¢T (e.g.,¢T =0:9999). 4. Repeat step (2) and step (3) until convergence. 4.3 Simulation Results 4.3.1 Selection of hyperparameters The factor¸ controls the reward to less complex rules. In order to determine a suitable ¸, we plotted the relation of the reward¿ versus the number of features, with different factor of¸ in Figure 4.3. Generally, a large¸ allows only simple rules, while a small¸ allows both simple and complex rules. In our application, we choose¸=0:5 so that the simulated annealing strategy is very likely to find rules with1 to8 features. 4.3.2 Box training and comparison The sample space contains 4,549 substitutions (as is shown in Table 3.5), each of which is defined by the 27 features (inputs) and one output. For the intolerant model, the outputs for intolerant substitutions are assigned with value 1, and the outputs for tolerant substitutions with 0. For tolerant model, the outputs have the opposite values. Data sets are randomly divided into a training set (containing 2/3 of the data) and a test set 92 0 5 10 15 20 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Number of Features τ λ=1.1 λ=0.8 λ=0.5 λ=0.2 Figure 4.3: Determination of the rewarding factor¸. (containing the rest 1/3 of the data). The simulated annealing bump hunting method is then applied to the training set, and the resulting rules are evaluated on the test set. As an example, Figure 4.4 lists five boxes obtained by the simulated annealing bump hunting for the intolerant and tolerant models, respectively. The results are then compared with the PRIM method, which is implemented in the SuperGEM software (http://www-stat.stanford.edu/ jhf/SuperGEM.html). Some candi- date boxes for the intolerant and tolerant models are shown in the Figure 4.5 and Fig- ure 4.6. The numbers in the figure beside the points are the number of features included in the boxes. Figure 4.5 and Figure 4.6 show that the simulated annealing bump hunting strategy can identify boxes with smaller number of features. 93 Training set Test set Box Box Mean Number of Features Number of Samples Coverage Rate Box mean Number of Samples Coverage Rate 1 0.92 4 36 0.010 1.00 11 0.0064 2 0.94 3 51 0.015 1.00 15 0.0087 3 0.96 3 36 0.010 0.90 10 0.0058 4 0.93 5 39 0.011 1.00 11 0.0064 Intolerant model 5 0.96 3 36 0.010 0.91 11 0.0064 1 0.94 3 217 0.062 0.90 71 0.041 2 0.97 2 156 0.045 0.96 52 0.030 3 0.94 4 253 0.072 0.89 71 0.042 4 0.92 1 228 0.066 0.90 72 0.041 Tolerant model 5 0.91 2 242 0.070 0.90 89 0.051 Figure 4.4: Boxes trained by the simulated annealing bump hunting method. ”Number of Features” represents the number of features included in the corresponding boxes. ”Number of Samples” represents the number of data samples inside the box for the corresponding training or test set. The ”Coverage Rate” represents the proportion of data samples inside the box for the corresponding training or test set. 94 0 0.01 0.02 0.03 0.04 0.05 0.06 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 2 3 4 1 5 3 5 20 19 18 18 17 14 17 Box−size (the coverage of samples in the box) Box−mean Figure 4.5: Comparison of the boxes obtained by the simulated annealing bump hunting strategy and the original bump hunting method (the SuperGEM software) for the intol- erant model. Red round points are boxes trained by simulated annealing bump hunting method and green triangle points are boxes trained by SuperGEM. The numbers beside each point denotes the number of features included in the corresponding box. 95 For example, for the tolerant substitutions, our method trains a box of box-mean around 0.97 with two features. By comparison, the SuperGEM trains a box of box-mean around 0.97 but with seventeen features. In the SuperGEM, deduction of redundant fea- tures in the final box must be done manually. In contrast, our simulated annealing bump hunting method automatically selects important features during the training process. 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 3 2 4 4 5 2 5 17 17 16 14 12 12 12 Box−size (the coverage of samples in the box) Box−mean Figure 4.6: Comparison of the boxes obtained by the simulated annealing bump hunting strategy and the original bump hunting method (the SuperGEM software) for the tolerant model. Red round points are boxes trained by simulated annealing bump hunting method and green triangle points are boxes trained by SuperGEM. The numbers beside each point denotes the number of features included in the corresponding box. 96 4.3.3 Interpretation of the rules For rules (boxes) with only a few features identified by our simulated annealing bump hunting method, interpreting these rules by chemical and physical terms will be very helpful in associating the amino acid substitutions with physicochemical and biological knowledge. To make a demonstration, here we list four rules (two for tolerant substitu- tions and two for intolerant ones) and explain them with chemical and physical terms. For the intolerant model, to 27-D substitution samples, the rule examples are shown in Figure 4.7 and Figure 4.8; to 45-D substitution samples, the rule example is shown in Figure 4.9. IF (0.822 ≤ the window-sized relative frequency in helices ≤ 1.109) AND (0.403 ≤ the column-weighted relative frequency in turns ≤ 0.712) AND (the substitution is K) THEN (The amino acid substitution is intolerant) Figure 4.7: Example rule of intolerant model, for 27-D samples: This rule can cover 2.21% data samples with an accuracy of 90.2%. IF (-1.172 ≤ window-sized hydrophobicity scale ≤ 1.6222) AND (-1.79 ≤ original amino acid’s hydrophobicity scale ≤ 3.7758) AND (-0.5322 ≤ column-weighted hydrophobicity scale ≤ 3.1075) AND (-4.0639 ≤ substitution’s hydrophobicity scale ≤ -1.4088) AND (original amino acid’s relative frequency in MSA is high [0.3677, 0.8594]) THEN (The amino acid substitution is intolerant) Figure 4.8: Example rule of intolerant model, for 27-D samples: This rule can cover 2.09% data samples with an accuracy of 94.7%. 97 IF (118.4887 ≤ window-sized molecular weight ≤ 135.6061) AND (83.1369 ≤ column-weighted molecular weight ≤ 107.4207) AND (35.8616 ≤ molecular weight difference between original amino acid and substitution ≤ 91.8048) AND (original amino acid’s relative frequency in MSA is high [0.3363, 0.9425]) THEN (The amino acid substitution is intolerant) Figure 4.9: Example rule of intolerant model, for 45-D samples: This rule can cover 2% data samples with an accuracy of 94.5%. The intuitive interpretations for these rules using chemical and physical terms are as follows: ² Rule 1 (Figure 4.7): If the original amino acid stays in an environment where neighboring amino acids intermediately favor to be in strands, and a majority of the amino acids in the same column of the multiple sequence alignments interme- diately favor to be in turns, an intolerant substitution will happen if the original amino acid is substituted by the amino acid of K. This rule covers 2.21% data samples with an accuracy of 90.2%. ² Rule 2 (Figure 4.8): If the original amino acid is hydrophobic, and is in a hydrophobic environment (with high window-sized and column-weighted hydrophobicity scales). At the same time, it is conserved in the column of MSA. Then an intolerant substitution will happen if the original amino acid is substituted by a hydrophillic residue. This rule covers 2.09% data samples with an accuracy of 94.7%. ² Rule 3 (Figure 4.9): If the original amino acid is in the small molecule environ- ment (with small window-sized and column-weighted molecular weights). At the same time, it is conserved in the column of MSA. Then an intolerant substitu- tion will happen if the original amino acid is substituted by a bigger residue (with 98 larger molecular weight). This rule covers 2% data samples with an accuracy of 94.5%. Similarly, for the tolerant model, to 27-D substitution samples, the rule examples are shown in Figure 4.10 and Figure 4.11; to 45-D substitution samples, the rule example is shown in Figure 4.12. IF (original amino acid is K) AND (-1.144 ≤ window-sized hydrophobicity scale ≤ -0.253) THEN (The amino acid substitution is tolerant) Figure 4.10: Example rule of tolerant model, for 27-D samples: This rule can cover 4.47% data samples with an accuracy of 97.1%. IF (5.517 ≤ window-sized pI value ≤ 7.1458) AND (3.1947 ≤ original amino acid’s pI value ≤ 9.6732) AND (5.9637 ≤ column-weighted pI value ≤ 8.6219) AND (3.8609 ≤ substitution’s pI value ≤ 6.1957) AND (original amino acid’s relative frequency in MSA is low [0.0034, 0.1001]) THEN (The amino acid substitution is tolerant) Figure 4.11: Example rule of tolerant model, for 27-D samples: This rule can cover 8.16% data samples with an accuracy of 95.6%. We give intuitive interpretations for these rules using chemical and physical terms as follows: ² Rule 4 (Figure 4.10): If the original amino acid is K, and it stays in an environment where neighboring amino acids are more hydrophilic, its substitutions will be tolerant. This rule covers 4.47% data samples with an accuracy of 97.1%. ² Rule 5 (Figure 4.11): If the original amino acid does not prefer specific pI value, and it is not conserved in the column of MSA. Then a tolerant substitution will 99 IF (-0.4441 ≤ difference between window-sized propensity in turn and that of substitution ≤ -0.0257) AND (-1.2896 ≤ difference between original amino acid’s propensity in turn and that of substitution ≤ -0.0464) AND (-1.0081 ≤ difference between column-weighted propensity in turn and that of substitution ≤ -0.2875) AND (original amino acid’s relative frequency in MSA is low [0.0195, 0.0957]) THEN (The amino acid substitution is tolerant) Figure 4.12: Example rule of tolerant model, for 45-D samples: This rule can cover 5.03% data samples with an accuracy of 93.0%. happen if the original amino acid is substituted by an acidic amino acid. This rule covers 8.16% data samples with an accuracy of 91.6% ² Rule 6 (Figure 4.12): If the original amino acid is not conserved in the column of MSA and is substituted by an amino acid less prefer turns (when compared with the original amino acid, its neighbors, and the residues in the same column of MSA), then the substitution is tolerant. This rule covers 5.03% data samples with an accuracy of 93.0% 4.4 Conclusion and Future Works We have designed a set of simple features using three general physicochemical charac- teristics of the amino acids, and three relative frequencies of the amino acids in protein’s secondary structures, together with the multiple sequence alignment of the query pro- tein sequence. Five groups of features are listed in Table 3.4. Amino acid substitution samples of the lac repressor and the T4 lysozyme are defined by these interpretable fea- tures. We propose a novel rule induction method (the simulated annealing bump hunting strategy) to automatically select features and extract meaningful rules for intolerant and 100 tolerant substitutions. These rules are either consistent with current knowledge or pro- vide new insight for understanding the chemical and physical principles of the amino acid substitutions. One limitation of the simulated annealing method is that there are free parameters (¸ and ¯ 0 ) to set. From the point of view of optimization, free parameters give the method more flexibility. For example, a smaller ¸ enables the method to extract rules with fewer number of features and thus the rules are simpler, while a larger¸ can yield more complex rules but the coverage of the rules are in general better than simple ones. How to make the tradeoff between the complexity and coverage of the rules remains an ongoing study. Another limitation is that our method is currently applied to only two unbiased experimental data sets. Our future work will concentrate on extracting general rules for amino acid substitutions from the combined data sets of several proteins having different functions and various origins. With more substitution data available, more insightful rules can be obtained by using our method, and our understanding for the principles of the amino acid substitutions can go deeper. 101 Chapter 5 Prediction Model Design and Prioritization Analysis for Human Disease-related Mutations We designed a Multiple Selection and Rule V oting (MS-RV) model to predict and pri- oritize human disease-associated amino acid substitutions. This model integrates data partition and feature selection, and uses ensemble rule voting strategy to make predic- tion. MS-RV model is made up of20 modules, each of them is a tree ensemble trained on bootstrapped training data sets with one of 20 original amino acids. Trees are built up using CART [BFSO84] methodology to maximize tree size without pruning. At each node, a small number of features are randomly selected from the optimal feature set to split on. A module’s optimal feature set is generated through sequential forward selec- tion (SFS)[PNK94, OLM04], based on a given optimization criterion, which is the area under the ROC curve (AUC) from 10-fold cross validation using MS-RV model.First, a single best feature corresponding to the highest AUC is selected, then add one more feature with largest increase on AUC. Iterate the updating procedure until the AUC stops increasing or all features are included. For a testing data, each tree gives its probabilities to be disease-associated or polymorphism, and the prediction is made by averaging the probabilities from all trees in the module. 102 A wide range of amino acid substitutions are collected from the Swiss-Prot database [BAW + 05], the HapMap database [Con03] and the Ensembl database [BAC + 06]. MS- RV model predicts and prioritizes the following groups of mutations: mitochondrial mutations, specific human disease-related mutations, mutations on specific genes, and mutations from neighboring chromosomal regions. For the purpose of interpretation, we adopted a26-D feature set from those listed in 3. This feature set includes24 physicochemical or relative frequency properties (each of the six amino acid properties being calculated in four different situations) and2 conser- vation scores (for the original and the substituted amino acids). Table 5.1 summarizes the details of this feature set, with features labeled byX i fori=1;:::;26. Table 5.1: Details of the26-D feature set. Each of the6 amino acid properties is calculated in 4 situations, formingX 1 »X 24 , and the conservation scores for the original and the substituted amino acids areX 25 andX 26 , respectively. Physicochemical Relative Frequency in Conservation Molecular weight pI value Hydropho- bicity Helix Strands Turns Frequency in MSA Original X1 X2 X3 X4 X5 X6 X25 Substitution X7 X8 X9 X10 X11 X12 X26 Window-sized X13 X14 X15 X16 X17 X18 Column-weighted X19 X20 X21 X22 X23 X24 We collected thirty10-Mb neighboring chromosomal regions, each of them centers at a gene whose amino acid substitutions are associated with a monogenetic human dis- ease, and the numbers of disease-associated and polymorphism substitutions among that region are comparable. With MS-RV model, rank sum test shows the prediction scores of two types of substitutions are statistically significant different. The normalized aver- age ranking of the disease-associated substitutions is relatively high. We further calcu- lated sensitivity and specificity values [ALM + 06]. Sensitivity refers to the normalized proportion of disease-associated substitutions that are ranked above a particular rank 103 threshold. Specificity refers to the normalized proportion of polymorphism substitutions that are ranked below this threshold. To explore MS-RV model’s ability to prioritize disease-associated substitutions, we plotted the rank receiver operating characteristic (ROC) curves and calculated the area under the curve (AUC). AUC is a standard measure of performance, an AUC value of 100% indicates that every disease-associated substi- tution ranked first, whereas a value of 50% means that disease-associated substitutions ranked randomly. As to these thirty 10-Mb regions, MS-RV model achieves an over- all AUC score of 86:6% for substitutions collected from from the Swiss-Prot database [BAW + 05], and82:8% for substitutions collected from the above three databases. As to the polygenic diseases such as familial Alzheimer disease-1 (AD1), Alzheimer disease- 3 (AD3) and Alzheimer disease-4 (AD4), mutations associated to the diseases were prioritized among the top rankings. For the 1;517 substitutions annotated as “Unclassified” in the Swiss-Prot database, those with high prediction scores are predicted to be disease-associated. 10 of them from4 human genes are studied in detail, and published references provided evidences to support our conclusions. 5.1 Multiple selection and rule voting (MS-RV) predic- tion model We proposed a multiple selection and rule voting (MS-RV) model to predict the effects of amino acid substitutions. There are20 modules in the MS-RV model, which integrate data partition and feature selection, and makes prediction using rule voting strategy. 104 5.1.1 Module training on partitioned data sets As to the original amino acids, the substitution data are partitioned into 20 sets, which correspond to20 modules in the MS-RV model, as shown in Figure 5.1. Each module is a tree ensemble trained on the bootstrapped data sets. Trees are grown to maximum size without pruning. At each node, a small number of features are randomly selected from that module’s optimal feature set to split on. The following part explains how to select the optimal feature set for a module. Data set partition (according to original amino acids) Select feature set 1 Train tree ensemble 1 Module 1 Module k Module 20 Select feature set k Select feature set 20 Train tree ensemble k Train tree ensemble 20 ... ... Figure 5.1: The training of20 modules in the MS-RV model. Each module is trained on the substitution data with one of the20 original amino acids. Tree ensemble is built up on bootstrapped data sets from the optimal feature set. 5.1.2 Sequential forward selection of optimal feature set Each module has its optimal feature set, which is generated through sequential forward feature selection as illustrated in Figure 5.2. We initiate a module’s feature pool to contain all features and its optimal feature set to be empty. At each step, add a feature that furthest increases a specific criteria into the optimal feature set. In this paper, the criteria is the AUC of 10-fold cross validation with MS-RV model. This procedure 105 continues until the criteria stops increasing by adding any more feature or all features have been included. The optimal feature set will be used to build up the tree ensemble. Begin Initiate the feature pool and optimal feature set Select a best feature with highest validation criteria Update the feature pool and optimal feature set criteria decreases or all features included? Output the optimal feature set End Yes No Figure 5.2: The generation of optimal feature set for each module using sequential for- ward feature selection. Starting from empty, the optimal feature set is updated by adding one more feature which furthest increases certain criteria. This procedure continues until the criteria decreases by adding any more features or all features have been included. 5.1.3 Prediction with rule voting strategy A mutation data enters certain module according to the original amino acid. Each tree in this module gives the probabilities to predict the mutation to be polymorphism or disease-associated. The averages of the probabilities from all trees give the prediction results. The testing procedure is shown in Figure 5.3. 106 Begin Input testing substitution data Enter a module according to the original amino acid Test all substitution data? Output prediction results End Yes No Figure 5.3: Testing mutation data by MS-RV model using rule voting strategy. 5.2 Performance evaluation criteria The validity of feature sets and the effectiveness of prediction models are evaluated using 10-fold cross-validation experiments, with a fair measurement induced from the averaged results of10 independent experiments. Two criteria were used for comparison, the first one is AUC, which means the Area Under the ROC (Receiver Operating Char- acteristic) Curve. AUC provides us a comprehensive understanding for the prediction power of a given method. The other criterion is the balanced error rate (BER). Given 107 the10-fold cross-validation results and a decision threshold, we can calculate the num- bers of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Considering that the numbers of positive (disease-associated) and negative (poly- morphism) samples are very different from each other, we use the balanced error rate (BER) defined as BER= 1 2 ³ FP TN+FP + FN TP+FN ´ To test whether a prediction model can differentiate two types of amino acid sub- stitutions, rank-sum test is used to compare the prediction scores of polymorphism and disease-associated substitutions. It is a nonparametric alternative to the two-sample t- test, and is based solely on the order in which the observations from the two samples fall. The p-value tells whether the difference between two data samples is statistically significant. Prioritization is performed for amino acid substitutions in a data sample. We ranked the substitutions according to the prediction scores from MS-RV model, and calculated the average ranking for the disease-associated ones. We also studied the sensitivity and specificity [ALM + 06]. Sensitivity is defined as the normalized proportion of disease- associated substitutions that are ranked above a threshold rank. Specificity is defined as the normalized proportion of polymorphisms ranked below this threshold. For example, a sensitivity/specificity value of 80=90 means that the disease-associated substitutions were ranked among the best-ranking 10% of the substitutions in 80% of the prioritiza- tions. The rank receiver operating characteristic (ROC) curves are plotted from these sensitivity/specificity values. The area under ROC curve (AUC) measures the perfor- mance of a prediction method. The higher an AUC-value, the more likely a disease- associated substitution ranked higher, and the better the performance. 108 5.3 Effectiveness of the MS-RV model on the proposed feature set The amino acid substitutions related to human diseases are predicted using various methods. By varying the decision cutoffs, a receiver operating characteristic (ROC) curve plots true positive rate [TP/(TP+FN)] against false positive rate [FP/(TN+FP)] to show the performance graphically. Figure 5.4 shows the ROC curves of 10-fold cross- validation for Support Vector Machine (with AUC = 0:8058), Random Forests (with AUC = 0:8220) and the MS-RV model(with AUC = 0:8484). The MS-RV model produces the largest AUC value, which indicates that it is superior to the other two state-of-art methods in terms of comprehensive prediction power. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False posititve rate (FPR) True positive rate (TPR) Multiple Selectin Rule Voting (ROC = 0.8484) Random Forest (ROC = 0.8220) Support Vector Machine (ROC = 0.8058) Figure 5.4: Prediction performance comparison on amino acid substitutions related to human diseases. ROC curves are drawn from the prediction results from support vector machine, random forest and MS-RV model. 109 5.4 Differentiation of mutations related to human dis- eases An effective prediction method should differentiate the disease-associated substitutions from the polymorphism ones. The following amino acid substitution data sets were predicted by the MS-RV model, and the results were analyzed by the rank sum test. 5.4.1 Mitochondrial mutations Mutations in mitochondrial DNA (MtDNA) affect the function of the mitochondria, and cause mitochondrial diseases such as Leber’s Hereditary Optic Neuropathy (LHON), Mitochondrial Encephalopathy Lactic Acidosis Stroke (MELAS) and Leigh syndrome. LHON, also known as Leber Optic Atrophy, is a kind of optic nerve disease, which causes a loss of central vision within two to eight weeks and usually affects both eyes [MTKPB03]. Three primary mutations (G3460A, T14484C and G11778A) account for 90% of the worldwide cases of LHON [BSW97]. MELAS [SN06] is a neurodegenera- tive disorder and a rare form of dementia. The symptom for MELAS is brain dysfunction (encephalopathy) with seizures and headaches, as well as muscle disease with a build-up of lactic acid in the blood (a condition called lactic acidosis), temporary local paralysis (stroke-like episodes), and abnormal thinking (dementia). Leigh syndrome [MPM + 87] is also a disorder characterized by degeneration of the central nervous system (brain, spinal cord, and optic nerve). In this paper, mitochondrial disease-associated mutations are those with reported disease associations and polymorphism mutations are those not causing any of these mitochondrial diseases. From the Swiss-Prot database [BAW + 05], there are 33 amino acid substitutions associated with the above three mitochondrial diseases, together with91 polymorphism ones. MS-RV model makes prediction on these124 mitochondrial mutations. The AUC 110 value is 0:8408, and the p-value of rank-sum test is 4:74£ 10 ¡5 . Among the top 10 prediction scores, 7 of them are disease-associated. It shows that from MS-RV testing, the difference between disease-associated and polymorphism mitochondrial mutations are statistically significant, and top ranking predictions correspond to disease-associated substitutions. 5.4.2 Mutation data sets related to specific human diseases 10 mutation data sets were collected from the Swiss-Prot database [BAW + 05], each of them has comparable number of disease-associated and polymorphism substitutions that relate to a specific human disease, such as polycystic kidney disease type I (ADPKD), Alport syndrome (AS), Colorectal cancer, Congenital bilateral absence of the vas def- erens (CBA VD), Cystic fibrosis (CF), Osteogenesis imperfecta type II (OI-II), Osteo- genesis imperfecta type III (OI-III), Osteogenesis imperfecta type IV (OI-IV), Retinitis pigmentosa (RP) and Wilson disease (WD). Table 5.2 lists thep-values of rank-sum test on these substitution data sets. With the MS-RV model, for each set, the difference between its disease-associated and polymor- phism mutations are statistically significant. 5.4.3 Mutation data on specific gene As to the complement factor H gene (CFH), the polymorphism of H to Y on position 402 (H402Y) has been studied in detail [KZC + 05, HHS + 05, ERA + 05] for age-related macular degeneration (AMD) disease. We collected26 disease-associated substitutions and9 polymorphism ones on CFH gene from Swiss-Prot database. Thep-value of rank-sum test on26 disease-associated substitutions and9 polymor- phism ones on CFH gene is3:62£10 ¡2 , with statistically significant difference. MS-RV model correctly predicts the widely-studied H402Y disease-associated mutation. 111 Table 5.2: p-value of rank sum test for mutations related to10 human diseases. Diseases p-value of rank sum test polycystic kidney disease type I (ADPKD) 1.56*10 -2 Alport syndrome (AS) 4.79*10 -3 Colorectal cancer 2.43*10 -2 Congenital bilateral absence of the vas deferens (CBAVD) 3.28*10 -2 Cystic fibrosis (CF) 1.35*10 -4 Osteogenesis imperfecta type II (OI-II) 3.67*10 -9 Osteogenesis imperfecta type III (OI-III) 6.26*10 -7 Osteogenesis imperfecta type IV (OI-IV) 2.54*10 -6 Retinitis pigmentosa (RP) 3.69*10 -3 Wilson disease (WD) 9.33*10 -4 5.4.4 Mutation data in chromosomal regions On Chromosome 7, two 10-Mb chromosomal regions were selected, one starts at position 89;678;936, centering at STEA2 HUMAN gene, and we denoted it as STEA2-region in this paper; the other one starts at position 128;166;653, centering at CALU HUMAN gene, and we denoted it as CALU-region. According to Swiss- Prot and Pfam databases, in the first region, there are48 disease-associated amino acid substitutions and 50 polymorphism ones on 22 coded proteins; in the second region, there are 10 disease-associated amino acid substitutions and 33 polymorphism ones on 7 coded proteins. For mutations in STEA2-region, the AUC is 0:84, and p-value of rank-sum test is 2:22£ 10 ¡16 , which indicates the statistically significant difference between disease- associated and polymorphism mutations. All of the top10 prediction scores correspond to disease-associated substitutions. For mutations in CALU-region, the AUC is 0:84, and p-value of rank-sum test is 1:33£10 ¡5 , which also indicates the statistically sig- nificant difference between two kinds of mutations. 9 of the top 10 prediction scores correspond to disease-associated substitutions. 112 5.5 Prioritization of human disease-associated amino acid substitutions The identification of mutations associated with human diseases remains a challenge. MS-RV model prioritizes amino acid substitutions from neighboring chromosomal regions, and integrates them into a global ranking using order statistics. The effec- tiveness was evaluated by the overall AUC score. As to the monogenic diseases, thirty10-Mb neighboring chromosomal regions were selected from the Swiss-Prot database [BAW + 05]. Each region centers at a gene whose amino acid substitutions associate with a human disease, and has comparable num- bers of disease-associated and polymorphism substitutions. To include heterogenous data sources, we also collected non-synonymous single nucleotide polymorphisms (nsS- NPs) and the coded amino acid substitutions from the HapMap database [Con03] and Ensembl database [BAC + 06]. Table 5.3 lists the numbers of substitutions in each region, where the mutation data were collected from different data sources. For each neighboring region, the prioritization of the substitutions were used to induce the sensitivity and specificity. Sensitivity is the proportion of disease-associated substitutions ranked above a particular threshold rank, and specificity is the propor- tion of polymorphism substitutions ranked below this threshold. To minimize the rank- ing variability, we integrated the 30 individual prioritizations into an overall rank and evaluated the performance based on order statistics. With this method, the sensitiv- ity and specificity are calculated from the total normalized numbers of substitutions in the 30 neighboring gene regions. AUC scores are obtained from the ROC curves as described in the METHODS part. For a disease-associated substitution, we defined its relative ranking coefficient (RRC) as the percentage of polymorphism substitutions that 113 Table 5.3: Substitution data sets in30 neighboring gene regions. Each region centers at a gene whose amino acid substitutions associate with a human disease. Mutation data are collected from heterogenous data sources, and numbers of substitutions are listed for each case. Swissprot Swissprot + Ensembl Disease caused by mutations occurring at the central gene in neighboring gene regions TolInTest IntInTest TolInTest IntInTest A colon tumor 40 21 132 21 Antithrombin III deficiency (AT-III deficiency) 75 44 110 44 Autosomal dominant neurohypophyseal diabetes insipidus (ADNDI) 53 32 118 32 Autosomal dominant retinitis pigmentosa (ADRP) 21 40 53 40 Best macular dystrophy (BMD) 44 86 152 86 Cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL) 51 24 160 24 Chronic nonspherocytic hemolytic anemia (CNSHA) 38 90 196 90 Citrullinemia type 1 (CTLN1) 44 37 127 37 Crouzon syndrome (CS) 23 26 50 26 Dystrophic epidermolysis bullosa (DEB) 32 59 108 59 Epidermolysis bullosa simplex Weber-Cockayne type (WC- EBS) 87 23 309 23 Epidermolytic hyperkeratosis (EHK) 88 23 313 23 Familial multiple endocrine neoplasia type I (MEN1) 45 33 135 33 Familial porphyria cutanea tarda (FPCT) 20 27 73 27 Glycogen storage disease II (GSD-II) 13 32 93 32 Glycogen storage disease Ib (GSD-Ib) 18 26 69 26 Hereditary nonspherocytic hemolytic anemia (HA) 10 21 45 21 Long QT syndrome type 1 (LQT1) 29 54 150 54 Maturity onset diabetes of the young type III (MODY3) 20 34 56 34 Mucopolysaccharidosis type IIIB (MPS-IIIB) 46 24 131 24 Nail-patella syndrome (NPS) [MIM:161200] 15 30 91 30 Osteogenesis imperfecta type III (OI-III) [MIM:259420] 28 22 82 22 Sjoegren-Larsson syndrome (SLS) [MIM:270200] 19 25 75 25 Smith-Lemli-Opitz syndrome (SLOS) [MIM:270400] 22 44 72 44 Tay-Sachs disease (TSD) [MIM:272800] 48 38 77 38 Usher syndrome type 1B (USH1B) 13 24 36 24 Very long chain acyl-CoA dehydrogenase deficiency (VLCAD deficiency) 40 25 132 25 X-linked chronic granulomatous disease (X-CGD) 22 36 31 36 X-linked nephrogenic diabetes insipidus, type I (NDI) 29 72 71 72 X-linked recessive myotubular myopathy (XLMTM) 29 23 71 23 ranked below it. The overall RRC was calculated from all disease-associated substitu- tions that are normalized on the all regions. The smaller the RRC, the higher ranking of the disease-associated substitutions. For the 30 data sets collected from the Swiss-Prot database, the ROC curve is plotted in Figure 5.5, the AUC score was 0:8657, and the overall RRC was0:159. For the30 data sets collected from the Swiss-Prot database, the 114 HapMap database [Con03] and Ensembl database [BAC + 06], the ROC curve is plotted in Figure 5.6, the AUC score was0:8228, and the overall RRC was0:185. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 − Specificity (Normalized cumulative value) Sensitivity (Normalized cumulative value) Prioritization of mutations from 30 neighboring gene regions Figure 5.5: ROC curve for prioritizing disease-associated substitutions from 30 neighboring gene regions. Data sets are collected from the Swiss-Prot database. As to the polygenic diseases, we collected three groups of mutation data related to familial Alzheimer disease-1 (AD1), Alzheimer disease-3 (AD3) and Alzheimer disease-4 (AD4). The testing data are from different patients, for example, the details of mutations on PSEN2 gene (AD4) are as follows: N141I: 3 of 4 pedigrees of V olga German ancestry D439A: a58-year old patient who belongs to a family from Barcelona, Spain T430M: a 49-year-old subject who displayed progressive dementia beginning at age45 years T122P: a female patient whose age at onset of disease was46 years M239I: an Italian family in which 4 members had early-onset AD T122R: a female patient whose age at onset of disease was 55 years. For AD1, there are3 disease-associated sub- stitutions and17 polymorphism ones. MS-RV model prioritizes67% disease-associated mutations among the top 10% ranking. For AD3, there are 70 disease-associated sub- stitutions and 42 polymorphism ones. MS-RV model prioritizes all disease-associated 115 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 − Specificity (Normalized cumulative value) Sensitivity (Normalized cumulative value) Prioritization of mutations from 30 neighboring gene regions Figure 5.6: ROC curve for prioritizing disease-associated substitutions from 30 neighboring gene regions. Data sets are collected from the Swiss-Prot database. mutations among the top10% ranking. For AD4, there are5 disease-associated substi- tutions and 48 polymorphism ones. MS-RV model prioritizes 40% disease-associated mutations among the top10% ranking. Those results indicate that MS-RV model is effective to predict and prioritize the amino acid substitutions related to both monogenic and polygenic human diseases. 5.6 Analysis of unclassified mutation data MS-RV model predicted the 1;517 unclassified amino acid substitutions in Swiss-Prot database. We studied those with high prediction scores in details, and listed some exam- ples and the supporting references. In this part, + means that a substitution was pre- dicted to be disease-associated, and the prediction scores are listed in the parentheses. 116 5.6.1 Mutations related toHBB HUMAN gene Hemoglobin (abbreviated as Hb) is the iron-containing oxygen-transport protein in the red cells of the blood in mammals and other animals. The Hemoglobin molecule is assembled of four globular protein subunits with a heme group consisting of an iron atom. This iron atom is the site of oxygen binding. The binding of oxygen is a coop- erative process. The Bohr effect is known as the control of hemoglobin’s affinity for oxygen by the binding and release of carbon dioxide and acid. Since all human cells depend on oxygen for survival, it is very important to ensure adequate oxygenation of all body tissues and organs. MS-RV model predicted four unclassified mutations on HBB HUMAN gene with high prediction scores: A129V (0:9882), A70D (0:9818), D21G (0:9798) and H117P (0:9773). A129V (+): The unstable A129V mutation was found in two unrelated black families associated with Hb S, Hb C or ¯ 0 thalassemia. It was named as Hb La Desirade. It exhibits a decreased oxygen affinity. Evidence can be found in the paper of Merault G, et al[MKG + 86]. A70D (+): Hemoglobin Seattle was showed to be associated with a considerable decrease in oxygen affinity by Stamatoyannopoulos et al. [SPF69]. It caused mild to mod- erate chronic hemolytic anemia in a mother and her sons from a Caucasian family (USA) (Huehns et al [HHY + 70]). It was also found in an Ukrainian family (Chow et al[CHKW94]) with a mild anemia. However, Huehns et al[HHY + 70] wrongly iden- tified the mutation as beta76 (E20), from Als to Glu; and it was corrected by the studies from Anderson et al [AP73]. D21G (+): 117 The change of Asp to Gly at position 21 in ¯ chain was found in members of a family of Polish origin living in the United States, named as Hb Connecticut. Several individuals with this mutation exhibited mild anemia. It exhibited decreased oxygen affinity and slightly decreased in the effect of allosteric effectors on the oxygen equi- librium properties of the mutation, as explained in the paper of Moo-Penn WF, et al [MPMB + 81]. H117P (+): Hb Saitama (beta117 (G19), from His to Pro) was identified in a 23-year-old Japanese female suffering from hemolytic anemia and jaundice. This mutation showed decreased oxygen affinity and increased 2,3-DPG level, as illustrated in the paper of Ohba Y , et al [OHA + 83]. 5.6.2 Mutations related toCD36 HUMAN gene On the CD36 HUMAN gene, three unclassified mutations are predicted with high scores, they are I170T (0:9717), E122K(0:8421), I270T ((0:9782). I170T, E122K, I270T (+): The study were based on the hypothesis that variants on CD36 HUMAN gene might influence the susceptibility to malaria, and, in an endemic area, an spleen size determined by ultrasound might to some extent reflect the individ- ual’s susceptibility to malaria [GSB + 01]. 12 individuals from a malaria endemic area in West Africa eight, among whom are eight with the smallest and four with the largest spleens, were selected. Three single-nucleotide substitutions causing non-conservative amino acid exchanges E122K, T173A, and I270T were identified. Our test results for these three mutations are 0.8421(E122K), T173A (0.9717), and I270T (0.9782). 118 5.6.3 Mutations related toCASR HUMAN gene Two unclassified mutations on CASR HUMAN gene have high prediction scores: E767K (0:9785) and E127A (0:9761). E767K (+): It was shown that defects in the calcium sensing receptor (CASR) gene cause famil- ial hypocalciuric hypercalcaemia and neonatal severe hyperparathyroidism. The muta- tion E767K on CASR was reported in a family with autosomal dominant hypocalcemia. It was proposed that three acidic residues (767, 758, and 759) in the second extracellu- lar loop in CASR help maintain the CASR in its inactive state, as cited in the paper of Uckun-Kitapci A, et al [UKUZMS05]. E127A (+): A missense mutation (from Glu to Ala) in CASR was found to cause familial hypocalcaemia in affected members of one family who is heterozygous for such a muta- tion while not found in normocalcaemic individuals tested, as cited in the paper of Pollak MR, et al [PBE + 94]. 5.6.4 Mutations related toCD22 HUMAN gene On CD22 HUMAN gene, one unclassified mutation was predicted with high predic- tion score: Q152E (0:9789). Q152E (+): Genomic DNA from 207 healthy Japanese individuals and 68 patients with systemic lupus erythematosus (SLE) and 119 patients with rheumatoid arthri- tis (RA) were analyzed. The substitution Q152E in CD22 gene, a member of the immunoglobulin superfamily, was observed with a marginally higher frequency in the patients with SLE (3/68, 4.4%) than that in healthy individuals (1/207, 0.5%) (P=0.048. 119 SLE vs healthy individuals). Also, CD22 was considered a candidate for the suscepti- bility genes to autoimmune diseases, as cited in the paper of Hatta Y , et al [HTM + 99]. 120 Chapter 6 Conclusion and Future Work 6.1 Conclusion In Chapter 2, we reviewed the background of our study. In general, the problem of predicting the potential effects of amino acid substitutions on the function of proteins containing the substitutions is of great importance in both the pathological and the phar- maceutical studies. Typically, the potential effect of a given substitution is categorized to one of the two classes: either an intolerant class or a tolerant class. The prediction problem can then described as: Given the original amino acid, the substitution, and the protein con- taining the substitution, we would like to predict the potential effects (intolerant or tolerant) of the substitution on the function of the pro- tein. This problem can be modeled as a classification problem and treated under the frame- work of statistical patten recognition. Following traditional pattern recognition procedures, we proposed a novel designed feature set in Chapter 3. The information of the protein, the original amino acid and the substitution was extracted to form a set of input variables x = (x 1 ;:::;x d ) T , where d is the number of features in the feature set and also the dimension of the inputs. We used two groups of features composed of three physiochemical properties of amino acids and three relative frequencies indicating the preference of the amino acids in the secondary structure of protein. Each feature was further calculated in four different 121 situations, including the original amino acid, the substitution, a window sized situation, as well as a column-weighted situation. All together, a set of 27-dimension features are used to characterize the properties of amino acids in our study. The two categories of substitutions (intolerant/tolerant) are labeled by 0 and 1, respectively, and treated as output variable y. If we would like to study the intolerant substitutions, we label intolerant substitutions by 1 and tolerant ones by 0. If we are interested in the tolerant substitutions, we label the data in the opposite way. With the inputs and output ready, our prediction task for a given substitution is to assign an appropriate label to outputy based on inputs x, which is a typical supervised learning problem. Two discriminative supervised learning methods (i.e., the support vector machine and the decision tree) and one generative supervised learning method (i.e., the Gaussian mixture model) were used to solve the problem. According to the statistical pattern recognition framework, we use a set of known substitution samples, D=f(x k ;y k )g N k=1 , to train the corresponding model (i.e., to determine the parameters in the model), and then use the trained model to predict future substitutions. The correctness of prediction is currently evaluated by cross-validation (typically in 10-fold). In the decision tree method, a top-down greedy search through the space of possible decision trees is used to recursively splitting each decision node to seek for maximal information gain. In our study, we adopted an algorithm called “C4.5” to construct the decision tree. In the support vector machine method, inputs x k are mapped from the d-dimensional input space to a kernel feature space (of an infinite dimension) by using a Gaussian kernel. A discriminant hyperplane is then determined in the feature space to maximize the margin of intolerant and tolerant data samples using a quadratic programming technique. In our study, we used the software package “LibSVM” to train the support vector machine. The application of the Gaussian mixture model to the amino acid substitution classification problem is a pioneering work. Two different sets 122 of Gaussian distributions with different weights were used to approximate the probabil- ity distribution of the intolerant and the tolerant classes. Prediction is then performed by calculating the posterior probability of a given substitution belonging to each class. No matter what learning method is used, the quality of features plays the most important role in the supervised learning. In Chapter 3, we compared the quality of our selected feature set with other published feature sets. With the same machine learn- ing method, a better feature set should yield better prediction results in cross-validation. It was shown that the proposed feature set outperform the published data sets when they are used together with the decision tree method and the support vector machine method. In other words, the proposed feature set can better embody the characteristics of amino acid substitutions. The feature set proposed in this study has other advantages. First, it is simple and easy to calculate. All we need for calculation are basic physiochemical prop- erties of 20 amino acids, which are widely available. Second, all features have direct physical, chemical, or biological meanings so that they are very easy to be interpreted in the biological language. Although machine learning methods can predict the potential effects of given substi- tutions, they act like “black-boxes” in the sense that it is difficult to get the physical and chemical meanings for amino acid substitutions from the estimated parameters. To over- come this problem, we proposed a simulated annealing bump-hunting method to extract rules for amino acid substitutions in Chapter 4. In particular, we focused on the discov- ery of chemical and biological rules for amino acid substitutions from experimental data instead of making predictions. These rules are: 1. interpretable (consisting of a small set of simple chemical and physical features); 2. high-quality (with very few exceptions); and 3. general (capable of explaining a significant number of substitutions). 123 We regarded rules as sub-regions (boxes) in the input space. To find low-dimensional (interpretable) and high-quality boxes in a high-dimensional input space, we developed a novel simulated annealing based bump hunting method that incorporates automatic feature selection. We implemented the method and tested it on experimental data to identify a set of rules. These rules were either consistent with the current knowledge or providing new insights into the understanding of amino acid substitutions. We also designed a Multiple Selection and Rule V oting (MS-RV) model to predict and prioritize mutations related to human diseases. This model integrates data partition and sequential forward feature selection. 20 modules are trained on data sets with 20 types of original amino acids. Each module is a tree ensemble which is generated on bootstrapped data sets. Trees are grown to maximum size without pruning. At each node, a small number of features are selected from an optimal feature set to split on. The optimal feature set is generated to maximize certain criterion through sequential forward selection. A testing data fits one rule in one tree, which gives the probabilities of it to be intolerant or tolerant. The average of the probabilities from all trees give the prediction results. A large number of amino acid substitutions occurring in highly heterogenous human proteins were collected and tested on various groups, such as those for specific mono- genic and polygenic diseases, specific genes and neighboring chromosomal regions. MS-RV model predicts and prioritizes the mutations from these groups. The analysis based on the rank statistics showed MS-RV model effectively predicts and prioritizes human disease-related mutations. We further analyzed those unclassified amino acid substitutions with high prediction scores from MS-RV model, and found evidences to support our conclusions. 124 6.2 Future Work One limitation of our feature set is that we currently use the Pfam multiple sequence alignment to extract evolutionary information for the query protein sequence. As a result, we are limited to deal with amino acid substitutions occurring in known pro- tein domains. This limitation can be overcome by using some other multiple sequence alignment method such as the PSI-BLAST and ClustalW instead of Pfam. In MS-RV model, each module’s optimal feature set is obtained with the sequential forward feature selection (SFFS). We used the AUC score of the10-fold cross validation of MS-RV model as the criterion. SFFS can’t guarantee to find the global optimal feature set, and the updating needs more time and memory for large scale size of data. Other feature selection method can be explored. We will have a deeper understanding on the polygenic disease-associated mutations, if more data are available. Data collection and processing is an ongoing work. 125 References [ALM + 06] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, L. C. Tranchevent, B. De Moor, P. Marynen, B. Hassan, P. Carmeliet, and Y . Moreau. Gene prioritization through genomic data fusion. Nat Biotechnol., 24(5):537–44, 2006. [ALR + 04] B. Alex, C. Lachlan, D. Richard, D. F. Robert, H. V olker, G. Sam, K. Ajay, M. Mhairi, M Simon, L. L. Erik, D. J. S. Sonnhammer, Y . Corin, and R. E. Sean. The pfam protein families database. Nucleic Acids Research, Database Issue 32:D138–D141, 2004. [AP73] N. L. Anderson and M. F. Perutz. Site of the amino-acid substitution in haemoglobin seattle (2 A270 Asp ). Nat New Biol., 243(130):274–5, 1973. [BA00] A. Bairoch and R. Apweiler. The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research, 28:45– 48, 2000. [BAC + 06] E. Birney, D. Andrews, M. Caccamo, Y . Chen, L. Clarke, G. Coates, T. Cox, F. Cunningham, V . Curwen, T. Cutts, T. Down, and et al. Ensembl 2006. Nucleic Acids Research, 34 Database issue:D556–D561, 2006. [BAW + 05] A. Bairoch, R. Apweiler, C.H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Natale, C. O’Donovan, N. Redaschi, and L.S. Yeh. The univer- sal protein resource (UniProt). Nucleic Acids Research, 33:D154–159, 2005. [BCD + 04] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V . Hollich, S. Griffiths- Jones, A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, D. J. Studholme, C. Yeats, and S. R. Eddy. The Pfam protein families database. Nucleic Acids Res., 32(D138-141), 2004. 126 [BDF + 05] A. Bureau, J. Dupuis, K. Falls, K. L. Lunetta, B. Hayward, T. P. Keith, and P. Van Eerdewegh. Identifying snps predictive of phenotype using random forests. Genet Epidemiol, 28(2)(171-82), 2005. [BFSO84] L. Breiman, J. Friedman, C. J. Stone, and R.A. Olshen. Classification and Regression Trees. Chapman and Hall, 1984. [BGR + 04] C. Berezin, F. Glaser, J. Rosenberg, I. Paz, T. Pupko, P. Fariselli, R. Casadio, and N. Ben-Tal. Conseq: the identification of functionally and structurally important residues in protein sequences. Bioinformat- ics, 20(1322-1324), 2004. [Bre96] L. Breiman. Bagging predictors. Machine Learning, 24(123-140), 1996. [Bre01a] L. Breiman. Random forests. University of California, Berkeley Tech- nical Report, (1–33), 2001. [Bre01b] L. Breiman. Random forests. Machine Learning, 45(5–32), 2001. [BSW97] M. D. Brown, FZ. Sun, and D. C. Wallace. Clustering of caucasian leber hereditary optic neuropathy patients containing the11778 or14484 mutations on an mtdna lineage. AM. J. Hum. Genet., 60:381–387, 1997. [CA01] D. Chasman and R. M. Adams. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: Structure-based assessment of amino acid variation. Journal of Molecular Biology, 307:683–706, 2001. [CHKW94] E. Y . Chow, L. P. Haley, S. H. Krikler, and L. D. Wadsworth. Hb seattle [beta 70(E14) ala! asp]: a report of a second kindred in a ukrainian family. Hemoglobin, 18(3):231–4, 1994. [Con03] The International HapMap Consortium. The international hapmap project. Nature, 426:789–796, 2003. [CST00a] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge Uni- versity Press, 2000. [CST00b] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge Uni- versity Press, 2000. [CT91] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley- Interscience, 1991. 127 [DEKM98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University press, 1998. [DHS01] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley Sons, Inc., 2001. [ERA + 05] A. O. Edwards, R. 3rd Ritter, K. J. Abel, A. Manning, C. Panhuysen, and L. A. Farrer. Complement factor h polymorphism and age-related macular degeneration. Science, 308(5720):421–4, 2005. [FCOdlC02] C. Ferrer-Costa, M. Orozco, and X. de la Cruz. Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. Journal of Molecular Biology, 315:771–786, 2002. [FCOdlC04] C. Ferrer-Costa, M. Orozco, and X. de la Cruz. Sequence-based pre- diction of pathological mutations. Proteins: Structure, Function, and Bioinformatics, 57:811–819, 2004. [FF99] J. H. Friedman and N. I. Fisher. Bump hunting in high-dimensional data. Statistics and Computing, 9(2):123–143, 1999. [FJ02] M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of finite mix- ture models. Pattern Analysis and Machine Intelligence, IEEE Transac- tions on, 24:381–396, 2002. [Fra98] C. Fraley. Algorithms for model-based gaussian hierarchical clustering. SIAM Journal on Scientific Computing, 20(270-281), 1998. [FS96] Y . Freund and R. E. Schapire. Experiments with a new boosting algo- rithm. Machine Learning: Proceedings of the Thirteenth International- Conference, (148-156), 1996. [GBS04] P. O. Gislason, J. A. Benediktsson, and J. R. Sveinsson. Random forest classification of multisource remote sensing and geographic data. Proc. Geoscience and Remote Sensing Symposium, IGARSS ’04(1049-1052), 2004. [GMCS04] L. Guo, Y . Ma, B. Cukic, and H. Singh. Robust prediction of fault- proneness by random forests. Proceedings of the 15th International Symposium on Software Reliability Engineering, ISSRE’04(417 - 428), 2004. 128 [GSB + 01] A. Gelhaus, A. Scheding, E. Browne, G. D. Burchard, and R. D. Horstmann. Variability of the cd36 gene in west africa. Hum Mutat., 18(5):444–50, 2001. [HH00] B. D. Hames and N. M. Hooper. Instant Notes in Biochemistry (Instant Notes S.). Bios Scientific Publishers Ltd, 2000. [HHS + 05] J. L. Haines, M. A. Hauser, S. Schmidt, W. K. Scott, L. M. Olson, P. Gallins, K. L. Spencer, S. Y . Kwan, M. Noureddine, J. R. Gilbert, N. Schnetz-Boutaud, A. Agarwal, E. A. Postel, and M. A. Pericak- Vance. Complement factor h variant increases the risk of age-related macular degeneration. Science, 308(5720):419–21, 2005. [HHY + 70] E. R. Huehns, F. Hecht, A. Yoshida, G. Stamatoyannopoulos, J. Hart- man, and A. G. Motulsky. Hemoglobin-seattle (alpha-2 beta-276-Glu): an unstable hemoglobin causing chronic hemolytic anemia. Blood, 36(2):209–18, 1970. [HMS66] E. B. Hunt, J. Matin, and P. J. Stone. Experiments in Induction. Aca- demic Press, New York, 1966. [HTF01] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer-Verlag., 2001. [HTM + 99] Y . Hatta, N. Tsuchiya, M. Matsushita, M. Shiota, K. Hagiwara, and K. Tokunaga. Identification of the gene variations in human CD22. Immunogenetics., 49(4):280–6, 1999. [JBS05] S. R. Joelsson, J. A. Benediktsson, and J. R. Sveinsson. Random forest classifiers for hyperspectral data. Proc. Geoscience and Remote Sensing Symposium, IGARSS ’05(160-163), 2005. [JDM00] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(4- 37), 2000. [KBF + 00] M. Krawczak, E. V . Ball, I. Fenton, P. D. Stenson, S. Abeysinghe, N. Thomas, and D. N. Cooper. Human gene mutation database: a biomedical information and re-search resource. Human Mutation, 15:45–51, 2000. [KD82] J. Kyte and R. F. Doolittle. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Journal of Molecular Biology, 157:105–132, 1982. 129 [KGC02] S. Kumar, J. Ghosh, and M. M. Crawford. Hierarchical fusion of mul- tiple classifiers for hyperspectral data analysis. Pattern Analysis and Applications, 5(2)(210–220), 2002. [KU03] V . Kehl and K. Ulm. Responder identification in clinical trials with censored data. Research Paper, (1-29), 2003. [KW03] V . G. Krishnan and D. R. Westhead. A comparative study of machine- learning methods to predict the effects of single nucleotide polymor- phisms on protein function. Bioinformatics, 19(17):2199–2209, 2003. [KZC + 05] R. J. Klein, C. Zeiss, E. Y . Chew, J. Y . Tsai, R. S. Sackler, C. Haynes, A. K. Henning, J. P. SanGiovanni, S. M. Mane, S. T. Mayne, M. B. Bracken, F. L. Ferris, J. Ott, C. Barnstable, and J. Hoh. Mitochon- drial genome mutations in hypertensive individuals. Am J Hypertens, 308(5720):385–9, 2005. [LC04] A. Y . Lau and D. I. Chasman. Functional classification of proteins and protein variants. PNAS, 101(6576-6581), 2004. [Lev78] M. Levitte. Biochemistry. Biochemistry, 17:4277–4285, 1978. [LFJ04] M. H. C. Law, M. A. T. Figueiredo, and A. K. Jain. Simultaneous feature selection and clustering using mixture models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26:1154–1166, 2004. [LMH + 04] X. Liu, V . Minin, Y . Huang, D. B. Seligson, and Horvath S. Statistical methods for analyzing tissue microarray data. J Biopharm Stat., 14(671- 685), 2004. [McK98] V . A. McKusick. Mendelian Inheritance in Man. Johns Hopkins Uni- versity Press, Baltimore, 12 edition, 1998. [MFC + 02] A. C. Martin, A. M. Facchiano, A. L. Cuff, T. Hernandez-Boussard, M. Olivier, P. Hainaut, and J. M. Thornton. Integrating mutation data and structural analysis of the TP53 tumor-suppressor protein. Hum Mutat., 19(149-164), 2002. [Mit97] M. T. Mitchell. Machine Learning. McGraw-Hill, U.S.A., 1997. [MKC + 94] P. Markiewicz, L. G. Kleina, C. Cruz, S. Ehret, and J. H. Miller. Genetic studies of the lac repressor XIV: Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as “spacers” which do not require a specific sequence. Journal of Molec- ular Biology, 240:421–433, 1994. 130 [MKG + 86] G. Merault, L. Keclard, J. Garin, C. Poyart, Y . Blouquit, N. Arous, F. Galacteros, J. Feingold, and J. Rosa. Hemoglobin la desirade alpha A2 beta2129 (H7) Ala!Val: a new unstable hemoglobin. Hemoglobin, 10(6):593–605, 1986. [MMR + 01] K. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An intro- duction to kernel-based learning algorithms. IEEE Transactions on Neu- ral Networks, 12(181–201), 2001. [MPM + 87] Van Erven P. M., Cillessen J. P., Eekhoff E. M., Gabreels F. J., Does- burg W. H., Lemmens W. A., Slooff J. L., Renier W. O., and Ruitenbeek W. Leigh syndrome, a mitochondrial encephalo(myo)pathy. a review of the literature. Clin Neurol Neurosurg., 89(4):217–30, 1987. [MPMB + 81] W. F. Moo-Penn, P. McPhedran, S. Bobrow, M. H. Johnson, D. L. Jue, and K. W. Olsen. Hemoglobin connecticut (beta 21 (B3) Asp leads to Gly): a hemoglobin variant with low oxygen affinity. Am J Hematol., 11(2):137–45, 1981. [MTKPB03] K. Mroczek-Tonska, B. Kisiel, J. Piechota, and E. Bartnik. Leber hered- itary optic neuropathy–a disease with a known molecular basis but a mysterious mechanism of pathology. J Appl Genet., 44(4):529–38, 2003. [Nab01] I. Nabney. Netlab: Algorithms for Pattern Recognition. Springer-Verlag UK, 2001. [NH01] P. C. Ng and S. Henikoff. Predicting deleterious amino acid substitu- tions. Genome Research, 11:863–874, 2001. [OG89] R. H. Otten and L. P. Ginneken. The Annealing Algorithm. Kluwer Academic Publishers, Boston, 1989. [OHA + 83] Y . Ohba, Y . Hasegawa, H. Amino, S. Miwa, T. Nakatsuji, Y . Hattori, and T. Miyaji. Hemoglobin saitama or beta 117 (G19) His leads to Pro, a new variant causing hemolytic disease. Hemoglobin, 7(1):47–56, 1983. [OLM04] I. S. Oh, J. S. Lee, and B. R. Moon. Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 26, no. 11:1424–1437, 2004. [PBE + 94] M. R. Pollak, E. M. Brown, H. L. Estep, P. N. McLaine, O. Kifor, J. Park, S. C. Hebert, C. E. Seidman, and J. G. Seidman. Autoso- mal dominant hypocalcaemia caused by a Ca(2+)-sensing receptor gene mutation. Nat Genet., 8(3):303–7, 1994. 131 [PNK94] P. Pudil, J. Novovicova, and J. Kittler. Floating search methods in fea- ture selection. Pattern Recognition Letters, vol 15:1119–1125, 1994. [RBHP91] D. Renell, S. E. Bouvier, L. W. Hardy, and A. R Poteete. Systematic mutation of bacteriophage t4 lysozyme. Journal of Molecular Biology, 222:67–87, 1991. [RBS02] V . Ramensky, P. Bork, and S. Sunyaev. Human non-synonymous SNPs: server and survey. Nucleic Acids Research, 30(17):3894–3900, 2002. [SB02] C. T. Saunders and D. Barker. Evaluation of structural and evolutionary contributions to deleterious mutation prediction. Journal of Molecular Biology, 322:891–901, 2002. [SCH01] M. R. Segal, M. P Cummings, and A. E. Hubbard. Relating amino acid sequence to phenotype: Analysis of peptide-binding data. Biometrics, 57(632-642), 2001. [SMK + 96] Y . J. Suckow, P. Markiewicz, L. G. Kleina, J. Miller, B. Kisters-Woike, and B. Muller-Hill. Genetic studies of the lac repressor XV: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure. Journal of Molecular Biology, 261:509– 523, 1996. [SN06] F. Scaglia and J. L. Northrop. The mitochondrial myopathy encephalopathy, lactic acidosis with stroke-like episodes (melas) syn- drome: a review of treatment options. CNS Drugs., 20(6):443–64, 2006. [SPF69] G. Stamatoyannopoulos, J. T. Parer, and C. A. Finch. Physiologic impli- cations of a hemoglobin with decreased oxygen affinity (hemoglobin seattle). New Eng. J. Med., 281:915–919, 1969. [SPK04] P. Somol, P. Pudil, and J. Kittler. Fast branch and bound algorithms for optimal feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(900-912), 2004. [SRK + 01] S. Sunyaev, V . Ramensky, I. Koch, W. Lathe III, A. S. Kondrashov, and P. Bork. Prediction of deleterious human alleles. Human Molecular Genetics, 10(6):591–597, 2001. [TCC + 02] B. N. Terp, D. N. Cooper, I. T. Christensen, F. S. Jorgensen, P. Bross, N. Gregersen, and M. Krawczak. Assessing the relative importance of the biophysical properties of amino acid substitutions associated with human genetic disease. Human Mutation, 20:98–109, 2002. 132 [TMBW00] P. C. Turner, A. G. McLennan, A. D. Bates, and M. R. H. White. Instant Notes in Molecular Biology (Instant Notes S.). Bios Scientific Publish- ers Ltd, 2000. [Toi04] P. Toivanen. Bayesian classification using gaussian mixture model and em estimation:implementations and comparisons. Lappeenranta Uni- versity Information Technology Project Report, (1-37), 2004. [UKUZMS05] A. Uckun-Kitapci, L. E. Underwood, J. Zhang, and B. Moats-Staats. A novel mutation (e767k) in the second extracellular loop of the calcium sensing receptor in a family with autosomal dominant hypocalcemia. Am J Med Genet A., 132(2):125–9, 2005. [Vap98] V . Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., New York, 1998. [WKPT04] P. Wang, Y . Kim, J. Pollack, and R. Tibshirani. Boosted prim with appli- cation to searching for oncogenic pathway of lung cancer. 2004 IEEE Computational Systems Bioinformatics Conference, (604-609), 2004. [WM01] Z. Wang and J. Moult. Snps, protein structure, and disease. Hum Mutat., 17(263-270), 2001. [YSD + 04] Y . L. Yip, H. Scheib, A. V . Diemand, A. Gattiker, L. M. Famiglietti, E. Gasteiger, and A. Bairoch. The swiss-prot variant page and the modsnp database: a resource for sequence and structure information on human protein variants. Hum Mutat., 23(464-470), 2004. 133
Abstract (if available)
Abstract
Classifying and predicting amino acid substitutions are important in pharmaceutical and pathological research. We proposed a novel feature set from amino acids' physicochemical properties, evolutionary profile of proteins, and protein sequence information. Large scale size of human disease-associated data were collected and processed, together with the unbiased experimental amino acid substitutions. Machine learning methods of decision tree, support vector machine, Gaussian mixture model, and random forests were used to classify neutral and deleterious substitutions, and the comparison of classification accuracy with published results showed that our feature set is superior to the existing ones.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature and model based biomedical system characterization of cancer
Asset Metadata
Creator
Yang, Hua
(author)
Core Title
Prediction modeling and statistical analysis of amino acid substitutions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/15/2006
Defense Date
10/09/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
classification,data processing,machine learning,Monte-Carlo simulation with variable temperature,OAI-PMH Harvest,predicting,prediction model,simulated annealing bump hunting strategy,statistical analysis
Language
English
Advisor
Kuo, C.-C. Jay (
committee chair
), Chen, Ting (
committee member
), Leahy, Richard M. (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
huayang@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m136
Unique identifier
UC1168735
Identifier
etd-Yang-20061115 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-21662 (legacy record id),usctheses-m136 (legacy record id)
Legacy Identifier
etd-Yang-20061115.pdf
Dmrecord
21662
Document Type
Thesis
Rights
Yang, Hua
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
data processing
machine learning
Monte-Carlo simulation with variable temperature
predicting
prediction model
simulated annealing bump hunting strategy
statistical analysis