Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Novel algorithms for large scale supervised and one class learning
(USC Thesis Other)
Novel algorithms for large scale supervised and one class learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NOVEL ALGORITHMS FOR LARGE SCALE SUPERVISED AND ONE CLASS LEARNING by Lingyan Sheng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2013 Copyright 2013 Lingyan Sheng Acknowledgements Foremost, I would like to express my sincere gratitude to my advisor Prof. Antonio Ortega for his continuous support of my Ph.D. study and research, for his patience, motivation, enthusiasm, and knowledge. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my Ph.D study. Besides my advisor, I would like to thank the rest of my thesis committee: Prof. C.-C. Jay Kuo, Prof. Yan Liu, as well as my qualify exam committee: Prof. B. Keith Jenkins and Prof. Fei Sha for their encouragement, insightful comments, and hard questions. My sincere thanks also goes to Prof. Shahab Asgharzadeh and Dr. Roger Pique-Regi, for their essential support and help through my project with Children's Hospital Los Angeles. I thank my fellow labmates in the Compression Group for the stimulating dis- cussions. Last but not the least, I would like to thank my family: my parents Zhiren Sheng and Yaqin Wu, my husband Tao Chen and my daughter Eileen Chen. Without their encouragement and understanding it would have been impossible for me to nish this work. ii Table of Contents Acknowledgements ii List of Tables v List of Figures vi Abstract vii Chapter 1: Introduction 1 1.1 Supervised Learning of High Dimensional Data . . . . . . . . . . . 3 1.2 One Class Learning and Extension to Large Data . . . . . . . . . . 6 Chapter 2: Supervised Learning of High Dimensional Data 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 BDLDA Algorithm Description . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Model Selection Metric . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Model Construction and Feature Selection . . . . . . . . . . 19 2.2.3 Algorithm Improvements . . . . . . . . . . . . . . . . . . . . 20 2.2.4 Treelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.5 Algorithm Description . . . . . . . . . . . . . . . . . . . . . 25 2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 3: Graph Based One Class Learning 31 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Brief Overview of LGC . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 LGC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Interpretation of LGC . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Choice of . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 LGC Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 iii 3.4 Analysis of LGC for the Ranking Problem . . . . . . . . . . . . . . 42 3.4.1 Analysis for the case 1 . . . . . . . . . . . . . . . . . . . 43 3.4.2 Analysis of LGC Ranking . . . . . . . . . . . . . . . . . . . 44 3.4.2.1 Ranking Decomposition . . . . . . . . . . . . . . . 44 3.4.2.2 Dierence Term . . . . . . . . . . . . . . . . . . . . 47 3.4.2.3 Cofactor Term . . . . . . . . . . . . . . . . . . . . 49 3.4.2.4 Approximation of Ranking . . . . . . . . . . . . . . 51 3.4.3 Choice of the Bound . . . . . . . . . . . . . . . . . . . . . . 52 3.4.4 Extension to Multi-class Ranking . . . . . . . . . . . . . . . 53 3.4.5 Evaluation of in LGC Ranking . . . . . . . . . . . . . . . 54 3.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.1 Identify Negative Samples . . . . . . . . . . . . . . . . . . . 61 3.5.2 Selecting the Number of Reliable Negative Samples . . . . . 61 3.5.3 Classication by TSVM . . . . . . . . . . . . . . . . . . . . 62 3.5.4 Graph-OCL Algorithm . . . . . . . . . . . . . . . . . . . . . 63 3.5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 64 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 4: A Novel Adaptive Nystr om Method 68 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Nystr om Approximation . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 Standard Nystr om Method . . . . . . . . . . . . . . . . . . . 71 4.2.2 Novel Perspective . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.1 Ensemble Nystr om Method . . . . . . . . . . . . . . . . . . 77 4.3.2 BoostNystr om Method . . . . . . . . . . . . . . . . . . . . . 78 4.3.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 80 4.3.4 Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 86 Chapter 5: General Conclusions 88 Chapter 6: Publications 90 References 92 iv List of Tables 2.1 Average Error Rate (Standard Deviation) for Simulated Data . . . 28 2.2 Average Error Rate (Standard Deviation) for Real Data . . . . . . 28 3.1 Accuracy of USPS with respect to (%) . . . . . . . . . . . . . . . 57 3.2 Accuracy of Mnist with respect to (%) . . . . . . . . . . . . . . . 58 3.3 Average Accuracy and Standard Deviation of USPS>B and<B (mean % (std)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Average Accuracy and Standard Deviation of Mnist >B and<B (mean % (std)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Classication Accuracy of Documents with Dierent Kernels . . . . 66 3.6 Classication Accuracy of Documents (Linear Kernel) . . . . . . . . 67 4.1 Properties of data sets, n and dim are the number of data points and features respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 85 v List of Figures 2.1 Sequential generation of candidate covariance matrix models for BDLDA. Starting with an empty list, we add one feature at a time (namely, the one that maximizes J (2.5)). A feature can be added as an independent block (solid line), or combined with an existing block (dotted line). The best of all these models is selected using ^ P e . . . . 21 2.2 Block concatenation procedure . . . . . . . . . . . . . . . . . . . . . 22 3.1 Classication accuracy of subsets of USPS data. The ratio of labeled data in positive class to negative class in each subgraph is 1:1 and 2:1. (a) USPS Digit 4 and 7; (b) USPS Digit 0 and 1; (c) USPS Digit 9 and 5; (d) USPS Digit 8 and 2; . . . . . . . . . . . . . . . . 40 3.2 The classication accuracy of subsets of Mnist data. The ratio of labeled data in positive class to negative class in each subgraph is 1:1 and 4:1. (a) Mnist Digit 4 and 7; (b) Mnist Digit 1 and 3; (c) Mnist Digit 9 and 5; (d) Mnist Digit 8 and 2; . . . . . . . . . . . . 41 3.3 Two Moons data and the identied negative data in the right lower moon. The blue circles are unlabeled positive data points and the green circles are labeled ones. The cyan triangles and black stars are retrieved data points. The black stars represent the retrieved data points that do not appear in the next subgure. . . . . . . . . . . . . . . . . . . . . . . 56 3.4 20 Newsgroup Data Subjects . . . . . . . . . . . . . . . . . . . . . . 64 4.1 BoostNystr om Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Percent error of Scerevisae, Isolet and USPS . . . . . . . . . . . . . 86 4.3 Percent error in Frobenius norm for base Nystr om method, Ensemble Nystr om method, BoostNystr om, K-means based Nystr om approxi- mation, the adaptive Nystr om approximation and the deterministic method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 vi Abstract Supervised learning is the machine learning task of inferring a function from labeled training data. There have been numerous algorithms proposed for supervised learn- ing, such as linear discriminant analysis (LDA), support vector machine (SVM), decision trees, and etc. However, most of them are not able to handle an increas- ingly popular type of data, high dimensional data, such as gene expression data, text documents, MRI images, and etc. This phenomenon is often called the curse of dimensionality. Our solution to this problem is an improvement to LDA that imposes a regularized structure on the covariance matrix, so that it becomes block diagonal while feature reduction is performed. The improved method, which we call block diagonal discriminant analysis (BDLDA), eectively exploits the o diagonal information in the covariance matrix without huge computation and memory re- quirement. BDLDA is further improved by using treelets as a preprocessing tool. Treelets, by transforming the original data by successive local PCA, concentrates more energy near the diagonal items in the covariance matrix, and thus achieves even better accuracy compared to BDLDA. Supervised learning requires labeled information of all classes. However, since labeled data is often more dicult to obtain than unlabeled data, there is an in- creasing interest in a special form of learning, namely, one class learning. In one class learning, the training set only has samples of one class, and the goal is to vii distinguish the class from all other samples. We propose a one class learning algo- rithm, Graph-One Class Learning (Graph-OCL). Graph-OCL is a two step strategy, where we rst identify reliable negative samples, and then we classify the samples based on labeled data and the identied negative samples in the rst step. The main novelty is the rst step, in which graph-based ranking by learning with local and global consistency (LGC) is used. Graph-based ranking is particularly accu- rate if the samples and their similarities are well represented by a graph. We also theoretically prove that there is a simple method to select a constant parameter for LGC, thus eliminating the necessity of model selection by time consuming validation. Graph-based methods usually scale badly as a function of the sample size. This can be solved by using the Nystr om approximation, which samples a few columns to represent the anity matrix. We propose a new method, BoostNystr om, which adaptively samples a subset of columns at each iterative step and updates the sampling probability in the next iterative step. This algorithm is based on a novel perspective, which relates the quality of Nystr om approximation with the subspace spanned by the sampled columns. BoostNystr om can be potentially applied to Graph-OCL to solve the problem of large data size. viii Chapter 1 Introduction This thesis focuses on two machine learning problems: supervised learning and one class learning. They are both classication problems, but they dier in one impor- tant aspect, namely how the training samples are labeled. A supervised learning algorithm analyzes the training data, which includes labeled data in all classes, and produces an inferred classication function. The classication function is then applied to the test data to nd their classes. Supervised learning nds applications in many real life problems. One example that many people might have encountered is the classication of spam emails and normal emails. Supervised learning is also used in many academic elds other than computer science. For example, it is used to classify the gene expression data, which can help medical doctors to predict the survival rate and identify possibly responsible genes for certain diseases. We will discuss this application in more detail in Chapter 2. One class learning, on the other hand, has labeled data that belongs to only one of the classes. Only unlabeled data is available in all the other classes. There are two reasons for the increasing popularity of one class learning. One reason is that usually labeled data is more dicult to obtain than unlabeled data. Labeling data may require special skills and equipment, which can be expensive to obtain. [9] described 1 one situation that only one class has labeled data. In industrial production, there is usually enough data available to characterize the state of the system when it is operating correctly. However, there might not be enough information about how the system malfunctions, because it certainly is not a good idea to disrupt the system in order to obtain the corresponding negative data. If we want to classify future samples, we have to build a model based on one class learning. The other motivation for one class learning is that only one class has characteristic samples, and other classes might not have typical samples. For instance, an e-business website wants to classify the customers into valued customers and normal customers. They may have well dened characteristics for valued customer, such as high shopping frequency, large number of purchases or a large amount of money spent in certain period of time. On the other hand, normal customers do not have typical characteristics to describe their shopping behavior. Thus, we may need to do a one class learning in which only the class of valued customers has labeled data. There have been various algorithms proposed to solve supervised learning and one class learning problems. Many times, the one class learning algorithm is derived from a supervised learning algorithm. For example, [37] designed an adaptive SVM method for one class learning of text data, and SVM is traditionally used in super- vised learning. Similarly, [33] extended the naive bayesian method, a traditional supervised learning method, to one class learning. With the advent of large data in dierent elds, there emerge new challenges in applying the algorithms. These challenges can be due to either high dimension or large sample size, that often arises in big data problems. Each situation leads to challenges in many traditional algo- rithms. This thesis will focus on large data challenges in supervised learning and one class learning. We will rst discuss a challenge in supervised learning, which 2 analyzes high dimensional data sets. Then we discuss a novel one class learning algorithm, and the challenge of scaling it up to large sample size. 1.1 Supervised Learning of High Dimensional Data The goal of supervised learning is to build a model that captures the intrinsic associations between the class type and the features, so the class associated to a sample can be accurately predicted from its feature values. For this purpose, the data is usually divided into a training set and a test set, where the training set is used to build the classier which is validated by the test set. There are several models developed for supervised learning, e.g., Naive Bayesian, neural networks, decision trees [36], SVM [53], LDA [42] [21], and so on. One of the challenges of supervised learning is to build a classication model on high dimensional data sets. The emergence of various new application domains, such as bioinformatics and e-commerce, underscores the need for analyzing high dimensional data. For example, in a gene expression microarray data set, there could be thousands or even millions of dimensions, each of which corresponds to a specic segment of genome. [54] described two challenges for analyzing high dimensional data. The rst one is the curse of dimensionality. Many machine learning algorithms have com- plexity that is at least quadratic or even exponential with respect to the number of dimensions. With a xed number of training samples, the predictive power of a learning algorithm decreases as the dimensionality increases. Sometimes, even the prediction is impractical. Secondly, when a measure such as a Euclidean distance 3 is dened on high dimension, there is little dierence in the distances between dif- ferent pairs of samples [6]. This phenomenon may cause many machine learning algorithms to be vulnerable to the presence of noise. Among the state-of-art supervised learning algorithms, SVM is one of the most successful classication models. An SVM model represents the samples as points in space, and derives a hyperplane as a boundary between separate categories to make the margin as wide as possible. SVM achieves the nonlinear classication by using the kernel trick, which transforms the input space into another higher dimensional feature space. Without having to express the data explicitly in higher space, only the kernel matrix is necessary to obtain the solution. SVM is a natural t for high dimensional data, because through the dual problem in convex optimization, its complexity only depends on the sample size. However, the complexity of training an SVM is O(N 2 ) where N is the number of samples in the training data set. This could be a big disadvantage when the training data set is large. The other disadvantage is the diculty to identify the top features that are responsible for the classication if the data is transformed to higher dimensions, since only the kernel matrix is involved in the solution. Unlike SVM, LDA is a generative learning algorithm, which assumes that sam- ples of a class are linearly separable from samples of other classes, and each of them has a conditional probability distribution that is normally distributed with common covariance matrix. Normal distribution is not a necessary condition to apply the LDA model. Actually many real life data sets do not have a strictly normal distribution. However, normal distribution guarantees optimality of the LDA solution. LDA has simple closed form solution and is eective in many ap- plications [21]. However given insucient sample size, LDA becomes impractical because of the singularity of the estimated covariance matrix. A few modications 4 to solve the overtting problem have been proposed. One solution is to regularize the covariance matrix by imposing a diagonal structure on it. Examples of tak- ing the assumption of diagonal covariance matrix include [13] [39]. There are also dierent ways of selecting the features involved in the classication. The features can be selected in a batch mode, or selected sequentially by a certain criterion. The regularization form and how the features are selected are in uential factors to decide the classication accuracy. Chapter 2 will focus on our novel method for the regularization form and feature selection. We have proposed and improved a novel supervised learning method, block di- agonal linear discriminant analysis (BDLDA), for high dimensional data, especially gene expression data. BDLDA is based on LDA, but does greedy feature selection in order to maintain the most informative elements in the covariance matrix. Learn- ing all elements in the full covariance matrix is impossible with limited sample size. On the other hand, much useful information is lost if a diagonal covariance matrix is assumed. BDLDA achieves a tradeo between a full covariance matrix and a diagonal one. The features are selected based on a simple metric, the discriminant function, such that the most discriminant features found are added into the feature set. Though BDLDA is more accurate as compared with some state-of-art algo- rithms, there is still much information in the covariance matrix not considered in building the model. One solution is to increase the size and the number of blocks used in BDLDA, but this comes at the cost of increasing complexity. We propose to use treelets as a preprocessing feature reduction method. Treelets provides a tool for feature transformation that imitates successive PCA. It localizes the energy of the covariance matrix near the diagonal. As a preprocessing method, it allows BDLDA to extract more informative features while maintaining the same complexity with 5 regular BDLDA, and thus leads to better classication accuracy. The classier combining BDLDA and treelets is called Treelets-BDLDA (T-BDLDA). The details of the algorithm and experimental results are described in Chapter 2. 1.2 One Class Learning and Extension to Large Data Supervised learning tries to distinguish between two or more classes with the train- ing set containing samples from all the classes. But as already mentioned, usually labeled data is more dicult to obtain than unlabeled data. This leads to the situation where labeled data may be missing in certain classes. Therefore, nowa- days, people are becoming more interested in one special case of learning: one class learning. There is one other term, outlier detection, which is often used for this problem. The term originates from a dierent application of one class learning, in which the objective is to identify as many negative samples as possible. The one class classication problem is dierent in one important aspect from supervised learning. In one class classication only one of the classes has labeled information that can be used in the training step. The other classes only have unlabeled data. We name the class with labeled data as positive class, and all other classes as negative class. The task is to learn a model that denes a bound- ary between the positive and negative class such that classication accuracy is optimized. The task is similar to semi-supervised learning in the sense that both labeled and unlabeled data are used in the training step to build a model. Unlike in semi-supervised learning however, the boundary is very likely to be biased, because generally the amount of unlabeled data is much larger than that of labeled data. 6 In One class learning one encounters the same problem as in supervised learning, in that the labeled set of samples may not characteristic enough. Limited number of labeled samples or noise in the measurements may cause the sampling distribution deviate signicantly from the real distribution. For example, the labeled data may be concentrated near the boundary, which will complicate the learning process. This problem may be more prominent in one class learning, because we have even less labeled information than in supervised learning. Though the problem is important, we will not discuss it in this thesis. We assume that there is enough training data to determine the characteristics of the labeled class. Several algorithms have been proposed for one class learning. Koch [26] used neural network for the detection of positive samples, but the resultign algorithm does not have a good performance in case of high dimensions. There have been algorithms proposed that are based on the Bayesian approach, such as Naive Bayesian [35] and S-EM [33]. Naive Bayesian is simple to implement, but it re- quires knowledge of prior distributions of the parameters and is not very accurate in deriving the classication boundary, as shown in Chapter 3. S-EM is an advanced form of Bayesian method, and uses the EM algorithm to predict the classication parameters, but because EM algorithm is an iterative method, S-EM is computa- tionally complex which restricts its wide application. An adaptive SVM algorithm for one class learning was proposed in [37]. The algorithm iteratively renes the SVM boundary until convergence. It has high accuracy, but is also computationally expensive and cannot scale up to high sample size. Recently, there has been a lot of research on spectral graph methods applied to clustering and semi-supervised learning, such as Mincut [7], graph random walk [51], Gaussian Random Fields [63], LGC [58], spectral graph transducer [24], manifold 7 regularization [5] [49], and many other variants. In particular, [5] generalizes graph- based learning to the manifold setting. In these graph based algorithms, samples are represented as graph nodes, and their similarities are represented by edge weights. Dierent algorithms may dene dierent regularization forms on the graph. The common goal is to infer a classication function that conforms to the initial labeling and is also smooth on the graph. Graph based methods are generally considered to have high accuracy and low complexity. To the best of our knowledge, graph based learning algorithms have never been applied to one class learning. Due to their success in semi-supervised learning, graph based algorithms are well suited for learning tasks including both labeled and unlabeled data. However, how graph based algorithms can handle missing la- beled information in negative class is not obvious. We propose a novel one class learning algorithm, Graph-OCL, which uses a graph based method, LGC, com- bined with Transductive-SVM (TSVM). One class learning is usually solved in a two step strategy. The rst step is to detect reliable negative samples, which are the most distant samples from the labeled ones. In our proposed algorithm, the reliable negative samples are selected as the lowest ranked samples by LGC rank- ing. This procedure can be explained by the class probability propagation among the graph, so that the negative samples have the lowest probability belonging to the labeled class. Since graph based methods explore the manifold structure of the data, LGC ranking provides more accurate identication than other methods, such as Naive Bayesian and Rocchio classier. In order to decide the number of negative samples selected, we use the spy sample technique rst proposed in [33]. A few spy samples from the positive labeled set are placed into the unlabeled set. The spy samples provide a calibration so that the samples with the classication probability signicantly higher than spies are considered reliable negative. 8 LGC has one parameter that controls the balance between graph energy and labeled data t. Usually the parameters of a learning algorithm are selected by validation methods, such as cross validation. However, it can be proved that LGC ranking is stable when is smaller than a bound, which is determined by the similarity matrix of the graph. We also show by extensive experiments that LGC has higher accuracy when is below the bound than when it is above the bound. Thus, can be chosen as any value below the bound. The biggest problem with graph based algorithms is that they are unable to scale up to large sample size, although they are eective in small size data. However, for large size problems (e.g. data size>10,000) both storing the similarity matrix, which is a Gram matrix, and solving the associated machine learning problem are prohibitive. This limits usage of the graph-based algorithms in many applications involving large data size, such as sensor networks, social networks, Internet text, etc. In many cases the Gram matrix can be approximated by a sparse matrix, and the sparse matrix computation algorithms can be used. In many other cases the Gram matrix may be dense. In this case, may calculations such as the matrix inversion in LDA, the quadratic programming in SVM, and the computation of the eigendecomposition will take space complexity O(n 2 ) and time complexity O(n 3 ). Thus, complexity may be prohibitive when the number of samples n is large. Recently, many methods have been proposed to approximate the Gram matrix using a low rank structure. For example, [1] used randomized methods to speed up kernel PCA by replacing the kernel matrix (a Gram matrix) by a randomized kernel which approximates the original one in expectation. [55] proposed to use uniform sampling without replacement to choose a small set of columns, from which an approximation to the Gram matrix is built. This method is the so called Nystr om approximation method, which derives its name from the Nystr om method in integral 9 equation theory. There have been various random sampling methods proposed for Nystr om approximation methods other than uniform sampling, such as [50], [17], [55], [2]. All Nystr om approximation methods based on random sampling have one com- mon problem as they may lead to high variance in the approximation errors. Dif- ferent sampling may lead to dierent approximation error, some of which may be large. Since in some applications the approximation is run only once, there is no guarantee that a good approximation can be achieved. To solve this problem, we propose an iterative algorithm, BoostNystr om. In each iteration, a small number of columns are added into the column set. The added columns have high accuracy, which guarantees a good approximation at each step. The sampled columns in the current iteration are also used to update the sampling probability distribution in the next iteration. BoostNystr om has the advantage of reducing the variance in ap- proximation performances by dividing the sampling into smaller subsets of columns that have low approximation error. BoostNyst om is based on a novel perspective on the Nystr om approximation, which states that a good Nystr om approximation can be achieved when the space spanned by the sampled columns has a large overlap with the ideal space, where the ideal space is the one spanned by the eigenvectors of top eigenvalues of the Gram matrix. This property guarantees that the columns added in each iterative step which have good approximation constitute a column set with good approxi- mation power. The algorithm to a large extent reduces the randomness often seen in other random sampling Nystr om methods, and provides a stable and accurate approximation. The details of Graph-OCL and experimental results are described in Chapter 3. The details of BoostNystr om and experimental results are described in Chapter 4. 10 Chapter 2 Supervised Learning of High Dimensional Data 2.1 Introduction In this chapter, supervised learning on high dimensional data is discussed. Nowa- days, there are many application domains where the data is of considerable dimen- sions, such as geographic information system, computer vision, text documents, etc. For example, in text document classication, each word corresponds to one feature. A long text document can contain tens of thousands of dierent words, which compose a long feature vector. In this chapter, we will focus on one example of high dimensional data, namely, DNA microarray data. DNA microarray measures mRNA levels of many genes in the cells or tissues at once. The principle behind microarrays is hybridization between two DNA strands. Single strands of complementary DNA for the genes of interest are attached to a solid surface and arranged in an array. The mRNA of a sample of interest, e.g. a tumor biospsy, is extracted and hybridized to the array. The intensity value at each spot in the array is correlated to the abundance of its RNA transcript in the sample. DNA microarrays are used to measure changes in ex- pression levels, or to genotype or targeted resequencing. The output of a biological 11 experiment using microarray is an intensity image, which represents the RNA tran- script values of genes of interest. Usually each spot on the solid space corresponds to many pixels in the intensity image. In order to obtain a single intensity value for each spot, the corresponding pixels need to be rst identied, which is called segmentation, and then summarized, which is called quantication. These steps are important in the process of microarray experiments and analysis. However, they are out of the scope of this thesis. We will only discuss the algorithms applied to the microarray gene expression data after segmentation and quantication. DNA microarray technology allows researchers to analyze patterns of gene ex- pression simultaneously recorded in a single experiment. Gene expression analysis is important in many medical applications. For example, it is generally believed that gene mutations can lead to cancer. Dierent gene expression patterns among patients or dierent tissues can be used for diagnosis or prognosis in cancer re- search. These data sets have a large number of gene expression values per sample (several thousands to tens of thousands, even millions), and a relatively small num- ber of samples (a few dozens), especially in the case of rare diseases, for which gene expression data for only a few patients is available. This situation requires improvements to traditional learning algorithms. Traditional linear discriminant analysis (LDA) [21,42] cannot be applied to gene expression data, because of the singularity of the within-class scatter matrix due to the small sample size. Thus, for these data sets some form of feature selection will always be needed. A number of solutions based on LDA have been proposed to tackle this challenge. One solution is to assume a diagonal covariance matrix, which essentially ignores potential correlation between dierent features. Examples include diagonal linear discriminant analysis (DLDA) [13] or nearest shrunken cen- troid (NSC) [52], as well as sequential DLDA (SeqDLDA) [39], a modied DLDA 12 technique that incorporates embedded feature selection. Alternative solutions use regularization methods to impose a structure on the covariance matrix, e.g., in shrunken centroid regularized discriminant analysis (SCRDA) [19], a diagonal reg- ularization matrix is employed. But SCRDA has the same problem as DLDA in that it does not perform well in data with correlations (as will be illustrated by our experiments). While it would be possible to consider more complex classication tools (e.g., SVM [53], neural networks [41] and random forests [12]), these tend to not perform as well as simpler LDA-based approaches, e.g., SCRDA, when applied to gene expression data [19]. One likely reason is that these more complex mod- els cannot be accurately learned from limited data. Thus we have a bias-variance trade-o problem: lower variance seen in simple classication methods compensates for the additional bias they introduce [21]. In order to improve performance in the presence of feature correlation (while staying within the general LDA framework), we focus on block diagonal linear dis- criminant analysis (BDLDA), rst proposed in [38]. BDLDA restricts the maximum number of features to be selected in the model. However, even with limited num- ber of features, reliably estimating all correlations is dicult with small sample size. In order to reduce the parameters to be estimated while keeping important correlations between features, BDLDA imposes a block diagonal structure on the covariance matrix. A greedy algorithm is applied to nd features to add into can- didate models with dierent block diagonal structures. Cross validation is used to select the best model among all candidate models. Unlike DLDA or NSC, BDLDA performs classication with embedded feature selection, while considering corre- lations between features. In [38], BDLDA was shown to outperform DLDA on simulated data with sparse covariance structure (e.g., Toeplitz or block diagonal). 13 While these results were promising, model selection using cross-validation made it impractical for large datasets, e.g., gene expression data. In this thesis, we improve feature selection in BDLDA by using an estimated error rate to select the best model among all candidate models. The estimated error rate is derived from LDA and can be obtained for each candidate block diagonal covariance structure. Within BDLDA, direct computation of these error rates is possible even when using a very small number of training samples, because the block diagonal structure is limited to use only small blocks. This error rate metric allows us to avoid cross validation for model selection, and enables BDLDA to be compu- tationally practical even when working on large data sets. We apply BDLDA to real gene expression data for the rst time, with very competitive results [48]. Other improvements with respect to the original BDLDA approach include a repeated fea- ture subset selection (RFSS) technique and a prescreening procedure. With RFSS, that is repeating model construction with previously selected features removed, the algorithm chooses more discriminating features that are independent from previous models. This is useful for gene expression data, because genes belonging to the same pathway tend to have sparse correlations. The prescreening procedure eliminates features that are not signicantly dierent between two classes, which accelerates model search and improves performance by removing noise. In Section 2.3, test re- sults are presented, that show our improved version of BDLDA works particularly well on simulated data with correlated features and outperforms the other three algorithms in real data. Though BDLDA has high classication accuracy as compared to some state- of-art algorithms, it still can neglect a signicant number of correlations in the covariance matrix, because the assumed block is small. Taking into consideration 14 both the algorithm complexity and feature correlations, we assume that the maxi- mum block size is 3, because increasing the block size increases BDLDA complexity. A better solution is to transform the covariance matrix so that more of its energy is concentrated on its diagonal elements. For this purpose, we rst need to identify the correlated features and perform some form of transformation to decorrelate them. Decorrelating makes it unnecessary to use larger blocks, but we cannot decorrelate globally, so the idea is to perform some local decorrelation, so that smaller blocks can be used. A popular example of feature transformation and reduction is principal com- ponent analysis (PCA) [25]. While PCA can exploit correlation between features to achieve a more compact representation of the feature vectors (thus providing a tool for feature selection), the resulting transformation is done independently of the classication task. Thus, there is no guarantee that projecting samples onto the highest magnitude principal components will lead to better classication per- formance. Moreover, PCA is a global feature transformation (each projection may be obtained as a linear combination of all components of a feature vector). Thus, if a classier is designed based on PCA data, it will no longer have an easy interpre- tation in terms of a sparse set of genes, as would be desirable. Other approaches, e.g., those based on wavelets, have been proposed to reduce the impact of noise on classication and, unlike PCA, provide a local analysis [34]. However, wavelets are unlikely to be useful for classication in a DNA context, where there is no reason to expect that adjacent genes would need to have similar levels of expression. As an alternative, we propose to use treelets as a preprocessing feature reduc- tion tool. Treelets [31] have been proposed as a new adaptive multiscale basis for sparse unordered data. Rather than performing PCA on the whole data (which, for small training set means that only a few global eigenvalues will be identied), in 15 treelets a tree based local (pair-wise) PCA is applied. Based on the correlation be- tween pairs of features a series of Jacobi rotations [18] are selected and successively applied. Each step in this process involves two features that have high correlation and are diagonalized. This leads to a covariance matrix in which more energy is concentrated on the diagonal than in the original covariance matrix. Thus, treelets can be seen as a feature transformation tool that approximates the decorrelating properties of PCA, but where tree-structured pair-wise processing makes it possible to work with small sample sets. A key advantage of treelets over PCA is that the feature transformation they produce is local in nature. That is, one can identify (by tracing back the successive pair-wise PCA operations) a sparse set of features (at most 2 l features in a depth l tree) that were transformed to obtain a given treelet coecient. However, as for PCA, high energy treelet coecients (analogous to eigenvectors corresponding to high magnitude eigenvalues) are not necessarily guaranteed to provide the best classication performance. Thus, we propose to use treelet coecients generated from the data as an input to BDLDA [47]. A rst advantage of applying BDLDA is that it can take advantage of the energy compaction/denoising properties of treelets, so that feature selection in BDLDA is more reliable. Second, and more important, since each treelet coecient is obtained by transforming a set of features in the original representation, the blocks are eectively larger, since they operate on transformed features. This allows us to take into consideration the correlation across a larger number of features, without increasing signicantly the BDLDA complexity. For publicly available two-class cancer data, T-BDLDA outperforms state of the art techniques, including BDLDA [48], SeqDLDA [39], NSC [52] and SCRDA [19]. 16 This chapter is organized as follows. Section 2.2 describes our proposed BDLDA algorithm. Section 2.2.4 describes the treelets algorithm and its properties. Sec- tion 2.2.5 introduces the algorithm combining BDLDA and treelets. Section 2.3 provides the experimental results of both BDLDA and TBDLDA. Section 2.4 con- cludes the chapter. 2.2 BDLDA Algorithm Description 2.2.1 Model Selection Metric We start by deriving the estimated error rate of LDA, which will be used as a model selection metric in Section 2.2.2. LDA assumes that both class A and class B have multivariate Gaussian distribution with meansm A andm B and a common covariance matrixK,f A (x)N(m A ;K),f B (x)N(m B ;K). The discriminant function is g(x) =w t xb 8 > < > : 0) class A < 0) class B (2.1) wherex is the feature vector of the sample to classify,w is a vector orthogonal to the decision hyperplane, and b denes the decision boundary g(x) = 0. The optimal w lies in the direction that maximizes the variance between/within ratio J K (w), where J K (w) = (d t w) 2 w t Kw (2.2) 17 The solution to the optimization is then: ^ w = arg max w J K (w) =K 1 d d =m A m B (2.3) with b = log B log A (2.4) In practice, the mean vectors m A , m B , the covariance matrix K, and the prior class probabilities A , B are usually replaced by the maximum likelihood estimators ^ m A , ^ m B , ^ K, ^ A and ^ B . The discriminant function then becomes: J ^ K (w) = ( ^ d t w) 2 w t ^ Kw (2.5) with a solution: ^ w = arg max w J ^ K (w) = ^ K 1 ^ d ^ d = ^ m A ^ m B (2.6) with b = log ^ B log ^ A (2.7) In the general LDA case, if the data set has more features than samples, ^ K is not invertible. In BDLDA, we restrict the feature size to be smaller than the sample size, in order to make ^ K more likely to be invertible. Dierent models in BDLDA represent the dierent size and structure of ^ K and ^ d. Given the training data, from which the estimated means and covariance matrix are derived, the estimated probability of error [21] of a model in BDLDA is 18 ^ P e jT = ^ A ( 1 2 ^ d t ^ K 1 ^ d + ln( ^ A ^ B ) p ^ d t ^ K 1 ^ d ) +^ B ( 1 2 ^ d t ^ K 1 ^ d ln( ^ A ^ B ) p ^ d t ^ K 1 ^ d ) (2.8) where is the cdf of the standard normal distribution. T denotes the training data set. Both (2.5) and (2.8) are used as criteria in feature and model selection. 2.2.2 Model Construction and Feature Selection There are numerous selections of features and covariance structures, such as a full covariance matrix, a diagonal covariance matrix with certain selection of features, or a block diagonal covariance matrix. Enumerating models with all possible se- lections of features and structures of ^ K and ^ d is obviously impractical due to computation and memory limitation. In [38], a block diagonal structure is imposed on the covariance matrix, with the dimensions of both subblocks and the result- ing covariance matrix kept small. An example of the resulting candidate models is shown in Figure 2.1, where each arrow denotes adding a feature to the model and forming a new model. A feature can be used to start a new subblock (solid line in Figure 2.1) or it can be combined with the current subblock (dashed line in Figure 2.1). The feature that generates the largest J ^ K (w) in (2.5) among all can- didate features is selected. For simplicity, it is assumed that the sizes of subblocks are nonincreasing, e.g., in Figure 2.1, the block size on the top left of the covariance matrix is larger than that on the bottom right. This is based on the assumption that through exhaustive search for features, the features added rst are considered of better classication power than those selected later, and thus their correlations 19 are considered more important. As a consequence, if small blocks are selected at the start of the process, subsequent block sizes will be restricted to be small, thus leading to complexity savings. J ^ K (w) in (2.5) is the maximized projected class mean divided by projected variance in the feature space. J ^ K (w) increases with the number of features and has no upper limit. Such increase is sometimes due to increasing number of features and does not necessarily improve performance. Thus using J ^ K (w) by itself for model selection could lead to undesirable results, with a large number of features being chosen. In order to compare all models with dierent number of features, [38] uses cross validation, an unbiased method, which does not make any assumptions on the data. Despite its advantages, it is time consuming, making it impractical to select a model in large data cases. We propose to use the estimated error rate in (2.8) for each covariance matrix ^ K as a way to compare dierent candidate structures. The covariance structure with smallest ^ P e will be selected at each step in the sequential search shown in Figure 2.1. Unlike J ^ K (w) in (2.2), ^ P e is in the range of 0 to 1 for all models. If the data has a Gaussian distribution and means and covariance matrix can be estimated, ^ P e is a good measure of each model's performance. In Section 2.3, experiments on simulated data are used to demonstrate this advantage. Moreover, in experiments on real gene expression data, which is not strictly Gaussian distributed, the measure still generates better results than other algorithms. 2.2.3 Algorithm Improvements Repeated feature subset selection (RFSS) is applied to reduce the impact of greedy search. RFSS repeats the model construction and feature selectionN times. 20 Figure 2.1: Sequential generation of candidate covariance matrix models for BDLDA. Starting with an empty list, we add one feature at a time (namely, the one that maximizes J (2.5)). A feature can be added as an independent block (solid line), or combined with an existing block (dotted line). The best of all these models is selected using ^ P e . At the start, a model with a predened maximum number of features (MaxFeature in Algorithm 1) is selected. Then the model construction and feature selection is performed again, with the features selected during the rst iteration removed from the set of candidate features. This procedure is repeated N times, and each time a feature selection iteration does not consider features already selected in previous iterations. Then theN models are combined by vector concatenatingN means and block diagonally concatenatingN covariance matrices as shown in Figure 2.2. The feature sets in allN models are dierent and uncorrelated. The model construction is performedN times or stops when there are not enough candidate features. This improvement enables the algorithm to nd more discriminating features without being in uenced by previously selected models. The complete algorithm is described in Algorithm 1. 21 Figure 2.2: Block concatenation procedure Algorithm 1 Model construction and feature selection S =;,T = all candidate features,M =;, F = 0, L = 0 1. Construct the rst model (Model 1) by adding feature i, i = arg max i2T d i i S =S +fig,T =T fig,M =M+fModel 1g, F = 1, L = 1 2. For models with feature size F, to create Model j+1, do either of the following (1) Add a feature as an independent subblock. The new feature is selected by i = arg max i2T J given in Eq. (2.2)S =S +fig,T =Tfig,M =M+fModel jg, F =F + 1, L = 1. (2) Add a feature to the last subblock if F < MaxGrow and F + 1 does not exceed any previous subblocks. The new feature is selected by i = arg max i2T J given in Eq. (2.2).S =S +fig,T =Tfig,M =M+fModel jg,F =F + 1, L =L + 1. 3. Repeat Step 2 until the MaxFeature is reached. 4. Select amongM the model with minimum P e given in Eq. (2.8) 5. RemoveS and repeat steps 1-4 N times 6. Combine N selected models S is the set of selected features. T is the set of candidate features. M is the set of candidate models. F is the number of features in the the model. L is the number of features in the last subblock. MaxGrow is the largest size of a subblock. MaxFeature is the largest number of features in the models. Prescreening is based on the observation that features with the same means and variances are not discriminating in BDLDA. Some of them may correspond to noise and interfere with classication. Removing these features can improve performance and reduce computation time. A prescreening of all the features is 22 applied before the model construction in Algorithm 1. The separation of two classes on featurei is represented byj d i i j, withd i =m Ai m Bi and 2 i = 1 Ks ( P k2ClassA (x ki m Ai ) 2 + P k2ClassB (x ki m Bi ) 2 ) +c, wherex ki is theith feature of sample k,K s is the total number of samples, m Ai is the mean of feature i that belongs to class A and similarly for m Bi , and c is a regularization value. Only features withj d i i j above a threshold will be used in Algorithm 1. In the experiments, we use 1 3 max i (j d i i j) as the threshold. To avoid the impact of outliers in real data, instead of using the top rankingj d i i j, the average of 10 largestj d i i j is used, that is, the threshold is one third of the average of 10 largestj d i i j. The prescreening procedure can also be applied to other classication tools. 2.2.4 Treelets Though BDLDA has high classication accuracy as compared to some state-of- art algorithms, it can neglect useful correlations in the covariance matrix, because we limit the block size to be not larger than 3. However increasing the block size increases the complexity of BDLDA. As mentioned in the introduction, if the covariance matrix is transformed to be diagonal by PCA, an ideal low dimensional approximation in terms of variance can be achieved. However, this cannot be done because of the large feature size. For gene expression data analysis, many methods have been proposed to iden- tify groups of highly correlated features and select a few representative features for each group. Treelets [31] can be used to identify subsets of genes that exhibit cor- relation in their expression patterns, but will replace each such localized group by a linear combination that encodes the information from all features in that group. 23 At each level of the tree, the two most correlated features, say feature and fea- ture , are transformed and replaced by two transformed coecients, representing their projection into their principal component and the vector orthogonal to the principal component respectively. The correlation between two features is dened as the Pearson's correlation, which is derived by dividing the covariance of the two features by the product of their standard deviations. Treelets is an unsupervised feature transformation measure that does not consider class information but can successively exploit correlation between features The treelets algorithm is described in Algorithm 2. Algorithm 2 Treelets Algorithm 1. Compute the sample covariance and correlation matrix ^ (0) and ^ M (0) from original data. Initialize the set of indices =f1; 2; ;pg. Each index represents a tree branch that can be combined into a higher level. Initially, contains every original feature. DeneB (0) as an identity matrix. Repeat 2-5 for l = 1; ;L, 2. Find the two most similar features from according to the correlation matrix ^ M (l1) . Let (;) = arg max i;j2 ^ M (l1) , where i<j. 3. Perform a local PCA on (;). Find a Jacobi rotation matrix J(;; l ), j l j=4 and ^ (l) = ^ (l) = 0. 4. Update the following matrices ^ (l) = J T ^ (l1) J, B (l) = B (l1) J, x (l) = J T x (l1) . Update ^ M (l) accordingly. 5. Retain the principal component in to represent the branch and remove , =nfg. At level l, an original samplex can be written as x =B (l) x (l) (2.9) where each row vector ofB (l) is a basis at treelets level l. x (l) is the projection ofx onto those basis vectors. The objective is to reduce the amount of o diagonal energy in the covariance matrix obtained fromx (l) , i.e., in the transformed space. 24 Each feature inx (l) is a linear combination of several features inx. At a xed level L,x (L) is the sample to be be used as an input to BDLDA. Note that in general x (L) will contain features that have not been modied, i.e., they are the same as in x, while some features may be the result of applying a transformation to multiple inputs in the original feature vector. For example, one feature inx (L) may contain information of at most 2 L features inx when anL-level treelet is used. In this case B (l) would include a column with 2 L non-zero values. 2.2.5 Algorithm Description T-BDLDA combines treelets and BDLDA. First, we build a treelets basis matrix B (L) at a certain levelL using training data. Each training sample x is trans- formed intox (L) as in (2.9). Thex (L) are taken as inputs into BDLDA to build a classication model. Each test sampley is also projected as in (2.9) onto the basis B (L) obtained from training samples. The treelets coecients of test samplesy (L) were used in testing in BDLDA. Note that even if we use the same number of blocks of the same size as when ap- plying BDLDA to original expression data, we would be essentially selecting more actual features and thus larger eective blocks. The resulting improvements in classication results could alternatively be attained by using more blocks with po- tentially larger size in BDLDA without using the treelets transformation. However, increasing the number and size of the blocks can be computationally expensive in BDLDA. Thus T-BDLDA provides a method to eectively nd more discriminating features and their structure in aordable time. The complete T-BDLDA algorithm is summarized in Algorithm 3 25 Algorithm 3 Complete Algorithm of treelets and BDLDA 1. Threshold data by t-score. 2. Obtain treelets basis on training data at levelL. 3. Transform both training and test data by treelets basis by (2.9). Run BDLDA on treelets coecients of training data . 4. Build candidate blocks of sizeMaxFeature by adding features according to (2.2). Each subblock inside candidate blocks has maximum sizeMaxGrow. 5. Select the block using (2.8). 6. Remove previously selected blocks and repeat steps 4 and step 5N times. 7. CombineN selected blocks. 8. Test the treelets coecients of test samples by the selected BDLDA model. L the level of treelets to extract coecients.MaxGrow is the largest size of a subblock. MaxFeature is the largest number of features in a block.N is the number of blocks. 2.3 Experimental Results Both BDLDA and T-BDLDA are tested on both simulated data and real data. The results are compared with SeqDLDA [39], NSC [52] and SCRDA [19]. The test results (both error rates and standard deviation) shown in Table 2.1 and Table 2.2 are obtained by performing 10 fold cross validation 50 times. The error rates of SCRDA in [19] are presented as a matrix according to two tuning parameters. The smallest error rate among all parameter pairs is shown. 2.3.1 Simulated Data Block diagonal covariance matrix The distributions of two classes areN( A ;K) andN( B ;K) with total number of featuresP = 10000, and with A = (000 00 | {z } 10000 ), and B = (0:5 0:5 | {z } 200 00 00 | {z } 9800 ). The block diagonal structure of K is shown in (2.10). Each subblock has an autoregressive structure, which is a symmetric Toeplitz matrix with the rst row set to (1 98 99 ). The subblock size is 100 100 and there are a total of 100 subblocks. It is assumed the autocorrelation 26 within each subblock isjj = 0:9 and we set alternating signs for each subblock. 220 samples are generated. The average error rates and standard deviations are shown in Table 2.1. K = 0 B B B B B B B B B B B B B B @ K 0 0 . . . . . . . . . 0 K 0 0 . . . . . . 0 0 K 0 . . . . . . . . . 0 0 K 0 . . . . . . . . . . . . 0 . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C C C C C C C C C A 1000010000 (2.10) Diagonal covariance matrix The distributions of two classes are N( A ;K) and N( B ;K) with total number of features P = 10000, A = (000 00 | {z } 10000 ), and B = (0:5 0:5 | {z } 100 00 00 | {z } 9900 ). We assume that features are independent so that the covariance can be written, K =I P , whereI P is the PP identity matrix. 220 samples are generated. Toeplitz covariance matrix The distributions of two classes are N( A ;K) and N( B ;K) with total number of features P = 1000. The dierence of means are assumed to be fading exponentially. A = (000 00 | {z } 1000 ). Bj = e j ; (j = 1; 2 1000). = 0:05. It is assumed thatK is the Toeplitz matrix with the rst row (1 1 2 2 5 0 0). 120 samples are generated. 2.3.2 Real Data We test our algorithm on two-class cancer data, using two publicly available online for colon cancer (62 samples, 2000 features) [15] and prostate cancer (102 samples, 6033 features) [14]. 10 fold cross validation is done 50 times on each data set. 27 Table 2.1: Average Error Rate (Standard Deviation) for Simulated Data Block Diagonal Diagonal Toeplitz covariance covariance covariance T-BDLDA <0.45% 3.41% 3.17% (0%) (1.14%) (0.95%) BDLDA 0.36% 3.57% 4.51% (0.19%) (1.09%) (1.02%) SeqDLDA 19.64% 2.57% 8.97% (1.5%) 0.83% (1.71%) NSC 18.15% 6.89% 10.82% (1.34%) (1.12%) (1.74%) SCRDA 9.45% 1.97% 10.2% (1.23%) (0.62%) (1.37%) L=5,MaxGrow=3,MaxFeature=20,N =5 Table 2.2: Average Error Rate (Standard Deviation) for Real Data Colon Prostate Cancer Cancer T-BDLDA 9.84% 5.1% (0.51%) (0.62%) BDLDA 10.06% 5.21% (1.15%) (0.85%) SeqDLDA 12.06% 5.53% (1.87%) (0.9%) NSC 10.31% 7.65% (1.02%) (0.42%) SCRDA 11.41% 5.41% (1.69%) (0.89%) L=5,MaxGrow=3,MaxFeature=20,N =5 28 2.3.3 Discussion In simulated data with block diagonal covariance matrix and Toeplitz covariance matrix, T-BDLDA performs the best, and BDLDA is the second best. In block diagonal covariance matrix, the simulation error is 0 (<1/220 in Table 2.2). For diagonal covariance matrix, our algorithms do not show much advantage over Se- qDLDA and SCRDA, which are DLDA based methods. T-BDLDA outperforms BDLDA in all simulated data. This demonstrates the advantage of using treelets coecients instead of raw data. The margin we observe in data with diagonal covariance matrix may come from selecting more actual features involved in the treelets coecients in the selected blocks. The gain in the other two data sets with correlations may come from the energy concentrating on diagonal terms, which is benecial for block diagonal structure in classication. In the two real data sets, T-BDLDA has the lowest error rates. The promising results of our work on T- BDLDA shows that it can be an competitive algorithm for classication of gene expression data. 2.4 Conclusions We have proposed and improved a supervised learning algorithm, BDLDA, for high dimensional data, typically for RNA gene expression data. Though BDLDA has superior classication accuracy compared with state-of-art algorithms, it still misses a lot of information in the o-diagonal positions of the covariance matrix. To better improve the classication performance, treelets is proposed as a preprocessing step to be applied to gene expression data leading to a covariance matrix having less o diagonal energy in covariance matrix. Then the treelets coecients are input into BDLDA. A block diagonal structure is imposed on the covariance matrix 29 with predened number of blocks and block size. RFSS and prescreening are used to improve the algorithm. T-BDLDA outperforms BDLDA, SeqDLDA, NSC and SCRDA in most simulated and both real data used in our tests. T-BDLDA is promising to handle data with small number of training samples, a very large number of features and an unknown correlation structure. 30 Chapter 3 Graph Based One Class Learning 3.1 Introduction In recent years, the graph Laplacian has become an important tool in manifold related machine learning algorithms [5] [58] [63] [4] [62] [8]. In these algorithms, data points are represented by graph nodes and similarity between data points by edge weights. The goal is to infer labels on nodes for which no label data is known. The algorithms can be viewed as estimating a classication function F on the graph, where F needs to satisfy two criteria simultaneously: (1) it should take a value close to that corresponding to the initial labels, and (2) it should be smooth, i.e., neighboring nodes in the graph should have similar values. This can be expressed in a regularization framework where the rst term is a loss function of the labeled data, and the second term is a regularizer of the graph structure and the label function. The various manifold learning algorithms dier in the particular choice of the loss function and the regularizer, but have in common some parameters that control the balance between the two terms. However, only a limited amount of work has been devoted to discussion of how these regularization parameters should be chosen. 31 Among the manifold learning algorithms, learning with local and global con- sistency (LGC) [58] has a closed form solution. It has been widely used for semi- supervised learning (SSL). A key aspect of LGC is to let every data point iteratively spread its label information to its neighbors until a global stable state is achieved. LGC can be explained from the perspective of lazy label propagation. Each row in F is the probability of a data point belonging to each class. The probability is propagated through the graph according to edge weights, while there remains a certain probability of keeping the initial labels. The balance of label propagation and initial label t is controlled by a parameter, which we denote here as . This will be formally introduced in Section 3.2. One application of LGC is classication, which is performed by examining each row of the classication function at convergence. This framework leads to two questions: 1. Is the algorithm sensitive to the choice of ? 2. How does the ratio of labeled data in dierent classes aect the classication accuracy? We show, in both simulated and real data, that classication results, to a large extent, depend on the choice of, especially when the labeled data are unbalanced. Thus, in LGC classication, an optimal has to be selected by validation. Validation requires setting aside a separate validation data set, which is not a problem when large amount of data is available. When the amount of data is limited, however, validation has to be done using time consuming methods, such as cross validation or boostrapping. Another major application of LGC is the universal ranking problem [60]. The LGC ranking problem has the same graph structure and label spreading process as LGC classication. When the spreading process is repeated until convergence, the samples are ranked according to a particular column j of F at convergence, which corresponds to samples' probability of belonging to class j. In other words, LGC 32 ranks the samples by their class preference. In this thesis, we prove that it is possible to select the parameter for LGC ranking without validation. Since validation either requires extra number of samples, or much more time to run validation methods, such as cross validation, it is desirable to select a parameter value that theoretically guarantees a stable and optimal or suboptimal ranking. For a particular ranking problem, there is a bound on the parameter. If is below the bound, the ranking is theoretically stable and experimentally close to optimal. Thus we can choose any below the bound. Time consuming validation methods, such as cross validation or bootstrapping are unnecessary in LGC ranking. In this work, we dene the pairwise ranking of two samples to be stable if the ranking remains the same when changes. We prove that the ranking is stable if is below a bound. The pairwise ranking of two samples, which is also the dierence between two rows of the classication function F at convergence, can be expressed in terms of and the entries of the similarity matrix. It is hard to decompose and the entries of the similarity matrix, so that the in uence of changing on the ranking is unclear. When is small enough, the pairwise ranking can be approximated by a product of a function of and a function of the entries of the similarity matrix. By doing this, we have separated and the similarity matrix, which is independent of . When changes only below this bound, the pairwise rankings do not change. In order for the approximations to hold, needs to be below a bound, which depends on the similarity matrix. In simulations of LGC ranking on various data sets, including the two-moon simulated data and eight real data sets, the ranking accuracy of smaller is consistently higher than that of larger , and the standard deviation of accuracy is consistently lower. This demonstrates that LGC ranking becomes more stable and accurate when is below the bound. 33 As an application of LGC ranking, we propose a one class learning algorithm, which we call, Graph-OCL. One class learning is a type of binary classication in which only one class has labeled data, so most of state-of-art supervised classiers do not apply. One example of one class learning is a situation where we have training samples of research papers and we want to retrieve similar papers from the web. We solve the one class learning problem in two steps. Step one is identication of reliable negative samples. In this step, we use LGC ranking and select the samples with highest ranking probability of belonging to negative class. Step two is classication with the initially labeled data and the identied negative samples. In this step, we use transductive SVM (TSVM) [23] as a classier. We test Graph- OCL on Newsgroup 20 data set. The proposed algorithm is demonstrated to be more eective than Naive Bayesian (NB) and S-EM [33], an improved version of NB. This chapter is organized as follows. Section 3.2 provides the denitions and algorithm description of LGC. Section 3.3 provides experiments on how the ratio of labeled data in dierent classes and the choice of in uence LGC classication accuracy. Section 3.4 provides theorems proving that ranking is stable if is below a bound. We also discuss how the bound is calculated. Section 3.5 provides an application of LGC ranking to one class learning. Section 3.6 concludes the chapter. 3.2 Brief Overview of LGC We rst introduce LGC algorithm [58] [60], then explain the algorithm from the perspective of label information propagation. 34 3.2.1 LGC Algorithm Consider a sample setX =fx 1 ; ; x l ; x l+1 ; ; x n g N m , where each sample is a vector of features. Without loss of generality, let the rstl samples x i (1il) be labeled and the remaining ones x u (l + 1 u n) be unlabeled. We build a graph, in which the nodes are labeled and unlabeled samples, and the anity matrix W quanties the similarity between the nodes, in terms of distance in the feature space. Let F(t) be a nc matrix, that corresponds to a classication ofX at time t(t = 0; 1; ; +1). c is the number of classes. F ij (t) represents the probability that at time t, instance i has label j: F ij (t) =P (instance i has label j; t) j = 1; 2; ;c (3.1) Each sample x i can then be assigned a labely i = maxfF i1 (t);F i2 (t); ;F ic (t)g ( P c k=1 F ik (t) = 1, 0F ik (t) 1, 1kc). At the start of the iteration, assumeF ik (0) = 1 if the initial label of x i isk and F ik (0) = 0 otherwise. Assume F ik (0) = 1=c if x i is unlabeled. With this, we can dene Y = F(0). F(t) represents the labeling matrix at each step of iteration while Y represents the initial labels. Construct matrix S = D 1 W in which D is the diagonal degree matrix with its (i;i)-entry equal to the sum of the i-th row of W. We then iterate F(t + 1) =SF(t) + (1)Y (3.2) 35 until convergence, where is a predened parameter with2 (0; 1). [58] shows that F(t) converges as follows. Since F(0) = Y and given (3.2), we have F(t) = (S) t1 Y + (1) t1 X i=0 (S) i Y (3.3) Since 0<< 1 and the eigenvalues of S are in [1; 1], lim t!1 (S) t1 = 0 and lim t!1 t1 X i=0 (S) i = (IS) 1 : (3.4) Let F denote the limit of the sequence F(t) as t!1. Then we can derive a closed form expression: F = (1)(IS) 1 Y (3.5) If LGC is applied to a classication problem, for sample i, we then select class k which has the maximum likelihood, i.e., max k F ik ; (1kc). The other application of LGC is ranking, in which the order of a series of test samples is learned, and the order is consistent with the known ranking in the training data. Usually the LGC ranking could be with respect to one of the classes, e.g., class k. We rank the kth column of F , which is the relevance score corresponding to the query points. The LGC classication and ranking algorithm are described in Algorithm 4. Algorithm 4 LGC classication/ranking 1. Form the similarity matrix W. 2. Construct matrix S = D 1 W in which D is a diagonal matrix with its (i;i)- entry equal to the sum of the i-th row of W. 3. Calculate F = (1)(IS) 1 Y. 4. (classication). Classify sample as class k which has max k F ik . 4. (ranking). Rank the kth column of F . 36 3.2.2 Interpretation of LGC We interpret Algorithm 4 in terms of a lazy information transfer network. The idea is similar to the lazy random walk introduced in [59], but has a probability interpretation of the label functions. Section 3.2.1 denes the initial state F(0) as F ik (0) = 1 if the initial label of x i is k and F ik (0) = 0 otherwise, while if x i is unlabeled, we assume F ik (0) = 1=c. F ik (0) = 1 means sample x i has probability 1 to belong to class k. For those unlabeled, F i1 (0) = = F ic (0) = 1=c represents our ignorance of its probability at the start of the algorithm. The probability of belonging to c classes propagates over time based on the lazy information transfer network. From (3.2), F ij (t + 1) = n X z=1 S iz F zj (t) + (1)Y ij (3.6) S is a row stochastic matrix, which can be seen as the information transfer probability matrix. We interpretS iz as the proportion of information thatF ij (t+1) obtains from F zj (t) (1 z n and j = 1; 2; ;c). It is similar to the random walk, because both interpretations of S iz represent the similarity between x i and x z . However, in the random walk S iz is the probability of state change from x i to x z , while in information transfer network, S iz is the probability of information transfer from x z to x i . This information transfer network is called lazy, because F(t) also depends on the initial state Y. This means that at each iteration, each sample in the graph has probability 1 of retaining its initial label. With the probability, a sample changes label with the probability proportional to the weight of the link. We assume that information transfer happens within each label and not across labels, which 37 means that F ij (t) depends on F zj (t 1), where 1 z n, and F ij (t) does not depend on F zv (t 1), where v6=j. F ij (j = 1; 2; ;c) is the probability of sample x i having label j when the algorithm converges. In LGC classication, we compare the probability of belonging to each class, and choose the class that has the highest probability. In LGC ranking, we sort the probability of samples according to thekth column of F, i.e.,F k , which represents each data's relevance to the labeled data. Sometimes, S is normalized as a symmetric matrix as D 1 2 WD 1 2 . The LGC classication function F can be rewritten using the symmetric graph Laplacian L s = I D 1 2 WD 1 2 . L s was used for the purpose of classication in [58]. However, D 1 2 WD 1 2 is not a stochastic matrix, so the sum of each row of F is no longer 1. Thus F does not have a probability interpretation corresponding to each of its rows. Though we can still perform classication according to the sample's class preference in the corresponding row of F, ranking is dicult since we do not have a general interpretation corresponding to each column of F. Thus, in the following analysis and applications, S is dened as D 1 W. 3.2.3 Choice of The classication function at convergence F is the solution to the following opti- mization function: F = arg min F 1 2 ( n X i;j=1 S ij kF i F j k 2 (1) n X i=1 kF i Y i k 2 ) (3.7) Let Q(F) = 1 2 ( P n i;j=1 S ij k F i F j k 2 (1) P n i=1 k F i Y i k 2 ), which is the function we want to optimize. Q(F) can be written in the matrix form: 38 Q(F) =(F T F F T SF) + (1)(F Y) T (F Y) (3.8) Taking the derivative of Q(F) with respect to F and making it equal to 0, we can derive the solution as F = (1)(IS) 1 Y. is the only parameter used in LGC. [61] discusses 's impact on the ranking when 1. However, 's impact on classication or ranking when 2 (0:1) has not been fully studied so far. In (3.7), controls the balance between the graph energy (similar graph nodes should have similar labels) and initial label t (F should be close to Y). Consider two extreme cases, i.e., = 1 and = 0. = 1 leads to F(t) = S t F(0). The quantity lim t!1 S t (i;j) is proportional to the degree of the node, which means every row of lim t!1 S t is the stationary distribution of the information transfer network. Thus, through iteration, the initial labels cannot propagate and are not considered as generating the nal class probabilities. The other extreme case is that of = 0, which implies that the probability distribution of labels does not change along time. Therefore, the rst term SF(t) in the iteration function (3.2) represents propagating information along the data structure, and the second term (1)Y imposes a regularization term, so that the classication conforms to the initial labels. However, it is not obvious how to choose in the range 0<< 1. In the following sections, we analyze's in uence on LGC classication and ranking. 3.3 LGC Classication As mentioned in the introduction, when applied to classication, there are two unsolved questions with respect to LGC: (1) Given the similarity matrix, is the algorithm sensitive to the choice of? (2) How does the ratio of labeled data among 39 dierent classes aect the classication accuracy? We test the LGC classication on two real data sets, the USPS digit recognition data and the Mnist data, in order to empirically answer the two questions. Each data set contains 10 subsets, which correspond to 10 digits. For each data set, we use one subset as positive data and one other subset as negative data, and let vary between 0 and 1 (0:1; 0:2; ; 0:9). We use the RBF kernel as similarity matrix in both data sets. Figures 3.1 and 3.2 show the classication accuracy of the subsets of the USPS and the Mnist data, from which we have two observations. (a) (b) (c) (d) Figure 3.1: Classication accuracy of subsets of USPS data. The ratio of labeled data in positive class to negative class in each subgraph is 1:1 and 2:1. (a) USPS Digit 4 and 7; (b) USPS Digit 0 and 1; (c) USPS Digit 9 and 5; (d) USPS Digit 8 and 2; 40 (a) (b) (c) (d) Figure 3.2: The classication accuracy of subsets of Mnist data. The ratio of labeled data in positive class to negative class in each subgraph is 1:1 and 4:1. (a) Mnist Digit 4 and 7; (b) Mnist Digit 1 and 3; (c) Mnist Digit 9 and 5; (d) Mnist Digit 8 and 2; Observation 1: The classication accuracy is sensitive to the ratio of labeled data among dierent classes. The classication is obtained by propagating the label information through time according to (3.6). In case of unbalanced labeled data, which means the ratio of labeled data is far from 1 among dierent classes, the information of the class with majority of labeled data propagates more label information than the other class. Thus, the unlabeled data has a preference for the class with more labeled data. 41 Observation 2: has an impact on the classication accuracy, and the impact is greater when the labeled data is unbalanced. controls the trade-o between the graph energy and initial label consistency. In case of balanced labeled data, the initial labels are consistent with graph structure, thus changing does not change the accuracy very much. However, when the labeled data is unbalanced, giving more weight to the graph energy helps propagating label information according to (3.6). More labeled information of the class with a higher number of labeled data is propagated. This will reduce classication accuracy if a constant threshold is used. The above observations show that LGC is sensitive to, especially when labeled data is unbalanced among classes. Thus, when LGC is applied to classication, especially the labeled data in dierent classes are unbalanced, must be carefully chosen by using validation methods. 3.4 Analysis of LGC for the Ranking Problem Usually unbalanced labeled data among dierent classes is not a problem in LGC ranking, because the orders of probability functions of data points, rather than their absolute values, are important, and there is no threshold to classify the data. [60] and [45] have shown the promise of using LGC as a ranking algorithm if only one class has initially labeled data. [45] also shows experimentally that the ranking is insensitive to when is between 0:1 and 0:9, when used as rst step in one class learning. In this section, we rst present previous work, which shows that ranking based on Laplacian matrix is sensitive to , when is close to 1. Then we present our analysis of the general case 0<< 1 in LGC ranking. 42 3.4.1 Analysis for the case 1 Zhou et al. [61] show that the LGC ranking function (when dening S as S = D 1 2 WD 1 2 ) is sensitive to if 1 by using Green's function of a Laplace operator. They also show that the ranking function is only in uenced by the way the graph is constructed in case of close to 1. We prove that LGC ranking has a similar property if 1 when S is dened as a stochastic matrix S = D 1 W. When used in ranking, the constant (1) term in (3.5) can be dropped, then the ranking function becomes F = (IS) 1 Y. Let L s = IS and = 1 , then F = 1 (I + L s ) 1 Y. Denote G s = (I + L s ) 1 . By eigenvector decomposition, G s = n X k=1 1 k + v k v 1 k (3.9) where k is the kth eigenvalue of L s , andj 1 jj 2 jj n j. By Perron- Frobenius Theorem,j k j 2; (1 k n). The eigendecomposition of L s can be written as L s = VV 1 , where v k is the eigenvector associated with k , and the kth columns of V. v 1 k is the kth row of V 1 . Since 1 = 0, G s can be written as: G s = 1 v 1 v 1 1 + n X k=2 1 k + v k v 1 k (3.10) If is small enough such thatj 1 jj 1 i + j; (2 i n), the behavior of the ranking function is determined by v 1 alone, where F 1 v 1 v 1 1 (3.11) and v 1 = (1; 1; ; 1) T and v 1 1 = ( 1 n ; 1 n ; ; 1 n ) T (3.12) 43 This implies that for small, the ranking function is only in uenced by the size of the graph, and does not depend on the initial label. This is consistent with our analysis of the eect of on the tradeo of graph energy and initial label t. It is also worth noticing that each row in F is approximately the same if 1, which means that there is no eective ranking. It is still unclear how the ranking function is generally impacted by , if is not close to 1. In the following, we discuss the general case of ; (0< < 1) and its impact on the ranking function. 3.4.2 Analysis of LGC Ranking 3.4.2.1 Ranking Decomposition To simplify the analysis, we assume the number of classes is c = 2. The analysis can be extended to multiple classes. We also assume the queries (labeled data) are in class 1, without loss of generality. We want to prove that the pairwise ranking of two samples x i and x j is stable with if is lower than a bound. The pairwise ranking can be represented as F i2 F j2 . We start by showing that F i2 F j2 is a function of entries of the normalized similarity matrix S and the initial state Y. Let K = (IS) 1 and M = IS. Remember that S = D 1 W. F can be expressed as: F = (1)KY (3.13) For samplex i , its corresponding entry in F isF i2 . SinceF i1 +F i2 = 1, ranking F i1 is equivalent to ranking F i2 . We consider only F i2 . Since we only consider the dierence of F i2 and F j2 , 1 can be omitted. Y i2 = 0; (1il) for the rst l labeled samples and Y i2 = 1 2 ; (l + 1in) for rest of the data. For sample i: 44 F i2 = n X z=1 K iz Y z2 = 1 2 n X z=l+1 K iz (3.14) Theorem 1. K iz can be expressed as: K iz = 1 jMj [ n X x=1;x6=z M jx M ij;xz U(j;x)](1) i+z (3.15) whereM ij;xz is the determinant of M when rowsi;j and columnsx;z are crossed out, and U(j;x) = 8 > < > : (1) j+x1 if x<z (1) j+x if x>z (3.16) Proof. K iz = (M 1 ) iz = 1 jMj M iz (1) i+z (3.17) whereM iz is the determinant of M when rowi and columnz are crossed out, and M iz (1) i+z is the matrix cofactor. By Laplace's formula, we express M iz along thejth row in terms of its minors. Then M iz can be expressed as: M iz = n X x=1;x6=z M jx M ij;xz 8 > < > : (1) j+x1 if x<z (1) j+x if x>z (3.18) 45 Plugging (3.18) into (3.17), we obtain (3.15). Similarly, if we expressM jz along the ith row in terms of its minors, a similar expression for K jz can be derived. Thus, Theorem 1 is proved. Given the expression for K iz in (3.15), The dierence between K iz and K jz is: K jz K iz = 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : 1 jMj [ P n x=1;x6=z M ij;xz (M ix U(i;x)M jx U(j;x))](1) j+z if ji is even 1 jMj [ P n x=1;x6=z M ij;xz (M ix U(i;x) +M jx U(j;x))](1) j+z if ji is odd (3.19) Therefore, the dierence of F j2 and F i2 is: F j2 F i2 = 1 2 n X z=l+1 (K jz K iz ) = 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : P n z=l+1 1 2jMj [ P n x=1;x6=z M ij;xz (M ix U(i;x)M jx U(j;x))](1) j+z if ji is even P n z=l+1 1 2jMj [ P n x=1;x6=z M ij;xz (M ix U(i;x) +M jx U(j;x))](1) j+z if ji is odd (3.20) 46 Thus,F j2 F i2 can be expressed as a function of entries in M and its cofactors. Then in our analysis we separate the terms in (3.20) into two terms. First we introduce the dierence term: 8 > > > > < > > > > : M ix U(i;x)M jx U(j;x) ji is even or M ix U(i;x) +M jx U(j;x); ji is odd (3.21) and separately we consider the cofactor term: M ij;xz jMj . Each of these terms are analyzed separately. 3.4.2.2 Dierence Term Since M = IS, we can express M ix in terms of S. 8 > < > : M ix =S ix if i6=x M ix = 1S ix if i =x (3.22) There are four cases of (3.21): Case 1: i6=x and j6=x: 8 > > > > > > > > > > < > > > > > > > > > > : M ix U(i;x)M jx U(j;x)) =(S jx S ix ); ji is even or M ix U(i;x) +M jx U(j;x)) =(S jx +S ix ); ji is odd (3.23) The comes from U(i;x) and U(j;x). 47 Case 2: i6=x and j =x: 8 > > > > > > > > > > < > > > > > > > > > > : M ix U(i;x)M jx U(j;x)) =[(S jx S ix ) 1]; ji is even or M ix U(i;x) +M jx U(j;x)) =[1(S jx +S ix )]; ji is odd (3.24) In case of small 0, 8 > > > > < > > > > : M ix U(i;x)M jx U(j;x))1; ji is even or M ix U(i;x) +M jx U(j;x))1; ji is odd (3.25) Case 3: i =x and j6=x: 8 > > > > > > > > > > < > > > > > > > > > > : M ix U(i;x)M jx U(j;x)) =[(S jx S ix ) 1]; ji is even or M ix U(i;x) +M jx U(j;x)) =[1(S jx +S ix )]; ji is odd (3.26) The approximation of case 3 is the same as in (3.25). 48 Case 4: i =x and j =x: 8 > > > > > > > > > > < > > > > > > > > > > : M ix U(i;x)M jx U(j;x)) =(S jx S ix ); ji is even or M ix U(i;x) +M jx U(j;x)) =[2(S jx +S ix )]; ji is odd (3.27) In case of small 0, 8 > > > > > > > > > > < > > > > > > > > > > : M ix U(i;x)M jx U(j;x)) =(S jx S ix ); ji is even or M ix U(i;x) +M jx U(j;x))2; ji is odd (3.28) With the above approximations, we conclude that the dierence term can be either a constant value or a product of a term and a term which is a function of S. 3.4.2.3 Cofactor Term We have dened L s = I S, = 1 and M = (I + L s ). Since we are only concerned about the sign of M, the constant parameter is dropped. The eigenvalue decomposition of M is M = V( +I)V 1 , where the columns of V are the eigenvectors of L s , and the diagonal of are the eigenvalues of L s . V 1 49 is the inverse of V. Since M is a symmetric matrix, V is invertible. Note that V, V 1 and do not depend on . We can also write M as follows: M = n X k=1 ( k +)v k v 1 k (3.29) where v k is the kth column of V, and v 1 k is kth row of V 1 . By Perron- Frobenius Theorem,j k j 2; (1kn). If is large enough ( is small enough), M is approximated by: M n X k=1 v k v 1 k (3.30) Therefore, the determinant of M can be approximated byjMjj P n k=1 v k v 1 k j. Similarly, M ij;xz can be expressed as: M ij;xz =j n X k=1 ( k +)v k(ij) v 1 k(xz) j (3.31) where v k(ij) iskth column of V withith andjth rows crossed out, and v 1 k(xz) is kth row of V 1 with xth and zth columns crossed out. If is large enough, M ij;xz is approximated by: M ij;xz = j n X k=1 ( k +)v k(ij) v 1 k(xz) j j n X k=1 v k(ij) v 1 k(xz) j (3.32) Therefore, the cofactor term M ij;xz jMj can be approximated by: M ij;xz jMj j P n k=1 v k(ij) v 1 k(xz) j j P n k=1 v k v 1 k j (3.33) 50 3.4.2.4 Approximation of Ranking If we plug the approximation of both dierence term and cofactor term into (3.19), K jz K iz can be approximated as follows. Case 1: i6=x and j6=x. We plug (3.23) and (3.33) into (3.19): K jz K iz 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : (1) j+z [ P n x=1;x6=z j P n k=1 v k(ij) v 1 k(xz) j j P n k=1 v k v 1 k j ((S jx S ix ))] if ji is even (1) j+z [ P n x=1;x6=z j P n k=1 v k(ij) v 1 k(xz) j j P n k=1 v k v 1 k j ((S jx +S ix ))] if ji is odd (3.34) Case 2: i6=x and j =x. We plug (3.24) and (3.33) into (3.19): K jz K iz 8 > > > > > > > < > > > > > > > : (1) j+z [ P n x=1;x6=z j P n k=1 v k(ij) v 1 k(xz) j j P n k=1 v k v 1 k j (1)] if ji is even (1) j+z [ P n x=1;x6=z j P n k=1 v k(ij) v 1 k(xz) j j P n k=1 v k v 1 k j (1)] if ji is odd (3.35) Case 3: i =x and j6=x. The approximation of case 3 is the same as in (3.35). Case 4: i =x and j =x. We plug (3.27) and (3.33) into (3.19): K jz K iz 8 > > > > > > > > > > > < > > > > > > > > > > > : (1) j+z [ P n x=1;x6=z j P n k=1 v k(ij) v 1 k(xz) j j P n k=1 v k v 1 k j ((S jx S ix ))] if ji is even 2(1) j+z [ P n x=1;x6=z j P n k=1 v k(ij) v 1 k(xz) j j P n k=1 v k v 1 k j (1)] if ji is odd (3.36) 51 The above equations show that the approximation of K jz K iz is separated into two parts. The rst part is either or a constant value. The second part is a complex term which is a function of the terms in S. The complex term is indepen- dent of . Since F j2 F i2 = 1 2 P n z=l+1 (K jz K iz ), F j2 F i2 can be approximated in a similar way, separated in to a term dependent on (or a constant value) and a function of the terms in S. Therefore, if is small enough (below a bound), the pairwise dierence remains stable with changes of (changes below the bound). 3.4.3 Choice of the Bound To derive the approximated pairwise ranking, two approximations are used in the cofactor term, in (3.30) and (3.32), and another two approximations are used in the dierence term, in (3.25) and (3.28). In order for the approximation in (3.30) and (3.32) to hold, must be large enough (corresponding to small ). D 1 S is a stochastic matrix. By the Perron- Frobenius theorem, the absolute values of the eigenvalues k ; (1kn) of L s = I D 1 S are bounded by 2. For (3.30) and (3.32) to hold, it is required is substantially greater thanj k j; (1 k n). For example choosing bound to be < 0:005 will be suciently good bound. In order for the approximation in (3.25) and (3.28) to hold, we assumej(S jx S ix )j< 0:01. In this case, < 0:01 jS jx S ix j . The bound depends on the values in the similarity matrix. If the edge weights from a node x dier drastically, a smaller bound on is obtained. In summary, the boundB is B = min i;j;x (0:005; 0:01 jS jx S ix j ) (3.37) 52 3.4.4 Extension to Multi-class Ranking If case of multi-class rankingc> 2, assume the labeled class is class 1. F i1 andF j1 are: F i1 = l X z=1 K iz + 1 c n X z=l+1 K iz F j1 = l X z=1 K jz + 1 c n X z=l+1 K jz (3.38) K jz K iz is the same as in (3.19). The dierence of F j1 and F i1 the becomes: F j1 F i1 = l X z=1 (K jz K iz ) + 1 c n X z=l+1 (K jz K iz ) (3.39) 1 c P n z=l+1 (K jz K iz ) is the same as (3.20) except for a constant 1 c . P l z=1 (K jz K iz ) is l X z=1 (K jz K iz ) = 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : P l z=1 (1) j+z jMj [ P n x=1;x6=z M ij;xz (M ix U(i;x)M jx U(j;x))] if ji is even P l z=1 (1) j+z jMj [ P n x=1;x6=z M ij;xz (M ix U(i;x) +M jx U(j;x))] if ji is odd (3.40) The only dierence between (3.40) and (3.20) is that the summation is from 1 to l. We can also decompose (3.40) into the product of a term dependent on (or a constant value) and a function of items in the similarity matrix as in 53 (3.31),(3.33),(3.34). In multi-class case, the pairwise ranking is also stable with small . 3.4.5 Evaluation of in LGC Ranking If the data points are ranked by LGC, there are two ways to retrieve the data according to thekth column of F : (a) Retrieve the top ranking data points which correspond to the data points most relevant to the labeled data. (b) Retrieve the bottom ranking data points which correspond to the data points most irrelevant to the labeled data (i.e., most similar to classes other than the labeled one). In the following experiments in order to test LGC ranking, we take the second view of retrieving ranked data. Dene N R as the number of samples retrieved that belong to classes other than the labeled class. We use the following metric accuracyA to measure the eectiveness of LGC ranking: A = N R R (3.41) where R is the number of samples retrieved. We rst run LGC ranking on the two moon data. 20% of the left upper moon are selected as labeled data. The RBF kernel is used in calculating the similarity matrix. The boundB for two moon data is calculated by (3.37), and is found to beB = 0:005. Let take values in the set 10 5 ; 10 4 ; 10 3 ; 10 2 ; 0:1; 0:2;:::; 0:9. Then we selectR data points that are least relevant to the labeled data. In this experiment with two moon data, R is equal to the number of labeled samples. If labeled data belongs to class 1, then the data points least relevant to the labeled data are the ones with the smallest F i1 (highest F i2 ). Figure 3.3 shows the R retrieved data points 54 for = 0:9; 0:5; 0:1; 10 2 ; 10 4 ; 10 5 . The blue circles are unlabeled positive data points and the green circles are labeled ones. The cyan triangles and black stars are retrieved data points. The black stars represent the retrieved data points that do not appear in the next subgure. For example, the black stars in Figure 3.3(b) are retrieved data points that are not retrieved in Figure 3.3(c). It can be observed that when changes from 0:9 to 0:01, some of the retrieved data points change, as manifested by the black stars. The number of black stars is decreasing as is decreasing, which means that the pairwise rankings of data points have fewer and fewer changes. When is small, and changes from 0:1 to 10 5 , there are no black stars and the retrieved data remain the same. In all cases of ,A remains 1, which means the accuracy of retrieving data is not sensitive to . However, the retrieved data points are not the same when is greater than 0:1, but remain unchanged when is lower than 0:1. The experimental results are consistent with the theoretical analysis that as long as is small, the pairwise rankings are stable as changes. We also test LGC ranking on the real data sets of USPS and Mnist. Each experiment is performed on the two subsets of the data. One of the subsets is denoted as positive and other is negative. 20% of the positive class are labeled. We set R, the total number of retrieved nodes, as the total number of negative data. A is a measure of how many true negative are retrieved. The accuracyA of LGC ranking in USPS and Mnist data sets are shown in Table 3.1 and Table 3.2. RBF kernel is also used in the similarity matrix of USPS and Mnist data sets. The bound of for the two data sets isB = 0:005. We calculate the average retrieval accuracy of >B and <B and their standard deviations. They are shown in Table 3.3 and Table 3.4. 55 (a) Original Image (b) = 0:9 (c) = 0:5 (d) = 0:2 (e) = 0:1 (f) = 0:01 (g) = 0:0001 (h) = 0:00001 Figure 3.3: Two Moons data and the identied negative data in the right lower moon. The blue circles are unlabeled positive data points and the green circles are labeled ones. The cyan triangles and black stars are retrieved data points. The black stars represent the retrieved data points that do not appear in the next subgure. 56 We observe that the accuracy of <B is consistently higher than >B , and the standard deviation of<B is consistently lower than>B in all data sets. This is also consistent with the theoretical analysis. Table 3.1: Accuracy of USPS with respect to (%) A A A A (Class 4 and 7) (Class 0 and 1) (Class 5 and 9) (Class 2 and 8) 0.9 94.00 91.27 97.45 98.36 0.7 93.00 93.55 96.91 98.55 0.5 91.73 93.64 96.91 98.91 0.3 96.00 92.18 96.91 98.73 0.1 96.36 90.09 97.00 98.55 0.01 94.91 95.36 98.27 98.91 10 3 95.55 95.82 97.27 98.45 10 4 97.00 93.09 98.45 99.81 10 5 95.55 97.82 97.82 99.18 10 6 95.55 95.55 97.82 99.27 10 7 95.82 97.18 97.91 99.45 10 8 95.73 96.64 98.00 99.00 3.4.6 Summary As the solution to the LGC algorithm, F is a function of parameter; (0<< 1). When LGC is applied to ranking, we have shown that the ranking of two data points can be decomposed into two parts. By approximations of both parts, we conclude that in case of small , the pairwise ranking can be approximated by a product of and a function of the similarity matrix. Thus, LGC ranking is stable when is small. But notice that small is only a sucient condition that LGC ranking is independent of , which means that in some cases a larger may be chosen and also have little in uence on LGC ranking. 57 Table 3.2: Accuracy of Mnist with respect to (%) A A A A (Class 4 and 7) (Class 0 and 1) (Class 5 and 9) (Class 2 and 8) 0.9 97.49 100 97.54 97.67 0.7 97.81 100 97.43 97.86 0.5 98.02 100 97.64 98.15 0.3 98.02 100 97.54 97.67 0.1 97.91 100 97.64 98.05 0.01 98.23 100 96.71 96.69 10 3 97.91 100 97.64 97.57 10 4 98.02 100 97.33 98.35 10 5 98.12 100 97.74 98.05 10 6 98.23 100 97.33 98.35 10 7 98.12 100 97.64 97.86 10 8 98.12 100 97.43 97.76 Table 3.3: Average Accuracy and Standard Deviation of USPS >B and <B (mean % (std)) >B <B A (Class 4 and 7) 94.33 (1.78) 95.86 (0.57) A (Class 0 and 1) 92.69 (1.89) 96.02 (1.66) A (Class 5 and 9) 97.24 (0.55) 97.88 (0.38) A (Class 2 and 8) 98.67 (0.22) 99.26 (0.18) Table 3.4: Average Accuracy and Standard Deviation of Mnist >B and <B (mean % (std)) >B <B A (Class 4 and 7) 97.91 (0.25) 98.09 (0.11) A (Class 0 and 1) 100 (0) 100 (0) A (Class 5 and 9) 97.42 (0.35) 97.52 (0.18) A (Class 2 and 8) 97.71 (0.49) 97.88 (0.28) 58 3.5 Application In this section, we apply LGC ranking to binary one class learning. Normally, binary supervised learning problems are based on training using both labeled positive and negative data. One class learning is a kind of information retrieval using only labeled positive data. The key feature of this problem is that there are no labeled negative samples. In recent years, there have been some algorithms proposed to solve the one class learning problem in a two step strategy, namely, Identication, i.e., identifying a number of reliable negative samples from the unlabeled set. Classication, i.e., building a classier with the positive labeled samples and the selected negative ones. The two step strategy has the advantage that there are many available conven- tional supervised learning tools. As long as the negative samples are accurately identied, good classication can be achieved by a state-of-the-art classier. The S-EM technique [33], PEBL [56] and Roc-SVM [32] are all two-step algo- rithms. S-EM uses Naive Bayesian, PEBL uses a positive feature set, and Roc-SVM uses a Rocchio classier to identify negative samples. Compared with the above methods, the graph-based method can explore the connection between the sam- ples through the paths in the graph, so the results take into account the geometric structure of the data. For the second step, S-EM uses Expectation-Maximization (EM) algorithm with Naive Bayesian (NB) classier. PEBL and Roc-SVM both use SVM. We propose a new two step algorithm for one class learning, which we call Graph- OCL [45]. The rst step, identifying R reliable negative samples is based on LGC 59 ranking. We rst build a graph with each node as a sample vector and edge weight as pairwise similarity. Then we select reliable negative samples by LGC ranking. Based on our analysis of the LGC ranking in the previous sections, the ranking is accurate and insensitive to the parameter as long as is below a certain bound (<B ). LGC provides a simple approach to identify negative samples, since we do not have to tune any parameters by time consuming validation methods. An important issue in Identication is to determine the number of reliable negative samples R. We employ the spy sample method introduced in [33] to determine R. For classication, we use TSVM [23] to classify the data. As we are given the unlabeled samples at the time of training and are only interested in the classication of the observed data (no out-of-sample problem), we use TSVM instead of regular SVM to build a classier. The eectiveness of TSVM depends on the choice of kernel function. We show that in order for the data representation to be consistent in two steps, the similarity matrix used in the identication step should be a normalized kernel matrix in TSVM. We rst determine the kernel matrix used in TSVM and then derive the similarity matrix from it. We apply the complete algorithm to the 20 Newsgroup data and compare it to the Naive Bayesian (NB) [35] and S-EM methods [33]. We test the algorithm on several popular kernel functions (linear, RBF, polynomial, etc.). We found in our experiments that linear kernel matrix, corresponding to the cosine similarity matrix in Identication, has the best average classication accuracy. The improved classication accuracy show the eectiveness of Graph-OCL. 60 3.5.1 Identify Negative Samples In binary one class learning, F(t); (t 0), F and Y are alln2 matrices. Assume class 1 has labeled data, and class 2 does not have any labeled data. Y i1 = 1 and Y i2 = 0 if x i is labeled. Y i1 =Y i2 = 1=2 if x i is unlabeled. The approach for identifying negative samples by LGC ranking is shown in Algorithm 5: Algorithm 5 Identify negative samples 1. Form the similarity matrix W. 2. Construct matrix S = D 1 W in which D is a diagonal matrix with its (i;i)- element equal to the sum of the i-th row of W. 3. Calculate F = (1)(IS) 1 Y 4. Select R samples with hightest F i2 as the reliable negative samples. F ij (j = 1; 2) is the probability of sample x i having label j after innite time from the initial state. The second column of F is the probability of samples belonging to the negative set given the labeled information. Graph-OCL chooses R unlabeled samples with highest F i2 as the most reliable negative samples. 3.5.2 Selecting the Number of Reliable Negative Samples The number of negative samples R in Algorithm 5 is yet to be determined. We modied the spy document technique in S-EM [33] to decide the number of reliable negative samples. We rst randomly select a subsetS R of positive samples from the labeled set and put them in the unlabeled set. Samples in S R acts as 'spies'. The spy samples act similarly to the unknown positive samples. Hence, they allow the algorithm to infer the behavior of the unknown positive samples in the mixed set. We run Algorithm 4, then select those samples whoseF i2 is greater than the highest F i2 of the spy samples. This implies that the reliable negative samples should have 61 a probability of belonging to negative class higher than the positive spies. However, if the spy samples happen to be close to the labeled ones in the feature space, this method will lead to more than enough negative samples, sometimes introduce false ones. So we restrict the numberR to be no greater than the number of the positive labeled ones, which is denoted as l. In summary, we rst selectR unlabeled samples whoseF i2 are greater than the highest F i2 of the spy samples. If R is less than the number of positive labeled samples, R reliable negative samples are kept. Otherwise, only the l samples with the highest F i2 values are kept as reliable negative. In the experiments of both toy example and real data, we set the ratio of spy samples in the labeled set to be 10%. 3.5.3 Classication by TSVM Unlike S-EM, which uses NB based methods for both Identication and Classica- tion, we use dierent methods for two steps. As mentioned in Section 3.3, LGC is not an accurate classier, especially when the labeled data in positive and negative classes are unbalanced. Since we cannot guarantee the balance of labeled data in both classes, we use SVM based method, which is not so sensitive to the unbal- ance problem. We use Transductive SVM (TSVM) as a classier, which takes into consideration of both labeled and unlabeled data during training. Kernel Selection Though we use dierent methods in Identication and Classication, these two methods are not completely separate. They are related through the similarity matrix in Identication and the kernel matrix of TSVM in Classication. The 62 choice of both matrices are important, because they re ect the structure of the data. The kernel function K(x i ;x j ) is the inner product of the higher dimensional mapping of the original samples x i and x j . If no mapping to higher dimension is performed, the kernel is the inner product of original vectors, which corresponds to linear kernel. In order to exploit the data structure in Identication and Classication in a consistent way, we apply the same kernel to both steps. We use a normalized kernel matrix K n as the similarity matrix W when identifying negative samples, and the original kernel matrix K in TSVM. The kernel matrix is transformed to a normalized one as: K n (x i ;x j ) = K(x i ;x j ) p K(x i ;x i )K(x j ;x j ) (3.42) In other words, this is the normalized inner product of sample data vectors. If the similarity matrix is dened as the cosine similarity, the corresponding kernel in TSVM is linear kernel. Dening either the similarity matrix or the kernel matrix rst automatically decides the other matrix. To show the eectiveness of the above idea, we present the classication of using cosine similarity in Identication and RBF and polynomial kernel in Classication. We show in Table (3.5) that linear kernel provides the best classication results. 3.5.4 Graph-OCL Algorithm The complete algorithm of Graph-OCL is described in Algorithm 6. 63 Algorithm 6 Graph-OCL Algorithm 1. Select S R samples from the labeled positive set to form the spy samples and move them to the unlabeled set. 2. Select a kernel matrix K for TSVM and derive the similarity matrix W from K. 3. Run Algorithm 5 to determine R. 4. Run Algorithm 5 again and select R most reliable negative samples. 5. Run TSVM based on the labeled positive samples and selected negative sam- ples. 3.5.5 Experimental Results We test the algorithm on 20 Newsgroups [30]. This is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 dierent news- groups, which are also categorized into 4 main categories, computer, recreation, science and talk, as shown in Figure 3.4. Figure 3.4: 20 Newsgroup Data Subjects The rst step in document classication is to transform the documents, which are typically strings of characters into a representation suitable for learning algo- rithm. We use vector space model as a representation of documents. Suppose D is the corpus of documents. Each document d inD is represented as a feature vector [w 1 ;w 2 ; ;w m ], where m is the total number of words. Each distinct word corre- sponds to a feature, with the number of times the word occurs in the document as its value, i.e., w j corresponds to the number that word j occurs in the document. 64 This leads to extremely long feature vectors. To avoid unnecessary words that are not discriminative, we rst remove the 'stop-words' (like 'and', 'or', etc.). Then words are considered as features if they occur at least 3 times in the training data. For simplicity, we refer to m as the number of features after feature reduction. Finally, we scale the feature vector by their TF-IDF. Term frequency (tf) is a measure of the importance of word j within the par- ticular document d i . tf ij =n ij = X k n ik (3.43) where n ij is the number of occurrences of the considered word j in document d i , and the denominator is the sum of number of occurrences of all words in document d i . The inverse document frequency (idf) is a measure of the general importance of a word. idf j = log jDj jd : word j2dj (3.44) jDj is the total number of documents in the corpus.jd : word j2dj is the number of documents where word j appears. tf-idf ij =tf ij idf j (3.45) Withtf-idf ij as feature, document d i is represented as a vector withjth column astf-idf ij . We build a weighted graph, in which each node represents a document. The cosine similarity of two documents denes edge weight between two nodes. W ij = d i d j k d i kk d j k (3.46) 65 We choose dierent groups in Newsgroup 20 data as positive class, then using various individual groups as unlabeled samples. For each experiment, we divide the positive class into two subsets, labeled and unlabeled. In the experiments, we set the labeled positive set to be 20% of the positive class and 10% of the labeled positive set as spy documents. For TSVM, we use the SVMlight package [22]. The boundB for the 20 Newsgroup data is 0.005, and is set to be 0.001 in the experiments. We measure the algorithm by both classication accuracy and an information retrieval measure F score, dened as: F = 2pr=(p +r), where p is the precision and r is the recall. F score measures the performance of a system on a particular class [44], and re ects an average eect of both precision and recall. Table 3.5 presents the classication accuracy of using dierent kernels in TSVM with corresponding similarity matrix in Identication. It shows that linear kernel has the best performance in most data sets, so linear kernel is used in Graph-OCL to compare with other algorithms. Table 3.6 shows that Graph-OCL has better classication accuracy than NB and S-EM. Graph-OCL is also more stable, since all accuracies are above 85%, but NB and S-EM have some accuracy lower than 80%. The F-score of Graph-OCL is much higher than NB and S-EM, which means both precision and recall are good. Table 3.5: Classication Accuracy of Documents with Dierent Kernels kernel os-win/ graphic/ pol.guns/ hockey/ type wind.x hardmac pol.misc baseball linear 89.1% 93.3% 90.7% 96.8% RBF 88.5% 65.5% 59.4% 96.0% polynomial 88.8% 93.1% 90.8% 96.3% 66 Table 3.6: Classication Accuracy of Documents (Linear Kernel) Positive Negative Graph- Graph- NB NB S-EM S-EM OCL OCL F score Accuracy F score Accuracy F score Accuracy os-win wind.x 88.0% 89.1% 45.6% 79.9% 92.2% 95.8% graphic hardmac 92.8% 93.3% 14.9% 73.6% 41.1% 78.5% rel.misc pol.misc 87.0% 89.0% 30.8% 75.7% 65.4% 82.7% hockey baseball 96.4% 96.8% 89.6% 95.4% 96.8% 98.4% pol.guns pol.misc 90.0% 90.7% 48.3% 79.9% 75.4% 87.3% politics rec 83.9% 86.2% 28.8% 72.6% 29.7% 73.1% hardmac os-win 88.9% 90.1% 74.2% 88.2% 94.4% 96.8% hardpc hardmac 83.6% 85.0% 52.8% 90.8% 82.7% 95.4% average 88.8% 90.0% 48.1% 82.1% 72.2% 88.5% 3.6 Conclusions We analyze the LGC algorithm for both classication and ranking. We propose an information diusion interpretation of LGC, then analyze the parameter 's impact on both classication and ranking. It is found that has a large impact on the classication results, especially when the labeled data in classes are unbal- anced. For LGC ranking, we theoretically prove that the pairwise ranking remains unchanged when is smaller than a bound. In other words, does not have to be tuned by validation as long as it is small enough. We apply LGC ranking to one class learning as a rst step to identify negative samples, and propose a new one class learning algorithm Graph-OCL. Experimental results have demonstrated the eectiveness of both LGC ranking and Graph-OCL. However, graph-based meth- ods have a scalability problem. For large size problems (e.g., data size>10,000) both storing the similarity matrix and solving the associated linear systems are prohibitive. Approximation methods are necessary to solve large scale graph-based algorithms. 67 Chapter 4 A Novel Adaptive Nystr om Method 4.1 Introduction Kernel methods play an important role in machine learning and have been used successfully in algorithms such as support vector machines (SVMs) [11], kernel principal component analysis (KPCA) [43] and manifold learning [40]. One example is the use of the kernel matrix in the Graph-OCL algorithm introduced in Chapter 3. A major challenge in applying kernel methods to modern learning comes from the fact that these usually involve large data sets of tens of thousands to millions of data points, leading to signicant complexity in terms of both space (quadratic) and time (usually cubic). A similar problem may occur in graph based algorithms in which using the similarity matrix of the graph also involves signicant complexity. One solution to deal with such large data sets is to use an approximation of the kernel matrix. The Nystr om method [55] is an ecient technique for obtaining a low-rank approximation of a large kernel matrix, using only a subset of its columns. A key problem in the Nystr om method, and its extensions, is that of determining which subset of the columns is to be used. The complexity and quality of the approximation depends heavily on the selected subset. 68 While column selection based on random sampling with a uniform probability distribution has been the most popular approach, a considerable amount of research work has been conducted exploring other sampling methods. [3] proved that if the number of independent sampled columns is equal to the rank of the kernel ma- trix, a perfect reconstruction can be achieved. However, nding these independent columns is dicult given the large data size. Our work develops a novel sampling strategy taking as a starting point two recently proposed methods [46]. The rst approach selects columns iteratively using an adaptive sampling method, where at each step probabilities associated to each column are updated based on the quality of the intermediate approximations to the kernel matrix that are obtained [28]. A second approach makes use of greedy sampling, nding the best rank-1 Nystr om approximation at each iteration and then adding the corresponding approximations to obtain an aggregate one [16]. Both algorithms are iterative; they are based on building intermediate approximations and then obtaining the nal result using the set of all the selected columns. Our proposed work is motivated by a novel interpretation of the Nystr om ap- proximation method. Assume the data is a matrix, in which rows represent samples and columns represent features. We show that sampling the columns of the kernel matrix is equivalent to projecting the data onto the subspace spanned by the corre- sponding columns. Based on this, a good Nystr om approximation can be achieved if the space spanned by the sampled columns has a large intersection with the space spanned by the data mapped to the eigenvectors corresponding to the top eigenval- ues of the kernel matrix. We verify this property using a simple experiment with a randomly generated kernel matrix. We also note that if subsets of columns exhibit this property, then it is likely that their union will also have the same property. Obviously the property cannot be used in sampling columns since it would require 69 knowing the top eigenvectors of the kernel matrix. However, as will be shown, it is possible to make use of alternative metrics to quantify the suitability of a column or a set of columns. In this chapter, we proposed a novel technique for iterative, low complexity col- umn selection. It extends the Ensemble Nystr om approximation method [27] by including greedy selection of columns and adaptive probability update. Unlike [16] we do not perform a greedy column-by-column selection, which is inecient and requires quadratic complexity. Instead, we compare the approximation properties of several subsets of columns, typically using the Frobenius error. The underlying assumption is that the subsets with lower error will also provide better approxima- tion. This approach is faster than column-wise selection, since a smaller number of alternatives needs to be evaluated, i.e., the number of subsets is much smaller than the number of columns. As in [16], the approach is greedy, that is, the best subset is selected and will be included in the computation of the nal estimate. However, unlike [16], each iteration is based on random sampling based on estimated prob- abilities for the remaining columns, which are obtained from previous iterations. We adopt the name BoostNystr om, because each iteration is used to update the probability of the remaining columns. The adaptive probability update gives higher probability to potentially good columns. Note that our adaptive probability esti- mation diers from that of [28]. BoostNystr om identies potentially good columns as those with low approximation error in the current iteration, and thus it does not impose any requirement on the rank as is the case in [28]. In the nal iterative step, Ensemble Nystr om approximation is performed on the set of columns selected through the iterations. Standard Ensemble Nystr om has high variance because of its use of randomly sampled columns. BoostNystr om reduces the high variance by incrementally adding a subset of columns with low 70 approximation error at each iteration. We derive an error bound for BoostNystr om, which guarantees a better convergence rate than both the standard Nystr om and the Ensemble Nystr om methods. Experimental results show the eectiveness of BoostNystr om. 4.2 Nystr om Approximation In many machine learning algorithms, we need a matrix, each item of which rep- resents the pairwise relationship between the two samples. The matrix can be a similarity matrix in graph based algorithms, a distance matrix or a kernel matrix in some kernel based algorithms. The similarity matrix and the kernel matrix used in Graph-OCL are two examples. As the number of samples grows, it is increas- ing dicult to store and calculate this type of matrix. Nystr om approximation is a way to approximate the matrix by using only a few columns, such that the computational and memory complexity can be greatly reduced. 4.2.1 Standard Nystr om Method Let K be a symmetric positive semidenite (SPSD) matrix. Any kernel matrix, inner product matrix or graph similarity matrix is a SPSD matrix, so we will discuss the Nystr om approximation in terms of a general SPSD matrix. The Nystr om approximation of K is obtained by sampling mn columns of K. Let E denote thenm matrix consisting of these sampled columns. Let W be themm matrix formed by the intersection of the m sampled columns with the corresponding m rows of K. We write K and E in terms of their sampled submatrices as follows: 71 K = 2 6 4 W K T 21 K 21 K 22 3 7 5 and E = 2 6 4 W K 21 3 7 5 (4.1) The Nystr om method generates a rank-k approximation ~ K of K for k m dened by: ~ K = EW + k E T K (4.2) where W k is the best rank-k approximation of W for the Frobenius norm, and W + k denotes the pseudo-inverse of W k . Note that W is also SPSD since K is SPSD. W + k can be derived from the singular value decomposition (SVD) of W. W = ~ V ~ ~ V T , where ~ V = [~ v 1 ; ~ v 2 ; ; ~ v m ] is orthonormal and ~ = diag( 1 ; ; m ) is a real diagonal matrix with 1 m 0. For k rank(W), it is given by W + k = P k i=1 1 i ~ v i ~ v T i where ~ v i denotes theith column of ~ V. Since the running time complexity of SVD isO(m 3 ) andO(nmk) is required for multiplication with E, the total complexity of the Nystr om approximation computation is O(m 3 +nmk). 4.2.2 Novel Perspective Assume that a linear kernel is used (the same proof applies to other kernel types, as any kernel matrix implicitly maps data vectors to a high-dimensional linear space.) The kernel matrix is calculated as K = X T X, where X is a dimn data matrix, and dim is the number of features in each data vector. We simplify the Nystr om approximation by xing k = m and assuming W is full rank. Then the Nystr om approximation of (4.2) based on a setS ofm randomly sampled columns becomes: ~ K = EW 1 E T (4.3) 72 Let X m be the matrix containing the m sampled columns of X corresponding to the columns sampled in K, then ~ K can be written as: ~ K = X T X m (X T m X m ) 1 X T m X (4.4) Since X T m X m is also an SPSD matrix, it can be decomposed in terms of its eigenvalues and eigenvectors: X T m X m = QQ T (4.5) where Q = [q 1 ; q 2 ; ; q m ] is orthonormal and q i (i = 1; 2; ;m) is the i- th eigenvector of X T m X m . is the diagonal matrix containing the eigenvalues 1 ; 2 ; ; m as diagonal elements and 1 2 m . The approximated kernel matrix ~ K can then be expressed as: ~ K = X T X m (QQ T ) 1 X T m X = X T X m (Q 1 2 ) ( 1 2 Q T )X T m X (4.6) where Q 1 2 = [ 1 p 1 q 1 ; 1 p 2 q 2 ; ; 1 p m q m ] are the eigenvectors normalized by the square root of their corresponding eigenvalues. X T X m is the original data mapped onto the m sampled columns in X m . Then X T X m (Q 1 2 ) represents the mapped data X T X m expressed in terms of q i (i = 1; 2; ;m) and normalized by 1 p i ; p i is the magnitude of X m along the vector q i . Thus, X T X m (Q 1 2 ) maps columns of X onto the space, denoted bySP m , spanned by the bases [ 1 p 1 X m q 1 ; 1 p 2 X m q 2 ; ; 1 p m X m q m ]. Dene U m = X m (Q 1 2 ). The ith column of U m is [ 1 p i X m q i ~ K is the linear kernel of X T U m . Note that SP m 73 is the same space spanned by the columns of X m , and the columns of U m are the orthonormalized bases of SP m . Thus, ~ K is the linear kernel of the original data projected onto the space spanned by the sampled columns, and the projection matrix is formed with the orthonormal bases obtained from the eigendecomposition of X T m X m . On the other hand, the eigendecomposition of K is: K = VV T (4.7) where V = [v 1 ; v 2 ; ; v n ] contains the eigenvectors of K as columns, and the diagonal values of contain the eigenvalues 1 ; 2 ; ; n and 1 2 n . Since X T Xv i = i v i ; (i = 1; 2; ;n), K can be written as: K = X T X( v 1 p 1 ; v 2 p 2 ; ; v n p n ) ( v 1 p 1 ; v 2 p 2 ; ; v n p n ) T X T X = X T X(V 1 2 ) ( 1 2 V T )X T X (4.8) where (4.8) has the same structure as (4.6), and U = X(V 1 2 ) are the or- thonormal bases of the n dimensional space. Both K and ~ K can be written in a linear kernel form: K = X T UU T X ~ K = X T U m U T m X (4.9) The approximation error of ~ K is: 74 k ~ K Kk = k X T (U m U T m UU T )Xk = k V 1 2 U T (U m U T m UU T )U 1 2 V T k = k V 1 2 (U T U m U T m U I) 1 2 V T k (4.10) According to (4.10), a sampling scheme U m has 0 approximation error if U T U m U T m U = I. Sampling all columns of K certainly leads to zero error. However, when the rank of U m is lower than the rank of X, it is impossible for U T U m U T m U = I to hold. We now describe how the corresponding error depends on properties of the chosen columns. U T m U can be expressed as: U T m U = [U T m U 1 ; U T m U 2 ; ; U T m U n ] (4.11) where U i is the ith column of U. For a given rank m, in order for (4.10) to be small, U T U m U T m U should be close to I. Since (U T U m U T m U I) is weighted by in (4.10), two properties would be desirable for U T m U: (a) The energy of U T m U should be focused on the left part of the matrix. In other words,k U T m U i kk U T m U j k, (i j). Intuitively this means that the subspace SP m has more intersection with subspaces spanned byf Xv i p i g(1in), for large i . (b) The rows of U T m U are orthogonal, which means that v T i ~ Kv j = 0; (1 i;j n;i6=j). 75 In order to sample columns that satisfy Properties (a) and (b), we would need to know the eigenvectors and eigenvalues of K in advance. However, this is not feasible in the case of large data sizes. Instead, we use an alternative metric, the approximation error on a validation set, to measure how good the approximate kernel matrix is. A low approximation error is equivalent to having Properties (a) and (b). In various iterative algorithms for column sampling, including [28], [16] and our proposed BoostNystr om, a subset of columns that are considered to provide good approximation is selected at each iterative step. The nal set of selected columns is the union of the subsets at each iterative step. It is not obvious why a union of good subsets would still have good approximation properties. However, this can be veried based on Properties (a) and (b). Since a low approximation error is equivalent to having Properties (a) and (b), if it is proved that Properties (a) and (b) of a subset can be extended to its union, then the iterative algorithm is justied. Property (a) states that in order to have low approximation error, the subspace spanned by the selected columns should have large overlap with the subspace ob- tained from the eigenvectors corresponding to the top eigenvalues of K. A union of subsets of columns can be considered as a sum of subspaces. A sum of subspaces that have Property (a) certainly conforms to Property (a) too. This proves that Property (a) can be extended to union of subsets. However, Property (b) does not hold in general when extended to a union of subsets. However, we can empirically prove that, as compared to Property (b), Property (a) is a dominant factor in determining approximation error. We propose a metricM to measure Property (a): M =k (X m (Q 1 2 )) T X(V 1 2 ) 2 k F (4.12) 76 We perform an experiment on a 20 20 matrix X. Each item of X is randomly generated from a uniform distribution in the interval [10; 10]. Then we randomly sample 3 columns and use them to compute a Nystr om approximation. We repeat the sampling process 10000 times. Among these 10000 approximations, we calculate each approximation's error " and itsM value. For two Nystr om approximations i andj, if" i <" j andM i >M j , or" i >" j andM i <M j , we say theM is eective for the two approximations. In our experiment, we observe that over 95% of the pairs are eective. Thus, we have empirically shown that Property (a) is a dominant factor in de- termining approximation error. Since Property (a) can be extended to a union of subsets, the iterative procedure of selecting subsets of columns with low approxi- mation errors and combining them is justied. In the next section, we introduce BoostNystr om based on the above result. 4.3 Proposed Method We rst introduce the Ensemble Nystr om approximation method, which is used in the nal iteration of BoostNystr om. We then introduce our proposed BoostNystr om algorithm. 4.3.1 Ensemble Nystr om Method The Ensemble Nystr om algorithm [27] leads to an improved hypothesis by sampling uniformly without replacement a set S of mp; (p > 1) columns from matrix K. S is decomposed into p subsets S 1 ; ;S p . Each subset S r ; (r2 [1;p]) contains m columns and is used to obtain a rank-k Nystr om approximation ~ K r . ~ K r can be written as ~ K r = E r W + r E T r , where E r and W r denote the matrices formed by the 77 columns of S r . The general form of the approximation of K generated by the ensemble Nystr om algorithm is: ~ K ens = p X r=1 r ~ K r (4.13) The mixture weights r can be dened in dierent ways. One choice is to assign equal weight to each base approximation, r = 1 p ; (1 r p). Another choice is exponential weight method, which rst measures the approximation error of the rth base approximation, ^ " r , and then computes the mixture weight as r = exp(^ " r )=Z, with Z = P p j=1 j . 4.3.2 BoostNystr om Method BoostNystr om is an iterative algorithm, which extends the Ensemble Nystr om method by greedily selecting the best sampled subset and adaptively determining the sampling distribution of the remaining columns in each iteration. Let T be the total number of iterations. First, a set of sampled columns S is generated throughT iterations. Then the Ensemble Nystr om method is applied to S. At iterative stept,m(pt+1) columns are sampled according to the probability distribution P t (1tT ). Initially P 1 is a uniform distribution and a set of mp columns are sampled. The set of columns is decomposed into mp d subsets of sized, on which a standard Nystr om method is performed. Then, the mp d subsets are sorted in ascending order based on their approximation errors on the validation setV . Them columns that contain the m=d subsets having the lowest approximation errors, are added to S. The next m=d subsets which have the second lowest approximation errors are given higher probability distribution in the next iteration. Then the 78 probability of the m(p1) d columns are normalized so that the probability sum is equal to 1. At the second iterative step, we sample m fewer columns than the rst iteration, and repeat the same process until iteration T . Figure 4.1 illustrates how S is generated through T iterations. AfterT iterations, we run an Ensemble Nystr om approximation usingS, which is decomposed into p subsets of size m. Exponential mixture weights are used in the Ensemble Nystr om approximation. Let j ; 1jp be the mixture weight in (4.13) at iterationT . Dene> 0 as a parameter to calculate the mixture weights. In the algorithm, we set p =T . The complete algorithm is described in Algorithm 7. The BoostNystr om algo- rithm uses two methods update probability and build mixture weight, which are described in Algorithm 8 and Algorithm 9 respectively. Algorithm 7 BoostNystr om 1. t = 1 and set the initial probability distribution P 1 as uniform distribution. 2. Sample a set V of m columns as the validation set. 3. Sample a set of m(pt + 1) columns using the probability distribution P t from matrix K without replacement. 4. The sampled columns are decomposed into m(pt+1) d subsets S 1 ;S 2 ; ;Sm(pt+1) d . Each subset S r (r = 1; 2; ; m(pt+1) d ) contains d columns. 5. Obtain the rank-k Nystr om approximation K r on S r , and calculate the ap- proximation error " r of K r on V . 6. Rank " r (r = 1; 2; ; m(pt+1) d ) (S (r) accordingly) in ascending order, and S (r) (sorted1r m d ) into S. 7. If t < T , construct P t+1 based on " r (r = 1; 2; ; m(pt+1) d ) by using Algo- rithm 8; 8. If t =T , build an Ensemble Nystr om approximation on samples S with each subset of size m and mixture weights j (j = 1; 2; ;p). Construct j (j = 1; 2; ;p) by using Algorithm 9. 9. t =t + 1. If t>T , then stop the iteration. Otherwise, go to Step 3. 79 Algorithm 8 update probability 1. Assume" (r) ; (r = 1; 2; ; m(pt+1) d ) are sorted in ascending order. Derive the average approximation error " avg = d m(pt+1) P m(pt+1) d r=1 " (r) . 2. Find m d subsetsS (r) ; (r = 1; 2; ; m d ) with smallest approximation errors and calculate " min = d m P m d r=1 " (r) . 3. Calculate ratio = " avg =" min . Multiply the probability distribution of the subsets S (r) ; (r = m d + 1; m d + 2; ; 2m d ) by this ratio. 4. Force the probability distribution of the subsets S (r) ; (r = 1; 2; ; m d ) to be 0. 5. Normalize P t+1 so that P n i=1 P t+1;i = 1. Algorithm 9 build mixture weight 1. Calculate ^ " j ; (j = 1; 2; ;p) as the approximation error of jth subset in S on V . 2. Dene j = exp(^ " j ); (j = 1; 2; ;p). 3. Z = P p j=1 j , and normalize j = j =Z. Figure 4.1: BoostNystr om Algorithm 4.3.3 Complexity Analysis The time complexity of performing a standard Nystr om approximation on a size d subset is O(d 3 +nd 2 ). Since there are at most mp d such subsets in each iteration, 80 the total complexity of performing standard Nystr om approximation is O(mpd 2 + nmpd). There are additional complexities including sorting the approximation errors (C s = O( mp d log( mp d ))), updating the probability distribution (C u = O(n)) and computing the Ensemble Nystr om approximation at the nal iterative step (C = O(pm 3 +pm 2 n)). The time complexity of the iterations in BoostNystr om is O(T (mpd 2 +nmpd +C s +C u ) +C ). The dominant term is O(Tnmpd). Then the total time complexity of BoostNystr om is O(Tnmpd +pm 2 n). Since usually Td n and mp n, the complexity BoostNystr om is much smaller than the O(n 3 ) time complexity usually encountered in kernel methods. Since in each iteration, computing the approximation errors of subsets S r (r = 1; 2; ; m(pt+1) d ) are independent, they can be computed in parallel. Then the time complexity of BoostNystr om becomes O(Tnd 2 +pm 2 n). The space requirement is the same as that for doing a standard Nystr om ap- proximation on m columns. The complexity is O(nm +m 2 ), which is linear with data size. 4.3.4 Error Bound To derive an error bound of BoostNystr om, we consider a simplied model. Assume that among themp columns ofS, there exist ^ p (1 ^ p<p) and ^ " (^ "> 0), such that k K ~ K r 1 k< ^ " <k K ~ K r 2 k; (8r 1 and r 2 ; 1 r 1 ^ p; ^ p + 1 r 2 p). ^ " is the approximation errork K ~ K ot k 2 , and ~ K ot is the standard Nystr om approximation givenm sampled columnsS ot other than themp columns inS. Given the adaptive step in BoostNystr om, which selects the best subsets of the sampled columns at each iteration, with high probability, there exist sampled columns S ot that satisfy the above condition. 81 In the proof of the error bound for BoostNystr om method, we need the general- ization of McDiarmid's concentration bound to sampling without replacement [29] [10], and the error bounds of Ensemble Nystr om method rst introduced in [27]. The error bounds of Ensemble Nystr om are denoted asB ens 2 for the norm-2 error bound, andB ens F , the Frobenius error bound. We denoteK max the maximum diago- nal entry of K,K max = max i K ii , andd K max the distance max ij p K ii +K jj 2K ij . We dene the selection matrix corresponding to a sample of mp columns as the matrix L2R nmp . Then L ij = 1 if the ith column of K is among those sampled, and L ij = 0 otherwise. Theorem Let S be a sample of pm columns decomposed into p subsets of size m, S 1 ; ;S p at iteration T . For r2 [1;p], let ~ K r denote the rank-k Nystr om approximation of K based on the sample S r , and let K (k) denote the best rank-k approximation of K. Assume there exist ^ p; (1 ^ p < p) and ^ "; (^ " > 0), such that k K ~ K r 1 k< ^ "<k K ~ K r 2 k, 1r 1 ^ p; ^ p + 1r 2 p. ^ " is the approximation errork K ~ K ot k 2 , and ~ K ot is a standard Nystr om approximation given S ot , m sampled columns other than the mp columns in S. Then, with probability at least 1, the following inequalities hold for any set of samples S of size mp and S ot of size m that satisfy the above assumption and for any in the simplex ( =f2R p : 0 VP p r=1 r = 1g) and ~ K boost = P p r=1 r ~ K r : k K ~ K boost k 2 <B ens 2 (4.14) k K ~ K boost k F <B ens F (4.15) Proof : We replace subset S 1 with S ot , the subset of m sampled columns other than S 1 ;S 2 ; ;S p . By denition of ~ K boost , the following holds: 82 k K ~ K boost k 2 = k p X r=1 r (K ~ K r )k 2 p X r=1 r k K ~ K r k 2 < 1 k K ~ K ot k 2 + p X r=2 r k K ~ K r k 2 1 (k K K (k) k 2 +2k XX T Z ot Z T ot k 2 ) + p X r=2 r (k K K (k) k 2 +2k XX T Z r Z T r k 2 ) = k K K (k) k 2 + 2k XX T Z ot Z T ot k 2 + p X r=2 2k XX T Z r Z T r k 2 (4.16) where Z r = p n m XL r , and L r is the selection matrix for S r . Z ot = p n m XL ot , and L ot is the selection matrix forS ot . Let S denote the sampled setS ot ;S 2 ; ;S p , and letS 0 be a sampled set with one column dierent with S. Let( S) =k XX T Z ot Z T ot k 2 + P p r=2 2k XX T Z r Z T r k 2 . Then the following inequality holds: j(S 0 )( S)j 2n m max d K max K 1 2 max (4.17) where max = max p i=1 i . The expectation of ( S) is bounded by: 83 E[( S)] n p m K max (4.18) (4.17) and (4.18) are derived in the same way as in [27]. Plugging (4.17) and (4.18) into the generalization of McDiarmid's concentration bound, we derive the upper bound for (4.16) asB ens 2 . k K ~ K boost k 2 is strictly less thanB ens 2 . In the same fashion, we obtain the Frobenius error boundk K ~ K boost k F , which is strictly less thanB ens F . [27] shows that the error bound of Ensemble Nystr om method is lower than the normal Nystr om approximation. We show that the error bound of BoostNystr om is strictly lower than that of Ensemble Nystr om method. Thus, BoostNystr om is a better kernel matrix approximation tool as compared to state-of-art algorithms. The superior performance is demonstrated experimentally in the following section. 4.4 Experiments In this section, we present experimental results that illustrate the performance of the BoostNystr om method. We work with data sets listed in Table 4.1. Throughout the experiments, we measure the accuracy of a low-rank approxi- mation ~ K by calculating the relative error in the following quantity: %error = k K ~ Kk F k Kk F 100 (4.19) For each data set, we set the number of sampled columns in standard Nystr om approximation to m = 0:03n, where n is the total number of data points. d is shown in Figure 4.3. Rank k is set to be the rank of W in (4.1). The number of approximations, p, was varied from 2 to 20, and T =p. We sampled an additional 84 Table 4.1: Properties of data sets, n and dim are the number of data points and features respectively. Dataset Type of data n dim MNIST-10K a Hand written 10000 784 digits S. Ceravisiae b proteins 4728 16 Abalone c Physical 4177 8 measurements Dexter c Documents 2000 20000 Isolet c Voice 6238 617 USPS d Hand written 7291 256 digits a http://yann.lecun.com/exdb/mnist b S.Ceravisiae is from [20] c http://archive.ics.uci.edu/ml/index.html d USPS is from [21] set of m columns as validation set V , on which the approximation error of each subset is derived, and V is xed with varying values of p. We run BoostNystr om on each dataset with the sampled columns in Ensemble Nystr om approximation as initial samples. The average error (Frobenius error) of BoostNystr om is shown in Figure 4.3. It is compared to the following well-known Nystr om methods: (a) Standard Nystr om approximation with uniform sampling of m columns without replacement. (b) Ensemble Nystr om approximation with uniform sampling ofmp columns without replacement. (c) K-means based Nystr om method [57]. (d) The adaptive sampling method in [28]. (e) The deterministic method in [3]. The average errors ofp approximations are reported for (a), (c) and (d). Figure 4.3 shows that in all data sets, BoostNystr om with large p outperforms all other algorithms except in Abalone data. In Abalone data, the K-means based and the adaptive methods are better than BoostNystr om. However, the K-means based method is unstable, with good error rate in Abalone, but much worse than 85 BoostNytr om in Isolet and S. Cerevisiae. The adaptive algorithm is unstable too, with best error rate in Abalone, but poor performance in S. Cerevisiae and USPS. We did not draw the error curve of K-means based method in Figure 4.3(b) and Figure 4.3(e), nor the error curve of the deterministic method in Figure 4.3(f), because they are so large that they make other curves undistinguishable. The error rate of the deterministic method in USPS is 9:93%. We draw all curves for S. Cerevisiae and Isolet data sets in Figure 4.2. (a) Scerevisiae (b) Isolet Figure 4.2: Percent error of Scerevisae, Isolet and USPS 4.5 Conclusions and Future Work We propose a new perspective on the Nystr om approximation. The new perspective relates the quality of the Nystr om approximation with the subspace spanned by the sample columns. Based on the perspective, we propose BoostNystr om, a novel adaptive Nystr om approximation method. The BoostNystr om algorithm is justied by our novel perspective and has an error bound guaranteed to be lower than that of Ensemble Nystr om method. The experimental results show the eectiveness of 86 (a) Mnist (b) S. Cerevisiae (c) Abalone (d) Dexter (e) Isolet (f) USPS Figure 4.3: Percent error in Frobenius norm for base Nystr om method, Ensem- ble Nystr om method, BoostNystr om, K-means based Nystr om approximation, the adaptive Nystr om approximation and the deterministic method BoostNystr om. In the future, BoostNystr om can be applied to graph-based learning algorithms, such as Graph-OCL. 87 Chapter 5 General Conclusions This dissertation has focused on three areas: (1) Supervised learning of high di- mensional data. (2) Graph-based one class learning. (3) Nystr om method for large size data. Each of the three areas has extensive practical applications. We have developed several state-of-art algorithms to solve each problem. For supervised learning of high dimensional data, a novel algorithm BDLDA is improved and applied to gene expression data. BDLDA achieves a balance of bias-variance tradeo, and has provided superior performance compared with other classiers. Treelets is then used to transform the data before BDLDA, as a step to concentrate more energy near the diagonal positions of the covariance matrix, and thus improve the performance of BDLDA. For one class learning problem, we have used graph-based ranking method as a rst step (identifying outliers) in the algorithm. As far as we know, this is the rst time graph-based method has been used in one class learning. One important advantage of our proposed algorithm is its selection of the constant parameter. This frees us of time consuming model selection procedure, like cross validation. This dissertation has provided theoretical proof supporting the parameter selection. 88 The novel adaptive Nystr om method is mainly proposed to solve the scalability issue of the graph-based algorithms. We have proposed a new perspective on the Nystr om approximation. The new perspective relates the quality of the Nystr om approximation with the subspace spanned by the sampled columns. Based on the perspective, we propose BoostNystr om, a novel adaptive Nystr om approximation method. The BoostNystr om algorithm is justied by our novel perspective. The experimental results show the eectiveness of BoostNystr om. In the future, eorts could be taken to reduce the complexity of BoostNystr om to make it more practical in real use. BoostNystr om method could be applied to the graph-based one class learning algorithm to make it scalable to large sample size. 89 Chapter 6 Publications Lingyan Sheng, Antonio Ortega, Roger Pique-Regi, Shahab Asgharzadeh, "Treelets as feature transformation tool for block diagonal linear discrimination" in Proceed- ings of IEEE International Workshop on Genomic Signal Processing and Statistics 2009 Lingyan Sheng and Antonio Ortega, "Graph-based partially supervised learning of documents" in Proceedings of IEEE International Workshop on MACHINE LEARN- ING FOR SIGNAL PROCESSING 2011 Lingyan Sheng and Antonio Ortega, "Pixel prediction by context based regres- sion" in Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing 2012 Lingyan Sheng and Antonio Ortega, "A Novel Adaptive Nystr om Approximation" in Proceedings of IEEE International Workshop on MACHINE LEARNING FOR SIGNAL PROCESSING 2012 90 Lingyan Sheng and Antonio Ortega, "Analysis of Graph-based Ranking and Ap- plication to One Class Learning", under preparation of The IEEE Transactions on Pattern Analysis and Machine Intelligence 91 References [1] Dimitris Achlioptas, Frank McSherry, and Bernhard Schlkopf. Sampling tech- niques for kernel methods. In Annual advances in neural information process- ing systems 14: Proceedings of the 2001 conference, pages 335{342. MIT Press, 2001. [2] Yossi Azar, Amos Fiat, Anna Karlin, Frank McSherry, and Jared Saia. Spec- tral analysis of data. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, STOC '01, pages 619{626, New York, NY, USA, 2001. ACM. [3] Mohamed-Ali Belabbas and Patrick J. Wolfe. Spectral methods in machine learning and new strategies for very large datasets. In Proceedings of the National Academy of Sciences of the USA, volume 106(2), pages 369{374, January 2009. [4] Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and semi- supervised learning on large graphs. In COLT, volume 3120, pages 624{638, 2004. [5] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399{2434, 2006. [6] Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In Proceedings of the 7th Interna- tional Conference on Database Theory, ICDT '99, pages 217{235, London, UK, UK, 1999. Springer-Verlag. [7] Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Confer- ence on Machine Learning, ICML '01, pages 19{26, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [8] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, COLT' 98, pages 92{100, New York, NY, USA, 1998. ACM. 92 [9] Anthony Brew, Marco Grimaldi, and P adraig Cunningham. An evaluation of one-class classication techniques for speaker verication. Articial Intelli- gence Review, 27:295{307, 2007. [10] Corinna Cortes, Mehryar Mohri, Dmitry Pechyony, and Ashish Rastogi. Sta- bility of transductive regression algorithms. Proceedings of the 25th Interna- tional Conference on Machine Learning, pages 176{183, 2008. [11] Corinna Cortes and Vladimir Vapnik. Support vector networks. Machine Learning, 20(3):273{297, 1995. [12] Ramon Diaz-Uriarte and Sara Alvarez de Andres. Gene selection and clas- sication of microarray data using random forest. BMC Bioinformatics, 7:3, 2006. [13] S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination meth- ods for the classication of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77{87, 2002. [14] Dinesh Singh et al. Gene expression correlates of clinical prostate cancer be- havior. Cancer Cell, I(2):203{209, Mar 2002. [15] U. Alon et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A, 96(12):6745{6750, Jun 1999. [16] Ahmed Farahat, Ali Ghodsi, and Mohamed Kamel. A novel greedy algorithm for nystr om approximation. Journal of Machine Learning Research, 15:269{ 277, 2011. [17] Shai Fine and Katya Scheinberg. Ecient svm training using low-rank kernel representations. J. Mach. Learn. Res., 2:243{264, 2002. [18] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996. [19] Y Guo, T Hastie, and R Tibshirani. Regularized linear discriminant analysis and its application in microarrays. Biostatistics, pages 86{100, 1 2007. [20] Adam M Gustafson, Evan S Snitkin, Stephen Cj Parker, Charles DeLisi, and Simon Kasif. Towards the identication of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics, 7:265, 2006. [21] T Hastie, R Tibshirani, and JH Friedman. The Elements of Statistical Learn- ing: Data mining, Inference, and Prediction. Springer, 2001. 93 [22] Thorsten Joachims. Making large-scale support vector machine learning prac- tical, pages 169{184. MIT Press, Cambridge, MA, USA, 1999. [23] Thorsten Joachims. Transductive inference for text classication using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML '99, pages 200{209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [24] Thorsten Joachims. Transductive learning via spectral graph partitioning. In In ICML, pages 290{297, 2003. [25] I. T. Jollie. Principal Component Analysis. Springer, 2002. [26] Mark W. Koch, Mary M. Moya, Larry D. Hostetler, and R. Joseph Fogler. Cueing, feature discovery, and one-class learning for synthetic aperture radar automatic target recognition. Neural Netw., 8(7-8):1081{1102, 1995. [27] Sanjiv Kumar, M Mohri, and A Talwalkar. Ensemble nystr om method. Ad- vances in Neural Information Processing Systems, 5:1{9, 2009. [28] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. On sampling-based approximate spectral decomposition. In Proceedings of the 26th Annual Inter- national Conference on Machine Learning, pages 553{560, 2009. [29] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling techniques for the nystr om method. Journal of Machine Learning Research - Proceedings Track, 5:304{311, 2009. [30] Ken Lang. Newsweeder: Learning to lter netnews. In in Proceedings of the 12th International Machine Learning Conference (ICML95), 1995. [31] Ann B. Lee, Boaz Nadler, and Larry Wasserman. Treelets - an adaptive multi- scale basis for sparse unordered data. Annals of Applied Statistics, 2(2), 2008. [32] Xiaoli Li and Bing Liu. Learning to classify texts using positive and unlabeled data. In Proceedings of the 18th international joint conference on Articial intelligence, pages 587{592, San Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc. [33] Bing Liu, Wee Sun Lee, Philip S. Yu, and Xiaoli Li. Partially supervised classication of text documents. In ICML02, pages 387{394, 2002. [34] Yvette Mallet, Danny Coomans, Jerry Kautsky, and Olivier De Vel. Classi- cation using adaptive wavelets for feature extraction. IEEE Trans. Pattern Anal. Mach. Intell., 19:1058{1066, 1997. 94 [35] Andrew McCallum and Kamal Nigam. A comparison of event models for naive bayes text classication. In AAAI-98 Workshop on Learning for Test Categorization, pages 41{48. AAAI Press, 1998. [36] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997. [37] Tao Peng, Wanli Zuo, and Fengling He. Svm based adaptive learning method for text classication from positive and unlabeled documents. Knowl. Inf. Syst., 16:281{301, 2008. [38] R Pique-Regi and A Ortega. Block diagonal linear discriminant analysis with sequential embedded feature selection. In Proc. IEEE International Confer- ence on Acoustics, Speech and Signal Processing ICASSP 2006, volume 5, pages V{V, 2006. [39] R. Pique-Regi, A. Ortega, and S. Asgharzadeh. Sequential diagonal linear discriminant analysis (seqDLDA) for microarray classication and gene iden- tication. In CSBW, pages 112{116, 2005. [40] John C. Platt. Fast embedding of sparse music similarity graphs. In Advances in Neural Information Processing Systems. MIT Press, 2003. [41] B.D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996. [42] DG Stork RO Duda, PE Hart. Pattern Classication. Wiley-Interscience, 2 edition, 2000. [43] Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10:1299{ 1319, 1998. [44] W. M. Shaw, Jr. On the foundation of evaluation. Journal of the American Society for Information Science, 37(5):346{348, 1986. [45] Lingyan Sheng and Antonio Ortega. Graph based partially supervised learning of documents. In IEEE International Workshop on Machine Learning for Signal Processing, 2011. [46] Lingyan Sheng and Antonio Ortega. A novel adaptive nystr om approximation. In IEEE International Workshop on Machine Learning for Signal Processing, 2012. [47] Lingyan Sheng, Antonio Ortega, Roger Pique-Regi, and Shahab Asgharzadeh. Treelets as feature transformation tool for block diagonal linear discrimination. 95 In IEEE International Workshop on Genomic Signal Processing and Statistics, 2009. [48] Lingyan Sheng, Roger Pique-Regi, Shahab Asgharzadeh, and Antonio Ortega. Microarray classication using block diagonal linear discriminant analysis with embedded feature selection. IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. [49] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. Beyond the point cloud: from transductive to semi-supervised learning. In Proceedings of the 22nd international conference on Machine learning, ICML '05, pages 824{831, New York, NY, USA, 2005. ACM. [50] Alex J. Smola and Bernhard Sch okopf. Sparse greedy matrix approximation for machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML '00, pages 911{918, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [51] Martin Szummer and Tommi Jaakkola. Partially labeled classication with markov random walks. In Advances in Neural Information Processing Systems, pages 945{952. MIT Press, 2002. [52] R Tibshirani, T Hastie, B Narasimhan, and G Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A, 99(10):6567{6572, May 2002. [53] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, 2 edition, 2000. [54] Wei Wang and Jiong Yang. Mining high-dimensional data. In Data Mining and Knowledge Discovery Handbook, pages 793{799. Springer US, 2005. [55] Christopher Williams and Matthias Seeger. Using the nystr om method to speed up kernel machines. In Advances in Neural Information Processing Sys- tems 13, pages 682{688. MIT Press, 2001. [56] Hwanjo Yu, Jiawei Han, and Kevin Chen chuan Chang. Pebl: Web page classication without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16:70{81, 2004. [57] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved nystr om low-rank approximation and error analysis. In Proceedings of the 25th international conference on Machine learning, pages 1232{1239, 2008. [58] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Sch olkopf. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16, pages 321{328. MIT Press, 2004. 96 [59] Dengyong Zhou and Bernhard Sch olkopf. Learning from labeled and unlabeled data using random walks. In Pattern Recognition, Proceedings of the 26th DAGM Symposium, pages 237{244. Springer, 2004. [60] Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bern- hard Sch olkopf. Ranking on data manifolds. In ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16. MIT Press, 2003. [61] Xueyuan Zhou, Mikhail Belkin, and Nathan Srebro. An iterated graph lapla- cian approach for ranking on manifolds. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, pages 877{885, New York, NY, USA, 2011. ACM. [62] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005. [63] Xiaojin Zhu, Zoubin Ghahramani, and John Laerty. Semi-supervised learning using gaussian elds and harmonic functions. In ICML, pages 912{919, 2003. 97
Abstract (if available)
Abstract
Supervised learning is the machine learning task of inferring a function from labeled training data. There have been numerous algorithms proposed for supervised learning, such as linear discriminant analysis (LDA), support vector machine (SVM), decision trees, and etc. However, most of them are not able to handle an increasingly popular type of data, high dimensional data, such as gene expression data, text documents, MRI images, and etc. This phenomenon is often called the curse of dimensionality. Our solution to this problem is an improvement to LDA that imposes a regularized structure on the covariance matrix, so that it becomes block diagonal while feature reduction is performed. The improved method, which we call block diagonal discriminant analysis (BDLDA), effectively exploits the off diagonal information in the covariance matrix without huge computation and memory requirement. BDLDA is further improved by using treelets as a preprocessing tool. Treelets, by transforming the original data by successive local PCA, concentrates more energy near the diagonal items in the covariance matrix, and thus achieves even better accuracy compared to BDLDA. ❧ Supervised learning requires labeled information of all classes. However, since labeled data is often more difficult to obtain than unlabeled data, there is an increasing interest in a special form of learning, namely, one class learning. In one class learning, the training set only has samples of one class, and the goal is to distinguish the class from all other samples. We propose a one class learning algorithm, Graph-One Class Learning (Graph-OCL). Graph-OCL is a two step strategy, where we first identify reliable negative samples, and then we classify the samples based on labeled data and the identified negative samples in the first step. The main novelty is the first step, in which graph-based ranking by learning with local and global consistency (LGC) is used. Graph-based ranking is particularly accurate if the samples and their similarities are well represented by a graph. We also theoretically prove that there is a simple method to select a constant parameter ɑ for LGC, thus eliminating the necessity of model selection by time consuming validation. ❧ Graph-based methods usually scale badly as a function of the sample size. This can be solved by using the Nyström approximation, which samples a few columns to represent the affinity matrix. We propose a new method, BoostNyström, which adaptively samples a subset of columns at each iterative step and updates the sampling probability in the next iterative step. This algorithm is based on a novel perspective, which relates the quality of Nyström approximation with the subspace spanned by the sampled columns. BoostNyström can be potentially applied to Graph-OCL to solve the problem of large data size.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient graph learning: theory and performance evaluation
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Scalable sampling and reconstruction for graph signals
PDF
Human activity analysis with graph signal processing techniques
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Object classification based on neural-network-inspired image transforms
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Lifting transforms on graphs: theory and applications
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Deep learning models for temporal data in health care
PDF
Learning multi-annotator subjective label embeddings
Asset Metadata
Creator
Sheng, Lingyan
(author)
Core Title
Novel algorithms for large scale supervised and one class learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
03/19/2013
Defense Date
09/12/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph,linear discriminant analysis,Nyström approximation,OAI-PMH Harvest,one class learning,supervised learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio K. (
committee chair
), Kuo, C.-C. Jay (
committee member
), Liu, Yan (
committee member
)
Creator Email
lingyan.sheng@gmail.com,lsheng@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-225861
Unique identifier
UC11293263
Identifier
usctheses-c3-225861 (legacy record id)
Legacy Identifier
etd-ShengLingy-1476.pdf
Dmrecord
225861
Document Type
Dissertation
Rights
Sheng, Lingyan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
graph
linear discriminant analysis
Nyström approximation
one class learning
supervised learning