Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Automated classification algorithms for high-dimensional data
(USC Thesis Other)
Automated classification algorithms for high-dimensional data
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
AUTOMATED CLASSIFICATION ALGORITHMS FOR HIGH DIMENSIONAL DATA Copyright 2000 by D avid W eidem ann A Thesis Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment o f the Requirements for the Degree MASTER OF SCIENCE (APPLIED MATHEMATICS) December 2000 David Weidemann Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 1407930 ® UMI UMI Microform 1407930 Copyright 2002 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA The Graduate School University Park LOS ANGELES, CALIFORNIA 90089-1695 This thesis, w ritten b y CbcVAol \ME * PEHftNM U nder th e direction o f hl-T... Thesis Com m ittee, and approved b y a ll its m em bers, has been p resen ted to and accepted b y The Graduate School, in p a rtia l fulfillm ent o f requirem ents fo r th e degree o f M ASTER OF SCIENCE CAyyUe d o J(V-€\w) l V cs ^ Dean o f Graduate Studies D ate December 18, 2000 THESIS COMMi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For my family. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements I would like to thank my advisor, Dr. Chunming Wang, for assisting me in developing this thesis. His directions and suggestions helped me very much. I could always rely on his full interest and support. I also want to thank my two other thesis committee members, Dr. Gary Rosen and Dr. Herbert Dawid, for their time and help during the last months. Finally, I am grateful to Dr. Philip Swain from Purdue University who gave me great inspiration by providing me a copy of his paper "Statistical Methods and Neural Network Approaches for Classification of Data from multiple Sources". Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract This thesis deals with the mathematical problem of automated classification. The goal is to identify classes in a data set on which no a priori knowledge is available. The work focuses on the evaluation of five automated classification algorithms. Their performances are analyzed on specifically designed data sets. The results for these algorithms are benchmarked against each other, strengths and weaknesses are explained. When a priori information about the existing classes in the data is available, algorithms using learning samples are an interesting alternative to automated classification procedures. Two algorithms that classify data by the use of learning samples are presented and their performances are compared to the previous test results. Further, the effect of using different distance measures on the automatic classification algorithms is discussed. These alternative distance functions allow the classification algorithms to overlook trivial differences in the data objects like constant offsets or uniform shifts. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents Dedication ii Acknowledgements iii Abstract iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 The initial problem ..................................................................................................................... 1 1.2 One concrete research application .............................................................................................3 2 Literature overview 5 3 The data sets 7 3.1 Data Set 1 ...................................................................................................................................... 8 3.2 Data Set 2 ......................................................................................................................................9 3.3 Data Set 3 .................................................................................................................................. 10 3.4 Data Set 4 .................................................................................................................................. 11 3.5 Data Set 5 .................................................................................................................................. 12 4 The different classification algorithms 13 4.1 General ideas of the testing procedures ................................................................................ 13 4.2 Restrictions on the tests .......................................................................................................... 14 4.3 The standard algorithm s.......................................................................................................... 16 4.3.1 Algorithm 1 ................................................................................................................. 16 4.3.1.1 General description.................................................................................... 16 4.3.1.2 Strengths and w eaknesses......................................................................... 18 4.3.2 Algorithm 2 ...................................................................................................................20 4.3.2.1 General description...................................................................................... 20 4.3.2.2 Strengths and w eaknesses........................................................................... 21 4.3.3 Algorithm 3 ...................................................................................................................22 4.3.3.1 General description...................................................................................... 22 4.3.3.2 Strengths and w eaknesses........................................................................... 23 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.4 Algorithm 4 ...................................................................................................................25 4.3.4.1 General description.......................................................................................25 4.3.4.2 Strengths and w eaknesses............................................................................26 4.3.5 Algorithm 5 ...................................................................................................................28 4.3.5.1 General description.......................................................................................28 4.3.5.2 Strengths and w eaknesses............................................................................29 4.4 The learning sample algorithm s..............................................................................................31 4.4.1 Reasons for using learning samples ..........................................................................31 4.4.2 Algorithm 1 ...................................................................................................................32 4.4.2.1 General description.......................................................................................32 4.4.2.2 Strengths and w eaknesses............................................................................33 4.4.3 Algorithm 2 ...................................................................................................................34 4.4.3.1 General description.......................................................................................34 4.4.3.2 Strengths and w eaknesses............................................................................35 5 More general cases 36 5.1 Non-uniform class s iz e s ............................................................................................................ 36 5.1.1 One small class and two large classes.......................................................................36 5.1.2 Two small classes and one large class ..................................................................... 37 5.2 Other distance measures ........................................................................................................... 38 5.2.1 Reasons for using other distance measures ..............................................................38 5.2.2 Discussion of possible approaches............................................................................ 39 5.2.2.1 Vertical shiftings........................................................................................... 39 5.2.2.2 Horizontal shiftings.......................................................................................40 5.2.2.3 Scalings...........................................................................................................41 5.2.3 Test results .......................................................... 42 5.3 Varying Noise Levels ............................................................................................................... 44 6 Final comments 46 References 49 Appendix 50 A MATLAB code of standard algorithm 1 .............................................................................. 50 B MATLAB code of standard algorithm 4 .............................................................................. 52 C MATLAB code of standard algorithm 5 .............................................................................. 55 D MATLAB code of learning sample algorithm 1 .................................................................58 E MATLAB code of learning sample algorithm 2 .................................................................59 F Collection of test results............................................................................................................ 60 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables 5.1 Results for the analysis of 2 large classes vs. 1 small c la ss...................................................... 37 5.2 Results for the analysis of 2 small classes vs. 1 large c la ss.......................................................37 5.3 Test results for the analysis of different distance measures ..................................................... 43 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures 3.1 Class Reference Functions of Data Set 1..........................................................................................8 3.2 Class Reference Functions of Data Set 2 ........................................................................................ 9 3.3 Class Reference Functions of Data Set 3 .................................................................................... 10 3.4 Class Reference Functions of Data Set 4 .................................................................................... 11 3.5 Class Reference Functions of Data Set 5 .................................................................................... 12 4.1 Time Analysis for Algorithm 1 .................................................................................................... 18 4.2 Accuracy Analysis for Algorithm 1 ............................................................................................. 19 4.3 Accuracy Analysis for Algorithm 3 ...............................................................................................23 4.4 Time Analysis for Algorithm 3 ...................................................................................................... 23 4.5 Accuracy Analysis for Algorithm 4 ...............................................................................................26 4.6 Time Analysis for Algorithm 4 ..................................................................................................... 27 4.7 Accuracy Analysis for Algorithm 5 ...............................................................................................29 4.8 Time Analysis for Algorithm 5 ..................................................................................................... 30 4.9 Average Accuracy Analysis ...........................................................................................................35 5.1 Accuracy for different Noise Levels (based on Data Set 2) ......................................................44 6.1 Overall Evaluation of the Standard Algorithms ......................................................................... 46 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 Introduction 1.1 The initial problem Do you know what the main job o f a dating agency is? Such a company continuously receives data from applicants; most likely they have filled out a survey with personal information. By gathering all this data, the agency builds up a database of its clients. The database includes many pieces of information about each person - things like age, size, weight, eye color, hair color, home town, hobbies and many more. The agency's business is to match couples of two people together. This choice is largely based on the database information. For smaller agencies with some hundred clients, it can easily be made by hand. But what if the company works nationwide and has collected several thousand surveys with data? In this case, it is surely desirable for the agents to have tools that can assist them in doing some sort of initial sorting. The outcome of this process should be a small list of people with similar characteristics, and from this list couples can be matched together. This is one example for a practical situation where automated classification is important. If the agency's database could be classified into a certain number of small groups that consist of people with similar attributes, the employees would save a lot of time because they would not have to search the whole data manually. In this context, it is clear that the number of classes is not known in advance and might actually vary from database to database. 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Another example could be a problem about stock investments. Besides its price, there are many other characteristics available for each stock and the respective company. Examples are market capitalization, volatility, price-eamings (P/E) ratio, trading ranges, ROI and current ratio - only to name a few. The number of such attributes is surely large enough so that it is really hard to check all of them manually if an investor wants to pick an investment. So what is his concrete problem? He might have certain preferences for his portfolio, and the task is to find or generate a list of companies which fulfill his criteria so that he can intelligently pick out his investments. But how can he generate this list? If he had access to a database that includes up-to-date information on a very large number of stocks, he could simply divide these companies into classes that share specific characteristics. But he neither knows the number of classes in advance nor has he an idea of the underlying data structure. Once more, an automated classification of the database is a very effective tool to provide him the results he is looking for. In a more mathematical sense, the corresponding problem can be described in the following way: Assume that a certain number of data vectors (say Xj,...,x m) is given. Each single vector x. is n-dimensional, i. e. x s e IR ” . Therefore n stands for the number of dimensions and m represents the size of the data set. For the general case, we also assume that there is no a priori knowledge available on the data, so we neither know the number of classes nor their structure in advance. The data itself could come from any source, the only restriction is that all vectors have the same dimension. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The problem is to classify the given data set by using mathematical models and approaches. In this context, a complete classification includes identifying the number of data classes as well as assigning each data vector to an appropriate class. In the research for this thesis, sample data sets were constructed by evaluating functions on a variable, uniformly spaced grid that covers the interval [-100,100]. In order to obtain different classes, the data vectors were based on different reference functions. In particular, let fk be the underlying reference function for the k-th data class. Then each data vector X j = (xn ,..., x. n) for this class is constructed by X;, - fk( g j) ^ ii (j = l ,...,n ) In this definition, g; . represents the j-th grid point and s.. stands for the random noise that is added (Sy are i.i.d.). 1.2 One concrete research application One research application for the approach presented above is the automated classification of errors that occur in radio occultation measurements. During the last years, scientists from the Jet Propulsion Lab (situated in Pasadena, California) and a research team from the University of Southern California have worked together on models that involve the use of occultation data. The data is continuously received by 24 GPS satellites that circle the earth in different orbital planes. GPS is an abbreviation for "Global Positioning System" and was originally designed as a navigation aid for the U. S. Air Force. Nowadays this technology is more and more used for civil 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. purposes as for example the measurement of atmospherical parameters. Examples for such parameters are air pressure, temperature and humidity. One of the goals that the scientists have is to develop a mathematical model to simulate and predict the atmospheric parameters based on available observations. One key factor in the prediction process is the analysis of errors that are obtained from the comparison of past predictions with the respective GPS measurements. These errors can be divided into three major groups: • measurement errors • calibration errors • retrieval errors Naturally, the amount of data that is used during the prediction process is huge; therefore the model operates with high complexity and it is very desirable to stabilize it as much as possible. In this context, the additional knowledge that can be obtained from error analyses plays an important role. Therefore it is necessary to classify the errors, and obviously the classification algorithms have to be fast, reliable and accurate. The key benefit o f an effective error analysis would be an improved model that is able to learn from the classification results. In the end, the quality of the predictions would logically increase and lead to a higher degree of stability. Talking about concrete applications of the scientists' research, the most important fields to mention would be weather prediction and climate control. 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 Literature overview Unfortunately, there is only a limited amount of literature available on the topic of automated classification in higher dimensions. However, there are some papers which could be used as a basis for the concepts that are presented here. The first interesting document was Haennsler's Master Thesis [1]. Its main topic is the use of singular vector theory to classify radio occultation data. Haennsler's main algorithm was applied on the underlying classification problem of this paper, in fact it is one of the standard algorithms that are going to be presented in chapter 4. Principally, the algorithm uses singular value decompositions to divide the data objects into different groups. It is set up in two stages; the first part returns a preliminary class structure and the second part checks if the classification can be improved by joining some of these classes. Another very good paper is the one by Cormack [2], It gives a very nice and general introduction to the topic of classification. He divides classification problems into three groups: • hierarchical classifications (where the classes themselves underlie a certain hierarchy) • partitioning (where the classes are mutually exclusive, hence the classification gives a clear partition of the data set) • clumping (where the classes are not necessarily mutually exclusive and may have overlaps) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Further, the author gives an overview of different measures of dissimilarity that could be useful for individual classification problems. He mostly quotes findings from other authors and summarizes the measures and their specific properties. In his explanations, Cormack mentions that the sample data must not necessarily be quantitative. He introduces ways to measure similarities between qualitative variables as well. In particular, he mentions that the only structure that has to be available is some order (e.g. a partial or a complete order) among the objects. This is usually sufficient information to classify the given data set. Finally, Gordon's book [3] was a very comprehensive resource. It focuses on classification problems in lower dimensions and presents various approaches to construct efficient clustering algorithms. The author starts by introducing different dissimilarity measures, just like Cormack does in his paper. Further, he covers hierarchical classification models in a later part of his book. A very interesting chapter is the one where he presents possible approaches o f how to classify data by using iterative relocation methods. The key idea of such an algorithm is to check if a given classification is optimal. If it is not, the algorithm tries to transform the classes in a way so that the degree of optimality increases. In this thesis, that methodology was used to design "two stage" algorithms in which the second stage consists of an iterative relocation procedure. In addition, Gordon gives some insight on the basics of statistical models that can also be used for classification problems. For instance, he presents techniques that use Bayesian or maximum likelihood approaches. 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 The data sets In order to test the classification algorithms, it was important to design appropriate sample data first. This chapter will describe the type and the structure of the test data that was used for evaluating the algorithms. Basically every individual data vector is constructed according to its class reference function. There are 3 classes for each data set; overall there exist 5 sets to test the algorithms. The reference functions were chosen in a way so that the algorithms were challenged on various scenarios throughout the testing procedures. The intention was to identify their strengths and possible weaknesses. Talking about the structure of the individual data sets, it must also be mentioned that there are differences in the underlying noise levels. Basically the data noise is generated using a pseudo random number generator which provides sequences of numbers following the normal distribution N (0,a) with a specific standard deviation a. The noise level is chosen differently for the data sets so that, besides varying the type of reference function and its range, there is another factor that influences the data structure. But inside each data set, this level is going to be fixed for all data vectors. The sample plots on the next pages have to be interpreted in the following way: Each graph represents one reference function (it is plotted without any noise), and the crosses stand for one concrete realization that includes random noise. In this way, the class structure can easily be understood and visualized. 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.1 Data Set 1 The reference functions are defined as follows: XX XX X / ~ \ ^ x X x - ^ X v X X X Figure 3.1: Class Reference Functions of f,(x) = 50 f2(x) lOOcos ' x ' V 20 f,(x ) = x 2 +100 100 Data Set 1 The noise level is uniformly set to a = 40 for all data vectors. (a constant function) (a periodic function) (a parabola function) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2 Data Set 2 The reference functions are defined as follows: 100 so -100 -S O -60 -40 -20 0 20 40 60 60 100 150 100 50 H oo -80 -60 -40 -20 0 20 40 60 80 100 / Xx> X x X X * X " x * x X XX X ^ >$< X V X # XX X v. X X X V 1 X X X v x X^ X X *X X X X X X X ^ X XX xx xX >Xx xX x < x x tw iw x ' * X ■ XX X x x x X x K x x •X . v -100 -80 -60 -40 -20 ;? Xx > f,(x) = 50 (a constant function) f 2 (x) = 25 cos f \ X vl5y + 50 (a periodic function) f,(x) = — — x 2 + 75 200 (a parabola function) Figure 3.2: Class Reference Functions of Data Set 2 The noise level is set to a = 30 for all data vectors, so it is a bit smaller than in data set 1. But the reader should note that the range of the reference functions has decreased as well. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 Data Set 3 The reference functions are defined as follows: x x x\x x x / x > ,X X m x x - 0 y X J + Figure 3.3: Class Reference Functions of f,(x) = 30 sin + 50 \ J J (a high-periodic function with range 30) f2 (x) = 25 cos ' x ' Vl5y + 50 (a low-periodic function with range 25) f 3(x) = 50 sin + 50 v15 2, (a low-periodic function with range 50) Data Set 3 The noise level is defined by o = 25 in this case and was chosen moderately. The key factors that distinguish the individual reference functions are their periodicity and their range. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4 Data Set 4 The reference functions are defined as follows: 100 so so 70 60 S O 40 30 20 10 0 -100 -60 -60 -40 -20 0 20 40 G O S O 100 X *X XX X * i t w I" - J ( A '-1 'A* % > 'x* 4 W t x .. X -60 -40 - 20 0 20 40 60 80 X X X i n i X v 1 , 1 j . J M * w i y i l l 1 ! j-F [ b j lil!|! X X X x , f,(x) = 10atan V 1 0 y + 50 (a non-periodic function) f2(x) = 20sin(x)+ 50 (a sine function) f3(x) = 20cos(x) + 50 (the cosine function corresponding to f 2) Figure 3.4: Class Reference Functions of Data Set 4 For this data set, the noise level was chosen with o = 20. The idea here was to design two very closely related sine and cosine functions and a third function that is substantially different from the other two. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.5 Data Set 5 The reference functions are defined as follows: v x v x : x x v X x X X ** x X ^ x X X?# X X xs X - ~ x * x J S s J x > % .'X x x f y ? ' xx x x x X X / ' X X ^ X x >x mu 5 1 * > # X x **X XX X- ? :+ X X -eo -*o -20 * X *k £ x x Ax x £ x \ a f g l p -60 -40 -20 40 60 60 100 ft(x) = 7 atan vlOy + 50 (a non-periodic function) f 2(x) = 10sin(x) + 50 (a sine function) f 3(x) = 10cos(x)+ 50 (the cosine function corresponding to f 2) Figure 3.5: Class Reference Functions of Data Set 5 The noise level was set to a = 10. The transition from the data set 4 to this one was to reduce the functions' range, thereby accumulating the points on a smaller interval. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 The different classification algorithms 4.1 General ideas of the testing procedures In this chapter, the classification algorithms will be introduced. They were implemented in MATLAB 5.3, the reader will find some listings in the appendix. In addition to a description of the algorithms, the main results of the performance tests are also included. These tests are based on the data sets which were presented in chapter 3. The two key skills that were measured and on which the algorithms were benchmarked are: • classification accuracy • computation time The accuracy of an algorithm is the proportion of data vectors that is assigned correctly to a class. For example, an accuracy of 77% implies that 77% of all data vectors were accumulated correctly in their respective classes. Possible sources of error for getting a low number on this measure are either the return of more classes than actually needed or incorrect class assignments. In terms of computation time, it is surely critical to see the results as an absolute measure. Naturally the implementation of the classification algorithms in MATLAB is not perfect, and it is clearly possible to program them much more effectively. But what is interesting are the algorithms' performances on a relative scale since it makes sense to analyze if one of them works faster than others. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The test procedures were designed consistently for all algorithms. In order to create various test scenarios, the class size and the grid size were changed in each trial. Because there are 4 algorithms that were tested on 5 data sets, the whole test series for the standard algorithms consisted o f 20 individual trials. In order to present a valuable summary of the test results, they were aggregated as good and as interestingly as possible. However, if the reader is interested in analyzing the plain findings, he can look them up in the appendix. 4.2 Restrictions on the tests Due to the limited time frame of the study, some restrictions on the test scenario had to be made. In general, these restrictions should not affect an overall evaluation of the algorithms. Chapter 5 will also analyze special cases that cannot be discussed in this setup. In particular, the most important restrictions are: • A fixed number of classes for each data set In chapter 3, it was already mentioned that every data set consists of three classes. It is clear that this is usually not the case in concrete applications. But since none of the algorithms has any a priori knowledge on the number of data classes anyway, there should not be a problem with fixing this parameter. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Uniform class sizes The class size was chosen equally for each data set. O f course this is a very rigorous assumption on the test data, but on the other hand it is essential to design the cases as generalistic as possible. Concretely, class sizes of 50, 100, 200 and 300 were analyzed in each individual trial (in combination with appropriate choices for the number of dimensions). Since there are 3 data classes, the total number of objects was 150, 300, 600 and 900, respectively. • Euclidean Distance In general, it is surely convenient to use the Euclidean distance to measure the difference between two data vectors. However, there might be applications for which it makes sense to consider different measures. For example, it could be important to disregard any scalings between single data vectors or to ignore vertical shiftings. That is why the algorithms should be designed flexible enough to take this fact into consideration. • Gaussian Noise In all sample data sets, the error noise is normally distributed, i.e. 8 ~N(0,o) with a > 0 representing the noise level. This is perhaps the most convenient assumption for most practical applications. • Fixed noise levels For these tests, a remained fixed within the data sets. This decision might be a bit restrictive since some applications assume flexible noise levels for their data, but on the other hand it once more ensures that the test scenarios stay as generalistic as possible. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3 The standard algorithms 4.3.1 Algorithm 1 4.3.1.1 General description Algorithm 1 was taken from Haennsler's Master Thesis [1]. His approach to classify data is to use the data point's singular vectors and their structure. He introduces an algorithm that consists of two parts. In the first stage, he does a preliminary classification of the data vectors, also admitting to obtain more classes than there actually are. In the second stage, he tries to unify classes that were set up separately in the first stage, thereby finalizing the classification process. Haennsler's starting point for developing this algorithm was that he observed that the singular vectors of linear dependent data vectors point in the same direction, regardless of their length. Since the data points in each class are close to be linear dependent (they are not exactly dependent because random noise is added), he uses this property to develop a classification method. In particular, his proposed algorithm works as follows: (1) Create a new class consisting of the first data vector x ,. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (2) For the elements in this class (they are forming a matrix), calculate the largest singular value and the corresponding left singular vector u, e IR ” . Clearly lujj = 1 which follows from the definition of a singular vector. (3) Go over all previously unsorted data vectors x ; (i = 2,.. ,,m) and compute |< u isx. > |. If the two vectors belong to the same class, they are close to being linear dependent. Hence, if a represents the angle between the two vectors, then either a « 0 or a ~ n and > |= ||u, |J j|x j I cos(a) * fx, | since cos(a)« 1 and lu j = 1. Therefore put x, into the same class as x, if \<Ux>Xi >-||xi|||^A where A is a previously specified parameter. If this condition is not fulfilled, x; stays in the set of unclassified data points. (4) Repeat steps 1-3 as long as there are still elements left in the set of remaining data vectors. This concludes the first stage of the algorithm. (5) Now fix the first class that was created before (call it C ,) and try to unify it with each one of the remaining classes (call them C ). After performing a singular value decomposition of the two matrices C, and C , denote the left singular vectors that correspond to the largest singular values in these matrices by u ,,u ; e IR”. (6) Join C, and C; if |< u,,Uj > | > Q where Q is another parameter that has to be specified by the user. Naturally Q e (0,l) 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. since the value of |< u,, u, >| should be close to 1 if the two classes are based on the same reference function. (7) Repeat steps 5 and 6 until all classes have been checked pair wise for being unified. After this step, the classification is complete. For any additional details, the reader can be referred to Haennsler's thesis in which he explains the mathematical foundation of his algorithm much more precise. 4.3.1.2 Strengths and weaknesses The following two charts compare the algorithm's time performance on the sample data sets with the average performance of all 4 algorithms. Keep Class Size fixed to 100 members “ 50 I 40 -••€>------------- 200 50 100 150 Algorithm 1 •Average Keep Grid Size fixed to 50 900 | 450 t 300 j 150 200 300 100 Algorithm 1 •Average Figure 4.1: Time Analysis for Algorithm 1 As the left figure clearly shows, algorithm 1 is very fast. Especially for larger data sets (which means having a higher number of class members), it outperforms the other algorithms significantly. Also, as the grid size gets larger, the second graph shows that the relative advantage of algorithm 1 does not decrease. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% S et 1 Set 3 Set 4 Set 2 S et 5 Average Accuracy □ A lgorithm 1 ■ A v erag e Figure 4.2: Accuracy Analysis for Algorithm 1 Figure 4.2 shows the accuracy performance of algorithm 1 on each data set as well as on average. It can be observed that its accuracy level is slightly below the other algorithms'. The most significant difference is coming from data set 1. On this data set, algorithm 1 highly underperformed in comparison to the others. During the tests, it could also be seen that the accuracy level of this algorithm highly depends on the specific choice of parameters. Already small changes in A and Q can cause completely different classification results. In the given test scenario, the algorithm was tested on various parameter constellations and it was made sure that the final selection is the best and most solid one. This also counts for the process of choosing the respective parameter values for all other algorithms. Finally, a general disadvantage of algorithm 1 is that it is not possible to apply different distance measures to it. Since the class assignment process is not based the data vectors' pair wise distances, the user is not very flexible in this context - a fact that should limit its prospective applications. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.2 Algorithm 2 4.3.2.1 General description In contrast to the first algorithm, algorithm 2 tries to classify the objects by identifying clusters in the data structure. Further, it is only designed as a one-stage procedure. The principal steps of algorithm 2 are the following: (1) Calculate all pair wise distances between the data objects and store the values in a matrix D. In particular, D.J = d ist(x i,x j ), 1 < i, j < n (2) Identify the maximum entry in D. It represents the maximum distance of two vectors in the data set: d = m a x ® ..} max ISiJSn < • ‘-J > (3) The algorithm needs a certain cutoff distance. It is set according to a previously specified parameter v e (0,l): 9 = vd m a x (4) Introduce D* as a symbol for the distance matrix consisting of all unclassified data points and define D*=D initially. (5) Now pick the first data vector x t * in D* and assign it to a new class. Its entry in D* is deleted. (6) Go over all elements x f * that have not been classified yet and determine if they are inside a 0-ball around x, *, i.e. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. dist(X j* , Xj*) < 9 If this condition is true, then x ( * is assigned to the same class as x, * and its entry in D* is deleted; if not, it stays in the set of unclassified vectors. (7) Repeat steps 5 and 6 as long as there are still elements left in D* that have not been classified yet. For this algorithm, v is the only parameter that has to be chosen in advance. In this context, it is important to mention that once 0 is computed, it stays fixed during the whole algorithm. 4.3.2.2 Strengths and weaknesses This algorithm was not included in the performance tests because it is very similar to algorithm 3. In fact the two algorithms only differ by one small procedure. Nevertheless, it can be observed that the algorithm is flexible in the context of distance measures; the user can choose which measure he wants to apply. The only adjustment he has to make is switching the distance computation for the creation process of D. Algorithm 2 is very suitable for identifying clear cluster structures in the test data. However, there is surely one point of concern: Since there choice for the centerpoint of the 0-balls is arbitrary (it is always the first element of D*), the results could eventually become inaccurate. For example, this can happen if the centerpoint is situated between two different clusters and the 0-ball is large enough to overlap with both of them. Therefore the classification will not be very accurate. This is the reason for introducing algorithm 3. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.3 Algorithm 3 4.3.3.1 General description Algorithm 3 is an extension of algorithm 2. There is one new procedure added to steps 5 and 6 (see chapter 4.3.2.1). First, the algorithm determines the index of a maximum element in D*. Then the corresponding data vector is chosen as centerpoint for the 0-ball (instead of simply taking the first element of D* like in algorithm 2). This procedure is repeated every time before setting up a new class. Note that the step is always applied to D*, the distance matrix including all vectors that have not been classified yet. Hence it is independent from the choice of 0 - a procedure that refers to D, the distance matrix including all data vectors. The advantage of this additional step is quite obvious. By always taking one of the outlying vectors as centerpoint, the algorithm does not produce unwanted overlaps and guarantees a higher degree of accuracy than in the last version. Algorithm 3 also requires the user to choose one parameter v e (0,l), and it has the same meaning as in algorithm 2. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.3.2 Strengths and weaknesses The tests returned the following accuracy results: 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% S e t 3 S e t 4 Average Accuracy S et 1 S et 2 H A Igorithm 3 B A v e ra g e Figure 4.3: Accuracy Analysis for Algorithm 3 It can be observed that the algorithm performs quite well. In fact, there is no data set for which its accuracy is deviating significantly from the average. Keep Class Size fixed to 100 members _ 60 g ■ £ ■ 5 0 20 200 100 150 -HO-— Algorithm 3 •Average Keep Grid Size fixed to 50 900 750 300 150 200 300 100 Algorithm 3 •Average Figure 4.4: Time Analysis for Algorithm 3 Figure 4.4 shows maybe the greatest disadvantage of algorithm 3, namely the amount of computation time that is needed for the classification. When the data sets get larger, the algorithm 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. gets slower and slower (see left chart). For varying grid sizes (see right chart), it seems that there is no case for which the algorithm really works much better or worse than the average. Algorithm 3 is flexible in terms of the underlying distance measure. The user has the freedom of choice to apply the measure he finds most appropriate for his application. One disadvantage could come up for specific data structures. If a data set includes significant outliers, they will be used to determine 0. In addition, since those values will also be chosen as centerpoints for the 0-balls, it could be that some of the clusters only consist of a small number of outliers, especially if the data structure is very scattered. That is why the user should be careful with the choice of the parameter v e (0,l) for such data sets. A reasonable strategy could be to choose v small - risking that there are actually more classes returned than desired, but at least the number of misclassifications stays small. This leads to another negative feature of algorithm 3, namely that is only goes over the data once in the sorting process. So it might happen that the classification comes up with two different classes for a group of data that originally belonged together, and there is no second step in which these classes can be joined. That is why it would surely be desirable to add a second stage to this algorithm that validates the initial classification. Exactly this was the idea for creating algorithm 4. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.4 Algorithm 4 4.3.4.1 General description In contrast to the two previous algorithms, algorithm 4 is a two stage procedure. First, the methodology of algorithm 3 is used for an initial classification. Then, the classes are checked for eventual modifications with an iterative relocation algorithm that was taken from Gordon's book [3]. He proposes a rule to decide whether two previously different classes should be joined. Mathematically, he uses in-class sums of squared distances as a basis for a decision rule. This also guarantees that the algorithm stays very flexible for using various distance measures. In particular, the algorithm works as follows: (1) Use algorithm 3 to obtain a preliminary classification of the data set. (2) Let C and C be two classes resulting from step 1. Then calculate W,= £ d is t (x k,x ,)2 and W2 = ^ d is t ( x t ,x,)2 + £ d is t( x t ,x,)2 (3) Obviously W, > W2. Following Gordon's idea, the two classes should be joined if W. 2 cn where c = |W ,u W 2 | and z > 0 is a previously chosen standard normal deviate specifying the significance level. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Note that z this is the second parameter besides v that has to be specified by the user before the algorithm is started. (4) Now apply steps 2 and 3 to all pair wise combinations of classes. After having checked all these possibilities, the classification is finished. 4.3.4.2 Strengths and weaknesses From the chart below, it can be observed that the accuracy level of algorithm 4 is clearly the best so far. Its performance level is significantly higher than average, especially on data sets 1 and 2 it does a remarkable job. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% S et 4 Average Accuracy Set 1 Set 2 S et 3 Set 5 P A lg o rith m 4 P A v e ra g e Figure 4.5: Accuracy Analysis for Algorithm 4 An important point for the applicability of this algorithm is a good strategy of choosing the two parameters v and z. First v should be picked a little bit smaller than in algorithm 3 so that more classes might be received but eventual misclassifications are avoided. Then set z so that those classes can eventually be joined and the final classification becomes accurate. This strategy really paid off in the tests because the average accuracy level of algorithm 4 is the highest among all algorithms. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Keep Class Size fixed to 100 members 70 * 50 40 100 150 — O — Algorithm 4 •Average Keep Grid Size fixed to 50 900 ■ 5 1 750 I 300 J 150 300 100 200 Algorithm 4 •Average Figure 4.6: Time Analysis for Algorithm 4 The only real disadvantage o f algorithm 4 is the time it needs for its calculations. It can be observed that, besides delivering the best accuracy level, algorithm 4 is also slowest in terms of computation time. Especially if the numbers of class members grows (left figure), this disadvantage really becomes immanent on a relative scale. Fortunately, there is no significant deterioration in performance once the grid size gets larger (right figure), in fact the relative disadvantage compared to the average stays almost the same. Finally, one practical advantage of algorithm 4 over algorithm 3 is that it is less dependent on the choice for the parameter v. If there are more classes created than actually needed after the first stage, they can be joined in the second part. This feature is not available for algorithm 3, and that is why this algorithm is much more flexible and easier to use. 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.5 Algorithm 5 4.3.5.1 General description The reason for the transition from algorithm 4 to algorithm 5 was to change the first step of the classification process. This algorithm is also implemented as a two stage procedure, and the second step (iterative relocation) is identical with algorithm 4. However, the first stage is completely new and has nothing in common with the previous algorithms. The goal was to find clusters of data vectors faster than before while keeping the accuracy level as high as possible. The algorithm works as follows: (1) Define a classification parameter r| e (0,l) in relation to s*, the uniform class size in the data set. Also choose z as a standard normal deviate for the second stage. (2) For each data vector x1 ,...,xm , calculate its [r|s*J neighbors; these are the points with the smallest distance to the respective vector. The result of this procedure is a "neighbor matrix" N of size m x [ps *J. For example, the i-th line of N includes the neighbors of the i-th data vector. (3) Now start the classification by assigning the first vector x, to a new class. (4) For the data point x 2, analyze its neighbors that are listed in N. If x, is one of the neighbors of x 2 and x 2 is one of the neighbors o f x ,, then x 2 is put into the same class as x ,. If the two vectors are not neighbors of each other, then x 2 is assigned to a new class. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (5) Now consider x 3 and its neighbors in N. If one of the previously classified elements (x, and x 2 in this case) is a neighbor of x 3 and vice versa, then the object is assigned to the respective class. It is assigned to a new class if no neighbor correspondences could be identified. (6) The reader recognizes the system! The same procedure is repeated for all data vectors x 4,...,x m . It is important to observe that the classification is straightforward. First x, is classified, then x 2, then x 3 and so on until x m . (7) Finally, the same iterative relocation procedure as in algorithm 4 is applied to check whether some of the classes can be joined. After this step, the classification of the data set is complete. 4.3.5.2 Strengths and weaknesses 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Set 5 Set 2 Set 3 S et 4 Average Accuracy Set 1 5 ■ A v erag e Figure 4.7: Accuracy Analysis for Algorithm 5 Analyzing the accuracy level, figure 4.7 shows that algorithm 5 performs well on every data set except on number 2. However, on average its accuracy level is slightly behind algorithm 3 and algorithm 4, but it is still ahead of algorithm 1. 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Keep Class Size fixed to 100 members 1 50 £ 40 20 100 150 200 50 .— Algorithm 5 •Average Keep Grid Size fixed to 50 900 j 750 | 600 | 450 1 300 J 150 100 200 300 ~CJ— Algorithm 5 ■Average Figure 4.8: Time Analysis for Algorithm 5 In terms of computation time, algorithm 5 has a clear advantage over algorithm 3 and algorithm 4. It outperforms them significantly and returns a lower than average computation time (left figure). The reason for this is surely that the kernel of the first classification stage includes much fewer loops than the one of the two other algorithms. However, algorithm 5 is still much slower than algorithm 1. Concerning the time performance for larger grid sizes (right figure), it can be seen that the relative performance of algorithm 5 compared to the average gets a little worse, but this deterioration is not very significant. As a general advantage of algorithm 5, it can be mentioned that it is flexible in terms of using different distance measures - just like algorithms 2, 3 and 4. Once more, the user can choose which distance measure is most convenient for his experiment. Due to its structure, algorithm 5 can identify outliers in the underlying data set quite effectively, simply because they are usually not neighbors of any other points. Hence for data sets that 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. include many outlying values, it will return a number of single element classes. These classes could easily be joined to a "class of outliers" in another step, thereby separating them from the rest of the data. In many research applications that is surely a desired feature. 4.4 The learning sample algorithms 4.4.1 Reasons for using learning samples All algorithms that were presented so far do not assume that there is any a priori knowledge available on the data. However, it is surely possible that the user has some information, or he simply makes assumptions that are based on his experience with the data. For instance, he might know that there is a small subset of data values that already has a class structure. This subset of data points is called a learning sample. Having a learning sample is an essential advantage. Not only does the user know the number of classes in advance, he also knows something about the classes themselves - simply because some representants are available. With this additional knowledge, it is much easier to solve the initial classification problem. Two algorithms for classifying data sets that provide learning samples are presented in this chapter. Principally, both of them consist of one simple loop that goes through all unclassified elements and assigns them to one o f the classes. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. These algorithms were tested on the same data sets and in exactly the same scenarios as the standard algorithms. The size of the learning samples was assumed to be in a range of 3-10% o f the overall sample size. In particular, the choices for the tests were learning sample sizes of 5 for classes of 50 and 100 elements as well as sizes of 10 for classes of 200 and 300 elements. 4.4.2 Algorithm 1 4.4.2.1 General description The classification principle on which this algorithm is based includes the use of "centroids" for the data classes. A centroid is nothing else but an average element o f the corresponding learning sample. In particular, let the learning sample for class i be denoted by y |,..., y‘ k (assuming that all learning samples are of size k). Then the centroid for this class is determined by Note that cent1 e IRn, hence it is a vector with the same dimensions as the data points. The classification principle is very simple. (1) Calculate the class centroids cent' for all classes. (2) Go linearly over the data set and assign each vector (j= l,...,n) to one class by finding the centroid that it lies closest to. 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In particular, this means identifying a class c* such that dist (cent c',xj)= min {dist (centc, x } X j is then assigned to class c*. 4.4.2.2 Strengths and weaknesses Clearly this algorithm is very fast since its kernel only consists of one loop. Each individual class assignment only requires to find the minimum over a rather small set. In addition, the user does not have to specify any parameters in advance. The accuracy level that the algorithm returns is amazing. As the reader can see from the test data in the appendix, its average accuracy level is 99.2%, a result that is substantially higher than the 85.4% which is the average value of all standard algorithms. It even handles those data sets well where the other algorithms had much trouble with. A possible weakness could show up if the learning sample's structure is not as nice as in the tests. Assume for instance that the sample data includes a very high noise level. In this case, the learning sample elements should deviate a lot from each other. If the centroid is understood as a statistic of y,,...,yk, the consequence is (statistically spoken) that it will have a high standard deviation. That could be problematic because then it is not necessarily true that the centroid is positioned in the center of the data class. Therefore it might happen that some elements are assigned to a wrong centroid, especially if several clusters lie close together or if there are small overlaps. This was the reason for designing a second algorithm that takes this fact more into consideration. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4.3 Algorithm 2 4.4.3.1 General description As indicated before, the structure of this algorithm is very similar to algorithm 1. The only difference can be found in the method that is used to assign data objects to their classes. In particular, the algorithm works as follows: (1) Determine the class centroids cent' e IR ” for all classes. (2) For each learning sample class, calculate the standard deviation by with a 1 e IR . (3) Now go over the data set in a linear loop. For each vector , find a class index c* that Then take the average over all dimensions to receive 1 fulfills Finally assign x. to class c*. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4A.3.2 Strengths and weaknesses The key advantage over the first learning sample algorithm is that this one should be more comfortable in handling data sets with higher noise levels. Eventual outliers in the learning sample will have a smaller influence since they will automatically increase the standard deviation and thereby the value for a ' . In addition, this algorithm will also be more flexible in dealing with data sets that include different noise levels. In the performance tests, this algorithm returned an average accuracy of 98.6%. This result is only slightly lower than the one for the first algorithm and still much better than all standard algorithms. Figure 4.9 summarizes this observation: 99.2% 98.6% 85.4"' Average of algorithms Learning Sample Algorithm Learning Sample Algorithm without Learning Samples 1 2 Figure 4.9: Average Accuracy Analysis In terms of computation time, the results of the two algorithms are very similar as well (see appendix for details). Therefore algorithm 2 should be preferred over the previous one since it is more flexible at low additional costs in terms of time and accuracy. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5 More general cases As it was pointed out in chapter 4.2, the discussion so far is based on specific assumptions on the data sets which are usually not found in concrete applications. In this chapter, the goal is to take a closer look at how the algorithms deal with more general scenarios by altering some of the restrictions that were described before. 5.1 Non-uniform class sizes The first experiment is to test algorithm 1 and algorithm 5 on data sets with non-uniform class sizes. Two different types of sample data were set up, namely • one data set including one small class and two large classes (chapter 5.1.1) • another data set including one large class and one small class (chapter 5.1.2) In both cases, the analysis is based on data set 1 and data set 4. They have a very different structure and therefore this choice should result in a broad insight on the algorithms' capabilities. 5.1.1 One small class and two large classes As table 5.1 shows, algorithm 5 does an excellent job in identifying the correct number of classes. Algorithm 1 seems to have some trouble with grouping vectors from data set 1; the concrete observations are explained below the table. It is important to add that classes with very few 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. elements are not considered as "main classes", simply because they most likely consist of outliers. Only the number of main classes was counted for this analysis. data set grid size size of class 1 size of class 2 size of class 3 Algor # of main classes thm 1 accuracy Algor # of main classes thm 5 accuracy 1 150 100 100 15 4* 65 % 3 93 % 1 150 150 20 150 5 * * 6 8 % 3 98 % 4 150 100 100 15 3 98 % 3 98 % 4 150 150 20 150 3 95 % 3 99 % * The algorithm w a s n o t ab le to se p a ra te th e elem en ts from cla ss 3 ** T he algorithm s c a tte re d th e classification of c la s s 1 Table 5.1: Results for the analysis of 2 large classes vs. 1 small class However, the average accuracies (82% for algorithm 1 and 97% for algorithm 5) show that the classification itself has been done quite accurately by both algorithms. 5.1.2 Two small classes and one large class data set grid size size of class 1 size of class 2 size of class 3 Algor # of main classes thm 1 accuracy Algor # of main classes thm 5 accuracy 1 150 100 15 15 4’ 4 2 % 3 85 % 1 150 20 20 150 6 * * 88 % 3 95 % 4 150 100 15 15 3 98 % 3 95 % 4 150 20 20 150 3 97 % 3 96 % * The algorithm m ixed to g eth e r ele m e n ts from c la s s 1 with elem en ts from c la s s e s 2 an d 3 ** T he algorithm s c a tte re d th e classification of c la s s 1 Table 5.2: Results for the analysis of 2 small classes vs. 1 large class The observations for this scenario are very similar to the previous case. Once more algorithm 5 identifies the number of classes correctly and algorithm 1 has some problems with the first data set. The average accuracies (81% for algorithm 1 and 93% for algorithm 5) show that the results 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are also quite solid, but in this case algorithm 1 obviously had some difficulties with classifying data set 1. Summarizing the results, the general advice should be to prefer algorithm 5 since its performance was the best on average. However, the reader shall be reminded of all advantages and drawbacks of both algorithms that were explained in chapter 4. It might very likely be that the application of algorithm 1 is favored for some reason, but in this case the disadvantages pointed out above should be taken into consideration. 5.2 Other distance measures 5.2.1 Reasons for using other distance measures The next investigation involved analyzing the use o f other distance measures than the Euclidean distance. As it was pointed out earlier, some applications might require using other measures besides the Euclidean distance, and therefore it is certainly important to add a discussion on this subject. But first, let us observe that the Euclidean distance between two vectors x and y is defined by radio occultation data. This model was presented in chapter 1.2. In that scenario, the question One application where it makes sense to apply different distance measures is the classification of 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. arises how sensitive an algorithm should be towards scalings. For example, it could be desirable to introduce a distance measure that ignores eventual scalings between the data objects. In this case, the algorithm should combine classes for which the reference functions only differ by a scaling factor. The following chapter develops customized distance measures for 3 special cases, namely • vertical shiftings • horizontal shiftings • scalings These measures were tested on algorithm 5. 5.2.2 Discussion of possible approaches 5.2.2.1 Vertical shiftings The first task was to design a customized distance measure that minimizes the influence of vertical shiftings between data points. In particular, let n = ± £ ( * , - y , ) n m be the mean distance between two data vectors x and y. Then define the new distance measure by d 39 ,(x,y) = - v Z ( x i -Y i - f ) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.2.2.2 Horizontal shiftings The analysis of horizontal shiftings is a bit more difficult. A horizontal shift between data vectors will decrease the number of vector components that can be used to determine the distance, simply because the grid overlap gets smaller. Therefore, the measure should include an additive term to neutralize this effect. In other words: The less grid points come into consideration for determining the distance, the more its actual value should be penalized. More specifically, assume that for two vectors x and y, there is a shift o f k steps on the grid of size n. Hence the number of grid points used for the calculation of the distance is reduced to n-k. Therefore, introduce a penalizing function by P(k) = d e„c <X y) 1 / 2 \ II p — n u , V .3 y with 1 > 0 . By requiring k e jo ,..., overlap. -n ■ , it is guaranteed that at least — of the grid points Note that P(0) = 0 4 H I V Therefore this definition of a penalty function makes sense. The user can influence the degree of penalty that is involved in the distance calculation by choosing an appropriate value for 1 : 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • 1 e (0,1) results in a low penalty • 1 = 1 results in a medium-sized penalty • 1 > 1 results in a high penalty Note that in the presented test cases, this factor was chosen to 1 = . Now a distance between x and y can be defined as d h o r (x, y) = mm{dtu d (x*, y*) + p(k)} where x* = (xk,...,x n) and y* = ( y ,,...,y n_t+l) . Obviously it takes some computation time to calculate all values that are necessary to determine this distance as a minimum over k, especially if the number of dimensions gets larger. 5.2.2.3 Scalings The case of scalings is quite similar to the case of vertical shiftings and is less complicated than the last one. The basic assumption here is that if two data vectors x and y just differ by scale, there exists a factor k e IR such that X; for all i = 1 ,..., n . Hence define 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. By using absolute values in the formula, it is guaranteed that X only takes a reasonable value if there really exists a scaling, otherwise the factor is going to become large and the distance is not going to decrease. Finally, define the distance between the vectors by dK a l(x,y) = J J ( x i -Xyt) 5.2.3 Test results In order to test the 3 distance measures that were presented above, suitable data sets were designed and algorithm 5 was used to classify the data. For all data sets, the noise level was set to o = 20. For the test of the vertical shift correction ("data set 1"), the following reference functions were used: f,(x) = 25cos(x)+ 50 f2( x ) = - — + 75 200 f3(x) = - — + 25 200 Clearly, there is a vertical shift of 50 between f2 and f3. For the test of the horizontal shift correction ("data set 2"), the corresponding reference functions were: 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. f,(x) = 30 atan f2(x) =30 sin f,(x) = 30 cos / \ X vlOy + 50 V 10y + 50 f \ X vlOy + 50 Since sine and cosine are both 27r-periodic and sin(0)=cos(7t)=0, the reference functions f2 and f3 differ by a horizontal shifting of 10t l For the test of the scaling correction ("data set 3"), the reference functions were chosen as: / f,(x) = 50 atan vlOy f2(x) = 200 atan / \ X J o , f3(x) = lOOcosf Obviously there is a scaling factor of X=4 between f, and f2. data sat # of class members 100 100 150 150 100 100 150 150 100 100 150 150 grid size Standard (Euclidean) # of main classes accuracy Vertical Shift # of main accur classes 2 --------9 * 1 Horizontal Shift # of main classes 94% 97% 99% 98% accuracy Scaling Correcti on U of main accuracy Table 5.3: Test results for the analysis of different distance measures Table 5.3 shows that all three distance measures work fine on the test data. In all cases, the algorithm identified the correct number of classes for both the Euclidean and the customized 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. measure. In addition, it also returned high accuracy values. One possible issue might be the computation time needed for the classification, especially in the horizontal shifting approach. But in general, the results are good enough to recommend applying these distance measures to concrete application problems. 5.3 Varying Noise Levels The last investigation in this chapter was made on the issue of variable noise levels. In most experiments, the user cannot expect the level of data noise to be constant - especially if the data comes from different sources. Hence it makes sense to add a short discussion on how some of the algorithms behave for changing noise levels. For the concrete tests, data set 2 was chosen as a basis. The class size and the number of elements in each class were fixed to 150 and the noise level was then varied between 10 and 50. Finally, the following results were obtained: W m 20% Noise Level “ "Algorithm 1 Algorithm 3 Algorithm 4 Algorithm 5 Figure 5.1: Accuracy for different Noise Levels (based on Data Set 2) 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As expected, the figure shows that the accuracy level decreases for all algorithms once the noise level increases. From these observations, it seems that algorithm 3 and algorithm 4 have an advantage over algorithm 1 and algorithm 5 because their level of accuracy is consistently higher. The relative difference gets larger for increasing values of o. Therefore algorithm 4 seems to be the best choice for this scenario. On the other hand, it is important to mention that all these results must not only be seen relative to each other. The accuracy levels are fairly low, in particular the average for a = 50 is around 40%. Hence the user should be careful in applying the algorithms to high noise data because an accuracy of 40% for the overall classification is not very satisfying. 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6 Final comments The goal for the last chapter of this thesis is to summarize the key observations and to draw some conclusions. The results of the previous chapters gave the reader a good impression of the capabilities that the classification algorithms have. Possible issues that might arise on concrete research applications were also explained. To summarize the performance of the standard algorithms, the following charts are surely useful: Accuracy Analysis -$rr 83.6* ^ 70.0% Algorithm 1 Algorithm 3 Algorithm 4 Algorithm 5 Time Analysis 900 750 600 | 450 | 300 I 150 100 Number o f C lass M em ber* (Grid Size SO) 200 300 50 '^ " " '"Algorithm 1 Algorithm 3 Algorithm " H I - * * Algorithm 5 Figure 6.1: Overall Evaluation of the Standard Algorithms Based on this data, the key conclusions are that • algorithm 4 returns the best results in terms of accuracy, but it is relatively slow. • algorithm 1 is by far the fastest algorithm, but its accuracy level is the lowest of all algorithms. • algorithm 5 is the best allrounder. It provides very good accuracy results while not consuming too much computation time. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Further, it was shown in chapter 5.1 that this algorithm is more flexible than algorithm 1 in handling non-uniform class sizes. In chapter 5.2, it could be observed that it is possible to use it with different distance measures, and it returned very solid results as well. With these findings and explanations, the reader should be able to decide which algorithm is most convenient for his individual classification problem. According to the last chapters, algorithm 5 is maybe the best choice in general, but it might surely be that algorithm 1 or algorithm 4 fit much better to a specific scenario. Further, the reader should have recognized how great the advantage of having a priori knowledge on the data set is. The investigations on learning sample algorithms in chapter 4.4 show that these methods significantly outperform the standard algorithms both in accuracy and computation time. However, a learning sample might not be available for many classification problems. Finally, there is one very intelligent strategy for some classification problems, especially if the amount of data is very large. • Take out a sample o f data points. Make sure that the sample size is large enough so that it is representative for the whole data set. • Let one of the standard algorithms do a classification o f this set. If it was chosen large enough, the classification results will deliver a realistic picture o f the data structure. In addition, the algorithm should have identified possible outliers. • Extract the set o f outliers and declare the remaining classes a learning sample for the whole data set. • Apply one of the learning sample algorithms in order to receive a complete classification of the data. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This strategy should work very well for huge data sets, assuming that the sample was nicely chosen. In contrast to only applying a standard algorithm, this combination of standard and learning sample algorithms will save a substantial amount of computation time while not necessarily reducing the accuracy of the classification. 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References [1] M. Haennsler Sorting Algorithms applied to Radio Occultation Data based on Singular Value Decomposition Master Thesis, University of Southern California, May 2000 [2] R. M. Cormack A Review of Classification Journal of the Royal Statistical Society, Series A, Volume 134, Issue 3 (1971), p. 321-367 [3] A. D. Gordon Classification (2nd edition) Chapman & Hall/CRC 1999 [4] E. R. Kursinski, G. A. Hajj, J. T. Schonfield, R. P. Linfield, K. R. Hardy Observing Earth's atmosphere with radio occultation measurements using the Global Positioning System Journal of Geophysical Research, Volume 102, No. D 19, 10/20/1997, p. 23,429-23,465 [5] J. A. Benediktsson, P. H. Swain Statistical Methods and Neural Network Approaches for Classification of Data from multiple Sources Purdue University, December 1990 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A MATLAB code of standard algorithm 1 function algl; delta=50; omega=0.95; load smatrix; category=0; remain=sm; 1 2 = 1 ; catsize= [ ] ; result= [] ; Matrix= [] ; tic; while t2>0 category=category+l; sm=remain; remain=[ ] ; [r,c] =size(sm); [u,v,d] =svd(sm( : , 1) ) ; result(:,1,category)=sm(:,1); columncat=l; columnremain=0; for i=2:c if abs(abs(u(:,1) '*sm(:,i))-norm(sm(:,i)) columncat=columncat+l; result(:,columncat,category)=sm(:,i); else columnremain=columnremain+l; remain(:,columnremain)=sm(:,i); end; end; catsize(category)=columncat; [tl,t2]=size(remain); end; catsize timel=toc; save catsize.mat catsize; < delta Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. save result.mat result; tic ; for i=l:category [u,v,d]=svd(result(:,1:catsize(i) ,i) ) ; Vector(:,i)=u(:,1); end; Matrix=result; [rl,r2,c]=size(Matrix); cat=0; while c>0 cat=cat+l; finalresult(:,1:catsize(1),cat)=Matrix(:,1:catsize(1),1); catsizenew(cat)=catsize(1); test=Vector( : ,1) ; Vector=Vector(:,2:end); Matrix=Matrix:,2:end); catsize=catsize(2:end); c = c -1 ; counter=0; for i=l:c counter=counter+l; if abs(test1*Vector(:,counter)) > omega finalresult(:,1:catsizenew(cat)+catsize(counter),cat)=[finalresul t(:,1:catsizenew(cat),cat) Matrix(:,1:catsize(counter),counter)]; catsizenew(cat)=catsizenew(cat)+catsize(counter); Vector(:,counter:end-1)=Vector(:,counter+1:end) ; Vector ( : , end) = [] ; Matrix counter: end-1) =Matrix counter+1: end) ; Matrix end) = [ ] ; catsize(counter:end-1)=catsize(counter+1:end); catsize (end) = [] ; counter=counter-1; end; end; [tl,t2,c]=size(Matrix); end ; time2=toc; catsizenew timel+time2 save finalresult.mat finalresult; save catsizenew.mat catsizenew; 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B MATLAB code of standard algorithm 4 function alg4; fraction=0.75; conf_level=2.5; load smatrix; [m n]=size(sm); matrix=sm; result= [] ; catsize= [] ; temp= [] ; tic; distance=zeros(n,n); for i=l:n for j =i + l:n distance(i,j)=norm(sm(:,i)-sm(: , j ) ) ; end; end; save distance.mat distance; cutoff=fraction*max(max(distance)) remain=n; class=0; % first stage while remain>0 class=class+l; % determine i and switch lines 1 and i [c i]=max(max(distance(1:remain,1:remain)')); temp=distance(1,2:i-1) ; distance(1,2:i-1)=distance(2:i-1,i)'; distance(2:i-1,i)=temp'; temp=distance(1,i+1:remain); distance(1,i+1:remain)=distance(i,i+1:remain); distance(i,i+1:remain)=temp; temp=matrix(:,1) ; matrix(:,1)=matrix(:,i); matrix(:,i)=temp; dist=distance(1,2:end); distance=distance(2:end,2:end); result(:,1,class)=matrix(:,1); matrix=matrix(:,2:end); Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. remain=remain-1; catsize(class)=1; j=i; while j<=remain if dist(j)ccutoff dist(j:end-1)=dist(j + 1:end) ; distance(:,j:end-1)=distance(:,j + 1:end); distance(j:end-1,:)=distance(j+1:end,:); result(:,catsize(class)+1,class)=matrix(: , j ) ; matrix(:,j:end-1)=matrix(:,j +1:end); remain=remain-1; catsize(class)=catsize(class)+1; else j=j+i; end; end; end; catsize % second stage newclass=0; finalresult= []; catsizenew=[]; for i=class:-1:1 j=i; not_joined=l; w2_i=0; for cl=l:catsize (i) for c2=cl+l:catsize (i) w2_i=w2_i+norm(result(:,cl,i)-result(:,c2,i)) end; end; if w2_i-=0 w2_i=w2_i/catsize(i) ; end; while (j<=newclass & not_joined) w2_j = 0; for cl=l:catsizenew(j) for c2=cl+l:catsizenew(j) w2_j =w2_j +norm(finalresult(:,cl,j)- finalresult(:,c2,j))A2; end; end; if w2_j~=0 w2_j=w2_j/catsizenew(j); end; w2=w2_i+w2_j; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. wl = 0 ; for cl=l:catsize(i)+catsizenew(j) for c2=cl+l:catsize(i)+catsizenew(j) if cl<=catsize(i) if c2<=catsize(i) wl=wl+norm(result(:,cl,i)-result(:,c2,i))A2 ; else wl=wl+norm(result(:,cl,i)-finalresult(:,c2- catsize(i),j))^2; end; else wl=wl+norm(finalresult(:,cl-catsize(i),j)- finalresult(:,c2-catsize(i),j))A2; end; end; end; wl=wl/(catsize(i)+catsizenew(j)); level=l-2/(pi*m)-conf_level*sqrt((2- 16/(piA2*m))/((catsize(i)+catsizenew(j)*m))); if (w2/wl < level) j=j+i; else not_joined=0 ; end; end; if not_joined newclass=newclass+l; finalresult(:,1:catsize(i),newclass)=result(:,1:catsize(i),i); catsizenew(newclass)=catsize (i) ; else finalresult(:,catsizenew(j)+1:catsizenew(j)+catsize(i),j)=result( : ,1:catsize(i),i); catsizenew(j)=catsizenew(j)+catsize(i) ; end; end; toe catsizenew save finalresult.mat finalresult; save catsizenew.mat catsizenew; 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C MATLAB code of standard algorithm 5 function alg5; conf_level=l; neighborsize=floor((n/3)/28); load smatrix; [m n]=size(sm); matrix=sm; result= [] ; catsize= [] ; neighbor= [] ; neigh_ref= [] ; temp= [] ; tic; distance=zeros(n,n); for i=l: n for j=i+l:n distance(i,j)=norm(sm(:,i)-sm(: , j ) ) ; end; end; save distance.mat distance; remain=n; class=0; % first stage for i=l:n [val pos]=sort([distance(1:i,i)' distance(i,i+1:n)]); neighbor(i,1)=i; neighbor(i,2:neighborsize+2)=pos(1:neighborsize+1); neighbor(i,2)=val(2:neighborsize+l)*ones(neighborsize, end; [min ind]=sort(neighbor(:,2)); temp=neighbor; neighbor = [] ; for i=l:n neighbor(i,:)=temp(ind(i),:); end; for i=l:n added=0; class=l; while added==0 if class>size(catsize,2) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. catsize(class)=1; result(:,1,class)=matrix(:,neighbor(i,1) ) ; neigh_ref(class,1,:)=[neighbor(i,1) neighbor(i,3:neighborsize+2)]; added=l; elseif isempty(find(neigh_ref(class,:,2:neighborsize+l)==neighbor(i,1))) ==0 result(:,catsize(class)+1,class)=matrix(:,neighbor(i,1)); neigh_ref(class,catsize(class)+1,:)=[neighbor(i,1) neighbor(i,3:neighborsize+2) ] ; catsize (class)=catsize(class)+1; added=l; else class=class+l; end; end; end; catsize % second stage newclass=0; finalresult= [] ; catsizenew=[]; for i = size(catsize,2) :-1:1 j=i; not_j oined=l; w2_i=0; for cl=l:catsize(i) for c2=cl+l:catsize(i) w2_i=w2_i+norm(result(:,cl,i)-result(:,c2,i))*2; end; end; if w2_i~=0 w2_i=w2_i/catsize(i) ; end; while (j<=newclass & not_joined) w2_j = 0; for cl=l:catsizenew(j) for c2=cl+l:catsizenew(j) w2_j =w2_j +norm(finalresult(:,cl,j)- finalresult(:,c2,j))A2; end; end; if w2_j~=0 w2_j=w2_j/catsizenew(j); end; 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. w2=w2_i+w2_j; wl = 0 ; for cl=l:catsize(i)+catsizenew(j) for c2=cl+l:catsize(i)+catsizenew(j) if cl<=catsize(i) if c2<=catsize(i) wl=wl+norm(result(:,cl,i)-result(:,c2,i))A2; else wl=wl+norm(result(:,cl,i)-finalresult(:,c2- catsize(i),j))A2; end; else wl=wl+norm(finalresult(:,cl-catsize(i),j)- finalresult(:,c2-catsize(i),j))^2; end; end; end; wl=wl/(catsize(i)+catsizenew(j)); level=l-2/(pi*m)-conf_level*sqrt((2- 16/(pi^2*m))/((catsize(i)+catsizenew(j)*m))); if (w2/wl < level) j=j+i; else not_j oined= 0; end; end; if not_joined newclass=newclass + l ; finalresult(:,1:catsize(i),newclass)=result(:,1:catsize(i),i); catsizenew(newclass)=catsize(i); else finalresult(:,catsizenew(j)+1:catsizenew(j)+catsize(i),j)=result( :,1:catsize (i),i) ; catsizenew(j)=catsizenew(j)+catsize(i); end; end; toe catsizenew save finalresult.mat finalresult; save catsizenew.mat catsizenew; 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D MATLAB code of learning sample algorithm 1 function ls_algl; % No parameters are needed here! load smatrix; [m lsize classes]=size(lsample); [m n]=size(sm); centroid=zeros(m,classes); dist= [] ; result= [] ; catsize=zeros(1,classes); tic; % create centroids for each class for c=l:classes centroid(:,c)=mean(lsample(:,:,c),2) ; end; % group all data in linear loop for i=l:n for c=l:classes dist(c)=norm(sm(:,i)-centroid( :,c) ) ; end ; [d m]=min(dist); result(:,catsize(m)+l,m)=sm(:,i); catsize(m)=catsize(m)+1; end; toe finalresult=result; catsizenew=catsize save finalresult.mat finalresult; save catsizenew.mat catsizenew; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E MATLAB code of learning sample algorithm 2 function ls_alg2; % No parameters are needed here! load smatrix; [m lsize classes]=size(lsample); [m n]=size(sm); centroid=zeros(m,classes); sdev=zeros(classes) ; sd_fact= [] ; result= [] ; catsize=zeros(1,classes); tic; % create centroids for each class for c=l:classes centroid(:,c)=mean(lsample(:,:,c),2); sdev(c)=mean(std(lsample(:,:,c),0,2)); end; % group all data in linear loop for i=l:n for c=l:classes sd_fact(c)=norm(sm(:,i)-centroid(:,c))/sdev(c); end; [d m]=min(sd_fact); result(:,catsize(m)+l,m)=sm(:,i); catsize(m)=catsize(m)+1; end; toe finalresult=result; catsizenew=catsize save finalresult.mat finalresult; save catsizenew.mat catsizenew; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. F Collection of test results o c /5 Q ) m ° i l ? J S « O q O 'f t <=z O ( 0 0 0 ) w « ^ o ^ Lll re E | .s ~ E a 2 ,= < u £> < S E .£ * 5 ° E < o 5 E ».S2 C / 5 M *! £ £ o V it s h*r-h.coGoo)05COO)0)0)0) S § C O S°grjO)‘ ^ C O O ) ~ ^ m in in in in in in in in C N C N C N C N T . ' l.( 0 (0 (^ ^ ^ o o o o ' o o o o o t-O vP s P vP >p V0 yP yp Sp Sp Sp ^ o*> a x x s m j s ris rfs d s tfs fafofo^fofototb-erNooN N(O(DShw4OCOCDO)(nO1C0 __ C O C N ► S fgO O ( O g C N o C O ^ 0 5 c o o o o o ' O ) O ) 05 h- in r-- 0 5 C O in 0 5 C N C D in o o o o ' yP vp vp vp vp vP vp Vp Vp Vp yP vP m s m s s J b m s UllOlOWfl-lrtNCOCONCON C O ■ c r in O ) C N C D C O O ) C D N - C N C D in in O ) co C O T- C O C N in 05 d t- s J ^ c n ^ in .n ,rirPlC N C 0 1 - riinin ajOiOJO^g^^cncn^oaco d a d o d ) d ) 0 5 a > „ 0 G d d 0 d d - o’ o ’ r ■ ■ ~ - - o in ^ o o jO K S rC O * 05 CN CO CO . o o o Q o o , o o o S o o [ 1 r-(N « ^ t- (Nl , O O n O ?s ® s CN II •* * " u . O s. f £ < o a O ) O ) O ) 0)0): in in in in ^ i II E + " a ® - < o C / 5 O E E .s ' - 5 E d. O £ i? 5 < o S E m .2 W T f i C D C O « e £ S K % 2 o o o o * O O 0 0 o ■ o o o o o o o ! o) o) o) o) ^ (N C O m C O (N (N co co C O ip CNttr-mOINT^PirrjCOr; d O r O i O O C M O r ( O O r yp vp vP vP vP Vp Vp vp yp yp yp yp T-T-nSSS^So'tfilo s c o kco © a to co £ , n ^ CO ,n I r i S <N S I ) O) O ) O ) O ) O ) O ) f ,- . C O q O O ) gcqH’ tfo fr; cn JG d d C O ® t - C O *g > O O O n O □ O O .• o o o S o o S o o f r— CN 00 t— CN ^— CN Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. O n Analysis of Data Set CM Euclidean Distance N oise N(0,30) # of class grid Algorithm 1 Algorithm 3 Algorithm 4 members elem. parameters comp, time accuracy parameter comp, time accuracy parameters comp, time accuracy 50 50 110.0.97 0.44 50% 0.72 3.85 75% 0.72.18 6.92 79% 100 100, 0.97 2.41 60% 0.71 21.43 62% 0.7,1 36.03 64% 200 90, 0.92 30.6 55% 0.71 16027 68% 0.71,1 199.76 75% 300 80. 0.9 90.35 45% 0.71 90089 63% 0.71,1 979.48 66% 50 100 160,0.97 0.82 73% 0.74 4.01 71% 0 .7 4 ,1 8 9 5 6 84% 100 150.0.98 4.06 72% 0.74 24.88 64% 0 .7 4 ,1 8 40.64 79% 200 145.0.96 28.45 61% 0.74 16764 70% 0.74,12 24338 72% 50 150 190.0.99 1.32 72% 0.75 6.53 75% 0.75,15 13.24 85% 100 190, 0.98 6.27 68% 0.76 29.28 67% 0.76.15 47.12 76% 200 175, 0.97 54.38 63% 0.77 19004 67% 0.77,15 307J03 82% 50 200 230, 0.99 2.31 70% 0.81 6.92 79% 0 8 1 ,2 12.85 81% 100 220, 0.98 8.3 80% 0.8 31.8 66% 0.79,18 50.53 79% # of class grid Algorithms LS Algorithm 1 LS Algorithm 2 members elem. parameters comp, time accuracy size of I. s comp, time accuracy size of I. s. comp, time accuracy 50 50 2 5 . 1/11 3.95 46% 5 0.16 88% 5 0.22 82% 100 1 2 . 1/15 17.43 36% 5 0.44 88% 5 0.61 85% 200 12. 1/22 12292 33% 10 1.26 95% 10 1.37 94% 300 12. 1/23 280 55 33% 10 2.69 96% 10 2.86 95% 50 100 2. 1j 6 5.6 56% 5 0.17 96% 5 0.22 94% 100 18, 1/14 34.54 62% 5 0.6 96% 5 0.83 93% 200 12, 1/20 132.1 62% 10 2.25 99% 10 2.31 99% 93 150 15. 1/7 11.47 61% 5 0.28 100% 5 0.27 99% 100 18. 1/12 44.59 53% 5 0.88 99% 5 0.93 99% 200 18. 1/14 31182 54% 10 3.13 100% 10 3.24 99% 50 200 2 .1 fi 10.11 77% 5 0.33 99% 5 0.39 94% 100 18. 1/12 50.86 79% 5 1.1 1CO% 5 1.15 100% Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. O S to Analysis of Data Set 3 Euclidean Distance N oise N(0,25) # of class grid Algorithm 1 Algorithm 3 Algorithm 4 members elem. parameters comp, time accuracy parameter comp, time accuracy parameters comp, time accuracy 50 50 90. 0.93 0.39 94% 0.55 3.95 96% 0 5 5 .1 8 7.36 96% 100 75. 0.93 2.58 97% 0.54 28.45 94% 0 5 3 .1 8 35.59 98% 200 72. 0.93 10.72 93% 0.56 17125 88% 0 5 6 ,1 8 228.44 99% 300 75. 0.93 19.61 97% 0.55 645.76 90% 0 5 5 ,1 8 761.71 99% 50 100 125. 0.92 0.77 99% 0.6 7.09 98% 0 5 8 .2 16.53 99% 100 110. 0.92 3.51 98% 0.6 35.53 99% 0 6 .2 71.68 99% 200 105.0.92 29.06 97% 0.6 231.78 95% 0 5 7 .2 44138 99% 50 150 150,0.9 1.92 100% 0.58 8.68 99% 0 5 6 ,2 18.57 100% 100 145.0.9 3.85 100% 0.58 41.03 98% 0 5 6 ,2 83.98 99% 200 140,0.9 13.95 100% 0.57 24683 99% 0 5 5 ,2 44463 100% S3 200 175, 0.87 1.81 100% 0.56 9.39 98% 0 5 6 ,2 22.52 99% 100 170. 0.87 5.61 100% 0.56 44.32 99% 0 5 6 ,2 88.05 100% # of class grid Algorithm 5 LS Algorithm 1 LS Algorithm 2 members elem. parameters comp, time accuracy size of I. s. comp, time accuracy size of I. s. comp, time accuracy 50 50 18. 1/4 5.8 96% 5 0.11 100% 5 0.11 100% 100 2 .1 6 22.03 92% 5 0.33 100% 5 0.38 100% 200 18. 1/8 82.11 98% 10 1.1 100% 10 1.15 99% 300 18. 1/14 20729 95% 10 2.36 100% 10 2.42 100% 50 100 2.1/3 20.66 97% 5 0.22 100% 5 0.17 100% 100 2 .1 fi 68.44 97% 5 0.6 100% 5 0.6 100% 200 2.1/7 330.11 98% 10 2.14 100% 10 2.2 100% 50 150 2.1/2 18.4 98% 5 0.27 100% 5 0.28 100% 100 2 .1 # 66.51 99% 5 0.88 100% 5 0.94 100% 200 2,1.6 31692 98% 10 3.07 100% 10 3.07 100% 50 200 2,1/2 13.3 100% 5 0.33 100% 5 0.33 100% 100 2.1/3 54.05 99% 5 1.1 100% 5 1.1 100% Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Analysis of Data Set 4 Euclidean Distance N oise N(0,20) - # of class grid Algorithm 1 Algorithm 3 Algorithm 4 members elem. parameters comp, time accuracy parameter comp, time accuracy parameters comp, time accuracy so 50 75. 0.97 0.38 67% 0.77 3.9 85% 0.76,25 6.92 86% 100 60. 0.98 1.37 87% 0.77 24.39 77% 0.77.25 34.6 85% 200 60. 0.98 4.83 90% 0.7 165.43 70% 0.7. 2.5 227.44 89% 300 50. 0.98 32.96 82% 0.72 65065 70% 0.7. 2.5 81559 78% 50 100 85. 0.97 0.83 87% 0.79 5.88 89% 0 .7 9 ,2 5 10.76 95% 100 85. 0.96 1.98 86% 0.78 27.02 93% 0 .7 8 .2 5 41.96 93% 200 80. 0.96 9.73 90% 0.76 19608 83% 0.75.2 33609 85% 50 150 95. 0.95 1.82 93% 0.77 5.77 96% 0.77.2 13.41 98% 100 95. 0.95 5.76 95% 0.78 29.6 90% 0.78.2 55.59 99% 200 95. 0.95 18.79 98% 0.76 21224 86% 0.76.2 31401 92% 50 200 115.0.95 1.92 98% 0.77 6.21 93% 0.77.2 13.3 99% 100 115, 0.95 5 9 9 % 0.77 32.69 96% 0.77.2 65.2 97% # of class grid Algorithm 5 LS Algorithm 1 LS Algorithm 2 members elem. parameters comp, time accuracy size of 1 . s. comp, time accuracy size of I. s. comp, time accuracy 50 50 15. 1/5 6.15 79% 5 0.22 100% 5 0.11 100% 100 2 5 . 1/10 23.56 8 2 % 5 0.39 99 % 5 0.44 99% 200 2 5 . 1/18 106.5 73% 10 1.1 100% 10 1.16 100% 300 2.1/22 256 9 4 60% 10 2.31 100% 10 2.36 100% 50 100 2 .1 # 9.61 98% 5 0.17 100% 5 0.17 100% 100 2 .1 6 27.96 94% 5 0.6 100% 5 0.6 99% 200 2 .1 # 161.42 9 4 % 10 1.98 100% 10 2.09 100% 50 150 2.1/3 9.89 99% 5 0.28 100% 5 0.22 100% 100 2 . 1/5 35.81 97% 5 0.82 100% 5 0.88 100% 230 2. 1j 6 16631 97% 10 3.03 100% 10 3.07 100% 50 200 2.1/2 9.72 99% 5 0.33 100% 5 0.33 100% 100 2 .1 # 37.9 99% 5 1.05 100% 5 1.05 100% O S Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. O s Analysis of Data Set 5 Euclidean Distance N oise N(0,10) # of class grid Algorithm 1 Algorithm 3 Algorithm 4 members elem. parameters comp, time accuracy parameter comp, time accuracy parameters comp, time accuracy 50 50 15, 0.99 0.82 78% 0.7 3.52 84% 0.7. 2.2 8.78 93% 100 15. 0.99 2.26 92% 0.72 24.05 90% 0.7, 2.2 37.73 90% 200 13. 0.99 14.11 89% 0.71 18301 88% 0.7. 2.2 25557 93% 300 15. 0.99 11.15 91% 0.71 69739 88% 0.71,23 841.4 92% 50 100 23, 0.99 0.6 95% 0.75 4.83 96% 0.75.25 13.51 96% 100 23, 0.99 2.15 97% 0.75 29.36 98% 0.75,25 54.1 98% 200 20. 0.99 24.55 95% 0.75 20757 95% 0 .7 5 .2 5 323J02 98% 50 150 23. 0.99 1.37 99% 0.76 5.66 99% 0 .7 6 ,2 5 10.93 99% 100 28, 0.99 3.3 98% 0.76 29.94 98% 0 .76,25 56.52 99% 200 27. 0.99 14.05 98% 0.76 19806 99% 0.76,25 31863 99% 50 200 32. 0.99 1.54 98% 0.77 5.82 100% 0.76,25 11.92 100% 100 33. 0.98 4.73 99% 0.77 31.75 99% 0.77.25 66.65 99% # of class grid Algorithms LS Algorithm 1 LS Algorithm 2 members elem. parameters comp, time accuracy size of 1 . s comp, time accuracy size ofl. s. comp, time accuracy 50 50 2 5 . 1/6 5.55 77% 5 0.17 100% 5 0.17 99% 100 2 5 . 1/7 21.31 91% 5 0.33 100% 5 0.38 99% 200 2.4, 1/14 10892 30% 10 1.21 100% 10 1.15 100% 300 2 5 , 1/22 23228 86% 10 2.47 100% 10 2.41 99% 50 100 2 5 . 1/4 11.42 95% 5 0.16 100% 5 0.17 1CO% 100 2 5 . 1/6 30.76 97% 5 0.55 100% 5 0.6 100% 200 2 5 , 1/10 17735 95% 10 2.14 100% 10 2.15 100% 50 150 2 5 . 1/2 10.27 98% 5 0.27 100% 5 0.33 100% 100 2 5 . 1/4 33.94 99% 5 0.88 100% 5 0.87 100% 200 2 5 . 1/8 16752 96% 10 3.35 100% 10 3.13 100% 50 200 2 5 . 1/2 9.4 100% 5 0.27 100% 5 0.33 100% 100 2 5 . 1/4 44.6 97% 5 1.09 100% 5 1.15 100%
Asset Metadata
Creator
Weidemann, David (author)
Core Title
Automated classification algorithms for high-dimensional data
Contributor
Digitized by ProQuest
(provenance)
School
Graduate School
Degree
Master of Science
Degree Program
Applied Mathematics
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Mathematics,OAI-PMH Harvest,statistics
Language
English
Advisor
Wang, Chunming (
committee chair
), Dawid, Herbert (
committee member
), Rosen, Gary (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-40670
Unique identifier
UC11337517
Identifier
1407930.pdf (filename),usctheses-c16-40670 (legacy record id)
Legacy Identifier
1407930.pdf
Dmrecord
40670
Document Type
Thesis
Rights
Weidemann, David
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Linked assets
University of Southern California Dissertations and Theses