Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Spam e-mail filtering via global and user-level dynamic ontologies
(USC Thesis Other)
Spam e-mail filtering via global and user-level dynamic ontologies
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
SPAM E-MAIL FILTERING VIA GLOBAL AND USER-LEVEL DYNAMIC ONTOLOGIES by Seongwook Youn A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2009 Copyright 2009 Seongwook Youn Dedication To My Family.... ii Acknowledgments First and foremost, I would like to thank my advisor and chair of my committee, Dr. Dennis McLeod, for his guidance over the past 5 years. I am also very thankful to other members of my committee: Prof. Aiichiro Nakano, Prof. Ellis Horowitz, Prof. Ulrich Neumann and Prof. Larry Pryor. I was fortunate to be surrounded by many gifted and smart colleagues like Boyoon Jung, Chansook Lim, Dongwoo Won, Bo Mi Song, Jongeun Jun, Seokkyung Chung, Jongwoo Lim, Sang Su Lee and Jinwoo Kim. I am also very thankful for the help and support I received from the members of Semantic Information Research Lab. It made my stay in Los Angeles over the past 9 years to be truly memorable. Finally, I would like to thank my parents and my sister who were a constant source of inspiration and always helped me to believe in myself. iii Table of Contents Dedication ii Acknowledgments iii List Of Tables vii List Of Figures ix List of Algorithms x Abstract xii Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Understanding of an Ontology . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Objective and Approach . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Related Work 11 2.1 Ontology Development . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 3: Email Classification (Text Mining) 15 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 E-mail Classification (Text Mining) . . . . . . . . . . . . . . . . . . . 20 3.4.1 Neural Network (NN) . . . . . . . . . . . . . . . . . . . . . . 20 3.4.2 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . 21 iv 3.4.3 Naive Bayesian Classifier (NB) . . . . . . . . . . . . . . . . . 23 3.4.4 J48 Classifier (J48) . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Result Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.1 Effect of datasets on performance . . . . . . . . . . . . . . . . 25 3.5.2 Effect of feature size on performance . . . . . . . . . . . . . . 26 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 4: SPONGY (SPam ONtoloGY) System 34 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 SPONGY System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.2 SPONGY Architecture . . . . . . . . . . . . . . . . . . . . . 38 4.2.3 SPONGY Implementation . . . . . . . . . . . . . . . . . . . 42 4.2.3.1 Global ontology creation procedure . . . . . . . . . 42 4.2.3.2 User customized ontology creation procedure . . . . 42 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 Global Ontology with 2108 training dataset . . . . . . . . . . . 45 4.3.2 User Customized Ontology with specific user dataset . . . . . . 51 4.3.3 SPONGY system results using global ontology and user cus- tomized ontology . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5: Improved Spam Filtering by Extraction of Information from Text Em- bedded Image E-mail 56 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Extraction of Information from Text Embedded Images using OCR . . 59 5.4 Image Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.6 Comparison with Commercial Filters . . . . . . . . . . . . . . . . . . 67 5.6.1 Introduction to Each E-mail System . . . . . . . . . . . . . . 67 5.6.2 Evaluation and Comparison . . . . . . . . . . . . . . . . . . . 69 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 6: Spam Decisions on Gray E-mail using a Personalized Ontology 74 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Motivation and Spam E-mail Trends . . . . . . . . . . . . . . . . . . 77 6.2.1 Decision on Gray E-mail . . . . . . . . . . . . . . . . . . . . 77 6.2.2 Spam E-mail Trends . . . . . . . . . . . . . . . . . . . . . . . 79 6.2.2.1 PDF Spam . . . . . . . . . . . . . . . . . . . . . . 80 v 6.2.2.2 Malware . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.2.3 Blended Spam with Malware Website Links . . . . . 83 6.2.2.4 Address Validation . . . . . . . . . . . . . . . . . . 84 6.2.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . 85 6.3 Filtering System for Gray E-mail . . . . . . . . . . . . . . . . . . . . 86 6.3.1 Personalized Ontology Filter Implementation . . . . . . . . . 91 6.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 91 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter 7: Contribution and Conclusion 98 7.1 Anticipated Contribution . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References 104 vi List Of Tables 3.1 Classification results based on data size. (with 55 features) . . . . . . . 26 3.2 Classification results based on feature size. . . . . . . . . . . . . . . . . 29 4.1 Classification result of training datasets . . . . . . . . . . . . . . . . . 50 4.2 Classification result of 200 user dataset . . . . . . . . . . . . . . . . . . 52 4.3 Precision and Recall of 200 user dataset . . . . . . . . . . . . . . . . . 52 4.4 Classification result of 400 user dataset . . . . . . . . . . . . . . . . . . 52 4.5 Precision and Recall of 400 user dataset . . . . . . . . . . . . . . . . . 53 4.6 Classification result of 600 user dataset . . . . . . . . . . . . . . . . . . 53 4.7 Precision and Recall of 600 user dataset . . . . . . . . . . . . . . . . . 53 4.8 Precision and Recall of whole SPONGY system . . . . . . . . . . . . . 55 4.9 Result of SPONGY system . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1 Experimental Results of Global filter wo OCR . . . . . . . . . . . . . . 66 5.2 Experimental Results of SPONGY wo OCR . . . . . . . . . . . . . . . 66 5.3 Experimental Results of Global filter with OCR . . . . . . . . . . . . . 66 vii 5.4 Experimental Results of SPONGY with OCR . . . . . . . . . . . . . . 67 5.5 Spam Filter Review 2007 [92] . . . . . . . . . . . . . . . . . . . . . . 69 5.6 Comparison Result with the Same E-mail Data Set . . . . . . . . . . . 72 5.7 Comparison Result with the Different E-mail Data Set . . . . . . . . . 72 viii List Of Figures 3.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . 22 3.2 Spam precision based on data size . . . . . . . . . . . . . . . . . . . . 27 3.3 Legitimate precision based on data size . . . . . . . . . . . . . . . . . 27 3.4 Spam recall based on data size . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Legitimate recall based on data size . . . . . . . . . . . . . . . . . . . 28 3.6 Spam precision based on feature size . . . . . . . . . . . . . . . . . . . 29 3.7 Legitimate precision based on feature size . . . . . . . . . . . . . . . . 30 3.8 Spam recall based on feature size . . . . . . . . . . . . . . . . . . . . . 31 3.9 Legitimate recall based on feature size . . . . . . . . . . . . . . . . . . 31 4.1 SPONGY Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Part of J48 classification result . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Summary of classification result . . . . . . . . . . . . . . . . . . . . . 46 4.4 Tree of J48 classification result . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Converted RDF file of J48 classification result . . . . . . . . . . . . . . 48 ix 4.6 W3C RDF Validation Service . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Triplets of RDF data model . . . . . . . . . . . . . . . . . . . . . . . . 49 4.8 RDF data model (Ontology) . . . . . . . . . . . . . . . . . . . . . . . 50 5.1 Snapshot of OCR Implementation . . . . . . . . . . . . . . . . . . . . 61 5.2 Spam Filtering System (Text and Image E-mails) . . . . . . . . . . . . 64 6.1 Personalized ontology . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 PDF Spam with Malware . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Zombie IPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4 Website Redirection using Image Spam . . . . . . . . . . . . . . . . . 83 6.5 Address Validation Spam . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6 Topics of Spam E-mail . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.7 Global Spam Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.8 Personalized Spam Filter . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.9 Experimental Results (ROC) . . . . . . . . . . . . . . . . . . . . . . . 94 6.10 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 x List of Algorithms 1 Global ontology filter pseudo code . . . . . . . . . . . . . . . . . . . . 43 2 User-customized ontology filter pseudo code . . . . . . . . . . . . . . . 44 3 Personalized ontology filter pseudo code . . . . . . . . . . . . . . . . . 92 xi Abstract E-mail is clearly a very important communication method between people on the In- ternet. However, the constant increase of e-mail misuse/abuse has resulted in a huge volume of spam e-mail over recent years. As spammers always try to find a way to evade existing filters, new filters need to be developed to catch spam. In my research to date, e-mail data was classified using four different classifiers; Neural Network, SVM classifier, Naive Bayesian Classifier, and C4.5 Decision Tree (J48) classifier. An exper- iment was performed based on different data size and different feature size. Feature is a set of words to charaterize domain dataset. The final classification result should be ’1’ if it is actually spam, otherwise, it should be ’0’. This paper shows that a simple C4.5 Decision Tree classifier, which makes a binary tree, is efficient for datasets that can be viewed as a binary tree. We present a new approach to filtering spam e-mail using semantic information rep- resented in ontologies. Ontologies allow for machine-understandable semantics of data [99]. Traditional keyword-based filters rely on manually constructed pattern-matching rules, but spam e-mail varies from user to user and also changes over time. Hence, an xii adaptive learning filtering technique is deployed in our system. An experimental sys- tem has been designed and implemented with the hypothesis that this method would outperform existing techniques; experimental results showed that indeed the proposed ontology-based approach improves spam filtering accuracy significantly. Also, we de- ploy an Image e-mail handling capability by extraction of information from text em- bedded image e-mail using OCR. Additionally, we improve the spam filter using a personalized ontology in spam decision on gray e-mail. In the proposed SPONGY (SPam ONtoloGY) system, two levels of ontology spam filters were implemented: a first level global ontology filter and a second level user-customized ontology filter. The use of the global ontology filter showed about 91% of spam filtered, which is compara- ble with other methods. The user-customized ontology filter was created based on the specific user’s background as well as the filtering mechanism used in the global ontol- ogy filter creation. Using the user-customized ontology filter, we measured the perfor- mance improvement by precision, recall and accuracy of classification. Through a set of experiments, it was proven that better classification performance (about 95%) can be achieved using the user-customized ontology filter, which is adaptive and scalable. The main contributions of the paper are 1) to introduce an ontology-based multi-level filter- ing technique that uses both a global ontology and an individual filter for each user to increase spam filtering accuracy, and 2) to create a spam filter in the form of ontology, xiii which is user-customized, scalable, and modularized, so that it can be embedded within other systems for better performance. xiv Chapter 1 Introduction 1.1 Motivation E-mail has been an efficient and popular communication mechanism as the number of Internet users increases. Therefore, email management became an important and grow- ing problem for individuals and organizations because it is prone to misuse. The blind posting of unsolicited email messages, known as spam, is an example of misuse. Spam is commonly defined as sending of unsolicited bulk e-mail - that is, e-mail that was not asked for by multiple recipients. A further common definition of spam is restricted to unsolicited commercial e-mail, a definition that does not consider non-commercial so- licitations such as political or religious pitches, even if unsolicited, as spam. E-mail was by far the most common form of spamming on the internet. According to the data esti- mated by Ferris Research [36], spam accounts for 15% to 20% of e-mail at U.S.-based 1 corporate organizations. Half of e-mail users are receiving 10 or more spam e-mails per day while some of them are receiving up to several hundred unsolicited e-mails. International Data Group [49] expects that global e-mail traffic will surge to 60 billion messages daily by 2006. It involves sending identical or nearly identical unsolicited messages to a large number of recipients. Unlike legitimate commercial e-mail, spam is generally sent without the explicit permission of the recipients, and frequently con- tains various tricks to bypass e-mail filters. Modern computers generally come with some ability to send spam. The only necessary ingredient is the list of addresses to target. Spammers obtain e-mail addresses by a number of means: harvesting addresses from Usenet postings, DNS listings, or Web pages; guessing common names at known domains (known as a dictionary attack); and ”e-pending” or searching for e-mail ad- dresses corresponding to specific persons, such as residents in an area. Many spammers utilize programs called Web Spider to find e-mail addresses on web pages, although it is possible to fool the Web Spider by substituting the ”@” symbol with another sym- bol, for example ”#”, while posting an e-mail address. As a result, users have to waste their valuable time to delete spam e-mails. Moreover, because spam e-mails can fill up the storage space of a file server quickly, they could cause a very severe problem for many websites with thousands of users. Currently, much work on spam e-mail filtering has been done using the techniques such as decision trees, Naive Bayesian 2 classifiers, neural networks, etc. To address the problem of growing volumes of unso- licited e-mails, many different methods for e-mail filtering are being deployed in many commercial products. We constructed a framework for efficient e-mail filtering using ontology. Ontologies allow for machine-understandable semantics of data, so it can be used in any system [93] [99]. It is important to share the information with each other for more effective spam filtering. Thus, it is necessary to build ontology and a frame- work for efficient e-mail filtering. Using ontology that is specially designed to filter spam, a significant amount of unsolicited bulk e-mail could be filtered out on the sys- tem. We propose to find an efficient spam e-mail filtering method using ontology. We used Waikato Environment for Knowledge Analysis (Weka) explorer, and Jena to make ontology based on sample datasets. Jena is a Java framework for building Semantic Web applications. It provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. E-mails can be classified using different methods. Different people or e-mail agents may maintain their own personal e-mail classifiers and rules. The problem of spam filtering is not a new one and there are already a dozen different approaches to the problem that have been implemented. The problem was more specific to areas like artificial intelligence and machine learn- ing. Several implementations had various trade-offs, difference performance metrics, and different classification efficiencies. The techniques such as decision trees, Naive 3 Bayesian classifiers, Support Vector Machine and Neural Networks had various classi- fication efficiencies. 1.2 Hypothesis We have been used multilevel ontology filter to filter out spam. First level is global ontology filter that is similar to other popular spam filter. Second level is a user cus- tomized ontology filter with semantic meta data that is adaptive. A user customized ontology filter is created depending on specific user’s background, so it can increase the accuracy of spam filtering. Current spam filters are static or adaptive globally, however, user customized ontology filter can be evolved differently according to users’ different background. Hence, more spam will be filtered out on the second level user customized ontology filter even though some spam is not filtered out on the first level global ontology filter. In this thesis, user customized ontology filter can show the improvement of perfor- mance using precision, recall and accuracy of classification. 1.3 Understanding of an Ontology An ontology is an explicit specification of a conceptualization. Ontologies can be tax- onomic hierarchies of classes, class definitions, or subsumption relation, but need not 4 be limited to these forms. Also, ontologies are not limited to conservative definitions. To specify a conceptualization one needs to state axioms that constrain the possible in- terpretations for the defined terms [43]. Ontologies play a key role in capturing domain knowledge and providing a common understanding. Generally, ontologies consist of taxonomy, class hierarchy, domain knowledge base, and relationships between classes and instances. An ontology has different relationships depending on the schema or taxonomy builder, and it has different restrictions depending on the language used. Also, the domain, range, and cardinality are different based on ontology builder. On- tologies allow for machine-understandable semantics of data, and facilitate the search, exchange, and integration of knowledge for business-to-business (B2B) and business- to-consumer (B2C) e-commerce. By using semantic data, the usability of e-technology can be facilitated. There are several languages like extensible markup language(XML), resource description framework (RDF), RDF schema (RDFS), DAML+OIL, and OWL. Many tools have been developed for implementing meta data of ontologies using these languages. However, current tools have problems with inter-operation and collabo- ration. Ontology tools can be applied to all stages of the ontology life cycle includ- ing the creation, population, implementation, and maintenance [78]. An ontology can be used to support various types of knowledge management including knowledge re- trieval, storage, and sharing [80]. In one of the most popular definitions, an ontology is the specification of shared knowledge [94]. For a knowledge-management system, an 5 ontology can be regarded as the classification of knowledge. Ontologies are different from traditional keyword-based search engines in that they are meta data, able to pro- vide the search engine with the functionality of semantic matching. Hence, ontologies are able to search more efficiently than traditional methods. An ontology consists of hierarchical descriptions of important concepts in a domain and the descriptions of the properties of each concept. Traditionally, ontologies are built by both highly trained knowledge engineers and domain specialists who may not be familiar with computer software. Ontology construction is a time-consuming and laborious task. XML is not enough to describe machine-understandable documents and interrelationships of re- sources in an ontology [44]. Therefore, the W3C has recommended the use of RDF, RDFS, DAML+OIL, and OWL. Since then, many tools have been developed for im- plementing the meta data of ontologies through use of these languages. 1.4 Objective and Approach The training datasets are the set of e-mails that gives us a classification result. The test data is actually the e-mail that will run through our system which we test to see if classified correctly as spam or not. This will be an ongoing test process and so, the test data is not finite because of the learning procedure, and will sometimes merge with the training data. The training datasets were used as input to J48 classification. To do that, the training datasets should be modified as a compatible input format. SPONGY 6 system gives us the classification result using J48. To query the test e-mail in Jena, an ontology should be created based on the classification result. To create ontology, an ontology language is required. Resource Description Framework (RDF) was used to create an ontology. The classification result of RDF format was inputted to Jena, and inputted RDF was deployed through Jena, finally, an ontology was created. An ontology generated in the form of RDF data model is the base on which the incoming mail is checked for its legitimacy. Depending upon the assertions that we can conclude from the outputs of Jena, the e-mail can be defined as spam or legitimate. The e-mail is actually the e-mail in the format that Jena will take in (i.e. in a CSV format) and will run through the ontology that will result in spam or legitimate. SPONGY system updates periodically the datasets with the e-mails classified as spam when user spam report is requested. Then, increased training datasets are inputted to Weka to get a new classification result. Based on the classification result, we can get new ontology, which can be used as a second spam filter (that is user-customized ontology). Through this procedure, the number of ontology will be increased. Finally, this spam filter- ing ontology will be customized for each user. User customized ontology filter would be different with each other depending on each user’s background, preference, hobby, etc. That means one e-mail might be spam for person A, but not for person B. User customized ontology evolves periodically and adaptively. SPONGY system provides evolving spam filter based on users’ preference, so users can get better spam filtering 7 result. The input to the system is mainly the training datasets and then the test e-mail. The test e-mail is the first set of e-mails that the system will classify and learn and after a certain time, the system will take a variety of e-mails as input to be filtered as spam or legitimate. For the training datasets which we used, several feature selection algorithms including Naive Bayesian, Neural Network, SVM, and J48 were tested, then J48 and Naive Bayesian classifier showed the good performance on the training e-mail datasets [100]. The classification results through Weka need to be converted to an ontology. The classification result which we obtained through J48 decision tree was mapped into RDF format. This was given as an input to Jena which then mapped the ontology for us. This ontology enabled us to decide the way different headers and the data inside the e-mail are linked based upon the word frequencies of each words or characters in the datasets. The mapping also enabled us to obtain assertions about the legitimacy and non-legitimacy of the e-mails. The next part was using this ontology to decide whether a new e-mail is spam or legitimate. Queries using the obtained ontology were processed again through Jena. The output obtained after querying was the decision that the new e-mail is spam or legitimate. In summary, test e-mail is checked whether it is spam or legitimate based on global ontology created with training datasets and incorrectly fil- tered e-mails are checked again based on user-customized ontology created with user’s spam report. With the help of adaptive user customized ontology, total spam filtering rate will be increased [101]. 8 1.5 Major Contributions Current e-mail filters are static or adaptive globally, however, in my research, user cus- tomized ontology for each user will be evolved differently according to user’s different background. Hence, more spam will be filtered out on the second level user customized ontology filter even though some spam is not filtered out on the first level global on- tology filter. The primary way where user can let the system know would be through a GUI or a command line input with a simple ’yes’ or ’no’. This would all be a part of a full fledged working system as opposed to our prototype which is a basic research model. 1.6 Thesis Outline The remaining chapters are organized as follows: ² Chapter 2 discusses related work and how our work differs from others. ² Chapter 3 discusses feature selection and text mining methods used in the re- search. ² Chapter 4 introduces the proposed spam e-mail filtering system using an ontol- ogy. 9 ² Chapter 5 introduces and discusses improved spam filtering mechanism by ex- traction of information from text embedded image e-mail. ² Chapter 6 discusses about spam decisions on gray e-mail using a personalized ontology. ² Chapter 7 introduces the contributions of the research and concludes the thesis. 10 Chapter 2 Related Work 2.1 Ontology Development Ontology tools can be applied to all stages of the ontology lifecycle including the cre- ation, population, implementation and maintenance of ontologies [78]. An ontology can be used to support various types of knowledge management including knowledge retrieval, storage and sharing [80]. In one of the most popular definitions, an ontology is ”the specification of shared knowledge” [94]. For a knowledge management system, an ontology can be regarded as the classification of knowledge. Ontologies are differ- ent from traditional keyword-based search engines in that they are metadata, able to provide the search engine with the functionaliy of semantic match. Ontologies are able to search more efficiently than traditional methods. Typically, an ontology consists of hierarchical description of important concepts in a domain and the descriptions of the 11 properties of each concept. Traditionally, ontologies are built by both highly trained knowledge engineers and domain specialists who may not be familiar with computer software. Ontology construction is a time-consuming and laborious task. Ontology tools also require users to be trained in knowledge representation and predicate logic. XML is not suited to describe machine understandable documents and interrelation- ships of resources in an ontology [44]. Therefore, The W3C has recommended the use of the resource description framework (RDF), RDF schema (RDFS), DAML+OIL and OWL. Since then, many tools have been developed for implementing metadata of ontologies by using RDF, RDFS, DAML+OIL and OWL. Ontology tools have to support more expressive power and scalability with a large knowledge base, and reasoning in querying and matching. Also, they need to support the use of high-level language, modularity, visualization, etc. There are also researches and applications about dynamic web pages consisting of database reports. The research on ontology integration tasks in B2B E-Commerce is also undergoing. The infrastruc- ture of the business documentation from the integration perspective and the identifi- cation of the integration subtasks was suggested [69]. There is research on generic e- Business model ontology for the development of tools for e-business management and IS Requirements Engineering. Based on an extensive literature review, the e-Business Model Ontology describes the logic for a business system [71]. 12 2.2 Spam Filtering [88] and [97] developed a algorithm to reduce the feature space without sacrificing re- markable classification accuracy, but the effectiveness was based on the quality of the training dataset. [96] demonstrated that the feasibility of the approach to find the best learning algorithm and the metadata to be used, which is a very significant contribu- tion in e-mail classification using Rainbow system. [5] proposed a graph based mining approach for e-mail classification that structures/patterns can be extracted from a pre- classified e-mail folder and the same can be used effectively for classifying incoming e-mail messages. Approaches to filtering junk e-mail are considered [24][28][86]. [34] and [48] showed approaches to filtering e-mails involve the deployment of data mining techniques. [26] proposed a model based on the Neural Network (NN) to classify per- sonal e-mails and the use of Principal Component Analysis (PCA) as a preprocessor of NN to reduce the data in terms of both dimensionality as well as size. [6] compared the performance of the Naive Bayesian filter to an alternative memory based learning approach on spam filtering. [64] addressed the problem by proposing a Word Sense Disambiguation (WSD) approach based on the intuition that word proximity in the document implies proximity also in the Hierarchical Thesauri (HT) graph. Bringing in other kinds of features, which are spam-specific features in their work, could im- prove the classification results [86]. A good performance was obtained by reducing the classification error by discovering temporal relations in an e-mail sequence in the form 13 of temporal sequence patterns and embedding the discovered information into content- based learning methods [55]. [66] showed that the work on spam filtering using feature selection based on heuristics. [60] presented a technique to help various classifiers to improve the mining of category profiles. Upon receiving a document, the technique helps to create dynamic category profiles with respect to the document, and accord- ingly helps to make proper filtering and classification decisions. [96][97] compared a cross-experiment between 14 classification methods, including decision tree, Naive Bayesian, Neural Network, linear squares fit, Rocchio. KNN is one of top performers, and it performs well in scaling up to very large and noisy classification problems. There is spam filtering research using social network. In the research, social net- works were used for judging the trustworthiness of outsiders. They exploited the prop- erties of social networks to distinguish between unsolicited commercial e-mail and messages associated with people the user knows [14][56][57]. In contrast to previous approaches, ontology was used in our approach. In addition, J48 was used to classify the training dataset. Ontology created by the implementation is modular, so it could be used in another system. In our previous classification ex- periment, J48 showed better result than Naive Bayesian, Neural Network, or Support Vector Machine (SVM) classifier. 14 Chapter 3 Email Classification (Text Mining) 3.1 Introduction I broadly classify the previous work with my research into three broad categories. 1. Text Mining 2. Feature Selection 3. E-mail Classification 3.2 Text Mining Text mining is a part of data mining. Data mining is defined as the ”the non trivial pro- cess of identifying valid, novel, potentially useful, previously unknown and ultimately understandable patterns in large databases” [35]. Most of data mining work was based 15 on structured data or database. Adibi motivated data ming as ”We are drowning in Data but Starving for Knowledge” [1] in his paper. According to [45] text mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation.” Berland and Charniak [11] used techniques similar to Hearst [46]. Mann [63] proposed the use of lexico-POS based patterns for constructing an ontology of is-a relations for proper nouns. He used the manually crafted pattern CN PN. He reported generating 200,000 unique descriptions of proper nouns with 60% accuracy. Moldovan and Girju [68] showed us a good overview of the various text mining techniques. In [75], Pasca used a pattern-based technique to extract is-a relations from a corpus of 500 webpages. Pasca [74] suggested a clustering technique to group together nuggets that have a high over- lap. Also, he proposed a technique for extracting glosses present for nodes present in WordNet from a corpus 500 million webpages. His paper showed that 60% of the WordNet nodes had at least one gloss extracted from the web corpus. Etzioni et al. [30][31][32] extracts instance-concept relations from a huge web corpus. To perform the task, a combination of pattern learning, subclass extraction and list extraction was used. Ciaramita and Johnson [23] use a fixed set of 26 semantic labels to perform is-a supersense tagging. Caraballo [19] used a clustering technique 16 to extract hyponym relations from newspaper corpus. Similar method was also used by Pantel and Ravichandran [73]. They used Clustering by Committee (CBC) algorithm [72] to extract clusters of nouns belonging to the same class. Cederberg and Widdows [20] use Latent Semantic Ananlysis and noun co-ordination (co-occurrence) technique to extract Hyponyms from a newspaper corpus. They reported 58% accuracy using their approach. Snow et al. [90] exploited rich co-occurrence and pattern-based features to extract is-a relations from text, but the technique was not web-scalable. 3.3 Feature Selection Nowadays, extraction of domain-specific semantic knowledge from given texts is be- coming very important. Such semantic knowledge, in the form of ontologies, can facil- itate integration of information from various sources. Additionally, ontologies enable the maintenance and visualization of knowledge. However, in most cases, ontology is still created manually. It is error-prone, time consuming and labor-intensive. Moreover, manual ontology creation has a critical weakness, in that the ontology usually reflects the inherent knowledge and biases of its creator, which may not be shared across people. Such weakness could be signif- icantly reduced through the automatic ontology creation. Therefore it would be very desirable to have a (semi or fully) automatic method for acquiring a domain ontology. Faure and Nedellec [33] proposed applying two techniques from the field of Natural 17 Language Processing (NLP), namely verb-subcategorization and noun-clustering for ontology learning. This trial was the one of the early attempts for ontology learning. Kietz et al. [53] developed a method for semi-automatic ontology acquisition for a corporate intranet (e.g., insurance company). A number of heuristics were used to organize a concept hierarchy for the target ontology. While creating an ontology, a human domain expert was expected to be on hand to intervene in this process by comparing the resulting ontology with a reference ontology. Navigli et al. [70] adapted the techniques from Information Retrieval and Machine Learning to resolve ambiguity in the meaning of words and their semantic relationships, which is crucial to creating a domain ontology. The performance of their method was evaluated with respect to a number of web pages on travel. Other tech- niques from Machine Learning and Information Retrieval for creating ontologies have been outlined in [62]. However, the majority of this work has tried to learn ontologies for relatively constrained domains. To date, there has been relatively little work on trying to construct ontologies for an public domain. Furthermore, tf idf is typically used to determine words for the domain ontology concepts. Since tf idf purely reflects the frequency-based importance of words, it cannot capture dependencies, such as those between a concept in the domain and the words that correspond to that concept. Text learning techniques, such as statistical feature selection methods, have proven to be useful in extracting more informative words from 18 a given text for a given text learning task. However, there have been few studies that empirically examine the value of text learning techniques to extract a set of candidate words for concept words in an ontology for ontology learning. To use of existing feature selection methods for the extraction of a set of good-candidate words for concept words in an ontology, we use a number of existing feature selection methods to identify sets of candidate concept words. These sets are then evaluated with respect to manually created domain ontologies [7]. Generally, feature selection refers to the way of selecting a set of features which is more informative in executing a given machine learning task while removing irrelevant or redundant features. This process ultimately leads to the reduction of dimensionality of the original feature space, but the selected feature set should contain sufficient or more reliable information about the original data set. For the text domain, this will be formulated into the problem of identifying the most informative word features within a set of documents for a given text learning task. Feature selection methods have relied heavily on the analysis of the characteristics of a given data set through statistical or information-theoretical measures. For text learning tasks, for example, they primarily count on the vocabulary-specific characteristics of given textual data set to identify good word features (more informative term which can characterize the data set). Although the statistics itself does not care about the meaning of text, these methods have been 19 proved to be efficient for text learning tasks (e.g., classification and clustering). In our study, tf idf was considered as a feature selection mechanism. 3.4 E-mail Classification (Text Mining) Generally, the main tool for e-mail management is text classification. A classifier is a system that classifies texts into the discrete sets of predefined categories. For the e- mail classification, incoming messages will be classified as spam or legitimate using classification methods. 3.4.1 Neural Network (NN) Classification method using a NN was used for e-mail filtering long time ago. Gener- ally, the classification procedure using the NN consists of three steps, data preprocess- ing, data training, and testing. The data preprocessing refers to the feature selection. Feature selection is the way of selecting a set of features which is more informative in the task while removing irrelevant or redundant features. For the text domain, feature selection process will be formulated into the problem of identifying the most relevant word features within a set of text documents for a given text learning task. For the data training, the selected features from the data preprocessing step were fed into the NN, and an e-mail classifier was generated through the NN. For the testing, the e-mail 20 classifier was used to verify the efficiency of NN. In the experiment, an error BP (Back Propagation) algorithm was used. 3.4.2 Support Vector Machine (SVM) SVMs are a relatively new learning process influenced highly by advances in statis- tical learning theory. SVMs have led to a growing number of applications in image classification and handwriting recognition. Before the discovery of SVMs, machines were not very successful in learning and generalization tasks, with many problems be- ing impossible to solve. SVMs are very effective in a wide range of bioinformatic problems. SVMs learn by example. Each example consists of a m number of data points(x 1 ;:::;x m ) followed by a label, which in the two class classification we will con- sider later, will be +1 or -1. -1 representing one state and 1 representing another. The two classes are then separated by an optimum hyperplane, illustrated in figure 1, mini- mizing the distance between the closest +1 and -1 points, which are known as support vectors. The right hand side of the separating hyperplane represents the +1 class and the left hand side represents the -1 class. This classification divides two separate classes, which are generated from training examples. The overall aim is to generalize well to test data. This is obtained by introducing a separating hyperplane, which must maxi- mize the margin () between the two classes, this is known as the optimum separating hyperplane 21 Figure 3.1: Support Vector Machine (SVM) Lets consider the above classification task with data pointsx i , i=1,...,m, with corre- sponding labelsy i =§1, with the following decision function: f(x)=sign(w¢x+b) By considering the support vectors x1 and x2, defining a canonical hyperplane, maximizing the margin, adding Lagrange multipliers, which are maximized with re- spect to : W (®)= m X i=1 ® i ¡ m X i;j=1 ® i ® j y i y j (x i ¢x j ) 22 Ã m X i=1 ® i y i =0;® i ¸0 ! 3.4.3 Naive Bayesian Classifier (NB) Naive Bayesian classifier is based on Bayes’ theorem and the theorem of total prob- ability. The probability that a document d with vector ~ x =< x i ;:::;x n > belongs to category c is P ³ C =cj ~ X =~ x ´ = P (C =c)¢P ³ ~ X =~ xjC =c ´ Q k2fspam;legitg P (C =k)¢P ³ ~ X =~ xjC =k ´ However, the possible values of ~ Xare too many and there are also data sparseness problems. Hence, Naive Bayesian classifier assumes that X 1 ;:::;X n are conditionally independent given the category C. Therefore, in practice, the probability that a docu- ment d with vector~ x=<x 1 ;:::;x n > belongs to category c is P ³ C =cj ~ X =~ x ´ = P (C =c)¢ Q n i=1 P ³ ~ X =~ xjC =c ´ Q k2fspam;legitg P (C =k)¢ Q n i=1 P ³ ~ X =~ xjC =k ´ 23 P (X i jC) and P(C) are easy to obtain from the frequencies of the training datasets. So far, a lot of researches showed that the Naive Bayesian classifier is surprisingly effective. 3.4.4 J48 Classifier (J48) J48 classifier is a simple C4.5 decision tree for classification. It creates a binary tree. 3.5 Result Evaluation In this section, four classification methods (Neural Network, Support Vector Machine classifier, Naive Bayesian classifier, and J48 classifier) were evaluated the effects based on different datasets and different features. Finally, the best classification method was obtained from the training datasets. 4500 e-mails were used as a training datasets. 38.1% of datasets were spam ad 61.9% were legitimate e-mail. To evaluate the classi- fiers on training datasets, we defined an accuracy measure as follows. Accuracy(%)= CorrectlyClassifiedE¡mails TotalE¡mails ¢100 Also, Precision and Recall were used as the metrics for evaluating the performance of each e-mail classification approach. 24 Recall= N ii N ;Precision= N ii N i N =#OfTotalInterestingE¡mails N i =#OfE¡mailsClassifiedAsInteresting N ii =#OfInterestingE¡mailsClassifiedAsInteresting 3.5.1 Effect of datasets on performance An experiment measuring the performance against the size of datasets was conducted using datasets of different sizes listed in Table 3.1. The experiment was performed with 55 features from tfidf. For example, in case of 1000 datasets, Accuracy was 95.80% using J48 classifier. A few observations can be made from this experiment. As shown on Table 3.1, the average of correct classification rate for both J48 and NB was over 95%. Size of datasets was not an important factor in measuring precision and recall. The results show that the performance of classification was not stable. For four different classification methods, precision of spam mail was shown in Fig- ure 3.2, likewise, precision of legitimate mail was shown in Figure 3.3. As shown on 25 Table 3.1: Classification results based on data size. (with 55 features) Data Size NN SVM Naive Bayesian J48 1000 93.50% 92.70% 97.20% 95.80% 2000 97.15% 95.00% 98.15% 98.25% 3000 94.17% 92.40% 97.83% 97.27% 4000 89.60% 91.93% 97.75% 97.63% 4500 93.40% 90.87% 96.47% 97.56% Figure 3.2, 3.3, 3.4, and 3.5, the precision and recall curves of J48 and NB classification were better than the ones of NN and SVM. Also, the average precision and recall for both J48 and NB was over 95%. In Figure 3.5, legitimate recall values were sharply decreased at the data size 2000. The increase of spam mail in the training datasets between 1000 and 2000 result in a sharp decrease of legitimate recall values for all classifiers. 3.5.2 Effect of feature size on performance The other experiment measuring the performance against the size of datasets was con- ducted using different features listed in Table 3.2. 4500 e-mail datasets was used for the experiment. For example, in case of 10 features, Accuracy was 94.84% using J48 classifier. The most frequent words in spam mail were selected as features. Generally, the result of classification was increased for all classification methods according the feature size increased. 26 Figure 3.2: Spam precision based on data size Figure 3.3: Legitimate precision based on data size 27 Figure 3.4: Spam recall based on data size Figure 3.5: Legitimate recall based on data size 28 Table 3.2: Classification results based on feature size. Feature Size NN SVM Naive Bayesian J48 10 83.60% 81.91% 92.42% 94.84% 20 89.87% 85.73% 95.60% 96.91% 30 93.31% 88.87% 95.64% 97.56% 40 92.13% 89.93% 97.49% 97.13% 50 93.18% 90.27% 96.84% 97.67% 55 93.10% 90.84% 97.64% 97.56% Figure 3.6: Spam precision based on feature size 29 Figure 3.7: Legitimate precision based on feature size As shown in Figure 3.6, 3.7, 3.8, and 3.9, good classification result order in the experiment was J48, NB, NN, and SVM for all cases (spam precision, legitimate pre- cision, spam recall, and legitimate recall). The overall precision and recall for e-mail classification increase and become stable according to the increase of the number of feature. Gradually, the Accuracy increase and finally saturated with the increased fea- ture size. As shown in Figure 3.6 and 3.7, J48 classifier provided the precision over 95% for every feature size irrespective of spam or legitimate. Also, J48 classifier sup- ported over 97% of classification accuracy for more than 30 feature size. For the recall, J48 and NB showed better result than NN and SVM for both spam and legitimate mail, but J48 was a little bit better than NB. 30 Figure 3.8: Spam recall based on feature size Figure 3.9: Legitimate recall based on feature size 31 3.6 Conclusion In this work, four classifiers including Neural Network, SVM, Naive Bayesian, and J48 were tested to filter spam from the datasets of e-mails. All the e-mails were classified as spam (1) or not (0). That was the characteristic of the datasets of e-mail for spam filtering. J48 is very simple classifier to make a decision tree, but it gave the efficient result in the experiment. Naive Bayesian classifier also showed good result, but Neu- ral Network and SVM didn’t show good result compared with J48 or Naive Bayesian classifier. Neural Network and SVM were not appropriate for the datasets to make a binary decision. From this experiment, we can find it that a simple J48 classifier can provide better classification result for spam mail filtering. In the near future, we plan to incorporate other techniques like different ways of feature selection, classification using ontology. Also, classified result could be used in Semantic Web by creating a modularized ontology based on classified result. There are many different mining and classification algorithms, and parameter settings in each algorithm. Experimental re- sults in this proposal are based on the default settings. Extensive experiments with different settings are applicable in Weka. Moreover, different algorithms which are not included in Weka can be tested. Also, experiments with various feature selection tech- niques should be compared. Furthermore, we plan to create an adaptive ontology as a spam filter based on classification result. Then, this ontology will be evolved and cus- tomized based on user’s report when a user requests spam report. By creating a spam 32 filter in the form of ontology, a filter will be user customized, scalable, and modular- ized, so it can be embedded to many other systems. This ontology also may be used to block porn web site or filter out spam e-mails on the Semantic Web. 33 Chapter 4 SPONGY (SPam ONtoloGY) System 4.1 Introduction I broadly classify the work into three broad categories. 1. Approach 2. SPONGY Architecture 3. SPONGY Implementation 4.2 SPONGY System We named spam filtering system using an adaptive ontology as SPONGY(SPam ON- toloGY) System. 34 4.2.1 Approach An assumption to create decision trees would be the intelligence behind the classifica- tion, but this was not enough because the decision tree ultimately is not a true ontology and also, querying a decision tree was also not easy. Once, we narrowed down on the type of decision tree that we going use, the next step were to create an ontology based on the classification result through J48. Resource Description Framework (RDF) which would be the form of ”Subject - Object - Predicate” was used to create an on- tology. Hence, our second main assumption was that we will need to map the decision tree into a formal ontology and query this ontology using our test e-mail to be classified as spam or not. The test e-mail is another thing we needed to consider because firstly, it is very difficult to deploy our system in such a way that it could read an incoming mail on a mail server and this would require a lot of extra work which would make the work unnecessarily complicated. The initial step was to gather a good dataset on which the decision tree will be based. This data should consider the characteristics of spam e-mail as well as the non-spam e-mail. Also the attributes and the values for each type of e-mail must be such that the decision tree based on the training data will not be biased. We evaluated a number of implementations for the decision trees and decided to use the Weka explorer for implementation of J48 decision tree. The J48 tree is an implementation of the c4.5 decision tree. The tree accepts input in Attribute-Relation File Format (ARFF) format. ARFF files have two distinct sections. The first section is 35 the header information, which is followed the data information. @relation<relation-name> @attribute<attribute-name><datatype> @attribute<classifier>fclass1, class2,..g @data The Header of the ARFF file contains the name of the relation, a list of the at- tributes (the columns in the data), and their types. Each data instance is represented on a single line, with a carriage return denoting the end of the instance. Attribute values for each instance are delimited by commas. The order that was declared in the header section should be maintained (i.e. the data corresponding to the nth @attribute dec- laration is always the nth field of the attribute). Missing values are represented by a single question mark. The training dataset was converted to ARFF format. Based on the training dataset, a decision tree was formed. This decision tree is a type of ontology. 36 @relation spamchar @ attribute word freq make: real @attribute word freq address:real @attribute word freq all: real @attribute word freq 3d: real @attribute word freq our: real @attribute word freq over: real @attribute word freq remove: real @attribute word freq internet: real @attribute word freq order: real @attribute word freq mail:real @attribute ifspamf1,0g @data 0,0.64,0.64,0,0.32,0,0,0,0,0,0 0,0.67,0.23,0,0.17,0.6,1.6,0,1,0.9,1 The above file is a sample ARFF file where the word next to @relation is the just a name. It could be the name of the file, and name. It just signifies a header. The word 37 next to the @attribute is the feature element on the basis of which the classification is going be done and our tree is being built. The value next to it after the ’:’ is its type. The last attribute in this list must be the final classifier of what we are looking for. In this case, the final classification result should be ’1’ if it is finally spam, otherwise, it should be ’0’ if it is not spam. All the leaf nodes on the classification result should be ’1’ or ’0’. This is a rule in the ARFF file that the last attribute be the final classification result needed. After the @data, a set of values which are values of the attributes will be placed. The number of values will equal the number of attributes and the order is such that the first value in the dataset corresponds to the first attribute. i.e., here: For the First mail: word freq make is 0 and word freq all is 0.64 Similarly, for the Second mail: word freq make is 0 and word freq all is 0.23 These values are calculated as follows: 100*Number of words or characters in the attribute / total number of words in the e-mail If you notice, in both the datasets, the last values are either 0 or 1 which means that this mail is should be classified as spam if 1 or not spam if 0. 4.2.2 SPONGY Architecture Figure 4.1 shows our framework to filter spam. It is named as SPONGY (SPam ON- toloGY) system. The training dataset is the set of e-mail that gives us a classification result. The test data is actually the e-mail will run through our system which we test to 38 see if classified correctly as spam or not. This will be an ongoing test process and so, the test data is not finite because of the learning procedure, the test data will sometimes merge with the training data. The training dataset was used as input to J48 classifica- tion. To do that, the training dataset should be modified as a compatible input format. SPONGY system gives us the classification result using J48. To query the test e-mail in Jena, an ontology should be created based on the classification result. To create on- tology, an ontology language was required. Resource Description Framework (RDF) was used to create an ontology. The classification result of RDF format was inputted to Jena, and inputted RDF was deployed through Jena, finally, an ontology was created. Jena is a Java framework for building Semantic Web applications. It provides a pro- grammatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine. An ontology generated in the form of RDF data model is the base on which the incoming mail is checked for its legitimacy. Depending upon the assertions that we can conclude from the outputs of Jena, the e-mail can be defined as spam or legitimate. The e-mail is actually the e-mail in the format that Jena will take in (i.e. in a CSV format) and will run through the ontology that will result in spam or legitimate. SPONGY system updates periodically the dataset with the e-mails classified as spam when user spam report is requested. Then, modified training dataset is inputted to WEKA to get a new classification result. Based on the classification result, we can get new ontology, which can be used as a second spam filter (that is user-customized 39 Figure 4.1: SPONGY Architecture ontology). Through this procedure, the number of ontology will be increased. Finally, this spam filtering ontology will be customized for each user. User customized on- tology filter would be different with each other depending on each user’s background, preference, hobby, etc. That means one e-mail might be spam for person A, but not for person B. User customized ontology evolves periodically and adaptively. SPONGY system provides evolving spam filter based on users’ preference, so users can get better spam filtering result. The input to the system is mainly the training dataset and then the test e-mail. The test e-mail is the first set of e-mails that the system will classify and learn and after a certain time, the system will take a variety of e-mails as input to be filtered as a spam or legitimate. For the training dataset which we used, several 40 feature selection algorithms including Naive Bayesian, Neural Network, SVM, and J48 were tested, then J48 and Naive Bayesian classifier showed the good performance on the training e-mail dataset [26]. The classification results through Weka need to be converted to an ontology. The classification result which we obtained through J48 de- cision tree was mapped into RDF format. This was given as an input to Jena which then mapped the ontology for us. This ontology enabled us to decide the way different headers and the data inside the e-mail are linked based upon the word frequencies of each words or characters in the dataset. The mapping also enabled us to obtain asser- tions about the legitimacy and non-legitimacy of the e-mails. The next part was using this ontology to decide whether a new e-mail is a spam or legitimate. This required querying of the obtained ontology which was again done through Jena. The output obtained after querying was the decision that the new e-mail is a spam or legitimate. In summary, test e-mail is checked whether it is spam or legitimate based on global ontology created with training dataset and mis-filtered e-mails are checked again based on user-customized ontology created with user’s spam report. With the help of adaptive user customized ontology, total spam filtering rate will be increased. The primary way where user can let the system know would be through a GUI or a command line input with a simple ’yes’ or ’no’. This would all be a part of a full fledged working system as opposed to our prototype which is a basic research model. 41 4.2.3 SPONGY Implementation In the experiment, tfidf is selected as a feature selection algorithm for the experiment. tfidf is popular text processing method for dealing with the textual features. For the classification method, J48 decision tree is used because it showed good performance compared with Neural Network, SVM, or Naive Bayesian classifier as we showed it our previous paper [100]. 4.2.3.1 Global ontology creation procedure Algorithm 1 is a pseudo code for global ontology filter creation in SPONGY system. 4.2.3.2 User customized ontology creation procedure For user-customized ontology, user profile ontology was used as shown in Algo- rithm 2. It was different with global ontology creation procedure. Using user profile ontology, preference of specific user would be adapted in feature selection procedure. User profile ontology had a list of people to block their e-mail and a list of words to block the e-mails related with some topic that is disliked by user. These blacklists and words will be combined with the words that were selected from the tfidf. 42 Algorithm 1 Global ontology filter pseudo code 1: // Initialize variables 2: set training datasetd tod 1 ,......,d n 3: set test datasett tot 1 ,......,t p 4: set normalized valuesv tov 1 ,......,v m 5: 6: Feature (f:f 1 ,......,f m )Ãtfidf(d); 7: 8: foreach(f:f 1 ,......,f m )f 9: foreach(d:d 1 ,......,d n )f 10: (n:n 1 ,......,n m )ÃNormalize(f;d); 11: g 12: g 13: foreach(n:n 1 ,......,n m )f 14: resultÃC4:5(n;d); 15: g 16: 17: Ontology ( )ÃJena(RdfConversion(result)); 18: 19: foreach(t:t 1 ,......,t p )f 20: if(Ontology (t i == 1) then 21: decision = SPAM; 22: else then 23: decision = LEGITIMATE; 24: g 43 Algorithm 2 User-customized ontology filter pseudo code 1: // Initialize variables 2: set training datasetd tod 1 ,......,d n 3: set test datasett tot 1 ,......,t p 4: set normalized valuesv tov 1 ,......,v m 5: 6: Feature (f:f 1 ,......,f k )Ãtfidf(d); 7: Feature (f:f k+1 ,......,f m )ÃUserProfileOntology(u); 8: 9: foreach(f:f 1 ,......,f m )f 10: foreach(d:d 1 ,......,d n )f 11: (n:n 1 ,......,n m )ÃNormalize(f;d); 12: g 13: g 14: foreach(n:n 1 ,......,n m )f 15: resultÃC4:5(n;d); 16: g 17: 18: Ontology ( )ÃJena(RdfConversion(result)); 19: 20: foreach(t:t 1 ,......,t p )f 21: if(Ontology (t i == 1) then 22: decision = SPAM; 23: else then 24: decision = LEGITIMATE; 25: g 44 4.3 Evaluation In initial SPONGY system, global ontology was created with 2108 e-mail dataset (42.82% were spam and 57.18% were legitimate e-mail). tfidf mechanism was used as a feature selection algorithm. In the Weka, J48 decision tree algorithm was used for e-mail classification because J48 showed the best result compared with Neural Net- work, Naive Bayesian, and SVM. Classified result would be modified to rdf file format semi-automatically in the SPONGY system. Modified rdf file was entered into Jena, so ontology was created for spam filtering. Finally, the test e-mail data can be tested in the SPONGY system whether it is spam or not. After SPONGY system was initialized, ontology as a spam filter will be evolved adaptively on user spam report. 4.3.1 Global Ontology with 2108 training dataset Figure 4.2 shows how we choose the J48 classification filter, which uses the simple c4.5 decision tree for classification. Figure 4.2 shows that word ”remove” was selected as a root node by J48 classification. Figure 4.3 shows the classification result including precision, recall. The confusion matrix which shows the number of elements classified correctly and incorrectly as the percentage of classification. 45 Figure 4.2: Part of J48 classification result Figure 4.3: Summary of classification result 46 Figure 4.4: Tree of J48 classification result Figure 4.4 shows the classification result using J48. Whole result is so big, so figure 4.4 is just a part of it. In the leaf node, 1 means spam and 0 means legitimate. According to the figure 4.5, if the normalized value of word ”people” is greater than 0.18, e-mail is classified as legitimate, otherwise, the system will check the normalized value of word ”our”. Finally, if the normalized value of word ”mail” is greater than 0.24, then the e-mail is classified as spam. Ontology using RDF was created based on the classification result. Figure 4.5 shows the RDF file created based on J48 classifica- tion result. The RDF file was used as an input to Jena to create an ontology which will be used to check if the test e-mail is spam or not. 47 Figure 4.5: Converted RDF file of J48 classification result Figure 4.6 shows RDF validation services. W3C RDF validation services help us to check whether the RDF schema which we are going to give as input to Jena is syn- tactically correct or not. Because the RDF file based on the classification result using J48 was created by us, and should be compatible with Jena, the validation procedure for syntax validation was required. Figure 4.7 also shows the database of Subject- Predicate-Object model we got after inputting the RDF file into Jena. This ontology model is also produced in Jena. Figure 4.8 shows the RDF data model or ontology model. This model is obtained from the W3C validation schema. This ontology is obtained in Jena in memory and not displayed directly. But it can be showed using the graphics property of the Jena. 48 Figure 4.6: W3C RDF Validation Service Figure 4.7: Triplets of RDF data model 49 Figure 4.8: RDF data model (Ontology) Table 4.1: Classification result of training datasets Class Precision Recall Spam 0.872 0.963 Legitimate 0.962 0.871 As shown on Table 4.1,total 2108 e-mails were used as training datasets. 47.8% of datasets were spam and 52.2% were legitimate e-mail. J48 was used to classify the datasets in Weka explorer. 91.51% of e-mails were classified correctly and 8.49% were classified incorrectly. In the case of spam, precision was 0.872, recall was 0.963. In the case of legitimate, precision was 0.962, recall was 0.871. Like the above, based on J48 classification result, ontology was created in RDF format using Jena. The ontology created using the RDF file was used to check input 50 e-mail through Jena. The result was generated after we consider the word frequencies of various words inside the e-mail and then querying our ontology data model for these word frequencies. If the value we get after comparing all the word frequencies of the e-mail words is ’0’ then the result was that the e-mail was not spam and if the value is ’1’ then the result is that the e-mail is spam. The result may have False Positives (A legitimate mail termed as not spam) or False Negatives (spam e-mail termed as not spam). 4.3.2 User Customized Ontology with specific user dataset In 4.3.1, standard ontology filter was created based sample e-mail dataset. Now, the SPONGY system creates a user-customized adaptive ontology filter based on received e-mails of specific user. Every user has different background and preference, so a user- customized adaptive ontology is different to person. At first, e-mails of user go through the standard ontology, the e-mails that were not classified correctly are entered into the user-customized adaptive ontology. The user-customized adaptive ontology will filter out the e-mails that were not classified correctly again. The user-customized adaptive ontology created by SPONGY system gave us good performance improvement because the ontology filter was made based on specific user’s e-mail set. The initial global ontology already showed good performance. 51 Table 4.2: Classification result of 200 user dataset Number of leaves 12 Size of tree 23 Correct classification 195 (97.5%) Table 4.3: Precision and Recall of 200 user dataset Class Precision Recall Spam 0.990 0.960 Legitimate 0.961 0.990 At first,the experiment was performed with 200 e-mails (100 spam and 100 legit- mate). The experimental result through the Weka is showed in the Table 4.2 and Table 4.3. Then, we increased user dataset to 400 and 600. The experiment was performed with 400 e-mails (200 spam and 200 legitmate). The experimental result through the Weka is showed in the Figure 4.4 and 4.5. Finally, the experiment was performed with 600 e-mails (300 spam and 300 legiti- mate). Most of procedures to create user customized ontology are similar to the case of global ontology creation except for use of user profile ontology. Table 4.4: Classification result of 400 user dataset Number of leaves 19 Size of tree 37 Correct Classification 372 (93%) 52 Table 4.5: Precision and Recall of 400 user dataset Class Precision Recall Spam 0.948 0.910 Legitimate 0.913 0.950 Table 4.6: Classification result of 600 user dataset Number of leaves 27 Size of tree 53 Correct Classification 553 (92.17%) As shown on Table 4.6 and 4.7,total 600 e-mails were used as an specific user dataset. 50% of dataset were spam and 50% were legitimate e-mail. J48 was used to classify the dataset in Weka explorer. 92.17% of e-mails were classified correctly and 7.83% were classified incorrectly. In the case of spam, precision was 0.882, recall was 0.973. In the case of legitimate, precision was 0.970, recall was 0.870. 4.3.3 SPONGY system results using global ontology and user customized ontology Through the global ontology, 179 from 2108 e-mails were classified incorrectly. In detail, 37 of 1008 spam e-mails and 142 of 1100 legitimate e-mails were classified Table 4.7: Precision and Recall of 600 user dataset Class Precision Recall Spam 0.882 0.973 Legitimate 0.970 0.870 53 incorrectly. These 179 (37 spam + 142 legitimate) e-mails were inputted into the user customized ontology for the filtering. By user customized ontology created using 200 user e-mail, 10 of incorrectly clas- sified 37 spam e-mails and 55 of incorrectly classified 142 legitimate e-mails were filtered out. Totally, 114 of 2108 e-mails were classified incorrectly, but 179 of 2108 e-mails were classified incorrectly when we used only global ontology. Rate of correct classification is from 91.5085% to 94.5920%. By user customized ontology created using 400 user e-mail, 13 of incorrectly clas- sified 37 spam e-mails and 57 of incorrectly classified 142 legitimate e-mails were filtered out. Totally, 109 of 2108 e-mails were classified incorrectly, but 179 of 2108 e-mails were classified incorrectly when we used only global ontology. Rate of correct classification is from 91.5085% to 94.8292%. By user customized ontology created using 600 user e-mail, 14 of incorrectly clas- sified 37 spam e-mails and 69 of incorrectly classified 142 legitimate e-mails were filtered out. Totally, 96 of 2108 e-mails were classified incorrectly, but 179 of 2108 e-mails were classified incorrectly when we used only global ontology. Rate of correct classification is from 91.5085% to 95.4459%. All the experimental results are shown on Table 4.9. It is much improvement. As shown on Table 4.8, for the whole SPONGY system, in the case of spam, precision was 0.931, recall was 0.977 and, in the case of legitimate, precision was 0.978, recall was 0.934. 54 Table 4.8: Precision and Recall of whole SPONGY system Class Precision Recall Spam 0.931 0.977 Legitimate 0.978 0.934 Table 4.9: Result of SPONGY system Global Ontol- ogy user ontology w/200 user ontology w/400 user ontology w/600 Spam (1008) 37 27 24 23 Legitimate (1100) 142 87 85 73 Correct Classi- fication 91.5085% 94.5920% 94.8292% 95.4459% 4.4 Conclusion Our experiment here is still at an inception phase where the model is still learning. The accuracy of the decision tree was approximately 91.51% which was quite good at this stage. Our system gave an accuracy of 95.45%, so we can conclude not a large loss from the work which is an idea and an attempt at aiding ontology based classification and filtering. 55 Chapter 5 Improved Spam Filtering by Extraction of Information from Text Embedded Image E-mail 5.1 Introduction The increase of image spam, a kind of spam in which the text message is embedded into an attached image to defeat spam filtering techniques, is becoming an increasingly major problem.. For nearly a decade, content based filtering using text classification or machine learning has been a major trend of anti-spam filtering systems. A Key technique being used by spammers is to embed text into image(s) in spam e-mail. In [102], we proposed two levels of ontology spam filters: a first level global ontology filter and a second level user-customized ontology filter. However, that previous system handles only text e-mail and the percentage of attached images is increasing sharply. 56 The contribution of the paper is that we add an image e-mail handling capability to the previous anti-spam filtering system, enhancing the effectiveness of spam filtering. 1. Related Work 2. Extraction of Information from Text Embedded Images using OCR 3. Image Spam Filtering 4. Experimental Results 5. Comparison with Commercial Filters 6. Conclusion 5.2 Related Work Dredze et al. tried to automatically classify an image as being spam or legitimate e- mail. They presented features that focus on simple properties of the image, making classification as fast as possible. Their evaluation showed that they accurately classify spam images in excess of 90% and up to 99% on real world data [29]. Attempts to use OCR (Optical Character Recognition) techniques to convert spam images back to text for processing by text-based filters have been foiled. The goal of OCR is to classify optical patterns corresponding to alphanumeric or other characters. The process of OCR involves several steps including segmentation, feature extraction, 57 and classification. An effective response by spammers is the application of CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) [18] techniques, which are designed to preserve readability by humans but capable of effec- tively confusing the OCR algorithms [17]. Sahami et al. [86], Graham et al. [42], and Zhang et al. [103] investigated the use of text categorization techniques based on the machine learning and pattern recognition approaches for e-mail semantic content analysis. With respect to manually encoded rules, using these techniques, categorization rules are automatically created and the system is generalized potentially. One of the most popular anti-spam solutions is SpamAssasin [91] and there are several sites hosting plug-in rule modules. The SpamAssasin was created to make a general purpose system compatible with a variety of anti-spam filters. Towards this end, they created a new binary prediction problem: Is this image spam or legitimate e-mail? The classification can then be fed into existing content filters as a feature. Oth- ers have followed this approach [8] and it had several advantages. First, it separates image classification from spam e-mail classification, which is a difficult and well stud- ied problem. Second, e-mails can contain multiple images and it is not clear how to combine them towards a single prediction. One approach have been done by Aradhye et al. [8]. They treated each image separately, avoiding this difficulty. Finally, they did 58 not commit to a specific content filtering system. Rather, they provided a single feature that can be integrated with any learning based anti-spam system. Spammers are embedding the e-mail’s message into images sent as attachments, which are displayed by most e-mail clients. This can make all content based filtering techniques based on the analysis of plain text in the subject and body fields of e-mails ineffective. This trick is often used in phishing e-mails, which are one of the harmful spam e-mails. Among commercial and open source spam filters currently available, only a plug-in of the SpamAssasin spam filter can analyze text embedded images, but it provides only a Boolean attribute indicating whether more than one word among a given set is detected in the text extracted by an OCR from attached images. Texts extracted through the OCR are used as training data set in the text based spam e-mail filtering system. 5.3 Extraction of Information from Text Embedded Images using OCR OCR (Optical Character Recognition) translates images of text, such as scanned doc- uments, into actual text characters. Also known as text recognition, OCR makes it possible to edit and reuse the text that is normally locked inside scanned images. OCR works using a form of artificial intelligence known as pattern recognition to identify 59 individual text characters on a page, including punctuation marks, spaces, and ends of lines. OCR SDK was an important component in our project because its efficiency can critically affect our project. For this we initially selected three OCR Software Develop- ment Kits. ² JOCR [52] ² Simple OCR [89] ² Asprise OCR [9] By running a sample of 200 image e-mails on these software’s we determined that Asprise OCR was performing with an accuracy of 95%. It had the best detection rate among the three softwares and hence we decided to go with Asprise OCR for this project. The components of Asprise OCR for Java Asprise OCR comprise two essential components: ² A native library: AspriseOCR.dll [on Windows] ² one Java package com.asprise.util.ocr [Cross platforms] main package; contains essential classes to perform OCR But the use of OCR tool is not cost effective with the large amounts of e-mails been handled daily by server-side filters. To address the issue, we suggest that computational complexity could be reduced by using a hierarchical architecture for the spam filter. 60 Figure 5.1: Snapshot of OCR Implementation Text extraction using OCR tool should be carried only if the previous less complex modules were not able to reliably detect whether an e-mail is legitimate or not. As a training image data set, we prepared 1000 e-mails (819 image spam + 181 le- gitimate images). The Asprise OCR cannot handle 676 image e-mails because of image obscuring techniques like wave, animate, deform, rotate, etc. 12 image e-mails out of 143 image spam and 4 image e-mails out of 181 legitimate image e-mails are addition- ally missed by error. Finally, 131 text messages out of 143 and 177 text messages out of 181 are retrieved correctly. In the experiment, we used only image e-mails not using obscuring techniques. Figure 5.1 is a snapshot of the implemented program. ² Source folder 61 – Browse and select all image and text files ² Destination folder – All retrieved text through OCR and text files goes to destination folder ² File (for the third Browse tab) – Specify a feature selection file containing feature list ² File (for the fourth Browse tab) – Specify a path where the .arff (WEKA [95] input) file would be generated 5.4 Image Spam Filtering Figure 5.2 shows our framework to filter spam. The training data set is the set of e-mail that gives us a classification result. It is composed of both text e-mail and image e-mail. The test data is actually the e-mail will run through our system which we test to see if classified correctly as spam or not. This will be an ongoing test process and so, the test data is not finite because of the learning procedure. Image e-mail among the training data set is entered into OCR, and then text infor- mation is retrieved from text embedded image e-mail. Training data set is selected. Training data set is a collection of text-oriented e-mail data. Features from the data set are selected using tfidf. Weka input file is created based on the selected features and 62 the data set. Weka is a toolkit of machine learning algorithms written in Java for data mining tasks. Through Weka, classification results are generated. The classified results are converted to RDF [83] file. The converted RDF file is fed into Jena, which is a Java framework for building Semantic Web applications. It provides a programmatic environment for RDF, RDFS, OWL, and SPARQL, and includes a rule-based inference engine. Using Jena [51], ontologies are created, and we can give a query to Jena. Jena will give an output for the query using ontologies created in Jena. Through these procedures, global and user-customized ontology filters are created. Incorrectly classified e-mails through global ontology filter are inserted into the user- customized ontology filter. The ontologies created by the implementation are modular, so those could be used in another system. Major trend in spam filtering area is a global filter. Generally, globally-trained filters outperform personally-trained filters for both small and large collections of users under a real environment. However, globally-trained filters sometimes ignore personal data. Globally-trained filters cannot retain personal preferences and contexts as to whether a feature should be treated as an indicator of legitimate e-mail or spam. Hence, we suggested two-level filters. Our goal is to combine the advantages of the both globally- trained filter and personally-trained filter for better spam filtering performance. That is the SPONGY (SPam ONtoloGY) system. 63 Figure 5.2: Spam Filtering System (Text and Image E-mails) Spam e-mails vary from user to user and change over time, so learning and adaptive filtering is desirable. An ontology defines a common vocabulary to share information in a domain. It includes definitions of basic concepts in the domain and relations among them. Hence, an ontology could be developed to share common understanding of the structure of information among people or software agents, to increase reusability of do- main knowledge, and to analyze domain knowledge. Several approaches adopted the machine learning techniques for learning and adaptation, but an ontology based filter is also proper for these necessities, so an ontology is used in our implementation. Ontolo- gies allow for machine-understandable semantics of data, so by using an ontology as a filter, it can be embedded within other systems for better performance. 64 5.5 Experimental Results In the experiment, we used both text e-mail and image e-mail. Data set was classified like the followings: TS (Text Spam) -1008 TL (Text Legitimate) -1100 OCR IS (Retrieved text from Image Spam using OCR) - 131 OCR IL (Retrieved text from Image legitimate using OCR) - 177 We showed the experimental results in Table 5.1 to 5.4. We measured the false neg- ative rate and false positive rate. The false negative rate is the proportion of positive instances that were erroneously reported as negative. It is equal to 1 minus the power of the test. The false positive rate is the proportion of negative instances that were er- roneously reported as being positive. It is equal to 1 minus the specificity of the test. This is equivalent to saying the false positive rate is equal to the significance level. False Negative rate = # of false negatives total # of positive instances False Positive rate = # of false positives total # of negative instances 65 Table 5.1: Experimental Results of Global filter wo OCR Global Filter without OCR functinality False Negative False positive Correct Classification TS + TL 12.91% 3.67% 91.5085% Table 5.2: Experimental Results of SPONGY wo OCR SPONGY without OCR functinality False Negative False positive Correct Classification TS + TL 6.34% 2.28% 95.4459% Actually, we used more Image data set, but many of them couldn’t handled by Asprise OCR. Hence, we considered only the image e-mails that can be handled by Asprise OCR. As you can see in the Table 5.1, 5.2, 5.3 and 5.4, the SPONGY spam filter still showed good results without much performance degradation. Global filter in the Table is a spam filter excluding personalized filter in the SPONGY system. SPONGY is a spam filter created through the procedures in Figure 5.2. As a whole, in the SPONGY system with OCR functionality, false negative rate is increased from 6.34% to 7.36%, false positive rate is increased from 2.28% to 3.16%, and accuracy (Correct classification rate) is decreased a little. However, the SPONGY system got image e-mail handling capability. Table 5.3: Experimental Results of Global filter with OCR Global Filter with OCR functinality False Negative False positive Correct Classification TS + TL + OCR IS + OCR IL 13.55% 4.74% 90.6043% 66 Table 5.4: Experimental Results of SPONGY with OCR SPONGY with OCR functinality False Negative False positive Correct Classification TS + TL + OCR IS + OCR IL 7.36% 3.16% 94.6192% 5.6 Comparison with Commercial Filters Spam e-mail filter is trying to block spam e-mail efficiently, but many spammers find new methods or techniques to try to break into the inbox of e-mail account of user. Most spam consists of an unwanted advertise, also some can transmit viruses, spyware on to your computer and cause problems. It is extremely annoying to go to user’s inbox and have a look through a whole list of e-mail to find one legitimate e-mail. We did some experiment with Gmail, Yahoo! mail, the e-mail system of University of Southern California (USC), and the SPONGY system we proposed. Also, we did a survey about some commercial spam filter programs. 5.6.1 Introduction to Each E-mail System Gmail has been known one of the best spam filters that prevent many spam e-mail to user’s inbox. Spammers are finding it harder to send e-mail and evade the innovative spam block technology used by Gmail. In the Gmail, there is a ”Report Spam” button. By clicking some message as ”Report Spam”, the Gmail will identify this type of mes- sage as spam the next time and not only block it at your e-mail account, everyone else’s 67 e-mail accounts will also block that message. In case of image spam, it is difficult for Gmail to block, but with Optical Character Recognition (OCR), it can read what the image content says and block the messages. Gmail supports multiple authentication system like Sender Policy Framework (SPF), DomainKeys, which domain the message originates from, and DomainKeys Identified Mail (KIM), to verify the sender and help recognize whether it is real or forged messages. The sender cannot through the third party using multiple authentication system, which is different from many other webmail services support a single authentication system. Yahoo! mail also supports many similar features used in Gmail. Yahoo! mail servers are going to need to separately check DomainKeys, SPF, e-mail Caller ID, some giant useless ”do not spam” database. When user logs in his/her account, user can create user’s own filter using filtering features provided by Yahoo! mail. User can specify whether or not the match should be case-sensitive and where the target string should appear in text that you are trying to match (ex. Contains, Does not contain, Begins with, and Ends with). The USC e-mail system uses a centralized spam detection system from Symantec called Brightmail AntiSpam [15] that scans all incoming e-mail before the messages are delivered to the user’s inbox. If the e-mail meets the specific criteria defined by the antispam filters, it tags the message header as potential spam. According to the 68 Table 5.5: Spam Filter Review 2007 [92] SpamEater Pro CA Anti-Spam ChoiceMail One Spam Killer SPONGY Block IP Addr o o o Block Server o o o o Block E-mail Addr o o o o o Blacklist Support o o Allow IP Addr o o o Allow Server o o o o Allow E-mail Addr o o o o o Individual User Profile o o o o Reporting Capability o o o announcement of the Brightmail AntiSpam, the false positive rate of the program is an extremely low 0.001%. The SPONGY system uses two-level filter using dynamic ontologies: a first level global ontology filter and a second level user profile ontology filter. The user profile ontology filter is created based on the specific user’s background as well as the filtering mechanism used in the global ontology filter creation. 5.6.2 Evaluation and Comparison We surveyed some commercial spam filters. We tested the filters using their 30 days trial versions. Commercial spam filters supports many features as you can see in Table 5.5. Pre- set categories are provided by the program vendor freely. It contains content such as financial, adult content, health, etc. Rule customization option will allow you to add, remove, or modify the filtering rules. A rule is a set of criteria for determining whether or not an e-mail is spam or legitimate. 69 However, most of commercial filters are too complicated and difficult for the end users. As you can see, most of filters can allow or block IP address, server, and e-mail address. There are many other known filters in the world. We also compared Gmail, Yahoo! mail, the USC e-mail and SPONGY . In the ex- periment, e-mail addresses and messages of our research group members are used. We cannot specify sender’s e-mail address because most of e-mail systems support authen- tication system, hence I cannot test with other e-mail address. We did two experiments. The first experiment is performed with the same e-mail data set (e-mails used for the SPONGY system are forwarded to my account of Gmail, Yahoo! mail, and the USC e-mail). E-mail addresses in hotmail.com and hanmail.net are used as a sender e-mail. In the second experiment, we just checked each e-mail system’s filtering accuracy with false negative and false positive in my e-mail account of each e-mail system. The sec- ond experiment is done under more realistic environment. The experiment is done with own filters of each e-mail system (Gmail, Yahoo! mail, the USC e-mail, and SPONGY) with the default setting. As you can see in Table 5.6, the experiment was performed with the same data set. E-mail data used in the SPONGY system experiment were sent to Gmail, Yahoo! mail, and the USC e-mail system. We used Hotmail and Hanmail as a sender e-mail account, and my e-mail accounts in Gmail, Yahoo! mail, and the USC e-mail system were used as a receiver. In the experiment, we cannot use other person’s e-mail account because 70 of privacy, and send bulk of e-mail because of authentication policy of each e-mail system. The SPONGY system is scalable, learning, multi-level filter. Although we consider experimental environment like the several difficulties, the SPONGY system showed better performance than Gmail, Yahoo! mail, and the USC e-mail system. In here, we insist that the experimental results of the SPONGY system in at least our experimental environments are efficient. False positive values of all the e-mail systems are reasonable, but false negative values of Gmail, Yahoo! mail, and the USC e-mail system are not good. Probably, it happened because the sender of e-mail is me, and most of the e-mail system considers the sender when their filtering policy is used. Another experiment was performed on the real setting with different e-mail data set. In this case, the SPONGY system showed good performance in both false negative rate and false positive rate. In the SPONGY system, most balanced false negative and false positive rate values were obtained. False negative rate was 7.3610% and false positive rate was 3.1607%. Three other commercial mail systems showed low experimental results. False negative rate of the Yahoo! mail was extremely bad for us. Brightmail AntiSpam of the Symantec used in the USC e-mail system showed very low false positive rate. Generally, performance order is SPONGY , the USC e-mail system, Google Gmail, and Yahoo! mail. Experimental results are shown in Table 5.7. With the same e-mail data set, the SPONGY showed the best false negative rate. With some of test e-mail data set, SPONGY showed better performance at least under my experiment. 71 Table 5.6: Comparison Result with the Same E-mail Data Set Google Gmail YahooMail USC E-mail SPONGY False Negative 81.2705% 83.0357% 60.3603% 7.3610% False Positive 0.7563% 2.2388% 1.5152% 3.1607% Table 5.7: Comparison Result with the Different E-mail Data Set Google Gmail YahooMail USC E-mail SPONGY False Negative 11.2436% 19.3319% 9.6611% 7.3610% False Positive 4.6358% 6.3830% 2.8169% 3.1607% By increasing image e-mail handling capability, we possibly increase the performance of the spam filtering system. We know our experiment is somewhat restricted, but it demonstrates the potential capability of the spam filtering system we proposed. 5.7 Conclusion The proposed framework is a hybrid two-level filter, which combines global filter and personalized filter. Also, it is user-customized, scalable, and modularized, so that it can be embedded to many other systems for better performance. We added image spam handling capability using OCR into the text-based anti-spam filtering system. By han- dling of text embedded image e-mail, the proposed system can be used partially for both text e-mail and image e-mail. The experiment was somewhat restricted, but it demonstrates the potential capability of the proposed system under restricted environ- ment. However, to cope with the image e-mail thoroughly, we need to adopt advanced 72 image processing techniques. Then, we can face image obscuring techniques like wave, animate, deform, and rotate. In the future, we will experiment with the combination of the general corpus data set and our data set for generality. As we explained, new spamming technique appears continuously and traditional spamming technique is also prevailing. Spamming technique is evolutionary, hence the spam filtering technique must catch up with the new spamming technique. 73 Chapter 6 Spam Decisions on Gray E-mail using a Personalized Ontology 6.1 Introduction E-mail is an efficient and popular communication mechanism, and the amount of e- mail traffic is now huge. E-mail management has become an important and growing problem for individuals and organizations because it is prone to misuse. The blind posting of unsolicited e-mail messages, known as spam, is an example of misuse. Spam is commonly defined as the sending of unsolicited bulk e-mail. A further common definition of spam is restricted to unsolicited commercial e-mail, a definition that does not include non-commercial solicitations such as political or religious pitches, scams, etc., as spam. We take the broader definition of spam as e-mail that was not requested or appropriate for a user, given his/her requests and privacy context. 74 Statistical text classification is at the core of many commercial and open-source anti-spam solutions. Statistical classifiers can either be trained globally with one clas- sifier learned for all users, or personally where a separate classifier is learned for each user. Personally-trained classifiers have the advantage of allowing each user to pro- vide their own personal definition of spam. A user actively refinancing his home can train a personal filter to delete unsolicited stock advice as spam but deliver unsolicited refinancing offers to his inbox. Another user might train a personal filter to block all unsolicited offers. Personal classifiers can quickly identify terms that are unique to an individual and use them as strong indicators of legitimate e-mail. Spam filters are faced with the challenge of distinguishing messages that users wish to receive from those they do not. At first glance this seems like a clear objective, but in practice this is not straightforward. For example, it has been estimated that two- thirds of e-mail users prefer to receive unsolicited commercial e-mail from senders with whom they have already done business, while one-third consider it spam. Spam not only causes loss of time and computational resources, leading to financial losses, but it is also often used to advertise illegal goods and services or to promote online fraud. As suggested in recent reports by Spamhaus [16], spam is increasingly being used to distribute viruses, worms, spyware, links to phishing web sites, etc. The problem of spam is not only an annoyance, but is also becoming a security threat. 75 A great number of learning-based spam filters are proposed in the literature. Some of them use the knowledge about the structure of the message header, retrieving par- ticular kinds of technical information and classifying messages according to it, for ex- ample the method based on SMTP path analysis [60]. Other methods use human lan- guage technologies to analyze the message content, for example, the approach based on smooth n-gram language modeling [65]. However, there is a large group of learning- based filters that observe a message just as a set of tokens. The most popular method in this group is Naive Bayes [86]. Much work on spam e-mail filtering has been done using techniques such as deci- sion trees, Naive Bayesian classifiers, neural networks, etc. To address the problem of growing volumes of unsolicited e-mail, many different methods for e-mail filtering are being deployed in many commercial products. We have experimentally constructed a framework for efficient e-mail filtering using ontologies. Ontologies allow for machine- understandable semantics of data, so they can be used in any system [93], [100], [102]. It is important to share the information with each other for more effective spam filtering. Thus, it is necessary to build ontologies and a framework for efficient e-mail filtering. Using ontologies that are specially designed to filter spam, most of unsolicited bulk e-mail can be filtered out on the system. This paper proposes an efficient spam e-mail filtering method using ontologies. As a part of spam filtering, we need to do some re- search about how to decide gray e-mail as spam or legitimate. It is very important to 76 improve the performance of spam filtering. In our initial experiments, we used Waikato Environment for Knowledge Analysis (Weka) explorer, and Jena to make ontologies based on a sample data set. 1. Motivation and Spam E-mail Trends 2. Filtering System for Gray E-mail 3. Conclusion 6.2 Motivation and Spam E-mail Trends Gray e-mail is a message that could reasonably be considered either legitimate or spam. However, we are getting many e-mails that cannot be decided clearly every day, so han- dling of gray e-mail is very important issue in spam filtering system. Our approach is to resolve the problem by considering both advantages of global filter and personalized filter. Also, to cope with the potential new spam e-mail, we introduce the recent spam e-mail trends here. 6.2.1 Decision on Gray E-mail Gray e-mail is a message that could reasonably be considered either legitimate or spam. For example, unsolicited commercial e-mail, or newsletters that do not respect unsub- scribe requests, could sometimes be useful. Message users prefer e-mail for personal 77 communications or business transactions. Unsolicited e-mail like messages advertising illegal products, or phishing message, is spam. E-mail that we cannot agree on unani- mously is gray e-mail. Gray e-mail can be considered as good e-mail for some people, or as bad e-mail for some other people. Hence, a personalized filter is required to han- dle the gray e-mail decisions by considering different user preferences or providing learning filters combining different training/testing policies. For example, if a customer buys one pair of speakers in the last month, then adver- tising e-mail about amplifiers, home theater receivers, or speaker cables, can be good e-mail. However, if after buying the speakers, they receive advertising e-mail about speakers of another brand name, then the advertising e-mail is spam to this user. The gray mail problem can be treated as a special kind of label noise. Instead of accidentally flipping the label from spam to good or vice versa by mistake, different users may simply have different e-mail contexts and preferences. Another reason is that individual users change their own preferences over time. For example, it is common for a user who tires of a particular newsletter to begin reporting it as spam rather than unsubscribing [87], [98]. Some companies also do not respect unsubscribe requests and continue sending mail that some users then consider spam. In all cases the effect is the same. Senders send mail that is not clearly spam or good and spam filters are faced with the challenge of determining which subset of this mail should be delivered. There are two major 78 problems in global anti-spam systems because of the presence of gray e-mail. First, when personalization or user preference and context are ignored, because gray e-mail is not clearly good or spam by definition, it makes accurate evaluation of a filter perfor- mance a challenge. Thus it is important that we are able to detect this mail and handle it appropriately in the context of anti-spam systems In the research, a user profile ontology is created for each user or class of users to handle gray e-mail. Figure 1 shows the user profile ontology creation procedure. A structured taxonomy is created to serve as a global ontology filter, and user profile on- tologies are created based on users’ preferences and contexts. A user profile ontology creates a blacklist of contacts and topic words. If a user wants to block some contact persons, they add their e-mail addresses to the blacklist. Then e-mail from those ad- dresses will be classified as spam by the filter. Also, if a user wants to repel a certain topic, then they can add the terms related to the topic. In this case, added terms are included into a feature set to classify the data set, so although a term is added to the topic list, not all e-mail including that term is classified as spam. However, in case of ”do not receive” blacklist, e-mail with addresses in that list must be classified as spam. 6.2.2 Spam E-mail Trends New spam techniques appear again and again. We will look at several new spam tech- niques that have appeared since 2007. 79 Figure 6.1: Personalized ontology 6.2.2.1 PDF Spam A new technique of attaching popular PDF (Portable Document Format) files is de- signed to bypass many traditional spam filters. This new spam technique using the PDF format prevents the adoption of blocking policies, such as those often enacted against .exe files. The PDF spam has a randomized content, similar to the image-based stock spam, and randomly altered to fetter OCR (Optical Character Recognition) tech- nology. Sometimes, the PDF spam is combined with a Malware URL. An example of PDF spam is shown on Figure 6.2. 80 Figure 6.2: PDF Spam with Malware 6.2.2.2 Malware Malware is software designed to infiltrate or damage a computer system without the owner’s informed consent. It is a portmanteau of the words ”malicious” and ”software”. Generally, Malware means a variety of forms of hostile, intrusive, or annoying software or program code. Sometimes, the PDF spam is combined with a Malware URL. By clicking the Malware URL, an unauthorized program can be downloaded and installed automatically on user’s machine. 81 Figure 6.3: Zombie IPs A Zombie (Zombie Computer) is a computer attached to the Internet that has been compromised by a hacker, a computer virus, or a trojan horse. Generally, a compro- mised computer is only one of many in a ”botnet”, and will be used to perform mali- cious tasks of one sort of another under remote direction. Most of zombie computers don’t know that their machine is being used in this way. Botnet is a jargon term for a collection of software robots, or bots, which run autonomously and automatically. They run on groups of ”zombie” computers controlled remotely. This can refer to the network of computers using distributed computing software [25]. Figure 6.3 shows a global distribution of Zombie IPs in 2007. 82 Figure 6.4: Website Redirection using Image Spam 6.2.2.3 Blended Spam with Malware Website Links Zombies link text of a URL that hyperlinks to a website containing malicious software or commercial product instead of sending messages with the usual virus attachments. Malware distributor needs to hack into the legitimate site’s web server to place the mal- ware page there. Generally, e-mail filter often will assume that the URL within the site is also legitimate if the e-mail filter identifies the site as legitimate. Sometimes, spam- mers still use image spam to redirect to enhancement website. If user clicks the image, then it will redirect to the commercial website. An example of website redirection using image spam is shown on Figure 6.4. 83 Figure 6.5: Address Validation Spam 6.2.2.4 Address Validation Address validation spam usually appears as harmless nonsense or an empty e-mail mes- sage sent from an unfamiliar address. The message in address validation spam looks like an innocent mistake, and may not even be considered as spam by users because there is no message or link inside the mail. Actually, these messages are botnet owner’s trial to test which e-mail addresses on their distribution lists are legitimate and which are not in use. If the message is bounced back, then the e-mail addresses are considered invalid and removed from the 84 Figure 6.6: Topics of Spam E-mail distribution lists. Example of website redirection using image spam is shown on Figure 6.5. 6.2.2.5 Discussion Topics of spam of 2007 are shown on Figure 6.6. As we explained here, spamming tech- niques are also evolving to evade existing spam filtering techniques. New spamming technique appears continuously and traditional spamming technique is also prevailing. Topics of spam e-mail are also changing continuously. Spam filters are destined to modify and evolve to face various spamming techniques [25]. 85 6.3 Filtering System for Gray E-mail We have built the global spam e-mail filtering system in the previous paper [102]. Based on the previous spam filtering system, we created a personalized ontology filter. Specif- ically, the system works as follows: 1. A training data set is selected; this is a collection of text-oriented e-mail data. 2. Features from the data set are selected using the both tfidf and user profile ontol- ogy. 3. A Weka input file is created based on the selected features and the data set. (Weka is a toolkit of machine learning algorithms written in Java for data mining tasks.) [95]. 4. Through Weka, classification results are generated. 5. The classified results are converted to an RDF file. 6. The converted RDF file is fed into Jena, which is a Java framework for building Semantic Web applications. It provides a programmatic environment for RDF, RDFS, OWL, and SPARQL, and includes a rule-based inference engine [50]. 7. Using Jena, ontologies are created, and we can give a query to Jena. Jena will give an output for the query using ontologies created in Jena. 86 Figure 6.7: Global Spam Filter Through these procedures, the personalized ontology filter is created. Details of the personalized ontology filter are shown in Figure 6.8. In contrast to other approaches, ontologies were used in our approach. In addition, the C4.5 algorithm was used to classify the training data set [81]. The ontologies created by the implementation are modular, so those could be used in another system. In our previous classification ex- periments, the C4.5 showed better results than Naive Bayesian, Neural Network, or Support Vector Machine (SVM) classifiers [100]. Figure 6.7 shows the architecture of the global spam e-mail system provided in the previous paper [102]. The training data set is the set of e-mail which gives us a classification result. The test data is actually the e-mail we will run through our system 87 Figure 6.8: Personalized Spam Filter which we test to see if it is classified correctly as spam or not. This will be an ongoing test process and since the test data is not finite because of the learning procedure, the test data will sometimes merge with the training data. The training data set was used as input to the C4.5 classification. To do that, the training data set should be modified to a compatible input format. The proposed spam filtering system gives us the classification result using the C4.5 classifier. To query the test e-mail in Jena, an ontology is created based on the classification result. To create the ontology, an ontology language was required. RDF was used to create an ontology. The classification result of the RDF format was input to Jena, and input RDF was deployed through Jena; finally, an ontology was created. An ontology 88 generated in the form of an RDF data model is the base on which the incoming mail is checked for its legitimacy. Depending upon the assertions that we can conclude from the outputs of Jena, the e-mail can be defined as either spam or legitimate. The e-mail is actually the e-mail in the format that Jena will take in (i.e. in a CSV format) and will run through the ontology that will result in spam or legitimate. Both spam filtering systems (global spam filter and personalized ontology spam filter) periodically update the data set with the e-mails classified as spam when user spam report is requested. Then, a modified training data set is input to Weka to get a new classification result. Based on the classification result, we can get a new ontology, which can be used as a second spam filter (that is, a user profile ontology). Through this procedure, the number of ontologies will be increased. Finally, these spam filter- ing ontologies will be customized for each user. User profile ontology filter would be different from the other depending on each user’s background, preference, hobby, etc. That means one e-mail might be spam for person A, but not for person B. User profile ontologies evolve dynamically. The personalized ontology spam filter system provides an evolving spam filter based on users’ preferences, so users can get a better spam filtering result. The input to the system is mainly the training data set and then the test e-mail. The test e-mail is the first set of e-mail that the system will classify and learn and after a certain time, the system will take a variety of e-mail as input to be filtered as a spam or legitimate. The 89 classification results through Weka need to be converted to an ontology. The classifi- cation result which we obtained through the C4.5 decision tree was mapped into the RDF format. This was given as an input to Jena which then mapped the ontology for us. This ontology enabled us to decide the way different headers and the data inside the e-mail are linked based upon the word frequencies of each word or character in the data set. The mapping also enabled us to obtain assertions about the legitimacy and non-legitimacy of the e-mail. The next part was using this ontology to decide whether a new e-mail is a spam or legitimate. This required querying of the obtained ontology which was again done through Jena. The output obtained after querying was the deci- sion whether the new e-mail is a spam or legitimate. In summary, test e-mail is checked whether it is spam or legitimate based on global ontology created with training data set. In the personalized ontology spam filter, most of the procedure of spam filtering are the same as the global ontology spam filter. Additionally, it uses a user profile ontology created with user’s spam report. With the help of adaptive user profile ontology, total spam filtering rate (the correct classification percentage) will be increased. The primary way in which a user can provide the necessary feedback to the system would be through a GUI or a command line input with a simple ’yes’ or ’no’. This would all be a part of a full-fledged working system as opposed to our prototype, which is a basic research, experimental system. 90 6.3.1 Personalized Ontology Filter Implementation Personalized ontology filter is created as shown on Algorithm 3. Using a user profile ontology, the context and preferences of a specific user would be adapted in the feature selection procedure. A user profile ontology includes a list of people to block their e-mail and a list of words to block the e-mails related with some topic that is disliked by user. These blacklists and words will be combined with the words that were selected from the tfidf as shown in the algorithm. 6.3.2 Experimental Results In our experiment, traditional C4.5 decision tree filter (We call it as a global ontology spam filter here) with the features selected by tfidf compared with the personalized ontology spam filter created using 200, 400, and 600 user data set. 2108 test data set (1008 spam and 1100 legitimate e-mail) was used. All the experimental results are summarized in Figure 6.10. Figure 6.9 shows an ROC curve of experimental results of the personalized ontology spam filter with 200, 400, and 600 user data set respectively. Receiver Operating Characteristic (ROC), or simply ROC curve, is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold is varied. As we expected, a personal- ized filter created with a user profile ontology learns with the increase of the training data set. A personalized filter that was learned with 600 training data set shows better 91 Algorithm 3 Personalized ontology filter pseudo code 1: // Initialize variables 2: set training datasetd tod 1 ,......,d n 3: set test datasett tot 1 ,......,t p 4: set normalized valuesv tov 1 ,......,v m 5: 6: Feature (f:f 1 ,......,f k )Ãtfidf(d); 7: Feature (f:f k+1 ,......,f m )ÃUserProfileOntology(u); 8: BlockedContactList (b:b 1 ,......,f r )ÃUserProfileOntology(u); 9: 10: foreach(f:f 1 ,......,f m )f 11: foreach(d:d 1 ,......,d n )f 12: (n:n 1 ,......,n m )ÃNormalize(f;d); 13: g 14: g 15: foreach(n:n 1 ,......,n m )f 16: resultÃC4:5(n;d); 17: g 18: 19: Ontology ( )ÃJena(RdfConversion(result)); 20: 21: foreach(t:t 1 ,......,t p )f 22: if(Ontology (t i == 1) then 23: decision = SPAM; 24: else then 25: decision = LEGITIMATE; 26: g 92 performance. By using a personalized filter, we can improve the performance of the spam filtering as you can see in Figure 6.9 and 6.10. We measured precision, recall, and correct classification rate through the experiment. ROC curve in Figure 6.9 shown that the filter is learning with more training data set, so experimental result with 600 training data set is better than those with 200 or 400 training data set. We compared the personalized ontology spam filter with traditional C4.5 decision tree filter (global ontology spam filter). As shown on Figure 6.10, the personalized ontology spam filter shows better experimental results than the global ontology spam filter (on spam recall, spam precision, legitimate recall, legitimate precision, and correct classification rate). From this result, by adding gray e-mail decision mechanism, we increased the spam fil- tering performance. When we increase the user data set for the personalized ontology spam filter, there was no more big improvement at some point. Also, we can see that the system evolves with the personalized ontology filter created with each user training data set. We will use the followings for the Figure 5 and 6. ² POSF 200 - Personalized Ontology Spam Filter with 200 user training data set ² POSF 400 - Personalized Ontology Spam Filter with 400 user training data set ² POSF 600 - Personalized Ontology Spam Filter with 600 user training data set ² Global - C4.5 decision tree filter 93 Figure 6.9: Experimental Results (ROC) Figure 6.10: Experimental Results 94 Precision and Recall were used as the metrics for evaluating the performance of each e-mail classification approach. Recall= N ii N ;Precision= N ii N i N =#OfTotalInterestingE¡mails N i =#OfE¡mailsClassifiedAsInteresting N ii =#OfInterestingE¡mailsClassifiedAsInteresting 6.4 Conclusion In this chapter, we suggested the method to detect spam e-mail from gray e-mail using personalized spam ontology filters. Global classifiers learned for a large population of users can leverage the data provided by each individual user across thousands of users. Proponents of a personalized classifier insist that statistical text learning is effective because it can consider the unique aspects of each individual’s e-mail. There is a trade- off between globally- and personally- trained anti-spam classifiers. It is believed that 95 globally-trained filters outperform personally-trained filters for both small and large collections of users under a real environment. However, globally-trained filters some- times ignore personal data. Globally-trained filters cannot retain personal preferences and contexts as to whether a feature should be treated as an indicator of legitimate e- mail or spam. Hence, we used a personalized filter to make a decision based on personal preferences and context. Gray e-mail can be considered as good e-mail for some people or as bad e-mail for some other people. Hence, a personalized filter is required to handle the gray e- mail decisions by considering different user preferences or providing learning filters combining different training/testing policies. In the experiment, the personalized ontology spam filter could improve the spam filtering performance by including gray e-mail detection mechanism. As we explained, new spamming techniques appear continuously and traditional spamming techniques are also prevailing. Spamming techniques are advancing yet further, hence the spam fil- tering techniques also must catch up with the new spamming techniques. Even though some spam e-mails are classified as legitimate e-mails, we don’t have to avoid the situa- tion that valuable legitimate e-mail is classified as spam e-mail by considering personal data. In the current system, tfidf as a feature selection algorithm and C4.5 decision tree as a classifier, were implemented. Various feature selection and classification algorithms 96 should be compared to find the best system environment in the future. Also, we need to find ways how to make a user profile ontology more conveniently. In the future, we will experiment with the combination of the general corpus data set and our data set for generality. 97 Chapter 7 Contribution and Conclusion 7.1 Anticipated Contribution In this paper, two levels of ontology spam filters were implemented: the global ontology filter and the user customized ontology filter. The global ontology filter, created using tfidf, The C4.5 decision tree (J48) algorithm, etc., showed about 91% of spam filter- ing rate, which is comparable with other authors’ similar works. Additionally, a user- customized ontology filter was utilized as a second level filter. The user-customized ontology filter was created based on the specific user’s background as well as the filter- ing mechanism used in the global ontology filter creation. The main contributions of the proposed system are to introduce an ontology-based multi-level filtering technique that uses both a global ontology and an individual filter for each user to increase spam filtering accuracy, and to create an adaptive and learning 98 spam filter in the form of ontology, which is user-customized, scalable, and modular- ized, so that it can be embedded to many other systems for better performance. Also, the proposed system has image spam handling capability using OCR into the text-based anti-spam filtering system. By handling of text embedded image e-mail, the proposed system can be used partially for both text e-mail and image e-mail. The experiment was somewhat restricted, but it demonstrates the potential capability of the proposed system under restricted environment. The proposed system also considers gray e-mail. Gray e-mail can be considered as good e-mail for some people or as bad e-mail for some other people. Hence, a per- sonalized filter is required to handle the gray e-mail decisions by considering different user preferences or providing learning filters combining different training/testing poli- cies. In the experiment, the personalized ontology spam filter could improve the spam filtering performance by including gray e-mail detection mechanism. Through a set of experiments, it was proven that the better spam filtering rate can be achieved using the user-customized ontology filter, which is adaptive and scalable. The same idea was adopted for the text-oriented e-mail data sets, but it can be also used for other classification or clustering jobs. 99 7.2 Conclusion The important objective of the thesis is to use an ontology to help classifying e-mails and it was successfully implemented. Learning motivation was that this approach has been taken and opens up a whole new aspect of e-mail classification on the semantic web. Also, this approach fits into any system because they are so generic in nature. This idea will have great advantage on systems ahead. The classification accuracy can be increased initially by pruning the tree and using better classification algorithms, more number and better classifiers or feature elements, etc. These are issues more in the machine learning and artificial intelligence domain which are not primary concerns but helped in better classification after all. The work is still a research model and the accuracy can be improved later. Moreover, ontologies play a key role here as after that e-mail gets classified through the ontology we created, and more work can be done in the area of creating intelligent ontologies and ontologies that can be used in certain areas of decision making, etc. The ontologies were created in Jena and this is just one aspect of ontology creation. There are other various and maybe better techniques that would have created ontologies without Jena or in some format that is more flexible and open to intelligence. This paper, as mentioned earlier, is more research-oriented and involved testing particular interfacing and checking for feasibility of classification of e-mail through ontologies. The challenge we faced was mainly to make decision tree classification outputs to RDF and gave it to Jena, i.e. 100 interfacing two independent systems and creating a prototype that actually uses this information that flows from one system to another to get certain desired input. In our case, it was classification of e-mail. The only aspect of this work that is evolutionary and can be worked upon in the future is the fact that the e-mail we use is in a particular Comma Separated Values (CSV) format. This is a requirement for Jena. Use of ontologies to help e-mail classification is the important objective of the paper and those ontologies were successfully implemented. Learning motivation was that this approach has been taken and opens up a whole new aspect of e-mail classification on the semantic web. Also, this approach fits into any system because the ontologies that were implemented in the paper are generic in nature. The technique introduced here will have a great advantage for systems ahead. As mentioned above, the classification accuracy can be increased initially by pruning the tree and using better classification algorithms, more number and better classifiers or feature elements, etc. These are bigger issues in the machine learning and artificial intelligence domain which are not primary concerns but helped in better classification after all. The paper, as mentioned earlier, is more research-oriented and involved testing par- ticular interfacing and checking for feasibility of classification of e-mail through on- tologies. The challenge we faced was mainly to make decision tree classification out- puts to RDF and gave it to Jena, i.e. interfacing two independent systems and creating 101 a prototype that actually uses this information that flows from one system to another to get certain desired input. In our case, it was the classification of e-mail. In this research, two levels of ontology spam filters were implemented: a first level global ontology filter and a second level user-customized ontology filter. The use of the global ontology filter showed about 91% of spam filtered, which is comparable with other methods. The user-customized ontology filter was created based on the specific user’s background as well as the filtering mechanism used in the global ontology filter creation. We added image spam handling capability using OCR into the text-based anti-spam filtering system. By handling of text embedded image e-mail, the proposed system can be used partially for both text e-mail and image e-mail. The experiment was somewhat restricted, but it demonstrates the potential capability of the proposed system under restricted environment. However, to cope with the image e-mail thoroughly, we need to adopt advanced image processing techniques. Then, we can face image obscuring techniques like wave, animate, deform, and rotate. A personalized filter is required to handle the gray e-mail decisions by consider- ing different user preferences or providing learning filters combining different train- ing/testing policies. In the experiment, the personalized ontology spam filter could improve the spam filtering performance by including gray e-mail detection mechanism. 102 With the default settings in Weka, all experiments were performed. Extensive exper- iments with different settings are applicable in Weka. Moreover, different algorithms which are not included in Weka can be tested. Also, experiments with various feature selection techniques should be compared. We implemented the dynamic ontologies as spam filter based on classification result. Then, this ontology is evolved and customized based on user’s report when a user requests spam report. By creating a spam filter in the form of ontology, a filter will be user-customized, scalable, and modularized, so it can be embedded to many other systems for better performance. We need to more fo- cus on misclassified legitimate e-mail than misclassified spam e-mail, so we don’t have to lose legitimate e-mails even though we get some spam e-mails in mail box. Also, Over-fitting can happen because the filter is learning; hence we need to consider this problem. Finally, many current filters are dependent on different data set, hence we need to develop more general filter, which can shows a good performance on the most of data set. 103 References [1] Adibi, J., and Shen, W. Data Mining Techniques and Applications in Medicine Tutorial, 1999. [2] Agichtein, E., and Luis Gravano, L. Querying Text Databases for Efficient In- formation Extraction. proceedings of the 19th IEEE International Conference on Data Engineering (ICDE03), Bangalore, India, 2003. [3] Albrecht, K., Burri, N., and Wattenhofer, R. Spamato - An Extendable Spam Filter System. Proceedings of 4th Conference on Email and Anti-Spam (CEAS05), Mountain View, CA 2005 [4] Allemang, D., Polikoff, I., and Hodgson, R. Enterprise Architecture Reference Modeling in OWL/RDF. Proceedings of 4th International Semantic Web Confer- ence (ISWC05), Galway, Ireland, 844-857, 2005. [5] Aery, M., and Chakravarthy, S. eMailSift: Email Classification Based on Struc- ture and Content. Proceedings of The 5th IEEE International Conference on Data Mining (ICDM05), Clearwater Beach, FL, 18-25. 2005 [6] Androutsopoulos, I.,Paliouras, G.,Karkaletsis, V .,Sakkis, G.,Spyropoulos, C., and Stamatopoulos, P. Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach. The Computing Research Repository (CoRR), cs.CL/0009009, 2000. [7] Ankolekar, A., Seo, Y ., and Sycara, K. Investigating semantic knowledge for text learning. Workshop on Semantic Web of 26th Annual International ACM SIGIR Conference , Toronto, Canada, 2003. [8] Aradhye, H., Myers, G., and Herson, J. Image analysis for efficient categorization of image-based spam e-mail. Int. Conf. Document Analysis and Recognition , 914-918, 2005. [9] The Asprise OCR. http://asprise.com/product/ocr/selector.php 104 [10] Beitzel, S., Jensen, E., and Grossman, D. Retrieving OCR Text: A Survey of Cur- rent Approaches. http://ir.iit.edu/publications/downloads/SIGIR-OCR-2002.pdf [11] Berland, M, and Charniak, E. Finding Parts in Very Large Corpora. Proceedings of the 37 th Annual Meeting of the Association for Computational Linguistics (ACL99), College Park, MD, 1999. [12] Biggio, B., Fumera, G., Pillai, I., and Roli, F. Image Spam Filtering Using Visual Information. Proceedings of ICIAP, 105-110, 2007. [13] Bishr, Y ., Pundt, H., and Ruther, C. Design of a Semantic Mapper Based on a Case Study from Transportation. Proceedings of 2nd International Conference of Interoperating Geographic Information Systems(INTEROP99), Zurich, Switzer- land, 203-215, 1999. [14] Boykin, Oscar, and Roychowdhury, Vwani. Leveraging Social Networks to Fight Spam. IEEE Computer 38(4), 61-68, 2005. [15] The Brightmail AntiSpam by Symantec. http://www.symantec.com/business/products. [16] Burns, E. The deadly duo: Spam and viruses. http://www.clickz.com/stats/sectors/e-mail/print.php/3614491, Jun. 2006. [17] Byun, B., Lee, C., Webb, S., and Pu, C. A Discriminative Classifier Learning Approach to Image Modeling and Spam Image Identification. Proceedings of 4th Conference on E-mail and Anti-Spam (CEAS07), 2007. [18] The CAPTCHA project. http://www.captcha.net, 2000. [19] Caraballo, S. Automatic construction of a hypernym-labeled noun hierarchy from text. Proceedings of the Association for Computational Linguistics (ACL99), Col- lege Park, MD, 20-26, 1999. [20] Cederberg, S., and Widdows, D. Using LSA and Noun Coordination Information to Improve the Precision and Recall of Automatic Hyponymy Extraction. Pro- ceedings of Conference on Computational Natural Language Learning (CoNLL- 2003), Edmunton, Canada, 118, 2003. [21] Chhabra, S., Yerazunis, W., and Siefkes, C. Spam Filtering using a Markov Ran- dom Field Model with Variable Weighting Schemas. Proceedings of 4th IEEE International Conference on Data Mining (ICDM04), Brighton, UK, 347-350, 2004 105 [22] Choi, K.,Lee, C.,and Rhee, P. Document Ontology Based personalized Filtering System. Proceedings of the ACM Multimedia, Los Angeles, CA, 2000. [23] Ciaramita, M. and Johnson, M. Supersense Tagging of Unknown Nouns in Word- Net. Proceedings of Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP 2003), Sapporo, Japan, 2003. [24] Cohen, W. Learning rules that classify e-mail. Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Palo Alto, CA, 1996. [25] Commtouch. http:// www.commtouch.com. [26] Cui, B.,Mondal, A.,Shen, J.,Cong, G., and Tan, K. On Effective E-mail Classifi- cation via Neural Networks. Proceedings of the 16th International Conference on Database and Expert Systems Applications (DEXA05), Copenhagen, Denmark, 85-94, 2005. [27] Crawford, E.,Koprinska, I., and Patrick, J. Phrases and Feature Selection in E- Mail Classification. Proceedings of the 9th Australasian Document Computing Symposium (ADCS04), Melbourne, Australia, 59-62, 2004. [28] Diao, Y ., Lu, H., and Wu, D. A comparative study of classification based personal e-mail filtering. Proceedings of the 4th Pacific-Asia Conference of Knowledge Discovery and Data Mining (PAKDD00), Kyoto, Japan, 2000. [29] Dredze, M., Gevaryahu, R., and Elias-Bachrach, A. Learning Fast Classifiers for Image Spam. Proceedings of 4th Conference on E-mail and Anti-Spam (CEAS07), 2007. [30] Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D., and Yates, A. Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Proceedings of International Conference on Artificial Intelligence (ICAI05), Las Vegas, NV , 2005. [31] Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D., and Yates, A. Methods for Domain-Indepedent Information Extrac- tion from the Web: An Experimental Comparison. Proceddings of 9th National Conference on on Artificial Intelligence (AAAI04), San Jose, CA, 2004. [32] Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soder- land, S., Weld, D., and Yates, A. Web-scale Information Extraction in KnowItAll. Proceddings of 13th International World Wide Web Conference (WWW04), New York, NY , 2004. 106 [33] Faure, D., and Nedellec, C. A corpus-based conceptual clustering method for verb frames and ontology acquisition. Proceedings of LREC Workshop on adapting lexical and corpus resources to sublanguages and applications, Granada, Spain, 1998. [34] Fawcett, T. in vivo spam filtering: A challenge problem for data mining. Proceed- ings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Explorations. vol.5 no.2 (KDD03), Washington, DC, 2003. [35] Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. Advances in Knowledge Discovery and Data Mining. M.I.T. Press, 1996. [36] Ferris Research. Spam Control: Problems & Opportunities, 2003. [37] Fetterly, D., Manasse, M., and Najork, M. Spam, damn spam, and statistics. In Proceedings of the Seventh International Workshop on the Web and Databases (WebDB), Paris, France, 2004. [38] Fleischman, M., Hovy, E., and and Echihabi, A. Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked. Proceedings of the 41 st Meeting of the Association for Computational Linguistics (ACL03). Sapporo, Japan, 2003 [39] Fumera, G., Pillai, I., and Roli, F. Spam Filtering Based On The Analysis Of Text Information Embedded Into Images. Journal of Machine Learning Research 6, 2699-2720, 2006. [40] Gee, K. Using latent semantic indexing to filter spam. Proceedings of the 18th ACM Symposium on Applied Computing, Data Mining Track (SAC03), Mel- bourne, FL, 2003. [41] Girju, R., Badulescu, A., and Moldovan, D. Learning Semantic Constraints for the Automatic Discovery of Part-Whole Relations. Proceedings of Human Lan- guage Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Canada, 2003. [42] Graham, P. A plan for spam. http://paulgraham.com/spam.html, 2002. [43] Gruber, T. What is an Ontology? http://www-ksl.stanford.edu/kst/what-is-an- ontology.html. [44] Gunther, O. Environment information systems. Springer, 1998 107 [45] Hearst, M. What is Text Mining? http://www.sims.berkeley.edu/ hearst/textmining.html, 2003. [46] Hearst, M. Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the 14th International Conference on Computational Linguistics (COLING92), Nantes, France. 1992. [47] Henzinger, M., Motwani, R., and Silverstein, C. Challenges in web search en- gines. Proceedings of the 25th Annual International ACM SIGIR Conference SI- GIR Forum, Tampere, Finland, 36(2), 2002. [48] Hotho, A.,Staab, S., and Stumme, G. Ontologies Improve Text Document Clus- tering. Proceedings of 3rd IEEE International Conference on Data Mining (ICDM03), Melbourne, FL, 541-544, 2003. [49] International Data Group. Worldwide email usage 2002 - 2006: Know what’s coming your way, 2002. [50] An Introduction to RDF and the Jena RDF API. http://jena.sourceforge.net/tutorial/RDF API/index.html. [51] The Jena. http://jena.sourceforge.net/ [52] The JOCR. http://jocr.sourceforge.net/links.html [53] Kietz, J., Maedche, A., and V olz, R. A method for semi-automatic ontology ac- quisition from a corporate intranet. Workshop on Ontologies and Text of 12th In- ternational Conference on Knowledge Engineering and Knowledge Management (EKAW00) Workshop on Ontologies and Text, Juan-les-Pins, France, 2000. [54] Kiritchenko, S., and Matwin, S. Email classification with co-training. Proceed- ings of workshop of the Center for Advanced Studies on Collaborative Research (CASCON01), Ontario, Canada, 2001. [55] Kiritchenko, S., Matwin, S., and Abu-Hakima, S. Email Classification with Tem- poral Features. Proceedings of the International Intelligent Information Systems (IIS04), Zakopane, Poland, 523-533, 2004. [56] Kong, J., Boykin, O., Rezaei, Sarshar, N., and Roychowdhury, V . Scalable and Reliable Collaborative Spam Filters: Harnessing the Global Social Email Net- works. CEAS, 2005. [57] Kong, J., Rezaei, A., Sarshar, N., Roychowdhury, V ., and Boykin, O. Collab- orative Spam Filtering Using E-Mail Networks. IEEE Computer 39(8), 67-73, 2006. 108 [58] Lam, H., and Yeung, D. A Learning Approach to Spam Detection based on Social Networks. Proceedings of 4th Conference on E-mail and Anti-Spam (CEAS07), 2007. [59] Li, L., and Tan, C. Improving OCR Text Categorization Accuracy with Elec- tronic Abstracts. Proceedings of 2nd International Conference on Document Im- age Analysis for Libraries (DIAL06), Lyon, France, 82-87, 2006. [60] Liu, R. Dynamic Category Profiling for Text Filtering and Classification. Pro- ceedings of The 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD06), Singapore, 255-264, 2006. [61] Machine Learning Lab. in Information and Computer Science, University of Cal- ifornia at Irvine. http://www.ics.uci.edu/ mlearn/MLSummary.html [62] Maedche, A., and Staab, S. Ontology learning for the semantic web. IEEE Intel- ligent Systems 16(2): 72-79, 2001. [63] Mann, G. Fine-Grained Proper Noun Ontologies for Question Answering Pro- ceedings of SemaNet: Building and Using Semantic Networks, Taipei, Taiwan, 2002. [64] Mavroeidis, D., Tsatsaronis, G., Vazirgiannis, M., Theobald, M., and Weikum, G. Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classifi- cation. Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD05), Porto, Portugal, 181-192, 2005. [65] Medlock, B. An adaptive approach to spam filtering on a new corpus. Proceed- ings of the 3rd Conference on E-mail and Anti-Spam (CEAS06), 2006. [66] Meyer, T., and Whateley, B. SpamBayes: Effective open-source, Bayesian based, email classification system. Proceedings of the 1st Conference of Email and Anti- Spam (CEAS04), Mountain View, CA, 2004. [67] Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. Introduction toWordNet: An on-line lexical database. Journal of Lexicography, 3(4):235-244, 1990. [68] Moldovan, D., and Girju, R. Knowledge Discovery from Text. Tutorial of the 41 st Meeting of the Association for Computational Linguistics (ACL03). Sapporo, Japan, 2003 109 [69] Monostori, L., Vancza, J., and Ali, M. Ontology integration tasks in Business- to-Business E-Commerce. Proceedings of the 14th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Sys- tems (IEA/AIE01), Budapest, Hungary, 2003. [70] Navigli, R., Velardi, P., and Gangemi, A. Ontology learnig and its application to automated terminology translation. IEEE Intelligent Systems 18(1): 22-31, 2003. [71] Osterwalder, A., Parent, C., and Pigneur, Y . Setting up an Ontology of Business Models. Proceedings of 16th International Conference on Advanced Information Systems Engineering (CAiSE03) Workshops (3), Riga, Latvia, 319-324, 2004. [72] Pantel, P., and Lin, D. Discovering Word Senses from Text. Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD02), Edmonton, Canada, 613-619, 2002. [73] Pantel, P., and Ravichandran, D. Automatically Labelling Semantic Classes. Pro- ceedings of Human Language Technology conference North American chapter of the Association for Computational Linguistics annual meeting (NAACL/HLT04), Boston, MA, 2004. [74] Pasca, M. Finding Instance Names and Alternate Glosses on theWeb: Word- Net Reloaded. Proceedings of 6th International Conference of Computational Linguistics and Intelligent Text Processing (CICLing05), 280-292, Mexico City, Mexico, 2005. [75] Pasca, M. Acquisition of categorized named entities for web search. Proceed- ings of the 13th ACM Conference on Information and Knowledge Management (CIKM04), Washington, D.C, 2004. [76] Pathak, A., Roy, S., and Hu, Y . A Case for a Spam-Aware Mail Server Architec- ture. Proceedings of 4th Conference on E-mail and Anti-Spam (CEAS07), 2007 [77] Perkins, A. The classification of search engine spam. http://www.ebrandmanagement.com/whitepapers/spam-classification/. [78] Polikoff, I. Ontology tool support. In TopQuadrant Technology Briefing [79] Pu, C., Webb, S., Kolesnikov, O., Wenke, L., and Lipton, R. Towards the integra- tion of diverse spam filtering techniques. Proceedings of the IEEE International Conference on Granular Computing (GrC06), Atlanta, GA, 17-20, 2006 110 [80] Pundt, H., and Bishr, Y . Domain ontologies for data sharing: An example from environmental monitoring using field GIS. Computer & Geosciences, 28, 98-102, 1999 [81] Quinlan, J. Bagging, Boosting, and C4.5. Proceedings of AAAI/IAAI, Vol. 1, 725-730, 1996. [82] Raghavan, P. The Changing Face of Web Search. Proceedings of 10th Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD06), Singa- pore, 2006. [83] The RDF. http://www.w3.org/RDF/. [84] Robert von Behren, J., Czerwinski, S., Joseph, A., Brewer, E., and Kubiatowicz, J. NinjaMail: The Design of a High-Performance Clustered, Distributed E-Mail System. Proceedings of Workshop of the 29th International Conference on Paral- lel Processing (ICPP00), Toronto, Canada, 2000. [85] Sahami, M., Mittal, V ., Baluja, S., and Rowley, H. The happy searcher: Chal- lenges in web information retrieval. In Trends in Artificial Intelligence, 8th Pa- cific Rim International Conference on Articicial Intelligence (PRICAI), GuiLin, China, 2004. [86] Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. A Bayesian Approach to Filtering Junk E-Mail. Proceedings of the AAAI Workshop on Learning for Text Categorization, Madison, WI, 1998. [87] Segal, R. Combining global and personal anti-spam. Proceedings of 4th Confer- ence on E-mail and Anti-Spam (CEAS07), 2007 [88] Shankar, S., and Karypis, G. Weight adjustment schemes for a centroid based classifier. Computer Science Technical Report TR00-035, 2000. [89] The SimpleOCR. http://www.simpleocr.com/. [90] Snow, R., Jurafsky, D., and Ng, A. Learning syntactic patterns for automatic hypernym discovery. Proceedings of Neural Information Processing, Vancouver, Canada, 2004. [91] The SpamAssassin project. http://spamassassin.apache.org/. [92] Spam Filter Review, http://spam-filter-review.toptenreviews.com, 2007. 111 [93] Taghva, K., Borsack, J., Coombs, J., Condit, A., Lumos, S., and Nartker, T. Ontology-based Classification of Email. Proceedings of the International Sym- posium on Information Technology (ITCC03), Las Vegas, NV , 194-198, 2003. [94] Waterson, A., and Preece, A. Verifying ontological commitment in knowledge- based systems. Knowledge-Based Systems. 12(1-2): 45-54, 1999. [95] Weka: the Waikato Environment for Knowledge Analysis. http://www.cs.waikato.ac.nz/ ml/publications/1995/Garner95-WEKA.pdf. [96] Yang, J., Chalasani, V ., and Park, S. Intelligent Email Categorization Based on Textual Information and Metadata. IEICE TRANS. INF . & SYST., VOL. E82, NO.1 JANUARY 1999 [97] Yang Y ., and Pedersen, J. A Comparative Study on Feature Selection in Text Cat- egorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML97), Nashville, TN, 412-420, 1997. [98] Yih, W., McCann, R., and Kolcz, A. Improving Spam Filtering by Detecting Gray Mail. Proceedings of the 4th Conference on E-mail and Anti-Spam (CEAS07), 2007. [99] Youn, S., and McLeod, D. Ontology Development Tools for Ontology-Based Knowledge Management. Encyclopedia of E-Commerce, E-Government and Mo- bile Commerce, Idea Group Inc., 2006. [100] Youn, S., and McLeod, D. A Comparative Study for Email Classification. Pro- ceedings of International Joint Conferences on Computer, Information, System Sciences, and Engineering (CISSE06), Bridgeport, CT, December, 2006 [101] Youn, S., and McLeod, D. Efficient Spam Email Filtering using an Adaptive On- tology. Proceedings of 4th International Conference on Information Technology: New Generations (ITNG07), Las Vegas, NV , April, 2007. [102] Youn, S. and McLeod, D. Spam E-mail Classification using an Adaptive Ontol- ogy Journal of Software (JSW) 2, 3, 43-55, 2007. [103] Zhang, L., Zhu, J., and Yao, T. An evaluation of statistical spam filtering tech- niques. ACM Transactions on Asian Language Information Processing, 3, 4, 243- 269, 2004. 112
Asset Metadata
Creator
Youn, Seongwook (author)
Core Title
Spam e-mail filtering via global and user-level dynamic ontologies
Contributor
Electronically uploaded by the author
(provenance)
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/21/2009
Defense Date
12/03/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
e-mail,OAI-PMH Harvest,ontology,spam filtering,text classification
Language
English
Advisor
McLeod, Dennis (
committee chair
), Horowitz, Ellis (
committee member
), Nakano, Aiichiro (
committee member
), Neumann, Ulrich (
committee member
), Pryor, Lawrence (
committee member
)
Creator Email
fortisisimo@gmail.com,syoun@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2754
Unique identifier
UC1167200
Identifier
etd-Youn-3379 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-278798 (legacy record id),usctheses-m2754 (legacy record id)
Legacy Identifier
etd-Youn-3379.pdf
Dmrecord
278798
Document Type
Dissertation
Rights
Youn, Seongwook
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
E-mail is clearly a very important communication method between people on the Internet. However, the constant increase of e-mail misuse/abuse has resulted in a huge volume of spam e-mail over recent years. As spammers always try to find a way to evade existing filters, new filters need to be developed to catch spam. In my research to date, e-mail data was classified using four different classifiers
Tags
e-mail
ontology
spam filtering
text classification
Linked assets
University of Southern California Dissertations and Theses