Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Context-based information and trust analysis
(USC Thesis Other)
Context-based information and trust analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CONTEXT-BASED INFORMATION AND TRUST ANALYSIS by Vesile Evrim A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2009 Copyright 2009 Vesile Evrim ii Acknowledgements This thesis would not have been possible without the immense support that I received during the course of my PhD. First of all I would like to thank to my parents Hicran and Erol Evrim and my brothers Duygu and Utku Evrim for their unconditional love and support during my education. I must express my deep gratitude to my supervisor Dennis McLeod for his encouragement and support during this research. I would also like to thank to my committee members Barry Boehm and Larry Pryor whose comments and advice helped to improve the quality of this thesis. I also would like to acknowledge that I have been fortunate to have friends who helped keep me going through the peaks and the valleys that inevitably occur during any Ph.D. iii Table of Contents Acknowledgements ii List of Tables vi List of Figures vii Abstract viii Chapter 1: Introduction 1 1.1 Introduction 1 1.2 Proposed Solution 2 1.3 Thesis Outline 3 Chapter 2: Information Retrieval 5 2.1 Introduction to Information Retrieval 5 2.1.1 Indexing 6 2.1.2 Information Retrieval Methods 8 2.1.3 Relevance Feedback 13 2.1.4 Query Expansion 14 2.1.5 Evaluation 16 2.1.6 Text Retrieval Conference 21 2.2 IR on the Web 22 2.2.1 Web Search Tasks 23 2.2.2. Web IR Approaches 25 2.2.2.1 Link Analysis 25 2.2.2.2 Textual Representation 27 2.2.3 Enhancing IR with Context 27 Chapter 3: Ontologies 30 3.1 What is Ontology? 30 3.1.1 Types of Ontologies 31 3.2 Building Ontologies 33 3.2.1 Ontology Languages 33 3.2.1.1 The OWL Ontology Language 34 3.2.2 Ontology Development Tools 35 3.2.2.1 OntoEdit 35 3.2.2.2 Protégé 36 iv 3.2.2.3 Hazo 36 3.3 WordNet Ontology 37 3.3.1 Structural Aspects of WordNet 37 3.3.2 Semantic Relatedness in WordNet 38 3.3.3 WordNet Usage 40 3.4 Ontologies and Information Retrieval 40 Chapter 4: Trust 43 4.1 What is Trust? 43 4.2 Trust in Web 44 4.3 Trust and Information Retrieval 46 4.3.1 Document Credibility Measures 48 Chapter 5: Context-Based Information and Trust Analysis 51 5.1 Introduction 51 5.2 System Architecture 53 5.3 Query Expansion by using WordNet 55 5.3.1 Phrasal Map 56 5.3.2 Selecting Sense from Synsnet 59 5.4 Customizing User’s Information Request 60 5.4.1 Interest Ontology 61 5.4.2 Domain Ontologies 62 5.4.2.1 Health Domain Ontology 62 5.4.2.2 Terrorism Domain Ontology 66 5.5 Mapping form WordNet to Domain Ontology 70 5.5.1 Syntactic and Semantic Map from WordNet 71 5.6 Ontology Categories and Context 77 5.7 Document Set 78 5.7.1 Advantage/Disadvantage of using Web Documents 79 5.8 Ontology to Document Map 81 5.8.1 Capturing Concepts in the Document 81 5.9 Finding Content Relevant Documents 82 5.9.1 Proximity of Mapped Terms 83 5.9.1.1 Span-based Proximity Measure 84 5.9.1.2 Pair-wise Proximity Measure 85 5.9.2 Content Relevance Value of the Document 87 5.9.3 Content extraction from HTML Documents 89 5.10 Trust Relevancy of Documents 89 5.11 Document Relevance 96 v Chapter 6: Experimental Results 97 6.1 Experiments by Using CONITA 97 6.2 Setup of the Experiment 98 6.3 Metadata Evaluation of Surveys 101 6.3.1 Survey Statistics in Health Domain 102 6.3.2 Survey Statistics in Terrorism Domain 104 6.3.3 Summary of the Survey Statistics 106 6.4 Evaluations of Queries 107 6.5 Summary of Evaluation Results 115 Chapter 7: Conclusion and Future Work 117 7.1 Conclusion and Contributions 117 7.2 Future Work 120 References 122 vi List of Tables 1 Noun relations in WordNet 38 2 Algorithm for mapping query to WordNet 58 3 Algorithm for calculating content relevancy 82 4 Health domain queries and IR systems they are tested 103 5 User and query statistics for Health domain surveys 104 6 Terrorism domain queries and IR systems they are tested 105 7 User and query statistics for Terrorism domain surveys 105 8 Ranking Performance of CONITA in Health domain 109 9 Success of CONITA in Health domain is compared to 110 (Google/Yahoo) by using NDCG evaluation measure 10 Ranking Performance of CONITA in Terrorism domain 112 11 Success of CONITA in Terrorism domain is compared to 113 (Google/Yahoo) by using NDCG evaluation measure 12 Success of CONITA is compared to Google and Yahoo on the 114 same data set shared by all. Success of CONITA is compared to Yahoo and Google through all the surveys vii List of Figures 1 Information Retrieval System (Baeza-Yates) 5 2 Ontology kinds, according to their level of dependence on a 31 Particular task or point of view [Gruber, 1997] 3 Context-Based Information Trust Analysis (CONITA) Framework 55 4 a) Customized NCI ontology, b) Integrated Interest and NCI 65 Ontologies 5 a) Terrorism ontology, b)Integrated Interest and Terrorism 69 ontologies 6 a)Two sibling qmt subclasses, b)qmt subclass is ancestor 72 of the other, c) qmt term and qmt subclass 7 a)Direct associative relation, b)Associative relation to a parent 75 c)Associative relation from a parent, d)Associative relation between the parents. 8 Trust Relevancy (Credibility) of the documents based on the first, 94 Second and third hand trust information viii Abstract Finding the relevant set of information that satisfies an information request of a Web user in the availability of today’s vast amount of digital data is becoming a challenging problem. Currently available Information Retrieval (IR) Systems are designed to return long lists of results only a few of which are relevant for a specific user. This thesis presents an IR method called CONITA that investigates the context information of the user and user’s information request to provide relevant results for the given domain users. In this thesis relevance is measured by the semantics and credibility of the information provided in the documents. The information extracted from lexical and domain ontologies are integrated by the user’s interest information to expand the terms entered in the request. The obtained set of terms is categorized by a novel approach and the relations between the categories are obtained from the ontologies. This categorization is used to improve the quality of the document selection by going beyond checking the availability of the words in the document but by analyzing the semantic composition of the mapped terms. We also introduced a novel approach to measure the credibility of the Web documents retrieved for the domain users. The proposed approach combines user’s feedback with provenance, reputation and recommendation information of the Web sources to calculate the credibility score of the documents. 1 Chapter 1 Introduction 1.1 Introduction As the amount of available digital information grows, the ability to search for information becomes increasingly critical. Document retrieval on the World Wide Web (Web), with over 29 billion pages [WebPages, 2009] is a challenging task for more than 1.5 billion users [InternetStat, 2009]. The information providers on the Web follow few formal protocols, often remain anonymous and publish in a wide variety of formats. There is no central registry or repository of the Web’s contents and the documents often misrepresent their content as some Web authors seek to unbalance ranking algorithms in their favor for personal gain [Henzinger et al., 2002]. In order to retrieve the information, Web users use search engines. Currently the most popular search engines used by the Web users to retrieve the information are Google and Yahoo [InternetStat, 2009]. A widely accepted de-facto standard for web search is a simple query form that accepts keywords and returns document locators (URLs). In general, keyword-based query articulation is difficult. Therefore typical 2 queries are short, comprising an average of two to three terms [Spink et al, 2001; Gulla et al, 2002] per query. Given the user query, the key goal of an Information Retrieval (IR) system is to retrieve information which might be useful or relevant to the user. However, inferring user preferences from a few keywords is a difficult task. In fact, most state of the art retrieval models ignore this problem altogether, and simply treat queries and documents as a bag of words. For example, the user who is searching for a word “java” might be looking for information about Java Island, java coffee bean or java programming language but the search engine returns the same set of documents regardless of the interest of the information requester. The quality of the returned documents becomes even more important when the user is a domain expert and the information is going to be used for critical decisions. For example, the experts of domains such as Terrorism and Health are often interested in finding the most up-to-date information to collect the new data from Web. For these users finding the most relevant set of documents to satisfy their request is crucial. Therefore there is a need of an IR system to analyze the user and the context of user’s request to return the best relevant set of documents. 1.2 Proposed Solution Given a domain user with an information need, the objective of this thesis is to analyze the domain and the information requests of the user to retrieve the most relevant set of documents to satisfy a user’s request. The relevancy of the documents in this thesis is measured in two fold: Content relevance and Credibility. 3 In order to satisfy the user’s information need, the semantics behind the user’s request need to be observed. We used lexical and domain ontologies with user’s interest to extract the semantic information of the query terms entered by the user. However, semantically expanding the query terms by itself does not help to overcome all the problems of search engines. Currently available search engines do not consider the contextual proximity of the words while processing the documents. It is common for Google/Yahoo users to receive high ranked documents that contain query words in multiple unrelated contexts. Therefore, once the semantic information about the user and the request are collected, we combined the semantically expanded words into categories to keep the contextual structure of the extracted information in mapping process. This thesis also introduces a trust evaluation method to measure the credibility of documents. Although trust is a subjective term, we have used Stanford et al, (2002) experiment results to reduce the dimensions of the subjectivity. The proposed approach combines user feedback with provenance, reputation and recommendation of the sources in the Web environment. 1.3 Thesis Outline This thesis is organized into seven chapters followed by the references: • Chapter 2 introduces the concepts from IR that this thesis relies on. In particular, concepts from classical IR such as indexing, and retrieval are introduced, and approaches for relevance feedback and query evaluation are explained. We describe how IR systems are evaluated, before moving on to describe how the advent of the Web has brought new concepts, and problems. 4 • Chapter 3 introduces the concepts relating to Ontologies that is used in this thesis. In this section, types of ontologies are introduced and followed by the languages, and the development tools used in building ontologies. The structure of lexical ontology WordNet is explained and followed by the research on ontologies and information retrieval. • Chapter 4 defines the concept of trust that is used to measure the credibility of the documents in this thesis. The role of trust in the web and information retrieval is introduced and followed by the document credibility measures. • Chapter 5 presents the research methodology, Context-Based Information and Trust Analysis (CONITA) framework, and the novel techniques proposed in this research. The expansion of the user’s query by using WordNet, health and terrorism domain ontologies and the mapping techniques between these ontologies are explained. The document set used as a source is discussed and the mapping between the expanded query terms and the document set are explained by the introduction of the proximity algorithm used in this research. Finally, content and trust relevance of the document and how they are combined in document relevance are explained in detail. • Chapter 6 presents the analysis and comparison of the experimental results. In this chapter the setup of the surveys provided for the user evaluation are explained. CONITA’s success is evaluated based on the survey results in both health and terrorism domains and compared to Google and Yahoo search engines. • Chapter 7 forms the conclusion and the contributions of this research followed by the future work directions. 5 Chapter 2 Information Retrieval 2.1 Introduction to Information Retrieval Information Retrieval (IR) deals with the structure, analysis, organization, storage, searching, and retrieval of information [Baeza-Yates & Ribeiro-Neto, 1999]. The representation and organization of the information items returned by the IR system should provide the user an easy access to the information in which he is interested. A typical Information Retrieval system takes the set of documents, a user information request (query) and retrieves a set of ranked items (e.g., documents) that satisfy the user’s information need [Figure 1] Figure 1: Information Retrieval System (Baeza-Yates) 6 Through the decades, much of experimentation has been done to investigate effective means of ranking and matching items with queries. In late 1950’s the strategy of using the terms in the document collection to create the index was explored [Luhn, 1957]. This method based on the user’s guess on one word out of many used in the document and often retrieved too many documents that were irrelevant [Witten, Moffat, & Bell, 1994]. The main challenges of IR research lie in effectively assessing the information need and the document contents as well as the relationship between them. Thus the suitable representation of information need, and the document content must be created such that the collection can be effectively searched. The advent of the Web turns the focus on IR even more than before. Web users must search for information among billions of pages (Apert & Hajaj, 2008). As the number of current users of the Web exceeds a billion [InternetStat, 2009], each with their own information need, the necessity for an effective means of ranking documents becomes more urgent. The user’s satisfaction depends on the relevancy of the returned items to his request and the time he needs to find the relevant items. Thus, the effective means of matching items with queries can be evaluated by measuring the success of ranking the relevant documents higher than the non-relevant ones. 2.1.1 Indexing The issue of representation is a crucial aspect of IR in that it affects not only the correct assessments of information need and document contents but also the effective matching of the two, all of which are essential for successful IR [Yang, 2002].The process of 7 effectively determining which documents from a corpus match to a given query is called indexing .In IR the representation of the documents based on the index is called an inverted index [Van Rijsbergen, 1979]. An index to a document acts as a tag by means of which the information content of the document in question may be identified. The index may be a single term or a set of terms which together tag or identify the content of each document. The terms that constitute the allowable vocabulary for indexing documents bridges the gap between the information in documents and the information requirements. As the selection of the proper terms representing the document becomes crucial, studies focused on the term occurrence characteristics (i.e. position, frequency). For example, a term can occur in the text once, can be a stem or might appear in different parts of the text (title, footnote, abstract). The decision of what constitutes a term greatly influences the retrieval outcome. The simplest way to construct a term index is to identify all the unique terms in the document set. However, that does not provide much information about which terms can better represent the document. Luhn (1958) excludes the high frequency words (e.g., the, a, has…) called “stopwords” from the collection of words. The elimination of stopwords reduces the size of the resultant index structures. Frequently, the words a user provided in the query do not exactly match with the words in the documents since only a morphological (e.g. analysis, analyzed, analyzing) variants of these words presents in the relevant documents. In order to combine the morphological variations of the words under a single term, most of the IR systems adapt a strategy called stemming to conflate word variations before creating a term index. The 8 first stemming algorithm [Lovins, 1968] had an influence on later work especially Porter’s stemming algorithm (Porter, 1980) which become one of the best known and commonly used stemmer. In order to combine morphological terms together, stemmers take the root form of the words and combine them in the same index. The remaining set of terms is called a “bag-of-words”, as a term can occur more than once in a given document. 2.1.2 Information Retrieval Models Once the documents matching to the query are identified by the indexing, there is a need to rank the documents in response to the user query. A successful IR system should be able to rank the documents in the collection in the decreasing order of relevance. The determination of how the “relevant” documents are identified, as well as how terms are weighted depends largely on the underlying assumption of the IR models. Various IR models exist for ranking documents with respect to a query and the way they generate weightings. Several classical models exist, such as Boolean, Vector Space, Probabilistic models. Boolean model is often referred to as the "Exact Match" model and Vector Space, Probabilistic models as the "Best Match" models [Belkin, 1992]. One of the first methods implemented in IR systems is Boolean the model [Cooper, 1988]. In this model query terms expressed with Boolean operators (e.g., AND, OR) and compared with the inverted index of terms that represent the documents. In short time the ability of combining abstract concepts and the fast response time made Boolean IR model popular among the scientist. However, using logic to query does not get the popularity among public. In addition, rigid structure of Boolean system returned either 9 too many or too few documents based on the presence or the absence of a single term [Bookstein, 1985; Cooper, 1988] In the late 1960s Salton approached IR with a different search strategy called the Vector Space model [Salton, 1968]. In this model, both queries and documents are represented as vectors in an n-dimensional space where each dimension corresponds to an entry in the inverted index. In contrast to the Boolean model where the retrieval outcome is based on binary judgment of the match between the query and inverted index of an document, the Vector Space model ranks the documents based on their similarity to the query by considering the angle between the document vector and the query vector as a measure of similarity. As in the case of term weighting, various models have been proposed to calculate the query, term similarity [McGill & Huitgeldt, 1979] in which have shown a significant effect on the performance of IR [Harman, 1986]. One of the most widely used similarity measure is the cosine similarity, which is the inner product between vector elements normalized for vector lengths. The Vector Space model follows the Best-Match methodology in which it does not require all query terms to exist in a documents, and instead able to rank documents according to a user’s query. Although that is an improvement over the Exact-Match Boolean model, there is no definitive solution as to how the similarity should be calculated or the term weights should be assigned. The orthogonality of the vector space implies the lack of relations between the terms, which is not necessarily correct since terms can be dependent via synonymy and polysemy. 10 The Probabilistic model (PR) is similar to Vector Space model in its representation of documents and queries as a vector. However, the Probabilistic model retrieves documents based on their probability of relevance to the queries instead of their similarity to the queries [Maron & Kuhns, 1960].The PR model calculates the term weights, which define the probability of relevance of documents, based on the data about the distribution of query terms in documents that have been assessed for relevance and rank the documents in the order of decreasing probability of relevance of a user’s information request. Each IR model can generate different weighting models for the documents. Given the premise that the more frequently occurring term in the document, the more important the term within the document, almost all weighting models take term frequency (tf), into consideration as a basic feature for the document ranking. The TF-IDF is the most well-known weighting model among the systems using Best-Match strategy [Salton, 1971]. The score of each document in TF-IDF is calculated by the following formula: t Q t N N tf Q d Score ∑ ∈ = 2 log . ) , ( (1) In this formula tf is the term frequency of term t of query Q in document d. The t N N 2 log part of the formula represents the Inverse Document Frequency (IDF), in which N represents the number of documents in the collection and t N is the number of documents, term t appears. 11 Essentially, TF-IDF works by determining the relative frequency of word in a specific document compared to the inverse proportion of that word over the entire document corpus. Intuitively, this calculation determines how relevant a given word is in a particular document. Words that are common in a single or a small group of documents tend to have higher TF-IDF numbers than common words such as articles and prepositions. The tf component of formula (1) is mostly replaced by normalized term frequency (tfn). This is based on the assumption that the same term usually used many times in long documents and long documents usually have large size of vocabulary [Singhal, 1996]. Therefore, state-of-the-art weighting models integrate the normalization component to the TF-IDF model to overcome the length bias problem. Roberson (1977) assumed that the probability of relevance of a document to a query is independent of other documents and posed the probability ranking principle (PRP). By application of Bayes theorem, and the assumption that the occurrences of terms within a document are independent, it is possible to derive a term weightings model similar to formula (1). (There are Probabilistic models derived form PRP but they are beyond the scope of this thesis.) The Probabilistic model suffers from the same limitation as the Vector Space model, the assumption that the terms in the document are independent than each other. This assumption is introduced for the sake of the computation simplicity however does not reflect the reality. Efforts to accommodate relationships among the terms in a PR model have yet to yield significant improvement in the retrieval performance [Van Rijsberger, 1977; Salton, Buckley &Yu, 1983]. 12 In early 90’s, a group of scientists suggested performing query retrieval using a popular matrix algorithm called Latent Semantic Indexing (LSI) [Berry, Dumais & Brien, 1994]. LSI uses a Statistical method called Singular Value Decomposition (SVD) to identify the patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. In doing so it considers pages that have many words in common to be close in meaning (semantically close) and pages with few words in common to be semantically distant. The LSI algorithm doesn’t understand anything about what the words mean but the queries and document that have undergone LSI return results that are conceptually similar in meaning to the search criteria even though the results don’t share a specific words or words with the search criteria. By accounting the term dependencies, LSI overcomes the two severe constraints ; polysemy, synonymy of term matching methods that are often the cause of mismatches in the vocabulary used by the authors of documents and the users of the information retrieval systems [Furnas, 1987]. LSI is also used to perform automated document categorization. Document categorization is the assignment of documents to one or more predefined categories based on their similarity to the conceptual content of the categories [Dumais, 1998].Similarly, clustering is a way to group documents based on their conceptual similarity to each other without using example documents, and can be achieved by using LSI. Clustering is especially useful when dealing with an unknown collection of unstructured text that mostly constructs the corpus of the document collection in IR. 13 2.1.3 Relevance Feedback Relevance feedback (RF) is not really a retrieval method but rather a method for enhancing the query process [Rocchio, 1966; Salton, 1971]. The process starts with presenting the user a set of ranked outcomes, which the user is asked to evaluate for the relevance. Once the user identifies which documents are relevant to his information need, the query is reformulated and submitted for another iteration of retrieval outcome and relevance judgment. There are three forms of relevance feedback: • Explicit Relevance Feedback: This type uses the set of documents selected by the user to identify the important terms that are attached to the relevant documents and similarly the negative evidence behind the irrelevant documents in the reformulation of the new query. Based on the feedback, a new query is either expanded and/or the weight of the terms in the query is readjusted. [Baeza-Yates & Ribeiro-Neto, 1999]. • Implicit Relevance Feedback: In this form, the user does not explicitly judge the relevance of the documents but the duration spent viewing a document. Page browsing or scrolling actions are used for the reformulation of the query [Kelly & Teevan, 2003]. • Blind Feedback: Blind feedback is also called “pseudo” relevance feedback. In this form, relevance feedback is obtained assuming that the top k ranked documents are relevant and performance is increased by learning from these top k documents [Kwok, 1984; Robertson, 1990; Xu & Croft, 2000]. The Blind Feedback has an advantage compared to explicit and implicit feedback by 14 automating the manual part of relevance feedback and not requiring any human interaction. Two common IR models used with RF method are the Vector Space Model and the Probabilistic model. In the VS model, the static index term weights are used to reformulate the query but in the PR model, dynamic Term Relevance (TR) weights are used. TR weights are defined by the probability that a given term will appear in the relevant document compared to the one that will appear in a non-relevant document [Robertson & Sparck Jones, 1976]. 2.1.4 Query Expansion Under the “bag-of-words” model, if a relevant document does not contain the terms that are in the query, then that document will not be retrieved. The aim of query expansion is to reduce this query/document mismatch by expanding the query using words or phrases with a similar meaning or some other statistical relation to the set of relevant documents. This can be done with user interaction or automatically without user interaction. • Global Expansion Method: Global extension techniques use the whole document set when calculating the set of expansion term and does not require any interaction with the user for the expansion. Thesauri are used in a global extension techniques, which have frequently been incorporated in information retrieval systems as a device for the recognition of synonymous expressions and linguistic entities that are semantically similar but superficially distinct. Many thesauri have been constructed manually [Voorhees, 1994; Smeaton & Berrut, 1996], and can only succeed if a domain-specific thesauri is used which 15 corresponds closely to a domain-specific document collection [Fox, 1980]. A typical entry in a manually built thesaurus contains the desired word and sets of related words grouped into different senses. However selecting an appropriate sense of the word is not easy work for machines [word-sense disambiguation]. A collection-dependent thesaurus is one that is automatically built using the term frequencies found with a document collection [Crouch, 1990; Chen & Schatz 1995]. Since the thesaurus is document collection dependent, any word relationships found will be based on documents that can be retrieved. The thesaurus can be constructed by the probabilistic methods to be used in the query expansion. Probabilistic query expansion usually based on calculating co-occurrences of terms in documents, and selecting terms that are most related to ones entered in the query. Some of the well known analysis techniques are Latent Semantic Indexing and Term Clustering. In these methods expansion terms are selected from clusters which contain query terms or from similarity matrix of terms that are most related to the query terms in that matrix. • Local Expansion Method: Local techniques extract their statistics from the top n documents returned by an initial query or by the documents selected by the users (i.e., uses only a subset of the document set to calculate the set of expansion terms). They might use some corpus-wide statistics such as the inverse document frequency but the calculations need to be fast since they are required to provide the new results to the user immediately. This category of query expansion is based on relevance feedback, as explained in the previous section. 16 Both probabilistic and local query expansion have the advantage of expanding the query based on all the words in the query. This contrast with a thesaurus-based approach, where individual words and phrases in the query are expanded and word ambiguity is a problem. Global analysis is inherently more expensive than local analysis since it is resource intensive. On the other hand, probabilistic expansion feeds a thesaurus construction that can be used for browsing without searching. In general expansion techniques use a co-occurrence relationship with original query terms and the appended terms are not scenario specific. The uncertainty of the query context, results by the addition of not only relevant but also the irrelevant terms to the queries, while increasing recall decreases the precision of the results. In order to solve this problem, Liu (2005) used a scenario-based query expansion in the medical domain with the help of a UMLS Metathesaurus which includes semantic types that define relationships between concepts. Although, Liu’s scenario-based query expansion only considers the parent, children and siblings of the concepts, by using a metatheasarus the much broader scope than previous synonym or parent/children concepts expansion are captured. 2.1.5 Evaluation Experimentation in IR is concerned with user satisfaction. The goal of the IR system is to maximize the effectiveness, such that the maximum number of relevant documents it retrieves, while minimizing the number of irrelevant documents retrieved. However the process of IR involves numerous factors that can influence its outcome, therefore making it difficult to determine the precise contributions of components. 17 The IR experiments are based on the document collection, a set of queries, and associated relevance judgments. Most of the experiments are conducted in labs settings that do not reflect the realities of actual IR environments. Regardless, the key element in the IR evaluation is the relevance – identification of documents in the collection that are “relevant” to a given set of queries. Furthermore, relevance is a complex notion that can be defined in many ways [Saracevic, 1975] and may not necessarily be the best measure of whether the information need has been satisfied [Cooper, 1973]. Evaluation in IT can be categorized in many different ways. It can be about effectiveness or the efficiency of the retrieval, it might be about the presentation of the information in the documents, the trust degree to the provided information or the source of the information. Briefly, it can be quantitative or qualitative. Rocchio (1971) described the notion of an optimal query formulation, where all relevant documents are ranked ahead of the irrelevant ones. However, he recognized that there is no way to formulate such a query. Instead, IR focuses on the formulation of weighting models and other techniques as described in the previous sections, which maximize the user satisfaction for a given query [Belew, 2000; van Rijsbergen, 1979]. The evaluation in IR has evolved from the Cranfield evaluation paradigm [Cleverdon, 1991]. The goal of evaluation is the assessments of comparative performances of different indexing languages and search methods. This evaluation process involves the use of a corpus of documents and a set of test topics/queries. For each query, a set of relevant documents in the collection is identified by having assessors read the documents and ascertain their relevance to each query. The list of relevant documents for each test query is called the relevance assessments. The evaluated IR 18 systems then create indices for the test collection, and return a set of documents for each test query. The IR system can then be evaluated by examining whether the returned documents are relevant to the query or not, and whether all relevant documents are retrieved. When the relevance assessments are available, one or several evaluation measure(s) are used for the evaluation of the IR systems. The most important evaluation measures to a user population namely, recall, precision, mean average precision, f- measure, fall-out [Cleverdon, 1970; Salton, 1983]. Among them, recall and precision have received the most attention in the literature as defined below: Ability to search to find all of the relevant items in the corpus: Documents levant of Number Total trieved Documents levant of Number call Re Re Re Re = (2) Ability to retrieve top-ranked items that are mostly relevant: trieved Documents of Number Total trieved Documents levant of Number ecision Re Re Re Pr = (3) For a given query, precision of the returned results at position n (P@n), can be calculated by formula (3). For example, if the top five documents returned by the query are (relevant, irrelevant, irrelevant, relevant, relevant) than precision at position 1 to position 5 have the values {1, ½, 1/3, 2/4. 3/5}. For a set of queries, the average of the precision of all the values of all the queries are calculated, that is called Mean Average Precision (MAP). For a single query, average precision is defined as the average of P@n values for all relevant documents. 19 query this for documents relevant total n rel n P AP N n # )) ( * @ ( 1 ∑ = = (4) Where N is the number of retrieved documents, and rel (n) is a binary function on the relevance (1-relevant; 0-Irrelevant) of the n’th documents. The insufficient treatment of the ranking-effects of a retrieval system in the standard measures of precision and recall has been questioned by many authors [Cooper, 1968; Raghavan, Bollmann, & Jung, 1989; Robertson, 1969]. Consequently, some research has been done on the evaluation of retrieval performance, using a multi-valued relevance scale [Bollmann et al., 1986; Frei & Schiuble, 199 1; Keen, 197 1; Rocchio, 197 l]. One difficulty in using a multi-valued relevance scale is that there is no clear guidance on how to design such a scale. Keen (1971) used a four-valued scale called point grade which assigns four to the more relevant documents, three to the next relevant and so on. Saracevic (1987) used three-valued scale consisting of “relevant, partially relevant, and non-relevant”. Since a universal interpretation for this multi-valued relevance scale does not appear to exist, if the number or the verbal descriptions are not fully understood, a multi-valued relevance scale may be easily misused in relevance assessment and system evaluation [French, 1986; King, 1968]. As an alternative methodology, ordinal scale measuring relevance of the document is proposed (Wong, Yao, & Bollmann, 1988). Within this framework, instead of stating whether a document is relevant or not, a user specifies whether a document is more or less relevant than another documents. Although the degree of relevance may be 20 scattered, the relative position of documents as to their relevance may be expected to be remarkably consistent. The success of this method has downside of being more time consuming and burdensome to do the evaluation for the user. The evaluation measures P@n and MAP can only handle cases with binary judgment; with the introduction of multi-valued judgment new evaluation metrics are needed. The new evaluation measure Normalized Discount Cumulative Gain (NDCG) has been proposed, which can handle multiple levels of relevance judgment [Jarvelin & Kekalainen, 2000; 2002]. While it does the ranking, NDCG takes into consideration two rules: • Highly relevant documents are more valuable than marginally relevant documents • The lower the ranking position of a document, the less valuable it is for the user, because it is less likely to be examined by the user. According to the above rules, the NDCG value of a ranking list at position n is calculated as follows: ∑ = + − ≡ n j b j r n j Z n N 1 ) ( ) 1 ( log 1 2 ) ( (5) Where r(j) is the rating of the j’th document in the ranking list, Zn is the normalization constant so that the perfect list gets a NDCG score of 1, ) 1 ( log j b + is a discounting function that reduces a document’s gain-value as its rank increases. The base of the logarithm, b, controls the amount of the reduction. As the value of b decreases, the sharpness of discount increases (In this thesis we used b=10 in our evaluations). 21 In order to calculate NDCG score, the rating of each documents needs to be defined. For example {0, 1, 2} can be defined for “not relevant”, partially relevant”, “definitely relevant” (Liu & Xu, 2007). Once the arithmetic value is assigned to evaluations, they can then be used in formula (5) to calculate the NDCG score. 2.1.6 Text Retrieval Conference The Text Retrieval Conference (TREC) is an on-going series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology [Voorhees 2007]. Each year at TREC, various IR research groups participate in tracks. While each group aims to be measured the best at retrieving over a common set of queries and documents, the primary aim of TREC is to provide re-usable test collection for IR experimentation. Since its inception in 1992, TREC has been applying a pooling technique [Sparck-Jonenns & van Rijsbergen, 1975]. For each test query, the top k (e.g., 100) returned documents of the participating systems are merged into a single pool. The relevance assessments are done only for the polled documents, instead of all the documents in the test collection. Therefore many relevant documents that are not in the pool are missed by that strategy. The evaluation measures in TREC are task oriented. Mean Average Precision is commonly used in TREC for all the test queries, to evaluate the overall retrieval performance of an IR system [Voorhees, 2008]. 22 A common feature of measure such as MAP is that they are primarily focused on measuring retrieval performance over the entire set of retrieved documents for each query, up to a predetermined maximum 1000. However many users will not read 1000 retrieved documents, thus other measures that are more linked to the user satisfaction such as P@n can be used for that purpose. 2.2 IR on the Web The advent of World Wide Web, from 1990 onwards, has made IR systems popular among the public. Today a billion of people use the Internet [InternetStat, 2009] and a vast majority (91%) makes use of a search engine (Madden et al., 2008). The Web can be considered as a large-scale document collection that has a contextual difference from traditional IR research environments, which have been based on a relatively small, static, and homogeneous text corpus. Thus analyzing the Web structure and adapting traditional IR to the current Web become a wide research topic. The Web uses a hypertext document model that is remotely accessible over the Internet. Documents can contain links to other related pages that the author found of interest. The large-scale document collection structure of the web make is necessary to look for IR techniques that will accommodate the user’s information and navigation needs. IR systems that search the Web are also known as Web search engines. The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria. How a search engine decides which pages are the best matches and what order the results should be shown in varies widely from one engine to another. 23 The first search engines for the Web (e.g. Lycos) appeared around 1992-1993, with full- text indexing. Soon after, many other search engines arrived, including AltaVista, Excite, and Inktomi. These often competed directly with directory-based services, like Yahoo, and added search engine facilities later. Starting from 2001, Google rose with a different methodology of link analysis technique called Page Rank [Page et al, 1998]. Google and Yahoo have become the two most competitive and most popular search engines among Web users. 2.2.1 Web Search Tasks The Web is a dramatically different form of corpus from those that classical IR systems have previously been applied to. The different purposes and nature of various Web sites suggest that users searching the Web will have different tasks and information needs. Analysis on the query logs submitted to the Web show that, Web-based user queries are shorter and are used for various tasks [Wolfram & Saracevic, 2002; Ozmutlu & Jansen, 2002]. Broder (2002) categorized the user needs behind the web into three categories: “Navigational”; Intended to find a specific website that user has in mind, “Informational”; Intended to find the information about a topic, “Transactional”; Intended to perform a Web Mediated activity. Rose and Levinson (2004) later refined Broder’s model by further categorizing informational and transactional needs of the user. They subcategorized informational queries as “directed, “undirected”, “advice”, “locate”, “list” and Transactional queries as “download”, “entertainment”, “interact”, “obtain”. 24 With the introduction of different user search behaviors, classical ad-hoc evaluation measures (e.g., MAP) moved towards evaluation measures that emphasize the accuracy of the top ranking of search engines. The concentration of the evaluation measures on the very top of the document ranking is motivated by the fact that in a Web search, users rarely view the second page of results [Spink et al.,2001], and they often click on the top few retrieved documents [Jansen & Spink, 2003; Joachims & Radlinski, 2007]. Hence, a search engine query that does not return the relevant results in the first 5 or 10 ranks is likely, in the perception of the user, to have failed. A further source of evaluation is available to large search engines with many users. For popular queries, the search engine can be evaluated by examining the click of the user to the ranked documents. For instance, if users never click on the top ranked results, it is likely that the top-ranked result is not relevant to the query. However that is not an accurate measure all the time since user preference can be affected by the summarizations methods [Amini, 2005]. Based on the summary presented by the search engine, a user might click to see whether a document is relevant and that does not necessarily mean that the document is relevant for the user’s information need. The time spent on the document is taken into consideration in some cases but again that does not determine the exact measure since the time might be affected by network delays. Using non-binary relevance assessments, a suitable evaluation measure would quantify the extent to which the IR system would rank higher quality relevant documents ahead of lower quality relevant documents. NDCG [Jarvelin & Kekalainen, 2000; 2002] has recently been gaining popularity and is well suited for use when document relevance had been judged using more than two levels. 25 2.2.2 Web IR Approaches Web Information Retrieval models integrate various source of evidence about the documents such as the links, the structure of the document, the actual content of the document, the quality of the document that can be achieved by the search engines. Web search engines have to deal with subversive techniques to return the documents with high ratings [Gyongyi &Garcia-Molina, 2005]. Therefore additional evidence is often derived from sources with the content of a page. These various sources of evidence are often categorized as query-independent and query-dependent sources of evidence which are described in the following sections. 2.2.2.1 Link Analysis One of the defining features of Web is the hyperlink structure. Each document in the Web has a uniquely identified Uniform Resource Locator (URL) and can connect to other sources through these URL’s. When a hyperlink model is formalized, documents can be seen as a node of a graph where the links between the documents can be represented by edges. One of the components of query-independent document quality can be considered by the number of incoming links each document has [Pitkow, 1997]. Although this shows the credibility given to the documents by the authors of the other document, it can be spammed very easily and is not very useful in ranking of the documents. The PageRank algorithm [Page et al, 1998] is based on the number of ingoing and outgoing hyperlinks. PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. In this algorithm the documents that are linked to by many other pages ranked higher. 26 PageRank was reported to be a fundamental component of the early versions of the Google search engine (Brin & Page, 1998). Kleinberg (1997) takes the link analysis one step further and introduces the HITS (Hyperlink Induced Text Search). HITS differ from other link-based approaches in several regards. Instead of simply counting the number of links, HITS calculates the value of pages based on the aggregate values of pages that point to or are pointed by the page p. HITS, however, differs from PageRank in three major points. First, it uses inlink and outlink measures to compute two separate values, authority and hub scores, instead of the single measure of importance like PageRank. Second, HITS measures a document’s values dynamically for each query, rather than assigning their global scores once regardless of the query. Third, HITS scores are computed from a relatively small subset of the Web instead of the totality of the Web. HITS starts with a root set of text- based search engine results in response to a query about some topic, and expands the root set to a base set with the inlink and outlink of the root set. HITS is considered as query-dependent in a sense that it begins with a root set of documents returned by a search engine in response to a query; the textual contents of pages are only considered in the initial step of obtaining that root set. Once HITS locates a topic neighborhood, it is guided by the link structure alone. The main disadvantage of the link-based evaluation is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing site. Also a document with good content that is not connected to other sites will have a lower chance to be detected and ranked in link-based systems. 27 In addition, other evidence of sources is considered for the assessment of the query- independent document quality. The structure of the URL (i.e., length, the characters it involves) is used to determine the type of the document [Kraaij et al., 2002]. The time it takes to crawl the Webpage is also taken into consideration [claiming that high quality pages often indentified earlier while crawling] [Najork & Wiener, 2001]. The number of clicks taken to reach a page from a given entry page is also among the evidence to assess the quality of the page [Craswell et al., 2005]. 2.2.2.2 Textual Representation The structure of each Web page itself can bring textual retrieval features. The HTML tag markup language, while not enforcing much formal structure, can reveal evidence about the importance of terms within a document. In contrast to link analysis, such as PageRank or HITS, the text written in the anchor text associated with a link is assumed to provide information about the hyperlinked document. Kraaij et al. (2002) discuss that the probability of having a term in the document depends on the probability of the occurrence of term in the different textual representations of the documents (e.g. content, title, anchor text fields). 2.2.3 Enhancing IR with Context Many pragmatic suppositions have been made in the literature to facilitate the engineering of systems that can deal with huge amount of information [Jones, 2004]. One of these consists in proposing systems that satisfy most users for most of their searches [Jones S., 1999]. The kernel of these systems [e.g., PageRank, HITS] is independent of 28 the context. The mechanisms focus on the representations of documents, queries and on their mapping and leave aside the modeling of the context linked to the user’s search. In order to meet the user’s needs better, Inquirus2 [Glover, 1999] metasearch engine was created to add the context information to find relevant documents via regular search engines. However, it requires a user to enter the context information explicitly. Budzik, (2000) with Watson project attempted to model the context of user information needs based on the content of documents being edited in Microsoft Word, or viewed in Internet Explorer, but that idea is far from the common use of the user information search needs. By the advent of technology, the more precise answers to the user’s need by employing the domain of contextual IR has become a priority [Allan, 2003]. The aim is to put the user back into the center of IR models by explicating some elements of the context that can affect the system’s performance. Context refers to the knowledge linked to the user’s intention (i.e., task perception of the task, type of information needed), to the user himself (i.e., a priori knowledge, profile), to his environment (i.e., material environment, culture), to the domain of his information need (i.e., corpus nature, domain- treated), and to the characteristics of the system (i.e., document representation, query/document mapping, strategies for accessing documents, visual interface) [Hernandez, 2007]. Taking context into account implies both identifying and modeling the different aspects that are useful to specify the user’s information needs and then integrating them into IR methods and processes. Taylor (1968) identified two parameters to define information needs of the user. The first is the theme or the subject of the need that determines the kind of information. 29 The second concerns the task or the user’s situation and this conditions how the information is searched and how it will be used. A majority of information systems focus on the theme of the search [Freud and Toms, 2005]. These systems mostly take into account the terms syntactically and search the documents for the terms to specify the theme of user need. However, terms are often ambiguous and refer to several themes. 30 Chapter 3 Ontologies 3.1 What is Ontology? Ontologies provide a structured way of describing knowledge. According to Gruber (1993), ontology is a “shared specification of a conceptualization”. A conceptualization refers to an abstract model of some phenomenon in the world that identifies the relevant concepts (e.g., entities, classes, and categories), their definitions and their inter- relationships. Ontologies are introduced to facilitate knowledge sharing and reuse between various agents, regardless of whether they are human or artificial in nature. They are supposed to offer this service by providing a consensual and formal conceptualization of a certain area. Ontology is used to make available assumptions about the meaning of a term. It can also be seen as an explication of the context of how a term is normally used. Lenat (1989), for example, describes context in terms of twelve independent dimensions that have to be shown in order to understand a piece of knowledge completely. Ontologies are often equated with taxonomic hierarchies of classes, class definitions, and the subsumption relation, but ontologies need not be limited to these forms. By the use of associative relations (i.e., the relationship between two concepts 31 having a non-hierarchical thematic connection such as a cause and its effects), the span of conveying information is much broader than the taxonomies. The relationship in taxonomies include is-a (subclass, or hyponym or hypernym) relationship between the concepts. In addition, ontologies might have part-of (holonym, or melonym) relationship with the associative relationships. 3.1.1 Types of Ontologies Depending on the scope of the ontology, it may be classified into four categories: Upper (top-level) ontology, domain ontology, task ontology and application ontology [Figure 2]. Figure 2: Ontology kinds, according to their level of dependence on a particular task or point of view [Gruber, 1997)]. Upper (top-level) Ontology: Upper Ontology is an ontology which describes very general concepts that are the same across all domains [Niles and Pease, 2001]. The aim is to achieve broad semantic interoperability between a large numbers of ontologies accessible "under" this upper ontology. As the metaphor suggests, it is usually a hierarchy of entities and associated rules (both theorems and regulations) that attempt to 32 describe those general entities that do not belong to a specific problem domain. Cyc and WordNet are examples of Upper Ontology. Domain Ontology: Domain ontology represents the particular meanings of terms as they apply to a particular domain such as medicine, biology, terrorism, music or law [Gruber 1993; Guarino, 1998]. For example, the word “python” refers to a programming language in the computer science domain; however, in the ontology of biology it refers to a species of snake. Domain ontology provides a controlled, structured vocabulary to annotate data in order to make it more easily searchable by human beings and process able by computers. Domain ontologies benefit from research in upper ontology, which assists in making communication between and among ontologies possible by providing a common language and common formal framework for reasoning. Task Ontology: While domain ontologies describe concepts used for all tasks that occur within that domain, Task ontologies describe the reasoning concepts and their relationships occurring within a certain domain and for a specific task [Chandrasekaran, 1998]. Task ontologies are models of partitions of reality preserving the context that determines the semantics of the concepts within the partition. Application Ontology: This category describes concepts depending both on a particular domain and task, which are often specializations of both the related ontologies [Guarino, 1997]. These concepts often correspond to roles played by domain entities while performing a certain activity, like a replaceable unit or spare component. Ontologies form a lattice of ontologies defined by the partial ordering of inheritance of ontologies. Task and domain ontologies may be independent and are 33 merged for application ontology, or it is possible that, for example, task ontology imports domain ontology. The upper ontologies are the most reused ones while application ontologies may be suitable for one application only. 3.2 Building Ontologies The process of building ontologies is a high-cost process. Some people believe that the construction of ontologies is an art rather than a science. The goal for building ontologies is to create an agreed-upon vocabulary and a semantic structure for exchanging information about that domain [Stevens, 2001]. There are no standard methodologies for building ontologies. Such a methodology would include a set of stages that occur when building ontologies, guidelines and principles to assist in the different stages, and an ontology life-cycle which indicates the relationships among stages [Uschold & Gruninger, 1996]. The most well known ontology construction guidelines were developed by Gruber (1993). Following Gruber there has been increased effort in trying to develop various ontology methodologies [Jones, 1998]. Scientists usually distinguish between formal and informal techniques for ontology development. In informal methods, the ontology is sketched out using either natural language descriptions or diagram techniques like, Unified Modeling Language (UML). In formal methods, the ontology is encoded in a formal knowledge representation language, such as Web Ontology Language (OWL), which is machine compatible. 3.2.1 Ontology Languages Over the years, a number of ontology languages have been developed, focusing on different aspects of ontology modeling. Many ontology languages have their roots in 34 first-order logic. Some of them have a particular focus on modeling ontology in a formal yet intuitive way, mainly frame-based languages such as Ontolingua and F-logic. While the others, use various description-logic based languages, such as LOOM, OIL and OWL, which are more concerned with finding an appropriate subset of first-order logic with decidable and complete subsumption inference procedures. 3.2.1.1 The OWL Ontology Language The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies, and is endorsed by the World Wide Web Consortium [Michael, Welty & McGuinness , 2004]. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans [W3C]. OWL ontologies are most commonly serialized using RDF/XML syntax. Extendible Markup Language (XML) provides syntax for structured documents but places no semantic constraints on their meaning. XML Schema restricts the structure of XML documents and extends XML with data types. Resource Description Framework (RDF) is a data model for resources and the relations between them. It provides a simple semantics for this model, and these semantics can be represented in XML syntax. RDF Schema is a vocabulary for describing properties and classes of RDF resources, with a semantics for generalization hierarchies of such properties and classes. OWL adds more vocabulary [W3C] for describing properties and classes. The data described by the OWL ontology is interpreted as a set of "concepts" and a set of "property assertions" which relate these concepts to each other. OWL ontology consists of a set of axioms which place constraints on sets of concepts and the types of relationships permitted between them. These axioms provide semantics by allowing 35 systems to infer additional information based on the data explicitly provided. For example, an ontology describing families might include axioms stating that a "hasMother" property is only present between two individuals when "hasParent" is also present. 3.2.2 Ontology Development Tools By incorporating the methodologies and languages, many environments for ontology have been developed. This section mentions a couple of these methods, such as OntoEdit [Sure, 2002], Protégé [Musen, 2003] and Hozo [Kozaki, 2002; Sunagawa 2003] and OilEd. These cover a wide range of ontology development. 3.2.2.1 OntoEdit OntoEdit [Sure, 2002], is an ontology engineering environment to support the development and maintenance of an ontology. It supports multilingual development, and the knowledge model is related to frame-based languages. OntoEdit is based on an open plug-in structure. Every plug-in provides other features to deal with the requirements an ontology engineer has. Data about classes, properties, and individuals may be imported or exported via different formats. OntoEdit employs F-Logic [Kifer, 1995] as its inference engine. It is used to process axioms in the refinement and evaluation phases. Especially, it plays an important role in the evaluation phase because it processes competency questions to the ontology to prove that it satisfies them. It exploits the strength of F-logic in that it can express arbitrary powerful rules which quantify over the set of classes that Description logics cannot. 36 3.2.2.2 Protégé Protégé (Muzen, 2003) is a free, open-source Java-based platform that provides a growing user community with a suite of tools to build domain models and knowledge- based applications with ontologies. Protégé implements a rich set of knowledge-modeling structures and actions that support the creation, visualization, and manipulation of ontologies in various representation formats. Protégé can be customized to provide domain-friendly support for creating knowledge models and entering data. Further, Protégé can be extended via a plug-in architecture. Various facilities can be programmatically available via the provided Java API. Protégé gives support for building the ontologies that are frame-based, in accordance with the Open Knowledge Base Connectivity protocol (OKBC). Also, the tool provides support for OWL ontologies. A special plug-in can be used to generate graph representations of the editing ontologies. 3.2.2.3 Hazo Hozo is an ontology visualization and development tool that brings version control constructs to group ontology development. “Hozo” is composed of “Ontology Editor”, “Onto-Studio” and “Ontology Server”. Hozo supports role representation, visualization of ontologies in a well-considered format with distributed development based on management of dependencies between ontologies. The inference mechanism of Hozo is not very sophisticated. Axioms are defined for each class but it works as a semantic constraint checker. 37 3.3 WordNet Ontology WordNet is, so far, the most well-developed and widely used lexical database for English [Fellabaum, 1998]. WordNet is much more than an electronic dictionary and indeed its structure distinguishes it from established lexical resources such as The Oxford English Dictionary. In WordNet, English nouns, verbs, adjectives and adverbs are organized into synonym (sense) sets, each representing one underlying lexical concept. The power of WordNet lies in its set of domain-independent lexical relations. These relations are hold among a WordNet synset. WordNet employs a model of classifying words that aims to imitate how the lexicon is organized in the human brain. This model emerged from psycholinguistic studies carried out over several years. Hence possible uses of WordNet extend way beyond searching for a definition for a particular word or collocation. 3.3.1 Structural Aspects of WordNet Central to the WordNet model is the principle of synonymy. Synonymy is the semantic relation that holds between two words that can (in a given context) express the same meaning. Those words that are synonymous are grouped into synsets which represent the concept described by their meaning. A synset is hence a representation of a concept that contains a set of words that are interchangeable in some context. As of 2006, the WordNet database contained about 150,000 words organized in over 115,000 synsets (7WN). However as indicated by Fellbaum(1998), given that words are grouped according to meaning/concept, multiple entries can exist for any given word. In essence, for every definition/meaning of a particular word or collocation there will exist a synset 38 for that definition in WordNet. Semantic Relations in WordNet vary based on the type of word. In this thesis we are going to use the nouns, thus the word type mentioned for the rest of this thesis are nouns. Following the hierarchical relations, each synset is related to its immediately more general and more specific synsets. To find chains of more general or more specific synsets, one can simply follow a transitive chain of hypernym and hyponym relations [Table 1]. This chain eventually leads to the top of the hierarchy entity. One should note that WordNet does not have a single top concept; rather it has several top concepts. Relation Definition Example Hypernym Synset, which is the more general class of another synset food apple Hyponym Synset, which is a particular kind of another synset, (“is a”) apple food Holonym Synsets, which is the whole of another synset, (“is part”) flower plant Meronyms Synsets, which the parts of another synset leg table Antonyms Synsets, which are opposite in meaning good bad Table 1: Noun relations in WordNet. 3.3.2 Semantic Relatedness in WordNet Measures of similarity use information found in an “is–a” hierarchy of concepts (or synsets), and quantify how much concept A is like (or is similar to) concept B. For example, such a measure might show that a cat is more like a dog than it is a tree, due to the fact that cat and dog share animal as an ancestor in the WordNet noun hierarchy. WordNet is particularly well suited for similarity measures, since it organizes nouns and verbs into hierarchies. 39 In 2003, an open source software package, WordNet::Similarity, was developed at the University of Minnesota [Patwardhan & Banerjee & Pedersen, 2003]. Methods in this package allow the user to measure the semantic similarity or relatedness between a pair of concepts (or word senses), and by extension, between a pair of words. The measure of similarity divided into two groups: Path-based and Information content-based. Information-based similarity is based on the information content of the least common susbsumer (LSC) of concepts A and B. Some of the examples of content- based measures include res [Resnik, 1995], lin [Lin, 1998], and jcn [Jiang and Conrath, 1997]. The lin and jcn measures augment the information content of the LCS with the sum of the information content of concepts A and B themselves. The lin measure scales the information content of the LCS by this sum, while jcn takes the difference of this sum and the information content of the LCS. Three similarity measures are based on path lengths between a pair of concepts: lch [Leacock and Chodorow, 1998], wup [Wu and Palmer, 1994], and path. Lch finds the shortest path between two concepts, and scales that value by the maximum path length found in the “is–a” hierarchy in which they occur. Wup finds the depth of the LCS of the concepts, and then scales that by the sum of the depths of the individual concepts. The depth of a concept is simply its distance to the root node. The measure path is a baseline that is equal to the inverse of the shortest path between two concepts [Pedersen & Siddharth Patwardhan & Michelizzi, 2004]. Following WordNet::Similarity, an extended version, WordNet::SenseRelate was released that uses measures of semantic similarity and relatedness to perform word-sense disambiguation. WordNet::SenseRelate has two different word-sense disambiguation 40 algorithms, an "all words" version (WordNet-SenseRelate-AllWords) and a lexical sample version (WordNet-SenseRelate-TargetWord). The AllWords algorithm assigns a sense to each word in a text, and the lexical sample version assigns a sense to a given target word. There is also a third program which selects the sense of a word that is most related to a given set of words (WordNet-SenseRelate-WordToSet). In this thesis we will be using WordNet-SenseRelate-WordToSet in our implementation. 3.3.3 WordNet Usage WordNet is a well established and highly sophisticated database. Despite the fact that it has been developed “by hand”, topic coverage is relatively good. Its inclusion in other natural language applications gives an indication of its popularity in various areas, including: Disambiguation of Meaning, Semantic Tagging, Information Retrieval, Conceptual Identification, Machine Translation, and Document Classification. WordNet’s hierarchical structure lends itself to numerous search algorithms for the extraction of information it contains. Most importantly searches for thematically related words are possible. 3.4 Ontologies and Information Retrieval Common information retrieval techniques mostly rely on simple full-text analysis. Matching user query to the documents is a challenging process. The introduced query expansion techniques solve only one part of the problem by increasing the recall, but the uncertainty of the user-interest context and dependent ambiguity of the query words make it challenging to retrieve relevant documents for the queries (See section 2.1.4). 41 In order to eliminate the ambiguity of the words, domain knowledge can be represented through thesauri or ontologies [Nirenburg, 2001; Guarino, 2001]. A thesaurus is based on the main terms of one or more fields organized together. However, they lack the semantics. Ontology corresponds to “an explicit and formal specification of a shared conceptualization” [Struder, 1998] and it is a higher degree of conceptualization than the thesaurus. Voorhees (1994) used WordNet as a tool for query expansion. She conducted experiments using the TREC collection [Voorhees and Harman, 1997] in which all terms in the queries were expanded using a combination of synonyms, hypernyms, and hyponyms. Voorhees only succeeded in improving the performance on short queries and almost no significant improvement for long queries. She also tried to use WordNet as a tool for word sense disambiguation [Voorhees, 1993] and applied it to text retrieval, but the performance of retrieval was degraded. Richardson used WordNet to compute the semantic distance between concepts or words and then used this term distance to compute the similarity between a query and a document [Richardson and Smeaton, 1995]. Although he proposed two methods to compute semantic distances, neither of them increased the retrieval performance. Mandala et al. (1998) analyzed why the use of WordNet has failed to improve the retrieval effectiveness in information retrieval application and found that the main reason is that most relationships between terms are not found in WordNet, and some terms, such as proper names, are not included in WordNet. To overcome this problem he proposed a method to enrich the WordNet with an automatically constructed thesaurus. 42 Ontologies are mostly used in categorizing or indexing of data [Baziz, 2005; Vallet, 2005; Mihacella, 2000; Guarino, 1994]. Aleman-Meza (2005) uses ontologies to capture domains semantics and semantic metadata to capture semantics of heterogeneous domains. However, these methods mostly do not make a distinction between the knowledge used for representing the domain of the search or the interest of the user. Hernandez (2007) used two ontologies to represent the two aspects of the context (Search task and theme) for semantic indexing of documents. Theme ontology is used to represent the domain information and task ontology is used to represent the information about the task he is engaged in. The document in the corpus, which is relevant to that task, is combined under the concepts. The relations between the concepts of theme ontology make it possible for the user to see the documents related to the concepts of interest. In order to implement this system, a predefined set of corpus needs to be annotated. Thus, this method is not relevant to capture the most up-to-date and dynamic information that what we are interested in for this project. 43 Chapter 4 Trust 4.1 What is Trust? Trust has been an important element in interpersonal relationships in many fields. Trust is a complex subject relating to belief in honesty, truthfulness, competence and reliability of the trusted person or service. There is no consensus in the literature on what trust is or what constitutes trust management [Konrad, 1999], though many research scientists recognize its importance [Povey, 1999]. Trust has been defined in many ways in the literature. Robinson (1996) defines trust as the expectations, assumptions, or beliefs about the likelihood that another’s future actions will be beneficial, favorable, or at least not detrimental to one’s interest. For Gambetta (1990), trust is a particular level of the subjective probability with which an agent will perform a particular action, before we can monitor such action in a context in which it affects our own action. Philosophers argue that only moral agents can be trustworthy by intentionally and freely refraining from harm or doing good [Solomon & Flores, 2001]. However, Reeves et al indicated that, even though computers and software are not moral agents since they do not have intentionally and free will, they are social actors of the social presence. 44 Reeves also concluded that, as a social participant, it would be proper to talk about a computer’s trustworthiness. Many related concepts are often confused with trust. Trust is not the same as trustworthiness, a distinction that is not always made clear in the literature [Blois, 1999]; trust is an act of a trustor. A person places his or her trust in some object. Regardless of whether the person’s trust proves to be well placed or not, trust emanates from a person. In contrast, trustworthiness is a characteristic of someone or something that is the object of trust. Although trust and trustworthiness are distinct, there is a logical link between them [Solomon & Flores, 2001]. Trust has also been used to mean credibility, as in this thesis. For example, in the phase “trust in information “, trust is used in the sense of credibility [Fogg &Tseng, 1999]. It is also proposed that trustworthiness is a key component of credibility, rather than credibility being a cue for trustworthiness. By contrast, Corritore (2003) sees credibility as a cue for trustworthiness, hence, credibility provides a reason to trust but not trust itself. Another word that is used to refer trust is reliance. However it is possible to rely on a person without trusting him (Blois,1999). 4.2 Trust in Web One of the first works that tried to give a formal treatment of trust that could be used in Computer Science was introduced by Marsh (1994). This model is based on social properties of trust and presents an attempt to integrate all the aspects of trust taken from sociology and psychology. McKnight (1996) used social sciences terminology in his work and defined three kinds of trust: Impersonal/Structural, trust to a social or institutional structure in the situation. Dispositional, trust based on the personality 45 attributes of the trusting party that develops across broad spectrum of situations and persons. Personal/Interpersonal, trust to a person or a group of people in the specific situation. Falcone (2001) presented a cognitive model of trust in terms of mental ingredients such as beliefs and goals. In the context of the Web, trust is widely used in Peer-to-Peer (P2P) and e- commerce applications. Trust in these applications usually means “confident reliance by one party about the behavior of other parties” [Clarke, 2001]. However, this definition lacks the context of the reliance that is by definition the major element of trust. Dey defines context as follows: Context is any information that can be used to characterize the situation of an entity. An entity is a person, place or object that is considered relevant to the interaction between a user and an application, including the user and application themselves. Trust is context dependent [Gulati, 1995], thus it changes based on the participants and applications. In order to assign a degree to trust and trustworthiness on the Web, most applications use Reputation-Based-Recommendation measures. The concept of reputation is closely linked to that of trustworthiness, but it is evident that there is a clear and important difference. According to the Oxford dictionary, reputation is “what is generally said or believed about a person’s or thing’s character or standing”. Reputation can be considered as a collective measure of trustworthiness (in the sense of reliability) based on the referrals or ratings from members in a community. An individual’s subjective trust can be derived from a combination of received referrals and personal experience [Josang, 2003]. Recommendation is an expressed opinion of an entity that some another entity is reputable which opinion the recommender responsible 46 for. Grishchenko’s (2001) Recommendations are especially popular in P2P system and social networking sites. The new members of these sites usually use the sources they know to get the recommendation about the reputations of the other sources. Because of the increase in the popularity of Internet applications, trust is analyzed in recommender systems and social networks [Massa, 2004]. Golbeck (2003) used FOAF (Friend of a Friend) [Brickley, 2004] trust model that deals with personal information and the relationships to assign trust in Web-based social nets. Massa (2004) used trust to select a smaller set of peers in the recommendation process that decreases the computational complexity and increases the scalability of the systems. We have used trust propagation to use the known sources to reach the unknown sources on eBay application [O’Donovan et. al, 2006; 2007]. Although valuable, the work on social networks and recommender system is application specific. where the user of the system can be detected by login subscriptions. Thus, these models are limited to the centralized applications that the users are known to the system and most of the time to the other users of the system. When it comes to World Wide Web, it is a decentralized environment in which users mostly don’t know each other. Besides, in many cases, there is no registration to identify the users or rules to prevent the users from posting malicious information. 4.3 Trust and Information Retrieval Although, there is no mediator to provide recommendation about the document in the Web, some methodologies are developed to trace the source of the provided information to measure the trustworthiness of the sources. Here, instead of recommendations, data 47 provenance is used. Buneman (2000) defines data provenance as tracing and recording the origins of data and its movement between databases. Thus, same definition of data provenance can be used to trace the source of the documents in the Web. Esfandari (2001) states that trust is weakly transitive and as it propagates its value decreases. Thus the link structure of the Web favors data propagation over provenance to assess the trust measures of the sources. Despite the potentially great value of open source information, which is the interest of this thesis, it is often difficult to take the advantage of the information provided through sources. Reaching most of the documents in the Web that correspond to a user’s request is challenging since only a small fraction of the Internet is indexed. Because of this distributed structure, it is often hard to find information about credibility of the free-text documents. The ability to make an estimation about the accuracy of the information available in the Internet is subjective and requires user expertise [Harris, 2000]. Trust is highly subjective and context dependent [O’Hara, 2004]. Thus, it is hard to judge the trustworthiness of sources because of their multidimensionality. Different users base their decisions on different attributes [Marsh S., 1994]. For example, some users of applications consider risk, security and privacy as the main attributes to make trustworthy decisions. However, some others might give more importance to the user- interface in their decision-making. Noble (2003) proposed a method to find a more objective assessment of the credibility of the documents. In his work, the reliability of a source is measured by the source’s previous reliability on similar events or subjects, the report’s consistency with 48 precedents and it’s consistency with the information provided from other sources. Although it is not impossible to use, this method is still infeasible since it needs to collect, structure and analyze the number of related documents, adding too much complexity, considering the amount of information that is accessible through the Internet. Again, in his work Noble does not explain how he calculates the relevancy of the documents. As an extension of Google PageRank [Page, 1998], Gyöngy (2004) has introduced a TrustRank algorithm that is mostly used for spam filtering, rather than as a measure of the trustworthiness of the information provided by sources. The basic idea behind the TrustRank algorithm is that the seed set of trusted sources is selected by the experts manually at the beginning and the trustworthiness of the other sites are assigned based on the distance (in terms of links) from this seed set of sources. This method is so general that it does not take into account the user’s interest or the context of trust, which will be the focus of trust relevance in this thesis. 4.3.1 Document Credibility Measures The subjectivity of trust makes it extremely hard to judge about the people decision making process. Unlike traditional Information Retrieval systems where documents are stored in a database or as plain text files, the Web environment provides many different data-presenting options. When a user enters a query, he/she receives thousands, if not millions of documents in return with the low possibility of recognizing the sources or the provenance of the sources. Thus a provenance based evaluation is not a solely sufficient criterion for the credibility judgment in an environment where users have little information about the other users and have many options. 49 In a large-scale study, Fogg et al. (2002) examined 2,684 consumers’ responses to the credibility of Web sites in a number of content areas, including e-commerce, finance, health, entertainment, sports, travel, and news. They found that consumers’ reasons for making actual credibility determinations were different from what they said their reasons were for making these determinations. When asked generally what factors they would consider in making Web site credibility determinations, consumers listed considerations like the presence of a privacy policy. In actual practice, they found that people rarely if ever referred to these criteria in making determinations. Instead, the authors found that credibility judgments focused first upon ‘‘design look’’. Other factors that were commented upon by consumers most frequently included ‘‘information structure’’ (i.e., how the information was structured upon the Web site) and ‘‘focus’’ (i.e., depth and breadth of information). Stanford et al. (2002) carried out another study, in which they examined credibility judgments of 15 experts in the areas of health and finance and compared their determinations to those of the consumers in the earlier study. They found that, in contrast to consumers, health experts rated the following factors most highly in determining Web site credibility: ‘‘name reputation of a site, its operator or its affiliates,’’ ‘‘information source, or references” and ‘‘company motive’’. Finance experts focused on ‘‘information focus,’’ ‘‘quantity of information,’’ and ‘‘company motive’’. Flanagan and Metzger (2007) examined people’s credibility determinations in response to Web sites in e-commerce, special interest, news organizations, and personal Web sites. They found that personal Web sites were rated lowest and news sites the 50 highest. Like Fogg et al. (2002), they found that design aspects of sites had a greater impact on credibility determinations than knowledge of Web site sponsors. Iding et al. (2009) conducted a study to test university students determination of the credibility of information on Web sites, the confidence in their determinations and perceptions of Web site authors vested interests. This study involved about 100 students from education and computer science disciplines. Categorization of Web site determinations indicated that the most frequently provided reasons associated with high credibility included information focus or relevance, educational focus and name recognition. As the research studies indicate, the credibility of Web sites varies based on many conditions such as level of expertise of the user, discipline, the goal of an information seeker, the author, the author’s expertise, the interest of an article, its sources, the application (e.g., e-commerce transaction versus information), the outline of the pages, and demographics of subjects [Fogg et al., 2002; Stanford et al., 2002; Flanagan and Metzger, 2007]. Since belief in credibility and dependence on “trust” has a multivariable structure, customizing the credibility/trust criteria is almost impossible. 51 Chapter 5 Context-Based Information and Trust Analysis 5.1 Introduction In this chapter, we are proposing a novel approach for ranking the documents corresponding to a user’s request, base on user’s context information. The Context-Based Information and Trust Analysis (CONITA) is designed to extract semantics from the user’s request to fill the gap between the currently popular information retrieval methodologies such as the ones used by Google / Yahoo, and the user’s needs. CONITA analyzes the context of the user’s interest, information request, and domain of expertise to extract the information that can be used for customizing the result for the particular domain users. Users of the Web have different needs in searching data. Broader (2002) identified the needs of a user’s search on the Web as Navigational, Informational, and Transactional. In general it is diffucult to determine the intent of a user since one user can have different goals in different searches or can combine two or more search needs to satisfy a single goal. In order to better control the users’ intent of a search, the target users of CONITA are selected from designated domains (e.g., Health, Terrorism, and 52 Business). For example, Intelligence Analysts, Health Professionals or Business Analysts are all target users of CONITA. The need of new Information Retrieval methodology for the users of these domains are beyond Navigational and Transactional uses. The dynamic structure of the Web makes it necessary for these users to retrieve dynamic information to better understand up-to-date changes. Rose and Levinson (2004) divided Informational Search to five subcomponents: “directed” (open, closed), “undirected“, “find,”, “list” and “advice”. However, this division does not explain the most common search intent for the domain users described in the scope of this thesis. While “directed” search is defined for searching a specific answer, “undirected” search is defined for searching all the information about the topic. In order to better customize the definition of “informational” search within the context of this thesis, we further subcategorized “informational” search as “Static” and “Dynamic”. Static Informational Search: The intent of the user in this search is to find the definition or description information about a given topic. Dynamic Informational Search: The intent of the user in this search is to find event related information about a given topic. The information requested in this category is not a unique answer but a set of answers that will be a response to the user’s request. Although it is not limited, CONITA is mainly designed to answer the “Dynamic Information” need of the user. Retrieving relevant documents for this category of search is challenging since it doesn’t have a single answer but yet should provide specific event information as requested in the query. 53 5.2 System Architecture The goal of this thesis is to provide a system that presents the best possible set of documents in response to information need expressed by a user though an inputted query. CONITA is not a crawler, thus it does not crawl the Web to find the relevant documents for the user’s query. Instead, it uses other search engines' returned results for the user’s query to build up the corpus. In order to provide the best possible set of documents in response to a user’s request, CONITA combines different components together. The flow of the architecture can be described as follows [Figure 3]: • The user enters the query. • The query passes to the search engine such as Google/Yahoo for the document retrieval. The returned documents as response to the query forms the corpus that CONITA will uses in ranking. • In addition to the search engine, the query also passes to the Ontologies for expansion and semantic analysis purposes. First to WordNet and then to the domain and Interest Ontologies. • The expanded query terms and the corpus of documents pass to the “Content Relevancy” module to be used in the computation of relevancy value that will be used in the ranking of the documents. • CONITA also has a “Trust Relevancy” module that can be integrated in calculating the relevance of the document. The trust relevancy module uses the 54 information about the user, query, source provenance and recommendations to estimate the credibility of the documents. • Once the Content and Trust relevancy values are calculated, the relevance of the document is calculated by optional integration of trust relevance value to content relevance value which is then used to rank the documents in a decreasing order of relevancy to be presented to the user. The following sections of the chapter, explain the each component of the architecture in details. Section 5.3 explains the role of WordNet in query expansion. In section 5.4, Interest and Domain ontologies are described and example the domain ontologies, health and terrorism are explained in details. Section 5.5 presents the mapping techniques between WordNet and a domain ontology that is followed by ontology categorization in Section 5.6. In Section 5.7 the documents set used in this thesis is explained followed by the explanation of mapping between ontology and document set in Section 5.8. In Section 5.9 and 5.10 details of content relevance and trust relevance methodologies and 5.11 describes the overall document relevance. 55 Figure 3: Context-Based Information Trust Analysis (CONITA) Framework 5.3 Query Expansion by using WordNet The number of words used in a Google query to search for information increased from 3 to 4 words in 2008 [Ussery, 2008]. This is a clear indication that users desire to get more relevant result by expressing their need with more words. As explained in Section 2.1.4, User+ Query+ Interest Trust Relevancy Query Search Engine Query Expansion Documents Ontologies Ontologies Interest Ontology Domain Ontology Interest Ontology WordNet Content Relevancy Ranked Documents 56 different query expansion techniques are used to expand the query to help the mapping from user’s query to the documents. WordNet as the biggest lexical handmade ontology has become one of the most practical taxonomies used in query expansion. Liu et al (2004) used WordNet to disambiguate word senses of query terms, and then considered the synonyms, the hyponyms, and the words from definitions for possible additions to a query. They used common words in the definitions of the words entered in the query to find the relevant sysnets of the words. Roberto and Paola (2003) utilized WordNet to expand a query and suggested that a good expansion strategy is to add those words that often co-occur with the words of the query. Moldovan and Mihalcea (2000) applied WordNet to improve Internet searches. In their method the words are paired, and each word is disambiguated by searching the Internet with queries formed using different senses of one word while keeping the other word fixed. The senses are ranked simply by the number of hits. In this way all the words are processed and senses are ranked. The disambiguity between the query terms and the different senses of the words in the WordNet has been the most challenging problem in using WordNet in query expansion. In this thesis we combined two methodologies to capture the context information of the words entered by the user to reduce the disambiguity caused by WordNet expansion. 5.3.1 Phrasal Map Once user input is recieved, CONITA uses stop-word elimination to peal the words that are not useful in selection of the useful documents. The remaining words are used to pass 57 to WordNet for evaluation. It is important to note that word-by-word mapping between the query terms and WordNet does not always preserve the context of the terms provided in the query. For example when “brain tumor” is passed to WordNet, “glioma” is returned as its subclass. However when “brain” and “tumor” are passed to WordNet separately, hundred of subclasses of tumor are returned including such as “lymphoma”,”Hodgkin's disease” which are not related to the brain tumor. In order to better predict the context of the user’s input, we find all the combinations of the query terms before mapping them into the WordNet. Given a set of n terms in the query, all the combination of the size r, C (n, r), is calculated and added to the set W as word/phrase. Starting from the set with max number of terms, each set is checked against WordNet for possible match. If a match is found, all the other combinations in W that are subsets of the matched phrase are erased from the list (see Algorithm: Mapping from Query to WordNet). For the ones that are mapped (query mapped terms), their synonyms are added to the combination set and synonyms replaced with the mapped words for derived combination. The example of the query and mapping is presented as follows: Query: “brain tumor treatment” Combinations:{“brain tumor treatment”, “brain tumor” ,” brain treatment” , “tumor treatment “, “brain” , “tumor” , “treatment” } Mapped words: “brain tumor”, “treatment” Synonyms: “brain tumor” “brain tumour” “treatment””intervention” 58 Combination with the added synonyms: { “brain tumor treatment”, “brain tumor” , “brain treatment” , “tumor treatment” , “brain” , “tumor” , “treatment” , “treatment brain tumour” , “brain tumour” , “brain tumor intervention” , “brain intervention” , “tumor intervention” , “intervention”} This step make it possible to better estimate the context of the query words by allowing more than one term to be mapped from query to WordNet. Multiple word mappings usually correspond to more specific concepts in WordNet than the single terms which help to reduce the ambiguity of the words by linking them together. Algorithm: Mapping from Query to WordNet Return 1: Set of mapped words M Return 2: Set of derived combinations including synonyms WS 1: Q: the set of words in the query after the stop word elimination 2: W: set holds all possible combinations. 3 WS: Combinations with the synonyms. 3: M: mapped words in WordNet 4: n: Number of words in Q 5: r: An integer, 0<r<=n 6: r=n; 7: While (r! =0) { 8: Find Combinations of n in size r (Combination (n, r)) 9: Add combinations to W 10: r=r-1; 11: } 12: For (i=n; i>0; i--) { 13: For each set w in W that has i elements { 14 If wi is not subset of mi in M { 15 Map wi to WordNet () 16: If wi is in the WordNet { 17: Add wi to M 18: }}}} 19: WS=W; Table 2: Algorithm for mapping query to WordNet 59 Table 2: Continued 20: For each mi in M find the synonym Sj { 21: For each synonym Sj of mi { 22: For each wi in W { 23: If wi has an a string that is equal to mi { 24: Replace string in wi with mi and add to WS 25: }}}} 5.3.2 Selecting Sense from Synsnet Finding the proper sense of a word in WordNet is one of the major steps in determining the semantics of query terms and in reducing the disambiguity while extending the words [Liu et al., 2004; Moldovan and Mihalcea, 2000; Roberto and Paola; 2003]. For example, the word Java has several meanings. It can refer to an island in Indonesia, coffee, or a programming language. Thus there is a need for context information to identify the correct sense of the word. The set of words entered in the query can form a context to use for disambiguating the ambiguous word. In this case if the set of words were object- oriented, class, platform-independent, and programming, we would know that the third sense would be the most appropriate. Michelizzi (2005) introduced a WordNet:: SenseRelate::WordToSet Perl package in which the context of a word is defined as the associated set of words. Thus, the correct sense of a target word is projected to be the sense that is most related to the words in the associated set. If there are N context words in the associated set of words, the best sense of the target word w is calculated as: ∑ = = N j jk i k i s t s relatednes Sense Best 1 ) , ( max max arg _ (6) Where i t sense of the target is word, and jk s is sense k of the jth context word. 60 Following that definition, in this thesis we use WordNet:: SenseRelate::WordToSet Perl package with WordNet version 3.0 to find the sense of the words mapped as a result of “Phrasal Map”. The SenseRelate package uses different relatedness measures for the similarity calculation (See section 3.3.2). In the scope of this thesis, all the mappings and comparisons are based on the WordNet noun set. Thus, as a robust relatedness measure for a noun-noun similarity [Pedersen, 2004], the jcn [Jiang and Conrath, 1997] content based similarity measure is used to determine the relatedness of the words entered in the query. Consequently, the highest ranked sense of the word is used to choose the correct sense of the words from WordNet. The aim of finding the correct sense of a word is to extract the semantically related terms that potentially improve the relevancy of the returned documents for a user’s request. To achieve this, in this research we have used “hyponym”, “holonym” and “synonym” of the mapped word/s for the corresponding sense found as a result of the SenseRelate algorithm. 5.4 Customizing User’s Information Request Customizing the results returned by Information Retrieval Systems for each user is a challenging task. Many research studies have been done on relevance feedback, to collect the data about user behavior [Baeza- Yates & Ribeiro-Neto, 1999]. However, all of this research is affected by the drifficulty of human interaction. In this thesis we are going to take the advantage of the common domain interest of the users to predict their information needs without the need of their interaction. Although it is not a restriction, the main use of CONITA is to find the information need of the users working on a 61 specific domain. This assumption gives us an opportunity to reduce the ambiguity of the words that can be derived from the query in addition of focusing the common goals of the domain users. In this thesis we chose Health and Terrorism domains as example domains to integrate with CONITA. The returned set of documents for the information request of users in these domains can be used for critical decisions. Time and information accuracy are the critical component in both domains. The users of these domains have a need to harvest the most up-to-date, reliable information in the shortest available time, which is a challenging problem for information retrieval techniques. 5.4.1 Interest Ontology Although, the domain ontology provides information about the domain specific aspects, when it comes to user requests, it becomes clear that the users within the domain might have different interests. For example, while one health professional is looking for information about cancer in children, the other might be interested in finding information about cancer in elderly people. Thus, the context of a user’s request within the domain can also be affected by the user’s interest. Therefore, we define a user’s interest within the domain as “interest ontology”, which is defined as follows: A small ontology with general concepts belonging to a specific domain that is different than the domain ontology but relates to it, is called Interest Ontology. In other words, it is a high level domain ontology that shares common concepts or relations with the domain ontology used by the system. In the following two sections, we give the details of examples of Health and Terrorism ontologies used to demonstrate the idea in this thesis. 62 5.4.2 Domain Ontologies One of the challenges of using domain ontology is finding a well-established, detailed ontology about the domain. The ontologies explained in this section are implemented by using Web Ontology Language (OWL) and Protégé 3.3 is used for visualization and manipulation of the ontologies. 5.4.2.1 Health Domain Ontology The National Cancer Institute’s (NCI) Thesaurus is an ontology-like vocabulary that includes a broad coverage of the cancer domain, including cancer-related diseases, findings and abnormalities; anatomy; genes and gene products, drugs and chemicals and organisms. In certain areas, like cancer diseases and combination chemotherapies, it provides the most granular and consistent terminology available. It combines terminology from numerous cancer research-related domains, and provides a way to integrate or link these kinds of information together through semantic relationships. This thesaurus currently contains over 34,000 concepts, structured into 20 taxonomic trees. In the scope of our research we have eliminated some branches of the NCI ontology such as organisms, gene and gene products and dropped the size of concepts to 15258 to make is easier to run in a laptop with 1GB RAM. The extracted NCI ontology is a very shallow with max depth of 5, max sibling number of 814 and very few numbers of associative relationships. As we mentioned in the previous section, users of the domain might have different interest in their information requests. To demonstrate this idea we built two interest ontologies for the health domain: Environment and Person. The Environment 63 interest ontology covers the concepts from the environment that cause health problems. For example air-borne pollutants connect with the respiratory problems in the health domain or commuting in traffic causes stress and dependently depression problems [Figure 4-a]. The concept of Environment Interest ontology is taken from the federal Environmental Protection Agency (2009) and organized in hierarchy. The person interest ontology is used to characterize the people in health problems. For example a user searching for leukemia might be interested with the information sources that talk about leukemia in the context of children or in the context of adults or might want to analyze the bone structure of babies or of senior people. The integration of ontologies has been an ongoing challenge. Research on merging ontoloiges is divided integration process into different categories For example, Gruninger (1996) mentions two kinds of integration: combining ontologies that have been designed for the same domain, and combining ontologies from different domains. Once again the problem of integration is considered difficult since two ontologies may use the same terminology with different semantics. G´omez-P´erez et al. (1995) proposes that ontology building, and therefore ontology integration, should be done preferably at the knowledge level (Newell 1982) (in conceptualization) and not at the symbol level (in formalization, when selecting the representation ontology) or at the implementational level (when the ontology is codified in a target language). Since the aim of this thesis is not ontology integration, in order to ease the communication between the interest ontologies and domain ontology, starting from the root node we added each interest-ontology as a different branch in the domain ontology 64 [Figure 4-b]. Then the relations between the interest ontology branch and the health domain branches are provided by the associative relations. 65 Figure 4: a) Customized NCI ontology, b) Integrated Interest and NCI ontologies 66 5.4.2.2 Terrorism Domain Ontology Since September 11, 2001, the focus of mining terrorism data to predict further attacks by extracting patterns becomes an interest of Intelligence Analysts. The need of finding up- to-date data about terrorism events suggested that the terrorism domain should be one of the example domains of this thesis. However, the top secret category of the domain makes it difficult to find a publicly available Terrorism ontology. Between September 14, 2001 and November, 2001, Valdis Krebs (2001) assembled a corpus of texts regarding events preceding September 11 attacks. In the aftermath of the September 11 attacks, it was noted that coherent information sources on terrorism and terrorist groups were not available to researchers [Gruenwald, McNutt, and Mercier, 2003]. Information was either available in fragmentary form, not allowing comparison studies across incidents, groups or tactics, or made available in written articles, which are not readily suitable for quantitative analysis of terrorist networks. Data collected by intelligence and law-enforcement agencies, while potentially better organized, is largely not available to the research community due to restrictions in distribution of sensitive information. To counter the information scarcity, a number of institutions developed unified database services that collected and made available publicly accessible information on terrorist organizations. This information is largely collected from open source media, such as newspaper and magazine articles, and other mass media sources. Such open- source databases include: RAND Terrorism Chronology Database [Corporation, 2003] – including international terror incidents between 1968 and 1997, RAND-MIPT (Memorial Institute for Prevention of Terrorism) Terrorism Incident Database [Houghton, 2002], 67 MIPT Indictment Database [Smith and Damphousse, 2002] – Terrorist indictments in the United States since 1978. Both RAND and MIPT databases rely on publicly available information from reputable information sources, such as newspapers, radio and television. Gruenwald et al. (2003) introduced architecture for extracting ontology from the available databases. Military of Defense (2005) used small terrorism ontology in a case study. Defense Research and Development in Canada introduced TERROGATE (2006) as an information retrieval technology for terrorism domain in which it retrieved the information about tactics, weapons, targets, persons, groups and locations. In 2007, Mannes and Golbeck proposed Mindswap terrorism ontology that is constructed by the cooperation with terrorism specialists. In this work they mentioned “Strategic Intelligence” (group leaders, policies, structure) and “Tactical intelligence” (specific attacks and their methods) and focus their ontology on Strategic intelligence. They also emphasize the importance of describing events within the domain and defined “pre-event planning activity”, and “post-event activity” as the temporal context element for the related events. Despite these few attempts on constructing a terrorism ontology, we only had an access to the Mindswap terrorism ontology and found excerpts from MoD and TerroGate ontologies. Based on our research interest on the topic for years, we have constructed the terrorism ontology used in this research from these three ontologies. The terrorism ontology we built consists of six major branches (Figure 5-a): Event, Location, Organization, Person, Target, Weapons. These are the common categories used in the databases and the ontologies constructed by the other researchers mentioned above. The 68 Terrorism Ontology consists of 128 concepts with max dept of 4 and max number of siblings 12. Although it is smaller than the health domain ontology, the associative relationships in this ontology is denser. For example, some of the terrorist groups are connected with events based on the statistical information provided about the terrorist groups and the activities are represented in the terrorism ontology [Gupta, 2005]. Two interest onotologies are constructed regarding the closeness to the terrorism domain: Economics and Politics (Figure 5-b). The Economics interest ontology consists of the concept that relates to the effects of a terrorist attack on Economy, while the Politics interest ontology includes concepts that relate the terrorist attacks to politics. 69 Figure 5: a) Terrorism ontology, b) Integrated Interest and Terrorism ontologies 70 5.5 Mapping form WordNet to Domain Ontology As explained in Section 5.3.1, after mapping query terms to WordNet, the following sets are obtained: A set of mapped query terms from query to WordNet, is-a and part-of subclasses of mapped query terms, and all the possible combinations of the query terms including the synonyms. Although the contextual relations of the query terms are used in the sense selection of the terms, the information about the domain is still missing. In order to better eliminate the ambiguity of the obtained set of words from WordNet, there is a need to select a subset of the expanded set of words, which are relevant in the context of the domain user’s request. Before explaining the mapping process between WordNet and the domain ontology, we would like to define the word “Ontology Category” that will be used in the rest of the thesis as follows: Each branch starting from the node that has the root as a first level ancestor is called an ontology category For example, in the health ontology [Figure 4-a], the root is “Thing”, thus “Anatomy Kind”, “Biological Process Kind” , “Chemicals and Drugs Kind”, “Consequence” , “Environment” , “Finding”, “Finding and Disorders Kind”, “Organism Kind” and ”Person” are the ontology categories. The reason for defining categories within the ontology comes from the fact that each branch of the ontology covers a different kind of information. The more information that is gathered about different aspects of the ontology, the better view of the query context can be observed. 71 5.5.1 Syntactic and Semantic Map from WordNet The mapped set of terms obtained from WordNet (e.g., “brain tumor”, “diagnosis”) are based on syntactic map from query to WordNet. Although the context of query terms are used to extract the proper sense of the words, it is not guaranteed that the terms entered in the query will have any coherence for the determination of relevant senses of terms. Mapping the terms obtained from query to the concepts of the domain ontology is a challenging process. Ontology mapping has been addressed by researchers using different approaches: One-to-one approach [Mena et al., 1996], Single-shared ontology [Visser & Cui, 1998], Ontology clustering [Visser & Tamma, 1999]. Mappings in these approaches are represented as conditional rules [Chang & Garcxia-Molina, 1998], functions [Chawathe et al, 1994], logic [Guha, 1991], or a set of tables and procedures [Weinstein & Birmingham, 1998]. Little research exists on mapping the query terms to the ontologies. In some cases ontology query languages are used instead of free-style requests [Zhang, 2005]. Nagypal (2005) used drop-down menus to help users choose the query terms from the ontology. Given a domain-specific ontology and a query term, Syeda-Mahmood et al. (2005) used rule-based inference to find the related terms in the ontology. Domain ontology concepts can be composed of sub sentences that are not necessary would have a definition or synonym (e.g., “Generation of Antibody Diversity”, “Monoclonal Antibody M170”), which are Different than the lexical ontology such as WordNet, in which concepts are composed of mostly single or two, three words that together have a definition. Mapping the set of words obtained from WordNet to an 72 ontology that consists of unstructured concepts is a challenging task. A word such as “antibody” can occur in different ontology categories and might be part of a concept that consists of multiple words. In order to decide which mapped concepts best match to the corresponding query terms, we used the following steps: • Map each query mapped term (qmt) and the synonyms of qmt from WordNet to the domain ontology. • If two subclasses of qmt has one-to-one map to the concepts in the domain ontology that are siblings in the proximity of 2 or less (have a common ancestor in proximity of 2 or less), than keep those subclasses. The Depth 2 is taken since the ontologies used in this thesis are not too deep and mostly shallow [Figure 6-a] • If one mapped subclass is an ancestor of the other mapped subclass in the domain ontology by the proximity of 2, then keep these subclasses [Figure 6-b]. • If any of the mapped qmt on the domain ontology is an ancestor of any other qmt mapped subclass in the proximity of 2 or less than keep the subclass [Figure 6-c]. • Categorize picked subclasses based on the ontology categories and find the number of subclasses in each category. Figure 6: a) Two sibling qmt subclasses, b) qmt subclass is ancestor of the other, c) qmt term and qmt subclass 73 Mapping qmt subclasses to the domain ontology and extracting the common subclasses through the above cases help us to find the concepts of the domain ontology that are semantically related to the query. Besides the semantic matching of subclasses, the importance of the exact words entered by the user should be considered. Thus, the cases below are used to find the best match of the mapped qmt on the domain ontology. • Count the number of qmt maps in each Ontology Category • If there is an exact match (one-to-one mapping) between qmt and the domain ontology concepts than assign qmt to that ontology category If the qmt doesn’t have an exact match but all the maps are in the single ontology category, than assign qmt to that ontology category. When mapped qmt concepts correspond to more than one ontology category, there is a need to choose the proper concept and the related ontology category. For the ontologies that do not have an associative relationship between the mapped qmt concepts the following criteria is used to choose the best category qmt belongs: • If qmt maps more than one category, then check the category of the selected subclasses of qmt that is obtained as a result of the above cases and assign qmt to the category of subclass. • If there is more than one mapped categories and the number of mapped cases between the two highest categories are less than 5, than check whether any other qmt is assigned to the category with highest number by the above mentioned steps, if so pick the second highest category. Otherwise, pick the category with the highest number. 74 • If there is a query term that is not a part of any qmt or has no maps in domain ontology, than use synonyms as with all the combination derived form query terms to check if there is any map on domain ontology. If so, apply the above steps. • If neither synonyms, nor combination with synonyms of a query term appears in the domain ontology, assign it to the category called “Independent” in which means it is not available in the ontology. Once the above process is completed, each term/terms entered by the user in the query is assigned to a category in domain ontology with the subclasses relevant to the domain. As a result, we have the subset of the extracted words from WordNet. However, the domain ontology is more likely to have detailed information about the mapped concepts. Thus, once the mapped concepts are found, their subclasses from the domain ontology are added to the set of extended terms. The above-mentioned algorithm is effective when there isn’t any associative relationship between the ontology categories. But the real gain of using ontologies is to extract the semantic information provided through associative relations. Therefore, the following cases are used to select the proper categories of mapped qmt concepts in the availability of associative relations between the mapped qmt on the domain ontology. Finding associative relations between qmt mapped ontology concepts: Case I: If there is a direct associative relation from one qmt mapped concept to the other one that is belonging to different ontology category, than chose two qmt terms and their categories [Figure 7-a]. 75 Case II: If there is a relation between one of the mapped qmt concepts and the parent of another mapped qmt concept at the proximity of 2 or less that belong different ontology categories, than chose the mapped qmt concepts, categories with the connecting concepts and their relations [Figure 7-b]. Case III: If there is a relation between the parent of a mapped qmt concept, and the other mapped qmt concept at the proximity of 2 or less that belong different ontology, than pick mapped qmt concepts, categories with the connecting concepts and their relations [Figure 7-c]. Case IV: If there isn’t any direct relation between the mapped qmt concepts but their ancestors in the proximity of 2 or less, than pick the mapped qmt concepts, categories with the connecting concepts and their relations [Figure 7-d]. Figure 7: a) Direct associative relation, b) Associative relation to a parent c) Associative relation from a parent, d) Associative relation between the parents Finding paths between all possible qmt terms: As a result of the above cases we observe the paths (i.e., set of nodes and relations from one mapped qmt to another) between the mapped qmt concepts. In the following section the terms Left Hand Side (LHS) and Right Hand Side (RHS) are used to explain the associative relations within the paths. Anything before the associative relation on the path is considered as the LHS whereas anything after the associative relation is considered as RHS. In case there is more than one associative relation in the path, 76 everything before the last associative relation is considered as LHS and after associative relation is considered as RHS of a path. The following criteria are used to combine paths P1 and P2 that share common concept, for extending the paths and finding variations: • If the first concept say “x” on the RHS of P1 appears on the LHS of P2: o Take all the concepts and the relations of P1 before “x” and combine with the all concepts and the relation that appears after “x” in P2. • If a concept say “x” that is in the LHS of P1 appears on the LHS of P2: o Take all the concepts on left of concept “x” from P1 and combine with all the concept and relations that is on the right of concept “x” from P2. • If a concept say “x” that is in the RHS of P1 appears on the RHS of P2: o Take all the concepts and relations before “x” from P1 and combine with all the concept and relations that is on the right of concept “x” from P2. • If the last element on the RHS of P1 is the first element on the LHS of P2: o Combine all the concepts and relations of P1 with all the concepts and the relations of P2 starting from the first concept. The algorithm in the previous page provides all the possible paths between mapped qmt terms on the ontology. Since these paths provide us coherent information about the overall context of the query words, it is important to choose the mapped qmt terms and the assigned categories that are belonging to these paths. The longer the path, the more likely it will have more mapped qmt terms and associative relations that cross between different ontology categories. Once the proper paths (the one with the most mapped qmt terms that belongs to different ontology categories) are selected, subclasses of the 77 selected concepts in these paths are also added to the set of chosen terms to extend the word list. 5.6 Ontology Categories and Context In oppose to Static Informational Search in which the user usually enters a single word to find the information about description/definition he requests, users with the intention of finding Dynamic Information tend to enter more than one term to define the context of a search. Passing the query terms to WordNet mostly helps to observe the related terms corresponding to the same concept category. Once terms are mapped from WordNet to the domain ontology, the information about the terms that is relevant to extend in the context of a domain are obtained with their assigned ontology categories. One of the main goals of our research is to show that successful IR should take the context information in bringing the relevant documents. In this thesis, ontology categories are the representative of the different context elements within the domain. Thus the context of the query terms within the domain is defined as follows: Query terms mapped to the domain ontology under different ontology categories construct different context elements of the query. Context is mostly associated with the events. Context based event extraction [Baldauf & Dustdar, 2004] is a research area by itself which is not the main focus of this research, however analyzing the event information within the domain well suits to the goal of Dynamic Information search. Thus we define the event category within the domain ontology as follows: 78 An ontology category can be assigned as an event category iff: - Concepts of the ontology category has a consequence - Concepts of the category is a consequence of the concepts from another ontology category (post_ event) In domains like Terrorism and Health the focus of the domain events is people, thus the consequence of event category is expected to affect people. For example, in the Health domain the category “finding and disorder kind” can be assigned as an event of the domain since the disease affectS people and as a consequence people get sick, die, go blind and so on. At the same time, the ontology category “Consequence” also can be defined as an event within the domain since it is an event that occurs as a consequence of the other event. Similarly in the Terrorism domain, the ontology category “Event” can be defined as an event category since such as attack, kidnapping would have consequences. In mapping the extended terms to the documents to find which document potentially provides the best relevant information for the user’s request, the context information of the mapped terms will be taken into the account. 5.7 Document Set The document set used as a corpus in this thesis is collected through the Web. CONITA is not a crawler but instead uses other search engines to collect the data for its corpus. When a user enters information request to CONITA, the query passes to two modules. One is to WordNet and ontologies, as explained in the previous sections and the other is to the search engine. Yahoo is the main search engine used to collect the documents for the corpus of CONITA. Since the users of CONITA are interested in finding the most up- 79 to-date dynamic information, we used Yahoo’s news search primarily to collect the relevant documents. In addition to the advantage of using other search engines to retrieve the documents, not having a control on the crawling process comes with disadvantages. The selection of documents in the corpus of CONITA depends on the relevance criteria used by the search engine. This places a limitation on CONITA, since in some cases the corpus might not have any documents matching to CONITA’s relevancy criteria even though there may be relevant documents corresponding to the user’s request on the Web. 5.7.1 Advantage/Disadvantage of using Web Documents One of the most common ways of measuring the success of IR technique is done by comparing it to the other techniques that use the same data set. That is the main purpose of TREC ongoing workshops as explained in Section 2.1.6. TREC provides a data set for the specific research purposes to be used by the IR systems. However the data set is pre- processed and organized and, therefore, does not always reflect the real challenges IR techniques face with the Web environment. The main purpose of CONITA is to bring the most relevant up to date information for its users. Thus here we would like to mention the motive behind using the Web documents in this research: • Dynamic Structure: With the billions of participants, Web provides different set of documents for the information request of a user in different times and well suited for the users that are in need of capturing timely changing data. • Domain Free Data Set: Unlike to the most of data sets that consists of domain specific information sources to retrieve domain specific information, we have used 80 Web without any restriction of the domain of the documents to retrieve the customized results for the user’s of a specific domain. • Easy Access: Users does not need to pay, or get a permission to access the available data. • No Pre-processing: The data is in the original form as provided by the source thus, it does not lose any context information as in the case it is stored in the Databases. As with anything else, using Web documents as a corpus comes with the following drawbacks that need to be taken account in processing the data: • Trust Issue: The source of the information provided does not necessarily know to the user. That sometimes raises the question about the credibility of returned documents. • Lots of Documents: Although having lots of documents from which to choose the information is an advantage, challenges to the IR techniques to have a better relevance technique to peel out the unwanted information. • Repeated Documents: In the Web environment it is very common to have the same documents posted by different sources. Thus, besides a good IR technique there is a need for an integrated module to check the similarity of returned documents. • Formatting: Although HTML is one of the common languages used for the documents in the Web, it does not enforce formal structure within it. A document with the same display can be written in different way using HTML and even have missing tags. As HTML is not the only language used to represent the documents, 81 it is also common to have documents with “.pdf”, “.doc”, “.txt”, “.ps”, “.ppt” formats thorough out the Web. In the following sections we are going to explain how we incorporate the above mentioned challenge within the structure of CONITA. 5.8 Ontology to Document Map In the previous sections we have explained how the terms inputted in the information request of a user is expanded and also give the description of the documents used as a corpus for CONITA. In this section the process of mapping expanded query terms to the documents will be described. 5.8.1 Capturing Concepts in the Document The traditional IR techniques bases on indexing that uses a number of mapped words appear in the document to decide on the relevancy of the documents. However, this technique does not consider any semantics or the context information that is crucial in a user’s information need. The motive behind CONITA is to add the domain and context information into the document selection process and customize returned documents based on users’ interests. The main reason for assigning concepts into Ontology Categories as mentioned in the previous section is to use these categories as context elements in the process of mapping expanded words to the documents. Once all the words extracted from the domain ontology are mapped to the document, there is a need to analyze how the mapped words contextually fit within the document. In this thesis the contextual relation of the 82 mapped words within the document is based on the actual distance between different category elements. The distance can be computed in different magnitudes. WordNet defines “Paragraph” as one of several distinct subdivisions of a text intended to separate ideas, since our interest is to find the semantically relevant documents for a user’s information request, we used paragraphs as a measure to calculate the distance between the element of different ontology categories. 5.9 Finding Content Relevant Documents In order to retrieve the relevant documents in response to a user’s request, a value should be assigned to the documents. In this thesis, the relevancy value of the document is assigned in paragraph level. First, the extended set of words with their context categories are found in the paragraph, then the minimal distance between the different context elements are calculated, as will be explained in the following sections. Once the minimum distance between the mapped words are calculated, the value of the paragraph is compared with the value of the all the other paragraphs and the paragraph with the minimal contextual distance selected to represent the relevance value of the document. . Algorithm: Calculating Content Relevancy Return: Relevance Value of the Document 1: P Set of Paragraphs in the Document 2: D Document 3: WSet of all expanded words 4: MList of words mapped from ontology to the Paragraph 5: MCCategory of each Mapped Word 6: MinDist=100000; Table 3: Algorithm for calculating content relevancy 83 Table 3: Continued 7: For each (P: p1, p2…, pk) { 8: M MappedWords (W, pi) 9: For each (M: m1, m2…, mn) { 10: MCj FindCategory (mj) 11: } 12: Distance OverallMinDistanceBetweenDifferentCategoryElements (MC) 13: If Distance < MinDist; 14: MinDist=Distance 15: } 16: Document Relevance= 1/MinDistance; 5.9.1 Proximity of Mapped Terms Term proximity has been explored extensively in document ranking studies [Rasolofo and Savoy, 2003; Hawking and Thistlewaite, 1996; Clark et al., 2000], where several distance factors were proposed. Two common intuitions underlie all of these approaches: • The closer the terms are in a document, the more likely they are topically related; • The closer the query terms are in a document, the more likely the document is relevant to the query. Our intuition in this thesis is also same. The difference is, in the above methods, term proximity is calculated in the document level and all the terms are treated equally. In CONITA, terms are further customized by the following criteria: • Terms are associated to context categories and the proximity of terms is calculated between different context category terms. • Term Proximity of the document is calculated in the paragraph level Thus, we adopted two of the following approaches from the above studies. 84 5.9.1.1 Span-based Proximity Measure Span is defined as the length of the shortest document segment that covers all query term occurrences in a document including repeated occurrences [Hawking and Thistlewaite, 1996]. For example for the given short document d= (t1, t2, t1, t3, t5, t4, t2, t3, t4) if the query is terms are {t1, t2} span value is 7. Similarly, MinCover is defined as the length of the shortest document segment that covers each query term at least once in a document. For the given {t1, t2} and the above mentioned document d. MinCover would be 2. The term proximity in this thesis is adopted from MinCover and called as CMinSpan. Given a paragraph, CMinSpan defined as the length of the shortest document segment within the paragraph that covers at least one term from the each context category that appears in the paragraph. For example, for a given paragraph if a1, a2 belong to category 1, b1 belongs to category 2 and c1,c2 belong to category 3, term proximity of p=(t1, t2, a1, t4, c2, b1, t5,t6, a2, t1, t2, c1) base on CMinSpan would be {a1,c2,b1}with value 4. Both Span and MinCover methods favor documents with fewer query occurrences. A normalization factor is used to fix the bios by dividing the Span value with the number of occurrences of query terms in a span segment [Tao and Zhai, 2007]. In CMinSpan normalization is done based on the number of categories, which is explained in Section 5.8.2. 85 5.9.1.2 Pair-wise Proximity Measure Pair-wise distance is defined as a distance between individual term occurrences, and the overall proximity distance value is calculated by aggregating pair-wise distances. For example for the query words q= {t1, t2, t3}, the proximity between the term pairs {(t1, t2), (t1, t3), (t2, t3)} are calculated. Rasolfo et al. (2003) computes term pair instance weight as follows ) , ( j i t t tpi = 2 ) , ( 1 tj ti d (7) Where ) , ( j i t t d is the distance expressed in number of words between search term ti and tj. We adapted Rasolfo’s tpi formula in the scope of CONITA. However instead of calculating the distance between each mapped term in the document, the distance between each mapped term in a paragraph that belongs to a different category is calculated. For the example given in CMinSpan, p= (t1, t2, a1, t4, c2, b1, t5, t6, a2, t1, t2, c1, CMinSpan {a1, c2, b1}, Pair-wise distance would be calculated between {a1, c2}, {a1, b1}, {c2, b1}. Thus we define term pair weighting between two terms as ) , ( j i c c Tpw = )) , ( min( 1 j i c c d (8) Where ) , ( j i c c d is the distance expressed in number of words between the two mapped terms (formula (7)) in the paragraph that belongs to different categories. If a category has more than one terms mapped, than the minimum distance is selected to represent ) , ( j i c c d . In (formula (8)) we eliminated the square of the distance since the distance of the terms in 86 our approach is calculated in paragraph level other than the document and won’t be that big. We want to go further and customize (formula (8)) more, to fit the purpose of the applications used in this research. As we mentioned in the previous sections, it would be logical to choose an event category among the ontology categories. A user searching about event related information will most probably be interested in the context elements of the event (in which will be the associative relations from event category in domain ontology). Thus for the paragraphs with more than two different category terms mapped, first we are interested in finding if there is a mapped term to the event category and if so we will find the minimum pair-wise distance from the term belonging to the event category to all the other categories. For instance, in the above example if “c2” would be the term from the assigned event category, the pair-wise distance between the terms {c2, al}, {c2, b1} would be calculated. Therefore, the total pair-wise term proximity of the paragraph with mapped event term is calculated as: ) , ( j c E TTpw = ∑ ∈ C c j j c E Tpw ) , ( 1 (9) Where ∑ ∈ C c j j c E Tpw ) , ( is the sum of the minimum distances form a term E that is a member of event category, to a term from the other mapped category in the paragraph. If the paragraph has more than one term mapped from event category, in this case the minimum of TTpw is selected. MTTpw (p) = min ( ) , ( j c E TTpw ) (10) 87 In the following section, we are going to explain how CMinSpan and Pair-wise distance methods are used and normalized in the scope of CONITA. 5.9.2 Content Relevance Value of the Document The content-based relevance of a document is calculated by the term proximity values obtained from its paragraphs. The paragraph level term proximity is calculated by the methods introduced (CMinSpan, Pair-wise). If any paragraph of the document has a term from the event category, than both CMinSpan and Pair-wise methods are used in ranking the documents. Otherwise, if there are no paragraphs with a term from an event category, CMinSpan is used to calculate the value of a document. Other than the terms in the event category, the ranking of a document is affected by the following criteria number of different category elements available in the paragraphs. The number of distinct categories mapped on the paragraph is the main measure on determining the relevance of the documents. The more the different category terms mapped to the paragraph, more the paragraph is likely to have a coherent theme that corresponds to the user’s request. If an interest ontology is selected along with the request, the interest ontology is treated as a part of the domain ontology in the term mapping. Besides, in order to better see the focus of the document in a particular interest, the number of distinct terms within the document and the paragraph level that fall into the interest category is calculated. Overall document relevancy is calculated in the following order: • The paragraph with the highest number of mapped categories is selected to represent the relevance value of document. Each document is categorized into levels based on the number of different categories of the selected paragraph. 88 • Depending on the category kind, CMinSpan or/and Pair-wise proximity measure are used to calculate the proximity value of the paragraph with the highest number of distinct categories. • If more than one paragraph has highest number of categories than the paragraph with the max proximity value is chosen (max =1/min). • In case user picks ‘interest ontology” while requesting the information, the number of distinct terms within the paragraph and the document calculated to determine the emphasis of the document on the interest area of the user. The most of the Span-based proximity calculation methods favor documents that have few words mapped, and need normalization to balance the proximity value. Usually the number of words within the document is used for normalizing the proximity value [Singhal, 1996]. In this research number of categories and the document with the max number of relevance is used for normalizing CMinSpan value as follows: • The document with the highest proximity value gets selected and the proximity value of all the other documents is divided by that value. • Documents are sorted within each level based on the proximity value. Not all information requests have a term from the event category or the interest ontology. If the query has at least one term mapped to the event category, then, the event category is used to make comparisons between the sorted documents (by CMinSpan) in each level. Similarly, Interest Ontology can be used to judge the document’s coverage of the particular area within the domain. 89 5.9.3 Content extraction from HTML Documents As mentioned in the previous sections, the documents returned by the search engine might be in different formats. The documents used in this thesis are in HTML format. In order to analyze the paragraph structure of the documents, we pass them through an open source HTML parser (http://htmlparser.sourceforge.net/). HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. The parser produces a stream of tag objects, which can be further parsed into a searchable tree structure. Our analysis on the retrieved news documents showed that most of the documents use paragraph “<p>” tag to capture the paragraphs within the body of the documents. In addition, in some cases we also observed that “<br><br>” and “</br></br>” tags are used to set a new line between the paragraphs. Since </br> tag is not a valid tag by the HTML parser. We have used “<p>” and “<br><br>” tags to parse the html documents, which covers about 90% of the documents retrieved by the search engine. 5.10 Trust Relevancy of Documents Trust has an important role in people’s decision-making process [Josang, 1999]. As subjective as it is, trust changes base on the context that makes it harder to define [Gulati, 1995]. Trust can be used to refer “trusting to information” or “trusting to source”. These two kinds of trust are closely related to each other and sometimes they are used interchangeably to refer trust in general. Trusting to source is mostly based on the reputation of the source on a particular area that he/she is going to be trusted. On the other hand, trusting to information is more based on the correctness of the information provided. Although, it doesn’t make sense to talk about the trustworthiness of the 90 information provided by a non-trusted source, it is possible not to trust the information provided by a trusted source which later can affect the trustworthiness of the source. In this thesis we use the term trust as a credibility measure of the user’s decision making process on selecting documents as relevant or not. Users with different information needs have different and varying levels of trust needs in their decision making. While the decision making of the users of the domains like terrorism, health are highly affected by trust, the users of domains such as music and sports are less concerned about the trust in their decision making process. What makes Websites credible? The answer to this question changes in varying degrees with multiple attributes such as domain (discipline, the goal of information seeker), information source (author, author’s expertise, intent of an author), Website design (design of the page, information structure). Fogg’s (2004) experiment on 2684 consumers showed that consumers are more interested on “design look” and “information structure” in deciding the credibility of the Websites. On the other hand, Stanford et al.’s (2002) study on health experts showed that they value; name reputation of a site, its operator or its affiliates, information source or references, and company motive. Consequently, Flanagina and Metger (2007) examined that people find news as the highest credible source While personal websites as the lowest credible in Web. One of the goals of CONITA is to integrate the trust component to the document ranking process [Figure 3]. Because of the diverse variety of sources and users, Trust in the IR systems is not studied in details. The biggest attempt to measure the credibility of documents on the Web is done by the introduction of TrustRank algorithm [Gyongy, 91 2004). However, credibility in TrustRank is calculated based on the link structure and in general omits the interest and context of the user’s information request. Customizing the search process for specific domain users reduces the dimension of trust variances. Once the expertise of the user is known, the better idea about the user’s goals can be captured. Although the trust attributes should be adjusted for the different domain users, we use Stanford’s et al, (2002) experimental results on health professionals in assessing the credibility of Web documents for the domain users of CONITA. Although these criteria might not be correct for all the domains, we assume that the users of the domains in which the accuracy of the information is crucial, have similar evaluation metrics in assessing the credibility of the sources. Although it is not limited, the documents fed to CONITA are collected through the Web. In general, provenance is one of the well-known attributes to measure the credibility of Web documents [Stanford et al, 2002]. Most of the time knowing the provenance of information is not enough by itself and there is a need to have reputation information of that provenance. Although some sources build up reputation in years (e.g., CNN), not all sources are well known. Besides, the credibility of the source might change depending on the source’s area of expertise and the interest of the user requesting the information. Although the dimensions of trust within domain specific applications are smaller, it still has variances that make it not possible to have a single formula that applies to all. However, given the domain, interest information captured about the user credibility of the documents in the CONITA framework roughly can be captured. The first process of determining the credibility of the document is to learn about the context of the search. In CONITA that is done by collecting “user” and “request” specific 92 information. Once user enters the query, metadata information of the “request” (e.g., query terms, which ontology categories mapped for the terms, interest of the user) is stored. In order to collect the information about the “user”, the user should have a search history. Information provided by the user in his previous requests, about the credibility of the documents, are used to find the information about which sites, a user finds credible. The information about the credibility of the documents is collected by using a radio button next to the document summary of ranked list of documents presented to the user. In case a user finds the source credible, he just clicks on the radio button to provide his feedback. The reason for having a button instead of using the click measure [Sugiyama, 2004] is that, clicking on the document does not necessarily show the intent of a user and can lead to wrong information. Although credibility can be quite subjective, we are assuming that as in Stanford’s experiment, domain users have similar measures such as provenance, author’s expertise in their decision making. The click of a user on a button helps to save the metadata information about the document, including “document URL”,” author”, “related links”, “References” that will be used in estimating the credibility of the sources for further searches. Out of the extracted metadata of the document, two levels of trust information are obtained. The metadata that provides information about the “document URL” and “author” is considered as “first hand trust” information in the credibility calculation since this information is as a result of the user’s direct feedback about the document credibility. The information collected from metadata that is about the other sources such as “related links”, “citations” is called as “second hand trust” information. These are not the documents found credible by the user but the sources credited by the document’s creator 93 that the user selected as being credible. The assumption behind that is using recommendation logic as FOAF [Brickley, 2004] application to reach the unknown sources/documents. The above-mentioned credibility measure is not useful for first-time system users since there is a need of a feedback from the previous search results of a user to find the metadata about the documents. However, repeated use of this approach not only helps to collect the general data about the credibility of the sources but also brings the advantage of using ontologies by categorizing the user metadata to the ontology categories and detecting the trust user had on a more focused areas of the ontology (Figure 8). The third variance we have used in calculating the credibility of the documents is called “third hand trust” and is quite similar to the “second hand trust”. However, “third hand trust” is not dependent on a user’s credibility measure of the previous searches and can be obtained even for the first search of the user without the need of history information. Instead of user’s selection, CONITA’s top n ranked content relevant documents are used to measure document’s credibility. The top n content relevant documents are semantically the best match found for the user’s request from the given corpus. Therefore, we use the top n documents’ authors’ judgment by checking the inter- referred references (e.g., links, citation) among top n documents. For example if the fifth’s element in the ranked set of documents is referred by the other 10 authors of the first n-top ranked documents, its credibility can be measured higher compared to the document that did not get any referral from top n ranked documents. 94 Figure 8: Trust Relevancy (Credibility) of the documents based on first, second and third hand trust information Although the second and third hand trust might be sparse and not available for all the articles posted in the web, they complement the first hand trust in their availability. Therefore the overall credibility is calculated as the combination of these parameters such as: CD= a*FHT+b*SHT+c*THT (11) In which, CD is the credibility of the document, a, b, c are the coefficients and “FHT”,”SHT”,”THT” are the first, second and third hand trust. The values of a, b, c are depend on the user’s behavior and can be calculate after sufficient amount of data accumulated by the system. Third Hand Trust Number of inter-refferred links from first “n” content relevant documents Oklahoma City Bombing … Related Links: NBC - Oklahoma City www.nbccom/oklahoma Bombing in Oklahoma www.rcdp.com ----------- Oklahoma City … www.abc.com Police find chemical .. www.xynt.com Terrorist attack on … www.cnn.com Explosion in … www.gggg.com First Hand Trust Previously selected documents by the user Second Hand Trust Related links/ references of first hand trusted sources Suicide bomb in CA… www.scgk.com World Trade Buildings… www.fihyke.com Chemical attack caused… www.nbtha.com Bombing in New York www.rcdp.com World Trade Buildings… www.fihyke.com Explosion in … www.gggg.com …………. Credible Documents 95 Although there is no one-size-fits-all when it comes to trust measure, the above scenario gives a general overview how credibility of the document can be measured for specific domain users employing Web documents as information sources. In this thesis, CONITA’s major data set consist of news data. News sources in Yahoo and Google are predetermined sites in which new information is collected through RSS feeds. For example, Google has 5306 news sites that collect information [Newsknife]. Flanagina and Metger’s (2007) study showed that news is the most trusted source among the others. We believe the reason is that news sources are controlled in several ways. First, they are analyzed before being added to the search engine’s news set and second, the authors of the news sites are the journalists hired by the news organizations and they are mostly doing this because of their expertise rather than as a hobby. Given that we did not have a chance to use CONITA as an application on the health domain and terrorism domain users to collect enough data about the real users’ behavior, setting trust parameters for the sample users such as students would not give the similar results as the real employees since a user’s behavior changes with the domain and the goal of the organization they work for. Besides, using news sources as a major corpus documents eliminated some of the credibility issues that might arise in the context of this research. Therefore, we only provided the theory behind assessing the credibility to the document but did not test it in our experiments. 96 5.11 Document Relevance As can be seen from its architecture [Figure 3], CONITA is designed to return relevant documents based on both content, and trust, based relevance measures. In Section 4.8 we explained how trust and content relevancy can be calculated separately. In our experiments we are not testing the trust relevancy and the document relevancy is considered the same as the content relevancy. However, in case trust relevancy is added to the systems, overall relevancy of the document can be calculated as follows: RD= a*CR +b*TR (12) Which, RD is the relevance of the document a, b are the coefficients, and CR, TR are content and trust relevance calculated as described in the previous sections. Here the coefficients of the content and trust relevance are calculated based on the domain needs. While the value of b can be high for the domains like terrorism, it might be small or negligible for the domains that the information is not crucial and might even be adjusted differently for different task within the domain. 97 Chapter 6 Experimental Results 6.1 Experiments by Using CONITA The aim of this thesis is to introduce an IR system that customizes the returned query results, based on users’ context information. As presented in Chapter 5, CONITA is the proposed system to reach this goal. The aim of this chapter is to present the experimental results for CONITA and other IR systems to measure the success of the proposed system. One of the most challenging tasks in IR is to measure the success of a system. The major challenge is finding the data set that can be used by several IR systems and can then be used for comparison. Since the relevance of a document is relative to the user, the way of measuring success of any system is base on the users’ feedback. One of the common ways of measuring the success of IR system is to use the pre-evaluated set of documents provided for TREC workshops in comparison. In these workshops the sets of documents are entered into different systems and their success is rated by the comparison of the IR systems results with user feedback. The data set used by CONITA is dynamic and crawled by currently available search engines (e.g., Google/Yahoo). Therefore there is no pre-evaluated set of documents to compare the results of CONITA. In an open environment in which a data 98 set is dynamically changing, the success of system is often measured by the user evaluation [Kowalski, 1997; Griesbaum et al., 2002]. For example, Liu (2006) used 25 graduate Computer Science students for 3 weeks to measure the success of Google, Yahoo and MSN search engines. Similarly, Long et al., (2007) recruited 270 participants, who were from ten universities at Beijing to test three Chinese search engines. Considering the above methods as an example, in this thesis we used subject ratings to evaluate the success of CONITA. The rest of the chapter is organized as follows: Section 6.2 introduces the setup of the experiment. Section 6.3 explains the statistical information about participants of the surveys, documents numbers for the Health and Terrorism domains. Section 6.4 discusses the experimental results measuring the success of CONITA. Section 6.5 summarizes the experimental results and talks about findings. 6.2 Setup of the Experiment Most of the IR systems are compared in two aspects; precision and recall. Unlike having a corpus of pre-collected documents, getting documents from the Web requires advance crawling techniques that are a research area by itself. In this thesis, we are using other crawlers to collect the data from Web. Therefore, it is not relevant to measure the recall of CONITA. Instead, the aim of these experiments will be to measure the precision of the system. Most of the experiments on IR systems, including the ones using TREC, base their measure of success to the first 10 returned documents for the query [Goa et al., 2001, 99 Webber et al., 2008]. The main reason is because a user is usually interested in checking the first few documents in the returned results to find an answer to his request. In the following sections we are going to evaluate two sets of queries in which the first set uses Health domain and the second set uses Terrorism domain. The results of each query are evaluated by set of subjects. The following steps are used in the set up of the experiments. • First, for a given description, a small survey is done for the determination of the query (Pre-survey). In this process, we have provided one, two phrase description of what the user interested in finding in the documents and asked 3-5 people to write the relevant query for the given description. The result of this is assessed in two phases: If the majority of the people listed the query, that query is selected, otherwise the most common words stated in the participants queries are selected and combined together. • Once the corresponding query of the description is determined, it is passed to the search engine (Yahoo/Google) to retrieve the corresponding documents. In addition to the original query, we send several queries to the search engine to collect the results for an expanded set of terms of a query. For example for the original query “eye disorder”, the successive queries of “cataract”, “glaucoma” that are subclasses of “eye disorder” also passed to the search engine. Mainly, Yahoo API is used to retrieve the documents for queries. Yahoo API has a limit of 1000 returned documents for each query. Even in cases that it states millions of documents are found, the upper limit of presented documents is 1000. 100 Although we haven’t able to get permission to use Google API, for the experimental purposes we crawled the returned results of Google for a given query and used it in some of our experiments. Considering the successive queries of a given query with the expanded terms, CONITA corpus has an upper limit of couple of thousands. However in order to reduce the response time of the system, we used 100 documents as an upper bound and CONITA’s ranked results are returned from the list of 100 documents. • The success of CONITA (precision in this thesis) is measured by passing its top ranked results for user evaluation. Although that would show its individual success, it is also important to compare CONITA’s success with the search engine used in forming the corpus. Therefore, we picked the first seven returned results from search engine and CONITA, to pass for the user evaluation. The reason of picking the first seven results instead of 10 is to reduce the evaluation time required by user about 30%. • For each query, the first seven results of CONITA and the search engine are combined together in a single survey to pass to the user for evaluation. Since the results returned by the search engine changes in time as the new documents added to the Web, we saved the results of the queries at a specific time with the returned links to the documents, and provided these documents to the subject users. The experiments presented in this thesis evaluate CONITA’s success by using precision measure and does not talk about the recall measure. First of all, CONITA’s evaluation is done base on the relevancy of the documents provided by user 101 evaluation. First of all, it is not possible to pass all the corpus documents for user evaluation that is required for recall measure. Secondly, recall is mainly used to measure the success of crawlers and CONITA is not a crawler. 6.3 Metadata Evaluation of Surveys Each survey passed to the user is designed to measure the success of a single query pass through CONITA and search engine. The following information is captured by metadata evaluation survey. • User’s profession and highest Degree • Given the problem, relevance of the selected query • For each document, the degree of relevance for the document (Not Relevant, Somehow Relevant, Probably Relevant, Relevant) • Confidence level of the user in answering the survey Queries used in the surveys are chosen to compare different aspects between CONITA and search engine. Overall, queries are organized to test the success of CONITA in the following cases. • CONITA in Health and Terrorism domain • CONITA news-web(general search) articles • CONITA-Google comparison • CONITA-Yahoo comparison • CONITA-Google-Yahoo comparison by same queries 102 6.3.1 Survey Statistics in Health Domain This set of surveys represents the information request of a user in the health domain. The problem descriptions (dependently queries) of this domain are selected to cover the wide variety of topics as mentioned in NCI ontology. Similarly survey participants are selected from medical school graduate students. In total, ten queries are tested in eight surveys and one of the two interest ontologies are used to enhance each query. In six of the surveys CONITA is compared with the one other search engine and in the other two surveys, documents are combined to test the success of three systems Google, Yahoo and CONITA [Table 4]. Four queries are tested in two different surveys for the comparison purposes; two of these queries are used for comparing the documents retrieved from”news search” and “web search”, and the other two are used to compare Yahoo-Google news results against CONITA. Although the four queries are given into two different search engines or the different specialization of a single search engine (news, web), the returned results of these queries had very few or no overlaps. Thus, we are considering each case as an independent comparison for measuring the success of CONITA. 103 DESCRIPTION QUERY COMPARISON 1 Find the documents that provide you information about the health effects of air pollution on asthmatic people who are exercising Air Pollution Exercise Asthma CONITA / YAHOO-web 2 Find the documents that provide you information about the health effects of air pollution on asthmatic people who are exercising Air Pollution Exercise Asthma CONITA / YAHOO-news 3 Find the documents that provide you information about the children with brain tumor that had surgery. Brain Tumor Surgery Children CONITA / YAHOO-news 4 Find the documents that provide you information about the children with eye disorder that get blind. Eye Disorder Blind Children CONITA / YAHOO-news 5 Find the documents that provide you information about the children with eye disorder that get blind. Eye Disorder Blind Children CONITA / GOOGLE-news 6 Find the documents about the people who have had an heart attack from smoke and died. Heart Attack Smoke died CONITA / YAHOO-news 7 Find the documents about the people who have had an heart attack from smoke and died. Heart Attack Smoke died CONITA / GOOGLE-news 8 find the documents that provide you information about the children who died from leukemia Leukemia Child died CONITA / YAHOO -news 9 Find the documents about the children who had a seizure and become unconscious. Children Seizure Unconscious CONITA / GOOGLE-news 10 Find the documents that talk about the unconsciousness of the children during seizure. Children Seizure Unconscious CONITA / YAHOO-web Table 4: Health domain queries and IR systems they are tested In average, eight subjects are participated in each survey and each evaluated about 12 documents. Although the time spent for most of the users were about five minutes, the average time spent for each survey is measured as 12, in which means less than a minute is spend on the analysis of each document. The Standard Deviation of the participants in the Health domain surveys is calculated as 0.875 [Table 5]. 104 Query Comparison #People #Questions Query Avg Time STDEV 1 Air Pollution Exercise Asthma CONITA /YAHOO-web 10 10 8.75 11 0.862 2 Air Pollution Exercise Asthma CONITA /YAHOO-news 7 8 7 11 0.890 3 Brain Tumor Surgery Children CONITA /YAHOO-news 9 12 8.9 18 0.851 4 Eye Disorder Blind Children CONITA /YAHOO /GOOGLE-news 8 15 7.5 8.5 0.837 5 Heart Attack Smoke died CONITA/YAHOO /GOOGLE-news 8 16 8.75 15 0.843 6 Leukemia Child died CONITA /YAHOO -news 7 11 7.5 14 0.823 7 Children Seizure Unconscious CONITA /GOOGLE-news 6 10 10 10 1.059 8 Children Seizure Unconscious CONITA /YAHOO-web 8 12 6.7 7.25 0.836 Average 7.875 11.75 8.137 11.843 0.875 Table 5: User and query statistics for Health domain surveys 6.3.2 Survey Statistics in Terrorism Domain This set of surveys represents the information request of a user in the Terrorism domain. The problem descriptions (dependently queries) of this domain are selected to cover the wide variety of topics as mentioned in Terrorism ontology. In two of the queries the interest ontology is used whereas in the other five, it is not. Since it wasn’t be possible to find Intelligence Analysts to participate at surveys, the survey participants were selected from Computer Science graduate students. In total, seven queries are tested in seven surveys. Two of the queries are used twice by different search engines (Google and Yahoo) for the comparison purposes [Table 6]. 105 DESCRIPTION QUERY COMPARISON 1 Find the documents that talk about cyber attacks to the United States Military. Cyber Attack United States Military CONITA/ GOOGE -news 2 Find the documents that talk about cyber attacks to the United States Military. Cyber Attack United States Military CONITA / YAHOO-news 3 Find the documents that give you information about the girls kidnapped in Iraq by the terrorists. Iraq Girl Kidnapped Terrorist CONITA / YAHOO-web 4 Find the documents that give you information about the politics behind the PKK bombings. PKK Bombing Politics CONITA / GOOGLE- web 5 Find the documents that talk about hijacking plane in United States Plane Hijack United States CONITA / GOOGLE- news 6 Find the documents that talk about hijacking plane in United States Plane Hijack United States CONITA / YAHOO-news 7 Find the documents that give you information about the effects of terrorist attacks in United States economy. Terrorist Attack United States Economy CONITA / GOOGLE-web Table 6: Terrorism domain queries and IR systems they are tested On average, 9 subjects are participated in each survey and evaluated about 11 documents. Although the time spent by most of the users were about five minutes, the average time spent for each survey is measured as 9 minutes that is less than a minute for the analysis of each document. The Standard Deviation of the participants in the Terrorism domain surveys is calculated as 0.890 [Table 7] that is very close to the standard deviation of the user evaluations in Health domain Query Comparison #People #Questions Query Avg Time STDEV 1 Cyber Attack United States Military CONITA/ GOOGE -news 8 10 8.75 9.25 0.983 2 Cyber Attack United States Military CONITA / YAHOO-news 7 12 10 8.25 1.076 3 Iraq Girl Kidnapped Terrorist CONITA / YAHOO-web 12 12 8.3 9 0.875 4 PKK Bombing Politics CONITA / GOOGLE- web 10 11 7.5 8.3 0.74 5 Plane Hijack United States CONITA / GOOGLE- news 8 12 9.2 8 0.924 6 Plane Hijack United States CONITA / YAHOO-news 13 10 8.5 11 0.818 7 Terrorist Attack United States Economy CONITA / GOOGLE-web 9 11 8.5 12 0.817 Average 9.571 11.14 8.678 9.4 0.890 Table 7: User and query statistics for Terrorism domain surveys 106 6.3.3 Summary of the Survey Statistics Overall, Health and Terrorism domains did not show many differences in the participants’ behavior. In average each survey had 8-9 participants and consisted of 11 questions. The participants that spent less than 2 minute on a survey are eliminated from the survey list. In average each participants spend a bit under a minute to evaluate each document and the relevance of the used query for the given description is rated around 8.5. One of the interesting conclusions about rating the relevance of the query terms came from the evaluation of query “Children Seizure Unconscious” [Table 4]. This query is used for two surveys to test the success of CONITA with the corpus of documents returned from News and Web search. While the participants of the survey with the document set from News rated the query terms perfectly matching to the description, for the same query and the description of a problem, the participants of the survey with the document set from Web search rated the query terms as the worst in the health domain surveys. The survey results showed that in general users do not tend to spend more than a minute to evaluate a given document. Although that represent the real case for the “Static Informational search”, it might not necessarily represent the ideal evaluation for the “Dynamic Informational search” [Chapter 5.1] in which mislead the evaluators in some cases. The reason is, users with the Static Informational need usually look for the documents that are solely talking about the topic they are searching for. However, in the Dynamic Informational search, the main goal of the user is to get the information that is not definitional and new (mostly events) that is not necessarily the whole theme of the paper and can be embedded in different themes and might be available only in a small 107 section of the documents. Therefore, considering the possibility of outliers, different evaluation techniques are used on evaluating the results in Section 6.6.2. 6.4 Evaluations of Queries This section presents the evaluation results for the documents returned by each query. Among the evaluation methods used in measuring IR Systems performance [Section 2.1.5], Normalized Discount Cumulative Gain (NDCG) is used to measure the precision of CONITA. One of the main reasons of using NDCG is because P@n and MAP evaluation methods only handle cases with binary judgments such as relevant and not relevant. NDCG is formulated to take into account the multiple levels of relevance in addition to ranking the documents in calculating the relevance. Liu et al., (2007) used NDCG scores to calculate the success of the queries used in OHSUMED subset. They transferred three levels of ratings “irrelevant”, ”partially relevant” and ”definitely relevant” into the numeric numbers {0, 1, 2} to be able to use the values in the calculation of scores. Similarly, we used four point scale {0, 1, 2, 3} to represent ranking of the users’ evaluation that is in the form of ” not relevant”, “somehow relevant”,” probably relevant”, and “relevant” and for the calculation of scores. Because of the high Standard Deviation between the users’ evaluations on the query documents, we used variations of evaluation techniques in extracting the user- assigned relevance of the documents. First, participants’ overall judgment for each document is calculated and passed to NDCG for evaluation. The process of extracting the overall user-assigned relevance of the documents can be done in various ways as follows: 108 • Mean -4PT: The mean of the survey participants’ evaluation at 4pt scale {0, 1, 2, and 3} is taken for each document in the query and passed to NDCG. • Mean-3PT: The four point evaluation of the survey participants is first transferred to 3pt scale{0, 1, 2} by assigning “somehow relevant “and “probably relevant” judgments to value 1 and followed by passing the 3pt-Mean value to NDCG. The 3 pt scale is used to combine the evaluation results of the users that did not get clear distinction whether document belongs to “somehow relevant” and “probably relevant” • Mean-2PT: The original evaluation (4pt) of survey participants are transferred to 2pt {0, 1} scale. The “Not Relevant” and “Somehow Relevant” judgments are assigned to 0, while “Probably Relevant “and “Relevant” judgments are assigned to 1. Accordingly, the mean of participants’ evaluation over 2 pt scale is passed to NDCG for evaluation. The 2pt scale is used to compare the user feedback with the traditional binary judgments “not relevant” and “relevant”. • Median-4PT: The high standard deviation of the users evaluations usually an indication of the skewness in the data. For the data set with outliers, median (the number in the middle of the set) is considered as a more robust measure [Greisdorf & Spink, 2001]. Therefore, we used median value of the participants’ evaluations (in 4pt scale) to pass NDCG for evaluations. Table 8 shows the NDCG evaluation results, for 10 queries obtained from 8 Health domain surveys. For CONITA when Median-4PT user-evaluation is used to assign the 109 relevance measures of the documents, the success of CONITA in the given set of queries changed between 30-95% with the mean average of 60%. Table 8: Ranking Performance of CONITA in Health domain In order to test the success of the CONITA, in Table 9, we compared its success with the other search engine/s that is used to retrieve the documents for its corpus. The comparison of the systems done by using NDCG evaluation for the above mentioned user-assigned relevance measures. In addition, the last 3 columns of the table show the number of documents considered as “relevant” and “irrelevant” in binary evaluation. The binary evaluation mentioned in the table is calculated using mean of the user’s evaluation in which the documents with overall score of less than 0.5 are assigned to “irrelevant” while the ones with score 0.5 or more are assigned as “relevant”. The column “User_R” shows the first seven most relevant set of documents found by the user in the given set of documents, where column “CON_R” and “Yahoo_R/Google_R” show the relevant documents found by CONITA and Yahoo/Google in their first seven returned set of results (P@7). QUERY CONITA’S SUCCESS (NDCG @7) Air Pollution Exercise Asthma 0.750 Air Pollution Exercise Asthma 0.602 Brain Tumor Surgery Children 0.565 Eye Disorder Blind Children 0.519 Eye Disorder Blind Children 0.575 Heart Attack Smoke died 0.317 Heart Attack Smoke died 0.626 Leukemia Child died 0.406 Children Seizure Unconscious 0.663 Children Seizure Unconscious 0.955 AVERAGE 0.598 110 Table 9: Success of CONITA in Health domain is compared to (Google/Yahoo) by using NDCG evaluation measure. The overall success of CONITA in the Health domain for the first seven documents of the given queries compare to the other IR systems is calculated between 11% and 15.8% depending on the user-assigned relevance. Although the success of CONITA is calculated for different user-assigned relevance measures [Table 9], since the original surveys are conducted using 4 point scale, we used 4pt scale as a main comparison scale in the rest of the experiments. The difference between the 4pt Mean and Median in Table 9 is calculated as 4% (11.3%, 15.8%). Although that is not a huge difference, since Median is accepted as a more robust measure than the Mean in a dataset with skewed information [Greisdorf & Spink, 2001], in the rest of the discussion, the main success of CONITA will be measured in terms of Median. QUERY COMPARISON 4PT- Mean 3PT- Mean 2PT- Mean 4PT- Median User_R CON_R Yahoo_R/ Google_R Air Pollution Exercise Asthma CONITA / YAHOO-web 0.095 0.007 0.054 0.086 7 5 5 Air Pollution Exercise Asthma CONITA / YAHOO-news 0.245 0.115 0.162 0.235 3 1 0 Brain Tumor Surgery Children CONITA / YAHOO-news -0.023 -0.053 0.093 -0.217 5 4 3 Eye Disorder Blind Children CONITA / YAHOO-news 0.034 0.010 0.104 -0.024 3 1 1 Eye Disorder Blind Children CONITA / GOOGLE-news 0.083 0.064 0.065 -0.017 3 1 2 Heart Attack Smoke died CONITA / YAHOO-news 0.423 0.473 0.403 0.172 1 0 0 Heart Attack Smoke died CONITA / GOOGLE-news -0.010 -0.022 0.041 0.012 1 1 1 Leukemia Child died CONITA / YAHOO -news 0.237 0.118 0.269 0.277 3 2 1 Children Seizure Unconscious CONITA / GOOGLE-news 0.150 0.175 0.091 0.145 6 3 2 Children Seizure Unconscious CONITA / YAHOO-web 0.299 0.212 0.299 0.458 5 4 2 AVERAGE 0.153 0.110 0.158 0.113 111 As a result of 10 query evaluation, CONITA failed in three queries and get a negligible success on the other compare to the used search engine’ results. Three of the four queries showed around 2% difference to the compared search engine, which is negligibly small. However, in one query “Brain Tumor Surgery Children”, the used search engine performed 21% better than CONITA. The reasons behind CONITA’s failing for this query is investigated and found that the document ranked as first in CONITA evaluated less relevant than the one found by Yahoo in which caused dramatic difference in the results. As explained in Chapter 2, NDCG evaluation takes into account both the evaluation value and its ranking in the document set. Therefore if the document with high rank is misjudged, the evaluated NDCG value becomes much lower from the case if the misjudgment is lower ranked. The normalization of Discount Cumulative Gain (DCG) is calculated by dividing DCG with the DCG value of the seven most relevant documents chosen by the users. This normalization mainly base on the assumption that survey participants evaluates all the possible documents in the given set and chose the seven best documents representing the given query. However in reality, only 10-15 documents are presented to the survey participants for the evaluation purposes and there might be other documents in the corpus which could possible match to the user’s request but not selected in the top seven list of any search engine or CONITA to pass to the survey for evaluation. Considering this case, we also assumed that there might be a perfect match that all the participants of the survey rate at least 7 same documents as “relevant” (i.e., in theory assuming that there were prefect documents but is not selected by CONITA). This would give the worst case 112 success of CONITA compared to the other used search engines and would average 1.6%- 2.3% with Median 2.3 %. The total number of “relevant” documents found, among the first 7 most relevant set of documents, by the users for the 10 queries of Health domain is calculated as 37. Therefore the success of CONITA and the search engine (Yahoo/Google) finding the relevant documents in their first seven returned documents (P@7) in binary scale calculated as 22(60%) and 17(46%). Similarly Table 10 shows the NDCG evaluation results, for seven queries obtained from the seven surveys of Terrorism domain for CONITA when Median-4PT user-evaluation is used to assign the relevance measures of the documents. The success of CONITA in the given set of queries changed between 61-88% with the mean average of 74%. Table 10: Ranking Performance of CONITA in Terrorism domain As in the Health domain, success of CONITA is compared to the other search engine/s that is used to retrieve the documents for its corpus [Table 11]. The overall success of CONITA compared to other search engine/s in the Terrorism domain at position seven has been found as 27.2% in Median calculation. QUERY CONITA’S SUCCESS (NDCG @7) Cyber Attack United States Military 0.613 Cyber Attack United States Military 0.762 Iraq Girl Kidnapped Terrorist 0.771 PKK Bombing Politics 0.881 Plane Hijack United States 0.757 Plane Hijack United States 0.616 Terrorist Attack United States Economy 0.807 AVERAGE 0.744 113 The normalization of DCG in Terrorism domain is also calculated base on users’ evaluation. The success rate of CONITA compared to the other used search engines in the worst case scenario as explained for the Health domain (All users rate at least 7 same documents as relevant for a given query) is measured as 6.3% in Median calculation. Table 11: Success of CONITA in Terrorism domain is compared to (Google/Yahoo) by using NDCG evaluation measure. The total number of “relevant” documents found, among the first seven most relevant set of documents, by the users for the seven queries in Terrorism domain is calculated as 39. Therefore the success of CONITA and the search engine (Yahoo/Google) on finding the relevant documents in their first seven returned documents (P@7) in binary scale calculated as 31(79.5%) and 21(53.8%). As mentioned in the Section 6.2.1, the surveys are designed to test several cases. One of the cases is the search type (news search, web search) used to collect the documents for the corpus of CONITA. When the results of the both Health and Terrorism domains combined together, the compared success of CONITA with the corpus from QUERY COMPARISON 4PT- Mean 3PT- Mean 2PT- Mean 4PT- Median User_R CON_R Yahoo_R/ Google_R Cyber Attack United States Military CONITA / GOOGE -news 0.010 0.043 -0.019 0.101 7 4 5 Cyber Attack United States Military CONITA / YAHOO-news 0.301 0.288 0.134 0.300 7 5 4 Iraq Girl Kidnapped Terrorist CONITA / YAHOO-web 0.439 0.375 0.396 0.408 5 4 2 PKK Bombing Politics CONITA / GOOGLE- web 0.215 0.075 0.328 0.224 7 6 4 Plane Hijack United States CONITA / GOOGLE- news 0.232 0.223 0.229 0.331 3 3 1 Plane Hijack United States CONITA / YAHOO-news 0.314 0.388 0.430 0.390 3 3 1 Terrorist Attack United States Economy CONITA/ GOOGLE-web 0.346 0.343 0.270 0.148 7 6 4 AVERAGE 0.265 0.248 0.253 0.272 114 “Web search” is measured as 26.5% for the five queries. Similarly, the compared success of CONITA for the overall “News search” is measured as 14.2% for the 12 queries. The difference in the success rate is most probably comes from the case that 3/5 of the queries on the “Web search” was from Terrorism domain that returned high success rate compare to the “News search” that only 4/12 queries were from Terrorism domain. We also had two queries to be tested in both types but the results was inconclusive and we believe that more experiments need to done for the comparison of this case. In addition, CONITA is also compared to Yahoo and Google for its overall success. Out of 17 queries in total, Google is used to retrieve the documents for seven queries. In these seven queries, the success of CONITA is compared to Google and measured as 13.5 % by Median. For the 10 queries that Yahoo is used to crawl the documents, CONITA’s success is measured as 20.8% [Table 10]. In overall, four queries are tested on both Google and Yahoo with the overlap of the common documents between Google and Yahoo as 10%. The success of CONITA in these four queries are compared to Google and Yahoo and measured as 21% and 10.7% [Table 12]. 4pt-Mean 3pt-Mean 2pt-Mean 4pt-Median Yahoo-CONITA Query set 1 0.268 0.289 0.268 0.210 Google-CONITA Query set 1 0.079 0.077 0.079 0.107 Yahoo-CONITA Overall 0.236 0.193 0.234 0.208 Google-CONITA Overall 0.147 0.129 0.144 0.135 Table 12: Success of CONITA is compared to Google and Yahoo on the same data set shared by all. Success of CONITA is compared to Yahoo and Google through all the surveys. 115 6.5 Summary of Evaluation Results In this thesis, 17 queries are tested through 15 surveys. The evaluation results showed that CONITA’s average success is measured as 60% and 74% on Health and Terrorism domains. When CONTIA’s success is compared to the search engine (Google/Yahoo) used for crawling the data for CONITA’s corpus, in overall CONITA performed 11% and 27% better than the other search engine/s in Health and Terrorism domains. The reason behind the different success rates between the domains mainly came from the fact that, Health domain had less relevant set of documents compared to the Terrorism domain. Participants of the Health domain survey rated most of the documents negatively (i.e., most of the values tend to be between 0 and 1). The DCG evaluation method calculates the relevance of the documents based on their order and the evaluation value in which grows exponentially by the relevance value given to the document. Since the difference between the document evaluations of the users lied in a small interval, dependently the difference between the evaluation results also stayed small. Our observation while working with the “News search” showed that, News search in both Google and Yahoo are sensitive to time and does not include the documents that are older than a month. In some cases, this restriction limited the number of possible documents retrieved for the corpus of CONITA and restricted us to certain queries since not all the queries provided result. Collecting data from Web search did not have the scarcity problem as of News search, but the documents retrieved by “Web search” tend to be more “static information” type than the “dynamic information “that is the main interest of 116 the users of CONITA. Overall Google showed about 10% better success rate than Yahoo and CONITA outperform Google’s success by 13.5%. The overall results of 15 surveys proved that using semantic information extracted from lexical and domain ontogies in the context of domain users and keeping the contextual structure of the extracted information in mapping into the documents improve the performance of the Information Retrieval Systems noticeably. 117 Chapter 7 Conclusion and Future Work 7.1 Conclusion and Contributions In this thesis we presented a technique to improve the IR performance for the information requests of users in a given domain. The main goal of this research is to use the context information of user and user’s request to customize the IR results to meet user’s needs. To achieve this goal, as a key contribution of this thesis, we developed a system called CONITA in which is used to rank the relevant documents extracted from dynamically changing data set such as Web to satisfy the user’s need. CONITA is not a crawler; therefore it uses other crawlers to build its dynamically changing corpus. Relevance of the documents in CONITA is measured in two folds: Content Relevance, that includes the semantic analysis of the documents with respect to the domain and interest information obtained from query, user, and Trust Relevance, which is the assigned credibility of the documents that is extracted from user experience. Although both relevance measures are explained in the thesis, because of the user and domain dependent multi-variance structure of trust relevance, only content relevance of the documents are tested in the experiments. 118 In CONTA, the context information of the user’s request is obtained by using WordNet and domain ontologies. Further, interest ontologies are used to capture the focus of the user’s search within the domain. We have introduced categorization technique to obtain the context elements of the user’s query and proposed several algorithms to find the relations between the obtained context elements. Two domains, Health and Terrorism, are used to test the performance of CONITA. In total 17 queries’ documents, that is 10 from the health domain and seven from the terrorism domain, are evaluated. In order to catch the outliers in the data collected by the evaluators, CONITA’s success is measured in several evaluation measures and compared against Google and Yahoo IR systems. Overall CONITA’s performance is measured as 60% and 74% in Health and Terrorism domains. When it is compared to the other IR systems (Google/Yahoo), it achieved 11% (Health) and 27% (Terrorism) improvement. From the two compared IR systems on the document sets, Google performed 10% better than Yahoo and CONITA’s outperform Google by 13%. In overall, experiments verified that by placing the extracted semantic information from user and user’s request in the context, we able to return more relevant documents in response to user’s need. The results showed that, out of the documents presented to the users for evaluation, Health domain had more irrelevant results compare to Terrorism domain. Therefore, the corpus of CONITA for the Health domain queries did not have enough relevant documents to prove its performance. This is one of the limitations faced in this thesis. Since CONITA is not a crawler, it is limited to rank the documents only returned by the other crawlers. 119 The specific contributions presented in this thesis include: • Presented and implemented CONITA, an Information Retrieval System that returns the most relevant results in the data set for the domain user’s requests. • By the help of lexical, domain, and interest ontologies provided a system that is capable of extracting the semantic information behind the user’s information request • Introduced the concept of “interest ontology” to represent the focus of the user within the domain ontology to better extract the user’s information need. • Introduced semantic and syntactic mapping algorithms between the query terms and the domain ontology to eliminate the information loss in passing query term to the domain. • Used the context information in mapping the query words to the document and calculating the term proximity, to better response the user’s information need. • Presented an IR system that is capable of receiving a domain independent Web document and analyze its content for the domain users. It expands the view of IR by having a capability of analyzing not only domain specific document but any documents. • Build a terrorism ontology that can be used for further research projects. • In order to collect data we had around 200 documents to be evaluated by 130 subjects that is a valuable dataset can be used for further evaluations. • Provide a novel approach to customize the trust measure of the Web documents for the domain users. Currently there is no system that explicitly uses a trust measure in an open Web environment to measure the document credibility. 120 7.2 Future work The CONITA framework introduced in this thesis consists of multiple components and its success can be increased in multiple directions. We categorized the future research directions as Ontologies, Presentation and Coverage. Ontology Learning: The success of CONITA is highly depends on the availability of the well build domain ontology that provides the detailed semantic information about the domain. Although it is not impossible, it is challenging to find the detailed domain ontologies. There are several organizations such as Cyc, Rand that provide professionally build general or domain specific ontologies to the users. However, they are also limited by providing this information for some domains. Therefore, the best solution to obtain a good quality domain ontology for the users would be to integrate an ontology learning tool [Biemann, 2005] to the CONITA framework that can automatically add the new information extracted from the documents to the ontology. Although, ontology extracting from unstructured data hasn’t reach the success of extracting ontology from structured data yet, it is an active research area [Hears, 1992; Cimiano et al., 2004; Dellschaft, 2005] that need to be followed to increase the success of CONITA. Document Presentation: CONITA is designed to rank the documents in descending order based on the relevancy of the documents to the user’s request. In this thesis we mentioned different ways of analyzing mapped words on the documents such as Category Degree, CMinSpan, Pair-wise Proximity and Degree of Interest. The ranking process in the experiments is done by the assigned Category Degree and CMinSpan value of the documents. It would be nice to feed Pair-wise Proximity and the Interest Degree obtained 121 for each document in addition to CMinSpan and Category Degree to the machine learning tool to investigate the user behavior on these four parameters. Besides ranking documents, there are other factors affecting the decision making of the users in selecting the documents. Our assumption in this thesis was that user would choose each document from a ranked list in an order. However, when it comes to presenting the results to the user, summarization [Amini, 2005] techniques also play an important role. Although it is a different area of research we believe that the more analysis can be done on analyzing the representation of the content relevant documents by summarization. It is also important to eliminate the duplicate documents in the ranked set. Retrieving duplicate documents in response to a user’s query clearly lowers the number of valid responses provided to the user. Therefore, the documents contain roughly the same semantic content should be detected. The work on measuring the similarity of the documents, (Sarawagi and Kirpal, 2004; Chaudhuri et al, 2006) can be applied to CONITA corpus to increase the performance of the system. Coverage: Web consists of wide variety of information in different formats. In this thesis we used the documents that are in HTML format. However, there are documents in the Web with other formats such as pdf, doc. The parsing module of CONITA can be extended to analyze the other document types of the Web. 122 References Alpert, J. & Hajaj, N. (2008). ‘We knew the web was big...’. http://googleblog.blogspot. com/2008/07/we-knew-web-was-big.html, accessed on 29/08/2008. Amini, M., Usunier, N., Gallinari, P., 2005. Automatic text summarization based on word-clusters and ranking algorithms. In: Proc. European Conf. on Information Retrieval Research, pp. 142-156 Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison Wesley. Baldauf, M., and Dustdar, S., A survey on context-aware systems. Technical Report TUV-1841-2004-24, Technical University of Vienna, 2004. Belew, R. K. (2000). Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press. Biemann, C.: Ontology learning from text: A survey of methods. LDV Forum 20(2) (2005) Brickley, D., and Miller, L. (2004). FOAF Vocabulary Specification 0.1, http://xmlns.com/foaf/0.1/ Carolyn J. Crouch. An aproach to the automatic construction of global thesauri. Information Processing and Management, Vol. 26, No. 5, pp. 629{640, 1990 Chandrasekaran, B., 1986, Generic tasks in knowledge-based reasoning: high level building blocks for expert system design. IEEE Expert, 1 (3), 23-30. Chaudhuri, S., Ganti, V., and Kaushik, R., A primitive operator for similarity joins in data cleaning. In ICDE, 2006. Chang C.C.K.,and Garcia-Molina, H., Conjunctive constraint mapping for data translation. In: Third ACM Conference on Digital Libraries, Pittsburgh, USA (1998). Chawathe, S.,Garcia-Molina, H., Hammer J., Ireland K., Papakonstantinou Y., Ullman, J. and Widom J., The TSIMMIS project: Integration of heterogeneous information sources. In: IPSJ Conference, Tokyo, Japan (1994). Chen, H., Schatz, B., Yim, T., and Fye, D.,. Automatic thesaurus generation for an electronic community sys- tem. Journal of American Society for Information Scicence, Vol. 46, No. 3, pp. 175{193, 1995. 123 Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, divisive and agglomerative clustering for learning taxonomies from text. In: Proceedings of the European Conference on Artificial Intelligence (ECAI). (2004) 435–443 Clarke, C.L.A., Cormack, G.V. and Tudhope, E.A., Relevance ranking for one to three term queries, Information Processing and Management 36(2) (2000) 291–311. Clarke, R., ,Principal, Xamax Consultancy Pty Ltd, Canberra, Version of 1 October 2001 Corritore, C.L. Int J. Human-Computer Studies (2003) Dellschaft, K.: Measuring the similarity of concept hierarchies and its influence on the evaluation of learning procedures. Master’s thesis, Universit¨at Koblenz Landau, Campus Koblenz, Fachbereich 4 Informatik, Institut f¨ur Computervisualisitk (2005) Douglas B. L.: Ontological Versus Knowledge Engineering. IEEE Trans. Knowl. Data Eng. 1(1): 84-88 (1989 Dumais, S., Platt J., Heckerman D., and Sahami M., Inductive Learning Algorithms and Representations For Text Categorization, Proceedings of ACM-CIKM’98, 1998. Edward A. Fox. Lexical relations enhancing effectiveness of information retrieval systems. SIGIR Forum, Vol. 15, No. 3, pp. 6{36, 1980. Fellbaum, C., ed.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge, USA(1998) Flanagin, A. J., & Metzger,M. J. (2007). The role of site features, user attributes, and information verification behaviours on the perceived credibility of web-based information. New Media Society, 9, 319–342. Furnas, G., et al, The Vocabulary Problem in Human-System Communication, Communications of the ACM, 1987, 30(11), pp. 964971. Fogg, B. J., Soohoo, C., Danielson, D., Marable, L., Stanford, J., & Tauber, E. R. (2002). How do people evaluate a Web site’s credibility: Results from a large study. Retrieved 2 Oct 2008 from http://www.consumerwebwatch.org/dynamic/web-credibility-reports- evaluate-abstract.cfm. Gao, J., Walker, S., Robertson, S., Cao, G., He, H., Zhang, M. & Nie, J-Y (2001). "TREC-10 Web Track Experiments at MSRA". In: NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001). Gaithersburg, MD: National Institute of Standards and Technology. 124 G´omez-P´erez, A., Juristo, N. & Pazos, J. (1995), Evaluation and Assessement of the Knowledge Sharing Technology, in N. Mars, ed., ‘Towards Very Large Knowledge Bases’, IOS Press, pp. 289–296. Greisdorf, H., & Spink, A., 2001. Median measure: An approach to IR systems evaluation. Information Processing and Management. Griesbaum, J., Rittberger, M. & Bekavac, B. (2002). Deutsche Suchmaschinen im Vergleich: AltaVista.de, Fireball.de, Google.de und Lycos.de. In R. Hammwöhner, C. Wolff, and C. Womser-Hacker (Eds.); Gruber, T.R., 1993, A translation approach to portable ontologies. Knowledge Acquisition, 5 (2), 199-220. Gruber, T.R., . Towards Principles for the Design of Ontologies Used for Knowledge Sharing. International Workshop on Formal Ontology, Padova, Italy, 1993. Grunewald, L., McNutt, G., and Mercier, A., Using an ontology to improve search in a terrorism database system. Proceedings of the 14th Interna- tional Workshop on Database and Expert System Applications (DEXA'03), 2003. Gruninger, M. (1996), Designing and Evaluating Generic Ontologies, in ‘ECAI96’s workshop on Ontological Engineering’. Gulati, R. (1995). "Does Familiarity Breed Trust? The Implications of Repeated Ties for Contractual Choice in Alliances," Academy of Management Journal, Vol. 38, No. 1(February), pp. 85-112. Guha, R.V., Contexts: a Formalization and Some Applications. Ph.D. Thesis, Stanford University (1991). Gupta, D.K., ‘‘Exploring Roots of Terrorism,’’ in Tore Bjørgo, ed., Root Causes of Terrorism (London: Routledge, 2005), 61. Guarino, N. 1997. Semantic Matching: Formal Ontological Distinctions for Information Organization, Extraction, and Integration. In M. T. Pazienza (ed.) InformationExtraction: A Multidisciplinary Approach to an Emerging Information Technology. Springer Verlag: 139-170 Guarino, N., 1998, Formal Ontology and Information Systems. In Formal Ontology in Information Systems (FOIS’98) in Trento, Italy, edited by Guarino, N., IOS Press Amsterdam, 3-15. 125 Hawking, D., and P. Thistlewaite, Proximity operators so near and yet so far. In: D.K. Harman (ed.), Proceedings of the 4th Text Retrieval Conference (TREC-4) (Gaithersburg, MD, USA, 1995) 131–43. Hawking, D., and Thistlewaite, P., Relevance Weighting Using Distance between Term Occurrences (Unpublished manuscript, Joint Computer Science Technical Report Series, TR-CS-96–08, The Australian National University, 1996). Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th International Conference on Computational Linguistics. (1992) Hollink, L., Schreiber, G., Wielinga, B.: Patterns of semantic relations to improve image content search. Journal of Web Semantics 5(3) (2007) 195{203 Houghton, B., Understanding the terrorism database. National Memorial Institute for Prevention of Terrorism Quarterly Bulletin, 2002. HTML Parser. http://htmlparser.sourceforge.net/ Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.41-48, July 24-28, 2000, Athens, Greece Jarvelin, K., & Kekalainen, J. Cumulated Gain-Based Evaluation of IR Techniques, ACM Transactions onInformation Systems, 2002. Jiang and Conrath, D., . 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings on International Conference on Research in Computational Linguistics, pages 19–33, Taiwan. Jones, D.M., Bench-Capon, T.J.M., and Visser P.R.S., Methodologies for Ontology Development. In Proceedings of 15th IFIP World Computer Congress, London, 1998. Jøsang, A., Hird, S., Faccer, E., Simulating the effect of reputation systems on e-markets, in: C. Nikolau (Ed), Proceedings of the First International Conference on Trust Management, Crete, May, 2003 Kifer, M. G. Lausen and J.Wu, Logical foundations of object-oriented and frame- basedlanguages, Journal of the ACM, 42 (1995) 741-843. Kowalski, G. (1997). Information retrieval systems, theory and implementation. Boston, MA: Kluwer Academic Publishers 126 Krebs, V.E., Mapping networks of terrorist cells. Connections, 2001 Kwok, K. L. (1984). A document-document similarity measure based on cited titles and probability theory, and its application to relevance feedback retrieval. In ‘SIGIR ’84: Proceedings of the 7th annual international ACM SIGIR conference on Research and development in information retrieval’. British Computer Society. pp. 221–231. Leacock and Chodorow, M., 1998. Combining local context andWordNet similarity for word sense identification. In C. Fellbaum, editor, WordNet: An electronic lexical database, pages 265–283. MIT Press. Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning, Madison, August. Liu, S., Liu, F., Yu, C.T., and Meng, W.: An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004) 266-272. Liu, B.,Personal Evaluations of Search Engines: Google, Yahoo! and MSN. 2006 Department of Computer Science University of Illinois at Chicago Liu T., Xu J., Qin T., Xiong W., Li H. LETOR:Benchmark Dataset for Research on Learning to Rank for Information Retrieval, 2007. Long, H., Lv, B., Zhao, T., & Liu, Y. (2007). Evaluate and compare Chinese internet search engines based on users’ experience. In Proceedings of IEEE wireless communications, networking and mobile computing conference (WiCom 2007) Luhn, H. (1957). A statistical approach to mechanized encoding and searching of literary information.IBM Journal of Research and Development. Mandala, R., Takenobu, T., and Hozumi, T., "The Use of WordNet in Information Retrieval," in Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, 1998, 31-37. Mannes, A. & Golbeck, J. Ontology Building: A Terrorism Specialist's Perspective. Aerospace Conference, 2007 IEEE, 2007, 1-5. Marie K., Iding, E., Martha, E., Crosby, E. B. A., Barbara, K., Web site credibility: Why do people believe what they believe? Instr Sci (2009) 127 Mena E., Illarramendi, A., Kashyap, V., and Sheth, A.P., OBSERVER: an approach for query processing in global information systems based on interoperation across pre- existing ontologies. In: Proceedings of the First IFCIS International Conference on Cooperative Information Systems (CoopIS’96), Brussels, Belgium (19–21 June 1996). Michelizzi, J., Semantic Relatedness Applied to All Words Sense Disambiguation – Master of Science Thesis, Department of Computer Science, University of Minnesota, Duluth, July, 2005 Moldovan, D.I. and Mihalcea, R.: Using WordNet and Lexical Operators to Improve Internet Searches. IEEE Internet Computing 4(1) (2000) 34-43. Musen, M. A., R. W. Fergerson, W. E. Grosso, M. Crubezy, H. Eriksson, N. F. Noy and S. W. Tu, The Evolution of Protégé: An Environment for Knowledge-Based Systems Development, International Journal of Human-Computer Interaction, 58(1), pp. 89-123, 2003. Nagypal, G.: Improving Information Retrieval Effectiveness by Using Domain Knowledge Navigli, Roberto and Velardi, Paola: An Analysis of Ontology-based Query Expansion Strategies. In: Proceedings of Workshop on Adaptive Text Extraction and Mining at the 14th European Conference on Machine Learning (2003). Newell, A. (1982), ‘The Knowledge Level’, Artificial Intelligence 18, 87–127 Niles, I., and Pease, A., (2001) “Origins of the IEEE Standard Upper Ontology,” Working Notes of the IJCAI-2001 Workshop on the IEEE Standard Upper Ontology, Seattle, 37- 42. Patwardhan, S., Banerjee, S., Pedersen, T.: Using measures of semantic relatedness for word sense disambiguation. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics. (2003) Pedersen, T., Patwardhan, S.,and Michelizzi, J., 2004. WordNet::similarity— measuring the relatedness of concepts. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), pages 1024–1025. RAND Corporation. Purpose and description of information found in the incident databases, 2003. http://www.tkb.org/RandSummary.jsp. Rasolofo, Y., and Savoy, J., Term proximity scoring for keyword-based retrieval systems. In: Sebastiani, F., (ed.), Proceedings of the 25th European Conference on Information Retrieval Research, Pisa, Italy, 2003 (Springer, Berlin, 2003), 207–18. 128 Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448–453, Montreal, August. Richardson R. and Smeaton, A.F., 1995. Using WordNet in a knowledge-based approach to information retrieval. Technical Report CA-0395, School of Computer Applications, Dublin City University. Robertson, S. E., & Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of theAmerican Society for Information Science, 27, 129-146. Rocchio, J. J., Jr. (1966). Document retreival system -- optimization and evaluation, DoctoralDissertation, Havard University. In Report ISR-10, to the National Science Foundation,Havard Computational Laboratory, Cambridge, MA. Rocchio, J. J. (1971). Relevance feedback in information retrieval. In G. Salton, ed., ‘The Redefining Web-of-Trust: reputation, recommendations, responsibility and trust among peers Victor S. Grishchenko, 2001 Salton, G. (1971). The SMART Retrieval System - Experments in Automatic Document Processing.Englewood Cliffs, NJ: Prentice-Hall, Inc. Sarawagi, S., and Kirpal, A., Efficient set joins on similarity predicates. In SIGMOD, 2004. Smart Retrieval system – Experiments in Automatic Document Processing’. Prentice- Hall.Englewood Cliff. NJ. Smeaton, A.F., and Berrut, C., Thresholding postings lists, query expansion by word- word distances and POS tagging of Spanish text. In Proceedings of The Fourth Text Retrieval Conference, 1996 Smith B. L., and Damphousse K.R. The american terrorism study: Indict- ment database, 2002. Smith, M. K.,Welty, C., Deborah L. McGuinness (2004-02-10). "OWL Web Ontology Language Guide Sparck-Jones, K. & van Rijsbergen, C. J. (1975). Report on the need for and provision of an “ideal” judgements retrieval test collection. Technical Report 5266. British Library Research and Development Report 129 Stanford, J., Tauber, E. R., Fogg, B. J., & Marable, L. (2002). Experts vs. online consumers: A comparative credibility study of health and finance web sites. Retrieved 2 Oct 2008 from http://www.consumer webwatch.org/dynamic/web-credibility-reports-experts-vs-online-abstract.cfm. Stevens R., Building an Ontology. July 2001. http://www.cs.man.ac.uk/~stevensr/onto/node12.html#building Sugiyama, K., Hatano, K., and Yoshikawa, M., Adaptive web search based on user profile constructed without any effort from users. In Proceedings of WWW 2004, 2004. Sure, Y., Staab, S., Erdmann, M., Angele, J., Studer R., and Wenke, D., OntoEdit: Collaborative ontology development for the semantic web, Proc. of ISWC2002, (2002) 221-235. Syeda-Mahmood, T., Shah, G., Akkiraju, R., Ivan, A., Goodwin, R.: Searching service repositories by combining semantic and ontological matching. In: ICWS. (2005) Tudhope, D., Alani, H., Jones, C.: Augmenting thesaurus relationships: Possibilities for retrieval. Journal of Digital Information 1(8) (2001) TERROGATE-information retrieval technology for the terrorism domain, Defence R&D Canada – Valcartier, 2006 Tomassen, S.L., Gulla, J.A., Strasunskas, D.: Document Space Adapted Ontology: Application in Query Enrichment. In: Kop, C., Fliedl, G., Mayer, H.C., Métais, E. (ed.): 11th International Conference on Applications of Natural Language to Information Systems (NLDB 2006) US Environmental Protection Agency, http://www.epa.gov/air/urbanair/, 2009. Uschold, M., and Gruninger, M.,. Ontologies: Principles, Methods and Applications. University of Toronto, Knowledge Engineering Review Volume 11 Number 2, June 1996. Ussery B., 2008, http://www.beussery.com/blog/index.php/2008/02/google-average- number-of-words-per-query-have-increased/ Van Rijsbergen, C. (1979). Information Retrieval, 2nd edition. Butterworths, London Visser, P.R.S., and Cui, Z., On accepting heterogeneous ontologies in distributed architectures. In: Proceedings of the ECAI98 Workshop on Applications of Ontologies and Problem-Solving Methods. Brighton, UK (1998). 130 Visser, P.R.S., and Tamma V.A.M., An experience with ontology clustering for information integration. In: Proceedings of the IJCAI-99 Workshop on Intelligent Information Integration in conjunction with the Sixteenth International Joint Conference on Artificial Intelligence, Stockholm, Sweden (31 July 1999). Voorhees, E.M.,. Query expansion using lexical-semantic relations. In Proceedings of the 17th ACM-SIGIR Conference, pp. 61{69, 1994} Voorhees, E.M., 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th ACM-SIGIR Conference, pages 61- 69. Voorhees, E.M., and Harman, D., 1997. Overview of the fifth text retrieval conference (trec- 5). In Proceedings of the Fifth Text Retrieval Conference, pages 1-28. NIST Special Publication 500-238. Voorhees, E. (2008). Common Evaluation Measures. In ‘Proceedings of the 16th Text Retrieval Conference (TREC-2007)’. Vol. 500-274 of NIST Special Publication. 2.5.1 W3C OWL Web Ontology Language Overview http://www.w3.org/TR/owl-features/ Weinstein P.C. and Birmingham P., Creating ontological metadata for digital library content and services, International Journal on Digital Libraries 2(1) (1998) 19–36. William, W., Alistair, M. , Justin, Z., Tetsuya, S., Precision-at-ten considered redundant, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore Witten, I.H., Moffat, A., & Bell, T.C. (1994). Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold. WNSTATS(7WN) manual page http://WordNet.princeton.edu/man/wnstats.7WN Wu and Palmer, M., 1994. Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics, pages 133–138, Las Cruces, New Mexico. Xu, J. & Croft, W. B. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18(1), 79–112. Robertson, S. E. (1990). On term selection for query expansion. J. Doc. 46(4), 359–364. Yang, K, (2002).Combining Text-, Link-, and Classigication-based Retrieval Methods to Enhance information Discovery of the web. Doctoral Dissertation. Zhang, Z., Ontology query languages for the semantic web: A performance evaluation. Master’s thesis, University of Georgia, 2005.
Abstract (if available)
Abstract
Finding the relevant set of information that satisfies an information request of a Web user in the availability of today’s vast amount of digital data is becoming a challenging problem. Currently available Information Retrieval (IR) Systems are designed to return long lists of results only a few of which are relevant for a specific user. This thesis presents an IR method called CONITA that investigates the context information of the user and user’s information request to provide relevant results for the given domain users.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Tag based search and recommendation in social media
PDF
Ontology-based semantic integration of heterogeneous information
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Location-based spatial queries in mobile environments
PDF
Spam e-mail filtering via global and user-level dynamic ontologies
PDF
An efficient approach to categorizing association rules
PDF
From matching to querying: A unified framework for ontology integration
PDF
Methods for improving search on U.S. patent documents
PDF
Efficient pipelines for vision-based context sensing
PDF
Bayesian analysis of software cost and quality models.
PDF
Complex pattern search in sequential data
PDF
WikiWinWin: a Wiki-based collaboration framework for rapid requirements negotiations
PDF
Music retrieval systems: robust performance under the effect of uncertainty
PDF
Software security economics and threat modeling based on attack path analysis; a stakeholder value driven approach
PDF
An adaptive temperament -based information filtering method for user -customized selection and presentation of online communication
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Calculating architectural reliability via modeling and analysis
PDF
Identification, classification, and analysis of opinions on the Web
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
Asset Metadata
Creator
Evrim, Vesile
(author)
Core Title
Context-based information and trust analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/03/2010
Defense Date
06/29/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
information retrieval,OAI-PMH Harvest,ontology,query,relevance,search,Trust
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McLeod, Dennis (
committee chair
), Boehm, Barry W. (
committee member
), Pryor, Lawrence (
committee member
)
Creator Email
evrim@usc.edu,vesile1@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2638
Unique identifier
UC1296598
Identifier
etd-evrim-3259 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-599959 (legacy record id),usctheses-m2638 (legacy record id)
Legacy Identifier
etd-evrim-3259.pdf
Dmrecord
599959
Document Type
Dissertation
Rights
Evrim, Vesile
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
information retrieval
ontology
query
relevance
search