Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Semantics -based customization of information retrieval on the World Wide Web
(USC Thesis Other) 

Semantics -based customization of information retrieval on the World Wide Web

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content SEMANTICS-BASED CUSTOMIZATION OF INFORMATION RETRIEVAL ON THE WORLD WIDE WEB by Yun-An Chen A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment o f the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2006 Copyright 2006 Yun-An Chen Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3233848 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3233848 Copyright 2006 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication In the memory of my grandmother Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements First and foremost, I would like to thanks Dr. Dennis McLeod, my research advisor and my guidance of life. I was very blessed to have this opportunity to be his student and to learn from the best. He not only trained me to be a better scholar but also supported me to go through the toughness in life during the past four and half years. I would like to show my appreciation to my qualifying and dissertation committees- Banu Ozden, Larry Pryor, Roger Zimmermann, and Shri Narayanan. To my beloved parents, a word thank alone is not enough to show my gratitude. They provided me the best a child can ever want. Thank you, my parents. To my “sisters” who supported me all these years, Ting-Yu Shih, Yi-Yao Li, Chao-Tung Lin, Chung-Tin Hsu, and Ying-Lan Liu. May our sisterhood never end. As the only child in my family, I truly treasure our sisterhood. I would like the thank Pei-Hsuan Lee for his support during all these years. To Chi-Chang Hsieh, thank for all the support during my toughest time. Thank to my friends from USC, Yu Chang, Shu-Fang Cheng, Chi-Wen Su, Hong-Yao Chen, Yan-Po Chen, Pei-Wen Wu, Ailing Chiang, Wen-Ting Liu, Howen Pu, Litsao Chen and many more. You were my sunshine during all these years. I am also grateful to my Godmother, Godfather, and their daughter. I can’t imagine a life in US without you. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For my colleagues, Seokyyung Chung, Jose Marcias, Andrea Donnellan, Michele Judd, Maggi Glassocoe, Robert Granat, Lisa Grant, Mirhya Gould, Marlon Pierce, Susanna Garden, Patrick Dent, and people in QuakeSim and AIST projects and in IMSC, I appreciate your friendships all these years. It was my pleasure to work with you. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures ix Abstract xii C hapter 1 Introduction 1 C hapter 2 Research Statements 10 C hapter 3 A Semantics-based Customization Framework 21 C hapter 4 A Semantics-based Similarity Model 44 C hapter 5 Profile Management 76 Chapter 6 Similarity Decision Mechanisms 87 C hapter 7 Implementation of the Semantics-based Customization Framework 94 C hapter 8 Experiment Designs 98 C hapter 9 Result Analysis 105 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 10 Discussion and Conclusion Bibliography Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables Table 1: Categories on Yahoo.com 25 Table 2: Categories on LATimes.com 26 Table 3 : Categories on CNN.com 26 Table 4: Categories on Washingtonpost.com 27 Table 5 : Categories on NYtimes.com 32 Table 6 : Categories on Newsweek.com 32 Table 7 : Categories on USAtoday.com 33 Table 8 : Categories on Timesonline.co.uk 34 Table 9 : Examples of Top Level Concepts 41 Table 10 : A Binary Representation Example for the Individual User Profile 84 Table 11 : Results of Grouping and Priority Assignments 91 Table 12 : The Survey Result From the First Domain Expert 106 Table 13 : The Survey Result From the Second Domain Expert 107 Table 14 : The Survey Result From the First Computer Scientist 107 Table 15 : The Survey Result From the Second Computer Scientist 108 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 16 : Results of Votes 109 Table 17 : Recommended Concepts based on Our Semantics-based Framework 114 Table 18 : Recommended Concepts based on the First Method Proposed by Fleischemen et al 116 Table 19 : Recommended Concepts based on the Second Method Proposed by Fleischemen et al 116 Table 20 : Recommended Concepts Provided by imdb.com 117 Table 21 : Randomly Selected Concepts to be Recommended 117 Table 22 : Recommended Movies Provided by Humans 118 Table 23 : Similar Movies Determined by Humans 118 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures Figure 1: A Combination of Several Interrelated Ontologies 35 Figure 2: A Sketch of the Science Domain Ontology 36 Figure 3: A Sketch of the Partial Earthquake Science Domain Ontology 39 Figure 4: A Sketch of the Partial Movie Domain Ontology 42 Figure 5: An Example of an XML File 45 Figure 6: A Graph Representation for the File in Figure 5 46 Figure 7: The Adaptation Algorithm 53 Figure 8: Creation of the Plane Q 57 Figure 9: Adaptation of the First Node 57 Figure 10: Creation of the Second Plane and the Adaptation of the Second Node 58 Figure 11: Projection of the Second Node on the Plane Q 58 Figure 12: Adaptation and Projection of the Third Node 59 Figure 13: Creation of the Third Plane and the Adaptation of the Fourth Node 59 Figure 14: Projection of the Fourth Node 60 Figure 15: Further Adaptation of Interrelated Domain Ontologies 61 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 16: Multiple Interrelated Domain Ontologies 62 Figure 17: Results of the Further Adaptation for Multiple Domain Ontologies with Only the Plane Q; Shown 62 Figure 18: The First Pseudo Codes for Deciding Hierarchically Similar Groups 65 Figure 19: The Second Pseudo Codes for Deciding Hierarchically Similar Groups 65 Figure 20: The First Pseudo Codes for Deciding Semantically Similar Groups 67 Figure 21: The Second Pseudo Codes for Deciding Semantically Similar Groups 68 Figure 22: Pseudo Codes for the First Value Decision Algorithm 71 Figure 23: Pseudo Codes for the Second Value Decision Algorithm 73 Figure 24: Pseudo Codes for the Algorithm Deciding a Sequence of Binary Representations 83 Figure 25: Pseudo Codes of the Grouping Algorithm 85 Figure 26: Pseudo Codes of the Online Recommendation Decision Algorithm 88 Figure 27: The Partial Detailed Earthquake Domain Ontology 91 Figure 28: An Experimental System Architecture 95 Figure 29: An Illustration of the Result Display Design 103 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 30: The Survey Questionnaires 104 Figure 31: The Contents of Query 1 110 Figure 32: The Contents of Query 2 111 Figure 33: The Regression Results for the First Data Set 121 Figure 34: The Regression Results for the Second Data Set 121 xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract Current web-search engines provide query results with relatively high precision and recall, but user satisfaction is often low. Precision and recall are two traditional measurement methods of data accuracy but not user satisfaction or preferences. Therefore, user characteristics and preferences have become important in development of information retrieval approaches. Although many recommendation systems have been designed to provide personalized query results to match user preferences in order to increase user satisfaction, none of these systems was designed to include semantics flexibly and efficiently to provide constant time online recommendation. The goal of this research is to test the hypothesis that a semantics-based system incorporating semantic data representation structures (ontologies) can facilitate customization of online search results. We developed a hybrid recommendation system which is implemented with a semantics-based similarity decision model. This semantics-based model utilizes Directed Acyclic Graph (DAG) structures and supports constant time online recommendation for the recommendation system design. The recommendation system is designed to solve scalability and sparsity problems, and to generate constant time online recommendation. The semantics-based model is compatible with current Semantic Web technologies. Experiments have been conducted to examine the effects of customized information retrieval. xii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction Vendors and search engine developers intend to provide recommendation based on user preference in order to increase user attention or satisfaction towards query results. For commercial web sites, accurate prediction can result in higher selling rates since users may buy the recommended items. Meanwhile, the recommendation list allows users to obtain the information they need quickly and to avoid possible information overloads. Users are willing to visit the vendors or the search engines again for the future information exploration. The goal of a recommendation system is to find top priority items and provide them to users [18]. Although many recommendation systems have been designed to provide customized query results, semantics are not considered in most recommendation systems. For few systems [23] [40], hierarchical structures have been developed to represent the interrelations of metadata in order to encompass semantics in recommendation system designs. However, hierarchical structures can only represent portions of domain knowledge. Directed Acyclic Graphs (DAGs) is more capable of representing complete domain knowledge than hierarchical structures since DAGs can represent more complicated interrelationships among concepts. A recommendation system with consideration of semantics should be with the capability of managing DAGs. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In order to incorporate semantics into customization of search results, approaches for determination of similarity among concepts and interrelationships in data representation structures are essential. Many approaches have been proposed to solve similarity decision problems. One of the extensively utilized conventional similarity decision approaches is the nearest-neighbor algorithm. The nearest-neighbor algorithm has been utilized in recommendation system designs, but this algorithm yields non-constant time computations [63], Recommendation systems consider either similarity among items or users to customize search results. Unlike item data, user data are dynamic, which means that user data vary within a short time period. Searching among millions of neighbors is a time-consuming process [31]. If any new user data are required to be analyzed and included in recommendation processes in real time, the waiting time for the query result will increase. This time increase is called data latency. Another approach for similarity determination is to traverse DAG structures. This approach has been studied in the filed of knowledge discovery in databases (KDD) [68].The relatedness of two concepts is calculated based on results of all possible path traversal. Traversing all possible paths costs 0(|E|), |E| is the number of edges. The maximum number of |E| is n is the number of nodes in a data representation structure (ontology). Computational complexity can be expressed as 0(n2 ) for online computations. This computational complexity is not constant. The complexity of the computation for the recommendation affects the 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. waiting time of the user directly. Non-constant complexity is higher than constant complexity. If the computational complexity of the algorithm is high, similarity decisions require more time than ones provided by low computational complexity algorithms do and information retrieval processes are delayed. Users have to wait longer to obtain customized query results. Other approaches with lower computational complexity to provide customized search results have been proposed. One approach is to incorporate the principle of Cosine-Similarity Measure. The similarity of concepts is measured by the defined vectors [3].The disadvantage is that a lot of manual edits are required for the establishment of the data similarity matrix. Another approach, the Generic algorithm, has been introduced in recommendation processes. The Generic algorithm is combined with the Naive Bayes classifier to be included in the recommendation design. The Naive Bayes classifier classifies items with consideration of semantics combined with the genetic algorithm that clusters users into groups and reduces dimension in order to decrease the computational complexity of time [43]. However, this approach is not able to incorporate existing data representation structures without manual edits. Besides computational complexity problems, other issues are existing in current recommendation systems. Scalability problems occur when system management mechanisms are not carefully designed. If a system is not capable of accepting updates or of extending the coverage of future data, this system 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is inflexible and with scalability problems. Another issue is the sparsity problem [35]. If scales of the user profiles are small or users have unique tastes, similarity decisions are unable to be established. Development of approaches to resolve scalability and sparsity problems are expected to provide better recommendation to users. Ontologies contain sufficient information to facilitate information retrieval processes in order to match user expectations towards retrieved results [47]. Many ontologies have been developed incorporating Semantic Web technologies. Semantic Web is defined as the data representation on the world wide web [7]. Many web-based languages and technologies have been proposed to represent data on the web and to manage metadata. Resource Description Framework (RDF) [69] describes the resources in terms of triples, the representations of the subject, property, and object. DARPA Agent Markup Language (DAML) [59] was developed as an extension of XML and RDF. Ontology Inference Layer (OIL) [59] was introduced later on, and DAML+OIL [78] was developed based on W3C standards as a semantic markup language. The Web Ontology Language (OWL) [68] goes one step further to provide additional metadata for web contents supported by XML, RDF, and RDF Schema (RDF-S) in order to allow non-human agents and applications to understand and to process information directly. Recommendation systems developed with compatibility of the semantic web technologies would enhance the application of customized search results. 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The extensible Markup Language (XML) has been an essential component in those web-compatible technologies. Ontologies are described with these technologies in the web environment for machines to read and to understand. Therefore, a design of a framework compatible with semantic web technologies would provide transferabilities and extensitabilities to recommendation systems. Compatibility of semantic web technologies ensures that existing ontologies are able to be utilized and abundant semantics are considered in the recommendation processes. In order to incorporate semantics into customized query results, a new system that is flexible on the structure of semantics representations is required. We propose to establish a semantic based recommendation system framework. The focus of the system designs is generation of customized query results, interpretation of semantics within user queries and data sources, flexibilities of different ontology acquirements, and reduction of human editions. The foundation of the similarity decision is an ontology, a data representation structure consisting of concepts and their relationships. Items are represented by concepts in an ontology. The proposed approach is able to manage the ontology updates or incorporate existing data representation structures automatically in order to update and to reform the profiles based on ontology and metadata dynamically. The automatic process ensures the utilization of existing ontologies described by the semantic web technologies with provision of minor 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. input transformation. In addition, user or item profiling is required to be flexible to the addition or updates of the data representation structures, which is an ontology in our system. With inclusion of semantics in the recommendation, user profiles are dynamic to reflect the semantic updates to avoid scalability problems. Furthermore, we incorporate semantics-based solutions to overcome the sparsity problem when there is insufficient information about current users. The sparsity problem occurs when data are not related to any information of assigned priorities, user behavior, or user preferences. We further introduce a binary interpretation in order to reduce the computational complexity of online similarity decisions. Fuzzy logic is also introduced into our framework design in order to present different levels of user preferences. Online recommendation decisions are completed in constant time in our proposed framework. Experiments have been designed to collect user data and behavior patterns, to interpret user data according to the domain ontologies, to refine queries incorporating the information from user profiles, and to adjust the final result presentation based on the user preference. 1.1 A Semantics-based Recommendation System A semantics-based recommendation system framework was developed to solve sparsity and scalability problems and to perform online recommendation in constant time. A semantics-based similarity model was developed to be incorporated in the framework. This model was designed to 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. solve similarity decision problems with consideration of collaborative filtering techniques and semantics presented in ontologies. The purpose of the model was to perform similarity computations in spontaneous and efficient similarity decisions. This semantics-based similarity decision model consisted of offline computations and online operations. Offline computations included user profiling, user grouping, semantically similar grouping for concept nodes, and priority computations for semantically similar groups. Performing grouping and computing priorities offline enabled the reduction in the computational complexity of time for online recommendation decisions. Online recommendation decisions were completed by a sequence of recurrent operations: locating user groups, selecting candidates of recommended concepts, and deciding the recommended concept(s). The similarity model consisted of uncomplicated approaches and results in constant-timed computations. 1.2 Experimental Contexts Applications of the proposed framework were on two different domains. Two domain ontologies were developed based on available data. The first domain ontology was an earthquake science domain ontology. The data source was data from the QuakeSim project [19] [28]. The earthquake science domain was considered as a sub-domain of the science domain. Therefore, several general concepts in the science domain were included and topped of the earthquake science domain ontology. 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The second domain was the entertainment domain. The entertainment domain ontology was developed. Data sources of the entertainment domain included categories from imdb.com [37] online categories and web contents. The Internet Movie Database (IMDb) is a very large movie information collection. In this database, details about a movie, from casts to producers, from filming locations to producing countries, and all other essential elements. 1.3 Contributions The framework included consideration of semantics in similarity computations and enabled low computational complexity similarity decisions for online recommendation processes. The framework was capable of managing DAGs, which are supersets of hierarchical structures, and compatible with Semantic Web technologies with minimum requirement of manual edits. The minimum requirement of manual edits results from the novel geometric-based data adaptation approach. Furthermore, the binary representation of user profiles in the framework was an innovative approach, and a facilitation of constant time online recommendation processes and profile updates. 1.4 Outline In this thesis, the research statements, hypothesizes, and other issues are discussed in Chapter 2. The sketch of the semantics-based framework and domain ontologies are delineated in Chapter 3. The essential entities and mechanisms in the semantics-based framework are detailed in Chapter 4, 5, and 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6. Details of implementations are discussed in Chapter 7. In Chapter 8, experiment designs are illustrated. Experimental results are analyzed in Chapter 9. Discussions and conclusions both are provided in Chapter 10. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Research Statements Analyzing user data and extracting useful information for further prediction are major functions in recommendation system designs. Recommendation systems are designed to assist users to locate the preferable items quickly and to avoid possible information overloads. Data mining techniques are applied on recommendation system designs in order to determine similarity among thousands or even millions of subject data. For example, people that like or dislike movies in the same categories would be considered as the ones with similar behavior [13]. Two major categories of similarity are focuses of recommendation system designs: similarity among user behavior and similarity among items. In early recommendation designs, user behavior, such as user preferences, were surveyed offline in advance in order to provide customized information selections [3], As the number of internet users started to grow, people had started to develop interests in information searches in the internet environment, and user interests have been considered into recommendation system designs ever since. Researchers started to develop automatic collaborative filtering techniques to provide recommendation in the internet environment [60]. Designs of pioneer recommendation systems focused on recommendation in the entertainment domain [32] [60] [66], Similarity decisions are concluded differently by different collaborative filtering techniques. Besides collaborative filtering techniques, content-based filtering techniques are proposed and implemented in recommendation systems [5] [45]. Each technique has its 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. own drawbacks. Therefore, hybrid methods that combine collaborative filtering and content-based filtering techniques have been proposed in order to resolve some drawbacks. Hybrid methods still are unable to solve the sparsity problem completely. Our research focuses on development of an efficient hybrid method with consideration of semantics. Consideration of semantics ensures that at least one recommendation is provided on queried information to solve the sparsity problem. In our research, semantics are represented in an ontology. An ontology is a branch of metaphysics dealing with the nature of being [39]. An ontology is a collection of concepts and their relationships either in a general or specific domain. A concept-based model involving an ontology has been established and tested to have high recall and precision [42]. Our research further refines a concept-based model with involvement of user profiles for information recommendation and flexibility to different domain ontologies. In this chapter, related work is reviewed in Section 2.1, research assumptions are listed in Section 2.2, and approaches for customization are provided in Section 2.3. 2.1 Literature Reviews In order to draw users’ attention and to increase their satisfaction towards online information search results, search engine developers and vendors have tried to predict user preferences based on user behavior. Recommendation is provided by search engines or online vendors along with the original query results to users. Recommendation systems are implemented in commercial and 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. non-profit web sites in order to collect user data and predict user preferences. Decisions are based on relationships among items, individual user behavior, relationships among a group of users, or combination of the above. Two major approaches include user behavior to provide recommendation: the content-based filtering and the collaborative filtering. The content-based filtering is an approach providing recommendation based on individual preferences. The collaborative filtering generates recommendation based on preferences of a group of users with similar behavior. A new approach has been developed to comprise both content-based and collaborative filtering techniques in order to provide accurate prediction on user preferences. A recommendation system including both technologies is a hybrid recommendation system [4], The decisions of how accurate the prediction is depend on the subjective opinions from the users. Literatures in recommendation system research area were reviewed in this section. 2.1.1 Recommendation Systems without Consideration of User Behavior Three major processes are preformed by recommendation systems: object data collections and representations, similarity decisions, and recommendation computations. In a recommendation system without consideration of user behavior, only information about items is required in the data collection process. Several methods have been proposed in order to represent relationships among items. One method is the Vector Space Model [62]. The Vector Space Model is commonly applied on document 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. data. Properties of items are represented as vectors in an n-dimensional space. Similarity between two items is determined by the cosine measurement. Deciding vectors for each object requires a lot of manual work for non-document items since relationships among document items are decided by keyword spotting and other techniques automatically [8] [9] [79] [83], Properties of non-document items can not be simply decided like counting different words in a document item. Another method is to apply techniques from the Natural Language Processing field [21]. Natural Language Processing techniques are included in recommendation systems to create similarity metrics. Natural Language Processing techniques again are not in favor of non-document items. The other method of representing relationships among items is creating data representation structures, such as an ontology. An ontology has been included in many information systems [17] [57]. Utilizing a current ontology in recommendation systems consumes less human efforts required than mapping item properties with the Vector Space Model from scratch. When content-based filtering, collaborative filtering, or hybrid recommendation systems are implemented with above methods, the sparsity problem is solved. The reason is that whenever user preferences are not available, recommendation is provided based on pure item-to-item relationships. Items similar to queried items without user rating or preferences related to are able to be located based on pure item-to-item relationships. Therefore, recommendation can still be provided to users. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1.2 Recommendation System with Consideration of User Behavior The content-based filtering and collaborative filtering are two major approaches implemented in recommendation systems. The content-based filtering is an approach providing recommendation based on preferences of individuals. The collaborative filtering affords recommendation based on common preferences of groups of users with similar behavior [15]. Collaborative filtering approaches solve several limitations in content-based filtering approaches [4], The most important one is the sparsity problem. If a user queries on an item that the item itself is not rated by this user or similar items have not been queried by this user before, recommendation can be provided based on common preferences of users with similar behavior in the past. Recommendation systems implemented with collaborative filtering techniques have been proven to provide satisfying recommendation [32] [66]. An important recommendation system design, Grouplens project, had investigated automated collaborative filtering techniques and related issues since 1992 [44] [60]. Grouplens was a recommendation system for netnews. In the system design, the Better Bit Bureaus (BBBs) had been proposed to predict user preferences based on computing the correlation coefficients between users and on averaging ratings for one news article from all. The approaches developed in Grouplens project have been included in many recommendation system designs. However, BBBs requires high 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. computational complexity in time. RecTree [13] was proposed to improve the performance. However, execution time scale was still not constant-timed, which was 0(«log2(«)). Other than BBBs, the nearest-neighbor algorithm, an algorithm that determines and ranks the distance between a target object and any other available object, or other methods based on this algorithm have been extensively included in the implementation of the recommendation systems [60]. The coefficient correlation computation and the nearest-neighbor algorithm have their limitations on scalability and sparsity issues. User data are dynamic, which means the data vary within a short time period. Current users may change their behavior patterns, and new users may enter the system at any moment. Millions of user data, which are called neighbors, are to be examined in real time in order to provide recommendations [31]. Searching among millions of neighbors is a time-consuming process. To solve the system update issue, item-based collaborative filtering algorithms have been proposed to enable reductions of computations because properties of items are relatively static [65]. Suggest is a Top-N recommendation engine implemented with item-based recommendation algorithms [18] [41]. Another example, Amazon.com, employs an item-based algorithm for collaborative-filtering-based recommendation [48] to avoid the disadvantages of conventional collaborative filtering algorithms. Other than item-based methods, the clustering [10], Eigentaste algorithm [24], and Singular Value Decomposition (SVD) [64] have been introduced to be included in collaborative-filtering-based 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. recommendation system designs in order to overcome scalability and sparsity issues. Another approach to decrease time consumption is to include the Eigenstate algorithm. Jester is an online joke recommendation system based on the Eigentaste algorithm, which was proposed to reduce dimensionality of the offline clustering and to perform online computations in constant time [24], The clustering is based on continuous user ratings of jokes. The Eigentaste and genetic algorithms enable constant time computations for online processes. One other approach of constant-timed recommendation is to utilize the Expectation Maximization (EM) algorithm. The Expectation Maximization (EM) algorithm [12] provides a standard procedure to estimate the maximum likelihood of latent variable models, and it has been applied to estimate different variants of the aspect model for collaborative filtering [33] [34]. Another important issue to be considered in recommendation system designs is the representation of user preferences. Unlike the digital world, human preference is difficult to be presented only in O s and Is. Therefore, the fuzzy logic has been considered in recommendation technique designs. The fuzzy logic covers the gray area of preference levels. A personal computer recommendation system (PCFinder) has included the fuzzy logic and utilized a proposed method of the order-based similarity measurement [82], Instead of using 0/1 for the search, this method uses the concept in the fuzzy logic 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to estimate the similarity. Utilizing data representation structures and the fuzzy logic enhances performance of recommendation system in both accuracy and completeness. Semantics have also been considered in novel recommendation designs to facilitate user profile management. Hierarchical structures have been employed to describe the relationships among users [40]. Preferences of each user can be described in a hierarchical structure. Hierarchical structures can also be applied on similarity computations for items [22] [23]. Knowledge of the world is unable to be described by only hierarchical structures. Directed Acyclic Graphs (DAGs) are able to express knowledge more completely. Using semantic similarity within a DAG has been proposed [36] [56] [58]. Polcicova el al. [58] and Hyvonen et al. [36] proposed approaches combined with semantics and a content-based technique. The approach proposed by Polcicova el al. is only suitable for document items. The approach proposed by Hyvonen et al. requires huge human manual edits of recommendation rules. Patrick et al. [56] proposed an approach combined with semantics and a hybrid technique. Nevertheless, a computation of recommendation is a polynomial-timed computation. For internet information searches, users expect to receive search results as soon as possible. High efficient approaches, which indicated constant-timed computations, are required in the internet environment. 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2 Assumptions 1. An ontology represents the world in a systematic, semantic, and sufficient way. 2. That two concepts are relatively closer in the ontology structure indicates that these two concepts are more similar. 3. If two concepts in an ontology have the same parent node, these two concepts must have at least one common property. 4. User behavior patterns indicate user preference towards certain concepts. 5. A user expects a recommended concept that is as similar to the queried concept as possible. 6. The closer to what a user expects, the higher the user satisfaction is. 7. Users expect fast recommendation results in the internet environment. 2 J The Approaches of Customization We developed a semantics-based customization framework to implement an online hybrid recommendation system with consideration of semantics. The online similarity decision making is required to be efficient. Being efficient indicates that the decision making process is completed in a short amount of time. In order to decrease the computation time, the number of approaches involved in the decision making process was required to be small, and the complexity of each approach was expected to be low. A similarity decision process based on our similarity model only required three 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. gradational approaches and yields low computational complexity of time overall. The three gradational procedures were locating a similar user group, selecting candidates of recommended concepts, and deciding the recommended concept(s). In order to locate a group that subjects in the group share common preference with this certain user, our semantics-based customization framework first performed a collaborative filtering technique. This collaborative filtering technique first located the user’s group. If the current user had been assigned to a group, candidates of recommended concepts would be the top-priority concepts preferred by this group of users, and these concepts have to be at least hierarchically similar to the queried concept. If the current user didn’t belong to any group, this user’s individual profile would be referred to in order to select candidates of recommended concepts and these concepts hade to be at least hierarchically similar to the queried concept. The definition of “hierarchically similar” was disclosed in Chapter 4. If selecting candidates of recommended concepts still failed, candidature would be decided based on semantics. Concepts that are most similar to the queried concepts in semantics would be chosen. The second procedure had ended. The third procedure, deciding the recommended concepts), was with consideration of specified criteria or limitations provided by system users, system designers, or system administrators. For example, in a news recommendation system, if any news content related to the concept candidate with the highest priority doesn’t exist, the 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. concept candidate with the highest priority would be removed from the candidacy due to the system limitation, and the concept candidate with the second highest priority would be examined. Several experiments were conducted. Web-based systems were implemented. Experiment designs are disclosed in Chapter 8. Experiment results are discussed in Chapter 9. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 A Semantics-based Recommendation Framework Many data representation structures, such as web site categories and domain ontologies, have been established for the semantics-based information search and retrieval on the web. These structures consist of concepts and their interrelationships. Approaches to determine the similarity among concepts in data representation structures have been developed in order to facilitate information retrieval and recommendation processes. Some approaches are only suitable for similarity computations in tree structures. The other approaches designed for the Directed Acyclic Graph structures yield high computational complexity for online similarity decisions. Another approach is the Cosine-Similarity Measure. This approach requires manual edits for the data similarity matrix. In order to provide efficient similarity computations for data representation structures, we developed a semantics-based customization framework that includes a geometry-based solution. An ontology is first spontaneously adapted into a geometric 3-dimensional space. Similarity computations are based on geometric properties, and the online similarity computation is performed in a constant time. Details of this framework are disclosed in Section 3.1. In Section 3.2, information about ontology and two chosen domain ontologies is provided. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.1 Semantics-based Customization Framework Our semantics-based recommendation system framework included two major components: a semantics-based similarity decision model and an online similarity decision algorithm. The proposed similarity model was a semantics-based hybrid technique. Similarity was decided based on adaptation results, and concepts in an ontology were associated with ranks mapped from user behaviour patterns. Both individual and group user profiles were created. Similarity decisions performed by our online similarity decision algorithm resulted in constant time computations. A similarity decision model was developed to facilitate recommendation processes with systematic mechanisms. This model was also designed to discover and organize the useful data properties hidden in the large datasets. This similarity model for the semantics-based recommendation included three entities and four mechanisms. These entities and mechanisms were proposed to facilitate recommendation decision processes. Entities and mechanisms are described and explained in this section. Entity 1: An ontology was the first essential entity in the similarity decision model. An ontology represents the domain or general knowledge. Entity 2: The second entity was a collection of concept groups. Each group contained one or more concepts with similar properties. How similar these properties are is decided based on semantics. In 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this case, semantics are indicated by relationships among concepts. Groups were decided by Mechanism 2 based on the results from Mechanism 1. Entity 3: The third entity was a collection of profiles. This collection was divided into three categories: item profiles, individual user profiles and group user profiles. The profile type of each category is discussed in Chapter 5. Mechanism 1: The geometric-based data adaptation was the initial step to illustration the similarity of concepts in an ontology based on the relationships of the concepts. In the data adaptation, relationship types were not considered. The data adaptation was designed to represent the ontology in a 3-dimensional space. Concept nodes were assigned with coordination. Edges are represented by vectors. Mechanism 2: The semantic grouping was based on hierarchically or semantically similar properties of the concepts after the data adaptation process was done. Definitions of semantically similar and hierarchically similar are introduced in Chapter 4. In order to facilitate similarity decisions, group labels were introduced to identify the concept groups. The concepts that were hierarchically or semantically similar were labelled as one group, L,. Any node without a parent node formed one group alone. Every concept that didn’t have the same hierarchically or semantically similarity as any other does forms one group alone also. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Mechanism 3: The profiling was to analyze user behavior data and to generate item profiles, individual user profiles and group user profiles. Analysis of user behavior data resulted in user preference representations. Mechanism 4: Creating online recommendation decisions was the fourth mechanism. This mechanism was designed to predict user preference based on similarity of user behavior and of semantics. 3.2 Semantics- Ontologies An ontology is a collection of concepts and relationships [40]. An ontology represents the semantics of the world more completely with several inter-connected hierarchical structures, which are combined into one DAG structure. In this research, two domains were chosen in order to establish ontologies with smaller scale for the experiments. Because of the abundant semantics within the ontologies, the similarities of the concepts are determined efficiently and effectively by referring to the interrelationships in ontologies. This characteristic of ontologies facilitated the decisions on recommendation associated with user behavior and preferences in our proposed framework. The core of the customization process in this framework was the utilization of ontologies. In the research, we planned to develop two domains of ontologies: The Earthquake Science within the Science domain and the Movie sub-domain within the Entertainment domain. The decisions of 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. domains were first based on the surveys of online news services. We had examined the categorization of eight different online news services, Yahoo.com [80], LA times.com [49], CNN [11], Washington Post.com [75], NYtimes.com [74], Newsweek [54], USAtoday [77], and TimesOnline.co.uk [76]. The Technology/Science domain was listed in every online news service. The Entertainment domain was listed in a major number of news services .Following tables show all categories for each news service. Table 1: Categories on Yahoo.com Main Categories Sub-categories if any Top Stories Business Entertainment Top Stories Movies Music Industry TV Arts and Stage Books Gossip and Celebrity Misc Technology World Politics Sports Science Health Weather Oddly Enough 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 1: Continued Crimes and Trials Op/Ed Community Lifestyle/Features Lottery Table 2: Categories on LATimes.com Main Categories Sub-categories if any The World The Nation California / Local Business Politics Sports Technology Travel Editorials, Op-Ed Table 3: Categories on CNN.com Main Categories Sub-categories if any World U.S. Weather Business Sports Politics Law Sci-tech Space Health 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3: Continued Entertainment Travel Education In-depth Table 4: Categories on Washingtonpost.com Main Categories Sub-categories if any Nation Columns/Cartoons Courts DOT.MIL National Security Science Search the States Special Reports Photo Galleries Live Online Nation Index World Search the World Special Reports Africa Americas Asia/Pacific Europe Former Soviet Union Middle East Live Online Photo Galleries World Index Metro Traffic Schools 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4; Continued The District Maryland Virginia Crime Government Obituaries Religion Lottery Columnists Special Reports Community Groups Photo Galleries Live Online Metro Index Sports Redskins Area Pro Teams Colleges High Schools Leagues & Sports Columnists Features Sports Index Business Market News Portfolio Washtech Company Research Mutual Funds Personal Finance Industries Columnists 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4: Continued Live Online Special Reports Real Estate Business Index Technology Biotech/Medical Government IT Media/Content 'Net Architecture Policy/Regulation Software/Services Telecom Finance Venture Capital Emerging Cos. M & A Markets Columnists Community Tech Thursday Special Reports Personal Tech Technology Jobs Style Books Food Home Post Magazine Sunday Arts Television Weekend Comics 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4: Continued Crosswords Horoscopes Ann Landers Columnists Photo Galleries Live Online Style Index Editorials & Opinion Travel Health Alternative Care Children/Youth Chronic Diseases Fitness Health Care Issues Men Mental Health Nutrition Seniors Women Columnists Special Reports Live Online Health Index Home & Garden Build It/Fix It Furnishings/Design Garden & Patio Home Life Neighborhoods Room by Room Columnists 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4: Continued Photo Galleries Live Online Home & Garden Index Education Parenting Preschool K to 12 KidsPost Higher Education Learning Distance Learning Politics and Policy Teachers Trends & Debates Live Online Education Review Education Index OnPolitics Top News Weather International Weather National Weather D.C. Area Weather Weather Images Historical Weather Data Beach Weather 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5: Categories on NYtimes.com Main Categories Sub-categories if any International National Nation Challenged Politics Business Technology Science Health Sports New York Region Education Weather Obituaries NYT Front Page Table 6: Categories on Newsweek.com Main Categories Sub-categories if any National Business & Money Technology and Science Health & Lifestyle Entertainment Opinion International News Weekend 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 7: Categories on USATodav.com Main Categories Sub-categories if any Top News Nation States Washington/Politics World Editorial/Opinion Health & Science Census Offbeat Table 8: Categories on Timesonline.co.uk Main Categories Sub-categories if any Breaking news Britain World Business Sport Your Money Comment Sports Book Travel Shopping Classifieds Law Games Crossword Talking Point Online Specials First night reviews 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 8: Continued Times 2 Films Books Theatre Food & Drink Television & Radio Arts Property Motoring Health Creme The technology/science and entertainment categories appeared in almost every online news resource. Based on available contents related to concepts in domain ontologies, we further focused on full development of two subdomains: the earthquake science domain within science domain and the Movie domain within Entertainment domain. For the earthquake science domain, data were from the QuakeSim project [12]. Several recommendation systems had been specifically designed for the Movie domain, and data were publicly available. We chose to collect data from the database of imdb.com [37], which was a very large online movie database. The relationships for two domains are presented in Figure 1. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 1: A Combination of Several Interrelated Ontologies Event Science Domain Entertainment Domain a \ Earthquak* Movie 3.2.1 The Earthquake Science Domain within the Science Domain For the science domain ontology, we first used the classifications featured in the Science magazine [70]. Then, we added one concept above these classifications: Subject. There are four subjects originally: “Life Sciences”, “Physical Sciences”, “Other Subjects”, and “Additional Topics”. Since the “Additional Topics” covered those narrow subjects, such as “Asia/Pacific News”, “Corrections”, “Editorials”, “Editors' Choice”, “Enhanced Content”, “Essays”, “European News”, “Latin American News”, “NetWatch”, “Scientific Community”, “Technical Comments”, and “Techniques”, we don’t include the “Additional Topics” into our Subject concept. The “Subject” concept in our ontology 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. presented the categories in the science domain. We only included the “Life Sciences”, “Physical Sciences”, and “Other Subjects” as the child nodes of the “Subject” concept node. A manual edition of the partial ontology for the science domain is in Figure 2. Figure 2: A Sketch of the Science Domain Ontology < ^ E v e n T ~ ^ ) C ^ o u m a P ^ ) <^OTferenM ^ ( ^ o m p e ti^ ^) ... Paper ( ^ S u b j e c P ^ ) (^Physi ;ics Science Life Science ie n c e j^ ) C ^ s tr o n o m y ) (G eochem isiry) ■ ■ -(^Geophysics" Other Genetics )... Virology Surface < ^^G eo Phencm eron^) 0 * 0 ( ^ Segmerrt^ ) San Andrea ^arth q u ak e ^ ^ y p j j o n ^ O ) <^Tarthquake*T^) IS-A * Exhaustive group Part-of Concept Volcano We further surveyed the Science Magazine to determine the narrower subjects in order to decide the concepts in the science domain ontology. For the Life Science, we found the following narrow subjects: 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Anatomy/Morphology/Biomechanics • Anthropology • Biochemistry • Botany • Cell Biology • Development • Ecology • Epidemiology • Evolution • Genetics • Immunology • Medicine/Diseases • Microbiology • Molecular Biology • Neuroscience • Pharmacology/Toxicology • Physiology Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Psychology • Virology For the Physical Science, we had the following narrow subjects: • Astronomy • Atmospheric Science • Chemistry • Computers/Mathematics • Engineering • Geochemistry/Geophysics • Material Science • Oceanography • Paleontology • Physics • Physics, Applied • Planetary Science Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For the Other Subjects, we had the following narrow subjects: • Economics • Education • History/Philosophy of Science • Science and Business • Science and Policy • Sociology Figure 3: A Sketch of the Partial Earthquake Science Domain Ontology Crust j) Journal Earthquake Paper Paper *1 Paper#2 )... Sierra Madre Long Tree San Andreas Earthquake #3 Earthquake #1 Carrizo IS-A San Fernando Cucamongo Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. These narrower subjects were included as the top-level concepts in the Science domain ontology. We then built a subsidiary domain ontology of the earthquake science based on the available experimental data [14] [16]. An earthquake science domain ontology is a subset of Science domain ontology. The connection between the Earthquake Science domain ontology and the rest was an "IS-A" relation. A sketch of the partial Earthquake Science domain ontology is shown in Figure 3. 3.2.2 The Movie Domain within the Entertainment Domain A movie domain ontology was developed based on information in Internet Movie Database [37]. The IMDb started as a newsgroup by an international group of movie fans. The IMDb now has 452,566 movie titles and related information [38]. Top level concepts in this movie domain ontology were first decided based on the categories listed in the Browse section on imdb.com. Top level concepts included different genres, and countries. Examples of top level concepts are listed in Table 9. The categories were decided based on two resources: the imdb.com and Yahoo.com. First, categories in the “Film by categories” on imdb.com were retrieved. These categories include "Top Titles", Search by Ratings", Titles by Year", Titles by Country", "Titles by Language", "Titles by Genre", "Titles by Location", "Titles by Business Information", "Titles by Awards", "Titles by Keywords", and "Titles by Co-Stars". Second, the “Top Categories” in the Yahoo! Movie Directory were reviewed. These categories are “Titles", "Box Office Reports", "Filmmaking", "Genres", 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. "Theaters", "Reviews", and "By Region", categories associated with numbers of data directly. An intersection of categories from two resources was performed in order to finalize the categories for determining top level concepts. The reason to select Yahoo! as one source is that the Yahoo! search engine developers have established and improved the categories since 1994 [81]. The formats of the categories are reliable. The results of the intersection were two categories: Genres and Countries (Regions). Different genres and countries associated with movies were described as concepts with IS A relationships to the concept Movie. Table 9: Examples of Top Level Concepts Categories Top Level Concepts Action Genres Adventure Animation Biography Comedy Crime Documentary Drama Family Film-Noir Fantasy History Horror Music Musical Mystery Sci-Fi Short Romance Sport Thriller War Western Albania Argentina Countries* Australia Belgium Brazil Austria Bulgaria Canada Chile Colombia China Croatia Cuba Czech Republic Czechoslovakia East Germany Egypt Denmark Germany France Finland Greece Hong Kong Hungary India Iran Ireland Israel Italy Japan 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 9: Continued Mexico Netherlands New Zealand Norway Philippines Poland Portugal Romania Russia South Korea Soviet Union Spain Sweden Switzerland Taiwan Turkey UK USA Venezuela West Germany Yugoslavia *Only concepts related to more than 500 movie titles are listed. Figure 4: A Sketch of the Partial Movie Domain Ontology Movie Action Year Comedy USA Movie Cast Director O ceans 11 ress or Actor Star War I Natalie Portan Steven Soderbergh Julia Roberts George Lucas Christopher Nolan Keanu Reeves Matt Deamon Exhaustive group as origin IS-A Part-of Categories for lower lever concepts included the cast, year, rating, languages, locations, and plots. Different casts, years, ratings, languages, locations, and plots associated with movies were described as concepts with Part of relationships to the concept Movie. Lower level concepts were created based 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. on other information on movies on imdb.com. A sketch of a partial Movie Domain ontology is shown in Figure 4. Different categories of lower level concepts may be with different importance to higher level concepts if the relationship type is P artof. The Part-of relationship type indicates partial inheritance. Some lower level concepts may inherit more properties from higher level concepts than other lower level concepts. In order to determine how important a lower level concept is to a movie, a survey was conducted. Concepts with Part of relationships, which are the lower level concepts, to the concept Movie were included in the survey. 31 subjects were involved in the survey. These 31 subjects included 16 females and 15 males. Subjects were required to provide a scale between 1 to 100 to indicate the importance by numeric values. The results indicated that leading casts were the most important concepts to a movie, locations and plots were the second important, other casts, years, ratings, and business information were the third important, other concepts were important but not highly appreciated as leading casts, locations, plots, other casts, years, ratings, and business information. The average scale value rounded to integers for four levels of importance were: 80, 60, 20, and 8. These values were employed to represent the weights of relationships. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 A Semantics-based Similarity Model Our proposed semantics-based similarity model was designed to provide online similarity decisions in constant time. Results of the similarity decisions were concepts that were candidates of recommendation to be provided to information agents. Information agents retrieved information pieces matching the recommended concepts and presented the information pieces back to users. Similarity decisions were with consideration of semantics. It means data presentation structures were involved in similarity decision processes. In this model, four mechanisms were required. The first mechanism was the data representation structure adaptation. Outputs from the data representation structure adaptation were included in the second mechanism, the semantics-based concept grouping with priority assignments. The second mechanism contributed to efficient online similarity decision processes as a preparation for similarity computations. The semantics-based grouping was to pre-process the data and to separate similar concepts into groups. Deciding recommended concepts based on grouping results can be performed in constant time. The third mechanism was the user profiling. Details are discussed in Chapter 5. The fourth mechanism, the online similarity decisions, incorporated the grouping results and involved semantic-based criteria and principles. Details are discussed in Chapter 6. 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5: An Example of an XML File <?xml version="1.0" en c o di n g='*UTF- 8 "? > <xs:xschema xmIns:xs= http:tfwww.w3.org/2001 /X M LS ch e ma eIementFormDefault="qu alifi ed <xs.elementname=” Surface"* <xs:complexType> <xs:sequence> <xs:elementname=“San Bernadinos">«xs:complexType» <xs:sequence> < xs: e I em entname="San Bernadino"type-'San BernardinoType" maxOccurs="unb o und ed"/» </ xs:sequence » </xs:complexType> </Xs:element> <xs:elementname="North C oasts,,><xs:complexType> <xs:sequence> <xs:elementname=”North Coast"type-'North CoastType" maxOccurs="unbounded'V» </ xs:sequence ></xs:complexType> <r!xs:element» <xs: e I em e ntnam e="M oj a ves <xs: c om p lexTyp e > <xs:sequence» <xs:elementname=”Mojave"type="MojaveType" maxOccurs-'unbounded"/* <! xs:sequence ><Acs:complexType> <6<s:element> <xs :e I em e ntnam e="C arrizos "> <xs: c omp lexTyp e > <xs:sequence> < x s: e I em e ntnam e=" Ca rrizo" ty p e - ' Ca rrizoTyp e “ m axO c c urs=" u nb o un d ed "/> <i xs:sequence »</xs:complexType> <tx.s\element* <xs: e I em e ntnam e-'C o a ch e Has "><xs: c omp I exTyp e > <xs:sequence> <xs:elementname="Coachella" type=''CoachellaType'' maxOccurs-'unbounded'V* <i xs.sequence ></xs:complexType> «fxs:element> </xs:sequence></xs:complexType> <rfxs:element> <xs:complexType name= “San BernardinoType"* <xs:sequence»«/xs:sequence» <<xs:complexT ype> <xs:complexType name= "North CoastType"> <xs:sequence>«A<s:sequence> <tos:complexType> <xs:complexType name= "MojaveType"> <xs:sequence></xs:sequence> <A(s:complexT ype> <xs:complexType name= "CarrizoType"* «xs:sequence></><s:sequence» <lxs: c o m p lexTyp e » <xs:complexType name= "CoachellasType"* <xs:sequence*</xs:sequence> </xs:complexT ype> </xs:schema» 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6: A Graph Representation for the File in Figure 5 String SegmentName-1(i; Subclasses iecmentlMamefl) LonStart-1 On) Segment Attribute Real L onE i Carrizo Coachella North Coast Mojave Our model focused on fundamental elements of data representation structure development. The elements were concept creation and interrelationship development. Fundamental elements are independent of languages describing data representation structures. Concepts and interrelationships can be expressed in a fundamental graphic representation despite of languages. Our model was designed to manage the fundamental graphic representation. Therefore, our model was also suitable for the knowledge described by Extensible Markup Language (XML), XML Schema, Resource Description Framework (RDF), and other Semantic Web technologies that are able to describe data representation structures. The reason is that the knowledge described by the above techniques is easily mapped to a graph. In Figure 5 and 4.2, an example of mapping knowledge described by XML schema to a graph is shown. In this example, the relations between an XML file and a graph representation for a data representation structure are clearly displayed. A direct match from an XML file to a graph representation is shown in Figure 5 and Figure 6. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.1 Data Adaptation Graphs are sets of vertices and edges. Data representation structures are able to be represented as graphs that concepts are represented as vertices and interrelationships are represented as edges. Many data representation structures are hierarchical structures, including tree structures. A hierarchical structure consists of hierarchical relationships and concept nodes without any occurrence of a cycle. Hierarchical relationships are represented by directed edges between any two nodes in the structure. In hierarchical structures, concepts are organized based on specified orders. The orders in the hierarchical data representations are decided based on the generality of each concept. Hierarchical relationships connect concepts from general to specific levels. Hierarchical relationships are directed, either from general to specific or from specific to general concepts. In any hierarchical structure, hierarchical relationships are directed in only one direction, either of the generality reduction or the specification reduction. Although hierarchical structures have been extensively utilized to present the knowledge of the world, hierarchical structures are unable to represent the knowledge as complete as DAG structures do. Hierarchical structures are subsets of DAG structures. DAG structures are acyclic. “Acyclic” indicates that traveling along edges must not lead back to any visited node. These structures are flexible to be extended or shrunk. In order to utilize the existing 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. structures in the similarity decision processes, the properties of flexibility need to be preserved in any data adaptation required by the recommendation processes in information agents or systems. Therefore, a geometry-based solution was introduced to preserve the flexibility in the proposed framework. The adaptation was to associate current data representation structures with redefined geometric properties in a 3-dimentional space. The 3-dimensional space for the structure adaptation was a 3-dimensional world of Euclidean geometry. The properties of the 3-dimensional space obey both Euclid’s Postulates and Hilbert’s Axioms of Geometry, including Axioms of Incidence, Axioms of Orders, Axioms of Congruence, Axioms of Parallel, and Axioms of Continuity. Any updates of the structures are able to be mapped into the 3-dimensional space since the 3-dimensional space doesn’t have any boundary constraint. The outputs of the structure adaptation are a set of coordinates, mapping coordinates, and vectors. Mapping coordinates are required for fast online similarity degree computations. Geometry enables studies of properties on given elements that remain invariant under specific transformation, and allows the given elements to be visualized. Since an ontology remains invariant within a period of time before the next updates on concepts or interrelationships, elements, nodes and edges in an ontology structure, are able to be considered to be invariant. Therefore, we were able to 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. develop a transformation for an ontology structure in order to decide similarity among concept nodes based on geometric properties. In our geometry-based solution, vertices were represented by points, and directed edges were represented by vectors in geometric space. Then, the vertices and edges were able to be mapped into a geometry space. Each vertex was assigned with coordination. Similarity was able to be decided based on geometry properties. An adaptation process was to manipulate data representation structures in a 3-dimensional space. An ontology structure was required to be deconstructed first. Deconstructions were designed to reassign the geometric properties to the data representation structures. Mapping coordinates were computed next. The structure adaptation of each data representation structure required creation of one new plane. In the adaptation, any node first created on plane P; was assigned with mapping coordination on every Pj and Q, where j<i. Vectors were represented by the coordinates of the starting and the ending points. In order to simplify the similarity computation tasks, the closeness and the difference of semantics were determined based on the geometric properties of the structure. The properties of the planes are specified below. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Axiom 1. Any point in this 3-dimensional space is represented by coordination (x, y, z). Axiom 2. Let Q be a plane in this 3-dimensional space. Q is decided based on the following equations and constraints: z=c where c>=o (cos 8*) x - (sin 0*) y = 0 ifO<=Qj<n y>0; otherwisey<0 if-(II/2) <=8t<(n/2) x>0; otherwise, x<0 Axiom 3. Any plane P, other than Q in the 3-dimensional space is described based on the following equations and constraints: z=c where c>=o (cos (O t + C j ) ) x - (sin (8, + C j ) ) y = 0 where (8, + c j <8i+ I ifO<=8i<IJ y>0; otherwisey<0 if-(n/2) <=8j<(II/2) x>0; otherwise, x<0 *0; is given and treated as constant when the plane is created. Our geometric-based solution was designed to associate concepts and relationships in data representation structures with new geometric properties. After data representation structure 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. adaptation processes, concepts wee associated with coordinates, and interrelationships were associated with coordinates representing vectors. Formation of the new geometric properties is defined below. Definition 1. Formation o f new geometric properties associated with any concept is defined as the following: Prop concept(identifier) dj, Coordinateo rlg , {Coord inatem a p p ln g(^)} } 0 i represents the plane P ;. dj represents the maximum number of edges to reach a node without any ascendant. Coordinate0 rig represents the first decided coordinate. {Coordinatem a p p,„g(e)} is a set of mapping coordinates. Definition 2. Formation o f new geometric properties associated with any interrelationship is defined as the following: P r o p relationship(concept(identifierl), concept(identifier2)) { C o o r d i n a t e i, C O O r d i n a t e 2, Q u a n t rciaj10ns[ 1]pjc0nCept(identirierl), concept(identifier2)) Coordinatei is the starting point o f the vector, Coordinate2 is the ending point o f the vector, and Q u a n ta , o n s h l p ( c o n c e p l ( l d e n l , f i e r i ) , c o n c e P t(id e n tifie r 2)) represents the quantification value o f the relationship type. 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The structure adaptation was to associate current data representation structures with redefined geometric properties in a 3-dimentional space. A definition was introduced to determine the plane assignments. Definition 3. If the maximum number of edges for a node to reach the root node is m, the next available 0 is (0; + m' c.m i,'). The proposed algorithm required initialization before adaptation processes. The major task was to decide the start node. If an ontology didn’t have one single root node, the algorithm created a virtual root node for the structure. In the middle of the algorithm, the virtual root node was removed, and adjustments were performed to reflect the relative distance between nodes. Adding a virtual root node was to assure that the initial coordinate assignments are according to the generality of the ontology concepts. Since DAG structures are able to be separated into several small hierarchical structures with overlapping nodes, adding a virtual node ensured the more general concepts were assigned with coordinates prior to the more specific ones since the coordination assignments started from top-level concepts. The reason to assign coordinates to the more general concepts earlier in the process was that the concepts with more general semantics are acknowledged by experts and even non-experts. Therefore, there was a larger possibility to have complete sets of general concepts than of specific 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. concepts. Assigning the most general concept as the start point and going through all the concepts from general to specific allowed general concepts to be transformed earlier in the algorithm. The earlier adaptation indicated less possible adjustments of the coordination. The basic structure was fixed by the assigned coordination of general concepts. The algorithm then was able to determine the relatively correct coordination for specific concepts. The proposed algorithm was to perform the data adaptation. The inputs and the outputs are the following: Input: G {V, E}, V are nodes (vertex) and E are edges Output. Prop concept(identifier) < U ld Prop re1ationship(concept(identifierl), conccpt(idcntifier2)) The algorithm of the data adaptation process is shown below. Figure 7: The Adaptation Algorithm // Initialization 1 Set 0=(max0i) + c^,; 2 Create plane Q based on 0 values; 3 if (no one root node) { 4 Create one root node; 5 Assign edges between the root node and the nodes without any parent node; 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7: Continued 6 Assign the coordination to root node with (sin0,cos0,-l); } 7 else 8 Assign the coordination to root node with (sin0,cos0,O); 9 for (each node) { 10 Compute the maximum number of edges to reach the root node; 11 Assign the node with this maximum number as its ID number; 12 if(the plane P, based on (0+ ID 'c^ ,) does not exist) 13 Create a plane P,; } 14 Mark the root node as VISITED; 15 Put all child nodes of the root node in the queue; // Structure transformation starts 16 while(the queue is not empty) { 17 Get the first node in the queue as the current node; 18 If (current node is not marked as VISITED) { //x and y representing the next //available incremental value on //plane (0+ ID 'cmt) //and z representing incremental value of // the maximum z from plane P,.i 19 Assign the x, y, z value of the coordination to the current node; 20 Define the vector between the node and its parent node; 21 M ark the current node as VISITED; 22 Put all child nodes of the current node in the queue; 23 for (each plane where (0+ ID 'cu n it)<= 0) { 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7: Continued //x and y values are calculated //based on the current 0j value //of the current plane, //z= current z value. 24 Assign the mapping coordination x, y, z; } } 25 else { 26 if(the node has more than one ascendant connecting to the virtual node) ( 27 Put the node is the waiting queue; } } 28 if (the root node is a virtual node) { 29 Remove the vectors with the virtual root node as the starting point; 30 Remove the virtual root node; 31 while (the waiting queue is not empty) { 32 Get the node from the waiting queue; 33 Put the parent nodes in the candidate queue; 34 Change the coordination based on MAX; 35 Put the node in the changing queue; 36 while(the candidate queue is not empty) { 37 Get the first node in the candidate queue; 38 if (the node only has one child node) { 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7: Continued 39 Change the coordination based on MAX-1; 40 Assign new MAX as MAX-1; 41 Put the node in the changing queue; 42 Put the parent nodes in the candidate queue; } 43 else { 44 Determine the minimum number of 0j among all child nodes; 45 if (0i of the node is not equal to (0j-1)) { 46 Change the coordination based on 0j -1; 47 Put the node in the changing queue; 48 Put the parent nodes in the candidate queue; } } } // sort in increasing order 49 Sort the changing queue based on the coordination; 50 while (the changing queue is not empty) { 51 Get the first node in the changing queue; 52 for(each mapping coordination) { 53 Change the mapping coordination value based on the new coordination; } } } } 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 8: Creation of the Plane O Figure 9: Adaptation of the First Node Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 10: Creation of the Second Plane and the Adaptation of the Second Node Figure 11: Projection of the Second Node on the Plane O Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 12: Adaptation and Projection of the Third Node Figure 13: Creation of the Third Plane and the Adaptation of the Fourth Node Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 14: Projection of the Fourth Node 4.1.1 Multiple Interrelated Domain Ontologies The data adaptation algorithm was able to be further applied on multiple domain ontologies. If domain ontologies weren’t interrelated, each ontology would be adapted on the plane Qi and planes associated with Q;, i indicates the i* ontology adapted. If ontologies were interrelated, further adaptation was required to be performed. The algorithm is shown below. The inputs and the outputs are the following: I n p u t : P r o p concept(identifier) < m d P r o p relationship(concept(identifierl), concept(identifier2)) Output: New Prop concept(identifier) ^nd NeW_Prop relationship(concept(identifierl), concept(identifier2)))) 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 15: Further Adaptation of Interrelated Domain Ontologies 1 for (each interrelated concept) { 2 Find the maximum Z value; } 3 Sort all z values in a ascendant order; 4 for (all interrelated concept except ones with maximum Z values) { //start from the concept with the smallest z-value 5 for (all concepts related to Q; with the projection of interrelated concept) { 6 if (the plane 0i +(ID+(Z- z^em )) 'tW not exists) 7 Create the plane; 8 Change the z value to (ID+(Z- Zc^m)); } } An example of multiple interrelated domain ontologies and the adaptation results for Q, are shown in Figure 16 and Figure 17. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 16: Multiple Interrelated Domain Ontologies Domain: Domain' Figure 17 : Results of the Further AdaDtation for Multiple Domain Ontologies with Onlv the Plane Oi Shown Q2 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2 The Concept Grouping The semantics-based concept grouping was introduced to utilize the results from the data representation structure adaptation, and to facilitate online recommendation processes. Performing the grouping as an offline task enabled the decrease on time complexity of online similarity decision processes. How the grouping can facilitate the reduction of computational complexity of online operations is discussed in the next chapter. Before our approach of the similarity decision is introduced, definitions of similarity in semantics are required to be declared. First, a definition of similarity based on interrelationships is declared below. Definition 4. If two concepts that are end points share the same starting point of the vector, these two concepts are with similarity in hierarchies. Two concept nodes sharing the same start point of their vectors indicates that two nodes have at least one same parent node in the data representation structure. Since both child concepts are one-edge away from the parent node, they have the same hierarchy. Definition 5. If two concepts are similar in hierarchies and the mapping z-axis values are the same, these two concepts are with the similarity in semantics. Having the same parent node indicates that the hierarchies of two child nodes are similar. However, it doesn’t imply these two child nodes necessarily have the same generality. The reason is that the 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. structure is not necessarily balanced. Two nodes have the same parent node don’t necessary have the same z-axis value. The concepts have similarity in semantics are labelled as one group. The decisions are based on the following definitions. Definition 6. If any two concepts share n same parent nodes, the similarity degree of these two concepts is higher than two concepts share n-1 same parents nodes. Definition 7. At node without any ascendant forms one group alone. 4.2.1 Hierarchically Similar Groups We developed two algorithms for hierarchically similar group decisions. Each algorithm had its own advantage in computational resources. The inputs and the outputs for both algorithms are the following: Input: coordinates representing the starting and the ending points of vectors Output: Groups of coordinates representing concept nodes The first algorithm of the group labelling for hierarchically similar groups is presented below in Figure 18. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 18 : The First Pseudo Codes for Deciding Hierarchically Similar Groups 1 for( each node without parent node) 2 Create a group; 3 for (each vector) { 4 Examine the x, y, z value of the starting points for the belonging group; J ____________________________________________________________ The memory allocation for the worst case required the scale of n3 memory space , n was the number of concepts in an ontology. The computational complexity is 0(1) for the initialization step. For the grouping operation, the iteration runs for 0(n2 ) due to the number of vectors. In total, the computational complexity of the first algorithm is 0(n2 ). The second algorithm of the group labelling for hierarchically similar groups is presented below: Figure 19 : The Second Pseudo Codes for Deciding Hierarchically Similar Groups 1 Sort the starting points of the vectors based on the coordinates; 2 while (not all starting points are VISITED) { 3 if (there is a next point) { 4 if(the coordinates of the next point is different from that of the current points) { 5 Create a new group; 6 Put all end points o f current points in the group; } } //ending condition 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 19: Continued 7 else { 8 Create a new group; 9 Put all end points of current points in the group; } 10 Marked the point as VISITED; } With a sorting technique with 0(m lg m), the computational complexity of the initialization was O (n2 lg n). For the memory consumption, the algorithm required the scale of n2. 4.2.2 Semantically Similar Groups Deciding semantically similar group required more strict constraints. There were two algorithms to perform the grouping. The first algorithm consumed less memory, which is 0(n2 ), but yielded higher computational complexity, which is 0(n2 lg n). The second algorithm required more memory space, which is 0 (d ,2b2 ), but resulted in lower computational complexity, which is 0(n2 ). n was the number of nodes, d’ was the maximum number of all the minimum numbers of the edges to reach the root node for the nodes in the structure, and b was the branch factor. The proposed algorithms were both to perform grouping for semantically similar concepts. The inputs and the outputs for both algorithms are the following: 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Input: coordinates representing the starting and the ending points of vectors Output: Groups of coordinates representing concept nodes The algorithm is described below. Figure 20 : The First Pseudo Codes for Deciding Semantically Similar Groups 1 for( each node without parent node) 2 Create a group; 3 for (each vector) { 4 Examine the x, y, z value of the starting points, the z value of the end points, and the relationship types for the belonging group; J ____________________________________________________________ The initialization step was to allocate memory space. In the worst case, d’ and b were at the scale of c’* n, where 0<c<l. The memory allocation for the worst case required the scale of n4 memory space. The computational complexity was 0(1) for the initialization step. For the grouping operation, the iteration ran for 0(n2 ) due to the number of vectors. In the iteration, each operation took 0(c), where c equals 4. The total computational complexity of the first algorithm was 0(n2 ). For the second algorithm, which is shown in Figure 21, the initialization step was to perform sorting. If a sorting technique with 0(m lg m) was employed, the computational complexity became O (n2 lg n). The reason was that m equals n2, the maximum scale number of edges, (n2 lg n2 ) equals (2 n2 lg n). Therefore, the computational complexity for the worst case was O (n2 lg n). Creating 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. temporary groups required O (n2 ) since there were n2 iterations to run. Since there were c* n2 edges, there were c* n2 starting and ending points. Creating new groups resulted in O (n2 ) also since the operations run for n2 iterations. For the memory consumption, the algorithm required the scale of n2 for the memory space. Figure 21 : The Second Pseudo Codes for Deciding Semantically Similar Groups //Initialization 1 Sort the starting points of the vectors based on the coordinates; 2 Sort by the z value of the ending points; //Grouping 3 while (not all starting points are VISITED) { 4 if (there is a next point) { 5 if(the coordinates of the next point is different from that of the current points) { 6 Create a new temporary group; 7 Put all end points of current points in the temporary group; } } //ending condition 8 else { 9 Create a new temporary group; 10 Put all end points o f current points in the tem porary group; } 11 Marked the point as VISITED; } 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 21 : Continued 12 for (each temporary group) { 13 while (not every end points in the temporary group marked as VISITED) { 14 if (there is a next point) { 15 if(the z value of the next point is different from that of the current points) { 16 Create a new group; 17 Put all ending points with the same z value in the group; } } //ending condition 18 else { 19 Create a new group; 20 Put all the ending points with the same z value in the group; } 21 Marked the point as VISITED; } } The second algorithm outperformed in the aspect of memory consumption. The first algorithm yielded lower computational complexity in time. The outputs of both algorithms were groups of semantically similar concepts. The decision of the algorithm choice should be based on the system recourses. In our research, we chose the first algorithm since a low computational complexity in time was important. 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Different relationship types affected the group labelling results. If two concepts were semantically similar and with the same relationship types, these two concepts were labelled as one group. One concept may be included in more than one group in a DAG structure. One parent node indicated at least one group inclusion. Grouping performed offline decreased the computational complexity of the online similarity decision making. 4.2.3 Priority Assignments and Priority Matrices Degrees of similarity among concepts in the same groups may differ. Therefore, subgroups may exist and priority assigned to these groups and subgroups were different. Decisions of priority assignments were based on the following definitions: Definition 8: The priority assigned to one group is the quantification of similarity degrees representing the whole group. In order to present degrees of similarity, two values were calculated and presented in a matrix. The first value was decided based on the number of parent nodes that two concept nodes shared. The principle behind the calculation is described in Definition 9 below. Definition 9: If any two concepts share n same parent nodes, the priority of the two concepts is higher than two concepts that share (n-1) parent nodes. 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 22 : Pseudo Codes for the First Value Decision Algorithm 1 Set the first value of every group to 1; 2 while (not all groups are visited) { 3 for(each concept node in the groups) { 4 if (the current concept node is with more than one parent node and not visited through) { 5 if (all other concepts in the group has the same parent and the same relationship type) 6 Increase the first value by 1; 7 else 8 if(the group with the same parents and relationship types has been created) 9 Add the concept; 10 else 11 Create a temporary group; } } } 12 for (each temporary group) { 13 if (relationships are not repeated) 14 Make the temporary group into a new group; _}______________________________________________________________________________ Based on Definition 9, the priority of two concepts sharing n same parent nodes is higher than two concepts that share (n-1) same parents nodes. For concept nodes, hierarchical relationships suggest the inheritance of properties from the parent nodes. Inheriting from the same parent node indicates the nodes having common properties. The priorities are higher if concepts within groups sharing more 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. parent nodes than other concepts sharing fewer parent nodes. An algorithm was proposed to decide the first values of hierarchically or semantically similar groups and shown in Figure 22. The priority of a group was further described with another parameter, the number of the same child nodes. The priority was decided based on the sum of normalized weight of relationships excluding the exhaustive groups. The normalized weight wais calculated based on a single child node and its relationships. A dynamic parameter p, for the normalization was required to be decided based on the following definition. Definition 10: If the highest weight of all relationship types is wto p and the number of relationships within a concept group that all concepts in this certain group share the same parent node(s) is r;, the normalization parameter pj equals (wto p * r; )_ 1 . The parameter was dynamic because the number of relationships varies from group to group. The second value was calculated by the following algorithm shown in Figure 23. The total regular weight summary was multiplied by (highest weight of all relationship kinds * number of relationships). IS-A relationships indicated high possibility of inheriting similar properties since IS-A relationships represented a complete inheritance. If a child belonged to only part of the group, the group would be further divided. But different relationships didn’t cause further division. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 23 : Pseudo Codes for the Second Value Decision Algorithm 1 Set the second value of every group to < p ; 2 while (not all groups are visited) { 3 for(each concept node in the groups) { 4 if (the current concept node is with more than one child node and not visited through ) { 5 if (all other concepts in the group has the same child) 6 if (the value equals q > ) 7 Change the second value to (the weight of the relationship type); 8 else 9 Increase the second value by (the weight of the relationship type); 10 else 11 if(the group with the same child nodes and relationship types has been created) 12 Add the concept; 13 else 14 Create a new group; } } 15 Calculate p,; } A definition of the priority matrix was able to be introduced based on the weight computation. The definition is the following. The priority now is described by a matrix. Definition 11: A 1 by 2 matrix is represented as M,. M;= [The first value; The second valuej. M, preserves all mathematical properties of a matrix. 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3 Similarity Computations Similarity computations were to provide information on the similarity between two concepts. Similarity was measured by the definition of similarity degrees. The difference in similarity degrees was abstract and decided based on geometric properties. Nevertheless, the difference in similarity degree in semantics was relative. The reason was that nodes plotted on different planes in the 3-dimensional space had different generality. Each plane was expressed as the function which is X=n, where n was a positive integer number. Smaller n value meant higher generality in semantics. For geometry, the difference in n values was measured in coordination units. As long as the differences in units were the same, the difference of two larger n values equaled the difference of two smaller n values in geometry. Therefore, the similarity computation with the consideration of semantics must only have one same node for the comparison. At any instance, the number of concepts included in the similarity computation was 2. The similarity degrees from several computations could only be compared if each computation contains one same node is the similarity computation. Two steps were required for the similarity computation. First, or any two concepts, the y-axis values of their coordinates were compared. The second step performed based on different conditions. If the x-axis values were the same, the degree of similarity was the (difference of z-axis value)"1 . Otherwise, the x-axis values of mapping coordinates would be compared. The mapping coordinates on 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the (ID-1) plane of the current mapping coordinates would be computed and compared again until the x-axis values of mapping coordinates were the same and two nodes had the same ancestor on this X=c plane. If there were c times of mappings, the degree of similarity was defined as: C —1 Equation 1: Degree of Similarity^ wre l *[ |difference of z-axis value| x ^ 10 / ]"' wrei is the 1=0 weighting of the relationship type. The number of mapping times indicated the distance to reach in order to obtain the first same ancestor node. Each mapping involved the change of the hierarchical level. The relative difference of the hierarchical levels was represented by the mathematical scales. One mapping results in a one-level decreased of the hierarchy. In order to represent the decreased, the decimal base was included in Equation 1. Every change in the hierarchical level was expressed by the power of 10. For the degree of similarity, the smaller value indicated less similarity between two concepts. 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 Profile Management In order to provide customization based on user preference to users, information associated with user preference is required to be collected in advance. One method to collect information is to survey users directly. However, users may feel interrupted in information searching processes or provide emotion-effected responses to surveys. The other problem of surveying users is the cognitive bias. A user may declare the action in one way but actually act in another. Therefore, observing user behavior and mine user behavior patterns are more objective in discovering user preference. Observing user behavior is to obtain user data based on the interactions between a user and a system. User data are required to be further processed in order to obtain user behavior patterns. Based on user behavior patterns, user preference can be discovered, and recommendation on information can be provided based on user preference. In our system, user data were obtained by recording user query histories. User data were required be first mapped to the concepts in the domain ontology. User data were analyzed individually in the beginning. The results of the analysis were the concepts in ontologies and priority of queried concepts. Priority was decided based on weighting functions provided by system designers or purely based on the number of times that concepts were queried for. User data were mapped to these concepts in order to be stored in user profiles and to be further processed. Based on the mapped concepts and numeric 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. values associated with the queried concepts, user behavior patterns were discovered based on user data with the proposed statistic-based method introduced in Section 5.2. The behavior patterns of an individual user were also compared to the patterns occurred in user groups. User groups were decided based on our statistic-based method. In summary, the establishments of user profiles were achieved by the following phases: Phase 1: Collecting user data based on user queries. Phase 2: Analyzing user information and mapping to the corresponding concepts in ontologies in order to obtain user behaviors. Phase 3: Processing user behavior patterns to generate user profiles of user preference. User preference toward specific concepts was represented by numeric values, which were assigned priorities of the concepts. Priority could simply be presented as the summary of the frequencies of the concepts being queried by each user. Assignments may involve weighting functions defined by system developers. User behavior patterns were recorded continuously. User behavior patterns were quantified and transformed to numeric-based representations. Numeric-based representations correspond to the results of concept priority assignments. Representation formats were discussed in Section 5.2. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1 User Data Analysis User data collections were actions to extract information from user query and search histories, and to map the information to concepts in an ontology. Different search engines may allow users to query in different formats. Two common formats are the keyword search [1] [25] [53] [80] and the natural-language-based question input [2] [26]. In our system design, both formats were compatible with our semantic-based framework. After a user submits a query, our method first performed data analysis by extracting information from this query and transforming this information into a representation of a concept in the domain ontology. In our previous work [51] [52], a semantic-based query interpreter was developed. Using semantic-based query interpreter could save time by extracting the concept instead of using natural language rules. There are more than 1000 common rules for natural language processing. It requires non-constant time computations to interpret the meaning of the input sentences. Utilizing ontologies would facilitate information extractions and the concept mapping from queries more efficiently. The algorithm developed for the semantic-based query interpretation required 0 ( m*l2 ), where m is the size of the metadata file associated with the domain ontology and 1 was the length of any user query. 1 was considered as a constant since the length of a user query was hardly more than 100 words, m was considered as a constant since the 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. metadata associated with an ontology is rather static. The size of the metadata file only changes when updates were performed on ontologies, which doesn’t happen continuously in real time. The extracted concepts of user queries were saved as user query histories and then utilized for analysis purpose in the future. The analysis was performed by mining the collected user information, and by cooperating with an ontology. The concepts that were queried for several times and used directly in the query generation would be assigned with higher priority if no other weighting function was applied. The priority assignments are explained in the next section. 5.1.1 Priority Assignments and Weighting Functions User behavior patterns were represented by two sets of elements: a set of queried concepts and a set of numeric-based representations associated with concepts. The numeric-based representation indicated the assigned priority to a certain concept. A concept with a higher value indicated that this concept is with higher priority. Higher priority indicated higher interest of a user toward this concept. Priority assignment methods may differ from search engines to search engines. Several methods have been proposed. The first one, also the most basic one, is to assign the priority based on the number of times that a concept has been queried for. If a user query for a certain concept more, this user is more interested in the information related to this concept. The second method is to conduct user surveys [60] [44], Priority assignments are based on the rating results of the surveys. The third 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. method is to measure the involvement time of the information associated with certain concepts [61]. In our experiment design, we implemented our system with the first method. The reason to choose a more fundamental method was to prove the effectiveness of the semantic-based methods without other possible distortions. Possible distortions from the second methods included user attitude towards surveys, the completeness of survey designs, and subjective judgments of users at the moment. Possible distortions from the third method included the connection speed variety of users, difficulty in specifying the end point of involvement time, and difficulty in ensuring the actual involvement time. 5.2 User Profiling Most recommendation systems are only with consideration of what users like. The consideration of what users dislike will facilitate the recommendation systems to generate more accurate recommendation. Meanwhile, preference of any user can not be described completely with only two degrees, likeness and non-likeness. Fuzzy logic can also be employed to express user preferences. In our system, we incorporate fuzzy logic to represent preference levels. Four mandatory levels and one optional level of user preference were computed based on user behavior histories. These five levels were decided by the mean values and the variance values of the quantified representation of user behaviours. These levels are: 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -Larger than mean plus the variance (Pre!): The items included in this level were the items that user preferred. -Between mean plus the variance and mean minus the variance (Pre2 ): Items fell into this level because users had moderate interests toward them. In the statistic-based approaches, using mean and variance value to evaluate a distribution of numbers is more accurate when a normal distribution exits. Although quantified user behaviour patterns were not guaranteed to be a normal distribution, a mean value still approximated the moderate opinion of an individual user and a variance value expressed the trend of difference within quantified single user behaviour patterns. -Less than mean minus the variance (Pre3 ): The items included in this level were the items that user didn’t prefers but don’t dislike. -Zero (Pre4 ): This level indicated the items within were either that users didn’t have interests at all or that users didn’t request for them yet. Users may not have interests toward the items in this level, but users may not necessarily dislike the items. At the same time, that users didn’t request for it may indicates that users didn’t pay attention to these items. Therefore, it was important to separate the items/ concepts without being queried into another level but not to include them in the previous level. When user grouping was preformed, this level was excluded in our proposed algorithm. 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. -Specified as disliked items (Pre5 , optional): In order to establish this level, user surveys were required, or more functionality was required in user query processes to obtain necessary information. Not every system included surveys or additional essential functionality for gathering information on what users disliked. However, inclusion of this level contributed to the completeness of our proposed framework. A condition was set in the proposed user grouping algorithm to ensure this level was optional. The proposed binary representation of profiles was introduced in our framework in order to reduce the online computational complexity of time. This binary representation was to facilitate the collaborative portion in our hybrid method. In conventional collaborative filtering approaches, such as nearest-neighbor algorithm, computational complexity results in the linear increase of computational time when the number of users or items increments. Binary representations of profiles enabled the similarity decisions to be performed in a constant time. 5.2.1 Individuals Our proposed algorithm of generating binary presentation for user behavior patterns is shown below. The design principle behind the algorithm was that the behavior patterns that related to groups having concepts with largest z-axis coordination value were represented first, following by groups having concepts that were the starting points of the represented groups with the same projection x and 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. y coordination values. The purpose of this design was to represent each distinguishable sub-graph in a DAG one at the time. Each distinguishable sub-graph was a presentation of sub-domain knowledge in a DAG. Representing user behavior patterns related to one sub-graph in a continuous sequence eased user grouping tasks in the next step. The reason was that the segments of similarity in semantics were easily decided and grouping was performed with consideration of semantic closeness among concept groups. Figure 24 ; Pseudo Codes for the Algorithm Deciding a Sequence of Binary Representations while(not all groups are visited) { Create a digit for unvisited groups having concepts with largest z-axis coordination value; while(any node in current the digit-created group(s) has a parent node) { for (each parent node of the concepts in the digit-created group) { Find its belonging group; if (the groups is un visited) { Create a digit; Make the group to current digit-created group; } } } } 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. After deciding the sequence of the binary representation, the digit could be filled in. The proposed method also ensured zero conflict of user preference, which was no overlapping value of 1 Preiand Pre5 , Pre2 and Pre5 or Pre3 and Pre5. If conflicts occur, the digit in Prei, Pre2, or Pre3 was set to 0. A user behavior representation example is shown in Table 10. Before user grouping was performed, a decimal number for each binary representation of each level was computed. Table 10 : A Binary Representation Example for the Individual User Profile User; L5 L18 L2 L3 L8 L20 Decimal Value Prei 1 0 0 0 0 0 Value 1 Pre2 0 1 0 0 1 0 Value2 Pre3 0 0 0 0 0 0 Value3 Pre4 0 0 1 1 0 1 Value4 « T ) ( 5 0 0 0 0 0 1 Value5 5.2.2 Groups Based on the computed decimal values, user groups were able to be determined. The user grouping algorithm is shown in Figure 25. 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 25 : Pseudo Codes of the Grouping Algorithm Compare the decimal value sets of all users except the level 4 values; if(more than one user has the same decimal value set) { Create a group; Remove users in this group from the computation; } while ((the number of groups is less than the number of distinguishable sub-graphs) and (the maximum shifting factor is less than or equal to the total number of digits)) { if(any number in the current decimal value set is odd) Decrease the number by 1; Find the minimum z-axis value for the maximum z-axis value of the concepts within one segment; for (each segment) { Find the leftmost digit representing the group with the minimum z-axis value; Find the rightmost digit representing the group with the z-axis value which is equal to or less than the minimum z-axis value; if(any digit value equals 1 within the range) Change the digit value of groups including the parent node to 1; Set the shifting factor as the difference of the digits position plus 1; Shift the rest digits in the segment using the shift number; } Calculate new decimal value sets; Compare the decimal value sets of all users; if(more than one user has the same decimal value set) { Create a group; Remove users in this group from the computation; } 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 53 Other Profiles In order to decrease the computational complexity of time for online recommendation, two more profiles related to user behaviour patterns were required to be established. The first one was the profile specifying sorted priority of each concept within user groups. Another profile was the sorted priority of each concept in any single concept groups, either semantically similar groups or hierarchically similar groups. 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 Similarity Decision Mechanisms We proposed an algorithm to perform similarity decision mechanisms in order to provide online recommendation decisions. Before the algorithm of online recommendation decisions was developed to provide online recommendation, several essential numbers were required to be decided. The first set of number was j„ where j! is the number of projections and j2 was the number of inversed projections to select candidates of recommended concepts. The decision was based on the following definition: Definition 6: The set of number j, is determined by a probability space, P(like(concept)). The total number of elements in the space equals the total number of concepts, j, are computed based on the constraint: Max ( Ave ( P(like(j 1 -link upper)|like(current)) ) , Ave ( P(like(j2-link lower)|like(current))) ). Another essential number was the constant C topi, where C , opi decided the maximum number of top- priority concepts was included in the recommendation process. This number differs from groups to groups and was decided based on the following definition: Definition 7 : C topi is the total number of queried concepts minus a variance of the queried concept number distribution belonging to any single user group. 8 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The next important number was the maximum number of concepts that required by a recommendation decision. This number was fixed as a constant, r;. The algorithm of online recommendation decisions is shown below. Figure 26 : Pseudo Codes of the Online Recommendation Decision Algorithm if (the current user belongs to any user group) { for (Cto p i concepts in the group profile; the number of concepts in the recommendation queue is less than rj) if(C,o p i * concept belongs to the same groups with the queried concept) { Choose the group with the highest priority value; Add Cto p i * concept to the recommendation queue; } } else { for ( C , 0pi concepts in the individual profile; the number of concepts in the recommendation queue is less than rj) if(C,o p i * concept belongs to the same groups with the queried concept) { Choose the group with the highest priority value; Add C topi * concept to the recommendation queue; } } 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 26: Continued while (the number of concepts in the recommendation queue is less than r,) { for (every semantically similar group that the queried concept belongs to, starting with the one with the highest priority but not in the queue) Add the most voted concept to the recommendation queue; for (every hierarchically similar group that the queried concept belongs to, starting with the one with the highest priority but not in the queue) Add the most voted concept to the recommendation queue; for (every semantically similar group that the queried concept belongs to, starting with the one with the highest priority; for each concept in the group but not in the queue, starting with the one with the highest priority) Add the concept to the recommendation queue; for (every hierarchically similar group that the queried concept belongs to, starting with the one with the highest priority; for each concept in the group but not in the queue, starting with the one with the highest priority) Add the concept to the recommendation queue; int j= l; while (not all concepts in the ontology are visited) { Visit all concepts are j edges away from the concept; Compute all similarity; Sort based on similarity; if (the number of concepts in the recommendation queue is less than r,) Add the concept to the recommendation queue; else Break; } 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 26: Continued Break; } Return all concepts in the recommendation queue; Any first time users or users that didn’t belong to any group few data were considered as the users that a system couldn’t locate a group with. When recommendation couldn’t be provided based on user profiles and semantic group information, computations to locate the most similar concepts in ontology to the queried concept was required. Computations were based on the equation listed in Section 4.3. If group user profiles and individual user profiles were insufficient to provide recommendation, recommendation without consideration of user preference would be provided. Recommendation without user profiles was supported by abundant information in ontologies. Two cases were provided in order to demonstrate how recommendation only based on semantics concludes. Recommendation was provided based on a partial earthquake domain ontology shown in Figure 27. 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 27 ; The Partial Detailed earthquake Domain Ontology The Earthquake Science Domain Ontology C C h o l w i w ^ Before the recommendation was able to be provided based on the semantics, the ontology adaptation, grouping, and priority assignment processes were required. The results of grouping and priority assignment processes are shown in Table 11. Table 11 : Results of Grouping and Priority Assignments Group of Concepts Priority Surface 1 Fault 1 Segment, San Andreas, Sierra M adre, Lone Tree, Kane Spring 1 Carrizo, Cholame, Coachella, Mojave, North Coast, Cucamonga, San Fernando 1 Carrizo, Cholame, Coachella, Mojave, North Coast 2 Cucamonga, San Fernando 2 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The recommended concepts associated with queried concepts and decided by the proposed similarity model are demonstrated in the following cases: Case 1: If the queried concept was Cucamonga, the recommended concept was San Fernando. It was because the group these concepts belonged to has a higher priority. The assigned priority was 2, and the value of Ctop here was set to l.The other group containing the concept Cucamonga only had the priority of 1. Case 2: The queried concept was Fault. The results would be different based on two different assumptions. Assumption 1: The values of q for both relationship types were set to 1. The recommended concepts were Curst, Segment, San Andreas, Sierra Madre, Lone Tree, and Kane Spring. No other concept was in the group of the Fault concept. Similarity degree computations were required. The concepts with the maximum degree were Curst, Segment, San Andreas, Sierra Madre, Lone Tree, and Kane Spring. Assumption 2: The values of C i for two relationship types were different. Since the IS-A relationship type indicated more complete semantics than the PART-OF relationship type did, the value of C i for the IS-A relationship type was higher. The similarity degree for the Crust or Segment concepts was now lower than the similarity degree of the Andreas, Sierra Madre, Lone Tree, and Kane 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Spring concepts. Therefore, the recommended concepts were San Andreas, Sierra Madre, Lone Tree, and Kane Spring. Case 1 represented a normal situation in the recommendation process. Case 2 provided an extreme example. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 Implementation of the Semantics-based Customization Framework In order to utilize our semantic-based framework, an online information search engine was implemented. Users were allowed to use browsers to access this online search engine. This search engine was designed to provide semantic-based information recommendation to users based on our semantic-based framework. This search engine performed semantic-based recommendation in constant time. In Section 7.2, the computational complexity of time is analyzed. The analysis indicated that our proposed semantics-based customization framework provided efficient online recommendation decisions 7.1 System Architecture Designs The system architecture of the information search engine includes four major components: A query processor, a metadata manager, a profiling manager, and a similarity decision maker. The system architecture is shown in Figure 28. The query processor first accepts user queries, extracts information from queries, and maps information to concepts in the ontology. The query processor has performed the initial work of the user data analysis. The mapped concept information is sent to the profiling manager and the similarity computation manager for further processes. The profile manager accepts information from the query processor and continues the user data analysis. User profile updates are based on the results of the user data analysis. The similarity computation manager decides 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. recommendation based on the information from the query processor, information in user profile databases, and ontology and associated metadata in metadata databases. Computed recommendation results are sent back to the query processor. The query processor generates a browser-compatible template to display the queried results and recommendation for users. A recommendation process is completed after templates are sent to users. Figure 28 : An Experimental System Architecture Profiling Manager U nrB M nnor History _________ Itam-to-uaar PruWaa t—FT U ser sid e Priority Kccumul Online Similarity Metadata Manager ontology Adaptor Server aide The system was implemented on a Java-based platform. Jakarta-Tomcat was the platform for the system development. The reason to choose Jakarta-Tomcat is that Jakarta-Tomcat is free and well developed by the Apache research team [71]. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.2 Online Performance Analysis Only two operations were required to be performed online. The first was the recommendation decision. In the worst case, all steps in the online recommendation decision algorithm were taken in order to provide recommendation. Finding a user group that the current user belongs to took 0(1) since the metadata has been created during user grouping tasks. Matching user queried concept with C topi toppest priority concepts in a group was bounded by O (max (summary of number of concepts for all groups in the first preference level, max (the number of concepts that a group can have), Cto p i) ), which was loosely bounded by the number of relationships in the ontology. For the collaborative portion in our online recommendation decision algorithm, matching user queried concept with Cto p i toppest priority concepts was bounded by O ( max (summary of number of concepts for all groups in the first preference level)). If the collaborative portion was unable to provide sufficient recommendation, the content-based portion would be performed. Matching user queried concept with Cto p i toppest priority concepts was bounded by O ( max (summary of number of concepts in the first preference level in the individual profile)). If these two portions failed to provide sufficient recommendation, semantic-based recommendation would be performed. The computational complexity was O ( max(the number of concepts that a group can have)). 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The second was the user profile updating. Online user profile updates required 4 steps: Adding one number to sum, calculating a mean value, adding a number for variance calculation, and calculating a new variance value. The computational complexity was O (Adst), where maximum value of Adst is the number of concept groups and loosely bounded by the number of relationships in the ontology. Since both operations are loosely bounded by the number of relationships in the ontology and the number of relationships in the ontology remained static, we considered our online recommendation was able to be completed in constant time. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 8 Experiment Designs The questions to be answered in this research were: 1. Is an ontology a good foundation for information recommendation? 2. Does our semantic-based framework provide recommendation matching user expectation? In order to answer these questions, experiments were designed to examine the effects of our semantics-based framework. Ontologies served as the foundation in order to provide semantics-based interpretations and recommendation. 8.1 Hypotheses The hypotheses for the experiments were: 1. The utilization of abundant information in and associated with ontologies results in semantic-based recommendation matching user expectation. 2. The incorporation of the user behavior patterns with our semantic-based framework in the recommendation process results in recommendation matching user expectation. 8.2 Experiment Designs In order to examine the first hypothesis, two experiments on recommendation without involvement of user behavior patterns were conducted in both earthquake science and movie domains. Recommendation without involvement of user behavior patterns indicated only semantically similar 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. group data, hierarchically similar group data, and ontologies and metadata associated with ontologies are considered in the recommendation processes. These two experiments allowed us to understand if semantics serves as good foundation of recommendation. The two experiments could be also considered as extreme cases in recommendation processes. When information in group and individual user profiles wasn’t sufficient enough to provide recommendation, recommendation would be provided based on semantics according to the design of our semantic-based framework. 8.2.1 The First Experiment- Semantic-based Recommendation for the Earthquake Science Domain A good recommendation system should be able to provide recommendation matching expectation from different kinds of users. Especially, recommendation provided for specific domains is required approval of domain experts. Domain experts are fully acknowledged of domain knowledge and able to determine the similarity among concepts in a specific domain ontology. The first experiment we designed was to prove the hypothesis that the utilization of information in and associated with the earthquake science domain ontology results in semantic-based recommendation matching expectation of domain experts and research affiliates. Before the experiment was able to be conducted, an earthquake science domain ontology was required to be established. 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The earthquake science domain ontology was established based on the Quaketables Database [27] [73]. QuakeTables was designed to accommodate several types of fault data and data sets, including primary paleoseismic and geologic fault data, non-primary data [29], and simulated or hypothetical data. These contents were valuable to different simulation method cross-validattion, competing theories of plate boundary development explore, case studies with widely accepted assumptions, and input provision of visualization software to display simulation results. Base on data in the QuakeTables Database, a domain ontology was established with 2324 concept nodes and 5249 relationships. Two types of relationships were included: Is_A and P a rto f types. The Is_A relationship type indicated full inheritance of properties from parent nodes. The Part of relationship type indicated partial inheritance of properties of the parent node. The ontology contained 1080 relationships of Is_A type and 4169 of Part_of relationship types. The experiment was presented as a survey format. Six queries and recommendation provided based on different methods were included in the survey. At least 3 different results from different recommendation methods were provided for each query. The first recommendation method was our method, which includes consideration of semantically similar group data, hierarchically similar group data, and the domain ontology and metadata associated with ontology. The second recommendation method was based on the keyword spotting approach for document data based on author names or 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fault names. The reason to choose author names or fault names is that the domain experts agreed these two are the major search criteria for the QuakeTables Database. Recommendation was based on common words appear in the documents. The third method was the random choice. Recommendation was randomly selected by possibility. In order to ensure the survey design was fair, the sequences of result choice arrangements for each query were randomly decided. Subjects in this experiment included two domain experts and two computer scientists. The results and analysis are discussed in Chapter 9. 8.2.2 The Second Experiment- Semantic-based Recommendation for the Movie Domain Many recommendation systems have been developed to provide recommendation in the movie domain. Commonly referred movie recommendation systems include Netflix [55], Moviefinder [20] and Movielens [30] systems. Netflix and Moviefinder are two commercial systems. Movielens is a research project of the GroupLens Research Project group [81]. The GroupLens Research Project group provides two datasets with movies and user ratings associated with movies for researchers to download. One dataset contains 100,000 ratings for 1682 movies by 943 users, and the other one contains approximately 1 million ratings for 3900 movies by 6040 users. Movielens requests users to specify ratings. This system requires more user interactions than systems based on observations of user behavior patterns does. 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Another publicly available movie database is the Internet Movie Database [37], The data in the Internet Movie Database include details about movies. User reviews ratings may also be stored for some movies. Data in the Internet Movie Database were utilized in our movie domain ontology development. Based on the available data, 3338246 relationships and 795620 concepts were collected for the domain ontology development. Our second experiment was to prove the hypothesis that the utilization of information in and associated with the movie domain ontology resultes in semantic-based recommendation matching expectation of common users. The results were compared with three other methods and the realistic human choices. The two of the three methods were introduced by Fleischemen et al. [21] and the third method was to select movies from the Internet Movie Database. The forth method was to select recommended movie randomly. The concept that the user queried for was the movie “French Kiss”. Results and analysis were listed in Chapter 9. 8.2.3 The Third Experiment- A Recommendation System for the Movie Domain The third experiment was designed to examine the hypothesis that the incorporation of the user behavior patterns with our semantic-based framework in the recommendation process results in recommendation matching user expectation. For the third experiment, a recommendation system implemented based on our semantic-based framework was required. The system implementation was 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. explained in Chapter 7. The result display contained two portions: the queried movie information and the recommended movie information. The template of the result display in a browser window is shown in Figure 29. The movie information was referred to the imdb.com movie database. Figure 29 : An Illustration of the Result Display Design fiecruit 31 subjects were involved in this experiment. All 31 subjects have a hobby of going to the movies. These 31 subjects included 16 females and 15 males. The age range of these subjects was between 21 and 57. The average age was 26. A subject was requested to query for at least 3 movies that the subject was interested to obtain information about them. Recommendation decided based on our semantic-based framework was returned with the information on the queried movie. After receiving the returned result display of both 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the queried movie information and the recommended movie information, the subject was required to participate in a survey about the perception of the recommended movie information. Survey questionnaires are listed in Figure 30. The results and the analysis were detailed in Chapter 9. Figure 30 : The Survey Questionnaires Have you ever watched the recommended movies? YES _1. How similar is the recommended movie to your queried one? (Please give a number between 0 to 100, 0 is not similar at all and 100 is extremely similar) _2. How much did you like the recommended movie? (Please give a number between 0 to 100, 0 means you doesn’t like it at all and 100 means you like it the m ost) NO Please click the link and read the movie description through the link provided (imdb.com) _1. How similar is the recommended movie to your queried one? (Please give a number between 0 to 100, 0 is not similar at all and 100 is extremely similar) _2. How high the chance is for you to watch this movie in the future after you read the description? (Please give a number between 0 to 100, 0 means you won’t see it at all and 100 means most likely) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 9 Result Analysis Three experiments were established and conducted to examine the hypotheses of our research. The first two experiments were designed to examine the suitability of ontology being foundation for semantic-based recommendation. Two domains were involved in the experiments: the earthquake science domain and the movie domain. Involvement of more than one domain ontology was required to examine the compatibility of our framework. Since our semantic-based framework was designed to accept existing domain ontologies with simple adaptation, experiments on examining different domain ontology was able to suggest the compatibility of our framework. Results and analysis of results are discussed in the following sections. The result of the first experiment to examine the hypothesis that the utilization of information in and associated with the earthquake science domain ontology results in semantic-based recommendation matching expectation of domain experts and research affiliates is analyzed and discussed in Section 9.1. The results from the second experiment, which was designed to prove the hypothesis that the utilization of information in and associated with the movie domain ontology results in semantic-based recommendation matching expectation of common users, is detailed and compared with results from other semantic-based methods in Section 9.2. For the third experiment, examination of the hypothesis that the incorporation 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of the user behavior patterns with our semantic-based framework in the recommendation process results in recommendation matching user expectation, results and discussions are provided in Section 9.3. 9.1 The First Experiment 9.1.1 Results of Surveys In the first experiment, two domain experts and two computer scientists were surveyed with 6 query results. Results of surveys are shown in Table 12 to Table 15. Table 12 : The Survey Result From the First Domain Expert The First Domain Expert Query 1 Query 2 Query 3 The semantic- based recommendation The semantic- based recommendation The semantic- based recommendation Query 4 Query 5 Query 6 The semantic- based recommendation The semantic- based recommendation The semantic- based recommendation The first domain expert agreed that all recommendation provided by our semantic-framework was more similar to the queried result in semantics than those provided by other methods. 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 13 : The Survey Result From the Second Domain Expert The Second Domain Query 1 Query 2 Query 3 Expert The semantic- based recommendation The semantic- based recommendation Query 4 Query 5 Query 6 The semantic- based The semantic- based The semantic- based recommendation recommendation recommendation The second domain expert agreed that five of the recommended results provided by our semantic-framework were more similar to the queried result in semantics than those provided by other methods. Table 14 : The Survey Result From the First Computer Scientist The First Computer Scientist Query 1 Query 2 Query 3 The semantic- based recommendation The semantic- based recommendation The semantic- based recommendation Query 4 Query 5 Query 6 The semantic- based recommendation The semantic- based recommendation The semantic- based recommendation For the third subject, the first computer scientist, all recommendation suggested by our semantic-framework was considered to be more similar to the queried result in semantics than those provided by other methods. 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 15 : The Survey Result From the Second Computer Scientist The Second Query 1 Query 2 Query 3 Computer Scientist The semantic- based recommendation Die semantic- based recommendation Query 4 Query 5 Query 6 The semantic- based The semantic- based The semantic- based recommendation recommendation recommendation The second computer scientist considered that five of the recommended results provided by our semantic-framework were more similar to the queried result in semantics than those provided by other methods. 9.1.2 Discussions on the Results of the Surveys In Table 16, the votes obtained from 4 subjects for each query and each selection were summarized. The query 3, 4, 5, and 6 obtained all 4 votes on the semantic-based recommendation from subjects. The query 1 and 2 received 3 out of 4 votes on the semantic-based recommendation. In order to understand the survey results better, similarity among the queried results of the query 1 and 2, the semantic-based recommendation, and the keyword spotting-based recommendation was calculated. Since the contents of the results and recommendation are text data, the vector space model and the cosine measurement were chosen to evaluate the similarity. 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 16 : Results of Votes Vote Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 The semantic- based recommendation 3 3 4 4 4 4 The keyword spotting- based recommendation 1 1 0 0 0 0 The random selection 0 0 0 0 0 0 Total Votes 4 4 4 4 4 4 The result contents of the query 1 and query 2 are shown in Figure 31 and Figure 32. Only the attributes that were presented as concepts in ontology would be included in the similarity evaluation. Attributes that were not presented as concepts in ontology were with numeric values or considered as metadata associated with concepts. Numeric values were measurement of earthquake events. Other attributes are characteristics of ontology concepts. For examples, Year, the publication year, provided more detailed information on the publication. Therefore, the following attributes were included in the similarity evaluation: Interpld, Authorl, Author2, Publication, Title, FaultName, and SegmentName. Interpld is the interpretation ID numbers that were automatically generated for each individual data source entered into the fault database to represent an earthquake event. Authorl and Author 2 were the authors of each publication listed in authorship order. Publication was the name of the journal article, professional paper, professional report, or conference abstract. Title was the name of the journal, professional paper, professional report, or conference proceedings that featured the publication. 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. FaultName was the fault name taken from study site descriptions provided in each individual data source. Faults were fractures dividing two bodies of rock, along which the rock masses have moved relative to each other. SegmentName was a segment name taken from study site descriptions provided in each individual data source. Figure 31 : The Contents of Query 1 Interpld = 2 Authorl = Hall, N. T. Author2 = null Publication = Holocene slip rates on the San Andreas fault in Northern California Year = 1986 Title = Minutes of the National Earthquake Prediction Evaluation Council and The San Francisco Bay Region Special Study Areas Workshop, February 26-March 1, 1986, Menlo Park, California, USGS Open-file report 86-630 Pages =16-31 Datum = NAD27 SiteLatl = 37.5703 SiteLonl = -122.4028 Faultid = 30 FaultName = San Andreas SegmentName = Cholame Segmentld = 6 Geometry = rl-ss NationalFaultld = 1 FADFaultld = 69 Lengthl = 57 Length2 = 63 Length3 = 69 DownDipWidthl = 10 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 31: Continued DownDipWidth2 =12 DownDipWidth3 = 14 AveStrikeRatel = 29 AveStrikeRate2 = 34 AveStrikeRate3 = 39 RateRank = P Mmax = 7.3 Dip = 90 Ruptop = 0 Rupbot = 12 LatEnd = 35.75 LatStart = 35.31 LonEnd = -120.3 LonStart = -119.87 Figure 32 : The Contents of Query 2 Interpld = 76 Authorl = McGill, S. F. Author2 = Sieh, K. Publication = Journal of Geophysical Research Year = 1991 Title = Surficial Offsets on the Central and Eastern Garlock Fault Associated With Prehistoric Earthquakes Volume = 96 Number = B 13 Pages = 21597-21621 Datum = NAD 27 SiteLatl = 35.5603 SiteLat2 = 0 SiteLonl = -117.375 F aultid = 33 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 32: Continued FaultName = Garlock SegmentName = Central Segmentld = 3 AveRecurrl = 200 The vector representing the contents of the query 1 queried result is the following: (Interpld = 2 , Authorl = Hall, N. T. , Autho2 = null, Publication = Holocene slip rates on the San Andreas fault in Northern California, Title = Minutes of the National Earthquake Prediction Evaluation Council and The San Francisco Bay Region Special Study Areas Workshop, February 26-March 1, 1986, Menlo Park, California, USGS Open-file report 86-630, FaultName = San Andreas, SegmentName = Cholame) = vq u e rie d = (1, 1, 1, 1, 1, 1, 1)- The vector representing the contents of the semantic-based recommendation was vs e m a n tlc = (0, 0, 0, 0, 0, 1, 1). In the vector space model, the similarity was calculated by the following equation: sim = V i *v2 /(| v, | | v2 |). Therefore, sim^ned, se m a n n c )= vq u e rie d *vse m a n tic /(| vq u e rie d | | vse m a n tlc |) = 0.535. The vector representing the contents of the keyword-based spotting method was (0, 0, 0, 0, 0, 0, 1). sim(q u c n c (j k e y W o r ( i) V q lle n t.d v^C y W O ld /(| vq u e ried I I ^ k ey w ao rd I) — 0.378. Therefore, sim{qU c n c d se m a n tic ) ^ sim(q u t.rle d , k e y w o rd )- Since the recommendation provided by our semantic-based framework gathered 3 out of 4 votes and was with higher similarity in texts to the original queried result, we concluded that our method performed a better recommendation for the query 1. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The vector representing the contents of the query 2 queried result is computed as the previous one and is the following: (Interpld = 76, Authorl = McGill, S. F., Author2 = Sieh, K., Publication = Journal of Geophysical Research, Title = Surficial Offsets on the Central and Eastern Garlock Fault Associated With Prehistoric Earthquakes, FaultName = Garlock, SegmentName = Central) = vq u e rie d = (1, 1, 1, 1, 1, 1, 1). The vector representing the contents of the semantic-based recommendation was v semantiC = (0, 0, 0, 0, 0, 1,1). The vector representing the contents of the keyword-based spotting method was (0, 0, 0, 0, 0, 0, 1). Therefore, s im ^ q Ucncd semantic)- Vqueried * ^semantic /(I Vqueried I I ^ s e m a n t i c |) — 0.535. s im ^ q ucrled keyword)- Vqueried *vk e y w o rd /(I V q U e rie d I I vk e y w a o rd I ) = 0.378. Since the recommendation provided by our semantic-based framework was with higher similarity in texts to the original queried result and with 3 out of 4 votes, that our method performed a better recommendation for the query 2 was concluded. Based on the survey results and the similarity analysis, we concluded that our semantic-based method provided a better recommendation that the other two methods. The hypothesis that the utilization of information in and associated with the earthquake science domain ontology results in semantic-based recommendation matching expectation of domain experts and research affiliates was proved to be true. 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9.2 The second Experiment 9.2.1 Experiment Results For the second experiment, “French Kiss” was the queried concept. The recommended top 5 concepts provided by our semantic-based framework were shown in Table 17. “Perlude to a Kiss”, ’’Kate & Leopold”, and ”A Midnight Summer Dream” were with the same priority and listed together. Table 17 : Recommended Concepts based on O ur Semantics-based Framework The Semantic-based Framework Forget Paris When Harry Met Sally Sleepless in Seattle Perlude to a Kiss* Kate & Leopold* All recommended concepts shared three parent nodes with the concept node “French Kiss”. These three parent nodes were “Romance”, “Comedy”, and “USA”. There were 2123 concepts belonging to the semantically similar groups that had the first value of the matrices equal to 3. Therefore, examinations on the second values of the matrices were required. The semantically similar group with the highest second value of the matrix contained 2 concepts, “French Kiss” and “Forget Paris”. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, “Forget Paris” was recommended first. The concept with the second highest priority to be recommended was “Delovely”. However, other methods compared with our semantic-based framework only considered movies released before the year 2003 and “Delovely” was released in 2004. “Delovely” was discarded in the final recommendation list. Concepts with the third highest priority to be recommended were “When Harry Met Sally” and “Sleepless in Seattle”. Three candidates of recommendation had been selected. Since five candidates were required, concepts with the forth highest priority were requested. Two concepts were with the same priority, which were “Perlude to a Kiss”, and “Kate & Leopold”. In Table 18 and Table 19, concepts recommended by methods proposed by Fleischemen et al. were shown. Table 20 showed the recommendation with more than 1000 ratings [21] provided by imdb.com. Randomly selected concepts to be recommended were shown in Table 21. 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 18 : Recommended Concepts based on the First Method Proposed bv Fleischemen et al Algorithm 1: Words Man in the Moon Wyatt Earp Story of Us Silverado Always Table 19 : Recommended Concepts based on the Second Method Proposed bv Fleischemen et al Algorithm 2: Genres When Harry Met Sally Chasing Amy Shakespeare in Love Anniversary Party Shop Around the Comer 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 20 ; Recommended Concepts Provided bv imdb.com IMDB Forget Paris America Werewolf in Paris Perlude to a Kiss Mouline Rouge! Charade Table 21 : Randomly Selected Concepts to be Recommended Random I Know What You Did Last Summer Alien The Wizard of Oz La Venida del Papa Honeymoon Horror 31 subjects were surveyed in order to obtain the results of human judgment. Subjects were requested to provide the movies to be recommended along with the movie “French Kiss”. The movies with the top-5 highest votes by users were shown in Table 22. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 22 : Recommended Movies Provided Bv Humans Human Judgment Sleepless in Seattle When Harry Met Sally Green Card You’ve Got Mail Forget Paris Table 23 shows the top-5 similar movies determined by human subjects with another survey method conducted by Fleischemen et al. Table 23 : Similar Movies Determined Bv Humans Human Sleepless in Seattle When Harry Met Sally To Catch A Thief Forget Paris Pretty Woman 9.2.2 Discussions on the Experiment Results Recommendation provided by our semantic-based framework matched the recommended movies chosen by surveyed subjects and similar movies determined by humans better than other methods. 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. While other methods were only able to match at most one out of five selections by humans, our method was able to provide 3 matches. In addition, these three matches were the concepts with highest priorities defined by our semantic-based framework. The hypothesis that the utilization of information in and associated with the movie domain ontology results in semantic-based recommendation matching expectation of users was proved to be true. 9.3 The Third Experiment In the third experiment, subjects were requested to create at least three queries for movie information. The movies they queried for must be the movies they were interested it. Random selections were prohibited. 99 queries for movies have been created by 31 users. A recommended movie was provided along with each query result. Users were required to participate in surveys after each recommended movie was provided. The survey questionnaires were shown in Figure 30. 51 out of 99 recommended movies have been viewed by subjects. The average score representing similarity of recommended movies to queried movies was 78.80 out of 100, and the medium value was 80. The lowest similarity value was 65. The average score representing user interest level towards recommended movies was 85.45, and the median value was 85. The lowest value was 70. 58 out of 99 recommended movies have not been viewed by subjects. The average score representing similarity of recommended movies to queried movies was 81.25 out of 100, and the 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. medium value was 80. The lowest similarity value was 68. The average score representing user interest level towards seeing the recommended movies was 83.40, and the medium value was 80. The lowest value was 70. Combining two different categories, viewed and unseen movies, of similarity evaluations of movies, recommendation provided by our semantic-based framework resulted in a similarity score of 80 evaluated by subjects. Similarity was determined by perception of subjects. Perception included the experience towards watched movies or detailed movie descriptions. Utilizing user profiles to provide recommendation increased the similarity of recommended movies to queried movies. In the second experiment, recommendation without user profiles resulted in 3 out of 5 matches, which equaled 60 out of 100 in scale. For user preference, the interest evaluation result was compared with the human judgment in the second experiment. In the second experiment, subjects were required to provide the movies to be recommended along with the movie “French Kiss”. Recommendation without information on user behavior patterns resulted in 3 out of 5 matches, which were equal to 60 out of 100 in scale. The average score representing user interest level towards recommendation of movies that have been seen was 85.35. Including user profiles about user behavior patterns in recommendation processes provided recommendation more preferred by users. 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 33 : The Regression Results for the First Data Set 50 60 80 90 Figure 34 : The Regression Results for the Second Data Set 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Regressions have been applied on two categories of data sets. The first one was the dataset of similarity and preference evaluated for view movies, and the second one was the dataset of similarity and interests determined for unseen movies. Similarity was treated as the x-coordination value, and preference or the interest was treated as the y-coordinate value. The regression result represented as a graph for the data set associated with viewed movies is shown in Figure 33 and Figure 34. Results indicated that a positive correlation existed between similarity and preference/interests. Therefore, recommendation provided based on semantic similarity with consideration of user profiles was meaningful. The hypothesis that the incorporation of the user behavior patterns with our semantic-based framework in the recommendation process results in recommendation matching user expectation was proven to be true. 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 10 Discussion and Conclusion A semantic-based model was proposed to incorporate existing data representation structures, ontologies, in a recommendation system. This similarity model was developed to facilitate the similarity computation for ontologies. Similarity computation was based on geometry properties. The structure adaptation process was developed to utilize the geometric properties. Similarity computations were performed with consideration of geometry properties mapped from adaptation results. The proposed model consisted of offline operations and online constant-timed ping computations. The offline operations were grouping semantically similar concept nodes and computing similarity degrees for semantically similar groups. Offline operations enabled the reduction on computational complexity of online similarity decisions. Online similarity decisions involved a constant number of times comparisons to obtain the recommended concepts. The proposed model has achieved the goal of constant-timed recommendation processes based the combination of uncomplicated approaches. Online similarity decisions were concluded with consideration of semantics in ontologies Experiments were conducted to prove that recommendation with consideration of semantics resulted in recommendation matching user preference. Results from two experiments indicated that our 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. proposed semantic-based recommendation performed better than other methods to predict user preference. Combination of semantics and information from user profiles provided better recommendation than semantic-based recommendation without consideration of user profiles did. An experiment was performed in order to examine that incorporation of information in user profiles with our semantic-based framework resulted in recommendation suited user preference better. The experiment results indicated that recommendation with consideration of information in user profiles matched user preference better than that without consideration of information in user profiles. Two domain ontologies have been included in experiments. Our semantic-based framework was proven to facilitate online recommendation in multiple domains. In the future, more domain ontologies should be included in order to examine the suitability of our semantic-based framework. In addition, methods of evaluating user satisfaction with minimum interactions were required. Involvement time measurement has been included in recommendation system designs in order to evaluate user satisfaction towards recommendation results. Several steps would be required to determine user preference levels: determining the possibility space of user involvement time, dividing the space into subspaces, and finalizing the preference cut-off points. Since human preference is hard to describe by precise values, finalizing cut-off points is complex. More verifications and researches 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are required before including involvement time measurement in our semantic-based recommendation system. One method of verification is to compare the direct user survey results with involvement time length measurement. Comparison results would indicate validity of involvement time length measurement. If quantified direct user survey results have positive correlations with involvement time length results, involvement time length measurement would be proven to be valid to evaluate user satisfaction and preference levels and determine proper cut-off points. In the future work, enriching ontology databases is necessary. Enriching ontology databases indicates extension of domain coverage. Extending coverage includes two categories of work: completion of Science and Entertainment domain ontologies and involvement of existing domain ontologies. Completion of science and entertainment domain ontology includes addition of subdomain ontologies, such as neuroscience domain ontology. Involvement of existing domain ontologies requires adaptation of existing domain ontologies into our recommendation systems. One possible involvement is the partial news domain ontology from our previous work [52], Extension of ontology databases would facilitate validating the suitability of our semantic-based framework. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Bibliography [1] A9.com, Inc., “About us,” A9.com, Inc., 2003. [Online]. Available: http://www.A9.com/company. [Accessed: Nov. 01,2005], [2] Ask Jeeves, Inc, “About Ask.com”, 2005. [Online]. Available: http://www.ask.comdocs/about/index.html. [Accessed: Nov. 01,2005], [3] R. Baeza-Yates and B. Ribeiro-Neto, Modem Information Retrieval, 1st ed., Boston: Addison Wesley, 1995. [4] M. Balabanovic and Y. Shoham, “Fab: Content-based collaborative recommendation,” Communications o f the ACM, vol. 40, no. 9, pp. 88-89, Mar. 1997. [5] C. Basu, H. Hirsh, and W. Cohen, “Recommendation as classification: Using social and content-based information in recommendation,” In Proceedings of 5th National Conference on Artificial Intelligence, 1998, pp. 714-720. [6] S. Bechhofer, J. Broekstra, S. Decker, M. Erdmann, D. Fensel, C. Goble, F. Van Harmelen, I. Horrocks, M. Klein, D. Cguinness, E. Motta, R Patel-schneider, S. Staab, and R. Studer, “An informal description of standard OIL and instance OIL,” University of Manchester, Manchester, United Kingdom, Tech. Rep., 2000. [7] T. Bemers-Lee and R. Swick, ’’The Semantic Web,” presented at 9th World Wide Web Conference. Amsterdam, Denmark, 2000. [8] M. Berry, Z. Drmac, and E. Jessup, “Matrices, vector spaces and information retrieval,” Society for Industrial and Applied Mathematics Review, vol. 41, no. 2, pp. 335-362, 1999. [9] C. Bighini, A. Carbonaro, and G. Casadei, “InLinx for docum ent classification, sharing and recommendation,” in Proceedings of 3rd IEEE International Conference on Advanced Technologies, 2003, pp. 91-95. 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [10] J. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of predictive algorithm for collaborative filtering,” Microsoft Research, Redmont, WA, Microsoft Tech. Rep. MSR-TR-98-12, 1988. [11] Cable News Network LP, LLLP, “CNN.com - Breaking news, U.S., world, weather, entertainment & video news,” Cable News Network LP, LLLP, 2005. [Online], Available: http://www.cnn.com. [Accessed: Nov. 01,2005]. [12] C. Charalambous and A. Logothetis, “Maximum likelihood parameter estimation from incomplete data via the sensitivity equations: The Continuous-time Case,” IEEE Transactions on Automatic Control, vol. 45, no. 5, pp. 928-934, May 2000. [13] S. Chee, J. Han, and K. Wang, “RecTree: An efficient collaborative filtering method,” in Proceedings o f 3rd International Conference on Data Warehousing and Knowledge Discovery, 2001, pp. 141-151. [14] A. Chen, S. Chung, S. Gao, D. McLeod, A. Donnellan, J. Parker, G Fox, M. Pierce, M. Gould, L. Grant, and J. Rundle, “Interoperability and semantics for heterogeneous earthquake science data,” presented at 2003 Semantic Web Technologies for Searching and Retrieving Scientific Data Conference. Sanibel Island, FL, 2003. [15] A. Chen and D. McLeod, “Collaborative filtering for recommendation systems,” in Encyclopedia o f E-Technologies and Applications, 1st ed., M. Khosrow-Pour, Ed., Hershey: Idea Group Reference, 2006, to be published. [16] A. Chen and D. McLeod, “Semantics-Based similarity decisions for ontologies,” in Proceedings o f 7th International Conference on Enterprise Information Systems, 2005, pp. 443-446. [17] C. De Lazzari, E. Guerrieri, D. Pisanelli, and A. Murray, “A domain ontology for mechanical circulatory support systems,” in P roceedings o f 2003 C om puters in C ardiology, 2003, pp. 417-419. [18] M. Deshpande and K. Karypis, ’’Item-based top-N recommendation algorithms,” ACM Transactions on Information Systems, vol. 22, no. 1, pp. 143-177, Jan. 2004. 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [19] A. Donnellan, D. McLeod, G. Fox, J. Parker, J. Rundle, L. Grant, and M. Pierce, “QuakeSim,” The NASA Earth Science Enterprise, 2005, [Online]. Available: http://quakesim.jpl.nasa.gov/. [Accessed: Nov. 01,2005], [20] E! Entertainment Television, Inc., “Movies, E! Online,” E! Entertainment Television, Inc., 2005, [Online]. Available: http://movies.eonline.com/Reviews/Movies/index.html. [Accessed: Nov. 01, 2005]. [21] M. Fleischman and E. Hovy, “Recommendations without user preferences: A natural language processing approach,” in Proceedings o f2003 International Conference on Intelligent User Interfaces, 2003, pp. 242-244. [22] P. Ganesan, H. Garcia-Molina, and J. Widom, “Exploiting hierarchical domain structure to compute similarity,” Stanford University, Stanford, CA, Extended Technical Report, 2001. [23] P. Ganesan, H. Garcia-Molina, and J. Widom, “Exploiting hierarchical domain structure to compute similarity,” ACM Transactions on Information Systems, vol. 21, no. 1, pp. 64-93, Jan. 2003. [24] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins, “Eigentaste: A constant time collaborative filtering algorithm,” Information Retrieval, vol. 4, no. 2, pp. 133-151, July 2001. [25] Google Inc., “About Google,” Google Inc., 2005. [Online]. Available: http://www.google.com. [Accessed: Nov. 01,2005]. [26] Google Inc., “Google Answers,” Google Inc., 2005. [Online]. Available: http://answers.google.com/answers. [Accessed: Nov. 01,2005], [27] L. Grant, A. Donnellan, D. McLeod, M. Pierce, A. Chen, M. Gould, G Noriega-Carlos, R. Paul, S. Sung, and M. Ta, “QuakeTables: The QuakeSim fault database for California,” presented at 2004 SCEC M eeting, Palm Spring, CA, 2004. 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [28] L. Grant, A. Donnellan, D. McLeod, M. Pierce, A. Chen, M. Gould, G Noriega-Carlos, R. Paul, S. Sung, and M. Ta, “QuakeTables: The fault database for QuakeSim,” presented at 2004 AGU Fall Meeting, San Francisco, CA, 2004. [29] L. Grant and M. Gould, “Assimilation of paleoseismic data for earthquake simulation,” Pure and Applied Geophysics, vol. 106, no. 11/12, pp. 2295-2306, Nov. 2004. [30] GroupLens Research at the University of Minnesota, “Movielens- movie recommendations,” GroupLens Research at the University o f Minnesota, 2003. [Online]. Available: http://movielens.umn.edu/login. [Accessed: Nov. 01,2005]. [31] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl, “An algorithmic framework for performing collaborative filtering,” in Proceedings o f 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 230-237. [32] W. Hill, L. Stead, M. Rosenstein, and G Furnas, “Recommending and evaluating choices in a virtual community of use,” in Proceedings o f 1995 International conference on human factors in computing systems, 1995, pp. 194-201. [33] T. Hofinannand and J. Puzicha, “Latent class models for collaborative filtering,” in Proceedings o f 6th International Joint Conference on Artificial Intelligence, 1999, pp. 688-693. [34] T. Hofmann, J. Puzicha, and M. Jordan, ’’Learning from dyadic data,” in M. S. Kearns, S. A. Solla, and D. A. Cohn, eds., Advances in Neural Information Processing Systems 11, San Mateo, CA: Morgan Kaufinann Publishers, 1998, pp. 466-472. [35] Z. Huang, H. Chen, and D. Zeng, “Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering,” ACM Transactions on Information Systems, vol. 22, no. 1, pp. 116-142, Jan. 2004. [36] E. Hyvonen, S. Saarela, and K. Viljanen, “Application of ontology techniques to view-based semantic search and browsing,” in Proceedings of 1st European Semantic Web Symposium, 2004, pp. 92-106. 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [37] Internet Movie Database Inc., “Internet movie database,” Internet Movie Database Inc., 2005. [Online]. Available: http://www.imdb.com. [Accessed: Nov. 01, 2005]. [38] Internet Movie Database Inc., “IMDb statistics,” Internet Movie Database Inc., 2005. [Online], Available: http://www.imdb.com/database_statistics. [Accessed: Nov. 01, 2005]. [39] E. Jewell and F. Abate, Ed., The New Oxford American Dictionary. New York: Oxford University Press, 2001. [40] J. Jung, J. Yoon, and G Jo, “Collaborative information filtering by using categorized bookmarks on the web,” in Proceedings o f 14th International Conference on Applications o f Prolog, 2001, pp. 343-357. [41] G Karypis, “Evaluation of item-based top-N recommendation algorithms,” Department of Computer Science, University of Minnesota, Minneapolis, MN, Tech. Rep. 00-046,2000. [42] L. Khan, “Ontology-based information selection,” Ph.D. dissertation, University of Southern California, Los Angeles, CA, 2000. [43] S. Ko, “Prediction of preferences through optimizing users and reducing dimension in collaborative filtering system,” in Proceedings o f 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, 2004, pp. 1259-1268. [44] A. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Riedl, “GroupLens: Applying collaborative filtering to Usenet news,” Communications o f the ACM, vol. 40, no. 3, pp. 77-87, Mar. 1997. [45] K. Lang, “Newsweeder: Learning to filter news,” in Proceedings of 12th International Conference on Machine Learning, 1995, pp. 331-336. [46] O. Lassila and R. Swick, “Resource Description Framework (RDF) model and syntax specification,” The World Wide Web Consortium, 1999. [Online]. Available: http://www.w3.org/TR/REC-rdf-syntax. [Accessed: Nov. 01, 2005]. 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [47] K. Latifur, D. McLeod, and E. Hovy, “Retrieval effectiveness of an ontology-based model for information selection,” The VLDB Journal, vol. 13, no. 1, pp. 71-85, Jan. 2004. [48] G. Linden, B. Smith, and J. York, “Amazon.com recommendations,” IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan./ Feb. 2003. [49] Los Angeles Times, “Los Angeles, California, national and world news, jobs, real estate, cars - Los Angeles Times,” Los Angeles Times, 2005. [Online]. Available: http://www.latimes.com. [Accessed: Nov. 01,2005]. [50] T. Malone, K. Grant, F. Turbak, S. Brobst, and M. Cohen, “Intelligent information sharing systems,” Communications o f the ACM, vol. 30, no.5, pp. 390-402, May 1987. [51] D. McLeod, P. Dent, S. Gardner, S. Narayanan, C. Shahabi, A. Chen, S. Chung, S. Gao, J. Grob, M. Kakar, H. Shin, J. Shin, H. Su, and L. Wang, “I-news,” Integrated Media Systems Center, University of Southern California, Los Angeles, CA, IMSC Annual Report to NSF, 2001. [52] D. McLeod, P. Dent, L. Pryor, S. Gardner, S. Narayanan, C. Shahabi, A. Chen, S. Shin, and J. Shin, “1-4,” Integrated Media Systems Center, University of Southern California, Los Angeles, CA, IMSC Annual Report to NSF, 2002. [53] Microsoft Inc., “MSN.com,” Microsoft Inc., 2005. [Online], Available: http://www.msn.com. [Accessed: Nov. 01,2005]. [54] MSNBC.com, “Newsweek.com - National news, world news, health, technology, entertainment and more... - MSNBC.com,” MSNBC.com, 2005. [Online]. Available: http://www.msnbc.msn.com/id/3032542/site/newsweek/. [Accessed: Nov. 01, 2005], [55] Netflix, Inc., “Netflix: Welcome to Netflix - Rent DVDs online - Try free,” Netflix, Inc., 2005. [Online]. Available: http://www.netflix.com/Default. [Accessed: Nov. 01, 2005]. [56] P. Patrick and T. Aimilia, “Combining collaborative and content-Based filtering using conceptual graphs,” in Proceedings o f Modelling with Words 2003, 2003, pp. 168-185. 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [57] S. Philippi and J. Kohler, “Using XML technology for the ontology-based semantic integration of life science databases,” IEEE Transactions on Information Technology in Biomedicine, vol. 8, no. 2, pp. 154-60, June 2004. [58] G Polcicova and P. Navrat, “Semantic similarity in content-based filtering,” in Proceedings of 6th East-European Conference on Advances in Databases and Information Systems, 2002, pp. 80-85. [59] B. Popp, “About DAML,” DARPA's Information Exploitation Office, 2003. [Online]. Available: http://www.daml.org/about.html. [Accessed: Nov. 01,2005]. [60] P. Resnick, N. Iacovou, M. Sushak, P. Bergstrom, and J. Riedl, “GroupLens: An open architecture for collaborative filtering of Netnews,” in Proceedings of the ACM 1994 Conference on Computer Supported Cooperative Work, 1994, pp. 175-186. [61] C. Shahabi, F. Banaei-Kashani, Y. Chen, and D. McLeod, “Yoda: An accurate and scalable web-based recommendation system,” In Proceedings o f 6th International Conference on Cooperative Information Systems, 2001, pp.418-432. [62] G. Salton and M. McGill, Introduction to Modern Information Retrieval, New York: McGraw-Hill, 1983. [63] B. Sarwar, G Karypis, J. Konstan, and J. Riedl, ’’Analysis of recommendation algorithms for e-commerce,” In Proceedings o f ACM E-Commerce 2000,2000, pp. 158-167. [64] B. Sarwar, G Karypis, J. Konstan, and J. Riedl. ’’Application of dimensionality reduction in recommender system - A case study,” in Proceedings o f the WebKDD 2000 workshop, 2000, pp. 82-90. [65] B. Sarwar, G Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recom m endation algorithm s,” in P roceedings o f the tenth international conference on W orld Wide Web, 2001, pp. 285-295. 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [66] U. Shardanand and P. Maes, “Social information filtering: algorithms for automating 'word of mouth',” in Proceedings o f the conference on Computer supported cooperative work, 1995, pp. 210-217. [67] B. Shekar and R Natarajan, “A fuzzy-Graph-Based approach to the determination ofinterestingness of association rules,” in Proceedings of The Fourth International Conference on Practical Aspects o f Knowledge Management, 2002, pp. 377-388. [68] M. Smith, C. Welty, and D. McGuinness, “OWL web ontology language guide,” The World Wide Web Consortium, 2004. [Online]. Available: http://www.w3.org/TR/owl-guide/. [Accessed: Nov. 01, 2005], [69] H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn, “XML Schema part 1: structures," The World Wide Web Consortium, 2004. [Online]. Available: http://www.stylusstudio.com/w3c/schemal/. [Accessed: Nov. 01,2005]. [70] The American Association for the Advancement of Science, “Science/AAAS | Scientific research, news and career information,” The American Association for the Advancement o f Science, 2005. [Online]. Available: http://www.sciencemag.org. [Accessed: Nov. 01,2005]. [71] The Apache Software Foundation, “Apache Tomcat - Apache Tomcat,” The Apache Software Foundation, 2005. [Online]. Available: http://jakarta.apache.org/tomcat/. [Accessed: Nov. 01, 2005]. [72] The NASA Earth Science Enterprise, “Goals and components,” The NASA Earth Science Enterprise, 2005. [Online], Available: http://quakesim.jpl.nasa.gov/goals.html. [Accessed: Nov. 01, 2005]. [73] The NASA Earth Science Enterprise, “Quaketable for the public,” The NASA Earth Science Enterprise, 2005. [Online], Available: http://danube.ucs.indiana.edu:9090/public.html. [Accessed: Nov. 01,2005], 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [74] The New York Times Company, “The New York Times - Breaking news, world news & multimedia,” The New York Times Company, 2005. [Online], Available: http://www.NYtimes.com. [Accessed: Nov. 01,2005]. [75] The Washington Post Company, “washingtonpost.com - nation, world, technology and Washington area news and headlines,” The Washington Post Company, 2005. [Online]. Available: http://www.washingtonpost.com. [Accessed: Nov. 01,2005]. [76] Times Newspapers Ltd., “World, UK and business news and comment from The Times and The Sunday Times, Times Online,” Times Newspapers Ltd., 2005. [Online]. Available: http://www.timesonline.co.uk. [Accessed: Nov. 01,2005]. [77] USA TODAY, “USATODAY.com - News & information homepage,” USA TODAY, 2005. [Online]. Available: http://www.USAtoday.com. [Accessed: Nov. 01,2005]. [78] F. van Harmelen, P. Patel-Schneider, and I. Horrocks, “DAML+OIL (March 2001) reference description,” DARPA's Information Exploitation Office, 2001. [Online]. Available: http://www.daml.org/2001/03/reference. [Accessed: Nov. 01,2005]. [79] S. Wong and V . Raghavan, “Vector space model of information retrieval: A reevaluation,” in Proceedings o f 7th Annual International ACM SIGIR Conference on Research and Development in information Retrieval, 1984, pp. 167-185. [80] Yahoo! Inc., “Yahoo!,” Yahoo! Inc., 2005. [Online]. Available: http://www.yahoo.com. [Accessed: Nov. 01,2005] [81] Yahoo! Inc., “Yahoo! media relations,” Yahoo! Inc., 2005. [Online], Available: http://docs.yahoo.com/info/pr/faq.html. [Accessed: Nov. 01,2005] [82] B. Xiao, E. Aimeur, and J. Fernandez, “PCFinder: An intelligent product recommendation agent for e-commerce,” in Proceedings o f IEEE International Conference on E-Commerce 2003, 2003, pp. 181-189. 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [83] H. Zha, O. Marques, and H. Simon, “A subspace-based model for information retrieval with applications in latent semantic indexing,” in Proceedings of Irregular '98,1998, pp. 29-42. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
Multi-view image -based rendering and modeling
PDF
Multi-view image -based rendering and modeling 
Semantic mapping using mobile robots
PDF
Semantic mapping using mobile robots 
Reduced-parameter modeling for cost estimation models
PDF
Reduced-parameter modeling for cost estimation models 
Large motion-based pose estimation method
PDF
Large motion-based pose estimation method 
Machine skill acquisiton based on self-discovery
PDF
Machine skill acquisiton based on self-discovery 
Web-based remote rendering with image -based rendering acceleration and compression
PDF
Web-based remote rendering with image -based rendering acceleration and compression 
Low-state mechanisms to protect the network from greedy and malicious agents
PDF
Low-state mechanisms to protect the network from greedy and malicious agents 
Distributed annotation framework supporting collaborative filtering of information
PDF
Distributed annotation framework supporting collaborative filtering of information 
Modeling the mirror:  Grasp learning and action recognition
PDF
Modeling the mirror: Grasp learning and action recognition 
Resource management in large-scale data stream recording architectures
PDF
Resource management in large-scale data stream recording architectures 
Mathematical techniques for optimizing data gathering in wireless sensor networks
PDF
Mathematical techniques for optimizing data gathering in wireless sensor networks 
Trusted grid and P2P computing with security binding and reputation aggregation
PDF
Trusted grid and P2P computing with security binding and reputation aggregation 
Software architectural support for disconnected operation in distributed environments
PDF
Software architectural support for disconnected operation in distributed environments 
Measurement and monitoring in wireless sensor networks
PDF
Measurement and monitoring in wireless sensor networks 
Modeling, rendering and animating human hair
PDF
Modeling, rendering and animating human hair 
A syntax-based statistical translation model
PDF
A syntax-based statistical translation model 
Speculative plan execution for information agents
PDF
Speculative plan execution for information agents 
Learning objects, places and relations in a brain model of visual navigation
PDF
Learning objects, places and relations in a brain model of visual navigation 
Efficient minimum bounding circle-based shape retrieval and spatial querying
PDF
Efficient minimum bounding circle-based shape retrieval and spatial querying 
Optimizing information mediators by selectively materializing data
PDF
Optimizing information mediators by selectively materializing data 
Action button
Asset Metadata
Creator Chen, Yun-An (author) 
Core Title Semantics -based customization of information retrieval on the World Wide Web 
Contributor Digitized by ProQuest (provenance) 
Degree Doctor of Philosophy 
Degree Program Computer Science 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Computer Science,OAI-PMH Harvest 
Language English
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c16-433821 
Unique identifier UC11336724 
Identifier 3233848.pdf (filename),usctheses-c16-433821 (legacy record id) 
Legacy Identifier 3233848.pdf 
Dmrecord 433821 
Document Type Dissertation 
Rights Chen, Yun-An 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA