Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Combining textual Web search with spatial, temporal and social aspects of the Web
(USC Thesis Other)
Combining textual Web search with spatial, temporal and social aspects of the Web
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMBINING TEXTUAL WEB SEARCH WITH SPATIAL, TEMPORAL AND SOCIAL ASPECTS OF THE WEB by Ali Khodaei A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2013 Copyright 2013 Ali Khodaei Epigraph “For years my heart was in search of the Grail What was inside me, it searched for, on the trail” Hafiz, 14th-century Persian mystic and poet ii Dedication To my parents who provided me with everything I needed, most importantly with unconditional love. To my beloved wife who supported me throughout this journey, and gave me hope and encouragement when I needed them the most. iii Contents Epigraph ii Dedication iii List of Tables vii List of Figures viii Abstract x 1 Chapter 1: Introduction 1 1.1 Spatial-Textual Web Search . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Temporal-Textual Web Search . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Social-Textual Web Search . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Chapter 2: Related Work 14 2.1 Spatial-Textual Web Search . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Temporal-Textual Web Search . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Social-Textual Web Search . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Chapter 3: Spatial-Textual Web Search 25 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2.1 tf-idf Score . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2.2 Inverted Files . . . . . . . . . . . . . . . . . . . . . 28 3.2 Spatial-Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Seamless Spatial-Keyword Ranking . . . . . . . . . . . . . . . 29 3.2.1.1 Spatial tf-idf . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1.2 Spatial-Keyword Relevance . . . . . . . . . . . . . . 33 3.2.2 Spatial-Keyword Inverted File . . . . . . . . . . . . . . . . . . 36 3.2.2.1 SKIF Structure . . . . . . . . . . . . . . . . . . . . . 36 iv 3.2.2.2 Query Processing . . . . . . . . . . . . . . . . . . . 37 3.2.3 SKIF-P: Spatial-Keyword Inverted File for Points . . . . . . . . 39 3.2.3.1 Spatial Decay . . . . . . . . . . . . . . . . . . . . . 40 3.2.3.2 SKIF-P . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.4 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.3 Performance: SKIF-P Parameters. . . . . . . . . . . . . . . . . 57 3.3.4 Accuracy: SKIF-P Parameters. . . . . . . . . . . . . . . . . . . 60 4 Chapter 4: Temporal-Textual Web Search 64 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Baseline Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.1 Inverted File and Interval-Tree Index (FnT) . . . . . . . . . . . 68 4.3.2 Inverted File Then Interval-Tree Index (FtT) . . . . . . . . . . . 69 4.3.3 Interval-Tree Then Inverted File Index (TtF) . . . . . . . . . . . 70 4.4 Temporal-Textual Search . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.1 Seamless Tempo-Textual Ranking . . . . . . . . . . . . . . . . 71 4.4.1.1 Temporal tf-idf . . . . . . . . . . . . . . . . . . . . . 71 4.4.1.2 Tempo-Textual Relevance . . . . . . . . . . . . . . . 74 4.4.1.3 Variants . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.2 Tempo-Textual inverted index . . . . . . . . . . . . . . . . . . 77 4.4.2.1 T 2 I 2 Structure . . . . . . . . . . . . . . . . . . . . . 78 4.4.2.2 Query Processing . . . . . . . . . . . . . . . . . . . 79 4.4.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4.3.1 Multiple Timespans . . . . . . . . . . . . . . . . . . 81 4.4.3.2 Points . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.3.3 Freshness . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.3.4 Weights . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.3.5 Leveraging Existing Search Engines . . . . . . . . . 83 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.1 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.1.1 Cost ofFnT . . . . . . . . . . . . . . . . . . . . . . 84 4.5.1.2 Cost ofFtT . . . . . . . . . . . . . . . . . . . . . . 86 4.5.1.3 Cost ofTtF . . . . . . . . . . . . . . . . . . . . . . 86 4.5.1.4 Cost of T 2 I 2 . . . . . . . . . . . . . . . . . . . . . . 86 4.5.2 Performance Experiments . . . . . . . . . . . . . . . . . . . . 87 4.5.2.1 NY-TIMES Dataset . . . . . . . . . . . . . . . . . . 88 4.5.2.2 FREEBASE Dataset . . . . . . . . . . . . . . . . . . 92 v 4.5.2.3 Cell Size (Temporal Granularity) . . . . . . . . . . . 93 4.5.3 Accuracy Experiments . . . . . . . . . . . . . . . . . . . . . . 94 4.5.3.1 Setting and Queries . . . . . . . . . . . . . . . . . . 94 4.5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . 96 5 Chapter 5: Social-Textual Web Search 98 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2 PerSocialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.1 PerSocial Relevance Model . . . . . . . . . . . . . . . . . . . 101 5.2.1.1 PerSocial Relevance - Level 1 . . . . . . . . . . . . . 101 5.2.1.2 PerSocial Relevance - Level 2 . . . . . . . . . . . . . 103 5.2.1.3 PerSocial Relevance - Level 3 . . . . . . . . . . . . . 108 5.2.2 PerSocialized Ranking . . . . . . . . . . . . . . . . . . . . . . 111 5.2.2.1 Textual Filtering, PerSocial Ranking . . . . . . . . . 111 5.2.2.2 Textual Ranking, PerSocial Filtering . . . . . . . . . 112 5.2.2.3 PerSocial-Textual Ranking . . . . . . . . . . . . . . 112 5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.1 Main Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3.2 PerSocial Relevance Levels . . . . . . . . . . . . . . . . . . . 119 5.3.3 Friends vs. User . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.4 Number of Friends . . . . . . . . . . . . . . . . . . . . . . . . 121 6 Chapter 6: Conclusions 123 References 126 vi List of Tables 3.1 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 R-precision of various rankings . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Ranking preferred by users . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Precision@k of various rankings . . . . . . . . . . . . . . . . . . . . . 62 3.5 nDCG@k of various rankings . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4 Precision@k and nDCG@k of various rankings . . . . . . . . . . . . . 97 4.5 Precision@k and nDCG@k by topic . . . . . . . . . . . . . . . . . . . 97 4.6 Precision@k and nDCG@k by cell size . . . . . . . . . . . . . . . . . 97 5.1 Main Approaches: qset1 . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2 Main Approaches: qset2 . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3 Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4 User-only vs. Friends-only . . . . . . . . . . . . . . . . . . . . . . . . 121 5.5 Number of Friends . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 vii List of Figures 1.1 A spatial-keyword query on documents with location information. . . . 2 1.2 A tempo-textual query on documents with time information. . . . . . . 7 3.1 Inverted file for Example 1. . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Properties (2) and (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Example 1 on the grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Spatial-keyword inverted file for Example 1. . . . . . . . . . . . . . . 37 3.5 Single-Score Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Double-Score Approach . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7 Example of Windows Decay withδ equal to 2 cells (Euclidian) and focal point at the center of the grid . . . . . . . . . . . . . . . . . . . . . . . 42 3.8 Example of Polynomial Decay with δ equal to 2 cells (Euclidian) and focal point at the center of the grid . . . . . . . . . . . . . . . . . . . . 43 3.9 Example of Exponential Decay with δ equal to 2 cells (Euclidian) and focal point at the center of the grid . . . . . . . . . . . . . . . . . . . . 44 3.10 Documents with location information and keyword frequencies. . . . . 44 3.11 A spatial-keyword query on documents with location information. . . . 45 3.12 Example 2 on the grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.13 Spatial-keyword inverted file for Example 2 (for δ = 0). The entry for each termt is composed of the term frequency (f t ) and list of pairs, each composed of document idd and normalized term frequencyf d,t . . . . . 48 3.14 Impact of|K q | on query cost . . . . . . . . . . . . . . . . . . . . . . . 53 3.15 Impact ofk on query cost . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.16 Impact ofα on query cost . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.17 Impact of number of cells on query cost . . . . . . . . . . . . . . . . . 58 3.18 Impact ofδ on query cost . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1 Example 3 with temporal cells . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Tempo-textual inverted index for Example 3. . . . . . . . . . . . . . . 78 4.3 top-k tempo-textual search, uni-score . . . . . . . . . . . . . . . . . . 80 4.4 top-k tempo-textual search, dual-score . . . . . . . . . . . . . . . . . . 81 4.5 Impact of number of keywords on query cost - NY-TIMES . . . . . . . 90 4.6 Impact ofk on query cost - NY-TIMES . . . . . . . . . . . . . . . . . 91 viii 4.7 Impact of time-interval size on query cost - NY-TIMES . . . . . . . . . 92 4.8 Impact of cell size on query cost . . . . . . . . . . . . . . . . . . . . . 94 5.1 Overview of PERSOSE . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Friendship Structure for the Running Example . . . . . . . . . . . . . . 105 ix Abstract Over the last few years, Web has changed significantly. Emergence of Web 2.0 has enabled people to interact with web documents in new ways not possible before. It is now a common practice to geo-tag or time-tag web documents, or to integrate web documents with popular social networks. With these new changes and the abundant usage of spatial, temporal and social information in web documents and search queries, the necessity of integration of such non-textual aspects of the web to the regular textual web search has grown rapidly over the past few years. To integrate each of those non-textual dimensions to the textual web search and to enable spatial-textual, temporal-textual and social-textual web search, in this disserta- tion we propose a set of new relevance models, index structures and algorithms specif- ically designed for adding each non-textual dimension (spatial, temporal and social) to the current state of (textual) web search. First, we propose a new ranking model and a hybrid index structure called Spatial-Keyword Inverted File to handle location-based ranking and indexing of web documents in an integrated/efficient manner. Second, we propose a new indexing and ranking framework for temporal-textual retrieval. The framework leverages the classical vector space model and provides a complete scheme for indexing, query processing and ranking of the temporal-textual queries. Finally, we show how to personalizes the search results based on users’ social actions. We propose a new relevance model called PerSocial relevance model utilizing three levels of social x signals to improve the web search. Furthermore, We Develop Several Approaches To Integrate PerSocial relevance model Into The Textual Web Search Process. xi Chapter 1 Introduction In this chapter, we introduce and motivate the problem of combing textual web search with spatial, with temporal and with social aspects of the web. Since integrating each dimension (spatial, temporal, social) into the web search is a separate (yet related) prob- lem, we introduce and motivate each problem separately. In Section 1.1, we introduce the problem of spatial-textual (spatial-keyword) search. Next, in Section 1.2 we moti- vate and introduce the problem of temporal-textual search. Finally, we introduce the problem of social-textual search (PerSocialization) is Section 1.3. 1.1 Spatial-Textual Web Search There is a large amount of location-based information generated and used by many applications. The Internet is the most popular source of data with location-specific information, such as documents describing schools at certain regions, Wikipedia pages containing spatial information and images with annotations and information about the places they were taken. Users of such a web-based application often need to query the system by providing requirements on a location as well as keywords in order to find relevant documents, as illustrated by the following example (Example 1). Suppose we have a collection of web pages, and each page describes objects for a specific location, such as a district, a city, or a county. Objects can be schools, real-state agencies, golf courses, and sports teams. We want to build a system to allow users to search on these documents. Consider a user, Mike, who moves to the central part of Los 1 d 1 d 2 d 3 d 4 d 5 d 6 Q Los Angeles Area (a) 2 2 1 1 0 4 league 0 d 6 2 d 5 1 d 4 1 d 3 3 d 2 5 d 1 soccer Keywords Document Pasadena Downtown Santa Monica Palms Culver City Redondo Beach Location (b) Figure 1.1: A spatial-keyword query on documents with location information. Angeles. He likes to play soccer, and wants to find soccer leagues in this area so that he can choose one to join. He submits a query to the system with two keywords “soccer league” and specifies “Central Los Angeles” as the location restriction. Fig- ure 1.1 shows the location of his query represented as a shaded region. Our goal is to find the best documents that are of interest to Mike. Suppose there are six documents in the repository with locations close to the Central Los Angeles. Figure 1.1(a) shows these locations represented as rectangles. In addition, each document has text keywords in its content. Figure 1.1(b) shows the frequencies of the two query keywords in these documents. We want to find the most relevant results (documents) to the query. The result cannot be found with a simple keyword- only query since none of the documents may have the actual keywords “Central Los Angeles” or even “Los Angeles” in them. One way to answer the query is to find the documents with a location contained in the query region and with the two query keywords, as suggested in several studies in the literature [HHLM07, CSM06, VJJS]. Using this approach, we can only find document d 3 as an answer, since it satisfies both conditions. Documentd 5 is not an answer since even though it has both query keywords, its region is not totally contained in the query region. One major limitation of this approach is that many documents with partial spatial 2 and/or textual matching will not be considered, even though they include information that could be interesting to the user. In this proposal we show how to support spatial-keyword queries on documents with spatial information. We demonstrate how to rank documents by seamlessly combining spatial and textual features, in order to find highly relevant answers to user queries. In our running example, an interesting question is how to measure the relevance of a document to the query. Intuitively, a document could be of interest to the user if it has at least one of the query keywords, and its location is close to the region mentioned in the query. Document d 6 is not very relevant to the query, since its region is far from the query region. The other five documents are all overlapping with the query region, and thus could be potentially of interest to the user. Hence, we need to rank them since the user may be only interested in the most relevant documents. However, it is not clear how to measure the relevance of the documents to user query. For example, it is clear that documentd 3 should have a high relevance since its region is contained in the query region, and both query keywords appear in the document. However, it is not clear how relevantd 2 is to the query, since even though its region is contained in the query region, it does not have the keyword “league.” The other documents,d 1 ,d 4 , andd 5 , all have the two query keywords, but with different frequencies, and they have different amounts of overlapping areas with the query region. We present a ranking method that considers both the spatial overlap of a document with a query and the frequencies of the query keywords in the document in order to compute a relevance score of the document to the query. We present a new scoring mechanism to calculate the spatial relevance of a document with respect to a query, and propose a method to combine the spatial relevance and the textual relevance. 3 We also extend our ranking model and the proposed index structure for a setting in which query and document locations are (geographical) points rather than geographi- cal regions. We show how to use spatial decay functions to define and model spatial relevance and spatial-keyword relevance for this new setting. Given a good ranking method for documents, a natural question to ask is how to effi- ciently index and search the location-specific documents. There are several challenges. First, space and text are two totally different data types requiring different index struc- tures. For instance, conventional text engines are set-oriented while location indexes are usually based on two-dimensional and Euclidean spaces. Second, the ranking and search processes should not be separated. Otherwise, the ranking process will rank all the candidate documents (instead of only the relevant documents), making the query processing inefficient. Third, the meaning of spatial relevance and textual relevance and a way to combine them using the proposed index structure have to be defined accurately. Finally, it should be easy to integrate the index structure into existing search engines. To solve the above problems, we propose a new hybrid index structure called Spatial- Keyword Inverted File (“SKIF” for short), which can handle the spatial and textual fea- tures of data simultaneously and in a similar manner. SKIF is an inverted file capable of indexing and searching on both textual and spatial data in a similar, integrated manner. Towards this end, the space is partitioned into a number of grid cells and each cell is treated similar to a textual keyword. We describe the structure of SKIF, and present two efficient algorithms for answering a ranking query using SKIF. Finally, we evaluate the efficiency and accuracy of our proposed methods using a comprehensive set of experiments. We have conducted a experimental evaluation on both real and synthetic datasets to show that our techniques can answer ranking queries efficiently and accurately. 4 1.2 Temporal-Textual Web Search For the first time in human history, there is a medium, the World Wide Web, that is continuously documenting our lives as they happen. Hence, it is such a waste to only be able to search this rich history by keywords and not by time. The web is no longer a snapshot of our history, neither should its search scheme. Unfortunately, the content of web-pages is currently not time-tagged, but this will become common practice in the near future (the same way that pages are now being geo-tagged) and until then, many techniques for automatic extraction of temporal information from documents [VM09], [We05], [AGBY07] will serve the purpose. Consequently, the challenge is how to enable efficient search of time-tagged web-pages? Or even more fundamentally, how to rank the results of a keyword-time search? Should a page with more textual similarity to the query keywords get ranked higher or the one with less textual similarity but higher temporal similarity? What does temporal similarity or relevance even mean and how it should be quantified? Several search engines have already started to exploit time in their search process. For instance, Google has started to add a feature called search result option that allows users to filter their search results by a custom time interval. For these search engines, the time attribute is usually the publication time (or the last modified time) of a document. The main assumption here is that the time-tag is a single time point, which simplifies the temporal “retrieval” problem to the temporal “filter” problem. That is, the final relevance (and ranking) of a document is still determined by the document’s textual relevance to the query keywords, and the temporal feature of the document is only used to filter the documents in or out of the final result set. To illustrate why this is a simpler problem, consider the analogous situation where each web document contains a single keyword. In this case, for each web document, we simply need to check whether its keyword exists in the list of query keywords or not. There would be no need for relevance metrics 5 such as tf-idf or indexing techniques such as inverted files. Similarly, if we relax the assumption that the temporal aspect of a web document is represented by a single time point, but by one or more time points and/or intervals, then we need temporal similarity metrics and index structures to efficiently estimate the temporal relevance of a web document to one or more query time points/intervals. Note that given the sophistication of the web content, it is no longer feasible to represent the temporal content of many web documents with a single time point. For instance, a Wikipedia page describing the biography of an individual is very likely to have several references of time, each relating to an event/achievement during the person’s lifetime. Even if we consider representing only the published and modified time of a web document (instead of its content time), a single time point representation is simplistic as the publication and/or modification of a web document evolves through multiple time intervals or time points. This work considers a new kind of top-k query that takes into account both the temporal information in a document’s content and the textual keywords of the docu- ment. An example query may search for “Lakers Celtics Rivalry between years 1984 and 1986”. We call this type of query a Temporal-Textual Retrieval (“tempo-textual” for short) query. The answer to a top-k tempo-textual query is a list of k documents ranked according to a scoring function that combines their textual and temporal relevances to query keywords and the query timestamp, respectively. The tempo-textual query is different from queries that retrieve textually relevant documents within a time range or queries that rank the relevant documents based only on their temporal feature. Example 3: Consider a collection of web pages, each one describing an event using textual keywords and also including one (or more) time-intervals (e.g., days, month, years or even decades). Suppose Tom is a student researching the history of war in Iraq 6 d 2 d 1 Q 1980 1985 1990 1995 2000 2005 2010 d 4 d 3 d 5 time d 6 (a) 0 6 4 9 4 war 7 d 5 11 d 4 5 d 3 10 d 2 12 d 1 Iraq Keywords Document Al-Anfal Campaign Reagan Presidency Gulf War Kurdish Civil War Iran-Iraq War Iraq Invasion Title 1986-1989 1981-1985 1990-1991 1991-1997 1980-1988 2003-2010 Time 0 0 d 6 (b) Figure 1.2: A tempo-textual query on documents with time information. between years 1982 to 1992. He submits a query to the system with two textual key- words “Iraq” and “war” and specifies “1982 - 1992” as the temporal expression. Suppose there exist six documents in our collection with temporal information close (in time) to the query’s temporal expression. Figure 1.2(a) shows these documents’ temporal information plus the query’s temporal expression (for a better readability, time- intervals are shown at different levels). Figure 1.2(b) shows each document’s title and the frequencies of the two query keywords in each document. Tom wants to find top-3 relevant documents to the query. In this example, document d 6 is not a relevant document since it does not contain any of the query keywords. Documentd 1 is not very relevant to the query either, since its time-interval is very far from the query’s time-interval (interestingly, this document is very likely to be the most relevant result when a regular textual search is used). The other four documents are overlapping (in time) with query’s time-interval and contain at least one of the query keywords, therefore all could be potentially of interest to the user. However, it is not clear how to measure the relevance for documents and rank them in accordance with each other. For instance, it is clear that document d 2 should have a high relevance since its time-interval is very similar (has a significant overlap) to the query’s time-interval and also contains both query keywords. On the other hand, it is not 7 clear how relevant documentd 5 is to the query. Although its time-interval is contained in the query time-interval, it does not have one of the query keywords. The other two documents, d 3 and d 4 both have the two query keywords and both have overlaps with the query time-interval (with various periods of overlap). In this work, we first present a simple baseline indexing technique to answer the temporal-textual queries. The baseline approach uses only one textual index structure and calculates the temporal relevance on-the-fly. We also introduce three new hybrid index structures to index documents based on both the temporal and textual features of the documents. Using one textual index structure (inverted file) and one temporal index structure (interval-tree) we present three different variants of combining a textual index and a temporal index to answer temporal-textual queries. We argue that these techniques would not be efficient in all circumstances and their performance is highly dependent on the distribution/selectivity of the data (e.g., distribution of textual keywords and tempo- ral intervals in the corpus) and type of queries issued. Next, we introduce our novel tempo-textual retrieval framework based on the clas- sical vector space model. Following the same intuitions and techniques used in textual searches and inspired by the tf-idf schema in textual context, we define a new scoring schema called temporal tf-idf for temporal context. Using textual and temporal tf-idf, we define a new tempo-textual relevance score and ranking. The third contribution of this work is a novel tempo-textual index structure. Design- ing an efficient index structure for both textual and temporal data has several challenges. First, text and time are two totally different data types requiring different index struc- tures. An ideal index should be able to handle both the temporal and the textual data simultaneously and in an integrated fashion. Second, the meaning of temporal relevance and textual relevance and how to combine them into one aggregate relevance score using the index structure have to be defined accurately. Moreover, the index structure needs to 8 support cases where the influences of text and time on the overall relevance are different. Third, the ranking and search processes should not be separated. Otherwise, the ranking process will rank all the candidate documents (instead of only the relevant documents), making the query processing inefficient. Last but not the least, it should be straightfor- ward to integrate the proposed index structure into the existing search engines. In this dissertation, we propose a new hybrid index structure called Tempo-Textual Inverted Index (“T 2 I 2 ” for short) for efficient search and ranking of time and text in a unified manner. T 2 I 2 is an inverted index capable of indexing and searching both textual and temporal data in a similar, integrated manner. Towards this end, the time domain is divided into a number of consecutive cells and each cell is treated similar to a textual keyword. We present the structure of T 2 I 2 , and discuss two efficient algorithms for answering tempo-textual queries using T 2 I 2 . Overall, we present a complete framework of indexing, query processing and rank- ing, for answering tempo-textual queries. Using experimental evaluation, we show that the proposed framework is both efficient and accurate. 1.3 Social-Textual Web Search While in early stages, search engines’ focus was mainly on searching and retrieving relevant document based on their content (e.g., textual keywords), new search engines and new studies start to focus on context alongside content as well. For instance, [SG05] proposed a search engine that combines traditional content-based search with context information gathered from user’s activities. More recently, search engines started to make the search results more personalized. With personalized searches, search engines consider the searcher’s preferences, interests, behavior and history. The final goal of 9 personalized search as well as other techniques studying users’ preferences and interests is to make the returned results more relevant to what the user is actually looking for. Emergence of social networks on the web (e.g., Facebook and Google Plus)have caused the following key changes on the web. First, social networks reconstruct friend- ship networks in the virtual world of the web. Many of these virtual relationships are good representatives of their actual (friendship) networks in the real world. Second, social networks provide a medium for users to express themselves and freely write about their opinions and experiences. The social data generated for each user is a valuable source of information about that user’s preferences and interests. Third, social networks create user identifiers (identities) for people on the web. Users of a social network such as Facebook will have a unique identity that can be used in many places on the web. Not only such users can use their Facebook identities on the social network itself but they can also use that identity to connect and interact with many other web-sites and applica- tions on the web. Along the same lines, social networks such as Facebook and Google Plus provide utilities for other web-sites to get integrated with them directly, enabling users of the social network to interact directly with those web-sites and web documents using their social network identity. For instance, a web document can be integrated into Facebook (using either Facebook Connect or instant personalization 1 ) allowing every Facebook user to perform several actions (e.g., LIKE, RECOMMEND, SHARE) on that document. Finally, many search engines are starting to connect to social networks and allows users of such social networks to be the users of the search engine. For instance, the Bing search engine is connected to Facebook and hence users with their Facebook identities can log-in into Bing to perform their searches. The above developments inspired us to study a new framework for search person- alization. In this dissertation, we propose a new approach for performing personalized 1 https://developers.facebook.com/docs/guides/web/ 10 search using users’ social actions (activities) on the web. We utilize the new social information mentioned above (user’s social activities, friendships, user identities and interaction of users on web documents) to personalize the search results generated for each user. We call this new approach to personalization of search, PerSocialized search since it uses social signals to personalize the search. While a traditional personalized search maintains information about the users and the history of their interactions with the system (search history, query logs), a PerSocialized search system maintains infor- mation about the users, their friendships (relations) with other users and their social interactions with the documents (via social actions). Recently, [MS] conducted a complete survey on the topic of social search and var- ious existing approaches to conduct social search. As mentioned in [MS] there exist several definitions for social search: One definition is the way individuals make use of peers and other available social resources during search tasks [EKP09]. Similarly, [VK09] defines social search as using the behavior of other people to help navigate online, driven by the tendency of people to follow other people’s footprints when they feel lost. A third definition is by [AHSS04] and is defined as searching for similar- minded users based on similarity of bookmarks. Finally, [EC08]’s definition of social search includes a range of possible social interactions that may facilitate information seeking and sense-making tasks: utilizing social and expertise networks; employing shared social work spaces; or involving social data-mining or collective intelligence pro- cesses to improve the search process. For us, social search focuses on utilizing querying user’s as well as her friends’ social actions to improve the conventional textual search. By integrating these social actions/signals into the textual search process, we define a new search mechanism: PerSocialized search. Our main goal is to prove our hypothesis that these social actions (from the querying user and his friends) are relevant and useful to improve the quality of the search results. 11 Towards this end, we propose a new relevance model called the PerSocial relevance model to determine the social relevance between a user and a document. PerSocial model is developed in three levels, where each level complements the previous level. First, we are using social actions of a user on documents as implicit judgment/rating of those documents by the user. For instance if a Facebook user u, performs any type of social action (e.g., LIKES, SHARES) on document d, she implicitly expresses her positive opinion about d. As a result, d should get a slightly higher score for queries relevant tod and issued byu. In Section 5.3, we show that using social actions from each user and boosting documents’ score with such actions (level 1), by itself improves the accuracy of search results. Second, it is both intuitive and proven [KSJ09] that people have very similar interests with their friends. Also, people tend to trust the opinions and judgements of their friends more than strangers. As a result, not only the documents with direct social actions by user u are relevant to u, but also those documents with social actions performed by u’s friends are also relevant to user u. Hence, we adjust (increase) the weights given to those documents for relevant queries issued byu. As we discuss in more details in Section 5.2, many parameters such as the strength of social connections between users as well as the influence of each user must be incorporated in to the model for generating the most accurate results. In Section 5.3, we show that using the social signals from the friends will improve the search results significantly. Furthermore, we show that using a combination of user data and his/her friends data generates the best results. Finally, the web documents are often well-connected to each other. We argue that social features of each document should be dynamic, meaning that social actions/signals of the document can and should be propagated to other adjacent documents. A user’s interest for a document - shown by a social action such as LIKE - can often imply the user’s interests in other relevant documents - often connected to the original document. Thus, we use connections among documents to let social scores flow 12 among documents, hence generating a larger document set with more accurate PerSocial relevance scores for each user. In sum, the major contribution of our work (in social domain) is to propose a model and build a system for utilizing users’ social actions to personalize the web search. We propose a new relevance model to capture relevance between documents and users based on users’ social activities. We model three levels of personalization based on three sets of social signals and show how each level improves the web search personal- ization. In addition, we propose three new ranking approaches to combine the textual and social features of documents and users. Furthermore, we develop a PerSocialized search engine dubbed PerSocialized search engine (’PERSOSE’ for short) to perform PerSocialized search on real data using real users. Using PERSOSE, we conduct a com- prehensive set of experiments using 14 million documents of Wikipedia as our document set and real Facebook users as our users. As a result of the experiments, we show that social actions of a user, his friends’ social actions and social expansion of documents (all three levels of social signals) improve the accuracy of search results. 1.4 Road Map The remainder of this dissertation is organized as follows. Chapter 2 reviews the related work. In Chapter 3, Chapter 4, and Chapter 5 we show how to combine spatial, temporal and social aspects of web with textual web search, respectively. Finally, in Chapter 6 we present conclusions and future work. 13 Chapter 2 Related Work In this chapter, we review the related work. First, in Section 2.1, we study the existing studies on adding location to the web search. Next, in Section 2.2, we review the existing work on temporal-textual search and finally in Section 2.3 we study the related studies on utilizing social signals in web search. 2.1 Spatial-Textual Web Search Existing index structures for handling the spatial-keyword queries can be categorized into two broad groups: 1) individual index structures, and 2) hybrid index structures. Individual index structures use one index for each set of features of the data (space or text). The index structures of choice for the spatial data are usually grid, R*-tree or quad- tree. For text, inverted files are often used. Using separate index structures, documents satisfying the textual part of the query and documents satisfying the spatial part of the query are retrieved separately using the textual and spatial indexes, respectively. The final result is the merging of the two result sets. An example of this method is the inverted file and R*-tree double index presented in [ZXW + 05]. An improved variation of this approach is to filter the results based on one feature first and then use the second index structure on only those results generated from the first step and not on the entire collection. The main problem with the above methods is the fact that each one-feature search usually returns huge number of results. 14 Hybrid index structures combine the textual and spatial indexes into one index struc- ture. Two basic designs introduced in [ZXW + 05] are inverted file-R*-tree and R*-tree- inverted file. Inverted file-R*-tree is essentially an inverted file on top of R*-trees. With this structure, first the inverted file is built and then R*-trees are built on each inverted list indexing the spatial features of the documents (objects) in the list. For a given spatial- keyword query, the query keywords are filtered using the inverted file and then R*-trees corresponding to those keywords are traversed to search/filter based on the spatial fea- tures of the data. This structure performs well only when the number of keywords is very small. Increase in the number of keywords results in traversing multiple R-trees sepa- rately and combining the results from those tress, which is very costly. R*-tree-inverted file is an R*-tree on top of inverted files. In this structure, an R*-tree is first built for all the documents (objects) location and then inverted lists are generated for keywords appearing in leaf nodes of the tree. For a given spatial-keyword query, R*-tree’s leaf nodes intersecting the query location are retrieved first and then all the inverted lists corresponding to those keywords are traversed. The main disadvantage of this method is in its spatial filtering step, which usually generates many candidate objects. Two other structures are those presented in [CSM06] and [VJJS]. Both these studies are conceptually similar to the work in [ZXW + 05] with the main difference that they use grid as their spatial index structure. Both also use inverted file as a textual index. With both approaches, spatial and textual search processes are separated and query processing require two stages. Although both approaches use one hybrid index but the spatial and textual search processes are still separated. Both approaches discuss hybrid (spatial and textual) relevance ranking in a very abstract (high level) form. Both fall short to provide a more detailed explanation of how they plan to implement this ranking and how the proposed index structures are used to provide a hybrid relevance ranking. Also, neither of the papers present any type of real experiments (even in the simplest form) to 15 provide and evaluate accuracy of any relevance ranking schema. Also both approaches are based on AND semantics meaning that all the keywords in the query should exist in each result. The main difference between these two approaches ([CSM06] and [VJJS] ) is that the focus of the work in [CSM06] is more on system (and algorithmic) issues (more specifically scaling to very large datasets and high query load). Although the work in [VJJS] is very different than what we propose in this proposal, as a future direction, the authors of [VJJS] suggested a closer integration of textual and spatial indexing by using spatial cell identifiers as part of the textual index. Several improved hybrid index structures are introduced more recently. In [HHLM07] a hybrid index structure called KR*-tree is proposed. KR*-tree extends R*-tree-inverted file structure by augmenting each node of R-tree with the list of all the keywords appearing in the objects of that subtree. At the query time, KR*-tree is traversed and for each node, not only the spatial intersection with the query region is checked but the node is also checked for the presence of the query keywords. The result to the query are the objects contained in the query region that have all the query key- words (AND semantics). KR*-tree is only good for spatial objects with a small number of keywords. Moreover, KR*-tree can not be used in the context of location-based web search since documents in the web typically contain large number of keywords. Also the textual relevance ranking (and therefore the spatial-keyword relevance ranking) can not be supported. Another hybrid index structure called IR 2 -tree is presented in [DFHR08], which combines R-tree with signature files. With this method, each node in R-tree is augmented with a signature representing the union of the keywords (text) of the objects in the subtree rooted at that node. Similar to KR*-tree, IR 2 -tree can identify the sub- trees that do not contain the query keywords and eliminate them from the search process early on. IR 2 -tree is more efficient than KR*-tree since the auxiliary data structure aug- mented to each node is much smaller (and more efficient) in the IR 2 -tree. Similar to 16 the KR*-tree, input data here is a set of spatial objects each associating with number of keywords. Using IR 2 -tree, the final result set is a ranked list of objects containing all the query keywords in order of their distances to the query point. IR 2 -tree performs better than index structures in [ZXW + 05] and [HHLM07] but still has its own shortcomings. At times, the signature files are not able to eliminate the objects not satisfying the query keywords (false hits). This results in loading and reading more objects, which is costly. Furthermore, the performance gets worse when the number of query keywords increases or when the final result is very far from the query point. Another problem with IR 2 -tree is that the signature files need to be loaded into memory each time a node is visited. Finally, since there is no intuitive way to use signature files for the textual relevance ranking, IR 2 -tree cannot really perform a meaningful ranking. In summary, none of the existing methods mentioned here are designed to support the spatial-keyword relevance ranking. If there is a ranking, it usually is based on the spatial relevance (either distance or overlap). The existing methods use the AND semantics and hence can not have the results with the partial relevance (object with some but not all the keywords). Also, none of the existing methods use both the spatial and textual aspect of the data simulta- neously and in the similar manner. At the best case, they use one feature for pruning the result set and one feature for the actual search (e.g. IR 2 -tree uses signature files to prune the data based on the text and uses R-tree to do the actual search). Finally, for all the mentioned methods there are inevitable cases when the actual objects have to be read from the disk and be reevaluated. This happens when the search is based on on feature (e.g. space in IR 2 -tree) and pruning on the other feature (e.g. text in IR 2 -tree) is not able to prune all the non-qualified objects. In this case, after reaching the candidate object using one feature (e.g. reaching a leaf node in IR 2 -tree), the actual object has to be read and rechecked to see if it is qualified based on the other feature (e.g. object’s textual description matches the query keyword in IR 2 -tree). In other words, in existing methods 17 the index structure by itself is not sufficient for the query processing and access to the actual objects are often needed. Very recently another hybrid index structure called IR-tree is presented, which also combines R-tree with the inverted files [CJW09]. With this index structure, each node of the R-tree is augmented with an inverted file for the objects contained in the sub-tree rooted at that node. IR-tree is the most similar approach to our work since it considers space and text together. IR-tree is a single index structure requiring only one step to process the query. Also, IR-tree supports ranking of the documents based on both the spatial and textual features (although not very accurately). Nevertheless, there are some major problems with IR-tree. First, one inverted file needs to be stored and possibly accessed for each node in the tree. For web documents, the total number of documents and the total number of keywords are very large, resulting in huge number of nodes in the tree and also large inverted files for each node. During the search process, the application needs to load the entire inverted file of each visited node into the memory, which causes extra I/Os. Another problem with IR-tree is that during the search process, it often needs to visit few nodes in the tree containing no relevant results. Finally, it is not clear that whether ranking proposed in [CJW09] is an accurate spatial-keyword relevance ranking (see Section 4.5). There are many other relevant topics such as approximate keyword search on spatial data [ABL10, WDL09] , m-closest keywords (mCK) query (returns the closest objects containing at least m keywords) [ZCM + 09], location-aware prestige-based text retrieval (takes into account the inter-relationships among spatial objects) [CCJ10], extraction of geographical information [McC01, DGS00], geo-coding of documents’ locations [AHSS04], and geographic crawling[GLM06]. In this proposal, we only focus on index structures, relevance ranking, and search algorithms for spatial-keyword queries. 18 2.2 Temporal-Textual Web Search There are several studies on the versioned text data such as web-archives. In time-travel text search [BBNW], [BBNW07], the goal is to identify and rank relevant documents as if the collection was in its state as of query time. In these methods, the final score of each document is usually calculated by aggregating the scores of the document over the query time interval. In other words, the final top-k results are the most textually relevant docu- ments during the query interval (or as of the query time) [HLY07]. Another type of query on the versioned text collections is durable top-k search. With this type of search, the goal is to find the documents that are consistently in the top-k throughout the sequence of rankings defined by the query time-interval and the query keywords [LHMBB10]. [NN06] focused on a simpler problem of temporal text-containment queries, which is a query for all versions of the documents that contained one or more particular word at a particular time. Temporal text-containment queries ignore the relevance scoring of the results. All of the above approaches work on document collection as a whole and not on specific keyword-temporal queries. Moreover, none of these methods’ final scores and rankings are based on both the textual relevance and the temporal relevance. A few other methods use the publication time of documents to improve the relevance ranking. [LC03] presents a language modeling technique that factors publication time of docu- ments in order to favor recent documents in its relevance ranking. [Cor05] also takes into account publication time of documents (and also their interlinkage) to rank news articles. On a separate but somehow related topic, [DD10], [DSD11] and [EG11] focus on freshness of web documents and study the integration of freshness/recency into the relevance ranking model. In [Pas08], the temporal features of documents are used to help answering time- related questions in open-domain question answering systems. [DG08] proposed a more general framework which detects automatically the important time intervals that are 19 likely to be of interest for time-sensitive queries and leverages documents published within these important time intervals. In [KC05], a temporal document retrieval model for business news archives is presented. In [AG06], a new method is presented for clustering and exploring search results based on temporal expressions within the text. For the past decade, several automatic methods for extracting temporal information from web-documents have been proposed. Using natural language processing (NLP) techniques, specifically information extraction methods, it is possible to identify words or expressions that convey temporal meaning (e.g. today, a long time ago) and use these to date documents [We05]. In [AGBY07], a three-step approach called document annotation pipeline is described. This approach extracts temporal information based on the TimeML standard described in [tim]. TimeML has become the standard markup language for events and temporal expressions in natural language. There are several other tools that can extract temporal information from documents. Examples of such tools are Lingua::EN::Tagger [lin] and TempEX [MW00]. For a more detailed study of temporal information extraction techniques, we refer readers to [VM09] and [We05]. Finally, there are few approaches that consider the temporal information in the docu- ments’ content for the relevance ranking and retrieval purposes. In the work of [BY05], the goal is to search for information that points to the future. The presented retrieval model called future retrieval uses a simple probability model for future events based on set of time segments and a simple ranking function. In [JLZW08], a temporal search engine (called TISE) supporting content time retrieval for web pages in presented. TISE extracts temporal features from web pages through natural language processing tech- niques and ranks web-pages using a simple linear combination of their textual rele- vance, temporal relevance and importance. This is a short paper and does not include the detailed information regarding indexing and query processing steps and how efficient those steps are. In addition, the proposed textual relevance function is basic and does 20 not generate accurate results. In [JCZ + 11], (same) authors propose several hybrid index structures for temporal-textual web search. No ranking function or relevance model is discussed in either paper. The studies closest to our work are [ABE09] and [BBAW10]. [ABE09] describes how to integrate the temporal expressions into a language modeling approach. Two dif- ferent approaches (LMF and LMW) are presented to leverage temporal expressions and improve retrieval effectiveness. In a similar paper, [BBAW10] study how to integrate the temporal expressions into a language model retrieval framework while focussing mostly on the aspect of uncertainty in the meaning of temporal expressions. Both of these stud- ies rank documents according to estimated probability of generating the query, textually and temporally. There are couple of major differences between these two approaches and our proposed study. They both are based on probabilistic language models while our proposal is based on the classical vector space model. As [BY05] noted in its conclu- sion section, for searching the past a probabilistic model does not make that much sense, as events in the past did (almost always) happen. More importantly, while our proposed solution provides a complete framework for temporal-textual indexing, query process- ing and relevance ranking, the aforementioned approaches only focus on the relevance ranking part. It is not clear how efficient the indexing and search processes perform for these two studies. Furthermore, as we explained in Section 4.4.3, our proposed solution can be easily integrated into existing systems while [ABE09] and [BBAW10] integration mechanism does not seem trivial. 2.3 Social-Textual Web Search There are several groups of related studies on the application of social networks in search. With the first group, people through their social networks are identified and 21 contacted directly to answer search queries. In other words, queries are directly sent to individuals and answers to the queries are coming from people themselves [HK10]. In this approach called search services, people and their networks are indexed and a search engine has to find the most relevant people to send the queries/questions to. There are also systems based on the synchronous collaboration of users in the search process. HeyStacks [SBCO09], as an example of such system, supports explicit/direct collabo- ration between users during the search. HeyStacks enables users to create search tasks and share it with others. HeyStacks is a complementary (and not comprehensive) search engine that needs to work a mainstream search engine to be useful. In [BBC12], authors show how social platforms (such as Facebook, Linkedin) can be used for crowdsourcing search-related tasks. They propose a new search paradigm that embodies crowds as first-class sources for the information seeking process. They present a model-driven approach for the specification of crowd-search tasks. Crowdsourcing search tasks or crowdsearching is a fairly new topic focusing on the active and explicit participation of human beings in the search process. Personalized search has been the topic of many studies in the research commu- nity. Search engine can either explicitly ask users for their preferences and interests [CNPK05] or more commonly, use data sources related to users’ search history such as query logs and click-through data. The most common data source used in search personalization is users’ web (query) log data. Recently, few studies started to exploit data from social online systems to infer users’ interests and preferences. [XBF + 08] and [NM07] exploit each user’s bookmarks and tags on social bookmarking sites and proposes a framework to utilize such data for personalized search. In a similar paper [WJ10], authors explore users’ public social activities from multiple sources such as blogging and social bookmarking to derive users’ interests and use those interests to personalize the search. 22 In [CZG + 09], authors investigate a personalized social search engine based on users’ relations. They study the effectiveness of three types of social networks: familiarity- based, similarity-based and both. In [YLL10], which is a short paper, authors propose two search strategies for performing search on the web: textual relevance (TR)-based search and social influence (SI)-based search. In the former, the search is first performed according to the classical tf-idf approach and then for each retrieved document the social influence between its publisher and querying user is computed. The final ranking is based on both scores. In the latter, first the social influence of the users to the querying user is calculated and users with high scores are selected. Then, for each document, the final ranking score is determined based on both TR and SI. In a set of similar papers [GCK + 10, GZC + 10], authors propose several social network-based search ranking frameworks. The proposed frameworks consider both document contents and the similarity between a searcher and document owners in a social network. They also propose a new user similarity algorithm (MAS) to calculate user similarity in a social network. In this set of papers, the focus is mainly on user sim- ilarity functions and how to improve those algorithms. Majority of their experiments are limited to a small number of queries on YouTube only. Also their definition of a relevant document is somehow ad-hoc. A relevant (interesting) result is a result (video) whose category is similar/equal to the dominant category of videos that the searcher has uploaded. With regards to commercial search engines, Bing and recently Google have started to integrate Facebook and Google+, respectively, into their search process. For some search results, they show the query issuer’s friends (from his/her social network) that have liked or +1ed that result. Their algorithms are not public and it seems that they only show the likes and +1s and the actual ranking is not affected. 23 There exists a relevant but somehow different topic of folksonomies. Tags and other conceptual structures in social tagging networks are called folksonomies. A folkson- omy is usually interpreted as a set of user-tag-resource triplets. Existing work for social search on folksonomies is mainly on improving search process over social data (tags and users) gathered from social tagging sites [RKES11][SCK + 08][YBLS08]. In this context, relationships between a user and a tag and also between two tags are of significant importance. Different ranking models proposed in the context of folk- sonomies include [GCF09, HJSS06]. Studies on folksonomies and/or with focus on social tags/bookmarking face the same limitations of user-based tagging. The main issue with user tagging is that results are unreliable and inconsistent due the lack of of control and consistency in user tags [MS]. 24 Chapter 3 Spatial-Textual Web Search This chapter is organized as follows. In Section 3.1, we define a set of terminologies and also discuss some background studies. Next, in Section 3.2 we explain our spatial- keyword relevance ranking model as well as the proposed hybrid index structure to answer spatial-keyword queries. We also show how to extend our approach for a setting in which document (and query) locations are geographical points (and not regions). Section 3.3 presents a comprehensive set of experiments to evaluate the efficiency of our proposed index structure as well as the effectiveness of our ranking model. 3.1 Preliminaries 3.1.1 Problem Definition We assume a collectionD ={d 1 ,...d n } ofn documents (web pages). Each documentd is composed of a set of keywordsK d and a set of locationsL d . Each location is represented by a minimum bounding rectangle (MBR) although any other arbitrary shape can be used. Spatial-keyword query: A spatial-keyword query is defined as Q = hK q ,L q i, whereL q is the spatial part of query specified as one or more minimum bounding rect- angles andK q is a set of keywords in the query. Spatial relevance: Spatial relevance between a document d and the query q is defined based on the type of the spatial relationship that exists betweenL d andL q . We focus only on the overlap relationship, although our approach can easily be extended to 25 cover other spatial relationships. Subsequently, we define spatial relevance as follows: A documentd and the queryq are spatially relevant if at least one of the query’s MBRs has a non-empty intersection with one of the document’s MBRs, i.e.,L q ∩L d 6=∅. The larger the area of the intersection is, the more spatially relevantd andq are. We denote spatial relevance of documentd to queryq bysRel q (d). Textual relevance: A documentd is textually relevant to the queryq if there exists at least one keyword belonging to bothd andq, i.e.,K q ∩K d 6=∅. The more keywords q and d has in common, the more they are textually relevant. We represent textual relevance of documentd to queryq bykRel q (d). Spatial-keyword relevance: A documentd is spatial-keyword relevant to the query q if it is both spatially and textually relevant to the queryq. Spatial-keyword relevance can be defined by a monotonic scoring function F of textual and spatial relevances. For example, F can be the weighted sum of the spatial and textual relevances: F q (d) = α s .sRel q (d)+(1−α s ).kRel q (d) . α s is a parameter assigning relative weights to spatial and textual relevances. The output of function F q (d) is the spatial-keyword relevance score of document d and query q, and is denoted by skRel q (d). In Section 3.2.1 we show in details how to calculate spatial-keyword relevance using our proposed index. Spatial-keyword search: A spatial-keyword search identifies all the documents (web pages) that are spatial-keyword relevant toq. The result is the top-k most spatial- keyword relevant ranked documents sorted based on documents’ spatial-keyword rele- vance scores. The parameterk is determined by the user. 26 3.1.2 Background 3.1.2.1 tf-idf Score All current textual (keyword) search engines use a similarity measure to rank and iden- tify potential (textual) relevant documents. In most keyword queries, a similarity mea- sure is determined by using the following important parameters: • f d,k : the frequency of keywordk in documentd • max(f d,k ): maximum value off d,k over all the keywords in documentd • f d,k : normalizedf d,k , which is f d,k max(f d,k ) • f k : the number of documents containing one or more occurrences of keywordk Using these values, three monotonicity observations are enforced [ZM06]: (1) less weight is given to the terms that appear in many documents; (2) more weight is given to the terms that appear many times in a document; and (3) less weight is given to the documents that contain many terms. The first property is quantified by measuring the inverse of frequency of keywordk among the documents in the collection. This factor is called inverse document frequency or the idf score. The second property is quantified by the raw frequency of keywordk inside a documentd. This is called term frequency or tf score, and it describes how well that keyword describes the contents of the document [BYRN99, SB97]. The third property is quantified by measuring the total number of keywords in the document. This factor is called document length. 27 keywordk f k Inverted list fork soccer 5 h1,1ih2,1ih3,1ih4,1ih5,1i league 5 h1,0.8ih3,1ih4,1ih5,1ih6,1i Figure 3.1: Inverted file for Example 1. A simple and very common formula to calculate the similarity between a document d and the queryq is shown in Equation 3.1. w q,k = ln(1+ n f k ); w d,k = ln(1+f d,k ); W d = s X k w 2 d,k ; W q = s X k w 2 q,k ; S q,d = P k w d,k .w q,k W d .W q . (3.1) Variable w d,k captures the tf score while variable w q,k captures the idf score. W d represents document length andW q is query length (which can be neglected since it is a constant for a given query). Finally,S q,d is the similarity measure showing how relevant documentd and queryq are. In this case (textual context) it is the same astRel q (d). 3.1.2.2 Inverted Files Inverted file is the most popular and very efficient data structure for textual query eval- uation. Inverted file is a collection of lists, one per keyword, recording the identifiers of the documents containing that keyword [ZM06]. An inverted file consists of two major parts: vocabulary and inverted lists. The vocabulary stores for each keywordk: a countf k showing number of documents containingk, and a pointer to the corresponding inverted list. The second part of inverted file is a set of inverted lists, each corresponding to a keyword. Each list stores for the corresponding keyword k: identifiers d of docu- ments containingk, and normalized frequenciesf d,k of termk in document d [ZM06]. A complete inverted file for Example 1 is shown in Figure 3.1. 28 3.2 Spatial-Keyword Search 3.2.1 Seamless Spatial-Keyword Ranking In this section, we define new scoring mechanism to calculate the spatial relevance and spatial-keyword relevance scores. Following the same intuitions and concepts used in regular (textual) searches, we define new concepts and parameters for spatial data. Most notably, inspired by tf-idf in textual context, we define a new scoring mechanism called spatial tf-idf for the spatial context. Using (textual) tf-idf scores and spatial tf-idf scores, the spatial-keyword relevance is defined and can be used to rank the documents based on both the spatial and textual aspects of the data, simultaneously and efficiently. We discuss two different approaches to calculate the spatial-keyword relevance using the spatial tf-idf score. Several variants of the final similarity measure is also presented. 3.2.1.1 Spatial tf-idf In order to be able to use the analogous ideas used in the regular tf-idf score, we need to treat spatial data similar to textual data. Most importantly, we need to represent space which is coherent and continuous in nature, as disjunct and set-oriented units of data - similar to the textual keywords. Hence, we partition the space into grid cells and assign unique identifiers to each cell. Therefore, each location in document can be associated with a set of cell identifiers. Since we are using overlap as our main spatial query type, these cells are defined as the cells which overlap with the document location. With spatial tf-idf, the overlap of a cell with the document is analogous to the existence of a keyword in document with tf-idf. However, knowing the overlapping cells is not enough. We need to know how well a cell describes the spatial content of the document. We use the overlap area between each cell and the document to provide a measure of how well that cell describes the document. Analogous to frequency of term t in document d, we 29 define frequency of cell c in document d as follows: f d,c = L d ∩c c which is the area of overlap between the document location L d and cell c divided by the area of cell c. Similar to the frequency of a keyword which describes how well the keyword describes the documents textual contents (K d ), the frequency of a cell describes how well the cell describes the documents spatial contents (L d ). The more the overlap, the better this cell describes the document location and viceversa. Now we can define the following parameters analogous to those of Section 3.1.2: • f d,c : the frequency of cellc in documentd • max(f d,c ): maximum value off d,c over all the cells in documentd • f d,c : normalizedf d,c , which is f d,c max(f d,c ) • f c : the number of documents containing one or more occurrences of cellc Using the above parameters, we revisit three monotonicity properties discussed in Section 3.1.2, this time in spatial context: (1) less weight is given to cells that appear in many documents; (2) more weight is given to cells that overlap largely with a document; and (3) less weight is given to documents that contain many cells. The first property is quantified by measuring the inverse of frequency of a cellc among the documents in the collection. We call this spatial inverse document frequency oridf s score. The second property is quantified by the frequency of cell c in document d (as defined earlier). This is called spatial term frequency or tf s score and describes how well that cell describes the document spatial contents (i.e. L d ). The third property is quantified by measuring the total number of cells in the document. This factor is called document spatial length. Among the above properties, properties (2) and (3) are more intuitive. Property (2) states that more weight should be given to the cells having a large overlap area with the 30 d 2 d 1 C 1 C 2 Q C 3 C 4 C 5 C 6 C 7 C 8 C 9 Figure 3.2: Properties (2) and (3) document. The larger the overlap, the better that cell describes the document location. For example, in Figure 3.2, cellc 8 better describes the documentd 2 than cellc 9 . Property (3) states that less wight should be given to those documents whose locations cover more cells. Assuming all the other parameters are equal, a document with a smaller coverage (fewer number of cells) should get a higher weight than a document with a larger coverage. To illustrate, consider Figure 3.2, where both documents d1 and d2 contain the query location and its corresponding cell (i.e. c 5 ). In other words, they have equal spatial tf scores for cell c 5 . Cell c 5 also have identical spatial idf for all documents. Under these conditions (equal tf scores and equal idf scores), the smaller document (d 1 ) should be ranked higher. This is analogous to the fact that in textual context, more weight is given to the documents that contain fewer keywords. Contrary to properties (2) and (3), property (1) is not very intuitive. It states that less weight is given to the cells appearing in more documents. In the textual context, the idf score is a weighting factor determining the importance of each keyword independent 31 d 1 d 2 d 3 d 4 d 5 d 6 C 1 C 2 Q Los Angeles Area C 3 C 4 C 5 C 6 C 7 C 8 C 9 Figure 3.3: Example 1 on the grid of the query. It assigns more weight to keywords appearing in fewer documents, since those are more meaningful keywords . However, the definition of meaningful cell is not very clear in the spatial context. A popular cell (location) -a cell overlapping with many documents - is a very meaningful cell for some users/applications, while for some others, a distinctive cell (location) - cell appearing in few documents - is more meaningful. (in Example 1, one user may look for more popular locations for soccer leagues while another user may be interested in a less crowded,more private location) . To cover both cases, we define spatial idf of cell c in two different ways: inverse of frequency of a cell c among the documents (inverted idf s ) and direct frequency of a cell c among the documents (directidf s ). 32 3.2.1.2 Spatial-Keyword Relevance In this section, we introduce two novel approaches for calculating spatial-keyword rel- evance between a documentd and a queryq. With the single-score approach, one sim- ilarity measure and one document length is used to combine the spatial relevance and textual relevance into one equation. With the double-score approach, spatial and textual relevance are calculated separately, using two document lengths, one for each relevance. Thus a new spatial similarity measure analogous to the textual similarity measure is defined. Both approaches can use the parameterα s to assign relative weights. Single-Score Approach After partitioning each document location to a set of cells, defining the spatial tf-idf score and creating one document spatial length for each document location, the cells are ready to be treated in a similar manner to the keywords. We define term as the smallest unit of data describing each document which is either a keyword or a cell. If we represent keywords associated with the documentd by K d and the cells associated with the same document by C d , then the set of terms associated with document d is represented byT d and defined as follows: T d =K d ∪C d . Simply stated, the document’s terms are the union of the document’s keywords and cells. For Instance, in Example 1: T d 1 = {soccer,league,c 1 ,c 2 } (see Figures 1.1(b) and 3.3). In order to be able to define a single similarity measure capturing both the textual and spatial relevances, we define the following parameters: • f d,t : the frequency of termt in documentd • f t : the number of documents containing occurrences of termt where each parameter gets its value from the corresponding parameter in the space or text domain (based on term type). For instance, value off d,t is equal tof d,k when term 33 is keyword and to f d,c when term is cell. Having defined these new parameters, we can now easily redefine Equation 3.1, this time with terms instead of keywords. This is a new formulation capturing the keywords (textual relevance) and the cells (spatial relevance) in a unified manner. w q,t = (1−α s ).ln(1+ n ft ) ift is keyword α s .w q,c ift is a cell ; w d,t = (1−α s ).ln(1+f d,t ) ift is keyword α s .ln(1+f d,t ) ift is cell ; c W d = s X t w 2 d,t ; c W q = s X t w 2 q,t ; b S q,d = P t w d,t .w q,t c W d . c W q . (3.2) The variablew d,t captures the spatial-keyword term frequency score (tf sk ). The vari- able w q,t captures the spatial-keyword inverted document frequency (idf sk ). Parameter α s is integrated into the weighting scheme to capture the weighted relevance of space versus text. c W d represents spatial-keyword document length and c W q is (spatial-keyword) query length. Finally, b S q,d is the similarity measure showing how spatial-keyword rele- vant documentd is to queryq. Double-Score Approach In the single-score approach, keywords and cells are treated in exactly the same manner. Keywords and cells tf and idf scores are used in one equation and one similarity measure ( b S q,d ) using one document length ( c W d ) is used to calculate the final relevance score. There might be cases when most of the documents in the collection contain very large document location but very few keywords (or the opposite). In this situation, it is better to calculate the textual and spatial relevance scores separately. Hence, we discuss another approach to calculate the similarity measure between document d and query q in the spatial-keyword context. One can first calculate the spatial relevance and the 34 textual relevance of documentd and queryq independently and then use an aggregation function to compute the overall spatial-keyword relevance score. Using the spatial tf- idf parameters and the definitions, we calculate the spatial similarity measure between documentd and queryq analogous to the textual similarity measure as follows: w q,c = ln(1+ n fc ) if inverted document frequency ln(1+ fc n ) if direct document frequency ; w d,c = ln(1+f d,c ); W ′ d = s X c w 2 d,c ; W ′ q = s X c w 2 q,c ; S ′ q,d = P c w d,c .w q,c W ′ d .W ′ q . (3.3) where S ′ q,d is the spatial similarity measure between documentd and queryq. This value captures the spatial relevancesRel q (d) defined in Section 3.1.1. After calculating the spatial relevance using the above equation and computing the tex- tual relevance using Equation 3.1, the aggregation function F can be used to calculate the final spatial-keyword relevance. More formally: skRel q (d) =α s .S ′ q,d +(1−α s ).S q,d . Variants. We conclude this section by summarizing possible variants of the spatial-keyword relevance score. We defined two different approaches to calculate the spatial-keyword relevance scores. We also introduced two different ways to define the spatial idf factor score. Combining our two main approaches with the two definitions of the spatial idf score yields four different variants for our final similarity measure: 1. Single-Score with Inverted document frequency (SSI) WhereskRel q (d) = d S q,d andw q,c = ln(1+ n fc ) 2. Single-Score with Direct document frequency (SSD) WhereskRel q (d) = d S q,d andw q,c = ln(1+ fc n ) 35 3. Double-Score with Inverted document frequency (DSI) WhereskRel q (d) =α s .sRel q (d)+(1−α s ).kRel q (d) andw q,c = ln(1+ n fc ) 4. Double-Score with Direct document frequency (DSD) WhereskRel q (d) =α s .sRel q (d)+(1−α s ).kRel q (d) andw q,c = ln(1+ fc n ) 3.2.2 Spatial-Keyword Inverted File Spatial-Keyword Inverted File (SKIF) is an inverted file capable of indexing and search- ing both the textual and spatial data in a similar, integrated manner using a single data structure. In this section, we first describe the structure of SKIF and the information it stores. Next, we show how spatial-keyword query evaluation is performed using SKIF. Two algorithms corresponding to our two approaches are presented. Finally, we discuss briefly how SKIF can be extended to more general cases. 3.2.2.1 SKIF Structure Since SKIF is an inverted file, its structure is very similar to the structure of the regular inverted files. SKIF consists of two parts: vocabulary and inverted lists. The vocabulary contains all the terms in the system which includes all the (textual) keywords and cells (cell identifiers). For each distinct term, three values are stored in the vocabulary: 1)f t representing the number of the documents containing the termt, 2) a pointer to the cor- responding inverted list and 3) the type of term which is used to help calculate the tf and idf scores. The second component of SKIF is a set of inverted lists each corresponding to a term. For the corresponding termt, each list stores the following values: identifiers of the documents containing term t and the normalized frequencies of term t for each documentd. The latter is represented byf d,t . Figure 3.3 redraws the Example 1 on the grid and Figure 3.4 shows the complete SKIF for Example 1. 36 termt f t type Spatial-Keyword Inverted List fort soccer 5 1 h1,1ih2,1ih3,1ih4,1ih5,1i league 5 1 h1,0.8ih3,1ih4,1ih5,1ih6,1i c 1 1 0 h1,1i c 2 2 0 h1,0.55ih2,1i c 4 1 0 h4,0.25i c 5 3 0 h3,1ih4,0.06ih5,1i c 6 1 0 h5,0.35i c 7 1 0 h4,1i c 8 2 0 h4,0.3ih5,0.65i c 9 2 0 h5,0.25ih6,1i Figure 3.4: Spatial-keyword inverted file for Example 1. 3.2.2.2 Query Processing As discussed in Section 3.1.1, the spatial-keyword query consists of two parts: the query keywords K q and the query location L q . To process spatial-keyword queries, we first need to convert L q to a set of cells C q . C q is the set of cells overlapping with the document locationL q . After calculatingC q , we define the set of terms associated with each query byT q as follows: T q =K q ∪C q . Algorithms 1 and 2 (presented in Figures 3.5 and 3.6, respectively) show the algo- rithms to perform top-k spatial-keyword search using SKIF for the single-score and the double-score approaches respectively. Both the algorithms are very similar. Accumula- tors are used to store the partial similarity scores. The main difference is that Algorithm 1 uses one accumulator A d while the Algorithm 2 uses two accumulators A d and A ′ d . After all the query terms are processed, similarity scores b S q,d ,S q,d andS ′ q,d are derived by dividing each accumulator value by the corresponding value c W d ,W d andW ′ d respec- tively . Finally, thek largest documents are identified and will be returned to the user. 37 Figure 3.5: Single-Score Approach Figure 3.6: Double-Score Approach 38 3.2.3 SKIF-P: Spatial-Keyword Inverted File for Points up to this point, we have studied a setting where query and document locations were rep- resented as regions, and we proposed indexing and ranking techniques for that setting. In this section, we show how to modify the proposed indexing and ranking techniques for a setting in which query and document locations are two-dimensional points. For most of the spatial information on the web, locations are either directly specified as two-dimensional points (latitude, longitude) or can be easily converted to that format. Examples of the former are most geo-tagged Web 2.0 objets, such as geo-tagged micro- blogs (e.g. tweets 1 ), geo-tagged images (e.g. Flickr 2 images) and geo-tagged videos (e.g. Youtube 3 videos). For the latter, almost all documents (web-pages) containing some spatial information about different places - such as news web-pages (e.g. NY- Times 4 ), business listings (e.g. Yelp 5 ) and check-in objects (e.g. foursquare 6 )- can be geo-coded into two-dimensional geo coordinates (latitude,longitude). In order to define and model the spatial relevance in this new setting (when locations are points and rel- evance is based on proximity), we define the concept of spatial decay and introduce several spatial decay functions. Using the spatial decay functions enable us to define the spatial relevance and the spatial-keyword relevance seamlessly and more naturally for the new setting. 1 http://www.twitter.com 2 http://www.flickr.com 3 http://www.youtube.com 4 http://www.nytimes.com 5 http://www.yelp.com 6 http://www.foursquare.com 39 3.2.3.1 Spatial Decay According to the first law of geography, ”Everything is related to everything else, but near things are more related than distant things.”[Tob70]. Given a location (point), we want to find those points that are more related to that location (point). We call the given location(point) focal point and the resulting related locations relevant locations to the focal point or relevant locations in short. The relevance of the relevant locations to the focal point can vary from location to location. Therefore, there is a weight assigned to each location. Weight of the focal point itself is always 1 while weight of the non- relevant locations are always 0. Weight of all the other locations (relevant locations) are more than 0 and less than or equal to 1. As we will show in Section 3.2.1, in this proposal we partition the space into grid cells. As a result, each location essentially corresponds to one cell. In this context, relevant locations are the grid cells that are relevant (more related) to the document location (although this can be generalized to other cases and applications) and focal point is the document location 7 . The larger the weight, the more relevant is that cell to the document location. In this section, we show how to find relevant cells to a given focal point and how to calculate the weight of each relevant cell. In evaluating locations around the focal point, it is common to give less importance to locations (cells) which correspond to the farther locations from the focal point. Intu- itively, this reflects the first law of geography mentioned above: nearer locations to the focal point are of more significance while the farther ones are of less significance and can be assigned lower weight or ignored entirely. Several notions of decay functions (and time decay functions) have been used in the literature to capture such characteris- tics in different data (and temporal data) management applications [CSSX09a, CS06]. 7 For simplicity, we assume that each document has only one location. Multiple locations can be easily handled by using the same methods multiple times - once for each focal point. 40 Here, we apply and customize some of those functions in our spatial setting and define several spatial decay functions to be used in our proposed framework. We consider input items(c i ), which describe all the cells in our setting. We also have focal pointFP as another input. FP c represents the cell associated with the focal point (focal points’s cell). Definition 1: A decay function takes some information about the focal point FP and theith cell and returns a weight for this cell. We define a functionw(i,FP) to be a decay function if it satisfies the following properties: 1. w(i,FP) = 1 whenc i =FP c and 0≤w(i,FP)≤ 1 for allc i 6=FP c . 2. w is monotone non-increasing as distance between cells andFP c increases: distance(c j ,FP c )≥distance(c i ,FP c )⇒w(j,FP)≤w(i,FP). 3. w(i,FP) = 0 whendistance(c i ,FP c )> δ. We callδ threshold value and is used to prune the locations (points) that are not very related (relevant) to the focal point. In this proposal, we focus on decay functions of a certain form: where the weight of a cell can be written as a function of its distance dist, where the dist for celli (c i 6=FP c ) is simply dist = distance(c i ,FP c ). Here and in all the other references in this pro- posal, distance can be any distance function in the metric space as long as it satisfies the three main distance properties: 1) non-negativity: the distance between distinct points is positive, 2) symmetry: the distance fromx toy is the same as the distance from y to x, and 3) triangle inequality: the distance fromx toz viay is at least as great as fromx toz directly. Euclidian distance, block (Manhattan) distance and road-network distance are valid examples of such a distance function (throughout this proposal, we will use the Euclidian distance function from the center of cells). 41 Figure 3.7: Example of Windows Decay with δ equal to 2 cells (Euclidian) and focal point at the center of the grid Definition 2: A spatial decay function is defined by a positive monotone non- increasing function f() so that the weight of the ith cell to focal point FP is given by: w(i,FP) = f(distance(c i ,FP c )) f(distance(FP c −FP c )) = f(distance(c i ,FP c )) f(0) (3.4) The denominator in the equation is to normalize the weigh and make the first property of Definition 1 valid. Different choices of functionf generate several interesting spatial decay functions. We study three of these decay functions here and report the effect of each one in Section 3.3. Windows Decay. With the windows decay, all the cells whose distance from the focal point is less than the threshold value δ are considered and more importantly are treated the same (i.e., they have the same weight). Setting the ”window size” parameter equal to the threshold parameterδ, the functionf(dist) = 1 fordist≤δ andf(dist) = 0 fordist>δ. 42 Figure 3.8: Example of Polynomial Decay withδ equal to 2 cells (Euclidian) and focal point at the center of the grid Polynomial Decay. Most often, treating cells (locations) as either relevant or not is not precise enough. Most of the time, we need a more complicated and fuzzy weighting mechanism. That is the reason polynomial and exponential decay functions are defined. Spatial polynomial decay is defined asf(dist) = (dist+1) −γ , for someγ > 0. Here, 1 is added todist to ensure thatf(0) = 1. We can also write the equation as: f(a) = exp(−γln(dist+1)). f(dist) is still zero fordist>δ. Exponential Decay. Sometimes, polynomial decay is too slow (weight changes are not very significant) and a faster decay function is needed. Spatial exponential decay is defined asf(dist) =exp(−λ×dist) forλ> 0. Again,f(dist) is zero fordist>δ. 3.2.3.2 SKIF-P We start this section by an example that we are going to use through this section. 43 Figure 3.9: Example of Exponential Decay withδ equal to 2 cells (Euclidian) and focal point at the center of the grid d 4 d 5 d 6 Los Angeles Area d 1 d 2 d 3 Q Figure 3.10: Documents with location information and keyword frequencies. 44 2 2 1 1 0 4 league 0 d 6 2 d 5 1 d 4 1 d 3 3 d 2 5 d 1 soccer Keywords Document Pasadena Downtown Santa Monica Palms Culver City Redondo Beach Location Figure 3.11: A spatial-keyword query on documents with location information. Example 2. Suppose there are six documents in the repository with locations close to the Central Los Angeles. Figure 3.10 shows these locations represented as small triangles. In addition, each document has text keywords in its content. Figure 3.11 shows the frequencies of the three query keywords in these documents. The query to the system contains three keywords “park concert free” and specifies “Central Los Angeles” as the location restriction. Figure 3.10 shows the location of his query represented as a black circle. We want to find the most relevant results (documents) to the query. As mentioned earlier, spatial relevance between a document d and a query q is defined based on the type of the spatial relationship that exists between L d and L q . In Section 3.2.1, we studied the overlap relationship while query and document locations were represented as regions (MBRs). Here, we focus on the proximity relationship, 45 since query and document locations are points. Subsequently, we define spatial rele- vance as follows: A documentd and a queryq are spatially relevant if at least one of the document’s locations is within threshold distanceδ of one of the query’s locations, i.e., distance(L q ,L d )≤δ. Functiondistance can be any arbitrary distance function such as Euclidian distance, Manhattan (block) distance or road-network distance. The smaller the distance is, the more spatially relevantd and q are. We denote spatial relevance of documentd to queryq bysRel q (d). We now define a new scoring mechanism to calculate the spatial relevance for this new setting similar to the scoring method presented in Section 3.2.1 (and hence, similar to tf-idf model). As before, we partition the space into grid cells and assign unique identifiers to each cell. Therefore, each location in document can be associated with a cell identifier. Since we are using proximity as our main spatial query type, these cells are defined as the nearby cells to the document location. With this new setting, the closeness of a cell with the document location is analogous to the existence of a keyword in document with tf-idf. However, knowing the nearby cells is not enough. We need to know how well a cell describes the spatial content of the document. We use the distance between each cell and the document location to provide a measure of how well that cell describes the document. Analogous to For this new setting, we also represent frequency of cellc in documentd byf d,c and but sets its value equal to the value of a spatial decay function (Definition 2 in Section 3.2.3.1) where focal point FP is the document d’s location (L d ) andi is the index of cellc. The value off d,c is monotone non-increasing as the distance between the document locationL d and the cellc increases. As before, the frequency of a cell describes how well the cell describes the documents spatial contents (L d ). The smaller the distance, the better this cell describes the document location and vice-versa. Note that the different variations of the spatial 46 decay function generate different values forf d,c . Now we can define the following parameters analogous to those of Section 3.1.2: • f d,c : the frequency of cellc in documentd • max(f d,c ): maximum value off d,c over all the cells in documentd • f d,c : normalizedf d,c , which is f d,c max(f d,c ) • f c : the number of documents containing one or more occurrences of cellc Using the above parameters, we revisit three monotonicity properties discussed in Section 3.1.2, this time in spatial context: (1) less weight is given to cells that appear in many documents; (2) more weight is given to cells that are closer to the document location; and (3) less weight is given to documents that contain many cells. The rest is exactly similar to what we presented in Sections 3.2.1 and 3.2.2. We can use and apply spatial relevance and spatial-keyword as before but this to,e for the new setting. Also, SKIF-P (spatial-Keyword Inverted File for Points) has the same structure as SKIF (discussed in Section 3.2.2). Figure 3.12 redraws the Example 2 on the grid and Figure 3.13 shows the complete SKIF-P for Example 2. To calculate the values of this example, we used a polynomial decay function withγ = 1.8. Also we used Euclidian distance function with value ofδ set at 0 (sinceδ = 0, values off d,t are all equal to 1. For other values ofδ, the inverted index and its values will be more complex). 3.2.4 Generalization In this section we briefly show how we can extend SKIF into more general cases. 47 Los Angeles Area C 7 C 8 C 9 d d 4 d 5 d 6 C 1 C 2 C 3 C 4 C 5 C 6 d 1 d 2 d 3 Q Figure 3.12: Example 2 on the grid termt f t type Spatial-Keyword Inverted List fort park 5 1 h1,1ih2,1ih3,1ih4,1ih5,1i free 5 1 h1,0.8ih3,1ih4,1ih5,1ih6,1i concert 5 1 h1,0.6ih3,1ih4,1ih5,0.5ih6,0.5i c 2 2 0 h1,1ih2,1i c 5 1 0 h3,1i c 8 2 0 h4,1ih5,1i c 9 1 0 h6,1i Figure 3.13: Spatial-keyword inverted file for Example 2 (for δ = 0). The entry for each termt is composed of the term frequency (f t ) and list of pairs, each composed of document idd and normalized term frequencyf d,t . Multiple Locations: One of the advantages of our technique is that there is no limiting constraint on the representation of the document location. Instead of treating the document location as one large and sparse MBR, we can use several sperate, disjoint locations using SKIF. This is feasible because our final spatial relevance score can be computed by separately computing the spatial score of each cell intersecting with the various document locations. Another advantage of SKIF is its capability to represent the document location as any arbitrary shape and not necessarily as MBR or rectangle. 48 The only information we need to calculate for spatial tf-idf score is the area of overlap between each cell and each document location. Points: We assumed that each document location is a region. In the context of the web, this is a reasonable assumption, still in the rare cases when the document location is one geographical point (pointp) we can generalize our approach as follows. A circle centered at p with radius r is constructed. The MBR covering the circle is our new document location. Radiusr is a parameter determined by user and also the context of the web-page (if available). Weights: When querying the system, there are two types of weight users may want to manipulate 1) setting different weights to the spatial and textual relevance, and 2) setting different weights to different terms in the query. For the first scenario, we have used the parameterα s in this proposal. SKIF can also support setting different weights for different terms. There are several existing methods to solve this problem for textual keywords. Since we are treating cells similar to keywords, those methods can also be applied to the spatial cells. As one possible solution, we define query term weightsα q,k and α q,c as the weight of keyword k in query q and the weight of cell c in query q, respectively. By multiplyingw d,k and w d,c values by α q,k and α q,c , respectively, query term weights are integrated into the relevance scores. This opens a wide possibility of complicated queries to the users. Leveraging Existing Search Engines: One of the most practical advantages of the proposed approach is the fact that it can be integrated into the existing search engines easily and seamlessly. Since the structure of SKIF is very similar to the structure of the regular inverted files, same techniques used in regular search engines (built on inverted files) can be applied for our location-based search engine (built on SKIF). The easy integration of our approach into the existing search engines is not only very beneficial for current search engines but also enables us to optimize SKIF using a body 49 Table 3.1: Dataset Details Dataset Total # of documents Average # of unique keywords per document Total # of unique key- words Total # of keywords DATASET1 19,841 64 31,721 1,269,824 DATASET2 250,000 230 50,000 57,500,000 DATASET3 8,964 11 2,340 109,604 of work exists in this field. For example, compression techniques are very popular for inverted files [ZM06, ZM95]. Since the structure of the inverted lists are identical with both SKIF and regular inverted files, no change is needed to apply the same compression techniques on SKIF. More interestingly, some of the optimization techniques seem to work better on SKIF. For instance, caching is another technique used in existing search engines. It is easy to see that with SKIF, by caching the inverted lists for the cells nearby the current query cell, we can improve the query performance significantly. It is very likely that nearby cells queried together or very close to each other. 3.3 Experimental Evaluation In this section, we experimentally evaluate the performance and accuracy of SKIF and SKIF-P. Comparison is done with the most efficient proposed solutions: MIR 2 -tree [DFHR08] and CDIR-tree [CJW09] which are optimized versions of IR 2 -tree and IR- tree, respectively. Since both of MIR 2 -tree and CDIR-tree use query points instead of query regions, we apply the following adjustment when comparing with SKIF: Each query is executed using MIR 2 -tree and CDIR-tree separately for random query pointq and for total number of resultsk. Subsequently, the farthest document in the union of the result sets is identified. Letr be the distance betweenq and this farthest document. We construct a circle centered atq with radiusr, the MBR covering the circle is considered as the query location (L q ). 50 Our experiments use two datasets which their properties are summarized in Table 3.1. DATASET1 is generated from a real world online web application called Shoah Foundation Visual History Archive (http://college.usc.edu/vhi/). Each document (tes- timony) is tagged with a set of textual and spatial keywords describing the content of the testimony. In preparing DATASET1, we extracted location names (spatial key- words) from all the testimonies and geo-coded the location names into spatial regions using Yahoo! Placemaker (http://developer.yahoo.com/geo/placemaker/). We run our experiments on all the documents in US. For DATASET1, we partition the space into 100km×100km cells. DATASET2 is generated synthetically. A set of keywords (from 1 to 500) and one location are assigned randomly to each document. Space is partitioned into225×225 cells. The documents’ keywords and locations are uniformly distributed. Finally, DATASET3 is generated from some of the geo-tagged images from the online photo-sharing website Flickr using Flickr API 8 . We chose random (geo-tagged) docu- ments inside California. Again, each document (image) is tagged with a set of textual keywords and one (point) location. Each query contains 1 to 4 randomly generated keywords and one rectangle. Each query round consists of 100 queries. All three structures are disk-resident and the page size is fixed at 4KB. MIR 2 -tree and CDIR-Tree implementations are the same as in [DFHR08] and [CJW09], respectively (e.g. signature length = 189 bytes, β = 0.1, number of clusters = 5, etc.) . Experiments were run on a machine with an Intel Core2 Duo 3.16 GHz CPU and with 4GB main memory. 3.3.1 Performance With the first set of experiments, we evaluate the impact of the number of keywords in each query |K q | on query cost. In this set of experiments, we vary |K q | from 1 to 4 8 http://www.flickr.com/services/api/ 51 while fixingk at 10 andα at 0.5. For each method, we report the average query cost in processing each round. The results are shown in Figures 3.14-a and 3.14-b. For almost all the cases, SKIF significantly outperforms both MIR 2 -tree and CDIR-tree. While for all the approaches, the query cost increases as|K q | grows, the growth rate for SKIF is very marginal. While the I/O costs of CDIR-tree and MIR 2 -tre increase by a factor of 15 and 8 respectively, SKIF’s query cost barely doubles when the number of keywords grows from 1 to 4. Both CDIR-tree and MIR 2 -tree will perform even worse if |K q | increase further. This is because with IR 2 -tree, by increasing the number of keywords, fewer documents will contain all the keywords and hence more documents need to be searched (this also increases the number of false hits). With CDIR-tree, when query contains more keywords, the textual relevance of the query with each node of CDIR-tree is very similar, which makes the textual relevance pruning less effective. Therefore, both approaches need to search larger and larger number of documents as|K q | increases. On the other hand, SKIF only searches for those documents containing the query keywords and therefore required to be scored. With the second set of experiments, we evaluate the impact of the number of requested result k on the query performance. Again, we report the average query cost for each round. |K q | is fixed at two and α is fixed at 0.5 and k varies from 1 to 50. Figures 3.15-a and 3.15-b show the results for search time and number of page accesses, respectively. The first observation is that for SKIF, the query cost changes slightly ask increases. Since the average number of terms in the query as well as k are small, only few disk pages in the inverted lists of the few query terms are retrieved and processed. On the other hand, CIDR-Tree and MIR 2 -tree perform worse ask grows since they have to access and process more entities in their corresponding trees. In the third set of experiments, we study the impact of the parameter α on the per- formance of SKIF and CDIR-tree. As mentioned earlier,α is the parameter that assigns 52 0 50 100 150 200 250 1 2 3 4 #keywords search time (ms) SKIF CDIR MIR2 a. Search Time 0 100 200 300 400 500 600 1 2 3 4 #keywords # accessed pages SKIF CDIR MIR2 b. I/O Figure 3.14: Impact of|K q | on query cost relative weights to the textual and spatial relevances. We fix |K q | at two and k at 10. Figures 3.16-a and 3.16-b show the results. The important observation is that the query cost for SKIF is weight independent while CDIR-tree performs very poorly when the spatial relevance is more important (largeα). Since CDIR-tree takes into account docu- ment similarity, it performs well when the textual relevance is given higher importance 53 0 20 40 60 80 100 120 140 160 180 1 10 20 50 k search time (ms) SKIF CDIR MIR2 a. Search Time 0 50 100 150 200 250 300 350 1 10 20 50 k # accessed page SKIF CDIR MIR2 b. I/O Figure 3.15: Impact ofk on query cost and performs poorly when the spatial relevance is given higher importance. On the con- trary, SKIF performs well for all the cases, since the query processing is the same for both keywords and space and is not affected by the relative weights. 3.3.2 Accuracy Our final set of experiments was conducted to evaluate the accuracy of our four proposed scoring approaches. Since spatial-keyword relevance ranking is new and no ground 54 0 100 200 300 400 500 600 700 0.1 0.3 0.5 0.7 0.9 alpha search time (ms) SKIF CDIR a. Search Time 0 100 200 300 400 500 600 700 800 0.1 0.3 0.5 0.7 0.9 alpha # accessed pages SKIF CDIR b. I/O Figure 3.16: Impact ofα on query cost truth exists for our work, we conducted a user study to evaluate the effectiveness of our ranking methods. To conduct the user study, we utilized the user study in [Hav03] (well- known paper in information retrieval) as our model and followed a similar procedure. We randomly selected 10 queries from our query set and found 10 volunteers. For each query, each volunteer was shown 6 result rankings, each one consisted of the top 10 results satisfying the query when the results were ranked with one of these approaches: DSI,DSD,SSD,SSI, CDIR-tree and MIR 2 -tree. Each volunteer was asked to select 55 Query DSI DSD SSD SSI MIR 2 -tree CDIR 1 1 1 1 1 1.00 0.67 2 1 1 1 1 1.00 1.00 3 1 1 1 0.95 1.00 1.00 4 1 1 0.8 0.8 0.00 0.10 5 0.9 0.7 0.9 0.8 0.00 0.20 6 1 1 1 1 0.88 0.88 7 1 1 1 1 0.00 0.38 8 1 1 1 1 1.00 1.00 9 1 1 1 1 1.00 0.80 10 1 1 1 1 1.00 1.00 Average 0.99 0.97 0.97 0.96 0.69 0.70 Table 3.2: R-precision of various rankings all documents which were ”relevant” to the query, in their opinion. They were not told how any of the rankings were produced. We used R-precision [MRS08] to evaluate the result s of various rankings. R-precision is defined as follows. Let a document be considered as relevant if at leat 6 of the 10 volunteers choose it as relevant for the query. Let Rel be a set that contains all such relevant documents and let |Rel| be the size of that set. Then, the R-precision of each list is the fraction of the top |Rel| documents that are deemed relevant. Hence, the higher the value of R-precision the more relevant the corresponding ranking. The R-precision of the six ranking approaches for each test query is shown in Table 3.2. We have also included the average R-precision for each ranking method. The first important observation is that for the majority of cases, our four proposed approaches generate results with R-precision equals to one, i.e., lists in which all the top|Rel| documents are relevant. The second observation is that the average R- precision for the rankings generated by our approaches is substantially higher than that of the other two rankings. Finally, Table 3.3 shows the rankings preferred by the majority of the users. For nearly all the queries, a majority of the users preferred one of our proposed scoring methods. These results, further confirms the effectiveness of our proposed approaches. 56 Query Preferred by Majority 1 DSI 2 DSI 3 DSI 4 DSD 5 DSI 6 DSD 7 SSD 8 MIR 2 -tree 9 MIR 2 -tree 10 DSI Table 3.3: Ranking preferred by users 3.3.3 Performance: SKIF-P Parameters. In this section, we evaluate the performance of SKIF-P based on SKIF-P specific param- eters: number of cells and the threshold value δ. We use DATASET1 for this set of experiments. Again, each query round consists of 100 queries. For each method, we report the average query cost in processing each round. Number of cells. In the first set of experiments, we study the impact of changing the number of cells on the performance of our system. In this set of experiments, we change the grid resolution from 2*2 cells to 400*400 cells and report the performance of SKIF-P.|K q | is fixed at 2 andα is fixed at 0.5,k is fixed at 10 andδ is equal to two cells. Figures 3.17-a and 3.17-b show the results for search time and number of pages accessed, respectively. Both figures convey similar observations. The main observation is that most often, when number of cells increase (cell sizes decrease) the performance of the system improves. Smaller cells usually correspond to fewer documents in each cell and hence less number of entries in each cell’s inverted list. This translates to fewer number of IOs and hence less processing time for a fixed δ (fixed number of cells are retrieved and processed for a fixed δ). In other words, larger cell sizes usually lead to larger search regions and hence more documents need to be searched. Since the 57 a. Search Time b. I/O Figure 3.17: Impact of number of cells on query cost number of cells is not the only factor affecting the performance, there may be cases that the above argument does not hold (e.g., 40*40 case in Figures 3.17-a and 3.17-b). The query location, distribution of documents locations and also query’s keywords and distribution of documents keywords may change this pattern. Delta (δ). In the second set of experiments, we evaluate the impact of δ on the performance of SKIF-P. Again,|K q | is fixed at two andα is fixed at 0.5,k is fixed at 10 and number of cells is 40*40. The results are shown in Figures 3.18-a and 3.18-b. In both figures, the main observation is that as expected both the number of IOs and also 58 a. Search Time b. I/O Figure 3.18: Impact ofδ on query cost processing time increase asδ increases. Increasingδ results in retrieval and processing of larger number of cells and hence larger number of disk IOs and subsequently larger amount of processing time. 59 3.3.4 Accuracy: SKIF-P Parameters. In this section, we evaluate the impact of the threshold valueδ and also the type of the decay function on the accuracy of our system. We also use two new (and standard) met- rics to evaluate the accuracy: Precision at k and nDCG at k. We used the real-world Flickr dataset (DATASET3). As explained earlier, this dataset contains information about geo-tagged images on Flickr website. We started with a set of 30,000 random images from Flickr. In the processing of this data, we removed the images with no loca- tion and also removed any image outside California. After these steps, the final dataset size was reduced to 8964 documents (images). We used SSI as our scoring mechanism (the other three have slightly better but almost identical accuracy, see Table 3.2), and again we used a 40*40 grid of cells. Values ofgamma andlambda (for polynomial and exponentially decays, respectively) are set to 1.8. Approaches. We ran our experiments for three decay functions discussed in Section 3.2.3.1 and for three ranges of δ: short (0-1 cells), medium (2-5 cells) and large (6-10 cells). The combination of the three values of δ and the three spatial decay functions generates a total of 9 approaches. Queries. We generated a set of 15 queries from different keywords in DATASET3 and randomly assigned them a two-dimensional point in California. Relevance Assessment. After computing top-5 results for each of our 15 queries using all the 9 approaches, we ran a user study using Amazon Mechanical Turk 9 . One task (hit) was generated for each query. Each query was run using all 9 approaches and top-5 results were returned for each approach. Then, all the results from all the methods were combined together. A Google map mashup web-page with markers rep- resenting these results (Flickr images) were generated. Also, query location (as a sep- arate marker on the map) alongside the query keywords were provided to the workers. 9 https://www.mturk.com/ 60 Workers could click on each marker and see the keywords associated with that marker (image) as well. Workers could choose whether the result (marker plus the keywords) is relevant or non-relevant. Workers could also add their comments/explaination for each assessment. Each task (query assessment) was assessed by ten workers. We used work- ers with HIT approval rate greater than or equal to 90 (meaning that more than 90% of their past assessments were correct). Each worker was rewarded $0.04 by completion of each assessment. Overall, workers chose relevant for 73% of the assessments and non-relevant for 27% of the assessments. Metrics. We evaluated the accuracy of the methods under comparison using two standard metrics: precision at k and nDCG at k. In calculating precision at k, we consider a document relevant if a majority of the workers assessed that document as relevant and non-relevant otherwise. When computing nDCG at k, we consider the average relevance given by the users to each document, interpreting relevant as score 1 and non-relevant as 0, respectively. The ideal ranking is calculated based on these average relevance scores. Results. The results of our relevance assessments with k = 5 and using the nine approaches using precision atk and nDCG atk are shown in Tables 3.4 and 3.5, respec- tively. For the precision@5, the first observation is that all of the evaluated methods generate accurate results (precision@5 larger than 0.6). The second observation is that, the approaches with medium and large values ofδ perform the search more accurately. Smaller values of δ may lead to smaller number of results in the final result set and hence smaller number of relevant results. On the other hand, larger values ofδ usually leads to more relevant results in the final result set. As you can see in Table 3.4, the medium values of δ are slightly better than the large values of δ in our experiments. This is because, for some cases (especially for windows decay function) when value of δ is large and there are many relevant documents to the query, some documents farther 61 Table 3.4: Precision@k of various rankings Decay Function Exponential Polynomial Windows Average Delta (δ) Short 0.613 0.626 0.626 0.622 Medium 0.76 0.746 0.76 0.755 Long 0.733 0.746 0.68 0.72 Average 0.702 0.706 0.688 to the query location become relevant and end up in the final result set. Some users do not consider these document relevant while others seem to evaluate them as relevant. The final observation is that there is no one superior decay type with regards to the dif- ferent decay functions. All three generate fairly similar and accurate results while the windows decay function generates a slightly less accurate result set. As for the nDCG@5, the first observation is that all the evaluated methods rank the results very accurately (nDCG@5 larger than 0.8). In other words, the average relevance values of the top-5 results (with their respected ranks) is very similar to the relevance values of the best possible ranking. The second observation is that as expected rankings improve whenδ increases. Using smaller values ofδ may lead to missing some relevant results in the final result set (and ranking). On the other hand, larger values of δ usually lessens the probability of missing relevant results and hence the final ranking can be done on a larger set of relevant documents. Hence, generating a more accurate (and complete) ranking. The last observation is that although there is still no superior decay function among the decay functions, the polynomial type generates more accurate rankings. Based on the users’ input, it seems that the exponential decay is a little too fast of a decay for our setting and users prefer the polynomial decay slightly more. As a conclusion, all the 9 approaches generate accurate results and very accurate rankings while the accuracy improves for largerδ values. All the other properties seem to be very similar (although slightly different from case to case). 62 Table 3.5: nDCG@k of various rankings Decay Function Exponential Polynomial Windows Average Delta (δ) Short 0.812 0.811 0.812 0.812 Medium 0.913 0.917 0.913 0.915 Long 0.934 0.982 0.934 0.950 Average 0.886 0.904 0.886 63 Chapter 4 Temporal-Textual Web Search The chapter is organized as follows. In Section 4.1, we formally define the tempo-textual search problem. Section 4.2 presents our baseline approach for simple processing of the tempo-textual queries. In Section 4.3, we introduce three new hybrid index structures for the processing of tempo-textual queries. In Section 4.4.1, we introduce a new ranking mechanism to calculate the temporal relevance and tempo-textual relevance scores. In Section 4.4.2, we present our efficient hybrid index structure and show how it is used in query processing. We discuss how the proposed framework can be extended to support more general scenarios in Section 4.4.3. Section 4.5 empirically evaluates our proposed solution in terms of effectiveness and performance. 4.1 Preliminaries 4.1.1 Problem Definition We assume a collectionD ={d 1 ,...,d n } ofn documents (web pages). Each documentd is composed of a set of keywordsK d and a timespan 1 T d represented by two timestamps: begin time denoted by t s and end time denoted by t e . Terms t s and t e represent the number of time units (e.g. days) from a reference point in time (which is the same for all the documents). The difference between t e and t s (in number of time units) is 1 Throughout our definitions and examples, for simplicity we consider one timespan for each document and one timespan for each query. In Section 4.4.3, we show how to generalize this model to multiple timespans. 64 defined as timespan length. For a better readability, we will use a normal date format (e.g. November 7th, 1980) for t s and t e in our examples. We will use the timespan to refer toT d in this dissertation. Tempo-textual query: A tempo-textual query is defined asQ =hK q ,T q i, whereT q is the temporal part of query specified as one timespan and K q is a set of keywords in the query. Temporal relevance: Temporal relevance between a documentd and the queryq is defined based on the type of the temporal relationship that exists betweenT d andT q . We focus only on the overlap relationship, although our approach can easily be extended to cover other temporal relationships [All81]. Subsequently, we define temporal relevance as follows: A documentd is temporally relevant to the queryq if the query’s timespan has a non-empty intersection with the document’s timespan, i.e., T q ∩ T d 6= ∅. The larger the area of the intersection is, the more temporally relevant d and q are. We denote temporal relevance of documentd to queryq bytmRel q (d). Textual relevance: A documentd is textually relevant to the queryq if there exists at least one keyword belonging to bothd andq, i.e.,K q ∩K d 6=∅. The more keywords q andd has in common, the more they are textually relevant. We represent textual rele- vance of documentd to queryq bytxRel q (d). See Section 3.1.2 for more information regarding textual relevance. Tempo-textual relevance: A documentd is tempo-textual relevant to the queryq if it is both temporally and textually relevant to the queryq. Tempo-textual relevance can be defined by a monotonic scoring function F of textual and temporal relevance. For example, F can be the weighted sum of the temporal and textual relevances: 65 F q (d) = α t .tmRel q (d)+(1−α t ).txRel q (d) iftmRel q (d)> 0 andtxRel q (d)> 0 0 otherwise (4.1) α t is a parameter assigning relative weights to temporal and textual relevance. The output of function F q (d) is the tempo-textual relevance score of document d to query q, and is denoted by ttRel q (d). In Section 4.4.1 we show in details how to calculate tempo-textual relevance using our proposed index. Tempo-textual search: A tempo-textual search identifies all the documents (web pages) that are tempo-textual relevant to q. The result are the top-k documents sorted based on documents’ tempo-textual relevance scores. The parameterk is determined by the user. 4.2 Baseline Approach In this section, we briefly discuss our baseline index structure and algorithm that exploit existing techniques for processing tempo-textual queries. IIO (Inverted Index Only): The basic idea behind IIO is to leverage the inverted index to calculate textual relevance using a classical tf-idf model of all the documents, therefore obtaining a ranked list of the documents based on their textual relevance. The list is then scanned to check if each (textually relevant) document’s temporal expression has a non-empty overlap with the query’s temporal expression. For each overlapping documentd, we calculate the temporal relevance betweend and the queryq as a simple overlap function as follows: 66 tmRel q (d) = T d ∩Tq Tq if (T d ∩T q )<T q 1 otherwise (4.2) where T d ∩Tq Tq is the area of overlap between the document time interval T d and the query time interval T q divided by the area of T q . After calculating the temporal rele- vance, we compute the final relevance score as shown earlier in Equation 5.4 for the documents returned from the inverted index. We sort the results and return the top-k results to the user. Using this straightforward simple approach, we do not need to cal- culate the temporal relevance for all the documents but only for the documents with the textual relevance greater than 0. We can use the following optimization to improve the performance of the above approach. While scanning (sorted) documents from the inverted index, if the score of d is smaller than the score of the kth document in the intermediatory results, then we skip this document and go to the next one. Otherwise (or if we do not havek intermediatory results yet), we insertd into the list ofk interme- diatory results (in its proper position) and remove thekth element from the list. We stop scanning when we are sure that none of the existing documents can generate a score larger than the currentkth document in the intermediatory result set (or when we reach the end of the list). The IIO algorithm only uses the inverted index and calculates the temporal relevance on-the-fly. 4.3 Hybrid Approaches The baseline method described previously only makes use of a textual index structure (i.e., inverted file) and does not employ a temporal indexing structure. As a result, all the documents returned from the textual index (which may be a very large set) must be 67 accessed and temporal relevance for most, if not all, of the documents must be com- puted. In this section, we propose three hybrid index structures, each using both tem- poral indexing and textual indexing thus enabling us to prune irrelevant documents both temporally and textually. As before, we use an inverted file as the textual index struc- ture. For temporal indexing, we use an interval-tree [PS85] which allows us to efficiently find all time intervals that overlap with a given (query) time interval. Interval tree is an ordered tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval (or point). An interval tree for a set of m intervals usesO(m) storage and has heightO(logm). It can be built inO(m.logm) time. It takesO(logm) time to process a query and return overlapping intervals. Further details regarding interval tree can be found in [PS85]. In all index structures in this section, an interval-tree is built on documents’ time intervals and then used to filter documents overlapping with the query’s temporal expres- sion. By adding a few simple steps to the search algorithm of the interval-tree, we also calculate the amount of overlap during the search process. Using the overlap dura- tion, we calculate normalized temporal scores (tmRel q (d) in Equation5.4) for each document as follows: tmRel q (d) = overlapLength q (d)/maxOverlapLength where overlapLength q (d) is the length of overlap between document d’s and query q’s tem- poral expressions andmaxOverlapLength is maximum of such values forq. 4.3.1 Inverted File and Interval-Tree Index (FnT) This hybrid index structure combines the inverted file and interval-tree separately and in an independent fashion. In this structure, documents are indexed by both index struc- tures separately. While all textual information is indexed by an inverted file (as is done in textual search engines), all the temporal information is indexed by an interval-tree. The primary difference between the interval-tree used here and traditional interval-trees 68 is that, here each leaf node of the interval-tree points to a list of documents containing the time-interval corresponding to that leaf node. With this structure, when processing queryq two independent processes take place. First, the textual part of the query K q is fed into the inverted file and the textual rele- vance for each document is calculated. Second, the temporal part of the queryT q is sent to the interval-tree and temporal intervals overlapping withT q are found. Subsequently, for each overlapping time-interval, the temporal relevance between each time-interval andT q is computed. Finally, lists for each leaf node (time-interval) are accessed and cor- responding temporal relevance values are assigned to the documents in each list. After finishing above steps, two sets of lists are merged, final relevance scores are computed and top-k results are returned. 4.3.2 Inverted File Then Interval-Tree Index (FtT) With the inverted file then interval-tree index structure, (vocabulary of) an inverted file is constructed on top of all textual keywords. However, instead of pointing to inverted lists, each keyword in the vocabulary points to (the root node of) an interval tree. Interval tree for keywordk i is built on temporal-intervals existed in documents containingk i . Each leaf node of each interval-tree points to a (page) list of documents that 1) their time- interval overlaps with the time-interval corresponding to that leaf node, and 2) contain the textual keyword pointing to that interval-tree. In other words, we get a set of (page) lists whose entry is determined by a pair of a keyword and a time-interval. We call a pair of a keyword and a time-interval time-interval-keyword (TIK) if there is document which contains the textual keyword and whose temporal expression (time-interval) overlaps with the time-interval. For query processing, textual part of the query K q is read first. For each keyword in the query, corresponding interval-tree is accessed and processed until all leaf nodes 69 (time-intervals) overlapping with the temporal part of the query are found. For each time-interval, the temporal relevance is calculated and saved. Next, (page) lists for each leaf node (time-interval) are traversed. While traversing page lists, textual relevance of each document is also computed. Finally, all the page lists (from all TIKs) are merged and temporal and textual relevances are combined. 4.3.3 Interval-Tree Then Inverted File Index (TtF) With the interval-tree then inverted file index structure, first an interval-tree is con- structed on top of all the time-intervals in the system. Instead of pointing to a list, each leaf node of the tree points to an inverted file. An inverted file for each time-interval (leaf node) is built on top of all documents overlapping with that time-interval. As a result, interval-tree then inverted file structure has one interval tree and m (number of unique time intervals in the system) inverted files. Similar to inverted file then interval-tree, we get a set of page lists whose entry is a pair of a time-interval and a textual keyword (this time, time-interval first and then textual keyword). For processing queryq, the temporal part of the queryT q is first fed into the interval- tree to find overlapping time-intervals with T q . For each time-interval, temporal rele- vance between that time-interval and query time-interval (T q ) is computed and saved. Next, inverted file corresponding to the time-interval (leaf node) is accessed and pro- cessed for the textual part of the queryK q . While accessing the lists, the textual score is also calculated. Finally, all lists from all TIKs (time-interval keyword pairs) are merged and textual and temporal scores are combined. 70 4.4 Temporal-Textual Search 4.4.1 Seamless Tempo-Textual Ranking In this section, we define a new scoring mechanism to calculate the temporal relevance and tempo-textual relevance scores. Following the same intuitions and concepts used in regular (textual) searches, we define new concepts and parameters for temporal data. Most notably, inspired by tf-idf in textual context, we define a new scoring mechanism called temporal tf-idf for the temporal context. Using (textual) tf-idf scores and tem- poral tf-idf scores, the tempo-textual relevance is defined and can be used to rank the documents based on both the temporal and textual aspects of the data, simultaneously and efficiently. We discuss two different approaches to calculate the tempo-textual rel- evance using the temporal tf-idf score. Several variants of the final similarity measure are also presented. 4.4.1.1 Temporal tf-idf In order to be able to use the analogous ideas used in the regular tf-idf score, we need to treat temporal data similar to textual data. Most importantly, we need to represent time which is coherent and continuous in nature, as disjunct and set-oriented units of data - similar to the textual keywords. Hence, we partition the time domain into con- secutive cells and assign unique identifiers to each cell. Therefore, each timespan in the document can be associated with a set of cell identifiers. Since we are using overlap as our main temporal query type, these cells are defined as the cells which overlap with the document timespan. With temporal tf-idf, the overlap of a cell with the document is analogous to the existence of a keyword in the document with tf-idf. However, knowing the overlapping cells is not enough. We need to know how well a cell describes the temporal content of the document. We use the overlap area between each cell and the 71 document to provide a measure of how well that cell describes the document. Analo- gous to frequency of term t in document d, we define frequency of cell c in document d as follows: f d,c = T d ∩c c which is the area of overlap between the document timespan T d and cellc divided by the area of cellc. Similar to the frequency of a keyword which describes how well the keyword describes the documents textual contents (K d ), the fre- quency of a cell describes how well the cell describes the documents temporal contents (T d ). The more the overlap, the better this cell describes the document timespan and viceversa. Now we can define the following parameters analogous to those of Section 3.1.2: 1. f d,c : the frequency of cellc in documentd 2. max(f d,c ): maximum value off d,c over all the cells in documentd 3. f d,c : normalizedf d,c , which is f d,c max(f d,c ) 4. f c : the number of documents containing one or more occurrences of cellc Using the above parameters, we revisit three monotonicity properties discussed in Section 3.1.2, this time in temporal context: (1) less weight is given to cells that appear in many documents; (2) more weight is given to cells that overlap largely with a docu- ment; and (3) less weight is given to documents that contain many cells. The first property is quantified by measuring the inverse of frequency of a cellc among the documents in the collection. We call this temporal inverse document frequency or idf temp score. The second property is quantified by the frequency of cellc in documentd (as defined earlier). This is called temporal term frequency ortf temp score and describes how well that cell describes the document temporal contents (i.e. T d ). The third prop- erty is quantified by measuring the total number of cells in the document and is called document temporal length. 72 d 2 d 1 Q 1985 1990 1995 2000 2005 d 4 d 3 d 5 C1 C2 C3 C4 C5 C6 1980 2010 time Figure 4.1: Example 3 with temporal cells Among the above properties, properties (2) and (3) are more intuitive. Property (2) states that more weight should be given to the cells having a large overlap area with the document. The larger the overlap, the better that cell describes the document timespan. For example, in Figure 4.1, cellc 6 better describes the documentd 1 than cellc 5 . Property (3) states that less weight should be given to those documents whose timespan covers more cells. Assuming all the other parameters are equal, a document with a smaller coverage (fewer number of cells) should get a higher weight than a document with a larger coverage. Assume two documents one containing history of the world from year 1503 to 2000 and the other one containing history of the world only in 1938. When searching for/about year 1938, the second document should be assigned more weight since it is a better representative of year 1938 than the first document. This is analogous to the fact that in textual context, more weight is given to the documents that contain fewer keywords. Contrary to properties (2) and (3), property (1) is not very intuitive. It states that less weight is given to the cells appearing in more documents. In the textual context, theidf score is a weighting factor determining the importance of each keyword independent of the query. It assigns more weight to keywords appearing in fewer documents, since 73 those are more meaningful keywords. However, the definition of meaningful cell is not very clear in the temporal context. A popular cell (time) - a cell overlapping with many documents - is a very meaningful cell for some users/applications, while for some others, a distinctive cell (time) - cell appearing in few documents - is more meaningful. To cover both cases, we define two variants of temporal idf of cell c: inverse of frequency of a cellc among the documents (invertedidf temp ) and direct frequency of a cellc among the documents (directidf temp ). 4.4.1.2 Tempo-Textual Relevance In this section, we introduce two novel approaches for calculating tempo-textual rele- vance between a documentd and a queryq. With the uni-score approach, one similar- ity measure and one document length are used to combine the temporal relevance and textual relevance into one equation. With the dual-score approach, temporal and textual relevance are calculated separately, using two document lengths, one for each relevance. Thus a new temporal similarity measure analogous to the textual similarity measure is defined. Both approaches can use the parameterα t to assign relative weights. Uni-Score Approach After partitioning each document timespan to a set of cells, defining the temporal tf-idf score and creating one document temporal length for each document timespan, the cells are ready to be treated in a similar manner as the keywords. We define term as the smallest unit of data describing each document which is either a keyword or a cell. If we represent keywords associated with a documentd byK d and the cells associated with the same document by C d , then the set of terms associated with document d is represented byU d and defined as follows: U d =K d ∪C d . Simply stated, the document’s terms are the union of the document’s keywords and cells. For instance, in Example 3: U d 1 = {Iraq,war,c 5 ,c 6 } (see Figures 1.2(b) and 74 4.1). In order to be able to define a single similarity measure capturing both the textual and temporal relevance, we define the following parameters: 1. f d,u : the frequency of termu in documentd = f d,k ifu is keyword f d,c ifu is cell 2. f d,u : the normalized frequency of termu in documentd= f d,k ifu is keyword f d,c ifu is cell 3. f u : the number of documents containing occurrences of term u = f k ifu is keyword f c ifu is cell where each parameter gets its value from the corresponding parameter in the time or text domain (based on the term type). For instance, the value of f d,u is equal to f d,k when term is keyword and tof d,c when term is cell. Having defined these new parameters, we can now easily redefine Equation 3.1, this time using terms instead of keywords. This is a new formulation capturing the keywords (textual relevance) and the cells (temporal relevance) in a unified manner. w q,u = (1−α t ).ln(1+ n fu ) ifu is keyword α t .w q,c ifu is a cell ; w d,u = (1−α t ).ln(1+f d,u ) ifu is keyword α t .ln(1+f d,u ) ifu is cell ; c W d = s X u w 2 d,u ; c W q = s X u w 2 q,u ; b S q,d = P u w d,u .w q,u c W d . c W q . (4.3) 75 The variablew d,u captures the tempo-textual term frequency score (tf tt ). The vari- ablew q,u captures the tempo-textual inverted document frequency (idf tt ). The parameter α t is integrated into the weighting scheme to capture weighted relevance of time ver- sus text. c W d represents tempo-textual document length and c W q is (tempo-textual) query length. Finally, b S q,d is the similarity measure showing how tempo-textual relevant doc- umentd is to queryq. Dual-Score Approach In the uni-score approach, keywords and cells are treated in exactly the same man- ner. Keywords and cellstf and idf scores are used in one equation and one similarity measure ( b S q,d ) using one document length ( c W d ) to calculate the final relevance score. There might be cases when most of the documents in the collection contain very long document timespans but very few keywords (or vice versa). In this case, it is better to calculate the textual and temporal relevance scores separately. Hence, we discuss another approach to calculate the similarity measure between document d and query q in the tempo-textual context. We first calculate the temporal relevance and the textual relevance of documentd and query q independently and then use an aggregation func- tion to compute the overall tempo-textual relevance score. Using the temporal tf-idf parameters and the definitions, we calculate the temporal similarity measure between documentd and queryq analogous to the textual similarity measure as follows: w q,c = ln(1+ n fc ) if inverted document frequency ln(1+ fc n ) if direct document frequency ; w d,c = ln(1+f d,c ); W ′ d = s X c w 2 d,c ; W ′ q = s X c w 2 q,c ; S ′ q,d = P c w d,c .w q,c W ′ d .W ′ q . (4.4) 76 whereS ′ q,d is the temporal similarity measure between documentd and queryq. This value captures the temporal relevancetmRel q (d) defined in Section 4.1.1. After calculating the temporal relevance using the above equation and computing the textual relevance using Equation 3.1, the aggregation function F can be used to calculate the final tempo-textual relevance. More formally: ttRel q (d) =α t .S ′ q,d +(1−α t ).S q,d . 4.4.1.3 Variants We conclude this section by summarizing possible variants of the tempo-textual rele- vance score. We defined two different approaches to calculate the tempo-textual rele- vance scores. We also introduced two different ways to define the temporal idf factor. Combining our two main approaches with the two definitions of the temporalidf score yields four different variants for our final similarity measure: 1. Uni-score with Inverted document frequency (UI) WherettRel q (d) = d S q,d andw q,c = ln(1+ n fc ) 2. Uni-score with Direct document frequency (UD) WherettRel q (d) = d S q,d andw q,c = ln(1+ fc n ) 3. Dual-score with Inverted document frequency (DI) WherettRel q (d) =α t .tmRel q (d)+(1−α t ).txRel q (d) andw q,c = ln(1+ n fc ) 4. Dual-score with Direct document frequency (DD) WherettRel q (d) =α t .tmRel q (d)+(1−α t ).txRel q (d) andw q,c = ln(1+ fc n ) 4.4.2 Tempo-Textual inverted index Tempo-textual inverted index (T 2 I 2 ) is an inverted index capable of indexing and search- ing both the textual and temporal data in a unified, integrated manner using a single data 77 termu f u type Tempo-Textual Inverted Index foru Iraq 5 1 h1,1ih2,1ih3,1ih4,1ih5,1i war 4 1 h1,0.33ih2,0.9ih3,0.8ih4,0.54i c 1 2 0 h2,1ih6,1i c 2 2 0 h2,0.6ih5,1i c 3 2 0 h3,1ih4,1i c 4 1 0 h3,0.5i c 5 1 0 h1,0.4i c 6 1 0 h1,1i Figure 4.2: Tempo-textual inverted index for Example 3. structure. In this section, we first describe the structure of T 2 I 2 and the information it stores. Next, we show how tempo-textual query evaluation is performed using T 2 I 2 . Two algorithms corresponding to our two approaches are presented. Finally, we discuss briefly how T 2 I 2 can be extended to more general cases. 4.4.2.1 T 2 I 2 Structure Since T 2 I 2 is an inverted index, its structure is very similar to the structure of the regular inverted indexes. T 2 I 2 consists of two parts: vocabulary and inverted lists. The vocab- ulary contains all the terms in the system which includes all the (textual) keywords and cells (cell identifiers). For each distinct term, three values are stored in the vocabulary: 1) f u representing the number of the documents containing the term u, 2) a pointer to the corresponding inverted list and 3) the type of term which is used to help calculate thetf andidf scores. The second component of T 2 I 2 is a set of inverted lists each corre- sponding to a term. For the corresponding termu, each list stores the following values: identifiers of the documents containing termu and the normalized frequencies of termu for each documentd. The latter is represented byf d,u . Figure 4.1 redraws the Example 3 with temporal cells (each cell is 5 years) and Figure 4.2 shows the complete T 2 I 2 for Example 3. 78 4.4.2.2 Query Processing As discussed in Section 4.1.1, the tempo-textual query consists of two parts: the query keywordsK q and the query timespanT q . To process tempo-textual queries, we first need to convert T q to a set of cells C q . C q is the set of cells overlapping with the document timespanT q . After calculatingC q , we define the set of terms associated with each query byU q as follows: U q =K q ∪C q . Algorithms 4.3 and 4.4 show the algorithms to perform top-k tempo-textual search using T 2 I 2 for the uni-score and the dual-score approaches, respectively. With both algo- rithms, accumulators are used to store the partial similarity scores. The main difference is that Algorithm 1 uses one accumulatorA d while Algorithm 2 uses two accumulators A d andA ′ d . After all the query terms are processed, similarity scores b S q,d ,S q,d andS ′ q,d are derived by dividing each accumulator value by the corresponding values of c W d ,W d and W ′ d , respectively (first one used in the uni-score algorithm while the last two are used in the dual-score algorithm) . Finally, the k largest documents are identified and returned to the user. In the uni-score approach (Algorithm 4.3), we assign one accumulator for each doc- ument d which is denoted by A d . Partial similarity scores are stored in these accumu- lators. Initially, all the accumulators have a value of zero (e.g., similarity of zero). The query terms are processed one at a time and for term u, the accumulator A d for each documentd included in theu’s inverted list is increased by the contribution ofu to the similarity of d and q. After all query terms are processed, similarity scores b S q,d are derived by dividing each accumulator value by the corresponding value c W d . Finally, the k largest documents are identified and returned to the user. In the dual-score approach (Algorithm 4.4), two accumulators are assigned to each document: A d and A ′ d . The partial textual similarity score is stored in A d while the partial temporal similarity score is stored inA ′ d . Initially, both accumulators are empty 79 Figure 4.3: top-k tempo-textual search, uni-score (zero score). Again, terms are processed one at a time and for each term u and for each document d included in u’s inverted list, values of A d and A ′ d are increased by the contribution of term u to the textual and temporal similarity of document d to q, respectively. After processing all the query terms, temporal similarity scores S ′ q,d are calculated by dividing each A ′ d to its correspondingW ′ d . In addition, textual similarity scoresS q,d are computed by dividing the values ofA d accumulators to the corresponding W d values. Finally, for each document, if both similarity scores are larger than zero, the final similarity score is calculated as the weighted sum of these two scores. 2 . 4.4.3 Generalization In this section, we briefly show how T 2 I 2 can be extended into more general cases. 2 In order to perform this algorithm efficiently, we use the Threshold Algorithm described in [FLN03] 80 Figure 4.4: top-k tempo-textual search, dual-score 4.4.3.1 Multiple Timespans is that there is no limit on the number of timespans in a document. Instead of treating the document timespan as one long (and maybe sparse) interval, we can use several separate, disjoint time intervals using T 2 I 2 . This is feasible because our final temporal relevance score can be computed by separately computing the temporal score of each cell intersecting with the various document timespans. Another advantage of T 2 I 2 is its capability to represent the document timespan in any arbitrary granularity and not necessarily in common time units (e.g. days, months, years). The only information we need to calculate for temporal tf-idf score is the area of overlap between each cell 81 and each document timespan. The cost is negligible since the computation is happening one-time during the index construction. 4.4.3.2 Points We assumed that each document timespan is an interval. In the context of the web, this is a reasonable assumption, still in the cases when the document temporal feature is only a point in time (p) we can generalize our approach as follows. We find the temporal cell that intersects withp and call itc p . The new document timespan isc p plusm temporal cells before andm cells afterc p . The value ofm is determined by the user and is usually a small number. In case of multiple points, we can either apply the above algorithm for each point and generate multiple timespans, or use one (possibly long) timespan covering all the points. 4.4.3.3 Freshness We assumed that when a user issues a query using a time-interval, all temporal cells inside the time-interval have the same importance to the user. This assumption is true for most of the historical queries. However, for some types of queries (e.g., queries in which end time is now), it makes more sense to use time decay to reduce the importance of the older cells. There exists several studies on time decay in the literature [CSSX09b]. We can integrate one of the existing approaches to our proposed technique as follows. For each cell, we define a new parameter called temporal decay factor or df temp . The weight given to this parameter represents cell’s decay and is inverse-polynomial to the cell’s elapsed time. The factordf temp is defined as a monotone non-increasing function and as follows: df temp (c) = (t c −t base + 1) −(tc−t base ) where c is the cell id, t c is the time associated with cell c (e.g., cell’s start time) and t base is the time associated with the reference point. Finally, we integrate the inverse ofdf temp into our methods similar 82 to what we did foridf temp . Consequently, we have three parameterstf temp ,idf temp and inverse ofdf temp impacting the temporal ranking. 4.4.3.4 Weights When querying the system, there are two types of weight factors users may want to manipulate 1) setting different weights to the temporal and textual relevances, and 2) setting different weights to different terms in the query. For the first scenario, we have used the parameterα t in this dissertation. T 2 I 2 can also support setting different weights for different terms. There are several existing methods to solve this problem for textual keywords. Since we are treating cells similar to keywords, those methods can also be applied to the temporal cells. As one possible solution, we define query term weights α q,k andα q,c as the weight of keywordk in queryq and the weight of cellc in queryq, respectively. By multiplyingw d,k and w d,c values by α q,k and α q,c , respectively, query term weights are integrated into the relevance scores. This opens up a wide array of sophisticated query capabilities for the users. 4.4.3.5 Leveraging Existing Search Engines approach is the fact that it can be integrated into the existing search engines easily and seamlessly. Since the structure of T 2 I 2 is very similar to the structure of the regular inverted indexes, the same techniques used in regular search engines (built on inverted indexes) can be applied to our time-based search technique. Structure wise, the main difference is in using T 2 I 2 instead of the regular inverted indexes. This essentially trans- lates to using a larger vocabulary (combination of cells and keywords instead of only keywords). The average size of inverted lists would not change since that only depends on the total number of documents, which is fixed. This is very promising because the cost of existing search engines is dominated by the cost of traversing the inverted lists 83 and not the size of the vocabulary. The easy integration of our approach into the existing search engines is not only very beneficial for current search engines but also enables us to optimize T 2 I 2 using a large body of work that exists in this field. More interestingly, some of the optimization techniques seem to work better on T 2 I 2 . For instance, caching is another technique used in existing search engines. It is easy to see that with T 2 I 2 , by caching the inverted lists for the cells nearby the current query cell, we can improve the query performance significantly. It is very likely that nearby cells queried together. 4.5 Experiments In this section we evaluate the efficiency and accuracy of our proposed approaches in two ways. First, we provide a cost model analysis for the proposed approaches. Next, we present results from simulations based on real document sets. 4.5.1 Cost Model In this section, we analyze the search (query processing) cost for T 2 I 2 and three hybrid approaches. The symbols used in the cost models are presented in Table 4.1. 4.5.1.1 Cost ofFnT For a query with|K q | number of keywords and a time-intervalT q , the search cost has three parts: 1) retrieval and processing of the interval-tree andm lists corresponding to m time-intervals, 2) calculating temporal relevance for each overlapping time-interval, and 3) access and retrieval of|K q | keywords and their corresponding|K q | lists. In other words: Time(FnT) =T tree (FnT)+T tRel (FnT)+T list (FnT) 84 Table 4.1: Symbols Symbol Meaning n total number of documents m total number of time-intervals (temporal expressions) H(q) total number of time-interval-keywords (TIKs) in the queryq P K (k i ) length of the page list for the keywordk i P C (c i ) length of the page list for the temporal cellc i P T (t i ) length of the page list for the time-intervalt i P H (h i ) length of the page list for the TIKh i T tree time cost associated with retrieving an interval-tree T tRel time cost associated with calculating temporal relevance T list time cost associated with retrieving page lists T temp time cost of calculating (one) temporal relevance T I/O time cost of one disk access B page size The cost of the retrieval and processing of an interval-tree with m leaf nodes is: T tree = O(logm). The time to read a list whose length is l from disk is: T list = O(l/B).T I/O . For all the approaches, we will ignore the cost of retrieving |K q | key- words from the vocabulary (we can assume that the vocabulary reside in memory and/or implemented with a simple hash table). Thus: Time(FnT) =T tree (FnT)+T tRel (FnT)+T list (FnT) =O(logm)+ m ′ X i=1 O(P T (t i )/B).T I/O + m ′ X i=1 T temp + |Kq| X i=1 O(P K (k i )/B).T I/O (4.5) wherem ′ is the number of overlapping time-intervals with the query’s time-interval and|K q | is the number of textual keywords in the query. 85 4.5.1.2 Cost ofFtT Given a queryq with temporal and textual parts (T q and|K q | respectively) and assuming H(q) is the number of temporal-interval-keywords (TIKs) for query q, the search cost for FtT approach has three parts: 1) retrieval and processing of|K q | interval-trees, 2) calculating temporal relevance for all leaf nodes of all trees, and 3) access and retrieval of H(q) keywords and H(q) lists corresponding to them (one list per TIK). Assuming m is the average number of leaf nodes for interval-trees, we have: Time(FtT) =T tree (FtT)+T tRel (FtT)+T list (FtT) =|K q |.O(logm)+ H X i=1 (q)T temp + H(q) X i=1 O(P H (h i )/B).T I/O (4.6) 4.5.1.3 Cost ofTtF The search cost ofTtF approach is dominated by three parts: 1) retrieval and processing of one interval-tree withm leaf nodes, 2) calculating temporal relevance for overlapping leaf nodes (time-intervals), and 3) access and retrieval ofH(q) keywords andH(q) lists corresponding to them. So: Time(TtF) =T tree (TtF)+T tRel (TtF)+T list (TtF) =O(logm)+ m ′ X i=1 T temp + H(q) X i=1 O(P H (h i )/B).T I/O (4.7) 4.5.1.4 Cost of T 2 I 2 For T 2 I 2 , the main search cost is dominated only by one part: 1) access and retrieval of|K q |+|T q | terms and|K q |+|T q | lists corresponding to them. Note that for all four 86 approaches, textual relevance calculation is done seamlessly during the query process- ing. For T 2 I 2 , the temporal relevance computation is also integrated into the search process, similar to the textual relevance calculation. Since, values off d,c (capturing the temporal relevance between cell c and document d) are already calculated and stored in T 2 I 2 , we do not need to compute the temporal relevance for each document during the search process and on-the-fly. That by itself reduce the search cost for T 2 I 2 in com- parison with other approaches. Also, since there is no extra tree structure, no cost is associated with retrieval of one or more (interval) trees and access and processing of the different nodes in each tree. We can show the search cost of T 2 I 2 as follows: Time(T 2 I 2 ) =T list (T 2 I 2 ) = |Tq| X i=1 O(P C (c i )/B).T I/O + |Kq| X i=1 O(P K (k i )/B).T I/O (4.8) 4.5.2 Performance Experiments In this section, we present the experimental evaluation of our approaches on real-world datasets in order to study the efficiency of our proposed index structures. Setting and Dataset: Our experiments use two datasets with their properties sum- marized in Table 4.2. FREEBASE dataset is generated from data on freebase web-site (www.freebase.com). Freebase is an online collection of structured data harvested from many sources, including individual wiki contribution. We used events 3 data on Freebase. Based on the events’ schema definition, “An event is a topic that can be described by the time or date at which it happened. Long-lasting events may be described as occur- ring between two dates”. Among properties of each event (web-page) on freebase are 3 http://www.freebase.com/type/schema/time/event 87 attributes start date and end date, which are used in our experiments ast b andt e of each document, respectively. NY-Times dataset is from New York Times Annotated Corpus 4 that contains around 1.8 million articles published in NY-times newspaper between 1987 and 2007. This is the de-facto document set used in most of recent studies in this field. For tempo- ral expressions in the documents, we used the data generated by [BBAW10]. This is how they extracted temporal information from the content of the documents. Temporal expressions were extracted using TARSQI [VMS + 05]. TARSQI detects and resolves temporal expressions using a combination of hand-crafted rules and machine learning. It annotates a given input document using the TimeML [tim] markup language. Build- ing on TARSQIs output, they extracted range temporal expressions such as from 1999 until 2002, which TARSQI could not support [BBAW10]. While the publication time of this dataset is from 1987 to 2007, temporal expression extracted from content of the documents contain time-intervals for a much larger time period (temporal expressions basically could relate to any event in past and future) . For our NY-TIMES dataset, we filtered in documents with time-intervals between 1512 and 2011. All the index structures are disk resident and the page size is set at 4 KB. For T 2 I 2 , we partition the time domain to 1-day cells (i.e., each temporal cell is a day). It is interesting to see that cell sizes larger than one day have a much better performance than the default cell size (i.e., one day). All of our experiments are conducted on a machine with an Intel Core2 Duo 3.16 GHz CPU and with 4GB main memory. 4.5.2.1 NY-TIMES Dataset In this section, we evaluate the performance of T 2 I 2 in terms of number of disk IOs and search time for the NY-TIMES dataset. We also perform the same evaluation study and 4 http://www.ldc.upenn.edu/Catalog/docs/LDC2008T19/ 88 Table 4.2: Dataset Details Dataset FREEBASE NY-TIMES Total # of documents 34,641 1,855,655 Total # of keywords 2,093,764 423,704,062 Total # of unique keywords 111,038 6,234,465 Average # of unique keywords per document 60 228 Total # of unique time- intervals (temporal expressions) 14,051 128,905 Average # of unique time-intervals (tempo- ral expressions) per document 0.94 3.35 Time range 211 years (1800- 2010) 500 ⁀ years (1512- 2011) show the results for the IIO, FnT , FtT and TtF approaches. 5 For each query, we randomly choose 1 to 6 keywords from the list of top-1000 most frequent keywords in the dataset, one random start time between January 1, 1512 and December 31, 2011 and one random time-interval length, which can be one day, one week, one month or one year. Given start time and time-interval length of each query, an end time is calculated and assigned to the query ( i.e., end time = start date + timespan length). Queries are performed in rounds. Each round consists of 100 queries and is conducted for each input setting. Value ofα t is 0.5 unless another value is specified. Effect of number of keywords: With the first set of experiments, we evaluate the impact of the number of keywords in each query (|K q |) on the query cost. In this set of experiments, we vary|K q | from 1 to 6 while fixing k at 10 and query time-interval length at 30 days (one month). For each method, we report the average query cost in processing each round. Figures 4.5(a) and 4.5(b) show the results for search time and 5 To have a fair comparison, none of the optimization techniques (early-termination,etc.) are imple- mented for any of the approaches. 89 (a) Search Time (b) Disk I/O Figure 4.5: Impact of number of keywords on query cost - NY-TIMES number of page accesses, respectively. The major observation is that T 2 I 2 is signifi- cantly superior to all other approaches for all the cases. The other observation is that all five approaches perform worse as the number of keywords increase (as expected). As expected, the impact of the increase in the number of keywords is very significant for FtT and IIO. With FtT , the number of disk IOs increases by a factor of 7 when the number of keywords changes from 1 to 6 and for IIO this factor is around 11. This is because more number of query keywords forFtT results in accessing and traversing more (sometimes very large) trees. ForIIO, more number of keywords results in more lists to retrieve and process. More importantly, larger number of candidate documents will be returned from the textual filtering step and as a result more documents need to be retrieved and processed in the second phase. Effect of k: In this set of experiments, we evaluate the performance of T 2 I 2 , IIO, FnT FtT andTtF by varying the number of requested resultsk. We report the average query cost for each round. Here, we fix the number of keywords at 2 and timespan length at 30 days. The value ofk varies from 1 to 100. The results are shown in Figures 4.6(a) and 4.6(b). The first observation is that T 2 I 2 again (significantly) outperforms all other approaches. The second observation is that for all five approaches the query cost does not change much whenk increases. This happens because our implementation (for 90 (a) Search Time (b) Disk I/O Figure 4.6: Impact ofk on query cost - NY-TIMES none of the approaches) uses any of the early-termination techniques (as we said earlier, so that we can have a fair comparison). In both figures (similar to Figures 4.5(a) and 4.5(b))TtF performs the worst. TtF performing the worst in almost all the experiments indicates the sensitivity of TtF to the number of size of time intervals in the system (i.e., temporal distribution of data). For both datasets, the number of time intervals in the dataset is fairly large resulting in many leaf nodes and inverted files in TtF . Also, for leaf nodes that represent large time intervals (which is common in our datasets), not much filtering is done based on the temporal aspect of the data and hence inverted files for those leaf nodes will be fairly large. Effect of time-interval length: In the third set of our experiments, we evaluate the impact of changing the length of the query’s time-interval. For different rounds, we set the query’s time-interval length to one day, one week (7 days), one month (30 days) and one year (365 days). In this set of experiments, we fix the number of keywords at 2 and k at 10. The results for the search time and the number of disk IOs are shown in Figures 4.7(a) and 4.7(b), respectively. Again, for most cases T 2 I 2 outperforms the other four approaches. As expected, the query cost increases for T 2 I 2 ,FnT ,FtT andTtF as the 91 (a) Search Time (b) Disk I/O Figure 4.7: Impact of time-interval size on query cost - NY-TIMES time-interval length increases. For T 2 I 2 , increase in the query time-interval size trans- lates into the increase in the number of query terms (temporal cells) and consequently more disk IOs (and search time). Hence, for the very long query time-intervals with small query cell sizes (e.g., one day) our proposed solution does not significantly outperform all other approaches. Even for very long time-intervals, T 2 I 2 still significantly outperforms both FnT and TtF because bothFnT andTtF need to 1) process huge time-intervals, and 2) retrieve and access large number of leaf nodes and their corresponding (page) lists (due to the size of query interval). FtT needs to traverse smaller trees and smaller time-intervals and IIO results change slightly for different query time-interval lengths, since its performance is not dependant on the temporal part of the query. As we show in Section 4.5.2.3, increasing the temporal sizes from one day to one week, one month, etc. significantly improve the performance of T 2 I 2 . 4.5.2.2 FREEBASE Dataset We also evaluated the performance of T 2 I 2 in terms of number of disk IOs and search time for the FREEBASE dataset. Except the scale of the result the trend is very similar to the results reported and discussed for NY-TIMES dataset. Due to the space limitation 92 and similarity of results between NY-TIMES dataset and FREEBASE dataset, we do not report the results from the FREEBASE dataset here. 4.5.2.3 Cell Size (Temporal Granularity) Finally, we show how choosing different values for temporal cell size (temporal granu- larity) affects the performance of the system. As we noted earlier, we chose one day as the default cell size while constructing T 2 I 2 to comply with most of the existing stud- ies and also because day was the smallest unit of time extracted from New York Times Annotated Corpus in [BBAW10]. To perform this set of experiments, we built T 2 I 2 four different time (one for each cell size) and for both datasets (eight time in total). We randomly generated a set of 100 queries for each dataset while fixing k at 10, number of keywords at 2 and query temporal size at 365 for each query. The four cell sizes are: 1 (day), 7 (week), 30 (month) and 365 (year). The results are shown in Figure 4.8. The main observation for both datasets is that the performance improves as cell size increases. This improvement is the most significant when cell size increases from one day to one week (note that the figures are in logarithmic scale). Larger cell size results in a fewer number of temporal cells and hence fewer retrieval of lists for the same query. For instance, for a query with a time-interval equal to 365 days, when each cell size is one day, we need to access 365 temporal cells and possibly retrieve all their corre- sponding inverted lists. This number will decrease significantly for larger cell sizes (e.g. 30). One can argue that by increasing cell sizes, the number of documents overlapping with each temporal cell and consequently size of page lists will increase and this will affect the performance. This is true but as it is seen in Figure 4.8, the impact of the increase in lists is not significant and is dominated by the impact of decrease in number of cells, resulting in overall decrease in the performance cost. There are two reasons for this behaviors. First, for many documents, their time-intervals are large and contain 93 Figure 4.8: Impact of cell size on query cost many consecutive temporal cells. As a result, there will be many repeated documents in the lists corresponding to these consecutive cells. When these consecutive cells merge and become one cell (e.g., cell size changes from 1 to 7 days), not many new document will be added to the new cell. Second, increase in the number of documents for each temporal cell, should be significantly large in order to have a significant impact on the number of disk page IOs. This is true because each (disk) page can store a large number of document postings for each inverted list. 4.5.3 Accuracy Experiments In this section, we evaluate the effectiveness of our proposed approaches in terms of accuracy (effectiveness). 4.5.3.1 Setting and Queries Data. We used the real-world FREEBASE dataset. As we explained earlier, this dataset contains information regarding events on the freebase website. Originally, this data on freebase contained 74,591 events (web-pages). In the processing of this data, we removed the events with no start or end dates, and also removed any event occurred before the year 1800. After these steps, final dataset’s size reduced to 34,641 documents. 94 Table 4.3: Queries Sports Politics Misc. Short ncaa men basketball [03/24/2010-04/07/2010] poland [03/24/1943-04/07/1943] film festival [09/03/2000-09/15/2000] swimming olympics [08/10/2008-08/18/2008] protests [06/01/2009-06/29/2009] rock concert [05/15/2009-05/30/2009] Medium roger federer [06/01/2008-10/31/2009] senate election [03/01/2008-08/30/2008] earthquake [01/15/1990-07/31/1991] nfl ravens [08/05/2004-06/31/2005] bombing [05/15/2007-09/31/2007] vietnam operation [03/01/1966-04/10/1967] Long lakers celtics [08/01/1984-07/29/1986] german battle [09/03/1916-09/15/1922] cholera [01/01/1820-12/30/1840] fifa world cup [01/01/1958-12/31/1970] Iraq war [08/01/1980-09/29/1990] STS shuttle [08/01/2006-06/30/2010] Queries. For queries, we generated a set of 100 queries from different freebase topics and assigned them timestamps with different granularities in length. The topics were sports, politics and misc.. Timestamp length granularities were ranging from one day to few decades. We categorized the timestamp granularities into three groups: short ranging from one day to few weeks, medium ranging from one month to few months, and long ranging from one year to several years. we had to filter out and/or tune some query timestamps. Finally, we categorized all 100 queries into nine different groups with regards to their timestamp granularity and topic, and randomly selected two queries for each group from our set of 100 queries. All the selected queries are shown in Table 4.3. Approaches. We computed top-5 query results for each query using four approaches in Section 4.4.1: DI, DD, UD, UI, and also baseline and hybrid approaches in Section 4.2: BH. 6 Relevance Assessment. After computing top-5 results for each of our 18 queries using all 5 approaches, we ran a user study using Amazon Mechanical Turk (https:// www.mturk.com/). One task (hit) was generated for each unique result (web-page). The web-page alongside the query keywords and the query timestamp in regular date format (e.g. November 7th, 1980 to December 12, 1981) were provided to the workers. Workers could choose whether the web-page is relevant or non-relevant. They could also choose the ’I cannot assess this document’ option (in case their knowledge was not enough to evaluate the document). Workers could also add their comments/explaination 6 Note that all three hybrid approaches as well as the baseline approach generate the same final ranking. 95 for each assessment. Each task (web-page assessment) was assessed by five workers. Each worker was rewarded $0.02 by completion of each assessment. Overall, workers chose relevant for 64% of the assessments, non-relevant for 33% of the assessments and I cannot assess this document for 3% of the assessments. 4.5.3.2 Results We evaluated the accuracy of the methods under comparison using two standard metrics: Precision at k and nDCG at k. In calculating precision at k, we consider a document relevant if the majority of workers assessed that document as relevant and non-relevant otherwise. When computing nDCG atk, we consider the average relevance given by the users to each document, interpreting relevant as grade 1 and non-relevant as 0, respec- tively. Overall. The overall result of our relevance assessments with k = 5 and using the five approaches under comparison is shown in Table 4.4. For the precision@5, The first observation is that all of the evaluated methods generate accurate results (precision larger than 0.6) while three of the (seamless) tempo-textual ranking methods (DI, DD and UD) generate very accurate results (precision larger than 0.8). The second observa- tion is that, as expected, the dual-score approaches perform the search more accurately than the uni-score approaches. Using two scores and two document lengths generate more accurate rankings than using one combined score and only one document length. As for the nDCG@5, the above observations are reconfirmed. Three of our tempo- textual ranking methods outperform all other approaches pretty well. Also, baseline and naive (hybrid) approaches gain nDCG@5 of less than 0.7. Again, the dual-score approaches perform better than uni-score approaches. Topic. For each query topic, we evaluate the effectiveness of each method using the same metrics. In Table 4.5, we show the results of the relevance assessment for each 96 Table 4.4: Precision@k and nDCG@k of various rankings Method Precision@5 nDCG@5 DI 0.83 0.74 UI 0.68 0.62 DD 0.88 0.77 UD 0.82 0.74 BH 0.63 0.58 Method Sports Politics Misc. P@5 N@5 P@5 N@5 P@5 N@5 DI 0.73 0.71 0.9 0.75 0.86 0.76 UI 0.53 0.53 0.66 0.59 0.86 0.73 DD 0.8 0.73 0.96 0.77 0.9 0.82 UD 0.76 0.70 0.8 0.69 0.9 0.85 BH 0.53 0.53 0.66 0.58 0.7 0.62 Table 4.5: Precision@k and nDCG@k by topic query topic separately. The results are in support of our prior observations. The most accurate method for all three topics is DD. Timestamp Granularity. In this experiment, we present the results of our relevance assessment for different query timespamp lengths. The results are shown in Table 4.6. As it is clear, there exists significant variations in the accuracy of the approaches across different timestamp granularities. The best ranking varies by timestamp granularity and measure. Method Short(days) Medium(months) Long(years) P@5 N@5 P@5 N@5 P@5 N@5 DI 0.8 0.76 0.73 0.63 0.96 0.83 UI 0.8 0.70 0.6 0.52 0.66 0.64 DD 0.96 0.82 0.73 0.68 0.96 0.83 UD 1 0.83 0.66 0.67 0.8 0.73 BH 0.76 0.68 0.66 0.57 0.46 0.50 Table 4.6: Precision@k and nDCG@k by cell size 97 Chapter 5 Social-Textual Web Search This chapter is organized as follows. In Section 5.1, we present the problem statement as well as the overview of our prototype system PERSOSE. In Section 5.2, we present a novel model to capture relevance between documents and users based on users’ social activities. We propose three levels of PerSocialization based on three sets of social signals. In addition, we propose three new ranking approaches to combine the textual and social features of documents and users. Finally, we conduct a comprehensive set of experiments using 14 million documents of Wikipedia as our document set and real Facebook users as our users. Complete evaluation and analysis of these experiments are presented in Section 5.3. 5.1 Overview In this section, we present the problem statement without going into much details (we present some of definitions/formalizations in Section 5.2.1). We also provide the system overview of PERSOSE. The objective of PERSOSE search engine can be stated as follows: Suppose D = {d 1 ,d 2 ,...,d n } is the set of documents that exist in our system. Each document is composed of a set of textual keywords. Also, there is a set U = {u 1 ,u 2 ,...,u m } of users interacting with the system. Users can search for documents but more importantly users can also perform a set of defined social actions (e.g, LIKE, RECOMMEND, SHARE) on the documents. We also assume a social network modeled 98 as a directed graph G = (V,E) whose nodes V represent the users and edges E rep- resent the ties (relationship) among the users. Finally, each query issued to the system has two parts: the textual part of the query which is presented by a set of textual key- words (terms), and social part of the query which is defined mainly as the user issuing the query. The goal of PERSOSE is to first identify and model the social dimension of the documents in the system, and next to score and rank the documents based on their relevance to both the textual and the social dimensions of the query. We call this type of search performed by PERSOSE, PerSocialized Search since search is personalized using social signals. System Overview. A general overview of PERSOSE is displayed in Figure 5.1. As shown in this figure, there exist two types of objects/modelus in PERSOSE: Modules that belong to the (existing) textual search models and modules that are new and are part of the new social model. In Figure 5.1, textual modules are displayed by solid lines, social modules are depicted by dotted lines and modules with both textual and social features are shown by mixed lines. Accordingly, PERSOSE has two engines: 1) the textual engine reads (crawls) the documents in the system and generates the necessary textual meta-data for each docu- ment (e.g., textual vectors); there is nothing new about the textual engine, 2) the social engine has two inputs. One is the social networkG with all its properties and relation- ships. The second data structure maintains a dataset of users’ social activities. This dataset, for each user in the social network, contains all their social activities (feed) including their interaction with documents in the system. The social engine processes this dataset as well as graphG and generates multiple social vectors for documents and users. In addition to the social vectors, the social engine defines and calculates relevance scores between documents and users as well as among documents. Description of each 99 Figure 5.1: Overview of PERSOSE vector as well as the detailed description of the new relevance model are discussed in Section 5.2.1. Another major module in our system is the ranker module. Ranker which contains both the textual and PerSocial aspects, receives queries from each user and generates a ranked list of documents for each query and returns them back to the user. As we mentioned earlier, each query has two parts: the textual part of the query (set of terms) and the user issuing the query. Ranker gets both information as well as different vectors generated from the textual and social engines and using one of the approaches described in Section 5.2.2 ranks the documents based on their (textual and social) relevance to the query. Details of different ranking approaches are discussed in Section 5.2.2. 5.2 PerSocialization In this section, we show how to personalize the search results using social data or what we call search PerSocialization. First, we propose a new relevance model called Per- Social relevance model to capture and model the social information for both documents 100 and users. In the second part, we show how to use the proposed PerSocial relevance model to perform PerSocialized search and propose various rankings. 5.2.1 PerSocial Relevance Model In this section, we model social relationships between users and documents as well as other social information about users, and propose a new weighting scheme to quantify the relevance of each user to each document. We define the PerSocial relevance model at three levels, each level complementing the previous level. We develop the simplest model in level 1 using minimum amount of social data, i.e, social data from user himself. We extend our model significantly in level 2, creating the core of our PerSocial model. In this level, we also define multiple new social vectors in order to be able to model the PerSocial relevance more accurately. In the process of modeling level 2 PerSocial relevance, we create a new weighting scheme called uf-ri weighting scheme and define new weights and weight functions for several relationships in the system. Finally, in level 3, we extend our model even further using the concept of social expansion. 5.2.1.1 PerSocial Relevance - Level 1 In the first level of the PerSocial model, we leverage each user’s past social data to calculate the PerSocial relevance between that user and the documents. Definition. We formalize social interactions between users and documents by social actions. We define A ={a 1 ,a 2 ,...a l } as a set of all possible social actions available to the system. For each document d j , a set A d j defines a set of valid (supported) actions for d j . A d j is a subset of A (A d j ⊆ A) and contains all the social actions possible for document d j . For each user u i we define a set UDA i as a set of all document-action pairs performed by useru i . To be more formal,UDA i ={(d j ,a k ) — if there is an action 101 a k on document d j by user u i }. Each social action is unique and can be applied only once by user u i on document d j (nevertheless, that action can be applied by the same useru i on multiple documents and/or by multiple users on the same documentd j ). Social actions do not have equal importance. We define a weight function W : A → R mapping social actions to real numbers in the range [0,1]. Values generated by the weight function represent the importance of each social action in the system. The weight function should be designed by a domain expert with the following two constrains: 1) each weight should be between 0 and 1 (inclusive), and 2) the more important the action, the higher the value. The importance of actions are determined based on the domain/application. Example. Assume that our document set contains all the web-pages of a sports website (e.g., ESPN). Web pages can include news articles, athlete profile pages, sports teams pages and so on. Also, this website is already integrated (connected) with a social network platform. In this example, all web pages in our document set are connected to the Facebook social plug-ins 1 and support the following social actions: LIKE, RECOM- MEND and SHARE. So,A ={LIKE,RECOMMEND,SHARE} and also A d j = {LIKE,RECOMMEND,SHARE} for each and every d i in our docu- ment set (all documents support all actions). Each user u i in the system, can LIKE, RECOMMEND or SHARE any document d j on the website. With this example, we define weight function W as follows: W (RECOMMEND) = 0.6, W (LIKE) = 0.6, and W (SHARE) = 0.8. These weights indicate that in this domain, SHARE is the most important action and LIKE and REC- OMMEND actions have the same importance. 1 https://developers.facebook.com/docs/plugins/ 102 Definition. PerSocial relevance - level 1 between documentd j and useru i is defined based on the number and type of social actions between user u i and documentd j , and as follows: psRel L1 (u i ,d j ) = P a k |(d j ,a k )∈UDA i W(a k ) wherepsRel L1 (u i ,d j ) is the PerSocial relevance level 1 between useru i and docu- mentd j . Example. In our running example, assume we have two documents d 1 and d 2 and user u 1 . User u 1 has LIKED and SHARED d 1 and he also has RECOMMENDED document d 2 . Hence, prRel L1 (u 1 ,d 1 ) = W(LIKE) + W(SHARE) = 1.4 and prRel L1 (u 1 ,d 2 ) =W(RECOMMEND) = 0.6. 5.2.1.2 PerSocial Relevance - Level 2 The amount of data generated from one user’s social actions is typically insignificant. If we only consider the user’s own social actions, many documents will end up having PerSocial relevance of zero for that user. In addition, as we discussed earlier people have very similar interests with their friends trust the opinions of their friends more than others. Hence, in the second level of PerSocial model, we utilize friendship relationships between users to improve and extend the level 1 model. Definition. A weightw i,j ¿ 0 is associated with each useru i and documentd j . The termw i,j represents the social importance/relevance of useri to documentd and its value is equal toprRel L1 (u i ,d j ) defined earlier. For useru i with no social action on document d j ,w i,j = 0. We define document social vector to represent the social dimension of the document d j and represent it asS d j , defined as bellow: S d j = (w 1,j ,w 2,j ,...,w m,j ) wherem is total number of users. 103 The concept of social vector for a document is analogous (and inspired by) the con- cept of the textual vector of a document. While textual vector represents the textual dimension of the documents, social vector characterizes the social dimension of the doc- uments. Moreover, our weightsw i,j are analogous to term frequency weights (tf i,j ) in the context of textual search. While eachtf i,j indicates the relevance between term (key- word)i and documentj, eachw i,j represents the relevance between useri and document j. Traditionally (and in the context of textual search), such term frequency is referred as tf (term frequency) factor and offers a measure of how well that term describes the docu- ment’s textual content. Similarly, we name our social weights (w i,j ) uf (user frequency) factor. The uf factor provides a measure of how well a user describes a document’s social content. Example. Continuing with our running example, let’s add users u 2 and u 3 to the system. Supposeu 2 has LIKED documentd 1 andu 3 has no social action ond 1 . Given this information and previous information aboutu 1 , the social vector ford 1 is as follows. S d 1 = (w 1,1 ,w 2,1 ,w 3,1 ) = (1.4,0.6,0). Definition. We measurew ′ i,p or the weight between useru i and useru p based on the user relatedness function between useru i andu p . User relatedness function is denoted by W ′ (u i ,u p ) and measures the relatedness/closeness of two users. There are several existing measures to calculate the relatedness/closeness of two nodes in a graph/social network. Some of the approaches consider the distance between nodes, some look at the behaviors of users in a social network and some take into consideration number of mutual neighbors of two nodes. While the required data is available, any of the above methods or any other existing method can be used for the user relatedness func- tion as long as the following three constraints are satisfied: (1) W ′ (u i ,u i ) = 1, (2) 0 ≤ W ′ (u i ,u p ) ≤ 1 and the more relevant the users, the higher the value, and (3) W ′ (u i ,u p ) = 0 when W ′ (u i ,u p ) < δ. The first constraint states that each user is the 104 Figure 5.2: Friendship Structure for the Running Example most related user to himself. The second constraint normalizes this measure and also ensures that the more related users are assigned higher scores. Finally, the third con- straint filters out all relationships that their significance is below a certain threshold (δ). Now, we define user social vector to represent the social dimension of the user u i and present it asS ′ u i , defined it as below: S ′ u i = (w ′ 1,i ,w ′ 2,i ,...,w ′ m,i ). Example. Let’s add users u 4 and u 5 to the running example. Friendship structure among all five users of our system is depicted in Figure 5.2. In the following, we calculate the user social vector for user u 1 using two different user relatedness functions. As case 1, we use an inverse of distance between two users (in the network) to capture their relatedness. We also set the threshold valueδ equal to 0.3. More formally, W ′ (u i ,u p ) = 1 dist(u i ,up)+1 where δ = 0.3 and dist(u i ,u p ) is the number of edges in a shortest path connectingu i andu p (dist(u i ,u i ) = 0). Using this function for useru1: S ′ u 1 = (W ′ (u 1 ,u 1 ),W ′ (u 2 ,u 1 ),W ′ (u 3 ,u 1 ),W ′ (u 4 ,u 1 ),W ′ (u 5 ,u 1 )) = (1,0.5,0.33,0,0.5). Note thatW ′ (u 4 ,u 1 ) = 1/(1+3) = 0.25 but since 0.25 < 0.3, this value becomes zero. As case 2, we only consider friends (direct links) and calculate the relatedness as follows: 105 W ′ (u i ,u p ) = numMutualFriends(u i ,up) max(numFriends(u i ),numFriends(up) ifu i andu p are friends 2 0 otherwise (5.1) Where numMutualFriends(u i ,u p ) is the number of mutual friends between u i andu p andnumFriends(u i ) is the number of friends foru i . Using this user relatedness function, user social vector foru 1 is calculated as follows: S ′ u 1 = (1,0.5,0.5,0,0). In addition to relatedness between users, knowing the overall importance/influence of each user also can help us in detecting (and thus giving more weight) to social actions with higher quality and more reliability. Often, when a high profile user (super user) performs a social action on a document, that action and consequently that document are of higher value/quality compared to the case when the same action is performed on the same document by a less influential user. We quantify the overall (global) importance of each user by the user weight function W ′′ (u i ). This measure quantifies the significance of a user in the social network. For instance, with Twitter, a user with many followers will be assigned a higher weight than a user with only few followers, or with Facebook, a user with more friends is often more important to the social network than a user with fewer friends. In the field of graph theory and social networks, this value is called centrality and there exist several approaches to measure it. Four popular methods to compute the centrality value are: degree centrality, betweenness, closeness, and eigenvector centrality [Fre79]. Similar to the user relatedness function, the user weight function is also generic enough and most of the existing approaches can be applied to obtainW ′′ . 106 Definition. We define a weight functionW ′′ : U →R mapping users to real num- bers in the range [0,1]. Each value w ′′ (i) generated by this weight function represents the overall importance of each user i in the system. The weight function should sat- isfy the following two constrains: 1) eachw ′′ (i) should be between 0 and 1 (inclusive), and 2) the more important the user, the higher the value. The importance of users are determined by user weight function 3 W”. In the context of textual search, there is the idf (inverse document frequency) factor for each term in the system offering a measure to how important (distinctive) is that term in the system. Analogously, we name the weights generated by the weight function W” , ui (user influence) factor. The value of ui for a user provides a measure of how important that user is in the system. We define influence social vector to represent the importance/influence of all the users, and present it asS ′′ . S ′′ is defined as follows: S ′′ = (w ′′ 1 ,w ′′ 2 ,...,w ′′ m ). Example. For the network depicted in Figure 5.2, we use the degree centrality of nodes (users) as an indication of their importance as follows: W ′′ (u i ) = deg(u i ) m−1 wheredeg(u i ) is the number of edges of nodeu i andm is number of nodes (users). Using the above user weight function, the following weights are generated for the five users: w ′′ (u 1 ) = 0.5,w ′′ (u 2 ) = 0.75,w ′′ (u 3 ) = 0.5,w ′′ (u 4 ) = 0.25,w ′′ (u 5 ) = 0.5. Thus,S ′′ = (0.5,0.75,0.5,0.25,0.5) Definition. PerSocial relevance - level 2 between documentd j and useru i is defined based on the number and type of social actions between user u i and document d j , the 3 Commercialized and more complicated examples of this measure include Klout (klout.com) and PeerIndex (peerindex.com). 107 relationships between user u i and other users, the overall importance of each user and the number and type of social actions between user u i ’s friends 4 and document d j , as follows: psRel L2 (u i ,d j ) = m X k=1 (w(k,j)×w ′ (k,i)×w ′′ (k)) (5.2) where w(k,j) is the user frequency (uf ) factor, w ′ (k,i) is the user relatedness (ur) factor, and w ′′ (k) is the user influence (ui) factor. We call this weighting scheme uf- ri (user frequency-relatedness influence) weighting scheme. While in classical textual weighting schemes such as tf-idf, for given terms, more weight is given to the documents with 1) more occurrences of terms (tf) , and 2) more important terms (idf), in our uf- ri weighting scheme, for a given user, more weight, is given to the documents with 1) more important actions 2) performed by more important users 3) whom are more related (closer) to the given user. Example. Given the values we have so far (using case 2 for W ′ ), the PerSocial relevance level 2 betweenu1 and documentd 1 is calculated as follows: prRel L2 (u 1 ,d 1 ) = P 5 k=1 (w(k,1)×w ′ (1,k)∗×w ′′ (k)) = 1.4×0.5+0.6×0.5× 0.75+0+0+0 = 0.7+0.225 = 0.925 5.2.1.3 PerSocial Relevance - Level 3 In this section, we present the concept of social expansion and discuss how it can be useful in generating more accurate PerSocial relevance scores. We show how to define level 3 of PerSocial relevance by integrating social expansion to the PerSocial relevance level 2. 4 To be more precise, set U’ of users such that∀u ′ l ∈U ′ |W ′ (u ′ l ,u i )>δ. 108 Each document on the web is often well connected to other documents, most com- monly using hyper-links. We argue that social features of each document should be dynamic, meaning that social actions/signals of the document can and should be propa- gated to other adjacent documents. A user’s interest for a document - shown by a social action such as LIKE - can often imply the user’s interests in other relevant documents - often connected to the original document. In simpler words, we enable social signals to flow in the network of documents. 5 We propose to propagate social actions from one document - with some social action - to all documents connected to that document. As an example, imagine a user LIKES ESPN’s Los Angeles Lakers page. Using this signal (action) alone can help us deriving the fact that this document is socially relevant to this user. However, we can do much better by taking into consideration the adjacent documents to the Los Angeles Lakers document. By looking at documents that the original document links to, we can retrieve a new set of documents that are also socially relevant to the user. For our example, the Los Angeles Lakers document has outgoing links to document on NBA and Kobe Bryant. Assuming there is one outgoing link for each of the two documents, half of the original social score can be given to each of these two new documents. As a result, documents on NBA and Kobe Bryant become socially relevant to the user as well (note that the original Los Angeles Lakers document is still more socially relevant to the user than the other two documents.). If we continue this propagation, many new documents will get adjusted social scores from the same social action. We define PerSocial relevance level 3 (psRel L3 ) between documentd j and user u i as follows: 5 Many existing approaches and definitions can be used to measure connections between documents. Here, we do not go into details of such approaches. 109 psRel L3 (u i ,d j ) =psRel L2 (u i ,d j )+ X d k ∈D d j V ′ (d k ,d j )×psRel L2 (u i ,d k ) (5.3) where psRel L2 (u i ,d j ) is the PerSocial relevance between documentd j and useru i (level 2) as defined in Equation 5.2, D d j is a set of documents connected to the doc- umentd j , and V ′ (d k ,d j ) is value of document relatedness function between document d k to documentd j . Document relatedness function is measuring the connectivity of two documents. Again, we intentionally define this function as generic as possible and do not limit our model by any particular implementation. Simple models like number of hyper-links between two documents or more sophisticated models such as those that calculate the textual and/or topical similarities between two documents can be used. The main advantage of using social expansion is to find more socially relevant doc- uments for each user. Social expansion also helps in adjusting documents’ scores and assigning more accurate relevance scores to each document. Imagine a user who has two explicit LIKES on Google and Microsoft. The same user also has other social actions on XBOX and Bing. Without using expansion, both Google and Microsoft generate the same social weight for this user, while using expansion will propagate some weight from both XBOX and Bing to Microsoft and hence gives Microsoft a slight advantage (assuming there are links from XBOX and Bing to Microsoft). Using social expansion is also very practical for the current state of the web where social actions are not very common yet and many documents do not have any social action. Social expansion will help more documents to get scored and hence it will improve the overall social search experience. 110 5.2.2 PerSocialized Ranking As described earlier, goal of the ranker module in PERSOSE is to personalize and rank the search results using both the social and textual features of the documents. In this section, we discuss three different approaches to rank the documents based on the com- bination of the textual relevance and PerSocial relevance scores. In any of the discussed approaches, PerSocial relevance model of any level (1 through 3) can be applied. Hence, for instance, if friends’ information do not exist in the system and only querying user’s own actions are available, we can use PerSocial relevance level 1 as the PerSocial rele- vance model in the proposed approaches. We also incorporate textual relevance in the proposed approaches. Any existing textual model (e.g., tf-idf [SB97], BM25 [RW94]) can be used to calculate the textual relevance scores. Furthermore, we have to note that most of the existing search optimization techniques (e.g., pageRank [PBMW99]) or other personalized approaches are orthogonal to our approaches and can be added to textual relevance model part (for instance combination of the tf-idf and pageRank can be used as the textual model). 5.2.2.1 Textual Filtering, PerSocial Ranking In textual filtering, PerSocial ranking approach, first a regular textual filtering is con- ducted and all the documents with textual relevance larger than 0 are returned (in the simplest case, documents that have at least one of the query keywords). Next, the remaining documents are scored and ranked using their PerSocial relevance to the query- ing user. This is a two-step process in which filtering is based on the textual dimension of the documents and ranking is based on the social aspect of the documents. 111 5.2.2.2 Textual Ranking, PerSocial Filtering In PerSocial filtering, textual ranking approach, any document d j with no PerSocial relevance to the querying useru i (i.e.,psRel(u i ,d j ) = 0) is pruned first. The result of this step is a set of documents with at least one social action from the querying user or her friends (related users). Next, the regular textual search is performed on the remaining documents and documents are scored and ranked based on the textual relevance model. This is also a two step process with filtering step based on social dimension of the documents and ranking step based on the textual features of the documents. 5.2.2.3 PerSocial-Textual Ranking With PerSocial-textual ranking approach, both textual and social features of the doc- uments are used simultaneously to calculate the final relevance of the query to each document. We define Rel(q,d j ) as the overall (textual plus PerSocial) relevance of document d j with query q. The value of Rel(q,d j ) is defined by a monotonic scoring functionF of the textual relevance and PerSocial relevance values. In PERSOSE, F is the weighted sum of the PerSocial relevance and textual relevance scores: Rel(q,d j ) = F(psRel(u q ,d j ),texRel(T q ,d j )) = α p .psRel(u q ,d j )+(1−α p ).texRel(T q ,d j ) where T q is the textual part of the query, u q is the querying user (social part of the query), texRel(T q ,d j ) is a textual relevance model to calculate the textual relevance betweenT q and documentd j , andα p is a parameter set by the querying user, assigning relative weights to PerSocial and textual relevance values. 112 In this approach and using the above formula, ranking is calculated using both the textual and social features of documents and the query. This is a one-step process with no filtering step. 5.3 Experimental Evaluation In this section, we evaluate the effectiveness of PERSOSE using data from Facebook and Wikipedia. We first discuss the dataset, approaches and other settings used for the experiments, and then present the results. Data. For a complete and accurate set of experiments, we need a dataset that con- tains the following data: 1) a large set of documents with textual information, 2) link structure between documents, 3) real users with friendship relationships, and 4) social actions from users on documents. Unfortunately no such dataset exists. As a result, we built such a dataset to be used in PERSOSE and to evaluate our approaches. As outlined in Section 5.1, two main data types are fed into PERSOSE. One is a set of documents and the other is the social data containing social actions from users as well as relationships among users. We used Wikipedia articles as our document set and Face- book as our social platform. We developed a web crawler to crawl around 14 million Wikipedia articles and extract textual information from those documents. While crawl- ing, we also captured the relationships among documents and built a (partial) Wikipedia graph. In this graph, each node represents a Wikipedia article. Node d 1 has a directed edge to node d 2 if their Wikipedia articles are related to each other, either explicitly when articled 1 has a link to articled 2 , or implicitly when articled 2 is mentioned several times by article d 1 . The weight of each connection is based on the frequency and the 113 position of the mentions of one article inside another. The total weight of all outgoing edges for each node of the graph always adds up to one. As far as the social data, we integrated PERSOSE to Facebook using Facebook Con- nect, hence allowing users to log-in into PERSOSE using their Facebook account and information. When a user connects to PERSOSE, our system asks for the permission to read and access user’s facebook data. The Facebook data that our system read include users’s Facebook activities (e.g., STATUS, LIKES, PHOTOS) as well as user’s friendship information. We also read all public data from the user’s friends. Finally, we map users’ Facebook activities to social actions on Wikipedia docu- ments. In order to perform this step, we utilized the technology developed at Graph- Dive 6 to link Facebook data to Wikipedia articles. With GraphDive API, each Facebook activity/post (e.g., STATUS, CHECK-IN, LIKE) can be mapped to one or more than one Wikipedia article. GraphDive algorithm works as follows. GraphDive API receives a Facebook post, parses the text to all possible word-level n-grams (1≤n≤total number of words in the post) and then looks for a Wikipedia article with the same title for each n-gram. For instance, for a status update of ”I love Los Angeles and Southern Cali- fornia”, GraphDive API, will match Wikipedia articles on Los Angeles, California, and Southern California to the post. There are other optimizations taken place by Graph- Dive API (e.g. disambiguation, varying weights for each n-gram, etc.) that are not the focus of this work. We only use GraphDive API to map Facebook actions to Wikipedia articles and hence generating a rich set of documents with social actions from real users. Actions. From the data that Facebook provides via its graph API 7 , we considered the following six actions: LIKE, check-in, STATUS, PHOTO, WORK and SCHOOL. LIKE is when a user likes a page/topic on Facebook or a document on the web. check-in is when 6 http://graphdive.com/ 7 https://developers.facebook.com/tools/explorer/ 114 a user check-ins his/her location using Facebook. STATUS is a free format text usually describing user’s activities and feelings. PHOTO is a text associated with each photo a user uploads to Facebook. Finally, WORK and SCHOOL are information about user’s workplace and school/university, respectively. Each of the above six actions contain some textual content. As described above, using GraphDive technology, we map those textual content to a set of Wikipedia articles - when possible. For instance, when a user check-ins at Peet’s Coffee & Tea, using GraphDive, we extract action check-in between the user and the Wikipedia article on Peet’s Coffee and Tea , between the user and the Wikipedia article on coffee , and between the user and the Wikipedia article on tea. Approaches. We use three main approaches described in Section 5.2.2 to generate the results: textual filtering, PerSocial ranking (TP), PerSocial filtering, textual ranking (PT) and PerSocial-textual ranking (HB) 8 . We also use a baseline approach called BS. The BS approach generates the results based on the combination of tf-idf and PageRank models. The same baseline approach is used as the textual model in our existing approaches (whenever textual model needed). The default setting is as follows. The social actions have the same weight (all equal to 0.5) and number of results returned for each approach is 5. When using friends data, we only access data from the top 25 friends (ranked by their user relatedness score to the user) of the user. Also, all four approaches use expan- sion as described in Section 5.2.1.3. Finally, α p is set to 0.7 for the HB approach (to give more importance to the social part of the search and hence evaluate the impact of social signals more thoroughly). In addition to the main approaches and the base- line approach, we also implemented three levels of the PerSocial mode on the hybrid approach to study and evaluate the impact of each level. Three variations are called: HB-Level1, HB-Level2, and HB-Level3. 8 HB stands for hybrid. 115 Queries. We generate two set of queries for our experiments. The first set called qset1, is generated from Google top 100 queries in 2009. For each user, five queries are randomly generated from that list. The second set of queries called qset2 is generated from each user’s social data. With qset2, we randomly generate 5 queries from user’s own Facebook data (e.g., pages they liked, city they live, school they attended). We believe qset2 is of higher quality since these are the queries that users are very familiar with and hence can understand and evaluate better. (For instance, user living in Irvine, California can evaluate the results for query Irvine, California very well.). Another benefit of choosing queries from user’s Facebook profile is a higher chance of having social actions from the user on the query topic. As a result, using qset2 provides us with a better evaluation of our system. Note that in the absence of any social signal, our approaches will perform the same as the baseline approach and hence will not provide many new insights. For the above reasons, we only useqset1 for he first set of experiments (comparing the main approaches) and use qset2 for other experiments. Relevance Assessment. After computing the top-5 results for each of our queries using all approaches, we ran a user study using Amazon Mechanical Turk 9 . One task (hit) was generated for each query. We asked workers to login to our experiment set- ting using their Facebook account 10 via Facebook connect 11 . For each query and for each worker, top 5 results from all approaches were generated, mixed together (dupli- cates removed) and presented to the worker. Workers could choose whether each result (Wikipedia article) is very relevant, relevant or non-relevant. User were not aware of different approaches and could not tell which results is for what approach. Moreover, 9 mturk.com 10 Each volunteer allowed us to read/access his/her Facebook data for this experiment. 11 https://developers.facebook.com/docs/guides/web/ 116 for each query, we asked each user to provide us with an ordered list of top-5 most rele- vant documents (from the documents presented) based on his/her own preferences. We use this information to calculate nDCG for each query. Each task (query assessment) was assessed by 12 workers for query set 1 and 8 workers for query set 2. Each worker was rewarded $0.25 by completion of each assess- ment. User Relatedness. To capture the relatedness between two users, we used the total number of interactions between those users (on Facebook) as our metric. We retrieved and counted all the direct interactions (except private messages) between two users and used normalized value of this number as the value of our user relatedness function. Although we could use simpler metrics such as number of mutual friends, we believe that the number of actual interactions is a better representative of relatedness/closeness of two Facebook users than the number of mutual friends between them. Evaluation Metric. We evaluated the accuracy of the methods under comparison using popular nDCG@k and precision@k metrics. When computing nDCG@k, we considered the ordered list of top-5 results entered by the user as the ideal ordering (ideal DCG). The relevance values used for very relevant, somehow relevant and not relevant are 2, 1, and 0, respectively. We calculate prec@k for two scenarios. For the first scenario prec@k (rel), we considered the results evaluated as somehow relevant or very relevant as relevant. For the second scenario prec@k (vrel), we only considered the results evaluated as very relevant as relevant. We calculated the final nDCG and precision values by averaging all nDCG and pre- cision values for each query. 117 Table 5.1: Main Approaches: qset1 Approach prec@5(rel) prec@5(vrel) nDCG@5 BS 0.714 0.359 0.760 TP 0.630 0.329 0.652 PT 0.787 0.413 0.655 HB 0.760 0.420 0.815 5.3.1 Main Approaches In the first set of experiments, we evaluate the effectiveness of our three main approaches (rankers) and compare the results with the baseline approach. The results - prec@5(rel), prec@5(vrel) and nDCG@5 - of the four approaches and the two query sets are shown in Tables 5.1 and 5.2, respectively. The first observation is that for qset2, all our proposed approaches (TP, PT and HB) are noticeably better than the baseline (BS) approach. The second observation is that for qset1, while HB outperform BS with regards to all three metrics, the other two social approaches are not as successful. This observation plus the first observation show that the hybrid (HB) approach is the best approach among all four approaches for all cases. We can also see that while the other two PerSocial approaches work pretty well for some queries (queries that users already have some related social actions), they may generate less accurate results for random/generic queries (although PT still outperformsBS for two of the three metrics). This shows that search PerSocialization works best for queries relevant to the querying user (queries such that the querying user has some social actions on documents relevant to those queries). The third observation is that for both datasets, the margin that our PerSocial approaches (except TP in qset1) are better than the baseline approach increases from prec@5(rel) to prec@5(vrel). This shows that if users are looking for very relevant results, our proposed approaches generate even better results. 118 Table 5.2: Main Approaches: qset2 Approach prec@5(rel) prec@5(vrel) nDCG@5 BS 0.787 0.491 0.689 TP 0.856 0.626 0.806 PT 0.890 0.628 0.777 HB 0.846 0.590 0.792 5.3.2 PerSocial Relevance Levels Int this set of experiments, we evaluate and compare the results generated from the three levels of PerSocial relevance with each other and also with the baseline approach. We use HB as our PerSocial ranker andqset2 as the dataset. Results for three levels and the BS approach are shown in Table 5.3. The first observation is that all three levels generate more (or equal for level 1 with regards to nDCG@5(rel) metric) accurate results than the baseline approach in regards to all the three metrics. This not only confirms the fact that our final proposed approach (level 3) generates more accurate results than the baseline, but also shows that even applying one or two levels of our PerSocial model can improve the search results. The second observation is that in regards to all three metrics, each level improves the accu- racy of search results in comparison to the previous level. As we discussed earlier, each level is built on top of the previous level and complements it by adding mode social sig- nals to the PerSocial relevance model. in other words, this set of experiments proves our hypothesis and shows that 1) social actions improve the search results, 2) using friends social signals further improves the accuracy of the results, and 3) social expansion also adds to the accuracy of search personalization. Overall, applying all three levels to the baseline approach will improve both the precision of nDCG of the results significantly. Metrics prec@5(vrel) and prec@5(vrel) improve from 0.78 and 0.49 to 0.84 and 0.59 (6% and 20% improvements), respectively. 119 Table 5.3: Levels Approach prec@5(rel) prec@5(vrel) nDCG@5 BS 0.787 0.491 0.689 HB-Level1 0.787 0.506 0.730 HB-Level2 0.809 0.548 0.744 HB-Level3 0.846 0.590 0.792 Also, the final ordering of the results in comparison to the ideal ordering (nDCG@5) improves significantly from 0.68 to 0.79 (16% improvement) as well. 5.3.3 Friends vs. User In this set of experiments, we compare the impact of using social data from friends only, user (querying user) only, or a combination of both on our proposed model. We developed two new variations of HB called HB-UO (User Only)) and HB-FO (Friends Only) and compare them with each other and also with the original HB. Again, qset2 is used and social expansion is enabled for all the approaches. Results for the three approaches are shown in Tables 5.4. The first and important observation is that the friends-only approach generates results as effective or even better than those of the user- only approach. This further proves the point that friends’ interests and preferences are very similar to the user’s own interests and preferences. This finding encourages using friends’ actions in the search and ranking process. The second observation from Tables 5.4 is that HB is the best approach among all three (reconfirming the observation that level 2 results are better than level 1 results). As we also saw earlier (for the non-expanded case), we can see that mixing data from both the querying user and his friends will generate the most accurate results. 120 Table 5.4: User-only vs. Friends-only Approach prec@5(rel) prec@5(vrel) nDCG@5 HB 0.846 0.590 0.792 HB-UO 0.823 0.545 0.778 HB-FO 0.831 0.582 0.777 Table 5.5: Number of Friends Number of Friends prec@5(rel) prec@5(vrel) nDCG@5 popular 0.889 0.626 0.826 semi-popular 0.821 0.564 0.782 non-popular 0.780 0.540 0.733 5.3.4 Number of Friends In this set of experiments, we evaluate the impact of number of friends of the querying user on the accuracy of the results. We categorize users based on their number of friends into three groups: popular, semi-popular and non-popular. Non-popular users are those with fewer that 200 friends (between 50 and 200). Semi-popular users are those with fewer than 500 friends and more than 200 friends. Finally, popular users are those with more than 500 friends (the most number of friends value among our workers is 1312). We present the prec@5(rel) results for the three groups in Table 5.5. The main observation is that the accuracy of the results is directly correlated with the number of friends of the querying user. The non-popular group generates the least accuarte results and this is expected since not many social signals from friends and perhaps even from the user himself (users with fewer friends tend to be less active on their social network) are used to influence the search. The popular group generates the most accurate results, and semi-popular group is in between. This observation shows that the larger the amount of data from a user’s friends, the PerSocial relevance scores for that user is more accurate and hence the results generated for that user is improved. To summarize, main observations derived from our experimental evaluation are: hou 121 • Each level of our PerSocial model improves the accuracy of search results com- pared to the previous level. All levels generate more accurate results than the baseline approach. • For qset2, all three proposed approaches generate more precise results and a better ranking than the baseline approach. • For qset1, our proposed HB approach, generate more accurate results than the baseline approach (for all three metrics), while the results of the other two approaches vary. • Results generated only from users’ friends social data only is as good (if not bet- ter) than the results generated from user’s own social actions. The best results are achieved when combining user’s own and friends’ social data. • Accuracy of results for each user is directly correlated with the number of friends for that user. 122 Chapter 6 Conclusions In this dissertation, we showed how to efficiently and effectively combine textual web search with spatial, temporal and social aspects of web documents. First in chapter 3, we introduced the problem of ranking the spatial and textual fea- tures of web documents. We proposed new scoring methods to rank documents by seam- lessly combining their spatial and textual features. We also proposed an efficient index structure which handles the spatial and textual features of data simultaneously and also supports the spatial-keyword relevance ranking. In particular we introduced SKIF and showed how it is used to search and rank the documents efficiently. We experimentally studied our methods, which proved its superior performance and accuracy. Furthermore in Chapter 4, we introduced the problem of temporal-textual retrieval and proposed a complete framework with both effective ranking and efficient indexing of temporal and textual features of (web) documents. We proposed a baseline approach and variety of hybrid index structures to exploit popular textual and temporal index structures (i.e., inverted file and interval tree). We proposed a novel index structure called T 2 I 2 that handles the temporal and textual features of data efficiently and in a unified manner. Using T 2 I 2 , we showed how query processing is performed efficiently and also discussed how to extend T 2 I 2 into more general cases. We experimentally and analytically evaluated our proposed approaches and showed the high efficiency of T 2 I 2 . Finally in Chapter 5, we introduced a novel way for personalization of web search using user social actions - dubbed PerSocialized search. With PerSocialized search, we showed how social actions are relevant and useful to improve the quality of the search 123 results. We proposed a model called PerSocial relevance model to incorporate three levels of social signals into the search process. In level 1, we showed how to utilize user’s own social actions to improve the search results. With level 2, we added social data from user’s friends to the proposed model. Finally, in level 3 we proposed social expansion to expand the effect of social action to more documents. Using the PerSocial relevance model, we proposed three ranking approaches to combine the existing textual relevance models with the PerSocial relevance models. Furthermore, we developed a system called PERSOSE as a prototype search engine capable of performing PerSo- cialized search. Employing PERSOSE, we conducted an extensive set of experiments using real documents from Wikipedia and real user and social properties from Face- book. With several set of experiments, we showed how different levels of our PerSocial model improve the accuracy of search results. We also evaluated the proposed ranking functions and compared them with each other and a baseline approach. As part of future work, one main direction is to develop a unified framework to combine all three dimensions - spatial, temporal and social - together and integrate all three with the textual web search using one ranking model and one hybrid index struc- ture. We believe that using our proposed ranking (relevance) models, one hybrid model can be developed and used to calculate the relevance of each query to each documents considering all four dimensions (textual, spatial, temporal and social). Also, since our proposed index structures for spatial-keyword and temporal-textual search (SKIF and T 2 I 2 , respectively) both are variations of inverted file index structure, it will be seam- less to combine those two index structures as well as regular inverted index, and hence to develop one hybrid index structure for textual, spatial and temporal aspects of web documents. Finally, and as far as social dimension, we believe that we defined the overall frame- work needed for the PerSocialized search. By design and whenever possible, we allowed 124 for different implementations for the proposed methods. This enables an easier cus- tomization as well as optimization of PERSOSE for different settings and applications. For any given method, finding the best variations/implemantion for a given context is a general and orthogonal research topic that can and should be pursued by experts of that specific context (e.g., optimal user influence or action weight values should be deter- mined based on a given application and by experts on that application.), and can be another direction for future work. 125 References [ABE09] Irem Arikan, Klaus Berberich, and Anthony Eden. Time Will Tell : Lever- aging Temporal Expressions in IR ? Business, 2009. [ABL10] Sattam Alsubaiee, Alexander Behm, and Chen Li. Supporting location- based approximate-keyword queries. In Proceedings of the 18th SIGSPA- TIAL International Conference on Advances in Geographic Information Systems, GIS ’10, pages 61–70, New York, NY , USA, 2010. ACM. [AG06] Omar Alonso and Michael Gertz. Clustering of search results using tem- poral attributes. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 597–598. ACM, 2006. [AGBY07] Omar Alonso, Michael Gertz, and Ricardo Baeza-Yates. On the value of temporal information in information retrieval. SIGIR Forum, 41:35–41, December 2007. [AHSS04] Einat Amitay, Nadav Har’El, Ron Sivan, and Aya Soffer. Web-a-where: geotagging web content. In Proceedings of the 27th annual interna- tional ACM SIGIR conference on Research and development in informa- tion retrieval, SIGIR ’04, pages 273–280, New York, NY , USA, 2004. ACM. [All81] James F Allen. An interval-based representation of temporal knowledge. In Proc. 7th International Joint Conference on Artificial Intelligence, Van- couver, Canada, pages 221–226, 1981. [BBAW10] Klaus Berberich, Srikanta Bedathur, Omar Alonso, and Gerhard Weikum. A Language Modeling Approach for Temporal Information Needs. Cur- rent, pages 13–25, 2010. [BBC12] Alessandro Bozzon, Marco Brambilla, and Stefano Ceri. Answering search queries with crowdsearcher. WWW ’12, 2012. 126 [BBNW] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. pages 1414–1417. [BBNW07] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. A time machine for text search. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07, page 519, 2007. [BY05] Ricardo Baeza-Yates. Searching the future. In SIGIR Workshop MF/IR, 2005. [BYRN99] Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999. [CCJ10] Xin Cao, Gao Cong, and Christian S. Jensen. Retrieving top-k prestige- based relevant spatial web objects. Proc. VLDB Endow., 3:373–384, September 2010. [CJW09] Gao Cong, Christian S. Jensen, and Dingming Wu. Efficient retrieval of the top-k most relevant spatial web objects. Proc. VLDB Endow., 2:337– 348, August 2009. [CNPK05] Paul Alexandru Chirita, Wolfgang Nejdl, Raluca Paiu, and Christian Kohlsch¨ utter. Using odp metadata to personalize search. SIGIR ’05, 2005. [Cor05] G M D Corso. Ranking a stream of news. In: WWW, 21, 2005. [CS06] Edith Cohen and Martin J. Strauss. Maintaining time-decaying stream aggregates. Journal of Algorithms, 59(1):19 – 36, 2006. [CSM06] Yen-Yu Chen, Torsten Suel, and Alexander Markowetz. Efficient query processing in geographic web search engines. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, SIG- MOD ’06, pages 277–288, New York, NY , USA, 2006. ACM. [CSSX09a] G. Cormode, V . Shkapenyuk, D. Srivastava, and Bojian Xu. Forward decay: A practical time decay model for streaming systems. In Data Engineering, 2009. ICDE ’09. IEEE 25th International Conference on, 29 2009. [CSSX09b] Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu. Forward decay: A practical time decay model for streaming systems. In Data Engineering, 2009. ICDE’09. IEEE 25th International Confer- ence on, pages 138–149. IEEE, 2009. 127 [CZG + 09] David Carmel, Naama Zwerdling, Ido Guy, Shila Ofek-Koifman, Nadav Har’el, Inbal Ronen, Erel Uziel, Sivan Yogev, and Sergey Chernov. Per- sonalized social search based on the user’s social network. CIKM ’09, 2009. [DD10] Na Dai and Brian D Davison. Freshness matters: in flowers, food, and web authority. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 114–121. ACM, 2010. [DFHR08] Ian De Felipe, Vagelis Hristidis, and Naphtali Rishe. Keyword search on spatial databases. In Proceedings of the 2008 IEEE 24th Interna- tional Conference on Data Engineering, pages 656–665, Washington, DC, USA, 2008. IEEE Computer Society. [DG08] Wisam Dakka and Luis Gravano. Answering General Time-Sensitive Queries. Readings, pages 3–4, 2008. [DGS00] Junyan Ding, Luis Gravano, and Narayanan Shivakumar. Computing geo- graphical scopes of web resources. In Proceedings of the 26th Interna- tional Conference on Very Large Data Bases, VLDB ’00, pages 545–556, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [DSD11] Na Dai, Milad Shokouhi, and Brian D Davison. Learning to rank for freshness and relevance. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 95–104. ACM, 2011. [EC08] Brynn M Evans and Ed H Chi. Towards a model of understanding social search. In Proceedings of the 2008 ACM conference on Computer sup- ported cooperative work, pages 485–494. ACM, 2008. [EG11] Miles Efron and Gene Golovchinsky. Estimation methods for ranking recent information. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 495–504. ACM, 2011. [EKP09] Brynn M Evans, Sanjay Kairam, and Peter Pirolli. Exploring the cognitive consequences of social search. In CHI’09 Extended Abstracts on Human Factors in Computing Systems, pages 3377–3382. ACM, 2009. [FLN03] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algo- rithms for middleware. J. Comput. Syst. Sci., 66:614–656, June 2003. 128 [Fre79] L Freeman. Centrality in social networks conceptual clarification. Social Networks, 1(3):215–239, 1979. [GCF09] Antonio Gulli, Stefano Cataudella, and Luca Foschini. Tc-socialrank: Ranking the social web. WAW ’09, 2009. [GCK + 10] Liang Gou, Hung-Hsuan Chen, Jung-Hyun Kim, Xiaolong (Luke) Zhang, and C. Lee Giles. Sndocrank: a social network-based video search rank- ing framework. MIR ’10, 2010. [GLM06] Weizheng Gao, Hyun Chul Lee, and Yingbo Miao. Geographically focused collaborative crawling. In Proceedings of the 15th international conference on World Wide Web, WWW ’06, pages 287–296, New York, NY , USA, 2006. ACM. [GZC + 10] Liang Gou, Xiaolong (Luke) Zhang, Hung-Hsuan Chen, Jung-Hyun Kim, and C. Lee Giles. Social network document ranking. JCDL ’10, 2010. [Hav03] T.H. Haveliwala. Topic-sensitive pagerank: a context-sensitive rank- ing algorithm for web search. Knowledge and Data Engineering, IEEE Transactions on, 15(4):784 – 796, july-aug. 2003. [HHLM07] Ramaswamy Hariharan, Bijit Hore, Chen Li, and Sharad Mehrotra. Pro- cessing spatial-keyword (sk) queries in geographic information retrieval (gir) systems. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management, SSDBM ’07, pages 16– , Washington, DC, USA, 2007. IEEE Computer Society. [HJSS06] Andreas Hotho, Robert J¨ aschke, Christoph Schmitz, and Gerd Stumme. Information retrieval in folksonomies: search and ranking. ESWC’06, 2006. [HK10] Damon Horowitz and Sepandar D. Kamvar. The anatomy of a large-scale social search engine. WWW ’10, 2010. [HLY07] Michael Herscovici, Ronny Lempel, and Sivan Yogev. Efficient indexing of versioned document sequences. In Advances in Information Retrieval, pages 76–87. Springer, 2007. [JCZ + 11] Peiquan Jin, Hong Chen, Xujian Zhao, Xiaowen Li, and Lihua Yue. Indexing temporal information for web pages. Computer Science and Information Systems/ComSIS, 8(3):711–737, 2011. 129 [JLZW08] Peiquan Jin, Jianlong Lian, Xujian Zhao, and Shouhong Wan. Second International Symposium on Intelligent Information Technology Applica- tion TISE : A Temporal Search Engine for Web Contents. Search, pages 220–224, 2008. [KC05] Pawel Jan Kalczynski and Amy Chou. Temporal document retrieval model for business news archives. Information processing & manage- ment, 41(3):635–650, 2005. [KSJ09] Ioannis Konstas, Vassilios Stathopoulos, and Joemon M. Jose. On social networks and collaborative recommendation. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’09, pages 195–202, New York, NY , USA, 2009. ACM. [LC03] Xiaoyan Li and W Bruce Croft. Time-based language models. In Proceedings of the twelfth international conference on Information and knowledge management, pages 469–475. ACM, 2003. [LHMBB10] U Leong Hou, Nikos Mamoulis, Klaus Berberich, and Srikanta Bedathur. Durable top-k search in document archives. In Proceedings of the 2010 international conference on Management of data, 2010. [lin] Lingua::en::tagger. [McC01] Kevin S. McCurley. Geospatial mapping and navigation of the web. In Proceedings of the 10th international conference on World Wide Web, WWW ’01, pages 221–229, New York, NY , USA, 2001. ACM. [MRS08] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Intro- duction to Information Retrieval. Cambridge University Press, New York, NY , USA, 2008. [MS] M. McDonnell and A. Shiri. Social search: A taxonomy of, and a user- centred approach to, social web search. Program: electronic library and information systems, 45(1). [MW00] Inderjeet Mani and George Wilson. Robust temporal processing of news. In Proceedings of the 38th Annual Meeting on Association for Computa- tional Linguistics, pages 69–76. Association for Computational Linguis- tics, 2000. [NM07] Michael G. Noll and Christoph Meinel. Web search personalization via social bookmarking and tagging. ISWC’07/ASWC’07, 2007. 130 [NN06] Kjetil Norvag and AO Nybo. Dyst: dynamic and scalable temporal text indexing. In Temporal Representation and Reasoning, 2006. TIME 2006. Thirteenth International Symposium on, pages 204–211. IEEE, 2006. [Pas08] Marius Pasca. Towards temporal web search. In Proceedings of the 2008 ACM symposium on Applied computing, pages 1117–1121. ACM, 2008. [PBMW99] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: bringing order to the web. 1999. [PS85] FP Preparatat and Michael Ian Shamos. Computational geometry: an introduction. 1985. [RKES11] Majdi Rawashdeh, Heung-Nam Kim, and Abdulmotaleb El Saddik. Folksonomy-boosted social media search and ranking. ICMR ’11, 2011. [RW94] Stephen E Robertson and Steve Walker. Some simple effective approxi- mations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 232–241. Springer-Verlag New York, Inc., 1994. [SB97] Gerard Salton and Christopher Buckley. Readings in information retrieval. chapter Term-weighting approaches in automatic text retrieval, pages 323–328. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. [SBCO09] Barry Smyth, Peter Briggs, Maurice Coyle, and Michael O’Mahony. Google shared. a case-study in social search. UMAP ’09, 2009. [SCK + 08] Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane X. Parreira, and Gerhard Weikum. Efficient top-k querying over social-tagging networks. SIGIR ’08, 2008. [SG05] Craig A. N. Soules and Gregory R. Ganger. Connections: using context to enhance file search. SOSP ’05, 2005. [tim] Timeml specification language. [Tob70] W R Tobler. A computer movie simulating urban growth in the detroit region. Economic Geography, 46(2):234–240, 1970. [VJJS] Subodh Vaid, Christopher Jones, Hideo Joho, and Mark Sanderson. Spatio-textual indexing for geographical search on the web. In Claudia Bauzer Medeiros, Max Egenhofer, and Elisa Bertino, editors, Advances in Spatial and Temporal Databases, volume 3633 of Lecture Notes in Computer Science, pages 923–923. Springer Berlin / Heidelberg. 131 [VK09] Riina Vuorikari and Rob Koper. Ecology of social search for learning resources. Campus-Wide Information Systems, 26(4):272–286, 2009. [VM09] Marc Verhagen and Jessica L. Moszkowicz. Temporal annotation and representation. Language and Linguistics Compass, 3(2):517–536, 2009. [VMS + 05] Marc Verhagen, Inderjeet Mani, Roser Sauri, Robert Knippen, Seok Bae Jang, Jessica Littman, Anna Rumshisky, John Phillips, and James Puste- jovsky. Automating temporal annotation with tarsqi. In Proceedings of the ACL 2005 on Interactive poster and demonstration sessions, pages 81–84. Association for Computational Linguistics, 2005. [WDL09] Zhijun Wang, Ming Du, and Jiajin Le. gr*-tree: An index for querying approximate keywords in geographic information system. In Information Engineering and Computer Science, 2009. ICIECS 2009. International Conference on, pages 1 –4, 2009. [We05] K.F. Wong et al. An overview of temporal information extraction. Int. J. Comput. Proc. Oriental Lang., 2005. [WJ10] Qihua Wang and Hongxia Jin. Exploring online social activities for adap- tive search personalization. CIKM ’10, 2010. [XBF + 08] Shengliang Xu, Shenghua Bao, Ben Fei, Zhong Su, and Yong Yu. Explor- ing folksonomy for personalized search. SIGIR ’08, 2008. [YBLS08] Sihem Amer Yahia, Michael Benedikt, Laks V . S. Lakshmanan, and Julia Stoyanovich. Efficient network aware search in collaborative tagging sites. Proc. VLDB Endow., 1(1), August 2008. [YLL10] Peifeng Yin, Wang-Chien Lee, and Ken C.K. Lee. On top-k social web search. CIKM ’10, 2010. [ZCM + 09] Dongxiang Zhang, Yeow Meng Chee, A. Mondal, A. Tung, and M. Kit- suregawa. Keyword search in spatial databases: Towards searching by document. In Data Engineering, 2009. ICDE ’09. IEEE 25th Interna- tional Conference on, 29 2009. [ZM95] Justin Zobel and Alistair Moffat. Adding compression to a full-text retrieval system. Softw. Pract. Exper., 25:891–903, August 1995. [ZM06] Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Comput. Surv., 38, July 2006. 132 [ZXW + 05] Yinghua Zhou, Xing Xie, Chuang Wang, Yuchang Gong, and Wei-Ying Ma. Hybrid index structures for location-based web search. In Proceed- ings of the 14th ACM international conference on Information and knowl- edge management, CIKM ’05, pages 155–162, New York, NY , USA, 2005. ACM. 133
Abstract (if available)
Abstract
Over the last few years, Web has changed significantly. Emergence of Web 2.0 has enabled people to interact with web documents in new ways not possible before. It is now a common practice to geo-tag or time-tag web documents, or to integrate web documents with popular social networks. With these new changes and the abundant usage of spatial, temporal and social information in web documents and search queries, the necessity of integration of such non-textual aspects of the web to the regular textual web search has grown rapidly over the past few years. ❧ To integrate each of those non-textual dimensions to the textual web search and to enable spatial-textual, temporal-textual and social-textual web search, in this dissertation we propose a set of new relevance models, index structures and algorithms specifically designed for adding each non-textual dimension (spatial, temporal and social) to the current state of (textual) web search. First, we propose a new ranking model and a hybrid index structure called Spatial-Keyword Inverted File to handle location-based ranking and indexing of web documents in an integrated/efficient manner. Second, we propose a new indexing and ranking framework for temporal-textual retrieval. The framework leverages the classical vector space model and provides a complete scheme for indexing, query processing and ranking of the temporal-textual queries. Finally, we show how to personalizes the search results based on users’ social actions. We propose a new relevance model called PerSocial relevance model utilizing three levels of social signals to improve the web search. Furthermore, We Develop Several Approaches To Integrate PerSocial relevance model Into The Textual Web Search Process.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Query processing in time-dependent spatial networks
PDF
Enabling spatial-visual search for geospatial image databases
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Responsible AI in spatio-temporal data processing
PDF
Integrating top-down and bottom-up visual attention
PDF
Efficient indexing and querying of geo-tagged mobile videos
PDF
Tag based search and recommendation in social media
PDF
GeoCrowd: a spatial crowdsourcing system implementation
PDF
Differentially private learned models for location services
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Classification and retrieval of environmental sounds
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
Asset Metadata
Creator
Khodaei, Ali
(author)
Core Title
Combining textual Web search with spatial, temporal and social aspects of the Web
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/15/2013
Defense Date
05/29/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
contextual search,indexing,information retrieval,inverted index,OAI-PMH Harvest,personalized search,ranking,search,social,social network,spatial,spatiotemporal,temporal,textual,Web,Web 2.0
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Currid-Halkett, Elizabeth (
committee member
), Sukhatme, Gaurav S. (
committee member
)
Creator Email
khodaei@gmail.com,khodaei@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-338255
Unique identifier
UC11296655
Identifier
etd-KhodaeiAli-2107.pdf (filename),usctheses-c3-338255 (legacy record id)
Legacy Identifier
etd-KhodaeiAli-2107.pdf
Dmrecord
338255
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Khodaei, Ali
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
contextual search
indexing
information retrieval
inverted index
personalized search
ranking
search
social network
spatial
spatiotemporal
temporal
textual
Web 2.0