Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Probabilistic framework for mining knowledge from georeferenced social annotation
(USC Thesis Other)
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PROBABILISTIC FRAMEWORK FOR MINING KNOWLEDGE FROM GEOREFERENCED SOCIAL ANNOTATION by Suradej Intagorn A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2015 Copyright 2015 Suradej Intagorn Acknowledgements My most sincere thanks and gratitude go to my advisor and mentor, Professor Kristina Lerman both for her seemingly innite support and for being an excellent advisor. It's been a great pleasure to be advised by her. Without her help and guidance, it would be impossible for me to reach this point. I also thank the other members of my proposal and dissertation committees including Professor Aiichiro Nakano, Craig Knoblock, Greg Ver Steeg, John Wil- son and Shahram Ghandeharizadeh. They have provided with their insight and suggestions, which are precious to me. I also want to thank my senior, Anon Plangprasopchok for his useful suggestions and helping me many ways. I also want to thank my friends at University of Southern California and Information Sciences Institute for their friendships and supports. I would like to thank the Royal Thai Government for providing me the nancial support for 6 years of my graduate studies at University of Southern California. Finally, I would like to express my eternal gratitude to my parents for their everlasting love and support. ii Contents Acknowledgements ii List of Tables v List of Figures vi Abstract viii 1 Introduction 1 1.1 Overview of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Noisy and Ambiguous Data . . . . . . . . . . . . . . . . . . 5 1.3.2 Discretization and Scale Eect Problem . . . . . . . . . . . . 6 1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background and Related Work 9 2.1 Location Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Location Prediction Error Estimation . . . . . . . . . . . . . . . . . 12 2.3 Place Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Learning Place Relations and Folksonomies . . . . . . . . . . . . . . 15 3 A Parametric Approach to Modeling Geospatial Data 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Probabilistic Framework for Mining Geospatial Data . . . . . . . . 19 3.2.1 Tag Distribution as a Mixture of Gaussians . . . . . . . . . 20 4 Mining Geospatial Knowledge 23 4.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.2 Place Identication . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.3 Location Prediction . . . . . . . . . . . . . . . . . . . . . . . 29 iii 4.1.4 Learning Relations between Places . . . . . . . . . . . . . . 34 4.1.5 Learning Folksonomies . . . . . . . . . . . . . . . . . . . . . 39 5 A Non-Parametric Approach to Modeling Geospatial Data 43 5.1 Prediction Condence and Error . . . . . . . . . . . . . . . . . . . . 43 5.1.1 Computing Condence from Samples . . . . . . . . . . . . . 44 5.1.2 Computing Condence from PDF . . . . . . . . . . . . . . . 46 5.1.3 Prediction Error . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.4 Probability Density Estimation . . . . . . . . . . . . . . . . 49 5.1.5 Adaptive Resolution Prediction . . . . . . . . . . . . . . . . 51 5.2 Model Renements . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.2 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . 55 5.2.3 Weighted Condence Prediction . . . . . . . . . . . . . . . . 56 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.1 Location Prediction Baseline . . . . . . . . . . . . . . . . . . 58 5.3.2 Error Estimation Baseline . . . . . . . . . . . . . . . . . . . 59 5.3.3 Result Ranking Baseline . . . . . . . . . . . . . . . . . . . . 60 5.3.4 Placing Task Results . . . . . . . . . . . . . . . . . . . . . . 60 6 Improving Location Prediction with Temporal Constraints 73 6.1 Improving Accuracy with Spatial Entropy . . . . . . . . . . . . . . 73 6.1.1 Spatial Entropy using GGMM . . . . . . . . . . . . . . . . . 74 6.1.2 Baseline with Feature Selection . . . . . . . . . . . . . . . . 76 6.1.3 Data Collection and Processing . . . . . . . . . . . . . . . . 76 6.1.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 Improving Accuracy with Mobility Constraints . . . . . . . . . . . . 78 6.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7 Conclusion 89 Reference List 98 iv List of Tables 4.1 Comparison of the model-based approach (gmm) to baseline on the place identication task. AUC is an area under precision-recall curve. Max F1 is a maximum value of f1 score. Min CE is a minimum classication error rate. Precision, Recall, F1 score, and Classication error can be computed from the formulations below. . 28 4.2 Comparison max F-score of the model-based approach (gmm) to baseline on the inducing relation task. . . . . . . . . . . . . . . . . 38 5.1 Evaluating the performances of three dierent methods: normal condence, weight condence, and variation method on the pre- dicted error estimation task. The Kendall-Tau correlation and accu- racy metric is used to evaluate the performance. . . . . . . . . . . . 66 5.2 Comparing the performances of four dierent methods: normal con- dence, weight condence, variation, and probability ratio methods on the result ranking task. All accuracies are reported at the error level = 100 km. Acc column reports the accuracy before ltering. Acc75% column reports the accuracy after ltering 25% of results. 67 6.1 Median and average errors, in km, of tweet locations predicted by the proposed method and baseline at levels of feature selection given by the resulting recall. . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Comparison of the prediction errors (in km) of the proposed GGMM approach with entropy and mobility constraints, the basic GGMM, and baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 v List of Figures 1.1 Portion of the world-wide distribution of (a) photos labeled with the tag `california' on Flickr and (b) tweets containing the term `california'. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 Portion of the world-wide distribution of photos labeled with the tags (a) `california' and (b) `socal'. . . . . . . . . . . . . . . . . . . 17 4.1 Comparison of performance of dierent methods in terms of the error between the photos' predicted and actual locations. (a) Distri- bution of errors in the test set. (b) Cumulative distribution function of errors produced by the proposed method (gmm) and baseline. . . 31 4.2 Using tag entropy to improve location prediction from tags. (a) Scatter plot of the prediction error (in units of 100 km) vs entropy of photo's representative (lowest entropy) tag. (b) CDF of predic- tion errors after ltering out photos whose tags are not well local- ized. Each line corresponds to a dierent ltering threshold, i.e., the lowest entropy value for the most localized tag in the photo. . . 33 4.3 Folksonomies learned by the greedy approach for two dierent seed terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1 Locations of Flickr photos tagged with terms (a) `california' and (b) `hollywood'. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Illustrations of condence estimation from (a) samples and from (b) the probability density function. . . . . . . . . . . . . . . . . . . . . 45 5.3 Flow chart of the adaptive resolution prediction. . . . . . . . . . . . 52 5.4 Bandwidth eect on estimated entropy. Distribution of 500 samples drawn from a Gaussian with (a) zero mean and variance ve, and entropy 3.0, (b) zero mean and variance 0:5, with entropy 0.72. The histogram approximations of the distribution in (a) with dierent bandwidths h lead to dierent estimated entropy H: (c) using h = 0:1, we get H = 2:6, (e) using h = 1, entropy is H = 2:9. The histogram approximation the distribution in (b): (d) using h = 0:1 gives H = 0:66. (f) using h = 1 the entropy is H = 0:83. . . . . . . 54 vi 5.5 Evaluating the eect of the bandwidth parameter,h: 0:01 , 0:1 and 0:5 on the prediction error. . . . . . . . . . . . . . . . . . . . . . . 69 5.6 Evaluating the eect of the spatial resolution parameter, r: 0:01 , 0:05 , 0:1 , 0:5 , 1 and 5 on the prediction error. . . . . . . . . . . 70 5.7 Evaluating the performances of four dierent methods: normal, weight, adaptive condence prediction, and add-one smoothing Naive Bayes classier on the prediction error. . . . . . . . . . . . . . . . . 71 5.8 Evaluating the performances of three dierent methods: normal, weight condence prediction, variation, and probability ratio on the result ranking task. All methods lter 25% of results according to their heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.1 Comparison of spatial distributions of tweets containing (a) a low entropy entity `bostonloganinternationalairport' (b) a high entropy entity `iphone'. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Scatter plot of the prediction error vs entropy of tweet's represen- tative (lowest entropy) tag. . . . . . . . . . . . . . . . . . . . . . . . 86 6.3 Using feature selection techniques to improve location prediction from tags. Each line corresponds to dierent recall metrics. (a) CDF of prediction error of GGMM after using spatial entropy to select localized terms. (b) CDF of prediction errors of the baseline after using the spatial variation to select localized terms. . . . . . . 87 6.4 We employ a Hidden Markov model to improve a recall of the loca- tion prediction task. A lled circle represents a tweet whose location can be predicted. An open circle represents a tweet with no well- localized words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.5 Cumulative distribution function of errors produced by combining GGMM, geospatial entropy, and mobility constraints. Percentage of ltered tweets gives the level of recall. . . . . . . . . . . . . . . . 88 vii Abstract Geography plays as an important role in the evolution of online social networks and the creation of user-generated content, including photos, news articles, and short messages or tweets. Some of this content is automatically geotagged. Use- ful knowledge can be extracted from geotagged social medial documents such as place boundaries and relations between places. Web applications could use this geographic knowledge to assist people with geospatial information retrieval, such as nding cheap hotels in a specic location, or local news in a specic area. This dissertation describes a method for extracting geospatial knowledge from user-generated data. The method learns a geospatial model from data using statis- tical estimation techniques and then applies the model in tasks that include place identication, geofolksonomy learning, and document geolocation. To that end, I make three contributions in the eld of geospatial data mining and statistical geolocation. First, I address the question of how to model the underlying processes that generate the locations of social media documents. I study alternate methods for estimating the models from data and discuss methods for approximating these models. This model approximation can be used to improve the quality of results in applications of interest such as location prediction. I also study the eects of statistical estimation methods on the quality of the extracted geospatial knowledge. viii The probability density estimation methods can be categorized into two main approaches: parametric and non-parametric. The parametric approaches estimate the common density f from samples by assuming that f belongs to a parametric family of functions, such as Gaussian or gamma functions. On the other hand, a non-parametric approach make no assumptions about the distribution of f. I discuss the challenges of both approaches and possible solutions to these challenges. Second, I develop the probabilistic framework to extract geospatial knowledges from the learned models. The probabilistic framework is general and exible and can be used in a variety of geospatial data mining applications. I evaluate its performance on three knowledge extraction tasks. In the rst task, place identi- cation, I classify terms as place names or not place names based on their spatial distributions. In the second task, location prediction, I attempt to predict loca- tions of documents such as photos or tweets given their terms. In the third task, I use the probabilistic framework to learn relations between terms. Finally, I develop methods for error estimation of extracted geospatial knowl- edge, particularly, for the location prediction task. When automatically geotagging documents, it is also useful to give the error associated with the predicted location. Error estimation can be used to rank the quality of prediction results. In some sit- uations, we may improve prediction quality by sacricing coverage; in other words, by ltering out documents whose locations we cannot precisely predict, we retain a small fraction of documents whose predicted locations we believe to be close their actual locations. This dissertation addresses some of the challenges of mining user-generated content. Success of the methods described in the dissertation suggests that social media can serve as a useful source of geospatial knowledge. ix Chapter 1 Introduction The information people create while organizing and using content and interact- ing with others on social media is called user-generated content (UGC). User- generated content includes text, such as short messages in tweets, photos, tags, which are the metadata or labels people use to describe content, relations that are used to hierarchically organize content, and geotags, which are the geographic coordinates attached to content. Although social metadata are freely generated and uncontrolled, they re ect how a community organizes knowledge, including geospatial knowledge. Geospatial knowledge can be automatically harvested from user-generated content to learn about places in the world. There are advantages to leveraging social metadata to learn geospatial concepts and relations. Specically, social metadata are distributed and dynamic in nature [Golder and Huberman, 2006]; therefore, more likely to better represent current geospatial knowledge than formal ontologies created by groups of experts. More importantly, it is closer to the \common knowledge" shared by a community. The advent of social media, which provide an unprecedented amount of data, promises to enable large-scale and automatic knowledge extraction, but only if we have an ecient computational framework. Thus, automatic knowledge extrac- tion from UGC has received attention in recent years from both the academic community and industry. Data-driven approaches have been widely used due to the massive size of these data. The variety of social websites on which users create 1 UGC requires an ecient and exible framework which integrates dierent forms of social data to achieve high accuracy in automatic knowledge extraction. In this dissertation, I develop such a framework for geospatial knowledge extrac- tion from UGC. The framework relies on estimating the probability density func- tion of the spatial distribution of terms. There are two main approaches to prob- ability density estimation: parametric and non-parametric. Both approaches have advantages and disadvantages. Parametric methods assume that the common den- sityf belongs to a specic family of functions, which then allows maximum likeli- hood estimation to be used to estimate parameters of the density function. If we know that the data come from a specic model, then parametric density estima- tion will usually perform well. However, this method can also signicantly bias conclusions if the wrong model is used. In contrast, non-parametric estimation makes fewer assumptions about the data, and thus will generally provide better estimation in situations where the true distribution is not known or cannot be easily approximated. The proposed framework is general and exible. Once the probability density function f is estimated, it can be used in a number of geospatial applications. For example, we can learn place names by looking for terms whose spatial density functions are highly localized. To predict the location of a document with terms W , we look for the location that maximizes the value of f. Finally, we can learn relations between places represented by termsw 1 andw 2 simply by comparing their density functionsf. We can then arrange the learned relations within a taxonomy of places. The proposed framework is quantitatively more powerful than competing meth- ods. For example, the placing task competition recently challenged researchers to estimate errors of predicted locations, what is referred to as the placeability of 2 (a) `california' on Flickr (b) `california' on Twitter Figure 1.1: Portion of the world-wide distribution of (a) photos labeled with the tag `california' on Flickr and (b) tweets containing the term `california'. a document [Hau, Thomee, and Trevisiol, 2013]. Knowing the error allows for eliminating low-quality predictions, so as to improve the accuracy of the remaining predictions. Proposed framework allows me to estimate prediction error along with its condence. 1.1 Overview of Data This dissertation uses data from the social photo-sharing site Flickr and the microblog- ging site Twitter. We used the Flickr API to retrieve information about more than 14 million geotagged photos created by 260K distinct users. These photos were tagged over 89 million times with 3 million unique tags. Following Rattenbury and Naaman [2009b], we represent each photo p as a tuplefid p , u p , l p , T p g, where id p is the site-specic id of the photo, u p is id of the user who create it, l p represents the photo's location as lat-long pair, andT p is set of tags in the photo. The photo has only one location, and one user, but it may have multiple tags. On Twitter, we used twitter4j, a Java library for the Twitter API, to collect over 1.8 million short message, or tweets. These tweets had geographic coordinates attached to them. Figure 1.1 compares the spatial distribution of Flickr photos and 3 tweets containing the term `california'. The distributions look dierent: tweets are concentrated in urban areas, while photos are more broadly distributed, includ- ing around sparsely populated landmarks and parks. Despite these dierences, both distributions roughly suggest California's shape. The framework proposed in this dissertation will leverage these spatial distributions to extract geospatial knowledge. 1.2 Motivation The main motivation behind this research is to study the underlying processes that generate the locations of social media documents and develop the framework to extract geospatial knowledge. The traditional gazetteers work well in several applications, such as geotagging news articles or webpages. However, they do not work well for social media content, such as Flickr photos and tweets, because of two limitations. First, gazetteers are usually manually maintained by domain experts. While this leads to high-quality data, the high cost of maintenance leads to limited and outdated coverage of the information contained in them. Second, the location of vernacular places can often not be found in traditional gazetteers. Vernacular places, such as `socal', also have vague boundaries. It is not clear precisely where Southern California is, because there is a variety of denitions of what constitutes Southern California. However, vernacular names are often mentioned in social media. Thus, gazetteer- based approaches may fail to recognize place names in social media documents. Our framework can be used to extract geospatial knowledge from social media to enrich the existing expert-created gazetteers. 4 The new knowledge can be integrated into gazetteers to address the vernacular place challenge in gazetteer-based location prediction. Although our framework works very well when localized terms are present, it fails to accurately georeference tweets when users do not specify localized terms in tweets. The place terms are the terms that exhibit spatial usage patterns that are signicantly geographically localized [Rattenbury and Naaman, 2009b]. 1.3 Challenges 1.3.1 Noisy and Ambiguous Data Acquiring accurate geospatial knowledge from evidence contributed by many dif- ferent people presents a number of challenges. Individuals vary in their level of expertise, experience, expressiveness, and their enthusiasm for creating and anno- tating content. As a result, social data are sparse, ambiguous, and very noisy. Some of the noise is the result of geo-coding errors, or mistakes people make when tagging images. Other sources of noise are more subtle, resulting from convention or convenience. For example, users may tag a batch of vacation photos \Califor- nia", forgetting about a side trip to Las Vegas, Nevada. In location prediction, noisy and ambiguous data are the main challenge of a gazetteer approach. For example, the term `victoria' could refer to a place in Canada, Australia, or half a dozen other places called Victoria. It is also a popular female name, leading to this tag cropping up in documents taken all over the world. The key challenge is that the traditional gazetteer may lack information used to disambiguate a tag `victoria'. An alternate approach is statistical. It uses proba- bility to deal with uncertainty and ambiguity. However, the statistical approach presents its own challenges in the probability density estimation processes. 5 In the place identication problem, the general idea is to capture spatial pat- terns of terms. Based on the spatial distributions of each term's usage, we can determine whether each term has a coherent place semantic [Rattenbury and Naa- man, 2009b, Backstrom, Kleinberg, Kumar, and Novak, 2008]. Place terms gener- ally have low variance on their spatial distributions. However, the assumption may not hold because the noisy and ambiguous data may increase variance of spatial distributions. Thus, handling noise and ambiguity is necessary for extract place semantic from terms. For learning relations between places, a boundary estimation is generally required. Once the boundaries are known, it is easy to learn a relation `part-of' by check- ing that an area associated with one place subsumes another place. Noisy data can signicantly distort the shape of a boundary, which leads to incorrect relation learning. Thus, it is necessary to handle noisy data in learning relations between places. 1.3.2 Discretization and Scale Eect Problem The scale parameter is generally required for estimating geographic probability distribution of terms. Serdyukov et al. use the scale parameter to divide the world into grids and then estimate probability distribution of terms from these grids [Serdyukov, Murdock, and Van Zwol, 2009]. The bandwidth parameter is selected by the scale of interest for the mean-shift algorithm [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b]. Both works show that classication error is aected by scale parameter selection. When we change the scale of the aggregation unit, for example from 100 kilo- meters to 1000 kilometers, we obtain the dierent results when the same analysis is applied to the same data. This is known as the scale eect problem. It is one of 6 two forms of the modiable areal unit problem (MAUP). The issue was discovered in 1934 and described in detail by [Openshaw, 1983]. It is challenging to manu- ally determine the scale parameter, as ne-scale data may cause fragmentation and coarse-scale data may cause loss of accuracy [Kelm, Murdock, Schmiedeke, Schock- aert, Serdyukov, and Van Laere, 2013]. Several researchers propose methodologies to solve problems such as multi-scale analysis in [Rattenbury and Naaman, 2009b]. The scale eect problem is also related to the discretization problem. For a Naive-Bayes classier, numeric features are often preprocessed into categories by discretization. Several researches have used Naive Bayes for georeferencing problems [Serdyukov, Murdock, and Van Zwol, 2009, Crandall, Backstrom, Hut- tenlocher, and Kleinberg, 2009b]. Some of them chose the scale parameter manu- ally [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b] while others [Serdyukov, Murdock, and Van Zwol, 2009] made experiments by varying scale parameters to obtain dierent results. Yang and Webb surveyed nine discretization methods for naive Bayes [Yang and Webb, 2002]. Each method has dierent strengths and weaknesses. Classication error is one of metrics used to evaluate performance of discretization methods [Yang and Webb, 2002]. 1.4 Research Questions This dissertation addresses the problem of geospatial knowledge extraction from social data. Specically, the research questions are: RQ1. How can we model the underlying processes that generate the locations of social media documents and their terms from observed evidences? and What statistical estimation methods can best learn models from data? RQ2. How do we use the learned models to extract geospatial knowledge? 7 RQ3. How can statistical methods be used to estimate, control, and improve the quality of extracted geospatial knowledge? 1.5 Outline of the Dissertation The outline of the dissertation is as follows. Chapter 2 describes background and related works. Chapter 3 presents the probabilistic framework by using a para- metric density estimation, while Chapter 5 presents the probabilistic framework by using a non-parametric density estimation. Chapter 4 applies the learned density functions within the geospatial knowledge extraction problems. Chapter 6 presents the method and results of error estimation in the location prediction problem. 8 Chapter 2 Background and Related Work 2.1 Location Prediction In the location prediction problem task, one estimates the location of documents such as images or textual documents using textual or other features. The location prediction is an important process to several applications in geographic information retrieval elds, for example, Sadilek et al. use it for epidemic study [Sadilek, Kautz, and Bigham, 2012]. There are two major categories for location prediction approaches. The rst approach is to extract the toponym and then look it up in a gazetteer. Several commercial services use this approach to locate the documents, for instance, Yahoo Placemaker, GeoNames, etc. It is proved to work well with documents such as news articles and webpages. However, it does not work well for social media such as Flickr images and tweets because of three main limitations. Ambiguity is an important challenge in this process [Amitay, Har'El, Sivan, and Soer, 2004]. Many place names are also common words in English such as Turkey. In addition, a place name could be ambiguous if it refers to dierent places, for example, `victoria' could refer to a place in Canada, Australia, or half a dozen other places called Victoria. Several techniques were proposed to improve the accuracy of gazetteer methods, mainly to resolve ambiguities. In [Amitay, Har'El, Sivan, and Soer, 2004], they use taxonomy to compute the score for each place in the documents. Then, the 9 place scores are sorted by the focus-nding algorithm to nd the place focus of the document. Lieberman et al. proposed a method to disambiguate places in a document by comma, for instance, Ontario, California [Lieberman, Samet, and Sankaranayananan, 2010]. The key idea is that if one of the terms in a comma group is disambiguated, the remaining terms are also disambiguated. The second approach is a statistical modeling approach. It generally yields better results than a gazetteer approach for social media documents as shown in [Kinsella, Murdock, and O'Hare, 2011, De Rouck, Van Laere, Schockaert, and Dhoedt, 2011]. The key idea of this approach is to estimate the probability distri- bution between features associated with places. Several types of features can be used such as visual, textual, or social features. Some researchers framed the location prediction problem as one of classica- tion [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b, Serdyukov, Mur- dock, and Van Zwol, 2009]. [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b], for example, used the mean-shift clustering algorithm to separate photos into distinct clusters, with each cluster representing a class of locations. The terms of the photos in the cluster are used as prediction features. Their method computes the probability of a class and the probability of each feature given the class. Then, given a new photo with some tags, they use a naive Bayes classier to predict the class the photo belongs to. Their method is also sensitive to the scale parameter used to discretize the data [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b]. They used 100 km as the grid size for clustering photos. Cheng et al. propose a method to determine the city that the Twitter user is located in [Cheng, Caverlee, and Lee, 2010]. They view the problem as a classica- tion problem. The joint probabilities of two random variables, words and location, are estimated from training set. The smoothing and word selection techniques are 10 used to increase accuracy. The word selection technique is proposed by [Back- strom, Kleinberg, Kumar, and Novak, 2008]. They model a spatial distribution of a word by a unimodal multivariate distribution. The unimodal multivariate distri- bution parameters are obtained by a numerical method such as the golden section search [Cheng, Caverlee, and Lee, 2010]. Van Laere et al. proposed a two-step approach to increase accuracy [Van Laere, Schockaert, and Dhoedt, 2011]. The rst step is quite similar to [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b], however, a k-medoids clustering is used instead of a mean-shift algorithm. Then, their method computes the probability of a class and probability of each feature given the class. Then, they nd a class of the Flickr photo by Naive Bayes classier. In the second step, they use a similarity search to reduce errors from the limited granularity of the rst step. Jaccard measure is used to measure the similarity of two photos. Only photos, within the class from the rst step, are used in the similarity search process. [Kinsella, Murdock, and O'Hare, 2011, De Rouck, Van Laere, Schockaert, and Dhoedt, 2011] showed that performance of a statistical model approach are better than a gazetteer approach. Kinsella et al. showed that approximately 20% of the tweets can be located within the correct neighbourhood level [Kinsella, Murdock, and O'Hare, 2011, Kelm, Murdock, Schmiedeke, Schockaert, Serdyukov, and Van Laere, 2013]. They report that only 1.5% of the tweets can be located within the neighbourhood level by baseline. In this work, Yahoo! Placemaker, a gazetteer approach, is used as the baseline. De Rouck et al. trained a statistical model from Flickr data set. The model was then used to georeference Wikipedia pages. They reported that 15.4% of the pages can be located within 1 km while the baseline is only 4.2%. They also used Yahoo! Placemaker as the baseline, same as [Kinsella, Murdock, and O'Hare, 2011]. Both of these works suggested that a 11 statistical model oers a better performance than a gazetteer approach in social media documents. 2.2 Location Prediction Error Estimation Location prediction error estimation is new to the location prediction task, with relatively few existing methods providing the error associated with the predicted location. One such method was described in [Hau, Thomee, and Trevisiol, 2013]. This method returns a ranked list of possible locations for a document, with the top-ranked location taken as the document's predicted location. If the top n loca- tions are distributed over the globe, the document's true location is considered to be very uncertain; hence, the estimated location error is high. In contrast, if the top n locations are spatially close (e.g. with a standard deviation of a few kilometers), then the error is considered to be low. Unfortunately, Hau et al. [Hau, Thomee, and Trevisiol, 2013] do not provide optimal parameters for error estimation. A related task of error estimation is ranking location prediction results. The goal of this task is to rank location prediction results and pick a subset of doc- uments with potentially very accurate predicted locations. In practice, the error may be too high when documents do not contain any location-indicative words. Thus, we may need to predict the location using a subset of all documents. Han et al. [Han, Cook, and Baldwin, 2014] propose several heuristics for ranking pre- dictions. Error estimation can be used in the result ranking task by ranking the results on the estimated error values. 12 2.3 Place Identication Rattenbury and Naaman observed that \place tags exhibit spatial usage patterns that are signicantly geographically localized" [Rattenbury and Naaman, 2009b]. They proposed a quantitative method to identify place tags from Flickr data set. A tag refers to a place if its distribution is spatially localized, i.e., highly clustered around a specic location. However, aggregate statistics of tag occurrence are sensitive to the choice of spa- tial unit used to discretize data, and this can impact our decision about whether the tag is localized. This is the Modiable Areal Unit Problem, or MAUP [Open- shaw, 1983]. For example, if we choose a 100 km 100 km grid, we may see samples of `socal' that are localized, i.e., are within the grid, while samples of `california' that are not localized. However, by increasing grid size to 1,000 km 1,000 km, both of them will be localized. The solution proposed by [Rattenbury and Naaman, 2009b] relied on multi- scale analysis method that they called Scale-Structure Identication. Their tech- nique relied on the observation that a place-related term is usually localized in a specic area. The key step in this method is to cluster data points into dis- crete areas. Then, the probability of each area is computed. Finally, entropy is computed from the probability distribution. They tackle the MAUP by using the multiple scale technique. The key idea is that instead of using a single scale parameter, they use multiple scale parameters for clustering and then aggregate these entropies. Their spatial scan method was inspired by [Kulldor, 1999] in the epidemiology eld. Kulldor studied a disease outbreak problem. This problem is closely related to the place semantic identication problem by detecting bursts in a specic area. They proposed spatial scan methods for disease/outbreak analysis which can be 13 used in the place identication problem [Rattenbury and Naaman, 2009b]. Neill et al. also studied a disease outbreak analysis and proposed a method to improve the spatial scan method [Kulldor, 1999, Neill and Moore, 2003]. Their algorithms exploited a multi-resolution technique to improve time complexity of the algorithm. Backstrom et al. introduced a method to solve a place identication problem by analyzing search engine query logs [Backstrom, Kleinberg, Kumar, and Novak, 2008]. The authors model spatial distribution of search query logs as unimodal multivariate functions. The location of a search query log is estimated from a IP address. The function has two parameters: C and . C is a parameter that identies the frequency in the center and determines how quickly the frequency decreases when the point goes further away from the center. They developed an ecient algorithm to compute the maximum likelihood values for C and . C and can be obtained by optimization technique such as a golden section search [Cheng, Caverlee, and Lee, 2010]. Spatial knowledge, extracted from place identication problem, can be used in several applications. It was used to improve prediction accuracies of Twitter user's home locations [Cheng, Caverlee, and Lee, 2010]. Location of Flickr photos can be automatically georeferenced more accurately by the place identication techniques [Intagorn and Lerman, 2012b]. Rattenbury et al. suggested other possible applications such as image search improvement through inferred query semantics, tag suggestions for social media based on location, automated creation of place gazetteer data, etc. [Rattenbury and Naaman, 2009b]. 14 2.4 Learning Place Relations and Folksonomies Several researchers have recently investigated approaches to learning relations between concepts from social metadata. [Plangprasopchok, Lerman, and Getoor, 2011] learn folksonomies by aggregating many shallow individual hierarchies, expressed through the collection/set relations on Flickr. They address many challenges in their approach, for example, dierences in the level of expertise and granularity for each user. [Schmitz, 2006a] applies a statistical subsumption model [Sanderson and Croft, 1999] to learn hierarchical relations between tags. He addresses the challenge of the popularity vs generality problem using tag frequency. Schmitz used probabilistic subsumption to learn broader{narrower relations between Flickr tags. His method can be expressed mathematically as follows: tag w 2 potentially subsumes tag w 1 if P (w 1 jw 2 ) t and P (w 2 jw 1 ) < t, where t is some threshold. Also P (w 1 jw 2 ) = N(w 1 ;w 2 )=N(w 2 ), where N(w 1 ;w 2 ) is the number of photos tagged with both w 1 andw 2 andN(w 2 ) is the number of photos tagged withw 2 only [Schmitz, 2006a]. One of challenges of this method is the popularity vs. generality problem. The more popular tags such as `alabama' can be more specic than less popular tags such as `northamerica'. The algorithm assumes that more general tags have higher frequency than the specic tags. However, this assumption does not hold true in some cases. In most existing gazetteers, the place name is usually limited to an administra- tive view on places [Keler, Mau e, Heuer, and Bartoschek, 2009]. Several meth- ods [Keler, Mau e, Heuer, and Bartoschek, 2009, Montello, Goodchild, Gottsegen, and Fohl, 2003] have been proposed for learning nonadministrative level place names, such as soho or bay area. The boundary of each place is generated to mea- sure degree of geospatial subsumption between places. The challenge of this work 15 is learning accurate boundaries from user-generated content. Learning boundaries of places from user-generated content are also studied by [Keler, Mau e, Heuer, and Bartoschek, 2009, Montello, Goodchild, Gottsegen, and Fohl, 2003]. Their methods can learn vague places from user-generated content which usually do not exist in an expert gazetteer. These learned boundaries can be used as additional information for GIS applications, for example, searching hotel in bay area. How- ever, these methods are ad hoc, developed to solve the specic problem of learning boundaries or relations. 16 Chapter 3 A Parametric Approach to Modeling Geospatial Data 3.1 Introduction We focus on analyzing the social photo-sharing site Flickr, which allows registered users to upload photos and videos and annotate them with descriptive labels, known as tags. Tags are used to describe the image's subject (e.g., `animal', `fam- ily'), properties (e.g., `cute', `furry'), medium (e.g., `nikon'), as well as where the image was taken (e.g., `zoo', `socal', `california'). In addition to tagging photos and videos, Flickr also allows users to geo-tag, or geo-reference, content by attaching geographic coordinates to it. The implicit relation between tags and locations of photos annotated with those tags can tell us much about how people think of places and relations between them. (a) `california' (b) `socal' Figure 3.1: Portion of the world-wide distribution of photos labeled with the tags (a) `california' and (b) `socal'. 17 Figure 3.1, for example, shows portion of the world-wide distribution of photos tagged with `california' and photos tagged with `socal'. Though such photos occur all over the world, the density of photos tagged `california' is higher around the state of California. The tag `socal' is a colloquial term for Southern California, and while this place does not exist in expert-curated gazeteers or have an ocial boundary, like many other vernacular places [Montello, Goodchild, Gottsegen, and Fohl, 2003], it is part of the vernacular used to talk about places in California. Geotagged annotations give us a source of evidence to represent and reason about places. In addition to recognizing that both `california' and `socal' refer to places, we can analyze their spatial distribution to nd their boundaries [Intagorn and Lerman, 2011] or learn that `socal' is in `california.' We propose a probabilistic framework for mining geospatial knowledge from social annotations that addresses the noisy challenge outlined in the rst chapter. We represent the spatial distribution of a term as a mixture of statistical Gaus- sian probability density functions. Such Gaussian mixture models (GMMs) can be estimated directly from data without using discretization parameters. Being probabilistic, the model also presents a natural way to deal with noise and uncer- tainty. However, using Gaussian mixture models presents its own challenges, as the number of Gaussian components in the model is not known a priori. Too high a number may lead to overtting the estimated probability density function, while too low a number will lead to an inaccurate probability density function. In this thesis, we solve this problem by using the Bayes Information Criterion to nd the best number of components, and then estimate the parameters of each component using Expectation Maximization. The probabilistic models oer a general and exible framework for solving a variety of geospatial data mining applications. Specically, it can be used for 18 place identication [Rattenbury and Naaman, 2009b], we classify tags as place names or not place names based on their spatial distributions. It can also be used for the location prediction task, which attempts to predict locations of docu- ments given their terms. In addition, we use the probabilistic framework to learn relations between places, specically part-of relations such as `socal' is in `cali- fornia'. Finally, we assemble the learned relations within hierarchical directories called folksonomies. We show that our method is competitive with state-of-the-art algorithms, where they exist. Moreover, the probabilistic framework enables us to improve performance on the location prediction task by eliminating nonspatial tags, e.g., `iphone'. 3.2 Probabilistic Framework for Mining Geospa- tial Data We model tags and photos in terms of spatial probability distributions or density functions. Spatial distribution of tag w can be written as P (Xjw), where X represents locations of photos tagged with w. Then, we can easily model spatial probability distribution of a photo as a superposition of probability distributions of all tags in that photo. We start by modeling the probability distribution of each tag separately. Multi- variate Gaussians, Gaussian mixture models (GMM), and kernel density estimation (KDE) are some of the more popular methods for estimating probability density of the observed data [Kobos and Ma ndziuk, 2009]. However, dierent tags may be expressed at dierent levels of granularity, from continent (e.g., `asia') to the level of landmarks (e.g., `goldengatebridge'). Kernel density estimators require a scale, or bandwidth, parameter to be set, which may not be known a priori and vary from 19 tag to tag. Therefore, we prefer to use GMMs as the method for estimating the probability distribution of each tag. Using a sucient number of Gaussian compo- nents, and by tuning the parameters of each components and adding them linearly, most continuous distributions can be approximated [Bishop and En Ligne, 2006]. 3.2.1 Tag Distribution as a Mixture of Gaussians A Gaussian mixture model is a superposition of K Gaussians: P (x) = K X k=1 k N (xj k ; k ); (3.1) where each Gaussian densityN (xj k ; k ) is called a component of the mixture with mean k and covariance matrix k . Parameter k , called the mixing coecient, gives the weight of the kth component, or the fraction of all data points that are explained by that component. It has a value between 0 and 1, and P K k=1 k = 1. The distribution of photos tagged `victoria', for example, will have a highly localized component in Canada and one in Australia among others. In addition, there will be random photos all over the world tagged `victoria', where the term is being used as a female name. These points may be assigned to their own, noise component, with a low k where it is classied as noise. Noise can also contribute to the variance of each Gaussian component. The Gaussian mixture model is specied by the number of mixture components, each governed by parameters k ; k ; k . For a given model, we use expectation- maximization (EM) algorithm [Dempster, Laird, and Rubin, 1977] to estimate model parameters k ; k ; k . EM is an iterative algorithm with two major steps: expectation (E) and maximization (M) step. The E step estimate (z nk ) from the 20 current parameter values where (z nk ) can be viewed as responsibility of compo- nent k generate x n . The M-step updates the values of k ; k ; k from (z nk ) of the previous step. The process continues until convergence. The convergence is usually dened by when the loglikelihood function or nding parameters changes below some threshold [Bishop and En Ligne, 2006]. How many mixture componentsK should we use to model each tag? Intuitively, the optimal number of components is one that best explains the data. By using more components (increasing the number of model parameters), one can usually get the model to better describe the data. However, this may lead to overtting, where a very complex model explains every point in the training data set, but it cannot generalize, and therefore, has no predictive power on test data. We can reduce this problem by penalizing model for its complexity (e.g., using AIC or BIC) or testing predictive performance on test data (e.g., cross-validation) [Liddle, 2007, Smyth, 1996, Mardia, Kent, and Bibby, 1980, Yan and Ye, 2007]. The BIC is used for model selection in statistics. It avoids overtting by intro- ducing a penalty term for the number for parameters in the model [Liddle, 2007]: BIC =2 lnL +k ln(n). HereL is likelihood estimate of the model,k is number of parameters and n the number of observations, or data points. Our model is a mixture of K bivariate Gaussian components. Each component has a bivariate mean, which is specied by two parameters, covariance matrix, which is sym- metric, therefore, specied by three parameters, and a mixing coecient, which contributes one parameter to the model. However, since mixing coecients have to add to one, this constraint removes one parameter. Therefore, the total number of model parameters is k = (K 1) + 2K + 3K = 6K 1. Our model selection process is very simple. We estimate model parameters using the EM algorithm to get the maximum likelihood estimateL(K) with respect 21 to the number of components K. We then choose the values of K that minimizes the BIC value of the model: K = arg min K BIC(K). Thus, each concept has dif- ferent number of components. About 34% of the tags in our data set have between one and ten components, 57% have between 11 and 20 components, and 9% of the tags have more than 20 components (the maximum number of components we consider is 30). In this chapter, we presented a parametric approach to modeling geospatial data. Once the probability density functions of each term are learned, we can use them for extracting geospatial knowledge. In the next chapter, we focus on three specic knowledge extraction tasks: place identication, location prediction, and learning relations between places. Since the ground truth is available for these tasks, we will use them to evaluate the quality of the learned models of data. 22 Chapter 4 Mining Geospatial Knowledge 4.1 Applications Modeling tag distribution as a mixture of Gaussians oers a general and exible framework for data mining applications in the geospatial domain. In this section, we study the problems of place identication, location prediction, learning part-of relations between places, and combining these into a directory of places. Finally, we aggregate the learned relations within a hierarchy of places. This informal ontology, or folksonomy, captures how people think of places and relations between them. 4.1.1 Data Collection The data for the experiments were collected from the social photo-sharing site Flickr. We used the Flickr API to retrieve information about more than 14 million geotagged photos created by 260K distinct users. These photos were tagged over 89 million times with 3 million unique tags. Following [Rattenbury and Naaman, 2009b], we represent each photo p as a tuplefid p , u p , l p , T p g, where id p is the site-specic id of the photo, u p is id of the user who create it, l p represents the photo's location as lat-long pair, andT p is set of tags in the photo. The photo has only one location, and one user, but it may have multiple tags. As a preprocessing step, we lter out tags which were used by fewer than 100 people. As a consequence, photos containing only less popular tags will not be 23 included in the data set. We randomly select ve thousand users whose photos will be the basis of the test data set. To create the training data set for learning GMM parameters of the distribution of a given tag, we sample a single photo from each of the remaining users uniformly from a 100 km grid. This sampling procedure helps reduce bias from users who take many more photos in some location compared to other users [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b]. The same training set is used for both place identication and location prediction tasks. After these preprocessing steps, the training set contains 2.5M photos with 192K distinct users and 11K distinct tags. 4.1.2 Place Identication A Place Identication problem is the problem of identifying place-related terms. Several researchers proposed solutions to this problem discussed in the related works section. A tag refers to place name if its distribution is spatially localized. However, aggregate statistics of tag occurrence are sensitive to the choice of spatial unit used to discretize data, and this can impact our decision about whether the tag is localized. Baseline: Discrete Entropy-based Method The solution proposed by [Rattenbury and Naaman, 2009b] relied on multi-scale analysis method that they called Scale-Structure Identication. They observed that \place tags exhibit spatial usage patterns that are signicantly geographically localized" [Rattenbury and Naaman, 2009b]. This can be seen in the distribution of photos tagged `california' in Figure 3.1(a). While these photos can be found all over the world, they are much more dense around California. Rattenbury et al. [Rattenbury and Naaman, 2009b] proposed a quantitative method to identify 24 place tags. We use the same intuition to distinguish between place or non-place tags, but analyze the spatial patterns of the tags using the probabilistic modeling framework. The intuition behind this method is if a tag is localized, most of the samples should be found within a single cluster. Then Rattenbury et al. compute the probability of samples in each cluster. Finally, they compute entropy from this probability mass function. Lower entropy means that most of samples can be explained by one cluster, i.e., they fall within the same cluster. In extreme case, if all the samples fall in the same cluster, the entropy is zero. They tackle problem of MAUP by clustering data at many dierent values of the scale parameter r and combine entropy values at dierent scales. The entropy of a place tag will converge to zero very fast for small r. In contrast, the entropy of a non-place tag will converge to zero at very large r. The method checks whether the entropy of a tag is below some threshold, and if so, it classies the tag as a place name. However, if the tag is ambiguous and refers to places that are far away from each other, e.g., `victoria' may refer to a place in Canada or Australia, the method will not judge it as being well recognized, and hence, not recognize it as a place tag. Model-Based Place Identication Instead of discretizing data at dierent scales, we work with a continuous prob- ability density function. We decide whether a tag is well-localized by examining the continuous entropy ofP (Xjw). The intuition behind our method is as follows. People usually use non-place tags everywhere in the world, for example, people use the tag `iphone' all over the globe. This tag has very high uncertainty, thus, it is unlikely that we can predict locations of photos tagged `iphone'. In contrast, the tag `alcatraz' is highly localized. In other words, it has low uncertainty in the geospatial domain. 25 Entropy is used to measure uncertainty of a distribution. In this application, we use entropy to estimate the uncertainty of P (Xjw), the spatial distribution of the tag w. There are advantages to using entropy in continuous space. First, geographic locations occur in continuous space. To estimate entropy in continuous space, there is no need for discretization parameters. Once entropy is estimated, it can be directly manipulated, e.g., compared to a threshold, without further processing as in discrete entropy method. The continuous entropy of tag w can be estimated by H P (Xjw) = Z X P (xjw) logP (xjw) dx (4.1) There is no analytic form for computing the entropy. Instead, we estimate it using the Monte Carlo method [Hershey and Olsen, 2007]. The idea is to draw a sample x i from the probability distribution P (x) such that the expectation of logP (x),E P (x) (logP (x)) =H(P ). Monte Carlo method is used because, accord- ing to [Hershey and Olsen, 2007], Monte Carlo sampling has highest accuracy for approximation for high enough number of samples; however, it is computationally expensive. A better runtime approximation can be obtained by more sophisticated approaches discussed in [Hershey and Olsen, 2007]. The Monte Carlo approxima- tion of entropy can be expressed as: H MC (P ) = 1 n n X i=1 logP (x i ) (4.2) where x i is a sample from the distribution P (x). As the number of samples n grows large, H MC (P )!H(P ). 26 If a tag's entropy is lower than some threshold , then P (xjw) is judged to be localized, and we identify the tag w as a place name: H(P (xjw))< (4.3) Evaluation To compare the performance of dierent methods on the place identication task, we need a ground truth about place names. For this purpose we use GeoNames (http://geonames.org), a geographical database containing over eight million place names from all over the world. We assume that if a tag exactly matches a name or an alternate name in GeoNames, then it is a place name. For example, tag `usa' matches the alternate name of United States. If successful, the match function returns a geoid (geographic id) of the place. Unfortunately, any word, including non-place word, may match a name and an alternate name in GeoNames. For example, the term `cake' will return a place with geoid 7063723. We exploit the geographic information in the GeoNames database to reduce this problem. We use a simple heuristic in building the ground truth that tags must match both names and locations. Specically, for a location match, a Flickr tag w must match not only the name of a GeoNames place (specied by geoid), but more than 5% of photos with that tag must be located within the bounding box of the place. We use standard precision and recall metrics to evaluate the performance of baseline and the proposed model-based method on the place identication task. Precision measures the fraction of tags that were correctly predicted to be place names relative to the number of all tags that were predicted to be place names. Recall measures the fraction of place names in GeoNames that were predicted to be place names. However, precision and recall are sensitive to threshold value; 27 approach AUC Max F1 Min CE baseline 0.6805 0.6807 0.0937 gmm 0.6989 0.7060 0.0904 Table 4.1: Comparison of the model-based approach (gmm) to baseline on the place identication task. AUC is an area under precision-recall curve. Max F1 is a maximum value of f1 score. Min CE is a minimum classication error rate. Precision, Recall, F1 score, and Classication error can be computed from the formulations below. therefore, we compute the performance as the precision-recall curve. Aggregate metrics, including AUC (area under precision-recall curve, maximum f1 score (Max F1), and minimum classication error rate (Min CE), are reported in Table 4.1. As we can see from these results, our method performs slightly better than baseline. Its true advantage, however, is exibility, as the same probabilistic framework can be used to address dierent geospatial data mining tasks, as shown below. TP and FP are the true and false positives respectively. The true positives are tags that the algorithm predicts as place tags, and the tags are place tags under the ground truth. The false positives are tags that the algorithm predicts as place tags, but the tags are non-place tags under the ground truth. recall = TP TP +FN precision = TP TP +FP f1 score = 2precisionrecall precision +recall 28 classication error = TP TP +FP 4.1.3 Location Prediction We apply the probabilistic model to solve a dierent problem, namely, nd the most likely geographic location of a photo given its tags. Many of the photos on Flickr were not taken by a GPS-enabled device. Can we leverage the annota- tions users create to automatically geo-reference them? People can often infer a photo's location from the textual information associated with it. Thus, a set of tagsf`castle', `smithsonian', `dc', `washington'g allow us to accurately place the photo near the Smithsonian Museum in Washington, DC, while the tagsf` ower', `night', `bug', `insect', `moth'g do not. However, this method has disadvantages. First, it is time consuming. Second, a person's limited memory and knowledge of places may allow him to accurately guess the location only for familiar places. Europeans may not make good guesses about places in the US, and Americans may not guess accurately about places in Asia. In contrast, an automatic method could do this more cheaply and accurately. Baseline Previous researchers framed the location prediction problem as one of classica- tion [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b, Serdyukov, Mur- dock, and Van Zwol, 2009]. [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009b], for example, used the mean-shift clustering algorithm to separate photos into distinct clusters, with each cluster representing a class of locations. The tags of the photos in the cluster are used as prediction features. Their method computes 29 the probability of a class and probability of each feature given the class. Then, given a new photo with some tags, they use a Naive Bayes classier to predict the class the photo belongs to. Their method is also sensitive to the scale parameter used to discretize data [Yang and Webb, 2009]. They used 100 km as the grid size for clustering photos. Instead of measuring classication accuracy of the baseline method, as [Cran- dall, Backstrom, Huttenlocher, and Kleinberg, 2009b] did, we measure its perfor- mance in terms of the error between the photo's actual and predicted locations. For example, the actual location of a photo could be Los Angeles, but the algo- rithm predicts its location to be San Diego. This prediction error is much less than one resulting from the predicted location being New York City. Since the Bayes Classier gives only the class of the photo but not the numeric value of its geographic location, we take the location of the class to be the mode of the cluster from mean shift clustering procedure. While the original work reported predictions made by two classiers, they were found to have similar performance. We chose to use the Naive Bayes Classier as the baseline, because it is easier for us to verify its results. Model-based Location Prediction We use methodology described in [Crandall, Backstrom, Huttenlocher, and Klein- berg, 2009b] to prepare the test data set for evaluating the proposed method. The test set is created by randomly selecting 5000 users, then choosing a photo at random from each user. Thus, the test data set consists of 5000 photos from 5000 users. Finally, photos from the users in the test set are removed from the training set to prevent bias that may come from having the same users in the training and test data sets. 30 (a) error pdf (b) error cdf Figure 4.1: Comparison of performance of dierent methods in terms of the error between the photos' predicted and actual locations. (a) Distribution of errors in the test set. (b) Cumulative distribution function of errors produced by the proposed method (gmm) and baseline. We hide the actual locations of photos in the test set and use our method and baseline to predict them. We compare performances between our method and baseline using Geographical distance [Sinnott, 1984] between predicted (prd) and 31 actual (act) locations by using haversine formula. The geographical distance two points is calculated by: a = sin 2 lat act lat prd 2 b = cos(lat prd ) cos(lat act ) sin 2 lon act lon prd 2 error = 2r arcsin p a +b where r is Earth's radius (r = 6,371 km). Evaluation: Figure 4.1(a) shows the distribution of prediction errors made by the model-based approach as a histogram, where each bin corresponds to a unit of 100 km. The bins corresponding to the lowest errors have the highest frequency, implying that our method results in small prediction errors most of the time. Figure 4.1(b) compares the cumulative probability distribution (CDF) of the errors made by the proposed method (gmm) and baseline. The proposed method has higher probability for lower errors, meaning that it produces better predictions than the baseline. Improving Accuracy: Existing location prediction methods suggest using gazetteers to improve prediction accuracy [Amitay, Har'El, Sivan, and Soer, 2004, Gouv^ ea, Loh, Garcia, Fonseca, and Wendt, 2008, Serdyukov, Murdock, and Van Zwol, 2009]. The key idea is that place-related terms, e.g., `goldengate', should have higher weight than non-place-related terms, e.g., `iphone', in classication. Our model-based method can integrate tag's spatial uncertainty to come up with better, more accurate location predictions. Intuitively, localized tags tend to have higher predictive power; therefore, tags that have low entropy will produce a lower error on the location prediction task. Figure 4.2(a) validates this assumption. 32 (a) (a) Figure 4.2: Using tag entropy to improve location prediction from tags. (a) Scatter plot of the prediction error (in units of 100 km) vs entropy of photo's representative (lowest entropy) tag. (b) CDF of prediction errors after ltering out photos whose tags are not well localized. Each line corresponds to a dierent ltering threshold, i.e., the lowest entropy value for the most localized tag in the photo. It plots the entropy of the representative (most localized) tag of each photo vs error of the predicted location. We can use these results to lter out photos the locations of which we cannot condently predict. We repeat location prediction experiment, keeping photos whose representative tag's entropy value is below some threshold. Figure 4.2(b) shows the CDF of the prediction error for dierent values of entropy threshold. The line that shows the CDF of the predicted error for threshold of 12 (which overlaps the results for threshold = 9) corresponds to the CDF of the model-based prediction error without ltering (Fig. 4.1(b)). The gure shows that 33 we can get much more accurate predictions (lower errors) for photos that use well- localized tags, and the more localized the tags, the better the performance. In fact, the performance is much better than baseline. 4.1.4 Learning Relations between Places In this section we describe how to use the proposed framework to learn rela- tions between places, for example, `socal' is part of `california' (represented as `california'!`socal'). Our approach is inspired by probabilistic subsumption [Sander- son and Croft, 1999]: given two conceptsw 1 andw 2 , if the occurrences of instances of w 2 can explain most of the instances of w 1 , but not vice versa, we say that w 2 subsumes w 1 , orw 2 refers to a broader concept thanw 1 . Transposed to the geospa- tial domain, this implies that the spatial distribution of the more general parent tag, e.g., `california', should subsume the spatial distribution of the child tag, e.g., `socal'. Baseline: Probabilistic Subsumption [Schmitz, 2006a] used probabilistic subsumption to learn broader{narrower rela- tions between Flickr tags. His method can be expressed mathematically as follows: tag w 2 potentially subsumes tag w 1 if P (w 1 jw 2 ) t and P (w 2 jw 1 ) < t, where t is some threshold. Also P (w 1 jw 2 ) = N(w 1 ;w 2 )=N(w 2 ), where N(w 1 ;w 2 ) is the number of photos tagged with bothw 1 andw 2 andN(w 2 ) is the number of photos tagged with w 2 only. According to [Schmitz, 2006a], certain place names, such as, `northamerica', are rarely specied by Flickr users and can be mistakenly subsumed by their chil- dren. This problem is generally known as the popularity vs. generality problem, i.e., when popular terms are mistaken to more general terms. For example, the 34 place `northamerica' is more general than the 50 U.S. states, and ideally, photos tagged with `alabama' should also be tagged with `northamerica' for the frequency- based approach to learn a relation between them. However, in practice, people prefer to tag their photos with more specic concepts, such as names of states, rather than the very general `northamerica'. The performance of tag-based proba- bilistic subsumption, therefore, suers when general concepts are not popular. Our approach tends to be more robust, because it relies on geospatial subsumption to learn relations. However, our method is reasonable only for the geospatial domain, while baseline method can apply in other domains, although the popularity vs. generality problem still occurs. Model-based Method for Learning Relations We can view the problem of learning relations as nding a broader distribution associated with a parent place that can well enough approximate the distribution of the child place. For example, for a given tag w i with distribution P (xjw i ), the problem is to nd P (xjw j ) that can approximate P (xjw i ) well enough, where w i 6= w j . For P (xjw j ) that can adequately well approximate P (xjw i ), we learn the relation w j !w i . We use cross entropy to quantify this intuition. Cross entropy measures the \dierence" in information content between two probability distributions and can be interpreted as the average information for dis- criminating between them. The cross entropy of distributions P and Q, H(P;Q), is asymmetric, meaning that H(P;Q)6= H(Q;P ). We can use cross entropy to measure information dierence between spatial distributions of two tags. The cross entropy of the child tag with respect to its parent should be low compared to the cross entropy of the same child tag with respect to an irrelevant term, since less information is required to dierentiate the distribution of the child tag from that 35 of the parent, rather than from the distribution of an irrelevant tag. For example, let tag w 1 =`socal', tag w 2 =`california' and tag w 3 =`malaysia'. Then, H P (xjw 1 );P(xjw 2 ) <H P (xjw 1 );P(xjw 3 ) because `malaysia' has no spatial relation with `socal' but `california' has. There- fore, we can learn a relation between tags as follows: given tag w 1 andw 2 ,w 2 is a parent of w 1 if and only if H P (xjw 1 );P(xjw 2 ) The parent concept is usually geographically broader than the child concept. If w 1 can approximatew 2 well enough andw 2 can also approximatew 1 well enough, they could potentially be synonyms. Therefore, we need to add two more condi- tions. First, the child tag distribution cannot approximate the parent tag distri- bution by the reverse condition. Second, the parent tag is geographically broader than the child tag, which can be measured using its entropy H(P (xjw)). This changes the formulations of the relation learning problem. Given tags w 1 and w 2 , w 2 is a parent ofw 1 if and only ifw 1 andw 2 satisfy the following three conditions: H P (xjw 1 );P(xjw 2 ) (4.4) H P (xjw 2 );P(xjw 1 ) > H P (xjw 1 );P(xjw 2 ) (4.5) H P (xjw 2 )) > H(P (xjw 1 ) (4.6) 36 where the entropy and cross entropy are dened as: H P (xjw) = Z X P (xjw) logP (xjw)dx H P (xjw 1 );P(xjw 2 ) = Z X P (xjw 1 ) logP (xjw 2 )dx Entropy in continuous space is dened as the expectation E P (xjw) of log(P (xjw)) with respect to itself, while cross entropy of the distributionsP (xjw 2 ) andP (xjw 1 ) is dened as the expectation of log(P (xjw 2 )) with respect to the distribution P (xjw 1 ). Each P (xjw i ) is given by Equation 3.1. There is no analytic form for computing the cross entropy. Thus, we estimate it using the Monte Carlo method [Hershey and Olsen, 2007]. The idea is same with the approximation of entropy in the place identication section. The Monte Carlo approximation of cross entropy can be expressed as: H MC (P;Q) = 1 n n X i=1 logQ(x i ); (4.7) where x i is a sample from the distribution P (x). As the number of samples n grows large, H MC (P;Q)!H(P;Q). Evaluation: We evaluate our methods on three test sets: U.S. states, countries, and continents. The U.S. states set is seeded by tags such as `alabama'. The countries set is seeded by tags such as `germany'. The continents set is seeded by ve tags `africa', `asia', `europe', `northamerica', and `southamerica'. The dierent granularity levels of the seed terms illustrate the challenges of mining geospatial knowledge, including the popularity vs. generality problem. We use GeoNames to measure the quality of learned relations. We construct the ground truth as follows. For each seed in the test set, e.g., `alabama', we identify the corresponding geoid 37 approach model baseline U.S. states 0.429 0.437 Countries 0.464 0.318 Continents 0.469 0.123 Table 4.2: Comparison max F-score of the model-based approach (gmm) to base- line on the inducing relation task. in GeoNames, e.g., geoid corresponding to Alabama, and extract the names of all geoids within it. Then, using methodology described in Section 4.1.2, we lter these child places, keeping only those places that match Flickr tags in our data. For each child place, we then create a relation seed!child, or \child place is in seed place," for example, `alabama'!`mobil'. We use precision and recall to evaluate learned relations. Precision measures the fraction of the learned relations that exist in the ground truth, and recall measures the fraction of relations in the ground truth that the method was able to learn. The maximum F-score is used to quantify the aggregate performance of the method. Maximum values of F-score (max-f1) attained by the two methods on the data sets are shown in Table 4.2. The max-f1 score of our method on the U.S. states data set is a little lower than the max-f1 score of baseline. However, on the countries or continents data sets our method performs signicantly better than baseline. We conclude that for learning relations between places, our method is more robust than baseline. 38 4.1.5 Learning Folksonomies Researchers have made various attempts to learn ontologies of concepts from tags [Mika, 2007, Schmitz, 2006a] and more structured social metadata [Plangpra- sopchok and Lerman, 2009, Plangprasopchok, Lerman, and Getoor, 2011]. Such ontologies, or what we refer to as folksonomies, capture social knowledge, including knowledge about places [Keler, Mau e, Heuer, and Bartoschek, 2009]. In this sec- tion we describe a simple greedy method that combines shallow relations learned using the method described in the previous section to create a hierarchy of places. Methodology We represent a geospatial folksonomy as a tree, where nodes are places and directed, weighted edges represent relations between nodes, for example, `california'!`bayarea' means that California is a parent of Bay Area. Since the folksonomy is a tree, each node may have at most one parent, and no cycles are allowed. We separate folksonomy construction into three steps. In the rst step we identify all relevant child nodes for the given seed tag, which becomes the root of the tree. In the second step we create a directed graph from candidate child nodes. In the last step we convert the directed graph to a tree using the single- parent constraint. Find relevant children: For a given seed tag, which is to be the root of the learned folksonomy, we use methodology described in Section 4.1.4 to identify tags referring to places within the seed place. Construct graph: We construct a directed graphG withV nodes, which repre- sent the seed tag and the children identied in the previous step, and E directed weighted edges, where an edge fromv i tov j means that tagv i is the parent of tagv j , and its weightw ij gives the cross entropy of the two tagsw ij =H P (xjw j );P(xjw i ) . 39 The edge is created only if cross entropy satises conditions in Eq. 4.4{4.6. By these conditions, the directed graph is automatically acyclic. Convert the graph to a tree: To convert the directed graph to a tree, we need to enforce the constraint that each node (except root node) may have at most one parent, i.e., for every node v2 V , deg (v) 1 where deg (v) is the number of edges incident on the node. Our procedure removes incident edges until deg (v) 1. We keep an edge to the parent whose spatial distribution best describes the spatial distribution of the child. Quantitatively, this is expressed by cross entropy. Therefore, we keep the edge with the lowest cross entropy (endcoded in the weight of the edge), and remove all other edges. The output of this step is a tree. Illustrations Figure 4.3 illustrates folksonomies learned by our method for seed terms `utah' and `newzealand', using = 5. Utah folksonomy shows places and non-place concepts that are strongly associated with Utah, such as `mormon' and `lds' (abbreviation of The Church of Jesus Christ of Latter-day Saints, the formal name of the Mormon religion, which is headquartered in Salt Lake City, Utah). The folksonomy also shows vernacular terms, such as `slc', a commonly used abbreviation for Salt Lake City, and landmarks, such as Delicate Arch in Arches National Park. None of these concepts exist in expert-created gazeteers, which focus on administrative level concepts [Keler, Mau e, Heuer, and Bartoschek, 2009]; however, they are useful concepts to know about Utah. These expressions of folk knowledge are what makes folksonomies a valuable tool in the analysis of user-generated content. The New Zealand folksonomy re ects the geographic division of the country into South Island and North Island, each with a major city within in. Both folksonomies 40 are far from complete, which is a re ection of data sparsity. However, since user- generated content is growing at a fast rate, there will be ever more geo-referenced data to learn from, and sparsity won't be an issue. 41 (a) Utah folksonomy (b) New Zealand folksonomy Figure 4.3: Folksonomies learned by the greedy approach for two dierent seed terms. 42 Chapter 5 A Non-Parametric Approach to Modeling Geospatial Data 5.1 Prediction Condence and Error Our method exploits the small fraction of content that is geotagged, for example, by a GPS-enabled device, to automatically estimate the location of documents with unknown geographic coordinates. (a) (b) Figure 5.1: Locations of Flickr photos tagged with terms (a) `california' and (b) `hollywood'. Figure 5.1 shows on the map the locations of Flickr photos with some textual features, such as tags `california' (a) and `hollywood' (b). We use these samples to estimate the spatial probability density function (PDF) of the features in the data set. Then, given a new document with some features, we can predict its location 43 and estimate the quality of the prediction. We integrate PDF of the features over some region of radius r, and return as document's predicted location the center of the region that maximizes this value, which we call condence. The radius r controls the error of the prediction which we call spatial resolution. The condence represents our certainty about the prediction at that spatial resolution, i.e., the likelihood that document's true location falls within the region of radiusr centered on the predicted location. For example, suppose we nd that the condence of the predicted location of a photo tagged `losangeles' is 70% at a resolution of 100 km. This means that at this condence level, 70% of time, a photo's location will be predicted correctly with error < 100 km. There exists a trade-o between resolution scale of the prediction and the degree of condence in it. We can increase the condence of the prediction by increasing the resolution scaler. For example, we can increase condence to 100% by setting r to approximately 20,040 km, which is half of the planet's circumference. Despite the trade-o, our goal is to improve geotagging accuracy at ne-grained resolution scales, i.e., for low values of r, without relying on new features or using external data as was done by [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009a, Popescu and Ballas, 2012]. 5.1.1 Computing Condence from Samples We assume that document terms W have some spatial distribution from which samples are drawn. For example, we treat the points in Fig. 5.1(a), which represent photos with the tag `california', as training samples of the distribution of the term `california'. If we assume that new documents using the term `california', which we refer to as test samples, are also drawn from the same distribution, we can calculate the likelihood of nding a test sample within a circle of radius r (spatial 44 (a) (b) Figure 5.2: Illustrations of condence estimation from (a) samples and from (b) the probability density function. resolution) by counting the fraction of training samples that fall within the circle. This likelihood is the condence: Confidence(x;r;w) =N w (x;r)=N w Here, N w is the number of samples of a photo with a tag w, and N w (x;r) is the number of samples of a photo with a tag w within circle of radius r centered at location x. We refer to r as the spatial resolution. Figure 5.2(a) illustrates the calculation. Here, red points are samples of the tag w 1 . Given a new document with a single tag w 1 , the condence of the predicted location x at the specied spatial resolution is 40%, which is the fraction of training samples that fall within the circle centered at x. Note that we can raise the condence by increasing the spatial resolution: sinceN w remains constant, butN w (x;r) increases monotonically with r. 45 We assume that the spatial resolution is a requirement of the system. For example, we may want to estimate locations of documents at some resolution, for example, city-level resolution. At this resolution, we can tolerate errors of at most 161 km [Han, Cook, and Baldwin, 2014]. In order to maximize the prediction accuracy at this resolution scale, we must set the spatial resolution the parameter to r = 161 km. Then, the most likely location of a document containing terms W is x that maximizes condence: x predict = arg max x2X Confidence(x;r;W ) The set X of possible locations can be computed by discretizing space into small grids, or we can assume that X is a set of locations of training samples of terms W . 5.1.2 Computing Condence from PDF In practice, the training set of documents containing terms W , e.g., photos with tagsf`california', `john', `2007' , `iphone', `hbd'g, may be very sparse. As a result, we may not have enough samples to predict their location with high condence. Instead, we assume that we can approximate the probability density function of termsf`california', `john', `2007' , `iphone', `hbd'g by combining the individual PDFs of each term. If we know the true PDF that generated the document with terms W , we can compute the prediction condence at location x with spatial resolution r as: Confidence(x 0 ;r;W ) = Z R f W (x) dx: (5.1) Here x 0 is the location at which we want to measure prediction condence, and r is the given spatial resolution. The function f W (x) is the probability density 46 function of documents with terms W . A detailed discussion of f W (x) estimation is provided in the next section. The regionR is the set of locations with distances to x 0 that are less than r: R =fx2Xjdist(x 0 ;x)rg. Another view of condence, Confidence(x 0 ;r;W ), is that it is the probability mass of f W (x 0 ) within a circle of radius r centered at x 0 . Depending on the specic density distribution, the predicted location will vary with r. We illustrate this point with a one-dimensional example in Figure 5.2(b). The example can be easily generalized to two-dimensional geographic space. The predicted location is the center of a segment of length 2r that gives the highest condence, i.e., area under the PDF curve in this one-dimensional example. The predicted location depends signicantly on the choice of r in this example. When r is small, e.g., r = 0:01, we will predict location at x = 3, because this location results in the largest probability mass at that value of r. However, when r = 1, location x = 0 is preferred over x = 3, because the probability mass at that location (black area) is larger than at x = 3 (red area). We calculate the integral of f W (x) numerically using the Riemann sum. The accuracy of the sum depends on the grid size parameter used to discretize space. We empirically investigate this dependence in Section 5.3. 5.1.3 Prediction Error When geotagging documents, it is useful to give the error associated with the predicted location. [Hau, Thomee, and Trevisiol, 2013] suggested one approach for estimating prediction errors. The intuition of the approach is as follows: assume that location prediction returns a ranked list of locations, with the top-ranked result usually taken as the predicted location. If the topn locations are distributed 47 all over the globe, the estimated prediction error is high. In contrast, if the top n locations are geographically close, then the error is low. We employ the condence prediction framework to estimate the location error. First, we estimate the probability mass around the predicted location, as described above. If the predicted location has moderate probability mass within a small spa- tial resolution r, it is likely that this predicted location has low error. If another predicted location has the same probability mass but within larger spatial resolu- tion, it can be inferred that this location has a higher error than the rst predicted location. The method requires a threshold to dene `moderate' probability. The natural choice should be 0:5, because this is the minimum condence at which we have an equal chance to nd a sample in the region. However, if there are errors in PDF estimation, this will be dierent from 0.5. Empirically, we nd that a good threshold is around 0.2. The error estimation method is a search algorithm. We start by setting the error to minimum spatial resolutionr. If the probability mass at this error bound is higher than threshold, we report the estimated error asr. If not, we increase the spatial resolution until the probability mass exceeds the threshold. The process is similar to the adaptive resolution prediction in the section 5.1.5. In fact, if the probability mass (condence) is higher than threshold before the spatial resolution r reaches its maximum value r max , the adaptive resolution prediction can report the estimated error as r without additional calculation. The estimated error increases at every iteration until the condence value exceeds the threshold. The algorithm is guaranteed to terminate if < 1 because Confidence() is a monotonically increasing function ofr and always reaches one for large enoughr. 48 5.1.4 Probability Density Estimation In the previous section we assumed that we know f W (x), the true PDF of terms W ; however in practice, we can only approximate it. Below we describe a sim- ple method that is easy to implement and relatively fast in practice. Improving accuracy of PDF estimation is the subject for future research. There are two important choices for estimating f W (x). The rst choice is how to model the combined PDF of all terms w2 W . We use the mixture model to approximate f W (x) as the weighted sum of individual PDFs f w (x): f W (x) = X w2W w f w (x); (5.2) where w is the weight of termw, which represents the importance of that term. For example, it is obvious that the term `california' should have the highest weight in a photo taggedf`california', `iphone', `ramen'g, because it is the most geographically indicative term. Thus, we can incorporate the feature selection into this model by giving such terms a higher weight. For simplicity, in this section, we assume that all terms are weighted equally, and w = 1=jWj. The second modeling choice is how to describe, f w (x), the spatial distribution of each term w. The distribution is learned from data using either parametric and non-parametric approaches. Let x 1 ,. . . , x n be a set of samples drawn from a continuous distribution function f w (x) associated with the w continuous random variables. The parametric approaches, such as [Wand and Jones, 1994], estimate the common density f from samples by assuming that f belongs to a parametric family of functions, such as Gaussian or gamma functions. However, spatial distri- butions of terms are usually multimodal and cannot be described by such unimodal functions, unless mixtures models are used [Intagorn and Lerman, 2012a]. 49 In contrast, non-parametric density estimation makes no assumptions about the spatial distributions of terms. Histogram is the oldest and most widely used non-parametric density estimator [Wand and Jones, 1994], and lends itself easily to spatial density estimation. We construct the histogram by dividing the surface of the earth into a uniform square grid. The grid sizeh, also called bandwidth, is a critical parameter in density estimation, because it determines the smoothness of the distribution. Wing et al. [Wing and Baldridge, 2011] also showed that it is an important parameter in geospatial applications and can aect location prediction. This is because the mode of the PDF depends on the discretization parameter h. One of possible reasons may relate to Modiable Areal Unit Problem [Openshaw, 1983]. The problem states that aggregate statistics are sensitive to the choice of spatial unit used to discretize data, and this can impact the predicted location. However, since we use condence, rather than estimate the mode of the PDF, to predict location, our method is less sensitive to the choice of bandwidth. While the bandwidth determines the accuracy of Riemann sum, and hence the precision with which condence is calculated, it will not aect the predicted location. As the grid size h! 0, the value of Riemann sum gets closer to the true value of integral. Thus, we set h to be small for higher accuracy. The trade-o is that computational time increases signicantly. To estimate f w (x) using a histogram, we simply count the number of observa- tions in each grid of size h: ^ f(x) = numberofobservationsinbincontainingx nh (5.3) 50 5.1.5 Adaptive Resolution Prediction The most appropriate level of granularity at which a document should be localized is generally not known beforehand and must be estimated [Van Laere, Schockaert, and Dhoedt, 2013]. For example, if the only tag available for a document is `lax', we should set the spatial resolution, r, to a landmark level instead of a city level. The adaptive resolution prediction provide a solution to this problem by search- ing the smallest spatial resolution that make the prediction condence high enough. The method requires two parameters: maximum spatial resolution (r max ) and con- dence threshold (). The maximum spatial resolution (r max ) is the maximum error that we can tolerate. The condence threshold is same as the condence threshold in the section 5.1.3. Given a document with its terms, our method searches for a location that maximizes its placeability condence. We start with the smallest spatial resolution and compute the condence by integrating estimated probability density function of the terms. If the calculated condence falls below the specied threshold, we increase the spatial resolution and recalculate condence. The process repeats until the condence is above the threshold or the maximum spatial resolution is reached. This process is summarized in the Figure 5.3. An advantage of this adaptive resolution prediction is that the error can be reduced by taking the risk to predict at a ne-grained resolution if the con- dence is high enough. This generally happens when documents contain local terms describing landmarks, e.g., `lax'. When local terms are present, some locations will have very high condence at small error bounds. In other words, condence helps us decide when we can predict well. 51 Figure 5.3: Flow chart of the adaptive resolution prediction. 5.2 Model Renements We can improve the performance of the location prediction algorithm by giving more weight to terms that are more predictive of a document's location. The most predictive tags are generally place names. For example, when a photo contains tags such asf`losangeles', `iphone', `ramen'g, the tag `losangeles' is more helpful for inferring the photo's location than other tags. However, if a photo contains tags f`losangeles', `lasvegas', `trip'g, `lasvegas' may in uence the location likelihood of the photo as much as `losangeles'. One of characteristics of such terms is that they are spatially localized. In contrast, terms that are non-place names tend to be uniformly distributed around the world. Although non-place terms may not aect predicted location, they aect the condence value. Below we describe how to use 52 feature selection to nd terms associated with place names and rene the mixture model by giving them more weight. 5.2.1 Feature Selection Place names are used widely for geotagging [Han, Cook, and Baldwin, 2014, Van Laere, Schockaert, and Dhoedt, 2013]. One popular procedure for identifying place names is to measure how well-localized the probability density functionf w (x) is. The rst step in this procedure is to estimate ^ f w (x) from samples of term w. The estimation can be parametric or non-parametric. Parametric methods must choose an appropriate mathematical function to represent the distribution, for example, a Gaussian function. However, many functions, including the Gaussian, are unimodal, and may not describe well some the distributions of terms that are multimodal. In contrast, the non-parametric methods, such as the histogram, do not construct the distributions from mathematical functions that are specied a priori, but infer them directly from data. A histogram method starts by discretiz- ing space into uniform regions, or bins, and calculates statistics over these bins. The value of the discretization parameterh, also called the bandwidth, determines the smoothness of the learned distribution. Once we know ^ f w (x), we can estimate how well-localized the term w is by calculating the uncertainty of x using continuous entropy: H w (x) = Z X ^ f w (x) ln ^ f w (x) dx: (5.4) As in any non-parametric smoothing application, the statistical properties of esti- mators such as ^ f w (x) are dependent on the choice of the bandwidth parameter h, and will, in turn, aect estimated entropy. An inappropriate value of h may lead 53 (a) (b) (c) (d) (e) (f) Figure 5.4: Bandwidth eect on estimated entropy. Distribution of 500 samples drawn from a Gaussian with (a) zero mean and variance ve, and entropy 3.0, (b) zero mean and variance 0:5, with entropy 0.72. The histogram approximations of the distribution in (a) with dierent bandwidths h lead to dierent estimated entropyH: (c) usingh = 0:1, we getH = 2:6, (e) usingh = 1, entropy isH = 2:9. The histogram approximation the distribution in (b): (d) using h = 0:1 gives H = 0:66. (f) using h = 1 the entropy is H = 0:83. to an estimator with a large bias or variance or both. Moreover, the optimal value of the bandwidth parameter may be dierent for each w. Figure 5.4 illustrates how the bandwidth parameterh aects entropy. The samples are drawn from two Gaussian distributions: N (xj0; 5) andN (xj0; 0:5), with entropies 3.02 and 0.72 respectively. Using bandwidthh = 1 to discretize space, we get entropy (2.9) that 54 is closer to its theoretical value than using smaller bandwidth h = 0:1 (2.6). How- ever, for the narrower Gaussian in Fig. 5.4(b), the narrower bandwidth h = 0:1 results in a better entropy estimate (0.72) than the larger bandwidthh = 1 (0.83). 5.2.2 Bandwidth Selection Observations above motivate us to develop a data-driven method to automatically and objectively select the optimal bandwidth parameter h. Using the optimal bandwidth will lead to an estimator ^ f(x) that is close to the true densityf(x). An objective function that measures the dierence between the estimator ^ f(x) and the true densityf(x) is the mean integrated squared error (MISE), which is the square of the L2 distance between them. This is the intuition behind the Least-Squares Cross Validation method (LSCV) [Wand and Jones, 1994], which nds the optimal bandwidth parameter h that minimizes MISE [Wand and Jones, 1994]: MISEf ^ f(;h)g =E Z f ^ f(;h)f(x)g 2 dx (5.5) We cannot minimize the MISE directly, because we do not know the true density f(x). Instead, we expand the MISE of ^ f(;h) to obtain [Wand and Jones, 1994]: MISEf ^ f(;h)g = E Z f ^ f(;h)g 2 2E Z ^ f(;h)f(x)dx +E Z ff(x)g 2 (5.6) Here f(x) is constant for all h, because it is the true true density. Thus, instead of minimizing Eq. 5.5, we minimize [Wand and Jones, 1994] MISEf ^ f(;h)g =E h Z f ^ f(;h)g 2 dx 2 Z ^ f(;h)f(x)dx i 55 We still have thef(x) term left in the second term. However, we know thatf(x) is a probability density function. Thus, the second term is the expectation off(x) on function ^ f(;h). Then, we use the \leave-one-out" cross-validation criterium to approximate the second term. Then, we obtain [Wand and Jones, 1994] LCSVf ^ f(;h)g = Z f ^ f(;h)g 2 dx 2 n n X i=1 ^ f i (X i ;h) The term ^ f i is a histogram of X when a data point, X i , is removed. The above equation is a general formulation of LCSV. To compute this term numerically for histogram kernel, we need to estimate the rst term which is Z f ^ f(;h)g 2 dx = M X i=1 ^ f(c i ;h) 2 A i = M X i=1 ^ f(c i ;h) 2 h 2 The integration of ^ f(;h) 2 is equivalent to the area under the curve of ^ f(;h) 2 . Thus, we can compute this term by the Riemann sum. Here, M is the number of bins in histogram, and A i is an area of bin i, which is constant for every bin. Since bins are square by construction, the area of each is h 2 . c i is the density at center of the histogram of bin i. It can be any point of bin i, because we assume the density is uniform across the bin. LSCV then searches the nite set of bandwidth candidates, e.g.,f1 ; 2 ; 3 ;:::; 10 g, for the optimal bandwidth. 5.2.3 Weighted Condence Prediction We integrate feature selection into the condence prediction framework called weighted condence prediction. Instead of using uniform weights in the mixture model Eq. 5.2, we give high entropy features (above some threshold) a low weight, because they are probably not indicative of places. For example, the entropy of 56 the Flickr tag `socal' is H = 1:69 and entropy of `geotagged' is H = 5:8. Thus, `socal' is a more indicative of a place than `geotagged' and should be given a higher weight in the model. 5.3 Evaluation We used Flickr and Twitter data to evaluate the performance of the proposed geotagging method. The Flickr data came from the placing task competition in the MediaEval2013 workshop. 1 The placing task asked participants to estimate the geographical location of Flickr photos and estimate its error. The training set consisted of approximately 8.5 million photos, with both visual and textual features; however, we used only the textual features, namely user-assigned tags, in our experiments. There were ve test sets of dierent sizes (between 5300 and 262,000 images). Test sets were developed following the Russian Doll approach, in which the larger test set contained all documents in the smaller test set. We used the third test set with 53,000 documents in our experiments. Some documents did not contain any textual information or contained only tags that did not exist in the training set. We ltered out these documents, because our method does not have anything to go on in predicting their location. After ltering, there were 45,742 documents in the test set. The Twitter data set contained tweets we collected using twitter4j, a Java library for the Twitter API. Tweets without geographic coordinates were excluded from the data set. We extracted entities from the collected tweets. These entities contained all the hashtags in tweets (terms starting with a `#' symbol), in addition to the entities identied in tweets using the entity recognition library. Entity names 1 http://www.multimediaeval.org/mediaeval2013/ 57 used by few (fewer than 18) people were also ltered out, since they were likely to be idiosyncratic or noisy. After preprocessing, the number of tweets was 12 million in the training set and 44K in the test set. The data set contained 1.8 million users in the training set and 752 in the test set. 5.3.1 Location Prediction Baseline Multinomial Naive Bayes (NB) classier remains the state-of-the-art in automati- cally geotagging user-generated content in social media [Han, Cook, and Baldwin, 2014, Kelm, Murdock, Schmiedeke, Schockaert, Serdyukov, and Van Laere, 2013]. Its performance of this simple classied is comparable to that of Support Vec- tor Machine (SVM) [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009a]; hence, we use it as baseline in our experiments. Naive Bayes-based methods start by dividing the geographic space into a regular grid, with each grid considered to be a unique class c. Given a document with features f and an unknown location, NB returns as document's location the classc that maximizes the quantityP (cjf): P (cjf) = P (c) Q P (fjc) P (f 1 ;:::;f n ) : (5.7) Whilef may consist of any features, including visual features [Crandall, Back- strom, Huttenlocher, and Kleinberg, 2009a] in our experiments we focus solely on textual features, such as tags of photos or entities in tweets. The density of fea- tures for each classc,P (fjc), is calculated from data, specically, by counting the fraction of observations of each feature in c. This probability can be zero when there are no observations off inc, which leads to zero in the numerator in Eq. 5.7. A solution to this problem is to use smoothing to eliminate zero values, for exam- ple, add-one smoothing, which adds one to the counts of f in all classes. More 58 complex smoothing methods may lead to better prediction accuracy as shown in [Serdyukov, Murdock, and Van Zwol, 2009, Cheng, Caverlee, and Lee, 2010]. Note that the discretization parameter used to partition continuous geograph- ical space into a grid plays an important role in prediction accuracy [Wing and Baldridge, 2011]. Following others, we use grid size of one degree (approximately 100km) [Wing and Baldridge, 2011]. 5.3.2 Error Estimation Baseline Error estimation is new to the location prediction task, with relatively few exist- ing methods providing the error associated with the predicted location. One such method was described in [Hau, Thomee, and Trevisiol, 2013]. This method returns a ranked list of possible locations for a document, with the top-ranked location taken as the document's predicted location. If the top n locations are distributed over the globe, the document's true location is considered to be very uncertain; hence the estimated location error is high. In contrast, if the topn loca- tions are spatially close (e.g. with a standard deviation of a few kilometers), then the error is considered to be low. Unfortunately, [Hau, Thomee, and Trevisiol, 2013] do not provide optimal parameters for error estimation. The ranked list is created by storing the Naive Bayes class along with its prob- ability (P (c) Q P (fjc)). The size of the list equals the number of classes. The list is then sorted in descending order. The top n classes (locations) are chosen to compute prediction variance. We choose n = 5 and compute pairwise distances between thesen locations. These pairwise distances are sorted, and the estimated error is the median of these distances. If the number of candidates is less than ve, we set n equal to the number of candidates. If the number of candidates is one, the estimated error is assumed to be zero. 59 5.3.3 Result Ranking Baseline The goal of this task is to rank location prediction results and pick a subset of documents with potentially very accurate predicted locations. In practice, the error may be too high when documents do not contain any location-indicative words. Thus, we may need to predict only a subset of all documents. [Han, Cook, and Baldwin, 2014] propose several heuristics for ranking predictions. Probability ratio is one of two best performance heuristics along with prediction coherence [Han, Cook, and Baldwin, 2014]. In this work, we choose the probability ratio heuristic as a baseline because of its simplicity and performance. The algorithm is very similar to the variation method described in the previous section. We create a ranked list of locations predicted by the Naive Bayes classier. The list is sorted by probability P (cjf) in descending order. The heuristic is computed from the ratio of the probabilities of rst and second elements in the sorted list. If there is only one element, we set the probability ratio to innity. The intuition of this method is based on the observation that if the prediction is accurate, the highest probability class tend to be much more accurate than other classes [Han, Cook, and Baldwin, 2014]. Error estimation can be used in the result ranking task. Thus, the variation method in the previous section is also used as baseline in this task. However, the heuristic cannot be used in the error estimation task, because it is not clear how to convert the probability ratio and other heuristics to an error value. Thus, we do not include probability ratio as a baseline for the previous task. 5.3.4 Placing Task Results We evaluate how well the proposed location prediction method performs with respect to dierent parameter values and also how accurate its predictions are 60 compared to the baseline. The ve experiments presented below summarize our ndings. All experiments were carried out on both Flickr and Twitter data sets. Following convention [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009a, Wing and Baldridge, 2011], we report the bandwidth and spatial resolution (error bound) in the units of degrees and location errors in kilometers. While the spa- tial extent of one degree is location-dependent, for simplicity we take it to be approximately 100 km. The rst experiment tested the eect of the histogram discretization parameter, the bandwidth h, on the accuracy of the proposed method. We hypothesize that bandwidth does not aect prediction accuracy very much. As long as bandwidth is relatively smaller than the spatial resolution scaler, predictions made using dif- ferent bandwidths should be approximately the same. This is because bandwidth aects only the value of the Riemann sum used to calculate the condence, not the predicted location itself. However, ner bandwidth should be better for two reasons. First, the Riemann sum becomes more accurate as space is discretized into ner units. Second, the discretization error decreases with bandwidth, i.e., as grids become ner. To understand why, consider, for example, a 1 1 (h =100 km) histogram bin. When we predict the location to be within some bin, we take it to be at center of that bin. As the true location could be anywhere within the bin, the error ranges from 0 km to approximately 71 km, or half of the grid diagonal. On the other hand, when the grid size (bandwidth) is 1 km, the discretization error range is approximately 0 km{0.7 km. Figure 5.5 reports prediction error at three dierent bandwidths: h = 0:01 , h = 0:1 and h = 0:5 . We set the spatial resolution r in these experiments to r = 1 . We compute errors as the distance between actual and predicted locations. The distance between two points is the geographical distance computed by using 61 haversine formula [Sinnott, 1984]. We present results as the cumulative probability distribution (CDF) of the errors. The results conrm our hypothesis. At the r = 100 km error level, prediction accuracy is 0.6 and approximately same regardless of grid size. The result can be interpreted that the algorithm can predict 60 of 100 documents with less than 100 km error. Below the 50 km error level, ner bandwidths produce more accurate results, probably due to better accuracy of the Riemann sum. Above the 50 km error level, there are small uctuations in performance results, but above 200 km error level, there is almost no dierence (< 0:005) in performance for dierent grid size parameters. Thus, we can choose h! 0 to obtain a good accuracy at small level without sacricing the accuracy at larger error bounds. However, small grid size can increase computational time due to larger search space and Riemann sum computation. In the second experiment we test the eect of spatial resolution of the predic- tion on its accuracy. We hypothesize that we can maximize accuracy at dierent spatial granularity levels by changing the error boundr. There is no optimal error bound for the prediction: for example, we may want to sacrice accuracy at larger spatial resolutions to obtain better accuracy at ner resolutions. Theoretically, the error bound r leads to the maximum accuracy at that resolution, because we are maximizing condence, or the mass of the PDF in a circle of radius r. However, in practice, we never know the true PDF, only its approximate form ^ f. This may lead to discrepancies between theory and practice. Nonetheless, the results from both data sets are consistent with theory. Figure 5.6 reports performance of the location prediction task at six dierent spatial resolutions, set by error boundsr = 0:01,r = 0:05,r = 0:1,r = 0:05,r = 1, r = 5, and a xed bandwidth h = 0:01 . Results in both data sets conrm our 62 hypothesis about the relationship between the spatial resolution,r, and prediction accuracy at some error levels. Thus, if we want more accurate predictions at a ner- grained level, for example, at a city-level rather than country-level, we should set the error bound lower. For example, to maximize prediction accuracy at 10 km, we have to set the spatial resolution tor = 0:1 (10 km). However, if we are interested in getting the country-level of the prediction right, we should set the error bound at a larger value, e.g., r = 5 (500 km). In the third experiment we compare variations of the proposed location pre- diction method to the Naive Bayes baseline. Specically, we compare the xed error bound condence-based location prediction method (Section 5.1), to term- weighted (using feature selection in Sec. 5.2.1), and to adaptive condence predic- tion (Sec. 5.1.5) with baseline. The baseline, NB, has two parameters: grid size and the smoothing parameter. We set the grid size to 1 1 and smooth densities using the add-one method. The xed condence prediction has two parameters: bandwidth,h, which we set to 0:01 , and the spatial resolution,r, which we set to r = 0:5. The weight condence prediction has two additional parameters: weights and weight threshold. We use weight to w i = 1 to high entropy features and other features w i = 2. We rank the features according to their entropy and set the rst 20% to be high entropy features in both data sets. Summation of weights in a document is then normalized to be 1. The adaptive condence prediction has three parameters: bandwidth (h), max- imum spatial resolution (r max ), and condence threshold (). We set bandwidth,h, to 0:01 , the maximum spatial resolution to r max = 0:5, and condence threshold to = 0:2. 63 Figure 5.7 reports results of this experiment in the two data sets. We found that Flickr results are more consistent with our expectations than Twitter results because of an improvement of weight model over the uniform weight (normal) model. Despite their dierences, both sets of results have similar trends. The proposed condence prediction method outperforms the baseline signicantly on Flickr data at all granularity levels. The weighted condence prediction shows a slight improvement over the uni- form weight (normal) model. The accuracy of adaptive condence prediction is signicantly higher than that of the normal model at error levels less than r max and slightly lower for error levelsr max . On Twitter, the normal condence prediction outperforms the baseline by a large margin for errors in 0 50 km range, and has approximately the same per- formance at higher error levels. The adaptive condence prediction framework signicantly improves on the mixture model at small error levels. In adaptive condence prediction, there is a risk of losing accuracy at the larger error levels, however, we use the condence value to help us deciding when we should take a risk. Again, if we really want to maximize the accuracy at 50 km, we should use a normal condence prediction with xed spatial resolution r = 0:5 (50 km). It can be useful when we try to minimize the error as much as possible and keep the error below an error bound at the same time. The adaptive condence prediction is better than both normal condence prediction and the baseline with large gap from 0{30 km and has approximately same performance with the normal con- dence prediction at 40 km. It has lower performance than the normal condence prediction at 50 km with very small gap (< 0:05). The median error of adaptive condence prediction is also the best at 15 km while normal condence prediction is 33 km and Naive Bayes is 56 km. 64 For Flickr data, the accuracy of the normal condence prediction is signi- cantly higher than baseline. The weighted mixture model leads to only a slight improvement. A possible reason for this is that the locations that have highest condence value between normal and weight mixture models are approximately same. Since non-place tags are uniformly distributed, reducing their weights may not change the mode's location. However, these tags can change the calculated condence values. Next, in the forth experiment, we compare error estimation results produced by the Naive Bayes's topn method and the proposed method. We use the Kendall-Tau correlation suggested in Hau et al. [2013] to compute the correlation between the actual errors the estimated errors. Higher correlation means the better estimations. Alternatively, we can view error estimation as a decision problem, i.e., decide whether location error is lower than a specied bound or not? We then compute prediction accuracy as: Accuracy = TP +TN TN +FP +FN +TP where TP and TF are the true and false positives respectively, and TN and FN are the true and false negatives. The accuracy metric may lead to the accuracy paradox [Brodersen, Ong, Stephan, and Buhmann, 2010]. The accuracy paradox states that the accuracy metric may mislead the classier performance when an imbalanced data set (large dierence in numbers between the positive and nega- tive classes) is used [Brodersen, Ong, Stephan, and Buhmann, 2010]. One of the solutions is restoring the balance of numbers between the positive and negative 65 classes [Brodersen, Ong, Stephan, and Buhmann, 2010]. Thus, we set the bound- ary decision to the median of each method to have the same number of positive and negative classes. Method Data Set Correlation Accuracy Median(km) Variation Twitter 0.2460 0.5033 56 Condence Twitter 0.5528 0.8067 33 Weight Twitter 0.5556 0.8097 33 Variation Flickr 0.2912 0.5066 102 Condence Flickr 0.3864 0.6789 34 Weight Flickr 0.4221 0.6797 33 Table 5.1: Evaluating the performances of three dierent methods: normal con- dence, weight condence, and variation method on the predicted error estimation task. The Kendall-Tau correlation and accuracy metric is used to evaluate the performance. For Flickr data, the proposed condence prediction framework is better than baseline according to both metrics. We found that the baseline method fails to esti- mate the error accurately when the probabilities of top-5 class are largely dierent but the their locations are far from each other. Error estimation can be used to rank the quality of prediction results. In some situations, we may improve prediction quality by sacricing coverage; in other words, by ltering out documents whose locations we cannot precisely predict, we retain a small fraction of documents whose predicted locations we believe to be close their actual locations. In the nal experiment, we evaluated this aspect of the error estimation task. We sorted locations predicted by each method based on their estimated errors. We sorted predictions in an ascending order of their 66 estimated errors, with lowest error predictions at the top of the list. We then ltered out the worst predictions, i.e., the lowest-ranking 25% of the predictions. Method Data Set Acc Acc 75% Improve Variation Twitter 0.5805 0.6676 15.0 PR Ratio Twitter 0.5805 0.7305 25.8 CP Twitter 0.5892 0.7501 27.3 Weight CP Twitter 0.5892 0.7508 27.4 Variation Flickr 0.4990 0.5617 12.5 PR Ratio Flickr 0.4990 0.6028 20.8 CP Flickr 0.6420 0.7719 20.5 Weight CP Flickr 0.6546 0.7983 21.9 Table 5.2: Comparing the performances of four dierent methods: normal con- dence, weight condence, variation, and probability ratio methods on the result ranking task. All accuracies are reported at the error level = 100 km. Acc column reports the accuracy before ltering. Acc75% column reports the accuracy after ltering 25% of results. We rank the results based on the error estimation for normal and weight con- dence prediction and variation method. The probability ratio heuristic is used to rank for the PR Ratio method. Then, we select the top 75% results to see the performance improvements. All methods seem to have a positive eect on ranking results. The baseline variation has the lowest improvement on both data sets. The state-of-the-art probability ratio has a comparable performance to the proposed method in both data sets. The feature selection seems to have more impacts on the 67 ranking task than the location prediction task on the ickr data set. The weight condence prediction has the highest accuracy and improvement. 68 (a) Flickr (b) Twitter Figure 5.5: Evaluating the eect of the bandwidth parameter, h: 0:01 , 0:1 and 0:5 on the prediction error. 69 (a) Flickr (b) Twitter Figure 5.6: Evaluating the eect of the spatial resolution parameter, r: 0:01 , 0:05 , 0:1 , 0:5 , 1 and 5 on the prediction error. 70 (a) Flickr (b) Twitter Figure 5.7: Evaluating the performances of four dierent methods: normal, weight, adaptive condence prediction, and add-one smoothing Naive Bayes classier on the prediction error. 71 (a) Flickr (b) Twitter Figure 5.8: Evaluating the performances of three dierent methods: normal, weight condence prediction, variation, and probability ratio on the result ranking task. All methods lter 25% of results according to their heuristics. 72 Chapter 6 Improving Location Prediction with Temporal Constraints 6.1 Improving Accuracy with Spatial Entropy Researchers observed that performance of the location prediction task usually suf- fers from the presence of words that do not exhibit strongly localized usage patterns [Cheng, Caverlee, and Lee, 2010, Intagorn and Lerman, 2012a, Kelm, Murdock, Schmiedeke, Schockaert, Serdyukov, and Van Laere, 2013, Van Laere, Schockaert, and Dhoedt, 2011]. Feature selection techniques are used to improve prediction accuracy. In the context of location prediction, feature selection techniques are used to select localized words, which have higher predictive power. Figure 6.1 com- pares spatial distributions of terms `loganinternationalairport' and `iphone'. The term `loganinternationalairport' is strongly localized, with a single peak around (33:9471;118:4082). On the other hand, `iphone' is an example of non-localized term which appears in many places. To validate our intuition, we plot the entropy of the most localized term within each tweet vs the tweet's predicted location error. The gure shows that sometimes non-localized tags also can give low error, though they often give high error. By binning the points along the entropy axis and compute parameters in each bin, we found that localized (low entropy) terms have low mean error with small standard deviation. In contrast, non-localized (high entropy) tags have large mean error with 73 large standard deviation. We conclude from this evidence that localized terms tend to give lower error on average with smaller variance. Several methods were proposed to identify localized words, e.g., using a gazetteer. Unfortunately, vernacular place names are present in tweets but not in a gazetteer. Alternatively, we can identify localized words by analyzing their spatial usage pat- terns. However, when data driven feature selection is used in classication, the primary issue is how to dene classes or areas. This problem is known as the Mod- iable Areal Unit Problem, or MAUP in the spatial analysis literature [Rattenbury and Naaman, 2009a]. For example, if we want to compare terms `losangeles' and `lax', dening the class at a city level may lead to both terms falling into the same class. This will result in a conclusion that both terms are localized to the same degree, which is incorrect. The term `lax' tag should be more localized than the term `losangeles'. Therefore, additional probability estimations are required by classication-based approaches. In this section, we show that we can compute spa- tial entropy directly using the GGMM framework, without additional probability estimation steps. In addition to simplicity, this also leads to better performance than the baseline. 6.1.1 Spatial Entropy using GGMM Uncertainty of a distribution can be measured by entropy. A strongly localized word usually has low uncertainty in spatial domain. For example, we are quite con- dent that the location of `lax' is around (33:9471;118:4082). However, location of `iphone' is uncertain. Based on this intuition, we use entropy to estimate the uncertainty of P (Xjw), the spatial distribution of the term w. Then, an entropy is used to lter out non-localized words. Spatial entropy can be estimated directly from GMMS: 74 H P (Xjw) = Z X P (xjw) logP (xjw)dx (6.1) There is no analytic form for computing entropy. Instead, we estimate it using Monte Carlo method [Hershey and Olsen, 2007]. The idea is to draw a sample x i from the probability distribution P (x) such that the expectation of logP (x), E P (x) (logP (x)) = H(P ). Monte Carlo method is used because, according to [Hershey and Olsen, 2007], Monte Carlo sampling has highest accuracy for approximation for high enough number of samples; however, it is computationally expensive. A better runtime approximation can be obtained by more sophisticated approaches discussed in [Hershey and Olsen, 2007]. The Monte Carlo approxima- tion of entropy can be expressed as: H MC (P ) = 1 n n X i=1 logP (x i ) (6.2) wherex i is a sample from the distributionP (x). As the number of samplesn grows large,H MC (P )!H(P ). In this case, the real data points can be used as samples. Thus, the formulation is exactly same as loglikelihood in EM algorithm with a negative sign and dividing by number of data points. Figure 6.1 demonstrates the dierence between low and high entropy entity. The spatial entropy of the term `bostonloganinternationalairport' is only2:7, which is low, since the term is well- localized. Its entropy is negative because its distribution is continuous. All data points are around the actual location of the Boston Logan Airport. In contrast, the spatial entropy of the term `iphone' is 7.738, which is quite high. 75 6.1.2 Baseline with Feature Selection Cheng et al. used feature selection to improve the performance of their location prediction method. Their feature selection approach is based on a technique intro- duced by Backstrom et al. [Backstrom, Kleinberg, Kumar, and Novak, 2008]. The key idea of this technique is that the geographic distribution of a word can be mod- eled as a unimodal distribution. This distribution is controlled by two parameters: C, the spatial focus which controls frequency at the center, and , the dispersion which controls the decreasing rate of the frequency as the distance from the center increases. A large value of indicates that the word whose distribution is being modeled is well localized. On the other hand, non-localized words are expected to have values very close to zero. 6.1.3 Data Collection and Processing We collected data using twitter4j, a Java library for the Twitter API. The training and test set are collected separately by dierent API. Geographic coverage is a desirable property for training data. We want our training data to represent the actual geographic distribution of tweets. Therefore, we used Twitter search API with geographic coordinate as an argument to collect 291 million tweets for the training set. For a test set, in contrast, we prefer to collect all tweets made by a user to demonstrate usefulness of mobility constraints. Thus, we used twitter4j user time line method to collect data of each user in a test set, resulting in 1.8 million tweets. We dene a raw tweet as T = t where t is a tuple (id t ;u t ;c t ;x t ) containing a unique tweet ID id t . u t is an ID of a user who posted the tweet t. c t is a tweet content, with up to 140 characters, which we decompose into a bag of words c t = 76 fw t1 ;w t2 ;:::;w tn g. x t represents the geographic coordinates of t. The goal of pre- processing step is to create a relation between terms w and geographic coordinate x. Formally, we process t = (id t ;u t ;c t ;x t ) into t = (id t ;u t ;fw t1 ;w t2 ;:::;w tn g;x t ). Then, we normalize these multiple value relations into atomic value relations. Thus, we convert from a relation t tor = (id t ;u t ;w ti ;x t ) wherei is a term index. Tweets without geographic coordinates are excluded from the data sets. In addi- tion, we lter out terms w ti which do not represent entity names. All hashtags in tweets are assumed to be entities. We also use entity recognition library to identify entities in the text of the tweet. Entity names used by few (fewer than 18) people are also ltered out since they are likely to be idiosyncratic or noisy terms. In summary, entity names are extracted from the full text of tweets, and rela- tions between entities and geographic coordinates are established during in this process. Many tweets are ltered out because they do not contain geographic coordinates or popular entity names. After processing, the number of tweets is reduced to 12 million in the training set and 44K in the test set. The number of users is 1.8 million in the training set and 752 in the test set. The average number of tweets per user is approximately 6.6 in the training set and 58.5 in the test set. 6.1.4 Evaluation To evaluate the eect of feature selection on location prediction, we compute values of and entropy for all terms and eliminate from the test set those terms whose values are below some threshold. Then, the baseline and our method are used to predict the location of tweets using the remaining (localized) terms. The values of thresholds were chosen such that both baseline and our method have the same recall. For example, recall of 0.75 means that the method cannot predict the 77 GGMM baseline method median average median average recall lter 75% 4.370 255.940 58.446 521.065 0.25 lter 50% 5.659 531.574 73.703 549.986 0.50 lter 25% 6.875 868.790 78.030 913.035 0.75 Table 6.1: Median and average errors, in km, of tweet locations predicted by the proposed method and baseline at levels of feature selection given by the resulting recall. location of 25% of tweets, because the entropy (or values) of all terms in these tweets is below the specied threshold. Table 6.1 gives the median and average errors of locations predicted by base- line and the GGMM approach. Both methods have the same trend, with average and median errors decreasing as more words are ltered out. However, GGMM combined with entropy outperforms the baseline algorithm using feature selec- tion. If we are willing to sacrice recall (e.g., recall of 0.25), our approach leads to a signicantly better average error than baseline. The average error of the GGMM approach is 255.940 km, while that of the baseline is 531.574 km, two times higher. This is because we compute entropy from the same distributions we use for predicting location. The baseline, on the other hand, uses discrete multi- nomial distribution for the location prediction task but uses continuous unimodal distributions for the feature selection task. The discrepancies between these two distributions lead to a lower performance in measuring randomness in the feature selection task. 6.2 Improving Accuracy with Mobility Constraints In the previous section, we used feature selection to signicantly improve the accu- racy of predicted locations of tweets, decreasing the average error from 1546.048 78 km to 255.940 km. However, performance improvements came at the expense of recall: most tweets could not be predicted because they do not contain any local- ized words. It may seem impossible to predict the correct location of a tweet containing the termsf` ower', `bug', `insect', `moth'g, since these words are not strongly associated with any location. However, if we know that this tweet was posted 10 minutes after a tweet containing wordsf`castle', `smithsonian', `dc', `washington'g by the same user, we are quite condent that the location should be somewhere in Washington, D.C., because mobility constraints limit how far a person can travel in ten minutes. This suggests that we can improve accuracy of our approach by leveraging historical data from the same user. In this section, we describe the problem statement and propose a solution to improve our location prediction method. 6.2.1 Methodology We use two additional pieces of information | an ID of user who posted a tweet and time stamp of the tweet | to constrain possible locations of tweets. Specically, location of a tweet at time k depends not only on its content but also on the location of the tweet at time k 1. Let the observation be a sequence of tweets from user u, W u =fW u1 ;W u2 ;:::;W uk g. The hidden states are in a sequence of tweets' locations,X u =fX u1 ;X u2 ;:::;X uk g. Then, the problem is to nd the most likely sequence of hidden states, X u . This problem can be modeled by a Hidden Markov model (HMM), and the solution can be obtained by the Viterbi algorithm. An example of a Hidden Markov model is shown in Figure 6.4. We dene each component in HMM as follows: Hidden state X: Locations of tweets are the hidden states in this problem. The hidden state space is discrete and has N possible values. Set of locations is 79 comprised of the maximum likelihood location of each tweet. For example, we estimate 10 tweets from a user. If only locations x 1 ;x 5 ;x 6 and x 7 can be estimated from tweets t 1 ;t 5 ;t 6 and t 7 , then, X u isfx 1 ;x 5 ;x 6 ;x 7 g and N =jXjjTj Observations W: We observe only the content of a tweet produced by a user. We dene a tweet as a set of words, W uk =fw uk1 ;w uk2 ;:::;w ukn g. Feature selection method may lter all words from some tweets. Consequently, there may not be observations for some tweets. Start probability P (X 0 ): Currently, the initial probability distribution is uni- form, because we assume no prior knowledge about the user. This can be changed if we acquire information about users, such as their home location. Transition Probability P (X k+1 jX k ): Transitions between states are limited by user's mobility; therefore, they are functions of time and geographic distance. For example, let the maximum likelihood location of the rst tweet based on content be Los Angeles, CA, and the second location Ontario, Canada. Suppose that the time interval between two tweets is one hour. It is unlikely that the user could have traveled from California to Ontario, Canada in one hour. The transition probability represents the change of the locations in the underlying Markov chain. The functional form of the transition probability is not certain, although it must satisfy certain properties, such as the rst law of geography (\Everything is related to everything else, but near things are more related than distant things"), i.e., it should prefer the next location to be near the current location. In the experiments, we use inverse logistic function as transition probability: 80 t = dist(x j;k+1 ;x i;k ) v c (t) = t k+1 t k P (x k+1 jx k ) = 1 1 1 +e (t (t)) x i;k is a location i at time k. dist(x j;k+1 ;x i;k ) is a distance between two locations at time k and k + 1. v c is a given velocity, which governs the mobility constraint, which we set to 100 km/hour. t is a time constraint depending on a distance and velocity. It is computed by a simple distance and velocity relation assuming that the acceleration is zero. (t) is the time interval between two tweets. When the time interval between two tweets (t) is much less than the time constraint t , P (x k+1 jx k ) = 1:0. The probability decreases in the range between 1.0 and 0 when (t) is close to t . The probability is 0.5 when (t) is equal tot because the fraction term becomes 0.5. Emission Probability P (x i;k ): This value represents how likely the location is. Its value depends on words in the tweets, and it can be derived from marginal probability formula of joint probability between two random variables, loca- tion x and tag w, P (x;w). P (x i;k ) must be normalized by dividing with the term P x i;k 2X k P (x i;k ). In summary, rst we create a set of likely locations X of tweets. Second, we compute the time interval (t) and distance dist(x k+1 ;x k ) between successive 81 tweetst k andt k+1 , and compute transition probability. Finally, we compute emis- sion probabilities P (X k ), whose value depends only on the words in the tweet W and location x k . If all the terms in the tweet are ltered out by feature selec- tion method, P (x k ) has a uniform distribution. Once the Hidden Markov model is created, we can use the Viterbi algorithm to nd most the likely sequence of tweet's locations, X u =fx 1 ;x 2 ;:::;x k g. The solution of this model is given by the recurrence relations: V i;k = P (x i;k ) arg max j2J (P (x i;k jx j;k1 )V j;k1 ) TheV i;k is the probability of the most probable state sequence at time k andi is its nal state. T is the time of last tweet. Then, the location of the last tweet is: x i;T = x arg max i2I V i;T The location of T 1 can be retrieved by saving back pointers which is a standard method in the Viterbi algorithm. In case that the back pointers are not saved, let's say i is the index of the most probable state at time T ; the most probable state at timeT1 can be computed by the following recurrence equation: 82 approach median average recall Baseline 96.922 1687.585 1.00 GGMM 17.042 1546.048 1.00 using mobility constraints no lter 26.278 918.776 1.00 lter 75% 19.527 850.502 1.00 lter 50% 17.265 811.150 1.00 lter 25% 28.204 923.009 1.00 Table 6.2: Comparison of the prediction errors (in km) of the proposed GGMM approach with entropy and mobility constraints, the basic GGMM, and baseline. j = P (x i;T ) arg max j2J (P (x i;k jx j;T1 )V j;T1 ) j is the index of most probable state x j;T1 at time T 1. The equation is almost same with the equation for computing V i;k but this equation searches the index of the most probable state of the previous time step T 1 instead of the probabilityV i;T . The process is repeated until the timeT is 1. The solution of this Hidden Markov model is the location of each time step k. The time complexity of this process is same as a standard Viterbi algorithmO(jTjjNj 2 ) wherejTj is the number of tweets andjNj is the number of states. In worst case, the number of states is same as the number of tweets orjNj =jTj. Thus, the time complexity is approximately O(jTj 3 ). 6.2.2 Evaluation We use mobility constraints method to estimate the likely locations of tweets whose content was ltered out by the feature selection method. 83 Figure 6.5 shows the cumulative distribution of error of the predicted loca- tions of tweets for dierent levels of ltering using entropy as the feature selection method. The average and mean errors are summarized in Table 6.2. Although the median and average errors are about a factor of three larger than using feature selection alone (see Table 6.1), the mobility constraints method is able to achieve a recall to 1.00, i.e., predict location of all the tweets. Thus, the main advantage of using mobility constraints is to improve recall. Even the location of some high entropy tweets at time k can be accurately predicted. However, some cannot be predicted, because mobility cannot constrain the location when the time interval between successive tweets is very large. Dierence in the number of ltered tweets yields dierences in average error. Without feature selection (using entropy), the average error of GMM with the mobility constraints is 918.776 km, which is much lower than the average error of the basic GMM method. However, the median error is 26.278 km, while the basic GMM gives the median error of 17.042 km. The median error of GMM with mobility constraints is around 50% higher than the basic baseline, although the dierence in absolute value is small, or only 9 km. Filtering out 50% of the tweets gives the best results according to both the median and average error. 84 (a) `bostonloganinternationalairport' (b) `iphone' Figure 6.1: Comparison of spatial distributions of tweets containing (a) a low entropy entity `bostonloganinternationalairport' (b) a high entropy entity `iphone'. 85 (a) Figure 6.2: Scatter plot of the prediction error vs entropy of tweet's representative (lowest entropy) tag. 86 (a) GGMM with feature selection (b) baseline with feature selection Figure 6.3: Using feature selection techniques to improve location prediction from tags. Each line corresponds to dierent recall metrics. (a) CDF of prediction error of GGMM after using spatial entropy to select localized terms. (b) CDF of prediction errors of the baseline after using the spatial variation to select localized terms. 87 Figure 6.4: We employ a Hidden Markov model to improve a recall of the location prediction task. A lled circle represents a tweet whose location can be predicted. An open circle represents a tweet with no well-localized words. Figure 6.5: Cumulative distribution function of errors produced by combining GGMM, geospatial entropy, and mobility constraints. Percentage of ltered tweets gives the level of recall. 88 Chapter 7 Conclusion In this thesis, I presented a exible probabilistic framework for geospatial data mining. This framework models each tag in continuous space as a probability density function,P (xjw), the parameters of which can be estimated by analyzing a corpus of geo-referenced and annotated documents. Once the distribution P (xjw) is estimated, it can be used in a number of geospatial applications within one framework. For example, it can identify place names by looking for tags that have low entropy. To predict the location of a document with tags W , I look for the maximum of the tag distributions: x predict = arg max x P w2W P (xjw). One of the central problems of this dissertation was that given a set of points of geotagged data (e.g. from social media, such as Flickr photos), estimate their probability density function. The density function is learned from data using either parametric and non-parametric approaches. Let x 1 ,. . . , x n be a set of samples drawn from a continuous distribution function P (xjw) associated with the con- tinuous random variables x. Parametric approaches estimate the common density P (xjw) from samples by assuming that P (xjw) belongs to a parametric family of functions, such as Gaussian or gamma functions. In contrast, non-parametric density estimators make no assumptions about the spatial distributions of w. A range of kernel functions is commonly used in the non-parametric density estima- tion such as uniform, Epanechnikov or normal. The quality of estimation depends less on the choice of the kernel than in parametric. 89 There are challenges for both methods. Parametric methods can be unrealistic in the mentioned applications because I may not know the distribution family beforehand. For example, I may assume the spatial distribution of each tag as the Gaussian distribution. However, spatial distributions of terms are usually multi-modal and cannot be described by such unimodal functions, for example, term `canada' has two clumps in its distribution and a simple distribution such as Guassian may not be able to capture this shape. In this thesis, I proposed a solution to this problem by approximating a distri- bution by a mixture of Gaussians. Almost any continuous density can be approx- imated by tuning the means and variances of an appropriate number of Gaussian functions. It is important to choose the appropriate number of Gaussians as a value that is too small or too large is not useful. Small values of number of Gaus- sians lead to under tting while larger number of Gaussians values lead to over tting. I solved this problem by penalizing model for its complexity (BIC). In contrast, non-parametric density estimation makes no assumptions about the spatial distributions of terms. The bandwidth parameter h is a critical parameter in density estimation, because it determines the smoothness of the distribution. It is an important parameter in geospatial applications and can aect location prediction. This is because the mode of the PDF depends on the parameter h. One of possible reasons may relate to Modiable Areal Unit Problem [Openshaw, 1983] which states that aggregate statistics are sensitive to the choice of spatial unit used to discretize data. The quality of a kernel estimate depends on the value of its bandwidth h. Too small h leads to under smooth estimation, while too largeh leads to over smoothing. The optimal bandwidth of each term may be dierent depending on its granularity, for example, term `california' may require the larger bandwidth h than the term `losangeles'. To solve this, I use the data 90 driven method to nd an optimal value of h using Least-Squares Cross-Validation Bandwidth (LCSV). The next central problem of this thesis was that given the set of points of geotagged data, how can we extract useful knowledge from them? To demon- strate this, I studied the problems of place identication, learning part-of relations between places, and location prediction. The goal of place identication is to learn whether or not a tag refers to a place based on its spatial distribution. The intuition behind the approach is that tags that are place names are more localized than those that are not place names, and the amount of localization can be estimated from the entropy of the components of the tag. This approach allows me to learn about places that do not exist in expert-curated directories. Our method decide whether a tag is well-localized by examining the continuous entropy of P (Xjw). The entropy of the mixture of Guassians can be computed by the Monte Carlo method. To evaluate the methods, I implemented the proposed framework and compared it with the discrete entropy from Rattenbury [Rattenbury and Naaman, 2009b]. I used a GeoNames database as the ground truth about place names. Then, I used standard precision and recall metrics to evaluate the performance of baseline and the proposed model-based method on the place identication task. Precision and Recall performance depends on the threshold parameter for both proposed method and the baseline. Thus, I also report aggregate metrics, including AUC (area under precision-recall curve, maximum f1 score (Max F1), and minimum classication error rate (Min CE). The proposed method performs better than baseline. The method can also be used to learn relations between places, for example, `socal' is part of `california'. The intuition behind the approach is that the spatial 91 distribution of the more general parent tag, e.g., `california', should subsume the spatial distribution of the child tag, e.g., `socal'. Proposed method decides whether a tag w1 subsumes tag w2 by looking cross entropy of distributions P (Xjw1) and P (Xjw2). I evaluated the cross entropy methods by comparing against the tag frequency subsumption from Schmitz [Schmitz, 2006b]. I also use GeoNames database as the ground truth, precision and recall as the metric. I found that the method performs better than baseline in countries or continents data sets. Finally, the proposed method can enable predicting location of other textual content. The intuition behind the approach is to estimate the probability density function of a document, P (x). The distribution can be interpreted as probability of the documents location. Therefore, the best guess of location x of a document is the mode ofP (x); in other words, location x that corresponds to the maximum value of P (x). I evaluated the methods by comparing with the Naive Bayes and mean-shift clustering from Crandall [Crandall, Backstrom, Huttenlocher, and Kleinberg, 2009a]. I hide the actual locations of documents in the test set and use the pro- posed method and baseline to predict their locations. The median of error is used as the metric. I found that our method performs better than baseline. Two subproblems of location prediction are also studied in this thesis. The rst subproblem is to improve the accuracy of prediction by using temporal informa- tion. I employ the Hidden Markov model to improve the accuracy of prediction. I evaluate the method by using it to predict the location with and without the temporal information. I found that the Hidden Markov method has the better performance than the baseline. The second subproblem is to estimate the pre- diction error. With the error estimation, the end user can know the quality of 92 prediction. I evaluate our method with the top n method from Hau [Hau, Thomee, and Trevisiol, 2013]. The metric is the Kendall-Tau correlation. I found that our method performs better than baseline. Geospatial information, that is automatically extracted from user-generated content, can be used in applications to improve user experience, for example, dis- playing spatially proximate content. It can also be used by social scientists to study geo-social questions. Finally, it can be used to learn how people conceptu- alize space and how they use space. 93 Reference List E. Amitay, N. Har'El, R. Sivan, and A. Soer. Web-a-where: geotagging web content. In Proc. 27th Int. Conf. on Research and development in information retrieval, pages 273{280. ACM, 2004. L. Backstrom, J. Kleinberg, R. Kumar, and J. Novak. Spatial variation in search engine queries. In Proceeding of the 17th international conference on World Wide Web, pages 357{366. ACM, 2008. C. M. Bishop and SpringerLink S. En Ligne. Pattern recognition and machine learning, volume 4. springer New York, 2006. URL http://www.library. wisc.edu/selectedtocs/bg0137.pdf. K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann. The balanced accuracy and its posterior distribution. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 3121{3124. IEEE, 2010. Z. Cheng, J. Caverlee, and K. Lee. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM inter- national conference on Information and knowledge management, pages 759{768. ACM, 2010. D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the world's photos. In WWW '09: Proceedings of the 18th international conference on World wide web, pages 761{770, New York, NY, USA, 2009a. ACM. ISBN 978-1-60558-487-4. doi: 10.1145/1526709.1526812. URL http://dx.doi.org/ 10.1145/1526709.1526812. D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the world's photos. In Proc. 18th Int. Conf. on World wide web, pages 761{ 770. ACM, 2009b. doi: 10.1145/1526709.1526812. URL http://dl.acm.org/ citation.cfm?id=1572025. C. De Rouck, O. Van Laere, S. Schockaert, and B. Dhoedt. Georeferencing wikipedia pages using language models from ickr. 2011. 94 A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incom- plete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1{38, 1977. S. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information Science, 32(2):198{208, 2006. C. Gouv^ ea, S. Loh, L. F. F. Garcia, E. B. Fonseca, and Wendt. Discovering Location Indicators of Toponyms from News to Improve Gazetteer-Based Geo- Referencing. In Simp osio Brasileiro de Geoinform atica-GEOINFO, 2008. B. Han, P. Cook, and T. Baldwin. Text-based twitter user geolocation prediction. J. Artif. Intell. Res.(JAIR), 49:451{500, 2014. C. Hau, B. Thomee, and M. Trevisiol. Working notes for the placing task at mediaeval 2013. In MediaEval 2013 Workshop, Barcelona, Spain, 2013. J. R. Hershey and P. A. Olsen. Approximating the Kullback Leibler divergence between Gaussian mixture models. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Int. Conf. on, volume 4, pages IV{317. Ieee, 2007. S. Intagorn and K. Lerman. Learning boundaries of vague places from noisy anno- tations. In Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 425{428. ACM, 2011. S. Intagorn and K. Lerman. A probabilistic approach to mining geospatial knowl- edge from social annotations. In CIKM, pages 1717{1721, 2012a. S. Intagorn and K. Lerman. A probabilistic approach to mining geospatial knowl- edge from social annotations. SIGSPATIAL Special, 4(3):2{7, 2012b. P. Kelm, V. Murdock, S. Schmiedeke, S. Schockaert, P. Serdyukov, and O. Van Laere. Georeferencing in social networks. In Social Media Retrieval, pages 115{141. Springer, 2013. C. Keler, P. Mau e, J. T. Heuer, and T. Bartoschek. Bottom-Up gazetteers: Learn- ing from the implicit semantics of geotags GeoSpatial semantics. In Krzysztof Janowicz, Martin Raubal, and Sergei Levashkin, editors, GeoSpatial Semantics, volume 5892 of Lecture Notes in Computer Science, chapter 6, pages 83{102. Springer, Berlin, Heidelberg, 2009. S. Kinsella, V. Murdock, and N. O'Hare. I'm eating a sandwich in Glasgow: modeling locations with tweets. In Proceedings of the 3rd international workshop on Search and mining user-generated contents, pages 61{68. ACM, 2011. 95 M. Kobos and J. Ma ndziuk. Classication based on combination of kernel density estimators. Articial Neural Networks{ICANN 2009, pages 125{134, 2009. doi: 10.1007/978-3-642-04277-5n 13. M. Kulldor. Spatial scan statistics: models, calculations, and applications. In Scan statistics and applications, pages 303{322. Springer, 1999. A. R. Liddle. Information criteria for astrophysical model selection. Monthly Notices of the Royal Astronomical Society: Letters, 377(1):L74{L78, 2007. M. D Lieberman, H. Samet, and J. Sankaranayananan. Geotagging: using proxim- ity, sibling, and prominence clues to understand comma groups. In Proceedings of the 6th Workshop on Geographic Information Retrieval, page 6. ACM, 2010. K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate analysis. Academic Press London, 1980. P. Mika. Ontologies are us: A unied model of social networks and semantics. Web Semantics: Science, Services and Agents on the World Wide Web, 5(1): 5{15, March 2007. ISSN 15708268. doi: 10.1016/j.websem.2006.11.002. URL http://dx.doi.org/10.1016/j.websem.2006.11.002. D. R. Montello, M. F. Goodchild, J. Gottsegen, and P. Fohl. Where's downtown?: Behavioral methods for determining referents of vague spatial queries. Spatial Cognition & Computation, 3(2):185{204, 2003. D. B Neill and A. W Moore. A fast multi-resolution method for detection of signicant spatial disease clusters. Advances in Neural Information Processing Systems, 16, 2003. S. Openshaw. The modiable areal unit problem. Geo Books Norwich, UK, 1983. A. Plangprasopchok and K. Lerman. Constructing folksonomies from user-specied relations on ickr. In Proc. the 18th Int. Conf. on World wide web, WWW '09, pages 781{790, New York, NY, USA, 2009. ACM. A. Plangprasopchok, K. Lerman, and L. Getoor. A probabilistic approach for learning folksonomies from structured data. In Proc. the 4th ACM Web Search and Data Mining Conference (WSDM), November 2011. A. Popescu and N. Ballas. Cea list's participation at mediaeval 2012 placing task. In MediaEval, 2012. T. Rattenbury and M. Naaman. Methods for extracting place semantics from ickr tags. ACM Trans. Web, 3(1):1{30, 2009a. ISSN 1559-1131. doi: 10.1145/ 1462148.1462149. URL http://dx.doi.org/10.1145/1462148.1462149. 96 T. Rattenbury and M. Naaman. Methods for extracting place semantics from Flickr tags. ACM Transactions on the Web (TWEB), 3(1):1, 2009b. A. Sadilek, H. Kautz, and J. P Bigham. Finding your friends and following them to where you are. In Proceedings of the fth ACM international conference on Web search and data mining, pages 723{732. ACM, 2012. M. Sanderson and B. Croft. Deriving concept hierarchies from text. In Proc. 22nd Int. Conf. on Research and development in information retrieval, pages 206{213. ACM, 1999. P. Schmitz. Inducing ontology from ickr tags. In Proc. of the Collaborative Web Tagging Workshop (WWW '06), May 2006a. URL http://www.rawsugar.com/ www2006/22.pdf. P. Schmitz. Inducing ontology from ickr tags. In Collaborative Web Tagging Workshop at WWW2006, Edinburgh, Scotland, volume 50, 2006b. P. Serdyukov, V. Murdock, and R. Van Zwol. Placing ickr photos on a map. In Proc. 32nd Int. Conf. Research and development in information retrieval, pages 484{491. ACM, 2009. doi: 10.1145/1571941.1572025. URL http://dl.acm. org/citation.cfm?id=1572025. R. W. Sinnott. Virtues of the Haversine. Sky and telescope, 68:158, 1984. P. Smyth. Clustering using Monte Carlo cross-validation. In Proc. the Second Int. Conf. on Knowledge Discovery and Data Mining, pages 126{133, 1996. O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of ickr resources using language models and similarity search. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 48. ACM, 2011. O. Van Laere, S. Schockaert, and B. Dhoedt. Georeferencing ickr resources based on textual meta-data. Information Sciences, 238:52{74, 2013. M. P. Wand and M. C. Jones. Kernel smoothing, volume 60. Crc Press, 1994. B. P Wing and J. Baldridge. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technologies-Volume 1, pages 955{964. Association for Computational Linguistics, 2011. M. Yan and K. Ye. Determining the number of clusters using the weighted gap statistic. Biometrics, 63(4):1031{1037, 2007. 97 Y. Yang and G. I Webb. A comparative study of discretization methods for naive- bayes classiers. In Proceedings of PKAW, volume 2002. Citeseer, 2002. Y. Yang and G. I. Webb. Discretization for naive-Bayes learning: manag- ing discretization bias and variance. Machine learning, 74(1):39{74, 2009. doi: 10.1007/s10994-008-5083-5. URL http://www.springerlink.com/index/ e30q39mt02g47208.pdf. 98
Abstract (if available)
Abstract
Geography plays as an important role in the evolution of online social networks and the creation of user-generated content, including photos, news articles, and short messages or tweets. Some of this content is automatically geotagged. Useful knowledge can be extracted from geotagged social medial documents such as place boundaries and relations between places. Web applications could use this geographic knowledge to assist people with geospatial information retrieval, such as finding cheap hotels in a specific location, or local news in a specific area. ❧ This dissertation describes a method for extracting geospatial knowledge from user-generated data. The method learns a geospatial model from data using statistical estimation techniques and then applies the model in tasks that include place identification, geofolksonomy learning, and document geolocation. To that end, I make three contributions in the field of geospatial data mining and statistical geolocation. ❧ First, I address the question of how to model the underlying processes that generate the locations of social media documents. I study alternate methods for estimating the models from data and discuss methods for approximating these models. This model approximation can be used to improve the quality of results in applications of interest such as location prediction. I also study the effects of statistical estimation methods on the quality of the extracted geospatial knowledge. The probability density estimation methods can be categorized into two main approaches: parametric and non-parametric. The parametric approaches estimate the common density f from samples by assuming that f belongs to a parametric family of functions, such as Gaussian or gamma functions. On the other hand, a non-parametric approach make no assumptions about the distribution of f. I discuss the challenges of both approaches and possible solutions to these challenges. ❧ Second, I develop the probabilistic framework to extract geospatial knowledges from the learned models. The probabilistic framework is general and flexible and can be used in a variety of geospatial data mining applications. I evaluate its performance on three knowledge extraction tasks. In the first task, place identification, I classify terms as place names or not place names based on their spatial distributions. In the second task, location prediction, I attempt to predict locations of documents such as photos or tweets given their terms. In the third task, I use the probabilistic framework to learn relations between terms. ❧ Finally, I develop methods for error estimation of extracted geospatial knowledge, particularly, for the location prediction task. When automatically geotagging documents, it is also useful to give the error associated with the predicted location. Error estimation can be used to rank the quality of prediction results. In some situations, we may improve prediction quality by sacrificing coverage
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Statistical approaches for inferring category knowledge from social annotation
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Learning fair models with biased heterogeneous data
PDF
Information geometry of annealing paths for inference and estimation
PDF
Mutual information estimation and its applications to machine learning
PDF
Learning distributed representations from network data and human navigation
PDF
Advances in understanding and leveraging structured data for knowledge-intensive tasks
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Efficient and accurate object extraction from scanned maps by leveraging external data and learning representative context
PDF
Tag based search and recommendation in social media
PDF
Failure prediction for rod pump artificial lift systems
PDF
Learning controllable data generation for scalable model training
PDF
Emergence and mitigation of bias in heterogeneous data
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Representation problems in brain imaging
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Modeling dynamic behaviors in the wild
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
Asset Metadata
Creator
Intagorn, Suradej
(author)
Core Title
Probabilistic framework for mining knowledge from georeferenced social annotation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/29/2015
Defense Date
12/05/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data mining,geographic information retrieval,geolocation,OAI-PMH Harvest,social media
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lerman, Kristina (
committee chair
), Nakano, Aiichiro (
committee member
), Ver Steeg, Greg (
committee member
), Wilson, John P. (
committee member
)
Creator Email
intagorn@gmail.com,intagorn@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-526795
Unique identifier
UC11298694
Identifier
etd-IntagornSu-3142.pdf (filename),usctheses-c3-526795 (legacy record id)
Legacy Identifier
etd-IntagornSu-3142.pdf
Dmrecord
526795
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Intagorn, Suradej
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data mining
geographic information retrieval
geolocation
social media