Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Tag based search and recommendation in social media
(USC Thesis Other)
Tag based search and recommendation in social media
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TAG BASED SEARCH AND RECOMMENDATION IN SOCIAL MEDIA
by
Sang Su Lee
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2011
Copyright 2011 Sang Su Lee
TableofContents
ListOfTables iv
Abstract 1
Acknowledgements 3
1 Introduction 4
2 RelatedWork 10
3 Tag-Geotagrelations 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Tag Similarity Calculation . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Geographical Cluster Calculation . . . . . . . . . . . . . . . . 18
3.2.3 Geographical Distribution Similarity (GDS) Calculation . . . . 21
3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Dynamictag-basedrecommendation 32
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Tag Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 LDA-based topic modeling over time . . . . . . . . . . . . . . 37
4.2.3 Time-based Similarity Weight Calculation . . . . . . . . . . . . 43
4.2.4 Recommendation System . . . . . . . . . . . . . . . . . . . . 44
4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Data Retrieval and Preprocessing . . . . . . . . . . . . . . . . 47
4.3.2 Precision Evaluation of Static and Dynamic Systems . . . . . . 48
4.3.3 Top-K Precision Evaluation of Static and Dynamic Systems . . 50
4.3.4 Rank Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
ii
5 Discussion 56
5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Bibliography 60
iii
ListOfTables
3.1 Example Raw Data from Flickr . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Relevant Tags for ”newyork” . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Geographical Clusters for ”newyork” and ”newyorkcity” . . . . . . . . 25
3.4 Tag Similarity-GDS List for ”newyork” . . . . . . . . . . . . . . . . . 27
4.1 Symbols for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
ListofFigures
3.1 Tag-Location Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Overlapped Regions among Two Different Clusters . . . . . . . . . . . 22
3.3 Distribution oflog(SIM(x,y)) andlog(GEO-SIM(x,y)) . . . . . . . . . . 28
4.1 Illustration of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 LDA model for item and tag modeling over time . . . . . . . . . . . . . 42
4.3 Precision Rates over Different Settings . . . . . . . . . . . . . . . . . . 49
4.4 Similarity Weights over Time . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 NDCG over Different Settings . . . . . . . . . . . . . . . . . . . . . . 52
v
Abstract
Social media, unlike traditional media, facilitate direct and real-time interaction among
users. The increase in interaction results in massive amounts of data, which require
appropriate categorization so that users can find the specific information they need. Tra-
ditional categorization by a few moderators cannot handle the massive amount of infor-
mation being generated. Instead, tags, which are free-format keywords or terms created
by users to describe content, are best able to categorize information in social media and
to cope with the large amount of data in a timely manner. Each tag may not perfectly
describe the content, but the aggregated tags reflect the knowledge of multiple users and
result in a taxonomy for categorization.
This categorization method can locate appropriate information for users, via search
and recommendation functions. Search functions allow users to locate appropriate infor-
mation from within the entire dataset, and recommendation functions involve the system
actively suggesting appropriate information for the user. The performance of search and
recommendation functions are degraded by the ambiguity inherent in the free format of
tagging. To resolve this ambiguity and to propose better search and recommendation
results, extra information, such as time and location, should be used. In this thesis, we
describe how tags and extra information can be used for search and recommendation
functions.
1
We present an analysis of the correlation between tags and geographical identifica-
tion metadata, or geotags. To make the analysis of geotagging and tagging possible,
we prove that there is a strong correlation between these two types of information. Our
approach uses similarity between tags and geographical distribution to determine inter-
relationships between tags and geotags. From our initial experiments, we show that the
power law is established between tag similarity and geographical distribution similar-
ity. We also present a system for recommendations that uses a modified latent Dirichlet
allocation model in which users and tags associated with an item are represented and
clustered by topics, and the topic-based representation is combined with the items time
stamp to show time-based topic distribution. By representing users via topics, the model
can cluster users to reveal common interests. Based on this model, we developed a rec-
ommendation system that reflects both user and group interests in a dynamic manner
that accounts for time.
This thesis contributes to the understanding of the use of tagging and improves the
use of tagging in social media. In addition, this thesis provides guidance on user behav-
ior analysis in social media.
2
Acknowledgements
First and foremost, I would like to express my heartfelt gratitude to Professor Dennis
McLeod, my research advisor and dissertation committee chair, for his support and
guidance in my Ph.D. study. I am grateful to him for giving me the freedom to explore
broadly, and for providing me invaluable advice during my research. I am also grateful
to the other members of my qualifying exam and dissertation committee: Professor Ellis
Horowitz, Professor Kristina Lerman, Professor Aiichiro Nakano, and Professor Larry
Pryor.
I am grateful to my colleagues, Dongwoo Won, Tagyoung Chung and other members
in Semantic Information Research Laboratory. I appreciate the invaluable feedback on
my research and dissertation from them. I would like to thank all my other friends for
their friendship, encouragement and advice. Finally, and most importantly, I would like
to thank my family members. I am deeply indebted to my parents for their love and
support. Without them this dissertation would never have come into existence.
3
Chapter1
Introduction
Social media is an online service and media platform on which users can share ideas,
experiences, and perspectives as well as build social relationships with other users.
Social media includes many types of information and content, such as blogs (e.g., Blog-
ger), microblogs (e.g., Twitter), bookmarks (e.g., del.icio.us), photograph or images
(e.g., Flickr), video (e.g., Youtube), social networking (e.g., Facebook), and others. In
social media, users can participate in creating and extending content, as opposed to tra-
ditional media (e.g., newspapers, television, and radio) in which content is generated by
designated producers. Users of social media are content consumers as well as produc-
ers. This bidirectional participation is one of the main characteristics that distinguishes
social media from traditional media. Users of social media need no authority or special-
ized knowledge to produce new content.
Users of social media can react to events of content much more quickly than they
could with traditional media. In traditional media, there is a time gap between the pro-
duction and the consumption of content. In social media, there is no such gap. Producers
of social media can interact immediately with consumers as soon as they produce their
content. This quick interaction also facilitates updating content. In traditional media,
it is difficult to edit or update content after it has been produced and consumed by the
public, even when errors are found. In social media, content can be updated by other
users comments or replies.
These characteristics enable users to create content easily and result in an enormous
amount of user-generated content being available on social media.
4
To navigate and browse this large amount of content effectively, the categorization of
content in social media is indispensable. Traditionally, content has been categorized and
classified by a few authorized domain experts. In social media, however, this traditional
method is not adequate. The amount of content in social media is much larger than that
of traditional media, and a limited number of experts cannot categorize all the content
effectively. By its nature, social media is capable of publishing instant responses to time-
sensitive events, and this rapid pace makes categorization by a limited number of experts
difficult if not impossible. This centralized method of categorization is unsuitable for
social media.
The characteristics of social media require a scalable method of categorization that
can handle massive amounts of content in a timely fashion. Tagging is a possible alterna-
tive method of categorization that is better-suited for social media than the expert-based
centralized method. Tags are keywords or terms that describe content and are generated
by the owner of the content or by a user who understands the content. Tagging is used
in different types of social media.
Tagging by users may not accurately describe or categorize the content, because not
all users are experts in the specific content. However, as more users annotate the content
with tags, the overall descriptiveness of the tags may become more accurate. This can
be explained as the regression toward the mean. At first glance, each tag may explain
the content quite differently, but the large amount of tags will cause the explanation to
converge in one direction. Also, users tend to mimic the tag vocabulary of other users,
which also affects the convergence of the content description via tagging.
The main characteristic of tagging is flexibility. Each user can add tags to the content
and build a hierarchy of content based on his perspective. This flexibility lowers the bar
for tagging, but at the same time makes it difficult to search for content via tags. Also
tags do not have a rigid format. Users can annotate content with tags in any format they
5
want. This free format is an advantage in that it facilitates user adaptation to tagging.
On the other hand, it is the main cause of homonyms (the same tags used with differ-
ent meanings) and synonyms (multiple tags for the same concept), which may lead to
inappropriate connections between items and inefficient searches for information about
a subject.
Tagging enables the timely scalable categorization of content in social media, but
the inherent ambiguity of tagging leads to inappropriate search results. Much research
has been done to reduce the drawbacks of tag-based searching while maintaining the
advantages. Some researchers [15] built automated taxonomies or hierarchies based on
the non-hierarchical tags of users, and others [38] used extra information to reduce the
ambiguity of tagging. Location and temporal information are good examples of extra
information that can be used in tagging. When content is published, we can track the
location from which it was published and the time of publication. This information
helps to reduce the ambiguity of tagging. For example, content with a tag of ”jaguar” is
more likely to be related to an animal if the content is published and tagged from the San
Diego Zoo. Content with a tag of ”usopen” is more likely to be related to the US Open
Tennis Championships if the content is published in August or September. If the content
is published in June, it is more likely to be related to the US Open Golf Championships.
Time stamps and location-awareness are important technical features that differen-
tiate social media from traditional media. Twitter is a good example of a social media
platform that takes advantage of communication in real-time. Twitter users can broad-
cast, in a 140-character format, what they are thinking, what they are doing, and where
they are to all the other users. Tweets are broadcast as soon as they are written, which
allows users to share ideas immediately and facilitates more interaction among users.
Location-awareness helps to specify interests and preferences based on location. Users
6
of social media are able to refine their searches for information or content based on
location, which helps them find appropriate information.
So far, we have discussed how tag-based searching helps users find the most appro-
priate content and information. Searching in social media is an active process in which
users deliberately seek out and locate appropriate information. If users do not try to
find information, the system cannot provide it. Search functions cannot help nonac-
tive users who use social media but do not try to search or browse content. To provide
appropriate information to these nonactive users, we need the second functionality, rec-
ommendations. Recommendations provide potentially preferable information to users.
They can help users for whom search functionality fails to provide information. Gener-
ally, recommendations for each user are based on information from that user’s profile. In
tagging, each user’s tag vocabulary builds the user profile. Like tag-based searches, tag-
based recommendations require extra information to provide more accuracy and speci-
ficity.Researchers have worked on tag-based recommendation systems that use several
sources of extra information [29, 12].
We have discussed the characteristics of tagging and the use of tagging in social
media. In social media, tagging is able to contribute to content searches as well as
content recommendations. But the ambiguity of tagging makes search results and rec-
ommendations less accurate. To reduce the ambiguity in tagging and to improve search
results and recommendations, extra information, such as time and location, is needed.
Based on this understanding of tagging in social media, we will continue to discuss
using extra information in tag-based searches and recommendations in the next chap-
ters. More specifically, in Chapter 3, we will discuss how extra information, such as
location, can be related to tagging based on the assumption that extra information can
reduce the ambiguity of tagging and thus improve search results. In Chapter 4, we will
discuss how extra information, such as time, can improve tag-based recommendations.
7
By doing so, we cover the feasibility of tagging in terms of searches and recommenda-
tions and also cover how time and location information can improve the feasibility of
tagging.
Chapter 3 presents an analysis of the correlation between tags and geographical iden-
tification metadata geotags. Despite the increased use of geotagging in collaborative
tagging systems, most current research focuses on textual tagging alone as a solution
to the problem of tag-based searching. This could make it difficult to search for pre-
cise and relevant information within the given tag space. For example, inconsistencies,
such as polysemy, synonyms, and word inflections with plural forms, complicate tag-
based searching. Therefore, an analysis of how geotag information works with existing
tagging information is needed. To make this analysis possible, we prove that there is
a strong correlation between tagging and geotagging information. Our approach uses
tag similarity and geographical distribution similarity (GDS) to determine interrelation-
ships between tags and geotags. From our initial experiments, we show that the power
law is established between tag similarity and GDS; this means that tag similarity and
GDS have a strong correlation, which can be used to find more relevant tags in the tag
space. The power law confirms that there is an increased relationship between tagging
and geotagging and that the increased relationship is scalable with the amount of tags
and geotags. Therefore, an analysis of how geotag information works with existing
tagging information is needed. To make this analysis possible, we prove that there is
a strong correlation between tagging and geotagging information. Our approach uses
tag similarity and geographical distribution similarity (GDS) to determine interrelation-
ships between tags and geotags. From our initial experiments, we show that the power
law is established between tag similarity and GDS; this means that tag similarity and
GDS have a strong correlation, which can be used to find more relevant tags in the tag
space. The power law confirms that there is an increased relationship between tagging
8
and geotagging and that the increased relationship is scalable with the amount of tags
and geotags.
Chapter 4 presents a novel technique for content recommendation within social
media that matches user and group interests over time. Users often tag items in social
media with words and phrases that reflect their preferred ”vocabulary.” As such, these
tags provide succinct descriptions of the content, implicitly reveal user preferences, and,
because the tag vocabulary of users tends to change over time, reflect the dynamics of
user preferences. Based on evaluation of user and group interests over time, we present
a recommendation system that uses a modified latent Dirichlet allocation (LDA) model
in which users and tags associated with an item are represented and clustered by topics,
and the topic-based representation is combined with the items time stamp to show time-
based topic distribution. By representing users via topics, the model can cluster users
to reveal group interests. Based on this model, we developed a recommendation system
that reflects user as well as group interests in a dynamic manner that accounts for time,
allowing the system to perform better than static recommendation systems in terms of
precision rate.
Our work makes several meaningful contributions. First, we show how extra infor-
mation, such as location, can be related to tagging information to improve the effective-
ness of tagging. Then, we show how extra information, such as time, can improve the
effectiveness of tagging. Chapter 2 discusses related work, Chapter 3 describes the rela-
tion between tag and location for tag-based searching, Chapter 4 describes the tag-based
recommendation system with time information, and Chapter 5 concludes our work and
presents extensions for further research.
9
Chapter2
RelatedWork
We have investigated several related studies while developing our approach. First, we
will discuss articles about the structure of the collaborative tagging system. Then, we
discuss an article regarding social analysis for tagging behavior. Then, special attention
is paid to articles focusing on tag similarity using various techniques and to an article
about geo-tagging in collaborative tagging systems. Then, we investigate several studies
on tag recommendations. Finally, we examine more directly related research that uses
tags as the basis for recommending items.
Two intriguing articles analyze the current tagging framework[9, 23]. Both articles
attempted to classify tags and to investigate the social aspects of tagging. One of them
[9] asserted that tagging is a kind of social activity, because tag usage is stabilized by
imitation and shared knowledge. For examples, users from a social book mark Web site
called del.icio.us can imitate other users tag choices, which makes tagging into a social
activity. Also, another article [23] referred to social incentives that demonstrate the
communicative nature of tagging. Authors showed how the increase in number of tags
is proportional to the increase in number of contacts. Furthermore, they have shown
the relationship between affiliation and formation of tag vocabulary by showing how
users linked by social contacts use similar tag vocabularies, which further suggests that
tagging is related to social activity, as other authors [9] have pointed out. Both studies
indicated that users must be taken into consideration when analyzing tagging systems.
The social aspect of tagging systems is further investigated in another study [20].
This work focused on social psychological aspects of tagging behavior in del.icio.us. It
10
described the relation between the users annotative tendency and the degree of perceived
social presence, which is the key concept of this approach. Many human social activities
are carried out because of social position and association. This point of view can be
applied to the analysis of collaborative tagging systems. Users who recognize other
users in online communities are more likely to tag resources more precisely and actively.
Some interesting studies have been done to solve the problem of tag-based searching
by finding tag relations [4, 2, 15]. Brooks and Montanez [4] induced a hierarchy of tags
by using data from Technorati. They have used an agglomerative cluster ing technique
to iteratively cluster similar blog articles by using cosine similarity metric. Belelman
et al. [2] proposed a technique similar to spectral clustering to generate tag clusters in
del.icio.us. Several small graphs of tag relations are made from clustering a big graph by
using tag similarity. Heymann and Garcia-Molina [15] suggested creating hierarchical
taxonomies of tags that are aggregated into tag vectors by using a cosine similarity
metric. Schmitz [27] generated an ontology of tags in Flickr. A subsumption-based
statistical model is adapted to generate a graph of possible parent-child relationships.All
of these studies used different ways to find tag similarity, but they all have one thing in
common: They all tried to find tag similarity based on the co-occurrence of tags from
resources.
[19] is one of the few studies available on geotagging in collaborative tagging sys-
tems. It uses disparate information-tags, the location information of photos, and the pho-
tos themselves-to generate information, such as representative photos in certain areas.
The authors used a location-driven approach to generate aggregate knowledge in the
form of representative tags for arbitrary areas in the world. They also used a tag-driven
approach to automatically extract place and event semantics for Flickr tags based on
each tags metadata patterns. Based on the extracted patterns, vision algorithms are used
11
with greater precision. The significance of this study is that it is the first to gener-
ate information from tagging, geotagging, and photos. However, this study extracted
knowledge separately, and therefore does not express compound information, such as
tagging and geotagging information together.
Some interesting studies have been done on tag recommendations. There is a tag
recommendation [31] for a specific domain, such as Flickr. The authors categorize tags
into several groups : location, facts or objects, people or groups, actions or events, and
time. Then they propose a tag recommendation system for Flickr based on their analysis.
Given a photo with user-defined tags, similar (co-occurred) tags are retrieved and ranked
by promotion functions. The promotion functions derived from the analysis are used to
promote more descriptive tags for the recommendation because the descriptive tags are
more related to the contents of the photo and give a richer semantic description of the
photo. Song et al. [32] proposed a framework for real-time tag recommendation. The
tagged training documents are treated as triplets of (words, docs, tags) and are repre-
sented in two bipartite graphs, which are partitioned into clusters by Spectral Recursive
Embedding. Tags in each topical cluster are ranked by an algorithm. A two-way Poisson
Mixture Model is proposed to model the document distribution into mixture components
within each cluster and to aggregate words into word clusters simultaneously. A new
document is classified by the mixture model based on its posterior probabilities so that
tags are recommended according to their rank. Symeonidis et al. [34] created a tag rec-
ommendation model by performing a three-dimensional analysis on the social tagging
data (users who add tags to a set of resources) and by modeling them with a three-order
tensor and Singular Value Decomposition techniques. Si et al. [29] proposed a scal-
able and real-time method for tag recommendations. They modeled documents, words,
and tags by using a tag-LDA model, which extends the LDA model by adding the tag
variable. The tag-LDA model allowed for real-time inferences about the likelihood of
12
assigning a tag to a new document, and the authors used that likelihood to generate
recommended tags.
We have investigated several studies regarding tag-based item recommendation. Sen
et al. [28] extended the capability of the current movie recommendation systems by
using tags as an indicator of user preferences. To recommend movies, the authors
inferred user preferences for tags from user-movie interactions, such as movie ratings
and clicks or users tagging behavior, then used the inferred tag preferences to make
movie recommendations. We, on the other hand, examined the nature of tag-preference
inferences despite having insufficient information, such as user movie ratings, to vali-
date the tag preferences.
Tso-Sutter et al. [35] proposed a generic method that allows tags to be incorporated
into standard collaborative filtering (CF) algorithms by reducing the three-dimensional
correlations to three correlations of two-dimensions each and then applying a fusion
method to re-associate these correlations, then investigated the effect of incorporating
tag information into different CF algorithms. Their empirical evaluation of three CF
algorithms using real-life datasets showed that incorporating tags into our proposed
approach provides promising and significant results. They also incorporated tagging
with existing rating information. Based on rating information, Zhen et al. [40] proposed
a novel framework called tag-informed collaborative filtering (TagiCoFi) to integrate
tagging information into the CF procedure. When they used tagging information to reg-
ularize the matrix factorization of Probabilistic Matrix Factorization [26], their experi-
mental results showed that TagiCoFi outperforms its counterpart, which discarded the
tagging information even when it was available. Their approach used tags as well as
ratings to generate recommendations.
Other research has focused on building a hierarchical structure from tags to solve
the problem of low levels of information sharing caused by the free style vocabulary
13
of tags and the long tails of the distribution of tags and items. Liang et al. [21] pro-
posed an approach that integrates the social tags given by users and the item taxonomy
with standard vocabulary and a hierarchical structure provided by experts to make per-
sonalized recommendations. Specifically, Liang et al. [21] proposed using the multiple
relationships among users, items, and tags to find the semantic meaning of each tag
for each user individually, thus determining the relevant tags of each item and the tag
preferences of each user. Other research has examined the use of user- and item-based
CF combined with the content-filtering approach. De Gemmis et al. [6] investigated
whether folksonomies might be valuable sources of information regarding user inter-
ests and might contribute to a strategy that enables a content-based recommender to
infer user interests by applying machine-learning techniques on both the ”official” item
descriptions provided by a publisher and on the tags that users adopt to freely annotate
relevant items. Folksonomy means a classification system from collaboratively filtered
and managed tagging information. They found that such a strategy allows static content
and tags to be preventively analyzed by advanced linguistic techniques to capture the
semantics of user interests often hidden behind keywords. Their proposed approach,
which has been evaluated in the context of cultural heritage personalization, has been
shown to improve the predictive accuracy of the tag-augmented recommender compared
with a pure content-based recommender. The only weakness of their approach was that
their experiment was performed by 30 real users. The generalization of the results from
30 users is statistically undesirable.
The approach taken by De Gemmis et al. [6] differs from our own in terms of the
data sources used to recommend resources. Our approach uses neither official descrip-
tions nor unofficial tags. Guan [11] considered the problem of document (e.g., Web
pages and research papers) recommendation using only tagging data; that is, only data
containing users, tags, documents, and the relationships among them. Guan proposed
14
a graph-based representation learning algorithm for this purpose according to which
the users, tags, and documents were represented in the same semantic space in which
two related objects were close to each other. For a given user, they recommend those
documents that are sufficiently close to her. Siersdorfer et al. [30] proposed a formal
model to characterize users, items, and annotations within social networks to fulfill their
goal of constructing a social recommendation system that predicts the utility of items,
users, or groups based on the tagging vocabulary of a given user. To do so, they imple-
mented a framework using the LDA model, just as we do, but they did not consider time
factors in recommending items, users, or groups to a given user. Guo et al. [12] pro-
posed a recommendation system based on a probabilistic generative model for tagging.
They introduced a modified LDA model, which is used to cluster the tags and users,
to generate user- as well as group-interest information from the LDA model, and used
that information to recommend items to users. Although similar to our approach, their
LDA model did not model topic distributions over time or consider other time factors in
making recommendations.
15
Chapter3
Tag-Geotagrelations
3.1 Introduction
User-generated tags are being used more and more often. A tag is the relevant keyword
or term that is associated with or assigned to a unit of information. A tag describes
the item and enables keyword-based classification of information that the tag is related
to. Tagging is acknowledged as a useful way to accumulate and categorize information
(e.g., bookmarks, blog posts, articles, photos, and videos). Often, users can annotate
resources with no restriction in format and no limit on the number of tags for each
resource. These characteristics allow regular users to facilitate tagging. Another bene-
fit of tagging systems is vocabulary enhancement [23]. This is aided by the shared tag
dataset generated by numerous users. It can also reduce the burden of building compre-
hensive and accurate metadata. In spite of these benefits, tagging systems have a critical
limitation. One characteristic of tagging system is the lack of a rigid format. This may
produce the following inconsistencies:
1. polysemy, words with multiple related meanings (e.g., a window can be an oper-
ating system or a sheet of glass)
2. synonyms, multiple words with the same or similar meanings (e.g., TV and tele-
vision, Netherlands/Holland/Dutch)
3. word inflections with plural forms (e.g., ”cat” versus ”cats”)
16
These inconsistencies impede users from finding appropriate resources by keywords.
To overcome these drawbacks, researchers are trying to find the relations among tags.
The relation among tags can bridge the links between synonyms and provide a way to
classify polysemy into several subgroups. One study focused on Flickr, attempting to
cluster tags for user convenience.
Tag relation, however, still has a deficiency. Although tagging systems are evolving,
tag relation does not reflect these changes, especially the new function called geotag-
ging. Geotagging is the process of adding geographical identification metadata to vari-
ous media, such as Web sites, RSS feeds, or images. Recently, geotagging has been used
widely in collaborative tagging systems. However, geotagging information has not been
included in analyses to improve tag relations. We believe that adding geotagging infor-
mation to retrieve new relationships among tags will make current tag relations more
precise and relevant. To support this, our study focuses on finding strong relationships
between tagging and geotagging.
We will show three steps to confirm our approach. We present tag similarity based
on cosine similarity and point-wise mutual information to express similarities among tag
pairs. Then, we calculate the geographical clusters for each tag based on the k-means
and k-means++ algorithms [1] for lowering squared sum of errors in cluster creation.
After the creation of the geographical clusters, we calculate the GDS for clusters. The
rest of this chapter includes the algorithm for generating geographical clusters for each
tag, and the algorithm to calculate the GDS of tags from these clusters in Section 3.2; an
evaluation of our approach for finding the correlation between tag similarity and GDS
in Section 3.3; and a conclusion and recommendations for future work in Section 3.4.
17
3.2 Approach
The approach taken in this research consists of three parts. The first part is calculating
tag similarity to discover tag relations. The second part is building geographical clusters
with tags. The last part is calculating the geographical distribution similarity (GDS) for
the geographical clusters of each tag.
3.2.1 TagSimilarityCalculation
Each photo has related tags, through which the user describes the characteristics of the
photo. From photo-tag information, we create the feature vector for each tag to calculate
similarity among tags. If tagA is co-annotated with other tagB, A was considered a
feature of B and vice versa. As in previous studies [22, 24], the value of feature vectors is
point-wise mutual information between the tag and its each feature (co-occurring tags).
Point-wise mutual information between the tag and co-occurring tag is used as feature
weight.
mi
w;c
=
p(w;c)
p(w)p(c)
(3.1)
, wherec is the co-occurring tags,w is the tag andp(w;c) is the frequency count of
a tagw occurring in co-occurring tagsc. Again, as in a previous study [24], these point-
wise mutual information values are multiplied with a discounting factor to mitigate bias
toward infrequent words. Once feature vectors are created, simple cosine similarity is
used to calculate similarity between all tags.
3.2.2 GeographicalClusterCalculation
To calculate the similarity of the geographical distribution between tags, we first create
the geographical clusters for each tag by using the coordinate (latitude and longitude)
18
information of photos. A photo has coordinate and annotated tags. We organize the
data to observe which tags are annotated in which places. Based on geotagging data and
annotated tags from photos, we assign the latitude and longitude information for each
tag. Then, a tag that holds several related coordinate information is used to generate
geographical clusters. Figure 3.1 is an example of the tag-location assignment.
Figure 3.1: Tag-Location Assignment
From tag-location information, we use the k-means algorithm to generate geograph-
ical clusters for tags. The k-means algorithm is widely used in cluster generation
because of its efficiency. In short, the k-means algorithm clusters objects based on those
attributes intok groups. The objective of k-means is to minimize the total intra-cluster
variance, or the sum of squared errors. Usually, k-means works as follows:
1. Selectk points randomly as the initial centroids.
2. Formk clusters by assigning all points to the closest centroid
19
3. Recompute the centroid of each cluster
4. Repeat Step 2 and 3 until centroids does not change any more
However, the efficiency of k-means comes with the low accuracy. The k-means algo-
rithm is not guaranteed to find a global optimum. On the contrary, there are many
examples of k-means generating inaccurate clusters. The accuracy of the result largely
depends on the initial set of clusters. Another disadvantage is that the number of k
should be specified before executing the algorithm. We propose a way to overcome
these disadvantages of k-means.
To improve the accuracy, we need to find the best possible initial set of seed points.
The k-means++ algorithm [1] is adapted to find appropriate seed points. Authors from a
previous study [1] showed that k-means++ improves the accuracy of the k-means algo-
rithm while maintaining its speed and simplicity. The idea of the k-means++ algorithm
is to maintain the distances among the seed points as long as possible. The k-means++
selects initial centers in a way that they are already initially close to large quantities of
points. After that,D(x), which is the shortest distance from a data pointx to the closest
center already chosen, is calculated. Using D(x), the probability named weighting is
calculated and is used to choose the next center. The k-means++ algorithm works in the
following manner:
1. Take one centerc
1
, chosen uniformly at random fromX
2. Take a new centerc
i
choosingc
i
=x
0
2X with probability
D(x
0
)
2
P
x2X
D(x)
2
3. Repeat Step 2 until we have takenk
4. Proceed as with the standard k-means algorithm
Then we use heuristics to choose the appropriate number ofk, which enables us to keep
the sum of squared errors as small as possible. We assume the k-means++ algorithm
20
can give the lowest possible sum of squared errors for the arbitrary numberk. Based on
this assumption, we start to find the location of the initial seed point, which holds the
lowest squared sum of errors fork = 1. Then we increase the number ofk gradually
and perform k-means++ until we find the lowest squared sum of errors. As a result, we
are able to find the number ofk and the locations ofk initial seeding points that gave
the lowest sum of squared errors from all possible numbers ofk. The whole procedure
works as follows:
1. Find locations of initial seeding points by k-means++.
2. For calculated initial seeding points, execute k-means
3. Increment the number ofk
4. Repeat above steps until the sum of squared errors is the smallest
Based on k-means and k-means++ algorithms, we generate clusters for tags. Every
cluster has three attributes: name of the tag, coordinate of the centroid, and radius of
the cluster. Radius of the center is the average distance from the centroid to its member
points and is calculated by the Euclidean distance. Clusters are defined as a shape of a
circle.
3.2.3 GeographicalDistributionSimilarity(GDS)Calculation
The next step is to calculate how geographically similar two tags are. To find the geo-
graphical similarity between two tags, we exploit the geographical aspect of tags, geo-
graphical clustering of tags. In the previous section, the first output had been the circle-
shaped clusters held by each tag on the coordinate system. These clusters are resources
in expressing the geographical distribution similarities of different tags. For two arbi-
trary tags, corresponding clusters are retrieved, and the similarity of clusters from two
21
tags is calculated. This similarity of two clusters indicates how two tags are similar in
their geographical locations. To calculate the similarity of two clusters, we find the size
of the overlapped regions in clusters of two different tags. Then, we calculate the total
size of clusters from two tags. Figure 3.2 shows geographical clusters and overlapped
regions of different clusters.
Figure 3.2: Overlapped Regions among Two Different Clusters
There area
1
;a
2
2A andb
1
;b
2
2B, whereA andB are sets of clusters for different
tags. a
1
,a
2
andb
1
,b
2
are geographical clusters for tagA andB respectively. As shown
in Figure 3.2, a
1
T
b
1
and a
2
T
b
2
mean the overlapped regions from A and B tags.
The regions that are proportional to whole regions fromA andB are referred to as the
22
geographical similarity of two tags. The equation to find the geographical distribution
similarity is shown below.
geo-sim(A,B) =
P
m
i=1
P
n
j=1
a
i
T
b
j
P
m
i=1
a
i
+
P
n
j=1
b
j
P
m
i=1
P
n
j=1
a
i
T
b
j
(3.2)
, wherea
1
;:::;a
m
2A andb
1
;:::;b
n
2B. Prior to using Equation (3.2), we need
to check whether at least one overlapped region exists for the efficiency. To do this,
cluster pairs from two different tags are retrieved, and overlapped regions are uncovered
(if they exist). If there is at least an overlapping region, the similarity is calculated. Then,
we calculate the similarities regarding all possible tags pairs. The whole procedure
works as follows:
1. For each tagT
i
, retrieve all relevant geographical clusters,a
1
,. . . ,a
m
2. For each tagT
j
, retrieve all relevant geographical clusters,b
1
,. . . ,b
n
3. If T
i
and T
j
have overlapped regions, calculate overlapped regions and retrieve
geographical distribution similarity (geo-sim(A,B)) by Equation (3.2)
4. Repeat above steps until there is no overlapped regions for tag pairT
i
andT
j
So far, we have shown the steps for calculating Geographical Distribution Similarity
(GDS). In the next section, we will find the relationship between tag similarity and
GDS.
23
3.3 ExperimentalEvaluation
3.3.1 Experiment
The machine we have used for this experiment has a Pentium 4 2:4GHZ CPU and 1
GB memory. The operating system is Windows XP. The approach is implemented by
Python2:5 and Java J2SE1:5. First, we collected raw data from the photo-sharing Web
site, Flickr.com. The data from Flickr.com consist of four elements: the owner’s infor-
mation regarding the photo, the tags attached to the photo, the geotag, and the photo
itself. We randomly selected approximately340 tags and retrieved5000 photos data per
tag. The raw data is retrieved by using the Flickr API. For our experiment, 729;948
photos are collected as an initial dataset. The dataset includes 12;545 distinct tags and
54;811 users. A total of 89;855 photos are retrieved with geotagging information, and
50;262 tags are associated with geotagging information. Table 3.1 shows a partial exam-
Table 3.1: Example Raw Data from Flickr
Photo ID User ID Tags Lng/Lat
138602753 12774574@N00 audience, baseball, nyc,
ny, yanks,bronx, newyork,
newyorkcity, stadium, yank,
yankee, yankees, yankeesta-
dium
-74.157715/40.797176
143045373 54267419@N00 swedenborgian, seder,church,
newyork, maundythursday
holyweek, newchurch
-73.980354/40.747257
149113095 43208665@N00 yankeestadium, newyork -73.92859/40.82696
ple of our raw dataset. Using this data, we first calculate the tag similarity. From the
data, photo IDs and relevant tags are retrieved. Then the feature vector for each tag
is calculated, and those vectors are used to calculate the cosine similarity between two
tags. Table 3.2 shows the partial results of tag similarity calculations for a tag named
24
”newyork”. Next, we generate geographical clusters for each tag based on the raw data.
Table 3.2: Relevant Tags for ”newyork”
Tag 1 Tag 2 Similarity
newyork newyorkcity 0.2591218990962
newyork gothamist 0.2255912261205
newyork bronx 0.1284546640474
newyork nycpb 0.1152881157987
newyork podcast 0.1092873413209
newyork yankeestadium 0.1039956324392
We assign tag locations based on the method shown in Figure 3.1. After that, tags and
coordinates from tag-location assignment are applied to the algorithm to generate geo-
graphical clusters for each tag. Table 3.3 shows the partial result of geographical clusters
for the tags ”newyork” and ”newyorkcity”. A Cluster ID that starts with ”ny” indicates
a cluster of ”newyork,” and a Cluster ID that starts with ”nyc” indicates a cluster of
”newyorkcity.” Once we generate the geographical clusters, we calculate the GDSes for
Table 3.3: Geographical Clusters for ”newyork” and ”newyorkcity”
Cluster ID Longitude of Centroid Latitude of Centroid Cluster Radius
ny1 40.8275305 -73.9265935 0.009425306704
ny2 40.730645142 -73.990243428 0.018889052616
ny3 40.77757425 -73.970645 0.016690971574
ny4 40.826908947 -73.928367578 0.000606728735
nyc1 40.76105525 -73.9758085 0.0093464984609
nyc1 40.82771825 -73.92622025 0.0007698066395
nyc1 40.703349666666 -73.99447833333 0.0223051413547
nyc1 40.827328090909 -73.92839290909 0.0008228534878
arbitrary tag pairs. For example, suppose we are to find the GDS for the tags ”newyork”
and ”newyorkcity.” According to Equation (3.3), the overlapping regions of clusters and
25
the total size for all clusters from the two tags ”newyork” and ”newyorkcity” need to be
derived. These are the derived overlapping regions:
ny1
\
nyc2 = 1:5542E6
ny2
\
nyc3 = 2:8556E4
ny3
\
nyc1 = 1:0982E4
ny4
\
nyc2 = 9:6561E7
All above values are added into total overlapped size:
totaloverlappedsize = 3:9790E4
Then, we calculate the total size of all clusters by adding each cluster’s area, which is
the same as(clusterradius)
2
.
totalsize = 0:003839
By applying total overlapped size and total size to Equation (3.2), the GDS for
”newyork” and ”newyorkcity” can be calculated as:
geosim("neywork";"newyorkcity") = 0:115613
The GDS for other tag pairs can be calculated the sane way. Table 3.4 lists all tags
related to the tag ”newyork” in terms of tag similarity and GDS.
26
Table 3.4: Tag Similarity-GDS List for ”newyork”
Tag 1 Tag 2 Similarity GDS
newyork newyorkcity 0.2591218990962 0.1156128726629
newyork gothamist 0.2255912261205 0.0047746618822
newyork bronx 0.1284546640474 3.926857876e-06
newyork nycpb 0.1152881157987 0.0010182582011
newyork podcast 0.1092873413209 1.139363029e-06
newyork yankeestadium 0.1039956324392 0.0010063674171
3.3.2 Evaluation
In this section, we find the relationship between tag similarity and GDS. To do this, we
first calculate the tag similarities and GDSes. Then, we need other factors to discover
the relationship between the two different similarities. One thing we need is the photo
frequency, pf(x), wherex is a tag. This refers to how many photos use this tag. The
other factor we need is the user frequency,uf(x), wherex is a tag. This refers to how
many users use this tag. If one tag is used in only one photo, and another tag is used in
many photos, then those tags have different popularity and must be dealt with differently.
To distinguish the popularity of tags in terms of the numbers of photos that use that tag,
we introduce the term ”photo similarity” of tags. This is the percentage of photos that
use the specific tag out of all the photos. The other thing we need to consider is how
many users use this tag. One user can take many photos and can label them all with only
one tag. In that case, even though the tag is used many times, it is used by only one user.
We need to distinguish between a tag used by only one person and a tag used by many
people. For that reason, we introduce the idea of user frequency. User frequency is the
percentage of users that use a certain tag. To find the relation between tag similarities
and GDS, we use following factors, such as sim(x,y), geo-sim(x,y),pf(x),pf(y),uf(x),
27
anduf(y). We weighted tag similarity as SIM(x,y) and weighted GDS as GEO-SIM(x,y)
in the following equation:
SIM(x,y) = sim(x,y)pf(x)pf(y)uf(x)uf(y) (3.3)
GEO-SIM(x,y) = geo-sim(x,y)pf(x)pf(y)uf(x)uf(y) (3.4)
, wherex andy are tags, sim(x,y) is the similarity between two tagsx andy, geo-sim(x,y)
is the GDS of two tagsx andy,pf(x) is the photo frequency of tagx, anduf(x) is the
user frequency of tag x. To clearly show the relationship, we first provide the log-
log plot for two weighted similarities. Figure 3.3 shows the relationship between two
similarities.
Figure 3.3: Distribution oflog(SIM(x,y)) andlog(GEO-SIM(x,y))
28
(SIM(x,y)) and Y axis is log(GEO-SIM(x,y)). To make the relationship easier to
understand, we added the minus sign to theX andY axes. From Figure 3.3, the regres-
sion equation is derived as follows:
log(GEO-SIM(x,y)) = 1:3914(log(SIM(x,y))9:0435 (3.5)
We suppose that the regression is written as:
log(y) =log(x)+log(c) (3.6)
The distribution of linear regression in the log-log space usually has a common meaning
for following the power law. The straight line in Figure 3.3 can be seen as evidence of
the power law. The power law is a relationship between two scalar quantitiesx andy of
the form:
y =cx
(3.7)
Equation (3.7) is the same form once we remove the log-log scale from two axes.
Before we investigate the meaning of this distribution regardingSIM(x;y) andGEO
SIM(x;y), we need to validate whether this distribution precisely follows the power
law. As mentioned previously, the most simple and widely used way to check if a distri-
bution follows the power law is to perform linear regression in the log-log space. How-
ever, one study [23] suggested that this can cause a bias in the value of the exponent.
So, the following formula is proposed as a reliable alternative:
= 1+n[
n
X
i=1
ln
x
i
x
min
]
1
(3.8)
, wherex
i
,i = 1;:::;n are the measured values ofx andx
min
that corresponds to the
lowest value for which the power law holds. By applying Equation (3.8), is calculated
29
as 1:184648305. Hence, the value of from Equation (3.7 and 3.8)s implies that the
distribution of tag similarity and GDS follows the power law with< 2.
Our evaluation of the results from the distribution suggests two points. First, the
distribution follows the power law distribution with increased relation. Hence, geotag-
ging and tagging are closely related to each other in terms of tag similarity and GDS.
The evidence helps us conclude that both geotagging and tagging information can be
integrated into the tag search problem, allowing users to get more refined and relevant
tag search results.
In addition, our approach assures scalability. Our analysis is supported by the scale
free characteristic of power law. Scale invariance is a feature of objects or laws that do
not change when length scales are multiplied by a common factor. Thus, the shape of
the distribution curve does not depend on the scale when we measure the quantity of the
similarity. In other words, the increased relation is maintained regardless of the size of
tag-pair examples.
30
3.4 Conclusion
We have shown how tag similarity is strongly related with GDS. To do this, we first
calculated tag similarities from tag pairs. Then, we calculated geographical clusters
for each tag. From those geographical clusters, we computed the GDSes for tag pairs.
Next, we introduced the weighted tag similarity and the weighted GDS to reveal the
relation between them. By using those two weighted similarities, we discovered linear
regression in the log-log scale. The results showed that one similarity increased as the
other similarity increased.
In the future, we plan to further explore the most appropriate metric for finding
relevant tags by the association between tag similarity and GDS. We hope to see more
refined results in searching the tag space. Next, we will try to improve the geographical
cluster generation. Usually, k-means clustering is weak from outliers and, therefore, we
plan to generate good clusters by removing outliers. In addition, the mutual information
of tagging and geotagging will be researched further. Finally, we are working on finding
relevant users in the collaborative tagging system by using tag similarity. Users can be
classified by which tags they use frequently and grouped together by this classification.
31
Chapter4
Dynamictag-basedrecommendation
4.1 Introduction
Within social networks, the practice of tagging, the process of creating and using a
tag, a relevant term that is associated with a unit of information, has become common.
Users create and share their own content, such as blog posts, photographs, and videos,
and then may create and use several tags to describe their content. A recommendation
system can be considered a specific type of information filtering (IF) technique [13] that
attempts to present information resources (e.g., images, videos, music, and URLs) that
are likely to be of interest to users. Typically, a recommendation system compares a
user’s characteristics against some reference and attempts to predict how the user would
rate an item that he or she has not yet rated. The characteristics of users may include
those concerning an information item (the content-based approach) or the user’s social
environment (the collaborative-filtering [8] approach).
The need for recommendation systems has increased in tandem with the great
increase in the number of information resources and the consequent difficulty of iden-
tifying relevant resources within social networks. Reacting to this need, the research
into tag recommendation systems based on users’ previous tag usages has recently been
extended to research into item recommendation systems using tags. As described above,
a tag describes the characteristics of an item, and may represent the characteristics of
users who are associated with the item. For this reason, tags are useful indicators of
characteristics in recommendation systems, particularly in cases in which it is difficult
32
to quantitatively retrieve the characteristics of items (e.g., video, audio, or images) from
the contents of the items.
Although we hypothesize that tag distribution changes over time, current research
into the use of tagging in item recommendation has not considered changes in interests
over time. For example, some groups of users who had been interested in football last
fall may be interested in basketball this spring. In this case, the users’ tag vocabulary
contained more tags related to football in fall but more tags related to basketball in
spring. Thus, at a specific point of time (e.g., March), it would be more desirable for a
recommendation system to recommend items related to basketball to users.
To address this consideration, we developed a system that uses the process of
dynamic adjustment and includes tags with similar concepts and interests to recom-
mend items with greater precision. We modeled our system after the latent Dirichlet
allocation (LDA) model [3], a generative model that approximates the generation of
items in terms of latent topics. The LDA model considers an item a mixture of various
latent topics, and chooses tags in the item according to the topics. We extended the LDA
model in order to model users and tags over time by representing users and tags as mix-
tures of topics and by reflecting the time-based distribution of topics. To reflect changes
in interests over time, we introduced the concept of a time-based similarity weight. Gen-
erally, a recommendation system suggests a new item to a user after determining which
of the user’s item is most similar to the new item. If the similarity between the new item
and the user’s item reaches or exceeds a threshold value, the system recommends the
new item to the user. If the tag distribution differs over time, the similarity metric that
determines the distribution change can better determine the similarities between items,
thus allowing the system to make better recommendations.
33
Our work makes the following meaningful contributions. First, our recommenda-
tion system considers solely tags and not the items themselves to make recommenda-
tions. By doing so, the system remains simple in terms of types of attributes while,
as we demonstrate, performing well, suggesting that tags provide accurate summaries
of contents. Second, our system reflects changes in interests over time and makes use
of change as a factor in recommendations. The remainder of this chapter is organized
as follows. Section 4.2 describes our approach, Section 4.3 presents and evaluates our
results, and Section 4.4 concludes our work and presents recommendations for further
research.
34
4.2 Approach
Our approach for dynamic item recommendation has four parts: preprocessing a dataset
from social networks; topic modeling from the preprocessed dataset; time-based simi-
larity weight calculation; and recommendation. A dataset in our approach consists of
three entities: an item, which is a resource preferred by a user; a user, who is an individ-
ual who prefers an item; and a tag, which is an entity annotated to the item to describe
the item. The relationship between items and tags is many-to-many, as is the relation-
ship between users and tags. To design our experiments, we extracted each user’s items
and associated tags. From the extracted data, we created a dataset in which each item’s
features are its tags. We then passed our dataset through the steps of pruning irrelevant
tags, grouping users and documents by topic through a variant of the LDA model, and
learning similarity weights over time. Then, we designed the system to recommend
new testing items to each user based on the similarity between weights over time and
between new testing items and his or her own items. 4.1 illustrates the proposed system.
Each numbered box denotes the sub process of our approach and is explained in the
following section.
4.2.1 TagPruning
In our experiment, we collect the items and associated tags for each user, and represent
each item as a vector of tags with which the item is annotated. Whereas some tags
provide useful information for creating recommendations, other tags are too general or
too specific to be useful. As it would be computationally difficult to use all the tags in
the following step, we reduced the number of tags by determining their term frequency-
inverse document frequency (tf-idf) weight [18], a statistical measure used to determine
the importance of a term to a document in a collection or corpus. The importance of
35
Figure 4.1: Illustration of Approach
a term t
i
increases proportionally to the number of times that the term appears in the
documentd
j
, but is offset by the frequency of the term in the corpus. The tf-idf weight is
obtained by multiplying the term frequency (tf) by the inverse document frequency (idf).
The term frequency, which indicates the importance of a term in a certain document, is
defined as follows:
tf
i;j
=
n
i;j
P
k=1
n
k;j
(4.1)
In Equation (4.1),n
i;j
denotes the number of occurrences of the termt
i
in the docu-
mentd
j
, and the denominator denotes the sum of the occurrences of all terms in docu-
mentd
j
. The inverse document frequency is a measure of the general importance of the
termt
i
obtained by dividing the number of all documents in the corpus by the number of
36
documents containing the term. As such, the greater the number of documents that con-
tain the termt
i
, the lesser the importance of the term. The inverse document frequency
is defined in (4.2):
idf
i
=log
jDj
jfd :t
i
2dgj
(4.2)
wherejDj denotes the total number of all documents in the corpus andjfd :t
i
2dgj
denotes the number of documents in which the termt
i
is contained. Then,
(tfidf)
i;j
=tf
i;j
idf
i
(4.3)
In our dataset, we consider each tag a term and each user a document. After cal-
culating the tf-idf weights of favorite tags for each user, we eliminate those tags whose
weights fell below a certain threshold, leaving us with the tags that best describe a user’s
preferences. In other words, we prune very generic tags, such as ”friend,” that would
provide little or no information about the user’s preferences, leaving us with only the
most salient tags.
4.2.2 LDA-basedtopicmodelingovertime
Using tf-idf weights for pruning, we effectively reduce the dimension of the problem by
decreasing the number of tags. Nevertheless, the feature space remained too large. For-
tunately, we could further reduce the dimension of the problem by the use of dimension-
ality reduction techniques, specifically the use of the LDA model [3] for dimensionality
reduction. In the LDA model, which is widely used for topic discovery based on the
occurrences of words (tags) in documents (items), an item is considered a bag of par-
ticular tags and represented by mixtures over latent topics, with each latent topic being
characterized by a fixed conditional distribution over tags. Using the LDA model is thus
37
regarded as a form of topic modeling because of the topic representation for which it
provides. Specifically, the LDA model assumes that all tags of all items are generated
by randomly chosen latent topics. The following section introduces the basic concepts
of the LDA model.
BasicconceptsoftheLDAmodel
Topic-based document modeling is based on the simple sampling model in which words
in documents are generated by latent random variables (topics). The goal of topic-based
document modeling is to identify the set of latent variables that can best explain the
observed data. Occurrences of a word in a topic indicate the frequency of the word in
the topic and the probabilistic distribution of the word over the topic. Each document is
generated by selecting words from a topic, and topic selection depends on the weights.
For each document, a single word can be samples by several topics. Because the
model does not restrict a single word to a single topic, it can capture polysemy. For
example, it allows topics related to money as well as topics related to rivers to yield
the term ”bank.” This generative process is not based on any assumptions regarding
the order of words in documents. The only relevant information in this model is the
frequency with which a word appears, and it is thus based on the so-called ”bag-of-
words assumption.” Also, we can illustrate the problem of statistical inferences. Given
a set of words that appear in a document, the goal is to determine the topic model that
best generates the data by inferring the probabilistic distribution of words regarding each
topic and the distribution of topics regarding each document.
ExtensionoftheLDAmodel
In our approach, we combine two LDA models [36, 12] to represent users as well
as items by mixture of latent topics over time. Topics Over Time (TOT) model
38
[36] is a variant of LDA which models timestamps and the tags in the timestamped
items. LDA for collaborative filtering [12] is also a variant of LDA which models
users and items over mixtures of latent topics. Our LDA model generates topic-tag
distributions and topic-user distributions over time. In our approach, we combine two
LDA model [36, 12] to represent users as well as items using a mixture of latent topics
over time. The topics over time (TOT) model [36] is a variant of the LDA model
that models timestamps and the tags in the timestamped items. The LDA model for
collaborative filtering [12] is a variant of the LDA model that models users and items
using mixtures of latent topics. Our LDA model generates topic-tag distributions and
topic-user distributions over time as follows:
1. For each topicz2T :
(a) DrawV dimensional multinomials from a Dirichlet prior;
(b) DrawU dimensional multinomials" from a Dirichlet prior
;
2. For each item d 2 D, draw a T dimensional multinomial from a Dirichlet
prior;
3. For each tagw
di
in itemd:
(a) Draw a topicz
di
from multinomial
d
;
(b) Draw a tagw
di
from multinomial
z
di
;
(c) Draw a timestampt
di
from Beta
z
di
;
4. For each useru
di
in itemd:
(a) Draw a topicz
di
from multinomial
d
;
39
(b) Draw a useru
di
from multinomial"
d
;
Table 4.1: Symbols for LDA
Symbol Description
T number of topics
D number of items
V number of unique tags
U number of users
N
d
number of tags in itemd
the hyperparameter of the Dirichlet prior for multinomial
the hyperparameter of the Dirichlet prior for multinomial
the hyperparameter of the Dirichlet prior for multinomial"
the hyperparameter of Beta
d
the multinomial distribution of tags in the itemd
z
the beta distribution of time specific to topicz
"
z
the multinomial distribution of users specific to topicz
z
di
the topic associated with theith tag in the itemd
w
di
theith tag in itemd
u
di
the user associated with theith tag in the itemd
t
di
the timestamp associated with theith tag in the itemd
The graphical model for this process is shown in Figure 4.2, and the symbols in
Figure 4.2 are described in Table 4.1. In Figure 4.2, only the shaded circles (u,w, and
t) are observed data. As shown in the process, the posterior distribution of topics are
evaluated by tag, user, and time. The parameterizations are similar to the following:
40
d
jDirichlet()
z
jDirichlet()
"
z
j
Dirichlet(
)
z
di
j
d
Multinomial(
d
)
w
di
j
z
di
Multinomial(
z
di
)
u
di
j"
z
di
Multinomial("
z
di
)
t
di
j
z
di
Beta(
z
di
)
We now must infer unknown parameters, such as , ", and . Because obtaining
the exact inference for these parameters is not possible [3], we used Gibbs sampling for
obtaining approximate inferences [10, 33, 37]. Using Gibbs sampling, we evaluated the
posterior distributions ofz and used them to infer,", and . The topic assignmentz of
a randomly chosen useru, tagw, and timet is sampled from all latent topics according
to Equation (4.4):
P(z
di
=kjz
di
;R
u
;R
w
;R
t
)/
n
t
R;i
+
t
(
P
T
t=1
n
t
R
+
t
)1
n
w
k;i
+
w
P
V
w=1
n
w
k;i
+
w
n
u
k;i
+
u
P
U
u=1
n
u
k;i
+
u
(1t
di
)
1
k1
t
1
k2
di
B(
k1
;
k2
)
(4.4)
,wheren
u
k
is the number of times topick is assigned for the useru,n
w
k
the number of
times topic k is assigned to the tag w, and n
t
R
the number of times topic t appears in
item R; n
;i
indicates that the current ith allocation is not counted; B denotes a beta
41
Figure 4.2: LDA model for item and tag modeling over time
distribution;t
k
ands
2
k
in
k1
= t
k
(
t
k
(1t
k
)
s
2
k
1) and
k2
= 1
k1
denote the sample
mean and the sample variance of the timestamps belonging to the topic k; and
k1
is
the topic distributions over time. From Equation (4.4), we can estimate the topic-tag
distribution and the topic-user distribution using the following formula:
t;w
=
n
w
k;i
+
w
P
V
w=1
n
w
k;i
+
w
(4.5)
t;u
=
n
u
k;i
+
u
P
U
u=1
n
u
k;i
+
u
(4.6)
42
As topic models are typically sensitive to hyperparameters, it is important to obtain
the correct values for the hyperparameters. After finding that the sensitivity to hyperpa-
rameters was not very strong in our model, we used fixed symmetric Dirichlet distribu-
tions ( = 50=T , = 0:1, and
= 0:1).
4.2.3 Time-basedSimilarityWeightCalculation
The previous section described the manner in which we obtained topic-user and topic-
tag distributions. By obtaining topic-user distributions, we can understand which users
have similar topic distributions and perform user grouping by using clustering methods,
such as k-means clustering, by which each user is an entity whose attributes are topics.
As described in the introduction, when a recommendation system suggests a new item
to a specific user at a specific point in time, the system identifies which of the user’s
current items is most similar to the new item and, if their degree of similarity reaches
a threshold, recommends the new item to the user. As user interests can change over
time, our model may contain users who have different tag distributions over time and,
consequently, different topic distributions over time. Thus, given topic distributions at
an arbitrary point in time, certain topic distributions will have a higher level of similarity
with the distributions than will topic distributions at other points in time. Taking into
account different topic distributions according to time, we can calculate the similarity
among topic distributions according to time periods and adopt the similarity as a weight
in order to suggest a new item to a user.
Users in the same group who have similar interest dynamics over time may be
grouped together. To do so, we created user groups based on their topic distributions
and calculated similarity weights over time for each group, as we had in the previous
section. After identifying the user groups, we collected user items associated with each
group and then divided the collected items according to the time period (in this case
43
one month). Defining each item as a topic vector, we identified the tags associated with
each item and the topic-tag distributions for all the tags. Converting new items into
topic vectors, we incorporated topic vectors from the new items and topic vectors of the
group from each month into the dataset in order to measure the similarity of the weights.
For example, if one group’s items existed over 12 months, each group had 12 datasets.
From the dataset, we measured the topic similarity over time using Equation (4.7). As
a result, we obtained several topic similarity values over time for each group that we
termed group similarity weights over time.
weight(g;t) =
!
x
target
!
x
g;t
k
!
x
target
kk
!
x
g;t
k
(4.7)
, where
!
x
target
denotes the topic vector in the target timetarget,
!
x
g;t
denotes the topic
vector in timet2TS for groupg, andTS is the set of time slots of the data set.
4.2.4 RecommendationSystem
In this section, we explain two recommendation approaches. When a recommendation
system suggests a new item to a specific user, the system finds user’s item which the
similarity with the new item is the highest. There are, however, differences in using
temporal information for two approaches. The first approach is a static recommen-
dation: it does not employ temporal information. The second approach is a dynamic
recommendation.
Statictopicbasedrecommendation
The topic-based recommendation system described in this section served as the basis
for our approach. First, we converted each item into a topic vector, as described in
the previous section. To determine whether to recommend a new item to a user, we
44
identified which of his or her items has the greatest level of similarity with the new item
and, if the level of similarity reached a threshold, recommended the new item to the user.
To determine the level of similarity, we determined the cosine similarity between items
using Equation (4.8):
sim(
!
y
i
;
!
y
j
) =
!
y
i
!
y
j
k
!
y
i
kk
!
y
j
k
(4.8)
, where
!
y
i
and
!
y
j
are topic vectors for itemi andj.
Dynamictopicbasedrecommendationwithgroupsimilarityweightovertime
To recommend a new item to a user using group similarity weights over time from the
previous section, we first identified the group with which the user was initially asso-
ciated. We then calculated the similarity between the new item and the user’s items
employing group similarity weights over time, which differ according to the user’s group
and the group item’s time period. By using group similarity weights over time in Equa-
tion (4.9) to calculate the similarity between items, we could identify which item had
the highest level of similarity with the new item and, if the level reached a threshold
value, recommend the item.
sim(
!
y
i
;
!
y
j;g;t
) =weight(g;t)
!
y
i
!
x
j;g;t
k
!
y
i
kk
!
y
j;g;t
k
(4.9)
, where
!
y
i
is a topic vector for itemi,
!
y
j;g;t
is a topic vector for itemj in the time slot
t and in the groupg where the user is associated andweight(g;t) is a similarity weight
for timet in the groupg.
45
LearningtoRankofrecommendeditems
In recommendation, several items are recommended to each user. If the recommended
items are suggested to the form of the list, users tends to look at the higher ranked items
in the list first. Then they look at the lower ranked items if the higher ranked items are
not relevant. If the recommendation system can put relevant items on the higher rank on
the list, the user convenience is improved in the using recommendation system.
For the better listing in the recommended items, where relevant items are
ranked higher and non-relevant items are ranked lower, we employ Learning to rank
(LETOR) [7, 14, 17, 25, 39, 5] algorithms. Learning to Rank (LETOR) algorithms
build a ranking model from a training data set. Among LETOR algorithms, we use
RankBoost [7] algorithm. The basic idea of RankBoost is to formalize ranking as a
problem of binary classification on each item pair, and then to adopt boosting approach.
46
4.3 ExperimentalEvaluation
To validate our approach, experiments have been conducted. First, we collect data from
social networks and preprocess the collected data in order to remove irrelevant tags.
Then, we evaluate how the group similarity weights reveal the actual user interest trends
about topics over time. After the evaluation of similarity weights, we evaluate the pre-
cision rates of two different recommendation approaches: static recommendation; and
dynamic recommendation with group similarity weights over time. Then we evaluate
the normalized discounted cumulative gains for recommended items with LETOR and
without LETOR.
4.3.1 DataRetrievalandPreprocessing
We need to show that our dynamic item recommendation is more effective than static
item recommendation. To do this, we have experimentally retrieved real data from social
networks. Our dataset needs to have three entities: user, item, and tag. In our dataset,
users have collections of items which are selected by them or recommended to them in
order to show their preferences explicitly and items are described by tags. Also, we need
to track the selection of the items over time in order to manipulate temporal information.
To conduct our experiments, we used datasets from Flickr, a photo-sharing network
that allows users to upload photos, describe them with tags, and set other users’ photos
as their favorite photos. Considering that setting a photo as a favorite is a strong sign
of user preference, our objective in this experiment was to recommend photos set as
favorite photos by users given their previous favorite photo datasets. The data that we
collected from Flickr consisted of users’ information, users’ favorite photos, and the tags
associated with the favorite photos. In our experiments, all the photos were favorite pho-
tos and considered items in our approach. Our dataset contained 5,821 users, 1,183,398
47
photos and 13,721,075 favorite tags regarding photos chosen by users as their favorite
photos between June 1, 2008 and June 30, 2009.
We sorted based on tf-idf weights to prune the data to make it more manageable
and to suppress noise. For each user, we pruned tags that fell below top 0.5% by tf-
idf weights, and retained only those photos that had at least one of the top 0.5% tags
for each user. After pre-processing the data, we executed Gibbs sampling for the LDA
model. The topic number for the LDA model sampling was 50. As a result, we retrieved
topic-user and topic-tag distributions over time. Using user-topic distributions, we con-
verted each user into a topic vector that we used to create user groups by the clustering
algorithm like k-means clustering, with the value of k set at 20. After creating user
groups, we identified user items associated with the group by each month (having set
the time period to one month) to calculate the similarity weight for each group and each
time period.
We then created a training set that included topic vectors of new items and user items
within a specific month. From the training set, we determined a similarity weight for
the specific time period of the group. To determine the similarity weight, we analyzed
500 user photos chosen as favorites between June 2008 and May 2009, and considered
the photos chosen in June 2009 to be the current favorite photos. When we applied our
recommendation system to 3,821 users to test our system using the dataset, we found
that the ratio of positive photos (labeled as recommended) and negative photos (labeled
as non-recommended) of the 186,408 photos that we evaluated was 1:5.
4.3.2 PrecisionEvaluationofStaticandDynamicSystems
As the item space in social networks is almost infinite, it is impossible to retrieve all
possible items in which users may be interested. However, it is possible and valuable to
recommend items that exactly match users’ interests by considering the precision results
48
of the approaches in which recall is greater than 0.1. Regarding which of the approaches
to recommend based on the results obtained in the experiments and the precision rate,
we first recommend photos without using any weights, then recommend photos using
group similarity weights.
Figure 4.3: Precision Rates over Different Settings
In Figure 4.3, the precision results of two recommendation approaches are displayed.
The X-axis denotes the threshold for the similarity between the new photo and the user’s
photo. For example, if the threshold is 0.3, the similarity between the new photo and the
most similar user’s previous photo is greater than 0.3, so the new photo is recommended
to the user. The Y-axis denotes the precision rate for the recommendation. Each column
49
denotes a recommendation approach. The first column denotes the static recommenda-
tion and the second column denotes the dynamic recommendation. The precision rates
of the static recommendation are 0.61, 0.69, 0.73, and 0.76 given the thresholds of 0.3,
0.4, 0.5, and 0.6 respectively. The precision rates of the recommendation using group
weights are 0.75, 0.80, 0.81, and 0.83 given thresholds. The dynamic recommendation
system outperforms the static recommendation system, confirming our hypothesis that
consideration of the phenomenon that user interests change over time is necessary to
increase the precision with which recommendations are made by a system.
4.3.3 Top-KPrecisionEvaluationofStaticandDynamicSystems
In this section, we adopt another evaluation rate: top-k precision rate. In the recom-
mended data, we sort recommended images by their similarities for each user. Then we
collect sorted recommended images for each user. Then we pick top-k images and cal-
culate precision rates from top-k images for each user. In real system, users do not want
a huge number of recommended data, they just want a small number of recommended
data. In this assumption, the adoption of top-k precision rate is reasonable.
As shown in Figure 4.4, the top-k precision results of two recommendation
approaches are displayed. The diagram contains four graphs, which shows top-k pre-
cision results for each threshold, 0.3, 0.4, 0.5, and 0.6. The X-axis denotes the top-k
number. Each slot denotes top-1, top-2, top-5, and top-10 respectively. The Y-axis
denotes the precision rate. For a similarity threshold 0.5, the top-k precision rates for
dynamic recommendation are 0.69, 0.70, 0.76, and 0.80 for each top-k while the top-k
precision rates for static recommendation are 0.62, 0.65, 0.72, and 0.74 respectively.
Also, the precision rates of the dynamic recommendation are higher than the precision
rates of the static recommendation for other similarity thresholds such as 0.3, 0.4, and
50
Figure 4.4: Similarity Weights over Time
0.6. These graphs also confirm that the dynamic recommendation outperforms the static
recommendation.
4.3.4 RankRelevance
As we mentioned earlier in 4.2, the better listing where relevant items are ranked higher
and non-relevant items are ranked lower is able to improve user convenience in using the
recommendation system. To improve the list of recommended items for each user, we
use RankBoost [7] algorithm, which is one of Learning to Rank (LETOR) algorithms.
To evaluate the effectiveness of LETOR algorithms in our approach, the normalized
discounted cumulative gain (NDCG) is employed. NDCG [16] is a measure of effec-
tiveness of a Web search engine algorithm originally. Using a graded relevance scale of
documents in a search engine result set, NDCG measures the usefulness, or gain, of a
51
document based on its position in the result list. The gain is accumulated from the top
of the result list to the bottom with the gain of each result discounted at lower ranks. In
our case, an item is a document and onek item recommendation is the result list.
NDCG/
X
i
2
rel(i)
1
log(1+i)
(4.10)
, wherei is the position of the item in the list andrel(i) is the relevance of the item
at positioni. In our case, we use binary relevance, 0 or 1.
Figure 4.5: NDCG over Different Settings
In Figure 4.5, the two NDCG results of two recommendation approaches are dis-
played. The diagram contains two graphs, which shows the NDCG result for 5 items
list (NDCG@5) and the NDCG result for 10 items list (NDCG@10) respectively. The
Y-axis denotes the NDCG rate. Each column denotes a recommendation approach, one
52
with LETOR algorithm and the other without LETOR algorithm. For NDCG@5, the
NDCG results for the recommendation with LETOR and without LETOR do not have
much difference in terms of NDCG rates. It is because the list has only 5 items, so
there is not enough room for re-ranking for the better relevance. For NDCG@10, the
recommendation with LETOR outperforms the recommendation without LETOR with
0.08. The recommendation with LETOR improves the rank relevance of recommended
item lists as the length of the recommended item list increases.
53
4.4 Conclusion
The recommendation systems used within social networks must address the phe-
nomenon that user interests change over time. We addressed this phenomenon by devel-
oping and testing a recommendation system that matches user and group interests over
time by topics extracted by tags associated with items to make recommendations. The
data in our approach consists of tuples of a user, a set of favorite items, and associated
tags. As manipulating a dataset is computationally difficult due to its inherent noisiness
and large feature space, we pre-processed data by calculating tf-idf weights for each tag
of each user and, after sorting the tags by tf-idf weight, retained only those tags with
high weights for making recommendations. Although this preprocessing reduced the
number of tags, too many tags remained to make recommendations. To further reduce
the complexity and model items by topics over time, we applied the LDA model to iden-
tify latent topics from items and tags, which allowed us to derive determine topic-user
and topic-tag distributions over time. After grouping the users by topic-user distribu-
tions and evaluating the similarity weights given the groups and time, we employed the
similarity weights to calculate the level of similarity between users’ items and new items
for making recommendations. Our results proved promising. When we compared the
precision rates on test data with the different systems: a static system not using weights;
a dynamic system using group similarity weights over time: We found that the dynamic
system using the group similarity weights over time improved the precision rate.
In conclusion, we proposed an approach for item recommendation by examining tag
vocabulary over time. Our results demonstrated that tags can serve as useful keys in
identifying user preferences and that gaining understanding of trends in user interests
over time is essential for better recommendation results. We plan to extend our work in
the future by adding more temporal aspects of item recommendation, such as identifying
54
periodic or seasonal trends and extending our focus from discovering monthly trends to
discovering daily trends, to provide more up-to-date recommendations for users.
We hypothesize that tag distribution changes over time and we can use the tag dis-
tribution for dynamic item recommendation with higher precision rates than static item
recommendation in social networks. We have confirmed the distribution changes over
time by showing that the similarity weight changes over time. We also confirmed that
dynamic item recommendation outperforms static item recommendation in terms of pre-
cision rate. From this empirical result, we successfully prove our hypotheses.
This tag-based item recommendation approach is useful in other social network
domains. In social network, it is hard to analyze quantitatively some items such as
image or video in order to extract information. In this case, the system needs additional
information to analyze the information that the item contains. Tag is a good candidate
for the additional information. Tag is used to describe the item and we can conjecture
the content of the item by looking at tags associated with the item. In this research,
our empirical domain is photo recommendation, but this approach can be extended to
video recommendation or audio recommendation where the characteristics of items is
not easily identified quantitatively.
55
Chapter5
Discussion
This thesis has investigated two uses of tagging in social media: search and recom-
mendation. Also, we have discussed which extra information can reduce the ambiguity
of tagging and consequently improve search and recommendation performance. More
specifically, we have discussed how extra information, such as location, can be related
to tagging. Also, we have suggested that tag-based recommendations can be improved
by adopting extra information, such as time.
We have shown how tag similarity has a strong relationship with geographical dis-
tribution similarity (GDS). To do this, we first calculated the tag similarities from tag
pairs. Then, we calculated geographical clusters for each tag. From those geographi-
cal clusters, we computed the GDSes for tag pairs. Next, we introduced the weighted
tag similarity and the weighted GDS to reveal the relationship between tag similarity
and GDS. By using those two weighted similarities, we discovered linear regression
in log-log scale. This result shows that one similarity increases as the other similarity
increases.
The recommendation systems used within social networks must address the fact
that user interests change over time. We addressed this phenomenon by developing
and testing a system that makes recommendations by matching user and group inter-
ests over time by topics extracted from tags associated with certain items. The data in
our approach consisted of tuples of a user, a set of favorite items, and associated tags.
Because manipulating a dataset is computationally difficult due to its inherent noisiness
and large feature space, we preprocessed data by calculating tf-idf weights for each tag
56
of each user and, after sorting the tags by tf-idf weight, retained only those tags with
high weights for making recommendations. Although this preprocessing reduced the
number of tags, too many tags remained to make recommendations. To further reduce
the complexity and to model items by topics over time, we used the LDA model to
identify latent topics from items and tags, which allowed us to determine topic-user and
topic-tag distributions over time. After grouping the users by topic-user distributions
and evaluating the similarity weights given the groups and time, we used the similarity
weights to calculate the level of similarity between users items and new items for mak-
ing recommendations. Our results proved promising. When we compared the precision
rates on test data with the different systems (a static system not using weights and a
dynamic system using group similarity weights over time), we found that the dynamic
system using group similarity weights over time improved the precision rate.
5.1 Contributions
The main contributions of this thesis are the discovery of a relationship between tag
and location information (geotag) for better search results, and the discovery of a rela-
tionship between tag and temporal information for better recommendations. Search and
recommendation functionality are designed to locate appropriate information for users.
Location and temporal information are key to reducing the inherent ambiguity of tags.
The contribution of this thesis is to discover the relation between tags and other
information (e.g., location information). This relationship reduces the ambiguity of tags
and improves tag-based searching. We have shown that tags with similar meanings
are similar in terms of location. Although the same tags can be located in different
places, those tags have different semantic meanings. When tags can track the location
of content, we can suggest similar tags based on that location.
57
This thesis also suggests that tags can be an indicator of user preferences over infor-
mation. We are able to model user preferences based on tags, but this method has
problems. A vast amount of tags requires a high dimension of data and makes tag-based
modeling less scalable. Also, the inherent ambiguity of tags makes tag-based modeling
less accurate. To resolve these problems, user preferences are represented as topics by
using topic modeling. The topic is weighted over time to reflect the preference change
over time.
5.2 FutureWork
This thesis has shown the relationship between tag and locational information (geotag)
for better searching. This suggests adopting geotags to resolve tag ambiguity. Also, we
showed how temporal information can be used to identify a burst in the popularity of
certain topics or information. In addition, we have shown how tagging can be used to
indicate user interests and preferences.
This thesis focuses on the use of tagging in social media, but the underlying concepts
are not confined to tagging. The underlying concept is to implement user-generated text
to categorize, search for, and recommend information. Other user-generated textual
information besides tags can be used for this . Tweets, which use a limited number
of text characters, are a good example. Information categorization, search functional-
ity, and recommendations based on user-generated information in social media are an
important part of social media research.
We have discovered user preferences based on tags related to the users, and this
analysis of user preference is the basis of the observed behavior. User preference is not
shown explicitly, but is revealed implicitly through the users behavior, such as tagging.
58
This tag-based research is closely related to user behavior analysis, which is a promising
research domain given the exponential growth of social media.
59
Bibliography
[1] ARTHUR, D., AND VASSILVITSKII, S. k-means++: the advantages of careful
seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Dis-
crete algorithms (Philadelphia, PA, USA, 2007), SODA ’07, Society for Industrial
and Applied Mathematics, pp. 1027–1035.
[2] BEGELMAN, G. Automated tag clustering: Improving search and exploration in
the tag space. In In Proc. of the Collaborative Web Tagging Workshop at WWW06
(2006).
[3] BLEI, D. M., NG, A. Y., AND JORDAN, M. I. Latent dirichlet allocation. J.
Mach. Learn. Res. 3 (2003), 993–1022.
[4] BROOKS, C. H., AND MONTANEZ, N. Improved annotation of the blogosphere
via autotagging and hierarchical clustering. In Proceedings of the 15th interna-
tional conference on World Wide Web (New York, NY , USA, 2006), WWW ’06,
ACM, pp. 625–632.
[5] CAO, Z., QIN, T., LIU, T.-Y., TSAI, M.-F., AND LI, H. Learning to rank:
from pairwise approach to listwise approach. In ICML ’07: Proceedings of the
24th international conference on Machine learning (New York, NY , USA, 2007),
ACM, pp. 129–136.
[6] DE GEMMIS, M., LOPS, P., SEMERARO, G., AND BASILE, P. Integrating tags in
a semantic content-based recommender. In RecSys ’08: Proceedings of the 2008
ACM conference on Recommender systems (2008), pp. 163–170.
[7] FREUND, Y., IYER, R., SCHAPIRE, R. E., AND SINGER, Y. An efficient boosting
algorithm for combining preferences. J. Mach. Learn. Res. 4 (2003), 933–969.
[8] GOLDBERG, D., NICHOLS, D., OKI, B. M., AND TERRY, D. Using collaborative
filtering to weave an information tapestry. Commun. ACM 35, 12 (1992), 61–70.
[9] GOLDER, S., AND HUBERMAN, B. A. The structure of collaborative tagging
systems, Aug 2005.
60
[10] GRIFFITHS, T. L., AND STEYVERS, M. Finding scientific topics. Proceedings
of the National Academy of Sciences of the United States of America 101, Suppl 1
(April 2004), 5228–5235.
[11] GUAN, Z., WANG, C., BU, J., CHEN, C., YANG, K., CAI, D., AND HE, X.
Document recommendation in social tagging services. In WWW ’10: Proceedings
of the 19th international conference on World wide web (2010), pp. 391–400.
[12] GUO, Y., AND JOSHI, J. B. Topic-based personalized recommendation for col-
laborative tagging system. In HT ’10: Proceedings of the 21st ACM conference on
Hypertext and hypermedia (2010), pp. 61–66.
[13] HANANI, U., SHAPIRA, B., AND SHOVAL, P. Information Filtering: Overview
of Issues, Research and Systems. User Modeling and User-Adapted Interaction
11 (2001), 203–259.
[14] HERBRICH, R., GRAEPEL, T., AND OBERMAYER, K. Large margin rank bound-
aries for ordinal regression. MIT Press, Cambridge, MA, 2000.
[15] HEYMANN, P., AND GARCIA-MOLINA, H. Collaborative creation of commu-
nal hierarchical taxonomies in social tagging systems. Technical Report 2006-10,
Stanford InfoLab, April 2006.
[16] J
¨
ARVELIN, K., AND KEK
¨
AL
¨
AINEN, J. Cumulated gain-based evaluation of ir
techniques. ACM Trans. Inf. Syst. 20 (October 2002), 422–446.
[17] JOACHIMS, T. Optimizing search engines using clickthrough data. In KDD ’02:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge
discovery and data mining (New York, NY , USA, 2002), ACM, pp. 133–142.
[18] JONES, K. S. A statistical interpretation of term specificity and its application in
retrieval. Journal of Documentation 28 (1972), 11–21.
[19] KENNEDY, L., NAAMAN, M., AHERN, S., NAIR, R., AND RATTENBURY, T.
How flickr helps us make sense of the world: context and content in community-
contributed media collections. In Proceedings of the 15th international conference
on Multimedia (New York, NY , USA, 2007), MULTIMEDIA ’07, ACM, pp. 631–
640.
[20] LEE, K. J. What goes around comes around: an analysis of del.icio.us as social
space. In Proceedings of the 2006 20th anniversary conference on Computer sup-
ported cooperative work (New York, NY , USA, 2006), CSCW ’06, ACM, pp. 191–
194.
61
[21] LIANG, H., XU, Y., LI, Y., NAYAK, R., AND WENG, L.-T. Personalized rec-
ommender systems integrating social tags and item taxonomy. In WI-IAT ’09:
Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web
Intelligence and Intelligent Agent Technology (2009), pp. 540–547.
[22] LIN, D. An information-theoretic definition of similarity. In Proceedings of
the Fifteenth International Conference on Machine Learning (San Francisco, CA,
USA, 1998), ICML ’98, Morgan Kaufmann Publishers Inc., pp. 296–304.
[23] MARLOW, C., NAAMAN, M., BOYD, D., AND DAVIS, M. Ht06, tagging paper,
taxonomy, flickr, academic article, to read. In HYPERTEXT ’06: Proceedings of
the seventeenth conference on Hypertext and hypermedia (New York, NY , USA,
2006), ACM Press, pp. 31–40.
[24] PANTEL, P., AND LIN, D. Discovering word senses from text. In Proceedings of
the eighth ACM SIGKDD international conference on Knowledge discovery and
data mining (New York, NY , USA, 2002), KDD ’02, ACM, pp. 613–619.
[25] RICHARDSON, M., PRAKASH, A., AND BRILL, E. Beyond pagerank: machine
learning for static ranking. In Proceedings of the 15th international conference on
World Wide Web (New York, NY , USA, 2006), WWW ’06, ACM, pp. 707–715.
[26] SALAKHUTDINOV, R., AND MNIH, A. Probabilistic matrix factorization. In
Advances in Neural Information Processing Systems (2008), vol. 20.
[27] SCHMITZ, P. Inducing ontology from flickr tags. In Proceedings of the Workshop
on Collaborative Tagging at WWW2006 (Edinburgh, Scotland, May 2006).
[28] SEN, S., VIG, J., AND RIEDL, J. Tagommenders: connecting users to items
through tags. In WWW ’09: Proceedings of the 18th international conference on
World wide web (2009), pp. 671–680.
[29] SI, X., AND SUN, M. Tag-lda for scalable real-time tag recommendation. Journal
of Computational Information Systems (2009).
[30] SIERSDORFER, S., AND SIZOV, S. Social recommender systems for web 2.0
folksonomies. In HT ’09: Proceedings of the 20th ACM conference on Hypertext
and hypermedia (2009), pp. 261–270.
[31] SIGURBJ
¨
ORNSSON, B., AND VAN ZWOL, R. Flickr tag recommendation based
on collective knowledge. In Proceeding of the 17th international conference on
World Wide Web (New York, NY , USA, 2008), WWW ’08, ACM, pp. 327–336.
[32] SONG, Y., ZHUANG, Z., LI, H., ZHAO, Q., LI, J., LEE, W.-C., AND GILES,
L. C. Real-time automatic tag recommendation. In SIGIR ’08: Proceedings of the
62
31st annual international ACM SIGIR conference on Research and development in
information retrieval (New York, NY , USA, 2008), ACM, pp. 515–522.
[33] STEYVERS, M., SMYTH, P., ROSEN-ZVI, M., AND GRIFFITHS, T. Probabilistic
author-topic models for information discovery. In KDD ’04: Proceedings of the
tenth ACM SIGKDD international conference on Knowledge discovery and data
mining (2004), pp. 306–315.
[34] SYMEONIDIS, P., NANOPOULOS, A., AND MANOLOPOULOS, Y. Tag recom-
mendations based on tensor dimensionality reduction. In RecSys ’08: Proceedings
of the 2008 ACM conference on Recommender systems (New York, NY , USA,
2008), ACM, pp. 43–50.
[35] TSO-SUTTER, K. H. L., MARINHO, L. B., AND SCHMIDT-THIEME, L. Tag-
aware recommender systems by fusion of collaborative filtering algorithms. In
SAC ’08: Proceedings of the 2008 ACM symposium on Applied computing (2008),
pp. 1995–1999.
[36] WANG, X., AND MCCALLUM, A. Topics over time: a non-markov continuous-
time model of topical trends. In KDD ’06: Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery and data mining
(2006), pp. 424–433.
[37] WEI, X., AND CROFT, W. B. Lda-based document models for ad-hoc retrieval. In
SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference
on Research and development in information retrieval (2006), pp. 178–185.
[38] WEINBERGER, K. Q., SLANEY, M., AND VAN ZWOL, R. Resolving tag ambigu-
ity. In Proceeding of the 16th ACM international conference on Multimedia (New
York, NY , USA, 2008), MM ’08, ACM, pp. 111–120.
[39] XU, J., AND LI, H. Adarank: a boosting algorithm for information retrieval. In
Proceedings of the 30th annual international ACM SIGIR conference on Research
and development in information retrieval (New York, NY , USA, 2007), SIGIR ’07,
ACM, pp. 391–398.
[40] ZHEN, Y., LI, W.-J., AND YEUNG, D.-Y. Tagicofi: Tag informed collaborative
filtering. In Proceedings of the 3rd ACM Conference on Recommender Systems
(RecSys ’09) (New York City, New York, USA, Oct. 22–25 2009), pp. 69–76.
63
Abstract (if available)
Abstract
Social media, unlike traditional media, facilitate direct and real-time interaction among users. The increase in interaction results in massive amounts of data, which require appropriate categorization so that users can find the specific information they need. Traditional categorization by a few moderators cannot handle the massive amount of information being generated. Instead, tags, which are free-format keywords or terms created by users to describe content, are best able to categorize information in social media and to cope with the large amount of data in a timely manner. Each tag may not perfectly describe the content, but the aggregated tags reflect the knowledge of multiple users and result in a taxonomy for categorization. ❧ This categorization method can locate appropriate information for users, via search and recommendation functions. Search functions allow users to locate appropriate information from within the entire dataset, and recommendation functions involve the system actively suggesting appropriate information for the user. The performance of search and recommendation functions are degraded by the ambiguity inherent in the free format of tagging. To resolve this ambiguity and to propose better search and recommendation results, extra information, such as time and location, should be used. In this thesis, we describe how tags and extra information can be used for search and recommendation functions. ❧ We present an analysis of the correlation between tags and geographical identification metadata, or geotags. To make the analysis of geotagging and tagging possible, we prove that there is a strong correlation between these two types of information. Our approach uses similarity between tags and geographical distribution to determine interrelationships between tags and geotags. From our initial experiments, we show that the power law is established between tag similarity and geographical distribution similarity. We also present a system for recommendations that uses a modified latent Dirichlet allocation model in which users and tags associated with an item are represented and clustered by topics, and the topic-based representation is combined with the item’s time stamp to show time-based topic distribution. By representing users via topics, the model can cluster users to reveal common interests. Based on this model, we developed a recommendation system that reflects both user and group interests in a dynamic manner that accounts for time. ❧ This thesis contributes to the understanding of the use of tagging and improves the use of tagging in social media. In addition, this thesis provides guidance on user behavior analysis in social media.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
An efficient approach to categorizing association rules
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Context-based information and trust analysis
PDF
Complex pattern search in sequential data
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Measuing and mitigating exposure bias in online social networks
PDF
Enabling spatial-visual search for geospatial image databases
PDF
Location-based spatial queries in mobile environments
PDF
Statistical approaches for inferring category knowledge from social annotation
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Combining textual Web search with spatial, temporal and social aspects of the Web
PDF
Query processing in time-dependent spatial networks
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
PDF
Natural language description of emotion
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Evaluating the utility of a geographic information systems-based mobility model in search and rescue operations
Asset Metadata
Creator
Lee, Sang Su
(author)
Core Title
Tag based search and recommendation in social media
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/17/2011
Defense Date
08/17/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
information analysis,OAI-PMH Harvest,recommender systems,social media,social tagging,web mining
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McLeod, Dennis (
committee chair
), Nakano, Aiichiro (
committee member
), Pryor, Lawrence (
committee member
)
Creator Email
magnusyi@gmail.com,sangsl@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c127-660743
Unique identifier
UC1391205
Identifier
usctheses-c127-660743 (legacy record id)
Legacy Identifier
etd-LeeSangSu-342-0.pdf
Dmrecord
660743
Document Type
Dissertation
Rights
Lee, Sang Su
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
information analysis
recommender systems
social media
social tagging
web mining