Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical approaches for inferring category knowledge from social annotation
(USC Thesis Other)
Statistical approaches for inferring category knowledge from social annotation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
STATISTICAL APPROACHES FOR INFERRING CATEGORY KNOWLEDGE FROM SOCIAL ANNOTATION by Anon Plangprasopchok A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2010 Copyright 2010 Anon Plangprasopchok Dedication To Mom and Dad for their tremendous amount of support throughout my life. ii Acknowledgments I would especially like to thank my thesis advisor, Prof. Kristina Lerman, who had constantly given me an enormous amount of support over my past five years at ISI. Throughout my Ph.D study, I couldn’t really count how many times she repeatedly taught and explained ideas, due to my ignorance, disobedience, and language barrier. I also appreciate her availability to all of her advisees. My academic siblings and I always know that when we have some problems or ideas that are needed to be discussed, she is always there to listen and give constructive comments. I enjoyed our meetings, as they usually generated many good ideas. I felt very comfortable to do my research as well, due to the freedom she gave, even though I got many “yellow lights” on some digressive thoughts from time to time. One of her quotes that I like the most is that “if you can work out on a problem yourself, then the machine also does.” This empiricism quote is still in my head to remind me that I should understand the problem (and the data) well enough, instead of rushing in developing its solution without a sufficient insight. In the “drought” year that none of my papers got acceptance, Prof. Lerman kept inform me to think positive, believe in my work, and keep strengthening them. These advices make me feel that there is, at least, another person in the research community iii who still supports my research. Moreover, I’m grateful for her patience in editing and re-organizing my manuscripts — I have learned a lot through the examples she made. Advising a graduate student is not an easy task, especially when the student is the first student of yours. Nevertheless, I feel that she did a fantastic job. If I have chances to advise students, I wish I will do the similar things that she did to me... Thank you Prof. Lerman. Manythankstomythesiscommittee members. Especially, Iwouldlike tothankProf. Michael A. Arbib who kindly guided me during my Master study. I also appreciate his thoughtful comments and thorough inspections on my proposal and thesis. I’m grateful to Prof. Craig A. Knoblock who brought me to Information Integration group at ISI. He urged me to be aware of evaluations, and to think hard about the big picture of my research. I thank to Prof. Daniel E. O’Leary for the useful pointers on ontology matching and his constructive comments on the future directions. I appreciate Prof. Fei Sha’s remark on my early work, which influences my present folksonomy learning work. Lastly, I would like to express my gratitude to Prof. Lise Getoor from University of Maryland. She provided a lot of useful pointers and guided me how to develop a research strategy, and convey ideas through research papers. I would like to mention my past and present ISI mates: Martin, Matt, Yao-Yi, Rumi, Shubham, Sudadej, Jeon-Hyung and Aman — for all the smiles, laughs and some other funny stuffs. I will never forget the cheerful working environment they have made. Last but not least, I would like to thank my family: Mom, Dad and my sister. I particularly would like to thank my Mom who has been giving me courage everyday to overcome all hardships. She keeps telling me about her dream of obtaining the highest iv degree abroad, butunfortunately it has never happened. Therefore, I would like dedicate this thesis to fulfill her dream. v Table of Contents Dedication ii Acknowledgments iii List Of Tables ix List Of Figures xi Abstract xvi Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Learning Concepts from Social Annotation 11 2.1 Social Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Modeling Social Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Finite Interest Topic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.1 Evaluation on Synthetic Data . . . . . . . . . . . . . . . . . . . . . 25 2.4.2 Real-World Data: Resource Discovery Task . . . . . . . . . . . . . 32 2.5 Infinite Interest Topic Model . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.1 Evaluation on Synthetic Data . . . . . . . . . . . . . . . . . . . . . 46 2.5.2 Evaluation on Real-World Data . . . . . . . . . . . . . . . . . . . . 47 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 3: Learning Concept Hierarchies: Folksonomy Learning Problem and Evaluation Metrics 53 3.1 Hierarchical Relations in Social Annotation Systems . . . . . . . . . . . . 53 3.1.1 Personal Hierarchies in Flickr . . . . . . . . . . . . . . . . . . . . . 55 3.2 Folksonomy Learning Definition . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Challenges in Learning Folksonomies from Social Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 vi 3.3.1 Sparseness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.2 Noisy vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.3 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.4 Structural noise and conflicts . . . . . . . . . . . . . . . . . . . . . 62 3.3.5 Varying granularity level . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Folksonomy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.1 Taxonomy and Ontology Evaluation: An Overview . . . . . . . . . 64 3.4.2 Lexical Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4.3 Modified Taxonomic Overlap . . . . . . . . . . . . . . . . . . . . . 65 3.4.4 Structural Metrics: Area Under Tree . . . . . . . . . . . . . . . . . 68 Chapter 4: An Incremental Approach to Folksonomy Learning 71 4.1 An Incremental Approach to Learn Folksonomies . . . . . . . . . . . . . . 71 4.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1.2 Relational Clustering of Structured Annotation . . . . . . . . . . . 76 4.1.2.1 Local Similarity . . . . . . . . . . . . . . . . . . . . . . . 77 4.1.2.2 Structural Similarity . . . . . . . . . . . . . . . . . . . . . 78 4.1.3 SAP: Growing a Folksonomy by Merging Saplings . . . . . . . . . 81 4.1.3.1 Handling Shortcuts . . . . . . . . . . . . . . . . . . . . . 82 4.1.3.2 Handling Loops . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.3.3 Mitigating Other Structural Noise . . . . . . . . . . . . . 84 4.1.3.4 Mitigating Noisy Vocabularies . . . . . . . . . . . . . . . 85 4.1.3.5 Managing Complexity . . . . . . . . . . . . . . . . . . . . 85 4.1.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1 Baseline Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.2.1 Evaluation against the reference hierarchy. . . . . . . . . 91 4.2.2.2 Structural evaluation . . . . . . . . . . . . . . . . . . . . 93 4.2.2.3 Manual evaluation . . . . . . . . . . . . . . . . . . . . . . 93 4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Chapter 5: A Structure Learning Approach to Folksonomy Learning 99 5.1 Structure Learning by Integrating Many Small Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Probabilistic Integration of Structured Data . . . . . . . . . . . . . . . . . 103 5.2.1 Affinity Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.1.1 I-Constraint . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2.1.2 E-Constraint . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.2 Expressing Structure through Similarity . . . . . . . . . . . . . . . 114 5.2.2.1 Structural Similarity with Cluster Labels . . . . . . . . . 115 5.2.2.2 Negative Similarity . . . . . . . . . . . . . . . . . . . . . 116 vii 5.2.3 ExpressingStructurethroughConstraints: RelationalAffinityProp- agation (RAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . 123 5.3 Evaluation on Real-World Data . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Chapter 6: Related Work 134 6.1 Social Annotation Characteristics . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 Extracting Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.3 Learning Conceptual Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter 7: Conclusions 144 7.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.2 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2.1 Recommendation and Personalization . . . . . . . . . . . . . . . . 145 7.2.2 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.2.3 Learning Complex Structures from Structured Data . . . . . . . . 146 7.3 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.3.1 Interest Topic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.3.2 Folksonomy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 148 7.3.3 Relational Affinity Propagation . . . . . . . . . . . . . . . . . . . . 149 7.4 Closing Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Bibliography 151 Appendices: Appendix A: Derivation of Gibbs Sampling Formula . . . . . . . . . . . . . . . 157 Appendix B: SIG: Learning Folksonomies from User-Specified Relations . . . . 162 viii List Of Tables 2.1 The table presents statistics for five data sets for evaluating models’ per- formance. Note that a triple is a resource, user, and tag co-occurrence. . . 34 4.1 Parameters of the folksonomy learning approach. . . . . . . . . . . . . . . 89 4.2 This table presents empirical validation on folksonomies induced by the proposed approach, sap, comparing to the baseline approach, sig. The first column group presents properties of the whole induced trees: the number of leaves and Area Under Tree (AUT). The second column group reports the quality of induced trees, relatively to the ODP hierarchy. The metrics in this group are modified Taxonomic Overlap (mTO) (averaged using harmonic mean), Lexical Recall (LR), where their scales are ranging from0.0to1.0(themorethebetter),asAUTiscomputedfromportionsof the trees, which are comparable to ODP. “#ovlp lvs” stands for a number of overlap leaves (to ODP). Thelast column group reportsperformanceon manually labeled portions of the trees, which do not occur in ODP. . . . . 94 4.3 The table lists all incorrect paths caused by possibly ambiguous nodes, which are in bold. Note that all node names are stemmed. . . . . . . . . . 97 5.1 The table compares the performance of (a) AP and (b) RAP, when using different similarity schemes on various metrics. The numbers show the average ranks across all 32 seeds. The lower rank, the better performance. 127 5.2 The table compares the performance between AP and RAPwhen using (a) local, (b) hybrid and (c) class-hybrid similarity on various metrics. . . . . 129 5.3 The table compares the performance on mTO of the proposed approach, RAP with local similarity scheme, to SAP, described in Chapter 4. The tablealsoreportsanumberofcomparablepaths, #OPaths tothereference hierarchies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 ix 5.4 The table compares the performance on LR and AUT of the proposed approach, RAP with local similarity scheme, to SAP, described in the previous chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 x List Of Figures 1.1 Screenshots of social annotation systems, to which I mostly refer in this thesis: (a) Delicious allowing users to annotate Web resources with tags; (b) Flickr allowing users to organize their photos using tags and descrip- tion. Basic entities involved in social annotation systems are (1) content, e.g., Web pages in Delicious and photos in Flickr; (2) users; (3) annotation. 4 1.2 Screenshots of folders specified by users in social annotation systems: (a) Delicious allowing users to “bundle” related tags together; (b) Flickr per- mitting users to organize their photos into a set (album) and related sets can also be grouped into a collection (super album). . . . . . . . . . . . . 5 2.1 Schematic diagram represents relations between users, resources and tags through posts or bookmarks in social tagging systems. A bookmark is comprised of a user, a resource and a set of tags that the user uses for annotating the resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Schematic diagrams represent: (a) tag generation process in social anno- tation domain; (b) word generation process in document modeling domain. 15 2.3 Graphical representation of the social annotation process. R,U,T,X and Z denote variables “Resource”, “User”, “Tag”, “Interest” and “Topic” respectively. ψ, φ and θ are distributions of user over interests, resource over topics and interest-topic over tags respectively. N t represents the number of tag occurrences for one bookmark (by a particular user, u, on a particular resource, r); D represents the number of all bookmarks in the social annotation system. The hyperparameters α, β, and η are the parameters for the priors of φ, ψ and θ. . . . . . . . . . . . . . . . . . . . 17 2.4 Deviations, Delta(Δ), between actual and learned topics on synthetic data sets for different regimes: (a) high tag ambiguity; (b) low tag ambiguity; (c) high interest spread; (d) low interest spread. LDA(10) and LDA(30) referstoLDAthatistrainedwith10and30topicsrespectively; ITM(10,3) refers to ITM that is trained with 10 topics and 3 interests. . . . . . . . . 29 xi 2.5 Thisplot showsthe deviation Δ between actual and learned topics on syn- thetic data sets, under different degrees of tag-ambiguity and user interest variation. The Δ of LDA is shown on the left (a); as that of ITM is on the right (b). The colors were automatically generated by the plotting program to improve readability. . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 This diagram summarizes the evaluation procedure for comparing LDA and ITM on a resource discovery task. . . . . . . . . . . . . . . . . . . . . 35 2.7 Performanceofthedifferentmodelsonthefivedatasets. X-axisrepresents the numberofretrieved resources; y-axis representsthenumberof relevant resources(thathavethesamefunctionastheseed). LDA(80)referstoLDA that is trained with 80 topics. ITM(80/40) refers to ITM that is trained with 80 topics and 40 interests. In wunderground case, I can only runITM with 30 interests due to the memory limits. . . . . . . . . . . . . . . . . . 36 2.8 Topicdistributionsofthreeresources: flytecomm,usatoday,bookings learned by (a) LDA and (b) ITM. φ z,r in y-axis indicates a weight of the topic z in the resource r – the degree to which r is about the topic z. . . . . . . . 40 2.9 Graphical representation on the Interest Topic model with hierarchical Dirichlet process (HDPITM). . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.10 Thisplot showsthe deviation Δ between actual and learned topics on syn- thetic data sets, under different degrees of tag-ambiguity and user interest variation. TheΔofHDPisshownontheleft(a); asthatofHDPITMison the right (b). (c) and (d) show the deviation produced by HDP+LDA and HDPITM+ITM respectively. For HDP+LDA, new topics can be instanti- ated, and thusthe numberof topics can change, duringthe firsthalf of the run (HDP); then all topics are frozen (no new topic can be instantiated) during the second half (LDA). This is similar to HDPITM+ITM where I take into account user information. See Section 2.5.1 for more detail. . . . 49 2.11 Performance of different methods on the five data sets (a) – (e). Each plot showsthenumberofrelevantresources(thataresimilartotheseed)within the top 100 results produced by HDP (non-parameteric version of LDA) and HDPITM(nonparametric version of ITM). Each modelwas initialized with 100 topics and 20 interests for HDPITM. (f) demonstrates the log likelihood of the HDPITM model during the parameter estimation period of flytecomm data set. Similar behavior of the plot (f) is found in both HDP and HDPITM for all data sets. . . . . . . . . . . . . . . . . . . . . . 51 3.1 Hierarchical relations appearing in social annotation systems . . . . . . . 54 xii 3.2 Personal hierarchies specified by a Flickr user. (a) Some of the collections created by the user and (b) sets associated with the Plant Pests collection, and (c) tags associated with an image in the Caterpillars set. . . . . . . . . 56 3.3 Schematic diagram presents the personal hierarchy of Figure 3.2(b) . . . . 57 3.4 Schematic diagram illustrates the interaction between a folksonomy and personal hierarchies in an animal domain. Specifically, users select small fractionsofthefolksonomytoannotateobjects,whichcanbeobservedasin many personalhierarchies. Thegoal of thisthesisisto develop folksonomy learning approaches that can infer a folksonomy from these hierarchies. . 58 3.5 SchematicdiagramsofpersonalhierarchiescreatedbyFlickrusers. (a)Am- biguity: the same term may have different meaning (“turkey” can refer to a bird or a country). (b) Conflict: users’ different organization schemes can be incompatible (china is a parent of travel in one hierarchy, but the other way around in another). (c) Granularity: users have different levels of expressiveness and specificity, and even mix different specificity levels within the same hierarchy (Scotland (country) and London (city) are both children of UK). Nodes are colored to aid visualization. . . . . . . 60 3.6 Illustrationsof(a)acorrecttreeabout“moth”,and(b)anincorrectversion of(a)where“insect”and“arctiida”(arctiidae)aremisplaced. OriginalTO will judge the trees identical. . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.7 IllustrationofcomputingAUTfromthedistributionofnodesateachdepth of a tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8 Examples of different tree shapes with their AUT. . . . . . . . . . . . . . 70 4.1 Some examples of personal hierarchies about “bird” in Flickr. Most of the hirarchies contain a small number of children. All node names are normalized using Porter stemming algorithm (Porter, 1980). . . . . . . . . 72 4.2 Two personal hierarchies about “victoria” concepts from different users. These “victoria”s are different since their local and structural information is different. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 AppearanceofmutualshortcutsbetweenLondonandEnglandwhenmerg- ing London and England saplings. To resolve them, I compare the simi- larity between UK-London and UK-England sapling pairs. Since England sapling is closer to UK than London sapling, I simply attach England sapling to the tree; while ignoring London leaf under UK. . . . . . . . . . 83 xiii 4.4 Comparison of performance of SAP and baseline (SIG) approaches. Bars report the numbers of the cases that one approach outperformed the other on various metrics, which are summarized from Table 4.2. The higher the better. Note that “mTO& LR & AUT” is computed fromthe intersection of superior cases in mTO, LR and AUT metrics. . . . . . . . . . . . . . . 95 4.5 Folksonomies learned for bird and sport . . . . . . . . . . . . . . . . . . 96 5.1 Illustrative examples of (a) a commonly shared conceptual categorization (hierarchy) system; (b) personal hierarchies expressed by the users based on the conceptual categorization in (a). For illustrative purposes, nodes with similar names have similar color. . . . . . . . . . . . . . . . . . . . . 102 5.2 TheoriginalbinaryvariablemodelforAffinityPropagationproposedbyGivoni and Frey (2009): (a) a matrixof binaryhiddenvariables (circles) and their factors (boxes); (b) incoming and outgoing messages of a hidden variable node from/to its associated factor nodes. . . . . . . . . . . . . . . . . . . . 106 5.3 Binary AP factor graph with 4 hidden variables: (a) an original view; (b) an unfolded view of (a), at which a loop can obviously be observed. . . . . 108 5.4 A mapping between actual nodes and hidden variable nodes in Relational Affinity Propagation: (a) examples of samplings; (b) a schematic diagram of the matrix of binary hidden variables (circles) of the nodes in (a). The variables within the green shade correspond to the leaf nodes in (a); while those within the pink shade corresponding to the root nodes in (a). pa(.) is a function returning a pointer to a root node of its argument. . . . . . . 117 5.5 The Relational Affinity Propagation (RAP) proposed in this chapter: (a) a schematic diagram of thematrix of binaryhiddenvariables (circles) with the new set of constraints F j , which impose on leaf nodes (their indices runningfrom1toL)ateachcolumnj. (b)incomingandoutgoingmessages of a hidden variable node from/to its associated factor nodes. There are two more messages σ and τ. Note that for (a), I omit E, I and S factors simply for the sake of clarity. . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.6 Comparisonofperformanceof RAP(withlocalsimilarity)andSAPapproaches on a variety of metrics reported in Table 5.3 and Table 5.4. Bars report the number of the cases that one approach outperformed the other. The higher the better. Note that “mTO & #OPaths” is computed from the intersection between mTO and #OPaths’ superior cases. . . . . . . . . . . 130 xiv B.1 Anillustrativediagramrepresentsrelations(arrows)betweenfourconcepts (circles): anim,insect,bug,andmoth. Thenumbersrepresentthenumber of userswhoagree (disagree) on a particular relation, e.g., anim→ bug(vs bug→ anim). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 xv Abstract Social annotation captures the collective knowledge of thousands of users and can po- tentially be used to enhance an array of applications including Web search, information personalization and recommendation, and even synthesize categorical knowledge. In or- der to make best use of social annotation – annotation generated by many users, we need methods that effectively deal with the challenges of data sparseness and noise, as well as take into account inconsistency in the vocabulary, interests, and the level of exper- tise among individual users. In this thesis, I study computational approaches to learning andintegratingcategoryknowledgeintermsoftopics,concepts,andhierarchicalrelations between themfromtwopopularformsofsocialannotation: tagsandpersonalhierarchies. Learning category knowledge from tags created by many distinct users to describe objectsischallenging sincetagsnotonlyreflectobjectscategories butalsousers’interests inthetaggedobjects. Toaddressthischallenge, Iproposeaprobabilisticmodelthattakes into account variation in interest among usersto infer a more accurate topic model of the tagged objects. I explore its performance in detail on a synthetic data set and compare it to Latent Dirichlet Allocation (LDA), a popular document modeling algorithm. I show that in domains with high tag ambiguity, variations among users can actually help discriminate between tag senses, leading to better topics. My approach is, therefore, best xvi suited to make sense of social annotation, since this domain is characterized both by a high degree of noise and ambiguity, and a highly diverse user population with varied interests. Additionally, I extend the model to automatically adjust its key parameters as suggested by data. Thiscapability helpsovercome one of the main difficulties of applying the original model to the data: namely, having to specify the right number of common topics and interests. Structured social annotation, such as personal hierarchies, helps users organize their content. Although individual structures — broader/narrower relations between concepts — are already explicitly specified by users, learning their common complex structures in a specific form, such as tree, is a difficult task. This is because individual users usually specify them in many different ways; therefore, they are not conforming. As the second main contribution of the thesis, I study the folksonomy learning problem, i.e., learning a common hierarchy from many small personal hierarchies. I first propose a simple, yet efficient clustering-based method that incrementally weaves individual hierarchies into a deeper, more complete folksonomy, from its root down to leaves. Inconsistencies are removed as the common hierarchy grows. Alternatively, I frame folksonomy learning as a generic structure learning problem – learning complex structures from many smaller ones. I develop a novel probabilistic approach, which is based on distributed inference. Thanks to structural constraints integrated into the inference procedure, the method avoidsstructuralinconsistencies,asallindividualstructuresarecombinedsimultaneously. All proposed approaches are evaluated on real-world data sets and the experimental resultsdemonstratetheiradvantagesinmanyaspects. Associalannotationsbecomemore xvii andmoreavailable, theapproachesareverypromisingasameanstomineknowledgefrom social annotation that can prove useful in many applications. xviii Chapter 1 Introduction In this chapter, I first provide the motivation and outline the problems that this thesis attempts to address. I will then briefly describe the solutions I developed to solve those problems, following with the outline of the thesis. 1.1 Motivation As social Web sites such as Delicious, 1 Flickr 2 and Youtube 3 have become increasingly popular, social annotation – metadata on the content of interest, generated by content creators and other users – has become more available. This annotation comes in several forms such as tag, set, rating, and description for describing, sharing, and organizing the content users create. Tag, as shown in Figure 1.1, is a keyword that users use to describe content. In general, users annotate a content item with multiple tags. Set, as shown in Figure 1.2, offers users a means to group related items together, which is similar to a 1 http://delicious.com/ 2 http://www.flickr.com/ 3 http://www.youtube.com/ 1 folder that groups files. Description is a content note or discussion, usually in free text format. I argue that the annotation is potentially a good source of evidences for inducing category knowledge, which is useful in many applications such as the following: • Help users organize content they have created. Specifically, it can be utilized to create semantic directories, similar to those in Open Directory Project (ODP), 4 which arrange user content items based on how they hierarchically relate to each other, making them easy to browse. • Search content. Since the annotation is generated according to users’ and creators’ perspective on what the content is about, it can potentially be used to enhance content searching. Moreover, some content types, such as photo and video, are in binary forms, which cannot be semantically understood by machines at present. In such case, thisannotation isinvaluable evidenceforcontent indexingandsearching. • Recommend relevant content. The annotation can be used to infer similar content; whiletastesorinterestsofotheruserscanalsobelearnedfromit. Contentitemsan- notated or created by like-minded users are then possibly selected as recommended items. • Complement existing knowledge bases, such as lexical systems like the WordNet 5 and several ontologies used in semantic web applications. Since users ceaselessly 4 http://www.dmoz.org/ 5 http://wordnet.princeton.edu/ 2 and freely provide annotation on new content, the annotation is up to date and thus can be used to update ontologies or expand their scope. • Understand how new content fits with existing content. Consequently, it can help content creators to label new content appropriately, with high quality annotation, which will in turn help in content searching and browsing. Moreover, once content is properly annotated, it can be used by information integration applications. Although annotation from an individual user may be inaccurate and incomplete, that from different users may complement each other, making it, in combination, meaningful for the tasks previously described. Since annotation is generated freely by users without any constraints, it is sparse, ambiguous, multi-faceted and inconsistent across individual users. Take tag usage, for example. The number of tags provided by a user for a certain content item is very small: approximately four to seven tags per bookmarked webpage on Delicious according to the data set I obtained (Plangprasopchok and Lerman, 2007), and 3.74 tags per photo in Flickr,asreportedinRattenburyetal.(2007). Itismulti-faceted, sinceeachcontentitem can bedescribed by different topics (Golder and Huberman,2006). One obvious example is a photo of a conifer flower that is annotated by its appearance, location at which it was observed, and its scientific/colloquial names (as shown in Figure 1.1(b)). Moreover, the same content can be annotated according to word variations, users’ disagreements, and different levels of user’s expertise (Golder and Huberman, 2006; Angeletou et al., 2007). For instance, some users organize photos by subject and then technique; while others use technique and then subject. Many users also use inaccurate terms to annotate 3 Content (webpage) User Tags Content (webpage) User Tags (a) Delicious Description Tags Description Tags (b) Flickr Figure 1.1: Screenshots of social annotation systems, to which I mostly refer in this thesis: (a) Delicious allowing users to annotate Web resources with tags; (b) Flickr allowing usersto organize their photos usingtags and description. Basic entities involved in social annotation systems are (1) content, e.g., Web pages in Delicious and photos in Flickr; (2) users; (3) annotation. 4 Folder (Bundle) Related Tags User tags Content (Webpage) Folder (Bundle) Related Tags User tags Content (Webpage) (a) Delicious Folder (collection) Sub folder (set) Folder (collection) Sub folder (set) (b) Flickr Figure 1.2: Screenshots of folders specified by users in social annotation systems: (a) Delicious allowing users to “bundle” related tags together; (b) Flickr permitting users to organize their photos into a set (album) and related sets can also be grouped into a collection (super album). 5 content,e.g.,theterm“insect”isusuallyusedtotagphotosofspiders. Also,anindividual tag is ambiguous (Mathes, 2004; Golder and Huberman, 2006). For instance, the term “jaguar” can be used to refer either a mammal or a luxury car; while, “cat” and “tiger” are sometimes used to refer to the same thing. Despite these difficulties, I believe that it is feasible to induce useful knowledge by carefully combining annotations generated by many different users. In this thesis, I develop several strategies to extract category knowledge from user-generated annotation. The category knowledge is in the form of concepts and their hierarchies. By a concept, I mean a group of content items and their description, which share some similar features. By a concept hierarchy or taxonomy, I mean a tree, which contains concepts and their broader-narrower relations. The strategies I developed, which take into consideration the previously mentioned challenges, are able to extract high quality category knowledge. Note that a comparison between using and not-using social annotation for automating tasks is not the main focus of this thesis. However, some of these tasks are utilized for evaluating the approaches. 1.2 Thesis Statement We can infer high-quality category knowledge, in the form of concepts and theirhierarchies, fromsocialannotationbycombiningannotationsfrommany individual users and utilizing statistics of such annotation. 6 1.3 Approach In this section, I briefly describe my frameworks for learning concepts and for learning concept hierarchies. Both frameworks are based on statistical models for analyzing co- occurrences of content, users, annotation and its structure. In order to extract concepts from social annotation, which has the challenges listed earlier, I develop a probabilistic generative model, which models how annotation 6 for a certain content item is generated by a certain user in a stochastic fashion. This type of approachhasbeenwidelyusedinseveraldomains; forexample, documentmodeling(Hof- mann, 1999; Blei et al., 2003b), topic-authorship associations (Rosen-Zvi et al., 2004), userrolesinasocialnetwork(McCallumetal.,2007)andCollaborativeFiltering(Marlin, 2004; Jin et al., 2006). Essentially, the model has two sets of hidden variables: topics and interests. Top- ics represent descriptive concepts of content. For example, photos of mammal jaguars associate with the wildlife concept; while photos of jaguar cars associate with the auto- mobile concept. Meanwhile, interests represent user interests – groupsof userswhoshare common interests on a certain topic set. The model assumes that users’ annotation for certain content is generated according to the content’s topics and the users’ interests. Model parameters – the associations between hidden variables (topics and interests) and observable variables (users, content and annotation) – are in the form of conditional 6 In this part, I mainly focus on tags. 7 probabilities, which can be inferred using iterative techniques, such as Expectation- Maximization (Dempster et al., 1977) and simulation techniques, such as Gibbs Sam- pling (Gilks et al., 1996). The numbers of topics and interests are set to be much fewer than the numberof content items and that of users respectively. In addition, the associa- tionsbetween thevariables arecomputedfromco-occurrences oftheobservablevariables. Consequently, “similar” annotations (tags) are grouped together into the same topics if they are used to annotate the “similar” content items by the “similar” users. Through this single framework, clustering effects appear – similar annotations are grouped into the same concepts, and different annotations are disambiguated. Hence, the tag “jaguar” that refers to a luxury car and the other that refers to a mammal are associated with different concepts: “car” and “wildlife” respectively. In learning concept hierarchies from social annotation, also called folksonomy learn- ing, existing approaches utilize statistics of tag co-occurrences to induce concept hierar- chies (Schmitz, 2006; Heymann and Garcia-Molina, 2006; Zhou et al., 2007). As argued in one of my studies (Plangprasopchok and Lerman, 2009), the approaches to learn hier- archical “broader/narrower” relations that are based on tag frequency alone are unable to properly distinguish between popular and general concepts. For instance, there are ten times as many images on the photosharing site Flickr tagged with “car” than with “automobile,” a concept that subsumes “car.” Instead, I explore statistical frameworks for aggregating many shallow hierarchies, created by individual users, for example, the “Plants” and “Wild Conifers” relation in Figure 1.2(b), into a common deeper hierarchy that reflects how a community organizes knowledge. 8 Although hierarchical relations among concepts are explicitly specified by individual users, learning common complex hierarchies by aggregating them is very challenging due to concept ambiguity. In addition, since personal hierarchies are very shallow and generated voluntarily from heterogeneous sources, there is certainly no single, unified structure to be found. Moreover, structural inconsistencies are likely to appear when such hierarchies are combined arbitrarily. It’s very likely that we would end up with an arbitrary graph rather than in a tree. Therefore, we need extra machinery to avoid structural inconsistencies while combining them. In this thesis, I develop two statistical approaches that combine personal hierarchies into a single or a few complex trees. Both approaches are driven by a similarity measure that utilizes both contextual and relational information. In the first approach, a common hierarchyisefficientlyinducedbyincrementallyweavingindividualhierarchiesinhorizon- tal and vertical directions. Inconsistencies are removed as new hierarchies are attached to the learned hierarchy. In the second approach, I instead allow all personal hierarchies to be merged concurrently in a distributed manner. With a global constraint attached to the inference procedure, the approach finds a good, yet more consistent, integration. 1.4 Contributions of the Research Thekeycontributionsofthethesisarestatistical frameworks for inferring category knowl- edge – concepts and their hierarchies – from social annotation. Specifically, the thesis includes the following major contributions: • A probabilistic model for inferring concepts from social annotation; 9 • Two probabilistic approaches that learn complex hierarchies from structured social annotation in incremental and distributed manner. Additionally, the secondary contributions include: • An automatic approach for quantitatively evaluating the quality of learned hierar- chies, by comparing them to a reference taxonomy; • A simple, yet intuitive metric for measuring how detailed a tree’s structure is in terms of depth and bushiness. 1.5 Outline of the Thesis The remainder of this thesis is organized as follows. In Chapter 2, I first explain a proposedapproach that infersconcepts fromsocial tags and evaluate it on both synthetic and real world data on a resource discovery task. In Chapter 3, I then describe the folksonomy learning problem along with strategies and metrics to evaluate the quality of learned folksonomies. Two statistical approaches to the folksonomy learning are then described in Chapter 4 and in Chapter 5, where the former is an incremental approach that removes inconsistencies as individualhierarchies are attached to the folksonomy; the latter is based on a distributed inference with constraints, which is generic enough for solving other structure learning problems. I then review related work in Chapter 6 and summarize the contributions and future directions in Chapter 7. 10 Chapter 2 Learning Concepts from Social Annotation Inthischapter,Ipresentmyfirstcontribution,aprobabilisticmodelforinferringconcepts from tags that users use to annotate content. I then study the behavior of the approach onsyntheticdatawithdifferentdegreesoftagambiguityandvariationsamongusers. The approach is also evaluated using a resource discovery task on a real world dataset. At the end of the chapter, I also describe the extension that enables the model to automatically infer key parameters. 2.1 Social Tagging Tagging hasrecently become apopularmethod forannotating content on the social Web. When a user tags some content, e.g., a Web resource on Delicious, a scientific paper on CiteULike,oraphotoonFlickr,theuserisfreetoselectanykeywordfromanuncontrolled personal vocabulary. I claim that tags are evidence of category knowledge that users use for categorizing content. 11 To automatically infer users’ category knowledge from tags, one can exploit the fact that “similar” content items tend to be annotated by “similar” tags. For example, web- sitesaboutluxurycarsareusuallyannotatedwithasimilartagset,e.g.,“jaguar”,“bmw”, “auto”. Intuitively, one can infer a set of tags about a “luxury car” concept by determin- ing frequent tags annotating luxury-car-related websites. Sincetags fromindividualusersaresparseand incomplete, oneway to solve this chal- lengeistoaggregate themfrommanyusers. Tagsfromoneusermaycomplementthoseof other users. One straightforward aggregation is to combine all users’ tags as if they came from the same user. In other words, each item is annotated by a bag of tags from all the users who annotated it. Unfortunately, such aggregation discards important information about individual interests, which can potentially help infer category concepts. Suppose content “A” is tagged by “jaguar”, “run” and “yellow”. We would have high uncertainty whetherthiscontent isabout“automobile” or“wildlife”, ifwehave noinformation about the users who tagged it. I claim that users express their individual interests and vocabulary through tags, and that wecan usethisinformation tolearn concepts oftagged content. For instance, we are likely to discover that users who are interested in the luxury car domain usually use the keyword “jaguar” to tag car-related URLs; while, those who are interested in wildlife use the tag “jaguar” to tag wildlife-related URLs. If we know that the majority of users, who annotated “A”, are interested in luxury cars, we would know that “A” is about luxury cars. Therefore, the tags “jaguar”, “run”and “yellow” are associated with the luxurycar concept in this context. 12 One possible way to extract such knowledge from social tagging is to model how tags are generated by users for content items in a stochastic fashion. Basically, I assume that each individual user has a mixture of interests, which is modeled as a probability distribution over interests. Each content item is comprised of a mixture of concepts, modeled as a probability distribution over concepts. Additionally, a pair of an interest and a concept is a probability distribution over tags. To generate a tag, a user interest and a content concept are firstly drawn from their distributions; then, a tag is draw from the tag distribution of the interest-concept pair. This stochastic process can be inverted using statistical techniques; as a result, the distributions over interests, concepts and tags are inferred. Within these distributions, category knowledge appears as conditional probabilities between concepts, content items and tags. 2.2 Modeling Social Tagging In general, a social annotation system involves three types of entities: resources (e.g., Web pages on Delicious), users and annotation. Although there are different forms of annotation, such as descriptions, notes and ratings, I focus on tags only in this context. Here, I define a variable R as resources,U as users, andT as tags. Their realizations are defined as r, u and t respectively. A post (or bookmark) k on a resource r by a user u, can beformalized asatuplehr,u,{t 1 ,t 2 ,...,t j }i k , whichcan befurtherbrokendowninto co-occurrence ofj resource-user-tag triples:hr,u,ti. N R ,N U andN T are the number of distinct resources, users and tags respectively. Figure 2.1 schematically illustrates these entity types and their relations in social tagging systems. 13 Users Resources (URLs) Tags Post1 Post2 Post or Bookmark Users Resources (URLs) Tags Post1 Post2 Post or Bookmark Figure 2.1: Schematic diagram represents relations between users, resources and tags through posts or bookmarks in social tagging systems. A bookmark is comprised of a user, a resource and a set of tags that the user uses for annotating the resource. In addition to the observable variables defined above, I introduce two ‘hidden’ or ‘la- tent’ variables, which we will attempt to infer from the observed data. The first variable, Z, represents resource topics, which we view as categories or concepts of resources. From the previous example, the tag “jaguar” can be associated with topics ‘cars’, ‘animals’, ‘South America’, ‘computers’, etc. The second variable, X, represents user interests, the degree to which users subscribe to these concepts. One user may be interested in collecting information about luxury cars before purchasing one, while another user may be interested in vintage cars. A user u has her interest profile, ψ u , which is a weight distribution over all possible interests x. ψ (without subscript) is just an N U ×N X ma- trix. Similarly, a resource r has its topic profile, φ r , which is again a weight distribution over all possible topics z, whereas φ (without subscript) is an N R ×N Z matrix. Thus, a resource aboutSouth American jaguarswill have a higher weight on ‘animals’ and‘South America’ topics than on the ‘cars’ topic. Usage of tags for a certain interest-topic pair 14 User (u) Resource (r) Possible Tags (N T ) Tagging Profiles (N X ) Possible Topics (N Z ) User interests (x) Resource topics (z) Generated tags (t) Possible Words Possible Topics Document (r) Topics (z) Generated words (t) (a) Social Annotation Process (b) Document Word Generation Process Figure2.2: Schematic diagramsrepresent: (a)tag generation processinsocial annotation domain; (b) word generation process in document modeling domain. (x,z) is defined as a weight distribution over tags, θ x,z – that is, some tags are more likely to occur for a given pair than others. The weight distribution of all tags, θ, can be viewed as an N T ×N Z ×N X matrix. I cast an annotation event as a stochastic process as follows: • User u finds a resource r interesting and would like to bookmark it. • For each tag that u generates for r: – User u selects an interest x from her interest profile ψ u ; resource r selects a topic z from its topic profile φ r . – Tag t is then chosen based on users’s interest and resource’s topic from the interest-topic distribution over all tags θ x,z . This process is depicted schematically in Figure 2.2 (a). Specifically, a user u has an interest profile, represented by a vector of interests ψ u . Meanwhile, a resource r has its own topic profile, represented by a vector of topicsφ r . Users who share the same interest (x) have the same tagging policy — the tagging profile “plate”, shown in the diagram. 15 For the “plate” correspondingto an interestx, each rowcorrespondsto a particular topic z, and it gives θ x,z , the distribution over all tags for that topic and interest. The process can be compared to the word generation process in standard topic mod- eling approaches, e.g., Latent Dirichlet Allocation (LDA) (Blei et al., 2003b) and Prob- abilistic Latent Semantic Analysis (pLSA) (Hofmann, 2001), as shown in Figure 2.2 (b). In topic modeling, a new document is presumably generated by first drawing a distri- bution over topics, while each topic is a distribution over words. For each word in that document, a topic is chosen randomly according to the distribution over topics. Then, a word is drawn that topic. Here, we can perceive that words of a certain document are generated according to a single policy, which assumes that all authors of documents in the corpus have the same word usage patterns. In other words, a set of “similar” words is used to represent a topic across all authors. In the“jaguar” example, forinstance, we may findone topic to bestronglyassociated with words “cars”, “automotive”, “parts”, “jag”, etc., while another topic may be associ- ated with words “animals”, “cats”, “cute”, “black”, etc., and still another with “guitar”, “fender”, “music”, etc. and so on. In social annotation, however, a resource can be annotated by many users, who may have different opinions, even on the same topic. Users who are interested in restoring vintage carswillhave adifferenttagging profilethanthosewhoareinterested inshopping for a luxury car. The ‘cars’ topic would then decompose under different tagging profiles into one that is highly associated with words “restoration”, “classic”, “parts”, “catalog”, etc., and another that is associated with words “luxury”, “design”, “performance”, “brand”, etc. 16 X Z T R U N t ψ N U ϕ N R θ N Z ,N X β α η D X X Z Z T T R R U U N t ψ ψ N U ϕ ϕ N R θ N Z ,N X β α η η D Figure 2.3: Graphical representation of the social annotation process. R, U, T, X and Z denote variables “Resource”, “User”, “Tag”, “Interest” and “Topic” respectively. ψ,φ andθ are distributions of user over interests, resource over topics and interest-topic over tags respectively. N t represents the number of tag occurrences for one bookmark (by a particular user,u, on a particular resource,r);D represents the numberof all bookmarks in the social annotation system. The hyperparameters α, β, and η are the parameters for the priors of φ, ψ and θ. The existence of tagging profiles for each group of users who share the same interest provides machinery to address this issue and constitutes the major distinction between my approach and standard topic modeling. 2.3 Finite Interest Topic Model According to the social tagging process described in the previous section, I develop Interest Topic Model (ITM) for extracting category knowledge by extending the LDA model (Blei et al., 2003b). Specifically, ITM additionally takes users and interests into account explicitly, which will help separate user interests from content topics. From this, I expect that the extracted topics would be closer to the actual topics than those that are extracted by LDA. 17 As in LDA, I implement ITM in a fully generative manner, under a Bayesian frame- work. Specifically, the distribution over topics for resource r, φ r , can be modeled as a discrete (categorical) distribution, along with its prior probability. I assume that a resource is only comprised of few topics, which have much higher weight than the rest. Similar treatments are also applied to ψ u (the distribution over interests of user u) and θ x,z (the distribution over tags of interest x and topic z pair). To model such sparsity, onenatural choice isto usesomeprior that favors distributionswith sparseelements, and Dirichlet distribution can capture this characteristic. Dirichlet distribution is a distribution over distributions, i.e., a draw from “k-dimensional” Dirichlet distribution is a distribution on a discrete probabilistic space withk dimensions (Ranganathan, 2006). It has a set of parameters that can be specified as a vector − → α ={α 1 ,··· ,α k }, each of which is considered as a weight on the dimension i. The higher α i , the more likely we are to draw a distribution with a high weight on the dimension i. For simplicity reason, many recent researches (e.g., Rosen-Zvi et al., 2004; Griffiths and Steyvers, 2004; Steyvers and Griffiths, 2006; Asuncion et al., 2009) use a symmetric Dirichlet distribution, where all parameters are set to the same value α (α 1 ,··· ,α k =α). By doing so, the number of parameters is reduced to one. αcanbeusedforadjustingthemodeofasymmetricDirichletdistribution. Forα≥1, the mode of the distribution is located at the center between all dimensions – the center of the simplex. The higherα, the more the distribution favors discrete distributions that havehighvaluesinalldimensions,leadingtomoresmoothing. Forα<1,themodesofthe distribution are located at the corners of the simplex. A symmetric Dirichlet distribution 18 with this kind of setting biases toward sparsity – it favors discrete distributionsthat have high weights on few dimensions (Steyvers and Griffiths, 2006). I place symmetric Dirichlet priors on top of the parameters φ, ψ and θ, with Dirich- let parameters α, β and η respectively. These Dirichlet parameters are correspondingly normalized by the numbers of possible topics (N Z ), interests (N x ) and tags (N T ). Addi- tionally, I assume N Z and N X must be fixed and known a priori (this requirement can be relaxed as I will demonstrate in Section 2.5). I also assume that most users share some similar interests, and most resources share some similar topics. Thus, we expect N X N U and N Z N R . Following the generative process described in Section 2.2 and the configurations we have here, the model can be described as a stochastic process, depicted in graphical form (Buntine, 1994) in Figure 2.3: ψ u ∼Dirichlet(β/N X ,...,β/N X ) (generating user u interest’s profile) φ r ∼Dirichlet(α/N Z ,...,α/N Z ) (generating resource r topic’s profile) θ x,z ∼Dirichlet(η/N T ,...,η/N T ). (generating tag’s profile for interest x and topic z) For each tag t i of a bookmark, x i ∼Discrete(ψ u ) z i ∼Discrete(φ r ) t i ∼Discrete(θ x i ,z i ). For the nextstep of modelingsocial tagging, Iusethisgeneratative processto fitwith theobservations: all user-resource-tag triplets. Thisprocedurerequiresestimatingψ u ,φ r 19 andθ x,z . One possible way to estimate these parameters is to use Gibbs sampling (Gilks et al., 1996; Neal, 2000). Briefly, the idea behind the Gibbs sampling is to iteratively use the parameters of the current state to estimate parameters of the next state. In particular, each next-state parameter is sampled from the posterior distribution of that parameter given all other parameters in the previous state. The sampling process is done sequentially until sampled parameters approach the target posterior distributions. Recently, this approach was demonstrated to be simple to implement, yet competitively efficient, and to yield relatively good performance on the topic extraction task (Griffiths and Steyvers, 2004; Rosen-Zvi et al., 2004). Since Dirichlet priors are conjugate to discrete distributions, it is straightforward to integrate outψ,φ andθ. Thus, I only need to sample hidden variables x andz and later on estimate ψ, φ and θ once x and z approach their target posterior distribution. To deriveGibbssamplingformulaforsamplingxandz, Ifirstassumethatallbookmarksare broken intoN K tuples. Each tuple is indexed byi and I refer to the observable variables, resource, user and tag, of the tuple i as r i , u i , t i . I refer to the hidden variables, topic and interest, for this tuple asz i andx i respectively, withx andz representing the vector of interests and topics over all tuples. I define N r i ,z −i as the number of all tuples having r = r i and z but excluding the present tuple i. In words, if z = z i , N r i ,z −i = N r i ,z i − 1; otherwise, N r i ,z −i = N r i ,z i . Similarly,N z −i ,x i ,t i is a numberof all tuples havingx=x i ,t=t i andz butexcluding the present tuplei;z −i represents all topic assignments except that of the tuplei. The Gibbs 20 sampling formulas for sampling z and x, whose derivation I provide in Appendix A, are as follows: p(z i |z −i ,x,t) = N r i ,z −i +α/N Z N r i +α−1 . N z −i ,x i ,t i +η/N T N z −i ,x i +η (2.1) p(x i |x −i ,z,t)= N u i ,x −i +β/N X N u i +β−1 · N x −i ,z i ,t i +η/N T N x −i ,z i +η . (2.2) Consider Eq. 2.1, which computes a probability of a certain topic for the present tuple. This equation is composed of 2 factors. Suppose that we are currently determining the probability that the topic of the present tuple i is j (z i =j). The left factor determines the probability of topic j to which the resourcer i belongs according to the present topic distribution ofr i . Meanwhile, the right factor determines the probability of tagt i under the topic j of the users who have interest x i . If resource r i assigned to the topic j has many tags, and the present tag t i is “very important” to the topic j according to the users with interest x i , there is a higher chance that tuplei will be assigned to topic j. A similar insight is also applicable to Eq. 2.2. In particular, suppose that we are currently determining the probability that the interest of the present tuple i is k (x i =k). If user u i assigned to the interest k has many tags, and tag t i is “very important” to the topic z i according to users with interestk, the tuplei will be assigned to interestk with higher probability. In the model training process, I first randomly assign interests x and topics y for all tuples. I then sample topic z i and interest x i for each individual tuple based on the 21 present states of z −i and x −i of the other tuples. By sampling z and x of all tuples sequentially using Eq. 2.1 and Eq. 2.2, the posterior distribution of topics and interests is expectedtoconvergetothetrueposteriordistributionafterenoughiterations. Regardless of the initial assignments on all interestsx and topicsz, the sampler will eventually reach a unique stationary state, if it is ergodic, i.e., all states can be reached from any other state within a finite number of sampling steps (Gilks et al., 1996). Although I have no empirical proof on ergodicity, I simply assume that this property holds, and in practice, the samplers seem to converge to some state after a certain number of iterations. Although it is difficult to assess convergence of Gibbs sampler in some cases as men- tioned in Sahu and Roberts (1998), I simply monitor it through the likelihood of data given themodel, whichmeasureshowwelltheestimated parametersfittothedata. Once the likelihood reaches the stable state, it only slightly fluctuates from one iteration to the next, i.e., there is no systematic and significant increase or decrease in likelihood. We can use this as a part of the stopping criterion. Specifically, I monitor likelihood changes over a number of consecutive iterations. If the average of these changes is less than some threshold, theestimation processterminates. More robustapproaches to determiningthe stable state are discussed elsewhere (e.g., Ritter and Tanner, 1992). The formula for the likelihood is defined as follows: f(t;ψ,φ,θ) = Y i=1:N K N x i ,z i ,t i +η/N T N x i ,z i +η . (2.3) 22 Toavoid anumericalprecision probleminmodelimplementation, oneusuallyuseslog likelihood log(f(t;ψ,φ,θ)) instead. Note that I use the strategy mentioned in Escobar and West (1995) (Section 6) to update α, β and η from data. Once the sampler reaches the stable state, samples can be used to estimate model parameters. In practice, one gets a representative set of samples at regular spaced inter- vals to avoid correlations between samples (Gilks et al., 1996). Here, I defineN r,z as the numberof all tuplesassociated with resourcer and topicz, withN r ,N x,u ,N u ,N x,u,t and N x,z defined in a similar way. From Eq. A.5 and Eq. A.6 in Appendix A, the formulas for computing such parameters are as follows: φ r,z = N r,z +α/N Z N r +α (2.4) ψ u,x = N u,x +β/N X N u +β (2.5) θ x,z,t = N x,z,t +η/N T N x,z +η . (2.6) Parameter estimation via Gibbs sampling is less prone to the local maxima problem than the generic Expectation Maximization (EM) algorithm (Dempster et al., 1977), as arguedinRosen-Zvietal.(2004). Inparticular,thisschemedoesnotestimateparameters φ, ψ, andθ directly. Rather, they are integrated out, while the hidden variables z andx are iteratively sampled during the training process. The process estimates the “posterior distribution”overpossiblevaluesofφ,ψ,andθ. Atastablestate,z andxaredrawnfrom this distribution and then used to estimate φ, ψ, andθ. Consequently, these parameters are estimated from a combination of “most probable solutions”, which are obtained from 23 multiple maxima. This clearly differs from the generic EM with point estimation (e.g., Hofmann, 2001), which I used in the earlier work (Plangprasopchok and Lerman, 2007). Specifically,thepointestimationschemeestimatesφ,ψ,andθfromsinglelocalmaximum. Per training iteration, the computational complexity of Gibbs sampling is more ex- pensive than EM. This is because we need to sample hidden variables (z and x) for each data point (tuple), whereas EM only requires updating parameters. In general, the number of data points is larger than the dimension of parameters. However, it has been reportedinGriffithsandSteyvers(2004) thattoreach thesameperformance, Gibbssam- pling requires fewer floating point operations than other popular approaches: Variational Bayes and Expectation Propagation (Minka, 2001). Moreover, to my knowledge, there is currently no explicit way to extend these approaches to automatically infer the size of hidden variables, as Gibbs sampling can. Note that the inference of these numbers will be described in Section 2.5. 2.4 Evaluation In this section I evaluate the Interest Topic Model and compare its performance to LDA, which I already explained in Section 2.2, on both synthetic and real-world data. The synthetic data set enables us to control the degree of tag ambiguity and individual user variation, and examine in detail how both learning algorithms respond to these key chal- lenges of learning from social annotation. The real-world data set, obtained from the social bookmarking site Delicious, demonstrates the utility of the proposed model. 24 LDA is a probabilistic generative model originally developed for modeling text docu- ments (Blei et al., 2003b) and more recently extended to many problems, such as finding topicsofscientificpapers(GriffithsandSteyvers,2004), topic-authorassociations(Rosen- Zvietal., 2004), userroles ina social network(McCallum etal., 2007), andCollaborative Filtering (Marlin, 2004). To apply LDA to social tagging context, one can straightfor- wardly ignore information about users, treating all tags as if they came from the same user as in some recent work (e.g., Kashoob et al., 2009). Then, a resource can be viewed as a document, while tags across different users who bookmarked it are treated as words. I then use LDA to fit the data where its parameters to be learned are the distribution over topics of resource r (φ r ) and the distribution over words of topic z (θ z ). To reiterate, ITM extends LDA by taking into account individual variations among users. In particular, a tag for a certain bookmark is chosen not only from the resource’s topics but also from the user’s interests. This allows each user group (with the same interest) to have its own policy, θ x,z,t , for choosing tags to represent a topic. Each policy is then used to update resource topics as in Eq. 2.1. Consequently, φ r,z is updated based on interests of users who actually annotated resource r, rather than updating it from a single policy that ignores user information. Hence, I expect ITM can extract better distributionsovertopics(φ)thanLDAwhenannotationsaremadebydiverseusergroups, and especially when tags are ambiguous. 2.4.1 Evaluation on Synthetic Data Toverifytheintuition aboutITM,Ievaluated theperformanceofthelearningalgorithms on synthetic data. Thedata set consists of40 resources, 10 topics, 100 users,10 interests, 25 and 100 tags. I first separate resources into five groups, with resources in each group assigned topic weights from the same (Dirichlet) probability distribution, which forces each resource to favor two to four out of ten topics. Rather than simulate the tagging behaviorof usergroupsby generating individualtagging policy plates as in Figure2.2(a), I simplify the generative process to simulate the impact of diversity in user interests on tagging. To this end, I represent user interests as distributions over topics. I create data sets under different tag ambiguity and user interest variation levels. To make these settings tunable, I generate distributionsof topics over tags, and distributions of resources over topics using symmetric Dirichlet distributions with different parame- ter values. Intuitively, when sampling from the symmetric Dirichlet distribution 1 with a low parameter value, for example 0.01, the sampled distribution contributes weights (probability values that are greater than zero) to only a few elements. In contrast, the distribution will contribute weights to many elements when it is sampled from a Dirichlet distribution with a high parameter value. I used this parameter of the symmetric Dirich- let distribution to adjust user variation, i.e., how broad or narrow user interests are, and tag ambiguity, i.e., how many or how few topics each tag belongs to. With higher pa- rameter values, we can simulate the behavior of more ambiguous tags, such as “jaguar”, which has multiple senses, i.e., it has weights allocated to many topics. Low parameter values can be used to simulate low ambiguity tags, such as “mammal”, which has one or few senses. The parameter values used in the experiments are 1, 0.5, 0.1, 0.05 and 0.01. To generate tags for each simulated data set, user interest profiles ψ u are first drawn from the symmetric Dirichlet distribution with the same parameter value. A similar 1 Samples that are sampled from Dirichlet distribution are discrete probability distributions 26 procedure is done for distributions of topics over wordsθ. A resource will presumably be annotated by a user if the match between the resource’s topics and the user’s interests is greater than some threshold. The match is given by the inner product between the resource’stopicsandtheuser’sinterests,andIsetthethresholdat1.5×theaveragematch computed over all user-resource pairs. The rationale behind this choice of threshold is to ensure that a resource will be tagged by a user who is strongly interested in the topics of that resource. When the user-resource match is greater than threshold, a set of tags (a bookmark) is generated according to the following procedure. First, I compute the topic distribution from an element-wise product of the resource’s topics and the user’s interests. Next, I sample a topic from this distribution and produce a tag from the tag distribution of that topic. This guarantees that tags are only generated according to the user’s interests. I repeat this process seven times in each post 2 and eliminate redundant tags. The process of generating tags is summarized below: for each resource-user pair (u,r) do m r,u =φ r ·ψ u (compute the match score) end for ¯ m=Average(m) for each resource-user pair (r,u) do if m r,u >1.5¯ m then topicPref =φ r ×ψ u (element-wise product) for i=1 to 7 do z∼topicPref (draw a topic from the topic preference) t i r,u ∼θ z (sample i th tag for the (u,r) pair) end for Remove redundant tags end if end for 2 I chose seven because Delicious users in general use four to seven tags in each post. 27 I measure sensitivity to tag ambiguity and user interest variation for LDA and ITM on the synthetic data, generated with different values of symmetric Dirichlet parameters. One way to measure sensitivity is to determine how the learned topic distribution,φ ITM r orφ LDA r , deviates fromtheactual topic distributionofresourcer,φ Actual r . Unfortunately, we cannot compare them directly, since topic order of the learned topic distribution may not be the same as that of the actual one. 3 An indirect way to measure this deviation is to compare distances between pairs of resources computed using the actual and learned topic distributions. I define this deviation as Δ. I calculate the distance between two distributions using Jensen-Shannon divergence (JSD) (Lin, 1991). If a model accurately learned the resources’ topic distribution, the distance between two resources computed using the learned distribution will be equal to the distance computed from the actual distribution. Hence, the lower Δ, the better model performance. The deviation between the actual and learned topic distributions is Δ= N R −1 X r=1 N R X r 0 =r+1 |JSD(φ Learned r ,φ Learned r 0 )−JSD(φ Actual r ,φ Actual r 0 )|. (2.7) Δ is computed separately for each algorithm, Learned=ITM and Learned=LDA. I ran both LDA and ITM to learn distributions of resources over topics, φ, for simu- lated data setgenerated withdifferentvaluesoftagambiguity anduserinterestvariation. I set the number of topics to 10 for each model, and the number of interests to three for ITM. Both models were initialized with random topic and interest assignments and then 3 This property of probabilistic topic models is called exchangeability of topics (Steyvers and Griffiths, 2006). 28 0 0.1 0.2 0.3 0.4 0.5 0.01 0.1 1 Delta Interest Spread High Tag Ambiguity (1) ` 0 0.1 0.2 0.3 0.4 0.5 0.01 0.1 1 Delta Interest Spread Low Tag Ambiguity (0.01) (a) High Ambiguity (b) Low Ambiguity 0 0.1 0.2 0.3 0.4 0.5 0.01 0.1 1 Delta Tag Ambiguity High Interest Spread (1) LDA(10) ITM(10,3) LDA(30) 0 0.1 0.2 0.3 0.4 0.5 0.01 0.1 1 Delta Tag Ambiguity Low Interest Spread (0.01) (c) High Interest Spread (d) Low Interest Spread Figure 2.4: Deviations, Delta(Δ), between actual and learned topics on synthetic data sets for different regimes: (a) high tag ambiguity; (b) low tag ambiguity; (c) high interest spread; (d)low interest spread. LDA(10) andLDA(30) refersto LDAthat istrained with 10 and 30 topics respectively; ITM(10,3) refers to ITM that is trained with 10 topics and 3 interests. 29 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity LDA interest spread Delta 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity ITM interest spread Delta (a) LDA(10) (b) ITM(10,3) Figure 2.5: This plot shows the deviation Δ between actual and learned topics on syn- thetic data sets, underdifferent degrees of tag-ambiguity and userinterest variation. The Δ of LDA is shown on the left (a); as that of ITM is on the right (b). The colors were automatically generated by the plotting program to improve readability. trained using 1000 iterations. For the last 100 iterations, I used topic and interest assign- ments in each iteration to compute φ (using Eq. 2.4 for ITM and Eq. (7) in Griffiths and Steyvers (2004) for LDA). The average 4 of φ in this period is then used as the dis- tributions of resources over topics. I ran the learning algorithm five times for each data set. The deviations between the learned topics and the actual ones of the simulated data setsareshownin Figure2.4and Figure2.5. Inthecasewhenthedegreeoftagambiguity is high, ITM is superior to LDA for the entire range of user interest variations, as shown in Figure 2.4 (a). This is because ITM exploits user information to help disambiguate tag senses; thus, it can learn better topics, which are closer to the actual ones, than LDA. In the other regime, when tag ambiguity is low, user information does not help and can 4 The reason to use the average of φ is that, in the stable state, the topic/interest assignments can still fluctuate from one iteration to another. To avoid estimate φ from an iteration that possibly has idiosyncratic topic/word assignments, one can average φ over multiple iterations (Steyvers and Griffiths, 2006). 30 even degrade ITM performance, especially when the degree of interest variation is low, as inFigure2.4(b). Thisisbecauselowamountofuserinterestvariation demotesstatistical strength of the learned topics. Suppose that, for example, we have two similar resources: the first one is bookmarked by one group, the second bookmarked by another. If these two groups have very different interest profiles, ITM will tend to split the “actual” topic that describes those resources into two different topics — one for each group. Hence, each of these resources will be assigned to a different learned topic, resulting in a higher Δ for ITM. In the case when user interest variation is high (Figure 2.4 (c)), ITM is superior to LDA for the same reason that it uses user information to disambiguate tag senses. Of course, there is no advantage to using ITM when the degree of tag ambiguity is very low, and it yields similar performance to LDA. In the last regime, when interest variation is low (Figure 2.4 (d)), ITM is superior to LDA for high degree of tag ambiguity, even thoughitstopicsmaylose somestatistical strength. ITM’sperformancestartsto degrade when tag ambiguity degree is low, for the same reason as in Figure 2.4 (b). These results are summarized in 3D plots in Figure 2.5. I also ran LDA with 30 topics, in order to compare LDA to ITM, when both models have the same complexity. As shown in the Figure 2.4, with the same model complexity, ITM is preferable to LDA in all settings. In some cases, LDA with higher complexity (30 topics) is inferior to the LDA with lower complexity (10 topics). I suspect that this degradation is caused by over-specification of the model with too many topics. In terms of the computational complexity, both LDA and ITM are required to sample the hidden variables for all data points. For LDA, only the topic variable z needs to be 31 sampled; for ITM, the interest variable x is also required. The computational cost in each sampling is proportional to a number of topics, N Z , forz, and that of interest, N X , for x. Let’s define κ as a constant. I also define a number of all datapoints (tuples) as N K . Hence, the computational cost for LDA, in each iteration can be approximated as N K ×(κ×N Z ). The computational cost of ITM in each iteration can be approximated as N K ×(κ×(N Z +N X )). In summary, ITM is not superior to LDA in learning topics associated with resources ineverycase. However, IshowedthatITMispreferabletoLDAinscenarioscharacterized by a high degree of tag ambiguity, and at least moderate user interest variation, which is the case in the social annotation domain. 2.4.2 Real-World Data: Resource Discovery Task In this section I validate the proposed model on real-world data obtained from the social bookmarkingsite Delicious. ThehypothesisImake forevaluating the proposedapproach is that the model that takes users into account can infer higher quality (more accurate) topics φ than those inferred by the model that ignores user information. The“standard”measure 5 usedforevaluatingtopicmodelsistheperplexityscore(Blei et al., 2003b; Rosen-Zvi et al., 2004). Specifically, it measuresgeneralization performance on how a certain model can predict unseen observations. In document topic modeling, someofthewordsineachdocumentaresetasideastestingdata; whiletherestareusedas trainingdata. Thentheperplexityscoreiscomputedfromaconditionalprobabilityofthe 5 In fact, topic model’s evaluation is still in controversy according to a personal communication at http://nlpers.blogspot.com/2008/06/evaluating-topic-models.html by Hal Daum´ e. 32 testing given training data. This evaluation is infeasible in the social annotation domain, where each bookmark contains relatively few tags, compared to document’s words. Instead of using perplexity, I propose to directly measure the quality of the learned topicsonasimplifiedresourcediscoverytask. Thetaskisdefinedasfollows: “given aseed resource, find other most similar resources” (Ambite et al., 2009). The seed resource is manually specified and each resource is represented as a distribution over learned topics, φ, which is computed using Eq. 2.4. Topics learned by the better approach will have more discriminative power for categorizing resources. When using such distribution to rankresourcesbysimilarity to theseed, wewould expect themore similarresources to be ranked higherthanless similar resources. Note thatsimilarity between a pairofresources A andB is computed using Jensen-Shannon divergence (JSD) (Lin, 1991) on their topic distributions, φ A and φ B . To evaluate the approach, I collected data for five seeds: flytecomm, 6 geocoder, 7 wun- derground, 8 whitepages, 9 and online-reservationz. 10 The flytecomm allows users to track flights given the airline and flight number or departure and arrival airports; geocoder re- turnsgeographic coordinates ofa given address; wunderground gives weather information for a particular location (given by zipcode, city and state, or airport); whitepages returns person’s phone numbers and online-reservationz lists hotels available in some city on 6 http://www.flytecomm.com/cgi-bin/trackflight/ 7 http://geocoder.us 8 http://www.wunderground.com/ 9 http://www.whitepages.com/ 10 http://www.online-reservationz.com/ 33 Seed # Resources # Users # Tags #Tripples Flytecomm 3,562 34,594 14,297 2,284,308 Geocoder 5,572 46,764 16,887 3,775,832 Wunderground 7,176 45,852 77,056 6,327,211 Whitepages 6,455 12,357 64,591 2,843,427 Online-Resevationz 764 41,003 9,194 162,763 Table 2.1: The table presents statistics for five data sets for evaluating models’ perfor- mance. Note that a triple is a resource, user, and tag co-occurrence. some dates. I crawl Delicious to gather resources possibly relating to each seed. The crawling strategy is as follows: for each seed, • retrieve the 20 most popular tags associated with this resource. • For each of the tags, retrieve other resources that have been annotated with the tag. • For each resource, collect all bookmarks (resource-user-tag triples). I wrote a special-purpose page scraper to extract this information from Delicious. In principle, we could continue to expand the collection of resources by gathering tags and retrieving more resources tagged with those keywords, but in practice, even after a small traversal, I already obtain millions of triples. In each corpus, each resource has at least one tag in common with theseed. Statistics on these data sets are presented in Table 2.1. For each corpus, LDA is trained with 80 topics, while the numbers of topics and interests for ITM is set to 80 and 40 respectively. The topic and interest assignments are randomlyinitialized, and then both modelsaretrained with 500 iterations. 11 For thelast 100 iterations, I use the topic and interest assignments, in each iteration, to compute the 11 I discovered that the model converged very quickly. In particular, the model appears to reach the stable state within 300 iterations in all data sets 34 Seed URL Candidates Users Tags URLs Users Tags URLs Probabilistic Model Compute URL Similarity URL’s distribution over concepts, ϕ ϕ ϕ ϕ Sorted similar URLs ITM or LDA Obtain Annotation From Delicious Manually Evaluate Figure 2.6: This diagram summarizes the evaluation procedure for comparing LDA and ITM on a resource discovery task. distributions of resources over topics, φ. The average of φ in this period is then used as the distributions of resources over topics. Next, the learned distributions of resources over topics, φ, are used to compute the similarity of resources in each corpus to the seed. The performance of each model is evaluated by manually checking the 100 most similar resources produced by the model. A resource is judged to be similar if it provides an input form that takes semantically the same inputs as the seed and returns semantically the same data. Hence, flightaware 12 is judged similarto flytecomm because bothtake flight information and returnflight status. This evaluation procedure is schematically summarized as in Figure 2.6. Figure2.7showsthenumberofrelevantresourcesidentifiedwithinthetopxresources returned by LDA and ITM. From the results, we can see that ITM is superior to LDA in three data sets: flytecomm, geocoder and online-reservationz. However, its performance for wunderground and whitepages is about the same as that of LDA. Although I have no empirical proof, I hypothesize that weather and directory services are of interest to all 12 http://flightaware.com/live/ 35 0 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 100 #retrieve resources #relevant resources LDA(80) ITM(80/40) 0 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 100 #retrieve resources #relevant resources LDA(80) ITM(80/40) (a) Flytecomm (b) Geocoder 0 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 90 100 #retrieve resources #relevant resources LDA(80) ITM(80/30*) 0 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 90 100 #retrieve resources #relevant resources LDA(80) ITM(80/40) (c) Wunderground (d) Whitepages 0 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 100 #retrieve resources #relevant resources LDA(80) ITM(80/40) (e) Online-reservationz Figure 2.7: Performance of the different models on the five data sets. X-axis represents the number of retrieved resources; y-axis represents the number of relevant resources (that have the same function as the seed). LDA(80) refers to LDA that is trained with 80 topics. ITM(80/40) refers to ITM that is trained with 80 topics and 40 interests. In wunderground case, I can only run ITM with 30 interests due to the memory limits. 36 users, and are therefore bookmarked by a large variety of users, unlike users interested in tracking flights or booking hotels online. As a result, ITM cannot exploit individual user differences to learn more accurate topics φ in the wunderground and whitepages cases. To illustrate the utility of ITM, I select examples of topics and interests of the model learnedfromtheflytecomm corpus. Forpurposesofvisualization, Ifirstlistindescending order the top tags that are highly associated with each topic, which are obtained from θ z (aggregated over all interests in the topic z). For each topic, I then enumerate some interests, and present a list of top tags for each interest, obtained from θ x,z . I manually label topics and interests (in italics) according to the meaning of its dominant tags. Travel&Flightstopic: travel, Travel, flights, airfare, airline, flight, airlines, guide, aviation, hotels, deals, reference, airplane • FlightTrackinginterest: travel, flight, airline, airplane, tracking, guide, flights, hotel, aviation, tool, packing, plane • Deal & Booking interest: vacation, service, travelling, hotels, search, deals, europe, portal, tourism, price, compare, old • Guideinterest: travel, cool, useful, reference, world, advice, holiday, international, vacation, guide, information, resource Video&p2ptopic: video, download, bittorrent, p2p, youtube, media, torrent, torrents, movies, videos, Video, downloads, dvd, free, movie 37 • p2pVideointerest: video, download, bittorrent, youtube, torrents, p2p, torrent, videos, movies, dvd, media, googlevideo, downloads, pvr • Media&Creationinterest: video, media, movies, multimedia, videos, film, editing, vlog, remix, sharing, rip, ipod, television, videoblog • FreeVideointerest: video, free, useful, videos, cool, downloads, hack, media, utilities, tool, hacks, flash, audio, podcast Referencetopic: reference, database, cheatsheet, Reference, resources, documentation, list, links, sql, lists, resource, useful, mysql • Databasesinterest: reference, database, documentation, sql, info, databases, faq, technical, reviews, tech, oracle, manuals • Tips & Productivity interest: reference, useful, resources, information, tips, howto, geek, guide, info, productivity, daily, computers • Manual & Reference interest: resource, list, guide, resources, collection, help, directory, manual, index, portal, archive, bookmark The three interests in the “Travel & Flights” topic have obviously different themes. The dominant one is more about tracking status of a flight; while the less dominant ones are about searching for travel deals and traveling guides respectively. This implies that 38 therearesubsetsofuserswhohavedifferentperspectives(orwhatIcallinterests)towards the same topic. Similarly, different interests also appear in the following topics, “Video & p2p” and “Reference.” Figure2.8 presentsexamplesoftopic distributionsforthreeresourceslearned byLDA and ITM: the seed flytecomm, usatoday, 13 and bookings. 14 Although all are about travel, the first two resources have specific flight tracking functionality; while the last one is about hotel & trip booking. In distribution of resources over the topics learned by LDA, shown in Figure 2.8 (a), all resources have high weights on topics #1 and #2, which are about traveling deals and general aviation. In the case of topics learned by ITM, shown in Figure 2.8 (b), flytecomm and usatoday have their high weight on topic #25, which is about tracking flights, while bookings does not. Consequently, ITM will be more helpful than LDA in identifying flight tracking resources. This demonstrates the advantage of ITM in exploiting individual differences to learn more accurate topics. 2.5 Infinite Interest Topic Model In Section 2.3, I assumed that parameters, such as, N Z and N X (the numbers of topics and interests), were fixed and known a priori. The choice of values for these parameters can conceivably affect the model performance. The traditional way to determine these numbers is to learn the model several times with different values of parameters, and then select those that yield the best performance (Griffiths and Steyvers, 2004). 13 http://www.usatoday.com/travel/flights/delays/tracker-index.htm 14 http://www.bookings.org/ 39 0 0.2 0.4 0.6 0.8 1 7 13 19 25 31 37 43 49 55 61 67 73 79 topic index z,r flytecomm usatoday bookings Topic #1: travel, flight, airfare, airline, guild, hotel, cheap Topic #2: flight, wireless, aviation, japan, airplane, wifi, tracking Topic #28: statistics, stats, seo, traffic, analysis, marketing, test (a) Topic distributions of three resources learned by LDA 0 0.2 0.4 0.6 0.8 1 7 13 19 25 31 37 43 49 55 61 67 73 79 topic index z,r flytecomm usatoday bookings Topic #13: travel, flights, airfare, guide, shopping, airlines Topic #25: flight, aviation, airplane, tracking, airline, airlines Topic #8: maps, googlemaps, mapping, geography, earth, cool (b) Topics distributions of three resources learned by ITM Figure 2.8: Topic distributions of three resources: flytecomm, usatoday, bookings learned by (a) LDA and (b) ITM. φ z,r in y-axis indicates a weight of the topic z in the resource r – the degree to which r is about the topic z. 40 In this work, I choose another solution by extending my finite model to have “count- ably” infinite numbers of topics and interests. By “countably” infinite number of com- ponents, I mean that such numbers are flexible and can vary according to the number of observations. Intuitively, there is a higher chance that more topics and interests will be found in a data set that has more resources and users. Such unbounded number of componentscanbedealt withwithinaBayesian framework, asmentioned intheprevious works (Neal, 2000; Rasmussen, 2000; Teh et al., 2004). This approach helps bypass the problem of selecting values for these parameters. Following Neal (2000), I set both N Z and N X to approach∞. This will give the model the ability to select not only previously used topic/interest components but also to instantiate “unused” components when required. However, the model that I derived in the previous section cannot be extended directly under this framework due to the use of symmetric Dirichlet priors. As pointed out by Teh et al. (2004), when the number of components grows, using the symmetric Dirichlet prior results in a very low — even zero probability — chance that a mixture component is shared across groups of data. That is, in this context, there is a higher chance that a certain topic is only used within one resource rather than utilized by many of them. Considering Eq. 2.1, if I set N Z to approach∞, I can obtain posterior probability of z as follows p(z i =z used |z −i ,x,t) = N r i ,z −i N r i +α−1 · N z −i ,x i ,t i +η/N T N z −i ,x i +η (2.8) p(z i =z new |z −i ,x,t)= α N r i +α−1 · 1 N T . (2.9) 41 From Eq. 2.8, we can perceive that the model only favors topic components that are only used within the resource r i . Meanwhile, for other components that are not used by that resource, N r i ,z −i would equal zero and thus result in zero probability of choosing them. Consequently, the model only chooses topic components for a resource either from components that are currently used by that resource, or it instantiates a new component forthatresourcewithprobabilitiesaccordingtoEq.2.8andEq.2.9respectively. Asmore new components are instantiated, each resource tends to own its components exclusively. From the previous section, we can also perceive that each resource profile is generated independently (using symmetric Dirichlet prior) — there is no mechanism to link the used components across different resources. 15 As mentioned in Teh et al. (2004), this is an undesired characteristic, because, in this context, we would expect “similar” resources to be described by the same set of “similar” topics. One possible way to handle this problem is to use Hierarchical Dirichlet Process (HDP) (Teh et al., 2004) as the prior instead of the symmetric Dirichlet prior. The idea of HDP is to link components at group-specific level together by introducing global components across all groups. Each group is only allowed to use some (or all) of these globalcomponentsandthus,someofthemareexpectedtobesharedacrossseveralgroups. I adapt this idea by considering all tags of resource r to belong to the resource group r. Similarly, all tags of user u belong to the user group u. Each of the resource groups is assigned to some topic components selected from the global topic component pool. 15 This behavior can be easily observed in multiple samples, each drawn independently from a Dirichlet distribution Dirichlet(α1,...,α k ) . If αi is “small” and k is “large”, there is a higher chance that samples obtained from this Dirichlet distribution will have no overlapped component, i.e., for any pair of samples, there is no case when the same components have their value greater than 0 at the same time. Lack of this component overlap across samples will be obvious when k→∞. This is the problem that can be found in the model with infinite limit on NZ and NX. 42 Similarly, each of the user groups is assigned to some interest components selected from the global interest component pool. This extension is depicted in Figure 2.9. Suppose that a number of all possible topic components is N Z (which will be set to approach∞ later on) and that for the interest components isN X , such extension can be described as a stochastic process as follows. At the global level, the weight distribution of components is sampled according to (β 1 ,...,β N X )∼Dirichlet(γ x /N X ,...,γ x /N X ) (generating global interest component weight); (α 1 ,...,α N Z )∼Dirichlet(γ z /N Z ,...,γ z /N Z )(generatingglobaltopiccomponentweight), whereγ x and γ z are parameters, which control diversity of interests and topics at global level. At the group specific level, ψ u ∼Dirichlet(μ x ·β 1 ,...,μ x ·β N X ) (generating user u interest’s profile); φ r ∼Dirichlet(μ z ·α 1 ,...,μ z ·α N Z ) (generating resource r topic’s profile), whereμ x andμ z are parameters, which control diversity of interests and topics at group specific level. The remaining steps involving generation of tags for each bookmark are the same as in the previous process. Supposethatthereisaninfinitenumberofallpossibletopics,N Z →∞,andaportion of them are currently used in some resources. By following Teh et al. (2004), we can rewritetheglobalweightdistributionoftopiccomponents,α,as(α 1 ,..α kz ,α u ),wherek z is thenumberofcurrentlyusedtopiccomponentsandα u = P N Z k=kz+1 α k –allofunusedtopic 43 components. Similarly, we can write (α 1 ,...,α kz ,α u )∼ Dirichlet(γ z/N Z ,...,γ z/N Z ,γ zu ), where γ z/N Z =γ z /N z and γ zu = (N Z −kz)·γz N Z . The same treatment is also applied to that of the interest components. Nowwecangeneralize Eq.2.1andEq.2.2forsamplingposteriorprobabilitiesoftopic z and interest x with HDP priors as follows. For sampling topic component assignment for datapoint i, p(z i =k|z −i ,x,t) = N r i ,z −i +μ z α k N r i +μ z −1 · N z −i ,x i ,t i +η/N T N z −i ,x i +η (2.10) p(z i =k new |z −i ,x,t) = μ z α u N r i +μ z −1 · 1 N T (2.11) For sampling interest component assignment for datapoint i, p(x i =j|x −i ,z,t)= N u i ,x −i +μ x β j N u i +μ x −1 · N x −i ,z i ,t i +η/N T N x −i ,z i +η (2.12) p(x i =j new |x −i ,z,t) = μ x β u /N X N u i +β−1 · 1 N T , (2.13) where k and j are an index for topic and interest component respectively. From these equations, we allow the model to instantiate a new component from the pool of unused components. Considering the case when a new topic component is instantiated and, for simplicity, we set this new component to be the last used component, indexed with k 0 z . We need to obtain the weight α k 0 z for this new component and also update the weight of 44 all unused components, α u 0. From the unused component pool, we know that one of its unusedcomponentswillbechosen asanewlyusedcomponent,k 0 z , withprobabilitydistri- bution (α kz+1 /α u ,..,α N Z /α u ) which can be sampled from Dirichlet(γ z /N Z ,...,γ z /N Z ). Suppose the component k 0 z will be chosen from one of these components and we collapse the remaining unused components. It will be chosen with the probability α k 0 z /α u , which canbesampledfromBeta(γ z /N Z ,γ zu /N Z −γ z /N Z ), whereBeta(.)isaBetadistribution. Now, suppose k 0 z is chosen. The probability of choosing this component is updated to α k 0 z /α u ∼ Beta(γ z /N Z + 1,γ zu /N Z −γ z /N Z ) . When N Z → ∞, this reduces to α k 0 z /α u ∼ Beta(1,γ zu /N Z ). Hence, to update α k 0 z , we first draw a∼ Beta(1,α u ). We then update α k 0 z ←a.α u and update α u 0← (1−a).α u . Similar steps are also applied to the interest components. Note that if we compare Eq. 2.10 to Eq. 2.8, the problem we found so far has gone since p(z i =k|z −i ,z,t) will never have zero probability even if N r i ,z −i =0. At the endofeach iteration, Iusethe methoddescribedin Teh et al. (2004) to sample α and β and update the hyperparameters γ z , γ x , μ z , μ x using the method described in Escobar and West (1995). I refer to this infinite version of ITM as “Interest Topic Model with Hierarchical Dirichlet Process” (HDPITM) for the rest of the chapter. For the computational complexity, althoughN Z andN X are both set to approach∞, the computational cost of each iteration, however, does not approach∞. Considering Eq. 2.10 and Eq. 2.11, sampling ofz i only involves currently instantiated topics plus one “collapsed topic”, which represents all currently unused topics. Similarly, the sampling of x i only involves currently instantiated interests plus one. For a particular iteration, a computational cost for HDP can therefore be approximated as N K ×(κ× ¯ N Z +1). The 45 cost for HDPITM can be approximated as N K ×(κ×( ¯ N Z + ¯ N X +2)), where ¯ N Z and ¯ N X are respectively the average number of topics and interests in that iteration. 2.5.1 Evaluation on Synthetic Data I ran both HDP and HDPITM to extract topic distributions, φ, on the simulated data set. In each run, the number of instantiated topics was initialized to ten, which equals to the actual number of topics for both HDP and HDPITM. The number of interests was initialized to three. Similar to the setting in Section 2.4.1, topic and interest assignments were randomly initialized and then trained using 1000 iterations. Subsequently, φ was computed from the last 100 iterations. The results are shown in Figure 2.10 (a) and (b) for HDP and HDPITM respectively. From these results, the behaviors of both modelsfor differentsettingsaresomewhatsimilartothoseofLDAandITM.Inparticular, HDPITM can exploit user information to help disambiguate tag senses, while HDP cannot. Hence, the performance of HDPITM is better than that of HDP when tag ambiguity level is high. Moreover, since topics may lose some statistical strength under low user interest condition, HDPITM is inferior to HDP, similar to Figure 2.4 (b) for the finite case. As one can compare the plots (a) and (b) in Figure 2.4 and Figure 2.10, the perfor- mance of infinite model is generally worse than that of the finite one, even though I allow the former the ability to adjust topic/interest dimensions. One possible factor is that the model still allows topic/interest dimensions (configuration) to change even though the trained model is in a “stable” state. That would prohibit the model from optimizing its parameters for a certain configuration of topic/interest dimensions. One evidence that supports this claim is that, although the log likelihood seems to converge, the number of 46 topics (for both models) and interests (only for HDPITM) still slightly fluctuate around a certain value. From this speculation, I ran both HDP and HDPITM with the different strategy. In particular, Isplitmodeltrainingintotwo periods. Inthefirstperiod,Iallowthemodelto adjust its configuration, i.e., the numbers of topics and interests. In the second period, I still train the model but do not allow the numbers of topics and interests to change. The first one is similar to the training process in the plain HDP and HDPITM. The second one is similar to that of the plain LDA and ITM that use the latest configuration from the first period. In this experiment, I set the first period to 500 iterations; another 500 iterations were set for the second phase. Subsequently, φ is computed from the last 100 iterations of the second. I refer to this training strategy for HDP as HDP+LDA, and that for HDPITM as HDPITM+ITM. The overall improvement of performance using this strategy are shown in Figure 2.10 (c) and (d), compared to (a) and (b). That is, both HDP+LDA and HDPITM+ITM can produce φ, which provide lower Δ, under this strategy. However, HDPITM+ITM performance under the condition with low user interest and low tag ambiguity isstill inferiorto HDP+LDA. Thisissimplybecause their structures are still the same to those of HDP and HDPITM respectively. 2.5.2 Evaluation on Real-World Data In the experiments, I initialize the numbers of topics and interests to 100 and 20 (the numberofinterestsisonlyapplicabletoHDPITM),andtrainthemodelsonthesamereal- world data sets I used in Section 2.4.2. The topic and interest assignments are randomly initialized, and then both models are trained with the minimum 400 and maximum 600 47 X Z T R U N t ψ N U ϕ N R θ N Z ,N X β α η D γ X γ Z μ X μ Z X X Z Z T T R R U U N t ψ ψ N U ϕ ϕ N R θ N Z ,N X β α η η D γ X γ Z μ X μ Z Figure2.9: GraphicalrepresentationontheInterestTopicmodelwithhierarchicalDirich- let process (HDPITM). iterations. For the first 100 iterations, I allow both models to instantiate a new topic or interest as required, under the constraint that the numbers of topics and interests do not exceed 400 and 80 respectively. If the model violates this constraint, it will exit this phase early. For the remainder of iterations, I do not allow the model to add new topics or interests (but these numbers can shrink if some topics/interests collapsed during this phase). Then, if the change in log likelihood, averaged over the 10 preceding iterations, is less than 2%, the training process will enter to final learning phase. (See Figure 2.11 (f) for an example of log likelihood during training iterations.) In fact, I found that the process enters the final phase early in all data sets. In the final phase, consisting of 100 iterations, I use the topic and interest assignments in each iteration to compute the distributions of resources over topics. The reason I limit the maximum numbers of topics, interests, and iterations over which these models are allowed to instantiate a new topic/interest, is that the numbers of users and tags in the data sets are large, and many new topics and interests could be 48 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity HDP interest spread Delta 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity HDPITM interest spread Delta (a) (b) 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity HDP+LDA interest spread Delta 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity HDPITM+ITM interest spread Delta (c) (d) Figure 2.10: This plot shows the deviation Δ between actual and learned topics on synthetic data sets, under different degrees of tag-ambiguity and user interest variation. The Δ of HDP is shown on the left (a); as that of HDPITM is on the right (b). (c) and (d) show the deviation produced by HDP+LDA and HDPITM+ITM respectively. For HDP+LDA, new topics can be instantiated, and thus the number of topics can change, during the first half of the run (HDP); then all topics are frozen (no new topic can be instantiated) during the second half (LDA). This is similar to HDPITM+ITM where I take into account user information. See Section 2.5.1 for more detail. 49 instantiated. Thiswouldrequiremanymoreiterations toconverge, andthemodelswould requiremorememorythanisavailableonthedesktopmachineIusedintheexperiments. 16 I would rather allow the model to “explore” the underlying structure of data within the constraints — in other words, find a configuration which is best suited to the data under a limited exploration period and then fit the data within that configuration. At the end of the parameter estimation, the numbers of allocated topics of HDP models for flytecomm, geocoder, wunderground, whitepages and online-reservationz was 171, 174, 197, 187 and 175 respectively. The numbersof allocated topics and interests in HDPITM areh307,43i,h329,44i,h231,81i,h225,78i andh207,72i respectively, which are bigger than those inferred by HDP in all cases. These results suggest that user information allows the HDPITM discover more detailed structure. HDPITMperformssomewhatbetterthanHDPinflytecomm, online-reservationz, and geocoder data sets. Its performance for wunderground and whitepages, however, is almost identical toHDP. AsinSection 2.4.2, thisispossiblyduetohighinterest variation among users. I suspect that weather and directory services are of interest to all users, and are therefore bookmarked by a large variety of users. 2.6 Conclusion I have presented a probabilistic model of social annotation that takes into account the users who created the annotations. I argued that my model is able to learn a more accurate topic description of a corpus of annotated resources by exploiting individual 16 At maximum, I can only allocate memory for 1,300 Mbytes. 50 0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 # Retrieved Resources # Relevant Resources HDP HDPITM 0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 # Retrieved Resources # Relevant Resources HDP HDPITM (a) Flytecomm (b) Geocoder 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 # Retrieved Resources # Relevant Resources HDP HDPITM 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 # Retrieved Resources # Relevant Resources HDP HDPITM (c) Wunderground (d) Whitepages 0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 # Retrieved Resources # Relevant Resources HDP HDPITM -1.40E+07 -1.20E+07 -1.00E+07 -8.00E+06 -6.00E+06 -4.00E+06 -2.00E+06 0.00E+00 1 51 101 151 201 251 301 351 log likelihood Iteration Log likelihood of flytecomm (e) Online-Reservationz (f) Log likelihood of flytecomm Figure 2.11: Performance of different methods on the five data sets (a) – (e). Each plot shows the number of relevant resources (that are similar to the seed) within the top 100 results produced by HDP (non-parameteric version of LDA) and HDPITM (nonpara- metric version of ITM). Each model was initialized with 100 topics and 20 interests for HDPITM. (f) demonstrates the log likelihood of the HDPITM model during the param- eter estimation period of flytecomm data set. Similar behavior of the plot (f) is found in both HDP and HDPITM for all data sets. 51 variations in user interests and vocabulary to help disambiguate tags. My experimen- tal results on collections of annotated Web resources from the social bookmarking site Delicious show that the model can effectively exploit social annotation on the resource discovery task. 52 Chapter 3 Learning Concept Hierarchies: Folksonomy Learning Problem and Evaluation Metrics In this chapter, I describe hierarchical relations appearing in social annotation systems. Subsequently,Iformallydefinetheproblemoflearningacommunalhierarchy, folksonomy learning, following with a discussion about the challenges. I then explain the strategies for evaluating a learned folksonomy. 3.1 Hierarchical Relations in Social Annotation Systems Hierarchicalrelationsamongconceptsappearimplicitlyandexplicitlyinsocialannotation systems. In tagging, for example, we usually observe both generic and specific tags associated with a certain content item. Usually, both tags “insect” and “grasshopper” appear on grasshopper photos. On the other hand, hierarchical relations can also be specified by users explicitly. Some users employ inventions like colon “:” or slash “/” to combine several related keywords into a new tag. In such cases, the preceding keyword is often a superclass of the following keyword. Recently, some social annotation systems 53 Tags: Insect Grasshopper Australian Macro Orthoptera (a) Implicit hierarchical relations between tags: “Grasshopper”, “Insect” and “Orthoptera” Folder (collection) Sub folder (set) Relations (b) Explicit hierarchical relations Figure 3.1: Hierarchical relations appearing in social annotation systems also begin to allow users to explicitly specify hierarchical relations. On Delicious, users can group related tags into bundles. On Bibsonomy, users can group related tags into relations. On Flickr, users can group related photos into a set and can also group related sets into a collection. A collection and all of its sets can be viewed as a shallow hierarchy. One-level-deep examples of these hierarchical relations are illustrated in Figure 3.1 (b). While the sites themselves do not impose any constraints on the vocabulary or se- mantics of the relations used, in practice users employ them to represent both subclass relationships (‘dog’ is a kind of ‘mammal’) and part-of relationship (‘my kids’ is a part of ‘family’). Users appear to express both types of relations (and possibly others) through personal hierarchies, in effect using the hierarchies to specify broader/narrower relations. 54 3.1.1 Personal Hierarchies in Flickr In this subsection, I use the photo sharing Web site, Flickr, 1 as an example to explain the previously mentioned hierarchical relations more precisely. Flickr allows users to group their photos in album-like folders, called sets. Users can also group sets into “super” albums, called collections (and related collections into new collections). 2 Both sets and collections are named by the owner of the image. A photo can belong to multiple sets. While Flickr does not enforce any specific rules about how to organize photos or how to name them, most users group “similar” or “related” photos into the same set and related sets into the same collection. Some users create multi-level hierarchies containing collections of collections, etc., but the vast majority of users who use collections create shallow hierarchies, consisting of collections and their constituent sets. Figure 3.2 (a) shows some of the collections created by an avid naturalist on Flickr. These collections reflect the subjects she likes to photograph: Birds, Mammals, Plants, Mushrooms & Fungi, PlantPests,PlantDiseases,etc. Figure3.2(b)showstheconstituentsetsofthePlantPests collection: Plant Parasites, Sap Suckers, Plant Eaters, and Caterpillars. Each set contains one or more photos, which are tagged bythe user. Thiscollection and its constituent sets can be schematically viewed as a hierarchy as in Figure 3.3. For example, a photograph in the set Caterpillars, shown in Figure 3.2 (c), is annotated with multiple tags describing it (Animal, Lepidoptera, Moth, larva, Caterpillar), its color (Black and orange), 1 http://www.flickr.com 2 The collection feature is limited to paid “pro” users. Pro users can also create an unlimited number of photo sets, while free membership limits a user to three sets. 55 (a) (b) (c) Figure 3.2: Personal hierarchies specified by a Flickr user. (a) Some of the collections created by the user and (b) sets associated with the Plant Pests collection, and (c) tags associated with an image in the Caterpillars set. 56 collection set photos tags Plant Pests Plant Parasites Sap Suckers Plant Eaters Caterpillars Figure 3.3: Schematic diagram presents the personal hierarchy of Figure 3.2(b) condition (on Senecio, eating), and location (North Seatac Park, King County, WA, North America). 3.2 Folksonomy Learning Definition Individual hierarchical relations and personal hierarchies can be considered as footprints ofacommunity’sknowledge,whichisexpressedasacommontaxonomy, alsocalledafolk- sonomy. 3 To illustrate this, Figure 3.4 (left) depicts one such folksonomy about ‘animal’ and its ‘bird’ subconcepts shared by a group of users. When users organize the content they created, e.g., photographs on Flickr, they select some portions of the folksonomy for categorization. Weobservethesecategoriesthroughtheshallowpersonalhierarchiesusers create. Figure 3.4 (right) depicts some of the personal hierarchies specified by different users to organize their ‘animal’ and ‘bird’ content. 3 This term is a blend of the words “folk” and “taxonomy”, invented by Thomas Vander Wal(http://vanderwal.net/folksonomy.html). Its original meaning was intended to refer to social tag- ging and social classification. In this thesis, nevertheless, folksonomy is used to refer to a communal taxonomy. 57 Folksonomy `folk knowledge’ Personal hierarchies Users Annotate Objects Folksonomy Learning Figure 3.4: Schematic diagram illustrates the interaction between a folksonomy and per- sonal hierarchies in an animal domain. Specifically, users select small fractions of the folksonomy to annotate objects, which can be observed as in many personal hierarchies. The goal of this thesis is to develop folksonomy learning approaches that can infer a folksonomy from these hierarchies. 58 Since folksonomies are central to an array of applications as mentioned in Chapter 1, much of recent research has focused on extracting or learning folksonomies (e.g., Brooks and Montanez, 2006; Markines et al., 2006; Schmitz, 2006; Mika, 2007; Zhou et al., 2007). HereIformallyrefertofolksonomy learningasaprocess that learns acommunal taxonomy from social annotations, which are in form of tags, hierarchical relations and/or personal hierarchies. Because individual annotation is sparse, one way to learn folksonomy is to integrate many annotations together into a more detailed taxonomy, with the hope that the combination is close to the actual folksonomy that users have in mind. Learning folksonomy comes with a number of challenges. Since users are free to annotate content according to their own preferences, social annotation is noisy, shallow, sparse, ambiguous, conflicting, multi-faceted, and expressed at inconsistent granularity levels across many users. 3.3 Challenges in Learning Folksonomies from Social Annotation Learning folksonomies from social annotation, specifically, from structured annotation, presents a number of challenges: 3.3.1 Sparseness Social annotation is usually very sparse. Users provide four to seven tags per bookmark on Delicious in my data set and 3.74 tags per photo on Flickr (Rattenbury et al., 2007). Sparsenessisalso manifested in thehierarchical organization created byan individual. In 59 Istanbul Antalya Amasya Bird Duck Geese Turkey Turkey (a) Ambiguity Japan Japan USA USA Food Food People People Travel China China Travel (b) Conflict UK Scotland London UK Glasgow Edinburgh London Scotland Glasgow Shetland Scotland London (c) Varying granularity Figure 3.5: Schematic diagrams of personal hierarchies created by Flickr users. (a) Am- biguity: the same term may have different meaning (“turkey” can refer to a bird or a country). (b) Conflict: users’ different organization schemes can be incompatible (china is a parent of travel in one hierarchy, but the other way around in another). (c) Granu- larity: users have different levels of expressiveness and specificity, and even mix different specificity levels within the same hierarchy (Scotland (country) and London (city) are both children of UK). Nodes are colored to aid visualization. 60 myFlickr data set, Ifoundonly 600 out of21,792 users—approximately 0.02 percent— who created multi-level (collections of collections) hierarchies. Most users define shallow (single-level) hierarchies. Moreover, among these shallow hierarchies, few users organize content the same way. For instance, of the 433 users who created an animal collection, only a few created common child sets, such as bird, cat, dog or insect. In order to learn a rich andcomplete folksonomy, we have to aggregate social annotations frommany different users. 3.3.2 Noisy vocabulary Vocabulary noise has several sources. One common source is variations and errors in spelling. Noise also arises from users’ idiosyncratic naming conventions. While such names as not sure, pleaseaddthistothethemecomppoll, mykid may be meaningful to the image owner and her narrow interest group, they are relatively meaningless to other users. 3.3.3 Ambiguity An individual tag is often ambiguous (Mathes, 2004; Golder and Huberman, 2006). For example, jaguar can be used to refer to a mammal or a luxury car. Similarly, terms that are used to name collections and sets can refer to different concepts. Consider the hierarchy in Figure 3.5 (a), where turkey collection could be about a bird or a country. Similarly, victoria can be a place either in Canada or Australia. When combining annotationstolearncommonfolksonomies,weneedtobeawareofitsmeaning. Structural and contextual information may help disambiguate annotation. 61 3.3.4 Structural noise and conflicts Like vocabulary noise, structural noise has a number of sources and can lead to incon- sistent or conflicting structures. Structural noise can arise as a result of variations in individuals’ organization preferences. Suppose that, as shown in Figure 3.5 (b), user A organizes photos first by activity, creating a collection called travel, and as part of this collection, a set called china, for photos of her travel in China. Meanwhile, user B or- ganizes photos by location first, creating a collection china, with constituent sets travel, people, food, etc. In one hierarchy, therefore, travel is more general than china, and in the second hierarchy, it is the other way around. Sometimes conflicts are caused by vo- cabulary differences among individual users. For example, to some users bug is a “pest,” a term broader than insect, while to others it is a subclass of insect. As a result, some users may express bug→ insect, while the others express an inverse relation. Another source of noise is variation in degree of expertise on a topic. Many users assemble images of spidersin a set called spidersand assign it to an insect collection, while others correctly assign spiders to arachnid. 3.3.5 Varying granularity level Differences in users’ level of expertise and expressiveness may also lead to relatively imprecise annotation. Experts may use specific breed names to tag dog photos, while non-experts will simply use the tag dog to annotate them (Golder and Huberman, 2006). Inaddition, oneusermayorganizephotosfirstbycountryandthenbycity, whileanother organizes them by country, then subregion and then city, as shown in Figure 3.5 (c). 62 Combining data from these users potentially generates multiple paths from one concept to another. 3.3.6 Discussion Several recentresearches have addressedsomeoftheabove challenges. For instance,Hey- mann and Garcia-Molina (2006); Schmitz (2006) proposed inducing folksonomies from tags by utilizing tag statistics. The basic motivation behind these approaches is that more frequent tags describe more general concepts. However, frequency-based methods cannot distinguish between more general and more popular concepts, e.g., we can end up with Los Angeles is broader than California because people annotate the former a lot more frequently than the latter. My earlier work, namely sig (Plangprasopchok and Lerman,2009), whichIwillbrieflyexplaininAppendixB, bypassesthisproblembyusing user-specified hierarchical relations, extracted from personal hierarchies. Nevertheless, it ignored other evidence, e.g., structures of hierarchies and tags, which potentially address the challenges listed above. The next chapters will present the approaches that better respond to these challenges. 3.4 Folksonomy Evaluation To measure how good the folksonomy learning approach is, one can measure the quality of the learned folksonomy. Because folksonomy is in a tree form, which shares some commonality to taxonomies and ontologies, here I describe strategies to evaluate them and how to adapt these approaches to the present setting. A new metric for evaluating how comprehensive a tree structure is, will also be described at the end of this section. 63 3.4.1 Taxonomy and Ontology Evaluation: An Overview According to Brewster et al. (2004), ontology (and taxonomy) evaluations can beroughly classified in three different classes as follows. • Human-judgment-based approaches: basically, the evaluations select parts of an ontology and ask people to judge them. Although this type of approach seems quite natural, it’s relatively subjective and laborious. • Task-based approaches: the evaluations usetasks to measurehow good an ontology is. However, it’sdifficulttostandardizespecifictasksforvariouskindsofontologies. • Reference-based approaches: the evaluations compare the learned ontology to some gold-standard/referenceontology. However, thisevaluationrequiresmetricstomea- sure how consistent the learned one to the reference is. In addition, the reference must be “comparable”, i.e., it is in the same domain as to the learned ontology. In this thesis, I follow the third class of evaluations since in recent years several hand-crafted hierarchies have become available, such as hierarchies from WordNet 4 and Open Directory Project (ODP). 5 I chose ODP because, in contrast to WordNet, ODP is collaboratively built by many registered users. These users seem to use more colloquial terms than those that appear in WordNet. In addition, like social Web users, they specify less formal relations, mainly broader/narrower relations. WordNet, on the other hand, specifies a number of formal relations among concepts, including hypernymy and 4 http://wordnet.princeton.edu/ 5 http://rdf.dmoz.org/ 64 meronymy. In whatfollows, Iwill describemetrics that measure howconsistent a learned hierarchy is with its reference hierarchy. 3.4.2 Lexical Recall According to Maedche and Staab (2002), Lexical Recall measures how well a taxonomy induction process can recover concepts that exist in the actual taxonomy, regardless of the correctness of the structure of the learned taxonomy. For simplicity, I also ignore the polysemy issue, i.e., I assume that concepts with the same name are the same. LetC 1 be a set of all concepts in the learned taxonomy T 1 , and let C 2 be the set of concepts in the reference taxonomy T 2 . Lexical Recall is defined as follows: LR(T 1 ,T 2 ) = |C 1 ∩C 2 | |C 2 | . (3.1) 3.4.3 Modified Taxonomic Overlap Taxonomic Overlap (Maedche and Staab, 2002; Cimiano et al., 2005) is a similarity measure that takes into consideration taxonomy structure. In particular, each concept in a learned taxonomy and a corresponding concept in a reference taxonomy are compared onhowmuchtheirancestorsanddescendantsoverlap. Asetofsuper-concepts(ancestors) and sub-concepts (descendants) of a given concept c in a taxonomy T is referred to as Semantic Cotopy (SC), which is defined according to Maedche and Staab (2002) as: SC(c,T) :={c j ∈T|c< T c j ∪c> T c j }. (3.2) 65 Note for Eq. 3.2 thatc< T c j returns all descendants ofc in taxonomyT, andc> T c j returns the ancestors of c. Unlike in the original formulation of SC, I do not include the node c to avoid an overly optimistic evaluation. Taxonomic Overlap (TO) between two taxonomies can be determined from the av- erage of degree of overlap between SCs of concepts in these taxonomies. According to Maedche and Staab (2002), the TO of taxonomy T 1 and T 2 is: TO(T 1 ,T 2 ) = 1 |C 1 | X c∈C 1 TO(c,T 1 ,T 2 ) (3.3) where TO(c,T 1 ,T 2 ):= TO 0 (c,T 1 ,T 2 ) if c∈C 2 TO 00 (c,T 1 ,T 2 ) if c / ∈C 2 , (3.4) and where TO 0 and TO 00 are defined as: TO 0 (c,T 1 ,T 2 ) := |SC(c,T 1 )∩SC(c,T 2 )| |SC(c,T 1 )∪SC(c,T 2 )| (3.5) TO 00 (c,T 1 ,T 2 ):= max c 0 ∈C 2 |SC(c,T 1 )∩SC(c 0 ,T 2 )| |SC(c,T 1 )∪SC(c 0 ,T 2 )| (3.6) Note that Eq. 3.6 makes an optimistic assessment when a concept namec inT 1 does not exist in T 2 by picking c 0 in T 2 that yields the best SC match to c in T 1 . In other words, the method assumes that c 0 refers to the same concept as c, although their names are different. I discovered that the original version ofTO Eq. 3.3 does not penalize for the incorrect concept ordering. Consider two trees in Figure 3.6. Since SC of “insect”, “moth” and 66 anim insect bug moth arctiida anim bug moth arctiida insect (a) (b) Figure 3.6: Illustrations of (a) a correct tree about “moth”, and (b) an incorrect version of (a) where “insect” and “arctiida” (arctiidae) are misplaced. Original TO will judge the trees identical. “arctiida” are the same for both trees, TO in Eq. 3.3 will judge trees (a) and (b) to be identical (TO = 1.0). This is because SC in Eq. 3.2 considers all ancestors and descendants, regardless of their ordering. One possible solution is to consider concept’s ancestors and descendants separately. I modify Eq. 3.3 as follows: TO(T 1 ,T 2 ) ∗ = 1 2 · 1 |C −root 1 | X c∈C −root 1 ˆ TO(c,T 1 ,T 2 ) + 1 |C −leaves 1 | X c∈C −leaves 1 ˇ TO(c,T 1 ,T 2 ) , (3.7) where C −root 1 is a set of all concepts in T 1 except its root concept. I exclude the root conceptbecauseithasnoancestors. Similarly,C −leaves 1 isasetofallconceptsinT 1 except its leaf concepts. ˆ TO ( ˇ TO) is computed as in Eq. 3.5, but uses ˆ SC ( ˇ SC) instead of SC. I define ˆ SC as ancestor Semantic Cotopy, which only considers ancestors of a certain 67 concept, and ˇ SC as descendant Semantic Cotopy, which only considers descendants of the concept: ˆ SC(c,T) := {c j ∈T|c> T c j }, (3.8) ˇ SC(c,T) := {c j ∈T|c< T c j }. (3.9) Returning to the case in Figure 3.6, the modified TO metric can detect that trees (a) and (b)have different concept ordering: TO(T b ,T a ) ∗ =0.417. SinceTO isnot symmetric as pointed out in Cimiano et al. (2005), one can compute a harmonic mean between TO(T 1 ,T2) and TO(T 2 ,T1) to get a symmetric score. 3.4.4 Structural Metrics: Area Under Tree The methodsof Lexical Recall and Modified Taxonomic Overlap I mentioned above mea- sure how consistent a learned tree is, relatively to its reference. Nevertheless, these measures do not explicitly reflect how expressive a learned tree is. Given the same data set, we would prefer an approach that generates bushier and deeper trees. The scope of concepts in such trees are broadly enumerated (tree width); while, each concept is subcategorized in enough detail (tree depth). Although one can usean average depth ofa tree and branchingfactor, it is difficultto justify which trees are better overall since these metrics are independent. A very bushy tree may have only one level depth; meanwhile, a very deep tree may have a chain-like structure. Here I define a simple, yet intuitive measure, Area Under Tree (AUT), which 68 1 st 0 @ depth 2 nd 3 rd 4 th # of nodes 1 3 5 AUT = 5.5 A Learned Tree A Distribution of Nodes at Each Depth Figure 3.7: Illustration of computing AUT from the distribution of nodes at each depth of a tree. takes both tree bushiness and depth into account. Basically, the metric balances the trade-off between bushiness and depth. To calculate AUT for a certain tree, I compute the distribution of the number of nodes in each level and then compute the area under the distribution. Intuitively, trees that keep branching out at each level will have larger AUT than those that are short and thin. Suppose that we have a tree with one node at the root, three nodes at 1 st level and five at 2 nd . With the scale of tree depth set to 1.0, AUT of this tree would be 0.5×(1+3)+0.5×(3+4) =5.5 (a sum of trapezoids). This computation is illustrated as in Figure 3.7. To demonstrate how AUT works, here I provide more examples of different tree shapes with their AUT in Figure 3.8. One can perceive that a tree that keeps branching out at each depth as in Figure 3.8 (c) gets higher AUT than those that are bushy but shallow, e.g., Figure 3.8 (b), or have chain-like structures, e.g., Figure 3.8 (a) and (d). Unfortunately, if we have the same number of nodes, AUT will unfairly favor a chain-like structures than a tree shape that keeps spanning out from its root to its descendants. 69 1 1 1 5 AUT = 5.5 1 8 AUT = 4.5 (a) (b) 3 5 1 AUT = 6 AUT = 5.5 2 2 1 2 (c) (d) Figure 3.8: Examples of different tree shapes with their AUT. Nevertheless, with the same depth, a tree that keeps spanning out at each depth would generally get better AUT. In all, oneshouldrely on acombination ofmanymetrics ratherthan justoneof them. It’sstraightforward thatone shouldprefertheapproach thatcan learn folksonomy, which (1) recovers a larger number of concepts, (2) is more structurally consistent, and (3) is more expressive. The metrics I mentioned in this chapter can gauge these properties. 70 Chapter 4 An Incremental Approach to Folksonomy Learning In this chapter, I present a statistical approach to induce hierarchies from social annota- tion. The approach incrementally combines small personal hierarchies into a bushier and deeper common hierarchy. I evaluate the approach on the real world data set, obtained from a photo sharing Web site. 4.1 An Incremental Approach to Learn Folksonomies From the first challenge I mentioned earlier in Section 3.3 on data sparseness, we cannot rely on few personal hierarchies to construct a deep and detail folksonomy, as most of the hierarchies contain few children as illustrated in Figure 4.1. In order to learn more detail folksonomy from these hierarchies, one possible solution is to combine them together through their overlapping substructures, e.g., nodes. Substructures of one hierarchy may complement another; hence, the combined structures would become more complete. To illustratetheidea,supposewehavetwopersonalhierarchies: bird→{swan,duck,heron} and bird→{gull, swan, heron, geese}. Combining these two hierarchies together will create a more complete bird hierarchy with five different kinds of birds, which is larger 71 Figure 4.1: Some examples of personal hierarchies about “bird” in Flickr. Most of the hirarchies contain a small number of children. All node names are normalized using Porter stemming algorithm (Porter, 1980). than any one of them. Moreover, if we have personal hierarchies about different bird breeds, combining these hierarchies to our bird hierarchy examples will reveal a more complete structure of bird concepts. Similar substructures can be determined from their commonality. From the bird examples, it’s obvious that the bird of the former is similar to the bird of the latter since they do not only have common node features, e.g., a node name and tags, but also common structural information. Particularly, both have the following children: swan and geese in common. Such structural information will become more crucial when features are missing or they are not specific enough. Combining all hierarchies arbitrarily can cause inconsistencies, i.e., loops and short- cuts, because the hierarchies are generated independently from individuals as described 72 in Section 3.3. Hence, we need to consider the structure of the learned folksonomy as we construct it. To tackle this problem, the approach I present in this chapter aggre- gates personal hierarchies incrementally from the top down to the bottom parts of the folksonomy, as inconsistent parts keep removed. To begin describing the approach, I first formally define a personal hierarchy as a shallow tree, a sapling, composed of a root noder i and its children, or leaf nodeshl i 1 ,..l i j i. The root node corresponds to a user’s collection, and inherits its name, while the leaf nodescorrespondtothecollection’sconstituentsetsandinherittheirnames. Onlyasmall number of users define multi-level hierarchies; for these, I decompose and represent them as collections of saplings. At the top level, we have a root node, which corresponds to the top-level collection, and its leaf nodes corresponding to the root’s sets or collections. I then construct saplings that correspond to the leaf nodes, which are collections, and so on. I assume that hierarchical relations between a root and its children, r i →hl i j i, specify broader-narrower relations. Hence, the sapling in Figure 3.2 (b) is Plant Pests →{Plant Parasites, Sap Suckers, Plant Eaters, Caterpillars}. In addition to hierarchical structure, each sapling carries information derived from tags. OnFlickr, usersattach tagsonlytophotos; therefore,thetagstatisticsofasapling’s leaf (set) are aggregated from that set’s constituent photos. Tag statistics are then prop- agated from the leaves to the parent node. In my example, Plant Parasites aggregates tag statistics from all photos in this set, and its parent Plant Pests contains tag statistics accumulated from all photos in Plant Parasites and its siblings. Let’s define a tag statis- tics of node x as τ x :={(t 1 ,f t 1 ),(t 2 ,f t 2 ),···(t k ,f t k )}, where t k and f t k are tag and its frequency respectively. Hence, τ r i is aggregated from all τ l i j s. 73 Givenacollectionofsaplings,specifiedbymanydifferentusers,mygoalistoaggregate themintoacommon,denseranddeepertree. Beforedescribingmyapproach,Ifirstbriefly describedatapreprocessingstepsthataddresssomeofthesparsenessandnoisechallenges listed above. 4.1.1 Data Preprocessing I extract terms representing concepts from collection and set names. I found that users oftencombinetwoormoreconceptswithinasinglename,e.g., “Dragonflies/Damselflies”, “Mushrooms & Fungi”, “Moth at Night.” Terms can be joined by bridging words that include prepositions “at”, “of”, “in,” and conjunctions “and” and “or,” or special char- acters, such as ‘&’, ‘<’, ‘>’, ‘:’, ‘/’. I start by tokenizing collection and set names on these words and characters. I do not tokenize on white spaces to avoid breaking up terms like “South Africa.” I remove terms composed only of non-alphanumeric characters and frequently-used uninformative words, e.g., “me” and “myself.” I then normalize all terms by lowercasing them and use the Porter stemming algorithm (Porter, 1980) to normalize the remaining terms. Thisstep is necessary to mitigate noise due to individual variations in naming conventions and vocabulary usage. 1 Learning to tokenize In some cases, multiple terms in a collection or set name are more meaningful when kept together. For instance, black & white usually refers to a photography technique; not things that are colored black or white. I use the likelihood 1 In some cases, stemming can cause term ambiguity. For example, “skiing” and “skies” both stem to “ski.” However, such ambiguity can be resolved by exploiting contextual information, as described later on. 74 ratio test, which is sentitive to the order of the terms, to determine whetherterms should be split or kept together. This method, which tests whether terms are likely to be generated together or independently, is similar to collocation discovery (Manning and Sch¨ utze, 1999) in statistical natural language processing. Suppose that I have a set with a composite name: term a followed by b. Follow- ing Manning and Sch¨ utze (1999), I assume that there are two hypotheses, which possibly explain the co-occurrence of a and b. The first hypothesis, H 1 , assumes that b’s occur- rence is independent of a: p(b|a) = p(b|¬a). The second hypothesis, H 2 , assumes that b’s occurrence depends on a’s: p(b|a)6=p(b|¬a). Assuming that the probability of term occurrencefollows abinomialdistribution, thelog ofthelikelihood ratio between thefirst and second hypothesis, log(λ(x,y)) = log L(H 1 ) L(H 2 ) (4.1) = log b(N(a,b),N(a),p) b(N(a,b),N(a),p 1 ) · b(N(b)−N(a,b),N(.)−N(a),p) b(N(b)−N(a,b),N(.)−N(a),p 2 ) , where b(k,n,x) = n k x k (1−x) n−k is a binomial distribution, N(a,b) is the number of sets and collections where botha andb appear,N(a) (N(b)) is the number where a (b) appears, andN(.) is the number of all sets/collections. Prob- abilities p, p 1 and p 2 are estimated from N(b) N(.) , N(a,b) N(a) and N(b)−N(a,b) N(.)−N(a) respectivley. Note that this ratio test is sensitive to the order of the terms as opposed to the usual inde- pendence criterion, which tests whetherp(a,b) =p(a)p(b). I use this likelihood ratio test 75 because the order of the terms in a set or collection name influences their meaning as a whole. I compute the likelihood ratios for term pairs of all composite names and rank them in descending order. The rank plot has a ‘knee’ point, which sets the threshold. If the likelihood ratio of a term pair is above the threshold, I keep these terms together; otherwise I separate the name into two terms. Of the 52 pairs above the threshold, the top5pairsareblackandwhite,bandw, familiandfriend,flowerandplant,sunris and sunset. After tokenization, a set or collection name may be split into multiple terms, which I expand into leaves. Suppose a user created a collection animal containing a set cats and dogs. After tokenization I get the sapling animal→{cats, dogs}. However, if the root node is determined to have a composite name, I simply ignore the entire sapling because it’s unclear which parent concepts correspond to which child concepts. 4.1.2 Relational Clustering of Structured Annotation In order to learn a folksonomy, we need to aggregate saplings both horizontally and ver- tically. By horizontal aggregation, I mean merging saplings with similar roots, which expands the width of the learned tree by adding leaves to the root. By vertical aggre- gation, I mean merging one sapling’s leaf to the root of another, extending the depth of the learned tree. The approach I use exploits contextual information from neighbors in addition to local features to determine which saplings to merge. The approach is simi- lar to relational clustering (Bhattacharya and Getoor, 2007) and its basic element is the similarity measure between a pair of nodes. 76 I define a similarity measure, which combines heterogeneous evidence available in the structured social annotation, and is a combination of local similarity and structural similarity. The local similarity between nodes a and b, localSim(a,b), is based on the intrinsic features of a and b, such as their names and tag distributions. The structural similarity, structSim(a,b), is based on features of neighboring nodes. If a is a root of a sapling, its neighboring nodes are all of its children. If a is a leaf node, the neighboring nodes are its parent and siblings. The similarity between nodesa andb is: nodesim(a,b) =(1−α)×localSim(a,b)+α×structSim(a,b), (4.2) where0≤α≤1isaweightforadjustingcontributionsfromlocalSim(,)andstructSim(,). I judge whether two nodes are similar if the similarity is greater than the threshold, τ. To illustrate how the similarity function works, let’s consider the personal hierarchies in Figure 4.2. Here, we are certain that the nodes victoria of user#1 and victoria of user#2 are different because they neither have common tags (local information) nor common children (structural information). 4.1.2.1 Local Similarity The local similarity of nodes a and b is composed of (1) name similarity and (2) tag distribution similarity. Name similarity can be any string similarity metric, which re- turns a value ranging from 0 to 1. Tag similarity, tagSim(,), can be any function for measuring the similarity of distributions. Because of the sparseness of data, and to make the computation fast, I use a simple function which counts the number of common tags, 77 !" " Figure 4.2: Two personal hierarchies about “victoria” concepts from different users. These “victoria”s are different since their local and structural information is different. n, in the top K tags ofa andb; it returns 1 if this number is equal or greater thanJ, else it returns n J . Local similarity is a weighted combination of name and tag similarities: localSim(a,b) =β×nameSim(a,b)+(1−β)×tagSim(a,b)). (4.3) Tag similarity helps address the ambiguity challenge described in Section 3.3. For exam- ple, the top tags of the node turkey that refers to a bird include “bird”, “beak”, “feed”, while the top tags of turkey that refers to the country include different terms about places within the country. 4.1.2.2 Structural Similarity Structuralsimilarity between two nodesdependson thepositionofthenodeswithintheir saplings. I define two versions: structSimRR(,) which computes structural similarity between two root nodes (root-to-root similarity), and structSimLR(,) which evaluates structural similarity between a root of one sapling and the leaf of another (leaf-to-root similarity). 78 Root-to-Root similarity Two saplings A and B are likely to describe the same concept if their root nodes r A and r B have a similar name and some of their leaf nodes also have similar names. In this case, there is no need to compute tagSim(,) of these leaf nodes. I define the normalized common leaves factor, namely CL, as 1 Z P i,j δ(name(l A i ),name(l B j )), where δ(.,.) returns 1 if both arguments are exactly the same; otherwise, it returns0;name(l A i ) is a function that returnsthe name of a leaf node l A i of sapling A. Z is a normalizing constant, which is described in greater detail later. The structural similarity between two root nodes is then defined as follows: structSimRR(r A ,r B ) =CL+(1−CL)×tagSim( ` L A tag , ` L B tag ), (4.4) where ` L A tag isan aggregation oftag distributionsofalll A i ,atwhichname(l A i )6=name(l B j ) for any leaf nodel B j of the sapling B. From Eq. 4.4, I compute the similarity based on: (1) how many of their children have common name (they match); (2) the tag distribution similarity of those that do not have the same name. The second term is an optimistic estimatethatchildnodesofthesesaplingsrefertothesameconceptwhilehavingdifferent names. ThenormalizationcoefficientZ =min(|l X |,|l Y |),where|l X |isanumberofchildrenof X. Iusemin(,)insteadofunion. Thereasonisthatsaplingsaggregated frommanysmall saplingswillcontain alarge numberofchild nodes. When mergingwith arelatively small sapling, the fraction of common nodes may be very low compared to the total number of child nodes. Hence, the normalization coefficient with the union (Z =union(l X ,l Y )), as defined in Jaccard similarity, results in overly penalizing small saplings. min(,), on the 79 other hand, seems to correctly consider the proportion of children of the smaller sapling that overlap with the larger sapling. When I decide that roots r A and r B are similar, I merge saplings A and B with the mergeByRoot(A,B)operation. Thisoperationcreatesanewsapling,M,whichcombines structures and tag statistics ofA andB. In particular, the tag statistics of the root ofM is a combination of those fromr A andr B . The leaves ofM,l M , are a union ofl A andl B . If there are leaves fromA andB that share a name, their tag statistics will be combined and attached to the corresponding leaf in M. The width of the newly merged sapling will increase as more saplings are merged. Also, sinceIsimplymergeleaf nodeswithsimilar names, andtheirrootsalso have similar names, leaf-to-leaf structural similarity structSimLL(,) is not required. This operation addresses the sparseness challenge mentioned in Section 3.3. Root-to-Leaf similarity Merging the root node of one sapling with the leaf node of another sapling extends the depth of the learned folksonomy. Since I consider a pair of nodes with different roles, their neighboring nodes also have different roles. This would appear to make them structurally incompatible. However, in many cases, some overlap between siblings of one sapling and children of another sapling exists. Formally, suppose that we are considering similarity between leaf l A i of sapling A and root r B of sapling B. There might be some l A k6=i of A similar to l B j of B. Consider Figure 3.5 (c). Suppose that we have already merged uk saplings. Now, there are two saplings uk→{scotland, glasgow, edinburgh, london} and scotland→{glasgow, shetland}, and we would like to merge the two scotlands. Since both uk and scotland saplings have glasgow 80 in common, and the user placed glasgow under uk instead of scotland, this shortcut contributes to the similarity between scotlandnodes. The structural similarity between leaf and root nodes that takes this type of shortcut into consideraion is: structSimLR(l A i ,r B ) =structSimRR(r A ,r B ). (4.5) Specifically, this is simply the root-to-root structural similarity of r A and r B , which measures the overlap between siblings of l A i and children of r B . For the case when there is no shortcut, the similarity from this part will be dropped out; hence, the Eq. 4.2 will only be based on the local similarity. 4.1.3 SAP: Growing a Folksonomy by Merging Saplings At this point, I describe sap algorithm, which uses the operations defined above to in- crementally grow a deeper, bushier tree by merging saplings created by different users. In order to learn a folksonomy corresponding to some concept, I start by providing a seed term, the name of that concept. The seed term will be the root of the learned tree. I cluster individual saplings whose roots have the same name as the seed by us- ing the similarity measures as defined in Eq. 4.2, Eq. 4.3 and Eq. 4.4 to identify similar saplings. Saplings within the same cluster are merged into a bigger sapling using the mergeByRoot(,) operation. Each merged sapling corresponds to a different sense of the seed term. Next, Iselect one ofthe merged saplings as the starting point forgrowing thefolkson- omyforthatconcept. Foreach leafoftheinitialsapling,Iusetheleafnametoretrieveall 81 other saplings whose roots are similar to the name. I then merge saplings corresponding to different senses of this term as described above. The merged sapling whose root is most similar to the leaf (using Eq. 4.2, Eq. 4.3 and Eq. 4.5), is then linked to the leaf. In the case that several saplings match the leaf, I merge all of them together before linking. Clustering saplings into different senses, and then merging relevant saplings to the leaves of the tree proceeds incrementally until reaching some threshold, e.g., at a certain depth. Suppose I start with saplings shown in Figure 3.5 (c), and the seed term is uk. The process will first cluster uk saplings. Suppose, for illustrative purposes, that there is only one sense of uk, resulting in a single sapling with root uk. Next, the procedure selects one of the unlinked leaves, say glasgow, to work on. All saplings with root glasgow will be clustered, and the merged glasgow sapling that is sufficiently similar to the glasgow leaf of the uk sapling will then be linked to it at the leaf, and so on. 4.1.3.1 Handling Shortcuts Attaching a sapling A to the learned tree F can result in structural inconsistencies in F. One type of inconsistency is a shortcut, which arises when a leaf of A is similar to a leaf of F. In the illustration above, attaching the scotland sapling to the uk tree will generate a shortcut, or two possible paths from uk to glasgow (r uk →l uk glassgow and r uk →l uk scotland →l scotland glasgow ). Ideally, we would drop the shorter path and keep the longer one which captures more specific knowledge. There are cases where the decision to drop the shorter path cannot be made imme- diately. Suppose we have uk→{london, england, scotland} as the current learned tree, and are about to attach london→{british museum, dockland, england} to it. 82 UK Scotland London England London England Glasgow England B. Museum London Liverpool Manchester Dockland Figure 4.3: Appearance of mutual shortcuts between London and England when merging London and England saplings. To resolve them, I compare the similarity between UK- London and UK-England sapling pairs. Since England sapling is closer to UK than London sapling, I simply attach England sapling to the tree; while ignoring London leaf under UK. Unfortunately, some users placed england under london, and attaching this sapling will create a shortcut to england. The decision to eliminate the shorter path to england cannot be made at this point, since we have no information about whether attaching the englandsapling will also create a shortcut to londonfrom the root (uk). Hence, we have to postpone this decision until we retrieve all relevant saplings that can be attached to the present leaf (l uk london ) and its siblings (l uk england and l uk scotland ). Supposethatl uk england doesmatchtherootofsaplingengland→{london,manchester, liverpool}. Mutualshortcutstoenglandandlondonwouldundesirablyappearonceall the saplings are attached to the tree. Hence, the decision to dropl uk england orl uk london must be made. I base the decision on similarity. Intuitively, a sapling that is more similar, or “closer,” to r uk should be linked to the tree. Formally, the node to be kept is l uk ˆ x , where ˆ x =argmax x {nodesim(r uk ,r x )} and x ={england,london}, while the other will be dropped. This is illustrated in Figure 4.3. 83 4.1.3.2 Handling Loops Attaching a sapling to a leaf of the learned tree may result in another undesirable struc- ture, a loop. Suppose that we are about to attach a saplingA to the leafl F i ofF. A loop will appear if there exists a leaf l A j of A with the same name as some node in the path from root to l F i in F. In order to make the learned tree consistent, we must remove l A j beforeattaching the sapling. For instance, supposewe decide to attach londonsapling to the england sapling in Figure 4.3 at its london node, we have to remove england node of london sapling first. Insomecases, loopsindicatesynonymousconcepts. Inthedataset, Ifoundthatthere are users who specify the relation animal→ fauna, and those who specify the reverse fauna→ animal. Since animal and fauna have similar meaning, I hypothesize that this conflict appears because of variations in users’ expertise and categorization preferences. Todeterminewhetheraloopiscausedbyasynonym,Icheckthesimilaritybetweenr A andr F . Ifitishighenough, Isimplyremovel F j fromF, forwhichname(l F j )=name(l A j ); then,merger A andr F . ThesimilaritymeasureisbasedonEq.4.2. Morestringentcriteria are required since r A and r F have different names. Specifically, I modify tagSim(X,Y) to tagSim syn (X,Y), which instead evaluates |τ X ∩τ Y | min(|τ X |,|τ Y |) , and modify structSim(X,Y) to structSim syn (X,Y), which only evaluates 1 Z P i,j δ(name(l X i ), name(l Y j )). 4.1.3.3 Mitigating Other Structural Noise The similarity measure between root-to-leaf defined earlier is only based on contextual information fromadjacent saplings. Hence, at adistantleaf node,farfromtherootofthe tree, the measure may consider merging some sapling sense, that is relevant to the leaf, 84 but irrelevant to the tree root. To illustrate, suppose we have the following hierarchy, flower→ rose→ black & white. There is a chance that the sapling, black & white →{macro, portrait, landscape} will be judged relevant to the leaf white of the tree, sincetheyshareenoughcommontagssuchasmacro,white,etc. Whendecidingtoattach this sapling to the tree, we could end up with a tree that mixes concepts from “flower” and “portraiture.” Iuseacontinuity measure tocheckwhetherthesenseofthesaplingweareconsidering attaching is relevant to the ancestors of the leaf. Recall that the root node inherits tags from all of its decendents. I examine the tag overlap and do not attach the sapling if it has less than L tags in common with the grandparent node. In addition, I only attach new saplings to leaf nodes which are the result of input from more than one user. 4.1.3.4 Mitigating Noisy Vocabularies As mentioned in Section 3.3, noisy nodes appear from idiosyncratic vocabularies, used by a small number of users. For a certain merged sapling, we can identify these nodes by the number of users who specified them. Specifically, I use 1% of the number of all users who “contribute” to this merged sapling as the threshold. I then remove leaves of the sapling, that are specified by a fewer number of users than the threshold. 4.1.3.5 Managing Complexity Computing the similarity measure for all pairs of saplings in the corpus is impractical, even considering local or structural similarity only. I address this scalability issue in two ways. First, I only compare sapling nodes if they share the same (stemmed) name. This 85 reduces the total number of pairs which need to be compared, and eliminates the need to compute nameSim(,) in Eq. 4.3. Second, I apply the blocking approach (Monge and Elkan,1997)forefficientlycomputingsimilarityandmergingsaplingroots. Thebasicidea behind this approach is to first use a cheap similarity measure to “roughly” group similar items. I can then thoroughly compute item similarities and merge them within each “roughly similar” group by using a more computationally expensive similarity measure. I assume that items judged to be dissimilar by the cheap measure will also be dissimilar when they are evaluated by the more expensive measure. Since the approach applies the expensive measure to a much smaller set of items, it reduces the time complexity of the clustering method. In my case, I use an inexpensive similarity measure based on the most frequent tags. Specifically, I map the top tags to some integer code, which can be cheaply sorted by any database. Subsequently, I use the database to sort saplings by their codes, moving roughly similar saplings to neighboring rows. The process begins by scanning sorted saplings in the database table on a sapling by sapling basis. If the presently scanned sapling has not been merged with some other sapling, I add this sapling to the top of the queue. If the present sapling belongs to some merged sapling, I check if this sapling is also similartosome othermergedsaplingsin thequeue. IuseEq.4.2, Eq.4.3 andEq.4.4 to evaluate their similarity. If they are similar enough, they will be merged together into a new merged sapling; then add it to the top of the queue. The scanning is performed repeatedly until the number of merged saplings no longer changes. 86 4.1.4 Complexity Analysis Here I sketch the computational complexity of sap. Basically, sap can be decomposed into two different parts: (1) root-to-root merging, which expands folksonomies’ width; (2) leaf-to-root merging, which extends folksonomies’ depth. These two parts are loosely dependent, i.e., one can cluster all saplings into different senses; then “vertically” merge the root of one sapling sense with a leaf of the other. Since I use the blocking technique and only cluster saplings with the same stemmed names, the computational complexity dependson (1) the numberof the unique stemmed names in the data set; (2) the average number of saplings that share a name. Let N and M be the number of the nodes and the number of the unique stemmed names in the data set respectively. Hence, for each stem, there are N M nodes to be compared on average. I use database to first roughly sort saplings, which generally requires O( N M log( N M )). After saplings are sorted, they are scanned and merged. This is repeatedly, say in i iterations, until the number of clusters no longer changes, which requires O(i× N M ). In all, the complexity of the first part is O(Nlog( N M )+iN). Empirically,thenumberofclustersconvergesintwotothreeiterations on average. Let b and d be the branching factor and the depth of the tree we want to produce. In addition, suppose that there are s sapling senses for each stemmed name on average. Since we have to traverse each inner node of the tree to attach relevant sapling senses, and for each of these nodes we need to compare the similarity to all sapling senses with similar root names, this requires O(s×b d ). 87 Myearlierwork, sig(PlangprasopchokandLerman,2009), whichisdescribedinmore detailinAppendixB, onlyconsideredthebestpathfromaroottoagiven leafofthetree, and required enumerating all possible paths between them. In the best case, when there are no shortcuts or loops in the data set, the number of paths from the root to all leaves of a given tree is equal to the number of the leaves, and that only requires O(b d +b d−1 ) to check whethereach edge should be included. In the worst case, when shortcuts appear to all node pairs, we would need O( d+1 2 ×b d ) to check all possible edges. Moreover, we also need to enumerate all possible paths for the root to all leaves of the tree, which requiresO(1+ P e=1:d−1 d−1 e ) per root-to-leaf pair. Hence, I expect SAP to scale better than SIG as the depth of the output tree increases and when there are many shortcuts. 4.2 Empirical Validation I constructed a data set containing collections and their constituent sets (or collections) createdbyasubsetofFlickruserswhoaremembersofseventeen publicgroupsdevotedto wildlife and nature photography (Plangprasopchok and Lerman, 2009). These users had many other common interests, such as travel and sports, arts and crafts, and people and portraiture. I extracted all the tags associated with the images in the set, and retrieved all other images that the user annotated with these tags. Iconstructedpersonalhierarchies,orsaplings,fromthisdata,witheachsaplingrooted at one of user’s top-level collections. For the reason described in Section 4.1.1, I ignored collections withcomposite names. Thisreducesthe sizeof thedata setto 20,759 saplings created by 7,121 users. A small number of these saplings are multi-level. 88 Parameters Description K The number of top frequent tags J The number of common tags for tag similarity α RR The weight combination of local and structural similarity for computing root-to-root similarity α LR The weight combination of local and structural similarity for computing leaf-to-root similarity β Theweightcombinationofnameand tagsimilarity(notrequiredinmyex- periment) τ The similarity threshold Table 4.1: Parameters of the folksonomy learning approach. The folksonomy learning approach described in this chapter has a number of param- eters as shown in Table 4.1. In my experiment, I ignored the parameter β since only sapling nodes with the same name are needed to be compared as described in the previ- oussection. Toexploretherange oftheseparameters, Iset upa smallexperimentbyfirst selecting 5 different seed terms 2 ; then running the approach with different optimal pa- rameter values would enable the approach to reasonably combine/separate saplings with similar/different senses. I manually inspected the induced folksonomies to check how the saplings were merged/separated. The parameter K allows the approach to consider only top frequency tags, which tend to be more stable and less noisy (Golder and Huberman, 2006). Nevertheless, the top tags will not contain enough information if the number is set too low, e.g., K = 10. At the fixed values of the common tag threshold, J = 4, and the structural-local weight combination, α RR = 0.1 (in this case, I simply evaluated on merging root-root nodes; 2 ski, bird, victoria, africa and insect 89 hence there is no need for α LR ), I found that the approach performs reasonably well when the value of K is around 30–60, while the performance starts to degrade for K>60. Smaller values of J lead to a weak tag similarity measure, which, in turn, mistakenly causes the approach to merge saplings with different senses. Large J will be relatively stringent, and as a result, saplings of the same sense will not be merged. I found that, at K=40, the value of J between 4 to 6 allows reasonable results. Forα RR andα LR , the weight combination between local and structural similarity for root-root and leaf-root nodes in Eq. 4.2, the larger the values the more the similarity measure emphasizes on the structural similarity. From my experiments, I found that the structure information is very informative. When α RR is set to a very large value or the maximum, 1.0, the approach clusters “structure-rich” saplings, i.e., saplings containing manychildren, reasonably well. For leaf-to-root merging orin situations wherestructural information is uncommon, local similarity becomes more important. I discovered that at α RR =0.1andα LR =0.8, theapproach producesreasonable folksonomies. Here, Ireport the parameter values that resulted in good performance: I set K=40; J=4. In addition, since all similarity measures are normalized to range within 0.0 and 1.0, I set τ =0.5. 4.2.1 Baseline Approach I compare sap against the folksonomy learning method, sig, described in Appendix B and in Plangprasopchok and Lerman (2009). Briefly, sig first breaks a given sapling into (collection-set) individual parent-child relations. With the assumption that the nodes with the same (stemmed) name refer to the same concept, the approach employs hypoth- esis testing to identify the informative relations, i.e., checking if the relation is statistical 90 significant. The informative relations are then linked into a deeper folksonomy. I used a significance test threshold of 0.01. 4.2.2 Methodology I quantitatively evaluate the induced folksonomies by (1) automatically comparing them to a reference hierarchy; (2) structural evaluation; (3) manual evaluation. 4.2.2.1 Evaluation against the reference hierarchy I use the reference hierarchy from the Open Directory Project (ODP). 3 I chose ODP because, in contrast to WordNet, ODP is generated, reviewed and revised by many reg- istered users. These users seem to use more colloquial terms than appear in WordNet. In addition, like Flickr users, they specify less formal relations, mainly broader/narrower relations. WordNet, on the other hand, specifies a number of formal relations among concepts, including hypernymy and meronymy. I use the following methodology to automatically evaluate the quality of the learned folksonomies. Although ODP and saplings are generated from different sources, there is substantial vocabulary overlap that makes them comparable. Since the ODP hierarchy is relatively large and composed of many topics, I carved out the “relevant” portion for comparison. First, I specified a seed, S, which is the root of the learned folksonomy F and the reference hierarchy to which it is compared. Next, the folksonomy is expanded two levels along the relations in F. The nodes in the second level are added as leaf candidates,LC. If the spanning stops after one level, I 3 http://rdf.dmoz.org/, as of September 2008 91 also add this node’s name toLC. GivenS andLC, I identify leaf candidates,LCD, that also appear in ODP,D. All paths fromS toLCD inD constitute the reference hierarchy for the seed S. Subsequently, S is used as the seed for learning the folksonomy associated with this concept. In sig, S and LC are both used to learn the folksonomy. The maximum depth of learned trees is limited to four. The metrics to compare the learned folksonomies to the reference are Lexical Recall (Maedche and Staab, 2002) and the modified Taxonomic Overlap,mTO(PlangprasopchokandLerman,2009), whichweredescribedinSection3.4. Recall that Lexical Recall measures the overlap between the learned and reference tax- onomies, independent of their structure. mTO measures the quality of structural align- ment of the taxonomies. Here,IreporttheharmonicmeanofmTOinstead,becausemTOisasymmetric. Since the proposed approach generates bushyfolksonomies whose leaf nodes may not appear in the reference taxonomy, the mTO metric may unfairly penalize the learned folksonomy. Instead, I only consider the paths of the learned folksonomy that are comparable to the reference hierarchy. Specifically, for each leaf l in LCD, I select the path S→ l in the learned folksonomy and compare it to one in the reference hierarchy. If there are many comparable paths existing in the reference, I select the one that has the highest LR to compare. 92 4.2.2.2 Structural evaluation Ideally, wepreferan approachthatgenerates bushieranddeepertrees. Inthiswork, Iuse a simple, yet intuitive measure, Area Under Tree (AUT), which takes both tree bushiness and depth into account. AUT was described in detail in Section 3.4. 4.2.2.3 Manual evaluation I set up three human subjects to evaluate portions of induced folksonomies which were not comparable to ODP hierarchy. I randomly selected 10% of the paths (all of them if there are fewer than 10 paths in the learned folksonomy) that are not in the reference hierarchy and asked three judges to evaluate them. If a portion of the path is incorrect, either because an incorrect concept appears or the ordering of concepts is wrong, the judges were asked to mark it incorrect, otherwise it is correct. They can also mark the path “unsure” if there is not enough evidence for a decision. A path’s label is based on the majority decision. If there is no agreement, or the path is marked uncertain by all judges, I exclude it. 4.2.3 Results InTable4.2,Icomparethequalityofthefolksonomylearnedforeachseedbysap,andthe earlier work, sig. sap generally recovers a larger number of concepts, relative to ODP, as indicated bythenumbersofoverlapping leaves (in 90%ofthe cases) andbetterLRscores (in 76%ofthecases). Moreover, sapcan producetrees withhigherquality, relative to the ODP, as indicated by mTO score (in 68% of the cases). From the structural evaluation, sap produced bushier trees as indicated by AUT in 87% of the cases. The average depth 93 Whole folksonomies Comparison with ODP Manual #leaves AUT #ovlp lvs mTO LR AUT Acc (10%) seeds sig sap sig sap sig sap sig sap sig sap sig sap sig sap anim 268 583 694.0 1076.0 68 92 0.602 0.659 0.281 0.360160.0 189.50.89 0.74 bird 73 103 84.5 113.5 20 220.760 0.755 0.281 0.315 21.5 28.5 0.60 1.00 invertebr 11 15 15.5 19.5 3 1 0.762 1.0000.250 0.125 4.5 1.5 1.00 1.00 vertebr 80 114 162.5 236.5 1 0 1.000 n/a0.600 0.200 2.5 n/a 1.00 1.00 insect 29 44 35.5 61.5 5 5 0.924 0.924 0.857 0.857 6.5 6.5 1.00 1.00 fish 7 6 7.5 6.5 0 0 n/a n/a 0.016 0.016 n/a n/a 1.00 1.00 plant 110 194 265.5 426.0 6 7 0.613 0.735 0.250 0.273 13.0 11.5 0.67 1.00 flora 64 403 173.0 1048.5 6 180.483 0.481 0.130 0.407 16.0 84.0 1.00 1.00 fauna 141 609 420.0 1146.0 9 31 0.463 0.490 0.113 0.212 27.0 71.50.91 0.85 flower 112 169 210.5 226.5 1 1 0.379 1.0000.267 0.250 3.5 1.5 1.00 n/a reptil 3 4 4.5 4.5 2 30.625 0.622 0.500 0.667 2.5 3.5 n/a n/a amphibian 1 1 1.5 1.5 1 1 1.000 1.000 1.000 1.000 1.5 1.5 n/a n/a build 7 23 11.5 37.5 0 0 n/a n/a 1.000 1.000 n/a n/a 1.00 1.00 urban 6 80 15.0 145.5 0 0 n/a n/a 0.071 0.071 n/a n/a 1.00 1.00 countri 3781605 798.5 4504.0 2 4 0.447 0.665 0.143 0.214 8.0 8.5 1.00 1.00 africa 53 71 90.5 119.5 23 27 0.773 0.895 0.508 0.547 37.5 40.5 1.00 1.00 asia 187 284 389.0 631.5 80 85 0.734 0.788 0.396 0.484165.5 168.5 1.00 1.00 europ 3791073 916.0 2706.5165 301 0.619 0.670 0.236 0.418369.0 874.51.00 0.94 south africa 12 17 15.5 18.5 3 3 0.431 0.600 0.444 0.444 3.5 3.5 0.78 1.00 north america 166 731 435.0 2203.5 67 118 0.545 0.576 0.165 0.319170.5 361.51.00 0.92 south america 32 50 54.5 101.5 12 15 0.706 0.832 0.415 0.463 20.5 28.5 1.00 1.00 central america 27 8 53.5 12.5 1 2 0.631 0.754 0.417 0.500 2.5 4.5 1.00 1.00 unit kingdom 106 267 274.5 658.5 31 820.787 0.724 0.099 0.127 71.5 179.5 1.00 1.00 unit state 102 375 217.0 936.5 35 55 0.620 0.749 0.130 0.256 74.5 122.0 1.00 1.00 world 54531771437.0 9235.0191 4750.476 0.461 0.085 0.215490.0 1676.50.97 0.96 citi 123 448 234.0 927.5 0 0 n/a n/a0.111 0.100 n/a 2.5 1.00 1.00 craft 5 1 10.5 1.5 1 0 0.603 n/a0.056 0.050 2.5 n/a 1.00 n/a dog 15 26 17.5 28.5 0 1 n/a 1.000 0.045 0.080 n/a 1.5 n/a n/a cat 11 39 13.5 41.5 0 0 n/a n/a 0.100 0.100 n/a n/a n/a n/a sport 207 74 407.0 86.5 19 270.693 0.647 0.091 0.084 30.0 31.5 0.28 1.00 australia 47 83 71.0 147.5 12 27 0.354 0.665 0.123 0.216 14.5 36.5 0.67 1.00 canada 55 763 128.0 2502.0 11 270.620 0.587 0.158 0.241 21.5 75.5 1.00 1.00 Table 4.2: This table presents empirical validation on folksonomies induced by the pro- posed approach, sap, comparing to the baseline approach, sig. The first column group presents properties of the whole induced trees: the number of leaves and Area Under Tree (AUT). The second column group reports the quality of induced trees, relatively to the ODP hierarchy. The metrics in this group are modified Taxonomic Overlap (mTO) (averaged usingharmonicmean), Lexical Recall (LR),wheretheirscalesarerangingfrom 0.0 to 1.0 (the more the better), as AUT is computed from portions of the trees, which are comparable to ODP. “#ovlp lvs” stands for a number of overlap leaves (to ODP). The last column group reports performance on manually labeled portions of the trees, which do not occur in ODP. 94 ! Figure 4.4: Comparison of performance of SAP and baseline (SIG) approaches. Bars report the numbers of the cases that one approach outperformed the other on various metrics, which are summarized from Table 4.2. The higher the better. Note that “mTO &LR&AUT” is computed fromthe intersection ofsuperiorcases in mTO,LRand AUT metrics. (not shown in the Table) from roots to all leaves of the trees over all cases generated by sap is deeper than sig (2.68 vs. 2.37). Moreover, when considering the intersection of the superior cases in TO, LR and AUT metrics, sap clearly outclasses sig as shown in Figure 4.4 (13 vs. 0 cases). Although the manual evaluation suggests that both approaches can induce about the same quality on the paths that are uncomparable to ODP, after closely inspecting the learned trees, I foundthat sap demonstrates its advantage over sig in disambiguating andcorrectly attaching relevant saplingstoappropriateinducedtrees. For instance, bird tree produced by sap does not includes Istanbul or other Turkey locations, as shown in Figure 4.5. Inthesporttree,sapdoesnotincludeanyconceptaboutthesky(notethatskiesand skiingsharecommon stemmed name). Inaddition, there arenoconceptsaboutirrelevant events like birthdays and parades appearing in the tree. There are some cases, e.g., dog 95 Figure 4.5: Folksonomies learned for bird and sport 96 Approach Incorrect Path sap anim/other anim/mara sap world/landscap/architectur/scarborough sap world/scotland/through viewfind sap europ/franc/flight to sig anim/pet/chester/chester zoo sig bird/turkei/antalya sig bird/turkei/ephesu sig fauna/underwat/destin sig south africa/safari/isla paulino sig south africa/safari/la flore sig sport/golf/adamst sig sport/ski/cloud/other/new year sig world/canada/victoria/melbourn Table 4.3: The table lists all incorrect paths caused by possibly ambiguous nodes, which are in bold. Note that all node names are stemmed. and cat, where I could not compute the hand labeling scores because these trees often contained pet names, rather than breeds. Ifurtherconsideredhowmanyoftheincorrectpathsarecausedbynodeambiguity. To do so, I first identified ambiguous terms, and checked to see how many of the incorrect paths contain these terms. Although it is not obvious how to automatically identify ambiguous terms, I use the following heuristic to determine the possible ambiguities: for a given leaf of the induced tree, if many different merged senses exist (i.e., > 10), then I consider the leaf ambiguous. During the tree induction process, I keep track of these nodes and the root. Subsequently, I use the ambiguous terms and their root names to check the accuracy of paths in the hand labeled data containing them. As presented in Table 4.3, there is about a half reduction in error for ambiguous paths using sap. This supports my claim about superiority of sap on node disambiguation. Inall,theproposedapproach,sap,hasseveraladvantagesoverthebaseline,sig. First, it exploits both structure information and tag statistics to combine relevant saplings, 97 which can produce more comprehensive folksonomies as well as resolve ambiguity of the concept names. Second, it allows similar concepts to appear multiple times within the same hierarchy. For example, sap allows the anim folksonomy to have both anim→ pet → catand anim→ mammal→ catpaths, whileonlyone ofthese pathsisretained by sig. Lastly, sap can identify synonyms from structure (loops). The following synonyms are learned from Flickr data:{anim, creatur, critter, all anim, wildlife}and{insect, bug}. 4.3 Conclusion I have presented an incremental approach to learn communal hierarchies by integrating many personal hierarchies. The experimental results demonstrate that the approach re- sponds to all challenges I mentioned earlier in Chapter 3. Additionally, it is superior to the baseline approach sig in terms of the quality of the induced hierarchies and compu- tational efficiency. 98 Chapter 5 A Structure Learning Approach to Folksonomy Learning In this chapter, I frame the folksonomy learning as a general structure learning problem, where we would like to learn common complex structures from many smaller ones. I develop a probabilistic framework for learning complex structures with a specific form fromstructureddatabyexploiting structuralinformation. Theapproach extendsAffinity Propagation (AP) (Frey and Dueck, 2007) and allows it to use structural information to guide the inference process to combine data concurrently into more complex structures with a desired form. I examine two strategies for introducing structural information into affinity propagation: through similarity function and through constraints. The approach is evaluated on the folksonomy learning task as in Chapter 4. 5.1 Structure Learning by Integrating Many Small Structures Learningstructurefromdata hasemerged asan importantprobleminmanydomains, for example, learning gene networks from microarray data (Chen et al., 2006) and structure 99 of probabilistic networks from knowledgebases (Heckerman et al., 1995; Kok and Domin- gos, 2005). However, few methods focus on learning complex structures from data that may already be explicitly structured. Such structured data is ubiquitous in several do- mains. Examplesincludeauthorswithco-authorshiprelations; eventsandtheircausality; concepts and their hypernyms. In general, this type of data is composed of entities and binary relations between pairs of them. Learningcomplex structuresfromcollections ofmany smaller, simplerstructuresmay provide insights into data that individual simple structures cannot provide. To infer complexstructuresoneneedsmachinerytomanipulateandcombineindividualstructures. For example, in order to construct a community graph of authors of scientific papers, one must first identify individual entities appearing among author names in co-authorship network(Bhattacharya andGetoor,2007), thenaggregaterelationsbetweentheidentified entities. To learn complex structures with a specific form such as a tree or a directed acyclic graph (DAG), the integration method must have extra machinery to avoid structural inconsistencies. That is, the integrated structure must be self-consistent. Inconsistencies are likely to appearwhen structureddata is combined arbitrarily. Thetask becomeseven more challenging when data comes from numerous heterogeneous sources. Such data is inherently noisy and inconsistent, and there is certainly no single, unified structure to be found that explains all the data. One instance of such a task is the folksonomy learning I discussed earlier in this thesis. Infolksonomy learning, theinput, structuredannotations in theformofhierarchies of conceptual terms created by many users, is combined into a global hierarchy of concepts 100 that reflects how a community organizes knowledge. Users who create personal hierar- chies, or what I also call saplings to organize content may use idiosyncratic categoriza- tion schemes (Golder and Huberman, 2006) and naming conventions. Simply combining nodes with similar names is very likely to lead to ill-structured graphs containing loops and shortcuts (multiple paths from one node to another), rather than a tree. Although SAP presented in the Chapter 4 addressed this problem, its strategy was to incremen- tally learn a folksonomy in a top-to-bottom manner. In particular, the bottom parts of a learned folksonomy will only be considered once the structure at the top is completedly learned and fixed. Consequently, a small portion of the structure is determined at a time, which is likely to create non-optimal structures. Moreover, that method also relied on many heuristics and an extra procedure to remove structural inconsistencies in the learned folksonomy. 5.1.1 Motivating Example To learn folksonomy by aggregating personal hierarchies, one needs a strategy that mea- sures the degree to which two sapling nodes are similar, and therefore, should be merged. Suppose that we have a very simple aggregation strategy that says two nodes are similar if they have similar names, as in the prior work (Plangprasopchok and Lerman, 2009). From Figure 5.1 (b), we will end up with a graph containing one loop and two paths from animal to bird, rather than the tree shown in Figure 5.1 (a). Suppose that we can also access tags with which users annotated photos within saplings, and that photos of the bird nodes have tags like “pet” and “domestic” in common, while photos belonging to the other ‘bird’ node have tags like “wildlife” and “forest” in common. A cleverer 101 Animal Wildlife Pet Mammal Bird Bird Corvid Pigeon Quail Wren Hawk Animal Wildlife Pet Mammal Bird Bird Corvid Pigeon Quail Wren Hawk Animal Wildlife Pet Animal Pet Wildlife Animal Bird Wildlife Mammal Wildlife Bird Pet Bird Animal Animal Animal Animal Animal Bird Corvid Pigeon Bird Quail Wren Bird Hawk Bird Pet Animal Wildlife Pet Animal Pet Wildlife Animal Bird Wildlife Mammal Wildlife Bird Pet Bird Animal Animal Animal Animal Animal Bird Corvid Pigeon Bird Quail Wren Bird Hawk Bird Pet (a) (b) Figure 5.1: Illustrative examples of (a) a commonly shared conceptual categorization (hi- erarchy) system; (b) personal hierarchies expressed by the users based on the conceptual categorization in (a). For illustrative purposes, nodes with similar names have similar color. similarity function that, in addition to node names, takes tag statistics within a node into consideration, should split bird nodes into two different groups: pet bird and wild bird, which are put under pet and wildlife nodes respectively. The similarity function plays a crucial role in sapling integration process, and a so- phisticated enough similarity function that can differentiate node senses may potentially correctly integrate the final tree. However, finding and tuning such function is very diffi- cult. Moreover, data is often inconsistent, noisy and incomplete, especially on the social Web, where data is generated by many different users. One possible way to tackle this challenge is to use a simple similarity function and incorporate constraints during the merging process. Intuitively, we would not consider merging the bird node under pet with the one under wildlife because it will result in multiple paths from animal. Specifically, we can impose constraints that will prevent two nodes from being merged if: 102 1. this will lead to links from different parent concepts or, 2. this will lead to an incoming link to the root node of a tree. Theseconstraintsguaranteethatthereis,atmost,asinglepathfromonenodetoanother. 5.2 Probabilistic Integration of Structured Data A key component of the folksonomy learning through sapling integration is the merging similar nodes in different saplings. Merging similar root nodes expands the width of the learnedtree,whilemergingtheleafofonesaplingtotherootofanotherextendsthedepth of the learned tree. The merging process has two key sub-components: (1) a similarity function that evaluates how similar a pair of nodes is; (2) a procedure that decides if two nodes should or should not be merged, based on their similarity. In this section I describe the probabilistic framework for distributed inference, and then investigate in detail alternative ways to introduce structural information into the inference process in order to learn deep, bushy trees from many smaller, shallow trees. 5.2.1 Affinity Propagation As described in the previous section, we need an inference procedure to merge/separate terms,andtoexploitstructuralinformationtoguidetheclusteringtoshapetheintegrated data up to a specific form, or a tree in this context. Affinity Propagation (AP) (Frey and Dueck, 2007), a powerful clustering algorithm, offers a natural way to incorporate structural information. In this subsection, I review AP. 103 APisaclusteringalgorithmthatidentifiesasetofexemplarpointsthatwellrepresent all thepointsinthedata set. Theexemplarsemerge asmessages arepassedbetween data points, with each point assigned to an exemplar. AP tries to find the exemplar set which maximizes the net similarity, or the overall sum of similarities between all exemplars and their data points. Here I describe AP in terms of a factor graph (Kschischang et al., 2001) on binary variables, which was recently introduced by Givoni and Frey (Givoni and Frey, 2009). The model is comprised of a square matrix of binary variables, along with a set of factor nodes imposed on each row and column in the matrix. Following the notations defined in the original work, let c ij be a binary variable. c ij = 1 indicates that node i belongs to node j (or, j is an exemplar of i); otherwise, c ij = 0. Let N be the number of data points; consequently, the size of the matrix is N×N. To clusterpoints, onemustenforce cluster consistency, i.e., a pointcan belongto only one cluster. In other words, it can have only one exemplar in this context. There are two types of constraints that enforce this cluster consistency. The first type, I i , which is imposed on the row i, indicates that a data point can belong to only one exemplar ( P j c ij =1). The second type, E j , which is imposed on the columnj, indicates that if a point other thanj chooses j as its exemplar, thenj must be its own exemplar (c jj =1). AP avoids forming exemplars and assigning cluster memberships, which violates these constraints. Particularly, iftheconfigurationatrowiviolatesI constraint,I i willbecome 104 −∞, which is not a very optimal configuration (and similarly for E j ). These constraints can be formally defined as follows: I i (c i1 ,··· ,c iN )= −∞ if P j c ij 6=1, 0 otherwise. (5.1) E j (c 1j ,··· ,c Nj ) = −∞ if c jj =0 and∃i6=j s.t., c ij =1, 0 otherwise. (5.2) Inadditiontotheconstraints,asimilarityfunctionS(.)indicateshowsimilaracertain node is to its exemplar. If c ij = 1, then S(c ij ) is a similarity between nodes i and j; otherwise, S(c ij ) = 0. S(c jj ) evaluates “self-similarity,” also called “preference”, which should be less than the maximum similarity value in order to avoid all singleton points becoming exemplars. This is because that configuration yields the highest net similarity. In general, the higher the value of the preference for a particular point, the more likely that point will become an exemplar. In addition, we can set the same self-similarity value to all data points, which indicates that all points are equally likely to be formed as exemplars. A graphical model of Affinity Propagation is depicted as in Figure 5.2 in terms of a factor graph. In a log-domain, the global objective function, which measures how good the present configuration (a set of exemplars and cluster assignments) is, can be written as a summation of all local factors as follows: S(c 11 ,··· ,c NN ) = X i,j S ij (c ij )+ X i I i (c i1 ,··· ,c iN )+ X j E j (c 1j ,··· ,c 1N ). (5.3) 105 … … … … … … … … … … … … s N1 s Nj s NN s i1 s ij s iN s 11 s 1j s 1N c 11 c 1j c 1N c i1 c ij c iN c N1 c Nj c NN E 1 E j E N I 1 I i I N … … … … … … … … … … … … s N1 s Nj s NN s i1 s ij s iN s 11 s 1j s 1N c 11 c 1j c 1N c i1 c ij c iN c N1 c Nj c NN E 1 E j E N I 1 I i I N c ij E j s ij I i s(c ij ) ρ ij α ij β ij η ij c ij E j s ij I i s(c ij ) ρ ij α ij β ij η ij (a) (b) Figure 5.2: The original binary variable model for Affinity Propagation proposed by Givoni and Frey (2009): (a) a matrix of binary hidden variables (circles) and their factors (boxes); (b) incoming and outgoing messages of a hidden variable node from/to its associated factor nodes. That is, optimizing this objective function isto findthe configuration that maximizes the net similarity S, while not violating I andE constraints. The original work uses max-sumalgorithm (Kschischang et al., 2001) to optimize this global objective function, and it requires to update and pass five messages for a certain hidden variable node, as shown in Figure 5.2 (b). Since each hidden nodec ij is a binary variable (two possible values), one can pass a scalar message — the difference between two possible values of a message, whenc ij =1 andc ij =0, instead of carrying two values at a time. Since these two possible values must be normalized (their summation would equal to a certain constant), it’s straightforward to recover these values from a scalar message. In fact, recovering these values is not of main concern. This is because we can still recover an estimated MAP state of any hidden variable by summing all values of its incoming scalar messages together. The merged message value is simply a ratio between two states of c ij (1 or 0). If the value is greater than 0, the state of c ij is more likely to be 1 rather than 0. 106 The derivations of the message update equations are originally described in Section 2 of the original work (Givoni and Frey, 2009). To make the thesis self-contained and these equationswillbere-usedinmyextension, nevertheless, Ibrieflydescribetheirderivations in what follows. Consider the binary AP factor graph Figure 5.2 (a) and (b) and its objective func- tion, Eq. 5.3. We can apply max-sum algorithm (Kschischang et al., 2001) to find the estimated MAP assignments. To begin, let’s consider Eq. 5.3 as a pseudo joint probabil- ity distribution, p(c) of all hidden variables, 1 where c ={c 11 ,c 12 ,··· ,c NN }. Max-sum algorithm provides an efficient way to find the best setting, c*, that yields the highest p(c) through local computations, which involve two types of messages: (1) those that flow from a variable node to a factor node, μ x→f (x); (2) those that flow from a factor node to a variable node, μ f→x (x). For a certain factor graph having no loop, i.e., a tree or chain, messages are computed from a root node to leaf nodes; then, messages are computed and sent from leaves back to the root. On the way back, a node c ij is com- puted for its setting that yields the highest p(c) by finding the setting that maximizes a summation of all incoming messages, P f∈ne(x) μ f→x (x), wherene(x) are all factor nodes adjacent to the variable nodex. The best value of eachc ij is stored as a part of c*, while messages passing back to the root. This procedure is known as back-tracking in Viterbi algorithm (Bishop, 2006). The factor graph of AP, however, contains loops as we can easily perceive in a simple case as illustrated in Figure 5.3. In fact, a loop exists for every possible rectangles in 1 This joint distribution, p(c) can be viewed as a muliplication of all factors over all variables in Fig- ure 5.2(a), as log(p(c) is simply equivalent to S(c) in Eq. 5.3. 107 s 21 s 22 c 11 c 12 c 12 c 22 E 1 E 2 I 1 s 11 s 12 I 2 s 21 s 22 c 11 c 12 c 12 c 22 s 11 s 12 E 1 E 2 I 1 I 2 (a) (b) Figure 5.3: Binary AP factor graph with 4 hidden variables: (a) an original view; (b) an unfolded view of (a), at which a loop can obviously be observed. the hidden variable matrix. In most cases, max-sum algorithm will no longer yield exact solutions as messages flow many times along their loops. However, this loopy inference, so called loopy belief propagation (Frey and MacKay, 1998), was empirically proved its success in providing good approximate results (Murphy et al., 1999; Yedidia et al., 2005; Frey and Dueck, 2007). In this loopy inference, one can generally start at any node and propagate messages back and forth between nodes sequentially (serial schedules), or concurrently update incoming/outgoing messages (flooding schedules) (Bishop, 2006). One can determine the termination of the inference from the convergence ofp(x), or the net similarity, S(c 11 ,··· ,c NN ) in my case. According to Bishop (2006; chap. 8), two message types are formally defined as follows: μ x→f (x) = X {l|f l ∈ne(x)\f} μ f l →x (x) (5.4) μ f→x (x) = max x 1 ,···,x M [f(x,x 1 ,··· ,x m )+ X {m|xm∈ne(f)\x} μ xm→f (x)]. (5.5) 108 5.2.1.1 I-Constraint Now I begin describing the message update formulas forI i constraint. For each message, there are two possible values corresponding to its possible states, m∈{0,1} of its cor- responding variable node. For the incoming message of this constraint, μ c ij →I i (m), or β ij (m) for short, its message values can be computed as follows: β ij (1) = X {l|f l ∈ne(c ij )\I i } μ f l →c ij (1) (5.6) = S ij (1)+α ij (1); β ij (0) = X {l|f l ∈ne(c ij )\I i } μ f l →c ij (0) (5.7) = S ij (0)+α ij (0), where α ij (m) is short for μ E j →c ij (m). The message difference, β ij is simply: β ij = β ij (1)−β ij (0) (5.8) = S(i,j)+α ij , where S(i,j) =S ij (1)−S ij (0), or it is simply a similarity between node i and j. For the outgoing message of I constraint,μ I i →c ij (m) orη ij for short, we again have to consider two possible cases. For each of them, we have to find the optimal configuration of c ik , wherek =1 :N;k6=i. The configuration that violates the constrain I i will never 109 be chosen because I i (···) =−∞, which is not very optimal. For the case c ij = 1, the message update formula is as follows: η ij (1) = μ I i →c ij (1) (5.9) = max c ik ,k6=j [I i (c i1 ,··· ,c ij =1,··· ,c iN )]+ X {t|c it ∈ne(I i )\c ij } μ c it →I i (c it ) = X t6=j β it (0). The above equation holds because c ij =1, which will force other nodes to become 0. For the case c ij =0, the message update formula can be delineated as follows: η ij (0) = μ I i →c ij (0) (5.10) = max c ik ,k6=j [I i (c i1 ,··· ,c ij =0,··· ,c iN )+ X {t|c it ∈ne(I i )\c ij } μ c it →I i (c it )] = max k6=j [β ik (1)+ X t6=k,j β it (0)]. The above equation holds because point i must belong to only one exemplar, which also forces other nodes in the row i to become 0. Here the message different, η ij is as follows: η ij = η ij (1)−η ij (0) (5.11) = −max k6=j [β ik (1)−β ik (0)] =−max k6=j β ik . 110 5.2.1.2 E-Constraint Now I start deriving the message difference for E j ’s incoming message, μ c ij →E j (m), or ρ ij (m) for short. Its message values can be computed as in the following equations. ρ ij (1) = X {l|f l ∈ne(c ij )\E j } μ f l →c ij (1) (5.12) = S ij (1)+η ij (1); ρ ij (0) = X {l|f l ∈ne(c ij )\E i } μ f l →c ij (0) (5.13) = S ij (0)+η ij (0). The message difference ρ ij is simply: ρ ij = ρ ij (1)−ρ ij (0) (5.14) = S(i,j)+η ij . For the outgoing message of E constraint, μ E j →c ij (m) or α ij for short, we need to consider two sub cases: wheni =j andi6=j. This is because the optimal configurations for these two conditions are different. Clearly, if c ij = 1 and i6= j, this will enforce c jj to become 1 to meet E j constraint. Moreover, if c ij = 0 and i = j, all other c kj , where k6=j must be 0 to meet E j constraint as well. Next, I start deriving the message update formula of α ij for i = j case (or simply α jj ). When c jj =1, the formula is as follows, 111 α jj (1) = μ E j →c jj (1) (5.15) = max c kj ,k6=j [E j (c 1j ,··· ,c ij =0,··· ,c Nj )]+ X {t|c tj ∈ne(E j )\c ij } μ c tj →E j (c tj ) = X t6=j max m∈{0,1} [ρ tj (m)]. Specifically, we have full flexibility to choose the configurations of other nodes, c tj (t6=j) that maximizes the above function. Whenc jj =0, to meetE j constraint, we now have the restriction that all other nodes cannot be assigned to 1. Therefore, the message update formula is simply: α jj (0) = X t6=j ρ tj (0), (5.16) which yields the message difference formula as α jj =α jj (1)−α jj (0) =max t6=j [0,ρ tj ]. (5.17) For i6=j case, if c ij =1, the message update formula is as follows, α ij (1) = μ E j →c ij (1) (5.18) = max c kj ,k6=j [E j (c 1j ,··· ,c ij =1,··· ,c Nj )]+ X {t|c tj ∈ne(E j )\c ij } μ c tj →E j (c tj ) = ρ jj (1)+ X k6={i,j} max m∈{0,1} ρ kj (m). 112 The above equation holds because if c ij = 1, then c jj must be 1, as for the other nodes, c kj , we have full flexibility to choose their configuration that maximizes the message. Lastly, themessage updateformulaforthe casei6=j andc ij =0, themessage update equation is simply: α ij (0) = μ E j →c ij (0) (5.19) = max c kj ,k6=j [E j (c 1j ,··· ,c ij =0,··· ,c Nj )]+ X {t|c tj ∈ne(E j )\c ij } μ c tj →E j (c tj ) = max[ρ jj (1)+ X k/ ∈{i,j} max m∈{0,1} ρ kj (m), X k6=i ρ kj (0)]. From Eq. 5.18 and Eq. 5.19, the message difference, α ij wheni6=j is as follows, α ij = min[0,ρ jj (1)+ X k6={i,j} max m∈{0,1} ρ kj (m)− X k6=i ρ kj (0)] (5.20) = min[0,ρ jj + X k6={i,j} max[0,ρ kj ]]. In all, the message update equations are: β ij = S(i,j)+α ij (5.21) η ij = −max k6=j β ik (5.22) ρ ij = S(i,j)+η ij (5.23) α ij = max t6=j [0,ρ tj ] i =j min[0,ρ jj + P k6={i,j} max[0,ρ kj ]] i6=j (5.24) 113 The inference procedure starts by first setting all message values to 0; then keep up- dating all messages according to above formulas iteratively until the process converges. Oncetheinferenceprocessterminates,theMAPconfiguration(exemplarsandtheirmem- bers)can berecovered asfollows. First, Iidentify anexemplarsetbyconsideringthesum of all incoming messages of each c jj (each node in the diagonal of the variable matrix). If the sum is greater than 0 (there is a higher probability that node j is an exemplar), j is an exemplar. Once the set of exemplars K is recovered, each non-exemplar point i is assigned to the exemplar k if the sum of all incoming messages of c ik is the highest compared to the other exemplars. 5.2.2 Expressing Structure through Similarity Following the incremental approach described in Chapter 4, I define a similarity measure between nodes in different saplings, which also exploits heterogeneous evidence available in the structure of the input data. To recap, the similarity function is a combination of local similarity and structural similarity. The local similarity between two nodes i and j, localSim(i,j) as defined in Eq. 4.3, is based on the intrinsic features of i and j, such as tag distributions. The structural similarity, structSim(i,j), is based on features of neighboring nodes. If i is a root of a certain sapling, its neighboring nodes are all of its children. If i is a leaf node, the neighboring nodes are its parent and siblings. Here, I define three versions of structSim(,): structSimRR(,) which computes structural simi- larity between two root nodes (root-to-root similarity),structSimLL(,), which evaluates structuralsimilaritybetweenaleafofonesaplingtothatofanother,andstructSimLR(,) 114 which evaluates structural similarity between a root of one sapling and the leaf of an- other (leaf-to-root similarity). The formulas for these similarity functions are defined as in Section 4.1.2.2. 5.2.2.1 Structural Similarity with Cluster Labels Structuralsimilaritymentionedearlierdoesnottakethecluster(orconcept)ofaterminto account. For a given pair of terms, we can use cluster labels of their neighboring terms to help decide whether or not they should belong to the same cluster. Intuitively, the more neighboring terms share common cluster labels, the more similar the nodepair is. Thisis along the same line to the earlier work on collective entity resolution (Bhattacharya and Getoor, 2007), where the entity identification decision is based on common neighboring entities rather than references’ features. Let clust(i) be a function which returns the cluster label of node i. For the root- to-root structural similarity using cluster labels, I modify structSimRR(,) simply by replacing name() with clust(). In other words, structSimRR(r A ,r B ) is a normalized intersection between cluster labels of A’s leaves and B’s leaves. Fortheleaf-to-leaf similarityofapairofleafnodes,wecanonlyconsidertheclusterla- beloftheirrootsratherthanalloftheirsiblings. Thisisbecausetheclusterlabelsoftheir root nodes have already taken cluster labels of their siblings into account. Consequently, structSimLL(l A ,l B ) with cluster labels is simplycomputed fromδ(clust(r A ),clust(r B )). 115 5.2.2.2 Negative Similarity Thestructural similarity above does notprovide an explicit force to discourage clustering that may cause incoming links from different parent concepts. In some settings (e.g., Bhattacharya and Getoor, 2007), the negative similarity can be applied to discourage the merge that violates a constraint. In my context, nevertheless, this similarity is in- applicable since it is imposed on pairwise basis. Specifically, suppose that a root node is formed as a representative (exemplar) of a certain cluster, leaf nodes having different parent clusters can still legally be merged to the cluster. This is due to the fact that AP onlyconsiderssimilarity between nodestotheirexemplar. Asa result, such configuration doesnotviolate anyconstraints; therefore, incoming linksfromdifferentconcepts arestill permitted. 5.2.3 Expressing Structure through Constraints: Relational Affinity Propagation (RAP) I extend AP to add in structural constraints that will ensure that the learned folksonomy makes sense – no loops, and, to the extent possible, forms a taxonomy. Since we want the learned folksonomy to be a tree, all nodes assigned to some exemplar must have their incoming links from nodes in the same cluster, i.e., assigned to the same exemplar. To achieve this, we must enforce the following two constraints: (1) merging should not create incoming links to a cluster, or concept, from more than one parent cluster (single parent constraint); (2) merging should not create an incoming link to the root of the induced tree (no root parent constraint). For the second constraint, we can simply discard all sapling leaves that are named similar to the tree root. Hence, we only need 116 D A B E C A B C D E D E A B C = 0 = 1 pa(A) =D pa(C) = E F Constraint pa(B) =D (a) (b) Figure 5.4: A mapping between actual nodes and hidden variable nodes in Relational Affinity Propagation: (a) examples of samplings; (b) a schematic diagram of the matrix of binary hidden variables (circles) of the nodes in (a). The variables within the green shadecorrespondtotheleafnodesin(a); whilethosewithinthepinkshadecorresponding to the root nodes in (a). pa(.) is a function returning a pointer to a root node of its argument. to enforce the first constraint. The first constraint will be violated if leaf nodes of two saplings are merged, i.e., assigned to the same exemplar, while the root nodes of these saplings are assigned to different exemplars. Consequently, the leaf cluster will have multiple parents pointing to it, which leads to an undesirable configuration. Letpa(.) be a function that returns the index of the parent node of its argument, and explr(.) be a function that returns the index of the argument’s exemplar. The factor F, “single parent constraint”, checks the violation of multiple parent concepts pointing to a given concept. The constraint is formally defined as follows: F j (c 1j ,··· ,c Nj )= −∞ ∃i,k :c ij =1;c kj =1; explr(pa(i))6=explr(pa(k)), 0 otherwise. (5.25) 117 … … … … … … … … … … … … c c c c c c c c c c c c c c c … … … … … c c c … … … … … c c c … … c F F F F c ij E j s ij I i s(c ij ) ρ ij α ij β ij η ij F j τ ij σ ij c ij E j s ij I i s(c ij ) ρ ij α ij β ij η ij F j τ ij σ ij (a) (b) Figure 5.5: The Relational Affinity Propagation (RAP) proposed in this chapter: (a) a schematic diagram of the matrix of binary hidden variables (circles) with the new set of constraints F j , which impose on leaf nodes (their indices running from 1 to L) at each column j. (b) incoming and outgoing messages of a hidden variable node from/to its associated factor nodes. There are two more messagesσ andτ. Note that for (a), I omit E, I and S factors simply for the sake of clarity. Figure 5.4(b) illustrates the way I impose the new constraint on the binary variable matrix of the nodes in Figure 5.4(a). The configuration shown in the figure is valid since both A, B and C belong to the same exemplar A and their respective parents, D and E, belong to the same exemplar E. However, if c DD = 1, then the configuration is invalid, because parents of nodes in the cluster of exemplarA will belong to the different exemplars. This constraint is imposed only on leaf nodes, because merging root nodes will never lead to multiple parents. The global objective function for Relational Affinity Propagation is basically Eq. 5.3 plus P j F j (c 1j ,··· ,c Nj ). I modify the equations for updating the messages ρ, β and also derive σ and τ to take into account this additional constraint. Following the max-sum message updaterule from a variable node to a factor node (cf., Eq. 5.4), the message update formulas for ρ, β andσ (incoming messages to factor nodes) are simply: ρ ij =S(i,j)+η ij +τ ij , (5.26) 118 β ij =S(i,j)+α ij +τ ij , (5.27) σ ij =S(i,j)+α ij +η ij . (5.28) For deriving the outgoing message from the factor F j , namely τ, we have to consider two cases: i = j and i6= j, i.e., the τ message to the nodes on the diagonal and τ for the others. For simplicity, I also assume that all leaf nodes have their index numbers less than any roots. Let L be a number of leaf nodes. Hence, leaf node indices run from 1 to L. Figure 5.5 (a) illustrates on how each F j constraint imposing on all leaf nodes c kj . For the case i =j (the diagonal nodes c jj ), we have to consider the update message for τ in two possible settings: c jj = 1 and c jj = 0 (or, they can be written as τ jj (1) and τ jj (0) respectively), and then find the best configuration for these settings. Following the max-sum message update rule from a factor node to a variable node (cf., Eq. 5.4), when c jj =1: τ jj (1) =max S {j} X k∈S {j} ;k6=j σ kj (1)+ X l/ ∈S {j} ;l6=j σ lj (0) . (5.29) For c jj =0, we have τ jj (0) = X k=1:L;k6=j σ kj (0), (5.30) where S {j} ∈ T; T⊃{1,··· ,L};{j}⊂ S {j} and all k in S {j} shares the same parent exemplar. Eq. 5.29 will favor the “valid” configuration (the assignments of c kj ), which maximizes the summation of all incoming messages to the factor node F j . For Eq. 5.30, since no other nodes can belong to j, the valid configuration is simply setting all c kj to 0. Note that I omitF j from the above equations since invalid configurations are not very optimal, so that they will never be chosen. Thus, F j is always 0. 119 From Eq. 5.29 and Eq. 5.30, the message difference τ jj is simply: τ jj =max max S {j} P k∈S {j} ;k6=j σ kj 0 (5.31) For i6=j, we also have to consider two sub cases in the same way as to the previous setting, when c ij =1: τ ij (1) =max S x X k∈S x ;k6=i σ kj (1)+ X l/ ∈S x ;l6=i σ lj (0) . (5.32) For c ij =0, we have τ ij (0) =max X k6=i σ kj (0),max S X k∈S;k6=i σ kj (1)+ X l/ ∈S;l6=i σ lj (0) , (5.33) where S∈ T; T⊃{1,··· ,L}, and allk in S shares the same parent exemplar without the restriction that S must contain x. When j is a root node, the leaf nodei will never have the multiple-parent conflict withj, but we still need to check whether other merging leaf nodes share the same parent exemplar toi. Therefore, I set x={j} for this case. When j is a leaf node, however, we have to make sure that node i, j and other merging leaf nodes have the same parent exemplar. Thus, I set x ={i,j}. In c ij = 0 case, the best configuration may or may not have j as the exemplar, which is different from thec ij =1 case that requires the best configuration necessarily having j as the exemplar. 120 The message difference τ ij , which is a difference between τ ij (1) (Eq. 5.32) and τ ij (0) (Eq. 5.33) is as follows: τ ij = min max S x X k∈S x ;k6=i σ kj , max S x X k∈S x ;k6=i σ kj −max S X l/ ∈S;l6=i σ lj . (5.34) From the above equation, since the first argument of the formula is always larger than, or equal to the second one, its shorter form is simply: max S x P k∈S x ;k6=i σ kj − max S P l/ ∈S;l6=i σ lj . There is one specific case that the above equation, Eq. 5.34, does not hold. In par- ticular, the case appears when both i and j are leaf nodes and they do not share the same parent exemplars. Therefore, the casec ij =1 should never happen, and that makes τ ij →−∞. In other words, we will always prefer c ij = 0 to c ij = 1. As a result, the message difference for this case is defined as, τ ij (explr(pa(i))6=explr(pa(j))) =−∞. (5.35) For sake of simplifying implementation, we can use any negative value instead of−∞ to simply tell the inference procedure that we always favor c ij =0 in this case. The inference of exemplars and cluster assignments starts by initializing all messages to zero and keeps updatingall messages foreach nodesiteratively until convergence. One possible way to determine the convergence is to monitor the stability of the net similarity value, P i,j S ij (c ij ), as in the original AP. To summarize, all message update equations for RAP are composed of the following equations: 121 ρ ij = max S(i,j)+η ij +τ ij if i is a leaf node S(i,j)+η ij otherwise, (5.36) β ij = max S(i,j)+α ij +τ ij if i is a leaf node S(i,j)+α ij otherwise, (5.37) η ij = −max k6=j β ik , (5.38) α ij = max t6=j [0,ρ tj ] i =j min[0,ρ jj + P k6={i,j} max[0,ρ kj ]] i6=j. (5.39) Besides, the following message update equations are only required if i is a leaf node: σ ij = S(i,j)+α ij +η ij (5.40) τ jj = max[0,max S {j} P k∈S {j} ;k6=j σ kj ] i=j −∞ i6=j; explr(pa(i))6= explr(pa(j)); i and j are leaf nodes max S x P k∈S x ;k6=i σ kj −max S P l/ ∈S;l6=i σ lj otherwise. (5.41) Recovering MAP exemplars and cluster assignments can be done as in a slightly different way to the original AP with one extra step, in order to guarantee that the final graph is in a tree form. In particular, for a certain exemplar, I sort its members by their similarity value in descending order. The parent exemplar of a cluster of nodes is 122 determined as follows. If the exemplar of the cluster is a leaf node, the parent exemplar of the cluster is the parent exemplar of the exemplar. Otherwise, the parent exemplar of the highest-ranked leaf node will be chosen. I then split all member nodes that have differentparentexemplarstothatofthecluster. Notethatamoresophisticated approach to this task may be applied: e.g., once split, find the next best valid exemplar to join. However, this more complex procedure is very cumbersome – the decision to re-join a certain cluster may recursively result in the invalidity of other clusters. Note that RAP can be extended to induce other structure types such as DAG. In DAG case, we can simply change the condition in Eq. 5.25. In particular, for a certain exemplar, its leaf nodes can now have multiple parents, but there will be no descendant nodes of its root nodes belonging to the same exemplar to some ancestor nodes of its leaf nodes. 5.2.4 Computational Complexity Both AP and RAP use similarity between pairs of nodes to make cluster decisions. A standard similarity function that only relies on node features can be pre-computed at the first iteration, and reused throughout the inference process. On the other hand, the class label-based similarity has to be evaluated at every iteration. 2 Therefore, the computational complexity of computing class label-based similarity grows linearly with the number of iterations. 2 For more accurate similarity, one can re-evaluate similarity once one of relevant nodes reassigned to a different cluster, but this would require much more computation. 123 Let N be a number of all terms (data points) in the data set. Generally, it requires O(N 2 ) operations to compute all pairwise similarities. Nevertheless, one can apply the blocking idea (e.g., McCallum et al., 2000), to significantly reduce the number of such pairwise computations. I use a simple blocking scheme, only comparing sapling nodes thatsharethesamestemmedname(Iassumethattermshavingdifferentstemmednames will never get clustered together). Let M be the number of unique stem terms. Hence, for each stem term, there are N M nodes to be compared on average; as a result, the computational complexity of pairwise similarity reduces to O(( N M ) 2 ). To determinethe computational complexity ofthe clustering procedure,in each itera- tionAPrequirestopassmessagestoO(( N M ) 2 )nodes. Therefore,thenumberofoperations is proportional to the number of node pairs to be compared. RAP, however, uses addi- tional operations to update τ messages. Specifically, it needs to (1) update all cluster labels; (2) group nodes that share the same parents. For each node group with the same stem name, the first operation requires sorting nodes by their message values, which can be done in O( N M log( N M )) operations. The second step can be done in O( N M ) operations with aproperdata structure. Consequently, RAP requiresan additionalO(N(1+log N M )) operations per iteration compared to AP. 5.3 Evaluation on Real-World Data I evaluate the different approaches to learning a tree from structured data collected from Flickr that was described in the previous chapter. I also used the same seed terms and applied the same strategy as in the previous chapter to collect saplings relevant to each 124 seed. Briefly, I selected saplings whose root names were similar to the seed term. I then used the leaf node names of these saplings to expand the data set by identifying other saplings whose root names were similar to these names, and so on, for two iterations. To compare the different strategies for exploiting structural information, I apply the two clustering procedures: AP and RAP, with different similarity functions to these seed sets. The similarity functions were: (1) local: only local similarity; (2) hybrid: local and structuralsimilarity; (3)class-hybrid: localandstructuralsimilarityusingclasslabels. To make this work comparable to SAP, I used the same parameter values in these similarity functions. For local similarity, I set the number of top tags K = 40, and the number of common tags J=4. For the hybrid similarity function, the weight combination between local and structural similarity is α = 0.9 when comparing two nodes that are both roots or leaves, and α = 0.2 when one node is a root and the other a leaf. The damping factor 3 for message updating is set to 0.9, and the number of iterations is set to 2,000. The preference is set to 0.0001 uniformly. Note that unlike SAP, there is no need for a clustering threshold, since exemplars emerge and compete against each other to attract other similar nodes. In all, I have six different settings (two clustering procedures with three similarity schemes). I apply a strategy similar to SAP to remove noisy nodes. Specifically, a noisy leaf node is identified by the number of users who specified it, N l , and its parent, N r . If N l Nr <0.01, the leaf node term is highly idiosyncratic, and I classify it as noise. Moreover, if there is only one leaf node and few or more root nodes in a certain cluster, I will split 3 The damping factor is used for avoiding numerical oscillations that may arise as argued in Frey and Dueck (2007). This value ranges from 0.0 to 1.0. Let’s λ be a damping factor. The message is set to λ times the previous messages plus (1−λ) times the updated message. 125 the leaf node out of the cluster. This heuristic helps to remove concepts that are less relevant to the seed term of the folksonomy. 5.3.1 Evaluation Methodology I evaluate the performance of alternate learning strategies by measuring the properties of thelearnedtree. UnlikeSAP,themethodsproposedinthischaptergenerallyreturnmore than one tree. I simply evaluate the most popular tree, which has the largest number of merged nodes at the root level. Specifically, I evaluate both the quality and structure of the learned folksonomy. The qualityofthelearnedfolksonomyisdeterminedbycomparingittothereferencetaxonomy as was done in SAP (cf. Chapter 4). I apply two metrics: modified Taxonomic Overlap (mTO), defined in Section 3.4.3, and Lexical Recall (LR), defined in Section 3.4.2. Forthestructuralevaluation, Iapplytwometrics: (1)netsimilarity(NetSim); (2)the numberofstructuralconflicts (Conflicts). Net similarity measureshowwell theapproach can combine similar smaller structures. To make all settings comparable, I use Jaccard similarity of the top tags to compute NetSim. The number of conflicts measures the structural integrity of the learned tree. It is given by the number of nodes whose parents belongtodifferentclusters. Thisnumberismonitoredattheendofthefinaliteration, just before the last step that removes structural conflicts that may still appear. The smaller thevalue,themoreconsistentthelearnedstructure. NotethatIexcludeAreaUnderTree (AUT) metric as applied in Chapter 4, since the measure generally favors the approach that produces trees with a large number of conflicts. Specifically, the conflict nodes will be“corrected”toensurethelearnedfolksonomieshaveatreestructure—nodesthathave 126 Metrics Similarity Scheme Avg Rank local 1.71 LR hybrid 1.39 class-hybrid 1.87 local 1.55 mTO hybrid 2.32 class-hybrid 1.81 local 2.84 Conflict hybrid 1.39 class-hybrid 1.68 local 1.48 NetSim hybrid 2.45 class-hybrid 2.03 Metrics Similarity Scheme Avg Rank local 1.61 LR hybrid 1.39 class-hybrid 1.94 local 1.61 mTO hybrid 2.26 class-hybrid 1.81 local 2.29 Conflict hybrid 1.39 class-hybrid 2.10 local 1.39 NetSim hybrid 2.35 class-hybrid 2.23 (a) AP (b) RAP Table 5.1: The table compares the performance of (a) AP and (b) RAP, when using different similarity schemes on various metrics. The numbers show the average ranks across all 32 seeds. The lower rank, the better performance. different parent exemplars will get split out. Therefore, such folksonomies misleadingly have bushier structure and so higher AUT. 5.3.2 Results Istudytheusesofstructuralinformationthroughstructuralsimilarityandthroughstruc- tural constraints, and measure how these strategies affect the quality of learned folkson- omy. To begin, I first evaluate the performance of different similarity schemes (with or without structural information) by running them on AP and RAP. Since all learning strategies tend to produce more than one tree, I average their performance across all induced trees. I report performance of each learning strategy on a particular metric by ranking it against all other strategies and averaging the rankings across all data sets. This gives a measure of how often a strategy outperforms others. The average rankings are summarized as in Table 5.1 (a) for AP and in Table 5.1 (b) for RAP. 127 From Table 5.1, all similarity schemes perform in a similar manner in both AP and RAP. Specifically, structural information in the similarity function (hybrid and class- hybrid) does help reduce the number of structural conflicts in both AP and RAP. Never- theless, these similarity functions performed worse on mTO and NetSim. This is because they are more stringent than local and cluster fewer saplings together in the folkson- omy learning task where individual saplings are rather sparse. Therefore, fewer “similar” structures are merged as indicated by lower NetSim. Not surprisingly, these similarity functions do not improve mTO scores over local. This is because mTO favors deeper trees to shorter ones if the nodes are ordered correctly. Nevertheless, I hypothesize that in domains where individual structures contain rich information, hybrid similarity should outperform local similarity. For LR, structural information through hybrid similarity can help recover more con- cepts. This is because learning strategies with this similarity function can exploit struc- tural information when local information is deficient. However, class-hybrid performs worse than hybrid in LR and in the other metrics. I speculate that class labels at the beginning of the learning process may not be reliable enough, and that lead to a worse performance. When similarity function is fixed, shown in Table 5.2, RAP generally outperforms AP on almost all measures. Specifically, it recovers more concepts (better LR score), learns more accurate structures (mTO), produces more consistent folksonomies (fewer Conflicts). However, RAP produces trees with lower net similarity (NetSim), since it contains more stringent criteria to merge saplings than AP. 128 Clustering Scheme Measure AP RAP LR 1.35 1.35 mTO 1.42 1.29 Conflict 1.97 1.00 NetSim 1.39 1.55 Clustering Scheme Measure AP RAP LR 1.29 1.29 mTO 1.48 1.26 Conflict 1.97 1.00 NetSim 1.48 1.45 (a) Local Similarity (b) Hydrid Similarity Clustering Scheme Measure AP RAP LR 1.29 1.19 mTO 1.48 1.29 Conflict 1.97 1.00 NetSim 1.32 1.58 (c) Class-Hydrid Similarity Table 5.2: The table compares the performance between AP and RAP when using (a) local, (b) hybrid and (c) class-hybrid similarity on various metrics. Next, I compare RAP with local similarity, found to be superior to alternative clus- tering schemes, to the previous folksonomy learning approach SAP (Plangprasopchok et al., 2010), described in Chapter 4. Unlike SAP, the methods proposed in this chapter generally return more than one tree. I simply evaluate the most popular tree, which has the largest number of merged nodes at the root level. Here I report the quality of the learned folksonomy, as measured by mTO scores and the number of overlapping paths (#OPaths) to the reference hiearchy. For #OPaths, I consider two paths as “overlapping” if their root (source) nodes share the same name, and their leaf (sink) nodes share the same name. Therefore, the number of overlapping paths are enumerated by counting how many leaves in the learned folksonomy share similar names to some leaves in the reference hierarchy. Since mTO is computed from the overlapping paths, the approach that yields higher mTO and higher #OPaths would be preferable. 129 Figure 5.6: Comparison of performance of RAP (with local similarity) and SAP ap- proaches on a variety of metrics reported in Table 5.3 and Table 5.4. Bars report the number of the cases that one approach outperformed the other. The higher the better. Note that “mTO & #OPaths” is computed from the intersection between mTO and #OPaths’ superior cases. As shown in Figure 5.6 (summarized from Table 5.3 and Table 5.4, RAP with lo- cal similarity can produce more consistent taxonomies compared to SAP (15 vs. 12 cases). Moreover, ifconsideringbothnumbersofcomparablepaths(#OPaths)andmTO, RAP+local is clearly superior to SAP (14 vs. 4 cases). Specifically, the former produces more consistent structures on a higher number of comparable paths, with respect to the reference hierarchy. Nevertheless, AUT and LR scores as shown in Figure 5.6, are inferior to SAP (7 vs. 24 cases, and 13 vs. 18 cases respectively). This is because of the nature of AP and its extension, RAP, that allows different trees to emerge simultaneously. In many cases, these trees attract the most similar structures to it. Compared to SAP, which grows one tree at a time, attracting all similar concepts to it, in RAP concepts are attracted to different trees, with which they have the best fit. Since I only consider one of the trees 130 SAP RAP + local seeds #OPaths mTO #OPaths mTO africa 27 0.895 37 0.869 anim 92 0.659 106 0.656 asia 85 0.788 43 0.785 australia 27 0.665 46 0.672 bird 22 0.755 38 0.714 build 0 0.000 0 0.000 canada 27 0.587 47 0.689 cat 0 0.000 1 0.508 central america 2 0.754 6 0.863 citi 0 0.000 0 0.000 countri 4 0.665 1 0.000 craft 0 0.000 14 0.400 dog 1 1.000 4 1.000 europ 301 0.670 133 0.596 fauna 31 0.490 14 0.529 fish 0 0.000 7 0.672 flora 18 0.481 28 0.512 flower 1 1.000 9 0.783 insect 5 0.924 18 0.836 invertebr 1 1.000 26 0.752 north america 118 0.576 182 0.683 plant 7 0.735 11 0.795 reptil 3 0.622 4 0.625 south africa 3 0.600 4 0.600 south america 15 0.832 28 0.637 sport 27 0.647 114 0.649 unit kingdom 82 0.724 135 0.620 unit state 55 0.749 133 0.823 urban 0 0.000 4 0.603 vertebr 0 0.000 3 1.000 world 475 0.461 44 0.432 Summary 1429(sum) 0.557(avg) 1240(sum) 0.629(avg) Table 5.3: The table compares the performance on mTO of the proposed approach, RAP with local similarity scheme, to SAP, described in Chapter 4. The table also reports a number of comparable paths, #OPaths to the reference hierarchies. 131 LR AUT seeds SAP RAP+local SAP RAP+local africa 0.547 0.304 119.5 42.5 anim 0.360 0.173 1076.0 448 asia 0.484 0.212 631.5 183 australia 0.216 0.097 147.5 54 bird 0.315 0.263 113.5 157 build 1.000 0.500 37.5 39 canada 0.241 0.159 2502 90 cat 0.100 0.143 41.5 53.5 central america 0.500 0.692 12.5 35 citi 0.100 0.080 927.5 86 countri 0.214 0.091 4504 47.5 craft 0.050 0.068 1.5 10.5 dog 0.080 0.063 28.5 46.5 europ 0.418 0.228 2706.5 960.5 fauna 0.212 0.050 1146 81 fish 0.016 0.085 6.5 16 flora 0.407 0.232 1048.5 139 flower 0.250 0.238 226.5 375.5 insect 0.857 0.778 61.5 26.5 invertebr 0.125 0.700 19.5 48.5 north america 0.319 0.032 2203.5 48 plant 0.273 0.182 426 71 reptil 0.667 0.500 4.5 6 south africa 0.444 0.364 18.5 27.5 south america 0.463 0.419 101.5 74 sport 0.084 0.047 86.5 92 unit kingdom 0.127 0.054 658.5 108.5 unit state 0.256 0.074 936.5 122 urban 0.071 0.118 145.5 51 vertebr 0.200 0.800 236.5 57 world 0.215 0.020 1676.5 194 Table 5.4: The table compares the performance on LR and AUT of the proposed ap- proach, RAP with local similarity scheme, to SAP, described in the previous chapter. 132 in the evaluation, there is a high chance that the selected tree contains relatively fewer unique concepts and so is not bushier than SAP. The overall experimental results clearly suggest that the proposed approach (RAP), which incorporates structural information through constraints during probabilistic infer- ence process can learn better, more consistent structures. I speculate that RAP can be even more advantageous in domains where heuristics for correcting the learned structure to a specific form are difficult to specify and expensive to carry out. 5.4 Conclusion In this chapter, I described a fully probabilistic approach for structure learning that is based on distributed inference. It combines a large number small structures into a few more complex structures by determining all parts of the structures simultaneously, while attempting to integrate the structures into a desired global form. I studied two different ways to incorporate structure information into the inference process, and applied the approach to the folksonomy learning problem. The experimental results suggest that, in the folksonomy learning setting, the approach that incorporates structure information through the constraint, RAP, can help producing folksonomies with high quality, and even more consistent to the reference hierarchy than those of SAP in a majority of the cases. 133 Chapter 6 Related Work In this chapter, I will first summarize previous works that study social annotation’s characteristics. Then, I will relate my research problems and proposed approaches to some previous works in other domains. 6.1 Social Annotation Characteristics Social annotation, especially social tagging, has recently been examined by many re- searchers and practitioners in many aspects. Some of these aspects, which relate to my work, are summarized as follows. First, social tagging users annotate a certain item us- ing tags from multiple aspects or facets (Mathes, 2004; Rashmi, 2005). Moreover, they are not only for categorization but for other reasons, e.g., attraction and findability, as well (Mathes, 2004). Secondly, although tags for a given item are freely chosen by differ- ent users, their occurrences are not completely random. Particularly, the size of the tag vocabulary on that item will soon be stable after few users have tagged the item (Golder and Huberman, 2006). Heymann et al. (2008) also discovered that, for a given item, only 7 percent of tags found irrelevant to the item; less than 5 percent of them are subjective. 134 These findings imply that tags reflect what item is about; meanwhile, a consensus across users on tagging does exist to a certain degree. The third aspect is that there exists a relationship between a group of users and tag vocabulary they use. Marlow et al. (2006) discovered that Flickr users who are in the same social groups tend to use the same set of tags to annotate items (“sociolect”). Meanwhile,Mika(2007)usedgraph-basedapproachtodemonstratethatsamesetofusers usethesamesetoftermstoannotateitemsinDelicious. Moreover, sincethesimilaritems are usually annotated with the similar tags, it is very likely that the same set of users usually annotate the same set of items. Fourthly, the level of specificity in tagging is varied from one user to another, as well as from one domain to another domain (Golder and Huberman, 2006). This depends on a “basic level” of categorization (Mervis and Rosch, 1981) that each individual has, in a different domain. For instance, a dog expert would tag a dog photo using very detailed tags, while she uses somewhat vague and general tags on a cat photo. Lastly, both synonymy and polysemy usually appears in social tagging context (Mathes, 2004; Golder and Huberman, 2006). Acronyms have also been widely used for tagging items. 6.2 Extracting Concepts The problem in learning concepts from social annotation is similar to that of clustering words from texts. A concept, which groups similar tags, in the former is analogous to a cluster of semantically related words in the latter. Many works on text clustering (e.g., Hindle, 1990; Pantel and Lin, 2002), basically utilize contextual information as features 135 for each word. For instance, co-occurrences of noun and verb pairs can be used as noun features. Similarity between nouns is then computed based on these features. However, these approaches are not directly applicable to the social annotation domain. Particu- larly, co-occurrences of users, items and tags are very sparse in the domain. Considering when we cluster similar tags, we need to group similar users and similar items first. Subsequently we use the clusters of these entities as tag features to reduce sparseness. Ironically, to group similar users or items, we have to group similar tags first. This is a chicken-and-egg-like problem. Ideally, one needs to cluster all entities simultaneously. Although co-clustering tech- niques, which can cluster two different views simultaneously, have been proposed (e.g., Dhillon et al., 2003), to my awareness, “tri-clustering” framework doesn’t exist yet. I instead frame this problem using a probabilistic generative approach, which explicitly models how all entities relate to each other. Clusters of all entities are by-products of this approach. Modeling social annotation is closely related to two fields: document modeling and modeling user profiles for Collaborative Filtering. It is relevant to the former in that one can view an item annotated by users with a set of tags to be analogous to a document, which is composed of words from the document’s authors. Usually, a number of users involvedincreatingadocumentismuchlessthanthoseinvolvedinannotatinganitem. In regardtocollaborative ratingsystems,annotationscreatedbyusersinasocialannotation system is analogous to item ratings in a recommendation system. However, users only provide one rating on an item in a recommendation system, but they usually annotate 136 an item with several keywords. Therefore, there are several relevant threads of research connecting my work to earlier ones in these areas. Inrelation todocumentmodeling, myworkin Chapter2isconceptuallymotivated by the Author-Topic model (AT) (Rosen-Zvi et al., 2004), wherewe can view a userwho an- notate an item asan author whocomposesadocument. In particular, the modelexplains a process of document generation, governed by author profiles, in forms of distributions of authors over topics. However, this work is not directly applicable to social annotation. This is because first we know explicitly who a certain tag is generated by; thus, author selection process in AT, the process that selects one of co-authors to be responsible for a generation of a certain tag, is not needed in my context. Second, co-occurrences of user-tag pairs for a certain bookmark are very sparse, i.e., there are approximately less than ten tags per bookmark. Thus, we need to group users who share the same interests together to avoid thesparsenessproblem. Third, AThasnodirect way to estimate distri- butions of items over topics since there are only author-topic and topic-word associations in that model. One possible indirect way is to compute this from an average over all distributions of authors over topics, of the authors who actually annotated that item. My model, instead, explicitly models these distribution; thus, no extra computation is required. Additionally, since my model uses profiles of groups of similar users, rather than using those from an individual, the distributions are expected to be more unbiased. There is a recent work that applies document modeling to a social annotations sys- tem(Wuetal.,2006). Thisworkutilizesmulti-wayaspectmodel(Hofmann,2001; Popes- cul et al., 2001) on social annotation data in Delicious. The model does not explicitly separate user interests and item topics as my model does. Therefore, it cannot exploit 137 individual differences to extract distributions of items over topics as shown in my earlier work (Plangprasopchok and Lerman, 2007). Moreover, the work focuses on personalized item search, and was evaluated by demonstrating that it can alleviate a problem of tag sparseness and synonymy in a task of searching for items by a tag. Recently, Kashoob etal. (2009) proposedaprobabilisticmodeltomodelsocial annotation process. However, the work does not explicitly model users; hence, it cannot distinguish user interests from item topics. Collaborative filtering was among of the first successful social applications. Collabo- rative filtering is a technology used by recommender systems to find users with similar interests by asking them to rate items. It then compares their ratings to find users with similar opinions, and recommendsto usersnewitems that similar usersliked. Among the recent works in collaborative filtering area, Jin et al. (2006) is most relevant to my works. In particular, the work describe a mixture model for collaborative filtering that takes into account users’ intrinsic preferences about items. In this model, item rating is gener- ated from both the item type and user’s individual preference for that type. Intuitively, like-minded users would have similar rating on the same item types (e.g., movie genres). When predicting a rating of a certain item for a certain user, the user’s previous ratings on other items will be used to infer a like-minded group of users. The “common” rating on that item from the users of that group is the prediction. This collaborative rating process is very similar to that of collaborative tagging. The only technical difference is thateach “item” canhavemultiple“ratings”(in mycase, tags) fromasingleuser. Thisis because an item usually has multiple subjects and each subject can be represented using multiple terms. 138 There exist, however, major differences between Jin et al. (2006) and my work. I use the probabilistic model to discover an “item description” despite users annotating items with potentially ambiguous tags. My goal is not to predict how a user will tag an item (analogous to predicting a rating user will give to an item), or discovering like-minded groups of users, which my proposed work could also do. The main purpose of the work is torecover theactual “concepts”, whichdescribeitems, fromnoisyobservationsgenerated bydifferentusers. Inessence, Ihypothesizethatthereisanactual descriptionofacertain itemandusersselectandthenannotatetheitemwiththatdescriptionpartiallyaccording to their “interest” or “expertise”. Another technical difference is that the model is not implemented in a fully Bayesian network, and uses point estimation to estimate its parameters, which is criticized to be susceptible to local maxima (Griffiths and Steyvers, 2004; Steyvers and Griffiths, 2006). Moreover, it can not be extended to allow numbers of topics/interests to be flexible; thus, the strong assumption on the numbers of topics and interests is required. 6.3 Learning Conceptual Hierarchy Many researchers have studied the problem of constructing ontological relations from text (e.g., Hearst, 1992; Caraballo, 1999; Pa¸ sca, 2004; Snow et al., 2006). These works exploitlinguisticpatternstoinferiftwokeywordsarerelatedunderacertainrelationship. For instance, they use “such as” to learn hyponym relations. Cimiano et al. (2005) also applies linguistic patterns to extract item properties and then uses Formal Concept Analysis (FCA) to infer concept hierarchies. In FCA, a given item consists of a set of 139 attributes and some attributes are common to a subsetof items. A concept ‘A’ subsumes concept ‘B’ if all items in ‘B’ (with some common attributes) are also in ‘A’. However, these approaches are not applicable to the annotation on social Web sites such as tags, bundles and photo sets, which are ungrammatical and unstructured. There are several works, which do not utilize linguistic patterns to construct word hierarchies. Instead, these approaches only consider occurrences of words. One of them is probabilistic abstraction hierarchy (Segal et al., 2002), which assumes that each word (datapoint) is generated from a particular class – a probability distribution of words in a corpus. Subsequently, similar classes, which have similar probability distribution, are merged into more abstract classes. Another example is the work by Blei et al. (2003a), which uses a probabilistic generative model to recover a topic hierarchy of words, accord- ing to their usage in documents. Basically, the assumption of the work is that, words in each document are generated from topics along a path from the root to a leaf of the topic hierarchy. Since topics at the top levels are shared among several paths, general words would likely appear in the topics at the top of the hierarchy. Recently, several works proposed different approaches to construct concept hierar- chies from social annotation. Mika (2007) uses a graph based approach to construct a network of related tags, projected from either a user-tag or item-tag association graphs. Although there is no evaluation on inducing broader/narrower relations, the work sug- gests inferring them by using betweenness centrality and set theory. Other works apply agglomerative clusteringtechniques to tags, andusetheirco-occurrence statistics to pro- duce concept hierarchies (Brooks and Montanez, 2006). In a variation of the clustering approach, Heymann and Garcia-Molina (2006) use graph centrality in similarity graph of 140 tags. In particular, the tag with the higher centrality would be more abstract than that with a lower centrality; thus it should be merged to the hierarchy before the latter, to guarantee that more abstract node gets closer to the root node. Schmitz (2006) has ap- plied a statistical subsumption model (Sanderson and Croft, 1999) to induce hierarchical relations of tags. I believe that the previously mentioned works suffer from the “popularity vs general- ity” problem that arises when using tags to induce a hierarchy. Specifically, a certain tag may be used more frequently not only because it is more general, but because it is more popular among users. On Flickr, I found that there are ten times as many photos tagged with“car”thanwith“automobile.” Ifweapplyclusteringapproachesandthelikes, “car” may be found to be more abstract than “automobile” since, the former is likely to have higher centrality than the latter. In addition, if we apply statistical subsumption model, theformerwouldbelikelytosubsumethelattersincethereisahigherchancethatphotos tagged with “car” are also tagged with “automobile”. Of course, I am convinced that tag statistics are a good source of evidence for inducing hierarchies; however, tag statistics alone may not be enough to discover concept hierarchies. There is another line of research that focuses on exploiting partial hierarchies con- tributed byusers. GiveALink project (Markines et al., 2006) collects bookmarksdonated by users. Each bookmark is organized in a tree structure as folders and sub folders by an individual user. Based on tree structures, similarities between URLs are computed and used for URL recommendation and ranking. Although this project does not concen- trate on concept hierarchy construction, it provides a good motivation to exploit explicit partial structures like folder and subfolder relations. My approach, which was previously 141 described in Chapter 4 and Chapter 5, is in the same spirit as GiveALink — I exploit collection and set relations contributed by users on a social Web site to construct con- cept hierarchies. I hypothesize that the generality-popularity problem of keywords in collection-set relation space is less than that in tag space. Although people may use a keyword “car” far more than “automobile” to name their collections and sets, not so many people would put their “automobile” album into “car” super album. Folksonomy learning problem as described in Chapter 3 is similar to ontology align- ment (Euzenat and Shvaiko, 2007; Udrea et al., 2007) in that both identify matches between concepts in pairs of structures. Nevertheless, ontology alignment differs from it in a certain aspect, since in ontology alignment there are typically just a few structures to align, and those structures are deep and semantically rich. Here, folksonomy learning focuses on the much noisier setting, where we have many smaller fragments created by end users with a variety of purposes in mind. Affinity Propagation, on which the work in Chapter 5 is based, has been applied to many clustering problems, e.g., segmentation in computer vision (Lazic et al., 2009), be- causeitprovidesanaturalwaytoincorporateconstraintswhilesimultaneouslyimproving the net similarity of the cluster assignments, which is not trivial to handle in standard clustering techniques. In addition, no strong assumption is required on the threshold, which determines whether clusters should be merged or not. Moreover, the cluster as- signments can be changed during the inference process as suggested by the emergence of exemplars, comparing to many “incremental” clustering (e.g., Bhattacharya and Getoor, 2007), at which the previous clustering decision cannot be changed. Nevertheless, to my 142 knowledge, there is no extension of the AP algorithm to learn tree structures from many sparse and shallow trees as presented in this work. Therearemanyotherstatisticalrelationallearning(SRL)approachesthatareapplica- bletothisclassofproblemsaswell. Forexample,MarkovLogicNetworks(MLN)(Richard- son and Domingos, 2006) and Probabilistic Similarity Logic (PSL) (Broecheler et al., 2010), generic frameworks for solving probabilistic inference problems, may also be ap- plied to folksonomy learning, by translating similarity function as well as constraints into logicalpredicates. Sincemysimilarityfunctioniscontinuous,hybridMLN(HMLN)(Wang and Domingos, 2008) would be required. Nevertheless, AP framework is preferable for the present problem due to its simplicity. For some problems which require to model multiple types of relations and constraints, MLN and PSL may be more suitable. 143 Chapter 7 Conclusions 7.1 Contribution This thesis presents statistical approaches to inferring category knowledge in terms of concepts and conceptual hierarchies from user-generated annotation. For the first part of the thesis (Chapter 2), I developed a probabilistic model that goes beyond standard text modeling approaches by taking into account the users who create the annotations. I hypothesized that user information can help the approach learn more accurate concepts, especially when user variation is high. The experiments on both synthetic and real-world data sets support my claim. In the second part of the thesis, I first identified the folksonomy learning problem (Chapter 3): given a set of small personal hierarchies generated by many different users, learncomplexstructures,whichbestcombinesimilarstructuresandresolvetheirinconsis- tencies. I described a strategy for evaluating the quality of the learned folksonomy along with some novel metrics. I developed two statistical approaches: one that incrementally combines personal hierarchies and removes inconsistencies (Chapter 4); the other that 144 combines the hierarchies in a distributed manner and utilizes structural constraints to avoid inconsistent integration (Chapter 5). Each approach has its own advantage: the incremental approach is efficient, while the distributed inference approach is accurate yet generalized to other structure learning problems. To reiterate, the major contributions of this thesis are: • A probabilistic model for inferring concepts from social annotation; • Two probabilistic approaches that learn complex hierarchies from structured social annotation in incremental and distributed manners. Additionally, the secondary contributions include: • An automatic approach for quantitatively evaluating the quality of learned tax- onomies, by comparing them to a reference taxonomy; • A simple, yet intuitive metric for measuring how detailed a tree’s structure is in terms of depth and bushiness. 7.2 Application Areas The approaches proposed in this thesis are applicable to many application areas. Some examples of them are enumerated in the following sub sections. 7.2.1 Recommendation and Personalization In addition to the resource discovery task, my approach, Interest Topic Model (ITM), to learn concepts from social tagging described in Chapter 2 can directly be applied to 145 tag recommendation. Basically, the approach models social annotation process by taking into account all essential entity types: namely, user, resourceand tag. Given a useru and a resourcer, we can identify the most probable tags that best reflect both user’ interests and resource’s topics, using the learned parameters of the model. Specifically, tag t with high p(t|u,r), straightforwardly computed from the model parameters, can be used for recommendation. My approach has already been extended and applied to this area in some recent work (e.g., Lerman et al., 2007; Harvey et al., 2010). 7.2.2 Community Detection One can apply ITM to detect groups or communities of like-minded users who share the same interests. Basically, the distribution of a user over interests, p(x|u) or ψ, which is one of ITM model parameters, can be used for finding community. In particular, from ψ, we compute the conditional probability, p(u|x) – how likely useru belongs to a group of users who share common taste or interest x. Top users who have high p(u|x) would simply belong to the community x. 7.2.3 Learning Complex Structures from Structured Data I believe that the approach in the second part of the thesis can be applied to many struc- ture learning problems when one needs to learn complex structures from many smaller onesbyintegrating them. For instance, one can applytheapproach to learn a fullseman- tic network by combining many small hypernym relations that are already induced by some existing approaches (e.g., Hearst, 1992; Sanderson and Croft, 1999; Kozareva and Hovy, 2010). Additionally, the approach is potentially applicable to sequence assembly, 146 the process which aligns and merges fragments to reconstruct long genomic sequences in bioinformatics. 1 In such setting, one needs an approach to find a structure that maxi- mizes overlaps (similarity) of constituent structures, as the structure must be in a chain form. This is in a very similar spirit to Relational Affinity Propagation (RAP) described in Chapter 5. Moreover, the approach can potentially be used to combine many small threads of events, at which a temporal constraint must be applied. Thiscan beexpressed as structural constraints in RAP. 7.3 Directions for Future Work 7.3.1 Interest Topic Model One issue that ITM presented in Chapter 2 does not address is tag bias, probably caused by expressiveness of users with high interests in a certain domain. In general, a few users use many more tags than others in annotating resources. This will bias the model toward these users’ annotations, causing the learned topic distributions to deviate from the actual distributions. One possible way to compensate for this is to tie the number of tags toindividualinterests inthemodel. ITMalso doesnotatpresentallow ustoinclude other sources of evidence about documents, e.g., their contents. It would be interesting to extend ITM to include content words as in Zhou et al. (2008), which will make this model more attractive for Information Retrieval tasks. Since ITM is more computationally expensive than other models that ignore user information, e.g., LDA, it is not practical to blindly apply my approach to all data sets. 1 http://en.wikipedia.org/wiki/Sequence assembly 147 Specifically, my model cannot exploit individual variation in the data that has low tag ambiguity and small individual variation, as shown in Section 2.4.1. In this case, the model can only produce small improvement or even similar performance to that of the simpler models. For a practical reason, a heuristic for determining level of tag ambiguity and user variation would be very beneficial in order to determine if the complex model is preferable to the simpler one. Ratios between a number of tags to a number of users or resources may provide some clues. 7.3.2 Folksonomy Evaluation In some cases, using a reference hierarchy as ground truth to evaluate a learned folkson- omy may not be appropriate, if the topics of the folksonomy and the reference are not the same. Moreover, annotation created by users may be colloquial, as some topics are too recent, and yet to be populated into the reference. In such situation, a better way to evaluate such folksonomy is to employ human judgement. Sincemanylearned folksonomies are complex– treesthat arebushyanddeep, it’stoo cumbersome to ask few judges to evaluate the whole folksonomies. Additionally, manual evaluation is subjected to bias if using one or few annotators. Recently, there are many researches (e.g., Dakka and Ipeirotis, 2008; Sorokin and Forsyth, 2008) start using crowd sourcing system – a mass of non expert users – for manual evaluation. A recent work by Snow et al. (2008) also addresses its feasibility and strategies to combine and correct bias on a mass of annotations. From this, I believe that it’s promising to use crowd sourcing technique to help evaluate learned folksonomies. Nevertheless, we will need to 148 consider the following questions: how do we ask each user to annotate a folksonomy: whole, or parts of it? How many users are needed for evaluating a certain folksonomy? 7.3.3 Relational Affinity Propagation Since the similarity function used in RAP is a weighted linear combination between local and structural similarities, a good weight must be specified apriori for learning good folksonomies. Additionally, different weight values can result in different folksonomy quality. The current strategy I use in this thesis is simply to set aside a small portion of the data set and run the approach with different weight values. The one that yields the highest performance will be selected. In many cases, labeled data, i.e., some cluster labels, are available. One can use such labels to automatically learn the similarity weight. One possible strategy is to use a gradient descent like algorithm to estimate it as in Markov logic network weight learning strategy (Lowd and Domingos, 2007). Moreover, affinity propagation, to which RAP extends, provides a natural way to imposebackgroundknowledgeonwhichdatapointshouldbecomeanexemplar,througha selfsimilarity,S ii . Thehigherthevalue,themorelikelythepointwillbecomeanexemplar ofothers. Insomecasesthatweknowsomeusershavebetterexpertisetoannotatecontent than the others, we can increase the self similarity on all data points belonged to them. Such strategy may help RAP converge faster as well as learn a folksonomy with higher quality. 149 7.4 Closing Remark AscontentisunceasinglygeneratedbyusersonthesocialWeb,socialannotationbecomes moreandmoreavailabletobeexploredandexploited. Ihopethatallcontributionsinthis thesis would provide some inspiration as well as establish some foundation forresearchers andpractitioners tobetterexplore andexploit suchcollective knowledge, inordertosoon crack many yet-to-be-solved real-world problems. 150 Bibliography M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, 1964. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993. J. L. Ambite, S. Darbha, A. Goel, C. A. Knoblock, K. Lerman, R. Parundekar, and T. A. Russ. Automatically constructing semantic web services from online sources. In Proceedings of the International Semantic Web Conference, 2009. S.Angeletou, M.Sabou,L.Specia, andE.Motta. Bridging thegap between folksonomies and the semantic web: An experience report. In Proceedings of the ESWC workshop on Bridging the Gep between Semantic Web and Web 2.0, 2007. A. Asuncion, M. Welling, P. Smyth, and Y.-W. Teh. On smoothing and inference for topic models. In In Proceedings of Uncertainty in Artificial Intelligence, 2009. I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data, 1(1):5, 2007. C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006. D. M. Blei, T.L. Griffiths, M. I. Jordan, and J.B. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In Proceedings of the Neural Information Processing Systems, 2003a. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003b. C. Brewster, H. Alani, S.Dasmahapatra, and Y. Wilks. Data driven ontology evaluation. In Proceedings of International Conference on Language Resources and Evaluation, 2004. M. Broecheler, L.Mihalkova, and L.Getoor. Probabilistic similarity logic. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010. C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In Proceedings of the World Wide Web conference, 2006. W. L. Buntine. Operations for learning with graphical models. J. Artif. Intell. Res. (JAIR), 2:159–225, 1994. 151 S.A.Caraballo. Automaticconstructionofahypernym-labelednounhierarchyfromtext. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 1999. X. Chen, G. Anantha, and X. Wang. An effective structure learning method for con- structing gene networks. Bioinformatics, 22:1367–1374, 2006. P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Intell. Res. (JAIR), 24:305–339, 2005. W. Dakka and P. G. Ipeirotis. Automatic extraction of useful facet terms from text documents. In Proceedings of International Conference on Data Engineering, 2008. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Method- ological), 39(1):1–38, 1977. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Pro- ceedings of the international conference on Knowledge Discovery and Data mining, 2003. M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, 1995. J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag, 2007. B. Frey and D. MacKay. A revolution: Belief propagation in graphs with cycles. In Proceedings of Neural Information Processing Systems, 1998. B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 312:972–976, 2007. W. Gilks, S. Richardson, and D. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman & Hall, 1996. I. E. Givoni and B. J. Frey. A binary variable model for affinity propagation. Neural Comput, 21(6):1589–1600, 2009. S. A. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. J. Inf. Sci., 32:198–208, 2006. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci U S A, 101 Suppl 1:5228–5235, 2004. M. Harvey, M. Baillie, I. Ruthven, and M. J. Carman. Tripartite hidden topic models for personalised tag suggestion. In Proceedings of European Conference on IR Research, 2010. M.A.Hearst. Automatic acquisition ofhyponymsfromlargetextcorpora. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 1992. 152 D. Heckerman, D. Geiger, and D. Chickering. Learning bayesian networks: The combi- nation of knowledge & statistical data. Mach. Learn., 20:197–243, 1995. P. Heymann and H. Garcia-Molina. Collaborative creation of communal hierarchical taxonomies in social tagging systems. Technical report, Stanford University, 2006. P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search? In Proceedings of the international conference on Web Search and Web Data mining, 2008. D. Hindle. Noun classification from predicate-argument structures. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1990. T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the conference on Uncertainty in Artificial Intelligence, 1999. T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1-2):177–196, 2001. R. Jin, L. Si, and C. Zhai. A study of mixture models for collaborative filtering. Inf. Retr., 9(3):357–382, 2006. S.Kashoob,J.Caverlee, andY.Ding. Acategorical modelfordiscoveringlatentstructure in social annotations. In Proceedings of the International Conference on Weblog and Social Media, 2009. S. Kok and P. Domingos. Learning the structure of markov logic networks. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, 2005. ISBN 1-59593-180-5. Z. Kozareva and E. Hovy. Learning arguments and supertypes of semantic relations using recursive patterns. In Proceedings of Annual Meeting of the Association for Computational Linguistics, 2010. F. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47:498–519, 2001. N. Lazic, I. Givoni, B. Frey, and P. Aarabi. Floss: Facility location for subspace segmen- tation. In Proceedings of the International Conference on Computer Vision, 2009. K. Lerman, S. Minton, and C. A. Knoblock. Wrapper maintenance: A machine learning approach. J. Artif. Intell. Res. (JAIR), 18:149–181, 2003. K. Lerman, A. Plangprasopchok, and C. Wong. Personalizing image search results on flickr. In Proceedings of AAAI workshop on Intelligent Web Personalization, 2007. J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151, 1991. 153 D. Lowd and P. Domingos. Efficient weight learning for markov logic networks. In European conference on Principles and Practice of Knowledge Discovery in Databases, 2007. A. Maedche and S. Staab. Measuring similarity between ontologies. In Proceedings of the Knowledge Engineering and Knowledge Management, 2002. C. D. Manning and H. Sch¨ utze. Foundations of Statistical Natural Language Processing. MIT Press, 1999. B. Markines, L. Stoilova, and F. Menczer. Bookmark hierarchies and collaborative rec- ommendation. In Proceedings of the Association for the Advancement of Artificial Intelligence, 2006. B. Marlin. Collaborative filtering: A machine learning perspective. Master’s thesis, University of Toronto, 2004. C. Marlow, M. Naaman, D. Boyd, and M. Davis. Ht06, tagging paper, taxonomy, flickr, academicarticle,toread.InProceedings oftheconferenceonHypertextandhypermedia, 2006. A. Mathes. Folksonomies: cooperative classification and communication through shared metadata. 2004. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data setswith application to referencematching. In Proceedings of International Conference on Knowledge Discovery and Data Mining, 2000. A. McCallum, X. Wang, and A. Corrada-Emmanuel. Topic and role discovery in social networks with experiments on enron and academic email. J. Artif. Intell. Res. (JAIR), 30:249–272, 2007. C.MervisandE.Rosch. Categorization ofnaturalobjects. Annual Review of Psychology, 32(1):89–115, 1981. P. Mika. Ontologies are us: A unified model of social networks and semantics. J. Web Sem., 5(1):5–15, 2007. T.P.Minka. Expectation propagation forapproximate bayesian inference. InProceedings of the 17th conference on Uncertainty in Artificial Intelligence, 2001. A. E. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD workshop on Data Mining and Knowledge Discovery, 1997. K. P. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inference: Anempiricalstudy. InProceedings of Conference onUncertainty inArtificial Intelligence, 1999. R.M.Neal. Markovchainsamplingmethodsfordirichletprocessmixturemodels. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000. 154 M. Pa¸ sca. Acquisition of categorized named entities for web search. In Proceedings of the international Conference on Information and Knowledge Management, 2004. P. Pantel and D. Lin. Discovering word senses from text. In Proceedings of the interna- tional conference on Knowledge Discovery and Data mining, 2002. A. Plangprasopchok and K. Lerman. Exploiting social annotation for automatic resource discovery. In Proceedings of the AAAI workshop on Information Integration, 2007. A. Plangprasopchok and K. Lerman. Constructing folksonomies from user-specified rela- tions on flickr. In Proceedings of the World Wide Web conference, 2009. A. Plangprasopchok, K.Lerman, and L.Getoor. Growing a tree in theforest: Construct- ing folksonomies by integrating structured metadata. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2010. A. Popescul, L. Ungar, D. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Pro- ceedings of the conference on Uncertainty in Artificial Intelligence, 2001. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. A. Ranganathan. The dirichlet process mixture (dpm) model, October 2006. URL http://www.ananth.in/docs/dirichlet.pdf. S. Rashmi. A cognitive analysis of tagging, 2005. URL http://rashmisinha.com/2005/09/27/a-cognitive-analysis-of-tagging/. C. E. Rasmussen. The infinite gaussian mixture model. In Proceedings of the Neural Information Processing Systems, pages 554–560. MIT Press, 2000. T. Rattenbury, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from flickr tags. In Proceedings of the conference on Research and development in information retrieval, 2007. M. Richardson and P. Domingos. Markov logic networks. Mach. Learn., 62:107–136, 2006. C. Ritter and M. A. Tanner. Facilitating the gibbs sampler: The gibbs stopper and the griddy-gibbssampler. Journal of the American Statistical Association, 87:861–868, 1992. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of the conference on Uncertainty in artificial intelligence, 2004. S.K.SahuandG.O.Roberts. Onconvergenceoftheemalgorithmandthegibbssampler. Statistics and Computing, 9:9–55, 1998. 155 M. Sanderson and W. B. Croft. Deriving concept hierarchies from text. In Proceedings of the conference on Research and development in information retrieval, pages 206–213, 1999. P. Schmitz. Inducing ontology from flickr tags. In Proceedings of the WWW workshop on Collaborative Web Tagging Workshop, 2006. E.Segal,D.Koller,andD.Ormoneit. Probabilisticabstractionhierarchies. InProceedings of the Neural Information Processing Systems, pages 913–920, Vancouver, Canada, December 2002. R. Snow, D. Jurafsky, and A. Y. Ng. Semantic taxonomy induction from heterogenous evidence. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2006. R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2008. A. Sorokin and D. Forsyth. Utility data annotation with amazon mechanical turk. In Proceedings of IEEE Workshop on Internet Vision at CVPR, 2008. M. Steyvers and T. Griffiths. Probabilistic topic models. In T. Landauer, D. Mcnamar, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. 2006. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101, 2004. O. Udrea, L. Getoor, and R. J. Miller. Leveraging data and structure in ontology inte- gration. In SIGMOD Conference, 2007. J. Wang and P. Domingos. Hybrid markov logic networks. In Proceedings of Association for the Advancement of Artificial Intelligence, 2008. X. Wu, L. Zhang, and Y. Yu. Exploring social annotations for the semantic web. In Proceedings of the World Wide Web conference, 2006. J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. Information Theory, 51:2282 – 2312, 2005. D. Zhou, J. Bian, S. Zheng, H. Zha, and C. L. Giles. Exploring social annotations for information retrieval. In Proceedings of the World Wide Web conference, 2008. M. Zhou, S. Bao, X. Wu, and Y. Yu. An unsupervised model for exploring hierarchi- cal semantics from social annotations. In Proceedings of International Conference on Semantic Web, 2007. 156 Appendix A Derivation of Gibbs Sampling Formula In this Appendix, I provide the derivations of Gibbs sampling equations for ITM in Section 2.3. Specifically, I derive the close form formula for computing p(z i |z −i ,x,t). I then generalize this formula for p(x i |x −i ,z,t). To begin, I define the joint probability of t, x and z of all tuples. Suppose that we currently have n tuples. 157 Their joint probability is defined as follows: p(t i ,x i ,z i ;i =1 :n) = Z p(t i ,x i ,z i |ψ,φ,θ;i =1:n).p(ψ,φ,θ)dhψ,φ,θi =c· Z Y i=1:n (ψ u i ,x i ·φ r i ,z i ·θ t i ,z i ,x i )· Y u,x ψ β/N X −1 u,x · Y r,z φ α/N Z −1 r,z · Y t,x,z θ η/N T −1 t,x,z dhψ,φ,θi =c· Z Y u,x ψ P i δu(x i ,x)+β/N X −1 u,x d(ψ)· Z Y r,z φ P i δr(z i ,z)+α/N Z −1 r,z d(φ) · Z Y t,z,x θ P i δz,x(t i ,t)+η/N T −1 t,z,x d(θ) =c· Y r ( Q z Γ( P i δ r (z i ,z)+α/N Z ) Γ(N r +α) )· Y u ( Q x Γ( P i δ u (x i ,x)+α/N X ) Γ(N u +β) ) · Y z,x ( Q z,x Γ( P i δ z,x (t i ,t)+η/N T ) Γ(N z,x +η) ), (A.1) whereψ,φandθ areasetofdistributionsoverinterestsforallusers; asetofdistributions over topics for all resources; a set of distributions over topics for all interest-topic pairs. p(ψ,φ,θ) are priors for ψ, φ and θ. These priors are the symmetric Dirichlet priors as described in Section 2.3. c = ( Γ(α) Γ(α/N Z ) z ) r ·( Γ(β) Γ(β/N X ) x ) u ·( Γ(η) Γ(η/N T ) t ) (z,x) and δ r (z i ,z) is a function that returns 1 if z i = z and r i = r otherwise 0. Γ is a gamma function from Dirichlet distributions. N r represents a number of all tuples associated with resource r. Similarly, N x,z represents a number of all tuples associated with interest x and topic z. As defined in Section 2.2, N Z , N X and N T are numbers of possible topics, interests and tags respectively. 158 By rearranging Eq. A.1, we obtain p(t i ,x i ,z i ;i =1 :n) = Y r ( Γ(α) Γ(N r +α) )· Y r,z ( Γ( P i δ r (z i ,z)+α/N Z ) Γ(α/N Z ) ) · Y u ( Γ(β) Γ(N u +β) )· Y u,x ( Γ( P i δ u (x i ,x)+β/N X ) Γ(β/N X ) ) · Y x,z ( Γ(η) Γ(N x,z +η) )· Y x,z,t ( Γ( P i δ x,z (t i ,t)+η/N T ) Γ(η/N T ) ) (A.2) Suppose that we have a new tuple and we index this tuple with k (say k =n+1 for convenience). From Eq. A.2, we can derive a joint probability of this new tuplek and all other previous tuples as follows: p(t k ,x k ,z k ,t i ,x i ,z i ;i =1:n) = Γ(α) Γ(N r=r k +α+1) ·( Y r6=r k Γ(α) Γ(N r +α) )· Γ( P i δ r=r k (z i ,z k )+α/N Z +1) Γ(α/N Z ) ·( Y r6=r k ,z6=z k Γ( P i δ r (z i ,z)+α/N Z ) Γ(α/N Z ) )· Γ(β) Γ(N u=u k +β +1) ·( Y u6=u k Γ(β) Γ(N u +β) ) · Γ( P i δ u=u k (x i ,x k )+β/N X +1) Γ(β/N X ) ·( Y u6=u k ,x6=x k Γ( P i δ u (x i ,x)+β/N X ) Γ(β/N X ) ) · Γ(η) Γ(N x=x k ,z=z k +η+1) ·( Y x6=x k ,z6=z k Γ(η) Γ(N x,z +η) ) · Γ( P i δ x=x k ,z=z k (t i ,t k )+η/N T +1) Γ(η/N T ) ·( Y x6=x k ,z6=z k ,t6=t k Γ( P i δ x,z (t i ,t)+η/N T ) Γ(η/N T ) ) (A.3) 159 For the tuple k, suppose that we only know the values of x k and t k while that of z k is unknown. The joint probability of all tuples, excludingz k is as follows: p(t k ,x k ,t i ,x i ,z i ;i =1:n) = Γ(α) Γ(N r=r k +α) ·( Y r6=r k Γ(α) Γ(N r +α) )· Γ( P i δ r=r k (z i ,z)+α/N Z ) Γ(α/N Z ) ·( Y r6=r k ,z6=z k Γ( P i δ r (z i ,z)+α/N Z ) Γ(α/N Z ) )· Γ(β) Γ(N u=u k +β+1) ·( Y u6=u k Γ(β) Γ(N u +β) ) · Γ( P i δ u=u k (x i ,x k )+β/N X +1) Γ(β/N X ) ·( Y u6=u k ,x6=x k Γ( P i δ u (x i ,x)+β/N X ) Γ(β/N X ) ) · Γ(η) Γ(N x=x k ,z=z k +η) ·( Y x6=x k ,z6=z k Γ(η) Γ(N x,z +η) )· Γ( P i δ x=x k ,z=z k (t i ,t k )+η/N T ) Γ(η/N T ) ·( Y x6=x k ,z6=z k ,t6=t k Γ( P i δ x,z (t i ,t)+η/N T ) Γ(η/N T ) ) (A.4) By dividing Eq. A.2 by Eq. A.4, we can obtain the posterior probability of z k given all other variables as follows: p(z k |t k ,x k ,t i ,x i ,z i ;i =1 :n) = Γ(N r=r k +α) Γ(N r=r k +α+1) · Γ( P i δ r=r k (z i ,z)+α/N Z +1) Γ( P i δ r=r k (z i ,z)+α/N Z ) · Γ(N x=x k ,z=z k +η) Γ(N x=x k ,z=z k +η+1) · Γ( P i δ x=x k ,z=z k (t i ,t k )+η/N T +1) Γ( P i δ x=x k ,z=z k (t i ,t k )+η/N T ) = P i δ r=r k (z i ,z k )+α/N Z N r=r k +α . P i δ x=x k ,z=z k (t i ,t k )+η/N T N x=x k ,z=z k +η = N r=r k ,z=z k +α/N Z N r=r k +α · N x=x k ,z=z k ,t=t k +η/N T N x=x k ,z=z k +η (A.5) 160 Intuitively, we can perceive from Eq. A.5 that Nr=r k ,z=z k +α/N Z Nr=r k +α tell us how resource r is likely to be described by the topic z; as the later part, Nx=x k ,z=z k ,t=t k +η/N T Nx=x k ,z=z k +η tell us how tag t is likely to be chosen given interest x and z. Similarly, we can obtain the posterior probability of x k as we did for z k . p(x k |t k ,z k ,t i ,x i ,z i ;i =1:n) = N u=u k ,x=x k +β/N X N u=u k +β · N x=x k ,z=z k ,t=t k +η/N T N x=x k ,z=z k +η (A.6) We can now generalize Eq. A.5 and Eq. A.6 for sampling posterior probabilities of topic z and interest x of a present tuple i given all other tuples. I define N r i ,z −i as the number of all toples having r = r i and z but excluding the present tuple i. Similarly, N z −i ,x i ,t i is a number of all tuples having x =x i , t =t i and z but excluding the present tuplei. As z −i represents all topic assignments except that of the tuplei. Here, Eq. A.5 and Eq. A.6 can be rewritten as follows: p(z i |z −i ,x,t) = N r i ,z −i +α/N Z N r i +α−1 . N z −i ,x i ,t i +η/N T N z −i ,x i +η ; (A.7) p(x i |x −i ,z,t)= N u i ,x −i +β/N X N u i +β−1 · N x −i ,z i ,t i +η/N T N x −i ,z i +η . (A.8) 161 Appendix B SIG: Learning Folksonomies from User-Specified Relations In this Appendix, I describe my earlier work on learning folksonomies from user-specified relations, SIG (Plangprasopchok and Lerman, 2009). This approach was a state-of-the- art approach, which I used as a baseline to measure the performance of the developed approach, SAP described in Chapter 4. Basically, SIG operates by considering each individual hierarchical relation whether it is noisy or not. There are two different proposed strategies: (1) conflict resolution and (2) significance test. Conflict Resolution Framework The first framework makes an assumption that relation conflicts occur because of noise, when a minority of users specify relations opposite to those of the majority. For each relation, we can simply consider how many users agree and disagree on it, i.e., how many users express forward and backward relations for a certain concept pair. Intuitively, concept a subsumes(or is broader than) concept bif the numberof userswhoagree upon a→ b is greater than that of users who agree on b→ a, with some threshold: 162 anim insect bug moth 7 (6) 10 (0) 33 (7) 85 (13) 5 (1) 18 (0) Figure B.1: An illustrative diagram represents relations (arrows) between four concepts (circles): anim, insect, bug, and moth. The numbers represent the number of users who agree (disagree) on a particular relation, e.g., anim→ bug (vs bug→ anim). let d x→y be the number of users who define x→ y and d y→x be the number of users who define y→ x (disagree on x→ y) I define x “subsumes” y over all users if: d x→y >1 and d y→x <d x→y . Where conflicts exist, I use a majority opinion to find and retain meaningful relations, and discard conflicting relations expressed by a minority of users. Additionally, the first condition, d x→y > 1, simply ignore some uncertain relation that has only one user who specifies it. Although conflict resolution helps filtering out “noisy” relations, it does not address the issue of multiple paths from one concept to another. This issue is partly caused by the varying levels of specificity used by different users, and also by users’ categorization variation. As an example, some users define anim→ insect and/or insect→ moth, while others define anim→ moth directly, as shown in Figure B.1. As mentioned earlier, 163 multiple paths may lead to aggregated relations being densely linked, making the learned taxonomies unnecessarily complex and hard to use. We need an approach to determine which paths should be kept and which discarded. Since a path is composed of relations with different weights (numbers of users who express such relations), one way to score this path is to use the minimum weight among these relations. This minimum weight can be cast as a Network Bottleneck in Net- work Optimization problems (Ahuja et al., 1993). Basically, I view each concept as a node, a relation as an edge and a number of users who agree on a certain relation as an information-flow capacity, or the weight, of that edge. For a certain path from one concept to another, I determine the flow bottleneck. The flow bottleneck is a minimum flow capacity among all relations (edges) in the path. This bottleneck score will be used to score the path. Intuitively, it measures the amount of users’ agreement on a path. After scoring all possible paths, the path with the least disagreement will be chosen as the rest will be ignored. This procedure can be formulated as follows. Given source a and sink b concepts, we determine the least disagreement path i ∗ from i ∗ =max i (P i a→b ) =max i (min j {W(e ij )|e ij ∈E(P i a→b )}), where P i a→b is a path i from concept a to b, e ij is relation j of the path i; E(x) is a function which returns all relations in the path x, and W(y) returns the weight of the relation y. Considering the case in Figure B.1, the bottleneck score for anim→ insect →moth is 18 (subtracting a number of conflicting relations); anim→moth is 10; anim→ 164 bug→ moth is 4; anim→ bug→ insect→ moth is 1. Consequently, anim→ insect→ moth is chosen. Significance Test Framework Thisapproach findsmeaningful relations in the data by checking whether they are statis- tically significant. Consider a particular relation from concept a to b. Following Lerman et al. (2003), I use hypothesis testing approach to decide whether a relation a→ b is significant, i.e., highly unlikely to arise purely by chance in a given data set. In this context, the null hypothesis is that, observed relations were generated by chance, via the random, independent generation of the individual concepts. Hypothesis testing decides, at a given confidence level, whether the data supportsrejecting the null hypothesis. Sup- posen instances of a concepta were generated by a random source. The probability that a concept b (which occurs with an overall probability p in the data) was used as a child of a k times has a binomial distribution. I will reject the null hypothesis if k is larger than was expected if relations were generated by chance. In order to determine if k is large enough for rejecting the null hypothesis, I first compute cumulative probability of the binomial distribution, i.e., the probability of ob- serving at least k events. For a large n, the binomial distribution approaches a normal distribution N(x,μ,σ) with μ = np and σ 2 = np(1−p). The cumulative probability in observing at least k events is: p(x≥k) = Z ∞ x=k N(x,μ,σ)dx. (B.1) 165 I approximate the value of the integral in Eq. B.1 using approximation formulas in Abramowitz and Stegun (1964). The significance level of the test, α, is the probability that the null hypothesis is rejected even though it is true, and it is given by the cumulative probability above. Suppose we set α = 0.01. This means that we expect to observe at least k events 1% of the time under the null hypothesis. If the number of users who expressed the relation a→b is greater, I reject the null hypothesis, i.e., decide that the relation is significant. After discarding all uninformative relations using significance testing approach, we still need to select the best path out of several possible ones linking one concept to another. Since all retained relations are judged to be significant, we cannot rank paths using Network Bottleneck metric as in the Conflict Resolution framework. Instead, we cansimplyselectthelongestpath. IntheexampleinFigureB.1,supposethatallrelations are significant. Then, the path anim→bug→insect→moth will be selected. 166
Abstract (if available)
Abstract
Social annotation captures the collective knowledge of thousands of users and can potentially be used to enhance an array of applications including Web search, information personalization and recommendation, and even synthesize categorical knowledge. In order to make best use of social annotation -- annotation generated by many users, we need methods that effectively deal with the challenges of data sparseness and noise, as well as take into account inconsistency in the vocabulary, interests, and the level of expertise among individual users. In this thesis, I study computational approaches to learning and integrating category knowledge in terms of topics, concepts, and hierarchical relations between them from two popular forms of social annotation: tags and personal hierarchies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Understanding diffusion process: inference and theory
PDF
A reference-set approach to information extraction from unstructured, ungrammatical data sources
PDF
Tag based search and recommendation in social media
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
Asset Metadata
Creator
Plangprasopchok, Anon
(author)
Core Title
Statistical approaches for inferring category knowledge from social annotation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/03/2010
Defense Date
08/16/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clustering,data mining,folksonomy,machine learning,OAI-PMH Harvest,social annotation,social information processing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lerman, Kristina (
committee chair
), Arbib, Michael A. (
committee member
), Knoblock, Craig A. (
committee member
), O'Leary, Daniel E. (
committee member
)
Creator Email
anon.plangprasopchok@gmail.com,plangpra@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3419
Unique identifier
UC1149881
Identifier
etd-Plangprasopchok-4066 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-390663 (legacy record id),usctheses-m3419 (legacy record id)
Legacy Identifier
etd-Plangprasopchok-4066.pdf
Dmrecord
390663
Document Type
Dissertation
Rights
Plangprasopchok, Anon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
clustering
data mining
folksonomy
machine learning
social annotation
social information processing