Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced machine learning techniques for video, social and biomedical data analytics
(USC Thesis Other)
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADVANCED MACHINE LEARNING TECHNIQUES FOR VIDEO, SOCIAL AND BIOMEDICAL DATA ANALYTICS by Sanjay Purushotham A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2015 Copyright 2015 Sanjay Purushotham Dedication Dedicated to my parents and teachers Sanjay Purushotham Acknowledgements I would like to first express my heartfelt thanks to my advisor and mentor Prof. C.- C. Jay Kuo for his guidance, encouragement and support during my PhD studies at USC. Prof. Kuo shared his wisdom, pushed me to work hard, and gave me freedom to collaborate with other researchers and explore diverse set of research projects. My sincere thanks to Prof. Qi Tian and Prof. Yan Liu, who have mentored and guided me during different stages of my PhD. Prof. Qi Tian had in-depth research discussions and collaborated with me on social multimedia copy detection and retrieval problems. Prof. Yan Liu’s Big Data Analytics course is one of best courses I have taken at USC. She sparked my interest in machine learning research and she helped me with social recommender systems topic. I am indebted to my PhD committee: Prof. Jerry Mendel, Prof. Antonio Ortega and Prof. Keith Jenkins for their kindness, insights and suggestions with my dissertation. Special thanks to my mentors Dr. Renqiang (Martin) Min and Dr. Hans Peter at NEC Labs of America; and Mr. Junaith and Mrs. Lama Nachman at Intel Labs for providing me valuable summer internship experiences and allowing me to explore topics related to my PhD research. iii I would like to acknowledge my fellow Media Communication Lab members for simulating discussions, fun times and memorable experiences during my grad- uate studies: Hang Yuan, Jiangyang Zhang, Xingze He, Sachin Chachada, Har- shad Kadu, Sudeng Hu, Xiang Fu, Jian Li, Hyunsuk Ko, Shangwen Li, Jewon Kang, Martin Gawecki, Steve Cho, Jing Zhang, Tze-Ping, Dr. Kim, Abhijit Bhat- tacharjee, Ping-Hao Wu, Byung Tae Oh and many others. I would like to thank my buddies Srinivas Yerramalli, Karthikeyan Shanmugam, Manish Jain, Vikram Ramanarayanan, Prasanta Ghosh, Mohammed Reza Rajati for sharing good times during my PhD stay at USC. Finally and most importantly, I would like to express my heartfelt thanks to my parents and family for their constant support and encouragement throughout my graduate studies. iv Contents List of Tables x List of Figures xii Abstract xvi 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Social Multimedia Retrieval and Recommendation . . . . . . 1 1.1.2 Sparse Learning in Biomedical Data Analytics . . . . . . . . 5 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Review of Video Copy Detection, Alignment and Retrieval . 7 1.2.2 Review of Recommender Systems . . . . . . . . . . . . . . . 9 1.2.3 Review of Sparse Learning Models . . . . . . . . . . . . . . 10 1.3 Contributions of this Research . . . . . . . . . . . . . . . . . . . . . 12 1.4 Organization of the Proposal . . . . . . . . . . . . . . . . . . . . . . 15 2 Research Background 16 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Image Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Image/Video Representation . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Bag-of-Visual Words . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Visual Codebooks . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Longest Common Subsequence Problem . . . . . . . . . . . . . . . 25 2.5 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . 29 2.8 Variational Bayesian Methods . . . . . . . . . . . . . . . . . . . . . 31 2.8.1 Variational Bayesian EM . . . . . . . . . . . . . . . . . . . . 31 2.9 Regression and Binary Classification . . . . . . . . . . . . . . . . . 32 v 3 Social Multimedia Copy Detection and Retrieval 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Picture-in-Picture (PiP) Video Copy Detection . . . . . . . . . . . . 39 3.4 Our Video Copy Detection System . . . . . . . . . . . . . . . . . . 39 3.5 Spatial Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.1 Log-Polar Spatial Code Representation . . . . . . . . . . . . 41 3.6 Spatial Verification Approaches . . . . . . . . . . . . . . . . . . . . 44 3.6.1 Inconsistency Sum Method (ISM) . . . . . . . . . . . . . . . 44 3.6.2 Maximum Clique Problem (MCP) . . . . . . . . . . . . . . . 45 3.6.3 Iterated Zero Row Sum Method (IZRS) . . . . . . . . . . . . 47 3.7 Video Indexing and Retrieval . . . . . . . . . . . . . . . . . . . . . 48 3.8 Spatial Verification Experimental Results . . . . . . . . . . . . . . . 49 3.8.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.9 Partial-Near Duplicate Video Alignment . . . . . . . . . . . . . . . 57 3.10 Efficient Subsequence Based Temporal Alignment Algorithm . . . . 58 3.10.1 Review of Sequence alignment techniques . . . . . . . . . . . 58 3.10.2 Video Sequence Alignment . . . . . . . . . . . . . . . . . . . 59 3.10.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 63 3.10.4 Datasets for Partial Near-Duplicate Video Alignment . . . . 65 3.11 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.11.1 Incorporating Social Network information for Video Copy Detection and Alignment . . . . . . . . . . . . . . . . . . . . 68 3.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4 Recommendation Systems for Social Media Users 74 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.1 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.1 Description of Datasets . . . . . . . . . . . . . . . . . . . . . 85 4.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4.3 Experimental settings . . . . . . . . . . . . . . . . . . . . . . 88 4.5 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5.1 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.2 Impact of parameters λ v , λ q . . . . . . . . . . . . . . . . . . 89 4.5.3 Content Representation of items . . . . . . . . . . . . . . . . 93 4.5.4 Examining User Latent Space . . . . . . . . . . . . . . . . . 95 4.5.5 Computational Time Analysis . . . . . . . . . . . . . . . . . 98 vi 4.5.6 Discussion on Social Network Structure . . . . . . . . . . . . 99 4.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5 Personalized Group Recommender Systems 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.2 Groups and Group Dynamics . . . . . . . . . . . . . . . . . 114 5.4 Personalized Collaborative Group Recommender Models . . . . . . 116 5.4.1 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.5.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . 125 5.5.2 Gowalla Dataset Characteristics . . . . . . . . . . . . . . . . 125 5.5.3 Dominant User Characteristics in LBSN . . . . . . . . . . . 129 5.5.4 Meetup Dataset Characteristics . . . . . . . . . . . . . . . . 131 5.5.5 Offline Social Group Event Participation in EBSN . . . . . . 131 5.5.6 Offline-Online Social Interactions in EBSN . . . . . . . . . . 132 5.5.7 Offline Social Group Properties in EBSN . . . . . . . . . . . 133 5.6 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.6.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 135 5.6.2 Evaluated Recommendation Approaches . . . . . . . . . . . 136 5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.7.1 Impact of model parameters . . . . . . . . . . . . . . . . . . 137 5.7.2 Dominant User Influence in LBSN . . . . . . . . . . . . . . . 139 5.7.3 Learned Group Preferences vs. Aggregating User Preferences 141 5.7.4 Examining Latent Spaces of Groups and Events . . . . . . . 141 5.7.5 Computational Time Analysis . . . . . . . . . . . . . . . . . 142 5.7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.8 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . 144 6 FactorizedSparseLearningModelswithInterpretableHighOrder Feature Interactions 145 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.5 Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.5.1 Asymptotic Oracle Properties when n→∞ . . . . . . . . . 153 6.5.2 Asymptotic Oracle Properties when p n →∞ as n→∞ . . . 156 vii 6.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.6.1 Sub-Gradient Methods . . . . . . . . . . . . . . . . . . . . . 159 6.6.2 Soft-thresholding . . . . . . . . . . . . . . . . . . . . . . . . 161 6.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.7.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . 163 6.7.3 Performance on Synthetic dataset . . . . . . . . . . . . . . . 164 6.7.4 Classification Performance on RCC . . . . . . . . . . . . . . 167 6.7.5 Computational Time Analysis . . . . . . . . . . . . . . . . . 169 6.8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 169 6.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.9.1 Proofs for section 6.5 . . . . . . . . . . . . . . . . . . . . . . 171 6.9.2 Computing Adaptive Weights . . . . . . . . . . . . . . . . . 185 7 Knowledge based Factorized High Order Sparse Learning Models187 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.3 Notations and Problem Formulation . . . . . . . . . . . . . . . . . . 190 7.4 Group FHIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.4.1 Overlapping Group FHIM . . . . . . . . . . . . . . . . . . . 195 7.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.5.1 Spectral Projected Gradient . . . . . . . . . . . . . . . . . . 195 7.5.2 Greedy Alternating Optimization . . . . . . . . . . . . . . . 196 7.6 Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . 196 7.6.1 Asymptotic Oracle Properties when n→∞ . . . . . . . . . 198 7.6.2 Properties of Overlapping Group FHIM . . . . . . . . . . . . 200 7.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 7.7.1 Optimization Settings . . . . . . . . . . . . . . . . . . . . . 202 7.7.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 7.7.3 Experimental Design and Evaluation metrics . . . . . . . . . 204 7.7.4 Performance on Synthetic dataset . . . . . . . . . . . . . . . 206 7.7.5 Classification Performance on RCC samples . . . . . . . . . 206 7.7.6 Gene Expression Prediction from ChIP-Seq Signals . . . . . 208 7.7.7 Peptide-MHC I Binding Prediction . . . . . . . . . . . . . . 210 7.7.8 Computational Time Analysis . . . . . . . . . . . . . . . . . 211 7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 7.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 7.9.1 Proofs for section 7.6 . . . . . . . . . . . . . . . . . . . . . . 212 viii 8 Conclusions and Future Work 217 8.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . 217 8.1.1 Social Multimedia Retrieval and Recommendation . . . . . . 217 8.1.2 Sparse Learning Models . . . . . . . . . . . . . . . . . . . . 219 8.2 Future Research Topics . . . . . . . . . . . . . . . . . . . . . . . . . 220 Reference List 222 ix List of Tables 3.1 Position and Orientation of Visual features w.r.t log-polar plot centered at V 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Spatial Map for figure 3.4 . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Social Network Statistics of Last.fm dataset . . . . . . . . . . . . . 86 4.3 User Latent Space Interpretation for Item + User tags. Tags highlighted in red color correspond to music genre/sub-genres, tags highlighted in blue color corresponds to similar artists, albums, labels or band names; while tags highlighted in green correspond to location or other noisy tags. Please note that some tags can highlighted by more than one color since the tags are ambiguous. It is suggested to view the table in color printouts . . . . . . . . . 102 4.4 User Latent Space Interpretation for Item tags. Tags highlighted in red color correspond to music genre/sub-genres, tags highlighted in blue correspond to similar artists, albums, labels or band names; while tags highlighted in green correspond to location or other noisy tags. Please note that some tags can highlighted by more than one color since the tags are ambiguous . . . . . . . . . . . . . 103 4.5 User Latent Space Interpretation for User tags. Tags highlighted in red color correspond to music genre/sub-genres, tags highlighted in blue correspond to similar artists, albums, labels or band names; while tags highlighted in green correspond to location or other noisy tags. Please note that some tags can highlighted by more than one color since the tags are ambiguous . . . . . . . . . . . . . 104 4.6 Time complexity comparison of our model with respect to CTR model (K=200 & convergence rate = 10 −6 ) . . . . . . . . . . . . . . 104 4.7 Time complexity comparison of our model for varying latent space dimensions,(convergence rate = 10 −5 ) . . . . . . . . . . . . . . . . 104 5.1 Gowalla Dataset Description . . . . . . . . . . . . . . . . . . . . . . 125 x 5.2 Meetup Dataset Description . . . . . . . . . . . . . . . . . . . . . . 125 5.3 User 1 and User 2 group location checkins . . . . . . . . . . . . . . . . 130 5.4 Performance Comparison on Meetup Dataset . . . . . . . . . . . . . . 137 5.5 Performance Comparison on Gowalla Dataset . . . . . . . . . . . . . . 138 5.6 Learned Group Preferences vs. Aggregating User Preferences for Meetup dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.7 Latent Topics for an Offline Social Group in EBSN. We list the top 5 events from the test data that was recommended by our model. Last column shows whether the group actually participated in the event and if the group-event distance was < 50 km. . . . . . . . . . . . . . . . . 142 5.8 Time comparison for K=50 . . . . . . . . . . . . . . . . . . . . . . . . 143 5.9 Time comparison of our models by varying K on Gowalla . . . . . . . 143 6.1 Performance comparison for synthetic data on linear regression model with high-order interactions. Prediction Error (MSE) and Std. deviation of MSE (shown inside brackets) on test data is used to measure the model’s performance. For p >= 500, Hierarchical Lasso (HLasso) has heavy computational complexity, hence we don’t show it’s results here. . . . . . . . . . . . . . . . . . . . . . . 164 6.2 Performance comparison for synthetic dataset on logistic regression model with high-order interactions. Misclassification Error on test data is used to measure the model’s performance . . . . . . . . . . 165 6.3 Comparison of optimization methods for our FHIM model based on test data prediction error . . . . . . . . . . . . . . . . . . . . . . 169 6.4 Support recovery ofβ,W . . . . . . . . . . . . . . . . . . . . . . . 169 6.5 Recovering K using greedy strategy . . . . . . . . . . . . . . . . . . 170 7.1 Peptide-MHC I binding datasets . . . . . . . . . . . . . . . . . . . . 205 7.2 ROC scores on synthetic data with non-overlapping groups: case 1) q = 5100, n = 1000,|G| = 25; case 2) p = 250, n = 100,|G| = 25. Note: We were not able to run HLasso R package due to high computational complexity . . . . . . . . . . . . . . . . . . . . . . . 205 7.3 ROC scores on synthetic data with overlapping groups: case 1) q = 5,100, n = 1,000,|G| = 10; case 2) p = 250, n = 100,|G| = 10. . . 205 7.4 Gene Expression Prediction from ChIP-Seq signals . . . . . . . . . 209 7.5 Peptide-MHC I binding prediction AUC scores . . . . . . . . . . . . 210 xi List of Figures 2.1 Example of corner interest keypoints extracted from an image . . . 17 2.2 An illustration of the process of construction of the vocabulary tree (105). The hierarchical quantization is defined at each level by k centers (in this case k = 3) and their Voronoi regions . . . . . . . . 25 2.3 Graphical model (Plate notation) representation of LDA. The boxes are âĂIJplatesâĂİ representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document . . . . . . . . . . . . 30 3.1 Example of Picture-in-Picture video keyframe . . . . . . . . . . . . 36 3.2 Block diagram of our video copy detection system . . . . . . . . . . 40 3.3 Block diagram of retrieval system . . . . . . . . . . . . . . . . . . . 40 3.4 Example for log-polar spatial codes . . . . . . . . . . . . . . . . . . 43 3.5 Visualization of S diff as an undirected graph . . . . . . . . . . . . . 47 3.6 PiP Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7 Average Retrieval accuracy for T2 query when small dataset is indexed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.8 Average Retrieval accuracy for T9 query when small dataset is indexed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.9 Average Retrieval accuracy for T10 query when small dataset is indexed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.10 Comparison of various geometric verification approaches. T10 query is used and large dataset is indexed . . . . . . . . . . . . . . 54 3.11 Complexity curves for LP-MCP (Robson), LP-MCP (Bron- Kerbosch) and LP-ISM (Note: Certain constants are ignored for plotting these complexity curves) . . . . . . . . . . . . . . . . . 56 3.12 Video alignment configurations . . . . . . . . . . . . . . . . . . . . 66 3.13 Example for Partial-near duplicate video alignment k-crossover configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.14 Partial Near-Duplicate Video Alignment Results on our query dataset 68 3.15 Parallelization framework for LCS-based algorithms . . . . . . . . . 68 xii 4.1 Collaborative Topic Regression Model (135). Here, θ is the topic proportions of LDA model, U is user random variable, V is item random variable, W is the observed words of items, and r is the observed user ratings, K is no. of topics, α is Dirichlet prior, Z is latent variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Proposed Model - CTR with SMF, CTR part shown in red color, (SMF) Social Matrix Factorization shown in blue color . . . . . . . 80 4.3 Recall of in-matrix prediction task for CTR model by varying content parameter λ v and fixing number of recommended items i.e M = 250. Dataset used is hetrec2011-lastfm-2k . . . . . . . . . . . 90 4.4 Comparison of Recall for CTR and our proposed model by varying λ v and fixing M = 250, our model indicated by CTR with SMF. Left plot: Dataset used is hetrec2011-lastfm-2k, Right plot: Dataset used is hetrec2011-delicious-2k . . . . . . . . . . . . . . . . . . . . 91 4.5 Recall comparison of various models for in-matrix prediction task by varying number of recommended items M and fixing λ v = 100. Our model indicated by CTR with SMF. PMF indicates matrix factorization (CF). Dataset used is hetrec2011-lastfm-2k . . . . . . . 92 4.6 Plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q by fixing M = 250. Dataset used: hetrec2011-delicious-2k . . . . . . . . . . . 94 4.7 Recall of in-matrix prediction task for our proposed model by varying content parameter λ v and social network parameter λ q at M = 250 and K = 200. Dataset used is hetrec2011-lastfm-2k . . . . 95 4.8 Plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q at M = 250 and K = 200. Dataset used is hetrec2011-lastfm-2k . . . . . . . . . 96 4.9 Zoom-in plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q at M = 250 and K = 200. Dataset used is hetrec2011-lastfm-2k . . . . 97 4.10 Plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q by fixing M = 250 and K = 50. Only ’user tags’ are used for modeling the content information . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.11 Plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q by fixing M = 250 and K = 50. Only ’item tags’ are used for modeling the content information . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.12 Plots of in-matrix prediction recall for our proposed model by varying content parameter λ v and social network parameter λ q by fixing M = 250 and K = 50. Both the ’user and item tags’ are used for modeling the content information . . . . . . . . . . . . . . 100 xiii 4.13 Recall of in-matrix prediction task for our proposed model by varying content parameter λ v at M = 250 and K = 50 . . . . . . . 101 4.14 Recall of in-matrix prediction task for our proposed model by varying social network parameter λ q at M = 250 and K = 50 . . . . 101 4.15 Recall of proposed model by varying social network structure. Dataset used: hetrec2011-delicious-2k . . . . . . . . . . . . . . . . . 103 4.16 Incorporating Limited Attention to our CTR with SMF model . . . 105 5.1 Groups in LBSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2 Personalized Collaborative Group Recommendation Systems (PCGR and PCGR-D) . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3 Gowalla Group Location Characteristics . . . . . . . . . . . . . . . . . 127 5.4 Gowalla Group Location Characteristics (Please view this Figure in Color printout) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.5 User Group Memberships . . . . . . . . . . . . . . . . . . . . . . . . 129 5.6 User’s behavior in different groups by analyzing group location checkins. These example plots are for 2 different users (User 1, User 2) who are in 7 different groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.7 Meetup Offline Groups and Event Localities . . . . . . . . . . . . . . . 132 5.8 Meetup Offline Groups and Event Localities . . . . . . . . . . . . . . . 134 5.9 Impact of λ G and λ A on prediction accuracy for PCGR-D model using Gowalla dataset . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.10 Dominant User Influence in Gowalla . . . . . . . . . . . . . . . . . 140 6.1 Support Recovery of β (90 % sparse) and W (99 % sparse) for synthetic data Case 1: n > p and q > n where n = 1000,p = 50,q = 1275. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.2 Support Recovery of W (99.5 % sparse) for synthetic data Case 2: p>n where p = 500, n = 100. . . . . . . . . . . . . . . . . . . . . 167 6.3 Comparison of the classification performances of different feature selection approaches with our model in identifying the different stages of RCC. We perform five fold cross validation five times and average AUC score is reported. For updated results please refer chapter 7 (figure 7.2). . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.1 Support Recovery of W OD (95 % sparse) for synthetic data q = 5100,n = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 7.2 Comparison of the classification performance of different feature selection approaches with our model (OvGroup FHIM) in identify- ing the different stages of RCC. . . . . . . . . . . . . . . . . . . . . 208 7.3 Interpretable interactions identified by OvGroup FHIM for predict- ing gene expression from ChIP-Seq signals. . . . . . . . . . . . . . . 209 xiv 7.4 Interaction feature factor coefficients for A2402 . . . . . . . . . . . 211 7.5 ROC curves for A0206 . . . . . . . . . . . . . . . . . . . . . . . . . 211 xv Abstract Advanced machine learning techniques are developed to tackle challenging prob- lems in three Big Data application domains in this thesis. They are: 1) partial near-duplicate video copy detection and alignment for the video application, 2) personalized single user and group recommender systems for the social media data application, and 3) identification of discriminative feature interactions for gene expression prediction and cancer stage prediction for the biomedical data applica- tion. Novel and suitable machine learning tools are designed to meet the nature of the data in each specific application domain. For the video data application, we propose a novel spatio-temporal verification and alignment algorithm to accurately detect and retrieve partially near-duplicate videosfromonlinevideo-sharingnetworks. Weproposeageneralizedspatialcoding and spatial verification scheme for video key-frame representation and matching respectively. We introduce efficient subsequence matching-based algorithms for temporal alignment which can be made scalable using parallelization framework. We also propose a novel way to use the social network information to reduce the computation cost of social media alignment and retrieval. For the social media data application, we investigate the effectiveness of the social network information for content recommendation to users. In particular, we propose a novel hierarchical Bayesian modeling framework that incorporates both xvi topic modeling of contents and matrix factorization of social networks to automat- ically infer topics in the latent space and provide interpretable recommendations. Our models reveal interesting insights by showing that social circles can have more influence on people’s decisions about the usefulness of content than personal taste. Empirical experiments on large-scale datasets show that our proposed techniques for retrieval and recommendation outperform existing state-of-the art approaches. Furthermore, we develop a new group recommender systems framework to model the group dynamics and to personalize location/event recommendations to group of users. For the biomedical data application, we propose a novel scalable knowledge- based high-order sparse learning framework, called the Group Factorized High order Interactions Model (Group FHIM), to identify discriminative feature groups and high-order feature group interactions. This study allows us to understand disrupted gene interactions, which are causes of some diseases. Unlike previous sparse learning approaches, the proposed model can recover both the discrimina- tive feature groups and the pairwise feature group interactions accurately without enforcing any hierarchical feature constraints. Experiments on synthetic and real datasets show that our model outperforms other state-of-the-art sparse learning techniques, and it provides interpretable high-order feature group interactions for bio-marker discovery and gene expression prediction. xvii Chapter 1 Introduction We are living in the ‘Big Data’ era where tremendous amount of data is generated everyday in various diverse fields such as astrophysics, bioinformatics, computer vision, social networks, etc. ‘Big Data’ brings many challenges in storage, analysis, visualization and privacy. To address the challenges of ‘Big Data’ era new theory and scalable algorithms, architectures and systems needs be to developed. In this thesis, we will propose advanced machine learning techniques to address ‘Big Data’ challenges arising in the video (multimedia), social (social media) and biomedical data analytics areas. In particular, we will develop new and efficient systems for retrieval and recommendation of multimedia content to social network users; and we propose new feature selection framework for gene expression prediction and disease diagnosis. 1.1 Significance of the Research 1.1.1 Social Multimedia Retrieval and Recommendation Ever-growing multimedia content on the Internet is revolutionizing the way of con- tent distribution and social interaction. Often, multimedia content such as photos and videos pervade social networking sites like Facebook, Twitter; and content- sharing sites such as Flickr, YouTube also support social networking. This hybrid of multimedia and social media presents a new computing paradigm and is influ- encing the manner in which we communicate and collaborate with each other. 1 The hybrid of multimedia technology and social media is called social multimedia, and its computing paradigm is called social multimedia computing (128). Social multimedia supports new types of user interactions. For example, YouTube users can provide video responses to other users’ video, which creates asynchronous multimedia conversations. Massive Open Online Courses (MOOCs) provide free online training videos to students and use Google+ hangout to enable synchronous interactions between instructor and students. Social multimedia can also provide additional context for understanding multimedia content. For example, aggregrat- ing all user’s clicks/likes about a particular video might help in understanding what segments/objects of the video are interesting to these users. Clearly, social multimedia has great potential to change how we communicate and collaborate. Social multimedia poses new challenges for both social computing and multi- media. Social computing raises new problems such as modeling the interactions between multimedia and social activities around it. There are two views on social multimedia computing. The first one is social computing over multimedia, which is motivated from social sciences. It is concerned with understanding multimedia technology’s role in social interactions and its impact on user behavior. The second view is social-network influenced multimedia computing. This approach empha- sizes applying knowledge from social sciences to the design of improved multimedia applications,includingharvestingmoreaccuratelylabelleddataandderivingmeta- data from social activities and resources, using social network analysis and socially collected data for content understanding, exploiting social dynamics to improve multimedia communication and content protection, and employing user behavior analysis to recommend multimedia resources to users. In our research, we take this second view and revisit the challenging classical multimedia problems in the context of social networks. 2 We are particularly interested in addressing the following two challenging mul- timedia problems: 1. partial near-duplicate video copy detection This problem is common in content-sharing websites like YouTube, YouKu, etc., and content streaming websites like Justintv, Ustream, etc. 2. multimedia content recommendation This technique attempts to make recommendations to social network users by incorporating their social network information. Both of them have significant impacts on academic and industrial communities and importantreal-worldsocietalapplications. Forexample,discoveringandextracting user-communityrelationshipsbasedonthesharedcontent(multimedia)fromsocial media networks can help in the design of better advertisement strategies for media companies. Understanding tags related to shared multimedia content on social networks might help reduce semantic gap (gap between low-level features and high level concepts) - one of the fundamental academic challenges of multimedia content analysis. A significant amount of research has been done in social network analysis, multimedia content analysis and other related fields. However, the result cannot be directly applied to social multimedia since many problems such as representing, modeling and analyzing multimedia in social networks is still open. Some of key scientific challenges of social multimedia are: • Multimedia social dynamics Social networks evolve over time, and user interactions associated with mul- timedia also evolve over time. This dynamic nature of user interactions with multimedia will influence their behaviors and preferences. While using social 3 multimedia, special care should be taken to model/analyze social dynamics, and multimedia social dynamics is a key factor. • Fusion of multimedia content, social network and context Analysis of multimedia social networks must be closely integrated’ with mul- timedia content analysis. It should also incorporate social contextual infor- mation. These are especially important for interaction-driven social net- works and social data networks (e.g., Flickr, YouTube). User preferences, social relationships, multimedia tags are social data that are generally used to address social multimedia computing problems. However, when using (sampling) social data, we should be careful as to not make independent datainstanceassumptionsincesocial data isinherentlyrelateddatainstance. Inference should be made jointly, which makes the fusion of multimedia con- tent and social network harder. Moreover, social data is generally noiser than multimedia data and, hence, care should be taken when handling the social data as it may lead to undesirable results. Fusion of multimedia content, social network and context is a key challenge. • Relationship discovery and prediction Relationship discovery and prediction are basic computational problems in social networks. They are challenging in multimedia social networks since they must include social interactions around and captured in multimedia. The relationship discovery and prediction problem seeks to determine the extent to which the relationship and its evolution can be modeled via fea- tures. This problem becomes very challenging especially if the multimedia is complicated (noisy or complex background). 4 The challenging multimedia problems such as the partial near-duplicate video copy detection problem and the multimedia content recommendation to be addressed in this research have some of the key challenging traits of social mul- timedia. In this thesis research, we present our novel approach to multimedia data modeling in social settings and analyze the proposed solutions. Our solu- tion to finding partial near-duplicates in social networks can help reduce copyright infringement on content-streaming websites and help reduce illegal content uploads on YouTube. Our approach for the multimedia content recommendation system can help users discover interesting items through their social networks. Moreover, it can help them find new friends who share similar interests. 1.1.2 Sparse Learning in Biomedical Data Analytics Advances in ‘omics’ (genomics, proteomics, etc.), devices, imaging and other biomedical technologies has ushered the ‘Big Data’ era in healthcare and biomedicine areas. Big data in biomedicine is coming due to the data deluge from two sources namely, 1) genomics-driven sources (genotyping, gene expres- sion, next-generation sequencing data, etc.), 2) health-driven sources (electronic medical records, pharmacy prescription and insurance records, etc.). These data sources usually have tens of millions of very high-dimensional data points (exam- ple: gene features and patient temporal data), and thus, the standard statistics and machine learning techniques are not sufficient to handle its scale. Moreover, these data sources can be noisy or missing. Inspite of these challenges, effective use of Biomedical Big Data analytics holds great promise in risk assessment and development of personalized medical treatments. Thus, there is a need to develop 5 new machine learning methods to handle high-dimensional data which can be use- ful in many biomedical applications such as biomarker discovery, disease status prediction, etc. Identifying interpretable high-order feature interactions is an important prob- lem in machine learning, data mining, and biomedical informatics, because feature interactions often help reveal some hidden domain knowledge and the structures of problems under consideration. For example, genes and proteins seldom perform their functions independently, so many human diseases are often manifested as the dysfunction of some pathways or functional gene modules, and the disrupted patternsduetodiseasesareoftenmoreobviousatapathwayormodulelevel. Iden- tifying these disrupted gene interactions for different diseases such as cancer will help us understand the underlying mechanisms of the diseases and develop effec- tive drugs to cure them. However, identifying reliable discriminative high-order gene/protein or SNP interactions for accurate disease diagnosis such as early can- cer diagnosis directly based on patient blood samples is still a challenging problem, because we often have very limited patient samples but a huge number of complex feature interactions to consider. In this thesis research, we propose a novel sparse learning framework based on weight matrix factorizations and ` 1 regularizations for identifying discrimina- tive high-order feature interactions in linear and logistic regression models, and we study several optimization methods for solving them. Our sparse learning frame- work can incorporate domain knowledge into the factorization to make it scalable and more robust. Experimental results on synthetic and real-world datasets show that our method outperforms the state-of-the-art sparse learning techniques, and it provides ‘interpretable’ blockwise high-order interactions for disease status pre- diction. Our proposed sparse learning framework is general, and can be used to 6 identify any discriminative complex system input interactions that are predictive of system outputs given limited high-dimensional training data. 1.2 Review of Previous Work There are many multimedia (video) copy detection and alignment algorithms, rec- ommender system algorithms and sparse learning models proposed in the litera- ture, which will be reviewed in detail in later chapters. In this section, we provide a brief overview of related work on these research fields. 1.2.1 Review of Video Copy Detection, Alignment and Retrieval Near-duplicate video detection and retrieval has been intensively studied recently because of the growing amount of multimedia shared on content-sharing sites and social networks. Most of the previous research has been geared towards solving the fully-duplicate or the near-duplicate video copy detection problem (59), (86). A few recent works such as (124) and (125) have focused on addressing the partial- near duplicate video copy detection problem. We can categorize existing work into three main types: signature-based, keyframe-based and trajectory-based approaches. • The signature-based approach summarizes video content with global descrip- tors for fast retrieval. Researchers have proposed different ways to describe the signature. Mostly, signatures are computed by "averaging" the global color histogram (144), ordinal (59) or temporal-ordinal (35) features of video frames. In almost all work, the temporal information is not used in the fingerprint generation and, thus, retrieval of partial near-duplicates is not 7 supported. The review conducted in (87) shows that the signature-based approach is effective only for copies with small transformations. • The keyframe-based approach samples keyframes or representative frames from video and use them for video copy detection. The used matching strategies include dynamic programming (38), (68), graph-based matching (37), windowing (144) and voting-based methods (49). The window-based method (sliding window) is sensitive to the temporal resolution and not suit- able in retrieving near-duplicates with changes in the frame rate. Dynamic programming finds the longest common subsequence and computes the edit distance to determine the near-duplicate identity. Dynamic programming is useful to extract temporal entities on the correspondence set generated from visual content matching. The heuristic voting scheme proposed in (49), (42) uses the aggregation of voting from near-duplicate frames to make the near-duplicate shot detection robust to individual near-duplicate false posi- tive frames. However, voting can only generates the coarse alignment result, which is sub-optimal and not useful for precise localization of partial near- duplicates. • The trajectory-based approach tracks temporally coherent interest points in video to enrich keypoint features with spatiotemporal information. Tra- ditionally, trajectories have been utilized to highlight and label different motion behaviors (86). In (145), the whole shot was represented by a bag-of- trajectories where each trajectory in turn is described as temporal patterns of discontinuities. Although the trajectory-based approach uses temporal 8 coherency, extractionoftrajectoriesisanexpensiveoperation. Moreover, tra- jectory features are sensitive to camera motion and, therefore, their robust- ness is limited to exact copies. They are not robust for near-duplicates or partial near-duplicates, especially when viewpoint changes are involved. 1.2.2 Review of Recommender Systems Recommendation Systems predict user’s preferences or ratings on particular items. Collaborative Filtering (CF) is one of the most popular recommender systems and has been heavily used in the industry. It automatically predicts the interests of a particular user based on the collective rating records of similar users or items. The technique has been extensively studied in literature, e.g., (122), (118), (67). The underlying assumption in the traditional CF model is that similar users would prefer similar items. In the following, we review a few state-of-the art techniques proposed for the CF-based recommendation system. There are mainly two CF-based approaches. • The memory-based approach The memory-based approach uses either user-based (61) or item-based tech- niques (77) for prediction (recommendation) of ratings for items. Although the approach is easy to implement and popular, it does not guarantee good prediction results. • The model-based approach The model-based approach includes several model-based learning methods suchasclusteringmodels,aspectmodels(65)andlatentfactormodels. These model-based methods, especially the latent-factor models based on matrix 9 factorization (81), (118), have shown promise in better rating prediction since they incorporate user interests in the model effectively. All CF-based approaches assume that users are independent and identically distributed and ignore additional information such as the content description of items and social connections of users while performing the recommendation task. Moreover, CF-based models suffer from the sparsity problem and imbalance of rating data, especially for new and infrequent users. Thus, the predicted ratings from CF-models can be unreliable. To overcome the weaknesses of the CF-based recommendation method, many models have been proposed to explore additional information such as item’s con- tent information (14), (135) and user’s social network information (97), (71). The Collaborative Topic Regression (CTR) offers a state-of-the-art model that incorpo- rates the content information via latent dirichlet allocation (24) in the CF frame- work. Social-network-based CF models (97) were recently proposed to find user’s like-minded friends to address the rating sparsity limitation. Most existing work has focused on utilizing either the content or the social network information, but few has considered them jointly. In this proposal, we propose a novel way to incor- porate both the social network information and the content information to achieve a better recommender system. 1.2.3 Review of Sparse Learning Models Feature selection (also known as variable section) is the process of selecting rele- vant features (or variables) for model construction and prediction. Feature selec- tion has been a well studied topic in statistics, machine learning, and data mining literature. Regularization based techniques are the most popular feature selection approaches to identify discriminative features especially for high-dimensional data. 10 Most recent methods focus on identifying discriminative features or groups of dis- criminative features based on Lasso penalty (129), Group Lasso (150), Trace-norm (56), Dirty model (70) and Support Vector Machines (SVMs) (123). However, reliably identifying interpretable discriminative feature interactions among high- dimensionalinputfeatureswithlimitedtrainingdataremainsanunsolvedproblem. Identifying interpretable high-order feature interactions is an important problem in machine learning and biomedical informatics, because feature interactions often help reveal some hidden domain knowledge and the structures of problems under consideration. In the following, we review a few state-of-the-art sparse learning models which try to recover high order feature interactions. A recent approach (143) heuristically adds some possible high-order interac- tions into the input feature set in a greedy way based on lasso penalized logistic regression. Some recent approaches (23),(40) enforce strong and/or weak hered- ity constraints to recover the pairwise interactions in linear regression models. In strong heredity, an interaction term can be included in the model only if the cor- responding main terms are also included in the model, while in weak heredity, an interaction term is included when either of the main terms are included in the model. However, recent progress in bioinformatics has shown that feature inter- actions need not follow heredity constraints for manifestation of the diseases, and thus the above approaches (23),(40) have limited chance of recovering relevant interactions. Kernel methods such as Gaussian Process (51) and Multiple Kernel Learning (84) can be used to model high-order feature interactions, but they can only tell which orders are important and cannot recover the high-order feature interactions. Thus, all these previous approaches either failed to identify specific high-order interactions for prediction or identified sporadic pairwise interactions in a greedy way, which is very unlikely to recover the ‘interpretable’ blockwise 11 high-order interactions among features in different sub-components (for example: pathways or gene functional modules) of systems. Recently, (102) proposed an effi- cientwaytoidentifycombinatorialinteractionsamonginteractivegenesincomplex diseases by using overlapping group lasso and screening. However, they use prior information such as gene ontology in their approach, which is generally not avail- able or difficult to collect for some machine learning problems. Thus, there is a need to develop new efficient techniques to automatically capture the important ‘blockwise’ high-order feature interactions in regression models, which is one of the focus of our research. In this thesis, we propose a scalable knowledge based high order sparse learning framework for identifying discriminative feature groups and high-order feature group interactions in regression and classification problems 1.3 Contributions of this Research There are several major contributions of this research. They are detailed below. • We develop a novel spatio-coding scheme to encode the spatial distribution of visual words in a keyframe. As compared with state-of-the-art methods, the spatial-coding scheme is more efficient in terms of the computational and the storage costs. Furthermore, we propose both exact and approximate spatial verification algorithms to accurately detect true matches and remove false matches from matched visual features between two keyframes. Based on the proposed spatial-coding and verification technique, we can detect partial near-duplicates present in a large video database with high accuracy. • Wepresentanefficienttemporalverificationalgorithmtoaligncommonvideo segments between two video sequences accurately. Our approach is based on the longest common subsequence (LCS) technique and it is robust to video 12 transforms and temporal attacks. The algorithm can be easily parallelized so that it is scalable to different video length. • We develop a novel probabilistic Bayesian model that exploits that user’s social network information and item’s content information to recommend itemstousers. Ourmodelseamlesslyincorporatestopicmodelingandmatrix factorization in the CF framework for accurate recommendation. Experi- ments on real world datasets show that the proposed model outperforms state-of-the-art algorithms such as CTR and matrix factorization in predic- tion accuracy. • The social recommendation model provides interpretable recommendations and user profiles. In other words, the model can explain the user latent space using topics learned from the data, which helps explain why the model recommends certain items to a particular user. • We propose a class of Collaborative filtering based Bayesian models that can personalize recommendations to a group of users. Our novel framework models the group dynamics such as user-user interactions, user-group mem- bership, user influence and infers the group preferences. Experiments on Location-Based Social Networks (LBSN) and Event-based Social Networks (EBSN) show that our modeling framework impressively outperforms state- of-the-art group recommenders and it provides interpretable recommenda- tions. 13 • We propose a novel knowledge-based sparse learning framework based on weight matrix factorizations and` 1 /` 2 regularization for identifying discrim- inative high-order feature group interactions in logistic regression and large- margin models, and we study theoretical properties for the same. Experi- mental results on synthetic and real-world datasets show that our method outperforms the state-of-the-art sparse learning techniques, and it provides ‘interpretable’ blockwise high-order feature interactions for gene expression prediction and peptide-MHC I protein binding prediction. Our proposed sparse learning framework is quite general, and can be used to identify any discriminative complex system input interactions that are predictive of sys- tem outputs given limited high-dimensional training data. • Our contributions for factorized sparse learning models are as follows: (1) We propose a method capable of simultaneously identifying both informative single discriminative features and discriminative block-wise high-order inter- actions, and which is extended to incorporate domain knowledge, moreover it can be easily extended to handle arbitrarily high-order feature interac- tions; (2) Our method works on high-dimensional input feature spaces and ill-posedproblemswithmuchmorefeaturesthandatapoints, whichistypical for biomedical applications such as biomarker discovery and cancer diagno- sis; (3) Our method has interesting theoretical properties for generalized linear regression models; (4) The interactions identified by our method lead to biomedical insight into understanding blood-based cancer diagnosis and gene expression prediction 14 1.4 Organization of the Proposal The rest of this proposal is organized as follows. In Chapter 2, we provide a techni- caloverviewofseveralbasicconceptstobediscussedinthisproposal. InChapter3, we first propose a generalized spatial coding scheme and a novel spatial verification algorithm for partial near-duplicate video copy detection. Then, we present exper- imental results to show that our spatial verification techniques, are quite robust to challenging video transformations such as PiP and, provide better storage and computational cost when compared to current state-of-art-techniques. In Chapter 4, we present a probabilistic graphical modeling framework which exploits the user social network information and item content information to recommend items to users. We present experimental results on music and bookmark datasets which reveal interesting insights that the social circles have more influence on people’s decisions about the usefulness of information than personal taste. In chapter 5, we present a class of Bayesian models to personalize recommendations to group of users. We show state-of-the-art performance of our group recommender systems on real-world LBSN and EBSN datasets and discuss how modeling group dynamics is better than aggregating user preferences. In chapter 6, we propose a Factor- ization based sparse learning technique called FHIM to learn interpretable high order feature interactions in regression and classification problems. In chapter 7, we propose Group FHIM to make FHIM scalable by incorporating domain knowl- edge. We conduct geneexpressionprediction and diseaseclassificationexperiments to show the performance of our Factorized Sparse Models compared to the cur- rent state of the art approaches. Finally, concluding remarks and future research extensions are pointed out in Chapter 8. 15 Chapter 2 Research Background 2.1 Introduction Some background knowledge for the current research is given in this chapter. 2.2 Image Local Features Image local feature is a pattern in the image that differs from its immediate neigh- bors. It is related to the change in some of the image properties such as color, texture, intensity, etc. andisgenerallylocalizedaroundcertainregionintheimage. Image local features can be points (keypoints), edges, image patches or histograms. Image keypoint features (especially invariant features) have been widely researched in the recent decades due to their several advantages over image global features. They have been successfully applied to a wide variety of applications such as object retrieval and recognition, near duplicate image/video detection, etc. Compared to the global features, image local features (keypoints) characterize image at a fine- grained level. Hessian (18), Harris (100) and DoG (96) are some of the most popular keypoint detectors and SIFT (96) and SURF (15) are the most popular descriptors used to represent the image’s local or low-level features. 16 2.2.1 Detectors Detectors directly extract local features from the image based on the underlying intensity patterns. The common detectors are corner, blob and edge detector. A first category of interest point detectors are the contour curvature based methods. Originally, these were mainly applied to line drawings, piecewise constant regions, and cad-cam images rather than natural scenes. The focus was especially on the accuracy of point localization. Recently, detectors based on intensity methods have become popular since these methods make weak assumptions and are thus applicable to a wide range of images. Many of the intensity based detectors are based on first and second order gray-value derivatives and some use heuristics to find regions of high variance. In this section, we briefly review the intensity differential-based detectors. Other types of detectors can be found in the review (131). Figure 2.1: Example of corner interest keypoints extracted from an image 2.2.1.1 Hessian-based detector Hessian-based detector was proposed by Beaudet(18) in the late 1970s. It was based on the Hessian matrix - the second order derivatives and was obtained using 17 the Taylor series expansion of the intensity surface. The determinant of this Hes- sian matrix reaches a maximum for blob-like structures in the image. Hesssian matrix for a point X is given as: H(X,σ) = L xx (X,σ) L xy (X,σ) L yx (X,σ) L yy (X,σ) (2.1) where σ is a Gaussian smoothing parameter. Saliency of the point X is based on the determinant of the Hessian matrix shown below: Hessian(X,σ) =det(H(X,σ))×σ 4 (2.2) Construction of a scale space can be done using a proper σ D . H(X,σ D ) = L xx (X,σ D ) L xy (X,σ D ) L yx (X,σ D ) L yy (X,σ D ) (2.3) Hessian points defined at equation 2.3 attains local extrema in spatial and scale space. If we use Laplacian-of-Gaussian function to select the proper σ D then it results in Hessian-Laplacian detectors. 2.2.1.2 Harris Detector Harris detector was orginially proposed by Harris and Stephens in the 1980s (60) and is based on the second moment matrix, also called the auto-correlation matrix. This matrix is often used for describing local image structures. This matrix describes the gradient distribution in a local neighborhood of a point as follows: M =σ 2 D g(σ I )× L 2 x (x,σ D ) L x (x,σ D )L y (x,σ D ) L x (x,σ D )L y (x,σ D ) L 2 y (x,σ D ) (2.4) 18 with L x (x,σ D ) = ∂ ∂x g(σ D )∗L(x) (2.5) g(σ D ) = 1 2πσ 2 exp − x 2 +y 2 2σ 2 (2.6) where σ I ,σ D are the integration and differential scales respectively, and L x ,L y are the derivatives computed in the x and y directions respectively. The local derivatives computed with Gaussian kernels of the size are determined by the local scaleσ D (dfferential scale). The derivatives are then averaged in the neighborhood of the points by Gaussian window smoothing of size σ I (integration scale). The eigenvalues of this matrix 2.4 represents the two principal signal changes in the orthogonal directions in the neighborhood of a point defined by σ I . Based on this property, corners can be found as locations in the image where image signal varies significantly in both directions, i.e. both the eigen values are large. Harris proposed to measure cornerness which combines the two eignevalues in a single measure (and is computationally less expensive): cornerness =det(M)−λtrace(M) (2.7) with det(M),trace(M) being the determinant and trace of the matrix M. High values of cornerness measure corresponds to both eigenvalues being large. Conse- quently, the local interest points is identified where one pixel attains local maxima with respect to cornerness. The local maxima can be identified after non-maximal suppression. Scale-invariance is achieved at each of the local detected point by choosing an appropriate scale, which involves seeking local extrema in scale space. In a scale-adapted second moment matrix equation 2.4, σ I determines the scale of the local region centering at point X. Different values of parameter σ I results in 19 different local maxima as per equation 2.7, however not all maximas are meaning- ful. We choose only those local maximas (local extremas in scale space) which are stable and coincide with local structures well (91). The parameter σ D is generally chosen as a constant ratio (0, 1) of σ I , which simplifies the problem of seeking proper parameters and thus reduces to searching for local extrema in scale space. LoG(X,σ I ) =σ I (L xx (x,σ I ) +L yy (x,σ I )) (2.8) where L gg denotes the second order derivative in direction g. If the region saliency is measure by Laplacian-of-Gaussian function (equation 2.8) in the scale space, then the local extrema can be more precisely defined and this results in the Harris-Laplacian detector. For Harris-Laplacian detector, mea- suring saliency by equation 2.7, and cornerness by equation 2.7 is applied on each pixel. This process is repeated in multiple scales (by increasing σ I constantly and discretely) and multiple octaves. The final key-points are localized in X-Y space where 2.8 attains local extrema and equation 2.7 attains local maxima simultane- ously. 2.2.2 Descriptors ThekeypointslocalizedintheX-Yandscale-spacehelpsinunderstandingthelocal image structures. However, they are not discriminative enough to be readily used in image retrieval applications. Thus, we need to find an efficient way to represent these local structures around the keypoints. The descriptor provides an efficient representation and helps to convert an intensity map surrounding the keypoint into a feature vector. SIFT and SURF are some of the most popular descriptors used in computer vision field. In this section we will briefly describe them. 20 2.2.2.1 Scale-Invariant Feature Transform - SIFT SIFT is a popular image local feature which has been successfully used in many applications like object classification, image matching, image registration etc. The SIFT approach, for image feature generation, takes an image and transforms it into a large collection of local feature vectors (96). Each of these feature vectors is invariant to any scaling, rotation or translation of the image. SIFT features are also very resilient to the effects of noise in the image. SIFT approach shares many features with neuron responses in primate vision. To aid the extraction of these features the SIFT algorithm applies a 4 stage filtering approach: 1. Scale-space extrema detection: The first stage of computation searches over all scales and image locations. It is implemented efficiently by using a difference-of-Gaussian function to identify potential interest points that are invariant to scale and orientation. 2. Keypoint localization: At each candidate location, a detailed model is ïňĄt to determine location and scale. Keypoints are selected based on mea- sures of their stability. 3. Orientation assignment: One or more orientations are assigned to each keypoint location based on local image gradient directions. All future opera- tions are performed on image data that has been transformed relative to the assigned orientation, scale, and location for each feature, thereby providing invariance to these transformations. 4. Keypoint descriptor: The local image gradients are measured at the selected scale in the region around each keypoint. These are transformed into a representation that allows for significant levels of local shape distortion and change in illumination. 21 Keypoint descriptors typically uses a set of 16 histograms, aligned in a 4x4 grid, each with 8 orientation bins, one for each of the main compass directions and one for each of the mid-points of these directions. Thus, this results in a SIFT feature vector containing 128 dimensions. In addition to 128-dimnesions, the location (in image co-ordinate space), scale and orientation of the feature vector is associated with the SIFT feature vector. 2.2.2.2 Speeded Up Robust Features - SURF SURF is a robust local feature detector developed by Herbert Bay et al. (16) in 2006. It was mainly inspired by SIFT feature and was developed to be more faster and robust than SIFT. It is used in several computer vision tasks like object recognition, image registration, image retrieval, 3D reconstruction, etc. SURF is based on sums of 2D Haar wavelet responses and makes an efficient use of integral images. It uses an integer approximation to the determinant of Hessian blob detector, which can be computed extremely quickly with an integral image. For features, it uses the sum of the Haar wavelet response around the point of interest which can again be computed with the aid of the integral image. 2.3 Image/Video Representation Traditionally, depending on the applications, images and videos have been repre- sented using histogram of pixels, edges, textures, interest points, low-level features, or global features etc. One of the most popular image/video representation for image/video-based retrieval applications is called bag-of-words model (120) which has been borrowed from the text retrieval community. Here, we will briefly review the bag-of-words model and the visual codebooks. 22 2.3.1 Bag-of-Visual Words Bag of visual words (BOVW) is currently a popular approach to object and scene recognition in computer vision. Local features are extracted from an image/video, and the image/video is then considered as a bag-of-features, that is, completely ignoring the spatial relationship among features. Probably due to the lack of an efficient and effective mechanism to encode spatial information among features, BOV is widely adopted in vision tasks. A typical BOVW-based method consists of the following stages (141): 1. Extract features. Visual features and their corresponding descriptors are extracted from local image patches. Typical visual descriptors are SIFT (96) and HOG (46). Usually two ways are used to determine where to extract local features. Some methods extract features at certain detected interest points. Other methods densely sample local features in a regular grid of pixel locations, for example, in Lazebnik et al. (88). Visual descriptors extracted from these local patches are considered as feature vectors that describe these local regions. 2. Generate a codebook and map features to visual code words. A visual codebook is a method that divides the space of visual descriptors into several regions. Features in one region correspond to the same visual code word, which is represented by an integer between 1 and the size of the codebook. An image is then encoded as a histogram of visual code words. 3. Learn and test. Various machine learning methods can be applied to the histogram representation of images. For example, SVM is a frequently used learner in BOVW models for object and scene recognition 23 The quality of the visual codebook has a significant impact on the success of BOVW-based methods. Popular and successful methods for object and scene cat- egorization typically employ unsupervised learning methods (for example, k-means clustering or Gaussian Mixture Model) to obtain a visual codebook. When there is a need to compute the dissimilarity of two feature vectors, the Euclidean distance is the most frequently used metric. 2.3.2 Visual Codebooks A visual codebook is a method that divides the space of visual descriptors (such as SIFT) into several regions. Descriptors such as SIFT, extracted from local affine invariant regions are quantized into visual words. Features in one region correspondtothesamevisual code word,whichisrepresentedbyanintegerbetween 1 and the size of the codebook. An image/video can then encoded as a histogram of visual code words. Themostpopularapproachtogeneratevisualwordsisbyk-meansquantization of the descriptor vectors from a number of training frames. The collection of visual words are used in Term Frequency Inverse Document Frequency (TF-IDF) scoring of the relevance of an image to the query. The scoring is accomplished using inverted files. 2.3.2.1 Vocabulary Tree The vocabulary tree defines a hierarchical quantization that is built by hierarchical k-means clustering (105). A large set of representative descriptor vectors are used in the unsupervised training of the tree. Instead of k defining the final number of clusters or quantization cells, k defines the branch factor (number of children of each node) of the tree. First, an initial k-means process is run on the training 24 data, definingk clustercenters. Thetrainingdataisthenpartitionedinto k groups, where each group consists of the descriptor vectors closest to a particular cluster center each group of descriptor vectors, recursively defining quantization cells by splitting each quantization cell into k new parts. The tree is determined level by level, up to some maximum number of levels L, and each division into k parts is only defined by the distribution of the descriptor vectors that belong to the parent quantization cell. The process is illustrated in figure 2.2: Figure 2.2: An illustration of the process of construction of the vocabulary tree (105). The hierarchical quantization is defined at each level by k centers (in this case k = 3) and their Voronoi regions 2.4 Longest Common Subsequence Problem Longest Common Subsequence (LCS) problem is a classical problem in computer sciencewherethegoalistofindthelongestsubsequencecommontoallsequencesin a set of sequences. Traditionally it has been used in file comparison programs and 25 heavily used in finding DNA sequence alignments in bio-informatics field. The LCS problem is NP-hard (98) for arbitrary number of input sequences. The program is solvable in polynomial time if the number of input sequences in constant and gen- erally dynamic programming is used to find the longest common subsequence. The dynamic programming approach for finding LCS among two sequences is described below. Let two sequences be defined as follows: A = (a 1 ,a 2 ...,a m ) and B = (b 1 ,b 2 ...,b n ). Let the prefixes of A be A 1 ,A 2 ,...,A m and the prefixes of B be B 1 ,B 2 ,...,B n . Let LCS(A i ,B j ) represent the set of longest common subsequence of prefixes A i and B j . This set of sequences is given by the following dynamic programming approach: LCS(A i ,B j ) = ∅ if i = 0 or j = 0 LCS(A i−1 ,B j−1 ) + 1 if a i =b j longest(LCS(A i−1 ,B j ),LCS(A i ,B j−1 )) if a i 6=b j (2.9) The dynamic programming approach was first proposed by Wagner-Fischer in early1970s(134). Dynamicprogrammingapproachhasaquadratictimeandspace complexity and is thus inefficient for long sequences. Several improvements have been published since their work. A good review of many new algorithms for LCS problem can be found in this thesis (13). 2.5 Collaborative Filtering Collaborative filtering (CF) is a technique used by recommender systems to pro- duce user specific recommendations of items based on patterns of ratings or usage 26 (e.g., purchases) without need for additional information about either items or users. The Netflix Prize competition which ran from October 2006 to September 2009 has fueled much recent progress in the field of collaborative filtering (80). To establish recommendations, CF systems need to relate two fundamentally different entities: items and users. There are two primary approaches to facil- itate such a comparison, which constitute the two main techniques of CF: the neighborhood approach and latent factor models. Neighborhood methods focus on relationships between items and/or between users. An item-item approach models the preference of a user to an item based on ratings of similar items by the same user. Latent factor models, such as matrix factorization, comprise an alternative approach by transforming both items and users to the same latent factor space. The latent space tries to explain ratings by characterizing both products and users on factors automatically inferred from user feedback. 2.6 Matrix Factorization Matrix factorization models (81) map both users and items to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space. Accordingly, each itemi is associated with a vectorq i ∈R f andeachuseruisassociatedwithavectorp u ∈R f . Foragivenitemi,theelements of q i indicate the extent to which the item possesses those factors, either positive or negative. Similarly, for useru,p u measures the extent of interest the user has in items that has high corresponding factors (positive or negative). The dot product q T i p u captures the interaction between useru and itemi - the user’s overall interest 27 in the item’s characteristics. This leads to an approximation of the u user’s rating of item i denoted by r ui , leading to the estimate ˆ r ui =q T i p u (2.10) The major challenge is computing the mapping of each item and user to factor vectorsq i ,p u ∈R f . After the recommender system completes this mapping, it can easily estimate the rating a user will give to any item by using Equation 2.10. The above model is closely related to the singular value decomposition (SVD) used in information retrieval. However, SVD cannot be applied in collaborative filtering settings since SVD involves factorization (decomposition) of the user-item matrix, which is difficult since user-item matrix is highly spare/incomplete (i.e. it has lot of zero/incomplete entries). Earlier systems relied on imputation to fill in missing ratings (entries of user-item matrix) and make the rating matrix dense, however, imputation is expensive as it increases amount of data and moreover, inaccurate imputation can distort the user-item matrix. Recent works (79), (107) have suggested to directly model the observed ratings, while avoiding overfitting through regularization. To learn the latent factor vectors (p u and q i ), the model minimizes the regularized squared error on the set of known ratings as: min p ∗ ,q ∗ X (u,i)∈K (r ui −q T i p u ) 2 +λ(||q i || 2 +||p u || 2 ) (2.11) whereK is the known (u,i) pairs for which r ui is known in the training set. The above system learns the model by fitting the previously observed ratings. However, our goal is to generalize those previous ratings so that we can predict thefutureunknownratings. Thus, thesystemshouldavoidoverfittingtheobserved data by regularizing the learned parameters, whose magnitudes are penalized. The 28 constantλ controls the extent of regularization and is usually determined by cross- validation. Ruslan Salakhutdinov et. al (118) offers a probabilistic version for the regularization model. The equation 2.11 can be minimized using two popular approaches : stochastic gradient descent (26) and alternating least squares (162). 2.7 Latent Dirichlet Allocation Latent Dirichlet allocation (LDA) (24) is a generative probabilistic model for col- lections of discrete data such as text corpora. The basic idea is that documents (items) are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. More formally, LDA is a three-level hierarchical Bayesian model (see figure 2.3), in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. LDA assumes the following generative process for each document w in a corpus D: 1. Choose N∼ Poisson(ξ) 2. Choose θ∼ Dir(α) 3. For each of the N words w n : (a) Choose a topic assignment z n ∼ Multinomial (θ) (b) Choose a word w n from p(w n |z n ,β), a multinomial probability condi- tioned on the topic z n Given a corpora, the posterior distribution (or maximum likelihood estimate) of the topics reveals the K topics that likely generated its documents. Unlike a clustering model, where each document is generally assigned to one cluster, LDA 29 Figure 2.3: Graphical model (Plate notation) representation of LDA. The boxes are âĂIJplatesâĂİ representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document allows documents to exhibit multiple topics (i.e. documents assigned to multiple clusters). For example, LDA can capture that one article might be about climate and statistics, while another might be about climate and politics. Since LDA is unsupervised, the themes of climateâĂİ, âĂIJstatisticsâĂİ and âĂIJpoliticsâĂİ can be discovered from the corpora; the mixed-membership assumptions lead to sharperestimatesofwordco-occurrencepatterns. Givenacorpusofdocuments, we can usevariational EMto learn thetopics and decompose the documents according to them (24). Further, given a new document, we can use variational inference to situate its content in terms of the topics. In our work, we use LDA to give a content-based representation of items for the recommender system. 30 2.8 Variational Bayesian Methods Variational methods are a powerful tool from statistical physics that can be used to approximate Bayesian learning. Bayesian learning mainly relies on the marginal likelihood, whichresultsfromaveragingovertheparametersoftheBayesianmodel. In almost every scenario, these averages are analytically intractable, and thus we are forced to work with approximations. Variational Bayesian methods (17) offer a fast and efficient alternative to sampling techniques, and moreover they offer an approximation in the form of a bound on the marginal likelihood. 2.8.1 Variational Bayesian EM Let m be a model with parameters θ giving rise to an i.i.d. data set y =y 1 ,...,y n with corresponding hidden variables x = x 1 ,...,x n . A lower bound on the model log marginal likelihood is F m (q x (x),q θ (θ)) = Z dθ dx q x (x)q θ (θ) log p(x,y,θ|m) q x (x)q θ (θ) (2.12) and this can be iteratively optimized by performing the following updates, using superscript (t) to denote the iteration number: VBE Step : q (t+1) x i (x i ) = 1 Z x i exp[ Z dθ q (t) θ (θ) logp(x i ,y i |θ,m)]∀i (2.13) where q (t+1) x (x) = n Y i=1 q (t+1) x i (x i ) (2.14) and 31 VBM Step : q (t+1) θ (θ) = 1 Z θ p(θ|m) exp[ Z dθ q (t+1) x (x) logp(x,y|θ,m)] (2.15) Moreover, the update rules converge to a local maximum ofF m (q x (x),q θ (θ)) 2.9 Regression and Binary Classification In regression, we are given a set of n real valued target values y i for i = 1, 2...,n, and a corresponding set of n real valued p vectors X i for i = 1, 2,...,n. The goal is to build a model which accurately predicts y i from the corresponding p-vector X i . Binary Classification is similar to regression expect that y i take binary values {−1, 1}. Linear least squares model is the most popular regression method, where y i is a linear function of X i ( with a bias term b), and we find the parameters of the model w,b by minimizing the least-squares objective min w,b n X i=1 1 2 (y i −w T X i −b) 2 The least-squares estimator can be viewed as a maximum likelihood estimator, under the assumption that each y i follows a Gaussian distribution with mean w T X i +b and a variance σ. Mathematically, p(y i |X i ,w,b) = 1 σ √ 2π exp(− (y i −w T X i −b) 2 2σ 2 ) Minimizingthenegativelogarithmofthelikelihood,− n P i=1 logp(y i |X i ,w,b)with σ = 1 (and ignoring other constant terms), we arrive at the least-squares objective. 32 Logistic Regression is the most common binary classification approach, where the assumption is that the logarithm of the odds of y i taking value +1 is a linear function of w T X i +b. This implies that we assume y i follows logistic distribution with location w T X i +b and scale 1: p(y i |X i ,w,b) = 1 1 +exp(−y i (w T X i +b))) Maximum likelihood estimation of the logistic regression model is usually car- ried out by minimizing the negative log-likelihood, i.e. min w,b n X i=1 log(1 +exp(−y i (w T X i +b))) In general, unlike the least-squares objective, there will not be a closed-form solution for the parameters in the logistic regression model. However, we can obtain accurate numerical maximum likelihood estimates by minimizing the dif- ferentiable, unconstrained, and convex negative loglikelihood. Nevertheless, for both of these models there are several reasons why we might not want to use a maximum likelihood estimate of the parameters: • Themaximumlikelihoodestimateresultsinallnon-zeroco-efficientsw i , even though some variables might not be relevant for prediction. • There is a high chance for the maximum likelihood estimator to overfit. This is especially true if the sample size n is relatively smaller than the number of features p. Subsetselectionandregularizationmethodsrespectivelyarecommontoaddress the above two issues. A good review of regularization techniques is presented in the thesis (119). 33 Chapter 3 Social Multimedia Copy Detection and Retrieval 3.1 Introduction The popularity of online social networks has resulted in exponential growth of professional and user-generated multimedia such as videos, photos, etc. on the web. One of the main reasons for this enormous growth is that the role of users has changed from being ’consumers’ to becoming ’creators’. As a result, a significant part of the user-generated multimedia (videos in particular) belongs to ’copies’ or ’near-duplicates’, that is, the users create their videos by copying (illegally or legally) and/or modifying videos from other sources (friends, online websites, TV streams etc.). Thus, video copy detection (visual redundancy analysis) (86),(157), has attracted lots of attention recently, with various practical applications such as copyright enforcement (78), video threading (153), video search and retrieval (120) etc. The redundant contents (copies) can be considered as both useful and negligible depending on the application. For example, detecting copies can help in copyright enforcement, finding similar content consumption by friends, etc., while the near-duplicate copies might be less useful for video ranking. Copied multimedia (videos) can be broadly categorized into three types: exact (full) duplicates, near-duplicates and partial near-duplicates. As the name indi- cates, the degree of overlap between the videos is used for this categorization and 34 provides a complete picture of their relationships by outlining dependency and hierarchy among them (124). In general, fully near-duplicate videos can share the same plot but have different contextual (additional) information; while partial near-duplicates videos (refer Figure 3.12) share only certain segments of the videos with one-another. Moreover, the near-duplicate and partial near-duplicate videos can be of large variety due to simple formatting to complex editing transforms such as alpha gain, blurring, additional of noise, logo insertion, picture-in-picture (PiP) transform, camcording, flipping and rotation, etc. Mostexistingcopydetectiontechniquesaredesignedtodiscoverthefullydupli- cate or near-duplicates from a video collection, where videos are regarded as identi- cal if sufficient amount of video segments/features match. While these approaches can handle fully duplicate videos and many near-duplicate videos, the alignment (localization) of the segments are made using adhoc or heuristic approaches. (49) uses a voting scheme to counts the number of duplicates within different time stamps of a video and locates duplicate segments. Such heuristics definitely will not work for partial near-duplicate videos due to it having various difficult video transforms such as PiP, logo insertion, or combination of them along with noise addition or blurring etc. Picture-in-Picture (PiP) is a special class of partial-near duplicate video trans- form in which one or more videos (say inner video(s)) appears inside another video (say host video). Fig. 3.1 shows an example of PiP video key-frame. The inner video(s) is generally different from the host video in terms of visual content. PiP is quite popular in today’s TV and it regularly appears in news-broadcast videos, commercialadvertisements, talkshows, etc. PiPserviceinTVsprovidetheviewers with the flexibility to watch two or more of their favourite programs simultane- ously on the same screen. Even though PiP is a very useful transform, people 35 Figure 3.1: Example of Picture-in-Picture video keyframe can use it for copyright infringement by hiding one video inside another video or overlaying one video over another video. Traditional video copy detections tech- niques are known to perform poorly in the PiP settings (TRECVID competitions, 2009-2011) (4). This motivates us to develop new effective approaches to handle copy detection in partial-near duplicate videos such as PiP (114). In this chapter, we propose novel spatio-temporal verification and alignment techniques to solve the partial near-duplicate video copy detection and alignment problem. First, we present the efficient spatial verification techniques to accu- rately retrieve (detect) video copies from the database. Then, we formulate video alignment as a temporal subsequence matching & alignment problem and provide efficient algorithms tailored to accurately localize all the partial-near duplicate video segments present in the query video. Spatial verification techniques are explained in the challenging PiP video copy detection settings - where we show that our generalized spatial verification tech- niques can efficiently identify video copies present in PiP. We will provide approx- imate (polynomial-time) and accurate (combinatorial-time) algorithms to address the spatial verification problem. Our main contributions for spatial verification technique includes proposing a novel framework for spatial code representation and formulation of spatial verification problem as finding a maximum clique in an undirected graph. Our spatial verification approach provides lower computation 36 and storage costs compared to the state-of-the-art approach(161). For temporal verification(videoalignment)techniques, weproposetwoalgorithmsbasedonfind- ing the common subsequences between the video sequences. Our main contribu- tions for temporal verification includes proposing an efficient DAG-LCS algorithm which achieves better computational complexity compared to the state-of-the-art approaches. The outline of this chapter is as follows. Related work is discussed in section 3.2. In section 3.3, we discuss our video copy detection system. In section 3.5, we will introduce our novel log-polar spatial code representation. In section 3.6, we discuss our approaches to solve the spatial verification problem. Video indexing and retrieval is discussed in section 3.7. Experimental results and complexity anal- ysis for PiP copy detection is reported in section 3.8. In section 3.9 we introduce the partial-near duplicate video alignment problem and discuss some related work. Then, in section 3.10 we present our subsequence alignment algorithms and show the experimental results. Future work and conclusions are discussed in sections 3.11 and 3.12 respectively. 3.2 Related Work Near-duplicate video detection and retrieval has been intensively studied recently because of the growing amount of multimedia shared on the content-sharing sites and social networks. Most of the previous research has been geared towards solving the fully-duplicate or near-duplicate video copy detection problem (59), (86). A few recent works such as (124) and (125) have focused on addressing the partial-near duplicate video copy detection problem. In this section, we will briefly 37 review the recent works on Geometric verification approaches (used in partial near- duplicate image search problem) and partial near-duplicate detection & alignment approaches. Geometric verification has been a popular approach used in partial near- duplicate image retrieval systems (73), (72), (146). However, due to the expensive computational cost of full geometric verification, it is usually only applied to some top-ranked candidate images. Thus, it becomes impractical to use it for large web-based image retrieval system. To overcome the limitation of full geomet- ric verification, a novel spatial coding technique was first proposed by Zhou et. al (161). Their paper addresses partial-duplicate image search by using more efficient feature vector quantization and spatial coding strategies. Their approach is based on the Bag-of-Visual-Words model (section 2.3.1). To improve the discriminative power of visual words, they quantize local features such as SIFT (section 2.2.2.1), in both descriptor space and orientation space. Then they propose a spatial cod- ing scheme to encode the relative spatial positions of local features in images (thus capturing the spatial layout of the visual words) and then use a spatial verification algorithm to remove the false matches of local features effectively. In their paper (124), Tan et. al. address the problem of partial near-duplicate detectionandlocalization. GiventwovideoscomprisingofM andN framesrespec- tively, partial near-duplicate is defined as a temporally contiguous set of K near- duplicate frame-pairs across the two videos where K is considerably smaller than either M or N, or both. The partial-duplicate links among videos are established through partially aligning video content. The partial alignment is modeled as a network flow problem which essentially takes into account the joint visual-temporal consistency among videos. Given two videos, a temporal network is constructed to 38 model the visual-temporal consistency among frames. For multiple partial align- ment, the network could be further split into multiple partitions. Frames in each partition are separately aligned by an efficient network flow algorithm. Temporal Verification is imposed to derive the sets of must-align and cannot-align frames as constraints, which formulate the alignment problem in an iterative manner. Repeating the procedure over all video pairs, groups of partial alignments are mined from a large video collection. Even though the approaches is able to recover partial near-duplicates, the flow optimization 3.3 Picture-in-Picture (PiP) Video Copy Detec- tion Picture-in-Picture can be used to visually hide one video in another video. Find- ing the copy of a video content in a PiP video corresponds to PiP copy detec- tion. Current researchers rely on the same copy detection techniques used for non-PiP videos to solve the PiP copy detection problem. However, the results from TRECVID competitions (5) indicate that the PiP copy detection is harder to solve and the results are quite unsatisfactory. Thus, we propose new spatial verification approaches to address the PiP copy detection by borrowing techniques from partial image matching problem (161). 3.4 Our Video Copy Detection System Our video copy detection system is shown in figures (3.2) and (3.3). Our system is based on the popular Bag-of-words approach (120) and includes the feature extrac- tion, codebook construction, indexing and post-processing (re-ranking) stages. We 39 Figure 3.2: Block diagram of our video copy detection system Figure 3.3: Block diagram of retrieval system adopt visual words obtained from SIFT features (96) for video key-frame represen- tation. Generally, the SIFT feature possesses several property values such as - a 128-Ddescriptor, a1-Dorientationvalue(rangingfrom−π toπ ), a1-Dscalevalue and the (x, y) coordinates of the key point. We apply the SIFT descriptor quan- tization and use hierarchical k-means to find the visual words and construct the visual codebook (105). Indexing is done by using inverted files. Spatial verification techniques (discussed later in section 3.6) is used as post-processing. The locations of SIFT key points will be exploited for spatial verification. In our retrieval system (Figure (3.3)), the feature descriptors are extracted from the query key-frame and matched with the visual words using inverted files and the retrieved video list is re-ranked using spatial verification algorithms. 40 3.5 Spatial Coding Spatialrelationshipsamongvisualwordsareimportantinidentifyingsimilarimage patches between two video key-frames. Local feature matching (SIFT matching) generally results in both true and false matches. Traditionally, researchers have adopted geometric verification techniques like RANSAC (54) to remove/reduce the false matches obtained by local feature matching. However, these geometric ver- ification techniques are known be to computationally expensive and to overcome this drawback, (161) proposed an efficient spatial coding scheme to encode the relative positions between each pair of local features in an image. They proposed two spatial maps - namely X-map and Y-map to describe the relative spatial posi- tions between each feature pair along the horizontal and vertical directions. Zhou et al., (161) showed that their scheme is quite robust and accurate in identifying partial duplicate images in a large image dataset. Inspired by their work on partial near-duplicate image search, we propose a generalized spatial coding scheme which is a more efficient representation since we use only one spatial map to represent our spatial codes. Our spatial coding scheme is more efficient since it will reduce both the computation requirements and storage space of spatial maps by almost half when compared with Zhou’s (161) approach. Our generalized spatial coding scheme is described in the following subsection. 3.5.1 Log-Polar Spatial Code Representation Our generalized spatial coding scheme encodes both the relative position and rel- ative orientation between each pair of matched features (matched visual words) in a video key-frame. One spatial map called log-polar spatial map is generated in our scheme. We use log-polar plot representation since (21) showed that this 41 representation is highly efficient in capturing shape patterns. Figure (3.4) shows an example of the 5-bit (32 regions) log-polar representation. In our spatial coding scheme, we use an 8-bit spatial code to represent both the relative orientation and the relative distance of visual features w.r.t each other. The first 3 bits are used to represent the relative orientation, (2 3 = 8 bins) and the remaining 5 bits (2 5 =32 regions) are used to represent the relative distances in log-polar plot. The first radius of the log-polar region is defined by the scale of the visual feature (SIFT feature). The second and third radii correspond to two and four times the scale of the visual feature and so on. An example of log-polar representation is shown in figure (3.4). The relative orientation is given as the difference between the orientations of the visual feature pairs. 3.5.1.1 Spatial Maps Spatial Map captures the spatial layout of the visual features present in a video key-frame. Table 1 shows an example of the spatial map for visual features present in figure (3.4). Every visual feature has a spatial code w.r.t another visual feature and this corresponds to an entry in the spatial map. For example, 1st row of Table 1 shows the orientation and positions of visual features (V 2 ,V 3 ) of fig. (3.4) with respect to log-polar plot centred at visual feature 1 (V 1 ). The spatial code ofV 1 V 2 is given by the the relative orientation betweenV 1 andV 2 i.e. 5π/6−π/12 = 9π/12 and relative position is given by region 26. Thus, in the spatial map shown in Table 2, this spatial code corresponds to the entry for V 1 V 2 and is represented as (4,26); where 4 is the decimal of 3-bit relative orientation and 26 is the decimal of 5-bit relativeposition. Thus,the8-bitspatialcodeofV 1 V 2 isgivenby10011010. Thelog- polar plot is centered at each visual feature location and then the relative position 42 and relative orientation of other features w.r.t this visual feature is calculated and represented as a spatial code in the spatial map. Figure 3.4: Example for log-polar spatial codes Table 3.1: Position and Orientation of Visual features w.r.t log-polar plot cen- tered at V 1 Visual Feature Position Orientation V 1 0 π/12 V 2 26 5π/6 V 3 23 π/12 Table 3.2: Spatial Map for figure 3.4 V 1 V 2 V 3 V 1 0 4,26 0,23 V 2 6,30 0 6,30 V 3 0,19 4,26 0 43 3.6 Spatial Verification Approaches Spatial verification is the key step in comparing the spatial layouts between two video key-frames. Here we focus on copy detection in PiP videos, where one of the videos in PiP query shares similar keyframes with some video indexed in the large video database. Due to the PiP characteristics and feature quantization error, many false feature matches appear during comparison of the spatial layouts. To have a high confidence in image patch detection, we need to remove the false matches and maintain only the true feature matches. Several spatial verification algorithms are proposed in this section to find the correct duplicate or similar image patches between the PiP query and target video key-frames. 3.6.1 Inconsistency Sum Method (ISM) Inconsistency Sum Method was proposed in Zhou’s paper (161) for comparing X and Y maps between query and target images. In this section, we extend their approach with a few modifications. Let K q denote the query key-frame and K t denote the matched target key-frame. Let them share N pairs of matched features through SIFT quantization. We obtain S q and S t spatial maps by spatial coding. For efficient comparison of the spatial maps i.e. comparison of the spatial codes of the matched features, we use logical XOR operation as follows: V(i,j) =S q (i,j) O S t (i,j) (3.1) If all the N matched pairs are true matches then all the entries of V are zero. If there are false matches, then the spatial codes of query K q and targetK t video key-frames will be different and hence the XOR operation results in a non-zero value (i.e. 1) for V(i,j). These false matches represent the inconsistency in the 44 spatial codes of the matched features. To remove these false matches, we calculate inconsistency sum for each matched feature. LetS i denote the inconsistency sum value for i th matched feature. ThenS i is given by equation (3.2). S i = X j V(i,j) (3.2) The value ofS i is checked to see if it is non-zero and the false matches are removed recursively. The recursion stops whenS i is zero for all remaining i matched fea- tures. i ∗ = arg max i S i (3.3) The above equation (3.3) shows how the mismatched feature i ∗ is removed when- everS i ∗ is non-zero. After i ∗ is removed; V and inconsistency sumS i are recalcu- latedfortheremainingmatchedfeaturesandtheotherfalsematchesarerecursively removed. Thus, this spatial verification algorithm efficiently compares the spatial layouts of the query and target key-frames. However, this approach is sub-optimal since we could end up keeping some false matches in some recursion. If this hap- pens frequently, then our geometric verification suffers. Thus, to overcome this limitation of ISM, in the next section we propose an exact algorithm (called MC method) which retains all the true matches while removing all the false matches between the matched video keyframes. 3.6.2 Maximum Clique Problem (MCP) Here, we propose a novel formulation to transform the spatial verification problem as finding a maximum clique in an undirected graph. A clique in an undirected graphisasubsetofitsverticessuchthatanytwoverticesareconnectedbyanedge. 45 In other words, clique contains a subset of vertices which form a fully connected sub-graph. The clique with maximum clique size is known as maximum clique. 3.6.2.1 Formulation LetK q denote the query key-frame andK t denoted the matched target key-frame. Let them share N pairs of matched features through SIFT quantization. We obtain S q and S t spatial maps by spatial coding. For efficient comparison of the spatial maps we can use logical XOR operation as mentioned in equation (3.1) or we can just compute the difference between the spatial maps as shown in equation (3.4). S d =S q −S t (3.4) The elements ofS d are either zero or non-zero. To formulate it as a graph, convert all the zeros and non-zeros in S d to ones and zeros respectively. Denote the new matrix as S diff . S diff is a symmetric matrix; if not, convert it into one. S diff rep- resents the adjacency matrix of an undirected graph and thus it can be visualized as an undirected graph with the visual features represented as vertices and the true spatial code matches between features represented as edges, i.e. the entries denoted by ’1’ in S diff are the edges between the vertices. Figure (3.5) shows an example of graph visualization of S diff matrix. The cliques of this undirected graph represents the true matches between the video keyframes and the maximum clique represents all the true matches present between the query and target video keyframe. 3.6.2.2 Finding Maximum Clique Maximum Clique problem (MCP) is a classical problem in combinatorial opti- mization. It has been shown that MC problem is NP-Complete (76) . There have 46 Figure 3.5: Visualization of S diff as an undirected graph been several algorithms (126), (106) proposed to efficiently handle the MC prob- lem. Bron-Kerbosch algorithm (29) proposed in 1973 lists all the maximal cliques present in a graph. (Note: largest of all the maximal cliques gives the maximum clique). Bron-Kerbosch algorithm is one of the most efficient algorithms to find the maximal cliques present in an undirected graph. Recent research on directly finding maximum clique in a graph has resulted in new efficient algorithms. The fastest algorithm known today is from Robson (2001) (117). 3.6.3 Iterated Zero Row Sum Method (IZRS) In this section we present another novel spatial verification algorithm. This algo- rithmisanapproximateheuristicpolynomialalgorithmtofindthemaximumclique in the graph. The algorithm is as follows: 47 Algorithm 1 Iterated Zero Row Sum Method (IZRS) 1: Find S diff from the spatial maps of the target and query video key-frames (by XOR or difference operations). Initialize the maximum clique number to 0 and vertex set of maximum clique to null. 2: Consider the lower (or upper) triangular matrix of complement of S diff matrix. Denote this lower triangular matrix as S L . 3: Find the row sum of the S L matrix. Count the number of zero-sum rows (this count represents the size of one of the cliques). If the count is greater than 1 and count is greater than the maximum clique number then update the maximum clique number to count and store the clique vertices in the vertex set of maximum clique. 4: Check if row sum equals to column sum for the 1st row/column of S L . If it is so, store this row number (vertex) in a temporary dynamic array (TDA), else store it in elimination dynamic array (EDA). 5: Remove first row and first column from S L matrix and obtain the new sub- matrix S L1 . 6: Iteratively repeat steps 3-4-5 on S L1 for the remaining number of rows in the matrix S L1 . 7: Update the vertex set of the maximum clique by adding the vertices stored in TDA. 8: Now take an element from EDA and check if it is connected to all the max- imum clique vertices. If it is connected, then add it to the vertex set of the maximum clique. 9: Repeat step 8 till EDA array becomes null. 3.7 Video Indexing and Retrieval An inverted-file index structure is used for video indexing and retrieval (161). Each visual word has an entry in the index that contains the list of videos in which the visual word appears. For each indexed feature, we store its video ID. Weformulatethevideoretrievalproblemasavotingproblem. Eachvisualword in the query video key-frame votes on its matched videos. The tf-idf weighting is used to distinguish different matched features. Similarity is defined in terms of the tf-idf weighting of visual words between the query and the indexed videos. The retrieval system returns all the matched videos in the descending order of 48 similarity. Spatial verification is applied as a post processing technique to re-rank the list of retrieval videos. If a copy is found it is returned in the top-N re-ranked results. (Refer figure 3.3). 3.8 Spatial Verification Experimental Results In this section, we present our experimental settings and discuss our results for picture-in-picture video copies (partial near-duplicates). We conducted experi- ments on both small and large datasets. 3.8.1 Datasets 3.8.1.1 Small dataset Our small training/indexing dataset consists of 120 videos which are chosen from TRECVID 2010 benchmarking dataset (5). For each video, shot detection is per- formed and key-frames are extracted from each shot. In total there are more than 3,600 video key-frames in our training/indexed dataset. Nearly 200,00 SIFT fea- ture descriptors are extracted from these video key-frames. Training dataset videos are indexed in the inverted files. 3.8.1.2 Large dataset Our large training (indexing) dataset consists of around 12,620 videos, out of which 12,500 videos are chosen from MSRA MM v.2.0 dataset (3) and 120 videos are chosen from TRECVID 2010 benchmarking dataset (5). For each video, shot detection is performed and key-frames are extracted from each shot. In total there are more than 600,000 video key-frames in our large training dataset. Nearly 40 million SIFT feature descriptors are extracted from these video key-frames. 49 3.8.1.3 Query dataset The testing dataset (query dataset) are the PiP videos where the inner or host video is copy of one of the videos of the small datasets. We consider 3 PiP testing cases (on similar lines of CCD task in TRECVID 2010 competition (5)). • PiP Type 1: T2 query videos - Inner video of PiP query is the video indexed in the inverted file. 120 videos of this type are used for testing. • PiP Type 2: T9 query videos - Host video of PiP query is the video indexed in the inverted file: 120 videos of this type are used for testing. • PiP Type 3: T10 query videos - T2 query or T9 query with additional video transformations such as blurring, addition of noise, camcording, etc. 120 videos of this type are used for testing. Figure (3.6) shows the examples for T2, T9 and T10 query videos. 3.8.1.4 Evaluation criterion We use average retrieval accuracy metrics to evaluate the copy-detection results on the different PiP query types. Average retrieval accuracy is defined as: Average retrieval accuracy (for Top-N rank) = Correct retrieved results (in Top-N results) Total retrieved results (N) 50 Figure 3.6: PiP Query Types 3.8.2 Discussion We compare our results with the popular and state-of-the-approaches such as hier- archical bag-of-words approach (105), RANSAC (54) and spatial verification algo- rithm (161). Figures (3.7), (3.8) and (3.9) respectively show the average retrieval accuracies for T2, T9 and T10 PiP queries when small dataset is indexed. These plots clearly show that both the spatial verification approaches i.e. Log polar spatial coding with Inconsistency Sum method (LP-ISM) and Log-polar spatial coding with Maximum clique problem (Bron-Kerbosch algorithm) (LP-MCP) out- performs the bag-of-words model. Moreover, Figure (3.7) shows that for T2 queries weget20%and9%improvementsoverthebaselinehierarchicalbag-of-visualwords approach for the top-10 and top-100 results respectively. Similarly, Figure (3.8) andFigure(3.9)showthatweobtain10%and18%improvementsfortop-10results over the baseline hierarchical bag-of-visual words approach for T9 and T10 queries respectively. Figure (3.10) shows that when the large dataset is indexed and T10 video is queried we obtain 30%, 18% and 4% improvements over the hierarchical bag-of-visual words (BoVW), RANSAC and Zhou’s approaches (161) respectively for the top-10 retrieval results. These experiments results show that our proposed spatial coding scheme with spatial verification algorithms outperform the popular and state-of-art-techniques. (Note: LP-IZRS gives the same results as LP-MCP 51 approach. To avoid cluttering of data in the plots (figures 3.7,3.8,3.9) we have not show the plots for LP-IZRS.) Figure 3.7: Average Retrieval accuracy for T2 query when small dataset is indexed 3.8.2.1 Complexity analysis Table 3.3: Computation Time Spatial Verifica- tion Approaches Avg. Computa- tion time/query Inconsistency Sum Method 2 seconds Maximum Clique - Bron-Kerbosch algorithm 0.5 seconds Now we compare the computational complexities of different spatial verification techniques. Table (3.3) shows that the average computation time for maximum 52 Figure 3.8: Average Retrieval accuracy for T9 query when small dataset is indexed Figure 3.9: Average Retrieval accuracy for T10 query when small dataset is indexed 53 Figure 3.10: Comparison of various geometric verification approaches. T10 query is used and large dataset is indexed Table 3.4: Complexity Analysis Spatial Verification Approaches Complexity Inconsistency Sum Method O(n 3 ) Maximum Clique Approach - Bron-Kerbosch algorithm (29) O(3 n/3 ) Maximum Clique Approach - Robson algorithm (117) O(1.1888 n ) Iterated Zero Row Sum Method O(n 3 ) clique approach (LP-MCP) is around four times less than LP-ISM approach. How- ever, Table (3.4) clearly shows that LP-MCP algorithms have combinatorial-time 54 worst-case complexities while LP-ISM and LP-IZRS approaches have polynomial- timeworst-casecomplexity. Fordetailedunderstandingofcomparisonoftheworst- case complexities of LP-MCP and LP-ISM approaches please refer figure (3.11) (Note: Certain constants are ignored for plotting these complexity curves). From figure (3.11) we see that if the number of vertices is less than 27 then LP-MCP (Bron-Kerbosch) algorithm is definitely faster than LP-ISM approach, and if the matched features are less than 74 then LP-MCP (Robson) algorithm is much faster than LP-ISM approach. Moreover, we found that if there are a lot of false matched pairs then the LP-MCP algorithms are much faster than LP-ISM algorithm. This can be intuitively explained as follows: If there are lot of false matched pairs then it corresponds to a sparse graph (i.e. sparse S diff matrix) with less no. of edges, which implies that we have to find the maximum clique among relatively smaller cliques of the sparse graph. While in LP-ISM we have to remove all the false matches recursively through inconsistency sum which takes more computational time. On the contrary, if there are lot of true matched pairs (n> 74) then LP-ISM is much faster than LP-MCP algorithms. 3.8.2.2 Discussion on improving complexity of LP-MCP techniques To handle large S diff matrices (large graphs) we can devise intelligent pre- processing steps to find the vertices which can be part of maximum clique in a graph, i.e. if we can find the bounds on the maximum clique size, then it would greatly reduce the computation cost. There are two approaches which can improve the computation complexity of LP-MCP for large graphs. 1. Partition the original matrix S diff (original graph) into sub-matrices (sub- graphs) such that we can apply spatial verification on smaller matrices 55 0 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 5 No. of Vertices --> Computations --> Complexity Curves O(n 3 ) O(3 n/3 ) O(1.1888 n ) 74 27 Figure 3.11: Complexity curves for LP-MCP (Robson), LP-MCP (Bron- Kerbosch) and LP-ISM (Note: Certain constants are ignored for plotting these complexity curves) 2. Find the upper and lower bounds on the maximum clique size for an undirected graph Approach 1 can be addressed using graph decomposition algorithms. One inter- esting graph decomposition technique is by using clique separators which runs in polynomial time. Moreover, approach 1 can be approximately addressed through solving approach 2 - that is, if we find the tighter bounds for the maximum clique size then the original large graph can be reduced to smaller sub-graph by remov- ing vertices with smaller degrees. Many works have been published over the past several decades to find the bounds for the maximum clique size. Wilf (1967) (139) provided an upper bound asX≤ 1 +λ max (A) , whereλ max is the maximum eigen value of the adjacency matrix (A =S diff ). Hoffman (1970) (64) showed that the lower bound of the maximum clique size is given byX≥ 1 + λmax(A) |λ min (A)| ; whereλ min is the minimum eigen value of A. Lovasz (1979) (95) generalized the lower and 56 upper bounds and proved a sandwich theorem. Lovasz introduced a number θ, which satisfies ω(G)≤ θ≤X (G) , where ω(G) is the clique number andX (G) is the chromatic number of graph G. Thus, we could reduce the computational complexities of LP-MCP techniques by operating on smaller and sparser graphs. In the previous sections (3.4-3.8.2.2), we have discussed the spatial verification techniques to find partial near-duplicate copies in a large database. However, in some applications, the partial near-duplicate copies can appear as video segments randomly arranged in the target video in the database. For example, in advertise- ment tracking application, an ad agency will be interested in finding all the time slots in the video where his competitors displayed their ads. So, for this applica- tion, competitor’s ad can be found using spatial-verification approaches, however the time slots need be found using some temporal alignment (localization) algo- rithms. In the next section, we discuss our approaches to efficiently find video alignments between two video sequences. 3.9 Partial-Near Duplicate Video Alignment Many video copy detection systems perform video alignment in a naive or heuristic fashion. (49) uses a voting scheme to counts the number of duplicates within different time stamps of a video and locates duplicate segments. Such heuristics definitely will not work for partial near-duplicate videos due to it having various difficult video transforms such as PiP, logo insertion, or combination of them along with noise addition or blurring etc. (50) uses a simple time shift approach with temporal grouping algorithm to align video segments between two video sequences. This approach is not robust enough and does not work for slow-motion or fast- forwarded video transforms and video crossover configurations. (124) constructs 57 an temporal flow network and performs alignment by solving a flow optimization problem. Even though this approach seems robust to many video transforms, it has a huge computational cost due to preprocessing overhead and when the temporal network is large. Motivated by the limitations of the existing partial near-duplicate video alignment techniques, we propose a novel Subsequence based temporal alignment algorithm to accurately retrieve and align the video segments of the query video. 3.10 Efficient Subsequence Based Temporal Alignment Algorithm Our temporal video sequence alignment algorithm is based on the subsequence matching approach. We are motivated to use this approach since we can design our algorithm in such a way that it is robust to several temporal attacks such as fast-forward, cross-over configurations, etc. and it has a small computational cost especially for short query videos. In this following sections, we will briefly review some popular work on sequence alignment, and then present our novel subsequence matching algorithms. 3.10.1 Review of Sequence alignment techniques Sequence alignment isaclassicalprobleminbioinformatics&computerscienceand hasbeenextensivelystudied. Someofthefamouspapersinthisbioinformaticsfield are (127), (44), (32). Sequence alignment can be divided into pairwise alignment and multiple sequence alignment (137), (142), (48), (55). When we construct an alignment, there are two main strategies. One is global alignment, which focuses on the alignment of full sequences. The other is local alignment which focuses 58 on matching of the subsequence of the input sequences. In some problems, the alignment need not be in successive regions and can be random, which results in alignment problem with affine gap. Aligning sequences without gaps can be done using BLAST and FASTA (7), while aligning sequences with gaps can be done using Needleman-Wunsch algorithm (104) and Smith-Waterman algorithm (121). Multiple sequence alignment (MSA) algorithms can be classified into exact, progressive and iterative algorithms. In many practical cases, MSA is based on pairwise alignment. 3.10.2 Video Sequence Alignment In this section, we will discuss our subsequence-matching based algorithms for solving video alignment problem. Let us consider a Query and a Reference video of duration T 1 and T 2 respectively. Without loss of generality, let us assume Query and Reference videos are uniformly sampled at 1 frame/second. Let Q = q 1 ,q 2 ,.....,q n and R = r 1 ,r 2 ,.....,r m denote the sampled query and refer- ence video frame sequences with n and m frames respectively. We can formulate the video temporal alignment problem as finding all the common subsequences between Q and R. To find a common subsequence, we need to use the subse- quence matching approaches. Subsequence matching is one of the classical prob- lems in time-series analysis and has been well studied over past several decades. Some of the early works include (52), (7), (6) where the main goal was to find the matching between subseries of time series and a given query. Longest Common Subsequence (LCS) problem, related to the subsequence matching problem, is a classical problem in computer science and has several applications in bioinformat- ics. LCS problem finds the longest subsequence common to all sequences in a set of sequences. Section 2.4 briefly discusses the dynamic programming approach to 59 solve LCS problem. Here, we propose to use LCS algorithm to find common video subsequence between any two video sequences. The LCS problem is generally defined for character/alphabet sequences where this character/alphabet is a unique symbol taken from some finite alphabet Σ. For video sequences, we can represent a video frame as a unique symbol (unique signature) using some signature-based approach. However, such a representation of video frame is not robust to several video transformations such as PiP, logo addition, noise addition, etc. which are quite frequent in the videos of today’s content-sharing networks. Thus, in our work, we represent frame as consisting of bag-of-visualwords, whichcannotbeencodedasasymbols. Notethatforcharacter sequences, the character-to-character comparison is a symbol comparison (which results in binary values - 1 (match) or 0 (no match)), while for video sequences, the frame-to-frame comparison is non-trivial and results in a similarity score. We can use the true matches obtained by our spatial verification scheme (see section 3.6) to represent this similarity score. Thus, using LCS algorithm for video alignment problem corresponds to finding the longest common sequentially aligned video subsequence (not all the common subsequences) between two video sequences. To find all the common subsequences (i.e. localize all the partial near-duplicate segments) between two video sequences, we propose two new algorithms based on LCS approach which are discussed below. 3.10.2.1 Iterative LCS algorithm Dynamic programming based LCS algorithm works well to align sequential video segments between two video sequences. If the video sequences have crossover configuration between them (as shown in 3.12), then we will not be able to localize all the partial video segment matches between the two video sequences. So, to 60 overcome this limitation, we propose a simple Iterative-LCS (I-LCS) algorithm which accurately finds and localizes all the partial near-duplicate video segments between two video sequences. As the name suggests, I-LCS algorithm runs LCS algorithm several times on the video sequences and in each iteration, it localizes the temporally coherent alignments (from largest to smallest length) and removes them from the video sequences. I-LCS algorithm is shown below. Algorithm 2 Iterative LCS (I-LCS) algorithm Input: Two video sequences Q and R Output: All subsequences (partial alignments) between Q and R 1: Find the longest common subsequence (lcs) between Q and R 2: Remove the elements of the longest common subsequence (lcs) from the two sequences while maintaining the order of the sequence and obtain two new sequences say Q 1 and R 1 3: Add lcs to the Matched Video Segment List 4: Q and R sequences are now updated to the new sequences (Q =Q 1 ,R =R 1 ) 5: Repeat above steps (1) to (4) till no common subsequence is found between Q and R. Return Matched Video Segment List Complexity Discussion of I-LCS algorithm: Due to the iterative nature of I-LCS algorithm, the algorithm has a worst-case time complexity of O(mn∗ min(m,n)−1 P i=m,j=n ij)∼O(mn∗min(m,n)) and a worst-case space complexity of O(mn), wheremandnarethelengthofqueryandreferencevideosequences. Thetimeand space complexity makes I-LCS algorithm not a very efficient approach for match- ing long video sequences especially when there are lots of partial video segments to be aligned. To improve upon the complexities of I-LCS algorithm, we propose a novel graph-based partial video alignment algorithm called DAG-LCS. This algorithm is discussed below in section 3.10.2.2. 61 3.10.2.2 Directed Acyclic Graph based video alignment algorithm I-LCS algorithm proposed in section 3.10.2.1 can find all the common subsequences between two video sequences, however at a higher computation and storage cost. The higher computation cost is due to the iterative nature of I-LCS algorithm. A careful observation of the algorithm reveals that we are running the LCS compari- son steps on the same keyframe pairs multiple times. That is we are having redun- dancy in comparison and storage costs. The table constructed during dynamic programming of LCS is called LCS table, and this table essentially stores all the information about the keyframe pair comparisons. Thus, ideally, we should be able to read all the matched subsequences between two video sequences by intelligently searching in this LCS table. There has been some recent progress on finding all the common subsequences (136) and all the longest common subsequences (116), (58) between any two sequences. However, allthesepreviousworksweredesignedforcharactersequences over some finite alphabet (Σ). These previous works are not directly applicable for the video subsequence alignment problem since they cannot exploit the underlying structureoftheLCStable(LCStableconstructedduringvideosubsequencematch- ing) and thus will not be able to efficiently recover all the common subsequences between the two video sequences. So, we propose a novel graph-based video align- ment algorithm called as Directed Acyclic Graph LCS (DAG-LCS) approach which can efficiently recover and align all the common subsequences between two video sequences. The main idea of DAG-LCS approach is to construct a directed acyclic graph in such a way that the paths in the DAG (from root node) correspond to all the common subsequences between the two video sequences. Thus, the key is 62 the DAG construction based on the dominant matches (which correspond to dom- inant matches of the LCS table, Note: we are not constructing the LCS table). Our proposed algorithm is outlined in Algorithm 3. Complexity Discussion for DAG-LCS algorithm: DAG-LCS algorithm has a worst-case time complexity of O(mn) and a worst-case space complexity of O(min(m,n)), where m and n are the lengths of query and reference video sequences. (Note: In our DAG-LCS algorithm, we are not constructing the LCS table; we are just constructing skeleton DAG. We could also construct DAG using LCS table, but in that case the space complexity will become O(mn)) The time and space complexities of DAG-LCS is much better than the I-LCS algorithm. Note that for matching longer video sequences with k-crossover configurations, DAG-LCS will have a large computation cost due to its quadratic time complex- ity. 3.10.3 Experimental Results In this section, we present our experimental results for partial near-duplicate video alignment problem under different alignment configurations (figure 3.12). We used TRECVID 2010 training corpus (4) for our reference videos and generated query videos (from the same TRECVID 2010 corpora) with different video transforma- tions such as PiP, noise addition and different video configurations (3.12)). 63 Algorithm 3 Directed Acyclic Graph (DAG) Construction Input: Two video sequences Q and R Output: All subsequences (partial alignments) between Q and R 1: Use spatial verification algorithm to find all the dominant matches (keyframe matches) between Q and R. Store the rank and co-ordinates for the domi- nant matches in a skeleton DAG (or linked list). MaximumRank is the high- est rank in DAG. 2: Add directed links to the DAG as follows: 3: For rank = 1 to MaximumRank-1 4: G be the set of all nodes at rank 5: For k = 1 to|G| 6: Read node co-ordinates (i 0 ,j 0 ) 7: Addedlink = ’no’ 8: For temprank= rank+1 to MaximumRank 9: Read nodes at temprank, their co-ordinates are (i,j) 10: If (i 0 <i) and (j 0 <j) 11: Add link from (i 0 ,j 0 ) to (i,j) 12: Addedlink =’yes’ 13: Else 14: continue 15: EndIf 16: If Addedlink = ’yes’ 17: break 18: Else 19: continue 20: EndIf 21: EndFor 22: EndFor 23: EndFor 24: Run a graph traversal algorithm on constructed DAG to find all the paths (say set P). Paths in P correspond to all common subsequences between Q and R. 25: For all paths in P 26: Find the longest in P path and add it to set called Common Subsequence Set 27: Remove the longest path from P and delete its elements from other paths. 28: EndFor 29: Return Common Subsequence Set 64 3.10.4 Datasets for Partial Near-Duplicate Video Align- ment 3.10.4.1 Query Dataset We generated a query dataset consisting of 600 short videos from TRECVID 2010 training corpus. We used the video transform generation tools provided by NIST (IMEDIA - INRIA) (2) and FFMPEG (1) to add PiP, noise and other video trans- forms to query videos and combined video segments under different types of align- ment configurations. The following video alignment configuration are present in our query dataset (figure 3.12): • Sequential configuration (100 videos - each 30s∼ 1 min duration) • Crossover configuration (simple crossover, k-crossover, reversed crossover- 400 videos, each 30s∼2 min duration) • Self duplication configuration (100 videos - each 30s∼ 1 min duration) We tested our proposed video alignment algorithms to evaluate its performance in terms of overall video alignment accuracy. Figure 3.14, shows the average F1- measure of the video alignment for different types of alignment configurations (3.12). From this plot, we observe that our proposed approach (I-LCS and DAG- LCS) obtains a high F1-score of 0.91 averaged over all configurations. Our LCS- based algorithms takes∼10 seconds (Matlab) to align k-crossover video configu- ration for two video sequences of length 1∼ 5 mins (short videos). One of the advantages of our LCS-based video alignment algorithms (I-LCS and DAG-LCS) is that they can be made scalable using a parallelization frame- work. We can parallelize LCS-based algorithms by parallelizing the dynamic pro- gramming part. Note that they are dependencies in dynamic programming steps 65 Figure 3.12: Video alignment configurations (keyframe comparison) - which should be preserved for LCS table construction. We can preserve these dependencies by comparing the video keyframes on differ- ent processors. The figure 3.15 shows how we can parallelize a dynamic program on different processors. The different processors in this figure corresponds to different color coded squares. Alternatively, we can also achieve parallelism of dynamic pro- gramming by partitioning LCS table into different blocks and running each block on different processor. One of the key aspects of parallelization is to load-balance the job on different processors and also keep the communication between the pro- cessors to a minimum cost. The parallelization framework can be used to compare multiple video sequences with a given query sequence all at the same time. This is especially applicable during DAG-LCS construction and thus will reduce the time complexity of query comparison but at an added expense on computational resource. The parallelization framework will be considered in our future work. 66 Figure 3.13: Example for Partial-near duplicate video alignment k-crossover con- figuration 3.11 Future Work All previous works on video copy detection problem (including ours) is based on analyzing the video content (or multimodal content) for finding the video copies in the large video collections. We were mainly motivated to solve the challenging copy detection problems arising in the content sharing websites (YouTube, Youku, etc.), and content hosting/streaming websites (ustream, etc.). Note that most of these content sharing/hosting services provide a platform for social interactions. So it is an interesting prospect if there was a way to incorporate this additional information i.e social network/interaction information to the video copy detection and alignment system to improve it’s accuracy and complexity. Here, we propose a novel approach which uses social network information to solve the video copy detection problem. The main motivation as to why social network information 67 Figure 3.14: Partial Near-Duplicate Video Alignment Results on our query dataset Figure 3.15: Parallelization framework for LCS-based algorithms might be useful (if at all it is useful) is discussed below and then our proposed approach is presented later. 3.11.1 IncorporatingSocialNetworkinformationforVideo Copy Detection and Alignment Multimedia sharing on social networks occurs in two main scenarios. One is content-sharing websites like YouTube, Youku, Ustream etc., and the second is 68 P2P social networks such as colluder social networks. In content-sharing websites such as YouTube, some users copy video content from their friends or subscribers without permission or without offering any credits to the original video creator. This is a frequent problem which is difficult to be handled (and is generally not handled) by YouTube (unless video creator lodges a complaint). On the other hand, in P2P networks - colluder social networks, users actively participate to col- lectively attack multimedia fingerprinting system and use (consume) multimedia content illegally. In these networks, the users also share their multimedia content with other users on network. So, the video copy detection problem arises when someuserstealsandmakesacopyofthemultimedia/videocontentwithoutgroup’s (otheruser’s)consent. Thus, thesevideocopydetectionproblemsgenerallyarisein social network settings. To address this issue, we could directly use the standard content-based copy detection techniques. However, it may become infeasible to index/search for video copies from all the video uploaded by all the users of social network. Moreover, we believe the social network structure could help in reducing the videos to be matched. So, we propose a novel approach to incorporate the social network information into the video copy detection algorithms to obtain a reliable video copy detection system. The main intuition is that the user in these social network settings is generally exposed to videos created and/or shared by his own social network. It is shown in some recent studies that 80% of YouTube users watch a video shared/recom- mended by friends/their social network. Thus, it is a reasonable assumption to only analyze (index/search) the videos that is present in user’s social network to check if there is a possible video copy detection problem. Thus, in this way we could use the social network information to reduce the search space of the video copies. One of the ideas that we propose here is based on the temporal network 69 idea proposed in (124). In their paper, the authors model video alignment as an assignment problem and use network flow approach to solve it. Our idea is based on the network flow optimization but it cleverly incorporates social network information into the model. 3.11.1.1 Social and Temporal Network based Flow Optimization Given two videos, a video is designated as the anchor videoQ =q 1 ,...,q |Q| and the other as the reference videoR =r 1 ,...,r |R| where|.| denotes the number of frames in the videos. Temporal network (124) is initially formed by querying the top-k similar frames from R using the query frames q i . Directed edges are established across the frames in the columns by chronologically linking frames according to theirtimestampvalues. Forexample, theframeinacolumnwithtimestampvalue t can link to another frame in a right-hand side column with time stamp larger than t. In other words, when tracing the list of connected edges from left to right, the time stamp values are monotonically increasing. Two artificial nodes, source and sink nodes, are included for modeling so that all paths in the network are originated from the source node and end at the sink node. The set of all possible paths in the temporal network from the source to the sink node encompasses all possible frame alignments between videos Q and R which follow strict temporal coherency. Let us denote the temporal network as G = (N,E) where N = N 1 ,...,N |Q| are columns of frames from R, where each column N i = [n 1 ,....,n k ] is the retrieval result using q i ∈ Q as the query, while E = e ij is the set of all edges where e ij represents a weighted directed edge linking any two nodes from column N i to N j , respectively. Each edge is characterized by two terms: weight w(.) and flow f(.). Given an edge e ij , its weight is proportional to the similarity of the destination 70 node to its query frame in Q. In this network, the weight signifies the capacity that an edge can carry w(e ij ) =Sim(q i ,n j ) wheren j is the node inN j andq j is the query frame which retrievesn j . Note that the weight does not depend on n i which links to n j . For any edge terminating at the sink node, the weight is assigned to zero. The flow f(e ij ) is a binary indicator with value equal to 1 or 0. Let us denote the social network as S = (U,L) where U =U 1 ,...,U |S| are users of the social network. Let the query video Q be from an user u q and the reference video R is from another user u r in the social network. Thus, if we consider social networkalongwiththetemporalnetworkitessentiallymeansthatweareinterested in checking if the query video Q is similar in content (copy) to another video R from another user in the social network. Thus, if a user copies a video from another user from his social network then we could incorporate this social relationship into the network flow model. Specifically we have to consider the social relationships in the objective function based on some social network constraints. Thus, a valid solution is an unbroken chain of edges forming a path from the source node to the sink node where the flows at the edges traversed by the path is 1, while for all other edges, the flow value is 0. Finding a maximal path with the maximum flow is thus equivalent to searching for a sequence alignment which maximizes the similarity between Q and R in monotonically increasing temporal order and based on Q and R being from users of social network. Thus, the video copy alignment, based on network flow, can be thus formulated as maximize X e ij ∈Es ij ∈S f(e ij )w(e ij )g(s ij ) (3.5) 71 subject to X e in ∈E in (n) f(e in )− X eout∈Eout(n) f(e out ) = 0,∀n∈N (3.6) X eout∈Eout(nsrc) f(e out ) = 1 (3.7) X e in ∈E in (n sink ) f(e in ) = 1 (3.8) 0≤f(e ij )≤ 1,∀e ij ∈E (3.9) 0≤s ij ≤ 1,∀s ij ∈S (3.10) Where g(s ij ) is a function based on social network information. A simple function for g(.) could be the based on delta function δ(s ij ) or logistic smoothing function 1−sigmoid(s ij ). This flow optimization problem could be solved using a LP solver or other popular optimization methods. Theidea ofincorporatingsocial network information intovideo(content-based) copy detection problem is quite novel and has never been explored before in the literature (to the best of our knowledge). This new way of incorporating social network information might be quite useful for content-copy detection and can be extended to other interesting problems such as tracking content consumption by users and their friends, identifying the interesting parts of video based on content and social interactions, content-based information diffusion in social networks, etc. 3.12 Conclusions In this chapter, we proposed a generalized spatial coding scheme and novel spatial verification algorithms for partial near-duplicate video copy detection. Our pro- posed spatial verification techniques are quite robust to challenging video trans- formations such as PiP and provide better storage and computational cost as compared to current state-of-art-techniques of partial near-duplicate video copy 72 detection. We formulated the partial near-duplicatevideo alignment problem as an efficient subsequence matching problem and provided efficient algorithms based on dynamic programming and DAG construction. Our experimental results showed that our proposed video alignment algorithms achieve high alignment accuracy for short video sequences. Our future work will be to incorporate social network structure information into the video copy detection system as outlined in section 3.11. 73 Chapter 4 Recommendation Systems for Social Media Users 4.1 Introduction Social networking sites such as Facebook, YouTube, Lastfm etc. have become a popular platform for users to socially connect to friends and share their con- tent/opinions. Many users share their content (links) such as music, images, videos, etc., with each other and also provide the ability to tag each other’s con- tent. Such social networking sites represent a kind of rich content information networks. These social networks provide explicit connections between users by allowing them to connect as friends, while the content information networks pro- vide direct links (connections) between the user and their content. Discovering and extracting user-community relationships based on the shared content from these social media networks is an exciting problem since it finds interesting uses in academic and industrial research. For example, social media advertising is a new paradigm in product marketing which helps the companies to market their products to specific audience. Moreover, understanding the user-item relationship using content and social information provides a new approach for personalized recommendation and filtering system. Recommendation systems generally suggest items to users based on their inter- ests. Collaborative Filtering (CF) framework is generally used in recommender 74 systems (122),(118) since it can automatically predict the interests of a partic- ular user based on the collective rating records of similar users or items. The underlying assumption in CF based recommendation systems is that similar users prefer similar items. However, CF-based recommendation systems suffer from a few weaknesses: (1) Sparsity in user-item matrix implies that the predicted ratings for a new user can be unreliable, (2) Traditional CF-based recommender systems do not consider the content information of the item for prediction, (3) CF-based recommender systems do not consider the impact of social influence of friends for recommendation, and thus may lead to unrealistic predictions. To overcome the weaknesses of CF-based recommendation systems and based on the intuition that user’s social network and item content information will affect user’s personal deci- sions on item ratings, we propose to combine user’s social graph with topic based model to perform more accurate and personalized recommendation of items. To capture the content information, we will use the topic-modeling based CF system called Collaborative Topic Regression (CTR) (135) and to capture social network information, we will use matrix factorization technique. Item’s content information can be represented using ’item tags (attributes)’ or it can be modeled using ’User tags’. Here we consider ’item tags’ as ’global’ description of items (generally provided by company during item advertisement), while user tags can be considered as ’local’ descriptions of items given by users (consumers) and it roughly indicates users’ preferences/know-how of the items. Previous works (130), (152), (158), have studied the influence of tags on the accu- racy of recommendation systems, however their final results are not human inter- pretable i.e. they do not explain why the content/item was recommended to the user. In this chapter, we discuss the trade-offs of modeling content information using ’item tags’ and ’user tags’ and present user interpretable results. 75 We propose a hierarchical Bayesian model to integrate social network structure (using matrix factorization) and item content-information (using LDA model) for itemrecommendation(112). Weconnectthesetwodatasourcesthroughtheshared user latent feature space. The matrix factorization of social network will learn the low-rank user latent feature space, while topic modeling (content analysis) provides a representation of the items in the item latent feature space, in order to make social recommendations. Our experimental results on a large music dataset (lastfm) and a bookmarking dataset (delicious) (30) show that our proposed model outperforms the state-of-the art collaborative filtering-based algorithms such as CTR and Probabilistic Matrix Factorization (PMF). More importantly, our model can provide useful insights into how much social network information can help in improving the prediction performance. Our results reveal interesting insight that the social circles have more influence on people’s decisions about the usefulness of information (e.g., bookmarking preference on Delicious) than personal taste (e.g., music preference on Lastfm). We examine and discuss solutions on potential information leak in many recommendation systems that utilize social information. WealsodiscusstheimpactofusingexternalsourcessuchasWikipediaformodeling the content of items (Item tags). The remainder of the chapter is arranged as follows: in Section 4.2, we provide a brief overview of related works on recommendation systems. In Section 4.3, we will present our proposed model and discuss how to learn parameters and do inference. Our experimental setup and analysis is presented in Sections 7.7 and 4.5, followed by conclusions and future work in Section 4.7. 76 4.2 Related Work In this section, we review the literature of a few state-of-the art approaches proposed for Collaborative filtering (CF)-based recommendation systems. There are mainly two types of CF-based approaches (1) memory-based approaches, (2) model-based approaches. The memory-based approaches use either user-based approaches (61) or item-based approaches (77) for prediction (recommendation) of ratings for items. Even though memory-based approaches are easy to implement and popular; they do not guarantee good prediction results. On the other hand, model-based approaches include several model based learning methods such as clustering models, aspect models (65) and the latent factor models. These model- based approaches, especially latent-factor models based on matrix factorization (81),(118), have shown promise in better rating prediction since they efficiently incorporate user interests into the model. However, all of the above CF-based approaches assume users are independent and identically distributed and ignore additional information such as the content of item and social connections of users while performing the recommendation task. Collaborative Topic Regression (CTR) model has been recently proposed (135), to do article (document) recommendation based on probabilistic topic modeling approach. Figure 4.1 shows the CTR model. The CTR model combines the merits of both traditional collaborative filtering and probabilistic topic modeling approaches. CTR represents users with topic interests and assumes that items (documents) are generated by a topic model. CTR additionally includes a latent variable j which offsets the topic proportions θ j when modeling the user ratings. Assume there are K topicsβ =β 1:K . The generative process of CTR model is as follows: 1. For each user i, draw user latent vector u i ∼N (0,λ −1 u I K ) 77 2. For each item j (a) Draw topic proportions θ j ∼ Dirichlet(α) (b) Draw item latent offset j ∼N (0,λ −1 v I K ) and set the item latent vector as v j = j +θ j (c) For each word w jn , i. Draw topic assignment z jn ∼ Mult (θ) ii. Draw word w jn ∼ Mult(β z jn ) 3. For each user-item pair (i,j), draw the rating r ij ∼N (u T i v j ,c −1 ij ) CTR model does a good job in using content information for recommendation of items. However, this model does not reliably learn the user latent space for new or inactive users. It has been well-studied and established in social science and social network anal- ysis research areas that user’s social relations affect user’s decision process and their interests (19) (41). For example: users generally trust their friend’s recom- mendation to buy an item/watch a movie. More recently, recommendation tech- niques has been developed to incorporate the social relationship information with CF techniques. (97) proposed a social recommendation system, based on matrix factorization techniques, which uses user’s social network information and user’s rating records to recommend products and movies. However, their model cannot be used for recommendation of new or unseen items. In our work, we propose a novel probabilistic model to address the recommendation problem when user’s item content, rating records and social network information are all known. Our model can be used for predicting ratings for new or unseen items and for new or inactive users of a social network. 78 Figure 4.1: Collaborative Topic Regression Model (135). Here, θ is the topic pro- portions of LDA model, U is user random variable, V is item random variable, W is the observed words of items, and r is the observed user ratings, K is no. of topics, α is Dirichlet prior, Z is latent variable 4.3 Proposed Approach In this section, we discuss our proposed model, shown in Figure 4.2. Our model is a generalized hierarchical Bayesian model which jointly learns the user, item and social factor latent spaces. We use LDA (24) to capture item’s content information in latent topic space, and we use matrix factorization to derive latent feature space ofuserfromhis/hersocialnetworkgraph. ItcanbeseenthatCTRmodel(135)and social matrix factorization (97) can be derived as special cases from our proposed model. Our model fuses LDA with social matrix factorization (SMF) to obtain a con- sistent and compact feature representation. First, we discuss the social matrix factorization and then we will discuss the factorization for our complete model. Consider a social network graphG = (V,E), where the users and their social relations are respectively represented as the vertex setV ={v i } m i=1 and the edge setE ofG. Let Q =q ik denote them×m matrix ofG, which is the social network matrix. For any pair of vertices v i and v k , let q ik denote the the relation between 79 Figure 4.2: Proposed Model - CTR with SMF, CTR part shown in red color, (SMF) Social Matrix Factorization shown in blue color two users ’i’ and ’k’. We associate q ik with a confidence parameter d ik , which is used to capture the strength of the user relations. A high value ofd ik indicates that user ’i’ has a stronger connection (likeliness) with user ’k’. Therefore, the idea of socialnetworkmatrixfactorizationistoderivel-dimensionalfeaturerepresentation of users, based on analyzing the social network graphG. Let U ∈ R l×m and S∈R l×m be the latent user and social factor feature matrices, with column vectors U i and S k representing the user-specific and social factor-specific latent feature vectors respectively. The conditional distribution over the observed social network relationships can be shown as P (Q|U,S,σ 2 Q ) = m Y i=1 m Y k=1 N (q ij |g(U T i S k ),σ 2 Q ) I Q ij (4.1) whereN (x|μ,σ 2 ) is the pdf of Gaussian distribution with meanμ and varianceσ 2 Q , and I Q ik is the indicator function that is ’1’ if user ’i’ and user ’k’ are connected in the social graph (i.e. there is an edge between the vertices ’i’ and ’k’), and equal 80 to 0 otherwise. The function g(x) is the logistic function g(x) = 1 1+exp(−x) , which bounds the range of U T i S k within [0, 1]. We place zero-mean spherical Gaussian priors on user and factor feature vectors: P (U|σ 2 U ) = m Y i=1 N (U i |0,σ 2 U I) (4.2) P (S|σ 2 S ) = m Y k=1 N (S k |0,σ 2 S I) (4.3) Hence, through Bayesian inference, we have p(U,S|Q,σ 2 Q ,σ 2 U ,σ 2 S )∝p(Q|U,S,σ 2 Q )p(U|σ 2 U )p(S|σ 2 S ) = m Y i=1 m Y k=1 N [(q ik |q(U T i S k ),σ 2 Q )] I Q ik × m Y i=1 N (U i |0,σ 2 U I)× m Y k=1 N (S k |0,σ 2 S I) (4.4) Now, combining LDA with SMF (Figure 4.2), we have p(U,V,S|Q,R,σ 2 Q ,σ 2 R ,σ 2 U ,σ 2 V ,σ 2 S ) ∝p(R|U,V,σ 2 R )p(Q|U,S,σ 2 Q ) ×p(U|σ 2 U )p(V|σ 2 V )p(S|σ 2 S ). (4.5) 81 The log of the posterior distribution for the above equation can be found by sub- stituting the corresponding pdfs. Note: the item latent vector v j is generated by a key property due to collaborative topic regression. P (V|σ 2 V )∼N (θ j ,λ −1 V I k ) (4.6) where λ V =σ 2 R /σ 2 V and θ j is the topic proportions from LDA. 4.3.1 Parameter Learning For learning the parameters, we develop an EM-style algorithm similar to the one discussed in (135). Maximization of the posterior is equivalent to maximizing the complete log-likelihood of U,V,S,θ 1:J ,R and Q given λ U ,λ V ,λ S ,λ Q and β. L =− λ U 2 X i u T i u i − λ V 2 X j (v j −θ j ) T (v j −θ j ) + X j X n log X k θ jk β k,w jn ! − X ij c ij 2 (r ij −u T i v j ) 2 − λ Q 2 X i,m d im 2 (q im −u T i s m ) 2 − λ S 2 X k s T k s k (4.7) where λ U = σ 2 R /σ 2 U ,λ S = σ 2 R /σ 2 S ,λ Q = σ 2 R /σ 2 Q and Dirichlet prior (α) is set to 1. Weoptimizethisfunctionbygradientascentapproachbyiterativelyoptimizingthe collaborative filtering and social network variables u i ,v j ,s m and topic proportions θ j . Foru i ,v j ,s m , maximization follows similar to matrix factorization (67). Given a current estimate ofθ j , taking the gradient ofL with respect tou i ,v j ands m and setting it to zero helps us to find u i ,v j ,s m in terms ofU,V,C,R,S,λ V ,λ U ,λ S ,λ Q . ∂L ∂u i = 0; ∂L ∂v j = 0; ∂L ∂s m = 0 (4.8) 82 Solving the corresponding equations will lead to the following update equations u i ← (VC i V T +λ Q SD i S T +λ U I K ) −1 (VC i R i +λ Q SD i Q i ) (4.9) v j ← (UC j U T +λ V I K ) −1 (UC j R j +λ V θ j ) (4.10) s m ← (λ Q UD m U T +λ S I K ) −1 (λ Q UD m Q m ) (4.11) where C i ,D i are diagonal matrices with c ij ,d ij ;j = 1....J as its diagonal elements andR i = (r ij ) J j=1 for useri. For each itemj,C j andR j are similarly defined. Note that c ij is confidence parameter for rating r ij , for more details refer (135). We defined ij as the confidence parameter forq ik , whereq ik is the relationship between usersi andk. The equation (4.10) shows how topic proportionsθ j affects the item latent vector v j , where λ V balances this effect. Given U and V, we can learn the topic proportions θ j . We define q(z jn =k) =φ jnk and then we separate the items that contain θ j and apply Jensen’s inequality: L(θ j )≥− λ V 2 (v j −θ j ) T (v j −θ j )+ X n X k φ jnk (logθ jk β k,w jn − logφ jnk ) =L(θ j ,φ j ) (4.12) 83 The optimal φ ink satisfies φ jnk ∝ θ jk β k,w jn . Note, we cannot optimize θ j ana- lytically, so we use projection gradient approaches to optimize θ 1:J and other parameters U,V,φ 1:J . After we estimate U,V andφ, we can optimize β, β kw ∝ X j X n φ jnk 1[w jn =w] (4.13) 4.3.2 Prediction After the optimal parameters,U ∗ ,V ∗ ,θ ∗ 1:J andβ ∗ are learned, our proposed model can be used for in-matrix and out-matrix prediction (recommendation) tasks. If D is the observed data, then both in-matrix and out-matrix predictions can be easily estimated. As discussed in (135), in-matrix prediction refers to the case where the user has not rated an item but that item has been rated by atleast one other user. On the other hand, out-matrix refers to the case where none of the users have rated a particular item i.e. the item has no rating records. For in-matrix prediction, we use the point estimate of u i ,θ j and j to approximate their expectations as: E[r ij |D]≈E[u i |D] T (E[θ j |D] +E[ j |D]) (4.14) r ∗ ij ≈ (u ∗ i ) T v ∗ j (4.15) For out-matrix prediction, the item is new and has not been rated by other users. Thus,E[ j ] = 0 and we predict the ratings as: E[r ij |D]≈E[u i |D] T (E[θ j |D]) (4.16) r ∗ ij ≈ (u ∗ i ) T θ ∗ j (4.17) 84 4.4 Experimental Setup Weconductseveralexperimentstocomparetheperformanceofourproposedmodel with the state-of-the-art techniques. Here, we discuss about the datasets and experiment settings. 4.4.1 Description of Datasets Table 4.4.1 shows the description of two real-world datasets considered for our experiments: hetrec2011-lastfm-2k (Lastfm) and hetrec2011-delicious-2k (Deli- cious) (30). Table 4.1: Dataset description Dataset Lastfm Delicious users 1892 1867 items 17632 69226 tags 11946 53388 user-user relations 25434 15328 user-tags-items 186479 437593 user-items relations 92834 104799 Last.fm music dataset has been obtained from Last.fm online music system. It has 1892 users, 17632 artists, 12717 bi-directional user friend relations that is 25434 (user i , user j ) pairs, 92834 user-listened artist relations called tuples [user, artist, listeningCount], 11946 tags, and 186479 tag assignments i.e. tuples [user, tag, artist]. On an average 49.067 artists are most listened by each user, and avg. number of users listening to an artist is 5.265. There are 98.562 avg. tag assignments per user and 14.891 tag assignments per artist. Average number of distinct tags used by each user is 18.93 and avg. number of distinct tags used to tag for each artist is 8.764. Each user has 13.443 avg. number of friend relations. Table 4.2 shows the social network statistics of the Last.fm dataset. 85 Table 4.2: Social Network Statistics of Last.fm dataset Total Nodes 1892 Total Links 25434 Density 0.00711 Clustering Co-efficient 0.01538 Maximum In-degree 119 Maximum Out-degree 119 The datasets are first preprocessed and cleaned to remove noisy entries. For hetrec2011-lastfm-2k dataset, if the user has listened to an artist then we consider the user rating for the artist as ’1’, else we do not give the user ratings for the artist. Similarly, for hetrec2011-delicious-2k dataset, if the user has bookmarked an URL (item) then we consider the rating for that bookmarked URL as ’1’, else we do not give any user ratings for the URL. We consider artists and URLs as items in the above two datasets. We observe that the user-item matrices for both the datasets are highly sparse (99.7% sparse and 99.91% sparse respectively). We use Last.fm API to crawl the Last.fm website to collect the biography for each artist in the dataset. The artist biographies are created by artists, label companies or users but they are regularly moderated (similar to Wikipedia). For the artist biographies, we remove stop words and use tf-idf to choose the top 1,000 distinct words as vocabulary. This yielded a corpus of more than a million words. We consider these words as ’Item tags’, that is, tags are generated from Last.fm wiki and they generally represent the global description (information) of the artist. On the other hand, we consider the tags assigned by individual users as ’User tags’ and they represent the ’local’ description of the artist (captures user’s preferences). The ’user tags’ are provided in the dataset hetrec2011-lastfm-2k(30) as [user, tag, artist] tuples. 86 4.4.2 Evaluation In our experiments, we split each of the datasets into two parts - training (90%) and testing datasets (10%). The model is trained on a majority of training dataset and the optimal parameters are obtained on a small held-out dataset. Using the optimal parameters, the ratings are predicted for the entries in the testing dataset. For evaluation, we consider ’recall’ as our performance evaluation metric (135) since ’precision’ metric is difficult to evaluate (zero rating for item can imply either the user does not like the item or does not know about the item). Recall only considers the rated items within the top M - a higher recall with lower M implies a better system. For each user, we define the recall@M as: recall@M = number of items the user likes in Top M total number of items the user likes (4.18) The above equation calculates user-oriented recall. We can similarly define item-oriented recall. For consistency and convenience, we use user-oriented recall for in-matrix prediction throughout this chapter. 4.4.2.1 In-matrix prediction In-matrix prediction considers the case where each user has a set of artists (items) that he has not listened to, but atleast one of the other users in the dataset has listened to the artists. So, for in-matrix prediction, we are interested in evaluating the prediction ratings of these artists to the user. In-fact this task is similar to traditional collaborative filtering. For this task, we ensure that all the artists in the test set have appeared in the training set. In-matrix prediction is similarly considered for bookmarking dataset where items are bookmarks. 87 4.4.3 Experimental settings For collaborative filtering based on matrix factorization (denoted by CF), we used gridsearchtofindtheparameterssuchthatwegetgoodperformanceonthetesting dataset. We found that λ v = 100,λ u = 0.01,a = 1,b = 0.01,K = 200 gives good performance for CF approach. Note: a and b are tuning parameters (a > b > 0) for the confidence parameters c ij and d ij (equation 4.7). For Collaborative Topic Regression model (denoted by CTR), we choose the parameters similar to CF approach. We set the parameters λ u = 0.01,a = 1,b = 0.01,K = 200 and we vary the parameterλ v to study its affect on the prediction accuracy. For our model, we set parameters a = 1,b = 0.01, and vary all other parameters to study their affect on prediction accuracy. Note that we use the terms prediction accuracy and recall interchangeably throughout this chapter. 4.5 Experimental Analysis We evaluate our model on real-world music and bookmarking datasets for artist & urlrecommendationsrespectively. Ourexperimentshelpustoanswerthefollowing key questions: • How does our model compare with respect to the state-of-the-art collabora- tive filtering techniques? • How does content parameter λ v and social network parameter λ q affect the accuracy of prediction? • How is prediction accuracy affected when content is modeled using ’item tags’ and ’user tags’? • How to interpret the user latent space? 88 4.5.1 Comparisons To show performance improvement of our proposed model, we compare results from our algorithm with the some of the state-of-the-art algorithms such as Col- laborative Topic Regression (135), and Matrix Factorization (81). First, we study the effect of precision parameter λ v on the CTR model. Figure 4.3 shows that when λ v is small in CTR, the per-item latent vector v j can diverge significantly from the topic proportions θ j which was observed in (135). Figure 4.4 shows that in-fact we observe similar affect of λ v for our model as well. Moreover, the plots in figure 4.4 clearly shows that our proposed model outperforms CTR model by a margin of 2.5∼ 3% for both the datasets. This can be explained since our model uses social network information to better model the user latent space i.e. user’s preferences is better modeled due to friends with similar tastes. Figure 4.5 shows the overall performance for in-matrix prediction when we vary the number of returned itemsM = 50, 100,..., 250 while keepingλ v (= 100) constant. This plot shows that as the number of returned items is increased performance of our model improves. This figure also shows that our approach always outperforms both CTR and CF approaches at different values of M. We observed that the recall measured at smaller M (M < 50) is quite small for all the models, since for many users in the test dataset, the average number of items per user is quite small and so all the models tend to recommend most popular items in top M (M < 50). 4.5.2 Impact of parameters λ v , λ q Our model allows us to study how the content (λ v ) and social network parameters (λ q ) affect the overall performance of our recommendation system. Here we discuss how to balance these parameters to achieve better recommendation of artists to users in social network. If λ q = 0, our model collapses to the CTR model which 89 0 0.01 0.1 1 10 100 1000 0.36 0.38 0.4 0.42 0.44 0.46 0.48 λ v −−> Recall −−> CTR Figure 4.3: Recall of in-matrix prediction task for CTR model by varying con- tent parameter λ v and fixing number of recommended items i.e M = 250. Dataset used is hetrec2011-lastfm-2k uses topic modeling and user-item rating matrix for prediction. If λ q =∞, our model uses only the information from social network to model user’s preferences. In all other cases, our model fuses information from topic model and the user social network for matrix factorization and furthermore to predict the ratings for users. Figure 4.7 shows how our system performs when the social network parameter λ q is varied while keeping the content parameter λ v as constant (fixed topic model). From this figure, we observe that the value of λ q impacts the recommendation results significantly, which demonstrates that fusing user social network with topic model (CTR) improves recommendation accuracy quite a bit (2∼3%). Figure 4.7 also indicates that for small values of λ q , the improvement in prediction accuracy is small and negligible, and it increases with further increase in λ q . However, when λ q increases beyond a certain threshold, the prediction accuracy decreases 90 0 0.01 0.1 1 10 100 1000 0.35 0.4 0.45 0.5 λ v Recall 0 0.001 0.01 0.1 1 10 100 0.2 0.3 0.4 0.5 λ v Recall CTR with SMF CTR CTR with SMF CTR Figure 4.4: Comparison of Recall for CTR and our proposed model by varying λ v and fixing M = 250, our model indicated by CTR with SMF. Left plot: Dataset used is hetrec2011-lastfm-2k, Right plot: Dataset used is hetrec2011- delicious-2k with further increase in λ q . This can be intuitively explained as follows: for large valuesofλ q , ourmodelgivesmorepreferencetosocialnetworkinformation(similar neighbor) but less preference to user’s tastes (previous item rating records) and hence the prediction accuracy may not be reliable for largeλ q . Our model achieves best prediction accuracy for hetrec2011-lastfm-2k dataset around λ q ∈ (100, 200) irrespective of the parameter λ v ’s value. This insensitivity of the optimal values for the parameter λ q shows that our model can be easily trained on the held-out dataset. To study how the content and social network parameters balance our model’s prediction accuracy, we plot the recall by varying these parameters. Figure 4.8 shows the contour and 3D plots of recall for our proposed model. When the parameters are zero, i.e λ v = 0 and λ q = 0, our model reduces to standard CF model which has a poor prediction accuracy and this is confirmed in these plots. When we increase λ v at fixed λ q , we see that our model’s performance improves 91 50 100 150 200 250 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of returned items (M) −−> Recall −−> CTR with SMF CTR PMF Figure 4.5: Recall comparison of various models for in-matrix prediction task by varying number of recommended items M and fixing λ v = 100. Our model indicated by CTR with SMF. PMF indicates matrix factorization (CF). Dataset used is hetrec2011-lastfm-2k (smaller λ v implies model behaves like CF model). Similar observations can be made for varying λ q at fixed λ v . Moreover, there is a region of values for λ v and λ q (near∼ (100,100)), around which our model provides the best performance in terms of recall. To investigate further, we plotted the recall contours for our model around the empirically chosen range (100, 250). Figure 4.9 show this plot. From this figure, we can infer that there is a region where the optimal values for λ v and λ q ensures the best prediction accuracy for our model. To our surprise, we found that both the parameters had similar value of∼150. To find out if having similar and higher values of λ q and λ v parameters always guarantee best performance for our model, we conducted similar experiments on hetrec2011-delicious-2k dataset. 92 Figure 4.6 shows the contour and 3D-plots of recall by varying λ v and λ q for hetrec2011-delicious-2k dataset. We observe that in figure 4.6 the optimal values forλ q ,λ v are pretty small, and our model achieves best prediction accuracy for λ q = 0.05 and λ v = 0.01. We can explain this by looking at the dataset. hetrec2011- delicious-2k isobtainedfromDelicioussocialbookmarkingwebsite, wheremajority of user’s bookmarks (items) are publicly shared online or with friends. Thus, the user’s social network plays a more important role than the content information of URLs (items) for the URL prediction task. Moreover, since the dataset is highly sparse (99.91%), the content and social network information help for some users only and thus these parameters have smaller values. On the other hand, for hetrec2011-lastfm-2k dataset (99.7% sparse) we see that it’s a music dataset, and generally the artist’s music (item content) has a great influence on user’s tastes. Moreover, lastfm users tend to be friends with other users who have similar music interests. Thus, we observe that, higher values of parameters λ v and λ q achieves the best prediction for our model. From these experiments (figures 4.9 and 4.6), we can say that the optimal values of λ q and λ v are highly dependent on datasets and their values balance how the content and social network information could be used for achieving best prediction accuracy (best recommendation). 4.5.3 Content Representation of items As mentioned in the section 4.4.1, we use ’user tags’ and/or ’item tags’ to model the artist (content) information (Lastfm dataset). We conducted several experi- ments to guage the impact of the content modeling using different tags. First, we modeled the content information using ’user’ and ’item’ tags separately. Figure 4.11 shows the in-matrix prediction plots when ’item tags’ are used to represent content information. From this figure, we observe that we get best prediction 93 λ v λ q 0.001 0.005 0.01 0.05 0.1 0.5 0.001 0.005 0.01 0.05 0.1 0.5 0.001 0.005 0.01 0.05 0.001 0.005 0.01 0.05 0.4 0.45 0.5 λ q λ v Recall 0.44 0.45 0.46 0.47 0.48 0.49 Figure 4.6: Plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q by fixing M = 250. Dataset used: hetrec2011-delicious-2k accuracy of 0.47 for λ v ∼ 10 and λ q ∼ 10. Note, here K =50 (K is latent space dimensions/number of topics) is used. We observe similar characteristics when ’user tags’ are used for modeling content. Figure 4.10 shows that we achieve the best recall of around 0.49 when ’user tags’ are used. Moreover, the best recall is achieved when λ v ∼ 10 and λ q ∼ 0.1. This improvement in prediction accuracy when user tags are used can be explained by looking at the user latent space which is discussed in the next section (section 4.5.4). We also conducted experiments by representing artist description using both the ’user tags’ and ’item tags’ and plot- ted the recall metrics for our model. We observed in figure 4.12 that this joint-tag content representation achieves best prediction accuracy (recall) of around∼ 0.48. Figures 4.13 and 4.14 shows the recall comparison plots for our model when differ- ent tags are used and whenλ v andλ q are varied. Figure 4.13 is plotted by keeping λ q as a constant (=10) and λ v is varied; and in this figure, we observe that our model achieves best accuracy around λ v = 10 irrespective of which tags are used for content representation. On the other hand, figure 4.14 shows the plot when λ v is fixed as a constant (=10) and λ q is varied. In this figure, we observe that we 94 0 0.1 0.5 1 10 100 1000 0.38 0.4 0.42 0.44 0.46 0.48 λ q −−> Recall −−> λ v =100 λ v = 10 λ v =0.1 Figure 4.7: Recall of in-matrix prediction task for our proposed model by vary- ing content parameter λ v and social network parameter λ q at M = 250 and K = 200. Dataset used is hetrec2011-lastfm-2k get best recall for ’user + item’ tags when λ q ∼ 0.1, while best recall for models using ’user’ tags and ’item’ tags separately is achieved when λ q ∼ 10. These plots imply that when we combine different sources for content representation, then the combined content feature might not always give the best accuracy and it generally depends on the data sources and data content we model. 4.5.4 Examining User Latent Space In section 4.5.3, we discussed how different types of tags influence prediction accu- racy. In this section, we explain those results by analyzing the latent space of users. One major advantage of our model (and CTR model) over other Collabo- rative filtering models is that our model can explain the user latent space using 95 λ v λ q 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0.35 0.4 0.45 0.5 λ v λ q Recall 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 Figure 4.8: Plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q at M = 250 and K = 200. Dataset used is hetrec2011-lastfm-2k the topics learned from the data. For any user, we can find the top matched top- ics by ranking the entries of their latent vector u i . Table 4.3 shows two example users from the dataset along with their top 3 matched topics and the top 5 artists listened by them. The user latent space of Table 4.3 is generated from the topics learned from LDA when both ’user and item tags’ are used for content represen- tation. In Table 4.3, the tags indicated by blue color corresponds to the similar artists, albums, bands or label names, while tags highlighted by red color represent the music genres/sub-genres and the tags indicated by green color corresponds to noisy tags representing location or other common vocabulary terms. For User 1 in Table 4.3, we see that his top artists belong to the music genres/subgenres repre- sented by the predicted topics. For example, ’Duran Duran’ artist is an electronic pop-rock from UK who made popular songs in the 1980s. Similarly, for User 2, we can observe that his top predicted topics are generally represented by the similar artists, albums or labels and they corresponds to genres/music styles of the top artists in his listening records. In Table 4.4, we represent the user latent space 96 λ v λ q 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 0.465 0.47 0.475 0.48 0.485 λ v λ q Recall 0.47 0.472 0.474 0.476 0.478 Figure 4.9: Zoom-in plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q at M = 250 and K = 200. Dataset used is hetrec2011-lastfm-2k using the ’Item tags’. In this table, we see that many of the tags for the pre- dicted topics for a user represent similar artists or albums or are noisy tags, while only a few tags tend to represent the music genres/sub-genres. This clearly shows that using external source like Last.fm wiki to model the artist description might describe a latent space which uses tags that might be noisy and less representative of the artists’ music genre, that is, these ’item tags’ are generic in nature and give a global description of the artists. In Table 4.5, we show the interpretable user latent space when only ’user tags’ are used for modeling the artist content. We observe that many of the user tags in topics correspond to the music genres/sub-genres and some to noisy tags (user-dependent tags), while only a few user tags correspond to albums, labels or other artists. This shows that user-tags are dependent on user and they tend to represent the ’local description’ of the artist by that particular user. All these three tables show that the user’s music profile (tastes/interests) can be easily interpreted. That is, the learned topics can serve as a summary of what the users might be interested in. This can help the users to manage/filter 97 λ v λ q 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 λ q λ v Recall 0.38 0.4 0.42 0.44 0.46 0.48 Figure 4.10: Plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q by fixing M = 250 and K = 50. Only ’user tags’ are used for modeling the content information their own music selections. For example, if a user recognizes that his profile as representing different topics, then he can choose to filter/hide some topics while seeking music recommendations on the other topics. 4.5.5 Computational Time Analysis Our model utilizes LDA for topic modeling, thus, the time complexity of our model is quite expensive when compared to the traditional matrix factorization tech- niques. Table 4.6 shows the average time consumed by our model when compared with CTR model. For hetrec2011-lastfm-2k dataset, we observed that when λ v is small, our model converges faster than CTR model, on the other hand when λ v is large, then our model takes more time for convergence. This is because the update for parameterv j is done much faster (due to joint learning of latent space vectors) in our model for smaller values of λ v . When averaged over all values of λ v our model took comparable time as CTR model. Table 4.7 shows how our model per- forms when latent space dimensions(K) is varied. We observed that when we use 98 λ v λ q 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0.25 0.3 0.35 0.4 0.45 0.5 λ v λ q Recall 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 Figure 4.11: Plots of in-matrix prediction recall for proposed model by varying content parameter λ v and social network parameter λ q by fixing M = 250 and K = 50. Only ’item tags’ are used for modeling the content information a smaller value forK (K<10), then the accuracy of our model decreases quite a bit but it converges much faster. This shows that there might be a trade-off between prediction accuracy and latent dimensions of topics. Moreover, we observed that using smaller K in our model achieves the similar accuracy as a CTR model which uses larger K. That is, our model with small K (50) gives similar performance as CTR with large K (200), and thus, our model is 30 times faster than CTR model for same level of prediction accuracy. However, when K is large (K >50), then our model tends to consistently give good prediction results for suitable values of λ v ,λ q . All our experiments were run on single core processors with 2∼4 GB RAM. 4.5.6 Discussion on Social Network Structure In our experiments, we considered a ’final’ static social network where the relations of users is fixed (that is, it doesn’t change with time). We showed that given the user’s social network, our model can more accurately predict the user ratings. It is possible that users form social network because they like similar types of 99 λ v λ q 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0 0.1 1 10 100 1000 0.36 0.39 0.42 0.45 0.48 λ q λ v Recall 0.35 0.4 0.45 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 Figure 4.12: Plots of in-matrix prediction recall for our proposed model by vary- ing content parameter λ v and social network parameter λ q by fixing M = 250 and K = 50. Both the ’user and item tags’ are used for modeling the content information items (music or bookmarks) and this social network dynamically evolves over time. Hence, we feel that using a final static social network could be a source of potential information leak. That is, our model could be making better predictions using future social network information. From our experience, we find that none of the currentliteraturediscussorincorporatethisintotheirrecommendationframework. To test our hypothesis, we conducted new experiments on the delicious dataset by considering the evolving social network structure. First, we obtained different training datasets based on the timestamp of the social network, then we evaluated the in-matrix recall on the test users by considering both the ’final’ and ’times- tamped’ social network information. Figure 4.15 shows that, using ’final’ static social network provides better recall then using the timestamped (evolving) social network. In-fact,weobservedthatsmallerthetrainingdataset,thelargertheinfor- mation leak in the recommendation system. We observe that our model obtains 100 0 0.1 1 10 100 1000 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 λ v Recall Item and user tags User tags Item tags Figure 4.13: Recall of in-matrix prediction task for our proposed model by vary- ing content parameter λ v at M = 250 and K = 50 0 0.1 1 10 100 1000 0.38 0.4 0.42 0.44 0.46 0.48 0.5 λ q Recall Item + User Tags User Tags Item Tags Figure 4.14: Recall of in-matrix prediction task for our proposed model by vary- ing social network parameter λ q at M = 250 and K = 50 3-5% improvement in prediction accuracy by using full (static) social network as compared to using timestamped (evolving) social network. 101 User 1 Top 3 topics 1. mesajah, hearttoheartca, mod revival, progressive metal, gangsta rap, alterntative rock, electronic, indie 80’s, ethereal 2. ironjesus, rockfest, saez, goatmachine, aya, undead, childhood battle metal, velha, pullman 3. undead, covers, dude, unknown, caia, subsonica, bocscadet, haeiresis,umay, somali somali, covers, dude, unknown Top 5 artists 1. Duran Duran 2. Morcheeba 3. Air 4. Hooverphonic 5. Kylie Minogue User 2 Top 3 topics 1. dert, defunct, himekami, invader, scadet, makoto, chamber, drewsifstalin, emicida, yuri, chamber 2. lavigne, michael, brokenfistmoscow, rinta, snaper, nicolas,soweto, aguabella2007, uhhyeahdude, ca 3. bytes,anontron, cephyrius, dogs, mc, pullman, brainforestrytual, hagal, id3, yosuke, pullman, dogs Top 5 artists 1. Segue 2. Max Richter 3. Celer 4. Pjusk 5. Pleq & Segue Table 4.3: User Latent Space Interpretation for Item + User tags. Tags high- lighted in red color correspond to music genre/sub-genres, tags highlighted in blue color corresponds to similar artists, albums, labels or band names; while tags highlighted in green correspond to location or other noisy tags. Please note that some tags can highlighted by more than one color since the tags are ambiguous. It is suggested to view the table in color printouts 4.6 Future Work Attention has been shown to impact the popularity of memes (138), (140); what peopleretweet(63),(45)andthenumberofmeaningfulconversationstheycanhave on Twitter (75). Users of social network have limited attention which limits their ability to process all the recommended items (updates) from friends. Moreover, 102 User 1 Top 3 topics 1. sunset,emily,glee,md, drivers, koma, rnb,instrumentalism, deleted, nj,md 2. melody, mood, hammered, elena, dulcimer, mali, depp, beyer, south 3. anna, louis, das, khan, haruka,, ukulele, bas, references, peaceful, brandenburg User 2 Top 3 topics 1. temple, zone, click, robot, masta, soweto, nberg, seychelles, vegan, myspace 2. aphex,mash,absurdity, hoo, boogieman, idm, ragga, remixes, bavaria,vegan, absurdity, bavaria, hoo, 3. des, dooley, lab, altai brighton, reject, salzburg, vid, references, mathy, Table 4.4: User Latent Space Interpretation for Item tags. Tags highlighted in red color correspond to music genre/sub-genres, tags highlighted in blue cor- respond to similar artists, albums, labels or band names; while tags highlighted in green correspond to location or other noisy tags. Please note that some tags can highlighted by more than one color since the tags are ambiguous 50 60 70 80 90 0 0.2 0.4 0.6 0.8 1 Training dataset (in percent) Recall Full static social network Timestamped social network Figure 4.15: Recall of proposed model by varying social network structure. Dataset used: hetrec2011-delicious-2k users tend to pay attention to only some friends, i.e., they divide their attention non-uniformly over their friends and interests. So, in our future work we plan 103 User 1 Top 3 topics 1. atmospheric, synth pop, psychedelic rock, electronic, polish rock, electroclash, glam rock,gregorian chant, 80’s, 80s 2. instro rock, synth pop, atmospheric, glam rock, polish rock, punk rock, doom metal, synthpop, old school female and male vocalists 3. downtempo, ambient, pop, relaxing, classic rock, sad and slow, good mood, deutsch, polish, 80’s User 2 Top 3 topics 1. pacific, nice, afrocuban, qari, sahara, greek hip-hop, electropunk, filmgroove, kabyle, favorite tracks,afrocuban, nice 2. downtempo, ambient, pop, classic rock, relaxing 80’s, polish, sad and slow, good mood, deutsch,relaxing 3. kim jong kook, hit, berlin, hi-nrg, glitch hop, schlager, synthpop, glam rock, techno hit, girl groups, berlin Table 4.5: User Latent Space Interpretation for User tags. Tags highlighted in red color correspond to music genre/sub-genres, tags highlighted in blue cor- respond to similar artists, albums, labels or band names; while tags highlighted in green correspond to location or other noisy tags. Please note that some tags can highlighted by more than one color since the tags are ambiguous Table 4.6: Time complexity comparison of our model with respect to CTR model (K=200 & convergence rate = 10 −6 ) Model Time taken(hrs) λ v < 1 Time taken(hrs) λ v > 1 Avg. time taken(hrs) CTR 9.45 9.47 9.46 Our model 8.47 10.59 9.53 Table 4.7: Time complexity comparison of our model for varying latent space dimensions,(convergence rate = 10 −5 ) K Time taken Avg. recall at λ v = 100,λ q = 1 50 ∼ 30 seconds 0.412 200 ∼ 3.1 hours 0.457 to incorporate the limited, non-uniformly divided attention model of social media users directly into our social recommendation system. 104 Incorporating Limited Attention into our social recommendation system Here, we propose our new recommendation system which seamlessly integrates the attention model into our social recommendation system (112). Our proposed prob- abilistic model is shown in the figure 4.16. In this model, we use source of adoption S and user preferences U to model the limited attention variable φ. The social latent variable is ψ and the latent item variable is V. The variables W,R,Q are observed in the model, while variablesθ,U,V,S,ψ,φ are learned (discussed below). Figure 4.16: Incorporating Limited Attention to our CTR with SMF model Learning the parameters of the Limited Attention social recommenda- tion model Since our model is hierarchical (4.16), we can use factorization to solve for the pos- terior distribution. Writing out the joint probability of all the unknown variables given known variables, we get 105 P (φ,U,S,V,θ,ψ,Z|R,Q,W,α,β) =P (φ|S,U)P (Q|ψ,φ)P (R|φ,V )P (θ|α) P (Z|θ)P (W|Z,β)P (ψ)P (V )P (S)P (U) (4.19) where we define each variable as follows: φ i j ∼N(S i j ˆ u i ,λ −1 φ I K ) S i k ∼N(0,λ −1 S I N ) u i ∼N(0,λ −1 U I K ) ψ i ∼N(0,λ −1 ψ I K ) (4.20) where k = 1 : K, K number of the topics and i,j→ 1 : N, N is the number of users in the dataset. We define ˆ u i as the diagonal matrix obtained by taking the outer product of k-dim identity matrix i.e. ˆ u i =u i ⊗I k . Maximization of the posterior is equivalent to maximizing the complete log- likelihood of U, V, S, θ,ψ, R and Q given λ U ,λ V ,λ S ,λ ψ ,λ Q ,λ φ ,α and β. L =− λ U 2 N X i u T i u i − λ V 2 D X j (v j −θ j ) T (v j −θ j ) + D X j N X t log( X k θ jk ,β k,w jt ) − N X i D X j c r ij 2 (r ij −g r (φ T i v j )) 2 − λ S 2 N X i N X j (S i j ) T S i j − λ φ 2 N X i D X j c φ ij 2 (φ ij −g φ (S T ij ˆ u i )) 2 − λ ψ 2 N X i N X j ψ iT j ψ i j − λ Q 2 N X i D X j c q ij 2 (q ij −g ψ (φ T ij ˆ ψ i )) 2 (4.21) 106 Wecanoptimizethisfunctionbygradientascentapproachbyiterativelyoptimizing the collaborative filtering and social network variables u i ,v j ,ψ i ,S ij ,φ ij and topic proportions θ j . Thus, we get the following update equations: u i ←(λ u I k +λ φ S i C φ i S T i ) −1 λ φ S i C φ i ˜ φ T i v j ←(λ v I k + ˆ φC r j ˆ φ T ) −1 ( ˆ φC r j R j +λ v θ j ) ψ i ←(λ ψ I k +λ q φ i C q i φ T i ) −1 (λ q φ i C q i Q i ) S ij ←(λ S I K +λ φ ˆ u i C φ ij ˆ u T i ) −1 λ φ C φ ij ˆ u i φ ij φ ij ←(λ φ C φ ij I K +VC r i V T ) −1 (VC i R i +λ φ C φ ij ˆ u i S ij ) (4.22) where ˜ φ = K P k=1 φ k and ˆ φ = N P j=1 φ j . 4.7 Conclusions In this chapter, we presented a generalized hierarchical Bayesian model that exploits that user’s social network and item’s content information to recommend itemstousers. Theproposedmodelseamlesslyintegratestopicmodelingandsocial matrix factorization into the collaborative filtering (CF) framework for accurate recommendation. Our experiments on real world datasets showed that our pro- posed model outperforms the state-of-the-art approaches such as CTR and matrix factorization in prediction accuracy. For future work, we plan to address sev- eral key challenges such as, how does our proposed model perform for very large datasets? that is, how to make it scalable for large real-life datasets? How to capture the dynamics of the evolving social network into our model and analyze it’s affect on prediction accuracy? 107 In this chapter, we modelled the artist’s (item) content information using the ’user’ and ’item’ tags, while the features from music signal was not used for mod- eling the topics. We feel that the audio features from music can be cleverly incor- porated to model the content for artist, however, audio features cannot be used to intrepret the user’s latent space. Thus, in future work, we feel that both the audio signal and user/item tags should be explored for jointly modeling and interpreta- tion of the content and user latent spaces. 108 Chapter 5 Personalized Group Recommender Systems 5.1 Introduction Recommendationsystemisanimportantexplorationparadigmthatretrievesinter- esting items for users based on their tastes, past activities and friend suggestions. Single user recommendation has been well studied in the few decades and has been successfully used in commercial services such as Amazon, Netflix etc. However, many real-life scenarios asks for item/activity recommendation to a group of users. For example, recommending a movie for friends to watch together or recommend- ing a good restaurant for colleagues to have a work lunch. Recommendation to groups, in general, is a challenging problem since the users of the group may or may not share similar tastes, and user preferences may change due to other users in the group. Therefore, it is important for recommender systems to capture group dynamics such as user interactions, user group membership, user influence etc. to do a better group recommendation. Location-Based Social Networks (LBSNs) such as Google+ Local, Foursquare, Gowalla 1 , have become ubiquitous platform for users to share meetups/activities with family and friends. Similarly, Event- based Social Networks (EBSNs) such as Meetup, Plancast, Eventbrite etc., have 1 Gowalla was a LBSN operational till 2012 when it was acquired by Facebook 109 become popular convenient platforms for users to co-ordinate, organize and par- ticipate in social meet-ups/events and share these activities with their contacts. Recommendation in LBSNs has been researched in the past few years (159), (149), (66) to recommend activities or friends to users of LBSN. Similarly, Event rec- ommendation has been recently studied in the past couple of years (154), (47), (115), for recommending events or to recommend event-sponsoring groups to the EBSN users. Most of these previous works are designed for single user recommen- dations by making use of user event/location checkin information. However, many users use LBSNs and EBSNs to organize personal (and sometimes professional) group activities (such as dining with friends, research talk at office, etc.) since LBSN/EBSN services provide easy to use online interfaces, and they provide rich user interaction and networking options. We believe that LBSN/EBSN provides a natural platform for studying group recommendation of location activities/events to group of users i.e. group location-activity or group-event recommendations. Personalization of group recommendation is possible for LBSNs and EBSNs sincetheseonlinesocialnetworksproviderichcontent(location/eventdescriptions, reviews, time-stamps) and social network information which help in accurate mod- eling of group dynamics. In this chapter, we study the problem of personalized group recommendation in LBSNs and EBSNs in the context of social, spatial and temporal settings; and propose an efficient recommendation modeling framework to address the same. Group recommendation is a classical problem in social choice theory and infor- mation retrieval (recommender systems) field. Traditional group recommendation approaches have mainly focused on aggregating individual group members’ prefer- ences to produce recommendations to a group. However, such approaches cannot handle group dynamics or data sparsity and they do not work well for cold-start 110 recommendation. Moreover, most aggregation methods (11), (99), do not even model the social relationships within the group. To overcome the shortcomings of the existing approaches, we propose novel probabilistic models for joint col- laborative group-activity or group-event recommendation. The main idea of our approach is to model the groups as a generative process to capture the group dynamics, and to model the activities/events at a location using a topic model to capture its semantics, and finally use a collaborative filtering framework to per- form the group-activity or group-event recommendation. The modeling of group dynamics relates to learning and inferring group preferences from user interactions, user group engagement and location-temporal information of the LBSNs/EBSNs. The group generative process and collaborative filtering framework allows our rec- ommender system to efficiently address the cold start problem for new/infrequent groups/events by transferring relevant historic knowledge about the group-event from a similar group-event pair. In this chapter, we present a class of Collaborative filtering based Hierarchical Bayesian models for personalized group recommendation, and study the impact of model parameters on group recommendation. Our contributions include: (1) Mining unstructured spatial-temporal social network data to model groups and location activities/events, (2) proposing generative probabilistic modeling frame- work for learning group dynamics and user behaviors in group, (3) Studying the impact of model parameters and user influence on personalized group recommen- dation, and (3) Handling data sparsity and cold-start recommendation challenges associated with personalized group recommendation. The remainder of this chapter is arranged as follows: in section 5.2 we provide an overview of the previous work on group, location, and content recommender systems. In section 5.3, we present our modeling framework and discuss how to 111 learn parameters and do inference. The experimental results and discussion are presented in section 5.5, followed by conclusions and future work in section 5.8. 5.2 Related Work In this section, we present a brief overview of state-of-the-art techniques for group, location and event-based recommender systems. Group recommendation has been widely studied in social choice theory (108) and in information retrieval (31) fields. The group recommender systems can classified into two categories: aggregrated models, which aggregates individual user data into a group data, and generates recommendation based on the aggregated group data; and aggregated predictions based group recommendation, which aggregate the predictions for individual users into group predictions. Aggregation models are quite popular due to their simplic- ity however they have several drawbacks such as: (1) They cannot handle group dynamics or data sparsity, (2) they cannot perform cold-start recommendation, and (3) they do not model social-relationships of users of a group. Latent-factor models based on matrix factorization (118) have shown promise in better rating prediction for item/content recommendation since they efficiently incorporate user interests into the model. Collaborative Topic regression (CTR) (135) is a state-of-the-art latent model for item recommendation and it combines the merits of both traditional collaborative filtering and probabilistic topic mod- eling approaches. CTR with social matrix factorization (112), presented in the previous chapter, is a state-of-the-art approach item recommendation system for social network users. These state-of-the-art recommenders were designed for sin- gle user-based recommendations, and therefore cannot be used directly for group recommendation. 112 Location-based recommender systems has been actively studied in the past few years. (12), (90), (148), (155) have studied link prediction and location-based venue and event recommendations in LBSNs. Similarly, Event recommendation in EBSNs has been recently studied in (47), (115), (154), for recommending events or to recommend event-sponsoring groups to the EBSN users. All these approaches are designed for single user recommendation and cannot be easily extended to group recommendation in LBSNs/EBSNs. There are a few recent works on group recommendation in social network domain. Recently, (147) proposed a social influ- ence based group recommender model for item recommendation. Their idea is interesting however it cannot be used for LBSNs/EBSNs since their model cannot capture group dynamics and cannot incorporate location/event specific informa- tionforpersonalizationofgrouprecommendation. In(151), theauthorsproposeda probabilistic framework to model the generative process of group activities, though the final group recommendation is done by aggregation of user selections without directly learning group preferences, thus ineffective for personalizing group recom- mendations. Our literature survey on group recommendation showed that few of the previous works systematically address the personalized group recommendation problem. In this chapter, we will propose a class of probabilistic models ((111), (110)) to incorporate group dynamics such as user influence, user-interactions, user-group membership into the probabilistic framework and study their effect on personalized group recommendation in LBSN and EBSN. Our probabilistic approach models groups and activities/events as generative processes to capture group dynamics and location/event semantics respectively, and finally uses collaborative filtering to perform group recommendation. In the following sections, we describe our 113 models, algorithms and report our experimental results on real-world large LBSN and EBSN datasets. 5.3 Our Approach We first define the group recommendation problem in this section and then we presentourproposedCollaborativefilteringbasedBayesianmodelsinthefollowing section. 5.3.1 Problem Statement The problem of group-activity/group-event recommendation is defined as recom- mending a list of locations/events that the groups of LBSN/EBSN may participate in. For EBSN, it is related to the group-event rating prediction task where the group’s event participation is predicted as (implicit) group-event rating. Similarly, for LBSN, group-location rating prediction is the group-activity recommendation. The definition of groups in LBSN and EBSN are discussed below. 5.3.2 Groups and Group Dynamics LBSNs and EBSNs are not identical and have their own unique features (94). So, groups are defined (slight) differently for LBSN and EBSN due to the nature of user-interactions in these networks. LetU,E,G on ,G off representthesetofUsers, Events, OnlineSocialGroupsand Offline Social Groups of the EBSNs. As the name indicates ‘Online Social Groups’ correspond to groups whose users interact online (virtual world), i.e. these groups arise due to the online social interactions. On the other hand, ‘Offline Social Groups’ corresponds to groups of users who physically meet and participate in 114 events organized by members of online social groups. Offline social group users interact at a location during a particular interval of time while participating in a social event. Mathematically, Online and Offline social groups correspond to the connected components of the Online and Offline Social Network Graphs (94). Throughout this chapter, we consider the ‘Offline social groups’ as ‘groups’ for our group-event recommendation task in EBSN. Note: event-sponsoring online groups in EBSN play the role of online communities in this work, and they do not correspond to ‘offline social groups’. Unlike EBSN, LBSNs usually do not have well-defined ‘groups’ (i.e. no group- checkins in dataset). So, we need to infer ‘groups’ and their checkins from the user checkins available on LBSN. We define a ‘group’ in LBSN as follows: Definition 1. A ‘Group’ in LBSN is a set of friends checked-in at a location during a particular interval of time. Definition 1 is quite intuitive. Mathematically, a group in LBSN represents the ‘connected components of the social network graph’ during a time interval at a particular location. Figure 5.1 shows an illustration of the ‘group’ in LBSN. To find the groups in LBSN, we first sort all the user check-ins at a particular location using the check-in time-stamps and then we consider all the users within a particular duration of time, say 30 mins or 1 hour, and then find out how many of these users are connected to each other. In this chapter, Group Dynamics corresponds to the user-user interactions, user-group membership and user-group influence characteristics. Social network links (online interactions) and user participating in an event/visiting a location (offline interactions) corresponds to user-user interactions. User-group member- ship is captured when the user participates in an event (or visits a location) as a member of a group. User-group influence is captured by user participation in 115 Locations Time Groups Group 1 Group 2 Figure 5.1: Groups in LBSN group events (activities) over time. All these are encoded in the checkin/RSVP, social network relations and location information present in the LBSN/EBSN data and will be captured in our generative models. 5.4 Personalized Collaborative Group Recom- mender Models First, we first discuss Personalized Collaborative Group Recommender (PCGR) (9) and then propose one extension namely PCGR-D. Figure 5.2 shows our pro- posed probabilistic models - PCGR and PCGR-D. Our models belong to a class of generalized hierarchical Bayesian models and they jointly learn the group and activity latent spaces. We use topic models based on Latent Dirichlet Alloca- tion (LDA) (25) to model the location-activity or location-event descriptions and the user-group memberships, and we use matrix factorization to match the group latent features to the latent features of locations/events. Our modeling framework fuses topic models with matrix factorization to obtain a consistent and compact latent feature representation. 116 PCGR PCGR combines topic models of group and location activity in a collaborative filtering framework. That is, PCGR represents groups with activity interests and assumes that locations are generated by a topic model which describe activities that could be performed at a location. Furthermore, PCGR assumes that groups are generated from users who are drawn from latent communities. Thus, a user is a member of different communities, while a group is formed by multiple users belonging to different communities. In other words, a group is generated as a mix- ture of users from different communities. Communities are latent variables in our model and are analogous to topics in LDA, and they represent collection of users with common interests. Group generative process of PCGR captures the user-user interactions and user-group membership dynamics. PCGR additionally includes latent variables i and ξ j which act as offsets for community topic proportions φ i and activity topic proportions θ j respectively, when modeling the group-activity ratings. As more groups check-in at a location, we have a better idea of what these offsets are. The offset variable ξ j can explain, for example, that an activity at location L is more interesting to a group G 1 than it is to a different group G 2 . Similarly, the offset i can help to explain how community proportions for a group G play a role in group preferring location L 1 over location L 2 . How much of the group-activity prediction relies on location, and how much it relies on other groups depend on the group preferences and how many groups have checked-in at that location. Note, the term ‘location-activity’ in LBSN is synonymous to ‘event’ in EBSN in our generative models. We use ‘location-activity’ to describe the PCGR and PCGR-D models. 117 K K R γ Ф h u δ G α θ z w β A ξ ε K K P L M N (a) PCGR K K R γ Ф h u δ G α θ z w β A ξ ε K K P L M N ѱ K s (b) PCGR-D Figure 5.2: Personalized Collaborative Group Recommendation Systems (PCGR and PCGR-D) LetG∈R K×P andA∈R K×L be the latent group and location-activity feature matrices in figure 5.2a, with column vectors G i and A j representing the group- specific and location-activity specific latent feature vectors respectively. The con- ditional distribution over the observed group location activity ratings R P×L can be shown as P (R|G,A,σ 2 R ) = P Y i=1 L Y j=1 N (r ij |g(G T i A j ),σ 2 R ) I R ij (5.1) whereN (x|μ,σ 2 ) is the pdf of Gaussian distribution with mean μ and variance σ 2 R , and I R ij is the indicator function that is ‘1’ if group ‘i’ has rated location ‘j’, and equal to 0 otherwise. Note, here we consider group checkin as implicit group rating for LBSN. The functiong(x) is the logistic functiong(x) = 1 1+exp(−x) , which bounds the range of G T i A j within [0, 1]. The group and location-activity latent feature vectors are generated as follows: P (G|σ 2 G )∼N (φ i ,λ −1 G I K ) (5.2) P (A|σ 2 A )∼N (θ j ,λ −1 A I K ) (5.3) 118 where λ G = σ 2 R /σ 2 G and λ A = σ 2 R /σ 2 A . The generative process of PCGR model is shown below: 1. For each group i, (a) Draw community proportions φ i ∼ Dirichlet(γ) (b) Draw group latent offset i ∼N (0,λ −1 G I K ) and set the group latent vector as g i = i +φ i (c) For each user (u im ) in group i, i. Draw community assignment h im ∼ Multinomial (φ) ii. Draw user u im ∼ Multinomial(δ h im ) 2. For each location j, (a) Draw topic proportions θ j ∼ Dirichlet(α) (b) Draw activity latent offset ξ j ∼ N (0,λ −1 A I K ) and set the location- activity latent vector as a j =ξ j +θ j (c) For each word (w jn ) describing an activity at location j, i. Draw activity (topic) assignment z jn ∼ Multinomial (θ) ii. Draw word w jn ∼ Multinomial(β z jn ) 3. For each group-location pair (i,j), draw the rating r ij ∼N (g T i a j ,c −1 ij ) PCGR-D InPCGR,werecommendactivitiesbasedonthegrouppreferenceslearnedfromthe model. However, some users are experts (i.e. dominant users, definition appears in later section) in certain activities and they have tremendous influence on their group’s activities. Even though PCGR captures user influence implicitly, we could explicitly model user influence into our PCGR framework. Here, we propose one 119 extension model: PCGR-D, to incorporate user influence into PCGR framework. In PCGR-D, we use a deterministic switch ‘s i ’ to select group preference according to an influential (expert/dominant) user ‘i’ in the group. The generative processes of PCGR-D model is similar to PCGR except for the generative process for groups. For brevity, below we only show group generative process for PCGR-D. 1. For each group i, (a) Draw community proportions φ i ∼ Dirichlet(γ) (b) Switch s i =I(G i ,D U ), (c) If s i = 0 • Draw group latent offset i ∼N (0,λ −1 G I K ) • Set group latent vector as g i = i +φ i (d) If s i = 1 • Draw user latent offset ψ i ∼N (0,λ −1 U I K ) • Set group latent vector as g i = ψ i +ψ i (e) For each user (u im ) in group i, i. Draw community assignment h im ∼ Multinomial (φ) ii. Draw user u im ∼ Multinomial(δ h im ) Where I(G i ,D U ) is an indicator function and takes value 1 ifG i ∩D U 6=∅, and takes value 0 otherwise. D U is the set of dominant users in the dataset,G i is the set of users in group i, ψ i is the user-specific multinomial topic distribution and ψ i is the offset for ψ i . Note: In PCGR-D, we consider only one dominant user per group i.e. the user who has maximum influence on group activities. In this chapter, ψ i is obtained using CTR model (135). 120 5.4.1 Parameter Learning We can use Variational Inference, Markov Chain Monte Carlo, Importance sam- pling etc., approaches for Bayesian inference. In this work, we choose variational inference - for fair comparison with the state-of-the-art recommender systems and, in general, it also converges faster. Here, we discuss the parameter learning for PCGR-D (For PCGR parameter learning, refer (9)). Given β, δ, ψ and s param- eters, computing the full posterior of g i ,a j ,φ i and θ j is intractable. We propose an EM-style algorithm to learn the Maximum-a-posteriori estimates. Maximiza- tion of the posterior is equivalent to maximizing the complete log-likelihood of G,A,θ 1:L ,φ 1:P and R given λ G ,λ A and β,δ,ψ,s. L =− λ G 2 X i (g i −f s i (φ i ,ψ i )) T (g i −f s i (φ i ,ψ i )) (5.4) − λ A 2 X j (a j −θ j ) T (a j −θ j ) + X j X n log X k θ jk β k,w jn ! − X ij c ij 2 (r ij −g T i a j ) 2 + X i X m log X p φ im δ p,u im ! where f s i (φ i ,ψ i ) = φ i if s i = 0, else f s i (φ i ,ψ i ) = ψ i ; λ G = σ 2 R /σ 2 G ,λ A = σ 2 R /σ 2 A and Dirichlet priors (α and γ) are set to 1. We optimize this function by gradient ascent approach by iteratively optimizing the collaborative filtering variablesg i ,a j and topic proportionsθ j andφ i . Forg i ,a j , maximization follows similar to matrix factorization (82). Given a current estimate of θ j ,φ i , taking the gradient ofL 121 with respect to g i and a j and setting it to zero helps us to find g i ,a j in terms of G,A,C,R,λ G ,λ A ,ψ. ∂L ∂g i = 0; ∂L ∂a j = 0; (5.5) Solving the corresponding equations will lead to the following update equations g i ← (AC i A T +λ G I K ) −1 (AC i R i +λ G f s i (φ i ,ψ i )) (5.6) a j ← (GC j G T +λ A I K ) −1 (GC j R j +λ A θ j ) (5.7) where C i is diagonal matrix with c ij ;j = 1....L as its diagonal elements and R i = (r ij ) L j=1 for group i. For each location j,C j and R j are similarly defined. Note that c ij is confidence parameter for rating r ij , for more details refer (112). The equation (5.7) shows how activity topic proportionsθ j affects the location-activity latent vector a j , where λ A balances this effect. GivenG andA, we can learn the activity topic proportions θ j and community proportions φ i . We define q(z jn =k) =π jnk and then we separate the items that contain θ j and apply Jensen’s inequality: L(θ j )≥− λ A 2 (a j −θ j ) T (a j −θ j ) + X n X k π jnk (logθ jk β k,w jn − logπ jnk ) =L(θ j ,π j ) (5.8) 122 The optimal π jnk satisfies π jnk ∝ θ jk β k,w jn . Note, we cannot optimize θ j analyti- cally, so we use projection gradient approaches to optimize θ 1:L and other param- eters G,A,π 1:L . After we estimate G,A and π, we can optimize β, β kw ∝ X j X n π jnk 1[w jn =w] (5.9) We solve φ i using variational Bayesian approach which is very similar to solving θ j . Note, equation (5.9) is same as the M-step update for topics in LDA (25) 5.4.2 Prediction After the optimal parameters, G ∗ ,A ∗ ,θ ∗ 1:L ,φ ∗ 1:P and β ∗ ,δ ∗ are learned, our mod- els (PCGR, PCGR-D) can be used for in-matrix and out-matrix prediction tasks (group-oriented recommendations). IfD is the observed data, then both in-matrix and out-matrix predictions can be easily estimated. For LBSN, in-matrix predic- tion refers to the case where a group has not rated (visited) a location but that place has been visited by atleast one other group. On the other hand, out-matrix refers to the case where none of the groups have rated a particular place i.e. the place has no rating records (no group checkins). For EBSN, in-matrix prediction is the case where a group has not RSVP’ed to an event, but other groups have RSVP’ed for that event; and out-matrix prediction refers to the case, where the event is new and none of the groups have RSVP’ed to that event. For in-matrix prediction, we use the point estimate of g i ,θ j and ξ j to approximate their expec- tations as: E[r ij |D]≈E[g i |D] T (E[θ j |D] +E[ξ j |D]) (5.10) r ∗ ij ≈ (g ∗ i ) T a ∗ j (5.11) 123 In out-matrix prediction, the place is new, i.e. it does not have any ratings (cold- start recommendation setting for location/event). Thus,E[ξ j ] = 0, and we predict ratings as: r ∗ ij ≈ (g ∗ i ) T θ ∗ j (5.12) 5.5 Experiments We conduct experiments to compare the performances of PCGR and PCGR-D withotherstate-of-the-artgrouprecommendertechniques. Weevaluateourmodels on Gowalla dataset (39) and Meetup dataset (94) for group recommendation in LBSN and EBSN respectively. Our experiments help us to answer the following key questions: • How do our models perform with respect to the state-of-the-art group rec- ommenders? • How do the model parameters (λ G ,λ A ) affect our prediction accuracy? • How does a dominant user influence personalized group recommendations? • What are the group characteristics of LBSNs and EBSNs? • How to interpret the group preferences learned by our models? • How does user’s group engagement (user behavior) affect group recommen- dation? The first three questions are answered by the experimental results from our models and the next three questions are answered by understanding the dataset characteristics and the group preferences learned by our models. 124 5.5.1 Dataset Description Our experiments were conducted on Gowalla and Meetup datasets. Gowalla was a real-world LBSN, and Meetup is a real-world EBSN. Tables 5.1 and 5.2 show the description of these datasets. Since Gowalla dataset does not contain the location content information, it was crawled from Foursquare using its API and the location’s geo-coordinates provided in Gowalla dataset. Table 5.1: Gowalla Dataset Description Dataset Gowalla Users 196591 Relations 950327 Check-ins 6442890 Locations 1280969 Total # of unique groups @ 1 hr 36669 # of unique groups with atleast 10 check-ins @ 1 hr 2702 # Locations checked-in by 2702 groups 94067 Total # of group checkins of 2702 groups (sparsity %) 57669 (99.9 %) Table 5.2: Meetup Dataset Description Dataset Meetup (Full) Meetup (Small) #Users 4,111,476 3,650 #Events 2,593,263 27,244 #Online Groups 70,604 454 #Offline Groups of size 2 345,998 2,000 #Offline Groups of size 2 at atleast 10 events 126,141 511 #Events attended by Offline groups of size 2 790,547 9510 5.5.2 Gowalla Dataset Characteristics In this section, we present the Group characteristics and dominant user character- istics of the Gowalla dataset. 125 5.5.2.1 Characteristics of Gowalla groups Gowalla dataset provides user checkins, user social relationships and location geo- coordinates. In our experiments, we generated the ‘groups’ in the following way: First, we clustered all the users present at a particular location during a particular duration of time, say 1 hour; and then we considered only the groups which were connected components (refer section 5.3.2). The time slot of 1 hour was choosen based on a reasonable assumption that users of a group checkin at a particular location with-in 30 mins of each other. Note that due to a large time slot interval (1 hour) and sparsity of user checkins in LBSN, there were few groups spread across two consecutive 1 hour time-slots. Based on this observation, we found that there are around 36669 unique groups in Gowalla dataset who have checked-in at 94067 unique locations. To make our group recommendation comparison study in section 5.6.2 more sound, we only considered groups which had atleast 10 checkins. In table 5.1, we see that in Gowalla dataset, there are 2702 groups (group sizes 2 to 4) with atleast 10 checkins. A total of 4870 users form these 2702 groups. These 2702 groups have checked-in at 47761 unique locations, while the 4870 users have checked-in at 348338 unique total locations. The total number of group- location checkins for 2702 groups is 57669. We consider a group-location checkin as a group rating for that location. Thus, we observe that our dataset is 99.96% sparse. Figure 5.3 shows the distribution of group checkins w.r.t group sizes in the entire dataset. Figure 5.3a plots total group checkins (of 36669 groups) w.r.t group size, while Figure 5.3b shows plot of total number of unique groups w.r.t group size. We observe Pareto-like distribution for groups and group-checkins from these plots. In the following sections, we study the characteristics of group-user location checkins in the 2702 Gowalla groups. We also study how modeling the 126 0 10 20 30 40 50 60 70 0 2 4 6 8 10 12 14 x 10 4 Group Size Number of Group Checkins Gowalla groups @ 1 hr 127101 8137 1245 1 2 (a) Group Location Checkins 0 5 10 15 20 25 30 0 0.5 1 1.5 2 2.5 x 10 4 Number of Groups Gowalla groups @ 1 hr Group Size (b) Unique Groups Figure 5.3: Gowalla Group Location Characteristics group dynamics impact the group preferences in the context of group and user behaviors. 5.5.2.2 Group-User-Location Characteristics Figure 5.4 shows the characteristics of the locations checked-in (in terms of relative %) by 2702 groups and the users (4870) of these groups. From Figure 5.4a, we see that users visit a lot of places alone than in groups. Figure 5.4b shows that the places visited by groups has less overlap with the places visited by its users. To better understand the group and user checkins w.r.t locations let us study their characteristics. LetX denote set of places visited by a groupG, and letY i denote the set of places visited by useri of the groupG. Let R denote the ratio of number of common places visited by users in the group (denoted by #X) to the number of places visited by the group. R = #(X∩ (∪ i Y i )) #X From Figure 5.4, we can observe the following in our dataset. When R <= 0.5, we observe that 62% of groups have visited 50% new places which were not visited 127 by the users of group (i.e. users checkin at a lot of new different places in groups). ForR<= 0.25, we observe that 27% of groups have visited 75% new places which are not visited by the users of group. R = 0 implies that 90 groups i.e. 3.5% of all groups visited new places which are not visited by individual users of the groups, i.e. group behavior for these groups will be completely different from individual user behaviors, that is, for these groups, the group preferences cannot be learned by simply aggregating individual user preferences. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 152 303 454 605 756 907 1058 1209 1360 1511 1662 1813 1964 2115 2266 2417 2568 Locations Groups Locations visited by dominant users Locations visited by users only Locations visited by groups only Common locations by groups and users (a) Grp-Loc. Characteristics 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 124 247 370 493 616 739 862 985 1108 1231 1354 1477 1600 1723 1846 1969 2092 2215 2338 2461 2584 Locations Groups Locations visited by groups only Common locations by groups and users (b) Grp-Usr-Loc. Characteristics Figure 5.4: Gowalla Group Location Characteristics (Please view this Figure in Color printout) 5.5.2.3 User Group Memberships Figure 5.5 shows the plot of group memberships for the users of Gowalla dataset. From this Figure, we clearly see that many users (4449 users i.e 91% of 4870 users) are members of one of the groups (out of 2702) and 421 users (i.e. 9% of all users) are members of more than one group. The users with multiple group memberships help us to understand how they influence the preferences of different groups which theyarepart-of. Figure5.6showstwouserswhohavemultiplegroupmemberships. In-fact, both belong to 7 different groups (of different group sizes) and they are not connected to each other in the LBSN. Figure 5.6 shows that User 1 has more checkinsatdifferentlocationtypeswhencomparedtoUser2. Moreover, fromtable 128 1 2 3 4 5 6 7 0 500 1000 1500 2000 2500 3000 3500 4000 4500 4449 311 79 12 11 5 3 Number of Users Number of Groups Figure 5.5: User Group Memberships 5.3, we see that User 1 performs different activities when in different groups. On the other hand, from Figure 5.6b and table 5.3, we observe that User 2 performs similar activities in different groups. From these examples, we can see infer two things: (1) Group membership of users affects group activities, thus it is important todirectlymodeluser’sgroupassociations(fromgroupcheckins)forlearninggroup preferences instead of simplying aggregating user preferences (from user checkins), and (2) Some users (like User 2) can be considered as ‘experts’ at certain activities (Baseball, Electronics store, Gym, etc.), and this information could be useful for improving group recommendations. In section ??, we will look at how our model does group recommendation when users belong to multiple groups. 5.5.3 Dominant User Characteristics in LBSN Inthissection, westudydominantusercharacteristicsw.r.tgrouplocationcheckins in Gowalla dataset. We define ‘dominant user’ as a user of the group who has the most overlap in the locations visited by the group. Mathematically, we define dominant user as follows: LetX denote the set of places visited by a groupG, and letY i denote the set of places visited by useri of the groupG. Then dominant user 129 0 5 10 15 20 25 30 35 40 45 0 0.5 1 1.5 2 2.5 3 Location Types Count Apparel Coffee Shop Store Historic Site Park Theme Park Sandwiches Tattoo Parlor (a) User 1 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 Count Location Types Baseball Burgers Office Electronics Gym/Fitness Library Italian Hotel (b) User 2 Figure 5.6: User’s behavior in different groups by analyzing group location checkins. These example plots are for 2 different users (User 1, User 2) who are in 7 different groups Table 5.3: User 1 and User 2 group location checkins User 1: Top 3 location- activities User 2: Top 3 location- activites Group 1 Movie Theater, Food, Arts Gym, Electronics Store, Office Group 2 Football, Baseball, Sand- wiches Gym, Baseball, Office Group 3 Football, Gym, Hotel Bar Gym, Electronics store, Movie Group 4 Comedy Club, Entertain- ment, Pub Office, Cafe, Electronics store Group 5 Apparel Shopping, Stores, Cafe Baseball, Burgers, Desserts Group 6 Museum, Theme Park, Train Pub, Baseball, office Group 7 Travel, Gym, Coffee shop Library, Performing Arts, French class d ofG is given byd = arg max i (Y i ∩X). Here,∩ operation means set intersection. Let #(X ) denote total number of elements of setX. Let R d denote the ratio of common places visited by a dominant user in a group w.r.t places visited by the group. Thus, R d is given by R d = #(Y d ∩X) #(X) 130 The ratioR d implies how a dominant user influences the group in terms of location checkins by the group. In our Gowalla dataset, we observe the following dominant user characteristics in terms of R d . When R d ≥ 0.8, we see that 170 groups (∼ 6%) have a dominant user who influences the group in (at-least) 80% of the places visited by the group. When R d ≤ 0.50, we see that 2000 groups (∼ 74%) have a dominant user who influences the group in (at-most) 50% of the places visited by the group. When R d = 0, we see that there are 80 groups (∼ 3%) who have no dominant user. On the otherhand, when R d = 1, we observe that 80 groups (3%) are completely influenced by the dominant user. These observations indicate that there are a few groups where dominant user has a large influence on group checkins, while there are many groups where dominant user’s influence on group checkins is limited. Thus, user’s influence on group checkins (and group preferences) depends on how dominant the dominant user (expert) is. In results section, we study how our models utilize dominant user’s influence for personalized group recommendation. 5.5.4 Meetup Dataset Characteristics The event and online group participation characteristics, network properties of EBSNs has been well studied in (94) for Meetup dataset. In the following sections, we present characteristics of Offline Social Group Event Participation, Offline- Online Social interactions and Offline Social Network properties (such as the local- ity of social interactions of groups) on Meetup (Full) dataset. 5.5.5 Offline Social Group Event Participation in EBSN Tounderstandthegroup-eventcharacteristicsoftheMeetupdataset, wefirststudy the event participation of the offline social groups. Figure 5.7a shows that many 131 events (Note: an event corresponds to one group count in the plot) have small number of participants i.e. many offline social groups have small group size. How- ever, there are a few events with a large number of participants (i.e. heavy tail exists). Smaller group size indicates tighter social connections. Figure 5.7b shows theUser-groupmembershipcharacteristics, andweobservethatmanyusersbelong to only one group, however, there are a few users who attend several events in dif- ferent groups i.e. these users are members of many different offline groups. The users with multiple group memberships help us to understand how they influence the preferences of different groups which they are part-of. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 2 4 6 8 10 12 14 Event Group Size −−> log(Group Size Count) −−> (a) Offline Social Groups vs. Event participation 0 200 400 600 800 1000 1200 1400 0 1 2 3 4 5 6 7 8 9 10 User Group Membership Size log(# User Groups)−−> (b) User Group Count vs. User Group Membership Size Figure 5.7: Meetup Offline Groups and Event Localities 5.5.6 Offline-Online Social Interactions in EBSN In our experiments, we observed that the users of the offline social groups need not be members of the same online groups. Infact, for the offline social groups with smaller group size (like size < 5), we found that there is a little or no overlap (in some cases) between the online social groups of these offline social group’s users. This shows that even though users join multiple online groups, they participate as 132 offline group users only in some events. This observation can be used to explain the online and offline social network statistics discussed in (94). 5.5.7 Offline Social Group Properties in EBSN In this section, we study the geographical aspects of offline social groups with smaller group sizes (size < 5). In Figure 5.8a, we examine the distance between the users’ home locations for the offline group users. This plot shows that a lot of groups have users who live pretty close to each other. In fact, we observe that nearly 91.52% of all the offline groups (group size 2) have users who stay within 50 kms of each other. Figure 5.8b shows the geographical distance between the event location and the groups. This distance, denoted by group-event distance, captures how far the users of an offline group will travel to attend an event. Since, a group is not localized, and consists of users who have different home locations, we considered three ways to measure the group-event distance. Group-Event distance was measured as 1) average of distances of groups’ user home locations to the event location,2)minimumofthedistanceofthegroups’userhomelocationstotheevent location, and 3) maximum distance of the groups’ user home locations to the event. In Figure 5.8b we show the CDF of group-event distances for these measures, and we observe that 93.55% of events are located within 50 kms when measured in the average group-event distance, while 79.17% and 71.14% of events are located within 50 kms if maximum and minimum group-event distance respectively are used for geographical distance. This shows that groups participate in events that are close to the groups’ user home locations. This was observed to be true in general for the groups with smaller group size (size < 5). However, group-event distance for large offline groups (size≥5) needs further study. 133 10^0 10^1 10^2 10^3 10^4 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Geographical Distance between the users of groups (Kms) CDF (a) Locality of Offline Group users 10^0 10^1 10^2 10^3 10^4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Geographical distance of Events from Group Users’ homes CDF Avg. distance of events from group’s user homes Min. distance of event from group’s user homes Max. distance of event from group’s user homes (b) Locality of Events w.r.t Offline Groups Figure 5.8: Meetup Offline Groups and Event Localities Studying the group characteristics of LBSN and EBSN dataset helps us to better understand the underlying group dynamics of these social networks, and they also help in the interpretion of our group recommendations. 5.6 Experimental Settings In our experiments, we split each of the Gowalla and Meetup (small) dataset into three parts - training (∼80%), held-out (∼5%) and test datasets (∼15%). The model is trained on training data, the optimal parameters obtained on the held- out data and ratings are predicted for the test data. We ran 5 simulations for all our comparison experiments. For each simulation, dataset splitting was randomly done. For collaborative filtering based on matrix factorization (MF), we used grid search to find the parameters such that we get good performance on the test data. Wefoundthatλ u = 1,λ v = 0.01,K = 50givesgoodperformanceforMFapproach. For Collaborative Topic Regression (CTR), we choose the parameters similar to MF approach. We found that the parameters λ u = 0.01,λ v = 1,K = 50 gave the 134 best prediction accuracy. (λ u and λ v are user and item (location) regularization parameters in MF and CTR models). For our models, we fix confidence parameter C by setting parameters a = 1,b = 0.01, and we vary the parameters λ G ,λ A to study their affect on prediction accuracy. Note: a and b are tuning parameters (a>b> 0)fortheconfidenceparametersc ij inCTR,PCGRandPCGR-Dmodels. Note: All our experiments were run on intel quad-core 2.2 GHz CPUs with 8 GB RAM. 5.6.1 Evaluation Metrics Forperformanceevaluation,weconsiderthreemetrics,namely: (1)AverageGroup- activity/Group-event Rating Prediction Accuracy (Avg. Accuracy), (2) Average Root Mean Squared Error (Avg. RMSE) and (3) Average Recall. We define the avg. accuracy for test data as the ratio of correctly predicted ratings compared to the total ratings in the test data. RMSE is given by: RMSE = v u u t 1 |T| X (i,j)∈T ( ˆ R ij −R ij ) 2 Where ˆ R ij is predicted ratings of group-activity (group-event) pairs (i,j) for a test setT , and R ij are true ratings. The average of RMSE for the test set is denoted by Avg. RMSE. ‘Recall’ only considers the activities/events participated within the top M suggestions. For each group, we define the recall@M as: recall@M = # events the group participates in Top M suggestions # events the group participates in 135 The above equation calculates group-oriented recall. We can similarly define activity/event-oriented recall. For consistency and convenience, we use group- oriented recall for in-matrix and out-matrix prediction throughout this chapter. Having higher avg. accuracy, lower avg. RMSE and higher recall@smaller M means better recommender system. 5.6.2 Evaluated Recommendation Approaches We compare our PCGR and PCGR-D models with the following popular state-of- the-art recommendation systems: • Matrix Factorization (MF): MF (118) is a state-of-the-art and a very popular collaborative filtering model used for item recommendation. • Collaborative Topic Regression (CTR): CTR is a state-of-the-art method which uses LDA (135) to model the content for item recommendation. • Aggregation methods for group recommendation: We considered the follow- ing popular aggregation methods (31), (147):(a) Least Misery method (b) Most Pleasurable method (c) Averaging method, (d) Plurality Voting The user preferences for aggregration methods was learned using the MF model, since it’s popular state-of-the-art for learning user preferences. Averaging method performs the best among all the aggregation methods and it is used for comparison in Results section. 5.7 Results In this section, we present our experimental results and answer the questions raised in experiments section. We will study the impact of model parameters on the 136 Gowalla dataset (LSBN), and we will showcase the model interpretability and learned group preferences using the Meetup dataset (ESBN). Table 5.4: Performance Comparison on Meetup Dataset Best Aggregation MF CTR PCGR In-matrix Avg. Accuracy 0.870 0.881 0.915 0.955 In-matrix Avg. RMSE 0.359 0.329 0.282 0.205 Out-matrix Avg. Accuracy - - 0.794 0.908 Out-matrix Avg. RMSE - - 0.441 0.298 Avg. Recall@10 (K=50) 0.49 0.532 0.583 0.792 Performance Comparisons Table 5.4 shows the evaluation of our model (PCGR) w.r.t the state-of-the-art group recommendation systems for EBSN dataset. Our approach outperforms all the other models for both in-matrix and out-matrix prediction tasks. Our model performs better than the state-of-the-art group recommenders by an impressive 20% in terms of recall@10 metrics. Table 5.5 shows the performance comparison of our models (PCGR, PCGR-D) w.r.t the state-of-the-art group recommendation systems for LBSN dataset. Both PCGR and PCGR-D outperform all the other models for in-matrix and out-matrix prediction tasks, while PCGR-D performs slightly better (∼ 2%) than PCGR. For PCGR-D, results are reported forR d = 1. In the following sections, we study impact of the parameters λ A ,λ G ,R d on our group recommendation using LBSN dataset. 5.7.1 Impact of model parameters Our models allow us to study how the group parameterλ G and location parameter λ A affects the overall performance of personalized group-activity recommendation. 137 Table 5.5: Performance Comparison on Gowalla Dataset Best Aggrega- tion method MF CTR PGCR PCGR-D In-matrix Avg. Accuracy 0.75 0.74 0.75 0.89 0.91 In-matrix Avg. RMSE 0.51 0.50 0.49 0.32 0.31 Out-matrix Avg. Accuracy - - 0.40 0.52 0.53 Out-matrix Avg. RMSE - - 0.77 0.69 0.68 Avg. Recall@1000 (K=50) 0.34 0.34 0.42 0.45 0.46 Figure 5.9 shows the impact of λ G and λ A on PCGR-D model’s prediction accu- racy for Gowalla test data. We observe that when λ A is fixed and finite, and λ G = 0, our model collapses to the CTR model which uses topic modeling (LDA) and group-location rating matrix for prediction. When λ G →∞ (and λ A is fixed and finite), then our model only uses group preferences for prediction without considering the activities offered at a location. When λ A = 0 and λ G = 0, then our model reduces to the Matrix Factorization (PMF) (? ), since the latent vari- ables G and A are not affected by the topic models. For all other cases, PCGR-D fuses information from activity and group topic models to predict ratings for the groups. From figure 5.9, we see that very small and very large values of λ G and λ A do not improve the prediction accuracy. This can be explained as follows: very small values (< 0.001) of parameters mean that model is close to MF and does not use location-activity/community topics for recommendation, while very large parameter values (> 10) means that model heavily relies on the topic models and less on the collaborative filtering which is not good especially if groups/location 138 descriptions are noisy/sparse. Our PCGR-D model obtains best prediction accu- racy when λ G ∼ 0.01 and λ A ∼ 0.1, showing that both group communities and location activity topics are important for better recommendation. We also observe that higher λ A (compared to λ G ) gives better prediction since we have a better and reliable descriptions for location-activity topic models. λ G λ A 0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0 0.2 0.4 0.6 0.8 1 λ A λ G Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure 5.9: Impact of λ G and λ A on prediction accuracy for PCGR-D model using Gowalla dataset 5.7.2 Dominant User Influence in LBSN Figure5.10showshowdominantuser’sinfluenceaffectstheperformance(in-matrix avg. accuracy) of PCGR-D on Gowalla test data. To study the effects of explicitly incorporating user influence into the model, we select different groups in the test data by varying R d . When R d = 1 dominant user completely influences group check-ins. ForR d = 1, there are around 35 groups with dominant users in our test data, and PCGR-D chooses user-specific multinomial topic distributionψ (learned from CTR model (135)) for group recommendation to these groups. WhenR d = 0, a few groups (∼ 10) do not have a dominant user, and only these groups use group- specific topic distributionφ for group recommendation, while the remaining groups useψ for group recommendation. We observe that PCGR-D performs better than 139 PCGRwhenR d > 0.8sinceitexplicitlyincorporatesdominantuser’sinfluenceinto the recommendation model. However, asR d decreases, the performance of PCGR- D decreases compared to fixed PCGR. This is because when R d < 0.5, PCGR-D relieson(dominant)user-specifictopicdistributions(ψ)forgrouprecommendation to many groups, while it uses group topic distribution for only a few groups. When R d = 0, PCGR-Dmodel’saccuracyisclosertoCTRmodel’saccuracysincePCGR- D uses dominant user’s latent vector ψ (instead of group latent vector) for group recommendation. That is, for R d = 0, ψ does not correctly capture the group preferences (CTR does not model group dynamics) since the dominant user is not very influential in selecting group activities. Thus, we find that explicitly modeling dominant user’s influence is useful for personalized group recommendation only when R d (> 0.8) is higher (i.e. really dominant users). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.78 0.8 0.82 0.84 0.86 0.88 0.9 Dominant User Influence Ratio, R d −−> Avg. Accuracy −−> PCGR PCGR−D Figure 5.10: Dominant User Influence in Gowalla 140 5.7.3 Learned Group Preferences vs. Aggregating User Preferences We compare our learned group preferences w.r.t aggregating user preferences on EBSN dataset using two scenarios. In first scenario, Group I has users who have similar user preferences and in second scenario, Group II has users who have differ- ent user preferences. From table 5.6, we observe that aggregating user preferences may recommend events that may not be relevant for their groups, while directly learning group preferences (by capturing group dynamics) will recommends events that are similar to the events that the group has participated in. In other words, directly learning group preferences is better than aggregation of user preferences when we have rich group dynamics information. This has been independently observed by other researchers as well (33). Here, we have used event tags (provided in the dataset) to represent the event-topics and the learned group preferences. Table 5.6: Learned Group Preferences vs. Aggregating User Preferences for Meetup dataset Group I Group II Learned group preferences Book-club, Spiritual- ity, Adventure Sports, Fitness, Busi- ness Networking Aggregating user preferences Board games, Reli- gion, Spirituality Politics, Movies, Book club Events par- ticipated by Groups Harrypotter club, Meditation, Hiking Baseball, Yoga, Con- ference 5.7.4 Examining Latent Spaces of Groups and Events Wecaninterpretourmodel’sgrouprecommendationsbystudyingthelatentspaces of the groups and events. Table 5.7 shows top 3 group preference topics for an example offline social group (say Group I). Group I has users who are interested 141 in fantasy literature, fitness and adventure related events, and most of the rec- ommended events belong to these topics. One advantage of our model is that it can predict if an event will become popular for groups (i.e. if an event will have more group participants) by studying the offsets of the event-topic proportions. An event whose topics have large offsets indicates that many groups (of different group sizes) will take part in that event. An example for such an event is a hiking trip organized by a school’s travel club, or an author book-reading session hosted by a book club. Table 5.7: Latent Topics for an Offline Social Group in EBSN. We list the top 5 events from the test data that was recommended by our model. Last column shows whether the group actually participated in the event and if the group-event distance was < 50 km. Group I Event participation (Group-event distance < 50 km) Top 3 topics (top 5 words) 1. bookclub, sci-fi, har- rypotter, kidlit, fantasy- literature, 2. wellness, spirituality, self-empowerment, Yoga 3. nightlife, travel, adven- tures, dance, singles top 5 event rec- ommendations harrypotter, sci-fi, Yoga, book-club, hiking Yes,Yes,Yes,Yes,Yes (Yes,Yes,Yes,Yes,Yes) 5.7.5 Computational Time Analysis Our approaches utilize topic models to model groups and locations/events, thus, the time complexity of our model is quite expensive when compared to matrix factorization methods. When λ G = 0, our model has similar time complexity as the CTR model. Table 5.8 shows time consumed by PCGR and PCGR-D when compared to MF and CTR models. Table 5.9 shows how our model performs 142 when latent space dimensions K is varied. We observe that for a smaller value of K, the recall of our model decreases but it converges much faster (5 times). This shows that there is a trade-off between prediction accuracy and dimensions of latent factors (K). Moreover, we observed that using smaller K in our model achieves the similar accuracy as a CTR model which uses larger K. Table 5.8: Time comparison for K=50 Model Avg. time taken (seconds) MF ∼247 CTR ∼277 PCGR/PCGR-D ∼324 Table 5.9: Time comparison of our models by varying K on Gowalla K Time taken Avg. Recall 50 324 0.45 100 1711 0.48 5.7.6 Discussion In this chapter, we showed that modeling group dynamics such as user interac- tions, user influence & user group memberships using group generative process helps us to personalize group preferences for LBSN and EBSN. For the activity descriptions at a location, we only used the information provided by the LBSN; and we considered group-checkin as a indicator for group rating (implicit rating) of the location. However, it is possible that groups checkin at a location but do not like the activities offered there. Hence, we feel that mining group’s feedback such as comments, actual ratings, tweets etc., provides better data to model group preferences in LBSN. On the otherhand, for EBSN dataset we considered only consider the group RSVP to an event as an indicator of the group rating of that event. EBSN group rating is more accurate than LBSN group rating since in most 143 of the cases the users who RSVP actually attend the event. Also, external factors such as weather, holidays, trends, etc. could influence group activities. We could include these factors into our activity/event topic model using external variables, but this could make our models complex and inference harder. Our models pro- vide interpretable results which are useful for user studies in the deployment of personalized group recommender systems. 5.8 Summary and Future Work In this chapter, we presented Collaborative filtering based hierarchical Bayesian models that exploit group checkins or group RSVPs, location/event information, and group dynamics to learn group preferences and recommend personalized loca- tion activities/events to groups. Our experiments on LBSN and EBSN datasets showedthatourmodelsconsistentlyoutperformthestate-of-the-artgroupandcon- tent recommender systems. Our framework models the group dynamics and allows us to address cold-start recommendation for new locations/groups. The main con- tributions include: 1) demonstrating the effectiveness of modeling group dynamics such as user interactions, user influence, user-group membership to improve and personalize group recommendation in LBSNs and EBSNs, 2) studying the group characteristics and the impact of model parameters on the recommendation, 3) demonstrating the interpret-ability of our recommendations. There are many directions for our future work. We will study approaches to make our algorithms scalable to ever evolving LBSNs/EBSNs. We plan to conduct user studies to check the performance of our personalized group recommenders in real-world settings. We will investigate how to incorporate external factors such as reviews, tweets, current trends etc. in our modeling framework. 144 Chapter 6 Factorized Sparse Learning Models with Interpretable High Order Feature Interactions 6.1 Introduction Identifying interpretable high-order feature interactions is an important problem in machine learning, data mining, and biomedical informatics, because feature interactions often help reveal some hidden domain knowledge and the structures of problems under consideration. For example, genes and proteins seldom perform their functions independently, so many human diseases are often manifested as the dysfunction of some pathways or functional gene modules, and the disrupted patternsduetodiseasesareoftenmoreobviousatapathwayormodulelevel. Iden- tifying these disrupted gene interactions for different diseases such as cancer will help us understand the underlying mechanisms of the diseases and develop effec- tive drugs to cure them. However, identifying reliable discriminative high-order gene/protein or SNP interactions for accurate disease diagnosis such as early can- cer diagnosis directly based on patient blood samples is still a challenging problem, because we often have very limited patient samples but a huge number of complex feature interactions to consider. 145 In this chapter, we propose a sparse learning framework based on weight matrix factorizations and ` 1 regularizations for identifying discriminative high-order fea- ture interactions in linear and logistic regression models, and we study several optimization methods for solving them. Experimental results on synthetic and real-world datasets show that our method outperforms the state-of-the-art sparse learning techniques, and it provides ‘interpretable’ blockwise high-order interac- tions for disease status prediction. Our proposed sparse learning framework is general, and can be used to identify any discriminative complex system input interactions that are predictive of system outputs given limited high-dimensional training data. Our contributions are as follows: (1) We propose a method capable of simul- taneously identifying both informative single discriminative features and discrim- inative block-wise high-order interactions in a sparse learning framework, which can be easily extended to handle arbitrarily high-order feature interactions; (2) Our method works on high-dimensional input feature spaces and ill-posed prob- lems with much more features than data points, which is typical for biomedical applications such as biomarker discovery and cancer diagnosis; (3) Our method has interesting theoretical properties for generalized linear regression models; (4) The interactions identified by our method lead to biomedical insight into understanding blood-based cancer diagnosis. 6.2 Related Work Variable selection has been a well studied topic in statistics, machine learning, and data mining literature. Generally, variable selection approaches focus on identify- ing discriminative features using regularization techniques. Most recent methods 146 focus on identifying discriminative features or groups of discriminative features based on Lasso penalty (129), Group Lasso (150), Trace-norm (56), Dirty model (70) and Support Vector Machines (SVMs) (123). A recent approach (143) heuris- tically adds some possible high-order interactions into the input feature set in a greedy way based on lasso penalized logistic regression. Some recent approaches (23),(40) enforce strong and/or weak heredity constraints to recover the pairwise interactions in linear regression models. In strong heredity, an interaction term can be included in the model only if the corresponding main terms are also included in the model, while in weak heredity, an interaction term is included when either of the main terms are included in the model. However, recent studies in bioinfor- matics has shown that feature interactions need not follow heredity constraints for manifestation of the diseases, and thus the above approaches (23),(40) have lim- ited chance of recovering relevant interactions. Kernel methods such as Gaussian Process (51) and Multiple Kernel Learning (84) can be used to model high-order feature interactions, but they can only tell which orders are important. Thus, all these previous approaches either failed to identify specific high-order interactions for prediction or identified sporadic pairwise interactions in a greedy way, which is very unlikely to recover the ‘interpretable’ blockwise high-order interactions among features in different sub-components (for example: pathways or gene func- tional modules) of systems. Recently, (102) proposed an efficient way to identify combinatorial interactions among interactive genes in complex diseases by using overlapping group lasso and screening. However, they use prior information such as gene ontology in their approach, which is generally not available or difficult to collect for some machine learning problems. Thus, there is a need to develop new efficient techniques to automatically capture the important ‘blockwise’ high-order feature interactions in regression models, which is the focus of this chapter. 147 The remainder of the chapter is organized as follows: in section 6.3 we discuss our problem formulation and relevant notations used in the chapter. In section 6.4, we discuss the main idea of our approach, and in section 6.5 we give a overview of theoretical properties associated with our method. In section 6.6, we present the optimization methods which we use to solve our optimization problem. In section 6.7, we discuss our experimental setup and present our results on synthetic and real datasets. Finally, in section 6.8 we conclude the chapter with discussions and future research directions. 6.3 Problem Formulation Consider a regression setup with a training set of n samples and p features, {(X (i) ,y (i) )}, where X (i) ∈R p is the i th instance (column) of the design matrix X (p×n),y (i) ∈R is thei th instance of response variabley (n× 1), andi = 1,...,n. To model the response in terms of the predictors, we can set up a linear regression model y (i) =β T X (i) + (i) , (6.1) or a logistic regression model p(y (i) = 1|X (i) ) = 1 1 + exp(−β T X (i) −β 0 ) , (6.2) whereβ∈R p is the weight vector associated with single features (also called main effects),∈R n is a noise vector, and β 0 ∈R is the bias term. In many practical fields such as bioinformatics and medical informatics, the main terms (the terms only involving single features) are not enough to capture complex relationship between the response and the predictors, and thus high-order interactions are 148 necessary. In this chapter, we consider regression models with both main effects and high-order interaction terms. Equation 6.3 shows a linear regression model with pairwise interaction terms. y (i) =β T X (i) +X (i)T WX (i) + (i) , (6.3) where W(p×p) is the weight matrix associated with the pairwise feature inter- actions. The corresponding loss function (the sum of squared errors) is as follows (we center the data to avoid an additional bias term), L sqerr (β,W) = 1 2 n X i=1 ||y (i) −β T X (i) −X T(i) WX (i) || 2 2 . (6.4) We can similarly write the logistic regression model with pairwise interactions as follows, p(y (i) |X (i) ) = 1 1 + exp(−y (i) (β T X (i) +X (i)T WX (i) +β 0 )) (6.5) and the corresponding loss function (the sum of the negative log-likelihood of the training data) is, L logistic (β,W,β 0 ) = n X i=1 log(1 + exp(−y (i) (β T X (i) +X (i)T WX (i) +β 0 )). (6.6) 6.4 Our Approach In this section, we propose an optimization-driven sparse learning framework to identify discriminative single features and groups of high-order interactions among 149 input features for output prediction in the setting of limited training data. When the number of input features is huge (e.g. biomedical applications), it is practically impossible to explicitly consider quadratic or even higher-order interactions among all the input features based on simple lasso penalized linear regression or logistic regression. To solve this problem, we propose to factorize the weight matrix W associated with high-order interactions between input features to be a sum of K rank-one matrices for pairwise interactions or a sum of low-rank high-order tensors for higher-order interactions. Each rank-one matrix for pairwise feature interactions is represented by an outer product of two identical vectors, and each m-order (m> 2) tensor is represented by the outer product ofm identical vectors. Besides minimizing the loss function of linear regression or logistic regression, we penalize the` 1 norm of both the weights associated with single input features and the weights associated with high-order feature interactions. Mathematically, we solve the optimization problem to identify the discriminative single and pairwise interaction features as follows, { ˆ β,ˆ a k } = arg min a k ,β L sqerr (β,W) +λ β kβk 1 + K X k=1 λ a k ka k k 1 (6.7) where W = P K k=1 a k a k , represents the tensor product/outer product, and ˆ β,ˆ a k represent the estimated parameters of our model and let Q represent objec- tive function of equation (6.7). For logistic regression, we replace L sqerr (β,W) in equation (6.7) by L logistic (β, W,β 0 ). We call our model Factorization-based High-order Interaction Model (FHIM). Proposition 6.4.1. The optimization problem in Equation 6.7 is convex inβ and non-convex ina k . 150 Because of the non-convexity property of our optimization problem, it is dif- ficult to propose optimization algorithms which guarantee convergence to global optima. Here, we adopt a greedy alternating optimization methods to find the local optima for our problem. In the case of pairwise interactions, fixing other weights, we solve each rank-one weight matrix each time. Please note that our symmetric positive definite factorization of W makes this sub-optimization prob- lem very easy. Moreover, for a particular rank-one weight matrix a k a k , the nonzero entries of the corresponding vector a k can be interpreted as the block- wise interaction feature indices of a densely interacting feature group. In the case ofhigher-orderinteractions, theoptimizationprocedureissimilartotheoneforthe pairwise interactions except that we have more rounds of alternating optimization. The parameter K of W is generally unknown in real datasets, thus, we greedily estimate K during the alternating optimization algorithm. In fact, the combi- nation of our factorization formulation and the greedy algorithm is effective for estimating the interaction weight matrix W.β is re-estimated when K is greedily added during the alternating optimization as shown in algorithm 4. Algorithm 4 Greedy Alternating Optimization 1: Initializeβ to 0, K = 1 anda K =1 2: 3: While (K==1)|| (a K−1 6=0 for K > 1) 4: Repeat until convergence 5: β t j = arg min j Q(β t 1 ,...,β t j−1 ,β t−1 j+1 ,β t−1 p ),a k t−1 ) 6: a t k,j = arg min j Q((a t k,1 ,...,a t k,j−1 ,a t−1 k,j+1 ,a t−1 k,p ),β t ) 7: End Repeat 8: K = K + 1;a K = 1 9: End While 10: Removea K anda K−1 froma 151 6.5 Theoretical Properties In this section, we study the asymptotic behavior of FHIM for the likelihood- based generalized linear regression models. The lemmas and theorems proved here are similar to the ones shown in the paper (40). However, in their work the authors make an assumption on the strong heredity (i.e. interaction term coefficients are dependent on the main effects), which is not assumed in our model since we are interested in identifying all high-order interactions irrespective of heredity constraints. Here, we discuss the asymptotic properties w.r.t to the main effects and factorized co-efficients. Problem Setup: Assume that the data V i = (X i ,y i ),i = 1,...n are collected independently andY i has a density off(g(X i ),y i ) conditioned on X i , whereg is a known regression function with main effects and all possible pairwise interactions. Let β ∗ j and a ∗ k,j denote the underlying true parameters satisfying block-wise prop- erties implied by our factorization. Let Q n (θ) denote the objective with negative log-likelihood andθ ∗ = (β ∗T ,α ∗T ) T , whereα ∗ = (a ∗ k ),k = 1,...,K. We consider the estimates for FHIM as ˆ θ n : ˆ θ n = arg min θ Q n (θ) (6.8) = arg min θ − 1 n n X i=1 (L(g(X i ),y i ) +λ β |β| + X k λ α k |α k | 152 whereL(g(X i ),y i ) is the loss function of generalized linear regression models with pairwise interactions. In the case of linear regression, g(·) takes the form of Equa- tion (6.3) without the noise term andL(·) takes the form of Equation (6.4). Now, let us define A 1 ={j :β ∗ j 6= 0} A 2 ={(k,l) :α ∗ k,l 6= 0}, A =A 1 ∪A 2 (6.9) whereA 1 contains the indices of the main terms which correspond to the nonzero true coefficients, and similarlyA 2 contains the indices of the factorized interaction terms whose true co-efficients are non-zero. Let us define a n = max{λ β j ,λ α k l :j∈A 1 , (k,l)∈A 2 } b n = min{λ β j ,λ α k l :j∈A c 1 , (k,l)∈A c 2 } (6.10) Now, we show that our model possesses the oracle properties for (i) n→∞ with fixed p and (ii) p n →∞ as n→∞ under some regularity conditions. Please refer to Appendix for proofs of the lemmas & theorems of sections 6.5.1 and 6.5.2. 6.5.1 Asymptotic Oracle Properties when n→∞ Theasymptoticpropertieswhensamplesizeincreasesandthenumberofpredictors isfixedaredescribedinthefollowinglemmasandtheorems. FHIMpossessesoracle properties (40) under certain regularity conditions (C1)-(C3) shown below. Let Ω denote the parameter space forθ. 153 (C1) The observations V i :i = 1,...,n are independent and identically dis- tributed with a probability density f(V,θ), which has a common support. We assume the density f satisfies the following equations: E θ ∂ logf(V,θ) ∂θ j = 0 for j = 1,...,p(K + 1), and I jk (θ) =E θ ∂ logf(V,θ) ∂θ j ∂ logf(V,θ) ∂θ k =E θ − ∂ 2 logf(V,θ) ∂θ j ∂θ k (C2) The Fisher Information Matrix I(θ) =E ∂ logf(V,θ) ∂θ ∂ logf(V,θ) ∂θ T is finite and positive definite atθ =θ ∗ . (C3) There exists an open set ω of Ω that contains the true parameter point θ ∗ such that for almost all V the density f(V,θ) admits all third derivatives (∂ 3 f(V,θ))/(∂θ j ∂θ k ∂θ l ) for allθ∈ ω and any j,k,l = 1,...,p(K + 1). Further- more, there exist functions M jkl such that ∂ 3 ∂θ j ∂θ k ∂θ l logf(V,θ) ≤M jkl (V) for allθ∈ω where m jkl = E θ ∗ [M jkl (V)] <∞. These regularity conditions are the existence of common support and first, second derivatives for f(V,θ); Fisher Information matrix being finite and positive definite; and existence of bounded third derivative 154 for f(V,θ). These regularity conditions guarantee asymptotic normality of the ordinary maximum likelihood estimates (89). Lemma 6.5.1. Assume a n = o(1) as n→∞. Then under regularity conditions (C1)-(C3), there exists a local minimizer ˆ θ n of Q n (θ) such that|| ˆ θ n −θ ∗ || = O P (n −1/2 +a n ) Theorem 6.5.2. Assume √ nb n →∞ and the minimizer ˆ θ n given in lemma 6.5.1 satisfies|| ˆ θ n −θ ∗ || =O P (n −1/2 ). Then under regularity conditions (C1)-(C3), we have P ( ˆ β A C 1 = 0)→ 1, P(ˆ α A C 2 = 0)→ 1 Lemma 6.5.1 implies that when the tuning parameters associated with the non- zero coefficients of main effects and pairwise interactions tend to 0 at a rate faster thann −1/2 , then there exists a local minimizer of Q n (θ), which is √ n−consis-tent (the sampling error is O p (n −1/2 )). Theorem 6.5.2 shows that our model removes noise consistently with high probability (→ 1). If √ na n → 0 and √ nb n →∞, then lemma 6.5.1 and theorem 6.5.2 imply that the √ n−consistent estimator ˆ θ n satisfies P ( ˆ θ A c = 0)→ 1. Theorem 6.5.3. Assume √ na n → 0 and √ nb n →∞. Then under the regularity conditions (C1)-(C3), the component ˆ θ A of the local minimizer ˆ θ n (given in lemma 6.5.1) satisfies √ n( ˆ θ A −θ ∗ A )→ d N(0,I −1 (θ ∗ A )), where I(θ ∗ A ) is the Fisher information matrix of θ A at θ A = θ ∗ A assuming that θ ∗ A c = 0 is known in advance. Theorem 6.5.3 shows that our model estimates the non-zero coefficients of the true model with the same asymptotic distribution as if the zero coefficients were 155 known in advance. Based on theorems 6.5.2 and 6.5.3, we can say that our model has the oracle property (40), (53), when the tuning parameters satisfy the condi- tions √ na n → 0 and √ nb n →∞. To satisfy these conditions, we have to consider adaptive weights w β j ,w α k l (163) for our tuning parameters λ β ,λ α k (see appendix for more details). Thus, our tuning parameters are: λ β j = logn n λ β w β j , λ α k l = logn n λ α k w α k l 6.5.2 Asymptotic Oracle Properties when p n →∞ as n→ ∞ In this section, we consider the asymptotic behavior of our model when the number of predictors p n grows to infinity along with the sample size n. If certain regular- ity conditions (C4)-(C6) (shown below) hold, then we can show that our model possesses the oracle property. We denote the total number of predictors by p n . We denote all the quantities that change with sample size by addingn as their subscript.A 1 ,A 2 ,A are defined as in section 6.5 and let s n =|A n |. The asymptotic properties of our model when the number of predictors increases along with the sample size are described in the following lemma and theorem. The regularity conditions (C4)-(C6) are given below: Let Ω n denote the parameter space forθ n . (C4) The observations V ni :i = 1,...,n are independent and identically dis- tributed with a probability densityf n (V n ,θ n ), which has a common support. We assume the density f n satisfies the following equations: E θn ∂ logf n (V n ,θ n ) ∂θ nj = 0 for j = 1,...,p n , 156 and I jk (θ n ) =E θn ∂ logf n (V n ,θ n ) ∂θ nj ∂ logf n (V n ,θ n ) ∂θ nk =E θn − ∂ 2 logf n (V n ,θ n ) ∂θ nj ∂θ nk (C5) I n (θ n ) =E[( ∂logfn(V n1 ,θn) ∂θn )( ∂logfn(V n1 ,θn) ∂θn ) T ] satisfies 0<C 1 <λ min I n (θ n )≤λ max I n (θ n )<C 2 <∞ for all n, where λ min (.) and λ max (.) represent the smallest and largest eigenvalues of a matrix respectively. Moreover, for any j,k = 1,...,p n , E θn ∂ logf n (V n1 ,θ n ) ∂θ nj ∂ logf n (V n1 ,θ n1 ) ∂θ nk 2 <C 3 <∞, and E θn ∂ 2 logf n (V n1 ,θ n ) ∂θ nj ∂θ nk <C 4 <∞ (C6) There exists a large open setω n ⊂ Ω n ∈R pn which contains the true param- eters θ ∗ n such that for almost all V ni the density admits all third derivatives ∂ 3 f n (V ni ,θ n )/∂θ nj ∂θ nk ∂θ nl for all θ n ∈ ω n . Furthermore, there are functions M njkl such that ∂ 3 f n (V ni ,θ n ) ∂θ nj ∂θ nk ∂θ nl ≤M njkl (V ni ) for allθ n ∈ω n and E θn M 2 njkl (V ni )<C 5 <∞ for all p n ,n, and j,k,l. 157 Lemma 6.5.4. Assume that the density f n (V n ,θ ∗ n ) satisfies some regularity con- ditions (C4)-(C6). If √ na n → 0 and p 5 n /n→ 0 as n→∞, then there exists a local minimizer ˆ θ n of Q n (θ) such that|| ˆ θ n −θ ∗ n || =O P ( √ p n (n −1/2 +a n )) Theorem 6.5.5. Suppose that the densityf n (V n ,θ ∗ n ) satisfies some regularity con- ditions (C4)-(C6). If √ np n a n → 0, q n/p n b n →∞ andp 5 n /n→ 0 asn→∞, then with probability tending to 1, the q n/p n -consistent local minimizer ˆ θ n in Lemma 6.5.4 satisfies the following: • Sparsity: ˆ θ nA c n =0 • Asymptotic normality: √ nA n I 1 2 n ( ˆ θ nAn −θ ∗ nAn )→ d N(0,G) whereA n is an arbitrary m×s n matrix with finite m such thatA n A T n → G and G is a m× m nonnegative symmetric matrix and I n (θ ∗ nAn ) is the Fisher information matrix of θ nAn at θ nAn =θ ∗ nAn . Since the dimension of ˆ θ nAn →∞ as sample size n→∞, we could consider arbitrary linear combination A n ˆ θ nAn for the asymptotic normality of our model’s estimates. Similar to section 6.5.1, to satisfy oracle property, we have to consider an adaptive weights w β nj ,w α k nl (163) for our tuning parameters λ β ,λ α k as: λ β nj = log(n)p n n λ β w β nj , λ α k n,l = log(n)p n n λ α k w α k nl 6.6 Optimization In this section, we outline three optimization methods that we employ to solve our objective function (6.7), which corresponds to Line 4 and 5 in Algorithm 1. 158 (119) provides a good survey on several optimization approaches for solving ` 1 - regularized regression problems. In this chapter, we use the sub-gradient and co- ordinate wise soft-thresholding based optimization methods since they work well and are easy to implement. We compare these methods in the experimental results in section 6.7. 6.6.1 Sub-Gradient Methods Sub-gradient based strategies treat the non-differentiable objective as a non- smoothoptimizationproblemandusesub-gradientsoftheobjectivefunctionatthe non-differentiable points. For our model, the optimality conditions w.r.t parameter vectors β and a k can be written out separately based on the objective function (6.7). Optimality conditions w.r.ta k is: ∇ j L(a k ) +λ a k sgn(a kj ) = 0 |a kj |> 0 |∇ j L(a k )|≤λ a k a kj = 0 whereL(a k ) is the loss function of our linear regression model or logistic regression model in Equation (6.7) w.r.ta k . Similarly, optimality conditions can be written forβ. The sub-gradient∇ s j f(a k ) for each a kj is given by ∇ s j f(a k ) = ∇ j L(a k ) +λ a k sgn(a kj ), |a kj |> 0 ∇ j L(a k ) +λ a k , a kj = 0,∇ j L(a k )<−λ a k ∇ j L(a k )−λ a k , a kj = 0,∇ j L(a k )>λ a k 0, −λ a k ≤∇ j L(a k )≤λ a k where 159 ∇ j L(a k ) = 1 2 X i (−2X (i) j X T(i) a k )[y (i) −β T X (i) −X T(i) WX (i) ]. for our linear regression model. The negation of the sub-gradient represents the steepest descent direction. Similarly the sub-gradients for β (∇ s j f(β) ) can be calculated. Differential of the loss function of the linear regression in Equation (6.7) w.r.tβ is given by ∇ j L(β) = 1 2 X i (−2X (i) j )[y (i) −β T X (i) −X T(i) WX (i) ] 6.6.1.1 Orthant-Wise Descent (OWD) Andrew and Gao (8) proposed an effective strategy for solving large-scale ` 1 - regularized regression problems based on choosing an appropriate steepest descent direction for the objective function and taking a step like a Newton iteration in this direction (with an L-BFGS Hessian approximation (92)). The orthant-wise learning descent method for our model takes the following form β←P O [β−γ β P S [H −1 β ∇ s f(β)]] a k ←P O [a k −γ a k P S [H −1 a k ∇ s f(a k )]] whereP O andP S are two projection operators and H β is the positive definite approximation of Hessian of quadratic approximation of objective function f(β), and γ β and γ a k are step sizes.P S projects the Newton-like direction to guarantee that it is in the descent direction.P O projects the step onto the orthant containing β ora k and ensures that line search does not cross points of non-differentiability. 160 6.6.1.2 Projected Scaled Sub-Gradient (PSS) Schmidt (119) proposed optimization methods called Projected Scaled Sub- Gradient methods where the iterations can be written as the projection of a scaling of a sub-gradient of the objective. Please refer to (8) and (119) for more details on OWD and PSS methods. 6.6.2 Soft-thresholding Soft-thresholding based co-ordinate descent optimization method can be used to findβ,a k updates in the alternating optimization algorithm for our FHIM model. Theβ updates (neglecting some terms) are ˜ β j and are given by ˜ β j (λ β )←S ˜ β j (λ β ) + n X i=1 X ij (y i − X k6=j X jk ˜ β k − X k X ik WX ki ),λ β where W = P k a k a k , and S is the soft-thresholding operator (57). Similarly, the updates fora k are ˜ a kj and given by ˜ a kj (λ a k )←S ˜ a kj (λ a k ) + n X i=1 X ij ( X k p X r=1 a kr X ir )[y i − X k6=j X jk ˜ β k − X k X ik W ∼j X ki ],λ a k where W ∼j is W with j th column and j th row elements are all zero. 6.7 Experiments In this section, we use synthetic and real datasets to demonstrate the efficacy of our model (FHIM), and compare its performance with LASSO (129), All-Pairs Lasso (23), Hierarchical LASSO (23), Group Lasso (150), Trace-norm (74), Dirty 161 model (70) and QUIRE (102). For all these models, we perform 5 runs of 5-fold cross-validation on training dataset (80 %) to find the optimal parameters and evaluate prediction error on a test dataset (20 %). We search tuning parameters for all methods using grid search and for our model the parametersλ β andλ a k are searched in the range of [0.01, 10]. We also discuss the support recovery ofβ and W for our model. 6.7.1 Datasets We use synthetic datasets and a real dataset for classification and support recovery experiments. We give detailed description of these datasets below. 6.7.1.1 Synthetic Dataset We generate the predictors of the design matrix X using a normal distribution with mean zero and variance one. The weight matrix W was generated as a sum of K rank one matrices i.e. W = P K k=1 a k a T k . β,a k were generated as a sparse vector from a normal distribution with mean 0 and variance 1, while noise vector is generated from a normal distribution with mean 0 and variance 0.1. Finally, the response vectors y of the linear and logistic regression models with pairwise interactions were generated using Equations (6.3) and (6.5) respectively. We gen- erated several synthetic datasets by varying number of instances (n), number of variables/predictors (p), rank of W i.e. K and sparsity level ofβ,a k . We denote thecombinedtotalpredictors(thatismaineffectspredictors+predictorsforinter- action terms) by q, here q =p(p + 1)/2. Sparsity level (non-zeros) was chosen as 2∼ 4% for largep(> 100), and 5∼ 10% for smallp(< 100) for bothβ,a k . In this chapter, we show results for synthetic data in these settings: Case (1) n>p and 162 q > n (high-dimensional setting w.r.t combined predictors) and, Case (2) p > n (high-dimensional w.r.t original predictors). 6.7.1.2 Real Dataset To predict cancer progression status directly from blood samples, we generated our own dataset. All samples and clinical information were collected under Health Insurance Portability and Accountability Act compliance from study participants afterobtainingwritteninformedconsentunderclinicalresearchprotocolsapproved bytheinstitutionalreviewboardsforeachsite. Bloodwasprocessedwithin2hours of collection according to established standard operating procedures. To predict RCC status, serum samples were collected at a single study site from patients diagnosed with RCC or benign renal mass prior to treatment. Definitive pathology diagnosis of RCC and cancer stage was made after resection. Outcome data was obtained through follow-up from 3 months to 5 years after initial treatment. Our RCC dataset contains 212 RCC samples from benign and 4 different stages of tumor. Expression levels of 1092 proteins based on a high-throughput SOMAmer protein quantification technology are collected. The number of Benign, Stage 1, Stage 2, Stage 3 and Stage 4 tumor samples are 40; 101; 17; 24 and 31 respectively. 6.7.2 Experimental Design Weuselinearregressionmodels(Equation6.3)forallthefollowingexperimentsand weonlyuselogisticregressionmodels(Equation6.5)forsyntheticdataexperiments shown in table 2. We evaluate the performance of our method (FHIM) by the following experiments: 1. Prediction error and support recovery experiments on synthetic datasets 163 2. Classification experiments using RCC samples: We perform three stage-wise binary classification experiments using RCC samples: (a) Case 1: Classification of Benign samples from Stage 1− 4 samples. (b) Case 2: Classification of Benign and Stage 1 samples from Stage 2− 4 samples. (c) Case 3: Classification of Benign, Stage 1, 2 samples from Stage 3, 4 samples. Table 6.1: Performance comparison for synthetic data on linear regression model with high-order interactions. Prediction Error (MSE) and Std. deviation of MSE (shown inside brackets) on test data is used to measure the model’s performance. For p >= 500, Hierarchical Lasso (HLasso) has heavy computational complexity, hence we don’t show it’s results here. n, p, K FHIM Fused Lasso Lasso HLasso Trace norm Dirty Model q>n 1000, 50, 1 338.4 (14.5) 425.9 (20.7) 474.7 (15.3) 354.32 (24.82) 464.4 (36.3) 613.5 (0.76) 1000, 50, 5 343.7 (12.9) 1888.3 (121.1) 1922.9 (143.9) 889.1 (112.5) 1822.6 (99.8) 2453.8 (0.76) 10000, 500, 1 1093.1 (19.5) 2739.57 (155.1) 3896.3 (129.5) - 3887.9 (101.1) 4674.7 (0.8) 10000, 500, 5 1090.76 (12.21) 22720 (597.8) 23279.6 (231.3) - 22916.5 (321.4) 29214 (0.8) p>n 100, 500, 1 230.49 (50.3) 1157.2 (355.0) 1335.0 (159.2) - 1160.3 (299.7) 1651.9 (62.6) 100, 1000, 1 340.1 (40.02) 770.9 (127.6) 879.1 (180.3) - 699.9 (208.7) 808.1 (5.1) 100, 2000, 1 907.8 (100.1) 1022.3 (406.2) 919.2 (132.1) - 880.42 (471.6) 1916.7 (63.4) 6.7.3 Performance on Synthetic dataset We evaluate the performance of our model (FHIM) on synthetic dataset by the fol- lowing experiments: (i) Comparison of optimization methods presented in section 164 Table 6.2: Performance comparison for synthetic dataset on logistic regression model with high-order interactions. Misclassification Error on test data is used to measure the model’s performance n, p, K FHIM Fused Lasso Lasso HLasso Trace norm q>n 1000, 50, 1 0.127 (0.009) 0.128 (0.017) 0.156 (0.017) 0.136 (0.02) 0.128 (0.016) 1000, 50, 5 0.189 (0.03) 0.227 (0.024) 0.292 (0.042) 0.257 (0.022) 0.503 (0.027) 10000, 500, 1 0.135 (0.002) 0.265 (0.007) 0.161 (0.012) - 0.225 (0.077) 10000, 500, 5 0.390 (0.05) 0.514 (0.006) 0.507 (0.108) - 0.514 (0.006) p>n 100, 500, 1 0.325 (0.04) 0.352 (0.086) 0.4323 (0.054) - 0.40 (0.079) 100, 1000, 1 0.390 (0.056) 0.409 (0.086) 0.458 (0.083) - 0.438 (0.011) 6.6, (ii) Prediction error on the test data for q > n and p > n (high-dimensional settings), (iii) Support recovery accuracy ofβ, W and (iv) Prediction of rank of W using greedy approach. Table 6.3 shows the prediction error on test data when different optimization methods (discussed in section 6.6) are used for our model (FHIM). From table 6.3, we see that both OWD and PSS methods perform nearly similar (OWD is marginally better), and are better than the soft-thresholding method. This is because, in soft-thresholding, co-ordinate updates of variables might not be accu- rate in high dimensional settings (i.e. the solution is affected by the path taken during updates). We observed that soft-thresholding in general is slower than OWD and PSS methods. For all the other experiments discussed in this chapter, we choose OWD as the optimization method for FHIM. Table 6.1 and Table 6.2 shows the performance comparison (in terms of prediction error on test dataset) 165 Figure 6.1: Support Recovery ofβ (90 % sparse) and W (99 % sparse) for syn- thetic data Case 1: n>p and q>n where n = 1000,p = 50,q = 1275. of FHIM for linear and logistic regression models with respect to the state-of-the- art approaches such as Lasso, Fused Lasso, Trace-Norm and Hierarchical Lasso (HLasso is a general version of SHIM (40)). From tables 6.1 and 6.2, we see that FHIM generally outperforms all the state-of-the-art approaches for both linear and logisticpairwiseregressionmodels. Forq>n, weseethattestdatapredictionerror for FHIM is consistently lower compared to all other approaches. Forp>n, FHIM performs slightly better than other approaches, however, the prediction error for all the approaches is high since it’s hard to accurately recover the coefficients of main effects and pairwise interactions in very high-dimensional settings. From figure 6.1 and table 6.5, we see that our model performs very well (F 1 score close to 1) in the support recovery ofβ and W for the q >n setting. From figure 6.2 and table 6.5, we see that our model performs fairly well in the support 166 Figure 6.2: Support Recovery of W (99.5 % sparse) for synthetic data Case 2: p>n where p = 500, n = 100. recovery of W forp>n setting. We observe that when the tuning parameters are correctly chosen, support recovery of W works very well when W is low-rank (see table ?? and 6.5), and the F 1 score for the support recovery of W decreases with increase in rank ofW. Table 6.5 shows that forq>n the greedy strategy of FHIM accurately recovers the rank K of W, while for p>n, the greedy strategy might not correctly recoverK. This is because the tensor factorization is not unique and slightly correlated variables can enter our model during optimization. 6.7.4 Classification Performance on RCC In this section, we report systematic experimental results on classification of sam- ples from different stages of RCC. The predictive performance of the markers and pairwise interactions selected by our model (FHIM) is compared against the markers selected by Lasso, All-Pairs Lasso (23), Group Lasso, Dirty model (70) and QUIRE. We use SLEP (93), MAL- SAR (160) and QUIRE packages for the implementation of these models. The overall performance of the algorithms are shown in Figure 6.3. In this figure, we report average AUC score for five runs of 5-fold cross validation experiments for 167 0.4 0.5 0.6 0.7 0.8 0.9 1 Benign vs. Stage1-4 Benign, Stage 1 vs. Stage 2-4 Benign, Stage 1-2 vs. Stage 3-4 Average area under ROC curve RCC Dataset APLasso Lasso Trace-norm Group Lasso Dirty Model QUIRE FHIM (Ours) Figure 6.3: Comparison of the classification performances of different fea- ture selection approaches with our model in identifying the different stages of RCC. We perform five fold cross validation five times and average AUC score is reported. For updated results please refer chapter 7 (figure 7.2). cancerstagepredictioninRCC.In5-foldcrossvalidationexperiments, wetrainour model on the four folds to identify the main effects and pairwise interactions and we use the remaining one fold for testing prediction. The average AUC achieved by features selected with our model are 0.72, 0.93 and 0.92 respectively for the three cases discussed in section 6.7.2. We performed pairwise t-tests for the com- parisons of our method vs. the other methods, and all p-values are below 0.0075. From figures 6.3 and 7.2 (chapter 7), it is clear that our FHIM outperforms all the other algorithms that do not use prior feature group information for all the three classification cases of RCC prediction. In addition, our model has better performance than the state-of-the-art technique - QUIRE (102), which uses Gene Ontology based functional annotation for grouping and clustering of genes to iden- tify high order interactions. Interestingly enough we found that FHIM chooses ‘STC1’ biomarker for achieving the best AUC score. In other words, for RCC dataset, FHIM did not find any feature interactions. 168 6.7.5 Computational Time Analysis FHIM has O(np) time complexity for algorithm 4. In general, FHIM takes more time than the Lasso approach since we do alternating optimization ofβ,a k . For q∼ n setting with n = 1000, q = 1275, our OWD learning optimization method on Matlab takes around∼ 1 minute for 5-fold cross-validation, while for p > n with p = 2000, n = 100, our FHIM model took around 2 hours for 5-fold cross- validation. Our experiments were run on intel i3 dual-core 2.9 GHz CPU with 8 GB RAM. Table 6.3: Comparison of optimization methods for our FHIM model based on test data prediction error n, p OWD Soft-thres PSS -holding 100, 500 230.5 276.2 239.5 100, 1000 340.1 710.5 358.7 100, 2000 907.8 1174.1 927.4 Table 6.4: Support recovery ofβ,W n, p Sparsity K Support recovery β,a k β,W(F1 score) 1000, 50 5, 5 1 1.0, 1.0 1000, 50 5, 5 3 1.0, 0.95 1000, 50 5, 5 5 1.0, 0.82 10000, 500 10, 20 1 0.95, 0.72 10000, 500 10, 20 3 0.80, 0.64 10000, 500 10, 20 5 0.72, 0.55 6.8 Conclusions and Future Work In this chapter, we proposed a factorization based sparse learning framework called FHIM for identifying high-order feature interactions in linear and logistic regres- sion models, and studied several optimization methods for our model. Empirical 169 Table 6.5: Recovering K using greedy strategy n, p true estimated W support recovery K K F1 score 1000, 50 1 1 1.0 1000, 50 3 3 1.0 1000, 50 5 5 0.8 100, 100 1 2 0.75 100, 500 3 2 0.6 100, 1000 5 4 0.5 experimentsonsyntheticandrealdatasetsshowedthatourmodeloutperformssev- eral well-known techniques such as Lasso, Trace-norm, GroupLasso and achieves comparable performance to the current state-of-the-art method - QUIRE, while not assuming any prior knowledge about the data. Our model gives ‘interpretable’ results for RCC dataset which can be used for biomarker discovery for disease diagnosis. In the future, we will consider the following directions: 1. We will consider factorization of the weight matrix W as W = P k a k b T k and higher-order feature interactions, which is more general, but the optimization problem is non-convex; 2. We will extend our optimization methods from Single-Task Learning to Multi-Task Learning; 3. We will consider groupings of features for both Single Task Learning and Multi-Task Learning. 170 6.9 Appendix 6.9.1 Proofs for section 6.5 Proof. [Lemma 6.5.1: ] Let η n =n −1/2 +a n and{θ ∗ +η n δ :||δ||≤d} be the ball aroundθ ∗ , whereδ = (u 1 ,...,u p , v 11 ,....v Kp ) T = (u T ,v T ) T . Define D n (δ)≡Q n (θ ∗ +η n δ)−Q n (θ ∗ ) Where Q n (θ ∗ ) is defined in equation (6.8). Forδ that satisfies||δ|| =d, we have D n (δ) =−L n (θ ∗ +η n δ) +L n (θ ∗ ) +n X j λ β j (|β ∗ j +η n u j |−|β ∗ j |) +n X k,l λ α k l (|α ∗ k,l +η n v kl )|−|α ∗ k,l | ≥−L n (θ ∗ +η n δ) +L n (θ ∗ ) +n X j∈A 1 λ β j (|β ∗ j +η n u j |−|β ∗ j |) +n X (k,l)∈A 2 λ α k l (|α ∗ k,l +η n v kl )|−|α ∗ k,l | 171 ≥−L n (θ ∗ +η n δ) +L n (θ ∗ ) −nη n X j∈A 1 λ β j |u j |−nη n X (k,l)∈A 2 λ α k l |v kl | ≥−L n (θ ∗ +η n δ) +L n (θ ∗ ) −nη 2 n ( X j∈A 1 |u j | + X (k,l)∈A 2 |v kl |) ≥−L n (θ ∗ +η n δ) +L n (θ ∗ )−nη 2 n (|A 1 | +|A 2 |)d =− [∇L n (θ ∗ )] T (η n δ)− 1 2 (η n δ) T [∇ 2 L n (θ ∗ )] (η n δ)(1 +o p (1))−nη 2 n (|A 1 | +|A 2 |)d We used Taylor’s expansion in above step. We split the above into three parts and we get: K 1 =−η n [∇L n (θ ∗ )] T δ =− √ nη n ( 1 √ n ∇L n (θ ∗ )) T δ =−O p (nη 2 n )δ K 2 = 1 2 nη 2 n {δ T [− 1 n ∇ 2 L n (θ ∗ )δ](1 +o p (1))} = 1 2 nη 2 n {δ T [I(θ ∗ )δ](1 +o p (1))} K 3 =−nη 2 n (|A 1 | +|A 2 |)d Thus, D n (δ)≥K 1 +K 2 +K 3 =−O p (nη 2 n )δ + 1 2 nη 2 n {δ T [I(θ ∗ )δ](1 +o p (1))} −nη 2 n (|A 1 | +|A 2 |)d 172 We see that K 2 dominates the rest of the terms and is positive since I(θ) is positive definite atθ =θ ∗ from regularity condition (C2). Therefore, for any given > 0 there exists a large enough constant d such that P{ inf ||δ||=d Q n (θ ∗ +η n δ)>Q n (θ ∗ )}≥ 1− This implies that with probability at-least 1−, there exists a local minimizer in the ball{θ ∗ +η n δ :||δ||≤d}. Thus, there exists a local minimizer of Q n (θ) such that|| ˆ θ n −θ ∗ || =O p (η n ). Proof. [Theorem: 6.5.2] Let us first considerP (ˆ α A c 2 = 0)→ 1. It is sufficient to show that for any (k,l)∈A c 2 ∂Q n ( ˆ θ n ) ∂α k,l < 0 for − n < ˆ α k,l < 0 (6.11) ∂Q n ( ˆ θ n ) ∂α k,l > 0 for n > ˆ α k,l > 0 (6.12) with probability tending to 1, where n =Cn −1/2 and C > 0 is any constant. To show (6.12), notice ∂Q n ( ˆ θ n ) ∂α k,l =− L n ( ˆ θ n ) ∂α k,l +nλ α k l sgn(ˆ α k,l ) =− L n (θ ∗ ) ∂α k,l − p(K+1) X j=1 ∂ 2 L n (θ ∗ ) ∂α k,l ∂θ j ( ˆ θ j −θ ∗ j ) − p(K+1) X j=1 p(K+1) X m=1 ∂ 3 L n ( ˜ θ) ∂α k,l ∂θ j ∂θ m ( ˆ θ j −θ ∗ j )( ˆ θ m −θ ∗ m ) +nλ α k l sgn(ˆ α k,l ) 173 where ˜ θ lies between ˆ θ n andθ ∗ . By regularity conditions (C1)-(C3) and the Lemma 6.5.1, we have ∂Q n ( ˆ θ n ) ∂α k,l = √ n{O p (1) + √ nλ α k l sgn(ˆ α k,l )} As √ nλ α k l →∞ for (k,l)∈A c 2 from the assumption, the sign of ∂Qn( ˆ θn) ∂α k,l is dominated by sgn(ˆ α k,j ). Thus, P ∂Q n ( ˆ θ n ) ∂α k,l > 0 for 0< ˆ α k,l < n → 1 as n→∞ (6.11) has identical proof as above. Also, P ( ˆ β A c 1 = 0)→ 1 can be proved similarly since in our modelβ andα are independent of each other. Proof. [Theorem 6.5.3] Let Q n (θ A ) denote the objective function Q n only on theA−component ofθ, that is Q n (θ) withθ A c. Based on Lemma 6.5.1 and Theorem 6.5.2, we have P ( ˆ θ A c = 0)→ 1. Thus, P arg min θ A Q n (θ A ) = (A−component of arg min θ Q n (θ)) → 1 Thus, ˆ θ A should satisfy ∂Q n (θ A ) ∂θ j θ A = ˆ θ A = 0 ∀j∈A (6.13) 174 with probability tending to 1. Let L n (θ A ) and P λ (θ A ) denote the log-likelihood function ofθ A and the penalty function ofθ A respectively so that we have Q n (θ A ) =−L n (θ A ) +nP λ (θ A ) From (6.13), we have ∇ A Q n ( ˆ θ A ) =−∇ A L n ( ˆ θ A ) +n∇ A P λ ( ˆ θ A ) = 0, (6.14) with probability tending to 1. Now, consider by Taylor expansion of first term and second terms atθ A =θ ∗ A , we get the following: −∇ A L n ( ˆ θ A ) =−∇ A L n (θ ∗ A )− [∇ 2 A L n (θ ∗ A ) +o p (1)] ( ˆ θ A −θ ∗ A ) = √ n − 1 √ n ∇ A L n (θ ∗ A )+ I(θ ∗ A ) √ n( ˆ θ A −θ ∗ A ) +o p (1) n∇ A P λ ( ˆ θ A ) =n λ β j sgn(β j ) λ α k l sgn(α k,l ) j∈A 1 ,(k,l)∈A 2 +o p (1)( ˆ θ A −θ ∗ A ) = √ no p (1) since √ na n =o(1) and|| ˆ θ A −θ ∗ A || =O p (n −1/2 ) 175 Thus, we get, 0 = √ n − 1 √ n ∇ A L n (θ ∗ A ) +I(θ ∗ A ) √ n( ˆ θ A −θ ∗ A ) +o p (1) Therefore, from central limit theorem, √ n( ˆ θ A −θ ∗ A )→ d N(0,I −1 (θ ∗ A )) Proof. [Lemma 6.5.4] Let η n = √ p n n −1/2 +a n and{θ ∗ n +η n δ :||δ||≤d} be the ball aroundθ ∗ n , whereδ = (u 1 ,...,u p ,v 11 ,....v Kp ) T = (u T ,v T ) T . Define D n (δ)≡Q n (θ ∗ n +η n δ)−Q n (θ ∗ n ) Let−L n and nP n denote the first and second terms of Q n . For anyδ that satisfies||δ|| =d, we have D n (δ) =−L n (θ ∗ n +η n δ) +L n (θ ∗ n ) +nP n (θ ∗ n +η n δ)−nP n (θ ∗ n ) =−L n (θ ∗ n +η n δ) +L n (θ ∗ n ) +n X j∈A n1 λ β nj (|β ∗ j +η n u j |−|β j |) +n X (k,l)∈A n2 λ α k nl (|α ∗ k,l +η n v k,l )|−|α ∗ k,l | 176 ≥−L n (θ ∗ n +η n δ) +L n (θ ∗ n ) −nη n X j∈A n1 λ β nj |u j |+ + X (k,l)∈A n2 λ α k nl α ∗ k,l |v k,l )| ≥−L n (θ ∗ n +η n δ) +L n (θ ∗ n ) −nη n ( X j∈A n1 a n |u j | + X (k,l)∈A n2 a n |v k,l |) ≥−L n (θ ∗ n +η n δ) +L n (θ ∗ n )−nη n ( √ s n a n )d =−L n (θ ∗ n +η n δ) +L n (θ ∗ n )−nη 2 n d =− [∇L n (θ ∗ n )] T (η n δ)− 1 2 (η n δ) T [∇ 2 L n (θ ∗ n )](η n δ) − 1 6 ∇ T {δ T [∇ 2 L n ( ˜ θ n )]δ}δη 3 n −nη 2 n d (By Taylor’s series expansion) where ˜ θ n lies between (θ ∗ n +η n δ) andθ ∗ n . We split the above into four parts: K 1 =−[∇L n (θ ∗ n )] T (η n δ) K 2 =− 1 2 (η n δ) T [∇ 2 L n (θ ∗ n )](η n δ) K 3 =− 1 6 ∇ T {δ T [∇ 2 L n ( ˜ θ n )]δ}δη 3 n K 4 =−nη 2 n d 177 Then, |K 1 | =|−η n [∇L n (θ ∗ n )] T δ| ≤η n ||(∇L n (θ ∗ n )) T ||||δ|| =O p (η n √ np n )δ =O p (nη 2 n )d Next, since we have 1 n ∇ 2 L n (θ ∗ n ) +I n (θ ∗ n ) =o p (1/p n ) (6.15) by Chebyshev’s inequality and (C5) we can show that K 2 =− 1 2 (η n δ) T [∇ 2 L n (θ ∗ n )](η n δ) = 1 2 nη 2 n δ T [I n (θ ∗ n )]δ− 1 2 nη 2 n d 2 o p (1) Moreover, by Cauchy-Schwarz inequality, (C6) and the conditions √ na n → 0 and p 5 n /n→ 0, |K 3 | = − 1 6 ∇ T {δ T [∇ 2 L n ( ˜ θ n )]δ}δη 3 n = 1 6 η 3 n n X i=1 pn X j,l,m=1 ∂ 3 L n ( ˜ θ n ) ∂θ nj ∂θ nl ∂θ nm δ j δ l δ m ≤η 3 n n X i=1 ( pn X j,l,m=1 M 2 njlm (V ni )) 1/2 ||δ|| 3 =nη 3 n O p (p 3/2 n )(p n O(1)) 1/2 ||δ|| 2 =nη 2 n O p (η n p 2 n )d 2 =nη 2 n o p (1)d 2 178 D n (δ)≥K 1 +K 2 +K 3 +K 4 =−O p (nη 2 n )δ− 1 2 nη 2 n δ T [I n (θ ∗ n )]δ − 3 2 nη 2 n o p (1)d 2 −nη 2 n d We see that K 2 dominates the rest of the terms and is positive since I(θ n ) is positive definite atθ n =θ ∗ n from (C5). Therefore, for any given > 0 there exists a large enough constant d such that P{ inf ||δ||=d Q n (θ ∗ n +η n δ)>Q n (θ ∗ n )}≥ 1− This implies that with probability at-least 1−, there exists a local minimizer in the ball{θ ∗ n +η n δ :||δ||≤d}. Thus, there exists a local minimizer of Q n (θ n ) such that|| ˆ θ n −θ ∗ n || =O p (η n √ p n ). Proof. [Theorem 6.5.5] [Proof of Sparsity] First, we prove P ( ˆ β A c n1 = 0)→ 1 as n→∞. It is sufficient to show that with probability tending to 1, for any j∈A c n1 ∂Q n ( ˆ θ n ) ∂β nj < 0 for − n < ˆ β nj < 0 (6.16) ∂Q n ( ˆ θ n ) ∂β nj > 0 for n > ˆ β nj > 0 (6.17) 179 where n =Cn −1/2 and C > 0 is any constant. To show (6.16), we consider Taylor expansion of ∂Qn( ˆ θn) ∂β nj atθ =θ ∗ n . ∂Q n ( ˆ θ n ) ∂β nj =− ∂L n ( ˆ θ n ) ∂β nj +nλ β nj sgn( ˆ β nj ) =− ∂L n (θ ∗ n ) ∂β nj − pn X k=1 ∂ 2 L n (θ ∗ n ) ∂β nj ∂θ nk ( ˆ θ nk −θ ∗ nk ) − pn X k=1 pn X k=1 ∂ 3 L n ( ˜ θ n ) ∂β nj ∂θ nk ∂θ nl ( ˆ θ nk −θ ∗ nk )( ˆ θ nl −θ ∗ nl ) +nλ β nj sgn( ˆ β nj ) (6.18) where ˜ θ lies between ˆ θ n andθ ∗ n . By (C4)-(C6), the lemma 6.5.4, and carefully solving the parts of above equation (6.18) using Cauchy Schwarz inequality, we have ∂Q n ( ˆ θ n ) ∂β nj =O p ( √ np n ) +nλ β nj sgn( ˆ β nj ) = √ np n O p (1) + q n/p n λ β nj sgn( ˆ β nj ) Since q n/p n b n →∞, sgn( ˆ β nj ) dominates the sign of ∂Qn( ˆ θn) ∂β nj when n is large. Thus, P ∂Q n ( ˆ θ n ) ∂β nj > 0 for 0< ˆ β nj < n → 1 as n→∞ (6.17) can be shown in the same way. Also, P (ˆ α nA c n2 = 0)→ 1 can be proved similarly since in our modelβ n andα n are independent of each other. Proof. [Theorem 6.5.5] [Proof of Asymptotic normality] 180 We want to show that with probability tending to 1, √ nA n I 1/2 n (θ ∗ nAn )( ˆ θ nAn −θ ∗ nAn ) = √ nA n I −1/2 n (θ ∗ nAn ) 1 n ∇L n (A n I −1/2 n (θ ∗ nAn )) +o p (n −1/2 ) (6.19) Also, we need to show that probability tending to 1, √ nA n I 1/2 n (θ ∗ nAn )( ˆ θ nAn −θ ∗ nAn ) = 1 √ n A n I −1/2 n (θ ∗ nAn ) n X i=1 [∇L ni (θ ∗ nAn )] +o p (A n I −1/2 n (θ ∗ nAn )1 sn×1 ) = 1 √ n A n I −1/2 n (θ ∗ nAn ) n X i=1 [∇L ni (θ ∗ nAn )] +o p (1) ≡ n X i=1 Y ni +o p (1) → d N(0,G) (6.20) where Y ni = 1 √ n A n I −1/2 n (θ ∗ nAn )[∇L ni (θ ∗ nAn )]. We will now prove (6.19) and (6.20) in (I) and (II) respectively. (I) We want to show I n (θ ∗ nAn )( ˆ θ nAn −θ ∗ nAn ) = 1 n ∇L n (A n (θ ∗ nAn ) +o p (n −1/2 ). We know that with probability tending to 1, ∇ An Q n ( ˆ θ nAn ) =−∇ An L n ( ˆ θ nAn ) +n∇ An P λn ( ˆ θ nAn ) =0 181 By Taylor’s expansion of∇ An L n ( ˆ θ nAn ) atθ =θ ∗ nAn , and substituting it in (6.19), we get I n (θ ∗ nAn )( ˆ θ nAn −θ ∗ nAn ) =− 1 n ∇ 2 An L n (θ ∗ nAn )( ˆ θ nAn −θ ∗ nAn ) + I n (θ ∗ nAn ) + 1 n ∇ 2 An L n (θ ∗ nAn ) ( ˆ θ nAn −θ ∗ nAn ) = 1 n ∇ An L n (θ ∗ nAn ) − 1 2n ( ˆ θ nAn −θ ∗ nAn ) T [∇ 2 An (∇ An L n (θ ∗ nAn ))] ( ˆ θ nAn −θ ∗ nAn )−∇ An P λn (θ ∗ nAn ) + I n (θ ∗ nAn ) + 1 n ∇ 2 An L n (θ ∗ nAn ) ( ˆ θ nAn −θ ∗ nAn ) Therefore, it is sufficient to show that − 1 2n ( ˆ θ nAn −θ ∗ nAn ) T [∇ 2 An (∇ An L n (θ ∗ nAn ))] ( ˆ θ nAn −θ ∗ nAn )−∇ nAn P λn (θ ∗ nAn ) + I n (θ ∗ nAn ) + 1 n ∇ 2 An L n (θ ∗ nAn ) ( ˆ θ nAn −θ ∗ nAn ) ≡A 1 +A 2 +A 3 =o p (n −1/2 ) Now, using Cauchy-Schwartz inequality and (C6), we can show that ||A 1 || 2 =o p (1/n) 182 Since a n =o(1/ √ np n ) from the condition in the theorem, ||A 2 || 2 = (λ β n1 sgn(β ∗ n1 ),...,λ α k n,Kp sgn(α ∗ k,Kp )) T 2 ≤s n [max{λ β nj ,λ α k n,l :j∈A n1 , (k,l)∈A n2 }] 2 =s n a 2 n =s n o(1/np n ) =o(1/n) Now, from equation (6.15), we can show that ||A 3 || 2 ≤||I n (θ ∗ nAn ) + 1 n ∇ 2 An L n (θ ∗ nAn )|| 2 ||( ˆ θ nAn −θ ∗ nAn )|| 2 =o p (1/p 2 n )O p (p n /n) =o p (1/np n ) =o p (1/n) Therefore, we get, A 1 +A 2 +A 3 =o p (n −1/2 ) (II) Now, we show P n i=1 Y ni +o p (1)→ d N(0,G) where Y ni = 1 √ n A n I −1/2 n (θ ∗ nAn )[∇ An L ni (θ ∗ nAn )] Now, we need to show that Y ni satifies the conditions for Lindeberg-Feller central limit theorem. For any > 0, by Cauchy-Schwartz inequality, we get 183 n X i E[||Y ni || 2 I{||Y ni ||>}] =nE[||Y n1 || 2 I{||Y n1 ||>}] ≤n[E[||Y n1 || 4 ] 1/2 [E(1{||Y n1 ||>})] 1/2 =nA 1/2 4 A 1/2 5 Now, solving for A 4 we get, A 4 = 1 n 2 E||A n I −1/2 n (θ ∗ nAn )[∇ An L n1 (θ ∗ nAn )]|| 4 ≤ 1 n 2 ||A T n A n || 2 ||I −1 n (θ ∗ nAn )|| 2 E[∇ T An L n1 (θ ∗ nAn )∇ An L n1 (θ ∗ nAn )] 2 = 1 n 2 λ 2 max (A T n A n )λ 2 max (I −1 n (θ ∗ nAn ))O(s 2 n ) =O(p 2 n /n 2 ) Now, by Markov inequality, A 5 =P (||Y n1 ||>) ≤ E(||Y n1 || 2 ) 2 =O(p n /n) Therefore, we get n X i E[||Y ni || 2 I{||Y ni ||>}] =nO(p n /n)O( q p n /n) =o(1) 184 Moreover, we have n X i=1 Cov(Y ni ) =nCov(Y n1 ) =A n A T n →G Since Y ni ,i = 1,...,n satisfies the conditions for Lindeberg-Feller central limit theorem, we have n X i=1 Y ni +o p (1)→ d N(0,G) 6.9.2 Computing Adaptive Weights Here, we explain how the adaptive weights w β j ,w α k l can be calculated for tuning parameters λ β ,λ α k in Theoretical properties (Theorems 6.5.3 & 6.5.5) of Section 6.5. Let q be the total number of predictors, let n be total number of instances. Whenn>q, we can compute the adaptive weights w β j ,w α k l for tuning parameters λ β j ,λ α k l using ordinary least squares (OLS) estimates of the training observations. λ β j = logn n λ β w β j , λ α k l = logn n λ α k w α k l where w β j =| 1 ˆ β OLS j |, w α k l =| 1 ˆ α k OLS l |, When q>n, the OLS estimates are not available and so we compute the weights using the ridge regression estimates, that is, replacing all the above OLS estimates with the ridge regression estimates. The tuning parameter for ridge regression can be selected using cross-validation. Note, we find ˆ α k OLS j by taking 185 least squares w.r.t to each α k where k∈ [0,K] for some K≥trueK. Without loss of generality we can assume K =trueK for proving the Theoretical properties in section 6.5. Even if K≥trueK, it does not affect the Theoretical properties since the cardinality ofA 2 (|A 2 |) does not affect the root-n consistency (see, proof of lemma 6.5.1). In practice, K is greedily chosen by our algorithm. 186 Chapter 7 Knowledge based Factorized High Order Sparse Learning Models 7.1 Introduction In machine learning and data mining, reliably identifying interpretable discrim- inative interactions among high-dimensional input features with limited training data remains an unsolved problem. For example, a major challenge in biomarker discovery and personalized medicine is to identify gene/protein interactions and their relations with other physical factors in medical records to predict the health status of patients. However, we often have limited patient samples but hundreds of millions of feature interactions to consider. Recently, some researchers tried to solve this problem by making sparsity and hierarchical constraint assumptions to find discriminative features and their interactions. Hierarchical constraints for high-order feature interactions are suitable for some real-world problems but are too stringent for some others. In real-world applications, we often have abundant prior information about input features that can be readily obtained from a lot of knowledge bases, especially in this big data era. To address the above challenging problem of identifying high order feature interactions, we need to build scalable models by incorporating existing knowledge about input features into the model construction process. 187 In this chapter, we propose a novel knowledge-based sparse learning frame- work based on weight matrix factorizations and` 1 /` 2 regularization for identifying discriminativehigh-orderfeaturegroupinteractionsinlogisticregressionandlarge- margin models, and we study theoretical properties for the same. Experimental results on synthetic and real-world datasets show that our method outperforms the state-of-the-artsparselearningtechniques, anditprovides‘interpretable’blockwise high-order feature interactions for gene expression prediction and peptide-MHC I protein binding prediction. Our proposed sparse learning framework is quite gen- eral, and can be used to identify any discriminative complex system input interac- tions that are predictive of system outputs given limited high-dimensional training data. Our contributions are as follows: (1) We propose a method capable of simulta- neously identifying both informative discriminative feature groups and discrimina- tive high order feature group interactions in a sparse learning framework by incor- porating domain knowledge; (2) Our method works on high-dimensional input feature spaces with much more features than data samples, which is typical for biomedical applications, (3) Our method has interesting theoretical properties for generalized linear regression models; (4) The feature group interactions identified by our method leads to better understanding of peptide-MHC I protein interaction and gene transcriptional regulation. 7.2 Related Work Feature selection is a classical problem and has been well studied using Kernel methods such as Support Vector Machines (SVMs) (123), Multiple Kernel Learn- ing (84), Gaussian Processes (51) and Regularization methods such as Lasso (129), 188 Group Lasso (150) etc. Even though there has been extensive research in these feature selection techniques, they have mainly focused on identifying individual discriminative features. Kernel methods have been used to model high order fea- ture interactions, but they only help to identify which orders are important rather than finding the relevant high order feature interactions. Recently, regularization methods have become very popular for feature selection because they are suited for the high dimensional problems. Many regularization methods focus on identify- ing discriminative features or groups of discriminative features based on` 1 penalty, Group penalty, Trace-norm (56) penalty or a combination of these penalties - Dirty model (70). More recent approaches (40), (23), (101) are aimed at recovering not only the discriminative features but also high order feature interactions in regres- sion models by enforcing strong and/or weak heredity (hierarchical) constraints. In strong heredity, a feature interaction term is included in the model only if the corresponding features are also included in the model, while in weak heredity, a feature interaction term is included when one of the features is included in the model (23). Even though hierarchical constraints help model interpretability in some applications, recent studies in bioinformatics and related areas have shown that feature interactions need not follow heredity constraints for manifestation of the diseases; and thus the above approaches based on heredity constraints have limited chance of recovering relevant interactions in these areas.(102) proposed an efficient way to identify combinatorial interactions among interactive genes in com- plex diseases by using prior information such as gene ontology. However, they also make hereditary assumptions which limits their model’s capacity at capturing all the important high order interactions. Thus, all these previous approaches are very unlikely to recover ‘interpretable’ blockwise high order feature and feature group interactions for prediction due to heredity constraints or they do not incorporate 189 the existing domain knowledge. This motivates us to develop new efficient knowl- edge based techniques to capture the important ‘blockwise’ high-order feature and feature group interactions without making heredity assumptions. Recently, in our previous work (113) (previous chapter) we proposed Factoriz- ed High order Interactions Model (FHIM) to identify high order feature inter- actions in regression models in a greedy way based on ` 1 penalty on features and without assuming heredity constraints. This chapter generalizes the sparse learn- ing framework introduced in (113) with the following new and significant contribu- tions: 1) In this chapter, we show how to incorporate domain knowledge into the sparse learning framework using knowledge-based factorization technique and reg- ularization penalties, 2) We show state-of-the-art results on 3 real world datasets to showcase the advantage of capturing prior information in our sparse learning framework, 3) We show that our Group FHIM still has the nice theoretical prop- erties as FHIM. The remainder of the chapter is organized as follows: in section 7.3 we discuss our problem formulation and relevant notations used in the chapter. In section 7.4, we discuss the main idea of this chapter. In section 7.5 we present the opti- mization method used in our sparse learning framework, and we give an overview of theoretical properties in section 7.6. In section 7.7, we discuss our experimental setup and present our results on synthetic and real datasets. Finally, in section 7.8 we conclude the chapter with discussion and future research directions. 7.3 Notations and Problem Formulation For any vectorw, let||w|| 2 denote the Euclidean norm ofw, andsupp(w)⊂ [1,p], denote the support ofw, i.e. the set of features i∈ [1,p] with w i 6= 0. A group 190 of features is a subset g⊂ [1,p]. The set of all possible groups is the power set of [1,p] and let us donate it asP. LetG⊂P denote a set of groups of features. In our chapter, the domain knowledge is presented in terms ofG. For any vector w∈R p , and any group g∈G, letw g denote a vector whose entries are the same asw for the features ing and 0 for other features. LetW g denote a matrix of size p×p for someg∈G and the entries ofW g are non-zero for corresponding column entries ing (i.e. W ij g 6= 0 forg∈G and 0 otherwise) . LetV G ∈R p×G denote a set of N G tuples of vectorv = (v g ) g∈G , where each v g is a separate vector inR p , with supp(v g )⊂ g,∀g∈G. If two groups overlap then they share at least one feature in common. Let{(X (i) ,y (i) )},i∈ [1,n] represent a training set of n samples and p features (predictors), where X (i) ∈ R p is the i th instance (column) of the design matrix X and y(i)∈{−1, 1} is the i th instance of response variable (output) y. Let {β,β g }∈R p be the weight vector associated with single features (also called main effects) and feature groups respectively, and β 0 ∈R be the bias term. Note,β = P g∈G β g . Let W be the weight matrix associated with the pairwise feature group interactions and let W OD be the weight matrix associated with only the pairwise feature group interactions without self interactions. W OD is an off-diagonal matrix and is given by equation (7.7). In this chapter, we study the problem of identifying the discriminative fea- ture groupsβ g and the pairwise feature group interactions W OD in classification settings, when domain knowledge such as grouping of features (G) is given, and without making any heredity assumptions. For the classification settings, we can model the output in terms of features and their high order interactions using logis- tic regression model or large-margin models. Here we consider both these popular 191 classifiers. A logistic regression model with pairwise interactions can be written as follows. p(y (i) |X (i) ) = 1 1+exp(−y (i) (β T X (i) +X (i)T W OD X (i) +β 0 )) (7.1) The corresponding loss function or empirical risk (i.e. the sum of the negative log-likelihood of the training data) is given by L LogReg (β,W OD ,β 0 ) = n X i=1 log(1+exp(−y (i) (β T X (i) +X (i)T W OD X (i) +β 0 )) (7.2) Similarly, we can solve the classification problem with high order interactions using large-margin formulation with Hinge Loss as follows L Hinge (β,W OD ,β 0 ) = n X i=1 max(0,1−(y (i) (β T X (i) +X (i)T W OD X (i) +β 0 )) (7.3) 7.4 Group FHIM Here, we present our optimization-driven knowledge based sparse learning frame- work to identify discriminative feature groups and pairwise feature-group interac- tions (blockwise interactions) for the classification problem introduced in previous section. For simplicity, here we consider that the groups do not overlap. A natural way to recover the feature groups and their interactions is by regularization as shown below. { ˆ β,W} = argmin β,W L(β,W)+λ β X g∈G ||β g || 2 +λ W X g∈G ||vec(W g )|| 2 (7.4) 192 where vec(W g ) is the vectorization of the group block matrixW g . When the number of input features is huge (e.g. biomedical applications), it is practically impossible to explicitly consider pairwise or even higher-order interactions among all the input feature groups based on simple ` 1 -penalty or Group Lasso penalty. To solve this problem, we propose a novel way to factorize the block-wise inter- action weight matrix W as sum of K rank-one matrices. Each rank-one matrix is represented by an outer product of two identical vectors (termed as rank-one factors) with the grouping structure imposed on these vectors. The feature group interactions of W can be effectively captured by the grouping on the rank-one factors. The proposition 7.4.1 shows that grouping on rank-one factors of W is a feasible option for representing blockwise interactions present in W. Proposition 7.4.1. A feasible decomposition of blockwise W is W = K X k=1 ( X g∈G a kg )⊗ ( X g∈G a kg ) where⊗ represents the tensor product/outer product and a k is a rank-one factor of W and is given bya k = P g∈G a kg . The above proposition can be easily verified by constructing each rank-one matrix decomposition of W as weighted combinations of the group block matrices W g . Now, we can rewrite the optimization problem (7.4) to identify the discrimina- tive feature groups and pairwise feature group interactions by using the grouped rank-one factors as follows, { ˆ β,ˆ a k } = arg min a k ,β L(β,W OD ) +P λ (β,a k ) (7.5) where, P λ (β,a k ) =λ β X g∈G ||β g || 2 + X k λ a k X g∈G ||a kg || 2 (7.6) 193 and W OD = K X k=1 ( X g∈G a kg )⊗( X g∈G a kg )−D( K X k=1 (˜ a 2 k,i ) i∈[1,p] ) (7.7) where ˆ β,ˆ a k represent the estimated parameters of our model,D is a diagonalizing matrixoperatorwhichreturnsap×pdiagonalmatrix, and ˜ a k,i isthei th component ofa k . Let Q represent the objective function (loss function with the regularization penalties) i.e. the right hand side of the equation (7.5). We replaceL in (7.5) by L LogReg (β,W OD ,β 0 ) for logistic regression, and by L Hinge (β,W OD ,β 0 ) for large- margin classification. We call our model Group Factorization-based High-order Interaction Model (Group FHIM). In section 7.5.2 we present a greedy alter- nating optimization algorithm to solve our optimization problem. Note that we use W OD in equation (7.5) instead of W. Although the original W is a sum of K rank-one matrix with the maximum rank K, the actual rank of W OD is often much larger thanK. However,W and the off-diagonalW OD define the same inter- action block-wise patterns between different input features. In practice, we often focus on identifying interpretable discriminative high-order interactions between different features instead of uninteresting self-interactions. Moreover, removing diagonal elements ofW has the advantage of eliminating the interference between optimizingβ and optimizinga k ’s for binary input feature vectors, which greatly helps our alternating optimization procedure and often results in much better local optimum in practice. Our empirical studies also show that, even for continuous input features, W OD often result in faster parameter learning and better local optima. Therefore, we used W OD instead of W in the objective functions of both FHIM and Group FHIM for all the experiments in this chapter. 194 7.4.1 Overlapping Group FHIM The non-overlapping group structure used in Group FHIM limits its applicability in practice. Hence, we propose an extension of Group FHIM to overlapping groups and call our method Overlapping Group FHIM (denoted by OvGroup FHIM). In OvGroup FHIM, we consider the overlapping group penalty (69) instead of the ` 1 /` 2 penalty used in Group FHIM. The overlapping group penalty fora k is given below in equation (7.8). The overlapping group penalty forβ is similar. Ω G overlap (a k ) = inf v∈V G , P g∈G vg=a k X g∈G ||v g || (7.8) 7.5 Optimization In this section, we briefly discuss the optimization method that we employ to solve our optimization problem (7.5), which corresponds to Line 4 and 5 in Algorithm 1 (section 7.5.2). Chapter 3 in (119) provides a good survey on several optimization approaches for solving the group` 1 -regularized regression problems. For our sparse learning model, we choose the Spectral Projected Gradient method for solving our optimization problem since we found that it is much faster than the other popular approaches such as Quasi-Newton methods. 7.5.1 Spectral Projected Gradient Spectral Projected Gradient is a popular method to solve convex constrained opti- mization problem. It uses two modifications of the projected gradient method to achieve faster convergence rate. First, it initializes the line search with a Barzilai and Borwein step size, and second, it uses a non-monotonic version of the Armijo condition. Note that for non-overlapping groups, we can compute the projection 195 across the groups by independently solving the projection problem for each group. For each group, the projection problem corresponds to solving a Euclidean projec- tion on a second order proper cone which is described in detailed in (119). 7.5.2 Greedy Alternating Optimization The optimization problem in Equation 7.5 is convex in β but non-convex in a k . The non-convexity property of our problem makes it is difficult to propose an optimization strategy which guarantees convergence to global optima. Thus, we propose a greedy alternating optimization approach (shown in algorithm 5) to find a local optima for our problem and discuss the theoretical properties about this local optima. Please refer to section 7.7.1 for details about optimization settings. Algorithm 5 Greedy Alternating Optimization 1: Initializeβ toβ LASSO , K = 1 anda K =1 2: 3: While (K==1) OR (a K−1 6=0 for K > 1) 4: Repeat until convergence 5: a t K = argminQ(a t−1 K ,β t−1 ) 6: β t = argminQ(β t−1 ,a t K ) 7: End Repeat 8: K = K + 1;a K =1 9: End While 10: Returna K andβ which has the least loss function. 7.6 Theoretical Properties In this section, we study the asymptotic behavior of our proposed Group FHIM for the likelihood based generalized linear regression models (eg. logistic regression model). The theorems shown here are similar to the ones in our previous work (113). However, in this chapter, we show that the asymptotic properties still holds 196 even with the regularization penalty on rank-one factors used in our Group FHIM estimator. Problem Setup: Assume that the data V i = (X i ,y i ),i = 1,...n are collected independently andY i has a density off(Z(X i ),y i ) conditioned onX i , whereZ is a knownregressionfunctionwithgroupedmaineffectsandallpossiblepairwisegroup interactions. Let β ∗ h and a ∗ k,h denote the underlying true parameters satisfying block-wise properties implied by our factorization. Let θ ∗ = (β ∗T ,α ∗T ) T , where β ∗ = (β ∗ h ),α ∗ = (a ∗ k,h ),k = 1,...,K;h = 1,...,|G| (Note: θ ∗ is p(K + 1)× 1). We consider the estimates for Group FHIM as ˆ θ n : ˆ θ n = arg min θ Q n (θ) = arg min θ − 1 n n X i=1 (L(Z(X i ),y i ) +λ β X h ||β h || 2 + X k λ α k X h ||α k,h || 2 (7.9) whereL(Z(X i ),y i ) is the loss function of generalized linear regression models with pairwise feature group interactions. In the case of logistic regression, Z(·) takes the form of Equation (7.1) and L(·) takes the form of Equation (7.2). Now, let us define A 1 ={h :β ∗ h 6= 0} A 2 ={(k,h 0 ) :α ∗ k,h 06= 0}, A =A 1 ∪A 2 (7.10) whereA 1 contains the indices of the groups of main terms which correspond to the non-zero true group coefficients, and similarlyA 2 contains the indices of the 197 factorized group interaction terms whose true group coefficients are non-zero. Let us define a n = max{λ h β ,λ h 0 α k :h∈A 1 , (k,h 0 )∈A 2 } b n = min{λ h β ,λ h 0 α k :h∈A c 1 , (k,h 0 )∈A c 2 } (7.11) whereA c 1 is the complement of setA 1 . Now, we show that our model possesses the oracle properties for n→∞ with fixed p under some regularity conditions. Note, the asymptotic properties for p n →∞ as n→∞ will be addressed in our future work. 7.6.1 Asymptotic Oracle Properties when n→∞ Theasymptoticpropertieswhensamplesizeincreasesandthenumberofpredictors is fixed are described in the following theorems. We will show that Group FHIM possesses oracle properties under certain regularity conditions (R1)-(R3) shown below. Let Ω denote the parameter space forθ. (R1) The observations V i :i = 1,...,n are independent and identically dis- tributed with a probability density f(V,θ), which has a common support. We assume the density f satisfies the following equations: E θ h ∂logf(V,θ) ∂θ j i = 0 for j = 1,...,p(K +1), and I jk (θ) =E θ h ∂logf(V,θ) ∂θ j ∂logf(V,θ) ∂θ k i =E θ h − ∂ 2 logf(V,θ) ∂θ j ∂θ k i (R2) The Fisher Information Matrix I(θ) =E h ∂logf(V,θ) ∂θ ∂logf(V,θ) ∂θ T i 198 is finite and positive definite atθ =θ ∗ . (R3) There exists an open set ω of Ω that contains the true parameter point θ ∗ such that for almost all V the density f(V,θ) admits all third derivatives (∂ 3 f(V,θ))/(∂θ j ∂θ k ∂θ l ) for allθ∈ ω and any j,k,l = 1,...,p(K + 1). Further- more, there exist functions M jkl such that ∂ 3 ∂θ j ∂θ k ∂θ l logf(V,θ) ≤M jkl (V) for allθ∈ω where m jkl = E θ ∗ [M jkl (V)] <∞. These regularity conditions are the existence of common support and first, second derivatives for f(V,θ); Fisher Information matrix being finite and positive definite; and existence of bounded third derivative for f(V,θ). These regularity conditions guarantee asymptotic normality of the ordinary maximum likelihood estimates (89). Theorem 7.6.1. Assume a n =o(1) as n→∞. Then under regularity conditions (R1)-(R3), there exists a local minimizer ˆ θ n of Q n (θ) such that|| ˆ θ n −θ ∗ || = O P (n −1/2 +a n ) Remark. Theorem 7.6.1 implies that when the tuning parameters associated with the non-zero coefficients of grouped main effects and grouped pairwise inter- actions tend to 0 at a rate faster thann −1/2 , then there exists a local minimizer of Q n (θ), which is √ n−consistent (the sampling error is O p (n −1/2 )). Theorem 7.6.2. Assume √ na n → 0, √ nb n →∞ and P ( ˆ θ A c = 0)→ 1. Then under the regularity conditions (R1)-(R3), the component ˆ θ A of the local minimizer ˆ θ n (given in theorem 7.6.1) satisfies √ n( ˆ θ A −θ ∗ A )→ d N(0,I −1 (θ ∗ A )), 199 where I(θ ∗ A ) is the Fisher information matrix of θ A at θ A = θ ∗ A assuming that θ ∗ A c = 0 is known in advance. Remark. Theorem 7.6.2 shows that our model estimates the non-zero coef- ficients of the true model with the same asymptotic distribution as if the zero coefficients were known in advance. Based on theorems 7.6.1 and 7.6.2, we can say that our group FHIM estimator has the oracle property, i.e. it is asymptotically optimal, namely unbiased and efficient, when the tuning parameters satisfy the conditions √ na n → 0 and √ nb n →∞. To satisfy these conditions, we have to consider adaptive weights w β j ,w α k l (163) for our tuning parameters λ β ,λ α k . Thus, our tuning parameters are: λ β j = logn n λ β w β j , λ α k l = logn n λ α k w α k l Note. Please see Appendix for proofs of the above theorems. 7.6.2 Properties of Overlapping Group FHIM Lemma 7.6.3. β7→ Ω G overlap (β) is a norm. Proof. : Lemma 1 (69) Overlapping Group FHIM (OvGroup FHIM) can be realized using a non- overlapped Group FHIM. Let us form ˜ X ∈ R n× P |g| by the concatenation of copies of the design matrix X ∈ R n×p restricted to a certain group g, i.e. ˜ X = [X g1 ,..,X g|G| ] andG = g 1 ,..,g G ; ˜ v = (˜ v T g1 ,.., ˜ v T g|G| ), i.e. ˜ v ∈ R P |g| and with ˜ v g = (v gi ) i∈g . Let the empirical risk for OvGroup FHIM and equiv- alent Group FHIM be represented by R(.) and ˜ R(.) respectively. Therefore, R(a k ) = ˜ R(X T a k a T k X) and R(β) = ˜ R(Xβ) respectively. 200 Theorem 7.6.4. (i) R(a k ) = ˜ R( ˜ X T ˜ v g ˜ v g T ˜ X) and (ii) R(β) = ˜ R( ˜ X ˜ v g ) Proof. : We prove (i) here. Proof for (ii) is similar. R(a k ) = ˜ R(X T WX) = ˜ R(X T a k a T k X) = ˜ R(X T ( X g a kg )( X g a kg ) T X) = ˜ R( ˜ X T ˜ v g ˜ v g T ˜ X) Remark. Theorem 7.6.4 shows that empirical risk minimization of Overlap- ping Group FHIM is same as an expanded non-overlapped Group GHIM, i.e. the OvGroup FHIM optimization can be solved by an equivalent expanded Group FHIMoptimizationproblem. ThisresultisusedintheimplementationofOvGroup FHIM for our experiments. 7.7 Experiments We use synthetic and real datasets to demonstrate the performance of our Group FHIM and OvGroup FHIM models, and compare it with LASSO (129), Hierarchi- cal LASSO (23), Group Lasso (150), Trace-norm (74), Dirty model (70), QUIRE (102) and FHIM (113). We use 80% of dataset for training and 20% for test, and 20% of training data as validation set to find optimal tuning parameters. We search tuning parameters for all methods using grid search, and for our model the parametersλ β andλ a k are searched in the range of [0.01, 100]. Here we report our results for 5 simulations. 201 7.7.1 Optimization Settings Initialization, warm start, and stopping criterion play an important role for our Greedy Alternating Optimization algorithm mentioned in Algorithm 5. Below, we discuss how we choose them for our optimization. Initialization and Warm Start : We initialize a k with 1 and β with β obtained from LASSO or l 1 Logistic regression. We use warm-start for β and a k during optimal parameter grid-search. Our experiments on synthetic dataset showedthatthisinitializationandwarmstartstrategyresultsinfasterconvergence and gives better performance on validation data. Stopping Criterion for Greedy Optimization : Stopping criterion of our alternating optimization algorithm directly affects the convergence rate and the converged local optima. Due to the non-convexity of our optimization problem, the decrease in objective function’s value may not correspond to the decrease in loss function’s value. Thus, we introduce a new stopping criterion where our optimization algorithm keeps track of the decrease in both the loss function’s value and the objective function’s value along with the best loss function during the iterations. During the greedy estimate of K, we stop the addition of new a k , if and only if the loss value of the a k is larger than the loss value of a k−1 for k> 1. When the stopping criterion is met, the parameters related to the best loss function’s value are chosen as the parameters for the local optima. 7.7.2 Datasets Weusesyntheticdatasetsand3realdatasetsforclassificationandsupportrecovery experiments. We give detailed description of these datasets below. 202 7.7.2.1 Synthetic Dataset We generate the features of design matrix X using a normal distribution with mean zero and variance one (N (0, 1)). β,a k were generated as s-sparse vector fromN (0, 1), s is chosen as 5-10% of p and the number of groups|G|∈ [10, 50]. The group interaction weight matrix W OD was generated using equation (7.7) for a K∈ [1, 5]. The response vectors y was generated for logistic and large-margin formulation with a noise factor of 0.01. We generated several synthetic datasets by varyingn,p,K,|G| ands. Note, we denote the combined total features (that is main effects + pairwise interaction) by q, hereq =p(p + 1)/2. In this chapter, we show results for synthetic data in these settings: Case 1) n>p and q >n (high- dimensionalsettingw.r.tinteractionfeatures)andCase2)p>n(high-dimensional setting w.r.t original features). 7.7.2.2 Real Datasets To assess the performance of our model, we tested our methods on three prediction tasks: 1. Classification on RCC sample: This dataset contains 213 RCC samples from Benign and 4 different stages of tumor. Expression levels of 1092 proteins are collected in this dataset and these 1092 proteins belong to the 341 groups (overlapping groups). The number of Benign, Stage 1, Stage 2, Stage 3 and Stage 4 tumor samples are 40, 101, 17, 24 and 31 respectively. 2. Gene Expression Prediction: This dataset (36) has 157 ChIP-Seq signals for transcription factor bindings and chromatin modifications and 1000 samples for gene transcripts. The features were grouped into 101 non-overlapping groups based on prior knowledge about ChIP-Seq experimental setup. For 203 example, different ChIP-Seq experiments under different conditions or treat- ments for the same transcription factor are grouped into the same group. 3. Peptide-MHC I Binding Prediction: This dataset (83) is listed in Table 1. There are 9 positional groups (non-overlapping) in this dataset. Each positional group contains 20 features which are substitution log-odds from BLOSUM62 for the amino acid at this position. Remark. RCC dataset was requested from the authors of (102). ChIP-Seq data is publicly available at http://genome.ucsc.edu/ENCODE/down-loads.html. Peptide-MHC I Binding dataset consists of publicly available data from Immune Epitope Database and Analysis Resource (IEDB) (133) which was used for train- ing and privately collected data by our research collaborators which was used for testing. 7.7.3 Experimental Design and Evaluation metrics For synthetic data, we evaluate performance of our methods using prediction error and support recovery experiments. For real dataset, we perform the following evaluations: 1. RCC Classification: We perform three stage-wise binary classification exper- iments using RCC samples: (a) Case 1: Benign samples vs. Stage 1− 4. (b) Case 2: Benign and Stage 1 vs. Stage 2− 4. (c) Case 3: Benign, Stage 1, 2 vs. Stage 3, 4. 2. ChIP-Seq Gene Expression Classification: We perform two binary classifica- tion experiments: Case 1) predict gene expression levels as low or high, Case 2) predict whether genes are expressed or not. 204 Dataset #Peptides #Binders #Non-binders A0201-IEDB 8471 3939 8532 A0201-Japanese 114 59 55 A0206-IEDB 1820 951 869 A0206-Japanese 81 33 48 A2402-IEDB 2011 890 1121 A2402-Japanese 167 125 42 Table 7.1: Peptide-MHC I binding datasets ` 1 Logistic Group H.Lasso FHIM Group FHIM Reg. Lasso (Logistic Loss) q>n 0.52 0.74 0.58 0.89 0.97 p>n 0.51 0.52 - 0.54 0.62 Table 7.2: ROC scores on synthetic data with non-overlapping groups: case 1) q = 5100, n = 1000,|G| = 25; case 2) p = 250, n = 100,|G| = 25. Note: We were not able to run HLasso R package due to high computational complexity 3. Peptide-MHC I Binding Prediction: We predict binding peptides from non-binding peptides for three alleles, HLA-A*0201, HLA-A*0206 and HLA-A*2402. For evaluation metrics, we use 1) F1-measure for support recovery of W OD (synthetic) and 2) Area under ROC curve (ROC) for the classification (synthetic and real data). ` 1 Logistic Overlap H.Lasso FHIM OvGroup FHIM Reg. Group Lasso (Hinge Loss) q>n 0.54 0.67 0.56 0.69 0.81 p>n 0.53 0.58 - 0.57 0.64 Table 7.3: ROC scores on synthetic data with overlapping groups: case 1) q = 5,100, n = 1,000,|G| = 10; case 2) p = 250, n = 100,|G| = 10. 205 Original W 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 Estimated W 20 40 60 80 100 10 20 30 40 50 60 70 80 90 100 Figure 7.1: Support Recovery of W OD (95 % sparse) for synthetic data q = 5100,n = 1000. 7.7.4 Performance on Synthetic dataset Tables 2 and 3 show that our Group FHIM and OvGroup FHIM outperforms the state-of-the-art approaches such as ` 1 Logistic Regression, Group Lasso (150), Hierarchical Lasso (23) and FHIM (113). These models (except` 1 Logistic Regres- sion) were chosen for comparison because they are the state-of-the-art approaches which can recover grouping structure or high order feature interactions. Figure 7.1 shows an example for the support recovery of W OD for the q > n setting. From this figure, we see that our model performs very well (i.e. F 1 score is close to 1). For p>n settings, our model also performs fairly well in the support recovery of W OD . 7.7.5 Classification Performance on RCC samples In this section, we report systematic experimental results on classification of sam- ples from different stages of RCC. This dataset does not have grouping infor- mation for proteins. In order to group the proteins, we use the web based tool “Database for Annotation, Visualization, and Integrated Discovery” (DAVID, http://david.abcc.ncifcrf.gov/). Thereareasetofparametersthatcanbeadjusted 206 in DAVID based on which the functional classification is done. This whole set of parameters is controlled by a higher level parameter -“Classification Stringency”, which determines how tight the resulting groups are in terms of association of the genes in each group. We set the stringency level to “Medium” which results in balanced functional groups where the association of the genes are moderately tight. The total number of groups based on cellular component annotations for RCC is 56. Each ungrouped gene forms a separate group, and in total we have 341 overlapping groups. The predictive performance of the bio-markers and pairwise group interactions selected by our OvGroup FHIM model (Hinge Loss) is compared against the mark- ers selected by Lasso, All-Pairs Lasso (23), Group Lasso, Dirty model (70), QUIRE and FHIM. We use SLEP (93), MALSAR (160) packages for the implementation of most of these models. QUIRE and FHIM codes were obtained from the authors. The overall performance of the algorithms are shown in Figure 7.2. In this figure, we report average AUC score for five runs of 5-fold cross validation experiments for cancer stage prediction in RCC. The average ROC scores achieved by feature groups selected with our model are 0.72, 0.93 and 0.95 respectively for the three cases discussed in section 7.7.3. We performed pairwise t-tests for the comparisons of our method vs. the other methods, and all p-values were below 0.0075 which shows that our results are statistically significant. From Figure 7.2, we see that our model outperforms all the other algorithms for the three classification cases of RCC prediction and performs similarly to the well-known biomarker STC1. Interestingly, our OvGroup FHIM did not find any feature group interactions, i.e a k = 0 for the RCC dataset, and the feature groups (ofβ g ) found by our model corresponds to the two groups containing STC1. 207 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Benign vs. Stage1-4 Benign, Stage 1 vs. Stage 2-4 Benign, Stage 1-2 vs. Stage 3-4 AUC Classification Experiments RCC Dataset AP-Lasso Lasso Trace-norm H.Lasso Dirty Model QUIRE FHIM STC1 Biomarker OvGroup FHIM Figure 7.2: Comparison of the classification performance of different feature selection approaches with our model (OvGroup FHIM) in identifying the dif- ferent stages of RCC. 7.7.6 Gene Expression Prediction from ChIP-Seq Signals For case 1, the gene expression measured by Cap Analysis (CAGE) from the ENCODE project (36) above 3.0 (the median of nonzero gene expression levels) is considered as high, while the gene expression between 0 and 3.0 is considered as low for the classification experiments; for case 2, the genes with nonzero expression levels are considered as expressed and the others as non-expressed. Table 7.4 shows the gene expression prediction results on these two classification experiments. We observed that our Group FHIM outperforms all the state-of-the-art models such as Group` 1 logistic regression and FHIM. Moreover, our model discovers biologically meaningful ChIP-Seq signal interactions which are discussed in the section 7.7.6.1. 7.7.6.1 Feature Group Interactions discovered by Group FHIM An investigation of the interactions identified by our Group FHIM on the ChIP- Seq dataset reveals that many of these interactions are indeed relevant for gene expression. Figure 7.3 shows 6 out of the top 7 group interactions for the Case 208 POL2 MYC Histone modifications SETDB1 CTCF YY1 Figure 7.3: Interpretable interactions identified by OvGroup FHIM for predict- ing gene expression from ChIP-Seq signals. Alleles ` 1 Logistic Group` 1 FHIM Group Reg. Logistic Reg. FHIM Case 1 0.74 0.90 0.82 0.92 Case 2 0.72 0.89 0.80 0.91 Table 7.4: Gene Expression Prediction from ChIP-Seq signals 1 classification, i.e., predicting whether a gene transcript is highly expressed or not. Among these group interactions, POL2 catalyzes DNA transcription and synthesizes mRNAs and most of small non-coding RNAs, and many transcription factors require its binding to gene promoters to begin gene transcription; MYC is knowntorecruithistonemodificationstoactivategeneexpression; YY1isknownto interact with histone modifications to activate or repress gene expression; SETDB1 regulates histone modifications to repress gene expression; CTCF is an insulator, its binding to MYC locus prevents the expression of MYC to be altered by DNA methylation, anditregulateschromatinstructureforwhichitsgroupalsoappeared in the dicriminative ones identified by our model. Further investigations of the interactions identified by our Group FHIM model might reveal novel insights that will help us to better understand gene regulation. 209 Alleles ` 1 Logistic Group` 1 FHIM Group Reg. Logistic Reg. FHIM A0201 0.74 0.72 0.72 0.80 A0206 0.76 0.75 0.68 0.79 A2402 0.83 0.77 0.75 0.82 Table 7.5: Peptide-MHC I binding prediction AUC scores 7.7.7 Peptide-MHC I Binding Prediction Table 7.5 shows the comparison of peptide-MHC I binding prediction of our model with respect to the state-of-the-art ` 1 and Group` 1 logistic regression and FHIM. Figure 7.5 shows the ROC curves of Group FHIM and Group` 1 logistic regression for Allele 0206. As evident from the AUC scores and ROC curve plots, our method achieves significant improvement over Group` 1 logistic regression in separating the ‘binders’ from ‘non-binders’. We found that` 1 logistic regression gave slightly bet- ter performance on A2402, but our model identified meaningful group interactions as discussed below. Group ` 1 logistic regression produces worse performance than ` 1 logistic regression, which shows that only using grouping information does not help to identify discriminative individual features. However, our model Group FHIM significantly outperforms FHIM, which demonstrates the effectiveness of modeling both grouping information and high-order feature interactions. Figure 7.4 shows the factorized rank-1 interaction weight vector with absolute values greater than 0.1. This feature shows that the positions 2,5,6,9 interact; and moreover the interaction between the middle position and the position 9 is very important for predicting 9-mer peptide binding, which has experimental support from the crystal structure of the interaction complex (43). We also found positions 2 and 9 interact for Alleles A0201 and A0206. 210 Feature Index Interaction Feature Co−efficient 20 40 60 80 100 120 140 160 0.5 1 1.5 −0.5 0 0.5 1 1.5 Figure 7.4: Interaction feature factor coefficients for A2402 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FPR TPR Allele A0206 Group L1 Log Reg (Ours) Group FHIM Figure 7.5: ROC curves for A0206 7.7.8 Computational Time Analysis Group FHIM takes more time for convergence than the state-of-the-art approaches (LASSO and ` 1 logistic regression) since we do multiple rounds of greedy alter- nating optimization forβ anda k . For q >n setting with n = 1000,p = 100,q = 5100,|G| = 25, our optimization method on Matlab takes around∼ 5 minutes to converge for fixed parameter, while for p > n with p = 250, n = 100, our Group FHIM model takes around∼ 10 mins to converge. Our experiments were run on intel i3 dual-core 2.9GHz CPU with 8 GB RAM. 211 7.8 Conclusions In this chapter, we proposed a knowledge-based sparse learning framework called Group FHIM for identifying discriminative high-order feature group interactions in logistic regression and large-margin models, and studied interesting theoretical properties of our model. Empirical experiments on synthetic and real datasets showed that our model outperforms several well-known and state-of-the-art sparse learning techniques such as Lasso, ` 1 Logistic Regression, Group Lasso, Hierarchi- cal Lasso, and FHIM, and it achieves comparable or better performance compared to the state-of-the-art knowledge based approaches such as QUIRE. Our model identifies high-order positional group interactions for peptide-MHC I binding pre- diction, and it discovers the important group interactions such as POL2-MYC, YY1-histone modifications, MYC-histone modifications, and CTCF-MYC which are valuable for understanding gene transcriptional regulation. In the future, we will consider the following directions: (i) We will consider fac- torization of the weight matrix W as W = P k a k b T k since it is more general and can capture non-symmetricW, (ii) We will extend our optimization methods from single-task learning to multi-task learning framework; (iii) We will investigate how to find and interpret much higher order feature interactions. 7.9 Appendix 7.9.1 Proofs for section 7.6 Here, we present the proofs for theorems 7.6.1 and 7.6.2. Please refer to section 7.6 for the regularity conditions (R1)-(R3) and the statements for the following these Theorems. 212 Proof of Theorem 7.6.1. : Please refer to section 7.6 for the regularity conditions (R1)-(R3). Let η n =n −1/2 +a n and{θ ∗ +η n δ :||δ|| 2 ≤d} be the ball aroundθ ∗ , whereδ = (u 1 ,...,u p , v 11 ,....v Kp ) T = (u T ,v T ) T . Define D n (δ)≡Q n (θ ∗ +η n δ)−Q n (θ ∗ ) Where Q n (θ ∗ ) is defined in equation (7.9). Forδ that satisfies||δ|| =d, we have D n (δ) =−L n (θ ∗ +η n δ) +L n (θ ∗ ) +n X h λ h β (||β ∗ h +η n u h || 2 −||β ∗ h || 2 ) +n X k,h 0 λ l α k (||α ∗ k,h 0 +η n v kh 0)|| 2 −||α ∗ k,h 0|| 2 ≥−L n (θ ∗ +η n δ) +L n (θ ∗ ) +n X h∈A 1 λ h β (||β ∗ h +η n u h || 2 −||β ∗ h || 2 ) +n X (k,h 0 )∈A 2 λ h 0 α k (||α ∗ k,h 0 +η n v kh 0)|| 2 −||α ∗ k,h 0|| 2 ≥−L n (θ ∗ +η n δ) +L n (θ ∗ )−nη n X h∈A 1 λ h β ||u h || 2 −nη n X (k,h 0 )∈A 2 λ l α k ||v kh 0|| 2 ≥−L n (θ ∗ +η n δ) +L n (θ ∗ )−nη 2 n ( X h∈A 1 ||u h || 2 + X (k,h 0 )∈A 2 ||v kh 0|| 2 ) ≥−L n (θ ∗ +η n δ) +L n (θ ∗ )−nη 2 n (||u h || 2 |A 1 | +||v kh 0|| 2 |A 2 |) =− [∇L n (θ ∗ )] T (η n δ)− 1 2 (η n δ) T [∇ 2 L n (θ ∗ )] (η n δ)(1 +o p (1))−nη 2 n (||u h || 2 |A 1 | +||v kh 0|| 2 |A 2 |) 213 We used Taylor’s expansion in above step. We split the above into three parts and we get: K 1 =−η n [∇L n (θ ∗ )] T δ =− √ nη n ( 1 √ n ∇L n (θ ∗ )) T δ =−O p (nη 2 n )δ K 2 = 1 2 nη 2 n {δ T [− 1 n ∇ 2 L n (θ ∗ )δ](1 +o p (1))} = 1 2 nη 2 n {δ T [I(θ ∗ )δ](1 +o p (1))} K 3 =−nη 2 n (||u h || 2 |A 1 | +||v kh 0|| 2 |A 2 |) Thus, D n (δ)≥K 1 +K 2 +K 3 =−O p (nη 2 n )δ + 1 2 nη 2 n {δ T [I(θ ∗ )δ](1 +o p (1))} −nη 2 n (||u h || 2 |A 1 | +||v kh 0|| 2 |A 2 |) We see that K 2 dominates the rest of the terms and is positive since I(θ) is positive definite atθ =θ ∗ from regularity condition (R2). Therefore, for any given > 0 there exists a large enough constant d such that P{ inf ||δ||=d Q n (θ ∗ +η n δ)>Q n (θ ∗ )}≥ 1− This implies that with probability at-least 1−, there exists a local minimizer in the ball{θ ∗ +η n δ :||δ|| 2 ≤d}. Thus, there exists a local minimizer ofQ n (θ) such that|| ˆ θ n −θ ∗ || =O p (η n ). 214 Proof of Theorem 7.6.2. : Let Q n (θ A ) denote the objective function Q n only on theA−component ofθ, that is Q n (θ) withθ A c. Based on Theorem 7.6.1, and if we assume P ( ˆ θ A c = 0)→ 1 ((103)), Then we have, P arg min θ A Q n (θ A ) = (A−component of arg min θ Q n (θ)) → 1 Thus, ˆ θ A should satisfy ∂Q n (θ A ) ∂θ j θ A = ˆ θ A = 0 ∀j∈A (7.12) with probability tending to 1. Let L n (θ A ) and P λ (θ A ) denote the log-likelihood function ofθ A and the penalty function ofθ A respectively so that we have Q n (θ A ) =−L n (θ A ) +nP λ (θ A ) From (7.12), we have ∇ A Q n ( ˆ θ A ) =−∇ A L n ( ˆ θ A ) +n∇ A P λ ( ˆ θ A ) = 0, (7.13) with probability tending to 1. Now, consider by Taylor expansion of first term and second terms atθ A =θ ∗ A , we get the following: −∇ A L n ( ˆ θ A ) =−∇ A L n (θ ∗ A )− [∇ 2 A L n (θ ∗ A ) +o p (1)]( ˆ θ A −θ ∗ A ) = √ n − 1 √ n ∇ A L n (θ ∗ A ) +I(θ ∗ A ) √ n( ˆ θ A −θ ∗ A ) +o p (1) 215 n∇ A P λ ( ˆ θ A ) =n λ h β β h ||β h || 2 λ α k h 0 α h 0 ||α h 0|| 2 h∈A 1 ,(k,h 0 )∈A 2 +o p (1)( ˆ θ A −θ ∗ A ) = √ no p (1) since √ na n =o(1) and|| ˆ θ A −θ ∗ A || =O p (n −1/2 ) Thus, we get, 0 = √ n − 1 √ n ∇ A L n (θ ∗ A ) +I(θ ∗ A ) √ n( ˆ θ A −θ ∗ A ) +o p (1) Therefore, from central limit theorem, √ n( ˆ θ A −θ ∗ A )→ d N(0,I −1 (θ ∗ A )) 216 Chapter 8 Conclusions and Future Work 8.1 Summary of the Research 8.1.1 Social Multimedia Retrieval and Recommendation Multimedia copy retrieval & alignment, and Content recommendation are chal- lenging research problems that have become prominent in the ever-growing social media networks. In this thesis, we have developed novel and efficient techniques to address these problems, and showed how the social network information would play an important role in addressing them. A brief summary of our research is discussed below. In chapter 3, we discussed several novel ideas for solving the partial near- duplicate video copy detection and alignment problem. We first proposed a gener- alized spatial coding scheme and a novel spatial verification algorithm for partial near-duplicate video copy detection. Then, we presented experimental results to show that the proposed spatial verification techniques are robust to challenging video transformations such as PiP, and it provides better storage and computa- tional cost as compared to several state-of-art-techniques. Furthermore, weformulatedthepartialnear-duplicatevideoalignmentproblem as an efficient subsequence matching problem and provided efficient algorithms based on the longest common subsequence technique. Experimental results showed that the proposed video alignment algorithms achieve high alignment accuracy for short video sequences at a reasonable computational cost. Finally, we discussed a 217 novel modeling framework to incorporate the social network structure information in the video copy detection system. Some of the material in this chapter appeared in (114). In chapter 4, we presented a novel probabilistic graphical modeling framework that exploits the user social network information and the item content information in recommending items to users. The proposed model seamlessly incorporated topic modeling and social network information in the collaborative filtering (CF) framework. We derived mathematical equations to learn user and item latent vectors which were then used to predict user-item ratings. Experimental results on real-world datasets revealed interesting insights into how the social circles could have more influence on people’s decisions about the usefulness of information than their personal taste. We showed how the model can provide human interpretable recommendations by examining the user latent space. Finally, we identified a potential information leak phenomena that can occur in recommendation systems that utilize social network information for recommendations. Some of the material in this chapter appeared in (112). In chapter 5, we proposed a class of Collaborative filtering based Bayesian models that can personalize recommendations to a group of users. Our novel framework models the group dynamics such as user-user interactions, user-group membership, user influence and automatically infers the group preferences. Exper- iments on Location-Based Social Networks (LBSN) and Event-based Social Net- works (EBSN) showed that our modeling framework impressively outperforms state-of-the-art group recommenders and it provides interpretable recommenda- tions. Our group recommendation framework is general and can easily incorporate additional information into the model for real-time performance. Some of the material in this chapter appeared in the papers (109), (110). 218 8.1.2 Sparse Learning Models A major challenge in biomarker discovery and personalized medicine is to identify gene/protein interactions and their relations with other physical factors in medical records to predict health status of patients. In this thesis, we address the above challenge by developing scalable sparse learning models which can incorporate domain knowledge into the model construction process. A brief summary of this research is discussed below. In chapter 6, we developed a novel factorized sparse learning framework based on weight matrix factorizations and ` 1 regularization (called FHIM) for identi- fying discriminative high-order feature interactions in regression models, and we showed interesting theoretical properties for the same. Experimental results on synthetic and real-world datasets showed that our sparse learning framework out- performs the state-of-the-art sparse learning techniques, and it provides ‘inter- pretable’ blockwise high-order feature interactions for blood-based cancer diagno- sis. Our proposed sparse learning framework is quite general, and can be used to identify any discriminative complex system input interactions that are predictive of system outputs given limited high-dimensional training data. In chapter 7, we generalized the sparse learning framework introduced in chap- ter 6 with the following new and significant contributions: (1) We showed how to incorporatedomainknowledgeintothesparselearningframeworkusingknowledge- based factorization technique and regularization penalties, (2) We showed that our model (Group FHIM) still has the nice theoretical properties similar to FHIM. (3) Weshowedstate-of-the-artresultsforgeneexpressionpredictionandpeptideMHC 1 binding prediction tasks to showcase the advantage of capturing prior informa- tion in our sparse learning framework. Some of the material in the chapters 6 and 7 appeared in (113). 219 8.2 Future Research Topics The following topics will be studied during future research. 1. Incorporating social network information for video copy detection and align- ment We would like to examine if the social network information such as user’s neighbourhoodonthesocialgraphcouldbeusefulformultimediacopydetec- tion and alignment. We briefly discussed initial thoughts and ideas on this problem in Section 3.11. Incorporating the social network information in the video copy detection framework might be useful to address related inter- esting problems such as tracking content consumption by users and their friends, identifying the interesting parts of video based on content and social interactions and content-based information diffusion in social networks. 2. Incorporating Limited attention model in social recommendation Users of social networks have limited attention which constrains their ability to process all recommended items (updates) from friends. Moreover, users tend to pay attention to only some friends, i.e., they divide their attention non-uniformly over their friends and interests. Thus, we plan to incorporate the limited, non-uniformly divided attention model of social media users into our social recommendation system. 3. Distributed Factorized Sparse Learning Models Recentadvancesindistributedoptimizationalgorithmshavemadethemvery attractive for solving big data problems. Our factorized sparse learning mod- els (FHIM and group FHIM) are global consensus optimization problems, which can be solved in a distributed manner by the popular alternating 220 direction method of multipliers (ADMM) algorithm. In our future work, we plan to develop distributed algorithms of our factorized sparse learning models to make the discovery of high order feature interactions much faster especially for very high-dimensional settings. 221 Reference List [1] FFMPEG. http://www.ffmpeg.org/. [2] INRIA IMEDIA. http://vitalas.ina.fr:8081/ public/Trecvideo2008- copydetection-software.tgz. [3] MSRA. http://research.microsoft.com/en-us/projects/msrammdata/. [4] TRECVID 2010. http://www-nlpir.nist.gov/projects/tv2010/. [5] TRECVID 2010 CONTENT-BASED COPY DETECTION TASK. http://www-nlpir.nist.gov/projects/tv2010/tv2010.html#ccd. [6] Agrawal,R.,Faloutsos,C.,andSwami,A. Efficient similarity search in sequence databases. Springer, 1993. [7] Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.,andLipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 3 (1990), 403–410. [8] Andrew, G., and Gao, J. Scalable training of l 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning (2007), ACM, pp. 33–40. [9] Anonymous. ———————. In —— (2014). [10] Baeza-Yates, R. A. Searching subsequences. Theoretical Computer Sci- ence 78, 2 (1991), 363–376. [11] Baltrunas, L., Makcinskas, T., and Ricci, F. Group recommen- dations with rank aggregation and collaborative filtering. In ACM RecSys (2010). [12] Bao,J.,Zheng,Y.,andMokbel,M.F. Location-based and preference- awarerecommendationusingsparsegeo-socialnetworkingdata. In ACM GIS (2012). [13] Bareš, B. L. Algorithms for longest common subsequence computation. 222 [14] Basilico, J., and Hofmann, T. Unifying collaborative and content- based filtering. In Proceedings of the twenty-first international conference on Machine learning (2004), ACM, p. 9. [15] Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. Speeded-up robust features (surf). Computer vision and image understanding 110, 3 (2008), 346–359. [16] Bay, H., Tuytelaars, T., and Van Gool, L. Surf: Speeded up robust features. In Computer Vision–ECCV 2006. Springer, 2006, pp. 404–417. [17] Beal, M. J. Variational algorithms for approximate Bayesian inference. PhD thesis, University of London, 2003. [18] Beaudet, P. R. Rotationally invariant image operators. In Proceedings of the International Joint Conference on Pattern Recognition (1978), vol. 579, pp. 579–583. [19] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., and Kraut, R. E. Using social psychology to motivate contributions to online communities. In Proceedings of the 2004 ACM con- ference on Computer supported cooperative work (2004), ACM, pp. 212–221. [20] Bell, R. M., and Koren, Y. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on (2007), IEEE, pp. 43–52. [21] Belongie, S., Malik, J., and Puzicha, J. Shape matching and object recognitionusingshapecontexts. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24, 4 (2002), 509–522. [22] Bergroth, L., Hakonen, H., and Raita, T. A survey of longest common subsequence algorithms. In String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on (2000), IEEE, pp. 39–48. [23] Bien, J., Taylor, J., and Tibshirani, R. A lasso for hierarchical inter- actions. The Annals of Statistics 41, 3 (2013), 1111–1141. [24] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003), 993–1022. [25] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. JMLR (2003). 223 [26] Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186. [27] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributedoptimizationandstatisticallearningviathealternatingdirection method of multipliers. Foundations and Trends R in Machine Learning 3, 1 (2011), 1–122. [28] Breimer, E. A., Goldberg, M. K., and Lim, D. T. A learning algo- rithmforthelongestcommonsubsequenceproblem. Journal of Experimental Algorithmics (JEA) 8 (2003), 2–1. [29] Bron, C., and Kerbosch, J. Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM 16, 9 (1973), 575–577. [30] Cantador, I., Brusilovsky, P., and Kuflik, T. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems (2011), RecSys 2011, ACM. [31] Cantador, I., and Castells, P. Group recommender systems: New perspectives in the social web. In Recommender Systems for the Social Web. 2012. [32] Carrillo,H.,andLipman,D. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics 48, 5 (1988), 1073–1082. [33] Chaney, A., Gartrell, M., Hofman, J., Guiver, J., Koenigstein, N., Kohli, P., and Paquet, U. Mining large-scale tv group viewing patterns for group recommendation. Tech. Rep. MSR-TR-2013-114, 2013. [34] Chen, J.-B. A survey of the longest common subsequence problem and its related problems. [35] Chen, L., and Stentiford, F. Video sequence matching based on tem- poral ordinal measurement. Pattern Recognition Letters 29, 13 (2008), 1824– 1831. [36] Cheng, C., Alexander, R., Min, R., Leng, J., Yip, K. Y., Rozowsky, J., Yan, K.-K., Dong, X., Djebali, S., Ruan, Y., et al. Understanding transcriptional regulation by integrative analysis of transcrip- tion factor binding data. Genome research (2012). [37] Chiu, C.-Y., Chen, C.-S., and Chien, L.-F. A framework for handling spatiotemporal variations in video copy detection. Circuits and Systems for Video Technology, IEEE Transactions on 18, 3 (2008), 412–417. 224 [38] Chiu, C.-Y., Li, C.-H., Wang, H.-A., Chen, C.-S., and Chien, L.- F. A time warping based approach for video copy detection. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on (2006), vol. 3, IEEE, pp. 228–231. [39] Cho, E., Myers, S. A., and Leskovec, J. Friendship and mobility: user movement in location-based social networks. In ACM SIGKDD (2011). [40] Choi, N. H., Li, W., and Zhu, J. Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statis- tical Association 105, 489 (2010), 354–364. [41] Christakis, N. A., and Fowler, J. H. Connected: amazing power of social networks and how they shape our lives. HarperCollinsPublishers, 2010. [42] Chum, O., Philbin, J., Isard, M., and Zisserman, A. Scalable near identical image and shot detection. In Proceedings of the 6th ACM interna- tional conference on Image and video retrieval (2007), ACM, pp. 549–556. [43] Cole, D. K., Rizkallah, P. J., Gao, F., Watson, N. I., Boulter, J.M.,Bell,J.I.,Sami,M.,Gao,G.F.,andJakobsen,B.K. Crystal structure of hla-a* 2402 complexed with a telomerase peptide. European journal of immunology (2006). [44] Corpet, F. Multiple sequence alignment with hierarchical clustering. Nucleic acids research 16, 22 (1988), 10881–10890. [45] Counts,S.,andFisher,K. Takingitallin? visualattentioninmicroblog consumption. Proc. ICWSM 2011 (2011). [46] Dalal, N., and Triggs, B. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (2005), vol. 1, IEEE, pp. 886–893. [47] De Macedo, A. Q., and Marinho, L. B. Event recommendation in event-based social networks. In HyperText, Social Personalization Workshop (2014). [48] Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, S. L. Alignment of whole genomes. Nucleic Acids Research 27, 11 (1999), 2369–2376. [49] Douze, M., Gaidon, A., Jegou, H., Marszałek, M., Schmid, C., et al. Inria-lears video copy detection system. In TREC Video Retrieval Evaluation (TRECVID Workshop) (2008). 225 [50] Douze, M., Jégou, H., and Schmid, C. An image-based approach to video copy detection with spatio-temporal post-filtering. Multimedia, IEEE Transactions on 12, 4 (2010), 257–266. [51] Duvenaud, D. K., Nickisch, H., and Rasmussen, C. E. Additive gaussian processes. In NIPS (2011), pp. 226–234. [52] Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. Fast subsequence matching in time-series databases, vol. 23. ACM, 1994. [53] Fan, J., and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 456 (2001), 1348–1360. [54] Fischler, M. A., and Bolles, R. C. Random sample consensus: a paradigm for model fitting with applications to image analysis and auto- mated cartography. Communications of the ACM 24, 6 (1981), 381–395. [55] Fitch, W. M., and Smith, T. F. Optimal sequence alignments. Proceed- ings of the National Academy of Sciences 80, 5 (1983), 1382–1386. [56] Foygel, R., Srebro, N., and Salakhutdinov, R. Matrix reconstruc- tion with the local max norm. arXiv preprint arXiv:1210.5196 (2012). [57] Friedman, J., Hastie, T., Höfling, H., andTibshirani, R. Pathwise coordinate optimization. The Annals of Applied Statistics 1, 2 (2007), 302– 332. [58] Greenberg, R. I. Fast and simple computation of all longest common subsequences. arXiv preprint cs/0211001 (2002). [59] Hampapur, A., Hyun, K., and Bolle, R. M. Comparison of sequence matching techniques for video copy detection. In Electronic Imaging 2002 (2001), International Society for Optics and Photonics, pp. 194–201. [60] Harris, C., and Stephens, M. A combined corner and edge detector. In Alvey vision conference (1988), vol. 15, Manchester, UK, p. 50. [61] Herlocker, J. L., Konstan, J. A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (1999), ACM, pp. 230–237. [62] Hirschberg, D. S. Algorithms for the longest common subsequence prob- lem. Journal of the ACM (JACM) 24, 4 (1977), 664–675. 226 [63] Hodas, N., and Lerman, K. How limited visibility and divided atten- tion constrain social contagion. In Proc. ASE/IEEE Intl. Conf. on Social Computing (SocialComm) (2012). [64] Hoffman, A. J. On eigenvalues and colorings of graphs. Graph theory and its applications (1970), 79–91. [65] Hofmann, T. Latent semantic models for collaborative filtering. ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 89–115. [66] Hu, B., and Ester, M. Spatial topic modeling in online social media for location recommendation. In ACM RecSys (2013). [67] Hu,Y.,Koren,Y.,andVolinsky,C. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE Interna- tional Conference on (2008), IEEE, pp. 263–272. [68] Hua, X.-S., Chen, X., and Zhang, H.-J. Robust video signature based on ordinal measure. In Image Processing, 2004. ICIP’04. 2004 International Conference on (2004), vol. 1, IEEE, pp. 685–688. [69] Jacob, L., Obozinski, G., and Vert, J.-P. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning (2009). [70] Jalali, A., Ravikumar, P., and Sanghavi, S. A dirty model for mul- tiple sparse regression. arXiv preprint arXiv:1106.5826 (2011). [71] Jamali, M., and Ester, M. Trustwalker: a random walk model for combining trust-based and item-based recommendation. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (2009), ACM, pp. 397–406. [72] Jegou, H., Douze, M., and Schmid, C. Hamming embedding and weak geometric consistency for large scale image search. In Computer Vision– ECCV 2008. Springer, 2008, pp. 304–317. [73] Jegou, H., Harzallah, H., and Schmid, C. A contextual dissimilarity measure for accurate and efficient image search. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on (2007), IEEE, pp. 1–8. [74] Ji, S., and Ye, J. An accelerated gradient method for trace norm min- imization. In Proceedings of the 26th Annual International Conference on Machine Learning (2009), ACM, pp. 457–464. 227 [75] Kang, J., Lerman, K., and Getoor, L. La-lda: A limited attention topic model for social recommendation. In The 2013 International Confer- ence on Social Computing, Behavioral-Cultural Modeling, & Prediction (SBP 2013) (2013). [76] Karp, R. M. Reducibility among combinatorial problems. Springer, 1972. [77] Karypis,G. Evaluationofitem-basedtop-nrecommendationalgorithms. In Proceedings of the tenth international conference on Information and knowl- edge management (2001), ACM, pp. 247–254. [78] Ke, Y., Sukthankar, R., and Huston, L. Efficient near-duplicate detection and sub-image retrieval. In ACM Multimedia (2004), vol. 4, p. 5. [79] Koren,Y. Factorization meets the neighborhood: a multifaceted collabora- tive filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), ACM, pp. 426– 434. [80] Koren, Y., and Bell, R. Advances in collaborative filtering. In Recom- mender Systems Handbook. Springer, 2011, pp. 145–186. [81] Koren,Y.,Bell,R.,andVolinsky,C. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37. [82] Koren,Y.,Bell,R.,andVolinsky,C. Matrix factorization techniques for recommender systems. Computer (2009). [83] Kuksa, P. P., Min, M. R., Dugar, R., and Gerstein, M. High- order neural networks and kernel methods for peptide-mhc binding predic- tion. NIPS 2014 Workshop on Machine Learning in Computational Biology (2014). [84] Lanckriet,G.R.G.,Cristianini,N.,Bartlett,P.,Ghaoui,L.E., and Jordan, M. I. Learning the kernel matrix with semidefinite program- ming. J. Mach. Learn. Res. 5 (Dec. 2004), 27–72. [85] Landau, G. M., and Ziv-Ukelson, M. On the common substring align- ment problem. Journal of Algorithms 41, 2 (2001), 338–359. [86] Law-To, J., Buisson, O., Gouet-Brunet, V., and Boujemaa, N. Robust voting algorithm based on labels of behavior for video copy detec- tion. In Proceedings of the 14th annual ACM international conference on Multimedia (2006), ACM, pp. 835–844. 228 [87] Law-To, J., Chen, L., Joly, A., Laptev, I., Buisson, O., Gouet- Brunet,V.,Boujemaa,N.,andStentiford,F. Video copy detection: a comparative study. In Proceedings of the 6th ACM international conference on Image and video retrieval (2007), ACM, pp. 371–378. [88] Lazebnik, S., Schmid, C., andPonce, J. Beyond bags of features: Spa- tial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (2006), vol. 2, IEEE, pp. 2169–2178. [89] Lehmann, E. L., and Casella, G. Theory of point estimation, vol. 31. Springer, 1998. [90] Levandoski, J. J., Sarwat, M., Eldawy, A., and Mokbel, M. F. Lars: A location-aware recommender system. In ICDE (2012). [91] Lindeberg, T. Feature detection with automatic scale selection. Interna- tional journal of computer vision 30, 2 (1998), 79–116. [92] Liu, D. C., and Nocedal, J. On the limited memory bfgs method for large scale optimization. Mathematical programming 45, 1-3 (1989), 503–528. [93] Liu, J., Ji, S., and Ye, J. SLEP: Sparse Learning with Efficient Projec- tions. Arizona State University, 2009. [94] Liu, X., He, Q., Tian, Y., Lee, W.-C., McPherson, J., and Han, J. Event-based social networks: linking the online and offline social worlds. In ACM SIGKDD (2012). [95] Lovász,L. On the shannon capacity of a graph. Information Theory, IEEE Transactions on 25, 1 (1979), 1–7. [96] Lowe, D. G. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110. [97] Ma, H., Yang, H., Lyu, M. R., and King, I. Sorec: social recommen- dation using probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information and knowledge management (2008), ACM, pp. 931–940. [98] Maier, D. The complexity of some problems on subsequences and superse- quences. Journal of the ACM (JACM) 25, 2 (1978), 322–336. [99] Masthoff,J. Group recommender systems: Combining individual models. In Recommender Systems Handbook. 2011. 229 [100] Mikolajczyk, K., andSchmid, C. Scale & affine invariant interest point detectors. International journal of computer vision 60, 1 (2004), 63–86. [101] Min, M. R., Ning, X., Cheng, C., and Gerstein, M. Interpretable sparsehigh-orderboltzmannmachines. InThe 17th International Conference on Artificial Intelligence and Statistics (2014). [102] Min, R., Chowdhury, S., Qi, Y., Stewart, A., andOstroff, R. An integrated approach to blood-based cancer diagnosis and biomarker discov- ery. In Proceedings of the Pacific Symposium on Biocomputing (2014). [103] Nardi,Y.,Rinaldo,A.,etal. Ontheasymptoticpropertiesofthegroup lasso estimator for linear models. Electronic Journal of Statistics (2008). [104] Needleman, S. B., and Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48, 3 (1970), 443–453. [105] Nister, D., and Stewenius, H. Scalable recognition with a vocabulary tree. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (2006), vol. 2, IEEE, pp. 2161–2168. [106] Östergård, P. R. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics 120, 1 (2002), 197–207. [107] Paterek,A. Improvingregularizedsingularvaluedecompositionforcollab- orative filtering. In Proceedings of KDD cup and workshop (2007), vol. 2007, pp. 5–8. [108] Pattanaik, P. K. Voting and collective choice: Some aspects of the theory of group decision-making. 1971. [109] Purushotham, S., and Kuo, C.-C. J. Studying user influence in per- sonalized group recommenders in location based social networks. In NIPS 2014, Personalization: Methods and Applications Workshop (2014). [110] Purushotham, S., and Kuo, C.-C. J. Modeling group dynamics for personalized group-event recommendation. In Social Computing, Behavioral- Cultural Modeling, and Prediction. Springer, 2015, pp. 405–411. [111] Purushotham, S., Kuo, C.-C. J., Shahabdeen, J., and Nachman, L. Collaborativegroup-activityrecommendationinlocation-basedsocialnet- works. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information (2014), ACM, pp. 8–15. 230 [112] Purushotham, S., Liu, Y., and Kuo, C.-C. J. Collaborative topic regression with social matrix factorization for recommendation systems. In Proceedings of the twenty-ninth International Conference on Machine learn- ing (ICML) (2012), p. 8. [113] Purushotham, S., Min, M.R., Kuo, C.-C.J., andOstroff, R. Fac- torized sparse learning models with interpretable high order feature interac- tions. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2014), KDD ’14, ACM. [114] Purushotham, S., Tian, Q., and Kuo, C.-C. J. Picture-in-picture copy detection using spatial coding techniques. In Proceedings of the 2011 ACM international workshop on Automated media analysis and production for novel TV services (2011), ACM, pp. 25–30. [115] Qiao, Z., Peng, Z., Zhou, C., Cao, Y., Guo, L., and Zhang, Y. Event recommendation in event-based social networks. In AAAI (2014). [116] Rick, C. Efficient computation of all longest common subsequences. In Algorithm Theory-SWAT 2000. Springer, 2000, pp. 407–418. [117] Robson,J.M. Finding a maximum independent set in time o (2n/4). Tech. rep., Technical Report 1251-01, LaBRI, Université de Bordeaux I, 2001. [118] Salakhutdinov, R., and Mnih, A. Probabilistic matrix factorization. Advances in neural information processing systems 20 (2008), 1257–1264. [119] Schmidt,M. Graphical model structure learning with l1-regularization. PhD thesis, UNIVERSITY OF BRITISH COLUMBIA, 2010. [120] Sivic, J., and Zisserman, A. Video google: A text retrieval approach to object matching in videos. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on (2003), IEEE, pp. 1470–1477. [121] Smith, T. F., and Waterman, M. S. Comparison of biosequences. Advances in Applied Mathematics 2, 4 (1981), 482–489. [122] Su, X., and Khoshgoftaar, T. M. A survey of collaborative filtering techniques. Advances in Artificial Intelligence 2009 (2009), 4. [123] Suykens, J. A., and Vandewalle, J. Least squares support vector machine classifiers. Neural processing letters 9, 3 (1999), 293–300. [124] Tan, H.-K., Ngo, C.-W., and Chua, T.-S. Efficient mining of multiple partialnear-duplicatealignmentsbytemporalnetwork. Circuits and Systems for Video Technology, IEEE Transactions on 20, 11 (2010), 1486–1498. 231 [125] Tan, H.-K., Ngo, C.-W., Hong, R., and Chua, T.-S. Scalable detec- tion of partial near-duplicate videos by visual-temporal consistency. In Pro- ceedings of the 17th ACM international conference on Multimedia (2009), ACM, pp. 145–154. [126] Tarjan, R. E., and Trojanowski, A. E. Finding a maximum indepen- dent set. SIAM Journal on Computing 6, 3 (1977), 537–546. [127] Thompson, J. D., Higgins, D. G., and Gibson, T. J. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research 22, 22 (1994), 4673–4680. [128] Tian, Y., Srivastava, J., Huang, T., and Contractor, N. Social multimedia computing. Computer 43, 8 (2010), 27–36. [129] Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267–288. [130] Tso-Sutter, K. H., Marinho, L. B., and Schmidt-Thieme, L. Tag- aware recommender systems by fusion of collaborative filtering algorithms. In Proceedings of the 2008 ACM symposium on Applied computing (2008), ACM, pp. 1995–1999. [131] Tuytelaars, T., and Mikolajczyk, K. Local invariant feature detec- tors: a survey. Foundations and Trends R in Computer Graphics and Vision 3, 3 (2008), 177–280. [132] Ullman, J., Aho, A., and Hirschberg, D. Bounds on the complexity of the longest common subsequence problem. Journal of the ACM (JACM) 23, 1 (1976), 1–12. [133] Vita, R., Zarebski, L., Greenbaum, J. A., Emami, H., Hoof, I., Salimi, N., Damle, R., Sette, A., and Peters, B. The immune epitope database 2.0. Nucleic acids research (2010). [134] Wagner, R. A., and Fischer, M. J. The string-to-string correction problem. Journal of the ACM (JACM) 21, 1 (1974), 168–173. [135] Wang,C.,andBlei,D.M. Collaborative topic modeling for recommend- ing scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (2011), ACM, pp. 448– 456. [136] Wang,H. Allcommonsubsequences. InProc. IJCAI-07 (2007), pp.635–40. 232 [137] Waterman, M. S., and Eggert, M. A new algorithm for best subse- quence alignments with application to trna-rrna comparisons. Journal of molecular biology 197, 4 (1987), 723–728. [138] Weng, L., Flammini, A., Vespignani, A., and Menczer, F. Compe- tition among memes in a world with limited attention. Scientific reports 2 (2012). [139] Wilf, H. S. The eigenvalues of a graph and its chromatic number. J. London Math. Soc 42, 1967 (1967), 330. [140] Wu, F., and Huberman, B. A. Novelty and collective attention. Pro- ceedings of the National Academy of Sciences 104, 45 (2007), 17599–17601. [141] Wu,J.,Tan,W.-C.,andRehg,J.M. Efficient and effective visual code- book generation using additive kernels. The Journal of Machine Learning Research 12 (2011), 3097–3118. [142] Wu,S.,Manber,U.,Myers,G.,andMiller,W. Ano(<i>np</i>) sequence comparison algorithm. Information Processing Letters 35, 6 (1990), 317–323. [143] Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., and Lange, K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25, 6 (2009), 714–721. [144] Wu, X., Hauptmann, A. G., and Ngo, C.-W. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th interna- tional conference on Multimedia (2007), ACM, pp. 218–227. [145] Wu, X., Takimoto, M., Satoh, S., and Adachi, J. Scene duplicate detection based on the pattern of discontinuities in feature point trajecto- ries. In Proceedings of the 16th ACM international conference on Multimedia (2008), ACM, pp. 51–60. [146] Wu, Z., Ke, Q., Isard, M., and Sun, J. Bundling features for large scale partial-duplicate web image search. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (2009), IEEE, pp. 25– 32. [147] Ye, M., Liu, X., and Lee, W.-C. Exploring social influence for recom- mendation: a generative model approach. In ACM SIGIR (2012). [148] Ye, M., Yin, P., andLee, W.-C. Location recommendation for location- based social networks. In ACM SIGSPATIAL (2010). 233 [149] Ye, M., Yin, P., Lee, W.-C., and Lee, D.-L. Exploiting geographical influence for collaborative point-of-interest recommendation. In ACM SIGIR (2011). [150] Yuan, M., and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49–67. [151] Yuan, Q., Cong, G., andLin, C.-Y. Com: a generative model for group recommendation. In ACM SIGKDD (2014). [152] Zanardi, V., and Capra, L. Social ranking: uncovering relevant content using tag-based recommender systems. In Proceedings of the 2008 ACM conference on Recommender systems (2008), ACM, pp. 51–58. [153] Zhang, D.-Q., and Chang, S.-F. Detecting image near-duplicate by stochastic attributed relational graph matching with learning. In Proceed- ings of the 12th annual ACM international conference on Multimedia (2004), ACM, pp. 877–884. [154] Zhang, W., Wang, J., and Feng, W. Combining latent factor model with location features for event-based group recommendation. In ACM SIGKDD (2013). [155] Zhang, W., Wang, J., and Feng, W. Combining latent factor model with location features for event-based group recommendation. In ACM SIGKDD (2013). [156] Zhao, W.-L. Lip-vireo in action. [157] Zhao, W.-L., and Ngo, C.-W. Scale-rotation invariant pattern entropy for keypoint-based near-duplicate detection. Image Processing, IEEE Trans- actions on 18, 2 (2009), 412–423. [158] Zhen, Y., Li, W.-J., and Yeung, D.-Y. Tagicofi: tag informed collabo- rative filtering. In Proceedings of the third ACM conference on Recommender systems (2009), ACM, pp. 69–76. [159] Zheng, V. W., Zheng, Y., Xie, X., and Yang, Q. Collaborative location and activity recommendations with gps history data. In WWW (2010). [160] Zhou, J., Chen, J., and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2011. 234 [161] Zhou, W., Lu, Y., Li, H., Song, Y., and Tian, Q. Spatial coding for large scale partial-duplicate web image search. In Proceedings of the international conference on Multimedia (2010), ACM, pp. 511–520. [162] Zhou, Y., Wilkinson, D., Schreiber, R., and Pan, R. Large-scale parallel collaborative filtering for the netflix prize. In Algorithmic Aspects in Information and Management. Springer, 2008, pp. 337–348. [163] Zou, H. The adaptive lasso and its oracle properties. Journal of the Amer- ican statistical association 101, 476 (2006), 1418–1429. 235
Abstract (if available)
Abstract
Advanced machine learning techniques are developed to tackle challenging problems in three Big Data application domains in this thesis. They are: 1) partial near-duplicate video copy detection and alignment for the video application, 2) personalized single user and group recommender systems for the social media data application, and 3) identification of discriminative feature interactions for gene expression prediction and cancer stage prediction for the biomedical data application. Novel and suitable machine learning tools are designed to meet the nature of the data in each specific application domain. ❧ For the video data application, we propose a novel spatio-temporal verification and alignment algorithm to accurately detect and retrieve partially near-duplicate videos from online video-sharing networks. We propose a generalized spatial coding and spatial verification scheme for video key-frame representation and matching respectively. We introduce efficient subsequence matching-based algorithms for temporal alignment which can be made scalable using parallelization framework. We also propose a novel way to use the social network information to reduce the computation cost of social media alignment and retrieval. ❧ For the social media data application, we investigate the effectiveness of the social network information for content recommendation to users. In particular, we propose a novel hierarchical Bayesian modeling framework that incorporates both topic modeling of contents and matrix factorization of social networks to automatically infer topics in the latent space and provide interpretable recommendations. Our models reveal interesting insights by showing that social circles can have more influence on people's decisions about the usefulness of content than personal taste. Empirical experiments on large-scale datasets show that our proposed techniques for retrieval and recommendation outperform existing state-of-the art approaches. Furthermore, we develop a new group recommender systems framework to model the group dynamics and to personalize location/event recommendations to group of users. ❧ For the biomedical data application, we propose a novel scalable knowledge-based high-order sparse learning framework, called the Group Factorized High order Interactions Model (Group FHIM), to identify discriminative feature groups and high-order feature group interactions. This study allows us to understand disrupted gene interactions, which are causes of some diseases. Unlike previous sparse learning approaches, the proposed model can recover both the discriminative feature groups and the pairwise feature group interactions accurately without enforcing any hierarchical feature constraints. Experiments on synthetic and real datasets show that our model outperforms other state-of-the-art sparse learning techniques, and it provides interpretable high-order feature group interactions for bio-marker discovery and gene expression prediction.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Efficient coding techniques for high definition video
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Scalable machine learning algorithms for item recommendation
PDF
Motion pattern learning and applications to tracking and detection
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Efficient graph learning: theory and performance evaluation
PDF
Scalable multivariate time series analysis
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Event detection and recounting from large-scale consumer videos
PDF
Deep learning models for temporal data in health care
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
Advanced features and feature selection methods for vibration and audio signal classification
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
Asset Metadata
Creator
Purushotham, Sanjay
(author)
Core Title
Advanced machine learning techniques for video, social and biomedical data analytics
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/15/2015
Defense Date
08/31/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
big data analytics,biomedical data analytics,collaborative filtering,high order feature interactions,OAI-PMH Harvest,social media analytics,social recommender systems,sparse learning,video data analytics,video retrieval and alignment,video-copy detection
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C. -C. Jay (
committee chair
), Jenkins, Keith (
committee member
), Liu, Yan (
committee member
), Mendel, Jerry (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
sanjayp2005@gmail.com,spurusho@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-179003
Unique identifier
UC11275961
Identifier
etd-Purushotha-3901.pdf (filename),usctheses-c40-179003 (legacy record id)
Legacy Identifier
etd-Purushotha-3901.pdf
Dmrecord
179003
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Purushotham, Sanjay
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
big data analytics
biomedical data analytics
collaborative filtering
high order feature interactions
social media analytics
social recommender systems
sparse learning
video data analytics
video retrieval and alignment
video-copy detection