Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Interactive learning: a general framework and various applications
(USC Thesis Other)
Interactive learning: a general framework and various applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Interactive Learning: A General Framework and Various Applications by Ehsan Emamjomeh-Zadeh A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2020 Copyright 2020 Ehsan Emamjomeh-Zadeh Dedication To my role model and little sister, Maryam, and our parents. To my beloved grandfather who left us this year. He lives in our memories and hearts forever. ii Acknowledgements First and foremost, I like to thank my supervisors, David Kempe and Shaddin Dughmi for having me welcomed in their group and being constantly supportive and kindhearted throughout this journey. My friends have heard from me over and over that I cannot even imagine a better PhD supervisor than David or Shaddin. I also thank Shanghua Teng for his help and care throughout my PhD. Not only was it a pleasure and an honor to collaborate with him, I always enjoyed his wise words and I have already missed Sonia's stories. I thank all of my collaborators who made my success possible. Specially, I thank Haipeng Luo at USC, Robert Schapire at Microsoft Research and Mohammad Mahdian at Google. I also thank Vahab Mirrokni, Renato Paes Leme, Balasubramanian Sivan and Jon Schneider at Google for the enjoyable and memorable summer internship in 2018 in New York city. Special thanks to my PhD dissertation committee members for their very useful feedback and their patience with me while I have been writing this dissertation. I moved to the US in 2013 knowing that my student visa is valid only for a single entry to the country. The next time I visited home was in 2019. During these six long years, my friends were my family. Later, year 2020 turned out to be (objectively) the worst year of my life. (I sincerely hope this statement remains true forever.) I cannot and I do not want to imagine how these past years of my life would have been without these friends. to express my gratitude to my friends, I list them here. Following the tradition in our community, Theoretical Compute Science, this list is in the alphabetical order to avoid any uncomfortable follow-up discussion: iii Ehsan Abbasi, Negarsadat Abolhassani, Nazanin AlipourFard, Ida Amini, Mohammad Asghari, Sepehr Assadi, Sumita Barahmand, Pooyan Behnamghader, Reihane Boghrati, Navid Dehdari Ebrahimi, Kiana Ehsani, Homa Esfahanizadeh, Sahba Ezami, Majid Ghasemi-Gol, Hana Kooredavoodi, Shaghayegh Mardani, Mehrnoosh Mirtaheri, Saj- jad Moazeni, Setareh Nasihati Gilani, Zahra Nazari, Ashkan Norouzi-Fard, Aida Rah- mattalabi, Alireza Rezai, Aria Rezai, Keyvan Rezaei-Moghaddam, Zeinab Sadeghipour, Abolfazl Sadeghpour, Ario Salmasi, Arman Shahbazian, Khadijeh Sheikhan, Mina Tah- masbi, Leili Tavabi, Nazgol Tavabi, Pooya Vahidi, and Donia Zaheri. Last and certainly the most is my family. I can never be grateful enough to my mom and my dad who have gone through tough days to support me in my decisions. Me living more than 7000 miles away from them has not been easy for them either. I was lucky to go back home on 2019 to reunite with my family on Maryam's birthday. I am submitting my dissertation and Maryam's years of being apart from our parents have just begun. I am extremely proud of her, I wish her the best and I will be there for her throughout her journey the same way she has always been there for me. Funding My PhD work was supported in part by NSF Grant 1619458 and ARO MURI grant W911NF1810208. iv Table of Contents Dedication ii Acknowledgements iii Abstract viii Chapter 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Expert Advice Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Multi-Armed Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Here, in This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Preliminaries 6 2.1 Multiplicative Weight Update Method . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 On the Computational Complexity of Hedge Algorithm . . . . . . . . . . . . 7 2.2 Graph Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Distance Function and Metric Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Examples of Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1.1 Kendall-Tau Distance for Permutations . . . . . . . . . . . . . . . . 10 2.3.1.2 Hamming Distance for Binary Classiers . . . . . . . . . . . . . . . 10 2.3.2 Shortest-Path Representation of a Distance Function . . . . . . . . . . . . . . 11 2.3.3 Shortest-Path Property of a Triple . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Tail Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Vapnik{Chervonenkis Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 Hardness Conjectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3 Interactive Learning Model 17 3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Learning of a Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Learning of a Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 The Same Problem in a Dierent Language . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Special Case: Binary Search on the Lines . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Special Case: Binary Search on the Trees . . . . . . . . . . . . . . . . . . . . 24 3.4 Related Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Noisy Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Dynamic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 v 3.6.1 Shifting Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6.2 Drifting Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7 Non-Uniform Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 4 Basic Setting 31 4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 Incorporation of Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.1 Learning a Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.2 Learning a Binary Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 5 Interactive Learning with Noisy Feedback 36 5.1 Background and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.1 Independent Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.2 Information-Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.3 Alternative Models of Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Small Noise Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3 Easier Algorithms for Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4.1 Incorporation of Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4.2 Hedge Algorithm, Similarities and Dissimilarities . . . . . . . . . . . . . . . . 48 5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 6 Computational Considerations 50 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Re-implementation of the Learning Algorithms Using Sampling . . . . . . . . . . . . 51 6.3 Learning of a Ranking with Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 7 Dynamic Target 58 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.2 Preliminaries and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.2.1 The Shifting Target Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.2.2 The Drifting Target Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.3 A Generic Mistake Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.4 Results for the Shifting Target Model . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.4.1 Negative Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.5 Results for the Drifting Target Model . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.5.1 Negative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.6 Proofs of the Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.6.1 A Lower Bound on Noisy Standard Binary Search . . . . . . . . . . . . . . . 71 7.6.2 Proof of Theorem 7.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.6.3 Proof of Theorem 7.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 8 Non-Uniform Known Costs 79 8.1 Computational Complexity of the Uniform Cost Setting . . . . . . . . . . . . . . . . 79 8.2 Computational Complexity of the Non-Uniform Cost Setting . . . . . . . . . . . . . 80 8.3 Hardness Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.3.1 Proof of Theorem 8.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.3.2 Proof of Theorems 8.1 and 8.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 87 vi Chapter 9 Posted-Price Auction 90 9.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 9.2 Selling a Single Item . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 9.3 Selling Multiple Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Chapter 10 Conclusion 96 10.1 Summary of our Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 10.2 Main Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 10.2.1 Computation of the Optimal Strategy . . . . . . . . . . . . . . . . . . . . . . 97 10.2.2 Closing the Gap for Dynamic Target . . . . . . . . . . . . . . . . . . . . . . . 97 10.2.3 Non-Uniform Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Bibliography 99 vii Abstract In many applications of machine learning, a system has to \learn" through interaction with the environment. For instance, a recommendation system or a search engine needs to learn relative importance of a set of items in order to properly rank them for the users. Such systems observe the users' click patters and exploit this information to gradually improve their understanding of the users' preferences. In this dissertation, we present a general framework for the task of interactively learning a combinatorial structure (such as a ranking, a classier, or a clustering) over a nite set of items. The learning task proceeds inT rounds throughout which a learner aims to discover the true hidden combinatorial structure, called target. In each round t, the learner proposes a structure x t and in response, learns thatx t is the target or otherwise receives some partial information about the target naturally in the form of a \correction" in x t . In this dissertation, we start by introducing a general abstract framework for interactive learning problems. We extend our framework to address several real-world aspects of such learning problems: First, we take into account the fact that the feedback is usually noisy, that is, it may sometimes be incorrect (due to human error, for example). We discuss several algorithms that are robust and work even in the presence of noise. Next, we consider interactive learning of a target when the target itself changes over time. Finally, we brie y discuss the learning task when a non-uniform cost function is associated with the structures. In this setting, dierent structures cost dierently viii when they are proposed to the users and the learner's objective is to minimize the total cost of the structures it proposes. As the main building block of our framework, we introduce and analyze a natural generalization of the classic binary search algorithm to metric spaces. This abstract problem is of interest in its own right in the theoretical computer science community. ix Chapter 1 Introduction 1.1 Background Reliance on machine learning is spreading across myriad domains. In many applications, the ma- chine learning systems are deployed before they are fully trained, and they are designed to improve through interaction with their environment. Here is an example: consider a recommendation sys- tem whose goal is to rank a set of items based on their (initially unknown) importance or relevance. In the basic model of interaction commonly assumed in the context of online/interactive learning, the learner (in this example, the recommendation system) interacts with its users in a sequence of rounds. In each round, the system shows a ranking of the items to a user. Subsequently, the user's clicks, or lack thereof, provides implicit feedback to the system about the user's preferences. The recommendation system can incorporate this feedback to come up with a \better" ranking for the next round. Another example of interactive learning is the classic equivalence query model, introduced by [Angluin, 1988], for learning a binary classier over a (nite) set of items. In this problem, the target is a binary classier over a nite set of items. In each round of the learning process, the learner proposes a binary classier. In response, the learner observes an item which is misclassied. This continues until the learner nds the true underlying classier. 1 1.2 Online Learning Online learning is the broad term used for learning settings in which the learner interacts with an environment or an oracle to extract information. The two iconic (families of) problems that are studied in the eld of online learning are closely related to the interactive learning model we discuss here. We introduce them in Sections 1.2.1 and 1.2.2. 1.2.1 Expert Advice Problem The rst problem we discuss in this section is called the expert advice problem. In this problem, a special case of which was initially studied by [Cesa-Bianchi et al., 1997], the learner has to \com- bine" several decision making policies, called experts. Formally, there is a set of N experts and learning proceeds interactively in T rounds. At the beginning of each round, each expert provides a piece of advice (in the form of a recommended action). Then, the learner chooses which expert to follow. At the end of each round, the learner obverses the loss associated with the advice of the expert it picked as well as the loss of every other expert. The loss of the learner in each round equals the loss of the expert it followed in the same round. Notice that in this setting, the learner has \full information" regarding the loss of each expert's advice in each round. The learner's nal loss is the sum of all losses it incurs throughout T rounds. We dene the loss of each expert e as the total loss associated with its advice throughout the entire T rounds. The most commonly used quantity to evaluate the learner's performance after T rounds is the loss it incurs compared to the loss of the best expert. Formally, if M and M denote the loss of the learning algorithm and the loss of the best expert throughout the T rounds, respectively, then MM is called regret of the learner. Notice that even negative regret is technically possible: because the learner can switch among the experts, it is possible that it outperforms each single one of them. However, for the problem 2 stated above | with no further assumption whatsoever | it is not even trivial to achieve a learning algorithm whose regret is sub-linear in T . We will not survey this problem in this dissertation. However, in Section 2.1, we discuss a well-known algorithm for this problem, called multiplicative weight update algorithm (also known as exponential weights algorithm). This algorithm achieves sub-linear regret in expectation even in the worst case setting (that is, with no further assumption on the problem except the losses are bounded). Several of our algorithms are inspired by this algorithm although they have fundamental dierences. 1.2.2 Multi-Armed Bandit Problem This problem is similar to the expert advice problem, although it is motivated (hence named) dierently. Imagine a gambler who has access to N slot machines. She has T rounds: in each round, she chooses one of the machines, pulls its arm and collects the reward associated with the arm. In the context of online learning, it is signicantly more common to associate a loss (rather than reward) with each choice that the learner makes. The two models, as we will explain in Section 2.1 are equivalent. So similar to the expert advice problem, we assume here that there is a loss associated with each arm. The multi-armed bandit is similar to the expert advice problem in the sense that in both problems, the learner sequentially makes a decision among a pool of N options and incurs the loss associated with its choice. However, the main dierence that distinguishes the two problems is that in the multi-armed bandit problem, at the end of each round, the learner only observes the loss it incurs, that is, the loss associated with the arm it pulled. As opposed to the full information setting we discussed in Section 1.2.1, in the setting of the multi-armed bandit problem, the learner does not observe what would be the loss had she chosen a dierent arm. 1 In a seminal 1 In the multi-armed bandit problem, the standard assumption is that the loss associated with each arm may change from one round to the next even if the arm is not pulled. Therefore, when a round is over, the learner can never know what was the loss of the arms that it did not pull. 3 paper, [Auer et al., 2002] adapt the multiplicative update weight algorithm and introduced a elegant algorithm called EXP3. The three \EXP"s stand for the words \exploration," \exploitation," and \exponential"; three keywords that characterize the algorithm. 1.3 Here, in This Dissertation In this dissertation, we design generic learning algorithms which are applicable to various interactive learning problems including (but not limited to) the two examples we mentioned before: learning of a ranking and learning of a binary classier. In our model, the learner interacts with an user in several rounds and the user provides some feedback to the learner. Under a natural assumption over the feedback that the learner receives in each round, we unify various applications from dierent domains within one abstract framework. We discuss our model and assumptions in Chapter 3. Our framework is based on a natural generalization of classic binary search to metric spaces that we rst introduced and studied in [Emamjomeh-Zadeh et al., 2016]. In Chapter 4, we state our basic results and pose several theoretical open problems. Our main result for the basic setting is that if N denotes the size of the search space, then there is a learning algorithm that discovers the target within log 2 N rounds [Emamjomeh-Zadeh et al., 2016]. Throughout this dissertation, every log is to the base 2 unless stated otherwise. An important aspect of deploying such interactive systems is that the feedback the learner re- ceives from its users is likely to be noisy (due to human error, for example). Hence, it is crucial for us that our algorithms be robust. In Chapter 5, we study the problem of nding the target from noisy feedback in as few rounds as possible. Our main result in this chapter is a robust algo- rithm that discovers the target with high probability using O(logN) rounds even in the presence of noise [Emamjomeh-Zadeh et al., 2016, Emamjomeh-Zadeh and Kempe, 2017]. The constant ab- sorbed by theOnotation depends on the amount of the noise in the feedback. Also, in Chapter 6, 4 we discuss an idea, originally presented in [Emamjomeh-Zadeh and Kempe, 2017] to implement our learning algorithms more eciently in terms of their computational complexity. Another practical challenge, when interactive systems are deployed \in the wild," is that there may not exist a single static target to be found. In the recommendation system example (stated above), the best ranking over a set of items may evolve over time as people's tastes do. Furthermore, dierent individuals may not necessarily agree on the same ranking. In this scenario, if the learner interacts with a new user drawn from a mixed population in each round, then the feedback in receives throughout the learning process can be inconsistent. In Chapter 7, we propose two dierent models of dynamic targets and present several learning algorithms for these scenarios. These algorithms are designed to \keep track" of the target. Number of the mistakes that our algorithms make is logarithmic in the size of the search space. Rest of the parameters which the mistake bound depends on quantify how \freely" the target can change throughout the learning. This chapter is based on our work [Emamjomeh-Zadeh et al., 2020]. Towards the end of the dissertation (in Chapters 8 and 9 as well as Subection 10.2.3) we discuss the interactive learning problem when querying dierent structures do not cost uniformly. An interesting example of this scenario is when the learner is interested in minimizing \total dissatisfaction" of the users rather than quickly discovering the target(s). In this model, we assume that there is a cost associated with each structure representing how unsatisfactory the structure is. Our results for this setting are preliminary and some of them are not yet published 2 . 2 As of November 7, 2020. 5 Chapter 2 Preliminaries In this chapter, we dene some of the basic concepts and state the theorems which we will use later. Throughout this dissertation, every log is to the base 2 unless stated otherwise. 2.1 Multiplicative Weight Update Method In this section, we explain a classic algorithm for the expert advice problem (dened in Section 1.2.1) called the hedge algorithm [Freund and Schapire, 1997]. In the context of online learning, hedge algorithm and many other learning algorithms are based on a technique that is called \multiplicative weight update" method. See, for example, the WINNOW algorithm [Littlestone, 1988]. Our main learning algorithms (discussed in this dissertation) also belong to this family and are essentially inspired by the hedge algorithm. In the expert advice problem, let N and T denote the number of experts and the number of rounds, respectively. At the beginning of each round t (1 t T ), each expert e provides some advice to which some loss` t (e)2R is associated. It is common to assume that each loss is bounded and it is without loss of generality to further assume 0` t (e) 1. Remark 2.1 In standard models of online learning (for both the expert advice problem and the multi-armed bandit), it is more common to associate some loss (rather than reward) to the advice 6 of each expert in each round. However, the two languages can be used interchangeably. Let t (e) denote the reward of the advice of expert e in roundt. Similar to losses, rewards are also assumed to be bounded, that is, 0 t (e) 1 in every round t and for every expert e. In this language, the learner's goal is to maximize its total reward (rather than minimizing the total loss). Two see the equivalence of the models, one can dene` t (e) = 1 t (e). In this dissertation, we always assume that the learner suers some loss in each round (rather than gaining some reward). We now explain the algorithm. Let ` t (e) = P t t 0 =1 ` t 0(e) denote the total loss (associated with the advice) of experte in the rstt rounds. Also, dene ` 0 (e) = 0 for every experte. At the beginning of each roundt, the multiplicative weight update algorithm computes a weightL t (e) = `t(e) for each expert, where 0 1 is a xed constant. Notice that the weight of each expert is exponentially decreasing in the cumulative loss of the expert. The algorithm then normalizes the weights so that they add up to 1. Formally, it denes t (e) as follows: t (e) = L t (e) P e 0L t (e 0 ) : Notice that t () is in fact a probability distribution over the experts. The algorithm chooses an expert randomly according to this probability distribution. At the end of the round, the algorithm computes the weights for the next round. 2.1.1 On the Computational Complexity of Hedge Algorithm Notice that L t+1 (e) =L t (e) ` t+1 (e) : 7 This implies that havingL t (e), when the learner observes ` t+1 (e), it can computes the new weight L t+1 (e) in time O(1). Thus, the overall computation complexity of this learning algorithm is NO(1) =O(N) in each round. 2.2 Graph Notation Let G = (V;E) be a (directed or undirected) graph. In this notation, V denotes the nodes and EVV denotes the edges. Let N =jVj denote the number of nodes. Let ! :E!R >0 be the edge weights (that is, the edge \lengths" when we compute the weighted distance between pairs of nodes). If e = (v;v 0 )2E is an edge, we may use !(e) and !(v;v 0 ) interchangeably. Notice the R >0 in the denition of edge weight function: we always assume that edge weights in every graph are positive. When all the edge weights are equal, we say that the graph is unweighted. For every node v2V , let N G (v) =fv 0 j (v;v 0 )2Eg denote the set of neighbors of v in G. We also dene N + G (v) =N G (v)[fvg: Lastly, for every node v2V and every edge e = (v;v 0 ) incident to it, we dene R G (v;e) as the set of nodes which have a shortest path from v through e. More formally, R G (v;e) =fv 00 jd(v;v 00 ) =!(e) +d(v 0 ;v 00 )g: 8 We rst presented this denition in [Emamjomeh-Zadeh et al., 2016] and it is used at the core of our framework, as you will see later. When graph G is clear from the context, we may drop it from the notation. So in most of this dissertation, we simply use N(v), N + (v) and R(v;e) instead of N G (v), N + G (v) and R G (v;e). 2.3 Distance Function and Metric Space Let V be a set. A distance function over V is a function that assigns (numerical) distances to the pairs of elements of V . Denition 2.1 states the formal denition of a distance function. Denition 2.1 (Distance Function) Let d : VV ! R 0 be a function. We say that d is a distance function over V if it satises the following properties: i d(v;v 0 ) = 0 if and only if v =v 0 . Consequently, d(v;v 0 )> 0 whenever v6=v 0 . ii Symmetry: d(v;v 0 ) =d(v 0 ;v) for all v;v 0 2V . iii Triangle Inequality: d(v;v 0 ) +d(v 0 ;v 00 )d(v;v 00 ) for all v, v 0 ;v 00 2V . Denition 2.2 We say that (V;d) is a metric space if d is a valid distance function over V . An asymmetric distance function is dened similarly except it is not required to satisfy Prop- erty ii in Denition 2.1. In [Emamjomeh-Zadeh et al., 2016, Emamjomeh-Zadeh and Kempe, 2017], we discuss asymmetric distance functions 1 as well. This is, however, not included in this disserta- tion. 2.3.1 Examples of Metrics In this dissertation, we refer to two standard and well-known metrics as we discuss applications of our framework. These two metrics are formally dened here. 1 In presentation of an algorithm for interactive learning of a clustering. 9 2.3.1.1 Kendall-Tau Distance for Permutations LetI =f1;:::;ng andV denote the set of all permutations over I (therefore,jVj =n!). For every permutation 2V and every i2I, (i) and 1 (i) denote the element in position i in and the position of element i in , respectively. A standard distance function over V is the Kendall-Tau distance which is dened in Denition 2.3. Denition 2.3 (Kendall-Tau Distance for Permutations) Let 1 ; 2 2 V be two permuta- tions overI. Kendall-Tau distance between 1 and 2 , denoted byd KT ( 1 ; 2 ), is dened as number of pairs of items in I which are ordered dierently in 1 and 2 . More formally, d KT ( 1 ; 2 ) =jf1i<jnj 1[ 1 1 (i)< 1 1 (j)]6= 1[ 2 1 (i)< 2 1 (j)]gj: 2.3.1.2 Hamming Distance for Binary Classiers Let I =f1;:::;ng and V denote the set of all binary classiers over I (therefore,jVj = 2 n ). For every binary classierC2V and every element i2I, v(i)2f; +g denotes the label assigned to i according to v. The Hamming distance between two binary classiersC 1 ;C 2 2 V is dened in Denition 2.4. Denition 2.4 (Hamming Distance for Binary Classiers) Let C 1 ;C 2 2 V be two binary classiers over I. The Hamming distance between C 1 and v 2 is dened as the number of the elements which are labeled dierently according to these two classiers. More formally, d H (C 1 ;C 2 ) =jf1injC 1 (i)6=C 2 (i)gj: 10 2.3.2 Shortest-Path Representation of a Distance Function Suppose that d is a distance function over V . One can dene a weighted graph G as follows: Let V be the set of nodes. For every pair of nodes v;v 0 2 V (v6= v 0 ), let G have an undirected edge between v and v 0 with weight d(v;v 0 ). Note that this graph is positively weighted, that is, all the edge weights are positive. The length of the shortest path between every pair of nodes equals their distance with respect to d. This easily follows from the triangle inequality. Notice that in G, one of the shortest paths between every pair of nodes is always the edge between them. However, there might also be other shortest paths in G which go through multiple nodes. Conversely, one can dene a metric space based on a given positively weighted undirected graph. LetG be such a graph. For every pair of nodesv,v 0 inG, dened(v;v 0 ) as the length of the shortest path between these two nodes inG. Also, by denition of shortest path, d(v;v) = 0 for every node v. It is easy to see thatd satises all the required properties of a distance function and hence, this is a distance function over nodes of G. As we discussed in this subsection, (nite) metric spaces and undirected positively weighted graphs can be used interchangeably. For most of this dissertation we use graphs to represent metric spaces. 2.3.3 Shortest-Path Property of a Triple Let (V;d) be a (symmetric or asymmetric) metric space and x;y;z2V . 11 Denition 2.5 (Shortest Path Property of a Triple) We say that (x;y;z) satises the short- est path property if and only if d(x;y) +d(y;z) =d(x;z): (2.6) When Equality 2.6 holds, we say that y lies on a shortest path from x to z. Denition 2.5 is at the core of the framework we will introduce in this dissertation. 2.4 Tail Bounds The tail of a distribution overR refers to the values with non-zero probability that are "far" from the expected value of the distribution. In this subsection, we phrase two theorems that state, under certain conditions, that the probability mass is mostly concentrated close to the expected value of the distribution. These two theorems are going to be used in some of our proofs. Theorem 2.2 (Markov's Inequality) Let X be a random variable drawn from a distributionD over R 0 whose expected value is . For every > 1, Pr[X] 1 Before we state the next theorem, we dene the notion of empirical expected value (also known as empirical mean). LetD be a real-valued distribution whose expected value is unknown. Let X 1 ;:::;X k bek samples drawn fromD and deneX = 1 k P k i=1 X i as their average. Intuitively, we expect X to be close to . The average X is called the empirical mean ofD with respect to the k samples we drew. Hoeding's bound (Theorem 2.3) states that ifD is a bounded distribution, thenjXj is \likely to be small" when k is \large enough." See the formal statement below. 12 Theorem 2.3 (Hoeding's Inequality) LetX 1 ;:::;X k be real-valued random variables bounded in the interval [0; 1], and independently drawn from a distributionD. Let denote the expected value of each variable drawn fromD andX = 1 k P k i=1 X i be the empirical expected value. For every real number > 0, Pr[jXj] 2e 2k 2 Notice that Hoeding's bound states an upper bound on the (additive) gap between the em- pirical expected value and the true expected value of the distribution. The right-hand side of Hoeding's inequality (in Theorem 2.3) is exponentially decreasing in the number of the samples, k. Lastly, we dene the Kullback{Leibler divergence (KL divergence, for short). LetD 1 andD 2 be two distributions over the same (nite) set. The KL divergence fromD 1 toD 2 , denoted by D KL (D 1 jjD 2 ), is a measure of how far these two distributions are, although this notion is not symmetric (that is, D KL (D 1 jjD 2 ) is not necessarily equal to D KL (D 2 jjD 1 )). Denition 2.7 LetD 1 andD 2 be two distributions over the same (nite) setX. The KL divergence fromD 1 toD 2 , denoted by D KL (D 1 jjD 2 ), is dened as follows: D KL (D 1 jjD 2 ) = X x2X D 1 (x) log D 1 (x) D 2 (x) : 2.5 Vapnik{Chervonenkis Dimension In this section, we dene a classic concept called Vapnik{Chervonenkis dimension (or VC dimension for short). Even though we do not directly use this notion in our main results, this is helpful to illustrate some of the interesting applications of our main results. 13 Vapnik{Chervonenkis dimension is particularly dened in the context of binary classication. LetI be a nite 2 set of item andSf+;g I be a set of binary classiers overI. Notice that each classierC2f+;g I is a function that assigns one of the two labels + and to each of the items in I. In this context, it is more convenient (and common) to think of each classierC as the set of items which are labeled + according toC. This introduces a one-to-one correspondence between the binary classiers over I and the subsets of I. Let II be a subset of items. We say S shatters I if and only if for every subset I 0 I there exists a classier in S that assigns + to I 0 and to InI 0 . See Denition 2.8. In this denition, each classier is thought of as a subset of I (rather than a function). Denition 2.8 (Shattering) Let I I be a subset of items. We say S shatters I if for every subset I 0 I there exists a classierC2S such that • I 0 C; and • (InI 0 )\C =; Having Denition 2.8 in hand, we can now dene the VC dimension of a set S. Denition 2.9 (VC Dimension) The Vapnik{Chervonenkis dimension, usually referred to as VC dimension, of S is the size of the largest set I that it shatters. Vapnik{Chervonenkis dimension is widely used in dierent elds of machine learning in order to quantify the complexity of a family of classiers. Consider the problem of discovering an underlying binary classier overI under the assumption that it belongs to the setS. In several learning models (for example, PAC learning), the VC dimension ofS captures \how dicult" it is to nd the target. In this dissertation, we will show that in the interactive learning of a binary classier, if the target is known to belong to a set S with \low" VC dimension, then query complexity of our algorithms 2 VC dimension is well-dened for innite sets as well. In this dissertation, however, we only use nite sets. 14 is small as well. This connection is based on the Sauer-Shelah Theorem (Theorem 2.4). This theorem was initially proved in [Sauer, 1972]. Shelah [Shelah, 1972] also proved the same theorem independently and in the same year. Theorem 2.4 (Sauer-Shelah) Let S 0 V be a set of VC dimension at most c. Then, jS 0 j en c =O(n c ): 2.6 Hardness Conjectures The most standard and popular hardness assumption in computer science community is NP* P. In this dissertation we refer to two stronger conjectures, namely ETH and SETH, both of which imply NP* P. The following two conjectures were originally introduced by [Impagliazzo and Paturi, 2001] and [Calabro et al., 2009], respectively. Conjecture 1 (Exponential Time Hypothesis) The Exponential Time Hypothesis (ETH, for short) [Impagliazzo and Paturi, 2001] states that there exists a positive constants such that 3-CNF- SAT is not solvable in time 2 s 0 n poly(n;m), for any constants 0 <s. Here,n andm are the number of variables and the number of clauses, respectively. Conjecture 2 (Strong Exponential Time Hypothesis) The Strong Exponential Time Hy- pothesis (SETH, for short) [Calabro et al., 2009] states that CNF-SAT does not admit any al- gorithm with running time 2 (1)n poly(n;m) for any constant > 0. 15 Notice that ETH (and hence SETH) implies that neither SAT nor any other NP-complete problem admit a quasi-polynomial time algorithm. More specically, an NP-complete problem with input size n is not solvable in time n s logn for any constant s. Next, we talk about a \harder" family of problems. The complexity class PSPACE is dened as the set of (decision) problems that can be solved in polynomial space. A problemA is called PSPACE-hard if for every PSPACE problemB, there is polynomial-time reduction fromB toA. Also, a problemA is called PSPACE-complete ifA belongs to PSPACE and it is PSPACE-hard as well. A generalization of standard SAT problem is named the quantied Boolean formula (QBF) and is known to be PSPACE-complete. Here, we state this problem. For every 1i, let each of x i and y i be a Boolean variables. LetF(x 1 ;:::;x ;y 1 ;:::;x ) be a CNF formula. The QBF question is to decide whether 9x 1 8y 1 :::9x 8y F(x 1 ;:::;x ;y 1 ;:::;x ) is true. Notice that the quantiers alternate. 16 Chapter 3 Interactive Learning Model In this chapter, we explain our basic notation, denitions, and assumptions of our model. We initially introduced this model in [Emamjomeh-Zadeh and Kempe, 2017]. It is based on an gen- eralization of binary search algorithm to metric spaces, which we proposed a year earlier in [Emamjomeh-Zadeh et al., 2016]. 3.1 Problem Statement In this dissertation, we discuss a category of machine learning problems in which the learner in- teracts with a user or an oracle in order to discover the ground truth which we call target. To formalize the problem statement, we here state the most basic version of this problem. Later on, we relax or generalize some of the assumptions to deal with more real-world challenges. The abstract problem which we dene in this section is called generalized binary search in metric spaces (or generalized binary search for short). We initially proposed this problem in our work [Emamjomeh-Zadeh et al., 2016]. Here, we change the language used in the original denition of [Emamjomeh-Zadeh et al., 2016] and present an equivalent (yet more intuitive) problem. In 17 Section 3.3, we present the original problem that we presented in [Emamjomeh-Zadeh et al., 2016] and show that the two problems are indeed equivalent. 1 In an interactive learning problem, the learner's goal is to nd a (hidden) combinatorial structure (such as a ranking or a classication) over a (nite) set of items. Let V denote the entire search space. One of the structures z2V is the true hidden target which the learner wants to discover. The learning task proceeds inT rounds. In each roundt, the learner proposes/queries a structurex t . We callx t the proposal or the query of the learner. We use the terms \proposing" and \querying" interchangeably. If x t =z, the learner learns that its proposal is correct. Otherwise, it receives a responsey t 2V which is \closer" toz thanx t according to some known metric functiond . Notice that y t is a structure itself. The term \closer" was intentionally vague. Assumption 3.1 formalize its denition. Assumption 3.1 (Shortest Path) There exists a known distance functiond overV which satis- es the following condition: in every round, ifx andz denote the query and the target, respectively, andy is the feedback, theny lies on a shortest path fromx toz with respect tod . See Denition 2.5 for the denition of \lying on a shortest path." First of all, notice that if z = x, then the only response y that satises Assumption 3.1 is x itself. In other words, if the learner queries the correct target, then the response must also be the same as the query. Another straightforward observation is that regardless ofx andz, ify =x, then Assumption 3.1 is always satised. That means that for each round, one trivial response which is valid according to this assumption is the query itself. We require the feedback to also satisfy Assumption 3.2. 1 When the problem is formulated as we originally presented in [Emamjomeh-Zadeh et al., 2016], several interesting theoretical problems come up, some of which may not have direct implication in machine learning. For this reason, presenting both problem statements are useful. For more theoretical discussions on this abstract model, we refer the readers to our paper [Emamjomeh-Zadeh et al., 2016]. 18 Assumption 3.2 (Non-Trivial Feedback) Let x, y, and z denote the query, the response and the target, respectively. If z6=x, then y6=x. Notice that Assumption 3.1 (in combination with Assumption 3.2) is stronger than d (y;z)< d (x;z). In other words, not only do we assume that y is closer to z than x, we expect y to lie on a shortest path from x to z. As was mentioned earlier, this is the most basic version of the problem. In Sections 3.5, 3.6 and 3.7, we relax some assumptions and deal with several challenges of real-world applications. As we will see in Sections 3.2.1 and 3.2.2, in each application every response y t is obtained by making a \natural correction" in the proposed structure x t . Assumption 3.1 does not restrict the feedback which the user can provide; rather, it states that a distance functiond should be properly chosen to model the underlying domain so that natural responses satisfy Equation (2.6). Whether such distance function exists or not depends on the nature of the application. 3.2 Applications In this section, we list two specic applications where our abstract model applies. These two will serve as our main examples throughout this dissertation in order to illustrate our general results. In our paper [Emamjomeh-Zadeh and Kempe, 2017], we also discuss a third application: clustering. Applying our framework to this application, however, requires some extra work: we need to model the search space with an asymmetric metric space rather than a symmetric one. For this reason, clustering is excluded in this dissertation. For the sake of consistency, here we use the unied notation of Section 3.1 for the queries, responses and the targets (rather than the application-specic notation we used in Sections 2.3.1.1 and 2.3.1.2). 19 3.2.1 Learning of a Ranking The rst example we illustrate is interactive learning of a ranking. This task is motivated by learning a user's preferences over a set of items in a recommendation system or a search engine. For more background on what has been done for this problem, see, for example, [Crammer and Singer, 2002, Joachims, 2002, Radlinski and Joachims, 2005]). In this dissertation, we show that this task can be formulated in a way that ts our generic framework. Consider a recommendation system which needs to rank a set of items based on their relevance or importance. Let I =f1;:::;ng be the set of items which the system has to rank and n =jIj denote its size. Dene V as the set of all possible rankings over I. Hence,jVj =n! which we denote by N. Letx =ha 1 ;:::;a n i be a ranking which the learner (the recommendation system, in our exam- ple) proposes (a i 2I for all 1in). It is shown [Granka et al., 2004] that when a ranking over I is presented, the users normally go through the list from top to bottom. Under this assumption, suppose that the user skips itemsa i ;:::;a j1 and clicks on itema j . In this case, the user implicitly asserts that a j is ahead of a i ;:::;a j1 in the correct underlying ranking z. We formalize this learning task as follows: each proposal is a ranking x =ha 1 ;:::;a n i over I. The feedback consists of two indices 1i<jn and asserts thata j appears ahead ofa i :::;a k1 in the target ranking z. We translate this feedback into a new ranking y2 V which is obtained from x by shifting a j to position i. See Figure 3.1. Observation 3.1 Let x, y and z denote the proposed ranking, the given response and the target, respectively. Let d be the Kendall-Tau distance over permutations as dened in Section 2.3.1.1. Then, d (x;y) +d (y;z) =d (x;z). 20 Figure 3.1: Feedback for interactive learning of a ranking. a 1 a i a i+1 a j a j+1 a n a 1 a j a i a j1 a j+1 a n Observation 3.1 implies that Assumption 3.1 holds for this learning task when Kendall-Tau distance is used over the permutations. Interactive learning of a ranking, as explained in this section, will be referred to in this dissertation as the signature application of the framework. 3.2.2 Learning of a Classier Another example that we discuss in this dissertation is learning of a classier over a set I of items. Even though our approach can be generalized to multi-class classication, for simplicity, we only focus on binary classication. A real-world application of interactively learning a binary classier is when an email service aims to distinguish spam from non-spam. When the user is shown a list of emails, some of which are categorized as spam, the user may provide feedback to the system by reporting an email which is not detected as spam or conversely, clearing a legitimate email which was mistakenly classied as spam by the system. This interactive learning model for learning of a binary classier, which we formalize momen- tarily, was initially introduced by Angluin [Angluin, 1988] and was named the \equivalence query model." Independently, Littlestone [Littlestone, 1988] also dened the same model and named it the \online learning" model. Let I =f1;:::;ng denote the items and dene V =f+;g n as the set of all binary classiers over I. In the example of an email server, I is the set of all emails in a user's inbox which need to be classied. The labels and + represent non-spam and spam, respectively. Also, let N =jVj = 2 n denote the size of V . Let z2V denote the correct classier overI. In each round, the learner presents a classier x2V . If its proposal is not correct (that is, 21 x6=z), then the user provides feedback: it corrects one item that is misclassied. This interaction continues until the learner discovers the correct classier z. The equivalence query model has been intensively studied in the literature. See, for example, [Hoi et al., 2018] which surveys the previous work in this domain. Let x be a classier that the learner proposes and z be the true unknown classier. if an item a2I is labeled dierently by x and z, we say a is misclassied by x. As explained earlier, in the equivalence query model, the feedback is a misclassied itema. We can translate this feedback into a binary classier y2V obtained from x by ipping the label for a. In other words, y agrees with x on Infag, but disagrees with it onfag. Observation 3.2 Let d be the Hamming distance over binary classiers, as dened in Sec- tion 2.3.1.2. Then, d (x;y) +d (y;z) =d (x;z). Observation 3.2 implies that Assumption 3.1 holds for learning under equivalence query model when Hamming distance is used over the set of classiers. 3.3 The Same Problem in a Dierent Language In Section 3.1 we dened the interactive learning problem using metric spaces. As we discussed in Section 2.3.2, metric spaces and undirected positively weighted graphs are technically interchange- able. We restate the interactive learning problem here using graph theory language. The problem we state here is mathematically equivalent to the problem stated in Section 3.1. Let G be an undirected positively weighted graph. Let V denote its nodes, one of which, z, is the target. In every round, the learner queries/proposes a node x2 V . In response, the learner learns that x is the target (that is, x = z) or, if this is not the case, it learns a node y2 V with 22 y6= x such that y lies on a shortest path from x to z in G. If there are multiple shortest paths between x and z, the only guarantee is that y lies on at least one of them. Assume that the learner queries node x which is not the target and in response, receives node y6=x. Using Observation 3.3, we can assume, without loss of generality, that y is a neighbor of x, or in other words, the response is in fact an edge out ofx. The reason is that if this is not the case, then the learner can nd a shortest path from x to y and consider the rst edge of such a path as the response. Observation 3.3 Assume that y is not a neighbor of x. In this case, any other node y 0 that lies on a shortest path from x to y does also lie on a shortest path from x to z. Assumption 3.3 summarizes the model. Assumption 3.3 Let x and z denote the query and the target, respectively. • if x is the target, then the response is x itself; • otherwise, the response is a neighbor of x (or equivalently, an edge incident to x) that lies on a shortest path from x to z. In either case, the response always belongs toN + G (x) (see Section 2.2 for the graph theory notation). While the rest of this chapter is consistent with our rst problem statement (stated in Sec- tion 3.1), most of our results will follow this alternative representation (that uses graph theory language) of the same problem because it opens up interesting directions. Our original papers [Emamjomeh-Zadeh et al., 2016, Emamjomeh-Zadeh and Kempe, 2017] also use the graph theo- retic language. This language immediately suggests some connections between our learning model and some of the classic graph theory problems in theoretical computer science. 23 3.3.1 Special Case: Binary Search on the Lines Consider the classic binary search problem: the learner is given a sorted array with N unique numbers as well as a value from the array. The learner's task is to locate in the array. In each round, the learner can choose an index in the array and compare the corresponding number in the array to . If this number equals , then the learning task ends. Otherwise, the learner learns whether is larger than the queried element or smaller. In this problem, the learner can always locate using no more than log 2 N queries by always querying the middle element in the remaining sub-array. It is straightforward to see that classic binary search problem as stated above is a special case of the interactive learning problem we dened in this section. In fact, interactive learning in a simple path of lengthn is equivalent to the classic binary search problem. This is why we call the interactive learning problem \generalization of binary search" to graphs in [Emamjomeh-Zadeh et al., 2016]. 3.3.2 Special Case: Binary Search on the Trees Another interesting special case which has been studied in theoretical computer science litera- ture is a special case of our interactive learning problem when G is a tree (see, for example, [Ben-Asher et al., 1999, Onak and Parys, 2006, Mozes et al., 2008]). Binary search in a tree is in fact studied mostly because of its application to catching faulty les in massive data stores. As shown by Jordan [Jordan, 1869], every tree has a \separator" node with the property that each of its subtrees has at most half of the nodes of the tree. Therefore, as pointed out by Onak and Parys [Onak and Parys, 2006], one can nd the target using at most log 2 N queries (where N is the number of nodes) by iteratively querying a separator of the remaining subtree and eliminating all subtrees known not to contain the target. 24 3.4 Related Models As discussed in Section 3.2.2, our interactive learning model is in fact, a generalization of the equivalence query model of [Angluin, 1988]. The equivalence query model is known to be equivalent to the Online Learning model of [Littlestone, 1988]. Both models focus on learning a (binary) classier; a large amount of follow-up work has studied this problem. Interactive learning of a combinatorial structure can be considered as a version of the multi- armed Bandit problem or the Expert Advice problem (see Section 1.2). Each guess (proposed by the learner) corresponds to one arm/expert. In every round, the loss associated with the target structure is 0, while the loss of every other arm is 1. In our model, the learner does not receive full information about all the experts; rather, it observes the loss of its guess as well as \directional" feedback to the target. In this sense, the feedback in our model is less informative than in Expert Advice problem but more informative than in Bandit setting. In our model, we assume two \worst case" assumptions: • the target is chosen adversarially; and • in each round, if there are multiple correct responses, one of the is chosen adversarially. A signicant departure from the model we consider is to relax the rst one and assume that a probability distribution is given over target locations, instead of the worst-case assumption made here. The goal then typically is to minimize the expected number of queries until the target is found. This type of model is studied in several papers, e.g., [Laber et al., 2002, Cicalese et al., 2012]. The techniques and results are signicantly dierent from ours. In this dissertation, we only focus on worst-case analysis. As argued by [Bei et al., 2013], in many real-world examples in which we are interested, the target as well as the responses are usually generated by the users of the system whose behavior is beyond our understanding. 25 A learning model with supercially similar avor, which lies between the Expert Advice problem and the multi-armed Bandit problem, is the graph feedback model (see, e.g., [Alon et al., 2015, Alon et al., 2017]). In this model, the arms/experts are also the nodes of a graph. In each round, the learner observes the loss of the chosen arm as well as the loss of each of its neighbors. Notice that the graph plays very dierent roles in the graph feedback model and in our model: in the graph feedback model, it captures observability of rewards, while in our model, it provides directional information. Several other \partial-information" versions of the expert advice problem have been studied in the literature. The path planning problem is a well-known example (see, e.g., [Gy orgy et al., 2007]): in a given directed acyclic graph, a decision maker has to pick a path between two particular nodes in every round. In this problem, the length of each edge changes in each round. The decision maker observes only the lengths of the edges in its chosen path in each round. One can cast this problem as a version of the multi-armed Bandit problem, where each path is an arm. In contrast to the classic multi-armed Bandit problem, the learner observes partial feedback about every path that shares at least one edge with its chosen path, so in this setting, as well, the feedback is more informative than multi-armed bandit. Lastly, we distinguish our interactive learning model from \active learning" models. In our model, the learner repeatedly proposes a structure and the response corrects one mistake in the proposal. If the proposal as multiple mistakes, an adversary decides which one is revealed. In active learning, on the other hand, the learner adaptively asks some questions in a specic format until it nds the target. For instance, here is a standard active learning model for learning of a permutation over a set ofn items: the learner, in each round, chooses two itemsa anda 0 and learns which one comes rst in the target. This learning problem, as stated here, resembles the classic sorting question: the learner can nd the target in O(n logn) rounds. In this dissertation, we only 26 focus on interactive learning model. For instance, in the context of ranking, the learner proposes a full permutation in each round and an adversary decides which mistake(s) should be revealed. The \trial and error" model of [Bei et al., 2013] for learning of a solution to a boolean formula is another interactive learning model. 3.5 Noisy Feedback We stated the most basic version of the problem in Section 3.1. However, our assumptions can be relaxed to capture more real-world aspects of the applications. In this section, we consider the case when the feedback is not always correct (due to human error, for example). If the feedback which the learner receives in each round is noisy, then it is crucial to design robust algorithms which do not fail even if some responses are incorrect. To address this concern, we use the independent stochastic noise model, also known as independent noise model, as dened below. Denition 3.4 (Independent Stochastic Noise Model) There exists a xed and known con- stant 0p 1 such that the feedback in each round is (adversarially) incorrect with probability p, independent of previous rounds. Notice that in this model, we do not make any assumption or generative model for incorrect feedback: in each round, the feedback is incorrect with probabilityp, in which case it is adversarially misleading. With the remaining probability, the feedback is correct in which case it is guaranteed that Assumption 3.1 is satised. In it crucial that in our model, the responses are independent: if the learner proposes the same structure multiple times, each of the responses is noisy independently with probability p. An al- ternative assumption (see, for example, [Boczkowski et al., 2016]) is that the errors are permanent, 27 that is, each structure is \broken" with some probability in which case, the responses that the learner receives from that node may be permanently incorrect. The \permanent error" model for noise is harder to justify in our applications and is not surveyed in this dissertation. 3.6 Dynamic Model For most of our results, we assume that the target is static, that is the target remains xed during the T rounds. In such cases, the goal is to identify the target as quickly as possible. However, as we mentioned before, in many real-world applications, the target may change over time. Here, we dene two models for an evolving target. If the target moves at the end of round t, then z t+1 6= z t . Let B be an upper bound on the number of rounds in which the target moves. Formally, jf1t<Tjz t+1 6=z t gjB: (3.5) To simplify some of the notation, we dene b = B T . In the applications where the target is dynamic, it is usually not sucient to locate it once. Rather, a natural goal is to \track" the target during the T rounds. We dene the mistake bound as the number of rounds in which the algorithm makes a mistake. This terminology is standard in several online learning domains to refer to the number of mistakes made by the learner throughout the learning process. As before, letx t andz t be the learner's proposal and the target, respectively. Then, the mistake bound, denoted by M, is formally dened as M =jf1tTjx t 6=z t gj (3.6) 28 3.6.1 Shifting Target Consider an interactive learning system who interacts with a population of users rather than a single user. In each roundt, the learner proposes a structurex t and one user from the population provides the feedback for the learner. In this setting, dierent users can have dierent target structures in mind. Therefore, not all the responses point to the same target. We assume that there exists a \relatively small" set of potential targetsZ such that every user's target belongs to Z. The learner knowsjZj, but not Z. We formalize this model in Denition 3.7. Denition 3.7 (Shifting Model) In the shifting model, there exists a set Z withjZj which is unknown to the learner and z t 2Z for every round t. Note that in this model, we assume that the learner does not observe the identity of the users. Therefore, the learner cannot verify if the user interacting with the system in round t + 1 is the same as who provided feedback in round t or not. 3.6.2 Drifting Target Recall the example of a recommendation system: A recommendation system aims to nd \the best ranking" its users have in mind. However, this best ranking can gradually evolve. For example, if the system is ranking a set of songs, people's taste of music and, consequently, the optimal ordering of the songs, may change. To model this type of evolution, we assume there exists a known (possibly directed) graph G 0 whose nodes are V , and if the target changes from one round to another, it changes along an edge of G 0 . To make this more formal, let E 0 VV denote the set of edges in G 0 . Also, let N out G 0 (v) be the set of nodes to whom node v has an edge in G 0 . Whenever the graph G 0 is obvious from the context, we may drop the subscript in our notation. 29 Denition 3.8 (Drifting Model) In the drifting model, for every 1 t < T , z t+1 2fz t g[ N out G 0 (z t ) where G 0 is the evolution graph. We dene a parameter which quanties the \freedom" of the target when it moves. Formally, is dened as the maximum number number of possibilities for z t+1 knowing z t . In other words, is the maximum out-degree in graph G 0 assuming every node has a self-loop. = max v2V jfvg[N out G 0 (v)j: (3.9) 3.7 Non-Uniform Cost In this dissertation, we assume that mistakes are equally costly to the learner. The problem becomes signicantly more challenging when the learner's loss in one round may depend on \how bad" its guess was. Indeed, in many real-world examples, the learner's cost is monotone in the user's dissatisfaction which relates to how bad the proposal was. For instance, in the example of ranking, the \distance" (according to some metric) from the proposed ranking to the target may be considered as the cost of the learner. Under this this model, which we call \non-uniform cost" model, the learner is interested in minimizing its total cost/loss throughout the learning process, rather than simply minimizing the number of mistakes. This setup is mostly out of the scope of this dissertation, but the discussions in Chapters 8 and 9 are related to this topic. 30 Chapter 4 Basic Setting Here, we present our results for the basic setup of our problem. Namely, we assume the responses are noiseless and the target is static. In this chapter, we follow the notation from Section 3.3, that is, we use the graph-theoretic version of the problem. This chapter is based on our work [Emamjomeh-Zadeh et al., 2016]. 4.1 Results We establish a tight logarithmic upper bound on the number of queries required when the tar- get is static and the responses are not noisy. Notice that for the special case of a simple path, our main result in this chapter is simply equivalent to the classic binary search problem (see Section 3.3.1). More generally, this result was previously known for trees (see, for example, [Onak and Parys, 2006]). Our main contribution is to generalize it to all positively-weighted con- nected undirected graphs. Also, note that in this chapter, we only analyze the query complexity of our algorithm. Later, in Chapter 6, we discuss how and under what assumptions this algorithm can be implemented eciently. 31 Theorem 4.1 ([Emamjomeh-Zadeh et al., 2016], Theorem 3) There exists an adaptive strat- egy with the following property: given any undirected, connected and positively weighted graph G = (V;E) with an unknown target vertex z, the strategy will nd z using at most logN queries, where N denotes the number of nodes in G. This bound is tight even when the graph is a line. Recall that as we mentioned earlier, every logarithm is to the base 2 (unless stated otherwise). Proof. In each iteration, based on the responses the algorithm has received so far, there will be a set S V of candidate nodes remaining, one of which is the target. The strategy is to always query a vertex x minimizing the sum of distances to vertices in S, i.e., a 1-median of S (with ties broken arbitrarily). Notice that x itself may not lie in S. More formally, for any set S V and vertex v2 V , we dene S (v) = P v 0 2S d(v;v 0 ). The algorithm is given as Algorithm 1. Recall that in Section 2.2, we dene R(x;e) as the set of nodes who have a shortest path from x that contains edge e. Algorithm 1 Target Search for Undirected Graphs 1: S V . 2: whilejSj> 1 do 3: x a vertex minimizing S (x). 4: if x is the target then 5: return x. 6: else 7: e = (x;y) response to the query. 8: S S\R(x;e). 9: end if 10: end while 11: return the only vertex in S. First, note that S (x) and R(x;e) can be computed using Dijkstra's algorithm, and x can be found by exhaustive search in linear time, so the entire algorithm takes polynomial time in the size o the graph. We discuss how this algorithm can be implemented more eciently in some domain-specic cases in Chapter 6. 32 We claim that Algorithm 1 uses at most logN queries to nd the targetz. To see this, consider an iteration in which z was not found; let e = (x;v) be the response to the query. We write S + = S\R(x;e), and S = SnS + . By denition, the edge e lies on a shortest x-v path for all v 0 2 S + , so that d(v;v 0 ) = d(x;v 0 )! e for all v 0 2 S + . Furthermore, for all v 0 2 S , the shortest path from v to v 0 can be no longer than the shortest path going through x, so that d(v;v 0 )d(x;v 0 ) +! e for allv 0 2S . Thus, S (v) S (x)! e jS + jjS j . By minimality of S (x), it follows thatjS + jjS j, sojS + j jSj 2 . Consequently, the algorithm takes at most logN queries. 4.1.1 Incorporation of Prior Knowledge Algorithm 1 as well as Theorem 4.1 do not incorporate any prior knowledge about the target. In some cases, as will be discussed in Section 4.2, the target is known to belong to a subset S 0 V . In these cases, one can replace Line 1 of Algorithm 1 with S S 0 . In other words, the candidate set S can be initialized to be the minimal set to which the target is known to belong. With the exact same analysis, we can prove the following theorem as a generalization of Tho- erem 4.1: Theorem 4.2 There exists an adaptive querying strategy with the following property: given any undirected, connected and positively weighted graph G = (V;E) with an unknown target vertex z and a set S 0 V such that z is guaranteed to be in S 0 , the strategy will nd z using at most logN 0 queries where N 0 =jS 0 j. 4.2 Applications In Section 3.2, we named two paradigmatic applications of our framework. Here, we apply Theo- rem 4.1 to each of them. 33 4.2.1 Learning a Ranking In the problem of learning of a ranking, the search space V is the set of all permutations over a set of n elements. HencejVj =n!. As discussed in Section 3.2.1, the Kendall-Tau metric satises the shortest-path property (see Observation 3.1). Therefore, Theorem 4.1 implies the following corollary. Corollary 4.3 In the absence of noise, assuming the target is static, there is an interactive learning algorithm that nds the target in at most logn!n logn queries. 4.2.2 Learning a Binary Classier Consider the problem of learning a binary classication over a set of n items. The set of all structures, denoted by V , is the set of all binary classiers over the n items. Therefore,jVj = 2 n . If we apply Theorem 4.1 with no further assumptions, it implies that the target can be found in log 2 n =n rounds. However, this result is in fact trivial: it is easy to see that any algorithm that always queries a classier that is consistent with all the feedback so far nds the target in no more than n rounds. Applying our result becomes more interesting if we assume the target belongs to a relatively small subset of V . For instance, it is natural in many applications to assume that the target belongs to a family of classiers with small VC-dimension. (For background on VC-dimension, see Section 2.5.) In particular, if we assume that the target belongs to a setS 0 V with VC-dimension c, then, by Theorem 2.4, S 0 O(n c ). Using Theorem 4.2 (instead of Theorem 4.1), the following corollary is obtained. 34 Corollary 4.4 For the problem of interactive learning of a binary classier, if the target is known to belong to S 0 whose VC-dimension is bounded by c, then our adaptive algorithm nds the target using at most logjS 0 jc logn +O(1) rounds. 35 Chapter 5 Interactive Learning with Noisy Feedback In Chapter 4, we discussed the simplest setup in which all the query responses are correct. It is natural, however, to assume that the feedback is noisy, that is, it may sometimes be incorrect. In this chapter, we study noisy responses and present robust algorithms. This chapter is based on our work [Emamjomeh-Zadeh et al., 2016, Emamjomeh-Zadeh and Kempe, 2017]. 5.1 Background and Preliminaries 5.1.1 Independent Noise Model In this dissertation, we assume that whenever the responses are noisy, the noise follows independent noise model. See Denition 3.4. As its name suggests, in the independent noise model, whether the response in one round is correct or not is independent of any other round. Importantly, the learner may query the same structure several times (in separate rounds) and receive dierent responses. Notice that we assume whether a response is correct or not is random and independent of other responses. However, every correct response as well as every incorrect response is chosen adversarially. In fact, our model of noise can be restated as follows: at the beginning of each round, a coin is ipped which comes up tail with probability p and comes up head with the remaining probability. 36 • If the coin comes up tail, the response to the query can be anything, that is, it is advesarially incorrect. It of course includes the possibility that the response is correct. • If the coin comes up head, the response is correct. However, if multiple correct responses exist, what is revealed to the learner is one correct answer chosen possibly adversarially (among all the correct responses). Throughout this dissertation, we always assume thatp is a constant and is known to the learner. In other words, similar to the graph itself, (an upper bound) p is assumed to be part of the nature of the problem itself. Notice that when the responses are noisy, the learner can never be completely sure about the target after any nite number of rounds. In this chapter, we present an algorithm that nds the target with high probability using a logarithmic number of rounds. 5.1.2 Information-Theoretical Background In this section, we use the notion of \capacity" which is dened in the context of information theory. We do not go to the details of this denition. For more details, see, for example, [Cover, 1999] (Chapter 1). Denition 5.1 (Capacity of a Channel) Let 0 p 1 be a probability. The capacity of a channel (or capacity for short) with transmission probability p is dened as below. C(p) = 1H(p) = 1 +p logp + (1p) log(1p): 37 Notice that H(p) =p logp (1p) log(1p) is the standard denition of \entropy" in information theory. 5.1.3 Alternative Models of Noise Throughout this dissertation, we only study the independent noise model (Denition 3.4) which is a standard model in many machine learning papers. This model is used, for example, by [R enyi, 1961] and [Ulam, 1991]. They pose a game between two parties: one party asks questions, and the other replies through a noisy channel (thus lying occasionally). Several models for noise are examined in detail in [Pelc, 2002]. Rivest et al. [Rivest et al., 1980] study a model in which the number of queries the adversary can answer incorrectly is a constant (or a function of the search space size), but independent of the total number of queries). Another model that is more relevant to ours is studied by [Pedrotti, 1999] among others. In this model, p fraction of the responses are incorrect, but the responses are not necessarily random. In other words, the incorrect responses may be distributed throughout the learning rounds in an adver- sarial (rather than random) way. Notice that this model is more adversarial than ours. In 2018, [Dereniowski et al., 2018] studied this generalized version of our problem and improved our main result of this chapter. 38 5.2 Small Noise Assumption Consider the binary search problem in the special case outlined in Section 3.3.1: when G is a simple line. As we discussed in Section 3.3.1, this special case of binary search is equivalent to the well-known classic binary search on sorted arrays. In this special case, each time the learner queries a node that is not the target, a truthful response is either \left" or \right" (depending on which side of the queried node the target lies on). Assumep = 1 2 , meaning that each query response is adversarially incorrect with probability 1 2 . Suppose that the adversary's strategy to generate incorrect responses is as follows: whenever the learner queries a node that is not the target, respond to the query toward the incorrect direction. It is easy to see that in this scenario, each time the learner queries a non-target node, the response is \left" with probability 1 2 and \right" with probability 1 2 . Therefore, even if the learner queries a non-target nodev many times, it learns absolutely no information about the target except the fact that v is unlikely to be the target. This implies that the learner requires at least a linear number of queries in order to nd the target with \high" probability. Notice that if p > 1 2 , it gives even more power to the adversary, so the learning task cannot get any easier. Indeed, with a similar argument, one can show that if p 2 3 , it is information theoretically impossible to locate the target with probability more than 1 N using any nite number of queries. As we saw in Chapter 4, we are generally interested in logarithmic query complexity. For this reason, we always assume p< 1 2 . Assumption 5.2 (Small Noise) Throughout this dissertation, we always assume that p< 1 2 . 39 5.3 Easier Algorithms for Special Cases First, note that for the path (or more generally, trees), a simple idea leads to an algorithm using c logN log logN queries (where c is a constant depending on p): simulate the deterministic algo- rithm, but repeat each query c log logN times and use the majority outcome. A straightforward analysis shows that with a suciently large c, this nds the target with high probability. For general graphs, however, this idea fails. The reason is that if there are multiple \correct" answers for the query, then there may be no majority among the answers, and the algorithm cannot infer which answers are correct. The c logN log logN bound for a path is not information-theoretically optimal. A sequence of papers [Feige et al., 1994, Karp and Kleinberg, 2007, Ben-Or and Hassidim, 2008] improved the upper bound toO(logN) for simple paths. Optimization of the constant hidden in the bigO nota- tion is an interesting problem which is solved by Ben-Or and Hassidim [Ben-Or and Hassidim, 2008]. They give an algorithm that nds the target with probability 1, using (1) logN C(p) +o(logN) + O(log(1=)) queries, where C(p) = 1 H(p) = 1 + p logp + (1 p) log(1 p). As pointed out in [Ben-Or and Hassidim, 2008], this bound is information-theoretically optimal up to the o(logN) +O(log(1=)) term. 5.4 Results In this section, we extend the result of [Ben-Or and Hassidim, 2008] (mentioned in Section 5.3) to general connected undirected graphs. Similar to the multiplicative weight update algorithm presented in Section 2.1, at a high level, the idea of the algorithm is to keep track of each node's likelihood of being the target, based on the responses to queries so far. Generalizing the idea of iteratively querying a median node from Chapter 4, the algorithm now iteratively queries a weighted 40 median, where the likelihoods are used as weights. Intuitively, this should ensure that the total weight of non-target nodes decreases exponentially faster than the target's weight. In this sense, the algorithm shares a lot of commonalities with the multiplicative weights update algorithm, an idea that is also prominent in [Karp and Kleinberg, 2007, Ben-Or and Hassidim, 2008], for example. The high-level idea runs into diculty when there is one node that accumulates at least half of the total weight. In this case, such a node will be the queried weighted median, and the algorithm cannot ensure a sucient decrease in the total weight. On the other hand, such a high-likelihood node is an excellent candidate for the target. Therefore, our algorithm marks such nodes for \subsequent closer inspection," and then removes them from the multiplicative weights algorithm by setting their weight to 0. The key lemma shows that with high probability, within (logN) rounds, the target node has been marked. Therefore, the algorithm could now identify the target by querying each of the O(logN) marked nodes (log logN) times, and then keeping the node that was identied as the target most frequently. However, because there could be up to (logN) marked nodes, this naive idea could still take (logN log logN) rounds. Instead, the algorithm run a second phase of the multiplicative weights algorithm, starting with weights only on the nodes that had been previously marked. Akin to the rst phase, we can show that with high probability, the target is among the marked nodes in the rst (log logN) rounds of the second phase. Among the at most (log logN) resulting candidates, the target can now be identied with high probability by querying each node suciently frequently. Because there are only (log logN) nodes remaining, this nal round takes time only O((log logN) 2 ). Given a functionL :V !R 0 , for every vertexv, we dene L (v) = P v 0 2V L(v 0 )d(v;v 0 ) as the weighted sum of distances fromv to other vertices. The weighted median is then argmin v2V L (v) 41 (ties broken arbitrarily). For any S V , let L (S) = P v2S L(v) be the sum of weights for the vertices inS. The following lemma, proved by exactly the same arguments as we used in Chapter 4, captures the key property of the weighted median. Lemma 5.1 Let v be a weighted median with respect toL and e = (v;v) an edge incident on v. Then, L (R(v;e)) L (V )=2. Proof. The proof of Lemma 5.1 is very similar to the argument made in the proof of our main result in Section 4.1. We write S + =R(x;e), andS =VnS + . By denition, the edge e lies on a shortest x-v path for all v 0 2S + , so that d(v;v 0 ) =d(x;v 0 )! e for all v 0 2S + . Furthermore, for all v 0 2S , the shortest path from v to v 0 can be no longer than the shortest path going through x, so that d(v;v 0 )d(x;v 0 ) +! e for all v 0 2S . Thus, L (v) L (v)! e L (S + ) L (S ) . By minimality of L (v), it follows that L (S + ) L (S ), so L (S + ) L (V ). We now specify the algorithm formally. Algorithm 2 encapsulates the multiplicative weights procedure, and Algorithm 3 shows how to combine the dierent pieces. Compared to the high- level outline above, the target bounds of (logN) and (log logN) are modied somewhat in Algorithm 3, mostly to account for the case when is very small. To analyze the performance of Algorithm 2, we keep track of the sum of the node weights and show, by induction, that after i rounds/iterations, L (V ) 2 i , deterministically. Consider the i th iteration: • If there exists a node in V whose weight is at least half of the total node weight, then Multiplicative-Weights sets its weight to 0. Therefore, in this case, the sum of node weights drops to at most half of its previous value. 42 Algorithm 2 Multiplicative-Weights (SV;T) 1: L(v) 1=jSj for all vertices v2S andL(v) 0 for all v2VnS. 2: M ;. 3: for T iterations do 4: Let x be a weighted median with respect toL. 5: ifL(x) 1 2 L (V ) then 6: Mark x, by setting M M[fxg. 7: L(x) 0. 8: else 9: Query node x. 10: for all nodes v2S do 11: if v is consistent with the response then 12: L(v) (1p)L(v). 13: else 14: L(v) pL(v). 15: end if 16: end for 17: end if 18: end for 19: return M Algorithm 3 Probabilistic Binary Search () 1: 0 =3. 2: Fix 1 = min( q 1 log logN ; C(p) 2 log((1p)=p) ). 3: T 1 maxf logN C(p) 1 log((1p)=p) + 1; ln(1= 0 ) 2 2 1 g. 4: S 1 Multiplicative-Weights(V;T 1 ). 5: Fix 2 = C(p) 2 log((1p)=p) . 6: T 2 maxf logjS 1 j C(p) 2 log((1p)=p) + 1; ln(1= 0 ) 2 2 2 g. 7: S 2 Multiplicative-Weights(S 1 ;T 2 ). 8: for all v2S 2 do 9: Query v repeatedly 2 ln(jS 2 j= 0 ) (12p) 2 times. 10: if v is returned as the target for at least half of these queries then 11: return v. 12: end if 13: end for 14: return failure. • If Multiplicative-Weights queries a median x, and is told that x is the target, then x's weight decreases by a factor of 1p, and the other nodes' weights decrease by a factor of p. The new sum of node weights is (1p)L(x) +p( L (V )L(x)) L (V )=2; 43 becauseL(x) L (V )=2. • If Multiplicative-Weights queries a median x, and the response is an edge e = (x;v), the algorithm multiplies the weights of nodes in R(x;e) by 1p and the weights of all other nodes by p. The new total weight is (1p) L (R(x;e)) +p L (V ) L (R(x;e)) ; which is at most L (V )=2, because Lemma 5.1 implies that L (R(x;e)) L (V )=2. The key lemma for the analysis (Lemma 5.2) shows that with high probability, the target is in the set returned by the Multiplicative Weights algorithm. Recall that C(p) is dened as 1H(p) = 1 +p logp + (1p) log(1p). Lemma 5.2 Let S V be a subset of vertices containing the target and 0 < < C(p) log(p=(1p)) . If T ln(1= 0 ) 2 2 and T > logjSj C(p) log(p=(1p)) , then with probability at least 1 0 , the target is in the set returned by Multiplicative-Weights(S;T ). Proof. By a standard Hoeding Bound (Theorem 2.3) the probability that fewer than (1p)T queries are answered correctly is at most e 2T 2 . By the rst bound on T in Lemma 5.2, this probability is at most 0 . If we assume that the target is not marked by the end of the Multiplicative-Weights algorithm, its nal weight is at least (t) (1p) 1p p p+ T jSj , with probability at least 1 0 . On 44 the other hand, we showed that the nal sum of node weights is deterministically upper-bounded by 2 T . Using both bounds, we obtain that log((t)=)T log(2(1p) 1p p p+ ) logjSj =T (1H(p) log((1p)=p)) logjSj =T (C(p) log((1p)=p)) logjSj > 0; where the nal inequality follows from the second assumed bound onT in the lemma. This implies that (t)> , a contradiction. The next lemma shows that the nal phase of testing each node nds the target with high probability. Lemma 5.3 Assuming that the target is in S, if each node v2S is queried at least T 2 ln(jSj= 0 ) (12p) 2 times, then with probability at least 1 0 , the true target is the unique node returned as the \target" by at least half of the queries. Proof. We prove Lemma 5.3 using standard tail bounds and a union bound. Let = 1 2 p. As in the proof of Lemma 5.2, for each queried node, a Hoeding Bound (Theorem 2.3 in Section 2.4) establishes that the probability that at most (1p)T = T 2 queries are correctly answered is at most e 2T 2 . Because of the lower bound on T , this probability is upper-bounded by 0 =jSj. By a union bound, the probability that any queried vertex has at least half of its queries answered incorrectly is at most 0 . Barring this event, the target vertex is conrmed as such more than half the time, while none of the non-target vertices are conrmed as target vertices at least half the time. Combining these lemmas, we prove the following: 45 Lemma 5.4 Given > 0, Algorithm 3 nds the target with probability at least 1, using no more than logN C(p) +o(logN) +O(log 2 (1=)) queries. Proof. Because 1 and T 1 satisfy both bounds of Lemma 5.2 by denition, Lemma 5.2 can be applied, implying that with probability at least 1 0 , the setS 1 contains the target. Note that the maximum in the denition of T 1 is taken over two terms. The rst one is logN C(p) +o(logN) because 1 =o(1). The second one is O(log logN log(1= 0 )) =O((log logN) 2 ) +O(log 2 (1= 0 )). Therefore, jS 1 j is bounded by logN C(p) +o(logN) +O(log 2 (1= 0 )). For the second invocation ofMultiplicative-Weights, the conditions of Lemma 5.2 are again satised by denition, soS 2 will contain the target with probability at least 12 0 . The maximum in the denition of T 2 is again taken over two terms. The rst one is O(log logN + log log(1= 0 )) (by the bound onjS 1 j) and the second one is O(log(1= 0 )). Finally, by Lemma 5.3, the target will be returned with probability at least 1 3 0 = 1. The nal phase again only makes o(logN) +O(log 2 (1=)) queries, giving us the claimed bound on the total number of queries. To obtain the bound (1) logN C(p) +o(logN) +O(log 2 (1=)), we can reason as follows: • If < 1 logN , then logN C(p) = (1) logN C(p) + logN C(p) , and logN C(p) = O(1) can be absorbed into the o(logN) term. • If 1= logN, we modify the algorithm using a trick from [Ben-Or and Hassidim, 2008]: With probability 1 logN , the modied algorithm outputs an arbitrary node without any 46 queries; otherwise, it runs Algorithm 3 with error parameter ^ = 1 logN . The success probability is then (1 + 1 logN )(1 1 logN ) 1, while the number of queries in expectation is (1 + 1 logN ) logN C(p) +o(logN) +O(log 2 (1=)) = (1) logN C(p) +o(logN) +O(log 2 (1=)): Hence, we have proved our main theorem: Theorem 5.5 ([Emamjomeh-Zadeh et al., 2016], Theorem 8) There exists an algorithm with the following property: Given a connected undirected graph and positive > 0, the algorithm nds the target with probability at least 1, using no more than (1) logN C(p) +o(logN) +O(log 2 (1=)) queries in expectation. Notice that except for having O(log 2 (1=)) instead ofO(log(1=)), this bound matches the one obtained by Ben-Or and Hassidim [Ben-Or and Hassidim, 2008] for the line. 5.4.1 Incorporation of Prior Knowledge Similar to Section 4.1.1, we can generalize our main theorem for the case when the target is known to belong to a set S 0 V . In this case, N in Theorem 5.5 is replaced withjS 0 j. Notice that throughout this dissertation, we assume that the target is chosen adversarially. Another way to incorporate prior knowledge is to assume that the target is drawn randomly from a known distributionD rather than being adversarial. For this case, we assume that p = 0, that is, the responses are noiseless. In this scenario, we can design a learning algorithm as follows: Initially, the learner sets the node likelihoods to be the probabilities inD . Then, the learner runs the multiplicative weight update algorithm. Notice that because p = 0, once a node v is inconsistent with the response in one round, its likelihood drops to 0. The learner keeps running 47 the multiplicative weight update algorithm until only one node has non-zero weight. Such node must the target. Using an analysis similar to our proofs in Chapter 4 and this chapter, one can prove Theorem 5.6. Theorem 5.6 If the target is drawn from a distributionD known to the learner, then the algorithm explained above nds the target inH[D ] rounds, in expectation (whereH denotes the entropy). In real world, it is more reasonable to assume that the learner does not knowD , but she knows a distributionD which is \close" toD . In this scenario, the learner initializes the likelihood of the nodes usingD and runs the multiplicative weight update algorithm until only one node has non-zero likelihood. Theorem 5.7 states number of the rounds this algorithm needs to nd the target. Theorem 5.7 If the target is drawn from an unkonw distributionD and the learner initializes the node likelihoods according to another distribution D, then the algorithm nds the target in D KL (D jjD) +H[D ] rounds, in expectation (where D KL (jj) denotes the KL-divergence). 5.4.2 Hedge Algorithm, Similarities and Dissimilarities In each iteration of Algorithm 3, the weight of each inconsistent node is multiplied by p while the weight of each consistent node is multiplied by 1p. Equivalently, one can multiply the weight of each inconsistent node by = p 1p and leave other weights untouched. It is easy to see this modication preserves the ratio of the weights throughout the algorithm and hence does not change its behavior. With this modication, our algorithm looks similar to the Hedge algorithm: the weight of each nodev at roundt is `t(v) where` t (v) denotes the number of rounds the node has been inconsistent with the feedback. In other words, our algorithm falls in the category of multiplicative weight update 48 algorithms. On the other hand, it is fundamentally dierent from the Hedge algorithm because it is deterministic. Hedge algorithm normalizes the weights and treats them as probabilities to randomly draw an expert. Our algorithm, however, simply uses the weights to compute a median of the search space. 5.5 Applications Similar to Section 4.2, here we can apply Theorem 5.5 to each of our two paradigmatic applications, namely, interactive learning of a ranking and interactive learning of a binary classier. Corollary 5.8 For interactive learning of a ranking, in the presence of noise, there is an algorithm that discovers the target with probability at least 1 using at most n logn C(p) +o(n logn) +O( 1 2 ) queries in expectation. Corollary 5.9 For interactive learning of a ranking, in the presence of noise, there is an algorithm that discovers the target with probability at least 1 using at most logjS 0 j C(p) +o(logjS 0 j) +O( 1 2 ) queries in expectation where S 0 is the prior set to which the target is known to belong. As we move forward in this dissertation, the generic results that we prove for our framework can similarly be applied to these two problems. We will not state the application corollaries explicitly in future chapters. 49 Chapter 6 Computational Considerations In this chapter, we discuss computational complexity of the learning algorithms presented in Chap- ters 4 and 5. This result is based on our work [Emamjomeh-Zadeh and Kempe, 2017]. 6.1 Background The algorithm of Theorem 4.1 (Algorithm 1) for the noiseless setting keeps track of a set of potential targets, that is, the set of nodes that are consistent with the answers from all queries received so far. In each round, the algorithm computes a median of this set. The naive implementation of this algorithm requires additional memory that is linear in the number of nodes of the graph. More importantly, the time complexity of computing each query is polynomial in the size of the graph. Similarly, the algorithm of Theorem 5.5 (Algorithms 2 and 3), keeps track of the likelihood of each node and in each round and computes a weighted median with respect to the likelihoods. Therefore, its native implementation requires linear memory and polynomial time computation. In the natural machine learning examples of our interactive learning model, the search space (that is, the number of nodes ofG) is exponentially large in the natural parameters of the problem. For instance, in the problem of ranking of n items, the search space is the set of all rankings over 50 the n items. Hence, its size is n!. Also, in the binary classication example, the size of the search space is 2 n where n is number of items. For this reason, we are interested in a learning algorithm whose running time is logarithmic in the search space. In this section we discuss a particular assumption under which this is goal is achievable. 6.2 Re-implementation of the Learning Algorithms Using Sampling In the noiseless setting, the algorithm needs to compute an unweighted median of the candidate set. In the noisy setting, the computational bottleneck of our learning algorithm is the computation of a weighted median with respect to the likelihood of the nodes. Notice that Algorithm 1 is in fact a special case of multiplicative weight update algorithm (Algorithm 2) with p = 0. This is because if p = 0, then in each round the likelihood of every node which is inconsistent with the feedback is zeroed out while other likelihoods are multiplied by 1p = 1, that is, they remain unchanged. Therefore, we just argued that to make our main learning algorithms computationally ecient, it is essentially enough to implement Algorithm 2. In this section, we show that in some examples, one can compute the medians eciently by exploiting natural structures of the underlying problem. Indeed, Theorem 6.1 states that if one can draw independent random nodes from the graph in a way that the probability of each node is \almost" proportional to its likelihood, then the median can be computed eciently with high probability. Theorem 6.1 ([Emamjomeh-Zadeh and Kempe, 2017], Theorem 2.6) Let n be a natural measure of the input size and assume that logN is polynomial in n. Assume that G = (V;E) is undirected, all edge weights are integers, and the maximum degree and diameter (both with respect 51 to the edge weights) are bounded by poly(n). Also assume w.l.o.g. thatL is normalized to be a distribution over the nodes 1 . Let 0 < 1 4 be a constant, and assume that there is an oracle that in every round of the algorithm runs in polynomial time in n and returns a node v drawn from a distributionL 0 with d TV (L;L 0 ) . Also assume that there is a polynomial-time algorithm that, given a nodev, decides whether or not v is consistent with every given query response or not. Then, for every> 0, in time poly(n; 1 ), an algorithm can nd a nodev with L (v) 1 2 +2+, with high probability. Therefore, Algorithms 1 and 3 (for a suciently small ) can be implemented to run in time poly(n; 1 ). Proof. The high-level idea is to use a simple local search algorithm to nd a node v with small L (v). In order to execute the updating step, the algorithm needs to estimateL(R(v;v 0 )) for all neighbors v 0 of v in G. The key insight here is thatL(R(v;v 0 )) is exactly the probability that a node drawn fromL is consistent with the feedback v 0 when v is queried. To get a sharp enough estimate ofL(R(v;v 0 )), in each iteration, enough samples are drawn to ensure that tail bounds kick in and provide high-probability guarantees. The high-level algorithm is given as Algorithm 4; details | in particular on Line 5 | are provided below. In Line 5 of Algorithm 4, P v;v 0 is estimated as the fraction of samples ^ v drawn fromL 0 that are in R(v;v 0 ). Below, for a particular desired high-probability guarantee, we choose the number of samples to guarantee that jP v;v 0L(R(v;v 0 ))j + 0 (6.1) with high probability. For most of the proof, we will assume that all of these high-probability events occurred; at the end, we will calculate the probability of this happening. 1 For Algorithm 1,L is uniform over all nodes consistent with all feedback up to that point. 52 Algorithm 4 Finding a good approximate median (;) 1: Let 0 ==3. 2: Let v be an arbitrary node. 3: loop 4: for each edge e = (v;v 0 ) do 5: Let P v;v 0 be the empirical estimate ofL(R(v;v 0 )). 6: end for 7: if P v;v 0 1 2 + + 2 0 for every edge e = (v;v 0 ) then 8: return v. 9: else 10: Let e = (v;v 0 ) be an edge out of v with P v;v 0 > 1 2 + + 2 0 . 11: Set v v 0 . 12: end if 13: end loop Whenever an edgee = (v;v 0 ) hasP v;v 0 1 2 ++2 0 , we can boundL(R(v;v 0 )) 1 2 +2+3 0 = 1 2 + 2 +. Thus, whenever Algorithm 4 returns a node v, the node satises L (v) 1 2 + 2 +. It remains to show that the algorithm terminates, and in a polynomial number of iterations. Dene L (v) = P v 0L(v 0 )d(v;v 0 ) to be the weighted distance (with respect to the edge weights ! e ) fromv to all other nodes. When Algorithm 4 switches from v tov 0 in Line 11, by Inequality (6.1), we haveL(R(v;v 0 )) 1 2 + + 2 0 ( + 0 ) = 1 2 + 0 , so L (v 0 ) L (v) 2 0 ! e . Because the value of L (v) for all v is bounded by the diameter of G, which was assumed to be bounded by poly(n), and the edge weights are all positive integers, Algorithm 4 terminates after poly(n; 1=) steps. The only remaining part is to compute how many samples from the oracle are required to guarantee Inequality (6.1) with high enough probability. Drawing r samples, by standard tail bounds, Inequality (6.1) fails with probability no more than e ( 2 r) . Because all degrees are bounded by poly(n), Line 5 of Algorithm 4 is executed only poly(n; 1=) times, so it suces to drawr = poly( 1 ; logn) samples in each iteration in order to apply a union bound over all iterations of the algorithm. 53 6.3 Learning of a Ranking with Sampling In this section we show how Theorem 6.1 can be used for interactive learning of a ranking in the noiseless setting. This is indeed the only example we have at the moment where we can apply Theorem 6.1 to make our interactive learning algorithm computationally ecient. Whether this theorem is applicable in other scenarios or not remains open. Whenp = 0, i.e., the algorithm only receives correct feedback, it can be made computationally ecient using Theorem 6.1. To apply Theorem 6.1, it suces to show that one can eciently sample a (nearly) uniformly random permutation consistent with all feedback received so far. Since the feedback is assumed to be correct, the set of all pairs (i;j) such that the user implied that element i must precede element j must be acyclic, and thus must form a partial order. The sampling problem is thus exactly the problem of sampling a linear extension of a given partial order. This is a well-known problem, and a beautiful result of Bubley and Dyer [Bubley and Dyer, 1999, Bubley, 2001] shows that the Karzanov-Khachiyan Markov Chain [Karzanov and Khachiyan, 1991] mixes rapidly. Huber [Huber, 2006] shows how to modify the Markov Chain sampling technique to obtain an exactly (instead of approximately) uniformly random linear extension of the given partial order. For the purpose of our interactive learning algorithm, the sampling results can be summarized as follows: Theorem 6.2 (Huber [Huber, 2006]) Given a partial order over n elements, letL be the set of all linear extensions, i.e., the set of all rankings consistent with the partial order. There is an algorithm that runs in expected time O(n 3 logn) and returns a uniformly random sample fromL. The maximum node degree in G IS is O(n 2 ). The diameter of G IS is O(n 2 ). Substituting these bounds and the bound from Theorem 6.2 into Theorem 6.1, we obtain the following corollary: 54 Corollary 6.3 If all feedback is always correct, there is a computationally ecient interactive learning algorithm using at most logn!n logn equivalence queries to nd the target ordering. The situation is signicantly more challenging when feedback could be incorrect, i.e., when p< 1. In this case, the user's feedback is not always consistent and may not form a partial order. In fact, we prove the following hardness result. Theorem 6.4 ([Emamjomeh-Zadeh and Kempe, 2017], Theorem 3.8) There exists ap (de- pending on n) for which the following holds. Given a set of user responses, letL() be the like- lihood of given the responses, and normalized so that P L() = 1. Let 0 < < 1 be any constant. There is no polynomial-time algorithm to draw a sample from a distributionL 0 with d TV (L;L 0 ) 1 unless RP = NP. Proof. We prove this theorem using a reduction from Minimum Feedback Arc Set, a well- known NP-complete problem [Karp, 1972]. Given a directed graphG 0 and numberk, theMinimum Feedback Arc Set problem asks if there is a set of at mostk arcs ofG 0 whose removal will leave the remaining graph acyclic. This is equivalent to asking if there is a permutation of the nodes of G 0 such that at most k arcs go from higher-numbered nodes in to lower-numbered ones. Given , a graphG 0 withn nodes andm edges, andk, we dene the following sampling problem. Consider sampling from rankings of n elements, let p = 1 2(n+1)! , and let the m user responses be exactly the (directed) edges of G 0 . For any permutation , leta be the number of queries that agrees with, andb the number of queries that disagrees with. Then, for all , a +b =m, and the (unnormalized) likelihood 55 for is L() = (1p) a p b . Let b = min b , and let =fj b = b g be the set of all permutations minimizing b . Then, for any permutation 2 , we have L() = 1 1 2(n + 1)! mb 1 2(n + 1)! b =: L : On the other hand, for any permutation 0 = 2 , we get that L( 0 ) 1 1 2(n + 1)! mb 1 1 2(n + 1)! b +1 =L 1 2(n + 1)! = 1 1 2(n + 1)! L (n + 1)! : Thus, under the normalized likelihood distribution L, the total sampling probability of all permutations 2 must be X 2 L() = P 2 L() P 0L( 0 ) = L j j L j j + P 0 = 2 L( 0 ) L j j L j j +n!L 1 (n+1)! = j j j j + 1=(n + 1) 1 1=(n + 1): 56 If L 0 has total variation distance at most 1 from L, it must satisfy P 2 L 0 () P 2 L() (1 ) 1=(n + 1). In particular, it must sample a permutation 2 with constant probability 1=(n + 1). A randomized algorithm can now simply sample O(logn) permutations according toL 0 . If one of these permutations, applied to the nodes of G 0 , has at most k edges going from higher- numbered to lower-numbered nodes, it constitutes a feedback arc set of at most k edges, and the algorithm can correctly answer \Yes" to the Minimum Feedback Arc Set instance. When the algorithm sees no with fewer than k + 1 edges going from higher-numbered to lower-numbered nodes, it answers \No." This answer may be incorrect. But notice that if it is incorrect, the Minimum Feedback Arc Set instance must have had a feedback arc set of at most k edges, and the randomized algorithm would sample at least one corresponding permutation with high probability. Thus, when the algorithm answers \No," it is correct with high probability. Thus, we have an RP algorithm for Minimum Feedback Arc Set under the assumption of an ecient approximate sampling oracle. It should be noted that the value ofp in the reduction is exponentially close to 0. In this range, incorrect feedback is so unlikely that with high probability, the algorithm will always see a partial order. It might then still be able to sample eciently. On the other hand, for larger values of p (e.g., constant p), sampling approximately from the likelihood distribution might be possible via a metropolized Karzanov-Khachiyan chain or a dierent approach. This problem is still open. 57 Chapter 7 Dynamic Target In this chapter we study our interactive learning framework in a more general setting in which the target may change throughout the learning process. This chapter is based on our work [Emamjomeh-Zadeh et al., 2020]. 7.1 Background So far in this dissertation, we have assumed that the target is static, that is, the target is the same structure throughout the entire learning process. In several real-world applications, however, this assumption is not realistic. For example, recall the example of recommendation systems. The optimal ordering of a set of items from the users' perspective can evolve other time. More importantly, responses in dierent rounds may be provided by dierent users and hence refer to dierent targets. In the context of learning a ranking when the target is dynamic, an active learning model has been studied by [Anagnostopoulos et al., 2011, Vial et al., 2018, Besa et al., 2018]. In their models, the learner actively chooses pairs of items to compare. In contrast, in the interactive learning model (that is the focus of this dissertation), the queries are full permutations and the feedback is the 58 mistake in the permutation that is revealed to the learner, chosen by an adversary. In this sense, their work is dierent from ours. In this chapter we look at this problem when the target can change throughout the learning process. Notice that in this setting, it is not enough to \discover" the target; the algorithm is expected to \keep track" of the target. The problem will be formally dened momentarily. 7.2 Preliminaries and Denitions As we mentioned earlier, when the target is not static, it is not sucient for a learning algorithm to just \discover" the target. Rather, we want to design an algorithm that \keeps track" of it. More formally, the interactive learning happens throughout the course of T rounds. The learner's performance is measured by the number of rounds in which the learner guesses an incorrect node. This is commonly known as a mistake bound in the context of online learning. In this chapter, we prove upper bounds on the mistake bound of our algorithms. We study two models of dynamic targets in this chapter. Notice that in both models, there is a parameter that quanties how much the target can move. 7.2.1 The Shifting Target Model The shifting target model aims to model scenarios where the learner interacts with a \small" population of users (or a population with only a small number of types of users), each of whom has a possibly dierent structure. The learner does not know between consecutive rounds whether it is still interacting with the same user or a dierent one. More formally, there exists an unknown set TV of (known) size at most , such thatz t 2T for every 1tT . In other words, the target only moves within the set T . 59 A model very similar to our shifting target model was considered by [Deligkas et al., 2017]. In their model, however, a target is chosen randomly from a small pool with respect to a xed distribution. They show that if one node is more likely to be the target than all other nodes combined, one can identify this specic target in a logarithmic number of rounds. Their focus is on learning the identity of one target (or more targets, under extra assumptions), rather than minimizing the number of mistakes over a long sequence. We analyze our algorithms in the worst case. 7.2.2 The Drifting Target Model The drifting target model models scenarios where there is only one user (or the users share the same structure), but the structure slowly evolves over time. As an example, think about a ranking capturing a user's music preferences or preferred order of search results; either is wont to evolve over time. We model the slow evolution as follows. There exists a known directed unweighted \evolution graph" G 0 = (V;E 0 ) (on the same set of nodes V ), such that for every 1 < t T , z t 2fz t1 g[ N out G 0 (z t1 ). In other words, the target can only move along the edges of G 0 . G 0 need not be connected. We assume that G 0 , like G, is known to the learner; however, we explicitly allow for G 0 6=G. We write = max v2V jfN out G 0 (v)g[fvgj for the maximum out-degree of any node in G 0 (implicitly assuming that every node has at least a self-loop). 7.3 A Generic Mistake Bound In this section, we present a generic result that is applicable to both the shifting target and drifting target models, as well as more general models of moving targets. The generic learning algorithm, 60 however, is not computationally ecient. In later sections, we show how the more specic shifting target and drifting target models facilitate ecient implementations of the algorithm. Let G = (V;E;!) be an undirected connected and weighted graph and a =hz 1 ;:::;z T i the unknown true sequence of targets throughout the T rounds. Recall that p denotes the probability that each response is incorrect. Theorem 7.1 ([Emamjomeh-Zadeh et al., 2020], Theorem 5) Let A =V T be the set of all node sequences of length T . Let : A! R 0 be a function that assigns non-negative weights to these sequences, such that P a2A (a) 1. There is an online learning algorithm which makes at most 1 1H[p] log 1 (a ) mistakes in expec- tation. Each sequence a2 A can be considered as \recommending" a node to query in each of the T rounds. Inspired by the classic Hedge algorithm, we consider each sequence as a \meta-expert" and keep track of a weight for each. However, in contrast to the standard Hedge analysis, simply drawing a random meta-expert in every round according to their weights and outputting its recommendation does not guarantee the bound of Theorem 7.1. Instead, we utilize the \directional" information in G provided by the shortest-path pointers which the algorithm receives. We adapt an idea from Chapter 5 to guarantee that after each mistake the algorithm makes, the relative weight of a increases signicantly. Unlike the Hedge algorithm, our algorithm is deterministic. A direct adaptation of the analysis of Chapter 5 results in a mistake bound of 1 (a ) +TH[B=T ] for our problem. We modify this analysis in order to prove the bound of Theorem 7.1; this bound is tight in general (e.g., for the case when the target does not move), as pointed out in Chapter 5. Proof. For every 1 t T + 1, the algorithm assigns a weight t (a) to every sequence a2 A. Initially, 1 (a) = (a) as given in the input. Dene t = P a2A t (a) to be total weight of all 61 sequences in round t. For the purpose of analysis, we dene ^ t (a) = log t (a) for every sequence a2A and similarly, ^ t = log t . Moreover, we assign a likelihoodL t (v) to every node v in every round t, which is essentially the total weight of the sequences which recommend v in round t. Formally,L t (v) = P a2A ( t (a) 1[a t =v]): In every round 1 t T , the algorithm's prediction is a median of G with respect toL t (). Let x t and y t denote the prediction of the algorithm and the response it receives, respectively. The algorithm then computes the weights of the sequences for the next round as follows. If the recommendation of the sequence a in round t is consistent with the response y t , the weight of a is multiplied by 1p. Otherwise, it is multiplied by p. More formally, t+1 (a) = (1p) t (a) if a t 2 R(x t ;y t ), and t+1 (a) = p t (a) otherwise. Recall that if x t = y t , then x t itself is the only node that is consistent with the response, i.e., R(v;v) =fvg: Lemma 7.2 For every 1tT , E[ ^ t+1 (a ) ^ t (a )]H[p]. Proof. In every round t, the response is correct with probability at least 1p. In this case, the recommendation of a is consistent with the response; hence, t+1 (a ) = (1p) t (a ), which implies that ^ t+1 (a ) = ^ t (a )+log(1p): In the other case, due to the noise, the recommendation of a may or may not be consistent with the response. Because at worst, the weight is multiplied withp, we obtain the bound that t+1 (a )p t (a ), i.e., ^ t+1 (a ) ^ t (a )+logp: By combining both cases, we get that E[ ^ t+1 (a )] (1p) ^ t (a ) + log(1p) +p ^ t (a ) + logp = ^ t (a )H[p]: 62 Now, summing over all T rounds, we obtain that E[ ^ T +1 (a )] log(a )TH[p]: (7.1) Having obtained a lower bound on the weight of a at the end of the T rounds, we now prove an upper bound on ^ T +1 . First, we state the following proposition which is easy to verify. Proposition 7.3 IfL(u)> 2 for a node u, then u is the unique median. For each round 1tT , one of the following three scenarios happens. (a) In roundt, each nodev hasL t (v) t 2 . By (the same argument as in) Lemma 5.1 and because of the way the algorithm updates the weights (as in the analysis of Chapter 5), t+1 t 2 and thus, ^ t+1 ^ t 1. (b) In round t, there is a node v6= a t withL t (v) > t 2 . By Proposition 7.3, v is the unique median and hence the algorithm's prediction. However, v6=a t , sov is not the correct target in this round. Dene =L t (v)= t . By the assumption, 1 2 < 1. With probability (1p), the response is correct. In this case, v is not consistent with the response, and therefore t+1 pL t (v) + (1p) ( t L t (v)) = t p + (1p) (1) : Taking logarithms, this implies that ^ t+1 ^ t log p + (1p) (1) . With the remaining probability, the response is adversarially incorrect. Importantly, it is impossible that both nodev and some other node in the graph are consistent with the response. Because 63 both (1p) and are larger than 1 2 , the smallest reduction in is achieved by having v be consistent with the response. Hence, in this case, we obtain the following upper bound: t+1 (1p)L t (v) +p ( t L t (v)) = t (1p) +p (1) : By taking logarithms, we obtain the bound ^ t+1 ^ t log (1p)+p(1) . Combining the two cases, we get that E[ ^ t+1 ^ t ] (1p) log p + (1p) (1) +p log (1p) +p (1) : For every 0 < p < 1 2 and 1 2 1, the derivative with respect to is always negative, meaning that the expression is maximized when = 1 2 . Plugging this value of into the inequality, we get thatE[ ^ t+1 ^ t ]1. (c) The nal case is that in round t,L t (a t ) > t 2 . By Proposition 7.3, a t is the unique median and hence, the algorithm's prediction. In this case, the algorithm does not make a mistake (i.e., its prediction is the true target). Dene =L t (a t )= t : Similar to scenario (b), we consider two cases, based on whether the response is correct or not. With probability 1p, the response is correct; it then conrms that the algorithm's prediction is correct, and we have t+1 = (1p)L t (a t ) +p ( t L t (a t )) = t (1p) +p (1) : If the response is incorrect, we can again reason as in the scenario (b), and obtain that t+1 pL t (a t ) + (1p) ( t L t (a t )) = t p + (1p) (1) : 64 Taking logarithms and combining both cases, we get that E[ ^ t+1 ^ t ]p log p + (1p) (1) + (1p) log (1p) +p (1) : This time, for every 0 < p < 1 2 and 1 2 1, the derivative with respect to is positive; therefore, the expression is maximized at = 1. Thus,E[ ^ t+1 ^ t ]H[p]. LetM denote the number of rounds in which either scenario (a) or (b) happens. M is an upper bound on the number of mistakes of the algorithm, and our preceding analysis shows that E[ ^ T +1 ]M (TM)H[p]: (7.2) Combining Inequalities (7.1) and (7.2) and using the fact that ^ T +1 ^ T +1 (a ) (hence, E[ ^ T +1 ]E[ ^ T +1 (a )]), we can now complete the proof by bounding M (TM)H[p] log(a )TH[p]; which can be rearranged to M 1 1H[p] log 1 (a ) . The algorithm we presented in this section explicitly keeps track of weights for all sequences in A. If the target does not move, the number of valid sequences cannot exceed the number of the nodes, which allows for an ecient implementation. However, with a moving target, there may be exponentially many sequences. In Sections 7.4 and 7.5, we discuss how to implement this algorithm more eciently for the shifting target and drifting target models, by carefully choosing the function (). Here, we say that an algorithm is computationally ecient if it computes its prediction in each round computationally eciently. 65 7.4 Results for the Shifting Target Model In this section, we provide a more ecient implementation of the learning algorithm for the Shifting Target Model, as well as a lower bound on the number of mistakes that any learning algorithm must make in this model. Theorem 7.4 ([Emamjomeh-Zadeh et al., 2020], Theorem 7) If the target changes under the shifting target model, there is a deterministic algorithm that runs in time O(N poly(N)) and makes at most 1 1H[p] logN + (B + 1) log +TH[B=T ] mistakes in expectation. Proof. We adapt the algorithm of Theorem 7.1. LetS denote the collection of all subsets of V of size . Consider the following random procedure to generate a sequence a of length T . In the following description, b2 [0; 1] is a constant, to be determined momentarily. 1. Pick a set X inS uniformly at random. 2. Pick an initial node a 1 2X uniformly at random. 3. For every 2 t T , let a t = a t1 with probability (1b); with the remaining probability, a t is chosen uniformly at random from X. For a sequence a, let (a)S denote the collection of all sets of (exactly) nodes that are supersets of the support of a. If the support size of a is larger than , then (a) =;; if it is , thenj(a)j = 1. If the sequence's support is smaller than , thenj(a)j > 1. Notice that the random procedure can only generate a when it chooses X2 (a). As in Theorem 7.1, let A =V T . For every sequence a2A, let (a) be the probability that a is generated by the random procedure described above. In particular, if a has support larger than , the probability is (a) = 0. Because () is a probability distribution, P a2A (a) = 1. And 66 because the true sequencea shifts at mostB times,(a ) N 1 1 ( b ) B (1b) TB . Letting b = B T and applying Theorem 7.1 gives us the mistake bound of Theorem 7.4. In order to bound the running time, we present a more ecient way to implement the algorithm of Theorem 7.1. The key insight is that in order to run the algorithm, one only needs to compute, in each round, the likelihood of each node. Fix some setU2S of nodes, and letA U denote the set of all length-T sequences with support U. For every node u2 U and every round 1 t T , letL t;U (u) = P a2A U t(a) j(a)j 1[a t = u] be the \contribution" of sequences in A U in round t to the likelihood of node u. Our faster implementation of the algorithm does not keep track of t (a) for every sequence a2 A. Instead, in every round 1 t T , it computesL t;U (u) for every U2S andu2U. This is sucient to compute the cumulative likelihoodL t (u) = P U3u L t;U (u) for every node u. In turn, theL t (u) are sucient for computing the median of the graph. We now show how to inductively compute theL t;U (u) for a xed set U and every u2U. For the base case, because the rst node is picked uniformly at random,L 1;U (u) = N 1 1 . Now assume that for some round t, the algorithm has already computedL t;U (u) for all nodes u. As explained above, it can then compute the medianx t . As before, lety t denote the response. For each u2U, ifu is consistent with the response, then, according to the description of the algorithm, the weight of every sequencea2A U witha t =u is multiplied by 1p. Otherwise, the weight of every such sequence is multiplied by p. Dene intermediate variablesL 0 t;U (u) =L t;U (u) (1p) 1[u2 R(x t ;y t )] +p 1[u = 2R(x t ;y t )] : By the denition of the random generative process for sequences, each sequence stays at the same node with probability 1b, and shifts to a uniformly random node in U with the remaining probability b. Therefore, the likelihood of the nodes for the next round can be computed as follows:L t+1;U (u) = (1b)L 0 t;U (u) + P v2U b L 0 t;U (v): This implementation of the algorithm requires memory of size N 2O(N ), and the running time isO(N polyN), as claimed. 67 Remark 7.5 Notice a perhaps interesting feature of our analysis. While the shifts of the target are entirely adversarial (subject to the bound on the support size and the total number of shifts), our analysis essentially \pretends" that the target performs a random walk over a uniformly random target set of size . Despite this \incorrect" shifting model, it achieves the same mistake bound as Theorem 7.1. Remark 7.6 This algorithm needs to know ahead of time in order to properly keep track of the likelihoods. When is not known ahead of time, there is a slightly dierent random procedure for generating random sequences which results in essentially the same bound as that of Theorem 7.4. This random procedure is similar to the one we explained in this section. Instead of picking X fromS , X is a random set such that each node belongs to X independently with probability 1 N . The na ve implementation of our algorithm using this new random procedure, however, requires explicitly enumerating all sequences of length T . We do not know of any more computationally ecient implementation at this time. 7.4.1 Negative Result Next, we state a lower bound on the number of mistakes for any algorithm (deterministic or randomized, ecient or not) in the shifting target model. Theorem 7.7 ([Emamjomeh-Zadeh et al., 2020], Theorem 10) For every pair of positive integers andn, there exists an undirected, connected and unweighted graphG on a set ofN = n nodes such that under the shifting target model, every (possibly randomized and/or computationally inecient) algorithm makes at least minfTo(T ); 1 1H[p] logN + (B 2 + 1) (log ) o(logN)Bo(log )g 68 mistakes, in expectation. The proof of Theorem 7.7 is presented in Section 7.6. Notice that this lower bound does not quite match the upper bound of Theorem 7.4; there is an additive gap of essentially TH[B=T ] 1H[p] . This gap is most relevant when is \small". Whether our upper bound is tight or not remains open. 7.5 Results for the Drifting Target Model In this section, we present an improved implementation of the learning algorithm from Section 7.3 for the Drifting Target Model. Unlike the implementation in Section 7.4, here, we actually achieve a polynomial-time implementation. Recall that is the maximum degree of the evolution graph G 0 . Theorem 7.8 ([Emamjomeh-Zadeh et al., 2020], Theorem 11) Under the drifting target model, there is a polynomial-time learning algorithm which makes at most 1 1H[p] logN +B log +T H[B=T ] mistakes in expectation. Proof. As in the proof of Theorem 7.4, we will prescribe specic choices of weights (a) for sequences a; these weights facilitate ecient computation of the median node to query. As before, let A =V T denote the set of all sequences of length T . Consider a uniformly random walk on G 0 with stalling probability 1b (which will be determined below), starting from a uniformly random vertex a 1 2V . That is, in each round t, with probability 1b, the walk stays at a t+1 =a t ; with the remaining probability, it moves to a uniformly random neighbor of a t in G 0 . Without loss of generality, we assume that every node has a self-loop in G 0 . For each sequence a2A, dene (a) to be the probability that a occurs as the result of this random walk process. Because () is a probability distribution, P a2A (a) = 1. And because the ground truth sequencea moves at mostB times, it has initial weight(a ) 1 N ( b ) B (1b) TB . Settingb = B T , 69 we apply Theorem 7.1, which gives a mistake bound of at most 1 1H[p] log 1 N ( b ) B (1b) TB , which is exactly the claimed bound. It remains to show how to implement this algorithm to run eciently. We again show how to inductively compute the likelihoods. For the base case, because the rst node is chosen uniformly at random,L 1 (u) = 1 N for every nodeu. Letx t andy t denote, the prediction of the algorithm and the response in round t, respectively. Consider a node u. • If u is consistent with the response, i.e., u2 R G (x t ;y t ), then the weight of every sequence a which predicts u in round t (i.e., a t = u) is multiplied by (1p); as a result, so is the likelihood of u. • Similarly, if u is inconsistent with the response, i.e., u = 2R G (x t ;y t ), then the weight of every sequence a which predicts u in round t is multiplied by p, and so is the likelihood of u. Similar to the proof of Theorem 7.4, we dene intermediate variables L 0 t (u) =L t (u) (1p) 1[u2R(x t ;y t )] +p 1[u = 2R(x t ;y t )] : Because the weights of sequences correspond to the probabilities of the random walk with stalling probability 1b, the new likelihoods areL t+1 (u) = (1b)L 0 t (u)+ P v2S b jN out G 0 (v)j L 0 t (v): Notice that the algorithm only needs to keep track of one variable per node and per round, and the computations are straightforward linear combinations. Hence, we obtain an ecient implementation. 7.5.1 Negative Results Similar to Section 7.4.1, we state a lower bound, which again leaves an additive gap of TH[B=T ] 1H[p] . Theorem 7.9 ([Emamjomeh-Zadeh et al., 2020], Theorem 12) For every pair of positive integers andn, there exist undirected unweighted graphsG andG 0 over the same set ofN = n 70 nodes such thatG is connected, the evolution graphG 0 is -regular (with a self-loop for every node), and every (possibly randomized or inecient) algorithm under the drifting target model makes at least min n T; 1 1H[p] logNo(logN) +B (log o(log )) o mistakes in expectation. Similar to our negative result in Section 7.4.1, our negative result in this section does not quite match our upper bound either. The proof of this theorem is also presented in Section 7.6. 7.6 Proofs of the Lower Bounds Theorems 7.7 and 7.9 state two lower bounds for the shifting target model and drifting target model, respectively. Because their proofs are closely related, we present both of the proofs here. Both of them are based on the lower bound of Theorem 7.10 for the noisy classic binary search problem with a uniform target. 7.6.1 A Lower Bound on Noisy Standard Binary Search In the classic binary search problem, the algorithm has to nd a target z 2 f1;:::;ng using binary comparisons. This is in fact equivalent to binary search on the lines that we discussed in Section 3.3.1. Formally, we dene the noisy standard binary search problem as follows. A target z is chosen uniformly at random from the setf1;:::;ng, and the learner's goal is to identify z. In each roundt, the learner queries an elementx t . Ifx t =t, the process ends. Otherwise, the response is a single bit, stating whetherx t >t or not. The response is noisy, meaning that the bit is correct with probability 1p, and incorrect with the remaining probability. We now prove a theorem which we utilize later to prove Theorems 7.7 and 7.9. 71 Theorem 7.10 ([Emamjomeh-Zadeh et al., 2020], Theorem 13) Every (possibly randomized) algorithm for the noisy standard binary search problem requires at least logn 1H[p] o(logn) queries in expectation. Proof. Because the input and responses are not adversarial, but drawn from a known distribution, there is a deterministic optimal algorithm, so it suces to prove the theorem for deterministic algorithms. LetA 0 be a deterministic algorithm for the noisy standard binary search problem, and assume that it nds the target in at most Q 1 rounds in expectation. (Q may depend on p and n, but the dependencies are omitted to simplify the notation.) We rst show that at a small additional cost, we can assume that the algorithm never uses too many queries. Lemma 7.11 If there is a deterministic algorithmA 0 that uses at mostQ1 rounds in expectation, then there is a deterministic algorithmA that nds the target in at most Q rounds in expectation and never uses more than (Q + 1)n queries. Proof. The algorithmA consists of runningA 0 for no more than Qn rounds. If the target has not been found yet, thenA 0 is terminated, andA queries all the elements inf1;:::;ng sequentially. This algorithm never uses more than Qn +n queries. Moreover, the probability thatA 0 is terminated is, by Markov's inequality, no more than 1 n . Thus, the expected query complexity ofA is bounded by Q 1 + 1 n n =Q. In order to prove Theorem 7.10, we show that for every xed> 0, ifn is large enough, thenA requires at least 1 1H[p] logn queries in expectation. Throughout this proof, p is a xed constant. Moreover, we assume that n is a (large enough) xed constant and (without loss of generality) a power of 2. We later discuss how large n should be (as a function of ). 72 We reduce the standard noisy binary search problem to a coding problem. Let T 1 and T 2 be two positive integers, depending on n and p, which we will specify later. Consider the following random procedureR which generates a sequence of length T 1 +T 2 : The rst T 1 elements of the sequence are drawn independently and uniformly at random from the setf1;:::;ng. The nal T 2 elements of the sequence are drawn i.i.d. fromf0; 1g; each of them is 1 with probability p. The entropy of the sequence generated from this procedure isT 1 logn+T 2 H[p], implying Lemma 7.12. Lemma 7.12 Every algorithm that constructs a binary encoding of the string generated from the random procedureR has to use at least T 1 logn +T 2 H[p] bits, in expectation. We show that using a deterministic algorithm A for standard noisy binary search, we can construct a good binary encoding algorithmB for sequences generated byR. By Lemma 7.11, we can assume w.l.o.g. thatA nds the target in at most Q rounds in expectation and never uses more than (Q + 1)n queries. The construction works as follows: Let s =hz 1 ;:::;z T 1 ;b 1 ;:::;b T 2 i be a sequence generated from the random procedureR, i.e., the z i for 1 i T 1 are drawn independently uniformly at random from the set f1;:::;ng, and each b i for 1 i T 2 is 1 independently with probability p and 0 otherwise. For every 1iT 1 , the algorithmB simulates the algorithmA, assuming that the target is z i . Thei th simulation ofA is as follows. In each round, ifA queriesz i ,B terminates; otherwise, it uses a fresh bit amongb 1 ;:::;b T 2 sequentially to determine whether to feed an incorrect or correct response toA. Notice that the algorithm terminates if and when the response to a query conrms that the queried node is the targetz i . Up until that event, the response to each query is either \left" or \ right." Therefore, the responses can be encoded into a binary string c i . By the assumption on A, E[jc i j]Q, andjc i j (Q + 1)n always. Let c 0 i be a binary string encoding of the length of c i . 73 Thus,jc 0 i j 1 + logjc i j. Dene a binary string c 00 i of lengthjc 00 i j = 2jc 0 i j, by adding a 1 after the last digit of c 0 i and adding a 0 after each of its other digits. Finally, dene the binary string c i as the concatenation of c 00 i with c i . From any string with prex c i , one can rst extract c 00 i , by nding the rst digit `1' in an even position. From this, one obtainsc 0 i from the leading odd digits; after learning the length of c i , one can uniquely reconstruct c i . In turn, from c i , one can uniquely recover the target z i becauseA is deterministic. Moreover, having c i and z i , it is uniquely determined which responses were noisy. Because E[jc i j] Q and log is a concave function,E[jc 00 i j] 2(logQ + 1)O(logQ). Dene c = c 1 c T 1 as the binary string obtained by sequentially concatenating c 1 ;:::;c T 1 . Based on the previous paragraph, given c, one can recover every z i as well as the bits b 1 ;:::;b L , where L =jcj = P 1iT 1 jc i j. Notice thatE[L]T 1 Q and LT 1 (Q + 1)n always. Let 1 ; 2 > 0 be real numbers, which we will let go to 0 later. Let T 1 be large enough so that Pr[LT 1 Q (1 + 1 )]< 2 . The existence of T 1 follows from Chebyshev's Inequality because E[jc i j]Q, and the variance ofjc i j is bounded by its maximum, which is at most (Q + 1)n. Let T 2 = T 1 Q (1 + 1 ). By denition of T 2 , with probability at least 1 2 , the T 2 random bits in the second part of the sequence s are sucient for the T 1 executions ofA. In this case, the binary string c, concatenated with the bits b L+1 ;:::;b T 2 which had been left unused in the simulations of A, is sucient to uniquely determine s. The length of this binary string is (T 2 L) + T 1 X i=1 jc i jT 2 + T 1 X i=1 O(logjc i j): With the remaining probability, i.e., with probability at most 2 , the T 2 bits at the end of s may not be sucient forB's T 1 simulations ofA. In this case,B simply encodes s by encoding 74 each z i using logn bits, and appends the nal T 2 bits. In this case, the length of the binary code is T 1 logn +T 2 . The expected length of the code, denoted by Q 0 , is upper-bounded as follows: E[Q 0 ] 2 (T 1 logn +T 2 ) + (1 2 ) (T 1 O(logQ) +T 2 ) T 1 ( 2 logn +O(logQ)) +T 2 T 1 ( 2 logn +O(logQ) +Q(1 + 1 )): Lemma 7.12 implies that E[Q 0 ]T 1 logn +T 2 H[p] = T 1 (logn +QH[p] (1 + 1 )): Combining both inequalities and canceling out the T 1 factor, logn +QH[p] (1 + 1 ) 2 logn +O(logQ) +Q (1 + 1 ): Rearranging this inequality gives us that 1 2 1 + 1 1 1H[p] lognQ +O(logQ): By letting 1 ; 2 ! 0, we obtain the claimed lower bound on the expected number of rounds Q for any algorithm to nd a random target. 7.6.2 Proof of Theorem 7.7 We next show how to use Theorem 7.10 to prove Theorem 7.7. 75 Proof. Let the graph G be a grid. Formally, V =f(i;j)j 1i ; 1jg is the set of n = nodes. Two nodes (i;j) and (i 0 ;j 0 ) are connected if and only ifjii 0 j +jjj 0 j = 1. Visualizing the grid, we call the rst coordinate of a node its row number and the second coordinate its column number. For each row i, the adversary draws a column j i independently and uniformly at random. Let t (i) = (i;j i ) be the resulting node in row i. These nodes t (1) ;:::;t (k) are the targets. The rst stage consists of phases, each lasting until the learner has identied the target in a particular row. In phase i, the adversary lets t (i) be the target. Identifying t (i) is tantamount to identifyingj i , and therefore equivalent to the classic binary search problem on integers. Therefore, by Theorem 7.10, every learner must make at least log 1H[p] o(log) mistakes in expectation until it identies the target t (i) for the rst time. As soon as the learner does, phase i ends, and phase i + 1 begins. In total over the phases, the learner must thus make at least log 1H[p] o(log) mistakes in expectation. The second stage consists of B 1 phases. In each phase, the adversary picks one of the targets uniformly at random, independently of past choices. A phase ends when the learning algorithm identies the target correctly. Consider any one phase in the second stage. Unless the learner queried the correct row, the adversary will always provide feedback in the vertical direction, i.e., it identies whether the target is above or below the queried node. Then, the learner's task in each phase is equivalent to identifying the target's row, and therefore to the classic binary search problem on a path of length . By Theorem 7.10, it takes the learner at least log 1H[p] o(log ) rounds to identify the target in expectation. The total number of mistakes across all B 1 phases of the second stage is therefore at least (B1)(log ) 1H[p] Bo(log ) in expectation. 76 Combining both stages, we obtain the claimed bound. Notice that ifT is \small," the adversary may not have enough time to nish this strategy. However, the same strategy results in To(T ) mistakes in expectation, corresponding to the rst term in minf:::g. 7.6.3 Proof of Theorem 7.9 Next, we prove Theorem 7.9. Proof. Let G be the grid. Formally, V =f(i;j)j 1 i ; 1 j ng is the set of n = nodes. Two nodes (i;j) and (i 0 ;j 0 ) are connected inG if and only ifjii 0 j +jjj 0 j = 1. The undirected graphG 0 is dened on the same set of nodes: in G 0 , two nodes (i;j) and (i 0 ;j 0 ) are connected if and only if j =j 0 , i.e., they are in the same column. In other words, G 0 is a disjoint (and disconnected) collection of complete graphs, each comprising the nodes of one of the grid's columns. Thus, each node of G 0 has degree , counting the self-loop. The adversary's strategy proceeds inB+1 phasess = 1;:::;B+1. Each phases has a designated target z s which stays xed for the duration of the phase. 1 Phase s ends when the learner queries a node in the same row as z s . Until then, the adversary only reveals vertical information, i.e., it reveals an edge pointing up or down, depending on whether the learner's guess was below or above the target. The initial target z 1 is a uniformly random node in G. Whenever a new phase s + 1 starts, the target z s+1 is chosen as a uniformly random node in the same column as z s (and thus alsoz 1 ;:::;z s1 ). Because the learner had only learned the row ofz s , when the row is changed to a new uniformly random one, the target is again uniformly random as far as the learner is concerned. By Theorem 7.10, each phase lasts for at least log 1H[p] o(log ) rounds in expectation. Once the adversary has exhausted theB moves, the target stays xed. At this point, in order to identify the target's column (about which nothing has been revealed so far), the learner still requires at 1 Notice that we slightly change the meaning of the notation zs, which referred to the target for a given round s throughout the body of this chapter. 77 least log 1H[p] o(log) queries in expectation. Hence, we obtain a lower bound of (B + 1) log 1H[p] + log 1H[p] (B + 1)o(log )o(log). Substituting that =n= now gives the claimed bound. As in the proof of Theorem 7.7, ifT is \small," the adversary may not have enough time to nish this strategy. However, the same strategy results in To(T ) mistakes in expectation, corresponding to the rst term in minf:::g. 78 Chapter 8 Non-Uniform Known Costs In this chapter, we discuss a generalization of our framework in which to each node a xed known querying cost is associated. The takeaway of this chapter is that non-uniform xed costs sig- nicantly increases computational hardness of the problem. This chapter is based on our work [Emamjomeh-Zadeh et al., 2016]. 8.1 Computational Complexity of the Uniform Cost Setting Recall our basic setting: the target is static and the responses are noiseless. We presented, in Chapter 4, a learning algorithm that nds the target withinblogNc rounds. This bound is tight in the worst case. For instance, locating the target in a line graph requiresblogNc queries. However, for some graphs G, fewer queries suce. For example, if graph G is a star, then a single query at the center of the start is sucient to locate the target. An interesting algorithmic question is the following: given graph G, what is the optimal number of queries required (in the worst case) to locate the target in G. There is an algorithm for this problem that runs in timejEj logN polyN (wherejEj denotes number of the edges in the graph) if the costs are uniform. This algorithm uses the fact, shown in Chapter 4, that the optimal strategy does not use more than logN queries and hence only a 79 quasi-polynomial number of strategies need to be explored. This observation implies that nding the optimal strategy cannot be NP-hard unless ETH is false. See Section 2.6 (Conjecture 1) for denition of ETH. On the problem of computing the optimal strategy for a given graph, we prove two hardness results, stated below. We state the results here for the sake of completeness. Because these results are not directly related to the scope of this dissertation, we exclude their proofs. Theorem 8.1 ([Emamjomeh-Zadeh et al., 2016], Theorem 9) If ETH (Conjecture 1) holds, then the problem of nding the optimal strategy for a given graph does not admit any algorithm that runs injEj o(logN) . The next theorem proves a stronger result assuming SETH. Theorem 8.2 ([Emamjomeh-Zadeh et al., 2016], Theorem 9) If SETH (Conjecture 2) is correct, then the problem of nding the optimal strategy for a given graph does not admit any algorithm that runs injEj (1) logN for any constant . The proofs of both theorems are presented in Section 8.3. 8.2 Computational Complexity of the Non-Uniform Cost Setting In this section, we focus on the same problem when making a mistake may have dierent costs to a learner. The new computational problem, then, is to nd the optimal strategy for a given graph when the costs are not uniform. In this section, we only consider one model of cost in which the learner knows all the costs ahead of time and they remain constant throughout learning. See Assumption 8.1. Assumption 8.1 With every nodev is associated a costc v 0. The costs are independent of the target and are constant throughout the learning process. Costs are known to the learner. 80 We only assume the basic setting in which the target is static and the responses are noiseless. We present a strong hardness result which holds even in this basic setting. Theorem 8.3 states a very strong result for computing the optimal strategy under Assump- tion 8.1. Theorem 8.3 ([Emamjomeh-Zadeh et al., 2016], Theorem 11) Given an undirected posi- tively weighted graph G, the total budget and the cost function c : V ! R >0 , it is PSPACE- complete to decide whether there is strategy to nd the target using no more than budget limit of . This result holds even when diameter of G is 13. This is interesting because optimal strategy for graph with constant diameter can be computed in polynomial time if the costs are uniform. This hardness result suggests that the problem is signicantly more complex when the costs are non-uniform even if the learner knows them. This theorem is proved in Section 8.3. 8.3 Hardness Proofs In this section, we rst present the proof of Theorem 8.3. Then, we show this proof can be modied for Theorems 8.1 and 8.2. 8.3.1 Proof of Theorem 8.3 To prove the PSPACE-hardness, we reduce from a well-known problem called QBF. See Section 2.6 for the denition of this problem. The QBF can be considered as a two-player game in which the players take turns choosing values for the variables. During the i th round, the rst player assigns either true or false to variable in x i ; subsequently, the second player chooses a Boolean 81 value variable y i . The rst player's objective is to satisfyF while the second player wantsF to become false. The decision question is Is there a winning strategy for the rst player? We consider the adaptive query problem as a two-player game, in which the rst player (whom we call the vertex player) queries a vertex at a time, while the second player (the edge player), for each queried vertex x, chooses an outgoing edge from x lying on a shortest path from x to the target. The vertex player wins if with at most total cost of he can uniquely identify the target vertex based on the responses, while the edge player wins if after spending , there is still more than one potential target vertex consistent with all answers. Given the formulaF, we construct an unweighted graph G = (V;E) with the following pieces. Let ^ k denotes number of quantiers9 in the formula. • For each variable x i , add two literal vertices of type I, named u i and u i . Similarly, corre- sponding to each variable y i , add two literal vertices of type II, named v i and v i . For every 1i< ^ k, add undirected edges from bothu i andu i to bothv i andv i . Add undirected edges between each pair of literal vertices of type II. Furthermore, add an extra vertex ^ v. Add undirected edges between ^ v and each of the literal vertices of type II, as well as between ^ v and both u ^ k and u ^ k . • For each 1 i ^ k, add two critical gadgets T i and T 0 i that will be specied momentarily. Choose two arbitrary distinct vertices in each critical gadget as critical nodes. Let t i and t i be critical nodes of T i , and t 0 i and t 0 i be in T 0 i . Add undirected edges between u i and both t i and t 0 i , as well as between u i and both t i and t 0 i . Thei th critical gadget consists of only two vertices connected by an edge. The cost of querying each of these vertices is ^ k + 2i. 82 • Corresponding to each clauseC ` inF, add two clause gadgets P ` andP 0 ` . Each clause gadget is an undirected path of length 7. Let p ` and p 0 ` be the midpoint of this path. • Whenever a literal (x i ,x i;j ,y i;j ,y i;j ) appears in clauseC ` , add two intermediate nodes specic to this pair of a literal and a clause. Add undirected edges between the intermediate nodes and the corresponding literal vertex (u i ,u i ,v i ,v i ) and add an edge between the rst intermediate node and p ` , and between the second intermediate node and p 0 ` . (The intermediate nodes connect the corresponding literal vertex to both p ` and p 0 ` via paths of length 2.) The cost of querying every node is set to be 1 except for those nodes that we specied otherwise. We set the total budget = ^ k+2. We begin with a simple lemma elucidating the role of the critical gadgets and clause gadgets. The proof of the lemma is obvious based on the construction of the gadgets. Lemma 8.4 (Gadgets) We have the following two observations about the gadgets. Conditioned on knowing that the target is in T i (or T 0 i ), and nothing else, a total budget of ^ k + 2i =i is necessary and sucient to nd the target in the worst case. 1. 2. Conditioned on knowing that the target is in a specic clause gadget P ` or P 0 ` , but nothing more, a total cost of 2 suces. We claim that the rst player in the QBF game has a winning strategy if and only if the vertex player can nd any target in G using at most a budget of . (1) First, assume that there exists a strategy for nding the target in G using no more budget than. LetQ i be the vertex queried by the vertex player in the i th round. We claim the following, for each i ^ k: if Q i 0 2fu i 0;u i 0g for each i 0 < i, and all of the second player's responses were toward either v i 0 or v i 0, then Q i 2fu i ;u i g. The reason is that under the assumption about prior 83 queries, all critical gadgets T i and T 0 i are still in the candidate set S. Having already used i 1 of the budget previously, the vertex player has only a budget of (i 1) left, andi is necessary in order to nd a target in one of these critical gadgets, by the rst part of Lemma 8.4. Since the edge player can also choose which of the 2 critical gadgets contains the target, the vertex player has to identify the correct gadget using at most one round, which is only accomplished by choosing one of u i ;u i . We now dene the following mapping from the vertex player's strategy to a winning strategy for the rst player in the formula game. When the vertex player queries u i , the rst player assigns x i = true, whereas when the vertex player queries u i , the rst player sets x i = false. For i < ^ k, in response to the rst player's setting of x i , the second player will set y i either true or false. If the second player sets y i = true, then we have the edge player reveal v i , whereas when y i = false, the edge player reveals v i instead. By the previous claim, if i< ^ k, the vertex player's next queries must be to eitheru i+1 oru i+1 , meaning that we can next set all of thex i+1 by the same procedure. For the last round of queries, we make the edge player reveal the edge toward ^ v. We thus obtain a variable assignment to all x i and y i , and it remains to show that it satises all clauses C ` . By assumption, having used ^ k = 2, the vertex player can always identify the target with the remaining budget of 2. Note that for each matching pair (P ` ;P 0 ` ) of clause gadgets, both P ` and P 0 ` are entirely in the candidate set, or both have been completely ruled out. This is because all responses were pointing tov i orv i , and were thus symmetric for bothP ` andP 0 ` . By the second part of Lemma 8.4, it would take at least 2 rounds of queries to identify a node in a known clause gadget, and one more to identify the correct gadget. This is too many rounds of queries, so no clause gadgets can be in S, and all clause gadget instances must have been ruled out previously. This means that for each clause C ` , there must have been 1 i ^ k such that the responses to the i th query ruled out the target being in either P ` or P 0 ` . Suppose that the queried set Q i 84 contained a node q i 2fu i ;u i g, and v i or v i (or ^ v if i = ^ k) is the answer. Then, the clause gadgets could have been ruled out in one of two ways: • q i is connected toP ` andP 0 ` via intermediate nodes, i.e., there is a path of length 2 fromq i to p ` and alsop 0 ` . These clause gadgets have been ruled out because the answer of the query was not toward one of the intermediate nodes (connected to p ` or p 0 ` ). In this case, by denition, the literal corresponding toq i (eitherx i orx i ) is inC ` , and becauseq i was queried, the literal is set to true. Thus, C ` is satised under the assignment. • An intermediate node connects v i to P ` (and P 0 ` ) and the edge player chose v i , or | sym- metrically | an intermediate node connects v i to P ` (and P 0 ` ) and the edge player chose v i Without loss of generality, assume the rst case. In that case, by denition, y i = true (because v i was chosen), and by construction of the graph, y i 2C ` . Again, this ensures that C ` is satised under the assignment. In summary, we have shown that all clauses are satised, meaning that the rst player in the formula game has won the game. (2) For the converse direction, assume that the rst player has a winning strategy in the formula game. We will use this strategy to construct a winning strategy for the vertex player in the target search game. We begin by considering a round i ^ k in which the vertex player needs to make a choice. Assume that for each round i 0 < i, Q i 02fu i 0;u i 0g. Also assume that the edge player responded with either v i 0 or v i 0 for each query of u i 0 or u i 0. (We will see momentarily that these assumptions are warranted.) Interpret an answer pointing to v i 0 as setting y i 0 = false, and an edge pointing to v i 0 as setting y i 0 = true. Consider the choice for x i prescribed by the rst player's assumed 85 winning strategy, based on the history so far. The vertex player will query u i if x i = true, and u i if x i = false. We distinguish several cases, based on the response: • If one of the queried vertices is the target, then clearly, the vertex player has won (given that we assume the noiseless setting). • If the edge player's response is toward an instance of critical gadgets, then the target is known to lie in that critical gadget. By Lemma 8.4, there exists a query strategy forT i (orT 0 i ) which can nd the target using a budget of no more than i. Together with the i rounds of queries already used by the vertex player, this gives a successful identication with a budget of at most rounds of queries total. • If the answer is toward an intermediate node connected to one of the queried nodes, then the target must lie in the corresponding clause gadget, sayP ` , or is one of the intermediate nodes connected to this gadget. By Lemma 8.4, it takes at most 2 more rounds of queries to identify the target; together with the rst i rounds of queries, this is a successful identication with at most a budget of . • This leaves the case when the edge player chooses the edge toward v i or v i (or ^ v, if i = ^ k), justifying the assumption made earlier that for each of the rst i rounds of queries, the edge player responds by revealing edges toward either v i or v i . In summary, the fact that the assignment satises all C ` implies that the target cannot lie on any clause gadget. The fact that the edge player responded with edges toward v i or v i to the rst ^ k rounds of queries implies that the target cannot lie on any critical gadgets, either. The remaining case is the ^ k th round of querying, when the edge player's response may contain edges toward ^ v. Then, the only candidate nodes remaining after ^ k rounds of queries (each of which costed 1 unit for the learner) are (some of) the literal vertices, (some of) the intermediate nodes 86 and ^ v. In this case, 2 more rounds are sucient: query ^ v, and subsequently query the vertexu i ,u i , v i or v i with which the edge player responds. (Recall that there are edges from ^ v to all the literal vertices.) That query either reveals the target, or points to an intermediate vertex which is then known to be the target. 8.3.2 Proof of Theorems 8.1 and 8.2 et V m `=1 C ` be an instance of CNF-SAT with n variables and m clauses (with m polynomial in n). Without loss of generality, assume that n = k 2 is a perfect square. (If it is not, we can add O( p n) dummy variables to make it a perfect square.) Partition the variables into k batches of k variables each, labeled x j;i . The overall construction idea is similar to the proof of Theorem 8.3 (Section 8.3.1). • For each batch j (1jk), and each assignment a2f0; 1g k , construct three vertices: an assignment vertex v j;a and two intermediate vertices u j;a ;u 0 j;a . Add edges between v j;a and u j;a , and between v j;a andu 0 j;a . Add two extra nodes ^ v; ^ v 0 , connected via an edge. Moreover, connect ^ v with all assignment vertices v j;a . • For each batchj (1jk), add two critical gadgetsT j andT 0 j , each a simple path of length 2 kj+3 1. Let t j and t 0 j be the middle points of T j and T 0 j , respectively. Connect t j to the intermediate nodes u j;a for all a, and t 0 j to u 0 j;a for all a. Hence, all assignment vertices are connected to the corresponding critical gadgets via paths of length two. • Corresponding to each clause C ` in the formula, add a clause gadget P ` , which is a simple path of length 7. For each assignment vertex v j;a , if a satises C ` , add a new intermediate node u 00 j;a;` , and connect it to both v j;a and the middle node of P ` . 87 The overall outline is similar to Section 8.3.1. The key idea is again that to have any chance of nding a target in the critical gadgets, an adaptive strategy must pick exactly one assignment vertex from each batch; otherwise, a target in a critical gadget could not be identied. This allows us to establish a one-to-one correspondence between adaptive strategies and variable assignments. Revealing an edge to a critical or clause gadget would give the adaptive strategy an easy winning option, so one can show that w.l.o.g., all responses are to ^ v. This rules out all clause gadgets for clauses satised by the a for the queried vertex v j;a . In order to succeed in the nal two rounds with ^ v; ^ v 0 , a number of unqueriedv j;a and manyu 00 j;a;` still remaining, an algorithm must have eliminated all of the clause gadgets from consideration, which is accomplished only when all clauses are satised. (Conversely, if all critical and clause gadgets have been eliminated, the algorithm can next query ^ v and the v j;a that is returned as the response.) Hence, a satisfying variable assignment exists if and only if k + 2 queries are sucient, as captured by the following lemma: Lemma 8.5 There exists an adaptive strategy to nd the target in the constructed graph within at most k + 2 queries if and only if the CNF formula is satisable. The constructed graph has N = O(mk2 k ) vertices and M = O(mk2 k ) edges; thus logN = k +o(k). Assume that some algorithmA decides whether there exists any adaptive strategy to nd the target with k + 2 queries. We would obtain the following complexity-theoretic consequences: 1. If the formula is a 3-CNF-SAT formula, and the running time ofA isM o(logN) =M o(k) , then the reduction would give us an algorithm for 3-CNF-SAT with running time M o(k) = 2 o(n) , which contradicts the ETH. 88 2. For general CNF-SAT instances, if the running time ofA is O(M (1) logN ), then the above reduction would solve CNF-SAT with running time O((m p n2 p n ) (1) p n ) =O(2 (1=2)n ); contradicting the SETH. 89 Chapter 9 Posted-Price Auction In this chapter, we depart from our generic framework and focus on one specic interactive learning problem. This problem is named \online posted-price auction" problem. The results in this chapter are still unpublished 1 . 9.1 Background In \The Ketchup Conundrum" [Gladwell, 2004] Malcolm Gladwell describes the history of market research in the food industry and the sophisticated scientic techniques used for companies like Campbell Soup and Pepsi to optimize their products to the taste of customers. In his tale about mustard, Gladwell describes how the mustard brand Grey Poupon dominated the market in the 1980s and was able to charge more than twice the price of its competitors by understanding the properties of mustard that were most desired by the market and oering a mustard product that was substantially dierent from its competitors. Pricing is an important component of revenue optimization, but one equally important compo- nent is to optimize the product being oered to increase its value to consumers: How sweet should soda be? How big should be the fonts in a website to maximize engagement? What color of orange 1 As of November 7, 2020. 90 juice is the best to boost sales? Often, the design space is quite complex. Gladwell describes the design space of pasta sauce as follows: \These were designed to dier in every conceivable way: spiciness, sweetness, tartness, saltiness, thickness, aroma, mouth feel, cost of ingredients, and so forth." In this chapter, we study the problem of pricing in a very simple and basic setting. In Section 9.2, we formulate and discuss the problem of nding the optimal price for a xed item. In Section 9.3, we considered the problem of nding the optimal item (for instance, the most likable combination of ingredients for pasta sauce) and optimal pricing of it. We study these problem in an online fashion: in each round, the seller interacts with the buyer by proposing an item and posting a price. She then observes the buyers response to the oer and they move to the next round. As mentioned before, this problem does not t into our generic framework. We depart from our framework in this chapter and present two algorithms specically for pricing. A generalization of our framework that captures these problem is still unknown. This open problem is discussed in Section 10.2. 9.2 Selling a Single Item First, we discuss the posted-price auction when there is only one item to sell. Also, we assume throughout this chapter that production of the item(s) in each round has no cost for the seller 2 On each day t, the seller sets a price 0 x t 1 for the item. Notice that she can choose a dierent price each day. The same buyer shows up every day and checks the price. The buyer has a hidden xed valuev for the item. On each dayt, the buyer buys the item if and only if vx t , that is the price is below the his hidden value for the item. 2 This assumption is realistic in many real examples. For instance, in online ad auctions, the item is a spot in a website where an ad can be shown. 91 If the seller's goal is to nd the buyer's private valuev, she can simply run classic binary search over the continuous interval [0; 1] and approximate v within additive error using log 1 rounds. The seller's goal, however, is to maximize her revenue within T rounds. In each round t, • if vx t , the buyer buys the item and the seller receives x t ; • if v <x t , then the buyer does not buy the item and the seller does not make any money in this round. If she knew v ahead of time, she could set the price at v in every round and make v per round. So her revenue in hindsight is Tv. We use the standard notion of regret bound which resembles the notation of \mistake bound" (used in Chapter 7) and is referred to earlier in this dissertations (in Section 1.2.1) as well. Suppose the learners algorithm collects an accumulative revenue of M in T rounds in the worst case, that is, M is a lower bound on the learner's revenue for every value of v. If the learning algorithm is probabilistic, then dene M as the expected revenue. The regret of the learner is dened as TvM: (9.1) Notice that the mistake bound indicates how much more the learner could have collected if she knew v. The regret is always non-negative. We rst present a naive algorithm. The learner runs the binary search algorithm to estimate some value vv that is very close to v. Then, she posts v for the remaining rounds. It is easy to verify that this naive algorithms leads to a regret bound of (logT ) in the worst case. With a very smart yet simple idea, [Kleinberg and Leighton, 2003] design an algorithm whose regret bound is O(log logT ). They also show that this bound is asymptotically tight even if v is drawn uniformly at random from the interval [0; 1]. 92 Theorem 9.1 ([Kleinberg and Leighton, 2003], Theorems 2.1 and 2.2) The following up- per bound and lower bound for single-item posted-price auction match: • There is a deterministic algorithm for the single-item posted-price auction whose regret bound in the worst case is O(log logT ). • Even if v is drawn uniformly at random from interval [0; 1] (rather than being adversarial), the expected regret of every (possibly randomized) algorithm is (log logT ). 9.3 Selling Multiple Items We next generalize the problem and consider the setting in which the sellers has multiple items to sell. We assume that the items are listed explicitly. Let n denote number of the items. Each item a2f1;:::;ng has a hidden valuev a 2 [0; 1]. The values are assumed to be all adversarially chosen. In this setup, in each round t, the learner chooses an item a t as well as a price x t . She oers item a t at price x t . • if v at x t , the buyer buys the item and the seller receives x t ; • ifv at <x t , then the buyer does not buy the item and the seller does not make any money in this round. Leta = n argmax a=1 v a be the item with maximum value and denev =v a = n max a=1 v a as the value of the most valuable item. If the seller knew the hidden values, she could oer a at price v in every round. In this case, her revenue would beTv . This is the maximum revenue that the learner could possible collect should she have all the information. If the expected value of the revenue of an algorithm (without knowing the hidden values) is M, then the regret bound is dened as follows: Tv M: (9.2) 93 Achieving the optional regret bound in this case remains an open problem. We have two preliminary results, stated as Theorems 9.2 and 9.3. Theorem 9.2 There is a deterministic algorithm whose regret bound is O(n log logT ). Proof. The algorithm of Theorem 9.2 is based on the result of [Kleinberg and Leighton, 2003] for a single item, that is, Theorem 9.1. LetA denote the algorithm of Theorem 9.1. Recall thatA is deterministic. We dene our algorithmB usingA. In algorithmB, we keep a single instance of algorithmA as a black-box. Our learning process consists of phases each of which corresponds to one round inA. Meanwhile, we keep track of the set of \active" items, denoted by I. Initially, I =f1;:::;ng. Let x t be the price thatA posts in phaset. AlgorithmB oers every active item at pricex t . This takesjIj rounds, so phaset consists ofjIj rounds in algorithmB. • If all the oers in phase t are rejected, then the oer inA is also considered rejected and the algorithm proceeds to the next phase. • If at least one oer in phase t is accepted, then all the items whose oer was rejected are considered inactive for the remaining rounds (that is, they are removed from I). The oer in A is considered accepted. Notice that in algorithmB,a is never removed from I because if in a phase, the oer for a is rejected, then all the other oers are rejected as well. The contribution of every other item to the regret ofB is at most as much as the contribution ofa . This is because every other item is treated identical to a as long as it is active and has no contribution to the regret ofB once it because inactive. Thus, the regret ofB is at most n times the regret ofA. 94 Theorem 9.3 There is a deterministic algorithm whose regret bound is O(n + logT ). Proof. This result is based on a complicated and extremely clever result for multi-dimensional binary search shown by [Kirkpatrick and Gao, 1990]. Their main result directly implies the follow- ing: there is a deterministic algorithm for our multi-item posted-price auction problem that nds a2f1;:::;ng and v2 [0; 1] such that vv a v v + 1 T (9.3) and uses at most O(n + logT ) rounds. In words, the algorithm of [Kirkpatrick and Gao, 1990] uses at most O(n + logT ) rounds to nd an element a whose hidden value v a is close to v and to estimate v a within an additive error of at most 1 T . LetA be the algorithm of [Kirkpatrick and Gao, 1990]. Our algorithm consists of two phases. In the rst phase of our algorithm, the learner runsA. It takes O(n + logT ) rounds and her cumulative regret in phase one is at most O(n + logT ). After the rst phase is over, the learner keeps oering itema at pricev. All these oers in phase two are accepted and becausev v 1 T , her cumulative regret in phase two is at most 1. Notice that Theorems 9.2 and 9.3 are not comparable. It is still open whether a universally optimal algorithm exists. 95 Chapter 10 Conclusion Thank you for reading this dissertation. We wrap it up by summarizing our contributions and listing some of the most important open problems. 10.1 Summary of our Main Contributions In this dissertation, we presented a generic framework for interactive learning problems. We showed that our abstract framework captures several interesting applications including learning a ranking and learning a (binary) classier. We addressed several practical challenges of such interactive learning problems as well. In particular, we considered the setting in which the feedback that the learner receives is noisy and we also studied learning of a dynamic target. Our main framework is based on a generalization of the classic binary search algorithm to metric spaces. This generalization is of interest in its own right in the community of theoretical computer science. It also has applications beyond the framework that we presented in this dissertation. For instance, in [Emamjomeh-Zadeh and Kempe, 2018], we showed that this generalization of binary search to the trees can be used as a building block for actively learning a hierarchical clustering. 96 10.2 Main Open Problems Several open problems has been pointed out throughout this dissertation. To wrap up, we list the biggest open problems. 10.2.1 Computation of the Optimal Strategy In Chapter 4, we showed that in the absence of noise, a static target can be found using logN rounds. In some graphs, however, fewer rounds suce. For instance, if the graphG is an unweighted complete graph, then the learner can nd the target by querying any arbitrary node. In Section 8.1, we showed that, given a graphG, there is an algorithm that nds the optimal number of queries for G in quasi-polynomial time. Whether this problem admits a polynomial-time algorithm remains open. Notice that in this context, the running time of the algorithms are with respect to the size of the graph. 10.2.2 Closing the Gap for Dynamic Target Theorem 7.4 gives a positive result for the shifting target model. On the negative side, we showed a lower bound in Theorem 7.7. The two bound do not quite match: there is a gap of essentially TH[B=T ] 1H[p] between them. An interesting problem is to improve the lower bound or design a bet- ter algorithm to improve the upper bound. A similar gap exists between the positive result of Theorem 7.8 and the negative result of Theorem 7.9. 10.2.3 Non-Uniform Cost Throughout most of this dissertation, we bounded number of the mistakes that the learner made. In other words, we implicitly assumed that all the mistakes cost uniformly. This assumption is not always reasonable. Recall the example of recommendation system: if the system's ultimate goal is 97 to maximize the users' satisfaction, then the \cost" that the learner encounters in each round should be dened as the user's dissatisfaction. It is also reasonable to assume the user's dissatisfaction in round t is monotone in how far the proposed structure x t is from the target in this round. The example of posted-price auction (Chapter 9) is a perfect example of non-uniform cost setting. In the single-item version of the problem, dene the hidden value v as the target. Let x t denote the price posted in round t. • If x t v, then the sellers makes x t which is below the optimal by vx t . This is the cost that the learner pays in this round. • If x t > v, the oer is declined and the seller makes no money in round t. In this case, her cost is v. Designing a generic framework that captures non-uniform costs remain as the main open prob- lem of this dissertation. 98 Bibliography [Alon et al., 2015] Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. (2015). Online learning with feedback graphs: Beyond bandits. In Journal of Machine Learning Research, volume 40. [Alon et al., 2017] Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Mansour, Y., and Shamir, O. (2017). Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785{1826. [Anagnostopoulos et al., 2011] Anagnostopoulos, A., Kumar, R., Mahdian, M., and Upfal, E. (2011). Sorting and selection on dynamic data. Theoretical Computer Science, 412(24):2564{ 2576. [Angluin, 1988] Angluin, D. (1988). Queries and concept learning. Machine Learning, 2:319{342. [Auer et al., 2002] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235{256. [Bei et al., 2013] Bei, X., Chen, N., and Zhang, S. (2013). On the complexity of trial and error. In Proceedings of the forty-fth annual ACM symposium on Theory of computing, pages 31{40. [Ben-Asher et al., 1999] Ben-Asher, Y., Farchi, E., and Newman, I. (1999). Optimal search in trees. SIAM Journal on Computing, 28(6):2090{2102. [Ben-Or and Hassidim, 2008] Ben-Or, M. and Hassidim, A. (2008). The bayesian learner is optimal for noisy binary search (and pretty good for quantum as well). In Proc. 49th IEEE Symp. on Foundations of Computer Science, pages 221{230. [Besa et al., 2018] Besa, J. J., Devanny, W. E., Eppstein, D., Goodrich, M. T., and Johnson, T. (2018). Optimally sorting evolving data. arXiv preprint arXiv:1805.03350. [Boczkowski et al., 2016] Boczkowski, L., Feige, U., Korman, A., and Rodeh, Y. (2016). Search- ing trees with permanently noisy advice: Walking and query algorithms. arXiv preprint arXiv:1611.01403. [Bubley, 2001] Bubley, R. (2001). Randomized Algorithms: Approximation, Generation, and Counting. Springer Science & Business Media. [Bubley and Dyer, 1999] Bubley, R. and Dyer, M. (1999). Faster random generation of linear extensions. Discrete mathematics, 201(1):81{88. [Calabro et al., 2009] Calabro, C., Impagliazzo, R., and Paturi, R. (2009). The complexity of satisability of small depth circuits. In Proc. of 4th Intl. Workshop on Parameterized and Exact Computation, volume 5917 of Lecture Notes in Computer Science, pages 75{85. 99 [Cesa-Bianchi et al., 1997] Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. (1997). How to use expert advice. Journal of the ACM (JACM), 44(3):427{485. [Cicalese et al., 2012] Cicalese, F., Jacobs, T., Laber, E., and Valentim, C. (2012). The binary identication problem for weighted trees. Theoretical Computer Science, 459:100{112. [Cover, 1999] Cover, T. M. (1999). Elements of information theory. John Wiley & Sons. [Crammer and Singer, 2002] Crammer, K. and Singer, Y. (2002). Pranking with ranking. In Proc. 16th Advances in Neural Information Processing Systems, pages 641{647. [Deligkas et al., 2017] Deligkas, A., Mertzios, G. B., and Spirakis, P. G. (2017). Binary search in graphs revisited. In Mathematical Foundations of Computer Science, pages 20:1{20:14. [Dereniowski et al., 2018] Dereniowski, D., Tiegel, S., Uzna nski, P., and Wolleb-Graf, D. (2018). A framework for searching in graphs in the presence of errors. arXiv preprint arXiv:1804.02075. [Emamjomeh-Zadeh and Kempe, 2017] Emamjomeh-Zadeh, E. and Kempe, D. (2017). A general framework for robust interactive learning. In Proc. 31st Advances in Neural Information Pro- cessing Systems, pages 7085{7094. [Emamjomeh-Zadeh and Kempe, 2018] Emamjomeh-Zadeh, E. and Kempe, D. (2018). Adaptive hierarchical clustering using ordinal queries. In Proc. 29th ACM-SIAM Symp. on Discrete Algo- rithms, pages 415{429. SIAM. [Emamjomeh-Zadeh et al., 2020] Emamjomeh-Zadeh, E., Kempe, D., Mahdian, M., and Shapire, R. E. (2020). Interactive learning of a dynamic structure. In Proc. 31st Intl. Conf. on Algorithmic Learning Theory. [Emamjomeh-Zadeh et al., 2016] Emamjomeh-Zadeh, E., Kempe, D., and Singhal, V. (2016). De- terministic and probabilistic binary search in graphs. In Proc. 48th ACM Symp. on Theory of Computing, pages 519{532, New York, NY, USA. ACM. [Feige et al., 1994] Feige, U., Raghavan, P., Peleg, D., and Upfal, E. (1994). Computing with noisy information. SIAM Journal on Computing, 23(5):1001{1018. [Freund and Schapire, 1997] Freund, Y. and Schapire, R. E. (1997). A decision-theoretic gener- alization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119{139. [Gladwell, 2004] Gladwell, M. (2004). The ketchup conundrum. New Yorker, 6. [Granka et al., 2004] Granka, L. A., Joachims, T., and Gay, G. (2004). Eye-tracking analysis of user behavior in www search. In Proc. 27th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 478{479. [Gy orgy et al., 2007] Gy orgy, A., Linder, T., Lugosi, G., and Ottucs ak, G. (2007). The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369{ 2403. [Hoi et al., 2018] Hoi, S. C., Sahoo, D., Lu, J., and Zhao, P. (2018). Online learning: A compre- hensive survey. arXiv preprint arXiv:1802.02871. 100 [Huber, 2006] Huber, M. (2006). Fast perfect sampling from linear extensions. Discrete Mathe- matics, 306(4):420{428. [Impagliazzo and Paturi, 2001] Impagliazzo, R. and Paturi, R. (2001). On the complexity of k- SAT. Journal of Computer and System Sciences, 62:367{375. [Joachims, 2002] Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proc. 8th Intl. Conf. on Knowledge Discovery and Data Mining, pages 133{142, New York, NY, USA. ACM. [Jordan, 1869] Jordan, C. (1869). Sur les assemblages de lignes. J. Reine Angew. Math, 70(185). [Karp, 1972] Karp, R. M. (1972). Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85{103. Plenum Press. [Karp and Kleinberg, 2007] Karp, R. M. and Kleinberg, R. (2007). Noisy binary search and its applications. In Proc. 18th ACM-SIAM Symp. on Discrete Algorithms, pages 881{890, Philadel- phia, PA, USA. Society for Industrial and Applied Mathematics. [Karzanov and Khachiyan, 1991] Karzanov, A. and Khachiyan, L. (1991). On the conductance of order Markov chains. Order, 8(1):7{15. [Kirkpatrick and Gao, 1990] Kirkpatrick, D. G. and Gao, F. (1990). Finding extrema with unary predicates. In International Symposium on Algorithms, pages 156{164. Springer. [Kleinberg and Leighton, 2003] Kleinberg, R. and Leighton, T. (2003). The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Proc. 44th IEEE Symp. on Foundations of Computer Science, pages 594{605. IEEE. [Laber et al., 2002] Laber, E. S., Milidi u, R. L., and Pessoa, A. A. (2002). On binary searching with nonuniform costs. SIAM Journal on Computing, 31(4):1022{1047. [Littlestone, 1988] Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285{318. [Mozes et al., 2008] Mozes, S., Onak, K., and Weimann, O. (2008). Finding an optimal tree search- ing strategy in linear time. In Proc. 19th ACM-SIAM Symp. on Discrete Algorithms, pages 1096{1105, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics. [Onak and Parys, 2006] Onak, K. and Parys, P. (2006). Generalization of binary search: Searching in trees and forest-like partial orders. In Proc. 47th IEEE Symp. on Foundations of Computer Science, pages 379{388. [Pedrotti, 1999] Pedrotti, A. (1999). Searching with a constant rate of malicious lies. In Proc. Intl. Conf. on Fun with Algorithms (FUN-98), pages 137{147. Citeseer. [Pelc, 2002] Pelc, A. (2002). Searching games with errors|fty years of coping with liars. Theo- retical Computer Science, 270(1-2):71{109. [Radlinski and Joachims, 2005] Radlinski, F. and Joachims, T. (2005). Query chains: Learning to rank from implicit feedback. In Proc. 11th Intl. Conf. on Knowledge Discovery and Data Mining, pages 239{248, New York, NY, USA. ACM. 101 [R enyi, 1961] R enyi, A. (1961). On a problem of information theory. MTA Mat. Kut. Int. Kozl. B, 6:505{516. [Rivest et al., 1980] Rivest, R. L., Meyer, A. R., Kleitman, D. J., Winklmann, K., and Spencer, J. (1980). Coping with errors in binary search procedures. Journal of Computer and System Sciences, 20(3):396{404. [Sauer, 1972] Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145{147. [Shelah, 1972] Shelah, S. (1972). A combinatorial problem; stability and order for models and theories in innitary languages. Pacic Journal of Mathematics, 41(1):247{261. [Ulam, 1991] Ulam, S. M. (1991). Adventures of a Mathematician. Univ of California Press. [Vial et al., 2018] Vial, J. J. B., Devanny, W. E., Eppstein, D., Goodrich, M. T., and Johnson, T. (2018). Quadratic time algorithms appear to be optimal for sorting evolving data. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 87{96. 102
Abstract (if available)
Abstract
In many applications of machine learning, a system has to “learn” through interaction with the environment. For instance, a recommendation system or a search engine needs to learn relative importance of a set of items in order to properly rank them for the users. Such systems observe the users' click patters and exploit this information to gradually improve their understanding of the users' preferences. ❧ In this dissertation, we present a general framework for the task of interactively learning a combinatorial structure (such as a ranking, a classifier, or a clustering) over a finite set of items. The learning task proceeds in T rounds throughout which a learner aims to discover the true hidden combinatorial structure, called “target”. In each round t, the learner proposes a structure xₜ and in response, learns that xₜ is the target or otherwise receives some partial information about the target naturally in the form of a “correction” in xₜ. ❧ In this dissertation, we start by introducing a general abstract framework for interactive learning problems. We extend our framework to address several real-world aspects of such learning problems: First, we take into account the fact that the feedback is usually noisy, that is, it may sometimes be incorrect (due to human error, for example). We discuss several algorithms that are robust and work even in the presence of noise. Next, we consider interactive learning of a target when the target itself changes over time. Finally, we briefly discuss the learning task when a non-uniform cost function is associated with the structures. In this setting, different structures cost differently when they are proposed to the users and the learner's objective is to minimize the total cost of the structures it proposes. ❧ As the main building block of our framework, we introduce and analyze a natural generalization of the classic binary search algorithm to metric spaces. This abstract problem is of interest in its own right in the theoretical computer science community.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Understanding goal-oriented reinforcement learning
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Computational aspects of optimal information revelation
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
No-regret learning and last-iterate convergence in games
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Leveraging training information for efficient and robust deep learning
PDF
Machine learning in interacting multi-agent systems
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Robust and adaptive online reinforcement learning
PDF
Do humans play dice: choice making with randomization
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Improving machine learning algorithms via efficient data relevance discovery
Asset Metadata
Creator
Emamjomeh-Zadeh, Ehsan
(author)
Core Title
Interactive learning: a general framework and various applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/07/2020
Defense Date
12/02/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
interactive learning,machine learning,OAI-PMH Harvest,online learning,theoretical computer science
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dughmi, Shaddin (
committee chair
), Kempe, David (
committee chair
), Fulman, Jason (
committee member
), Luo, Haipeng (
committee member
), Teng, Shanghua (
committee member
)
Creator Email
ehsan7069@gmail.com,emamjome@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-409452
Unique identifier
UC11666624
Identifier
etd-EmamjomehZ-9185.pdf (filename),usctheses-c89-409452 (legacy record id)
Legacy Identifier
etd-EmamjomehZ-9185.pdf
Dmrecord
409452
Document Type
Dissertation
Rights
Emamjomeh-Zadeh, Ehsan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
interactive learning
machine learning
online learning
theoretical computer science