Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
(USC Thesis Other)
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Robust and Adaptive Algorithm Design in Online Learning: Regularization, Exploration, and Aggregation by Mengxiao Zhang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2024 Copyright 2024 Mengxiao Zhang Acknowledgements The Ph.D. journey is not a short period; rather, it is a substantial chapter marked by challenges and growth. Therefore, I want to express my sincere gratitude to all those who have supported me along the way. First and foremost, I would like to show my deepest gratitude my advisor, Haipeng Luo, for his countless assistance throughout my entire Ph.D. journey. I can still vividly recall my initial apprehension and lack of knowledge about online learning when I just embarked on this path. I was filled with uncertainty, questioning whether I could meet Haipeng’s expectations, especially when I did not make good progress and felt struggled in my first year. However, Haipeng is always patient to answer any of my naive questions and teach me from scratch on how to conduct research in this field. His steadfast support has always been a beacon of strength for me. In terms of research, Haipeng always has keen insights on various problems and provides hands-on guidance whenever I meet obstacles. During our discussions, I have learned a lot not only from the technical standpoint but also in terms of refining my research sensibilities. Haipeng serves as my role model of excellence, not only in his capacity as a skilled researcher but also as an exemplary mentor who guides and empowers students to pursue their own research journeys. Beyond research, I am also grateful to Haipeng for his invaluable advice on my future career and life. His profound influence has not only shaped my academic pursuits but also left an indelible mark on my personal and professional development. During my Ph.D. study, I feel honored to have the privilege to do research with many excellent collaborators, including Shi Chen, Chung-Wei Lee, Brendan Lucier, Paul Mineiro, Sarath Pattathil, Raif Rustamov, ii Haoran Shi, Alex Slivkins, Lei Tang, Hanghang Tong, Olga Vrousgou, Chu Wang, Yingfei Wang, Chen-Yu Wei, Yongning Wu, Yuqi Wu, Xiaojin Zhang, Yuheng Zhang, Zuohua Zhang, Peng Zhao, Zhi-Hua Zhou, and Hongyu Zhu. Specifically, I would like to express my gratitude to Chu Wang for being my internship mentor at Amazon for two consecutive summers. Throughout our discussions, I have learned more about the intricacies of real-world ads recommendation systems and the challenges of applying algorithms in this context. Additionally, I would like to thank Brendan Lucier and Alex Slivkins for hosting my internship at Microsoft Research, New York in Summer 2022. Brendan and Alex not only have a solid theoretical background in machine learning and algorithmic economics, but also provide me with much helpful guidance on implementing designed algorithms in a systematic manner in practice. Our lab always gives off a warmth feeling to me and I want to express my deep thanks to my labmates, William Chang, Liyu Chen, Yifang Chen, Soumita Hait, Hikaru Ibayashi, Tiancheng Jin, Chung-Wei Lee, Spandan Senapati, Ram Deo-Campo Vuong, Chen-Yu Wei, Yan Wen, Dongze Ye, for their companionship and support. Specifically, I thank Chen-Yu for numerous helpful research discussions. Chen-Yu is knowledgeable in various research domains and can always come up with very smart ideas that I have never thought of. These discussions have enriched me, both in technical understandings and intuitive insights; I thank Chung-Wei for providing delicious cookies, driving to various places, and engaging in fruitful discussions on various research projects; I thank Tiancheng and Yifang for being willing to listen to my random thoughts, including but not limited to the ups and downs in research and daily life. Their willingness to engage with me has been incredibly helpful in relieving my anxiety and stress; I thank Liyu for helping me a lot during my job applications; I thank Hikaru for discussions on deep learning and for sharing interesting animations. All your support and camaraderie have meant the world to me. I thank Vatsal Sharan and Renyuan Xu for their willingness to serve on my thesis committee and providing guidance and support during the process. I also thank Shaddin Dughmi, David Kempe, Yan Liu, iii Mahdi Soltanolkotabi, and Jiapeng Zhang for serving as committee members of my thesis proposal and oral qualification and providing many helpful suggestions through these processes. I am grateful to Lizsl De Leon, Rita Wiraatmadja, Asiroh Cham, Kimberly Serrano, Andy Chen, Andrea Mora, Ellecia Williams, and other staff of the USC Viterbi School of Engineering for their assistance throughout my Ph.D. studies. My research would not have been possible without the support of NSF Award IIS-1755781 and IIS1943607. I would like to thank Haipeng Luo, Shi Chen, Brendan Lucier, Alex Slivkins, Chu Wang, Yingfei Wang for generous help on my job search materials. I also would like to thank Shi Chen, Wei Gu, Xiao Jin, Zhuojin Li, Haipeng Luo, Yingfei Wang, and Chen-Yu Wei for many helpful suggestions on my job talk. I have been moving several times during my Ph.D. study and I would like to express my gratitude to all my wonderful roommates over the years: Zhuojin Li, Tengxiao Wang, Haodi Yang, Hongxun Yang, Mingxuan Li, Meiyi Chen, Minghan Cen, Yufan Xie, and Wei Gu. I fondly recall how we prepared for the New Year’s Eve dinner together every Spring Festival and cherishing all those joyful moments we shared together. Your companionship has been invaluable for me. I also want to thank my friends, including but not limited to Sihan Chen, Xuyang Fang, Xiao Jin, Shunan Mao, Chenhui Qian, Tiancheng Qin, Haoyang Sang, Lingda Wang, Qizhong Wang, Hui Xu, Yinchen Xu, Zhirong Xu, Yukun Yue, Chaolun Zhang, Peng Zhao. The memories of our time together, from traveling to gaming nights to casual conversations, will forever be cherished. Your companionship colored this journey with warmth and laughter. Finally, I want to express my heartfelt gratitude to my dear parents for their endless love and support during my Ph.D. journey. Every weekend, we chat through video and share the happenings in our lives. Though my research topics may have remained a mystery to you, your compassionate understanding and encouragement consistently sustain me through the most challenging times. Your unwavering support has given me the strength and the power to achieve this accomplishment. iv Ph.D. is an arduous journey, but with the support from you all, it has blossomed into a colorful adventure filled with growth and discovery. v Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Useful Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Preliminaries of Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Types of Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Outlines of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Robust and Adaptive Algorithm Design for Online Learning with Feedback Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Small-loss Bounds for Bandits with Feedback Graphs . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Problem Setup and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Strongly Observable Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2.1 Reducing the Amount of Uniform Exploration . . . . . . . . . . . . . . . 18 2.1.2.2 Oe p (s + 1)L⋆ Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.2.3 Oe p (κ + 1)L⋆ Regret Bound . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.2.4 Oe min{ √ αT , √ κL⋆} Regret Bound for Self-Aware Graphs . . . . . . 25 2.1.3 Weakly Observable Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.4 Conclusions and Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs . . . 31 2.2.1 Problem Setup and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.2 Optimal High-Probability Regret for Strongly Observable Graphs . . . . . . . . . . 35 2.2.3 High-Probability Regret for Weakly Observable Graphs . . . . . . . . . . . . . . . 43 2.2.4 Conclusions and Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3 Efficient Contextual Bandits with informed Feedback Graphs . . . . . . . . . . . . . . . . . 48 2.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.3.2 Problem Setting and Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.3.3 Algorithms and Regret Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.3.3.1 Algorithms via Minimax Reduction Design . . . . . . . . . . . . . . . . . 53 2.3.3.2 Regret Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.3.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.3.4 Examples with Closed-Form Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.3.5.1 SquareCB.G under Different Feedback Graphs . . . . . . . . . . . . . . . 60 2.3.5.2 Comparison between SquareCB.G and SquareCB . . . . . . . . . . . . . 61 2.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.4 Efficient Contextual Bandits with Uninformed Feedback Graphs . . . . . . . . . . . . . . . 65 2.4.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.4.2 Algorithms and Regret Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2.4.3.1 Analysis for Partially Revealed Graphs . . . . . . . . . . . . . . . . . . . 74 2.4.3.2 Analysis for Fully Revealed Graphs . . . . . . . . . . . . . . . . . . . . . 77 2.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.4.4.1 Empirical Results on Synthetic Data . . . . . . . . . . . . . . . . . . . . . 80 2.4.5 Empirical Results on Real Auction Data . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 3: Robust and Adaptive Algorithm Design for Linear Bandits . . . . . . . . . . . . 83 3.1 High-Probability Adaptive Regret bounds for Linear Bandits . . . . . . . . . . . . . . . . . 86 3.1.1 Multi-armed bandits: an illustrating example . . . . . . . . . . . . . . . . . . . . . 89 3.1.2 Generalization to adversarial linear bandits . . . . . . . . . . . . . . . . . . . . . . 94 3.1.3 Generalization to adversarial MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.2 Switching Regret for Adversarial Linear Bandits . . . . . . . . . . . . . . . . . . . . . . . . 104 3.2.1 Problem Setup and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.2.2 Corralling a Larger Band of Bandits: A Recipe . . . . . . . . . . . . . . . . . . . . . 109 3.2.3 Optimal Switching Regret for Linear Bandits over ℓp Balls . . . . . . . . . . . . . . 114 3.2.4 Extension to Unconstrained Linear Bandits . . . . . . . . . . . . . . . . . . . . . . 118 3.2.4.1 Black-Box Reduction for Switching Regret of Unconstrained Linear Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.2.4.2 Subroutine: Switching Regret of Unconstrained Online Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3.2.4.3 Summary: Comparator-Adaptive Switching Regret for Unconstrained Linear Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 3.2.5 Conclusion and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 4: Adaptive Bandit Convex Optimization with Heterogeneous Curvature . . . . 124 4.1 Preliminaries and Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.2 Smooth Convex Bandits with Heterogeneous Strong Convexity . . . . . . . . . . . . . . . 130 4.2.1 Proposed Algorithm and Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . 132 4.2.2 Implications of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.3 Lipschitz Convex Bandits with Heterogeneous Strong Convexity . . . . . . . . . . . . . . . 139 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Appendix A: Omitted Details in Section 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 A.1 Proofs for Section 2.1.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 A.2 Proofs for Section 2.1.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 A.3 Omitted details for Section 2.1.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 vii A.3.1 Hedge with Adaptive Learning Rates . . . . . . . . . . . . . . . . . . . . . . . . . . 164 A.3.2 Proofs of Theorem 2.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 A.3.3 Adaptive Version of Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 A.4 Omitted details for Section 2.1.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 A.5 Omitted Details for Section 2.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 A.5.1 Proof of Theorem 2.1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 A.5.2 Proofs for Theorem 2.1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 A.5.3 Adaptive Version of Algorithm 2 for Directed Complete Bipartite Graphs . . . . . . 189 A.5.4 Proof of Theorem 2.1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 A.5.5 Adaptive Version of Algorithm 2 for General Weakly Observable Graphs . . . . . . 198 Appendix B: Omitted Details in Section 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 B.1 Omitted Details in Section 2.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 B.1.1 Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 B.1.2 Proof of Theorem 2.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 B.2 Proofs for Section 2.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 B.3 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Appendix C: Omitted Details in Section 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 C.1 Omitted Details in Section 2.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 C.1.1 Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 C.1.2 Proof of Theorem 2.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 C.1.3 Python Solution to Eq. (2.28) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 C.1.4 Proof of Theorem 2.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 C.2 Omitted Details in Section 2.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 C.2.1 Cops-and-Robbers Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 C.2.2 Apple Tasting Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 C.2.3 Inventory Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 C.2.4 Undirected and Self-Aware Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 C.3 Implementation Details in Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 C.3.1 Implementation Details in Section 2.3.5.1 . . . . . . . . . . . . . . . . . . . . . . . . 241 C.3.2 Implementation Details in Section 2.3.5.2 . . . . . . . . . . . . . . . . . . . . . . . . 242 C.3.2.1 Details for Results on Random Directed Self-aware Graphs . . . . . . . . 242 C.3.2.2 Details for Results on Synthetic Inventory Dataset . . . . . . . . . . . . . 242 C.4 Adaptive Tuning of γ without the Knowledge of Graph-Theoretic Numbers . . . . . . . . . 244 C.4.1 Strongly Observable Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 C.4.2 Weakly Observable Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Appendix D: Omitted Details in Section 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 D.1 Omitted Details in Section 2.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 D.1.1 Value of the Minimax Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 D.1.2 Parameter-Free Algorithm in the Partially Revealed Feedback Graphs Setting . . . 254 D.2 Omitted Details in Section 2.4.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 D.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 D.3.1 Omitted Details in Section 2.4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Appendix E: Omitted Details in Section 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 E.1 Omitted details for Section 3.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 viii E.1.1 Proof of Theorem 3.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 E.1.2 Proof of Lemma 3.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 E.1.3 Proof of Theorem 3.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 E.2 Omitted details for Section 3.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 E.2.1 More explanation on Algorithm 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 E.2.2 Preliminary for analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 E.2.3 Proof of Theorem 3.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 E.2.3.1 Bounding Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 E.2.3.2 Bounding Reg-Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 E.2.3.3 Proof of Theorem 3.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 E.3 Omitted details for Section 3.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 E.3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 E.3.2 Algorithm for MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 E.3.3 Proof of Theorem 3.1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 E.3.3.1 Useful lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 E.3.3.2 Bounding Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 E.3.3.3 Bounding Bias-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 E.3.3.4 Bounding Bias-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 E.3.3.5 Bounding Reg-Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 E.3.3.6 Putting everything together . . . . . . . . . . . . . . . . . . . . . . . . . 307 E.3.4 Issues of other potential approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Appendix F: Omitted Details in Section 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 F.1 Potential Approaches for Switching Regret of Linear Bandits . . . . . . . . . . . . . . . . . 312 F.2 Omitted Details for Section 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 F.2.1 Pseudocode of Base Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 F.2.2 Unbiasedness of Loss Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 F.2.3 Regret Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 F.2.4 Bounding Deviation and Pos-Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 F.2.5 Bounding Base-Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 F.2.6 Bounding Meta-Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 F.2.7 Proof of Theorem 3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 F.3 Extension to Smooth and Strongly Convex Set . . . . . . . . . . . . . . . . . . . . . . . . . 329 F.3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 F.3.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 F.3.3 Unbiasedness of Loss Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 F.3.4 Regret Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 F.3.5 Bounding Deviation and Pos-Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 F.3.6 Bounding Base-Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 F.3.7 Bounding Meta-Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 F.3.8 Proof of Theorem F.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 F.4 Omitted Details for Section 3.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 F.4.1 Proof of Lemma 3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 F.4.2 Algorithm for Unconstrained OCO with Switching Regret . . . . . . . . . . . . . . 344 F.4.3 Proof of Theorem 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 F.4.4 Data-dependent Switching Regret of Unconstrained Online Convex Optimization . 353 F.4.5 Proof of Theorem 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 F.5 Lemmas Related to Online Mirror Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 ix Appendix G: Omitted Details in Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 G.1 Omitted Details for Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 G.1.1 Proof of Lemma 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 G.1.2 Stability Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 G.1.3 Proof of Lemma 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 G.1.4 Proof of Lemma 4.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 G.1.5 Proof of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 G.1.6 Proofs for Implications of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 374 G.2 Omitted Details for Section 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 G.2.1 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 G.2.2 Proofs for Implications of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . 383 G.3 Self-concordant Barrier Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 G.4 Additional Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 G.4.1 FTRL Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 G.4.2 Relations among strong convexity, smoothness and Lipschitzness . . . . . . . . . . 389 x List of Tables 1.1 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Main results and comparisons with prior work. T is the number of rounds, L⋆ ≤ T is the total loss of the best arm, α, κ, and d are the independence, clique partition, and weak domination number respectively. For our results for weakly observable graphs, γ can be any value in [ 1/3, 1/2], i ⋆ is the best arm, S is the set of nodes with a self-loop, Li ⋆ S is the loss of the best arm in S, LD is the average loss of nodes in a weakly dominating set, and dependence on other parameters is omitted. All our algorithms have parameter-free versions. . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Summary of our results and comparisons with prior work. T is the number of rounds. K is the number of actions. αt and dt are respectively the independence number and the weak domination number of feedback graph Gt at round t. The results of [10, 121] are for a fixed feedback graph G (so Gt = G, αt = α, and dt = d for all t). Our high-probability regret bound for weakly observable graphs omits some lower-order terms; see Theorem 2.2.3 for the complete form. . . . . . . . . . . 32 4.1 A summary of our results for bandit convex optimization over T smooth d-dimensional functions, the t-th of which is σt-strongly convex. T ⊂ [T] is a subset of rounds with no strong convexity. The dependency on parameters other than d and T can be found in the respective corollary (see also Footnote ∗). Note that our results are all achieved by one single adaptive algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.2 A summary of our results for bandit convex optimization over T Lipschitz d-dimensional functions, the t-th of which is σt-strongly convex. T ⊂ [T] is a subset of rounds with no strong convexity. The dependency on parameters other than d and T can be found in the respective corollary. Note that our results are all achieved by one single adaptive algorithm. 126 xi List of Figures 2.1 Examples of feedback graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 An illustration of Algorithm 1 for a graph with 7 nodes. Here, we have S = {1, 2, 5, 6, 7}, S¯ = {3, 4}, and κ = 2 with C1 = {1, 2} and C2 = {5, 6, 7} being a minimum clique partition of GS. The meta-algorithm operates over nodes 3 and 4, and also the two cliques, each with a Hedge instance running inside. . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Left figure: Performance of SquareCB.G on RCV1 dataset under three different feedback graphs. Right figure: Performance comparison between SquareCB.G and SquareCB under random directed self-aware feedback graphs. . . . . . . . . . . . . . . . . . . . . . . 61 2.4 Performance comparison between SquareCB.G and SquareCB on synthetic inventory dataset. Left figure: Results under fixed discretized action set. Right figure: Results under adaptive discretization of the action set. Both figures show the superiority of SquareCB.G compared with SquareCB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.5 Comparison among SquareCB.UG, SquareCB, greedy, and a trivial baseline on one synthetic dataset with diverse contexts (top figure) and another one with poor diversity (bottom figure). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 2.6 Comparison among SquareCB.UG, SquareCB, greedy, and a trivial baseline on a real auction dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.1 An illustration of the concept of lifting, the conic hull, and the Dikin ellipsoid. In this example d is 2, and the pink disk at the bottom is the original decision set Ω. The gray dot w is a point in Ω. In Algorithm 8, we lift the problem from R 2 to R 3 , and obtain the lifted, orange, decision set Ω. For example, w is lifted to the black dot w = (w, 1). Then we construct the conic hull of the lifted decision set, that is, the gray cone, and construct a normal barrier for this conic hull. By Lemma E.2.1, the Dikin ellipsoid centered at w of this normal barrier (the green ellipsoid), is alway within the cone. In Algorithm 8, if w is the OMD iterate, we explore and play an action within the intersection of Ω and the Dikin ellipsoid centered at w, that is, the (boundary of) the blue ellipse. . . . . . . . . . . 97 xii Abstract In recent years, online learning, or data-driven sequential decision making, is becoming a central component in Artificial Intelligence and has been widely applied in many real applications. Specifically, online learning means that the learner interacts with an unknown environment and learns the model on the fly, which is more challenging compared with the classic offline learning setting where the dataset is available to the learner at the beginning of the learning process. In this thesis, we focus on designing algorithms for online learning with the two pivotal characteristics: robustness and adaptivity. Motivated by the existence of unpredictable corruptions and noises in real-world online learning applications such as E-commerce recommendation systems, robustness is an important and desired property. It means that the designed algorithm is guaranteed to perform well even in highly adversarial environments. In contrast, adaptivity complements robustness by enhancing performance in benign environments. In broader terms, adaptivity means that the designed algorithm is able to automatically scale with certain intrinsic property that reflects the difficulty of the problem. In order to achieve adaptivity and robustness, in this thesis, we utilize the following three methodologies, namely regularization, exploration, and aggregation. Regularization method has been widely used in the field of machine learning to control the dynamic of the decisions, which is especially important when facing a possibly adversarial environment. In online learning problems, very often the learner can only observe partial information of the environment, making an appropriate exploration method crucial. Aggregation, a natural idea to achieve adaptivity, combines multiple algorithms that work well in different xiii environments. Though intuitive, this requires non-trivial algorithm design for different online learning problems. Using these methodologies, in this thesis, we design robust and adaptive learning algorithms for a wide range of online learning problems. We first consider the problem of multi-armed bandits with feedback graphs, which includes the classic full-information expert problem, multi-armed bandits, and beyond. Then, we consider more complex problems including linear bandits and convex bandits, which involve infinite number of actions. We hope that the techniques and algorithms developed in this thesis can help improve the current online learning algorithms for real-world applications. xiv Chapter 1 Introduction In the domain of Artificial Intelligence, data-driven sequential decision-making has emerged as an indispensable component. Specifically, sequential decision-making, or online learning, entails a learner dynamically interacting with an unknown environment, adjusting and refining her strategy based on real-time feedbacks. This learning framework fits into a wide variety of applications, including clinical trials [138], design of game AIs [27, 139], and many operational management applications. Concretely, one classic instance of application is the inventory control problem [81, 22], where a retailer, unaware of the exact daily demand, must engage with the unknown market and determine the optimal inventory policy in realtime. Another example is the autobidding systems [84, 23], where learners (bidders) continually learn their bidding strategies throughout the process based on their previous feedbacks from the platform and other bidders. Drawn by the vast potential of online learning, we focus on developing practical online learning algorithms with strong theoretical foundations. Specifically, this thesis focuses on designing algorithms with the two important characteristics: robustness and adaptivity. Robustness is important for ensuring the viability of an online learning algorithm, allowing it to be resistant to feedback corruptions and noisy inputs. In real-world applications, the environments may be 1 attacked by adversaries or influenced by previous actions of the learner. For example, in E-commerce recommendation systems [105], while commendable products generally receive positive feedback, adversarial fake reviews can also mislead these platforms. In contrast to robustness, which assures good performances even in the worst-case environments, adaptivity ensures that the algorithm excels when the environment is benign. A common instance is the inventory control problem mentioned before. In general, the market condition is consistent, meaning that the daily demands follow a predictable, stationary distribution. In broader terms, adaptivity implies that the algorithm is able to automatically scale with certain intrinsic property that reflects the difficulty of the problem. Recognizing the importance of robustness and adaptivity in real-world applications, in this thesis, we focus on designing algorithms satisfying these two properties provably in various types of online learning problems. In the rest of this chapter, we introduce some frequently used notations and briefly discuss the methodologies we use and the outlines of the thesis. 1.1 Useful Notations Throughout this thesis, we denote {1, . . . , m} by [m] for some positive integer m. For a differentiable convex function ψ defined on a convex set Ω, the associated Bregman divergence is defined as Dψ(x, y) = ψ(x) − ψ(y) − ⟨∇ψ(y), x − y⟩ for any two points x, y ∈ Ω. For a positive definite matrix M ∈ R d×d and a vector u ∈ R d , define ∥u∥M = √ u⊤Mu is the quadratic norm of u with respect to M. We also use ∥v∥2 to denote its Euclidean norm and ∥v∥p to denote its ℓp norm. The notation Oe(·) may hide logarithmic dependence on various parameters that will be specified in later chapters respectively. 2 1.2 Preliminaries of Online Learning In this section, we formally introduce the online learning framework. In this framework, the learner and the environment interact for T rounds. At each round t ∈ [T], the learner first select an action at from a known action set Ω. Simultaneously, the environment selects a loss function ft : Ω 7→ [−1, 1]. The learner suffers a loss ft(at) and observe feedbacks that reveal certain information of ft . The learner’s performance is measured via the notion of regret or static regret: Reg ≜ X T t=1 ft(at) − min u∈Ω X T t=1 ft(u), (1.1) which is defined as the difference between the total loss suffered by the learner and the one suffered by the optimal action in hindsight. We call an online learning algorithm no-regret, if the algorithm guarantees that Reg = o(T). Two extensions of this static regret notion are also considered in this thesis. S-switching regret. While this static regret is a reasonable measure, a stronger notion of regret is called S-switching regret. Instead of comparing with the loss of the best fixed action in hindsight, S-switching regret compares with the loss of a sequence of actions that switch at most S −1 times. Formally, the regret is defined as follows: Reg(u1, . . . , uT ) ≜ X T t=1 ft(at) − X T t=1 ft(ut), (1.2) where the benchmark sequence (u1, . . . , uT ) satisfies that 1 + PT t=2 1{ut ̸= ut−1} ≤ S and ut ∈ Ω for all t ∈ [T]. When S = 1, Eq. (1.2) reduces to Eq. (1.1) after taking maximum over the benchmark strategy (u1, . . . , uT ). 3 Contextual online learning. In the contextual online learning framework, before the learner selects her action at at each round t, she first observes a context xt . Since the context can be very different round by round, instead of comparing with the loss of the best fixed action, it is more reasonable to compare with the loss of the best policy which maps the context to the action. Specifically, the regret in the contextual setting is defined as follows: RegCB = X T t=1 ft(at) − min π⋆∈Π X T t=1 ft(π ⋆ (xt)), where Π is a set of policies. 1.2.1 Types of Feedback Throughout this thesis, we consider the following three types of feedbacks. • Full-information. In this case, the learner knows the whole loss function ft . • Bandits. In this case, the learner only knows ft(at), the loss of her chosen action. • Graph feedback. In this case, the action set is a finite set of size K and there is a directed graph G = ([K], E), where [K] denotes the node set and E denotes the edge set. The learner can observe the loss of action j ∈ [K] if and only if edge (at , j) is in the edge set E. More examples and motivations are shown in Chapter 2. 1.3 Methodologies In this section, we briefly introduce the three main methodologies that we use to design adaptive and robust algorithms. More details on how we apply these methodologies will be introduced in later chapters. 4 • Regularization. Regularization techniques are widely used in the fields of general machine learning and artificial intelligence to control the dynamic of the learning process. In particular, in the context of online learning, regularization is important in order to achieve o(T) regret. More concretely, one can show that even under full-information feedback, the algorithm of follow-the-leader (FTL) which chooses at = argmina∈Ω Pt−1 τ=1 fτ (a) at each round t will suffer Θ(T) regret. However, the regularized version of FTL, follow-the-regularized-leader (FTRL), is able to achieve o(T) regret. Specifically, FTRL picks at = argmina∈Ω Pτ−1 t=1 fτ (a)+ψ(a) where ψ is the regularizer. With certain choices of ψ, one can show that FTRL achieves o(T) regret under a broad class of loss functions. Throughout this thesis, we will show how to design different regularizers in order to achieve robust and adaptive guarantees. • Exploration. As mentioned in Section 1.2.1, in online learning problems, the learner may receive only partial feedbacks from the environment. Therefore, it is important for the learner to explore in order to gather more information for future decisions. A typical issue in sequential learning is to balance the so-called exploration-exploitation tradeoff. Specifically, exploration involves trying out different actions to gain more information about the environment, potentially leading to discovering better strategies in the long run. However, exploration may also yield lower immediate rewards compared to exploitation, which involves leveraging known information to maximize short-term gains. Therefore, finding the right balance between exploration and exploitation is crucial for achieving optimal performance in online learning scenarios. • Aggregation. In order to be adaptive to different type environments, one idea is to maintain a group of base algorithms, each of which performs well in certain environment, and then design a meta algorithm that combines all the base algorithms. Though the idea is intuitive, this is challenging in the context of online learning, especially when only partial feedbacks are available. Specifically, the difficulty is to ensure that the meta algorithm achieves comparable performance to that of the 5 base algorithm when it were run on its own. Suppose one base algorithm itself may be initially exploratory but excel later on. Then, this base algorithm may lose the chance to be selected by the meta algorithm due to its poor performance in the earlier phase, meaning that it never gets to explore enough and reaches its good performance regime. Therefore, it requires careful designs for the aggregation and we will discuss in detail how to design the meta algorithms in later chapters. 1.4 Outlines of the Thesis In this thesis, we apply the techniques mentioned above to design adaptive and robust algorithms for a wide varieties of online learning problems with partial information. Specifically, our main contributions are summarized below: • In Chapter 2, we consider the problem of online learning with feedback graphs. We first derive learning algorithms for this problem achieving adaptive regret guarantee when the feedback graph is fixed and known (Section 2.1). Then, we extend the fixed feedback graph case to the time-varying feedback graph case and derive robust guarantees which hold with high probability (Section 2.2). Finally, we consider the contextual generalization of this problem and derive efficient learning algorithms when the feedback graph is either known (Section 2.3) or unknown (Section 2.4) to the learner. • In Chapter 3, we study the linear bandits problem, where the loss function is a linear function and the learner can only observe the loss of her chosen action. While expected regret guarantee is extensively studied for linear bandits, high-probability regret guarantee, a more robust measure, is not well explored. Since in many online learning applications, the learner has only one chance to have access to the sequential data, it is important to have high-probability regret guarantees. In Section 3.1, we derive the first algorithm achieving optimal high-probability regret guarantee for 6 Chapter/Section Paper Chapter 2, Section 2.1 Lee, Luo, and Zhang [100] Chapter 2, Section 2.2 Luo, Tong, Zhang, and Zhang [109] Chapter 2, Section 2.3 Zhang, Zhang, Vrousgou, Luo, and Mineiro [148] Chapter 2, Section 2.4 Zhang, Zhang, Luo, and Mineiro [147] Chapter 3, Section 3.1 Lee, Luo, Wei, and Zhang [99] Chapter 3, Section 3.2 Luo, Zhang, Zhao, and Zhou [112] Chapter 4 Luo, Zhang, and Zhao [111] Table 1.1: A mapping from chapters/sections to papers linear bandits. In Section 3.2, we study the S-switching regret defined in (Eq. (1.2)) and propose the first algorithm that achieves optical S-switching regret when Ω is the ℓp ball with p ∈ (1, 2]. • In Chapter 4, we consider the bandit convex optimization problem where the loss function is convex in general. This includes the problem of linear bandits considered in Chapter 3. While previous works assume known and homogeneous curvature on these loss functions, we study a heterogeneous setting where each function has its own curvature that is only revealed after the learner makes a decision. We develop an efficient algorithm that is able to adapt to the curvature on the fly. Specifically, our algorithm not only recovers or even improves existing results for several homogeneous settings, but also leads to surprising results for various heterogeneous settings, providing guarantees that are adaptive to the easiness of the environment. Table 1.1 provides a mapping between each chapter/section and the source paper of the content. 7 Chapter 2 Robust and Adaptive Algorithm Design for Online Learning with Feedback Graphs In this chapter, we consider the problem of online learning with feedback graphs. As mentioned in Chapter 1, in this problem, the learner and the environment interact for T rounds. At each round t ∈ [T], the learner selects an action from a finite set of actions of size K and suffers the loss of the chosen action. Then, the learner’s observation is defined by a directed feedback graph Gt = ([K], Et), where [K] is the node set and Et ⊆ [K] × [K] is the edge set. Each node i ∈ [K] in the graph Gt represents one of the actions in the action set and if (i, j) ∈ Et for some i, j ∈ [K], then the learner is able to observe the loss of action j if action i is selected. When the feedback graph is a clique, meaning that (i, j) ∈ Et for all i, j ∈ [K] and t ∈ [T], it corresponds to the classic online learning with full information feedback [67], where the learner can observe the loss for each action no matter what action is selected. When the feedback graph only contains selfloops, meaning that Et = {(i, i)}i∈[K] for all t ∈ [T], it corresponds to the classic multi-armed bandit problem [19], where the learner can only observe the loss of her chosen action. Besides these two classic examples, this framework in fact includes a broader range of interesting online learning applications. One example is the apple tasting problem [75]. In this problem, the learner is required to check a sequence of apples. When an apple comes, the learner needs to decide whether 8 1 2 3 4 (a) Full information 1 2 3 4 (b) Multi-armed Bandits 1 2 (c) Apple tasting 1 2 3 4 (d) Revealing action Figure 2.1: Examples of feedback graphs to taste the apple (action 1) in order to check whether the apple is rotten or not, or to ship the apple to the market (action 2). The learner will incur a unit loss if a rotten apple is shipped to the market or a good apple is eaten. In this application, the learner will observe the loss for both actions if she chooses action 1 since she knows the quality of the apple, and she observes no information if she chooses action 2 since she does not know whether the apple is rotten or not. This application can then be modeled as online learning with feedback graph, with Gt = ([2], {(1, 1),(1, 2)}) for all t ∈ [T]. Another example is the label efficient or revealing action problem (Example 6.4 in [40]). In this problem, action 1 is called the revealing action, which incurs a constant unit loss but can reveal the loss for all the actions. On the other hand, the learner observes nothing if she picks the action among 2 to K. In this example, we have Gt = ([K], {(1, i), i ∈ [K]}) for all t ∈ [T]. In the following, we introduce some graph-related notions and other useful notations. Observability. Given a directed feedback graph G = ([K], E), define Nin(i) ≜ {j : (j, i) ∈ E} to be the set of nodes that can observe node i. We call a node i is observable if Nin(i) ̸= ∅. We call an observable node a strongly observable node if either i ∈ Nin(i) or Nin(i) = [K]\{i}, and a weakly observable node 9 otherwise. Similarly, we say a graph is observable if all of its nodes are observable. An observable graph is strongly observable if all nodes are strongly observable, and weakly observable otherwise. Self-aware graphs. We denote by S ≜ {i ∈ [K] : i ∈ Nin(i)} the subset of nodes with a self-loop, and by S¯ ≜ [K]\S the subset of nodes without a self-loop. We further use s and s¯ to denote |S| and |S¯|. A graph G = ([K], E) is self-aware if S = [K], that is, every node has a self-loop, which is a special case of strongly observable graphs. Independence sets and cliques. For completeness, we include the standard concepts of independence sets and cliques of a directed graph. An independent set I is a subset of nodes such that for any two distinct nodes i, j ∈ I, we have (i, j) ∈/ E. The independence number of a graph is the cardinality of its largest independent set. A clique C is a subset of nodes such that for any two distinct nodes i, j ∈ C, we have (i, j) ∈ E. A clique partition {C1, . . . , Cm} of a graph is a partition of its nodes such that each Ck in this collection is a clique. The clique partition number of a graph is the cardinality of its smallest clique partition. Define κ as the clique partition number of the subgraph GS obtained by restricting G to only the nodes in S. Note that α ≤ κ + 1 always holds. Weakly dominating sets. Following [11], for a weakly observable graph, we define a weakly dominating set D to be a set of nodes such that all weakly observable nodes can be observed by at least one node in D. Define the weak domination number d of graph G as the cardinality of its smallest weakly dominating set. The structure of this chapter is as follows: • In Section 2.1, we consider the problem with the feedback graph known and fixed over rounds, meaning that Gt = G for all t ∈ [T]. Specifically, we study small-loss bounds for the adversarial multiarmed bandits problem with the feedback graph , that is, adaptive regret bounds that depend on 10 the loss of the best arm or related quantities, instead of the total number of rounds. We derive the first small-loss bound for general strongly observable graphs, resolving an open problem proposed in [113]. Specifically, we develop an algorithm with regret where L⋆ is the loss of the best arm, and for the self-aware graph case, we improve the regret to Oe min{ √ αT , √ κL⋆} . Our results significantly improve and extend those by [113] who only consider self-aware undirected graphs. Furthermore, we also take the first attempt at deriving small-loss bounds for weakly observable graphs. We first prove that no typical small-loss bounds are achievable in this case, and then propose algorithms with alternative small-loss bounds in terms of the loss of some specific subset of arms. A surprising side result is that Oe( √ T) is achievable even for weakly observable graphs as long as the best arm has a self-loop. • In Section 2.2, we extend the problem with the fixed feedback graph to time-varying feedback graphs where Gt ’s are changing over time and can be adversarially chosen by the environment. Specifically, we study high-probability regret bounds for this problem, which is a more robust guarantee compared with expected regret bounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret Oe qPT t=1 αt + maxt∈[T] αt with high probability, where αt is the independence number of the feedback graph at round t. Compared to the best existing result [121] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any poly(K) dependence that can be prohibitively large for applications such as contextual bandits. • In Section 2.3, we consider a contextual generalization of this problem, in which the learner is first given a context at each round and both the loss and the feedback graph can be dependent on the context. For this problem, we propose and analyze an approach to contextual bandits with feedback graphs based upon reduction to regression. The resulting algorithms are computationally practical and achieve established minimax rates. 11 • Finally, in Section 2.4, we further extend this contextual learning problem to the more challenging uninformed case where the feedback graph Gt is either only revealed after the learner makes her decision or even never fully revealed at all. We develop the first contextual algorithm for such uninformed settings, via an efficient reduction to online regression over both the losses and the graphs. Importantly, we show that it is critical to learn the graphs using log loss instead of squared loss to obtain favorable regret guarantees. We also demonstrate the empirical effectiveness of our algorithm on a bidding application using both synthetic and real-world data. 12 2.1 Small-loss Bounds for Bandits with Feedback Graphs In this section, we consider deriving learning algorithms that achieve adaptive performance guarantees for online learning with feedback graphs in the adversarial environment. As mentioned, adversarial multiarmed bandits with feedback graphs is an online learning model that generalizes the classic expert problem [67] as well as the standard multi-armed bandits problem [19]. In this model, the learner needs to choose one of the K arms at each round, while simultaneously the adversary decides the loss of each arm. Afterwards, the learner receives feedback based on a graph with the K arms as nodes. Specifically, the learner observes the loss of every arm to which the chosen arm is connected. Clearly, the full-information expert problem corresponds to having a complete feedback graph, while the standard multi-armed bandits problem corresponds to having a feedback graph with only self-loops. Alon et al. [11] provided a full characterization of the minimax regret for this problem. Specifically, it was shown that the minimax regret for strongly observable graphs and that for weakly observable graphs are Θ( e √ αT) and Θ( e d 1/3T 2/3 ) respectively, where T is the total number of rounds and α and d are the independence number and weak domination number of the feedback graph respectively. However, it is well-known that more adaptive data-dependent regret bounds are achievable for a wide range of online learning problems. Among them, perhaps the most common one is the so-called firstorder or small-loss bounds, which replaces the dependence on T by the total loss of the best arm L⋆ ≤ T. Such bounds are usually never worse than the worst-case bounds, but could be potentially much smaller if a relatively good arm exists. Achieving small-loss bounds for bandits with feedback graphs has been surprisingly challenging. Lykouris, Sridharan, and Tardos [114] took the first attempt and their algorithms achieve regret Oe(α 1/3L 2/3 ⋆ ) or Oe( √ κL⋆)(κ is the clique partition number), but only for self-aware undirected graphs (self-aware means that every node has a self-loop). It was left as a major open problem whether better and more general small-loss bounds are achievable. 13 Table 2.1: Main results and comparisons with prior work. T is the number of rounds, L⋆ ≤ T is the total loss of the best arm, α, κ, and d are the independence, clique partition, and weak domination number respectively. For our results for weakly observable graphs, γ can be any value in [ 1/3, 1/2], i ⋆ is the best arm, S is the set of nodes with a self-loop, Li ⋆ S is the loss of the best arm in S, LD is the average loss of nodes in a weakly dominating set, and dependence on other parameters is omitted. All our algorithms have parameter-free versions. Graph Type Regret Minimax [11] [114] Our work Strongly Observable General Θ( e √ αT) N/A Oe( p (κ + 1)L⋆) Special case: self-aware Oe(α 1/3L 2/3 ⋆ ), Oe( √ κL⋆) (undirected graphs only) Oe(min{ √ κL⋆, √ αT}) Weakly Observable General Θ( e d 1/3T 2/3 ) N/A (no o(L⋆) bounds achievable) ( Oe(L 1−γ D ), if i ⋆ ∈ S, Oe(L (1+γ)/2 D ), else. Special case: bipartite Θ( e T 2/3 ) ( Oe( √ L⋆), if i ⋆ ∈ S, Oe(L 2/3 i ⋆ S ), else. Our work makes a significant step towards a full understanding of small-loss bounds for bandits with a fixed directed feedback graph. Specifically, our contributions are (see also Table 2.1): • (Section 2.1.2.3) For general strongly observable graphs, we develop an algorithm that enjoys regret Oe( p (κ + 1)L⋆). This is the first small-loss bound for the general case, extending the results of [114] that only hold for self-aware undirected graphs and resolving an open problem therein. • (Section 2.1.2.4) For the special case of self-aware (directed) graphs, we develop an algorithm with regret Oe(min{ √ αT , √ κL⋆}), again strictly improving [114] by providing an extra robustness guarantee (note that α ≤ κ always holds for self-aware graphs). • (Section 2.1.3) For weakly observable graphs (where small-loss bounds have not been studied before at all), we prove that no algorithm can achieve typical small-loss bounds (such as o(L⋆)). Despite this negative result, we develop an algorithm with regret Oe(L 2/3 D ) where LD is the average loss of a weakly dominating set (and dependence on other parameters is omitted for simplicity). More generally, we also achieve different trade-offs between the case when the best arm has a self-loop 14 and the case without, such as Oe( √ LD) versus Oe(L 3/4 D ). We further consider a special case with a complete bipartite graph, and show that our algorithm achieves Oe( √ L⋆) when the best arm has a self-loop and Oe(L 2/3 i ⋆ S ) otherwise, where Li ⋆ S is the loss of the best arm with a self-loop.∗ A surprising implication of our result is that Oe( √ T) regret is possible even for weakly observable graphs as long as the best arm has a self-loop. • We also provide parameter-free versions of all our algorithms using sophisticated doubling tricks, which we emphasize is highly non-trivial for bandit settings, especially because some of our algorithms consist of a layer structure combining different subroutines. Our algorithms are based on the well-known Online Mirror Descent framework, but importantly with a suite of different techniques including hybrid regularizers, unconstrained loss shifting trick, increasing learning rates, combining algorithms with partial information, adding correction terms to loss estimators, and their combination in an innovative way. We defer further discussion on the novelty of each component and comparisons with prior work to the description of each algorithm. Related work. The bandits with feedback graphs model was first proposed by [115]. Later, Alon et al. [11, 13] gave a full characterization of the minimax regret for this problem. There are many follow-ups that consider different variants and extensions of this problem, such as [92, 55, 127, 14]. Small-loss bounds have been widely studied in the online learning literature. For the full-information expert problem, the classic Hedge algorithm [67] achieves Oe( √ L⋆) regret already. For the standard multiarmed bandits problem and its variant semi-bandits, there are also several different approaches to achieve Oe( √ L⋆) regret [9, 122, 63, 143, 38]. Even for the challenging contextual bandits setting (which is in fact a special case of learning with time-varying feedback graphs), it was shown by Allen-Zhu, Bubeck, and Li [8] that Oe( √ L⋆) regret is also achievable. ∗ See Section 2.1.1 for a detailed definition of a complete bipartite graph. 15 The work most related to ours is [114]. As mentioned, we significantly extend their results to more general graphs, including graphs with directed edges, graphs without self-loops, and even weakly observable graphs, and we also improve their bound for self-aware graphs. Our algorithms are also based on very different ideas compared to theirs which are mainly built on the recursive arm freezing technique. We point out that, however, they also studied high probability bounds and time-varying graphs for some cases, which is not the focus of this work. 2.1.1 Problem Setup and Notations Before the game starts, the adversary decides a sequence of T loss vectors ℓ1, . . . , ℓT ∈ [0, 1]K for some integers K ≥ 2 and T ≥ 2K, and a directed feedback graph G = ([K], E) for E ⊆ [K] × [K] which is fixed and known. Each node in the graph represents one of the K arms, and within Section 2.1, we use the terms “arm” and “node” interchangeably. At each round t ∈ [T], the learner selects an arm it ∈ [K] and incurs loss ℓt,it . At the end of this round, the learner receives feedback according to G. Specifically, the learner observes the loss of arm i for all i such that it ∈ Nin(i), where Nin(i) ≜ {j : (j, i) ∈ E} is the set of nodes that can observe i. The expected regret with respect to an arm i is defined as Regi ≜ E "X T t=1 ℓt,it − X T t=1 ℓt,i# , which is the expected difference between the learner’s total loss and that of arm i (the expectation is with respect to the learner’s internal randomness). We denote the best arm by i ⋆ ≜ argmini∈[K] PT t=1 ℓt,i and define Reg ≜ Regi ⋆ , which is in consistent with the definition shown in Section 1.2 after taking expectation. Alon et al. [11] showed that the minimax regret (in terms of T) for strongly observable graphs and weakly observable graphs are Reg = Θ( e √ T) and Reg = Θ( e T 2/3 ) respectively. Our goal is to obtain the so-called small-loss regret bounds that could potentially be much smaller than the minimax bounds. Specifically, for an arm i, we denote its total loss by Li ≜ PT t=1 ℓt,i ≤ T, and we use 16 the shorthand L⋆ ≜ Li ⋆ . Our goal is to replace the dependence of T by L⋆ when bounding Reg, or more generally, to replace T by Li when bounding Regi for each arm i. Special cases. For weakly observable feedback graphs, we also consider a special case of weakly observable graphs with (i, j) ∈ E for every i ∈ S and every j ∈ S¯, and call it a directed complete bipartite graph. † Recall that S ≜ {i ∈ [K] : i ∈ Nin(i)} is the subset of nodes with a self-loop. Some of our algorithms rely on having a clique partition of GS, which we assume is given, even though it is in general NP-hard to find [87]. We emphasize that, however, our algorithms work with any clique partition and the bounds hold with κ replaced by the size of this partition. Similarly, we assume that a weakly dominating set of size d is given, but our algorithms work using any weakly dominating set and our bounds hold with d replaced by the size of this set. Other notations. For two matrices M1 and M2, M1 ⪯ M2 means that M2−M1 is positive semi-definite, and for two vectors v1 and v2, v1 ⪯ v2 means that v1 is coordinate-wise less than or equal to v2. We denote the (K − 1)-dimensional simplex by ∆(K), the diagonal matrix with v1, . . . , vK ∈ R on the diagonal by diag{v1, . . . , vK}, and the all-zero and all-one vectors in R K by 0 and 1 respectively. The notation Oe(·) hides logarithmic dependence on K and T. 2.1.2 Strongly Observable Graphs In this section, we focus on strongly observable graphs. We first show how to achieve Oe( p (κ + 1)L⋆) regret in general, and then discuss how to further improve it to Oe(min{ √ αT , √ κL⋆}) for the special case of self-aware graphs (Section 2.1.2.4). There are three key components/ideas to achieve Oe( p (κ + 1)L⋆) regret. Specifically, starting from the Exp3.G algorithm of [11], which is an instance of Online Mirror Descent (OMD) with the entropy †Note that unlike the traditional definition of bipartite graphs, here we allow additional edges other than those from S to S¯, as adding more edges only makes the problem easier. 17 regularizer, natural loss estimators for feedback graphs, and an additional Θ(1/ √ T) amount of uniform exploration, we make the following three modifications: • (Section 2.1.2.1) Reduce the amount of uniform exploration to Θ(1/T) and add a constant amount of log-barrier to the regularizer. We show that this modification maintains the same Oe( √ αT) regret as Exp3.G, but importantly, the smaller amount of uniform exploration is the key for further obtaining small-loss bounds. • (Section 2.1.2.2) Replace the entropy regularizer with the log-barrier regularizer for nodes in S. This leads to a small-loss bound of order Oe( p (s + 1)L⋆). • (Section 2.1.2.3) Create a clique partition for nodes in S, run a Hedge variant within each clique, and run the algorithm from Section 2.1.2.2 treating each clique as a meta-node, which finally improves the regret to Oe( p (κ + 1)L⋆). This part relies on highly nontrivial extensions of techniques from [7] on combining algorithms in the bandit setting. 2.1.2.1 Reducing the Amount of Uniform Exploration We start by describing the Exp3.G algorithm of [11]. It maintains a distribution pt ∈ ∆(K) for each time t, and samples it according to pt . Then a standard importance-weighted loss estimator ℓbt is constructed such that ℓbt,i = ℓt,i Wt,i 1 it ∈ N in(i) where Wt,i ≜ X j∈Nin(i) pt,j . (2.1) It is clear that E[ℓbt,i] = ℓt,i, that is, the estimator is unbiased. With such a loss estimator, the distribution is updated according to the classic OMD algorithm: pt+1 = argmin p∈Ω D p, ℓbt E + Dψ (p, pt), with p1 = argmin p∈Ω ψ(p). (2.2) 18 For Exp3.G, ψ(p) = 1 η P i∈[K] pi ln pi is the standard Shannon entropy regularizer with learning rate η ≤ 1/2, and Ω = n p ∈ ∆(K) : pi ≥ 2η K , ∀i ∈ [K] o is the clipped simplex and enforces O(η) amount of uniform exploration.‡ Standard OMD analysis shows that the instantaneous regret of Exp3.G against any u ∈ ∆(K) at time t is bounded as ⟨pt −u, ℓbt⟩ ≤ Dψ(u, pt)− Dψ(u, pt+1) +∥ℓbt∥ 2 ∇−2ψ(pt) . However, the last term ∥ℓbt∥ 2 ∇−2ψ(pt) (often called the local-norm term) could be prohibitively large for general strongly observable graphs. The analysis of Exp3.G overcomes this issue via a key loss shifting trick. Specifically, it is shown that the following more general bound holds D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) (2.3) for any z ≤ 1/η, and in particular, with z = P i∈S¯ pt,iℓbt,i, the local-norm term ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) is bounded by Oe(ηα) in expectation. This choice of z is indeed not larger than 1/η due to the form of ℓbt and importantly the O(η) amount of uniform exploration. The rest of the analysis is straightforward and shows that Reg = Oe( 1 η + ηαT), which is Oe( √ αT) by picking η = 1/ √ αT. To obtain small-loss bounds, one clear obstacle in Exp3.G is the uniform exploration, which contributes to O(ηT) = O( √ T) regret already by the above optimal choice of η. Our first step is thus to get rid of this large amount uniform exploration, and we take an approach that is completely different from [114]. Specifically, noting that the key reason to have uniform exploration is the constraint z ≤ 1/η in the loss shifting trick Eq. (2.3), our key idea is to remove this constraint completely, which turns out to be possible if the regularizer contains a constant amount of log-barrier (similar to [33, 151]), as shown in the following lemma. ‡ In the original Exp3.G algorithm, Ω is the exact simplex ∆(K) and uniform exploration is enforced by sampling it according to pt with probability 1 − 2η and according to a uniform distribution with probability 2η. Nevertheless, our slight modification essentially serves the same purpose and makes subsequent discussions easier. 19 Lemma 2.1.1 (Unconstrained Loss Shifting). Let pt+1 = argminp∈Ω ⟨p, ℓbt⟩ + Dψ(p, pt), for Ω ⊆ ∆(K) and ψ : Ω → R. Suppose (a) 0 ≤ ℓbt,i ≤ max n 1 pt,i , 1 1−pt,io , ∀i ∈ [K], (b) ∇−2ψ(p) ⪯ 4∇−2ψ(q) holds when p ⪯ 2q, (c) ∇2ψ(p) ⪰ diag {64K/p 2 1 , . . . , 64K/p 2 K} , ∀p ∈ Ω. Then we have D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + 8 min z∈R ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) . (2.4) Condition (a) is clearly satisfied for strongly observable graphs if ℓbt is defined by Eq. (2.1) since Wt,i ≥ pt,i fori ∈ S and Wt,i = 1−pt,i fori /∈ S. Condition (b) is trivially satisfied for all common regularizers for the simplex such as Shannon entropy, Tsallis entropy, log-barrier, and any of their combinations. Finally, to ensure Condition (c), one only needs to include a log-barrier component c P i∈[K] ln 1 pi for c ≥ 64K in the regularizer, whose Hessian is exactly diag{c/p 2 1 , . . . , c/p 2 K}. This inspires us to propose the following hybrid regularizer ψ(p) = 1 η X i∈[K] pi ln pi + c X i∈[K] ln 1 pi , (2.5) and we prove the following theorem. Theorem 2.1.1. The OMD update Eq. (2.2) with Ω = p ∈ ∆(K) : pi ≥ 1 T , ∀i ∈ [K] , ℓbt defined in Eq. (2.1), and ψ defined in Eq. (2.5) for c = 64K ensures Reg ≤ Oe 1 η + ηαT + K2 for any strongly observable graph. Choosing η = 1/ √ αT, we have Reg = Oe √ αT + K2 . Note that we still enforce a small 1/T amount of uniform exploration, which is important for a technical lemma (Lemma 5 of [11]), but this only contributes O(K) regret. Also, adding the log-barrier leads to a small O(K2 ) overhead in the Bregman divergence term Dψ(u, p1), but only makes the local-norm term smaller and thus minz ∥ℓbt−z·1∥ 2 ∇−2ψ(pt) is still of order Oe(ηα)in expectation. The proof of Theorem 2.1.1 is now straightforward and is deferred to Appendix A.1. 2.1.2.2 Oe p (s + 1)L⋆ Regret Bound Having solved the uniform exploration issue, we now discuss our first attempt to obtain small-loss bounds for strongly observable graphs. Inspired by the fact that for multi-armed bandits, that is, the case where E contains all the self-loops but nothing else, OMD with the log-barrier regularizer achieves a small-loss bound [63], we propose to replace the entropy regularizer with the log-barrier for all nodes in S, while keeping the hybrid regularizer Eq. (2.5) for nodes in S¯: ψ(p) = 1 η X i∈S ln 1 pi + 1 η X i∈S¯ pi ln pi + c X i∈S¯ ln 1 pi . (2.6) We note that it is important not to also use 1/η amount of log-barrier for nodes in S¯, since this leads to an overhead of K/η for the Bregman divergence term and thus at best gives Oe( √ KL⋆) regret. We prove the following theorem for our proposed algorithm. Theorem 2.1.2. OMD Eq. (2.2) with Ω = p ∈ ∆(K) : pi ≥ 1 T , ∀i ∈ [K] , ℓbt defined in Eq. (2.1), and ψ defined in Eq. (2.6) for c = 64K and η ≤ 1 64K ensures Reg = O s ln T +ln K η + ηL⋆ + K2 ln T for any strongly observable graph. Choosing η = min nqs+1 L⋆ , 1 64K o gives Oe p (s + 1)L⋆ + K2 . § While the algorithmic idea is straightforward, the main challenge in the analysis is to deal with the nodes in S¯. Specifically, for the particular choices of η and c, we know that the conditions of Lemma 2.1.1 hold, and the key is thus again to bound the local-norm term minz∈R ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) . Simply taking z = 0 or the previous choice z = P i∈S¯ pt,iℓbt,i from [11] does not give the desired bound. Instead, we propose a novel shift: z = ℓbt,i0 for some i0 ∈ S¯ with pt,i0 ≥ 1/2, or z = 0 if no such i0 exists. Direct calculations then show ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) = O(η⟨pt , ℓbt⟩). In expectation, the local-norm term is thus § In fact, the s dependence can be improved to the number of nodes that are not observed by every other nodes (by using log-barrier only for these nodes). However, we simplify the presentation with a looser bound since this improvement does not help improve the final main result in Section 2.1.2.3. 2 related to the loss of the algorithm ⟨pt , ℓt⟩, and it is well-known that this is enough for obtaining small-loss bounds. For the complete proof, see Appendix A.2. 2.1.2.3 Oe p (κ + 1)L⋆ Regret Bound Finally, we discuss how to further improve our bound to Oe( p (κ + 1)L⋆). Let C1, . . . , Cκ be a clique partition of GS (recall that GS is G restricted to S). Essentially, we compress each clique as one metanode, and together with nodes from S¯, this creates a meta-graph with β ≜ κ+ ¯s nodes, which can be seen as a “low-resolution” version of G. We index these meta-nodes as 1, . . . , κ, and without loss of generality we assume that the original indices of nodes in S¯ are κ+ 1, . . . , κ+ ¯s, so that [β] is the set of nodes for this meta-graph, and [κ] is the set of nodes with a self-loop (taking the same role as S in the original graph). If we were actually dealing with a problem with this meta-graph, running the algorithm from Theorem 2.1.2 would thus give Oe( p (κ + 1)L⋆) regret. To solve the original problem, however, we need to specify what to do when a meta-node is selected. Note that within a meta-node, we are essentially facing the classic expert problem with fullinformation [67] since nodes are all connected with each other. A natural idea is thus to run an expert algorithm with a small-loss bound within each clique, and when a clique is selected, follow the suggestion of the corresponding expert algorithm. We choose to use an adaptive version of Hedge [67] as the expert algorithm, with details deferred to Algorithm 15 in Appendix A.3. Importantly, the regret of Hedge has only logarithmic dependence on the number of nodes and thus does not defeat the purpose of obtaining Oe( p (κ + 1)L⋆) regret. Figure 2.2 illustrates the main idea of our algorithm, and Algorithm 1 shows the complete pseudocode. We use i to index a node in the original graph and j to index a node in the meta-graph. Note that nodes from S¯ appear in both graphs so they are indexed by either i or j, depending on the context. The κ instances of Hedge are denoted by A1, . . . , Aκ, where Aj only operates over nodes in Cj . For notational 2 1 2 3 4 5 6 7 META-ALGORITHM A2 HEDGE A1 HEDGE C1 C2 meta-node (clique) algorithm observes the loss of operates over node Figure 2.2: An illustration of Algorithm 1 for a graph with 7 nodes. Here, we have S = {1, 2, 5, 6, 7}, S¯ = {3, 4}, and κ = 2 with C1 = {1, 2} and C2 = {5, 6, 7} being a minimum clique partition of GS. The meta-algorithm operates over nodes 3 and 4, and also the two cliques, each with a Hedge instance running inside. convenience, however, we require Aj to output at time t a distribution pe (j) t ∈ ∆(K) over all nodes with the constraint pe (j) t,i = 0 for all i /∈ Cj (Line 1), and we also feed the estimated losses of all nodes to Aj (Line 6), even though it ignores those outside Cj . The algorithm maintains a distribution pt ∈ ∆(β) for the meta-graph, updated using the algorithm from Theorem 2.1.2 (Line 8). The only difference is that we use a time-varying regularizer defined in Eq. (2.7), where the learning rate ηt,j for meta-node j ∈ [κ] is time-varying and also different for different j (all starting from η1,j = η; more explanation to follow). At the beginning of time t, the algorithm first samples jt ∈ pt . If jt happens to be a node in S¯, we play it = jt ; otherwise, we sample it from the distribution received from Ajt . See Line 2 and Line 3. After playing arm it and receiving loss feedback, we construct loss estimator ℓet ∈ R K for nodes in G (Line 5) and estimator ℓbt ∈ R β for nodes in the meta-graph (Line 7). The estimator for nodes in S¯, for either G or the meta-graph, is exactly the same as the standard one described in Eq. (2.1). The estimator for a node i ∈ S also essentially follows Eq. (2.1), except that we ignore all edges that point to i but are not from those nodes in the same clique, so the probability of observing i is pt,j for j being the index of 23 Algorithm 1 Algorithm for General Strongly Observable Graphs Input: Feedback graph G and a clique partition {C1, . . . , Cκ} of GS, parameters η, c. Define: β = κ + ¯s and Ω = p ∈ ∆(β) : pj ≥ 1 T , ∀j ∈ [β] . Initialize: p1 = argminp∈Ω ψ1(p) (see Eq. (2.7)), η1,j = η, ρ1,j = 2κ, ∀j ∈ [κ]. Initialize: an instance Aj of adaptive Hedge (Algorithm 15) with nodes in Cj , ∀j ∈ [κ]. for t = 1, 2, . . . , T do 1 For each j ∈ [κ], receive pe (j) t ∈ {p ∈ ∆(K) : pt,i = 0, ∀i /∈ Cj} from Aj . 2 Sample jt ∼ pt . 3 if jt ∈ [κ] then draw it ∼ pe (jt) t ; else let it = jt .; 4 Pull arm it and receive feedback ℓt,i for all i such that it ∈ Nin(i). 5 Construct estimator ℓet ∈ R K such that ℓet,i = ( ℓt,i pt,jt 1{jt ∈ [κ], i ∈ Cjt}, for i ∈ S, ℓt,i 1−pt,i 1{i ̸= it}, for i ∈ S. ¯ 6 For each j ∈ [κ], feed ℓet to Aj . 7 Construct estimator ℓbt ∈ R β for meta-nodes such that ℓbt,j = (D pe (j) t , ℓet E , for j ∈ [κ], ℓet,j , for j ∈ S. ¯ 8 Compute pt+1 = argminp∈Ω nDp, ℓbt E + Dψt (p, pt) o where ψt(p) = X j∈[κ] 1 ηt,j ln 1 pj + 1 η X j∈S¯ pj ln pj + c X j∈S¯ ln 1 pj . (2.7) 9 for j ∈ [κ] do if 1 pt+1,j > ρt,j then set ρt+1,j = 2 pt+1,j , ηt+1,j = ηt,je 1 ln T .; else set ρt+1,j = ρt,j , ηt+1,j = ηt,j .; the clique to which i belongs.¶ Finally, the estimator for a meta-node j ∈ [κ] is simply ⟨pe (j) t , ℓet⟩, which is the estimated loss of the corresponding Hedge Aj . While the idea of combining algorithms in such a two-level hierarchy is natural and straightforward, doing it in a partial-information setting is notoriously challenging and requires extra techniques, as explained in detail in [7]. In a word, the difficulty is that the Hedge algorithms do not always receive feedback and thus do not yield the usual regret bound as one would get for a full-information problem. To address this issue, we apply the increasing learning rate technique from [7]. Specifically, we maintain a threshold ρt,j for each time t and each meta-node j (starting from ρ1,j = 2κ). Every time after the OMD update, if pt+1,j becomes too small and 1/pt+1,j exceeds the threshold ρt,j , we increase the learning rate for j by a ¶One could also follow exactly Eq. (2.1) to construct more complicated estimators, but it does not lead to a better bound. 24 factor of e 1 ln T and set the new threshold to be ρt+1,j = 2/pt+1,j (Line 9). The high-level idea behind this technique is that when the probability of picking a clique is small and thus the corresponding Hedge receives little feedback, increasing the corresponding learning rate allows its faster recovery, should a node in the clique become the best node later. For more intuition, we refer the reader to [7]. We are now ready to show the main theorem of this section. Theorem 2.1.3. Algorithm 1 with c = 64β and η ≤ ηmax ≜ min n 1 64β , 1 1000(ln T) ln2 (KT) o guarantees: Reg ≤ Oe κ ln T + ln K η + ηL⋆ + β 2 ln T for any strongly observable graph. Choosing η = min n ηmax, qκ+1 L⋆ o gives Oe p (κ + 1)L⋆ + β 2 . Our algorithm strictly improves over the GREEN-IX-Graph algorithm of [114] (in terms of expected regret), which achieves the same bound but only for undirected self-aware graphs. The proof of Theorem 2.1.3 is deferred to Appendix A.3. Moreover, in Appendix A.3.3 we also provide an adaptive version of our algorithm with a sophisticated doubling trick, which achieves the same bound but without the need of knowing L⋆ for learning rate tuning. We remark that doing doubling trick in such a two-level structure and with partial information is highly non-trivial. 2.1.2.4 Oe min{ √ αT , √ κL⋆} Regret Bound for Self-Aware Graphs Although our bound Oe( p (κ + 1)L⋆) could be much smaller than the worst-case bound Oe( √ αT), it is not always better since α ≤ κ + 1. To remedy this drawback, we propose another algorithm with an improved Oe(min{ √ αT , √ κL⋆}) regret for the special case of self-aware graphs. The high-level idea is to first run an algorithm with Oe( √ κL⋆) regret while keeping an estimate of L⋆, and when we are confident that Oe( √ αT) is the better bound, switch to any algorithm with Oe( √ αT) regret. For the first part, using Algorithm 1 would create some technical issues and we are unable to analyze it unfortunately (otherwise 2 we could have dealt with general strongly observable graphs). Instead, we introduce a new algorithm using a similar clipping technique of [114]. We emphasize that the key challenge here is that the algorithm has to be adaptive in the sense that it does not need the knowledge of L⋆ — otherwise, simply comparing the two bounds and running the corresponding algorithm with the better bound solves the problem already. We again design a sophisticated doubling trick to resolve this issue. All details are included in Appendix A.4. 2.1.3 Weakly Observable Graphs In this section, we consider small-loss bounds for weakly observable graphs, which have not been studied before. Recall that the minimax regret bound in this case is Θ( e d 1/3T 2/3 ) where d is the weak domination number. The most natural small-loss bound one would hope for is therefore Θ( e d 1/3L 2/3 ⋆ ). However, it turns out that this is not achievable, and in fact, no typical small-loss bounds are achievable for any weakly observable graph, as we prove in the following theorem. Theorem 2.1.4. For any weakly observable graph and any algorithm A (without the knowledge of L⋆), if A guarantees Oe(1) regret when L⋆ = 0 (ignoring dependence on K), then there exists a sequence of loss vectors such that the regret of A is Ω(T 1−ε ) for any ε ∈ (0, 1/3). Note that this precludes small-loss bounds such as Oe(L 2/3 ⋆ ), or even Oe(min{L a ⋆ , T 2/3}) for any a > 0. The proof crucially relies on the fact that for a weakly observable graph, one can always find a pair of nodes u and v such that neither of them can observe u. See Appendix A.5 for details. Despite this negative result, nevertheless, we provide alternative first-order regret bounds in terms of the loss of some specific subset of nodes. Specifically, ignoring dependence on other parameters, for the special case of directed complete bipartite graphs, we obtain regret Oe( √ L⋆) when i ⋆ ∈ S and Oe(L 2/3 i ⋆ S ) otherwise, where i ⋆ S = argmini∈S PT t=1 ℓt,i is the best node with a self-loop. Moreover, for general weakly observable graphs, we achieve regret Oe( √ LD) when i ⋆ ∈ S and Oe(L 3/4 D ) otherwise (or other different 26 Algorithm 2 Algorithm for Weakly Observable Graphs Input: Feedback graph G, decision set Ω, parameter η ≤ 1 5 and η¯. Define: hybrid regularizer ψ(p) = 1 η P i∈S ln 1 pi + 1 η¯ P i∈S¯ pi ln pi . Initialize: p1 is such that p1,i = 1 2s for i ∈ S and p1,i = 1 2¯s for i ∈ S¯. for t = 1, 2, . . . , T do 1 Play arm it ∼ pt and receive feedback ℓt,i for all i such that it ∈ Nin(i). 2 Construct estimator ℓbt such that ℓbt,i = ℓt,i Wt,i 1{it ∈ Nin(i)} where Wt,i = P j∈Nin(i) pt,j . 3 Construct correction term at such that at,i = ( 2ηpt,iℓb2 t,i, for i ∈ S, 2¯ηℓb2 t,i, for i ∈ S. ¯ 4 Compute pt+1 = argminp∈Ω ⟨p, ℓbt + at⟩ + Dψ(p, pt) . trade-offs between the two cases), where D is a weakly dominating set and LD ≜ 1 |D| P i∈D PT t=1 ℓt,i is the average total loss of nodes in D. Our algorithm is summarized in Algorithm 2. The key algorithmic idea is inspired by the work of [143]. They show that a variant of OMD with certain correction terms added to the loss estimators leads to a regret bound on Regi where the typical local-norm term ∥ℓbt∥ 2 ∇−2ψ(pt) is replaced by a term that is only in terms of the information of arm i. For our problem this is the key to achieve different orders of regret for different arms. More specifically, the algorithm performs a standard OMD update, except that ℓbt is replaced by ℓbt + at for some correction terms at (Line 4). Before specifying our correction term, we first describe the regularizer. Similar to the algorithms for strongly observable graphs, we again use a hybrid regularizer ψ(p) = 1 η P i∈S ln 1 pi + 1 η¯ P i∈S¯ pi ln pi , that is, log-barrier for nodes in S and entropy for nodes in S¯. Note that we do not enforce a small amount of log-barrier for every node as in Section 2.1.2 (reasons to follow). Also note that the learning rate for nodes in S and S¯ are different (η and η¯ respectively), which is also crucial for getting different orders of regret for different nodes. In light of this specific choice of regularizer, our correction term at is defined as in Line 3, because ηpt,iℓb2 t,i is the typical correction term for log-barrier [143], and on the other hand η¯ℓb2 t,i is the typical one for entropy [136]. Mixing these two correction terms is novel as far as we know. 27 The estimator ℓbt is constructed exactly by Eq. (2.1), and it remains to specify the decision set Ω ⊆ ∆(K), which is different for different cases and will be discussed separately. In both cases, the decision set is such that some relatively large amount of uniform exploration is enforced over a subset of nodes, which is also the reason why the small amount of log-barrier is not needed anymore. Similar to the analysis of [143], we prove the following lemma (see Appendix A.5). Lemma 2.1.2. Algorithm 2 ensures D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + ⟨u, at⟩ for all u ∈ Ω, as long as Ω is a subset of {p ∈ ∆(K) : P j∈Nin(i) pt,j ≥ 5¯η, ∀i ∈ S¯}. Naturally, to compete with node i, we let u almost concentrate on i, in which case ⟨u, at⟩ is roughly at,i (only in terms of i; key difference compared to Eq. (2.4)). To understand why this is useful, consider the case when i ∈ S so at,i is ηpt,iℓb2 t,i. By the construction of the loss estimator, the latter is bounded by ηℓt,i in expectation, which is the key of getting Oe( √ T) regret in this case. Directed Complete Bipartite Graphs. For the special case of directed complete bipartite graphs, we take Ω = p ∈ ∆(K) : P i∈S pi ≥ √ η¯ , which ensures that every node in S¯ is observed with probability at least √ η¯ (by the definition of directed complete bipartite graphs). This, however, unavoidably introduces dependence on Li ⋆ S when bounding Regi for i ∈ S¯, as shown below. Theorem 2.1.5. For any directed complete bipartite graph, Algorithm 2 with η ≤ 1 5 , η¯ ≤ 1 25 and Ω = p ∈ ∆(K) : P i∈S pi ≥ √ η¯ guarantees: Regi ≤ s ln T η + 2ηLi + 2s, for i ∈ S, 2s ln T η + 2 ln K η¯ + 2√ ηL¯ i ⋆ S + 2√ ηL¯ i + 2s, for i ∈ S, ¯ where i ⋆ S = argmini∈S PT t=1 ℓt,i. Choosing η = min q s Li ⋆ S , 1 5 and η¯ = min n L −2/3 i ⋆ S , 1 25o , we have: Regi = Oe √ sLi + s for i ∈ S and Regi = Oe L 2/3 i ⋆ S + psLi ⋆ S + s for i ∈ S¯. 2 Note that even though Alon et al. [11] show that the worst-case regret of any weakly observable graph is Ω(T 2/3 ), our result indicates that for directed complete bipartite graphs, one can in fact achieve much better regret of order Oe( √ T) when the best node has a self-loop, while still maintaining the worst-case regret Oe(T 2/3 ). Moreover, in the former case, we can even achieve a typical small-loss bound, while in the latter case, the regret could be better than Oe(T 2/3 ) as long as the best node in S has sublinear total loss. In Appendix A.5.3, we also provide an adaptive version of the algorithm without the need of knowing Li or Li ⋆ S to tune learning rates (while maintaining the same bound), which requires a nontrivial combination of a clipping technique and doubling trick. General Case. For general weakly observable graphs, following similar ideas of forcing the algorithm to observe nodes in S¯ with a large enough probability, we take Ω = {p ∈ ∆(K) : pi ≥ δ, ∀i ∈ D} where D is a minimum weakly dominating set with size d and δ is some parameter. By definition, this ensures that each node in S¯ is observed with probability at least δ. However, this also introduces dependence on LD for Regi , even when i ∈ S, as shown in the following theorem. Theorem 2.1.6. For any weakly observable graph, Algorithm 2 with 1 T ≤ δ ≤ min 1 125 , 1 4s , 1 4d , η ≤ 1 25 , η¯ ≤ δ 4 3 , and Ω = {p ∈ ∆(K) : pi ≥ δ, ∀i ∈ D} guarantees: Regi ≤ 2s ln T η + 2ηLi + 2δdLD + 2s, for i ∈ S, s ln T η + ln(2¯s) η¯ + 2¯ηLi δ + 2δdLD + 2s, for i ∈ S. ¯ For any γ ∈ [ 1 3 , 1 2 ], setting δ = min{ 1 125 , 1 4s , 1 4d , L−γ D }, η = min nq 1 LD , 1 25o , and η¯ = min nq δ LD , δ 4 3 o gives Reg = Oe L 1−γ D if i ⋆ ∈ S and Oe L (1+γ)/2 D otherwise (ignoring dependence on s and d). Note that unlike the special case of directed complete bipartite graphs, we face a trade-off here when setting the parameters, due to the extra parameter δ that appears in both cases (i ∈ S or i ∈ S¯). For example, when picking γ = 1 3 , we achieve Reg = Oe L 2/3 D always, better than the worst-case bound as 2 long as LD is sublinear. On the other hand, picking γ = 1 2 , we achieve Reg = Oe √ LD when i ⋆ ∈ S and Oe L 3/4 D otherwise. Once again, we provide an adaptive version in Appendix A.5.5. 2.1.4 Conclusions and Open Problems In Section 2.1, we provide various new results on small-loss bounds for bandits with a directed feedback graph (either strongly observable or weakly observable), making a significant step towards a full understanding of this problem. One clear open question is whether one can achieve Oe( √ αL⋆) regret for strongly observable graphs, which would be a strict improvement over the minimax regret Oe( √ αT). Note again that our bound Oe( p (κ + 1)L⋆) is weaker since α ≤ κ + 1 always holds. Achieving this ideal bound appears to require new ideas. Even for the special case of self-aware graphs, the problem remains challenging and the closest result is the bound Oe(α 1/3L⋆ 2/3 ) by Lykouris, Sridharan, and Tardos [114]. Another future direction is to generalize our results to time-varying feedback graphs, which also appears to require new ideas. 2.2 High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs In Section 2.1, we consider the problem when the feedback graph is fixed and known to the learner. However, as mentioned in [115], in many real-world applications, the feedback graphs can change arbitrarily over time. For example, in the web advertising application, the learner picks an ad at each round and then knows whether the user likes or dislikes the ads that are similar to the selected ad. However, the similarity among the ads can be very different for different users and can change in an arbitrary way. Moreover, while as mentioned in Section 2.1, Alon et al. [10] characterized the minimax expected regret bound for this problem with a fixed feedback graph G by providing both a regret lower bound and an algorithm achieving matching regret upper bound, it is known that these algorithms exhibit a huge variance and can in fact suffer Θ(T) regret with a constant probability (see [98]), which is clearly undesirable in practice. To mitigate this issue, Alon et al. [13] designed an algorithm called ELP.P, which ensures Oe( √ mT + m2T 1/4 ) regret with high probability for self-aware graphs (a special case of strongly observable graphs in which every node has a self-loop), where m is the size of the maximal acyclic graph in G and can be much larger than α. On the other hand, Neu [121] developed the Exp3-IX algorithm which uses implicit exploration in the loss estimator construction and achieves Oe( √ αT + K) high-probability regret bound also for self-aware graphs. While the bound is almost optimal, the additional K term could be prohibitively large for applications such as contextual bandits where K is the number of policies (usually considered as exponentially large). In this section, we significantly improve these results and extend them to more general graphs. For full generality, we also consider a sequence of time-varying feedback graphs G1, . . . , GT , each of which can be chosen adaptively by the environment based on the learner’s previous actions. We denote the 31 Table 2.2: Summary of our results and comparisons with prior work. T is the number of rounds. K is the number of actions. αt and dt are respectively the independence number and the weak domination number of feedback graph Gt at round t. The results of [10, 121] are for a fixed feedback graph G (so Gt = G, αt = α, and dt = d for all t). Our high-probability regret bound for weakly observable graphs omits some lower-order terms; see Theorem 2.2.3 for the complete form. Graph Type Expected Regret High-probability Regret [10] [121] Our work Strongly Observable General Oe( √ αT) N/A Oe qPT t=1 αt + maxt∈[T] αt Self-aware Oe( √ αT + K) Weakly Observable Oe(d 1/3T 2/3 + √ KT) N/A Oe((PT t=1 dt) 1/3T 1/3 + 1 T PT t=1 dt) independence number of Gt by αt and its weakly domination number by dt . Our main contributions are (see also Table 2.2): • In Section 2.2.2, we start with a refined analysis showing that Exp3-IX of [121] in fact achieves a better Oe((PT t=1 αt) 1/2+maxt∈[T] αt) high-probability regret bound for self-aware graphs, removing the Oe(K) dependence of [121]. We then extend the same bound to the more general strongly observable graphs via a new algorithm that, on top of the implicit exploration technique of Exp3-IX, further injects certain positive bias to the loss estimator of an action that has no self-loop but is selected with more than 1/2 probability, making it a pessimistic estimator. • In Section 2.2.3, we propose an algorithm with high-probability regret Oe((PT t=1 dt) 1/3T 1/3+ 1 T PT t=1 dt) for weakly observable graphs (ignoring some lower-order terms). To the best of our knowledge, this is the first algorithm with (near-optimal) high-probability regret guarantees for such graphs. Moreover, our bound even improves the expected regret bound of [10] by removing the Oe( √ KT) term. We remark that for simplicity we prove our results by assuming the knowledge of α1, α2, . . . , αT or d1, d2, . . . , dT to tune the parameters, but this can be easily relaxed using the standard doubling trick, making our algorithms completely parameter-free. 32 Techniques. Our algorithms are based on the well-known Online Mirror Descent (OMD) framework with the entropy regularizer. However, several crucial techniques are needed to achieve our results, including implicit exploration, explicit uniform exploration, injected positive bias, and a loss shifting trick. Among them, using positive bias and thus a pessimistic loss estimator is especially notable since most earlier works use optimistic underestimators for achieving high-probability regret bounds. The combination of these techniques also requires non-trivial analysis. Related Works. This section focuses on achieving high-probability regret bounds, which is relatively less studied in the bandit literature but as mentioned extremely important due to the potentially large variance of the regret. As far as we know, to achieve high-probability regret bounds for adversarial bandit problems, there are three categories of methods as discussed below. The first method is to inject a negative bias to the loss estimators, making them optimistic and trading unbiasedness for lower variance. Examples include the very first work in this line for standard MAB [19], linear bandits [24, 1, 154], and bandits with self-aware feedback graphs [13]. The second method is the so-called implicit exploration approach [91] (which also leads to optimistic estimators). Neu [121] used this method to achieve Oe( √ KT) regret for MAB and Oe( √ αT +K) regret for bandits with a fixed self-aware feedback graph, improving over the results by [13]. Lykouris, Sridharan, and Tardos [113] also used implicit exploration and achieved high-probability first-order regret bound for bandits with self-aware undirected feedback graphs. However, their regret bounds are either suboptimal in T or in terms of the clique partition number of the graph (which can be much larger than the independence number). The third method is to use OMD with a self-concordant barrier and an increasing learning rate scheduling, proposed by Lee et al. [99]. They used this method to achieve high-probability data-dependent regret bounds for MAB, linear bandits, and episodic Markov decision processes. However, using a self-concordant barrier regularizer generally leads to Oe( √ KT) regret in bandits with strongly observable feedback graphs 33 and Oe(K 1/3T 2/3 ) regret in bandits with weakly observable feedback graphs, making it suboptimal compared to the minimax regret bound Oe( √ αT) and Oe(d 1/3T 2/3 ) respectively. All our algorithms adopt the implicit exploration technique for nodes with self-loop. For strongly observable graphs, we find it necessary to further adopt the injected bias idea for nodes without selfloop, but contrary to prior works, our bias is positive, which makes the loss overestimated and intuitively prevents the algorithm from picking such nodes too often without seeing their actual loss. 2.2.1 Problem Setup and Notations At each round t ∈ [T], the learner selects one of the K available actions it ∈ [K], while the adversary decides a loss vector ℓt ∈ [0, 1]K with ℓt,i being the loss for action i, and a directed feedback graph Gt = ([K], Et) where Et ⊆ [K] × [K]. The adversary can be adaptive and chooses ℓt and Gt based on the learner’s previous actions i1, . . . , it−1 in an arbitrary way. At the end of round t, the learner observes some information about ℓt according to the feedback graph Gt . Specifically, she observes the loss of action j for all j such that it ∈ Nin t (j), where Nin t (j) = {i ∈ [K] : (i, j) ∈ Et} is the set of nodes that can observe node j. We use the standard notion of regret defined in Section 1.2 to measure of the learner’s performance, defined as the difference between the total loss of the learner and that of the best fixed action in hindsight Reg ≜ X T t=1 ℓt,it − X T t=1 ℓt,i∗ , where i ∗ = argmini∈[K] PT t=1 ℓt,i. In this work, we focus on designing algorithms with high-probability regret guarantees. We refer the reader to [10] for the many examples of such a general model, and only point out that the contextual bandit problem [95] is indeed a special case where each node corresponds to a policy and each Gt is the union of several cliques. Each such clique consists of all polices that make the same decision for 34 the current context at round t. In this case, K, the number of policies, is usually considered as exponentially large, and only polylog(K) dependence on the regret is acceptable. This justifies the significance of our results that indeed remove poly(K) dependence from existing regret bounds. Informed/Uninformed Setting. Under the informed setting, the feedback graph Gt is shown to the learner at the beginning of round t before she selects it . In other words, the learner’s decision at round t can be dependent on Gt . In contrast, the uninformed setting is a harder setting, in which the learner observes Gt only at the end of round t after she selects it . For strongly observable graphs, we study the harder uninformed setting, while for weakly observable graphs, in light of the Ω(K 1/3T 2/3 ) regret lower bound of Theorem 9 of [12], we only study the informed setting. Other Notations. Define St ≜ {i ∈ [K] : i ∈ Nin t (i)} as the set of nodes with self-loop in Gt . Define αt and dt as the independence number and the weakly domination number of Gt . For notational convenience, for two vectors x, y ∈ R K and an arbitrary index set U ⊆ [K], we define ⟨x, y⟩U ≜ P i∈U xiyi to be the partial inner product with respect to the coordinates in U. We denote the (K − 1)-dimensional simplex by ∆K, the all-one vector in R K by 1, and the i-th standard basis vector in R K by ei . We use the Oe(·) notation to hide factors that are logarithmic in K and T. ∥ 2.2.2 Optimal High-Probability Regret for Strongly Observable Graphs In this section, we consider the uninformed setting with strongly observable graphs, that is, each Gt is strongly observable and revealed to the learner after she selects it at round t. We propose an algorithm which achieves Oe((PT t=1 αt) 1/2 + maxt∈[T] αt) high-probability regret bound. As mentioned, this result ∥ In the text, O(·) and Oe(·) often further hide lower-order terms (in terms of T dependence) and poly(log(1/δ)) factors for simplicity. However, in all formal theorem/lemma statements, we use O(·) to hide universal constants only and Oe(·) to also hide factors logarithmic in K and T. 35 improves over those from [121, 13] in two aspects: first, they only consider self-aware graphs;∗∗ second, our bound enjoys the optimal ( PT t=1 αt) 1/2 dependence with no poly(K) dependence at all. To present our algorithm, which is built on top of the Exp3-IX algorithm of [121], we start by reviewing how Exp3-IX works and how it achieves Oe((PT t=1 αt) 1/2 + K) high-probability regret bound for selfaware graphs. At each round t, after picking the action it randomly according to pt ∈ ∆K and observing the loss ℓt,j for all j such that it ∈ Nin t (j), Exp3-IX constructs the underestimator ℓbt for ℓt , such that ℓbt,i = ℓt,i Wt,i+γ · 1{it ∈ Nin t (i)} where Wt,i = P j∈Nin t (i) pt,j is the probability of observing ℓt,i and γ > 0 is a bias parameter. Then, the strategy at round t + 1 is computed via the standard multiplicative weight update (equivalent to OMD with entropy regularizer): pt+1,i ∝ pt,i exp(−ηℓbt,i)for all i ∈ [K] where η > 0 is the learning rate. Following standard analysis of OMD, we know that for any j ∈ [K], X T t=1 D pt − ej , ℓbt E ≤ log K η | {z } Bias-Term + η X T t=1 X K i=1 pt,iℓb2 t,i | {z } Stability-Term . (2.8) To derive the high-probability regret bound from Eq. (2.8), Neu [121] first shows that with probability at least 1 − δ, the following two inequalities hold due to the underestimation: X T t=1 ℓbt,j − ℓt,j ≤ log(2K/δ) 2γ , ∀j ∈ [K] (2.9) X T t=1 X K i=1 pt,i Wt,i + γ ℓbt,i − ℓt,i ≤ K log(2K/δ) 2γ . (2.10) ∗∗Although [121] only considers a fixed feedback graph (i.e. Gt = G for all t ∈ [T]), their result can be directly generalized to time-varying feedback graphs. On the other hand, we point out that [13] only considers the easier informed setting. 36 Define Qt ≜ P i∈St pt,i Wt,i+γ , which is simply PK i=1 pt,i Wt,i+γ for self-aware graphs. Using Eq. (2.10), the stability term can be upper bounded as follows: Stability-Term ≤ η X T t=1 X K i=1 pt,i Wt,i + γ ℓbt,i ≤ η X T t=1 Qt + Oe Kη γ . (2.11) To connect the true regret PT t=1(ℓt,it − ℓt,i∗ ) with PT t=1 D pt − ei ∗ , ℓbt E , direct calculation shows: X T t=1 (ℓt,it − ℓt,i∗ ) = X T t=1 D pt − ei ∗ , ℓbt E + X T t=1 (ℓt,it − ⟨pt , ℓt⟩) +X T t=1 ℓbt,i∗ − ℓt,i∗ + X T t=1 X K i=1 Wt,i − 1{it ∈ N in t (i)} pt,iℓt,i Wt,i + γ + X T t=1 X K i=1 γpt,iℓt,i Wt,i + γ . (2.12) In this last expression (summation of five terms), the first term is bounded using Eq. (2.8) and Eq. (2.11); the second term is upper bounded by Oe( √ T) via standard Azuma’s inequality; the third term is bounded by Oe(1/γ) according to Eq. (2.9); the fourth term is a summation over a martingale sequence and can be bounded by Oe qPT t=1 Qt + K with high probability via Freedman’s inequality; and the last term can be bounded by γ PT t=1 Qt . Combining all the bounds above, we obtain that with high probability, the regret is bounded as follows: X T t=1 (ℓt,it − ℓt,i∗ ) ≤ Oe 1 η + ηK γ + 1 γ + vuutX T t=1 Qt + K + (η + γ) X T t=1 Qt . Finally, using the fact that Qt = Oe(αt) (Lemma 1 of [91], included as Lemma B.3.1 in this work) and choosing γ and η optimally gives Oe((PT t=1 αt) 1/2 + K) high-probability bound. Improvement from Oe(K) to Oe(maxt∈[T] αt). We now show that with a refined analysis, the undesirable Oe(K) dependence can be improved to Oe(maxt∈[T] αt) (still for self-aware graphs using the same Exp3-IX algorithm). From the previous analysis sketch of [121], we can see that the Oe(K) dependency 3 comes from two terms: Stability-Term and the fourth term in Eq. (2.12). The upper bound of StabilityTerm is derived by using Eq. (2.10) and the fourth term in Eq. (2.12) is bounded via Freedman’s inequality. We show that both of these two bounds are in fact loose and can be improved by using a strengthened Freedman’s inequality (Lemma 9 of [154], included as Lemma B.3.3 in the appendix). Specifically, we prove the following lemma to bound these two terms. Note that this lemma is not restricted to self-aware graphs, and we will use it later for both general strongly observable graphs and weakly observable graphs. Lemma 2.2.1. For all t and i ∈ St , let ℓbt,i be the underestimator ℓt,i Wt,i+γ · 1{it ∈ Nin t (i)} with γ ≤ 1 2 . Then, with probability at least 1 − δ, the following two inequalities hold: X T t=1 X i∈St pt,i Wt,i + γ ℓbt,i − ℓt,i ≤ O X T t=1 Q2 t γUT + UT log KT δγ ! , (2.13) X T t=1 X i∈St Wt,i − 1{it ∈ N in t (i)} pt,iℓt,i Wt,i + γ ≤ O vuutX T t=1 Qtι1 + max t∈[T] Qtι1 , (2.14) where Qt = P i∈St pt,i Wt,i+γ , Ut = max n 1, 2 maxt∈[T ] Qt γ o , and ι1 = log 2 maxt Qt+2√PT t=1 Qt δ . The full proof is deferred to Appendix B.1.1. As Qt/γ ≤ UT = Θ(maxt∈[T] Qt)/γ for all t ∈ [T] and Qt ≤ Oe(αt) , Lemma 2.2.1 shows that Stability-Term is bounded by Oe(η PT t=1 Qt + ηUT ) = Oe(η PT t=1 αt + η maxt∈[T] αt/γ), which only has logarithmic dependence on K, unlike the Oe(ηK/γ) bound of Eq. (2.10)! For the fourth term in Eq. (2.12), Lemma 2.2.1 shows that it can actually be bounded by Oe((PT t=1 αt) 1/2 + maxt∈[T] αt), which again has no poly(K) dependence. Combining Lemma 2.2.1 with the rest of the analysis outlined earlier, we know that Exp3-IX in fact achieves Oe((PT t=1 αt) 1/2 + maxt∈[T] αt) high-probability regret for self-aware graphs, formally stated in the following theorem. The full proof is deferred to Appendix B.1.1. Theorem 2.2.1. Exp3-IX with the optimal choice of η > 0 and γ > 0 guarantees that with probability at least 1 − δ, Reg = Oe qPT t=1 αt log 1 δ + maxt∈[T] αt log 1 δ . 3 Algorithm 3 Algorithm for Strongly Observable Graphs Input: Parameter γ, β, η, T = ∅. Define: Regularizer ψ(p) = 1 η PK i=1 pi ln pi . Initialize: p1 is such that p1,i = 1 K for all i ∈ [K]. for t = 1, 2, . . . , T do 1 Calculate pet = (1 − η)pt + η K 1. 2 Sample action it from pet . 3 Receive the feedback graph Gt and the feedback ℓt,j for all j such that it ∈ Nin t (j). 4 Construct estimator ℓbt ∈ R K such that ℓbt,i = ℓt,i1{it∈Nin t (i)} Wt,i+γ1{i∈St} where Wt,i = P j∈Nin t (i) pet,j . 5 If there exists a node jt ∈ S¯ t with pet,jt > 1 2 (at most one such jt exists), set T ← {t} ∪ T . 6 Construct bias bt ∈ R K such that bt,i = β Wt,i 1{t ∈ T , i = jt}. 7 Compute pt+1 = argminp∈∆K ⟨p, ℓbt + bt⟩ + Dψ(p, pt) . Generalization to Strongly Observable Graphs. Next, we show how to deal with general strongly observable graphs. The pseudocode of our proposed algorithm is shown in Algorithm 3. Compared to Exp3-IX, there are three main differences. First, we have an additional η amount of uniform exploration over all actions (Line 1). Second, while keeping the same loss estimator construction for node i ∈ St at each round t, for i ∈ S¯ t (nodes without self-loop), we construct a standard unbiased estimator (Line 4). Third, if there exists jt ∈ S¯ t such that the probability of choosing action jt is larger than 1 2 , then we further add positive bias β/Wt,jt to the loss estimator ℓbt,jt (encoded via the bt vector; see Line 6 and Line 7), making it a pessimistic over-estimator. Intuitively, the reason of doing so is that we should avoid picking actions without self-loop too often even if past data suggest that it is a good action, since the only way to observe its loss and see if it stays good is by selecting other actions. A carefully chosen positive bias injected to the loss estimator of such actions would exactly allow us to achieve this goal. In what follows, by outlining the analysis of our algorithm, we further explain why we make each of these three modifications from a technical perspective. First, we note that with a nonempty S¯ t , the key fact used earlier PK i=1 pt,i Wt,i+γ = Oe(αt) is no longer true, making the fourth and the fifth term in Eq. (2.12) larger than desired if we still do implicit exploration for all nodes. Therefore, for nodes in S¯ t , we go back to the original inverse importance weighted unbiased 39 estimators (Line 4), and decompose the regret against any fixed action j ∈ [K] differently into the following six terms: X T t=1 (ℓt,it − ℓt,j ) ≤ X T t=1 ℓt,it − X T t=1 ⟨pet , ℓt⟩ ! | {z } term (1) + X T t=1 ⟨pet − pt , ℓt⟩ ! | {z } term (2) + X T t=1 ⟨pt − ej , ℓt⟩S¯t − X T t=1 Et D pt − ej , ℓbt E S¯t ! | {z } term (3) + X T t=1 Et D pt − ej , ℓbt E S¯t − X T t=1 D pt − ej , ℓbt E S¯t ! | {z } term (4) + X T t=1 D pt − ej , ℓt − ℓbt E St ! | {z } term (5) + X T t=1 D pt − ej , ℓbt E ! | {z } term (6) . (2.15) term (1) can be bounded again by Oe( √ T) via standard Azuma’s inequality. term (2) can be bounded by ηT due to the O(η) amount of uniform exploration. term (3) is simply 0 as ℓbt,i is unbiased for i ∈ S¯ t . term (5) can similarly be written as X T t=1 (ℓbt,j − ℓt,j )1{j ∈ St} + X T t=1 X i∈St Wt,i − 1{it ∈ N in t (i)} pt,iℓt,i Wt,i + γ + X T t=1 X i∈St γpt,iℓt,i Wt,i + γ since the loss estimator construction for i ∈ St is the same as the one in Exp3-IX. Following how we handled the last three terms in Eq. (2.12), we have with high probability, term (5) ≤ Oe vuutX T t=1 αt + max t∈[T] αt + γ X T t=1 αt + 1 γ . (2.16) The formal statement and the proof are deferred to Lemma B.1.2 in Appendix B.1.2. 4 The key challenge lies in controlling term (4) and term (6). To this end, let us first consider the variance of the ⟨pt − ej , ℓbt⟩S¯t . Let T = {t : ∃j ∈ S¯ t , pet,jt > 1 2 } be the final value of T in Algorithm 3. If t /∈ T , then ℓbt,i ≤ 1 1−pet,i ≤ 2 for all i ∈ S¯ t and the variance of ⟨pt − ej , ℓbt⟩S¯t is a constant; otherwise, we know that there is at most one node jt ∈ S¯ t such that pt,jt > 1 2 . Direct calculation shows that the variance of ⟨pt − ej , ℓbt⟩S¯t is bounded by Oe( 1 Wt,jt · 1{j ̸= jt}). With the help of the Freedman’s inequality and the fact Wt,jt = Ω(η) thanks to the uniform exploration (Line 1), we prove in Lemma B.1.1 of Appendix B.1.2 that term (4) can be bounded as follows: term (4) ≤ Oe vuutX T t=1 1 + 1{j ̸= jt , t ∈ T } Wt,jt + 1 η . (2.17) Handling this potentially large deviation is exactly the reason we inject a positive bias to the loss estimator (Line 6). Specifically, when t ∈ T , we add bt,jt = β Wt,jt to the loss estimator ℓbt,jt for some parameter β > 0. With the help of this positive bias, we can decompose term (6) as follows: term (6) = X T t=1 D pt − ej , ℓbt + bt E − X T t=1 ⟨pt − ej , bt⟩. (2.18) Direct calculation shows that the second negative term is of order −Θ(P t∈T β Wt,jt · 1{j ̸= jt})) + Θ(P t∈T β·1{j = jt}), large enough to cancel the large deviation in Eq. (2.17). Specifically, using AM-GM inequality, we obtain term (4) − X T t=1 ⟨pt − ei ∗ , bt⟩ ≤ Oe 1 η + √ T + 1 β + βT . (2.19) 41 The final step is to handle the first term in Eq. (2.18). Similar to Eq. (2.8), standard analysis of online mirror descent shows that X T t=1 D pt − ej , ℓbt + bt E ≤ log K η + η X T t=1 X K i=1 pt,i ℓbt,i + bt,i2 , (2.20) However, unlike the case for self-aware graphs, when there exist nodes without a self-loop, the second term may be prohibitively large when t ∈ T . Inspired by [10], we address this issue with a loss shifting trick. Specifically, the following refined version of Eq. (2.20) holds: X T t=1 D pt − ej , ℓbt + bt E ≤ log K η + 2η X T t=1 X K i=1 pt,i ℓbt,i + bt,i − zt 2 , (2.21) for any zt ≤ 3 η , t ∈ [T]. We choose zt = 0 when t /∈ T and zt = ℓbt,jt + bt,jt when t ∈ T , which satisfies the condition zt ≤ 3 η again thanks to the O(η) amount of uniform exploration over all nodes (Line 1). With such a loss shift zt , continuing with Eq. (2.21) it can be shown that: X T t=1 D pt − ej , ℓbt + bt E ≤ Oe 1 η + η X t /∈T X i∈[K] pt,iℓb2 t,i + η X t∈T 1{it ̸= jt} Wt,jt + β 2T . (2.22) Note that for t ∈ T , ℓbt,i ≤ 2 for all i ∈ S¯ t . Therefore, the second term in Eq. (2.22) can be bounded by Oe(ηT + η PT t=1 P i∈St pt,i Wt,i+γ ℓbt,i) ≤ Oe(ηT + η PT t=1 αt + η/γ · maxt∈[T] αt) where the inequality is by using Lemma 2.2.1. The third term can be bounded by Oe( √ ηT) with high probability by using Freedman’s inequality. Together with Eq. (2.19), Eq. (2.16), the bounds for term (1), term (2), and term (3), and the optimal choice of the parameters η, β, and γ, we arrive at the following main theorem for general strongly observable graphs (see Appendix B.1.2 for the full proof). 42 Algorithm 4 Algorithm for Weakly Observable Graphs Input: Parameters η, γ, ε. Define: Regularizer ψ(p) = 1 η PK i=1 pi ln pi . Initialize: p1,i = 1 K for all i ∈ [K]. for t = 1, 2, . . . , T do 1 Receive the feedback graph Gt and (approximately) find a smallest weakly dominating set Dt . 2 Let pet = (1 − ε|Dt |)pt + ε · 1Dt (1Dt is a vector with 1 for coordinates in Dt and 0 otherwise). 3 Sample action it ∼ pet . 4 Receive feedback ℓt,j for all j such that it ∈ Nin t (j). 5 Construct estimator ℓbt ∈ R K such that ℓbt,i = ℓt,i1{it∈Nin t (i)} Wt,i+γ1{i∈St} where Wt,i = P j∈Nin t (i) pet,j . 6 Compute pt+1 = argminp∈∆K ⟨p, ℓbt⟩ + Dψ(p, pt) . Theorem 2.2.2. Algorithm 3 with parameter γ = β = η = min √ 1 PT t=1 αt log(1/δ) , 1 2 guarantees that with probability at least 1 − 6δ, the regret is bounded as follows: Reg ≤ Oe vuutX T t=1 αt log 1 δ + max t∈[T] αt log 1 δ . To the best of our knowledge, this is the first optimal high-probability regret bound for general strongly observable graphs, importantly without any poly(K) dependence. While in this theorem we assume the knowledge of αt for all t ∈ [T] to tune the parameters η, β and γ, a standard doubling trick can be applied to remove this requirement and make Algorithm 3 completely parameter-free.†† 2.2.3 High-Probability Regret for Weakly Observable Graphs In this section, we study the setting where the feedback graph Gt is weakly observable for allt ∈ [T]. Under the uninformed setting,Theorem 9 of [12] proves that the lower bound of expected regret is Ω(K 1/3T 2/3 ). To get rid of the poly(K) dependence, we thus consider the informed setting, in which Gt is revealed to the learner before she selects it . We propose a simple algorithm to achieve Oe(T 1/3 ( PT t=1 dt) 1/3 + 1 T PT t=1 dt) high-probability regret bound. ††This can be achieved efficiently by applying a standard doubling trick on the quantity Bt = √ 1 Pt τ=1 Qτ , t ∈ [T]. 43 Our algorithm is summarized in Algorithm 4, which is a combination of Exp3.G [10] and Exp3-IX. Following Exp3.G, we add uniform exploration over a smallest weakly dominating set (Line 2).‡‡ In this way, each weakly observable node has at least ε probability to be observed, which is essential to control the variance of the loss estimators. Similar to Algorithm 3, we apply implicit exploration for nodes with self-loops when constructing their loss estimators (Line 5). Different from Algorithm 3, we do not need to inject bias any more. This is because with the combination of uniform exploration and implicit exploration, our algorithm already achieves Oe(T 2/3 ) bound which is optimal for weakly observable graphs. Our main result in summarized below (see Appendix B.2 for the proof). Theorem 2.2.3. Algorithm 4 with parameter ε = min{ 1 2 , T 1/3 ( PT t=1 dt) −2/3 ln(1/δ) 1/3}, γ = r ln(1/δ) PT t=1 αet , η = min n T −1/3 ( PT t=1 dt) −1/3 ln(1/δ) −1/3 , γo ensures with probability at least 1 − 4δ: Reg ≤ Oe T 1 3 X T t=1 dt !1 3 ln 1 3 (1/δ) + 1 T X T t=1 dt ln 1 δ + vuutX T t=1 αet ln 1 δ + max t∈[T] αet ln 1 δ , where αet is the independence number of the subgraph induced by nodes with self-loops in Gt . When Gt = G for all t, our bound becomes Oe d 1/3T 2/3 + √ αT + α + d , where α is the independence number of the subgraph of G induced by its nodes with self-loops and d is the weak domination number of G. This even improves over the Oe(d 1/3T 2/3 + √ KT) expected regret bound of [10], removing any poly(K) dependence. To prove Theorem 2.2.3, similar to the analysis in Section 2.2.2, we first decompose the regret against any fixed action j ∈ [K] as follows: X T t=1 (ℓt,it − ⟨pet , ℓt⟩) | {z } term (a) + X T t=1 ⟨pet − pt , ℓt⟩ | {z } term (b) + X T t=1 D pt − ej , ℓt − ℓbt E | {z } term (c) + X T t=1 D pt − ej , ℓbt E | {z } term (d) . (2.23) ‡‡While finding it exactly is computational hard, it suffices to find an approximate one with size O(dt log K), which can be done in polynomial time. 44 term (a)is of order Oe( √ T) via Azuma’s inequality. By the definition of pet and direct calculation, term (b) is of order O(ε PT t=1 dt). To bound term (c), we state the following lemma, which controls the deviation between real losses and loss estimators. The proof starts by considering nodes in St and S¯ t separately, followed by standard concentration inequalities; see Appendix B.2 for details. Lemma 2.2.2. Algorithm 4 guarantees the following with probability at least 1 − δ X T t=1 D pt , ℓt − ℓbt E ≤ Oe vuutX T t=1 αet ln 1 δ + γ X T t=1 αet + max t∈[T] αet ln 1 δ + r T ε ln 1 δ . Furthermore, with probability at least 1 − δ, for any i ∈ [K], the following inequality holds: X T t=1 ℓbt,i − ℓt,i ≤ Oe r T ε ln 1 δ + 1 ε ln 1 δ + 1 γ ln 1 δ ! . Next, we prove the following lemma bounding term (d) (see Appendix B.2 again for the full proof). Lemma 2.2.3. Algorithm 4 guarantees that with probability at least 1 − δ term (d) ≤ Oe 1 η + η X T t=1 αet + η γ max t αet ln 1 δ + η r T ε 3 ln 1 δ + ηT ε + η ε 2 ln 1 δ ! . Proof sketch. First, we apply standard OMD analysis [31] and obtain term (d) ≤ log K η + 2η X T t=1 X i∈St pt,iℓb2 t,i + 2η X T t=1 X i∈S¯t pt,iℓb2 t,i. We can bound the second term by Oe η PT t=1 αet + η γ maxt αet ln(1/δ) using Eq. (2.13) in Lemma 2.2.1. For the third term, based on the definition of ℓbt,i for i ∈ S¯ t , we decompose it as follows η X T t=1 X i∈S¯t pt,iℓb2 t,i ≤ η X T t=1 X i∈S¯t pt,i Wt,i (ℓbt,i − ℓt,i) + η X T t=1 X i∈S¯t pt,i Wt,i ℓt,i. 45 We bound the first term by Oe(η p T /ε3 + η/ε2 ) using Freedman’s inequality. With the help of uniform exploration, we know that Wt,i ≥ ε and thus the second term is bounded by ηT ε . With the help of Lemma 2.2.2 and Lemma 2.2.3, we are ready to prove Theorem 2.2.3. Proof of Theorem 2.2.3. Putting results from Eq. (2.23), Lemma 2.2.2, and Lemma 2.2.3 together, our regret bound becomes Reg ≤ Oe ε X T t=1 dt + vuutX T t=1 αet ln 1 δ + γ X T t=1 αet + max t αet ln 1 δ + 1 γ log 1 δ + r T ε ln 1 δ + Oe 1 η + η X T t=1 αet + η γ max t αet ln 1 δ + η r T ε 3 ln 1 δ + ηT ε + η ε 2 ln 1 δ ! . By picking ε, η, and γ as stated in Theorem 2.2.3, we achieve that with probability at least 1 − δ, Reg ≤ Oe T 1 3 X T t=1 dt !1 3 ln 1 3 1 δ + 1 T X T t=1 dt ln 1 δ + vuutX T t=1 αet ln 1 δ + max t αet ln 1 δ . Again, we can apply the standard doubling trick to tune η, γ, and ε adaptively without requiring the knowledge of dt and αet for t ∈ [T] ahead of time. 2.2.4 Conclusions and Open Problems In this work, we design algorithms that achieve near-optimal high-probability regret bounds for adversarial MAB with time-varying feedback graphs for both the strongly observable case and the weakly observable case. We achieve Oe((PT t=1 αt) 1/2 + maxt∈[T] αt) regret for strongly observable graphs, improving and extending the results of [121], which only considers self-aware graphs and suffers an Oe(K) term. In addition, we derive the first high-probability regret bound for weakly observable graph setting, which also depends on K only logarithmically and is order optimal. 46 One open problem is whether one can achieve high-probability data-dependent regret bounds for this problem, such as the so-called small-loss bounds which scales with the loss of the best action. In Section 2.1, we achieved expected regret bound Oe( √ κL⋆) for a fixed graph where κ is the clique partition number and L⋆ is the loss of the best action. Achieving the same bound with high-probability under an adaptive adversary appears to require new ideas. 47 2.3 Efficient Contextual Bandits with informed Feedback Graphs In this section, we generalize the problem of non-contextual online learning with feedback graphs considered in Section 2.1 and Section 2.2 to the contextual setting, in which the learner observes a context before making her decision at each round and both the loss and the feedback graph can be dependent on the context. While contextual bandits have enjoyed broad applicability [29], the statistical complexity of learning with bandit feedback imposes a data lower bound for application scenarios [5]. This has inspired various mitigation strategies, including exploiting function class structure for improved experimental design [152], and composing with memory for learning with fewer samples [130]. In this section, we exploit alternative graph feedback patterns to accelerate learning: intuitively, there is no need to explore a potentially suboptimal action if a presumed better action, when exploited, yields the necessary information. The framework of bandits with feedback graphs is mature and provides a solid theoretical foundation for incorporating additional feedback into an exploration strategy [115, 12, 13]. Succinctly, in this framework, the observation of the learner is decided by a directed feedback graph G: when an action is played, the learner observes the loss of every action to which the chosen action is connected. When the graph only contains self-loops, this problem reduces to the classic bandit case. For non-contextual bandits with feedback graphs, it is mentioned before that Alon et al. [12] provides a full characterization on the minimax regret bound with respect to different graph theoretic quantities associated with G according to the type of the feedback graph. However, contextual bandits with feedback graphs have received less attention [134, 141]. Specifically, there is no prior work offering a solution for general feedback graphs and function classes. In this section, we take an important step in this direction by adopting recently developed minimax algorithm design principles in contextual bandits, which leverage realizability and reduction to regression to construct practical algorithms with strong statistical guarantees [57, 58, 59, 61, 60, 152]. Using this strategy, we construct a practical algorithm for contextual bandits with feedback graphs that achieves the optimal regret 48 bound. Moreover, although our primary concern is accelerating learning when the available feedback is more informative than bandit feedback, our techniques also succeed when the available feedback is less informative than bandit feedback, e.g., in spam filtering where some actions generate no feedback. More specifically, our contributions are as follows. Contributions. In this section, we extend the minimax framework proposed in [60] to contextual bandits with general feedback graphs, aiming to promote the utilization of different feedback patterns in practical applications. Following [58, 60, 152], we assume that there is an online regression oracle for supervised learning on the loss. Based on this oracle, we propose SquareCB.G, the first algorithm for contextual bandits with feedback graphs that operates via reduction to regression (Algorithm 5). Eliding regression regret factors, our algorithm achieves the matching optimal regret bounds for deterministic feedback graphs, with Oe( √ αT) regret for strongly observable graphs and Oe(d 1 3 T 2 3 ) regret for weakly observable graphs, where α and d are respectively the independence number and weakly domination number of the feedback graph.Notably, SquareCB.G is computationally tractable, requiring the solution to a convex program (Theorem 2.3.4), which can be readily solved with off-the-shelf convex solvers (Appendix C.1.3). In addition, we provide closed-form solutions for specific cases of interest (Section 2.3.4), leading to a more efficient implementation of our algorithm. Empirical results further showcase the effectiveness of our approach (Section 2.3.5). 2.3.1 Related Work As mentioned in previous sections, Alon et al. [12] characterized the minimax rates in terms of graphtheoretic quantities. Follow-on work includes relaxing the assumption that the graph is observed prior to decision [45]; analyzing the distinction between the stochastic and adversarial settings [13]; considering stochastic feedback graphs [103, 54]; instance-adaptivity [83]; data-dependent regret bound [113, 100]; and high-probability regret under adaptive adversary [121, 109]. 49 However, the contextual bandit problem with feedback graphs has received relatively less attention. Wang et al. [141] provide algorithms for adversarial linear bandits with uninformed graphs and stochastic contexts. However, this work assumes several unrealistic assumptions on both the policy class and the context space and is not comparable to our setting, since we consider the informed graph setting with adversarial context. Singh et al. [134] study a stochastic linear bandits with informed feedback graphs and are able to improve over the instance-optimal regret bound for bandits derived in [97] by utilizing the additional graph-based feedbacks. Our work in this section is also closely related to the recent progress in designing efficient algorithms for classic contextual bandits. Starting from [95], numerous works have been done to the development of practically efficient algorithms, which are based on reduction to either cost-sensitive classification oracles [53, 6] or online regression oracles [58, 59, 60, 152]. Following the latter trend, our work assumes access to an online regression oracle and extends the classic bandit problems to the bandits with general feedback graphs. 2.3.2 Problem Setting and Preliminary We consider the following contextual bandits problem with informed feedback graphs. The learning process goes in T rounds. At each round t ∈ [T], an environment selects a context xt ∈ X , a (stochastic) directed feedback graph Gt ∈ [0, 1]A×A, and a loss distribution Pt : X → ∆([−1, 1]A); where A is the action set with finite cardinality K. For convenience, we use A and [K] interchangeably for denoting the action set. Both Gt and xt are revealed to the learner at the beginning of each round t. Then the learner selects one of the actions at ∈ A, while at the same time, the environment samples a loss vector ℓt ∈ [−1, 1]A from Pt(·|xt). The learner then observes some information about ℓt according to the feedback graph Gt . Specifically, for each action j, she observes the loss of action j with probability Gt(at , j), resulting in a realization At , which is the set of actions whose loss is observed. With a slight abuse of 50 notation, denote Gt(·|a) as the distribution of At when action a is picked. We allow the context xt , the (stochastic) feedback graphs Gt and the loss distribution Pt(·|xt) to be selected by an adaptive adversary. When convenient, we will consider G to be a K-by-K matrix and utilize matrix notation. Other Notations. Let ∆(K) denote the set of all Radon probability measures over a set [K]. conv(S) represents the convex hull of a set S. Denote I as the identity matrix with an appropriate dimension. For a K-dimensional vector v, diag(v) denotes the K-by-K matrix with the i-th diagonal entry vi and other entries 0. We use R K ≥0 to denote the set of K-dimensional vectors with non-negative entries. We use the Oe(·) notation to hide factors that are polylogarithmic in K and T. Realizability. We assume that the learner has access to a known function class F ⊂ (X ×A 7→ [−1, 1]) which characterizes the mean of the loss for a given context-action pair, and we make the following standard realizability assumption studied in the contextual bandit literature [5, 57, 58, 133]. Assumption 1 (Realizability). There exists a regression function f ⋆ ∈ F such that E[ℓt,a | xt ] = f ⋆ (xt , a) for any a ∈ A and across all t ∈ [T]. Two comments are in order. First, we remark that, similar to [59], misspecification can be incorporated while maintaining computational efficiency, but we do not complicate the exposition here. Second, Assumption 1 induces a “semi-adversarial” setting, wherein nature is completely free to determine the context and graph sequences; and has considerable latitude in determining the loss distribution subject to a mean constraint. Regret. For each regression function f ∈ F, let πf (xt) := argmina∈A f(xt , a) denote the induced policy, which chooses the action with the least loss with respective to f. Define π ⋆ := πf ⋆ as the optimal policy. We measure the performance of the learner via regret to π ⋆ : RegCB := PT t=1 ℓt,at − PT t=1 ℓt,π⋆(xt) , 51 which is the difference between the loss suffered by the learner and the one if the learner applies policy π ⋆ . Regression Oracle We assume access to an online regression oracle AlgSq for function class F, which is an algorithm for online learning with squared loss. We consider the following protocol. At each round t ∈ [T], the algorithm produces an estimator fbt ∈ conv(F), then receives a set of context-action-loss tuples {(xt , a, ℓt,a)}a∈At where At ⊆ A. The goal of the oracle is to accurately predict the loss as a function of the context and action, and we evaluate its performance via the square loss P a∈At (fbt(xt , a) − ℓt,a) 2 . We measure the oracle’s cumulative performance via the following square-loss regret to the best function in F. Assumption 2 (Bounded square-loss regret). The regression oracle AlgSq guarantees that for any (potentially adaptively chosen) sequence {(xt , a, ℓt,a)}a∈At,t∈[T] in which At ⊆ A, X T t=1 X a∈At fbt(xt , a) − ℓt,a2 − inf f∈F X T t=1 X a∈At (f(xt , a) − ℓt,a) 2 ≤ RegSq. For finite F, Vovk’s aggregation algorithm yields RegSq = O(log|F|) [140]. This regret is dependent upon the scale of the loss function, but this need not be linear with the size of At , e.g., the loss scale can be bounded by 2 in classification problems. See Foster and Krishnamurthy [61] for additional examples of online regression algorithms. 2.3.3 Algorithms and Regret Bounds In this section, we provide our main algorithms and results. 52 2.3.3.1 Algorithms via Minimax Reduction Design Our approach is to adapt the minimax formulation of [60] to contextual bandits with feedback graphs. In the standard contextual bandits setting (that is, Gt = I for all t), Foster et al. [60] define the DecisionEstimation Coefficient (DEC) for a parameter γ > 0 as decγ(F) := supfb∈conv(F),x∈X decγ(F; f, x b ), where decγ(F; f, x b ) := inf p∈∆(K) decγ(p, F; f, x b ) := inf p∈∆(K) sup a ⋆∈[K] f ⋆∈F Ea∼p f ⋆ (x, a) − f ⋆ (x, a⋆ ) − γ 4 · fb(x, a) − f ⋆ (x, a) 2 . (2.24) Their proposed algorithm is as follows. At each round t, after receiving the context xt , the algorithm first computes fbt by calling the regression oracle. Then, it solves the solution pt of the minimax problem defined in Eq. (2.24) with fband x replaced by fbt and xt . Finally, the algorithm samples an action at from the distribution pt and feeds the observation (xt , at , ℓt,at ) to the oracle. Foster et al. [60] show that for any value γ, the algorithm above guarantees that E[RegCB] ≤ T · decγ(F) + γ 4 · RegSq. (2.25) However, the minimax problem Eq. (2.24) may not be solved efficiently in many cases. Therefore, instead of obtaining the distribution pt which has the exact minimax value of Eq. (2.24), Foster et al. [60] show that any distribution that gives an upper bound Cγ on decγ(p, F; f, x b ) also works and enjoys a regret bound with decγ(F) replaced by Cγ in Eq. (2.25). To extend this framework to the setting with feedback graph G, we define decγ(F; f, x, G b ) as follows decγ(F; f, x, G b ) := inf p∈∆(K) decγ(p, F; f, x, G b ) 53 Algorithm 5 SquareCB.G. Note Theorem 2.3.4 provides an efficient implementation of Eq. (2.27). Input: parameter γ ≥ 4, a regression oracle AlgSq for t = 1, 2, . . . , T do Receive context xt and directed feedback graph Gt . Obtain an estimator fbt from the oracle AlgSq. Compute the distribution pt ∈ ∆(K) such that pt = argminp∈∆(K) decγ(p; fbt , xt , Gt), where decγ(p; fbt, xt, Gt) := sup a ⋆∈[K] f ⋆∈Φ Ea∼p " f ⋆ (xt, a) − f ⋆ (xt, a⋆ ) − γ 4 EA∼Gt(·|a) "X a′∈A (fbt(xt, a′ ) − f ⋆ (xt, a′ ))2 ##, (2.27) and Φ := X × [K] 7→ R. Sample at from pt and observe {ℓt,j}j∈At where At ∼ Gt(·|at). Feed the tuples {(xt , j, ℓt,j )}j∈At to the oracle AlgSq. := inf p∈∆(K) sup a ⋆∈[K] f ⋆∈F Ea∼p " f ⋆ (x, a) − f ⋆ (x, a⋆ ) − γ 4 EA∼G(·|a) "X a′∈A (fbt(x, a′ ) − f ⋆ (x, a′ ))2 ##. (2.26) Compared with Eq. (2.24), the difference is that we replace the squared estimation error on action a by the expected one on the observed set A ∼ G(·|a), which intuitively utilizes more feedbacks from the graph structure. When the feedback graph is the identity matrix, we recover Eq. (2.24). Based on decγ(F; f, x, G b ), our algorithm SquareCB.G is shown in Algorithm 5. As what is done in [60], in order to derive an efficient algorithm, instead of solving the distribution pt with respect to the supremum over f ⋆ ∈ F, we solve pt that minimize decγ(p; f, x b t , Gt) (Eq. (2.27)), which takes supremum over f ⋆ ∈ (X × [K] 7→ R), leading to an upper bound on decγ(F; f, x b t , Gt). Then, we receive the loss {ℓt,j}j∈At and feed the tuples {(xt , j, ℓt,j )}j∈At to the regression oracle AlgSq. Following a similar analysis to [60], we show that to bound the regret RegCB, we only need to bound decγ(pt ; fbt , xt , Gt). Theorem 2.3.1. Suppose decγ(pt ; fbt , xt , Gt) ≤ Cγ−β for all t ∈ [T] and some β > 0, Algorithm 5 with γ = max{4,(CT) 1 β+1 Reg − 1 β+1 Sq } guarantees that E [RegCB] ≤ O C 1 β+1 T 1 β+1 Reg β β+1 Sq . The proof is deferred to Appendix C.1. In Section 2.3.3.3, we give an efficient implementation for solving Eq. (2.27) via reduction to convex programming. 54 2.3.3.2 Regret Bounds In this section, we derive regret bounds for Algorithm 5 when Gt ’s are specialized to deterministic graphs, i.e., Gt ∈ {0, 1} A×A. We utilize discrete graph notation G = ([K], E), where E ⊆ [K] × [K]; and with a slight abuse of notation, define Nin(G, j) ≜ {i ∈ A : (i, j) ∈ E} as the set of nodes that can observe node j. In this case, at each round t, the observed node set At is a deterministic set which contains any node i satisfying at ∈ Nin(Gt , i). In the following, we introduce several graph-theoretic concepts for deterministic feedback graphs [12]. Strongly Observable Graphs In the following theorem, we show the regret bound of Algorithm 5 for strongly observable graphs. Theorem 2.3.2 (Strongly observable graphs). Suppose that the feedback graph Gt is deterministic and strongly observable with independence number no more than α. Then Algorithm 5 guarantees that decγ(pt ; fbt , xt , Gt) ≤ O α log(Kγ) γ . In contrast to existing works that derive a closed-form solution of pt in order to show how large the DEC can be [58, 61], in our case we prove the upper bound of decγ(pt ; fbt , xt , Gt) by using the Sion’s minimax theorem and the graph-theoretic lemma proven in [12]. The proof is deferred to Appendix C.1.1. Combining Theorem 2.3.2 and Theorem 2.3.1, we directly have the following corollary: Corollary 2.3.1. Suppose that Gt is deterministic, strongly observable, and has independence number no more than α for all t ∈ [T]. Algorithm 5 with choice γ = max n 4, q αT /RegSqo guarantees that E[RegCB] ≤ Oe q αTRegSq . 55 For conciseness, we show in Corollary 2.3.1 that the regret guarantee for Algorithm 5 depends on the largest independence number of Gt over t ∈ [T]. However, we in fact are able to achieve a move adaptive regret bound of order Oe qPT t=1 αtRegSq where αt is the independence number of Gt . It is straightforward to achieve this by applying a standard doubling trick on the quantity PT t=1 αt , assuming we can compute αt given Gt , but we take one step further and show that it is in fact unnecessary to compute αt (which, after all, is NP-hard [88]): we provide an adaptive tuning strategy for γ by keeping track the the cumulative value of the quantity minp∈∆(K) decγ(p; fbt , xt , Gt) and show that this efficient method also achieves the adaptive Oe qPt t=1 αtRegSq regret guarantee; see Appendix C.4 for details. Weakly Observable Graphs For the weakly observable graph, we have the following theorem. Theorem 2.3.3 (Weakly observable graphs). Suppose that the feedback graph Gt is deterministic and weakly observable with weak domination number no more than d. Then Algorithm 5 with γ ≥ 16d guarantees that decγ(pt ; fbt , xt , Gt) ≤ O s d γ + αe log(Kγ) γ ! , where αe is the independence number of the subgraph induced by nodes with self-loops in Gt . The proof is deferred to Appendix C.1.2. Similar to Theorem 2.3.2, we do not derive a closed-form solution to the strategy pt but prove this upper bound using the minimax theorem. Combining Theorem 2.3.3 and Theorem 2.4.3, we are able to obtain the following regret bound for weakly observable graphs, whose proof is deferred to Appendix C.1.2. Corollary 2.3.2. Suppose that Gt is deterministic, weakly observable, and has weak domination number no more than d for all t ∈ [T]. In addition, suppose that the independence number of the subgraph induced by nodes with self-loops in Gt is no more than αe for all t ∈ [T]. Then, Algorithm 5 with γ = max{16d, q αT / e RegSq, d 1 3 T 2 3 Reg− 2 3 Sq } guarantees that 56 E[RegCB] ≤ Oe d 1 3 T 2 3 Reg 1 3 Sq + q αTe RegSq . Similarly to the strongly observable graph case, we also derive an adaptive tuning strategy for γ to achieve a more refined regret bound Oe qPT t=1 αetRegSq + PT t=1 √ dt 2 3 Reg 1 3 Sq where αet is the independence number of the subgraph induced by nodes with self-loops in Gt and dt is the weakly domination number of Gt . This is again achieved without explicitly computing αet and dt ; see Appendix C.4 for details. 2.3.3.3 Implementation In this section, we show that solving argminp∈∆(K) decγ(p; f, x, G b )in Algorithm 5 is equivalent to solving a convex program, which can be easily and efficiently implemented in practice. Theorem 2.3.4. Solving argminp∈∆(K) decγ(p; f, x, G b ) is equivalent to solving the following convex optimization problem. min p∈∆(K),z p ⊤fb+ z (2.28) subject to ∀a ∈ [K] : 1 γ ∥p − ea∥ 2 diag(G⊤p)−1 ≤ fb(x, a) + z, G ⊤p ≻ 0, where fbin the objective is a shorthand for fb(x, ·) ∈ R K, ea is the a-th standard basis vector, and ≻ means element-wise greater. The proof is deferred to Appendix C.1.4. Note that this implementation is not restricted to the deterministic feedback graphs but applies to the general stochastic feedback graph case. In Appendix C.1.3, we provide the 20 lines of Python code that solves Eq. (2.28). 57 2.3.4 Examples with Closed-Form Solutions In this section, we present examples and corresponding closed-form solutions of p that make the value decγ(p; f, x, G b ) upper bounded by at most a constant factor of minp decγ(p; f, x, G b ). This offers an alternative to solving the convex program defined in Theorem 2.3.4 for special (and practically relevant) cases, thereby enhancing the efficiency of our algorithm. All the proofs are deferred to Appendix C.2. Cops-and-Robbers Graph. The “cops-and-robbers” feedback graph GCR = 11⊤ − I, also known as the loopless clique, is the full feedback graph removing self-loops. Therefore, GCR is strongly observable with independence number α = 1. Let a1 be the node with the smallest value of fband a2 be the node with the second smallest value of fb. Our proposed closed-form distribution p is only supported on {a1, a2} and defined as follows: pa1 = 1 − 1 2 + γ(fba2 − fba1 ) , pa2 = 1 2 + γ(fba2 − fba1 ) . (2.29) In the following proposition, we show that with the construction of p in Eq. (2.29), decγ(p; f, x, G b CR) is upper bounded by O(1/γ), which matches the order of minp decγ(p; f, x, G b ) based on Theorem 2.3.2 since α = 1. Proposition 2.3.1. When G = GCR, given any fb, context x, the closed-form distribution p in Eq. (2.29) guarantees that decγ(p; f, x, G b CR) ≤ O 1 γ . Apple Tasting Graph. The apple tasting feedback graph GAT = 1 1 0 0 consists of two nodes, where the first node reveals all and the second node reveals nothing. This scenario was originally proposed by Helmbold, Littlestone, and Long [75] and recently denoted the spam filtering graph [79]. The independence 58 number of GAT is 1. Let fb1 be the oracle prediction for the first node and let fb2 be the prediction for the second node. We present a closed-form solution p for Eq. (2.27) as follows: p1 = 1 fb1 ≤ fb2 2 4+γ(fb1−fb2) fb1 > fb2 , p2 = 1 − p1. (2.30) We show that this distribution p satisfies that decγ(p; f, x, G b AT) is upper bounded by O(1/γ) in the following proposition. We remark that directly applying results from [60] cannot lead to a valid upper bound since the second node does not have a self-loop. Proposition 2.3.2. When G = GAT, given any fb, context x, the closed-form distribution p in Eq. (2.30) guarantees that decγ(p; f, x, G b AT) ≤ O( 1 γ ). Inventory Graph. In this application, the algorithm needs to decide the inventory level in order to fulfill the realized demand arriving at each round. Specifically, there are K possible chosen inventory levels a1 < a2 < . . . < aK and the feedback graph Ginv has entries G(i, j) = 1 for all 1 ≤ j ≤ i ≤ K and G(i, j) = 0 otherwise, meaning that picking the inventory level ai informs about all actions aj≤i . This is because items are consumed until either the demand or the inventory is exhausted. The independence number of Ginv is 1. Therefore, (very) large K is statistically tractable, but naively solving the convex program Eq. (2.28) requires superlinear in K computational cost. We show in the following proposition that there exists an analytic form of p, which guarantees that decγ(p; f, x, G b inv) can be bounded by O(1/γ). Proposition 2.3.3. When G = Ginv, given any fb, context x, there exists a closed-form distribution p ∈ ∆(K) guaranteeing that decγ(p; f, x, G b inv) ≤ Oe( 1 γ ), where p is defined as follows: pj = max{ 1 1+γ(fbj−mini fbi) − P j ′>j pj ′, 0} for all j ∈ [K]. 59 Undirected Self-Aware Graph. For the undirected and self-aware feedback graph G, which means that G is symmetric and has diagonal entries all 1, we also show that a certain closed-form solution of p satisfies that decγ(p; f, x, G b ) is bounded by O( α γ ). Proposition 2.3.4. When G is an undirected self-aware graph, given any fb, context x, there exists a closedform distribution p ∈ ∆(K) guaranteeing that decγ(p; f, x, G b ) ≤ O α γ . 2.3.5 Experiments In this section, we use empirical results to demonstrate the significant benefits of SquareCB.G in leveraging the graph feedback structure and its superior effectiveness compared to SquareCB. Following [61], we use progressive validation (PV) loss as the evaluation metric, defined as Lpv(T) = 1 T PT t=1 ℓt,at . All the feedback graphs used in the experiments are deterministic. We run experiments on CPU Intel Xeon Gold 6240R 2.4G and the convex program solver is implemented via Vowpal Wabbit [94]. 2.3.5.1 SquareCB.G under Different Feedback Graphs In this subsection, we show that our SquareCB.G benefits from considering the graph structure by evaluating the performance of SquareCB.G under three different feedback graphs. We conduct experiments on RCV1 dataset and leave the implementation details in Appendix C.3.1. The performances of SquareCB.G under bandit graph, full information graph and cops-and-robbers graph are shown in the left part of Figure 2.3. We observe that SquareCB.G performs the best under full information graph and performs worst under bandit graph. Under the cops-and-robbers graph, much of the gap between bandit and full information is eliminated. This improvement demonstrates the benefit of utilizing graph feedback for accelerating learning. 60 Figure 2.3: Left figure: Performance of SquareCB.G on RCV1 dataset under three different feedback graphs. Right figure: Performance comparison between SquareCB.G and SquareCB under random directed self-aware feedback graphs. 2.3.5.2 Comparison between SquareCB.G and SquareCB In this subsection, we compare the effectiveness of SquareCB.G with the SquareCB algorithm. To ensure a fair comparison, both algorithms update the regressor using the same feedbacks based on the graph. The only distinction lies in how they calculate the action probability distribution. We summarize the main results here and leave the implementation details in Appendix C.3.2. Results on Random Directed Self-aware Graphs We conduct experiments on RCV1 dataset using random directed self-aware feedback graphs. Specifically, at round t, the deterministic feedback graph Gt is generated as follows. The diagonal elements of Gt are all 1, and each off-diagonal entry is drawn from a Bernoulli(3/4) distribution. The results are presented in the right part of Figure 2.3. Our SquareCB.G consistently outperforms SquareCB and demonstrates lower variance, particularly when the number of iterations was small. This is because when there are fewer samples available to train the regressor, it is more crucial to design an effective algorithm that can leverage the graph feedback information. Results on Synthetic Inventory Dataset In the inventory graph experiments, we create a synthetic inventory dataset and design a loss function for each inventory level at ∈ [0, 1] with demand dt ∈ [0, 1]. 61 Figure 2.4: Performance comparison between SquareCB.G and SquareCB on synthetic inventory dataset. Left figure: Results under fixed discretized action set. Right figure: Results under adaptive discretization of the action set. Both figures show the superiority of SquareCB.G compared with SquareCB. Since the action set [0, 1] is continuous, we discretize the action set in two different ways to apply the algorithms. Fixed discretized action set. In this setting, we discretize the action set using fixed grid size ε ∈ { 1 100 , 1 300 , 1 500 }, which leads to a finite action set A of size 1 ε + 1. Note that according to Theorem 2.3.2, our regret does not scale with the size of the action set (to within polylog factors), as the independence number is always 1. The results are shown in the left part of Figure 2.4. We remark several observations from the results. First, our algorithm SquareCB.G outperforms the previous SquareCB algorithm for all choices K ∈ {101, 301, 501}. This indicates that SquareCB.G utilizes a better exploration scheme and effectively leverages the structure of Ginv. Second, we observe that SquareCB.G indeed does not scale with the size of the discretized action set A, since under different discretization scales, SquareCB.G has similar performances and the slight differences are from the improved approximation error with finer discretization. This matches the theoretical guarantee that we prove in Theorem 2.3.2. On the other hand, SquareCB does perform worse when the size of the action set increases, matching its theoretical guarantee which scales with the square root of the size of the action set. 62 Adaptively changing action set. In this setting, we adaptively discretize the action set [0, 1] according to the index of the current round. Specifically, for SquareCB.G, we uniformly discretize the action set [0, 1] with size ⌈ √ t⌉, whose total discretization error is O( √ T) due to the Lipschitzness of the loss function. For SquareCB, to optimally balance the dependency on the size of the action set and the discretization error, we uniformly discretize the action set [0, 1] into ⌈t 1 3 ⌉ actions. The results are illustrated in the right part of Figure 2.4. We can observe that SquareCB.G consistently outperforms SquareCB by a clear margin. 2.3.6 Discussion In this section, we consider the design of practical contextual bandits algorithm with provable guarantees. Specifically, we propose the first efficient algorithm that achieves near-optimal regret bound for contextual bandits with general directed feedback graphs with an online regression oracle. While we study the informed graph feedback setting, where the entire feedback graph is exposed to the algorithm prior to each decision, many practical problems of interest are possibly uninformed graph feedback problems, where the graph is unknown at the decision time. To handle the uninformed graph feedback case, one idea is to consume the additional feedback in the online regressor and adjust the prediction loss to reflect this additional structure, e.g., using the more general version of the E2D framework which incorporates arbitrary side observations [60]. Cohen, Hazan, and Koren [45] consider this uninformed setting in the non-contextual case and prove a sharp distinction between the adversarial and stochastic settings: even if the graphs are all strongly observable with bounded independence number, in the adversarial setting the minimax regret is Θ(T) whereas in the stochastic setting the minimax regret is Θ(√ αT). Intriguingly, our setting is semi-adversarial due to realizability of the mean loss, and therefore it is apriori unclear whether the negative adversarial result applies. In fact, in the next section, we show that it is possible to jointly learn both the loss and the feedback graph simultaneously in the uninformed contextual setting. 63 In addition, bandits with graph feedback problems often present with associated policy constraints, e.g., for the apple tasting problem, it is natural to rate limit the informative action. Therefore, another interesting direction is to combine our algorithm with the recent progress in contextual bandits with knapsack [135], leading to more practical algorithms. 64 2.4 Efficient Contextual Bandits with Uninformed Feedback Graphs In Section 2.3, we consider the informed contextual bandits with feedback graphs, where the feedback graph at each round is known to the learner before her making the decision. While this model includes many real-world applications such as contextual inventory control, there are other applications where the learner does not have any knowledge of the feedback graph when making the decision. For example, in the online pricing application [45], at each round, the seller (learner) has to publish a selling price for her product. Then, the buyer will decide whether to buy the product or not based on whether the published price is above his own private value or not. From the seller’s perspective, the only feedback is whether or not the buyer makes the purchase. In this case, the feedback graph is determined by the buyer’s private value, which is not known to the seller when (or even after) she makes the decision (details on the structure of the feedback graph are introduced in Section 2.4.4). Cohen, Hazan, and Koren [45] first considered this more challenging uninformed setting without context and derived algorithms achieving near-optimal regret guarantees. However, there is no prior work offering a solution for contextual bandits with uninformed feedback graphs. In this section, we take the first step in this direction and propose the first efficient algorithms achieving strong regret guarantees. Specifically, our contributions are as follows. Contributions. Our algorithm, SquareCB.UG, is based on the SquareCB.G algorithm of Zhang et al. [148], which is introduced in Algorithm 5 in Section 2.3. Assuming realizability on the loss function and an online square loss regression oracle, Zhang et al. [148] extended the minimax framework for contextual bandits [58, 61] to contextual bandits under informed feedback graphs. With uninformed graphs, we further assume that they are realizable by another function class G and propose to learn them simultaneously so that in each round we can plug in the predicted graph into SquareCB.G. While the idea is natural, our 65 analysis is highly non-trivial and, perhaps more importantly, reveals that it is crucial to learn these graphs using log loss instead of squared loss. More specifically, within the uninformed setting, we analyze two different types of feedback on the graph structure. In the partially revealed graph setting, the learner only observes which actions are connected to the selected action, and our algorithm achieves Oe( p α(G)T) regret (ignoring the regression overhead), where α(G) is the maximum expected independence number over all graphs in G; in the easier fully revealed graph setting, the learner observes the entire graph after her decision, and our algorithm achieves an improved Oe qPT t=1 αt regret bound, where αt denotes the expected independence number of the feedback graph at round t. §§ We note that this latter bound even matches the optimal regret for the easier informed setting [148]. In addition to these strong theoretical guarantees, we also empirically test our algorithm on a bidding application with both synthetic and real-world data and show that it indeed outperforms the greedy algorithm or algorithms that ignore the additional feedback from the graphs. 2.4.1 Preliminary Throughout this section, we denote the set of distributions over some set S by ∆(S), and the convex hull of some set S by conv(S). For a vector v ∈ R m and a matrix M ∈ R m×m, vi denotes the i-th coordinate of v and Mi,j denotes the (i, j)’s entry of M for i, j ∈ [m]. The contextual bandits problem with uninformed feedback graphs proceeds in T rounds. At each round t, the environment (possibly randomly and adaptively) selects a context xt from some arbitrary context space X , a loss vector ℓt ∈ [0, 1]K specifying the loss of each of the K possible actions, and finally a directed feedback graph Gt = ([K], Et) where Et ⊆ [K] × [K] denotes the set of directed edges. The learner then observes the context xt (but not ℓt or Gt ) and has to select an action it ∈ [K]. At the §§For simplicity, in this section, we only consider the strongly observable graphs where √ T-regret is achievable, but our ideas can be directly generalized to weakly observable graphs as well (where T 2/3 -regret is minimax optimal). 66 end of this round, the learner suffers loss ℓt,it and observes the loss of every action connected to it (not necessarily including it itself): ℓt,j for all j ∈ At , where At = {j ∈ [K],(it , j) ∈ Et}. In the partially revealed graph setting, the learner does not observe anything else about the graph (other than At ), while in the fully revealed graph setting, the learner additionally observes the entire graph (that is, Et ). Alon et al. [12] showed that in the non-contextual version of this problem, there are essentially only two types of nontrivial and learnable feedback graphs: strongly observable graphs and weakly observable graphs. For simplicity, in this section, we focuses solely on the first type, that is, we assume that Gt is always strongly observable, meaning that for each node i ∈ [K], either it can observe itself ((i, i) ∈ Et ) or it can be observed by any other nodes ((j, i) ∈ Et for any j ∈ [K]\{i}). As mentioned in Footnote §§, our results can be directly generalized to weakly observable graphs as well. Bandits with uninformed feedback graphs naturally capture many applications such as online pricing, viral marketing, and recommendation in social networks [91, 12, 13, 128, 45, 104]. By incorporating contexts, which are broadly available in practice, our model significantly increases its applicability in real world. For a concrete example, see Section 2.4.4 for an application of bidding in a first-price auction. Realizability and oracle assumptions. Following a line of recent works on developing efficient contextual bandit algorithms, we make the following standard realizability assumption on the loss function, stating that the expected loss of each action can be perfectly predicted by an unknown loss predictor from a known class: Assumption 3 (Realizability of mean loss). We assume that the learner has access to a function class F ⊆ (X × [K] 7→ [0, 1]) in which there exists an unknown regression function f ⋆ ∈ F such that for any i ∈ [K] and t ∈ [T], we have E[ℓt,i | xt ] = f ⋆ (xt , i). 67 The goal of the learner is naturally to be comparable to an oracle strategy that knows f ⋆ ahead of time, formally measured by the (expected) regret: RegCB ≜ E "X T t=1 (f ⋆ (xt , it) − min i∈[k] f ⋆ (xt , i))# . To efficiently minimize regret for a general class F, it is important to assume some oracle access to this class. To this end, we follow prior works and assume that the learner is given an online regression oracle AlgSq for function class F, which follows the following protocol: at each round t ∈ [T], the oracle AlgSq produces an estimator ft ∈ conv(F), then receives a context xt and a set St of action-loss pairs in the form (a, c) ∈ [K] × [0, 1]. The squared loss of the oracle for this round is P (a,c)∈St (ft(xt , a) − c) 2 , which is on average assumed to be close to that of the best predictor in F: Assumption 4 (Bounded squared loss regret). The regression oracle AlgSq guarantees: X T t=1 X (a,c)∈St (ft(xt , a) − c) 2 − inf f∈F X T t=1 X (a,c)∈St (f(xt , a) − c) 2 ≤ RegSq. Here, RegSq is a regret bound that is sublinear in T and depends on some complexity measure of F; see e.g. Foster and Rakhlin [58] for concrete examples of such oracles and the corresponding regret bounds. The point is that online regression is such a standard machine learning practice, so reducing our problem to online regression is both theoretically reasonable and practically desirable. So far, we have made exactly the same assumptions as made in Section 2.3, which studies the informed setting. In our uninformed setting, however, since nothing is known about the feedback graph before deciding which action to take, we propose to additionally learn the feedback graphs, which requires the following extra realizability and oracle assumptions related to the graphs. 68 Algorithm 6 SquareCB.UG Input: parameter γ ≥ 0, a regression oracle AlgSq for loss prediction, and a regression oracle AlgLog for graph prediction for t = 1, 2, . . . , T do Receive context xt . Obtain a loss estimator ft from the oracle AlgSq and a graph estimator gt from AlgLog. Compute pt = argminp∈∆([K]) decγ(p; ft , gt , xt) (a simple convex program; see Appendix D.3), where decγ(p; f, g, x) = sup i ⋆∈[K] v ⋆∈[0,1]K X K i=1 piv ⋆ i − v ⋆ i ⋆ − γ 4 X K i=1 pi X K j=1 g(x, i, j)(f(x, j) − v ⋆ j ) 2 . (2.31) The environment decides loss ℓt and feedback graph Gt = ([K], Et). The learner samples it from pt and observe {ℓt,j}j∈At where At = {j ∈ [K] : (it , j) ∈ Et}. Feed xt and {(j, ℓt,j )}j∈At to the oracle AlgSq. Case 1 (Partially revealed graph): Construct St = {(it , j, 1)}j∈At ∪ {(it , j, 0)}j /∈At . Case 2 (Fully revealed graph): Observe Et and construct St = {(i, j, 1)}(i,j)∈Et ∪ {(i, j, 0)}(i,j)∈/Et . Feed xt and St to the oracle AlgLog. Assumption 5 (Realizability of mean graph). We assume that the learner has access to a function class G ⊆ (X × [K] × [K] 7→ [0, 1]) in which there exists a regression function g ⋆ ∈ G such that for any (i, j) ∈ [K] × [K] and t ∈ [T], we have E[1{(i, j) ∈ Et} | xt ] = g ⋆ (xt , i, j). Similarly, since we do not impose specific structures on G, we assume that the learner can access G through the use of another online oracle AlgLog: at each round t ∈ [T], AlgLog produces an estimator gt ∈ conv(G), then receives a context xt and a set St of tuples in the form (i, j, b) ∈ [K] × [K] × {0, 1} where b = 1 (b = 0) means i is (is not) connected to j. Importantly, our analysis shows that it is critical for this oracle to learn the graphs using log loss instead of squared loss (hence the name AlgLog); see detailed explanations in Section 2.4.3.1. More specifically, we assume that the oracle satisfies the following regret bound measured by log loss: Assumption 6 (Bounded log loss regret). The regression oracle AlgLog guarantees: X T t=1 X (i,j,b)∈St ℓlog(gt(xt , i, j), b) − inf g∈G X T t=1 X (i,j,b)∈St ℓlog(g(xt , i, j), b) ≤ RegLog, 69 where for two scalars u, v ∈ [0, 1], ℓlog(u, v) is defined as ℓlog(u, v) = v log 1 u + (1 − v) log 1 1 − u . Once again, the bound RegLog is sublinear in T and depends on some complexity measure of G. We note that regression using log loss is also highly standard in practice. For concrete examples, we refer the readers to Foster and Krishnamurthy [61] where the same log loss oracle was used (for a different purpose of obtaining first-order regret guarantees for contextual bandits). In our analysis, we also make use of the following important technical lemma that connects the log loss regret with something called the triangular discrimination under the realizability assumption. Lemma 2.4.1 (Proposition 5 of Foster and Krishnamurthy [61]). Suppose for each t and (i, j, b) ∈ St , we have E[b|xt ] = g ⋆ (xt , i, j). Then oracle AlgLog guarantees: E X T t=1 X (i,j,b)∈St (gt(xt , i, j) − g ⋆ (xt , i, j))2 gt(xt , i, j) + g ⋆(xt , i, j) ≤ 2RegLog. Independence number. It is known that for strongly observable graphs, their independence numbers characterize the minimax regret [12]. Specifically, an independence set of a directed graph is a subset of nodes in which no two distinct nodes are connected. The size of the largest independence set in graph G is called its independence number, denoted by α(G). Since we consider stochastic graphs, we further define the independence number with respect to a g ∈ conv(G) and a context x as: α(g, x) ≜ inf q∈Q(g,x) EG∼q[α(G)], where Q(g, x) denotes the set of all distributions of strongly observable graphs whose expected edge connections are specified by g(x, ·, ·) (that is, for any q ∈ Q(g, x), we have E([K],E)∼q [1{(i, j) ∈ E}] = 70 g(x, i, j) for all (i, j)). With this notion, the difficulty of Gt is then characterized by the independence number αt ≜ α(g ⋆ , xt). In the more challenging partially revealed graph setting, however, our result depends on the worst-case independence number over the entire class G: α(G) ≜ supg∈G,x∈X α(g, x). 2.4.2 Algorithms and Regret Guarantees In this section, we introduce our algorithm SquareCB.UG and its regret guarantees. To describe our algorithm, we first briefly introduce the SquareCB.G algorithm of Zhang et al. [148] for the informed setting: at each round t, given the loss estimator ft obtained from the regression oracle AlgSq and the feedback graph Gt , SquareCB.G finds the action distribution pt ∈ ∆([K]) by solving argminp decγ(p; ft , Gt , xt), where the Decision-Estimation Coefficient (DEC) is defined as: decγ(p; ft , Gt , xt) ≜ sup i ⋆∈[K] v ⋆∈[0,1]K "X K i=1 piv ⋆ i − v ⋆ i ⋆ − γ 4 X K i=1 pi X K j=1 Gt,i,j ft(xt , j) − v ⋆ j 2 # (2.32) for some parameter γ > 0, and we abuse the notation by letting Gt also represent its adjacent matrix. The idea of DEC originates from Foster and Rakhlin [58] for contextual bandits and has become a general way to tackle interactive decision making problems since then [60]. The first two terms within the supremum of Eq. (2.32) is the instantaneous regret of strategy p against the best action i ⋆ with respect to a loss vector v ⋆ , and the third term corresponds to the expected squared loss between the loss predictor ft(xt , ·) and the loss vector v ⋆ on the observed actions (since each action i is selected with probability pi and, conditioning on i being selected, each action j is observed with probability Gt,i,j ). Because the true loss vector is unknown, a supremum over v ⋆ is taken (that is, the worst case is considered). The goal of the learner is to pick pt to minimize this DEC, since a small DEC means that the regret suffered by the learner is close to the regret of the regression oracle (which is assumed to be bounded). After selecting an action it ∼ pt and seeing 7 the loss of actions connected it , SquareCB.G naturally feeds these observations to AlgSq and proceeds to the next round. The clear obstacle of running SquareCB.G in the uninformed setting is that Gt , required in Eq. (2.32), is unavailable at the beginning of round t. Therefore, we propose to learn the graphs simultaneously and simply use a predicted graph in place of the true graph Gt . More concretely, at the beginning of round t, in addition to the loss estimator ft , we also obtain a graph estimator gt from the graph regression oracle AlgLog. Then, to find pt , we solve the same problem but with Gt in Eq. (2.32) replaced by the estimator gt(xt , ·, ·); see Eq. (2.31). After picking action it ∼ pt , the training of AlgSq remains the same, and additionally, we feed all the observations about the structure of Gt to AlgLog: for the partially reveal graph setting, the observations are the connections between it and all j ∈ At and the disconnections between it and all j /∈ At ; while for the fully reveal graph setting, the observations are all the connections in Et and the disconnections for all other action pairs. See Algorithm 6 for the complete pseudocode. While the idea of our algorithm appears to be very natural, its analysis is in fact highly non-trivial and reveals that learning the graphs using log loss is critical; see Section 2.4.3 for more discussions. We also note that finding the minimizer of the DEC can be written as a simple convex program and solved efficiently; see more implementation details in Appendix D.3. We prove the following regret guarantee of SquareCB.UG in the partially revealed graph setting. Theorem 2.4.1. Under Assumptions 3-6 and partially revealed feedback graphs, SquareCB.UG with γ = max n 4, q α(G)T log(KT) max{RegSq,RegLog} o guarantees:¶¶ RegCB = Oe q α(G)T max{RegSq, RegLog} . Since RegSq and RegLog are both sublinear in T, this regret bound is also sublinear in T. More importantly, it has no polynomial dependence on the total number of actions K, and instead only depends ¶¶The notation Oe(·) hides logarithmic dependence on K and T. 72 on the worst-case independence number α(G) ≤ K. There are indeed important applications where the independence number of every encountered feedback graph must be small or even independent of K (e.g., the inventory control problem discussed in Zhang et al. [148] or the bidding application in Section 2.4.4), in which case it only makes sense to pick G such that α(G) is also small. While SquareCB.UG requires setting γ with the knowledge of α(G), in Appendix D.1.2, we show that applying certain doubling trick on the DEC value leads to the same regret bound even without the knowledge of α(G). However, it would be even better if instead of paying α(G) every round, we only pay the independence number of the corresponding stochastic feedback graph at each round t, that is, αt . While it is unclear to us whether this is achievable with partially revealed graphs, in the next theorem, we show that SquareCB.UG indeed achieves this in the easier fully revealed graph setting. Theorem 2.4.2. Under Assumptions 3-6 and fully revealed feedback graphs, SquareCB.UG with γ = max 12, r PT t=1 αt max{RegSq,RegLog} guarantees: RegCB = Oe vuutX T t=1 αt max{RegSq, RegLog} . In other words, we replace the α(G)T term in the regret with the smaller and more adaptive quantity PT t=1 αt , indicating that the complexity of learning only depends on how difficult each encountered graph is, but not the worst case difficulty among all the possible graphs in G. 2.4.3 Analysis In this section, we provide some key steps of our analysis, highlighting 1) why it is enough to replace the true graph in Eq. (2.32) with the graph estimator gt ; 2) why using log loss in the graph regression oracle is important; and 3) why having fully revealed graphs helps improve the dependence from α(G) to αt . 73 2.4.3.1 Analysis for Partially Revealed Graphs While we present the DEC in Eq. (2.31) as a natural modification of Eq. (2.32) in the absence of the true graph, it in fact can be rigorously derived as an upper bound on another DEC more tailored to our original problem. Specifically, we define for two parameters γ1, γ2 > 0: decP γ1,γ2 (p; f, g, x) ≜ sup i ⋆∈[K],v⋆∈[0,1]K M⋆∈[0,1]K×K X K i=1 piv ⋆ i − v ⋆ i ⋆ − γ1 X K i=1 pi X K j=1 M⋆ i,j (f(x, j) − v ⋆ j ) 2 −γ2 X K i=1 pi X K j=1 (M⋆ i,j − g(x, i, j))2 M⋆ i,j + g(x, i, j) . (2.33) Similar to Eq. (2.32), the first two terms within the supermum represent the instantaneous regret of strategy p against the best action i ⋆ with respect to a loss vector v ⋆ . The third term is also similar and represents the squared loss of AlgSq under strategy p, but since the true graph Gt is unknown, it is replaced with the worst-case adjacent matrix M⋆ (hence the supremum over M⋆ ). Finally, the last additional term is the expected triangular discrimination between M⋆ i,· and g(x, i, ·) when i is sampled from p, which, according to Lemma 2.4.1, represents the log loss regret of AlgLog. Once again, the idea is that if for every round t, we can find a strategy pt with a small DEC value decP γ1,γ2 (pt ; ft , gt , xt), then the learner’s overall regret RegCB will be close to the square loss regret RegSq of AlgSq plus the log loss regret RegLog of AlgLog, both of which are assumed to be reasonably small. This is formally stated below. Theorem 2.4.3. Under Assumptions 3-6, for any γ1, γ2 ≥ 0, the regret RegCB of SquareCB.UG is at most E "X T t=1 decP γ1,γ2 (pt ; ft , gt , xt) # + γ1RegSq + 2γ2RegLog. 74 Now, instead of directly minimizing the DEC to find the strategy pt (which is analytically complicated), the following lemma shows that the easier form of Eq. (2.31) serves as an upper bound of Eq. (2.33), further explaining our algorithm design. Lemma 2.4.2. For any p ∈ ∆([K]), g ∈ conv(G), f ∈ conv(F), and x ∈ X , we have decP 3 4 γ, 1 4 γ (p; f, g, x) ≤ decγ(p; f, g, x). To prove this lemma, we need to connect the squared loss with respect to M⋆ and that with respect to g. We achieve so using the following lemma, which also reveals why triangular discrimination naturally comes out. Lemma 2.4.3. For any z, z′ > 0, the following holds: 3z ′ ≥ z − (z − z ′ ) 2 z + z ′ . Proof. By AM-GM inequality, we have 2(z − z ′ ) ≤ (z + z ′ ) + (z − z ′ ) 2 z + z ′ . Rearranging then finishes the proof. Proof of Lemma 2.4.2. Using Lemma 2.4.3 with z ′ = M⋆ i,j (f(x, j) − v ⋆ j ) 2 , z = g(x, i, j)(f(x, j) − v ⋆ j ) 2 , 75 we have 3M⋆ i,j (f(x, j) − v ⋆ j ) 2 ≥ g(x, i, j)(f(x, j) − v ⋆ j ) 2 − (M⋆ i,j − g(x, i, j))2 M⋆ i,j + g(x, i, j) (f(x, j) − v ⋆ j ) 2 ≥ g(x, i, j)(f(x, j) − v ⋆ j ) 2 − (M⋆ i,j − g(x, i, j))2 M⋆ i,j + g(x, i, j) , where the last step uses the fact (f(x, j)−v ⋆ j ) 2 ≤ 1. Applying this inequality to the definition of Eq. (2.33) and plugging in the parameters γ1 = 3 4 γ and γ2 = 1 4 γ, we see that the two triangular discrimination terms cancel, leading us to exactly Eq. (2.31). The importance of using log loss when learning the feedback graphs now becomes clear: the log loss regret turns out to be exactly the price one needs to pay by pretending that the graph estimator is the true graph. The last step of the analysis is to show that the minimum DEC value, which our final strategy pt achieves, is reasonably small and related to some independence number with no polynomial dependence on the total number of actions: Lemma 2.4.4. For any g ∈ conv(G), f ∈ conv(F), x ∈ X , and γ ≥ 4, we have min p∈∆([K]) decγ(p; f, g, x) = O α(g, x) log(Kγ) γ . The proof of this lemma is deferred to Appendix D.1 and is a refinement and generalization of Theorem 3.2 of Zhang et al. [148] which only concerns deterministic graphs. Combining everything, we are now ready to prove Theorem 2.4.1. 76 Proof of Theorem 2.4.1. Setting γ1 = 3 4 γ and γ2 = 1 4 γ in Theorem 2.4.3 and combining it with Lemma 2.4.2, we know that RegCB is at most X T t=1 decγ(pt ; ft , gt , xt) + 3 4 γRegSq + 1 2 γRegLog. (2.34) Since pt is chosen by minimizing decγ(p; ft , gt , xt), we further apply Lemma 2.4.4 to bound the regret by O X T t=1 α(gt , xt) log(Kγ) γ ! + 3 4 γRegSq + 1 2 γRegLog. Finally, realizing α(gt , xt) ≤ α(G)(see Lemma D.1.1) and plugging in the choice of γ finishes the proof. 2.4.3.2 Analysis for Fully Revealed Graphs From the analysis for the partially revealed graph setting, we see that the α(G) dependence in fact comes from α(gt , xt), the independence number with respect to the graph estimator gt . To improve it to αt = α(g ⋆ , xt), we again need to connect two different graphs in the DEC definition using Lemma 2.4.3, as shown below. Lemma 2.4.5. For g, g′ ∈ conv(G), f ∈ conv(F), and x ∈ X , we have min p∈∆([K]) decγ(p; f, g′ , x) ≤ min p∈∆([K]) dec γ 3 (p; f, g, x) + γ 12 X K i=1 X K j=1 (g(x, i, j) − g ′ (x, i, j))2 g(x, i, j) + g ′(x, i, j) . Proof sketch. For a fix p, similarly to the proof of Lemma 2.4.2, we apply Lemma 2.4.3 with z ′ = g ′ (x, i, j)(f(x, j) − v ⋆ j ) 2 , z = g(x, i, j)(f(x, j) − v ⋆ j ) 2 77 for each i and j and arrive at decγ(p; f, g′ , x) ≤ dec γ 3 (p; f, g, x) + γ 12 X K i=1 pi X K j=1 (g(x, i, j) − g ′ (x, i, j))2 g(x, i, j) + g ′(x, i, j) . Further upper bounding pi by 1 in the last term and then taking min over p on both sides finishes the proof. This lemma allows us to connect the minimum DEC value with respect to gt and that with respect to g ⋆ , but with the price of γ 12 PK i=1 PK j=1 (gt(xt,i,j)−g ⋆(xt,i,j))2 gt(xt,i,j)+g ⋆(xt,i,j) , which, under fully revealed graphs, is essentially the per-round log loss regret in light of Lemma 2.4.1 since the oracle AlgLog indeed receives observations for all (i, j) pairs (importantly, this does not hold for partially revealed graphs). With this insight, we are ready to prove the main theorem. Proof of Theorem 2.4.2. First note that Theorem 2.4.3 in fact also holds for fully revealed graphs; this is intuitively true simply because the fully revealed case is easier than the partially revealed case, and is formally explained in the proof of Theorem 2.4.3. Therefore, combining it with Lemma 2.4.2, we still have RegCB bounded by X T t=1 min p decγ(p; ft , gt , xt) + 3 4 γRegSq + 1 2 γRegLog. By Lemma 2.4.5, this is at most X T t=1 min p dec γ 3 (p; ft , g⋆ , xt) + 3 4 γRegSq + 1 2 γRegLog + γ 12 X T t=1 X K i=1 X K j=1 (gt(xt , i, j) − g ⋆ (xt , i, j))2 gt(xt , i, j) + g ⋆(xt , i, j) . Finally, using Lemma 2.4.4 and Lemma 2.4.1, the above is further bounded by O X T t=1 αt log(Kγ) γ ! + 3 4 γRegSq + 2 3 γRegLog, 78 which completes the proof with our choice of γ. 2.4.4 Experiments In this section, we show empirical results of our SquareCB.UG algorithm by testing it on a bidding application. We start by describing this application, followed by modelling it as an instance of our partially revealed graph setting. Specifically, consider a bidder (the learner) participating in a first-price auction. At each round t, the bidder observes some context xt , while the environment decides a competing price wt ∈ [0, 1] (the highest price of all other bidders) and the value of the learner vt ∈ [0, 1] for the current item (unknown to the learner herself). Then, the learner decides her bid ct ∈ [0, 1]. If ct ≥ wt , the learner wins the auction, pays ct to the auctioneer (first-price), and observes her reward vt − ct ; otherwise, the learner loses the auction without observing her value vt , and her reward is 0. In either case, at the end of this round, the auctioneer announces the winning bid to all bidders. For the learner, this information is only meaningful when she loses the auction, in which case wt (the winning bid) is revealed to her. This problem is a natural instance of our model. Specifically, we let the learner choose her bid ct from a discretized set Aε = {0, ε, . . . , 1−ε, 1} of size K = 1 ε + 1 for some granularity ε. For ease of presentation, the i-th bid (i − 1)ε is denoted by ai . The feedback graph Gt is completely determined by the competing price wt in the following way: Gt,i,j = 1, ai < wt and aj < wt , 1, ai ≥ wt and j ≥ i, 0, otherwise, where we again overload the notation Gt to represent its adjacent matrix. This is because when bidding lower than the competing price wt , the learner observes wt and knows that bidding anything below wt gives 0 reward; and when bidding higher than wt , the bidder only knows that she would still win if she 79 were to bid even higher, and the corresponding reward can be calculated since she knows her value vt in this case. It is clear that this graph is strongly observable with independence number at most 2 and is only partially revealed at the end of each round if the learner wins (and fully revealed otherwise). On the other hand, the reward of action i is 1 [ai ≥ wt ] · (vt − ai), which we translate to a loss in [0, 1] by shifting and scaling: ℓt,i = 1 2 (1 − 1 [ai ≥ wt ] · (vt − ai)) ∈ [0, 1]. (2.35) Regression oracles. For the graph predictor, since the feedback graph is determined by the competing price wt , we use a linear classification model to predict the distribution of wt . Then at each round, we sample a competing price from this distribution, leading to the predicted feedback graph gt . For the loss predictor ft , since losses are determined by wt and vt , we use a two-layered fully connected neural network to predict the value vt and construct the loss predictors according to Eq. (2.35) with wt and vt replaced by their predicted values. For more details of the oracles and their training, see Appendix D.3. Implementation of SquareCB.UG. While Eq. (2.31) can be solved by a convex program, in order to implement SquareCB.UG even more efficiently, we use a closed-form solution of pt enabled by the specific structure of the predicted graph gt in this application. See Appendix D.3 for more details. 2.4.4.1 Empirical Results on Synthetic Data Data. We first generate two synthetic datasets {(xt , wt , vt)} T t=1 with T = 5000 and xt ∈ R 32 for all t ∈ [T]. The competing price wt and the value vt are generated by wt = √ 1 32 θ ⊤ 1 xt + εt , vt = wt + max{ √ 40 32 θ ⊤ 2 xt , 0}, where θ1, θ2 ∈ R 32 are sampled from standard Gaussian and εt ∼ N (0, 0.05) is a small noise. All wt and vt ’s are then normalized to [0, 1]. The two datasets only differ in how {xt} T t=1 are generated. Specifically, in the context of linear bandits, Bastani, Bayati, and Khosravi [26] showed that 80 Figure 2.5: Comparison among SquareCB.UG, SquareCB, greedy, and a trivial baseline on one synthetic dataset with diverse contexts (top figure) and another one with poor diversity (bottom figure). whether the simple greedy algorithm with no explicit exploration performs well depends largely on the context’s diversity, roughly captured by the minimum eigenvalue of its covariance matrix. We thus follow their work and generate two datasets where the first one enjoys good diversity and second one does not; see Appendix D.3 for details. Results. We compare our SquareCB.UG with SquareCB [58] (which ignores the additional feedback from graphs), the greedy algorithm (which simply picks the best action according to the loss predictors), and a trivial baseline that always bids 0. For the first three algorithms, we try three different granularity values ε ∈ { 1 25 , 1 50 , 1 75 } leading to three increasing number of actions, and we run each of them 4 times and plot the averaged normalized regret (RegCB/T) together with the standard deviation in Figure 2.5. We observe that our algorithm performs the best and, unlike SquareCB, its regret almost does not increase when the number of action increases, matching our theoretical guarantee. In addition, consistent with Bastani, Bayati, and Khosravi [26], while greedy indeed performs quite well when the contexts are diverse (top figure), it performs almost the same as the trivial “not bidding” baseline and suffers linear regret in the absence of diverse contexts (bottom figure). 81 Figure 2.6: Comparison among SquareCB.UG, SquareCB, greedy, and a trivial baseline on a real auction dataset. 2.4.5 Empirical Results on Real Auction Data Data. We also conduct experiments on a subset of 5000 samples of a real eBay auction dataset used in Mohri and Medina [119]; see Appendix D.3 for details. Results. We compare the four algorithms in the same way, with the only difference being the discretization granularity value ε ∈ { 1 50 , 1 100 , 1 150 }. The results are shown in Figure 2.6. Similar to what we observe in synthetic datasets, SquareCB.UG consistently outperforms other algorithms, demonstrating the advantage of exploration with graph information. The greedy algorithm’s performance is unstable and has a relatively large variance due to the lack of exploration. With all these results for both real and synthetic datasets, we show that our algorithm indeed effectively explores the environment utilizing the uninformed graph structure and is robust to different types of environments. 82 Chapter 3 Robust and Adaptive Algorithm Design for Linear Bandits In this chapter, we consider the problem of linear bandits. Different from the problem considered in Chapter 2, in linear bandits, the action set is no longer a set of finite size [K] but a convex set Ω ⊆ R d in d-dimensional space containing infinite number of actions. At each round t, the learner selects an action at ∈ Ω and the suffered loss is defined as the inner product of at and the loss vector ℓt that can be possibly adversarially chosen by the environment. This generalization allows the model to capture real-life problems such as building a personalized news recommendation system [102]. In this application, each time t corresponds to a visit of some user to the website. The available news articles at that time as well as the user’s information are then used to generate a feature vector for each article. The loss is then based on whether the user clicks on the recommended article or not. Suppose that the probability that the user clicks a certain article can be perfectly modeled by an unknown linear predictor. Then, the expected loss can be modeled as a linear function of feature vector of the selected article. In this chapter, we aim to design algorithms for linear bandits with adaptive and robust guarantees. Specifically, the structure of this chapter is as follows: 83 • In Section 3.1, we develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary, which not only resolves an open problem of Bartlett et al. [24] and Abernethy and Rakhlin [1] by designing the first general and efficient algorithm with a high-probability regret bound for adversarial linear bandits, but also resolves an open problem asked by Neu [121] by obtaining data-dependent high-probability regret guarantees. While existing approaches all require carefully constructing optimistic and biased loss estimators, our approach uses standard unbiased estimators and relies on a simple increasing learning rate schedule, together with the help of logarithmically homogeneous self-concordant barriers and a strengthened Freedman’s inequality. Moreover, our approach can also be applied to learning adversarial Markov Decision Processes and provides the first algorithm with a high-probability small-loss bound for this problem. • In Section 3.2, we consider deriving S-switching regret guarantee for adversarial linear bandits, whose benchmark is the loss of a sequence of comparators with at most S − 1 switches, providing a more robust performance guarantee. To achieve this, we first consider a problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The Corral algorithm of Agarwal et al. [7] and its variants [59] achieve this goal with a regret overhead of order Oe( √ MT) where M is the number of base algorithms and T is the time horizon. The polynomial dependence on M, however, prevents one from applying these algorithms to many applications where M is poly(T) or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only logarithmic dependence on M as long as some conditions are satisfied. We then apply our recipe to the problem of adversarial linear bandits over a d-dimensional ℓp unitball for p ∈ (1, 2]. Specifically, by corralling a large set of T base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret Oe( √ dST) when 84 S is known. We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits. 85 3.1 High-Probability Adaptive Regret bounds for Linear Bandits In this section, we consider deriving high-probability regret guarantees for adversarial linear bandits. Online learning with partial information in an adversarial environment, such as the non-stochastic Multiarmed Bandit (MAB) problem [19], is a well-studied topic. However, the majority of work in this area has been focusing on obtaining algorithms with sublinear expected regret bounds, and these algorithms can in fact be highly unstable and suffer a huge variance. As we mentioned in Section 2.2, it is known that the classic Exp3 algorithm [19] for MAB suffers linear regret with a constant probability (over its internal randomness), despite having nearly optimal expected regret (see Section 11.5, Note 1 of [98]), making it a clearly undesirable choice in practice. To address this issue, a few works develop algorithms with regret bounds that hold with high probability, including those for MAB [19, 16, 121], linear bandits [24, 1], bandits with feedback graphs as we introduced in Section 2.2, and even adversarial Markov Decision Processes (MDPs) [85]. Getting highprobability regret bounds is also the standard way of deriving guarantees against an adaptive adversary whose decisions can depend on learner’s previous actions. This is especially important for problems such as routing in wireless networks (modeled as linear bandits in [21]) where adversarial attacks can indeed adapt to algorithm’s decisions on the fly. As far as we know, all previous existing high-probability methods (listed above) are based on carefully constructing biased loss estimators that enjoy smaller variance compared to standard unbiased ones. While this principle is widely applicable, the actual execution can be cumbersome; for example, the scheme proposed in [1] for linear bandits needs to satisfy seven conditions (see their Theorem 4), and other than two examples with specific action sets, no general algorithm satisfying these conditions was provided. In this section, we develop a new and simple approach to obtaining high-probability regret bounds that works for a wide range of bandit problems with an adaptive adversary (including MAB, linear bandits, MDP, and more). Somewhat surprisingly, in contrast to all previous methods, our approach uses 86 standard unbiased loss estimators. More specifically, our algorithms are based on Online Mirror Descent with a self-concordant barrier regularizer [2], a standard approach with expected regret guarantees. The key difference is that we adopt an increasing learning rate schedule, inspired by several recent works using similar ideas for completely different purposes (e.g., [7]). At a high level, the effect of this schedule magically cancels the potentially large variance of the unbiased estimators. Apart from its simplicity, there are several important advantages of our approach. First of all, our algorithms all enjoy data-dependent, or adaptive regret bounds, which could be much smaller than the majority of existing high-probability bounds in the form of Oe( √ T) where T is the number of rounds. As a key example, we provide details for obtaining a particular kind of such bounds called “small-loss” bounds in the form Oe( √ L⋆), where L ⋆ ≤ T is the loss of the benchmark in the regret definition. For MAB and linear bandits, our approach also obtains bounds in terms of the variation of the environment in the vein of [70, 126, 143, 37], resolving an open problem asked by Neu [121]. Second, our approach provides the first general and efficient algorithm for adversarial linear bandits (also known as bandit linear optimization) with a high-probability regret guarantee. As mentioned, Abernethy and Rakhlin [1] provide a general recipe for this task but in the end only show concrete examples for two specific action sets. The problem of obtaining a general and efficient approach with regret Oe( √ T) was left open since then. The work of [24] proposes an inefficient but general approach, while the work of [68, 30] develop efficient algorithms for polytopes but with Oe(T 2/3 ) regret. We not only resolve this long-standing open problem, but also provide improved data-dependent bounds. Third, our approach is also applicable to learning episodic MDPs with unknown transition, adversarial losses, and bandit feedback. The algorithm is largely based on a recent work [85] on the same problem where a high-probability Oe( √ T) regret bound is obtained. We again develop the first algorithm with a high-probability small-loss bound Oe( √ L⋆) in this setting. The problem in fact shares great similarity with the simple MAB problem. However, none of the existing methods for obtaining small-loss bounds for MAB 87 can be generalized to the MDP setting (at least not in a direct manner) as we argue in Section 3.1.3. Our approach, on the other hand, generalizes directly without much effort. Techniques. Most new techniques we developed in this section is in the algorithm for linear bandits (Section 3.1.2), which is based on the SCRiBLe algorithm from the seminal work [2, 3]. The first difference is that we propose to lift the problem from R d to R d+1 (where d is the dimension of the problem) and use a logarithmically homogeneous self-concordant barrier of the conic hull of the action set (which always exists) as the regularizer for Online Mirror Descent. The nice properties of such a regularizer lead to a smaller variance of the loss estimators. Equivalently, this can be viewed as introducing a new sampling scheme for the original SCRiBLe algorithm in the space of R d . The second difference is the aforementioned new learning rate schedule, where we increase the learning rate by a small factor whenever the Hessian of the regularizer at the current point is “large” in some sense. In addition, we also provide a strengthened version of the Freedman’s concentration inequality for martingales [66], which is crucial to all of our analysis and might be of independent interest. Related work. In online learning, there are subtle but important differences and connections between the concept of pseudo-regret, expected regret, and the actual regret, in the context of either oblivious or adaptive adversary. We refer the readers to [16] for detailed related discussions. While getting expected small-loss regret is common [9, 122, 63, 8, 100], most existing high-probability bounds are of order Oe( √ T). Although not mentioned in the original paper, the idea of implicit exploration from [121] can lead to high-probability small-loss bounds for MAB (see Section 12.3 Note 4 of [98]). Lykouris, Sridharan, and Tardos [113] adopt this idea together with a clipping trick to derive small-loss bounds for more general bandit problems with graph feedback. We are not aware of other works with highprobability small-loss bounds in the bandit literature. Note that in Section 6 of [16], some high-probability 88 “small-reward” bounds are derived, and they are very different in nature from small-loss bounds (specifically, the former is equivalent to Oe( √ T − L⋆) in our notation). We are also not aware of high-probability version of other data-dependent regret bounds such as those from [70, 126, 143, 37]. The idea of increasing learning rate was first used in the seminal work of Bubeck, Eldan, and Lee [36] for convex bandits. Inspired by this work, Agarwal et al. [7] first combined this idea with the log-barrier regularizer for the problem of “corralling bandits”. Since then, this particular combination has proven fruitful for many other problems [143, 110, 100]. We also use it for MAB and MDP, but our algorithm for linear bandits greatly generalizes this idea to any self-concordant barrier. Structure and notation. In Section 3.1.1, we start with a warm-up example on MAB, which is the cleanest illustration on the idea of using increasing learning rates to control the variance of unbiased estimators. Then in Section 3.1.2 and Section 3.1.3, we greatly generalize the idea to linear bandits and MDPs respectively. We focus on showing small-loss bounds as the main example, and only briefly discuss how to obtain other data-dependent regret bounds, since the ideas are very similar. We introduce the notation for each setting in the corresponding section, but will use the following general notation throughout Section 3.2: for a positive integer n, ∆n represents the (n − 1)-dimensional simplex; ei stands for the i-th standard basis vector and 1 stands for the all-one vector (both in an appropriate dimension depending on the context); for a matrix M ∈ R d×d , λmax(M) denotes the largest eigenvalue of M; Et [·] is a shorthand for the conditional expectation given the history before round t; Oe(·) hides all logarithmic terms. 3.1.1 Multi-armed bandits: an illustrating example We start with the most basic bandit problem, namely adversarial MAB [19], to demonstrate the core idea of using increasing learning rate to reduce the variance of standard algorithms. The MAB problem proceeds in rounds between a learner and an adversary. For each round t = 1, . . . , T, the learner selects one of 89 the d available actions it ∈ [d], while simultaneously the adversary decides a loss vector ℓt ∈ [0, 1]d with ℓt,i being the loss for arm i. An adaptive adversary can choose ℓt based on the learner’s previous actions i1, . . . , it−1 in an arbitrary way, while an oblivious adversary cannot and essentially decides all ℓt ’s ahead of time (knowing the learner’s algorithm). At the end of round t, the learner observes the loss of the chosen arm ℓt,it and nothing else. The standard measure of the learner’s performance is the regret as introduced in Chapter 1 and Chapter 2, which is defined as Reg = PT t=1 ℓt,it − mini∈[d] PT t=1 ℓt,i, that is, the difference between the total loss of the learner and that of the best fixed arm in hindsight. A standard framework to solve this problem is Online Mirror Descent (OMD), which at time t samples it from a distribution wt , updated in the following recursive form: wt+1 = argmin w∈∆d w, ℓbt + Dψt (w, wt), where ψt is the regularizer and ℓbt is an estimator for ℓt . The standard estimator is the importance-weighted estimator: ℓbt,i = ℓt,i1{it = i}/wt,i, which is clearly unbiased. Together with many possible choices of the regularizer (e.g., the entropy regularizer recovering Exp3 [19]), this ensures (nearly) optimal expected regret bound E[Reg] = Oe( √ dT) against an oblivious adversary. To obtain high-probability regret bounds (and also as a means to deal with adaptive adversary), various more sophisticated loss estimators have been proposed. Indeed, the key challenge in obtaining highprobability bounds lies in the potentially large variance of the unbiased estimators: Et ℓb2 t,i = ℓ 2 t,i/wt,i is huge if wt,i is small. The idea of all existing approaches to addressing this issue is to introduce a slight bias to the estimator, making it an optimistic underestimator of ℓt with lower variance (see e.g., [19, 16, 121]). Carefully balancing the bias and variance, these algorithms achieve Reg = Oe( p dT ln(d/δ)) with probability at least 1 − δ against an adaptive adversary. 90 Algorithm 7 OMD with log-barrier and increasing learning rates for Multi-armed Bandits Input: initial learning rate η. Define: increase factor κ = e 1 ln T , truncated simplex Ω = w ∈ ∆d : wi ≥ 1 T , ∀i ∈ [d] . Initialize: for all i ∈ [d], w1,i = 1/d, ρ1,i = 2d, η1,i = η. for t = 1, 2, . . . , T do Sample it ∼ wt , observe ℓt,it , and construct estimator ℓbt,i = ℓt,i1{it=i} wt,i for all i ∈ [d]. Compute wt+1 = argminw∈Ω w, ℓbt + Dψt (w, wt) where ψt(w) = Pd i=1 1 ηt,i ln 1 wi . for i ∈ [d] do if 1 wt+1,i > ρt,i then set ρt+1,i = 2 wt+1,i , ηt+1,i = ηt,iκ;; else set ρt+1,i = ρt,i, ηt+1,i = ηt,i.; Our algorithm. In contrast to all these existing approaches, we next show that, perhaps surprisingly, using the standard unbiased estimator can also lead to the same (in fact, an even better) high-probability regret bound. We start by choosing a particular regularizer called log-barrier with time-varying and individual learning rate ηt,i: ψt(w) = Pd i=1 1 ηt,i ln 1 wi , which is a self-concordant barrier for the positive orthant [120] and has been used for MAB in several recent works [63, 7, 33, 143, 37]. As mentioned in Section 3.1, the combination of log-barrier and a particular increasing learning rate schedule has been proven powerful for many different problems since the work of [7], which we also apply here. Specifically, the learning rates start with a fixed value η1,i = η for all arm i ∈ [d], and every time the probability of selecting an arm i is too small, in the sense that 1/wt+1,i > ρt,i for some threshold ρt,i (starting with 2d), we set the new threshold to be 2/wt+1,i and increase the corresponding learning rate ηt,i by a small factor κ. The complete pseudocode is shown in Algorithm 7. The only slight difference compared to the algorithm of [7] is that instead of enforcing a 1/T amount of uniform exploration explicitly (which makes sure that each learning rate is increased by a most O(ln T) times), we directly perform OMD over a truncated simplex Ω = {w ∈ ∆d : wi ≥ 1/T, ∀i ∈ [d]}, making the analysis cleaner. As explained in [7], increasing the learning rate in this way allows the algorithm to quickly realize that some arms start to catch up even though they were underperforming in earlier rounds, which is also the hardest case in our context of obtaining high-probability bounds because these arms have low-quality 91 estimators at some point. At a technical level, this effect is neatly presented through a negative term in the regret bound, which we summarize below. Lemma 3.1.1. Algorithm 7 ensures X T t=1 ℓt,it − X T t=1 u, ℓbt ≤ O d ln T η + η X T t=1 ℓt,it ! − ⟨ρT , u⟩ 10η ln T for any u ∈ Ω. The important part is the last negative term involving the last threshold ρT whose magnitude is large whenever an arm has a small sampling probability at some point over the T rounds. This bound has been proven in previous works such as [7] (see a proof in Appendix E.1.2), and next we use it to show that the algorithm in fact enjoys a high-probability regret bound, which is not discovered before. Indeed, comparing Lemma 3.1.1 with the definition of regret, one sees that as long as we can relate the estimated loss of the benchmark P t u, ℓbt with its true loss P t u, ℓt , then we immediately obtain a regret bound by setting u = (1− d T )ei ⋆ + 1 T 1 ∈ Ω where i ⋆ = argmini P t ℓt,i is the best arm. A natural approach is to apply standard concentration inequality, in particular Freedman’s inequality [66], to the martingale difference sequence u, ℓbt − ℓt . The deviation from Freedman’s inequality is in terms of the variance of u, ℓbt , which in turn depends on P i ui/wt,i. As explained earlier, the negative term is exactly related to this and can thus cancel the potentially large variance! One caveat, however, is that the deviation from Freedman’s inequality also depends on a fixed upper bound of the random variable u, ℓbt ≤ P i ui/wt,i, which could be as large as T (since wt,i ≥ 1/T) and ruin the bound. If the dependence on such a fixed upper bound could be replaced with the (random) upper bound P i ui/wt,i, then we could again use the negative term to cancel this dependence. Fortunately, since P i ui/wt,i is measurable with respect to the σ-algebra generated by everything before round t, we are 92 indeed able to do so. Specifically, we develop the following strengthened version of Freedman’s inequality, which might be of independent interest. Theorem 3.1.1. Let X1, . . . , XT be a martingale difference sequence with respect to a filtration F1 ⊆ · · · ⊆ FT such that E[Xt |Ft ] = 0. Suppose Bt ∈ [1, b] for a fixed constant b is Ft-measurable and such that |Xt | ≤ Bt holds almost surely. Then with probability at least 1 − δ we have X T t=1 Xt ≤ C p 8V ln (C/δ) + 2B ⋆ ln (C/δ) , where V = max 1, PT t=1 E[X2 t |Ft ] , B⋆ = maxt∈[T] Bt , and C = ⌈log(b)⌉⌈log(b 2T)⌉. This strengthened Freedman’s inequality essentially recovers the standard one when Bt is a fixed quantity. In our application, Bt is exactly ⟨ρt , u⟩ which is Ft-measurable. With the help of this concentration result, we are now ready to show the high-probability guarantee of Algorithm 7. Theorem 3.1.2. Algorithm 7 with a suitable choice of η ensures that with probability at least 1 − δ, Reg = Oe p dL⋆ ln (d/δ) + d ln (d/δ) , where L ⋆ = mini PT t=1 ℓt,i is the loss of the best arm. The proof is a direct combination of Lemma 3.1.1 and Theorem 3.1.1 and can be found inAppendix E.1.3. Our high-probability guarantee is of the same order Oe( p dT ln(d/δ)) as in previous works [19, 16] since L ⋆ = O(T). However, as long as L ⋆ = o(T) (that is, the best arm is of high quality), our bound becomes much better. This kind of high-probability small-loss bounds appears before (e.g., [113]). Nevertheless, in Section 3.1.3 we argue that only our approach can directly generalize to learning MDPs. Finally, we remark that the same algorithm can also obtain other data-dependent regret bounds by changing the estimator to ℓbt,i = (ℓt,i − mt,i)1{it = i}/wt,i + mt,i for some optimistic prediction mt . We 9 refer the reader to [143] for details on how to set mt in terms of observed data and what kind of bounds this leads to, but the idea of getting the high-probability version is completely the same as what we have illustrated here. This resolves an open problem mentioned in Section 5 of [121]. 3.1.2 Generalization to adversarial linear bandits Next, we significantly generalize our approach to adversarial linear bandits, which is the main algorithmic contribution of Section 3.1. Linear bandits generalize MAB from the simplex decision set ∆d to an arbitrary convex body Ω ⊆ R d . For each round t = 1, . . . , T, the learner selects an action wet ∈ Ω while simultaneously the adversary decides a loss vector ℓt ∈ R d , assumed to be normalized such that maxw∈Ω | ⟨w, ℓt⟩ | ≤ 1. Again, an adaptive adversary can choose ℓt based on the learner’s previous actions, while an oblivious adversary cannot. At the end of round t, the learner suffers and only observes loss ⟨wet , ℓt⟩. The regret of the learner is defined as Reg = maxu∈Ω PT t=1 ⟨wet − u, ℓt⟩, which is the difference between the total loss of the learner and that of the best fixed action within Ω. Linear bandits subsume many other well-studied problems such as online shortest path for network routing, online matching, and other combinatorial bandit problems (see e.g., [18, 39]). The seminal work of Abernethy, Hazan, and Rakhlin [2] develops the first general and efficient linear bandit algorithm (called SCRiBLe in its journal version [3]) with expected regret Oe(d √ νT) (against an oblivious adversary), which uses a ν-self-concordant barrier as the regularizer for OMD. It is known that any convex body in R d admits a ν-self-concordant barrier with ν = O(d) [120]. The minimax regret of this problem is known to be of order Oe(d √ T) [50, 32], but efficiently achieving this bound (in expectation) requires a log-concave sampler and a volumetric spanner of Ω [71]. High-probability bounds for linear bandits are very scarce, especially for a general decision set Ω. In [24], an algorithm with high-probability regret Oe( √ d 3T ln(1/δ)) was developed, but it cannot be implement efficiently. In [1], a general recipe was provided, but seven conditions need to be satisfied to arrive 94 at a high-probability guarantee, and only two concrete examples were shown (when Ω is the simplex or the Euclidean ball). We propose a new algorithm based on SCRiBLe, which is the first general and efficient linear bandit algorithm with a high-probability regret guarantee, resolving the problem left open since the work of [24, 1]. Issues of SCRiBLe. To introduce our algorithm, we first review SCRiBLe. As mentioned, it is also based on OMD and maintains a sequence w1, . . . , wT ∈ Ω updated as wt+1 = argmin w∈Ω w, ℓbt + 1 η Dψ(w, wt), where ℓbt is an estimator for ℓt , η is some learning rate, and importantly, ψ is a ν-self-concordant barrier for Ω which, again, always exists. To ensure smooth readability, we defer the definition and properties of self-concordant barriers to Appendix E.2.2. To incorporate exploration, the actual point played by the algorithm at time t is wet = wt + H −1/2 t st where Ht = ∇2ψ(wt) and st is uniformly randomly sampled from the d-dimensional unit sphere S d . ∗ The point wet is on the boundary of the Dikin ellipsoid centered at wt (defined as {w : ∥w − wt∥Ht ≤ 1}) and is known to be always within Ω. Finally, the estimator ℓbt is constructed as d ⟨wet , ℓt⟩ H 1/2 t st , which can be computed using only the feedback ⟨wet , ℓt⟩ and is unbiased as one can verify. The analysis of [2] shows the following bound related to the loss estimators: X T t=1 wt − u, ℓbt ≤ Oe ν η + ηd2T ∗ In fact, st can be sampled from any orthonormal basis of R d together with their negation. For example, in the original SCRiBLe, the eigenbasis of Ht is used as this orthonormal basis. The version of sampling from a unit sphere first appears in [132], which works more generally for convex bandits. 95 Algorithm 8 SCRiBLe with lifting and increasing learning rates Input: decision set Ω ⊆ R d , a ν-self-concordant barrier ψ for Ω, initial learning rate η. Define: increase factor κ = e 1 100d ln(νT ) , normal barrier Ψ(w) = Ψ(w, b) = 400 ψ w b − 2ν ln b . Initialize: w1 = argminw∈Ω ψ(w), w1 = (w1, 1), H1 = ∇2Ψ(w1), η1 = η, S = {1}. Define: shrunk lifted decision set Ω′ = {w = (w, 1) : w ∈ Ω, πw1 (w) ≤ 1 − 1 T }. 1 for t = 1, 2, . . . , T do 2 Uniformly at random sample st from H − 1 2 t ed+1⊥ ∩ S d+1 . 3 Compute wet = wt + H − 1 2 t st ≜ (wet , 1). 4 Play wet , observe loss ⟨wet , ℓt⟩, and construct loss estimator ℓbt = d⟨wet , ℓt⟩H 1 2 t st . 5 Compute wt+1 = argminw∈Ω′ w, ℓbt + DΨt (w, wt), where Ψt = 1 ηt Ψ. 6 Compute Ht+1 = ∇2Ψ(wt+1). 7 if λmax(Ht+1 − P τ∈S Hτ ) > 0 then S ← S ∪ {t + 1} and set ηt+1 = ηtκ; ; 8 else set ηt+1 = ηt .; for any u ∈ Ω (that is not too close to the boundary). Since Et [ wt − u, ℓbt ] = Et [ wet − u, ℓt ], this immediately yields an expected regret bound (for an oblivious adversary). However, to obtain a highprobability bound, one needs to consider the deviation of PT t=1 wt − u, ℓbt from PT t=1 wet − u, ℓt . Applying our strengthened Freedman’s inequality (Theorem 3.1.1) with Xt = wet − u, ℓt − wt − u, ℓbt , with some direct calculations one can see that both the variance term V and the range term B⋆ from the theorem are related to maxt∥wt∥Ht and maxt∥u∥Ht , both of which can be prohibitively large. We next discuss how to control each of these two terms, leading to the two new ideas of our algorithm (see Algorithm 8). Controlling ∥wt∥Ht . Readers who are familiar with self-concordant functions would quickly realize that ∥wt∥Ht = q w⊤ t ∇2ψ(wt)wt is simply √ ν provided that ψ is also logarithmically homogeneous. A logarithmically homogeneous self-concordant barrier is also called a normal barrier (see Appendix E.2.2 for formal definitions and related properties). However, normal barriers are only defined for cones instead of convex bodies. Inspired by this fact, we propose to lift the problem to R d+1. To make the notation clear, we use bold letters for vectors in R d+1 and matrices in R (d+1)×(d+1). The lifting is done by operating over a lifted Ω Ω Dikin ellipsoid w w Exploration region Conic hull Lifting Figure 3.1: An illustration of the concept of lifting, the conic hull, and the Dikin ellipsoid. In this example d is 2, and the pink disk at the bottom is the original decision set Ω. The gray dot w is a point in Ω. In Algorithm 8, we lift the problem from R 2 to R 3 , and obtain the lifted, orange, decision set Ω. For example, w is lifted to the black dot w = (w, 1). Then we construct the conic hull of the lifted decision set, that is, the gray cone, and construct a normal barrier for this conic hull. By Lemma E.2.1, the Dikin ellipsoid centered at w of this normal barrier (the green ellipsoid), is alway within the cone. In Algorithm 8, if w is the OMD iterate, we explore and play an action within the intersection of Ω and the Dikin ellipsoid centered at w, that is, the (boundary of) the blue ellipse. 97 decision set Ω = {w = (w, 1) ∈ R d+1 : w ∈ Ω}, that is, we append a dummy coordinate with value 1 to all actions. The conic hull of this set is K = {(w, b) : w ∈ R d , b ≥ 0, 1 bw ∈ Ω}. We then perform OMD over the lifted decision set but with a normal barrier defined over the cone K as the regularizer to produce the sequence w1, . . . , wT (Line 5). In particular, using the original regularizer ψ we construct the normal barrier as: Ψ(w, b) = 400 ψ w b − 2ν ln b . † Indeed, Proposition 5.1.4 of [120] asserts that this is a normal barrier for K with self-concordant parameter O(ν). So far nothing really changes since Ψ(w, 1) = 400ψ(w). However, the key difference is in the way we sample the point wet . If we still follow SCRiBLe to sample from the Dikin ellipsoid centered at wt , it is possible that the sampled point leaves Ω. To avoid this, it is natural to sample only the intersection of the Dikin ellipsoid and Ω (again an ellipsoid). Algebraically, this means setting wet = wt + H −1/2 t st where Ht = ∇2Ψ(wt) and st is sampled uniformly at random from (H −1/2 t ed+1) ⊥ ∩ S d+1 (v ⊥ is the space orthogonal to v). Indeed, since st is orthogonal to H −1/2 t ed+1, the last coordinate of H −1/2 t st is zero, making wet = (wet , 1) stay in Ω. See Line 2 and Line 3. To sample st efficiently, one can either sample a vector uniformly randomly from S d+1, project it onto the subspace perpendicular to H −1/2 t ed+1, and then normalize; or sample a vector st uniformly randomly from S d , then normalize H 1 2 t (s ⊤ t , 0)⊤ to obtain st . Finally, after playing wet and observing ⟨wet , ℓt⟩, we construct the loss estimator the same way as SCRiBLe: ℓbt = d ⟨wet , ℓt⟩ H 1/2 t st (Line 4). Lemma E.2.9 shows that the first d coordinates of ℓbt is indeed an unbiased estimator of ℓt . This makes the entire analysis of SCRiBLe hold in R d+1, but now the key term ∥wt∥Ht we want to control is exactly 20√ 2ν (see Lemma E.2.5 and Lemma E.2.6)! We provide an illustration of the lifting idea in Figure 3.1. One might ask whether this lifting is necessary; indeed, one can also spell out the algorithm in R d (see Appendix E.2.1). Importantly, compared to SCRiBLe, the key difference is still that the sampling scheme has changed: the sampled point is not †Our algorithm works with any normal barrier, not just this particular one. We use this particular form to showcase that we only require a self-concordant barrier of the original set Ω, exactly the same as SCRiBLe. necessarily on the Dikin ellipsoid with respect to ψ. In other words, another view of our algorithm is that it is SCRiBLe with a new sampling scheme. We emphasize that, however, it is important (or at least much cleaner) to perform the analysis in R d+1. In fact, even in Algorithm 7 for MAB, similar lifting implicitly happens already since ∆d is a convex body in dimension d − 1 instead of d, and log-barrier is indeed a canonical normal barrier for the positive orthant. Controlling ∥u∥Ht . Next, we discuss how to control the term ∥u∥Ht , or rather ∥u∥Ht after the lifting. This term is the analogue of P i ui wt,i for the case of MAB, and our goal is again to cancel it with the negative term introduced by increasing the learning rate. Indeed, a closer look at the OMD analysis reveals that increasing the learning rate at the end of time t brings a negative term involving −DΨ(u, wt+1) in the regret bound. In Lemma E.2.13, we show that this negative term is upper bounded by −∥u∥Ht+1 + 800ν ln(800νT + 1), making the canceling effect possible. It just remains to figure out when to increase the learning rate and how to make sure we only increase it logarithmic (in T) times as in the case for MAB. Borrowing ideas from Algorithm 7, intuitively one should increase the learning rate only when Ht is “large” enough, but the challenge is how to measure this quantitatively. Only looking at the eigenvalues of Ht , a natural idea, does not work as it does not account for the fact that the directions of eigenvectors are changing over time. Instead, we propose the following condition: at the end of time t, increase the learning rate by a factor of κ if λmax(Ht+1 − P τ∈S Hτ ) > 0, with S containing all the previous time steps prior to time t where the learning rate was increased (Line 7). First, note that this condition makes sure that we always have enough negative terms to cancel maxt∥u∥Ht . Indeed, suppose t is the time with the largest ∥u∥Ht+1 . If we have increased the learning rate at time t, then the introduced negative term exactly matches ∥u∥Ht+1 as mentioned above; otherwise, the condition did not hold and by definition we have ∥u∥Ht+1 ≤ qP τ∈S∥u∥ 2 Hs ≤ P τ∈S∥u∥Hτ , meaning that the negative terms introduced in previous steps are already enough to cancel ∥u∥Ht+1 . 99 Second, in Lemma E.2.12 we show that this schedule indeed makes sure that the learning rate is increased by only Oe(d) times. The key idea is to prove that det(P τ∈S Hτ ) is at least doubled each time we add one more time step to S. Thus, if the eigenvalues of Ht are bounded, |S| cannot be too large. Ensuring the last fact requires a small tweak to the OMD update (Line 5), where we constrain the optimization over a slightly shrunk version of Ω defined as Ω′ = {w ∈ Ω : πw1 (w) ≤ 1 − 1 T }. Here, π is the Minkowsky function and we defer its formal definition to Appendix E.2.2, but intuitively Ω′ is simply obtained by shrinking the lifted decision set by a small amount of 1/T with respect to the center w1 (which is the analogue of the truncated simplex for MAB). This makes sure that wt is never too close to the boundary, and in turn makes sure that the eigenvalues of Ht are bounded. This concludes the two main new ideas of our algorithm; see Algorithm 8 for the complete pseudocode. Clearly, our algorithm can be implemented as efficiently as the original SCRiBLe. The regret guarantee is summarized below. Theorem 3.1.3. Algorithm 8 with a suitable choice of η ensures that with probability at least 1 − δ: Reg = Oe d 2ν q T ln 1 δ + d 2ν ln 1 δ , against an oblivious adversary; Oe d 2ν q dT ln 1 δ + d 3ν ln 1 δ , against an adaptive adversary. Moreover, if ⟨w, ℓt⟩ ≥ 0 for all w ∈ Ω and all t, then T in the bounds above can be replaced by L ⋆ = minu∈Ω PT t=1 ⟨u, ℓt⟩, that is, the total loss of the best action. Our results are the first general high-probability regret guarantees achieved by an efficient algorithm (for either oblivious or adaptive adversary). We not only achieve √ T-type bounds, but also improve it to √ L⋆-type small-loss bounds, which does not exist before. Note that the latter holds only when losses are nonnegative, which is a standard setup for small-loss bounds and is true, for instance, for all combinatorial bandit problems where Ω ⊆ [0, 1]d lives in the positive orthant. Similarly to MAB, we can also obtain 100 other data-dependent regret bounds by only changing the estimator to d⟨wet , ℓt − mt⟩H 1/2 t st + mt for some predictor mt (see [126, 37]).‡ Ignoring lower order terms, our bound for oblivious adversaries is d √ ν times worse than the expected regret of SCRiBLe. For adaptive adversary, we pay extra dependence on d, which is standard since an extra union bound over u is needed and is discussed in [1] as well. The minimax regret for adaptive adversary is still unknown. Reducing the dependence on d for both cases is a key future direction. 3.1.3 Generalization to adversarial MDPs Finally, we briefly discuss how to generalize Algorithm 7 for MAB to learning adversarial Markov Decision Processes (MDPs), leading to the first algorithm with a high-probability small-loss regret guarantee for this problem. We consider an episodic MDP setting with finite horizon, unknown transition kernel, bandit feedback, and adversarial losses, the exact same setting as the recent work [85] (which is the state-of-theart for adversarial tabular MDPs; see [85] for related work). Specifically, the problem is parameterized by a state space X, an action space A, and an unknown transition kernel P : X × A × X → [0, 1] with P(x ′ |x, a) being the probability of reaching state x ′ after taking action a at state x. Without loss of generality (see discussions in [85]), the state space is assumed to be partitioned into J + 1 layers X0, . . . , XJ where X0 = {x0} and XJ = {xJ } contain only the start and end state respectively, and transitions are only possible between consecutive layers. The learning proceeds in T rounds/episodes. In each episode t, the learner starts from state x0 and decides a stochastic policy πt : X × A → [0, 1], where πt(a|x) is the probability of selecting action a at state x. Simultaneously, the adversary decides a loss function ℓt : X ×A → [0, 1], with ℓt(x, a) being the loss of selecting action a at state x. Once again, an adaptive adversary chooses ℓt based on all learner’s actions in ‡One caveat is that this requires measuring the learner’s loss in terms of ⟨wt, ℓt⟩, as opposed to wet, ℓt , since the deviation between these two is not related to mt. 101 previous episodes, while an oblivious adversary chooses ℓt only knowing the learner’s algorithm. Afterwards, the learner executes the policy in the MDP for J steps and generates/observes a state-action-loss sequence (x0, a0, ℓt(x0, a0)), . . . ,(xJ−1, aJ−1, ℓt(xJ−1, aJ−1)) before reaching the final state xJ . With a slight abuse of notation, we use ℓt(π) = E hPJ−1 k=1 ℓt(xk, ak) | P, πi to denote the expected loss of executing policy π in episode t. The regret of the learner is then defined as Reg = PT t=1 ℓt(πt)−minπ PT t=1 ℓt(π), where the min is over all possible policies. Based on several prior works [153, 129], Jin et al. [85] showed the deep connection between this problem and adversarial MAB. In fact, with the help of the “occupancy measure” concept, this problem can be reformulated in a way that becomes very much akin to adversarial MAB and can be essentially solved using OMD with some importance-weighted estimators. We refer the reader to [85] and Appendix E.3.1 for details. The algorithm of [85] achieves Reg = Oe(J|X| p |A|T) with high probability. Since the problem has great similarity with MAB, the natural idea to improve the bound to a smallloss bound is to borrow techniques from MAB. Prior to our work, obtaining high-probability small-loss bounds for MAB can only be achieved by either the implicit exploration idea from [121] or the clipping idea from [9, 113]. Unfortunately, in Appendix E.3.4, we argue that neither of them works for MDPs, at least not in a direct way we can see, from perspectives of both the algorithm and the analysis. On the other hand, our approach from Algorithm 7 immediately generalizes to MDPs without much effort. Compared to the algorithm of [85], the only essential differences are to replace their regularizer with log-barrier and to apply a similar increasing learning rate schedule. The algorithm is deferred to Appendix E.3.2 and we show the main theorem below. Theorem 3.1.4. Algorithm 21 with a suitable choice of η ensures that with probability at least 1 − δ, Reg = Oe |X| r J|A|L⋆ ln 1 δ + |X| 5 |A| 2 ln2 1 δ ! , 102 where L ⋆ = minπ PT t=1 ℓt(π) ≤ JT is the total loss of the best policy. We remark that our bound holds for both oblivious and adaptive adversaries, and is the first highprobability small-loss bounds for adversarial MDPs.§ This matches the bound of [85] in the worst case (including the lower-order term Oe(|X| 5 |A| 2 ) hidden in their proof), but could be much smaller as long as a good policy exists with L ⋆ = o(T). It is still open whether this bound is optimal or not. §Obtaining other data-dependent regret bounds as in MAB and linear bandits is challenging in this case, since there are several terms in the regret bound that are naturally only related to L ⋆ . 103 3.2 Switching Regret for Adversarial Linear Bandits In this section, we switch our focus from classic regret defined in Eq. (1.1) to S-switching regret, which is a stronger notion of regret.¶ S-switching regret measures the learner’s performance against a sequence of changing comparators with at most S − 1 switches. To tackle this problem, we start from considering a more general problem of combining a set of bandit algorithms to learn the best one on the fly. The connection with the S-switching regret will soon be clear in later paragraphs. In fact, by combining a set of base algorithms, each dedicated for a certain type of environments, the final meta algorithm can then automatically adapt to and perform well in every problem instance encountered, as long as the price of such meta-level learning is small enough. While such ideas have a long history in online learning, doing so with partial information (that is, bandit feedback) is particularly challenging, and only recently have we seen success in various settings [7, 124, 59, 100, 93, 144, 36, 142]. We focus on an adversarial setting where the data are generated in an arbitrary and potentially malicious manner. The closest work is [7], where a generic algorithm called Corral is developed to learn over a set of M base algorithms with extra regret overhead Oe( √ MT) after T rounds. In order to maintain Oe( √ T) overall regret, which is often the optimal bound and the goal when designing bandit algorithms, Corral can thus at most tolerate M = poly(log T) base algorithms. However, there are many applications where M needs to be much larger to cover all possible scenarios of interest (we will soon provide an example where M needs to be of order T). Therefore, a natural question arises: can we corral an even larger band of bandit algorithms, ideally with only logarithmic dependence on M in the regret? As an attempt to answer this question, we focus on the adversarial linear bandit problem and develop a new recipe to combine base algorithms, which reduces the problem to designing good unbiased loss estimators for the base algorithms and good optimistic loss estimators for the meta algorithm. As long as these estimators ensure certain properties, the resulting algorithm enjoys logarithmic dependence on M ¶ Switching regret is also known as tracking regret or shifting regret in many previous works (e.g., [76, 41, 69]). 104 in the regret. We discuss this recipe in detail along with a warm-up example on the classic multi-armed bandit problem in Section 3.2.2. Then, as a main example, in Section 3.2.3, we apply this recipe to develop the first optimal switching regret bound for adversarial linear bandits over a d-dimensional ℓp unit ball with p ∈ (1, 2]. A standard technique to achieve switching regret guarantee in the full-information setting is by combining T base algorithms, each of which starts at a different time step and is guaranteed to perform well against a fixed comparator starting from this step (that is, a standard static regret guarantee); see for example [74, 51, 108]. However, applying the same idea to bandit problems was not possible before because as mentioned, previous methods such as Corral cannot afford T base algorithms.∥ However, this is exactly where our approach shines! Indeed, by using our recipe to combine T instances of the algorithm of [33] together with carefully designed loss estimators, we manage to achieve logarithmic dependence on the number of base algorithms, resulting in the optimal (up to logarithmic factors) switching regret Oe( √ dST) for this problem for any fixed S. As another example, in Appendix F.3 we also generalize our results from ℓp balls to smooth and strongly convex sets. Finally, in Section 3.2.4 we further generalize our results to the unconstrained linear bandit problem and obtain the first comparator-adaptive switching regret of order Oe maxk∈[S] ∥˚uk∥2 · √ dST where ˚uk is the k-th (arbitrary) comparator. The algorithm requires two components, one of which is exactly our new algorithm developed for ℓp balls, the other being a new parameter-free algorithm for unconstrained Online Convex Optimization with the first comparator-adaptive switching regret. We note that this latter algorithm/result might be of independent interest. High-level ideas. For such as a meta learning framework, it is standard to decompose the overall regret as meta-regret, which measures the regret of the meta algorithm to the best base algorithm, and ∥One can compromise and corral o(T) base algorithms instead, which leads to suboptimal switching regret; see such an attempt in Appendix G in [110]. 10 base-regret, which measures the best base algorithm to the best elementary action. The main difficulty for bandit problems is that, it is hard to control base-regret in such a framework due to possible starvation of feedback for the base algorithm. The Corral algorithm of Agarwal et al. [7] addresses this via a new meta algorithm based on Online Mirror Descent (OMD) with the log-barrier regularizer and an increasing learning rate schedule, which together provides a negative term in meta-regret large enough to (approximately) cancel base-regret. However, the log-barrier regularizer unavoidably introduces poly(M) dependence in meta-regret. Our ideas to address this issue are two-fold. First, to make sure meta-regret enjoys logarithmic dependence on M, we borrow the idea of the Exp4 algorithm [19], which combines M static experts (instead of learning algorithms) without paying polynomial dependence on M. This is achieved by OMD with the negative entropy regularizer, plus a better loss estimator with lower variance for each expert. In our case, this requires coming up with similar low-variance loss estimator for each base algorithm as well as updating each base algorithm no matter whether it is selected by the meta algorithm or not (in contrast, Corral only updates the selected base algorithm). Without the log-barrier regularizer, however, we now cannot use the same increasing learning rate schedule as Corral to generate a large enough negative term to cancel base-regret. To address this, our second main idea is to inject negative bias to the loss estimators (making term optimistic underestimators), with the goal of generating a reasonably small positive bias in the regret and at the same time a large enough negative bias to cancel base-regret. This idea is similar to that of Foster et al. [59], but they did not push it far enough and only improved Corral on logarithmic factors. Related work. Since the work of Agarwal et al. [7], there have been several follow-ups in the same direction, either for adversarial environments [59] or stochastic environments [124, 48, 15, 93]. The problem is also highly related to model selection in online learning with bandit feedback [62, 64, 116]. 106 The optimal regret for adversarial linear bandits over a general d-dimensional set is Oe(d √ T) [50, 32], but it becomes Oe( √ dT) for the special case of ℓp balls with p ∈ [1, 2] [33]. To the best of our knowledge, switching regret has not been studied for adversarial linear bandits, except for its special case of multiarmed bandits [19, 17]. We discuss several natural attempts in Appendix F.1 to extend existing methods to linear bandits, but the best we can get is Oe(d √ ST) via combining the Exp2 algorithm [32] and the idea of uniform mixing [76, 19]. On the other hand, our proposed approach achieves the optimal Oe( √ dST) regret. In fact, our algorithm is also more computationally efficient as Exp2 requires log-concave sampling. We assume a known and fixed S in most places. Achieving the same result for all S simultaneously is known to be impossible for adaptive adversaries [116], and remains open for oblivious adversaries (our setting) even for the classic multi-armed bandit problem, so this is beyond the scope of this work. We mention that, however, without knowing S we can still achieve Oe(S √ dT) regret via a slightly different parameter tuning of our algorithm, or Oe( √ dST +T 3/4 )regret via wrapping our algorithm with the generic Bandits-over-Bandits strategy of Cheung, Simchi-Levi, and Zhu [44]. On the other hand, for the easier piecewise stochastic environments, adapting to unknown S without price has been shown possible for different problems including multi-armed bandits [20], contextual bandits [43], and many more [144]. Regarding our extension to the unconstrained setting, while unconstrained online learning has been extensively studied in the full-information setting with gradient feedback since the work of Mcmahan and Streeter [117] (see e.g., [123, 118, 65, 47, 49]), as far as we know [78] is the only existing work considering the same with bandit feedback. They consider static regret and propose a black-box reduction approach, taking inspiration from a similar reduction from the full-information setting [49]. We consider the more general switching regret, and our algorithm is also built on a similar reduction. 107 3.2.1 Problem Setup and Notations Problem setup. While our idea is applicable to more general setting, for ease of discussions we focus on the adversarial linear bandit problem throughout Section 3.2. Specifically, at the beginning of a T-round game, an adversary (knowing the learner’s algorithm) secretly chooses a sequence of linear loss functions parametrized by ℓ1, . . . , ℓT ∈ R d . Then, at each round t ∈ [T], the learner makes a decision by picking a point (also called action) xt from a known feasible domain Ω ⊆ R d , and subsequently suffers and observes the loss ℓ ⊤ t xt . Note that ℓ ⊤ t xt is the only feedback on ℓt revealed to the learner. We measure the learner’s performance via the switching regret, defined as Reg(u1, . . . , uT ) ≜ X T t=1 ℓ ⊤ t xt − X T t=1 ℓ ⊤ t ut = X S k=1 X t∈Ik ℓ ⊤ t (xt − ˚uk), (3.1) where u1, . . . , uT ∈ Ω is a sequence of arbitrary comparators with S −1 switches for some known S (that is, PT t=2 1{ut−1 ̸= ut} = S −1) and I1, . . . , IS denotes a partition of [T] such that for each k, ut remains the same (denoted by ˚uk) for all t ∈ Ik. Except for comparator-adaptive bounds discussed in Section 3.2.4, our results have no explicit dependence on ˚u1, . . . ,˚uS other than the number S, so we often use RegS as a shorthand for Reg(u1, . . . , uT ). The classic static regret is simply Reg1 (that is, competing with a fixed comparator throughout), which we also simply write as Reg. Notations. For any integer n, we denote by ∆n the simplex {p ∈ R n ≥0 | Pn i=1 pi = 1}. We use ei to denote the standard basis vector (of appropriate dimension) with the i-th coordinate being 1 and others being 0. Given a vector x ∈ R d , its ℓp norm is defined by ∥x∥p = (Pd n=1 |xn| p ) 1/p . Et [·] denotes the conditional expectation given the history before round t. The Oe(·) notation omits the logarithmic dependence on the time horizon T and the dimension d. 108 Protocol 9 Combining M base algorithms in adversarial linear bandits for t = 1, . . . , T do Each base algorithm Bi submits an action ea (i) t ∈ Ω to the meta algorithm, for all i ∈ [M]. Meta algorithm selects xt such that Et [xt ] = P i∈[M] pt,iea (i) t for some distribution pt ∈ ∆M. Play xt and receive feedback ℓ ⊤ t xt . Construct base loss estimator ℓbt ∈ R d and meta loss estimator bct ∈ RM. Base algorithms {Bi}M i=1 update themselves based on the base loss estimator ℓbt . Meta algorithm updates the weight pt+1 based on pt and the meta loss estimator bct . 3.2.2 Corralling a Larger Band of Bandits: A Recipe In this section, we describe our general recipe to corral a large set of bandit algorithms. We start by showing a general and natural protocol of such a meta-base framework in Protocol 9. Specifically, suppose we maintain M base algorithms {Bi}M i=1. At the beginning of each round, each base algorithm Bi submits its own action ea (i) t ∈ Ω to the meta algorithm, which then decides the final action xt with expectation P i∈[M] pt,iea (i) t for some distribution pt ∈ ∆M specifying the importance/quality of each base algorithm. After playing xt and receiving the feedback ℓ ⊤ t xt , we construct base loss estimator ℓbt ∈ R d and meta loss estimator bct ∈ RM. As their name suggests, base loss estimator estimates ℓt and is used to update each base algorithm, while meta loss estimator estimates A⊤ t ℓt , where the i-th column of At ∈ R d×M is ea (i) t , and is used to update the meta algorithm to obtain the next distribution pt+1 ∈ ∆M. In the following, we formalize the high-level idea discussed in the beginning of Section 3.2. For simplicity, we focus on the static regret Reg in this discussion (that is, S = 1) and let u be the fixed comparator. The first step is to decompose the regret into two parts as mentioned in the beginning of Section 3.2: as long as ℓbt and bct are unbiased estimators (that is, Et [ℓbt ] = ℓt and Et [bct ] = A⊤ t ℓt ), one can show: ∀j ∈ [M], E[Reg] = E "X T t=1 ⟨pt − ej , bct⟩ # | {z } meta-regret + E "X T t=1 D ea (j) t − u, ℓbt E # | {z } base-regret . (3.2) Controlling base-regret is the key challenge. Indeed, even if the base algorithm enjoys a good regret guarantee when running on its own, it might not ensure the same guarantee any more in this meta-base 109 framework because it cannot fully control the final action and observe the feedback it needs. At a technical level, this is reflected in a larger variance of ℓbt due to the randomness from the meta algorithm, which then ruins the base algorithm’s original regret guarantee. As mentioned, the way Corral [7] addresses this issue is by using OMD with the log-barrier regularizer and increasing learning rates as the meta algorithm, which ensures that meta-regret is at most Oe( √ MT) plus some negative term large enough to cancel the prohibitively large part of base-regret. The poly(M) dependence in their approach is unavoidable because they treat the problem that the meta algorithm is facing as a classic multi-armed bandit problem and ignores the fact that information can be shared among different base algorithms. The recent follow-up [59] shares the same issue. Instead, we propose the following idea. We use OMD with entropy regularizer (a.k.a. multiplicative weights update) as the meta algorithm to update pt+1, usually in the form pt+1,i ∝ pt,ie −εbct,i where ε > 0 is some learning rate. This first ensures that the so-called regularization penalty term in meta-regret is of order log M ε instead of M ε as in Corral. To control the other so-called stability term in meta-regret, the estimator bct has to be constructed in a way with low variance, but we defer the discussion and first look at how to control base-regretin this case. Since we are no longer using the log-barrier regularizer of Corral, a different way to generate a large negative term in meta-regret to cancel base-regret is needed. To this end, we propose to inject a (negative) bias bt ∈ RM + to the meta loss estimator bct , making it an optimistic underestimator. More specifically, introduce another notation ct for some unbiased estimator of A⊤ t ℓt . Then the adjusted meta loss estimator is defined as bct = ct − bt . Since bct is biased now, the decomposition (3.2) needs to be updated accordingly as E[Reg] = E "X T t=1 ⟨pt − ej , bct⟩ # | {z } meta-regret + E "X T t=1 D ea (j) t − u, ℓbt E # | {z } base-regret + E "X T t=1 ⟨pt , bt⟩ # | {z } Pos-Bias − E "X T t=1 ⟨ej , bt⟩ # | {z } Neg-Bias . 110 Based on this decomposition, our goal boils down to designing good base and meta loss estimators such that the following three terms are all well controlled: base-regret − Neg-Bias ≤ Target, (3.3) Pos-Bias ≤ Target, (3.4) meta-regret ≤ Target. (3.5) Here, Target represents the final targeted regret bound with logarithmic dependence on M and usually √ T-dependence on T (such as Oe( √ dST log M) for our main application of switching regret discussed in Section 3.2.3).∗∗ A recipe. We are now ready to summarize our recipe in the following three steps. • Step 1. Start from designing ℓbt , which often follows similar ideas of the original base algorithm. • Step 2. Then, by analyzing base-regretwith such a base loss estimator, figure out what bt needs to be in order to ensure Eq. (3.3) and Eq. (3.4) simultaneously. • Step 3. Finally, design ct to ensure Eq. (3.5). As mentioned in the beginning of Section 3.2, this is a problem similar to combining static experts as in the Exp4 algorithm [19], and the key is to ensure that ct allows information sharing among base algorithms and enjoys low variance. A natural choice is ct,i = ⟨ea (i) t , ℓbt⟩, which is exactly what Exp4 does and works in the toy example we show below, but sometimes one needs to replace ℓbt with yet another better unbiased estimator of ℓt , which turns out to be indeed the case for our main example in Section 3.2.3. ∗∗We present our recipe with a targeted regret bound just because this is indeed the case for our main application (of getting switching regret bounds), but this is not necessary for our approach, and we can for example also achieve some of the applications discussed in [7] where the meta algorithm achieves different regret bounds (that is, not a single target) in different environments. 111 A toy example. Now, we provide a warm-up example to show how to successfully apply our three-step recipe to the multi-armed bandit problem. We note that this example does not really lead to meaningful applications, as in the end we are simply combining different copies of the exact same algorithm. Nevertheless, this serves as a simple and illustrating exercise to execute our recipe, paving the way for the more complicated scenario to be discussed in the next section. Specifically, in multi-armed bandit, we have Ω = ∆d and ℓt ∈ [0, 1]d for all t ∈ [T], and we set the target to be Target = Oe( √ dT log M) (optimal up to logarithmic factors). The meta algorithm is as specified before (multiplicative weights update). For the base algorithm, we choose a slight variant of the classic Exp3 algorithm [19], so that ea (i) t+1 = argmina∈∆d∩[η,1]d ⟨a, ℓbt⟩ + 1 η Dψ(a, a (i) t ) , where η > 0 is a clipping threshold (and also a learning rate) and ψ(a) = Pd n=1 an log an is the negative entropy. Given qt = PM i=1 pt,iea (i) t ∈ ∆d, the meta algorithm naturally samples an arm nt ∈ [d] according to qt , meaning xt = ent . Step 1. With the feedback ℓ ⊤ t xt = ℓt,nt , following Exp3 we let the base loss estimator be the standard importance-weighted estimator: ℓbt = ℓt,nt qt,nt xt , which is clearly unbiased with Et [ℓbt ] = ℓt . Step 2. By standard analysis (e.g.,Theorem 3.1 in [31]), base-regret is at most ηdT + log d η + ηE "X T t=1 X d n=1 ea (j) t,nℓb2 t,n# . Since Et [ℓb2 t,n] = ℓ 2 t,n qt,n , the last term is further bounded by ηE PT t=1 Pd n=1 ea (j) t,n/qt,n . This is exactly the problematic stability term that can be prohibitively large. We thus directly define the bias term bt,j as 112 η Pd n=1 ea (j) t,n/qt,n, so that base-regret − Neg-Bias is simply bounded by ηdT + log d η . Picking the optimal η ensures Eq. (3.3). On the other hand, Pos-Bias happens to be small as well: Pos-Bias = E "X T t=1 ⟨pt , bt⟩ # = ηE "X T t=1 X M i=1 pt,iX d n=1 ea (i) t,n/qt,n# = ηE "X T t=1 X d n=1 qt,n/qt,n# = ηdT, ensuring Eq. (3.4). Step 3. Finally, we use the natural meta loss estimator ct,i = ⟨ea (i) t , ℓbt⟩. Since qt,n ≥ η due to the clipping threshold and thus 0 ≤ bt,i ≤ 1 and bct,i ≥ −1 (that is, not too negative), standard analysis shows meta-regret ≤ log M ε + εE "X T t=1 X M i=1 pt,ibc 2 t,i# , with the last term further bounded by 2εE PT t=1 PM i=1(pt,ic 2 t,i + pt,ib 2 t,i) ≤ 4εdT. Picking the optimal ε in the final bound leads to meta-regret ≤ log M ε + 4εdT, ensuring Eq. (3.5). This concludes our example and shows that our recipe indeed enjoys logarithmic dependence on M in this case, which Corral fails to achieve. At a high level, our method avoids the poly(M) dependence by information sharing among different base algorithms: we update every base algorithm no matter whether it is selected by the meta algorithm or not, while in contrast, Corral only updates the selected base algorithm, which makes the poly(M) dependence unavoidable. 113 3.2.3 Optimal Switching Regret for Linear Bandits over ℓp Balls As the main application in this work, we now discuss how to apply our recipe to achieve the optimal switching regret for adversarial linear bandits over ℓp balls. In this problem, the feasible domain is an ℓp unit ball for some p ∈ (1, 2], namely, Ω = {x ∈ R d | ∥x∥p ≤ 1}, and each ℓt is assumed to be from the dual ℓq unit ball with q = p/(p − 1), such that |ℓ ⊤ t x| ≤ 1 for all x ∈ Ω and t ∈ [T]. Bubeck, Cohen, and Li [33] show that the optimal regret in this case is Θ(√ dT), which is better than the general linear bandit problem by a factor of √ d. This implies that the optimal switching regret for this problem is Ω(√ dST) — indeed, simply consider the case where I1, . . . , IS is an even partition of [T] and the adversary forces the learner to suffer Ω(p d|Ik|) = Ω(p dT /S) regret on each interval Ik by generating a new worst case instance regarding the static regret. Therefore, our target regret bound here is set to Target = Oe( √ dST). We remind the reader that this problem has not been studied before and that in Appendix F.1, we discuss other potential approaches and why none of them is able to achieve this goal. The pseudocode of our final algorithm is shown in Algorithm 10. At a high-level, it is simply following the standard idea in the literature on obtaining switching regret, that is, maintain a set of M = T base algorithms with static regret guarantees, the t-th of which Bt starts learning from time step t (before time t, one pretends that Bt picks the same action as the meta algorithm). If the meta algorithm itself enjoys a switching regret guarantee,†† then by competing with Bjk on interval Ik where jk is the first time step of Ik so that Bjk enjoys a (static) regret guarantee on Ik, the overall algorithm enjoys a switching regret for the original problem. While this is a standard and simple idea, applying it to the bandit setting was not possible before our work due to the large number of base algorithms (T) needed to be combined. Our approach, however, is able to overcome this with logarithmic dependence on M, making it the first successful execution of this long-standing idea in bandit problems. ††We point out that in the full-information setting, even a certain static regret guarantee from the meta algorithm is enough, but a switching regret guarantee is needed in the bandit setting for technical reasons. 114 Base algorithm overview. Our base algorithm is naturally the one proposed in [33] that achieves Oe( √ dT) static regret.‡‡ Specifically, let Ω ′ = {x | ∥x∥p ≤ 1 − γ} for some clipping parameter γ be a slightly smaller ball. At each round t, each base algorithm Bi (for i ≤ t) has a vector a (i) t ∈ Ω ′ at hand. Then, it generates a Bernoulli random variable ξ (i) t with mean ∥a (i) t ∥p. If ξ (i) t = 0, then its final decision ea (i) t is uniformly sampled from {±en} d n=1; otherwise, ea (i) t = a (i) t /∥a (i) t ∥p. Next, Bi submits (ea (i) t , a (i) t , ξ(i) t ) to the meta algorithm. After receiving the base loss estimator ℓbt (to be specified later), Bi updates a (i) t+1 using OMD with the regularizer R(a) = − ln(1 − ∥a∥ p p), that is, a (i) t+1 = argmina∈Ω′ ⟨a, ℓbt⟩ + 1 η DR(a, a (i) t ) for some learning rate η > 0. We defer the pseudocode Algorithm 22 to Appendix F.2.1. Meta algorithm overview. The meta algorithm maintains the distribution pt ∈ ∆T again via multiplicative weights update, but since a switching regret guarantee is required as mentioned, a slight variant studied in [19] is needed which mixes the multiplicative weights update with a uniform distribution: pt+1,i = (1 − µ) pt,i exp(−εbct,i) PT j=1 pt,j exp(−εbct,j ) + µ T for some mixing rate µ, learning rate ε, and meta loss estimator bct (to be specified later). As mentioned, at time t, all base algorithm Bi with i > t should be thought of as making the same decision as the meta algorithm, so in a sense we are looking for an action xet such that xet = Pt i=1 pt,iea (i) t + PT i=t+1 pt,ixet , or equivalently xet = Pt i=1 pbt,iea (i) t with a distribution pbt ∈ ∆t satisfying pbt,i ∝ pt,i. Combining this with some extra exploration for technical reasons, the final decision xt of our algorithm is decided as follows: sample a Bernoulli random variable ρt with mean β (a small parameter); if ρt = 1, then xt is uniformly sampled from {±en} d n=1, otherwise xt is sampled from ea (1) t , . . . , ea (t) t according to the distribution pbt . See Line 3, Line 4, and Line 9. We are now ready to follow the three steps of our recipe to design the loss estimators. ‡‡To be more accurate, the version we present here is a slightly simpler variant with the same guarantee. 115 Algorithm 10 Algorithm for adversarial linear bandits over ℓp balls with optimal switching regret Input: clipping parameter γ, base learning rate η, meta learning rate ε, mixing rate µ, exploration parameter β, bias coefficient λ, initial uniform distribution p1 ∈ ∆T . for t = 1, . . . , T do 1 Start a new base algorithm Bt , which is an instance of Algorithm 22 with learning rate η, clipping parameter γ, and initial round t. 2 Receive local decision (ea (i) t , a (i) t , ξ(i) t ) from base algorithm Bi for each i ≤ t. 3 Compute the renormalized distribution pbt ∈ ∆t such that pbt,i ∝ pt,i for i ∈ [t]. 4 Sample a Bernoulli random variable ρt with mean β. If ρt = 1, uniformly sample xt from {±en} d n=1; otherwise, sample it ∈ [t] according to pbt , and set xt = ea (it) t and ξt = ξ (it) t . 5 Make the final decision xt and receive feedback ℓ ⊤ t xt . 6 Construct the base loss estimator ℓbt ∈ R d as follows and send it to all base algorithms {Bi} t i=1: ℓbt = 1{ρt = 0}1{ξt = 0} 1 − β · d(ℓ ⊤ t xt) 1 − Pt i=1 pbt,i∥a (i) t ∥p · xt . (3.6) 7 Construct another loss estimator ¯ℓt ∈ R d as ¯ℓt = Mf−1 t xtx ⊤ t ℓt , (3.7) where Mft = β d Pd n=1 ene ⊤ n + (1 − β) Pt i=1 pbt,iea (i) t ea (i)⊤ t . 8 Construct the meta loss estimator bct ∈ R T as: bct,i = ( ⟨ea (i) t , ¯ℓt⟩ − bt,i, i ≤ t, Pt j=1 pbt,jbct,j , i > t, where bt,i = 1 λT(1 − β) 1 − ∥a (i) t ∥p 1 − Pt j=1 pbt,j∥a (j) t ∥p . (3.8) 9 Meta algorithm updates the weight pt+1 ∈ ∆T according to pt+1,i = (1 − µ) pt,i exp(−εbct,i) PT j=1 pt,j exp(−εbct,j ) + µ T , ∀i ∈ [T]. (3.9) Step 1. The design of the base loss estimator ℓbt mostly follows [33], except for the extra consideration due to the sampling scheme of the meta algorithm (Line 4). The final form is shown in Eq. (3.6), and a direct calculation verifies its unbiasedness Et [ℓbt ] = ℓt (see Lemma F.2.1). Step 2. With ℓbt fixed, for an interval Ik, we analyze the static regret of Bjk on this interval (recall that jk is the first time step of Ik), mostly following the analysis of Bubeck, Cohen, and Li [33]. This corresponds to 116 base-regret (since we have moved from static regret to switching regret). More concretely, in Lemma F.2.3 we show for some universal constant C1 > 0: E " X t∈Ik D ea (jk) t − ˚uk, ℓbt E # ≤ log(1/γ) η + ηC1 X t∈Ik 1 − ∥a (jk) t ∥p 1 − Pt j=1 pbt,j∥a (j) t ∥p . Again, the second term above is the prohibitively large term, and we thus define bt,i in the same form; see Eq. (3.8). As long as the parameters are chosen such that ηC1 ≤ 1 λT(1−β) , base-regret − Neg-Bias is simply bounded by log(1/γ) η , and Eq. (3.3) can be ensured. Direct calculation shows that with such a bias term bt,i, Pos-Bias is also small enough to ensure Eq. (3.4); see Appendix F.2.4. Step 3. Finally, it remains to design unbiased loss estimator ct,i and finalize the meta loss estimator bct,i. As mentioned, a natural choice would be ct,i = ⟨ea (i) t , ℓbt⟩. However, despite its unbiasedness, it turns out to suffer a large variance in this case and cannot lead to a favorable guarantee for meta-regret. To address this issue, we introduce yet another unbiased loss estimator ¯ℓt for ℓt , defined in Eq. (3.7), which follows standard idea from the general linear bandit literature (see for example the Exp2 algorithm of Bubeck, Cesa-Bianchi, and Kakade [32]). With that, ct,i is defined as ⟨ea (i) t , ¯ℓt⟩ instead, which now has a small enough variance. We find it intriguing that using different unbiased loss estimators (ℓbt for base algorithms and ¯ℓt for the meta algorithm) for the same quantity ℓt appears to be necessary for this problem. As the final piece of the puzzle, we set bct,i = ct,i − bt,i for i ≤ t as our recipe describes, and for i > t, recall that these base algorithms are thought of as making the same prediction of the meta algorithm, thus we set bct,i = Pt j=1 pbt,jbct,j ; see Eq. (3.8). This ensures an important property ⟨pt , bct⟩ = P i≤t pbt,ibct,i, which we use to finally prove that meta-regret is small enough to ensure Eq. (3.5) (see Lemma F.2.6). This concludes the description of our entire algorithm. Algorithm 10 summarizes the main update procedures. We briefly discuss the computational and space complexity. Note that each base algorithm performs a simple OMD update with a barrier regularizer, and it suffices to obtain an approximate solution 117 with 1 T precision, which takes O(poly(d)) space and O(poly(d log T)) time via for example the Interior Point Method. On the other hand, the computational/space complexity of the meta algorithm is clearly O(T poly(d)) per round. We formally prove in Appendix F.2 that our algorithm enjoys the following switching regret. Theorem 3.2.1. With parameters γ = 4C q dS T , η = C q S dT , ε = min q S dT , 1 16d , C2 2 , µ = 1 T , β = 8dε, and λ = √ C dST , Algorithm 10 guarantees E[RegS] = E "X T t=1 ℓ ⊤ t xt − X T t=1 ℓ ⊤ t ut # = Oe √ dST , where C = √ p − 1·2 − 2 p−1 , and u1, . . . , uT ∈ Ω are arbitrary comparators such that PT t=2 1{ut−1 ̸= ut} ≤ S − 1. We point out again that this is the first optimal switching regret guarantee for linear bandits over ℓp balls with p ∈ (1, 2], demonstrating the importance of our new corralling method. Extensions to smooth and strongly convex domain. Our ideas and results can be generalized to adversarial linear bandits over any smooth and strongly convex set, a setting studied in [89]. Specifically, for a smooth and strongly convex set containing the ℓp unit ball and contained by the dual ℓq unit ball (for some p ∈ (1, 2]), our algorithm achieves Oe d 1/p √ ST switching regret. We defer all details to Appendix F.3. 3.2.4 Extension to Unconstrained Linear Bandits In this section, we further extend our results on linear bandits to the unconstrained setting, that is, Ω = R d , which means both the learner’s decisions {xt} T t=1 and the comparators {ut} T t=1 can be chosen arbitrarily in R d . The loss vectors are assumed to have bounded ℓ2 norm: ∥ℓt∥2 ≤ 1 for all t ∈ [T]. As mentioned, [78] 11 is the only existing work considering the same setting. They study static regret and achieve a comparatoradaptive bound E[Reg] = Oe(∥u∥2 √ dT) simultaneously for all u (the fixed comparator).§§ Building on our results in Section 3.2.3, we generalize their static regret bound to switching regret and achieve a similar comparator-adaptive bound E[Reg(u1, . . . , uT )] = Oe maxk∈[S] ∥˚uk∥2 · √ dST simultaneously for all u1, . . . , uT with S − 1 switches. Instead of using our recipe and starting from scratch to solve this problem, we directly make use of the reduction of Hoeven, Cutkosky, and Luo [78] that reduces the unconstrained problem to the constrained counterpart (already solved by our Algorithm 10) plus another one-dimensional unconstrained problem; see Section 3.2.4.1. To solve the latter problem, in Section 3.2.4.2 we design a new unconstrained algorithm for general Online Convex Optimization (OCO) that enjoys a comparator-adaptive switching regret guarantee and might be of independent interest. Finally, we summarize the overall algorithm and provide the formal guarantees in Section 3.2.4.3. 3.2.4.1 Black-Box Reduction for Switching Regret of Unconstrained Linear Bandits The reduction of Hoeven, Cutkosky, and Luo [78] takes heavy inspiration from [49]. Specifically, suppose that we have two subroutines denoted by AZ and AV: AZ is a constrained linear bandit algorithm over the ℓ2 ball Z = {z ∈ R d | ∥z∥2 ≤ 1} and AV is an unconstrained and one-dimensional online linear optimization algorithm with full-information feedback (in fact, in the one-dimensional linear case, there is no difference between full-information and bandit feedback). Then, one can solve an unconstrained linear bandit problem as follows: at each round t ∈ [T], the learner makes the decision xt = vt · zt , where zt ∈ Z is the direction returned by the constrained bandit algorithm AZ, and vt ∈ R is the scalar returned by the one-dimensional algorithm AV. After observing the loss ℓ ⊤ t xt , the learner then feeds ℓ ⊤ t zt = ℓ⊤ t xt vt to both AZ and AV so they can update themselves. See Algorithm 11 for the pseudocode. §§The actual bound stated in [78] is actually Oe(∥u∥2d √ T), but it is straightforward to see that it can be improved to Oe(∥u∥2 √ dT) by picking the optimal linear bandit algorithm over ℓ2 balls in their reduction. 11 Algorithm 11 Comparator-adaptive algorithm for unconstrained linear bandits Input: subroutine AV (unconstrained OCO algorithm), subroutine AZ (constrained linear bandits algorithm), Z = {z | ∥z∥2 ≤ 1}. for t = 1 to T do Receive the direction zt ∈ Z from subroutine AZ. Receive the magnitude vt ∈ R from subroutine AV. Submit xt = zt · vt and receive and observe the loss ℓ ⊤ t xt . Send ℓ ⊤ t zt = ℓ ⊤ t xt/vt as the feedback for subroutine AZ. Construct linear function ft(v) ≜ v · ℓ ⊤ t zt as the feedback for subroutine AV. Hoeven, Cutkosky, and Luo [78] show that the static regret of such a reduction can be expressed using the regret of the two subroutines. This can be directly generalized to switching regret, formally described below (see Appendix F.4.1 for the proof). Lemma 3.2.1. For an interval I ⊆ [T], let RegV I (v) = P t∈I(vt−v)⟨zt , ℓt⟩ be the regret of the unconstrained one-dimensional algorithm AV against a comparator v ∈ R on this interval, and similarly RegZ I (z) = P t∈I⟨zt −z, ℓt⟩ be the regret of the constrained linear bandits algorithm AZ against a comparator z ∈ Z = {z ∈ R d | ∥z∥2 ≤ 1} on this interval. Then Algorithm 11 (with decision xt = zt · vt) satisfies Reg(u1, . . . , uT ) = X S k=1 RegV Ik (∥˚uk∥2) +X S k=1 ∥˚uk∥2 · RegZ Ik ˚uk ∥˚uk∥2 , (3.10) where we recall that I1, . . . , IS denotes a partition of [T] such that for each k, ut remains the same (denoted by ˚uk) for all t ∈ Ik. One can see that the first term in Eq. (3.10) is clearly the switching regret of AV, while the second term, after upper bounded by maxk∈[S]∥˚uk∥2 PS k=1 RegZ Ik ˚uk ∥˚uk∥2 , is the switching regret of AZ scaled by the maximum comparator norm. Therefore, to control the second term, we simply apply our Algorithm 10 as the subroutine AZ, making it at most Oe maxk∈[S] ∥˚uk∥2 · √ dST . On the other hand, to the best of our knowledge, there are no existing unconstrained algorithms with switching regret guarantees. To this end, we design one such algorithm in the next section. In fact, for full generality, we do so for the more 12 general unconstrained OCO problem of arbitrary dimension without the knowledge of S, which might be of independent interest. 3.2.4.2 Subroutine: Switching Regret of Unconstrained Online Convex Optimization As a slight detour, in this section we consider a general unconstrained OCO problem: at round t ∈ [T], the learner makes a decision vt ∈ R d and simultaneously the adversary chooses a loss function ft : R d 7→ R, then the algorithm suffers loss ft(vt) and observes the gradient ∇ft(vt) as feedback. Notably, the feasible domain is R d (that is, no constraints). The goal of the learner is to minimize the switching regret Reg(u1, . . . , uT ) = X T t=1 ft(vt) − X T t=1 ft(ut) = X S k=1 X t∈Ik ft(vt) − ft(˚uk) , (3.11) where the notations I1, . . . , IS and ˚u1, . . . ,˚uS ∈ R d are defined similarly as in Section 4.1. Without loss of generality, it is assumed that maxx ∥∇ft(x)∥2 ≤ 1 for all t. Note that this setup is a strict generalization of what we need for the one-dimensional subroutine AV discussed in Section 3.2.4.1. Our idea is once again via a meta-base framework, which is in fact easier than our earlier discussions because now we have gradient feedback. There are two quantities that we aim to adapt to: the number of switches S and the comparator norm ∥˚uk∥2 (although the latter can be unbounded, it suffices to consider a maximum norm of 2 T as Appendix D.5 in [42] shows). Therefore, we create an exponential grid for these two quantities, and maintain one base algorithm for each possible configuration. These base algorithms only need to satisfy some mild conditions specified in Requirement 1 of Appendix F.4.2, and many existing algorithms such as [51, 86, 146, 46] indeed meet the requirements. The design of the meta algorithm requires some care to ensure the desirable adaptive guarantees, and we achieve so by building upon the recent progress in the classic expert problem [42]. In short, our meta algorithm is OMD with a multi-scale entropy regularizer and certain important correction terms. We defer 121 Algorithm 12 Comparator-adaptive algorithm for unconstrained OCO Input: base algorithm B. Define: H = ⌈log2 T⌉ + T + 1 and R = ⌈log2 T⌉. Define: clipped domain Ω = {w | w ∈ ∆N and wt,(i,r) ≥ 1 T2·2 2i , ∀i ∈ [H], r ∈ [R]}. Define: weighted entropy regularizer ψ(w) ≜ P (i,r)∈[H]×[R] ci ηr w(i,r) ln w(i,r) with ci = T −1 · 2 i−1 for i ∈ [H] and ηr = 1 32·2 r for r ∈ [R]. Initialization: for (i, r) ∈ [H] × [R], initiate base algorithm Bi,r ← B(Ωi) with Ωi = {x | ∥x∥2 ≤ Di}, which is an instance of B, and prior distribution w1,(i,r) ∝ η 2 r/c2 i . for t = 1 to T do Each base learner B(i,r) returns the local decision vt,(i,r) for each i ∈ [H] and r ∈ [R]. Make the final decision vt = P (i,r)∈[H]×[R] wt,(i,r)vt,(i,r) and receive feedback ∇ft(vt). Construct feedback loss ℓt ∈ R N and correction term at ∈ R N for meta algorithm : ℓt,j ≜ ⟨∇ft(vt), vt,j ⟩, at,j ≜ 32 ηr ci ℓ 2 t,j , ∀j = (i, r) ∈ [H] × [R]. Meta algorithm updates the weight by wt+1 = argminw∈Ω ⟨w, ℓt + at⟩ + Dψ(w, wt). the details to Appendix F.4.2 and only present the pseudocode of the full algorithm in Algorithm 12. Below we present the main comparator-adaptive switching regret guarantee of this algorithm. Theorem 3.2.2. Algorithm 12 with a base algorithm satisfying Requirement 1 guarantees that for any S, any partition I1, . . . , IS of [T], and any comparator sequence ˚u1, . . . ,˚uS ∈ R d , we have X S k=1 X t∈Ik ft(vt) − X t∈Ik ft(˚uk) ≤ Oe X S k=1 ∥˚uk∥2 p |Ik| ! ≤ Oe max k∈[S] ∥˚uk∥2 · √ ST . We emphasize again that in contrast to our other results on bandit problems, the guarantee above is achieved for all S simultaneously (in other words, the algorithm does not need the knowledge of S). It also adapts to the norm of the comparator ∥˚uk∥2 on each interval Ik, instead of only the maximum norm maxk∈[S]∥˚uk∥2. As another remark, if the base algorithms further guarantee a data-dependent regret (this is satisfied by for example the algorithm of Zhang, Liu, and Zhou [146]), the switching regret can be further improved to Oe PS k=1∥˚uk∥2 qP t∈Ik ∥∇ft(vt)∥ 2 2 ≤ Oe maxk∈[S]∥˚uk∥2· q S PT t=1∥∇ft(vt)∥ 2 2 , replacing the dependence on T by the cumulative gradient norm square. This result holds even if the algorithm is required to make decisions from a bounded domain, thus improving the Oe Dmaxq S PT t=1∥∇ft(vt)∥ 2 2 122 result of prior works [46, 150, 149] where Dmax is the diameter of the domain. See Appendix F.4.4 for details. 3.2.4.3 Summary: Comparator-Adaptive Switching Regret for Unconstrained Linear Bandits Combining all previous discussions, we now present the final result on unconstrained linear bandits. Theorem 3.2.3. Using Algorithm 10 (with p = 2) as the subroutine AZ and Algorithm 12 as the subroutine AV in the black-box reduction Algorithm 11, the overall algorithm enjoys the following comparator-adaptive switching regret against any partition I1, . . . , IS of [T] and any corresponding comparators˚u1, . . . ,˚uS ∈ R d : E[RegS] ≤ Oe X S k=1 ∥˚uk∥2 r dT S + r dS T |Ik| !! ≤ Oe max k∈[S] ∥˚uk∥2 · √ dST . The proof can be found in Appendix F.4.5. Again, this is the first switching regret for unconstrained linear bandits, and it strictly generalizes the static regret results of Hoeven, Cutkosky, and Luo [78]. Although we are not directly using our new corralling recipe to achieve this result, it clearly serves as an indispensable component for this result due to the usage of Algorithm 10. 3.2.5 Conclusion and Discussions In Section 3.2, we propose a new mechanism for combining a collection of bandit algorithms with regret overhead only logarithmically depending on the number of base algorithms. As a case study, we provide a set of new results on switching regret for adversarial linear bandits using this recipe. One future direction is to extend our switching regret results to linear bandits with general domains or even to general convex bandits, which appears to require additional new ideas to execute our recipe. Another interesting direction is to find more applications for our corralling mechanism beyond obtaining switching regret, as we know that logarithmic dependence on the number of base algorithms is possible. 123 Chapter 4 Adaptive Bandit Convex Optimization with Heterogeneous Curvature In this chapter, we consider the problem of adversarial bandit convex optimization, formulated as the following sequential learning process of T rounds. At the beginning, knowing the learner’s algorithm, an adversary decides an arbitrary sequence of T convex loss functions f1, . . . , fT : Ω 7→ R over some convex domain Ω ⊂ R d . Then, at each round t, the learner is required to select a point xt ∈ Ω, and afterwards observes and suffers her loss ft(xt). The performance of the learner is measured by her regret, the difference between the her total suffered loss and that of the best fixed point in hindsight. This subsumes the problem of linear bandits considered in Chapter 3. Without further assumptions, Bubeck, Eldan, and Lee [36] achieve Oe(d 10.5 √ T) regret with large computational complexity of Oe(poly(d)T) per round. Another work by Lattimore [96] improves the regret to Oe(d 2.5 √ T) but with even larger computational complexity that is exponential in d and T per round. On the other hand, the current best lower bound is Ω(d √ T) [50], exhibiting a large gap in the d dependency. It has been shown that, however, curvature of the loss functions helps — for example, when the functions are all smooth and strongly convex, Hazan and Levy [72] develop a simple and efficient Followthe-Regularized-Leader (FTRL) type algorithm with Oe(d 3/2 √ T) regret; even when the functions are only smooth, Saha and Tewari [132] show that Oe(dT 2/3 ) regret is achievable again via a simple and efficient FTRL variant, despite the suboptimal dependency in T. 124 Table 4.1: A summary of our results for bandit convex optimization over T smooth d-dimensional functions, the t-th of which is σt-strongly convex. T ⊂ [T] is a subset of rounds with no strong convexity. The dependency on parameters other than d and T can be found in the respective corollary (see also Footnote ∗). Note that our results are all achieved by one single adaptive algorithm. Strong Convexity {σt} T t=1 Previous Works Our Results (Algorithm 13) σt = 0, ∀t ∈ [T] Oe(dT 2/3 ) [132] Oe(d 2/3T 2/3 ) (Corollary 4.2.1) σt = σ > 0, ∀t ∈ [T] Oe(d 3/2 √ T) [72] Oe(d 3/2 √ T) (Corollary 4.2.2) σt = σ1{t /∈ T }, |T | = T 3/4 or T = [T/2, T] N/A Oe(d 3/2 √ T) (Corollary 4.2.3) σt = t −α, ∀t ∈ [T] N/A ( Oe(d 3/2T (1+α)/2 ), α ∈ [0, 1/3) Oe(d 2/3T 2/3 ), α ∈ [1/3, 1] (Corollary 4.2.4) However, all such existing results making use of curvature assume a homogeneous setting, that is, all loss functions share the same curvature parameters that are known ahead of time. Ignoring the ubiquitous heterogeneity in online data is either unrealistic or forcing one to use a conservative curvature parameter (e.g., the smallest strong convexity parameter among all functions), while intuitively, being able to exploit and adapt to the individual curvature of each loss function should result in much better performance. Motivated by this fact and inspired by the work of Bartlett, Hazan, and Rakhlin [25] who consider a heterogeneous setting in Online Convex Optimization with stronger gradient feedback, we study a similar setting where each ft has its own strong convexity σt ≥ 0, revealed only at the end of round t after the learner decides xt . We provide examples in Section 4.1 to illustrate why this is a realistic setup even in the bandit setting where traditionally only ft(xt) is revealed. In this setting, we develop efficient algorithms that automatically adapt to the heterogeneous curvature and enjoy strong adaptive regret bounds. These bounds not only recover or even improve the existing results in the homogeneous setting, but also reveal interesting new findings in some hybrid scenarios. More specifically, our results are as follows (for simplicity, only the dependency on d and T is shown; see respective sections for the complete bounds). • We start with the case where all loss functions are smooth in Section 4.2. Our algorithm achieves Oe(d 2/3T 2/3 ) regret if no functions are strongly convex (that is, σt = 0 for all t), improving the Oe(dT 2/3 ) 125 Table 4.2: A summary of our results for bandit convex optimization over T Lipschitz d-dimensional functions, the t-th of which is σt-strongly convex. T ⊂ [T] is a subset of rounds with no strong convexity. The dependency on parameters other than d and T can be found in the respective corollary. Note that our results are all achieved by one single adaptive algorithm. Strong Convexity {σt} T t=1 Previous Works Our Results (Algorithm 14) σt = 0, ∀t ∈ [T] Oe(d 3/4T 3/4 ) [107] Oe( √ dT 3/4 ) (Corollary 4.3.1) σt = σ > 0, ∀t ∈ [T] Oe(d 4/3T 2/3 ) [4] Oe(d 2/3T 2/3 ) (Corollary 4.3.2) σt = σ1{t /∈ T }, |T | = T 8/9 or T = [T/2, T] N/A Oe(d 2/3T 2/3 ) (Corollary 4.3.3) σt = t −α, ∀t ∈ [T] N/A ( Oe(d 2/3T (2+α)/3 ), α ∈ [0, 1/4) Oe( √ dT 3/4 ), α ∈ [1/4, 1] (Corollary 4.3.4) bound of [132], and Oe(d 3/2 √ T) regret if all functions happen to be σ-strongly convex (that is, σt = σ for all t), matching that from [72]. In fact, our algorithm achieves the latter result even if Θ(T 3/4 ) of the functions have no strong convexity (and sometimes even if a constant fraction of the functions have no strong convexity). More generally, our bound interpolates between these two extremes. For example, if σt is decaying at the rate of 1/t α for some α ∈ [0, 1], then the regret is Oe(d 3/2T (1+α)/2 ) when α ≤ 1/3, and Oe(d 2/3T 2/3 ) otherwise.∗ See Table 4.1 for a summary. We note that the improvement over [132] comes as a side product of the better regularization technique of our algorithm. • We then consider another scenario where all loss functions are Lipschitz in Section 4.3. We develop another algorithm that achieves Oe( √ dT 3/4 ) regret when no functions are strongly convex, improving the Oe(d 3/4T 3/4 ) bound of [107], and Oe(d 2/3T 2/3 ) regret when all functions are σ-strongly convex, improving the Oe(d 4/3T 2/3 ) bound of [4]. These improvements again come as a side product of our better regularization. Similarly, the Oe(d 2/3T 2/3 ) result holds even if Θ(T 8/9 ) of the functions (or sometimes a constant fraction of them) have no strong convexity. For similar intermediate bounds in the example when σt = 1/t α, see Table 4.2. ∗This is a simplified and loosen version of Corollary 4.2.4, which explains the discontinuity in α. 126 Techniques. Our algorithm is also a variant of FTRL, with two crucial new ingredients to handle heterogeneous curvature. First, we extend the idea of Bartlett, Hazan, and Rakhlin [25] to adaptively add ℓ2 regularization to the loss functions and adaptively tune the learning rate. Doing so in the bandit setting is highly nontrivial and requires our second technical ingredient, which is to lift the problem to the (d + 1)-dimensional space and then apply a logarithmically homogeneous self-concordant barrier in the FTRL update. This technique is inspired by a recent work of Lee et al. [99] on achieving high-probability regret bounds for adversarial linear bandits, but the extension from linear bandits to convex bandits is nontrivial. In fact, the purpose of using this technique is also different: they need to bound the variance of the learner’s loss, which is related to bounding x ⊤∇2ψ(x)x for some regularizer ψ, while we need to bound the stability of the algorithm, which is related to bounding ∇ψ(x)∇−2ψ(x)∇ψ(x), but it turns out that when ψ is a logarithmically homogeneous ν-self-concordant barrier, then these two quantities are exactly the same and bounded by ν. Related work. Bandit convex optimization has been broadly studied under different loss function structures, including Lipschitz functions [90, 56], linear functions [2, 3, 32], smooth functions [132], strongly convex functions [4], smooth and strongly convex functions [72, 82], quadratic functions [137], pseudo-1- dimensional functions [131], and others. Without any structure (other than convexity), a series of progress has been made over recent years [34, 73, 35, 36], but as mentioned, even the best result [36] has a large dependency on d in the regret and is achieved by an impractical algorithm with large computational complexity. Our comparisons in this work (such as those in Table 4.1 and Table 4.2) thus mainly focus on more efficient and practical methods in the literature that share the same FTRL framework.† Closest to our heterogeneous setting is the work on Online Convex Optimization by Bartlett, Hazan, and Rakhlin [25], where at the end of each round, σt and ∇ft(xt) are revealed (versus σt and ft(xt) in †When the functions are smooth only, our comparison is based on [132], instead of the seemingly better results of prior works [52, 145], because the latter ones are unfortunately wrong as pointed out in [80]. 127 our setting). Due to the stronger feedback, their algorithm achieves O( √ T) regret without any strong convexity, O(log T) regret if all functions are strongly convex, and generally something in between. Our results are in the same vein, and as mentioned, our algorithm is also heavily inspired by theirs. Another potential approach to adapting to different environments is to have a meta algorithm learning over a set of base algorithms, each dedicated to a specific environment. Doing so in the bandit setting, however, is highly challenging [7] or even impossible sometimes [116]. For example, even if one only aims to adapt to two environments, one with only smooth functions and the other with smooth and strongly convex functions, the approach of Agarwal et al. [7] is only able to achieve Oe(T 3/4 ) regret for the first environment if one insists to enjoy Oe( √ T) regret in the second one. 4.1 Preliminaries and Problem Setup We start by reviewing some basic definitions. Definition 4.1.1. We say that a differentiable function f : Ω 7→ R is β-smooth over the feasible set Ω if for any x, y ∈ Ω, ∥∇f(x) − ∇f(y)∥2 ≤ β∥x − y∥2 holds. Definition 4.1.2. We say that a function f : Ω 7→ R is L-Lipschitz over the feasible set Ω if for any x, y ∈ Ω, |f(x) − f(y)| ≤ L∥x − y∥2 holds. Definition 4.1.3. We say that a differentiable function f : Ω 7→ R is σ-strongly convex over the feasible set Ω if for any x, y ∈ Ω, f(y) ≥ f(x) + ∇f(x) ⊤(y − x) + σ 2 ∥x − y∥ 2 2 holds. Problem setup. Bandit Convex Optimization (BCO) can be modeled as a T-round games between a learner and an oblivious adversary. Before the game starts, the adversary (knowing the learner’s algorithm) secretly decides an arbitrary sequence of convex functions f1, . . . , fT : Ω 7→ R, where Ω ⊂ R d is a known compact convex domain. In Section 4.2 we assume that all ft ’s are β-smooth for some known parameter β ≥ 0, while in Section 4.3 we assume that they are all L-Lipschitz for some known parameter L ≥ 0. 128 In both cases, we denote by σt ≥ 0 the strong convexity parameter of ft , initially unknown to the learner (note that σt could be zero). At each round t ∈ [T] ≜ {1, . . . , T} of the game, the learner chooses an action xt ∈ Ω based on all previous observations, and subsequently suffers loss ft(xt). The adversary then reveals both ft(xt) and σt to the learner. Compared to previous works where σt is the same for all t and known to the learner ahead of time, our setting is clearly more suitable for applications with heterogeneous curvature. We provide such an example below, which also illustrates why it is reasonable for the learner to observe σt at the end of round t. Examples. Consider a problem where ft(x) = PN i=1 gt,i(c ⊤ t,ix). Here, N is the number of users in each round, ct,i represents some context of the i-th user in round t, the learner’s decision x is used to make a linear prediction c ⊤ t,ix for this user, and her loss is evaluated via a convex function gt,i which incorporates some ground truth for this user (e.g., labels in the case of classification where gt,i could be the logistic loss, or responses in the case of regression where gt,i could be the squared loss). Note that the (heterogeneous) strong convexity σt of ft is at least µλmin( PN i=1 ct,ic ⊤ t,i) where λmin(·) denotes the minimum eigenvalue of a matrix, and µ is a lower bound on the second derivative of gt,i, often known ahead of time assuming some natural boundedness of x and ct,i. In this setup, there are several situations where our feedback model is reasonable. For example, it might be the case that the loss ft(xt) as well as the context ct,i is visible to the learner, but the ground truth (and thus gt,i) is not. In this case, based on the earlier bound on the strong convexity, the learner can calculate σt herself. As another example, due to privacy consideration, the user’s context ct,i might not be revealed to the learner, but it is acceptable to reveal a single number λmin( PN i=1 ct,ic ⊤ t,i) summarizing this batch of users. Clearly, σt can also be calculated by the learner in this case. 129 Objective and simplifying assumptions. The objective of the learner is to minimize her (expected) regret, defined as Reg = E "X T t=1 ft(xt) # − min x∈Ω E "X T t=1 ft(x) # , (4.1) which is the difference between the expected loss suffered by the learner and that of the best fixed action (the expectation is with respect to the randomness of the learner). Without loss of generality, we assume maxx∈Ω |ft(x)| ≤ 1 for all t ∈ [T], Ω contains the origin, and maxx∈Ω ∥x∥2 = 1. ‡ Notations. We adopt the following notational convention throughout this chapter. Generally, we use lowercase letters to denote vectors and capitalized letters to denote matrices. S d and B d respectively denote the unit sphere and unit ball in R d , 1 and 0 respectively denote the all-one and all-zero vectors with an appropriate dimension, and I denotes the identity matrix with an appropriate dimension. For a vector x ∈ R d , we use x[i:j] ∈ R j−i+1 to denote the induced truncated vector consisting of the i-th to j-th coordinates of x. For a sequence of scalars a1, . . . , at , we use ai:j ≜ Pj k=i ak to denote the cumulative summation, and {as} q s=p to denote the subsequence ap, ap+1, . . . , aq. The Oe(·) notation omits the logarithmic dependence on the horizon T. § Et [·] is a shorthand for the conditional expectation given the history before round t. 4.2 Smooth Convex Bandits with Heterogeneous Strong Convexity Throughout this section, we assume that all convex loss functions f1, . . . , fT are β-smooth (see Assumption 4.1.1). To present our adaptive algorithm in this case, we start by reviewing two important existing algorithms upon which ours is built. ‡This is without loss of generality because for any problem with |ft(x)| ≤ B and maxx,x′∈Ω ∥x − x ′ ∥2 = D, we can solve it via solving a modified problem with convex domain U = {u = 1 D (x − x0) : x ∈ Ω} for any fixed x0 ∈ Ω and loss functions gt(u) = 1 B ft(Du + x0), which then satisfies our simplifying assumptions maxu∈U |gt(u)| ≤ 1, 0 ∈ U, and maxu∈U ∥u∥2 = 1. § In the texts, for simplicity Oe(·) might also hide dependency on other parameters such as the dimension d. However, in all formal theorem/lemma statements, this will not be the case. 130 Review of Adaptive Online Gradient Descent (AOGD). As mentioned, the AOGD algorithm proposed by Bartlett, Hazan, and Rakhlin [25] is designed for a similar heterogeneous setting but with the stronger gradient feedback. The first key idea of AOGD is that, instead of learning over the original loss functions {ft} T t=1, one should learn over their ℓ2-regularized version {fet} T t=1 where fet(x) = ft(x) + λt 2 ∥x∥ 2 2 for some coefficient λt . Intuitively, λt is large when previous loss functions exhibit not enough strong convexity to stabilize the algorithm, and small otherwise. How to exactly tune λt based on the observed {σs} t s=1 is their second key idea — they show that λt should balance two terms and satisfy 3 2 λt = 1 σ1:t + λ1:t , (4.2) which results in a quadratic equation of λt and can be solved in closed form. The final component of AOGD is simply to run gradient descent on {fet} T t=1 with some adaptively decreasing learning rates. Review of BCO with smoothness and strong convexity. Hazan and Levy [72] consider the BCO problem with β-smooth and σ-strongly convex loss functions. Their FTRL-based algorithm maintains an auxiliary sequence y1, . . . , yT ∈ Ω via yt+1 = argmin x∈Ω (X t s=1 ⟨gs, x⟩ + σ 2 ∥x − ys∥ 2 2 + 1 η ψ(x) ) , (4.3) where gs is some estimator of ∇fs(ys), η > 0 is some fixed learning rate, and the regularizer ψ is a ν-self-concordant barrier whose usage in BCO is pioneered by Abernethy, Hazan, and Rakhlin [2] (see Appendix G.3 for definition). The rational behind the squared distance terms in this update is that, due to strong convexity, we have fs(ys)−fs(x) ≤ ges(ys)−ges(x)for any x and ges(x) = ∇fs(ys) ⊤x+ σ 2 ∥x−ys∥ 2 2 , meaning that it suffices to consider ge1, . . . , geT as the loss functions. Having yt , the algorithm makes the final prediction xt by adding certain curvature-adaptive and shrinking exploration to yt : xt = yt+H −1/2 t ut , 131 where Ht = ∇2ψ(yt) + ησtI and ut is chosen from the unit sphere S d uniformly at random (xt ∈ Ω is guaranteed by the property of self-concordant barriers). Finally, with the feedback ft(xt), the gradient estimator is constructed as gt = d · ft(xt)H 1/2 t ut , which can be shown to be an unbiased and low-variance estimator of the gradient of some smoothed version of ft at yt . 4.2.1 Proposed Algorithm and Main Theorem We are now ready to describe our algorithm. Following [25], our first step is also to consider learning over the ℓ2-regularized loss functions: fet(x) = ft(x) + λt 2 ∥x∥ 2 2 with an adaptively chosen λt > 0 (note that with the bandit feedback ft(xt), we can also evaluate fet(xt)). While Bartlett, Hazan, and Rakhlin [25] apply gradient descent, the standard and optimal algorithm for strongly-convex losses with gradient feedback, here we naturally apply the algorithm of Hazan and Levy [72] to this sequence of regularized loss functions instead. Since fet is (σt + λt)-strongly convex, following Eq. (4.3) and adopting a decreasing learning rate ηt shows that we should maintain the auxiliary sequence y1, . . . , yT according to yt+1 = argmin x∈Ω (X t s=1 ⟨gs, x⟩ + σs + λs 2 ∥x − ys∥ 2 2 + 1 ηt+1 ψ(x) ) , (4.4) where similarly gt = d · fet(xt)H 1/2 t ut for xt = yt + H −1/2 t ut , Ht = ∇2ψ(yt) + ηt (σ1:t−1 + λ1:t−1) I, and ut chosen randomly from the unit sphere. While this forms a natural and basic framework of our algorithm, there is in fact a critical issue when analyzing such a barrier-regularized FTRL algorithm due to the decreasing learning rate, which was never encountered in the literature as far as we know since all related works using this framework adopt a fixed learning rate (see e.g., [2, 132, 70, 126, 72, 37]) or an increasing learning rate [99]. More specifically, to bound the stability ⟨yt − yt+1, gt⟩ of the algorithm, all analysis for barrier-regularized FTRL implicitly or explicitly requires bounding the Newton decrement, which in our context is ∥∇Gt(yt)∥ 2 ∇−2Gt(yt) for Gt 132 being the objective function in the FTRL update Eq. (4.4). To simplify this term, note that since ψ is a barrier and yt minimizes Gt−1, we have ∇Gt−1(yt) = 0 = X t−1 s=1 gs + (σs + λs)(yt − ys) + 1 ηt ∇ψ(yt). Further combining this with ∇Gt(yt) = Pt s=1 gs + Pt−1 s=1(σs + λs)(yt − ys) + 1 ηt+1 ∇ψ(yt) shows ∇Gt(yt) = gt+( 1 ηt+1 − 1 ηt )∇ψ(yt). Now, if ηt is fixed for all t, then the Newton decrement simply becomes ∥gt∥ 2 ∇−2Gt(yt) ⪯ ∥gt∥ 2 ∇−2Gt−1(yt) , which by the definition of gt is directly bounded by ηtd 2 . However, with decreasing learning rates, the extra term contributes to a term of order ( 1 ηt+1 − 1 ηt ) 2∥∇ψ(yt)∥ 2 ∇−2Gt(yt) , which could be prohibitively large unfortunately.¶ To resolve this issue, our key observation is: ∥∇ψ(yt)∥ 2 ∇−2Gt(yt) ⪯ ηt+1∥∇ψ(yt)∥ 2 ∇−2ψ(yt) , and ∥∇ψ(yt)∥ 2 ∇−2ψ(yt) is always bounded by ν as long as the self-concordant barrier ψ is also logarithmically homogeneous (see Appendix G.3 for definition and Lemma G.3.5 for this property). A logarithmically homogeneous self-concordant barrier is also called a normal barrier for short, and it is only defined for a cone (recall that our feasible set Ω, on the other hand, is always bounded, thus not a cone). Fortunately, this issue has been addressed in a recent work by Lee et al. [99] on achieving high probability regret bounds for adversarial linear bandits. Their motivation is different, that is, to bound the variance of the learner’s loss, related to ∥yt∥ 2 ∇2ψ(yt) using our notation, and this turns out to be bounded when ψ is a normal barrier — in fact, ∥yt∥ 2 ∇2ψ(yt) and ∥∇ψ(yt)∥ 2 ∇−2ψ(yt) are exactly the same in this case! Their solution regarding Ω not being a cone is to first lift it to R d+1 and find a normal barrier of the conic hull of this lifted domain ¶ Instead of FTRL, one might wonder if using the highly related Online Mirror Descent framework could solve the issue caused by decreasing learning rates. We point out that while this indeed addresses the issue for bounding the stability term, it on the other hand introduces a similar issue for the regularizaton penalty term PT t=2( 1 ηt+1 − 1 ηt )Dψ(x, yt). 13 (which always exists), then perform FTRL over the lifted domain with this normal barrier regularizer. We extend their idea from linear bandits to convex bandits, formally described below (see also Algorithm 13 for the pseudocode). Lifted domain and normal barrier. To make the dimension of a vector/matrix self-evident, we use bold letters to represent vectors in R d+1 and matrices in R (d+1)×(d+1). Define the lifted domain as Ω = {x = (x, 1) | x ∈ Ω} ⊆ R d+1, which simply appends an additional coordinate with constant value 1 to all points in Ω. Define the conic hull of this set as K = {(x, b) | x ∈ R d , b ≥ 0, 1 b x ∈ Ω}. Our algorithm requires using a normal barrier over K as a regularizer (which always exists). While any such normal barrier works, we simply use a canonical one constructed from a ν-self concordant barrier ψ of Ω, defined via Ψ(x) = Ψ(x, b) = 400ψ(x/b) − 2ν ln b and proven to be a Θ(ν)-normal barrier over K in Proposition 5.1.4 of [120]. This also shows that our algorithm requires no more than that of Hazan and Levy [72]. Algorithm 13 then performs FTRL in the lifted domain using Ψ as the regularizer to maintain an auxiliary sequence y1, . . . , yT ; see Line 16. This follows the earlier update rule in Eq. (4.4), except for an additional ℓ2 regularization term λ0 2 ∥x∥ 2 2 added for technical reasons. With yt at hand, we compute Hessian matrix Ht = ∇2Ψ(yt)+ηt (σ1:t−1 + λ0:t−1) I similarly as before (Line 10). What is slightly different now is the exploration (Line 12): we sample ut uniformly at random from the set S d+1 ∩ H − 1 2 t ed+1⊥ where w⊥ denotes the space orthogonal to w, and then obtain a point xt = yt + H − 1 2 t ut . It can been shown that xt is always on the intersection of the lifted domain Ω and the surface of some ellipsoid centered at yt — we refer the reader to Figure 3.1 for a pictorial illustration and their description for how to sample ut efficiently. Since xt ∈ Ω, it is in the form of (xt , 1), where xt will be the final decision of the algorithm. Upon receiving ft(xt) and σt , Algorithm 13 computes the ℓ2 regularization coefficient λt in some way (to be discussed soon), gradient estimator gt as in earlier discussion (Line 14), learning rate ηt+1 as in Line 15, and finally yt+1 via the aforementioned FTRL. 13 Algorithm 13 Adaptive Smooth BCO with Heterogeneous Strong Convexity Input: smoothness parameter β and a ν-self-concordant barrier ψ for the feasible domain Ω. Define: lifted feasible set Ω = {x = (x, 1) | x ∈ Ω}. Define: Ψ is a normal barrier of the conic hull of Ω: Ψ(x) = Ψ(x, b) = 400(ψ(x/b) − 2ν ln b). Define: ρ = 512ν(1 + 32√ ν) 2 . Initialize: λ0 = max (β + 1)ρν−1 , d2 (β + 1) and η1 = 1 2d qβ+1 λ0 + ν T log T . Initialize: y1 = (y1, 1) = argminx∈Ω Ψ(x). for t = 1, 2, . . . , T do 10 Compute Ht = ∇2Ψ(yt) + ηt (σ1:t−1 + λ0:t−1) I. 11 Draw ut uniformly at random from S d+1 ∩ H − 1 2 t ed+1⊥ . ▷ w⊥: space orthogonal to w 12 Compute xt = (xt , 1) = yt + H − 1 2 t ut , play the point xt , and observe ft(xt) and σt . 13 Compute regularization coefficient λt ∈ (0, 1). ▷ See Eq. (4.6) and related discussions 14 Compute gradient estimator gt = d ft(xt) + λt 2 ∥xt∥ 2 2 H 1 2 t ut . 15 Compute learning rate ηt+1 = 1 2d q β+1 σ1:t+λ0:t + ν T log T . 16 Update yt+1 = argminx∈Ω nPt s=1 g ⊤ s x + σs+λs 2 ∥x − ys∥ 2 2 + λ0 2 ∥x∥ 2 2 + 1 ηt+1 Ψ(x) o . Guarantees and ℓ2 regularization coefficient tuning. We now present some guarantees of our algorithm that hold regardless of the tuning of {λt} T t=1. First, we show that except for the last coordinate, gt is an unbiased estimator of a smoothed version of fet(yt), where fet(y) = ft(y[1:d] ) + λt 2 ∥y[1:d]∥ 2 2 . This is a non-trivial generalization of Lemma B.9 of [99] from linear functions to convex functions. See Appendix G.1.1 for the proof. Lemma 4.2.1. For each t ∈ [T], we have Et [gt,i] = ∇fbt(yt)[i] for alli ∈ [d], where fbt is the smoothed version of fet defined as fbt(x) ≜ Eb fet(x + H −1/2 t b) , where b is uniformly sampled from B d+1 ∩ (H − 1 2 t ed+1) ⊥. Thanks to the unbiasedness of the gradient estimators and the crucial properties of normal barrier, we prove the following regret guarantee of Algorithm 13 in Appendix G.1.3. Lemma 4.2.2. With any regularization coefficients {λt} T t=1 ∈ (0, 1), Algorithm 13 guarantees: Reg = Oe d √ νT + λ1:T −1 + T X−1 t=1 d √ β + 1 √ σ1:t + λ0:t ! , (4.5) if loss functions {ft} T t=1 are all β-smooth and T ≥ ρ (a constant defined in Algorithm 13). We are now in the position to specify the tuning of the regularization coefficients. Based on the bound in Eq. (4.5), we propose to balance the last two terms by picking λt ∈ (0, 1) such that: λt = d √ β + 1 √ σ1:t + λ0:t , (4.6) which must exist since when λt = 0, the left-hand side is smaller than the right-hand side, while when λt = 1, the left-hand side is larger than the right-hand side by the definition of λ0. Note that unlike the AOGD tuning in Eq. (4.2), our tuning leads to a cubic equation of λt , which does not admit a closed-form. However, the earlier argument on its existence clearly also implies that it can be computed via a simple and efficient binary search (using information available at the end of round t). Our next lemma is in the same vein as Lemma 3.1 of [25], which shows that our adaptive tuning is almost as good as the optimal tuning (that knows all σt ’s ahead of time). This lemma turns out to be crucial to achieve our guarantees that adapt to the environment with heterogeneous curvatures. Lemma 4.2.3. Define B({λs} t s=1) ≜ λ1:t + Pt τ=1 d √ β+1 √ σ1:τ +λ0:τ , with λ0 defined in Algorithm 13. Then the sequence {λt} T t=1 attained by solving Eq. (4.6) satisfies for all t ∈ [T]: B({λs} t s=1) ≤ 2 min {λ∗ s} t s=1≥0 B({λ ∗ s} t s=1). (4.7) The proof of Lemma 4.2.3 is deferred in Appendix G.1.4. Combining Lemma 4.2.2 and Lemma 4.2.3, we obtain the final regret guarantee in Theorem 4.2.1, whose proof can be found in Appendix G.1.5. Theorem 4.2.1. Algorithm 13 with adaptive tuning Eq. (4.6) ensures for any sequence λ ∗ 1 , . . . , λ∗ T ≥ 0: Reg = Oe d √ νT + λ ∗ 1:T −1 + T X−1 t=1 d √ β + 1 p σ1:t + λ ∗ 0:t ! , (4.8) when loss functions {ft} T t=1 are all β-smooth and T ≥ ρ (a constant defined in Algorithm 13). 136 We leave the discussion on the many implications of this general regret bound to the next subsection, and make a final remark on the per-round computational complexity of Algorithm 13. Note that although the objective function in the FTRL update (Line 16) contains O(T) terms, it is clear that by storing and updating some statistics (such as Pt s=1 gs and Pt s=1(σs + λs)ys), one can evaluate its function value and gradient in time independent of T. Approximating solving the FTRL update (up to precision 1/poly(T)) via for example the interior point method thus only requires O(poly(d log T)) time. This is more efficient than the method of Bubeck, Eldan, and Lee [36], which requires O(poly(d log T)T) time per round even when the domain is a polytope. 4.2.2 Implications of Theorem 4.2.1 In the following, we investigate several special cases and present direct corollaries of Theorem 4.2.1 to demonstrate that our algorithm not only matches/improves existing results for homogeneous settings, but also leads to interesting intermediate results in some heterogeneous settings. Note that since our regret bound in Theorem 4.2.1 holds for any choice of the sequence {λ ∗ t } T t=1 ≥ 0, in each case below we will simply provide a specific sequence of {λ ∗ t } T t=1 that leads to a favorable guarantee. For simplicity, we also directly replace ν with O(d) (in both our bounds and previous results) since it is well known that any convex set admits an O(d)-self-concordant barrier [120]. First, consider the case when no functions have strong convexity, that is, σt = 0 for all t. This degenerates to the same homogeneous setting as [132], where their algorithm achieves Oe(d 3/2 √ T + β 1/3dT 2/3 ) regret. As a side product of our adaptive ℓ2-regularization, our algorithm manages to achieve even better dependency on the dimension d. Indeed, by choosing λ ∗ 1 = (1 + β) 1/3d 2/3T 2/3 and λ ∗ t = 0 for all t ≥ 2 in Theorem 4.2.1, we obtain the following corollary. 137 Corollary 4.2.1 (Smooth BCO without strong convexity). When ft is β-smooth and 0-strongly convex for all t ∈ [T], Algorithm 13 guarantees that Reg = Oe d 3 2 √ T + (1 + β) 1 3 d 2 3 T 2 3 . Second, we consider the case when all functions are σ-strongly convex for some constant σ > 0, which degenerates to the same homogeneous setting as [72]. By picking λ ∗ t = 0 for all t ≥ 0 in Theorem 4.2.1, our algorithm achieves the same result as theirs. Corollary 4.2.2 (Smooth BCO with σ-strong convexity). When ft is β-smooth and σ-strongly convex, i.e., σt = σ > 0 for all t ∈ [T], Algorithm 13 guarantees that Reg = Oe d 3 2 √ T + d r T(1 + β) σ ! . Third, we investigate an intermediate setting with a mixture of σ-strongly convex and 0-strongly convex functions. Specifically, suppose that there are M functions with no strong convexity, and the rest are σ-strongly convex. According to Theorem 4.2.1, the worst case scenario for our algorithm is when these M functions appear in the first M rounds, while the best scenarios is when they are in the last M rounds. Considering these two extremes and picking {λ ∗ t } T t=1 correspondingly, we obtain the following corollary (see Appendix G.1.6 for the proof). Corollary 4.2.3 (Smooth BCO with a mixture of convex and σ-strongly convex functions). Suppose that {ft} T t=1 are β-smooth and T − M of them are σ-strongly convex. Then Algorithm 13 guarantees Reg = Oe d 2 3 (1 + β) 1 3M 2 3 + d 3 2 √ T + d r (1 + β)(T − M) σ ! . 138 If these T − M functions appear in the first T − M rounds, then the bound is further improved to Reg = Oe d 3 2 √ T + dTs 1 + β σ(T − M) ! . To better interpret these bounds, we consider how large M can be (that is, how many functions without strong convexity we can tolerate) to still ensure Oe( √ T) regret — in the general case (the first bound of the corollary), we see that we can tolerate M = O(T 3/4 ), while in the best case (the second bound), we can even tolerate M being any constant fraction of T! On the other hand, a naive method of discarding all functions without strong convexity can only tolerate M = Oe( √ T). Finally, following [25] we consider a situation with decaying strong convexity: σt = t −α for some α ∈ [0, 1]. We prove the following corollary; see Appendix G.1.6 for the proof. Corollary 4.2.4 (Smooth BCO with decaying strong convexity). When ft is β-smooth and σt-strongly convex with σt = t −α for some α ∈ [0, 1], Algorithm 13 guarantees Reg = Oe d 3 2 √ T + d √ 1 + βT 1+α 2 α ∈ [0, 1 3 − 2 3 logT d − 1 3 logT (1 + β)], Oe d 3 2 √ T + (1 + β) 1 3 d 2 3 T 2 3 α ∈ [ 1 3 − 2 3 logT d − 1 3 logT (1 + β), 1]. 4.3 Lipschitz Convex Bandits with Heterogeneous Strong Convexity In this section, we consider a similar setting where instead of assuming smoothness, we assume that functions {ft} T t=1 are known to be L-Lipschitz (Assumption 4.1.2). The strong convexity parameter σt of function ft is still only revealed at the end of round t. We extend our algorithm to this case and present it in Algorithm 14, which differs from Algorithm 13 only in the tuning of the learning rate ηt (see Line 6) and the regularization coefficient λt (see Line 4). These tunings are different because of the different structures 139 Algorithm 14 Adaptive Lipschitz BCO with Heterogeneous Strong Convexity Input: Lipschitz parameter L and a ν-self-concordant barrier ψ for the feasible domain Ω. Define: lifted feasible set Ω = {x = (x, 1) | x ∈ Ω}. Define: Ψ is a normal barrier of the conic hull of Ω: Ψ(x) = Ψ(x, b) = 400(ψ(x/b) − 2ν ln b). Define: ρ ′ = 2 16(16√ νd1/3 (4L+1)1/3+(L+1)2/3 ) 3 d . Initialize: λ0 = max{ρ ′ , d2 (L + 1)2} and η1 = (L + 1) 2 3 d − 4 3 ( 1 λ0 + 1 T ) 1 3 . Initialize: y1 = (y1, 1) = argminx∈Ω Ψ(x). for t = 1, 2, . . . , T do 1 Define Ht = ∇2Ψ(yt) + ηt (σ1:t−1 + λ0:t−1) I. 2 Draw ut uniformly at random from S d+1 ∩ H − 1 2 t ed+1⊥ . ▷ w⊥: space orthogonal to w 3 Compute xt = (xt , 1) = yt + H − 1 2 t ut , play the point xt , and observe ft(xt) and σt . 4 Compute regularization coefficient λt ∈ (0, 1) as the solution of the following equation λt = d 2 3 (L + 1) 2 3 (σ1:t + λ0:t) 1 3 . (4.9) 5 Compute gradient estimator gt = d ft(xt) + λt 2 ∥xt∥ 2 2 H 1 2 t ut . 6 Compute learning rate ηt+1 = d − 4 3 (L + 1) 2 3 · 1 σ1:t+λ0:t + 1 T 1 3 . 7 Update yt+1 = argminx∈Ω nPt s=1 g ⊤ s x + σs+λs 2 ∥x − ys∥ 2 2 + λ0 2 ∥x∥ 2 2 + 1 ηt+1 Ψ(x) o . in the setting, but their design follows the same idea as before. Similar to Theorem 4.2.1, we prove the following theorem (see Appendix G.2 for the proof). Theorem 4.3.1. Algorithm 14 ensures for any sequence λ ∗ 1 , . . . , λ∗ T ≥ 0: Reg = Oe d 4 3 νT 1 3 + λ ∗ 0:T + X T t=1 d 2 3 (L + 1) 2 3 (σ1:t−1 + λ ∗ 0:t−1 ) 1 3 ! , (4.10) when all the functions are L-Lipschitz and T ≥ ρ ′ (a constant defined in Algorithm 14). Similar to Section 4.2.2, we now discuss the implications of this theorem in several special cases, demonstrating that our algorithm not only improves existing results in the homogeneous settings as a side product of the better regularization technique, but also achieves favorable guarantees in some heterogeneous settings. Again, we plug in ν = O(d) for simplicity. First, we consider the case when no functions have strong convexity, which degenerates to the same homogeneous setting studied in [90, 56, 107]. Among these results, the best regret bound is Oe( √ Ld3/4T 3/4 ) shown in [107]. By picking λ ∗ 1 = p d(L + 1)T 3/4 and λ ∗ t = 0 for all t ≥ 2 in Theorem 4.3.1, we achieve the following guarantee with improved dimension dependency. Corollary 4.3.1 (Lipschitz BCO without strong convexity). When ft is L-Lipschitz and 0-strongly convex for all t ∈ [T], Algorithm 14 guarantees Reg = Oe p d(L + 1)T 3 4 . Second, we consider the case when all loss functions are σ-strongly convex. This degenerates to the homogeneous setting studied in [4], where they achieve Oe(d 4/3L 2/3σ −1/3T 2/3 ) regret.∥ Once again, by picking λ ∗ t = 0 for all t ≥ 1 in Theorem 4.3.1, we obtain the following result with improved dimension dependency. Corollary 4.3.2 (Lipschitz BCO with σ-strong convexity). When ft is L-Lipschitz and σ-strongly convex, i.e., σt = σ for all t ∈ [T], Algorithm 14 guarantees that Reg = Oe d 2 3 (L + 1) 2 3 σ − 1 3 T 2 3 . Third, we consider the case with a mixture of 0-strongly convex and σ-strongly convex functions. Corollary 4.3.3 (Lipschitz BCO with a mixture of convex and σ-strongly convex functions). Suppose that {ft} T t=1 are L-Lipschitz and T − M of them are σ-strongly convex. Then Algorithm 14 guarantees Reg = p d(L + 1)M 3/4 + d 2/3 (L + 1)2/3σ −1/3 (T − M) 2/3 . ∥The bound stated in their paper has d 2/3 dependency on the dimension, but that is under a different assumption on Ω. Translating their setting to ours via a reshaping trick (Section 3.2 in [56]) leads to the d 4/3 dependency. 141 If these T − M functions appear in the first T − M rounds, then the bound is further improved to Reg = Oe d 2/3 (L + 1)2/3T σ 1/3(T − M) 1/3 ! . The proof can be found in Appendix G.2.2. Similar to the discussion in Section 4.2.2, we consider how large M can be to still ensure Oe(T 2/3 ) regret — according to the first bound, we can always tolerate M = O(T 8/9 ), while in the best case (the second bound), we can tolerate M being any constant fraction of T. These are again much stronger compared to the naive method of discarding all functions without strong convexity, which can only tolerate M = Oe(T 2/3 ). Finally, we consider the example with σt = t −α again. See Appendix G.2.2 for the proof. Corollary 4.3.4 (Lipschitz BCO with decaying strong convexity). When ft is L-Lipschitz and σt-strongly convex with σt = t −α for some α ∈ [0, 1], Algorithm 14 guarantees Reg = Oe(d 2 3 (L + 1) 2 3 T 2+α 3 ) α ∈ [0, 1 4 − 1 2 logT (L + 1) − 1 2 logT d], Oe( p d(L + 1)T 3 4 ) α ∈ [ 1 4 − 1 2 logT (L + 1) − 1 2 logT d, 1]. 4.4 Conclusion In this section, we initiates the study of bandit convex optimization with heterogeneous curvature and proposes strong algorithms and guarantees that automatically adapt to individual curvature of each loss function. As the first step in this direction, we have assumed homogeneous smoothness or Lipschitzness and only considered heterogeneous strong convexity. Extending the heterogeneity to the other curvature parameters is an immediate next step. Moreover, it is worth investigating an even more challenging setting where the individual curvature information is not revealed to the learner at the end of each round, or at 142 least has to be learned via other weaker and indirect feedback (such as some rough and potentially incorrect estimation of the curvature). 143 Bibliography [1] Jacob Abernethy and Alexander Rakhlin. “Beating the adaptive bandit with high probability”. In: 2009 Information Theory and Applications Workshop. IEEE. 2009, pp. 280–289. [2] Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. “Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization”. In: Conference on Learning Theory. 2008. [3] Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. “Interior-point methods for full-information and bandit online learning”. In: IEEE Transactions on Information Theory 58.7 (2012), pp. 4164–4175. [4] Alekh Agarwal, Ofer Dekel, and Lin Xiao. “Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback”. In: Proceedings of the 23rd Conference on Learning Theory (COLT). 2010, pp. 28–40. [5] Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert Schapire. “Contextual bandit learning with predictable rewards”. In: Artificial Intelligence and Statistics. PMLR. 2012, pp. 19–26. [6] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. “Taming the monster: A fast and simple algorithm for contextual bandits”. In: International Conference on Machine Learning. PMLR. 2014, pp. 1638–1646. [7] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. “Corralling a Band of Bandit Algorithms”. In: Proceedings of the 2017 Conference on Learning Theory. 2017. [8] Zeyuan Allen-Zhu, Sébastien Bubeck, and Yuanzhi Li. “Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits”. In: Proceedings of the 35th International Conference on Machine Learning. 2018. [9] Chamy Allenberg, Peter Auer, László Györfi, and György Ottucsák. “Hannan consistency in on-line learning in case of unbounded losses under partial monitoring”. In: International Conference on Algorithmic Learning Theory. 2006. [10] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. “Online learning with feedback graphs: Beyond bandits”. In: Conference on Learning Theory. PMLR. 2015, pp. 23–35. 144 [11] Noga Alon, Nicolò Cesa-Bianchi, Ofer Dekel, and Tomer Koren. “Online Learning with Feedback Graphs: Beyond Bandits”. In: Proceedings of The 28th Conference on Learning Theory. 2015. [12] Noga Alon, Nicolò Cesa-Bianchi, Ofer Dekel, and Tomer Koren. “Online Learning with Feedback Graphs: Beyond Bandits”. In: arXiv preprint arXiv:1502.07617 (2015). [13] Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. “Nonstochastic multi-armed bandits with graph-structured feedback”. In: SIAM Journal on Computing (2017). [14] Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. “Bandits with Feedback Graphs and Switching Costs”. In: Advances in Neural Information Processing Systems 32. 2019. [15] Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. “Corralling stochastic bandit algorithms”. In: The 24th International Conference on Artificial Intelligence and Statistics (AISTATS). 2021, pp. 2116–2124. [16] Jean-Yves Audibert and Sébastien Bubeck. “Minimax policies for adversarial and stochastic bandits”. In: Conference on Learning Theory. 2009. [17] Jean-Yves Audibert and Sébastien Bubeck. “Regret Bounds and Minimax Policies under Partial Monitoring”. In: Journal of Machine Learning Research 11 (2010), pp. 2785–2836. [18] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. “Minimax policies for combinatorial prediction games”. In: Proceedings of the 24th Annual Conference on Learning Theory. 2011. [19] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. “The nonstochastic multiarmed bandit problem”. In: SIAM journal on computing 32.1 (2002), pp. 48–77. [20] Peter Auer, Pratik Gajane, and Ronald Ortner. “Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes”. In: Proceedings of the 32nd Conference on Learning Theory (COLT). 2019, pp. 138–158. [21] Baruch Awerbuch and Robert D Kleinberg. “Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches”. In: Symposium on Theory of Computing. 2004. [22] Sven Axsäter. Inventory control. Vol. 225. Springer, 2015. [23] Santiago R Balseiro and Yonatan Gur. “Learning in repeated auctions with budgets: Regret minimization and equilibrium”. In: Management Science 65.9 (2019), pp. 3952–3968. [24] Peter Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari. “High-probability regret bounds for bandit online linear optimization”. In: Conference on Learning Theory. Omnipress. 2008, pp. 335–342. [25] Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. “Adaptive Online Gradient Descent”. In: Advances in Neural Information Processing Systems 20 (NIPS). 2007, pp. 65–72. 145 [26] Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. “Mostly exploration-free algorithms for contextual bandits”. In: Management Science. Vol. 67. INFORMS, 2021, pp. 1329–1349. [27] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. “Dota 2 with large scale deep reinforcement learning”. In: arXiv preprint arXiv:1912.06680 (2019). [28] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. “Contextual bandit algorithms with supervised learning guarantees”. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings. 2011, pp. 19–26. [29] Djallel Bouneffouf, Irina Rish, and Charu Aggarwal. “Survey on applications of multi-armed and contextual bandits”. In: 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE. 2020, pp. 1–8. [30] Gábor Braun and Sebastian Pokutta. “An efficient high-probability algorithm for linear bandits”. In: arXiv preprint arXiv:1610.02072 (2016). [31] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. “Regret analysis of stochastic and nonstochastic multi-armed bandit problems”. In: Foundations and Trends® in Machine Learning 5.1 (2012), pp. 1–122. [32] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. “Towards minimax policies for online linear optimization with bandit feedback”. In: Conference on Learning Theory. 2012. [33] Sébastien Bubeck, Michael Cohen, and Yuanzhi Li. “Sparsity, variance and curvature in multi-armed bandits”. In: Proceedings of Algorithmic Learning Theory. 2018. [34] Sébastien Bubeck, Ofer Dekel, Tomer Koren, and Yuval Peres. “Bandit Convex Optimization: √ T Regret in One Dimension”. In: Proceedings of the 28th Conference on Learning Theory (COLT). 2015, pp. 266–278. [35] Sébastien Bubeck and Ronen Eldan. “Multi-scale exploration of convex functions and bandit convex optimization”. In: Proceedings of the 29th Conference on Learning Theory (COLT). Vol. 49. 2016, pp. 583–589. [36] Sébastien Bubeck, Ronen Eldan, and Yin Tat Lee. “Kernel-Based Methods for Bandit Convex Optimization”. In: vol. 68. 4. 2021. [37] Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei. “Improved path-length regret bounds for bandits”. In: Proceedings of the 32nd Conference On Learning Theory. 2019. [38] Sébastien Bubeck and Mark Sellke. “First-order regret analysis of thompson sampling”. In: International Conference on Algorithmic Learning Theory. 2020. [39] Nicolo Cesa-Bianchi and Gábor Lugosi. “Combinatorial bandits”. In: Journal of Computer and System Sciences 78.5 (2012), pp. 1404–1422. 146 [40] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. [41] Nicolò Cesa-Bianchi, Pierre Gaillard, Gábor Lugosi, and Gilles Stoltz. “Mirror Descent Meets Fixed Share (and feels no regret)”. In: Advances in Neural Information Processing Systems 25 (NIPS). 2012, pp. 989–997. [42] Liyu Chen, Haipeng Luo, and Chen-Yu Wei. “Impossible Tuning Made Possible: A New Expert Algorithm and Its Applications”. In: Proceedings of the 34th Conference on Learning Theory (COLT). 2021, pp. 1216–1259. [43] Yifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen-Yu Wei. “A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free”. In: Proceedings of the 32nd Conference on Learning Theory (COLT). 2019, pp. 696–726. [44] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. “Learning to Optimize under Non-Stationarity”. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS). 2019, pp. 1079–1087. [45] Alon Cohen, Tamir Hazan, and Tomer Koren. “Online learning with feedback graphs without the graphs”. In: International Conference on Machine Learning. PMLR. 2016, pp. 811–819. [46] Ashok Cutkosky. “Parameter-free, Dynamic, and Strongly-Adaptive Online Learning”. In: Proceedings of the 37th International Conference on Machine Learning (ICML). 2020, pp. 2250–2259. [47] Ashok Cutkosky and Kwabena Boahen. “Online Learning Without Prior Information”. In: Proceedings of the 30th Conference on Learning Theory (COLT). 2017, pp. 643–677. [48] Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, and Manish Purohit. “Dynamic Balancing for Model Selection in Bandits and RL”. In: Proceedings of the 38th International Conference on Machine Learning (ICML). 2021, pp. 2276–2285. [49] Ashok Cutkosky and Francesco Orabona. “Black-Box Reductions for Parameter-free Online Learning in Banach Spaces”. In: Proceedings of the 31st Conference on Learning Theory (COLT). 2018, pp. 1493–1529. [50] Varsha Dani, Sham M Kakade, and Thomas P Hayes. “The price of bandit information for online optimization”. In: Advances in Neural Information Processing Systems. 2008. [51] Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. “Strongly Adaptive Online Learning”. In: Proceedings of the 32nd International Conference on Machine Learning (ICML). 2015, pp. 1405–1411. [52] Ofer Dekel, Ronen Eldan, and Tomer Koren. “Bandit Smooth Convex Optimization: Improving the Bias-Variance Tradeoff”. In: Advances in Neural Information Processing Systems 28 (NIPS). 2015, pp. 2926–2934. [53] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. “Efficient optimal learning for contextual bandits”. In: Conference on Uncertainty in Artificial Intelligence. 2011, pp. 169–178. 147 [54] Emmanuel Esposito, Federico Fusco, Dirk van der Hoeven, and Nicolò Cesa-Bianchi. “Learning on the Edge: Online Learning with Stochastic Feedback Graphs”. In: Advances in Neural Information Processing Systems. 2022. [55] Zhili Feng and Po-Ling Loh. “Online learning with graph-structured feedback against adaptive adversaries”. In: 2018 IEEE International Symposium on Information Theory (ISIT). 2018. [56] Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. “Online convex optimization in the bandit setting: gradient descent without a gradient”. In: Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 2005, pp. 385–394. [57] Dylan Foster, Alekh Agarwal, Miroslav Dudik, Haipeng Luo, and Robert Schapire. “Practical contextual bandits with regression oracles”. In: International Conference on Machine Learning. PMLR. 2018, pp. 1539–1548. [58] Dylan Foster and Alexander Rakhlin. “Beyond ucb: Optimal and efficient contextual bandits with regression oracles”. In: International Conference on Machine Learning. PMLR. 2020, pp. 3199–3210. [59] Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. “Adapting to misspecification in contextual bandits”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 11478–11489. [60] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. “The statistical complexity of interactive decision making”. In: arXiv preprint arXiv:2112.13487. 2021. [61] Dylan J Foster and Akshay Krishnamurthy. “Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination”. In: Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 18907–18919. [62] Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. “Model selection for contextual bandits”. In: Advances in Neural Information Processing Systems (NeurIPS) 32 (2019). [63] Dylan J Foster, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. “Learning in Games: Robustness of Fast Convergence”. In: Advances in Neural Information Processing Systems 29. 2016. [64] Dylan J. Foster, Akshay Krishnamurthy, and Haipeng Luo. “Open Problem: Model Selection for Contextual Bandits”. In: Proceedings of the 33rd Conference on Learning Theory (COLT). 2020, pp. 3842–3846. [65] Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. “Adaptive Online Learning”. In: Advances in Neural Information Processing Systems 28 (NIPS). 2015, pp. 3375–3383. [66] David A Freedman. “On tail probabilities for martingales”. In: the Annals of Probability (1975), pp. 100–118. [67] Yoav Freund and Robert E Schapire. “A decision-theoretic generalization of on-line learning and an application to boosting”. In: Journal of computer and system sciences (1997). 148 [68] András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. “The on-line shortest path problem under partial monitoring”. In: Journal of Machine Learning Research 8.Oct (2007), pp. 2369–2403. [69] András György and Csaba Szepesvári. “Shifting Regret, Mirror Descent, and Matrices”. In: Proceedings of the 33nd International Conference on Machine Learning (ICML). 2016, pp. 2943–2951. [70] Elad Hazan and Satyen Kale. “Better algorithms for benign bandits”. In: Journal of Machine Learning Research 12.Apr (2011), pp. 1287–1311. [71] Elad Hazan and Zohar Karnin. “Volumetric spanners: an efficient exploration basis for learning”. In: The Journal of Machine Learning Research 17.1 (2016), pp. 4062–4095. [72] Elad Hazan and Kfir Y. Levy. “Bandit Convex Optimization: Towards Tight Bounds”. In: Advances in Neural Information Processing Systems 27 (NIPS). 2014, pp. 784–792. [73] Elad Hazan and Yuanzhi Li. “An optimal algorithm for bandit convex optimization”. In: arXiv preprint arXiv:1603.04350 (2016). [74] Elad Hazan and Comandur Seshadhri. “Adaptive algorithms for online decision problems”. In: Electronic colloquium on computational complexity (ECCC) 14.088 (2007). [75] David P Helmbold, Nicholas Littlestone, and Philip M Long. “Apple tasting”. In: vol. 161. 2. Elsevier, 2000, pp. 85–139. [76] Mark Herbster and Manfred K. Warmuth. “Tracking the Best Expert”. In: Machine Learning 32.2 (1998), pp. 151–178. [77] Mark Herbster and Manfred K. Warmuth. “Tracking the Best Linear Predictor”. In: Journal of Machine Learning Research 1 (2001), pp. 281–309. [78] Dirk van der Hoeven, Ashok Cutkosky, and Haipeng Luo. “Comparator-Adaptive Convex Bandits”. In: Advances in Neural Information Processing Systems 33 (NeurIPS). 2020. [79] Dirk van der Hoeven, Federico Fusco, and Nicolò Cesa-Bianchi. “Beyond bandit feedback in online multiclass classification”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 13280–13291. [80] Xiaowei Hu, Prashanth L. A., András György, and Csaba Szepesvári. “(Bandit) Convex Optimization with Biased Noisy Gradient Oracles”. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS). 2016, pp. 819–828. [81] Woonghee Tim Huh and Paat Rusmevichientong. “A nonparametric asymptotic analysis of inventory planning with censored demand”. In: Mathematics of Operations Research 34.1 (2009), pp. 103–123. [82] Shinji Ito. “An Optimal Algorithm for Bandit Convex Optimization with Strongly-Convex and Smooth Loss”. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS). 2020, pp. 2229–2239. 149 [83] Shinji Ito, Taira Tsuchiya, and Junya Honda. “Nearly Optimal Best-of-Both-Worlds Algorithms for Online Learning with Feedback Graphs”. In: Advances in Neural Information Processing Systems (2022). [84] Krishnamurthy Iyer, Ramesh Johari, and Mukund Sundararajan. “Mean field equilibria of dynamic auctions with learning”. In: Management Science 60.12 (2014), pp. 2949–2970. [85] Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. “Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition”. In: International Conference on Machine Learning. 2020. [86] Kwang-Sung Jun, Francesco Orabona, Stephen Wright, and Rebecca Willett. “Improved Strongly Adaptive Online Learning using Coin Betting”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS). 2017, pp. 943–951. [87] Richard M Karp. “Reducibility among combinatorial problems”. In: Complexity of computer computations. Springer, 1972, pp. 85–103. [88] Richard M Karp. “Reducibility among combinatorial problems”. In: (1972), pp. 85–103. [89] Thomas Kerdreux, Christophe Roux, Alexandre d’Aspremont, and Sebastian Pokutta. “Linear Bandits on Uniformly Convex Sets”. In: Journal of Machine Learning Research 22.284 (2021), pp. 1–23. [90] Robert D. Kleinberg. “Nearly Tight Bounds for the Continuum-Armed Bandit Problem”. In: Advances in Neural Information Processing Systems 17 (NIPS). 2004, pp. 697–704. [91] Tomás Kocák, Gergely Neu, Michal Valko, and Rémi Munos. “Efficient learning by implicit exploration in bandit problems with side observations”. In: Advances in Neural Information Processing Systems 27 (2014). [92] Tomáš Kocák, Gergely Neu, and Michal Valko. “Online Learning with Noisy Side Observations”. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. 2016. [93] Sanath Kumar Krishnamurthy, Vitor Hadad, and Susan Athey. “Adapting to misspecification in contextual bandits with offline regression oracles”. In: Proceedings of the 38th International Conference on Machine Learning (ICML). 2021, pp. 5805–5814. [94] John Langford, Lihong Li, and Alex Strehl. Vowpal wabbit online learning project. 2007. [95] John Langford and Tong Zhang. “The epoch-greedy algorithm for contextual multi-armed bandits”. In: Advances in Neural Information Processing Systems 20.1 (2007), pp. 96–1. [96] Tor Lattimore. “Improved regret for zeroth-order adversarial bandit convex optimisation”. In: Mathematical Statistics and Learning 2.3 (2020), pp. 311–334. [97] Tor Lattimore and Csaba Szepesvari. “The end of optimism? an asymptotic analysis of finite-armed linear bandits”. In: Artificial Intelligence and Statistics. PMLR. 2017, pp. 728–737. 150 [98] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [99] Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. “Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 15522–15533. [100] Chung-Wei Lee, Haipeng Luo, and Mengxiao Zhang. “A closer look at small-loss bounds for bandits with graph feedback”. In: Conference on Learning Theory. PMLR. 2020, pp. 2516–2564. [101] David D Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li. “Rcv1: A new benchmark collection for text categorization research”. In: Journal of machine learning research 5.Apr (2004), pp. 361–397. [102] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. “A contextual-bandit approach to personalized news article recommendation”. In: Proceedings of the 19th international conference on World wide web. 2010, pp. 661–670. [103] Shuai Li, Wei Chen, Zheng Wen, and Kwong-Sak Leung. “Stochastic online learning with probabilistic graph feedback”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 2020, pp. 4675–4682. [104] Fang Liu, Swapna Buccapatnam, and Ness Shroff. “Information directed sampling for stochastic bandits with graph feedback”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 2018. [105] Yi Liu and Lihong Li. “A map of bandits for e-commerce”. In: arXiv preprint arXiv:2107.00680 (2021). [106] László Lovász and Santosh Vempala. “The geometry of logconcave functions and sampling algorithms”. In: Random Structures & Algorithms 30.3 (2007), pp. 307–358. [107] Haipeng Luo. “Lecture Note 18, Introduction to Online Learning”. In: 2017. url: https://haipeng-luo.net/courses/CSCI699/lecture18.pdf. [108] Haipeng Luo and Robert E Schapire. “Achieving All with No Parameters: AdaNormalHedge”. In: Proceedings of the 28th Annual Conference Computational Learning Theory (COLT). 2015, pp. 1286–1304. [109] Haipeng Luo, Hanghang Tong, Mengxiao Zhang, and Yuheng Zhang. “Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs”. In: International Conference on Algorithmic Learning Theory. PMLR. 2023, pp. 1074–1100. [110] Haipeng Luo, Chen-Yu Wei, and Kai Zheng. “Efficient online portfolio with logarithmic regret”. In: Advances in Neural Information Processing Systems. 2018. [111] Haipeng Luo, Mengxiao Zhang, and Peng Zhao. “Adaptive bandit convex optimization with heterogeneous curvature”. In: Conference on Learning Theory. PMLR. 2022, pp. 1576–1612. 151 [112] Haipeng Luo, Mengxiao Zhang, Peng Zhao, and Zhi-Hua Zhou. “Corralling a larger band of bandits: A case study on switching regret for linear bandits”. In: Conference on Learning Theory. PMLR. 2022, pp. 3635–3684. [113] Thodoris Lykouris, Karthik Sridharan, and Éva Tardos. “Small-loss bounds for online learning with partial information”. In: Conference on Learning Theory. PMLR. 2018, pp. 979–986. [114] Thodoris Lykouris, Karthik Sridharan, and Éva Tardos. “Small-loss bounds for online learning with partial information”. In: Proceedings of the 31st Conference On Learning Theory. 2018. [115] Shie Mannor and Ohad Shamir. “From Bandits to Experts: On the Value of Side-Observations”. In: Advances in Neural Information Processing Systems 24. 2011. [116] Teodor Vanislavov Marinov and Julian Zimmert. “The Pareto Frontier of model selection for general Contextual Bandits”. In: Advances in Neural Information Processing Systems 34 (NeurIPS). 2021. [117] Brendan Mcmahan and Matthew Streeter. “No-regret algorithms for unconstrained online convex optimization”. In: Advances in Neural Information Processing Systems 25 (NIPS). 2012. [118] H. Brendan McMahan and Francesco Orabona. “Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations”. In: Proceedings of The 27th Conference on Learning Theory (COLT). 2014, pp. 1020–1039. [119] Mehryar Mohri and Andrés Munoz Medina. “Learning algorithms for second-price auctions with reserve”. In: Journal of Machine Learning Research. Vol. 17. JMLR. org, 2016, pp. 2632–2656. [120] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. Vol. 13. Siam, 1994. [121] Gergely Neu. “Explore no more: Improved high-probability regret bounds for non-stochastic bandits”. In: Advances in Neural Information Processing Systems 28 (2015). [122] Gergely Neu. “First-order regret bounds for combinatorial semi-bandits”. In: Proceedings of The 28th Conference on Learning Theory. 2015. [123] Francesco Orabona. “Dimension-Free Exponentiated Gradient”. In: Advances in Neural Information Processing Systems 26 (NIPS). 2013, pp. 1806–1814. [124] Aldo Pacchiano, My Phan, Yasin Abbasi-Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, and Csaba Szepesvári. “Model Selection in Contextual Stochastic Bandit Problems”. In: Advances in Neural Information Processing Systems 33 (NeurIPS). 2020, pp. 10328–10337. [125] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. “Pytorch: An imperative style, high-performance deep learning library”. In: Advances in neural information processing systems 32 (2019). 152 [126] Alexander Rakhlin and Karthik Sridharan. “Online Learning with Predictable Sequences”. In: Proceedings of the 26th Annual Conference on Learning Theory. 2013. [127] Anshuka Rangi and Massimo Franceschetti. “Online learning with feedback graphs and switching costs”. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2019. [128] Anshuka Rangi and Massimo Franceschetti. “Online learning with feedback graphs and switching costs”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2019, pp. 2435–2444. [129] Aviv Rosenberg and Yishay Mansour. “Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function”. In: Advances in Neural Information Processing Systems. 2019. [130] Mark Rucker, Joran T Ash, John Langford, Paul Mineiro, and Ida Momennejad. “Eigen Memory Tree”. In: arXiv preprint arXiv:2210.14077 (2022). [131] Aadirupa Saha, Nagarajan Natarajan, Praneeth Netrapalli, and Prateek Jain. “Optimal regret algorithm for Pseudo-1d Bandit Convex Optimization”. In: Proceedings of the 38th International Conference on Machine Learning (ICML). 2021, pp. 9255–9264. [132] Ankan Saha and Ambuj Tewari. “Improved regret guarantees for online smooth convex optimization with bandit feedback”. In: International Conference on Artificial Intelligence and Statistics. 2011. [133] David Simchi-Levi and Yunzong Xu. “Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability”. In: Mathematics of Operations Research. Vol. 47. INFORMS, 2022, pp. 1904–1931. [134] Rahul Singh, Fang Liu, Xin Liu, and Ness Shroff. “Contextual bandits with side-observations”. In: arXiv preprint arXiv:2006.03951 (2020). [135] Aleksandrs Slivkins and Dylan Foster. “Efficient Contextual Bandits with Knapsacks via Regression”. In: arXiv preprint arXiv:2211.07484 (2022). [136] Jacob Steinhardt and Percy Liang. “Adaptivity and optimism: An improved exponentiated gradient algorithm”. In: International Conference on Machine Learning. 2014, pp. 1593–1601. [137] Arun Sai Suggala, Pradeep Ravikumar, and Praneeth Netrapalli. “Efficient Bandit Convex Optimization: Beyond Linear Losses”. In: Proceedings of the 34th Conference on Learning Theory (COLT). 2021, pp. 4008–4067. [138] Sofía S Villar, Jack Bowden, and James Wason. “Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges”. In: Statistical science: a review journal of the Institute of Mathematical Statistics 30.2 (2015), p. 199. 153 [139] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575.7782 (2019), pp. 350–354. [140] Vladimir G Vovk. “A game of prediction with expert advice”. In: Proceedings of the eighth annual conference on Computational learning theory. 1995, pp. 51–60. [141] Lingda Wang, Bingcong Li, Huozhi Zhou, Georgios B Giannakis, Lav R Varshney, and Zhizhen Zhao. “Adversarial linear contextual bandits with graph-structured side observations”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 2021, pp. 10156–10164. [142] Chen-Yu Wei, Christoph Dann, and Julian Zimmert. “A Model Selection Approach for Corruption Robust Reinforcement Learning”. In: Proceedings of the 33rd International Algorithmic Learning Theory (ALT). 2022, pp. 1043–1096. [143] Chen-Yu Wei and Haipeng Luo. “More Adaptive Algorithms for Adversarial Bandits”. In: Proceedings of the 31st Conference On Learning Theory. 2018. [144] Chen-Yu Wei and Haipeng Luo. “Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach”. In: Proceedings of the 34th Conference on Learning Theory (COLT). 2021, pp. 4300–4354. [145] Scott Yang and Mehryar Mohri. “Optimistic Bandit Convex Optimization”. In: Advances in Neural Information Processing Systems 29 (NIPS). 2016, pp. 2289–2297. [146] Lijun Zhang, Tie-Yan Liu, and Zhi-Hua Zhou. “Adaptive Regret of Convex and Smooth Functions”. In: Proceedings of the 36th International Conference on Machine Learning (ICML). 2019, pp. 7414–7423. [147] Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, and Paul Mineiro. “Efficient Contextual Bandits with Uninformed Feedback Graphs”. In: arXiv preprint arXiv:2402.08127 (2024). [148] Mengxiao Zhang, Yuheng Zhang, Olga Vrousgou, Haipeng Luo, and Paul Mineiro. “Practical Contextual Bandits with Feedback Graphs”. In: Conference on Neural Information Processing Systems. 2023. [149] Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. “Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization”. In: Journal of Machine Learning Research 25 (2024). [150] Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. “Dynamic Regret of Convex and Smooth Functions”. In: Advances in Neural Information Processing Systems 33 (NeurIPS). 2020, pp. 12510–12520. [151] Kai Zheng, Haipeng Luo, Ilias Diakonikolas, and Liwei Wang. “Equipping Experts/Bandits with Long-term Memory”. In: Advances in Neural Information Processing Systems 32. 2019. 154 [152] Yinglun Zhu and Paul Mineiro. “Contextual bandits with smooth regret: Efficient learning in continuous action spaces”. In: International Conference on Machine Learning. PMLR. 2022, pp. 27574–27590. [153] Alexander Zimin and Gergely Neu. “Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search”. In: Advances in Neural Information Processing Systems. 2013. [154] Julian Zimmert and Tor Lattimore. “Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits”. In: Conference on Learning Theory. PMLR. 2022, pp. 3285–3312. 155 Appendix A Omitted Details in Section 2.1 A.1 Proofs for Section 2.1.2.1 In this section, we prove Lemma 2.1.1 and Theorem 2.1.1. To prove Lemma 2.1.1, we first show the following auxiliary lemma, which states that the OMD update enjoys multiplicative stability under certain conditions. Lemma A.1.1. Let pt+1 = argminp∈Ω D p, ℓbt E + Dψ(p, pt) for Ω ⊆ ∆(K) and ψ : Ω → R such that ∇2ψ(p) ⪰ diag n C1 p 2 i , . . . , C1 p 2 K o for some C1 ≥ 9 and ∇−2ψ(p) ⪯ 4∇−2ψ(q) as long as p ⪯ 2q. If there exists z ∈ R such that ∥ℓbt − z · 1∥∇−2ψ(pt) ≤ 1 8 , then we have 1 2 pt ⪯ pt+1 ⪯ 2pt . Proof. The proof follows similar ideas of recent work such as [143] or [37]. Let Ft(p) ≜ D p, ℓbt − z · 1 E + Dψ(p, pt), for z satisfying the condition ∥ℓbt − z · 1∥∇−2ψ(pt) ≤ 1 8 . As we only shift each entry of the loss estimator by a constant, according to the algorithm, we have pt+1 = argminp∈Ω Ft(p). We first show that Ft(p ′ ) ≥ Ft(pt) for any p ′ ∈ Ω such that ∥p ′ − pt∥∇2ψ(pt) = 1. We start by applying Taylor expansion: Ft(p ′ ) = Ft(pt) + ∇Ft(pt) ⊤(p ′ − pt) + 1 2 (p ′ − pt) ⊤∇2Ft(ξ)(p ′ − pt) = Ft(pt) + ℓbt − z · 1 ⊤ (p ′ − pt) + 1 2 ∥p ′ − pt∥ 2 ∇2ψ(ξ) ≥ Ft(pt) − ∥ℓbt − z · 1∥∇−2ψ(pt)∥p ′ − pt∥∇2ψ(pt) + 1 2 ∥p ′ − pt∥ 2 ∇2ψ(ξ) 156 = Ft(pt) − ∥ℓbt − z · 1∥∇−2ψ(pt) + 1 2 ∥p ′ − pt∥ 2 ∇2ψ(ξ) , where the inequality is by Hölder’s inequality and ξ is some point on the line segment between pt and p ′ . By the condition ∇2ψ(p) ⪰ diag n 9 p 2 i , . . . , 9 p 2 K o , we have 1 = ∥p ′ − pt∥ 2 ∇2ψ(pt) ≥ 9 P i∈[K] (p ′ i−pt,i) 2 p 2 t,i , which implies |p ′ i−pt,i| pt,i ≤ 1 3 for all i ∈ [K]. Therefore, we have ξ ⪯ 4 3 pt ⪯ 2pt , which leads to ∇2ψ(ξ) ⪰ 1 4∇2ψ(pt) according to the assumption. Plugging it into the previous inequality, we have Ft(p ′ ) − Ft(pt) ≥ −∥ℓbt − z · 1∥∇−2ψ(pt) + 1 2 ∥p ′ − pt∥ 2 ∇2ψ(ξ) ≥ − 1 8 + 1 8 = 0. Therefore, according to the optimality of pt+1 and the convexity of Ft , we have ∥pt+1 − pt∥∇2ψ(pt) ≤ 1. Following the previous analysis, we further have: 1 ≥ ∥pt+1 − pt∥ 2 ∇2ψ(pt) ≥ 9 X i∈[K] (pt+1,i − pt,i) 2 p 2 t,i ≥ 9 (pt+1,j − pt,j ) 2 p 2 t,j , ∀j ∈ [K]. So we conclude pt+1,i ∈ 2 3 pt,i, 4 3 pt,i ⊆ [ 1 2 pt,i, 2pt,i] for all i ∈ [K], finishing the proof. The next lemma further shows that the condition ∃z, ∥ℓbt − z · 1∥∇−2ψ(pt) ≤ 1 8 is easily satisfied as long as 0 ≤ ℓbt,i ≤ max n 1 pt,i , 1 1−pt,io for all i. Lemma A.1.2. If 0 ≤ ℓbt,i ≤ max n 1 pt,i , 1 1−pt,io for all i ∈ [K], under the same conditions of Lemma A.1.1 with C1 = 64K, there exists z ∈ R such that ∥ℓbt − z · 1∥∇−2ψ(pt) ≤ 1 8 . Proof. If pt,i ≤ 1 2 for all i ∈ [K], then we have pt,iℓbt,i ≤ max n 1, pt,i 1−pt,io ≤ 1 for all i ∈ [K]. In this case, z = 0 satisfies: ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) ≤ X i∈[K] p 2 t,iℓb2 t,i C1 ≤ K C1 = 1 64 . 157 On the other hand, if there is one node it,0 such that pt,it,0 > 1 2 , then pt,i ≤ 1 2 and pt,iℓbt,i ≤ 1 must be true for all i ̸= it,0. In this case picking z = ℓbt,it,0 gives the following bound on ∥ℓbt − z · 1∥∇−2ψ(pt) : ℓbt − ℓbt,it,0 1 2 ∇−2ψ(pt) ≤ 1 C1 X i̸=it,0 p 2 t,i(ℓbt,i − ℓbt,it,0 ) 2 ≤ 1 C1 X i̸=it,0 p 2 t,iℓb2 t,i + p 2 t,iℓb2 t,it,0 ≤ (K − 1) C1 + (1 − pt,it,0 ) 2 ℓb2 t,it,0 C1 ( P i̸=it,0 p 2 t,i ≤ (1 − pt,it,0 ) 2 ) ≤ 1 64 . (0 ≤ ℓbt,it,0 ≤ 1 1−pt,it,0 ) Combining the two cases finishes the proof. Now we are ready to prove Lemma 2.1.1. Proof. of Lemma 2.1.1. For any time step t and any z ∈ R, we first follow standard Online Mirror Descent analysis and show D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + 2∥ℓbt − z · 1∥ 2 ∇−2ψ(ξ) . (A.1) for some ξ on the line segment of pt and pt+1. Define Ft(p) ≜ D p, ℓbt − z · 1 E + Dψ (p, pt). As we only shift each entry of the loss estimator by a constant, according to the algorithm, we have pt+1 = argminp∈Ω Ft(p) and thus by Taylor expansion, it holds for some ξ on the line segment of pt and pt+1 that Ft(pt) − Ft(pt+1) = ∇Ft(pt+1) ⊤ (pt − pt+1) + 1 2 (pt − pt+1) ⊤ ∇2Ft(ξ) (pt − pt+1) ≥ 1 2 ∥pt − pt+1∥ 2 ∇2ψ(ξ) . (A.2) 158 On the other hand, by the non-negativity of Bregman divergence and Hölder’s inequality, we have Ft(pt) − Ft(pt+1) = D pt − pt+1, ℓbt − z · 1 E − Dψ(pt+1, pt) ≤ D pt − pt+1, ℓbt − z · 1 E ≤ ∥pt − pt+1∥∇2ψ(ξ) · ∥ℓbt − z · 1∥∇−2ψ(ξ) . (A.3) Combining Eq. (A.2) and Eq. (A.3), we have ∥pt − pt+1∥∇2ψ(ξ) ≤ 2∥ℓbt − z · 1∥∇−2ψ(ξ) . Furthermore, standard analysis of Online Mirror Descent (see e.g. Lemma 6 of [143]) shows D pt − u, ℓbt E = D pt − u, ℓbt − z · 1 E ≤ Dψ(u, pt) − Dψ(u, pt+1) + D pt − pt+1, ℓbt − z · 1 E , which proves Eq. (A.1) after applying Hölder’s inequality again and using the previous conclusion ∥pt − pt+1∥∇2ψ(ξ) ≤ 2∥ℓbt − z · 1∥∇−2ψ(ξ) . Finally, according to Lemma A.1.2, we know that the conditions of Lemma A.1.1 hold, and thus multiplicative stability 1 2 pt ⪯ pt+1 ⪯ 2pt holds, implying ξ ⪯ 2pt . By Condition (b) of the lemma statement, we have ∇−2ψ(ξ) ⪯ 4∇−2ψ(pt), which shows ∥ℓbt − z · 1∥ 2 ∇−2ψ(ξ) ≤ 4∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) and completes the proof as z is arbitrary. Finally, we prove Theorem 2.1.1. Proof. of Theorem 2.1.1. According to the choice of c and the construction of loss estimators, the conditions of Lemma 2.1.1 hold and we have D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + 8 min z∈R ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) . 159 To bound the local-norm term minz∈R ∥ℓbt −z ·1∥ 2 ∇−2ψ(pt) , one could follow the analysis of [11]. However, to be consistent with other proofs in this work, we provide a different analysis based on a novel loss shift (that is critical for all other proofs). Specifically, we consider two cases. First, if pt,i < 1 2 holds for all i ∈ S¯, then we relax the local-norm term by taking z = 0: min z∈R ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) ≤ ∥ℓbt∥ 2 ∇−2ψ(pt) = X i∈[K] 1 c/p 2 t,i + 1/ηpt,i ℓb2 t,i ≤ X i∈[K] ηpt,iℓb2 t,i ≤ X i∈S ηpt,iℓb2 t,i + 2X i∈S¯ ηpt,iℓbt,i, where the last step is because ℓbt,i ≤ 1 1−pt,i ≤ 2 for i ∈ S¯. On the other hand, if there exists it,0 ∈ S¯ such that pt,it,0 ≥ 1 2 , then we take z = ℓbt,it,0 and arrive at: min z∈R ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) ≤ ∥ℓbt − ℓbt,it,0 · 1∥ 2 ∇−2ψ(pt) ≤ X i̸=it,0 ηpt,i ℓbt,i − ℓbt,it,0 2 ≤ X i̸=it,0 ηpt,i ℓb2 t,i + ℓb2 t,it,0 = X i∈S ηpt,iℓb2 t,i + X i∈S,i ¯ ̸=it,0 ηpt,iℓb2 t,i + X i∈[K],i̸=it,0 ηpt,iℓb2 t,it,0 ≤ X i∈S ηpt,iℓb2 t,i + 2 X i∈S,i ¯ ̸=it,0 ηpt,iℓbt,i + X i∈[K],i̸=it,0 ηpt,iℓb2 t,it,0 = X i∈S ηpt,iℓb2 t,i + 2 X i∈S,i ¯ ̸=it,0 ηpt,iℓbt,i + η 1 − pt,it,0 ℓb2 t,it,0 ≤ X i∈S ηpt,iℓb2 t,i + 2X i∈S¯ ηpt,iℓbt,i, 16 where the second to last inequality is because ℓbt,i ≤ 1 1−pt,i ≤ 2 for i ∈ S¯ \ {it,0} and the final inequality is because (1 − pt,it,0 )ℓbt,it,0 ≤ 1−pt,it,0 1−pt,it.0 = 1 ≤ 2pt,it,0 . Therefore, combining the two cases we have shown: D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + 8η X i∈S pt,iℓb2 t,i + 16η X i∈S¯ pt,iℓbt,i. Summing over t and telescoping, we further have: X T t=1 D pt − u, ℓbt E ≤ Dψ(u, p1) + 8η X T t=1 X i∈S pt,iℓb2 t,i + 16η X T t=1 X i∈S¯ pt,iℓbt,i. ≤ Dψ(u, p1) + 8η X T t=1 X i∈S pt,i Wt,i ℓbt,i + 16η X T t=1 X i∈S¯ pt,iℓbt,i. We choose u = 1 − K T ei ⋆ + 1 T · 1. By the optimality of p1, we bound the Bregman divergence term as: Dψ(u, p1) ≤ ψ(u) − ψ(p1) ≤ 1 η X i∈[K] p1,i ln 1 p1,i + c X i∈[K] ln 1 ui ≤ ln K η + cK ln T. Comparing u and ei ⋆ , we bound PT t=1 D pt − ei ⋆ , ℓbt E by ln K η + cK ln T + 8η X T t=1 X i∈S pt,i Wt,i ℓbt,i + 16η X T t=1 X i∈S¯ pt,iℓbt,i + 1 T X T t=1 X i∈[K] ℓbt,i. Taking expectation over both sides, we arrive at: Reg ≤ ln K η + cK ln T + E 8η X T t=1 X i∈S pt,i Wt,i ℓt,i + 16η X T t=1 X i∈S¯ pt,iℓt,i + 1 T X T t=1 X i∈[K] ℓt,i ≤ ln K η + cK ln T + 32ηαT ln 4KT α + 16ηT + K = Oe 1 η + ηαT + K2 , 16 where the last inequality uses the fact ℓt,i ≤ 1 and also a graph-theoretic lemma (Lemma 5 in [11]) which asserts P i∈S pt,i Wt,i ≤ 4α ln 4KT α . This finishes the proof. A.2 Proofs for Section 2.1.2.2 We prove Theorem 2.1.2 in this section. Proof. of Theorem 2.1.2. Similar to the proof of Theorem 2.1.1, the conditions of Lemma 2.1.1 hold and we have D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + 8 min z∈R ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) . Once again, we bound the local-norm term by considering two cases separately. (i). If pt,i < 1 2 holds for all i ∈ S¯, then choosing z = 0 we have: min z∈R ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) ≤ ∥ℓbt∥ 2 ∇−2ψ(pt) = X i∈S ηp2 t,iℓb2 t,i + X i∈S¯ 1 c/p 2 t,i + 1/ηpt,i ℓb2 t,i ≤ X i∈S ηp2 t,iℓb2 t,i + X i∈S¯ ηpt,iℓb2 t,i ≤ X i∈S ηpt,iℓbt,i + 2X i∈S¯ ηpt,iℓbt,i ≤ 2η D pt , ℓbt E . The third inequality is because pt,iℓbt,i ≤ 1 for i ∈ S and ℓbt,i ≤ 1 1−pt,i ≤ 2 for i ∈ S¯. (ii). If there exists ∃it,0 ∈ S¯ such that pt,it,0 ≥ 1 2 , the choosing z = ℓbt,it,0 we have: min z∈R ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) ≤ ∥ℓbt − ℓbt,it,0 · 1∥ 2 ∇−2ψ(pt) 16 = X i∈S ηp2 t,i ℓbt,i − ℓbt,it,0 2 + X i∈S,i ¯ ̸=it,0 1 c/p 2 t,i + 1/ηpt,i ℓbt,i − ℓbt,it,0 2 ≤ X i∈S ηp2 t,i ℓb2 t,i + ℓb2 t,it,0 + X i∈S,i ¯ ̸=it,0 ηpt,i ℓb2 t,i + ℓb2 t,it,0 = X i∈S ηp2 t,iℓb2 t,i + X i∈S,i ¯ ̸=it,0 ηpt,iℓb2 t,i + X i̸=it,0 ηpt,iℓb2 t,it,0 = X i∈S ηp2 t,iℓb2 t,i + X i∈S,i ¯ ̸=it,0 ηpt,iℓb2 t,i + η 1 − pt,it,0 ℓb2 t,it,0 ≤ X i∈S ηpt,iℓbt,i + 2 X i∈S,i ¯ ̸=it,0 ηpt,iℓbt,i + 2ηpt,it,0 ℓbt,it,0 ≤ 2η D pt , ℓbt E . The second to last inequality is because pt,iℓbt,i ≤ 1 for i ∈ S, ℓbt,i ≤ 1 1−pt,i ≤ 2 for i ∈ S¯\{it,0}, and (1 − pt,it,0 )ℓbt,it,0 ≤ 1 ≤ 2pt,it,0 . Combining the two cases, we have D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + 16η D pt , ℓbt E , and summing over t and telescoping, we further have X T t=1 D pt − u, ℓbt E ≤ Dψ(u, p1) + 16η X T t=1 D pt , ℓbt E . We choose u = 1 − K T ei ⋆ + 1 T · 1 and calculate the Bregman divergence term as Dψ(u, p1) ≤ ψ(u) − ψ(p1) (by optimality of p1) ≤ 1 η X i∈S ln 1 ui + 1 η X i∈S¯ p1,i ln 1 p1,i + c X i∈S¯ ln 1 ui ≤ s ln T η + ln K η + cK ln T. 1 Comparing the difference between u and ei ⋆ and rearranging, we arrive at: X T t=1 D pt − ei ⋆ , ℓbt E ≤ 1 1 − 16η s ln T + ln K η + cK ln T + 1 T X T t=1 X i∈[K] ℓbt,i + 16η X T t=1 ℓbt,i⋆ . Taking expectation on both sides shows Reg ≤ 1 1 − 16η s ln T + ln K η + cK ln T + K + 16ηL⋆ = O s ln T + ln K η + ηL⋆ + K2 ln T , finishing the proof. A.3 Omitted details for Section 2.1.2.3 In this section, we provide omitted details for Section 2.1.2.3, including the adaptive Hedge subroutine used in Algorithm 1 and its regret bound (Appendix A.3.1), the proof of Theorem 2.1.3 (Appendix A.3.2), and an adaptive version of Algorithm 1 and its analysis (Appendix A.3.3). A.3.1 Hedge with Adaptive Learning Rates We first provide details of the Hedge variant used in Algorithm 1. Algorithm 15 shows the complete pseudocode. Note that as described in Section 2.1.2.3, each Hedge instance only operates over a subset of arms, denoted by C as an input of the algorithm. At each time t, the algorithm proposes a distribution pet , and then receives a loss vector ℓet ∈ R K + . Vanilla Hedge is simply OMD with the entropy regularizer over the simplex. Our variant makes the following two modifications. First, the decision set Ω is restricted to a subset of simplex so that zero probability is assigned to arms outside C and at least 1 |C|T probability is assigned to each arm in C for 164 Algorithm 15 Hedge with Adaptive Learning Rates Input: The number of arms K, the set of active arms C ⊆ [K]. Define: Ω = n p ∈ ∆(K) : pi ≥ 1 |C|T , ∀i ∈ C, and pi = 0, ∀i /∈ C o Initialize: pe1 is the uniform distribution over C. for t = 1, 2, . . . , T do 1 Propose distribution pet . 2 Receive feedback ℓet ∈ R K + . 3 Compute pet+1 = argminp∈Ω nDp, ℓet E + Dψt (p, pet) o , where ψt(p) = 1 ηt X i∈[K] pi ln pi , with ηt = s 1 1 + Pt τ=1 PK i=1 peτ,iℓe2 τ,i . exploration purpose. Second, we apply an adaptive time-varying learning rate as specified in Line 3. This adaptive learning rate schedule ensures an adaptive regret bound (which is important for our analysis), as shown in the following lemma. Lemma A.3.1. Algorithm 15 ensures that for any i ∈ C, we have X T t=1 D pet − ei , ℓet E ≤ 25ρ ln2 (KT) + 10 ln(KT) vuutρ X T t=1 ℓet,i, (A.4) where ρ = max n 1, maxt∈[T],i∈C ℓet,io . Proof. Let qt+1,i = pet,i exp(−ηtℓet,i). One can verify pet+1 = argminp∈Ω Dψt (p, qt+1) and also for any u ∈ Ω, D pet − u, ℓet E = Dψt (u, pet) − Dψt (u, qt+1) + Dψt (pet , qt+1) ≤ Dψt (u, pet) − Dψt (u, pet+1) + Dψt (pet , qt+1), 165 where the second step uses the generalized Pythagorean theorem. On the other hand, using the fact that exp(−x) ≤ 1 − x + x 2 for any x ≥ 0, we also have Dψt (pet , qt+1) = 1 ηt X i∈[K] pet,i ln pet,i qt+1,i + qt+1,i − pet,i = 1 ηt X i∈[K] pet,i exp(−ηtℓet,i) − 1 + ηtℓet,i ≤ ηt X i∈[K] pet,iℓe2 t,i. Summing over t we have shown X T t=1 D pet − u, ℓet E ≤ X T t=1 (Dψt (u, pet) − Dψt (u, pet+1)) +X T t=1 ηt X K i=1 pet,iℓe2 t,i ≤ KL(u||pe1) + T X−1 t=1 1 ηt+1 − 1 ηt KL(u||pet+1) +X T t=1 ηt X K i=1 pet,iℓe2 t,i ≤ KL(u||pe1) + maxp∈Ω KL(u||p) ηT + X T t=1 ηt X K i=1 pet,iℓe2 t,i ≤ ln K + ln(KT) ηT + X T t=1 X K i=1 pet,iℓe2 t,i q 1 + Pt τ=1 PK i=1 peτ,iℓe2 τ,i ≤ ln K + ln(KT) ηT + Z PT t=1 PK i=1 pet,iℓe2 t,i 0 1 √ x + 1 dx ≤ ln K + ln(KT) ηT + 2 vuut1 +X T t=1 X K i=1 pet,iℓe2 t,i = ln K + (ln(KT) + 2) vuut1 +X T t=1 X K i=1 pet,iℓe2 t,i. Now choosing u = 1 − 1 T ei + 1 |C|T · 1C ∈ Ω where 1C is the vector with one for coordinates in C and zero otherwise, we have X T t=1 D pet − ei , ℓet E ≤ ln K + (ln(KT) + 2) vuut1 +X T t=1 X K i=1 pet,iℓe2 t,i + 1 |C|T X T t=1 X i∈C ℓet,i ≤ 4 ln(KT) + 3 ln(KT) vuutX T t=1 X K i=1 pet,iℓe2 t,i + 1 |C|T X T t=1 X i∈C ℓet,i 16 ≤ 4 ln(KT) + 3 ln(KT) vuutρ X T t=1 D pet , ℓet E + ρ. Let LeT ≜ PT t=1 D pet , ℓet E and LeT,i ≜ PT t=1 ℓet,i. By solving the quadratic inequality, we have q LeT ≤ 3 ln(KT) √ρ + q 9 ln2 (KT)ρ + 4 · (4 ln(KT) + ρ + LeT,i) 2 ≤ 5 ln(KT) √ ρ + q LeT,i. Finally, squaring both sides proves LeT − LeT,i ≤ 25 ln2 (KT)ρ + 10 ln(KT) q ρLeT,i. A.3.2 Proofs of Theorem 2.1.3 To prove Theorem 2.1.3, we combine the regret bounds of the meta-algorithm and the Hedge subroutine. For the former, we prove the following lemma, which combines the result of Theorem 2.1.2 and the effect of the increasing learning rate schedule proposed in [7], leading to an important negative regret term. Lemma A.3.2. Algorithm 1 with c = 64β and η ≤ 1 64β ensures that for any j ∈ [β], X T t=1 D pt − ej , ℓbt E ≤ O κ ln T + ln K η + β 2 ln T + 80η X T t=1 D pt , ℓbt E + 1 T X T t=1 X j∈[β] ℓbt,j − ρT,j 20η ln T 1 {j ∈ [κ]} . (A.5) Proof. We first show that according to our increasing learning rate schedule, the final learning rate is upper bounded by a constant times the original learning rate. Fix a node j ∈ [κ]. Let nj be such that ηT,j = σ nj η1,j with σ = e 1 ln T , where we assume nj ≥ 1 (the case nj = 0 is trivial as one will see). Let t1, ..., tnj be the rounds in which the learning rate update is executed for node j. Since 1 ptnj +1,j > ρtnj ,j > 2ρtnj−1,j > ... > 2 nj−1ρ1,j = 2njκ and 1 ptnj +1,1 ≤ T, we have nj ≤ log2 T. Therefore, we have ηT,j ≤ σ log2 T η1,j ≤ 5η1,j = 5η. 167 Next, according to our choice of c and η, the conditions of Lemma 2.1.1 hold and we have D pt − u, ℓbt E ≤ Dψt (u, pt) − Dψt (u, pt+1) + 8 min z ∥ℓbt − z · 1∥ 2 ∇−2ψt(pt) . We consider the Bregman divergence terms and choose u = 1 − K T ej + 1 T 1. If j ∈ [κ], then with h(y) = y − 1 − ln y we have X T t=1 Dψt (u, pt) − Dψt (u, pt+1) ≤ Dψ1 (u, p1) + T X−1 t=1 Dψt+1 (u, pt+1) − Dψt (u, pt+1) ≤ Dψ1 (u, p1) + 1 ηtnj +1,j − 1 ηtnj ,j ! h uj ptnj +1,j ! = Dψ1 (u, p1) + 1 − σ σ nj η h uj ptnj +1,j ! ≤ κ ln T + ln K η + cβ ln T − 1 5η ln T h uj ptnj +1,j ! , where we use the facts 1 − σ ≤ − 1 ln T and σ nj ≤ 5 as shown earlier, and also the exact same analysis of bounding Dψ1 (u, p1) as in the proof of Theorem 2.1.2. Note that uj ptnj +1,j ≥ 1 2ptnj +1,j ≥ 2 nj−1κ ≥ 1. Combining the facts that h(y) is increasing when y ≥ 1 and ρT,j = 2 ptnj +1,j ≤ 2T, we have: h uj ptnj +1,j ! ≥ h 1 2ptnj +1,j ! = ρT,j 4 − 1 − ln ρT,j 4 ≥ ρT,j 4 − 2 ln T. We have thus shown when j ∈ [κ], X T t=1 Dψt (u, pt) − Dψt (u, pt+1) ≤ O κ ln T + ln K η + cβ ln T − ρT,j 20η ln T 1 On the other hand, if j ∈ S¯, then we have by the monotonicity of learning rates X T t=1 Dψt (u, pt) − Dψt (u, pt+1) ≤ Dψ1 (u, p1) + T X−1 t=1 Dψt+1 (u, pt+1) − Dψt (u, pt+1) ≤ Dψ1 (u, p1) ≤ κ ln T + ln K η + cβ ln T. It remains to deal with the local-norm term minz ∥ℓbt − z · 1∥ 2 ∇−2ψt(pt) . Following the exact analysis in the proof of Theorem 2.1.2 and the fact ηt,j ≤ 5η for all t ∈ [T] and j ∈ [κ], we have: min z ∥ℓbt − z · 1∥ 2 ∇−2ψt(pt) ≤ 5 min z ∥ℓbt − z · 1∥ 2 ∇−2ψ1(pt) ≤ 10η D pt , ℓbt E . Combining the bounds for the Bregman divergence terms and the local-norm term, and accounting for the difference between u and ej complete the proof. We are now ready to prove Theorem 2.1.3. Proof. of Theorem 2.1.3. The main idea of the proof is as follows. When i ⋆ ∈ S¯, the regret is exactly E hPT t=1 D pt − ei ⋆ , ℓbt Ei. Therefore, Lemma A.3.2 already provides the small-loss bound guarantee by rearranging the terms and taking expectation on both sides. When i ⋆ ∈ Cj , according to our loss estimator construction, the regret is exactly the regret of the meta-algorithm plus the regret of Aj , and we apply Lemma A.3.2 and Lemma A.3.1 to bound each of these two parts and importantly use the negative term from Eq. (A.5) to cancel the corresponding terms in Eq. (A.4). Formally, when i ⋆ ∈ S¯, we apply Lemma A.3.2 with j = i ⋆ and rearrange terms to arrive at X T t=1 D pt − ei ⋆ , ℓbt E ≤ O κ ln T + ln K η + cβ ln T + η X T t=1 ℓbt,i⋆ + 1 T X T t=1 X j∈[β] ℓbt,j . 16 Note that in this case E h ℓbt,i⋆ i = E h ℓt,i⋆ 1−pt,i⋆ 1{it ̸= i ⋆} i = ℓt,i⋆ for all t ∈ [T]. Thus, taking expectation shows Reg = O κ ln T + ln K η + β 2 ln T + ηL⋆ . On the other hand, when i ⋆ ∈ Cj for some j ∈ [κ], we decompose the regret as Reg = E "X T t=1 D pt , ℓbt E − X T t=1 D ei ⋆ , ℓet E # = E "X T t=1 D pt − ej , ℓbt E # + E "X T t=1 D pe (j) t − ei ⋆ , ℓet E # . Here, in the first equality, we use the facts E hDpt , ℓbt Ei = E X j∈[κ] pt,j X i∈Cj pe (j) t,i ℓt,i pt,j 1{j = jt} + X i∈S¯ pt,i ℓt,i 1 − pt,i 1{i ̸= it} = E X j∈[κ] pt,j X i∈Cj pe (j) t,i ℓt,i + X i∈S¯ pt,iℓt,i = E [ℓt,it ] and E h ℓet,i⋆ i = E h ℓt,i⋆ pt,j 1 {j = jt} i = ℓt,i⋆ ; and the second equality is directly by the definition of ℓbt,j for j ∈ [κ]. For the first part of the decomposition, we apply Lemma A.3.2 directly; for the second part, noting that the scale of ℓe(j) t for all t ∈ [T] is no more than ρT,j , according to Lemma A.3.1 we have: X T t=1 D pe (j) t − ei ⋆ , ℓet E ≤ 25ρT,j ln2 (KT) + 10 ln(KT) vuutρT,j X T t=1 ℓet,i⋆ . 170 Combining the two gives X T t=1 D pt , ℓbt E − X T t=1 D ei ⋆ , ℓet E ≤ O κ ln T + ln K η + cβ ln T + 80η X T t=1 D pt , ℓbt E + 1 T X T t=1 X j∈[β] ℓbt,j − ρT,j 40η ln T + 25ρT,j ln2 (KT) − ρT,j 40η ln T + 10 ln(KT) vuutρT,j X T t=1 ℓet,i⋆ ≤ O κ ln T + ln K η + cβ ln T + 80η X T t=1 D pt , ℓbt E + 1 T X T t=1 X j∈[β] ℓbt,j + 1000η(ln T) ln2 (KT) X T t=1 ℓet,i⋆ , (A.6) where the second inequality is by the fact −ax + √ bx ≤ b 4a for a, b > 0 and also the condition η ≤ 1 1000(ln T) ln2 (KT) . By rearranging we have: X T t=1 D pt , ℓbt E − X T t=1 D ei ⋆ , ℓet E ≤ O κ ln T + ln K η + cβ ln T + η(ln T) ln2 (KT) X T t=1 ℓet,i⋆ ! + 1 T X T t=1 X i∈[β] ℓbt,i. (A.7) Taking expectation on both sides finishes the proof. A.3.3 Adaptive Version of Algorithm 1 In this section, we provide Algorithm 16, an adaptive version of Algorithm 1 with a doubling trick to remove the need of tuning the learning rate η in terms of L⋆. The algorithm mostly follows Algorithm 1, starting from a relatively large value of η. The key difference is that at the end of each round, we check if condition κ+1 η ≤ η Pt τ=Tλ+1 D pt , ℓbt E holds, where Tλ + 1 is the time step of the most recent reset. If the condition holds, it implies that the current learning rate η is not small enough, and we thus halve the 171 Algorithm 16 Adaptive Version of Algorithm 1 Input: Feedback graph G and a clique partition {C1, . . . , Cκ} of GS, parameter η and c. Define: β = κ + ¯s and Ω = p ∈ ∆(β) : pj ≥ 1 T , ∀j ∈ [β] . for λ = 1, 2, . . . do Tλ = t − 1, ηt,j = η, ρt,j = 2κ, ∀j ∈ [κ], pt = argminp∈Ω ψt(p) (ψt defined in Eq. (2.7)) Create an instance Aj of adaptive Hedge (Algorithm 15) with nodes in Cj , ∀j ∈ [κ]. while t ≤ T do Execute Line 1 to Line 9 of Algorithm 1. if κ+1 η ≤ η Pt τ=Tλ+1 D pt , ℓbt E then η ← η 2 , t ← t + 1. Break t ← t + 1. learning rate, and at the same time reset the algorithm, which includes resetting the parameters ηt,j , ρt,j , and the distribution pt , as well as resetting the Hedge instances. Below we prove that Algorithm 16 achieves the same regret bound as Algorithm 1 without knowing L⋆. The main difficulty of the doubling trick analysis is that D pt , ℓbt E is not well bounded when the graph is not self-aware, which is not the case in prior works such as [143]. We resolve this issue by again utilizing the negative regret term from the increasing learning rate schedule. Theorem A.3.1. Algorithm 16 with c = 64β and η = 1 2000(ln T) ln2 (KT)+80κ ln T guarantees Reg = Oe p (κ + 1)L⋆ + β 2 . Proof. We call the time steps between two resets an epoch (indexed by λ) and let ηλ be the value of η during epoch λ so that ηλ = 21−λη1. Also let λ ⋆ be the index of the last epoch. For notational convenience, define Reg d ≜ PT t=1 D pt , ℓbt E − PT t=1 ℓet,i⋆ , i⋆ ∈ S. PT t=1 D pt − ei ⋆ , ℓbt E , i⋆ ∈ S. ¯ 172 Note that Reg = E[Reg d]. We will first prove the following Reg d ≤ λ X⋆ λ=1 Oe κ + 1 ηλ + β 2 + 1 T X T t=1 X i∈[β] ℓbt,i. (A.8) To show this, consider the regret in each epoch λ. When i ⋆ ∈ S¯, we have: T Xλ+1 t=Tλ+1 D pt − ei ⋆ , ℓbt E ≤ O cβ ln T + κ ln T + ln K ηλ + 40ηλ T Xλ+1 t=Tλ+1 min z ∥ℓbt − z · 1∥ 2 ∇−2ψt(pt) + 1 T T Xλ+1 t=Tλ+1 X i∈[β] ℓbt,i ≤ Oe β 2 + κ + 1 ηλ + 80ηλ TλX +1−1 t=Tλ+1 D pt , ℓbt E + 5 8 + 1 T T Xλ+1 t=Tλ+1 X i∈[β] ℓbt,i ≤ Oe β 2 + κ + 1 ηλ + 1 T T Xλ+1 t=Tλ+1 X i∈[β] ℓbt,i. Here, the first inequality is according to the analysis of Lemma A.3.2. In the second inequality, we bound minz ∥ℓbt −z ·1∥ 2 ∇−2ψ(pt) by 2 D pt , ℓbt E for t = Tλ + 1, . . . , Tλ+1 −1 by the same analysis of Theorem 2.1.2, and bound the same term for t = Tλ+1 by 1 64 by Lemma A.1.2. The final inequality is because the reset condition does not hold for t = Tλ+1 − 1. Summing over the epochs proves Eq. (A.8) for the first case. When i ⋆ ∈ Cj for some j ∈ [κ], similar to the previous analysis and the derivation of Eq. (A.6), we have: T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓet,i⋆ ≤ O cβ ln T + κ ln T + ln K ηλ + 40ηλ T Xλ+1 t=Tλ+1 min z ∥ℓbt − z · 1∥ 2 ∇−2ψ(pt) + 1 T T Xλ+1 t=Tλ+1 X i∈[β] ℓbt,i − ρTλ+1,j 80ηλ ln T − ρTλ+1,j 80ηλ ln T + 25ρTλ+1,j ln2 (KT) − ρTλ+1,j 40ηλ ln T + 10 ln(KT) vuutρTλ+1,j T Xλ+1 t=Tλ+1 ℓet,i⋆ 173 ≤ Oe β 2 + κ + 1 ηλ − ρTλ+1,j 80ηλ ln T + 1000ηλ(ln T) ln2 (KT) T Xλ+1 t=Tλ+1 ℓet,i⋆ + 1 T T Xλ+1 t=Tλ+1 X i∈[β] ℓbt,i, where the second inequality uses the fact ηλ ≤ η1 ≤ 1 2000 ln T ln2 (KT) and the AM-GM inequality. By rearranging terms, we have 1 + 1000η(ln T) ln2 (KT) T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓet,i⋆ ≤ Oe β 2 + κ + 1 ηλ − ρTλ+1,j 80ηλ ln T + 1000ηλ(ln T) ln2 (KT) T Xλ+1 t=Tλ+1 D pt , ℓbt E + 1 T T Xλ+1 t=Tλ+1 X i∈[β] ℓbt,i ≤ Oe β 2 + κ + 1 ηλ + 1 2 D pTλ+1 , ℓb Tλ+1 E − ρTλ+1,j 80ηλ ln T + 1 T T Xλ+1 t=Tλ+1 X i∈[β] ℓbt,i, where the second inequality is again because the condition κ+1 ηλ ≤ ηλ PTλ+1−1 t=Tλ+1 D pt , ℓbt E does not hold and ηλ ≤ η1 ≤ 1 2000(ln T) ln2 (KT) . Now if for all i ∈ S¯, pTλ+1,i < 1 2 , then we have D pTλ+1 , ℓb Tλ+1 E ≤ X j ′∈[κ] pTλ+1,j′ 1 pTλ+1,j′ + X i∈S¯ pTλ+1,i 1 − pTλ+1,i ≤ β. Otherwise, we have exactly one i0 ∈ S¯ such that pTλ+1,i0 > 1 2 and then we have D pTλ+1 , ℓb Tλ+1 E ≤ X j ′∈[κ] pTλ+1,j′ 1 pTλ+1,j′ + X i∈S¯ pTλ+1,i 1 − pTλ+1,i ≤ κ + 1 1 − pTλ+1,i0 ≤ κ + ρTλ+1,j . The last inequality is because 1 − pTλ+1,i0 ≥ pTλ+1,j . Therefore, as ηλ ≤ η1 = 1 2000 ln T ln2 (KT)+80κ ln T , we always have D pTλ+1 , ℓb Tλ+1 E ≤ ρTλ+1,j 80ηλ ln T . 17 We have thus shown T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓet,i⋆ ≤ Oe β 2 + κ + 1 ηλ + 1 T T Xλ+1 t=Tλ+1 X i∈[β] ℓbt,i. Summing up the regret from epoch 1 to λ ⋆ gives Eq. (A.8). Next, using the definition of ηλ, we further have Reg d ≤ λ X⋆ λ=1 Oe κ + 1 ηλ + β 2 + 1 T X T t=1 X i∈[β] ℓbt,i ≤ Oe κ + 1 ηλ⋆ + β 2λ ⋆ + 1 T X T t=1 X i∈[β] ℓbt,i. (A.9) When λ ⋆ = 1, direct calculation gives Reg d ≤ Oe(β 2 ) + 1 T PT t=1 P i∈[β] ℓbt,i. On the other hand, if λ ⋆ ≥ 2, consider the time step at the end of epoch λ ⋆ − 1. Using the reset condition, we have: (κ + 1) 2 λ ⋆−2 η1 2 = (κ + 1) 1 η 2 λ⋆−1 ≤ X Tλ⋆ t=Tλ⋆−1+1 D pt , ℓbt E ≤ X T t=1 D pt , ℓbt E ≤ T 2 . So λ ⋆ = O(ln T), κ+1 ηλ⋆ 2 = Oe (κ + 1)PT t=1 D pt , ℓbt E. Plugging these into Eq. (A.9), we have Reg d ≤ Oe vuut(κ + 1) X T t=1 D pt , ℓbt E ! + β 2 + 1 T X T t=1 X i∈[β] ℓbt,i, which also holds for the case λ ⋆ = 1. Finally, taking expectation on both sides gives: Reg = E h Reg d i ≤ Oe E vuut(κ + 1) X T t=1 D pt , ℓbt E ! + β 2 ≤ Oe vuut(κ + 1) E "X T t=1 D pt , ℓbt E #! + β 2 ≤ Oe vuut(κ + 1) E "X T t=1 ⟨pt , ℓt⟩ #! + β 2 . Solving the quadratic inequality, we obtain the regret bound Reg ≤ Oe p (κ + 1)L⋆ + β 2 . 175 A.4 Omitted details for Section 2.1.2.4 Algorithm 17 An Algorithm with Regret Oe(min{ √ αT , √ κL⋆}) for Self-aware Graphs Input: A clique partition {C1, . . . , Cκ} of GS, parameter ηinit and ε. Define: Ω = {p ∈ ∆(K) : pi ≥ 1 T , ∀i ∈ [K]}. 1 for m = 1, 2, . . . , log2 T do 2 η = ηinit. 3 for λ = 1, 2, . . . do 4 pt = 1 K · 1, Tλ = t − 1. 5 while t ≤ T do 6 Pull arm it ∼ pt and receive feedback ℓt,i for all i such that it ∈ Nin(i). 7 Construct estimator ℓbt ∈ R K such that ℓbt,i = ℓt,i·1{it∈Nin P (i)} j∈Nin(i) pt,j , if pt,i > 0. 0, if pt,i = 0. 8 Compute pbt+1 = argminp∈∆K nDp, ℓbt E + Dψ(p, pbt) o , where ψ(p) = 1 η X i∈[K] pi ln pi . 9 pt = pbt . 10 for j = 1, 2, . . . , κ do 11 if P i∈Cj pt+1,i ≤ ε then 12 for i ∈ Cj do 13 pt+1,i = 0. 14 Renormalize pt+1 such that pt+1 ∈ ∆(K). 15 if 1 η ≤ 4ηκ mini∈[K] nPt τ=Tλ+1 ℓbτ,io then 16 η ← η 2 , ε = max{2η, 1 T }, t ← t + 1. 17 if η ≤ q 1 αT then 18 Jump to Line 1. 19 Jump to Line 3. 20 break. 21 Run the algorithm from Theorem 2.1.1 (from scratch) for the rest of the game. In this section, we discuss how to obtain Oe(min{ √ αT , √ κL⋆}) regret for self-aware graphs. Algorithm 17 shows the complete pseudocode. It consists of two stages. The first stage is when m ≤ log2 T and runs yet another parameter-free algorithm with Oe( √ κL⋆) regret for self-aware graphs (for technical reasons we are not able to use Algorithm 1 directly here). The second stage exactly runs the algorithm mentioned in Section 2.1.2.1 (Theorem 2.1.1), which achieves Oe( √ αT) regret. 176 The first stage mainly follows the clipping idea of [9] for OMD with entropy regularizer. At each round t, after updating the distribution with OMD, we clip the probability of a clique to zero if it has low probability to be chosen (Line 11 and Line 13), and normalize the distribution after clipping (Line 14). Then, a doubling trick is introduced: once the condition 1 η ≤ 4ηκ mini∈[K]{ Pt τ=Tλ+1 ℓbτ,i} holds (Line 15), we halve the learning rate and reset (Line 19). We point out that thanks to the clipping trick, the term mini∈[K]{ Pt τ=Tλ+1 ℓbτ,i} is nicely bounded, which is crucial for the doubling trick analysis. We say that we start a new epoch in this case. Furthermore, once the learning rate is smaller than p1/αT (Line 17), we jump to Line 1 and reset it to the initial learning rate η = ηinit in Line 2. We say that we start a new meta-epoch in this case. After having log2 T meta-epochs, the algorithm is confident that √ αT ≤ Oe( √ κL⋆), and thus switches to the second stage and runs the algorithm introduced in Section 2.1.2.1 with regret Oe( √ αT). We point out that in fact any algorithm with Oe( √ αT) regret is acceptable for this second stage. The guarantee of this algorithm is shown below. Theorem A.4.1. Algorithm 17 with ηinit = 1 4κ and ε = 1 2κ guarantees Reg = Oe min n√ αT , p κL⋆ o + K2 . To prove the theorem, we first prove a bound on the regret within an epoch λ, where η ≤ 1 4κ and ε = max{2η, 1 T } are fixed. Lemma A.4.1. Fix a meta-epoch and consider an epoch λ in the first stage. Algorithm 17 guarantees T Xλ+1 t=Tλ+1 D pt , ℓbt E − min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i = Oe 1 η + ηκ min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i + κ + K T min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i. . 177 Proof. According to the analysis of online mirror descent with entropy regularizer [40], we have T Xλ+1 t=Tλ+1 D pbt , ℓbt E − min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i ≤ ln K η + η T Xλ+1 t=Tλ+1 X K i=1 pbt,iℓb2 t,i ≤ ln K η + η T Xλ+1 t=Tλ+1 X j∈[κ] X i∈Cj P pbt,i i ′∈Cj pt,i′ ℓbt,i. As we do clipping for each clique, if pt,i > 0, then 1 − κε ≤ pbt,i pt,i ≤ 1 and otherwise ℓbt,i = 0. Therefore, we have T Xλ+1 t=Tλ+1 (1 − κε) D pt , ℓbt E − min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i ≤ ln K η + η T Xλ+1 t=Tλ+1 X j∈[κ] X i∈Cj P pt,i i ′∈Cj pt,i′ ℓbt,i (A.10) Now consider a fixed clique Cj that is not clipped at time t. We have P pt,i i ′∈Cj pt,i′ = P pbt,i i ′∈Cj pbt,i′ = exp(−η Pt−1 τ=1 ℓbτ,i) P i ′∈Cj exp(−η Pt−1 τ=1 ℓb τ,i′) , which can be considered as a probability distribution generated by online mirror descent with entropy regularizer inside the clique. Therefore for any clique Cj , j ∈ [κ], we have T Xλ+1 t=Tλ+1 X i∈Cj P pt,i i ′∈Cj pt,i′ ℓbt,i − min i∈Cj T Xλ+1 t=Tλ+1 ℓbt,i ≤ ln K η + η T Xλ+1 t=Tλ+1 X i∈Cj P pt,i i ′∈Cj pt,i′ ℓb2 t,i ≤ ln K η + η ε T Xλ+1 t=Tλ+1 X i∈Cj P pt,i i ′∈Cj pt,i′ ℓbt,i ≤ ln K η + 1 2 T Xλ+1 t=Tλ+1 X i∈Cj P pt,i i ′∈Cj pt,i′ ℓbt,i. 178 The second inequality is because ℓbt,i ≤ 1 ε for all i ∈ Cj , j ∈ [κ] and the third inequality is because ε ≥ 2η. Rearranging the terms, we have T Xλ+1 t=Tλ+1 X i∈Cj P pt,i i ′∈Cj pt,i′ ℓbt,i ≤ 2 ln K η + 2 min i∈Cj T Xλ+1 t=Tλ+1 ℓbt,i. Therefore, by combining with Eq. (A.10), we have (1 − κε) T Xλ+1 t=Tλ+1 D pt , ℓbt E − min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i ≤ ln K η + η X j∈[κ] 2 ln K η + 2 min i∈Cj T Xλ+1 t=Tλ+1 ℓbt,i ≤ ln K η + 2κ ln K + 2η X j∈[κ] min i∈Cj T Xλ+1 t=Tλ+1 ℓbt,i. (A.11) Furthermore, let Ti0 be the last round such that ℓbt,i0 > 0 for some i0 ∈ Cj . Note that Ti0 is also the last round such that ℓt,i′ > 0 for all i ′ ∈ Cj . Then we have for any i ′ ∈ [K]: ε ≤ X i∈Cj pbTi0 ,i = P i∈Cj exp(−η PTi0−1 τ=T(s)+1 ℓbτ,i) PK i=1 exp(−η PTi0−1 τ=T(s)+1 ℓbτ,i) ≤ P i∈Cj exp(−η PTλ+1 t=Tλ+1 ℓbt,i + η ε ) PK i=1 exp(−η PTλ+1 t=Tλ+1 ℓbt,i) ≤ |Cj | exp(−η mini∈Cj PTλ+1 t=Tλ+1 ℓbt,i + η ε ) exp(−η PTλ+1 t=Tλ+1 ℓb t,i′) , where the second inequality is because ℓb Ti0 ,i ≤ 1 ε for all i ∈ Cj and the fact ℓbt,i ≥ 0 for all t ∈ [T] and i ∈ [K]. Therefore, by rearranging terms, for any i ′ ∈ [K] and any j ∈ [κ], we have min i∈Cj T Xλ+1 t=Tλ+1 ℓbt,i ≤ T Xλ+1 t=Tλ+1 ℓb t,i′ + 1 ε + 1 η ln K ε . 179 Combining Eq. (A.11) and choosing i ′ = argmini∈[K] PTλ+1 t=Tλ+1 ℓbt,i further show (1 − κε) T Xλ+1 t=Tλ+1 D pt , ℓbt E − min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i ≤ ln K η + 2κ ln K + 2ηκ min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i + 1 ε + 1 η ln K ε ≤ ln K η + 2ηκ min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i + 3κ ln K + 2κ ln(KT) ≤ ln K η + 2ηκ min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i + 5κ ln(KT), where the second inequality is because ε = max{2η, 1 T }. Finally, rearranging terms again shows T Xλ+1 t=Tλ+1 D pt , ℓbt E − min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i ≤ 1 1 − κε ln K η + κ(2η + ε) min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i + 5κ ln(KT) ≤ 2 ln K η + 8ηκ min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i + 10κ ln(KT) + 2K T min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i. The second inequality is because η ≤ 1 4κ , which means κε ≤ max{ κ T , 2ηκ} ≤ 1 2 , as T ≥ 2K. Next we bound the regret within a meta-epoch. In the remaining of the section, we use Tm to denote the set of rounds in meta-epoch m. Lemma A.4.2. For any meta-epoch m in the first stage, Algorithm 17 guarantees X t∈Tm D pt , ℓbt E − min i∈[K] X t∈Tm ℓbt,i ≤ Oe min √ αT , s κ min i∈[K] X t∈Tm ℓbt,i + κ + K T min i∈[K] X t∈Tm ℓbt,i . 180 Proof. Let ηλ = 21−λη1 and ελ = max{2ηλ, 1/T} be the value of η and ε during epoch λ and let λ ⋆ be the index of the last epoch. Consider the regret in epoch λ. Using Lemma A.4.1, we know that T Xλ+1 Tλ+1 D pt , ℓbt E − min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i = Oe 1 ηλ + ηλκ min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i + κ + K T min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i. = Oe 1 ηλ + ηλκ ελ + κ + K T min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i (the update rule and ℓb Tλ+1,i ≤ 1 ελ for all i ∈ [K]) = Oe 1 ηλ + κ + K T min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i (ελ = max{ 1 T , 2ηλ}) = Oe 1 ηλ + K T min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i . (ηλ ≤ η1 = 1 4κ ) Taking a summation from λ = 1, 2, . . . , λ⋆ , we have X t∈Tm D pt , ℓbt E − min i∈[K] X t∈Tm ℓbt,i ≤ λ X⋆ λ=1 Oe 1 ηλ + K T min i∈[K] T Xλ+1 t=Tλ+1 ℓbt,i ≤ Oe 1 ηλ⋆ + K T min i∈[K] X t∈Tm ℓbt,i! . (A.12) Now we come to bound 1 ηλ⋆ . If λ ⋆ = 1, we have P t∈Tm D pt , ℓbt E − mini∈[K] P t∈Tm ℓbt,i ≤ Oe(K) + 2K T mini∈[K] P t∈Tm ℓbt,i. If λ ⋆ ≥ 2, consider the last round of epoch λ ⋆ − 1. According to the update rule of η, we have 1 η 2 λ⋆−1 ≤ 4κ min i∈[K] X Tλ⋆ t=Tλ⋆−1+1 ℓbt,i ≤ 4κ min i∈[K] X t∈Tm ℓbt,i. 181 Therefore, we have 1 ηλ⋆ ≤ q 16κ mini∈[K] P t∈Tm ℓbt,i. In addition, note that ηλ⋆ ≥ q 1 αT by Line 17 of the algorithm. So 1 ηλ⋆ = Oe min √ αT , q κ mini∈[K] P t∈Tm ℓbt,i + κ . Plugging this into Eq. (A.12) completes the proof. Now we are ready to prove Theorem A.4.1. Proof. of Theorem A.4.1 We first show that with high probability, the algorithm does not enter the second stage if αT ≥ 64κL⋆. Let m⋆ ≤ log2 T be the number of meta-epochs executed in the first stage. Define random variables as follows: Rm ≜ min i∈[K] X t∈Tm ℓbt,i, m ∈ [m⋆ ]. As we reset all the parameters for each meta-epoch, we have E [Rm | R1, . . . , Rm−1] ≤ E " min i∈[K] X T t=1 ℓbt,i# ≤ min i∈[K] E "X T t=1 ℓbt,i# = L⋆. On the other hand, according to Line 17, the learning rate η used at the last round of each meta-epoch m is at most q 4 αT before being halved. Combining with Line 15, we have αT ≤ 4 η 2 ≤ 16κRm. Now suppose αT ≥ 64κL⋆. Using Markov inequality, we have Prob Rm ≥ αT 16κ R1, . . . , Rm−1 ≤ 16κL⋆ αT ≤ 1 4 , ∀m ∈ [m⋆ ]. Therefore, the probability that the algorithm reaches the second stage when αT ≥ 64κL⋆ is upper bounded as follows Prob [Algorithm 17 reaches the second stage] ≤ log Y2 T m=1 Prob Rm ≥ αT 16κ R1, . . . , Rm−1 182 ≤ 1 4 log2 T = 1 T2 . This shows that it is unlikely to reach the second stage when αT ≥ 64κL⋆. Now consider the expected regret when αT ≥ 64κL⋆. With probability 1 − 1 T2 , the regret only comes from the first stage. According to Lemma A.4.2, we have Reg ≤ E "Xm⋆ m=1 X t∈Tm D pt , ℓbt E − min i∈[K] X t∈Tm ℓbt,i!# + 1 T2 · T ≤ Oe E min √ αT , vuutκ min i∈[K] X T t=1 ℓbt,i + K ≤ Oe min n√ αT , p κL⋆ o + K . On the other hand, when αT ≤ 64κL⋆, the regret is bounded by the sum of the worst-case regret from both stages. According to Theorem 2.1.1 and Lemma A.4.2, we arrive at Reg ≤ E "Xm⋆ m=1 X t∈Tm D pt , ℓbt E − min i∈[K] X t∈Tm ℓbt,i!# + Oe √ αT + K2 ≤ Oe √ αT + K2 = Oe min n√ αT , p κL⋆ o + K2 . Combining the two cases completes the proof. A.5 Omitted Details for Section 2.1.3 In this section, we provide omitted details for Section 2.1.3, including the proof of Theorem 2.1.4 (Appendix A.5.1), the proof of Theorem 2.1.5 (Appendix A.5.2), an adaptive version of Algorithm 2 for directed complete bipartite graphs and its analysis (Appendix A.5.3), the proof of Theorem 2.1.6 (Appendix A.5.4), and an adaptive version of Algorithm 2 for weakly observable graphs and its analysis (Appendix A.5.5). For 183 notational convenience, we use 1U to denote a vector in R K whose i-th coordinate is 1 if i ∈ U and 0 otherwise. We also define h(x) ≜ x−1−ln x so that for a hybrid regularizer of the form ψ(p) = 1 η P i∈S ln 1 pi + 1 η¯ P i∈S¯ pi ln pi , its associated Bregman divergence is Dψ(p, q) = 1 η P i∈S h(pi/qi) + 1 η¯ P i∈S¯ pi ln(pi/qi). A.5.1 Proof of Theorem 2.1.4 Proof. Fix any ε ∈ (0, 1/3). Since the feedback graph is weakly observable, there must exist two nodes u and v, such that u does not have a self-loop and v cannot observe u. Consider the following environment: ∀t ∈ [T], ℓt,u = 0, ℓt,v = T −b for some b such that ε < b < 1 − 2ε, and ℓt,w = 1, ∀w ̸= u, v. We call nodes other than u and v “bad arms”. Note that the loss of each node is constant over time and L⋆ = 0. Since A achieves Oe(1) regret when L⋆ = 0, the expected number of times that bad arms are selected by A (denoted by Nbad) is at most O(T ε ). Similarly, the expected number of times that v is selected by A (denoted by Nv) is at most O(T b+ε ). The condition Nbad = O(T ε ) implies that we can find a interval of length T 2Nbad+1 = Ω(T 1−ε ) such that A selects bad arms less than 1 2 times in expectation in this interval. Therefore, by Markov inequality, we have with probability 1 2 , A does not select bad arms at all in this interval. Now consider creating another environment by switching the loss of u to 1 only in this interval. Note that with probability 1 2 , A cannot notice the change because u’s loss is not revealed if none of the bad arms is selected. Since Nv = O(T b+ε ), we conclude that A suffers expected loss Ω(T 1−ε − Nv) = Ω(T 1−ε ) in this interval using the condition b < 1 − 2ε. Moreover, v becomes the best arm in this new environment and L⋆ = T 1−b . Therefore, A suffers expected regret Ω(T 1−ε − T 1−b ) = Ω(T 1−ε ) since ε < b, which completes the proof. A.5.2 Proofs for Theorem 2.1.5 We first prove Lemma 2.1.2, which will be useful for the proofs of Theorem 2.1.5 and Theorem 2.1.6. 184 Proof. of Lemma 2.1.2 Let pet+1 = argminp∈RK + nDp, ℓbt + at E + Dψ(p, pt) o . One can verify that pt+1 = argminp∈Ω {Dψ(p, pet+1)} and for any u ∈ Ω, we have D pt − u, ℓbt + at E = Dψ(u, pt) − Dψ(u, pet+1) + Dψ(pt , pet+1) ≤ Dψ(u, pt) − Dψ(u, pt+1) + Dψ(pt , pet+1) = Dψ(u, pt) − Dψ(u, pt+1) + 1 η X i∈S pt,i pet+1,i − 1 − ln pt,i pet+1,i + 1 η¯ X i∈S¯ pt,i ln pt,i pet+1,i − (pt,i − pet+1,i) , where the second inequality is by the generalized Pythagorean theorem. According to the choice of ψ, one can also obtain the close-form of pet+1, which satisfies pt,i pet+1,i = 1 + ηpt,i(ℓbt,i + at,i), for i ∈ S; ln pt,i pet+1,i = ¯η(ℓbt,i + at,i), for i ∈ S. ¯ Plugging this into the previous inequality shows D pt − u, ℓbt + at E ≤ Dψ(u, pt) − Dψ(u, pt+1) + 1 η X i∈S ηpt,i ℓbt,i + at,i − ln 1 + ηpt,i ℓbt,i + at,i + 1 η¯ X i∈S¯ ηp¯ t,i ℓbt,i + at,i − pt,i + pt,i exp −η¯ ℓbt,i + at,i Using the facts ln(1 + x) ≥ x − x 2 and exp(−x) ≤ 1 − x + x 2 for any x ≥ 0, and realizing ℓbt,i + at,i ≥ 0 holds, we further have D pt − u, ℓbt + at E ≤ Dψ(u, pt) − Dψ(u, pt+1) +X i∈S ηp2 t,i ℓbt,i + at,i2 + X i∈S¯ ηp¯ t,i ℓbt,i + at,i2 . (A.13) 185 Next we use the conditions for η and η¯ and the concrete form of at to bound the last two terms as X i∈S ηp2 t,i ℓbt,i + at,i2 + X i∈S¯ ηp¯ t,i ℓbt,i + at,i2 = X i∈S ηp2 t,i ℓbt,i + 2ηpt,iℓb2 t,i2 + X i∈S¯ ηp¯ t,i ℓbt,i + 2¯ηℓb2 t,i2 = X i∈S ηp2 t,iℓb2 t,i 1 + 2ηpt,iℓbt,i2 + X i∈S¯ ηp¯ t,iℓb2 t,i 1 + 2¯ηℓbt,i2 ≤ X i∈S ηp2 t,iℓb2 t,i (1 + 2η) 2 + X i∈S¯ ηp¯ t,iℓb2 t,i 1 + 2 5 2 (pt,iℓbt,i ≤ 1, i ∈ S; ¯ηℓbt,i ≤ 1 5 , i ∈ S¯) ≤ ⟨pt , at⟩. (η ≤ 1 5 ) The proof is completed by plugging the inequality above into Eq. (A.13) and rearranging terms. Proof. of Theorem 2.1.5 The condition of Lemma 2.1.2 holds since according to the definition of η¯ and Ω, we have η¯ Wt,i ≤ η¯√ η¯ = √ η¯ ≤ 1 5 , ∀t ∈ [T] and i ∈ S¯. Thus, by Lemma 2.1.2, we have for any u ∈ Ω D pt − u, ℓbt E ≤ Dψ(u, pt) − Dψ(u, pt+1) + ⟨u, at⟩. Summing over t ∈ [T], we have: X T t=1 D pt − u, ℓbt E ≤ Dψ(u, p1) +X T t=1 ⟨u, at⟩ = 1 η X i∈S h ui p1,i + 1 η¯ X i∈S¯ ui ln ui p1,i + X T t=1 ⟨u, at⟩. When comparing to a node i ∈ S, we set u = 1 T · 1S + 1 − s T · ei ∈ Ω. Using the definition of p1, we have Dψ(u, p1) = 1 η X i∈S h ui p1,i + 1 η¯ X i∈S¯ ui ln ui p1,i = 1 η X j̸=i,j∈S h 2s T + 1 η h 2s 1 − s − 1 T 18 ≤ s − 1 η 2s T − 1 − ln 2s T + 1 η h(2s) = s − 1 η 2s T − 1 − ln 2s T + 1 η (2s − 1 − ln 2s) ≤ (s − 1) ln T η + 2s − 1 − s ln 2s η ≤ s ln T η , The first inequality is because h(y) is increasing when y ≥ 1 and T ≥ 2K ≥ 2s. The second inequality is also because T ≥ 2s and the last one is because 2s − 1 − s ln 2s ≤ 1 ≤ ln T for all s > 0. Therefore, we have X T t=1 D pt − ei , ℓbt E ≤ s ln T η + 2η X T t=1 pt,iℓb2 t,i + 2η T X T t=1 X j∈S pt,j ℓb2 t,j + 1 T X T t=1 X j∈S ℓbt,j ≤ s ln T η + 2η X T t=1 ℓbt,i + 2 T X T t=1 X j∈S ℓbt,j , (A.14) where the second inequality is because pt,iℓbt,i ≤ 1 for all i ∈ S and η ≤ 1 5 ≤ 1 2 . When comparing with node i ∈ S¯, let bi ⋆ S = argmini∈S PT t=1 ℓbt,i and we choose u = √ η¯ · ebi ⋆ S + 1 T · 1S + 1 − s T − √ η¯ ei ∈ Ω. According to the choice of p1, we bound the Bregman divergence term as Dψ(u, p1) = 1 η X i∈S h ui p1,i + 1 η¯ X i∈S¯ ui ln ui p1,i = 1 η X j̸=bi ⋆ S ,j∈S h 2s T + 1 η h 2s 1 T + √ η¯ + 1 η¯ 1 − s T − √ η¯ ln 2¯s 1 − s T − √ η¯ ≤ s η 2s T − 1 − ln 2s T + 1 η 2s T + 2s √ η¯ − 1 − ln 2s T + 2s √ η¯ + 1 η¯ 1 − s T − √ η¯ ln (2¯s) ≤ s ln T η + 1 η 2s √ η¯ + ln T + 1 η¯ ln (2¯s) ≤ s ln T + s + ln T η + ln(2¯s) η¯ ≤ 2s ln T η + ln(2¯s) η¯ . 1 Here, the second inequality is because T ≥ 2K ≥ 2s; the third inequality is because √ η¯ ≤ 1 5 ≤ 1 2 ; and the last inequality is because ln T ≥ 1 and s ≥ 1. Therefore, we have X T t=1 D pt − ei , ℓbt E ≤ 2s ln T η + ln(2¯s) η¯ + 2¯η X T t=1 ℓb2 t,i + 2η T X T t=1 X j∈S pt,j ℓb2 t,j + 2η √ η¯ X T t=1 pt,bi ⋆ S ℓb2 t,bi ⋆ S + √ η¯min i∈S X T t=1 ℓbt,i + 1 T X T t=1 X j∈S ℓbt,j ≤ 2s ln T η + ln(2¯s) η¯ + 2√ η¯ X T t=1 ℓbt,i + 2√ η¯min i∈S X T t=1 ℓbt,i + 2 T X T t=1 X j∈S ℓbt,j , (A.15) where the second inequality is because ℓbt,i ≤ √ 1 η¯ for nodes i ∈ S¯, pt,iℓbt,i ≤ 1 for all i ∈ S, t ∈ [T] and η ≤ 1 5 ≤ 1 2 . Now we take expectation over both sides. If i ∈ S, then Regi = E "X T t=1 D pt − ei , ℓbt E # ≤ s ln T η + 2ηLi + 2s. If i ∈ S¯, then Regi = E "X T t=1 D pt − ei , ℓbt E # ≤ 2s ln T η + 2 ln K η¯ + 2√ ηL¯ i + 2√ η¯E " min i∈S X T t=1 ℓbt,i# + 2s ≤ 2s ln T η + 2 ln K η¯ + 2√ ηL¯ i + 2√ η¯min i∈S E "X T t=1 ℓbt,i# + 2s = 2s ln T η + 2 ln K η¯ + 2√ ηL¯ i + 2√ ηL¯ i ⋆ S + 2s. 188 The first inequality is because ln(2¯s) ≤ ln(2K) ≤ 2 ln K as K ≥ 2, and the second inequality is because of Jensen’s inequality. Finally, choosing η = min{ q s/Li ⋆ S , 1/5}, η¯ = min{L −2/3 i ⋆ S , 1/25}, we have for i ∈ S Regi ≤ Regi ⋆ S ≤ O s ln T + q sLi ⋆ S ≤ Oe p sLi + s . For i ∈ S¯, if Li ≤ Li ⋆ S , we have Regi ≤ 2s ln T η + 2 ln K η¯ + 4√ ηL¯ i ⋆ S + 2s ≤ Oe L 2/3 i ⋆ S + q sLi ⋆ S + s . Otherwise, we have Regi ≤ Regi ⋆ S ≤ Oe q sLi ⋆ S + s . Combining the above results finishes the proof. A.5.3 Adaptive Version of Algorithm 2 for Directed Complete Bipartite Graphs In order to make Algorithm 2 parameter-free, one may consider directly applying doubling trick. However, one technical issue comes from analyzing the last round before each restart where the loss estimator might be too large. To address this issue, we combine Algorithm 2 and the clipping technique, together with a double trick. The full algorithm is described in Algorithm 18. Similar to previous doubling trick algorithms, we start from some large η and η¯, run the procedures of Algorithm 2 (Line 5 to Line 8) and reduce the learning rate when the accumulated estimated loss is too large (Line 10 and Line 11). The key difference is that we again follow the clipping idea of [9]: after computing pbt+1 through OMD, we do clipping with threshold µ and renormalization for nodes in S (Line 9). In this way, ℓbt,i defined in Line 6 is well upper bounded 189 Algorithm 18 Adaptive Version of Algorithm 2 for Directed Complete Bipartite Graphs Input: Feedback graph G and parameter η ≤ 1 5 . Initialize: p1 is such that p1,i = 1 2s for i ∈ S and p1,i = 1 2¯s for i ∈ S¯. for λ = 1, 2, . . . do 1 pt = p1, η¯ = s − 2 3 η 4 3 , Tλ = t − 1. 2 Define decision set Ω = p ∈ ∆(K) : P i∈S pi ≥ √ η¯ . 3 Define hybrid regularizer ψ(p) = 1 η P i∈S ln 1 pi + 1 η¯ P i∈S¯ pi ln pi . 4 while t ≤ T do 5 Play arm it ∼ pt and receive feedback ℓt,i for all i such that it ∈ Nin(i). 6 Construct estimator ℓbt such that ℓbt,i = ( 0, pt,i = 0, ℓt,i Wt,i · 1{it ∈ Nin(i)}, pt,i > 0, , where Wt,i = P j∈Nin(i) pt,j . 7 Construct correction term at such that at,i = ( 2ηpt,iℓb2 t,i, for i ∈ S, 2¯ηℓb2 t,i, for i ∈ S. ¯ 8 Compute pbt+1 = argminp∈Ω nDp, ℓbt + at E + Dψ(p, pbt) o . 9 Construct pt+1 as follows, where µ = η √ η¯ s : pt+1,i = ( P pbt+1,i1{pbt+1,i≥µ} i ′∈S pbt+1,i′·1{pbt+1,i′≥µ} · P i ′∈S pbt+1,i′, if i ∈ S, pbt+1,i, if i ∈ S. ¯ 10 if s η ≤ η mini∈S Pt τ=Tλ+1 ℓbt,i then 11 η ← η/2, t ← t + 1. Break. t ← t + 1. for all i ∈ [K], t ∈ [T], which is crucial for the doubling trick analysis. Formally, we prove the following theorem. Theorem A.5.1. Algorithm 18 with η = min{ 1 5 , 1 s } guarantees for any directed complete bipartite graph: Reg = Oe psLi ⋆ S + s 2 , if i ⋆ ∈ S. Oe L 2/3 i ⋆ S + psLi ⋆ S + s 2 , if i ⋆ ∈ S. ¯ Proof. We call the time steps between two resets an epoch (indexed by λ) and let ηλ, η¯λ, and µλ be the value of η, η¯, and µ during epoch λ so that ηλ = 21−λη1, η¯λ = s −2/3η 4/3 λ , and µλ = ηλ √ η¯λ s . Also let λ ⋆ be the index of the last epoch. As we only do clipping restricted on the nodes in S, all nodes in S¯ can still 190 be observed with probability greater than zero. Therefore, ℓbt,i is still unbiased for any node i ∈ S¯ and we have E [ℓt,it ] = E[⟨pt , ℓbt⟩]. In addition, in each epoch λ, as the clipping threshold is µλ ≤ √ η¯λ s , at least one node in S will survive and we have 1 ≤ pt,it pbt,it ≤ 1 1− √ sµλ η¯λ = 1 1−ηλ . Now we consider the regret in epoch λ. We will prove that 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ Oe s ηλ + 1 T PTλ+1 t=Tλ+1 P i∈S ℓbt,i , if i ⋆ ∈ S, Oe s ηλ + 1 η¯λ + 1 T PTλ+1 t=Tλ+1 P i∈S ℓbt,i , if i ⋆ ∈ S. ¯ (A.16) When i ⋆ ∈ S, we have 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − min i∈S T Xλ+1 t=Tλ+1 ℓbt,i ≤ Oe s ηλ + ηλ min i∈S T Xλ+1 t=Tλ+1 ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe s ηλ + ηλ min i∈S TλX +1−1 t=Tλ+1 ℓbt,i + ηλ max i∈S ℓb Tλ+1,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe s ηλ + s 4 3 η − 2 3 λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe s ηλ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . The second inequality is derived by rearranging terms in Eq. (A.14). The fourth inequality is because s ηλ ≤ ηλ mini∈S PTλ+1−1 t=Tλ+1 ℓbt,i does not hold and ℓb Tλ+1,i ≤ 1 µλ holds for all i ∈ [K]. The last inequality is because ηλ ≤ η1 ≤ 1 s . On the other hand, if i ⋆ ∈ S¯, we have 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ 191 ≤ Oe s ηλ + 1 η¯λ + √ η¯λ + ηλ T Xλ+1 t=Tλ+1 ℓbt,i⋆ + √ η¯λ min i∈S T Xλ+1 t=Tλ+1 ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . which is also derived by rearranging terms in Eq. (A.15). Then we consider the following two cases. If PTλ+1 t=Tλ+1 ℓbt,i⋆ ≤ mini∈S PTλ+1 t=Tλ+1 ℓbt,i, we have 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ Oe s ηλ + 1 η¯λ + √ η¯λ + ηλ min i∈S T Xλ+1 t=Tλ+1 ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe s ηλ + 1 η¯λ + √ η¯λ + ηλ min i∈S TλX +1−1 t=Tλ+1 ℓbt,i + √ η¯λ + ηλ max i∈S ℓb Tλ+1,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe s ηλ + 1 η¯λ + √ η¯λ + ηλ max i∈S ℓb Tλ+1,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe s ηλ + 1 η¯λ + s ηλ + s 4 3 η − 2 3 λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i = Oe s ηλ + 1 η¯λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . Here, the third inequality is because s ηλ ≤ ηλ mini∈S PTλ+1−1 t=Tλ+1 ℓbt,i does not hold, which also implies that 1 ηλ ≤ ηλ mini∈S PTλ+1−1 t=Tλ+1 ℓbt,i does not hold; the fourth inequality is also ℓb Tλ+1,i ≤ 1 µλ for all i ∈ S; and the last inequality is because ηλ ≤ η1 ≤ 1 s . On the other hand, if PTλ+1 t=Tλ+1 ℓbt,i⋆ ≥ mini∈S PTλ+1 t=Tλ+1 ℓbt,i, then based on previous results, we have 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − min i∈S T Xλ+1 t=Tλ+1 ℓbt,i ≤ Oe s ηλ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . Combining the two cases, we finish proving Eq. (A.16). Now we sum up the regret over all epochs λ = 1, 2, . . . , λ⋆ . For i ⋆ ∈ S, we have λ X⋆ λ=1 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ λ X⋆ λ=1 Oe s ηλ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe s ηλ⋆ + 1 T X T t=1 X i∈S ℓbt,i! = Oe 2 λ ⋆ s η1 + 1 T X T t=1 X i∈S ℓbt,i! (A.17) For i ⋆ ∈ S¯, we have λ X⋆ λ=1 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ λ X⋆ λ=1 Oe s ηλ + 1 η¯λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe s ηλ⋆ + 1 η¯λ⋆ + 1 T X T t=1 X i∈S ℓbt,i! = Oe 2 λ ⋆ s η1 + 2 4 3 λ ⋆ η¯1 + 1 T X T t=1 X i∈S ℓbt,i! . (A.18) Below we show that λ ⋆ is well upper bounded. When λ ⋆ ≥ 2, consider the last time step of epoch λ ⋆ − 1, we have 2 2λ ⋆−2 s η 2 1 = s η 2 λ⋆−1 ≤ min i∈S X Tλ⋆ t=Tλ⋆−1+1 ℓbt,i ≤ min i∈S X T t=1 ℓbt,i. Therefore, we know that 2 λ ⋆ ≤ 2η1 q mini∈S PT t=1 ℓbt,i s . Plugging this into Eq. (A.17), we have for i ⋆ ∈ S, λ X⋆ λ=1 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ Oe 2 λ ⋆ s η1 + 1 T X T t=1 X i∈S ℓbt,i! ≤ Oe vuuts min i∈S X T t=1 ℓbt,i + 1 T X T t=1 X i∈S ℓbt,i . 193 On the other hand, for i ⋆ ∈ S¯, plugging 2 λ ⋆ ≤ 2η1 q mini∈S PT t=1 ℓbt,i s into Eq. (A.18) gives λ X⋆ λ=1 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ Oe 2 λ ⋆ s η1 + 2 4 3 λ ⋆ η¯1 + 1 T X T t=1 X i∈S ℓbt,i! ≤ Oe vuuts min i∈S X T t=1 ℓbt,i + min i∈S X T t=1 ℓbt,i!2 3 + 1 T X T t=1 X i∈S ℓbt,i . Combining with the case λ ⋆ = 1, we have the following result 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ Oe q s mini∈S PT t=1 ℓbt,i + s 2 + 1 T PT t=1 P i∈S ℓbt,i , if i ⋆ ∈ S, Oe q s mini∈S PT t=1 ℓbt,i + mini∈S PT t=1 ℓbt,i 2 3 + s 2 + 1 T PT t=1 P i∈S ℓbt,i , if i ⋆ ∈ S. ¯ Now we take the expectation over both sides. First, for the left hand side, we have E λ X⋆ λ=1 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ = E λ X⋆ λ=1 1 1 − ηλ T Xλ+1 t=Tλ+1 D pbt , ℓbt E − T Xλ+1 t=Tλ+1 ℓt,it + E "X T t=1 ℓt,it − ℓbt,i⋆ # = E "X T t=1 1 1 − ηλt D pbt , ℓbt E − ℓt,it # + E "X T t=1 ℓt,it − ℓt,i⋆ # = X T t=1 Eλt Eit|λt 1 1 − ηλt D pbt , ℓbt E − ℓt,it + E "X T t=1 ℓt,it − ℓt,i⋆ # ≥ E "X T t=1 ℓt,it − ℓt,i⋆ # . 194 Here, λt represents the epoch that time t belongs to. The last inequality is because of the fact that whether t is in epoch λ or not is independent of what action is realized in time t and 1 ≤ pt,it pbt,it ≤ 1 1−ηλ . Next we consider the right hand side. For i ⋆ ∈ S, we have E Oe vuuts min i∈S X T t=1 ℓbt,i + s 2 + 1 T X T t=1 X i∈S ℓbt,i ≤ Oe vuuts min i∈S E "X T t=1 ℓbt,i# + s 2 = Oe q sLi ⋆ S + s 2 . where we use Jensen’s inequality. For i ⋆ ∈ S¯, we have E Oe vuuts min i∈S X T t=1 ℓbt,i + min i∈S X T t=1 ℓbt,i!2 3 + s 2 + 1 T X T t=1 X i∈S ℓbt,i ≤ Oe vuuts min i∈S E "X T t=1 ℓbt,i# + min i∈S E "X T t=1 ℓbt,i#!2 3 + s 2 = Oe q sLi ⋆ S + L 2/3 i ⋆ S + s 2 . Finally combining the results above proves the theorem statement: Reg = Oe psLi ⋆ S + s 2 , if i ⋆ ∈ S, Oe L 2/3 i ⋆ S + psLi ⋆ S + s 2 , if i ⋆ ∈ S. ¯ 195 A.5.4 Proof of Theorem 2.1.6 Proof. The condition of Lemma 2.1.2 holds since according to the choice of η¯ and δ, we have η¯ Wt,i ≤ η¯ δ ≤ δ 1/3 ≤ 1 5 , ∀t ∈ [T], i ∈ S¯. Therefore, we know that for any u ∈ Ω, X T t=1 D pt − u, ℓbt E ≤ 1 η X i∈S h ui p1,i + 1 η¯ X i∈S¯ ui ln ui p1,i + X T t=1 ⟨u, at⟩. Set u = 1 T · 1S\D + δ · 1D + 1 − dδ − |S\D| T · ei . When comparing with i ∈ S, we have X j∈S h uj p1,j + X j∈S¯ uj ln uj p1,j = X j̸=i,j∈S\D 2s T − 1 − ln 2s T + X j̸=i,j∈S∩D (2sδ − 1 − ln 2sδ) + 2s 1 − dδ − |S\D| T − 1 − ln 2s 1 − dδ − |S\D| T + |S¯ ∩ D|(δ ln(2¯sδ)) ≤ (s − 1) ln T 2s + 2s − 1 − ln s 2 (T ≥ 2K ≥ 2s and 1 T ≤ δ ≤ min 1 4d , 1 4s ) = (s − 1) ln T − s ln 2s + 2s + ln 4 − 1 ≤ 2s ln T. (−s ln 2s + 2s + ln 4 − 1 ≤ 2 ≤ (s + 1) ln T) Therefore, X T t=1 D pt − ei , ℓbt E ≤ 2s ln T η + 2η X T t=1 pt,iℓb2 t,i + δ X T t=1 2η X j∈S∩D pt,j ℓb2 t,j + 2¯η X j∈S¯∩D ℓb2 t,j + 2η T X T t=1 X j∈S\D pt,j ℓb2 t,j + δ X T t=1 X j∈D ℓbt,j + 1 T X T t=1 X j∈S\D ℓbt,j ≤ 2s ln T η + 2η X T t=1 ℓbt,i + 2δ X T t=1 X j∈D ℓbt,j + 2 T X T t=1 X j∈S ℓbt,j . 196 The second inequality is because pt,j ℓbt,j ≤ 1 for j ∈ S, ℓbt,j ≤ 1 δ for j ∈ S¯, 2¯η ≤ δ and 2η ≤ 1. The reason that ℓbt,j ≤ 1 δ holds for j ∈ S¯ is that if i is weakly observable, this trivially holds according to the definition of Ω. Otherwise, j can be observed by all the other nodes, which include at least one weakly observable node. This shows that ℓbt,j ≤ 1 δ . On the other hand, when comparing with i ∈ S¯, we have X j∈S h uj p1,j + X j∈S¯ uj ln uj p1,j = X j∈S\D 2s T − 1 − ln 2s T + X j∈S∩D (2sδ − 1 − ln 2sδ) + |S¯ ∩ D| − 1 (δ ln(2¯sδ)) + ui ln(2¯sui) ≤ s ln T 2s + ln(2¯s) ≤ s ln T + ln(2¯s) because T ≥ 2K ≥ 2s, 1 T ≤ δ ≤ 1 4s and ui ≤ 1. Therefore, X T t=1 D pt − ei , ℓbt E ≤ s ln T η + ln(2¯s) η¯ + 2η X T t=1 ℓb2 t,i + δ X T t=1 2η X j∈S∩D pt,j ℓb2 t,j + 2¯η X j∈S¯∩D ℓb2 t,j + 2η T X T t=1 X j∈S\D pt,j ℓb2 t,j + δ X T t=1 X j∈D ℓbt,j + 1 T X T t=1 X j∈S\D ℓbt,j ≤ s ln T η + ln(2¯s) η¯ + 2η δ X T t=1 ℓbt,i + 2δ X T t=1 X j∈D ℓbt,j + 2 T X T t=1 X j∈S ℓbt,j , where the second inequality holds by the same reasons for the case i ∈ S. Taking expectation over both sides, we have for i ∈ S Regi ≤ 2s ln T η + 2η X T t=1 E h ℓbt,ii + 2δ X T t=1 X j∈D E h ℓbt,ji + 2 T X T t=1 X j∈S E h ℓbt,ji ≤ 2s ln T η + 2ηLi + 2δdLD + 2s. 19 Algorithm 19 Adaptive Version for Algorithm 2 for General Weakly Observable Graphs Input: Feedback graph G and parameter δ. Initialize: p1 is such that p1,i = 1 2s for i ∈ S and p1,i = 1 2¯s for i ∈ S¯. 1 for λ = 1, 2, . . . do 2 pt = p1, η = δ 1 2γ , η¯ = δ 1 2 + 1 2γ , Tλ = t − 1. 3 Define decision set Ω = {p ∈ ∆(K) : pi ≥ δ, i ∈ D}. 4 Define hybrid regularizer ψ(p) = 1 η P i∈S ln 1 pi + 1 η¯ P i∈S¯ pi ln pi . 5 while t ≤ T do 6 Execute Line 1 to Line 4 of Algorithm 2. 7 if δ − 1 γ ≤ Pt τ=Tλ+1 P i∈D ℓbt,i then 8 δ ← δ 2 , t ← t + 1. 9 Break. 10 t ← t + 1. Similarly for i ∈ S¯, we have Regi ≤ s ln T η + ln(2¯s) η¯ + 2¯η δ E "X T t=1 ℓbt,i# + 2δ X T t=1 X j∈D E h ℓbt,ji + 2 T X T t=1 X j∈S E h ℓbt,ji ≤ s ln T η + ln(2¯s) η¯ + 2¯ηLi δ + 2δdLD + 2s. This finishes the proof. A.5.5 Adaptive Version of Algorithm 2 for General Weakly Observable Graphs In this section, we introduce the adaptive version of Algorithm 2 for general weakly observable graphs. For conciseness, we ignore the dependence on s and d and prove that we can achieve the same result of Theorem 2.1.6. We also assume that the weakly dominating set D contains at least one node in S if S is not empty. Otherwise, we can add an arbitrary node in S to D and the new set is still a weakly dominant set (just with size increased by 1). Algorithm 19 shows the full pseudocode. Unlike the directed complete bipartite graphs case, here we do not need the clipping technique. Formally, we have the following theorem. 198 Theorem A.5.2. Algorithm 19 with δ = min{ 1 125 , 1 4d , 1 4s } guarantees: Reg = Oe L 1−γ D , if i ⋆ ∈ S. Oe L (1+γ)/2 D , if i ⋆ ∈ S. ¯ Proof. We call the time steps between two resets an epoch (indexed by λ) and let δλ, ηλ, and η¯λ be the value of δ, η, and η¯ during epoch λ so that δλ = 21−λ δ1, ηλ = δ 1/2γ λ , and η¯λ = δ (1+γ)/2γ λ . Also let λ ⋆ be the index of the last epoch. According to the analysis in Theorem 2.1.6, for fixed η, η¯ and δ, we have X T t=1 D pt − ei ⋆ , ℓbt E ≤ Oe 1 η + η PT t=1 ℓbt,i⋆ + δ PT t=1 P i∈D ℓbt,i + 1 T PT t=1 P i∈S ℓbt,i , if i ⋆ ∈ S. Oe 1 η + 1 η¯ + η¯ δ PT t=1 ℓbt,i⋆ + δ PT t=1 P i∈D ℓbt,i + 1 T PT t=1 P i∈S ℓbt,i , if i ⋆ ∈ S. Here, Oe(·) suppresses all the constant factors, log factors and the dependence on s and d. Now we consider the regret of epoch λ. We will prove that T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ Oe δ 1− 1 γ λ + 1 T PTλ+1 t=Tλ+1 P i∈S ℓbt,i , if i ⋆ ∈ S, Oe δ − 1 2 − 1 2γ λ + 1 T PTλ+1 t=Tλ+1 P i∈S ℓbt,i , if i ⋆ ∈ S. ¯ (A.19) If i ⋆ ∈ S, we have from the proof of Theorem 2.1.6 T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ Oe 1 ηλ + ηλ T Xλ+1 t=Tλ+1 ℓbt,i⋆ + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . Compare the terms PTλ+1 t=Tλ+1 ℓbt,i⋆ and PTλ+1 t=Tλ+1 P i∈D ℓbt,i. If PTλ+1 t=Tλ+1 ℓbt,i⋆ ≤ PTλ+1 t=Tλ+1 P i∈D ℓbt,i, then we have T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ 199 ≤ Oe 1 ηλ + δ 1 2γ λ + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i (ηλ = δ 1 2γ λ ) ≤ Oe δ − 1 2γ λ + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i (δλ ≤ 1 and γ ∈ [1/3, 1/2]) ≤ Oe δ − 1 2γ λ + δ 1− 1 γ λ + δλ X i∈D ℓb Tλ+1,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i (δ −1/γ λ > PTλ+1−1 τ=Tλ+1 P i∈D ℓbt,i) ≤ Oe δ 1− 1 γ λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . The last inequality is because for i ∈ D, if i has a self-loop, then ℓbt,i ≤ 1 pt,i ≤ 1 δλ ; if i does not have a self-loop, it can be observed either by all the other nodes or at least one node in D, which also means that ℓbt,i ≤ 1 δλ . Therefore, δλ P i∈D ℓb Tλ+1,i ≤ d. We also use δ −1/2γ λ ≤ δ 1−1/γ λ as δλ ≤ 1 and γ ∈ [1/3, 1/2]. If PTλ+1 t=Tλ+1 ℓbt,i⋆ ≥ PTλ+1 t=Tλ+1 P i∈D ℓbt,i, then let i0 ∈ D be the node with a self-loop in D and we have PTλ+1 t=Tλ+1 ℓbt,i⋆ ≥ PTλ+1 t=Tλ+1 ℓbt,i0 . Therefore, T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i0 ≤ Oe 1 ηλ + ηλ T Xλ+1 t=Tλ+1 ℓbt,i0 + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe δ − 1 2γ λ + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe δ 1− 1 γ λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . (A.20) The third inequality is because i0 ∈ D and also ηλ ≤ δλ, and the last inequality is by the same reason as in the previous case. 200 Next we consider the case i ⋆ ∈ S¯. We have again from the proof of Theorem 2.1.6: T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ Oe 1 ηλ + 1 η¯λ + η¯λ δλ T Xλ+1 t=Tλ+1 ℓbt,i⋆ + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i = Oe δ − 1 2 − 1 2γ λ + δ − 1 2 + 1 2γ λ T Xλ+1 t=Tλ+1 ℓbt,i⋆ + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i , where the last step is because ηλ = δ 1/2γ λ ≥ δ (γ+1)/2γ λ = ¯ηλ. Consider the following two cases. If S = ∅, then D contains at least one node in S¯, which means that mini∈S¯ PTλ+1 t=Tλ+1 ℓbt,i ≤ PTλ+1 t=Tλ+1 P i∈D ℓbt,i. Therefore, following similar steps as in previous cases, we have T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ T Xλ+1 t=Tλ+1 D pt , ℓbt E − min i∈S¯ T Xλ+1 t=Tλ+1 ℓbt,i ≤ Oe δ − 1 2 − 1 2γ λ + δ − 1 2 + 1 2γ λ min i∈S¯ T Xλ+1 t=Tλ+1 ℓbt,i + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe δ − 1 2 − 1 2γ λ + δ − 1 2 + 1 2γ λ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe δ − 1 2 − 1 2γ λ + δ − 1 2 + 1 2γ λ X i∈D ℓb Tλ+1,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe′ δ − 1 2 − 1 2γ λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . If S ̸= ∅, then D contains at least one node in S. Let this node be i0. Now we compare PTλ+1 t=Tλ+1 ℓbt,i⋆ with PTλ+1 t=Tλ+1 ℓbt,i0 . If PTλ+1 t=Tλ+1 ℓbt,i⋆ ≤ PTλ+1 t=Tλ+1 ℓbt,i0 , then we have T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ 201 ≤ Oe δ − 1 2 − 1 2γ λ + δ − 1 2 + 1 2γ λ T Xλ+1 t=Tλ+1 ℓbt,i0 + δλ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe δ − 1 2 − 1 2γ λ + δ − 1 2 + 1 2γ λ T Xλ+1 t=Tλ+1 X i∈D ℓbt,i + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe δ − 1 2 − 1 2γ λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i , where the second inequality is because PTλ+1 t=Tλ+1 ℓbt,i0 ≤ PTλ+1 t=Tλ+1 P i∈D ℓbt,i. Otherwise, according to Eq. (A.20), we have T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i⋆ ≤ T Xλ+1 t=Tλ+1 D pt , ℓbt E − T Xλ+1 t=Tλ+1 ℓbt,i0 ≤ Oe δ 1− 1 γ λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i . By combining all the above cases, we finish the proof of Eq. (A.19). Summing up the regret from λ = 1, 2, . . . , λ⋆ , we have for i ⋆ ∈ S: X T t=1 D pt , ℓbt E − X T t=1 ℓbt,i⋆ ≤ λ X⋆ λ=1 Oe δ 1− 1 γ λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe δ 1− 1 γ λ⋆ + 1 T X T t=1 X i∈S ℓbt,i! ≤ Oe δ 1− 1 γ 1 · 2 λ ⋆ −1+ 1 γ + 1 T X T t=1 X i∈S ℓbt,i! . and for i ⋆ ∈ S¯: X T t=1 D pt , ℓbt E − X T t=1 ℓbt,i⋆ ≤ λ X⋆ λ=1 Oe δ − 1 2 − 1 2γ λ + 1 T T Xλ+1 t=Tλ+1 X i∈S ℓbt,i ≤ Oe δ − 1 2 − 1 2γ λ⋆ + 1 T X T t=1 X i∈S ℓbt,i! ≤ Oe δ − 1 2 − 1 2γ 1 · 2 λ ⋆ 1 2 + 1 2γ + 1 T X T t=1 X i∈S ℓbt,i! . 202 Below we upper bound λ ⋆ . If λ ⋆ ≥ 2, consider the last round of epoch λ ⋆ − 1. According to the update rule, we have δ1 · 2 −λ ⋆+1− 1 γ = δ − 1 γ λ⋆−1 ≤ X Tλ⋆ t=Tλ⋆−1+1 X i∈D ℓbt,i ≤ X T t=1 X i∈D ℓbt,i. Plugging the above result to both cases, we have X T t=1 D pt − ei ⋆ , ℓbt E ≤ Oe 1 T PT t=1 P i∈S ℓbt,i + PT t=1 P i∈D ℓbt,i1−γ , if i ⋆ ∈ S, Oe 1 T PT t=1 P i∈S ℓbt,i + PT t=1 P i∈D ℓbt,i(1+γ) 2 ! , if i ⋆ ∈ S. ¯ Combining the case λ ⋆ = 1, taking expectation over both sides and using Jensen’s inequality, we have: Reg ≤ Oe E PT t=1 P i∈D ℓbt,i1−γ ≤ Oe L 1−γ D , if i ⋆ ∈ S, Oe E " PT t=1 P i∈D ℓbt,i(1+γ) 2 #! ≤ Oe L 1+γ 2 D , if i ⋆ ∈ S. ¯ This completes the proof. 203 Appendix B Omitted Details in Section 2.2 B.1 Omitted Details in Section 2.2.2 B.1.1 Proof of Theorem 2.2.1 In this section, we prove Theorem 2.2.1, which shows that the regret of the Exp3-IX algorithm [121] does not necessarily has linear dependence on the number of actions K (that appears in the original analysis), but is instead Oe((PT t=1 αt) 1/2 + maxt∈[T] αt) with high probability. First, we prove Lemma 2.2.1, which shows a tighter concentration between ℓbt and ℓt and is crucial to the improvement from Oe(K) to Oe(maxt∈[T] αt). Proof of Theorem 2.2.1. We first prove Eq. (2.13). Let Xt,1 = X i∈St pt,i Wt,i + γ (ℓbt,i − ℓt,i), Qt = X i∈St pt,i Wt,i + γ . According to the definition of ℓbt,i and the fact that ℓt ∈ [0, 1]K, we know that |Xt,1| ≤ X i∈St pt,i (Wt,i + γ) 1 γ + 1 ≤ 2Qt γ , 204 where we use the fact that γ ≤ 1 2 . Next, consider the term Et [X2 t,1 ]. Et [X2 t,1 ] ≤ Et X i∈St pt,i Wt,i + γ ℓbt,i!2 ≤ Et X i∈St pt,i (Wt,i + γ) 2 1{it ∈ N in t (i)} ! X j∈St pt,j (Wt,j + γ) 2 1{it ∈ N in t (j)} ≤ Et X i∈St pt,i (Wt,i + γ) 2 1{it ∈ N in t (i)} ! X j∈St pt,j (Wt,j + γ) 2 ≤ Qt γ Et "X i∈St pt,i (Wt,i + γ) 2 1{it ∈ N in t (i)} # ≤ Q2 t γ . Note that Qt ≤ K. Therefore, Xt,1 ≤ 2K γ and Et [X2 t,1 ] ≤ K2 γ . Then, using Lemma B.3.3, we know that with probability at least 1 − δ, X T t=1 Xt,1 ≤ 3 vuutX T t=1 Q2 t γ log 2K δ s T γ ! + 2 max{1, max t∈[T] Xt,1} log 2K δ s T γ ! . ≤ O X T t=1 Q2 t γUT + UT log KT δγ ! , where UT = max{1, maxt∈[T] Xt,1} and the last inequality is because of AM-GM inequality. Next, we prove Eq. (2.14). Let Xt,2 = P i∈St (Wt,i − 1{it ∈ Nin t (i)}) pt,iℓt,i Wt,i+γ . Direct calculation shows that |Xt,2| ≤ 2Qt . Consider its conditional variance: Et [X2 t,2 ] ≤ Et X i∈St pt,i Wt,i + γ 1{it ∈ N in t (i)} !2 = X i∈St pt,i X j∈St pt,j Wt,j + γ ≤ Qt . 205 Define ι1 = ln 2 maxt Qt+2√PT t=1 Qt δ . Applying Lemma B.3.3, we can obtain that with probability at least 1 − δ, X T t=1 Xt,2 ≤ O vuutX T t=1 Qtι1 + max t∈[T] Qtι1 . Next, we are ready to prove Theorem 2.2.1. Proof of Theorem 2.2.1. According to Eq. (2.12), for an arbitrary comparator j ∈ [K], we decompose the overall regret as follows: X T t=1 (ℓt,it − ℓt,j ) = X T t=1 D pt − ej , ℓbt E + X T t=1 (ℓt,it − ⟨pt , ℓt⟩) +X T t=1 ℓbt,j − ℓt,j + X T t=1 X K i=1 Wt,i − 1{it ∈ N in t (i)} pt,iℓt,i Wt,i + γ + X T t=1 X K i=1 γpt,iℓt,i Wt,i + γ . (B.1) According to the standard analysis of Exp3 [31], the first term of Eq. (B.1) can be bounded as follows: X T t=1 D pt − ej , ℓbt E ≤ log K η + η X T t=1 X K i=1 pt,iℓb2 t,i ≤ log K η + η X T t=1 X K i=1 pt,i Wt,i + γ ℓbt,i ≤ log K η + η X T t=1 X K i=1 pt,i Wt,i + γ ℓt,i + O η X T t=1 Q2 t maxτ∈[T] Qτ + η maxt∈[T] Qt γ log KT δγ ! , where the last inequality holds with probability at least 1 − δ according to Lemma 2.2.1. 20 According to standard Hoeffding-Azuma inequality, we know that with probability at least 1 − δ, the second term of Eq. (B.1) is bounded as follows: X T t=1 (ℓt,it − ⟨pt , ℓt⟩) ≤ O r T log 1 δ ! . Based on Corollary 1 in [121], with probability at least 1 − δ, the third term is bounded as follows: for all j ∈ [K], X T t=1 ℓbt,j − ℓt,j ≤ log(K/δ) 2γ . The fourth term of Eq. (B.1) can be bounded by using Lemma 2.2.1 and the final term of Eq. (B.1) is bounded by O(γ PT t=1 Qt) (recall that Qt = P i∈St pt,i Wt,i+γ ). Combining all the terms, we know that with probability at least 1 − 3δ, X T t=1 (ℓt,it − ℓt,j ) ≤ Oe 1 η + log(1/δ) γ + (η + γ) X T t=1 Qt + η maxt∈[T] Qt γ log 1 δ ! + Oe r T log 1 δ + vuutX T t=1 log 1 δ + max t∈[T] Qt log 1 δ . According to Lemma B.3.1, we know that Qt = Oe(αt). Finally, choosing η = γ = r log(1/δ) PT t=1 αt and picking δ ′ = δ 3 finishes the proof. B.1.2 Proof of Theorem 2.2.2 In this section, we prove our main result Theorem 2.2.2 in the strongly observable setting. To prove Theorem 2.2.2, according to Eq. (2.15), we can decompose the overall regret with respect to any j ∈ [K] as follows 207 X T t=1 (ℓt,it − ℓt,j ) ≤ X T t=1 ℓt,it − X T t=1 ⟨pet , ℓt⟩ ! | {z } term (1) + X T t=1 ⟨pet − pt , ℓt⟩ ! | {z } term (2) + X T t=1 ⟨pt − ej , ℓt⟩S¯t − X T t=1 Et D pt − ej , ℓbt E S¯t ! | {z } term (3) + X T t=1 Et D pt − ej , ℓbt E S¯t − X T t=1 D pt − ej , ℓbt E S¯t ! | {z } term (4) + X T t=1 D pt − ej , ℓt − ℓbt E St ! | {z } term (5) + X T t=1 D pt − ej , ℓbt E ! | {z } term (6) . (B.2) With the help of Hoeffding-Azuma’s inequality, we know that with probability at least 1 − δ, term (1) ≤ O( p T log(1/δ)). term (2) ≤ O(ηT) because of the definition of pet and pt . term (3) = 0 as ℓbt,i is an unbiased estimator of ℓt,i for i ∈ S¯ t . In the next three sections, we bound term (4), term (5) and term (6) respectively. Bounding term (4). Using Freedman’s inequality, we prove the following lemma: Lemma B.1.1. With probability at least 1 − δ, term (4) ≤ 2 + 4 η log 1 δ + 2 vuut 4T + X T t=1 1{t ∈ T , j ̸= jt} Wt,jt ! log 1 δ . Proof. Let Yt = Et D pt − ej , ℓbt E S¯t − D pt − ej , ℓbt E S¯t . Then, |Yt | ≤ Et D pt − ej , ℓbt E S¯t + 2X i∈S¯t pt,i K−1 K η ≤ 2 + 4 η . 208 If t /∈ T , then we know that Wt,i ≥ 1/2 for all i ∈ S¯ t and Et [Y 2 t ] ≤ Et D pt − ej , ℓbt E2 S¯t ≤ 4. If t ∈ T , then we know that Wt,i ≥ 1/2 for all i ∈ S¯ t except for i = jt . When j ̸= jt , we can bound Et [Y 2 t ] as follows: Et [Y 2 t ] ≤ Et D pt − ej , ℓbt E2 S¯t ≤ 2Et D pt , ℓbt E2 S¯t + 2Et D ej , ℓbt E2 S¯t ≤ 2Et X i∈S¯t pt,iℓb2 t,i + 2 Wt,jt ≤ 2Et X i∈S¯t pt,iℓbt,i Wt,i + 2 Wt,jt ≤ 4 + 4 Wt,jt . If j = jt , we know that D pt − ej , ℓbt E S¯t = P i∈S¯t,i̸=jt pt,iℓbt,i + (1 − pt,jt ) · 1 Wt,jt ≤ 2 + 1 1−η ≤ 4 as η ≤ 1 2 . Then we know that Et [Y 2 t ] ≤ 16. Therefore, according to Freedman’s inequality Lemma B.3.2, we know that with probability at least 1 − δ, X T t=1 Yt ≤ min λ∈[0,1/(2+4/η)] ( log(1/δ) λ + λ X T t=1 Et [Y 2 t ] ) ≤ min λ∈[0,1/(2+4/η)] ( log(1/δ) λ + λ 16T + 4X T t=1 1{t ∈ T , j ̸= jt} Wt,jt !) ≤ 2 + 4 η log 1 δ + 2 vuut 4T + X T t=1 1{t ∈ T , j ̸= jt} Wt,jt ! log 1 δ . Bounding term (5). The following lemma gives a bound on term (5). The proving technique is similar to the one that we use to bound the last three terms in Eq. (B.1). 209 Lemma B.1.2. With probability at least 1 − 2δ, term (5) ≤ Oe vuutX T t=1 Qt log 1 δ + max t∈[T] Qt log 1 δ + γ X T t=1 Qt + 1 γ log 1 δ . Proof. We bound PT t=1 D pt , ℓt − ℓbt E St and PT t=1 D ej , ℓt − ℓbt E St separately. Note that ℓbt,i is an underbiased estimator of ℓt,i for i ∈ St . Direct calculation shows that X T t=1 D pt , ℓt − ℓbt E St = X T t=1 X i∈St pt,i(ℓt,i − ℓbt,i) = X T t=1 X i∈St (Wt,i − 1{it ∈ N in t (i)}) pt,iℓt,i Wt,i + γ + X T t=1 X i∈St γ pt,iℓt,i Wt,i + γ . (B.3) Therefore, according to Lemma 2.2.1, with probability at least 1 − δ, X T t=1 D pt , ℓt − ℓbt E St ≤ Oe vuutX T t=1 Qt log 1 δ + max t∈[T] Qt log 1 δ + γ X T t=1 Qt . Next, consider the term − PT t=1 D ej , ℓt − ℓbt E St = PT t=1(ℓbt,j − ℓt,j ) · 1{j ∈ St}. Similar to the proof of Corollary 1 in [121], define ¯ℓt,i = ℓt,i Wt,i 1{it ∈ Nin t (i)} and we know that for any i ∈ [K], ℓbt,i1{i ∈ St} ≤ ℓt,i Wt,i + γ 1{it ∈ N in t (i), i ∈ St} ≤ ℓt,i Wt,i + γℓt,i 1{it ∈ N in t (i), i ∈ St} ≤ 1 2γ 2γℓt,i/Wt,i 1 + γℓt,i/Wt,i 1{it ∈ N in t (i), i ∈ St} ≤ 1 2γ log 1 + 2γ ¯ℓt,i 1{i ∈ St} = 1 2γ log 1 + 2γ ¯ℓt,i1{i ∈ St} . 2 Therefore, we know that Et h exp 2γℓbt,i1{i ∈ St} i ≤ Et 1 + 2γ ¯ℓt,i1{i ∈ St} = 1 + 2γℓt,i1{i ∈ St} ≤ exp (2γℓt,i1{i ∈ St}). Define Zt = exp 2γ1{i ∈ St} ℓbt,i − ℓt,i and according to previous analysis, we know that Zt is a super-martingale and by Markov inequality, we obtain that Pr "X T t=1 ℓbt,i − ℓt,i 1{i ∈ St} > ε# = Pr " exp 2γ X T t=1 ℓbt,i − ℓt,i 1{i ∈ St} ! > exp(2γε) # ≤ exp(−2γε). Taking a union bound over i ∈ [K], we know that with probability at least 1 − δ, for all i ∈ [K], X T t=1 ℓbt,i − ℓt,i 1{i ∈ St} ≤ log(K/δ) 2γ . (B.4) Combining both parts gives the bound for term (5): with probability at least 1 − 2δ, term (5) ≤ Oe vuutX T t=1 Qt log 1 δ + max t∈[T] Qt log 1 δ + γ X T t=1 Qt + 1 γ log 1 δ . (B.5) Bounding term (6). For completeness, before bounding term (6), we show the following OMD analysis lemma. 211 Lemma B.1.3. Suppose that p ′ = argminp∈∆K {⟨p, ℓ⟩ + Dψ(p, pt)} with ψ(p) = 1 η PK i=1 pi ln pi . If ηℓi ≥ −3 for all i ∈ [K], then for any u ∈ ∆K, the following inequality hold: ⟨p − u, ℓ⟩ ≤ Dψ(u, p) − Dψ(u, p′ ) + 2η X K i=1 piℓ 2 i . Proof. Let qi = pi exp(−ηℓi) and direct calculation shows that p ′ = argminp∈∆K Dψ(p, q) and for all u ∈ ∆K, ⟨p − u, ℓ⟩ = Dψ(u, p) − Dψ(u, q) + Dψ(p, q) ≤ Dψ(u, p) − Dψ(u, p′ ) + Dψ(p, q), where the second step uses the generalized Pythagorean theorem. On the other hand, using the inequality exp(−x) ≤ 1 − x + 2x 2 for any x ≥ −3, we know that Dψ(p, q) = 1 η X K i=1 pi ln pi qi + qi − pi ≤ 1 η X K i=1 pi (exp(−ηℓi) − 1 + ηℓi) ≤ 2η X K i=1 piℓ 2 i , where the inequality is because ηℓi ≥ −3. Now we are ready to bound term (6) as follows. Lemma B.1.4. With probability at least 1 − 2δ, term (6) ≤ Oe 1 η + ηT + η X T t=1 Qt + log 1 δ + 48p ηT + β 2T + η maxt∈[T] Qt γ log 1 δ ! − β X T t=1 pt,jt Wt,jt 1{t ∈ T } + β X T t=1 1{t ∈ T , j = jt} Wt,jt . 212 Proof. Recall that according to the definition in Algorithm 3, T ≜ {t | there exists jt ∈ S¯ t and pt,jt > 1/2}. To apply Lemma B.1.3, we first need to verify the scale of ℓbt + bt − zt where zt = ℓbt,jt + bt,jt if t ∈ T . If t /∈ T , then we know that for all i ∈ [K], η(ℓbt,i + bt,i) = ηℓbt,i ≥ 0. If t ∈ T , note that with an η amount of uniform exploration, ηzt = η ℓbt,jt + bt,jt ≤ η(1 + β) · 1 K−1 K η ≤ 2(1 + β) ≤ 3, where the second inequality is because K ≥ 2 and the last inequality is because β ≤ 1 2 . Therefore, we know that η(ℓbt,i + bt,i − zt) ≥ −3 for all i ∈ [K]. Therefore, applying Lemma B.1.3 with p = pt and p ′ = pt+1 and taking summation over t ∈ [T], we know that for any u ∈ ∆K, X T t=1 D pt − u, ℓbt + bt E ≤ X T t=1 (Dψ(u, pt) − Dψ(u, pt+1)) + 2η X T t=1 X K i=1 pt,iℓb2 t,i · 1{t /∈ T } + 2η X T t=1 X K i=1 pt,i ℓbt,i + bt,i − zt 2 · 1{t ∈ T } ≤ Dψ(u, p1) η + 2η X T t=1 X K i=1 pt,iℓb2 t,i1{t /∈ T } + 6η X T t=1 X i̸=jt pt,i ℓb2 t,i + b 2 t,i + z 2 t 1{t ∈ T } = Dψ(u, p1) η + 2η X T t=1 X K i=1 pt,iℓb2 t,i1{t /∈ T } + 6η X T t=1 X i̸=jt pt,i ℓb2 t,i + z 2 t 1{t ∈ T }. 213 For t /∈ T , we know that ℓbt,i ≤ 2 for i ∈ S¯ t and X i∈S¯t pt,iℓb2 t,i ≤ 4 X i /∈St pt,i ≤ 4, X i∈St pt,iℓb2 t,i ≤ X i∈St pt,i Wt,i + γ ℓbt,i. For t ∈ T , we know that ℓbt,i ≤ 2 for all i ∈ S¯ t\{jt} and X i∈St pt,iℓb2 t,i ≤ X i∈St pt,i Wt,i + γ ℓbt,i, X i∈S¯t,i̸=jt pt,iℓb2 t,i ≤ 4 X i∈S¯t,i̸=jt pt,i ≤ 4, X i̸=jt pt,iz 2 t ≤ Wt,jt · ℓbt,jt + bt,jt 2 ≤ 2Wt,jt ℓb2 t,jt + 2 β 2 Wt,jt ≤ 2ℓbt,jt + 4β 2 η , where the last inequality uses the fact that Wt,jt ≥ K−1 K η ≥ 1 2 η. For any j ∈ [K], let u = ej ∈ ∆K. Combining all the above inequalities, we can obtain that X T t=1 D pt − ej , ℓbt + bt E ≤ Dψ(ej , p1) η + 24ηT + 6η X T t=1 X i∈St pt,i Wt,i + γ ℓbt,i + 12η X T t=1 ℓbt,jt1{t ∈ T } + 24β 2X T t=1 1{t ∈ T } ≤ Dψ(ej , p1) η + 24ηT + 6η X T t=1 X i∈St pt,i Wt,i + γ ℓbt,i + 12η X T t=1 ℓbt,jt1{t ∈ T } + 24β 2T. (B.6) We first bound the term PT t=1 ℓbt,jt1{t ∈ T }. Let Zt = ℓbt,jt1{t ∈ T } − ℓt,jt1{t ∈ T }. We know that Et [Zt ] = 0 and |Zt | ≤ 1 K−1 K η ≤ 2 η . In addition, Et Z 2 t ≤ Et " 1 W2 t,jt · 1{it ̸= jt} # · 1{t ∈ T } = 1{t ∈ T } Wt,jt . 214 Therefore, by Freedman’s inequality (Lemma B.3.2), we can obtain that with probability at least 1 − δ, X T t=1 Zt ≤ min λ∈[0, η 2 ] ( ln(1/δ) λ + λ X T t=1 Et [Z 2 t ] ) ≤ 2 ln(1/δ) η + 2 vuutX T t=1 1{t ∈ T } Wt,jt ≤ 2 ln(1/δ) η + 4s T η . Combining with Eq. (B.6), we know that with probability at least 1 − 2δ X T t=1 D pt − ej , ℓbt E ≤ Dψ(ej , p1) η + 24ηT + 6η X T t=1 X i∈St pt,i Wt,i + γ ℓbt,i + 24 log(1/δ) + 48p ηT + 24β 2T − X T t=1 ⟨pt − ej , bt⟩ ≤ log K η + 24ηT + 6η X T t=1 Qt + 24 log 1 δ + 48p ηT + 24β 2T + Oe η X T t=1 Qt + η maxt∈[T] Qt γ log 1 δ ! − X T t=1 ⟨pt − ej , bt⟩ ≤ Oe 1 η + ηT + η X T t=1 Qt + log 1 δ + p ηT + β 2T + η maxt∈[T] Qt γ log 1 δ ! − β X T t=1 pt,jt Wt,jt 1{t ∈ T } + β X T t=1 1{t ∈ T , j = jt} Wt,jt , where the second inequality is because of Lemma 2.2.1 and the choice of p1 = 1 K · 1. With Lemma B.1.1, Lemma B.1.2 and Lemma B.1.4 on hand, we are ready to prove Theorem 2.2.2. Proof of Theorem 2.2.2. According to Eq. (B.2), Lemma B.1.1, Lemma B.1.2, Lemma B.1.4, and the bounds on term (1), term (2) and term (3), we know that with probability at least 1 − 6δ, for any j ∈ [K], X T t=1 (ℓt,it − ℓt,j ) ≤ Oe r T log 1 δ ! + O(ηT) + 2 + 4 η log 1 δ + 2 vuut 4T + X T t=1 1{t ∈ T , j ̸= jt} Wt,jt ! log 1 δ + Oe vuutX T t=1 Qt log 1 δ + max t∈[T] Qt log 1 δ + γ X T t=1 Qt + log(1/δ) γ 215 + Oe 1 η + ηT + η X T t=1 Qt + log 1 δ + 48p ηT + β 2T + η maxt∈[T] Qt γ log 1 δ ! − β X T t=1 pt,jt Wt,jt 1{t ∈ T } + β X T t=1 1{t ∈ T , j = jt} Wt,jt ≤ Oe 1 η + log(1/δ) γ + (η + γ) X T t=1 Qt + vuutX T t=1 Qt log 1 δ + p ηT + β 2T + η γ + 1 max t∈[T] Qt log 1 δ + 2 vuut 4T + X T t=1 1{t ∈ T , j ̸= jt} Wt,jt ! log 1 δ − β X T t=1 pt,jt Wt,jt 1{t ∈ T } + β X T t=1 1{t ∈ T , j = jt} Wt,jt . (B.7) Consider the last three terms: 2 vuut 4T + X T t=1 1{t ∈ T , j ̸= jt} Wt,jt ! log 1 δ − β X T t=1 pt,jt Wt,jt 1{t ∈ T } + β X T t=1 1{t ∈ T , j = jt} Wt,jt = 2 vuut 4T + X T t=1 1{t ∈ T , j ̸= jt} Wt,jt ! log 1 δ − β X T t=1 pt,jt Wt,jt 1{t ∈ T , j ̸= jt} + β X T t=1 (1 − pt,jt )1{t ∈ T , j = jt} Wt,jt ≤ O r T log 1 δ ! + 1 β + β X T t=1 P i̸=jt pt,i1{t ∈ T , j = jt} (1 − η) P i̸=jt pt,i ≤ O r T log 1 δ + 1 β + βT! . where the first inequality uses the AM-GM inequality and the second inequality uses the fact that η ≤ 1 2 . Combining with Eq. (B.7), we obtain X T t=1 (ℓt,it − ℓt,j ) ≤ Oe 1 η + log(1/δ) γ + (η + γ) X T t=1 Qt + vuutX T t=1 Qt log 1 δ + p ηT + β 2T + η γ + 1 max t∈[T] Qt log 1 δ + O r T log 1 δ + 1 β + βT! . 216 Using Lemma B.3.1, we know that Qt ≤ 2αt log 1 + ⌈K2/γ⌉ + K αt + 2 ≤ 4αt log 1 + ⌈K2/γ⌉ + K αt = Oe(αt). Picking η = β = γ = 1/ qPT t=1 αt log(1/δ), we achieve that with probability at least 1 − 6δ, RegT ≤ Oe vuutX T t=1 αt log 1 δ + max t∈[T] αt ln 1 δ . This finishes our proof. B.2 Proofs for Section 2.2.3 In this section, we prove Lemma 2.2.2 and Lemma 2.2.3. The key of the proof is to use a careful analysis of Freedman’s inequality with the help of uniform exploration and implicit exploration. Proof. of Lemma 2.2.2. Recall that ⟨p, ℓ⟩S ≜ P i∈S piℓi for any S ⊆ [K]. Therefore, we decompose the target PT t=1⟨pt , ℓt − ℓbt⟩ as follows X T t=1 D pt , ℓt − ℓbt E = X T t=1 D pt , ℓt − ℓbt E St + X T t=1 D pt , ℓt − ℓbt E S¯t . Bounding PT t=1⟨pt , ℓt − ℓbt⟩St : We proceeds as follows X T t=1 X i∈St pt,i(ℓt,i − ℓbt,i) = X T t=1 X i∈St γ pt,iℓt,i Wt,i + γ + X T t=1 X i∈St (Wt,i − 1{it ∈ N in t (i)}) pt,iℓt,i Wt,i + γ . 217 Recall that Qt = P i∈St pt,i Wt,i+γ . As ℓt ∈ [0, 1]K, it is clear that P i∈St γ pt,iℓt,i Wt,i+γ ≤ γQt . To bound the second term, according to Lemma 2.2.1, let ι1 = log 2 maxt Qt+2√PT t=1 Qt δ ′ , we know that with probability at least 1 − δ ′ , X T t=1 X i∈St Wt,i − 1{it ∈ N in t (i)} pt,iℓt,i Wt,i + γ ≤ O vuutX T t=1 Qtι1 + max t∈[T] Qtι1 . Consider the subgraph Get of Gt = ([K], Et) where Get = (St , Eet) and Eet ⊆ Et is the set of edges with respect to the nodes in St . Applying Lemma B.3.1 on the subgraph Get , we know that Qt ≤ X i∈S¯t pt,i pt,i + P j∈Nin t (i) pt,j + γ ≤ Oe(αt). (B.8) Combining all the above equations, we know that with probability at least 1 − δ ′ , X T t=1 D pt , ℓt − ℓbt E ≤ O γ X T t=1 Qt + vuutX T t=1 Qtι1 + max t∈[T] Qtι1 ≤ Oe γ X T t=1 αet + vuutX T t=1 αetι1 + max t∈[T] αetι1 . (B.9) Bounding PT t=1⟨pt , ℓt − ℓbt⟩S¯t : Since the loss estimators for nodes without a self-loop are unbiased, we directly apply Lemma B.3.2 to bound PT t=1⟨pt , ℓt − ℓbt⟩S¯t . Note that X i∈S¯t pt,i(ℓt,i − ℓbt,i) ≤ X i∈S¯t pt,iℓt,i ≤ 1 Et X i∈S¯t pt,i(ℓt,i − ℓbt,i) 2 ≤ Et X i∈S¯t pt,iℓbt,i 2 ≤ 1 ε Et X i∈S¯t pt,iℓbt,i ≤ 1 ε , 21 where the second inequality is because ℓbt,i ≤ 1 ε for all i ∈ S¯ t . Therefore, using Lemma B.3.2, we obtain that with probability at least 1 − δ ′ X T t=1 X i∈S¯t pt,i(ℓt,i − ℓbt,i) ≤ 2 r T ε ln(1/δ′) + ln(1/δ′ ). (B.10) Let δ ′ = δ 2 . Combining Eq. (B.9), Eq. (B.10), we prove the result of first part. For the second part, we consider the cases where i ∈ St and i ∈ S¯ t separately. X T t=1 ℓbt,i − ℓt,i = X T t=1 ℓbt,i − ℓt,i 1{i ∈ St} + X T t=1 ℓbt,i − ℓt,i 1{i ∈ S¯ t}. The analysis for the first term is the same as Eq. (B.4) and we can obtain that with probability at least 1−δ ′ , for all i ∈ [K], X T t=1 ℓbt,i − ℓt,i 1{i ∈ St} ≤ log(K/δ′ ) 2γ . (B.11) For the second term, note that (ℓbt,i − ℓt,i)1{i ∈ S¯ t} ≤ 1 ε as ℓbt,i ≤ 1 ε for all i ∈ S¯ t . In addition, the conditional variance is bounded as follows Et (ℓbt,i − ℓt,i)1{i ∈ S¯ t} 2 ≤ Et h ℓb2 t,i1{i ∈ S¯ t} i ≤ 1 ε . Using Lemma B.3.2 and an union bound over [K], for all i ∈ [K], we have that with probability at least 1 − δ ′ X T t=1 (ℓbt,i − ℓt,i)1{i ∈ S¯ t} ≤ r T ε ln K δ ′ + 1 ε ln K δ ′ . (B.12) Combining Eq. (B.11) and Eq. (B.12) and picking δ ′ = δ 2 , we prove the second part. 219 Next, we prove Lemma 2.2.3, which bounds the estimated regret term (d) in Eq. (2.23). Proof. of Lemma 2.2.3. We apply standard OMD analysis [31] and obtain that term (d) ≤ log K η + η X T t=1 X K i=1 pt,iℓb2 t,i ≤ log K η + η X T t=1 X i∈St pt,iℓb2 t,i + η X T t=1 X i∈S¯t pt,iℓb2 t,i. For the second term, using Lemma 2.2.1, we know that with probability at least 1 − δ ′ η X T t=1 X i∈St pt,iℓb2 t,i ≤ η X T t=1 X i∈St pt,i Wt,i + γ ℓbt,i ≤ η X T t=1 X i∈St pt,i Wt,i + γ ℓt,i + Oe η X T t=1 Qt + max t∈[T] Qt γ log 1 δ ′ ! ≤ Oe η X T t=1 αet + η γ max t αet ln 1 δ ′ ! . (B.13) For the third term, we decompose it as η X T t=1 X i∈S¯t pt,iℓb2 t,i ≤ η X T t=1 X i∈S¯t pt,i Wt,i ℓbt,i ≤ η X T t=1 X i∈S¯t pt,i Wt,i (ℓbt,i − ℓt,i) | {z } term (i) +η X T t=1 X i∈S¯t pt,i Wt,i ℓt,i | {z } term (ii) . To bound term (i), note that with uniform exploration on the dominating set, P i∈S¯t pt,i Wt,i (ℓbt,i − ℓt,i) ≤ P i∈S¯t pt,i Wt,i ℓbt,i ≤ 1 ε 2 . Next, we consider the conditional variance: Et X i∈S¯t pt,i Wt,i (ℓbt,i − ℓt,i) 2 ≤ Et X i∈S¯ pt,i W2 t,i 1{it ∈ N in t (i)} X j∈S¯t pt,j W2 t,j 1{it ∈ N in t (j)} ≤ 1 ε 3 . 220 Using Freedman’s inequality Lemma B.3.2, we know that with probability at least 1 − δ ′ , η X T t=1 X i∈S¯t pt,i Wt,i (ℓbt,i − ℓt,i) ≤ η r T ε 3 ln 1 δ ′ + η ε 2 ln 1 δ ′ . (B.14) For term (ii), we directly bound it by noticing that Wt,i ≥ ε for all i ∈ S¯ t η X T t=1 X i∈S¯t pt,i Wt,i ℓt,i ≤ ηT ε . (B.15) Combining Eq. (B.13), Eq. (B.14), Eq. (B.15) and picking δ ′ = δ 2 , we finish the proof. B.3 Auxiliary Lemmas In this section, we show several auxiliary lemmas that are useful in the analysis. Lemma B.3.1 (Lemma 1 in [91]). Let G = (V, E) be a directed graph with |V | = K, in which each node i ∈ V is assigned a positive weight wi . Assume that P i∈V wi ≤ 1, then X i∈V wi wi + P j∈Nin(i) wj + γ ≤ 2α log 1 + ⌈K2/γ⌉ + K α + 2, where α is the independence number of G. Lemma B.3.2 (Freedman’s inequality, Theorem 1 [28]). Let X1, X2, . . . , XT be a martingale difference sequence with respect to a filtration F1 ⊆ F2 ⊆ . . . FT such that E[Xt |Ft ] = 0. Assume for all t, Xt ≤ R. Let V = PT t=1 E[X2 t |Ft ]. Then for any δ > 0, with probability at least 1−δ, we have the following guarantee: X T t=1 Xt ≤ inf λ∈ 0, √ e−2 R √ ln(1/δ) p (e − 2) ln(1/δ) λV + 1 λ = inf λ′∈[0, 1 R ] ln(1/δ) λ′ + (e − 2)λ ′V . 221 Lemma B.3.3 (Strengthened Freedman’s inequality, Theorem 9 [154]). Let X1, X2, . . . , XT be a martingale difference sequence with respect to a filtration F1 ⊆ F2 ⊆ . . . FT such that E[Xt |Ft ] = 0 and assume E[|Xt ||Ft ] < ∞ a.s. Then with probability at least 1 − δ X T t=1 Xt ≤ 3 s VT log 2 max{UT , √ VT } δ + 2UT log 2 max{UT , √ VT } δ , where VT = PT t=1 E[X2 t |Ft ], UT = max{1, maxs∈[T] Xs}. 222 Appendix C Omitted Details in Section 2.3 C.1 Omitted Details in Section 2.3.3 Theorem 2.3.1. Suppose decγ(pt ; fbt , xt , Gt) ≤ Cγ−β for all t ∈ [T] and some β > 0, Algorithm 5 with γ = max{4,(CT) 1 β+1 Reg − 1 β+1 Sq } guarantees that E [RegCB] ≤ O C 1 β+1 T 1 β+1 Reg β β+1 Sq . Proof. Following [59], we decompose RegCB as follows: E[RegCB] = E "X T t=1 f ⋆ (xt , at) − X T t=1 f ⋆ (xt , π⋆ (xt))# = E "X T t=1 f ⋆ (xt , at) − f ⋆ (xt , π⋆ (xt)) − γ 4 EA∼Gt(·|at) "X a∈A fbt(xt , a) − f ⋆ (xt , a) 2 #!# + γ 4 E "X T t=1 EA∼Gt(·|at) "X a∈A fbt(xt , a) − f ⋆ (xt , a) 2 ## ≤ E X T t=1 max a ⋆∈[K] f∈(X ×[K]7→R) Eat∼pt " f(xt , at) − f(xt , a⋆ ) − γ 4 EA∼Gt(·|at) "X a∈A fbt(xt , a) − f(xt , a) 2 ## + γ 4 E "X T t=1 EA∼Gt(·|at) "X a∈A fbt(xt , a) − f ⋆ (xt , a) 2 ## = E "X T t=1 decγ(pt ; fbt , xt , Gt) # + γ 4 E "X T t=1 EA∼Gt(·|at) "X a∈A fbt(xt , a) − f ⋆ (xt , a) 2 ## (C.1) 223 ≤ CT γ−β + γ 4 E "X T t=1 EA∼Gt(·|at) "X a∈A fbt(xt , a) − f ⋆ (xt , a) 2 ##. Next, since E[ℓt,a | xt ] = f ⋆ (xt , a) for all t ∈ [T] and a ∈ A, we know that E "X T t=1 EA∼Gt(·|at) "X a∈A fbt(xt , a) − f ⋆ (xt , a) 2 ## = E "X T t=1 EA∼Gt(·|at) "X a∈A fbt(xt , a) − ℓt,a2 − X a∈A (f ⋆ (xt , a) − ℓt,a) 2 ## ≤ RegSq, (C.2) where the final inequality is due to Assumption 2. Therefore, we have E[RegCB] ≤ CT γ−β + γ 4 RegSq. Picking γ = max 4, CT RegSq 1 β+1 , we obtain that E [RegCB] ≤ O C 1 β+1 T 1 β+1 Reg β β+1 Sq . C.1.1 Proof of Theorem 2.3.2 Before proving Theorem 2.3.2, we first show the following key lemma, which is useful in proving that decγ(p; f, x, G b ) is convex for both strongly and weakly observable feedback graphs G. We highlight that the convexity of decγ(p; f, x, G b )is crucial for both proving the upper bound of minp∈∆(K) decγ(p; f, x, G b ) and showing the efficiency of Algorithm 5. 224 Lemma C.1.1. Suppose u, v, x ∈ R d with ⟨v, x⟩ > 0. Then both g(x) = ⟨u,x⟩ 2 ⟨v,x⟩ and h(x) = (1−⟨u,x⟩) 2 ⟨v,x⟩ are convex in x. Proof. The function f(x, y) = x 2/y is convex for y > 0 due to ∇2 f(x, y) = 2 y 3 y −x y −x ⊤ ⪰ 0. By composition with affine functions, both g(x) = f(⟨u, x⟩,⟨v, x⟩) and h(x) = f(1 − ⟨u, x⟩,⟨v, x⟩) are convex. Theorem 2.3.2 (Strongly observable graphs). Suppose that the feedback graph Gt is deterministic and strongly observable with independence number no more than α. Then Algorithm 5 guarantees that decγ(pt ; fbt , xt , Gt) ≤ O α log(Kγ) γ . Proof. For conciseness, we omit the subscript t. Direct calculation shows that for all a ⋆ ∈ [K], Ea∼p f ⋆ (x, a) − f ⋆ (x, a⋆ ) − γ 4 X a ′∈Nin(G,a) (fb(x, a′ ) − f ⋆ (x, a′ ))2 = X K a=1 paf ⋆ (x, a) − f ⋆ (x, a⋆ ) − γ 4 X K a=1 Wa fb(x, a) − f ⋆ (x, a) 2 , where Wa = P a ′∈Nin(G,a) pa ′. Therefore, taking the gradient over f ∗ (x, ·) and we know that sup f ⋆∈(X ×[K] 7→ R) "X K a=1 paf ⋆ (x, a) − f ⋆ (x, a⋆ ) − γ 4 X K a=1 Wa fb(x, a) − f ⋆ (x, a) 2 # = X K a=1 pafb(x, a) − fb(x, a⋆ ) + 1 γ ∥p − ea ⋆ ∥ 2 diag(W)−1 . 225 Then, denote fb∈ R K to be fb(x, ·) and consider the following minimax form: inf p∈∆(K) sup a ⋆∈A (X K a=1 pafb(x, a) − fb(x, a⋆ ) + 1 γ ∥p − ea ⋆ ∥ 2 diag(W)−1 ) = min p∈∆(K) max a ⋆∈A X K a=1 pafb(x, a) − fb(x, a⋆ ) + 1 γ X a̸=a ⋆ p 2 a Wa + 1 γ (1 − pa ⋆ ) 2 Wa ⋆ (C.3) = min p∈∆K max q∈∆K (X K a=1 (pa − qa)fba + 1 γ X K a=1 p 2 a (1 − qa) Wa + X K a=1 qa(1 − pa) 2 γWa ) (C.4) = max q∈∆K min p∈∆K (X K a=1 (pa − qa)fba + 1 γ X K a=1 p 2 a (1 − qa) Wa + X K a=1 qa(1 − pa) 2 γWa ) , (C.5) where the last equality is due to Sion’s minimax theorem and the fact that Eq. (C.3) is convex in p ∈ ∆(K) by applying Lemma C.1.1 with u = ea and v = ga for each a ∈ [K], where ga ∈ {0, 1} K is defined as ga,i = 1{(i, a) ∈ E}, G = ([K], E), ∀i ∈ [K]. Choose pa = (1 − 1 γ )qa + 1 γK for all a ∈ [K]. Let S be the set of nodes in [K] that have a self-loop. Then we can upper bound the value above as follows: max q∈∆(K) min p∈∆(K) (X K a=1 (pa − qa)fba + 1 γ X K a=1 p 2 a (1 − qa) Wa + X K a=1 qa(1 − pa) 2 γWa ) ≤ max q∈∆(K) 2 γ + 1 γ X K a=1 (1 − 1 γ )qa + 1 γK 2 (1 − qa) + qa 1 − (1 − 1 γ )qa − 1 γK 2 Wa ≤ max q∈∆(K) 2 γ + 1 γ X K a=1 2 (1 − 1 γ ) 2 q 2 a + 1 γ 2K2 (1 − qa) + qa 1 − (1 − 1 γ )qa 2 Wa ≤ max q∈∆(K) 2 γ + 2 γ 2 + 1 γ X K a=1 2q 2 a (1 − qa) + 2qa (1 − qa) 2 + 2q 3 a γ 2 Wa (Wa = P j∈Nin(G,a) pj ≥ 1 γK for all a ∈ [K]) ≤ max q∈∆(K) ( 2 γ + 2 γ 2 + 2 γ X K a=1 qa(1 − qa) Wa + 2 γ 3 X K a=1 q 3 a Wa ) = max q∈∆(K) ( 2 γ + 2 γ 2 + 2 γ X K a=1 qa(1 − qa) Wa + 2 γ 3 X a∈S q 3 a Wa + 2 γ 3 X a /∈S q 3 a Wa ) (C.6) 226 ≤ max q∈∆(K) ( 2 γ + 2 γ 2 + 2 γ X K a=1 qa(1 − qa) Wa + 2 γ 3 X a∈S q 2 a + 2 γ 3 X a /∈S q 3 a K−1 γK ) (if a /∈ S, Wa = 1 − pa ≥ K−1 γK ) ≤ max q∈∆(K) ( 8 γ + 2 γ X K a=1 qa(1 − qa) Wa ) . (K ≥ 2) Next we bound 2qa(1−qa) Wa for each a ∈ [K]. If a ∈ [K]\S, we have Wa = 1 − pa and 2qa(1 − qa) Wa ≤ 2qa(1 − qa) 1 − (1 − 1 γ )qa − 1 γK ≤ 2qa(1 − qa) (1 − 1 γ )(1 − qa) + K−1 γK ≤ 2 1 − 1 γ qa ≤ 4qa. (C.7) If a ∈ S, we know that X a∈S 2qa(1 − qa) Wa ≤ X a∈S 2qa(1 − qa) P j∈Nin(G,a) ((1 − 1 γ )qj + 1 γK ) ≤ γ γ − 1 X a∈S 2((1 − 1 γ )qa + 1 γK )(1 − qa) P j∈Nin(G,a) ((1 − 1 γ )qj + 1 γK ) ≤ 4 X a∈S ((1 − 1 γ )qa + 1 γK ) P j∈Nin(G,a) ((1 − 1 γ )qj + 1 γK ) ≤ O(α log(Kγ)), (C.8) where the last inequality is due to Lemma 5 in [12]. We include this lemma (Lemma D.1.2) for completeness. Combining all the above inequalities, we obtain that inf p∈∆(K) sup a ⋆∈A (X K a=1 pafb(x, a) − fb(x, a⋆ ) + 1 γ ∥p − ea ⋆ ∥ 2 diag(W)−1 ) = max q∈∆(K) min p∈∆(K) (X K a=1 (pa − qa)fba + 1 γ X K a=1 p 2 a (1 − qa) Wa + X K a=1 qa(1 − pa) 2 γWa ) ≤ max q∈∆(K) ( 8 γ + 2 γ X K a=1 qa(1 − qa) Wa ) ≤ O α log(Kγ) γ . 227 C.1.2 Proof of Theorem 2.3.3 Theorem 2.3.3 (Weakly observable graphs). Suppose that the feedback graph Gt is deterministic and weakly observable with weak domination number no more than d. Then Algorithm 5 with γ ≥ 16d guarantees that decγ(pt ; fbt , xt , Gt) ≤ O s d γ + αe log(Kγ) γ ! , where αe is the independence number of the subgraph induced by nodes with self-loops in Gt . Proof. Similar to the strongly observable graphs setting, for weakly observable graphs, we know that decγ(p; f, x, G b ) = max q∈∆K min p∈∆K (X K a=1 (pa − qa)fba + 1 γ X K a=1 p 2 a (1 − qa) Wa + X K a=1 qa(1 − pa) 2 γWa ) . (C.9) Choose pa = (1− 1 γ −ηd)qa + 1 γK +η1{a ∈ D} where D with |D| = d is the minimum weak dominating set of G and 0 < η ≤ 1 4d is some parameter to be chosen later. Substituting the form of p to Eq. (C.9) and using the fact that |fba| ≤ 1 for all a ∈ [K], we can obtain that decγ(p; f, x, G b ) ≤ max q∈∆K ( 2 γ + ηd + 1 γ X K a=1 p 2 a (1 − qa) Wa + X K a=1 qa(1 − pa) 2 γWa ) . Then we can upper bound the value above as follows: decγ(p; f, x, G b ) ≤ max q∈∆K 2 γ + ηd + 1 γ X K a=1 (1 − 1 γ − ηd)qa + 1 γK + η1{a ∈ D} 2 (1 − qa) Wa 228 + X K a=1 qa 1 − (1 − 1 γ − ηd)qa 2 Wa ≤ max q∈∆K 2 γ + ηd + 1 γ X a /∈D qa + 1 γK 2 (1 − qa) + qa (1 − qa) + 1 γ qa + ηdqa 2 Wa + 1 γ X a∈D qa + 1 γK + η 2 (1 − qa) + qa (1 − qa) + 1 γ qa + ηdqa 2 Wa ≤ max q∈∆K 2 γ + ηd + 1 γ X a /∈D 2 q 2 a + 1 γ 2K2 (1 − qa) + 3qa (1 − qa) 2 + q 2 a γ 2 + η 2d 2 q 2 a Wa + 1 γ X a∈D 3 q 2 a + 1 γ 2K2 + η 2 (1 − qa) + 3qa (1 − qa) 2 + q 2 a γ 2 + η 2d 2 q 2 a Wa . (C.10) Now consider a ∈ D. If a ∈ S, then we know that Wa ≥ η; Otherwise, we know that this node can be observed by at least one node in D, meaning that Wa ≥ η. Combining the two cases above, we know that 1 γ X a∈D 3 q 2 a + 1 γ 2K2 + η 2 (1 − qa) + 3qa (1 − qa) 2 + 1 γ 2 q 2 a + η 2d 2 q 2 a Wa ≤ 3 ηγ X a∈D q 2 a + 1 γ 2K2 + η 2 (1 − qa) + qa (1 − qa) 2 + 1 γ 2 q 2 a + η 2 d 2 q 2 a ≤ 3 ηγ X a∈D qa − q 2 a + 1 γ 2 q 3 a + η 2 d 2 q 3 a + 1 γ 2K2 + η 2 ≤ O 1 ηγ + dη γ + 1 ηγ3K (η ≤ 1 4d and γ ≥ 16d) ≤ O 1 ηγ , (C.11) 229 where the last inequality is because η ≤ 1 4d and γ ≥ 16d. Consider a /∈ D. Let S0 be the set of nodes which either have a self loop or can be observed by all the other node. Recall that S represents the set of nodes with a self-loop. Then similar to the derivation of Eq. (C.6), we know that for a ∈ S0, 1 γ X a /∈D,a∈S0 2 q 2 a + 1 γ 2K2 (1 − qa) + 3qa (1 − qa) 2 + q 2 a γ 2 + η 2d 2 q 2 a Wa ≤ 1 γ X a /∈D,a∈S0 2q 2 a (1 − qa) + 3qa (1 − qa) 2 + q 2 a γ 2 + η 2d 2 q 2 a Wa + O 1 γ 2 + 1 ηγ3K (Wa ≥ 1 γK if a ∈ S and Wa ≥ η if a ∈ [K]\S) ≤ O 1 γ X a∈S0,a /∈D qa(1 − qa) Wa + 1 γ 3 X a∈S,a /∈D q 2 a + 1 γ 3 X a∈S0,a /∈D∪S q 3 a K−1 γK + 1 γ 2 + 1 ηγ3K + O 1 γ X a∈S,a /∈D η 2 d 2 q 2 a + 1 γ X a∈S0,a /∈D∪S η 2d 2 q 3 a η (for a ∈ S0, a /∈ S, Wa ≥ max{ K−1 γK , η}) ≤ O 1 γ X a∈S0,a /∈D qa(1 − qa) Wa + 1 ηγ . (C.12) For a /∈ S0, we know that Wa ≥ η. Therefore, 1 γ X a /∈D∪S0 2 q 2 a + 1 γ 2K2 (1 − qa) + 3qa (1 − qa) 2 + q 2 a γ 2 + η 2d 2 q 2 a Wa ≤ 1 γη X a /∈D∪S0 2 q 2 a + 1 γ 2K2 (1 − qa) + 3qa (1 − qa) 2 + q 2 a γ 2 + 1 16 q 2 a ≤ 1 γη X a /∈D∪S0 2qa(1 − qa) + 1 γ 2K2 + 2q 3 a γ 2 + 3 16 q 3 a ≤ O 1 γη . (C.13) Plugging Eq. (C.11), Eq. (C.12), and Eq. (C.13) to Eq. (C.10), we obtain that decγ(p; f, x, G b ) ≤ O 1 γ + ηd + 1 γη + 1 γ X a∈S0,a /∈D qa(1 − qa) Wa (C.14) 230 Consider the last term. If a ∈ S0\S, similar to Eq. (C.7), we know that qa(1 − qa) Wa ≤ qa(1 − qa) 1 − (1 − 1 γ − dη)qa − 1 γK ≤ qa(1 − qa) (1 − 1 γ − ηd)(1 − qa) ≤ 1 1 − 1 γ − ηd qa ≤ O(qa), where the last inequality is due to γ ≥ 16d and η ≤ 1 4d . If a ∈ S, similar to Eq. (C.8), we know that X a∈S qa(1 − qa) Wa ≤ X a∈S qa(1 − qa) P j∈Nin(G,a) ((1 − 1 γ − ηd)qj + 1 γK ) ≤ γ γ − 1 − γηd X a∈S ((1 − 1 γ − ηd)qa + 1 γK )(1 − qa) P j∈Nin(G,a) ((1 − 1 γ − ηd)qj + 1 γK ) ≤ 2 X a∈S (1 − 1 γ − ηd)qa + 1 γK P j∈Nin(G,a) (1 − 1 γ − ηd)qj + 1 γK (γ ≥ 4, η ≤ 1 4d ) ≤ O(αe log(Kγ)), (C.15) where the last inequality is again due to Lemma 5 in [12] and αe is the independence number of the subgraph induced by nodes with self-loops in G. Plugging Eq. (C.15) to Eq. (C.14) gives decγ(p; f, x, G b ) ≤ O ηd + 1 γη + αe log(Kγ) γ . Picking η = q 1 γd ≤ 1 4d proves the result. Next, we prove Corollary 2.3.2 by combining Theorem 2.3.3 and Theorem 2.3.1. Corollary 2.3.2. Suppose that Gt is deterministic, weakly observable, and has weak domination number no more than d for all t ∈ [T]. In addition, suppose that the independence number of the subgraph induced by nodes with self-loops in Gt is no more than αe for all t ∈ [T]. Then, Algorithm 5 with γ = max{16d, q αT / e RegSq, d 1 3 T 2 3 Reg− 2 3 Sq } guarantees that 231 E[RegCB] ≤ Oe d 1 3 T 2 3 Reg 1 3 Sq + q αTe RegSq . Proof. Combining Eq. (C.1), Eq. (C.2) and Theorem 2.3.3, we can bound RegCB as follows: E[RegCB] ≤ O s d γ T + αTe log(Kγ) γ + γRegCB! . Picking γ = max 16d, q αT / e RegSq, d 1 3 T 2 3 Reg− 2 3 Sq finishes the proof. C.1.3 Python Solution to Eq. (2.28) def makeProblem ( nactions ): import cvxpy as cp sqrtgammaG = cp . Parameter (( nactions , nactions ) , nonneg = True ) sqrtgammafhat = cp . Parameter ( nactions ) p = cp . Variable ( nactions , nonneg = True ) sqrtgammaz = cp . Variable () objective = cp . Minimize ( sqrtgammafhat @ p + sqrtgammaz ) constraints = [ cp . sum( p ) == 1 ] + [ cp . sum([ cp . quad_over_lin ( eai - pi , vi ) for i , (pi , vi ) in enumerate (zip(p , v ) ) for eai in ( 1 if i == a else 0 ,) ]) <= sqrtgammafhata + sqrtgammaz for v in ( sqrtgammaG @ p ,) for a , sqrtgammafhata in enumerate ( sqrtgammafhat ) ] 232 problem = cp . Problem ( objective , constraints ) assert problem . is_dcp ( dpp= True ) # proof of convexity return problem , sqrtgammaG , sqrtgammafhat , p , sqrtgammaz This particular formulation multiplies both sides of the constraint in Eq. (2.28) by √γ while scaling the objective by √γ. While mathematically equivalent to Eq. (2.28), empirically it has superior numerical stability for large γ. For additional stability, when using this routine we recommend subtracting off the minimum value from fb, which is equivalent to making the substitutions √γfb← √γfb− √γ mina fba and z ← z + √γ mina fba and then exploiting the 1 ⊤p = 1 constraint. C.1.4 Proof of Theorem 2.3.4 Theorem 2.3.4. Solving argminp∈∆(K) decγ(p; f, x, G b ) is equivalent to solving the following convex optimization problem. min p∈∆(K),z p ⊤fb+ z (2.28) subject to ∀a ∈ [K] : 1 γ ∥p − ea∥ 2 diag(G⊤p)−1 ≤ fb(x, a) + z, G ⊤p ≻ 0, where fbin the objective is a shorthand for fb(x, ·) ∈ R K, ea is the a-th standard basis vector, and ≻ means element-wise greater. Proof. Denote f ⋆ = f ⋆ (x, ·) ∈ R K. Note that according to the definition of G, we know that (G⊤p)i denotes the probability that action i’s loss is revealed when the selected action a is sampled from distribution p. Then, we know that decγ(p; f, x, G b ) 233 = sup a ⋆∈[K] f ⋆∈RK Eat∼p " f ⋆ at − f ⋆ a ⋆ − γ 4 EA∼G(·|at) "X a∈A fba − f ⋆ a 2 ## = sup a ⋆∈[K] f ⋆∈RK (p − ea ⋆ ) ⊤f ⋆ − γ 4 X a∈[K] ∥fb− f ⋆ ∥ 2 diag(G⊤p) = sup a ⋆∈[K] (p − ea ⋆ ) ⊤fb+ 1 γ ∥p − ea ⋆ ∥ 2 diag(G⊤p)−1 G ⊤p ≻ 0 = p ⊤fb+ max a ⋆∈[K] 1 γ ∥p − ea ⋆ ∥ 2 diag(G⊤p)−1 − e ⊤ a ⋆ fb , where the third equality is by picking f ⋆ to be the maximizer and introduces a constraint. Therefore, the minimization problem minp∈∆(K) decγ(p; f, x, G b ) can be written as the following constrained optimization by variable substitution: min p∈∆(K),z p ⊤fb+ z subject to ∀a ∈ [K] : 1 γ ∥p − ea∥ 2 diag(G⊤p)−1 ≤ e ⊤ a fb+ z, G ⊤p ≻ 0. The convexity of the constraints follows from Lemma C.1.1. C.2 Omitted Details in Section 2.3.4 In this section, we provide proofs for Section 2.3.4. We define Wa := P a ′∈Nin(G,a) pa ′ to be the probability that the loss of action a is revealed when selecting an action from distribution p. Let fb = fb(x, ·) ∈ R K and f = f(x, ·) ∈ R K. Direct calculation shows that for any a ⋆ ∈ [K], f ⋆ = argmax f∈RK Ea∼p f(x, a) − f(x, a⋆ ) − γ 4 · X a ′∈Nin(G,a) (fbt(x, a′ ) − f(x, a′ ))2 234 = 2 γ diag(W) −1 (p − ea ⋆ ) + f.b Therefore, substituting f ⋆ into Eq. (2.27), we obtain that decγ(p; f, x, G b ) = max a ⋆∈[K] ( 1 γ X K a=1 p 2 a Wa + 1 − 2pa ⋆ Wa ⋆ ! + D p − ea ⋆ , fb E ) . (C.16) Without loss of generality, we assume the mini∈[K] fbi = 0. This is because shifting fb by mini∈[K] fbi does not change the value of D p − ea ⋆ , fb E . In the following sections, we provide proofs showing that a certain closed-form of p leads to optimal decγ(p; f, x, G b ) up to constant factors for several specific types of feedback graphs, respectively. C.2.1 Cops-and-Robbers Graph Proposition 2.3.1. When G = GCR, given any fb, context x, the closed-form distribution p in Eq. (2.29) guarantees that decγ(p; f, x, G b CR) ≤ O 1 γ . Proof. We use the following notation for convenience: p1 := pa1 , p2 := pa2 , fb1 := fba1 = 0, fb2 := fba2 . For the cops-and-robbers graph and closed-form solution p in Eq. (2.29), Eq. (C.16) becomes: decγ(p; f, x, G b CR) = max a ⋆∈[K] 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E . If a ⋆ ̸= a1 and a ⋆ ̸= a2, we know that 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E = 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 + p1fb1 + p2fb2 − fba ⋆ ≤ 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 − p1fb2 (fba ⋆ ≥ fb2 ≥ fb1 = 0) 235 ≤ 1 γ 1 1 − p1 + 1 + 1 − p1fb2 (p1 ∈ [ 1 2 , 1], p1 ≥ p2 ∈ [0, 1 2 ]) = 1 γ 4 + γfb2 − 1 − 1 2 + γfb2 fb2 ≤ 5 γ . If a ⋆ = a2, we can obtain that 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E = 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 − 2p2 p1 + p1fb1 + p2fb2 − fb2 ≤ 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 − 2(1 − p1) p1 − p1fb2 (fb1 = 0) ≤ 1 γ 1 1 − p1 + 1 + 2 − 1 p1 − p1fb2 (p1 ∈ [ 1 2 , 1], p2 ∈ [0, 1 2 ]) ≤ 1 γ 5 + γfb2 − 1 − 1 2 + γfb2 fb2 (p1 = 1 2+γfb2 ) ≤ 6 γ . If a ⋆ = a1, we have 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E ≤ 1 γ p 2 1 1 − p1 + (1 − p1) 2 p1 + 1 − 2p1 1 − p1 + (1 − p1)fb2 ≤ 1 γ 1 − p1 + (1 − p1) 2 p1 + (1 − p1)fb2 ≤ 1 γ 1 + 1 2 + fb2 2 + γfb2 (p1 ∈ [ 1 2 , 1]) ≤ 3 γ . Putting everything together, we prove that decγ(p; f, x, G b CR) ≤ 6 γ ≤ O 1 γ . 236 C.2.2 Apple Tasting Graph Proposition 2.3.2. When G = GAT, given any fb, context x, the closed-form distribution p in Eq. (2.30) guarantees that decγ(p; f, x, G b AT) ≤ O( 1 γ ). Proof. For the apple tasting graph and closed-form solution p in Eq. (2.30), Eq. (C.16) becomes: decγ(p; f, x, G b ) = max a ⋆∈[K] 1 γ p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E . Suppose fb1 = 0, we know that p1 = 1, p2 = 0 and 1. If a ⋆ = 1, we have 1 γ p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E = 0. 2. If a ⋆ = 2, direct calculation shows that 1 γ p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E ≤ 2 γ . Suppose fb2 = 0, we know that p1 = 2 4+γfb1 , p2 = 1 − p1 and 1. If a ⋆ = 1, we have 1 γ p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E = 1 γ p1 + (1 − p1) 2 p1 + 1 − 2p1 p1 − (1 − p1)fb1 = 2(1 − p1) 2 γp1 − (1 − p1)fb1 = (2 + γfb1) 2 γ(4 + γfb1) − (1 − p1)fb1 237 ≤ 4 + γfb1 γ + 2fb1 4 + γfb1 − fb1 ≤ 6 γ . 2. If a ⋆ = 2, direct calculation shows that 1 γ p1 + (1 − p1) 2 p1 + 1 − 2pa ⋆ Wa ⋆ + D p − ea ⋆ , fb E = 2p1 γ + p1fb1 ≤ 1 γ + 2fb1 4 + γfb1 ≤ 3 γ . Putting everything together, we prove that decγ(p; f, x, G b AT) ≤ 6 γ ≤ O 1 γ . C.2.3 Inventory Graph Proposition 2.3.3. When G = Ginv, given any fb, context x, there exists a closed-form distribution p ∈ ∆(K) guaranteeing that decγ(p; f, x, G b inv) ≤ Oe( 1 γ ), where p is defined as follows: pj = max{ 1 1+γ(fbj−mini fbi) − P j ′>j pj ′, 0} for all j ∈ [K]. Proof. Based on the distribution defined above, define A ⊆ [K] to be the set such that for all i ∈ A, pi > 0 and denote N = |A|. We index each action in A by k1 < k2 < · · · < kN = K. According to the definition of pi , we know that pi is strictly positive only when fbi < fbj for all j > i and specifically, when pi > 0, we know that Wi = P j≥i pj = 1 1+γfbi (recall that mini fbi = 0 since we shift fb). Therefore, define WkN+1 = 0 and we know that decγ(p; f, x, G b inv) = X N i=1 pki fb ki + 1 γ X K a=1 p 2 a Wa + max a ⋆∈[K] 1 − 2pa ⋆ γWa ⋆ − fba ⋆ ≤ X N i=1 Wki − Wki+1 fb ki + 1 γ + max a ⋆∈[K] 1 − 2pa ⋆ γWa ⋆ − fba ⋆ ≤ 2 γ + N X−1 i=1 1 1 + γfb ki − 1 1 + γfb ki+1 ! fb ki + max a ⋆∈[K] 1 − 2pa ⋆ γWa ⋆ − fba ⋆ 23 ≤ 3 γ + X N i=2 fb ki − fb ki−1 1 + γfb ki + max a ⋆∈[K] 1 − 2pa ⋆ γWa ⋆ − fba ⋆ . Direct calculation shows that X N i=2 fb ki − fb ki−1 1 + γfb ki = 1 γ X N i=2 fb ki − fb ki−1 1 γ + fb ki ≤ 1 γ Z fbkN 0 1 1 γ + x dx ≤ ln(1 + γ) γ . (C.17) Next, consider the value of a ⋆ ∈ [K] that maximizes 1−2pa⋆ γWa⋆ − fba ⋆ . If a ⋆ ≤ k1, then we know that Wa ⋆ = 1 and 1−2pa⋆ γWa⋆ −fba ⋆ ≤ 1 γ . Otherwise, suppose that ki < a⋆ ≤ ki+1 for some i ∈ [N −1]. According to the definition of p, if a ⋆ ̸= ki+1 we know that pa ⋆ = 0 and 1 1 + γfba ⋆ ≤ X j ′>a⋆ pj ′ = Wki+1 = Wa ⋆ . Therefore, 1 − 2pa ⋆ γWa ⋆ − fba ⋆ = 1 γWa ⋆ − fba ⋆ ≤ 1 γ . Otherwise, Wa ⋆ = Wki+1 and 1−2pa⋆ γWa⋆ − fba ⋆ ≤ 1 γWki+1 − fb ki+1 = 1 γ . Combining the two cases above and Eq. (C.17), we obtain that decγ(p; f, x, G b inv) ≤ 3 γ + ln(1 + γ) γ + 1 γ = Oe 1 γ . 239 C.2.4 Undirected and Self-Aware Graphs Proposition 2.3.4. When G is an undirected self-aware graph, given any fb, context x, there exists a closedform distribution p ∈ ∆(K) guaranteeing that decγ(p; f, x, G b ) ≤ O α γ . Proof. We first introduce the closed-form of p and then show that decγ(p; f, x, G b ) ≤ O( α γ ). Specifically, we first sort fba in an increasing order and choose a maximal independent set by choosing the nodes in a greedy way. Specifically, we pick k1 = argmini∈[K] fbi . Then, we ignore all the nodes that are connected to k1 and select the node a with the smallest fba in the remaining node set. This forms a maximal independent set I ⊆ [K], which has size no more than α and is also a dominating set. Set pa = 1 α+γfba for a ∈ I\{k1} and pk1 = 1 − P a̸=k1,a∈I pa. This is a valid distribution as we only choose at most α nodes and pa ≤ 1/α for all a ∈ I\{k1}. Now we show that decγ(p; f, x, G b ) ≤ O( α γ ). Specifically, we only need to show that with this choice of p, for any a ⋆ ∈ [K], X K a=1 pafba − fba ⋆ + 1 γ X K a=1 p 2 a Wa + 1 − 2pa ⋆ γWa ⋆ ≤ O α γ . Plugging in the form of p, we know that X K a=1 pafba − fba ⋆ + 1 γ X K a=1 p 2 a Wa + 1 − 2pa ⋆ γWa ⋆ ≤ X a∈I\{k1} fba α + γfba − fba ⋆ + 1 − 2pa ⋆ γWa ⋆ + 1 γ (pa ≤ Wa for all a ∈ [K]) ≤ α γ − fba ⋆ + 1 − 2pa ⋆ γWa ⋆ . (|I| ≤ α) 240 If a ⋆ = k1, then we can obtain that 1−2pa⋆ γWa⋆ ≤ 1 γWk1 ≤ α γ as pk1 ≥ 1 α according to the definition of p. Otherwise, note that according to the choice of the maximal independent set I, Wa ⋆ ≥ 1 α+γfb a′ for some a ′ ∈ I such that fb a ′ ≤ fba ⋆ . Therefore, −fba ⋆ + 1 − 2pa ⋆ γWa ⋆ ≤ −fba ⋆ + 1 γWa ⋆ ≤ −fba ⋆ + α + γfb a ′ γ ≤ α γ . Combining the two inequalities above together proves the bound. C.3 Implementation Details in Experiments C.3.1 Implementation Details in Section 2.3.5.1 We conduct experiments on RCV1 [101], which is a multilabel text-categorization dataset. We use a subset of RCV1 containing 50000 samples and K = 50 sub-classes. Therefore, the feedback graph in our experiment has K = 50 nodes. We use the bag-of-words vector of each sample as the context with dimension d = 47236 and treat the text categories as the arms. In each round t, the learner receives the bag-of-words vector xt and makes a prediction at ∈ [K] as the text category. The loss is set to be ℓt,at = 0 if the sample belongs to the predicted category at and ℓt,at = 1 otherwise. The function class we consider is the following linear function class: F = {f : f(x, a) = Sigmoid((Mx)a), M ∈ R K×d }, where Sigmoid(u) = 1 1+e−u for any u ∈ R. The oracle is implemented by applying online gradient descent with learning rate η searched over {0.1, 0.2, 0.5, 1, 2, 4}. As suggested by [61], we use a timevarying exploration parameter γt = c · √ αt, where t is the index of the iteration, c is searched over {8, 16, 32, 64, 128}, and α is the independence number of the corresponding feedback graph. Our code is 241 built on PyTorch framework [125]. We run 5 independent experiments with different random seeds and plot the mean and standard deviation value of PV loss. C.3.2 Implementation Details in Section 2.3.5.2 C.3.2.1 Details for Results on Random Directed Self-aware Graphs We conduct experiments on a subset of RCV1 containing 10000 samples with K = 10 sub-classes. Our code is built on Vowpal Wabbit [95]. For SqaureCB, the exploration parameter γt at round t is set to be γt = c · √ Kt, where t is the index of the round and c is the hyper-parameter searched over set {8, 16, 32, 64, 128}. The remaining details are the same as described in Appendix C.3.1. C.3.2.2 Details for Results on Synthetic Inventory Dataset In this subsection, we introduce more details in the synthetic inventory data construction, loss function constructions, oracle implementation, and computation of the strategy at each round. Dataset. In this experiment, we create a synthetic inventory dataset constructed as follows. The dataset includes T data points, the t-th of which is represented as (xt , dt) where xt ∈ R m is the context and dt is the realized demand given context xt . Specifically, in the experiment, we choose m = 100 and xt ’s are drawn i.i.d from Gaussian distribution with mean 0 and standard deviation 0.1. The demand dt is defined as dt = 1 √ m x ⊤ t θ + εt , where θ ∈ R m is an arbitrary vector and εt is a one-dimensional Gaussian random variable with mean 0.3 and standard deviation 0.1. After all the data points {(xt , dt)} T t=1 are constructed, we normalize dt to [0, 1] by setting dt ← dt−mint ′∈[T ] dt ′ maxt ′∈[T ] dt ′−mint ′∈[T ] dt ′ . In all our experiments, we set T = 10000. 242 Loss construction. Next, we define the loss at round t when picking the inventory level at with demand dt , which is defined as follows: ℓt,at = h · max{at − dt , 0} + b · max{dt − at , 0}, (C.18) where h > 0 is the holding cost per remaining items and b > 0 is the backorder cost per remaining items. In the experiment, we set h = 0.25 and b = 1. Regression oracle. The function class we use in this experiment is as follows: F = {f : f(x, a) = h · max{a − (x ⊤θ + β), 0} + b · max{x ⊤θ + β − a, 0}, θ ∈ R m, β ∈ R}. This ensures realizability according to the definition of our loss function shown in Eq. (C.18). The oracle uses online gradient descent with learning rate η searched over {0.01, 0.05, 0.1, 0.5, 1}. Calculation of pt . To make SquareCB.G more efficient, instead of solving the convex program defined in Eq. (2.28), we use the closed-form of pt derived in Proposition 2.3.3, which only requires O(K) computational cost and has the same theoretical guarantee (up to a constant factor) as the one enjoyed by the solution solved by Eq. (2.28). Similar to the case in Appendix C.3.1, at each round t, we pick γt = c · √ t with c searched over the set {0.25, 0.5, 1, 2, 3, 4}. Note again that the independence number for inventory graph is 1. We run 8 independent experiments with different random seeds and plot the mean and standard deviation value of PV loss. 243 C.4 Adaptive Tuning of γ without the Knowledge of Graph-Theoretic Numbers In this section, we show how to adaptively tune parameter γ in order to achieve Oe qPT t=1 αtRegSq regret in the strongly observable graphs case and Oe PT t=1 √ dt 2 3 Reg 1 3 Sq + qPT t=1 αetRegSq in the weakly observable graphs case. C.4.1 Strongly Observable Graphs In order to achieve Oe PT t=1 αtRegSq regret guarantee without the knowledge of αt , we apply a doubling trick on γ based on the value of minp∈∆(K) decγ(p; fbt , xt , Gt). Specifically, our algorithm goes in epochs with the parameter γ being γs in the s-th epoch. We initialize γ1 = q T RegSq . As proven in Theorem 2.3.2, we know that γ · min p∈∆(K) decγ(p; fbt , xt , Gt) ≤ Oe(αt). Therefore, within each epoch s (with start round bs), at round t, we calculate the value Qt = X t τ=bs min p∈∆(K) decγs (p; fbt , xt , Gt), (C.19) which is bounded by Oe 1 γs Pt τ=bs ατ and is in fact obtainable by solving the convex program. Then, we check whether Qt ≤ γsRegSq. If this holds, we continue our algorithm using γs; otherwise, we set γs+1 = 2γs and restart the algorithm. 244 Now we analyze the performance of the above described algorithm. First, note that for any t, within epoch s, γsQt ≤ Oe X T τ=1 ατ ! , meaning that the number of epoch S is bounded by S = log2 C1 + log4 PT t=1 αt T for certain C1 > 0 which only contains constant and log terms. Next, consider the regret in epoch s with Is = [bs, es]. According to Eq. (C.1), we know that the regret within epoch s is bounded as follows: E "X t∈Is f ⋆ (xt , at) − X t∈Is f ⋆ (xt , π⋆ (xt))# ≤ E "X t∈Is decγs (pt ; fbt , xt , Gt) # + γs 4 RegSq ≤ E X t∈[bs,es−1] decγs (pt ; fbt , xt , Gt) + Oe αes γs + γs 4 RegSq ≤ Oe(γsRegSq), (C.20) where the last inequality is because at round t = es − 1, Qt ≤ γsRegSq is satisfied. Taking summation over all S epochs, we know that the overall regret is bounded as E[RegCB] ≤ X S s=1 Oe(γsRegSq) = X S s=1 Oe 2 s−1 q TRegSq ≤ Oe 2 S q TRegSq = Oe vuutX T t=1 αtRegSq , (C.21) which finishes the proof. 245 C.4.2 Weakly Observable Graphs For the weakly observable graphs case, to achieve the target regret without the knowledge of αet and dt , which are the independence number of the subgraph induced by nodes with self-loops in Gt and the weak domination number of Gt , we apply the same approach as the one applied in the strongly observable graph case. Note that according to Theorem 2.3.3, within epoch s, we have Qt ≤ C2 Pt τ=bs √ dτ √γs + Pt τ=bs αeτ γs for certain C2 > 1 only containing constants and log factors. In the weakly observable graphs case, we know that the number of epoch is bounded by S = 2 + log2 C2 + max log4 PT t=1 αet T , log2 ( PT t=1 √ dt) 2 3 √ T · Reg 1 6 Sq since we have γS = 4C2 · s 1 RegSq max vuutX T t=1 αet , ( PT t=1 √ dt) 2 3 Reg 1 6 Sq , and at round t in epoch S, Qt ≤ C2 PT τ=1 αeτ γS + PT τ=1 √ dτ √γS ! ≤ γSRegSq, meaning that epoch S will never end. Therefore, following Eq. (C.20) and Eq. (C.21), we can obtain that E[RegCB] ≤ Oe 2 S q TRegSq = Oe vuutX T t=1 αetRegSq + X T t=1 p dt !2 3 Reg 1 3 Sq , which finishes the proof. 246 Appendix D Omitted Details in Section 2.4 D.1 Omitted Details in Section 2.4.2 We start with restating Theorem 2.4.3 along with its proof. Theorem 2.4.3. Under Assumptions 3-6, for any γ1, γ2 ≥ 0, the regret RegCB of SquareCB.UG is at most E "X T t=1 decP γ1,γ2 (pt ; ft , gt , xt) # + γ1RegSq + 2γ2RegLog. Proof. Define π ⋆ (x) = argmini∈[K] f ⋆ (x, i). We decompose RegCB as follows: RegCB = E "X T t=1 X K i=1 pt,if ⋆ (xt , i) − X T t=1 f ⋆ (xt , π⋆ (xt))# = E X T t=1 X K i=1 pt,if ⋆ (xt , i) − f ⋆ (xt , π⋆ (xt)) − γ1 X K i=1 X K j=1 pt,ig ⋆ (xt , i, j) (f ⋆ (xt , j) − ft(xt , j))2 − E X T t=1 γ2 X K i=1 X K j=1 pt,i (g ⋆ (xt , i, j) − gt(xt , i, j))2 g ⋆(xt , i, j) + gt(xt , i, j) (D.1) + γ1E X T t=1 X K i=1 X K j=1 pt,ig ⋆ (xt , i, j) (f ⋆ (xt , j) − ft(xt , j))2 247 + γ2E X T t=1 X K i=1 X K j=1 pt,i (g ⋆ (xt , i, j) − gt(xt , i, j))2 g ⋆(xt , i, j) + gt(xt , i, j) ≤ E X T t=1 max i ⋆∈[K] v ⋆∈[0,1]K,M⋆∈[0,1]K×K X K i=1 pt,iv ⋆ i − v ⋆ i ⋆ − γ1 X K i=1 X K j=1 pt,iM⋆ i,j v ⋆ j − ft(xt , j) 2 −γ2 X K i=1 X K j=1 pt,i (M⋆ i,j − gt(xt , i, j))2 M⋆ i,j + gt(xt , i, j) (D.2) + γ1E X T t=1 X K i=1 X K j=1 pt,ig ⋆ (xt , i, j) (f ⋆ (xt , j) − ft(xt , j))2 + γ2E X T t=1 X K i=1 X K j=1 pt,i (g ⋆ (xt , i, j) − gt(xt , i, j))2 g ⋆(xt , i, j) + gt(xt , i, j) = E "X T t=1 decP γ1,γ2 (pt , ft , gt , xt) # + γ1E X T t=1 X K i=1 X K j=1 pt,ig ⋆ (xt , i, j) (f ⋆ (xt , j) − ft(xt , j))2 + γ2E X T t=1 X K i=1 X K j=1 pt,i (g ⋆ (xt , i, j) − gt(xt , i, j))2 g ⋆(xt , i, j) + gt(xt , i, j) . (D.3) Next, since E[ℓt,a | xt ] = f ⋆ (xt , a) for all t ∈ [T] and a ∈ [K], we know that E X T t=1 X K i=1 X K j=1 pt,ig ⋆ (xt , i, j) (f ⋆ (xt , j) − ft(xt , j))2 = E "X T t=1 EAt "X i∈At (ft(xt , i) − ℓt,i) 2 − X i∈At (f ⋆ (xt , i) − ℓt,i) 2 ## ≤ RegSq, (D.4) where the inequality is due to Assumption 4 and the way the algorithm feeds the oracle AlgSq. In addition, according to Lemma 2.4.1 and the way the algorithm feeds the oracle AlgLog in the partially revealed graph setting, we know that E X T t=1 X K i=1 X K j=1 pt,i (g ⋆ (xt , i, j) − gt(xt , i, j))2 g ⋆(xt , i, j) + gt(xt , i, j) (D.5) = E "X T t=1 Eit∼pt "X K i=1 (g ⋆ (xt , it , i) − gt(xt , it , i))2 g ⋆(xt , it , i) + gt(xt , it , i) ## ≤ 2RegLog. (D.6) 24 Combining the above two inequalities, we obtain that RegCB ≤ E "X T t=1 decP γ1,γ2 (pt ; ft , gt , xt) # + γ1RegSq + 2γ2RegLog. (D.7) Moreover, Eq. (D.7) also holds in the fully revealed feedback graph setting since in this case, according to Lemma 2.4.1 and the fact that the algorithm feeds all action-pairs to AlgLog, we have E X T t=1 X K i=1 X K j=1 pt,i (g ⋆ (xt , i, j) − gt(xt , i, j))2 g ⋆(xt , i, j) + gt(xt , i, j) (D.8) ≤ E X T t=1 X K i=1 X K j=1 (g ⋆ (xt , i, j) − gt(xt , i, j))2 g ⋆(xt , i, j) + gt(xt , i, j) ≤ 2RegLog. (D.9) Plugging Eq. (D.8) and Eq. (D.4) into Eq. (D.3) finishes the proof in the fully revealed feedback graph setting. D.1.1 Value of the Minimax Program In this subsection, we show that for any context x ∈ X , g ∈ conv(G) and f ∈ conv(F), the minimum DEC value is roughly of order α(g,x) γ . Lemma 2.4.4. For any g ∈ conv(G), f ∈ conv(F), x ∈ X , and γ ≥ 4, we have min p∈∆([K]) decγ(p; f, g, x) = O α(g, x) log(Kγ) γ . Proof. To bound decγ(p; f, g, x), it suffices to bound decγ(p; f, g, x) defined as follows, which relaxes the constraint from v ⋆ ∈ [0, 1]K to v ⋆ ∈ R K: decγ(p; f, g, x) = max i ⋆∈[K] v ⋆∈RK X K i=1 piv ⋆ i − v ⋆ i ⋆ − 1 4 γ X K i=1 X K j=1 pjg(x, i, j)(f(x, i) − v ⋆ i ) 2 . 249 For a positive definite matrix M ∈ R K×K, we define norm ∥z∥M = √ z⊤Mz. By taking the gradient with respect to v ⋆ and setting it to zero, we know that for any i ⋆ ∈ [K] and p ∈ ∆([K]), max v ⋆∈RK X K i=1 piv ⋆ i − v ⋆ i ⋆ − 1 4 γ X K i=1 X K j=1 pjg(x, i, j)(f(x, i) − v ⋆ i ) 2 = X K i=1 pif(x, i) − f(x, i⋆ ) + 1 γ ∥p − ei ⋆ ∥ 2 W(p,g,x) , (D.10) where W(g, p, x) is a diagonal matrix with the i-th diagonal entry being PK j=1 pjg(x, j, i) and ei ⋆ ∈ R K corresponds to the basic vector with the i-th coordinate being 1. Then, direct calculation shows that min p∈∆([K]) decγ(p; f, g, x) = min p∈∆([K]) max i ⋆∈[K] max v ⋆∈RK X K i=1 piv ⋆ i − v ⋆ i ⋆ − 1 4 γ X K i=1 X K j=1 pjg(x, i, j)(f(x, i) − v ⋆ i ) 2 = min p∈∆([K]) max i ⋆∈[K] (X K i=1 pif(x, i) − f(x, i⋆ ) + 1 γ ∥p − ei ⋆ ∥ 2 W(p,g,x) ) (according to Eq. (D.10)) = min p∈∆([K]) max i ⋆∈[K] X K i=1 pif(x, i) − f(x, i⋆ ) + 1 γ X i̸=i ⋆ p 2 P i K i ′=1 pi ′g(x, i′ , i) + 1 γ (1 − pi ⋆ ) 2 PK i ′=1 pi ′g(x, i′ , i⋆) (D.11) = min p∈∆([K]) max q∈∆([K]) (X K i=1 pif(x, i) − X K i=1 qif(x, i) + 1 γ X K i=1 (1 − qi)p 2 P i K i ′=1 pi ′g(x, i′ , i) (D.12) + 1 γ X K i=1 qi(1 − pi) 2 PK i ′=1 pi ′g(x, i′ , i) ) = max q∈∆([K]) min p∈∆([K]) (X K i=1 pift(x, i) − X K i=1 qif(x, i) + 1 γ X K i=1 (1 − qi)p 2 P i K i ′=1 pi ′g(x, i′ , i) (D.13) + 1 γ X K i=1 qi(1 − pi) 2 PK i ′=1 pi ′g(x, i′ , i) ) , (D.14) where the last inequality is due to Sion’s minimax theorem. 250 Picking pa = 1 − 1 γ qa + 1 γK for all a ∈ [K], we obtain that for any distribution µ ∈ Q(g, x), max q∈∆([K]) min p∈∆([K]) (X K i=1 pif(x, i) − X K i=1 qif(x, i) + 1 γ X K i=1 (1 − qi)p 2 P i K i ′=1 pi ′g(x, i′ , i) (D.15) + 1 γ X K i=1 qi(1 − pi) 2 P i ′∈A pi ′g(x, i′ , i) ) (i) ≤ max q∈∆([K]) ( − 1 γ X K i=1 qif(x, i) + 1 γK X K i=1 f(x, i) (D.16) + 1 γ X K i=1 (1 − qi) qi − 1 γ qi + 1 γK 2 + qi 1 − (1 − 1 γ )qi − 1 γK 2 PK i ′=1 pi ′g(x, i′ , i) (ii) ≤ max q∈∆([K]) 1 γ + 1 γ X K i=1 2 (1 − 1 γ ) 2 q 2 i + 1 γ 2K2 (1 − qi) + qi 1 − (1 − 1 γ )qi 2 PK i ′=1 pi ′gt(xt , i′ , i) ≤ max q∈∆([K]) 1 γ + 1 γ X K i=1 2 q 2 i + 1 γ 2K2 (1 − qi) + qi (1 − qi) + qi γ 2 PK i ′=1 pi ′gt(xt , i′ , i) (iii) ≤ max q∈∆([K]) 1 γ + 2 γ 2 + 1 γ X K i=1 2q 2 i (1 − qi) + 2qi (1 − qi) 2 + 2q 3 i γ 2 PK i ′=1 pi ′g(x, i′ , i) = max q∈∆([K]) ( 1 γ + 2 γ 2 + 2 γ X K i=1 qi(1 − qi) PK i ′=1 pi ′g(x, i′ , i) + 2 γ 3 X K i=1 q 3 P i K i ′=1 pi ′g(x, i′ , i) ) (iv) = max q∈∆([K]) 1 γ + 2 γ 2 + 2 γ X K i=1 qi(1 − qi) EG∼µ hPK i ′=1 pi ′Gi ′ ,ii + 2 γ 3 X K i=1 q 3 i EG∼µ hPK i ′=1 pi ′Gi ′ ,ii , where (i) is by replacing pi with (1 − 1 γ )qi + 1 γK (except for the pi ′ in the denominators); (ii) holds since f(x, i) ∈ [0, 1] for all i ∈ [K] and (a + b) 2 ≤ 2(a 2 + b 2 ), and we drop the last − 1 γK term; (iii) is because PK i ′=1 pi ′g(x, i′ , i) ≥ 1 γK for all i ∈ [K] since pi ′ ≥ 1 γK for all i ′ ∈ [K] and PK i ′=1 g(x, i′ , i) ≥ 1 as g(x, ·, ·) is the mean graph of a distribution of strongly observable graphs; (iv) is by definition of Q(g, x) and with an abuse of notation, Gi,j represents the (i, j)-th entry of the adjacent matrix of G. For a feedback graph G ∈ {0, 1} K×K and a distribution u ∈ ∆([K]), with an abuse of notation, define W(G, u) ∈ [0, 1]K as the probability for each node i ∈ [K] to be observed according to u and G. 251 Specifically, for each i ∈ [K], Wi(G, u) = PK j=1 ujGj,i. In addition, let S ⊆ [K] be the nodes in G that have self-loops, meaning that Gi,i = 1 for all i ∈ S. Then, using Jensen’s inequality, we know that min p∈∆([K]) decγ(p; f, g, x) ≤ max q∈∆([K]) EG∼µ " 1 γ + 2 γ 2 + 2 γ X K i=1 qi(1 − qi) PK i ′=1 pi ′Gi ′ ,i + 2 γ 3 X K i=1 q 3 P i K i ′=1 pi ′Gi ′ ,i # (Jensen’s inequality) (i) = max q∈∆([K]) EG∼µ " 1 γ + 2 γ 2 + 2 γ X K i=1 qi(1 − qi) Wi(G, p) + 2 γ 3 X i∈S q 3 i Wi(G, p) + 2 γ 3 X i /∈S q 3 i Wi(G, p) # (ii) ≤ max q∈∆([K]) EG∼µ " 1 γ + 2 γ 2 + 2 γ X K i=1 qi(1 − qi) Wi(G, p) + 2 γ 2(γ − 1) X i∈S q 2 i + 2 γ 3 X i /∈S q 3 i K−1 γK # (iii) ≤ max q∈∆([K]) EG∼µ " O 1 γ + 1 γ X K i=1 qi(1 − qi) Wi(G, p) !# (iv) ≤ O 1 γ + O 1 γ · EG∼µ " max q∈∆([K]) X K i=1 qi(1 − qi) Wi(G, p) # , (D.17) where (i) is by definition of W(G, p); (ii) is because for all i ∈ S, Wi(G, p) ≥ pi ≥ (1 − 1 γ )qi , and for all i /∈ S, every other node in G can observe i and Wi(G, p) = 1 − pi ≥ K−1 γK since pi ≥ 1 γK ; (iii) is because K ≥ 2 and γ ≥ 4; (iv) is again due to Jensen’s inequality. Next we bound PK i=1 qi(1−qi) Wi(G,p) for any strongly observable graph G with independence number α. For notational convenience, we omit the index G and denote Wi(G, p) by Wi(p). If i ∈ [K]\S, we know that Wi(p) = 1 − pi and qi(1 − qi) Wi(p) = qi(1 − qi) 1 − 1 − 1 γ qi − 1 γK = qi(1 − qi) 1 − 1 γ (1 − qi) + K−1 γK ≤ 1 1 − 1 γ qi ≤ 2qi , (D.18) 252 where the first equality is because pi = (1 − 1 γ )qi + 1 γK for all i ∈ [K] and the last inequality is because γ ≥ 4. If i ∈ S, we know that X i∈S qi(1 − qi) Wi(p) = X i∈S qi(1 − qi) P j:Gj,i=1 1 − 1 γ qj + 1 γK ≤ γ γ − 1 X i∈S (1 − 1 γ )qi + 1 γK (1 − qi) P j:Gj,i=1 1 − 1 γ qj + 1 γK ≤ 2 X i∈S (1 − 1 γ )qi + 1 γK (1 − qi) P j:Gj,i=1 1 − 1 γ qj + 1 γK ≤ O(α log(Kγ)), (D.19) where the last inequality is due to Lemma D.1.2. Combining Eq. (D.18) and Eq. (D.19), we know that for any q ∈ ∆([K]) and strongly observable graph G with independence number α, PK i=1 qi(1−qi) Wi(G,p) ≤ O(α log(Kγ)). Plugging the above back to Eq. (D.17), we know that min p∈∆([K]) decγ(p; f, g, x) ≤ O 1 γ · EG∼µ [O(α(G) log(Kγ))] ≤ O α(g, x) log(Kγ) γ , where the last inequality is due to the definition of α(g, x). The following two auxiliary lemmas have been used in our analysis. Lemma D.1.1. For all gb ∈ conv(G) and context x ∈ X , α(g, x b ) ≤ α(G). 253 Proof. Let u ∈ ∆(G) be such that Eg∼u[g] = gb. For each g ∈ G, consider any qg ∈ Q(g, x). We have by definition qb ≜ Eg∼u[qg] ∈ Q(g, x b ), leading to α(g, x b ) = inf q∈Q(g,x b ) EG∼q[α(G)] ≤ EG∼qb[α(G)] = Eg∼u EG∼qg [α(G)] . Since qg can be any distribution in Q(g, x), the above implies α(g, x b ) ≤ Eg∼u inf ρ∈Q(g,x) EG∼ρ[α(G)] = Eg∼u [α(g, x)] , which is at most Eg∼u[supx∈X α(g, x)] ≤ α(G), finishing the proof. Lemma D.1.2 (Lemma 5 in [12]). Let G = (V, E) be a directed graph with |V | = K, in which Gi,i = 1 for all vertices i ∈ [K]. Assign each i ∈ V with a positive weight wi such that Pn i=1 wi ≤ 1 and wi ≥ ε for all i ∈ V for some constant 0 < ε < 1 2 . Then X K i=1 P wi j:Gj,i=1 wj ≤ 4α(G) ln 4K α(G)ε , where α(G) is the independence number of G. D.1.2 Parameter-Free Algorithm in the Partially Revealed Feedback Graphs Setting In this section, we show that applying doubling trick to Algorithm 6 achieves the same regret without the knowledge of α(G) in the partially revealed feedback setting. The idea follows Zhang et al. [148], which utilizes the value of the minimax problem Eq. (2.31) to guide the choice of γ. 254 Specifically, our algorithm goes in epochs with the parameter γ being γs in the s-th epoch and γ1 = q T max{RegSq,RegLog} . Within each epoch s (with starting round bs), at round t, we calculate the value Rt = X t τ=bs min p∈∆([K]) decγs (p; fτ , gτ , xτ ), (D.20) and decide whether to start a new epoch by checking whether Rt ≤ γs max RegSq, RegLog . Specifically, if Rt ≤ γs max RegSq, RegLog , we continue our algorithm using γs; otherwise, we set γs+1 = 2γs and restart the algorithm. Now we analyze the performance of the above described algorithm. Denote the s-th epoch to be Is = {bs, bs + 1, . . . , es} ≜ [bs, es] and let S be the total number of epochs. First, using Lemma D.1.1 and Lemma 2.4.4, we know that for any t within epoch s, we have γsRt ≤ Oe X t∈Is α(G) ! ≤ Oe(α(G)T). Applying this to the last round of the (S − 1)-th epoch, we obtain: γ 2 S−1 max RegSq, RegLog ≤ Oe(α(G)T), which, together with γS−1 = 2S−2γ1 and the definition of γ1, implies 2 S = Oe( p α(G)). Next, consider the regret in epoch s. According to Eq. (2.34), we know that the regret within epoch s is bounded as follows: E "X t∈Is f ⋆ (xt , it) − X t∈Is min i∈[K] f ⋆ (xt , i) # ≤ E "X t∈Is decγs (pt ; ft , gt , xt) # + 3γs 4 RegSq + 1 2 γsRegLog 255 ≤ E X t∈[bs,es−1] decγs (pt ; ft , gt , xt) + Oe α(G) γs + 2γs max{RegSq, RegLog} ≤ Oe(γs max{RegSq, RegLog}), (D.21) where the second inequality uses Lemma 2.4.2 and Lemma 2.4.4 again, and the last inequality is because at round t = es − 1, Rt ≤ γs max{RegSq, RegLog} is satisfied and α(G) γs ≤ T γs ≤ γs max{RegSq, RegLog}. Taking summation over all S epochs, we know that the overall regret is bounded as E[RegCB] ≤ X S s=1 Oe(γs max{RegSq, RegLog}) = X S s=1 Oe 2 s−1 q T max{RegSq, RegLog} ≤ Oe 2 S q T max{RegSq, RegLog} = Oe q α(G)T max{RegSq, RegLog} , (D.22) which is exactly the same as Theorem 2.4.1. D.2 Omitted Details in Section 2.4.3.2 Lemma 2.4.5. For g, g′ ∈ conv(G), f ∈ conv(F), and x ∈ X , we have min p∈∆([K]) decγ(p; f, g′ , x) ≤ min p∈∆([K]) dec γ 3 (p; f, g, x) + γ 12 X K i=1 X K j=1 (g(x, i, j) − g ′ (x, i, j))2 g(x, i, j) + g ′(x, i, j) . Proof. For any v ∈ [0, 1]K, applying Lemma 2.4.3 with z ′ = g ′ (x, i, j)(f(x, j) − vj ) , z = g(x, i, j)(f(x, j) − vj ) 2 , for each i and j, we know that − 3γ X K i=1 X K j=1 pig ′ (x, i, j)(f(x, j) − vj ) 2 − γ X K i=1 X K j=1 pi (g(x, i, j) − g ′ (x, i, j))2 g(x, i, j) + g ′(x, i, j) 256 ≤ −γ X K i=1 X K j=1 pig(x, i, j)(f(x, j) − vj ) 2 . (D.23) Plugging this back to the definition of decγ(p; f, g, x), we get min p∈∆([K]) decγ(p; f, g′ , x) = min p∈∆([K]) max i ⋆∈[K] v ⋆∈[0,1]K X K i=1 piv ⋆ i − v ⋆ i ⋆ − 1 4 γ X K i=1 X K j=1 pig ′ (x, i, j)(f(x, j) − v ⋆ j ) 2 (i) ≤ min p∈∆([K]) max i ⋆∈[K] v ⋆∈[0,1]K X K i=1 piv ⋆ i − v ⋆ i ⋆ − 1 12 γ X K i=1 X K j=1 pig(x, i, j)(f(x, j) − v ⋆ j ) 2 + 1 12 γ X K i=1 X K j=1 pi (g(x, i, j) − g ′ (x, i, j))2 g(xt , i, j) + g ′(x, i, j) (ii) ≤ min p∈∆([K]) max i ⋆∈[K] v ⋆∈[0,1]K X K i=1 piv ⋆ i − v ⋆ i ⋆ − 1 12 γ X K i=1 X K j=1 pig(x, i, j)(f(x, j) − v ⋆ j ) 2 + 1 12 γ X K i=1 X K j=1 (g(x, i, j) − g ′ (x, i, j))2 g(x, i, j) + g ′(x, i, j) = min p∈∆([K]) dec γ 3 (p; f, g, x) + 1 12 γ X K i=1 X K j=1 (g(x, i, j) − g ′ (x, i, j))2 g(x, i, j) + g ′(x, i, j) , where (i) uses Eq. (D.23) and (ii) holds by trivially bounding pi in the last term by 1. D.3 Implementation Details We first point out that the DEC defined in Zhang et al. [148] in fact relaxes the constraint v ⋆ ∈ [0, 1]K to v ⋆ ∈ R K, which makes the problem of minimizing the DEC a simple convex program (see their Theorem 3.6). In the partially revealed graph setting, we in fact can do the exact same trick because it does not affect our analysis at all. Since our experiments are for an application with partially revealed graphs, we indeed implemented our algorithm in this way for simplicity. 257 However, this relaxation does not work for the fully revealed graph setting, since the analysis of Lemma 2.4.5 relies on v ⋆ ∈ [0, 1]K. Nevertheless, minimizing the DEC is still a relatively simple convex problem. To see this, we first fix an i ⋆ ∈ [K] and work on the supremum over v ⋆ ∈ [0, 1]K. Specifically, define v ⋆ (i ⋆ ) = argmax v∈[0,1]K X K i=1 pivi − vi ⋆ − γ 4 X K i=1 pi X K j=1 g(x, i, j)(f(x, j) − vj ) 2 . Direct calculation shows that v ⋆ j (i ⋆ ) = max n f(x, j) − (pj−1{j=i ⋆}) 2 γ PK i=1 pig(x,i,j) , 0 o for all j ∈ [K]. Let hi ⋆ (p) be the maximum value attained by v ⋆ (i ⋆ ), which is convex in p since it is a point-wise supremum over functions linear in p. It is then clear that solving argminp∈∆([K]) decγ(p; f, g, x) is equivalent to solving the following constrained convex problem: min u∈R,p∈∆([K]) u s.t. hi ⋆ (p) ≤ u, ∀i ⋆ ∈ [K]. D.3.1 Omitted Details in Section 2.4.4 In this section, we include the omitted details for our experiments. Dataset Details. For the synthetic datasets, as mentioned in Section 2.4.4.1, the two datasets differ in how {xt} T t=1 are generated. In the first dataset, each coordinate of xt ∈ R 32 is independently drawn from N (0, 1); in the second dataset, the first 8 coordinates of xt is independently drawn from N (0, 1) and the remaining coordinates are all 1. The real auction dataset we used in Section 2.4.5 is an eBay auction dataset (available at https://cims.nyu.edu/~munoz/data/) with the t-th datapoint consisting of a 78-dimensional feature vector xt , a winning price of the auction vt , and a competing price wt . We treat the winning price 258 as the value of the learner in our experiment. We randomly select a subset of 5000 data points whose winning price is in range [100, 300] and normalize the value and the competing price to range [0, 1]. Model Details. We implement the graph oracle as a linear classification model, aiming to predict the distribution of the competing price, denoted as pw(xt) ∈ ∆([K]). With pw(xt), we sample wbt ∼ pw(xt) and the predicted graph gt is calculated as gt(xt , i, j) = 1[i ≥ wbt ] · 1[i ≤ j] + 1[i < wbt ] · 1[j < wbt ]. (D.24) We implement the loss oracle as a two-layer fully connected neural network with hidden size 32. The neural network predicts the value of the data point xt , denoted as vbt . The predicted loss of each arm i is then calculated as: ft(xt , i) = 1 2 [1 − 1 [ai ≥ wbt ] · (vbt − ai)] , (D.25) where ai = (i − 1)ε. Training Details. For the graph oracle, the loss function of each round t is calculated as: 1 K X (i,j,b)∈St ℓlog(Ewbt∼pw(xt) [gt(xt , i, j)], b), where St is the input dataset defined in SquareCB.UG. For the loss oracle, the loss function is calculated as 1 |At| P j∈At (ft(xt , j) − ℓt,j ) 2 . We apply online gradient descent to train both models. Since the loss regression model aims to predict the value, we only update it when the learner wins the auction and observes the value. For experiments on the real auction dataset, learning rate is searched over {0.005, 0.01, 0.05} for the loss oracle 259 and over {0.01, 0.05} for the graph regression oracle. For experiments on the synthetic datasets, they are searched over {0.005, 0.01, 0.02} and {0.01, 0.05} respectively. For SquareCB, we set the exploration parameter γ = c · √ KT (based on what its theory suggests), where c is searched over {0.5, 1, 2}. For our SquareCB.UG, we set γ = c · √ T, where c is also searched over {0.5, 1, 2}. The experiment on the real auction dataset is repeated with 8 different random seeds and the experiment on the synthetic datasets is repeated with 4 different random seeds. A Closed-Form Solution. In this part, we introduce a closed form of pt which leads to a more efficient implementation of SquareCB.UG for the specific setting considered in the experiments. Specifically, given the predicted competing price wbt , let b ∈ [ 1 ε + 1] be the smallest action such that (b − 1)ε ≥ wbt . Given ft and gt defined in Eq. (D.25) and Eq. (D.24), define a closed-form p which concentrates on action 1 (bidding 0) and action b as follows, p1 = 1 2+γ(1/2−ft(xt,b)) ft(xt , b) ≤ 1 2 1 − 1 2+γ(ft(xt,b)−1/2) ft(xt , b) > 1 2 , pb = 1 − p1, (D.26) In the following lemma, we prove that the closed-form probability distribution in Eq. (D.26) guarantees decγ(p; ft , gt , xt) ≤ 4 γ , which is enough for all our analysis to hold (despite the fact that it does not exactly minimize the DEC). Lemma D.3.1. For any xt ∈ X , ft ∈ F, and gt in the form of Eq. (D.24), the probability distribution p defined in Eq. (D.26) guarantees decγ(p, ft , gt , xt) ≤ 4 γ . Proof. According to the analysis in Lemma 2.4.4, it suffices to bound decγ(p, ft , gt , xt). Based on Eq. (D.11), we have decγ(p, ft , gt , xt) 260 = max i ⋆∈[K] X K i=1 pift(xt , i) − ft(xt , i⋆ ) + 1 γ X i̸=i ⋆ p 2 P i K i ′=1 pi ′gt(xt , i′ , i) + 1 γ (1 − pi ⋆ ) 2 PK i ′=1 pi ′gt(xt , i′ , i⋆) ≤ max i ⋆∈[K] (X K i=1 pift(xt , i) − ft(xt , i⋆ ) + 1 γ X K i=1 p 2 P i K i ′=1 pi ′gt(xt , i′ , i) + 1 γ PK i ′=1 pi ′gt(xt , i′ , i⋆) ) . Note that according to Eq. (D.25) and the definition of b, for i < b, ft(x, i) = 1 2 . We now first consider the case ft(xt , b) ≤ 1 2 and p1 = 1 2+γ( 1 2 −ft(xt,b)) ≤ 1 2 . 1. Suppose 1 ≤ i ⋆ < b, we observe that ft(xt , i⋆ ) = 1 2 and have decγ(p, ft , gt , xt) ≤ p1 2 + pbft(xt , b) − 1 2 + 1 γ 1 p1 + 1 = p1 2 + (1 − p1)ft(xt , b) − 1 2 + 1 γ 2 + γ 1 2 − ft(xt , b) + 1 = p1 2 − p1ft(xt , b) + 3 γ = p1 1 2 − ft(xt , b) + 3 γ ≤ 4 γ . 2. Suppose b ≤ i ⋆ ≤ K. According to Eq. (D.25), we know that ft(xt , i⋆ ) ≥ ft(xt , b) and obtain decγ(p, ft , gt , xt) ≤ p1 2 + pbft(xt , b) − ft(xt , b) + 1 γ 1 pb + 1 ≤ p1 2 + (1 − p1)ft(xt , b) − ft(xt , b) + 3 γ (since pb = 1 − p1 ≥ 1 2 ) = p1 1 2 − ft(xt , b) + 3 γ ≤ 4 γ . Then we consider the case when ft(xt , b) > 1 2 and p1 = 1 − 1 2+γ(ft(xt,b)− 1 2 ) > 1 2 . 261 1. Suppose 1 ≤ i ⋆ < b, we have decγ(p, ft , gt , xt) ≤ p1 2 + pbft(xt , b) − 1 2 + 1 γ 1 p1 + 1 = 1 2 + pb ft(xt , b) − 1 2 − 1 2 + 1 γ 1 p1 + 1 ≤ pb ft(xt , b) − 1 2 + 3 γ (since p1 > 1 2 ) ≤ 4 γ . 2. Suppose b ≤ i ⋆ ≤ K, we have decγ(p, ft , gt , xt) ≤ p1 2 + pbft(xt , b) − ft(xt , b) + 1 γ 1 pb + 1 = 1 2 + pb ft(xt , b) − 1 2 − ft(xt , b) + 1 γ 2 + γ ft(xt , b) − 1 2 + 1 = pb ft(xt , b) − 1 2 + 3 γ ≤ 4 γ . Combining the two cases finishes the proof. 262 Appendix E Omitted Details in Section 3.1 E.1 Omitted details for Section 3.1.1 E.1.1 Proof of Theorem 3.1.1 First we generalize the proof of the standard Freedman’s inequality in the following way. For any λt that is Ft-measurable and such that λt ≤ 1/Bt , we have with Et [·] ≜ E[·|Ft ]: Et h e λtXt i ≤ Et 1 + λtXt + λ 2 t X2 t = 1 + λ 2 tEt X2 t ≤ exp λ 2 tEt X2 t . (E.1) Now for any t define random variable Zt such that Z0 = 1 and Zt ≜ Zt−1 · exp(λtXt − λ 2 tEt X2 t ) = exp X t s=1 λsXs − X t s=1 λ 2 sEs X2 s ! . From Eq. (E.1), we have Et [Zt ] = Zt−1 · exp(−λ 2 tEt X2 t )Et h e λtXt i ≤ Zt−1 · exp(−λ 2 tEt X2 t ) exp(λ 2 tEt X2 t ) ≤ Zt−1. 26 Therefore, taking the overall expectation we have E [ZT ] ≤ E [ZT −1] ≤ · · · ≤ E [Z0] = 1. Using Markov’s inequality, we have Pr ZT ≥ 1 δ ′ ≤ δ ′ . In other words, we have with probability at least 1 − δ ′ , X T t=1 λtXt ≤ ln(1/δ′ ) +X T t=1 λ 2 tEt X2 t . (E.2) The proof of the standard Freedman’s inequality takes all λt to be the same fixed value, while in our case it is important to apply Eq. (E.2) several times with different sets of values of λt . Specifically, for each i ∈ [⌈log2 (b 2T)⌉] and j ∈ [⌈log2 b⌉], set λt = λ ≜ min 2 −j , q ln(1/δ′)/2 i , for t ∈ Tj , where Tj ≜ t : 2j−1 ≤ max s≤t Bs ≤ 2 j , and λt = 0 otherwise. Clearly λt is Ft-measurable (since B1, . . . , Bt are Ft-measurable). Applying Eq. (E.2) gives X t∈Tj Xt ≤ ln(1/δ′ ) λ + X t∈Tj λEt X2 t ≤ 2 j ln(1/δ′ ) + q 2 i ln(1/δ′) + λ X T t=1 Et X2 t ( 1 λ ≤ max{2 j , p 2 i/ ln(1/δ′)}) ≤ 2 max s∈Tj Bs ln(1/δ′ ) + q 2 i ln(1/δ′) + V r ln(1/δ′) 2 i (2 j−1 ≤ maxs∈Tj Bs) ≤ 2B ⋆ ln(1/δ′ ) + q 2 i ln(1/δ′) + V r ln(1/δ′) 2 i . 264 By a union bound, the above holds with probability at least 1 − Cδ′ for all i ∈ [⌈log2 (b 2T)⌉] and j ∈ [⌈log2 b⌉]. In particular, since 1 ≤ V ≤ b 2T (almost surely), there exists an i ⋆ ∈ [⌈log2 (b 2T)⌉] such that 2 i ⋆−1 ≤ V ≤ 2 i ⋆ , and thus X T t=1 Xt = X j∈[⌈log2 b⌉] X t∈Tj Xt ≤ C · 2B ⋆ ln(1/δ′ ) + q 2 i ⋆ ln(1/δ′) + V r ln(1/δ′) 2 i ⋆ ! ≤ C · 2B ⋆ ln(1/δ′ ) + p 2V ln(1/δ′) + p V ln(1/δ′) ≤ C · 2B ⋆ ln(1/δ′ ) + p 8V ln(1/δ′) . Finally replacing δ ′ with δ/C finishes the proof. E.1.2 Proof of Lemma 3.1.1 First note that ℓt,it = D wt , ℓbt E . Using standard OMD analysis (e.g.,Lemma 12 of [7]), we have ℓt,it − D u, ℓbt E ≤ Dψt (u, wt) − Dψt (u, wt+1) +X d i=1 ηt,iw 2 t,iℓb2 t,i. (E.3) Summing the first two terms on the right hand side over t shows (here h(y) = y − 1 − ln y): X T t=1 (Dψt (u, wt) − Dψt (u, wt+1)) ≤ Dψ1 (u, w1) + T X−1 t=1 Dψt+1 (u, wt+1) − Dψt (u, wt+1) (DψT (u, wT +1) ≥ 0) = 1 η X d j=1 h uj w1,j + X d j=1 T X−1 t=1 1 ηt+1,j − 1 ηt,j h uj wt+1, j . (E.4) For the first term, since uj ≥ 1 T and w1,j = 1 d for each j, we have 1 η X d j=1 h uj w1,j = 1 η X d j=1 − ln(duj ) ≤ d ln T η . 26 Now we analyze the second term for each j. Note that ηT,j = κ nj η1,j where nj is the number of times Algorithm 7 increases the learning rate for arm j. Let tj be the time step such that ηT,j = ηtj+1,j = κηtj ,j , that is, the last time step where the learning rate for arm j is increased. Then we have 1 ηtj+1,j − 1 ηtj ,j h uj wtj+1, j = 1 − κ κ nj η h uj wtj+1,j ≤ −h uj wtj+1,j 5η ln T = −h ujρT ,j 2 5η ln T , where we use the facts 1 − κ ≤ − 1 ln T and κ nj ≤ 5. The term −h ujρT ,j 2 is bounded by −h ujρT,j 2 = ln ujρT,j 2 − ujρT,j 2 + 1 ≤ 1 + ln T − ujρT,j 2 , where the inequality is because ujρT ,j 2 ≤ 1 wtj+1,j ≤ T. Plugging this result for every j back to Eq. (E.4), we get X T t=1 Dψt (u, wt) − Dψt (u, wt+1) ≤ d ln T η + X d j=1 2 + 2 ln T − ujρT,j 10η ln T = O d ln T η − ⟨ρT , u⟩ 10η ln T . Finally, since ηt,iw 2 t,iℓb2 t,i ≤ ηt,it ℓt,it ≤ ηT,it ℓt,it ≤ 5ηℓt,it , summing Eq. (E.3) over t gives: X T t=1 ℓt,it − D u, ℓbt E ≤ X T t=1 (Dψt (u, wt) − Dψt (u, wt+1)) +X T t=1 X d i=1 ηt,iw 2 t,iℓb2 t,i ≤ O d ln T η − ⟨ρT , u⟩ 10η ln T + 5η X T t=1 ℓt,it = O d ln T η + η X T t=1 ℓt,it ! − ⟨ρT , u⟩ 10η ln T . 2 E.1.3 Proof of Theorem 3.1.2 Fix any i ⋆ ∈ [d] and let u = (1 − d T )e ⋆ + 1 T 1, where e ⋆ is the one-hot vector for i ⋆ . First note that X T t=1 (ℓt,it − ℓt,i⋆ ) = X T t=1 ℓt,it − D u, ℓbt E + X T t=1 D u, ℓbt − ℓt E + X T t=1 ⟨u − e ⋆ , ℓt⟩ ≤ X T t=1 ℓt,it − D u, ℓbt E + X T t=1 D u, ℓbt − ℓt E + d. For the first term, using Lemma 3.1.1, we have X T t=1 ℓt,it − D u, ℓbt E ≤ O d ln T η + ηLT − ⟨ρT , u⟩ 10η ln T , (E.5) where LT = PT t=1 ℓt,it . For the second term above, we use Theorem 3.1.1 with Xt = D u, ℓbt − ℓt E , Bt = ⟨ρt , u⟩ ∈ [1, T], b = T, and the fact Et [X2 t ] ≤ Et D u, ℓbt E2 = Et " u 2 it ℓ 2 t,it p 2 t,it # ≤ X d i=1 u 2 i ℓt,iρT,i ≤ ⟨ρT , u⟩ ⟨u, ℓt⟩, showing that with probability at least 1 − δ ′ , X T t=1 D u, ℓbt − ℓt E ≤ C p 8Lu ⟨ρT , u⟩ln (C/δ′) + 2 ⟨ρT , u⟩ln C/δ′ , (E.6) where Lu = D u,PT t=1 ℓt E and C = ⌈log(b)⌉⌈log(b 2T)⌉ = ⌈log(T)⌉⌈3 log(T)⌉. With η ≤ 1 40C ln T ln(C/δ′) , we then have with probability at least 1 − δ ′ , X T t=1 ℓt,it − ℓt,i⋆ 26 ≤ Oe d η + ηLT − ⟨ρT , u⟩ 10η ln T + C p 8Lu ⟨ρT , u⟩ln (C/δ′) + 2 ⟨ρT , u⟩ln C/δ′ (Eq. (E.5) and Eq. (E.6)) ≤ Oe d η + ηLT + η40C 2Lu ln(C/δ′ ) ln T − ⟨ρT , u⟩ 20η ln T + 2C ⟨ρT , u⟩ln(C/δ′ ) (AM-GM inequality) ≤ Oe d η + ηLT ln(1/δ′ ) + ηLu ln(1/δ′ ) . (η < 1 40C ln T ln(C/δ) ) Therefore, rearranging the terms, using the fact Lu ≤ L ⋆ + d, and choosing η = min (r d L⋆ ln(1/δ′), 1 40C ln T ln(C/δ′) , 1 2 ) , we have with probability 1 − δ ′ , X T t=1 ℓt,it − ℓt,i⋆ = Oe p dL⋆ ln(1/δ′) + d ln(1/δ′ ) , where L ⋆ = PT t=1 ℓt,i⋆ . This finishes the proof when the adversary is oblivious. For adaptive adversaries, taking a union bound over all possible best arms i ⋆ ∈ [d] and setting δ ′ = δ/d, we have with probability 1 − δ, Reg = Oe p dL⋆ ln(d/δ) + d ln(d/δ) , finishing the proof. Remark 1. Although the proof above requires tuning the initial learning rate η in terms of the unknown quantity L ⋆ , standard doubling trick can remove this restriction (even in the bandit setting). We refer the reader to Lee, Luo, and Zhang [100] for detailed exposition on how to achieve so. 26 Algorithm 20 d-dimensional version of Algorithm 8 Input: decision set Ω ⊆ R d , a ν-self-concordant barrier ψ(w) for Ω, initial learning rate η. Define: increase factor κ = e 1 100d ln(νT ) , Ψ(w, b) = 400 ψ w b − 2ν ln b . Initialize: w1 = argminw∈Ω ψ(w), H1 = ∇2Ψ(w1, 1), η1 = η, S = {1}. Define: shrunk decision set Ω ′ = {w ∈ Ω: πw1 (w) ≤ 1 − 1 T }, J = [Id, 0d] ∈ R d×(d+1) . for t = 1, 2, . . . , T do 1 Uniformly at random sample st from H − 1 2 t ed+1⊥ ∩ S d+1 . 2 Compute wet = wt + JH − 1 2 t st . 3 Play wet , observe loss ⟨wet , ℓt⟩, and construct loss estimator ℓbt = d⟨wet , ℓt⟩JH 1/2 t st . 4 Compute wt+1 = argminw∈Ω′ nDw, ℓbt E + Dψt (w, wt) o , where ψt = 1 ηt ψ. 5 Compute Ht+1 = ∇2Ψ(wt+1, 1). 6 if λmax(Ht+1 − P τ∈S Hτ ) > 0 then S ← S ∪ {t + 1} and set ηt+1 = ηtκ; ; 7 else set ηt+1 = ηt . ; E.2 Omitted details for Section 3.1.2 E.2.1 More explanation on Algorithm 8 Here, we provide a d-dimensional version of Algorithm 8 by removing the explicit lifting and performing OMD in R d ; see Algorithm 20. It is clear that this version is exactly the same as Algorithm 8. Compared to the original SCRiBLe, one can see that besides the increasing learning rate schedule, the only difference is how the point wet is computed. In particular, one can verify that wet does not necessarily satisfy ∥wet − wt∥∇2ψ(wt) = 1, meaning that wet is not necessarily on the boundary of the Dikin ellipsoid centered at wt with respect to ψ. In other words, our algorithm provides a new sampling scheme for SCRiBLe. E.2.2 Preliminary for analysis In this section, we introduce the preliminary of self-concordant barriers and normal-barriers, including the definitions and some useful properties that will be used frequently in later analysis. Self-concordant barriers. Let ψ : int(Ω) → R be a C 3 smooth convex function. ψ is called a selfconcordant barrier on Ω if it satisfies: 2 • ψ(xi) → ∞ as i → ∞ for any sequence x1, x2, · · · ∈ int(Ω) ⊂ R d converging to the boundary of Ω; • for all w ∈ int(Ω) and h ∈ R d , the following inequality always holds: X d i=1 X d j=1 X d k=1 ∂ 3ψ(w) ∂wi∂wj∂wk hihjhk ≤ 2∥h∥ 3 ∇2ψ(w) . We further call ψ is a ν-self-concordant barrier if it satisfies the conditions above and also ⟨∇ψ(w), h⟩ ≤ √ ν∥h∥∇2ψ(w) for all w ∈ int(Ω) and h ∈ R d . Lemma E.2.1 (Theorem 2.1.1 in [120]). If ψ is a self-concordant barrier on Ω, then the Dikin ellipsoid centered at w ∈ int(Ω), defined as {v : ∥v − w∥∇2ψ(w) ≤ 1}, is always within Ω. Moreover, ∥h∥∇2ψ(v) ≥ ∥h∥∇2ψ(w) 1 − ∥v − w∥ ∇2ψ(w) holds for any h ∈ R d and any v with ∥v − w∥∇2ψ(w) ≤ 1. Lemma E.2.2 (Theorem 2.5.1 in [120]). For any closed convex body Ω ⊂ R d , there exists an O(d)-selfconcordant barrier on Ω. Lemma E.2.3 (Corollary 2.3.1 in [120]). Let ψ be a self-concordant barrier for Ω ⊂ R d . Then for any w ∈ int(Ω) and any h ∈ Ω such that w + bh ∈ Ω for all b ≥ 0, we have ∥h∥∇2ψ(w) ≤ − ⟨∇ψ(w), h⟩. 270 Normal Barriers. Let K ⊆ R d be a closed and proper convex cone and let θ ≥ 1. A function ψ : int(K) → R is called a θ-logarithmically homogeneous self-concordant barrier (or simply a θ-normal barrier) on K if it is self-concordant on int(K) and is logarithmically homogeneous with parameter θ, which means ψ(tw) = ψ(w) − θ ln t, ∀w ∈ int(K), t > 0. The following two lemmas show the relationship between θ-normal barriers and θ-self-concordant barriers. Lemma E.2.4 (Corollary 2.3.2 in [120]). A θ-normal barrier on K is a θ-self-concordant barrier on K. Lemma E.2.5 (Proposition 5.1.4 in [120]). Suppose f is a θ-self-concordant barrier on K ⊆ R d . Then the function F(w, b) = 400 f w b − 2θ ln b , is a 800θ-normal barrier for con(K) ⊆ R d+1, where con(K) = {0} ∪ {(w, b) : w b ∈ K, w ∈ R d , b > 0} is the conic hull of K lifted to R d+1 (by appending 1 to the last coordinate). Note that our regularizer Ψ defined in Algorithm 8 is exactly based on this formula. We point out that, however, our entire analysis works for any O(ν)-normal barrier Ψ, as we will only use the following general properties of normal barriers, instead of the concrete form of Ψ. As mentioned in Footnote †, we use this concrete formula only to emphasize that, just as SCRiBLe, our algorithm requires only a selfconcordant barrier of the original set Ω. Lemma E.2.6 (Proposition 2.3.4 in [120]). If ψ is a θ-normal barrier on K, then we have for all w, u ∈ int(K), 1. ∥w∥ 2 ∇2ψ(w) = w ⊤∇2ψ(w)w = θ, 2. ∇2ψ(w)w = −∇ψ(w), 271 3. ψ(u) ≥ ψ(w) − θ ln −⟨∇ψ(w),u⟩ θ . Next, we show the definition of Minkowsky functions, which is used to define the shrunk decision domain similar to the clipped simplex in multi-armed bandit setting. Minkowsky functions. The Minkowsky function of a convex body Ω with the pole at w ∈ int(Ω) is a function πw : Ω → R defined as πw(u) = inf t > 0 w + u − w t ∈ Ω . The last lemma shows several useful properties using the Minkowsky function. Lemma E.2.7 (Proposition 2.3.2 in [120]). Let ψ be a ν-self-concordant barrier on Ω ⊆ R d and u, w ∈ int(Ω). Then for any h ∈ R d , we have ∥h∥∇2ψ(u) ≤ 1 + 3ν 1 − πw(u) ∥h∥∇2ψ(w) , | ⟨∇ψ(u), h⟩ | ≤ ν 1 − πw(u) ∥h∥∇2ψ(w) , ψ(u) − ψ(w) ≤ ν ln 1 1 − πw(u) . E.2.3 Proof of Theorem 3.1.3 To prove the theorem, we decompose the regret against any fixed u ⋆ ∈ Ω (with u ⋆ = (u ⋆ , 1) ∈ Ω) into the following three terms: X T t=1 ⟨wet − u ⋆ , ℓt⟩ = X T t=1 ⟨wet − u ⋆ , ℓt⟩ (define ℓt = (ℓt , 0)) 272 = X T t=1 ⟨wet , ℓt⟩ − D wt , ℓbt E + D u, ℓbt − ℓt E | {z } Deviation + X T t=1 D wt − u, ℓbt E | {z } Reg-Term + X T t=1 ⟨u − u ⋆ , ℓt⟩, (E.7) where u = 1 − 1 T · u ⋆ + 1 T · w1 ∈ Ω′ . Note that the last term is trivially bounded by 2 as X T t=1 ⟨u − u ⋆ , ℓt⟩ = X T t=1 ⟨u − u ⋆ , ℓt⟩ = 1 T X T t=1 ⟨u ⋆ − w1, ℓt⟩ ≤ 2, where the last inequality is because | ⟨w, ℓt⟩ | ≤ 1 for all w ∈ Ω. In the following sections, we show how to bound other terms. Specifically, we bound Deviation in Section E.2.3.1 and Reg-Term in Section E.2.3.2. Finally we prove Theorem 3.1.3 in Section E.2.3.3. We will use the following notations in the remaining of this section (the first two are mentioned above already): ℓt ≜ (ℓt , 0), u ≜ (u, 1) ≜ 1 − 1 T · u ⋆ + 1 T · w1 ∈ Ω′ , ρ ≜ max t∈[T] ∥u∥Ht , (E.8) LT ≜ X T t=1 ⟨wet , ℓt⟩, LT ≜ X T t=1 |⟨wet , ℓt⟩| , (E.9) ˚LT ≜ X T t=1 Et [|⟨wet , ℓt⟩|] , Lu ≜ X T t=1 |⟨u, ℓt⟩| . (E.10) Before proceeding, we provide one useful lemma. Lemma E.2.8. We have ∥u∥H1 ≤ 800ν. Proof. Clearly, for any b > 0, we have w1 + bu still in the conic hull of Ω. According to Lemma E.2.3, we thus have ∥u∥H1 ≤ ⟨−∇Ψ(w1),u⟩. Note that Ψ is a 800ν-normal barrier by Lemma E.2.5. By the first order optimality condition of w1 and Lemma E.2.6, we then have 0 ≤ ⟨∇Ψ(w1),u − w1⟩ = ⟨∇Ψ(w1),u⟩ + 800ν. 27 Combining the above gives ∥u∥H1 ≤ ⟨−∇Ψ(w1),u⟩ ≤ 800ν. E.2.3.1 Bounding Deviation We first show that ℓbt is an unbiased estimator of ℓt for the first d coordinates. Lemma E.2.9. We have Et h ℓbt,ii = ℓt,i for i ∈ [d]. Proof. Let v = H −1/2 t ed+1 H −1/2 t ed+1 2 . First note that Et [sts ⊤ t ] = 1 d I − vv⊤ (E.11) by the definition of st . Then by the definition of ℓbt , we have Et h ℓbt i = Et d wt + H − 1 2 t st , ℓt · H 1 2 t st = Et d ⟨wt , ℓt⟩ · H 1 2 t st + d · H 1 2 t st H − 1 2 t st , ℓt = d ⟨wt , ℓt⟩ · H 1 2 t Et [st ] + Et d · H 1 2 t sts ⊤ t H − 1 2 t ℓt = d · H 1 2 t Et h sts ⊤ t i H − 1 2 t ℓt (Et [st ] = 0 by symmetry) = H 1 2 t I − vv⊤ H − 1 2 t ℓt (Eq. (E.11)) = ℓt − ed+1e ⊤ d+1H−1 t ℓt H −1/2 t ed+1 2 2 . Noticing that the first d coordinates of ed+1e ⊤ d+1H−1 t ℓt are all zeros concludes the proof. Now we are ready to bound Deviation. 274 Lemma E.2.10. With probability at least 1 − δ, we have Deviation ≤ 161Cdq (ν + ρ 2) ˚LT ln(C/δ) + C q 32Lu ln(C/δ) + 64Cd √ ν + ρ ln(C/δ), where C = Θ(ln2 (dνT)). Proof. Define Xt ≜ ⟨wet , ℓt⟩ − D wt , ℓbt E + D u, ℓbt − ℓt E and we have Deviation = PT t=1 Xt . The goal is to apply our strengthened Freedman’s inequality Theorem 3.1.1. To this end, first we show Et [Xt ] = 0. Indeed, we have Et [wet ] = wt and Et [Xt ] = ⟨wt , ℓt⟩ − ⟨wt , ℓt⟩ + ⟨u, ℓt − ℓt⟩ − Et h (wt,d+1 − ut,d+1)ℓb t,d+1i (Lemma E.2.9) = 0. (wt,d+1 = ut,d+1 = 1) Next, we bound Xt by a Ft-measurable random variable Bt ≜ 32d √ ν +d∥u∥Ht . This can be shown using the properties of a normal barrier: Xt = ⟨wet , ℓt⟩ − D wt , ℓbt E + D u, ℓbt − ℓt E = ⟨wet , ℓt⟩ − wt , d · ⟨wet , ℓt⟩ · H 1 2 t st + u, d · ⟨wet , ℓt⟩ H 1 2 t st − ℓt = ⟨wet , ℓt⟩ 1 − dw⊤ t H 1 2 t st + d ⟨wet , ℓt⟩u ⊤H 1 2 t st − ⟨u, ℓt⟩ ≤ 2 + d w⊤ t H 1 2 t st + d u ⊤H 1 2 t st (| ⟨w, ℓt⟩ | ≤ 1 for any w ∈ Ω) ≤ 2 + d∥wt∥Ht + d∥u∥Ht (by Cauchy-Schwarz inequality and s ⊤ t st = 1) ≤ 2 + 20d √ 2ν + d∥u∥Ht (Lemma E.2.5 and Lemma E.2.6) ≤ 32d √ ν + d∥u∥Ht . (ν ≥ 1) 27 Then, we show that Bt is bounded by a constant b ≜ 2 × 106dν2T for all t: Bt ≤ 32d √ ν + d∥u∥H1 · 1 + 2400ν 1 − πw1 (wt) (Lemma E.2.7) ≤ 32d √ ν + d∥u∥H1 (1 + 2400ν)T (wt ∈ Ω′ ) ≤ 32d √ ν + 800dν(1 + 2400ν)T (Lemma E.2.8) ≤ 2 × 106 dν2T. (ν ≥ 1) The last step before applying Theorem 3.1.1 is to calculate Et [X2 t ]. We first write Et [X2 t ] = Et ⟨wet , ℓt⟩ − D wt , ℓbt E + D u, ℓbt − ℓt E2 ≤ 2Et ⟨wet , ℓt⟩ − D wt , ℓbt E2 + 2Et D u, ℓbt − ℓt E2 . (E.12) The first term is bounded by: Et ⟨wet , ℓt⟩ − D wt , ℓbt E2 = Et " ⟨wet , ℓt⟩ 2 1 − wt , d · H 1 2 t st 2 # ≤ Et " ⟨wet , ℓt⟩ 2 2d 2 w⊤ t H 1 2 t st 2 + 2!# ≤ Et " |⟨wet , ℓt⟩| 2d 2 w⊤ t H 1 2 t st 2 + 2!# (⟨wet , ℓt⟩ ≤ 1) ≤ Et |⟨wet , ℓt⟩| 2d 2 ∥wt∥ 2 Ht ∥st∥ 2 2 + 2 (Cauchy-Schwarz inequality) ≤ Et |⟨wet , ℓt⟩| 1600d 2 ν + 2 (∥st∥ 2 2 = 1 and Lemma E.2.6) ≤ 1602d 2 νEt [|⟨wet , ℓt⟩|] . 2 Similarly, the second term is bounded by: Et D u, ℓbt − ℓt E2 ≤ Et " − ⟨u, ℓt⟩ + d ⟨wet , ℓt⟩u ⊤H 1 2 t st 2 # ≤ Et " 2 |⟨u, ℓt⟩| + 2d 2 |⟨wet , ℓt⟩| · u ⊤H 1 2 t st 2 # (⟨wet , ℓt⟩ ≤ 1) ≤ Et 2 |⟨u, ℓt⟩| + 2d 2 |⟨wet , ℓt⟩| · ∥u∥ 2 Ht . Plugging these bounds to Eq. (E.12), we have Et [X2 t ] ≤ 3204d 2 νEt [|⟨wet , ℓt⟩|] + 4 |⟨u, ℓt⟩| + 4d 2Et [|⟨wet , ℓt⟩|] ∥u∥ 2 Ht . Summing over t gives X T t=1 Et [X2 t ] ≤ 3204d 2X T t=1 ν + ∥u∥ 2 Ht Et [|⟨wet , ℓt⟩|] + 4X T t=1 |⟨u, ℓt⟩| ≤ 3204d 2 ν + max t∈[T] ∥u∥ 2 Ht X T t=1 Et [|⟨wet , ℓt⟩|] + 4X T t=1 |⟨u, ℓt⟩| = 3204d 2 ν + ρ 2 ˚LT + 4Lu. Therefore, choosing B⋆ = 32d( √ ν + ρ), b = 2 × 106dν2T, C = ⌈log2 b⌉⌈log2 b 2T⌉ = Θ(ln2 (dνT)) and using Theorem 3.1.1, we obtain with probability 1 − δ, X T t=1 Xt = X T t=1 ⟨wet , ℓt⟩ − D wt , ℓbt E + D u, ℓbt − ℓt E ≤ C q 25632d 2 (ν + ρ 2) ˚LT ln(C/δ) + 32Lu ln(C/δ) + 64Cd √ ν + ρ ln(C/δ). Finally, using √ a + b ≤ √ a + √ b, the first term above is bounded by 161C q d 2 (ν + ρ 2) ˚LT ln(C/δ) + C q 32Lu ln(C/δ), which finishes the proof. E.2.3.2 Bounding Reg-Term The goal of this section is to prove the following bound on Reg-Term. Lemma E.2.11. Let S be its final value after running Algorithm 8 for T rounds and S ′ = S \ {1, T + 1}. Then as long as η ≤ 1 80d , we have Reg-Term ≤ Oe ν η − P s∈S′ ∥u∥Hs 5ηad ln(νT) + 40ηd2LT . for a = 100. To prove this lemma, we first prove three useful lemmas. The first one shows that the number of times Algorithm 8 increases the learning rate is upper bounded by O(d log2 (dνT)). Lemma E.2.12. Assume that T ≥ 8. Let n be the number of times Algorithm 8 increases the learning rate. Then n ≤ ad log2 (νT) for a = 100. Consequently, we have ηt ≤ 5η for all t ∈ [T]. Proof. Let S = {t1, . . . , tn+1} be its final value after running Algorithm 8 for T rounds, which means n is the number of times the algorithm has increased the learning rate, t1 = 1, and for i = 2, . . . , n + 1, ηti = ηti−1κ holds. Let Ai = Pi j=1 Htj . Then for any i > 1, according to the update rule, there exists a vector p ∈ R d+1 such that p ⊤Hti p ≥ p ⊤Ai−1p and thus p ⊤Aip ≥ 2p ⊤Ai−1p. Since a selfconcordant function is strictly convex, Ai is positive definite for all i ∈ [n]. Therefore, let q = A 1 2 i−1 p and 278 we have q ⊤A − 1 2 i−1AiA − 1 2 i−1 q ≥ 2∥q∥ 2 2 . This implies that the largest eigenvalue of A − 1 2 i−1AiA − 1 2 i−1 is at least 2. Furthermore, the smallest eigenvalue of A − 1 2 i−1AiA − 1 2 i−1 is at least 1 since A − 1 2 i−1AiA − 1 2 i−1 = A − 1 2 i−1 (Ai−1 + Hti ) A − 1 2 i−1 = I + A − 1 2 i−1HtiA − 1 2 i−1 ⪰ I. Therefore, we have 2 ≤ det(A − 1 2 i−1AiA − 1 2 i−1 ) = det(Ai) det(Ai−1) , which implies that det(An+1) ≥ 2 n det(A1). Next we show an upper bound for det(An+1) det(A1) . Consider any (d+ 1)-dimensional unit vector r. For each i ∈ [n + 1], applying Lemma E.2.7 with h = H − 1 2 1 r, u = wti and w = w1, we have , ∥h∥ 2 Hti = r ⊤H − 1 2 1 HtiH − 1 2 1 r ≤ 1 + 2400ν 1 − πw1 (wti ) 2 ∥h∥ 2 H1 ≤ (1 + 2400ν) 2T 2 . Taking a summation over all i ∈ [n + 1], we obtain r ⊤A − 1 2 1 An+1A − 1 2 1 r ≤ (n + 1)(1 + 2400ν) 2T 2 , which means that λmax A − 1 2 1 An+1A − 1 2 1 ≤ (n + 1)(1 + 2400ν) 2T 2 , and thus det(An+1) det(A1) = det A − 1 2 1 An+1A − 1 2 1 ≤ (n + 1)(1 + 2400ν) 2T 2 d+1 . 27 Combining with det(An+1) det(A1) ≥ 2 n , we have n ≤ (d + 1) log2 (n + 1) + 2(d + 1) log2 ((1 + 2400ν)T) ≤ ad log2 (νT), for a = 100. To show that ηt ≤ 5η for t, notice that exp(log2 (νT)/ln(νT)) ≤ 5. Therefore, ηt ≤ κ n η = exp n ad ln(νT) η ≤ 5η, finishing the proof. The second lemma gives a lower bound of the Bregman divergence between u and wt , which contains an important term to cancel Deviation in later analysis. Lemma E.2.13. For all t ∈ [T], DΨ(u, wt) ≥ −800ν ln (800νT) − 800ν + ∥u∥Ht . Proof. Note again that Ψ is a 800ν-normal barrier of Ω by Lemma E.2.5. By the definition of Bregman divergence, we have DΨ(u, wt) = Ψ(u) − Ψ(wt) − ⟨∇Ψ(wt),u − wt⟩ ≥ −800ν ln −u ⊤∇Ψ(wt) 800ν − ⟨∇Ψ(wt),u⟩ − 800ν. (Lemma E.2.6 and Lemma E.2.8) According to Lemma E.2.7 and Lemma E.2.8, we know that u ⊤∇Ψ(wt) ≤ 800ν 1 − πw1 (wt) ∥u∥H1 ≤ 800νT∥u∥H1 ≤ 640000ν 2T. On the other hand, according to Lemma E.2.3, we have −∇Ψ(wt) ⊤u ≥ ∥u∥Ht . 280 Combining everything, we have DΨ(u, wt) ≥ −800ν (ln(800νT) + 1) + ∥u∥Ht , finishing the proof. The third lemma gives a bound for the so-called stability term. Lemma E.2.14. If η ≤ 1 80d , then Algorithm 8 guarantees ∥wt − wt+1∥Ht ≤ 40η∥ℓbt∥H−1 t for all t ∈ [T]. Proof. Let Ft(w) = D w, ℓbt E + 1 ηt DΨ(w, wt). We have Ft(wt) − Ft(wt+1) = (wt − wt+1) ⊤ℓbt − 1 ηt DΨ(wt+1, wt) ≤ (wt − wt+1) ⊤ℓbt ≤ ∥wt − wt+1∥Ht · ∥ℓbt∥H−1 t , (E.13) where the last line uses the nonnegativity of Bregman divergence and also Hölder’s inequality. On the other hand, by Taylor’s theorem, there exists a point ξ on the segment connecting wt and wt+1 such that Ft(wt) − Ft(wt+1) = ∇Ft(wt+1) ⊤(wt − wt+1) + 1 2 (wt − wt+1)∇2Ft(ξ)(wt − wt+1) ≥ 1 2 (wt − wt+1)∇2Ft(ξ)(wt − wt+1) (by first order optimality of wt+1 = argminw∈Ω′ Ft(w)) = 1 2ηt ∥wt − wt+1∥ 2 ∇2Ψ(ξ) . (E.14) Next we will prove ∥wt −wt+1∥∇2Ψ(ξ) ≥ 1 2 ∥wt −wt+1∥Ht . To do so, we first show ∥wt −wt+1∥Ht ≤ 1 2 . It is in turn sufficient to show Ft(w′ ) ≥ Ft(wt), for all w′ such that ∥w′ − wt∥Ht = 1 2 , 281 since wt+1 is the minimizer of the convex function Ft . Indeed, using Taylor’s theorem again and denoting w′ − wt by h, we have a point ξ ′ on the segment between w′ and wt such that Ft w′ = Ft (wt) + ∇Ft (wt) ⊤ h + 1 2 h ⊤∇2Ft(ξ ′ )h = Ft (wt) + ℓb⊤ t h + 1 2ηt ∥h∥ 2 ∇2Ψ(ξ ′) ≥ Ft (wt) + ℓb⊤ t h + 1 2ηt ∥h∥ 2 Ht 1 − wt − ξ ′ Ht 2 (Lemma E.2.1) ≥ Ft (wt) + ℓb⊤ t h + 1 160η (∥h∥Ht = 1 2 , ∥wt − ξ ′∥Ht ≤ 1 2 , and Lemma E.2.12) ≥ Ft (wt) − ∥ℓbt∥H−1 t ∥h∥Ht + 1 160η (Hölder’s inequality) ≥ Ft (wt) − d 2 + 1 160η . (∥ℓbt∥H−1 t ≤ d |⟨wet , ℓt⟩| ≤ d) Under the condition η ≤ 1 80d , we have thus shown that Ft(w′ ) ≥ Ft(wt) and consequently ∥wt − wt+1∥Ht ≤ 1 2 and ∥wt − ξ∥Ht ≤ 1 2 . Now according to Lemma E.2.1 again, we have ∥wt − wt+1∥∇2Ψ(ξ) ≥ ∥wt − wt+1∥Ht (1 − ∥wt − ξ∥Ht ) ≥ 1 2 ∥wt − wt+1∥Ht . Plugging it into Eq. (E.14) and combining Eq. (E.13) give ∥ℓbt∥H−1 t ≥ 1 8ηt ∥wt − wt+1∥Ht ≥ 1 40η ∥wt − wt+1∥Ht , where the last inequality uses Lemma E.2.12. Rearranging finishes the proof. Now we are ready to prove the bound for Reg-Term stated in Lemma E.2.11. Proof of Lemma E.2.11. We first verify that u is in Ω ′ . Indeed, according to the definition of u, we have w1 + 1 1 − 1 T · (u − w1) = u ⋆ ∈ Ω, 28 which by the definition of Minkowsky function shows that πw1 (u) ≤ 1−1/T and thus u ∈ Ω ′ . According to the standard analysis of Online Mirror Descent, for example, Lemma 6 of [143], we then have D wt , ℓbt E − D u, ℓbt E ≤ DΨt (u, wt) − DΨt (u, wt+1) + D wt − wt+1, ℓbt E . (E.15) We first focus on the term DΨt (u, wt) − DΨt (u, wt+1). Taking a summation over t = 1, 2, . . . , T, we have X T t=1 DΨt (u, wt) − DΨt (u, wt+1) ≤ DΨ1 (u, w1) + T X−1 t=1 DΨt+1 (u, wt+1) − DΨt (u, wt+1) ≤ DΨ1 (u, w1) +Xn i=2 1 ηti − 1 ηti−1 DΨ(u, wti ), where we recall the definition of t1, . . . , tn defined in the beginning of the proof of Lemma E.2.12. The first term can be bounded by DΨ1 (u, w1) = 1 η DΨ(u, w1) = Ψ(u) − Ψ(w1) η − 1 η · ⟨∇Ψ(w1),u − w1⟩ ≤ Ψ(u) − Ψ(w1) η (by first order optimality of w1) ≤ 800ν ln T η . (Lemma E.2.7) For the second term, using 1 − κ ≤ − 1 ad ln(νT) for a = 100 and Lemma E.2.12, we have 1 ηti − 1 ηti−1 ≤ 1 − κ ηti ≤ − 1 5ηad ln(νT) . Therefore, X T t=1 DΨt (u, wt) − DΨt (u, wt+1) 28 ≤ 800ν ln T η − Xn i=2 1 5ηad ln(νT) · DΨ(u, wti ) ≤ Oe ν η − 1 5ηad ln(νT) · Xn i=2 ∥u∥Hti − 800ν − 800ν ln(800νT) (Lemma E.2.13) = Oe ν η − 1 5ηad ln(νT) Xn i=2 ∥u∥Hti . For the second term in Eq. (E.15), that is, D wt − wt+1, ℓbt E , taking summation over t ∈ [T] we have X T t=1 D wt − wt+1, ℓbt E ≤ X T t=1 ∥wt − wt+1∥Ht∥ℓbt∥H−1 t (Hölder’s inequality) ≤ 40η X T t=1 ∥ℓbt∥ 2 H−1 t (Lemma E.2.14) = 40η X T t=1 d 2 ⟨wet , ℓt⟩ 2 s ⊤ t H 1/2 t H−1 t H 1/2 t st ≤ 40η X T t=1 d 2 |⟨wet , ℓt⟩| = 40ηd2LT . Combining everything finishes the proof. E.2.3.3 Proof of Theorem 3.1.3 To prove Theorem 3.1.3, we first prove the following main lemma. Lemma E.2.15. Algorithm 8 with η ≤ 1 640aCd2 ln(νT) ln(C/δ) guarantees that with probability at least 1 − δ, X T t=1 ⟨wet − u ⋆ , ℓt⟩ ≤ Oe ν η + ηd2LT + q Lu ln(1/δ) + (√ ν + ρ) 161Cdq ln(C/δ)˚LT − 1 10ηad ln(νT) , where a = 100, C = Θ(ln2 (dνT)) is defined in Lemma E.2.9, and we recall all other notations defined in Equations (E.8)-(E.10). 284 Proof. Recall the decomposition of regret shown in Eq. (E.7). Combining the result of Lemma E.2.10 and Lemma E.2.11, we have when η ≤ 1 80d , X T t=1 ⟨wet − u ⋆ , ℓt⟩ ≤ Oe ν η − P s∈S′ ∥u∥Hs 5ηad ln(νT) + 40ηd2LT + 64Cd √ ν + ρ ln(C/δ) + 161Cdq (ν + ρ 2) ˚LT ln(C/δ) + C q 32Lu ln(C/δ). = Oe ν η + ηd2LT + q Lu ln(C/δ) − P s∈S′ ∥u∥Hs 5ηad ln(νT) + 64Cd √ ν + ρ ln(C/δ) + 161C q d 2 (ν + ρ 2) ˚LT ln(C/δ). (E.16) Now consider the value of ρ = ∥u∥Ht⋆ where t ⋆ ∈ argmaxt∈[T] ∥u∥Ht , compared to the negative term above. Suppose t ⋆ ∈ S, then we have ρ ≤ max ( ∥u∥H1 , X s∈S′ ∥u∥Hs ) ≤ 800ν + X s∈S′ ∥u∥Hs , where we use Lemma E.2.8 again to bound ∥u∥H1 . On the other hand, if t ⋆ ∈ S / , then according to the update rule of S in Algorithm 8, we have Ht ⋆ ⪯ H1 + P s∈S′ Hs, which means ρ = q ∥u∥ 2 Ht⋆ ≤ s ∥u∥ 2 H1 + X s∈S′ ∥u∥ 2 Hs ≤ 800ν + X s∈S′ ∥u∥Hs . Therefore, we continue to bound the last three terms in Eq. (E.16) as 800ν − ρ 5ηad ln(νT) + 64Cd √ ν + ρ ln(C/δ) + 161Cdq (ν + ρ 2) ˚LT ln(C/δ) ≤ O ν η − √ ν + ρ 5ηad ln(νT) + 64Cd √ ν + ρ ln(C/δ) + 161Cdq (ν + ρ 2) ˚LT ln(C/δ) ≤ O ν η − √ ν + ρ 10ηad ln(νT) + 161Cdq (ν + ρ 2) ˚LT ln(C/δ) (η ≤ 1 640aCd2 ln(νT) ln(C/δ) ) ≤ O ν η + (√ ν + ρ) 161Cdq ln(C/δ)˚LT − 1 10ηad ln(νT) . Plugging this back into Eq. (E.16) finishes the proof. Now we are ready to prove the main theorem. For convenience, we restate the theorem below. Theorem E.2.1. Algorithm 8 with an appropriate choice of η ensures that with probability at least 1 − δ: Reg = Oe(d 2ν q T ln 1 δ + d 2ν ln 1 δ ), against an oblivious adversary; Oe(d 2ν q dT ln 1 δ + d 3ν ln 1 δ ), against an adaptive adversary. Moreover, if ⟨w, ℓt⟩ ≥ 0 for all w ∈ Ω and all t, then T in the bounds above can be replaced by L ⋆ = minu∈Ω PT t=1 ⟨u, ℓt⟩, that is, the total loss of the best action. Proof. Using Lemma E.2.15 and the fact that |⟨wet , ℓt⟩| ≤ 1 and |⟨u, ℓt⟩| ≤ 1 for all t ∈ [T], we have X T t=1 ⟨wet − u ⋆ , ℓt⟩ ≤ Oe ν η + ηd2T + r T ln 1 δ ! + (√ ν + ρ) 161Cdp T ln(C/δ) − 1 10ηad ln(νT) . With η = min ( 1 640aCd2 ln(νT) ln(C/δ) , 1 1610aCd2 ln(νT) p T ln(C/δ) ) , the last term becomes nonpositive, and we arrive at X T t=1 ⟨wet − u ⋆ , ℓt⟩ ≤ Oe d 2 ν r T ln 1 δ + d 2 ν ln 1 δ ! , (E.17) for any fixed u ⋆ ∈ Ω, which completes the proof for the oblivious case. To obtain a bound for an adaptive adversary, we discrete the feasible set Ω and then take a union bound. Specifically, define BΩ as follows: BΩ ≜ ⌈α⌉⌈β⌉, α ≜ max w,w′∈Ω ∥w − w ′ ∥∞, β ≜ max ℓ∈Ω◦ ∥ℓ∥∞, 286 where Ω ◦ ≜ {ℓ : | ⟨w, ℓ⟩ | ≤ 1, ∀w ∈ Ω} is the set of feasible loss vectors. Then we discretize Ω into a finite set Ω of (BΩT) d points, such that for any u ⋆ ∈ Ω, there exists u ∈ Ω, such that ∥u − u ⋆∥∞ ≤ 1 ⌈β⌉T . This means that X T t=1 ⟨u − u ⋆ , ℓt⟩ ≤ X T t=1 d ⌈β⌉T · max i ℓt,i ≤ d. Therefore, it suffices to only consider regret against the points in Ω. Taking a union bound and replacing δ with δ (BΩT) d in Eq. (E.17) finish the proof for the worst-case bound for adaptive adversaries. In the remaining of the proof, we show that if ⟨w, ℓt⟩ ∈ [0, 1] for all w ∈ Ω and t ∈ [T], T can be replaced by L ⋆ in both bounds. As ⟨w, ℓt⟩ is always positive, we have Et [|⟨wet , ℓt⟩|] = Et [⟨wet , ℓt⟩] = ⟨wt , ℓt⟩, Lu = PT t=1 ⟨u, ℓt⟩ ≤ L ⋆ + 2, and LT = LT = PT t=1 ⟨wet , ℓt⟩. Using standard Freedman’s inequality, we have with probability at least 1 − δ, ˚LT − LT ≤ ˚LT 2 + 3 ln(1/δ). Rearranging gives ˚LT ≤ 2LT + 6 ln(1/δ). Using Lemma E.2.15 again, we have X T t=1 ⟨wet − u ⋆ , ℓt⟩ ≤ Oe ν η + ηd2LT + r L⋆ ln 1 δ ! + (√ ν + ρ) 161Cds ln(C/δ) 2LT + 6 ln 1 δ − 1 10ηad ln(νT) ! . 287 With η = min 1 640aCd2 ln(νT) ln(C/δ) , 1 1610aCd2 ln(νT) √ (2LT +6 ln(1/δ)) ln(C/δ) , the last term becomes nonpositive, and we arrive at X T t=1 ⟨wet − u ⋆ , ℓt⟩ ≤ Oe d 2 ν r LT ln 1 δ + r L⋆ ln 1 δ + d 2 ν ln 1 δ ! Solving the quadratic inequality in terms of √ LT gives the following high probability regret bound X T t=1 ⟨wet − u ⋆ , ℓt⟩ ≤ Oe d 2 ν r L⋆ ln 1 δ + d 2 ν ln 1 δ ! . This finishes the proof for the case with oblivious adversaries, and the case with adaptive adversaries is again by taking a union bound as done earlier. Remark 2. The tuning of η in the proof above depends on the unknown quantity LT . In fact, the issue seems even more severe than that pointed out in Remark 1 because LT depends on the algorithm’s behavior, which in turns depends on η itself. We point out that, however, this can again be addressed using a doubling trick, making the algorithm completely parameter-free. We omit the details but refer the reader to Algorithm 4 of [100] for very similar ideas. E.3 Omitted details for Section 3.1.3 E.3.1 Preliminary In this section, we introduce the concept of occupancy measure (used in previous works already; see [85]), which helps reformulate adversarial MDP problems in a way very similar to adversarial MAB problems. 288 For a state x, let k(x) denote the index of the layer to which state x belongs. Given a policy π and a transition function P, we define occupancy measure w P,π ∈ R X×A×X as follows: w P,π(x, a, x′ ) = P xk = x, ak = a, xk+1 = x ′ |P, π , where k = k(x). In other words, w P,π(x, a, x′ ) is the probability of visiting the triple (x, a, x′ ) if we execute policy π in an MDP with transition function P. According to this definition, we have the following two properties for any occupancy measure w. First, based on the layered structure, we know that each layer is visited exactly once in each episode, which means for each k = 0, 1, . . . , J, we have X x∈Xk,a∈A,x′∈Xk+1 w(x, a, x′ ) = 1. (E.18) Second, the probability of entering one state when coming from the previous layer equals to the probability of leaving the state to the next layer. Therefore, for each k = 1, 2, . . . , J − 1, we have X x′∈Xk−1,a∈A w(x ′ , a, x) = X x′∈Xk+1,a∈A w(x, a, x′ ), (E.19) for all x ∈ Xk. Moreover, the following lemma shows that if w satisfies the above two properties, then w is an occupancy measure with respect to some transition function P w and policy π w. 289 Lemma E.3.1 (Lemma 3.1 in [129]). For any w ∈ [0, 1]|X|×|A|×|X| , it satisfies Eq. (E.18) and Eq. (E.19) if and only if it is a valid occupancy measure associated with the following induced transition function P w and policy π w: P w(x ′ |x, a) = w(x, a, x′ ) P y∈Xk(x)+1 w(x, a, y) , πw(a|x) = P x′∈Xk(x)+1 w(x, a, x′ ) P a ′∈A P x′∈Xk(x)+1 w(x, a′ , x′) . Following [85], we denote by ∆ the set of all valid occupancy measures. For a fixed transition function, we denote by ∆(P) ⊆ ∆ the set of occupancy measures whose induced transition function P w is exactly P. In addition, we denote by ∆(P) ⊆ ∆ the set of occupancy measures whose induced transition function P w belongs to a set of transition functions P. With a slightly abuse of notation, we define w(x, a) = P x′∈Xk(x)+1 w(x, a, x′ ) for all x ̸= xJ and a ∈ A. Using the notations introduced above, we know that the expected loss of using policy π at round t is exactly w P,π, ℓt ≜ P x,a w P,π(x, a)ℓt(x, a). Let πt be the policy chosen at round t. Then the total expected loss (with respect to randomness of the transition function) is PT t=1 w P,πt , ℓt and the total regret can be written as: Reg = X T t=1 ℓt(πt) − min π X T t=1 ℓt(π) = X T t=1 ⟨wt − u ⋆ , ℓt⟩ = LT − L ⋆ , (E.20) where u ⋆ = w P,π⋆ is the occupancy measure induced by the optimal policy π ⋆ = argminπ PT t=1 ℓt(π), wt = w P,πt , LT ≜ PT t=1 ⟨wt , ℓt⟩, and L ⋆ ≜ PT t=1 ⟨u ⋆ , ℓt⟩. When the regret is written in this way, it is clear that the problem is very similar to MAB or linear bandits with ∆(P) being the decision set and ℓt parametrizing the linear loss function at time t. 290 E.3.2 Algorithm for MDPs In this section, we introduce our algorithm that achieves high-probability small-loss regret bound for the MDP setting. The full pseudocode of the algorithm is shown in Algorithm 21. The algorithm is very similar to UOB-REPS introduced in [85], except for the following two modifications. First, in [85], they propose a loss estimator akin to the importance-weighted estimator using the socalled upper occupancy bound, denoted by ϕt(x, a) in our notation. Indeed, the actual probability wt(x, a) of visiting state-action pair (x, a) is unknown (due to the unknown transition), and thus standard unbiased importance-weighted estimators do not apply directly. Instead, since the algorithm maintains a confidence set Pi (for epoch i) of all the plausible transition functions based on observations, one can calculate the largest probability of visiting state-action pair (x, a) under policy πt , among all the plausible transition functions, which is exactly the definition of ϕt(x, a) and can be computed efficiently via the sub-routine Comp-UOB as shown in [85]. In addition, Jin et al. [85] also apply the idea of implicit exploration from [121] and introduce an extra bias with a parameter γ > 0, leading to the following loss estimator: ℓbt(x, a) = ℓt(x, a) ϕt(x, a) + γ 1t(x, a), which is crucial for them to derive a high-probability bound. As one can see in Eq. (E.21), the first difference of our algorithm is that we remove this implicit exploration (that is, γ = 0), similarly to our MAB algorithm in Section 3.1.1. As we later explain in Appendix E.3.4, removing this implicit exploration is important for obtaining a small-loss bound. Second, while UOB-REPS uses the entropy regularizer with a fixed learning rate, we use the log-barrier regularizer with time-varying and individual learning rates for each state-action pair, defined in Eq. (E.22), which is a direct generalization of Algorithm 7 for MAB. The way we increase the learning rate is also essentially identical to the MAB case; see the last part of Algorithm 21. We also point out that the analogue 291 of the clipped simplex used in Algorithm 7 is now ∆(Pi)∩Ω where ∆(Pi)is the set of occupancy measures with induced transition functions in the confidence set Pi , and Ω (defined at the beginning of Algorithm 21) contains all wb with each entry not smaller than 1/(T 3 |X| 2 |A|), which ensures that the learning rates cannot be increased by too many times. E.3.3 Proof of Theorem 3.1.4 In this section, we analyze Algorithm 21 and prove Theorem 3.1.4. We start with decomposing the regret into five terms (recall the definitions of wt and u ⋆ in Eq. (E.20) and wbt and ℓbt in Algorithm 21): X T t=1 ⟨wt − u ⋆ , ℓt⟩ = X T t=1 ⟨wt − wbt , ℓt⟩ | {z } Error + X T t=1 D wbt , ℓt − ℓbt E | {z } Bias-1 + X T t=1 D wbt − u, ℓbt E | {z } Reg-Term + X T t=1 D u, ℓbt − ℓt E | {z } Bias-2 + X T t=1 ⟨u − u ⋆ , ℓt⟩ | {z } Bias-3 . Here u is defined as u = 1 − 1 T u ⋆ + 1 T|A| X a∈A w P0,πa , (E.24) where πa is the policy that chooses action a at every state, and the definition of the transition function P0 is deferred to Lemma E.3.4. Note that u is random in the case with adaptive adversaries. In the remaining of this subsection, we first provide a few useful lemmas in Appendix E.3.3.1, and then bound Error in Appendix E.3.3.2, Bias-1 in Appendix E.3.3.3, Bias-2 in Appendix E.3.3.4, and Reg-Term in Appendix E.3.3.5. Note that Bias-3 can be trivially bounded by J as Bias-3 = X T t=1 ⟨u − u ⋆ , ℓt⟩ ≤ 1 T|A| X a∈A X T t=1 w πa,P0 , ℓt ≤ J. (E.25) We finally put everything together and prove Theorem 3.1.4 in Appendix E.3.3.6. 292 E.3.3.1 Useful lemmas The first two lemmas are from [85]. Lemma E.3.2 (Lemma 2 in [85]). With probability at least 1 − 4δ, we have for all k = 0, 1, . . . , J − 1 and (x, a, x′ ) ∈ Xk × A × Xk+1, P(x ′ |x, a) − P¯ i(x ′ |x, a) ≤ εi(x ′ |x, a) 2 . (E.26) Consequently, we have P ∈ Pi for all i. Lemma E.3.3 (Lemma 10 in [85]). With probability at least 1 − δ, we have for all k = 0, . . . , J − 1, X T t=1 X x∈Xk,a∈A wt(x, a) max{1, Nit (x, a)} = Oe |Xk| · |A| + ln 1 δ X T t=1 X x∈Xk,a∈A wt(x, a) − 1t(x, a) p max{1, Nit (x, a)} ≤ X T t=1 X x∈Xk,a∈A wt(x, a) max{1, Nit (x, a)} + Oe ln 1 δ ≤ Oe |Xk| · |A| + ln 1 δ , where it is the index of the epoch to which episode t belongs. Next, we prove a lemma showing that there exists a transition function P0 that always lies in the confidence set Pi of the algorithm, such that for any action a ∈ A and any two states x, x′ in consecutive layers, the probability of reaching x ′ by taking action a at state x is at least 1 T|X| . Lemma E.3.4. With probability at least 1−4δ, there exists P0 ∈ ∩iPi such that for all k < J, x ∈ Xk, a ∈ A, and x ′ ∈ Xk+1, we have P0(x ′ |x, a) ≥ 1 T|X| . Proof. The construction of P0 is as follows. First we start with P0 = P. Then for each fixed (x, a), we focus on the distribution P0(·|x, a). In particular, for all x ′ ∈ Xk(x)+1 such that P0(x ′ |x, a) < 1 T|X| , we 293 move the weight from the largest entry of P0(·|x, a) to this entry so that P0(x ′ |x, a) = 1 T|X| and P0(·|x, a) remains a valid distribution. Repeat the same for all (x, a) pairs finishes the construction of P0. Clearly, P0 satisfies P0(x ′ |x, a) ≥ 1 T|X| , and it remains to show P0 ∈ Pi for all i. To this end, we first note that P0(x ′ |x, a) − P(x ′ |x, a) ≤ |Xk(x′) | T|X| ≤ 1 T ≤ εi(x ′ |x, a) 2 holds for all k = 0, 1, . . . , J − 1 and (x, a, x′ ) ∈ Xk × A × Xk+1. Combining this with Eq. (E.26) then shows that P0(x ′ |x, a) − P¯ i(x ′ |x, a) ≤ εi(x ′ |x, a), indicating that P0 is indeed in Pi by the definition of Pi . The next lemma shows that the upper occupancy bound for each state-action pair is lower bounded. Lemma E.3.5. We have ϕt(x, a) ≥ 1 T3|X| 2|A| for all x ∈ X and a ∈ A. Proof. This is simply by the definition of ϕt in Eq. (E.23) and the definition of Ω: ϕt(x, a) ≥ wbt(x, a) ≥ 1 T3|X| 2|A| . The last lemma is an improvement of Lemma 4 of [85] and is important for bounding Error and Bias-1 in terms of √ LT , as opposed to √ T (which is the case in [85]). Lemma E.3.6. With probability at least 1−6δ, for any t and any collection of transition functions {P x t }x∈X such that P x t ∈ Pit for all x (where it is the index of the epoch to which episode t belongs), we have X T t=1 X x∈X,a∈A w P x t ,πt (x, a) − wt(x, a) ℓt(x, a) = Oe |X| r J|A|LT ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! . 294 Proof. The proof is technical but mostly follows the same ideas of that for Lemma 4 of [85]. We first assume that the events of Lemma E.3.2 and Lemma E.3.3 hold, which happens with probability at least 1 − 5δ. According to the proof of Lemma 4 of [85] (specifically their Eq. (15)), we have for any pair (x, a), w P x t ,πt (x, a) − wt(x, a) · ℓt(x, a) ≤ k( Xx)−1 m=0 X xm,am,xm+1 ε ⋆ it (xm+1|xm, am)wt(xm, am)w P x t ,πt (x, a|xm+1) · ℓt(x, a), (E.27) where ε ⋆ it (x ′ |x, a) = O s P(x′ |x,a) ln T |X||A| δ max{1,Nit (x,a)} + ln T |X||A| δ max{1,Nit (x,a)} , and for an occupancy measure w, w(x, a|x ′ ) denotes the probability of encountering the pair (x, a) given that x ′ was visited earlier, under policy π w and P w. By their Eq. (16), we also have |w P x t ,πt (x, a|xm+1) − wt(x, a|xm+1)| ≤ πt(a|x) k( Xx)−1 h=m+1 X x ′ h ,a′ h ,x′ h+1 ε ⋆ it (x ′ h+1|x ′ h , a′ h )wt(x ′ h , a′ h |xm+1), (E.28) Combining Eq. (E.27) and Eq. (E.28), summing over all t and (x, a), and using the short-hands zm ≜ (xm, am, xm+1) and z ′ h ≜ (x ′ h , a′ h , x′ h+1), we have X T t=1 X x∈X,a∈A |w P x t ,πt (x, a) − wt(x, a)| · ℓt(x, a) ≤ X t,x,a k( Xx)−1 m=0 X zm ε ⋆ it (xm+1|xm, am)wt(xm, am)wt(x, a|xm+1)ℓt(x, a) + X t,x,a k( Xx)−1 m=0 X zm ε ⋆ it (xm+1|x)wt(xm, am) · πt(a|x) k( Xx)−1 h=m+1 X z ′ h ε ⋆ it (x ′ h+1|x ′ h , a′ h )wt(x ′ h , a′ h |xm+1) = X t X k<J X k−1 m=0 X zm ε ⋆ it (xm+1|xm, am)wt(xm, am) X x∈Xk,a∈A wt(x, a|xm+1)ℓt(x, a) 295 + X t X 0≤m<h<k<J X zm,z′ h ε ⋆ it (xm+1|x)wt(xm, am)ε ⋆ it (x ′ h+1|x ′ h , a′ h )wt(x ′ h , a′ h |xm+1) · X x∈Xk,a∈A πt(a|x) ! = X 0≤m<k<J X t,zm ε ⋆ it (xm+1|xm, am)wt(xm, am) X x∈Xk,a∈A wt(x, a|xm+1)ℓt(x, a) + X 0≤m<h<k<J |Xk| X t,zm,z′ h ε ⋆ it (xm+1|x)wt(xm, am)ε ⋆ it (x ′ h+1|x ′ h , a′ h )wt(x ′ h , a′ h |xm+1) ≤ X 0≤m<k<J X t,zm ε ⋆ it (xm+1|xm, am)wt(xm, am) X x∈Xk,a∈A wt(x, a|xm+1)ℓt(x, a) + |X| X 0≤m<h<J X t,zm,z′ h ε ⋆ it (xm+1|x)wt(xm, am)ε ⋆ it (x ′ h+1|x ′ h , a′ h )wt(x ′ h , a′ h |xm+1) ≜ B1 + |X|B2. It remains to bound B1 and B2. First, B2 is exactly the same as in the proof of Lemma 4 of [85]. Below, we outline the proof with the dependence on all parameters explicit (indeed, this is hidden in their proof). First, according to their analysis, B2 is bounded by Oe X 0≤m<h<J X t,zm,z′ h s P(xm+1|xm, am) ln 1 δ max{1, Nit (xm, am)} wt(xm, am) s P(x ′ h+1|x ′ h , a′ h ) ln 1 δ max{1, Nit (x ′ h , a′ h )} wt(x ′ h , a′ h |xm+1) + Oe X 0≤m<h<J X t,zm,z′ h wt(xm, am) ln 1 δ max{1, Nit (xm, am)} + Oe X 0≤m<h<J X t,zm,z′ h wt(x ′ h , a′ h ) ln 1 δ max{1, Nit (x ′ h , a′ h )} . They show that the first term is bounded by Oe(|X| 2 |A| ln2 (1/δ)). For the second term, we have X 0≤m<h<J X t,zm,z′ h wt(xm, am) ln 1 δ max{1, Nit (xm, am)} ≤ J X−1 h=0 |Xh| · |A| · |Xh+1| ln 1 δ ! J X−1 m=0 |Xm+1| · X t,x∈Xm,a∈A wt(xm, am) max{1, Nit (xm, am)} ≤ O |X| 2 |A| ln 1 δ · Oe |X| 2 |A| + |X| ln 1 δ (Lemma E.3.3) 296 ≤ Oe |X| 4 |A| 2 ln 1 δ + |X| 3 |A| ln 1 δ . The third term can be bounded in the exact same way. Therefore, we arrive at |X|B2 ≤ Oe |X| 5 |A| 2 ln(1/δ) + |X| 4 |A| ln2 (1/δ) . (E.29) Next we show that B1 is bounded by Oe(|X| p J|A|LT ln(1/δ) + |X| 3 |A| ln(1/δ)). According to the definition of ε ⋆ it , we have B1 = O X 0≤m<k<J X t,zm wt(xm, am) X x∈Xk,a∈A wt(x, a|xm+1)ℓt(x, a) · vuut P(xm+1|xm, am) ln T|X||A| δ max{1, Nit (xm, am)} + O X 0≤m<k<J X t,zm wt(xm, am) ln T|X||A| δ max{1, Nit (xm, am)} . (E.30) According to Lemma E.3.3, the second term is bounded as O X 0≤m<k<J X t,zm wt(xm, am) ln T|X||A| δ max{1, Nit (xm, am)} ≤ Oe J|X| 2 |A| + J|X| ln 1 δ . (E.31) In the following, we define ℓt(k|x, a) ≜ P xk∈Xk,ak∈A ℓt(xk, ak)wt(xk, ak|x, a) where wt(x ′ , a′ |x, a) is the probability of encountering pair (x ′ , a′ ) given that pair (x, a) was encountered earlier, under policy πt and transition P. For the first term of Eq. (E.30), we then have X 0≤m<k<J X t,zm wt(xm, am) X x∈Xk,a∈A wt(x, a|xm+1)ℓt(x, a) · vuut P(xm+1|xm, am) ln T|X||A| δ max{1, Nit (xm, am)} ≤ X 0≤m<k<J X t,zm wt(xm, am) vuuut X x∈Xk,a∈A wt(x, a|xm+1)ℓt(x, a) · P(xm+1|xm, am) ln T|X||A| δ max{1, Nit (xm, am)} 29 ≤ X 0≤m<k<J X t,xm,am wt(xm, am) vuut|Xm+1| · ℓt(k|xm, am) ln T|X||A| δ max{1, Nit (xm, am)} (Cauchy-Schwarz inequality) ≤ X 0≤m<k<J s |Xm+1| ln T|X||A| δ · X t,xm,am 1t(xm, am) s ℓt(k|xm, am) max{1, Nit (xm, am)} + wt(xm, am) − 1t(xm, am) p max{1, Nit (xm, am)} ! . (E.32) According to Lemma E.3.3 again, we have for all m = 0, 1, . . . , J − 1, X T t=1 X xm,am wt(xm, am) − 1t(xm, am) p max{1, Nit (xm, am)} ≤ Oe (|Xm||A| + ln(1/δ)). For the term P t,xm,am 1t(xm, am) q ℓt(k|xm,am) max{1,Nit (xm,am)} , using Cauchy-Schwarz inequality, we have X t,xm,am 1t(xm, am) s ℓt(k|xm, am) max{1, Nit (xm, am)} ≤ X xm,am vuutX T t=1 1t(xm, am) max{1, Nit (xm, am)} · vuutX T t=1 1t(xm, am)ℓt(k|xm, am) (Cauchy-Schwarz inequality) ≤ O vuuut|Xm||A| X T t=1 X x∈Xm,a∈A 1t(xm, am)ℓt(k|x, a) · ln T , (E.33) where the last step uses Cauchy-Schwarz inequality again and the fact PT t=1 1t(xm,am) max{1,Nit (xm,am)} ≤ O(ln T). Combining Eq. (E.32) and Eq. (E.33), we have X 0≤m<k<J X t,zm wt(xm, am) X x∈Xk,a∈A wt(x, a|xm+1)ℓt(x, a) · vuut P(xm+1|xm, am) ln T|X||A| δ max{1, Nit (xm, am)} ≤ Oe X 0≤m<k<J r |Xm||A||Xm+1| ln 1 δ vuutX T t=1 X x∈Xm,a∈A 1t(xm, am)ℓt(k|x, a) 298 ≤ Oe J X−1 m=0 vuutJ|Xm||A||Xm+1| X T t=1 X k>m X x∈Xm,a∈A 1t(xm, am)ℓt(k|x, a) ln 1 δ . (Cauchy-Schwarz inequality) Further note that Et X k>m X x∈Xm,a∈A 1t(xm, am)ℓt(k|x, a) = X k>m X x∈Xm,a∈A wt(x, a)ℓt(k|x, a) = X x∈Xm,a∈A wt(x, a) X k>m X x′∈Xk,a′∈A wt(x ′ , a′ |x, a)ℓt(x ′ , a′ ) = X k>m X x′∈Xk,a′∈A wt(x ′ , a′ )ℓt(x ′ , a′ ) ≤ ⟨wt , ℓt⟩ and Et X k>m X x∈Xm,a∈A 1t(xm, am)ℓt(k|x, a) 2 ≤ J ⟨wt , ℓt⟩. Using Freedman inequality J times with parameter δ/J for m = 0, 1, . . . , J −1 and taking a union bound, we have with probability 1 − δ, for all m = 0, 1, . . . , J − 1, X T t=1 X k>m X x∈Xm,a∈A 1t(xm, am)ℓt(k|x, a) − X T t=1 ⟨wt , ℓt⟩ ≤ Oe vuutJ X T t=1 ⟨wt , ℓt⟩ln 1 δ + J ln 1 δ = Oe r JLT ln 1 δ + J ln 1 δ ! . 299 Therefore, using AM-GM inequality, we have X T t=1 X k>m X x∈Xm,a∈A 1t(xm, am)ℓt(k|x, a) ≤ Oe LT + J ln 1 δ . Combining the results above and Eq. (E.31), we know that with probability at least 1 − δ, B1 ≤ Oe |X| r J|A|LT ln 1 δ + J|X| p |A| ln 1 δ ! + Oe J|X| 2 |A| + J|X| ln 1 δ ≤ Oe |X| r J|A|LT ln 1 δ + |X| 3 |A| ln 1 δ ! . (E.34) Finally, combining Eq. (E.29) and Eq. (E.34) and considering the failure probability of the events of Lemma E.3.2 and Lemma E.3.3, we have with probability 1 − 6δ, B1 + |X|B2 ≤ Oe |X| r J|A|LT ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! , finishing the proof. E.3.3.2 Bounding Error Lemma E.3.7. With probability at least 1 − 6δ, we have Error = X T t=1 ⟨wt − wbt , ℓt⟩ = Oe |X| r J|A|LT ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! , Proof. Note that according to the definition of wbt , the transition function P wbt induced by wbt is in Pit . Therefore, applying Lemma E.3.6, we know that with probability at least 1 − 6δ, Error = X T t=1 ⟨wbt − wt , ℓt⟩ 300 ≤ X T t=1 X x∈X,a∈A |wbt(x, a) − wt(x, a)| ℓt(x, a) ≤ Oe |X| r J|A|LT ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! , completing the proof. E.3.3.3 Bounding Bias-1 Lemma E.3.8. With probability at least 1 − 7δ, we have Bias-1 = X T t=1 D wbt , ℓt − ℓbt E ≤ Oe |X| r J|A|LT ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! . Proof. First we write X T t=1 D wbt , ℓt − ℓbt E = X T t=1 D wbt , Et h ℓbt i − ℓbt E + X T t=1 D wbt , ℓt − Et h ℓbt iE . Since wbt(x, a) ≤ ϕt(x, a) by the definition of ϕt , we have D wbt , ℓbt E ≤ X J k=1 X x∈Xk,a∈A wbt(x, a) ϕt(x, a) · 1t(x, a) ≤ J, Et D wbt , ℓbt E2 ≤ Et h J · D wbt , ℓbt Ei = J X x,a wbt(x, a) · ℓt(x, a) ϕt(x, a) · wt(x, a) ≤ J · ⟨wt , ℓt⟩, and thus according to Freedman inequality, we have with probability at least 1 − δ, X T t=1 D wbt , Et h ℓbt i − ℓbt E ≤ O vuutJ X T t=1 ⟨wt , ℓt⟩ln 1 δ + J · ln 1 δ = O r JLT ln 1 δ + |X| ln 1 δ ! . (E.35) 301 For the second term, we have X T t=1 D wbt , ℓt − Et h ℓbt iE = X t,x,a wbt(x, a)ℓt(x, a) · 1 − wt(x, a) ϕt(x, a) ≤ X t,x,a |ϕt(x, a) − wt(x, a)| · ℓt(x, a). By the definition of ϕt , one has ϕt = w P x t ,πt for P x t = argmaxPb∈Pit P a w P ,π b t (x, a). Therefore, according to Lemma E.3.7, we have with probability at least 1 − 6δ, X T t=1 D wbt , ℓt − Et h ℓbt iE ≤ Oe |X| r J|A|LT ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! . (E.36) Combining Eq. (E.35) and Eq. (E.36) proves the lemma. E.3.3.4 Bounding Bias-2 Lemma E.3.9. With probability at least 1 − 5δ, we have Bias-2 = X T t=1 D u, ℓbt − ℓt E ≤ C X x∈X,a∈A u(x, a) vuut8ρT (x, a) X T t=1 ℓt(x, a) ln C|X||A| δ + 2C ⟨u, ρT ⟩ln C|X||A| δ , for some constant C = Oe(1). Proof. First we write X T t=1 D u, ℓbt − ℓt E = X T t=1 D u, Et h ℓbt i − ℓt E + X T t=1 D u, ℓbt − Et h ℓbt iE . 302 The first term is nonpositive under the event of Lemma E.3.2 as for any (x, a) ∈ X × A, wt(x, a) ≤ ϕt(x, a) by the definition of ϕt and thus Et h ℓbt(x, a) i − ℓt(x, a) = wt(x, a) · ℓt(x, a) ϕt(x, a) − ℓt(x, a) ≤ 0. (E.37) For the second term, note that for each (x, a) ∈ X × A, we have ℓbt(x, a) = ℓt(x, a) ϕt(x, a) · 1t(x, a) ≤ T 3 |X| 2 |A|, (Lemma E.3.5) ℓbt(x, a) = ℓt(x, a) ϕt(x, a) · 1t(x, a) ≤ ρt(x, a), and X T t=1 Et h ℓbt(x, a) 2 i ≤ X T t=1 Et ℓt(x, a) ϕt(x, a) 2 · 1t(x, a) ≤ ρT (x, a) X T t=1 ℓt(x, a). Therefore, using Theorem 3.1.1 with Xt = ℓbt(x, a) − Et h ℓbt(x, a) i , Bt = ρt(x, a), B⋆ = ρT (x, a), b = T 3 |X| 2 |A|, C = ⌈log2 b⌉⌈log2 b 2T⌉ = Oe(1), we have with probability at least 1 − δ |X||A| , X T t=1 ℓbt(x, a) − Et h ℓbt(x, a) i ≤ C vuut8ρT (x, a) X T t=1 ℓt(x, a) ln C|X||A| δ + 2ρT (x, a) ln C|X||A| δ . Taking a union bound over all (x, a) ∈ X × A, multiplying both sides by u(x, a), and summing up all these inequalities, we have with probability at least 1 − δ, X T t=1 D u, ℓbt − Et h ℓbt iE ≤ C X x∈X,a∈A u(x, a) vuut8ρT (x, a) X T t=1 ℓt(x, a) ln C|X||A| δ + 2ρT (x, a) ln C|X||A| δ . (E.38) 303 Combining Eq. (E.37) and Eq. (E.38) finishes the proof. E.3.3.5 Bounding Reg-Term Lemma E.3.10. With probability at least 1 − 4δ, we have Reg-Term = X T t=1 D wbt − u, ℓbt E ≤ Oe |X| 2 |A| η + 5ηLT − ⟨u, ρT ⟩ 70η ln T , where LT = PT t=1 P x∈X,a∈A 1t(x, a)ℓt(x, a). Proof. We condition on the event of Lemma E.3.2. First, we prove that u ∈ ∆(Pi) ∩ Ω for all i (recall its definition in Eq. (E.24)). Indeed, for any fixed (x, a, x′ ) ∈ Xk × A × Xk+1, k = 0, 1, . . . , J − 1, we have (with w P0,πa(x) being the probability of visiting x under P0 and πa) u(x, a, x′ ) ≥ 1 T|A| w P0,πa (x, a, x′ ) = 1 T|A| w P0,πa (x)P0(x ′ |x, a) ≥ 1 T|A| X x′′∈Xk(x)−1 w P0,πa (x ′′) · P0(x|x ′′, a) · 1 T|X| (Lemma E.3.4) ≥ 1 T3|X| 2|A| X x′′∈Xk(x)−1 w P0,πa (x ′′) , (Lemma E.3.4 again) = 1 T3|X| 2|A| , which shows u ∈ Ω. On the other hand, since P ∈ Pi under Lemma E.3.2 and P0 ∈ Pi as well by Lemma E.3.4, we have u ⋆ ∈ ∆(Pi) and w P0,πa ∈ ∆(Pi), which indicates that, as a convex combination of u ⋆ and w P0,πa for all a, u has to be in ∆(Pi) as well. 304 Therefore, by standard OMD analysis (e.g.,Lemma 12 of [7]), we have D wbt − u, ℓbt E ≤ Dψt (u, wbt) − Dψt (u, wbt+1) + J X−1 k=0 X (x,a,x′)∈Xk×A×Xk+1 ηt(x, a)wb 2 t (x, a, x′ )ℓb2 t (x, a) ≤ Dψt (u, wbt) − Dψt (u, wbt+1) + J X−1 k=0 X (x,a)∈Xk×A ηt(x, a)wb 2 t (x, a)ℓb2 t (x, a) ( P x′∈Xk+1 wbt(x, a, x′ ) 2 ≤ wbt(x, a) 2 ) ≤ Dψt (u, wbt) − Dψt (u, wbt+1) + X x∈X,a∈A ηt(x, a)1t(x, a)ℓt(x, a). (wbt(x, a) ≤ ϕt(x, a)) Summing over t gives X T t=1 D wbt − u, ℓbt E ≤ Dψ1 (u, wb1) + T X−1 t=1 Dψt+1 (u, wbt+1) − Dψt (u, wbt+1) + X T t=1 X x∈X,a∈A ηt(x, a)1t(x, a)ℓt(x, a). (E.39) Next, for a fixed (x, a) pair, let n(x, a) be the total number of times the learning rate for (x, a) has increased, such that ηT (x, a) = ηκn(x,a) , and let t1, . . . , tn(x,a) be the rounds where ηt(x, a) is increased, such that ηti+1(x, a) = ηti (x, a)κ. Then since 1 ϕtn(x,a)+1(x,a) > ρtn(x,a) (x, a) > 2ρtn(x,a)−1 (x, a) > · · · > 2 n(x,a)−1ρ1(x, a) > 2 n(x,a) |A| and 1 ϕtn(x,a)+1(x,a) ≤ T 3 |X| 2 |A| (Lemma E.3.5), we have n ≤ log2 T 3 |X| 2 ≤ 7 log2 T. Therefore, we have ηt(x, a) ≤ ηe 7 log2 T 7 ln T ≤ 5η for any t, x ∈ X, and a ∈ A, and the last term in Eq. (E.39) is thus bounded by 5ηLT . For the second term, with h(y) = y − 1 − ln y , we have T X−1 t=1 Dψt+1 (u, wbt+1) − Dψt (u, wbt+1) ≤ T X−1 t=1 J X−1 k=0 X x∈Xk,a∈A,x′∈Xk+1 1 ηt+1(x, a) − 1 ηt(x, a) h u(x, a, x′ ) wbt+1(x, a, x′) 30 ≤ J X−1 k=0 X x∈Xk,a∈A,x′∈Xk+1 1 − κ η · κ n(x,a) · h u(x, a, x′ ) wbtn(x,a)+1(x, a, x′) ! ≤ − 1 35η ln T J X−1 k=0 X x∈Xk,a∈A,x′∈Xk+1 h u(x, a, x′ ) wbtn(x,a)+1(x, a, x′) ! (1 − κ ≤ − 1 7 ln T and κ n(x,a) ≤ e 7 log2 T 7 ln T ≤ 5) = − 1 35η ln T J X−1 k=0 X x∈Xk,a∈A,x′∈Xk+1 u(x, a, x′ ) wbtn(x,a)+1(x, a, x′) − 1 − ln u(x, a, x′ ) wbtn(x,a)+1(x, a, x′) ! ≤ |X| 2 |A|(1 + 6 ln T) 35η ln T − 1 35η ln T J X−1 k=0 X x∈Xk,a∈A,x′∈Xk+1 u(x, a, x′ ) wbtn(x,a)+1(x, a, x′) (ln u(x,a,x′ ) wbtn(x,a)+1(x,a,x′) ≤ 6 ln T) ≤ |X| 2 |A| 5η − 1 35η ln T J X−1 k=0 X x∈Xk,a∈A,x′∈Xk+1 u(x, a, x′ ) ϕtn(x,a)+1(x, a) = |X| 2 |A| 5η − 1 35η ln T J X−1 k=0 X x∈Xk,a∈A u(x, a) ϕtn(x,a)+1(x, a) = |X| 2 |A| 5η − ⟨u, ρT ⟩ 70η ln T . (ρT (x, a) = 2 ϕtn(x,a)+1(x,a) ) Finally, we bound the first term in Eq. (E.39): Dψ1 (u, wb1) = 1 η J X−1 k=0 X (x,a,x′)∈Xk×A×Xk+1 h u(x, a, x′ ) wb1(x, a, x′) = 1 η J X−1 k=0 X (x,a,x′)∈Xk×A×Xk+1 h |Xk| · |A| · |Xk+1| · u(x, a, x′ ) = 1 η J X−1 k=0 X (x,a,x′)∈Xk×A×Xk+1 ln 1 |Xk| · |A| · |Xk+1| · u(x, a, x′) ≤ Oe |X| 2 |A| η . Combining all the bounds finishes the proof. 30 E.3.3.6 Putting everything together Now we are ready to prove Theorem 3.1.4. For completeness, we restate the theorem below. Theorem E.3.1. Algorithm 21 with a suitable choice of η ensures that with probability at least 1 − δ, Reg = Oe |X| q J|A|L⋆ ln 1 δ + |X| 5 |A| 2 ln2 1 δ . Proof. First, note that Et J X−1 k=0 X x∈Xk,a∈A 1t(x, a) · ℓt(x, a) = ⟨wt , ℓt⟩ ≤ J, Et J X−1 k=0 X x∈Xk,a∈A 1t(x, a) · ℓt(x, a) 2 ≤ J · ⟨wt , ℓt⟩. Therefore, using Freedman’s inequality, we have with probability at least 1 − δ LT − LT ≤ 2 r JLT ln 1 δ + J ln 1 δ , where LT is defined in Lemma E.3.10. Furthermore, using AM-GM inequality, we have with probability at least 1 − δ, LT ≤ 2LT + 2J ln 1 δ . (E.40) Choosing η ≤ 1 280C ln(C|X||A|/δ) ln T , combining Lemma E.3.7, Lemma E.3.8, Lemma E.3.9 and Lemma E.3.10 and letting Lu ≜ PT t=1 ⟨u, ℓt⟩, we have with probability at least 1 − 22δ: LT − L ⋆ ≤ Oe |X| r J|A|LT ln 1 δ + |X| 2 |A| η ! + 5ηLT + 2C ⟨u, ρT ⟩ln C|X||A| δ − ⟨u, ρT ⟩ 140η ln T | {z } term1 307 + X x∈X,a∈A u(x, a) C vuut8ρT (x, a) X T t=1 ℓt(x, a) ln C|X||A| δ − ρT (x, a) 140η ln T | {z } term2 + Oe |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ≤ Oe |X| r J|A|LT ln 1 δ + |X| 2 |A| η + ηLu ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! + 10ηLT (term1 is nonpositive, AM-GM inequality for term2, and Eq. (E.40)) ≤ Oe |X| r J|A|LT ln 1 δ + |X| 2 |A| η + ηL⋆ ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! + 10ηLT . (Eq. (E.25)) As η ≤ 1 280C ln(C|X||A|/δ) ln T ≤ 1 20 , rearranging the terms gives LT − L ⋆ ≤ Oe |X| r J|A|LT ln 1 δ + |X| 2 |A| η + ηL⋆ ln 1 δ + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ ! . Finally, choosing η = min r |X| 2|A| L⋆ ln 1 δ , 1 280C ln(C|X||A|/δ) ln T , δ = δ ′/22, and solving the quadratic inequality, we have with probability at least 1 − δ ′ , LT − L ⋆ ≤ Oe |X| r J|A|L⋆ ln 1 δ ′ + |X| 5 |A| 2 ln 1 δ ′ + |X| 4 |A| ln2 1 δ ′ ! , finishing the proof. Remark 3. Similarly to the MAB case, the proof above requires tuning the initial learning rate η in terms of the unknown quantity L ⋆ , and again, using standard doubling trick can remove this restriction, as pointed out in Remark 1. 308 E.3.4 Issues of other potential approaches In this section, we discuss why the idea of clipping [9] or implicit exploration [121] may not be directly applicable to achieve near-optimal high-probability small-loss bounds. Implicit exploration. First, we consider using implicit exploration to achieve high-probability smallloss bounds for MDPs. As mentioned in Appendix E.3.2, this means using the following loss estimator: ℓbt = ℓt(x,a) ϕt(x,a)+γ · 1t{x, a} for all x ∈ X and a ∈ A and some parameter γ > 0, and without using our increasing learning schedule. The concentration results of Lemma 12 of [85] show that the deviation contains a term of order 1/γ, meaning that γ cannot be too small. Repeating the same analysis, one can see that the main difficulty of obtaining high-probability smallloss bounds in this case is to bound Bias-2 by the loss of the algorithm LT = PT t=1 ⟨wt , ℓt⟩ or L ⋆ , instead of the number of episodes T. Indeed, consider the term PT t=1 D wbt , ℓt − Et h ℓbt iE: X T t=1 D wbt , ℓt − Et h ℓbt iE = X T t=1 X x∈X,a∈A wbt(x, a)ℓt(x, a) · 1 − wt(x, a) ϕt(x, a) + γ ≤ X T t=1 X x∈X,a∈A |ϕt(x, a) − wt(x, a)|ℓt(x, a) + γ γ + ϕt(x, a) · wbt(x, a)ℓt(x, a). The first term can still be bounded by O |X| p J|A|LT + |X| 5 |A| 2 ln 1 δ + |X| 4 |A| ln2 1 δ according to Lemma E.3.8. For the second term, while it is at most γ PT t=1 P x,a ℓt(x, a) ≤ γ|X||A|T, it is not clear at all how to bound it in terms of LT or L ⋆ . For MAB (where there is only one state x0), it is possible to show that PT t=1 ℓt(x0, a) ≤ PT t=1 ℓt(x0, a⋆ )+Oe( 1 η + 1 γ ) for all a ̸= a ⋆ where a ⋆ is the best action, making it possible to connect the second term with L ⋆ . However, we do not see a way of doing similar analysis for general MDPs. Clipping. On the other hand, the idea of clipping for MAB is to clip all small probabilities so that actions with probability smaller than γ are never selected. Even from an algorithmic perspective, it is not clear 309 how to generalize this idea to MDPs, because it is possible that for a state x, wbt(x, a) is smaller than γ for all a. In this case, the clipping idea suggests not to “pick” (x, a) at all for any a, but there is no way to ensure that if the transition function is such that x can always be visited with some positive probability regardless of the policy we execute. Moreover, even if there is a way to fix this, the analysis of clipping for MAB is also similar to the idea of implicit exploration in terms of obtaining small-loss bounds of order Oe( √ L⋆), and as we argued already, even for implicit exploration there are difficulties in generalizing the analysis to MDPs. 310 Algorithm 21 Upper Occupancy Bound Log Barrier Policy Search Input: state space X, action space A, learning rate η, and confidence parameter δ. Define: κ = e 1 7 ln T , Comp-UOB is Algorithm 3 of [85], and Ω = wb : wb(x, a, x′ ) ≥ 1 T3|X| 2|A| , ∀k ∈ {0, 1, . . . , J − 1}, x ∈ Xk, a ∈ A, x′ ∈ Xk+1 . Initialization: Set epoch index i = 1 and confidence set P1 as the set of all transition functions. For all k = 0, . . . , J − 1,(x, a, x′ ) ∈ Xk × A × Xk+1, set wb1(x, a, x′ ) = 1 |Xk||A||Xk+1| , π1 = π wb1 , η1(x, a) = η, ρ1(x, a) = 2|Xk||A|, ϕ1(x, a) = Comp-UOB(π1, x, a,P1), N0(x, a) = N1(x, a) = G0(x ′ |x, a) = G1(x ′ |x, a) = 0. 1 for t = 1, 2, . . . , T do 2 Execute policy πt for J steps and obtain trajectory xk, ak, ℓt(xk, ak) for k = 0, . . . , J − 1. 3 Construct loss estimators for all (x, a) ∈ X × A: ℓbt(x, a) = ℓt(x, a) ϕt(x, a) 1t(x, a), where 1t(x, a) = 1{xk(x) = x, ak(x) = a}. (E.21) 4 Update counters: for each k = 0, 1, . . . , J − 1, Ni(xk, ak) ← Ni(xk, ak) + 1, Gi(xk+1|xk, ak) ← Gi(xk+1|xk, ak) + 1. 5 if ∃k, Ni(xk, ak) ≥ max{1, 2Ni−1(xk, ak)} then 6 Increase epoch index i ← i + 1. 7 Initialize new counters: Ni = Ni−1, Gi = Gi−1 (copy all entries). 8 Compute confidence set Pi = Pb : Pb(x ′ |x, a) − P¯ i(x ′ |x, a) ≤ εi(x ′ |x, a), ∀(x, a, x′ ) ∈ Xk × A × Xk+1, k = 0, 1, . . . , J − 1 , where P¯ i(x ′ |x, a) = Gi(x ′ |x,a) max{1,Ni(x,a)} and εi(x ′ |x, a) ≜ 4 vuut P¯ i(x ′ |x, a) ln T|X||A| δ max{1, Ni(x, a) − 1} + 28 ln T|X||A| δ 3 max{1, Ni(x, a) − 1} . 9 Compute wbt+1 = argminw∈∆(Pi)∩Ω n ⟨w, ℓbt⟩ + Dψt (w, wbt) o , where ψt(w) = J X−1 k=0 X (x,a,x′)∈Xk×A×Xk+1 1 ηt(x, a) ln 1 w(x, a, x′) . (E.22) 10 Update policy πt+1 = π wbt+1 . 11 for each (x, a) ∈ X × A do 12 Update upper occupancy bound: ϕt+1(x, a) = max Pb∈Pi w P ,π b t+1 (x, a) = Comp-UOB(πt+1, x, a,Pi). (E.23) 13 if 1 ϕt+1(x,a) ≥ ρt(x, a) then ρt+1(x, a) = 2 ϕt+1(x,a) , ηt+1(x, a) = ηt(x, a) · κ. ; 14 else ρt+1(x, a) = ρt(x, a), ηt+1(x, a) = ηt(x, a). ; 311 Appendix F Omitted Details in Section 3.2 F.1 Potential Approaches for Switching Regret of Linear Bandits As mentioned in Section 3.2, to the best of our knowledge, we are not aware of any paper with switching regret for adversarial linear bandits. In this section, we present two potential approaches to achieve switching regret for adversarial linear bandits with ℓp-ball feasible domain, however, the regret bounds are suboptimal. Method 1. Periodical Restart. The first generic method for tackling the switching regret of linear bandits is by running a classic linear bandits algorithm with a periodical restart. Specifically, suppose we employ an algorithm A as the base algorithm and restart it for every ∆ > 0 rounds. Then, the switching regret of the overall algorithm satisfies: E[RegS] ≤ S · ∆ + T ∆ − S · Reg(A; ∆) ≤ Oe S∆ + T √ ∆ = Oe S 1 3 T 2 3 , (F.1) where the first inequality holds because there are at most S periods that contains a shift of comparators and we bound the regret in those periods trivially by S∆, and for the other periods the regret is controlled by the base algorithm A. The second inequality is by chosen base algorithm A such that the regret is of 312 order Oe( √ ∆), which can be satisfied by for example SCRiBLe [2]. The last equality is by set the period optimally as ∆ = ⌈(T /S) 1 3 ⌉. To summarize, the restarting algorithm applies to general adversarial linear bandits and attains a suboptimal switching regret of order Oe(S 1 3 T 2 3 ), given the knowledge of S. Method 2. Exp2 with Fixed-share Update. The second method is by using the Exp2 algorithm [50] with a uniform mixing update [76, 19], which can give an Oe(d √ ST) switching regret for adversarial linear bandits with a general convex and compact domain. Note that the method is based on continuous exponential weights and thus requires log-concave sampling [106], which is theoretically efficient but usually time-consuming in practice. More importantly, the dimensional dependence is linear and hence not optimal when the feasible domain is an ℓp ball, p ∈ (1, 2]. Beyond the above two methods, one may wonder whether we can simply use FTRL/OMD with some barrier regularizer (such as SCRiBLe [50]) along with either a uniform mixing update [76, 19] or a clipped domain [77] to achieve switching regret for linear bandits. However, the attempt fails to work as the regularization term in the regret bound will become too large to control due to the property of barrier regularizer. Indeed, this method cannot even achieve switching regret guarantees for MAB due to the same reason. F.2 Omitted Details for Section 3.2.3 In this section, we provide the omitted details for Section 3.2.3, including the pseudocode of the base algorithm (in Appendix F.2.1) and the proof of Theorem 3.2.1 (in Appendix F.2.2 – F.2.7). To prove Theorem 3.2.1, we first prove the unbiasedness of loss estimators in Appendix F.2.2, then decompose the regret in Appendix F.2.3, and subsequently upper bound each term in Appendix F.2.4, Appendix F.2.5, and Appendix F.2.6. We finally put everything together and present the proof in Appendix F.2.7. 313 F.2.1 Pseudocode of Base Algorithm Algorithm 22 shows the pseudocode of the base algorithm for linear bandits with ℓp unit-ball feasible domain, which is the same as the one proposed in [33]. Algorithm 22 Base algorithm for linear bandits on ℓp ball Input: learning rate η, clipping parameter γ, initial round t0. Define: clipped feasible domain Ω ′ = {x | ∥x∥p ≤ 1 − γ}. Initialize: a (t0) t0 = argminx∈Ω′ R(x) and ξ (t0) t0 = 0. Draw ea (t0) t0 uniformly randomly from {±en} d n=1. for t = t0 to T do Send (ea (t0) t , a (t0) t , ξ(t0) t ) to the meta algorithm. Receive a loss estimator ℓbt . Update the strategy based on OMD with regularizer R(x) = − log(1 − ∥x∥ p p): a (t0) t+1 = argmin a∈Ω′ D a, ℓbt E + 1 η DR(a, a (t0) t ) . (F.2) Generate a random variable ξ (t0) t+1 ∼ Ber(∥a (t0) t+1∥p) and set ea (t0) t+1 = ( a (t0) t+1/∥a (t0) t+1 ∥p if ξ (t0) t+1 = 1, δen if ξ (t0) t+1 = 0, where n is uniformly chosen from {1, . . . , d} and δ is a uniform random variable over {−1, +1}. F.2.2 Unbiasedness of Loss Estimators The following lemma shows the unbiasedness of the constructed loss estimators for both meta and base algorithms. Lemma F.2.1. The meta loss estimator ¯ℓt defined in Eq. (3.7) and the base loss estimator ℓbt defined in Eq. (3.6) satisfy that Et [ ¯ℓt ] = ℓt and Et [ℓbt ] = ℓt for all t ∈ [T]. Proof. We first show the unbiasedness of the meta loss estimator ¯ℓt . According to the definition in Eq. (3.7), we have Et [ ¯ℓt ] = Et [Mf−1 t xtx ⊤ t ℓt ] 314 = β d X d n=1 ene ⊤ n + (1 − β) X t i=1 pbt,iea (i) t ea (i)⊤ t !−1 · Et [xtx ⊤ t ]ℓt = β d X d n=1 ene ⊤ n + (1 − β) X t i=1 pbt,iea (i) t ea (i)⊤ t !−1 · β d X d n=1 ene ⊤ n + (1 − β) X t i=1 pbt,iea (i) t ea (i)⊤ t ! ℓt = ℓt . (F.3) Next, we show the unbiasedness of the base loss estimator ℓbt . According to the definition in Eq. (3.6), we have Et [ℓbt ] = Et " 1 − ξt 1 − β · d 1 − Pt i=1 pbt,i∥a (i) t ∥p · (ℓ ⊤ t xt) · xt · 1{ρt = 0} # = Et X t j=1 pbt,j · d(1 − ξ (j) t ) 1 − Pt i=1 pbt,i∥a (i) t ∥p · (ℓ ⊤ t ea (j) t ) · ea (j) t = X t j=1 pbt,j (1 − ∥a (j) t ∥p) · d 1 − Pt i=1 pbt,i∥a (i) t ∥p · 1 d X d n=1 ene ⊤ n ℓt = ℓt . In above derivations, the first step simply substitutes the definition of loss estimator, the second step holds due to the sampling scheme of Algorithm 10 (see Line 4), and the third step is because of the sampling mechanism in base algorithm (see Algorithm 22). This finishes the proof. F.2.3 Regret Decomposition We introduce shifted comparators u ′ t = (1 − γ)ut and ˚u ′ k = (1 − γ)˚uk to ensure that u ′ t ∈ Ω ′ for t ∈ [T] and ˚u ′ k ∈ Ω ′ for k ∈ [S], where Ω ′ = {x | ∥x∥p ≤ 1 − γ}. Based on the unbiasedness of ℓbt and ¯ℓt , the expected regret can be decomposed as E[RegS] 315 = E "X T t=1 ⟨xt , ℓt⟩ −X T t=1 ⟨ut , ℓt⟩ # = E "X T t=1 ⟨xt , ℓt⟩ # − E "X T t=1 D u ′ t , ℓbt E # + E "X T t=1 u ′ t − ut , ℓt # = (1 − β)E "X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E # − E "X T t=1 D u ′ t , ℓbt E # + E "X T t=1 u ′ t − ut , ℓt # = E "X T t=1 X t i=1 pbt,i D ea (i) t , ¯ℓt E # − E "X T t=1 D u ′ t , ℓbt E # + E "X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E # = E "X T t=1 X t i=1 pbt,ict,i# − E "X T t=1 D u ′ t , ℓbt E # + E "X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E # , where the third equation holds because of the sampling scheme of xt : with probability β, the action xt is uniformly sampled from {±en}, n ∈ [d]; with probability 1−β, the action is sampled from {(ea (i) t , ξ(i) t )} t i=1 according to pbt . In the last step, we recall that the notation ct ∈ R t is defined by ct,i = ⟨ea (i) t , ¯ℓt⟩ for all i ∈ [t]. We further decompose the above regret into several intervals. To this end, we split the horizon to a partition I1, . . . , IS. Let jk be the start time stamp of Ik. Note again that we use ˚uk ∈ Ω to denote the comparator in Ik for k ∈ [S], which means that ut = ˚uk for all t ∈ Ik. Then we have E[RegS] ≤ E X S k=1 X t∈Ik X t i=1 pbt,ict,i − D ˚u ′ k , ℓbt E ! + E "X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E # = E X S k=1 X t∈Ik ⟨pbt − ejk , ct⟩ + X S k=1 X t∈Ik ⟨ejk , ct⟩ − D ˚u ′ k , ℓbt E + E "X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E # = E X S k=1 X t∈Ik ⟨pbt − ejk , ct − bt⟩ + X S k=1 X t∈Ik ⟨ejk , ct⟩ − D ˚u ′ k , ℓbt E + E X S k=1 X t∈Ik ⟨pbt − ejk , bt⟩ + E "X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E # 316 = E X S k=1 X t∈Ik X i∈[t] pbt,ibct,i − bct,jk + X S k=1 X t∈Ik ⟨ejk , ct⟩ − D ˚u ′ k , ℓbt E (bct,i = ct,i − bt,i for i ∈ [t]) + E X S k=1 X t∈Ik X t i=1 (pbt,i − ejk )bt,i + E "X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E # = E X S k=1 X t∈Ik X i∈[t] ⟨pt − ejk , bct⟩ + X S k=1 X t∈Ik Dea (jk) t , ¯ℓt E − D ˚u ′ k , ℓbt E + E X S k=1 X t∈Ik X t i=1 (pbt,i − ejk )bt,i + E "X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E # = E "X S k=1 X t∈Ik ⟨pt − ejk , bct⟩ | {z } Meta-Regret + X S k=1 X t∈Ik D a (jk) t − ˚u ′ k , ℓbt E | {z } Base-Regret + X T t=1 X t i=1 pbt,ibt,i | {z } Pos-Bias − X S k=1 X t∈Ik bt,jk | {z } Neg-Bias + X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E | {z } Deviation # , (F.4) where the second-last equality is due to the constructions of pbt and bct (see Line 8 in Algorithm 10), ⟨pt , bct⟩ = X i∈[t] pt,ibct,i + X i>t pt,ibct,i = X i∈[t] pbt,i X j∈[t] pt,j bct,i + X i>t pt,i X j∈[t] pbt,jbct,j = X i∈[t] pbt,i X j∈[t] pt,j + X i>t pt,i bct,i = X i∈[t] pbt,ibct,i, and the last equality is based on the definition of ea (i) t and a (i) t and the following equation: E hDea (i) t , ¯ℓt E − D u, ℓbt Ei = E hDea (i) t , Epbt [ ¯ℓt ] E − D u, ℓbt Ei = E hDea (i) t , ℓt E − D u, ℓbt Ei = E hDEξ (i) t [ea (i) t ], ℓt E − D u, ℓbt Ei = E hDa (i) t , E[ℓbt ] E − D u, ℓbt Ei = E hDa (i) t − u, ℓbt Ei . As a consequence, we upper bound the expected switching regret by five terms as shown in Eq. (F.4), including: Meta-Regret, Base-Regret, Pos-Bias, Neg-Bias, and Deviation. In the following, we will bound each term respectively. 317 F.2.4 Bounding Deviation and Pos-Bias Deviation. Deviation can be simply bounded by (β + γ)T as X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E ≤ X T t=1 ((1 − γ) − 1)⟨ut , ℓt⟩ + βT ≤ X T t=1 (1 − (1 − γ)) + βT ≤ (β + γ)T. (F.5) where the first and second inequalities hold because we have |ℓ ⊤ t x| ≤ 1 for any x ∈ Ω and t ∈ [T]. Pos-Bias. According to the definition of bt,i, we show that Pos-Bias is at most 1 λT(1 − β) X T t=1 X t i=1 pbt,i(1 − ∥a (i) t ∥p) 1 − Pt j=1 pbt,j∥a (j) t ∥p = 1 λ(1 − β) ≤ 2 λ . (F.6) Hence, it remains to evaluate Base-Regret and Meta-Regret, and in the following two subsections we present their upper bounds, respectively. F.2.5 Bounding Base-Regret In order to bound Base-Regret, we need to introduce the following lemma proven in [33], which shows that the dual local norm with respect to the regularizer R(x) = − log(1− ∥x∥ p p) is well bounded. This will later be shown to be crucial in controlling the stability of at updated by the online mirror descent shown in Algorithm 22. 318 Lemma F.2.2 (Lemma 2 in [33]). Let x, ℓ ∈ R d such that ∥x∥p < 1, ∥ℓ∥0 = 1 and ∥ℓ∥2 ≤ 1. Let y ∈ R d such that ∇R(y) ∈ [∇R(x), ∇R(x) + ℓ], R(x) = − log(1 − ∥x∥ p p). Then, we have for p ∈ (1, 2], ∥ℓ∥ 2 y,∗ ≤ 2 3 p−1 (1 − ∥x∥ p p) p(p − 1) X d n=1 |xn| 2−p + |ℓn| 2−p p−1 ℓ 2 n . In above, for a vector h ∈ R d , ∥h∥0 ≜ #{n | hn ̸= 0} denotes the number of non-zero entries, ∥h∥x ≜ p h⊤∇2R(x)h denotes the local norm induced by R at x, and ∥h∥x,∗ ≜ p h⊤(∇2R(x))−1h denotes the dual local norm. Then we are ready to bound Base-Regret for each k ∈ [S]. Note that for each k ∈ [S], as jk is the start time stamp of interval Ik, and base algorithm Bt starts at round t, we know that P t∈Ik ⟨a (jk) t − ˚uk, ℓbt⟩ is in fact the (estimated) static regret against comparator ˚uk for Bjk . Lemma F.2.3. For an arbitrary interval I started at round j, if γ = 4dηj ′ for all j ′ ∈ [T], Algorithm 10 ensures that the base regret of Bj with learning rate η for any comparator u ∈ Ω ′ is at most E "X t∈I D a (j) t − u, ℓbt E # ≤ log(1/γ) η + 2 4 p−1 dη (p − 1)(1 − β) X t∈I 1 − ∥a (j) t ∥p 1 − Pt i=1 pbt,i∥a (i) t ∥p . (F.7) Proof. Since the base algorithm Bj performs the online mirror descent over loss ℓbt with learning rate η, see update in Eq. (F.2), according to the standard analysis of OMD (see Lemma F.5.1) we have E "X t∈I D a (j) t − u, ℓbt E # ≤ R(u) − R(a (j) j ) η + 1 η X t∈I E h DR∗ ∇R(a (j) t ) − ηℓbt , ∇R(a (j) t ) i . Consider the first term. As a (j) j = argminx∈Ω R(x) and u ∈ Ω ′ = {x | ∥x∥p ≤ 1 − γ}, we have R(u) − R(a (j) j ) ≤ − log(1 − (1 − γ)) ≤ − log γ. (F.8) 319 For the second term, in the following we will employ Lemma F.2.2 to show that Et h DR∗ ∇R(a (j) t ) − ηℓbt , ∇R(a (j) t ) i ≤ η 2 · d · 2 4 p−1 (p − 1)(1 − β) · 1 − ∥a (j) t ∥p 1 − Pt i=1 pbt,i∥a (i) t ∥p . (F.9) To this end, we need to verify the condition of Lemma F.2.2. In fact, according to the definition of the base loss estimator in Eq. (3.6), ℓbt is a non-zero vector only when Algorithm 10 samples from one of the base algorithm instances and ξt = 0, meaning that xt = ±en for some n ∈ [d] according to Algorithm 22. Using the fact that a (i) t ∈ Ω ′ and β ≤ 1 2 , we have ∥a (i) t ∥p ≤ 1 − γ and ∥ηℓbt∥2 ≤ ηd (1 − β)(1 − Pt i=1 pbt,i(1 − γ)) ≤ ηd γ(1 − β) ≤ 2ηd γ ≤ 1 2 , where the last inequality is because of the choice of γ = 4dη. In addition, based on the definition of ℓbt , we have ∥ηℓbt∥0 = 1. Therefore, we can apply Lemma F.2.2 and obtain that Et h DR∗ ∇R(a (j) t ) − ηℓbt , ∇R(a (j) t ) i = Et h ∥ηℓbt∥ 2 yt,∗ i ≤ η 2 · 2 3 p−1 (1 − ∥a (j) t ∥ p p) p(p − 1) X d n=1 Et h|a (j) t,n| 2−p + |ηℓbt,n| 2−p p−1 ℓb2 t,ni = η 2 · 2 3 p−1 (1 − ∥a (j) t ∥ p p) p(p − 1) X d n=1 Et h |a (j) t,n| 2−p · ℓb2 t,ni | {z } term (a) + X d n=1 Et h |ηℓbt,n| 2−p p−1 · ℓb2 t,ni | {z } term (b) ! , where the first equality holds for some yt ∈ [∇R(a (j) t ) − ηℓbt , ∇R(a (j) t )] by the definition of Bregman divergence and the mean value theorem, the second inequality is by Lemma F.2.2. The last equality splits the desired quantity into two terms, and we upper bound term (a) and term (b) respectively. 320 For term (a), substituting the definition of loss estimator ℓbt (see definition in Eq. (3.6)) yields X d n=1 Et h |a (j) t,n| 2−p · ℓb2 t,ni = d 2 (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥p) 2 X d n=1 |a (j) t,n| 2−p · X t τ=1 pbt,τEt h (1 − ξ (τ) t ) 2ea (τ) 2 t,n ⟨ea (τ) t , ℓt⟩ 2 i = d 2 ) (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥p) 2 X d n=1 |a (j) t,n| 2−p · X t τ=1 pbt,τ (1 − ∥a (τ) t ∥p) · 1 d X d n′=1 1{n ′ = n}ℓ 2 t,n′ = d (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥p) X d n=1 |a (j) t,n| 2−p ℓ 2 t,n ≤ d (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥p) , where the last inequality is because of Hölder’s inequality, ∥ℓt∥q ≤ 1 and ∥a (j) t ∥p ≤ 1. For term (b), again by definition of the loss estimator, we have X d n=1 Et h |ηℓbt,n| 2−p p−1 · ℓb2 t,ni ≤ X d n=1 Et η(1 − ξt)d(xtx ⊤ t ℓt)n (1 − β)γ 2−p p−1 · ℓb2 t,n = X d n=1 Et η(1 − ξt)d(xtx ⊤ t ℓt)n (1 − β)γ 2−p p−1 · (1 − ξt) 2d 2 (xtx ⊤ t ℓt) 2 n · 1{ρt = 0} (1 − β) 2(1 − Pt i=1 pbt,i∥a (i) t ∥p) 2 ≤ X d n=1 Et (xtx ⊤ t ℓt)n 2 2−p p−1 · (1 − ξt) 2d 2 (xtx ⊤ t ℓt) 2 n · 1{ρt = 0} (1 − β) 2(1 − Pt i=1 pbt,i∥a (i) t ∥p) 2 (γ = 4dη, 1 − β ≥ 1 2 ) ≤ d 2 (1 − β) 2(1 − Pt i=1 pbt,i∥a (i) t ∥p) 2 X d n=1 Et h (1 − ξt) 2 (xtx ⊤ t ℓt) q n · 1{ρt = 0} i (note that 2−p p−1 + 2 = q) ≤ 1 (1 − β) 2 · d 2 (1 − Pt i=1 pbt,i∥a (i) t ∥p) 2 · X d n=1 (1 − β) X t τ=1 pbt,τ (1 − ∥a (τ) t ∥p) · 1 d · ℓ q t,n ≤ d (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥p) . 321 Combining the above upper bounds for term (a) and term (b), we obtain Et h DR∗ ∇R(a (j) t ) − ηℓbt , ∇R(a (j) t ) i ≤ η 2 1 − β · 2d · 2 3 p−1 p(p − 1) · 1 − ∥a (j) t ∥ p p 1 − Pt i=1 pbt,i∥a (i) t ∥p ≤ η 2 1 − β · d · 2 4 p−1 p(p − 1) · 1 − ∥a (j) t ∥ p p 1 − Pt i=1 pbt,i∥a (i) t ∥p ≤ η 2 1 − β · d · 2 4 p−1 p − 1 · 1 − ∥a (j) t ∥p 1 − Pt i=1 pbt,i∥a (i) t ∥p . Note that the last step is true because we have 1 − ∥x∥ p p ≤ p(1 − ∥x∥p) by the following inequality 1 + p · ∥x∥ p p − 1 p ≤ 1 + ∥x∥ p p − 1 p p , which holds due to p ∈ (1, 2] and 0 ≤ ∥x∥p ≤ 1 as well as the Bernoulli’s inequality that 1+rθ ≤ (1+θ) r for any r ≥ 1 and θ ≥ −1. Therefore, we finish proving the desired upper bound in Eq. (F.9). Further combining it with the upper bound in Eq. (F.8) finishes the proof of Lemma F.2.3. We will show later that the second term in the bound shown in Eq. (F.7) can in fact be cancelled by the Neg-Bias. Finally, we bound the term Meta-Regret. F.2.6 Bounding Meta-Regret We prove the following lemma to bound the Meta-Regret. Lemma F.2.4. For an arbitrary interval I ⊆ [T] started at round j, setting ε λγT ≤ 1 8 , β = 8dε and µ = 1 T , Algorithm 10 guarantees that X t∈I ⟨pt − ej , bct⟩ ≤ 2 ln T ε + ε X t∈I X T i=1 pt,ibc 2 t,i + O |I| εT . (F.10) 322 Proof. Note that the meta algorithm essentially performs the exponential weights with a fixed-share update and sleeping expert. Define vt+1,i ≜ pt,i exp(−εbct,i) PT t=1 pt,i exp(−εbct,i) for all i ∈ [T]. Then pt+1,i = µ T + (1 − µ)vt+1,i. Note that ⟨pt , bct⟩ + 1 ε ln X T i=1 pt,i exp(−εbct,i) ! ≤ ⟨pt , bct⟩ + 1 ε ln X T i=1 pt,i(1 − εbct,i + ε 2bc 2 t,i) ! = ⟨pt , bct⟩ + 1 ε ln 1 − ε ⟨pt , bct⟩ + ε 2X T i=1 pt,ibc 2 t,i! ≤ ε X T i=1 pt,ibc 2 t,i. The first inequality is because exp(−x) ≤ 1−x+x 2 holds for x ≥ −1 2 . To show that ε maxi∈[t] |bct,i| ≤ 1 2 , we have ε max i∈[t] |bct,i| = ε max i∈[t] D ea (i) t , Mf−1 t xtx ⊤ t ℓt E − bi ≤ ε max i∈[t] ea (i)⊤ t Mf−1 t xt + ε max i∈[t] |bt,i| . We can bound the first term by Hölder’s inequality ε max i∈[t] ea (i)⊤ t Mf−1 t xt ≤ ε max i∈[t] ∥ea (i) t ∥p∥Mf−1 t xt∥q ≤ ε∥Mf−1 t xt∥2 ≤ εd β ∥xt∥2 ≤ εd β ∥xt∥p ≤ εd β , where 1 p + 1 q = 1. The second inequality is by ∥ea (i) t ∥p ≤ 1 and p ≤ 2 ≤ q. The third one is because Mft has the smallest eigenvalue β d . By the definition of bt,i, we can bound the second term as ε max i∈[t] |bt,i| ≤ ε λT(1 − β) · 1 γ ≤ 2ε λT γ . 323 Therefore, according to the choice of ε, γ and λ, we have ε maxi∈[t] |bct,i| ≤ 1 2 . Furthermore, by the definition of vt+1,i, we have PT j=1 pt,j exp(−εbct,j ) = pt,i exp(−εbct,i)/vt+1,i. Therefore, we have 1 ε ln X T j=1 pt,j exp(−εbct,j ) = − 1 ε ln vt+1,i pt,i − bct,i. Combining the two equations and taking summation over t ∈ I, we have for any ej ∈ ∆T , j ∈ [T], X t∈I ⟨pt , bct⟩ −X t∈I ⟨ej , bct⟩ ≤ ε X t∈I X T i=1 pt,ibc 2 t,i + 1 ε X t∈I ln vt+1,j pt,j . Further note that X t∈I ln vt+1,j pt,j = X t∈I ln pt+1,j pt,j + X t∈I ln vt+1,j µ T + (1 − µ)vt+1,j ≤ ln pq+1,j ps,j + |I| log 1 1 − µ (let I = [s, q]) ≤ log(T 2 ) + O |I| T (F.11) where the last step is due to pt,j ≥ µ T = 1 T2 for j ∈ [T] and t ∈ [T], and moreover, we have log( 1 1−µ ) = log(1 + µ 1−µ ) = O(1/T) as µ = 1 T ≤ 1 2 . Combining the above two inequalities achieves X t∈I ⟨pt − ej , bct⟩ ≤ ε X t∈I X T i=1 pt,ibc 2 t,i + 2 log T ε + O |I| εT . which finishes the proof. Next, we prove the following lemma, which bounds the second term shown in Eq. (F.10) 324 Lemma F.2.5. For any t ∈ [T], setting λ 2γ = Θ q 1 dST3 and β ≤ 1 2 , Algorithm 10 guarantees that X T i=1 pt,ibc 2 t,i ≤ X i∈[t] pbt,ibc 2 t,i ≤ 2 X i∈[t] pbt,ic 2 t,i + O r dS T ! , (F.12) where ct,i = ⟨ea (i) t , ¯ℓt⟩ for i ∈ [t]. Proof. According to the definition of pbt , we have X T i=1 pt,ibc 2 t,i = X i∈[t] pt,ibc 2 t,i + X i>t pt,i X i∈[t] pbt,ibct,i 2 ≤ X i∈[t] pt,ibc 2 t,i + X i>t pt,i X i∈[t] pbt,ibc 2 t,i = X i∈[t] pt,i X i∈[t] pbt,ibc 2 t,i + X i>t pt,i! X i∈[t] pbt,ibc 2 t,i = X i∈[t] pbt,ibc 2 t,i, where the inequality is because of Cauchy-Schwarz inequality. Besides, recall that ct,i = ⟨ea (i) t , ¯ℓt⟩ and bc 2 t,i = (ct,i − bt,i) 2 ≤ 2c 2 t,i + 2b 2 t,i. According to the definition of bt,i, we know that X i∈[t] pbt,ib 2 t,i ≤ 4 (λT) 2 1 γ X i∈[t] pbt,i 1 − ∥a (i) t ∥p 1 − P j∈[t] pbt,j∥a (j) t ∥p = 4 (λT) 2 1 γ = O r dS T ! , where the first inequality uses the fact that bt,i ≤ 1 λT γ(1−β) ≤ 2 λT γ and the last step holds because we choose λ 2γ = Θ q 1 dST3 . Combining Lemma F.2.4 and Lemma F.2.5, we obtain the following lemma to bound the meta-regret. Lemma F.2.6. Define C = √ p − 1·2 − 2 p−1 . Set parameters ε = min q S dT , 1 16d , C2 2 , β = 8dε, λ = √ C dST , γ = 4C q dS T and µ = 1 T . Then, Algorithm 10 guarantees that E [Meta-Regret] ≤ Oe √ dST . 325 Proof. It is evident to verify that the choice of ε, λ, β and γ satisfies the condition required in Lemma F.2.4 and Lemma F.2.5, then based on the two lemmas, with β = 8dε ≤ 1 2 , for each interval Ik, we have E X t∈Ik ⟨pt − ejk , bct⟩ ≤ 2 ln T ε + 2εE X t∈Ik X i∈[t] pbt,ic 2 t,i + O ε|Ik| r dS T ! + O |Ik| εT ≤ 2 ln T ε + 2εE X t∈Ik X t i=1 pbt,iea (i)⊤ t Mf−1 t xtx ⊤ t Mf−1 t ea (i) t + O ε|Ik| r dS T ! + O |Ik| εT ≤ 2 ln T ε + 2ε X t∈Ik X t i=1 pbt,iea (i)⊤ t Mf−1 t ea (i) t + O ε|Ik| r dS T ! + O |Ik| εT (Et [xtx ⊤ t ] = Mft ) ≤ 2 ln T ε + 2ε 1 − β X t∈Ik X t i=1 pbt,iea (i)⊤ t X t i=1 pbt,iea (i) t ea (i)⊤ t !−1 ea (i) t + O ε|Ik| r dS T ! + O |Ik| εT ≤ 2 ln T ε + 4εd|Ik| + O ε|Ik| r dS T ! + O |Ik| εT . (F.13) Summing the regret over all the intervals achieves the following meta-regret upper bound: E [Meta-Regret] = E X S k=1 X t∈Ik ⟨pt − ejk , bct⟩ ≤ 2S ln T ε + 4εdT + O ε √ dST + O (1/ε) ≤ Oe √ dST , (F.14) where the last inequality is because we choose ε = min q S dT , C2 2 , 1 16d . F.2.7 Proof of Theorem 3.2.1 Putting everything together, we are now ready to prove our main result (Theorem 3.2.1). 326 Proof. Based on the regret decomposition in Eq. (F.4), upper bound of bias term in Eq. (F.5), upper bound of positive term Eq. (F.6), base regret upper bound in Lemma F.2.3 and meta regret upper bound in Eq. (F.14), we have E[RegS] = E X S k=1 X t∈Ik ⟨xt − ˚uk, ℓt⟩ ≤ 2 λ + X S k=1 log(1/γ) ηjk + 2 4 p−1 dηjk (p − 1)(1 − β) − 1 λT(1 − β) ! X t∈Ik 1 − ∥a (jk) t ∥p 1 − Pt i=1 pbt,i∥a (i) t ∥p + (β + γ)T + Oe √ dST . Importantly, note that the coefficient of the third term is actually zero. Indeed, due to the parameter configurations that γ = 4C q dS T , η = C q S dT , λ = √ C dST , β = 8dε, ε = min 1 16d , C2 2 , q S dT and C = √ p − 1 · 2 − 2 p−1 , we can verify that 2 4 p−1 dη p − 1 − 1 λT = 2 2 p−1 √ dS p (p − 1)T − √ dS C √ T = 0. Therefore, we obtain the following switching regret: E[RegS] ≤ 2 λ + 8√ dST + 4C √ dST + Oe( √ dST) ≤ Oe √ dST , which finishes the proof. In addition, we also provide the following theorem showing the expected interval regret bound, which will be useful in the later analysis, for example, the unconstrained linear bandits in Section 3.2.4. 327 Theorem F.2.1. Define C = √ p − 1 · 2 − 2 p−1 . Set parameters ε = min q S dT , 1 16d , C2 2 , β = 8dε, λ = √ C dST , γ = 4C q dS T , µ = 1 T and η = C q S dT . Then, Algorithm 10 guarantees that for any interval I and comparator u ∈ Ω, E "X t∈I ℓ ⊤ t xt − X t∈I ℓ ⊤ t u # ≤ Oe r dT S + |I|r dS T ! . (F.15) Proof. Based on the regret decomposition Eq. (F.4), Eq. (F.5), Eq. (F.6), Lemma F.2.3 and Eq. (F.13) within rounds t ∈ I starting at round j, we have E "X t∈I ℓ ⊤ t xt − X t∈I ℓ ⊤ t u # ≤ 2|I| λT + log(1/γ) ηj + 2 4 p−1 dηj (p − 1)(1 − β) − 1 λT(1 − β) !X t∈I 1 − ∥a (j) t ∥p 1 − Pt i=1 pbt,i∥a (i) t ∥p + (β + γ)|I| + Oe ε|I|r dS T ! + O |I| εT . Again, note that according to the choice of γ, η, λ, β and ε, we have 2 4 p−1 dη p − 1 − 1 λT = 2 2 p−1 √ dS p (p − 1)T − √ dS C √ T = 0. Therefore, we have E "X t∈I ℓ ⊤ t xt − X t∈I ℓ ⊤ t u # ≤ 2 C |I|r dS T + log 1 4C · q T dS C · r dT S + 8d|I|r S dT + 4C|I|r dS T + Oe ε|I|r dS T + |I| εT ! ≤ Oe r dT S + |I|r dS T ! , 328 which finishes the proof. F.3 Extension to Smooth and Strongly Convex Set In this section, we extend our results for linear bandits with ℓp-ball feasible domain in Section 3.2.3 to the setting when the feasible domain is a smooth and strongly convex set. Kerdreux et al. [89] studied the static regret for linear bandits in this setting, and we focus on the S-switching regret. F.3.1 Main Results Formally, we investigate adversarial linear bandits with a smooth and strongly convex feasible domain. In the following, we present the definitions of smooth set Definition 1 in [89] and strongly convex set Definition 3 in [89]. Definition F.3.1 (smooth set). A compact convex set Ω is smooth if and only if |NΩ(x) ∩ ∂Ω ◦ | = 1 for any x ∈ ∂Ω, where NΩ(x) ≜ {u ∈ R d | ⟨x − y, u⟩ ≥ 0, ∀y ∈ Ω}, ∂Ω is the boundary of Ω and Ω ◦ = {u ∈ R d | ⟨u, x⟩ ≤ 1, ∀x ∈ Ω} is the polar of Ω. Definition F.3.2 (strongly convex set). Let Ω be a centrally symmetric set with non-empty interior. Let α > 0 be the curvature coefficient. The set Ω is α-strongly convex with respect to ∥ · ∥Ω if and only if for any x, y, z ∈ Ω and γ ∈ [0, 1], we have γx + (1 − γ)y + α 2 γ(1 − γ)∥x − y∥ 2 Ω · z ∈ Ω, where ∥x∥Ω ≜ inf{λ > 0 | x ∈ λΩ} is the gauge function to Ω. Conventionally, we assume that |ℓ ⊤ t x| ≤ 1 holds for all x ∈ Ω and t ∈ [T]. We also assume that ℓp(1) ⊆ Ω ⊆ ℓq(1) with p ∈ (1, 2] and 1 p + 1 q = 1, where ℓs(r) ≜ {x ∈ R d | ∥x∥s ≤ r} denotes the ℓs-norm ball (s ≥ 1) with radius r > 0. We here stress the connection and difference between the strongly 329 convex set setting and the ℓp-ball setting considered in Section 3.2.3. Note that Ω is a subset of ℓq ball and includes ℓp ball. Besides, ℓp ball is also smooth when p ∈ (1, 2]. Therefore, it includes ℓp-ball feasible set for p ∈ (1, 2] but can be more general. Nevertheless, the switching regret bound we will prove is Oe(d 1/p √ ST), which recovers the Oe( √ dST) switching regret of ℓp-ball feasible domain in Theorem 3.2.1 only when p = 2 but leads to a slightly worse dependence on d when p ∈ (1, 2). Note that as p > 1, this bound is still better than Oe(d √ ST). Our proposed algorithm for smooth and strongly convex set is basically the same as the one proposed for the ℓp ball setting, except that we now need to modify the base algorithm based on the algorithm introduced in [89] and also need to revise the construction of injected bias bt,i and the loss estimator ℓbt in the meta level. Specifically, in the base algorithm we use online mirror descent with the following regularizer, R(x) = − ln(1 − ∥x∥Ω) − ∥x∥Ω, whose detailed update procedures are presented in Algorithm 24. For the meta algorithm, the update procedures are in Algorithm 23, notably, the injected bias bt is constructed according to Eq. (F.18) and the base loss estimator ℓbt is constructed according to Eq. (F.16). We have the following theorem regarding the switching regret of our proposed algorithm for linear bandits on smooth and strongly convex feasible domain. Theorem F.3.1. Consider a compact convex set Ω that is centrally symmetric with non-empty interior. Suppose that Ω is smooth and α-strongly convex with respect to ∥ · ∥Ω and ℓp(1) ⊆ Ω ⊆ ℓq(1), p ∈ (1, 2], 1 p + 1 q = 1. Define C = q α 10α+8 . Set parameters γ = 4Cd 1 q q S T , λ = Cd− 1 q √ ST , β = 8d 2 p ε, ε = min 1 16d 2 p , C2 2 , d− 1 p q S T , µ = 1 T and η = Cd− 1 p q S T . Then, Algorithm 23 guarantees E[RegS] = E "X T t=1 ℓ ⊤ t xt − X T t=1 ℓ ⊤ t ut # ≤ Oe d 1/p √ ST , 330 Algorithm 23 Algorithm for adversarial linear bandits over smooth and strongly convex set with switching regret Input: clipping parameter γ, base learning rate η, meta learning rate ε, mixing rate µ, exploration parameter β, bias coefficient λ, initial uniform distribution p1 ∈ ∆T . for t = 1 to T do Start a new base algorithm Bt , which is an instance of Algorithm 24 with learning rate η, clipping parameter γ, and initial round t. Receive local decision (ea (i) t , a (i) t , ξ(i) t ) from base algorithm Bi for each i ≤ t. Compute the renormalized distribution pbt ∈ ∆t such that pbt,i ∝ pt,i for i ∈ [t]. Sample a Bernoulli random variable ρt with mean β. If ρt = 1, uniformly sample xt from {±en} d n=1; otherwise, sample it ∈ [t] according to pbt , and set xt = ea (it) t and ξt = ξ (it) t . Make the final decision xt and receive feedback ℓ ⊤ t xt . Construct the base loss estimator ℓbt ∈ R d as follows and send it to all base algorithms {Bi} t i=1: ℓbt = 1{ρt = 0}1{ξt = 0} 1 − β · d(ℓ ⊤ t xt) 1 − Pt i=1 pbt,i∥a (i) t ∥Ω · xt . (F.16) Construct another loss estimator ¯ℓt ∈ R d as ¯ℓt = Mf−1 t xtx ⊤ t ℓt , (F.17) where Mft = β d Pd i=1 eie ⊤ i + (1 − β) Pt i=1 pbt,iea (i) t ea (i)⊤ t . Construct the meta loss estimator bct ∈ R T as: bct,i = ( ⟨ea (i) t , ¯ℓt⟩ − bt,i, i ≤ t, Pt j=1 pbt,jbct,j , i > t, where bt,i = 1 λT(1 − β) 1 − ∥a (i) t ∥Ω 1 − Pt j=1 pbt,j∥a (j) t ∥Ω . (F.18) Meta algorithm updates the weight pt+1 ∈ ∆T according to pt+1,i = (1 − µ) pt,i exp(−εbct,i) PT j=1 pt,j exp(−εbct,j ) + µ T , ∀i ∈ [T]. (F.19) where u1, . . . , uT ∈ Ω is the comparator sequence such that PT t=2 1{ut−1 ̸= ut} ≤ S − 1. In the following, we first introduce some definitions and lemmas useful for the analysis in strongly convex set in Appendix F.3.2 and then prove Theorem F.3.1 in Appendix F.3.3–F.3.8. To prove Theorem F.3.1, similar to the analysis structure in Appendix F.2, we first prove the unbiasedness of loss estimators in Appendix F.3.3, and then in Appendix F.3.4, we decompose the regret into several terms, and subsequently upper bound each term in Appendix F.3.5, Appendix F.3.6, and Appendix F.3.7. We finally put everything together and present the proof in Appendix F.3.8. 331 Algorithm 24 Base algorithm for linear bandits on strongly convex set Input: learning rate η, clipping parameter γ, initial round t0. Define: clipped feasible domain Ω ′ = {x | ∥x∥Ω ≤ 1 − γ, x ∈ Ω}. Initialize: a (t0) t0 = argminx∈Ω′ R(x) and ξ (t0) t0 = 0. Draw ea (t0) t0 uniformly randomly from {±en} d n=1. for t = t0 to T do Send (ea (t0) t , a (t0) t , ξ(t0) t ) to the meta algorithm. Receive a loss estimator ℓbt . Update the strategy based on OMD with regularizer R(x) = − log(1 − ∥x∥Ω) − ∥x∥Ω: a (t0) t+1 = argmin a∈Ω′ D a, ℓbt E + 1 η DR(a, a (t0) t ) . (F.20) Generate a random variable ξ (t0) t+1 ∼ Ber(∥a (t0) t+1∥Ω) and set ea (t0) t+1 = ( a (t0) t+1/∥a (t0) t+1 ∥Ω if ξ (t0) t+1 = 1, δen if ξ (t0) t+1 = 0, where n is uniformly chosen from {1, . . . , d} and δ is a uniform random variable over {−1, +1}. F.3.2 Preliminary This subsection collects some useful definitions and lemmas for the analysis. We refer the reader to [89] for detailed introductions. Define ∥ · ∥Ω is the gauge function to Ω as ∥x∥Ω = inf{λ > 0 | x ∈ λΩ}. (F.21) The polar of Ω is defined as Ω ◦ = {ℓ ∈ R d | ⟨x, ℓ⟩ ≤ 1, ∀x ∈ Ω}. If Ω is symmetric, then based on the assumption |⟨x, ℓt⟩| ≤ 1, we have ℓt ∈ Ω ◦ . Based on the definition of gauge function, we have ∥x∥Ω ≤ 1 for all x ∈ Ω. In addition, we have the Hölder’s inequality ⟨x, ℓ⟩ ≤ ∥x∥Ω · ∥ℓ∥Ω◦ . In this problem, we also assume that ℓp(1) ⊆ Ω ⊆ ℓq(1), p ∈ (1, 2], 1 p + 1 q = 1 which implies that {±en}n∈[d] ⊆ Ω. By the definition of Ω ◦ , we also have ℓp(1) ⊆ Ω ◦ ⊆ ℓq(1). The following lemmas show some useful identities for the regularizer R(x). 332 Lemma F.3.1 (Lemma 5 of Kerdreux et al. [89]). A gauge function ∥ · ∥Ω is differentiable at x ∈ R d\{0} if and only if its support set S(Ω◦ , x) = {h ∈ Ω ◦ | ⟨h, x⟩ = suph′∈Ω◦ ⟨h ′ , x⟩} contains a single point h. If this is the case, we have ∇∥ · ∥Ω(x) = d. Besides, the following assertions are true: (1) ∥(∇∥ · ∥Ω(x))∥Ω◦ = 1; (2) ∇∥ · ∥Ω(λx) = ∇∥ · ∥Ω(x), for any λ > 0; (3) if Ω ◦ is strictly convex, then ∥ · ∥Ω is differentiable in R d\{0}. Lemma F.3.2 (Corollary 8 of Kerdreux et al. [89]). Let Ω be a centrally symmetric set with non empty interior. Assume that Ω is α-strongly convex with respect to ∥ · ∥Ω. Then for any (u, v) ∈ R n , D1 2 ∥·∥2 Ω◦ (u, v) ≤ 4(α + 1) α ∥u − v∥ 2 Ω◦ . Lemma F.3.3 (Lemma 15 of Kerdreux et al. [89]). Assume Ω ⊆ R d is strictly convex compact and smooth set. Let x ∈ Ω such that ∥x∥Ω < 1 and h ∈ R d\{0}. We have R(x) is differentiable on int(Ω) and ∇R(x) = ∥x∥Ω 1 − ∥x∥Ω · ∇∥ · ∥Ω(x), R ∗ (h) = ∥h∥Ω◦ − ln(1 + ∥h∥Ω◦ ), ∇R ∗ (h) = ∥h∥Ω◦ 1 + ∥h∥Ω◦ ∇∥ · ∥Ω◦ (h). F.3.3 Unbiasedness of Loss Estimator We first show that the loss estimator for the meta algorithm ¯ℓt and the one for the base algorithm ℓbt constructed in Algorithm 23 are unbiased. Lemma F.3.4. The meta loss estimator ¯ℓt defined in Eq. (F.17) and base loss estimator ℓbt defined in Eq. (F.16) satisfy that Et [ ¯ℓt ] = ℓt and Et [ℓbt ] = ℓt for all t ∈ [T]. 333 Proof. The the unbiasedness of ¯ℓt can be proven in the exact same way as in Eq. (F.3). For ℓbt , according to the sampling scheme of xt , we have Et [ℓbt ] = Et " 1 − ξt 1 − β · d 1 − Pt i=1 pbt,i∥a (i) t ∥Ω xtx ⊤ t ℓt · 1{ρt = 0} # = Et " (1 − ξt) · d 1 − Pt i=1 pbt,i∥a (i) t ∥Ω xtx ⊤ t ℓt ρt = 0# = Et X t j=1 pbt,j · d(1 − ξ (j) t ) 1 − Pt i=1 pbt,i∥a (i) t ∥Ω ea (j) t ea (j)⊤ t ℓt = X t j=1 pbt,j · d(1 − ∥a (j) t ∥Ω) 1 − Pt i=1 pbt,i∥a (i) t ∥Ω 1 d X d n=1 ene ⊤ n ℓt = ℓt . This ends the proof. F.3.4 Regret Decomposition Similar to the analysis in Appendix F.2, we decompose the expected switching regret into five terms and then bound each term respectively. Again, we split the horizon to I1, . . . , IS, and let jk be the start time stamp of Ik. We introduce u ′ t = (1 − γ)ut and ˚u ′ k = (1 − γ)˚uk to ensure that u ′ t ∈ Ω ′ for t ∈ [T] and ˚u ′ k ∈ Ω ′ for k ∈ [S], where Ω ′ = {x | ∥x∥Ω ≤ 1 − γ, x ∈ Ω}. Similar to the decomposition method of Eq. (F.4), the expected regret can be decomposed as E[RegS] = E "X T t=1 ⟨xt , ℓt⟩ −X T t=1 ⟨ut , ℓt⟩ # = E "X S k=1 X t∈Ik ⟨pt − ejk , bct⟩ | {z } Meta-Regret + X S k=1 X t∈Ik D a (jk) t − ˚u ′ k , ℓbt E | {z } Base-Regret + X T t=1 X t i=1 pbt,ibt,i | {z } Pos-Bias − X S k=1 X t∈Ik bt,jk | {z } Neg-Bias + X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E | {z } Deviation # . (F.22) In the following, we will bound each term respectively. 334 F.3.5 Bounding Deviation and Pos-Bias Deviation. Deviation term can still be bounded by (β + γ)T as X T t=1 u ′ t − ut , ℓt − β X T t=1 X t i=1 pbt,i D ea (i) t , ℓt E ≤ X T t=1 ((1 − γ) − 1)⟨ut , ℓt⟩ + βT ≤ X T t=1 (1 − (1 − γ)) + βT = (β + γ)T. (F.23) Pos-Bias. According to the definition of bt,i, we have 1 λT(1 − β) X T t=1 X t i=1 pbt,i(1 − ∥a (i) t ∥Ω) 1 − Pt j=1 pbt,j∥a (j) t ∥Ω = 1 λ(1 − β) ≤ 2 λ , (F.24) where the last inequality is because β ≤ 1 2 . In the following two subsections, we bound Base-Regret and Meta-Regret respectively. F.3.6 Bounding Base-Regret Before bounding the term Base-Regret, we show the following two lemmas which will be useful in the analysis. The first lemma bounds the scale of the loss estimator used for the base algorithm. Lemma F.3.5. For any x ∈ (1 − γ)Ω and η, define u = ∇R(x) − ηℓbt and v = ∇R(x) with ℓbt defined in Algorithm 23. We have ∥u∥Ω◦ − ∥v∥Ω◦ 1 + ∥v∥Ω◦ ≥ − 2ηd γ . 335 Proof. First, note that using Lemma F.3.1 and Lemma F.3.3, the denominator can be written as 1 1 + ∥v∥Ω◦ = 1 1 + ∥∇R(x)∥Ω◦ = (1 − ∥x∥Ω)(∥(∇∥ · ∥Ω(x))∥Ω◦ ) −1 = 1 − ∥x∥Ω. (F.25) For the numerator, note that a (i) t ∈ (1 − γ)Ω for all t ∈ [T] and i ∈ [t] and β ≤ 1 2 , we have ∥ℓbt∥Ω◦ ≤ d (1 − β)(1 − (1 − γ))|x ⊤ t ℓt | · ∥xt∥Ω◦ · 1{xt ∈ {±en}n∈[d]} ≤ 2d∥xt∥Ω◦ γ 1{xt ∈ {±en}n∈[d]}. Therefore, according to triangle inequality, we have ∥u∥Ω◦ − ∥v∥Ω◦ ≥ −η∥ℓbt∥Ω◦ ≥ − 2dη γ ∥xt∥Ω◦ · 1{xt ∈ {±en}n∈[d]}. Note that Ω ⊆ ℓq(1), we have ℓq(1)◦ = ℓp(1) ⊆ Ω ◦ , which means that en ∈ Ω ◦ . This means that ∥en∥Ω◦ ≤ 1 and we have ∥u∥Ω◦ − ∥v∥Ω◦ ≥ − 2ηd γ , which finishes the proof. The second lemma helps to bound the stability of the base algorithm, which is originally introduced in Lemma 17 of [89]. For completeness, we include the proof here. Lemma F.3.6. Suppose Ω to be a α-strongly convex and centrally symmetric set with non-empty interior. Let x ∈ Ω such that ∥x∥Ω ≤ 1 − γ and if η∥ℓbt∥Ω◦ ≤ 1 2 , DR∗ (∇R(x) − ηℓbt , ∇R(x)) ≤ (1 − ∥x∥Ω) 1 + 4(α + 1) α η 2 ∥ℓbt∥ 2 Ω◦ . 336 Proof. Define u = ∇R(x) − ηℓbt , v = ∇R(x) and z = ∥u∥Ω◦−∥v∥Ω◦ 1+∥v∥Ω◦ . By the definition of Bregman divergence and using Lemma F.3.3, we have DR∗ (u, v) = R ∗ (u) − R ∗ (v) − ⟨∇R ∗ (v), u − v⟩ = ∥u∥Ω◦ − ∥v∥Ω◦ − ln 1 + ∥u∥Ω◦ 1 + ∥v∥Ω◦ − ∥v∥Ω◦ 1 + ∥v∥Ω◦ ⟨∇∥ · ∥Ω◦ (v), u − v⟩ = z − ln(1 + z) + 1 1 + ∥ν∥Ω◦ [∥v∥Ω◦ (∥u∥Ω◦ − ∥v∥Ω◦ ) − ∥v∥Ω◦ ⟨∇∥ · ∥Ω◦ , u − v⟩] = z − ln(1 + z) − 1 2 (∥u∥Ω◦ − ∥v∥Ω◦ ) 2 1 + ∥v∥Ω◦ + D1 2 ∥·∥2 Ω◦ (u, v) 1 + ∥v∥Ω◦ ≤ z − ln(1 + z) + D1 2 ∥·∥2 Ω◦ (u, v) 1 + ∥v∥Ω◦ Note that z ≥ −1 2 as ∥u∥Ω◦−∥v∥Ω◦ 1+∥v∥Ω◦ ≥ −η∥ℓbt∥Ω◦ ≥ −1 2 , we have z − ln(1 + z) ≤ z 2 . Therefore, we have DR∗ (u, v) ≤ ∥u∥Ω◦ − ∥v∥Ω◦ 1 + ∥v∥Ω◦ 2 + 1 1 + ∥v∥Ω◦ D1 2 ∥·∥2 Ω◦ (u, v). Note that according to Lemma F.3.1, we have 1 1+∥v∥Ω◦ = 1 − ∥x∥Ω. Therefore, using triangle inequality leads to DR∗ (u, v) ≤ (1 − ∥x∥Ω) 2 ∥u − v∥ 2 Ω◦ + (1 − ∥x∥Ω)D1 2 ∥·∥2 Ω◦ (u, v). Finally, using Lemma F.3.2, we have DR∗ (u, v) ≤ (1 − ∥x∥Ω) 2 ∥u − v∥ 2 Ω◦ + (1 − ∥x∥Ω) · 4(α + 1) α ∥u − v∥ 2 Ω◦ ≤ (1 − ∥x∥Ω) 1 + 4(α + 1) α η 2 ∥ℓbt∥ 2 Ω◦ . 337 With the help of Lemma F.3.5 and Lemma F.3.6, we are able to bound Base-Regret. Lemma F.3.7. For an arbitrary interval I started at round j, setting γ = 4dη′ for all j ′ ∈ [T], Algorithm 23 ensures that the base regret of Bj with learning rate η (starting from round j) for any comparator u ∈ Ω ′ is at most E "X t∈I D a (j) t − u, ℓbt E # ≤ log(1/γ) η + 2d 2 p η 1 − β · 1 + 4(α + 1) α X t∈I 1 − ∥a (j) t ∥Ω 1 − Pt i=1 pbt,i∥a (i) t ∥Ω . Proof. Again, according to the standard analysis of OMD (see Lemma F.5.1) we have E "X t∈I D a (j) t − u, ℓbt E # ≤ R(u) − R(a (j) j ) η + 1 η X t∈I E h DR∗ ∇R(a (j) t ) − ηℓbt , ∇R(a (j) t ) i . The first term can still be upper bounded by log(1/γ) η as a (j) j = argminx∈Ω′ R(x) and u ∈ Ω ′ = {x | ∥x∥Ω ≤ 1 − γ}, we have R(u) − R(a (j) j ) ≤ − log(1 − (1 − γ)) − 0 = − log γ. For the second term, we will show that Et h DR∗ ∇R(a (j) t ) − ηℓbt , ∇R(a (j) t ) i ≤ 2d 2 p η 2 1 − β · 1 + 4(α + 1) α X t∈I 1 − ∥a (j) t ∥Ω 1 − Pt i=1 pbt,i∥a (i) t ∥Ω . According to Eq. (F.25) and the choice of η and γ, we have η∥ℓbt∥Ω◦ ≤ 2dη γ = 1 2 . Based on Lemma F.3.6, we only need to show that Et h ∥ℓbt∥ 2 Ω◦ i ≤ 2d 2 p (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥Ω) . 338 In fact, according to the definition of ℓbt , we have Et h ∥ℓbt∥ 2 Ω◦ i ≤ d 2 (1 − β) 2(1 − Pt i=1 pbt,i∥a (i) t ∥Ω) 2 Et h (1 − ξt) 2 ∥xt∥ 2 Ω◦ · |x ⊤ t ℓt | 2 · 1{ρt = 0} i ≤ d 2 (1 − β) 2(1 − Pt i=1 pbt,i∥a (i) t ∥Ω) 2 Et (1 − β) X t j=1 pbt,j (1 − ξ (j) t ) 2 ∥ea (j) t ∥ 2 Ω◦ · |ea (j)⊤ t ℓt | 2 ≤ d 2 (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥Ω) 2 X t j=1 pbt,jEt " (1 − ∥a (j) t ∥Ω) · 1 d X d n=1 ∥en∥ 2 Ω◦ · |ℓt,n| 2 # . Note that Ω ⊆ ℓq(1), we have ℓp(1) ⊆ Ω ◦ , which means that en ∈ Ω ◦ and ∥en∥Ω◦ ≤ 1. Also using the fact that ℓp(1) ⊆ Ω, we have ℓt ∈ Ω ◦ ⊆ ℓq(1) and ∥ℓt∥ 2 2 ≤ d 1− 2 q ∥ℓt∥ 2 q ≤ d 1− 2 q . Therefore, we have Ej h ∥ℓbt∥ 2 Ω◦ i ≤ 2d 2 p Pt j=1 pbt,j (1 − ∥a (j) t ∥Ω) (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥Ω) 2 = 2d 2 p (1 − β)(1 − Pt i=1 pbt,i∥a (i) t ∥Ω) , which finishes the proof. F.3.7 Bounding Meta-Regret In this section, we first prove several useful lemmas and then bound the term Meta-Regret. We prove the following lemma, which is a counterpart of Lemma F.2.4. Lemma F.3.8. For an arbitrary interval I ⊆ [T] started at round j, setting ε λγT ≤ 1 8 , β = 8d 2 p ε ≤ 1 2 and µ = 1 T , Algorithm 23 guarantees that X t∈I ⟨pt − ej , bct⟩ ≤ 2 ln T ε + ε X t∈I X T i=1 pt,ibc 2 t,i + O |I| εT . (F.26) 339 Proof. Define vt+1,i ≜ pt,i exp(−εbct,i) PT t=1 pt,i exp(−εbct,i) for all i ∈ [T]. Then pt+1,i = µ T + (1 − µ)vt+1,i. Note that ⟨pt , bct⟩ + 1 ε ln X T i=1 pt,i exp(−εbct,i) ! ≤ ⟨pt , bct⟩ + 1 ε ln X T i=1 pt,i(1 − εbct,i + ε 2bc 2 t,i) ! = ⟨pt , bct⟩ + 1 ε ln 1 − ε ⟨pt , bct⟩ + ε 2X T i=1 pt,ibc 2 t,i! ≤ ε X T i=1 pt,ibc 2 t,i. The first inequality is because exp(−x) ≤ 1 − x + x 2 for x ≥ −1 2 and according to the choice of ε, γ and λ, we have ε max i∈[t] |bct,i| ≤ ε max i∈[t] ea (i)⊤ t Mf−1 t xt − bt,i ≤ ε max i∈[t] ea (i)⊤ t Mf−1 t xt + ε max i∈[t] |bt,i| . For the first term, by using Hölder’s inequality, we have ε max i∈[t] ea (i)⊤ t Mf−1 t xt ≤ ε max i∈[t] ∥ea (i) t ∥Ω · ∥Mf−1 t xt∥Ω◦ ≤ ε∥Mf−1 t xt∥p (ea (i) t ∈ Ω and ℓp(1) ⊆ Ω ◦ ) ≤ εd 1 p − 1 2 ∥Mf−1 t xt∥2 ≤ εd 1 2 + 1 p β · ∥xt∥2 (Mft ⪰ β d I) ≤ εd 2 p β . (∥x∥2 ≤ d 1 2 − 1 q ∥x∥q ≤ d 1 p − 1 2 ) In above argument, we use the fact that for vector x ∈ R d and 0 < s < r, we have ∥x∥r ≤ ∥x∥s ≤ d 1 s − 1 r ∥x∥r. Moreover, note that p ∈ (1, 2] and ℓp(1) ⊆ Ω ⊆ ℓq(1). 340 For the second term, according to the definition of bt,i, |bt,i| ≤ 1 λT(1−β)γ ≤ 2 λT γ . Therefore, combining the above two bounds shows that ε maxi∈[t] |bct,i| ≤ 1 8 + 1 4 ≤ 1 2 according to the choice of ε, γ, and λ. Furthermore, by the definition of vt+1,i, we have PT j=1 pt,j exp(−εbct,j ) = pt,i exp(−εbct,i)/vt+1,i. Therefore, we have 1 ε ln X T j=1 pt,j exp(−εbct,j ) = − 1 ε ln vt+1,i pt,i − bct,i. Combining the two equations and taking summation over t ∈ I, we have for any ej ∈ ∆T , j ∈ [T], X t∈I ⟨pt , bct⟩ −X t∈I ⟨ej , bct⟩ ≤ ε X t∈I X T i=1 pt,ibc 2 t,i + 1 ε X t∈I ln vt+1,j pt,j . The second term can be dealt with according to Eq. (F.11) and we then have X t∈I ⟨pt − ej , bct⟩ ≤ 2 ln T ε + ε X t∈I X T i=1 pt,ibc 2 t,i + O |I| εT , which finishes the proof. Next, we prove the following lemma, which bounds the second term shown in Eq. (F.26) Lemma F.3.9. For any t ∈ [T], setting λ 2γ = Θ d − 1 q q 1 ST3 , Algorithm 23 guarantees that X T i=1 pt,ibc 2 t,i ≤ X i∈[t] pbt,ibc 2 t,i ≤ 2 X i∈[t] pbt,ic 2 t,i + O d 1 q r S T ! , (F.27) where ct,i = ⟨ea (i) t , ¯ℓt⟩. Proof. According to the definition of pbt and bct , we have X T i=1 pt,ibc 2 t,i = X i∈[t] pt,ibc 2 t,i + X i>t pt,i X t j=1 pbt,jbct,j 2 ≤ X i∈[t] pt,ibc 2 t,i + X i /∈[t] pt,i X j∈[t] pbt,jbc 2 t,j 341 = X i∈[t] pt,i X i∈[t] pbt,ibc 2 t,i + X i /∈[t] pt,i X i∈[t] pbt,ibc 2 t,i = X i∈[t] pbt,ibc 2 t,i, where the inequality is because of Cauchy-Schwarz inequality. Besides, recall that ct,i = D ea (i) t , ¯ℓt E and bc 2 t,i = (ct,i − bt,i) 2 ≤ 2c 2 t,i + 2b 2 t,i. According to the definition of bt,i, we know that X i∈[t] pbt,ib 2 t,i ≤ 4 (λT) 2 1 γ X i∈[t] pbt,i 1 − ∥a (i) t ∥Ω 1 − P i∈[t] pbt,i∥a (i) t ∥Ω = 4 (λT) 2 1 γ = O d 1 q r S T ! , where the last step holds because we choose λ 2γ = Θ d − 1 q q 1 ST3 . Combining Lemma F.3.8 and Lemma F.3.9, we obtain the following lemma showing the upper bound for Meta-Regret, which is exactly the same as Lemma F.2.6 except for the choice of parameters. Lemma F.3.10. Define C = q α 10α+8 . Set ε = min d − 1 p q S T , 1 16d 2 p , C2 2 , β = 8d 2 p ε, λ = Cd− 1 q √ ST , γ = 4Cd 1 q q S T and µ = 1 T . Algorithm 23 guarantees that E [Meta-Regret] ≤ Oe d 1 p √ ST . Proof. First, it is direct to check that the choice of λ, γ and ε satisfies the condition required in Lemma F.3.8 and Lemma F.3.9. Based on the two lemmas, for each interval Ik, let jk be the start time stamp for Ik. As β = 8d 2 p ε ≤ 1 2 , we follow the derivation of Eq. (F.13) and obtain that E X t∈Ik ⟨pt − ejk , bct⟩ ≤ 2 ln T ε + 4εd|Ik| + O ε|Ik|d 1 q r S T ! + O |Ik| εT . Summing the regret over all the intervals achieves the bound for Meta-Regret: E [Meta-Regret] = E X S k=1 X t∈Ik ⟨pbt − ejk , bct⟩ 342 ≤ 2S ln T ε + 4εdT + O d 1 q √ ST + O(1/ε) ≤ Oe d 1 p √ ST , where the last inequality is because we choose ε = min d − 1 p q S T , C2 2 , 1 16d 2 p . F.3.8 Proof of Theorem F.3.1 Putting everything together, we are now ready to prove our main result (Theorem F.3.1) in the setting when the feasible domain is α-strongly convex. Proof. First, it is evident to check that the parameter choice satisfies the condition required in Lemma F.3.7 and Lemma F.3.10. Therefore, based on the regret decomposition in Eq. (F.22), upper bound of bias term in Eq. (F.23), upper bound of positive term Eq. (F.24), base regret upper bound in Lemma F.3.7 and meta regret upper bound in Lemma F.3.10, we have E[RegS] = E X S k=1 X t∈Ik ⟨xt − ˚uk, ℓt⟩ ≤ 2 λ + X S k=1 log(1/γ) ηjk + 2d 2 p η (1 − β) · 5α + 4 α − 1 λT(1 − β) ! X t∈Ik 1 − ∥a (jk) t ∥Ω 1 − Pt i=1 pbt,i∥a (i) t ∥Ω + (β + γ)T + Oe d 1 p √ ST . Importantly, note that the coefficient of the third term is actually zero. Indeed, due to the parameter configurations that γ = 4Cd 1 q q S T , η = Cd− 1 p q S T , λ = Cd− 1 q √ ST , β = 8d 2 p ε, ε = min 1 16d 2 p , C2 2 , d− 1 p q S T and C = q α 10α+8 , we can verify that 2dη(5α + 4) α − 1 λT = 0. Then we can achieve E[RegS] ≤ Oe d 1 p √ ST and complete the proof. 343 F.4 Omitted Details for Section 3.2.4 In this section, we consider the switching regret of unconstrained linear bandits. F.4.1 Proof of Lemma 3.2.1 Proof. Our switching regret decomposition for linear bandits is inspired by the existing black-box reduction designed for the full information online convex optimization [49] and static regret of linear bandits [78]. Indeed, the switching regret can be decomposed in the following way. Reg(u1, . . . , uT ) = X T t=1 ℓ ⊤ t xt − X T t=1 ℓ ⊤ t ut = X S k=1 X t∈Ik ℓ ⊤ t xt − X S k=1 X t∈Ik ℓ ⊤ t ˚uk = X S k=1 X t∈Ik ℓ ⊤ t (zt · vt − ˚uk) (xt = zt · vt ) = X S k=1 X t∈Ik ⟨zt , ℓt⟩(vt − ∥˚uk∥2) + ∥˚uk∥2 X t∈Ik zt − ˚uk ∥˚uk∥2 , ℓt = X S k=1 RegV Ik (∥˚uk∥2) +X S k=1 ∥˚uk∥2 · RegZ Ik ˚uk ∥˚uk∥2 , which finishes the proof. F.4.2 Algorithm for Unconstrained OCO with Switching Regret In this section, we present the details of our proposed algorithm for unconstrained OCO with switching regret. Under the unconstrained setup, the diameter of the feasible domain is D = ∞. However, as observed in Appendix D.5 in [42], we can simply assume maxk∈[S]∥˚uk∥2 ≤ 2 T . Otherwise, we will have T ≤ log2 (maxk∈[S]∥˚uk∥2), and by constraining the learning algorithm such that ∥vt∥2 ≤ 2 T , we can 344 obtain the following trivial upper bound for switching regret: Reg ≤ PT t=1∥∇ft(vt)∥2∥vt − ut∥2 ≤ T(2T + maxk∈[S]∥˚uk∥2) = Oe(maxk∈[S]∥˚uk∥2), which is already adaptive to the comparators. Therefore, we can simply focus on the constrained online learning with a maximum diameter D = 2T . In addition, as mentioned earlier, we do not assume the knowledge of the number of switch S in advance in this part. To this end, we propose a two-layer approach to simultaneously adapt to the unknown scales of the comparators and the unknown number of switch, which consists of a meta algorithm learning over a set of base learners. Below we specify the details. Base algorithm. The base algorithm tackles OCO problem with a given scale of feasible domain. The only requirement is as follows: given a constrained domain Ω ⊆ R d with diameter D = supx∈Ω∥x∥2, base algorithm running over Ω ensures an Oe(D p |I|) static regret over any interval I ⊆ [T]. Formally, we assume the base algorithm to satisfy the following requirement. Requirement 1. Consider the online convex optimization problem consisting a convex feasible domain Ω ⊆ R d and a sequence of convex loss functions f1, . . . , fT , where ft : Ω 7→ R and we assume 0 ∈ Ω and ∥∇ft(v)∥2 ≤ 1 for any v ∈ Ω and t ∈ [T]. An online algorithm A running over this problem returns the decision sequence v1, . . . , vT ∈ Ω. We require the algorithm A to ensure the following regret guarantee X t∈I ft(vt) − min u∈Ω X t∈I ft(u) ≤ Oe D p |I| (F.28) for any interval I ⊆ [T], where D = supx∈Ω∥x∥2 is the diameter of the feasible domain. This requirement can be satisfied by recent OCO algorithms with interval regret (or called strongly adaptive regret) guarantee, such as Algorithm 1 of Daniely, Gonen, and Shalev-Shwartz [51], Algorithm 2 of Jun et al. [86], Theorem 6 of Cutkosky [46]. We denote by B any suitable base algorithm. 345 Since both the scale of comparators and the number of switch are unknown in advance, we maintain a set of base algorithm instances, defined as S = n Bi,r, ∀(i, r) ∈ [H] × [R] Bi,r ← B(Ωi), with Ωi = {x | ∥x∥2 ≤ Di = T −1 · 2 i−1 } o . (F.29) In above, H = ⌈log2 T⌉ + T + 1 and the index i ∈ [H] maintain a grid to deal with uncertainty of unknown comparators’ scale; R = ⌈log2 T⌉ and the index r ∈ [R] maintains a grid to handle uncertainty of unknown number of switch S. There are in total N = H · R base learners. For i ∈ [H] and r ∈ [R], the base learner Bi,r is an instantiation of the base algorithm whose feasible domain is Ωi ⊆ R d with diameter Di , and vt,(i,r) denotes her returned decision at round t. We stress that even if S is known, the two-layer structure remains necessary due to the unknown comparators’ scale. Meta algorithm. Then, a meta algorithm is used to combine all those base learners, and more importantly, the regret of meta algorithm should be adaptive to the individual loss scale of each base learner, such that the overall algorithm can achieve a comparator-adaptive switching regret. We achieve so by building upon the recent progress in the classic expert problem [42]. Our proposed algorithm is OMD with a multi-scale entropy regularizer and certain important correction terms. Specifically, let the weight vector produced by the meta algorithm be wt ∈ ∆N , then the overall decision is vt = PH i=1 PR r=1 wt,(i,r)vt,(i,r) , and the weight is updated by wt+1 = argmin w∈Ω ⟨w, ℓt + at⟩ + Dψ(w, wt), (F.30) where Ω = {w | w ∈ ∆N and wt,(i,r) ≥ 1 T2·2 2i , ∀i ∈ [H], r ∈ [R]} is the clipped domain. Besides, the meta loss ℓt , the correction term at , and a certain regularizer ψ are set as follows: 346 • The regularizer ψ : ∆N 7→ R is set as a weighted negative-entropy regularizer defined as ψ(w) ≜ X (i,r)∈[H]×[R] ci ηr w(i,r) ln w(i,r) with ci = T −1 · 2 i−1 and ηr = 1 32 · 2 r . (F.31) • The feedback loss of meta algorithm ℓt ∈ R N is set as such to measure the quality of each base learner: ℓt,(i,r) ≜ ⟨∇ft(vt), vt,(i,r) ⟩ for any (i, r) ∈ [H] × [R]. • The correction term at ∈ R N is set as: at,(i,r) ≜ 32 ηr ci ℓ 2 t,(i,r) for any (i, r) ∈ [H] × [R], which is essential to ensure the meta regret compatible to the final comparator-adaptive bound. The entire algorithm consists of meta algorithm specified above and base algorithm satisfying Requirement 1. We show the pseudocode in Algorithm 12. F.4.3 Proof of Theorem 3.2.2 Proof. Consider the k-th interval Ik. The regret within this interval can be decomposed as follows. X t∈Ik ft(vt) − ft(˚uk) = X t∈Ik ft(vt) − ft(vt,j ) + X t∈Ik ft(vt,j ) − ft(˚uk) ≤ X t∈Ik ⟨∇ft(vt), vt − vt,j ⟩ + X t∈Ik ft(vt,j ) − ft(˚uk) = X t∈Ik ⟨wt − ej , ℓt⟩ | {z } meta-regret + X t∈Ik ft(vt,j ) − ft(˚uk) | {z } base-regret , (F.32) where the final equality is because ℓt,j = ⟨∇ft(vt), vt,j ⟩ and vt = P j ′∈[H]×[R] wt,j′vt,j′. Note that the decomposition holds for any index j = (i, r) ∈ [H] × [R]. 347 We first consider the case when ∥˚uk∥2 ≥ 1 T and will deal with the other case (when ∥˚uk∥2 < 1 T ) at the end of the proof. Under such a circumstance, we can choose (i, r) = (i ∗ k , r∗ k ) such that ci ∗ k = T −1 · 2 i ∗ k−1 ≤ ∥˚uk∥2 ≤ T −1 · 2 i ∗ k = ci ∗ k+1, and ηr ∗ k = 1 32 · 2 r ∗ k ≤ 1 32p |Ik| ≤ 1 32 · 2 r ∗ k−1 = ηr ∗ k−1, (F.33) which is valid as i ∈ [H] = [⌈log2 T⌉ + T + 1] and r ∈ [R] = [⌈log2 T⌉]. We now give the upper bounds for meta-regretand base-regret respectively. base-regret. Based on the assumption of base algorithm, we have base learner Bj ∗ k satisfying X t∈Ik ft(vt,j∗ k ) − ft(˚uk) ≤ Oe ci ∗ k p |Ik| ≤ Oe ∥˚uk∥2 p |Ik| , (F.34) where we use the interval regret guarantee of base algorithm (see Requirement 1) and also use the fact that the diameter of the feasible domain for base learner Bj ∗ k is 2 i ∗ k as Ωi ∗ k = {x | ∥x∥2 ≤ Di ∗ k } and Di ∗ k = ci ∗ k . The last inequality holds by the choice of i ∗ k shown in Eq. (F.33). meta-regret. The meta algorithm is essentially online mirror descent with a weighted entropy regularizer. Based on Lemma 1 in [42], if for all i ∈ [H] and r ∈ [R], 32 ηr ci |ℓt,(i,r) | ≤ 1, then we have for any q ∈ Ω, X t∈Ik ⟨wt − q, ℓt⟩ ≤ X t∈Ik Dψ(q, wt) − Dψ(q, wt+1) + 32 X t∈Ik X i∈[H] X r∈[R] ηr ci q(i,r)ℓ 2 t,(i,r) . (F.35) Note that this is a simplified version of Lemma 1 in [42] for the interval regret, which employs a fixed learning rate for each action and does not include the optimism in the algorithm. We present the simplified lemma in Lemma F.5.2 in Appendix F.5 for completeness. 348 To this end, we first verify the condition of 32 ηr ci |ℓt,(i,r) | ≤ 1 for all i ∈ [H], r ∈ [R]. In fact, 32ηr ci ℓt,(i,r) ≤ 1 ci · 2 r ∥∇ft(vt)∥2 · ∥vt,i∥2 ≤ 1 2 r ≤ 1, where the first inequality is by the definition of ηr = 1 32·2 r and the construction of meta loss ℓt,(i,r) = ⟨∇ft(vt), vt,(i,r) ⟩, the second inequality is because ∥vt,i∥2 ≤ ci and ∥∇ft(v)∥2 ≤ 1 for all v ∈ R d , and the third inequality holds as r ≥ 1. Then we define e¯j ∗ k ≜ e¯(i ∗ k ,r∗ k ) = 1 − R · a0 T2 e(i ∗ k ,r∗ k ) + X (i,r)∈[H]×[R] 1 T2 · 2 2i e(i,r) , where a0 = PH i=1 1 2 2i = 1 3 (1− 1 4H ) is a constant which guarantees e¯j ∗ k ∈ Ω. Using Eq. (F.35) with q = ¯ej ∗ k , we have X t∈Ik D wt − e¯j ∗ k , ℓt E ≤ X t∈Ik Dψ(¯ej ∗ k , wt) − Dψ(¯ej ∗ k , wt+1) + 32 X t∈Ik X i∈[H] X r∈[R] ηr ci e¯j ∗ k ,(i,r)ℓ 2 t,(i,r) = Dψ(¯ej ∗ k , wsk ) − Dψ(¯ej ∗ k , wsk+1 ) + 32 X t∈Ik X i∈[H] X r∈[R] ηr ci e¯j ∗ k ,(i,r)ℓ 2 t,(i,r) , where sk denotes the starting index of the interval Ik and sk+1 is defined as T + 1 if Ik is the last interval. The two terms on the right-hand side are called bias term and stability term respectively. In the following, we will give their upper bound individually. For the bias term, we have Dψ(¯ej ∗ k , wsk ) − Dψ(¯ej ∗ k , wsk+1 ) = X i∈[H] X r∈[R] ci ηr e¯j ∗ k ,(i,r) ln wsk,(i,r) wsk+1,(i,r) + wsk,(i,r) − wsk+1,(i,r) ! (by definition in Eq. (F.31)) 349 = X i∈[H] X r∈[R] ci ηr e¯j ∗ k ,(i,r) ln wsk,(i,r) wsk+1,(i,r) ! + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) ≤ ci ∗ k ηr ∗ k ln T 2 · 2 2i ∗ k + X (i,r)̸=(i ∗ k ,r∗ k ) 1 T2 · 2 2i ci ηr ln T 2 · 2 2i + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) (wsk , wsk+1 ∈ Ω) ≤ ci ∗ k ηr ∗ k ln 4T 4 · c 2 i ∗ k + X (i,r)̸=(i ∗ k ,r∗ k ) 2 ln T + (4 ln 2) · T 32 · T3 · 2 i+r+1 + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) = Oe ci ∗ k ηr ∗ k ln ci ∗ k ! + Oe 1 T2 + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) . (F.36) Moreover, for the stability term, we have 32 X t∈Ik X i∈[H] X r∈[R] ηr ci e¯j ∗ k ,(i,r)ℓ 2 t,(i,r) = 32 X t∈Ik ηr ∗ k ci ∗ k 1 − R · a0 T2 + 1 T2 · 2 2i ∗ k ℓ 2 t,(i ∗ k ,r∗ k ) + 32 X t∈Ik X (i,r)̸=(i ∗ k ,r∗ k ) ηr ci e¯j ∗ k ,(i,r)ℓ 2 t,(i,r) ≤ 32 X t∈Ik ηr ∗ k ci ∗ k ℓ 2 t,(i ∗ k ,r∗ k ) + 32 X t∈Ik X i∈[H] X r∈[R] ηrci T2 · 2 2i ≤ O ηr ∗ k ci ∗ k |Ik| + X t∈Ik X i∈[H] X r∈[R] 1 T3 · 2 i+r+1 = O ηr ∗ k ci ∗ k |Ik| + O 1 T2 (F.37) where the two inequalities hold as ℓ 2 t,(i,r) = ⟨∇ft(vt), vt,(i,r) ⟩ 2 ≤ ∥∇ft(vt)∥ 2 2 ∥vt,(i,r)∥ 2 2 ≤ c 2 i . Combining the upper bounds of bias term in Eq. (F.36) and stability term in Eq. (F.37), we get X t∈Ik D wt − e¯j ∗ k , ℓt E ≤ Oe ηr ∗ k ci ∗ k |Ik| + ci ∗ k ηr ∗ k ln ci ∗ k ! + Oe 1 T2 + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) . 35 Further, notice that X t∈Ik D e¯j ∗ k − ej ∗ k , ℓt E ≤ X t∈Ik X i∈[H] X r∈[R] 1 T2 · 2 2i · ℓt,(i,r) ≤ X t∈Ik X i∈[H] X r∈[R] 1 T3 · 2 i+1 ≤ Oe 1 T2 , (F.38) and we thus obtain the overall meta regret upper bound in the interval Ik: X t∈Ik D wt − ej ∗ k , ℓt E = X t∈Ik D wt − e¯j ∗ k , ℓt E + X t∈Ik D e¯j ∗ k − ej ∗ k , ℓt E ≤ Oe ηr ∗ k ci ∗ k |Ik| + ci ∗ k ηr ∗ k ln ci ∗ k ! + Oe 1 T2 + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) = Oe ∥˚uk∥2 p |Ik| + Oe 1 T2 + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) , (F.39) where the last inequality is because of the choice of i ∗ k and r ∗ k defined in Eq. (F.33). The Oe(·)-notation omits logarithmic dependence on T and comparator norm ∥˚uk∥2. Overall Regret. The overall regret is obtained by combining the base regret and meta regret and further summing over all the intervals I1, . . . , IS. Indeed, we have the following total meta-regret by taking summation over intervals on Eq. (F.39), X S k=1 X t∈Ik D wt − ej ∗ k , ℓt E ≤ Oe X S k=1 ∥˚uk∥2 p |Ik| ! + Oe S T2 + X S k=1 X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) ≤ Oe X S k=1 ∥˚uk∥2 p |Ik| ! + X i∈[H] X r∈[R] ci ηr w1,(i,r) = Oe X S k=1 ∥˚uk∥2 p |Ik| ! , (F.40) 351 where the final equality is because we choose w1,(i,r) ∝ η 2 r c 2 i for all (i, r) ∈ [H]×[R]. Indeed, such a setting of prior distribution ensures that X i∈[H] X r∈[R] ci ηr · w1,(i,r) = P i∈[H] P r∈[R] ηr ci P i∈[H] P r∈[R] η 2 r c 2 i = 16 T · P i∈[H] P r∈[R] 1 2 i+r P i∈[H] P r∈[R] 1 2 2i+2r = 144 T · 1 − 1 2 R 1 − 1 2 H 1 − 1 4 R 1 − 1 4 H = 144 T · 1 1 + 1 2 R 1 + 1 2 H ≤ O 1 T , and also guarantees that w1 ∈ Ω since for any (i, r) ∈ [H] × [R], w1,(i,r) = η 2 r c 2 i P i ′∈[H] P r ′∈[R] η 2 r′ c 2 i ′ = 1 2 2i+2r P i ′∈[H] P r ′∈[R] 1 2 2i ′+2r′ ≥ 1 T2 · 2 2i · 1 1 9 1 − 1 4 R 1 − 1 4 H ≥ 1 T2 · 2 2i , where the first inequality holds in that we have 2 r ≤ T for r ∈ [R]. Substituting the meta regret upper bound Eq. (F.40) and the base regret upper bound Eq. (F.34) into the regret decomposition Eq. (F.32) obtains that X S k=1 X t∈Ik ft(vt) − ft(˚uk) ≤ Oe X S k=1 ∥˚uk∥2 p |Ik| ! ≤ Oe max k∈[S] ∥˚uk∥2 · √ ST , (F.41) which finishes the proof for the case when ∥˚uk∥2 ≥ 1 T holds for every k ∈ [S]. We now consider the case when the condition is violated. Suppose for some k ∈ [S], it holds that ∥˚uk∥2 < 1 T . Then, we pick any ˚u ′ k ∈ R d such that ∥˚u ′ k ∥2 = 1 T , and obtain that X t∈Ik ft(vt) − ft(˚uk) = X t∈Ik ft(vt) − ft(˚u ′ k ) + X t∈Ik ft(˚u ′ k ) − ft(˚uk) = X t∈Ik ft(vt) − ft(˚u ′ k ) + X t∈Ik ∥∇ft(˚u ′ k )∥2∥˚u ′ k − ˚u ≤ X t∈Ik ft(vt) − ft(˚u ′ k ) + O |Ik| T . Clearly, the last additional term will not be the issue even after summation over S intervals. Moreover, notice that now the comparator ˚u ′ k satisfies the condition of ∥˚u ′ k ∥2 ≥ 1 T , we can still use the earlier results including the base regret bound in Eq. (F.34) and meta regret bound in Eq. (F.39). Thus, we can guarantee the same regret bound as Eq. (F.41) under this scenario. Hence, we finish the proof for the overall theorem. We finally remark that our algorithm for unconstrained OCO actually does not require the knowledge of S ahead of time. F.4.4 Data-dependent Switching Regret of Unconstrained Online Convex Optimization In this subsection, we further consider achieving data-dependent switching regret bound for unconstrained online convex optimization. In Appendix F.4.2, we require the base algorithm to achieve an Oe(D p |I|) interval regret for any interval I ⊆ [T], where D is the diameter of the feasible domain. See Requirement 1 for more details. To achieve a data-dependent switching regret for unconstrained OCO, we require a stronger regret for the base algorithm. Requirement 2. Consider the online convex optimization problem consisting a convex feasible domain Ω ⊆ R d and a sequence of convex loss functions f1, . . . , fT , where ft : Ω 7→ R and we assume 0 ∈ Ω and ∥∇ft(v)∥2 ≤ 1 for any v ∈ Ω and t ∈ [T]. An online algorithm A running over this problem returns the decision sequence v1, . . . , vT ∈ Ω. We require the algorithm A to ensure the following regret guarantee X t∈I ft(vt) − min u∈Ω X t∈I ft(u) ≤ Oe D sX t∈I ∥∇ft(vt)∥ 2 2 (F.42) for any interval I ⊆ [T], where D = supx∈Ω∥x∥2 is the diameter of the feasible domain. 353 This requirement can be satisfied by recent OCO algorithm with data-dependent interval regret guarantee, such as Algorithm 2 of [146] and the algorithm specified by Theorem 6 of [46]. Using the new base algorithm and the same meta algorithm as Appendix F.4.2, the overall algorithm can ensure a data-dependent comparator-adaptive switching regret. Theorem F.4.1. Algorithm 12 with a base algorithm satisfying Requirement 2 guarantees that for any S, any partition I1, . . . , IS of [T], and any comparator sequence ˚u1, . . . ,˚uS ∈ R d , we have X S k=1 X t∈Ik ft(vt) − X t∈Ik ft(˚uk) ≤ Oe X S k=1 ∥˚uk∥2 sX t∈Ik ∥∇ft(vt)∥ 2 2 ≤ Oe max k∈[S] ∥˚uk∥2 · vuutS X T t=1 ∥∇ft(vt)∥ 2 2 . (F.43) Notably, the algorithm does not require the prior knowledge of the number of switch S as the input. Proof. The argument follows the proof of Appendix F.4.3. Similar to Eq. (F.32), the regret within the interval can be decomposed into meta-regret and base-regret: X t∈Ik ft(vt) − ft(˚uk) ≤ X t∈Ik ⟨wt − ej , ℓt⟩ | {z } meta-regret + X t∈Ik ft(vt,j ) − ft(˚uk) | {z } base-regret , (F.44) which holds for any index j = (i, r) ∈ [H] × [R]. We first the case when ∥˚uk∥2 ≥ 1 T and will deal with the other case (when ∥˚uk∥2 < 1 T ) at the end of the proof. Under such a circumstance, we can choose (i, r) = (i ∗ k , r∗ k ) such that ci ∗ k = T −1 · 2 i ∗ k−1 ≤ ∥˚uk∥2 ≤ T −1 · 2 i ∗ k = ci ∗ k+1, and ηr ∗ k = 1 32 · 2 r ∗ k ≤ 1 32qP t∈Ik ∥∇ft(vt)∥ 2 2 ≤ 1 32 · 2 r ∗ k−1 = ηr ∗ k−1, (F.45) 354 which is valid as i ∈ [H] = [⌈log2 T⌉ + T + 1] and r ∈ [R] = [⌈log2 T⌉]. We now give the upper bounds for meta-regretand base-regret respectively. base-regret. Based on the assumption of base algorithm, we have base learner Bj ∗ k satisfies X t∈Ik ft(vt,j∗ k ) − ft(˚uk) ≤ Oe 2 i ∗ k sX t∈Ik ∥∇ft(vt)∥ 2 2 ≤ Oe ∥˚uk∥2 sX t∈Ik ∥∇ft(vt)∥ 2 2 , (F.46) where we use the interval regret guarantee of base algorithm (see Requirement 2) and also use the fact that the diameter of the feasible domain for base learner Bj ∗ k is 2 i ∗ k as Ωi ∗ k = {x | ∥x∥2 ≤ Di ∗ k } and Di ∗ k = ci ∗ k . The last inequality holds by the choice of i ∗ k shown in Eq. (F.45). meta-regret. Note that the meta algorithm remains the same, so we will only improve the analysis to show that the meta algorithm can also enjoy a data-dependent guarantee. The bias term will not be affected, which is the same as the data-independent one presented in Eq. (F.36), and the main modification will be conducted on the stability term. Indeed, continuing the analysis of the stability term exhibited in Eq. (F.37), we have 32 X t∈Ik X i∈[H] X r∈[R] ηr ci e¯j ∗ k ,(i,r)ℓ 2 t,(i,r) ≤ 32 X t∈Ik ηr ∗ k ci ∗ k ℓ 2 t,(i ∗ k ,r∗ k ) + O 1 T2 ≤ O ηr ∗ k ci ∗ k X t∈Ik ∥∇ft(vt)∥ 2 2 + O 1 T2 (F.47) 355 where the last inequality holds as ℓ 2 t,(i,r) = ⟨∇ft(vt), vt,(i,r) ⟩ 2 ≤ ∥∇ft(vt)∥ 2 2 ∥vt,(i,r)∥ 2 2 ≤ c 2 i ∥∇ft(vt)∥ 2 2 . Then, combining the upper bounds of bias term Eq. (F.36), above stability term Eq. (F.47), and additional term Eq. (F.38) leads to the following result: X t∈Ik D wt − ej ∗ k , ℓt E = X t∈Ik D wt − e¯j ∗ k , ℓt E + X t∈Ik D e¯j ∗ k − ej ∗ k , ℓt E ≤ Oe ηr ∗ k ci ∗ k X t∈Ik ∥∇ft(vt)∥ 2 2 + ci ∗ k ηr ∗ k ln ci ∗ k + Oe 1 T2 + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) = Oe ∥˚uk∥2 sX t∈Ik ∥∇ft(vt)∥ 2 2 + Oe 1 T2 + X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) , (F.48) where the last inequality is because of the choice of i ∗ k and r ∗ k defined in Eq. (F.45). Summing over all the intervals I1, . . . , IS achieves a data-dependent upper bound for the meta-regret: X S k=1 X t∈Ik D wt − ej ∗ k , ℓt E ≤ Oe X S k=1 ∥˚uk∥2 sX t∈Ik ∥∇ft(vt)∥ 2 2 + X S k=1 X i∈[H] X r∈[R] ci ηr wsk,(i,r) − wsk+1,(i,r) ≤ Oe X S k=1 ∥˚uk∥2 sX t∈Ik ∥∇ft(vt)∥ 2 2 + X i∈[H] X r∈[R] ci ηr w1,(i,r) = Oe X S k=1 ∥˚uk∥2 sX t∈Ik ∥∇ft(vt)∥ 2 2 ≤ Oe max k∈[S] ∥˚uk∥2 · vuutS X T t=1 ∥∇ft(vt)∥ 2 2 . (F.49) The last equality holds by the same argument for Eq. (F.40) and the final inequality is by Cauchy-Schwarz inequality. Combining the meta-regret and base-regret upper bounds finishes the proof for the case when ∥˚uk∥2 ≥ 1 T holds for every k ∈ [S]. 356 In addition, when the above condition of the comparators’ norm is violated, we can deal with the scenario by the same argument at the end of Appendix F.4.3 and attain the same regret guarantee. Hence, we finish the proof of the overall theorem. Remark 4. Note that Theorem F.4.1 is for the unconstrained OCO setting, while from the proof we can see that actually the result holds even if the algorithm is required to make decisions from a bounded domain. Indeed, in the unconstrained setting, we only need to focus on a bounded domain with maximum diameter 2 T as observed in Appendix D.5 of [42]. As a result, when working under constrained OCO with a diameter Dmax > 0, we can still use our algorithm by simply maintaining the set of base algorithm instances as S ′ = n Bi,r, ∀(i, r) ∈ [H′ ] × [R] Bi,r ← B(Ωi), with Ωi = {x | ∥x∥2 ≤ Di = T −1 · 2 i−1 } o . where H′ = ⌈log2 T⌉ + ⌈log2 Dmax⌉ + 1 and R = ⌈log2 T⌉ now. Thus, our result strictly improves the Oe Dmaxq S PT t=1∥∇ft(vt)∥ 2 2 result of prior works [46, 150, 149] for the constrained OCO setting. F.4.5 Proof of Theorem 3.2.3 Proof. From Lemma 3.2.1, we have Reg(u1, . . . , uT ) = X S k=1 RegV Ik (∥˚uk∥2) +X S k=1 ∥˚uk∥2 · RegZ Ik ˚uk ∥˚uk∥2 . (F.50) In the following, we bound the two terms respectively. The first term on the right-hand side of Eq. (F.50) is the switching regret of the OCO algorithm AV, we have X S k=1 RegV Ik (∥˚uk∥2) = X S k=1 X t∈Ik ft(vt) − ft(∥˚uk∥2) ≤ Oe X S k=1 ∥˚uk∥2 p |Ik| ! , 357 where the first equality is due to the definition of online function ft(v) = v · ⟨ℓt , zt⟩ and the second inequality holds by the regret guarantee of AV proven in Theorem 3.2.2. The second term on the right-hand side of Eq. (F.50) requires the switching regret analysis of the online algorithm for constrained linear bandits AZ. Indeed, since the comparator satisfies that ∥ ˚uk ∥˚uk∥2 ∥2 = 1, the subroutine AZ can be chosen as the proposed algorithm for linear bandits with ℓp-ball feasible domain (with p = 2), see Algorithm 10. We thus get the following regret bound according to Theorem F.2.1: E RegZ Ik ˚uk ∥˚uk∥2 ≤ Oe r dT S + r Sd T |Ik| ! . Substituting the above two upper bounds in Eq. (F.50) gives that E [Reg(u1, . . . , uT )] = X S k=1 E RegV Ik (∥˚uk∥2) + X S k=1 E ∥˚uk∥2 · RegZ Ik ˚uk ∥˚uk∥2 ≤ Oe X S k=1 ∥˚uk∥2 p |Ik| ! + Oe X S k=1 ∥˚uk∥2 r dT S + r Sd T |Ik| ! ≤ Oe X S k=1 ∥˚uk∥2 r dT S + r Sd T |Ik| ! ≤ Oe max k∈[S] ∥˚uk∥2 · √ dST where the second inequality is because p |Ik| ≤ q dT S + q Sd T |Ik|. Hence, we finish the proof. F.5 Lemmas Related to Online Mirror Descent This section collects several useful lemmas related to online mirror descent (OMD). We first introduce a general regret guarantee for OMD due to Bubeck, Cesa-Bianchi, et al. [31]. Lemma F.5.1 (Theorem 5.5 of Bubeck, Cesa-Bianchi, et al. [31]). Let D ⊂ R d be an open convex set and let D be the closure of D. Let Ω be a compact and convex set and let F be a Legendre function defined on D ⊃ Ω 358 such that ∇F(x)−ε∇ℓ(x) ∈ D∗ holds for any (x, ℓ) ∈ (Ω ∩ D)× L, where D∗ = ∇F(D) is the dual space of D under F. Consider the following online mirror descent: x ′ t+1 = ∇F ∗ (∇F(xt) − ε∇ℓt(xt)), xt+1 = argmin x∈Ω DF (x, x′ t+1), (F.51) where F ∗ is the Legendre–Fenchel transform of F defined by F ∗ (u) = supx∈Ω(x ⊤u − F(x)). Then, we have X T t=1 ℓt(xt) − X T t=1 ℓt(x) ≤ F(x) − F(x1) ε + 1 ε X T t=1 DF∗ ∇F(xt) − ε∇ℓt(xt), ∇F(xt) . (F.52) We next introduce an important lemma related to the online mirror descent with weighted entropy regularizer, which is a version of Lemma 1 [42] in the fixed learning rate and non-optimistic setting. Note that this is actually an interval version of Lemma 1 [42], replacing the summation range from [T] to an interval I ⊆ [T], which is also used in Appendix C.3 [42]. Lemma F.5.2 (Lemma 1 of Chen, Luo, and Wei [42]). Consider the following online mirror descent update over a compact convex decision subset Ω ⊆ ∆d, wt+1 = argmin w∈Ω n ⟨w, ℓt + at⟩ + Dψ(w, wt) o where ψ(w) = Pd n=1 1 ηn wn ln wn is the weighted entropy regularizer. Suppose that for all t ∈ [T], it holds that 32ηn|ℓt,n| ≤ 1 for all n ∈ [d] such that wt,n > 0, then the above update ensures for any u ∈ Ω, X t∈I ⟨ℓt , wt − u⟩ ≤ X t∈Ik Dψ(u, wt) − Dψ(u, wt+1) + 32X t∈I X d n=1 ηnunℓ 2 t,n − 16X t∈I X d n=1 ηnwt,nℓ 2 t,n. 359 Appendix G Omitted Details in Chapter 4 G.1 Omitted Details for Section 4.2 G.1.1 Proof of Lemma 4.2.1 In this section, we prove one of our key lemmas (Lemma 4.2.1), which shows that the lifted gradient estimator constructed in Line 14 of Algorithm 13 is an unbiased estimator in the first d dimensional coordinates. Proof. Fix any t ∈ [T] and let w = H − 1 2 t ed+1/ H − 1 2 t ed+1 2 . As u ∼ S d+1 ∩ H − 1 2 t ed+1⊥ and b ∼ B d+1 ∩ H − 1 2 t ed+1⊥ , there exists a transformation matrix M ∈ R d×(d+1) that satisfies Mw = 0, M⊤M = I − ww⊤ and MM⊤ = I, such that u = M⊤v, b = M⊤b where v is uniformly drawn from S d and b is uniformly drawn from B d . In fact, the d row vectors of M together with w forms a set of unit orthogonal base in the (d + 1)-dimensional space. Recall the two following functions whose feasible domain is in (d + 1)-dimensional space. ft(x) ≜ ft(x[1:d] ), (G.1) fet(x) ≜ ft(x) + λt 2 ∥x[1:d]∥ 2 2 = ft(x[1:d] ) + λt 2 ∥x[1:d]∥ 2 2 (G.2) 3 Then we define the following functions in the d-dimensional space. Let Jt : R d → R such that Jt(x) = fet(H − 1 2 t M⊤x) and Jbt(x) = Eb∼Bd [Jt(x + b)]. In addition, we denote ybt = (yt , 0) ∈ R d+1 that appends an additional constant value 0 to yt in the (d + 1)-th coordinate. Then by the definition of gt , we have Et [gt ] = dE u∼S d+1∩ H − 1 2 t ed+1⊥ fet(yt + H − 1 2 t u)H 1 2 t u = dEv∼S d Jt(MH 1 2 t ybt + v)H 1 2 t M⊤v = dH 1 2 t M⊤Ev∼S d Jt(MH 1 2 t ybt + v)v = H 1 2 t M⊤∇Jbt(MH 1 2 t ybt), where the final equality is due to Lemma 5 of Flaxman, Kalai, and McMahan [56]. The second equality is because of the following reasoning. Note that by the definition of Jt and properties of M, we have Jt(MH 1 2 t ybt + v) = fet(H − 1 2 t M⊤(MH 1 2 t ybt + v)) = fet(H − 1 2 t (I − ww⊤)H 1 2 t ybt + H − 1 2 t M⊤v) = fet((I − H − 1 2 t ww⊤H 1 2 t )ybt + H − 1 2 t M⊤v). In addition, by the definition of w, we have I − H − 1 2 t ww⊤H 1 2 t = I − H−1 t ed+1e ⊤ d+1 H − 1 2 t ed+1 2 2 Note that the second term has all entries 0 except for the last column, i.e., the (d+ 1)-th one. Therefore we have H − 1 2 t ww⊤H 1 2 t ybt = 0, which leads to Jt(MH 1 2 t ybt + v) = fet(ybt + H − 1 2 t M⊤v) = fet(yt + H − 1 2 t M⊤v), 36 where the last equality is because fet(x) has no dependence on the last coordinate of x. Furthermore, according to the definition of Jbt(x), we have ∇Jbt(x) = H − 1 2 t M⊤ ⊤ Eb∼Bd ∇fet(H − 1 2 t M⊤(x + b)) = H − 1 2 t M⊤ ⊤ Eb∼Bd ∇fet(H − 1 2 t M⊤x + H − 1 2 t M⊤b) = H − 1 2 t M⊤ ⊤ E b∼Bd+1∩ H − 1 2 t ed+1⊥ ∇fet(H − 1 2 t M⊤x + H − 1 2 t b) = MH− 1 2 t ∇fbt(H − 1 2 t M⊤x), where the fourth equality is by the definition of fbt . Therefore, we get Et [gt ] = H 1 2 t M⊤∇Jbt(MH 1 2 t ybt) = H 1 2 t M⊤MH− 1 2 t ∇fbt(H − 1 2 t M⊤MH 1 2 t ybt) = (I − H 1 2 t ww⊤H − 1 2 t )∇fbt((I − H − 1 2 t ww⊤H 1 2 t )ybt). (G.3) Since w = H − 1 2 t ed+1 · H − 1 2 t ed+1 −1 2 , we have I − H 1 2 t ww⊤H − 1 2 t = I − ed+1e ⊤ d+1H−1 t H − 1 2 t ed+1 2 2 , I − H − 1 2 t ww⊤H 1 2 t = I − H−1 t ed+1e ⊤ d+1 H − 1 2 t ed+1 2 2 . This shows that for any x ∈ R d+1 , (I − H 1 2 t ww⊤H − 1 2 t )x [1:d] = x[1:d] and also we have (I − H − 1 2 t ww⊤H 1 2 t )ybt = ybt because the (d + 1)-th coordinate of ybt is 0. Combining the above with Eq. (G.3) yields the following result: Et [gt ] = I − ed+1e ⊤ d+1H−1 t H − 1 2 t ed+1 2 2 ∇fbt(ybt) = I − ed+1e ⊤ d+1H−1 t H − 1 2 t ed+1 2 2 h ∇fbt(yt)[1:d] ; 0] = h ∇fbt(yt)[1:d] ; ∗ i , where ∗ ∈ R denotes the last coordinate of the expectation of the gradient estimator that can be calculated according to the context. Note that the last step is true by noting that fbt is defined as a smoothed function of fet that is irrelevant to the (d + 1)-th coordinate. Hence, we show that the first d dimensions of the estimator constructed in Line 14 are unbiased and finish the proof. G.1.2 Stability Lemma In this section, we prove the following lemma which shows the stability of the dynamics of our algorithm. We point out that this stability lemma is the main technical reason that we introduce the lifting idea. Lemma G.1.1. Consider the following FTRL update: yt+1 = argmin x∈Ω (X t s=1 g ⊤ s x + σs + λs 2 ∥x − ys∥ 2 2 + λ0 2 ∥x∥ 2 2 + 1 ηt+1 Ψ(x) ) , where Ψ(x) = Ψ(x, b) = 400(ψ(x/b) − 2ν ln b) is a normal barrier of the conic hull of Ω defined by con(Ω) = {0} ∪ {(w, b) | w b ∈ Ω, w ∈ R d , b > 0}, and ψ is a ν-self-concordant barrier of Ω ⊆ R d , Ht = ∇2Ψ(yt) + ηt(σ1:t−1 + λ0:t−1), us is uniformly sampled from S d+1 ∩ H − 1 2 s ed+1⊥ and gs = d fs(xs) + λs 2 ∥xs∥ 2 2 H 1 2 s us for s ∈ [t], Suppose that the following two conditions hold: (1) the sequence of learning rates {ηt} T t=1 is non-increasing and satisfies 1 ηt+1 − 1 ηt ≤ C(λt + σt) p for some C > 0 and p > 0; (2) σt ≤ γ holds for some γ > 0 and η1 ≤ 1 32(d+16√ νC(γ+1)p) , λt ∈ (0, 1) holds for all t ∈ [T], and λ0 > 0. Then, we have ∥yt − yt+1∥Ht ≤ 1 2 . 3 Proof. Define the objective of FTRL update to be Ft+1(x) = Pt s=1 ℓs(x) + Rt+1(x) with Rt+1(x) = λ0 2 ∥x∥ 2 2 + 1 ηt+1 Ψ(x) and ℓs(x) ≜ ⟨gs, x⟩ + σs+λs 2 ∥x − ys∥ 2 2 . Therefore, we have yt+1 = argmin x∈Ω Ft+1(x). Define Ψt(x) ≜ Ψ(x) + ηtλ0 2 ∥x∥ 2 2 + ηt Pt−1 s=1 σs+λs 2 ∥x − ys∥ 2 2 . With this definition, we have Ht = ∇2Ψt(yt) and Ft+1(x) = Pt s=1 ⟨gs, x⟩ + 1 ηt+1 Ψt+1(x). Moreover, according to the definition of selfconcordant function (see Definition G.3.1), we know that Ψt is also a self-concordant function. Because of the convexity of Ft+1, in order to prove the desired conclusion, it suffices to show that for any y ′ ∈ Ω satisfying ∥y ′ − yt∥Ht = 1 2 , we have Ft+1(y ′ ) ≥ Ft+1(yt). To this end, we first calculate ∇Ft+1(yt) and the Hessian of Ft+1 as follows: ∇Ft+1(yt) = X t s=1 ∇ℓs(yt) + λ0yt + 1 ηt+1 ∇Ψ(yt) = X t s=1 gs + X t s=1 (σs + λs)(yt − ys) + λ0yt + 1 ηt+1 ∇Ψ(yt) = ∇Ft(yt) + gt + 1 ηt+1 − 1 ηt ∇Ψ(yt). (G.4) ∇2Ft+1(x) = 1 ηt+1 ∇2Ψt+1(x) = 1 ηt+1 ∇2Ψ(x) + λ0I + X t s=1 (σs + λs)I ! ⪰ 1 ηt ∇2Ψ(x) + λ0I + X t−1 s=1 (σs + λs)I ! = 1 ηt ∇2Ψt(x), (G.5) where the inequality is because ηt ≥ ηt+1. Based on the above, using Taylor’s expansion of Ft+1 at yt , we know that there exists ξt that lies in the line segment of yt and y ′ such that: Ft+1(y ′ ) = Ft+1(yt) + h ⊤∇Ft+1(yt) + 1 2 ∥h∥ 2 ∇2Ft+1(ξt) (h ≜ y ′ − yt and ξt ∈ [yt , y ′ ]) 364 ≥ Ft(yt) + h ⊤∇Ft(yt) + 1 ηt+1 − 1 ηt ∇Ψ(yt) ⊤h + g ⊤ t h + 1 2ηt ∥h∥ 2 ∇2Ψt(ξt) (by Eq. (G.4) and Eq. (G.5)) ≥ Ft(yt) + h ⊤∇Ft(yt) + 1 ηt+1 − 1 ηt ∇Ψ(yt) ⊤h + g ⊤ t h + 1 2ηt ∥h∥ 2 Ht · (1 − ∥yt − ξt∥Ht ) 2 (∇2Ψt(yt) = Ht , Ψt is a self-concordant function and by Lemma G.3.2) ≥ Ft(yt) − 1 ηt+1 − 1 ηt ∇Ψ(yt) ⊤h − ∥gt∥ ∗ Ht ∥h∥Ht + 1 2ηt ∥h∥ 2 Ht · (1 − ∥yt − ξt∥Ht ) 2 (first-order optimality of yt ) ≥ Ft(yt) − 1 ηt+1 − 1 ηt ∇Ψ(yt) ⊤h − 1 2 ∥gt∥ ∗ Ht + 1 32ηt (∥h∥Ht = 1 2 , ∥yt − ξt∥Ht ≤ 1 2 ) ≥ Ft(yt) − C(σt + λt) p ∇Ψ(yt) ⊤h − 1 2 ∥gt∥ ∗ Ht + 1 32ηt (by definition of ηt ) ≥ Ft(yt) − C(γ + 1)p ∇Ψ(yt) ⊤h − 1 2 ∥gt∥ ∗ Ht + 1 32η1 . (σt ≤ γ and λt ∈ (0, 1), {ηt} T t=1 is monotonically non-increasing.) Furthermore, note that the third term, which is the gradient local norm, can be upper bounded by ∥gt∥ ∗2 Ht = d 2 ft(xt) + λt 2 ∥xt∥ 2 2 2 ∥H 1 2 t u∥ ∗ Ht 2 ≤ d 2 1 + λt 2 2 u ⊤H 1 2 t H−1 t H 1 2 t u ≤ 4d 2 . (G.6) For the second term, we have ∇Ψ(yt) ⊤h ≤ ∥∇Ψ(yt)∥∇−2Ψ(yt)∥h∥∇2Ψ(yt) ≤ ∥∇Ψ(yt)∥∇−2Ψ(yt)∥h∥∇2Ψt(yt) (∇2Ψt(y) ⪰ ∇2Ψ(y)) = ∥yt∥∇2Ψ(yt)∥h∥∇2Ψt(yt) = √ ν¯ 2 . 3 The last two equations make use of the properties of ν¯-normal barrier (see Lemma G.3.5): ∇2Ψ(yt)yt = −∇Ψ(yt) and ∥yt∥ 2 ∇2Ψ(yt) = ¯ν, as well as the fact that ∥h∥Ht = 1 2 . Moreover, the constructed normal barrier satisfies that ν¯ = 800ν (see Lemma G.3.4). Therefore, we have Ft+1(y ′ ) ≥ Ft+1(yt) − √ 800ν 2 C(γ + 1)p − d + 1 32η1 ≥ Ft+1(yt), where the last step is due to the setting of η1 ≤ 1 32(d+16√ νC(γ+1)p) . Hence, we complete the proof. To apply Lemma G.1.1, when all the functions are β-smooth (see Assumption 4.1.1), we can choose γ = β to satisfy the condition σt ≤ γ; when all the functions are L-Lipschitz (see Assumption 4.1.2), we show in Lemma G.4.2 that choosing γ = 4L satisfies the condition of σt ≤ γ. G.1.3 Proof of Lemma 4.2.2 To bound the expected regret, we decompose the cumulative regret with respect to x ∈ Ω in the following way using the functions in the lifted domain defined in Lemma 4.2.1, Eq. (G.1) and Eq. (G.2): E "X T t=1 ft(xt) − X T t=1 ft(x) # = E "X T t=1 ft(xt) − X T t=1 ft(x) # = E "X T t=1 ft(xt) − X T t=1 ft(xe) # + E "X T t=1 ft(xe) − X T t=1 ft(x) # = E "X T t=1 ft(xt) − X T t=1 ft(yt) # | {z } Exploration + E "X T t=1 ft(yt) − X T t=1 fet(yt) # | {z } Regularization I + E "X T t=1 fet(yt) − X T t=1 fbt(yt) # | {z } Smooth I + E "X T t=1 fbt(yt) − X T t=1 fbt(xe) # | {z } Reg Term + E "X T t=1 fbt(xe) − X T t=1 fet(xe) # | {z } Smooth II + E "X T t=1 fet(xe) − X T t=1 ft(xe) # | {z } Regularize II 366 + E "X T t=1 ft(xe) − X T t=1 ft(x) # | {z } Comparator Bias , (G.7) where in the second equality, we define xe ≜ 1 − 1 T x+ 1 T ·y1, where y1 = argminx∈Ω Ψ(x). Note that both xe and y1 belong to the shrunk lifted feasible set Ωe = {x = (x, 1) | x ∈ Ω, πy1 (x) ≤ 1 − 1 T }. We remind the readers the notations xe = xe[1:d] and y1 = y1[1:d] , and we have xe = (x, e 1) and y1 = (y1, 1). We now bound the each term of the regret decomposition in Eq. (G.7) individually. First, for the two terms Regularization I and Regularization II, we have for any x ∈ Ω, E "X T t=1 ft(x) − X T t=1 fet(x) # = E "X T t=1 ft(x) − X T t=1 fet(x) # ≤ X T t=1 λt 2 , (G.8) which essentially is the bias due to introducing the regularization term. Second, consider the two terms Smooth I and Smooth II. According to the definition of fet shown in Lemma 4.2.1, we know that fet is (β + λt)-smooth. Using the fact that perturbation b has mean 0, we can bound the two term as follows: for any x ∈ Ω, Eb "X T t=1 fet x + H − 1 2 t b − X T t=1 fet(x) # ≤ X T t=1 β + λt 2 H − 1 2 t b 2 2 ≤ X T t=1 d(β + λt) p (β + 1)(σ1:t−1 + λ0:t−1) , (G.9) where the second inequality is because Ht ⪰ ηt(σ1:t−1 + λ0:t−1I) and ηt = 1 2d q β+1 σ1:t−1+λ0:t−1 + ν T log T . Third, by definition of yt and the β-smoothness of function ft , Exploration term can be bounded by E "X T t=1 ft(xt) − X T t=1 ft(yt) # ≤ X T t=1 β 2 H − 1 2 t ut 2 2 ≤ X T t=1 dβ p (β + 1)(σ1:t−1 + λ0:t−1) . (G.10) 3 Fourth, for Comparator Bias, according to the definition of xe and using the convexity property of ft , we have E "X T t=1 ft(xe) − X T t=1 ft(x) # ≤E "X T t=1 ft 1 T y1 + 1 − 1 T x − X T t=1 ft(x) # ≤E " 1 T X T t=1 ft(y1) − 1 T X T t=1 ft(x) # ≤ 2. (G.11) Therefore, it suffices to further bound the Reg Term, which is the expected regret over the smoothed version of the lifted online functions. The following lemma proves the upper bound for the Reg Term. We remark that bounding this Reg Term is the most challenging part of the proof and is also the technical reason for us to lift the domain. Lemma G.1.2. When loss functions {ft} T t=1 are all β-smooth, if T ≥ ρ (a constant defined in Algorithm 13), Algorithm 13 guarantees that Reg Term ≤ Oe d √ νT + X T t=1 d √ β + 1 p σ1:t−1 + λ0:t−1 ! . (G.12) Proof. According to the definition of Reg Term, we have E "X T t=1 fbt(yt) − X T t=1 fbt(xe) # ≤ E "X T t=1 ∇fbt(yt) ⊤(yt − xe) − σt + λt 2 ∥yt − xe∥ 2 2 # (G.13) = E "X T t=1 g ⊤ t (yt − xe) − σt + λt 2 ∥yt − xe∥ 2 2 # (G.14) = E "X T t=1 ℓt(yt) − X T t=1 ℓt(xe) # . (G.15) In above, Eq. (G.13) holds owing to the (σt +λt)-strong-convexity of fbt (actually only in the first d dimension but it is enough as yt and xe have the same last coordinate); Eq. (G.14) is true because Lemma 4.2.1 ensures that gt is an unbiased estimator of ∇fbt(yt) in the first d coordinates and meanwhile yt−x has the 368 last coordinate 0. The last step shown in Eq. (G.15) is by introducing the surrogate loss ℓt : Ω 7→ R, defined as ℓt(x) ≜ ⟨gt , x⟩ + σt+λt 2 ∥x − yt∥ 2 2 . Note that according to this construction, we have ∇ℓt(yt) = gt . In addition, our FTRL update rule can be written in the following two forms: yt+1 = argmin x∈Ω (X t s=1 ⟨gs, x⟩ + σs + λs 2 ∥ys − x∥ 2 2 + λ0 2 ∥x∥ 2 2 + 1 ηt+1 Ψ(x) ) = argmin x∈Ω (X t s=1 ℓs(x) + λ0 2 ∥x∥ 2 2 + 1 ηt+1 Ψ(x) ) = argmin x∈Ω (X t s=1 ⟨gs, x⟩ + 1 ηt+1 Ψt+1(x) ) , where Ψt+1(x) = Ψ(x) + ηt+1 λ0 2 ∥x∥ 2 2 + Pt s=1 σs+λs 2 ∥x − ys∥ 2 2 . As discussed in Lemma G.1.1 Ψt+1 is still a self-concordant function and moreover Ht+1 = ∇2Ψt+1(yt+1). Recall the definition Rt+1(x) = λ0 2 ∥x∥ 2 2 + 1 ηt+1 Ψ(x) and Ft+1(x) = Pt s=1 ℓs(x) + Rt+1(x). Denote by R′ t+1(x) = λ0 2 ∥x∥ 2 2 + 1 ηt+1 (Ψ(x) − Ψ(y1)) the (shifted) regularizer and by Qt+1(x) = Pt s=1 ℓs(x) + R′ t+1(x) and (shifted) FTRL objective. Therefore, we have Ft+1(x) = Qt+1(x) + 1 ηt+1 Ψ(y1). Then, yt+1 = argminx∈Ω Ft+1(x) = argminx∈Ω Qt+1(x) according to the FTRL update rule, and we have X T t=1 ℓt(yt) − X T t=1 ℓt(xe) ≤ R′ T +1(xe) − R′ 1 (y1) +X T t=1 ∇ℓt(yt) ⊤(yt − yt+1) − X T t=1 DQt+ℓt (yt+1, yt) +X T t=1 R′ t (yt+1) − R′ t+1(yt+1) ≤ R′ T +1(xe) − R′ 1 (y1) +X T t=1 g ⊤ t (yt − yt+1) − X T t=1 DQt+ℓt (yt+1, yt) ≤ R′ T +1(xe) − R′ 1 (y1) +X T t=1 g ⊤ t (yt − yt+1) − X T t=1 DQt (yt+1, yt). 369 In above, the first inequality is due to the standard FTRL analysis as shown in Lemma G.4.1; the second inequality is true because the surrogate loss satisfies that ∇ℓt(yt) = gt and 0 ≤ R′ t (x) ≤ R′ t+1(x) holds for any x ∈ Ω as the learning rate is monotonically non-increasing and y1 = argminx∈Ω Ψ(x). The last inequality follows from ∇2 ℓt(x) = (σt + λt)I and the following inequality: DQt+ℓt (yt+1, yt) = DQt (yt+1, yt) + Dℓt (yt+1, yt) = DQt (yt+1, yt) + σt + λt 2 ∥yt+1 − yt∥ 2 2 ≥ DQt (yt+1, yt). In addition, by Taylor expansion, we know that DQt (yt+1, yt) = 1 2 ∥yt+1 − yt∥ 2 ∇2Qt(ξt) for some ξt ∈ [yt , yt+1], and ∇2Qt(x) = ∇2Ft(x) = 1 ηt ∇2Ψt(x) as shown in the first equality of Eq. (G.5). Therefore, combining all above, we get that X T t=1 ℓt(yt) − X T t=1 ℓt(xe) ≤ λ0 2 ∥xe∥ 2 2 + Ψ(xe) − Ψ(y1) ηT +1 + X T t=1 g ⊤ t (yt − yt+1) − X T t=1 1 2ηt ∥yt+1 − yt∥ 2 ∇2Ψt(ξt) . (G.16) In the following, we proceed to analyze the crucial terms g ⊤ t (yt − yt+1) and ∥yt+1 − yt∥ 2 ∇2Ψt(ξt) . For the first term, by Holder’s inequality, we have g ⊤ t (yt − yt+1) ≤ ∥gt∥ ∗ Ht · ∥yt − yt+1∥Ht . (G.17) The second term is more involved to analyze. To do this, we first verify that the conditions required in Lemma G.1.1 are indeed satisfied. First, it is direct to see that {ηt} T t=1 is non-increasing and 1 ηt+1 − 1 ηt ≤ 2d 1 q β+1 λ0:t+σ1:t + ν T log T − 1 q β+1 λ0:t−1+σ1:t−1 + ν T log T 370 ≤ 2d √ β + 1 p λ0:t + σ1:t − p λ0:t−1 + σ1:t−1 ≤ 2d √ λt + σt √ β + 1 , where the second inequality is because (a + c) −1/2 − (b + c) −1/2 is decreasing in c when a ≤ b. Therefore, this satisfies that η −1 t+1 − η −1 t ≤ C(λt + σt) p with C = 2d(β + 1)−1/2 and p = 1 2 . In addition, note that as T ≥ ρ = 512ν(1 + 32√ ν) 2 and λ0 ≥ (β + 1)ρν−1 , we have η1 = 1 2d s β + 1 λ0 + ν T log T ≤ 1 2d r 2ν ρ = 1 32d(1 + 32√ ν) = 1 32(d + 16√ νC(γ + 1)p) , with γ = β. Therefore, according to Lemma G.1.1, we show that ∥yt − yt+1∥Ht ≤ 1 2 . Then, due to the nice properties of optimization with self-concordant functions (see Lemma G.3.2), we obtain that ∥yt+1 − yt∥∇2Ψt(ξt) ≥ ∥yt+1 − yt∥∇2Ψt(yt) · (1 − ∥yt+1 − ξt∥∇2Ψt(yt) ) ≥ 1 2 ∥yt+1 − yt∥Ht , (G.18) where the last inequality makes use of the result ∥yt−yt+1∥Ht ≤ 1 2 as well as the fact that ∇2Ψt(yt) = Ht . Plugging inequalities Eq. (G.17) and Eq. (G.18) to the regret upper bound achieves X T t=1 ℓt(yt) − X T t=1 ℓt(xe) ≤ λ0 2 ∥xe∥ 2 2 + Ψ(xe) − Ψ(y1) ηT +1 + X T t=1 g ⊤ t (yt − yt+1) − X T t=1 1 2ηt ∥yt+1 − yt∥ 2 ∇2Ψt(ξt) ≤ λ0 2 ∥xe∥ 2 2 + Ψ(xe) − Ψ(y1) ηT +1 + X T t=1 ∥gt∥ ∗ Ht · ∥yt − yt+1∥Ht − 1 8ηt ∥yt+1 − yt∥ 2 Ht ≤ λ0 2 ∥xe∥ 2 2 + Ψ(xe) − Ψ(y1) ηT +1 + X T t=1 2ηt∥gt∥ ∗2 Ht (G.19) ≤ O(1) + Ψ(xe) − Ψ(y1) ηT +1 + X T t=1 8ηtd 2 , (Eq. (G.6)) 371 ≤ O(1) + Ψ(xe) − Ψ(y1) ηT +1 + X T t=1 8d √ β + 1 p σ1:t−1 + λ0:t−1 + O d p νT log T , ≤ O(1) + O d p νT log T + X T t=1 8d √ β + 1 p σ1:t−1 + λ0:t−1 , (G.20) where Eq. (G.20) holds because of the following three facts. First, as both xe and y1 belong to the shrunk lifted domain Ωe = {x | πy1 (x) ≤ 1 − 1 T }, based on Lemma G.3.1, we have that 0 ≤ Ψ(x) − Ψ(y1) ≤ ν¯ ln 1 1−πy1 (x) ≤ ν¯ log T holds for all x ∈ Ωe. Second, as demonstrated in Lemma G.3.4, the normal barrier we choose in Algorithm 13 ensures that ν¯ = 800ν = O(ν). Third, ηT +1 ≥ 1 2d q ν T log T . This finishes the proof of Lemma G.1.2. Now we are ready to prove our main lemma (Lemma 4.2.2). Below we restate the lemma for convenience. Lemma G.1.3. With any regularization coefficients {λt} T t=1 ∈ (0, 1), Algorithm 13 guarantees: Reg = Oe d √ νT + λ1:T −1 + T X−1 t=1 d √ β + 1 √ σ1:t + λ0:t ! , (G.21) if loss functions {ft} T t=1 are all β-smooth and T ≥ ρ (a constant defined in Algorithm 13). Proof. Combining all the above terms in Eq. (G.8), Eq. (G.9), Eq. (G.10), Eq. (G.11), and Eq. (G.12) as well as the decomposition in Eq. (G.7), we obtain the following expected regret upper bound: E "X T t=1 ft(xt) − X T t=1 ft(x) # Eq.(G.7) ≤ E "X T t=1 fbt(yt) − X T t=1 fbt(xe) # + 2 + E "X T t=1 ft(xt) − X T t=1 ft(yt) # + E "X T t=1 ft(yt) − X T t=1 fet(yt) # + E "X T t=1 fet(yt) − X T t=1 fbt(yt) # + E "X T t=1 fbt(xe) − X T t=1 fet(xe) # + E "X T t=1 fet(xe) − X T t=1 ft(xe) # 37 ≤ O λ1:T + d p νT log T + X T t=1 d(β + λt) √ β + 1p σ1:t−1 + λ0:t−1 + X T t=1 d √ β + 1 p σ1:t−1 + λ0:t−1 ! ≤ Oe λ1:T + d √ νT + X T t=1 d √ β + 1 p σ1:t−1 + λ0:t−1 ! (λt ∈ (0, 1)) ≤ Oe λ1:T −1 + d √ νT + T X−1 t=1 d √ β + 1 √ σ1:t + λ0:t ! , (G.22) where the last step hold because our choice of regularization coefficients λt ∈ (0, 1) for t ∈ [T] and the input parameter λ0 ≥ 1, which finishes the proof. G.1.4 Proof of Lemma 4.2.3 Proof. We prove the claim Eq. (4.7) by induction, whose proof technique is similar to Lemma 3.1 in [25]. Consider the base case when t = 1. For simplicity, we define λ ∗ 0 = λ0. If λ1 ≤ λ ∗ 1 ≜ argminλ1≥0 B(λ1), we have B(λ1) = λ1 + d √ β+1 √ σ1+λ0:1 = 2λ1 ≤ 2λ ∗ 1 ≤ 2B(λ ∗ 1 ), where the second equality is true because of the condition in Eq. (4.6). Otherwise, we have B(λ1) = 2d √ β+1 √ σ1+λ0:1 ≤ 2d √ √ β+1 σ1+λ ∗ 0:1 ≤ 2B(λ ∗ 1 ). Combining both scenarios verifies the base case. Suppose we have B({λs} t−1 s=1) ≤ 2 min{λ′ s} t−1 s=1≥0 B({λ ′ s} t−1 s=1). With a slight abuse of notation, we set {λ ∗ s} t s=1 = argmin{λ′ s} t s=1≥0 B({λ ′ s} t s=1). Similarly, if λ1:t ≤ λ ∗ 1:t , we have B({λs} t s=1}) = λ1:t + X t s=1 d √ β + 1 √ σ1:s + λ0:s = λ1:t + X t s=1 λs = 2λ1:t ≤ 2λ ∗ 1:t ≤ 2B({λ ∗ s} t s=1). Otherwise, we have λt + d √ β + 1 √ σ1:t + λ0:t = 2d √ β + 1 √ σ1:t + λ0:t ≤ 2d √ β + 1 p σ1:t + λ ∗ 0:t ≤ 2 λ ∗ t + d √ β + 1 p σ1:t + λ ∗ 1:t ! . 373 Using the induction hypothesis, we have B({λs} t s=1}) = B({λs} t−1 s=1}) + λt + d √ β + 1 √ σ1:t + λ0:t ≤ min {λ′ s} t−1 s=1≥0 2B({λ ′ s} t−1 s=1) + 2 λ ∗ t + d √ β + 1 p σ1:t + λ ∗ 1:t ! ≤ 2B({λ ∗ s} t−1 s=1) + 2 λ ∗ t + d √ β + 1 p σ1:t + λ ∗ 1:t ! = 2B({λ ∗ s} t s=1), where the first inequality is because of the induction hypothesis. Combining both cases, we have that B({λs} t s=1}) ≤ 2B({λ ∗ s} t s=1). G.1.5 Proof of Theorem 4.2.1 Proof. By Lemma 4.2.2 we have Reg ≤ Oe λ1:T −1 + d √ νT + T X−1 t=1 d √ β + 1 √ σ1:t + λ0:t ! , which holds for any sequence of regularization coefficients λ1, . . . , λT ∈ (0, 1). Moreover, due to the specific calculation of regularization coefficients (see Eq. (4.6)) and from Lemma 4.2.3, we immediately achieve for any λ ∗ 1 , . . . , λ∗ T ≥ 0, Reg ≤ Oe λ ∗ 1:T −1 + d √ νT + T X−1 t=1 d √ β + 1 p σ1:t + λ ∗ 0:t ! , which finishes the proof of Theorem 4.2.1. G.1.6 Proofs for Implications of Theorem 4.2.1 In this section, we provide the proofs of implications in Section 4.2.2. 374 Proof of Corollary 4.2.1. Since Theorem 4.2.1 holds for any non-negative sequence of {λ ∗ t } T t=1, in particular, we choose λ ∗ 1 = (1 + β) 1 3 d 2 3 T 2 3 and λ ∗ t = 0 for all t ≥ 2, then with ν = O(d), we obtain that Reg Eq. (4.8) ≤ Oe (1 + β) 1 3 d 2 3 T 2 3 + d 3 2 √ T + (1 + β) 1 3 d 2 3 T 2 3 = Oe d 3 2 √ T + (1 + β) 1 3 d 2 3 T 2 3 , where the last step holds due to ν = O(d) (see Lemma G.3.3). Proof of Corollary 4.2.2. Since Theorem 4.2.1 holds for any non-negative sequence of {λ ∗ t } T t=1, in particular, we choose λ ∗ t = 0 for all t ≥ 1 and, then with ν = O(d), we obtain that Reg Eq. (4.8) ≤ Oe d √ νT + d r (1 + β)T σ ! = Oe d 3 2 √ T + d r T(1 + β) σ ! , which ends the proof. Proof of Corollary 4.2.3. In the first environment where there are M rounds in which the loss function is 0-strongly convex, to make the right hand side of Eq. (4.8) the largest, we have στ = 0 when τ ∈ [M] and στ = σ when τ > M. Set λ ∗ t = 0 for all t ≥ 2. According to Eq. (4.8) shown in Theorem 4.2.1 and the choice ν = O(d), we have Reg ≤ Oe λ ∗ 1 + d √ νT + T X−1 t=1 d √ 1 + β p λ ∗ 1 + σ1:t ! ≤ Oe λ ∗ 1 + d 3 2 √ T + d √ 1 + βM p λ ∗ 1 + d p 1 + β min ( (T − M) p λ ∗ 1 , r T − M σ )! ≤ Oe d 2 3 (1 + β) 1 3M 2 3 + d 3 2 √ T + d p 1 + β min ( T − M (1 + β) 1 6 d 1 3M 1 3 , r T − M σ )! , (choosing λ ∗ 1 = (1 + β) 1 3 d 2 3M 2 3 ) 375 which leads to the first regret bound. Next, we consider the second environment where the first T −M loss functions are σ-strongly convex. Similarly, we choose λ ∗ t = 0 for t ≥ 2 and we have our regret bounded as follows: Reg ≤ Oe λ ∗ 1 + d √ νT + T X−1 t=1 d √ 1 + β p λ ∗ 1 + σ1:t ! ≤ Oe λ ∗ 1 + d 3 2 √ T + dM√ 1 + β p (T − M)σ + λ ∗ 1 + d p 1 + β min ( T − M p λ ∗ 1 , r T − M σ )! . When T − M = Θ(T), we have Reg ≤ Oe λ ∗ 1 + d 3 2 √ T + dM√ 1 + β p T σ + λ ∗ 1 + d p 1 + β r T σ ! ≤ Oe λ ∗ 1 + d 3 2 √ T + dM√ 1 + β √ T σ + d p 1 + β r T σ ! ≤ Oe d 3 2 √ T + dT √ 1 + β p σ(T − M) ! , where the last inequality is by choosing λ ∗ 1 = 0. When T −M = o(T), we have M = Θ(T). Furthermore, if λ ∗ 1 ≤ σ(T − M), we have T √−M λ ∗ 1 ≥ q T −M σ and therefore, Reg ≤ Oe λ ∗ 1 + d 3 2 √ T + dT √ 1 + β p (T − M)σ + d p 1 + β r T − M σ ! ≤ Oe d 3 2 √ T + dT √ β + 1 p σ(T − M) ! , where the last inequality is by choosing λ ∗ 1 = 0. On the other hand, if λ ∗ 1 ≥ σ(T − M), we have Reg ≤ Oe λ ∗ 1 + d 3 2 √ T + dT √ 1 + β p λ ∗ 1 ! ≤ Oe σ(T − M) + d 2 3 (1 + β) 1 3 T 2 3 + d 3 2 √ T , 376 where the last inequality is by choosing λ ∗ 1 = max n σ(T − M), d 2 3 (1 + β) 1 3 T 2 3 o . Combining the above bounds, we have Reg ≤ Oe d 3 2 √ T + min ( dT √ 1 + β p σ(T − M) , σ(T − M) + d 2 3 (1 + β) 1 3 T 2 3 )! , which finishes the proof. Proof of Corollary 4.2.4. Since Theorem 4.2.1 holds for any sequence of {λ ∗ t } T t=1, in particular, we choose λ ∗ t = 0 for all t ≥ 2 and set λ ∗ 1 = (1 + β) µ0 d µ1 · T µ2 with µ2 < 1, then we obtain that Reg ≤ Oe λ ∗ 1 + d √ νT + d p β + 1 T X−1 t=1 1 p t 1−α + λ ∗ 1 ! ≤ Oe λ ∗ 1 + d 3 2 √ T + d p β + 1 min ( T 1 2 + α 2 , T p λ ∗ 1 )! = Oe (1 + β) µ0 d µ1 T µ2 + d 3 2 √ T + d p β + 1 min n T 1+α 2 ,(1 + β) − µ0 2 d − µ1 2 T 1− µ2 2 o . First, the above bound can be upper bounded by Reg ≤ Oe d 3 2 √ T + (1 + β) µ0 d µ1 T µ2 + (1 + β) 1−µ0 2 d 1− µ1 2 T 1− µ2 2 ≤ Oe(d 3 2 √ T + (1 + β) 1 3 d 2 3 T 2 3 ), where the last inequality is by choosing µ0 = 1 3 , µ1 = 2 3 and µ = 2 3 . Furthermore, when α ∈ [0, 1 3 − 1 3 logT (β + 1) − 2 3 logT d], we have T 1+α 2 ≤ (1 + β) − 1 6 d − 1 3 T 2 3 . Therefore, set µ0 = µ1 = µ2 = 0 and we have Reg ≤ Oe d 3 2 √ T + d p 1 + βT 1+α 2 . Combining both situations finishes the proof. 377 G.2 Omitted Details for Section 4.3 In this section, we show the proof in the Lipschitz BCO setting. Specifically, we show the proof for the main theorem of Lipschitz BCO in Appendix G.2.1 and show the proofs for the implications of Theorem 4.3.1 in Appendix G.2.2. G.2.1 Proof of Theorem 4.3.1 Following the same regret decomposition as Eq. (G.7), we decompose the regret into the following terms where xe is defined the same as the one in Eq. (G.7). E "X T t=1 ft(xt) − X T t=1 ft(x) # = E "X T t=1 ft(xt) − X T t=1 ft(yt) # | {z } Exploration + E "X T t=1 ft(yt) − X T t=1 fet(yt) # | {z } Regularization I + E "X T t=1 fet(yt) − X T t=1 fbt(yt) # | {z } Smooth I + E "X T t=1 fbt(yt) − X T t=1 fbt(xe) # | {z } Reg Term + E "X T t=1 fbt(xe) − X T t=1 fet(xe) # | {z } Smooth II + E "X T t=1 fet(xe) − X T t=1 ft(xe) # | {z } Regularization II + E "X T t=1 ft(xe) − X T t=1 ft(x) # | {z } Comparator Bias , (G.23) For terms Regularization I and Regularization II, we bound them in the same way as shown in Eq. (G.8): for any x ∈ Ω, E "X T t=1 ft(x) − X T t=1 fet(x) # ≤ X T t=1 λt 2 . (G.24) 378 For terms Smooth I and Smooth II, instead of using the smoothness property in Appendix G.1, we use the Lipschitzness of fet and bound the two terms as follows: Eb "X T t=1 fet(x + H − 1 2 t b) − X T t=1 fet(x) # ≤ X T t=1 (L + λt) H − 1 2 t b 2 ≤ X T t=1 L + 1 p ηt(σ1:t−1 + λ0:t−1) ≤ X T t=1 d 2 3 (L + 1) 2 3 (σ1:t−1 + λ0:t−1) 1 3 , (G.25) where the last inequality is by the definition of ηt ≥ d − 4 3 (L + 1) 2 3 (σ1:t−1 + λ0:t−1) − 1 3 . For term Exploration, we again use the Lipschitzness of ft and have E "X T t=1 ft(xt) − X T t=1 ft(yt) # ≤ E "X T t=1 L∥xt − yt∥2 # ≤ X T t=1 L H − 1 2 t ut 2 ≤ d 2 3 (L + 1) 2 3 (σ1:t−1 + λ0:t−1) 1 3 . (G.26) For term Comparator Bias, as shown in Eq. (G.11), we have Comparator Bias ≤ 2. (G.27) Next, we show the following lemma bounding Reg Term. Lemma G.2.1. When loss functions {ft} T t=1 are all L-Lipschitz, if T ≥ ρ ′ (a constant defined in Algorithm 14), Algorithm 14 guarantees that Reg Term ≤ Oe d 4 3 νT 1 3 + X T t=1 d 2 3 (L + 1) 2 3 (σ1:t−1 + λ0:t−1) 1 3 ! . (G.28) 379 Proof. Similar to the analysis in Lemma G.1.2, we first verify the conditions in Lemma G.1.1 are satisfied. It is direct to see that {ηt} T t=1 is non-increasing and 1 ηt − 1 ηt+1 ≤ d 4 3 (L + 1)− 2 3 1 σ1:t + λ0:t + 1 T − 1 3 − 1 σ1:t−1 + λ0:t−1 + 1 T − 1 3 ! ≤ d 4 3 (L + 1)− 2 3 (σ1:t + λ0:t) 1 3 − (σ1:t−1 + λ0:t−1) 1 3 ≤ d 4 3 (L + 1)− 2 3 (σt + λt) 1 3 . Therefore, η −1 t+1 − η −1 t ≤ C(σt + λt) p with C = d 4 3 (L + 1)− 2 3 and p = 1 3 . Also, because of Lemma G.4.2, choosing γ = 4L ensures that σt ≤ γ for all t ∈ [T]. Moreover, because of the choice of λ0 and T ≥ λ0, we have η1 = d − 4 3 (L + 1) 2 3 1 λ0 + 1 T 1 3 ≤ d − 4 3 (L + 1) 2 3 λ − 1 3 0 · 2 1 3 ≤ d − 4 3 (L + 1) 2 3 · d 1 3 32 · 1 16√ νd 1 3 (4L + 1) 1 3 + (L + 1) 2 3 = 1 32d · (L + 1) 2 3 16√ νd 1 3 (4L + 1) 1 3 + (L + 1) 2 3 = 1 32(d + 16√ νC(4L + 1)p) . Therefore, according to Lemma G.1.1, ∥yt − yt+1∥Ht ≤ 1 2 . In addition, according to Eq. (G.6), we have ∥gt∥ ∗ Ht ≤ 2d for all t ∈ [T]. Therefore, Eq. (G.19) holds. Noticing that ηt = d − 4 3 (L + 1) 2 3 (σ1:t−1 + λ0:t−1) − 1 3 , and using Eq. (G.15) and Eq. (G.19), we have Reg Term = E "X T t=1 fbt(yt) − X T t=1 fbt(xe) # ≤ E "X T t=1 ℓt(yt) − X T t=1 ℓt(xe) # (Eq. (G.15)) 380 ≤ λ0 2 ∥xe∥ 2 2 + Ψ(xe) − Ψ(y1) ηT +1 + X T t=1 2ηt∥gt∥ ∗2 Ht (Eq. (G.19)) ≤ O(1) + Oe d 4 3 νT 1 3 + X T t=1 8d 2 3 (L + 1) 2 3 (σ1:t−1 + λ0:t−1) 1 3 , where we use the fact that ηT +1 ≥ d − 4 3 (L + 1) 2 3 T − 1 3 ≥ d − 4 3 T − 1 3 . This finishes the proof. Finally, we combine the above terms and show the following theorem, which holds for an arbitrary sequence of {λt} T t=1 with λt ∈ (0, 1) for all t ∈ [T], not necessarily satisfying Eq. (4.9). Theorem G.2.1. With any regularization coefficients {λt} T t=1 ∈ (0, 1), Algorithm 14 guarantees: Reg ≤ Oe X T t=1 d 2 3 (L + 1) 2 3 (σ1:t−1 + λ0:t−1) 1 3 + λ1:T ! , (G.29) if loss functions {ft} T t=1 are all L-Lipschitz and T ≥ ρ ′ (a constant defined in Algorithm 14). Proof. Combining Eq. (G.24), Eq. (G.25), Eq. (G.26), Eq. (G.27) and Eq. (G.28), we have E "X T t=1 ft(xt) − X T t=1 ft(x) # ≤ Oe X T t=1 d 2 3 (L + 1) 2 3 (σ1:t−1 + λ0:t−1) 1 3 + d 4 3 νT 1 3 + λ1:T ! . Next we show that if we choose the adaptive regularization coefficients as shown in Eq. (4.9), the obtained regret bound is no worse than the one with an optimal tuning of {λt} T t=1. Lemma G.2.2. Consider the following objective B ′ ({λs} t s=1) ≜ λ1:t + X t τ=1 d 2 3 (L + 1) 2 3 (σ1:τ + λ0:τ ) 1 3 , (G.30) 381 with λ0 defined in Algorithm 14. Then the sequence {λt} T t=1 attained by solving Eq. (4.9) satisfies that for all t ∈ [T], λt ∈ (0, 1) and B ′ ({λs} t s=1) ≤ 2 min {λ∗ s} t s=1≥0 B ′ ({λ ∗ s} t s=1). (G.31) Proof. First, we show that there exists a coefficient λt ∈ (0, 1) for all t ∈ [T] that satisfies the fixed-point problem Eq. (G.30). Indeed, we have the following two observations: • on one hand, when setting λt = 0, the LHS of Eq. (G.31) equals to 0, while the RHS of Eq. (G.31) is strictly larger than 0; • on the other hand, when setting λt = 1, the LHS of Eq. (G.31) is equal to 1 but the RHS of Eq. (G.31) is strictly less than 1 due to the choice of λ0 ≥ d 2 (L + 1)2 . Combining both facts shows that there exists a coefficient λt ∈ (0, 1) that satisfies Eq. (G.31). We prove this by induction similar to Lemma 3.1 in [25]. Again, we set λ ∗ 0 = λ0. Consider the case of t = 1. If λb1 ≤ λ ∗ 1 , we have B ′ (λb1) = λb1 + d 2 3 (L+1) 2 3 (σ1+λb0:1) 1 3 = 2λb1 ≤ 2λ ∗ 1 ≤ 2B ′ (λ ∗ 1 ). Otherwise, we have B ′ (λb1) = 2d 2 3 (L+1) 2 3 (σ1+λb0:1) 1 3 ≤ 2d 2 3 (L+1) 2 3 (λ ∗ 0:1+σ1) 1 3 ≤ 2B ′ (λ ∗ 1 ). Suppose we have B ′ ({λbs} t−1 s=1) ≤ 2 min{λ′ s} t−1 s=1≥0 B ′ ({λ ′ s} t−1 s=1). With a slight abuse of notation, we set {λ ∗ s} t s=1 = argmin{λ′ s} t s=1≥0 B ′ ({λ ′ s} t s=1). Similarly, if λb1:t ≤ λ ∗ 1:t , we have B ′ ({λbs} t s=1}) = λb1:t + X t s=1 d 2 3 (L + 1) 2 3 (σ1:s + λb0:s) 1 3 = λb1:t + X t s=1 λbs ≤ 2λ ∗ 1:t ≤ 2B ′ ({λ ∗ s} t s=1). Otherwise, we have λbt + d 2 3 (L + 1) 2 3 (σ1:t + λb0:t) 1 3 = 2d 2 3 (L + 1) 2 3 (σ1:t + λb0:t) 1 3 ≤ 2d 2 3 (L + 1) 2 3 (σ1:t + λ ∗ 0:t ) 1 3 ≤ 2 λ ∗ t + d 2 3 (L + 1) 2 3 (σ1:t + λ ∗ 1:t ) 1 3 ! . Using the induction hypothesis, we have B ′ ({λbs} t s=1}) ≤ 2B ′ ({λ ∗ s} t s=1). 382 Therefore, combining Theorem G.2.1 and Lemma G.2.2 gives the proof of Theorem 4.3.1. G.2.2 Proofs for Implications of Theorem 4.3.1 In this subsection, we prove the corollaries presented in Section 4.3. Proof of Corollary 4.3.1. Since Theorem 4.3.1 holds for any sequence of {λ ∗ t } T t=1, in particular, we choose λ ∗ 1 = p (L + 1)dT 3/4 and λ ∗ t = 0 for all t ≥ 2, then we obtain that Reg Eq. (4.10) ≤ Oe p d(L + 1)T 3 4 , which completes the proof. Proof of Corollary 4.3.2. Choose λ ∗ t = 0 for all t ≥ 1 and by Theorem 4.3.1, we obtain that Reg Eq. (4.10) ≤ Oe X T t=1 d 2 3 (L + 1) 2 3 σ 1 3 t 1 3 ! = Oe((L + 1) 2 3 d 2 3 T 2 3 σ − 1 3 ), which completes the proof. Proof of Corollary 4.3.3. In the first environment where there are M rounds such that the loss function is 0-strongly convex. In order to make the right hand side of Eq. (4.10) the largest, we have στ = 0 when τ ∈ [M] and στ = σ when τ > M. Set λ ∗ t = 0 for all t ≥ 2, then Theorem 4.3.1 implies that (omitting the Oe(d 4 3 νT 1 3 ) low-order term) Reg ≤ Oe λ ∗ 1 + X T t=1 d 2 3 (L + 1) 2 3 (σ1:t + λ ∗ 1 ) 1 3 ! ≤ Oe λ ∗ 1 + λ ∗ 1 − 1 3M d 2 3 (L + 1) 2 3 + d 2 3 (L + 1) 2 3 min n σ − 1 3 (T − M) 2 3 , λ∗ 1 − 1 3 (T − M) o 383 ≤ Oe p d(L + 1)M 3 4 + d 2 3 (L + 1) 2 3 min ( σ − 1 3 (T − M) 2 3 , T − M d 1 6 (L + 1) 1 6M 1 4 )! , where the last inequality is by choosing λ ∗ 1 = p d(L + 1)M 3 4 . This proves the first result. Consider the second type of environment where the first T−M rounds are σ-strongly convex functions and the remaining rounds are 0-strongly convex functions. Still set λ ∗ t = 0 for all t ≥ 2 and we have Reg ≤ Oe λ ∗ 1 + X T t=1 d 2 3 (L + 1) 2 3 (σ1:t + λ ∗ 1 ) 1 3 ! ≤ Oe λ ∗ 1 + d 2 3 (L + 1) 2 3 min n λ ∗ 1 − 1 3 (T − M), σ− 1 3 (T − M) 2 3 o + d 2 3M(L + 1) 2 3 (λ ∗ 1 + σ(T − M)) 1 3 ! . When T − M = Θ(T), we have Reg ≤ Oe λ ∗ 1 + d 2 3 (L + 1) 2 3 min n λ ∗ 1 − 1 3 T, σ− 1 3 T 2 3 o + d 2 3M(L + 1) 2 3 (λ ∗ 1 + σT) 1 3 ! ≤ Oe λ ∗ 1 + d 2 3 T(L + 1) 2 3 (σT) 1 3 ! ≤ Oe d 2 3 T(L + 1) 2 3 σ 1 3 (T − M) 1 3 ! , where the last inequality is by choosing λ ∗ 1 = 0. When T −M = o(T), we have M = Θ(T). Furthermore, when λ1 ≤ σ(T − M), we have λ ∗ 1 − 1 3 (T − M) ≥ σ − 1 3 (T − M) 2 3 and therefore, Reg ≤ Oe λ ∗ 1 + d 2 3 (L + 1) 2 3 σ − 1 3 T 2 3 + d 2 3 T(L + 1) 2 3 (λ ∗ 1 + σ(T − M)) 1 3 ! ≤ Oe d 2 3 (L + 1) 2 3 T σ 1 3 (T − M) 1 3 ! , where the last inequality is by choosing λ ∗ 1 = 0. When λ ∗ 1 ≥ σ(T − M), we have λ ∗ 1 − 1 3 (T − M) ≤ σ − 1 3 (T − M) 2 3 and therefore, Reg ≤ Oe λ ∗ 1 + d 2 3 λ ∗ 1 − 1 3 (L + 1) 2 3 T + d 2 3 (L + 1) 2 3 T (λ ∗ 1 + σ(T − M)) 1 3 ! ≤ Oe λ ∗ 1 + d 2 3 (L + 1) 2 3 λ ∗ 1 − 1 3 T 384 ≤ Oe σ(T − M) + p d(L + 1)T 3 4 , where the last inequality is by choosing λ ∗ 1 = max n σ(T − M), p d(L + 1)T 3 4 o . Combining the two cases, we have Reg ≤ Oe min ( d 2 3 T(L + 1) 2 3 σ 1 3 (T − M) 1 3 , σ(T − M) + p d(L + 1)T 3 4 )! , leading to the second conclusion. Proof of Corollary 4.3.4. Since Theorem 4.3.1 holds for any sequence of {λ ∗ t } T t=1, in particular, we choose λ ∗ t = 0 for all t ≥ 2 and set λ ∗ 1 = (L + 1)µ0 d µ1 T µ2 , we obtain that (again omitting the low-order term) Reg ≤ Oe λ ∗ 1 + X T t=1 d 2 3 (L + 1) 2 3 (t 1−α + λ ∗ 1 ) 1 3 ! ≤ Oe (L + 1)µ0 d µ1 T µ2 + d 2 3 (L + 1) 2 3 min (L + 1)− µ0 3 d − µ1 3 T 1− µ2 3 , T 2+α 3 . First, the above bound can be upper bounded by Reg ≤ Oe (L + 1)µ0 d µ1 T µ2 + d 2−µ1 3 (L + 1) 2−µ0 3 T 1− µ2 3 = Oe p d(L + 1)T 3 4 , where the last equality is true by choosing µ0 = 1 2 , µ1 = 1 2 and µ2 = 3 4 . Second, when α ∈ [0, 1 4 − 1 2 logT (L + 1) − 1 2 logT d], we choose µ0 = µ1 = µ2 = 0 and have Reg ≤ Oe d 2 3 (L + 1) 2 3 T 2+α 3 . Again, we emphasize that the setting of λ ∗ 1 is required in the analysis only and will not affect the algorithmic procedures. Combining both situations finishes the proof. 385 G.3 Self-concordant Barrier Properties In this section, we list several basic definitions and some important properties related to self-concordant barriers. Some of them are already stated in Appendix E.2.2 but for the ease of reading, we also include them in this section. Definition G.3.1 (Self-Concordant Functions). Let Ω ⊆ R d be a closed convex domain with a nonempty interior int(Ω). A function R : int(Ω) 7→ R is called self-concordant on Ω if (i) R is a three times continuously differentiable convex function, and approaches infinity along any sequence of points approaching ∂Ω; and (ii) R satisfies the differential inequality: for every h ∈ R d and x ∈ int(Ω), |∇3R(x)[h, h, h]| ≤ 2(∇2R(x)[h, h]) 3 2 , where the third-order differential is defined as ∇3R(x)[h, h, h] ≜ ∂ 3 ∂t1∂t2∂t3 R (x + t1h + t2h + t3h) t1=t2=t3=0 . Given a real ν ≥ 1, R is called a ν-self-concordant barrier (ν-SCB) for Ω if R is self-concordant on Ω and, in addition, for every h ∈ R d and x ∈ int(Ω), |∇R(x)[h]| ≤ ν 1 2 (∇2R(x)[h, h]) 1 2 . Given a self-concordant function R on Ω, for any h ∈ R d the induced local norm is defined as ∥h∥x ≜ ∥h∥∇2R(x) = q h⊤∇2R(x)h, and ∥h∥ ∗ x ≜ ∥h∥ ∗ ∇2R(x) = q h⊤(∇2R(x))−1h. (G.32) 386 Below, we present several key technical lemmas regarding to the self-concordant functions. Lemma G.3.1 (Proposition 2.3.2 of [120]). Let R : int(Ω) 7→ R be a ν-self-concordant barrier over the closed convex set Ω ∈ R d , then for any x, y ∈ int(Ω), we have R(y) − R(x) ≤ ν log 1 1−πx(y) , where πx(y) ≜ inf{t ≥ 0 | x+t −1 (y −x) ∈ Ω} is the Minkowski function of Ω whose pole is on x, which is always in [0, 1]. Lemma G.3.2 (Theorem 2.1.1 of [120]). Let ψ be a self-concordant function on the closed convex body Ω ⊆ R d with non-empty interior, then ∥h∥∇2ψ(x′) ≥ ∥h∥∇2ψ(x) (1 − ∥x − x ′ ∥∇2ψ(x) ) (G.33) holds for any h ∈ R d and any x ∈ int(Ω) with x ′ ∈ E1(x) where E1(x) ≜ {y ∈ R d | ∥y − x∥x ≤ 1} denotes the Dikin ellipsoid of R and we always have E1(x) ⊆ X . Lemma G.3.3 (Theorem 2.5.1 of [120]). For each each closed convex body Ω ⊆ R d , there exits an O(d)-selfconcordant barrier on Ω. In the following, we present the definition and properties of a normal barrier. Definition G.3.2 (Normal Barriers). Let K ⊆ R d be a closed and proper convex cone and let ν¯ ≥ 1. A function ψ : int(K) → R is called a ν¯-logarithmically homogeneous self-concordant barrier (or simply a ν¯-normal barrier) on K if it is self-concordant on int(K) and is logarithmically homogeneous with parameter ν¯, namely ψ(tw) = ψ(w) − ν¯ ln t, ∀w ∈ int(K), t > 0. Lemma G.3.4 (Proposition 5.1.4 of [120]). Suppose ψ is a ν-self-concordant barrier on Ω ⊆ R d . Then the function Ψ(w, b) ≜ 400 ψ w b − 2ν ln b 38 is a ν¯-normal barrier on con(Ω) ⊆ R d+1 with ν¯ = 800ν, where con(Ω) = {0} ∪ {(w, b) | w b ∈ Ω, w ∈ R d , b > 0} is the conic hull of Ω lifted to R d+1 (by appending a dummy variable 1 to the last coordinate). Lemma G.3.5 (Proposition 2.3.4 of [120]). Suppose ψ is a ν-normal barrier on Ω ⊆ R d . Then for any x, y ∈ int(Ω), we have (1) ∥x∥ 2 ∇2ψ(x) = x ⊤∇2ψ(x)x = ν; (2) ∇2ψ(x)x = −∇ψ(x); (3) ψ(y) ≥ ψ(x) − ν ln −⟨∇ψ(x),y⟩ ν . (4) ∥∇ψ(x)∥ 2 ∇−2ψ(x) = ν. G.4 Additional Lemmas G.4.1 FTRL Lemma For completeness, we present the following general result for FTRL-type algorithms as follows. Lemma G.4.1. Let Ω ⊆ R d be a closed and convex feasible set, and denote by Rt : Ω 7→ R the convex regularizer and by ft : Ω 7→ R the convex online functions. Denote by Ft(x) = Rt(x) +Pt−1 s=1 fs(x) and the FTRL update rule is specified as xt ∈ argminx∈Ω Ft(x). Then, for any u ∈ Ω we have X T t=1 ft(xt) − X T t=1 ft(u) ≤ RT +1(u) − R1(x1) +X T t=1 ∇ft(xt) ⊤(xt − xt+1) − X T t=1 DFt+ft (xt+1, xt) +X T t=1 Rt(xt+1) − Rt+1(xt+1) , (G.34) where DFt+ft (·, ·) denotes the Bregman divergence induced by the function Ft + ft . 388 Proof. It is easy to verify that the following equation holds for any comparator u ∈ Ω, X T t=1 ft(xt) − X T t=1 ft(u) = RT +1(u) − R1(x1) + FT +1(xT +1) − FT +1(u) + X T t=1 Ft(xt) − Ft+1(xt+1) + ft(xt) . Moreover, we have Ft(xt) − Ft+1(xt+1) + ft(xt) = Ft(xt) + ft(xt) − (Ft(xt+1) + ft(xt+1)) + Rt(xt+1) − Rt+1(xt+1) = ⟨∇Ft(xt) + ∇ft(xt), xt − xt−1⟩ − DFt+ft (xt+1, xt) + Rt(xt+1) − Rt+1(xt+1) ≤ ⟨∇ft(xt), xt − xt−1⟩ − DFt+ft (xt+1, xt) + Rt(xt+1) − Rt+1(xt+1) where the last inequality holds by the optimality condition of xt ∈ argminx∈Ω Ft(x). Hence, combining the above equations finishes the proof. G.4.2 Relations among strong convexity, smoothness and Lipschitzness In this section, we discuss the relations among strong convexity, smoothness and Lipschitzness. First, we point out a minor technical flaw that appeared in two previous works on BCO [132, 72]. In both works, the authors use the statement that a convex function f that is β-smooth and has bounded value in [−1, 1] has Lipschitz constant no more than 2β + 1 when maxx,x′∈X ∥x − x ′∥2 ∈ [2, 4]. However, this is not correct as we give the following counter example. Example 1. Consider the following function in 2-dimensional space: f(x, y) = Gy where G > 1 can be arbitrarily large and the first coordinate does not affect the function value. The feasible domain is defined as 389 X = {(x, y) | x ∈ [−1, 1], y ∈ [− 1 G , 1 G ]} with diameter in [2, 4]. It is direct to see that function f is 0-smooth and has bounded value in [−1, 1]. However the Lipschitz constant is G, which can be arbitrarily large. Hazan and Levy [72] and Saha and Tewari [132] use this property to bound the term Comparator Bias in Eq. (G.7). We fix that by using the property of convexity. Next, we discuss the relationship between strong convexity and Lipschitzness. Specifically, the following lemma shows that for a convex function f that is L-Lipschitz and defined over a bounded domain with diameter D, its strong convexity parameter σ is upper bounded by 4L D . Lemma G.4.2. If a convex function f : X 7→ R is L-Lipschitz and σ-strongly convex, and has bounded domain diameter maxx,x′∈X ∥x − x ′∥2 = D, then we have σ ≤ 4L D . In fact, we have for any x, y ∈ X L∥x − y∥2 ≥ ft(x) − ft(y) ≥ ∇ft(y) ⊤(x − y) + σ 2 ∥x − y∥ 2 2 . Choose y = argminx∈X ft(x) and we have L∥x − y∥2 ≥ ft(x) − ft(y) ≥ ∇ft(y) ⊤(x − y) + σ 2 ∥x − y∥ 2 2 ≥ σ 2 ∥x − y∥ 2 2 . Therefore, σ ≤ 2L ∥x−y∥2 for any x ∈ X , which means that σ ≤ 4L D . This is because we can choose x1, x2 ∈ X such that ∥x1 − x2∥ = D = maxx,x′∈X ∥x − x ′∥2. Then we have ∥x1 − y∥2 + ∥x2 − y∥2 ≥ ∥x1 − x2∥2 = D, which means that either ∥x1 − y∥ ≥ D 2 or ∥x2 − y∥2 ≥ D 2 . This shows that when ft is both σ-strongly convex and L-Lipschitz, we have σ ≤ 4L D . 390
Abstract (if available)
Abstract
In recent years, online learning, or data-driven sequential decision making, is becoming a central component in Artificial Intelligence and has been widely applied in many real applications. Specifically, online learning means that the learner interacts with an unknown environment and learns the model on the fly, which is more challenging compared with the classic offline learning setting where the dataset is available to the learner at the beginning of the learning process. In this thesis, we focus on designing algorithms for online learning with the two pivotal characteristics: robustness and adaptivity.
Motivated by the existence of unpredictable corruptions and noises in real-world online learning applications such as E-commerce recommendation systems, robustness is an important and desired property. It means that the designed algorithm is guaranteed to perform well even in highly adversarial environments. In contrast, adaptivity complements robustness by enhancing performance in benign environments. In broader terms, adaptivity means that the designed algorithm is able to automatically scale with certain intrinsic property that reflects the difficulty of the problem.
In order to achieve adaptivity and robustness, in this thesis, we utilize the following three methodologies, namely regularization, exploration, and aggregation. Regularization method has been widely used in the field of machine learning to control the dynamic of the decisions, which is especially important when facing a possibly adversarial environment. In online learning problems, very often the learner can only observe partial information of the environment, making an appropriate exploration method crucial. Aggregation, a natural idea to achieve adaptivity, combines multiple algorithms that work well in different environments. Though intuitive, this requires non-trivial algorithm design for different online learning problems.
Using these methodologies, in this thesis, we design robust and adaptive learning algorithms for a wide range of online learning problems. We first consider the problem of multi-armed bandits with feedback graphs, which includes the classic full-information expert problem, multi-armed bandits, and beyond. Then, we consider more complex problems including linear bandits and convex bandits, which involve infinite number of actions. We hope that the techniques and algorithms developed in this thesis can help improve the current online learning algorithms for real-world applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robust and adaptive online reinforcement learning
PDF
Robust and adaptive online decision making
PDF
No-regret learning and last-iterate convergence in games
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Understanding goal-oriented reinforcement learning
PDF
Leveraging training information for efficient and robust deep learning
PDF
Interactive learning: a general framework and various applications
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Transfer learning for intelligent systems in the wild
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Invariant representation learning for robust and fair predictions
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Optimization strategies for robustness and fairness
PDF
Empirical methods in control and optimization
PDF
Algorithms and systems for continual robot learning
PDF
Robust causal inference with machine learning on observational data
PDF
Learning to detect and adapt to unpredicted changes
Asset Metadata
Creator
Zhang, Mengxiao
(author)
Core Title
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
05/28/2024
Defense Date
05/24/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptive algorithm design,bandit,contextual bandits,OAI-PMH Harvest,online learning,robust algorithm design
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Luo, Haipeng (
committee chair
), Sharan, Vatsal (
committee member
), Xu, Renyuan (
committee member
)
Creator Email
akfdk159236@gmail.com,zhan147@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113967452
Unique identifier
UC113967452
Identifier
etd-ZhangMengx-13032.pdf (filename)
Legacy Identifier
etd-ZhangMengx-13032
Document Type
Thesis
Format
theses (aat)
Rights
Zhang, Mengxiao
Internet Media Type
application/pdf
Type
texts
Source
20240528-usctheses-batch-1162
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adaptive algorithm design
contextual bandits
online learning
robust algorithm design