Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Using formal optimization techniques to improve the performance of mobile and data center networks
(USC Thesis Other)
Using formal optimization techniques to improve the performance of mobile and data center networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
USING FORMAL OPTIMIZATION TECHNIQUES TO IMPROVE THE PERFORMANCE OF MOBILE AND DATA CENTER NETWORKS Weng Chon Ao A thesis submitted to the FACULTY OF THE USC GRADUATE SCHOOL for the degree of Doctor of Philosophy in Electrical Engineering at University of Southern California December 2017 Abstract Mobile edge cloud networking is a new paradigm to provide computing and stor- age capabilities at the edge of pervasive radio access networks in close proximity to mobile users. One specic example is the cloud radio access network (C- RAN), which has been proposed to centralize base band units as an edge cloud data center to provide computation resources and content storage or caching, thus addressing both cost and performance concerns of cellular systems. In this thesis, various techniques are applied to analyze and optimize resource alloca- tion and scheduling in such mobile and data center networks. Introduction The mobile edge cloud network architecture is shown in Fig. 1, where an edge cloud connects to (and services) a number of neighboring base stations, and edge clouds are connected to the backend cloud. Fig. 1 also shows the organi- zation of this thesis. In Chapter 1, we consider the online multi-tier multi-cell user association problem. Our proposed online algorithm has the best perfor- mance bound up-to-date. In Chapter 2, we develop a novel architecture for fast content delivery via distributed caching and small cell cooperation. Our proposed architecture allows a joint optimization of the cache content alloca- tion in the application layer and the cooperative transmissions in the physical layer. In Chapter 3, we design a data-locality-aware user grouping algorithm for multi-user beamforming precoding in a C-RAN. In order to satisfy the delay constraint for precoding, the number of data transfers across racks is regular- ized according to the congestion level in the edge cloud data center network. In Chapter 4, we propose an analytical framework that allows a joint optimization of the workload distribution and capacity augmentation to minimize the job completion time in hybrid data center networks. Figure 1: Mobile edge cloud network architecture. 1 Contents 1 An Ecient Approximation Algorithm for Online Multi-Tier Multi-Cell User Association 5 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Prior Work and Contribution . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Static user association . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Online user association . . . . . . . . . . . . . . . . . . . . 8 1.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Network topology . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Data rates in the massive MIMO regime . . . . . . . . . . 10 1.3.3 Data rates in the MU-MIMO full multiplexing gain regime 12 1.3.4 User scheduling . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.5 Optimal user-BS association . . . . . . . . . . . . . . . . . 13 1.3.6 Special cases . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Online association algorithms . . . . . . . . . . . . . . . . . . . . 14 1.4.1 User-centric online algorithm . . . . . . . . . . . . . . . . 14 1.4.2 Cell-centric randomized online algorithm . . . . . . . . . . 14 1.4.3 Cell-centric deterministic online algorithm . . . . . . . . . 16 1.5 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5.1 Performance bounds . . . . . . . . . . . . . . . . . . . . . 19 1.5.2 Rationale of equal time allocation . . . . . . . . . . . . . 21 1.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.6.1 Heterogeneous users and user priority . . . . . . . . . . . 22 1.6.2 Departing users . . . . . . . . . . . . . . . . . . . . . . . . 22 1.6.3 Base station cooperation . . . . . . . . . . . . . . . . . . . 23 1.7 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.7.1 Two-tier heterogeneous cellular network in massive MIMO scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.7.2 The eect of biasing . . . . . . . . . . . . . . . . . . . . . 27 1.7.3 Multi-channel WiFi network in MU-MIMO full multiplex- ing gain scenario . . . . . . . . . . . . . . . . . . . . . . . 29 1.7.4 Departing users . . . . . . . . . . . . . . . . . . . . . . . . 29 1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2 2 Fast Content Delivery via Distributed Caching and Small Cell Cooperation 31 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Prior work and contributions . . . . . . . . . . . . . . . . . . . . 33 2.3 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.1 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2 Caching strategies and cache-driven cooperation policies . 36 2.3.3 Channel model . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4 Performance analysis in terms of rates . . . . . . . . . . . . . . . 38 2.4.1 Optimal and randomized caching under MRT . . . . . . . 39 2.4.2 Optimal and threshold-based caching under ZFBF . . . . 42 2.4.3 Joint MRT{ZFBF . . . . . . . . . . . . . . . . . . . . . . 48 2.5 Performance analysis in terms of delay . . . . . . . . . . . . . . . 49 2.5.1 Optimal caching under MRT . . . . . . . . . . . . . . . . 49 2.5.2 Threshold-based caching under ZFBF . . . . . . . . . . . 50 2.6 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.6.1 Data rates under diversity gains . . . . . . . . . . . . . . 54 2.6.2 Data rates under multiplexing gains . . . . . . . . . . . . 55 2.6.3 Data rates under joint MRT{ZFBF . . . . . . . . . . . . . 58 2.6.4 Delay analysis . . . . . . . . . . . . . . . . . . . . . . . . 58 2.7 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . 59 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.9.1 Co-channel interference . . . . . . . . . . . . . . . . . . . 60 2.9.2 Proof of Proposition 2.1 . . . . . . . . . . . . . . . . . . . 61 2.9.3 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . 63 2.9.4 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . 63 2.9.5 Proof of Theorem 2.4 . . . . . . . . . . . . . . . . . . . . 64 2.9.6 Proof of Theorem 2.5 . . . . . . . . . . . . . . . . . . . . 64 3 Data-locality-aware User Grouping in Cloud Radio Access Net- works 65 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3.1 User grouping . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3.2 Multi-user ZFBF precoding and the need to transfer data 71 3.3.3 Simple motivating examples . . . . . . . . . . . . . . . . . 71 3.4 Problem formulation and analysis . . . . . . . . . . . . . . . . . . 72 3.4.1 Randomized/CSI-based user grouping . . . . . . . . . . . 73 3.4.2 Data-locality-aware user grouping . . . . . . . . . . . . . 74 3.4.3 Joint CSI- and data-locality-aware user grouping . . . . . 77 3.4.4 Regularized resource block minimization . . . . . . . . . . 77 3.4.5 Regularized spectral eciency maximization . . . . . . . . 80 3.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5.1 The case with replicas . . . . . . . . . . . . . . . . . . . . 81 3 3.5.2 The case with multiple BSs . . . . . . . . . . . . . . . . . 82 3.5.3 The case with multi-antenna users . . . . . . . . . . . . . 82 3.6 Simulation and numerical results . . . . . . . . . . . . . . . . . . 82 3.6.1 Randomized/CSI-based vs data-locality-aware user group- ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.6.2 Joint CSI- and data-locality-aware user grouping . . . . . 83 3.6.3 Regularized resource block minimization vs regularized spectral eciency maximization . . . . . . . . . . . . . . . 84 3.6.4 Accuracy of the 2-approximation algorithm . . . . . . . . 87 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4 Joint Workload Distribution and Capacity Augmentation in Hybrid Datacenter Networks 88 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4 Workload not amenable to pipelining . . . . . . . . . . . . . . . . 91 4.4.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . 92 4.4.2 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . 93 4.5 Workload amenable to pipelining . . . . . . . . . . . . . . . . . . 94 4.5.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . 94 4.5.2 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . 96 4.5.3 Special cases . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6 Network Cost for Data Transmission . . . . . . . . . . . . . . . . 97 4.6.1 Workload not amenable to pipelining . . . . . . . . . . . . 98 4.6.2 Workload amenable to pipelining . . . . . . . . . . . . . . 98 4.7 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.7.1 Varying wired link capacity . . . . . . . . . . . . . . . . . 101 4.7.2 Varying total wireless bandwidth and wired link capacity 101 4.7.3 Varying service rate . . . . . . . . . . . . . . . . . . . . . 102 4.7.4 Varying service rate and workload size . . . . . . . . . . . 102 4.7.5 Varying the number of racks . . . . . . . . . . . . . . . . 105 4.7.6 Job completion time{data transfer trade-o . . . . . . . . 105 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Bibliography 108 4 Chapter 1 An Ecient Approximation Algorithm for Online Multi-Tier Multi-Cell User Association The constantly growing wireless bandwidth demand is pushing wireless networks to multi-tier architectures consisting of a macrocell tier and a number of dense small cell deployment tiers. In such a multi-tier multi-cell environment, the classic problem of associating users to base stations becomes both more chal- lenging and more critical to the overall network performance. Most previous analytical work is focused on designing static user-cell association algorithms which, to achieve optimality, are periodically applied whenever there are new user arrivals thus potentially inducing a large number of re-associations for pre- viously arrived users. On the other hand, practical online algorithms that do not allow any such user re-association are often based on heuristics and may not have any performance guarantees. In this work, we propose online algorithms for the multi-tier multi-cell user association problem that have provable performance guarantees which improve previously known bounds by a sizable amount. The proposed algorithms are motivated by online combinatorial auctions, while capturing and leveraging the relative sparsity of choices in wireless networks as compared to auction setups. Our champion algorithm is a 1 2a 1 approximation algorithm, where a is the maximum number of feasible associations for a user and is, in general, small due to path loss. Our analysis takes into account the state of the art wire- less technologies such as massive and multiuser MIMO, and practical aspects of the system such as the fact that highly mobile users have a preference to connect to larger cell tiers to keep the signaling overhead low. In addition to establishing formal performance bounds, we also conduct simulations under re- 5 alistic assumptions which establish the superiority of the proposed algorithm over existing approaches under real-world scenarios. [1,2] 1.1 Introduction To support the tremendous growth of wireless data trac fueled by popular ap- plications like video streaming, enterprise networks consist of dense deployments of access points (APs), while a dense deployment of small cells (e.g. microcells and femtocells) under the coverage of macrocells has been proposed for future cellular networks in the upcoming 5G standard [3]. Such small cells could op- erate at a dierent frequency spectrum than macrocells (e.g. millimeter wave systems at 60 GHz [4]), and the performance of the overall cellular network can be sizably improved by this heterogeneous multi-tier architecture [5{8]. In addition, it is envisioned that antenna arrays will be deployed at cells to pro- vide large spatial multiplexing gain with low-complexity linear precoding via a large number of antenna elements (massive MIMO) and/or via multiuser MIMO schemes, see, for example, [9{12]. In the context of such a multi-tier, multi-cell MIMO-enabled network, users typically have multiple choices when it comes to associating with a base sta- tion (BS), and, the association depends on many factors such as the quality of the received signal from the base stations at each user, the system load at the base stations, the user mobility, etc. Further, the fundamental problem of how to properly associate users with base stations so that the overall system performance is maximized is both more complex and more critical in such de- ployments, because dense small-cell deployments may have signicant intra- and inter-tier interference and may operate in the interference-limited rather than the power-limited regime. There is a large body of prior work in academia on the user-BS association problem, which is usually formulated as a static optimization problem assuming full knowledge of the information of all users (e.g., the number of users and the users' rates), see, for example, [13{27] and references therein. In an eort to make such optimization problems more tractable, researchers have resorted to relaxation which leads to fractional solutions (where users are associated with multiple base stations and associate with each one of them for a fraction of time). What is more, to account for the dynamic nature of user arrivals while guaranteeing good performance, such static approaches are periodically applied, potentially inducing a large number of re-associations for previously arrived users. However, real world systems use neither fractional associations nor re-associations (except for connectivity reasons, e.g. mobile users' handos). On the other hand, practical online algorithms that are used in the industry are based on simple heuristics which waste precious system capacity and lead to suboptimal performance [17], while oering no performance guarantees. For example, by default, in today's cellular/WiFi networks users simply associate with the BS/AP from which they receive the strongest signal. And, some man- ufacturers of dense enterprise WiFi networks have recently attempted to impose 6 some sort of load balancing by capping the maximum number of users an access point may associate with [28], while the LTE standard allows the introduction of a bias to ooad users from macrocells to small cells when the latter are present, even when the signal from the macrocells is stronger. In this work, we propose novel online algorithms for the multi-tier user- BS association problem (the single tier user-BS association problem is obvi- ously a special case), which are both practical (using neither fractional- nor re-associations) and provably near-optimal. The algorithms are motivated by online combinatorial auctions (bidders bid on objects) [29{31], where the base stations act as bidders and the users act as objects. By applying properties of wireless systems to the analysis of the online algorithms for combinatorial auctions, we are able to prove a performance guarantee which is close to the optimal. Specically, we exploit the fact that a user can only receive and decode reference signals from a small number of nearby base stations due to path loss and interference. Therefore, the candidate set of feasible associations of a user is small, whereas in combinatorial auctions each bidder is in general assumed to have a positive valuation for every object. It turns out that by taking advan- tage of such \sparsity" together with introducing random decisions which favor \better" association candidates, our champion online algorithm achieves at least 1 2a 1 of the optimal, which, for typical values of a, say 2 or 3, yields about 60 - 67% of the optimal performance guaranteed. To the best of our knowledge, this is the tightest known bound achievable by online association algorithms, see Section 1.2 for more details. The remainder of this work is organized as follows. We present related work and highlight our contributions in Section 1.2. Section 1.3 describes the system model where we consider both the massive and multiuser MIMO scenarios and formally state the user association problem. In Section 1.4, we present our on- line multi-tier multi-cell user association algorithms. The performance analysis of the algorithms is presented in Section 1.5. We discuss how the proposed al- gorithm can be applied in various practical scenarios of interest in Section 1.6. Section 1.7 presents numerical and simulation results for a number of real-world scenarios. Last, Section 1.8 concludes the work. 1.2 Prior Work and Contribution We start with a discussion of prior work on static user-cell association schemes, which are usually applied periodically and induce re-associations of previously arrived users, followed by a discussion about prior work on online schemes whose association decision for a user is irrevocably made at the time that a user arrives. 1.2.1 Static user association Static user association (also known as load balancing) has been well studied in the literature in the context of both WiFi networks and cellular networks, see, for example [13{27] and references therein. In general, a topology with 7 users and base stations/access points is given, and the association problem is formulated as an optimization problem. In the presence of new users arriving over time, the problem is solved from scratch each time a new user arrives, or at periodic intervals to reduce the overhead, inducing a potentially large number of re-associations. In [13] the authors study the user-AP association problem ensuring a max- min fair bandwidth allocation. In [14] the authors perform joint AP channel selection and user association to minimize the user transmission delay. In [15], the authors associate users such that load balancing is achieved among APs. They achieve this by adjusting the power and thus the coverage of the APs. In [16], the authors propose a distributed user association policy that adapts to spatial trac loads to achieve ow-level cell load balancing. A recent overview of load balancing techniques in cellular networks can be found in [17]. In one of the works referred therein [18], the authors formulate the user-BS association problem as an integer programming problem. After relaxation of the integral constraints, the problem is reduced to a convex opti- mization problem, and dual algorithms are developed to iteratively solve for the optimal. While the relaxation leads to a plausible way to solve the optimiza- tion problem fast, it imposes unrealistic constraints as users end up associating with multiple base stations, spending a fraction of their time associated with each of them. In [19], the user-BS association problem is investigated in the context of massive MIMO wireless networks. Under the time scale over which the large-scale channel coecients remain constant, the association problem is formulated as a network utility maximization problem that gives the fraction of time of a user associating to each base station. The problem is further ex- tended to the case with base station cooperation in [20]. In [21,22] the multi-tier user-BS association problem is analyzed using stochastic geometry, and in [23] a game-theoretic model is proposed to associate users with dierent radio ac- cess technologies. In [24], the authors propose approximation algorithms for the user-BS association problem to minimize the maximum total service time among all base stations. In [25], the authors study the user association problem in millimeter wave wireless networks operating at 60 GHz. Last, in [26,27], the authors design an ecient auction-based algorithm for user-cell association in 60 GHz networks by exploiting the structure of the problem. As already stated, to accommodate new user arrivals while maintaining high performance, the optimization in all this prior work is periodically applied, thus potentially inducing a large number of re-associations for previously arrived users. 1.2.2 Online user association Contrary to the static case, there is less related work on designing online approx- imation algorithms for the user-BS association problem, where the association decision for a user is irrevocably made at the time of its arrival. In [32], the au- thors propose a heuristic online algorithm for dynamic user association. In [33] the authors introduce a 1/8 approximation algorithm for online user-BS associa- 8 tion to maximize the sum rate of the users under equal time sharing scheduling and equal power allocation, and, in [34] they introduce a 1/2 approximation algorithm to maximize the sum rate under a broadcast channel with receiver cooperation scenario and water-lling power allocation. Last, in [35] the au- thors derive an association algorithm aiming at minimizing the maximum load among all base stations. The performance bound of the proposed algorithm is proportional to the ratio of the minimum user rate over the maximum user rate, which for real world systems is more than 10, thus yielding an approximation bound which is a bit looser than 1/10. In this work, we consider the online multi-tier multi-cell user association with the objective of maximizing the sum utility of the users, which can be written as the sum of \base station utility functions". A base station utility function is dened as the sum utility of its associated users. As a concrete example, we will analyze the logarithmic user utility (with respect to the data rate) with a bias of associating users with high mobility to tiers operating at low frequency with large cell coverage. Note that the logarithmic user utility captures the concept of proportional fairness [18]. In addition to the fact that proportional fairness is a good approximation of the operational point of today's networks, under mild assumptions it also yields a monotone and submodular base station utility which renders the problem analytically tractable. Our proposed online algorithm is proved to be a 1 2a 1 approximation algorithm (compared against the optimal algorithm that allows re-associations for previously arrived users whenever a new user arrives), where the parameter a equals the maximum number of potential associations of a user. Note that the smaller the value of a, the tighter the bound. (For a = 1 there is only one choice and there is no association decision that can be made for a user.) Due to path loss, signal degradation, interference in wireless medium, and the physical deployment of base stations, a is typically small, yielding a bound which is much tighter than the previous best known bound for an online association algorithm under realistic assumptions, and it is the tightest among all prior bounds. 1.3 System Model 1.3.1 Network topology LetU = f1; 2; ;Mg be the set of users and the cardinality of U be M. Without loss of generality, we index the users according to their arrival to the system, i.e., user 1 arrives rst and userM arrives last. Note that our proposed online algorithm does not need to know the total number of users M. In other words, the performance guarantee holds for any instant of user arrival m, m = 1; 2; ;M. The users are just arriving online, and each user shall be associated upon arrival to one of the base stations. We consider a multi-tier heterogeneous network with K tiers and we denote the set of tiers asK =f1; 2; ;Kg. We assume that there areN k base stations (denoted asB k =f1; 2; ;N k g) operating at tierk2K. As a result, each base 9 Figure 1.1: A scenario of multi-tier user-BS association. station is indexed by a tuple (j;k); k2K; j 2B k . The bandwidth of the spectrum band of the kth tier is denoted as W k and the spectrum bands of dierent tiers do not overlap. We consider a single carrier system where each base station in the kth tier uses the whole spectrum band with bandwidth W k for data transmission. (The analysis can be easily generalized to a multi-carrier system where the spectrum is divided into time-frequency slots (recourse blocks) as well as a multi-channel system with pre-allocated channels, see Section 1.7 for a multi-channel scenario.) Since the base stations in the same tier share the same spectrum band, their transmissions will interfere with each other. Last, note that if dierent tiers are using the same spectrum band, e.g., as with today's macro and small cells in cellular networks, the only change would be to replace the interference from a single tier with the sum of the interference from all tiers using the same spectrum band and the analysis would work the same way. We consider the multi-tier cellular downlink user-BS association scenario depicted in Fig. 1.1. For each useri2U, we dene the setA i as the set of base stations that user i can potentially be associated with. Specically,A i is the set of base stations from which the received SINR at user i is larger than some threshold (which is chosen to ensure successful decoding of data messages), i.e., A i ,f(j;k) : SINR i;j;k ; k2K; j2B k g; (1.1) where SINR i;j;k is the received SINR at useri from base station (j;k). Note that ifA i is empty for useri, then useri cannot be associated with any base station and is excluded from the system. The value of the received SINR (and thus the data rate) depends on the signaling scheme that we use. In the following two subsections, we consider two popular options. Specically, we consider the signaling and the data rates in the massive MIMO regime and the multi-user (MU) MIMO full multiplexing gain regime, respectively. 1.3.2 Data rates in the massive MIMO regime We assume that the system operates at the massive MIMO regime in which each base station is equipped with a large antenna array, while the users are assumed to be equipped with a single antenna. Let L j;k denote the number of 10 antennas at base station (j;k) and j;k denote the number of users that base station (j;k) can simultaneously service on any given time slot. In other words, j;k is the spatial multiplexing gain of base station (j;k) and the ratio j;k , j;k L j;k (1.2) is the corresponding spatial load [19]. We assume Time Division Duplex (TDD) operation with reciprocity-based channel state estimation. As a result, each base station antenna close to user i can estimate its downlink channel coecient to user i from the uplink pilot transmitted by user i, facilitating the training of large antenna arrays with training overhead proportional to j;k . In the massive MIMO regime, the value of SINR i;j;k depends on the beam- forming techniques that we use. We consider the following two commonly used schemes: Conjugate Beamforming (CB) and Zero-Forcing BeamForming (ZFBF). Under conjugate beamforming, the SINR can be expressed as fol- lows [9,19]: SINR CB i;j;k = P j;k g 2 i;j;k = j;k W k N 0 + P l2B k P l;k g i;l;k + P l2B q(i) k ;l6=j P l;k g 2 i;l;k = l;k ; (1.3) where P j;k is the transmission power of base station j at tier k, and g i;j;k is the channel gain between user i and base station j at tier k that captures the eects of path loss and shadowing. The eect of small-scale fading is modeled as Rayleigh fading coecients. Note that the Rayleigh fading coecients do not appear in Eq. (1.3) since in the massive MIMO regime, the eect of small- scale fading averages out over the antenna array (a fact commonly referred to as channel hardening). N 0 is the noise power spectral density and and are normalization constants. P l2B k P l;k g i;l;k is the interference received from base stations operating at the same tier. The setB q(i) k denotes the set of base stations operating at tierk using the pilot signalq(i) that is also used by useri. Hence, P l2B q(i) k ;l6=j P l;k g 2 i;l;k = l;k is the interference received from base stations operating at the same tier and using the same pilot signal as user i (commonly called pilot contamination). Under ZFBF, the SINR is given by [19]: SINR ZFBF i;j;k = (1 j;k )P j;k g 2 i;j;k = j;k 2 6 4W k N 0 + 2 P j;k g i;j;k + X l2B k ;l6=j P l;k g i;l;k + X l2B q(i) k ;l6=j (1 l;k )P l;k g 2 i;l;k = l;k 3 7 5 1 ; (1.4) where 1= 2 is the SNR of the uplink pilot signal and the rest of the notation is like before. Note that the intra-cell interference 2 P j;k g i;j;k is zero when the uplink SNR 1= 2 ! 1. The other two terms correspond to inter-cell interference and pilot contamination, respectively. 11 The data rate (bits/s) between user i2U and base station (j;k)2A i is given by c i;j;k =W k log 1 + SINR CB=ZFBF i;j;k ;i2U; (j;k)2A i ; (1.5) where Shannon's formula is used which can be extended to accommodate real world features like modulation and coding tables, see, for example, [36]. 1.3.3 Data rates in the MU-MIMO full multiplexing gain regime We assume that base station (j;k) hasL j;k antennas and the users are equipped with a single antenna. We assume that ZFBF is used for MU-MIMO beam- forming and we are able to use all degrees of freedom. In other words, the base station (j;k) can provide a full multiplexing gain of order j;k =L j;k to support L j;k simultaneous data transmissions/streams to its associated users. Similar to [37,38], under equal power allocation on each data stream and by using ran- dom matrix theory, we obtain the following deterministic approximation for the SINR: SINR MUMIMO i;j;k = P j;k j;k g i;j;k W k N 0 + P l2B k ;l6=j P l;k g i;l;k : (1.6) Then, the data rate (bits/s) between user i and base station (j;k) is given by c i;j;k =W k log 1 + SINR MUMIMO i;j;k ;i2U; (j;k)2A i : (1.7) 1.3.4 User scheduling Let the association variable bex i;j;k , wherex i;j;k = 1 if useri is associated with base station (j;k)2A i andx i;j;k = 0 otherwise. The actual data rate that user i will receive, which is denoted as r i;j;k , depends on the user scheduling mech- anism. We assume that when a base station is associated with multiple users and the number of the associated users is larger than the spatial multiplexing gain j;k , equal time-sharing is used to schedule the users. (This is not only what happens in most real-world systems, but also the optimal schedule under our scenario, see Section 1.5.2 for further details and a proof.) Specically, we have r i;j;k =c i;j;k ; if X l2U x l;j;k j;k ; (1.8) and r i;j;k = j;k c i;j;k P l2U x l;j;k ; if X l2U x l;j;k > j;k : (1.9) Furthermore, the utility function of useri is denoted asU i (r i;j;k ;v i ;z j;k ), which is a function of the actual data rate r i;j;k , the speed of the user v i , and the coverage of base station (j;k),z j;k . The multi-tier user-BS association problem is to nd the association such that the sum utility of the users is maximized. 12 1.3.5 Optimal user-BS association Suppose a user arrives at time t and letU t denote the set of users currently in the system. Given this user set, we consider the following static multi-tier user-BS association problem (denoted as Q t ) which can be used to obtain the optimal user-BS association conguration at time t: Q t : maximize x i;j;k X i2U t X (j;k)2Ai x i;j;k U i (r i;j;k ;v i ;z j;k ) subject to X (j;k)2Ai x i;j;k = 1; i2U t x i;j;k 2f0; 1g; i2U t ; (j;k)2A i r i;j;k = c i;j;k if P l2U tx l;j;k j;k j;k c i;j;k = P l2U tx l;j;k if P l2U tx l;j;k > j;k ;i2U t ; (j;k)2A i ; (1.10) where the rst constraint ensures that a user can only be associated with a single base station. We denote the optimal value as OPT (Q t ). To take the user dynamics into account, we solve problem Q t from scratch at every time t a new user arrives, that is, we apply the static optimization formulation periodically. This guarantees optimality at every time t, and, to simplify notation, we drop the superscriptt and refer to the user-BS association problem as problem Q and to its optimal value as OPT (Q) from this point on. Clearly, the periodic application of the static optimization may yield a large number of re-associations, since previously associated users may have to be re-associated. 1.3.6 Special cases When j;k = 1, the above problem Q t can be simplied as maximize x i;j;k X i2U t X (j;k)2Ai x i;j;k U i c i;j;k P l2U tx l;j;k ;v i ;z j;k subject to X (j;k)2Ai x i;j;k = 1; i2U t x i;j;k 2f0; 1g; i2U t ; (j;k)2A i : (1.11) In the massive MIMO regime, the special case with j;k = 1 (and thus j;k = 1 L j;k ) corresponds to having an array gain of order L j;k for the desired signal but not having any multiplexing gain. In the MU-MIMO full multiplexing gain regime, the special case with j;k = L j;k = 1 corresponds to a point-to- point single-input single-output (SISO) channel, which is the specic scenario we consider in [1] (without using a bias to associate users with high mobility to large cells, like we do in this work). 13 1.4 Online association algorithms In the following, we consider three online algorithms for the multi-tier user- BS association, where the users arrive online (user 1 arrives rst and user M arrives last) and the association decision is immediately and irrevocably made upon each user's arrival (i.e., we do not allow re-associations for previously arrived users). The rst online algorithm is user-centric in that the user makes a decision based on its own performance. The second, which is the algorithm we advocate, is cell-centric in the sense that the association decision strives to maximize the performance of cells, and the third is a deterministic, somewhat simplied version of the second. 1.4.1 User-centric online algorithm In the user-centric algorithm (Algorithm 1), when a user arrives, the user is associated with the base station that maximizes the user's own utility. The Algorithm 1 User-centric online algorithm 1. Initialize s j;k 0; k2K; j2B k ; 2. for i = 1;:::;M do 3. Associate user i with base station j at tier k , where (j ;k ) = argmax (j;k)2Ai U i (r i;j;k ;v i ;z j;k ), where r i;j;k =c i;j;k if s j;k + 1 j;k , oth- erwise r i;j;k = j;k c i;j;k =(s j;k + 1); 4. s j ;k s j ;k + 1; 5. end for variable s j;k updates the number of users associated with base station j at tier k. Note that at the end of the algorithm, we have P k2K P j2B k s j;k = M. In practice, when a user arrives, the user can obtain the information of the system load s j;k , the cell range z j;k , and the available degrees of freedom j;k by base station broadcast and the data rate c i;j;k by training, sensing and estimation, see, for example, [28]. We denote the resulting sum utility of the users under the user-centric online algorithm as ALG 1 (Q). 1.4.2 Cell-centric randomized online algorithm To facilitate analysis, let us rst introduce the concept of the utility of a base station. The utility of the base station j at tier k (denoted as V j;k ) is dened as the sum utility of its associated users. The domain of V j;k (denoted asA j;k ) is the set of users that the base station j at tier k can be associated with, i.e., A j;k ,fi2U : (j;k)2A i g: (1.12) 14 We have V j;k (S) = X i2S U i (r i;j;k ;v i ;z j;k ); k2K; j2B k ; SA j;k ; r i;j;k =c i;j;k 1 fjSj j;k g + j;k c i;j;k jSj 1 fjSj> j;k g ; (1.13) whereS denotes the set of users that base station (j;k) is associated with,jSj is the cardinality ofS, and 1 fg is the indicator function. In addition, we let V j;k (;) = 0. We further dene the marginal utility of the base station j at tier k for associating with a \new" user i given the set of \previously" associated usersS as V j;k (ijS) =V j;k (S[fig)V j;k (S);i2A j;k ; SA j;k ; i62S: (1.14) Algorithm 2 Cell-centric randomized online algorithm 1. InitializeS j;k ;; k2K; j2B k ; 2. for i = 1;:::;M do 3. Associate user i with base station j at tier k with probability V j;k (ijS j;k ) jAij1 P (j;k)2Ai V j;k (ijS j;k ) jAij1 ; (j;k)2A i : (1.15) Let the selected base station be (j ;k ); 4. S j ;k S j ;k [fig; 5. end for In the cell-centric randomized algorithm (Algorithm 2), when a user arrives, the user is associated with a base station in a probabilistic manner. Specically, the probability of associating a user with a base station is proportional to the base station's marginal utility (of including that user). In this sense, a user will most likely be associated with the base station with the highest marginal utility. The variableS j;k updates the set of users that base station (j;k) is associated with. At the end of the algorithm, the setsS j;k ; k2K; j2B k form a partition of the users and P k2K P j2B k jS j;k j =M. When a user i arrives, it can compute the probability in Eq. (1.15) by col- lecting the broadcasted valuesV j;k (ijS j;k ); (j;k)2A i from all base stations in its candidate set. The value V j;k (ijS j;k ) can be computed by base station (j;k) using the values of the system loadS j;k , the cell rangez j;k , the available degrees of freedom j;k , the user speed v i , and the data rate c i;j;k (see Eq. (1.13) and (1.14)). Note that the amount of information required is the same as that in Algorithm 1. Similar to before, we denote the resulting sum utility of the users under the cell-centric randomized online algorithm as ALG 2 (Q). Note that this can be written as the sum of base station utility functions, that is, ALG 2 (Q) = P k2K P j2B k V j;k (S j;k ). 15 Table 1.1: Main notation Data rate that user i can get if associated with BS (j;k) r i;j;k Maximum data rate (no time sharing) that user i can get if asso- ciated with BS (j;k) c i;j;k User speed v i BS coverage z j;k Spatial multiplexing gain j;k User utility U i (;;) The bias function f(;) Base station utility V j;k () Marginal base station utility V j;k (j) The set of users that BS (j;k) can be associated with A j;k The set of BSs that user i can be associated with A i 1.4.3 Cell-centric deterministic online algorithm In the previous subsection, we introduced the cell-centric randomized online algorithm. It is natural to consider its deterministic counterpart. Specically, when a user arrives, the user is associated with the base station with the highest marginal utility (of including that user). Compared to the randomized version Algorithm 3 Cell-centric deterministic online algorithm 1. InitializeS j;k ;; k2K; j2B k ; 2. for i = 1;:::;M do 3. Associate user i with base station j at tier k , where (j ;k ) = argmax (j;k)2Ai V j;k (ijS j;k ); 4. S j ;k S j ;k [fig; 5. end for (Algorithm 2), the deterministic version (Algorithm 3) is easier to implement. However, it will be shown that the deterministic version has a worse performance guarantee than the randomized one. We denote the resulting sum utility of the users under the cell-centric deterministic online algorithm as ALG 3 (Q). 1.5 Performance analysis In this section we establish the performance bound for the two cell-centric online algorithms using the theory of online combinatorial auctions. The main notation is summarized in Table 1.1. To apply results from online combinatorial auctions, we rst need to prove that the specic base station utility function for our application, namely V j;k () in Eq. (1.13), is submodular and monotone. As a concrete example, we consider the following user utility U i (r i;j;k ;v i ;z j;k ) = log(r i;j;k ) +f(v i ;z j;k ): (1.16) 16 Figure 1.2: Consider three cells with cell range y 1 > y 2 > y 3 . While for slow- moving users it is almost indierent to which cell they will associate with when it comes to signaling overhead, for high-speed users the larger the cell range the better. The user utility consists of two parts: the logarithmic user utility (log(r i;j;k )) with respect to the data rate, which is commonly used in wireless networks to provide proportional fairness among users [18], and the biasf(v i ;z j;k ) of associ- ating users with high mobility to tiers operating at low frequency with large cell coverage, which depends on the user mobility (v i ) and the cell coverage 1 (z j;k ). The bias function f(x;y) is dened on the domain x 0;y 0, satisfying 1: f(x;y) 0; 2: For any given x;f(x;y 1 )>f(x;y 2 ); 8 y 1 >y 2 : (1.17) The bias functionf(x;y) dened above implies that a user prefers to be associ- ated with a cell with a large coverage, which makes sense from a practical point of view as the smaller the cell the more frequent the handos from one cell to another, thus larger cells keep the signaling overhead at reasonable levels. See Fig. 1.2 for an example. Under the user utility of Eq. (1.16), the base station utility function becomes V j;k (S) = ( P i2S log (c i;j;k ) +f(v i ;z j;k ) ifjSj j;k P i2S log j;k c i;j;k jSj +f(v i ;z j;k ) ifjSj> j;k ;k2K;j2B k ;SA j;k : (1.18) In addition, the marginal base station utility function can be derived as follows: IfjSj j;k 1, we have V j;k (ijS) = log(c i;j;k ) +f(v i ;z j;k ); (1.19) 1 The coverage of a cell depends on a number of factors such as the power, the carrier frequency, and the elevation of the tower. 17 IfjSj j;k , we have V j;k (ijS) =V j;k (S[fig)V j;k (S) = X l2S[fig log j;k c l;j;k jSj + 1 +f(v l ;z j;k ) X l2S log j;k c l;j;k jSj +f(v l ;z j;k ) = log( j;k c i;j;k ) +f(v i ;z j;k ) +jSj logjSj (jSj + 1) log(jSj + 1): (1.20) Denition 1.1. The base station utility functionV j;k () is submodular ifV j;k (ijS) V j;k (ijT ) for all i2A j;k ; ST A j;k ; i62T . Denition 1.2. The base station utility functionV j;k () is monotone ifV j;k (ijS) 0 for all i2A j;k ; SA j;k ; i62S. Lemma 1.1. The base station utility function V j;k () in Eq. (1.18) is submod- ular. Proof. Let i2A j;k ; ST A j;k ; i62T be given. There are three cases. In the rst case, we considerjTj j;k 1. Clearly, we have V j;k (ijS) = V j;k (ijT ). In the second case, we considerjSj j;k 1 andjTj j;k . We have V j;k (ijS)V j;k (ijT ) = log( j;k )jTj logjTj + (jTj + 1) log(jTj + 1) = log (jTj + 1) jTj+1 jTj jTj j;k log (jTj + 1) jTj+1 jTj jTj+1 > 0: (1.21) In the third case, we considerjSj j;k . To check thatV j;k (ijS)V j;k (ijT ), it is equivalent to show thatjSj logjSj (jSj + 1) log(jSj + 1)jTj logjTj (jTj + 1) log(jTj + 1), which in turn is equivalent to show that the function h(x),x logx (x + 1) log(x + 1); x> 0, is decreasing. Indeed, we have h 0 (x) = logx log(x + 1) = log 1 + 1 x < 0; 8x> 0; (1.22) which implies that h(x) is decreasing. As a result, in all cases we have V j;k (ijS) V j;k (ijT ). We conclude that V j;k () is submodular. Lemma 1.2. If c i;j;k max n jA j;k j j;k e 1f(vi;z j;k ) ; 1 o bits/s; 8i 2 A j;k , then V j;k () in Eq. (1.18) is monotone. Note: For the monotonicity to hold we need c i;j;k max n jA j;k j j;k e 1f(vi;z j;k ) ; 1 o bits/s,8i2A j;k , i.e., we need the data rate (measured in bits/s) between base station (j;k) and user i2A j;k to be larger than e 1f(v i ;z j;k ) j;k the number of users that base station (j;k) can be associated with, which is trivially satised for any real world scenario. 18 Proof. Let i2A j;k ; SA j;k ; i62S be given. From Eq. (1.20), we have V j;k (ijS) = log( j;k c i;j;k ) +f(v i ;z j;k ) +jSj logjSj (jSj + 1) log(jSj + 1) (a) log( j;k c i;j;k ) +f(v i ;z j;k ) + (jA j;k j 1) log(jA j;k j 1)jA j;k j logjA j;k j = log( j;k c i;j;k ) +f(v i ;z j;k ) log jA j;k j jA j;k j (jA j;k j 1) jA j;k j1 = log( j;k c i;j;k ) +f(v i ;z j;k ) logjA j;k j 1 + 1 jA j;k j 1 jA j;k j1 log( j;k c i;j;k ) +f(v i ;z j;k ) logjA j;k je; (1.23) where (a) holds since the function h(x),x logx (x + 1) log(x + 1); x> 0 is decreasing and achieves its minimum whenjSj =jA j;k j 1. Therefore, if c i;j;k max n jA j;k j j;k e 1f(vi;z j;k ) ; 1 o bits/s,8i2A j;k , we have V j;k (ijS) 0 and thus V j;k () is monotone. 1.5.1 Performance bounds We rst derive the performance bound of the cell-centric randomized online algorithm. Theorem 1.1. Under the submodularity and monotonicity of V j;k (), we have E[ALG 2 (Q)] 1 2a 1 OPT (Q), where a, max i2U jA i j. Proof. After establishing the submodularity and monotonicity of the base sta- tion utility function V j;k (), one may apply some somewhat recent results from online combinatorial auctions, see [30], to get a lower bound equal to 1 2N 1 OPT (Q), whereN = P K k=1 N k is the total number of base stations inK tiers (which could be very large). We further tighten this bound by exploiting the \sparsity" of feasible associations of a user in a heterogeneous wireless cellular system, and show that E[ALG 2 (Q)] 1 2a 1 OPT (Q), where a = max i2U jA i j is the maxi- mum number of potential associations of a user (see Eq. (1.1)). Clearly, since the bound deteriorates as N and a increase, smaller values of a yield tighter bounds (we assume a> 1 since if a = 1 there is no decision to be made). We prove the performance bound by induction on the number of users M. LetQ be the original problem of associatingM users to base stations. For each (j;k)2A 1 , we deneQ j;k as the subproblem of associating the remaining users 2;:::;M to the base stations, where the base station utility function V j;k () is replaced by V j;k (jf1g) (which is also a monotone submodular function). From the cell-centric randomized online algorithm, we have E[ALG 2 (Q)] = X (j;k)2A1 q j;k fE[ALG 2 (Q j;k )] +V j;k (f1g)g; (1.24) where q j;k = V j;k (f1g) jA1j1 P (j;k)2A1 V j;k (f1g) jA1j1 ; (j;k)2A 1 : (1.25) 19 LetS =fS j;k ; k2K; j2B k g be the optimal association prole for the original problem Q and let us assume that user 12S ~ j; ~ k for some ( ~ j; ~ k)2A 1 . Consider a new association proleS 0 which is the same asS except that user 1 is removed. Let us denote the value (the achieved sum user utility) of the subproblem Q j;k under the association prole S 0 as Val(Q j;k ). (Obviously, we have Val(Q j;k ) OPT (Q j;k ).) By the submodularity and monotonicity of V j;k (), for all (j;k)2A 1 ; (j;k)6= ( ~ j; ~ k), we have OPT (Q)Val(Q j;k ) V j;k (f1g) +V ~ j; ~ k (f1g), where V ~ j; ~ k (f1g) is the maximum \loss" due to the fact that the subproblemQ j;k does not have user 1 associated with base station ( ~ j; ~ k), and V j;k (f1g) is the maximum \loss" due to the fact that the subproblem Q j;k uses the utility function V j;k (jf1g) (instead of V j;k () in the original problem Q). For the case (j;k) = ( ~ j; ~ k), we have OPT (Q)Val(Q ~ j; ~ k ) =V ~ j; ~ k (f1g). As a result, we have OPT (Q) P (j;k)2A1 q j;k OPT (Q j;k ) P (j;k)2A1 q j;k V j;k (f1g) OPT (Q) P (j;k)2A1 q j;k Val(Q j;k ) P (j;k)2A1 q j;k V j;k (f1g) P (j;k)2A1 (j;k)6=( ~ j; ~ k) q j;k [V j;k (f1g) +V ~ j; ~ k (f1g)] +q ~ j; ~ k V ~ j; ~ k (f1g) P (j;k)2A1 q j;k V j;k (f1g) = 1 + V ~ j; ~ k (f1g) P (j;k)2A1;(j;k)6=( ~ j; ~ k) V j;k (f1g) jA1j1 P (j;k)2A1 V j;k (f1g) jA1j (a) 1 + 1 1 jA 1 j 2 1 max i2U jA i j = 2 1 a ; (1.26) where (a) follows by the AM-GM inequality (see Appendix). Therefore, we have OPT (Q) (a) X (j;k)2A1 q j;k OPT (Q j;k ) + 2 1 a X (j;k)2A1 q j;k V j;k (f1g) (b) X (j;k)2A1 q j;k 2 1 a [E[ALG 2 (Q j;k )] +V j;k (f1g)] (c) = 2 1 a E[ALG 2 (Q)]; (1.27) where (a) follows from Eq. (1.26), (b) follows by induction, and (c) follows from Eq. (1.24). Now, we proceed to derive the performance bound of the cell-centric deter- ministic online algorithm. 20 Theorem 1.2. Under the submodularity and monotonicity of V j;k (), we have ALG 3 (Q) 1 2 OPT (Q). Proof. The submodularity and monotonicity ofV j;k () are respectively shown in Lemma 1.1 and Lemma 1.2. Then, the 1=2-performance guarantee follows by the analysis above and a result in online combinatorial auctions, see Theorem 11 in [29]. Remark: It is evident from the proofs of Theorems 1.1 and 1.2 that the specic form of the utility function does not play a role in the proof as long as the func- tion is submodular and monotone. Thus, the above performance bounds hold for a generic submodular and monotone base station utility functionV j;k () and not just for the logarithmic user utility function with biasing which has been introduced in Eq. (1.18) as a concrete example. Also, recall that for this partic- ular utility function to be monotone we needc i;j;k max n jA j;k j j;k e 1f(vi;z j;k ) ; 1 o bits/s,8i2A j;k which is trivially satised in practice. 1.5.2 Rationale of equal time allocation When the number of the associated usersjSj at base station (j;k) is less than or equal to its spatial multiplexing gain j;k , each associated user can be active for the whole duration without the need of time sharing. However, whenjSj> j;k , some kind of time sharing is needed. In the above analysis, we assume that equal time sharing is used to schedule transmissions for users associated with the same base station whenjSj > j;k (see Eq. (1.18)). To motivate this assumption, we generalize equal time sharing to a more exible resource allocation scheme, in which dierent users are allowed to have dierent time portions for data transmissions, and show that under a logarithmic user utility with biasing, equal time sharing is optimal. For any base station (j;k); k2K; j2B k , letSA j;k be the set of users associated with it and assume thatjSj > j;k . Let us dene the time sharing variables w i;j;k ; i2S where P i2S w i;j;k = j;k and 0w i;j;k 1; i2S. The time sharing variables are optimized such that the sum utility of the users inS is maximized. In other words, whenjSj> j;k , the base station utility function is generalized from Eq. (1.18) to V j;k (S) = maximize w i;j;k X i2S log (w i;j;k c i;j;k ) +f(v i ;z j;k ) subject to X i2S w i;j;k = j;k 0w i;j;k 1; i2S: (1.28) Let us dene the Lagrange function L(w i;j;k ;) = X i2S log (w i;j;k c i;j;k ) +f(v i ;z j;k ) X i2S w i;j;k j;k ! ; (1.29) 21 where is the Lagrange multiplier. By taking the derivative ofL(w i;j;k ;) with respect to w i;j;k and setting the result to zero, we have @L @w i;j;k = 1 w i;j;k = 0)w i;j;k = 1 : (1.30) Therefore, we have X i2S w i;j;k = jSj = j;k ) = jSj j;k )w i;j;k = j;k jSj : (1.31) We can see that 0 j;k jSj 1, so the optimal time sharing variables are indeed w i;j;k = j;k jSj ;i2S, showing that equal time allocation is optimal. 1.6 Extensions In the following, we comment on how the proposed cell-centric randomized online algorithm can be applied into scenarios with user heterogeneity, departing users, and base station cooperation. 1.6.1 Heterogeneous users and user priority Heterogeneous users refer to users that subscribe at dierent services. For exam- ple, some users are allowed to connect to all K tiers while others are restricted to connect to one tier. Similarly, users can be divided into dierent classes with dierent priorities. For example, primary users with high priority are allowed to access all base stations while secondary users with low priority are not [39{46]. Both heterogeneous users and user priority can be incorporated into the anal- ysis by restricting the set of tiers and/or base stations which user i may be associated with in Eq. (1.1), while the rest of the analysis remains unchanged. 1.6.2 Departing users The performance bound on the cell-centric randomized algorithm holds as users arrive online. However, when users leave the system, the performance bound may no longer hold. A simple way to guarantee the bound when a user leaves is to backtrack to the association prole just before this user's arrival, and con- sider re-associating users which arrived after this user. Specically, suppose that there are M users in the system, where as previously discussed user 1 arrived rst, user M arrived last, and they were associated with base stations by using Algorithm 2. Suppose now user m leaves the system. We rst backtrack to the association prole just before user m's arrival (i.e., the association prole S m1 , n S m1 j;k ; k2K; j2B k o generated at the m 1 th iteration of Algo- rithm 2) and then re-associate users m + 1 to M. Clearly, this may result in a number of re-asscoaitions, which is not practical. In Section 1.7, we show that Algorithm 2 in the presence of user departures performs very close to the optimal, thus in practice there is no need to backtrack. 22 1.6.3 Base station cooperation A dense deployment of small cells may yield even higher throughput when mul- tiple neighboring base stations can cooperate with each other, an architecture often referred to as Coordinated Multi-Point (CoMP) [47{50], to form a cluster and coordinate their data transmissions such that they aggregate constructively. In a typical scenario of a two-tier heterogeneous network consisting of macro-BSs and femto-BSs, one may have tens or hundreds of femto-BSs inside a macrocell and hundreds or thousands of users. Thus, femto-BSs could be grouped into clusters of nearby femto-BSs which can concurrently serve a number of users. For example, one may have one such cluster per oor on a large building or one cluster per building. Along these lines, assuming that lower-power base stations form cooperation clusters, the user-BS association problem can be generalized to a user-cluster association problem (note that it is possible that a cluster is just a single base station). While it is beyond the scope of this work to investigate clustering algorithms, we wish to extend our association algorithms to make them applicable to the CoMP setup. Let the set of base stations at tier k,B k , be partitioned into G k clustersC 1 ;C 2 ; ;C G k , whereC m \C n =; and S G k m=1 C m =B k . We index the m-th cluster at tier k by the tuple (C m ;k). Suppose that distributed MU- MIMO ZFBF is used by the base stations in a cooperation cluster, say, (C m ;k), to provide P j2Cm j;k degrees of freedom for spatial multiplexing [50]. Similar to [38] and using the same asymptotic regime as the one we used to derive Eq. (1.6), the SINR at user i from cluster (C m ;k) becomes SINR CoMP i;Cm;k = g i;Cm;k W k N 0 + P l2B k nCm P l;k g i;l;k ; (1.32) where g i;Cm;k , P j2Cm P j;k P j2Cm g i;j;k =jC m j P j2Cm j;k : (1.33) As a result, the data rate becomes c i;Cm;k =W k log 1 + SINR CoMP i;Cm;k . In addi- tion, the \cluster" utility function can be stated as follows (see Eq. (1.18)): IfjSj P j2Cm j;k , we have V Cm;k (S) = X i2S log (c i;Cm;k ) +f(v i ;z Cm;k ); (1.34) IfjSj> P j2Cm j;k , we have V Cm;k (S) = X i2S log c i;Cm;k P j2Cm j;k jSj +f(v i ;z Cm;k ); (1.35) wherez Cm;k is the coverage of the cluster (C m ;k). By a straightforward extension of Lemmas 1.1 and 1.2 it is easy to show that this utility function is monotone and submodular as well, and the rest of the analysis goes through as before. 23 −1000 −500 0 500 1000 −1000 −500 0 500 1000 Macro−BS Femto−BS User (a) Non-homogeneous user density. −1000 −500 0 500 1000 −1000 −500 0 500 1000 Macro−BS Femto−BS User (b) Homogeneous user den- sity. Figure 1.3: Two-tier heterogeneous network. Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized sum log−rate utility (a) The sum log- rate utility (normal- ized w.r.t. the opti- mal). Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized minimun user rate (b) The minimum user rate (normal- ized w.r.t. the rand. cell-centric). Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Jain’s fairness index (c) The Jain's fair- ness index. Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Normalized sum user rate (d) The sum user rate (normalized w.r.t. the rand. cell-centric). Figure 1.4: Performance under the two-tier heterogeneous network with non- homogeneous user density. Note that under this CoMP setup, the \sparsity" parameter a is the maximum number of potential cluster associations for a user, and, since this is naturally smaller than the number of potential base station associations, the cell-centric randomized online algorithm has, in practice, an even tighter bound ( 1 2a 1 ) with respect to the optimal than it had before. 1.7 Simulation results 1.7.1 Two-tier heterogeneous cellular network in massive MIMO scenario We consider a two-tier heterogeneous cellular network consisting of macro-BSs and femto-BSs in a 20002000 m 2 area as shown in Fig. 1.3. There are 4 macro- BSs and 32 femto-BSs, where two femto-BSs are uniformly distributed in each sub-square of size 500 500 m 2 . There are 1000 users that arrive to the system online (one user arrival per unit time), whose locations are randomly drawn according to a non-homogeneous point process (users concentrate in interlacing sub-squares as in Fig. 1.3a to account for the non-uniform distribution of users in practice) and a homogeneous point process (Fig. 1.3b). The transmit power 24 Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized sum log−rate utility (a) The sum log- rate utility (normal- ized w.r.t. the opti- mal). Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized minimun user rate (b) The minimum user rate (normal- ized w.r.t. the rand. cell-centric). Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Jain’s fairness index (c) The Jain's fair- ness index. Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Normalized sum user rate (d) The sum user rate (normalized w.r.t. the rand. cell-centric). Figure 1.5: Performance under the two-tier heterogeneous network with homo- geneous user density. (a) Non-homogeneous user density. (b) Homogeneous user den- sity. Figure 1.6: Proportion of users associated with femto-cells in the two-tier het- erogeneous network. The transmit power of femto-BS is 20 dBm (solid color) and 35 dBm (faded color), respectively. of a macro-BS and a femto-BS are respectively assumed to be 46 dBm and 20 dBm and the spectrum bands of the two tiers are orthogonal, each with bandwidth 10 MHz, while transmissions at the same tier interfere with each other, as has been assumed in prior work [18] and in line with industry practice. Under the massive MIMO regime, a macro-BS is assumed to have 100 antennas to provide 10 degrees of freedom for spatial multiplexing, and each macro-BS uses the same set of 10 orthogonal pilots. Similarly, a femto-BS is assumed to have 40 antennas to provide 4 degrees of freedom for spatial multiplexing, and each femto-BS uses the same set of 4 orthogonal pilots. The background noise power is assumed to be104 dBm, and the path loss exponent is supposed to be 4, as is usually the case in outdoor environments [36]. Last, given the above parameters, for most realizations of the system deployment, and with an SINR threshold =3 dB for decoding as measured in real-world deployments [51], the maximum number of potential associations of a usera is calculated to be 3. We compare our proposed online algorithms with other algorithms that do not allow user re-associations and with the optimal. For the case with non- homogeneous user density, Fig. 1.4a compares the performance of the random- 25 −1000 −500 0 500 1000 −1000 −500 0 500 1000 Macro−BS Femto−BS User (high mobility) User (low mobility) (a) No bias ( = 0) −1000 −500 0 500 1000 −1000 −500 0 500 1000 Macro−BS Femto−BS User (high mobility) User (low mobility) (b) Some bias ( = 1) −1000 −500 0 500 1000 −1000 −500 0 500 1000 Macro−BS Femto−BS User (high mobility) User (low mobility) (c) A large bias ( = 4) Figure 1.7: The eect of biasing for associating high mobility users to macro- BSs. Solid (dotted) lines are the associations for high (low) mobility users. Only a subset of users are shown. ized cell-centric online algorithm, the cell-centric online algorithm, the user- centric online algorithm, and the max-SINR online algorithm [17] according to which when a user arrives, the user is associated with the base station that provides the user with the highest SINR value, regardless of the system load of the base station. From Fig. 1.4a we observe that the sum log-rate utility of the cell-centric online algorithm is very close to the optimal. (In the gure the sum log-rate utility is normalized with respect to the optimal value.) As a result, we do not see any performance dierence between the 1 2a 1 randomized approximation algorithm and the 1 2 approximation algorithm. With respect to complexity, it is easy to see that all four online algorithms have a complexity of O(Ma), which is linear in the number of users and the sparsity parameter. Motivated by the industry's desire to oer some notion of fairness to its users, we are also interested in comparing the minimum user rates and the Jain's fairness index [52] under the four algorithms. Note that the Jain's fairness index is between 1 M (worst case) and 1 (best case when all users receive the same rate). As shown in Fig. 1.4b and Fig. 1.4c, the (randomized) cell-centric algorithm performs better than the others in terms of fairness too. Last, as can be seen in Fig. 1.4d, the max-SINR algorithm can achieve a higher sum user rate while ignoring fairness considerations. Similar results can be observed in Fig. 1.5 for the case with homogeneous user density. Finally, in Fig. 1.6, we investigate the proportion of users associating to macro-cells and femto-cells. For both non-homogeneous and homogeneous user density cases, we observe that under the cell-centric algorithm, the proportion of users associating to femto-BSs is about 61%, while under the max-SINR algorithm, the proportion of users associating to femto-BSs is about 44%. In addition, when the transmit power of femto-BSs is increased from 20 dBm to 35 dBm, we can see that there is only a small increase in the proportion of users associating to femto-BSs (about 66% under the cell-centric algorithm and 52% under the max-SINR algorithm). 26 0 50 100 150 200 250 300 0 50 100 150 200 250 AP (CH 1) AP (CH 2) AP (CH 3) AP (CH 4) User (a) Regularly placed APs with non-homogeneous user den- sity. 0 50 100 150 200 250 300 0 50 100 150 200 250 AP (CH 1) AP (CH 2) AP (CH 3) AP (CH 4) User (b) Randomly placed APs with homogeneous user den- sity. Figure 1.8: Multi-channel WiFi network. Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized sum log−rate utility (a) The sum log- rate utility (normal- ized w.r.t. the opti- mal). Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized minimun user rate (b) The minimum user rate (normal- ized w.r.t. the rand. cell-centric). Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Jain’s fairness index (c) The Jain's fair- ness index. Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized sum user rate (d) The sum user rate (normalized w.r.t. the rand. cell-centric). Figure 1.9: Performance under the multi-channel WiFi network with regularly placed APs and non-homogeneous user density. 1.7.2 The eect of biasing We investigate the eect of introducing a bias function of associating users with high mobility to tiers with large cell coverage. In Fig. 1.7, we have two classes of users, high mobility and low mobility users, and two tiers consisting of either macro- or femto-cells. The proportion of users with high mobility is about 50%. Only a subset of user associations (100 out of 1000) are shown. We consider as an example a bias functionf(x;y) along the lines of Fig. 1.2 withx2fhigh mobility; low mobilityg andy2fmacro-cell; femto-cellg. Speci- cally, letf(high; macro) = , andf(high; femto) =f(low; macro) =f(low; femto) = 0. As we increase from 0 (no bias) to 4 (a large bias), we can see from Fig. 1.7a{c that users with high mobility tend to be associated more and more with macro-cells. 27 Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized sum log−rate utility (a) The sum log- rate utility (normal- ized w.r.t. the opti- mal). Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized minimun user rate (b) The minimum user rate (normal- ized w.r.t. the rand. cell-centric). Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Jain’s fairness index (c) The Jain's fair- ness index. Rand. cell−centric Cell−centric User−centric Max−SINR 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Normalized sum user rate (d) The sum user rate (normalized w.r.t. the rand. cell-centric). Figure 1.10: Performance under the multi-channel WiFi network with randomly placed APs and homogeneous user density. 2 3 4 5 6 7 8 1 1.5 2 2.5 3 3.5 # of antennas in an AP Normalized sum rate High SINR Low SINR Figure 1.11: Perfor- mance of the multi- channel WiFi network with dierent number of antennas per AP. 500 600 700 800 900 1000 4230 4240 4250 4260 4270 4280 4290 4300 4310 4320 Time slot Sum log−rate utility Optimal Rand. cell−centric online alg. (a) Single user ar- rival/departure. 500 600 700 800 900 1000 4190 4200 4210 4220 4230 4240 4250 4260 4270 4280 Time slot Sum log−rate utility Optimal Rand. cell−centric online alg. (b) Batch user ar- rival/departure. Figure 1.12: Performance of the randomized cell-centric online algorithm against user dy- namics. 28 1.7.3 Multi-channel WiFi network in MU-MIMO full mul- tiplexing gain scenario We consider a dierent network topology motivated by enterprise WiFi net- works. Specically, consider the multi-channel conference hall topology de- picted in Fig. 1.8a (the APs are placed regularly) and Fig. 1.8b (the APs are randomly distributed). There are 20 APs in a 300 250 m 2 area and each of them operates at one of four orthogonal channels (we use dierent colors to represent dierent channels). There are 200 users arriving to the system online (one user arrival per unit time), whose locations are independently drawn from a non-homogeneous point process (Fig. 1.8a), i.e., a two-dimensional uncorre- lated normal distribution with mean (150 m, 125 m) and standard deviation 25 m, and a homogeneous point process (Fig. 1.8b). The transmit power of an AP is assumed to be 20 dBm and the channel bandwidth is assumed to be 20 MHz, in line with industry practice [28]. Under the MU-MIMO full multiplexing gain regime, each AP is assumed to be equipped with 2 antennas to provide 2 de- grees of freedom for spatial multiplexing. The noise power is101 dBm, and the path loss exponent is 3, a typical value for indoor environments [36]. For most realizations of the system deployment, and with an SINR threshold = 3 dB for decoding as reported in [28], the parameter a is calculated to be 4. For the case with non-homogeneous user density, Fig. 1.9 compares the per- formance of the four online algorithms in terms of the sum log-rate utility, the minimum user rate, the Jain's fairness index, and the sum user rate, respec- tively. We can see that the (randomized) cell-centric algorithm outperforms the others in terms of all four metrics. Fig. 1.10 shows the results for the case with randomly distributed APs and homogeneous user density. Last, in Fig. 1.11 we increase the number of antennas per AP of Fig. 1.8b and plot the resulting sum rate, normalized by the sum rate achieved with 2- antenna APs (default case). In addition to the SINR regime considered so far (see above for power and noise levels which result in about 15dB) we also plot the rates under a high SINR regime (about 35dB). The results are consistent with MIMO theory, e.g., we see that in the high SINR regime, the rate that we get with 4-antenna APs is 1.8 times larger and with 8-antenna APs is about 3.3 times larger than that with 2-antenna APs. 1.7.4 Departing users As discussed in Section 1.6.2, the performance guarantee of the randomized cell- centric algorithm holds as users arrive online but do not leave the system. Here, we investigate the robustness of the randomized cell-centric online algorithm against user departures. Let us consider the topology of a two-tier heterogeneous cellular network in Fig. 1.3a. Suppose that users arrive online to the system from time slot 1 to time slot 1000. Upon the arrival of a user, the user is immediately associated with one base station (according to the randomized cell- centric online algorithm). Starting from time slot 500, in each subsequent time slot we also select users to depart from the system. In Fig. 1.12, we compare the 29 performance of the randomized cell-centric online algorithm with the optimal, where the optimal is recomputed in every time slot. In Fig. 1.12a there is one arrival/departure per time slot whereas in Fig. 1.12b arrivals and departures occur in batches where the batch size is uniformly distributed between 0 and 2. We can see that the sum utility of the randomized cell-centric algorithm is within 1% of the optimal as users join and leave the system, implying that our online algorithm is robust against user dynamics and in practice we do not need to re-associate users when a user departs to guarantee near optimal performance (see discussion in Section 1.6.2). 1.8 Conclusion In this work, we proposed ecient approximation algorithms for the online user association problem in a multi-tier multi-cell mobile network, which nds ap- plications in today's enterprise WiFi networks and in next generation cellular systems. We showed that the approximation ratio of our champion algorithm is 1 2a 1 , wherea is the maximum number of potential associations for a user. The parameter a is small due to the signal characteristics of the wireless medium, and the bound constitutes a signicant improvement over the best known prior work. The proposed algorithms were applied to many scenarios of interest in- cluding systems with massive antenna arrays, systems with MU-MIMO capa- bilities, networks with prioritized user classes, and networks where transmitters coordinate to form clusters in a CoMP-like setup. Last, we showed via simu- lations that the proposed algorithms perform near optimal and pose desirable fairness properties under realistic scenarios. 1.9 Appendix Suppose that n2 N;n 2 and b i 2 R;b i > 0;i = 1;:::;n. By the AM-GM inequality with n variables, we have b n 1 + (n 1)b n i nb 1 b n1 i ; i = 2;:::;n. Therefore, we have n n X i=2 b 1 b n1 i n X i=2 [b n 1 + (n 1)b n i ] = (n 1) b n 1 + n X i=2 b n i ! : By rearranging terms, we have b1( P n i=2 b n1 i ) P n i=1 b n i 1 1 n . In general, we can conclude that b k ( P n i=1;i6=k b n1 i ) P n i=1 b n i 1 1 n ; k = 1;:::;n: 30 Chapter 2 Fast Content Delivery via Distributed Caching and Small Cell Cooperation The demand for higher and higher wireless data rates is driven by the popular- ity of mobile video content delivery through wireless devices such as tablets and smartphones. To achieve unprecedented mobile content delivery speeds while reducing backhaul cost and delay, in this work we propose a new system archi- tecture that combines two recent ideas, distributed caching of content in small cells (FemtoCaching), and, cooperative transmissions from nearby base stations (Coordinated Multi-Point). A key characteristic of the proposed architecture is the interdependence between the caching strategy and the physical layer co- ordination. Specically, the caching strategy may cache dierent content in nearby base stations (BSs) to maximize the cache hit ratio, or cache the same content in multiple nearby BSs such that the corresponding BSs can transmit concurrently, e.g. to multiple users using zero-forcing beamforming, and achieve multiplexing gains. Such interdependency allows a joint cross-layer optimiza- tion. Given the popularity distribution of the content, the available cache size, and the network topology, we devise near-optimal strategies of caching such that the system throughput is maximized or the system delay is minimized. Under realistic scenarios and assumptions, our analytical and simulation results show that our system yields signicantly faster content delivery, which can be one order of magnitude faster than that of legacy systems. [5,53] 2.1 Introduction The popularity of mobile video streaming together with the proliferation of mo- bile devices such as smartphones and tablets are causing a tremendous growth of data trac in cellular networks. To address this challenge the cellular indus- 31 try is advocating a heterogeneous network architecture [1, 2, 6, 54{56] in which small cells (low power nodes), such as micro-BSs, pico-BSs, and femto-BSs, are deployed within traditional macrocells. These low power nodes provide short- range localized communication links resulting in a higher density of spatial reuse of radio resources and thus in a higher overall network throughput. There are many challenges in deploying a dense network of low power nodes. One such challenge that service providers consistently rank high is the deploy- ment cost associated with connecting all the small cells to the backbone with fast links. Motivated by this, there is a growing interest to cache popular con- tent to those low power nodes in a distributed manner, eectively trading o fast backhaul capacity with storage capacity. Specically, the authors in [57{59] have introduced the concept of FemtoCaching, which is the idea of embedding femto-BSs with high storage capacity to store popular video les. When a user requests a video le, the user may be served by a nearby femto-BS that has the requested le in its cache over a high rate short-range wireless link. If the requested le is not in the cache of any nearby femto-BS, the user will be served directly by the macro-BS over a low rate long-range wireless link. Since the popularity distribution of video les changes at a much slower pace than that of user requests, cache updates (downloading popular video les via backhaul into the caches) can be done at o-peak hours, which results in a signicant re- duction of backhaul cost and delay while maintaining the performance benets of a dense deployment of low power BSs. Deploying a dense network of low power BSs yields even higher throughput when multiple neighboring BSs coordinate their data transmissions such that they aggregate constructively [47{50,60]. As a matter of fact, in the absence of such BS coordination, interference between nearby BSs may cancel the perfor- mance gains of dense deployments, and service providers consistently rank the technological challenges related to this issue as yet another major challenge in the deployment of small cells. There are many schemes for BS coordination, and in this work we will consider the two most basic/popular ones: Maximum Ratio Transmission (MRT) and Zero-Forcing BeamForming (ZFBF) [61]. Con- sider that low power BSs form cooperation clusters. Then, under MRT, each BS in the cooperation cluster beamforms to a user such that the signals from the neighboring BSs are coherently combined, resulting in a diversity gain [62]. Un- der ZFBF, the BSs in the cluster simultaneously transmit multiple data streams to multiple users [50,63], resulting in a multiplexing gain. Note that in the ab- sence of oine cache updates, both MRT and ZFBF would further increase the cost and delay associated with backhaul, as they require multiple copies of the same les to be distributed to multiple BSs. In this work, we propose a new system architecture that combines Femto- Caching and femto-BS cooperation. The proposed cooperation scheme is cache- driven in the sense that if a typical user requests a video le, only the neighboring femto-BSs that have the requested video le in their caches will participate in the cooperative transmission. In other words, the cluster of cooperating femto- BSs is dynamically formed on a per-request basis. An important aspect of our system architecture is the joint cross-layer optimization of the cache allocation 32 (content placement) in the application layer and the cooperative transmission techniques (MRT for diversity and ZFBF for multiplexing) in the physical layer. We jointly optimize these aspects of the system because caching dierent con- tent in nearby caches increases hit ratio, but caching the same content increases the chances to get diversity and multiplexing gains. In general, the optimal cache allocation depends on a number of parameters, including the le popular- ity distribution, the cache size, the number of neighboring femto-BSs, and the transmission rate of the macro-BS in comparison to that of a femto-BS. The remainder of this work is organized as follows. We present related work in Section 2.2. Section 2.3 describes the setup, the caching strategies, and the cache-driven cooperation policies. In Section 2.4 we derive analytical formulas for the achieved rates under a variety of scenarios. Section 2.5 analyzes the system performance in terms of the system delay. In Section 2.6 we present nu- merical results for a number of real-world scenarios, highlighting the gains from our framework. Notably, our schemes can increase the speed of content delivery by an order of magnitude without requiring fast backhaul speeds. Last, Section 2.7 discusses practical considerations and Section 2.8 concludes the work. 2.2 Prior work and contributions This work is related to a number of prior lines of work. Our setup is that of heterogeneous networks, formed by distributing multi-tier low power nodes (e.g., micro-BSs and femto-BSs) in macro-cellular networks [1, 2, 6, 54, 55]. It is building upon prior work on BS cooperation, more generally known as Co- ordinated Multi-Point (CoMP), and FemtoCaching. There is a long line of research in CoMP, see, for example, [47, 48]. FemtoCaching has been recently introduced to trade o backhaul capacity with cache capacity [57{59] and can be further applied to device-to-device communication networks [64, 65] and to coded caching [66,67]. FemtoCaching itself is building upon prior work on dis- tributed caching, content placement schemes and content distribution networks, see, for example, [68]. In addition to using standard analytical tools like convex and integer op- timization, combinatorics, and Shannon rate formulas, we also use stochastic geometry [69] to take into account co-channel interference in the context of heterogeneous networks, see, e.g., the relevant analysis in [70]. Content placement and caching in wireless cellular networks are investigated in [71] where the locations of base stations are distributed according to a Poisson point process. Directly related to this work is [72, 73] where the authors use a coding scheme to introduce redundancy in caches and create CoMP opportuni- ties for cooperative transmissions. A fundamental dierence between this prior work and our work is that we consider the eect of cache misses, since any type of redundancy decreases the number of distinct les that can be stored in nite size caches. To optimize the system performance, we appropriately control the stored redundancy for each individual le and dynamically (per-request basis) form a cluster of cooperating femto-BSs. 33 1 2 3 4 ... 1 2 3 4 ... 1 2 3 4 ... 1 2 3 4 ... Macro-BS Femto-BS 1 A typical user 1 2 4 5 1 3 4 6 ... 1 2 3 4 ... Cache 1 2 3 5 ... ... Femto-BS 2 Femto-BS 3 Femto-BS 4 Core network backhaul (a) (b) d F d M Figure 2.1: System model for cache-driven femto-BS cooperation: (a) optimal caching under MRT (b) threshold-based caching under ZFBF. Our contributions are as follows: We combine the concepts of FemtoCaching and BS cooperation to propose a novel, high-performing system architecture. We derive analytical expressions for the user rates and delays and jointly opti- mize the caching strategy and the PHY layer cooperation. We devise ecient caching strategies for providing diversity gains under MRT, multiplexing gains under ZFBF, and the optimal diversity-multiplexing tradeo. Last, we study the performance of our schemes under practical scenarios and address deploy- ment considerations. 2.3 System model 2.3.1 Topology Consider a typical user in a macrocell. Suppose there are N femto-BSs and another K 1 users in the neighborhood of the typical user. We denote by d 0;k , 1 k K, and d j;k , 1 j N, 1 k K, the distance between the macro-BS and thekth user, and thejth femto-BS and thekth user respectively. Let theseK co-located users be associated with the sameN neighboring femto- BSs and the macro-BS. For simplicity, we assume that the distances between the co-located users and the N neighboring femto-BSs (the macro-BS) are the same, i.e., d j;k , d F (d 0;k , d M ), see Fig. 2.1b for a pictorial representation. 1 These N neighboring femto-BSs are candidates for cooperative transmissions, see Fig. 2.1a for a scenario where femto-BSs transmit the same content (say, le 1 This makes the analysis tractable without changing the behavior of the caching strategy. Specically, while dierent user spatial distributions would yield dierent quantitative results, qualitatively they would be similar, see, for example, Fig. 2.8b and c where the performance of various schemes is plotted for two dierent user spatial distributions. 34 Figure 2.2: Optimal caching under MRT, N = 4, m = 5. 1) to one user, and Fig. 2.1b for a scenario where femto-BSs transmit concur- rently to multiple users (say, to four users four dierent les, namely le 1, 2, 3, and 4). In a typical real-world scenario one may have tens or hundreds of femto-BSs inside a macrocell and hundreds or thousands of users. Thus, femto-BSs would be grouped into clusters of nearby femto-BSs which can concurrently serve a number of users. For example, one may have one such cluster per oor on a large building or one cluster per building. In the rest of this work we will focus on one such cluster of co-located femto-BSs and users. 2 In general, the users, the macro-BS, and the femto-BSs would be assigned dierent time-frequency slots (resource blocks) for data transmissions. We will only focus on the downlink, and thus on transmissions towards the K users. Also, for simplicity, we will focus on a single frequency slot which the macro-BS and the femto-BSs may use to transmit to the users. 3 In some of the physical layer schemes that we study, one time slot can be used for a single user only, e.g. MRT, while in others it can be used for multiple users concurrently, e.g. ZFBF. Depending on whether a user is served by the macro-BS, a single femto-BS, or multiple femto-BSs using MRT or ZFBF, the data rate that it receives during this time slot varies. Last, a control frame is used to collect requests from the users at the macro-BS, and then the users will be served in subsequent time slots. 2 It is beyond the scope of this work to further investigate clustering algorithms, see, for example [74,75] for details. 3 Cellular systems use OFDM where usually subcarriers/frequency slots are assumed to be i.i.d. In this case, extending the analysis to multiple frequency slots is straightforward since we are not using instantaneous CSI but rather the outage probability, thus all one needs to do is to multiply the rate by the number of frequency slots [61]. 35 1 T 2 3 . . . T+1 m . . . . . . . . . . . . 1 2 N T+(N-1)(m-T)+1 . . . T+2 . . . T+(N-1)(m-T)+2 . . . Figure 2.3: Threshold-based caching under ZFBF. (a) control frame time slot frequency slot (b) 1 2 ... ... K 1 K 2 K 3 K 1 K 2 K 3 epoch Figure 2.4: Control and data frames under (a) optimal caching under MRT, (b) threshold-based caching under ZFBF. 2.3.2 Caching strategies and cache-driven cooperation poli- cies Suppose that there is a library of M video les which are ordered according to their (normalized) popularity p i ; 1iM, p 1 p 2 p M ; P M i=1 p i = 1. In other words, a typical user would request the ith le with probability p i . 4 Furthermore, we dene the cdf of the le popularity distribution as v j , P j i=1 p i ; j = 1; ;M. To simplify the analytical exposition, we assume that all les have the same size (say, L bits) and the unit of cache size is a le. The macro-BS stores (or has access to) all of theM les, while each femto-BS has a cache that can store up to m les with m<M. The caching strategy species for each le at which femto-BS(s) it will be cached, if any, and the cooperation 4 The caching decisions in a FemtoCaching architecture are based on content popularity statistics collected over a large time horizon and over numerous users from a large geographical region, rather than on personalized statistics, following the approach of CDNs. Thus, it makes sense to assume that all users draw content requests from the same content popularity distribution. 36 policy species the PHY mechanism used for the cooperative transmissions. We have the following two dierent caching and cooperation strategies aiming at providing diversity gains and multiplexing gains. Providing diversity gains by optimal caching under MRT We formulate the caching problem under MRT as an optimization problem, see Section 2.4.1. The optimal caching strategy determines in how many femto- BSs a le will be cached, see, for example, Fig. 2.2, where les 1, 2 and 3 are cached in three femto-BSs, les 4, 5, 6 and 7 are cached in two, 8, 9 and 10 in one, and the remaining les can only be transmitted by the macro-BS. Because the aforementioned optimization problem can be shown to be NP-hard, we also provide a convex programming formulation by replacing the hard cache size constraint with a probabilistic one, according to which each lei is cached with probability q i determined by the convex formulation, with the constraint that the expected number of cached les at each femto-BS is less than or equal to m, see Section 2.4.1 for details. Under MRT, each femto-BS that has the requested le in its cache beam- forms its signal to the user who requested the le so that the signals (from the cooperating femto-BSs) are coherently combined at the receiver, producing a di- versity gain for the desired signal. If none of the femto-BSs have the requested le, the request will be served by the macro-BS. For example, in Fig. 2.1a, if a user requests the 3rd le, then femto-BSs 1, 2, and 4 will beamform to the user. Note that no data le exchanges are needed for the cooperation but con- trol signals such as channel state information may be required to be distributed among the femto-BSs in the cooperation cluster. Last, Fig. 2.4a shows a typ- ical sequence of time slots for the optimal caching under MRT scheme, where following a control frame, each le request is served for one time slot. Providing multiplexing gains by threshold-based caching under ZFBF Formulating the caching problem under ZFBF as an optimization problem leads to an exponential state space, see Section 2.4.2 for details. What is more, given any cache state, computing the rate under ZFBF requires an exponential number of evaluations due to its dependence on the selection of users to be concurrently served. Motivated by this we introduce some structure by grouping together les based on their popularity and caching the les of a group the same number of copies. As an example, if a single threshold T is used, 0Tm, we would cache les 1 to T (referred to as type 1 les) in all of the femto-BSs, and les T + 1 to T +N(mT ) (referred to as type 2 les) in exactly one of the femto-BSs (see Fig. 2.3). That is, we would have N copies for each of the T most popular les, one copy of the les T + 1 to T +N(mT ), and the remaining les T +N(mT )+1 toM (referred to as type 3 les) would be downloaded only via the macro-BS. Since all the femto-BSs cache type 1 les, the N femto-BSs can coordinate their transmissions and serve N type 1 le requests simultaneously, 37 producing a multiplexing gain of order N. 5 No cooperative transmission is used for serving a type 2 (type 3) le request because only a single femto-BS (the macro-BS) caches the requested le. Fig. 2.4b shows a typical sequence of epochs for the single-threshold-based caching under ZFBF scheme, where an epoch is dened as a collection of time slots of length K 1 +K 2 +K 3 . In the following analysis, see Section 2.4.2, we will normalize the epoch length to one, i.e., K 1 +K 2 +K 3 = 1. Thus, K i ; i = 1; 2; 3, will be the portion of time allocated for transmitting type i les. 2.3.3 Channel model We assume that the users are equipped with a single antenna. For simplicity of exposition we start with the assumption that macro- and femto-BSs also have one antenna and then generalize to the multi-antenna case in Section 2.4.2. We consider quasi-static Rayleigh at fading channels with unit mean power. We denote the transmit power of a macro-BS and a femto-BS by P M and P F , and the data rates of a macro-BS and a femto-BS by R M and R F . The bandwidth of the frequency slot is denoted by W , the path loss exponent by , and the noise power spectral density by N 0 . To simplify the notation in the analysis, we also dene the eective data rate as the data rate multiplied by the non-outage (transmission success) probability (i.e., the probability that the channel can support the data rate [61]). Thus, for a macro-BS the eective data rate equals ~ R M ,R M Pr W log 1 + P M S (1) d M N0W >R M and for a femto-BS cluster with a diversity of order j it equals ~ R (j) F , R F Pr W log 1 + P F S (j) d F N0W >R F , whereS (1) is an exponential random vari- able with unit mean (Rayleigh fading), and S (j) is the sum of j i.i.d. expo- nential random variables with unit mean due to the coherent combining of the signals fromj femto-BSs. Note that for further simplicity, whenj = 1 we denote ~ R (1) F as ~ R F . The above channel model can be easily extended to take into account co- channel interference from other macro-BSs and other femto-BSs outside the cooperation cluster using the same frequency slot at the same time with the typical user under study (i.e., the same resource block is spatially reused). See Appendix 2.9.1. Table 2.1 summarizes the main notations used in the work. 2.4 Performance analysis in terms of rates We study the saturation throughput, that is, we assume that there are always enough requests for les, and thus enough pending bits, to be transmitted to the various users at every time slot. 5 By using ZFBF, suppose there are N transmitters and K receivers (K N), the N transmitters can transmit K independent (non-interfering or spatially isolated) streams to theK receivers simultaneously, each with a diversity of order 1. 38 Table 2.1: Main notations Number of femto-BSs N Number of antennas in a (macro)femto-BS L M (L F ) Distance between a typical user and a macro(femto)BS d M (d F ) Number of video les M Cache size in a femto-BS m Number of users K File popularity distribution p i (cdf v j ) File size L Bandwidth of a frequency slot W Transmit power of macro(femto)-BS P M (P F ) Data rate of macro(femto)-BS R M (R F ) Eective data rate of macro(femto)-BS ~ R M ( ~ R F / ~ R (j) F ) Path loss exponent Noise power spectral density N 0 Eective data rate with MRT(ZFBF) R MRT (R ZFBF ) Eective data rate with MRT{ZFBF R MRT{ZFBF Arrival rate of le requests System delay with MRT(ZFBF) W MRT (W ZFBF ) In the following, integer and convex optimization are used in the performance analysis of caching strategies under MRT and ZFBF. 2.4.1 Optimal and randomized caching under MRT We aim to maximize the eective data rate of a typical user (denoted asR MRT ) averaged over the le popularity distribution, with respect to the caching de- cision variables subject to the cache size constraint. Let x i;j be an indicator variable where x i;j = 1 if the ith le is cached with j copies in j femto-BSs and x i;j = 0 otherwise. Note that x i;0 = 1 indicates that the ith le is only cached in the macro-BS. For example, in Fig. 2.2 x 1;3 = x 2;3 = x 3;3 = 1, x 4;2 = x 5;2 = x 6;2 = x 7;2 = 1, x 8;1 = x 9;1 = x 10;1 = 1, x 11;0 = = x M;0 = 1 and the rest of the x i;j are zero. We have the following optimization problem: maximize xi;j M X i=1 N X j=0 p i U j x i;j ,R MRT subject to M X i=1 N X j=0 jx i;j mN; N X j=0 x i;j = 1; i = 1;:::;M; x i;j 2f0; 1g; i = 1;:::;M; j = 0;:::;N; (2.1) 39 wherep i is the le popularity distribution (i.e., the probability that the typical user requests the ith le), U 0 , ~ R M is the eective data rate for transmit- ting a le via the macro-BS, and U j , ~ R (j) F ; j = 1;:::;N is the eective data rate for transmitting a le via a cluster of j femto-BSs. The constraint P M i=1 P N j=0 jx i;j mN corresponds to the total cache size constraint. (Since what matters is the total number of femto-BSs and not the specic femto-BSs in which a le is cached, there is no need for individual cache size constraints. The caches are lled as indicated by the arrow in Fig. 2.2.) We note that the above problem is a multiple-choice knapsack problem [76], which is NP-hard. We can solve the multiple-choice knapsack problem by using a large-scale mixed integer linear programming solver such as Gurobi, see Section 2.6. In the following, we further provide a convex programming relaxation by replacing the hard cache size constraint with a probabilistic one. Specically, we assume that for each femto-BS the caching strategy caches the ith le with probability q i ; 0q i 1; i = 1; ;M subject to the probabilistic cache size constraint P M i=1 q i m. In other words, the expected number of cached les at each femto-BS is less or equal to m, as it can be seen by dening the random variables X i = 1 if the ith le is cached to the femto-BS under consideration and X i = 0 otherwise, and noting that E[ P M i=1 X i ] = P M i=1 q i . The convex programming formulation (called randomized caching under MRT) is written as: maximize q1;;q M M X i=1 p i U(q i ) subject to M X i=1 q i m; 0q i 1; i = 1; ;M; (2.2) where U(q i ) is the eective data rate for the transmission of the ith le. Since we have N femto-BSs, the number of copies of the ith le in the femto-BSs is a binomial random variable with mean Nq i . Thus, the number of cooperating femto-BSs for the transmission of the ith le (denoted as C i ) follows the law Pr(C i =j) = N j q j i (1q i ) Nj ; j = 0; 1; ;N: (2.3) The eective data rate for the transmission of the ith le, U(q i ), can be com- puted as U(q i ) = Pr(C i = 0) ~ R M + N X j=1 Pr(C i =j) ~ R (j) F = (1q i ) N ~ R M + N X j=1 N j q j i (1q i ) Nj ~ R (j) F : (2.4) 40 Since we have assumed Rayleigh fading, we can rewrite the eective data rates as ~ R M =R M e M and ~ R F =R F e F where M = (2 R M =W 1)N 0 Wd M =P M and F = (2 R F =W 1)N 0 Wd F =P F . We have the following theorem. Proposition 2.1. If R M e M R F e F (1 F ), then U(q i ) is concave and thus P M i=1 p i U(q i ) is concave in the domain 0q i 1; i = 1; ;M. Proof. The proof is provided in Appendix 2.9.2. The condition R M e M R F e F (1 F ) implies that the marginal rate gain of including one more femto-BS into the cluster to perform cooperative transmission is decreasing and is smaller than the dierence between the rates of a femto-BS and a macro-BS (see Appendix 2.9.2 for a detailed discussion). These conditions usually hold in practice. Theorem 2.1. The optimal solution q i of Problem (2.2) satises q i = U 01 p i 1 0 ; i = 1; ;M; M X i=1 q i =m; (2.5) where is the Lagrange multiplier, U 01 () is the inverse function of U 0 (), and [x] 1 0 , min(max(0;x); 1). Proof. By Proposition 2.1, we can see that Problem (2.2) is a maximization problem with concave objective and linear constraints. Therefore, the optimal solution q i satises the KKT conditions [77], which are Eq. (2.5). In addition, U 01 () exists since U 0 () is monotone (see Lemma 2.1 in Appendix 2.9.2). Note that U 01 () is a decreasing function. So, q i increases as p i increases, i.e., we have a higher probability to cache a more popular le, which makes sense intuitively. Finally, for any instantiation of the randomized caching, we can restore the hard cache size constraint by limiting the number of cached les at each femto- BS to be m. Such cut-o induces little performance loss, as can be seen in Fig. 2.8d. Remark 2.1. Problem (2.1) maximizes the instantaneous data rate. If we are interested in studying the rate over a long time horizon assuming an innite backlog of le requests, then how we select which le request to service next aects the result. To see this, notice that Problem (2.1) corresponds to selecting a le-i request with probability p i , and, once the x i;j 's are xed, this sets the rate U i , P N j=0 U j x i;j each le-i request will be served with. A le-i request is serviced with rateU i for a pi L U i P M k=1 p k L U k portion of time and the ensuing long term (time average) rate can be easily computed as 1 P M i=1 pi 1 U i . In Section 2.5.1, we 41 study such long term quantities where we optimize the caching decisions x i;j to minimize the total system delay (service plus queueing delay) or just the service delay (which is equivalent to maximizing the long term service/data rate) under a specic le request arrival process. 2.4.2 Optimal and threshold-based caching under ZFBF We aim to maximize the eective data rate of K users averaged over the le popularity distribution, with respect to the caching decision variables subject to the cache size constraint. Let us dene the set of le requests from the K users asK ,f(r 1 ;:::;r K ) : r k 2f1;:::;Mg;k = 1;:::;Kg, where r k is the index of the le requested by user k and, thus, p r k is the probability that user k requests le r k . Let y i;j be an indicator variable where y i;j = 1 if femto-BS j caches the ith le and y i;j = 0 otherwise. We have the following optimization problem: maximize yi;j X (r1;:::;r K )2K p r1 p r K U(r 1 ;:::;r K ;y i;j 8i;j) subject to M X i=1 y i;j m; j = 1;:::;N; y i;j 2f0; 1g; i = 1;:::;M; j = 1;:::;N; (2.6) where U(r 1 ;:::;r K ;y i;j 8i;j) is the data rate of serving the le requests from theK users by using ZFBF, which depends on which femto-BSs cache each le. To see this, consider the caching pattern in Fig. 2.2. Files 4, 5, 6 and 7 are cached in two femto-BSs each. If two users request say les 4 and 6 then we get a multiplexing gain of 2, whereas if they request les say 4 and 5, we cannot get any multiplexing gain because no two femto-BSs have both les in their caches. Thus the rates depend on the specic caching pattern. In contrast, in the analysis under MRT it was enough to know the total number of femto-BSs that cache a particular le to compute rates. It is evident that the above problem has an exponential state space. What is more, given any cache state, computing the rate under ZFBF requires an expo- nential number of evaluations. Motivated by this, we introduce some structure to the problem by dening thresholds to separate les into groups, create the same number of copies for the les of the same group, and optimize the value of these thresholds. We rst consider the case of a single threshold T , 0Tm, where the le requests are generated byK users withKN, thus the maximum multiplexing gain of N is achievable. (Later we consider the case where K < N, which restricts the multiplexing gain toK, and the case of multiple thresholds.) Recall that in this case (see Section 2.3.2) the T most popular les (type 1 les) are cached in allN femto-BSs, the nextN(mT ) most popular les (type 2 les) are cached in a single femto-BS, and the rest of the les (type 3) can be retrieved by the macro-BS only. As a result, the probability of a typical le request 42 being type 1 is v T , the probability of being type 2 is v T+N(mT) v T and the probability of being type 3 is 1v T+N(mT) , where recall thatv j represents the cdf of the le popularity distribution. By using ZFBF, we can simultaneously transmit N type 1 les with sum rate N ~ R F , while the transmission rate for a type 2 (type 3) le is ~ R F ( ~ R M ). Under this threshold structure, the optimization problem is simplied to (see also [5]) maximize T K X i=0 Ki X j=0 K! i!j!(Kij)! v i T (v T+N(mT) v T ) j (1v T+N(mT) ) Kij i K minfi;Ng ~ R F + j K ~ R F + Kij K ~ R M subject to T2f0; 1; ;mg: (2.7) Following the lines of Remark 2.1, Problem (2.7) maximizes the instanta- neous rate averaged over the K users, i of which selected a type 1 le, j a type 2 le, andKij a type 3 le. If we are interested in studying the long term rate, then note that dierent types of les have dierent service rates thus the portion of time spent servicing type i les aects the long term data rate. In general, let K i denote the portion of time dedicated to servicing type i les. Above we have implicitly assumed K 1 = i K , K 2 = j K and K 3 = Kij K which would result in serving much more type 1 les than 2 and 3 if we were to study the system under a long time horizon. With this in mind, the optimization problem can be generalized as: maximize T K X i=0 Ki X j=0 K! i!j!(Kij)! v i T (v T+N(mT) v T ) j (1v T+N(mT) ) Kij h K 1 minfi;Ng ~ R F +K 2 ~ R F +K 3 ~ R M i subject to T2f0; 1; ;mg: (2.8) Motivated by fairness concerns, we allocate system resources to serve the three types of le requests in proportion to their trac load, that is,K 1 minfi;Ng ~ R F : K 2 ~ R F : K 3 ~ R M = i : j : K i j. Also, to simplify the above multi- nomial formulation in order to obtain more insight, we consider the limiting regime where the number of requests is large enough such that the fraction of type 1, type 2, and type 3 le requests converges to v T , v T+N(mT) v T , and 1v T+N(mT) respectively. The allocation of system resources now sat- ises K 1 N ~ R F : K 2 ~ R F : K 3 ~ R M = v T : v T+N(mT) v T : 1v T+N(mT) and K 1 +K 2 +K 3 = 1. Using simple algebra we obtain K 1 = v T CN ~ R F ;K 2 = v T+N(mT) v T C ~ R F ;K 3 = 1v T+N(mT) C ~ R M ; (2.9) whereC = v T N ~ R F + v T+N(mT) v T ~ R F + 1v T+N(mT) ~ R M . As a result, the eective data 43 rate (denoted as R ZFBF (T )) can be computed as R ZFBF (T ) =K 1 N ~ R F +K 2 ~ R F +K 3 ~ R M = 1 v T N ~ R F + v T+N(mT) v T ~ R F + 1v T+N(mT) ~ R M ; (2.10) and our optimization problem is further simplied to maximize T R ZFBF (T ) subject to T2f0; 1; ;mg: (2.11) We can nd the optimal caching thresholdT by enumeration of the solution space T = 0; 1; ;m, which is linear in the cache size m and thus easy to compute in practice. Furthermore, we can analytically solve for the optimal T when the discrete le popularity distribution can be approximated by a continuous one. For example, consider the Zipf distribution with parameter s, i.e., p i =c M;s =i s ; i = 1;:::;M, where c M;s is the normalization constant. We have the following theorem: Theorem 2.2. Under the Zipf distribution with parameter s, the optimal solu- tion T of Problem (2.11) with the objective function dened by Eq. (2.10) can be computed as T Nm +N 1 m 0 ; , N( ~ R F ~ R M ) ~ R M ! 1 s ; (2.12) where the notation [x] m 0 , min(max(0;x);m). Proof. The proof is provided in Appendix 2.9.3. In Fig. 2.10, we compare the performance of the unstructured optimal for- mulation (Eq. (2.6)) versus the structured, threshold-based formulation for both the multinomial (Eq. (2.8)) and the limiting (Eq. (2.11)) case, where for (2.6) and (2.8) we are allocating system resources to serve the three types of le re- quests in proportion to their trac load as we have done for (2.11). It is evident that the performance penalty from introducing thresholds is minimal. Tradeo between multiplexing gain and hit ratio We consider two special cases which are easy to analyze and provide intuition. WhenT =m, we cache the most popularm les in allN femto-BSs (targeting a large multiplexing gain) and we have R ZFBF (m) = 1 vm N ~ R F + 1vm ~ R M . If v m 1, that is, the m most popular les contain most probability mass, we have R ZFBF (m)N ~ R F . 44 WhenT = 0, we cache only one copy of the most popular Nm les in femto- BSs (targeting a large cache hit ratio) and we haveR ZFBF (0) = 1 v Nm ~ R F + 1v Nm ~ R M . If v Nm 1, we have R ZFBF (0) ~ R F . We can see that for very skewed popularity distribution satisfying v m 1, the rate R ZFBF with threshold T = m is N times higher than that with threshold T = 0, where N is the maximum multiplexing gain. More general, Fig. 2.9 in Section 2.6 shows the resulting data rate under various threshold values for practical scenarios in the case of a single-threshold- based caching strategy. It is evident that there is a tradeo associated with the value of the design parameter T . When T is large, we benet from the multiplexing gain but more redundant les are held in the caches, resulting in an increasing amount of requests towards the low-rate macro-BS (cache misses). When T is small, we lose the multiplexing gain but most of the les are in the caches of the femto-BSs, generating fewer requests towards the macro-BS. The optimal choice of T depends on the le popularity distribution. User-limited multiplexing gain We now study the case where the number of users K can be smaller than the number of femto-BSs N. When K < N, the maximum multiplexing gain is K, limited by the number of users. In this case, K 1 , K 2 , and K 3 satisfy K 1 minfK;Ng ~ R F : K 2 ~ R F : K 3 ~ R M = v T : v T+N(mT) v T : 1v T+N(mT) . Similarly to before, after solving for K 1 , K 2 , and K 3 , we have R ZFBF (T ) =K 1 minfK;Ng ~ R F +K 2 ~ R F +K 3 ~ R M = 1 v T minfK;Ng ~ R F + v T+N(mT) v T ~ R F + 1v T+N(mT) ~ R M : (2.13) To compute the optimal caching threshold, we have the following theorem: Theorem 2.3. Under the Zipf distribution with parameter s, the optimal solu- tion T of Problem (2.11) with the objective function dened by Eq. (2.13) can be computed as T Nm +N 1 m 0 ; , (N 1)( 1 ~ R F 1 ~ R M ) 1 minfK;Ng ~ R F 1 ~ R F ! 1 s : (2.14) Proof. The proof is provided in Appendix 2.9.4. Multiple thresholds The threshold-based caching under ZFBF scheme can be generalized to one with multiple thresholds where the basic idea is that the more popular a le is, the larger the number of copies of the le is in the caches. For simplicity, assume that the number of neighboring femto-BSs is N = 2 n and dene n thresholds 45 8 10 12 14 7 4 3 2 1 9 11 13 4 6 6 3 5 5 2 2 2 1 1 1 1 2 3 4 (a) (b) T 0 T 1 m-T 0 -T 1 Type 0 Type 1 Type 2 Figure 2.5: Threshold-based caching under ZFBF with multiple thresholds (m = 6; T 0 = 2; T 1 = 2; N = 2 n = 4). T 0 ;T 1 ; ;T n1 , which are design parameters. The T 0 most popular les will be stored in all 2 n femto-BSs like before (2 n copies each), the next 2T 1 most popular les will have 2 n1 copies each, and in general, the threshold T i means that we allocate 2 n T i storage units to cache 2 i T i les, each with 2 ni copies. Fig. 2.5a shows an example with n = 2 (N = 4). We partition the les into n + 2 types (namely, type 0 les, type 1 les, , type n + 1 les). Type i; i = 0; 1; ;n 1 les refer to the 2 i T i les stored in the caches with 2 ni copies each (see Fig. 2.5b). Type n les refer to the 2 n m P n1 i=0 T i les stored in the caches with a single copy each. Typen+1 les refer to the les only stored in the macro-BS. As a result, the probability of a typical le request being type i, denoted by a i , is a 0 ,v T0 ; a i ,v P i j=0 2 j Tj v P i1 j=0 2 j Tj ; i = 1;:::;n 1; a n ,v P n1 j=0 2 j Tj +2 n (m P n1 j=0 Tj ) v P n1 j=0 2 j Tj ; a n+1 , 1a 0 a 1 a n : (2.15) By using ZFBF, we can simultaneously transmit minfK; 2 ni g typei les with sum rate minfK; 2 ni g ~ R F ; i = 0; 1;:::;n. The transmission rate for a type n + 1 le is ~ R M . We divide an epoch inton+2 portions with durationsK i ; i = 0; 1;:::;n+1 for serving the type i le requests. Following the same rational as before, the epoch length is normalized to 1, i.e., P n+1 i=0 K i = 1, and the values of K i ; i = 0; 1;:::;n+1 satisfyK 0 minfK; 2 n g ~ R F :K 1 minfK; 2 n1 g ~ R F : :K n+1 ~ R M = a 0 : a 1 : : a n+1 . Then, the eective data rate for the threshold-based caching under ZFBF scheme with multiple thresholds (denoted as R multi ZFBF ) can 46 be computed as R multi ZFBF (T 0 ; ;T n1 ) =K n+1 ~ R M + n X i=0 K i minfK; 2 ni g ~ R F = 1 an+1 ~ R M + P n i=0 ai minfK;2 ni g ~ R F : (2.16) We maximize the eective data rate with respect to the caching thresholds: maximize Ti R multi ZFBF (T 0 ; ;T n1 ) subject to n1 X i=0 T i m; T i 2f0; 1; ;mg; i = 0; ;n 1: (2.17) The optimal caching thresholdsT i ;i = 0; ;n1 can be found by enumeration of the solution space, which is of size m n . While this complexity is exponential in n, it is not very large in practice since the number of caching thresholds is quite small (n = log 2 N). For example, even for relatively large clusters with, say, 8 cooperating femto-BSs there are at most 3 thresholds. Multi-antenna base stations 1 (a) (b) Type 1 2 3 6 6' 9 9' 12 12' 5 4 3 2 1 5' 8 8' 11 11' 4' 7 7' 10 10' 3' 3 3' 3 3' 2' 2 2' 2 2' 1' 1 1' 1 1' 1 T T+1 m ... ... Type 2 Figure 2.6: Threshold-based caching under ZFBF with multi-antenna base sta- tions (m = 6; T = 3; N = 3; L F = 2). Suppose that each femto-BS in the cluster has L F antennas and the macro- BS hasL M antennas. Let us consider the single-threshold-based caching under ZFBF scheme. That is, as described before, we have three types of les, type 1, 2 and 3, with N, 1 and 0 copies respectively at the femto-BSs (type 3 les can be retrieved by the macro-BS). Since each femto-BS has L F antennas, it can potentially serveL F users simultaneously with ZFBF (single-cell multi-user MIMO mode). For analytical purposes, we can think of each femto-BS having L F \virtual" copies of its cache (see Fig. 2.6, e.g. le 1' is a virtual version of le 1). This enables us to apply the analysis in Section 2.4.2 to the current case. 47 By using ZFBF in a multi-cell multi-user CoMP mode, where the ZFBF pre- coding is performed jointly across all antennas in all femto-BSs, we can simul- taneously transmit minfK;NL F g type 1 les with sum rate minfK;NL F g ~ R F . By using ZFBF in a single-cell multi-user MIMO mode, where ZFBF precoding is performed solely across the antennas in a single femto-BS, we can simultane- ously transmit minfK;L F g type 2 les with sum rate minfK;L F g ~ R F . By using ZFBF at the macro-BS (single-cell multi-user MIMO mode), we can simultane- ously transmit minfK;L M g type 3 les with sum rate minfK;L M g ~ R M . Like before, we divide an epoch into three portions with durationsK 1 ,K 2 , andK 3 for serving the type 1, 2 and 3 le requests respectively, where the epoch length is normalized to 1, i.e.,K 1 +K 2 +K 3 = 1. Following the same rational like before, the values ofK 1 ,K 2 , andK 3 satisfyK 1 minfK;NL F g ~ R F :K 2 minfK;L F g ~ R F : K 3 minfK;L M g ~ R M =v T :v T+N(mT) v T : 1v T+N(mT) . Then, the eec- tive data rate for the threshold-based caching under ZFBF scheme with multi- antenna BSs (denoted as R MIMO ZFBF ) can be computed as R MIMO ZFBF (T ) =K 1 minfK;NL F g ~ R F +K 2 minfK;L F g ~ R F +K 3 minfK;L M g ~ R M = 1 v T minfK;NL F g ~ R F + v T+N(mT) v T minfK;L F g ~ R F + 1v T+N(mT) minfK;L M g ~ R M : (2.18) To compute the optimal caching threshold, we have the following theorem: Theorem 2.4. Under the Zipf distribution with parameter s, the optimal solu- tion T of Problem (2.11) with the objective function dened by Eq. (2.18) can be computed as T Nm +N 1 m 0 ; , 0 @ (N 1) 1 minfK;L F g ~ R F 1 minfK;L M g ~ R M 1 minfK;NL F g ~ R F 1 minfK;L F g ~ R F 1 A 1 s : (2.19) Proof. The proof is provided in Appendix 2.9.5. 2.4.3 Joint MRT{ZFBF It is known that both diversity and multiplexing gains can be achieved by a careful design of the ZFBF precoding [78]. Specically, suppose there are N transmitters andK receivers (KN). Then, theN transmitters can transmit K independent (non-interfering or spatially isolated) streams to theK receivers simultaneously, each with a diversity of order NK + 1 [78]. We refer to this scheme as the MRT{ZFBF scheme. Assuming the threshold-based caching and following the analysis in Section 2.4.2 we dene K 1 , K 2 , and K 3 such that K 1 +K 2 +K 3 = 1 and K 1 minfK;Ng ~ R (NminfK;Ng+1) F : K 2 ~ R F : K 3 ~ R M = v T : v T+N(mT) v T : 1v T+N(mT) . Then, the eective date rate of the 48 MRT{ZFBF scheme (denoted as R MRT{ZFBF (T )) can be computed as R MRT{ZFBF (T ) =K 1 minfK;Ng ~ R (NminfK;Ng+1) F +K 2 ~ R F +K 3 ~ R M = 1 v T minfK;Ng ~ R (NminfK;Ng+1) F + v T+N(mT) v T ~ R F + 1v T+N(mT) ~ R M : (2.20) When K <N, each user gets a diversity of order NK + 1, receiving a type 1 le at a rate ~ R (NK+1) F . To compute the optimal caching threshold, we have the following theorem: Theorem 2.5. Under the Zipf distribution with parameter s, the optimal solu- tion T of Problem (2.11) with the objective function dened by Eq. (2.20) can be computed as T Nm +N 1 m 0 ; , 0 @ (N 1)( 1 ~ R F 1 ~ R M ) 1 minfK;Ng ~ R (NK+1) F 1 ~ R F 1 A 1 s : (2.21) Proof. The proof is provided in Appendix 2.9.6. 2.5 Performance analysis in terms of delay In Section 2.4, we characterize the system performance of the cache-driven femto-BS cooperation scheme in terms of the achieved rate. In this section, le requests are assumed to arrive over time and we use queueing theory to study the performance of the system in terms of the average system delay (in- cluding the queueing delay and the service time). We model the arrival of le requests as a Poisson process with rate and the identities of the le requests are drawn from the le popularity distribution in an i.i.d. manner. In the following, we analyze the system delay under MRT and ZFBF, respectively. 2.5.1 Optimal caching under MRT The data rate for serving a le request depends on the number of femto-BSs caching the requested le, ranging in the set n ~ R M ; ~ R F ; ~ R (2) F ;:::; ~ R (N) F o . Under the FIFO service discipline, the arrival and departure of the le requests can be modeled by an M=G=1 queue, where the arrival rate is and the service time is a random variable, denoted by X, X = L U j with prob. M X i=1 p i x i;j ; j = 0; 1;:::;N; (2.22) where L is the size of a le (bits), L Uj = L ~ R (j) F ; j = 1;:::;N is the service time when the le request is served by a cluster of j femto-BSs, and L U0 = L ~ R M is the service time when the le request is served by the macro-BS. 49 λ v T v T+N(m-T) -v T 1-v T+N(m-T) K 1 K 2 K 3 R F R F R M a batch of max. size N Figure 2.7: Queueing model for threshold-based caching under ZFBF. For a xed arrival rate, we know from queueing theory [79] that the queue is stable if < 1 E[X] . Furthermore, the average system delay of a le request can be obtained by using the Pollaczek-Khinchine formula [79], which is the sum of the average service time E[X] = P N j=0 L Uj P M i=1 p i x i;j and the average queueing delay E[X 2 ] 2(1E[X]) . To minimize the average system delay, we consider the following optimization problem: minimize xi;j M X i=1 N X j=0 p i L U j x i;j + P M i=1 P N j=0 p i L 2 U 2 j x i;j 2 1 P M i=1 P N j=0 p i L Uj x i;j subject to M X i=1 N X j=0 jx i;j mN; N X j=0 x i;j = 1; i = 1;:::;M; M X i=1 N X j=0 p i L U j x i;j < 1; x i;j 2f0; 1g; i = 1;:::;M; j = 0;:::;N: (2.23) Note that the objective function is the sum of a linear function and a linear fractional function and denote the optimal value as W MRT . The above problem can be transformed into a mixed integer second-order cone program [80] which we solve using Gurobi, see line W MRT in Fig. 2.15. Remark 2.2. If we remove the second part of the objective function and the third constraint of Problem (2.23), we end up minimizing the service time and thus also maximizing the service rate along the lines of Remark 2.1. 2.5.2 Threshold-based caching under ZFBF We organize the le requests according to their types into three queues. Type 1 requests, where the arrival rate isv T , are queued in queue 1 in order of their 50 arrival. Similarly, type 2 requests, where the arrival rate is (v T+N(mT) v T ), are queued in queue 2 in order of their arrival. Type 3 requests, where the arrival rate is(1v T+N(mT) ), are queued in queue 3 in order of their arrival. We assume a single server and consider the weighted round robin service discipline with weightsK 1 ,K 2 , andK 3 (K 1 +K 2 +K 3 = 1), assigning resources proportionally to the load of each type as discussed in Section 2.4.2. Specically, we allocateK 3 portion of time to serve queue 3 with rate ~ R M (the server is the macro-BS). We allocate K 2 portion of time to serve queue 2 with rate ~ R F (the server is a femto-BS). We allocate K 1 portion of time to serve queue 1, where we collect a batch of up to N type 1 requests and serve them jointly with rate ~ R F (the server is the femto-BS cooperation cluster of sizeN using ZFBF). Note that queue 1 is the so-called bulk queue with service in batches of maximum size N, see Fig. 2.7. Let us rst consider the stability of the three queues under the above specic fairness/scheduling criteria. Queue 3 is an M=D=1 queue with arrival rate (1v T+N(mT) ) and service time L K3 ~ R M , which is stable if(1v T+N(mT) )< 1 L K 3 ~ R M )< RZFBF(T) L ; where we use Eq. (2.9) and Eq. (2.10). Similarly, queue 2 is an M=D=1 queue with arrival rate (v T+N(mT) v T ) and service time L K2 ~ R F , which is stable if (v T+N(mT) v T ) < 1 L K 2 ~ R F ) < RZFBF(T) L : Last, queue 1 is anM=D [1N] =1 bulk queue (with service in batches of maximum size N), where the arrival rate is v T and the service time is L K1 ~ R F and it is stable if v T < 1 L NK 1 ~ R F )< RZFBF(T) L : Therefore, for a xed arrival rate , all three queues are stable if < RZFBF(T) L . The average system delay of a le request, denoted by W ZFBF , can be computed as W ZFBF (T ) =v T W 1 + (v T+N(mT) v T )W 2 + (1v T+N(mT) )W 3 ; (2.24) where W i is the system delay for a type i request. By the Pollaczek-Khinchine formula for an M=D=1 queue with arrival rate (1v T+N(mT) ) and service time L K3 ~ R M , W 3 can be obtained as W 3 (T ) = (2 L RZFBF(T) )L 2(1 L RZFBF(T) )R ZFBF (T )(1v T+N(mT) ) : (2.25) Similarly, for anM=D=1 queue with arrival rate(v T+N(mT) v T ) and service time L K2 ~ R F , W 2 can be obtained as W 2 (T ) = (2 L RZFBF(T) )L 2(1 L RZFBF(T) )R ZFBF (T )(v T+N(mT) v T ) : (2.26) The system delay for an M=D [1N] =1 bulk queue does not have a closed-form formula. We resort to numerical methods to obtain W 1 (see Section 2.6). To 51 minimize the average system delay, we consider the following optimization prob- lem: minimize T W ZFBF (T ) subject to < R ZFBF (T ) L ; T2f0; 1; ;mg: (2.27) The optimal caching threshold can be obtained by enumeration of the solution space T = 0; 1; ;m. Under medium/low loads, the asynchronous arrival of le requests poses the dilemma to wait till enough le requests are collected to concurrently serve many users, resulting in a larger multiplexing gain and thus service rate, or to immediately serve less users, resulting in a smaller idle time but also smaller multiplexing gain/service rate. Since we cannot service jobs preemptively as wireless transmissions cannot be paused and restarted, the optimal policy is nontrivial. We propose the following policy to address the above dilemma which strives to be work conserving while exploiting the multiplexing gain to the largest extend possible. Recall that the le requests are organized into three queues, one for each of the three types of requests. In addition, there is a (weighted round robin) selector that decides on which queue the server will work (and the corresponding time portion). Suppose that the selector checks queue 1. Whenever the number of type 1 requests is larger than or equal to N, they are served in batches of size N. Otherwise, when the number of type 1 requests is smaller thanN, the selector checks queues 2 and 3. If there are pending type 2 or type 3 requests, we serve those up until N type 1 requests are accumulated at which point we serve those N requests and so on and so forth. If, however, both queues for type 2 and type 3 requests are empty, we serve the less than N type 1 requests in a batch and so on and so forth. Note that under the above policy, type 1 le requests are served in a batch of size less than N only if the queues for the type 2 and type 3 requests are empty. In Section 2.6 we compare using simulations this policy against the weighted round robin policy mentioned above. Remark 2.3. Problem (2.27) minimizes the total system delay. Minimizing the service delay only, which is equivalent to maximizing the service rate, has already been treated in Problem (2.11) under an innite backlog of le requests. 2.6 Numerical results In this section we present simulation and performance results by numerically solving our analytical model in a number of practical scenarios. We assume that there is a library of M = 1000 video les, and the le popularity distribution follows the Zipf distribution [58, 81] with parameter s, i.e., p i = c M;s =i s ; i = 1;:::;M, where c M;s is the normalization constant. 52 100 200 300 1 2 3 4 5 6 7 8 Cache size: m R MRT (Mbps) N (# of femto−BSs) = 5 N = 4 N = 3 N = 2 N = 1 N = 0 (macro−BS only) (a) s = 0:56 0 0.25 0.5 0.75 1 1.25 1.5 2 3 4 5 6 7 8 9 10 Zipf distribution parameter: s Rate (Mbps) R MRT MaxDiversity MaxHitRatio (b) N = 5; m = 100 0 0.25 0.5 0.75 1 1.25 1.5 1 2 3 4 5 6 7 8 9 10 Zipf distribution parameter: s Rate (Mbps) R MRT MaxDiversity MaxHitRatio (c)N = 5; m = 100, co-located users 0 0.25 0.5 0.75 1 1.25 1.5 2 3 4 5 6 7 8 9 10 Zipf distribution parameter: s Rate (Mbps) R MRT (optimal caching) Randomized caching Randomized caching with cut−off (d) N = 5; m = 100 Figure 2.8: Performance of the system with optimal caching under MRT. 0 50 100 150 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 Caching threshold: T R ZFBF (T) (Mbps) m = 150 m = 100 m = 50 (a) s = 0 (uniform popularity distribu- tion) 0 50 100 150 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 Caching threshold: T R ZFBF (T) (Mbps) m = 150 m = 100 m = 50 (b) s = 1 (skewed) 0 50 100 150 5 10 15 20 25 30 Caching threshold: T R ZFBF (T) (Mbps) m = 150 m = 100 m = 50 (c) s = 2 (very skewed) 0 0.5 1 1.5 2 0 5 10 15 20 25 30 Zipf distribution parameter: s R ZFBF (Mbps) m = 150 m = 100 m = 50 (d) s2 [0; 2] Figure 2.9: Performance of the system with threshold-based caching under ZFBF, N = 5. Consider a macro-BS with transmission range 4000m and a number of out- door femto-BSs each with transmission range up to 300m deployed inside the macro-BS cell. Note that these are typical transmission ranges for a macrocell and for low power BSs, see, for example, the capabilities of metrocells, pico- cells, microcells, and femtocells dened in [82]. The number of femto-BSs inside a cluster (N), that is, the number of femto-BSs that a typical user can receive useful signal from varies in the scenarios that we study. The data rate of the macro-BS is assumed to beR M = 2 Mbps and the data rate of the femto-BSs is assumed to be R F = 10 Mbps, again in line with industry practice. The trans- mit power of the macro-BS equals P M = 20 W and of the femto-BSs equals P F = 20 mW, as has been assumed in prior works as well [54]. Consider a cluster of femto-BSs arranged in a circle covering an area of radius 200m at distance 2000m from the macro-BS. Unless otherwise noted, users are uniformly distributed within the circle. Thus, the distance of a user from a femto-BS of the cluster lies between 0 and 400m, and the distance from the macro-BS lies between 1800m and 2200m. We assume that the path loss exponent equals = 4 and consider quasi-static Rayleigh at fading channels with unit mean power. In addition, the bandwidth of the frequency slot isW = 5 MHz and the noise power spectral density varies fromN 0 = 4 10 19 W/Hz to N 0 = 8 10 20 W/Hz. As a result, by substituting these numerical values into the formulas of Section 2.4, the transmission success (non-outage) probability for the macro- and the femto-BSs with a unit diversity varies from 0:6 to 0:9, 53 and the eective data rate of the macro- and femto-BSs with a unit diversity varies from ~ R M = 2 0:6 = 1:2 to 1.8 Mbps and from ~ R F = 10 0:6 = 6 to 9 Mbps, respectively. 2.6.1 Data rates under diversity gains We study the rate of a user under MRT achieved at an arbitrary time slot. To highlight the eect of diversity gains to the data rate, we show results when the success (non-outage) probability with a unit diversity equals 0:6. We later show results when the outage probability is smaller. In Fig. 2.8a we plot the rate as a function of the cache size (m) and the number of nearby femto-BSs (N) for a popularity distribution with a typical parameter, say, s = 0:56. As expected, the rate R MRT increases with the cache size and the number of neighboring femto-BSs, the later because asN increases we have a larger diversity that reduces the transmission link failure (outage) probability. In this plot we also show the achieved rate when no femto-BSs are used. It is evident that using femto-BSs improves rates by 2-3x and adding MRT results in an additional gain of 2-3x. Last, note that for a xed cache size and as we increase the number of femto-BSs, the marginal gain decreases since the outage probability has already been reduced to a very small value. In Fig. 2.8b we plot the rate as a function of the Zipf distribution param- eter s. We vary s from 0 (uniform distribution) to 1.5 (skewed distribution) and compare our scheme with the following two basic schemes: a \MaxDiver- sity" caching scheme and a \MaxHitRatio" caching scheme, to investigate how well our system adapts to changing levels of popularity. In the MaxDiversity scheme, we cache the most popular les 1 to m in every femto-BS so that we have a diversity of order N for all these les. Intuitively, this scheme would perform well for a skewed distribution with a large s. On the other hand, in the MaxHitRatio scheme we cache in the femto-BSs the most popular les 1 to mN, each with a single copy. 6 This scheme would perform well for a near- at (uniform) popularity distribution with a small s. As shown in Fig. 2.8b, our proposed cross-layer optimization scheme adapts to the popularity distribution and controls the diversity gain (equivalently, the number of copies of a le in the caches of the femto-BSs) for each individual le, attending a good performance in the whole range of s. In Fig. 2.8c, we consider the case where the users are co-located at the center of the femto-BS cluster and similar results can be observed, which shows that the co-located assumption in the analysis does not produce structurally dierent results from those under realistic user placements. In Fig. 2.8d, we compare the optimal caching under MRT scheme, the ran- domized caching under MRT scheme (where the hard caching size constraint is replaced by a probabilistic one, see Section 2.4.1), and the randomized caching under MRT with cut-o (where for any instantiation of the randomized caching, the number of cached les at each femto-BS is limited to be m, restoring the hard caching size constraint). We can see that the randomized caching with 6 This scheme is also called FemtoCaching [57]. 54 0 0.5 1 1.5 2 2 4 6 8 10 12 14 16 Zipf distribution parameter: s Rate (Mbps) Optimal (no thresholds) Single threshold Single threshold (limiting behavior) Figure 2.10: Comparison of problem formu- lations under ZFBF. 0 0.5 1 1.5 0 2 4 6 8 10 12 14 16 18 Zipf distribution parameter: s Rate (Mbps) Single threshold MaxMultiplexing MaxHitRaito Figure 2.11: Comparison of dierent caching strategies under ZFBF, N = 5, m = 100. 0 0.5 1 1.5 2 1 1.005 1.01 1.015 1.02 1.025 Zipf distribution parameter: s R ZFBF multi /R ZFBF m = 150 m = 100 m = 50 Figure 2.12: Performance of the system under ZFBF with mul- tiple thresholds, N = 4. 0 0.5 1 1.5 2 0 10 20 30 40 50 60 Zipf distribution parameter: s R ZFBF MIMO (Mbps) L F = L M = 2 L F = 2, L M = 1 L F = 1, L M = 2 L F = L M = 1 Figure 2.13: Performance of the system un- der ZFBF with multi-antenna BSs, N = 5, m = 100. cut-o performs close to the optimal caching. 2.6.2 Data rates under multiplexing gains Fig. 2.9a-c plots the rateR ZFBF (T ) as a function of the caching thresholdT; 0 Tm. We assume that the cluster of neighboring femto-BSs consists ofN = 5 femto-BSs. We consider three values for the cache size,m = 50, 100 and 150, and three values for the Zipf distribution parameter, s = 0, 1 and 2. In practice, a cluster of N = 5 femto-BSs may serve tens to hundreds of users. In other words, we assume KN. From Fig. 2.9a we observe that for a uniform popularity distribution (s = 0) the optimal threshold isT = 0, i.e., we prefer to have only a single copy of a le in the caches to increase cache hits and choose not to have a multiplexing gain. On the other hand, in Fig. 2.9c, for the very skewed popularity distribution (s = 2), the optimal thresholdT is closer tom, i.e., we prefer to have multiple copies of the most popular les in the caches of femto-BSs, achieving a large multiplexing gain. Note that for the very skewed distribution, we have v m 1 (v m is the probability that a user requests one of the m most popular les) and R ZFBF (m)N ~ R F , where the multiplexing gain isN. For a skewed distribution, as shown in Fig. 2.9b with s = 1, the optimal threshold lies somewhere in the middle. Thus, our caching scheme can adapt to the le popularity distribution by properly setting the value of the caching threshold. Fig. 2.9a-c also show the increase in rate with respect to the cache size m. We observe that the optimal caching threshold T is proportional to m, as suggested by Eq. (2.12). Finally, in Fig. 2.9d, we plot the achieved rate (at the optimal thresholdT ) for the entire range of s and m. By comparing the baseline macro-BS only black curve in Fig. 2.8a to the rates in Fig. 2.9, it is evident that the rate gain from using both femto-BSs and ZFBF over using the macro-BS only can be very substantial, reaching 10-20x in the case of very skewed distributions. To see the performance loss due to introducing the threshold structure, in Fig. 2.10, we compare the achieved rates under the \unstructured" formulation 55 0 0.5 1 1.5 2 0 5 10 15 20 25 30 Zipf distribution parameter: s Rate (Mbps) R MRT−ZFBF R ZFBF MaxHitRatio (a) K = 3 0 0.5 1 1.5 2 0 5 10 15 20 25 30 Zipf distribution parameter: s Rate (Mbps) R MRT−ZFBF R ZFBF MaxHitRatio (b) K 5 0 0.5 1 1.5 2 0 5 10 15 20 25 30 Zipf distribution parameter: s Rate (Mbps) R MRT−ZFBF R ZFBF MaxHitRatio (c) K = 3, low outage probability of 0.1 Figure 2.14: Performance of the system under MaxHitRatio, ZFBF, and MRT{ ZFBF, N = 5, m = 100. 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Zipf distribution parameter: s System delay (s) W MRT MaxDiversity MaxHitRatio Figure 2.15: The system delay un- der MRT,N = 5, m = 100, = 0:5. 5 10 15 20 25 30 0 5 10 15 20 25 30 File request arrival rate: λ System delay (s) MaxHitRatio, s = 0 W ZFBF , s = 0 MaxHitRatio, s = 1 W ZFBF , s = 1 MaxHitRatio, s = 2 W ZFBF , s = 2 Figure 2.16: The system delay un- der ZFBF, N = 5, m = 100. 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 File request arrival rate: λ System delay (s) MaxHitRatio, m = 50 W ZFBF , m = 50 MaxHitRatio, m = 100 W ZFBF , m = 100 MaxHitRatio, m = 150 W ZFBF , m = 150 Figure 2.17: The system delay un- der ZFBF, N = 5, s = 1. 1 5 10 15 20 25 30 0 0.5 1 1.5 2 2.5 3 3.5 4 File request arrival rate: λ System delay (s) W ZFBF , s = 0 Proposed policy, s = 0 W ZFBF , s = 1 Proposed policy, s = 1 W ZFBF , s = 2 Proposed policy, s = 2 Figure 2.18: The proposed policy concerning the asynchronous arrival of le requests, N = 5, m = 100. 56 (Eq. (2.6)), the structured formulation with a single threshold (Eq. (2.8)) and its limiting behavior (Eq. (2.11)) for small scale scenarios for which it is possible to enumerate all cases and compute the optimal rates for the unstructured case (M = 100, N = 3, m = 10, and K = 15). It is observed that the performance gap between the rst two formulations is within 2% and the performance gap between the rst and the thrid is within 6% over the whole range of le popu- larity distributions s. Thus, the use of thresholding appears to induce minimal performance loss. In Fig. 2.11, we compare our single-threshold-based caching scheme under ZFBF with the following two caching schemes: \MaxMultiplexing" and \Max- HitRatio", to investigate how well our system adapts to changing levels of pop- ularity. In the MaxMultiplexing scheme, we cache the most popular les 1 tom in every femto-BS so that we have N copies for each of these les, allowing us to jointly service users requesting these les by using ZFBF with the maximum multiplexing gain N. The MaxMultiplexing scheme performs well for a skewed distribution with a larges. The reason is that we benet from the multiplexing gain only when the most popularm les are requested, while the rest of the les can only be served by the low-rate macro-BS (cache misses). In the MaxHitRa- tio scheme, like before, we cache in the femto-BSs the most popular les 1 to mN, each with a single copy, and there is no cooperative transmission among the femto-BSs. By maximizing the cache hit ratio, the scheme performs well for a near- at popularity distribution with a smalls. Our proposed threshold-based caching scheme attends a good performance in the whole range ofs by choosing the optimal caching thresholdT to balance between the multiplexing gain and the cache hit ratio. Specically, the performance gain against MaxMultiplexing is about 1.5x (whens is small), and the gain against MaxHitRatio is about 2.5x (when s is large). Fig. 2.12 plots the rate when two thresholds are used (R multi ZFBF (T 0 ;T 1 )), over the rate when one threshold is used (R ZFBF (T )) in a setup with 4 femto-BSs (N = 2 n = 4). We can see that R multi ZFBF is at most 2.5% larger than R ZFBF (T ) over the whole range of le popularity distributions (s) and cache size (m). Thus, the use of a single threshold appears to be enough to get most of the rate gains. Last, Fig. 2.13 plots the rate R MIMO ZFBF as a function of the Zipf distribution parameter s under dierent combinations of the number of antennas at femto- BSs and the macro-BS. We consider a cluster of N = 5 femto-BSs with L F = 1 or 2 antennas and a macro-BS with L M = 1 or 2 antennas. We can see that the rate is signicantly increased by having an additional antenna at femto-BSs to provide spatial multiplexing, especially when the le popularity is skewed (s> 1). On the other hand, when s is small (near at popularity), it is slightly better to have an additional antenna at the macro-BS (single-cell multi-user MIMO) than at the femto-BSs. 57 2.6.3 Data rates under joint MRT{ZFBF Fig. 2.14 compares the rates for MaxHitRatio, ZFBF (R ZFBF ), and MRT{ZFBF (R MRT{ZFBF ), where in the latter two cases the rates are evaluated at their optimal caching thresholds, respectively. As shown in Fig. 2.14a, the MRT{ZFBF scheme outperforms the ZFBF scheme when the number of users K is less than the number of femto-BSs N. The reason is that if K < N, MRT{ZFBF achieves a diversity of order NK + 1 2 for all the K users, increasing the transmission success (non- outage) probability from 0.6 to more than 0.9 (equality for a diversity of 2), which, in turn, increases the eective data rate. Of course, whenKN, ZFBF performs the same as MRT{ZFBF, as shown in Fig. 2.14b. The results above are generated when the success probability with a unit diversity equals 0:6. For a larger success probability, e.g. 0:9, the ZFBF scheme is expected to be very close to MRT{ZFBF. Fig. 2.14c conrms the above. Last, these plots oer additional data points about the performance benets of our proposed caching strategies over MaxHitRatio, especially for a skewed le popularity. 2.6.4 Delay analysis In Fig. 2.15, we compare the performance of the optimal caching under MRT scheme (W MRT ) with the MaxDiversity scheme and the MaxHitRatio scheme in terms of the average system delay (including service time and queueing delay) under a Poisson arrival of le requests with rate = 0:5 requests per second. The le size is set asL = 1Mb. Similar to the results for data rates, we can see that our proposed scheme outperforms the others in the whole range of s. In Fig. 2.16, we compare the performance of the threshold-based caching under ZFBF scheme (W ZFBF ) with the MaxHitRatio scheme in terms of the average system delay under a Poisson arrival of le requests with rate = 0:1 to 30 requests per second. As the arrival rate increases, the system delay diverges, starting from smaller values ofs to larger values ofs. Also, the system can be stabilized for a larger range of arrival rates with the ZFBF scheme when compared to the stability region for the MaxHitRatio scheme (especially whens is large). The reason is that for a skewed le popularity, in the ZFBF scheme we cache multiple copies of the most popular les in the caches to achieve a larger multiplexing gain, and, in turn, we achieve a larger service rate. In Fig. 2.17, similar results can be observed as we x the value of s = 1 and vary the cache size m = 50, 100, and 150. Last, Fig. 2.18 shows the performance of the proposed policy in Section 2.5.2 which attempts to maximize the multiplexing gains by deferring serving type 1 requests up until there are many such requests in the queue that can be served concurrently. We can see that the proposed policy results in a better performance than the original weighted round robin policy for a moderately skewed popularity (s = 1). The dierence is barely visible when popularity is very skewed (s = 2) because almost all requests are for type 1 les and queue 58 1 is serviced almost exclusively, and when popularity is at (s = 0) because the optimal is to cache les only once and thus there is no multiplexing gain anyway. Last, the dierence is quite small also when saturation occurs since the queues are quite long and the weighted round robin policy also achieves high multiplexing gains. 2.7 Practical considerations Updating the parameters of the caching strategy: The parameters of the caching strategy depend on the le popularity distribution p i , e.g. x i;j and q i for MRT and the threshold T for ZFBF. The le popularity distribution can be estimated by le requests, see, for example, [83]. For the optimal caching under MRT, we need to solve the multiple-choice knapsack problem to obtain the caching decision variablesx i;j . For the randomized caching under MRT, the complexity of computing the caching strategy parameterq i is low sinceq i can be obtained eciently by solving a convex optimization problem, see Theorem 2.1. The complexity of computing T in threshold-based caching is also tractable since the size of the solution space is linear in the cache size for the single threshold case. In addition, when the discrete le popularity distribution can be approximated by a continuous one, the optimal threshold can be obtained analytically. For the multi-threshold case the solution space is exponential in the number of thresholds, but (i) even for large cluster sizes, e.g. 8 femto-BSs, the number of thresholds is as small as 3, and (ii) as we have shown in Fig. 2.12 the use of a single threshold is enough to get most of the rate gains thus we do not anticipate the use of multiple thresholds in the majority of cases. The right place to perform the above computations is the macro-BS as it can keep track of all le requests and thus of the le popularity distribution, and it has more than enough computation power to compute the caching strategy parameters. Upon computing the parameters, the macro-BS will send their values to the femto- BSs. Last, since the time scale of signicant changes in the le popularity distribution is in the order of a day or longer, updating the caching strategy parameters will happen infrequently. Cache content update: Macro-BS and femto-BSs coordinate the cache con- tent update (downloading popular video les via backhaul into caches in femto- BSs) according to updates in the caching strategy parameters. This can be done at o-peak hours because the time scale of signicant changes in the le pop- ularity distribution (e.g. days) is much larger than the time scale of receiving users' requests (e.g. seconds) [57]. Note also that only signicant changes in the le popularity will result in a cache content update, while small changes in the le popularity will only result in reordering (relabeling) les in the caches. 59 2.8 Conclusion In this work we proposed a new system architecture that jointly uses and opti- mizes distributed caching in femto-BSs and femto-BS cooperative transmissions. Our analytical and simulation results show that our system achieves an order of magnitude faster content delivery than legacy systems. The gains are partic- ularly pronounced for skewed popularity distributions where caching multiple copies of popular les across multiple femto-BSs yields particularly large diver- sity and multiplexing gains without sizeably increasing cache misses. Potential future work includes extending our analysis to a two-tier system where the higher tier consists of caches at the cloud and the lower tier of caches at the edge (femto caches). We also plan to extend our work when no prior knowl- edge of the content popularity is available, or the popularity changes over time, leading to caching policies which dynamically adapt as they learn the current content popularity statistics. 2.9 Appendix 2.9.1 Co-channel interference We approximate the spatial distribution of the interfering macro-BSs (outside the circle centered at the typical user with radiusd M , denoted asB(0;d M )) as a Poisson point process with some density M [70]. The co-channel interference from the interfering macro-BSs can be written asI M ,I M , P x2nB(0;d M ) P M S x;(1) D x , whereS x;(1) denotes the channel gain for the small-scale fading of the interfer- ence from the xth interfering macro-BS to the typical user, which is exponen- tially distributed with unit mean (Rayleigh fading). D x is the distance between the xth interfering macro-BS and the typical user. Similarly, we approximate the spatial distribution of the interfering femto- BSs (outside the circle centered at the typical user with radius d F , denoted as B(0;d F )) as a Poisson point process with some density F . The co- channel interference from the interfering femto-BSs can be written as I F , I F , P y2 nB(0;d F ) P F S y;(1) D y , where S y;(1) and D y are similarly dened. Following the derivations in [84], the success probability of the transmission between the typical user and its serving macro-BS can be computed as Pr W log 1 + P M S (1) d M I M +I F +N 0 W ! >R M ! = exp M N 0 Wd M P 1 M exp M d 2 M M E S [S (1; M S)]E S [1 exp( M S)] exp F d 2 F M E S [S (1; M S)]E S [1 exp( M S)] g; (2.28) where M , 2 R M =W 1, , 2=, , ( d M d F ) P F P M , S is an exponential random variable with unit mean, and (a;z), R z 0 exp(t)t a1 dt is the lower incomplete gamma function. 60 Similarly, the success probability of the transmission between the typical user and a cluster of j femto-BSs can be computed as Pr W log 1 + P F S (j) d F I M +I F +N 0 W ! >R F ! = j1 X k=0 1 k! (1) k d k dt k V (t) t=1 ; (2.29) where V (t) is dened as V (t), exp F N 0 Wd F tP 1 F exp F d 2 F F t E S [S (1; F tS)]E S [1 exp( F tS)] exp M d 2 M F t E S [S (1; F t 1 S)]E S [1 exp( F t 1 S)] (2.30) and F , 2 R F =W 1. With these expressions, we can obtain the eective data rates in the presence of co-channel interference. 2.9.2 Proof of Proposition 2.1 First, let us dene the function G(q), N X j=0 N j q j (1q) Nj w j ; 0q 1; (2.31) where w j 0;j = 0;:::;N. We have the following lemma. Lemma 2.1. If w j w j1 ; j = 1;:::;N, then G(q) is a non-decreasing func- tion. Furthermore, if w j+1 w j w j w j1 ; j = 1;:::;N 1, then G(q) is concave. Proof. Using basic analysis, the rst derivative of G(q) can be computed as G 0 (q) = N X j=1 N j jq j1 (1q) Nj w j N1 X j=0 N j (Nj)q j (1q) Nj1 w j = N X j=1 N j jq j1 (1q) Nj w j N X j=1 N j 1 (Nj + 1)q j1 (1q) Nj w j1 = N X j=1 N! (j 1)!(Nj)! q j1 (1q) Nj (w j w j1 ): (2.32) Whenw j w j1 ,j = 1;:::;N, we haveG 0 (q) 0. So,G(q) is non-decreasing. 61 Similarly, the second derivative of G(q) is computed as G 00 (q) = N X j=2 N j j(j 1)q j2 (1q) Nj w j N1 X j=1 N j j(Nj)q j1 (1q) Nj1 w j N1 X j=1 N j j(Nj)q j1 (1q) Nj1 w j + N2 X j=0 N j (Nj)(Nj 1)q j (1q) Nj2 w j = N1 X j=1 N j + 1 (j + 1)jq j1 (1q) Nj1 w j+1 N1 X j=1 N j j(Nj)q j1 (1q) Nj1 w j N1 X j=1 N j j(Nj)q j1 (1q) Nj1 w j + N1 X j=1 N j 1 (Nj + 1)(Nj)q j1 (1q) Nj1 w j1 = N1 X j=1 N! (j 1)!(Nj 1)! q j1 (1q) Nj1 [(w j+1 w j ) (w j w j1 )]: (2.33) When w j+1 w j w j w j1 ; j = 1;:::;N 1, we have G 00 (q) 0. So, G(q) is concave. Let w 0 = ~ R M and w j = ~ R (j) F , j = 1;:::;N. From Eq. (2.4) and Eq. (2.31), we can see that U(q i ) = P N j=0 N j q j i (1q i ) Nj w j =G(q i ). In addition, since the ccdf of S (j) is F S (j) (z) = P j1 n=0 1 n! e z z n , we have w j+1 w j =R F j X n=0 1 n! e F n F R F j1 X n=0 1 n! e F n F =R F 1 j! e F j F ; j = 1; ;N 1; (2.34) w 1 w 0 =R F e F R M e M : (2.35) We note that the conditions w j+1 w j ; j 0 are equivalent to R F e F R M e M . Also, the conditions w j+1 w j w j w j1 ; j 2 (the marginal rate gain of including one more femto-BS into the cluster to perform cooperative transmission is decreasing) are equivalent to F 2. Moreover, the condition w 2 w 1 w 1 w 0 (the aforementioned marginal gain is smaller than the dierence between the rates of a femto-BS and a macro-BS) is equivalent to R M e M R F e F (1 F ). As a result, by Lemma 2.1 and combining the above conditions, we conclude that if R M e M R F e F (1 F ) holds, U(q i ) is concave and thus P M i=1 p i U(q i ) is concave. 62 2.9.3 Proof of Theorem 2.2 Under the Zipf distribution with parameter s, i.e., p i =c M;s =i s ; i = 1;:::;M. v T can be approximated as v T = T X i=1 p i =c M;s T X i=1 1 i s c M;s Z T 1 1 x s dx: (2.36) Similarly, v T+N(mT) can be approximated as v T+N(mT) c M;s Z T+N(mT) 1 1 x s dx: (2.37) By substituting Eq. (2.36) and Eq. (2.37) into Eq. (2.10), dierentiatingR ZFBF (T ) with respect to T , and setting the result to zero, we have 1 T s N ~ R F + 1N (T+NmNT) s 1 T s ~ R F + N1 (T+NmNT) s ~ R M = 0: (2.38) After rearranging the terms, we have T +NmNT T = (N 1)( 1 ~ R F 1 ~ R M ) 1 N ~ R F 1 ~ R F ! 1 s : (2.39) Therefore, the optimal threshold can be obtained as T Nm +N 1 m 0 ; (2.40) where , (N 1)( 1 ~ R F 1 ~ R M ) 1 N ~ R F 1 ~ R F ! 1 s = N( ~ R F ~ R M ) ~ R M ! 1 s (2.41) and the notation [x] m 0 , min(max(0;x);m). 2.9.4 Proof of Theorem 2.3 Similar to the proof of Theorem 2.2, by substituting Eq. (2.36) and Eq. (2.37) into Eq. (2.13), dierentiating R ZFBF (T ) with respect to T , and setting the result to zero, we have 1 T s minfK;Ng ~ R F + 1N (T+NmNT) s 1 T s ~ R F + N1 (T+NmNT) s ~ R M = 0: (2.42) After rearranging the terms, we obtain the optimal threshold. 63 2.9.5 Proof of Theorem 2.4 Similar to the proof of Theorem 2.2, by substituting Eq. (2.36) and Eq. (2.37) into Eq. (2.18), dierentiating R MIMO ZFBF (T ) with respect to T , and setting the result to zero, we have 1 T s minfK;NL F g ~ R F + 1N (T+NmNT) s 1 T s minfK;L F g ~ R F + N1 (T+NmNT) s minfK;L M g ~ R M = 0: (2.43) After rearranging the terms, we obtain the optimal threshold. 2.9.6 Proof of Theorem 2.5 Similar to the proof of Theorem 2.2, by substituting Eq. (2.36) and Eq. (2.37) into Eq. (2.20), dierentiating R MRT{ZFBF (T ) with respect to T , and setting the result to zero, we have 1 T s minfK;Ng ~ R (NK+1) F + 1N (T+NmNT) s 1 T s ~ R F + N1 (T+NmNT) s ~ R M = 0: (2.44) After rearranging the terms, we obtain the optimal threshold. 64 Chapter 3 Data-locality-aware User Grouping in Cloud Radio Access Networks Cellular base band units of the future are expected to reside in a cloud data center which provides computation resources, content storage and caching, and a natural place to perform multi-user precoding, thus addressing both cost and performance concerns of cellular systems. Multi-user precoding relies on ecient user grouping schemes to maximize multiplexing gains. However, traditional user grouping schemes are unaware of data center constraints, and may induce a large number of data transfers across racks when fetching requested data to a certain rack for precoding. When congestion occurs in the data center network, the delay of data transfers across racks may exceed the channel coherence time. This would kill multi-user MIMO transmissions as channel state information becomes outdated. In this work we design novel data-locality-aware user grouping schemes which preferentially group users whose requested data are located under the same rack. We also design user grouping algorithms which adapt to the congestion level in the cloud data center. Specically, a regularized spectral eciency maximization problem is proposed where the number of data transfers across racks is introduced as a regularization term. By adjusting the weight of the regularization term according to the congestion level, we gradually suppress data transfers across racks in forming user groups when congestion occurs. We reduce the above problem to a soft-capacitated facility location problem and we devise a 2-approximation user grouping algorithm. Last, we conduct simulations which show that the 2-approximation algorithm performs close to the optimal in practical scenarios, and study the tradeo between higher spectral eciency and lower data transfer cost. 65 Figure 3.1: (a) C-RAN architecture. (b) BBU pool: a cloud data center for ecient computation and content caching. 3.1 Introduction In the cloud radio access network (C-RAN) architecture, base band units (BBUs) are centralized as a cloud data center (called BBU pool) separated from the re- mote radio heads (RRHs) deployed at base stations (BSs) to reduce energy consumption and better utilize computation resources [85, 86], see Fig. 3.1a. The centralized BBUs can not only provide computation resources to serve a large number of BSs but also enable coordination among BSs to increase spec- tral eciency by using techniques like Coordinated MultiPoint (CoMP) [87]. In addition to providing computation resources, content like video les can be stored or cached at the BBU pool [88,89]. A cloud data center consists of a large number of racks. See, for example, the Telco cloud proposed by Nokia [90,91]. In each rack, there is a top-of-rack switch that connects to a large number of servers (for both computing and data storage). Moreover, a core switch connects to the top-of-rack switches to enable data transfers across dierent racks, see Fig. 3.1b. To perform computation over data such as precoding, the data needs to be rst fetched to a local drive. It is typically more expensive to fetch data across dierent racks than under the same rack because data transfers across racks are more susceptible to congestion and more prone to delay [92,93]. In the BBU pool, a major computation task is to perform precoding over dierent data streams requested by dierent users to enable spatial multiplexing via multi-user MIMO [94] or massive MIMO [9,10,95]. Since most of today's BSs are equipped with at most 16 antennas, e.g., in the current LTE standard [51] codebooks for precoding are dened for up to 16 antenna ports, we focus the discussion in the multi-user MIMO case, enabled via Zero-Forcing BeamForming (ZFBF) precoding, though our analysis can be readily extended to the massive MIMO setting as well where other precoders are more popular, e.g. conjugate beamforming. ZFBF requires to select a set of users to be served concurrently, an opera- tion commonly referred to as user grouping, and it also requires instantaneous channel state information (CSI) to be collected to perform the precoding. Tra- 66 ditional user grouping schemes are not aware of the data locality in data centers, that is, of where the data reside. Instead, their task is to minimize the overhead from collecting CSI and/or to maximize the multiplexing gain. For example the so-called randomized and round-robin schemes select users randomly or in a round-robin fashion, aiming to minimize overhead since CSI is collected only for the selected users once the group is formed. In contrast, in CSI-based user grouping [96{98] it is required to collect CSI from a large number of users and then select a subset of users with semi-orthogonal channel vectors to form a group, aiming to maximize the multiplexing gain. Multi-user ZFBF precoding with CSI-based user grouping has higher chances to achieve the maximum spectral eciency and is the precoder of choice in the today's cellular standards. That said, it yields a higher data rate only if the turn- around time between CSI feedback and the actual data transmission is smaller than the channel coherence time. 1 Otherwise, the CSI becomes outdated, caus- ing signicant performance degradation. And, the turn-around time might be large because after collecting CSI and selecting the users, we then have to fetch the requested data of the users to a local disk in a certain rack to perform pre- coding, as these data may be located across dierent racks in the data center. Depending on the congestion level in the data center, the data fetching time ranges from hundreds of microseconds to tens of milliseconds [92,100]. This can be larger than the channel coherence time, since, depending on the user speed and the carrier frequency, the channel coherence time ranges from milliseconds (high mobility users) to tens or hundreds of milliseconds (low mobility or static users) [51]. Therefore, when congestion occurs in the cloud data center network, a re- duction in the number of data transfers across racks is required to guarantee that the overall turn-around time is smaller than the channel coherence time and the wireless transmissions are successful. However, all the above traditional user grouping schemes are independent of where the requested data is located, potentially inducing a large number of data transfers across racks. With this in mind, we propose user grouping schemes under the C-RAN architecture which take into consideration where the data is located in the data center. Thus, we preferentially group together users whose requested data are located under the same rack, and only transfer data across racks if the ensuing rate increase is siz- able. This signicantly reduces the number of data transfers across racks, while still providing high wireless spectral eciency through spatial multiplexing. The wireless spectral eciency becomes higher as we allow more data trans- fers across racks to form user groups with better-conditioned channel matrix as long as the CSI does not become outdated. On the other hand, when conges- tion occurs in the cloud data center, the inter-rack data fetching time increases such that the overall turn-around time may become a large fraction of (or even 1 Note that the fronthaul delay is also a part of the turn-around time. However, we focus on scenarios where the fronthaul connections between the BBU pool and the RRHs are fast enough [90, 91] for the fronthaul to not be a bottleneck. For scenarios where the fronthaul is part of the system optimization, the interested reader is referred to [94, 99] and references therein. 67 larger than) the channel coherence time and the CSI may become outdated, re- ducing the spectral eciency. To devise a user grouping algorithm that is able to adapt to the congestion level in the cloud data center network, we propose a regularized spectral eciency maximization problem where the number of data transfers across racks is introduced as a regularization term. When there is no congestion in the cloud data center, one can simply set the weight of the regu- larization term to zero, allowing arbitrary number of data transfers across racks to maximize the wireless spectral eciency (in this case our problem reduces to the traditional CSI-based user grouping). On the other hand, when congestion is high, one may impose a large weight to regulate the inter-rack data transfers and settle for \local" user groups. By adjusting the weight of the regulariza- tion term according to the congestion level, data transfers across racks can be controlled to achieve the desired system operating point. The regularized spectral eciency maximization problem is NP-hard. Mo- tivated by this, we rst introduce a simplied form of the problem (referred to as the regularized resource block minimization problem), where we use the number of resource blocks needed to serve a xed number of users as a proxy for the spectral eciency. Then, we reduce this problem to a soft-capacitated facility location problem [101, 102], and devise a 2-approximation algorithm to eciently solve it. The remainder of this work is organized as follows. We present related work in Section 3.2. Section 3.3 describes the system model and presents motivating examples. The problem formulation and performance analysis are given in Sec- tion 3.4. We extend our formulation to various practical scenarios of interest in Section 3.5. Section 3.6 presents numerical and simulation results. Last, Section 3.7 concludes the work. 3.2 Prior Work An introduction of the C-RAN architecture and the advantages of BBU cen- tralization can be found in [85, 86] and in references therein. Such BBU cen- tralization facilitates base station coordination like CoMP [87]. The concept of the BBU pool acting as a cloud data center for both computation and con- tent caching or storage is introduced in [88, 103, 104]. Multiple recent works have evaluated the performance of the C-RAN architecture using simulations, real-world data, software-dened radios, and even prototypes, see, for exam- ple, [90,91,105{108]. A major task of the BBU pool is to perform multi-user ZFBF precoding over users' requested data streams to enable multi-user MIMO [94]. A central piece of multi-user ZFBF is user grouping. The analysis of the performance of randomized user grouping for user scheduling in multi-user MIMO and massive MIMO wireless networks can be found in [10, 95]. CSI-based user grouping for multi-user beamforming was investigated in [96{98], where the users are grouped according to the instantaneous CSI to exploit the multi-user diversity. User grouping can also be based on the channel distribution information, the 68 second-order channel statistics or some combination of the above approaches. For example, in [109], a two stage precoding scheme has been presented, where the users are rst grouped according to the second-order channel statistics, and then, for the users that have the same such statistics, randomized user grouping is used to schedule a subset of users for multi-user ZFBF precoding. CSI-based beamforming has been considered in a C-RAN with a focus on fronthaul optimization [94, 99]. Specically, in [94], the authors designed a sparse beamformer to minimize the fronthaul power consumption. In [99], the authors considered enhanced RRHs that are also embedded with caches and baseband processing units. By using superposition coding, the precoding can be decomposed, performed partly at the BBU pool and partly at the enhanced RRH subject to the fronthaul capacity constraint. As mentioned earlier, in this work we operate under scenarios where the fronthaul connections are fast enough [90, 91] and the fronthaul delay does not become a bottleneck in the turn-around time. Instead, we focus on the case where the data fetching time for precoding in the cloud data center becomes a bottleneck when congestion occurs. To the best of our knowledge, this is the rst work that takes into con- sideration the data locality information and the congestion level in the cloud data center in forming user groups for multi-user MIMO ZFBF precoding in a C-RAN. 3.3 System model Consider a BS in the context of the C-RAN architecture. (Section 3.5 discusses how to extend the discussion and results for multiple BSs.) Suppose that the BS has antennas and there are M users under the coverage of the BS, each equipped with a single antenna. (See Section 3.5 for the case with multi-antenna users.) Denote the users by the setM =f1; 2;:::;Mg. Suppose that each of the M users makes a single data request. We denote the data requested by user i as d i ; i = 1;:::;M, which is stored in the BBU pool cloud data center. Suppose that there areN racks (denoted by the setN = f1; 2;:::;Ng) in the data center and assume that the requested data of a user is equally likely to be stored in any one of theN racks. The event that the data requested by user i is in rack j is represented as d i 2r j (we use the shorthand notationr j to refer to rackj). We denote the set of users whose requested data are in rack j asM j ,fi : d i 2 r j g. Note thatM j ; j = 1; 2:::;N form a partition of the user set, whereM i \M j =;; i6=j and S N j=1 M j =M. Our formulation can be extended to the case with data replication across dierent racks (M i \M j 6=;; i6=j), see Section 3.5 for a detailed discussion. 3.3.1 User grouping Since the base station has antennas, it can provide degrees of freedom for spatial multiplexing ( is also called the spatial multiplexing gain) and can 69 serve at most users in a resource block (time-frequency slot). For simplicity, we assume that the unit size of a user's requested data is one resource block. User grouping refers to partitioning the set of users M into T groups, i.e., S 1 ;S 2 ;:::;S T , where they are disjointS i \S j =;;i6= j, S T i=1 S i =M, and the cardinality of each group is less than or equal to the degrees of freedom jS i j . By using multi-user ZFBF precoding, the users in the same group will be serviced simultaneously in one resource block, so a total number of T resource blocks will be used. Assuming dividesM, we partition the set of usersM into M groups each of size, to fully utilize the degrees of freedom for data transmission and minimize the number of resource blocks needed to serve all users. The partition criteria (i.e, how to partition theM users into M groups) for the randomized/CSI-based user grouping scheme depends on the randomization strategy or the instanta- neous CSI of the users and will be discussed in Section 3.4.1. The partition criteria for the data-locality-aware user grouping scheme depends on the data locality (encoded by the setsM j ; j = 1; 2:::;N): given a xed number of M resource blocks, we form M user groups while minimizing the number of induced data transfers across racks, see Sections 3.4.2. Since the resulting wireless rates depend not only on utilizing the maximum multiplexing gain (and thus the min- imum number of resource blocks) but also on the degree of orthogonality among the users' channel vectors and thus on the CSI [96,97], in Section 3.4.3 we extend the above scheme to also take into consideration the CSI when forming those M user groups, and term the scheme a joint CSI- and data-locality-aware user grouping scheme. In Sections 3.4.4 and 3.4.5, to further reduce the number of induced data transfers, we relax the constraint of using only M resource blocks (i.e., we allow to formT > M user groups). Since transferring data across racks takes time, if the associated delay is larger than the channel coherence time due to high congestion, the CSI will be invalidated and the resulting spectral eciency will be signicantly reduced. Motivated by this, we devise two user grouping algorithms that can adapt to the congestion level in the cloud data center. First, we propose a regularized resource block minimization problem where we minimize a weighted sum of the number of resource blocks used and the number of induced data transfers across racks (see Section 3.4.4). Second, we propose a regularized spectral eciency maximization problem where we consider the data rate directly (see Section 3.4.5). By adjusting the weight of the regularization term according to the congestion level in the cloud data cen- ter, we can suppress data transfers across racks in forming user groups when congestion occurs. To simplify the analysis we assume divides M in all an- alytical derivations and relax this in Section 3.6 where we present simulation results. 70 3.3.2 Multi-user ZFBF precoding and the need to transfer data Let h i be the 1 channel coecient vector between the BS and user i and let H, [h 1 h M ] T be the M channel matrix. Fix a typical user group S. Let H(S) be thejSj sub-matrix of H where the rows correspond to the channel vectors of the users inS. The ZFBF precoding matrix is dened as V , H(S) y , where H(S) y is the pseudoinverse of H(S). Let d(t) denote the jSj 1 data vector corresponding to the data streams requested by the users in S and x(t) denote the1 symbol vector to be transmitted by the BS. We have x(t) = Vd(t) = H(S) y d(t). Under the ZFBF precoding, thejSj 1 received signal y(t) at the users is y(t) = H(S)x(t) = H(S)H(S) y d(t); (3.1) where we ignore the background noise. Note that whenjSj and H(S) has full row rank, we have H(S)H(S) y = I and thus y(t) = d(t). In this case, we can see that by using ZFBF each user gets its own requested data stream without interference from other streams. Unless otherwise stated, we will assume that user groups of cardinality up to yield a well-conditioned channel matrix and ZFBF can be used to serve all the users in the same group concurrently. In a C-RAN, the ZFBF precoding is performed at the BBU pool, where we jointly precode all data streams requested by the users in the same groupS to get a spatial multiplexing gain of orderjSj. Before the actual computation for ZFBF precoding happens, the requested data streams for users inS need to be fetched to the same rack, which may induce a large number of data transfers across racks depending on whether data locality is taken into consideration upon the formation of the user groups or not. 3.3.3 Simple motivating examples Suppose that the base station has = 2 antennas that can provide two degrees of freedom for spatial multiplexing, sending two data streams in a single resource block. Suppose that there are M = 4 single antenna users. Last, suppose that the data requested by user 1 and user 3 are in the rst rack and the data requested by user 2 and user 4 are in the second rack, i.e.,M 1 =f1; 3g and M 2 =f2; 4g, see Fig. 3.2. If we group user 1 and 2 together and user 3 and 4 together for ZFBF precoding, i.e.,S 1 =f1; 2g andS 2 =f3; 4g, then we need to rst transfer data d 2 from r 2 to r 1 and data d 3 from r 1 to r 2 , resulting in a total of two data transfers across racks. This is because for ZFBF precoding, the coded symbols to be transmitted by the BS antenna array are weighted linear combinations of all streams. On the other hand, if we group user 1 and 3 together and user 2 and 4 together, i.e.,S 1 =f1; 3g andS 2 =f2; 4g, then obviously there is no need to transfer any data across racks for precoding since, for example, for users 1 and 3 d 1 2r 1 and d 3 2r 1 . 71 Figure 3.2:M 1 =f1; 3g andM 2 =f2; 4g. Data-locality-aware user grouping: S 1 =f1; 3g andS 2 =f2; 4g. Figure 3.3: Regularized spectral eciency maximization: When the weight for the regularization term is small, we haveS 1 =f1; 3g andS 2 =f2; 4g, resulting in usage of two resource blocks and one data transfer; when the weight is large (we aim to suppress data transfers), we haveS 1 =f1; 3g,S 2 =f2g, andS 3 =f4g, resulting in usage of three resource blocks and no data transfer. Suppose now thatd 1 2r 1 ;d 3 2r 1 ;d 2 2r 2 , andd 4 2r N as shown in Fig. 3.3. Consider the following two user grouping schemes. The rst one is to have two groups withS 1 =f1; 3g andS 2 =f2; 4g, in which case two resource blocks and one data transfer (say, moving d 4 from r N to r 2 ) are required. The second one is to have three groups withS 1 =f1; 3g,S 2 =f2g, andS 3 =f4g, in which case three resource blocks are required (the degrees of freedom are not fully utilized) and no data transfer is needed. It is easy to see that to fully utilize the available degrees of freedom provided by the system more data transfers across racks may be required. We shall see in later sections that the number of data transfers across racks in forming user groups can be gradually suppressed by adjusting the weight of the regularization term according to the congestion level in the cloud data center. 3.4 Problem formulation and analysis We start by deriving expressions for the number of data transfers induced by traditional user grouping schemes (Section 3.4.1) and the proposed data-locality- 72 aware user grouping scheme (Section 3.4.2) under the assumption that any user group of size yields a well-conditioned channel matrix and thus M resource blocks suce to serve all users. We then present the details of how to jointly use the CSI and data locality information for user grouping (Section 3.4.3). To design a user grouping algorithm that can adapt to the congestion level in the cloud data center, we also consider the use of more than M resource blocks. Thus, we formulate a regularized resource block minimization problem and introduce an ecient approximation algorithm (Section 3.4.4). Finally, we discuss how to exploit CSI to further increase the spectral eciency and formulate a regularized spectral eciency maximization problem (Section 3.4.5). 3.4.1 Randomized/CSI-based user grouping In this subsection, we compute for randomized and CSI-based user grouping the number of induced data transfers, and derive closed-form lower and upper bound expressions under the regime where the number of racks is a lot larger than the degrees of freedom (antennas) of the BS, which is trivially satised in practice. In randomized user grouping, we form the rst group by selecting users uniformly at random from the M users. Then, we form the second group by selecting users uniformly at random from the remaining M users, and we continue this process until we have M groups. In CSI-based user grouping, user groups are formed by lumping together users with semi-orthogonal channel vectors [96]. Since the performance metric under study is the number of induced data transfers, the analysis is the same for both randomized and CSI-based user grouping schemes. Under such traditional user grouping schemes, the setsS i ; i = 1;:::; M are formed independently of the setsM j ; j = 1;:::;N since users are grouped independently of the location of their requested data. As a result, the requested data by the users in a group will be randomly and uniformly distributed among the N racks. This is similar to the \balls into bins" problem where we throw balls into N bins. To perform ZFBF precoding over the data streams requested by the users in a group we need to rst select a rack as the designated rack to perform the precoding. Intuitively, we select the rack that stores the largest number of re- quested data as the designated rack (ties are randomly resolved). Then, we transfer all remaining requested data (located outside the designated rack) to the designated rack for precoding. Thus, the number of induced data transfers equals minus the number of requested data in the designated rack. For ex- ample, if = 6, rack 1 stores two, rack 2 stores one, and rack 3 stores three requested data les, then rack 3 will be selected as the designated rack and a total of 3 data transfers will take place towards rack 3 (two from rack 1 and one from rack 2). As a result, the average number of induced data transfers under the randomized/CSI-based user grouping scheme (denoted asD rand/csi ) is equal 73 to D rand/csi = M E [ the number of requested data les in the designated rack]: (3.2) Since the number of requested data les in the designated rack is the same as the load of the maximum loaded bin of the \ balls into N bins" problem, we proceed the analysis by deningX j to be the number of requested data les (balls) in rack (bin)j. By using the concentration results for the balls into bins problem [110], when N polylog(N) N logN, we have Pr 8 < : max j=1;:::;N X j > logN log N logN 0 @ 1 + log log N logN log N logN 1 A 9 = ; =o(1); (3.3) and Pr ( max j=1;:::;N X j > logN log N logN ) = 1o(1): (3.4) As a result, the asymptotic upper and lower bound on the average number of induced data transfers are D rand/csi = M X k=1 (k) Pr max j=1;:::;N X j =k . M logN log N logN ! ; (3.5) and D rand/csi & M 2 4 logN log N logN 0 @ 1 + log log N logN log N logN 1 A 3 5 : (3.6) We compare both bounds with simulation results in Section 3.6. From Fig. 3.7, we can see that both bounds are quite tight. 3.4.2 Data-locality-aware user grouping Consider a data-locality-aware user grouping scheme whose goal is to minimize the number of induced data transfers by properly grouping users into M groups. To nd the user groups that minimize data transfers we introduce the fol- lowing optimization problem. Let x i;j be a binary variable, where x i;j = 1 if the requested data d i is assigned to rack j for processing, that is, it will be jointly precoded with another 1 requested data assigned to rack j. Let c i;j be the associated cost of transferring data le d i to rack j. Since we aim to count the number of induced data transfers across racks, we set c i;j = 0 if d i 2 r j and c i;j = 1 otherwise. (In Section 3.4.4, we generalize to non-binary costs to take into account dierent routing paths.) Let y j be the number of 74 Figure 3.4: The number of induced data transfers in the data-locality-aware user grouping scheme. resource blocks required to service all the requested data assigned to rack j. The data-locality-aware user grouping problem can be formulated as follows: minimize X i2M X j2N c i;j x i;j subject to X j2N x i;j = 1; 8i2M X i2M x i;j y j ; 8j2N X j2N y j = M x i;j 2f0; 1g8i2M;j2N y j 2f0; 1; 2;g; 8j2N: (3.7) The objective function above is equal to the total number of induced data trans- fers across racks. The rst constraint ensures that the requested data d i has to be assigned to some rack. The second constraint ensures that we allocate enough resource blocks to rack j to service all the requested data les assigned to it. Recall that the number of requested data les that can be serviced in one resource block by using ZFBF precoding is less than or equal to the number of BS antennas . The third constraint ensures that the total number of resource blocks that we use is equal to M like in the randomized/CSI-based case. A polynomial time algorithm for Problem (3.7): The above problem can be solved in polynomial time as follows. Recall thatM j denotes the set of users whose requested data are in rack j. First, for each rack j = 1; ;N withjM j j, we group batches of users inM j untiljM j j mod users are left ungrouped. Note that in this step we form a total of P N j=1 j jMjj k groups, inducing no data transfers across racks. The number of ungrouped requested data les in rackj at the end of this step isjM j j mod. Second, we sort the set of numbersfm j ,jM j j mod; j = 1;:::;Ng in descending order (note that 0m j 1). We denote the sorted numbers as m (1) m (2) m (N) . 75 Third, to ensure that we jointly precode and service requested data les at each resource block, we nd the number t such that t X j=1 (m (j) ) = N X j=t+1 m (j) ; and transfer the requested data les stored in the racks corresponding to the set fm (t+1) ; ;m (N) g to the racks corresponding to the setfm (1) ; ;m (t) g, as shown in Fig. 3.4. Intuitively, the equation above nds the right way to partition racks such that the minimum number of data les are transferred from racks with a few ungrouped requests into racks with a few \empty" slots to reach ungrouped requests and then group them together. As a result, the average number of induced data transfers under the data-locality-aware user grouping scheme is equal to D locality =E h P N j=t+1 m (j) i . For M N, that is, when the number of resource blocks required to service one request of each user is a lot larger than the number of racks, we derive an approximation formula for D locality . When M N; > 1, the number of requested data les left ungrouped in rack j, i.e.,m j ,jM j j mod, will be approximately uniformly distributed inf0; 1; 2;:::; 1g. As a result, we have on average N racks left with i requested data les ungrouped, for i = 0; 1;:::;1. Under this deterministic approximation, we may transfer the only ungrouped data le of N racks to the N racks which have 1 ungrouped data les, and so on and so forth till we form groups each of size as usual. Specically, when is odd, the total number of data le transfers equals D locality N 1 + 2 + + 1 2 = N( + 1)( 1) 8 ; (3.8) and, when is even it equals D locality N 1 + 2 + + 2 1 + 1 2 2 = N 8 ; (3.9) where the term 1 2 2 comes from the fact that we have N racks with 2 ungrouped data les, and we transfer from half of these racks these ungrouped data les to the other half. The above results can be obtained formally using order statistics [111] when m j is uniformly distributed in [0; 1] and N ! 1. Specically, it can be shown that when is odd, we have D locality = E h P N j=t+1 m (j) i N!1 ! N(+1)(1) 8 , and, when is even, we have D locality = E h P N j=t+1 m (j) i N!1 ! N 8 , where t = N(1) 2 since t =E h P N j=1 m (j) i =E h P N j=1 m j i =N 1 2 . We compare the above approximations with simulation results in Section 3.6, see Fig. 3.7. 76 3.4.3 Joint CSI- and data-locality-aware user grouping The proposed data-locality-aware user grouping scheme in the previous subsec- tion can be generalized to a joint CSI- and data-locality-aware user grouping scheme that also takes CSI into account when forming user groups. Let us con- sider again the problem of minimizing the number of induced data transfers by properly grouping M users into M groups (Problem (3.7)) and its correspond- ing polynomial time algorithm. We are going to modify the polynomial time algorithm such that not only the optimal in Problem (3.7) is reached (i.e. we have the same minimum number of induced data transfers) but also all user groups of cardinality yield a well-conditioned channel matrix. The polynomial time algorithm consists of three steps. We modify the rst and last step as follows: Recall that in the rst step of the polynomial time algorithm, we x a rackj and group batches of users inM j untiljM j j mod users are left ungrouped. Unlike before where users are selected at random, here we select users based on CSI. Note that it is well known that CSI-based user grouping is NP-hard, and no tight approximation algorithms exist. Thus, we resort to a greedy approach which forms what is commonly referred to as semi-orthogonal user groups [96, 97]. To form the rst group (of size ), we begin with choosing an arbitrary useri 1 1 2M j and insert it into the rst group. Then, we choose another useri 2 1 whose channel vector is the most orthogonal to that of user i 1 1 and insert i 2 1 into the rst group. After that, we choose another useri 3 1 whose channel vector is the most orthogonal to the signal space spanned by the channel vectors of users i 1 1 and i 2 1 , and so on and so forth until we have users in the rst group, namely,fi 1 1 ;:::;i 1 g. We repeat the above procedure to form the second group considering the candidate user setM j nfi 1 1 ;:::;i 1 g. We continue this process until we form the j jMjj k -th user group consisting of users n i 1 bjMjj=c ;:::;i bjMjj=c o . We modify the last (third) step of the polynomial time algorithm as follows. When we transfer requested data from the racks in the tail portion of Fig. 3.4 to the racks in the head portion, we again use CSI and the above greedy algorithm repeatedly to decide the formation of batches of users. Specically, let us x the rack with the most remaining ungrouped usersm (1) . We select a total num- ber ofm (1) users one by one (from the set of ungrouped users whose requested data les are stored in the racks corresponding to the setfm (t+1) ; ;m (N) g) by using the aforementioned orthogonality principle to form a group of size . Then, we x the rack with the second most remaining ungrouped users m (2) , and so on and so forth. We continue this procedure till we x the rack with m (t) ungrouped users and group them together with the last remainingm (t) users, forming the last group. 3.4.4 Regularized resource block minimization In this subsection, we aim to design a user grouping algorithm that can adapt to the congestion level in the cloud data center. The idea is to suppress data trans- 77 Figure 3.5: Data center topologies [112]: (a) FlatNet. (b) DCell. (c) BCube. fers across racks in forming user groups when congestion occurs. We achieve this goal by relaxing the constraint of using exactly M resource blocks to serve M user groups of users each (i.e. allowing forming a larger number of smaller user groups) and by introducing a regularization term. Specically, we propose a regularized resource block minimization problem in which we minimize a weighted sum of the number of resource blocks needed to serve all user requests ( P j2N y j ) and the cost associated with the data transfers across racks ( P i2M P j2N c i;j x i;j ) that acts as a regularization term with weight . The regularized resource block minimization problem can be written as: minimize X j2N y j + X i2M X j2N c i;j x i;j subject to X j2N x i;j = 1; 8i2M X i2M x i;j y j ; 8j2N x i;j 2f0; 1g8i2M;j2N y j 2f0; 1; 2;g; 8j2N (3.10) Note that here we allow a general cost c i;j for transferring data le d i to rack j (instead of restricting to the binary cost as in Section 3.4.2) to account for more complicated data center topologies [112, 113], as shown in Fig. 3.5. For example,c i;j may be the cost of the shortest path to transfer data led i to rack j. Note that c i;j = 0 if d i 2r j . To solve the above problem, we rst map it to the well-known metric soft- capacitated facility location problem (SCFLP). Then, we follow the approach presented in [101] to solve it with a 2-approximation algorithm. Mapping Problem (3.10) to a metric SCFLP: Under a general cost c i;j , Problem (3.10) is NP-hard since it can be mapped to the SCFLP [101]. In the SCFLP, soft capacity means that we can open multiple copies of the same facility at the same location and each copy has a hard capacity for servicing customers. In the SCFLP, we minimize the sum of the costs for opening facilities and the costs for servicing customers by deciding which facilities to open, how 78 many copies of each facility to open, and which customers are serviced by which facilities. The mapping between Problem (3.10) and the SCFLP is established as fol- lows. Performing ZFBF precoding at rack j fory j resource blocks is equivalent to opening facility j with y j copies. The degrees of freedom (the maximum number of requested data les that can be jointly precoded to be transmitted in one resource block) is equivalent to the hard capacity of a copy of a facility (one resource block can only serve up to requested data les). The user data requests are equivalent to customers. Processing/precoding data d i at rack j with the associated data transfer cost c i;j is equivalent to servicing customer i by facility j with the associated servicing cost c i;j . In the SCFLP, when the costs live in a metric space and thus satisfy the triangular inequality, we call it a metric SCFLP. Our problem is a metric SCFLP since our costs satisfy the triangular inequality. Indeed, let r(i) denote the rack that stores data d i and denote by ^ c r(i);j the cost for transferring data d i via the shortest path between rack r(i) and rack j, and, more general, denote by ^ c k;j the cost of transferring a data le from rack k to rack j. Then, we have c i;j = ^ c r(i);j , ^ c k;k = 0, ^ c k;j = ^ c j;k , and ^ c k;j ^ c k;l + ^ c l;j for any other rack l. A 2-approximation algorithm for Problem (3.10): The metric SCFLP is NP-hard. The authors in [101] map the metric SCFLP to a linear facil- ity location problem (LFLP) and then to an uncapacitated facility location problem (UFLP) which they solve using the Jain-Mahdian-Saberi (JMS) algo- rithm [114], yielding a 2-approximation solution to the original metric SCFLP problem. We follow the same procedure to solve Problem (3.10), which yields a 2-approximation solution [101]. The 2-approximation algorithm for Problem (3.10) 1. Let ~ c i;j = 1 +c i;j . 2. At the beginning, all requests are unassigned, all racks are unopened, and the budget of every request i, denoted by B i , is initialized to 0. At every moment, each requesti oers some money from its budget to each unopened rackj. The amount of this oer is equal to maxfB i ~ c i;j ; 0g ifi is unassigned, and maxf~ c i;j 0 ~ c i;j ; 0g if it is assigned to some other rack j 0 . 3. While there is an unassigned request, increase the budget of each unassigned request at the same rate, until one of the following events occurs: For some unopened rackj, the total oer that it receives from requests is equal to 1 1 . In this case, we open rack j, and for every request i (assigned or unassigned) which has a non-zero oer to rackj, we assign request i to rack j. For some unassigned request i, and some rack j that is already open, the budget of request i is equal to ~ c i;j . In this case, we assign request i to rack j. To evaluate the performance of the 2-approximation algorithm, we use CVX 79 with Gurobi to solve Problem (3.10), since it is a mixed integer linear program- ming problem. Gurobi reports the distance of its output to the optimal and often this distance is zero. Thus, we are able to compare the 2-approximation algorithm against the optimal for small-scale scenarios, see Section 3.6 for the results. 3.4.5 Regularized spectral eciency maximization Similar to what we did in Section 3.4.3, for the regularized resource block min- imization problem, we can further increase the spectral eciency by exploiting CSI and the aforementioned greedy algorithm to form semi-orthogonal user groups whenever there are more than requests assigned to the same rack at the end of the 2-approximation algorithm. Specically, we greedily add users to user groups such that the channel vector of each added user is the most orthog- onal among the ungrouped users to the signal space spanned by the channel vectors of the users already in the group, see Section 3.4.3. The procedure described above uses CSI to form better groups whenever there are more than requests assigned to the same rack at the end of the 2- approximation algorithm. However, it might be the case that upon transferring a request from a rack with or less requests to another rack, the corresponding channel matrices might be such that the spectral eciency of the resulting user groups is improved. And, depending on the weight , the marginal gain from such a move might be larger than the marginal cost of the additional data transfer. Motivated by this, in the following we introduce a general formulation referred to as the regularized spectral eciency maximization problem, by using as a metric for the eciency of the wireless channel not the number of resource blocks required, but the actual data rates achieved on these resource blocks. We optimize over the set of all possible partitions of the user setM, i.e., we optimize overS 1 ;S 2 ;:::;S T , whereS k \S l = ;;k 6= l, S T k=1 S k = M, and the cardinality of each group is less than or equal to the degrees of free- dom jS k j . Let R ZFBF (S k ) denote the sum rate of the users in group S k under ZFBF precoding, where R ZFBF (S k ) = P l2S k log(1 + l SNR) and l = 1= (H(S k )H(S k ) H ) 1 l;l [96]. The regularized spectral eciency maxi- mization problem can be written as maximize P T k=1 R ZFBF (S k ) T T X k=1 min j2N X i2S k c i;j ! subject to T2f1; 2;:::;Mg S k \S l =;; k6=l; k;l = 1;:::;T T [ k=1 S k =M jS k j; k = 1;:::;T: (3.11) Note that all the data les requested by the users in groupS k will be transferred 80 to a rackj k for ZFBF precoding. It is easy to see thatj k = arg min j2N ( P i2S k c i;j ) since this rack designation results in the minimum data transfer cost for users in groupS k . Like before, is the weight for the regularization term. As already mentioned, the pure spectral eciency maximization problem ( = 0 in Problem (3.11)) is well known to be NP-hard and no tight ap- proximation algorithms exist. Thus, we devise a greedy algorithm to solve Problem (3.11). Let us dene the utility of a user group S k as U(S k ) , R ZFBF (S k ) min j2N ( P l2S k c l;j ), which is the sum rate minus the induced data transfer cost. In addition, we dene the marginal utility of including a new user i into a groupS k as U(ijS k ), U(S k [fig)U(S k ). In the greedy Greedy algorithm for Problem (3.11) 1. InitializeM f1;:::;Mg;k 0; 2. whileM6=; do 3. k k + 1; 4. S k ;; 5. while (jS k j< and max i2M U(ijS k )>) orS k =; do 6. S k S k [fi g, where i = argmax i2M U(ijS k ); 7. M Mnfi g; 8. end while 9. end while algorithm, when forming the k-th user groupS k , we include a new user i that has the highest marginal utility intoS k if the number of users inS k is less than the degrees of freedom and that highest marginal utility is larger than some threshold . Otherwise, we proceed to form the next groupS k+1 . At the end of the algorithm, the setsS 1 ; S 2 ; form a partition of the user setM. Note that larger groups will allow for more concurrent transmissions and thus better spectral eciency, which Problem (3.11) takes into account by dividing the sum rate by T , the total number of groups, and optimizing the average group rate rather than the total sum rate. Since the greedy algorithm does not know a priori the optimal number of user groups, the user group utility and marginal utility dened above work with the group rate, and, the parameter is used to parameterize over dierent group sizes (the smaller the the larger the groups tend to be and thus the less their number and vice versa). 3.5 Extensions 3.5.1 The case with replicas In data centers it is common to have multiple replicas of the same data le in dierent racks to reduce the read/write latency and increase system reliability [115]. Let us dene the set of racks that store a requested data led i by useri as R i . Recall that ^ c k;j denotes the cost of transferring a data le from rackk to rack j. Since we havejR i j replicas of data le d i , the associated cost of transferring 81 data led i to rackj would be the minimum cost between transferring the data le from any rack k2R i to rack j, that is, c i;j = min k2Ri ^ c k;j . Note that c i;j = 0 if one of the replicas of data le d i is stored in rack j already. From the discussion above it is easy to see that the presence of replicas yields the same formulation as before. 3.5.2 The case with multiple BSs We consider the scenario as shown in Fig. 3.1a, where there are multiple BSs/cells (say, G cells), each associated with a number of users. For independent cells (no inter-cell interference and no coordination among cells like in CoMP [87]) it is easy to see that the global problem can be decomposed into G independent instantiations leading to the same formulation as before. When cells are dependent, the problem is coupled. In the case of Problem (3.10), this is so because two users belonging to two dierent cells may or may not use the same resource block depending on inter-cell interference, and the problem cannot be mapped to the SCFLP. In the case of Problem (3.11), this is so because the rate of users depends on both inter-cell interference and on whether CoMP is used or not. That said, the greedy algorithm can still be used. 3.5.3 The case with multi-antenna users When a user has more than one antennas, it can receive more than one data streams at the same resource block. Let n i ; i2M be the number of antennas of user i. By replacing user i with n i users of one antenna each, and serving dierent requests of the original user i with each new user, we can map the multi-antenna problem to the same formulation as before. 3.6 Simulation and numerical results 3.6.1 Randomized/CSI-based vs data-locality-aware user grouping Fig. 3.6 compares the data-locality-aware user grouping scheme with the randomized/CSI- based user grouping scheme in terms of the total number of induced data trans- fers. We assume that the number of racksN equals 10, the number of antennas in a BS equals 4, and the number of user data le requestsM varies from 1 to 50. The data les requested by the M users are distributed among theN racks independently and uniformly at random. Under all schemes, we plot the number of induced data transfers needed to fully exploit the spatial multiplexing gain, i.e., we group the M users into l M m groups of size = 4. Fig. 3.6 illustrates that the data-locality-aware user grouping scheme can signicantly reduce the number of induced data transfers. This comes as no surprise since the data-locality-aware scheme has been designed with this goal in mind. 82 0 10 20 30 40 50 0 5 10 15 20 25 30 35 Number of user data file requests M Number of data transfers Randomized/CSI−based user grouping Data−locality−aware user grouping Figure 3.6: Comparison between the data-locality-aware and the randomized/CSI-based user grouping schemes. (N = 10; = 4) 0 50 100 150 200 0 20 40 60 80 100 120 140 Number of user data file requests M Number of data transfers Randomized/CSI−based user grouping Rand/CSI asymptotic upper bound Rand/CSI asymptotic lower bound Data−locality−aware user grouping Data−locality approximation formula (a) N = 10; = 4. 0 50 100 150 200 0 50 100 150 Number of user data file requests M Number of data transfers Randomized/CSI−based user grouping Rand/CSI asymptotic upper bound Rand/CSI asymptotic lower bound Data−locality−aware user grouping Data−locality approximation formula (b) N = 30; = 4. 0 50 100 150 200 0 50 100 150 Number of user data file requests M Number of data transfers Randomized/CSI−based user grouping Rand/CSI asymptotic upper bound Rand/CSI asymptotic lower bound Data−locality−aware user grouping Data−locality approximation formula (c) N = 50; = 4. Figure 3.7: Comparison of the asymptotic bounds / approximation formulas with simulation results. In Fig. 3.7, under the randomized/CSI-based user grouping scheme, we com- pare the asymptotic upper bound (Eq. (3.5)) and lower bound (Eq. (3.6)) for the number of induced data transfers with simulation results. These bounds hold for N and we vary N from 10 till 50 in Fig. 3.7a-c. We can see that both the upper and lower bounds give a good approximation. In addition, under the data-locality-aware user grouping scheme, we compare the approximation formula (Eq. (3.9)) for the number of induced data transfers with simulation results. The approximation holds for MN, and, indeed, it is accurate in this regime. Last, Fig. 3.7 illustrates that the number of induced data transfers for the data-locality-aware user grouping scheme does not increase with the number of user data le requestsM, while that for the randomized/CSI- based user grouping scheme increases linearly with M. 3.6.2 Joint CSI- and data-locality-aware user grouping We compare the performance of the data-locality-aware user grouping scheme with that of the joint CSI- and data-locality-aware user grouping scheme in 83 0 20 40 60 80 100 2 3 4 5 6 7 8 9 Number of requests per rack Spectral efficiency (bits/s/Hz) Joint CSI−data−locality−aware user grouping Data−locality−aware user grouping Figure 3.8: Joint CSI- and data-locality-aware user grouping scheme. ( = 4) terms of the average spectral eciency measured in bits/s/Hz. Note that as discussed in Section 3.4.3, both schemes induce the same number of data trans- fers. We assume that the number of antennas in a BS is = 4 and assume i.i.d. Rayleigh fading channel coecients with unit power. The spectral eciency is calculated based on ZFBF precoding at an SNR value of 3dB [96]. For the joint CSI- and data-locality-aware user grouping scheme, we use the greedy algorithm introduced in Section 3.4.3 to form semi-orthogonal user groups. Fig. 3.8 shows that, as expected, the joint CSI- and data-locality-aware user grouping scheme achieves a higher spectral eciency due to the better- conditioned channel matrix formed by the users in the same group. Note that the rate gains can be signicant. For example, when the average number of requests per rack is 50, there is a 1.4x improvement in data rate over the data- locality-aware scheme. In addition, we can see that the gap between the two curves gradually increases as the number of requests per rack increases. The reason is that we have progressively better-conditioned channel matrix due to increasing multiuser diversity. 3.6.3 Regularized resource block minimization vs regular- ized spectral eciency maximization We study the Pareto optimal curves for the regularized resource block minimiza- tion problem and the regularized spectral eciency maximization problem in Fig. 3.9 and Fig. 3.10, respectively. Like before, we assume that the number of racksN equals 10, the number of antennas in a BS equals 4, and the number of user data le requests M equals 50. Let the cost ^ c k;j for transferring data from rack k to rack j be distributed in the range [0; 50]. Fig. 3.9 shows the Pareto optimal curve for the regularized resource block minimization problem (Problem (3.10)) where the number of resource blocks needed to service all re- quests is plotted against the cost associated with the induced data transfers as the weight varies between 0 and 50. The Pareto-optimal curve is obtained by 84 13 14 15 16 0 1 2 3 4 5 6 7 The cost for the induced data transfers The number of resource blocks Figure 3.9: Pareto optimal curve for the regularized resource block minimization problem. (M = 50;N = 10; = 4) 6 6.2 6.4 6.6 6.8 7 7.2 7.4 0 1 2 3 4 5 6 7 The cost for the induced data transfers Spectral efficiency (bits/s/Hz) Figure 3.10: Pareto optimal curve for the regularized spectral eciency maxi- mization problem. (M = 50;N = 10; = 4) using CVX with Gurobi, which is a commercial solver for mixed integer linear programming problems. Note that using a smaller number of resource blocks to service a given number of user data le requests implies that a larger number of data transfers across racks is induced to form larger user groups for ZFBF precoding. Fig. 3.10 shows the Pareto optimal curve for the regularized spectral e- ciency maximization problem (Problem (3.11)). The Pareto-optimal curve is obtained by using our proposed greedy algorithm in Section 3.4.5 with dierent values of the weight and the threshold . As expected, the spectral eciency increases as the number of the induced data transfers increases. Note that the weight is the control knob that can be adjusted by the system operator ac- cording to the congestion level in the cloud data center. For example, when there is no congestion in the cloud data center, can be set to zero. That is, we focus on maximizing the spectral eciency by forming user groups with 85 13 14 15 16 4.5 5 5.5 6 6.5 7 7.5 The number of resource blocks Spectral efficiency (bits/s/Hz) CSI−fully aware CSI−partially aware CSI−unaware Figure 3.11: Spectral eciency for increasing use of CSI. (M = 50;N = 10; = 4) 0 0.2 0.4 0.6 0.8 1 12 14 16 18 20 22 24 26 28 30 32 The weight ρ The achieved cost 2*Optimal 2−approximation algorithm Optimal Figure 3.12: Performance of the 2-approximation algorithm. (M = 50;N = 10; = 4) well-conditioned channel matrices and do not care about the number of induced data transfers across racks (corresponding to operating the system at the right- most point in Fig. 3.10). On the other hand, when congestion occurs in the cloud data center, we choose a large weight to suppress data transfers across racks and preferentially group together users whose requested data are located under the same rack. Operating the system at the leftmost point in Fig. 3.10 corresponds to the case of using a very large weight for regularization (the data center network is heavily congested), resulting in no data transfers across racks in forming user groups. Last, Fig. 3.11 shows the increase in the spectral eciency by exploiting CSI. The blue curve corresponds to Problem (3.10), the regularized resource block minimization problem. We refer to this as the CSI-unaware scheme. The red curve corresponds to Problem (3.10) with the addition of using CSI to form semi-orthogonal groups within a rack with more than requests assigned 86 to it, see the rst paragraph of Section 3.4.5. We refer to this as the CSI- partially-aware scheme. Last, the green curve corresponds to Problem (3.11), the regularized spectral eciency maximization problem, where CSI is fully exploited. We refer to this as the CSI-fully-aware scheme. To create the blue and red curves we use the Shannon rate formula with ZFBF precoding to obtain the corresponding rates for dierent number of resource blocks used. Since the CSI-partially-aware scheme is using CSI for some group formations, it has better channel gains, yielding about 1.2x improvement. To create the green curve and compare it with the other two, we use Fig. 3.9 to map resource blocks to the cost of data transfers for the rst two schemes, and then use Fig. 3.10 to directly get the corresponding rates for the CSI-fully-aware scheme for the aforementioned cost of data transfers. Since this scheme uses CSI for all group formations it yields a higher gain, about 1.3x. This shows how the increasing use of CSI can lead to increasing spectral eciency without causing additional data transfers at the cloud data center. 3.6.4 Accuracy of the 2-approximation algorithm In Fig. 3.12, we compare the achieved objective value of the 2-approximation algorithm for the regularized resource block minimization problem with that we get by using CVX with Gurobi, as we vary the weight from 0 to 1. We can see that the 2-approximation algorithm behaves as expected, i.e., the achieved objective value lies within two times of the optimal. 3.7 Conclusion In this work, we introduced a novel data-locality-aware user grouping scheme for multi-user MIMO beamforming under the cloud radio access network ar- chitecture. By grouping users based on both the CSI and the locality of their requested data, we signicantly reduced the number of induced data transfers across racks in the cloud data center, while providing the same level of spatial multiplexing gain. To design a user grouping algorithm that can adapt to the congestion level in the cloud data center, we proposed a regularized spectral ef- ciency maximization problem where the number of data transfers across racks serves as a regularization term whose weight can be controlled by a system op- erator. When there is no congestion, a small weight can be used, resulting in the traditional CSI-based user grouping scheme. When congestion occurs, a large weight can be used, suppressing data transfers across racks and preferentially grouping together users whose requested data are located in the same rack. Last, under some simplications, we casted the above problem as a soft-capacitated facility location problem, developed a 2-approximation user grouping algorithm to solve it, and studied via simulations and numerical analysis the tradeo be- tween spectral eciency and data transfer cost. 87 Chapter 4 Joint Workload Distribution and Capacity Augmentation in Hybrid Datacenter Networks In hybrid datacenter networks, wired connections are augmented with wireless links to facilitate data transfers between racks. The usage of mmWave/FSO wireless links enables dynamic bandwidth/capacity allocation with extremely small reconguration delay. Also, on-demand workload distribution, where the workload of a job is divided into multiple tasks that can be distributed/routed to dierent racks to be processed in parallel, allows better utilization of compu- tational resources in data centers. In prior work, the dynamic wireless capacity augmentation and workload dis- tribution decisions were mostly made independently and in a heuristic manner for servicing distributed and parallel computing jobs. In this work, we propose a novel analytical framework and algorithms to jointly optimize both the wire- less capacity augmentation and the workload distribution, to minimize the job completion time. With extensive simulation studies, we show that the gain (in terms of the reduction in the job completion time) can be very substantial when allowing such joint optimization. 4.1 Introduction In parallel and distributed computing frameworks such as MapReduce and Hadoop, the workload of a job (say, deep learning applications [116]) is often divided into multiple tasks that can be distributed across multiple servers in dierent racks for parallel processing, reducing the response time. To improve computational resource utilization and alleviate hot spots in a data center, on- 88 demand workload distribution is employed, e.g., via load balancing [117,118] and auto-sharding [119,120]. According to the Google report [120], a well-designed on-demand workload distribution allows data centers to process more jobs per unit time. Traditional datacenter networks which consist of cooper and optical ber cables forming, say, a Fat-Tree topology [121], may not be able to support the recent advance of dynamic on-demand workload distribution. In traditional datacenter networks, the amount of capacity to provision between racks (or servers) is decided in advance. As a result, when the demand between two racks exceeds the provisioned capacity, congestion will occur. Such xed/static link capacity allocation and link oversubscription restrict the support of dynamic on-demand workload distribution. On the other hand, recongurable wireless interconnects that have very low reconguration delay such as mmWave links [122, 123] and free-space optics (FSO) [124] have been recently deployed in hybrid datacenter networks [125{ 130] to enable dynamic capacity allocation (augmentation) between racks. For example, in [131], the reconguration delay of mmWave beamforming was shown to be extremely small even when the number of beams (i.e. the number of fanouts) is large. In [132], the reconguration delay of FSO was shown to be only 12 s while supporting 18,432 fanouts. In the literature, the on-demand workload distribution and wireless link aug- mentation were mostly treated/congured independently in a heuristic manner for servicing distributed and parallel computing jobs. In this work, we propose a novel analytical framework that allows a joint optimization of the dynamic work- load distribution and wireless link augmentation in hybrid datacenter networks. Specically, to minimize the job completion time, we decide which interconnects are required to augment capacity (and by how much) and how the workload of a job is distributed among the racks via these augmented links for parallel pro- cessing. Motivated by dierent applications, we consider two types of workload: workload which is amenable to pipelining and workload which is not. For the later, data processing/computation in a rack can only start after all data des- tined for the rack are received. For the former, data processing can start as soon as data reception starts. We show that the gain from the joint optimization is substantial, which can be as large as 30% for non-pipeline-amenable workload and 45% for pipeline-amenable workload. Note that real world workload char- acteristics lies in between the two extremes (being non-pipeline-amenable or pipeline-amenable). The reminder of this work is organized as follows. We present related work in Section 4.2. The system architecture is introduced in Section 4.3. The joint optimization of workload distribution and wireless link augmentation for the non-pipeline-amenable and pipeline-amenable workload are discussed in Sec- tion 4.4 and Section 4.5, respectively. The analytical framework is extended to consider data transmission cost in Section 4.6. Simulation results are given in Section 4.7. Last, Section 4.8 concludes the work. 89 4.2 Related work Hybrid datacenter networks: Prior work on hybrid datacenter networks mainly focuses on increasing data center network capacity by using wireless links, subject to a wireless bandwidth constraint, indoor channel characteris- tics, and a xed wired topology. For example, the authors in [126] proposed a mmWave-based hybrid datacenter network architecture and optimized the wireless bandwidth allocation. An interference cancellation scheme was further proposed in [127] to increase the wireless link capacity. In [129], the authors demonstrated the usage of mmWave MIMO beamforming to dynamically con- gure hybrid datacenter networks. To perfectly re ect light beams between racks in FSO-based hybrid datacenter networks, optimal mirror placement was investigated in [130]. Workload distribution: The goal of on-demand workload distribution is to distribute the workload of a job from the original server to multiple servers across racks to improve computational or network resource utilization and alleviate hot spots. For example, the authors in [117, 118] proposed load-balancers to better utilize network resources by optimizing network software and hardware simultaneously. Some general purpose sharding systems, such as Orleans [119] and Slicer [120], split the workload of a job into multiple tasks and distribute them to multiple servers to reduce the response time and/or to relieve hot spots in data centers. The most related work to ours is [133]. The authors considered ow rout- ing and antenna scheduling for congestion control (network resources) in a mmWave-based hybrid datacenter network, where a ow has a xed source and a xed destination and a xed amount of trac is sent from the source to the destination. In this work, we focus on optimizing both computational and network resources to service large-scale distributed and parallel computing jobs. In particular, both the destination(s) of the workload of a job (destined racks for data processing) and the amount of workload routed to each destined rack are decision variables, which are optimized according to both the computational resources available in dierent racks and the network resources available among dierent racks (which are also themselves variables that can be optimized by wireless capacity augmentation). We jointly optimize the wireless link capacity augmentation and the workload distribution to minimize the job completion time. 4.3 System Model LetN =f1; 2;:::;Ng be the set of racks in a data center. A rack consists of a number of servers for computation and storage. Each pair of racks i and j is connected with a (logical) link with capacity C f +c i;j , where C f is the xed wired capacity between racks i and j whose value depends on the wired 90 " % & ' ( % & ' ( &,' &,( Figure 4.1: An illustration of the system model. topology 1 , andc i;j is the augmented wireless capacity from racki to rackj that can be dynamically recongured according to the current workload distribution. We assume that the total wireless bandwidth isC w , which is shared by wireless links among racks via say, FDMA, so that the wireless transmissions will not interfere with each other. We have P i2N P j2Nnfig c i;j =C w . 2 Consider a job whose data are stored across dierent racks (a special case is that all the data of the job are stored in a single rack). Let d i denote the amount of data stored in racki. The total workload of the job is thus P i2N d i . In addition, let S i be the service rate available in rack i to process the data of the job. The units of d i , S i , C f , and c i;j are normalized, where d i is measured in Mb and S i , C f , and c i;j are measured in Mb/s. See Fig. 4.1 for an example. Suppose that we are allowed to transfer the workload of the job among dierent racks for processing. Let r i;j denote the portion of workload d i to be transferred from rack i to rack j. In other words, we transfer the amount of workload d i r i;j from rack i to rack j. We have P j2N r i;j = 1; 8i2N . Note that d i r i;i is the amount of workload that stays in rack i for processing. 4.4 Workload not amenable to pipelining In this section, we consider workload that is not amenable to pipelining, where data processing in a rack can only start after the rack receives all the data destined for it. 1 Here we assume a fat-tree [121] datacenter network topology. Followed by the non-blocking property of fat-tree [134], the wired link capacity (C f ) between any two racks is equal to the link capacity between a rack and its corresponding edge router. The analysis can be easily extended to the case with dierent values of wired capacity between any two dierent racks. 2 With a slight abuse of notation, we also useCw to denote the product of the total wireless bandwidth (Hz) and the spectral eciency (bits/s/Hz). 91 4.4.1 Problem formulation Since the workload of the job is distributed among dierent racks to be processed in parallel, the job completion time is constrained by the slowest one, i.e., the maximum completion time among the racks. In addition, since data processing at racki only starts after it receives all the data destined for it, the completion time of rack i is the sum of data transmission time (the time duration to re- ceive all its designated data) max j2N;j6=i djrj;i C f +cj;i and the data processing time P k2N d k r k;i Si . As a result, the job completion time equals max i2N n max j2N;j6=i djrj;i C f +cj;i + P k2N d k r k;i Si o . We aim to minimize the job completion time by properly routing/distributing the workload among dierent racks for parallel processing and by properly aug- menting wireless link capacity for data transfers. We jointly optimize the routing variabler i;j , which is the portion ofd i to be transferred from racki to rackj for processing, and the wireless link capacity allocation variable c i;j , which is the capacity augmentation from racki to rackj. We have the following optimization problem: min ri;j;ci;j max i2N max j2N;j6=i d j r j;i C f +c j;i + P k2N d k r k;i S i subject to: X j2N r i;j = 1; 8i2N; X i;j2N c i;j =C w ; r i;j ;c i;j 0; 8i;j2N: (4.1) Note that if we are not allowed to augment wireless capacity between, say, racks k and l due to, say, topology constraint, we can simply add the constraints c k;l = 0 and c l;k = 0 to Problem (4.1). In addition, if there is an upper bound (c upper ) on the amount of wireless capacity augmentation from rack k to rack l, we can add the constraint c k;l c upper to Problem (4.1). The resulting constraint set is still a polyhedral, and our analysis and algorithms still apply. Denition 4.1. (Biconvex Set [135]) A setBXY is biconvex onXY if B x ,fy2Y : (x;y)2Bg is convex for every x2X andB y ,fx2X : (x;y)2 Bg is convex for every y2Y. Denition 4.2. (Biconvex Function [135]) A function f :B!R is biconvex if BXY is biconvex, f x ,f(x;) :B x !R is convex onB x for every x2X , and f y ,f(;y) :B y !R is convex onB y for every y2Y. Denition 4.3. (Biconvex Optimization Problem [135]) A problem min (x;y)2B f(x;y) is a biconvex optimization problem ifB is biconvex onXY and f is biconvex onB. Lemma 4.1. The subproblem of Problem (4.1) with xed wireless capacity aug- mentation is a convex optimization problem. 92 Proof. When the wireless capacity augmentation c i;j is xed, Problem (4.1) becomes: min ri;j max i2N max j2N;j6=i d j r j;i C f +c j;i + P k2N d k r k;i S i subject to: X j2N r i;j = 1; 8i2N; r i;j 0; 8i;j2N: (4.2) It is easy to see that Problem (4.2) can be transformed to a linear program [136]. Lemma 4.2. The subproblem of Problem (4.1) with xed workload routing is a convex optimization problem. Proof. When the workload routing variabler i;j is xed, Problem (4.1) becomes: min ci;j max i2N max j2N;j6=i d j r j;i C f +c j;i + P k2N d k r k;i S i subject to: X i;j2N c i;j =C w ; c i;j 0; 8i;j2N: (4.3) Since the function 1 C f +ci;j is convex for c i;j 0, and the maximum of convex functions is convex, Problem (4.3) is a convex optimization problem [136]. Theorem 4.1. Problem (4.1) is a biconvex optimization problem. Proof. This follows directly by Lemma 4.1 and Lemma 4.2. We employ the Alternate Convex Search (ACS) algorithm [135] in the fol- lowing subsection to solve the biconvex optimization Problem (4.1). 4.4.2 Proposed algorithm It is not hard to see that Problem (4.1) can be equivalently stated as: min ri;j;ci;j;z z subject to: d j r j;i C f +c j;i + P k2N d k r k;i S i z; 8i;j2N; i6=j X j2N r i;j = 1; 8i2N; X i;j2N c i;j =C w ; r i;j ;c i;j 0; 8i;j2N: (4.4) 93 Algorithm 4 Alternate Convex Search (ACS) Input: d i ;8i2N , S i ;8i2N , C f , and C w . Output: z, r i;j ;8i;j2N , and c i;j ;8i;j2N . 1. Initialize c i;j = 0;8i;j2N . 2. Initialize t = 0. 3. Initialize z = max i2N di Si (upper bound on the job completion time). 4. while true do 5. Fix c i;j , and solve Problem (4.4) to get r i;j and z r . 6. if z r <z then 7. z =z r and t = 0 8. else 9. t =t + 1 10. end if 11. Fix r i;j , and solve Problem (4.4) to get c i;j and z c . 12. if z c <z then 13. z =z c and t = 0 14. else 15. t =t + 1 16. end if 17. if t 2 then 18. break; 19. end if 20. end while We apply Algorithm 4 to solve the biconvex optimization Problem (4.4). The idea of ACS is to iteratively solve two convex optimization problems until a local optimum is reached. At each iteration, we x one decision variable (i.e. c i;j or r i;j ) to arrive at a convex optimization problem. Then, we eciently solve the resulting convex optimization problem by using standard techniques [136]. At the next iteration, we x the other decision variable to arrive at the other convex optimization problem, and eciently solve it. Once the performance cannot be improved (t 2 in Algorithm 4), a local optimum is reached and Algorithm 4 stops. The decisions c i;j and r i;j are returned. 4.5 Workload amenable to pipelining In this section, we consider workload that is amenable to pipelining, where data processing in a rack can start as soon as the rack starts to receive its designated data. 4.5.1 Problem formulation When we pipeline the data reception and the data processing, racki starts data processing as soon as it starts to receive data. We start by showing that the 94 completion time of racki is equal to the maximum of the data transmission time max j2N;j6=i djrj;i C f +cj;i and the data processing time P k2N d k r k;i Si , since it depends on whether the bottleneck is the CPU or the network. Lemma 4.3. For workload that is amenable to pipelining, the completion time of rack i equals max h max j2N;j6=i djrj;i C f +cj;i ; P k2N d k r k;i Si i . Proof. See the Appendix. From Lemma 4.3, it follows that the job completion time equals max i2N n max h max j2N;j6=i djrj;i C f +cj;i ; P k2N d k r k;i Si io . Like before, we aim to minimize the job completion time by properly rout- ing/distributing the workload among dierent racks for parallel processing and by properly augmenting wireless link capacity between racks for facilitating data transfers. We have the following optimization problem: min ri;j;ci;j max i2N max max j2N;j6=i d j r j;i C f +c j;i ; P k2N d k r k;i S i subject to: X j2N r i;j = 1; 8i2N; X i;j2N c i;j =C w ; r i;j ;c i;j 0; 8i;j2N: (4.5) It is not hard to see that Problem (4.5) can be equivalently stated as: min ri;j;ci;j max max i;j2N;i6=j d j r j;i C f +c j;i ; max i2N P k2N d k r k;i S i subject to: X j2N r i;j = 1; 8i2N; X i;j2N c i;j =C w ; r i;j ;c i;j 0; 8i;j2N: (4.6) Theorem 4.2. Problem (4.6) is a quasiconvex optimization problem. Proof. Since djrj;i C f +cj;i is a linear-fractional function and P k2N d k r k;i Si is a linear function, both are quasiconvex. Since the maximum of quasiconvex functions is quasiconvex [136], the objective function of Problem (4.6) is quasiconvex. As a result, Problem (4.6) is a quasiconvex optimization problem. We employ a bisection method [136] in the following subsection to solve the quasiconvex optimization Problem (4.6). 95 4.5.2 Proposed algorithm Problem (4.6) can be equivalently stated as: min ri;j;ci;j;z z subject to: d j r j;i C f +c j;i z; 8i;j2N;i6=j; P k2N d k r k;i S i z; 8i2N; X j2N r i;j = 1; 8i2N; X i;j2N c i;j =C w ; r i;j ;c i;j 0; 8i;j2N: (4.7) We apply Algorithm 5 to solve the quasiconvex optimization Problem (4.7). For Algorithm 5 Bisection Method Input: d i ;8i2N , S i ;8i2N , C f , C w , and > 0. Output: z, r i;j ;8i;j2N , and c i;j ;8i;j2N . 1. Initialize z upper = max i2N di Si . 2. Initialize z lower = 0. 3. while true do 4. Set z = zupper +z lower 2 . 5. Solve the convex feasibility Problem (4.7) with z xed. 6. if the convex feasibility problem is feasible then 7. z upper =z 8. else 9. z lower =z 10. end if 11. ifjz upper z lower j then 12. break; 13. end if 14. end while initialization purposes, the upper bound of Problem (4.7) is chosen as z upper = max i2N di Si , which corresponds to the case that we do not perform any workload distribution, and the lower bound of Problem (4.7) is chosen as z lower = 0. The idea of Algorithm 5 is to iteratively test whether or not the current value of z is feasible. We rst setz = zupper +z lower 2 . If the convex feasibility Problem (4.7) (with xed z) is feasible, it implies that we can still reduce z. Therefore, we set z upper = z. Otherwise, if it is not feasible, we set z lower = z. We then update the value ofz to be zupper +z lower 2 and run the feasibility test again. Once the dierence between z upper and z lower is smaller than some predetermined threshold [136], Algorithm 5 stops and returns the decisions r i;j and c i;j . 96 4.5.3 Special cases Similar to before, we are interested in the subproblems with xed wireless ca- pacity augmentation and xed workload routing. When the wireless capacity augmentation c i;j is xed, Problem (4.6) be- comes: min ri;j max max i;j2N;i6=j d j r j;i C f +c j;i ; max i2N P k2N d k r k;i S i subject to: X j2N r i;j = 1; r i;j 0; 8i;j2N: 8i2N: (4.8) Note that Problem (4.8) can be transformed to a linear program. When the workload routing variable r i;j is xed, Problem (4.6) becomes: min ci;j max max i;j2N;i6=j d j r j;i C f +c j;i ; max i2N P k2N d k r k;i S i subject to: X i;j2N c i;j =C w ; c i;j 0; 8i;j2N: (4.9) Note that Problem (4.9) is a convex optimization program. 4.6 Network Cost for Data Transmission In this section, we extend our proposed analytical framework to consider the case where there is a data transmission costw i;j 0 per unit data routed from racki to rackj. In particular, we minimize a weighted sum of the job completion time and the data transmission cost P i;j2N;i6=j w i;j d i r i;j by optimizing the workload distribution and wireless capacity augmentation. The data transmission cost serves as a regularization term to regulate the amount of data transfers between racks. 97 4.6.1 Workload not amenable to pipelining For the case with workload that is not amenable to pipelining, Problem (4.1) is generalized to be: min ri;j;ci;j max i2N max j2N;j6=i d j r j;i C f +c j;i + P k2N d k r k;i S i + X i;j2N;i6=j w i;j d i r i;j subject to: X j2N r i;j = 1; 8i2N; X i;j2N c i;j =C w ; r i;j ;c i;j 0; 8i;j2N; (4.10) where 0 is the weight between the job completion time and the data trans- mission cost. Note that Problem (4.10) is still a biconvex optimization problem which can be solved by using Algorithm 4. 4.6.2 Workload amenable to pipelining For the case with workload that is amenable to pipelining, Problem (4.5) is generalized to be: min ri;j;ci;j max i2N max max j2N;j6=i d j r j;i C f +c j;i ; P k2N d k r k;i S i + X i2N X j2N;j6=i w i;j d j r j;i subject to: X j2N r i;j = 1; 8i2N; X i;j2N c i;j =C w ; r i;j ;c i;j 0; 8i;j2N: (4.11) Note that the sum of a quasiconvex function (even a linear-fractional function) and a linear function is, in general, not quasiconvex [137]. Thus, we cannot use Algorithm 5 to solve Problem (4.11). However, it is not hard to see that Problem (4.11) is a biconvex optimization problem, which can be solved by using Algorithm 4. 4.7 Simulation results We assume a fat-tree [121] datacenter network topology. Followed by the non- blocking property of fat-tree [134], the xed, wired link capacity (C f ) between 98 100 200 300 400 500 600 700 800 900 1000 Wired Link Capacity (Mbps) 4 6 8 10 12 14 16 Job Completion Time (s) Non-Pipeline (Optimal) Pipeline (Optimal) Non-Pipeline (Uniform) Pipeline (Uniform) Non-Pipeline (C w = 0) Pipeline (C w = 0) No Transmission (r i,i = 1) (a) Job completion time 100 200 300 400 500 600 700 800 900 1000 Wired Link Capacity (Mbps) 0 5 10 15 20 25 30 35 40 Gain (%) Gain (Optimal) Gain (Uniform) Gain (C w = 0) (b) Gain from pipelining 100 200 300 400 500 600 700 800 900 1000 Wired Link Capacity (Mbps) 0 5 10 15 20 25 30 35 40 45 50 Gain (%) Gain for Nonpipeline (Optimal/Uniform) Gain for Pipeline (Optimal/Uniform) Gain for Nonpipeline (Optimal/C w = 0) Gain for Pipeline (Optimal/C w = 0) (c) Gain from joint opti- mization Figure 4.2: System performance under dierent values of wired link capacity C f (0:6 C f C avg w 6). 100 200 300 400 500 600 700 800 900 1000 Total Wireless Bandwidth (Mbps) 4 6 8 10 12 14 16 Job Completion Time (s) Non-Pipeline (Optimal) Pipeline (Optimal) Non-Pipeline (Uniform) Pipeline (Uniform) Non-Pipeline (C w = 0) Pipeline (C w = 0) No Transmission (r i,i = 1) (a) Job completion time 100 200 300 400 500 600 700 800 900 1000 Total Wireless Bandwidth (Mbps) 0 5 10 15 20 25 30 35 40 Gain (%) Gain (Optimal) Gain (Uniform) Gain (C w = 0) (b) Gain from pipelining 100 200 300 400 500 600 700 800 900 1000 Total Wireless Bandwidth (Mbps) 0 5 10 15 20 25 30 35 Gain (%) Gain for Nonpipeline (Optimal/Uniform) Gain for Pipeline (Optimal/Uniform) Gain for Nonpipeline (Optimal/C w = 0) Gain for Pipeline (Optimal/C w = 0) (c) Gain from joint opti- mization Figure 4.3: System performance under dierent values of total wireless band- width C w and wired link capacity C f ( C f C avg w =3). 99 Table 4.1: The values of c i;j and r i;j (C f =100 Mbps) Non-pipeline-amenable workload c i:j 1 2 3 r i;j 1 2 3 1 0 0 0 1 1 0 0 2 0 0 0 2 0 1 0 3 589.7 410.3 0 3 0.28 0.11 0.61 Pipeline-amenable workload c i;j 1 2 3 r i;j 1 2 3 1 0 0 0 1 1 0 0 2 0 0 0 2 0 1 0 3 771.6 228.4 0 3 0.4 0.15 0.45 any two racks is equal to the link capacity between a rack and its correspond- ing edge router. Consider a job whose workload is stored and will be pro- cessed among N = 3 racks (realistic large-scale cases will be considered in Section 4.7.5). Depending on the scenarios that we are interested in, the value of the wired link capacity between any two racks C f varies between 100 and 1000 Mbps. The value of the total bandwidth for wireless link augmentation C w varies between 100 and 1000 Mbps, to be shared among the 2 N 2 = 2 3 2 = 6 possible direct links between pairs of racks. The service rate available in rack i, S i , to process the data of the job, varies between 1 and 10 Gbps. The workload of the job in rack i, d i , varies between 5000 and 15000 Mb. We use the following legends to label various algorithms. For non-pipeline- amenable workload, we use \Non-Pipeline (Optimal)" to refer to the ACS al- gorithm 4 that jointly optimizes the workload distribution and wireless capac- ity augmentation (Problem (4.1)). Let C avg w , Cw 2( N 2 ) be the \average" wire- less capacity augmentation that each direct link between pairs of racks would get if C w were to be shared equally among all pairs. We use \Non-Pipeline (Uniform)" to refer to the case with uniform wireless capacity augmentation c i;j =C avg w ; i;j2N; i6=j in which we only optimize the workload distribution (Problem (4.2)). We use \Non-Pipeline (C w = 0)" to refer to the case with- out wireless augmentation (Problem (4.2) withC w = 0). For pipeline-amenable workload, we use \Pipeline (Optimal)" to refer to the Bisection algorithm 5 that jointly optimizes r i;j and c i;j (Problem (4.5)). We use \Pipeline (Uniform)" to refer to the case with uniform wireless capacity augmentationc i;j =C avg w (Prob- lem (4.8)). We use \Pipeline (C w = 0)" to refer to the case without wireless augmentation (Problem (4.8) with C w = 0). Last, we use \No transmission (r i;i = 1)" to refer to the case without any on-demand workload distribution, i.e, r i;i = 1; i2N . 100 4.7.1 Varying wired link capacity In Fig. 4.2, we assume N=3 with workload d 1 =5000 Mb, d 2 =10000 Mb, and d 3 =15000 Mb. The service rates are S 1 =3000 Mbps, S 2 =2000 Mbps, and S 3 =1000 Mbps. The total wireless bandwidth isC w =1000 Mbps. The wired link capacityC f between any two racks varies from 100 to 1000 Mbps. In Table 4.1, we show the wireless capacity augmentation c i;j and the workload distribution r i;j for the case with C f =100 Mbps. Since rack 3 stores the most data but has the least computational resources available while rack 1 stores the least data but has the most computational resources available, for the non-pipeline-amenable workload we augment wireless link capacities c 3;1 =589.7 Mbps and c 2;1 =410.3 Mbps and route r 3;1 =0.28 (resp. r 2;1 = 0.11) portion of d 3 from rack 3 to rack 1 (resp. rack 2) for parallel data processing. Similar augmentation and routing pattern can be observed for the pipeline-amenable workload case. As shown in Fig. 4.2a, as expected, the job completion time decreases as C f increases for both types of non-pipeline-amenable and pipeline-amenable workload, and the job completion time for pipeline-amenable workload is less than that of non-pipeline-amenable workload. Note that the smallest possible job completion time is reached by Pipeline (Optimal), Pipeline (Uniform), and Pipeline (C w = 0) at C f =500, 900, and 1000 Mbps, respectively. Fig. 4.2b shows the performance gain from pipelining. We plot the gain, which is dened as one minus the ratio of the job completion time with pipelin- ing to that without pipelining, under optimal augmentation (optimizing both capacity augmentation and workload distribution), uniform augmentation (opti- mizing workload distribution only), and no augmentation (optimizing workload distribution only), respectively. For example, when C f =100 Mbps, the gain from pipelining under optimal augmentation is about 26%, under uniform aug- mentation is about 13%, and under no augmentation is about 5%. Note that the gain from pipelining begins to decrease when reaching the point at which the smallest possible job completion time is achieved, i.e., C f =500 and 900 Mbps for Pipeline (Optimal) and Pipeline (Uniform), respectively. Fig. 4.2c shows the performance gain from joint optimization. For both types of non-pipeline-amenable and pipeline-amenable workload, we plot the gain, which is dened as one minus the ratio of the job completion time with optimal augmentation to that with uniform augmentation (or without augmentation). In general, the gain decreases as the wired link capacityC f increases since the data transmission time becomes less of a bottleneck. The gains for pipeline-amenable workload are more prominent (about 45% and 30% atC f =100 Mbps), while that for non-pipeline-amenable workload are about 30% and 18% at C f =100 Mbps. 4.7.2 Varying total wireless bandwidth and wired link ca- pacity In Fig. 4.3, we assume N=3 with workload d 1 =5000 Mb, d 2 =10000 Mb, and d 3 =15000 Mb. The service rates are S 1 =3000 Mbps, S 2 =2000 Mbps, and S 3 =1000 Mbps. The total wireless bandwidth for augmentationC w varies from 101 2 4 6 8 10 Service Rate Scale Factor α 0 5 10 15 Job Completion Time (s) Non-Pipeline (Optimal) Pipeline (Optimal) Non-Pipeline (Uniform) Pipeline (Uniform) Non-Pipeline (C w = 0) Pipeline (C w = 0) No Transmission (r i,i = 1) (a) Varying Si only 2 4 6 8 10 Service Rate and Data Size Scale Factor α 6 8 10 12 14 16 Job Completion Time (s) Non-Pipeline (Optimal) Pipeline (Optimal) Non-Pipeline (Uniform) Pipeline (Uniform) Non-Pipeline (C w = 0) Pipeline (C w = 0) No Transmission (r i,i = 1) (b) VaryingSi anddi (the ratio d i S i is kept xed) Figure 4.4: System performance under dierent values of service rate S i and workload size d i 100 to 1000 Mbps. The wired link capacity C f between any two racks also scales up such that the ratio C f C avg w is xed to be 3. As expected, in Fig. 4.3a the job completion time decreases as C w increases for both types of non-pipeline- amenable and pipeline-amenable workload. Fig. 4.3b shows the performance gain from pipelining, which is most prominent for the case that allows a joint optimization of wireless capacity augmentation and workload distribution. From Fig. 4.3c, we can see that the performance gain from joint optimization is more signicant when C w is large. 4.7.3 Varying service rate In Fig. 4.4a, we assume N=3 with workload d 1 =5000 Mb, d 2 =10000 Mb, and d 3 =15000 Mb. The wired link capacity between any two racks isC f =500 Mbps, and the total wireless bandwidth for augmentation isC w =1000 Mbps ( C f C avg w =3). The service rates are S 1 = 3000 Mbps, S 2 = 2000 Mbps, and S 3 = 1000 Mbps, where the factor varies from 1 to 10. Fig. 4.4a shows that the job completion times decrease and converge (to the black curve without workload distribution) as the service rates increase ( varies from 1 to 10). When the service rates are suciently large (with respect to the link capacity), data pro- cessing tends to be performed locally and the impact of workload distribution on the job completion time becomes less signicant. 4.7.4 Varying service rate and workload size In Fig. 4.4b, we assume N=3 with workload d 1 = 5000 Mb,d 2 = 10000 Mb, and d 3 = 15000 Mb, and service rates S 1 = 3000 Mbps, S 2 = 2000 Mbps, and S 3 = 1000 Mbps. The value of varies from 1 to 10. As a result, as increases, both the workload size and the service rate in rack i increase but 102 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 rack i 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 rack j 0 2 4 6 8 10 12 (a) The values of ci;j 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 rack i 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 rack j 0 0.2 0.4 0.6 0.8 1 (b) The values of ri;j Figure 4.5: Non-pipeline-amenable workload (N=32) 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 rack i 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 rack j 0 5 10 15 20 (a) The values of ci;j 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 rack i 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 rack j 0 0.2 0.4 0.6 0.8 1 (b) The values of ri;j Figure 4.6: Pipeline-amenable workload (N=32) the ratio di Si is kept xed. The wired link capacity between any two racks is C f =500 Mbps, and the total wireless bandwidth for augmentation is C w =1000 Mbps ( C f C avg w =3). In Fig. 4.4b, we observe that the job completion times increase when the service rate and the data size increase simultaneously with the same rate ( varies from 1 to 10). The reason is that while the data processing time remains the same, i.e., P k2N d k r k;i Si = P k2N d k r k;i Si , the data transmission time increases as the data size increases since the wired/wireless link capacity does not scale up, i.e., max j2N;j6=i djrj;i C f +cj;i = max j2N;j6=i djrj;i C f +cj;i . On the other hand, we observe a similar convergence behavior where the job completion times are approaching the black curve (the case without workload distribution), indicating that the impact of workload distribution is less signicant when the service rates are suciently larger than the wired/wireless link capacity. 103 0 10 20 30 40 50 60 70 Number of Racks 5 10 15 20 25 30 35 40 45 50 Job Completion Time (s) Non-Pipeline (Optimal) Pipeline (Optimal) Non-Pipeline (Uniform) Pipeline (Uniform) Non-Pipeline (C w = 0) Pipeline (C w = 0) (a) Job completion time 0 10 20 30 40 50 60 70 Number of Racks 5 10 15 20 25 30 35 40 Gain (%) Gain (Optimal) Gain (Uniform) Gain (C w = 0) (b) Gain from pipelining 0 10 20 30 40 50 60 70 Number of Racks 0 5 10 15 20 25 30 35 40 45 Gain (%) Gain for Nonpipeline (Optimal/Uniform) Gain for Pipeline (Optimal/Uniform) Gain for Nonpipeline (Optimal/C w = 0) Gain for Pipeline (Optimal/C w = 0) (c) Gain from joint opti- mization Figure 4.7: System performance under dierent number of racks N (C w is a constant, 1:2 C f C avg w 403:2). 0 10 20 30 40 50 60 70 Number of Racks 5 10 15 20 25 30 35 40 45 50 Job Completion Time (s) Non-Pipeline (Optimal) Pipeline (Optimal) Non-Pipeline (Uniform) Pipeline (Uniform) Non-Pipeline (C w = 0) Pipeline (C w = 0) (a) Job completion time 0 10 20 30 40 50 60 70 Number of Racks 0 10 20 30 40 50 60 70 Gain (%) Gain (Optimal) Gain (Uniform) Gain (C w = 0) (b) Gain from pipelining 0 10 20 30 40 50 60 70 Number of Racks 10 20 30 40 50 60 70 80 90 Gain (%) Gain for Nonpipeline (Optimal/Uniform) Gain for Pipeline (Optimal/Uniform) Gain for Nonpipeline (Optimal/C w = 0) Gain for Pipeline (Optimal/C w = 0) (c) Gain from joint opti- mization Figure 4.8: System performance under dierent number of racks N (C w scales with N such that C avg w =C f ). 0 2000 4000 6000 8000 10000 12000 The Amount of Data Transfer (Mb) 6 8 10 12 14 16 Job Completion Time (s) (a) Non-pipeline-amenable 0 5000 10000 15000 The Amount of Data Transfer (Mb) 4 6 8 10 12 14 16 Job Completion Time (s) (b) Pipeline-amenable Figure 4.9: Pareto-optimal curve for the job completion time and the amount of data transfer. 104 4.7.5 Varying the number of racks We vary the number of racks N 2 f4; 8; 16; 32; 64g. Fix a value of N, the workload is assumed to d i = 5000i Mb, i = 1;:::;N and the service rate is assumed to beS i = 1000(Ni+1) Mbps,i = 1;:::;N. The wired link capacity between any two racks is C f =100 Mbps, and the total wireless bandwidth is C w =1000 Mbps. In Fig. 4.5 and 4.6, we show the wireless capacity augmentation c i;j and the workload distributionr i;j for the case withN=32. Note that rack 1 stores the least data but has the most computational resources available, while rack 32 stores the most data but has the least computational resources available. For both non-pipeline-amenable and pipeline-amenable workload, we observe that most capacity augmentation and workload routing occur from racks with large indexes to racks with small indexes. We observe from Fig. 4.7a that as the system scales up, jointly optimizing the workload distribution and the wireless capacity augmentation is particularly useful for pipeline-amenable workload. Note that since Pipeline (Uniform) and Pipeline (C w = 0) have similar performance, uniform augmentation does not help remove hot spots at all. Fig. 4.7b-c show the gains from pipelining and joint optimization, respectively. We can see again that as the system scales up, the performance with uniform augmentation is similar to that without augmen- tation, emphasizing the need for joint optimization. Fig. 4.8 shows a similar setup, except that we also scale up the total wireless bandwidth C w such that C avg w =C f . As the system scales up, both gains from pipelining and joint optimization rst increase and then become stable. Last, we note that real world workload characteristics lies in between the two extremes (being non-pipeline-amenable or pipeline-amenable). 4.7.6 Job completion time{data transfer trade-o In Fig. 4.9, we consider the case where there is a data transmission costw i;j = 1 per unit data routed from rack i to rack j. We assume N=3 with workload d 1 =5000 Mb,d 2 =10000 Mb, andd 3 =15000 Mb. The service rates areS 1 =3000 Mbps, S 2 =2000 Mbps, and S 3 =1000 Mbps. The wired link capacity between any two racks is C f =500 Mbps, and the total wireless bandwidth is C w =1000 Mbps. Note that the data transmission cost serves as a regularization term to regulate the amount of data transfers between racks. We vary the weight between the job completion time and the data transmission cost from 0 to 0.002. The Pareto-optimal curves for the non-pipeline-amenable (Problem (4.10)) and the pipeline-amenable workload case (Problem (4.11)) are shown in Fig. 4.9a and 4.9b, respectively. We observe that there is a keen point (about 7000 Mb for the non-pipeline-amenable case and 9500 Mb for the pipeline-amenable case) above which additional data transmission has little help. The job completion time starts to level o, reaching the minimum job completion time. 105 4.8 Conclusion In this work, we proposed a novel analytical framework and algorithms to jointly optimize the dynamic workload distribution and wireless link capacity augmen- tation in hybrid datacenter networks for servicing distributed and parallel com- puting jobs. With extensive simulation studies, we showed that the gain from joint optimization is very substantial, which can be as large as 30% for work- load that is not amenable to pipelining and 45% for workload that is amenable to pipelining. For real world workload in which the data reception and data processing can be pipelined to some degree but not entirely, the gain is between the two extremes. Last, we explored the trade-o between the job completion time and the amount of workload transferred between racks. 4.9 Appendix Fix rack i. Suppose that the data transmissions from racks j2Nnfig to rack i start at the same time and lett j , djrj;i C f +cj;i ; j2Nnfig. We sort and re-index t j ; j2Nnfig such thatt i1 t i2 t i N1 . In addition, let e t, P k2N d k r k;i Si . We have two dierent cases. First, if e t t i N1 , the data processing at rack i nishes at time e t since the data reception rate P N1 j=1 (C f +c ij;i )1 ftti j g , where 1 fg is the indicator function, is non-increasing int and racki works at full rate S i all the time (until time e t). See Fig. 4.10 for an example. Second, if e t<t i N1 , the data processing at rack i nishes at time t i N1 . Note that rack i works at full rate S i until some time b t (< e t) and continues to work at non-full rate P N1 j=1 (C f +c ij;i )1 ftti j g (<S i ) from time b t to timet i N1 . The value b t satises S i b t =d i r i;i + Z b t 0 N1 X j=1 (C f +c ij;i )1 ftti j g dt: (4.12) See Fig. 4.11 for an example. Combining the two cases, we conclude that the completion time of racki is maxft i N1 ; e tg = max h max j2N;j6=i djrj;i C f +cj;i ; P k2N d k r k;i Si i . 106 Figure 4.10: An example with N = 5 and e tt i N1 . Figure 4.11: An example with N = 5 and e t<t i N1 . 107 Bibliography [1] W. C. Ao and K. Psounis, \An ecient approximation algorithm for online multi-tier multi-cell user association," in Proc. ACM MobiHoc, 2016. [2] W. C. Ao and K. Psounis, \Approximation algorithms for online user association in multi-tier multi-cell mobile networks," IEEE/ACM Trans. Netw., vol. 25, pp. 2361{2374, Aug. 2017. [3] J. Andrews, S. Buzzi, W. Choi, S. Hanly, A. Lozano, A. Soong, and J. Zhang, \What will 5G be?," IEEE J. Sel. Areas Commun., vol. 32, pp. 1065{1082, June 2014. [4] W. C. Ao, C. Wang, O. Y. Bursalioglu, and H. Papadopoulos, \Com- pressed sensing-based pilot assignment and reuse for mobile UEs in mmWave cellular systems," in Proc. IEEE ICC, May 2016. [5] W. C. Ao and K. Psounis, \Distributed caching and small cell cooperation for fast content delivery," in Proc. ACM MobiHoc, 2015. [6] S. M. Cheng, W. C. Ao, F. M. Tseng, and K. C. Chen, \Design and anal- ysis of downlink spectrum sharing in two-tier cognitive femto networks," IEEE Trans. Veh. Technol., vol. 61, pp. 2194{2207, June 2012. [7] A. Michaloliakos, W. C. Ao, and K. Psounis, \Joint user-beam selec- tion for hybrid beamforming in asynchronously coordinated multi-cell net- works," in Proc. ITA, Jan. 2016. [8] A. Michaloliakos, W. C. Ao, K. Psounis, and Y. Zhang, \Asynchronously coordinated multi-timescale beamforming architecture for multi-cell net- works," IEEE/ACM Trans. Netw., Oct. 2017. [9] T. L. Marzetta, \Noncooperative cellular wireless with unlimited num- bers of base station antennas," IEEE Trans. Wireless Commun., vol. 9, pp. 3590{3600, Nov. 2010. [10] H. Huh, G. Caire, H. C. Papadopoulos, and S. A. Ramprashad, \Achiev- ing massive MIMO spectral eciency with a not-so-large number of an- tennas," IEEE Trans. Wireless Commun., vol. 11, pp. 3226{3239, Sept. 2012. 108 [11] J. Hoydis, S. ten Brink, and M. Debbah, \Massive MIMO in the UL/DL of cellular networks: How many antennas do we need?," IEEE J. Sel. Areas Commun., vol. 31, pp. 160{171, Feb. 2013. [12] Y. G. Lim, C. B. Chae, and G. Caire, \Performance analysis of massive MIMO for cell-boundary users," IEEE Trans. Wireless Commun., vol. 14, pp. 6827{6842, Dec. 2015. [13] Y. Bejerano, S.-J. Han, and L. Li, \Fairness and load balancing in wire- less LANs using association control," IEEE/ACM Trans. Netw., vol. 15, pp. 560{573, June 2007. [14] B. Kaumann, F. Baccelli, A. Chaintreau, V. Mhatre, K. Papagiannaki, and C. Diot, \Measurement-based self organization of interfering 802.11 wireless access networks," in Proc. IEEE INFOCOM, May 2007. [15] Y. Bejerano and S.-J. Han, \Cell breathing techniques for load balancing in wireless LANs," IEEE Trans. Mobile Comput., vol. 8, pp. 735{749, June 2009. [16] H. Kim, G. de Veciana, X. Yang, and M. Venkatachalam, \Distributed -optimal user association and cell load balancing in wireless networks," IEEE/ACM Trans. Netw., vol. 20, pp. 177{190, Feb. 2012. [17] J. Andrews, S. Singh, Q. Ye, X. Lin, and H. Dhillon, \An overview of load balancing in HetNets: old myths and open problems," IEEE Wireless Commun., vol. 21, pp. 18{25, Apr. 2014. [18] Q. Ye, B. Rong, Y. Chen, M. Al-Shalash, C. Caramanis, and J. Andrews, \User association for load balancing in heterogeneous cellular networks," IEEE Trans. Wireless Commun., vol. 12, pp. 2706{2716, June 2013. [19] D. Bethanabhotla, O. Y. Bursalioglu, H. C. Papadopoulos, and G. Caire, \Optimal user-cell association for massive MIMO wireless networks," IEEE Trans. Wireless Commun., vol. 15, pp. 1835{1850, Mar. 2016. [20] Q. Ye, O. Y. Bursalioglu, H. C. Papadopoulos, C. Caramanis, and J. G. Andrews, \User association and interference management in massive MIMO HetNets," IEEE Trans. Commun., vol. 64, pp. 2049{2065, May 2016. [21] S. Singh, H. Dhillon, and J. Andrews, \Ooading in heterogeneous net- works: Modeling, analysis, and design insights," IEEE Trans. Wireless Commun., vol. 12, pp. 2484{2497, May 2013. [22] W. Bao and B. Liang, \Structured spectrum allocation and user associa- tion in heterogeneous cellular networks," in Proc. IEEE INFOCOM, Apr. 2014. 109 [23] E. Aryafar, A. Keshavarz-Haddad, M. Wang, and M. Chiang, \RAT se- lection games in HetNets," in Proc. IEEE INFOCOM, Apr. 2013. [24] H. Zhou, S. Mao, and P. Agrawal, \Approximation algorithms for cell association and scheduling in femtocell networks," IEEE Trans. Emerg. Topics Comput., vol. 3, pp. 432{443, Sept. 2015. [25] G. Athanasiou, P. C. Weeraddana, C. Fischione, and L. Tassiulas, \Opti- mizing client association for load balancing and fairness in millimeter-wave wireless networks," IEEE/ACM Trans. Netw., vol. 23, pp. 836{850, June 2015. [26] G. Athanasiou, P. C. Weeraddana, and C. Fischione, \Auction-based re- source allocation in millimeterwave wireless access networks," IEEE Com- mun. Lett., vol. 17, pp. 2108{2111, Nov. 2013. [27] Y. Xu, G. Athanasiou, C. Fischione, and L. Tassiulas, \Distributed as- sociation control and relaying in millimeter wave wireless networks," in Proc. IEEE ICC, May 2016. [28] Cisco, \Cisco Wireless LAN Controller Conguration Guide," tech. rep., Cisco Systems, Inc., 2013. [29] B. Lehmann, D. Lehmann, and N. Nisan, \Combinatorial auctions with decreasing marginal utilities," Games and Economic Behavior, vol. 55, no. 2, pp. 270{296, 2006. [30] S. Dobzinski and M. Schapira, \An improved approximation algorithm for combinatorial auctions with submodular bidders," in Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006. [31] M. Kapralov, I. Post, and J. Vondr ak, \Online submodular welfare max- imization: Greedy is optimal," in Proc. ACM-SIAM Symposium on Dis- crete Algorithms (SODA), 2013. [32] K. Son, S. Chong, and G. D. Veciana, \Dynamic association for load balancing and interference avoidance in multi-cell networks," IEEE Trans. Wireless Commun., vol. 8, pp. 3566{3576, July 2009. [33] A. Thangaraj and R. Vaze, \Online algorithms for basestation allocation," IEEE Trans. Wireless Commun., vol. 13, pp. 2966{2975, May 2014. [34] K. Thekumparampil, A. Thangaraj, and R. Vaze, \Combinatorial resource allocation using submodularity of waterlling," IEEE Trans. Wireless Commun., vol. 15, pp. 206{216, Jan. 2016. [35] Y. Zhang, D. Bethanabhotla, T. Hao, and K. Psounis, \Near-optimal user- cell association schemes for real-world networks," in Proc. ITA, 2015. [36] A. Goldsmith, Wireless Communications. Cambridge University Press, 2005. 110 [37] H. Huh, A. M. Tulino, and G. Caire, \Network MIMO with linear zero- forcing beamforming: Large system analysis, impact of channel esti- mation, and reduced-complexity scheduling," IEEE Trans. Inf. Theory, vol. 58, pp. 2911{2934, May 2012. [38] A. Michaloliakos, R. Rogalin, Y. Zhang, K. Psounis, and G. Caire, \Perfor- mance modeling of next-generation WiFi networks," Computer Networks, vol. 105, pp. 150{165, 2016. [39] W. C. Ao and S. M. Cheng, \A lower bound on multi-hop transmission delay in cognitive radio ad hoc networks," in Proc. IEEE PIMRC, Sept. 2013. [40] C. Y. Kao, W. C. Ao, and K. C. Chen, \Spatial distributed dynamic spectrum access," in Proc. IEEE ICC, June 2013. [41] W. C. Ao, S. M. Cheng, and K. C. Chen, \Connectivity of multiple coop- erative cognitive radio ad hoc networks," IEEE J. Sel. Areas Commun., vol. 30, pp. 263{270, Feb. 2012. [42] W. C. Ao and K. C. Chen, \Cognitive radio-enabled network-based coop- eration: From a connectivity perspective," IEEE J. Sel. Areas Commun., vol. 30, pp. 1969{1982, Nov. 2012. [43] P.-Y. Chen, S. M. Cheng, W. C. Ao, and K. C. Chen, \Multi-path routing with end-to-end statistical QoS provisioning in underlay cognitive radio networks," in Proc. IEEE INFOCOM Workshops, Apr. 2011. [44] S. M. Cheng, W. C. Ao, and K. C. Chen, \Eciency of a cognitive radio link with opportunistic interference mitigation," IEEE Trans. Wireless Commun., vol. 10, pp. 1715{1720, June 2011. [45] W. C. Ao and K. C. Chen, \End-to-end HARQ in cognitive radio net- works," in Proc. IEEE WCNC, Apr. 2010. [46] W. C. Ao, S. M. Cheng, and K. C. Chen, \Phase transition diagram for un- derlay heterogeneous cognitive radio networks," in Proc. IEEE GLOBE- COM, Dec. 2010. [47] R. Irmer, H. Droste, P. Marsch, M. Grieger, G. Fettweis, S. Brueck, H. P. Mayer, L. Thiele, and V. Jungnickel, \Coordinated multipoint: Con- cepts, performance, and eld trial results," IEEE Commun. Mag., vol. 49, pp. 102{111, Feb. 2011. [48] D. Lee, H. Seo, B. Clerckx, E. Hardouin, D. Mazzarese, S. Nagata, and K. Sayana, \Coordinated multipoint transmission and reception in LTE- advanced: deployment scenarios and operational challenges," IEEE Com- mun. Mag., vol. 50, pp. 148{155, Feb. 2012. 111 [49] H. V. Balan, R. Rogalin, A. Michaloliakos, K. Psounis, and G. Caire, \Achieving high data rates in a distributed MIMO system," in Proc. ACM MOBICOM, 2012. [50] H. V. Balan, R. Rogalin, A. Michaloliakos, K. Psounis, and G. Caire, \AirSync: Enabling distributed multiuser MIMO with full spatial multi- plexing," IEEE/ACM Trans. Netw., vol. 21, pp. 1681{1695, Dec. 2013. [51] H. Holma, A. Toskala, and J. Reunanen, LTE Small Cell Optimization: 3GPP Evolution to Release 13. Wiley, 2016. [52] R. Jain, D. Chiu, and W. Hawe, \A quantitative measure of fairness and discrimination for resource allocation in shared computer systems," DEC Research Report TR-301, Sept. 1984. [53] W. C. Ao and K. Psounis, \Fast content delivery via distributed caching and small cell cooperation," IEEE Trans. Mobile Comput., Sept. 2017. [54] V. Chandrasekhar, J. G. Andrews, and A. Gatherer, \Femtocell networks: a survey," IEEE Commun. Mag., vol. 46, pp. 59{67, Sept. 2008. [55] A. Ghosh, N. Mangalvedhe, R. Ratasuk, B. Mondal, M. Cudak, E. Visot- sky, T. A. Thomas, J. G. Andrews, P. Xia, H. S. Jo, H. S. Dhillon, and T. D. Novlan, \Heterogeneous cellular networks: From theory to prac- tice," IEEE Commun. Mag., vol. 50, pp. 54{64, June 2012. [56] S.-M. Cheng, W. C. Ao, and K. C. Chen, \Downlink capacity of two-tier cognitive femto networks," in Proc. IEEE PIMRC, Sept. 2010. [57] N. Golrezaei, K. Shanmugam, A. G. Dimakis, A. F. Molisch, and G. Caire, \Femtocaching: Wireless video content delivery through dis- tributed caching helpers," in Proc. IEEE INFOCOM, 2012. [58] K. Shanmugam, N. Golrezaei, A. G. Dimakis, A. F. Molisch, and G. Caire, \Femtocaching: Wireless content delivery through distributed caching helpers," IEEE Trans. Inf. Theory, vol. 59, pp. 8402{8413, Dec. 2013. [59] N. Golrezaei, A. F. Molisch, A. G. Dimakis, and G. Caire, \Femtocaching and device-to-device collaboration: A new architecture for wireless video distribution," IEEE Commun. Mag., vol. 51, pp. 142{149, Apr. 2013. [60] H. S. Rahul, S. Kumar, and D. Katabi, \JMB: Scaling wireless capacity with user demands," in Proc. ACM SIGCOMM, 2012. [61] D. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge University Press, 2005. [62] D. Ben Cheikh, J. M. Kelif, M. Coupechoux, and P. Godlewski, \Ana- lytical joint processing multi-point cooperation performance in Rayleigh fading," IEEE Wireless Commun. Lett., vol. 1, pp. 272{275, Aug. 2012. 112 [63] D. Gesbert, S. Hanly, H. Huang, S. Shitz, O. Simeone, and W. Yu, \Multi- cell MIMO cooperative networks: A new look at interference," IEEE J. Sel. Areas Commun., vol. 28, pp. 1380{1408, Dec. 2010. [64] M. Ji, G. Caire, and A. F. Molisch, \Fundamental limits of distributed caching in D2D wireless networks," in Proc. IEEE ITW, 2013. [65] N. Golrezaei, A. Dimakis, and A. Molisch, \Scaling behavior for device- to-device communications with distributed caching," IEEE Trans. Inf. Theory, vol. 60, pp. 4286{4298, July 2014. [66] M. A. Maddah-Ali and U. Niesen, \Fundamental limits of caching," IEEE Trans. Inf. Theory, vol. 60, pp. 2856{2867, May 2014. [67] M. A. Maddah-Ali and U. Niesen, \Decentralized coded caching attains order-optimal memory-rate tradeo," IEEE/ACM Trans. Netw., vol. 23, pp. 1029{1040, Aug. 2015. [68] S. Borst, V. Gupta, and A. Walid, \Distributed caching algorithms for content distribution networks," in Proc. IEEE INFOCOM, 2010. [69] M. Haenggi, J. G. Andrews, F. Baccelli, O. Dousse, and M. Franceschetti, \Stochastic geometry and random graphs for the analysis and design of wireless networks," IEEE J. Sel. Areas Commun., vol. 27, pp. 1029{1046, Sept. 2009. [70] H. S. Dhillon, R. K. Ganti, F. Baccelli, and J. G. Andrews, \Modeling and analysis of K-tier downlink heterogeneous cellular networks," IEEE J. Sel. Areas Commun., vol. 30, pp. 550{560, Apr. 2012. [71] B. Blaszczyszyn and A. Giovanidis, \Optimal geographic caching in cel- lular networks," in Proc. IEEE ICC, 2015. [72] A. Liu and V. K. N. Lau, \Mixed-timescale precoding and cache control in cached MIMO interference network," IEEE Trans. Signal Process., vol. 61, pp. 6320{6332, Dec. 2013. [73] A. Liu and V. K. N. Lau, \Cache-enabled opportunistic cooperative MIMO for video streaming in wireless systems," IEEE Trans. Signal Pro- cess., vol. 62, pp. 390{402, Jan. 2014. [74] V. Garcia, Y. Zhou, and J. Shi, \Coordinated multipoint transmission in dense cellular networks with user-centric adaptive clustering," IEEE Trans. Wireless Commun., vol. 13, pp. 4297{4308, Aug. 2014. [75] L. Liu, V. Garcia, L. Tian, Z. Pan, and J. Shi, \Joint clustering and inter-cell resource allocation for CoMP in ultra dense cellular networks," in Proc. IEEE ICC, 2015. 113 [76] N. Cher and M. Hi, \A column generation method for the multiple- choice multi-dimensional knapsack problem," Computational Optimiza- tion and Applications, vol. 46, no. 1, pp. 51{73, 2010. [77] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Univer- sity Press, 2004. [78] K.-K. Wong and Z. Pan, \Array gain and diversity order of multiuser MISO antenna systems," Int. J. Wireless Inf. Networks, 2008. [79] D. Bertsekas and R. Gallager, Data Networks. Prentice-Hall, 1992. [80] A. Fakhri and M. Ghatee, \Minimizing the sum of a linear and a linear fractional function applying conic quadratic representation: continuous and discrete problems," Optimization, vol. 65, no. 5, pp. 1023{1038, 2016. [81] M. Zink, K. Suh, Y. Gu, and J. Kurose, \Characteristics of YouTube network trac at a campus network - measurements, models, and impli- cations," Comput. Netw., vol. 53, pp. 501{514, Mar. 2009. [82] http://www.smallcellforum.org/. [83] F. Figueiredo, F. Benevenuto, and J. M. Almeida, \The tube over time: characterizing popularity growth of YouTube videos," in Proc. ACM WSDM, 2011. [84] M. Haenggi and R. K. Ganti, \Interference in large wireless networks," Found. Trends Netw., vol. 3, pp. 127{248, Feb. 2009. [85] A. Checko, H. L. Christiansen, Y. Yan, L. Scolari, G. Kardaras, M. S. Berger, and L. Dittmann, \Cloud RAN for mobile networks { a tech- nology overview," IEEE Commun. Surveys Tuts., vol. 17, pp. 405{426, rstquarter 2015. [86] M. Peng, Y. Sun, X. Li, Z. Mao, and C. Wang, \Recent advances in cloud radio access networks: System architectures, key techniques, and open is- sues," IEEE Commun. Surveys Tuts., vol. 18, pp. 2282{2308, thirdquarter 2016. [87] A. Davydov, G. Morozov, I. Bolotin, and A. Papathanassiou, \Evaluation of joint transmission CoMP in C-RAN based LTE-A HetNets with large coordination areas," in Proc. IEEE Globecom Workshops, 2013. [88] Z. Zhao, M. Peng, Z. Ding, W. Wang, and H. V. Poor, \Cluster content caching: An energy-ecient approach in cloud radio access networks," IEEE J. Sel. Areas Commun., vol. 34, pp. 1207{1221, May 2016. [89] M. Zhu, D. Li, F. Wang, A. Li, K. K. Ramakrishnan, Y. Liu, J. Wu, N. Zhu, and X. Liu, \CCDN: Content-centric data center networks," IEEE/ACM Trans. Netw., vol. 24, pp. 3537{3550, Dec. 2016. 114 [90] \AirScale cloud RAN," tech. rep., Nokia, 2016. [91] \Multi-layer and cloud-ready radio evolution towards 5G," tech. rep., Nokia, 2016. [92] C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien, \Pingmesh: A large-scale system for data center network latency measurement and anal- ysis," in Proc. ACM SIGCOMM, 2015. [93] K. He, E. Rozner, K. Agarwal, Y. J. Gu, W. Felter, J. Carter, and A. Akella, \AC/DC TCP: Virtual congestion control enforcement for dat- acenter networks," in Proc. ACM SIGCOMM, 2016. [94] Y. Shi, J. Zhang, and K. B. Letaief, \Group sparse beamforming for green Cloud-RAN," IEEE Trans. Wireless Commun., vol. 13, pp. 2809{2823, May 2014. [95] H. Huh, A. M. Tulino, and G. Caire, \Network MIMO with linear zero- forcing beamforming: Large system analysis, impact of channel esti- mation, and reduced-complexity scheduling," IEEE Trans. Inf. Theory, vol. 58, pp. 2911{2934, May 2012. [96] T. Yoo and A. Goldsmith, \On the optimality of multiantenna broadcast scheduling using zero-forcing beamforming," IEEE J. Sel. Areas Com- mun., vol. 24, pp. 528{541, Mar. 2006. [97] A. Bayesteh and A. K. Khandani, \On the user selection for MIMO broad- cast channels," IEEE Trans. Inf. Theory, vol. 54, pp. 1086{1107, Mar. 2008. [98] W. L. Shen, K. C. J. Lin, M. S. Chen, and K. Tan, \SIEVE: Scalable user grouping for large MU-MIMO systems," in Proc. IEEE INFOCOM, 2015. [99] S. H. Park, O. Simeone, and S. S. Shitz, \Joint optimization of cloud and edge processing for fog radio access networks," IEEE Trans. Wireless Commun., vol. 15, pp. 7621{7632, Nov. 2016. [100] G. Judd, \Attaining the promise and avoiding the pitfalls of TCP in the datacenter," in Proc. USENIX NSDI, 2015. [101] M. Mahdian, Y. Ye, and J. Zhang, \Approximation algorithms for metric facility location problems," SIAM Journal on Computing, vol. 36, no. 2, pp. 411{432, 2006. [102] H.-C. An, M. Singh, and O. Svensson, \LP-based algorithms for capaci- tated facility location," in FOCS, 2014. 115 [103] P. Rost, I. Berberana, A. Maeder, H. Paul, V. Suryaprakash, M. Valenti, D. Wbben, A. Dekorsy, and G. Fettweis, \Benets and challenges of vir- tualization in 5G radio access networks," IEEE Commun. Mag., vol. 53, pp. 75{82, Dec. 2015. [104] D. Pompili, A. Hajisami, and H. Viswanathan, \Dynamic provisioning and allocation in cloud radio access networks (C-RANs)," Ad Hoc Netw., vol. 30, pp. 128{143, July 2015. [105] S. Bhaumik, S. P. Chandrabose, M. K. Jataprolu, G. Kumar, A. Muralid- har, P. Polakos, V. Srinivasan, and T. Woo, \CloudIQ: A framework for processing base stations in a data center," in Proc. ACM Mobicom, 2012. [106] A. Gudipati, D. Perry, L. E. Li, and S. Katti, \SoftRAN: Software dened radio access network," in Proc. HotSDN, 2013. [107] Y. D. Beyene, R. Jntti, and K. Ruttik, \Cloud-RAN architecture for in- door DAS," IEEE Access, vol. 2, pp. 1205{1212, 2014. [108] K. Sundaresan, M. Y. Arslan, S. Singh, S. Rangarajan, and S. V. Krishna- murthy, \FluidNet: A exible cloud-based radio access network for small cells," IEEE/ACM Trans. Netw., vol. 24, pp. 915{928, Apr. 2016. [109] A. Adhikary, J. Nam, J. Y. Ahn, and G. Caire, \Joint spatial division and multiplexing { the large-scale array regime," IEEE Trans. Inf. Theory, vol. 59, pp. 6441{6463, Oct. 2013. [110] M. Raab and A. Steger, \Balls into bins - a simple and tight analysis," in Proc. RANDOM, 1998. [111] B. C. Arnold, N. Balakrishnan, and H. N. Nagaraja, A rst course in order statistics. Siam, 1992. [112] T. Wang, Z. Su, Y. Xia, and M. Hamdi, \Rethinking the data center networking: Architecture, network protocols, and resource sharing," IEEE Access, vol. 2, pp. 1481{1496, Dec. 2014. [113] R. Rojas-Cessa, Y. Kaymak, and Z. Dong, \Schemes for fast transmission of ows in data center networks," IEEE Commun. Surveys Tuts., vol. 17, pp. 1391{1422, thirdquarter 2015. [114] K. Jain, M. Mahdian, E. Markakis, A. Saberi, and V. V. Vazirani, \Greedy facility location algorithms analyzed using dual tting with factor-revealing LP," J. ACM, vol. 50, pp. 795{824, Nov. 2003. [115] D. Boru, D. Kliazovich, F. Granelli, P. Bouvry, and A. Y. Zomaya, \Energy-ecient data replication in cloud computing datacenters," Clus- ter Computing, vol. 18, no. 1, pp. 385{402, 2015. [116] N. P. Jouppi, C. Young, N. Patil, et al., \In-Datacenter performance anal- ysis of a tensor processing unit," in Proc. ACM/IEEE ISCA, 2017. 116 [117] R. Gandhi, H. H. Liu, Y. C. Hu, et al., \Duet: Cloud scale load balancing with hardware and software," in Proc. ACM SIGCOMM, 2014. [118] N. Kang, M. Ghobadi, J. Reumann, et al., \Ecient trac splitting on commodity switches," in Proc. CoNEXT, 2015. [119] S. Bykov, A. Geller, G. Kliot, et al., \Orleans: Cloud computing for everyone," in Proc. ACM SOCC, 2011. [120] A. Adya, D. Myers, J. Howell, et al., \Slicer: Auto-sharding for datacenter applications," in Proc. USENIX OSDI, 2016. [121] M. Al-Fares, A. Loukissas, and A. Vahdat, \A scalable, commodity data center network architecture," in Proc. ACM SIGCOMM, 2008. [122] D. Halperin, S. Kandula, J. Padhye, P. Bahl, and D. Wetherall, \Aug- menting data center networks with multi-gigabit wireless links," in Proc. ACM SIGCOMM, 2011. [123] J.-Y. Shin, E. G. Sirer, H. Weatherspoon, and D. Kirovski, \On the fea- sibility of completely wireless datacenters," in Proc. ACM/IEEE ANCS, 2012. [124] N. Hamedazimi, Z. Qazi, H. Gupta, et al., \FireFly: A recongurable wireless data center fabric using free-space optics," in Proc. ACM SIG- COMM, 2014. [125] A. S. Hamza, J. S. Deogun, and D. R. Alexander, \Wireless communica- tion in data centers: A survey," IEEE Commun. Surveys Tuts., vol. 18, pp. 1572{1595, thirdquarter 2016. [126] Y. Cui, H. Wang, and X. Cheng, \Channel allocation in wireless data center networks," in Proc. IEEE INFOCOM, Apr 2011. [127] T. Yamane and Y. Katayama, \An eective initialization of interference cancellation algorithms for distributed MIMO systems in wireless data- centers," in Proc. IEEE GLOBECOM, Dec 2012. [128] X. Wu, S. Zhang, and A. Ozg ur, \STAC: Simultaneous transmitting and air computing in wireless data center networks," IEEE J. Sel. Areas Com- mun., vol. 34, pp. 4024{4034, December 2016. [129] X. Zhou, Z. Zhang, Y. Zhu, et al., \Mirror mirror on the ceiling: Flexible wireless links for data centers," in Proc. ACM SIGCOMM, 2012. [130] N. Hamedazimi, H. Gupta, V. Sekar, and S. R. Das, \Patch panels in the sky: A case for free-space optics in data centers," in Proc. ACM HotNet, 2013. 117 [131] O. Abari, H. Hassanieh, M. Rodriguez, and D. Katabi, \Millimeter wave communications: From point-to-point links to agile network connections," in Proc. ACM HotNet, 2016. [132] M. Ghobadi, R. Mahajan, A. Phanishayee, et al., \ProjecToR: Agile re- congurable data center interconnect," in Proc. ACM SIGCOMM, 2016. [133] K. Han, Z. Hu, J. Luo, and L. Xiang, \RUSH: Routing and scheduling for hybrid data center networks," in Proc. IEEE INFOCOM, April 2015. [134] Z. Guo and Y. Yang, \On nonblocking multicast fat-tree data center net- works with server redundancy," IEEE Trans. Comput., vol. 64, pp. 1058{ 1073, April 2015. [135] J. Gorski, F. Pfeuer, and K. Klamroth, \Biconvex sets and optimization with biconvex functions: A survey and extensions," Mathematical Methods of Operations Research, vol. 66, pp. 373{407, Dec 2007. [136] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Univer- sity Press, 2004. [137] A. Fakhri and M. Ghatee, \Minimizing the sum of a linear and a linear fractional function applying conic quadratic representation: Continuous and discrete problems," Optimization, vol. 65, no. 5, 2016. 118
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Design, modeling, and analysis for cache-aided wireless device-to-device communications
PDF
Enabling virtual and augmented reality over dense wireless networks
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Efficient delivery of augmented information services over distributed computing networks
PDF
Understanding the characteristics of Internet traffic dynamics in wired and wireless networks
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Cache analysis and techniques for optimizing data movement across the cache hierarchy
PDF
Optimizing privacy and performance in spectrum access systems
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
A framework for runtime energy efficient mobile execution
PDF
Resource scheduling in geo-distributed computing
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Ensuring query integrity for sptial data in the cloud
Asset Metadata
Creator
Ao, Weng Chon
(author)
Core Title
Using formal optimization techniques to improve the performance of mobile and data center networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/23/2018
Defense Date
10/06/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
approximation algorithm,biconvex optimization,caching,cloud radio access network,coordinated multipoint,data center,distributed and parallel computing,diversity,heterogeneous cellular networks,hybrid datacenter networks,load balancing,massive MIMO,mobility,multi-user MIMO,OAI-PMH Harvest,on-demand workload distribution,online algorithm,randomized algorithm,spatial multiplexing,user association,user grouping,wireless capacity augmentation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Psounis, Konstantinos (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Savla, Ketan (
committee member
)
Creator Email
jonathanwcao@gmail.com,wao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-445098
Unique identifier
UC11263225
Identifier
etd-AoWengChon-5852.pdf (filename),usctheses-c40-445098 (legacy record id)
Legacy Identifier
etd-AoWengChon-5852.pdf
Dmrecord
445098
Document Type
Dissertation
Rights
Ao, Weng Chon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
approximation algorithm
biconvex optimization
caching
cloud radio access network
coordinated multipoint
data center
distributed and parallel computing
heterogeneous cellular networks
hybrid datacenter networks
load balancing
massive MIMO
mobility
multi-user MIMO
on-demand workload distribution
online algorithm
randomized algorithm
spatial multiplexing
user association
user grouping
wireless capacity augmentation