Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Machine learning for efficient network management
(USC Thesis Other)
Machine learning for efficient network management
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MACHINE LEARNING FOR EFFICIENT NETWORK MANAGEMENT by Angelos Lazaris A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2020 Copyright 2020 Angelos Lazaris Dedication To my family ... ii Acknowledgments First of all, I would like to express my deepest appreciation to my advisor Pro- fessor Viktor K. Prasanna for believing in me, and for giving me the opportunity to work in the exciting field of Data Science and its applications to computer net- working. Hismentorshipthroughoutthisjourneypavedthepathtowardsacademic success. Without him, this work would not have been possible. I would also like to thank Professor Cauligi Raghavendra, and Professor Jyotirmoy V. Deshmukh for serving on my dissertation committee and for providing their insightful feedback on how to improve the quality of this work. Duringthisjourney, Ihadalsotheopportunitytoworkwithotherverytalented researchers such as Professor Rajgopal Kannan and Dr. Ajitesh Srivastava, as well as Professor Minlan Yu, Professor Li Erran Li, Professor Y. Richard Yang, Dr. Xin Huang. Our interactions definitely improved me as a researcher and strengthened my understanding of various areas of computer engineering, and I am deeply thankful for that. I would also like to thank all the members of the Data Science lab at USC for their valuable feedback on this work, as well as Diane Demetrasforpatientlyansweringtoallmygraduateschoolquestionsandsmoothly handling all the PhD program logistics. Last but not least, I would like to thank my family and especially my wife Inga for her unconditional support that made this thesis possible. iii Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi 1 Introduction 1 1.1 Network Management . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Active vs. Passive Measurement . . . . . . . . . . . . . . . . . . . . 2 1.3 Measurement in Legacy Networks . . . . . . . . . . . . . . . . . . . 3 1.4 Software-Defined Networking (SDN) . . . . . . . . . . . . . . . . . 4 1.5 Measurement in SDN . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Knowledge-Defined Networking (KDN) . . . . . . . . . . . . . . . . 7 1.7 Applications of Traffic Measurement & Prediction . . . . . . . . . . 8 1.8 Challenges in Network Traffic Measurement . . . . . . . . . . . . . 10 1.9 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.10 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Background 14 2.1 Flow Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Prefix Measurement in SDN . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Traffic Matrix Estimation . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Measurement Task Model . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Major Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 An LSTM Framework For Modeling Network Traffic 22 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Background & Motivation . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 The CAIDA Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 26 iv 3.4 A Big Data Framework for Network Traffic Processing . . . . . . . 29 3.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.1 Long Short-Term Memory Model . . . . . . . . . . . . . . . 32 3.5.2 Time Series Clustering for Model Selection . . . . . . . . . . 34 3.5.3 Data Transformations . . . . . . . . . . . . . . . . . . . . . 36 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.7 Prediction Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7.1 Large Scale Experiment (CAIDA) . . . . . . . . . . . . . . . 41 3.7.2 Small Scale Experiment (Mininet) . . . . . . . . . . . . . . . 42 3.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Link-Level Network Traffic Modeling 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Background & Motivation . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Link Throughput Time Series . . . . . . . . . . . . . . . . . 50 4.2.2 Auto-Regressive Time Series Modeling . . . . . . . . . . . . 52 4.3 Aggregated Network Traffic Analysis . . . . . . . . . . . . . . . . . 54 4.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5 Prediction-Assisted Software-Defined Measurement 64 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Background & Motivation . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.1 Measurements in Software-Defined Networks . . . . . . . . . 68 5.3 DeepFlow Design Overview . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Prediction-Assisted Measurement . . . . . . . . . . . . . . . . . . . 73 5.4.1 Active Flow Prefix Detection . . . . . . . . . . . . . . . . . . 73 5.4.2 Optimizing the Measurement Location . . . . . . . . . . . . 77 5.4.3 Using Predictions to Enhance Measurement . . . . . . . . . 78 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6 Measurement Orchestration Using Reinforcement Learning 96 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 97 6.2.1 Markov Decision Processes (MDP) . . . . . . . . . . . . . . 97 6.2.2 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . 99 v 6.2.4 Measurement Task . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Measurement Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4 The DRLFlow Architecture . . . . . . . . . . . . . . . . . . . . . . 103 6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7 Conclusion and Future Work 110 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2.1 Network Traffic Modeling . . . . . . . . . . . . . . . . . . . 112 7.2.2 Software-Defined Measurement . . . . . . . . . . . . . . . . 113 Reference List 114 vi List of Tables 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Features for network time series clustering. . . . . . . . . . . . . . . 35 5.1 DeepFlow latency breakdown. . . . . . . . . . . . . . . . . . . . . . 91 vii List of Figures 1.1 Examples of NetFlow and SNMP measurement architectures. . . . . 4 1.2 The overall SDN architecture [1]. . . . . . . . . . . . . . . . . . . . 5 1.3 SDN measurement architecture. . . . . . . . . . . . . . . . . . . . . 6 1.4 The Knowledge-Defined Networking workflow [2]. . . . . . . . . . . 7 1.5 Applications of network traffic measurement. . . . . . . . . . . . . . 8 2.1 An example of a source IP prefix trie. The highlighted nodes cor- respond to measurement rules and above each rule is its associated priority. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Network traffic time series from a Tier-1 ISP link aggregated over 1 second measurement intervals. . . . . . . . . . . . . . . . . . . . . . 26 3.2 CAIDA network trace histograms for packet size and aggregate vol- ume over 1 second periods. . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Autocorrelation Coefficients (ACFs) for various lags for the trace in Figure 3.1. The shaded area corresponds to statistically non- significant ACF values. . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Scalable flow aggregation for multiple mask sizes (prefixes) and mul- tiple time epochs using a sequence of SQL unions that can be pro- cessed in parallel by a big data SQL platform. . . . . . . . . . . . . 30 viii 3.5 Architecture of the data collection and processing pipeline. . . . . . 31 3.6 Architecture of an LSTM unit that connects to other units of a recurrent neural network. . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 Average MAPE for various mask sizes and epoch durations. . . . . . 39 3.8 Average MAPE for each of the 20 clusters used for CLSTM and CDLSTM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.9 Average MAPE for various prediction horizons for CLSTM. . . . . . 41 3.10 Average MAPE for various prediction horizons for CDLSTM. . . . . 42 3.11 VLSTM predictions for the size of a /32 flow from Mininet simula- tion (epoch duration = 5 seconds). . . . . . . . . . . . . . . . . . . 44 3.12 VLSTM Average MAPEs for various predictions horizons across the /32 flows of the Mininet simulation (epoch duration = 5 seconds). . 44 4.1 SDN measurement and prediction architecture. . . . . . . . . . . . . 51 4.2 The Google B4 [3] network topology (black) with attached ingress (yellow) and egress (orange) switches. . . . . . . . . . . . . . . . . . 54 4.3 The B4 network topology (black) with attached ingress (yellow) and egress (orange) nodes and their respective senders/receivers. . . . . 55 4.4 Sample link time series for various mask sizes and measurement epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Average Autocorrelation Coefficients (ACFs) for various lags of the aggregated trace for various mask sizes and measurement epochs. . 58 4.6 Average MAPE for each model across various mask sizes and epoch durations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 ix 5.1 An example of source IP prefix trie with three potential measurem- ent architectures that provide various granularities. When predic- tions are used, the granularity level can be increased for the same TCAM utilization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 DeepFlow architecture components. . . . . . . . . . . . . . . . . . . 71 5.3 An example of the set of rules at a switch, represented as shaded areas in a two-dimensional plane. The red (dashed) lines in R5 correspond to the rule splitting AFPDA performs in order to detect the two active flows in the area of R5 (i.e. the two circles). . . . . . 74 5.4 A finite state machine representation of the states of a given flow prefix that is used by the Prediction Assisted Measurement Algo- rithm (PAMA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5 Percentiles for the flow duration and flow size of the CAIDA dataset. 84 5.6 AFPDA Statistics for a 2D parsing of an artificial dataset to assess where significant flows are located. . . . . . . . . . . . . . . . . . . 86 5.7 The number of measurements conducted by the 2D AFPDA for an artificial dataset for various thresholds and mask sizes. . . . . . . . 87 5.8 AFPDA statistics for 1D parsing of the CAIDA dataset for vari- ous significant thresholds θ and maximum mask sizes d s (i.e. flow resolution). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.9 Flowcoveragegainachievedduetopredictionsforvariousregression window sizes and prediction horizons. . . . . . . . . . . . . . . . . 90 6.1 The interaction of an agent with its environment in a Markov Deci- sion Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Average task completion rate for various available TCAM sizes per switch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 x Abstract Providing fine-grained traffic measurement at short time scales is crucial for many network management and optimization tasks such as traffic engineering, anomaly detection, load balancing, network accounting, network analytics, and trafficmatrixestimation. Software-DefinedNetworks(SDN)canpotentiallyenable fine-grained measurement by providing statistics for each forwarding rule of Open- Flow enabled switches. However, TCAMs have limited size due to their high cost and power consumption thus allowing only a fraction of the total number of active flowstobemeasured. Inadditiontothat,inproactiveSDNdeployments,theactive flows are not known a-priori to the controller, which creates additional challenges to the measurement process. As a consequence, fine-grained flow measurement in SDN is not directly possible at scale and more efficient mechanisms are needed in order to achieve it. In this thesis, we answer to the following three important questions: 1) is it possible to know all the major flows in the network even if we do not have enough available measurement resources, 2) is it possible to model network flows efficiently at short time-scales (i.e. < 1 minute) and forecast their behavior such that we reduce the need for exact measurement, and 3) can we optimize the measurement resources appropriately such that we achieve higher measurement resolution? To answer to these questions we first develop a framework that detects the most active xi source and destination prefixes in the network. We then develop various Long Short-Term Memory (LSTM) models that can be used to provide short-term flow- rate predictions at microflow, megaflow, and link-level granularities, and propose a clustering framework that can be used to model similar time series together. By leveraging the prediction effectiveness and scalability of the proposed LSTMs, we develop a measurement framework that uses flow predictions in order to increase the measurement resolution of SDN. Finally, we develop a Deep Reinforcement Learning mechanism that optimizes the measurement parameters such that the flow resolution is maximized given a set of resource constraints. xii Chapter 1 Introduction In the recent years, the technological advances and the increased availability of Internet connectivity has created a totally different network traffic landscape as compared to more than a decade ago. According to the 2019 Digital Trends report by HootSuite [4], the number of Internet users in 2019 increased by 9.1% compared to 2018, corresponding to 57% of the global population. From this pool of users, it was estimated that 92% stream videos online, and 58% stream TV content online, with the average Internet use time across the globe being more than 6.5 hours. As these numbers suggest, the world is becoming more and more interconnected, and the network traffic is also rapidly growing as a result of the user growth. Managing such rapidly growing networks is not a trivial task. Network man- agement relies on traffic measurements in order to enable a variety of tasks such as Traffic Engineering (TE), anomaly detection, load balancing, network analytics and more. However, traffic measurements cannot be obtained easily at scale due to the imposed overheads at the network equipment in terms of CPU, available mem- ory at the switches, or the overall control overhead. In this thesis, we focus on the problem of network traffic measurement and we propose methods that can be used to increase the visibility into the network traffic by a) developing scalable models to predict network flow rates, b) using these prediction models to substitute fre- quent measurements with accurate predictions, and c) developing a measurement 1 orchestration mechanisms that optimizes the measurement parameters depending on the available resources in the network. 1.1 Network Management Network management can be defined as "the process that includes the deploy- ment, integration, andcoordinationofthehardware, software, andhumanelements to monitor, test, poll, configure, analyze, evaluate, and control the network and element resources to meet the real-time, operational performance, and Quality of Service (QoS) requirements at a reasonable cost" [5]. In this work, we focus on the monitoring aspect of network management that is achieved through traffic mea- surement and which can provide in-depth visibility into the network in real-time with reasonable cost and improve the network control decisions. 1.2 Active vs. Passive Measurement There are two main measurement mechanisms in networking, depending on the way the measurement data are generated. The first one is known as active measurement, and it is achieved using test packets injected in the network in order to measure a quantity of interest such as the available bandwidth [6], the one-way packet delay, the round trip delay, the packet loss, the jitter, or even determine the network topology [7]. Common tools for such measurements include the ICMP ping, and traceroute, but various custom packet probing mechanisms have also been developed, depending on the application. 2 The second measurement methodology is known as passive measurement, and it corresponds to the collection of various statistics by tapping the network and capturing packets as well as their corresponding timestamps in a non-intrusive way [8, 9]. This way, there is no interference with the network traffic since no probing packets are injected. In addition, the collection of packet-level logs can provide more granular information about the network traffic. 1.3 Measurement in Legacy Networks In legacy networks, passive measurement is used more often to gather in-depth data about the network. There are two widely used mechanisms for enabling pas- sive traffic measurement [9]: a) through the use of specialized flow-level protocols such as NetFlow [10], and sFlow [11], and b) through the use of applications built on top of the Simple Network Management Protocol (SNMP) that can provide link-level aggregated measurements [12, 13]. In the case of flow-level protocols, a sampling methodology is used since the monitoring mechanisms cannot keep up with the high speeds of the modern networks, as well as the large number of con- current flows, especially in cases of backbone and data center networks. However, sampling can affect the quality of the resulting measurement or even miss com- pletely important flows. In addition, the time-resolution of NetFlow is limited to 5 minutes (lowest) in the vast majority of its implementations, thus not allowing for more short-term decision making. An example of a NetFlow measurement archi- tecture is shown in Figure 1.1(a) from where we can see that NetFlow relies on a NetFlow Exported that runs at the router, and which communicates with the NetFlow collector that takes care of gathering and storing the flow data for further 3 (a) NetFlow (b) SNMP Figure 1.1: Examples of NetFlow and SNMP measurement architectures. processing by a traffic analyzer. On the other hand, SNMP-based measurement can provide only a high level view of the network since only link-level statistics can be collected. The time resolution can be higher (i.e. 1 minute), with the trade-off of not being able to see information at the individual flow level. An example of a SNMP measurement architecture is shown in Figure 1.1(b) from where we can see that the architecture is quite similar with that of NetFlow, with the difference that the measurements characterize aggregated data such as ports and are gathered by the Network Monitoring Station (NMS) which can then be further processed by a traffic analyzer application. 1.4 Software-Defined Networking (SDN) Legacy networks were not able to provide enough flexibility for implementing new forwarding protocols, supporting globally optimized routing decisions, as well as easily configuring the network switches. As a result, in 2011 a new movement started with the intend to decouple the control plane from the forwarding plane in order to enable innovation [1, 14]. This movement led to the new network 4 Figure 1.2: The overall SDN architecture [1]. paradigm called Software-defined Networking (SDN). SDN was also associated with its main API specification called OpenFlow [15, 16] that was introduced in 2008, and which specified how a logically centralized controller can communicate with the OpenFlow-enabled switches in order to configure them as well as collect traffic counters. The overall design architecture of SDN is shown in Figure 1.2 [17, 1] from where we can see how the data plane (i.e. infrastructure layer) is separated from the control layer through the use of the OpenFlow API, and also how the application layer is separated from the control layer through the use of custom built application-specific APIs. The classification of SDN measurement in active or passive can be done in different ways, depending on the underlying data collection mechanism (i.e. pull vs push) [18], or the presence of probing packets [19, 20]. Throughout this thesis, we will be following the taxonomy illustrated in [18] and we will be focusing on active SDN measurement frameworks. 5 Task Scheduler Applications Applications Measurement Tasks Measurement Database Measurement Engine Flow & Port Counters Controller OpenFlow Switches Figure 1.3: SDN measurement architecture. 1.5 Measurement in SDN One of the many advantages of SDN is the fact that it can provide traffic coun- ters at the individual forwarding rule level (using the rules installed in the switch’s TCAM), as well as port-level counters that require no explicit forwarding rules. This way, SDN can combine the benefits of NetFlow and SNMP together, without any additional software agents running at the switches (e.g. NetFlow exporter) other than the support of OpenFlow. The general measurement architecture of SDN is outlined in Figure 1.3 from where we can see that a logically centralized controller can run applications that implement custom measurement tasks that are executed to collect flow and port statistics from the switches using the OpenFlow API, and which can later be stored into a database for further processing. 6 Figure 1.4: The Knowledge-Defined Networking workflow [2]. 1.6 Knowledge-Defined Networking (KDN) One of the many advantages of SDN is the fact that it can enable the creation of a knowledge plane on top of the controller that can leverages a measurement database and a set of Machine Learning (ML) models to make optimized data- driven network-wide decisions. The knowledge layer concept was first introduced in [21] and was associated with SDN in a new paradigm that became known as Knowledge-Defined Networking (KDN) in [2]. KDN is defined as the combina- tion of SDN, telemetry, network analytics, and the knowledge plane. The KDN paradigm is shown in Figure 1.4 from where we can see that above the control layer, there is a workflow of ML models, human decisioning, as well as automated decisions that can manage the network more efficiently with data-driven decision making. 7 The frameworks presented in this thesis fully align with the KDN paradigm in the sense that they provide a wide range of mechanisms to enhance network telemetry, enable fine-grained measurement at scale, and also further enable ML- based or general data-driven decision making. In other words, our work can be used to supply the "analytics platform" of the KDN workflow of Figure 1.4 with large amounts of data, that can be later consumed by the knowledge plane. Figure 1.5: Applications of network traffic measurement. 1.7 Applications of Traffic Measurement & Pre- diction Network traffic measurement has a wide variety of applications such as network security (e.g. anomaly detection), traffic control (e.g. traffic engineering), traffic characterization (e.g. traffic forecasting), and performance diagnostics (e.g. detect traffic bottlenecks). A more complete list of applications is shown in Figure 1.5. 8 From all the applications shown in Figure 1.5, traffic forecasting can be con- sidered one of the most important ones since it can further enable measurement applications in a more proactive manner (instead of reactive). For example, in order for the network to prevent an upcoming congestion to a link and prevent packet losses and increased delays, short-term link-level predictions can be used for 15 seconds ahead, and proactively reroute traffic appropriately if a traffic spike is detected, without having to wait for the routing protocols to react with delay. This can have a significant impact to the overall Quality of Experience (QoE) of the end-users. Another important application of traffic measurement and prediction is in the reduction of power consumption either at the data center level [22, 23, 24, 25, 26], or even at the ISP [27] and backbone link level [28, 29]. According to the US Data Center Energy report published by the Lawrence Berkeley National Laboratory [30], the total energy consumption for the US data centers is expected to reach 73 billion kWh in 2020, representing approximately 1.7% of the total U.S. electricity consumption. So, being able to reduce the power consumption of data centers can have very large economic and environmental implications. To do so, one can lever- age short-term traffic predictions about flow prefixes and turn off the equipment that is not needed (or enable power saving mode in CPU cores, physical links, VMs, servers, etc.) in order to forward or process traffic [26]. A similar concept can be applied even at the backbone link level or at the ISP level overall, where physical links belonging to bundled links can be turned off if they are not going to be needed in the near future similar to the work in [28, 29, 27]. It is impor- tant to note here that the more granular the predictions (both in time and space), the more the power savings are expected to be, since the optimized system can 9 adapt to the power needs more quickly without overprovisioning resources for long time horizons [26]. This way, companies can reduce their Operating Expenditure (OpEx) even if they have invested large amounts on Capital Expenditure (CapEx) to acquire over-provisioned infrastructure [31]. 1.8 Challenges in Network Traffic Measurement Despite the availability of existing measurement mechanisms for both legacy and SDN deployments, there are fundamental limitations in both that prevent us from achieving scalable and in-depth view of the network. Specifically, the sampling that is needed in frameworks like NetFlow and sFlow, as well as the coarse-grain time resolution (i.e.≥ 5 minutes) can significantly limit the visibility and the usability of the data both in terms of space and time. On the other hand, the SDN paradigm also suffers in terms of the limited measurement resources such astheexpensiveandpower-hungryTCAMsthathaveverylimitedsize(i.e.≤4096 rules in most of the production switches) compared to the number of active flows in the network at any given time [32], [33], [34]. The same applies for the limited CPU and SRAM at the switches. In addition to the above, sketch-based solutions such as in [35, 36, 37], that were introduced for SDN, suffer in terms of their large memory footprint, CPU utilization, hash collision errors, and they also require switch modifications in order to be implemented in production switches. For these reasons, none of the existing solutions can provide both 1) a general framework that fits a wide variety of network management tasks, and 2) a scalable solution that can provide fine-grained visibility at short time scales (i.e.≤ 5 minutes). 10 1.9 Thesis Contributions The focus of this thesis is to develop a data-driven framework that is able to providenetworktrafficmeasurementcapabilitiesatscaleforSDNdeployments,and which can be easily deployed in production hardware switches as an SDN controller application without any hardware modifications, with the intent to enable a large variety of management and optimization tasks such as efficient resource use, power management, analytics and more. The contributions of this thesis are summarized below: • Network Flow Dataset Generator: In order to overcome the problem of unavailable structured flow-level datasets in the research community, we develop a framework for scalable analysis of very large passive network traffic logs (e.g. PCAP files) in order to extract flow level time series at various time resolutions and IP aggregation levels within minutes. • AggregatedNetworkTrafficModeling: WedevelopvariousLongShort- Term Memory (LSTM) models to predict real network flows with high accu- racy at various time resolutions and IP aggregation levels. • Cluster-Based Network Traffic Modeling at Scale: In order to provide a scalable framework for network traffic modeling, we propose the use of time series clustering to group similar time series together and train a single model per group, thus reducing the total number of models needed to predict network traffic effectively. • Link-Level Network Traffic Modeling: We develop various LSTM mod- els that are able to model link-level data at short time scales by taking into 11 accountcorrelationsacrosslinkswhichcanbeusedtoforecastnetworkTraffic Matrices (TM) at short time horizons. • Active Flow Prefix Detection Engine: We develop an algorithm to find active flow prefixes in proactive SDN deployments where granular active flow information is not available. • High Resolution Monitoring by Leveraging Predictions: We develop a framework that relies on the active flow prefixes found and LSTM predic- tions in order to provide continuous visibility into the network by combining measurements and predictions interchangeably. • Measurement Orchestration Using Deep Reinforcement Learning: We develop a framework that leverages a deep reinforcement learning algo- rithm in order to optimize the measurement parameters given measurement resource constraints. 1.10 Thesis Organization The rest of this thesis is organized as follows: • Chapter2introducessomebackgroundconceptsneededinordertoformulate our problems and our proposed solutions. • Chapter 3 presents our flow prefix modeling results as well as our proposed time series clustering framework that reduces the total number of models needed to predict diverse network time series. 12 • Chapter 4 presents our modeling results for link-level data at short time scales which can enable more granular traffic matrices and short-term deci- sion making. • Chapter 5 presents DeepFlow, our proposed measurement framework that combines measurements and predictions in order to achieve continuous visi- bility into the network. • Chapter 6 introduces DRLFlow, an extension of DeepFlow that optimizes the measurement parameters using deep reinforcement learning given delay constraints. • Chapter 7 concludes the thesis and presents our future research directions. 13 Chapter 2 Background In this chapter, we present some definitions that are needed in order to intro- ducethemainideasbehindthisthesis. Wealsoprovidebackgroundinformationon how measurement can be achieved in SDN using forwarding rule counters. Finally, we provide a table of the most important notation used throughout this thesis. 2.1 Flow Model Network flows can be defined in the literature in many different ways depending ontheparticularapplication. Throughoutthisthesisandwithoutlossofgenerality, we will be using an IP level definition without taking into account port numbers or transport protocol information. Definition 2.1.1. A flow is defined as a set of IP packets passing an observation point in the network during a certain time interval which share a set of common properties such as the source and the destination IP address or prefix. AswecanseefromDefinition2.1.1, inthecaseofindividualsource-destination IP pairs, a flow corresponds to all the traffic exchanged between two IPs as seen fromagivenobservationpoint. Ontheotherhand, ifweuseIPprefixestoestablish a source - destination IP prefix pair, then a flow will correspond to the aggregated traffic from all the individual source IPs matching the prefix that is sent towards a set of destination IPs matching the destination prefix, and observed through a 14 given observation point. In such a case, a flow can potentially correspond to the aggregated traffic of a university department, for example, or even a university overall, depending on the size of the prefix. Definition 2.1.2. A flow-rate time series F i =f (1) i ,f (2) i ,...,f (n) i is an ordered set ofn real-valued variables that correspond to the total traffic volume of flowf i over n consecutive measurement intervals. Definition 2.1.3. Given a time series F i of length n, a subsequence F (p) i of F i is a sampling of length w < n of contiguous positions from F i , that is, F (p) i = f (p−w+1) i ,f (p−w+2) i ,...,f (p) i for w≤p≤n. Definitions 2.1.2, and 2.1.3 are needed in order to support limited time-window datasets that can be implemented as rolling time windows in a production system. In other words, modeling network time series does not need to use very old data (e.g. more than 1 year ago) due to storage constraints, cost, and more importantly, the need to adapt to the latest trends of the Internet traffic such as new services becoming available (e.g. more connected TV sets, more streaming providers, etc.) while, atthesametime, takingintoconsiderationhourly, weekly, monthlyoryearly trends. 2.2 Prefix Measurement in SDN In order to measure the traffic size of a given flow prefix over a measurement interval in SDN, we need to install a measurement rule at an SDN switch that this prefix is assigned to, which is going to capture the number of packets or bytes that match the given rule. In general, a flow that arrives at a given port of an 15 ∗∗∗ 0∗∗ p 2 00∗ 000 001 01∗ p 1 010 011 p 0 1∗∗ p 2 10∗ 100 101 11∗ 110 111 Figure2.1: AnexampleofasourceIPprefixtrie. Thehighlightednodescorrespond to measurement rules and above each rule is its associated priority. SDN switch can potentially match many installed forwarding rules. In such a case, the switch will use rule priorities to break any ties. From the measurement perspective, this priority architecture can be leveraged in order to achieve fine- grained measurement by splitting forwarding rules into potentially overlapping sub-rules with higher priorities. So, for example, if we assume that we use source IP based forwarding in an SDN switch and we have two forwarding rules only installed at the nodes 0** and 1** of the prefix trie of Figure 2.1, then if we want to measure more granular prefixes such as 01* and 011, we will need to install two forwarding rules, one at 01* and one at 011 with the same outgoing port (to not disrupt routing) but with priorities p 1 >p 2 andp 0 >p 1 , respectively. The process described above can be also generalized in the case of 2-dimensions, i.e. when we use both source and destination IP prefixes to match flows, or even when we use port numbers and MAC addresses to install very granular rules in the TCAMs. 16 2.3 Traffic Matrix Estimation There are two main categories of Traffic Matrices (TM). The first one is known as Origin-Destination (OD) matrix and it represents the volume of traffic flowing through all possible origin-destination pairs in a network, i.e. from the point that generates a packet to the point that receives it. The second type is known as Ingress-Egress (IE) matrix and it represents only a subset of the OD matrix that can be observed or is relevant to a given operator (e.g. traffic flowing through the operator’s edge routers) [38]. From the definitions above, we can see that OD matrices are much more granular but they are hard to obtain due to the very large number of possible source-destination IP pairs. So, in practice, IPs are aggregated in prefixes to form a single traffic source for TM estimation purposes [38]. In general, a traffic matrix can be inferred using partial link measurements as well as other statistical methods that can replace missing information. In legacy networks, SNMP measurements are frequently used to obtain link statistics that can be combined with routing information to infer the actual flow sizes to some extend. More formally, let G(S,L) be the graph representing the network topology whereS is the set of switches in the network andL is the set of physical links in the network. Also, letN =|S| andL =|L| the total number of switches and links, respectively, and K the total number of flows in the network where the volume of flow f i at time period t is denoted as f (t) i or for simplicity x i . The routing in G is assumed to be known and represented by the L×K matrix R, each element r i,j of which corresponds to the fraction of traffic of flow j going through link i. 17 If we represent all the flow sizes x i ’s as a vector form x and the set of link measurements y j ’s as a vector form y, then the traffic matrix estimation problem is described by the system of linear equations: R·x =y (2.3.1) that has L equations and K unknowns. Since the above linear system is ill-posed in general (i.e. there are far less links than micro-flows in the network), in this thesis we will see how our proposed prediction frameworks from Chapters 3 and 4 can be used in conjunction with additional flow counters from the TCAM rules at the switches in order to add more equationstothelinearsystemfromEquation2.3.1. Thisway,wecanestimatesome of the x i ’s, or partial sums of them (e.g. instead of x 1 and x 2 we might be able to estimate x 1 +x 2 etc.) thus improving the TM granularity. So, the TM linear system can take the following general form: r 11 r 12 ... r 1K . . . . . . . . . . . . r L1 r L2 ... r LK e 11 e 12 ... e 1K . . . . . . . . . . . . e E1 e E2 ... e EK p 11 p 12 ... p 1K . . . . . . . . . . . . p P1 p P2 ... p PK · x 1 . . . x K = y 1 . . . y Q (2.3.2) 18 where Q = L +E +P and L is the number of link measurements retrieved, E is the number of exact match measurements retrieved from the TCAM, and P is the number of predictions (instead of measurements) generated. Note: It is important to note here that for a given row in the matrix of the LHS of Equation 2.3.2, not all the elements have to be non-zero since measurements or predictions can be chosen ad-hoc and the non-zero columns in the matrix should only be the ones that correspond to the IP prefix being measured of predicted. 2.4 Measurement Task Model A traffic measurement task can be defined in many different ways depending on the application. As we will see in Chapter 5, most of the measurement frameworks are specialized in doing one thing such as finding the top-K flows (e.g. for K=100), ormeasuringaspecificprefix, etc. Inthisthesis, wedefinethemeasurementtaskin a more general way that aims to enable fine-grained measurement, while ensuring completion of the measurement processes. For this, we will be using a traffic rate threshold θ that will separate important and unimportant flows, as well as two maximum granularity thresholds, namely d s , and d t , that correspond to the maximum source and destination aggregation prefix returned by the measurement framework (essentially capturing how granular the flows are allowed to be). More formally: Definition 2.4.1. A measurement taskM is defined as the process of collecting granular flow-rate information about all the flows in the network that have a rate of at leastθ, and a source/destination IP prefix size of at most d s ,d t , respectively. So, a task can be characterized by the 3-tupleM = (θ,d s ,d t ). 19 In Chapters 5 and 6, we will extend this definition to include delay guarantees, as well as task success criteria. 20 2.5 Major Notation In Table 2.1 we summarize all the major notation used throughout this thesis. Table 2.1: Notation Symbol Description N Total number of switches in the network K Total number of flows in the network L Total number of physical links in the network S Set of switches L Set of physical links F Set of flows f i Flow i s j Switch j M j The total memory at switch j m j The total free memory at switch j at a given time S f i The set of switches that flow f i traverses R Routing matrix r i,j Routing matrix element (i,j) that represents the fraction of traffic of flow j that is forwarded through switch i x Vector of flow sizes y Vector of flow measurements θ The minimum threshold for the size of a flow above which a flow is considered significant w The number of inputs required to generate a prediction T p The prediction horizon of the flow model T The time it takes to complete a measurement task τ The measurement task deadline F (p) i A time series subsequence of active flow i l (t) i,j Total traffic seen by port j of switch i during measurement epoch t F w The set of all forwarding rules across all switches. M = (θ,d s ,d t ) A measurement task with minimum flow rate threshold θ, maximum source destination IP prefix length d s , and maxi- mum destination IP prefix length d t 21 Chapter 3 An LSTM Framework For Modeling Network Traffic Forecasting fine-grained network traffic is crucial for many network manage- ment and optimization tasks such as traffic engineering, anomaly detection, net- workaccounting,networkanalytics, loadbalancing, powermanagement,andtraffic matrixestimation. However,buildingmodelsthatareabletopredictawide-variety of network traffic types is not a trivial task due to a) the diversity of network traf- fic, and b) the computational challenges in processing large datasets to train the prediction models. In this chapter, we present a network traffic prediction frame- work that uses real network traces from a Tier-1 ISP to train a Long Short-Term Memory (LSTM) neural network and generate predictions at short time scales (≤ 30 seconds). In order to reduce the number of models needed to capture the very diverse dynamics of the various traffic sources, we develop a feature-based clus- tering framework that acts as a preprocessing step in order to group similar time series together and train a single model for each group. Our extensive experimental evaluation study shows that LSTMs can indeed be used to predict network traffic with low prediction errors. 22 3.1 Introduction Fine-grained network traffic predictions can enable a large variety of network management tasks such as network monitoring, traffic engineering, anomaly detec- tion, network accounting, network analytics, performance diagnostics, load balanc- ing, and Traffic Matrix (TM) estimation [39, 40, 41, 42, 43, 18, 38, 44]. However, traditionally, prediction models for network traffic have only been developed for large aggregation time-windows (i.e. > 15 minutes in most of the cases) due to a) the very volatile nature of network traffic in smaller time scales, b) the lack of computational resources to process packet level training data at scale, and c) the lack of efficient models that can predict network flow rates with high accu- racy. For this, fine-grained traffic predictions have been substituted by network traffic measurement frameworks that fall into one of the following three categories, depending on their objective [18]: a) balance in overhead implications by using techniques like sampling, aggregation, efficient heuristics, etc., b) resource usage as a trade-off with measurement accuracy, and c) accurate measurements in real- time for decision making. However, none of the existing frameworks can provide a generic solution for granular traffic measurement, since either the framework will be application specific, or there is going to be some tradeoff that we are trying to balance. More recently, the development of Software-Defined Networking (SDN) allowed more fine-grained measurement to be performed by providing statistics for each forwarding rule of an OpenFlow-enabled switch. However, commodity hard- ware switches use TCAMs to collect statistics for every forwarding rule installed, which have very limited size due to their high cost and power consumption (4K rules in most of the switches). For this, only a small fraction of the total number 23 of flows that a switch forwards can be monitored at any time. As it is evident from the above, more efficient mechanisms are needed in order to be able to a) collect network measurements, and b) use the collected measurements in order to generate accurate traffic predictions for the traffic that cannot be measured, as well as traffic forecasts that characterize the future behavior of the network. In this chapter, we conduct an in-depth modeling study of backbone network traffic using a relatively recent dataset provided by CAIDA in [45] as well as the state-of- the-art deep-learning model for time series predictions, namely Long Short-Term Memory (LSTM). The contributions of our work are summarized below: 1. Weprovideagenericdataprocessingarchitectureforanalyzinglargevolumes of raw network traffic logs (e.g. in PCAP format) in order to model them in various time scales and various aggregation levels (megaflow level). 2. We provide an in-depth analysis of backbone network traffic that was cap- tured relatively recently (i.e. 2016) and which contains new traffic dynamics that were unavailable in relevant studies two decades ago, since in the recent years the multimedia content, the social networks, the mobile devices, the smart-TVs and more, have completely changed the traffic landscape. 3. We provide an analysis of several variations of LSTMs for network traffic modeling, including vanilla LSTM, delta-based LSTMS (i.e. models that predict the consecutive flow-size deltas), and cluster-based LSTMs. This way we can shed more light on what type of model is good for what type of traffic. 24 4. We propose a set of features that can be used to cluster the traffic time series and use a single model per cluster in order to reduce the number of distinct models needed to cover a large variety of traffic dynamics, and thus increase the scalability of the framework. 3.2 Background & Motivation In this chapter, and without loss of generality, we will focus on modeling the special case of Definition 2.1.1 where a flow is defined as the traffic that shares the same source and destination IP prefix (i.e. megaflow), forthewholedurationofour experiment. To do so, we develop a framework that aggregates data both spatially and temporally using a big data processing framework, as it will be described in Section 3.4. The aggregation is done on source and destination IP prefix pairs that are characterized by their subnet mask size. Our goal is to use past measurements for the size of the traffic from an IP prefix over a given measurement period (e.g. the last 3 observations) and use that to predict the size for the next epoch. To train our model, we use the subsequence F (p) i with p = n/2 and w = n/2 data points, and to test our model we use the subsequenceF (p) i withp =n andw =n/2, according to Definition 2.1.3. Throughout this chapter, n is defined as the length of the dataset, after the packets have been grouped in time epochs. One of the biggest challenges in modeling network traffic is its diversity in terms of the dynamics of the underlying traffic (e.g. multimedia, web content, file- sharing, etc.) that makes it hard for a single model to generate good predictions especially in small time-scales. For this, larger aggregation windows (usually > 15 25 min) have been used where network traffic exhibits hourly, daily, and weekly peri- odicities, andthusmakesiteasiertopredict. However, thiscoarsegrainpredictions cannot always satisfy the requirements of many network management tasks. 3.3 The CAIDA Dataset 0 500 1000 1500 2000 2500 3000 3500 Epoch No. 0 200 400 Aggregated Traffic Size (MB) Aggregated Traffic For Source: /0 - Destination: /0 (Epoch Size = 1 sec) Figure 3.1: Network traffic time series from a Tier-1 ISP link aggregated over 1 second measurement intervals. In our study, we focus on a 1-hour long trace containing large volumes of traffic from a Tier-1 ISP from [45]. The trace is stored in PCAP format collected from thehigh-speed"equinix-chicago"monitorin2016. The"equinix-chicago"monitoris located at the Equinix data center in Chicago, IL, and is connected to a backbone 10GigE link of a Tier1 ISP that connects Chicago, IL and Seattle, WA. Each direction of the bidirectional link is monitored and logged separately and labeled as "directionA"(SeattletoChicago)and"directionB"(ChicagotoSeattle). However, since CAIDA is aware that some data in this dataset contain more than trivial amounts of packet loss (especially direction B) due to the way the monitoring equipment is set up and the high network speeds, in this work we are focusing only on direction A. In Figure 3.1 we show the resulting time series when we aggregate 26 the CAIDA dataset across all prefixes. The trace contains 1.65 billion IPv4 packets with total size of 0.98 TB. 0 1000 Packet Size (Bytes) 0.0 0.1 0.2 Frequency (%) Packet Size Histogram (a) Histogram of packet sizes. 0 200 400 Aggregated Traffic Size (MB) 0.000 0.005 0.010 0.015 Frequency (%) Histogram of Aggregated Traffic Size (b) Histogram of network traffic size for epoch = 1 sec. Figure 3.2: CAIDA network trace histograms for packet size and aggregate volume over 1 second periods. In order to better understand the trace characteristics, we show in Figures 3.2(a), and 3.2(b) the histogram of packet size in the trace, and the histogram of the total traffic sizes for a measurement epoch of 1 second, respectively. As we can see from the graphs, the packet sizes are either small (e.g. TCP ACKS) or around 1500 bytes, which aligns with the fact that TCP packets are 88.95% of the trace and UDP is 10.73%. Another observation is that the aggregated traffic time series appears to follow a bimodal distribution with a long left tail that corresponds to the frequent drops in the traffic which is not clear if it was due to the measurement equipment of CAIDA or it was inherent characteristic of the traffic. In order to better analyze the structure of the flow time series, we plot in Figure 3.3 the Autocorrelation Coefficients (ACF) of the traffic for various lags. ACFsarecalculatedusingEquation3.3.1. Specifically, giventhesetofobservations 27 0 25 50 75 100 Lag 0.0 0.5 1.0 Autocorrelation Coefficient Autocorrelation Coefficients Figure 3.3: Autocorrelation Coefficients (ACFs) for various lags for the trace in Figure 3.1. The shaded area corresponds to statistically non-significant ACF val- ues. x 1 ,x 2 ,...,x n of a random variable with sample mean ¯ x, the lag-k autocorrelation coefficient is defined as: ˆ R(k) = 1 (n−k)σ 2 n−k X t=1 (x t − ¯ x)(x t+k − ¯ x). (3.3.1) As we can see in Figure 3.3, all the first 100 lags have significant ACF (the area outside the blue colored region), which is a good indicator that a model like LSTM that was designed specifically to effectively handle long-range dependencies ([46]) would be a good candidate. 28 3.4 A Big Data Framework for Network Traffic Processing One of the biggest challenges in the networking research community is the lack of available structured datasets that contain pre-aggregated traffic that can be easily fed into a machine learning model to test its effectiveness. To overcome this challenge, we design a framework that can leverage any available big data SQL platform and can easily process large amounts of packet level data in seconds, instead of the very time and resource consuming scripts that were traditionally used with limited capabilities. The network traces from CAIDA, which are available in large PCAP files, are first processed to extract the (timestamp, ip_version, source_ip, destination_ip, protocol, source_port, destination_port, packet_size) fields of each packet in CSV format. Since 99.68% of the packets in the traces are TCP or UDP, we are explicitly focusing on extracting only these two protocols form the trace. In addition, due to the large size of the resulting files (since they contain packet level information), we design a system that relies on SQL queries in order to generate the traffic aggregation at various time scales and mask sizes. The database platform used was Google BigQuery [47] and it was chosen due to its scalability and its low cost. To leverage BigQuery’s fully managed backend, we first import the packet-level processed logs to the database (in CSV format), and then use a Python SQL query generator script to programmatically create and execute the SQL aggregation queries (in 100s of lines of SQL code) that will create the final dataset. In order to generate the flow aggregates at various time scales, and IP prefix sizes, we run a SQL query 29 1 SELECT * 2 FROM 3 ( 4 # mask = 1, epoch = 5 seconds 5 SELECT '1' AS mask, '5' AS epoch, 6 round_timestamp(timestamp, 5) AS timestamp, 7 get_prefix(srcIp, 1) AS srcPrefix, 8 get_prefix(dstIp, 1) AS dstPrefix, 9 SUM(size) AS flowSize 10 FROM `CAIDA.Data` 11 GROUP BY 1,2,3,4,5 12 ) UNION 13 ( 14 # mask = 2, epoch = 5 seconds 15 SELECT '2' AS mask, '5' AS epoch, 16 round_timestamp(timestamp, 5) AS timestamp, 17 get_prefix(srcIp, 2) AS srcPrefix, 18 get_prefix(dstIp, 2) AS dstPrefix, 19 SUM(size) AS flowSize 20 FROM `CAIDA.Data` 21 GROUP BY 1,2,3,4,5 22 ) UNION 23 ... 24 ( 25 # mask = 24, epoch = 30 seconds 26 SELECT '24' AS mask, '30' AS epoch, 27 round_timestamp(timestamp, 30) AS timestamp, 28 get_prefix(srcIp, 24) AS srcPrefix, 29 get_prefix(dstIp, 24) AS dstPrefix, 30 SUM(size) AS flowSize 31 FROM `CAIDA.Data` 32 GROUP BY 1,2,3,4,5 33 ) Figure3.4: Scalableflowaggregationformultiplemasksizes(prefixes)andmultiple time epochs using a sequence of SQL unions that can be processed in parallel by a big data SQL platform. that is structured as shown in Figure 3.4. As we can see, we format the SQL queries as a sequence of SQL "unions" that are highly parallelizable and can speedup execution. In addition, every row of the database is processed using two User-Defined Functions (UDF) to a) aggregate the timestamps into time epochs (i.e. round_timestamp(timestamp, epoch)), and b) convert the individual IPs 30 into prefixes of a given size (i.e. get_prefix(ip, mask)). It is important to note here that the above design can speed up the flow aggregation process by more than 300x compared to the same process running with some script language like Python on a single server machine. Figure 3.5: Architecture of the data collection and processing pipeline. The overall process is illustrated in Figure 3.5 where we show the steps involved from data collection (performed by CAIDA) till the LSTM flow modeling step. 31 Figure 3.6: Architecture of an LSTM unit that connects to other units of a recur- rent neural network. 3.5 Methodology 3.5.1 Long Short-Term Memory Model A Long Short-Term Memory (LSTM) model is a form of a recurrent neural network that has gained popularity in the recent years due to its effectiveness in modeling complex time series with time lags of unknown size that separate important events [48]. The main idea of LSTM is the use of self-loops where the gradient can flow for long durations without vanishing or exploding. This, in combination with the use of a forget-gate, allows the LSTM to accumulate knowledge that can be "forgotten" later depending on the input data. To the best 32 of our knowledge, this is the first time that LSTM models are used for modeling fine-grained network flow sizes in short time scales. LSTMs are characterized by the following recursive equations: f (t) = σ(W f x (t) +U f h (t−1) +b f ) (3.5.1) i (t) = σ(W i x (t) +U i h (t−1) +b i ) (3.5.2) ˜ c (t) = tanh (W c x (t) +U c h (t−1) +b c ) (3.5.3) c (t) = i (t) ˜ c (t) +f (t) c (t−1) (3.5.4) o (t) = σ(W o x (t) +U o h (t−1) +b o ) (3.5.5) h (t) = o (t) tanh(c (t) ) (3.5.6) where f (t) ,i (t) , ˜ c (t) ,c (t) ,o (t) ,h (t) ,x (t) ,b f are the forget gate, input gate, candidate state, current state, output gate, hidden state, and input data, respectively, W f ,W i ,W c ,W o are the input weights for the forget gate, input gate, candidate state gate, and output gate, respectively, and U f ,U i ,U c ,U o are the recurrent weights for the forget gate, input gate, current state, and output gate, respec- tively. In addition, is the (element-wise) Hadamard product, σ(x) = 1 1+e −x is the sigmoid function, andb f ,b i ,b c ,b o are the bias terms for the forget gate, input gate, candidate state, and output gate, respectively. The architecture of an LSTM recurrent neural network is shown in Figure 3.6 from where we can see the internal structure of the LSTM unit at time t, as well as how it connects to past and future states to form a recurrent neural network. 33 3.5.2 Time Series Clustering for Model Selection Network traffic is very heterogeneous in nature, depending on the application that generates the data, the transport protocol used (e.g. TCP, UDP, etc.), the available bandwidth at the edge, the congestion in the network, the time of the day, the distance between source and destination, the end-user behavior and more. These can create flow-rate time series that can have very different dynamics from prefix to prefix, and which would be hard to capture with a single machine learning model. On the other hand, using a different model for each flow prefix would not scale due to the large number of possible source - destination IP pairs. So, in order to be able to use a small set of models to model a large variety of flow prefixes with different dynamics, we use a time series clustering framework that assigns time series into a set of disjoint clusters and train a single machine learning model for each cluster. Given the set of all the active flowsF (p) i fori∈{i,...,K}, the goal of the time series clustering framework is to find a partitionA 1 ,A 2 ,...A k wherek is the total number of clusters and K the total number of flows. Let z i ∈{1, 2,...,k} be the cluster to which time series F (p) i is assigned. Then, the clustering algorithm needs to find the optimal z ∗ i =argmax z p(z i =z|x i ,D) (3.5.7) where x i is the feature vector that characterizes the time series F (p) i , andD is the rest of the input data. In the analysis above, the number of clusters k can be set manually (e.g. in case of k-Means) or derived automatically by the clustering algorithm (e.g. in DBSCAN). 34 Time series clustering can be done using three kinds of approaches [49]: a) raw data based methods, b) feature-based methods, and c) model-based methods. In this thesis, we propose the use of a feature based method to cluster the time seriesF i since a) this type of approaches work well for heterogeneous time series, b) require a smaller set of dimensions compared to a raw-data approach, and c) do not depend on a specific model of the data. However, the challenge here is to find a set of features that can work well. For this, we use various features that characterize the dynamics of a flow such as the mean, variance, autocorrelation for various lags, the subnet size, the burstiness of the traffic, the entropy of the distribution of all the flow values, and remove any correlated features found. The full list of features used in shown in Table 3.1. The rationale for choosing these features is that a) they capture well the diverse dynamics of the various network time series, b) they produced relatively good clusters, as indicated by various cluster quality metrics such as homogeneity, completeness, and silhouette coefficient. Table 3.1: Features for network time series clustering. Feature Name Description mean sample mean of the flow rates variance variance of flow rates median median of flow rates min the smallest flow rate seen max the largest flow rate seen ptp the range of values (peak-to-peak) skew skewness of the flow rate time series kurtosis fourth central moment divided by the square of the variance acf1 lag-1 autocorrelation coefficient acf10 lag-10 autocorrelation coefficient entropy sample entropy of the flow rates mask the mask size used to aggregate the traffic epoch the epoch size used to aggregate the traffic over time 35 The clustering algorithms used are k-Means and DBSCAN. For k-Means, we specify the number of clusters k (20 was used after experimenting with various values and deriving clustering quality metrics, but depending on the data we want to model, this can be tweaked accordingly) that need to be used to partition the time series features such that the within-cluster sum of squares is minimized as follows: z ∗ i =argmin z kx i −μ z k 2 2 (3.5.8) where μ z is the cluster’s center and it is defined as μ z = 1 N k X i:z i =z x i (3.5.9) and N z the total number of points in cluster z. On the other hand, DBSCAN only needs to be specified the maximum distance between two samples for them to be considered as belonging to the same neighborhood, and also the minimum number of samples (10 was used as a minimum) in a neighborhood for a point to be considered as a core point (including the point itself). 3.5.3 Data Transformations One very common approach when modeling data in practice is to apply several transformations to the data depending on their distribution, in order to achieve certain properties. In this work, we apply the following transformations in various combinations in order to assess their effectiveness: a) normalize the time series, b) model the deltas (i.e. f (j) i −f (j−1) i ,f (j−1) i −f (j−2) i ,...) instead of the actual values 36 (thisalsochangesthedistributionofthedataandmayachievebetterperformance), and c) model the logarithm of the time series values. 3.6 Evaluation In order to evaluate the effectiveness of the model, we proceed to analyze the CAIDA dataset processed by our big data framework described in Section 3.4. In order to quantify the modeling error, we calculate the Mean Absolute Percentage Error (MAPE) across all the time series, as defined below: MAPE = 100 n n X t=1 |f (t) i −u (t) i | |f (t) i | (3.6.1) where f (t) i is the actual flow-rate value and u (t) i the estimated flow-rate value for a given flow i. Since MAPE can produce large errors when the actual time series values are close to 0, we remove from the analysis time series that appear to be 0 for more than 50% of the trace. In a real system, we can use a framework like in [50] to detect active flows and model those only. We implement the following four variations of LSTM using Keras [51] and Tensorflow [52]: 1. Vanilla LSTM (VLSTM): This is a simple LSTM architecture with an LSTM layer with 50 units, followed by a dense layer with 50 units, dropout of 20%, look back window 3, 20 training epochs (not to be confused with the aggregation epoch used during the dataset creation), batch size 8, and standard scaler on the time series data. The model was trained with a 50-50 train/test split. 37 2. Delta LSTM (DLSTM): This is exactly the same architecture as in 1) above, with the only difference that the input data have been pre-processed to calculate the deltas. 3. Cluster LSTM (CLSTM): This is an LSTM architecture that consists of 20 individual LSTM models (equal to the number of clusters found to be working well with K-Means since DBSCAN did not produce good results) that are trained using data from a given cluster that the time series are grouped into. Each model has an LSTM layer with 50 units, followed by a dense layer with 50 units, dropout of 30%, look back window 3, 20 training epochs, batch size 128, and standard scaler on the time series data. 4. Cluster Delta LSTM (CDLSTM): This is exactly the same architecture as in 3) above, with the only difference that the input data have been pre- processed to calculate the deltas. In addition to the above models, some more variations were also tested (e.g. using log-transform) but are not included here since they did not provide any better results. The data are aggregated using epochs of 5, 10, 15, and 30 seconds, as well as subnet mask sizes ranging from 0 to 7, in all the possible combinations of the two. Due to the large number of the resulting active flows, we randomly sample up to 100 active flows per combination, run the models, and then repeat the process 10 times to calculate the final MAPE averages. The results are shown in Figures 3.7, and 3.8. As we can see from Figure 3.7, smaller mask sizes and higher epoch sizes can be predicted more easily, however, even at larger mask sizes, the average MAPE can be less than 30%. In addition, it is important to mention 38 0 1 2 3 4 5 6 7 Aggregation Mask Size 0 10 20 30 40 50 Avg MAPE Average MAPE Per Mask Size (Epoch = 5 sec) vlstm dlstm (a) Average MAPE for various mask sizes for VLSTM, and DLSTM, for epoch = 5 sec. 0 1 2 3 4 5 6 7 Aggregation Mask Size 0 5 10 15 20 25 30 35 Avg MAPE Average MAPE Per Mask Size (Epoch = 10 sec) vlstm dlstm (b) Average MAPE for various mask sizes for VLSTM, and DLSTM, for epoch = 10 sec. 0 1 2 3 4 5 6 7 Aggregation Mask Size 0 10 20 30 40 Avg MAPE Average MAPE Per Mask Size (Epoch = 15 sec) vlstm dlstm (c) Average MAPE for various mask sizes for VLSTM, and DLSTM, for epoch = 15 sec. 0 1 2 3 4 5 6 7 Aggregation Mask Size 0 10 20 30 40 50 60 70 Avg MAPE Average MAPE Per Mask Size (Epoch = 30 sec) vlstm dlstm (d) Average MAPE for various mask sizes for VLSTM, and DLSTM, for epoch = 30 sec. Figure 3.7: Average MAPE for various mask sizes and epoch durations. here that this does not mean that LSTM cannot predict more fine grained flows, since our experimentation showed that this increase in the MAPE is due to new traffic patterns being observed during the testing phase, something that can be 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cluster No. 0 20 40 60 80 100 MAPE Average MAPE Per Cluster clstm cdlstm Figure 3.8: Average MAPE for each of the 20 clusters used for CLSTM and CDLSTM. easily resolved if we use more historical data. Another interesting observation is that the delta transformation (i.e. first order differentiation) does not give a big improvement over vanilla LSTM in most of the cases, however when it gives, it gives by quite a lot, as it is evident in Figure 3.8. This shows that there is a certain pattern that is better captured by the delta transformation. Figure 3.8 also suggests that each cluster can be further optimized by assigning a separate model architecture instead of using the same architecture with different training data across clusters. Finally, Figure 3.8 shows that when all the individual time series have been grouped into similarity groups, the model generalizes better since more data with similar patterns are taken into account. 40 3.7 Prediction Horizon 3.7.1 Large Scale Experiment (CAIDA) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Prediction Horizon (Epochs) 0 20 40 60 80 Avg MAPE MAPE vs. Prediction Horizon For CLSTM Figure 3.9: Average MAPE for various prediction horizons for CLSTM. In order to investigate the effect of the prediction horizon to the prediction error, we proceed to modify the CLSTM and CDLSTM models such that a set of T p future predictions are generated before updating the regressors to correspond to their actual value instead of their predictions. The results are shown in Figures 3.9, and 3.10 from where we can see the MAPE vs horizon plots for each of the 20 clusters of CLSTM and CDLSTM, respectively. As it is evident from the plots (the exact cluster number does not really matter here since we only focus on the general trend of the error) the predictions stay quite robust as horizon increases 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Prediction Horizon (Epochs) 0 25 50 75 100 125 Avg MAPE MAPE vs. Prediction Horizon For CDLSTM Figure 3.10: Average MAPE for various prediction horizons for CDLSTM. with a lot of clusters exhibiting a logarithmic trend. This shows that the approach of using LSTMs for predicting more than the next step is also feasible. Also we should note here that even though some of the MAPE scores appear pretty high in the two figures (i.e. > 50%), we can still choose to use the best between CLSTM and CDLSTM for a given cluster, since as we saw in Figure 3.8, for ever cluster there is a model that has relatively low MAPE. 3.7.2 Small Scale Experiment (Mininet) In order to further analyze the prediction capabilities of LSTMs for various prediction horizons in small scale topologies, we proceed to implement a real net- work topology from Google B4 [3] in Mininet [53]. Specifically, we simulate the 42 B4 topology with 1 Gbps links, where each OpenVSwitch has 5 hosts attached to it and they all concurrently send traffic to a random destination host over their shortest path. The traffic is sent using iperf, at a maximum rate that can be chosen uniformly in the range 1 to 100 Mbps, over channels with 10ms link delay, and 1% loss rate. In addition, a sender is allowed to change rate after an interval that varies uniformly between 400 to 1500 seconds. In Figure 3.11 a sample aggregated flow of 100 epochs is shown (epoch size = 5 second), as well as its prediction curve. As we can see from the graph, the prediction curve approximates the real flow size pretty well, yielding to an average MAPE of 3.9% across all the flows measured. One reason for this is the fact that in Mininet, the TCP flows created where active for longer periods of time without other interfering traffic such as UDP, thus yielding to more stationary flows that can be modeled with very high accuracy. Another reason is the fact that, due to the small number of flows and their similar nature, we could not use clustering, but instead we simply used VLSTM to model individual flows. Figure 3.12 shows the average MAPEs obtained from the Mininet simulation scenario when we vary the prediction horizon T p used when modeling individual flows. As we can see, the average MAPE remains low (i.e. < 5%) for horizons up to 35 epochs, with the only exception of an 8% MAPE at 40 epochs, which is still relatively low though. This indicates that in scenarios with stationary flows like this, our model can work very well even for long time horizons. 43 0 20 40 60 80 100 Epoch 0 500 1000 1500 2000 2500 KBytes Real vs. Predicted Flow Size Over Time (Mininet Scenario) real flow predicted flow Figure 3.11: VLSTM predictions for the size of a /32 flow from Mininet simulation (epoch duration = 5 seconds). 10 20 30 40 Prediction Horizon (epochs) 0 2 4 6 8 10 Average MAPE (\%) Average MAPE Per Prediction Horizon (Mininet Simulation) Figure 3.12: VLSTM Average MAPEs for various predictions horizons across the /32 flows of the Mininet simulation (epoch duration = 5 seconds). 44 3.8 Related Work The problem of modeling network flow time series is not new in the relevant literature. Most of the previous works have focused on modeling the aggregate size of a number of flows over time windows of several minutes [54, 55, 56]. These models are traditionally good for coarse grain traffic matrix predictions, since they leverage long-range dependencies in order to predict how the overall volume seen by an observation point will behave in the future. On the other hand, there have been some efforts on modeling aggregated flow-sizes in shorter time scales, such as [57, 58, 59] but none of them provides a generic and scalable solution for modern networks, since most of the works rely on traces that are more than 15 years old that differ significantly from the modern traffic dynamics. In addition, the flow datasets used were significantly different compared to now due the significantly lower network speeds, the limited amount of multimedia traffic (e.g. video, VOIP, etc.), and the different traffic dynamics overall. Second, in the case of more recent examples such as [60], only aggregated traffic at the link (port) level was modeled in large timescales (i.e. 15 minutes), which differs significantly from what our framework aims to model. 3.9 Conclusion In this chapter, we presented several variations of LSTM that can effectively model backbone network traffic. In addition, we presented a big data architecture where analysis like this can be conducted at scale by processing large PCAP files containing packet level network logs. The results obtained validate the hypothesis 45 that LSTM is a good candidate for network traffic modeling, especially when the time series are clustered together into similarity groups. In addition, we study the effects of the prediction horizon of our model and illustrate that the prediction errors do not grow rapidly and thus long time horizons can be used in practice. 46 Chapter 4 Link-Level Network Traffic Modeling In this chapter, we continue our modeling study of real network traffic from CAIDA and investigate how good modeling results we can get when the traffic is aggregated in physical links. In an SDN environment, link statistics are available without additional rules, so our proposed framework can be easily deployed in SDN to forecast future link throughputs and Traffic Matrices (TM) in general. 4.1 Introduction Link-level network traffic predictions at short time-scales can provide infor- mation that is very crucial for various network management tasks such as traffic engineering, failurerecovery, anomalydetection, performancediagnostics, loadbal- ancing, and Traffic Matrix (TM) estimation [61, 39, 40, 41, 42, 18, 38, 44]. For example, in order for the network to react to a congestion event early on and prevent packet losses and increased delays, link level predictions can be used and reroute traffic accordingly without having to wait for the routing protocols to react with delay [60]. Another example is the use of predictions as baselines of normal behavior for the near future, and the treatment of any measured deviations from the baselines as detected anomalies [62, 63]. Finally, predictions can be used to 47 substitute traffic measurements whenever the available resources are limited and network telemetry cannot be performed deterministically [50]. The prediction models that have been proposed in the relevant literature have been designed for large aggregation time-windows (i.e. 15 minutes in the vast majority of the cases) due to the challenges in regards to the very volatile nature of network traffic in smaller time scales that makes predictions hard. Another factor that poses significant challenges for the evaluation of statistical models in smaller time scales is the lack of structured datasets that the research community can easily use to better understand the modern network traffic. The development of Software-Defined Networks (SDN) has enabled easier traffic measurement at short time-scales by leveraging the capabilities of the OpenFlow enabled switches to provide a centralized controller with statistics for each forwarding rule installed, as well as the traffic send/received by each link. This, makes SDN an ideal solution for bringing more visibility into the network and provide prediction models with the required data. In this chapter, we present an analysis of link-level aggregated network traffic at various aggregation levels and short time scales, and propose the use of state-of- the-art deep-learning models for time series predictions, namely Long Short-Term Memory (LSTM), in order to generate accurate traffic predictions. Due to the lack of publicly available structured datasets from both SDN and legacy networks at these time scales, we use the framework introduced in Section 3.4 that can process and aggregate traditional packet level logs (i.e. PCAP files) at scale, and apply it to a relatively recent dataset from a Tier-1 ISP provided by CAIDA in [45]. The contributions of our work are summarized below: 48 1. We present an in-depth analysis of the capabilities of LSTMs to model link- level aggregated network traffic at various aggregation levels and short time scales, which can enable short-term decision making, unlike the easier to predict longer time-scales that traditional TM estimation frameworks are based on. 2. Weproposeapredictionframeworkthatcanbeeasilydeployedtoproduction SDN topologies without requiring any switch modifications since it leverages the port statistics that are available through the OpenFlow API in order to train the LSTM models. 3. We evaluate our framework using real backbone network traffic that was cap- tured relatively recently (i.e. 2016) and which contains new traffic dynamics thatwereunavailableinrelevantstudiestwoormoredecadesago, sinceinthe recent years the multimedia content, the social networks, the mobile devices, the smart-TVs and more, have completely changed the traffic landscape. 4. We provide a comparison of several variations of LSTMs including vanilla LSTM, delta LSTM (which models the consecutive link throughput deltas), multi-variate LSTM (which models all the link throughput time series at oncethustakingintoaccountpotentialcorrelations), andcomparewiththree versions of ARIMA-based models that have been traditionally used for link- level network traffic modeling at these time scales [58, 59]. 49 4.2 Background & Motivation Before presenting our analysis, we introduce some additional definitions that will be used throughout this chapter, as well as the Auto-Regressive family of models that will be used as baselines. 4.2.1 Link Throughput Time Series Since network links act as flow aggregators, we can extend the Definitions 2.1.1, and 2.1.2 for the cases of directional link throughputs by summing the time series of the individual network flows that a link carries according to a routing protocol. For this, consider a set of flow-rate time seriesF i (potentially mega-flows) of length n (zero-padding can be used to achieve equal n across time series), and a set ofN switches that are connected with L directional links and forward traffic according to a routing matrix R with elements r(j,i) that represent the fraction of traffic of flow i that is forwarded through link j (assuming links have unique identifiers). Then: Definition 4.2.1. A link-throughput time series for link j, or simply a link time series L j =l (1) j ,l (2) j ,...,l (n) j is an ordered set of n real-valued variables such that: l (t) j = X i∈{1,...,N} r j,i >0 f (t) i , (4.2.1) for t∈{1,...,n}. Definition 4.2.2. Given a link-throughput time series L j of length n, a link- throughput subsequence L (p) j of L j is a sampling of length w < n of contiguous positions from L j , that is, L (p) j =l (p−w+1) j ,l (p−w+2) j ,...,l (p) j for w≤p≤n. 50 Figure 4.1: SDN measurement and prediction architecture. A link throughput time series and all its subsequences can be easily derived by collecting link-level measurement data using the port statistics from any OpenFlow-enabled switches. Of course, in hybrid deployments (i.e. SDN and legacy topologies), a combination of both OpenFlow and SNMP statistics can be used in order to build a more complete picture of the network. In this chapter, and without loss of generality, we will be focusing on the SDN use-case only which can be summarized in the architecture shown in Figure 4.1. As we can see, in order for an SDN controller to perform a given management task, it retrieves directional link counters from OpenFlow switches and stores them to a database in order to calculate the desired subsequences that will used by the prediction server. In order for the SDN controller to calculate the link-throughput time series, it collects the 51 cumulative incoming/outgoing traffic counters of each switch port, and converts them to differentials using the counters from the previous measurement epoch. Finally, the layer that stores, analyzes, models, and predicts the traffic can also be used to enable a knowledge plane, as described in [2], and can further support network-wide data-driven decision making. In our modeling study, or when imple- menting a prediction server as described above, we use a subsequence L (p) j with p =n/2 andw =n/2 data points for model training, and a subsequenceL (p) j with p = n and w = n/2 for model testing. In this chapter, n is defined as the full length of the dataset (described in more detail in Section 4.3), after the traffic has been grouped in time epochs. In a production deployment, n can be capped using a rolling time-window (i.e. the last n samples). 4.2.2 Auto-Regressive Time Series Modeling The Auto-Regressive (AR) family of models has been used extensively for mod- elingtimeseriesincludinginthecontextofcomputernetworksfortrafficprediction. Some examples can be found in [54, 64, 65]. More formally, letX =X 1 ,X 2 ,...,X n be a time series with X i ∈R. Then, an AR model of order p or simply AR(p), is defined as [66]: X t =c +a 1 X t−1 +··· +a p X t−p + t , (4.2.2) wherec is a constant,a 1 ,a 2 ,...,a p are the parameters (real numbers) of the model, e t is the error term which is usually white noise, i.e. e t ∼N(0,σ 2 ), and X t is the prediction of the model at time t. 52 Similarly, aMovingAverage(MA)processoforderq orsimplyMA(q)isdefined as: X t =μ + t +β 1 t−1 +β 2 t−2 +...β q t−q , (4.2.3) where μ =E[X] is the mean of X, t corresponds to the (white) noise term, and b 1 ,b 2 ,...,b n corresponds to the real parameters of the model. The Auto-Regressive Moving Average (ARMA) model of ordersp andq is then defined by combining the AR(p) model with the MA(q) model as follows: X t =c + t + p X i=1 α i X t−i + q X i=1 β i t−i , (4.2.4) where c is a constant and t the white noise term. The main disadvantage of the ARMA model is that it assumes stationarity of theunderlyingtimeseriesX. However, sinceusuallythisisnotthecaseinpractice, a common approach is to differentiate the initial time series (i.e. calculate the delta of each pair of consecutive values)d times, in order to produce a stationary version of the initial time series. This way, the Auto-Regressive Integrated Moving Average (ARIMA) model is defined using the three parameters (p,d,q), where p and q are defined from their respective processes above (i.e. AR and MA), andd corresponds to the degree of differentiation of the original time series in order to convert it to stationary. More formally, ARIMA(p,d,q) is defined as follows: (4.2.5) X t =α 1 X (d) t−1 +α 2 X (d) t−2 +··· +α p X (d) t−p + t +c + +β 1 t−1 +β 2 t−2 +··· +β q t−q , 53 Figure 4.2: The Google B4 [3] network topology (black) with attached ingress (yellow) and egress (orange) switches. 4.3 Aggregated Network Traffic Analysis One of the biggest challenges in modeling aggregated network traffic is the ever-changing nature of the content generation and consumption landscape which can make old datasets obsolete almost every decade. For example, the constant consumption of multimedia content, the connected TVs, the frequent web browsing from mobile devices, the file-sharing applications, are examples that were barely available more than a decade ago. So, all these new traffic types and patterns need to a) be better understood, and b) be modeled accurately using a scalable architecture. 54 13 14 15 21 23 24 1 2 3 4 5 6 7 8 9 10 11 12 Figure 4.3: The B4 network topology (black) with attached ingress (yellow) and egress (orange) nodes and their respective senders/receivers. The available traffic matrix datasets also suffer in terms of the issues described above (i.e.≥ 15 min time scales from data collected in 2004 or 2006 [67, 68]). In addition, in such cases, network traffic exhibits hourly, daily, and weekly period- icities, and thus makes it easier to predict [60]. In our study we use the dataset from [45] which was captured in 2016 and was pre-prosessed at the flow prefix level using our proposed framework in Section 3.3. In order to analyze the characteristics of the traffic when grouped together in network links, we proceed to implement Google’s B4 backbone topology 1 from [3] 1 We also tried random topologies with very similar results. 55 and assign the source and destination prefixes from the CAIDA trace to ingress and egress routers as shown in Figure 4.2. Specifically, we group the source and destination IPs of each micro-flow into one of the following four aggregation levels, i.e. /8, /10, /12, and /14 both at the source and the destination IP, and assign these aggregated prefixes at random to one of the 3 ingress and egress nodes shown in Figures 4.2 and 4.3. To calculate the link-throughput subsequences observed by each link in our topology, we assign source and destination prefix pairs to links using shortest path routing from each source to its destination. This way, each link observes various traffic aggregations simulating a real network scenario. In order to better understand the effects of such aggregation, we present here some samples of the aggregated time series and their autocorrelation coefficients for various lags, mask sizes and epochs. The results for the resulting time series are shown in Figures 4.4(a), 4.4(b), 4.4(c), and 4.4(d) from where we can see that time series exhibit very different behaviors when it comes to absolute magnitude, burstiness, and overall trend. In Figures 4.5(a), and 4.5(b) we also show the average autocorrelation across links for epoch size 15 and 30 sec respectively (due to the large number of directed links an average across links was calculated). The epochs omitted here exhibit similar behavior as in 4.5(a). As we can see, the traffic aggregation reduced the ACF for all the time scales except the 30 seconds one in which we can see high autocorrelations for the first 10-20 lags. This can pose challenges to models that are not capable of capturing such autocorrelations, unlike LSTMs as we will see in the next section. 56 0 500 Epoch No. 0 50 100 150 Aggregated Traffic Size (MB) Aggregate Traffic Size For Link (11, 9) Mask = 8, Epoch Size = 5 sec (a) Sample link time series for /8 mask size and 5 second measurement epoch. 0 200 Epoch No. 200 400 600 Aggregated Traffic Size (MB) Aggregate Traffic Size For Link (1, 13) Mask = 8, Epoch Size = 10 sec (b) Samplelinktimeseriesfor/8masksize and 10 second measurement epoch. 0 100 200 Epoch No. 500 1000 1500 Aggregated Traffic Size (MB) Aggregate Traffic Size For Link (1, 13) Mask = 8, Epoch Size = 15 sec (c) Sample link time series for /8 mask size and 15 second measurement epoch. 0 50 100 Epoch No. 0 100 200 Aggregated Traffic Size (MB) Aggregate Traffic Size For Link (2, 14) Mask = 10, Epoch Size = 30 sec (d) Sample link time series for /10 mask size and 30 second measurement epoch. Figure 4.4: Sample link time series for various mask sizes and measurement epochs. 4.4 Methodology Similar to our approach in Chapter 3, we apply several transformations to the data depending on their distribution in order to achieve certain properties. In this work, we apply the following transformations in various combinations in order to assess their effectiveness: a) normalize the time series, and b) model the deltas (i.e. f (j) i −f (j−1) i ,f (j−1) i −f (j−2) i ,...) instead of the actual values. The transformations 57 0 20 40 60 80 100 Lag 0.0 0.2 0.4 0.6 0.8 1.0 ACF Autocorrelation Coefficients (Avg) Epoch = 15 sec Mask 8 Mask 10 Mask 12 Mask 14 (a) Average ACFs for various lags of the aggregated trace for various mask sizes and 15 second measurem- ent epoch. 0 20 40 60 80 100 Lag 0.0 0.2 0.4 0.6 0.8 1.0 ACF Autocorrelation Coefficients (Avg) Epoch = 30 sec Mask 8 Mask 10 Mask 12 Mask 14 (b) Average ACFs for various lags of the aggregated trace for various mask sizes and 30 second measurem- ent epoch. Figure 4.5: Average Autocorrelation Coefficients (ACFs) for various lags of the aggregated trace for various mask sizes and measurement epochs. 58 aim to help the model focus on the relative feature importance rather than their absolute magnitudes. In order to evaluate the ability of LSTMs to model the network traffic described in Section 4.3, we assess their effectiveness by calculating the Mean Absolute Percentage Error (MAPE), as defined in Equation 3.6.1. 4.5 Evaluation WeimplementedthreevariationsofLSTMusingKeras[51]andTensorflow[52], as well as three variations of ARIMA baselines [66]. In order to choose the best performing hyper-parameters, we implemented random search [69] where random combinations of the model parameter values are chosen (i.e. look-back window, number of units, number of layers, dropout ratio, epochs, and batch size for LSTM, and parameters (p,d,q) for ARIMA) and the one that performs the best is finally selected. The details of each model are shown below: 1. Vanilla LSTM (VLSTM): This is a simple LSTM architecture with an LSTM layer with 50 units, followed by a dense layer with 50 units, dropout of 20%, look back window 3, 50 training epochs (not to be confused with the aggregation epoch used during the dataset creation), batch size 4, and standard scaler on the time series data. The model was trained using a 50- 50 train/test split. We model each link’s traffic separately and calculate an average of all the MAPE errors across links. 2. Delta LSTM (DLSTM): This is exactly the same architecture as in 1) above, with the only difference that the input data have been pre-processed to calculate the time series deltas. The model was used to model each link’s 59 traffic separately and calculate an average of all the MAPE errors across links. 3. Multivariate LSTM (MLSTM): This is an LSTM architecture that con- sists of an LSTM layer with 3×(#links) units (for the last 3 observations of each link), followed by a dense layer of #links units, dropout of 30%, 50 training epochs, batch size 4, and standard scaler on the time series data. This model produces predictions for all the links at once. 4. ARIMA model (ARIMA): This is an ARIMA model with p = 3,q = 0, andd = 0 parameters, that was found to perform well for a variety of traces. The parameter p corresponds to the number of lag observations included in the model, d corresponds to the number of times that the input data are differenced, and q corresponds to the size of the moving average window. 5. Delta ARIMA model (DARIMA): This is an ARIMA model with p = 3,q = 0, and d = 1 parameters that implements differentiation to improve stationarity. 6. First-Order Autoregressive ARIMA (AR1): This is an ARIMA model withp = 1,q = 0, andd = 0 parameters that generates predictions based on the last value seen. We run each model 10 times to calculate the final MAPE averages. For the models that operate on each link separately, we calculate the average MAPE across links. All the model training times range between 5-30 seconds in a server with 32GBofRAMandanIntelCPUwith16cores. TheaverageMAPEsforeachmode across all epochs are shown in Figures 4.6(a), 4.6(b), 4.6(c), and 4.6(d). As we can 60 ar1 arima darima mlstm vdlstm vlstm Model Name 0 5 10 15 20 25 MAPE Average MAPE (Epoch = 5 sec) 8 10 12 14 (a) Average MAPE for various mask sizes for epoch = 5 sec. ar1 arima darima mlstm vdlstm vlstm Model Name 0 5 10 15 20 25 30 35 MAPE Average MAPE (Epoch = 10 sec) 8 10 12 14 (b) Average MAPE for various mask sizes for epoch = 10 sec. ar1 arima darima mlstm vdlstm vlstm Model Name 0 5 10 15 20 25 30 MAPE Average MAPE (Epoch = 15 sec) 8 10 12 14 (c) Average MAPE for various mask sizes for epoch = 15 sec. ar1 arima darima mlstm vdlstm vlstm Model Name 0 5 10 15 20 25 30 35 MAPE Average MAPE (Epoch = 30 sec) 8 10 12 14 (d) Average MAPE for various mask sizes for epoch = 30 sec. Figure 4.6: Average MAPE for each model across various mask sizes and epoch durations. see from the figures, all the LSTM models perform much better than the ARIMA ones in all the scenarios under consideration, which validates our hypothesis that LSTM is a good candidate for link-level traffic predictions at time-scales below 30 seconds. The difference becomes much bigger (i.e. more than 2x) for all time scales in the case of /14 network assignment due to the more challenging nature of the resulting aggregation that separated volatile low rate time series with non 61 volatile high rate ones. The same pattern but in smaller scale is evident for smaller aggregation masks (i.e. /8, /10, /12). The time epoch played also an important role in the model performances, with larger time epochs producing better results for all the models, which is expected due to the reduced variance of the traffic at higher aggregation scales. Among all the LSTM models, we can observe that VLSTM (per link) performs better overall, followed by VDLSTM and MLSTM. The reason for this is due to the fact that VLSTM and VDLSTM are optimized for each link separately, whereas the mlstm underfits when trying to capture the traffic patterns across the network. Finally, the VDLSTM performs relatively similar to VLSTM, especially for higher epoch sizes. In the future, we are planning to further optimize MLSTM and use more data in order to reduce underfitting. Finally, we can observe that all three ARIMA models exhibit very similar performance. Based on the above results, we can conclude that per-link LSTMs can provide the best results with quick training times (≤ 30 seconds), and up to 2x reduction in MAPE compared to ARIMA models. 4.6 Related Work The problem of modeling network flow time series is not new in the relevant literature. Most of the previous works have focused on modeling the aggregate size ofanumberofflowsovertimewindowsofseveralminutes[54,55,56,70,71,72,65]. On the other hand, there have been some efforts on modeling aggregated flow-sizes in shorter time scales, such as [57, 58, 59, 70, 64, 73] but none of them provides a generic and scalable solution for modern network traffic modeling at the prefix or link level, since most of the works rely on traces that are more than 15 years 62 old that differ significantly from the modern traffic dynamics, or they model the flows as an aggregate across the whole network. From these works, [70] is the most relevant to our approach. However, [70] has some major differences as follows: a) it uses the 2004 GEANT/ABILENE datasets (we use the 2016 CAIDA) that contain 15minuteaggregateswhichareeasiertomodelandnotsuitableforshort-term/fast decision making, b) it also uses a small dataset with 5 second intervals collected from a single link sending artificial traffic between 2 virtual machines and which contains only 25K packets (we use 1.65 Billion packets from real ISP) and thus cannot provide any conclusions about backbone traffic modeling at short time- scales. In addition to the above, our previous works in [50, 74] use small time-scale predictions as in here. However, none of them models network traffic at the link level, neither they apply multivariate models as a way to capture network-wide correlations. 4.7 Conclusion In this chapter, we presented several variations of LSTM that can effectively model backbone network traffic at the link level and for various time-epochs and compared with several ARIMA baseline models. The results obtained look very promising and validate the hypothesis that LSTM is a good candidate for link-level network traffic modeling. 63 Chapter 5 Prediction-Assisted Software-Defined Measurement Inthischapter, wepresentDeepFlow, aframeworkforscalablesoftware-defined measurement that relies on an efficient mechanism that a) adaptively detects the most active source and destination prefixes in the network, b) collects fine-grained flow-size measurements for the most active prefixes and coarse grained for the less active ones, and c) uses historical measurements in order to train a Long Short- Term Memory (LSTM) model that can be used to provide short-term predictions whenever exact flow counters cannot be placed at a switch due to its limited resources. Thus the number of fine-grained flows measured can increase signifi- cantly without the need to use other flow sampling solutions that suffer from low accuracy (since flows can get missed). 5.1 Introduction As we saw in the previous chapters, the availability of fine-grained network traffic measurement is necessary in order to provide the information required for a large variety of network management and optimization tasks [39, 40, 41, 42, 43, 18, 38]. Software-Defined Networking (SDN) is an emerging network architecture that decouplesthecontrolplanefromtheforwardingplane. OpenFlow[16]hasprovided 64 a standard centralized mechanism for a network controller to install forwarding rules at the flow tables and retrieve statistics for a given flow (defined, for example, byasourceanddestinationIPprefix)suchasthetotalbytesorpacketstransferred. This can enable a wide variety of fine-grained measurement tasks that can increase the visibility into the network, as long as the forwarding rules are not too broad to match multiple flows. However, in practice, the TCAM memory of the switches is fundamentally limited in size due to its high cost and power consumption. This causes the vast majority of SDN hardware vendors to limit the TCAMs to less than 4K L2/L3 rules, which is much smaller compared to the tens or hundreds of thousands of micro-flows that an SDN-enabled switch can concurrently forward [32], [33], [34]. Thus, more efficient mechanisms are needed in order to enable fine-grained traffic measurement. Prior work on SDN traffic measurement has either assumed specialized hard- ware support (e.g. sketches) at the switches [37, 36, 35], or it has focused on specific measurement tasks only such as heavy hitter detection, or anomaly detec- tion [75, 44]. Moreover, in the context of TM estimation for SDN, prior work has either assumed that TCAMs have enough capacity to fit all the monitoring rules for all the flows of interest ([33, 76]) which is not the case in practice, or when not, then the top K (K being limited by the total TCAM space available) most important flows are measured with exact match rules and the rest are measured in aggregate (e.g. [77, 78]). However, such mechanisms cannot be used to provide a universal low-level view of the network. In this chapter, we introduce DeepFlow, a framework for fine-grained software defined measurement that can be directly deployed to production hardware swi- tches and which leverages scalable machine learning predictions and an efficient 65 active flow detection engine in order to provide visibility into the network traffic. DeepFlow operates at the control layer and leverages four main observations. 1. A given flow (or group of flows) as defined by a source and destination IP prefix can be measured in any of the switches that it traverses throughout the network, and its exact measurement location can be optimized such that the total number of flows measured is maximized (a similar approach can be applied for a 5-tuple flow definition). 2. When network flows are measured in aggregate over a period of few seconds or more, then the short-term variation that individual flows might exhibit in small time scales is averaged out, and thus the aggregated flow can be modeled more accurately by an efficient time series prediction model. 3. If we periodically use flow rate predictions for some of the flows in the net- work, then we can free up TCAM space and let the controller measure other unexplored regions of the IP space, thus resulting in more in-depth view of the network. 4. If the traffic dynamics in the network changes significantly at a given epoch and that portion of the IP space is not measured with exact flow measure- ments during that epoch, then we can detect the change and install more fine-grained measurement rules by correlating the switch port statistics data (which are available "for free" in each epoch) with the routing information such that we detect with high probability which flow prefixes caused the traffic change. 66 Contributions: The main contributions of our proposed work are summarized below: 1. We introduce an algorithm that can adaptively detect active flow prefixes in the network that generate most of the traffic and should be monitored separately. 2. We introduce a framework to optimize the measurement location of each flow in order to increase parallelism and speedup the measurement process. 3. We implement the scalable framework for modeling diverse flow time series using Long Short-Term-Memory (LSTM) models from Chapter 3. 4. We introduce a framework for combining predictions with real measurements in order to increase the visibility into the network, while at the same time allow the controller to measure other unexplored areas of the IP space. One of the main advantages of DeepFlow is that it supports the Knowledge- Defined Networking (KDN) paradigm [2] according to which SDN can be used to enable a knowledge plane on top of the controller which can then be used for traffic predictions and forecasting, as well as other network management tasks that rely on high resolution flow rate data. This makes it quite different than other previous works that either only focus on detecting large flows, or performing a specific measurement task only. In addition, DeepFlow can be used to a) achieve more efficient traffic engineering that can increase the link utilization, b) protect critical traffic from congestion, c) improved load balancing, d) improve admission control, e) reduce power consumption in data centers or ISP networks by turning 67 off selectively oversubscribed infrastructure components, and f) improved security even during low volume attacks (due to the improved network visibility). 5.2 Background & Motivation According to the definition in [5] "Network management includes the deploy- ment, integration, and coordination of the hardware, software, and human ele- ments to monitor, test, poll, configure, analyze, evaluate, and control the network and element resources to meet the real-time, operational performance, and Quality of Service requirements at a reasonable cost". In this chapter, we focus on the measurement aspect of network management that can enable the in-depth moni- toring of the network while satisfying the requirements for real-time operation and reasonable cost. 5.2.1 Measurements in Software-Defined Networks SDN switches that support the OpenFlow specifications [16] can provide a variety of traffic counters for each flow table, flow entry, port, queue, group, group bucket, meter and meter band. The counters primarily count the total number of bytes and packets received or transmitted, as well as the duration (in seconds) that these counters correspond to. The counters can be retrieved by the SDN controller depending on the needs of each measurement task. From the list of counters provided above, in this work we focus only on the flow entry and port counters only, since a) these are the most fundamental ones that are supported by all the major switch vendors, and b) they do not require any complex switch configuration in order to be implemented. 68 ** 0* 00 01 1* 10 11 measure(**) = 12Mbps (a) 1 TCAM rule used (not fine- grained measurement granular- ity) ** 0* 00 01 1* 10 11 measure(0*) = 7Mbps measure(1*) = 5Mbps (b) 2 TCAM rules used (not fine- grained measurement granularity) ** 0* 00 01 1* 10 11 measure(**- 00) = 9Mbps measure(00) = 3Mbps infer(01) = 1Mbps predict(1*) = 8Mbps (c) 2TCAMrules(theleafnodewithhigher priority) and 1 prediction used (more fine- grained measurement granularity) Figure 5.1: An example of source IP prefix trie with three potential measurement architectures that provide various granularities. When predictions are used, the granularity level can be increased for the same TCAM utilization. Even though the basic idea behind SDN measurement is simple (as introduced in Chapter 2, there are various measurement methodologies one can use with a wide range of complexities. To get a better idea, we illustrate in Figure 5.1 three examples of flow measurements with a) low TCAM utilization, b) higher TCAM utilization, and c) higher TCAM utilization and use of flow predictions. Specifi- cally, in Figure 5.1 (a) we illustrate a 2-bit source prefix trie, and assume that the forwarding rules at the SDN switch are of the form {src_ip=(**), dst_port=1}. In such a case, the statistics the switch stores about the single forwarding rule can 69 cover up to 4 micro-flows (i.e. 00-11) and are not fine-grained. In Figure 5.1 (b), the same rule space from (a) is represented by partitioning the root prefix in the following two: {src_ip=(0*), dst_port=1} and {src_ip=(1*), dst_port=1}. In such a case, with two mega-flow rules we can achieve higher granularity since now each rule will store statistics matching two micro-flows (i.e. 00, 01, and 10, 11). In Figure 5.1 (c), the same rule space from (a) is represented by combining the high priority micro-flow rule {src_ip=(00), dst_port=1, priority=0} with a low priority root prefix rule {src_ip=(**), dst_port=1, priority=1}. This way, only the traffic matching rule 00 will be counted in the measurement of the first rule, whereas the rule at the root prefix will count everything but 00. As we can see from this, if we are able to enhance the measurement process with predic- tions for prefix 1* (as an example) that models two micro-flows, then the size of flow 01 can be inferred using simple flow algebra, thus leaving only two out of the 4 micro-flows without exact low level measurement. The process described above can be generalized as we can see in the following sections in the 2 dimensional prefix plane (i.e. source and destination IP), and depending on the prediction capabilities we can get, we can achieve any desired granularity level. In addition, a spatial component can be added to the process to capture the location that a flow can be measured at (i.e. switch), which can allow more flows to be measured concurrently for a given number of available TCAM rules. 5.3 DeepFlow Design Overview Based on the ideas presented in the previous section, we developed DeepFlow, a software-defined measurement framework that uses predictions to enhance the 70 measurement process, whenever exact measurements are not feasible, and can be used for any application that leverages fine-grained traffic measurement. In this section, we present the main components of DeepFlow, as shown in Figure 5.2. In the analysis presented below, a flow is defined as the traffic between a source and destination IP prefix. However, the same concepts can be applied to higher dimensions (e.g. 5-tuples). Active Flow Prefix Detection Engine Figure 5.2: DeepFlow architecture components. Measurement Tasks: In order to start collecting DeepFlow measurements, the network operator needs to specify the following input parameters: 1) source and destination IP prefixes of interest (if we want to monitor the whole network, Deep- Flow automatically calculates the union of the source and destination IP prefixes covered by the rules in all the TCAMs and uses this as an input (e.g. (54/8, 64/8)), 2) the maximum flow granularity for the source and destination prefixes, 71 expressed by a subnet mask, e.g. (/24, /24), and 3) the minimum threshold θ above which the traffic of a given prefix is considered significant (e.g. 1 Mbps). If no threshold is provided, then all the flow sizes up to the prefix granularity specified above will be considered. The threshold θ can be also defined in terms of the total available bandwidth in the switches, e.g. 0.001% of the total band- width. Throughout this chapter, we will be using the measurement task definition introduced in Definition 2.4.1. Prediction Engine & Measurement Database: DeepFlow’s prediction engine useshistoricaldatastoredinthecloudinameasurementdatabase(adataretention policy can be used that depends on the application needs) in order to train a deep neural network (deep in time) that is then used to generate predictions about the traffic of a given flow prefix. The prediction engine is exposed through an API to theDeepFlowscheduler. Whenanewpredictionisneeded, theDeepFlowscheduler makes a call to the API by providing the source and destination IP prefixes. Then the prediction engine pulls the historical data from the measurement database and generates a prediction for the next epoch. It is important to note here that it is not necessary to maintain a separate model for each flow pair, since as we will see in the subsequent sections, network flows exhibit similar traffic patterns that can be grouped together by subnet size, application, time of the day, etc. Flow-Size Change Detection Engine In order to guarantee that the prediction engine does not produce large estimation errors in cases of sudden traffic changes in the network (e.g. a DDoS attack, or other traffic anomalies), DeepFlow con- stantly monitors the aggregated volume of all the links and uses the Flow-Size Change Detection Engine to detect which prefixes are responsible for the volume 72 change. Then, the prefixes detected will be monitored in the next epoch with new measurement rules, instead of using predictions. DeepFlow Scheduler & Measurement Engine: The DeepFlow Scheduler, given a set of measurement tasks provided by the network administrators, decides where and when to install exact flow measurements, as well as for which flows to use model predictions. For this, the scheduler uses the Measurement Engine that takes care of adding or deleting flow counter rules, as well as retrieving flow measurements from the TCAMs. For the rest of the flows, DeepFlow scheduler uses the prediction engine that predicts the flow sizes for the next epoch. In the following sections, we formulate DeepFlow, and discuss the optimization steps that it takes in order to maximize the number of flows monitored. 5.4 Prediction-Assisted Measurement 5.4.1 Active Flow Prefix Detection OnetheofmainchallengesinSDNmeasurementisthefactthattheactiveflows in the network are not known to the controller a-priori, as well as their relative importance. The reason is that due to the limited size of the TCAM, wild-card rules are often used that can potentially match many flows (e.g. based on the destination prefix) and thus the traffic statistics of these flows cannot reveal to which exactly fine-grained flows they belong to. For this, we developed the Active Flow Prefix Detection Algorithm (AFPDA) that operates on the 2-Dimensional IP space (i.e. source and destination IP prefix) and iteratively detects what are the most active prefixes in the network. When AFPDA starts, it retrieves all the 73 Figure 5.3: An example of the set of rules at a switch, represented as shaded areas in a two-dimensional plane. The red (dashed) lines in R5 correspond to the rule splitting AFPDA performs in order to detect the two active flows in the area of R5 (i.e. the two circles). forwarding rules from all the switches, as well as information regarding the ingress switches in the network (provided as an input by the network administrator during setup). The next step of AFPDA is to start the flow zoom-in process where it tries to locate source/destination IP prefixes that have large volumes and need to be further split into longer active prefixes that can be measured separately. For this, AFPDA installs TCAM rules that have higher priority and differ only in the length of the source and destination IP prefixes (i.e. they are essentially subsets of them) compared to the initial rule, while keeping the rest of the rule values the same. So, part of the rule’s matching traffic will be offloaded to the new sub-rule, thus allowing more fine-grained measurement. By following the steps outlined 74 in Algorithm 5.1, AFPDA splits the source/destination prefix plane into smaller regions iteratively until further splitting measures flows with smaller volume than the minimum threshold θ, where it stops or when the maximum prefix lengths d s ,d t have been reached (defined by the input task). If no threshold θ has been specified, theprocesswillstopwhenthemaximumprefixlengthsforthesourceand destination prefixes have been reached. This process is illustrated in the example of Fig. 5.3 where the rule space for rule R5 is split twice until two flows (the two circles) where detected that exceeded the threshold θ with a given maximum zoom-level of /31 in this case. 75 Algorithm 5.1: ActiveFlowPrefixDetection Data: d s , d t , θ, F w // task & forwarding rules Result: List of active leaf nodes L a 1F measure ←−∅ // flows to measure 2F install ←−∅ // flow counters to install 3T = PrefixTrie(F w )// convert rules to trie 4 while True do 5 for p∈T .GetCurrentRoots() do 6 s = p.state 7 if s == "measurement" then 8 F measure ←−F measure ∪{p} 9 if p.src.mask_length≤d s && p.dst.mask_length≤d t then 10 if s == "unmeasured" then 11 F install ←−F install ∪{p} 12 p.state = "to_be_measured" 13 F location ←− CounterILP(F install ) 14 for switch, prefix in F location do 15 InstallRule(switch, prefix) 16 T [prefix].state = "measurement" 17 T [prefix].switch = switch 18 for prefix inF measure do 19 traffic = CollectMeasurements(prefix.switch, prefix.rule) 20 prefix.traffic.append(traffic) 21 if traffic <θ then 22 prefix.state = "terminal" 23 else 24 prefix.isActive = True 25 prefix.isRoot = False 26 sub_prefixes = SplitPrefix(prefix, d s , d t ) 27 for sub_prefix in sub_prefixes do 28 sub_prefix.state = "unmeasured" 29 // to trigger new measurements starting from this node 30 sub_prefix.isRoot = True 31 return GetActiveLeafNodes(T ) 76 5.4.2 Optimizing the Measurement Location In order to find the optimal location that a monitoring rule should be installed while maximizing the total number of flows monitored, we proceed to formulate the problem as an Integer Linear Program (ILP) that needs to be solved before installing monitoring rules at the switches. LetS ={s 1 ,s 2 ,...,s N } be the set of all switches in the network andF = {F 1 ,F 2 ,...,F K } be the set of all the (potentially aggregated) flows that we are interested to monitor during a given measurement period. Also let x i,j be an auxiliary variable that takes the value 1 if flow F i is monitored with a rule at switch j, otherwise it is zero, and m j be the total available memory at switch j. Then, the problem of optimal rule placement can be written as: minimize N X j=1 m j − K X i=1 x i,j (5.4.1) subject to K X i=1 x i,j ≤m j , j = 1,...,N (5.4.2) N X i=1 x i,j ≤ 1, j = 1,...,K (5.4.3) x i,j ∈{0, 1}, i = 1,...,N; j = 1,...,K (5.4.4) x i,j ≤r i,j , i = 1,...,N; j = 1,...,K (5.4.5) In the above linear program, Equation 5.4.1 maximizes the total number of rules monitored. Equation 5.4.2 guarantees that we will not try to install more rules to a switch than its available memory. Equation5.4.3 forces the number of rules per flow to be equal to one (i.e. no flow will be monitored in two switches). 77 Equation 5.4.4 makes sure that the auxiliary variable x i,j is binary, and Equation 5.4.5 that the rules are installed in some of the switches in the path of each flow where r i,j ∈{0, 1} represents the routing of flow j through switch i. In the above analysis, we assume that if an aggregated flow gets split at some switch throughout its path, or two or more flows get merged at some switch throughout their path, then can follow one of the options below, before solving the ILP in Equation 5.4.1 - 5.4.5: a) split a flow into its most fine-grained micro-flows that exist as forwarding rules at the switches and use rules that match the micro- flows in the ILP, or b) monitor the flows only at their aggregation switch without providing further granularity. However, the best option would really depend on the application scenario, and the total number of micro-flows that appear as sep- arate forwarding rules in the TCAMs. A similar approach applies in the case of multiple ingress switches per IP prefix, in which case we can flag the flows with a switch-specific tag (e.g. using a VLAN tag or MPLS label) that is used to differ- entiate between portions of the traffic of a given prefix that is measured on a given location. 5.4.3 Using Predictions to Enhance Measurement AFPDA can be periodically used to find active flow prefixes that exceed the threshold θ. After AFPDA has detected all the important flows (the process can take more than 1 epoch), the multiplexing of measurements and predictions can take place. For this, we have implemented an algorithm called Prediction Assisted Measurement Algorithm (PAMA) that performs measurements in a round robin fashion (using the same ILP described above) to collect ground truth data that can 78 be used to predict future values. Specifically, PAMA will install measurement rules for the active prefixes in order to collect w samples before start using predictions for the next T p epochs for each prefix. So, if the total available memory in the network is m, then we can cover up to b Tp w c + 1 × m many flows before we start collecting measurements again for the first round of prefixes. This allows the system to multiplex predictions and measurements and achieve a more fine- grained view of the network. It is important to note that in cases of networks where some free rules should always be available for high priority traffic, we do not need to use all the free TCAM space for measurements. In such cases, we can setm i = min{m true i ,α·M i } withm true i being the true free TCAM space in switch i, M i the total TCAM memory of switch i, and 0 < α < 1 (e.g. α = 0.95) the maximum TCAM utilization constant that we want to achieve. Below, we provide an analysis of PAMA for a single switch and multi-switch scenario. Single Switch Model: Letw be the number of past observations used to predict future flow rates, and let T p be the number of future predictions provided by the model (using a rolling time window approach). Letm be the total available TCAM at the switch of interest. Then, in time w we will be measuring the same m flows to collect their w measurements that will be used to predict their sizes for the next T p epochs. In the meanwhile, in the interval [w,w +T p ], we will have also collected w measurements forb Tp w c×m more flows. So, overall in time w +T p we measure and generate predictions for b Tp w c + 1 ×m flows. This means that, in each measurement epoch, we will be collecting measurements for m flows, and 79 generating predictions forb Tp w c×m. So, if we have a set of K important flows to cover (i.e. measure or generate predictions for), it will be: b T p w c + 1 ×m =K, (5.4.6) which shows that the prediction horizonT p can be set to allow any coverage of interest. An alternative approach could also be to adjust the significance threshold θ such that the number of significant flows K is reduced. Multi-Switch Model: In case of multiple switches, the overall benefit from mul- tiplexing predictions can be increased by allowing flows to be measured in any of the switches they traverse. However, the exact number of flows that can be cov- ered in a given epoch really depends on the routing matrix in the network, since not all the flows can be measured in any of the N switches in the network. So, assuming switch i has m i free rules, and by using the counter placement ILP to get the optimal measurement location of each flow, we can collectw measurements from up to m i flows on a given switch and generate predictions for T p epochs for each one of them. This allows up to: N X i=1 b T p w c + 1 ×m i ≤ b T p w c + 1 ×m (5.4.7) flows to be covered, where m is the total number of rules available across all the switches. From the analysis above it is evident that given a (fixed) routing matrix and a set of significant flowsF, in order to cover all the significant flows detected, the 80 prediction horizon needs to be potentially extended for some flows. This also leads to the definition of the "completed task" as follows: Definition 5.4.1. A measurement task that started to be executed at time t 1 is considered completed at time t 2 ≥ t 1 if t 2 is the time when a) all the active flow prefixes have been detected using AFPDA, and b) for every active flow, we have collected at least w samples. Its completion delay will then be T =t 2 −t 1 . to-be- measured unmeasured prediction measurement rule instal- lation time + 1 epoch flow unim- portant or not scheduled prediction horizon measurement horizon Figure 5.4: A finite state machine representation of the states of a given flow prefix that is used by the Prediction Assisted Measurement Algorithm (PAMA). Inordertoprovideagenericframeworkthatcanworkinamulti-switchscenario and be able to handle any set of input parameters, we proceed to implement 81 PAMA as a Finite State Machine (FSM) that maintains a state for each prefix with the following values: "unmeasured", "to-be-measured", "measurement", and "prediction", as illustrated in Figure 5.4. The "unmeasured" state characterizes prefixes that have not been visited yet by the algorithm. The "to-be-measured" state characterizes prefixes that we need to install a measurement rule for, and which will produce measurement data in the next epoch. The "measurement" state characterizes prefixes that have available measurement data to collect, and the "prediction" state characterizes prefixes that we will have to use predictions for during the current measurement epoch. The algorithm also uses a time-to-live (i.e. ttl) metric for each node of the FSM to keep track of the regression window w and prediction windowT p . The pseudocode of PAMA is shown in Algorithm 5.2 below. 82 Algorithm 5.2: PredictionAssistedMeasurement Data: F w , θ, d s , d t , w, T p , period Result: Flow prefix measurements and predictions 1 current_epoch = 0 2 while True do 3 F measure ←−∅ // flows to measure 4 F install ←−∅ // flow counters to install 5 F predict ←−∅ // flows to predict 6 if current_epoch % period == 0 then 7 active_flows = ActiveFlowPrefixDetection(F w ,d s ,d t ,θ) 8 flow_prefix_stats = InitializeFlowStats(active_flows) 9 for flow_id in active_flows do 10 f = flow_prefix_stats[flow_id] 11 state = f.cur_state 12 if state == "unmeasured" then 13 F install ←−F install ∪{f} 14 f.state = "to_be_measured" 15 if state == "to_be_measured" then 16 F install ←−F install ∪{f} 17 if state == "measurement" then 18 F measure ←−F measure ∪{f} 19 if f.ttl == 0 then 20 f.state = ’prediction’ 21 f.ttl = T p 22 if state == ’prediction’ then 23 F predict ←−F predict ∪{p} 24 if f.ttl == 0 then 25 f.state = "to_be_measured" 26 f.ttl = f.ttl - 1 27 F location ←− CounterILP(F install ) 28 for flow_id, switch, prefix in F location do 29 InstallMeasurementRule(switch, prefix) 30 flow_prefix_stats[flow_id].state = "measurement" 31 flow_prefix_stats[flow_id].ttl = w 32 flow_prefix_stats[flow_id].switch = switch 33 for f in flows_to_collect_measurements do 34 f.traffic.append(CollectMeasurements(switch, f.rule)) 35 for flow in flows_for_predictions do 36 f.traffic.append(GeneratePrediction(flow)) 37 current_epoch += 1 83 5.5 Evaluation 0 20 40 60 80 100 Percentile No. 0 500 1000 1500 2000 2500 3000 3500 Flows Duration (sec) Flow Duration Percentiles (a) CAIDA flow duration percentiles for pre- fix size 10. 0 20 40 60 80 100 Percentile No. 0.0 0.2 0.4 0.6 0.8 1.0 Flows Size (Bytes) × 10 7 Flow Size Percentiles (b) CAIDA flow size percentiles for prefix size 10. 0 20 40 60 80 100 Percentile No. 0 500 1000 1500 2000 2500 3000 3500 Flows Duration (sec) Flow Duration Percentiles (c) CAIDA flow duration percentiles for pre- fix size 20. 0 20 40 60 80 100 Percentile No. 0 20000 40000 60000 80000 100000 Flows Size (Bytes) Flow Size Percentiles (d) CAIDA flow size percentiles for prefix size 20. Figure 5.5: Percentiles for the flow duration and flow size of the CAIDA dataset. In order to better understand the dynamics of the flows in the CAIDA dataset, we analyze all the flows in the trace and plot in Figures 5.5(a), 5.5(b), 5.5(c) and 5.5(d) the percentiles of the flow durations (in seconds) and the flow sizes (in bytes) for prefix 10 and 20 respectively. As we can see from the 4 plots, at larger aggregation scales (i.e. prefix = 10), 80% of the flows appear to be active for more than 750 seconds and only 20% of the flows appear to have a significantly larger 84 size compared to the rest. On the other hand, for smaller aggregation scales (i.e. prefix = 20) 60% of the flows have a duration of 100 seconds or less, and again, only 20% of the flows appear to have a significantly larger size compared to the rest. From the above figures we can infer that AFPDA does not need to rerun very often if we operate at higher aggregation levels, but as we zoom-in, the traffic appears to be more dynamic and more frequent updates will be needed. 2D-AFPDA Evaluation Using Artificial Data: In order to evaluate the per- formance of AFPDA in the 2-dimensional prefix plane, we proceed to simulate its evolution under various traffic conditions that capture various probabilistic traffic splits on the prefix trie. Specifically, we implemented the 2D splitting for various static and random volume splitting distributions that characterize how the overall volume of a subnet is distributed when the 2D splitting takes place. In the figures presented below, we show the results obtained by using random traffic splitting generated by a Dirichlet distribution, for a network with an aggregate traffic rate of 100Gbps, a flow size threshold of 10Mbps, and a maximum mask size of /15. Specifically, Figure 5.6(a) shows the percentage of high volume prefixes (i.e. more than the threshold) for each mask size. From the graph we can see that the percentage drops exponentially as we keep splitting further the 2 dimensional IP space, which also shows that despite the fact that we have a potentially huge number of prefixes to explore, AFPDA eventually focuses only on few thousands of flows that are important. This is also evident in Figure 5.6(b) where the exact numberofprefixesexceedingthethresholdisshownforeachmasksize. Fromthere, we can see that AFPDA converges pretty fast since after mask size 8, the number of large flows decreases exponentially, and AFPDA will skip any sub-prefixes that do not exceed the threshold. Finally, for the experiment shown above, the converge 85 0 5 10 15 Subnet Size 0 20 40 60 80 100 Percentage of Large Megaflows Ratio Of High Volume Subnets Per Mask Size (a) The percentage of high volume prefixes for each source and destination IP mask size, as determined by AFPDA using ran- dom traffic splitting. 0 5 10 15 Mask Size 0 250 500 750 1000 1250 1500 1750 2000 Number of Large Megaflows Total High Volume Subnets Per Mask Size (b) The total number of high volume pre- fixes for each source and destination IP mask size, as determined by AFPDA using random traffic splitting. Figure 5.6: AFPDA Statistics for a 2D parsing of an artificial dataset to assess where significant flows are located. time was 90 seconds, with an epoch size of 5 seconds and an average number of available measurement rules of 500 per epoch. In order to better illustrate the behavior of the total measurements required for various input configurations (i.e. the thresholdθ and the maximum mask size), we plot in Figure 5.7 the total number of AFPDA measurements for various thresholds and mask sizes (here we set d s = d t = d for various values of d between 1Mbps and 5Gbps. 1D-AFPDA Evaluation Using The CAIDA Dataset: One optimization we can do in order to speedup the convergence of AFPDA is to leverage potential asymmetries in the cardinalities of the sets of source and destination IPs seen by a switch (depending on the direction, source IPs can be much more than the destination IPs) and break the 2D scan into two phases: 1) 1D scan on the source 86 0 5 10 15 20 25 Max Depth 0 20000 40000 60000 80000 T otal Measurements T otal Measurements Per Flow Threshold (Thresholds < 8 Mbps) 1 2 3 4 5 6 7 (a) The number of measurements conducted by AFPDA for various mask sizes (θ≤ 10). 0 5 10 15 20 25 Max Depth 0 2000 4000 6000 8000 10000 T otal Measurements T otal Measurements Per Flow Threshold (Thresholds < 100 Mbps) 9 10 15 20 30 40 50 (b) The number of measurements conducted by AFPDA for various mask sizes (10 < θ≤ 100). 0 5 10 15 20 25 Max Depth 0 100 200 300 400 T otal Measurements T otal Measurements Per Flow Threshold (Thresholds > 100 Mbps) 200 300 400 500 1000 2000 3000 5000 (c) The number of measurements con- ducted by AFPDA for various mask sizes (θ> 100). Figure 5.7: The number of measurements conducted by the 2D AFPDA for an artificial dataset for various thresholds and mask sizes. prefix, and 2) 2D scan on the (reduced) fine-grained prefixes found from step 1) in order to find the destination prefixes that appear to be active. Of course, step 2) above might not be needed depending on the application. In the experiment we present below, we analyze 1D-AFPDA using the CAIDA dataset since the user IPs are significantly more than the destination IPs. We replay the traffic using 87 the B4 topology [3] where prefixes are randomly assigned to switches and thus prefixes can be measured in various switches using the ILP formulation described in Section 5.4.2. Every switch has 1024 free TCAM rules that can be used for measurements. As we can see in Figure 5.8(a), the total number of large flows found in the CAIDA dataset ranges from 800 to 1500 depending on the threshold θ used, as well as the maximum mask size (i.e. resolution) allowed. The number of flows increases almost linearly with the mask-size and its growth slows down for higher mask values. This shows that there is not significant change in the number of flows found if we increase the depth of the search since most of the large flows are concentrated in individual subnets that have been detected already from higher layers of the prefix trie. In Figures 5.8(b), and 5.8(c) we can see that the measurement duration and the total number of measurements performed also exhibits a linear pattern. However, for smaller thresholds θ it appears to have an increasing slope that indicates that the algorithm spends more time searching as we approach to the lower levels of the prefix trie. This is expected due to the exponential growth of the prefix trie, however AFPDA does not explore all these states since the network traffic is not spread across all these prefixes. Finally, as we can see from Figure 5.8(d), for eachθ value, there is a point after which we can achieve nearly 90% coverage in terms of the total volume of traffic in the network. This shows that AFPDA can indeed be used to find the large flows in the network and keep monitoring them using predictions and measurements interchangeably. InordertoisolatethebenefitcomingfromthePAMalgorithmonly, wealsoplot the flow coverage gain we get by using predictions for various regression window sizes, and prediction horizons. As we can see from Figure 5.9, as well as Equation 5.4.6, and 5.4.7, for arbitrary large horizons and small regression windows, we can 88 10 12 14 16 18 20 Maximum Mask Size 800 900 1000 1100 1200 1300 1400 1500 # of Large Flows Total Large Flows Θ=0.5Mbps Θ=1.0Mbps Θ=5.0Mbps Θ=10.0Mbps (a) Number of large flows found for various θ and d s . 10 12 14 16 18 20 Maximum Mask Size 8 10 12 14 Measurement Duration (sec) Measurement Duration Θ=0.5Mbps Θ=1.0Mbps Θ=5.0Mbps Θ=10.0Mbps (b) Measurement duration for various θ and d s . 10 12 14 16 18 20 Maximum Mask Size 2000 2500 3000 3500 # of Measurements Total Measurements Θ=0.5Mbps Θ=1.0Mbps Θ=5.0Mbps Θ=10.0Mbps (c) Total AFPDA measurements for various mask sizes, for epoch = 15 sec. 10 12 14 16 18 20 Maximum Mask Size 0.3 0.4 0.5 0.6 0.7 0.8 Ratio Of Traffic Covered Traffic Coverage Θ=0.5Mbps Θ=1.0Mbps Θ=5.0Mbps Θ=10.0Mbps (d) Total coverage (i.e. percentage of the total volume) of the significant flows found for various mask sizes, for epoch = 30 sec. Figure 5.8: AFPDA statistics for 1D parsing of the CAIDA dataset for various significant thresholds θ and maximum mask sizes d s (i.e. flow resolution). achieve the highest flow coverage gain. For example, for w = 3 andT = 6, we can achieve 2x improvement in terms of the number of flows explored, which can have a big impact in the network visibility. Finally, in order to investigate if there are any "bottlenecks" in the execution of DeepFlow, we proceed to measure the running times of all its major operations that can be costly, especially the ones related to generating predictions. For this, 89 Prediction Horizon 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Regression Window 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Flow Coverage Gain 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 PAM Flow Coverage Gain Figure 5.9: Flow coverage gain achieved due to predictions for various regression window sizes and prediction horizons. we have used a 16-core Intel CPU with 32 GB of RAM. The results are shown in Table 5.1, from where we can see the delays for model training, feature extrac- tion, clustering, cluster assignment, prediction generation, and the ILP solver. It is important to note here that DeepFlow operations can be classified in two cate- gories: a)coldstartoperations, whichareneededonlyduringsetupandpotentially during a periodic retraining (e.g. daily), and b) operations that need to be exe- cuted in each epoch. LSTM model training (a cold start operation) can be done in less than 10 minutes for each cluster, but since clusters can be trained in parallel, this process can be completed relatively fast. From the other cold start opera- tions, the data preprocessing for feature extraction can be done in approximately 3 minutes since this task is also parallelizable. In a real system, this operation can also occur on the fly as flows come in. The cluster assignment, and the prediction generation (from a pre-trained model) are pretty fast operations and even in our Python prototype, they took approximately 0.1 seconds to complete. Finally, the 90 ILP solver terminates in less than 2.81 seconds since the number of switches and rules to install are relatively small. In addition, since the progression of DeepFlow occurs in a Round-Robin fashion, we can easily cache past solutions and avoid solving the ILP in each epoch, in which case ILP also can be considered a cold start operation. From the above we can conclude that building a measurement system that leverages predictions can scale well since the operations that need to occur in real time have very small latency or can be precalculated. Table 5.1: DeepFlow latency breakdown. DeepFlow Stage Execution Time LSTM Model Training (Cold Start) 5-550 secs Feature Extraction (Cold Start) 186 secs Clustering (Cold Start) 274 secs Cluster Assignment 0.1 secs Generate Prediction 0.1 secs ILP solver 2.81 secs 5.6 Related Work The problem of network traffic measurement has been extensively studied in the relevant literature in both traditional networks and SDN. Traffic measurem- ent in SDN can be achieved using two main approaches: a) TCAM-based traffic counters (e.g. [75], [33], [77], [78]), and b) hash-based counters such as sketches (e.g. [37], [36], [79], [35], [80], [34] ). In this work (a shorter version of which was presented in [50], and [74]), similar to the work in [75] and [77], we focus on the problem of traffic measurement in SDN using TCAM-based counters since it provides immediate deployability in commercial switches. In addition, our previ- ous work in [81] provided a unified framework for overcoming the challenges that 91 switch diversity imposes to network management (including traffic measurement) that can introduce significant delays in the control or data plane and prevent the controller form being able to query the hardware switches for traffic counters. On the other hand, a) hash-based counters are not currently supported in commercial switches, b) they require complex switch upgrades in order to be implemented, and c) they do not provide a generic fine-grained measurement framework. Never- theless, our proposed scheme can be adapted to use hash-based measurements as well in its data collection component, and then combine that with its prediction capabilities to increase the overall visibility into the network. Finally, the scope of our work spans mainly the area of Wide Area Networks (WANs), since the large number of flows in such networks pose significant scalability challenges for traffic measurement. However, our framework can also be extended to SDN networks of any scale. According to [18], measurement frameworks can be classified in one of the following three categories depending on if they provide: a) a balance in overhead implications by using techniques like sampling, aggregation, efficient heuristics etc, b) aresourceusage asa trade-offwithmeasurement accuracy, andc) accurate mea- surements in real-time for decision making. Below, we briefly cover relevant frame- works from all three categories. The authors in [82] propose iStamp, a scheme for flow measurement that uses a Multi-Armed Bandit algorithm to balance between measuring unimportant flows in aggregate, and important flows with exact moni- toring TCAM rules. This way, the authors achieve high monitoring accuracy for the flows of interest. iStamp provides a single switch framework that does not aim to provide a generic measurement solution, neither uses machine learning predic- tions to enhance the measurement coverage. In [75], the authors propose DREAM, 92 a TCAM-based measurement framework that focuses on balancing measurement accuracy and the amount of resources (i.e. TCAM rules) that are used for mon- itoring specific flows. DREAM is suitable for tasks for which accuracy can be estimated, such as (Hierarchical) Heavy Hitter detection, and change detection, and does not provide a generic framework for fine-grained measurement. In [33], the authors present OpenTM, a framework for TM estimation for OpenFlow net- works that is based on simple flow statistics retrieval from SDN switches. The paper assumes that all flow rules fit in the TCAM and thus can be tracked with exact match rules (i.e. no wildcard rules are used). In addition, OpenTM con- stantly requests flow counters from various switches over the path of a flow, thus generating significant overhead. A very similar approach is also presented in [76] where the frequency of flow counter retrieval is determined depending on the vari- ability of each flow’s size over time. In [77], the authors propose OpenMeasure, an extension of iStamp [82] for TM estimation in hybrid SDN deployments that uses an adaptive counter placement mechanism to detect large flows in the network, and then places exact match rules in the TCAM for the large flows detected. The framework uses a simple prediction framework to estimate the size of a flow in the next measurement period and based on that select the target flows to monitor but predictions are not used to substitute measurements. The authors in [83] optimize the rule installation across switches to avoid using wild card rules everywhere in order to enable more visibility into the network. The statistics collected for each fine-grained rule can provide improved visibility. However such approach cannot not provide a generic measurement framework but instead it provides a mechanism for improving TCAM utilization while improving flow visibility to some extend if 93 there are available measurement resources. The authors in [78] present a TM esti- mation framework for hybrid SDN deployments that is a multi-switch extension of iStamp [82]. The proposed framework does not use online learning (unlike the work in [77]) and monitors flows by de-aggregating aggregated flows in order to provide more fine-grained measurements of important flows. The paper in [60] uses an LSTM model to model aggregated traffic at the link (port) level in large timescales (i.e. 15 minutes), which differs significantly from our proposed work. Finally, the papers in [84], and [85] present neural network approaches for TM esti- mation in traditional IP networks (i.e. not SDN) where simple link measurements (i.e. SNMP) are used to estimate the Ingress-Egress router TM (i.e. aggregated flow-size estimation). From the previous work presented above, the most relevant is OpenMeasure [77]. However, there are significant differences compared to our proposed work as described below: a) OpenMeasure does not provide fine-grained flow measurement (it picks only the most important flows that can fit in the TCAM and focuses only on that), b) OpenMeasure does not use predictions to substitute measure- ments, but instead, it uses predictions to find the largest flows to monitor with TCAM rules in the next measurement epoch, c) the two models used for predic- tion are simple linear models that have not been studied for their effectiveness in predicting flow-rates in small time ranges (i.e. < 15 minutes) with smaller flow- aggregation ratio, where flows appear to be more volatile, and d) OpenMeasure is suitable for large measurement epochs (unlike DeepFlow which operates at≤ 60 seconds timescale) with large flow aggregation ratio.Finally, our previous work in [50] presents an introductory exploration of the main ideas behind DeepFlow, namely the overall architecture and its proof of concept implementation using two 94 small network datasets. In addition, our work in [74] introduces the concept of network time series clustering for model selection which is applied to various net- work prefixes to assess its modeling effectiveness, and our work in [86] introduces various LSTM models for link-level traffic predictions. 5.7 Conclusion Providing fine-grained measurement is mandatory for many network applica- tions. In this chapter, we propose DeepFlow, a prediction-assisted measurement framework for SDN that uses the available TCAM memory to install measurem- ent rules for important flows, and uses an efficient machine learning algorithm to predict the size of rest of the flows that cannot be monitored with exact match rules, by using historical data from previous measurement periods. 95 Chapter 6 Measurement Orchestration Using Reinforcement Learning In the previous chapters, we saw how we can model network traffic at scale, and presented DeepFlow, a measurement framework that leverages predictions to increase the visibility into the network for any measurement task. However, the data collection process that is needed before generating any predictions can take some time and it really depends on dynamic conditions in the network such as the current TCAM and CPU utilization at the switches etc. In this chapter, we present an extension of DeepFlow, called DRLFlow, which adjusts the measurement con- figuration parameters (i.e. θ,d s ,d t ) to optimize the flow resolution measured, given a set of resource constraints. 6.1 Introduction The availability of measurement resources is something that does not remain constant necessarily over time. For example, new forwarding decisions might need to be made that can consume valuable TCAM space thus subtracting resources from the available measurement pool and affecting the measurement speed overall (since less flows can be measured in each epoch). In addition, the CPU utilization at the switches can vary over time depending on the underlying traffic conditions 96 or potential additional packet processing rules installed at the switches. All these, createadynamicenvironmentthatcanaffectthemeasurementlatency. Toaccount for all these, we develop DRLFlow, an extension of the DeepFlow framework from Chapter 5 that automatically adapts to the changing environment or the different configurationneedsandchoosesmeasurementconfigurationsthatcanbecompleted byagivendeadline. MotivatedbytherecentadvancesinthefieldofReinforcement Learning (RL), and specifically Deep Reinforcement Learning (DRL), we configure DRLFlowtotakeintoaccountpastmeasurementdelaysforeachtaskconfiguration and choose the one that can be completed on time. 6.2 Background and Motivation In the section below, we present the background information that is necessary in order to introduce our RL-based framework. Throughout this chapter, we will use the Definitions 2.4.1 and 5.4.1 from Chapters 2 and 3, respectively. 6.2.1 Markov Decision Processes (MDP) AMarkovDecisionProcess(MDP)isaformofadiscretetimestochasticcontrol process that is suitable for solving decision making problems [87]. One of the most standard tools recently for solving MDP problems is the family of the various RL algorithms such as Q-Learning [88]. An MDP can be described by the tuple (S,A,p,r) where S is the finite set of all possible states, A is the finite set of actions that can be taken from a given state, p is the transition probability from state s to state s 0 after an action a has been taken, andr is the reward that is received after actiona has been taken. The 97 goal of an MDP is to find an optimal policy π ∗ : S→ A, i.e. a mapping from a state to an action, that maximizes the expected total reward over an infinite time horizon P ∞ t=0 γr t (s t ,a t ) wherea t =π ∗ (s t ) andγ∈ [0, 1] is the discount factor [89]. The general architecture of an MDP problem is shown in Figure 6.1. Figure 6.1: The interaction of an agent with its environment in a Markov Decision Process. 6.2.2 Q-Learning In order to find the optimal policy π ∗ :S→A, we can define a value function V π (s) that captures the expected value obtained by following a given policyπ from a state s as follows: V π (s) =E π " ∞ X t=0 γr t (s t ,a t )|s 0 =s # (6.2.1) =E π [r t (s t ,a t ) +γV π (s t+1 )|s 0 =s] (6.2.2) 98 If we define the optimal Q-function asQ ∗ (s,a) =r t (s t ,a t )+γE π [V π (s t+1 )] then the optimal value function will be: V ∗ (s) = max at {E π [r t (s t ,a t ) +γV π (s t+1 )]} (6.2.3) = max a {Q ∗ (s,a)} (6.2.4) In order to find the optimal values of the Q-function, we use an iterative process with the following update rule: Q t+1 (s,a) =Q t (s,a) +a t r t (s,a) +γ max a 0 Q t (s,a 0 )−Q t (s,a) (6.2.5) As we can see from Equation 6.2.5, Q-learning iteratively learns the most rewarding next step from any given state, to any other given state. This can be represented as a table of (state, action, reward), which is known as Q-table. However, the main disadvantage of Q-learning is that it is intractable for large state and action spaces due to the large memory footprint of the resulting Q-table. For this, it was proposed in [90] to approximate the Q-function using a deep neural network that can predict Q values instead of pre-calculating all of them in a table. This neural network became known as Deep-Q-Network (DQN) and the overall method Deep Reinforcement Learning (DRL). 6.2.3 Deep Reinforcement Learning The main idea of using a neural network to approximate the Q-function is simple in principle. However, in practice, there are significant challenges to make this method produce stable results [88, 91]. Specifically, the way to overcome the 99 prediction stability issues, DRL implements two neural networks instead of one, in order to avoid updating a neural network on the fly while it generates predictions (otherwise the predictions will be very noisy). So, one neural network is used to generate predictions (also called the fixed target Q-network), whereas the second one is used for the gradient descent updates. The weights of the second network are copied to the first one every few batches, where batch is a parameter that can be fine-tuned experimentally. The other common optimization technique that is used in DRL is what is knownasexperience replay. ExperiencereplayisatechniquethatleveragesaFIFO buffer that stores recent experiences of the agent in the form (s t ,a t ,r t ,s t+1 ). So, in order to construct a batch used to compute the loss function and the gradient, we randomly sample from this list of experiences thus forcing more uniformity in the batch sample by allowing old and new experiences to be sampled, instead of only new ones. Finally, another technique that is commonly used in order to help an RL agent explore more states is what is known as −greedy or epsilon-greedy algorithm. According to the−greedy algorithm, the agent is configured to allow some small deviation from the currently optimal decision making, in order to explore new states/rewards with a very small probability (e.g. = 0.01). The probability can be also adjusted over time (decayed) such that the algorithm converges finally to the optimal decisions. 100 6.2.4 Measurement Task As we saw in Definition 2.4.1, a measurement task that started to be executed at time t 1 is considered completed at time t 2 ≥t 1 if t 2 is the time when a) all the active flows prefixes have been detected using AFPDA (Algorithm 5.1), and b) for every active flow found, we have collected at leastw samples. Its completion delay will then be T =t 2 −t 1 . The rationale behind the above definition is that by time t 2 , the measurement framework will have collected the regressors needed in order to start generating predictionsaswellascollectsomemeasurementdataperiodically. However, finding the active flows, and collecting measurements can take a random amount of time that depends on dynamic network conditions, and also the thresholds used. Definition 6.2.1. A measurement taskM = (θ,d s ,d t ) with completion delay requirement τ is considered successful if T ≤ τ, where T is the random variable that characterizes the time it takes for the task to be completed with an adjusted rate thresholdθ a ≥θ, and an adjusted source/destination IP prefix size ofd a s ≤d s , and d a t ≤d t , respectively. From the above we can see that while AFPDA starts from the top of the prefix trie and keeps zooming-in, there is going to be a point where the prefix splittingwillbeaddingwaytoomanyrulesandthusincreasethedelayofcollecting measurements and generating predictions exponentially. For this, we develop a DRL-based framework to find the right adjusted parametersθ a ,d a s andd t t that can balance the trade-off between delay and measurement resolution. 101 6.3 Measurement Delay There are two main types of delay that are introduced and can affect a task’s completion time. According to Definition 5.4.1, there overall delay has two major sources: 1. T a : the time it takes for the AFPDA algorithm to find the active prefixes 2. T w : the time it takes to collect w samples from each active flow found. If we assume that the active prefixes have not been seen and assigned a cluster before (as described in Chapter 3), then we will have to include the delay to collect moresamples(e.g. 100×w)perflowsuchthatweareabletocalculatethestatistics needed for clustering as described in Table 3.1 and then assign the right model to each time series. Without loss of generality, we will assume here that we have operated the DeepFlow framework long enough to have seen all these prefixes in the past and be able to classify them depending on their activity patterns. Based on the above, the total measurement delayT should satisfy the relation: T =T a +T w (6.3.1) The two types of delay above are not independent since the more the active flowsK found, thehigherthetotaldelayT w willbetocollectw datapointsforeach one of them. On the other hand, the delay of each iteration of AFPDA depends on a variety of factors such as: 1. the number of available TCAM rules at the switches that can be used for measurements 102 2. thenumberofswitchesthatcanbeusedtomeasureflowsconcurrently, which depends on the routing 3. the rule installation behaviors of the switches as discussed in [81, 92] that can introduce delays due to different rule cache eviction policies, CPU or TCAM utilizations, and other implementation idiosyncrasies. As we will see in the next section, the decision problem of optimizing the measurementparameterconfigurationcanbemodeledasanMDPandsolvedusing a DRL agent that senses the environment and interacts with it by exploring the cost of various measurement configurations, thus building its optimal measurement policy over time. 6.4 The DRLFlow Architecture In this section, we present DRLFlow, a Deep Reinforcement Learning version of DeepFlow that is able to optimize the measurement configuration parameters to meet a given target deadline τ. The main idea behind DRLFlow is to define the state, action, and reward space in a way that allows DRLFlow to adapt to environment changes and still make optimal measurement decisions, while keeping the dimensionality of the problem as low as possible. To do so, we will quantize certain parameters such as the supported θ that can take values in the set Θ = {θ min ,θ 1 ,θ 2 ,...,θ n ,θ max } with θ min < θ 1 < θ 2 < ... < θ n < θ max . Then, the problem can be formulated as an MDP as follows: 103 • State Space: The state spaceS corresponds to the finite set of all possible states of the system. We define the state as: s t = (K t , Δ t ) (6.4.1) where K t corresponds to the number of active flows found, and Δ t =τ−T t where T t is the completion time of the task with the given configuration parameters. • Action Space: An action a t can be any of the following: – take no action – increase the threshold θ t to the next supported value – decrease the threshold θ t to the previous supported value – increase the maximum source prefix d st by 1 – decrease the maximum source prefix d st by 1 – increase the maximum destination prefix d tt by 1 – decrease the maximum destination prefix d tt by 1. Here, θ t ,d st ,d tt correspond to temporal adjustments of the parameters θ,d s ,d t , respectively. • Reward Function: The reward function can be defined as: r t = P Kt i=1 f i C· Δ t , (6.4.2) 104 where f i is the size of active prefix i found by AFPDA (their total number is K t ), C is the total throughput in the root of the prefix trie (as measured by summing the link statistics of all the ingress switches), and Δ t is the difference between the deadline τ and the AFPDA completion time T. The intuition behind this reward function is that we want to give higher rewards to measurement parameters that uncover flows that correspond to a large portion of the total throughput C. In addition, we want to allow enough time to the system to explore for large flows, so we want to spend as much time as possible till right before the deadline (since more time means higher resolution), but not exceed the deadline τ. 6.5 Evaluation In order to evaluate our proposed framework, we proceed to implement it using the CAIDA dataset from Chapter 3.3, and by working similar to the evaluation of DeepFlow in Chapter 5. We implement the DRL network using Keras [51] and the Tensorflow API [52]. We use a two convolutional layers and a dense layer with 50 nodes, dropout 0.2, Adam optimizer, and 0.001 learning rate. We use a memory replay buffer with size 50000, batch size 32, and 0.99 discount factor. We update the target Q-network every 5 batches, and we use an −greedy policy for exploration with minimum = 0.001. We let the system run for 20000 episodes. Below, we are assessing our framework for its ability to meet the given deadline for various configurations while trying to maximize granularity. For our experi- ment, we use the same topology as in Chapter 5 (i.e. Google’s B4 [3]), and epoch size = 10 seconds. We run each task 100 times for 4 different switch TCAM sizes 105 (i.e. available measurement rules), namely 200, 500, 1000, and 1500 rules, as well as 5 different deadlines, namely 5, 8, 10, 12, and 15 epochs. Since for such an experiment there is no notion of baseline (since we can select task configurations that can be arbitrarily bad and fail to complete before the deadline), we are only going to show the average completion rate achieved using DRLFlow. As we can see from Figures 6.2(a), 6.2(b), 6.2(c), 6.2(d), the less the available TCAM, the higher the non completion rate. However, in all the cases that the task failed to complete within the pre-specified time, we observed that it completed eventually in the epoch right after the deadline. In addition, these fail rates are at most 10% and that only for the cases where there is not enough TCAM available in the swi- tches and our framework takes longer to complete each iteration. Finally, we can see from the graphs that the completion rate becomes nearly 100% for the larger TCAM sizes and it also increases as the deadline increases and gives more time to our algorithm to search for important flows. This validates our hypothesis that we can optimize the measurement configurations given time and resource constraints. 6.6 Discussion From the analysis in the previous sections, we can see that DRLFlow can act as an autonomous agent that optimizes the measurement configuration given a set of requirements. Our main design was around granularity and delay. In a real application, our framework can be used as an offline trained system that is exposed through an API to the DeepFlow controller, and which automatically updates itself every time AFPDA algorithm is executed and measurements statistics are collected. This way, the system can adapt to the changes in its environment, while 106 5 8 10 12 15 Deadline (# Epochs) 0.0 0.2 0.4 0.6 0.8 1.0 Task Completion Rate Average Task Completion Rate (TCAM = 200) (a) AveragetaskcompletionrateforTCAM size = 200 per switch 5 8 10 12 15 Deadline (# Epochs) 0.0 0.2 0.4 0.6 0.8 1.0 Task Completion Rate Average Task Completion Rate (TCAM = 500) (b) Average task completion rate for avail- able TCAM size = 500 per switch 5 8 10 12 15 Deadline (# Epochs) 0.0 0.2 0.4 0.6 0.8 1.0 Task Completion Rate Average Task Completion Rate (TCAM = 1000) (c) Average task completion rate for avail- able TCAM size = 1000 per switch 5 8 10 12 15 Deadline (# Epochs) 0.0 0.2 0.4 0.6 0.8 1.0 Task Completion Rate Average Task Completion Rate (TCAM = 1500) (d) Average task completion rate for avail- able TCAM size = 1500 per switch Figure 6.2: Average task completion rate for various available TCAM sizes per switch. at the same time provide its best estimate to any new measurement task submitted for execution. 107 6.7 Related Work DRLwasintroducedrecentlyin[90]inordertobeabletohandlelargestateand action spaces and reduce the large memory footprint of the traditional Q-Learning algorithms. To do this, a deep neural network approach was proposed that approx- imates the Q-function. In the context of communication and networking, RL and DRL have been applied in various domains in the past that range from resource allocation [93], power allocation [94], caching [95], channel selection [96], and more. A survey of all the applications of RL in communications and networking can be found in [89]. From the related works, below we emphasize on the ones that are most relevant to our proposed framework. Specifically, the work in [97] applies DRL for traffic optimizations in modern data centers such as load balancing. DRL has also been proposed recently in [98] as a way to efficiently manage the rules SDN switches, reduce TCAM utilization and reduce the control overhead in data center environments. The most relevant work is the one in [99] where the authors use Q-Learning and leverage very large lookup tables to choose the next action with the goal of distributing rules across switches to avoid over-measuring flows at the switches thus causing performance degradation. Despite the similarity in the frameworks used, our proposed framework in this thesis has a very different goal which is to increase the visibility into the network by allowing fine-grained measurements under delay constraints, and for this, it uses an ILP formulation to distribute the counters efficiently across switches and gives enough time to the system to search for active flows of maximum granularity. It is important to men- tion here that in our framework, there is a way to reduce the TCAM utilization and this is through predictions (indirectly) and by also setting a limit in the the 108 maximum utilizable TCAM memory (e.g. 90%) such that there is always room for new high priority rules. 6.8 Conclusion In this chapter, we presented an extension of DeepFlow, called DRLFlow, that uses Deep Reinforcement Learning in order to optimize the measurement parame- ters of the DeepFlow scheduler to maximize the measurement granularity, without exceeding a given delay constraints. 109 Chapter 7 Conclusion and Future Work In this chapter, we summarize our research findings and discuss some ideas on how to extend this thesis. 7.1 Summary of Contributions In this thesis, we first presented a unified framework for network traffic mea- surement that can be easily deployed in production hardware SDN switches such as in Google’s B4 network [3]. The first problem we solved was that of modeling diverse network flow time series and can be applied in various areas such as traffic engineering, power savings, data analytics, and more. The problem is extremely challenging due to the very diverse nature of network traffic, especially in short time-scales. To tackle this problem, we proposed the use of time series clustering that was able to group similar time series together and train one model per clus- ter, thus increasing the scalability of our approach significantly. In addition, our analysis showed that clustering can produce very good results when tested with real backbone network traffic from a Tier-1 ISP. 110 The second problem we solved was that of modeling link-level throughput time series. This problem is of great importance due to its applications in traffic engi- neering, anomaly detection, traffic matrix estimation and more. We explored sev- eral variations of LSTMs and ARIMA models and showed that LSTMs can predict link throughputs with very low errors outperforming the baselines. The third problem we solved was that of providing in-depth visibility into the network, something that can allow for more efficient network management such as building solutions for energy savings in data centers or ISP environments. For this, we proposed a unified framework that detects active prefixes in SDN, efficientlyplacesflowcountersthroughoutthenetworktoincreasethemeasurement parallelism, and by multiplexing measurements and predictions, it can achieve high monitoring resolution for all the major prefixes in the network. Finally, the last problem we solved is that of selecting the right measurement parameters given a set of available resources in the network. The problem with any measurement framework is the trade-off between measurement resources and measurement accuracy (or resolution in general). To overcome that, we proposed the use of a Deep Reinforcement Learning (DRL) algorithm that optimizes the thresholds used, given the available resources, in order to meet a target deadline. 7.2 Future Work This work offers many directions to explore in both the areas of network traffic modeling, as well as the area of software-defined measurement. 111 7.2.1 Network Traffic Modeling In order to further improve the accuracy of our prefix flow rate (or link through- put) models, we plan to explore the following: • Implement a Convolutional Neural Network (CNN) version of the LSTM model that can potentially model even better the network traffic volatility. • Implement a different model per cluster and not just the same LSTM archi- tecture with different training data. This can further improve the accuracy of the cluster-based framework. • Implement a compressed version of the flow prefix LSTM model that instead of modeling the actual throughput of a prefix, it models the split-ratio of the throughput of a parent prefix into its children prefixes (i.e. longer prefixes) in the prefix trie. In other words, if the parent prefix has a size f i and we can easily collect real measurement data for it using a megaflow rule in the TCAM, we are going to evaluate how well we can predict how this volume is distributed across its children nodes (starting with the immediate 2 children and potentially exploring deeper splits) where child node i sees c i percent of the traffic, and P i c i = 1. This will act as an additional normalization mechanism and forces the model to learn to predict with the assistance of a ground truth data point from the current epoch, i.e. the parent node. • Implement a Round Robin sampling mechanism that samples which links to measure in each epoch and makes predictions based on sparse link data, instead of polling link counters during every epoch. This will be equivalent to using longer prediction horizons for the link level models. 112 7.2.2 Software-Defined Measurement In the area of Software-Defined measurement, we are planning to evaluate the following directions: • Speedup the execution of the Active Flow Prefix Detection Algorithm (AFPDA) using binary predictions for a given flow significance threshold θ. Since in a real network we expect to gather data over long time ranges, we propose the use of models that can be trained to detect active flows based on weekly, daily, or hourly patterns, and predict if a given prefix that is moni- tored by the SDN controller is active or not. This is expected to be a simpler learning problem since we are not interested in learning exact throughput values, but instead we are interested in learning if the traffic source is on or off. To make the predictions more robust, we are planning to predict longer time-horizons (e.g. next 15 minutes or next hour) such that we avoid overfitting. • Use dynamic prediction horizons depending on the prediction quality of each prefix. This way, prefixes that can be predicted more easily can remain for longer periods of time under prediction mode, compared to the more volatile ones. This is expected to further increase the quality and the resolution of the DeepFlow measurement framework. 113 Reference List [1] “Onf sdn definition.” [Online]. Available: https://www.opennetworking.org/ sdn-definition [2] A.Mestres, A.Rodriguez-Natal, J.Carner, P.Barlet-Ros, E.Alarcón, M.Solé, V. Muntés-Mulero, D. Meyer, S. Barkai, M. J. Hibbett et al., “Knowledge- defined networking,” ACM SIGCOMM Computer Communication Review, vol. 47, no. 3, pp. 2–10, 2017. [3] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu et al., “B4: Experience with a globally- deployed software defined wan,” in ACM SIGCOMM Computer Communi- cation Review, vol. 43, no. 4. ACM, 2013, pp. 3–14. [4] “Digital trends 2019,” January 2019. [Online]. Available: http://www. 808multimedia.com/winnt/kernel.htm [5] T. Saydam and T. Magedanz, “From networks and network management into service and service management,” Journal of Network and Systems Manage- ment, vol. 4, no. 4, pp. 345–348, 1996. [6] C. Dovrolis, P. Ramanathan, and D. Moore, “Packet-dispersion techniques and a capacity-estimation methodology,” IEEE/ACM Transactions On Net- working, vol. 12, no. 6, pp. 963–977, 2004. [7] P. Calyam, D. Krymskiy, M. Sridharan, and P. Schopis, “Active and passive measurements on campus, regional and national network backbone paths,” in Proceedings. 14th International Conference on Computer Communications and Networks, 2005. ICCCN 2005., Oct 2005, pp. 537–542. [8] J. Cleary, S. Donnelly, I. Graham, A. McGregor, and M. Pearson, “Design principles for accurate passive measurement,” in Proceedings of PAM, 2000. [9] W. John, S. Tafvelin, and T. Olovsson, “Passive internet measurement: Overview and guidelines based on experiences,” Computer Communications, vol. 33, no. 5, pp. 533–550, 2010. 114 [10] Cisco systems netflow services export version 9. [Online]. Available: https://www.ietf.org/rfc/rfc3954.txt [11] sflow version 5. [Online]. Available: https://sflow.org/sflow_version_5.txt [12] A simple network management protocol (snmp). [Online]. Available: https://tools.ietf.org/html/rfc1157 [13] Simple network management protocol (snmp) applications. [Online]. Available: https://tools.ietf.org/html/rfc3413 [14] N. Feamster, J. Rexford, and E. Zegura, “The road to sdn,” Queue, vol. 11, no. 12, pp. 20–40, 2013. [15] N.McKeown,T.Anderson,H.Balakrishnan,G.Parulkar,L.Peterson,J.Rex- ford, S. Shenker, and J. Turner, “Openflow: enabling innovation in campus networks,” ACM SIGCOMM Computer Communication Review, vol.38, no.2, pp. 69–74, 2008. [16] O. S. Specifications, “Openflow 1.5.1,” Open Networking Foundation, vol. 3, 2015. [17] ONF, “Sdn architecture,” https://www.opennetworking.org/wp- content/uploads/2013/02/TR_SDN_ARCH_1.0_06062014.pdf, June 2014. [18] A. Yassine, H. Rahimi, and S. Shirmohammadi, “Software defined network traffic measurement: Current trends and challenges,” IEEE Instrumentation & Measurement Magazine, vol. 18, no. 2, pp. 42–50, 2015. [19] M. Jarschel, T. Zinner, T. Höhn, and P. Tran-Gia, “On the accuracy of lever- aging sdn for passive network measurements,” in 2013 Australasian Telecom- munication Networks and Applications Conference (ATNAC). IEEE, 2013, pp. 41–46. [20] Z. Su, T. Wang, Y. Xia, and M. Hamdi, “Flowcover: Low-cost flow monitoring scheme in software defined networks,” in 2014 IEEE Global Communications Conference. IEEE, 2014, pp. 1956–1961. [21] D. D. Clark, C. Partridge, J. C. Ramming, and J. T. Wroclawski, “A knowl- edge plane for the internet,” in Proceedings of the 2003 conference on Applica- tions, technologies, architectures, and protocols for computer communications. ACM, 2003, pp. 3–10. 115 [22] B.Heller, S.Seetharaman, P.Mahadevan, Y.Yiakoumis, P.Sharma, S.Baner- jee, and N. McKeown, “Elastictree: Saving energy in data center networks.” in Nsdi, vol. 10, 2010, pp. 249–264. [23] L. Zhou, L. N. Bhuyan, and K. Ramakrishnan, “Dream: Distributed energy- aware traffic management for data center networks,” in Proceedings of the Tenth ACM International Conference on Future Energy Systems, 2019, pp. 273–284. [24] J. Shuja, K. Bilal, S. A. Madani, M. Othman, R. Ranjan, P. Balaji, and S. U. Khan, “Survey of techniques and architectures for designing energy-efficient data centers,” IEEE Systems Journal, vol. 10, no. 2, pp. 507–519, 2014. [25] M. Masdari, S. S. Nabavi, and V. Ahmadi, “An overview of virtual machine placement schemes in cloud computing,” Journal of Network and Computer Applications, vol. 66, pp. 106–127, 2016. [26] T. Mastelic, A. Oleksiak, H. Claussen, I. Brandic, J.-M. Pierson, and A. V. Vasilakos, “Cloud computing: Survey on energy efficiency,” Acm computing surveys (csur), vol. 47, no. 2, pp. 1–36, 2014. [27] L. Chiaraviglio, M. Mellia, and F. Neri, “Minimizing isp network energy cost: Formulation and solutions,” IEEE/ACM Transactions on Networking, vol. 20, no. 2, pp. 463–476, 2011. [28] W. Fisher, M. Suchara, and J. Rexford, “Greening backbone networks: reduc- ingenergyconsumptionbyshuttingoffcablesinbundledlinks,” inProceedings of the first ACM SIGCOMM workshop on Green networking, 2010, pp. 29–34. [29] M. Zhang, C. Yi, B. Liu, and B. Zhang, “Greente: Power-aware traffic engi- neering,” in The 18th IEEE International Conference on Network Protocols. IEEE, 2010, pp. 21–30. [30] A. Shehabi, S. Smith, D. Sartor, R. Brown, M. Herrlin, J. Koomey, E. Masanet, N. Horner, I. Azevedo, and W. Lintner, “United states data cen- ter energy usage report,” Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), Tech. Rep., 2016. [31] eWEEK. Report finds over-provisioned hardware an expensive it problem. [Online]. Available: https://www.eweek.com/it-management/ report-finds-over-provisioned-hardware-an-expensive-it-problem [32] M. Fernando, P. Esteves, C. Esteve et al., “Software-defined networking: A comprehensive survey,” PROCEEDINGS OF THE IEEE, 2015. 116 [33] A. Tootoonchian, M. Ghobadi, and Y. Ganjali, “Opentm: traffic matrix esti- mator for openflow networks,” in International Conference on Passive and Active Network Measurement. Springer, 2010, pp. 201–210. [34] J. Kučera, D. A. Popescu, G. Antichi, J. Kořenek, and A. W. Moore, “Seek and push: Detecting large traffic aggregates in the dataplane,” arXiv preprint arXiv:1805.05993, 2018. [35] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman, “One sketch to rule them all: Rethinking network flow monitoring with univmon,” in Pro- ceedings of the 2016 ACM SIGCOMM Conference. ACM, 2016, pp. 101–114. [36] M. Moshref, M. Yu, R. Govindan, and A. Vahdat, “Scream: Sketch resource allocation for software-defined measurement,” in Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies. ACM, 2015, p. 14. [37] M. Yu, L. Jose, and R. Miao, “Software defined traffic measurement with opensketch,” in Presented as part of the 10th{USENIX} Symposium on Net- worked Systems Design and Implementation ({NSDI} 13), 2013, pp. 29–42. [38] P. Tune, M. Roughan, H. Haddadi, and O. Bonaventure, “Internet traffic matrices: A primer,” Recent Advances in Networking, vol. 1, pp. 1–56, 2013. [39] A. Soule, K. Salamatian, and N. Taft, “Combining filtering and statistical methods for anomaly detection,” in Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement. USENIX Association, 2005, pp. 31–31. [40] M. Roughan, M. Thorup, M. Thorup, and Y. Zhang, “Traffic engineering with estimated traffic matrices,” in Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement. ACM, 2003, pp. 248–258. [41] T. Benson, A. Anand, A. Akella, and M. Zhang, “Microte: Fine grained traffic engineering for data centers,” in Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies. ACM, 2011, p. 8. [42] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee, “Devoflow: Scaling flow management for high-performance net- works,” in ACM SIGCOMM Computer Communication Review, vol. 41, no. 4. ACM, 2011, pp. 254–265. [43] C. Estan and G. Varghese, “New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice,” ACM Transac- tions on Computer Systems (TOCS), vol. 21, no. 3, pp. 270–313, 2003. 117 [44] Y. Zhang, “An adaptive flow counting method for anomaly detection in sdn,” in Proceedings of the ninth ACM conference on Emerging networking experi- ments and technologies. ACM, 2013, pp. 25–30. [45] C. Walsworth, E. Aben, K. Claffy, and D. Andersen. (2016) The caida anonymized internet traces. [Online]. Available: http://www.caida.org/data/ passive/passive_2016_dataset.xml [46] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural com- putation, vol. 9, no. 8, pp. 1735–1780, 1997. [47] Google bigquery cloud data warehouse. [Online]. Available: https: //cloud.google.com/bigquery/ [48] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016. [49] R. Xu and D. C. Wunsch, “Survey of clustering algorithms,” IEEE Transac- tions on Neural Networks, vol. 16, no. 3, pp. 645–678, May 2005. [50] A. Lazaris and V. K. Prasanna, “Deepflow: a deep learning framework for software-definedmeasurement,” inProceedings of the 2nd Workshop on Cloud- Assisted Networking. ACM, 2017, pp. 43–48. [51] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015. [52] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/ [53] B.Lantz, B.Heller, andN.McKeown, “Anetworkinalaptop: rapidprototyp- ing for software-defined networks,” in Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks. ACM, 2010, p. 19. [54] K. Papagiannaki, N. Taft, Z.-L. Zhang, and C. Diot, “Long-term forecasting of internet backbone traffic: Observations and initial models,” in IEEE INFO- COM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), vol. 2. IEEE, 2003, pp. 1178–1188. 118 [55] K. Xu, Z.-L. Zhang, and S. Bhattacharyya, “Profiling internet backbone traf- fic: behavior models and applications,” in ACM SIGCOMM Computer Com- munication Review, vol. 35, no. 4. ACM, 2005, pp. 169–180. [56] C. You and K. Chandra, “Time series models for internet data traffic,” in Proceedings 24th Conference on Local Computer Networks. LCN’99. IEEE, 1999, pp. 164–171. [57] C. Barakat, P. Thiran, G. Iannaccone, C. Diot, and P. Owezarski, “Model- ing internet backbone traffic at the flow level,” IEEE Transactions on Signal processing, vol. 51, no. 8, pp. 2111–2124, 2003. [58] S.Basu, A.Mukherjee, andS.Klivansky, “Timeseriesmodelsforinternettraf- fic,” in Proceedings of IEEE INFOCOM’96. Conference on Computer Com- munications, vol. 2. IEEE, 1996, pp. 611–620. [59] A. Sang and S.-q. Li, “A predictability analysis of network traffic,” Computer networks, vol. 39, no. 4, pp. 329–345, 2002. [60] A. Azzouni and G. Pujolle, “A long short-term memory recurrent neural network framework for network traffic matrix prediction,” arXiv preprint arXiv:1705.05690, 2017. [61] P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A. Maltz, R. Kern, H. Kumar, M. Zikos, H. Wu, C. Kim, and N. Karri, “Ananta: Cloud scale load balancing,” in Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, ser. SIGCOMM ’13. New York, NY, USA: ACM, 2013, pp. 207–218. [Online]. Available: http://doi.acm.org/10.1145/2486001.2486026 [62] G. Qin, Y. Chen, and Y.-X. Lin, “Anomaly detection using lstm in ip net- works,” in 2018 Sixth International Conference on Advanced Cloud and Big Data (CBD). IEEE, 2018, pp. 334–337. [63] M. Cheng, Q. Xu, J. Lv, W. Liu, Q. Li, and J. Wang, “Ms-lstm: A multi- scale lstm model for bgp anomaly detection,” in 2016 IEEE 24th International Conference on Network Protocols (ICNP). IEEE, 2016, pp. 1–6. [64] B. Zhou, D. He, and Z. Sun, “Traffic modeling and prediction using arima/- garch model,” in Modeling and Simulation Tools for Emerging Telecommuni- cation Networks. Springer, 2006, pp. 101–121. [65] Y.-W. Chen and C.-C. Chou, “Traffic modeling of a sub-network by using arima,” in 2001 International Conferences on Info-Tech and Info-Net. Pro- ceedings (Cat. No. 01EX479), vol. 2. IEEE, 2001, pp. 730–735. 119 [66] R. H. Shumway and D. S. Stoffer, Time series analysis and its applications. Springer-Verlag, 2005. [67] S. Uhlig, B. Quoitin, J. Lepropre, and S. Balon, “Providing public intrado- maintrafficmatricestotheresearchcommunity,” ACM SIGCOMM Computer Communication Review, vol. 36, no. 1, pp. 83–86, 2006. [68] (2004) The 2004 abilene traffic matrix dataset. [Online]. Available: http: //www.maths.adelaide.edu.au/matthew.roughan/project/traffic_matrix/ [69] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimiza- tion,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012. [70] N. Ramakrishnan and T. Soni, “Network traffic prediction using recurrent neural networks,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018, pp. 187–193. [71] P. Le Nguyen, Y. Ji et al., “Deep convolutional lstm network-based traffic matrix prediction with partial information,” in 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). IEEE, 2019, pp. 261– 269. [72] M. Li, C. Chen, C. Hua, and X. Guan, “Cflow: A learning-based compressive flow statistics collection scheme for sdns,” in ICC 2019-2019 IEEE Interna- tional Conference on Communications (ICC). IEEE, 2019, pp. 1–6. [73] A. Mozo, B. Ordozgoiti, and S. Gomez-Canaval, “Forecasting short-term data center network traffic load with convolutional neural networks,” PloS one, vol. 13, no. 2, 2018. [74] A. Lazaris and V. K. Prasanna, “An lstm framework for modeling network traffic,” in 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). IEEE, 2019, pp. 19–24. [75] M. Moshref, M. Yu, R. Govindan, and A. Vahdat, “Dream: dynamic resource allocation for software-defined measurement,” in ACM SIGCOMM Computer Communication Review, vol. 44, no. 4. ACM, 2014, pp. 419–430. [76] N. L. Van Adrichem, C. Doerr, and F. A. Kuipers, “Opennetmon: Network monitoring in openflow software-defined networks,” in 2014 IEEE Network Operations and Management Symposium (NOMS). IEEE, 2014, pp. 1–8. [77] C. Liu, A. Malboubi, and C.-N. Chuah, “Openmeasure: Adaptive flow mea- surement & inference with online learning in sdn,” in 2016 IEEE Conference 120 on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 2016, pp. 47–52. [78] Y.Gong, X.Wang, M.Malboubi, S.Wang, S.Xu, andC.-N.Chuah, “Towards accurate online traffic matrix estimation in software-defined networks,” in Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Net- working Research. ACM, 2015, p. 26. [79] Y. Li, R. Miao, C. Kim, and M. Yu, “Flowradar: A better netflow for data centers,” in 13th{USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), 2016, pp. 311–324. [80] Y. Yu, C. Qian, and X. Li, “Distributed and collaborative traffic monitoring in software defined networks,” in Proceedings of the third workshop on Hot topics in software defined networking. ACM, 2014, pp. 85–90. [81] A. Lazaris, D. Tahara, X. Huang, E. Li, A. Voellmy, Y. R. Yang, and M. Yu, “Tango: Simplifying sdn control with automatic switch property inference, abstraction, and optimization,” in Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies. ACM, 2014, pp. 199–212. [82] M. Malboubi, L. Wang, C.-N. Chuah, and P. Sharma, “Intelligent sdn based traffic(de)aggregationandmeasurementparadigm(istamp),” inIEEE INFO- COM 2014-IEEE Conference on Computer Communications. IEEE, 2014, pp. 934–942. [83] S. Bera, S. Misra, and A. Jamalipour, “Flowstat: Adaptive flow-rule place- ment for per-flow statistics in sdn,” IEEE Journal on Selected Areas in Com- munications, vol. 37, no. 3, pp. 530–539, 2019. [84] H. Zhou, L. Tan, Q. Zeng, and C. Wu, “Traffic matrix estimation: A neural network approach with extended input and expectation maximization itera- tion,” Journal of Network and Computer Applications, vol. 60, pp. 220–232, 2016. [85] D. Jiang and G. Hu, “Large-scale ip traffic matrix estimation based on the recurrent multilayer perceptron network,” in 2008 IEEE International Con- ference on Communications. IEEE, 2008, pp. 366–370. [86] A. Lazaris and V. K. Prasanna, “Deep learning models for aggregated network traffic prediction,” in 2019 15th International Conference on Network and Service Management (CNSM). IEEE, 2019, pp. 1–5. 121 [87] M. L. Puterman, Markov Decision Processes.: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. [88] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018. [89] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications andnetworking: Asurvey,” IEEE Communications Surveys & Tutorials, 2019. [90] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015. [91] N. Buduma and N. Locascio, Fundamentals of Deep Learning: Designing Next-generation Machine Intelligence Algorithms. O’Reilly Media, 2017. [Online]. Available: https://books.google.com/books?id=EZFfrgEACAAJ [92] A. Lazaris, D. Tahara, X. Huang, L. E. Li, A. Voellmy, Y. R. Yang, and M. Yu, “Jive: Performance driven abstraction and optimication for sdn,” in Presented as part of the Open Networking Summit 2014 (ONS 2014), 2014. [93] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource management with deep reinforcement learning,” in Proceedings of the 15th ACM Workshop on Hot Topics in Networks, 2016, pp. 50–56. [94] Y.S.NasirandD. Guo, “Multi-agentdeepreinforcementlearningfor dynamic power allocation in wireless networks,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2239–2250, 2019. [95] C. Zhong, M. C. Gursoy, and S. Velipasalar, “A deep reinforcement learning- based framework for content caching,” in 2018 52nd Annual Conference on Information Sciences and Systems (CISS). IEEE, 2018, pp. 1–6. [96] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforce- ment learning for dynamic multichannel access in wireless networks,” IEEE Transactions on Cognitive Communications and Networking, vol. 4, no. 2, pp. 257–265, 2018. [97] L. Chen, J. Lingys, K. Chen, and F. Liu, “Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Commu- nication. ACM, 2018, pp. 191–205. 122 [98] T.-Y. Mu, A. Al-Fuqaha, K. Shuaib, F. M. Sallabi, and J. Qadir, “Sdn flow entry management using reinforcement learning,” ACM Transactions on Autonomous and Adaptive Systems (TAAS), vol. 13, no. 2, p. 11, 2018. [99] T. V. Phan, S. T. Islam, T. G. Nguyen, and T. Bauschert, “Q-data: Enhanced traffic flow monitoring in software-defined networks applying q-learning,” arXiv preprint arXiv:1909.01544, 2019. 123
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Computing cascades: how to spread rumors, win campaigns, stop violence and predict epidemics
PDF
Novel techniques for analysis and control of traffic flow in urban traffic networks
PDF
Spatiotemporal traffic forecasting in road networks
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Efficient control optimization in subsurface flow systems with machine learning surrogate models
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
PDF
Discovering and querying implicit relationships in semantic data
PDF
Prediction models for dynamic decision making in smart grid
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Machine-learning approaches for modeling of complex materials and media
PDF
Towards the efficient and flexible leveraging of distributed memories
PDF
Customized data mining objective functions
PDF
Learning distributed representations of cells in tables
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Deep learning for subsurface characterization and forecasting
Asset Metadata
Creator
Lazaris, Angelos
(author)
Core Title
Machine learning for efficient network management
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
05/07/2020
Defense Date
05/06/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,network traffic measurement,network traffic modeling,OAI-PMH Harvest,SDN
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Deshmukh, Jyotirmoy Vinay (
committee member
), Raghavendra, Cauligi (
committee member
)
Creator Email
aggelos.lazaris@gmail.com,alazaris@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-299746
Unique identifier
UC11664200
Identifier
etd-LazarisAng-8454.pdf (filename),usctheses-c89-299746 (legacy record id)
Legacy Identifier
etd-LazarisAng-8454.pdf
Dmrecord
299746
Document Type
Dissertation
Rights
Lazaris, Angelos
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
machine learning
network traffic measurement
network traffic modeling
SDN