Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Internet security and quality-of-service provision via machine-learning theory
(USC Thesis Other)
Internet security and quality-of-service provision via machine-learning theory
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTERNET SECURITY AND QUALITY-OE-SERVICE PROVISION VIA MACHINE-LEARNING THEORY by Junghun Park A Dissertation Presented to the EACULTY OF THE GRADUATE SCHOOL UNIVERSITY OE SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2006 Copyright 2006 Junghun Park R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UMI Number: 3237707 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. UMI UMI Microform 3237707 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. D ed ication This dissertation is dedicated to my parents and wife for their endless love. 11 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table O f C ontents PROVISION VIA MACHINE-LEARNING THEORY Dedication ii List Of Tables vi List Of Figures viii Abstract xi 1 Introduction 1 1.1 Significance of the R e se a rc h ................................................................. 1 1.2 Review of Previous W ork....................................................................... 4 1.3 Contributions of the R esearch .............................................................. 8 1.4 Organization of the D issertation........................................................... 10 2 Background R eview 12 2.1 TCP Control Information .................................................................... 12 2.2 Denial-of-Service A tta c k ....................................................................... 14 2.2.1 Distributed Denial-of-Service (DDoS) A ttack .......................... 16 2.3 Markov Model and Hidden Markov M o d e l........................................ 18 2.3.1 Discrete-time Markov M odel..................................................... 18 2.3.2 Hidden Markov M o d e l.............................................................. 19 2.4 Scalable Internet Traffic Measurement T o o ls ..................................... 21 3 HM M -based TC P SY N Flooding Attack D etection 23 3.1 Introduction.............................................................................................. 23 3.2 Analysis of TCP SYN Flooding A tta c k .............................................. 24 3.2.1 Attack Identification.................................................................. 25 3.2.2 Dynamics of Attack P a c k e ts..................................................... 26 3.2.3 Solution to TCP SYN Attack via OS Parameter Setting . . 29 3.2.4 TCP Connection Loss Probability........................................... 31 3.3 Proposed HMM-based D e te c to r........................................................... 35 3.3.1 Review of Previous W ork........................................................... 35 3.3.2 Feature S electio n........................................................................ 36 iii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.3.3 Detector Deploym ent................................................................... 38 3.3.4 Stateful and Stateless Detection Mechanisms.......................... 41 3.3.5 Stationarity of Normalized Residue Sequences....................... 45 3.3.6 Proposed HMM-based D e te c to r................................................ 46 3.3.7 Training Process .......................................................................... 50 3.3.8 Decision Process .......................................................................... 52 3.4 Simulation R e s u lts ................................................................................... 53 3.4.1 Simulation S e tu p .......................................................................... 53 3.4.2 Detection time ............................................................................. 54 3.4.3 Detection R a t e ............................................................................. 56 3.4.4 C o m p lex ity................................................................................... 57 3.5 Conclusion................................................................................................... 57 D etection of the Bandwidth D epletion Attack 58 4.1 Introduction................................................................................................ 58 4.2 D eploym ent................................................................................................ 59 4.3 Detector Using Multiple Markov M o d e ls ............................................. 60 4.3.1 Markov M odel................................................................................ 60 4.3.2 Detector Using Combined Markov M o d e ls .............................. 61 4.3.3 Enhancement of Basic Proposed D etector................................. 65 4.4 Simulation Results and A nalysis............................................................. 67 4.4.1 Simulation E nvironm ent............................................................. 67 4.4.2 Sequential-Batch Detector vs Multiple Markov Detector . . . 67 Internet Packet Classification for QoS Provision 72 5.1 Introduction................................................................................................ 73 5.2 Features Selection...................................................................................... 75 5.2.1 Feature E x tra c tio n ....................................................................... 75 5.2.2 Feature Reduction ....................................................................... 79 5.2.3 Proposed Feature Training and Testing System .................... 83 5.3 Classification M e th o d s............................................................................. 85 5.3.1 Naive Bayesian A p p ro ach es...................................................... 85 5.3.2 Decision T rees................................................................................ 86 5.4 Early Classification Based on Partial Flow Inform ation.................... 89 5.5 Simulation Results and Discussion.......................................................... 91 5.5.1 Simulation S e tu p .......................................................................... 91 5.5.2 Accuracy and C o m p lex ity.......................................................... 93 5.5.3 Complexity and Memory Requirem ents.................................... 98 5.5.4 Robustness ................................................................................... 99 5.5.5 Comparison of Modified Multistage Filter and NetFlow . . . 100 5.6 Conclusion.......................................................................................................102 IV R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 6 Conclusion and Future Work 104 6.1 Conclusion...................................................................................................... 104 6.2 Future W ork...................................................................................................106 References 109 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List O f Tables 3.1 Statistics of TCP SYN flood attacks in the collected data set. . . . 28 3.2 Default values of the backloug queue size and the maximum duration of a half-open connection for three operating systems......................... 30 3.3 The statistic information for normal TCP SYN request traffics in the unit of seconds............................................................................................. 46 3.4 The first and second order statistics of Z{n) for To = 2,3,4,5. . . . 46 3.5 Nine observation cases................................................................................ 48 4.1 Parameters of the simulation environment.............................................. 68 5.1 Characteristics of classes under classification......................................... 76 5.2 The list of candidate features.................................................................... 78 5.3 The rank-ordered features from symmetrical uncertainty values, where Avg., Var., and AdvWin. mean average, variance and Advertised- Window in TCP header, respectively...................................................... 80 5.4 Time complexity for Naive Bayesian (NB), NB Kernel Estimator (NBKE) and decision tree......................................................................... 87 5.5 The number of flows of the classes in trained data set from PSC. . . 93 5.6 Optimized features obtained from a wrapper method using GA and two classifiers in the P2A directions, where Avg., Var., and AdvWin. are the average and the variance of packet sizes and the Advertised- Window in the TCP header, respectively and I — A time and pkt stand for inter-arrival time and packet, respectively ) ...................... 94 5.7 Optimized features obtained from a variant of the wrapper method using FCBF and three classifiers in the P2A directions....................... 96 vi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.8 The TP rates according to major applications in classifiers with ge netic search algorithm ............................................................................ 98 5.9 The space requirements of decision trees under different feature se lection schemes of wrapper m ethods...................................................... 98 5.10 Accurate classification rates of various classification tools in the the direction from the server to clients with traffic data sets in two dif ferent sites........................................................................................................100 VII R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List O f Figures 2.1 The three-way handshake mechanism and the corresponding state transition for the TCP protocol............................................................... 13 2.2 The distributed denial-of-service attack network and its operation. . 17 2.3 A Markov model of three states with state transition probabilities. . 19 2.4 A HMM with 3 states and M observations............................................. 20 3.1 The distribution of the inter-arrival time of attack packets using TCP SYN.............................................................................................................. 27 3.2 The statistics of attacks in terms of the attack duration and the packet arrival rate, where each vertical segment denotes an attack. . 29 3.3 The cummulative histogram of RTT measured in terms of the time between the TCP SYN packet and the responsive message (ACK, FIN or RST) from the TCP SYN sender............................................... 32 3.4 The TCP connection loss probability as a function of the attack rate in the Solaris server.................................................................................... 33 3.5 The TCP connection loss probability as a function of the attack rate in the Window 2000 server........................................................................ 34 3.6 The deployment of the proposed HMM detector................ 40 3.7 The flowchart for a stateful mechanism.......................................... 44 3.8 The behaviors of Z{n) under the normal and the attack traffics. . . 47 3.9 The observation events with non-zero probabilities in the proposed HMM under states 0, 1 and 2................................................................... 49 viii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.10 The average distances between a test HMM and three reference HMMs, where the x-axis denotes the number of observation se quences used in the training of the test HMM...................................... 51 3.11 The plot of MCR as a function of the detection depth).............. 53 3.12 The detection time as a function of the attack rate (SYNs/sec). . . 55 3.13 The detection probability as a function of the attack rate (SYNs/sec). 56 4.1 The Trellis diagrams for the three scenarios................................. 61 4.2 The proposed detector using Markov models................................ 62 4.3 The integration of two Markov models into one Markov model with two transitions............................................................................................. 63 4.4 The detection system using multiple combined Markov models. . . . 66 4.5 Network topology used for simulation...................................................... 68 4.6 Performance comparison of sequential, the Ist-order and the 2nd- order Markov model detectors with attack traffic increasing linearly. 69 4.7 Performance comparison of sequential, the Ist-order and the 2nd- order Markov model detectors with attack traffic increasing abruptly. 69 4.8 The optimum detection path of the proposed detector with respect to different attack rates............................................................................. 70 4.9 The MCR performance of the proposed detector using different at tack training data....................................................................................... 71 5.1 Feature training and testing processes..................................................... 84 5.2 The basic operation in a multistage filter for the classification in the domain of time............................................................................................ 90 5.3 The accurate classification rates versus the number of features which are ranked according to symmetrical uncertainty values in (a) the A2P direction and (b) the P2A direction............................................... 95 5.4 Comparison of accurate classification rates of REP, NBKE and J48 when the full feature set and two sets of features selected by GA and Moore’ s work are used............................................................................... 97 5.5 Comparison of computational complexity of several classifiers in the P2A direction.............................................................................................. 99 IX R eproduced witfi perm ission of tfie copyrigfit owner. Furtfier reproduction profiibited witfiout perm ission. 5.6 The accurate classification rates as a function of the number of pack ets used for classification........................................................................... 100 5.7 The accurate detection rates versus the sampling rates in (a) the A2P direction and (b) the P2A direction...................................................101 X R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A bstract To detect DoS (Denial-of-Service) attacks, two mechanisms based on traffic pattern monitoring using HMMs (Hidden Markov Model) and multiple Markov models are proposed in this research. To effectively design a detector against the TCP SYN flooding attack, we first analyze the dynamic behavior of real world attacks and then propose a stateful HMM detector to achieve early detection with high accuracy. Multiple HMMs can achieve the advantages of misuse detection and anomaly detection by training them differently. With the stateful mechanism, the impact of background noise due to the protocol behavior can be mitigated. We compare the proposed HMM detector with the stateless Cumulative Sum (CUSUM) and the stateful CUSUM detector using trace-driven simulations. Simulation results show that the proposed HMM detector provides earlier detection time and a higher detection rate under the same false alarm rate. Furthermore, we develop a detector using multiple Markov models to detect the UDF flooding attack in wireless networks. The high-rate attack using UDF can be detected easily since there are few legitimate users using UDF in the network. However, it is difficult to detect subtle UDF flooding attacks since there are many UDF-based applications with a dynamic traffic rate. A Markov model is used to characterize the traffic pattern. Multiple Markov models are trained with normal traffic and some deviations from the normal traffic, and they are integrated into xi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. a single detector. The proposed detector is compared with the batch-sequential detection algorithm in terms of the false alarm rate and detection latency. Finally, to support various Internet services such as QoS, security, and account ing, the Internet traffic classification problem is studied. The proposed classification process consists of two steps: feature selection and classification. Candidate fea tures that can be easily obtained by ISPs are considered. Then, we perform feature reduction to balance the performance and complexity. Decision trees are adopted as classifiers. It is demonstrated by simulations with real data that the proposed classification scheme outperforms existing techniques. XII R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 1 Introduction 1.1 Significance o f th e R esearch The Internet is indispensable to our daily life nowadays. The success of the Internet is driven by the IP/TC P protocols. The IP protocol enables any node to send a packet to any other nodes using the IP address. While this flexibility can offer scalability to the Internet, it makes the Internet vulnerable to the denial-of-service (DoS) attack. In a DoS attack, a large number of packets was sent to an IP node so that key resources at the victim (such as the bandwidth, buffer, and CPU time needed to handle these packets) are quickly exhausted. As a result, legitimate users are prevented from accessing the node for services. The Internet is used daily to provide important services such as stock trading, financial management, on-line auction, etc. These services are time-critical. However, the DoS attack can delay or block the services to result in communication breakdown between victim sites and their customers, which may in turn lead to financial loss and other damages. The defense against the DoS attack consists of the detection mechanism and the response mechanism. To achieve an effective defense, robust detection within a short period of time after the launch of the attack is required. After the attack R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is detected, most response mechanisms block attack packets. Good DoS detection algorithms have to meet the following several criteria. First, the detector needs both a high detection rate and a low false alarm rate. If an attack is missed, the damage is obvious as explained above. On the other hand, if there is a false alarm, the action of packet blocking prevents legitimate users from their data services. Besides, DoS attacks usually use spoofed source addresses and traceback algorithms [54,56,57] have been developed to find the attack source. These algorithms are costly to implement. Second, the overhead of the DoS detector implementation should be small since the attack does not occur for most of the time. Third, early detection is important in mitigating the impact of an attack. The DoS attack becomes more complex these days via the use of distributed sources and spoofed source addresses, which is called the Distributed DoS (DDoS) attack. Furthermore, sources generate attack packets at a lower rate in a DDoS attack to avoid easy detection. The small rate of attack flows make their detection more difficult since their traffic is similar to the normal one. The spoofed source address makes it difficult to differentiate normal and attack packets using the packet header information. Moreover, the normal traffic in the Internet is often burst and an attacker can control the attack duration such that the attack traffic mimics the normal traffic. As a result, it becomes increasingly difficult to detect the attack timely and accurately. DoS attacks can be classified into two types: (i) depletion of computing resources and (ii) depletion of the bandwidth. For example, the TCP SYN flooding attack belongs to the first type while the UDF flooding attack belongs to the second type. In this research, we develop efficient detection mechanisms against both types. Our detection is based on traffic monitoring. To design the detector, we first examine the traffic dynamics in terms of the flow of packets at a certain node of the 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. network in a practical environment. The cases under study include both normal and attack traffics. Traffic dynamics of these different underlying scenarios are then modelled by several hidden Markov models (HMM). Then, robust detection can be achieved by comparing the likelihood values of these models based on a particular observation traffic pattern. When applied to DoS attacks, the HMM-based detector is shown to have several advantages as compared to existing traffic-based DoS attack detection schemes such as the one based the Cumulative Sum (CUSUM) [63]. For the TCP SYN flood ing attack, we evaluate the performance of the proposed HMM detector against the stateless Cumnlative Snm (CUSUM) [63] and the stateful CUSUM detector via trace-driven simulations, and observe a reduced impact from sparkle noise that often degrades the performance of a detector in the proposed HMM detector. Fur thermore, multiple Markov models (MMM) are developed for detecting the UDF flooding attack. The proposed MMM-based detector is compared with the batch- sequential detection algorithm studied in [5]. A high performance classification method for Internet traffic is essential to vari ous services in Internet Service Providers (ISPs). Some of the services are mentioned below. • Real-time QoS (Quality-of-Service) support QoS provision over the Internet can be addressed by considering the packet forwarding mechanism and network resource management. They should be designed to support different rate/delay requirements of various flows. The flow-based service depends on the application types. Even though being ex tensively studied in the literature, QoS-based services are still not yet deployed by ISPs. One of the main obstacles is that there lacks a simple yet accurate mechanism to classify the application in real time. Our proposed scheme for R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Internet traffic classification is expected to enhance the QoS operation in real world applications. • Detection against bandwidth-depleting DoS attack A scalable mechanism to protect individual flows against bandwidth-depleting DoS attack is still lacking at the ISP today. The main reason is that there is no good rule in selecting a threshold to launch an alarm under the occurrence of a DoS attack. If we know the application type of a flow, we can find a threshold tailored to each individual flow to protect the network against DoS attacks, especially the distributed DoS (DDoS) attack that generates nuisance packets with a low rate from multiple attack sources. This approach allows DoS attack detection at an earlier stage due to a refined threshold value. • Pricing for network traffics There are two typical types of network traffic pricing policies: the usage of information amount (in terms of bits) and the usage of the time. However, pricing can be adjusted according to the requested QoS level. Ideally, a net work administrator provide customers with different QoS levels for different applications via traffic classification, and charge customers accordingly. 1.2 R eview o f P reviou s W ork The DoS attack is a form of intrusion. Generally speaking, an intrusion detection system can be classified to: (i) the signature detection system [46,62] and (ii) the anomaly detection system [5,63]. A signature detection system detects attacks by comparing the traffic pattern with predefined attack signatures while an anomaly detection system detects attacks when a deviation from normal traffic is observed. 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Each system has its own advantages and disadvantages. On one hand, the signa ture detection system has faster detection time than the anomaly detection system against known attacks. On the other hand, the anomaly detection system is more robust in detecting unknown attacks. A hybrid approach that mitigates the weak nesses of the two approaches and magnifies their strengths was proposed in [55]. In this work, we also focus on the hybrid approach using HMMs to model traffic patterns. Several defense mechanisms against TCP SYN flooding attacks have been pro posed, including attack mitigation, detection, and response. In the following, we discuss these mechanisms at different locations of the Internet. • Server-based protection Syn cache and Syn cookies [38] are implemented in the BSD and the Linux kernels, respectively, to alleviate the attack to the server. They assign a reduced form of data structures to clients to reduce the occurrence of the re source exhaustion caused by the attack. For example, the Syn cache allocates the reduced data structure to a new request. The reduced data structure can decrease connection latency in addition to attack mitigation. Syn cookies store all states of the initial TCP connection by a value with a cryptographic function. This mechanism is more resilient to the attack since only a value is managed. However, it cannot support the negotiation of TCP options. Some of them, such as the window size or selective acknowledgement, are related to the overall throughput performance. • Firewall-based protection A firewall provides two ways in defending against a SYN flooding attack; namely, relay and gateway. The firewall relay, such as Syn Defender [12] and 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Netscreen [33], intercepts a SYN packet going to a server. It acts on server’ s behalf and replies with a SYN/ACK packet to the client. If SYN comes from the client, the firewall plays the role of the server. The firewall gateway [32] sends the RST message to the server, if ACK does not come from the client after a certain amount of time. Thus, the firewall mechanism can protect server’s resource. The firewall protection is simple and effective. However, since it interferes with the end-to-end protocol, the end-to-end performance for normal users can be degraded as well. • Router-based protection Router-based detection and response mechanisms have been proposed for de tecting and localizing attacks in [42,63]. They provide a good solution against the attack since both the detection and the response mechanisms can be achieved at the same time in protecting network resources. They are executed at the leaf router in a subnet monitoring packets. If an attack is detected in outbound traffic, they can help block the source without any traceback algo rithm to traverse the Internet. The cumulative sum (CUSUM) method was proposed by Wang et al. [63] to detect the TCP SYN flooding attack. The CUSUM method is inherently a sequential change point detection algorithm. However, its performance is poor in detecting low rate attacks, which how ever occur often in a real world environment as the consequence of the DDoS attack. As described above, the firewall- and the server-based protections can only mitigate the attack impact. To remove the attack source, we have to rely on the attack detection and response mechanism, e.g. the IP traceback technique [45,54,56], which is a router-based protection scheme. However, the traceback algorithm is 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. computationally expensive and may be inaccurate under the DDoS attack. To apply the IP traceback algorithm effectively, we need to differentiate DoS and DDoS attacks first. A classification algorithm using a spectral domain technique was proposed by Hussain et al. [31]. After that, the traceback algorithm against the DDoS attack [57] can be applied more conveniently. Generally speaking, DoS detection and classification should proceed before the traceback algorithm. In above, our review has primarily focused on the resource-depletion attack. As compared to the resource-depletion attack, research on the bandwidth-depletion attack has been much less. Here, it is worthwhile to comment that a signal pro cessing technique has been used to detect the bandwidth-depletion attack. That is, the periodicity of the TCP behavior was analyzed by Cheng et al. [13] with the discrete Fourier transform to detect anomaly traffics. Accurate classification of Internet traffic at the application level is essential to various services in today’s Internet. The services include QoS, security, accounting, traffic engineering and provision for future resources. Some applications can also be represented with some particular features. Claffy [17] showed that the joint- distribution of flow length and the number of packets provides a good feature set to distinguish DNS (Domain Name Services) traffic. Dewes et al. [20] exploited the fact that most packets are less than 200 bytes in web chat traffics to analyze Internet chat systems. Wright et al. [36] adopted the Hidden Markov Model and trained the HMM with features of the packet size and arrival time for traffic classification. However, the performance is not good due to the limited number of features. Furthermore, the parametric distributions of features used in HMM are not robust, since Internet traffic is non-stationary. Different classification techniques are demanded by different applications. For example, P2P applications use an arbitrary port number to hide themselves. To 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. achieve reliable estimation, Karagiannis et al. [34] identified P2P flows at the trans port layer using connection patterns of P2P networks without the use of the payload information. Soule et al. [58] addressed this problem by classifying aggregate flows into a smaller set of classes which are defined by their bandwidth usage. They propose a method to model flow histograms using the Dirichlet Mixture Process for random distributions. 1.3 C ontributions o f th e R esearch In this work, we consider the router-based approach for attack detection, and take a hybrid approach in order to get the advantages of both signature and anomaly de tection systems. Besides, we consider both the resource-depletion attack as well as the bandwidth-depletion attack. The major contributions of this research proposal are summarized as follows. • Detection against the TCP SYN flooding attack. — We examine the properties of the TCP SYN flooding attack in terms of inter-arrival time from real attack samples. While there has been little study on the attack traffic model, we infer the influence of the attack traffic on the connection loss probability, which provides a better understanding of the problem and our solution. — We propose an HMM to capture the dynamics of network traffic, such as the change of the traffic volume. To obtain a stationary process in the HMM, the behavior of TCP SYN and ACK, FIN or RST packets is treated as observations for the HMM. The behavior provides a clue to differentiate normal and attack traffics. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. — We propose three models in the proposed HMM detector. They are trained by the Baum-Welch re-estimation algorithm [52] with (i) the normal traffic, (ii) the low-rate attack traffic, and (iii) the high-rate attack traffic. The HMM trained with the high-rate attack traffic enables the detector to work as an anomaly detection system while the HMM trained with the low-rate attack traffic enables the proposed detector to have enhanced detection capability in detecting the low-rate attack and function as a signature detection system. Thus, the proposed detector can work as a hybrid detector. — We demonstrate the improved performance of the proposed stateful HMM detector by measuring the computational load and the memory usage with real data sets. Besides, the proposed HMM detector has bet ter detection performance over CUSUM detector under the same false alarm rate. Detection against the UDP flooding attack for wireless networks. — We discuss bandwidth-depletion DoS attack for wireless networks (espe cially for 3G wireless networks), where the attack target is a set of users associated with a base station. The attack may occur even with a low attack traffic rate. Tfiis discussion helps us understand a security hole in wireless networks, such as the ad-hoc and the sensor networks, which become increasingly important these days. — We propose to use a second order Markov model and multiple Markov models to enhance the detection performance. The proposed detector is compared with the batch-sequential detection algorithm [5] in terms of the false alarm rate and detection latency. 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Internet traffic classification - A fast and robust scheme that classifies Internet packets according to their application types is investigated. - For feature selection, practical features are extracted using tools such as the multistage filter and NetFlow. By using the genetic algorithm (GA) and a variant of the wrapper method, we obtain two sets of features for comparison. - Decision trees such as J48 and REPTree and the bagging method using REPTree are used as classifiers. Decision trees are trained with selected features from real traffics. The trained decision trees are compared with classifiers using Bayesian approaches in terms of accuracy, complexity, memory space, and robustness. - It is demonstrated by simulation results that the decision tree with fea tures selected by GA gives the best performance. Finally, early classi fication with a modified multistage filter is proposed to reduce collision errors for fast and robust performance. 1.4 O rganization o f th e D issertation The rest of this dissertation is organized as follows. The background on the DoS attack and HMM is reviewed in Chapter 2. A new detection mechanism against the TCP SYN Flooding attack is presented in Chapter 3, where an HMM-based detector is developed. We use real traffic samples collected in [31] for the modeling purpose. Then, we propose a stateful detection mechanism to reduce sparkle noise which may occur in the protocol behavior of TCP SYN-ACK, FIN or RST pairs. 10 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The trace-driven simulation based on the normal traffic samples is conducted to evaluate the proposed HMM detector and the CUSUM detector [63]. The issues of DoS attack of depleting bandwidth in wireless networks is examined in Chapter 4, where multiple Markov models are used for attack detection. For Internet traffic classification, several features are identified and decision trees are compared with Bayesian approaches in Chapter 5. Finally, concluding remarks and future work are given in Chapter 6. 11 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 2 Background R eview 2.1 T C P C ontrol Inform ation The TCP protocol has been developed to provide reliable end-to-end data transport services. TCP peers should be synchronized to support the service. TCP peers op erate according to their state-transition diagram to achieve synchronization. The synchronization process is initiated by a three-way handshaking mechanism for the connection setup. The synchronization process is also required between peers in terminating a TCP session. We show in Fig. 2.1 the three-way handshake mecha nism in TCP establishment and teardown. Note that if only one TCP peer closes the connection while the other still keeps the connection open, this may result in some confusion of the operations and affects the performance of the network. The 6-bit Flags field in the TCP header is used to relay control messages between TCP peers. They include; SYN, FIN, ACK, PUSH, URG, and RST. Flags SYN and FIN are used to request a new connection and close an existing connection, respectively. At teardown, each peer should send a TCP FIN packet for synchronization. The ACK flag is set any time for data acknowledgement. The PSH flag is set for delivering data to applications. The RST flag is used to abort a 12 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Active open (client) SYN (seq:n, ack.m) SYN/ACK(seq:l, ack: n+1) SYN SENT ESTABLISHED Active close Passive open (server) LISTEN ACK (ack:l+l) Established SYN_RCVD > half-open connection ESTABLISHED Passive close FIN_WAITI FINJTAITI TIMEJVAIT CLOSED FIN (seq:N) ACK(ack: N+1) FIN (seq:L) CLOSE WAIT LAST ACK CLOSED Figure 2.1: The three-way handshake mechanism and the corresponding state tran sition for the TCP protocol. connection when an undesirable event occurs. For example, if a TCP packet arrives at a closed port, the packet receiver generates the RST packet to the sender. Being different from the FIN flag, the packet with the RST flag does not need the three- way handshake since the sender undertands that the session does not exist any longer. When a server receives a SYN request, it returns a SYN/ ACK packet to the client. Until the ACK packet corresponding to the sent SYN/ACK packet from the client arrives at the server, the server keeps the information on the SYN request, such as the socket data structure, in its system memory up to the connection timeout, which is set to 75 seconds as a default value for the Solaris or the Window 2000 operating system (OS). If the client does not receive a SYN/ACK packet from the server after sending TCP SYN, the client repeats to send the TCP SYN packet after timeout, which can be set differently according to the client’s OS. For 13 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. example, the timeout values are set to 1, 3, and 7 seconds sequentially by default in the Solaris OS. 2.2 D enial-of-Service A ttack Security attacks can be categorized into active and passive attacks. Eavesdropping and traffic analysis are examples of passive attacks. Authentication, message mod ification, and denial-of-service (DoS) attacks are examples of active attacks. Some examples of network security attacks are given below. • An attacker floods the network with a large number of nuisance packets so that normal legitimate traffics are prevented. • An attacker prevents a particular individual or some specific users from ac cessing a service. • An attacker disrupts a service to a specific system or person by generating malfunction in hardware or software within a system. The DoS attack is an attem pt to prevent legitimate users of a service from using that service. It originates from the fact that some key resources in the Internet are limited, including bandwidth, buffers, and CPU time to compute the response. According to the attack target, the DoS attack can be classified into two types: the bandwidth-depletion attack and the computing-resource-depletion attack (such as memory buffer and CPU time). It is possible to invoke the computing-resource- depletion attack with a smaller rate of attack packets than the bandwidth-depletion attack. This property requires the defense system to react differently to different types of attacks. 14 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Several commonly observed DoS attacks include: the TCP SYN flooding attack, the UDF flooding attack, the ICMP flooding attack, and the Domain Name Service (DNS) reflector attack [41]. The TCP SYN flooding attack is one of the computing- resource-depletion attacks. The UDP flooding attack and the DNS reflector attacks belong to the bandwidth-depletion attack. The ICMP flooding attack can be in cluded in both attack types. These attacks are described in detail below. • TCP SYN flooding attack The TCP SYN flooding attack exploits the limited queue in a server, which is used for the three-way handshake at the initial TCP setup stage. The handshake process makes a server keep a half-open connection during the round-trip time (RTT) as shown in Fig. 2.1. The server manages the half-open connection in its built system memory, which is of a finite size. If a malicious user sends a large number of TCP open request, called the TCP SYN packet, the target server may use up its system memory. When a new TCP SYN packet comes in, there is no memory left and the server has to remove one request entry from the memory to make room for this new request. If the removed request is from a legitimate user, the user is not able to connect to the server. The attack to result in such a connection failure is called the TCP SYN flooding attack. Since this attack is simple to generate yet the impact is great, many reported Internet attacks in the past have been this type of attack. • UDP flooding attack The UDP flooding attack exploits the characteristics of the UDP protocol, which sends data packets without considering network congestion. The attack 15 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. can achieve its goal easily by sending a large number of large-sized UDP packets. • ICMP flooding attack The attacker generates a flood of ICMP echo packets to a victim. The victim replies to each ICMP request. As the number of requests increases, a large number of replies consume the computing resource and the network band width of the victim. This attack is as simple as the UDP flooding attack. To give an example, the Smurf attack [11] is an ICMP reflector attack, in which an attacker generates spoofed ICMP packets from a given subnet. The intermediate node who receives the generated ICMP packets sends replies to the target. Then, the target receives a large number of ICMP echo replies. • DNS reflector attack The attacker sends a stream of DNS requests to multiple name servers, spoof ing the victim’s address in their source address fields [18]. If the target name server allows the query and is configured to be recursive, the response could contain more data than the original DNS request. As a result, more band width can be wasted. By deploying the ingress filter at the source network, the spoofed IP packets can be prevented. 2.2.1 Distributed Denial-of-Service (DDoS) Attack By the distributed denial-of-service (DDoS) attack, the attacker uses multiple at tack sources to generate nuisance packets to hide its true location and increase the attack effect. There are several commonly used DDoS attack tools, includ ing Trinoo, Tribe Flood Network (TEN), Tribe Flood Network 2000(TFN2K) and Shaft [19]. These tools tend to be more sophisticated. Each of them has some slight 16 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Attacker Masters Agents Victim Infection for DDoS network Attack commands Attack traffic after attack command Figure 2.2: The distributed denial-of-service attack network and its operation. difference in generating attack packets and establishing the communication between the attacker, masters and agents. The DDoS network consists of an attacker, mas ters, and agents as shown in Fig. 2.2 which is borrowed from [19]. To build a DDoS attack, an attacker first search for compromised hosts. The found hosts play the role of a master that downloads a scanning tool from the attacker and searches for more remote compromised machines. The master infects other compromised machines with attack codes. Finally, the infected machines are agents that follow commands from the master. The process of building a DDoS network can be done automatically, like worm virus. Once a DDoS network is constructed, the attacker can control the onset of the attack as well as its type, rate and duration. The attacker controls the scenario by 17 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. commanding masters and agents to generate attack packets accordingly. Usually, agents send the packets at their maximum possible rates to increase the overall effect. 2.3 M arkov M odel and H idden M arkov M odel 2.3.1 Discrete-time Markov Model A system may be described at any time as one of a set of N distinct states indexed by {1, 2, ..., N }. Fig. 2.3 illustrates the system with 3 states. The system experiences the state change according to a set of are state transition probabilities, at discrete times. The discrete time is usually evenly spaced. When the discrete time is indexed with t = 1,2, • • •, we can represent the state at time t as %. To fully describe the system, the specification of the current state and all of the predecessor states is needed. Consider only the first order in the system, the dependence of the predecessor states is truncated to the preceding state. This can be represented as P[Qt = j\qt~i = i,Qt- 2 = k,- ■ ■ ] = P[qt = j\qt~i = * ] ■ (2.1) Finally, if the right-hand-side of the above equation is independent of time, we can write the state transition probability as f l ^ (» ), mTo < t < (m + l)7b, (3.2) where To is the unit observation time interval. Note that S{n) is slightly larger than Y{n) due to the loss of the SYN packet. Let X(m) = g(n) - y (n ) (3.3) denote the residue, which is usually dependent of the time-of-day and the site since the observation interval is limited while the RTT is dependent on the network load. To alleviate the dependency, X{n) is normalized by the moving average of Y{n), ÿ(M ) = aÿ(m - ! ) + ( ! - a )y (n ). (3.4) Then, the normalized X{n) value can be written as ^(M) - % (n)/y(7r), (3.5) The pre-processing of Z{n) is borrowed from [63]. Z{n) will be near to zero under the normal case. If, however, the TCP SYN flooding attack starts, Z{n) will increase. Thus, Z{n) can be used as a feature. 3.3.3 Detector Deployment When a detector operates, it is desirable that it has little impact during the normal operations. The proposed detector can be placed at the leaf-router or the gateway 38 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. that connects an Intranet to the Internet. The leaf-router provides the information of the network and the transport layers to the detector as long as its own process is not affected. The detector receives the relevant data of all inbound and outbound packets, and the correlation feature as described in the last subsection can be extracted accordingly. Then, the detector monitors the occurrence of the attack. If an attack is detected, the detector can give the network system administrator some related information, such as the target address and the source address destined to the target. The deployment of a detector at the leaf router was proposed in [42,63]. The advantages of this deployment include the following. 1. The end-to-end TCP performance can be kept. 2. The response mechanism after attack detection in outbound traffic do not depend on the expensive traceback algorithm. 3. The bandwidth that is wasted by the traversal of the attack packets can be saved. However, it has some disadvantages such as the increase of the computation over head and the complexity of the detection algorithm in a centralized point. Overall, the deployment of detectors at the leaf router has more advantages than their de ployment at the end points. Thus, we propose to place the proposed HMM detector at the leaf router or the gateway in a stub AS (autonomous system) as shown in Fig. 3.6. In the case of multi-homed AS, such as the large corporation in Fig. 3.6, the AS has more than one gateway to maintain reliability and load balancing. When the network is in the normal situation, packets in the same TCP session bypass 39 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 . 1 I U L - L O i i m i i i l i i ' i i ^ ^ 1 ) P eenng pom l C onsum er ISP Ciunpiis H nekbone scrviec provider C onsum er ISP P eering point . I ) Smnll eorpoiiition ) 1 ) : D etector Figure 3.6; The deployment of the proposed HMM detector. the same leaf router so that the proposed detector can work well. However, if the packets go through different gateways, the correlation feature does not hold any longer. In this case, cooperations between leaf routers are needed. To update Z(n), packet classification has to be done first. The network-layer information is accessible since the detector operates at the leaf router. The IP header provides the protocol of the upper level and the IP header length. The network-layer security of IPSec can prevent the detector from accessing the content in upper layers. However, the proposed HMM detector needs to access the field of Flags in the TCP header to compute S'(n) and Y(n). This can be achieved by a multi-layer IPSec protocol [67]. If the upper-layer protocol is TCP, the detector continues to access the TCP header by traversing the number of bytes of the IP header length. The field of Flags in the TCP header represents the flag type of the packet, such as SYN, FIN, ACK, and RST. The field starts from the 107-th bit of the TCP header. By observing the field, the detector can update S{n) or Y{n). 40 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.3.4 Stateful and Stateless Detection Mechanisms The detector monitors inbound and outbound traffics separately, and there are two ways to obtain Y{n); i.e., the stateful and the stateless mechanisms. In a stateful mechanism, the detector counts TCP SYN packets for S{n) at every observation interval and records the destination IP address and port, the source IP address and port, the event time and the sequence number of every TCP SYN packet during the observation interval. Other TCP packets are checked to update Y{n). Furthermore, the detector checks both inbound and outbound traffics for the RST packet. If such a packet is the first RST packet from the same source address and port number after a TCP SYN is sent, it is counted in Y (n) and the corresponding record is removed. After time expires, the detector also removes records that are not removed by any ACK. In a stateless mechanism, the detector does not record any data, but updates S{n) and Y{n) with observed SYN and FIN/ active RST packets, respectively. In this case, Y{n) depends only on FIN/RST but not the first ACK packet from the active peer. It counts all SYN packets arriving in the nth observation interval in S{n). The observation interval of Y(n) has to be extended by td after the observation interval of S{n), since FIN or RST comes at the end of a normal TCP session. It is desirable to set td to be the average duration of a normal TCP session. The long-lived connection will not be matched in the SYN-FIN/RST pair counting and treated as noise in this mechanism. The above two mechanisms have their own advantages and disadvantages. The stateless mechanism is simpler and demands a lower computational cost in comput ing Y (n) at the cost of poorer detection performance. The poor performance is at tributed to the following reasons. First, its required detection interval is larger than 41 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. that of the stateful mechanism since the stateless mechanism counts the FIN/RST packet in Y (n) while the stateful mechanism counts the first ACK in Y (n). Sec ond, there exists background noise due to the RST packet from the passive peer and long lived connections. Third, it cannot detect the attack that sends SYN and FIN to two different target addresses in the same sub-net at the same time. Furthermore, the stateful mechanism can provide more information about the at tacker such as the IP source and destination addresses and port numbers. Thus, the stateful mechanism does not need the traceback algorithm [54] to find the real IP source address of attack packets, if an egress filter is used with the detector. For the stateful detection mechanism, there exists background noise caused by the ACK that arrives later than the original observation time. Background noise de grades the detection performance. The stateful mechanism can reduce the noise by increasing the observation interval. Early detection is important in the sense that the loss of a normal user’s request can be reduced as soon as possible under the attack. Wang et al. [63] adopted the stateless mechanism for their CUSUM detector. They set td to 10 seconds based on the observation that most TCP sessions last 12-19 seconds. In other words, Y(n) is continued to be counted until 10 seconds after the observation interval for S{n), which is set to 10 seconds, too. Thus, the total observation time for Y(n) is 20 seconds. In their results, the detector needs 150 seconds to detect the attack with an attack rate of 40 SYN/s. However, if an attack with the rate finishes within 150 seconds, it cannot detect the attack. The DDos attack from a daemon finishes before 150 seconds in a real world environment and starts again in a while to conceal its source as discussed in Sec. 3.2. Then, a detector with such a long detection time cannot prevent attack packets from leaving the leaf router and may not help users on slower links in connecting to the victim server. 42 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. In a stateful detection mechanism, every TCP packet from a client has to be checked for the first ACK, which increases the overhead significantly. To reduce the overhead, we propose to use a hash function. A hash value is computed for the incoming TCP packet using the destination address and port number, and it is stored as an index in a hash table. Suppose that a hash table has K indices {K = 40 in our implementation), where each index has a data link list. When a TCP SYN packet arrives in an observation interval, S{n) in incremented by one and the relevant data of this packet {e.g., the source address and port, the destination address and port and the sequence number in the header of the packet) are stored as an element that is linked to the data list of its hash value. If other TCP packets arrive, its calculated hash value is used to search for the same information in the linked list. If an element with the same information is found, we claim that the packet is the first packet after TCP SYN so that the Y (n) value is incremented by one and the element in the linked list is erased at the same time. If no element with the same address is not found, the packet is ignored. Before accessing a data link list, we check and remove expired elements in the head of the link list. The flowchart of the above procedure is shown in Fig. 3.7. With the above implementation, it is worthwhile to know the number of search jobs needed per incoming TCP packet for the traffic samples described in Sec. 3.2. With a hash size oî K = 40, it was observed to be 4.6 search jobs on the average. Thus, as compared with the stateless mechansim, the stateful mechanism demands one hash function evaluation and 4.6 search jobs as the extra computational over head, which is a price to pay for fast and robust detection. The maximum memory required for a stateful detector can be estimated as fol lows. One element in the link list needs 24 bytes totally, summing up 12 bytes 43 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Incoming Packet Fragment offset Is 0 TCP discard Flag ? SYN -Increase S(n) Found - Discard Access TCP flag Repeated packet discard -Increase Y(n) - Remove the element - Calculate Hash value - Search the same element in the queue of the hash - remove any other timeout element - Records packet Info Into a data element - Calculate Hash value - Put the record In the queue of the hash value Figure 3.7: The flowchart for a stateful mechanism. for the IP source and destination addresses including ports, 8 bytes for time in formation, and 4 bytes for the sequence number. Furthermore, we should consider the maximum inter-arrival time within two observation intervals without any ACK. We assign the memory of 600 Kbytes for our stateful mechanism. The memory size can support 25000 users. Considering the minimum TCP SYN inter-arrival time of 0.0004 in our data set, 2500 users can be generated per second. One user can hold its assigned memory during one observation period, which is set as 5 second. The memory can be also used for keeping the attack TCP SYN requests with the same rate of the minimum inter-arrival time. Thus, the memory of 600 Kbytes will be enough. If an attack is detected in the outbound traffic, the detector can provide the information to the hosts with the source and/or the target IP addresses based on the information stored in the linked list. This is another advantage of the stateful detection. The CUSUM detector in [63] used a stateless mechanism to count TCP 44 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. SYN and FIN or RST. The stateful detection mechanism can be integrated with the CUSUM detector. Our HMM detector to be discussed in Sec. 3.3 can adopt either the stateful or the stateless mechanism, too. Please note that the result that a time window of 150 seconds is required in order for the stateless CUSUM detector to detect the attack at a rate of 40 SYNs/sec is primarily due to the stateless detection mechanism. Both detectors with the stateful detection mechanism will be compared in Sec. 4.4 3.3.5 Stationarity of Normalized Residue Sequences The stationarity of normalized residue sequence Z{n) can be controlled by the length of the observation interval. As the length becomes larger, Z{n) be close to zero with small variation. However, the length of the observation interval is related to the response time as well. The detector decides the occurrence of the attack at the end of each observation interval. Thus, we have to find a good length to balance detection speed and accuracy. Let us consider the normal traffic samples in the collected data set as described in Sec. 3.2. There are three files in the data set. Each of them was collected at the busiest hours of a different date. We removed the detected attack events from these files. The statistic information of normal TCP SYN requests in each file is shown in Table 3.3. Furthermore, we check the stationarity of Z{n) by varying the length of the observation interval, i.e. To = 2,3,4,5 seconds. We see from Table 3.4 that Z{n) appears to be wide-sense stationary when To is equal to 5 seconds. Fig. 3.8 shows the dynamic values of Z{n) under the normal and the attack traffics, where the attack rate is 30 SYNs/sec and Tq = 5, as a function of the observation interval index. Under the normal traffic, there is a small fluctuation 45 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. File A FileB File C Duration 39F3 1015.8 918.9 Mean inter-arrival time 0.0065 0.0070 Min inter-arrival time 0.0004 0.0004 0.0004 Max inter-arrival time 0.1514 0.0813 0.388 Table 3.3: The statistic information for normal TCP SYN request traffics in the unit of seconds. To{sec) File A File B File G 2 m ean 0.12 0.09 0.12 std 0.06 0.02 0.03 3 m ean 0.07 0.06 0.09 std 0.02 0.02 0.26 4 m ean 0.04 0.04 0.06 std 0.01 0.01 0.01 5 m ean 0.03 0.03 0.03 std 0.01 0.01 0.01 Table 3.4: The first and second order statistics of Z(n) for To = 2,3,4, 5. in Z{n) as shown in Fig. 3.8. For the attack case, we consider adding an attack traffic of 30 SYNs/sec to the normal one, it is clear that Z{n) has a larger mean and variance than the normal case. In general, it is difficult to predict the attack rate in an attack over an observation interval. On one hand, it is difficult to build an HMM according to the attack rate. On the other hand, Z{n) increases as the attack rate increases. Thus, we can still build good HMM detectors to distinguish normal and attack traffics. 3.3.6 Proposed HMM-based Detector An HMM is characterized by a triple, A = {A, B, tt), where A is the state-transition probability matrix, B is the observation symbol probability distribution, and t t is the vector of the initial state probabilities [52]. Our proposed HMM uses the discrete space of Z{n) as the observation. The Baum-Welch (BW) re-estimation 46 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. « Z(n) under normal case Z(n) under attack case with the attack rate of 30 SYNs/s 0 20 40 60 80 100 120 140 160 Observation time (unit=5sec) Figure 3.8: The behaviors of Z{n) under the normal and the attack traffics. algorithm can be used to optimize the HMM given some observation sequences via a training process. We propose three HMMs. One is used to model the behavior of legitimate users while the other two are used to model low-rate and high-rate attack traffics. Observations The observation in our HMM is the change of Z{n) along time, where the time unit is the observation interval To- To simplify the detection process, the detector quantize the Z{n) values into three levels (i.e., the low, medium and high levels) with two thresholds. The first threshold Th\ is set to the average of Z{n) under the normal traffic, which is equal to 0.03, while the second threshold Thg is obtained by averaging Z{n) under the attack traffic. A much higher value of T/i2 so that the probabilities for the normal traffic and the low-rate attack traffic to be over the threshold are very low. The quantized levels of Z{n) in two consecutive observation intervals, i.e. the current and the next observation intervals are shown in Table 3.5, where L, M and H mean the “low”, “medium” and “high” activity 47 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. O bserved Sym bols {vi) VI V2 % Vi % r> 6 V7 U g A ctivity in th e previous interval L L L M M M H H H A ctivity in th e current interval L M H L M H L M H Table 3.5: Nine observation cases. levels, respectively. It is possible to refine the activity of Z{n) into more levels, which will increasing the complexity of the training and the detection. States The state represents the change of Z{n) between the previous and the current interval. Here, we consider the following three states. • State 0: activity-decreasing state. • State 1: activity-unchanged state. • State 2: activity-increasing state. As indicated by the state name, the activities decrease if the state is equal to 0. Thus, if Z{n — 1) is in medium-activity, observed Z{n) should be in the medium- or the low-activity level but unlikely to be in the high-activity level. Thus, the prob ability of event % is zero. Following the same arguments, the non-zero observation probabilities for each state are shown in Fig. 3.9. Thus, we demand P{v2) = P{vz) = P{ve) = 0 for State 0 P{v2) = P{v3) = P{vi) = P{yo) — Piv'j) = P{vg) = 0 for State 0 P{v4) — P{vt) = P(vg) = 0 for State 2 in the observation probabilities for a given state. 48 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. L M H L V, M V4 Vs H V? Vg V9 <State 0 and its observations> n-1 <State 1 and its observations> " \ ^ n L M H L Vl V2 V3 M Vs V6 H V9 <State 2 and its observations> L : Low , M : Medium, H : High 0( : An observation at the time of t O t e ...V g} Figure 3.9: The observation events with non-zero probabilities in the proposed HMM under states 0, 1 and 2. 49 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.3.7 Training Process Performance of the HMM detector depends on the training process. Three models, HMM{Xn), HMM{\i), HMM{\h), are obtained to tailor the normal, low-attack- rate and high-attack-rate traffics. The specific values of model parameters in A„, such as the state transition probabilities, the initial state probabilities and obser vation probabilities for each state, are obtained from the data files described in Sec. 3.3.5 by applying the Baum-Welch (BW) re-estimation algorithm [52]. First, we assign the initial parameter values arbitrarily. The optimized parameter values were achieved by the re-estimation procedure of the algorithm. The low rate attack model parameters in A;, are generated by adding attack packets with a rate of 2 0 SYNs/sec and a duration of 20 sec to the normal traffics randomly. The attack rate and duration are selected for detecting the real attack sources which are found in 3.2.2. The high rate attack model parameters in A /, are generated by adding attack packets with a rate of 60 SYNs/sec and a duration of 20 sec into the normal traffics. The number of observation sequences required for constructing a HMM can be estimated by the following distance equation [30], D{Xa,Xb) = logPr{Oisr\Xa) - logPr(Ojv|A 6), (3.6) where On denotes an observation sequence of N observations. In practice, we can train a couple of HMMs with the different number of the observation sequences. For example, one HMM is trained with the full size of normal traffics while others are trained with a fewer number of observations. If the distance between the HMM with the full size of observations and the HMM with a fewer number of observations 50 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. ■ HMM with 300 observation seqeunces • HMM with 250 observation seqeunces ♦ — HMM with 200 observation seqeunces 8 I O g g < 0.5 0 50 100 150 200 250 300 Number of observation sequences Figure 3.10: The average distances between a test HMM and three reference HMMs, where the x-axis denotes the number of observation sequences used in the training of the test HMM. starts to deviate, we can claim that the smaller number is the minimum samples required to train a stable HMM. In Fig. 3.10, the x-axis indicates the number of observation sequences used for training and the y-axis shows the average distance between two HMMs. Three ref erence HMMs are trained with a fixed number of observation sequences as shown in the legend of the figure. Then, another HMM is trained according to the num ber specified in the x-axis. The average distances between the test and the three reference HMMs are shown in the figure. From Fig. 3.10, we see that the three distances become reasonably small when the number of observation sequences is 100 or above. In the simulation, we used 100 observation sequences to train the normal HMM and 200 observation sequences to train the two attack HMMs. 51 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.3.8 Decision Process The likelihood function of an observed sequence with respect to the three HMMs of model parameters A„, A ( and A /, is used to evaluate the underlying network state. The likelihood function is represented as — log(P(0|A)), where O denotes the observation sequence {Oi, O2 • • ■ Oat}. We compute the likelihood using the forward-backward procedure [52]. Finally, the decision at the interval between nTo and (n + l)To can be made according to the following rule; Attack, P a t t a c k ^ P n o r m a l i Decision = ^ _ (3.7) N o r m a l , P n o r m a l ^ P a t t a c k i where and n o r m a l = -log[P(0|A „)], P a t t a c k = m in (- log[P(0|A;)J, - log[P(0|Ak)]). The length of the observation sequence, i.e., {n + l)To, is proportional to detection time. For a short length, the accuracy of the detector is lower. The optimum depth is defined to be the detection depth that gives the earliest detection under some performance requirement on the detection rate. To find the optimum depth, we plot the curve of the miss classification rate (MCR) as a function of the detection depth in Fig. 3.11, where the MCR is the sum of the false alarm rate (FAR) and the miss alarm rate (MAR). We see from Fig. 3.11 that the HMM detector has good performance when the depth is greater or equal to 8 observation intervals or 40 seconds (8x5). In the simulation Section, we set the optimum depth to be 8 observation intervals. 52 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 0.08 0.06 a : i + a : < K 0.04 o 0.02 — — 3 0 SYNs/sec 50 SYNs/sec 70 SY Ns/sec 90 SYNs/sec % -\v - - è - 10 - 6 12 Detection depth(unit=5sec) Figure 3.11: The plot of MCR as a function of the detection depth). To achieve early detection against the high rate attack, a sliding window ap proach is used. That is, one observation interval may be enough to detect a high rate attack since its traffic pattern is very different from the normal traffic in the Z(n) domain. Thus, the HMM detector keeps the previous 7 observation intervals and decides the occurrence of the high rate attack by adding the current observa tion interval. After the decision, we shift the window by one interval for decision to be made in the next observation interval. 3.4 Sim ulation R esu lts 3.4.1 Simulation Setup We used three files in Table 3.3 as normal traffics. Several attack traffics were gen erated and added into the files for performance evaluation. The resultant files were 53 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. used as the input to DoS detectors, including the proposed HMM and the CUSUM detectors. The detection time, the detection probability and the complexity are used as performance measure metrics. The detection time is the minimum time required to detect an attack after the attack starts. For any detector, there is a tradeoff between the detection rate and the false alarm rate. To compare the performance of different detection schemes, we need to specify the false alarm rate, which can be controlled by adjusting tunable pa rameters in each detection algorithm with respect to the normal data set. We tuned both the HMM and the CUSUM detectors so that they would not issue any false alarms with respect to normal traffic files. Furthermore, the stateful detection mechanism can reduce the background noise for normal traffics. Thus, we can tune parameters in both detectors so that they can detect the attack accurately without false alarms. The tunable parameters in the CUSUM detector [63] are: • a: the upper bound in case of normal operation, • h: the lower bound of the increase in case of an attack, • N: the threshold. We set them to 0.06, 0.12 and 0.1, respectively, in the simulation. For the HMM detector, two thresholds Thi = 0.03 and Tfig = 0.33 are used to quantize the Z{n) value. 3.4.2 Detection time We generated attack packets at a fixed rate ranging from 30 to 100 SYNs per second at a random point in normal files. Once attack packets are generated, they 54 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. I T J 0) I 1000 ■ HMM detector • CUSUM detector — stateless CUSUM detector 1 0 0 10 'v : 1 30 40 50 60 70 80 90 100 Attack rate (SYNs/sec) Figure 3.12: The detection time as a function of the attack rate (SYNs/sec). are generated continuously until the end of the normal files to determine the average detection time. The same test condition has been repeatedly tried to compute the average performance . Fig. 3.12 shows the average detection time of three DoS detectors as a function of attack rates. The proposed stateful HMM detector is compared with the stateless CUSUM detector proposed in [63] and the stateful CUSUM detector. From the figure, the stateless CUSUM detector is the slowest for all attacks. For the stateful detection, the proposed HMM detector has shorter detection time than the CUSUM if the attack rate is lower than 80 SYNs per second. However, as the attack rate increases, both CUSUM and HMM can respond to the attack in an observation interval. The reason why the CUSUM detector is slower for lower- rate attacks is that it takes longer time to find out the anomaly of the cumulative statistics from a low-rate attack. The detection time of the HMM detector is constant for an attack rate more than 40 SYNs per second. 55 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 0.8 T O 0.6 .a 2 * — HMM detector -*— CUSUM detector Q. C o ■ ■ 3 Î a 0.4 0.2 20 40 60 80 1 0 0 Attack rate(SYNs/sec) Figure 3.13: The detection probability as a function of the attack rate (SYNs/sec). 3.4.3 Detection Rate Fig. 3.13 shows the detection probability as a function of the attack rate. Attack packets were generated in the same way as described in Sec. 3.4.2. Each event was repeatedly tested 10 times at random points in normal files. We see clearly from Fig. 3.13 that the CUSUM detector is not sensitive to low rate attacks. One reason to explain the poor performance of the CUSUM detector is that it has only one threshold. Another reason is that the CUSUM detector is not sensitive to the change of the attack behavior. The HMM detector models the dynamics of attack packets between two observation intervals. In contrast, the attack dynamics have less impact on the cumulative statistics in the CUSUM detector. 56 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.4.4 Complexity Both the HMM detector and the stateful CUSUM detector have a higher complexity in checking the states of a packet as described in Sec. 3.3.4. After that, the HMM detector has a complexity of 3 x x A’ per observation interval, where J is the number of states and N is the detection depth. Here, J = 3 and N < 8. In contrast, the CUSUM detector needs a constant number of additions, comparison, and substitutions in the same observation interval. The HMM detector needs only slightly more computations than the CUSUM detector. 3.5 C onclusion To effectively design the detector, we first analyzed the dynamic behaviors of at tacks observed from a real world environment. Then, an HMM detector against distributed the TCP SYN flooding attack was presented. Finally, we evaluated its performance with that of the stateless and stateful CUSUM detectors using trace- driven simnlations. It was shown by simulation results showed that the proposed HMM detector provides better performance in terms of accurate detection rate and detection depth. 57 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 4 D etectio n o f th e B an d w id th D ep letion A ttack 4.1 Introduction In this chapter, we discuss two schemes to detect the DoS attack of depleting bandwidth in wireless networks. The DoS attack using IP spoofed packets can occur in IP-based wired and wireless networks, such as wireless LAN (WLAN) and 3G wireless networks. The target of the attack in a wired network is usually a server since the effect of the attack can be maximized. In contrast, mobile nodes in wireless networks are clients rather than a server. If the target of the attack is a node in wireless networks, the effect of the attack will be limited in the range of an access point or a base station that covers the target. The resources of a wireless network, such as the bandwidth in WLAN and the power in a 3G wireless network, are more limited so that it is possible to deny services in the targeted area with fewer attack packets. Thus, the attack can have a severe impact on a group of users sharing common resources in wireless networks. The main issue in this chapter is to manage the detection of the attack in a wireless network. There are several types of IP-based DoS attack; namely, TGP SYN, IGMP echo and UDP flood packets. The inbound packet traffic to a user in wireless networks 58 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. usually comes from a server. Only the UDP flood attack and TCP ACK attack are possible in a wireless network, since the destination of packets of other types in the same network is a server or ICMP echo packets need not to be used within a single-hop based wireless network. In this chapter, we consider the UDP flood attack in a wireless network. To detect the subtle UDP flooding attack, a legitimate traffic model for each application should be used. Otherwise, any detector detects only attacks that use subnet spoofing or send very high rate attack data by deploying a filter because there are not usually large UDP packets in a subnet. However, there are many UDP-based applications with various dynamics, ranging from the low-frequency request/response of Network Time Protocol, to the steady rate of Real Audio, to the high-rate satellite data streaming. In this chapter, we focus on the steady rate model since this application requires relatively long-lived session, which can be a target of the attack more easily. 4.2 D eploym ent By taking the load of the detection task on the network into account, it is desirable that the detector is located on the router which has one hop to the destination in a distributed manner. For instance, the Packet Data Serving Node (PDSN) in 3G systems has one hop to mobile users. The PDSN receives IP datagrams from the PDN (Packet Data Network), which is connected to the Internet, and establishes, maintains and terminates link layer sessions to the mobile node. The PDSN operates in layer 3 and fragments packets for the Radio Access Network (RAN) to send frames in layer 2 [1] as a leaf router. To convert packets into frames, the PDSN has to manage each mobile user who is connected in the RAN. 59 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Therefore, incoming packets can be monitored in the PDSN for each mobile user and the proposed detector should be located in the PDSN. In wireless LAN, an Access Point plays the same role of the PDSN. Thus, the detector should be deployed at the Access Point. By deploying the detector at the location with one hop to mobile users in wireless networks, we can protect wireless resources from the DoS attack. Furthermore, it is possible to differentiate the retransmission caused by the poor channel in a wireless network and the transmission of data packets coming from the Internet at the location. 4.3 D etecto r U sing M u ltiple M arkov M odels 4.3.1 Markov Model We can monitor the number of packets arriving in the queue allocated for each user in 3G wireless network via PDSN. The states of a Markov model for each user are defined according to the monitored number of arriving packets during a fixed time interval and two thresholds. Three states are defined here. State 0 represents that there are less packets than the lower threshold in the queue for a certain time unit. State 2 indicates that the number of arriving packets is more than the higher threshold. State 1 is corresponds to the number of packets between two thresholds. State 1 means that we observe the expected number of CBR packets in the queue during the time unit. We use the following two observations to distinguish the attack from the normal behavior. First, the number of arriving packets in the case of attack is much larger than the expected number of the normal behavior. Second, for the normal CBR 60 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. flow, the number of arriving packets in a time unit should be constant if there is no network delay variation and drop of packets in bottleneck queues. However, there is network delay variation in practice so that the number of arriving packets in a time unit fluctuates. Let us consider the three scenarios. First is the normal case of the CBR traffic without network delay. Second case represents another normal case of the CBR traffic with network delay variation or drops of some packets in a bottleneck queue. The last represents a case in which traffic anomaly occurs. The corresponding trellis diagrams for the three cases are given in Fig. 4.1. The optimum parameters of the model, such as transition probabilities, can be obtained by a training process. Normal w/delay or packet drop (b) 'Normal (a) Attack case (c) Figure 4.1: The Trellis diagrams for the three scenarios. 4.3.2 Detector Using Combined Markov Models There are limitations of the detector presented in the previous section. First, the Markov model cannot represent the normal behavior and the attack behavior at the same time. Second, it takes long time to detect the attack by comparing the transition probabilities of the normal model with that measured in the actual environment and, as a result, it is difficult to detect the attack behavior fast, ft is desirable to develop a detector that is able to represent the patterns of normal and attack states at the same time. 61 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The states in the proposed model that combines two Markov models are defined as the event of the normal behavior or the attack behavior. They are denoted by S = {A{attack),N(normal)}. An observation, which is corresponding to a state in the Markov model, in the combined model are denoted by 0 = { 0 j, O2, Os}and defined based on the monitored number of arriving packets during a fixed time interval. They are the same as the states of the Markov model described in the previous subsection. For example, the Oi means that there are less packets than the lower threshold in the fixed interval and the O3 means that there are more packets than the higher threshold in the duration. The proposed model is shown in Fig. 4.2. Observations transition of States * ■ Nonnal ) : Attack ^ '''‘f', Attack' y t = T, t = T, t = T„ Tn : detection interval Figure 4.2: The proposed detector using Markov models. Given a certain observation 0{, the conditional probabilities of the states in the proposed model are calculated using two Markov models combined in the pro posed model, i.e. a normal behavior Markov model and an attack behavior Markov model, as shown in Fig. 4.3. The normal model can be obtained from an environ ment without attack, while the attack model can be obtained when an attack node sends nuisance packets. Various attack models with different attack rates can be 62 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. generated for the training purpose, where the attack rate is defined as the number of attack packets over the number of normal packets. < norm al M arkov m odel> < attack M arkov m odel> < proposed m odel> Figure 4.3: The integration of two Markov models into one Markov model with two transitions. Let denote the transitional probability of states in the attack Markov model from time t — 1 to t, and rist_^st for the normal Makov model. In the proposed model (A) , the probabilities that an attack and a normal event occur at time t can be written as P [S — A /O i, A ) — (cisi-iSi + n-si_isi) j (4.1) and P [S — N /O i, A ) — U gj , (4.2) respectively. 63 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The proposed detector works as follows. First, the detector estimates the ex pected packet rate of the CBR application per unit time for each user over an initial interval. We can assume that it will take for a while that the attacker generates nui sance packets after monitoring the authenticate packets. After the estimation, the detector starts to monitor the number of arrival packets during a certain detection interval T. The detector continues to monitor the number of arriving packets and calculates the probability of the normal event, P {S = N, O, A), and the probability of the attack event, P {S = A, O, A), as given by T f (S = A f, O, A) = % % f (g = Af/0,, A ) f (O,, A), (4.3) i=l and T P (S = A, O, A) = % % f (S = A /0 ,, A ) P (O,, A), (4.4) i=l respectively. After detection interval T, the detector compares the two probabilities P {S = N, O, A) and P {S = A, O, A). The following maximum likelihood decision rule Decision = < Attack, p[s=N,o,\) > < 1, can be adopted for decision making. If those probabilities are equal, the probability of an attack is 50%. The detection interval is an important parameter to select for good detection performance. If it is too short, the proposed detector may miss an attack or give a false alarm, thus degrading the performance. Generally speaking, the performance of the detector improves as the detection interval increases under the continued attack. However, a longer interval impairs the detector’s fast detection ability. Furthermore, the reliability of the detector is governed by the limited number of 64 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. training samples. Thus, when the detection interval increases to some extent, de tection reliability reaches a stable value and does not increase furthermore. We define a parameter, called the optimum detection path T, as the detection time needed for the detector to reach the steady-state performance. 4.3.3 Enhancement of Basic Proposed Detector Weighting Factors The distinguished features between the normal behavior and the attack behavior is important for the detector to make a robust decision. From the trellis diagrams shown in Fig. 4.1, we see that the system has a different probability of being in the normal or the attack state, given its observation state 0, 1, or 2. For exam ple, when it is in state 0, it has a higher probability in the normal state than the attack state. On the other hand, when it is in state 2, it has a higher probability in the attack state than the normal state. Based on these observations, we ap ply different weighting factors to the conditional probabilities P {S = N /O i, A ) and P {S = A /O i, X), where i = 0,1,2, respectively. Consequently, the distinguishing features are amplified in the combined model in the form of conditional expecta tion. Furthermore, the weighting factors can be used to avoid underflow, which may occur in calculating conditional probabilities after monitoring for long time. Second-Order Markov Model Besides weighting factors, we adopt the second-order Markov model in the pro posed detector to make it more reliable. The Markov models of the observed states described above are of the first order. That means that the current state depends only on the previous state. However, in practice, more state information in the past is observed to be useful in attack detection. This is reasonable since the attack is 65 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. characterized by a continuous stream of heavy-traffic packets. Thus, we use the second order Markov modef to enhance the detector performance. As the order of the Markov modef increases, the detector becomes more reliabie. However, there is a price to pay. That is, the computationai compfexity increases exponentially with the order. Therefore, only the second order Markov model is considered in the experiment. The second-order Markov model provides a good tradeoff between reliability and complexity. Multiple Markov Models The performance of the detector depends on the attack rate used in the model training. For example, consider two combined Markov models are trained with an attack rate of 5% and 10%, respectively. The latter is more reliable when the number of attack packets is around 10% of that of normal packets. However, if an attacker sends packets less than 5%, the former one detects the attack better. Thus, this observation suggests the use of multiple combined Markov models simul taneously, which are trained with various attack rates to improve the reliability of attack detection. Data Packets Markov Mode Wireless Network Markov Model tor Attack 3 Observation sequences Combined Markov Model for Attack n Calculate P(S=N /X l) j P(S=A/ Xl) ! P(S=N/Xn) P(S=A/Xn) Calculate P{S=N/X2) PiS SeleOt Maximum & Calculate P(S=N/ X3) Decision (Xi) ■ p (S=a /X 3 ) Figure 4.4; The detection system using multiple combined Markov models. 66 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Fig. 4.4 illustrates the proposed detection system that uses multiple combined Markov models. The system consists of a group of combined Markov model detec tors, which have been trained with different attack rates, working in parallel. Each combined Markov model detector calculates the normal probability and the attack probability independently, and the one which has the highest ratio in Eqn. (4.5) is selected for final decision-making. 4.4 Sim ulation R esu lts and A nalysis 4.4.1 Simulation Environment To compare the performance of the proposed detector with that presented in pre vious work [5], we conducted computer simulations under the conditions as given in Table 4.1. With the network simulator ns-2, the background traffic is generated with almost 20% UDP traffic and 80% TCP traffic. The target user is receiving the CBR service as shown in Fig. 4.5. Then, we emulated an attack node that generates nuisance packets and sends them to the target user. The attack rate, which is defined as the ratio of attack packets and normal packets, increases with two cases, i.e. abruptly and linearly. The parameters of the attack model was also obtained through the training process. The proposed detector located on the node of 2. 4.4.2 Sequential-Batch Detector vs Multiple Markov Detector The false alarm rates (FAR) of the sequential detection algoritm given in [5], the first-order and the second-order Markov model detectors are compared in Figs. 4.6 and 4.7 for linear and abrupt increasing attack rates, respectively, ft can be seen 67 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. - Link : Duplex link with 20Mbps and Drop tail queue I Attack node \ I Dl*'' Di* I kick eii M'liive f TCP backgroiiiui \ / x source J seiidciy 1 )------ 1 11 P sink V. /C'liR receiver\ (Victim) Figure 4.5: Network topology used for simulation. from both figures that the combined Markov model detector is more robust than the sequential detection approach. The second order Markov model detector has the best performance with the lowest FAR. Comparing results in Figs. 4.6 and 4.7, we see that the advantage of the first order Markov model detector over the sequential approach is more significant when the attack rate is increasing abruptly. The reason is attributed to the fact that the difference between the normal and the attack pattern becomes more apparent when Traffic Scenario TCP background (Mbps) 8 UDP background(Mbps) 1.05 Victim receiver (Mbps) 1.0 Attack Rate (%) Linear increase (0.27 4.3) Attack Rate (%) Abrupt increase (8.4 20.5) Table 4.1: Parameters of the simulation environment. 68 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3 5 0.008 E ■ Sequential Detection - Detector based on 1st order Markov - Detector based on 2nd order Markov \ \ a: \; 3.5 Attack Rate(%) Figure 4.6: Performance comparison of sequential, the Ist-order and the 2nd-order Markov model detectors with attack traffic increasing linearly. 0.01 0.006 0.004 Sequential Detection Detector based on 1 st order Markov — Detector based on 2nd order Markov 12 14 16 Attack Rate (%) Figure 4.7: Performance comparison of sequential, the Ist-order and the 2nd-order Markov model detectors with attack traffic increasing abruptly. 69 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 100 r Detector trained with attack rate of 0.1% -s— Detector trained with attack rate of 5% - — Detector trained with attack rate of 10% - -X - Detector trained with attack rate of 20% Î x-x 20 10 15 20 0 5 Attack rate (%) Figure 4.8: The optimum detection path of the proposed detector with respect to different attack rates. the attack rate increases abruptly. When the Markov model is trained to reflect this feature, it is able to be achieve very good performance. Early detection is one of the major performance metrics for DoS detection in wireless networks. Here, we evaluate the detection time needed for the proposed detector in terms of the optimum detection depth. Fig. 4.8 shows the optimum detection depth, which can be obtained at the minimum value of the Miss Classifi cation Rate (MGR), which is the sum of the false alarm rate and the missing alarm rate. The unit of the optimum detection depth is the expected arrival time of a packet. Note that the FAR of the proposed second order Markov model detector is so small that it can be ignored. However, it does have some noticeable missing alarms rate. By comparing with the results of the sequential detector reported in [5], our detector achieves earlier detection than the sequential detector. Fig. 4.9 shows that the detection performance of the proposed detector depends on the attack rate used in the training process. The combined Markov model with 70 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 0.1 a I Detector trained with attack rate of 0 1% — ■— Detector trained with attack rate of 5 % — ♦— Detector trained with attack rate of 10 % - -X - Detector trained with attack rate of 20% 0.01 a : Ë 0.001 0.0001 0 10 15 20 5 Figure 4.9: The MCR performance of the proposed detector using different attack training data. an attack rate of 5% gives the best results among other models in Fig. 4.9 when the incoming packet has an attack rate of around 5%. When the attack rate increases, e.g. to 10%, both combined models with the 5% and the 10% attack rates achieve the minimum MCR. However, we see from Fig. 4.8 that the combined model with the attack rate of 10% can detect the attack earlier than the model with the attack rate of 5%. Thus, for the overall detection performance in terms of both MCR and the optimal detection depth, the model with the 10% attack rate performs better since the attack rate used for combined model training and that of the incoming packet sequence is smaller. 71 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 5 Internet Packet C lassification for QoS P rovision A fast and robust scheme that classifies Internet packets according to their appli cation types is examined in this chapter. The scheme is proposed to be deployed at ISP (Internet Service Provider) for due to the consideration of scalability and reli ability. The proposed classification scheme consists of two steps: feature selection and classification. For feature selection, practical features are extracted using tools such as the multistage filter and NetFlow. By using the genetic algorithm (GA) and a variant of the wrapper method, we obtain two sets of features for comparison. As to classifiers, decision trees such as J48 and REPTree (Reduced Error Pruning Tree) and the bagging method using REPTree are considered. The decision trees are trained with selected features from real traffics. The trained decision trees are compared with classifiers using Bayesian approaches in terms of accuracy, complex ity, memory space, and robustness. It is demonstrated by simulation results that the decision tree with features selected by GA gives the best performance. Finally, early classification with a modified multistage filter is proposed to reduce collision errors for fast and robust performance. 72 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.1 In trodu ction Classification of IP packets is essential to various Internet services, including quality of services (QoS), security, accounting, traffic engineering and provision for future resources. The importance of this problem for supporting QoS in practical applica tions was highlighted by Roughan et al. [53]. The classification has several major challenges. First, the accuracy has to be sufficiently high to meet the application requirements. The performance of the traditional port-based classification scheme is poor since ports are shared by multiple applications [43,53]. Second, real-time classification is desirable for real world applications. This implies that features used in classification should be simple and the complexity and memory require ments of a practical classifier should be low, too, since a large number of packets will be processed by a router in ISP per unit time. Third, observed or extracted features can be contaminated by errors of measuring tools such as the multistage filter in [25] and NetFlow [15], which result in the collision error and the sampling error, respectively. Furthermore, we should take asymmetric routing into account, which is caused by multiple routers in an ISP. Multiple routers are often used for traffic load sharing. Since there is no guarantee that in-coming and out-going pack ets of one two-way session go through the same router, asymmetric routing forces classification to be performed using the information of one-way direction. Typically, a classification scheme consists of two steps: feature selection and classification. Possible features include fields in the header, the P* and 2 " * ^ order statistics of selected fields or some parameters. Roughan et al. [53] used the near est neighbor rule, linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), to classify traffics into a limited number of classes. The final deci sion was made using the Bayesian approach. However, features used in their work 73 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. vary from one class to the other. Furthermore, the Bayesian approach was applied under the assumption that the selected features follows the Gaussian distribution. However, it is not true in practice, which results in lower correct classification rates. Moore et al. [43] used a kernel function to estimate the feature distribution with supervised training samples. Usually, a large number of training samples is needed for good classification performance, which demands high computation complexity and large memory space. Furthermore, the Kernel estimator may suffer from the overfitting problem. In this work, features are obtained from supervised training data sets from real world traffic packets. For feature selection, we consider a slight modification of that used in the multistage filter [25]; namely, to add more memory in an entry so as to log more records of features. An entry in a filter is assigned to a flow for logging the information on the flow and it is re-initialized after the end of the flow. While the multistage filter adds one more fields in an entry, the modified multi stage filter adds, subtracts, multiplies, and divides some fields in an entry. Among several candidate features, a set of excellent features is selected according to the genetic algorithm (GA) [28,29] while another set is generated using a variant of the wrapper method. As to classifiers, we consider the decision tree approach such as J48 [51] and REPTree (the Reduced Error Pruned Tree) [50]. They are trained with selected features from real world Internet traffics. As a non-parametric method, the decision tree classifier is more robust against various traffic statistics in different appfications. The Bayesian approach and its variants, which were used extensively in previous work, are compared with the decision trees. They are tested with respect to various performance metrics, including robustness, accuracy, computa tional complexity and memory space. Finally, early classification is proposed to support real-time classification. 74 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The rest of the chapter is organized as follows. The feature selection problem is examined in Sec. 5.2. Several classification algorithms are described and compared in Sec. 5.3. Some advanced issues are considered in Sec. 5.4. Simulation results are given and discussed in Sec. 5.5. Finally, concluding remarks are provided in Sec. 5.6. 5.2 Features Selection 5.2.1 Feature Extraction The classified object is an IP flow, which can be distinguished by 5 tuples, i.e. IP addresses and ports of the source and the destination and protocol type. The classes are the popularly used applications in this work. The classification is per formed based on a feature vector. The components of the feature vector can be decided according to the characteristics of desired class types. The classes and their corresponding characteristics considered in this work are shown in Table 5.1. To train a classifier, we need a supervised data set that maps flows representing all applications to their intended applications. However, it is difficult to obtain such a data set from real data sets in the public domain since the payload of packets is removed due to the concern of privacy and limited memory for their storage. Alternatively, we may collect the supervised data sets directly. However, it is a labor intensive job to map flows to their applications. The classification of a supervised data set is often done based on the port number, which is however not reliable as argued in [43,53]. To overcome the trade-off between accuracy and robustness in supervised data sets, we selected some famous applications whose ports are not likely to be shared by other applications. 75 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Characteristics Classes (Applications) Interactive Bulk data Real Time Email Transactions W W W , Telnet, Chat(Messenger) W W W , FTP, P2P(Kazza, Grunella) Multimedia (quick time, window media player) email (smtp, POP, IMAP) services (DNS, Oracle, X ll) Table 5.1: Characteristics of classes under classification. The use of well known ports is a necessary step in the collection of data related to classes. Based on the ports, we obtained supervised data sets from real raw data as performed in [53]. The applications using well-known server ports are shown with their characteristics in Table 5.1. Considering the fact that most traffics are from web applications, WWW traffics using port 80 are included in the supervised data set. In the extraction, two data sets within a flow using TCP are obtained according to the direction from the active peer to the passive peer (A2P) and the direction from the passive peer to the active peer (P2A). The direction can be decided by the TCP SYN flag. The separation is used to mimic the one-way traffic caused by asynchronous routing at the ISP. All features will be extracted from the one-way traffic in this work. We demand that the extracted features have to be easily obtained or derived from measurement tools of ISP such as multistage filter or NetFlow. They are primarily the packet-level information, including fields of IP/TC P/U D P packet headers and the arrival time. Their statistics such as average and variance can be updated per observation sample. For example, we have (5 1 ) and var(Xj+i) = -F ^uar(A j) 1+1 (5 2) 76 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. where Xj and var(Xj) are the average and the variance of feature X, respectively, at the j-th observation sample. These derived quantities can be computed easily with out logging much information in the past so as to reduce the memory requirement, too [53]. Our selected features include the size and the number of ’ ’burst” packets. The burst characteristics can be determined using the arrival time information of pack ets. That is, consecutive packets are said to be burst, if their inter-arrival time is less than a predefined threshold. Some applications, such as WWW and P2P, have a heavy burst behavior, while streaming and inter-active multimedia applications do not have such a behavior on the average. Thus, this burst feature is expected to provide good discriminant power among these distinctive classes. Candidate features considered in our work are shown in Table 5.2. All of them can be obtained with relatively low costs. It is worthwhile to mention that there are features that were discussed in the literature but not listed in Table 5.2. For example, the FFT (Fast Fourier Transform) of the inter-arrival time can be a good feature as mentioned in [43]. However, it is expensive to compute this feature at ISP due to the limited amount of memory. Compared to FFT, it is possible to add more memory or registers in a multistage filter and NetFlow for extracting the candidate features. The logged information in the tools are erased, if the classification is done. According to protocol behaviors, some applications can be distinguished. Most of applications are operated on only TCP, while some applications use both of TCP and UDP at the same time. For example, P2P, multimedia, game, DNS, and some of chat applications use both TCP and UDP at the same time. P2P and DNS use TCP for the delivery data and UDP for the control packets, while multimedia and game use UDP for the data and TCP for the control packets. Two multistage filters are used to check the usage of TCP and UDP within an application. The input to 77 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Method Features Directive information duration of the flow, ^ of packets, initial AdvertisedWindow bytes, # of actual data packets, # of packets with the option of ’ ’PUSH” Recursive statistics (1st, 2nd order) size of the packets, AdvertisedWindow bytes,inter-arrival time, # and size of the total burst packets Using a variable updated every sample # and size of the total burst packets, inter-arrival time Table 5.2: The list of candidate features. one multistage filter for tracking UDP is only IP addresses while the input to the other filter for tracking TCP includes the ports. Without collisions, we can decide those applications by checking the tables in the filters. P2P application can be distinguished by using ports. Karagiannis et al. found excluded ports of those applications using TCP/UDP IP pairs heuristically in [34]. Furthermore, they observed that the application that use TCP and UDP would be P2P, if the application used a port which is not specified. We adopt their observation to decide the P2P application. More methods, such as the ratio of used ports and different source IP addresses exist and the specific strings in application layer in [34] are not considered due to unrealistic issues at ISP. Multimedia application uses UDP for the delivery of data. The characteristics of the UDP trafhc is useful for deciding the application. Most multimedia sessions have the characteristics that most of UDP packet size are relatively small and the inter-arrival time of the packets are 1 - 2 second. Furthermore, the variance of the inter-arrival time is smaller. The UDP packets in other applications show some bursty behaviors or irregular packet sizes. Thus, the UDP characteristics can be used for the classification of the multimedia application. The features of UDP 78 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. traffic can be obtained with the same method mentioned in Table 5.2 except the features related to TCP flags, such as the option of ’ ’PUSH”. 5.2.2 Feature Reduction The feature selection process should find a good balance between the complexity (in terms of computation and storage) and performance (in terms of the correct classification rate). More features included, more computation and more memory needed. The complexity issue becomes even more critical when we desire a solution that is scalable to the growth of the network. Furthermore, a large number of features may confuse some classification tools. Selected features should be relevant to classes and not correlated much with each other. There are two commonly used methods in feature reduction: the filter method and the wrapper method. The filter method assigns scores to features according to evaluation strategies and high scored features are selected while the wrapper method uses classification results as feedback for feature adjustment. Each method has its own advantages and disadvantages. The filter method suffers from the model bias; namely, different features may be suitable for different classifiers. The wrapper methods suffer from the complexity in search of good features tailored to the chosen classifier. To compare several classifiers, we have to fix the set of features in use. These features can be selected by a filter method. As mentioned above, the filter method may be biased to a classifier. Among various filter methods, we selected the one that generates the highest performance under the Naive Bayesian Kernel Estimator (NBKE); which is the Fast Correlation-Based Filter (FCBF) [65]. FCBF computes 79 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Rank P2A direction A2P direction SU feature SU feature 1st 2nd 3rd 4th 5th 6th 7th 0.257 0.214 0.199 0.164 0.160 0.154 0.153 initial AdvWin. Avg. of AdvWin. Var. of AdvWin. Avg. packet size size of burst packets Var. of packet size PUSH packets 0.240 0.212 0.206 0.203 0.195 0.195 0.194 PUSH packets Avg. of AdvWin. Var. of AdvWin. Var. of packet size minimum segment size initial AdvWin. size of burst packets Table 5.3: The rank-ordered features from symmetrical uncertainty values, where Avg., Var., and AdvWin. mean average, variance and AdvertisedWindow in TCP header, respectively. the normalized information gain of candidate features, which indicates the corre lation between features and classes, also known as symmetrical uncertainty (SU). It is represented by the unit of bits {i.e. entropy). Thus, the higher the SU value, the more the features of study are relevant to the classes. By using FCBF, we can rank features in the decreasing order as shown in Table 5.3. The ranked features are obtained from real data sets mentioned in Sect. 5.5. In Table 5.3, P2A and A2P mean the direction from the passive peer to the active peer and that from the active peer to the passive peer, respectively. Among all potential features given in Table 5.2, only seven features are kept in Table 5.3. We see that the highest seven features in each direction are the same when the number of total training data samples is more than 500K. Thus, they are used in the later classification step. It also implies that a minimum of 500K samples of traffic flows are required in the training of classifiers. In the A2P direction, the number of packets with the “PUSH” flag on has the top rank. The TCP sender sets the PUSH flag of a packet to indicate that all buffered data are delivered in the packet. In an interactive application, when a client sends a command to the server, the client would set the PUSH flag and wait 80 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. for server’s response. Without the PUSH flag, this process may hang up because the operating system in the sender and the receiver may continue to wait for additional data. Thus, interactive applications tend to have a larger number of push packets. Furthermore, when the “PUSH” flag of encrypted packets in applications such as sftp is used for its correct decryption. Thus, the number of push packets can be used as a good feature. In the P2A direction, the AdvertisedWindow parameter in the TCP header is the key parameter to distinguish applications. It indicates the allowable size in the buffer at the receiver. The corresponding peer’s transmission behaviors are controlled by this parameter. The peer should send packets of a size less than it. Thus, we can differentiate different applications based on different transmission behaviors at the peer’s side. For example, the passive peer (server) may send a larger data flle to the active peer in the WWW application while the active peer can send a larger data flle in the ftp and email applications. This parameter will vary according to applications. As observed in [43], [20], [64], the statistics of the packet size provide good features, too. Our selected features include the size and the number of “burst” packets. The burst characteristics can be determined using the arrival time information of pack ets. That is, consecutive packets are said to be burst, if their inter-arrival time is less than a predefined threshold. The minimum threshold will be the computing time for processing a packet in the sender. The maximum threshold will be less than one round trip time. The threshold for deciding burst packets is selected as 0.007 second in this work, because most round trip time in the training data set is less than the time. Some applications, such as WWW and P2P, have a heavy burst behavior, while streaming and inter-active multimedia applications do not 81 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. have such a behavior on the average. Thus, this burst feature is expected to pro vide good discriminant power among these distinctive classes. These seven features will be used in the comparison of classifiers in Sect. 5.5. Feature selection can be done with an off-line process, and this allows the use of methods of higher complexity. The wrapper method searches a space of 2" feature sets, where n is the number of features. Thus, we have to find a balance between the search cost and the performances the searched result. The cost is measured in terms of computation and memory space while the performance metric is the accurate classification rate. There are several optimization techniques such as branch-and-bound and exhaustive search to be used. We use the genetic algorithm (GA) [4,28,29,60] for optimization based on the observation that the decision trees with features searched by GA provide the best performance. Furthermore, features selected by GA can work effectively with decision trees and the Bayesian approach. GA consists of four components: a population of individuals, a fitness (objec tive) function, a stop criterion, and a genetic operator {e.g. crossover and muta tion). GA starts from a population of individuals, which correspond to application classes in our current context. Each individual has some random genes, which correspond to features under consideration. Suppose that there are n genes (or features). We can represent the entire set as a n-bit vector. If the gene is selected, we set the corresponding bit to one. Otherwise, it is zero. The population consists of a specified number of individuals with different genes. The gene selection process follows the “survival of the fittest” principle in Darwin’s theory of evolution. The degree of the fitness for each individual is computed by an objective function. The fitness is the accurate rate in this work. Once the fitness values of all individuals are computed, GA repeats the evolution procedure until the stop criterion is met. The evolution procedure is stated below. 82 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1. Create offsprings by exchanging genes between parents, which consists of all individuals in the previous population. This is called “crossover”. 2. Make a bit-level mutation to each offspring. 3. Select two parents that have the best fitness. 4. Generate a population with the off springs and the selected two parents. 5. Calculate the fitness value of each individual and check the stop criterion. Finally, genes in the survival are selected features when the statistics of all individ ual’s fitness meet the stop criterion. Crossover and mutation are two key factors in GA that outperforms both classical gradient search and random search for a high dimensional, noisy and multi-modal objective functions. For more details on GA, we refer to [4,28,29,60] 5.2.3 Proposed Feature Training and Testing System Fig. 5.2.3 shows the proposed system that uses a wrapper method in the training process and a filter method in the testing process. The training process is done off-line. GA is repeatedly applied under randomly divided training data sets and features are selected by the fittest individual. The classifier can be constructed using these selected features. The classifier should give similar performance as trained since the wrapper method tailored the performance with a predefined threshold. Moore et al. in [43] used FGBF [65] to propose a variant of the wrapper method. The filter ranks features according to their calculated SU values in a descending order. They obtained a set of good features by progressively adding one feature after the other and checking the performance of the wrapper method. A set of 83 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. O ffline-training process 40% No criterion Repeat N times Yes No Final feature set and decision tree O nline-testing process Testing data — ► Gopstructiori of décision tree Data set 2 for evaluations Classification results Final decision tree Feature selection Data setT for training classifier Computation of fitness : ; J per individual I I Randomly divide training Data Generate nextgeneration Figure 5.1: Feature training and testing processes. features can be decided with the best performance. The total number of perfor mance evaluation is the same with the total number of features, since features are ordered in the descending order. This approach has a complexity of 0{n) which is significantly lower than that of the fnll search, namely, 0(2”). However, this approach does not work synergistically with decision trees. First, decision trees use the information gain to construct their trees automatically as proposed by [49]. W ithout FCBF, decision trees can reflect correlation using the information gain. Second, the search used in [43] may not be sufficient to find the optimum feature set. The optimum feature set needs to be searched with respect to performance rather than SU values {i.e. relations between classes and features). In other words, there could be some feature set to offer better performance. Actually, the proposed hybrid method using GA does offer better performance than FCBF as shown in Sec. 5.5 84 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.3 C lassification M eth od s A good classification algorithm should meet three criteria: low complexity, high accuracy and robustness. In order for an ISP to classify a great number of flows, the complexity of a classifier should be low while meeting the desired classification accuracy. To reduce the complexity, the training process should make the classi fier simpler. However, the simplification process may ignore the gap between the training data set and the test data set since the feasible sample space of flows in the Internet is too huge and it continues to change due to the emergence of new applications. Thus, a good classifier should be robust with respect to variations in new applications, while keeping high classification rates and low complexity. 5.3.1 Naive Bayesian Approaches The naive Bayesian analysis technique was used in previous work due to its sim plicity, where the conditional probability given a class was assumed to be Gaussian distributed. This is however not true in reality. To overcome this problem, a ker nel estimation was adopted in [43] to estimate the distribution. Thus, the density estimation can be expressed by .. " y ^ X i where h is the kernel bandwidth, j G J is the class index, ricj is the number of samples in class j in the training data sets and K(x) is the normalized Gaussian distribution of the form exp(— z^/2). 85 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The kernel estimation is however not suitable for this application due to the heavy computation involved. When computing f{x\cj) in (5.3) to classify an un known flow, one has to perform ricj computations. As the number of U cj increases, the estimation becomes more reliable but the complexity increases as well. Fur thermore, this approach suffers from the overfitting problem so that the classifier does not work effectively for new data sets. 5.3.2 Decision Trees In this work, we propose the use of a decision tree to solving the problem. The decision tree classifier demands higher complexity in the training process. However, this is acceptable since the training is done in an off-line mode. A typical decision tree contains internal nodes, which represent tests to be performed, and leaf nodes, which represent all classification outcomes. The tree construction process [49] (or the training process) is to develop a classification rule with a supervised training data set. The process consists of two steps. First, we should identify features that provide the best discriminant power among classes and determine a test {i.e. a branch in the decision tree) by selecting threshold values of those features. Then, we need to determine a sequence of tests, which corresponds to the decision tree generation. In the test mode, the decision process can be done by a series of tests, which corresponds to branches of the tree. The final decision goes to a certain leaf node, which gives the application class. Thus, the complexity of the classification process is the depth of the tree, which can be represented by O(logn), where n is the num ber of training samples. The computational and memory requirements for Naive Bayesian (NB), Naive Bayesian using Kernel Estimator(NBKE) and the decision 86 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Operation NB NBKE decision tree Space for training Train on n samples Test on m samples 0(d) 0{dn) 0{dm) 0{dn) 0{dn) 0{dmn) 0(m) 0{dri^ logn) 0(m log n) Table 5.4: Time complexity for Naive Bayesian (NB), NB Kernel Estimator (NBKE) and decision tree. tree are shown in Table 5.4, where d is the number of features and m is the number of testing samples. The space (or memory) complexity in training a tree is the num ber of nodes, which is 1+2+4+.....+ n /2 % n, that is 0(n). By considering the time and space complexities in the classification process, the decision tree classification approach is better than NBKE. The decision tree also suffers the overfitting problem caused by noise in the training data, which degrades the classification performance. Thus, we need a pruning process to cut down sub-trees resulted from the noise. However, it is difficult to prune sub-trees optimally since there is no specific relation between the tree size and the classification error. It means that the pruning process is performed repeatedly until the error is reduced as small as possible. After a normal tree is generated, the pruning process is applied to the tree to increase the classification accuracy. Furthermore, the pruning process reduces the size of the decision tree so that the complexity of the decision tree in Table 5.4 is reduced. The reduction is dependent on the training data sets. Thus, the actual complexity will be predicted after the pruning process with real Internet traffic data sets in Sect. 5.5. According to the particular method used in calculating errors in the pruning process [23], there are some variants in the decision tree approach. Among several decision tree classifiers, J48 [51] and REPTree (Reduced Error Pruning Tree) [50] are adopted here. Another name of J48 is C4.5, which is the most popular in a 87 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. series of decision tree methods. During the training process, every leaf can estimate the error ratio of the number of wrong classified incidents and the total incidents assigned to each leaf from the supervised training data sets. The upper node can also calculate the weighted sum of error estimates for all its leaves. If the weighted sum at the upper node is less than the error ratio combined from its leaves, all leaves under the node are pruned. The REPTree (Reduced Error Pruning Tree) classifier [50] is adopted, since it uses a fast pruning algorithm to increase the accurate detection rate with respect to noisy training data. Furthermore, the pruned tree reduces the complexity much more in the classification process. Generally speaking, pruning is used to find the best sub-tree of the initially grown tree with the minimum error for the test set. However, the number of sub-trees grows exponentially with the size of the initial tree. Thus, it is computationally impractical to search all sub-trees. REPTree yields a suboptimal tree under the restriction that a sub-tree can only be pruned if it dose not contain a sub-tree with a lower classification error than itself. The decision tree is unstable in the sense that a small change in the training data can result in a different tree [22]. The pruning process is less useful under highly noisy data. Internet traffic is non-stationary. There can exist some gaps in the training and the testing data. Bagging [6,7] provides a simple yet effective method to enhance the the accuracy performance under noisy data. However, the more accurate performance is achieved by a higher computation cost. For example, the bagging classifier generates multiple versions of a predictor (which is REPTree in our work) and the final classification result is decided by votes from multiple predictors. Each predictor is trained using a randomly divided sub-set of the entire training set. Due to the complicated nature of the Internet traffic, the bagging R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. classifier is expected to provide better performance than the single REPTree at a higher complexity cost. In other applications, decision trees have yielded higher accuracy as compared to other methods such as the nearest neighbor and the neural network classifiers. For problems similar to Internet traffic classification, which has no sufficient infor mation on the statistics of the objects and features, decision trees work effectivefy. Furthermore, the decision tree approach has refativefy lower complexity to classi fication due to the pruning process. However, it has no means to overcome fixed errors in trees so that the training process has to be done by the network adminis trator periodically to update the classifier for Internet traffic cfassihcation. 5.4 E arly C lassification B ased on P artial Flow Inform ation Traffic measurement is essentiaf to feature extraction and cfassihcation. The mul tistage hlter [25] provides a scalable trafhc measurement tool at the ISP. However, some slight modihcations are needed for the specihc problem of our interest. The hlter has several stages. In each stage, a table of counters are maintained, and the hlter assigns a counter to a new how. More memory space and registers are needed to record features per how. In our implementation, one memory box is assigned to a how for the classihcation purpose. The size of a box depends on how many features to be used. Two different hows may share the same memory box in one stage of the hlter because the number of memory boxes is less than the total number of hows. Once a box is shared, the statistics of each how is mixed. The mixed statistics in the box cannot be recovered without logging the how information. As a result, the 89 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. collided flows are often treated as one flow. Thus, the classification performance is degraded. To alleviate the collision problem, we propose to use early classification based on partial flow information. In previous work, classification is performed only after a flow is completed, where the end of a flow can be represented by the flags of TCP FIN, RST, or timeout. We propose to perform classification with a sufficient number of packets before its end. Then, the memory b o x assigned to this flow can be released for another new flow. _ L T A-flow is generated at tg H,(A) A » Classification of A-flow is finished at tf T » T B Memory boxes assigned for A-flow are reset at t f Hj(x) : Hash function at i-th stage I I : A memory box for extracting features o f a flow g : Total number o f memory box in a stage s : Total number o f stages Figure 5.2: The basic operation in a multistage filter for the classification in the domain of time. The basic operation of a multistage filter for classification is shown in Fig. 5.2. Assume that flow A is generated at to- The filter assigns a memory box to this flow at every stage. If the classification of flow A is done at ty, tfie filter resets tfie boxes assigned to flow A in all stages. Tfie probability for flow A to have a collision in a stage is proportional to the number of flows, N , generated during td = tf — Iq. It 90 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is assumed that there are B memory boxes in a stage with N « B and these N are uniformly distributed among B memory boxes. Then, the collision probability in a stage is roughly N /B and the probability that N flows do not share the same memory as flow A in the filter is equal to P = ( l - § ) '. (5.4) where s is the number of stages. If there are fewer collision events, classification results are more reliable. We investigate the relation between td and N with real data sets in Sec. 5.5. 5.5 Sim ulation R esu lts and D iscu ssion 5.5.1 Simulation Setup We used three data sets collected at different places. Each data set is described below. • Data from USC/ISI. The Information Sciences Institute (ISI) of USC accessed to Los Nettos [39], a local ISP in Los Angeles, to collect attack and normal traffics [31]. They collected traffics from two peering links among five links at Los Nettos during the busiest hours. We used some of collected traffics and removed attack ones. Under the ISP, small and large corporations, universities and laboratories are connected. Thus, various applications are included in the data set. In addition, a great number of one-way flows were also observed. 91 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Data from PMA (Passive Measurement and Analysis). Our training data were collected by PSC (Pittsburgh Super computing Center) in April, 2005, while the test data were collected at the same place in June and October, 2005. To check the robustness of the proposed algorithm against time, we considered two cases, that had 2- and 6-month gaps between test and training data, respectively. • Data from CAIDA (Cooperative Association for Internet Data Analysis) [9]. CAIDA has monitored various locations in several ISPs in US. We used anonymized OC48 data from Link A (San Jose, CA) to test the dependency of the site. Table 5.5 shows the statistics of the classes in the training data sets. The data sets were collected at PSC in April, 2005 for a month. Even though the total data set is huge, the number of extracted flows is relatively low as shown in Table 5.5 since we have restrictions on supervised traffics as mentioned in Sect. 5.2.1 and eliminate the very short-lived flows with less than four packets to reduce noise caused by short-lived flows. The statistics are shown according to the P2A (passive peer to active peer) and the A2P (active peer to passive peer) directions in the table, respectively. The direction is distinguished by the TCP SYN flag. We also include the class of ’ ’Others” to measure the hit ratio of the correct number of classifications over the total number of flows. The class of ” Others” includes all flows that are outside applications defined in Table 5.1. Prom the data sets, all potential features were extracted by the procedures as described in Sec. 5.2.1. Then, features were converted to the input format used in WEKA [40], a verified classification tool which has been popularly used in the data mining community. Furthermore, we applied the method mentioned in 5.2.2 92 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Flows W W W ftp telnet multi- media SMTP service P2P chat others A2P 488337 9594 13 171 49090 7772 627 2789 397557 P2A 300447 7146 9 86 34670 7918 226 402 596540 Table 5.5: The number of flows of the classes in trained data set from PSC. to reduce the dimension of features. Then, with this reduced set of features, five classifiers were tested and compared. They were: (1) the naive Bayesian classifier, (2) the naive Bayesian classifier using kernel estimation, (3) J48, (4) the REPTree classifier and (5) the bagging classifier using REPTrees. All parameters in each classifier were chosen to be default values provided by the tool. Simulation results remain about the same under repeated tests under the above conditions. 5.5.2 Accuracy and Complexity Fig. 5.3(a) and Fig. 5.3(b) show the the accurate classification rates versus the number of features used in the A2P direction and the P2A direction, respectively. The features ranked by SU are used in the high order. This makes it possible to compare the classifiers under relatively fair conditions. Traffic data collected at PSC are used for training and testing. The decision tree-based classifiers are clearly better than the two Bayesian methods. We can also see that the accurate rates of the decision trees start to be saturated if seven or more features are used in both directions. It proves that the seven selected features as shown in Table. 5.3 are enough to classify the application. Furthermore, the classification performance of the two naive Bayesian methods degrades as the number of feature increases from the seven features. More features make the two methods to be confused. However, the performance of the bagging classifier is not significantly better than that of the REPTree classifier. It implies that the statistics of training data sets and test data 93 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Features NBKE(7) J4 8 (ll) REP(8) common # of total pkts, minimum pkt size initial AdvWin., Avg. of AdvWin., Avg. of pkt size different Var. of pkt size, Avg. of I-A time #of ack pkts, ^ of pkts with ’ ’PUSH”, Avg. of I-A time, Var. of I-A time time, # of data packets, Var. of AdvWin. total size of burst pkts, # of pkts with ’ ’PUSH” Var. of AdvWin. Table 5.6: Optimized features obtained from a wrapper method using GA and two classifiers in the P2A directions, where Avg., Var., and AdvW in. are the average and the variance of packet sizes and the AdvertisedWindow in the TCP header, respectively and I — A tim e and pkt stand for inter-arrival time and packet, respectively ) sets are similar, even if six months is a gap. The classifiers are robust under a time gap of 6 months in training and test data sets. So far, the set of the selected features is obtained from the ranked symmetrical uncertainty for the evaluation of classifiers. They are however not optimized yet. They may be redundant among themselves and could be biased to some classifier. It was mentioned in Sec. 5.2.2 that the wrapper method stops when a criterion is met. The criterion is a threshold for the ratio of the standard deviation and the mean of accurate rates. The mean and deviations are obtained from the evaluation of /-folds from random and uniform divisions of the supervised training data set. Under a threshold of 0.01%, we obtain optimized features tailored to classifiers using GA. In particular, REP and NBKE are selected and compared as representatives of decision trees and Bayesian approach, respectively. Table 5.6 shows features selected by NBKE, J48 and REP, respectively, under the wrapper method using GA. By comparing them with data in Table 5.3, we see that some features are redundant while others are biased. For example, the 94 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. I I { 100 90 80 70 60 50 40 30 -/ / / • NB ■ NBKE ^ - J48 -e - - REPTree -B - Bagging _L 1 3 5 7 9 11 13 Number of selected features in the direction of A2P (a) In the direction of the active peer to the passive peer 15 a 2 i I 100 90 80 70 60 50 40 30 20 tr • NB > NBKE - e — J48 - 0 - - REPTree -B - - Bagging _L J _ 1 3 5 7 9 11 13 15 Number of selected features in the direction of P2A (b) In the direction of the active peer to the passive peer Figure 5.3: The accurate classification rates versus the number of features which are ranked according to symmetrical uncertainty values in (a) the A2P direction and (b) the P2A direction. 95 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Order NBKE J48 REP 1st minimum pkt size minimum pkt size # of total pkts 2nd initial AdvWin. size of total burst pkts # of pkts with ’ ’PUSH” 3rd Avg. of AdvWin. initial AdvWin. minimum pkt size 4th Var. of AdvWin. Avg. of AdvWin. size of total burst pkts 5th Avg. of pkt size Var. of AdvWin. initial AdvWin 6th Avg. of pkt size Avg. of AdvWin. 7th Var. of AdvWin 8th Avg. of pkt size 9th Var. of pkt size Table 5.7: Optimized features obtained from a variant of the wrapper method using FCBF and three classifiers in the P2A directions. number of push packets is redundant in REP, and the variance of the initial advert- simentwindow and the size of burst packets are redundant in NBKE, since they are removed in Table 5.6. However, the features of “the number of total packets” and “the minimum packet size” are important to the performance of NBKE and REP since they are kept by GA. Table 5.7 shows features selected in [43]. We see some diflFerences from those in Table 5.6 due to different feature selection algorithms. The number of features chosen in REP by [43] is larger than that obtained from GA due to feature redun dancy. The performance of features selected by GA in decision trees and NBKE is more effective than those selected by FGBF in [43] as shown in Fig. 5.4. The true positive rate (TP rate) can be used as another metric to verify the biased effect of classifiers. The rate is defined by the number of classified objects over the total number of real objects in a class. The rates under various feature sets from GA are shown in Table 5.8. We see from the table that decision trees are superior to NBKE, and classifier J48 with features selected by GA has the best classification performance. 96 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A c c u r a te r a te (% ) 0REP ■ NBKE □ J48 Genetic Search Figure 5.4: Comparison of accurate classification rates of REP, NBKE and J48 when the full feature set and two sets of features selected by GA and Moore’s work are used. In [20], Dewes et al used the packet size, inter-arrival time and the header (the first strings) in the application layer to identify the Internet Chat system. However, we can not use the header in the application layer to solve our problem. The packet size and the inter-arrival time may not be sufficient to classify the application and other applications within limited time and computing power, because the range of the features can be overlapped with those of other applications. The inter arrival time of the packets using PUSH fiag is a good feature to distinguish the chat application. The reason is that chat applications have the behaviors of our daily conversation and the conversation is inter-active. The inter-active can be represented by the inter-arrival time of the packets of PUSH fiag. The inter-arrival time of the packets can not represent the behavior because the packets include some ACK packets that are not related to the inter-active behavior. Furthermore, the total number of packets is observed to consist of the ACK packets and the PUSH 97 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Classifiers W W W ftp multi- media SMTP service P2P chat others NBKE J48 REP 0.86 0.99 0.99 0.80 0.98 0.94 0.83 0.94 0.90 0.86 0.94 0.94 0.83 0.94 0.98 0.99 0.99 0.99 0T7 0.84 0.82 0.92 0.92 0^8 Table 5.8: The TP rates according to major applications in classifiers with genetic search algorithm Feature Selection scheme J48 REP GA FCBF 1699(11) 21370^ 1355(8) 1305(9) Table 5.9: The space requirements of decision trees under different feature selection schemes of wrapper methods packets evenly in a Chat session. FTP also use several packets set PUSH flag, however the inter-arrival time of the packets with PUSH flag is very short because it utilizes the channel maximum as possible, the number of packets of PUSH flag, and the average of inter-arrival time between packets with PUSH flag are used for distinguishing the chat application. 5.5.3 Complexity and Memory Requirements We compare the computational complexity of several classifiers in the P2A direction in Fig. 5.5. The Naive NBKE scheme has the highest computation while the REPTree scheme has the lowest. The same trend is also observed in the A2P direction, too. The memory space requirement of decision trees depend on the statistics of se lected features and training data sets. The memory space requirements for different methods are shown in Table 5.9. The number in parentheses is the number of used features. We see that the GA-selected features demand smaller memory space. 98 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 10^ 1 0 “ 10“ m 10’ D > I E o O 10“ lO '^ io’ 1 0 “ • N B ■ NBKE -e - - J48 - Q - - REPTree -□ - ’ Bagging . . E T . r. EH 1 0 “ 10’ 1 0 “ 10“ Number of training data samples Figure 5.5: Comparison of computational complexity of several classifiers in the P2A direction. 5.5.4 Robustness The classification performance at different sites is studied to check robustness of various classification methods. Accurate classification rates are given with for more data sets in Table 5.10. The results are obtained by using the feature set selected by genetic search algorithm. The results are similar to those reported in Sec. 5.5.2, indicating the robustness with respect to different sites. It is observed that the traffic collected from Los Nettos and CAIDA serves as better training data for tests conducted in Los Nettos and CAIDA, respectively. This is called site-dependent training, which is due to different traffic loads and different router performances that affect properties of features, such as the inter-arrival time, the window size, burst packets, etc. 99 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. site NB NBKE J48 REP Bag Los Nettos CAIDA 74.1 69.2 8&0 8&2 94^ 9&3 96.1 94.8 9&3 96T Table 5.10: Accurate classification rates of various classification tools in the the direction from the server to clients with traffic data sets in two different sites. 96 94 92 % 90 I 88 86 84 A ........... — 0 — J43 -e - - REP - - B - Bag b'-— -............... b ^ ................... ' N . . 'c A 5- _ . 10 20 30 40 The number of packets used in the classification 50 Figure 5.6: The accurate classification rates as a function of the number of packets used for classification. 5.5.5 Comparison of Modified Multistage Filter and NetFlow Early classification was proposed to reduce the collision probability in Sec. 5.4. Fig.5.6 shows the accurate rate according to the number of used packets for classi fication. The x-axis is the elapsed time between the initial setup of a flow and the time of its being classified even if the flow is not finished. The Y-axis is the accurate rate. We see that a few packets are suflicient to yield high accurate classification rates while the collusion probability is very low. 100 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 100 0 ) 2 I I o N B — e— NBKE — ♦ — REPTree - -X - - Bagging 40 : 30 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sampling rate In the traffics from active to passive peer (a) A2Pdirection 100 90 I I — e— NB 0 NBKE — ♦ — REPTree - -X - - Bagging 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sampling rate in the traffics from passive to active peer (b) P2Adirection Figure 5.7: The accurate detection rates versus the sampling rates in (a) the A2P direction and (b) the P2A direction. 1 0 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. To investigate the sampling effect when NetFlow is enabled, we dropped pack ets in each data set according to the sampling rate p. In implementation, if the generated random number in [0,1] of a packet was higher than p, it is dropped. The performance of all classifiers degrades for a fixed size of training data. However, if the classifier is trained by training data obtained using the same sampling rate as applied to test data, the classifier has little performance degradation as compared to that with p = 1.0 (i.e. no sampling is applied). The correct detection performance versus the sampling rate for traffics in the A2P direction and the P2A direction are shown in Fig. 5.5.5 and Fig.5.5.5, respectively. Given the sampling rate of 0.3, the correct detection rates of the REPTree and the Bagging classifiers was lowered about 10% in the A2P direction and 20% in the P2A direction as compared to those with p = 1. 5.6 C onclusion The problem of classifying the application type of an Internet traffic fiow at the ISP was examined. This is a challenging problem due to the asymmetric routing and the large number of flows. For practical feature selection, we used a feature selection process by modifying the multistage filter and NetFlow slightly. Realistic features were identified under those tools and a set of optimum features was selected using GA. Another feature selection method using FCBF and a filter method were compared with GA. As classifiers, decision trees J48, REPTree and the bagging method using REPTree were considered. They were trained with selected features from real traffics. Trained decision trees were compared with classifiers using the Bayesian approach. It was concluded from simulation results that the decision tree using GA works mostly effective. 102 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. To support real-time QoS and security, we proposed a process called early clas sification. Under a modified multistage filter, the early classification can be highly reliable with a certain amount of memory in the filter. However, when NetFlow is used, the performance of real-time classification is lower than that of the multistage filter. We would like to overcome the effects caused by sampling errors to enhance the performance of early classification in NetFlow. 103 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 6 C onclusion and Future W ork 6.1 C onclusion For DoS attack detection, we considered the TCP SYN flooding attack and the UDF flooding attack in Chapter 3 and 4, respectively, based on a traffic moni toring approach. For various Internet services, Internet traffic classification using decision trees were taken into account in Chapter 5. The main research results are summarized below. To design the detector for the TCP SYN flooding attack, we examined real world attacks by studying attack traffic patterns and their impact. It was observed that some attackers use a low-rate attack to confuse the detector. The low-rate attack may mislead a target server to deny its clients, who have longer RTT, by dropping their requests. Furthermore, multiple attack sources can be used for maximizing the influence. Thus, it is important to develop a detector that has a better discriminating power between various traffic flows in defending the TCP server against the TCP SYN flooding attack effectively. The detector using CUSUM [63] was compared with the proposed HMM detector in terms of detection accuracy, latency and computational overload. The CUSUM algorithm uses one threshold to 104 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. detect the occurrence of the attack. If the threshold is low, the detector is too sensitive to be reliable because the normal traffic has self-similar property. If the threshold is high, the detector may miss the low-rate attacks. To address this problem, the proposed detector uses three HMMs and two decision thresholds. A more refined decision result can be obtained since the three HMMs can be tailored to the normal, low-rate attack and high-rate attack traffics accordingly. Besides, the low threshold can be used to separate the normal and low-rate attack traffics while the high threshold can be used to separate the high-rate and low-rate attack traffics. Trace-driven simulations demonstrated that the proposed HMM detector has a higher detection rate and shorter detection latency. At the expense of slightly increased overload, the HMM detector can accomplish earlier detection than the CUSUM detector based on cumulative statistic data. Then, we described a possible DoS attack scenario with the flooding of UDP packets against wireless users in Chapter 4. The wireless network has very limited resources, including bandwidth and power, so that it is more vulnerable to the DoS attack than the wired one. To protect wireless users from the DoS attack, a detector using Markov models was proposed. The deployment of the detector was discussed. To enhance the performance of the proposed detector, a weighting factor and the second order Markov model were used. The proposed detector was compared with the batch-sequential detection algorithm [5] in terms of the false alarm rate and detection latency. Our preliminary simulation results show that the proposed detector can be more reliable. This is an on-going research effort. More work has to be done in the near future to make our study more complete along this direction. To support real-time QoS and security, the problem of classifying the application type of an Internet traffic flow at the ISP was examined. This is a challenging 105 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. problem due to the asymmetric routing and the large number of flows. For practical feature selection, we used a feature selection process by modifying the multistage filter and NetFlow slightly. Realistic features were identified under those tools and a set of optimum features was selected using GA. Another feature selection method using FCBF and a filter method were compared with GA. As classifiers, decision trees J48, REPTree and the bagging method using REPTree were considered. They were trained with selected features from real traffics. Trained decision trees were compared with classifiers using the Bayesian approach. It was concluded from simulation results that the decision tree using GA works mostly effective. 6.2 Future W ork Even though our detector using multiple Markov models is effective if the received data rate of the normal traffic at the client side is constant or nearly constant in the time domain, our research on the protection against the UDP flooding attack as described in Chapter 4 is far from complete. In the near future, we will consider a more robust detection method against this type of attack. Specific research topics include the following. • Generalized UDP Traffic Model and TCP/UDP Attack The constant bit rate (GBR) assumption about the normal UDP traffic is too stringent. Even the source generates the GBR traffic, the receiver may not observe the GBR traffice due to network delay in different travelling paths. Thus, we need a more flexible model in characterizing the normal traffic and the attack traffic. Furthermore, in the data set [31], some attacks using TCP AGK have also been observed. Thus, the detector needs to react to the attack 106 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. using TCP and UDF packets, respectively. The main goal here is to detect more subtle attacks by studying their traffic patterns. • Detection Based on HMM and/or Layered HMM We have so far considered only the Markov model for the UDF flooding attack. However, we have not yet implemented the HMM-based detector. The HMM-based detector is expected to give better performance. The HMM- based detector can be tailored to different traffic classes as done in Chapter 3. For example, we may divide the traffic into three classes and separate them by two thresholds. Furthermore, to get more robust detection results, it may be worthwhile to consider layered HMM. A layered HMM consists of multiple HMMs which have different observation durations. The layered HMM can monitor the averages of the received data rate in terms of the long term and short term. • Parameter Fine-tuning and Analysis There are two parameters in controlling the performance of the detector using layered HMM; namely, the observation duration and the decision thresholds. If the observation duration is long, the layered HMM reflects the long term average of the traffic. The comparison of the long term averages can give more correct information while the long term will lead long detection latency and make the detector insensitive. If the higher threshold is set too low, the layered HMM will be unreliable. Thus, we need to determine these parameters based on real world UDF traffic data statistics. The trained layer HMM is expected to react to attack traffic differently. • Hybrid Detectors and Decision Fusion Furthermore, to complement our HMM detector, other techniques such as 107 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the discrete Fourier transform of TCP packet process (DPT) [13] can be incorporated, too. The fusion mechanism of decisions from different detectors has to be studied. • Extensive Experiments with Real and Simulated Data Preliminary experiments have been performed to test the proposed detector using multiple Markov models against the UDP flooding attack. However, more experiments have to be conducted based on simulated and real world data. Simulated data are still needed since the real world data are more difficult to collect. The set-up of a good simulation environment to capture the real world traffic patterns well is apparently a challenging task. • Leveraging the low performance under NetFlow Under a modified multistage filter, the classification can be highly reliable with a certain amount of memory in the filter. However, when NetFlow is used, it is still difficult to do real-time classification and the performance is inferior to the multistage filter. It is interesting to examine ways to overcome the sampling error effect and enable early classification in NetFlow. • Support of Internet services with proposed classification schemes Internet traffic classification is essential to QoS and security. Based on pro posed classification tools, better QoS and security systems can be in place. For example, QoS can be provided to users according to classified application types. The bridge between proposed schemes and their real world deployment is to be explored furthermore. 1 0 8 R eproduced with perm ission of fhe copyright owner. Further reproduction prohibited without perm ission. R eferences [1] [2] [3 ] [4 ] [5 [6 [7 [8 [9 [10 [11 3GPP2, “3G wireless network management system high level,” 3GPP2 S.R0Ü17, December 1999. A. Anderson and B. Nielsen, “An application of superpositions of two-state Markovian sources to the modeling self-similar behaviour,” in Proc. IEEE IN- FOCOM, Kobe, Japan, April 1997, pp. 196-204. , “A Markovian approach for modeling packet traffic with long-range de pendence,” IEEE J. Select. Areas Commun., vol. 16, pp. 719-732, June 1998. J. Bala, J. Huang, H. Vafaie, K. DeJong, and H. Wechsler, “Hybrid learning using genetic algorithms and decision trees for pattern classification,” in IJCAI conference, Montreal, August 1995. R. Blazek, H. Kim, B. Rozovskii, and A. Tartakovsky, “A novel approach to detection of Denial of Service attacks via adaptive sequential and batch- sequential change-point detection methods,” in Proc. IEEE Workshop on In formation Assurance and Security, West Point, NY, June 2001, pp. 220-226. L. Breiman, “Bagging predictors,” Dpartment of statistics in University of California Berkely, Tech. Rep. 421, September 1994. “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123-140, 1996. M. Burdach. Hardening the TG P/IP stack to SYN attacks. [Online]. Available: http://www.securityfocus.com/infocus/1729 CAIDA, “CAIDA internet data - passive data sources.” [Online]. Available: h ttp :// W W W . caida. org/data/passive/index, xml CERT. (2001, January) Distributed Denial of Service attack tool. [Online]. Available: http://www.cert.org/incident-notes/IN-99_07.html CERT Coordination Center. Smurf attack. [Online]. Available: http: / / W W W . cert. org/advisories/CA-1998-01.html 109 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [12] Check Point Software Tech. SynDefender. [Online]. Available: http: / / W W W . checkpoint .com/pro ducts / Ere wall-1 [13] C. Cheng, H. Kung, and K. Tan, “Use of spectral analysis in defense against DoS attacks,” in Proc. Globecom Conf., Taiwan, November 2002, pp. 2143- 2148. [14] ciac. SUN’s TCP SYN flooding solutions. [Online]. Available: http: / / ciac. llnl. gov / ciac/bulletins/h- 02. shtml [15] Cisco, “Netflow.” [Online]. Available: http://wwv/.cisco.com/warp/public/ 732/Tech/netflow [16] ------, “Sampled netflow.” [Online]. Available: http: / / www.cisco.com/univercd/cc/td/doc / product/software/ios 120/ 120newf%t / 1201imit/120s/120sll/12s_sanf.htm [17] K. C. Claffy, “Internet traffic characterization,” Ph.D. dissertation. University of California San Diego, San Diego, CA, 1994. [18] C. Coordination Center. DoS using nameservers. [Online]. Available: http://www.cert.org/advisories/CA-2000-04.html [19] P. Criscuolo. (2000, February) Distributed denial of service, TrinOO, TFN, TFN2000, and Stacheldraht. [Online]. Available: http://www.ciac.org/ciac/ documents/CIAC-2319-Distributed-Denial-of-Serv%ice.pdf [20] C. Dewes, A. Wichmann, and A. Feldmann, “An analysis of internet chat sys tems,” in Proc. ACM SIC COMM Internet Measurement Conference, Miami, FL, October 2003, pp. 51-64. [21] D. Dittirch. (1998, January) DDoS. [Online]. Available: http://staff. washington.edu/dittrich [22] R. Duda, P. Hart, and D. Stork, Pattern classification, 2nd ed. Wiley- Interscience, 2001. [23] F. Esposito, D. Malerba, and G. Semeraro, “A comparative analysis of meth ods for pruning decision trees,” IEEE Transactions on Pattern Analysis and Machine Intellisgence, vol. 19, no. 5, pp. 476-491, 1997. [24] C. Estan, K. Keys, D. Moore, and G. Varghese, “Building a better NetFlow,” in Proc. ACM SICCOMM, Portland, Oregon, August 2004. [25] C. Estan and G. Varghese, “New directions in traffic measurement and accen ting,” in Proc. ACM SICCOMM, Pittsburgh, Pensylvania, August 2002, pp. 187-200. 110 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [26] C. Estan, G. Varghese, and M. Fisk, “Bitmap algorithm for counting active flows on high speed links,” in Proc. ACM SIGCOMM Internet Measurement Conference, Miami, FL, August 2002, pp. 153^166. [27] A. Feldmann, “Charactersitics of TCP connection arrivais,” AT&T Labs- Research, Florham, NJ, Tech. Rep., December 1998, technical report. [28] J. Grefenstette. A user’ s guide to genesis version 5.0. [Online]. Available: http: / / www.genetic-programming.com/c2003genesisgrefenstette.txt [29] J. H. Holland, Adaptation in natural and artificial systems, 2nd ed. MIT Press, 1992. [30] B. Huang and L. Rabiner, “A probabilitistic distance measure for hidden Markov models,” AT& T Tech. J., vol. 64, no. 2, pp. 391-408, February 1985. [31] A. Hussain, J. Heidemann, and C. Papadopoulos, “A framework for classifying Denial of Service attacks,” in Proc. ACM SIGCOMM, Karlsruhe, Germany, August 2003, pp. 99-110. [32] Internet Security Systems. RealSecure Network 10/100. [Online]. Available: http://www.iss.net [33] Juniper Networks. Integrated Firewall/IPSec VPN. [Online]. Available: http: / / W W W .juniper.net / products / integrated [34] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport layer iden tification of P2P traffic,” in Proc. ACM SIGCOMM Internet Measurement Conference, Sicily, October 2004, pp. 121-134. [35] R. Kompella, S. Sing, and G. Varghese, “On scalable attack detection in the network,” in Proc. ACM SIGCOMM Internet Measurement Conference, Sicily, Italy, October 2004, pp. 187-200. [36] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussier, “Hidden Markov Models in computational biology:Applications to protein modeling,” Journal of Molecular Biology, no. 235, pp. 1501-1531, Feb 1994. [37] W. Leland, M. Taqqu, W. Willinger, and D. Wilson, “On the self-similar nature of ethernet traffic,” lEEE/ACM Trans. Networking, vol. 2, February 1994. [38] J. Lemon, “Resisting SYN Hooding DoS attacks with a SYN cache,” in Pro ceedings of USENIX BSD Con, February 2002. [39] Los Nettos, “Los Nettos - Passing packets since 1988.” [Online]. Available: http://www.ln.net 111 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [40 [41 [42 [43 [44 [45 [46 [47 [48 [49 [50 [51 [52 [53 Machine Learning Lab in the university of Waikato, “WEKA.” [Online], Available: http://www.cs.waikato.ac.nz/'"ml J. Mirkovic, “D-WARD: Source-end defense against distributed Denial-of- Service attacks,” Ph.D. dissertation. University of California, Los Angeles, 2003. J. Mirkovic, G. Frier, and P. Reiher, “Attacking DDoS at the source,” in Procceedings of ICNP, Paris, France, November 2002, pp. 312-321. A. Moore and D. Zuev, “Internet traffic classification using Bayesian analysis techniques,” in Proc. ACMSigmetrics, Alberta, Canada, June 2005, pp. 50-59. D. Moore, G. Voelker, and S. Savage, “Inferring internet denial of service activity,” in Proceedings of USENIX Security Symposium, Washington DC, WA, February 2001. K. Park and H. Lee, “On the effectiveness of probabilistic packet marking for IP traceback under denial of service attack,” in Proc. IEEE INFOCOM, Anchorage, Alaska, April 2001, pp. 338-347. V. Paxson, “Bro: A system for detecting network intruders in real-time,” in Computer Networks, December 1999, pp. 2435-2463. V. Paxson and S. Floyd, “Wide-area traffic:the failure of Poisson modeling,” IEEE/ACM Trans. Networking, vol. 3, pp. 1-15, June 1995. L. Peterson and B. Davie, Computer Networks, 2nd ed. Morgan Kaufmann, 1998. J. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81-106,1986. , “Simplifying decision trees,” International Journal of Man-Machine Studies, vol. 27, pp. 221-234, 1987. , C f.5: Programs for Machine hearing. Morgan Kaufmann, 1993. L. Rabiner and B. Huang, “An introduction to hidden Markov models,” Proc. IEEE, vol. 77, pp. 257-286, February 1989. M. Roughan, S. Sen, 0 . Spatscheck, and N. Duffield, “Class-of-service mapping for QoS :A statistical signature-based approach to IP traffic classification,” in Proc. ACM SIGCOMM Internet Measurement Conference, Sicily, Italy, Octo ber 2004, pp. 135-148. 112 R eproduced wifh perm ission of fhe copyrighf owner. Further reproduction prohibited without perm ission. [54] S. Savage, D. Wether all, A. Karlin, and T. Anderson, “Practical network sup port for IP traceback,” in Proc. ACM SIGCOMM, San Diego, CA, August 2001, pp. 295-305. [55] R. Sekar, A. Gupta, J. Prullo, T. Shanbhag, A. Tiwari, H. Yang, and S. Zhou, “Specification-based anomaly detection: a new approach for detecting network intrusions,” in ACM conference on Computer and communications security, November 2002, pp. 265-274. [56] A. Soneren, C. Partridge, L. Sanchez, C. Jones, F. Kent, and W. Strayer, “Hash-based IP traceback,” in Proc. ACM SIGCOMM, San Diego, CA, August 2001, pp. 3-14. [57] D. Song and A. Perrig, “Advanced and authenticated marking schemes for IP traceback,” in Proc. IEEE INFOCOM, Anchorage, AL, April 2001, pp. 878-886. [58] A. Soule, K. Salamtian, N. Taft, R. Emilion, and K. Papagiannaki, “Flow clas sification by histograms or How to go on safari in the Internet,” in Proc. ACM Sigmetrics, New York, NY, June 2004, pp. 49-60. [59] N. Spring, R. Mahajan, and D. Wether all, “Measuring ISP topologies with rocketfuel,” in Proc. ACM SIGCOMM, Pittsburgh, Pennsylvania, August 2002, pp. 133-145. [60] G. Stein, B. Ghen, A. S. Wu, and K. Hua, “Decision tree classifier for network intrusion detection with GA-based feature selection,” in Proc. the 43rd ACM Southeast Conference, Kennesaw, GA, August 2005. [61] University of Southern California. Los nettos-passing packets since 1988. [Online]. Available: http://www.ln.net [62] G. Vigna and R. A. Kemmerer, “NetSTAT: A network based intrusion detec tion approach,” in Proc. 14*^. Comp. Sec. App. Conf., 1998, pp. 25-34. [63] H. Wang, D. Zhang, and K. Shin, “Detecting SYN flooding attacks,” in Proc. IEEE INFOCOM, New York, NY, June 2002, pp. 1530-1539. [64] G. Wright, F. Monrose, and G. Masson, “HMM profiles for network traffic classification,” in Proc. of the ACM workshop on visulization and data mining for computer security, Washington DC, October 2004, pp. 9-15. [65] L. Yu and H. Liu, “Feature selection for high-dimensional data A fast correlation-based filter solution,” in Proc. the 20th International Conference on Machine Learning (ICML 2003), 2003. 113 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [66] S. Yu, Z. Liu, M. Squilante, C. Xia, and L. Zhang, “A hidden semi-Markov model for web workload self-similarity,” in Proc. IEEE Intn’ l. Conf. Perfor mance, Computing, and Communication Conference, April 2002, pp. 65-72. [67] Y. Zhang and B. Singh, “A multi-layer IPSec protocol,” in Proc. 9th USENIX Security Symposium, August 2000. 114 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Probabilistic analysis of power dissipation in VLSI systems
PDF
The design and synthesis of concurrent asynchronous systems.
PDF
Efficient media access and routing in wireless and delay tolerant networks
PDF
Iterative data detection: complexity reduction and applications
PDF
The importance of using domain knowledge in solving information distillation problems
PDF
MUNet: multicasting protocol in unidirectional ad-hoc networks
PDF
Protocol evaluation in the context of dynamic topologies
PDF
An argumentation-based approach to negotiation in collaborative engineering design
PDF
Quality of service provisioning for multimedia applications in service differentiation networks
PDF
Orthogonal architectures for parallel image processing
PDF
Streaming video transmission control via bitstream modulation based on Internet traffic
PDF
Image compression with wavelet transform and vector quantization
PDF
Performance modeling of a class of queuing systems with self-similar characteristics
PDF
Optimization of BIST resources during high-level synthesis
PDF
Semantic heterogeneity resolution in federated databases by meta-data implantation and stepwise evolution
PDF
Hybrid fractal/wavelet methods for image compression
PDF
Efficient communication algorithms for parallel computing platforms
PDF
Scalable photonic neural networks for real-time pattern classification
PDF
Data-based control
PDF
Data compression: The theory of algorithm analysis
Asset Metadata
Creator
Park, Junghun
(author)
Core Title
Internet security and quality-of-service provision via machine-learning theory
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2006-08
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-114137
Unique identifier
UC11349268
Identifier
3237707.pdf (filename),usctheses-c17-114137 (legacy record id)
Legacy Identifier
3237707.pdf
Dmrecord
114137
Document Type
Dissertation
Rights
Park, Junghun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
computer science
engineering, electronics and electrical