On Entropy in Network Trafﬁc Anomaly Detection

: Different systems, e.g., the anomaly-based network intrusion detection system (A-NIDS), have been continuously developed in order to ensure integrity, availability, and conﬁdentiality of networks. In this paper, we present a structured and comprehensive overview of the research into entropy-based A-NIDS with the intention of providing researchers a quick introduction to essential aspects of this topic. The main components of the general architecture of A-NIDS based on Entropy are discussed. The achieved high detection rates prove the effective use of entropy. Finally, some open issues in entropy-based network trafﬁc anomaly detection are also highlighted.


Introduction
Given a traffic network and its set of the selected traffic features X = {X 1 , X 2 , . . ., X p }, and N time instances of X, the normal and abnormal behaviors of the instances can be studied.The space of all instances of X builds the feature space which can be mapped to another space by employing a function such as entropy.In the literature, Shannon and generalized Rényi and Tsallis entropy estimators, as well as probability estimators (Balanced [1], Balanced II [2]), are used.Chandola et al. (2009) [3] states that the term anomaly-based intrusion detection in networks refers to the problem of finding exceptional patterns in network traffic that do not conform to the expected normal behavior.This concept is accepted by the Internet Community.
A generic architecture of entropy-based A-NIDS is presented in figure 1, see [4].It usually consists of two stages: the training and the testing stage.In the training stage, using a database of "normal" or free-anomaly network traffic, feature extraction, windowing, and entropy calculation modules, a "normal" profile is found.In the testing stage, using feature extraction, windowing, and entropy calculation modules, anomalies in the current network traffic are detected and classified.Rényi entropy Shannon entropy H T (•, q) Tsallis entropy H(X; Y )

Joint entropy H(Y |X)
Conditional entropy I(X; Y ) Mutual information p i Probability of occurrence of x i element p(x, y)

Joint probability p(x|y)
Conditional probability q Parameter of generalized entropies S X m-dimensional space of all traffic features S N p-dimensional space of free-anomaly traffic features W i (L, τ ) i-th sliding window with L packets and τ as the overlapping parameter x, y, z Instances of random variables X, Y, Z. [3] present "a survey in anomaly detection, grouping existing techniques into different categories based on the underlying approach adopted by each technique".Bhuyan et al. ( 2014) [5] provides an excellent survey in network anomaly detection describing six distinct classes of methods and systems.However, these surveys are not focused directly on entropy-based A-NIDS.

Chandola et al. (2009)
This paper presents a structured and comprehensive overview of the research into entropy-based A-NIDS with the intention of providing researchers a quick introduction to essential aspects of this topic.Using a general architecture of entropy-based A-NIDS, the different techniques proposed in the state-of-the-art of the main modules are shown.The measures of information used by researchers and the most important metrics for testing the performance of the detection and classification are presented.We also highlight some open issues in entropy-based network traffic anomaly detection.
The next sections describe the main modules of the entropy-based A-NIDS.Section 2 presents the main databases used for researchers in the field.Section 3 shows the most commonly employed features in network traffic and the windowing approach.Section 4 introduces the mathematical background, including Shannon entropy, generalized entropies, Mutual information, Kullback-Leibler divergence, conditional entropy, and the techniques to estimate their values.Section 5 presents the decision functions defined by the researchers in the detection stage.Section 6 shows the approaches employed in the classification stage and the most widely used metrics for evaluating the A-NIDS.Section 7 contains the conclusions of the paper and some important open issues of this topic.

Databases
Different databases have been used to evaluate the A-NIDS, and these databases are divided into two groups: synthetic and real.
The synthetic databases are generated artificially, e.g., the MIT-DARPA 1998, 1999, 2000 databases1 , which include five major categories: Denial of Service Attacks (DoS), User to Root Attacks (U2R), Remote to User Attacks (R2U) and probes.
Some real public databases are: CAIDA2 , which contains anonymized passive traffic traces from high-speed Internet backbone links, and the traffic data repository, maintained by the MAWI 3 Working Group of the WIDE Project.Other researchers have created their own databases in different universities, e.g., Carnegie Mellon University, Xi'an Jiaotong University, and Clemson University (GENI [6]), or traffic collected from backbone in SWITCH, Abilene, and Géant.
Nowadays, there is no public database large enough to exhaustively test and compare different algorithms in order to extract significant conclusions about their performances and their capabilities of classification.

Feature Extraction
Motoda H. and Liu H. (2002) [7] state that feature selection is a process that chooses a subset of M features from the original set of N features M ≤ N so that the feature space is optimally reduced according to a certain criterion [8,9].Feature extraction is a process that extracts a set of new features from the original features through some functional mapping [10].Assuming that there are N features Z 1 , Z 2 , . . ., Z N after feature extraction, another set of new features X 1 , X 2 , . . ., X M (M < N ) is obtained via the mapping functions F i , i.e.X i = F i (Z 1 , Z 2 , . . ., Z N ).
Among the algorithms used to reduce the number of features in network traffic anomaly detection are: PCA [11], Mutual Information and linear correlation [12], decision tree [13], and maximum entropy [14].
In network traffic, the most commonly employed features are [2,[15][16][17][18][19]: source and destination IP addresses and source and destination port numbers.Other features extracted from headers are: protocol field, number of bytes, service, flag, and country code.Zhang et al. (2009) [20] divided the size of packets into seven types and Gu et al. (2005) [21] defined 587 packet classes based on the port number.
At flow4 level the features selected were: flow duration, flow size distribution (FSD), and average packet size per flow.For KDD Cup 99, 41 features or a subset were employed [12,13].On the other hand, Tellenbach et al. ( 2011) [22] used source port, country code and others, constructing the TES as input data.

Entropy Concepts Used in Network Traffic Anomaly Detection
Let X be a random variable which takes values of the set {x 1 , x 2 , ..., x M }, p i := P (X = x i ) the probability of occurrence of x i , and M the cardinality of the finite set; hence, the Shannon entropy is: Based on the Shannon entropy [27], Rényi [28] and Tsallis [29] defined generalized entropies, which are related to the q-deformed algebra.The Rényi entropy is defined as: and the Tsallis entropy is when q → 1 the generalized entropies are reduced to Shannon entropy.In order to compare the changes of entropy at different times, the entropy is normalized, i.e., For generalized entropies, the values q = 0.9 and q = 1.1 to detect DoS and DDos atacks have been used by [30,31], respectively.The sets q ∈ {−3} ∪ {−2, −1.75, ..., 1.75, 2}, q = {1, 2, 3, . . ., 15} and q = {1, 2, 3, . . ., 10} were used to detect DDos and scanning attacks, and low-rate and high-rate DDoS by [22,32,33].

Kullback-Leibler divergence
Consider two complete discrete probability distributions P = (p 1 , p 2 , . . ., p n ) and Q = (q 1 , q 2 , . . ., q n ), with n i=1 p i = The information divergence is a measure of the divergence between P and Q and is defined by [28]: where ρ is the order of the information divergence.Consequently, the smaller D ρ (P ||Q) is, the closer the distributions P and Q are.D ρ (P ||Q) = 0 iff P = Q.When ρ → 1 the Kullback-Leibler (KL) divergence [34] is obtained

Mutual information
The conditional entropy of a variable Y given X, with alphabet X and Y, respectively, is defined as: The mutual information (MI) [35] between two random variables X and Y is a measure of the amount of knowledge of Y supplied by X or vice versa.If X and Y are independent, then their mutual information is zero.The MI of two random variables X and Y is defined as: where H(•) is entropy, H(X|Y ) and H(Y |X) are conditional entropies, H(X; Y ) is the joint entropy of X and Y, defined as where p(x, y) is the joint probability mass function.
The MI equation can be written as: where p(x) and p(y) are marginal probability mass functions of X and Y , respectively.In order to estimate the MI between X, Y, it is necessary to estimate p(x, y).

Entropy calculation
As the full probability distribution is generally not known or not completely known, different probability estimators are used, e.g., relative frequency, Balanced, and Balanced II, and consequently, a "true" probability distribution is built.The entropy is calculated using these estimators; the more accurate the estimators, the better the entropy estimates.
Rahmani et al. ( 2009) [36] noted that time series of IP-flow number and aggregate traffic size are strongly statistically dependent, and when an attack occurs, it causes a rupture in the time series of joint entropy values.In order to calculate the joint entropy H(X; Y ) they employed p(x, y) of the time series X and Y using either the Gamma density probability function (when the number of connections was small) or the central limit theorem (when the number of connections was large enough).Liu et al. (2010) [18] calculated the conditional entropy H(Y |X) where Y and X are two of the most widely used traffic variables: source and destination Ip addresses.Amiri et al. (2011) [12] used an estimator of MI developed by Kraskov [37], which employs entropy estimates from k-nearest neighbors distances.Velarde-Alvarado et al. ( 2009) [2] estimated entropy values using the balanced estimator II as a probability estimator.

Anomaly detection
An anomaly in network traffic is a data pattern that does not conform to those representing a normal traffic behavior.Therefore, anomaly detection is a broad field, where numerous anomaly detection methods are used for different applications.
Assuming that 1) X ∈ R p is a p-dimensional real-valued random variable with a domain S X ⊂ R p representing traffic features, 2) x i are instances of X, i.e. x i ∈ S X , and 3) data patterns of normal behavior are represented by the subspace S N ⊂ S X , anomaly detection determines whether an instance x i belongs to S N or not.
The space S X can be partitioned or divided into classes with the help of decision functions, allowing further classification.
In Santiago-Paz et al. (2015) [4], a decision function is based on the Mahalanobis distance [39] d 2 M (x i ), and a second decision function is given by f (x i ) = N i α i k(x i , x) − b for One Class-Support Vector Machine (OC-SVM), where k(x i , x) is a kernel.Huang et al. (2006) [40] computed the Rényi entropy (q = 3) of the Coiflet and wavelets.
In Velarde-Alvarado et al. ( 2009) [2], used the proportional uncertainty (PU) and the method of remaining elements (MRE) to detect anomalies.Tellenbach et al. (2011) [22] used Kalman filter, PCA, and KLE as anomaly detection methods.Ma et al. (2014) [41] established a function decision based on the entropy of the source IP address Ĥs and the entropy of the destination IP address Ĥd .In [42,43], a function decision based on entropy and a range of values was used to detect anomalies.
The use of the entropy allows the A-NIDS to achieve high detection rates, see table 2. In addition, new measures based on entropy should be studied and used as a basis for other decision functions.[44] state that given: 1) a training data set of the form {(x i , y i )}, where x i ∈ S X is a feature vector or data pattern and y i ⊂ {1, . . ., G} is the subset of the G class labels that are known to be correct labels for x i , 2) a discriminant function f (x; β g ) with class-specific parameters β g for each class with g = 1, . . ., G; then class discriminant functions are used to classify an instance x as the class label that solves arg max g f (x; β g ).Lakhina et al. (2005) [16] apply two clustering algorithms: k-means and hierarchical agglomeration, using a vector h = [ H(srcIP), H(dstIP), H(srcPort), H(dstPort)].Xu et al., (2005) [23] define three "free" feature dimensions and introduce an "Entropy-based Significant Cluster Extraction Algorithm" for clustering.Lima et al. (2011) [13] use the WEKA 5 Simple K-Means algorithm , which employs Euclidean distance as a measure to compute distances between instances and clusters.Support Vector Machine is applied by Tellenbach et al. (2011) [22] to classify the anomalies.Yao et al. (2012) [45] use the Random Forests Test.
Santiago-Paz et al. ( 2014) [19] present the Entropy and Mahalanobis Distance (EMD) based Algorithm to define elliptical regions in the feature space.In [4], OC-SVM and k-temporal nearest neighbors are used to improve accuracy in classification.

The Classifier Metrics
Given a classifier and an instance, there are four possible outcomes: 6 T N , F P , F N , and T P .With these entries, the following statistics are computed [46]: Accuracy (AC) is the proportion of the total number of predictions that were correct: AC = T N +T P T N +F P +F N +T P ; True Positive Rate (TPR) is the proportion of positive cases that were correctly identified: T P R = T P F N +T P ; True Negative Rate (TNR) is the proportion of negative cases that were classified correctly: T N R = T N T N +F P ; False Negative Rate (FNR) is the proportion of positive cases that were incorrectly classified as negative: F N R = F N F N +T P ; and F -measure is a measure of a test's accuracy: F -measure = 2 * T P R * AC T P R+AC .In addition, Receiver Operating Characteristic7 (ROC) graphs illustrate the performance of a classifier.
Although, there are several types of distances, classifiers based on new closeness and farness measures of data patterns and pattern clusters should be studied.

Conclusions
This paper provides a structured and comprehensive overview of the state-of-the-art in entropy-based A-NIDS.Using a general architecture of Entropy-based A-NIDS, the different techniques proposed in the state-of-the-art of the main modules are shown.The measures of information used by researchers and the most commonly employed metrics for testing the performance of the detection and classification are presented.The achieved high detection rates prove the effective use of entropy.T N is the number of correct predictions that an instance is negative, F P is the number of incorrect predictions that an instance is positive, F N is the number of incorrect predictions that an instance is negative, and T P is the number of correct predictions that an instance is positive.

Open Issues
Nowadays, there is no public database large enough to exhaustively test and compare different algorithms in order to extract significant conclusions about their performances and their capabilities of classification.Therefore, the construction of a common database with real "normal" and anomalous traffic for the evaluation of A-NIDS is needed.
The value of the q parameter for generalized entropies is found experimentally; its correct choice for the best anomaly detection is an open research problem.
For different networks, the larger the slot size, the more different the entropy behaviors.In the near future, this behavior including more and recent traces in order to determine whether the learned model from a certain network can be used in a different network should be addressed.
Another open issue is related to the adequate window size for reducing the data volume, ensuring good entropy estimates and early detection of anomalies.
The set of decision functions and classifiers with new closeness and farness entropy-based measures should be enhanced.
SymbolMeaning D ρ (P ||Q)Information divergence, where ρ is the order of the information divergence.D 1 (P ||Q) Kullback-Leibler divergence f (x; β g ) Discriminant function where β g are class-specific parameters H R (•, q)

Table 1
presents the notation used in this paper.

Table 2 .
Results of network traffic anomaly detection using entropy.