7

Click here to load reader

[IEEE 2009 Next Generation Internet Networks (NGI) - Aviero, Portugal (2009.07.1-2009.07.3)] 2009 Next Generation Internet Networks - Discriminating Internet Applications based on

  • Upload
    antonio

  • View
    222

  • Download
    6

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 Next Generation Internet Networks (NGI) - Aviero, Portugal (2009.07.1-2009.07.3)] 2009 Next Generation Internet Networks - Discriminating Internet Applications based on

Discriminating Internet Applications based on Multiscale Analysis

Eduardo Rocha, Paulo Salvador and Antonio NogueiraUniversity of Aveiro/Instituto de Telecomunicacoes

Campus de Santiago, 3810-193 Aveiro, PortugalE-mail: {eduardorocha, salvador, nogueira}@ua.pt

Abstract—In the last few years, several new IP applicationsand protocols emerged as the capability of the networks toprovide new services increased. The rapid increase in the numberof users of Peer-to-Peer (P2P) network applications, due tothe fact that users are easily able to use network resourcesover these overlay networks, also lead to a drastic increase inthe overall Internet traffic volume. An accurate mapping ofInternet traffic to applications can be important for a broadrange of network management and measurement tasks, includingtraffic engineering, service differentiation, performance/failuremonitoring and security. Traditional mapping approaches havebecome increasingly inaccurate because many applications usenon-default or ephemeral port numbers, use well-known portnumbers associated with other applications, change applicationsignatures or use traffic encryption. This paper presents a novelframework for identifying IP applications based on the multiscalebehavior of the generated traffic: by performing clusteringanalysis over the multiscale parameters that are inferred from themeasured traffic, we are able to efficiently differentiate differentIP applications. Besides achieving accurate identification results,this approach also avoids some of the limitations of existingidentification techniques, namely their inability do deal withstringent confidentiality requirements.

Index Terms—Application identification, multiscale analysis,wavelets, cluster analysis, multifractal behavior.

I. INTRODUCTION

The ability to efficiently identify Internet applications canhave a great impact on several network management tasks,including traffic engineering, service differentiation, Qualityof Service (QoS) mechanisms, performance/failure monitoringand security. For example, once a service provider is ableto associate traffic to its corresponding application, he cangroup traffic with different or similar statistical characteristicsin order to optimize the bandwidth occupancy of the links; hecan also use that ability to detect security attacks, like worms,zombies, botnets, among others, and consequently trigger theappropriate defense and/or repair actions.

The identification of IP applications has been traditionallybased on different techniques, each one having its own ad-vantages but also important drawbacks that limit or dissuadetheir application on certain identification scenarios: (i) portbased analysis presents some obvious limitations since mostapplications allow users to change the default port numbersby manually selecting whatever port(s) they like; many newerapplications are more inclined to use random ports, thusmaking ports unpredictable, and there is also a trend forapplications to begin masquerade their function ports withinwell-known application ports; (ii) protocol analysis is inef-fective since IP applications are continuously evolving and

therefore their signatures can change; in addition, applicationdevelopers can encrypt traffic making protocol analysis moredifficult; signature-based identification can affect network sta-bility because it has to read and process all network traffic and,finally, protocol analysis is not able to deal with confidentialityrequirements; (iii) syntactic and semantic analysis of the dataflows can be a burden to network stability due to its highprocessing requirements and is not appropriate when dealingwith confidentiality requirements because, in these situations,it is not possible to have access to the packet contents. SectionII will briefly describe the most important related work on thissubject, pointing out the main advantages and disadvantagesof the different proposed methodologies.

With the emergence of new applications and protocols,namely P2P services, existing methodologies became inac-curate and new approaches involving machine learning [1],artificial intelligence [2] or statistical clustering [3] were pro-posed. Besides, several statistical analysis of measured InternetWAN traffic have revealed that multifractal structures (suchas random cascades) can help explaining the scaling behaviortypically associated to networking mechanisms operating onsmall time scales (e.g. TCP flow control). The multifractalnature of network traffic was first noticed by Riedi et al. [4]and, since then, various studies have addressed the character-ization and modeling of multifractal traffic. In this paper, wewill present a new approach to identify Internet applicationsthat explores the traffic multiscaling characteristics by clus-tering the corresponding statistical parameters. Basically, theproposed methodology performs a multiscaling analysis overthe sampled flows corresponding to the different applications,estimating their multiscaling coefficients based on a discretewavelet transform and, then, uses a clustering algorithm togroup in the same cluster those multiscaling coefficients thatshare a similar behavior.

This approach is immune to confidentiality restrictionsbecause all statistical analysis is based on the number ofpackets or bytes per time interval, without using any payloadinformation. Based on their relevance on current Internettraffic, three applications were selected to assess the efficiencyof the proposed methodology: web-browsing, streaming andBitTorrent. From the obtained results, we could conclude thatthe proposed methodology is able to achieve very accurateresults, as will be shown in section VI.

The paper is organized as follows: section II presentssome of the most important works on traffic identificationtechniques that were published so far; sub-sections III-A and

978-1-4244-424 - /09/$25.00 (c)2009 IEEE5 4

Page 2: [IEEE 2009 Next Generation Internet Networks (NGI) - Aviero, Portugal (2009.07.1-2009.07.3)] 2009 Next Generation Internet Networks - Discriminating Internet Applications based on

III-B present some background on multiscaling, wavelets andcluster analysis, important concepts that are intensively usedin this paper; section IV provides an overview of the traffictraces that are used in this study; section V presents the mainsteps of the proposed classification methodology; section VIpresents and discusses the main results obtained and, finally,section VII presents some conclusions and directions for futurework.

II. RELATED WORK

The problem of identifying IP applications has been studiedfor some years. In an early stage, the most applied approachwas port-based classification, that relies on the simple conceptthat many applications have default ports on which theyfunction. However, this technique proved to be inefficientsince many applications, such as peer-to-peer protocols, voiceor video transmission, use ephemeral ports. Moreover, someapplications disguise themselves by using ports that are usuallyassociated to different protocols in order to bypass proxiesand firewalls. A study conducted by Madhukar et. al. [5]proved that port-based analysis no longer provides accurateresults when compared to other identification methods, sinceunknown traffic varied between 40% to 65% of the total traffic.This study also confirmed that unknown traffic was moreevident at night periods, which might suggest that this wasgenerated by P2P applications. In a related work, Sen et. al. [6]state that the default port of the Kazaa protocol only accountedfor 30% of the total traffic generated by this protocol.

Several approaches were proposed to overcome these ob-stacles. One of these approaches is payload analysis, which isbased on the fact that many Internet protocols and applicationsuse characteristic signatures in their packets that can actuallydistinguish them. This technique can lead to very accurateresults: T. Karagiannis et al. [7] developed a methodology toidentify P2P traffic based on the examination of the user’spayload and the results obtained were very precise and provedthat reports claiming that P2P traffic was decreasing werewrong; Haffner et al. [8] developed a system for the automaticdetection of applications’ signatures and the obtained errorrate was lower than 1%; Sen S. et al. [6] also used applicationsignatures to identify Internet traffic and their results werealso very accurate. Although being a very promising andaccurate technique, payload analysis also has some importantdisadvantages: frequently, access to user’s payload can be verydifficult due to privacy and legal issues; some protocols usetraffic encryption, making payload analysis useless; the lackof reliable and available protocol specifications for the non-standardized and evolving applications can also be a hardlytransposable barrier; finally, different client implementationsfor the same protocol may not follow the officially availablespecifications.

Another methodology that has been used for identifyingInternet traffic is the study of the statistical properties of thetraffic flows. This analysis is based on the fact that differentapplications typically generate different traffic patterns and,

consequently, their distinct underlying protocols can be iden-tified. In an impressive work [9], Karagiannis et al. developeda technique to identify P2P traffic flows based on their con-nection patterns at the social, functional and application levels.The authors state that although this protocol may use randomports or payload encryption, the traffic patterns it generatesdo not change. The accuracy they achieved was very high andthey were also able to identify unknown P2P protocols. In asimilar but more recent work [10], authors also built behavioralprofiles of the studied Internet applications that depict theirprominent patterns. However, in this study, authors built theirapplications’ profiles based on flow statistics. The resultsobtained from this classification procedure demonstrated thata very high accuracy level and a low rate of False Positivescan be achieved using this approach. In another study [11],the authors proposed an identification methodology that usesthe unique behaviors of file-sharing protocols during theirdata transfer and connection establishment phases and theirresults showed that this technique is comparable, in terms ofaccuracy, to payload analysis. Madhukar A. and Williamson C.[5] compare several methods for classifying P2P applications.One of the studied techniques is statistical analysis and, unlikethe work proposed in [9], the used data set did not containany UDP traffic, relying only on the TCP SYN, FIN andRST headers to provide connection-level information. Theresults obtained show an increase in the volume of P2P trafficand proved that this method can provide useful information.All these identification procedures suffer from the fact thattraffic with the same statistical behavior can be classified asbelonging to the same application, which may not be true.Moreover, traffic with unknown behavior is not classified.

The statistical properties of the traffic flows can also be usedto discriminate applications by using clustering techniques.Clustering techniques are one of the most used methodologiesfor identifying classes among a group of objects: they are usedon different fields such as biology, finances and data mining.Erman J. et al. [3] used two unsupervised algorithms, K-Meansand DBSCAN, to perform traffic classification. Their resultsindicated that these algorithms are useful tools for groupingtraffic with similar characteristics. Despite this ability, thesealgorithms have to rely on other identification techniques tolabel the clusters.

Machine Learning Classifiers are also based on the statisti-cal analysis of Internet traffic. A study conducted by McGregoret al. [1] successfully used machine learning techniques tocreate clusters for Internet traffic classification, and the classi-fication results obtained were very accurate. Zander et al. [12]used an unsupervised machine learning technique where flowswere automatically classified based on their statistical charac-teristics. These cluster-based classification methods are able togroup Internet traffic using only transport layer information.

III. BACKGROUND

A. Multiscale analysis and wavelets

The notion of scaling can be shortly defined as the property ofscale invariance [13], that is, the whole and its parts cannot be

Page 3: [IEEE 2009 Next Generation Internet Networks (NGI) - Aviero, Portugal (2009.07.1-2009.07.3)] 2009 Next Generation Internet Networks - Discriminating Internet Applications based on

differentiated from each other [14]. Several works proved theexistence of scale invariance in network traffic [15]. However,this scaling behavior only holds for a finite range of timescales. Consequently, new mathematical models are necessaryto cope with these particular behaviors and to provide a betterapproximation to the current data complexity. A class ofprocesses that has been intensively used in data analysis arefractal processes, which are related to the signal decomposi-tion in the limit of small time scales. This scaling behaviorcan be described by a local scaling exponent, designated asHolder Exponent, which is related to the degree of regularityof the data. A process Y(T) is said to present Holder regularityh ≥ 0 in t0 if it is possible to find a polynomial Pt0(t) of ordern = �h� and a constant K ≥ 0 such that:

|Y (t) − Pt0(t)| ≤ K|t− t0|h (1)

If the parameter h is constant for all time-scales, then the pro-cess can be considered as monofractal. However, if the Holderexponent varies with time, then the process is a multifractalone. It is also known that processes which exhibit scaling in asecond order statistic are very likely to present scaling for allthe remaining moments [16]. As an example, if the varianceof of the wavelet coefficients at octave j, S2(j), behaves likeS2(j) ∼ Cα

j , then the scaling for the remaining moments willmostly behave like Sq(j) = E[|d(j)|q] ∼ C(q)jα(q), wherethe scaling parameter α(q) varies between 0 and 1. FunctionsSq(j) are based on the increments of the process and thereforeare known as partition functions. For self-similar processes, thefunction α(q) takes the form α(q) = Hq+q/2, where the Hurstparameter, H, controls all the exponents. For this particularcase, the processes are known as monofractals. When thefunction α(q) is not linear, the corresponding processes presentmultiscaling behavior. Multifractals fall into this category.

A wavelet ψ(t) can be defined as a pass-band functionoscillating at a central frequency f0. By performing a scalingchange, which may consist of an expansion or a compression,and a temporal shift, we obtain ψj,k = 2−j/2ψ(2−jt − k),that is the oscillating central frequency moves to 2−jf0 andthe origin of the temporal reference to 2jk. Note that jrepresents the temporal scale, k represents the kth coefficientcorresponding to scale j, with j0 being the larger time scale.

Wavelet decomposition also uses a low-pass function, φ(t),known as scaling function, that can be scaled and temporarilyshifted in a similar way to function ψ(t). Therefore, a signalX(t) can be built as a sum of the scaling and waveletfunctions:

X(t) =∑

k

cX(j0, k)φj0,k(t) +∞∑

j=j0

K

dX(j, k)φj,k(t) (2)

where cX(j0, k) are the scaling coefficients and dX(j, k) arethe wavelet coefficients. This equation represents the DiscreteWavelet Transform (DWT) of the signal X(t).

By defining the following estimators for the moments of

order q:

μ(q)j =

1nj

nj∑

k=1

|dX(j, k)|q, q ∈ R (3)

where nj is the number of coefficients to be analyzed atoctave j, it is possible to build energy diagrams of orderq, in logarithmic scale, in order to analyze the multifractalscaling similarities. For small values of j and assuming thatthe analysed data presents a a multifractal behavior, μ(q)

j canbe given by

μ(q)j ≈ 2j(ζ(q)+q/2) (4)

The exponent α(q) = ζ(q) + q/2 can be estimated from thelogarithmic scale energy diagrams, for several values of q,allowing to study the behavior of function ζ(q). The logscalediagram (LD) is a log-log plot of variance estimates of thewavelet details at each scale, against scale, completed withconfidence intervals about the estimates at each scale. It canbe seen as a spectral estimator where large scale corresponds tolow frequency. The exponents α(q) are also frequently definedas ζq = αq − q/2, which leads to the useful relation ζq = Hq

for self-similar processes. Therefore, ζq is the slope read inthe Zeta Diagram (ZD) or Multiscalar Diagram (MD). Thesemultiscaling estimators will be used in this work to identifythe traffic multiscaling characteristics.

Wavelet analysis has been widely applied in IntrusionDetection due to its time-frequency property that allows thesignals decomposition in several components, each one at adifferent frequency. In a work presented by Kline et. al. [17],wavelets were used to model network signals and detect trafficanomalies: authors used fifteen traffic features from the famous1999 DARPA intrusion detection dataset as input signals andthe results obtained show that the proposed methodology wasable to achieve high detection rates. A work conducted by V.Alarcon et. al. [18] used undecimated discrete wavelet trans-form and bayesian analysis: the proposed algorithm detectedsudden changes in the variance and frequency of the timeseries. A. Ramanathan [19] proposed a tool based on wavelets,named WADeS, to detect distributed denial of service (DDoS)attacks. This tool applied DWTs to the network signals andcomputed the variance of the corresponding coefficients inorder to detect security attacks.

Multifractal theory and analysis can be found in severalcontexts, such as the study of turbulence and stock markets.These concepts have been introduced in the networking area by[4] through an analysis of TCP traffic traces that were capturedat the gateway of a LAN. The information collected allowedthe differentiation of incoming and outgoing traffic and thedetection of multifractality characteristics on the packet sizevalues. These authors argue that multifractal behavior is relatedto the high frequency components of the network signals,which are due to the rapid variations in the signal contents.In [20], evidences of a scaling behavior at large and smalltime scales were also found: authors observed the presenceof a multifractal behavior at small time scales, while longrange dependence (LRD) was detected at large time scales.

Page 4: [IEEE 2009 Next Generation Internet Networks (NGI) - Aviero, Portugal (2009.07.1-2009.07.3)] 2009 Next Generation Internet Networks - Discriminating Internet Applications based on

The observed behaviors were related to the effect that thetransport-layer protocols and controls may have on trafficflows. In another work, A. Feldmann et al. [21] introduced themultiplicative cascade traffic model in order to accommodatethe two previously mentioned scaling parameters that wereidentified in the analyzed traffic data. Once again, this behaviorwas related to the multifractal characteristics of the WANtraffic observed on small time scales. A similar work wasconducted by D. Veitch et al. in [22], where similar resultswere achieved and the same separation between the twoscaling regimes was identified.

B. Cluster analysis

Clustering aims to partition a set of objects into groups, orclusters, in such a way that objects in the same group aresimilar, whereas objects in different clusters are distinct. Thecreation of clusters is based on the concept of proximitybetween objects and groups of objects [23]. There are twocommon approaches to cluster observations: the hierarchicaland non-hierarchical ones, among whose the partition methodsare the most common.

Hierarchical clustering techniques proceed by either a suc-cessive series of merges (agglomerative hierarchical methods)or by successive divisions (divisive hierarchical methods). Theagglomerative methodologies start with as many clusters asobjects and end with only one cluster, containing all objects.These are based on a measure of proximity between two ob-jects and a criterion, relying on the distance between clusters,to decide which are the two closest clusters to be merged ineach step of the agglomerative hierarchical procedure. Differ-ent approaches to measure the distance between clusters giverise to different hierarchical methods. A widely used methodis the Wards’s method, also known as the incremental sumof squares method, that uses the (squared) within-cluster andbetween-cluster distances to decide which clusters should bemerged. The divisive methods work in the opposite direction.

Partitioning non-hierarchical clustering consists in dividingthe data set into a pre-determined number of non-overlappingclusters so that each data object belongs to a cluster. Oneexample of a partioning clustering methodology is the K-Means algorithm, which is also one of the simplest [24].This algorithm builds spherical clusters and attempts to finda user-chosen number of clusters in the data set in such away that guarantees that they are disjoint and represented bytheir centroid. Within each cluster, this algorithm maximizesthe homogeneity through a minimization of the mean squarederror. The algorithm starts by randomly choosing the centroidsof the K clusters. Subsequently, objects are assigned to theclosest cluster and, then, the centroids of each cluster areiteratively re-computed and re-partitioned according to the newcenters. This process continues until all members inside eachcluster stabilize. The K-Means algorithm will be used in ourclassification methodology.

Fig. 1. Number of bytes per sampling interval for a Streaming flow.

Fig. 2. Number of bytes per sampling interval for a Torrent flow.

0 2000 4000 6000 8000 10000 12000 14000 16000 180000

2000

4000

6000

8000

10000

12000

Sampling Interval

Num

ber o

f Byt

es

Fig. 3. Number of bytes per sampling interval for an HTTP flow.

IV. OVERVIEW OF THE TRAFFIC TRACES

The traffic traces used in this work were passively collected onthe University of Aveiro network. The traces were measured onSeptember 15, 2008, and are composed by all TCP and UDPflows of both upload and download traffic. Using TCPDump,publicly available at [25], the full packet header and the first 68bytes of payload data were captured for each packet, togetherwith their arrival instants.

Figures 1, 2 and 3 present the variation, before normaliza-tion, of the total (upload and download) number of bytes persampling interval corresponding to a flow from the Streaming,BitTorrent and HTTP applications, respectively.

From these plots, we can see that there is a clear distinc-tion between the three selected applications: HTTP traffic ischaracterized by non-periodic very short duration peaks; thestreaming traffic profile is characterized by a small variabilitybut includes some periodic very short duration peaks, withquite significant absolute values; BitTorrent traffic presents amedium bandwidth consumption and a noticeable variability

Page 5: [IEEE 2009 Next Generation Internet Networks (NGI) - Aviero, Portugal (2009.07.1-2009.07.3)] 2009 Next Generation Internet Networks - Discriminating Internet Applications based on

around the average bandwidth. These significant differencesbetween the traffic profiles associated to the different applica-tions suggest that it should be possible to efficiently identifythem using appropriate traffic descriptors. In our approach, wewill base our identification on the multiscaling parameters ofthe different traffic traces.

The dynamic range of these signals must be regulatedthrough normalization [26] in order to eliminate (or at leastminimize) the throughput effects of the network connectionthat was used to perform the measurements. Thus, all capturedtraces were normalized to their maximum values.

V. CLASSIFICATION METHODOLOGY

The proposed classification methodology consists of an off-line procedure that is periodically executed. There are twotypes of flows to process: known flows, that is, flows belongingto traces generated in a controlled lab environment, andunknown flows that are experimentally captured. These flowsare mixed and processed together in order to create and labelthe clusters corresponding to their generating applications:since we are using unsupervised clustering, we believe thismethodology will provide a good and accurate cluster arrange-ment. All flows are sampled using Wireshark [27] in order toextract the number of bytes and packets per sampling interval,both in the upload and download directions. In this study, allflows will be analyzed based on the number of bytes persampling interval corresponding to half-hour intervals. Thesampling intervals used in this work were 100 ms.

Then, a multiscaling analysis in performed on the sampledflows, using the tool available at [16]. This tool estimates themultiscaling coefficients, based on a wavelet decomposition,enabling the analysis of complex processes, such as multifrac-tals, where the single Hurst parameter is not sufficient to fullydescribe the data complexity. This methodology examinesthe dependence of the q-th order moment of the waveletcoefficients (in other words, the energy of the data signal)with the time scale. The moment of order q can be defined asE[Xq] =

∫xqfX(x)dx, where the range of q is usually semi-

infinite, going from some initial value up to infinity. However,the lower limit of q is usually taken as zero since negativemoments often do not exist or can be problematic.

In order to minimize the effects of the different valuesof available bandwidth, the multiscaling coefficients obtainedfrom the previous analysis are normalized to zero mean,for each moment. That is, the normalized estimators μj arecomputed as follows:

μ(q)j = μ

(q)j −

nj∑

k=1

μ(q)j

nj(5)

This allows us to observe the energy variation at each timescale, for the different traces, irrespectively of the availablebandwidth. We made this normalization since we want todifferentiate applications based on the variations of theirmultiscaling coefficients, that is, variations of the energy ateach scale, and not based on their absolute values.

Known Flows

Flow Sampling

Unknown Flows

Flow Sampling

Extraction of theMultifractalcoefficients

Clustering of thecoefficients

On-linemeasurement

New Flow?

Sampling of thefirst ∆ seconds

Extraction of theMultifractalcoefficients

Analysis of theresults

Analysis of theresults

ClassificationClassification

Yes

No

Off-line/Periodical On-line

Fig. 4. Flow diagram of the off-line and on-line classification methodology.

The multiscaling coefficients are then passed to a clusteringprocedure that will group in the same cluster coefficientsthat exhibit a similar behavior. In this step, we used the K-Means algorithm since it is one of the simplest and mostefficient clustering techniques, besides allowing the choiceof the number of clusters and always converging to a localoptimum. The number of clusters will be obviously equal tothe number of studied Internet applications. At the end of thisprocess, the three clusters obtained contain the known trafficflows, together with the unknown ones, allowing us to classifyall flows that can be further inputted to the classification tool.

Our methodology can be used in an on-line classificationscenario, where new incoming traffic flows can be analyzedbased only on their profiles corresponding to pre-defined timewindows of width seconds. After sampling, the multifractalcoefficients corresponding to the different flows will be ex-tracted and flows will be clustered using the same algorithmthat was previously explained. Figure 4 presents the flowdiagram that illustrates our off-line classification methodology,presenting also its adaptation and/or integration on the on-lineclassification framework.

In order to evaluate the accuracy of the proposed methodol-ogy, we have inspected the packets’ payloads of all flows usinga payload-based identification tool that we have previouslydeveloped at our research group [28]. In this way, we canverify if the different flows are correctly assigned to the clusterthat was suggested/defined for each of the selected protocols.

VI. RESULTS

This section will present the obtained results, enabling us toevaluate the performance and the accuracy of the proposedmethodology. The traffic flows presented in section IV wereclassified using the classification procedure described in sec-tion V: three clusters were defined, one per application, andthe different traffic flows were assigned to them.

Page 6: [IEEE 2009 Next Generation Internet Networks (NGI) - Aviero, Portugal (2009.07.1-2009.07.3)] 2009 Next Generation Internet Networks - Discriminating Internet Applications based on

Fig. 5. First order coefficients for the different traffic flows.

Fig. 6. Second order coefficients for the different traffic flows.

Fig. 7. Third order coefficients for the different traffic flows.

Figures 5, 6 and 7 show the log-scale diagram, for thefirst three moments, of the normalized (to zero mean) firstorders multiscaling coefficients corresponding to all analyzedflows from the three selected applications. As already said,these coefficients account for the energy variation of the qth-order moment along the different temporal scales. Higher timescales are physically related to long-term actions, mainly atthe session-level, caused by the functioning behavior of eachapplication, while lower time scales are related to short-terminteractions, such as user clicks over web page links.

As can be observed, flows generated by the same protocolpresent similar energy variations at all time scales, whileflows from different protocols exhibit quite different behaviors.However, the distinction between the different applicationbehaviors are more noticeable for the first moment coefficients,so this moment was chosen for the subsequent analysis. Inthis case, it seems that differentiating protocols based on theirmultifractal behavior should be an highly reliable approach.To prove this claim, we used our classification methodologyto classify the different traffic flows. The exact classification,

TABLE IPERCENTAGE OF FLOWS CORRECTLY CLASSIFIED

Moment HTTP Streaming Torrent1 100% 93,3% 93,3%2 40% 53.3% 73,3%3 40% 60% 73,3%4 40% 60% 73,3%5 53,3% 60% 60%

TABLE IIPERCENTAGE OF FALSE POSITIVES

Moment HTTP Streaming Torrent1 6,7% 0% 0%2 16,7% 30% 13,3%3 23,3% 26,7% 13,3%4 53,3% 46,7% 6,7%5 26,7% 26,7% 3,75%

that is used as a comparison term, relies on the results obtainedfrom using the above mentioned payload-based classificationtool that was developed at our research group and presentedin [28]. Table I presents the percentage of correctly classifiedflows for the first five moments.

As we can see from these results, the classification accuracyis very high when using the first moment, beginning to de-grade for higher order moments. So, efficiently differentiatingbetween these three applications only requires the use of thefirst order coefficients.

We also examine the accuracy of our results by calculatingthe number of False Positives (FP) and False Negatives (FN).These are both undesirable inaccuracies which decrease theefficiency of any classification methodology. False Positivescan be defined as the number of elements that are incorrectlyclassified as belonging to a certain class. On the other hand,False Negatives refer to the number of elements of a certainclass that are erroneously classified as not belonging to thatgroup. In order to compute the rate of False Positives, let usdefine N as the total number of processed flows and P asthe number of non-application flows classified as applicationflows. The rate of False Positives can then be computed asFP = P/N and is shown in Table II for our traffic flows.

From these results, we can see that the best classificationresults are achieved for the first moment, with a low numberof False Positives. For the first moment, flows from thesame application exhibit similar variations and are differentenough from the other application’s flows, allowing theirefficient distinction. For the remaining higher order moments,the accuracy decreases very significantly, to values around40%. The lowest values are obtained for HTTP browsing andStreaming applications, which means that their discriminationis harder to perform using these statistics (their energies begingto shuffle for several time scales). Note that these moments,which are non-dimensional, correspond to the skewness (thirdmoment) and Kurtosis (fourth moment) of a distribution and,at these levels, it becomes difficult to differentiate applicationflows because, although they are statistically different, theshapes of their distributions tend to superpose.

Page 7: [IEEE 2009 Next Generation Internet Networks (NGI) - Aviero, Portugal (2009.07.1-2009.07.3)] 2009 Next Generation Internet Networks - Discriminating Internet Applications based on

VII. CONCLUSIONS AND FUTURE WORK

Since the number and variety of Internet applications hasincreased in the last few years, it is crucial for ServiceProviders to map traffic into their corresponding applications.Important tasks such as security, traffic engineering and per-formance/failure monitoring strongly depend on the abilityto correctly differentiate between applications. In this paperwe presented a novel approach, based on clustering of themultiscaling coefficients, for the identification of Internet ap-plications. By applying the developed framework to three typesof Internet applications, HTTP, Streaming and BitTorrent, veryaccurate classification results were obtained. We also proposean on-line identification framework, based on the proposedidentification methodology, that can be used on real timeidentification scenarios.

This work can be extended in several ways: more Internetprotocols, such as VoIP and Remote Desktop, can be includedin the classification analysis; different types of normalizationcan be explored, comparing their efficiencies; different sam-pling intervals can be tested and, most importantly, a newalgorithm can be proposed in order to automatically chose thebest time scales and the best moments (based on appropriateefficiency metrics).

REFERENCES

[1] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow clustering usingmachine learning techniques,” Proceedings of the Passive and ActiveMeasurement Workshop (PAM2004), April 2004.

[2] A. Nogueira, M. de Oliveira, P. Salvador, R. Valadas, and A. Pacheco,“Classification of internet users using discriminant analysis and neuralnetworks,” Next Generation Internet Networks, 2005, pp. 341–348, April2005.

[3] J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification using clus-tering algorithms,” in MineNet ’06: Proceedings of the 2006 SIGCOMMworkshop on Mining network data. New York, NY, USA: ACM, 2006,pp. 281–286.

[4] R. H. Riedi and J. Vehel, “Multifractal properties of tcp traffic: Anumerical study,” Tech. Rep., Feb 1997.

[5] A. Madhukar and C. Williamson, “A longitudinal study of p2p trafficclassification,” Modeling, Analysis, and Simulation of Computer andTelecommunication Systems, 2006. MASCOTS 2006. 14th IEEE Inter-national Symposium on, pp. 179–188, Sept. 2006.

[6] S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalable in-networkidentification of p2p traffic using application signatures,” in WWW ’04:Proceedings of the 13th international conference on World Wide Web.New York, NY, USA: ACM, 2004, pp. 512–521.

[7] T. Karagiannis, A. Broido, N. Brownlee, K. Claffy, and M. Faloutsos,“Is p2p dying or just hiding? [p2p traffic measurement],” GlobalTelecommunications Conference, 2004. GLOBECOM ’04. IEEE, vol. 3,pp. 1532–1538 Vol.3, Nov.-3 Dec. 2004.

[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “Acas: Automatedconstruction of application signatures,” in MineNet ’05: Proceedingsof the 2005 ACM SIGCOMM workshop on Mining network data.Philadelphia, Pennsylvania, USA: ACM Press, August 2005, pp. 197–202.

[9] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “Blinc: multileveltraffic classification in the dark,” in SIGCOMM ’05: Proceedings ofthe 2005 conference on Applications, technologies, architectures, andprotocols for computer communications. New York, NY, USA: ACM,2005, pp. 229–240.

[10] Y. Hu, D.-M. Chiu, and J. Lui, “Application identification based onnetwork behavioral profiles,” Quality of Service, 2008. IWQoS 2008.16th International Workshop on, pp. 219–228, June 2008.

[11] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transport layeridentification of p2p traffic,” in IMC ’04: Proceedings of the 4th ACMSIGCOMM conference on Internet measurement. New York, NY, USA:ACM, 2004, pp. 121–134.

[12] S. Zander, T. Nguyen, and G. Armitage, “Automated traffic classifica-tion and application identification using machine learning,” The IEEEConference on Local Computer Networks, 2005. 30th Anniversary., pp.250–257, Nov. 2005.

[13] M. S. T. P. Abry, P. Flandrin, “Wavelets for the analysis,estimation and synthesis of scaling data,” 2000. [Online]. Available:http://citeseer.ist.psu.edu/395082.html

[14] B. Enescu, K. Ito, and Z. R. Struzik, “Wavelet-based multifractal analysisof real and simulated time-series of earthquakes,” in Annuals of DisasterPrevention Research Institute Annuals, vol. No. 47 B. Kyoto University,2004.

[15] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, “On theself-similar nature of ethernet traffic (extended version),” IEEE/ACMTransactions on Networking, vol. 2, no. 1, pp. 1–15, 1994.

[16] (2008, December) Darryl veitch. [Online]. Available:http://www.cubinlab.ee.unimelb.edu.au/ darryl/

[17] P. B. J. Kline, D. Plonka, and R. Amos, “A signal analysis of networktraffic anomalies,” in IMW ’02: Proceedings of the 2nd ACM SIGCOMMWorkshop on Internet measurment. New York, NY, USA: ACM, 2002,pp. 71–82.

[18] V. Alarcon-Aquino and J. Barria, “Anomaly detection in communicationnetworks using wavelets,” Communications, IEE Proceedings-, vol. 148,no. 6, pp. 355–362, Dec 2001.

[19] A. Ramanathan, “Wades: A tool for distributed denial of service attackdetection,” Master’s thesis, TAMU-ECE-2002-02, 2002.

[20] A. Feldmann, A. Gilbert, P. Huang, and W. Willinger, “Dynamicsof IP traffic: A study of the role of variability and the impactof control,” in SIGCOMM, 1999, pp. 301–313. [Online]. Available:citeseer.nj.nec.com/feldmann99dynamics.html

[21] A. Feldmann, A. C. Gilbert, and W. Willinger, “Data networks ascascades: investigating the multifractal nature of internet wan traffic,”in SIGCOMM ’98: Proceedings of the ACM SIGCOMM ’98 conferenceon Applications, technologies, architectures, and protocols for computercommunication. New York, NY, USA: ACM, 1998, pp. 42–55.

[22] D. Veitch, P. Abry, P. Flandrin, and P. Chainais, “Infinitely divisiblecascade analysis of network traffic data,” in Proceedings of the Inter-national Conference on Acoustics, Speech, and Signal Processing, June2000.

[23] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Intro-duction to Cluster Analysis. Wiley, 1990.

[24] J. B. MacQueen, “Some methods for classification and analysis of multi-variate observations,” in Proceedings of the fifth Berkeley Symposium onMathematical Statistics and Probability, L. M. L. Cam and J. Neyman,Eds., vol. 1. University of California Press, 1967, pp. 281–297.

[25] (2009, March) Tcpdump/libpcap public repository. [Online]. Available:http://www.tcpdump.org

[26] D. Cochran, “A consequence of signal normalization in spectrum analy-sis,” Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988International Conference on, pp. 2388–2391 vol.4, Apr 1988.

[27] (2009, March) Wireshark: Go deep. [Online]. Available:http://www.wireshark.org/

[28] E. Rocha, H. Veiga, R. Valadas, P. Salvador, and A. Nogueira, “Modulefor identifying internet applications and its integration in a peer-to-peer measurement tool,” in Proceedings of the IADIS InternationalConference, IADIS, Ed., 2007.