3
Formal Methods to Improve the Identification and Validation of Network Traffic Michael Finsterbusch, HTWK Leipzig, Germany [email protected] Jean-Alexander M¨ uller, HfT Leipzig, Germany [email protected] Abstract—Internet traffic identification and validation has been the subject of intensive study for many years. It is used to provide Quality of Service, to provide security and to implement many other tasks. The reliability of these methods, however, is not proved by formal verification. Therefore, the results of these methods can vary from the lab where these methods were developed and their deployment in real world application. In this paper, we present an idea to overcome this problem in order to guarantee reliability and provide more optimised solutions. Keywords-traffic identification, protocol verification. I. I NTRODUCTION The amount of internet traffic is rapidly growing. To reduce the investments to new network infrastructure, internet service providers commonly use Quality of Service (QoS) techniques to provide good experience to their customers. To realize QoS, needs to be assessed the kind of traffic (real time traffic, background traffic, etc.) to be able to handle it adequately. To obtain the kind of traffic there are two general methods in existence. The first method signals the kind of service or the QoS class. Examples of this include IntServ and DiffServ, as well as the Next Generation Networks (NGNs) such as IP Multimedia Subsystem (IMS) or Evolved Packet Core (EPC) which use SIP signalling. IntServ and DiffServ are no longer commonly in use, the SIP signalling in NGNs only works for Voice over IP and for IP-TV, whereas the majority of traffic cannot be marked and is handled with best effort. The second method consists in identifying the protocol or application of single network packets or flows to handle them sufficiently. Techniques to identify network traffic can be divided into four classes: (1) port-based, (2) Pattern Matching, (3) Heuristics and (4) Protocol Decoding. The port- based approach [1] is the oldest and simplest, using only well-known port numbers of the transport protocol to identify the flow. For the Pattern Matching approach [2] simply fixed patterns or regular expressions are used to find characteristic signatures in network traffic. The heuristic methods [3] use statistical features to identify different kinds of traffic. The fourth technique — the Protocol Decoding (PD) — inspects the network packet payload and tries to decode it. If it is able to decode the packet and the constraints of the decoded protocol are fulfilled, then the packet or flow is identified. These methods of traffic identification are applied to provide QoS, for access control, to monitor the network for network planning, etc., and could also be used to fill the gap on non- signalised traffic in NGNs. Most of the traffic identification methods have been well investigated, but until now little investigation has been done for PD. The only investigation made was on its traffic identification quality [4], [5], without a deeper look into its internal operation to find out why the quality is good or bad. One important reason for this low interest is the fact that it imposes much work to implement a good PD for traffic classification. In general, all protocol decoders must be implemented by hand. For this, not only detailed knowledge of the protocols and programming skills are needed (C is mainly used), but also knowledge of the system design to make it feasible for real-time network traffic handling. PD is the preferred method of traffic identification in industry [6], [7], [8]. In this work we focus on PD, because it is not well investigated but used in many commercial products and seems to get reliable results. The remaining document is organised as follows. Section II describes the problems of the identification methods and the PD. Section III outlines how we plan to address these problems. Initial results are shown in Section IV. Section V covers related work and Section VI presents our future work and some conclusions. II. PROBLEM STATEMENT In research of traffic identification methods, there has so far been a lack on formal verification. The methods are only tested with recorded network traffic. This traffic is mostly very ’synthetic’ [5]. This means the traffic was recorded under laboratory conditions and has very low variance, with few protocol implementations and protocol ’dialects’. Thus, tests with this kind of traffic can only point to very general errors, and do not indicate side effects of different protocol detection modules, which results in false positives. A good example of this can be shown on L7-filer [9] which uses regular expressions for traffic identification. For instance, it uses the next two regular expressions are used to identify the protocols STUN and Subspace, respectively. ˆ[\x01\x02]................?$ ˆ\x01....\x11\x10........\x01$ As it can be seen, it is possible that both regular expressions can match to the same string. In most test cases, these two protocols will not be in the same network trace, but on real world traffic this can lead to bad quality, because STUN is 978-1-4799-1270-4/13/$31.00 c 2013 IEEE

[IEEE 2013 21st IEEE International Conference on Network Protocols (ICNP) - Goettingen, Germany (2013.10.7-2013.10.10)] 2013 21st IEEE International Conference on Network Protocols

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 21st IEEE International Conference on Network Protocols (ICNP) - Goettingen, Germany (2013.10.7-2013.10.10)] 2013 21st IEEE International Conference on Network Protocols

Formal Methods to Improve the Identification andValidation of Network Traffic

Michael Finsterbusch, HTWK Leipzig, [email protected]

Jean-Alexander Muller, HfT Leipzig, [email protected]

Abstract—Internet traffic identification and validation has beenthe subject of intensive study for many years. It is used toprovide Quality of Service, to provide security and to implementmany other tasks. The reliability of these methods, however,is not proved by formal verification. Therefore, the results ofthese methods can vary from the lab where these methods weredeveloped and their deployment in real world application. In thispaper, we present an idea to overcome this problem in order toguarantee reliability and provide more optimised solutions.

Keywords-traffic identification, protocol verification.

I. INTRODUCTION

The amount of internet traffic is rapidly growing. To reducethe investments to new network infrastructure, internet serviceproviders commonly use Quality of Service (QoS) techniquesto provide good experience to their customers. To realize QoS,needs to be assessed the kind of traffic (real time traffic,background traffic, etc.) to be able to handle it adequately.To obtain the kind of traffic there are two general methodsin existence. The first method signals the kind of service orthe QoS class. Examples of this include IntServ and DiffServ,as well as the Next Generation Networks (NGNs) such asIP Multimedia Subsystem (IMS) or Evolved Packet Core(EPC) which use SIP signalling. IntServ and DiffServ areno longer commonly in use, the SIP signalling in NGNsonly works for Voice over IP and for IP-TV, whereas themajority of traffic cannot be marked and is handled withbest effort. The second method consists in identifying theprotocol or application of single network packets or flows tohandle them sufficiently. Techniques to identify network trafficcan be divided into four classes: (1) port-based, (2) PatternMatching, (3) Heuristics and (4) Protocol Decoding. The port-based approach [1] is the oldest and simplest, using onlywell-known port numbers of the transport protocol to identifythe flow. For the Pattern Matching approach [2] simply fixedpatterns or regular expressions are used to find characteristicsignatures in network traffic. The heuristic methods [3] usestatistical features to identify different kinds of traffic. Thefourth technique — the Protocol Decoding (PD) — inspectsthe network packet payload and tries to decode it. If it isable to decode the packet and the constraints of the decodedprotocol are fulfilled, then the packet or flow is identified.These methods of traffic identification are applied to provideQoS, for access control, to monitor the network for network

planning, etc., and could also be used to fill the gap on non-signalised traffic in NGNs.

Most of the traffic identification methods have been wellinvestigated, but until now little investigation has been donefor PD. The only investigation made was on its trafficidentification quality [4], [5], without a deeper look into itsinternal operation to find out why the quality is good orbad. One important reason for this low interest is the factthat it imposes much work to implement a good PD fortraffic classification. In general, all protocol decoders must beimplemented by hand. For this, not only detailed knowledgeof the protocols and programming skills are needed (C ismainly used), but also knowledge of the system design tomake it feasible for real-time network traffic handling. PDis the preferred method of traffic identification in industry [6],[7], [8]. In this work we focus on PD, because it is not wellinvestigated but used in many commercial products and seemsto get reliable results.

The remaining document is organised as follows. SectionII describes the problems of the identification methods andthe PD. Section III outlines how we plan to address theseproblems. Initial results are shown in Section IV. Section Vcovers related work and Section VI presents our future workand some conclusions.

II. PROBLEM STATEMENT

In research of traffic identification methods, there has sofar been a lack on formal verification. The methods are onlytested with recorded network traffic. This traffic is mostlyvery ’synthetic’ [5]. This means the traffic was recordedunder laboratory conditions and has very low variance, withfew protocol implementations and protocol ’dialects’. Thus,tests with this kind of traffic can only point to very generalerrors, and do not indicate side effects of different protocoldetection modules, which results in false positives. A goodexample of this can be shown on L7-filer [9] which usesregular expressions for traffic identification. For instance, ituses the next two regular expressions are used to identify theprotocols STUN and Subspace, respectively.ˆ[\x01\x02]................?$ˆ\x01....\x11\x10........\x01$As it can be seen, it is possible that both regular expressionscan match to the same string. In most test cases, these twoprotocols will not be in the same network trace, but on realworld traffic this can lead to bad quality, because STUN is978-1-4799-1270-4/13/$31.00 c©2013 IEEE

Page 2: [IEEE 2013 21st IEEE International Conference on Network Protocols (ICNP) - Goettingen, Germany (2013.10.7-2013.10.10)] 2013 21st IEEE International Conference on Network Protocols

detected instead of Subspace, or vice versa. This kind ofbad implementation and the lack of formal verification —to find such shortcomings — is a general problem of allidentification methods.

The next problem is related only to the PD. The implement-ing of the PD is done by hand with a programming language.This is, depending on the protocol, very expensive, complexand error-prone. Thus, much effort is needed to implement alarge number of protocols. To reduce the complexity and thetime needed for implementing the PD, the protocol validationis reduced to a minimum. This increases the possibility for sideeffects in different protocol decoders and lowers the numberof fields of application. So, generally the protocol decoderscan only be used to identify network flows on startup. A lateridentification or the validation of the whole flow, which isnecessary to detect anomalies, tunnels and attacks, is usuallynot possible. The PD is the only identification method whichcan achieve the protocol validation, so this potential should beexploited.

Another problem is the efficiency of the implementations.All known Open Source implementations [10], [11], [12],[13], [14] can only process network packets sequentially.Commercial products [15] also have this problem and usingflow parallelisation to overcome the issue. But this is not theoptimal solution. To implement the PD, the decoder needs tostore some information (flow context) about all current andpast processed packets of the flow. Commonly, the decoderneeds up to 10 packets to identify a flow’s protocol. Theflow context information is associated with the memory of theCPU, which processes the flow first. Thus, a fixed allocationbetween flow and CPU is set. A hash function on the 5-tuple(src and dest IP addresses, src and dest port numbers, IP’sprotocol/next header field) is commonly used to determinewhich flow belongs to which CPU. Depending on the hashfunction and the network traffic, in the worst case only oneprocessor handles the whole network traffic.

III. HOW TO ADDRESS THESE ISSUES

To address all the problems of the PD, we decided to createa Domain Specific Language (DSL). The DSL should be apure declarative language. The declarative character of thelanguage will be designated to hide all complexity of networkprogramming and provide models of the protocols that can beused for formal verification, code generation and optimisation.With a declarative language the programmer only says whatto do, but not how it should be done. So, the complexity anderror-proneness decreases.

To build a model of a protocol, the programmer must onlydefine the header of the protocol as well as the behaviour ofthe protocol, e.g., with a deterministic finite automata. Thiscould be taken from the protocol’s standardisation, or couldbe investigated by reverse engineering.

Based on the protocol models, the compiler can do formalverification testing to support the programmer and point outparts of the protocol definition which could lead to sideeffects with other protocols. The idea of the formal verification

is primary to check the protocol implementations to avoidthat the same stream of packets matches multiple protocoldefinitions. In this way, all protocol definitions could be sodetailed that no side effects – i.e., no false positives arepossible. This could definitely exclude all side effects withinthe set of defined protocols, and result in a significant increasein quality and reliability, but this must be proved. A protocolidentification system with a recall of 80% and a precision of100% is much more useful than a system with a recall of 99%but a precision of 80%.

IV. INITIAL RESULTS

On a first proof of concept, we implemented a compilerfor a DSL in which we defined protocol headers andprotocol behaviour. Based on this model, we could generate Ccode, which does the protocol identification and verification.This proof of concept compiler does not support text-basedprotocols. We defined the protocols DNS, Oscar, TLS, RTP,Bittorrent and Edonkey with few lines of code of the DSL. Thegenerated program does the header validation and behaviourvalidation and we could switch the behaviour validation onor off. The test having only header validation showed goodresults with high recall and precision, because we used adetailed definition of the protocols. Still, there were somefalse positives. After enabling the behaviour validation as well,however, there were no more false positive. For the tests weused captured traffic already used in [16], [17]. For thosepapers we did an evaluation of machine learning algorithmsand feature selection for network traffic identification, and weachieved better results with the proof of concept DSL thanwith the heuristic-based method.

During our investigations on heuristics with machinelearning algorithms [16], [17], we recognized that it is nota problem to obtain high recall with heuristics, but this oftenresults in low precision. Our latest results show that PD with[10], [11], [12], [13], [14] provides much higher precision.

As described in Section II, we found out that some PDimplementations such as [10], [11], [13] use only reducedvalidation functions. These implementations can, e.g., notidentify all DNS flows, because not all extensions andResource Records are supported. The same behaviour was seenon other protocols.

Furthermore, we investigated the PD of [10], [11], [12],[13], [14] on code-level and assembly language level anddiscovered that the PD needs only a subset of general purposeCPU’s instruction set. It mostly uses comparison functionsas well as load and store functions to access the networkdata. All observed protocol decoders do not use floating pointoperations. This, provides us some options for generatingoptimized code – i.e., modern CPUs have several arithmeticlogic units, so they can process comparison functions inparallel.

V. RELATED WORK

Domain Specific Languages have been used since decadesin the protocol engineering. These languages are LOTOS [18],

Page 3: [IEEE 2013 21st IEEE International Conference on Network Protocols (ICNP) - Goettingen, Germany (2013.10.7-2013.10.10)] 2013 21st IEEE International Conference on Network Protocols

ESTELLE [19] and SDL [20].We had difficulties finding any publications about formal

verification of protocol identification methods based onheuristics, regular expressions or PD.

DSLs for network traffic identification or validation arebinpac, NetPDL and SML. In [21] binpac is described. It is”A yacc for Writing Application Protocol Parsers”. Binpac wasdeveloped to easily implement syntax and semantic-analysersfor network protocols. The focus of this language is onprotocols of layer 5 to 7. Binpac is a part of the IntrusionDetection System Bro [22] and generates C++-code for parsersand protocol validation modules used within Bro. This DSLcan describe headers and behaviour of network protocols. Itdoes not have highly intuitive syntax and semantics.

In [23] the DSL NetPDL (Network Protocols DescriptionLanguage) was extended for use in network traffic identifi-cation. NetPDL is a XML-based language to specify proto-col headers and potential upper layer protocols. Therefore,NetPDL is mainly used for protocol decoder like Wireshark.The NetPDL database contains detailed descriptions on manylayer 2 to 4 protocols, but it does not provide good supportfor upper layer protocols. To use this language for trafficidentification, in [23] the verify XML-tag was added. This tagis mostly used to specify pattern with regular expressions toidentify network traffic. In this way only the protocol headersor characteristic fields are described, but not the behaviourof the protocols. Thus, NetPDL does not provide protocoldecoders suitable for traffic validation.

SML (Service Manangement Language) is the languageused to write modules for Cisco’s Network Based ApplicationRecognition (NBAR) [8]. With SML, protocol headers can bespecified with Abstract Data Types, but analysis, identificationor validation must be manually implemented with a C-likelanguage.

None of these three DSLs binpac, NetPDL and SML arepure declarative languages. All of them use embedded codeto implement decoding, type checks, session management,etc. Binpac, for instance, uses hand-written C++-Code torealise some header validation functions. This code can beautomatically checked for syntax and semantic on the C++-level, but not on the protocol level. This contradicts the definedgoals of this work: to describe the protocols in a formal way,independent of a target platform, and to only establish whatshould be done but not how it should de done.

VI. CONCLUSIONS AND FUTURE WORK

In our next work we plan to develop formal verificationmethods to evaluate the protocol definitions. This includesthe verification of the protocol headers, which can be binary,text-based or a combination of both. The challenges will beoptional parameters and parameters of variable length. If weuse regular expressions to describe text-based headers, weneed a method to determine the similarity of two regularexpressions. Determining the similarity of fixed strings canbe done with string similarity metrics or string distancefunctions. Furthermore, the behaviour of the protocols must

be verified. Some methods do exist to evaluate deterministicfinite automata (DFA), such as reachability analysis, but wewill need verification methods to evaluate whether differentDFAs match to the same sequence of network packet.

Besides developing formal verification methods, we alsohave to investigate how much payload of a network packetmust be observed to clearly identify or validate the networktraffic. This is important for two reasons. First, a lower amountof observed data reduces the processing effort. Second, trafficidentification and verification can affect data privacy, which isimportant for the acceptance of this technology.

To conclude this paper, we definitively confirm that the taskof identification and validation of network traffic is importantfor many fields of application. Formal verification methods arestill not using and this significantly decreases the reliability.

REFERENCES

[1] A. Moore and K. Papagiannaki, “Toward the Accurate Identification ofNetwork Applications,” in Passive and Active Network Measurement,ser. Lecture Notes in Computer Science, 2005, vol. 3431, pp. 41–54.

[2] Y. Yang, H. Le, and V. Prasanna, “High Performance Dictionary-BasedString Matching for Deep Packet Inspection,” in INFOCOM, 2010Proceedings IEEE, 2010, pp. 1–5.

[3] T. Nguyen and G. Armitage, “A survey of techniques for internettraffic classification using machine learning,” Communications Surveys& Tutorials, IEEE, vol. 10, no. 4, pp. 56–76, 2008.

[4] J. Khalife et al., “Performance of OpenDPI in Identifying SampledNetwork Traffic,” JNW, vol. 8, no. 1, pp. 71–81, 2013.

[5] T. Bujlow et al., Comparison of Deep Packet Inspection (DPI) Tools forTraffic Classification. Universitat Politcnica de Catalunya, 2013.

[6] (2013, July) PACE – Protocol and Application Classification Engine.[Online]. Available: http://www.ipoque.com/en/products/pace

[7] (2013, July) Deep Packet Inspection and Metadata Engine.http://www.qosmos.com/products/deep-packet-inspection-engine/.

[8] (2013, March) Network Based Application Recognition(NBAR). Cisco. http://www.cisco.com/en/US/products/ps6616/products ios protocol group home.html.

[9] (2013, July) Application Layer Packet Classifier for Linux. http://l7-filter.sourceforge.net/.

[10] (2013, July) OpenDPI. http://www.opendpi.org/.[11] (2013, July) nDPI. http://www.ntop.org/products/ndpi/.[12] (2013, July) Official IPP2P homepage. http://www.ipp2p.org/.[13] (2013, July) HiPPIE. http://www.linux112.com/hippie-p313520.html.[14] S. Alcock and R. Nelson, “Libprotoident: Traffic Classification Using

Lightweight Packet Inspection,” WAND Network Research Group, Tech.Rep., 2012.

[15] Emerson Network Power, “Deep Packet Inspection (DPI) Use Cases,Requirements and Architectures,” White Paper, 2013.

[16] M. Finsterbusch et al., “Parameter Estimation for Heuristic BasedInternet Traffic Classification.” ICIMP 2012, ISBN:978-1-61208-201-1.

[17] C. Richter, M. Finsterbusch, K. Hanßgen, and J.-A. Muller, “Impact ofAsymmetry of Internet Traffic for Heuristic Based Classification,” Int.Journal of Computer Networks (IJCN), vol. 4, no. 10, Dec. 2012.

[18] ISO/IEC, Information Processing Systems – Open Systems Interconnec-tion: LOTOS, A Formal Description Technique Based on the TemporalOrdering of Observational Behavior, 1989.

[19] I. O. for Standardisation, “ISO: Information Processing Systems — OpenSystems Interconnection – Estelle, A Formal Description Techniquebased on an Extended State Transition Model”, ISO 9074,” May 1989.

[20] ITU-T, ITU-T Rec. Z.100 – Formal description techniques (FDT) –Specification and Description Language (SDL), 2002.

[21] R. Pang et al., “binpac: A yacc for Writing Application ProtocolParsers,” in Proceedings of the 6th ACM SIGCOMM conference onInternet measurement, ser. IMC ’06, 2006, pp. 289–300.

[22] V. Paxson, “Bro: a System for Detecting Network Intruders in Real-Time,” Computer Networks, vol. 31, 1999.

[23] F. Risso et al., “Extending the NetPDL Language to Support TrafficClassification,” in GLOBECOM ’07. IEEE, 2007, pp. 22–27.