Classification of Applications in HTTP Tunnels By Gajen Piraisoody, Changcheng Huang,Biswajit Nandy,...

Preview:

Citation preview

Classification of Applications in HTTP Tunnels

By

Gajen Piraisoody, Changcheng Huang ,Biswajit Nandy, Nabil Seddigh

Electrical and Computer EngineeringCarleton University.Ottawa, ON. Canada.

12 November 2013

Slide 2

Outline• Overview• Motivation• Problem Statement• Contribution• Approach to classification• Evaluation• Conclusion

Slide 3

Overview – HTTP Tunnel

What is HTTP Tunnelled Traffic?

• HTTP port used to carry web traffic

• Non-HTTP applications are wrapped in HTTP protocols

• HTTP port now tunnels email, chat, video, image, audio, file-transfer and

peer to peer traffic

Why HTTP Tunnel non-HTTP applications?

• HTTP clients (browser) are readily available and deployable

• Tunneling permits applications to by-pass restricted network connectivity

that exists in the form of firewalls, proxy and NAT

Slide 4

Motivation

HTTP Traffic Classification

• HTTP traffic in an entire network is about 80%

• HTTP tunneled traffic is not identifiable by ports alone

• Tunneled traffic like YouTube and Netflix is increasing in cloud network

• Info on tunneled traffic helps cloud-centre management with planning,

provisioning and ensuring quality of service

Why flow-based against DPI classification process?

• Provides a scalable software solution(less CPU consumption)

• Can classify encrypted data

Slide 5

Problem Statement

Given network traffic measured with NetFlow

Find a way to classify HTTP tunnelled traffic

• Audio (Radio & Music), Video and File-transfer

No training dataset needed for the proposed algorithm

Use information available from NetFlow only

Slide 6

Contribution

Proposed scheme classifies HTTP tunneled traffic: audio(radio

& music), video and file-transfer

Proposed scheme helps audio classification by using

‘occupancy’ feature

Proposed scheme enhances classification performance by

including flow-group found using flows from Content

Servers(subnet masked IP of long-flow)

Slide 7

Approach in detail

Identify long-flow HTTP traffic Parameter : BPF

Classify radio trafficParameter : BPF, BPP, BPS, Occupancy

Classify music trafficParameter : BPF, BPP, BPS, Occupancy

Classify video trafficParameter : BPF, BPP, BPS, Flow-group

Classify file-transfer trafficParameter : BPF, BPP, BPS, Flow-group

Bytes-per-second(BPS), Bytes-per-flow(BPF), Bytes-per-pkt(BPP)

Slide 8

Approach to Classification

Identify Long-flow HTTP Traffic

Classify Audio Traffic

Classify Video & File-transfer Traffic

Slide 9

Identify Long-flow HTTP Traffic

Identifying HTTP Traffic

Long-flow has byte size larger than a threshold Audio, video and file-transfer are generally long-flow

HTTP_PORTS 80, 443, 1935, 8008, 8080, 8088, 8090

Slide 10

Identify Long-flow HTTP Traffic

Classify Audio Traffic

Classify Video & File-transfer Traffic

Approach

Slide 11

Classify Audio Traffic

99.4 % of radio rates are between 20 and 320 Kbps (Statistics from 3683 online radio web sites)

98% of online music rates are between 64 and 320Kbps (Statistics from >20 online music sites)

95% Confidence Interval of radio bytes-per-packet are between 900 and 1470 (Samruay et.al [1])

95% Confidence Interval of music bytes-per-packet are between 1260 and 1500 (Samruay et.al [1])

  

Slide 12

Classify Audio Traffic

Behavioral analysis: Online audio listener typically listens to

audio for more than 5 minutes

There are two distinct audio types : Radio & Music(songs)

New concept : Occupancy helps classify audio. Occupancy is a ratio of the

flow duration over the entire duration of a chunk of time.

  

0123456

Ave

rage

dow

nloa

d ra

te (M

bps)

music(Grooveshark)

radio (Hdradio)

video(CTV)

Slide 13

Classify Audio Traffic

Difference between Radio & MusicContinuous - Radio contents appears to download every second of the flow

Dirac - Songs in a playlist are downloaded & played one at a time

The max/min size of a radio flow is dependent on maximum flow-period configuration and the offered radio rates

The max/min size of a music flow is dependent on max/min song duration and offered online music rates

95% confidence interval of radio occupancy from DS-1,DS-2,SME-6,SME-7 and SME-8 is 82%,100%

95% confidence interval of music occupancy from DS-1,DS-2,SME-6,SME-7 and SME-8 is 0%,55%

Assumption : Minimum number of radio-flows are two (5 minutes at least)

Assumption : Minimum number of music-flows are two ( 5 minutes at least)

Assumption : Maximum radio-phase timeout is based on a flow-period(120 seconds)

Maximum music-phase timeout is based on maximum song duration (382 seconds)

Slide 14

Approach

Identify Long-flow HTTP Traffic

Classify Audio Traffic

Classify Video & File-transfer Traffic

Slide 15

CDN’s Authoritative DNS Server

Client Server

1) Client clicks on audio/video hyperlink

2) Metafile sent to client

3) M

etafi

le

Listening

HTTP Server

CDN_1

Web Browser

Media Player

8) Request multimedia content 1

5) Responds with CDN site

6) FromDNS lookup ,request sent tio CDN admin

7) Responds with address of all contents on all CDN’s

CDN_n

4) Request multimedia content

9) Request multimedia content 210) Content1

11) Content2

Background

• Multimedia Distribution (3 types)

Slide 16

Classify Video & File-transfer Traffic

Video flow-attributes (bytes-per-packet, bytes-per-flow, download rates)

& flow-group technique (FG) are used to classify video & file-transfers

Flow-group (FG)

• Video flow is associated with meta-data, style sheet, advertisements

• Kei.et.al[3] defined FG as the number of flows that occur within a few

seconds of video-flow with same destination-IP address

• Our expanded flow-group also includes flows that occur within a

longer duration that have the same subnet masked source-IP

address and the same destination-IP address

  

An Example

Slide 17

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233343536012345678

Flow Size

flow-index

Log

10(B

ytes

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 360

102030405060708090

Flow Duration

flow-index

TIm

e (S

econ

ds)

Example cont`d

Slide 18

1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435360

200

400

600

800

1000

1200

1400

1600Bytes-per-packet

Flow Index

0 1 2 3 4 5 6 7 8 9 1011 12131415161718192021222324252627282930313233343536

Type of Flow

flow-index

vide

o-flo

w

flow-g

roup

signa

l-flo

w

Slide 19

Classify Video & File-transfer Traffic

-60 -4 0 1 10

Kei.et.al's flow-group - 98% within 4 seconds before video-flow and 97.8% of flow-group are

within 1 seconds after video-flow

Flow-group range (seconds)

Improved flow-group - 94.4% within 60 sec-onds before video-flow and 94.1% of flow-group are within 10 seconds after video-flow

video-flow

All flow-group statistics are estimated from dataset DS-4 and DS-5

-92.6% of flow-group-bytes-per-flow is above 1000 and below 500000 -Almost 100% of flow-group bytes-per-packet are above 200

Slide 20

Classify Video & File-transfer Traffic

Start

Gather potential V/F flows

• flow > 0.5MB

• & > 1260 bytes-per-pkt

• & > 128Kbps

• & order by destination-IP

and flow start time

End

For every potential V/F flow, gather potential

flow-group(FG) flows when:

• FG flow > V/F start-time – 4

• &FG flow < V/F start-time + 1

• & FG flow and V/F has same dest-IP

• & FG flow between 1000B and 0.5 MB

• & FG flow between 200 and 1500 BPP

For V/F-phase gather potential FG flows:

• Same source IP address-subnet

• Same destination IP address

• & FG flow > V/F start-time – 60

• &FG flow < V/F start-time + 10

• & FG flow between 1000B and 0.5 MB

• & FG flow between 200 and 1500 BPP

If FG == true:

inc FG counter

If FG == true:

inc FG counter

If FG >0:Label videoelse:Label file-transfer

Green is original flow-group(FG), Yellow is improvised flow-group. Both FG are run

:

Slide 21

Evaluation

Datasets used to test algorithms Accuracy measurement assessment

• Precision is the systems correct predictions against all predicted value. That is precision = TP / (TP+FP)

• Recall is the systems correct predictions against all actual correct value. That is recall = TP / (TP + FN)

• F-Measure is the harmonic mean of recall and precision. That is F-measure => 2 * Precision * Recall / (Precision + Recall)

• accuracy = TP + TN / (TP + FP + FN + TN) – true results Compare against other algorithms

NaïveBayes SVM (Support Vector Algorithm)

Slide 22

Evaluation – Datasets

SME-6 SME-7 SME-8Date 1/7/2013 1/22/2013 1/23/2013Duration(s) 24723 28207 13628Start-time (GMT-5) 10:18:04 10:29:04 10:56:20Flows 249822 287616 198409Packets 13376109 15351639 10170693

Bytes 11158181285 13589511746 8728052938

HTTP Flows 75485 87181 63951

HTTP Packets 7346663 8814438 5628558

HTTP Bytes 10456335955 12545720613 7982629610

Slide 23

Evaluation – Results

SME6-Audio SME6-File SME6-Video SME7-Audio SME7-File SME7-Video SME8-Audio SME8-File SME8-Video

27.5%

59.5%

39.4%

56.1%

79.7%

70.8%66.5%

64.0%

86.6%

16.8%

23.2%

42.6%

21.6%

12.5%

40.4%

60.4%

49.1%

43.1%

84.9%

60.8%

72.9%

93.0% 93.6%

82.5%85.1%

89.7%94.2%

F-Measure

NaivesBayes SVM Proposed Algorithm

Slide 24

Evaluation – Results

SME-6 SME-7 SME-8

NaivesBayes 39.1% 73.5% 71.4%

SVM 17.8% 16.3% 42.0%

Proposed Algorithm 70.5% 89.9% 90.9%

39.1%

73.5% 71.4%

17.8% 16.3%

42.0%

70.5%

89.9% 90.9%

Accuracy

Slide 25

Conclusion

• Proposed algorithm uses flow-based approach and classifies high percentage of tunneled traffic : audio, video and file-transfer

• Proposed audio algorithm:• Used a concept called occupancy to classify radio & music traffic

• Proposed video & file-transfer algorithm• Used improvised flow-group method to help increase

classification accuracy of video and file-transfer traffic• Proposed scheme’s F-measure is at least 10% more than

NaiveBayes and SVM

Slide 26

Reference[1] Samruay Kaoprakhon , Vasaka Visoottiviseth, "Classification of Audio and Video Traffic over HTTP Protocol," in Communications and Information Technology, 2009. ISCIT 2009. 9th International Symposium on, Sept 2009

[2] M. Twardos, "The Information Diet," 2011. [Online]. Available: http://theinformationdiet.blogspot.ca/2011/11/probability-distribution-of-song-length.html. [Accessed 2013]

[3] K Takeshita, T Kurosawa, M Tsujino and M Iwashita, "Evaluation of HTTP Video Classification Method Using Flow Group Information," in Telecommunications Network Strategy and Planning Symposium (NETWORKS), 2010 14th International, Sept 2010.

[4] H.Kim, K.Claffy, M.Fomenkov, D.Barman, M.Falutsos, K.Lee, " Internet Traffic Classification Demystified: Myths, Caveats, and the Best Practices Classification of Audio and Video Traffic over HTTP Protocol," in ACM, 2008

[5] POWERS, D.M.W. “EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC, INFORMEDNESS, MARKEDNESS & CORRELATION ," in Journal of Machine Learning Technologies, Volume 2, Issue 1, 2011, pp-37-63